New on LowEndTalk? Please Register and read our Community Rules.
All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.
All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.
Rewrite a Python script into PHP (python/php programmer requested)
I have a requirement to use an interesting script written in Python.
I'd like a PHP implementation of it as I don't script with Python and want something I can work with in PHP.
$50 via Paypal sounds good, I imagine it'd take someone who can program in both around 30 minutes.
"InputWordList.txt" is a list of tab delimited [WORD] [WORD FREQUENCIES], here is a datasource:
https://github.com/hermitdave/FrequencyWords/blob/master/content/2016/en/en_full.txt
I'll consider the task complete when it returns the same data as the Python version.
def viterbi_segment(text):
probs, lasts = [1.0], [0]
for i in range(1, len(text) + 1):
prob_k, k = max((probs[j] * word_prob(text[j:i]), j)
for j in range(max(0, i - max_word_length), i))
probs.append(prob_k)
lasts.append(k)
words = []
i = len(text)
while 0 < i:
words.append(text[lasts[i]:i])
i = lasts[i]
words.reverse()
return words, probs[-1]
def word_prob(word):
return dictionary.get(word, 0) / total
def words(text):
return re.findall('[a-z]+', text.lower())
# CREATE DICTIONARY OF WORDS TO COMPARE TO
dictionary = {}
with open('InputWordList.txt') as input:
for entry in input:
w = entry.split()
dictionary[w[0]] = int(w[1])
max_word_length = max(map(len, dictionary))
total = float(sum(dictionary.values()))
# SPLIT URL
words, prob = viterbi_segment('thisisacombinedurl.com')
Comments
hmm can you explain a little bit more what this script really does?
https://en.wikipedia.org/wiki/Viterbi_algorithm
Essentially, it's to find words in domain names, or more specifically, the most likely combination.
There's a problem with your input text file.
When running your python script to see what it does, it chokes on the line containing "ì"
Edit: That's line 24607 in your code.
Why not use Regex instead?
So remove the line
There's no point in using regex here.
That's not really Viterbi, it's a greedy approximation. For instance, I'm guessing it would break "thesemaphore" into "these map hore", because it doesn't backtrack.
Just search Github
https://github.com/search?langOverride=&p=13&q=viterbi&repo=&start_value=1&type=Repositories
you'll find a PHP match on the first page of results
https://github.com/trdarr/viterbi/blob/master/viterbi.php
I've converted most of your script, but do you by chance know what this part means?
" prob_k, k = max((probs[j] * word_prob(text[j:i]), j)
for j in range(max(0, i - max_word_length), i)) "
@rincewind
Thanks, I've looked before this post but didn't find anything that should be as quick as this.
@Pandarain
Not really, that's why I need it converted. It looks like it finds the maximum probability from a set of probabilities.
Hi @ricardo,
I've converted the script.
Is it acceptable that the speed won't be as fast as python's?Update: bug fixed. It's fast now.
Hi @scpal,
If you're OK with it, paste it in here and we can compare the output of both. If output is the same I'll send over some cash. Thanks for taking the time to rewrite it.
The site keeps blocking, so I can't paste it here.
Please check it out at: http://pastebin.com/ydzfbknd
Cheers
Excellent. I'll check it out today (5AM here)
Just a small fix to remove the type casting: http://pastebin.com/D92PFeBD
It looks like they're both producing one and the same, which is what I wanted. Now it's in PHP I can do some more work on it, thank you.
PM me your paypal and I'll get some cash sent over. Thanks!