If you get to the end, seek to the start, and continue on.Find the first word from there on, which meets your criteria.Discard the first "line", because you will probably start in the middle of a word.So, while not necessarily always faster, if you need to avoid loading the whole list into memory, reservoir sampling is what you want.Ī MUCH more efficient solution for VERY large files (at a cost of a bias to some words) would be as follows: i.e., we stay out of the interpreter for more of the work in the random.choice() implementation. My guess this is due to the overhead of the extra python opcode operations in the for loop vs filter being implemented in native C. My word list is not insignificant: $ curl -O It is significantly faster: 45.04 # For loop Words = filter(lambda x: len(x)-1 > _MAX_LEN, fd) I've tested this against a read-the-whole-file implementation: def goalword(): "In theory there is no difference between theory and practice. If you're just picking one word, however, this implementation is the most time and space efficient since you read the whole file once, but only retain one word in memory. If you're picking more than one word from each execution, reading the whole file into a list will likely be more time efficient. Print(timeit.timeit("goalword1()", setup="from _main_ import goalword1 import random ed(42)", number=100, timer=time.clock)) Print(timeit.timeit("goalword0()", setup="from _main_ import goalword0 import random ed(42)", number=100, timer=time.clock)) Turns out, though, that reduce() isn't necessarily faster: # for loop If randint(0, linenum) _MAX_LEN, fd))).strip() #!/usr/bin/pythonįor linenum, line in enumerate(ifilter(lambda x: len(x)-1 > _MAX_LEN, fd)): The algorithm is based on the idea that you select later samples based on a decreasing probability. (This is the algorithm used in fortune on Unix!) You don't need to read the whole file in at once nor use random.choice() if you use the reservoir-sampling algorithm. # ValueError: No word found with at least 30 characters. Raise ValueError("No word found with at least %s characters." % min_length) There's no indication that the desired length is too long.īefore calling random.choice, the script could simply check that large_words isn't empty: import random In the above examples, goal_word(30) fails with Inde圎rror: Cannot choose from an empty sequence. On a dictionary of English words ( /usr/share/dict/american-english), this function is 3 times faster than the previous one. Return random.choice(large_words).rstrip('\n') # commented that rstrip is called on every line even though it's only needed for one word. Words = (line.rstrip('\n') for line in wordbook) Thanks to a generator, the script only loops once over every line: import randomĭef goal_word(min_length=7, filename="words.txt"): If you read the file line by line, you need to take care with the trailing newline character: len("test\n") is 5.You could use iterators and generators to avoid loading the whole file into memory.You could use function parameters for word length and filename.Python functions are written in snake_case.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |