/ Cookbook.FileReadRandomLine

You have a file of unknown size, and you need to select one line at random from it. The file may be very large, so you don't want to put it all in memory at once, and want to do this in a single pass.

;; random-line : (U string path) -> string ;; ;; Returns a random line from the given file, using only a single pass through the file. ;; Each line has an equal chance of being chosen. (define (random-line/stdin) (let reader ([chance 1] ; chance that a line becomes the new result [result ""]) ; current result if there are no more lines (cond [(eof-object? (peek-char)) result] ; end of file, return current result [(zero? (random chance)) ; this line is the new result (reader (add1 chance) (read-line (current-input-port) 'any))] ; continue [else (read-line (current-input-port) 'any) ; discard a line (reader (add1 chance) result)]))) ; continue (define (random-line filename) (with-input-from-file filename random-line/stdin))

As an example, consider the case that there are 3 lines in the file. During the first iteration, there is a 1/1 chance that the line read is stored as the new result. During the second iteration, there is a 1/2 chance that the line read is stored as the new result, and a 1/2 chance that the first line remains. During the third iteration, there is a 1/3 chance that the third line is stored as the result, and a 2/3 chance that the previous result remains. 2/3 * 1/2 = 1/3, thus there is a cumulative 1/3 chance that the first line will be the result and a cumulative 1/3 chance that the second line will be the result. On the fourth iteration, the result (which has an equal probability of being any of the three lines read so far) is returned.

Sampling an element from a stream in a single pass is known as reservoir sampling, the reservoir being `result`

in the implementation above. This technique can be used to sample from any data structure that can be sequentially iterated. For example, the same technique could be used to sample an element from a tree using only a single pass through the tree. It is also possible to sample from a non-uniform distribution, though the details are beyond the scope of this article.

And of course, the program has been tested. I had it draw 3000 random liens from a 6 line file, and got each line ~500 times, with no apparent favoritism toward earlier or later lines.
This function WOULD fail if the number of lines in the file is enough to make the random number generator begin giving skewed results, ie: a significant fraction of 2147483647.

-- BrianJ - 11 Dec 2006

Nice recipe BrianJ! I've added a comment for the code. You might want to move your discussion of the algorithm into the Discussion section above, and the code could use `with-input-from-file`

, making it more robust against errors. It is also worth noting that reserviour sampling can be used for many other situations. For example, I've used it draw random sub-trees from a tree. I was first shown this trick by OlegK.

-- NoelWelsh - 11 Dec 2006

Thanks for your comments. I'm a little busy due to exams tomorrow, but I'll try to implement the rest of your suggestions soon. UPDATE: I used `with-input-from-file`

, as requested.

-- GordonWeakliem - 13 Aug 2004

-- BrianJ - 11 Dec 2006

CookbookForm | |
---|---|

TopicType: | Recipe |

ParentTopic: | FileRecipes |

TopicOrder: | 999 |

The copyright for certain compilations of material taken from this website is held by the SchematicsEditorsGroup - see ContributorAgreement & LGPL.

Other than such compilations, this material can be redistributed and/or modified under the terms of the GNU Lesser General Public License (LGPL), version 2.1, as published by the Free Software Foundation.

Ideas, requests, problems regarding Schematics Cookbook? Send feedback. / You are Main.guest