You have a file of unknown size, and you need to select one line at random from it. The file may be very large, so you don't want to put it all in memory at once, and want to do this in a single pass.
(define (random-line filename)
(let reader ([infile (open-input-file filename)]
[chance 1]
[result ""])
(cond [(eof-object? (peek-char infile))
(close-input-port infile)
result]
[(zero? (random chance))
(reader infile (add1 chance) (read-line infile 'any))]
[else (read-line infile 'any)
(reader infile (add1 chance) result)])))
How this works: This program stores one line, it's eventual result, at all times. It continues reading lines until it comes to the end of the file; when it reads line
n there is a 1/
n chance that it replaces the result, and an (
n -1)/
n chance that the result remains the same.
As an example, consider the case that there are 3 lines in the file. During the first iteration, there is a 1/1 chance that the line read is stored as the new result. During the second iteration, there is a 1/2 chance that the line read is stored as the new result, and a 1/2 chance that the first line remains. During the third iteration, there is a 1/3 chance that the third line is stored as the result, and a 2/3 chance that the previous result remains. 2/3 * 1/2 = 1/3, thus there is a cumulative 1/3 chance that the first line will be the result and a cumulative 1/3 chance that the second line will be the result. On the fourth iteration, the result (which has an equal probability of being any of the three lines read so far) is returned.
If the record size is guaranteed to be constant, this problem has a much simpler solution. However, I assumed that you meant for this function to get a random line from any text file. I apologize if the code style isn't Cookbook Convention; this is my first edit. Feel free to tweak it.
And of course, the program has been tested. I had it draw 3000 random liens from a 6 line file, and got each line ~500 times, with no apparent favoritism toward earlier or later lines.
This function WOULD fail if the number of lines in the file is enough to make the random number generator begin giving skewed results, ie: a significant fraction of 2147483647.
--
BrianJ - 11 Dec 2006
Nice recipe
BrianJ! I've added a comment for the code. You might want to move your discussion of the algorithm into the Discussion section above, and the code could use
with-input-from-file, making it more robust against errors. It is also worth noting that reserviour sampling can be used for many other situations. For example, I've used it draw random sub-trees from a tree. I was first shown this trick by
OlegK.
--
NoelWelsh - 11 Dec 2006
Thanks for your comments. I'm a little busy due to exams tomorrow, but I'll try to implement the rest of your suggestions soon.
--
GordonWeakliem - 13 Aug 2004
--
BrianJ - 11 Dec 2006