You have a file of unknown size, and you need to select one line at random from it. The file may be very large, so you don't want to put it all in memory at once, and want to do this in a single pass.

500 Can't connect to 127.0.0.1:8778 (connect: Connection refused)

How this works: This program stores one line, it's eventual result, at all times. It continues reading lines until it comes to the end of the file; when it reads line

*n* there is a 1/

*n* chance that it replaces the result, and an (

*n* -1)/

*n* chance that the result remains the same.

As an example, consider the case that there are 3 lines in the file. During the first iteration, there is a 1/1 chance that the line read is stored as the new result. During the second iteration, there is a 1/2 chance that the line read is stored as the new result, and a 1/2 chance that the first line remains. During the third iteration, there is a 1/3 chance that the third line is stored as the result, and a 2/3 chance that the previous result remains. 2/3 * 1/2 = 1/3, thus there is a cumulative 1/3 chance that the first line will be the result and a cumulative 1/3 chance that the second line will be the result. On the fourth iteration, the result (which has an equal probability of being any of the three lines read so far) is returned.

Sampling an element from a stream in a single pass is known as reservoir sampling, the reservoir being `result`

in the implementation above. This technique can be used to sample from any data structure that can be sequentially iterated. For example, the same technique could be used to sample an element from a tree using only a single pass through the tree. It is also possible to sample from a non-uniform distribution, though the details are beyond the scope of this article.

If the record size is guaranteed to be constant, this problem has a much simpler solution. However, I assumed that you meant for this function to get a random line from any text file. I apologize if the code style isn't Cookbook Convention; this is my first edit. Feel free to tweak it.

And of course, the program has been tested. I had it draw 3000 random liens from a 6 line file, and got each line ~500 times, with no apparent favoritism toward earlier or later lines.
This function WOULD fail if the number of lines in the file is enough to make the random number generator begin giving skewed results, ie: a significant fraction of 2147483647.

-- BrianJ - 11 Dec 2006

Nice recipe BrianJ! I've added a comment for the code. You might want to move your discussion of the algorithm into the Discussion section above, and the code could use `with-input-from-file`

, making it more robust against errors. It is also worth noting that reserviour sampling can be used for many other situations. For example, I've used it draw random sub-trees from a tree. I was first shown this trick by OlegK.

-- NoelWelsh - 11 Dec 2006

Thanks for your comments. I'm a little busy due to exams tomorrow, but I'll try to implement the rest of your suggestions soon. UPDATE: I used `with-input-from-file`

, as requested.

-- GordonWeakliem - 13 Aug 2004

-- BrianJ - 11 Dec 2006