s c h e m a t i c s : c o o k b o o k

/ Cookbook.FileReadRandomLine

This Web

TOC (with recipes)

Other Webs



Schematics Home
Sourceforge Page
Original Cookbook

Scheme Links

Scheme FAQ
Scheme Cross Reference
Scheme48 SCM
MIT Scheme scsh
JScheme Kawa
Chicken Guile
Bigloo Tiny
Gambit LispMe

Lambda the Ultimate

Retrieving a Line at Random from a File of Unknown Size


You have a file of unknown size, and you need to select one line at random from it. The file may be very large, so you don't want to put it all in memory at once, and want to do this in a single pass.


500 Can't connect to (connect: Connection refused)


How this works: This program stores one line, it's eventual result, at all times. It continues reading lines until it comes to the end of the file; when it reads line n there is a 1/ n chance that it replaces the result, and an ( n -1)/ n chance that the result remains the same.

As an example, consider the case that there are 3 lines in the file. During the first iteration, there is a 1/1 chance that the line read is stored as the new result. During the second iteration, there is a 1/2 chance that the line read is stored as the new result, and a 1/2 chance that the first line remains. During the third iteration, there is a 1/3 chance that the third line is stored as the result, and a 2/3 chance that the previous result remains. 2/3 * 1/2 = 1/3, thus there is a cumulative 1/3 chance that the first line will be the result and a cumulative 1/3 chance that the second line will be the result. On the fourth iteration, the result (which has an equal probability of being any of the three lines read so far) is returned.

Sampling an element from a stream in a single pass is known as reservoir sampling, the reservoir being result in the implementation above. This technique can be used to sample from any data structure that can be sequentially iterated. For example, the same technique could be used to sample an element from a tree using only a single pass through the tree. It is also possible to sample from a non-uniform distribution, though the details are beyond the scope of this article.

Comments about this recipe

If the record size is guaranteed to be constant, this problem has a much simpler solution. However, I assumed that you meant for this function to get a random line from any text file. I apologize if the code style isn't Cookbook Convention; this is my first edit. Feel free to tweak it.

And of course, the program has been tested. I had it draw 3000 random liens from a 6 line file, and got each line ~500 times, with no apparent favoritism toward earlier or later lines. This function WOULD fail if the number of lines in the file is enough to make the random number generator begin giving skewed results, ie: a significant fraction of 2147483647.
-- BrianJ - 11 Dec 2006

Nice recipe BrianJ! I've added a comment for the code. You might want to move your discussion of the algorithm into the Discussion section above, and the code could use with-input-from-file, making it more robust against errors. It is also worth noting that reserviour sampling can be used for many other situations. For example, I've used it draw random sub-trees from a tree. I was first shown this trick by OlegK.

-- NoelWelsh - 11 Dec 2006

Thanks for your comments. I'm a little busy due to exams tomorrow, but I'll try to implement the rest of your suggestions soon. UPDATE: I used with-input-from-file, as requested.


-- GordonWeakliem - 13 Aug 2004

-- BrianJ - 11 Dec 2006

TopicType: Recipe
ParentTopic: FileRecipes
TopicOrder: 999

Copyright © 2004 by the contributing authors. All material on the Schematics Cookbook web site is the property of the contributing authors.
The copyright for certain compilations of material taken from this website is held by the SchematicsEditorsGroup - see ContributorAgreement & LGPL.
Other than such compilations, this material can be redistributed and/or modified under the terms of the GNU Lesser General Public License (LGPL), version 2.1, as published by the Free Software Foundation.
Ideas, requests, problems regarding Schematics Cookbook? Send feedback.
/ You are Main.guest