Getting random documents from a CouchDB
Hi,
for a while now I’ve been using and trying out Apache CouchDB, the rather new and very cool document-oriented database written in Erlang.
The problem
To get used to the Couch a bit (it’s very comfortable, by the way), I’m currently writing a quote management system with CouchDB as backend. Now, every quote thingy has a feature like »random quotes«, and this is done with SQL backends using ORDER BY RANDOM() or something like that. Now CouchDB doesn’t use SQL, but instead view-based querying, where views are based on map/reduce functions (read more in the CouchDB wiki). It’s not so easy to order by RANDOM() here.
The solution
… is to create a view that I called random_quotes, with only a map function, which looks as follows:
function(doc) { emit(Math.random(), doc); }
Now you’ll get a random number associated with each of your documents. To get 1 or n random documents from your CouchDB now, you query the view with the startkey parameter set to a random number between 0 and 1 (e. g. in Python: startkey = random.random()), and set limit to the maximum of rows that you want to get.
Now you could possibly get less rows than you wanted, because if your startkey is too high, it won’t cover that much documents. The solution is to count what you’ve got, and get documents from the beginning of the view, with a limit of as much documents as you still need, and the endkey set to your previous startkey.
I have written a python function that does this, using CouchDBKit, to which I contributed a object mapper:
import random def get_random(db, n): """Get an iterator over n random quotes from the database. Parameters: db: couchdbkit.session.Session n: int """ # see above how this view looks like view_results = db.view('quote/random_quotes', startkey=random.random(), limit=n) count = view_results.count() if count >= n: return view_results else: # not enough documents yet: new_limit = n - count other_results = view_results.view( endkey=view_results.params['startkey'], limit=new_limit) # we'll get at most as much documents as the DB contains, rather than # getting duplicates return chain(view_results, other_results)
Update: Note that if you query for more than a document at a time, you are likely to get »bad« randomness, i. e. always the same set of documents for a given random startkey. I haven’t yet found a solution to this, so if you need true randomness, you better query for single documents and aggregate them yourself.
Have fun with that, and a nice day,
Jannis.
Hi,
this map function will blow up on you. Map functions must guarantee the same output for the same input. Things like randomness or Date.now() cannot be used. The transitivity is needed for the incremental updates of view indexes.
Cheers
Jan
–
Jan said this on May 18th, 2009 at 10:36 am
I’m aware that view indexes are updated incrementally, so old documents will keep their old random number as key. The trick is that I query the view with a (client-side) random start key, so I only needed a range in which the key can vary.
Random numbers weren’t needed at all, I could as well hash the documents in a way to get a fitting key. I only use random numbers to get entropy and distribute keys well. (And – I’ve tested it – it works perfectly.)
Xjs said this on May 18th, 2009 at 3:21 pm
The problem with this approach isn’t what Jan noted, because if you test your view without the startkey you’ll see that all the random numbers are the same for each run. The issue is that they aren’t evenly distributed. Let’s say you have 4 documents to pick from, and they get the random numbers 0.01, 0.02, 0.9, and 0.91. Almost 90% of the time you are going to be pulling the same “random” document from your view.
Jeff M.
Jeff M said this on July 1st, 2010 at 12:26 am
In fact, if the random numbers consistently aren’t evenly distributed, this is an issue. But otherwise, due to the law of large numbers, everything will be fine as soon as we have some more than 4 documents.
The method could be perfected by writing a hash function optimized for even distribution. Perhaps one could use the _id-s, too, since they are in fact only numbers as well.
Xjs said this on July 1st, 2010 at 6:50 am