Gutenovox, a Mashup of Gutenberg and LibriVox Catalogs

I recently did my first recording for LibriVox, which is a site where volunteers read public domain works. I wanted to contribute to a short science fiction collection, and it was challenging to find a work that hadn’t been read already — not because there aren’t any, but because it involves looking through lists of stories on Project Gutenberg, then and searching for each one in the LibriVox catalog. “This is what computers are for,” I thought. So over the last week or so, I created a mashup site of the two catalogs, so that one can search the Project Gutenberg site and show what LibriVox recordings exist already, plus estimated reading time:  It works pretty well, but there are a few limitations.

For one, the Project Gutenberg (“PG”) catalog I am using is from April 2014, and is not likely to get updated. Their catalog export was, and still is, in the not-much-used and super-confusing RDF format. Fortunately a programmer in Estonia, Emilis Dambauskas, had made a SQLite version of it. Unfortunately, PG changed their data format radically in April 2014 — still RDF, but in many files and with a very different structure. I wrote to Emilis, and he said that it was too much for him to change. I spent a few hours with the RDF format and found it too challenging to be worth spending a lot of time on. Although new books are being added to PG all the time, the April 2014 catalog will do and hopefully PG will start exporting in something more developer friendly before too long, or someone will write a converter that works with the new format.

The other limitation is on the matching of books. For single-author works, the matching is done by Gutenberg ebook_id. While that works a lot of the time, it is possible that someone may  read a book that is in the PG catalog, but not use the PG text. This apparently happens more often that you might think, as LibriVox veteran RuthieG explains: “it happens from time to time that a LibriVox recording has already been made by the time the PG text is released. I actually prefer to read from a scan of the actual book, for two reasons: first because PG texts are not necessarily from one particular edition of a book, or don’t state which edition they are, and secondly, because many of the early PG (pre-Distributed Proofreaders) transcriptions have a number of transcription errors.” So recordings like that don’t show up (yet).

For collective works, the LibriVox API does not provide the ebook_id, which seems to be an oversight, but no one is working on the API at the moment. So I worked around it by matching on the title. But not all titles match (“a” and “The” are often omitted, or the occasional comma), and it is also possible that some books might both have the same title leading to false positives. In the end, though, this seems more like the proper method because minor changes can be overcome by Levenshtein or similar. Combined with author last name, a fuzzy title match should get almost all books. So I hope to move Gutenovox over to that method sometime soon. In the meantime, it seems to get about 90% of ebooks, so good enough to be pretty useful.