PDFinder

Even more than most of my projects, PDFinder is a "Partly-baked Idea". However, it is one that I've been thinking about for a few years, so it's probably worth writing up. Maybe someone with the necessary resources will get interested.

PDFinder is a proposal to index all of the WWW's Portable Document Format (PDF) documents. Yep; all 300 million of them:

SS_PDF_cnt.png

Of course, as the screenshot above demonstrates, Google has already indexed these PDFs. So, what does PDFinder bring to the party? Hopefully, the ability to find related documents and perform slightly more intelligent searches.

Why PDF?

But before we get into questions of "how" or even precisely "what", let's discuss the things that make PDFs particularly valuable to index. Basically, it comes down to the fact that PDF is a publication format for (more or less) static content. Web pages can change from moment to moment; PDF files, in contrast, are generally written once and then left alone.

Also, because PDF is a publication format, it tend to be used for any "important" published work. This does not imply the converse: many PDFs are completely trivial. However, if an academic researcher wants a paper to reach a large audience, s/he will make sure that a PDF copy is available.

How would this work?

The basic idea is that we calculate a cosine vector for each PDF file. We can then compare pairs of vectors (multiplying terms and summing the results) to find out how closely related the documents are.

There may well be a better way to do this (I'm certainly no expert), but here's a plausible first cut at the problem:

  • Duplicate removal

    Using Git or some similar technology, make sure that only one copy of each document is stored (Git uses SHA-1 cryptographic checksums for this).

  • Term extraction

    Pull out words and short word sequences, discarding punctuation and really common words. If you like, boil the words down to "stems".

  • Cosine vector generation

    Generate a numeric hash (eg, a number from 0 to 80K) from each extracted term. Use this to index into a 10 KB bit vector, setting a bit wherever the hash points.

  • Index generation

    Store the cosine vectors in a database, along with the SHA-1 checksums, URLs, etc. Build an inverted index that can be used to look up documents that have particular bits set.

When Joe User shows up with some "starter" documents (eg, a resarch paper or Wikipedia page), we perform a similar round of term extraction and cosine vector generation. However, we then use the result to look up documents which (appear to) have similar sets of terms.

Using this first result, we pull the indicaated documents out of storage and do a full comparison. The results of this are fed back to Joe, who goes off to see which ones are actually relevant.

Of course, if Joe shows up with a large set of documents (possibly selected from the first-round results), we can do even better for him.

What would this cost?

Well, some folks (eg, Google, The Internet Archive) already have these files coming in, so for them it would simply be a matter of storing and indexing them.

A quick look at some random PDFs indicates that (a) the average size is about 1.3 MB and (b) they don't compress much further. So, to store 300 million PDFs, we're looking at perhaps 0.4 Petabytes. The index, of course, would be much smaller (eg, a few Terabytes).

The input processing and retrieval could be done by a small number of PCs. Of course, if zillions of researchers start using the service, more would be needed. In summary, this is not a huge project...


This wiki page is maintained by Rich Morin, an independent consultant specializing in software design, development, and documentation. Please feel free to email comments, inquiries, suggestions, etc!

Topic revision: r29 - 29 Aug 2008, RichMorin
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding CFCL Wiki? Send feedback