PDFinder
Even more than most of my projects,
PDFinder is a "Partly-baked Idea".
However, it is one that I've been thinking about for a few years,
so it's probably worth writing up.
Maybe someone with the necessary resources will get interested.
PDFinder is a proposal to index all of the WWW's
Portable Document Format (PDF) documents.
Yep; all 300 million of them:
Of course, as the screenshot above demonstrates,
Google has already indexed these PDFs.
So, what does PDFinder bring to the party?
Hopefully, the ability to find related documents
and perform slightly more intelligent searches.
Why PDF?
But before we get into questions of "how" or even precisely "what",
let's discuss the things that make PDFs particularly valuable to index.
Basically, it comes down to the fact that PDF is a publication format
for (more or less) static content.
Web pages can change from moment to moment;
PDF files, in contrast, are generally written once and then left alone.
Also, because PDF is a publication format,
it tend to be used for any "important" published work.
This does not imply the converse:
many PDFs are completely trivial.
However, if an academic researcher wants a paper to reach a large audience,
s/he will make sure that a PDF copy is available.
How would this work?
The basic idea is that we calculate a
cosine vector for each PDF file.
We can then compare pairs of vectors
(multiplying terms and summing the results)
to find out how closely related the documents are.
There may well be a better way to do this (I'm certainly no expert),
but here's a plausible first cut at the problem:
- Index generation
Store the cosine vectors in a database, along with the SHA-1 checksums, URLs, etc. Build an inverted index that can be used to look up documents that have particular bits set.
When Joe User shows up with some "starter" documents
(eg, a resarch paper or Wikipedia page),
we perform a similar round of term extraction
and cosine vector generation.
However, we then use the result to look up documents
which (appear to) have similar sets of terms.
Using this first result,
we pull the indicaated documents out of storage
and do a full comparison.
The results of this are fed back to Joe,
who goes off to see which ones are actually relevant.
Of course, if Joe shows up with a large set of documents
(possibly selected from the first-round results),
we can do even better for him.
What would this cost?
Well, some folks (eg, Google, The Internet Archive)
already have these files coming in,
so for them it would simply be a matter of storing
and indexing them.
A quick look at some random PDFs indicates that
(a) the average size is about 1.3 MB and
(b) they don't compress much further.
So, to store 300 million PDFs,
we're looking at perhaps 0.4 Petabytes.
The index, of course,
would be much smaller (eg, a few Terabytes).
The input processing and retrieval
could be done by a small number of PCs.
Of course, if zillions of researchers
start using the service, more would be needed.
In summary, this is not a huge project...
This wiki page is maintained by
Rich Morin,
an independent consultant specializing in software design, development, and documentation.
Please feel free to
email
comments, inquiries, suggestions, etc!