is focused on Wikipedia
it uses an indirect approach.
Specifically, it relies on other data sets
to provide information about (or complementary to) WP, eg:
- Solid line - implemented connection
- Dashed line - planned connection
- Dotted line - hoped-for connection
DBpedia supports a variety of download distributions and formats.
for Ontiki-specific details and notes.
YAGO supports two download formats:
We are currently loading the Turtle format into Neo4j
by means of our own data preparation suite (
and the Neo4j Import Tool (aka
The TSV-to-PostgreSQL load script creates a single database table, called
It has four columns:
See the YAGO web
for Ontiki-specific details and notes.
My wish list for Ontiki contains a number of ponies and unicorns.
Several collections of scientific papers approach the scale of WP.
Although their audiences are much smaller than WP's,
improving navigation on them would still be useful.
In addition, the collections have much richer metadata than WP, including:
- explicit linkage (citations, ...)
- index information (categories, keywords, ...)
- provenance (author, date, institution, ...)
Finally, it might be interesting and worthwhile
to fold some of these collections into Ontiki.
Users could then "drill down" from WP pages into relevant papers,
use WP to provide simple definitions and cross-field connections, etc.
The arXiv (pronounced "archive", as if the "X" were the Greek letter Chi, χ)
is a repository of electronic preprints, known as e-prints, of scientific papers
in the fields of mathematics, physics, astronomy, computer science, quantitative
biology, statistics, and quantitative finance, which can be accessed online. In
many fields of mathematics and physics, almost all scientific papers are
self-archived on the arXiv.
-- arXiv (WP)
BioMed Central, etc.
BioMed Central (BMC) is a United Kingdom-based, for-profit scientific publisher
specialising in open access journal publication. BioMed Central and its sister
companies Chemistry Central and PhysMath Central publish over 200 scientific
journals. Most BioMed Central journals are now published only online. BioMed
Central describes itself as the first and largest open access science publisher.
It is owned by Springer Science+Business Media."
-- BioMed Central (WP)
Public Library of Science
PLOS (for Public Library of Science) is a nonprofit open access scientific
publishing project aimed at creating a library of open access journals and
other scientific literature under an open content license. It launched its
first journal, PLOS Biology, in October 2003 and publishes seven journals,
all peer reviewed, as of April 2012."
-- PLOS (WP)
Although there are a number of citation indexes,
most are proprietary and require payment for access.
Given the size of the effort involved, this isn't surprising,
but it is a bit disappointing.
CiteSeer was a public search engine and digital library for scientific
and academic papers, primarily in the fields of computer and information
science that has been replaced by CiteSeerX. Many consider it to be the
first academic paper search engine.
-- CiteSeer (WP)
CiteSeerX is a happy exception to the generally restrictive nature
of citation index distribution:
CiteSeerX data and metadata are available for others to use.
Data available includes CiteSeerX metadata, databases,
data sets of PDF files, and text of PDF files.
-- CiteSeerX Data
GetCITED was a website database that listed publication and citation
information on academic articles whose information was entered by members.
It aimed to include journal articles, book chapters and other publications,
both peer-reviewed and non-reviewed.
The objective was to make this information publicly available,
as such information is held in restricted databases.
It indexed over 3,000,000 publications from over 300,000 authors.
Its last copyright amendment was in 2013 and it has largely been supplanted
by other tools including Google Scholar.
The website ceased to function in mid-2014.
-- GetCITED (WP)
Given that the GetCITED data set was maintained as of 2013,
it might still be useful, if it is available for download.
Google Scholar is a freely accessible web search engine that indexes the
full text of scholarly literature across an array of publishing formats
... the Google Scholar index includes most peer-reviewed online journals
of Europe and America's largest scholarly publishers,
plus scholarly books and other non-peer reviewed journals.
-- Google Scholar (WP)
Unfortunately, Google Scholar does not appear to distribute its data
or even make the service available via online APIs.
This wiki page is maintained by Rich Morin
an independent consultant specializing in software design, development, and documentation.
Please feel free to email
comments, inquiries, suggestions, etc!