Data Sets


Although Ontiki is focused on Wikipedia (WP), it uses an indirect approach. Specifically, it relies on other data sets to provide information about (or complementary to) WP, eg:

Data Flow


  • Solid line - implemented connection

  • Dashed line - planned connection

  • Dotted line - hoped-for connection

Data Formats


DBpedia supports a variety of download distributions and formats. See DBpedia for Ontiki-specific details and notes.


YAGO supports two download formats:

We are currently loading the Turtle format into Neo4j by means of our own data preparation suite (it_prep) and the Neo4j Import Tool (aka neo4j-import command). The TSV-to-PostgreSQL load script creates a single database table, called yagoFacts. It has four columns: id, subject, predicate, object.

See the YAGO web for Ontiki-specific details and notes.

Wish List

My wish list for Ontiki contains a number of ponies and unicorns. YMMV...

Papers, etc.

Several collections of scientific papers approach the scale of WP. Although their audiences are much smaller than WP's, improving navigation on them would still be useful. In addition, the collections have much richer metadata than WP, including:

  • explicit linkage (citations, ...)

  • index information (categories, keywords, ...)

  • provenance (author, date, institution, ...)

Finally, it might be interesting and worthwhile to fold some of these collections into Ontiki. Users could then "drill down" from WP pages into relevant papers, use WP to provide simple definitions and cross-field connections, etc.


The arXiv (pronounced "archive", as if the "X" were the Greek letter Chi, χ) is a repository of electronic preprints, known as e-prints, of scientific papers in the fields of mathematics, physics, astronomy, computer science, quantitative biology, statistics, and quantitative finance, which can be accessed online. In many fields of mathematics and physics, almost all scientific papers are self-archived on the arXiv.

-- arXiv (WP)

BioMed Central, etc.

BioMed Central (BMC) is a United Kingdom-based, for-profit scientific publisher specialising in open access journal publication. BioMed Central and its sister companies Chemistry Central and PhysMath Central publish over 200 scientific journals. Most BioMed Central journals are now published only online. BioMed Central describes itself as the first and largest open access science publisher. It is owned by Springer Science+Business Media."

-- BioMed Central (WP)

Public Library of Science

PLOS (for Public Library of Science) is a nonprofit open access scientific publishing project aimed at creating a library of open access journals and other scientific literature under an open content license. It launched its first journal, PLOS Biology, in October 2003 and publishes seven journals, all peer reviewed, as of April 2012."

-- PLOS (WP)

Citation Indexes

Although there are a number of citation indexes, most are proprietary and require payment for access. Given the size of the effort involved, this isn't surprising, but it is a bit disappointing.


CiteSeer was a public search engine and digital library for scientific and academic papers, primarily in the fields of computer and information science that has been replaced by CiteSeerX. Many consider it to be the first academic paper search engine.

-- CiteSeer (WP)

CiteSeerX is a happy exception to the generally restrictive nature of citation index distribution:

CiteSeerX data and metadata are available for others to use. Data available includes CiteSeerX metadata, databases, data sets of PDF files, and text of PDF files.

-- CiteSeerX Data


GetCITED was a website database that listed publication and citation information on academic articles whose information was entered by members. It aimed to include journal articles, book chapters and other publications, both peer-reviewed and non-reviewed. The objective was to make this information publicly available, as such information is held in restricted databases. It indexed over 3,000,000 publications from over 300,000 authors. Its last copyright amendment was in 2013 and it has largely been supplanted by other tools including Google Scholar. The website ceased to function in mid-2014.

-- GetCITED (WP)

Given that the GetCITED data set was maintained as of 2013, it might still be useful, if it is available for download.

Google Scholar

Google Scholar is a freely accessible web search engine that indexes the full text of scholarly literature across an array of publishing formats and disciplines. ... the Google Scholar index includes most peer-reviewed online journals of Europe and America's largest scholarly publishers, plus scholarly books and other non-peer reviewed journals.

-- Google Scholar (WP)

Unfortunately, Google Scholar does not appear to distribute its data or even make the service available via online APIs.

This wiki page is maintained by Rich Morin, an independent consultant specializing in software design, development, and documentation. Please feel free to email comments, inquiries, suggestions, etc!

Topic revision: r13 - 14 Jan 2015, RichMorin
This site is powered by Foswiki Copyright © by the contributing authors. All material on this wiki is the property of the contributing authors.
Foswiki version v2.1.6, Release Foswiki-2.1.6, Plugin API version 2.4
Ideas, requests, problems regarding CFCL Wiki? Send us email