YAGO (Yet Another Great Ontology)
is an interesting knowledge base,
linking a large number of (mostly English) words to concepts, related words,
and specific entities (eg, items, locations, people).
YAGO appears to have great promise as infrastructure
for knowledge mining, recommendation engines, etc.
I also hope to use it as a testbed for some ideas
on interactive exploration of large-scale graphs (eg, Wikipedia).
"DBpedia is a crowd-sourced community effort to extract structured information from Wikipedia and make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia, and to link the different data sets on the Web to Wikipedia data."
"WordNet is a large lexical database of English. Nouns, verbs, adjectives, and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations."
YAGO-SUMO integrates YAGO with
SUMO (the Suggested Upper Merged Ontology),
adding some axiomatic rigor.
Conveniently, SUMO has also been linked to WordNet.
YAGO2s (version 2.5.3) is distributed in two formats
(TSV and Turtle).
Each of these is archived and compressed into a 2.2 GB file.
Uncompressing the Turtle archive yields a large folder (~18.5 GB)
containing 25 (*.ttl) files.
The files contain about 300 million (!) triples.
See Contents for details.
Although there are millions of subjects and objects in YAGO,
no more than 37 unique predicates are used in any single file.
Indeed, the entire ontology uses only 103 unique predicates.
See Predicates for the full list.
Using one of the YAGO demo pages, we can explore the ontology a bit.
To avoid being overwhelmed, let's pick a relatively unknown topic
(Gaviota State Park) as our
The interactive diagram shows each triple that mentions the topic,
relating it to other topics and values via attributed edges.
See Exploration for details.
There are a number of open source Semantic Web frameworks
(eg, Jena) which can import a set of RDF triples,
support SPARQL queries on it, etc.
However, it's not clear that any of these is a good match for my imagined projects.
Specifically, I'd like a tool with a more convenient and powerful query language,
faster graph traversal, etc.
Neo4j appears to be such a tool.
It supports a pattern-based query language (Cypher)
and uses an index-free representation for the graph.
This should provide convenient, flexible, and rapid graph traversal.
Neo4j can (reportedly) handle billions of nodes and edges with ease.
So, I'm working on the problem of importing YAGO into Neo4j.
My approach (detailed in Neo4j Mapping) is as follows:
Each RDF subject becomes a Neo4j node.
Some RDF objects (eg, numbers, dates) become node properties.
Most other objects become nodes.
The actual mapping is determined by a control table,
similar to the one shown in Predicates.
So, for example, hasArea and linksTo might be handled as follows: