YAGO

Overview

YAGO (Yet Another Great Ontology) is an interesting knowledge base, linking a large number of (mostly English) words to concepts, related words, and specific entities (eg, items, locations, people). YAGO appears to have great promise as infrastructure for knowledge mining, recommendation engines, etc. I also hope to use it as a testbed for some ideas on interactive exploration of large-scale graphs (eg, Wikipedia).

Origins

YAGO is a huge (300 million fact) semantic knowledge base (ie, ontology), developed at Max-Planck Institute for Computer Science. It is mechanically generated, using rule-based software, from three other knowledge bases:

  • DBpedia

    "DBpedia is a crowd-sourced community effort to extract structured information from Wikipedia and make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia, and to link the different data sets on the Web to Wikipedia data."

  • GeoNames

    "The GeoNames geographical database covers all countries and contains over eight million placenames ..."

  • WordNet

    "WordNet is a large lexical database of English. Nouns, verbs, adjectives, and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations."

YAGO-SUMO integrates YAGO with SUMO (the Suggested Upper Merged Ontology), adding some axiomatic rigor. Conveniently, SUMO has also been linked to WordNet.

Contents

YAGO2s (version 2.5.3) is distributed in two formats (TSV and Turtle). Each of these is archived and compressed into a 2.2 GB file. Uncompressing the Turtle archive yields a large folder (~18.5 GB) containing 25 (*.ttl) files. The files contain about 300 million (!) triples. See Contents for details.

Predicates

Although there are millions of subjects and objects in YAGO, no more than 37 unique predicates are used in any single file. Indeed, the entire ontology uses only 103 unique predicates. See Predicates for the full list.

Exploration

Using one of the YAGO demo pages, we can explore the ontology a bit. To avoid being overwhelmed, let's pick a relatively unknown topic (Gaviota State Park) as our starting point. The interactive diagram shows each triple that mentions the topic, relating it to other topics and values via attributed edges. See Exploration for details.

Installation

There are a number of open source Semantic Web frameworks (eg, Jena) which can import a set of RDF triples, support SPARQL queries on it, etc. However, it's not clear that any of these is a good match for my imagined projects. Specifically, I'd like a tool with a more convenient and powerful query language, faster graph traversal, etc.

Neo4j appears to be such a tool. It supports a pattern-based query language (Cypher) and uses an index-free representation for the graph. This should provide convenient, flexible, and rapid graph traversal. Neo4j can (reportedly) handle billions of nodes and edges with ease. So, I'm working on the problem of importing YAGO into Neo4j.

My approach (detailed in Neo4j Mapping) is as follows:

  • Each RDF subject becomes a Neo4j node.

  • Some RDF objects (eg, numbers, dates) become node properties.

  • Most other objects become nodes.

The actual mapping is determined by a control table, similar to the one shown in Predicates. So, for example, hasArea and linksTo might be handled as follows:

( {
  node:         'Gaviota_State_Park',
  has_area:     '1.128E7',
  has_area_M:   'number (m^2)'
} ) -[ :Links_to ]-> ( {
  node:         'California'
) }

Once I have things loaded, I expect to use Cypher to explore the graph and help me decide what ancillary indexes and links I'll need. See Installation for details.


This wiki page is maintained by Rich Morin, an independent consultant specializing in software design, development, and documentation. Please feel free to email comments, inquiries, suggestions, etc!

Topic revision: r63 - 04 Apr 2016, RichMorin
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding CFCL Wiki? Send feedback