YAGO - Syntax

YAGO is based on RDF (Resource Description Framework), a family of specifications from the W3C (World Wide Web Consortium). The distribution I'm importing is encoded in Turtle (Terse RDF Triple Language).

Neo4j and Turtle both support Unicode, but there are some minor character set issues which must be handled. There are also some namespace and related issues. This page is only a descriptive summary; see the conversion code for definitive information.

Issues

@base and @prefix

YAGO's Turtle (*.ttl) files use @base and @prefix directives, allowing them to shorten URIs in the RDF triples. Using the Unix command line, I did a simple and quick sanity check, making sure that there were no cross-file usage conflicts:

$ head -50 *.ttl | egrep '^@' | sort | uniq -c
25 @base         <http://yago-knowledge.org/resource/>         .
25 @prefix dbp:  <http://dbpedia.org/ontology/>                .
...

The @prefix notation is convenient, but not fully utilized in YAGO. So, I extended and regularized things a bit, changing @base to an explicit @prefix, adding more prefixes, etc:

dbpo:  http://dbpedia.org/ontology/
dbpr:  http://dbpedia.org/resource/
dbpy:  http://dbpedia.org/class/yago/
owl:   http://www.w3.org/2002/07/owl#
rdf:   http://www.w3.org/1999/02/22-rdf-syntax-ns#
rdfs:  http://www.w3.org/2000/01/rdf-schema#
skos:  http://www.w3.org/2004/02/skos/core#
wpen:  http://en.wikipedia.org/wiki/ 
xsd:   http://www.w3.org/2001/XMLSchema#
yago:  http://yago-knowledge.org/resource/

Note: yago appears to be used far more than any other prefix, so my tentative plan (to save space) is to make it the default.

Identifiers

By default, Cypher identifiers use a rather restricted character set:

/^[^A-Za-z_][^0-9A-Za-z_]*$/

This conflicts with YAGO's use of colons (eg, yago:hasLatitude) and could conflict with other (eg, Unicode) characters, as well. However, the only YAGO-based identifiers I'm using are names of relations and properties. These are derived from my expansions of YAGO predicates, so I can ensure that there are no character set problems.

Provenance

YAGO2s is divided into 25 "themes" (eg, yagoGeonamesData), each of which has a unique provenance (ie, nature, origin). I may capture this information (eg, in a theme property).

Some YAGO names (eg, geoclass_reefs) have a prefix (geoclass) which further characterizes their provenance. I may split off this information in a future revision.

Values

RDF's literal objects can contain both values and metadata (eg, units). I plan to strip off the metadata and store it in a companion property (*_M), eg:

RDF Literal                 prop    prop_M
-----------                 ----    ------
42, "42"^^xsd:integer       42      'number'       
1.85"^^<m>                  1.85    'number (^^<m>)'
"1.2"^^xsd:a                1.2     'number (^^xsd:a)'
"a"@b                       'a'     'string (@b)


This wiki page is maintained by Rich Morin, an independent consultant specializing in software design, development, and documentation. Please feel free to email comments, inquiries, suggestions, etc!

This topic: Projects/YAGO > Syntax
Topic revision: 04 Apr 2016, RichMorin
This site is powered by Foswiki Copyright © by the contributing authors. All material on this wiki is the property of the contributing authors.
Foswiki version v2.1.6, Release Foswiki-2.1.6, Plugin API version 2.4
Ideas, requests, problems regarding CFCL Wiki? Send us email