ETL Data Flow

This is a high-level overview of Ontiki's ETL (extract, transform, load) dataflow. Currently, only YAGO's Turtle distribution is supported. See Data Sets for other possibilities.


Here are some Cypher renderings of the Neo4j data structure we want to create. This pattern tells us that the wiki category and YAGO DBpedia class of Danish ecologists are equivalent. First, in idiomatic Cypher:

# ( subj ) -[ :Pred ]-> ( obj )

( yago_wikicategory_Danish_ecologists {        # subj
  _label: 'yago',
  name:   'wikicategory_Danish_ecologists',
  ns:     'yago'
} )

-[ :Equivalent_class {                         # pred
  ns:     'owl',
  src:    'yagoDBPediaClasses'
} ]->

( dbpy_DanishEcologists {                      # obj
  _label: 'dbpy',
  name:   'DanishEcologists',
  ns:     'dbpy'

Now, as an Ontiki claim, with the subj and obj as defined above:

# ( subj ) -[ :Claim_in ]-> ( pred ) -[ :Claim_to ]-> ( obj )

( subj ) -[ :Claim_in  ]-> ( yago_Equivalent_class {
  _label: 'claim',
  name:   '53989cfa-164b-4d8b-a2e3-153f5f458bf7',
  ns:     'uuid'
  src:    'yagoDBPediaClasses'
) -[ :Claim_to ]-> ( obj )


The examples below are simplified somewhat from the actual syntax, e.g.:

  • Fields are separated by white space.

  • Sharp signs (#) are used to begin comments.

  • Records start at the left-hand margin; continuation lines are indented.


We use YAGO's Turtle distribution, which contains ~300M records (specifically, RDF N-Triples) in ~30 files.

# .../yagoDBPediaClasses.ttl

<wikicategory_Danish_ecologists>  owl:equivalentClass
  <>  .


The parse_ttl script generates a file of node information (eg, properties), based on the triples it receives. We don't have enough room in memory to keep track of the nodes we've seen, so we repeat the node name in each line. Later, we'll use the Unix sort -u command to create a sorted, unique set of lines.

Our example triple requires three Neo4j nodes. The first two represent the subject and object; the third supports a claim:

# .../yagoDBPediaClasses.ttl.nodes

yago_wikicategory_Danish_ecologists        # subject

dbpy_DanishEcologists                      # object

uuid_53989cfa-164b-4d8b-a2e3-153f5f458bf7  # claim
uuid_53989cfa-164b-4d8b-a2e3-153f5f458bf7  _label  claim
uuid_53989cfa-164b-4d8b-a2e3-153f5f458bf7  src     yagoDBPediaClasses


To support a hybrid claim, we need to define three relations. The first connects the subject and object nodes directly. The last two (Claim_in, Claim_to) connect the claim node to the subject and object nodes.

# .../yagoDBPediaClasses.ttl.rels

  Equivalent_class  dbpy_DanishEcologists
  ns  owl
  src  yagoDBPediaClasses

  Claim_in  53989cfa-164b-4d8b-a2e3-153f5f458bf7

  Claim_to  dbpy_DanishEcologists

This wiki page is maintained by Rich Morin, an independent consultant specializing in software design, development, and documentation. Please feel free to email comments, inquiries, suggestions, etc!

Topic revision: r3 - 11 Jan 2015, RichMorin
This site is powered by Foswiki Copyright © by the contributing authors. All material on this wiki is the property of the contributing authors.
Foswiki version v2.1.6, Release Foswiki-2.1.6, Plugin API version 2.4
Ideas, requests, problems regarding CFCL Wiki? Send us email