YAGO - Importing
Neo4j supports a variety of importing options, but only a couple
are appropriate for graphs as large as YAGO.
300 million triples turns into a lot
of nodes, relations, and properties,
so we need an approach that is both efficient and robust.
As discussed in Importing (Cautionary Tales)
I tried a number of alternatives before settling on the current approach.
Although I'm still not out of the woods, I have
to generate TSV files that the Superfast Batch Importer (SBI) can load.
The SBI, developed by Michael Hunger and S. Raghuram (Raghu),
takes TSV (tab-separated value) files as input.
Version 22 (now in Beta) can generate numeric IDs for nodes,
if unique name strings are provided in the input data.
My approach depends on this capability.
To improve the benefits of parallel parsing,
YAGO's Turtle files (
) are split into files (
containing no more than 10M lines each.
See the Optimization
page for details.
- A Turtle parsing script (
parse_ttl) generates files on relations and nodes.
- The relation files (
*.ttl.*.rels) are sorted and uniqued into
- The node files (
*.ttl.*.nodes) are sorted and uniqued into
- The node sets in
d_nodes.t2 are merged (by
redo_nodes) into TSV rows.
- The header files (
h_*.tsv) are generated by the
The data flow looks something like this:
This wiki page is maintained by Rich Morin
an independent consultant specializing in software design, development, and documentation.
Please feel free to email
comments, inquiries, suggestions, etc!