YAGO - Importing

Neo4j supports a variety of importing options, but only a couple are appropriate for graphs as large as YAGO. 300 million triples turns into a lot of nodes, relations, and properties, so we need an approach that is both efficient and robust.

As discussed in Importing (Cautionary Tales), I tried a number of alternatives before settling on the current approach. Although I'm still not out of the woods, I have been able to generate TSV files that the Superfast Batch Importer (SBI) can load.

Current Approach

The SBI, developed by Michael Hunger and S. Raghuram (Raghu), takes TSV (tab-separated value) files as input. Version 22 (now in Beta) can generate numeric IDs for nodes, if unique name strings are provided in the input data. My approach depends on this capability.

Overview

To improve the benefits of parallel parsing, YAGO's Turtle files (*.ttl) are split into files (*.ttl.*) containing no more than 10M lines each. See the Optimization page for details.

  • A Turtle parsing script (parse_ttl) generates files on relations and nodes.

  • The relation files (*.ttl.*.rels) are sorted and uniqued into d_rels.tsv.

  • The node files (*.ttl.*.nodes) are sorted and uniqued into d_nodes.t2.

  • The node sets in d_nodes.t2 are merged (by redo_nodes) into TSV rows.

  • The header files (h_*.tsv) are generated by the gen_hdrs script.

The data flow looks something like this:


This wiki page is maintained by Rich Morin, an independent consultant specializing in software design, development, and documentation. Please feel free to email comments, inquiries, suggestions, etc!

Topic revision: r17 - 04 Apr 2016, RichMorin
This site is powered by Foswiki Copyright © by the contributing authors. All material on this wiki is the property of the contributing authors.
Foswiki version v2.1.6, Release Foswiki-2.1.6, Plugin API version 2.4
Ideas, requests, problems regarding CFCL Wiki? Send us email