DBpedia

Overview

DBpedia supports a variety of download distributions and formats:

The downloads are provided as N-Triples and N-Quads, where the N-Quads version contains additional provenance information for each statement. All files are bzip2 1 packed. In addition to the RDF version of the data, we also provide a tabular version of some of the core DBpedia data sets as CSV and JSON files.

-- http://wiki.dbpedia.org/Downloads2014

For each class in the DBpedia ontology (such as Person, Radio Station, Ice Hockey Player, or Band), we provide a single CSV/JSON file which contains all instances of this class. Each instance is described by its URI, an English label and a short abstract, the mapping-based infobox data describing the instance (extracted from the English edition of Wikipedia), geo-coordinates, and external links.

-- http://wiki.dbpedia.org/DBpediaAsTables

DBpedia is considered the Semantic Web mirror of Wikipedia. Over time, Wikipedia articles are revised, which makes the data in DBpedia outdated. The main objective of DBpedia Live is to keep DBpedia always in synchronization with Wikipedia.

-- http://wiki.dbpedia.org/DBpediaLive

Finally, it appears that the entire working data of DBpedia is available for download, should one be so inclined.

Dataset Descriptions

This information was adapted from the DBpedia 2014 Downloads page.

DBpedia Ontology

The DBpedia ontology is defined in Web Ontology Language (OWL). Specifically, the file dbpedia_2014.owl contains ~30K lines of Resource Description Format (RDF), encoded in Extensible Markup Language (XML). See DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia for details.

Direct Extraction

Some information is extracted directly from Wikipedia pages.

  • Extended Abstracts: en, ..., tr

    The file long_abstracts_en.nt contains ~4.6M triples.
    Objects are full abstracts of Wikipedia articles.

  • Geographic Coordinates: en, ..., tr

    The file geo_coordinates_en.nt contains ~2.1M triples.
    Objects are geographic coordinates.

  • Homepages: en, ..., tr

    The file homepages_en.nt contains ~0.6M triples.
    Objects are homepages of persons, organizations, etc.

  • Images: en, ..., tr

    The file images_en.nt contains ~6.9M triples.
    Objects are links to main and thumbnail images.

  • Raw Infobox Properties: en, ..., tr

    The file infobox_properties_en.nt contains ~68M triples.
    Objects are properties, extracted from Wikipedia infoboxes.

  • Raw Infobox Property Definitions: en, ..., tr

    The file infobox_property_definitions_en.nt contains ~0.1M triples.
    Objects are definitions of properties / predicates used in infoboxes.

  • Short Abstracts: en, ..., tr

    The file short_abstracts_en.nt contains ~4.6M triples.
    Objects are short abstracts (max. 500 characters long) of articles.

  • Titles: en, ..., tr

    The file labels_en.nt contains ~11M triples.
    Objects are article titles, in the corresponding language.

Mapping-based Extraction

Mapping-based extraction is used to canonicalize information from non-English Wikipedias. The information is typically delivered in N-Triple (*.nt, *.nq) and Turtle (*.ttl) format. Only the N-Triple quad format (*.nq) includes provenance information.

Here is a summary of the file types:

  • Mapping-based Properties: en, ..., tr

    The file mappingbased_properties_en.nt contains ~33M triples.
    Objects are Infobox properties, extracted via mapping.

  • Mapping-based Properties (Cleaned): en

    The file mappingbased_properties_cleaned_en.nt contains ~33M triples.
    Objects are Infobox properties, extracted via mapping and cleaned.

  • Mapping-based Properties (Specific): en, ..., tr

    The file specific_mappingbased_properties_en.nt contains ~0.8M triples.
    Objects are Infobox properties from the mapping-based extraction,
    using convenient units of measurement.

  • Mapping-based Types: en, ..., tr

    The file instance_types_en.nt contains ~28M triples.
    Triples are of the form $object rdf:type $class.

  • Mapping-based Types (Heuristic): en

    The file instance_types_heuristic_en.nt contains ~3.1M triples.
    Triples are of the form $object rdf:type $class,
    generated per Paulheim/Bizer: Type Inference on Noisy RDF Data.

Categories

  • Article Categories: en, ..., tr

    The file article_categories_en.nt contains ~19M triples.
    Links relate articles to categories, using the SKOS vocabulary.

  • Categories (Labels): en, ..., tr

    The file category_labels_en.nt contains ~1.1M triples.
    Objects are labels for categories, in various languages.

  • Categories (SKOS): en, ..., tr

    The file skos_categories_en.nt contains ~4.5M triples.
    Information about which concept is a category
    and how categories are related, using the SKOS Vocabulary.

IDs, etc.

  • Page IDs: en

    The file page_ids_en.nt contains ~13M triples.
    Objects are article page IDs.

  • Revision IDs: en

    The file revision_ids_en.nt contains ~13M triples.
    Objects are article revision IDs.

  • Revision URIs: en

    The file revision_uris_en.nt contains ~13M triples.
    Objects are article revision URIs.

  • Disambiguation Links: en

    The file disambiguations_en.nt contains ~1.4M triples.
    Links are extracted from Wikipedia disambiguation pages.

  • External Links: en, ..., tr

    The file external_links_en.nt contains ~7M triples.
    Links from articles to external web pages.

  • Inter-Language Links: en

    The file interlanguage_links_en.nt contains ~29M triples.
    Dataset linking a DBpedia resource to the same resource in other languages and in Wikidata.

  • IRI-same-as-URI Links: en

    The file iri_same_as_uri_en.nt contains ~0.9M triples.
    Links (owl:sameAs) between the IRI and URI format of DBpedia resources.
    Only extracted when the IRI and URI differ.

  • Links to Wikipedia Article: en

    The file wikipedia_links_en.nt contains ~44M triples.
    Dataset linking DBpedia resources to corresponding articles.

  • Old Interlanguage Links: en

    The file old_interlanguage_links_en.nt contains ~0.2M triples.
    Remaining interlanguage extracted directly from Wikipedia articles.

  • Wikipedia Pagelinks: en, ..., tr

    The file page_links_en.nt contains ~153M triples.
    Dataset containing internal links between DBpedia instances.

Miscellany

  • Page Outdegree: en

    The file out_degree_en.nt contains ~11M triples.
    Number of links from a Wikipedia article to another Wikipedia article.

  • Page Length: en

    The file page_length_en.nt contains ~11M triples.
    Numbers of characters contained in an article's source.

  • Persondata: en, de

    The file persondata_en.nt contains ~7.9M triples.
    Information about persons (date and place of birth, etc.).

  • Redirects: en

    The file redirects_en.nt contains ~6.4M triples.
    Dataset containing redirects between articles in Wikipedia.

  • Surface Forms: bg, ..., tr

    The file surface_forms_bg.nt contains ~???M triples.
    Texts used to refer to Wikipedia articles.

  • Transitive Redirects: en

    The file redirects_transitive_en.nt.nt contains ~7.9M triples.
    Dataset in which multiple redirects have been resolved and redirect cycles have been removed.

External Dataset Linkage

DBpedia has been linked to a number of external data sets.

  • CORDIS (EU repository and portal)

  • DailyMed (drug listings submitted to FDA)

  • DBLP (computer science bibliography)

  • DBTune (music-related information)

  • DrugBank (drug and target information)

  • EUNIS (European Nature Information System)

  • GADM-RDF (global administrative areas)

  • GHO (WHO's Global Health Observatory)

  • OpenEI (Open Energy Information)

  • Revyu (reviews and ratings)

  • SIDER (side effects of drugs)

  • RDF-TCM (traditional Chinese medicines)

  • UMBEL (Upper Mapping and Binding Exchange Layer)

  • YAGO (Yet Another Great Ontology]]
    • YAGO instance links
    • YAGO type information
    • YAGO type hierarchy
    • YAGO type links


This wiki page is maintained by Rich Morin, an independent consultant specializing in software design, development, and documentation. Please feel free to email comments, inquiries, suggestions, etc!

Topic revision: r3 - 15 Jan 2015, RichMorin
This site is powered by Foswiki Copyright © by the contributing authors. All material on this wiki is the property of the contributing authors.
Foswiki version v2.1.6, Release Foswiki-2.1.6, Plugin API version 2.4
Ideas, requests, problems regarding CFCL Wiki? Send us email