Claims

This page discusses the "claim", one of Ontiki's fundamental idioms for knowledge representation.

Motivation

Unlike Wikipedia, which attempts to achieve an informed consensus (with citations) on each topic, Ontiki supports the definition and use of variant ontological views. That is, information sources (e.g., data sets, users) can make arbitrary assertions. This isn't unique to Ontiki; it's true of the entire World Wide Web:

People can build their own webpage and say anything they want on it. ... it is up to the reader to come to a conclusion about what to believe. ... This feature is so instrumental in its character that we give it a name: the AAA Slogan: "Anyone can say Anything about Any topic."

-- Semantic Web for the Working Ontologist

Because assertions may be in apparent or even complete contradiction to each other, knowing each assertion's context becomes critical for analysis, evaluation, filtering, etc. More generally, various kinds of metadata (e.g., culture, definition, language, provenance, scope, support) can all be useful in knowledge representation.

To encode this metadata, we'll need to make assertions about other assertions. For lack of better terminology, let's call these "meta-assertions". Most graph databases have little specific support for this sort of thing, but generally some sort of workaround (i.e., indirect solution) is possible.

Ontiki's claim takes the form of a Neo4j-based programming idiom: a data structure that serves as an abstract data type. Developers can define claims with minimal concern about implementation details; users can employ claims without any realization that they are doing so.

Possible Uses

A claim is a low-level building block which can be applied in a number of ways. Ontiki uses them primarily for administrative purposes (e.g., provenance, scope), but they can be used in a variety of higher-level patterns, as:

Implementation

A claim encodes information (e.g., a logical assertion) that some party (e.g., a data set or human user) has made. Implemented as a Neo4j "intermediate node", a claim is an abstract data type which resolves the meta-assertion limitation noted above.

Because a claim is a node, it can be used as an end point for relationships. These might lead to any node in the graph: an author, a general definition, a temporal framework, etc. The claim node can also have its own properties (eg, instance and/or transaction IDs).

External View

The external view of a claim looks like this:

Cypher has no syntax to represent the external form of a claim. Something like this might capture the basic assertion:

(en_1) -[| cl_42 |]-> (en_2)

Capturing the metadata is another matter entirely. Here is one (rather prolix) possibility:

(en_1) -[| cl_42 |]-> (en_2),
(en_1) -[| cl_42 |]-> (en_3).

Internal View

The internal view of a claim looks more like this:

Here is a definition, in vanilla Cypher syntax:

(en_1)  -[ Claim_In ]-> (cn_42),
(cn_42) -[ Claim_To ]-> (en_2),
(cn_42) -[ Claim_By ]-> (en_3).

Other relationships (eg, Claim_At, Claim_Of) could be added, as desired.

Hybrid Claims

In order to support experimentation, I may implement a "hybrid claim" which adds a direct relationship to the claim data structure:

Performance

Neo4j has fixed storage costs for nodes (9 bytes) and relationships (33 bytes). The minimal claim requires a node and two relationships: (9 + 2 * 33 = 75 bytes). Each piece of metadata (e.g., Claim_At, Claim_By, Claim_Of) adds another relationship and possibly another node. So, a well-annotated claim might use quite a bit of memory.

The cost in query speed also increases, but not as dramatically in most cases. Rather than traversing one relationship, we're traversing (at least) two. However, relationship traversal tends to be inexpensive, as long as the working set fits in memory.

Graph Databases

I've considered several graph databases for use in projects such as Ontiki. Most of these (e.g., Datomic, RDF Triplestores) are based on a single, indexed table. As a result, each edge traversal requires an O(log N) lookup. This is far better than a linear search of the table, but it's still expensive.

Neo4j, in contrast, is a pointer-based (i.e., index-free) database which can traverse an edge by accessing a different location in memory. Once the database has been "warmed up" (to minimize page faults), this can be really fast (e.g., millions of edge traversals per second).

Datomic

Datomic can support meta-assertions, at some cost in convenience. Individual datoms are atomic, immutable objects, represented as (entity, attribute, value, transaction, ...) tuples. Because each datom has a transaction ID, it can be linked to other datoms. This can be used to encode arbitrary metadata and relationships.

Specifically, transaction IDs can be used to connect datoms. If datoms share a transaction ID, they can be looked up together. Successive datoms can use the ID as an entity or value, connecting it to descriptive information.

I find Datomic's peer-based distribution model, inherent immutability, and Datalog-based query language to be very attractive. Unfortunately, because Datomic is proprietary, I don't consider it suitable for use in an open source project.

Neo4j

In Neo4j, relations can be used to connect pairs of nodes, but there is no direct way to specify an instance of a relation. So, the prevailing practice is to create an "intermediate node". This also finesses the problem of creating N-ary relationships, etc.

Neo4j's property graphs are quite general and its Cypher query language feels like a good starting point for describing patterns, etc. However, there is no direct way to specify an instance of a relation. So, the prevailing practice is to create an "intermediate node". See Implementation (above) for my proposed use of this.

RDF Triplestores

Triplestores, commonly used to support RDF and the Semantic Web, store sets of tuples in indexed tables. Any column (e.g., subject, predicate, object) can be indexed, but generally, the predicate name only indicates its class (as opposed to a specific instance).

Triplestores typically use the SPARQL language for queries. Some (e.g., Jena) support OWL-based reasoning. Unlike conventional databases, they follow the open-world assumption:

... the open-world assumption is the assumption that the truth value of a statement may be true irrespective of whether or not it is known to be true. It is the opposite of the closed-world assumption, which holds that any statement that is true is also known to be true.

-- open-world assumption (WP)


This wiki page is maintained by Rich Morin, an independent consultant specializing in software design, development, and documentation. Please feel free to email comments, inquiries, suggestions, etc!

Topic revision: r8 - 07 Jan 2015, RichMorin
This site is powered by Foswiki Copyright © by the contributing authors. All material on this wiki is the property of the contributing authors.
Foswiki version v2.1.6, Release Foswiki-2.1.6, Plugin API version 2.4
Ideas, requests, problems regarding CFCL Wiki? Send us email