Data Storage

This page describes Puffy Data's high-level approach to data storage.

Objectives

We'd like to minimize the amount of duplicated data, while retaining as much flexibility as possible. So, each Ref Tuple should be small, containing only enough information to enable the desired forms of access to a shared global resource.

Processing latency and throughput are also major considerations. In read mode, Ref Tuples must be retrieved from the cache and inflated into data structures. In write mode, data structures must be deflated into Ref Tuples, then (sequentially!) stored in the cache by a Transactor process. Finally, any storage used by Puffy Data must be garbage collected.

Overview

Erlang's Binary Heap is a shared cache for large (> 64 byte) binaries. Available to any process, it is often employed as a way to share large data structures. So, we can use it, along with Erlang Term Storage (ETS), to meet for Puffy Data's need for bulk storage.

A Ton of Tuples

Any number of Ref Tuples can exist simultaneously. Each one tells Puffy Data how to inflate (etc) a specific view of the data. Each Ref Tuple has three components:

  • data_blob - data blob (cached)
  • meta_map - metadata for view, etc.
  • type_key - type (e.g., :Clj_Atom)

Here's an example Ref Tuple:

{ :Clj_Atom, meta_map, data_blob }

A Heap of Blobs

We use the Binary Heap to store Blobs (after binary large objects). This diagram gives the general idea, but the actual data structures are somewhat more complicated:

Each Blob contains a "delta" (set of data and index values). If a Blob needs to share structure with an ancestor, it must contain the appropriate pointers and metadata. Blobs may also contain assorted metadata (e.g., transaction ID), helping Puffy Data to perform optimized searches, etc.

ETS, et cetera

At many points in Puffy Data's operation, there may be no Deflated Tuples that reference a given Blob. Although other Blobs may reference the Blob implicitly, this is invisible to the garbage collector. So, the Blob is in peril of disappearing from the Binary Heap.

To prevent this, we can keep a List of active Blobs in the Transactor process. However, if (when!) the Transactor crashes, we'll need a way to restore the List. We accomplish this by saving the Blobs in an ETS set (i.e., key/value store).

Specifically, we store each Blob as a value, using a cryptographic hash function (e.g., SHA-1) to generate the key. This is inspired by techniques used in distributed revision control systems such as Git.

Blob Internals

The internal structure of a Blob is conceptually opaque to the calling code. However, there is nothing secret (let alone magical) about them. Each Blob contains enough information to define a version of the data structure, relying on other Blobs as needed:

  • a mapping table into predecessor Blobs

Each Blob may contain references into zero or more of its "ancestors", as needed:

To be continued...


This wiki page is maintained by Rich Morin, an independent consultant specializing in software design, development, and documentation. Please feel free to email comments, inquiries, suggestions, etc!

Topic revision: r5 - 14 Aug 2015, RichMorin
This site is powered by Foswiki Copyright © by the contributing authors. All material on this wiki is the property of the contributing authors.
Foswiki version v2.1.6, Release Foswiki-2.1.6, Plugin API version 2.4
Ideas, requests, problems regarding CFCL Wiki? Send us email