Introduction

Etymology

puffy (comparative puffier, superlative puffiest)

  1. Of or pertaining to puffs or puffiness; being pillow-like, exhibiting swelling, inflated
  2. Speaking or writing in an exaggeratedly eloquent and self-important manner

-- puffy, in Wiktionary

Puffy Data "inflates" tuples (e.g., in message bodies) into persistent data structures. Like cloud computing, it relies on shared resources such as cached binaries and lightweight server processes (i.e., actors). Conveniently, the name sounds a bit like "PF'y data" (i.e., akin to purely functional data). Of course, both clauses of the definition above may apply (:-).

Motivations

Elixir has all of Erlang's data structures, and more. However, there are still possibilities for improvement in capabilities and efficiency.

Capabilities

Clojure has very powerful data management capabilities. In particular, Clojure's reference types (Agent, Atom, Ref, Var) provide safe, efficient ways for colocated processes to share state. This seems like a useful complement to state-retaining processes (e.g., Elixir's Agents), Erlang Term Storage (ETS), and the Mnesia DBMS.

Based on these foundations, it may be possible to add other interesting types. For example, Datomic has some very powerful capabilities for time-sensitive data storage and retrieval, massively scalable read access, etc.

Performance

Some of Clojure's data types have much better performance than their Elixir alternatives. For example, the Vector type allows O(1) appending and indexing (as opposed to O(n) for a List). The Sorted-Map type is also very convenient and efficient for certain tasks.

Also, deep copying of inter-process message data isn't free. The copying operation takes processor time and each copy of the data takes up memory. Saša Jurić discusses this in Section 5.4.3 of Elixir in Action (Shared nothing concurrency):

Deep copying is an in-memory operation, so it should be reasonably fast. Occasionally sending a big message shouldn't present a problem. But having many processes frequently sending big messages may affect the performance of the system.

The notion of small versus big is subjective. Simple data, such as a number, an atom, or a tuple with few elements, is obviously small. A list of a million complex structures is big. The border lies somewhere in between and depends on your special case.

Persistent Data Structures

Persistent data structures aren't a new concept for Elixir, but introductions to Elixir's data structures tend to gloss over this fact. As described in Does Elixir have persistent data structures similar to Clojure?, Elixir's List, HashDict, and HashTree implementations all use this approach.

There is a wealth of literature on this topic (see Resources), but David Karger's lecture notes provide a very concise and readable description:

Once changes have been made to an ephemeral data structure, no mechanism exists to revert to previous states. Persistent data structures are really data structures with archaeology.

Partial persistence lets you make modifications only to the present data structure, but allows queries of any previous version. These previous versions might be accessed via a timestamp. Full persistence lets you make queries and modifications to all previous versions of the data structure. With this type of persistence, the versions don't form a simple linear path – they form a version tree.

The obvious way to provide persistence is to make a copy of the data structure each time it is changed. This has the drawback of requiring space and time proportional to the space occupied by the original data structure. In turns out that we can achieve persistence with O(1) additional space and O(1) slowdown per operation for a broad class of data structures.

Note: The use of the term "modification" in the text above could confuse the unwary. So, for the record, persistent data structures are immutable: no in-place modification is ever performed on them.

The magic behind persistent data structures is "structural sharing". Each "copy" of the structure uses as much of the old structure as possible, adding only the new values and some overhead. (See the Implementation page for more information.)

Clojure and Datomic take great advantage of this idea. Clojure makes it the basis of a set of concurrency-friendly (e.g., transactional) "reference data types". Datomic uses it to allow massive scaling of database query operations, convenient temporal analysis of data, and much more.

Bringing It Together

Persistent data structures are supported by a number of functional programming languages (e.g., Clojure, Haskell, Scala) and have also been ported to more conventional languages (e.g., Java, JavaScript, Python). So, it seems like time for Elixir to get on the bus.

If, at the same time, we could extend Elixir's set of data types and make large-scale message passing more convenient and efficient, it begins to sound like a no-brainer. So, I'm looking into it...

The Erlang environment already provides some related facilities. Let's take a look.

Erlang Term Storage

Erlang Term Storage (ETS) is the traditional way to share large amounts of data:

This module is an interface to the Erlang built-in term storage BIFs. These provide the ability to store very large quantities of data in an Erlang runtime system, and to have [in general] constant access time ... Data is organized as a set of dynamic tables, which can store tuples. Each table is created by a process. When the process terminates, the table is automatically destroyed. Every table has access rights set at creation.

Tables are divided into four different types, set, ordered_set, bag, and duplicate_bag. A set or ordered_set table can only have one object associated with each key. A bag or duplicate_bag can have many objects associated with each key.

The number of tables stored at one Erlang node is limited. ... Note that there is no automatic garbage collection for tables. Even if there are no references to a table from any process, it will not automatically be destroyed unless the owner process terminates. It can be destroyed explicitly by using delete/1. ...

This module provides some limited support for concurrent access. All updates to single objects are guaranteed to be both atomic and isolated. ... No other support is available within ETS that would guarantee consistency between objects. ...

The Binary Heap

Saša Jurić describes an interesting and useful optimization that might provide an opportunity for another solution:

A special case where deep copying doesn't take place involves binaries (including strings) that are larger than 64 bytes. These are maintained on a special shared binary heap and sending them doesn't result in a deep copy. This can be useful when you need to send information to many processes and the processes don't need to decode the string.

If the same data structure is being used by multiple processes, this optimization will reduce both copying time and data storage. However, the use of global data interferes with per-process garbage collection. So, YMMV.

Target Data Types

Clojure supports a wide variety of well-documented data types. Agents, Atoms, Refs, and Vars have their own pages. Data Structures describes ArrayMap, Character, Keyword, List, Map, nil, Number, Set, String, StructMap, Symbol, and Vector. Datatypes describes deftype, defrecord, and reify.

Only a few of these data types are present in Elixir, but any of them could be added to the language, by appropriate use of macros and protocols. My intention is to start with Elixir's existing types (e.g., lists, maps, sets), then move on to some simple structures from Clojure (e.g., vectors).

Clojure's reference data types (e.g., Agents, Atoms, Refs, and Vars) would seem to be philosophically at odds with the Actor model, as used in Elixir, Erlang, etc. They are also limited to colocated (as opposed to distributed) processes. (See Message Passing and Actors for Clojure's rationale.)

However, they offer an interesting alternative (with great performance and safety benefits) to message passing and other state-sharing approaches. So, even though integrating them may be challenging, we can't really leave them out of the design.


This wiki page is maintained by Rich Morin, an independent consultant specializing in software design, development, and documentation. Please feel free to email comments, inquiries, suggestions, etc!

Topic revision: r11 - 04 Apr 2016, RichMorin
This site is powered by Foswiki Copyright © by the contributing authors. All material on this wiki is the property of the contributing authors.
Foswiki version v2.1.6, Release Foswiki-2.1.6, Plugin API version 2.4
Ideas, requests, problems regarding CFCL Wiki? Send us email