Neo4jc

This page explores a possible approach to creating a job control framework for IPM Lab, based on:

  • a task management tool (eg, Rake)

Overview

IPM Lab provides a number of programs for performing common modeling tasks. These programs can be run from the Unix shell, a shell script, or some other program. Each executing program (ie, process) must be told about its input and output files, runtime parameters, etc.

Although a typical modeling experiment will only use a few dozen programs, the count of processes and files may be far larger (eg, thousands). This produces a great deal of incidental complexity which is seldom of interest to researchers (or even developers). We need a way to manage this complexity and hide it from casual view, while allowing control and visibility into it as needed.

When specifying an experiment, developers and researchers need to define general patterns of processing and data flow. However, many specific details can be handled automatically, including:

  • File management (eg, archiving, conversion, indexing)

  • Model structure generation and model evaluation

  • Analysis and presentation of modeling results

  • Process management (eg, parallel execution)

Graphs

Graph-based data structures (eg, nodes, properties, relations) are good at encoding complexity. Graph databases (eg, Neo4j) provide robust storage for these structures, as well as tools (eg, libraries, query languages) for dealing with them. Here are some instances of directed graphs that a graph database might handle for IPM Lab:

  • abstract data flow (eg, programs, file types)

  • concrete data flow (eg, processes, files)

  • model information (eg, entities, scores)

Data Flow

We can consider the data flow in an IPM Lab experiment at two levels. The abstract data flow, defined by the user, might involve a few dozen programs, file types, etc. The concrete data flow, implemented by IPM Lab's infrastructure, may involve thousands of processes and files.

Abstract Data Flow

From the user's perspective, a typical experiment might look like a series of processing tasks, coupled to some analytics software:

Concrete Data Flow

However, from IPM Lab's perspective, things are more complicated. In the diagram below, we see that Git and Neo4j have been added, along with some supervisory software (eg, the main script, Rake).

Information from Neo4j drives the analytics software; it can also influence experimental behavior, via the main script and the model structure generator.

Note: This architecture contains both implicit and explicit cycles. So, for example, it is possible to fashion experiments in which evaluation results can influence the generation of succeeding models.

Infrastructure

In order to manage the control and data flows, we need a bit of infrastructure:

  • The operating system's processes, files, and directories
    provide the low-level building blocks.

  • A graph database (eg, Neo4j) gives us rapid access
    to high-level information (eg, configuration, scores).

  • A software task management tool (eg, Rake) manages
    details of process management, parallelization, etc.


This wiki page is maintained by Rich Morin, an independent consultant specializing in software design, development, and documentation. Please feel free to email comments, inquiries, suggestions, etc!

Topic revision: r9 - 04 Apr 2016, RichMorin
This site is powered by Foswiki Copyright © by the contributing authors. All material on this wiki is the property of the contributing authors.
Foswiki version v2.1.6, Release Foswiki-2.1.6, Plugin API version 2.4
Ideas, requests, problems regarding CFCL Wiki? Send us email