Tree Globbing

Motivation

When I construct large systems, I like to use sets of communicating processes, possibly using intermediate files as messages. This is far from a novel approach; indeed, it is central to the Unix Philosophy. It is also reminiscent of Stuart Sierra's notions about Thinking in Data.

I have used this approach a number of times. For example, I used it for a documentation system (self-documenting, complete with image maps and movies :-) in the Fermi project. I'm using a similar approach as the basis for the "IPM Lab" that I'm developing for ISLE.

I particularly like the fact that the serialized messages can be examined (or even edited!) by the programmer (eg, as part of modification or debugging efforts). As Gene Dronek points out, the recorded, static messages are points of stability in the churning sea of process activity. So, it behooves us to consider carefully what sorts of messages we want to send about...

Message Structure

Each message type needs to have a specified data structure, agreed upon (in some fashion) by its senders and receivers. The serialization format (ie, concrete syntax) is largely an implementation detail, but the abstract syntax is generally an arborescence:

In graph theory, an arborescence is a directed graph in which, for a vertex u called the root and any other vertex v, there is exactly one directed path from u to v.

Equivalently, an arborescence is a directed, rooted tree in which all edges point away from the root. Every arborescence is a directed acyclic graph (DAG), but not every DAG is an arborescence.

-- Arborescence (graph theory) (WP)

For extended discussion of this data structure, see my Arborescence web.

Examples

In the example code below, I use irb (Interactive Ruby Shell) to create and access a pair of arborescences (a, b). Note that Ruby allows arrays and hashes to be accessed using the same (bracket-based) syntax:

$ irb
>> a = { :foo => [ 42, :sym, 'str' ] }
=> {:foo=>[42, :sym, "str"]}
>> a[:foo][1]
=> :sym
>> b = { :foo => [ 42, :sym, 'str', {:bar => [ :a, :b]}] }
=> {:foo=>[42, :sym, "str", {:bar=>[:a, :b]}]}
>> b[:foo][3][:bar][1]
=> :b

Serialization (mostly YAML)

An arborescence can be encoded in any modern serialization format. However, YAML has a number of very convenient features: it lets me add comments, incorporate blocks of text cleanly, minimize the use of quotes, etc. So, for convenience and clarity, I use YAML by default.

With larger structures, I may use some specification and traversal trickery. For example, I needed to define the infix notation for an algebraic equation in the Ivlev grazing process. In YAML, this is pretty awkward and bulky to write down:

process:
  grazing:
    ivlev:
      equations:
        ALG:
          Graz.dn__a:
            expr_infix:       Graz.pc__a
            ...

However, some key-expansion trickery (after the structure is loaded) can support Unix file path syntax for the top-level key. This simplifies the YAML, reducing both the number of lines and the depth of indentation:

process/grazing/ivlev/equations/ALG:
  Graz.dn__a:
    expr_infix:       Graz.pc__a
    ...

This makes it easier to edit the serialized arborescence, but does nothing to help us traverse it. Fortunately, there are many more tricks we can steal...

Traversal (mostly Ruby)

As shown above, it's possible to index into an arborescence. Loops (or recursion) and working variables can be used to implement reasonably efficient tree traversal. However, this can also lead to verbose and ugly (IMHO) code. In the IPM project, I wanted to add a prefix-format expression (expr_prefix) to each equation definition sub-hash, based on the existing infix form. How hard could that be? (ducks :-)

  @sh['process'].each do |n_proc_type, proc_type|
    proc_type.each do |n_proc_item, proc_item|
      process = proc_item['equations']
      process.each do |n_eqn_type, eqn_type|
        eqn_type.each do |n_item, item|
          item['expr_prefix'] = rx2lx( item['expr_infix'] )
        end
      end
    end

Although this does the job, a single line of critical code is buried inside eight lines of incidental complexity (eg, looping, variable manipulation). Fortunately, our use of an arborescence will let us DRY this out. For example, we might borrow some globbing syntax from the Unix shell or XPath. This would let us remove almost all of the loop control code:

  @sh.ar_glob('process/*/*/equations') do |item|
    item['expr_prefix'] = rx2lx( item['expr_infix'] )
  end

However, this doesn't give us a convenient way to access loop keys and values (for tracing or use). So, let's change the asterisks to named placeholders and provide hashes of keys and values (hk, hv) to access them:

  glob_patt = 'process/:proc_type/:proc_name/equations/' +
              ':eqn_type/:var_name'

  @sh.ar_glob(glob_patt) do |hk, hv, item|
    item['expr_prefix'] = rx2lx( item['expr_infix'] )
  end

Careful Reader will recognize that glob_patt borrows its syntax from router control strings, as used in web application frameworks such as Ruby on Rails and Sinatra: In any case, we can now add tracing code, as desired:

  glob_patt = 'process/:proc_type/:proc_name/equations/' +
              ':eqn_type/:var_name'
  @sh.ar_glob(glob_patt) do |hk, hv, item|
    l = %w[ glob_patt#  hk~  hv~  item~ ]
    et(l, binding, names) if @debug

    item['expr_prefix'] = rx2lx( item['expr_infix'] )
  end

This might produce a trace of the form:

et> lib_lint.rb:73:in `add_expr_prefix'
et>   hk~     {:eqn_type=>"ODE", ...}
et>   hv~     {:eqn_type=>{"Prod.conc"=>{"note_precis"=>"???", ...}
et>   item~   {"note_precis"=>"???", ...}

Generalization

Although the examples above show only trees of hashes, ar_glob also supports heterogeneous trees. The usage looks (mostly) the same, but the returned keys may be either hash keys or array indices, depending on the structure of the input data.

Implementation

For the curious (and brave), here's some Ruby implementation code:

  class Object
  #
  # Various kinds of Objects (eg, Array, Hash, Nil, String) could be handed
  # to these routines.  So, we define the receiver's class very generally.

    def ar_each
    #
    # Hash#each isn't well suited to traversing a level of an arborescence.
    # First, the level might be an array, so there's a type mis-match.  An
    # array is ordered, but unlike a hash, the keys are implicit.  Finally,
    # hash keys aren't sorted, so the generated results won't appear in a
    # predictable (ie, human-friendly) order.
    #
    # Also, YAML encoding problems can cause the script to crash with no
    # useful clues as to the reason.  For example, a sub-hash may be missing
    # or the structure may not be what the template file expects.
    #
    # This "monkey-patched each" method works around these problems.  It:
    #
    #   iterates through Hashes, in key order
    #   traces everything else (eg, nil, "...")
    #
    # Alert interpretations:
    #
    #   nil   - This sub-hash has not been defined in the YAML file.  This
    #           could indicate a typo in the key or a lack of information
    #           (eg, an entity which has no constants)
    #
    #   "..." - This sub-hash is actually a text string.  This indicates a
    #           mis-match between the YAML file and the ERb template.
    #

      debug   = false  #TD - set to true to trace
      status  = false

      case self
      when Hash;  status  = true
      else;       debug   = true
      end

      if debug #T
        selfish = self.inspect
        selfish = selfish[0,200] + '...' if selfish.size > 200
        puts "\nAlert (ar_each): not a Hash: #{ selfish }\n\n"
        puts caller #T
      end

      return unless status

      ar_keys(self).each {|key| yield(key, self[key]) }
    end

    def ar_glob(glob_patt, &block)
    #
    # This is a specialization of glob for traversing an arborescence.
    # The syntax is shamelessly stolen from URL-handling (ie, router)
    # strings, as used in Rails and Sinatra, eg:
    #
    #   'process/:proc_type/:eqn_type/equations'

      debug   = false  #TD - set to true to trace
      status  = false

      case self
      when Hash;  status  = true
      else;       debug   = true
      end

      if debug #T
        selfish = self.inspect
        selfish = selfish[0,200] + '...' if selfish.size > 200
        puts "\nAlert (ar_each): not a Hash: #{ selfish }\n\n"
      end

      return unless status

      def ar_walk(ar_in, glob_in,
        hk_in={}, hv_in={}, lvl_in=0, &block)
      #
      # Handle processing of a single glob level.

        @debug = true

        re_patt   = %r{^([^/]+)/(.+)$}
        lvl_to    = lvl_in + 1

#       puts "\nin ar_walk"                                               #T
        l = %w[ lvl_in  lvl_to  re_patt ]
#       et(l, binding) if @debug                                          #T

        if (glob_in =~ re_patt)                   # path node
          glob_this   = Regexp.last_match(1)
          glob_to     = Regexp.last_match(2)

#         puts "\ndo path node"                                           #T
          l = %w[ glob_in  glob_this  glob_to ]
#         et(l, binding) if @debug                                        #T

          if (glob_this =~ /^:(.+)$/)             # wildcard path node
            card_this   = Regexp.last_match(1)

#           puts "\ndo wild card"                                         #T
            l = %w[ card_this ]
#           et(l, binding) if @debug                                      #T

            if (ar_in)
              keys_this   = ar_keys(ar_in)

              keys_this.each do |key_this|
                ar_to     = ar_in[key_this]
                ct_sym    = card_this.to_sym
                hk_to     = hk_in.merge( {ct_sym =>  key_this } )
                hv_to     = hv_in.merge( {ct_sym =>  ar_to   } )

                ar_walk(ar_to, glob_to, hk_to, hv_to, lvl_to, &block)
              end
            end

          else                                    # constant path node
            ar_to    = ar_in[glob_this]

#           puts "\ndo constant"                                          #T
            l = %w[ ar_in#  ar_to# glob_to ]
#           et(l, binding) if @debug                                      #T

            ar_walk(ar_to, glob_to, hk_in, hv_in, lvl_to, &block)
          end

        else                                      # leaf node

#         puts "\ndo leaf node"                                           #T
          l = %w[ ar_in~  glob_in  list_in~ ]
#         et(l, binding) if @debug                                        #T

          if (glob_in =~ /^:(.+)$/)               # wildcard leaf node
            card_this     = Regexp.last_match(1)
            if (ar_in)
              keys_this   = ar_keys(ar_in)

#             puts "\ndo wild card leaf node"                             #T
              l = %w[ glob_this  keys_this ]
#             et(l, binding) if @debug                                    #T

              keys_this.each do |key_this|
                ar_to     = ar_in[key_this]
                ct_sym    = card_this.to_sym
                hk_to     = hk_in.merge( {ct_sym =>  key_this } )
                hv_to     = hv_in.merge( {ct_sym =>  ar_to   } )

                l = %w[ ar_in~     ar_to~
                        card_this   ct_sym
                        glob_in     glob_this
                        hk_in       hk_to
                        hv_in~      hv_to~ ]
#               et(l, binding) if @debug                                  #T

                yield(hk_to, hv_to, ar_to)
              end
            end

          else                                    # constant leaf node
#           puts "\ndo constant leaf node"                                #T
            ar_to     = ar_in[glob_in]
            glob_to   = glob_this

            l = %w[ ar_in~   ar_to~
                    glob_in   glob_to   list_to~ ]
#           et(l, binding) if @debug                                      #T

            yield(hk_in, hv_in, ar_to)
          end
        end
      end

      ar_walk(self, glob_patt, &block)
    end

    def ar_keys(ar)
    #
    # Get an ordered list of keys for a level (eg, array, hash) in the
    # arborescence level.  Array "keys" are numeric (0, 1, ...).  Hash
    # keys are sorted by their string equivalents (via sort_by and to_s).

      if ar.kind_of?(Array)
        [0 .. ar.size-1]
      else
        ar.keys.sort_by {|key| key.to_s }
      end
    end

    def safe_split
    #
    # Like String#split, but handles nils.

      self.to_s.split
    end

  end

Summary

We've eliminated a great deal of looping code by adopting (and adapting :-) some popular idioms. The techniques we're using are both familiar and general, so they should be easy to use and widely applicable.

My implementation depends strongly on Ruby's notions of blocks. I'm curious to know how it might look in a language (eg, Clojure, JavaScript) that doesn't use this control structure. For example, as Brandon Bloom suggests, Clojure's get-in function (and friends) could be extended to support some sort of wildcard syntax, eg:

(query-in a-map [:foo wild :bar])


This wiki page is maintained by Rich Morin, an independent consultant specializing in software design, development, and documentation. Please feel free to email comments, inquiries, suggestions, etc!

Topic revision: r14 - 04 Apr 2016, RichMorin
This site is powered by Foswiki Copyright © by the contributing authors. All material on this wiki is the property of the contributing authors.
Foswiki version v2.1.6, Release Foswiki-2.1.6, Plugin API version 2.4
Ideas, requests, problems regarding CFCL Wiki? Send us email