Edit Hacks - meta

This page discusses some ideas about extracting metadata from hand-edited text files (e.g., code, data).

Motivation

Hand-edited text files almost always contain extractable metadata. English text, for example, can yield keyword indices, sentiment, structure, and more. Data files commonly follow a handful of standards (e.g., CSV, JSON, XML, YAML). Source code can yield large amounts of metadata, once the language has been determined and its syntax parsed. Although the general problem of metadata extraction is AI-complete, a large subset can be handled by existing technology (e.g., parsing libraries).

Pattern recognition

Pattern recognition can be used in a number of ways, from selecting a parser to detecting ad hoc formatting based on white space, etc. Again, the general problem is intractable, but naive approaches may yield useful results.

Parser Selection

Typically, the file name extension will indicate the language being used in a file. However, it's possible that the extension may be missing or ambiguous. If the user doesn't know (or we don't want to bother her by asking), we might want to try guessing at the language, in order to select a parser.

Recognition

Hand-edited code and data files commonly use white space to make their internal structure more visually evident. The most common cases of this are indented hierarchies and other forms of vertical alignment (e.g., columnar data). These aren't all that hard to recognize, if we're willing to accept occasional false positive errors.

One possible approach would calculate the likelihood of a given "card column" (CC) in the current line being semantically significant (e.g., starting a data column). This information could then be used to alert the user about instances of vertical alignment, support tabbing, etc. A "proof of concept" script that implements this approach is available here.

Presentation

Once we have recognized semantically significant boundaries, we can worry about making them accessible. Here, by way of example, is some Ruby code for a data structure literal. Although the spacing hints at the implicit structure, most screen readers will not recognize this nor assist the user in navigating the fields:

foo = {
  bar:   42,   # foo[:bar] is 42
  baz:   43    # foo[:baz] is 43
}

Here is the same code, laid out in a tabular representation. Most screen readers will recognize this and assist the user in navigating the fields:

foo = {
bar: 42, # foo[:bar] is 42
baz: 43 # foo[:baz] is 43
}

If the user is able to toggle between these layouts, she can use whichever one works best for a given task.

Implementation

I've written a "proof of concept" implementation of a script (awsb) to analyze and display white space boundaries. It prints each line in the input (with tabs expanded into spaces), then prints a line that indicates the location of "boundary" character positions.

Most modern programming languages have libraries to parse common data formats. Many provide ways to parse their own code, yielding data structures that describe the code's Abstract Syntax Trees (ASTs). Assorted tools (e.g., documentation generators, prettyprinters, syntax highlighters) also tend to be available for popular languages.

The GNU Compiler Collection (GCC) can parse various programming languages, generating a common intermediate representation. Finally, compiler-compiler technology can be used to parse specific languages or syntactically-related clusters.


This wiki page is maintained by Rich Morin, an independent consultant specializing in software design, development, and documentation. Please feel free to email comments, inquiries, suggestions, etc!

Topic revision: r11 - 28 Jun 2016, RichMorin
This site is powered by Foswiki Copyright © by the contributing authors. All material on this wiki is the property of the contributing authors.
Foswiki version v2.1.6, Release Foswiki-2.1.6, Plugin API version 2.4
Ideas, requests, problems regarding CFCL Wiki? Send us email