Handling indentation (etc) for various programming languages
is a critical part of our handling of Monospace
Fortunately, the Pygments
solves a large part of this problem very nicely.
Pygments is a syntax highlighter
written in Python
It handles 300+ languages, as well as other text formats.
As discussed in Write your own lexer
adding lexical analysis
for a new language
appears to be fairly easy.
Lexers are written largely using declarative programming
, containing regular expressions
and (if need be) references to callback
Several output formats are available, including HTML
, and ANSI escape code
We're using HTML, with Pygments' generated CSS
In this format, a
) contains a set of code snippets,
indicates the desired display style.
Pygments recognizes and delineates keywords and other tokens from about 300 formats.
We can use this information in a number of ways, including code coloring, code folding,
Code coloring has many benefits for sighted readers.
In a static context, such as our code reading tool,
styling choices (e.g., color, font, size) can make certain elements easier to recognize,
speeding up some kinds of visual scanning.
Color choices can also help to produce a desired "feel" to the page.
In a dynamic context, such as a text editor,
code coloring can alert the sighted programmer about mismatched quotes
and other issues whose effects ripple through the code.
However, it's not clear how to let a blind programmer gain similar value.
It might be useful to add naive introspection and navigation capabilities,
based on Pygments tokens
For example, by selecting an item and requesting a contextual menu,
the user could find out what kind of token it has,
search for other uses of the item, etc.
module scans Ruby code for tokens (e.g., keywords).
It uses this information to determine the line ranges of Ruby methods and associated code.
Finally, it returns a data structure (list of hashes) describing each foldable section.
The scanning code, which uses regular expressions and hand-coded logic,
only handles a few tokens for a single language.
Using Pygments as a code scanning front end would let us handle a much larger range of tokens,
covering about 300 languages and other text formats.
Note, however, that Pygments is not a complete replacement for our scanning code.
For example, it can't recognize the terminating
keyword for a method.
Similar issues will exist for other languages and formats,
so we'll need a way to recognize sections for each language we wish to support.
By scanning the marked-up code, we can generate an index of significant items
(e.g., classes, methods, modules).
This could, for example, be used to generate a page-level Table of Contents.
Alternatively, it could help to support navigation.
Classes and Tokens
The following table is adapted from the
and the Pygments Builtin Tokens
|| any comment
|| hashbang (aka shebang) comments
|| multiline comments
|| preprocessor comments
|| comments that end at the end of a line
|| special data in comments
|| represents lexer errors
|| generic, unstyled token
|| marks token as deleted
|| marks token as emphasized
|| marks token as an error message
|| marks token as a heading
|| marks token as inserted
|| marks token as program output
|| marks token as a command prompt
|| marks token as bold
|| marks token as a subheading
|| marks token as a part of an error traceback
|| any kind of keyword
|| keywords that are constants
|| keywords used for variable declarations
|| keywords used for namespace declarations
|| keywords that aren’t really keywords
|| reserved keywords
|| builtin types that can’t be used as identifiers
|| any literal
|| date literals
|| any name
|| names of attributes (e.g. in HTML)
|| builtin names
|| builtin names that are implicit
|| names of classes
|| names of constants
|| names of decorators
|| special entities (e.g., in HTML)
|| names of exceptions
|| names of functions or methods
|| names of statement labels
|| names of namespaces
|| names of properties
|| names of tags (e.g., in HTML)
|| names of variables
|| names of class variables
|| names of global variables
|| names of instance variables
|| any number
|| binary number
|| floating point number
|| hexadecimal number
|| integer number
|| long integer number
|| octal number
|| any punctuation operator
|| any operator that is a word
|| data not matched by the parser
|| any punctuation which is not an operator
|| any string literal
|| strings enclosed in backticks
|| single characters
|| documentation strings
|| strings enclosed in double quotes
|| escape sequences in strings
|| "here document" strings
|| interpolated parts in strings
|| other strings
|| regular expression literals
|| strings enclosed in single quotes
|| symbols (interned strings)
|| spaces and tabs
This wiki page is maintained by Rich Morin
an independent consultant specializing in software design, development, and documentation.
Please feel free to email
comments, inquiries, suggestions, etc!