File Characterization and Conversion
In a project of this scope,
there are many plausible "starting points".
One of the best involves characterizing files,
converting existing metadata where possible.
The "man" pages offer good examples
of both file characterization and data conversion.
They are written in a simple and well-defined format
have a well-developed formatting chain,
and (generally) follow well-defined conventions
in file naming and location.
Thus, it is fairly simple to recognize "man" pages
in any of the common formats
(e.g., ASCII, PostScript,
Each "recognized" file can then be characterized
(e.g., as to format, topic, version)
and the characterizing information can be saved as XML.
source code (being highly structured data)
can itself be parsed into XML.
This allows the publishing side of XML to be used,
generating HTML, indexed PDF, PostScript,
or whatever other format is desired.
As files are parsed,
interesting features can be extracted.
In a "man" page,
these might be keywords, "File" and "See Also" entries, etc.
In C source code, use of system resources
(e.g., data structures, include files, library functions)
could be characterized.
Detailed metadata of this sort has many uses;
automated indexing and hyperlink generation are obvious examples.
A programmer, for example,
might wish to look over examples of source code
that uses a particular combination of system resources.
Any collection of files can be characterized,
but automated recognition and conversion of arbitrary files
is not a trivial (or even feasible) task.
Consider, for example, the distribution tree
for an Open Source
Each file and directory in the distribution has a purpose,
but what is it?
A program can make a good guess as to many files' types and formats
(e.g., employing file naming and "magic" conventions).
Other files, showing up frequently enough to exhibit a dependable pattern,
can be used to improve the recognition software.
Some files and many directories, however,
will always require human assistance.
In any case,
some knowledgeable human should make a pass
over the result of the total process.
There are many things that can go wrong
with mechanical analysis of "raw" data.
Also, humans are very good at seeing patterns;
a human might well discern (or remember) information
(e.g., a file's purpose) that a program would not.
Quality Assurance Benefits
In the process of examining a software distribution,
certain questions are likely to arise:
"What is this file used for, anyway?",
"Why aren't these files in the same directory?"
If no good answers can be found,
the developer may be inspired to change the software
(rather than explain the unexplainable :-).
In addition, global indexing of the software
can make some questions easier to answer.
family of functions, for example,
is widely regarded as a security sinkhole.
If use of library functions and system calls is indexed,
comprehensive examination becomes much more feasible.
In short, the documentation process
can be a significant aid to quality assurance.
Anecdotal evidence (e.g., from the OpenBSD Project)
indicates that many Eclectic Systems
have hidden bugs;
this might be a good way to bring some of them to light.
Other opportunities include conversion of metadata
from existing indexing and package management systems into XML.
Basically, any documentation resource that hasn't been characterized (at least)
and converted (where appropriate) is a golden opportunity!
Just as package distributions can be characterized,
so can installable operating system distributions.
An OS integrator might start by characterizing all of the (5K or so) files
that comprise their "vanilla" distribution.
The bad news about such a project is that it starts as a pretty big job
and never really gets done.
New packages arrive on the scene, details change, etc.
The good news is that the task can be partitioned
(e.g., characterize the files in
and that most distributions
contain common files in (relatively) common places.
Also, many things are stable, even in a changing release.
The type, format, and purpose of a
file, for example,
do not tend to change from release to release.
Finally, there is the possibility
that a package which an integrator is trying to characterize
has already been handled by its developer or some other helpful party.
In short, we should expect a snowball effect.
-- Main.RichMorin - 12 Jun 2003