File Characterization and Conversion

In a project of this scope, there are many plausible "starting points". One of the best involves characterizing files, converting existing metadata where possible.

The "man" pages offer good examples of both file characterization and data conversion. They are written in a simple and well-defined format (i.e., troff -man), have a well-developed formatting chain, and (generally) follow well-defined conventions in file naming and location.

Thus, it is fairly simple to recognize "man" pages in any of the common formats (e.g., ASCII, PostScript, troff). Each "recognized" file can then be characterized (e.g., as to format, topic, version) and the characterizing information can be saved as XML.

Meanwhile, the troff source code (being highly structured data) can itself be parsed into XML. This allows the publishing side of XML to be used, generating HTML, indexed PDF, PostScript, or whatever other format is desired.

As files are parsed, interesting features can be extracted. In a "man" page, these might be keywords, "File" and "See Also" entries, etc. In C source code, use of system resources (e.g., data structures, include files, library functions) could be characterized.

Detailed metadata of this sort has many uses; automated indexing and hyperlink generation are obvious examples. A programmer, for example, might wish to look over examples of source code that uses a particular combination of system resources.

Any collection of files can be characterized, but automated recognition and conversion of arbitrary files is not a trivial (or even feasible) task. Consider, for example, the distribution tree for an Open Source offering. Each file and directory in the distribution has a purpose, but what is it?

A program can make a good guess as to many files' types and formats (e.g., employing file naming and "magic" conventions). Other files, showing up frequently enough to exhibit a dependable pattern, can be used to improve the recognition software. Some files and many directories, however, will always require human assistance.

In any case, some knowledgeable human should make a pass over the result of the total process. There are many things that can go wrong with mechanical analysis of "raw" data. Also, humans are very good at seeing patterns; a human might well discern (or remember) information (e.g., a file's purpose) that a program would not.

Quality Assurance Benefits

In the process of examining a software distribution, certain questions are likely to arise: "What is this file used for, anyway?", "Why aren't these files in the same directory?" If no good answers can be found, the developer may be inspired to change the software (rather than explain the unexplainable :-).

In addition, global indexing of the software can make some questions easier to answer. The scanf(3) family of functions, for example, is widely regarded as a security sinkhole. If use of library functions and system calls is indexed, comprehensive examination becomes much more feasible.

In short, the documentation process can be a significant aid to quality assurance. Anecdotal evidence (e.g., from the OpenBSD Project) indicates that many Eclectic Systems have hidden bugs; this might be a good way to bring some of them to light.

Other Opportunities

Other opportunities include conversion of metadata from existing indexing and package management systems into XML. Basically, any documentation resource that hasn't been characterized (at least) and converted (where appropriate) is a golden opportunity!

Just as package distributions can be characterized, so can installable operating system distributions. An OS integrator might start by characterizing all of the (5K or so) files that comprise their "vanilla" distribution.

The bad news about such a project is that it starts as a pretty big job and never really gets done. New packages arrive on the scene, details change, etc.

The good news is that the task can be partitioned (e.g., characterize the files in /etc) and that most distributions contain common files in (relatively) common places. Also, many things are stable, even in a changing release. The type, format, and purpose of a make file, for example, do not tend to change from release to release.

Finally, there is the possibility that a package which an integrator is trying to characterize has already been handled by its developer or some other helpful party. In short, we should expect a snowball effect.

-- Main.RichMorin - 12 Jun 2003
Topic revision: r5 - 08 Jun 2003, WikiGuest
This site is powered by Foswiki Copyright © by the contributing authors. All material on this wiki is the property of the contributing authors.
Foswiki version v2.1.6, Release Foswiki-2.1.6, Plugin API version 2.4
Ideas, requests, problems regarding CFCL Wiki? Send us email