TOC Talk

This page discusses the generation and use of "Table of Contents" (TOC) lists.

Introduction

"Table of Contents" (TOC) lists can be very useful for navigation within and among web pages. There are many pages (and other documents) that could benefit from generated (or simply improved) TOCs.

A TOC can be generated for a web page by harvesting the section headers. For example, the TOC at the top of this page was created by the wiki engine. In a similar manner, it should be possible to generate TOCs for HTML pages that we generate (e.g., from code listings), retrieve from the web, or extract from EPUB documents.

Information sources

EPUB documents may contain a variable set of information sources, including the OPF file and one or both of the Nav and NCX files. Each file has a different format and the reported information may well be conflicting and/or erroneous. Some of the issues include:

  • omitting mention of pages in lists, etc.
  • using ordered lists for already numbered titles
  • treating a "part" page as if it were a chapter
  • treating sections as if they were chapters

The Navigation (Nav) file, defined in EPUB version 3, uses XHTML format. So, it can be used directly as a TOC page or harvested for information. It contains a list (or possibly a tree) of page titles and links.

NCX file

The Navigation Control XML (NCX) file was borrowed from the Daisy Digital Talking Book (DTB) specification. It was defined for EPUB version 2 and is not required for version 3 documents. However, it is often included for compatibility with older reading software. The contains a tree of navMap, navPoint, and navLabel elements. containing page titles and file paths.

OPF file

The Open Packaging Format (OPF) file contains information which can be used to generate a serial "reading list". The manifest element is an unsorted collection of item elements, each of which has href, id, and media-type attributes. The spine element is a sorted list of itemref elements, each of which has an idref attribute,

Page files

Although the title element of a page file can be harvested, it isn't always useful. In some EPUB documents, it may be a fixed string or simply missing.

Integration

We'd like to integrate these sources into a more complete and polished TOC, reconciling the harvested information along the way. Our expected approach is to collect any available information, then produce both a corrected, integrated, and reconciled result. We currently harvest lists of hashes from the Nav, NCX, and OPF files, using a consistent and convenient format:

[
  {
    href:  'ch01.html#_installation',
    level:  1,
    name:  'Installation'
  }
  ...
]

Using this data, we generate TOC pages (Nav', NCX', and OPF'), containing trees of links.

Ambiguity

Given that the information sources are imperfect, we have ample opportunity for ambiguity. Here is a simple example of the diamond problem:

  • The OPF spine begins with items A, B1, and C.
  • The Nav tree begins with items A, B2, and C.
  • The NCX tree begins with items A, B3, and C.
  • Items B1-B3 should all be between A and C.
  • However, what order should they be in?

Because we don't see any clean and reliable way to merge the TOC entries, we have decided to take an entirely different approach. The main TOC is based on the best available data set, using the preference order: Nav, NCX, OPF. Any other items are listed as "extras".

Restructuring

Some EPUB documents take liberties in reporting their structure. For example, Pragmatic Bookshelf's "Programming Ruby 1.9 & 2.0" lists Part instances (and some Section instances) as if they were Chapters. This isn't a problem, visually, but it distorts the hierarchy and gets in the way of our attempts at providing navigation aids.

So, let's examine a possible remedy. The first task is to determine whether the document has this issue. @tocs is a list of hashes summarizing TOC items. By scanning this list, we can locate any instances where items with different levels in the structure are reported at the same level. If we don't find any such instances, we're done! Otherwise, we need to impose some rules, e.g.:

  • Chapter, section, and subsection items are assumed to be subsumed under the preceding Part item. So, if need be, we increment their level numbers.

  • Section and subsection items are assumed to be subsumed under the preceding Chapter item. So, if need be, we increment their level numbers.

  • Any item (e.g., Appendix, Index) which is not one of the above terminates the range of the Part.

To be continued...


This wiki page is maintained by Rich Morin, an independent consultant specializing in software design, development, and documentation. Please feel free to email comments, inquiries, suggestions, etc!

Topic revision: r14 - 04 Dec 2016, RichMorin
This site is powered by Foswiki Copyright © by the contributing authors. All material on this wiki is the property of the contributing authors.
Foswiki version v2.1.6, Release Foswiki-2.1.6, Plugin API version 2.4
Ideas, requests, problems regarding CFCL Wiki? Send us email