Model-Based Documentation

Rich Morin,


As these authors imply, there are many kinds of information and many ways to lose the use of it. Whatever the reason, if you can't locate information when you need it, you have lost the benefits of having it. Access to information is critical to the success of any enterprise. Its loss can result in confusion, frustration, and inefficiency, to say nothing of direct financial impact.

Any significant enterprise can be thought of as a system: an assemblage of inter-related elements comprising a unified whole. In computer-based systems, large amounts of detailed information is collected about these elements. However, even well-managed systems may suffer from "indirect" forms of information loss:

Model-based Documentation (MBD) addresses these problems (among others) by providing an integrated approach to the semi-automated generation of detailed system documentation. A consistent "model" (e.g., set of entities and relationships, based on the system being documented) assists and guides the process:

Let's look at how this is accomplished...

Systems, Data, and MBD

Systems are composed of entities and relationships. Some entities are tangible: buildings, computers, employees. Others are more abstract: bug reports, meetings, programs, releases, telephone numbers. A system's capabilities and internal structure are defined by its entities and the relationships between them.

Because most enterprises store their operational data on computers, a large amount of detailed information is likely to be stored online. Unfortunately, the information is frequently balkanized by technical or organizational boundaries, format incompatibilities, etc. Even without these impediments, the complexity of production systems can be overwhelming.

MBD addresses these problems by modeling key entities and relationships. This mapping, like those used in object-oriented programming and computer-based ontologies, takes advantage of the fact that large numbers of instances can be described by small numbers of classes. For example, we can define classes for methods, data structures, and the relationships between them. We can then track (and report on) arbitrary numbers of data structures, method calls, etc.

By cataloging high-level information on available data sources, MBD makes it easier to identify and locate needed data. We can start to think about questions that this wealth of detailed information could be used to answer. As information is extracted and used, the model will increase in detail, aiding the investigation of more complex questions.


MBD normally begins with the construction of a (rough) model of the system in question. This should catalog and describe the system's major entities (i.e., components and subsystems) and relationships (e.g., control and data flows). Note that this isn't a physical model (as in model airplanes), let alone a simulation. Instead, it follows this definition:

Some parts of the model will become visible in the documentation, providing a consistent structure, convenient navigation, etc. Other parts will remain hidden from the user, but will motivate and guide the development process, allow disparate data sources to be used together, etc. In either case, the model will improve the utility and convenience of the documentation.

Meanwhile, the presence of the model will help to motivate and support analytical thinking. A software development manager might, for example, ask for comparative frequencies of pre- and post-release bugs. Alternatively, it might be useful to rank software modules by bug frequency, check-in activity, and/or size.

A simple chart or table can be very illuminating, if it shows the right information in the right way! Of course, getting to the right view of the right data can be challenging. Consequently, MBD efforts tend to be exploratory, collaborative, and iterative:

Extraction and Analysis

Before a report can be generated, the necessary data must be extracted from the correct data source. Some amount of analysis may also be needed. In some cases, everything may be highly organized and readily accessible. In other cases, substantial effort may be required to extract the needed data.

Data Warehousing

It's very useful to collect all of the extracted data into an MBD-specific data warehouse (roughly, database). This allows consistent access methods to be used, reduces duplication of effort, keeps the MBD project's storage under its own administrative control, etc.

Views of the data can be generated and presented, of course, before a data warehouse is in place. In fact, this is a common approach for prototyping reports, etc. Folding interesting data into the warehouse, however, may expedite future inquiries. It also opens the door to the use of existing analysis programs.

Mechanized Analysis

Once the data is in a consistent and easily accessible form, mechanized analysis can be employed to find "implicit" information: correlations, patterns, hidden structures, etc. Some of this analysis will require special-purpose programming, but many tools and libraries are available to reduce this burden. Let's consider some examples...

Business Intelligence (BI) techniques (e.g., data mining, OLAP) are commonly used in large enterprises to mechanically extract information and even knowledge from data. With the advent of Open Source BI tools, even small enterprises can now take advantage of these techniques.

Documentation generators can provide detailed documentation for software projects, reducing development time, errors, and inefficiency. Documentation frameworks can also help to organize and increase the consistency of programmer-generated documentation.

Extremely powerful tools are now available for mathematical analysis and display, using techniques from graph theory, operations research, statistics, symbolic mathematics, etc

Most enterprises have large bodies of existing documentation and other "unstructured information". Many tools exist to help with this material, including document indexing and digital library suites, etc.


Presentation spans a wide range, from specialized "one-shot" reports (possibly created interactively) to integrated, comprehensive sets of detailed documentation (e.g., for a software project's code base, requirements, tests, etc.). Throughout this range, the user's "mental model" is a key consideration.

In the case of a custom report, the user's background and interests are clear. So, the report can be "tuned" to meet known requirements. Detailed documentation, intended for arbitrary users, has no such luxury. It must be organized and presented so that any user can easily assimilate it.

MBD addresses this problem by presenting the user with a consistent (though possibly simplified) model of the system in question. As the user navigates through the documentation, s/he will learn about the system's basic organization.

A typical MBD-based web site will have several kinds of web pages. Some, such as tutorials and class descriptions, will be manually edited. Others, such as entity descriptions and indexes, will be mechanically generated.

MBD bridges the gap between traditional documentation and report generation techniques, leveraging the strengths of each. It works well for generating timely, integrated, and detailed documentation for large systems. At the same time, it facilitates the rapid prototyping of specialized documents and reports.

Like the term AJAX, MBD describes an existing (though not commonplace) way of using existing technologies. It is my hope that, by coining the term and pointing out the utility of this approach, I can convince others to think about, and perhaps experiment with this approach.

Extraction and analysis can fill a data warehouse, but the data doesn't become useful until it becomes part of a human's "mental model". Modeling is fundamental to both intelligence and communication, so thinking about the underlying model(s) is an inseparable part of good documentation design. MBD encourages the developer to keep these models in mind, at all stages of the development process.


MBD (Model-based Documentation) is an integrated approach to the design and development of semi-automated document production systems. Specifically, MBD uses a consistent "system model" (at various levels of abstraction) to provide conceptual clarity, ease navigation, and guide the development process.

MBD bridges the gap between traditional documentation and report generation techniques, leveraging the strengths of each. It works well for generating timely, integrated, and detailed documentation for large systems. At the same time, it facilitates the rapid prototyping of specialized documents and reports. Although I have only used MBD in a software development context, I believe it to be applicable to any substantial system that depends extensively on computers.

This page introduces some key concepts in MBD, discussing the role of models in intelligence, communication, and documentation. It then presents an abstract data flow model for MBD processing.


In the popular MVC (Model-View-Controller) architecture, the model stores the data that underlies the application. My use of the term is a bit broader, including "mental models", etc. Here are some applicable definitions:

Either of these definitions could, with a bit of stretching, describe documentation. More to the point, documentation which is based on these sorts of models has a built-in conceptual and organizational framework. This framework can assist both the documenter and the reader, as discussed below.

More generally, because modeling is fundamental to both intelligence and communication, thinking about the underlying model(s) is an inseparable part of good documentation design. Let's explore this connection a bit further...


Jeff Hawkins ("On Intelligence") contends that intelligence and memory are byproducts of modeling activity in the neocortex. Specifically, our brains record and recognize "invariant patterns" in received data, then make and test predictions about other, related patterns. After hearing a few notes from a known song, we can recognize a pattern and predict the following notes. If the prediction fails, our attention will be drawn to the disparity and our brains will try to resolve it. Is this is a stylized rendition, a different song, or what?

Using billions of interconnected neurons, the neocortex can recognize and predict many levels and variations of patterns. The incoming flow of information interacts with the stored patterns, causing them to be interlinked, modified, reinforced, etc. Collectively, the patterns form a "model" of perceived reality. We may think that we are interacting with "reality", or at least what our senses report about it, but we are actually dealing with this internal model.


Communication, similarly, is based on the transmission and sharing of models. The sender offers assertions, qualifications, connections, etc. If the communication is successful, these will be incorporated into the receiver's models. Modeling also plays a significant role in message preparation. A "mental model" of readers' typical backgrounds (or really, of their mental models) influences the sender's choice of material, presentation style and order, etc.

In preparing these pages, I thought about the key concepts I wanted to present and the likely backgrounds of my readers. Using familiar concepts (e.g., brain, memory) as a base, I presented new concepts (e.g., invariant patterns). Because "neocortex" might be an unfamiliar concept, I defined it by context and added an HTML link. As I got further into the material, I was able to rely on concepts already presented. Thus, my model of the readers' models was predicated on their ongoing comprehension of the material being presented.

This sort of model-based "second guessing" is very common in human communication. Different forms of communication (e.g., articles, conversations, documentation, formal and informal presentations) employ it to different degrees, however. In casual conversation and informal presentations, the speaker may not pay much attention to organizing the presentation. After all, the listener(s) can easily ask about anything that isn't understood. Thus, interactivity reduces the need for organization.

Formal presentations and most forms of written communication do not allow for easy interaction, so authors tend to spend significant effort on organization. They write outlines, move material around, and generally polish the text until the ideas "flow" smoothly. Some trickery may be required, because many topics don't "serialize" cleanly. For example, the author may have to use a "placeholder" definition, because a full definition would interrupt the flow and/or rely on material not yet presented.


Documentation differs from other forms of written communication in several ways, including complexity, scale, information level, and typical usage. These differences affect the way that documentation is (or at least should be :-) designed.

In summary, documentation creation presents unique challenges. The writer must present vast amounts of information, with very little knowledge of the readers' mental models. About all that the writer can predict is that the reader will not approach the material in a sequential fashion.


Because the writer can't predict the reader's background, other means must be used to supply context. In practice, this means that the author must provide ways for the reader to find (i.e., navigate to) any needed background information. Navigation can be supported by various mechanisms: indexes, links, search engines, etc. Using mechanized techniques, it's quite easy to generate web pages full of links, pop-ups, clickable diagrams, etc. The tricky part is to make the result comprehensible and useful.

MBD addresses this problem, where possible, by basing the site's organization and navigation on an abstract model of the documented system's entities and relationships. Because the model is consistent, readers soon learn how to navigate around the documents. Overview text and diagrams can ease and expedite this process.

Modeling MBD

Typically, MBD uses a "suite" of cooperating programs, mixing general-purpose utilities (e.g., text formatters) with special-purpose programs (e.g., data filters). Taking our own advice, let's model a typical MBD suite. The following diagram shows (very abstractly!) how the suite fits into the overall documentation data flow:

Our goal is to document a system, using a combination of its own Data (e.g., databases, files) and any related Info (e.g., documentation, institutional memory). If an existing document is useful and presentable, we'll simply add it to our collection of Documents. Alternatively, information can be extracted from Info and/or Data sources, analyzed, and presented in Documents.

The dotted line connecting the Info and Data boxes indicates that they are closely related. For example, a system's documentation should describe the nature of its data. This relationship isn't part of the MBD suite, but it forms an important part of the suite's working environment.

Processing Phases

An MBD Suite can be considered as having three processing phases:

A trivial MBD application might combine all three phases in a single program. For example, a Perl script might access Bugzilla's MySQL database, tally selected bug reports, and generate a web page. As more types of data sources and generated documents come into play, however, a single program will become unwieldy.

Real-world MBD applications may have dozens of data sources, generate many kinds of reports, etc. Consequently, they will have many instances of Analysis, Extraction, and Presentation programs. There may also be cases where multiple levels of analysis are needed.

Data Flow

Here is a slightly less abstract diagram, showing the flow of data through these (sets of) programs:

As the diagram indicates, the data traverses a directed graph. With a bit of care, the graph can be constrained to be acyclic (no program can modify data that affects its own input). This may sound quite theoretical, but it has some useful consequences.

In particular, it means that dependency-based tools such as make (or even parallel make) can be used to control the suite's operation. Better yet, we can run the entire suite under Cron and have an automated "service" that updates the relevant output documents whenever an underlying data source changes.

The following diagram shows the data flow for a smallish MBD suite. If data source S1 changes, programs E1, A1, and P1 will be run, producing an updated version of document D1. Programs A2, P2, and P3 will also be run, updating all of the other documents. Changing S3 or S4, in contrast, would only cause D3 and D4 to be updated.

To be sure, this description leaves out many implementation details. For example, it says nothing about information storage and transmission. Are we using files, a database, or what? We also need to ensure that each program's input information is complete and consistent while it is processing. Given that the input could be several thousand files, a bit of trickery may be needed.

Nonetheless, this approach is not science fiction. I recently created a web site that contains tens of thousands of web pages, image-mapped diagrams, etc. Because the pages are heavily cross-linked, the site contains hundreds of thousands of links and tooltips. Nonetheless, the site is easy to navigate, understand, etc.


Modeling efforts can usefully range from informal sketches to formal, notation-heavy specifications. My preference is to keep things informal and flexible until the need for more formality becomes evident. This is in line with some of the ideas of the Agile software development movement.

Diagrams are very useful for showing connectivity (e.g., control and data flow, method usage). If all you're trying to do is capture the basic structure, use any icons (e.g., boxes, circles) that seem comfortable. Picking a consistent notation becomes important, however, if the reader is separated by time or distance.

Similarly, although paper and whiteboards work well for small diagrams and informal meetings, diagram generation tools (e.g., Dia, OmniGraffle, Visio ) also have their place. Aside from handling presentation details (e.g., icon and arrow styles, fonts, layout), these tools manage connectivity constraints, etc.

As the size and complexity of the model grows, the need for even more structure and support will become evident:

Just as general-purpose drawing tools cannot support architectural, electronic, or mechanical design, diagram generation tools cannot analyze the diagrams they record.

Knowledge Representation

Tools for generating and analyzing conceptual schemas must be able to represent and manage knowledge about the system under study. Here is an elegant description of knowledge representation, taken from John F. Sowa's excellent book on the topic:

Most of the modeling methods and tools discussed below are aimed at assisting with ontology development. They keep track of the definitions of classes, instances, relationships, etc. If our goal were to create an expert system, both the ontology and the logic rules would need to be extremely detailed and precise.

In MBD, however, most of the reasoning will be done by humans. The developer will look over the ontology and decide what to present. The user will look over the presented material and decide which parts are currently of interest. As long as the material is plausibly interesting, the user is unlikely to complain.

So, these approaches and tools often strive for more detail and precision than MBD requires. Don't get caught up in the details; your main objective is to produce a useful but simplified model!

Also note that knowledge representation is an active research area. There are many approaches and theories, a few emerging standards, and little interoperability between existing tools.

Methods and Tools

Assorted communities (e.g., AI, DBMS, Semantic Web) have developed methods and tools for knowledge representation. Several of these approaches appear to be quite applicable to MBD, but I have yet to find a category killer. Before we look at the available offerings, let's consider the general characteristics we're looking for.


The most critical characteristic, from my perspective, is the model's fundamental organization. The components of a system can have arbitrary relationships; the model must be able to encode these, allow the user to traverse them, etc.

Because the relationships are arbitrary and (initially) unknown, the approach must not restrict the modeler to, say, a list-based or even hierarchical organization. Consequently, most outliners and mind mapping tools aren't suitable.

Because HTML links can only be traversed in one direction, most HTML-based approaches (e.g., typical wikis are also unsuitable. (Pairs of links can be used for bi-directional relationships, but this is tedious and error-prone, if done manually).

In addition, the model should allow the user to interact with assorted subsets and views of the system. For these and other reasons, I believe that the model must be based on a fully-traversable (and very extensible) graph-based organization.

Diagramming Format

A modeling tool must allow the user to interact with (e.g., view, navigate, edit) the model. Although most of the detailed information will be textual in nature, text is a poor medium for presenting relationships. So, most modeling tools use some sort of diagramming format.

The design of this format is both critical and challenging. If the format is too simple, it won't be able to convey the needed information. If it is too complex, the user will become confused and frustrated. Ideally, the tool should allow the user to use simplified notation, adding details as desired.

Interchange Format

Modeling tools should have reliable and convenient ways to exchange information with other tools. Unfortunately, this is seldom going to be the case for existing tools. There are many formats for encoding conceptual models, varying at syntactic, structural, and semantic levels.

Efforts are being made to provide paths between these formats. For example, the International Organization for Standardization (ISO) has a working group which recently proposed a standard:

The World Wide Web Consortium (W3C) is working on a related, though less formal standard:

Although these efforts are proceeding well, no adopted standards appear to be imminent. In the meanwhile, although one can hope for a documented interchange format, about all that one can reasonably require is a readable (e.g., XML) file format. Now, let's look at some of the available offerings...

Concept Maps

For informal modeling, I would suggest the use of concept maps. They aren't overloaded with notation, but they have enough structure to capture the basic entities and relationships of a system. Unlike mind maps, concept maps aren't restricted to hierarchies.

Dr. John Sowa has written a nice overview of Concept Mapping, covering concept maps, conceptual graphs, topic maps, etc. The classic reference on concept maps is Learning How to Learn.

Data(base) Modeling

The database community uses assorted variations on entity-relationship diagrams (ERDs). Unfortunately, these can cause the modeler to focus on low-level (e.g., database-related) issues, rather than high-level concepts.

In addition, ERDs can run into difficulties when a relationship needs to be treated as an entity. If we say "Romeo loves Juliet", how do we discuss the different meanings of "loves"? Modeling Methodologies is a good introduction to some of these issues.

Object-Role Modeling (ORM2) appears to handle these issues nicely, at some cost in notational complexity. I don't know of any Open Source ORM2 tools (though one is promised for 2006), but some gratis and inexpensive tools have emerged. Some versions of Microsoft's Visio handle various aspects of ORM2 diagrams. For more information on ORM2, visit

Unified Modeling Language (UML) Class Diagrams are also complex, but they may appeal to programmers who are already familiar with this notation. UML is very well documented and many supporting tools are available for it.

Knowledge Engineering

The Expert Systems community has been working with problems of Knowledge Engineering for several decades. Not surprisingly, they have some useful tools to offer.

I'm particularly interested in Protégé. As described in An AI tool for the real world, Protégé is an Open Source, well-supported, standards-friendly tool for creating models, ontologies, etc.

Conceptual Graphs (CGs) are structurally similar to ORM2 diagrams, but they are based on a form of predicate calculus known as first-order logic (FOL). So, they are a good match for expert system technology.

It's quite likely that Protégé could be augmented to support CGs, ORM2, or other diagramming notations. This might ease the recognition and specification of complex sets of relationships.

Semantic Web

The Semantic Web community is developing standards (e.g., Resource Description Framework, Topic Maps) for describing concepts, encoding document metadata, etc. The standards are still "works in progress", but they are worth watching because of their large and active developer communities.

Resource Description Framework (RDF) is based on sets of three-part declarations (i.e., "triples"): subject, predicate, object. The apparent simplicity of this approach is balanced by the need to create large numbers of triples when complex concepts need to be expressed.

Topic Maps use a much richer vocabulary, including terms such as association, name, occurrance, scope, topic, etc. This allows relatively small expressions to express complex concepts, conditional assertions, etc. A gratis tool (Ontopia Omnigator) is available; Open Source tools are under development.


Although the use of models is central to MBD, modeling is a tool, rather than a goal. If you develop a crystal-clear model, but generate no documentation, you haven't really accomplished your objective.

So, curb your desire to generate the "perfect" model. Instead, try to generate a useful and flexible model, improving it as you proceed. As you work with the model, you'll find areas that could use clarification, expansion, etc. Your modeling approach should make it easy to make these changes.

Also, avoid the temptation to "start at the bottom", detailing every data item, field, etc. This sort of information can be researched as it is needed, but it doesn't serve the general purposes of the model: finding useful information, understanding the system, etc.

Semantic Wikis

Semantic Wikis show promise for creating models and presenting information. Ideally, they would combine a wiki's convenience and freedom with the strengths of semantically-aware (e.g., ontology-based) systems. This could allow a graceful merging of human-edited and mechanically generated content.


Wikis provide a very convenient means for generating informal documentation. They are easy to edit, support both individual and collaborative efforts, and scale extremely well.

Wiki links can be created by the simple act of typing in a CamelCase word. If a page doesn't already exist, the act of clicking on its link will create one. Simplified markup languages are also available, easing the process of page creation.

Most wikis have rollback mechanisms, allowing contributions or changes to be removed, merged with previous content, etc. This allows public wikis such as Wikipedia to reach and retain a high level of quality, despite occasional misguided postings.

The web's basic architecture reduces the apparent complexity of web site (and thus wiki) generation. Although collections of pages and links form a graph-based data structure, few users think about this fact. Looking at any given page, the user sees only content and links; the global structure can be (and usually is) ignored.

As sites such as Wikipedia demonstrate, wikis can take full advantage of these simplifications. Their users navigate through enormous graphs, seeing only pages that appear to be of interest.

Web Limitations

HTML links (e.g., <A HREF="...">...</A>) have limitations that interfere with their use in semantic wikis. Specifically, they are uni-directional, untyped, and binary. The first problem can be worked around fairly easily, but the others require structural changes.

Because a link only goes to a given page, the entire graph must be traversed in order to find backlinks (links that come from other pages). For search engines such as Google, this can be a massive problem, because the "graph" in question is the entire web.

Few wikis bother to track backlinks, even though the problem is much more tractable for them. Even fewer can display clickable context diagrams, showing a page's "local neighborhood". Pimki (an experimental "Personal Information Management" wiki) does both, but it is a conspicuous exception.

Even Pimki, however, is constrained by HTML links' other limitations. Although a link can have many attributes, most only contain the target URL and some text content to be highlighted and displayed. Nothing, in any case, indicates which links are of what "type".

Without typed links (e.g., Is_A, Has_A, Used_By), Pimki has very little information to work with. It cannot, for example, filter by link type or assess the "strength" of given links, much less make deductions (e.g., inherited characteristics) based on link types.

Finally, HTML links are "binary", in the sense that they only connect two pages. This isn't a catastrophic problem: Resource Description Framework (RDF) is also based on binary links. However, many users may find it easier to say "John is taking the plane to Chicago" than to specify the equivalent set of binary relationships.

So, add structure!

Semantic wikis address the "type" problem head-on, allowing pages and/or links to have specified types. Thus, we might say that the /etc/passwd page deals with an instance of the class Control_File. With this information, the wiki can generate bi-directional links, display summary or inherited information, etc.

Similarly, if we say that a Control_File may be written by a Program (or really, by a Process that is running the Program), the wiki knows that it's OK for a user to assert that the /etc/passwd file may be written by the /bin/passwd program.

The final limitation (binary relationships) is addressed by some semantic wikis, but not by others. In fact, some would contend that this limitation is beneficial, in that it simplifies the specification of relationships. However, the complexity re-asserts itself by greatly expanding the number of relationships required.

Some Examples

Here are more than a dozen examples of semantic wikis, including (where available) brief summaries. For an up-to-date list, and other useful information, be sure to visit the Semantic Wiki State Of The Art page on the Semantic Wiki Interest Group web site.

Note: The "OWW" link following some entries is not an exclamation, merely a link to the corresponding page in the OntoWorld Wiki.


Nearly all of the wikis listed above are based on the RDF notion of (subject, predicate, object) expressions, known as "triples". In each link, the current and target pages are used, respectively, as the subject and object of the triple. The predicate is then added by means of a "type" indicator, such as:

[[type:IsWrittenBy /bin/passwd]]

This is convenient and flexible (and thus in keeping with the spirit of wikis), but it has both structural and human interface problems. Ontiki, my (speculative) design for an "ontology-aware wiki", attempts to address these issues.

There is no concensus as to what semantic wikis should do, let alone how they should do it. A "category killer" may emerge in a few years, but even this is not certain: none has for basic wikis. Nonetheless, semantic wikis are fun to play with and can be used to solve problems that other tools do not. So, enjoy...

The Extraction Phase

The concepts presents a data flow "model" for an MBD (Model-based Documentation) suite. In this model, the extraction phase is responsible for accessing data sources, selecting and organizing the desired data, etc. Its output should be encoded in a convenient and reliable representation for follow-on analysis or presentation.

This page looks at various kinds of data sources, including system data and hand-edited files, offering hints about how to extract data from them. The question of how this data should be stored is a subject for another page (available RSN :-), but some general comments may be useful at this point.

MBD's extraction phase corresponds roughly to the first part of data warehousing. The "live" data gets collected, filtered, and saved in a manner that eases follow-on analysis (e.g., data mining, OLAP) and presentation. The storage representation should handle structural issues, retaining "interesting" structures from the input data and allowing the addition of structures "discovered" in the extraction or analysis phase.

The actual data will normally be stored in (collections of) files and/or a relational database system. In general, data access in the warehouse should be far faster, easier, and more consistent than it was in the "live" system. If it isn't, you're doing something wrong (:-).

System Data

Computer-based systems maintain large amounts of data. MBD systems can "harvest" this data, extracting information on specific entities: facts, relationships, etc. The resulting information can be used to generate detailed reports, summary plots and diagrams, etc. The first step, however, is data extraction.

The extraction code must access the incoming data, parse it, reject noise, correct errors, and pass the result along in a "convenient" format. The specific tasks are defined by (a) the format and content of the incoming data and (b) the current and expected data needs.

The incoming data may be simple or complex in structure and may arrive in any of a variety of manners (e.g., files, APIs) and formats (e.g., binary, text). The following sections cover some common cases.

Flat Files

Line-oriented "flat file" formats (e.g., CSV files, Unix control files) are usually easy to parse and understand. Each line (including any "continuation" lines) is a record. Individual fields can usually be extracted by regular expressions; special handling may be required for particularly complex formats.

If you have a lot of flat files to parse, consider writing a parameterized input filter. Declarative control files are far easier to maintain than hand-crafted (and nearly identical) "input filters". If your control files are cleanly formatted and well commented, they can serve as documentation for the input files.

Exchange Mechanisms

If a data source is used by more than one program, it is likely to be available via a well-documented data exchange mechanism. For example, it might be a text file, encoded in a documented dialect of a standard data serialization format (e.g., XML, YAML). Alternatively, it might be offered via an API (Application Programming Interface) such as SQL or Apple's Core Data framework.

The exchange mechanism can be expected to handle the "syntax" level (e.g., dividing the data into fields). It may also help with structural or even semantic issues, but there is no guarantee of this. The good news is that you can often extract the needed data without understanding the entire structure and semantics of an arbitrarily complex data source.

Generally, use of an exchange mechanism follows a well-trodden path: study the documentation, decide what data you want, use a query to extract it. That said, there's nothing wrong with speculative data collection: if capturing some extra data is easy, storing it can be a very worthwhile gamble.

It's also useful to look at the way the data source is used by the system's own code. A program's SQL statements and surrounding code, for example, can provide quite a bit of information on the purpose and interaction of database fields and tables.

If you get lost, look for a support infrastructure (e.g., books, forums, mailing lists, web sites). Asking authors and "resident experts", you may be able to get answers to specific questions (e.g., "What is this field used for?"). If you develop any documentation, be sure to share it with the appropriate commmunity!

Screen Scraping

Wikipedia defines "screen scraping" as:

Because the "display" was not intended for use by a program, it may not provide simple (or even reliable) indications of its structure. After all, humans are much better than programs at discerning structure from context, formatting, etc. Fortunately, machine-generated documents tend to have reasonably predictable structures. Once you can handle the base layout and the common variations, you're mostly done. By coding for robustness (e.g., "print a diagnostic and continue"), you can deal with new variations as they appear.

Of course, if the generating program gets changed in a way that modifies the document structure, your extraction program will "break". So, it's a good idea to push for machine-friendly (e.g., XML) forms of any documents you rely upon.

These problems aside, screen scraping may involve assorted low-level formats: "plain text" (e.g., log files, nroff output), HTML, PDF, PostScript, etc. Access methods are often available for binary formats, but you'll have to handle most text-based formats yourself. Here are some hints...

Web Pages

If you only need a few items from a web page, you may be able to extract them using special-purpose code. For example, if a particular snippet of HTML always appears in a line by itself, or in a particular place in a table, you may be able to recognize it and handle it as a special case. Be aware that this "hack" may be brittle in the face of even minor formatting variations.

If you need to extract a lot of data, consider transforming the page into XHTML, a formalized dialect of HTML. Because XHTML files comply with XML syntax, you can load them with your favorite XML-handling tools. HTML Tidy will perform this conversion, as well as cleaning up ugly (e.g., Microsoft Word) HTML.

Although the resulting document will be valid XML, you'll still have some "detective work" to do. Unlike XML that was generated for information exchange, converted HTML is likely to have structural irregularities (e.g., interleaved tags and text). Worse, you won't have a schema to guide you in figuring out the variations.

PostScript, PDF, etc.

PostScript is a text-based "page description language". Although it is actually an imperative, Turing-complete programming language, this is seldom a real problem for parsing, etc. RPN syntax aside, most PostScript commands are used as formatting declarations. So, it is generally possible to examine a representative PostScript document, determine what "idioms" are being used, and write a specialized script to extract desired information.

An alternative approach uses that fact that PostScript files are, in fact, programs. By editing the PostScript code (and/or overriding selected operators), it is possible to make a document log information about itself (e.g., to standard output or a designated file). If the idea of hacking PostScript doesn't appeal to you, however, read on.

PDF (Portable Document Format) is a binary, declarative translation of PostScript. Although a binary format may seem daunting, there are libraries and other tools which can help in parsing PDF. There are also reliable tools for PostScript translation. So, you may want to convert all your incoming PostScript documents to PDF, then parse them all with the same tool(s).

Here are some useful tools for dealing with PDF and PostScript documents. Although there is some overlap in their capabilities, all three are well worth having.

For more tools (and tricks), see "PDF Hacks: 100 Industrial-Strength Tips & Tools" (Sid Stewart; O'Reilly, 2004). It's an easy read and will give you some ideas about unconventional ways to use PDF. If you get serious, you'll also want to get a copy of Adobe's "PDF Reference" (sixth edition). It contains well over 1000 pages of definitive and detailed information.

There are dozens of books on PostScript, ranging from introductions to reference manuals. Again, the Adobe books are definitive, but you may want to look at some others as a way to get started, explore undocumented areas, etc.

Arcane Formats

The data you need may be stored in some arcane format. Examples include binary libraries, spreadsheets, word processing documents, and source code files. Rather then researching (or worse, reverse-engineering) the format, look around for a library or command-line tool that already knows how to parse it.

For example, you might want to extract linkage information from binary library files. The nm command (and variations) does a fine job of generating reports on library and object files. These reports have a regular format and are easy to parse.

Perl's CPAN (Comprehensive Perl Archive Network) has modules that import spreadsheets and many other oddball files. Other scripting languages (e.g., Python, Ruby) have similar online collections.

Finally, you may be able to find a tool that reads the file(s) and exports the data into XML or some other parsable format. Doxygen, for example, will peruse source code files in several languages, dumping its accumulated knowledge as XML.

Hand-edited Files

There are times when you'll need to augment your machine-harvested data with hand-edited information. The supplementary data may have been collected from interviews, copied and pasted from a PDF document, or obtained in some other manner. Regardless, the current objective is to encode it for convenient use by your documentation scripts.

The ideal file format for this purpose would be flexible and powerful, supported by an active user community, easy to read and edit, and a good "fit" for scripting languages such as Perl, Python, and Ruby. Flat files (e.g., CSV) fail the first test: a two-dimensional array is neither flexible nor powerful. Any format that you might cobble up on your own fails the second test (user base). XML fails the last two tests; nobody but a masochist likes to edit XML or traverse its data structures.


Fortunately, YAML (YAML Ain't Markup Language) meets all of these criteria quite handily:

Finally, I'll let you judge YAML's flexibility and readability for yourself:

  # This is a comment
    -  123
    -  def:  'fed'
       ghi:  'ihg'

Assuming that this text was loaded into a Perl data structure referenced by $r, the expression $r->{abc}[0] would have the value 123. The expression $r->{abc}[1]{def} would have the value 'fed'.

The ability to create and edit data structures in one window, then test them out in another, is incredibly seductive. I have edited tens of thousands of lines of YAML, using it to store a wide variety of data. By post-processing the encoded text strings, I have even created (declarative) "little languages" of various sorts, using a YAML loader to handle the first-order parsing. The FSW case study page and this tutorial contain more information on YAML.

The Analysis Phase

The concepts page presents a data flow "model" for an MBD (Model-based Documentation) suite. In this model, the analysis phase is responsible for combining or tallying data, discerning patterns, etc. This page considers the relationship between modeling and analysis, looks at some representative forms of analysis, and makes some broad generalizations about data management.

Modeling and Analysis

The administrators and designers of a complex system generally understand its overall nature. They may be unclear on specific details, but they understand the major entities and relationships, along with some "minor" ones which have attracted their attention, for whatever reason. In short, they have a working "mental model" of the system.

By interviewing these "local experts", studying the existing documentation, and generally poking around, an MBD developer can "discover" the types of entities and relationships in the system. Using this information, s/he can create an informal model that will support and guide the creation of MBD software.

Analysis functions, in particular, rely on this sort of model. They look for, characterize, and/or summarize the specific values and relationships concerning instances of known entities. Typically, they produce explicit formulations of information which was only implicitly available in the input data.

In a software development project, the entities might include bug reports, data structures, developers, functions, libraries, managers, programs, projects, releases, test suites, etc. By analyzing information on these instances, we can "discover" and present useful information about individual instances and/or the system as a whole.

For example, we could count the bugs found during each stage of each release cycle. This might tell us useful things about our design and testing procedures. We could also look for "hot spots" in the code (e.g., areas that seem to have lots of bugs), as a way to guide refactoring efforts.

We could also track the usage of each data structure, function, library, etc. This would let us create links to related items and even draw nifty-looking (and useful) diagrams showing critical relationships. Anyone who has had to "drop into" a large body of code can appreciate the value of this sort of contextual information.

Analysis Functions

As Gerald Weinberg points out in An Introduction to General Systems Thinking, different approaches are needed for different sorts of problems. Statistics deals well with "unorganized complexity", but requires both randomness and large population sizes. Pure analysis, on the other hand, handles "organized simplicity", but is limited by complexity and randomness. Thus, there is a substantial area of "organized complexity" where neither tool can be used:

As discussed below, data mining tools such as Weka apply both statistical and analytical techniques (e.g., analyzing statistical results). Similarly, this document uses the term "analysis" to include statistics, etc.

Analysis functions can range from trivial to extremely complex. The ones which involve only simple statistics and record-keeping lie near the "trivial" end. Data mining and document analysis functions, which can involve much more intricate calculations, go well into the "complex" end. In fact, some forms of analysis can be described as "interesting research projects" (:-).

Fortunately, two factors operate to make analysis less challenging. First, most analysis falls on the trivial end: the details may be complex, but the principles involved are simple. Second, tools are available to perform many kinds of analysis, leaving the MBD implementor with the relatively simple tasks of tool selection, installation, and use.

Although there is no shortage of expensive, proprietary analysis products, many useful tools are also available as Open Source. Their capabilities range from text analysis and indexing through numerical and statistical methods to volumetric data presentation. Typically, they have books and/or extensive documentation which detail the underlying principles and use. Here is a sampling, to whet your appetite...

Business Inelligence

BI (Business Intelligence) techniques (e.g., data mining, OLAP) are commonly used in large enterprises to mechanically extract information and even knowledge from data. Open Source now allow even small enterprises to take advantage of these techniques.

Mathematical Analysis

Extremely powerful tools are now available for mathematical analysis and display, using techniques from graph theory, operations research, statistics, symbolic mathematics, etc. Several powerful suites are available to help with these tasks.

Unstructured Data

Although images, text, and video have lots of internal structure, it isn't the sort of thing that fits well into arrays or even schemas. Consequently, they are often referred to as "unstructured data". Here, in any case, are some tools for handling them.


Data Management

The Extraction page glossed over the issue of how MBD data should be managed. Indeed, this page has said very little about data management. Unfortunately, proper coverage of MBD data management would take another whole set of pages.

Still, there are a few things that can and should be said. The first is that some OLAP tools (e.g., Bizgres) rely on a particular RDBMS. If you already have your data in some other database, this could be an issue. On the other hand, disk space is cheap enough that the cost of replicating information should be reasonable.

Moreover, there are good reasons (and strong precedent) for separating your operational data storage from the MBD suite's. An MBD data warehouse can integrate data from disparate sources, making it easier to analyze and present. Because this data is separate from the "live" data, its management policies can be optimized to serve the needs of the MBD effort.

Operational data storage needs to be managed in a conservative manner, because important things can break if it isn't. Adding new tables or columns may involve a great deal of politics, even if no substantial coding changes are required. MBD systems, in contrast, need flexibility and responsiveness: "I need this report now." Also, the live data will not, in general, be integrated into a consistent access framework.

More generally, RDBMS systems aren't optimized for flexibility. In a prototyping environment, a set of YAML files may be far easier to create and use than a set of database tables. Finally, many tools (e.g., OpenCyc, OsiriX, TAP) have their own storage formats. In summary, don't get locked into a single approach.

The Presentation Phase

The concepts page presents a data flow "model" for a Model-based Documentation (MBD) suite. In this model, the presentation phase is responsible for generating various output formats, using assorted tools (e.g., markup languages, documentation generators).

Output Formats

Most documentation needs can be readily met by one of two output formats: PDF or HTML (and friends). Although there is some overlap, it is usually easy to decide which format is appropriate.

There are many ways to generate these formats, but only a few of them are relevant to mechanized production. Consequently, we'll focus our attention on this subset. We'll start with markup languages, then move on to other tools and some web-specific technologies.

Markup Languages

Markup languages such as HTML allow text to be "marked up" with formatting and other ancillary notations. An appropriate tool (e.g., a web browser) can interpret the markup, producing formatted text, hypertext links, etc. Using markup languages, MBD systems can generate attractive PDF documents, web pages, and more.

If you're just producing web pages, HTML is a direct and obvious choice. Even with Cascading Style Sheets (CSS), HTML doesn't allow precise control of formatting, but this is seldom needed.

Unfortunately, HTML can't generate arbitrary PDF documents. Many other markup languages are available, however, offering a wide range of features. Here are three popular "families" of markup languages, to get you started...

Documentation Generators

Documentation generators are actually specialized MBD suites, documenting software-related entities such as data structures, functions, and modules. The results are generally published as web pages, but other formats are sometimes available. In particular, collected information may be available in a form (e.g., XML) that is suitable for follow-on analysis.

Most documentation generators perform a variant of screen scraping, parsing the source code (and specially-formatted comments). Consequently, they are specific to particular programming languages. A few, however, work strictly from comments or binary files. If you are documenting a software project, be sure to investigate this class of tools.

Eye Candy

There are a number of tools that can generate "eye candy" (e.g., diagrams, images, plots) for documents. The trick, of course, is to generate useful eye candy. Here, in any case, are some potentially useful tools.

HTML, redux

There are many ways to generate content for web pages. HTML can be edited by hand, generated directly by a script, or translated from a markup language. Eye candy (e.g., images) has a similar wealth of sources. Now, let's talk about pulling it all together.

Modern web servers and browsers are capable of handling far more than simple HTML. Here are some variations and enhancements which are worth considering for use in MBD:

Page Types

Although MBD-based web sites can have totally arbitrary content and format, certain types of web pages are likely to be useful.

"Entity" pages

Most MBD-generated web pages describe entities, so let's consider the kinds of information that the user should see on such a page:

Because these pages are mechanically generated, it is trivial to provide rich cross-linking between pages, add explanatory descriptions and tooltips, etc. Image-mapped "context diagrams" can be generated to show "close relatives", etc.

Help and tutorial pages can also be linked in, explaining sections, pages, or even sets of pages. Finally, if the user can't find what s/he is looking for, mailto links and a search facility are obvious, trivial, and very useful amenities to add...

"Index" pages

No single index can meet the needs of every user at every time. A user may only be interested in a selected subset of the entities, need them sorted in a particular order, etc. The number of available views (i.e., combinations) can grow very rapidly with the number of options allowed. Nonetheless, it is still a manageable problem.

Mechanical generation of index pages can allow any number of "views" to be shown. With proper planning, navigation between the views can be very simple. For example, the user might navigate between views by clicking in one or more "link tables". Alternatively, in a forms-based interface, checkboxes, menus, and/or radio buttons can be used to good effect.

"Tutorial" pages

To the extent that the presented model matches the organization of the system being documented, the same descriptive material can be used to cover both. That is, understanding of the system helps in understanding of the web site, and vice versa.

Image-mapped diagrams can be used to good effect, allowing the user to examine and "explore" the presented model. Animated diagrams, showing control or data flow, can also be useful. In both cases, mechanized generation techniques can ease the burden of generating the diagrams.

Case Study (FSW)

Due to spacecraft-related security concerns, the GLAST Flight Software (FSW) web site now requires a password. Given SLAC's historic policy of open research, we can hope that the concerns may eventually be handled in a less intrusive manner.

In the meanwhile, although I am able to perform in-person demos of the site (contact me for details), I am unable to give out the password. I apologize (and sympathize) with any difficulty this may cause.

In a recent contract, I was asked to create a comprehensive web site, providing both overview and detailed documentation for a scientific software development project:

The "production" FSW code is being written in C and assembler. It will be run on multiple, radiation-hardened PowerPC processors under the VxWorks operating system. However, development and test code must run in assorted environments, including Intel-based Linux, SPARC-based Solaris, etc.

Aside from a few dozen hand-written web pages (e.g., tutorials), the site's content is entirely computer-generated. About half of the pages are generated by Doxygen, a well-known documentation generator; the rest are generated by custom Perl scripts.

The input data comes from a variety of sources, including databases, XML files, (electronic and printed) documents, web pages, and assorted file formats (e.g., configuration files, object libraries).

Information can be requested of the development engineers, but there is no guarantee that they will have the time to reply. In short, a fairly typical mechanized documentation scenario.

Design Goals

A number of goals influenced the design:

These goals are typical of many mechanized documentation projects. Equally typical, but worth noting, are some omissions:


As discussed in the "Data Flow" section of the concepts page, MBD techniques can be implemented by means of a DAG (Directed Acyclic Graph) of processing routines. Each routine accepts one or more input data sets and, in traditional batch processing fashion, generates one or more output data sets. This produces a very modular system, because each routine interacts only with its input and output data sets.

This implementation consists of several dozen smallish Perl scripts, supplemented by assorted command-line utilities (e.g., dot, make, troff). Most of the scripts are under 500 lines in length, but a couple exceed 1000 lines. An ancillary library of perhaps 1500 lines supplies all of the "generic" code. Few of the routines, in any case, are difficult to comprehend.

YAML files are used for both the hand-edited and intermediate data sets. Because YAML is a simple and powerful data serialization format, each file can be a nicely-formatted textual representation of a loadable data structure. After loading the input structure(s), some scripts create "helper" data structures (e.g., additional indexes into the data).

All text files, whether hand-edited or machine-generated, are thoroughly commented. Each generated file receives an informative header, indicating the file's format, origins, purpose, etc. Section comments provide context and ease navigation.

Base Technologies

For a variety of (mostly pragmatic) reasons, the project uses Open Source tools whenever possible. The ability to modify code is important, as are cost factors and the ability to share applications with other institutions. In any case, many Open Source standbys (e.g., CMT, CVS, Doxygen, GCC, Graphviz, Groff, ImageMagick, Linux, MySQL, Perl, Python, Swish-e) are in use.

That said, some proprietary software is also being used. This includes database systems (e.g., Oracle), operating systems (e.g., Solaris, VxWorks, Windows), and applications (e.g., Adobe's Acrobat and Frame; Microsoft's Excel, Outlook, and Word). The decision to use proprietary software is generally based on either familiarity or the lack of an acceptable Open Source alternative.

Finally, quite a bit of the software infrastructure has been developed from scratch or adapted from Open Source tools. Aside from my own contributions, there are a couple of "test executives", an interactive packet specification application, etc. The group also uses a version of CMT which has been extensively modified to handle local requirements. For more information, see this introduction.

Data and Control Flow

The documentation suite consists of several dozen Perl scripts (~20K lines) and hand-edited YAML reference files (~30K lines). The suite is run early each morning, by means of cron and make. It produces (as needed) tens of thousands of files, in a variety of formats:

Several "tricks" are employed to ease maintenance, increase reliability, and optimize performance.

The DDF declarations, collectively, provide an abstract model of the system's data flow. Specifically, programs and (sets) of data files are represented as nodes in a DAG. Connections between nodes (e.g., read or write access, "include" usage) are represented as edges. Each hand-edited file "knows" its relative path, description, label (for diagrams), and type. Scripts, in addition, know which files they use or create. Generated files are described by their originating scripts.

As odd as all this may sound, this is only a slight variation on a traditional Unix-style batch processing system. The use of cron and make are commonplace, as is the use of textual files for data interchange. The only unusual aspects, really, lie in the makefile generation technique and the use of YAML as an encoding format for data structures.

One interesting aspect of this implementation is that data structures are "first-class citizens" of the design process. Given that OOP (Object-oriented programming) techniques are based on hiding data structures, this may seem odd. However, this approach appears to provide a great deal of modularity, which is one of the major goals of the OOP approach.

It would be fairly trivial to convert this design into an event-based system. Given that the scripts are written in Perl, I would probably turn to POE (Perl Object Environment), which supports a very flexible approach to event-based programming. C++, Python, and Ruby have roughly equivalent systems, known respectively as ACE (Adaptive Communication Environment), Twisted, and dRuby (Distributed Ruby).

Although an event-based approach wouldn't need a generated makefile, it would still be necessary to have a setup script, in order to "program" the event distribution and check for cycles in the data flow. It would also be a good idea to use atomic file writes (e.g., write a temporary file, then rename it to the "final" name) to eliminate incomplete data transmissions.

Little Languages

The Unix community is rife with "little languages" such as dot, eqn, grap, grep, lex, pic, tbl, and yacc. Although limited in scope, they perform their (specialized) functions very well. Unfortunately, creation of little languages generally requires the use of tools such as lex and yacc. By using YAML to handle the first-order parsing issues, I was able to avoid this hassle and create a number of declarative "little languages".

Aside from the DDF entries (described above), I created a language for generating data-flow animation sequences, a couple of "mini-templating" systems, etc. These languages are primarily used with hand-edited files, where they dramatically reduce the amount of typing (and in some cases, thinking).

Most of these "languages", to be sure, consist of simple, special-purpose macro expansions. For example, I make frequent use of brace expansion (ala the shell) to handle expressions such as "e_cat/{cat,html}.yml". In the case of DDF entries, this is extended to create multiple path entries (in the generated file_sets file) for any patt entries containing brace expressions.

In the case of the data flow animation sequences, the "base" data flow diagrams are defined by hand-coded dot files. The YAML file defines highlighting sequences, inter-sequence pauses, etc. After reading this file, a script edits each dot file into a sequence of modified files (one for each "frame"). These files are then turned into images which can be concatenated into a QuickTime movie.

Even with the dense encoding that its little language affords, the animation specification is several hundred lines of intricate YAML. Without this compression, it would be ridiculously large, completely impractical to edit, and far more subject to error. In short, I believe that YAML-based little languages are a powerful addition to the MBD developer's repertoire.


Although some projects actually use documentation as a design tool, most documentation is created "after the fact". So, it's likely that you're dealing with an existing system. Your task is to enhance the current documentation, producing a consistent, integrated result. Using MBD's phases as an outline, let's look at some ways to prepare for this:

Data Catalogs

Information on a system's data flow (e.g., creation, storage, use) is a useful and frequently overlooked form of documentation. In addition, knowing about available data sources is very helpful in mechanized document generation. Finally, researching these issues will uncover many useful tidbits about the system.

It isn't necessary to create a single, formal "data catalog". In fact, the effort to create one may be counter-productive, because many pieces of information won't "fit" a rigid format. Instead, create an informal web page for each "lump" of data, capturing critical information (e.g., format, name, owner, users), as well as informal notes, open questions, etc.

Don't be afraid to catalog "oddball" data sources. If a ReadMe file, spreadsheet, or PDF document contains interesting information, list it. Figuring out how to capture the data in a convenient format can come later...

Once you have some pages started, invite others to report errors, omissions, etc. In time, you may have enough information to create formal pages, data flow diagrams, etc. In the meanwhile, you'll have a very handy set of notes!

The Project Wiki

Start up a wiki for the project. Take some care in selecting the wiki software. You'll want it to be comfortable, popular, robust, scalable, and rich in useful features. It should also be undergoing active development in areas such as dynamic content, semantic wikis, etc.

My current choice, FWIW, is MediaWiki, the technology underlying Wikipedia. It is written in PHP, so it installs easily on a wide range of platforms. It has extensive documentation, a variety of mailing lists, etc. Although MediaWiki's principal focus is supporting Wikipedia, some of its development goes beyond this purpose.

Adding mechanically-generated content is quite feasible. MediaWiki uses an RDBMS (typically MySQL), so you can simply do a careful set of table updates. Indeed, the MediaWiki source code can be co-opted to help in this. Alternatively, page edits can be done over HTTP, using a library such as LWP.

Because MediaWiki has support for Transclusion, generated content can be written into its own pages and "included" into human-edited pages. This limits the chance for collisions, editing errors, etc.

Dynamically-generated content is also possible. Thomas Gries has extended the InterWiki facility to perform transclusion from arbitrary web servers. His WikiMania 2004 slide set, Getlets: extending the interwiki concept beyond any limits, describes the extension's motivation, capabilities, architecture, and syntax.

Once you have the wiki set up, you'l find that it is a useful place to "publish" ideas, try out experiments, etc. One obvious notion (and my current activity :-) is to use the wiki as a place to publish "system sketches". For example, see my Unix Ontology page.

System Sketches

Any significant system is going to be very complicated. Fortunately, a model doesn't have to capture everything in order to be useful. In fact, it's in the very nature of models to be incomplete. The trick lies in deciding which aspects of the system the model needs to represent.

Think about the types of entities and relationships that underlie the system. Which entities deserve web pages? Which relationships deserve links? As you proceed, publish "snapshots" of your current thinking (e.g., textual descriptions, diagrams) on the wiki.


Diagrams can be very useful, but Humpty Dumpty's advice applies:

So, sketch out and study some informal diagrams, using whatever tools and notations you find convenient. Diagrams, like words, should not be the master.

There are many formal and precise tools for diagramming systems, including Conceptual Graphs, Entity-relationship diagrams, Object-Role Modeling, and Unified Modeling Language diagrams. The problem with these tools, for this exercise, is that their very formality and precision may get in the way.

Use a simple graph notation (e.g., circles for entities, lines for relationships). A diagram editor (e.g., Dia, Graphviz, OmniGraffle, Visio) can help you to produce "pretty" diagrams, but paper or a whiteboard may be more comfortable for initial and/or collaborative exploration.


Once you have the appropriate tools, start brainstorming about things that you might want to document. Software systems typically include items such as databases, functions, libraries, modules, and programs. Software development efforts have bug reports, tests, etc. Corporate settings have office locations, reporting structures, etc.

As you add items, consider which relationships might be of interest. A program's web page might link to relevant databases, functions, libraries, programmers, test suites, etc. Don't try for perfection; just try to get the main items and relationships recorded...

Report Summaries

While you're waiting for help with the data catalogs, start drafting a set of report summaries. The initial purpose of these summaries is to define the objectives and strategies for each "report" (i.e., plot, table, web page) you'll be generating. Each summary should contain, at least:

Once the documentation is generated, the summary should be brought up to date, expanded, and made available (e.g., as a "help page") to both developers and users.

Coding Practices

Most programmers agree that code should be well commented, have a clear and consistent style, etc. They understand that these practices improve readability, ease debugging, and reduce errors.

The same line of reasoning can (and should) be applied to mechanically-generated data files. By spending a little extra effort on the generating code, you can make textual output files easy (nay, pleasant) to read.


Model-based Documentation (MBD) is an integrated approach to the design and development of semi-automated document production systems. Specifically, MBD uses a consistent "system model" (at various levels of abstraction) to provide conceptual clarity, ease navigation, and guide the development process.

The following tables list several dozen Open Source tools, summarizing their applicability for use in an MBD project. The first three tables list tools which are primarily applicable to a single phase; the fourth table lists "Multi-phase" tools.

The last table lists tools which are applicable to Modeling. Of course, these tools may also be relevant to production MBD suites. Please contact me with any comments, questions, etc.


Tool Description In
(BEE) ETL Tool extract, transform, and load suite DS
Clover.ETL extract, transform, and load framework DS
Enhydra Octopus extract, transform, and load system DS
KETL extract, transform, and load system DS
Kettle extract, transform, and load system DS


Tool Description In
COIN-OR operations research suite DS
Maxima computer algebra system DS
mg digital library manager DS
Mondrian online analytical processing (OLAP) database BD, PD
MySQL Relational DBMS DS
Octave numerical calculation tool DS
PostgreSQL Object/Relational DBMS DS
SQLite memory-based Relational DBMS DS
The OpenScience Project links to scientific software DS
Weka data mining suite PD


Tool Description In
Apache web server DS
AurigaDoc multi-target documentation tool DS
Cacti networked graphing tool
Cocoon web (etc) application framework DS
Dia diagram creation program DS
DocBook typesetting suite DS
Enhydra application server PD
GIMP GNU Image Manipulation Program DS
gnuplot data plotting program DS
Graphviz graph visualization suite DS
ImageMagick image manipulation suite DS
JasperReports report generator BD
JBoss application server PD
JOSSO SSO (single sign-on) infrastructure PD
JPivot spreadsheet-like front end for Mondrian BD, PD
OpenI report generator BD
OpenOffice office suite DS
Ploticus numerical (etc) display tool DS
Rails web (etc) application framework DS
Rhino Javascript engine PD
RRDtool data logging and graphing tool
TeX (etc) typesetting suite DS
Texinfo the GNU documentation system DS
Troff (etc) typesetting suite DS
unroff Troff to HTML, etc. DS


Tool Ph. Description In
BIRT _AP Business Intelligence and Reporting Tools PD
Doxygen EAP software documentation generator DS
Eclipse EAP Java IDE and application platform PD
Ganglia EAP OS / network monitoring system
Ghostscript E_P PDF / PostScript suite DS
GRASS EAP Geographic Information System DS
HTML Tidy E_P HTML clean-up tool DS
Joomla! _AP content management system DS
Kowari EAP database for metadata DS
Maxima _AP computer algebra system DS
Nagios EAP OS / network monitoring system
NeDi EAP OS / network monitoring system
Natural Docs EAP software documentation generator DS
Osirix _AP volumetric data visualization tool DS
pdftk E_P PDF ToolKit DS
PDL _AP Perl Data Language (array processing and display) DS
R _AP numerical/statistical analysis and display tool DS
SchemaSpy EAP database diagrammer DS
Scilab _AP numerical analysis and display tool DS
Swish-e _AP indexing system for HTML, PDF, etc. DS
Synopsis EAP software documentation generator DS
UIMA EA_ Unstructured Information Mgmt. Arch. DS
VTK _AP 3D visualization toolkit DS
Xpdf E_P PDF / PostScript suite DS
W3C Open Source Software _AP W3C index to Open Source software DS


Tool Description In
Comet constraint-based, OO programming language for modeling and search DS
KAON ontology management infrastructure DS
MindRaider Semantic Web outliner DS
OpenCyc knowledge base and commonsense reasoning engine DS
PowerLoom knowledge representation system DS
Protégé knowledge base framework DS
TAP RDF-based knowledge engineering suite DS
TM4J Topic Map suite DS
Topic Map Tools annotated list of Topic Map tools DS
ZTM Zope Topic Map system DS

Distribution Status

The last column (In) of each table above indicated where the package can be acquired:


Because MBD spans a number of disciplines, this book list is quite eclectic. In addition, some books will have to be approached with a firm MBD bias, lest the reader get drawn off into interesting, but distracting, topics. Finally, the absence of a book does not indicate that it isn't worthwhile or appropriate to MBD. Feel free to contact me with suggestions...

Data Mining, Warehousing, ...

Databases, SQL, ...

Modeling: ORM2, UML, ...

Ruby, Rails, ...

XML: Cocoon, DocBook, ...