MBD: Concepts

MBD Home

Overview

Concepts

  Models
  Modeling MBD
  Processing Phases
  Data Flow

Modeling

Semantic Wikis

The Extraction Phase

The Analysis Phase

The Presentation Phase

Case Study (FSW)

Case Study (Unix)

Advice

Tools

Books


Rich Morin, rdm@cfcl.com

Printable Version

MBD (Model-based Documentation) is an integrated approach to the design and development of semi-automated document production systems. Specifically, MBD uses a consistent "system model" (at various levels of abstraction) to provide conceptual clarity, ease navigation, and guide the development process.

MBD bridges the gap between traditional documentation and report generation techniques, leveraging the strengths of each. It works well for generating timely, integrated, and detailed documentation for large systems. At the same time, it facilitates the rapid prototyping of specialized documents and reports. Although I have only used MBD in a software development context, I believe it to be applicable to any substantial system that depends extensively on computers.

This page introduces some key concepts in MBD, discussing the role of models in intelligence, communication, and documentation. It then presents an abstract data flow model for MBD processing.

Models

In the popular MVC (Model-View-Controller) architecture, the model stores the data that underlies the application. My use of the term is a bit broader, including "mental models", etc. Here are some applicable definitions:

    A small object ... that represents in detail another, often larger object.

    A schematic description of a system ... that accounts for its known or inferred properties and may be used for further study of its characteristics ...

    - dictionary.com

Either of these definitions could, with a bit of stretching, describe documentation. More to the point, documentation which is based on these sorts of models has a built-in conceptual and organizational framework. This framework can assist both the documenter and the reader, as discussed below.

More generally, because modeling is fundamental to both intelligence and communication, thinking about the underlying model(s) is an inseparable part of good documentation design. Let's explore this connection a bit further...

Intelligence

Jeff Hawkins ("On Intelligence") contends that intelligence and memory are byproducts of modeling activity in the neocortex. Specifically, our brains record and recognize "invariant patterns" in received data, then make and test predictions about other, related patterns. After hearing a few notes from a known song, we can recognize a pattern and predict the following notes. If the prediction fails, our attention will be drawn to the disparity and our brains will try to resolve it. Is this is a stylized rendition, a different song, or what?

Using billions of interconnected neurons, the neocortex can recognize and predict many levels and variations of patterns. The incoming flow of information interacts with the stored patterns, causing them to be interlinked, modified, reinforced, etc. Collectively, the patterns form a "model" of perceived reality. We may think that we are interacting with "reality", or at least what our senses report about it, but we are actually dealing with this internal model.

Communication

Communication, similarly, is based on the transmission and sharing of models. The sender offers assertions, qualifications, connections, etc. If the communication is successful, these will be incorporated into the receiver's models. Modeling also plays a significant role in message preparation. A "mental model" of readers' typical backgrounds (or really, of their mental models) influences the sender's choice of material, presentation style and order, etc.

In preparing these pages, I thought about the key concepts I wanted to present and the likely backgrounds of my readers. Using familiar concepts (e.g., brain, memory) as a base, I presented new concepts (e.g., invariant patterns). Because "neocortex" might be an unfamiliar concept, I defined it by context and added an HTML link. As I got further into the material, I was able to rely on concepts already presented. Thus, my model of the readers' models was predicated on their ongoing comprehension of the material being presented.

This sort of model-based "second guessing" is very common in human communication. Different forms of communication (e.g., articles, conversations, documentation, formal and informal presentations) employ it to different degrees, however. In casual conversation and informal presentations, the speaker may not pay much attention to organizing the presentation. After all, the listener(s) can easily ask about anything that isn't understood. Thus, interactivity reduces the need for organization.

Formal presentations and most forms of written communication do not allow for easy interaction, so authors tend to spend significant effort on organization. They write outlines, move material around, and generally polish the text until the ideas "flow" smoothly. Some trickery may be required, because many topics don't "serialize" cleanly. For example, the author may have to use a "placeholder" definition, because a full definition would interrupt the flow and/or rely on material not yet presented.

Documentation

Documentation differs from other forms of written communication in several ways, including complexity, scale, information level, and typical usage. These differences affect the way that documentation is (or at least should be :-) designed.

  • Detailed documentation for complex systems will also be complex. If a system has thousands of related entities (e.g., data structures, functions), its documentation will need thousands of descriptive entries. For efficient navigation, each "interesting" relationship will require a cross-reference or link.

  • Low-level facts (e.g., function usage) and quantitative summaries (e.g., plots and tables) can typically be harvested from the system under consideration, but abstract information (e.g., descriptions, tutorials) must be created by humans. Consequently, detailed documentation tends to contain a relatively small (but very important!) amount of abstract information.

  • Documentation tends to be used in a non-sequential manner. Most users have particular problems to solve; they are exploring the documentation in the hope of finding (at least part of) a solution. So, they skip around, reading bits and pieces in no predictable order.

In summary, documentation creation presents unique challenges. The writer must present vast amounts of information, with very little knowledge of the readers' mental models. About all that the writer can predict is that the reader will not approach the material in a sequential fashion.

Navigation

Because the writer can't predict the reader's background, other means must be used to supply context. In practice, this means that the author must provide ways for the reader to find (i.e., navigate to) any needed background information. Navigation can be supported by various mechanisms: indexes, links, search engines, etc. Using mechanized techniques, it's quite easy to generate web pages full of links, pop-ups, clickable diagrams, etc. The tricky part is to make the result comprehensible and useful.

MBD addresses this problem, where possible, by basing the site's organization and navigation on an abstract model of the documented system's entities and relationships. Because the model is consistent, readers soon learn how to navigate around the documents. Overview text and diagrams can ease and expedite this process.

Modeling MBD

Typically, MBD uses a "suite" of cooperating programs, mixing general-purpose utilities (e.g., text formatters) with special-purpose programs (e.g., data filters). Taking our own advice, let's model a typical MBD suite. The following diagram shows (very abstractly!) how the suite fits into the overall documentation data flow:

Our goal is to document a system, using a combination of its own Data (e.g., databases, files) and any related Info (e.g., documentation, institutional memory). If an existing document is useful and presentable, we'll simply add it to our collection of Documents. Alternatively, information can be extracted from Info and/or Data sources, analyzed, and presented in Documents.

The dotted line connecting the Info and Data boxes indicates that they are closely related. For example, a system's documentation should describe the nature of its data. This relationship isn't part of the MBD suite, but it forms an important part of the suite's working environment.

Processing Phases

An MBD Suite can be considered as having three processing phases:

  • Extraction - obtaining and storing the desired data

    Data extraction (i.e., input filtering) is usually straightforward. Although some data sources use complex, undocumented formats, most do not. In addition, tools and libraries are often available to access complex formats.

  • Analysis - combining or tallying data, discerning patterns, etc.

    Analysis may range from trivial (e.g., tallying bug reports) to extremely difficult (e.g., extracting information from unstructured text).

  • Presentation - generating diagrams, indexes, plots, tables, etc.

    Generating documents is seldom challenging, once the content and layout have been determined. However, selecting and learning how to configure and use various tools (e.g., Cocoon, dot, Rails, troff) can require significant start-up effort.

A trivial MBD application might combine all three phases in a single program. For example, a Perl script might access Bugzilla's MySQL database, tally selected bug reports, and generate a web page. As more types of data sources and generated documents come into play, however, a single program will become unwieldy.

Real-world MBD applications may have dozens of data sources, generate many kinds of reports, etc. Consequently, they will have many instances of Analysis, Extraction, and Presentation programs. There may also be cases where multiple levels of analysis are needed.

Data Flow

Here is a slightly less abstract diagram, showing the flow of data through these (sets of) programs:

As the diagram indicates, the data traverses a directed graph. With a bit of care, the graph can be constrained to be acyclic (no program can modify data that affects its own input). This may sound quite theoretical, but it has some useful consequences.

In particular, it means that dependency-based tools such as make (or even parallel make) can be used to control the suite's operation. Better yet, we can run the entire suite under Cron and have an automated "service" that updates the relevant output documents whenever an underlying data source changes.

The following diagram shows the data flow for a smallish MBD suite. If data source S1 changes, programs E1, A1, and P1 will be run, producing an updated version of document D1. Programs A2, P2, and P3 will also be run, updating all of the other documents. Changing S3 or S4, in contrast, would only cause D3 and D4 to be updated.

To be sure, this description leaves out many implementation details. For example, it says nothing about information storage and transmission. Are we using files, a database, or what? We also need to ensure that each program's input information is complete and consistent while it is processing. Given that the input could be several thousand files, a bit of trickery may be needed.

Nonetheless, this approach is not science fiction. I recently created a web site that contains tens of thousands of web pages, image-mapped diagrams, etc. Because the pages are heavily cross-linked, the site contains hundreds of thousands of links and tooltips. Nonetheless, the site is easy to navigate, understand, etc.

Next: Modeling