Rich Morin,
rdm@cfcl.com
Overview
Where is the wisdom? Lost in the knowledge.
Where is the knowledge? Lost in the information.
-- T.S. Eliot
Where is the information? Lost in the data.
Where is the data? Lost in the #@$$%?!& database.
-- Joe Celko, database consultant/writer
As these authors imply, there are many kinds of information and many ways to lose the use of it. Whatever the reason, if you can't locate information when you need it, you have lost the benefits of having it. Access to information is critical to the success of any enterprise. Its loss can result in confusion, frustration, and inefficiency, to say nothing of direct financial impact.
Any significant enterprise can be thought of as a
system:
an assemblage of inter-related elements comprising a unified whole.
In computer-based systems,
large amounts of detailed information is collected
about these elements.
However, even well-managed systems may suffer
from "indirect" forms of information loss:
Model-based Documentation (MBD) addresses these problems (among others)
by providing an integrated approach
to the semi-automated generation of detailed system documentation.
A consistent "model" (e.g., set of
entities and relationships,
based on the system being documented)
assists and guides the process:
Let's look at how this is accomplished...
Systems are composed of entities and relationships.
Some entities are tangible:
buildings, computers, employees.
Others are more abstract:
bug reports, meetings, programs, releases, telephone numbers.
A system's capabilities and internal structure
are defined by its entities and the relationships between them.
Because most enterprises store their operational data on computers,
a large amount of detailed information is likely to be stored online.
Unfortunately, the information is frequently
balkanized by technical or organizational boundaries,
format incompatibilities, etc.
Even without these impediments,
the complexity of production systems can be overwhelming.
Essentially, all models are wrong, but some are useful.
MBD addresses these problems by
modeling key entities and relationships.
This mapping, like those used in
object-oriented programming and
computer-based ontologies,
takes advantage of the fact
that large numbers of instances
can be described by small numbers of classes.
For example, we can define classes for methods, data structures,
and the relationships between them.
We can then track (and report on) arbitrary numbers
of data structures, method calls, etc.
By cataloging high-level information on available data sources,
MBD makes it easier to identify and locate needed data.
We can start to think about questions that this wealth
of detailed information could be used to answer.
As information is extracted and used,
the model will increase in detail,
aiding the investigation of more complex questions.
MBD normally begins with the construction of a (rough) model
of the system in question.
This should catalog and describe the system's major entities
(i.e., components and subsystems)
and relationships (e.g., control and data flows).
Note that this isn't a physical model (as in model airplanes),
let alone a simulation.
Instead, it follows this definition:
A schematic description of a system ...
that accounts for its known or inferred properties
and may be used for further study of its characteristics ...
- The American Heritage Dictionary of the English Language,
Fourth Edition
Some parts of the model will become visible in the documentation,
providing a consistent structure, convenient navigation, etc.
Other parts will remain hidden from the user,
but will motivate and guide the development process,
allow disparate data sources to be used together, etc.
In either case, the model will improve the utility and convenience
of the documentation.
Meanwhile, the presence of the model will help
to motivate and support analytical thinking.
A software development manager might, for example,
ask for comparative frequencies of pre- and post-release bugs.
Alternatively, it might be useful to rank software modules
by bug frequency, check-in activity, and/or size.
A simple chart or table can be very illuminating,
if it shows the right information in the right way!
Of course, getting to the right view
of the right data can be challenging.
Consequently, MBD efforts
tend to be exploratory, collaborative, and iterative:
Before a report can be generated,
the necessary data must be extracted from the correct data source.
Some amount of analysis may also be needed.
In some cases, everything may be highly organized and readily accessible.
In other cases, substantial effort may be required
to extract the needed data.
It's very useful to collect all of the extracted data
into an MBD-specific
data warehouse (roughly, database).
This allows consistent access methods to be used,
reduces duplication of effort,
keeps the MBD project's storage
under its own administrative control, etc.
Views of the data can be generated and presented, of course,
before a data warehouse is in place.
In fact, this is a common approach for prototyping reports, etc.
Folding interesting data into the warehouse, however,
may expedite future inquiries.
It also opens the door to the use of existing analysis programs.
Once the data is in a consistent and easily accessible form,
mechanized analysis can be employed to find "implicit" information:
correlations, patterns, hidden structures, etc.
Some of this analysis will require special-purpose programming,
but many tools and libraries are available to reduce this burden.
Let's consider some examples...
Business Intelligence (BI) techniques (e.g.,
data mining,
OLAP)
are commonly used in large enterprises
to mechanically extract information and even knowledge from data.
With the advent of
Open Source BI tools,
even small enterprises can now take advantage of these techniques.
Documentation generators can provide detailed documentation for software projects,
reducing development time, errors, and inefficiency.
Documentation frameworks can also help to organize and
increase the consistency of programmer-generated documentation.
Extremely powerful tools are now available
for mathematical analysis and display,
using techniques from
graph theory,
operations research,
statistics,
symbolic mathematics, etc
Most enterprises have large bodies of existing documentation
and other "unstructured information".
Many tools exist to help with this material,
including document indexing and digital library suites, etc.
Presentation spans a wide range,
from specialized "one-shot" reports (possibly created interactively)
to integrated, comprehensive sets of detailed documentation
(e.g., for a software project's code base, requirements, tests, etc.).
Throughout this range, the user's "mental model" is a key consideration.
In the case of a custom report,
the user's background and interests are clear.
So, the report can be "tuned" to meet known requirements.
Detailed documentation, intended for arbitrary users,
has no such luxury.
It must be organized and presented so that
any user can easily assimilate it.
MBD addresses this problem by presenting the user
with a consistent (though possibly simplified) model
of the system in question.
As the user navigates through the documentation,
s/he will learn about the system's basic organization.
A typical MBD-based web site will have several kinds of web pages.
Some, such as tutorials and class descriptions,
will be manually edited.
Others, such as entity descriptions and indexes,
will be mechanically generated.
MBD bridges the gap
between traditional documentation and report generation techniques,
leveraging the strengths of each.
It works well for generating timely, integrated,
and detailed documentation for large systems.
At the same time,
it facilitates the rapid prototyping
of specialized documents and reports.
Like the term AJAX,
MBD describes an existing (though not commonplace) way
of using existing technologies.
It is my hope that,
by coining the term and pointing out the utility of this approach,
I can convince others to think about,
and perhaps experiment with this approach.
Extraction and analysis can fill a data warehouse,
but the data doesn't become useful
until it becomes part of a human's "mental model".
Modeling is fundamental to both intelligence and communication,
so thinking about the underlying model(s)
is an inseparable part of good documentation design.
MBD encourages the developer to keep these models in mind,
at all stages of the development process.
MBD (Model-based Documentation)
is an integrated approach to the design and development
of semi-automated document production systems.
Specifically, MBD uses a consistent "system model"
(at various levels of abstraction)
to provide conceptual clarity, ease navigation,
and guide the development process.
MBD bridges the gap
between traditional documentation and report generation techniques,
leveraging the strengths of each.
It works well for generating timely, integrated,
and detailed documentation for large systems.
At the same time,
it facilitates the rapid prototyping
of specialized documents and reports.
Although I have only used MBD in a software development context,
I believe it to be applicable to any substantial system
that depends extensively on computers.
This page introduces some key concepts in MBD,
discussing the role of models in intelligence, communication,
and documentation.
It then presents an abstract data flow model for MBD processing.
In the popular MVC (Model-View-Controller) architecture,
the model stores the data that underlies the application.
My use of the term is a bit broader,
including "mental models", etc.
Here are some applicable definitions:
A small object ...
that represents in detail another, often larger object.
A schematic description of a system ...
that accounts for its known or inferred properties
and may be used for further study of its characteristics ...
Either of these definitions could,
with a bit of stretching, describe documentation.
More to the point,
documentation which is based on these sorts of models
has a built-in conceptual and organizational framework.
This framework can assist both the documenter and the reader,
as discussed below.
More generally,
because modeling is fundamental to both intelligence and communication,
thinking about the underlying model(s)
is an inseparable part of good documentation design.
Let's explore this connection a bit further...
Jeff Hawkins
("On Intelligence")
contends that intelligence and memory
are byproducts of modeling activity in the
neocortex.
Specifically, our brains record and recognize "invariant patterns"
in received data,
then make and test predictions about other, related patterns.
After hearing a few notes from a known song,
we can recognize a pattern and predict the following notes.
If the prediction fails,
our attention will be drawn to the disparity
and our brains will try to resolve it.
Is this is a stylized rendition, a different song, or what?
Using billions of interconnected neurons,
the neocortex can recognize and predict
many levels and variations of patterns.
The incoming flow of information interacts with the stored patterns,
causing them to be interlinked, modified, reinforced, etc.
Collectively, the patterns form a "model" of perceived reality.
We may think that we are interacting with "reality",
or at least what our senses report about it,
but we are actually dealing with this internal model.
Communication, similarly,
is based on the transmission and sharing of models.
The sender offers assertions, qualifications, connections, etc.
If the communication is successful,
these will be incorporated into the receiver's models.
Modeling also plays a significant role in message preparation.
A "mental model" of readers' typical backgrounds
(or really, of their mental models)
influences the sender's choice of material,
presentation style and order, etc.
In preparing these pages,
I thought about the key concepts I wanted to present
and the likely backgrounds of my readers.
Using familiar concepts (e.g., brain, memory) as a base,
I presented new concepts (e.g., invariant patterns).
Because "neocortex" might be an unfamiliar concept,
I defined it by context and added an HTML link.
As I got further into the material,
I was able to rely on concepts already presented.
Thus, my model of the readers' models
was predicated on their ongoing comprehension
of the material being presented.
This sort of model-based "second guessing"
is very common in human communication.
Different forms of communication
(e.g., articles, conversations, documentation,
formal and informal presentations)
employ it to different degrees, however.
In casual conversation and informal presentations,
the speaker may not pay much attention to organizing the presentation.
After all, the listener(s) can easily ask
about anything that isn't understood.
Thus, interactivity reduces the need for organization.
Formal presentations and most forms of written communication
do not allow for easy interaction,
so authors tend to spend significant effort on organization.
They write outlines, move material around,
and generally polish the text until the ideas "flow" smoothly.
Some trickery may be required,
because many topics don't "serialize" cleanly.
For example, the author may have to use a "placeholder" definition,
because a full definition would interrupt the flow
and/or rely on material not yet presented.
Documentation differs
from other forms of written communication
in several ways, including
complexity, scale, information level, and typical usage.
These differences affect the way that documentation is
(or at least should be :-) designed.
In summary, documentation creation presents unique challenges.
The writer must present vast amounts of information,
with very little knowledge of the readers' mental models.
About all that the writer can predict is that the reader
will not approach the material in a sequential fashion.
Because the writer can't predict the reader's background,
other means must be used to supply context.
In practice, this means that the author
must provide ways for the reader
to find (i.e., navigate to) any needed background information.
Navigation can be supported by various mechanisms:
indexes, links, search engines, etc.
Using mechanized techniques,
it's quite easy to generate web pages
full of links, pop-ups, clickable diagrams, etc.
The tricky part is to make the result
comprehensible and useful.
MBD addresses this problem, where possible,
by basing the site's organization and navigation
on an abstract model
of the documented system's entities and relationships.
Because the model is consistent,
readers soon learn how to navigate around the documents.
Overview text and diagrams can ease and expedite this process.
Typically, MBD uses a "suite" of cooperating programs,
mixing general-purpose utilities (e.g., text formatters)
with special-purpose programs (e.g., data filters).
Taking our own advice, let's model a typical MBD suite.
The following diagram shows (very abstractly!)
how the suite fits into the overall documentation data flow:
Our goal is to document a system,
using a combination of its own Data (e.g., databases, files)
and any related Info (e.g., documentation, institutional memory).
If an existing document is useful and presentable,
we'll simply add it to our collection of Documents.
Alternatively, information can be extracted
from Info and/or Data sources, analyzed, and presented in Documents.
The dotted line connecting the Info and Data boxes
indicates that they are closely related.
For example, a system's documentation should
describe the nature of its data.
This relationship isn't part of the MBD suite,
but it forms an important part of the suite's working environment.
An MBD Suite can be considered as having three processing phases:
Data extraction (i.e., input filtering) is usually straightforward.
Although some data sources use complex, undocumented formats,
most do not.
In addition, tools and libraries are often available
to access complex formats.
Analysis may range from trivial (e.g., tallying bug reports)
to extremely difficult
(e.g., extracting information from unstructured text).
Generating documents is seldom challenging,
once the content and layout have been determined.
However, selecting and learning how
to configure and use various tools
(e.g., Cocoon, dot, Rails, troff)
can require significant start-up effort.
A trivial MBD application might combine all three phases
in a single program.
For example, a Perl script might access Bugzilla's MySQL database,
tally selected bug reports,
and generate a web page.
As more types of data sources and generated documents come into play,
however, a single program will become unwieldy.
Real-world MBD applications may have dozens of data sources,
generate many kinds of reports, etc.
Consequently, they will have many instances
of Analysis, Extraction, and Presentation programs.
There may also be cases where multiple levels of analysis are needed.
As the diagram indicates,
the data traverses a
directed graph.
With a bit of care,
the graph can be constrained to be
acyclic (no program can modify data that affects its own input).
This may sound quite theoretical,
but it has some useful consequences.
In particular, it means that dependency-based tools such as
The following diagram shows the data flow
for a smallish MBD suite.
If data source S1 changes, programs E1, A1, and P1 will be run,
producing an updated version of document D1.
Programs A2, P2, and P3 will also be run,
updating all of the other documents.
Changing S3 or S4, in contrast,
would only cause D3 and D4 to be updated.
To be sure, this description leaves out many implementation details.
For example, it says nothing
about information storage and transmission.
Are we using files, a database, or what?
We also need to ensure that each program's input information
is complete and consistent while it is processing.
Given that the input could be several thousand files,
a bit of trickery may be needed.
Nonetheless, this approach is not science fiction.
I recently created a
web site that contains tens of thousands of web pages, image-mapped diagrams, etc.
Because the pages are heavily cross-linked,
the site contains hundreds of thousands of links and tooltips.
Nonetheless, the site is easy to navigate, understand, etc.
Modeling efforts can usefully range from informal sketches
to formal, notation-heavy specifications.
My preference is to keep things informal and flexible
until the need for more formality becomes evident.
This is in line with some of the ideas of the
Agile software development movement.
Diagrams are very useful for showing connectivity
(e.g., control and data flow, method usage).
If all you're trying to do is capture the basic structure,
use any icons (e.g., boxes, circles) that seem comfortable.
Picking a consistent notation becomes important, however,
if the reader is separated by time or distance.
Similarly, although paper and whiteboards work well
for small diagrams and informal meetings,
diagram generation tools (e.g.,
Dia,
OmniGraffle,
Visio ) also have their place.
Aside from handling presentation details
(e.g., icon and arrow styles, fonts, layout),
these tools manage connectivity constraints, etc.
As the size and complexity of the model grows,
the need for even more structure and support will become evident:
Just as general-purpose drawing tools
cannot support architectural, electronic, or mechanical design,
diagram generation tools cannot analyze the diagrams they record.
Tools for generating and analyzing
conceptual schemas must be able
to represent and manage knowledge about the system under study.
Here is an elegant description of
knowledge representation,
taken from John F. Sowa's excellent
book on the topic:
Knowledge representation is a multidisciplinary subject
that applies theories and techniques from three other fields:
Without logic,
a knowledge representation is vague,
with no criteria for determining
whether statements are redundant or contradictory.
Without ontology,
the terms and symbols are ill-defined, confused, and confusing.
And without computable models,
the logic and ontology cannot be implemented in computer programs.
Knowledge representation is the application of logic and ontology
to the task of constructing computable models for some domain.
Most of the modeling methods and tools discussed below
are aimed at assisting with ontology development.
They keep track of the definitions
of classes, instances, relationships, etc.
If our goal were to create an
expert system,
both the ontology and the logic rules
would need to be extremely detailed and precise.
In MBD, however,
most of the reasoning will be done by humans.
The developer will look over the ontology
and decide what to present.
The user will look over the presented material
and decide which parts are currently of interest.
As long as the material is plausibly interesting,
the user is unlikely to complain.
So, these approaches and tools often strive
for more detail and precision than MBD requires.
Don't get caught up in the details;
your main objective is to produce a useful but simplified model!
Also note that knowledge representation is an active research area.
There are many approaches and theories,
a few emerging standards,
and little interoperability between existing tools.
Assorted communities (e.g.,
AI,
DBMS,
Semantic Web)
have developed methods and tools for knowledge representation.
Several of these approaches appear to be quite applicable to MBD,
but I have yet to find a
category killer.
Before we look at the available offerings,
let's consider the general characteristics we're looking for.
The most critical characteristic, from my perspective,
is the model's fundamental organization.
The components of a system can have arbitrary relationships;
the model must be able to encode these,
allow the user to traverse them, etc.
Because the relationships are arbitrary and (initially) unknown,
the approach must not restrict the modeler to, say,
a list-based or even hierarchical organization.
Consequently, most
outliners and
mind mapping tools
aren't suitable.
Because HTML links
can only be traversed in one direction,
most HTML-based approaches
(e.g., typical wikis are also unsuitable.
(Pairs of links can be used for bi-directional relationships,
but this is tedious and error-prone, if done manually).
In addition, the model should allow the user
to interact with assorted subsets and views of the system.
For these and other reasons,
I believe that the model must be based
on a fully-traversable (and very extensible)
graph-based organization.
A modeling tool must allow the user to interact with
(e.g., view, navigate, edit) the model.
Although most of the detailed information will be textual in nature,
text is a poor medium for presenting relationships.
So, most modeling tools use some sort of diagramming format.
The design of this format is both critical and challenging.
If the format is too simple,
it won't be able to convey the needed information.
If it is too complex, the user will become confused and frustrated.
Ideally, the tool should allow the user to use simplified notation,
adding details as desired.
Modeling tools should have reliable and convenient ways
to exchange information with other tools.
Unfortunately, this is seldom going to be the case for existing tools.
There are many formats for encoding conceptual models,
varying at syntactic, structural, and semantic levels.
Efforts are being made to provide paths between these formats.
For example, the
International Organization for Standardization (ISO)
has a working group which recently proposed a standard:
Common Logic (CL) is an information exchange
and transmission language, based on
first-order logic.
The CL definition allows a variety of different syntactic forms,
called "dialects".
A dialect may use any desired syntax or structure,
but it must be equivalent to the abstract syntax of Common Logic
(and thus, to any other CL dialect), in terms of its semantics.
The World Wide Web Consortium (W3C)
is working on a related, though less formal standard:
SKOS Core is a model and an RDF vocabulary
for expressing the basic structure and content
of concept schemes such as thesauri, classification schemes,
subject heading lists, taxonomies, 'folksonomies',
other types of controlled vocabulary,
and also concept schemes embedded in glossaries and terminologies.
The SKOS Core Vocabulary is an application of the
Resource Description Framework (RDF),
that can be used to express a concept scheme as an RDF graph.
Using RDF allows data to be linked to and/or merged with other data,
enabling data sources to be distributed across the web,
but still be meaningfully composed and integrated.
Although these efforts are proceeding well,
no adopted standards appear to be imminent.
In the meanwhile,
although one can hope for a documented interchange format,
about all that one can reasonably require
is a readable (e.g., XML) file format.
Now, let's look at some of the available offerings...
For informal modeling, I would suggest the use of
concept maps.
They aren't overloaded with notation,
but they have enough structure
to capture the basic entities and relationships of a system.
Unlike mind maps,
concept maps aren't restricted to hierarchies.
Dr. John Sowa has written a nice overview of
Concept Mapping,
covering concept maps, conceptual graphs, topic maps, etc.
The classic reference on concept maps is
Learning How to Learn.
The database community uses assorted variations on
entity-relationship diagrams (ERDs).
Unfortunately, these can cause the modeler
to focus on low-level (e.g., database-related) issues,
rather than high-level concepts.
In addition, ERDs can run into difficulties
when a relationship needs to be treated as an entity.
If we say "Romeo loves Juliet",
how do we discuss the different meanings of "loves"?
Modeling Methodologies is a good introduction to some of these issues.
Object-Role Modeling (ORM2)
appears to handle these issues nicely,
at some cost in notational complexity.
I don't know of any Open Source ORM2 tools
(though one is promised for 2006),
but some gratis and inexpensive tools have emerged.
Some versions of
Microsoft's
Visio handle various aspects of ORM2 diagrams.
For more information on ORM2, visit
www.orm.net.
Unified Modeling Language (UML)
Class Diagrams are also complex,
but they may appeal to programmers
who are already familiar with this notation.
UML is very well documented
and many supporting tools are available for it.
The Expert Systems community has been working with problems of
Knowledge Engineering for several decades.
Not surprisingly, they have some useful tools to offer.
I'm particularly interested in
Protégé.
As described in
An AI tool for the real world,
Protégé is an Open Source, well-supported,
standards-friendly tool for creating models, ontologies, etc.
Conceptual Graphs (CGs)
are structurally similar to ORM2 diagrams,
but they are based on a form of
predicate calculus known as
first-order logic (FOL).
So, they are a good match for expert system technology.
It's quite likely that Protégé could be augmented
to support CGs, ORM2, or other diagramming notations.
This might ease the recognition and specification
of complex sets of relationships.
The Semantic Web community
is developing standards (e.g.,
Resource Description Framework,
Topic Maps)
for describing concepts, encoding document metadata, etc.
The standards are still "works in progress",
but they are worth watching because
of their large and active developer communities.
Resource Description Framework (RDF)
is based on sets of three-part declarations (i.e., "triples"):
subject, predicate, object.
The apparent simplicity of this approach is balanced
by the need to create large numbers of triples
when complex concepts need to be expressed.
Topic Maps use a much richer vocabulary,
including terms such as association, name, occurrance,
scope, topic, etc.
This allows relatively small expressions
to express complex concepts, conditional assertions, etc.
A gratis tool
(Ontopia Omnigator) is available;
Open Source tools are under development.
Although the use of models is central to MBD,
modeling is a tool, rather than a goal.
If you develop a crystal-clear model,
but generate no documentation,
you haven't really accomplished your objective.
So, curb your desire to generate the "perfect" model.
Instead, try to generate a useful and flexible model,
improving it as you proceed.
As you work with the model,
you'll find areas that could use clarification, expansion, etc.
Your modeling approach should make it easy to make these changes.
Also, avoid the temptation to "start at the bottom",
detailing every data item, field, etc.
This sort of information can be researched as it is needed,
but it doesn't serve the general purposes of the model:
finding useful information, understanding the system, etc.
Semantic Wikis show promise for
creating models and
presenting information.
Ideally, they would combine a wiki's convenience and freedom
with the strengths of semantically-aware
(e.g., ontology-based) systems.
This could allow a graceful merging
of human-edited and mechanically generated content.
Wikis provide a very convenient means
for generating informal documentation.
They are easy to edit,
support both individual and collaborative efforts,
and scale extremely well.
Wiki links can be created by the simple act of typing in a
CamelCase word.
If a page doesn't already exist,
the act of clicking on its link will create one.
Simplified
markup languages are also available,
easing the process of page creation.
Most wikis have
rollback mechanisms,
allowing contributions or changes to be removed,
merged with previous content, etc.
This allows public wikis such as
Wikipedia to reach and retain a high level of quality,
despite occasional misguided postings.
The web's basic architecture reduces the apparent complexity
of web site (and thus wiki) generation.
Although collections of pages and links form a
graph-based data structure,
few users think about this fact.
Looking at any given page, the user sees only content and links;
the global structure can be (and usually is) ignored.
As sites such as Wikipedia demonstrate,
wikis can take full advantage of these simplifications.
Their users navigate through enormous graphs,
seeing only pages that appear to be of interest.
HTML links (e.g.,
Because a link only goes to a given page,
the entire graph must be traversed in order to find
backlinks (links that come from other pages).
For search engines such as
Google,
this can be a massive problem,
because the "graph" in question is the entire web.
Few wikis bother to track backlinks,
even though the problem is much more tractable for them.
Even fewer can display clickable context diagrams,
showing a page's "local neighborhood".
Pimki
(an experimental "Personal Information Management" wiki)
does both, but it is a conspicuous exception.
Even Pimki, however, is constrained by HTML links' other limitations.
Although a link can have many attributes,
most only contain the target
URL
and some text content to be highlighted and displayed.
Nothing, in any case, indicates which links are of what "type".
Without typed links
(e.g.,
Finally, HTML links are "binary",
in the sense that they only connect two pages.
This isn't a catastrophic problem:
Resource Description Framework (RDF)
is also based on binary links.
However, many users may find it easier to say
"John is taking the plane to Chicago"
than to specify the equivalent set of binary relationships.
Semantic wikis
address the "type" problem head-on,
allowing pages and/or links to have specified types.
Thus, we might say that the
Similarly, if we say that a
The final limitation (binary relationships) is addressed
by some semantic wikis, but not by others.
In fact, some would contend that this limitation is beneficial,
in that it simplifies the specification of relationships.
However, the complexity re-asserts itself
by greatly expanding the number of relationships required.
Here are more than a dozen examples of semantic wikis,
including (where available) brief summaries.
For an up-to-date list,
and other useful information,
be sure to visit the
Semantic Wiki State Of The Art page on the
Semantic Wiki Interest Group web site.
Note:
The "OWW" link following some entries is not an exclamation,
merely a link to the corresponding page in the OntoWorld Wiki.
COW provides explicit capabilities
for defining concepts and instances,
making queries, etc.
The COW and mWiki source code is available for download,
with no apparent copyright or license notices.
(OWW)
A paper ("Semantic Wikipedia") was submitted about this project.
(OWW)
Because SemperWiki runs on the user's machine,
it can take advantage of local information
(e.g., looking up or making annotations on local files).
In addition, the use of GTK+ offers a more powerful GUI
than a web browser can provide.
Nearly all of the wikis listed above are based on the RDF notion
of (subject, predicate, object) expressions, known as "triples".
In each link, the current and target pages are used, respectively,
as the subject and object of the triple.
The predicate is then added by means of a "type" indicator, such as:
This is convenient and flexible
(and thus in keeping with the spirit of wikis),
but it has both structural and human interface problems.
Ontiki,
my (speculative) design for an "ontology-aware wiki",
attempts to address these issues.
There is no concensus as to what semantic wikis should do,
let alone how they should do it.
A "category killer" may emerge in a few years,
but even this is not certain: none has for basic wikis.
Nonetheless, semantic wikis are fun to play with
and can be used to solve problems that other tools do not.
So, enjoy...
The concepts presents a data flow "model" for an
MBD (Model-based Documentation) suite.
In this model,
the extraction phase is responsible
for accessing data sources,
selecting and organizing the desired data, etc.
Its output should be encoded
in a convenient and reliable representation for follow-on
analysis or
presentation.
This page looks at various kinds of data sources,
including system data and hand-edited files,
offering hints about how to extract data from them.
The question of how this data should be stored
is a subject for another page (available RSN :-),
but some general comments may be useful at this point.
MBD's extraction phase corresponds roughly to the first part of
data warehousing.
The "live" data gets collected, filtered, and saved in a manner
that eases follow-on analysis (e.g.,
data mining,
OLAP) and presentation.
The storage representation should handle structural issues,
retaining "interesting" structures from the input data
and allowing the addition of structures "discovered"
in the extraction or analysis phase.
The actual data will normally be stored
in (collections of) files and/or a relational database system.
In general, data access in the warehouse
should be far faster, easier, and more consistent
than it was in the "live" system.
If it isn't, you're doing something wrong (:-).
Computer-based systems maintain large amounts of data.
MBD systems can "harvest" this data,
extracting information on specific entities: facts, relationships, etc.
The resulting information can be used to generate detailed reports,
summary plots and diagrams, etc.
The first step, however, is data extraction.
The extraction code must access the incoming data,
parse it, reject noise, correct errors,
and pass the result along in a "convenient" format.
The specific tasks are defined by
(a) the format and content of the incoming data and
(b) the current and expected data needs.
The incoming data may be simple or complex in structure
and may arrive in any of a variety of manners
(e.g., files, APIs) and formats (e.g., binary, text).
The following sections cover some common cases.
Line-oriented "flat file" formats (e.g.,
CSV files,
Unix control files)
are usually easy to parse and understand.
Each line (including any "continuation" lines) is a record.
Individual fields can usually be extracted by
regular expressions;
special handling may be required for particularly complex formats.
If you have a lot of flat files to parse,
consider writing a parameterized input filter.
Declarative control files are far easier to maintain
than hand-crafted (and nearly identical) "input filters".
If your control files are cleanly formatted and well commented,
they can serve as documentation for the input files.
If a data source is used by more than one program,
it is likely to be available
via a well-documented data exchange mechanism.
For example, it might be a text file,
encoded in a documented dialect
of a standard data serialization format (e.g.,
XML,
YAML).
Alternatively, it might be offered via an
API (Application Programming Interface)
such as SQL or
Apple's
Core Data framework.
The exchange mechanism can be expected to handle the "syntax" level
(e.g., dividing the data into fields).
It may also help with structural or even semantic issues,
but there is no guarantee of this.
The good news is that you can often extract the needed data
without understanding the entire structure and semantics
of an arbitrarily complex data source.
Generally, use of an exchange mechanism
follows a well-trodden path:
study the documentation, decide what data you want,
use a query to extract it.
That said, there's nothing wrong
with speculative data collection:
if capturing some extra data is easy,
storing it can be a very worthwhile gamble.
It's also useful to look at the way the data source
is used by the system's own code.
A program's SQL statements and surrounding code,
for example, can provide quite a bit of information
on the purpose and interaction of database fields and tables.
If you get lost, look for a support infrastructure
(e.g., books, forums, mailing lists, web sites).
Asking authors and "resident experts",
you may be able to get answers to specific questions
(e.g., "What is this field used for?").
If you develop any documentation,
be sure to share it with the appropriate commmunity!
Wikipedia defines
"screen scraping" as:
Because the "display" was not intended for use by a program,
it may not provide simple (or even reliable) indications
of its structure.
After all, humans are much better than programs
at discerning structure from context, formatting, etc.
Fortunately, machine-generated documents tend
to have reasonably predictable structures.
Once you can handle the base layout and the common variations,
you're mostly done.
By coding for robustness (e.g., "print a diagnostic and continue"),
you can deal with new variations as they appear.
Of course, if the generating program gets changed
in a way that modifies the document structure,
your extraction program will "break".
So, it's a good idea to push for machine-friendly (e.g., XML) forms
of any documents you rely upon.
These problems aside,
screen scraping may involve assorted low-level formats:
"plain text" (e.g., log files,
nroff output),
HTML, PDF, PostScript, etc.
Access methods are often available for binary formats,
but you'll have to handle most text-based formats yourself.
Here are some hints...
If you only need a few items from a web page,
you may be able to extract them using special-purpose code.
For example,
if a particular snippet of HTML always appears in a line by itself,
or in a particular place in a table,
you may be able to recognize it and handle it as a special case.
Be aware that this "hack" may be brittle
in the face of even minor formatting variations.
If you need to extract a lot of data,
consider transforming the page into
XHTML,
a formalized dialect of
HTML.
Because XHTML files comply with XML syntax,
you can load them with your favorite XML-handling tools.
HTML Tidy will perform this conversion,
as well as cleaning up ugly (e.g., Microsoft Word) HTML.
Although the resulting document will be valid XML,
you'll still have some "detective work" to do.
Unlike XML that was generated for information exchange,
converted HTML is likely to have structural irregularities
(e.g., interleaved tags and text).
Worse, you won't have a schema to guide you
in figuring out the variations.
PostScript is a text-based "page description language".
Although it is actually an
imperative,
Turing-complete programming language,
this is seldom a real problem for parsing, etc.
RPN syntax aside,
most PostScript commands are used as formatting declarations.
So, it is generally possible
to examine a representative PostScript document,
determine what "idioms" are being used,
and write a specialized script to extract desired information.
An alternative approach uses that fact that PostScript files are,
in fact, programs.
By editing the PostScript code (and/or overriding selected operators),
it is possible to make a document log information about itself
(e.g., to standard output or a designated file).
If the idea of hacking PostScript doesn't appeal to you,
however, read on.
PDF (Portable Document Format)
is a binary,
declarative translation of PostScript.
Although a binary format may seem daunting,
there are libraries and other tools which can help in parsing PDF.
There are also reliable tools for PostScript translation.
So, you may want to convert all your incoming PostScript documents to PDF,
then parse them all with the same tool(s).
Here are some useful tools for dealing
with PDF and PostScript documents.
Although there is some overlap in their capabilities,
all three are well worth having.
Ghostscript is a powerful and flexible set of tools
for processing PDF and PostScript files.
It can be used to render documents, translate between formats, etc.
pdftk (PDF ToolKit) is a command-line tool for manipulating PDF files.
It performs number of specialized functions (e.g.,
applying watermarks,
encrypting and decrypting documents,
merging and splitting documents,
updating PDF metadata).
Although Xpdf is billed as a "PDF viewer" for the
X Window System,
it is far more than this.
The suite includes tools to extract images and text,
translate PDF to PostScript, etc.
Parts of Xpdf are used by other utilities,
such as search engines
(e.g., Swish-e)
and PDF viewers.
For more tools (and tricks), see
"PDF Hacks: 100 Industrial-Strength Tips & Tools"
(Sid Stewart; O'Reilly, 2004).
It's an easy read and will give you some ideas
about unconventional ways to use PDF.
If you get serious, you'll also want to get a copy of
Adobe's
"PDF Reference" (sixth edition).
It contains well over 1000 pages of definitive and detailed information.
There are dozens of books on PostScript,
ranging from introductions to reference manuals.
Again, the Adobe books are definitive,
but you may want to look at some others as a way to get started,
explore undocumented areas, etc.
The data you need may be stored in some arcane format.
Examples include binary libraries, spreadsheets,
word processing documents, and source code files.
Rather then researching
(or worse, reverse-engineering) the format,
look around for a library or command-line tool
that already knows how to parse it.
For example, you might want to extract linkage information
from binary library files.
The
Perl's
CPAN (Comprehensive Perl Archive Network)
has modules that import spreadsheets and many other oddball files.
Other scripting languages (e.g., Python, Ruby)
have similar online collections.
Finally,
you may be able to find a tool that reads the file(s)
and exports the data into XML or some other parsable format.
Doxygen, for example,
will peruse source code files in several languages,
dumping its accumulated knowledge as XML.
There are times when you'll need
to augment your machine-harvested data
with hand-edited information.
The supplementary data may have been collected from interviews,
copied and pasted from a PDF document,
or obtained in some other manner.
Regardless, the current objective
is to encode it for convenient use
by your documentation scripts.
The ideal file format for this purpose would be
flexible and powerful,
supported by an active user community,
easy to read and edit,
and a good "fit" for scripting languages
such as Perl, Python, and Ruby.
Flat files (e.g., CSV) fail the first test:
a two-dimensional array is neither flexible nor powerful.
Any format that you might cobble up on your own
fails the second test (user base).
XML fails the last two tests;
nobody but a masochist likes to edit XML
or traverse its data structures.
Fortunately,
YAML (YAML Ain't Markup Language)
meets all of these criteria quite handily:
Finally, I'll let you judge YAML's flexibility and readability
for yourself:
Assuming that this text was loaded
into a Perl data structure referenced by
The ability to create and edit data structures in one window,
then test them out in another, is incredibly seductive.
I have edited tens of thousands of lines of YAML,
using it to store a wide variety of data.
By post-processing the encoded text strings,
I have even created (declarative)
"little languages"
of various sorts,
using a YAML loader to handle the first-order parsing.
The FSW case study page
and this tutorial contain more information on YAML.
The concepts page
presents a data flow "model"
for an MBD (Model-based Documentation) suite.
In this model, the analysis phase is responsible
for combining or tallying data, discerning patterns, etc.
This page considers the relationship between modeling and analysis,
looks at some representative forms of analysis,
and makes some broad generalizations about data management.
The administrators and designers of a complex system
generally understand its overall nature.
They may be unclear on specific details,
but they understand the major entities and relationships,
along with some "minor" ones which have attracted their attention,
for whatever reason.
In short, they have a working "mental model" of the system.
By interviewing these "local experts",
studying the existing documentation,
and generally poking around,
an MBD developer can "discover" the types
of entities and relationships in the system.
Using this information,
s/he can create an informal model
that will support and guide the creation of MBD software.
Analysis functions, in particular, rely on this sort of model.
They look for, characterize, and/or summarize the specific
values and relationships concerning instances of known entities.
Typically, they produce explicit formulations
of information which was only implicitly available
in the input data.
In a software development project,
the entities might include bug reports, data structures, developers,
functions, libraries, managers, programs, projects, releases,
test suites, etc.
By analyzing information on these instances,
we can "discover" and present useful information
about individual instances and/or the system as a whole.
For example, we could count the bugs found
during each stage of each release cycle.
This might tell us useful things
about our design and testing procedures.
We could also look for "hot spots" in the code
(e.g., areas that seem to have lots of bugs),
as a way to guide
refactoring efforts.
We could also track the usage
of each data structure, function, library, etc.
This would let us create links to related items
and even draw nifty-looking (and useful)
diagrams showing critical relationships.
Anyone who has had to "drop into" a large body of code
can appreciate the value of this sort of contextual information.
As Gerald Weinberg points out in
An Introduction to General Systems Thinking,
different approaches are needed for different sorts of problems.
Statistics deals well with "unorganized complexity",
but requires both randomness and large population sizes.
Pure analysis, on the other hand,
handles "organized simplicity",
but is limited by complexity and randomness.
Thus, there is a substantial area of "organized complexity"
where neither tool can be used:
As discussed below,
data mining tools such as
Weka
apply both statistical and analytical techniques
(e.g., analyzing statistical results).
Similarly, this document uses the term "analysis"
to include statistics, etc.
Analysis functions can range from trivial to extremely complex.
The ones which involve only simple statistics and record-keeping
lie near the "trivial" end.
Data mining and document analysis functions,
which can involve much more intricate calculations,
go well into the "complex" end.
In fact, some forms of analysis
can be described as "interesting research projects" (:-).
Fortunately, two factors operate to make analysis less challenging.
First, most analysis falls on the trivial end:
the details may be complex, but the principles involved are simple.
Second, tools are available to perform many kinds of analysis,
leaving the MBD implementor with the relatively simple tasks
of tool selection, installation, and use.
Although there is no shortage
of expensive, proprietary analysis products,
many useful tools are also available as
Open Source.
Their capabilities range from text analysis and indexing
through numerical and statistical methods
to volumetric data presentation.
Typically, they have books and/or extensive documentation
which detail the underlying principles and use.
Here is a sampling, to whet your appetite...
BI (Business Intelligence) techniques (e.g.,
data mining,
OLAP)
are commonly used in large enterprises
to mechanically extract information and even knowledge from data.
Open Source now allow even small enterprises
to take advantage of these techniques.
BIRT (Business Intelligence and Reporting Tools) is an
Eclipse-based
"reporting system".
It can generate (combinations of)
charts, cross-tabulations, lists, and textual documents.
Bizgres is billed as
"PostgreSQL for Business Intelligence and Data Warehousing".
The Bizgres distribution bundles an enhanced and optimized version of
PostgreSQL with a "stack" of business intelligence tools:
KETL (an extract, transform, and load system),
Mondrian,
JPivot,
JasperReports, and
OpenI.
A unified installation and configuration system
makes all of this (relatively) easy to put into place.
OLAP (online analytical processing)
suites are very popular for business-related analysis.
As defined by Edgar F. Codd,
OLAP is "the dynamic synthesis, analysis, and consolidation
of large volumes of multidimensional data".
This article gives a thorough, yet readable summary of this idea.
Mondrian is an OLAP database that is written in
Java.
It implements the
MDX (Multidimensional Expressions) language and the
XML for Analysis and
JOLAP specifications.
A related project,
JPivot, uses
JSP (JavaServer Pages)
to provide an interactive (spreadsheet-like) front end to Mondrian.
The Pentaho BI Platform is an integrated distribution of business intelligence tools.
It incorporates technology from
BIRT,
Eclipse,
Enhydra,
JBoss,
JOSSO,
JPivot,
Mondrian,
Rhino, and
Weka.
It can use any
RDBMS that supports the
JDBC (Java Database Connectivity)
API.
The platform's principal claim to fame
is that all of the tools run under the control
of a process-driven "workflow engine"
which provides auditing, logging, security, etc.
It can be embedded into server-based applications
or desktop applications.
The R language
and other numerical analysis tools
may be useful for certain types of OLAP work.
The Weka data mining suite
can be used for statistical work.
These offerings are described in more detail below.
Extremely powerful tools are now available
for mathematical analysis and display,
using techniques from
graph theory,
operations research,
statistics,
symbolic mathematics, etc.
Several powerful suites are available to help with these tasks.
COIN-OR hosts dozens of Open Source projects in operations research,
including both linear and nonlinear deterministic optimization,
metaheuristics, stochastic optimization, and more.
Octave,
a program for performing numerical (e.g., scientific) computations,
is mostly compatible with
MATLAB.
PDL (Perl Data Language)
is a vectorized array programming language.
It allows Perl scripts
to perform rapid vector calculations
using simple (but highly expressive) expressions.
It also has support for interactive use, graphics and plotting, etc.
R is a mathematical language and environment
for statistical analysis and display,
modeled after Bell Labs'
S programming language.
Although R is mostly used by statisticians,
it can also be used as a general matrix calculation toolbox
in a program such as
Octave.
Scilab is a program for performing and displaying the results
of numerical (e.g., scientific) computations.
It has a similar (though incompatible) syntax to
MATLAB.
Weka is a Java-based collection of machine learning algorithms for
data mining tasks.
The algorithms can either be applied directly to a dataset
(e.g., via command-line or interactive interfaces)
or called from Java code.
Weka contains tools for data pre-processing, classification,
regression, clustering, association rules, and visualization.
Data Mining: Practical Machine Learning Tools and Techniques is a very approachable introduction to both Weka
and its (table-based, statistical) approach to data mining.
Although images, text, and video have lots of internal structure,
it isn't the sort of thing that fits well into arrays or even schemas.
Consequently, they are often referred to as "unstructured data".
Here, in any case, are some tools for handling them.
mg is a system for creating and managing digital libraries.
The approach is described
by an extremely interesting and readable book
(Managing Gigabytes: Compressing and Indexing Documents and Images).
OsiriX is a
Macintosh-based system
for analyzing and presenting volumetric medical data
(e.g., sets of MRI scans or confocal microscope images).
It will accept a variety of "generic" input formats, however,
including AVI, JPEG, MPEG, PDF, Quicktime, and TIFF).
Thus, it could be used for geology, mechanical engineering,
or even completely artificial sets of volumetric "data".
Although Swish-e is billed as a
"system for indexing collections of Web pages or other files",
it is really an extensible toolkit for building indexing systems.
It can handle "plain text, e-mail, PDF, HTML, XML,
Microsoft Word/PowerPoint/Excel,
and just about any file that can be converted to XML or HTML text".
If it doesn't already do what you want,
just add your own analysis functions!
UIMA (Unstructured Information Management Architecture) is a Java
SDK (Software Development Kit)
that supports the implementation, composition, and deployment
of applications working with unstructured information
(e.g., audio, images, video).
UIMA is based on the composition of stackable "analysis engines"
that extract and record document information.
Documentation generators, such as
Doxygen and
Synopsis,
analyze software-related entities
such as data structures, functions, and modules.
Most documentation generators perform a variant of
screen scraping,
parsing the source code (and specially-formatted comments).
Consequently, they tend to be specific
to particular (sets of) programming languages.
OpenCyc is a subset of
Cycorp's
"Cyc" knowledge engineering system.
Both Cyc and OpenCyc are based on the CycL language,
which is based on predicate calculus.
When answering questions,
the "inference engine" traverses a "knowledge base" of "assertions".
Unfortunately, although the knowledge base is
libre (free as in speech),
the inference engine is only
gratis (free as in beer).
TAP is a knowledge engineering project
that hopes to solve some challenging problems related to the
Semantic Web.
I'm particularly intrigued by their notion of "semantic negotiation",
which may allow machines to automagically match up their terms.
The project has released "TAPache"
(an Apache module for data publication)
and "TAP KB" (a knowledge base).
The Extraction page glossed over the issue
of how MBD data should be managed.
Indeed, this page has said very little
about data management.
Unfortunately, proper coverage of MBD data management
would take another whole set of pages.
Still, there are a few things that can and should be said.
The first is that some OLAP tools (e.g., Bizgres)
rely on a particular RDBMS.
If you already have your data in some other database,
this could be an issue.
On the other hand, disk space is cheap enough
that the cost of replicating information should be reasonable.
Moreover, there are good reasons (and strong precedent)
for separating your operational data storage from the MBD suite's.
An MBD data warehouse can integrate data from disparate sources,
making it easier to analyze and present.
Because this data is separate from the "live" data,
its management policies can be optimized
to serve the needs of the MBD effort.
Operational data storage needs to be managed in a conservative manner,
because important things can break if it isn't.
Adding new tables or columns may involve a great deal of politics,
even if no substantial coding changes are required.
MBD systems, in contrast, need flexibility and responsiveness:
"I need this report now."
Also, the live data will not, in general,
be integrated into a consistent access framework.
More generally, RDBMS systems aren't optimized for flexibility.
In a prototyping environment,
a set of YAML files may be far easier to create and use
than a set of database tables.
Finally, many tools (e.g., OpenCyc, OsiriX, TAP)
have their own storage formats.
In summary, don't get locked into a single approach.
The concepts page
presents a data flow "model" for a
Model-based Documentation (MBD) suite.
In this model, the presentation phase
is responsible for generating various output formats,
using assorted tools
(e.g., markup languages, documentation generators).
Most documentation needs can be readily met
by one of two output formats: PDF or HTML (and friends).
Although there is some overlap,
it is usually easy to decide which format is appropriate.
Most PDF viewers provide little interactive capability.
Keyword search is usually supported,
as is text clipping, etc.
Hyperlinks are possible (in recent versions of PDF),
but are rarely used in practice.
(See the Extraction page
for pointers to PDF-specific tools, reference material, etc.)
There are many ways to generate these formats,
but only a few of them are relevant to mechanized production.
Consequently, we'll focus our attention on this subset.
We'll start with markup languages,
then move on to other tools and some web-specific technologies.
Markup languages such as HTML
allow text to be "marked up"
with formatting and other ancillary notations.
An appropriate tool (e.g., a web browser)
can interpret the markup,
producing formatted text, hypertext links, etc.
Using markup languages,
MBD systems can generate attractive PDF documents, web pages, and more.
Note:
Structured Generalized Markup Language (SGML) and
Extensible Markup Language (XML)
aren't really markup languages in the same sense that HTML is.
Rather, they are "metalanguages"
which can be used to create markup languages.
Finally, YAML Ain't Markup Language (YAML)
is a data serialization language,
not a markup language at all.
If you're just producing web pages,
HTML is a direct and obvious choice.
Even with Cascading Style Sheets (CSS),
HTML doesn't allow precise control of formatting,
but this is seldom needed.
Unfortunately, HTML can't generate arbitrary PDF documents.
Many other markup languages are available, however,
offering a wide range of features.
Here are three popular "families" of markup languages,
to get you started...
DocBook is a presentation-neutral markup language for technical documentation:
it can generate PDF files, web pages, and assorted other formats,
using a single set of input files.
DocBook is well suited to books, manual sets,
and other large-scale documentation projects.
By the same token,
DocBook has a substantial learning curve,
involving large amounts of SGML and/or XML arcana
and complicated tool chains.
See the DocBook Project for more information.
The TeX suite
(including LaTeX and other associated packages)
is very powerful,
documented by dozens of books,
and supported by an active user community.
Building and configuring a distribution can be daunting, however.
Do yourself a favor and start with a "canned" distribution
(e.g., TeX Live,
from the TeX Users Group (TUG).
Texinfo is the documentation format
of the GNU Project.
Although Texinfo predates the Web,
it has built-in hypertext capabilities.
It can be used to generate HTML pages and many other output formats,
including DVI, PDF, and XML.
Troff has been a part of Unix since the earliest days.
It is still used for the Unix "man pages",
but its role in document production
has largely been taken over by other programs.
In addition, it provides little support
for hyperlinks, image maps, and other web-related features.
Nonetheless, the Troff suite
(including pre-processors such as
Documentation generators are actually specialized MBD suites,
documenting software-related entities
such as data structures, functions, and modules.
The results are generally published as web pages,
but other formats are sometimes available.
In particular, collected information may be available
in a form (e.g., XML) that is suitable for follow-on analysis.
Most documentation generators perform a variant of
screen scraping,
parsing the source code (and specially-formatted comments).
Consequently, they are specific to particular programming languages.
A few, however, work strictly from comments or binary files.
If you are documenting a software project,
be sure to investigate this class of tools.
There are a number of tools that can generate
"eye candy" (e.g., diagrams, images, plots) for documents.
The trick, of course,
is to generate useful eye candy.
Here, in any case, are some potentially useful tools.
The GNU Image Manipulation Program (GIMP)
is generally used interactively,
but it can also be used in
batch mode,
by means of scripting extensions such as
Script-Fu.
Gnuplot is a very portable and versatile tool for plotting numerical data.
It works well with
GNU Octave,
a system for performing numerical (e.g., scientific) computations.
The Graphviz suite
contains several useful tools for "graph visualization".
I've made extensive use of the
ImageMagick is an extremely capable suite (read, "Swiss Army Knife")
of image manipulation tools.
It can be used from the command line
or as a library for any of several programming languages.
As mentioned above, a number of pre-processors
have been written for Troff.
Two of these (
There are many ways to generate content for web pages.
HTML can be edited by hand, generated directly by a script,
or translated from a markup language.
Eye candy (e.g., images) has a similar wealth of sources.
Now, let's talk about pulling it all together.
Modern web servers and browsers are capable
of handling far more than simple HTML.
Here are some variations and enhancements
which are worth considering for use in MBD:
Many webmasters use
Cascading Style Sheets (CSS)
to give web pages a consistent "look",
while simplifying their definitions.
The same principle can be applied to mechanically-generated pages,
but the simplification also extends to the generating programs.
If you've gone to the trouble of collecting a bunch of data,
why not make it available for easy follow-on analysis?
Comma-separated Value (CSV) files,
for example, are trivial to generate
and can be loaded into a wide variety of spreadsheet programs.
With a bit more work,
you can generate spreadsheet files,
complete with fancy formatting, headers, formulas, etc.
More generally, try to think of useful files
that you can offer for downloading.
Most web servers can handle almost any sort of data:
bits are bits, after all...
Some programs can generate client-side image maps,
allowing images to serve as navigational aids.
I have made heavy use of
I have deep reservations about Java, JavaScript, and other
imperative client-side programming languages.
Any given program may be error-ridden or even actively malicious.
So, I disable these languages in my browser,
enabling them only when
(a) they are needed and
(b) I'm willing to trust the originating site.
On the other hand,
client-side processing can do many splendid things.
Google Maps and other
AJAX applications
clearly demonstrate this power.
So, it's certainly worth keeping these languages in mind.
Just don't use them unless there's a good reason.
If you have an image-generation tool
that produces consistent and controllable layout,
it isn't that big a step to produce stop-motion animations.
QuickTime movies,
for example, can be used to clarify the sequence of
data flow diagrams.
Scalable Vector Graphics (SVG)
is an extremely appealing technology
for client-side image generation, etc.
When combined with AJAX,
it should be able to do some really spectacular things.
I expect to make extensive use of it,
just as soon as more browsers
get around to supporting it.
No matter how well you design your inter-page navigation,
there will be times when it won't fit the user's needs.
Fortunately, it's trivial to add a
search engine, such as
Simple Web Indexing System for Humans - Enhanced (Swish-e),
which can index web pages, PDF files, and more.
As a side benefit,
the indexing process can detect broken links!
Server Side Includes (SSIs)
provide an easy way to "include" standardized bits of HTML
(e.g., headers, footers) into static web pages.
In MBD, they can be used to fold mechanically-produced HTML
into manually-edited web pages (and vice versa).
SSIs can also resolve content generation conflicts.
For example, one script's images and image maps
might be needed by an earlier script's web pages.
There are many other ways to do server-side processing, including
Common Gateway Interface (CGI), script-based
template engines (e.g., Mason,
PHP,
Ruby on Rails,
and XML-based systems such
Apache Cocoon.
Selecting and integrating these sorts of tools, however,
should be approached with care.
If an approach doesn't scale well or restricts your data model,
it may be a poor choice.
It's also worthwhile to think about cost/performance trade-offs.
Disk storage and offline computer time are very inexpensive,
in most situations.
By generating much of your web content in advance,
you can trade cheap resources for crisp interactive performance.
Although MBD-based web sites
can have totally arbitrary content and format,
certain types of web pages are likely to be useful.
Most MBD-generated web pages describe entities,
so let's consider the kinds of information
that the user should see on such a page:
Because these pages are mechanically generated,
it is trivial to provide rich cross-linking between pages,
add explanatory descriptions and tooltips, etc.
Image-mapped "context diagrams" can be generated
to show "close relatives", etc.
Help and tutorial pages can also be linked in,
explaining sections, pages, or even sets of pages.
Finally, if the user can't find what s/he is looking for,
mailto links and a search facility
are obvious, trivial, and very useful amenities to add...
No single index can meet the needs
of every user at every time.
A user may only be interested in a selected subset of the entities,
need them sorted in a particular order, etc.
The number of available views (i.e., combinations)
can grow very rapidly with the number of options allowed.
Nonetheless, it is still a manageable problem.
Mechanical generation of index pages
can allow any number of "views" to be shown.
With proper planning, navigation between the views can be very simple.
For example, the user might navigate between views
by clicking in one or more
"link tables".
Alternatively, in a forms-based interface,
checkboxes, menus, and/or radio buttons can be used to good effect.
To the extent that the presented model matches the organization
of the system being documented,
the same descriptive material can be used to cover both.
That is, understanding of the system helps in understanding
of the web site, and vice versa.
Image-mapped diagrams can be used to good effect,
allowing the user to examine and "explore" the presented model.
Animated diagrams, showing control or data flow,
can also be useful.
In both cases,
mechanized generation techniques can ease the burden
of generating the diagrams.
In the meanwhile,
although I am able to perform in-person demos of the site
(contact me for details),
I am unable to give out the password.
I apologize (and sympathize) with any difficulty this may cause.
In a recent contract,
I was asked to create a comprehensive
web site,
providing both overview and detailed documentation
for a scientific software development project:
The FSW (Flight Software) group is creating software
to operate the LAT (Large Area Telescope) instrument
in the GLAST (Gamma-ray Large Area Space Telescope) satellite,
process the data it collects,
and output the resulting scientific information.
The "production" FSW code is being written in C and assembler.
It will be run on multiple, radiation-hardened PowerPC processors
under the VxWorks operating system.
However, development and test code must run in assorted environments,
including Intel-based Linux, SPARC-based Solaris, etc.
Aside from a few dozen hand-written web pages (e.g., tutorials),
the site's content is entirely computer-generated.
About half of the pages are generated by
Doxygen, a well-known
documentation generator;
the rest are generated by custom Perl scripts.
The input data comes from a variety of sources,
including databases, XML files, (electronic and printed) documents,
web pages, and assorted file formats
(e.g., configuration files, object libraries).
Information can be requested of the development engineers,
but there is no guarantee that they will have the time to reply.
In short, a fairly typical mechanized documentation scenario.
A number of goals influenced the design:
These goals are typical of many mechanized documentation projects.
Equally typical, but worth noting, are some omissions:
As discussed in the "Data Flow" section of the
concepts page,
MBD techniques can be implemented by means of a
DAG (Directed Acyclic Graph) of processing routines.
Each routine accepts one or more input data sets and, in traditional
batch processing fashion,
generates one or more output data sets.
This produces a very modular system,
because each routine interacts only with its input and output data sets.
This implementation consists of several dozen smallish Perl scripts,
supplemented by assorted command-line utilities
(e.g.,
YAML files are used
for both the hand-edited and intermediate data sets.
Because YAML is a simple and powerful data serialization format,
each file can be a nicely-formatted textual representation
of a loadable data structure.
After loading the input structure(s),
some scripts create "helper" data structures
(e.g., additional indexes into the data).
All text files, whether hand-edited or machine-generated,
are thoroughly commented.
Each generated file receives an informative header,
indicating the file's format, origins, purpose, etc.
Section comments provide context and ease navigation.
For a variety of (mostly pragmatic) reasons,
the project uses Open Source tools whenever possible.
The ability to modify code is important,
as are cost factors
and the ability to share applications with other institutions.
In any case, many Open Source standbys
(e.g., CMT, CVS, Doxygen, GCC, Graphviz, Groff, ImageMagick,
Linux, MySQL, Perl, Python, Swish-e) are in use.
That said, some proprietary software is also being used.
This includes database systems (e.g., Oracle),
operating systems (e.g., Solaris, VxWorks, Windows),
and applications (e.g., Adobe's Acrobat and Frame;
Microsoft's Excel, Outlook, and Word).
The decision to use proprietary software
is generally based on either familiarity
or the lack of an acceptable Open Source alternative.
Finally, quite a bit of the software infrastructure
has been developed from scratch or adapted from Open Source tools.
Aside from my own contributions,
there are a couple of "test executives",
an interactive packet specification application, etc.
The group also uses a version of CMT
which has been extensively modified to handle local requirements.
For more information, see this
introduction.
The documentation suite consists
of several dozen Perl scripts (~20K lines) and hand-edited
YAML reference files (~30K lines).
The suite is run early each morning,
by means of
Several "tricks" are employed to ease maintenance,
increase reliability, and optimize performance.
Large, hand-edited makefiles are tedious and error-prone to edit.
They are also difficult to document in an automated manner.
To resolve this problem, I went to a two-stage process.
A small, hand-edited makefile causes and controls the generation
of the "real", machine-generated makefile.
Each hand-edited file (whether data or script)
contains a set of YAML
"#DDF"
(Documentation Data Flow) declarations,
encoded as specially-formatted comments.
These are harvested at the beginning of each run
and used to generate both the production makefile
and a set of data flow diagrams.
Some scripts (e.g., extraction routines) are always run,
because there is no way to know whether their input has changed.
Consequently, the content of their output files may be unchanged.
Unfortunately,
The
Although a script may generate thousands of files,
there is no reason to bother
The DDF declarations, collectively,
provide an abstract model of the system's data flow.
Specifically, programs and (sets) of data files
are represented as nodes in a DAG.
Connections between nodes (e.g., read or write access, "include" usage)
are represented as edges.
Each hand-edited file "knows" its relative path, description,
label (for diagrams), and type.
Scripts, in addition, know which files they use or create.
Generated files are described by their originating scripts.
As odd as all this may sound,
this is only a slight variation on a traditional Unix-style
batch processing system.
The use of
One interesting aspect of this implementation
is that data structures are "first-class citizens" of the design process.
Given that
OOP (Object-oriented programming) techniques
are based on hiding data structures,
this may seem odd.
However, this approach appears to provide a great deal of modularity,
which is one of the major goals of the OOP approach.
It would be fairly trivial to convert this design
into an event-based system.
Given that the scripts are written in Perl,
I would probably turn to
POE (Perl Object Environment),
which supports a very flexible approach to event-based programming.
C++, Python, and Ruby have roughly equivalent systems,
known respectively as
ACE (Adaptive Communication Environment),
Twisted, and
dRuby (Distributed Ruby).
Although an event-based approach wouldn't need a generated makefile,
it would still be necessary to have a setup script,
in order to "program" the event distribution
and check for cycles in the data flow.
It would also be a good idea to use atomic file writes
(e.g., write a temporary file, then rename it to the "final" name)
to eliminate incomplete data transmissions.
The Unix community is rife with
"little languages"
such as
Aside from the
Most of these "languages", to be sure,
consist of simple, special-purpose macro expansions.
For example, I make frequent use of brace expansion (ala the shell)
to handle expressions such as "
In the case of the data flow animation sequences,
the "base" data flow diagrams are defined
by hand-coded
Systems, Data, and MBD
Model-building
(also in dictionary.com)
Extraction and Analysis
Data Warehousing
Mechanized Analysis
Presentation
Concepts
Models
Intelligence
Communication
Documentation
Navigation
Modeling MBD
Processing Phases
Data Flow
Here is a slightly less abstract diagram,
showing the flow of data through these (sets of) programs:
make (or even parallel make)
can be used to control the suite's operation.
Better yet, we can run the entire suite under
Cron and have an automated "service"
that updates the relevant output documents
whenever an underlying data source changes.
Modeling
Knowledge Representation
Methods and Tools
Organization
Diagramming Format
Interchange Format
Concept Maps
Data(base) Modeling
Knowledge Engineering
Semantic Web
Caveats
Semantic Wikis
Background
Web Limitations
<A HREF="...">...</A>)
have limitations that interfere with their use in semantic wikis.
Specifically, they are uni-directional, untyped, and binary.
The first problem can be worked around fairly easily,
but the others require structural changes.
Is_A, Has_A, Used_By),
Pimki has very little information to work with.
It cannot, for example, filter by link type
or assess the "strength" of given links,
much less make deductions (e.g., inherited characteristics)
based on link types.
So, add structure!
/etc/passwd page
deals with an instance of the class Control_File.
With this information, the wiki can generate bi-directional links,
display summary or inherited information, etc.
Control_File
may be written by a Program
(or really, by a Process
that is running the Program),
the wiki knows that it's OK for a user to assert that
the /etc/passwd file may be written by
the /bin/passwd program.
Some Examples
Commentary
[[type:IsWrittenBy /bin/passwd]]
The Extraction Phase
System Data
Flat Files
Exchange Mechanisms
Screen Scraping
... the act of capturing data from a system or program
by capturing and interpreting the contents of some display
that is not actually intended for data transport
or inspection by programs.
Web Pages
PostScript, PDF, etc.
Arcane Formats
nm command (and variations)
does a fine job of generating reports
on library and object files.
These reports have a regular format and are easy to parse.
Hand-edited Files
YAML
# This is a comment
abc:
- 123
- def: 'fed'
ghi: 'ihg'
$r,
the expression $r->{abc}[0]123.
The expression $r->{abc}[1]{def}'fed'.
The Analysis Phase
Modeling and Analysis
Analysis Functions

Business Inelligence
Mathematical Analysis
Unstructured Data
Miscellany
Data Management
The Presentation Phase
Output Formats
Markup Languages
eqn,
grap, pic, and tbl)
is still very handy for mechanized document production.
It is available in most Unixish distributions,
either as a variation on the original Bell Labs version or as
Groff (the GNU Project's re-implementation).
Documentation Generators
Eye Candy
dot program,
which lays out diagrams of
directed graphs.
It can be used to generate data flow diagrams,
entity-relationship diagrams, and more.
grap and pic)
can be useful for generating images:
grap does simple data graphing;
pic does constraint-based diagram layout.
Although grap is not part of groff,
an Open Source version is
available.
HTML, redux
dot to generate
context diagrams for web pages.
The diagrams show related pages (and their inter-relationships),
provide informative "pop-up" messages (i.e., tooltips),
and ease inter-page navigation.
Page Types
"Entity" pages
"Index" pages
"Tutorial" pages
Case Study (FSW)
Due to spacecraft-related security concerns,
the GLAST Flight Software (FSW) web site now requires a password.
Given SLAC's historic policy of open research,
we can hope that the concerns may eventually be handled
in a less intrusive manner.
Design Goals
Implementation
dot, make, troff).
Most of the scripts are under 500 lines in length,
but a couple exceed 1000 lines.
An ancillary library of perhaps 1500 lines
supplies all of the "generic" code.
Few of the routines, in any case, are difficult to comprehend.
Base Technologies
Data and Control Flow
cron and
make.
It produces (as needed) tens of thousands of files,
in a variety of formats:
dot, grap, troff).
file_update()
make just looks at file time stamps
(not file content),
so this could cause unnecessary processing.
file_update() function is a simple workaround.
It is called with a file name and the new content.
If the new content differs from the existing content,
the file is updated.
Otherwise, the function simply returns.
This ensures that an updated time stamp always reflects changed content.
make
with this level of detail.
So, when a script finishes successfully,
it may generate a "flag file" to indicate the fact.
The time stamp on this file is then used as a proxy
for the time stamps of the "real" output files.
cron and make are commonplace,
as is the use of textual files for data interchange.
The only unusual aspects, really,
lie in the makefile generation technique
and the use of YAML as an encoding format for data structures.
Little Languages
dot, eqn, grap,
grep, lex, pic,
tbl, and yacc.
Although limited in scope,
they perform their (specialized) functions very well.
Unfortunately, creation of little languages generally requires the use
of tools such as lex and yacc.
By using YAML to handle the first-order parsing issues,
I was able to avoid this hassle
and create a number of declarative "little languages".
DDF entries (described above),
I created a language for generating data-flow animation sequences,
a couple of "mini-templating" systems, etc.
These languages are primarily used with hand-edited files,
where they dramatically reduce the amount of typing
(and in some cases, thinking).
e_cat/{cat,html}.yml".
In the case of DDF entries,
this is extended to create multiple path entries
(in the generated file_sets file)
for any patt entries containing brace expressions.
dot files.
The