The concepts page
presents a data flow "model"
for an MBD (Model-based Documentation) suite.
In this model, the analysis phase is responsible
for combining or tallying data, discerning patterns, etc.
This page considers the relationship between modeling and analysis,
looks at some representative forms of analysis,
and makes some broad generalizations about data management.
The administrators and designers of a complex system
generally understand its overall nature.
They may be unclear on specific details,
but they understand the major entities and relationships,
along with some "minor" ones which have attracted their attention,
for whatever reason.
In short, they have a working "mental model" of the system.
By interviewing these "local experts",
studying the existing documentation,
and generally poking around,
an MBD developer can "discover" the types
of entities and relationships in the system.
Using this information,
s/he can create an informal model
that will support and guide the creation of MBD software.
Analysis functions, in particular, rely on this sort of model.
They look for, characterize, and/or summarize the specific
values and relationships concerning instances of known entities.
Typically, they produce explicit formulations
of information which was only implicitly available
in the input data.
In a software development project,
the entities might include bug reports, data structures, developers,
functions, libraries, managers, programs, projects, releases,
test suites, etc.
By analyzing information on these instances,
we can "discover" and present useful information
about individual instances and/or the system as a whole.
For example, we could count the bugs found
during each stage of each release cycle.
This might tell us useful things
about our design and testing procedures.
We could also look for "hot spots" in the code
(e.g., areas that seem to have lots of bugs),
as a way to guide
refactoring efforts.
We could also track the usage
of each data structure, function, library, etc.
This would let us create links to related items
and even draw nifty-looking (and useful)
diagrams showing critical relationships.
Anyone who has had to "drop into" a large body of code
can appreciate the value of this sort of contextual information.
As Gerald Weinberg points out in
An Introduction to General Systems Thinking,
different approaches are needed for different sorts of problems.
Statistics deals well with "unorganized complexity",
but requires both randomness and large population sizes.
Pure analysis, on the other hand,
handles "organized simplicity",
but is limited by complexity and randomness.
Thus, there is a substantial area of "organized complexity"
where neither tool can be used:
As discussed below,
data mining tools such as
Weka
apply both statistical and analytical techniques
(e.g., analyzing statistical results).
Similarly, this document uses the term "analysis"
to include statistics, etc.
Analysis functions can range from trivial to extremely complex.
The ones which involve only simple statistics and record-keeping
lie near the "trivial" end.
Data mining and document analysis functions,
which can involve much more intricate calculations,
go well into the "complex" end.
In fact, some forms of analysis
can be described as "interesting research projects" (:-).
Fortunately, two factors operate to make analysis less challenging.
First, most analysis falls on the trivial end:
the details may be complex, but the principles involved are simple.
Second, tools are available to perform many kinds of analysis,
leaving the MBD implementor with the relatively simple tasks
of tool selection, installation, and use.
Although there is no shortage
of expensive, proprietary analysis products,
many useful tools are also available as
Open Source.
Their capabilities range from text analysis and indexing
through numerical and statistical methods
to volumetric data presentation.
Typically, they have books and/or extensive documentation
which detail the underlying principles and use.
Here is a sampling, to whet your appetite...
BI (Business Intelligence) techniques (e.g.,
data mining,
OLAP)
are commonly used in large enterprises
to mechanically extract information and even knowledge from data.
Open Source now allow even small enterprises
to take advantage of these techniques.
BIRT (Business Intelligence and Reporting Tools) is an
Eclipse-based
"reporting system".
It can generate (combinations of)
charts, cross-tabulations, lists, and textual documents.
Bizgres is billed as
"PostgreSQL for Business Intelligence and Data Warehousing".
The Bizgres distribution bundles an enhanced and optimized version of
PostgreSQL with a "stack" of business intelligence tools:
KETL (an extract, transform, and load system),
Mondrian,
JPivot,
JasperReports, and
OpenI.
A unified installation and configuration system
makes all of this (relatively) easy to put into place.
OLAP (online analytical processing)
suites are very popular for business-related analysis.
As defined by Edgar F. Codd,
OLAP is "the dynamic synthesis, analysis, and consolidation
of large volumes of multidimensional data".
This article gives a thorough, yet readable summary of this idea.
Mondrian is an OLAP database that is written in
Java.
It implements the
MDX (Multidimensional Expressions) language and the
XML for Analysis and
JOLAP specifications.
A related project,
JPivot, uses
JSP (JavaServer Pages)
to provide an interactive (spreadsheet-like) front end to Mondrian.
The Pentaho BI Platform is an integrated distribution of business intelligence tools.
It incorporates technology from
BIRT,
Eclipse,
Enhydra,
JBoss,
JOSSO,
JPivot,
Mondrian,
Rhino, and
Weka.
It can use any
RDBMS that supports the
JDBC (Java Database Connectivity)
API.
The platform's principal claim to fame
is that all of the tools run under the control
of a process-driven "workflow engine"
which provides auditing, logging, security, etc.
It can be embedded into server-based applications
or desktop applications.
The R language
and other numerical analysis tools
may be useful for certain types of OLAP work.
The Weka data mining suite
can be used for statistical work.
These offerings are described in more detail below.
Extremely powerful tools are now available
for mathematical analysis and display,
using techniques from
graph theory,
operations research,
statistics,
symbolic mathematics, etc.
Several powerful suites are available to help with these tasks.
COIN-OR hosts dozens of Open Source projects in operations research,
including both linear and nonlinear deterministic optimization,
metaheuristics, stochastic optimization, and more.
Octave,
a program for performing numerical (e.g., scientific) computations,
is mostly compatible with
MATLAB.
PDL (Perl Data Language)
is a vectorized array programming language.
It allows Perl scripts
to perform rapid vector calculations
using simple (but highly expressive) expressions.
It also has support for interactive use, graphics and plotting, etc.
R is a mathematical language and environment
for statistical analysis and display,
modeled after Bell Labs'
S programming language.
Although R is mostly used by statisticians,
it can also be used as a general matrix calculation toolbox
in a program such as
Octave.
Scilab is a program for performing and displaying the results
of numerical (e.g., scientific) computations.
It has a similar (though incompatible) syntax to
MATLAB.
Weka is a Java-based collection of machine learning algorithms for
data mining tasks.
The algorithms can either be applied directly to a dataset
(e.g., via command-line or interactive interfaces)
or called from Java code.
Weka contains tools for data pre-processing, classification,
regression, clustering, association rules, and visualization.
Data Mining: Practical Machine Learning Tools and Techniques is a very approachable introduction to both Weka
and its (table-based, statistical) approach to data mining.
Although images, text, and video have lots of internal structure,
it isn't the sort of thing that fits well into arrays or even schemas.
Consequently, they are often referred to as "unstructured data".
Here, in any case, are some tools for handling them.
mg is a system for creating and managing digital libraries.
The approach is described
by an extremely interesting and readable book
(Managing Gigabytes: Compressing and Indexing Documents and Images).
OsiriX is a
Macintosh-based system
for analyzing and presenting volumetric medical data
(e.g., sets of MRI scans or confocal microscope images).
It will accept a variety of "generic" input formats, however,
including AVI, JPEG, MPEG, PDF, Quicktime, and TIFF).
Thus, it could be used for geology, mechanical engineering,
or even completely artificial sets of volumetric "data".
Although Swish-e is billed as a
"system for indexing collections of Web pages or other files",
it is really an extensible toolkit for building indexing systems.
It can handle "plain text, e-mail, PDF, HTML, XML,
Microsoft Word/PowerPoint/Excel,
and just about any file that can be converted to XML or HTML text".
If it doesn't already do what you want,
just add your own analysis functions!
UIMA (Unstructured Information Management Architecture) is a Java
SDK (Software Development Kit)
that supports the implementation, composition, and deployment
of applications working with unstructured information
(e.g., audio, images, video).
UIMA is based on the composition of stackable "analysis engines"
that extract and record document information.
Documentation generators, such as
Doxygen and
Synopsis,
analyze software-related entities
such as data structures, functions, and modules.
Most documentation generators perform a variant of
screen scraping,
parsing the source code (and specially-formatted comments).
Consequently, they tend to be specific
to particular (sets of) programming languages.
OpenCyc is a subset of
Cycorp's
"Cyc" knowledge engineering system.
Both Cyc and OpenCyc are based on the CycL language,
which is based on predicate calculus.
When answering questions,
the "inference engine" traverses a "knowledge base" of "assertions".
Unfortunately, although the knowledge base is
libre (free as in speech),
the inference engine is only
gratis (free as in beer).
TAP is a knowledge engineering project
that hopes to solve some challenging problems related to the
Semantic Web.
I'm particularly intrigued by their notion of "semantic negotiation",
which may allow machines to automagically match up their terms.
The project has released "TAPache"
(an Apache module for data publication)
and "TAP KB" (a knowledge base).
The Extraction page glossed over the issue
of how MBD data should be managed.
Indeed, this page has said very little
about data management.
Unfortunately, proper coverage of MBD data management
would take another whole set of pages.
Still, there are a few things that can and should be said.
The first is that some OLAP tools (e.g., Bizgres)
rely on a particular RDBMS.
If you already have your data in some other database,
this could be an issue.
On the other hand, disk space is cheap enough
that the cost of replicating information should be reasonable.
Moreover, there are good reasons (and strong precedent)
for separating your operational data storage from the MBD suite's.
An MBD data warehouse can integrate data from disparate sources,
making it easier to analyze and present.
Because this data is separate from the "live" data,
its management policies can be optimized
to serve the needs of the MBD effort.
Operational data storage needs to be managed in a conservative manner,
because important things can break if it isn't.
Adding new tables or columns may involve a great deal of politics,
even if no substantial coding changes are required.
MBD systems, in contrast, need flexibility and responsiveness:
"I need this report now."
Also, the live data will not, in general,
be integrated into a consistent access framework.
More generally, RDBMS systems aren't optimized for flexibility.
In a prototyping environment,
a set of YAML files may be far easier to create and use
than a set of database tables.
Finally, many tools (e.g., OpenCyc, OsiriX, TAP)
have their own storage formats.
In summary, don't get locked into a single approach.
Next: Presentation
Modeling and Analysis
Analysis Functions

Business Inelligence
Mathematical Analysis
Unstructured Data
Miscellany
Data Management