MBD: The Analysis Phase

MBD Home

Overview

Concepts

Modeling

Semantic Wikis

The Extraction Phase

The Analysis Phase

  Modeling and Analysis
  Analysis Functions
  Data Management

The Presentation Phase

Case Study (FSW)

Case Study (Unix)

Advice

Tools

Books


Rich Morin, rdm@cfcl.com

Printable Version

The concepts page presents a data flow "model" for an MBD (Model-based Documentation) suite. In this model, the analysis phase is responsible for combining or tallying data, discerning patterns, etc. This page considers the relationship between modeling and analysis, looks at some representative forms of analysis, and makes some broad generalizations about data management.

Modeling and Analysis

The administrators and designers of a complex system generally understand its overall nature. They may be unclear on specific details, but they understand the major entities and relationships, along with some "minor" ones which have attracted their attention, for whatever reason. In short, they have a working "mental model" of the system.

By interviewing these "local experts", studying the existing documentation, and generally poking around, an MBD developer can "discover" the types of entities and relationships in the system. Using this information, s/he can create an informal model that will support and guide the creation of MBD software.

Analysis functions, in particular, rely on this sort of model. They look for, characterize, and/or summarize the specific values and relationships concerning instances of known entities. Typically, they produce explicit formulations of information which was only implicitly available in the input data.

In a software development project, the entities might include bug reports, data structures, developers, functions, libraries, managers, programs, projects, releases, test suites, etc. By analyzing information on these instances, we can "discover" and present useful information about individual instances and/or the system as a whole.

For example, we could count the bugs found during each stage of each release cycle. This might tell us useful things about our design and testing procedures. We could also look for "hot spots" in the code (e.g., areas that seem to have lots of bugs), as a way to guide refactoring efforts.

We could also track the usage of each data structure, function, library, etc. This would let us create links to related items and even draw nifty-looking (and useful) diagrams showing critical relationships. Anyone who has had to "drop into" a large body of code can appreciate the value of this sort of contextual information.

Analysis Functions

As Gerald Weinberg points out in An Introduction to General Systems Thinking, different approaches are needed for different sorts of problems. Statistics deals well with "unorganized complexity", but requires both randomness and large population sizes. Pure analysis, on the other hand, handles "organized simplicity", but is limited by complexity and randomness. Thus, there is a substantial area of "organized complexity" where neither tool can be used:

As discussed below, data mining tools such as Weka apply both statistical and analytical techniques (e.g., analyzing statistical results). Similarly, this document uses the term "analysis" to include statistics, etc.

Analysis functions can range from trivial to extremely complex. The ones which involve only simple statistics and record-keeping lie near the "trivial" end. Data mining and document analysis functions, which can involve much more intricate calculations, go well into the "complex" end. In fact, some forms of analysis can be described as "interesting research projects" (:-).

Fortunately, two factors operate to make analysis less challenging. First, most analysis falls on the trivial end: the details may be complex, but the principles involved are simple. Second, tools are available to perform many kinds of analysis, leaving the MBD implementor with the relatively simple tasks of tool selection, installation, and use.

Although there is no shortage of expensive, proprietary analysis products, many useful tools are also available as Open Source. Their capabilities range from text analysis and indexing through numerical and statistical methods to volumetric data presentation. Typically, they have books and/or extensive documentation which detail the underlying principles and use. Here is a sampling, to whet your appetite...

Business Inelligence

BI (Business Intelligence) techniques (e.g., data mining, OLAP) are commonly used in large enterprises to mechanically extract information and even knowledge from data. Open Source now allow even small enterprises to take advantage of these techniques.

  • BIRT

    BIRT (Business Intelligence and Reporting Tools) is an Eclipse-based "reporting system". It can generate (combinations of) charts, cross-tabulations, lists, and textual documents.

  • Bizgres

    Bizgres is billed as "PostgreSQL for Business Intelligence and Data Warehousing". The Bizgres distribution bundles an enhanced and optimized version of PostgreSQL with a "stack" of business intelligence tools: KETL (an extract, transform, and load system), Mondrian, JPivot, JasperReports, and OpenI. A unified installation and configuration system makes all of this (relatively) easy to put into place.

  • Mondrian

    OLAP (online analytical processing) suites are very popular for business-related analysis. As defined by Edgar F. Codd, OLAP is "the dynamic synthesis, analysis, and consolidation of large volumes of multidimensional data". This article gives a thorough, yet readable summary of this idea.

    Mondrian is an OLAP database that is written in Java. It implements the MDX (Multidimensional Expressions) language and the XML for Analysis and JOLAP specifications. A related project, JPivot, uses JSP (JavaServer Pages) to provide an interactive (spreadsheet-like) front end to Mondrian.

  • Pentaho

    The Pentaho BI Platform is an integrated distribution of business intelligence tools. It incorporates technology from BIRT, Eclipse, Enhydra, JBoss, JOSSO, JPivot, Mondrian, Rhino, and Weka. It can use any RDBMS that supports the JDBC (Java Database Connectivity) API. The platform's principal claim to fame is that all of the tools run under the control of a process-driven "workflow engine" which provides auditing, logging, security, etc. It can be embedded into server-based applications or desktop applications.

  • R, Weka, etc.

    The R language and other numerical analysis tools may be useful for certain types of OLAP work. The Weka data mining suite can be used for statistical work. These offerings are described in more detail below.

Mathematical Analysis

Extremely powerful tools are now available for mathematical analysis and display, using techniques from graph theory, operations research, statistics, symbolic mathematics, etc. Several powerful suites are available to help with these tasks.

  • COIN-OR

    COIN-OR hosts dozens of Open Source projects in operations research, including both linear and nonlinear deterministic optimization, metaheuristics, stochastic optimization, and more.

  • Octave

    Octave, a program for performing numerical (e.g., scientific) computations, is mostly compatible with MATLAB.

  • PDL

    PDL (Perl Data Language) is a vectorized array programming language. It allows Perl scripts to perform rapid vector calculations using simple (but highly expressive) expressions. It also has support for interactive use, graphics and plotting, etc.

  • R

    R is a mathematical language and environment for statistical analysis and display, modeled after Bell Labs' S programming language. Although R is mostly used by statisticians, it can also be used as a general matrix calculation toolbox in a program such as Octave.

  • Scilab

    Scilab is a program for performing and displaying the results of numerical (e.g., scientific) computations. It has a similar (though incompatible) syntax to MATLAB.

  • Weka

    Weka is a Java-based collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset (e.g., via command-line or interactive interfaces) or called from Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization.

    Data Mining: Practical Machine Learning Tools and Techniques is a very approachable introduction to both Weka and its (table-based, statistical) approach to data mining.

Unstructured Data

Although images, text, and video have lots of internal structure, it isn't the sort of thing that fits well into arrays or even schemas. Consequently, they are often referred to as "unstructured data". Here, in any case, are some tools for handling them.

  • mg

    mg is a system for creating and managing digital libraries. The approach is described by an extremely interesting and readable book (Managing Gigabytes: Compressing and Indexing Documents and Images).

  • OsiriX

    OsiriX is a Macintosh-based system for analyzing and presenting volumetric medical data (e.g., sets of MRI scans or confocal microscope images). It will accept a variety of "generic" input formats, however, including AVI, JPEG, MPEG, PDF, Quicktime, and TIFF). Thus, it could be used for geology, mechanical engineering, or even completely artificial sets of volumetric "data".

  • Swish-e

    Although Swish-e is billed as a "system for indexing collections of Web pages or other files", it is really an extensible toolkit for building indexing systems. It can handle "plain text, e-mail, PDF, HTML, XML, Microsoft Word/PowerPoint/Excel, and just about any file that can be converted to XML or HTML text". If it doesn't already do what you want, just add your own analysis functions!

  • UIMA

    UIMA (Unstructured Information Management Architecture) is a Java SDK (Software Development Kit) that supports the implementation, composition, and deployment of applications working with unstructured information (e.g., audio, images, video). UIMA is based on the composition of stackable "analysis engines" that extract and record document information.

Miscellany

  • Documentation Generators

    Documentation generators, such as Doxygen and Synopsis, analyze software-related entities such as data structures, functions, and modules. Most documentation generators perform a variant of screen scraping, parsing the source code (and specially-formatted comments). Consequently, they tend to be specific to particular (sets of) programming languages.

  • OpenCyc

    OpenCyc is a subset of Cycorp's "Cyc" knowledge engineering system. Both Cyc and OpenCyc are based on the CycL language, which is based on predicate calculus. When answering questions, the "inference engine" traverses a "knowledge base" of "assertions". Unfortunately, although the knowledge base is libre (free as in speech), the inference engine is only gratis (free as in beer).

  • TAP

    TAP is a knowledge engineering project that hopes to solve some challenging problems related to the Semantic Web. I'm particularly intrigued by their notion of "semantic negotiation", which may allow machines to automagically match up their terms. The project has released "TAPache" (an Apache module for data publication) and "TAP KB" (a knowledge base).

Data Management

The Extraction page glossed over the issue of how MBD data should be managed. Indeed, this page has said very little about data management. Unfortunately, proper coverage of MBD data management would take another whole set of pages.

Still, there are a few things that can and should be said. The first is that some OLAP tools (e.g., Bizgres) rely on a particular RDBMS. If you already have your data in some other database, this could be an issue. On the other hand, disk space is cheap enough that the cost of replicating information should be reasonable.

Moreover, there are good reasons (and strong precedent) for separating your operational data storage from the MBD suite's. An MBD data warehouse can integrate data from disparate sources, making it easier to analyze and present. Because this data is separate from the "live" data, its management policies can be optimized to serve the needs of the MBD effort.

Operational data storage needs to be managed in a conservative manner, because important things can break if it isn't. Adding new tables or columns may involve a great deal of politics, even if no substantial coding changes are required. MBD systems, in contrast, need flexibility and responsiveness: "I need this report now." Also, the live data will not, in general, be integrated into a consistent access framework.

More generally, RDBMS systems aren't optimized for flexibility. In a prototyping environment, a set of YAML files may be far easier to create and use than a set of database tables. Finally, many tools (e.g., OpenCyc, OsiriX, TAP) have their own storage formats. In summary, don't get locked into a single approach.

Next: Presentation