MBD: The Extraction Phase

MBD Home

Overview

Concepts

Modeling

Semantic Wikis

The Extraction Phase

  System Data
  Hand-edited Files

The Analysis Phase

The Presentation Phase

Case Study (FSW)

Case Study (Unix)

Advice

Tools

Books


Rich Morin, rdm@cfcl.com

Printable Version

The concepts presents a data flow "model" for an MBD (Model-based Documentation) suite. In this model, the extraction phase is responsible for accessing data sources, selecting and organizing the desired data, etc. Its output should be encoded in a convenient and reliable representation for follow-on analysis or presentation.

This page looks at various kinds of data sources, including system data and hand-edited files, offering hints about how to extract data from them. The question of how this data should be stored is a subject for another page (available RSN :-), but some general comments may be useful at this point.

MBD's extraction phase corresponds roughly to the first part of data warehousing. The "live" data gets collected, filtered, and saved in a manner that eases follow-on analysis (e.g., data mining, OLAP) and presentation. The storage representation should handle structural issues, retaining "interesting" structures from the input data and allowing the addition of structures "discovered" in the extraction or analysis phase.

The actual data will normally be stored in (collections of) files and/or a relational database system. In general, data access in the warehouse should be far faster, easier, and more consistent than it was in the "live" system. If it isn't, you're doing something wrong (:-).

System Data

Computer-based systems maintain large amounts of data. MBD systems can "harvest" this data, extracting information on specific entities: facts, relationships, etc. The resulting information can be used to generate detailed reports, summary plots and diagrams, etc. The first step, however, is data extraction.

The extraction code must access the incoming data, parse it, reject noise, correct errors, and pass the result along in a "convenient" format. The specific tasks are defined by (a) the format and content of the incoming data and (b) the current and expected data needs.

The incoming data may be simple or complex in structure and may arrive in any of a variety of manners (e.g., files, APIs) and formats (e.g., binary, text). The following sections cover some common cases.

Flat Files

Line-oriented "flat file" formats (e.g., CSV files, Unix control files) are usually easy to parse and understand. Each line (including any "continuation" lines) is a record. Individual fields can usually be extracted by regular expressions; special handling may be required for particularly complex formats.

If you have a lot of flat files to parse, consider writing a parameterized input filter. Declarative control files are far easier to maintain than hand-crafted (and nearly identical) "input filters". If your control files are cleanly formatted and well commented, they can serve as documentation for the input files.

Exchange Mechanisms

If a data source is used by more than one program, it is likely to be available via a well-documented data exchange mechanism. For example, it might be a text file, encoded in a documented dialect of a standard data serialization format (e.g., XML, YAML). Alternatively, it might be offered via an API (Application Programming Interface) such as SQL or Apple's Core Data framework.

The exchange mechanism can be expected to handle the "syntax" level (e.g., dividing the data into fields). It may also help with structural or even semantic issues, but there is no guarantee of this. The good news is that you can often extract the needed data without understanding the entire structure and semantics of an arbitrarily complex data source.

Generally, use of an exchange mechanism follows a well-trodden path: study the documentation, decide what data you want, use a query to extract it. That said, there's nothing wrong with speculative data collection: if capturing some extra data is easy, storing it can be a very worthwhile gamble.

It's also useful to look at the way the data source is used by the system's own code. A program's SQL statements and surrounding code, for example, can provide quite a bit of information on the purpose and interaction of database fields and tables.

If you get lost, look for a support infrastructure (e.g., books, forums, mailing lists, web sites). Asking authors and "resident experts", you may be able to get answers to specific questions (e.g., "What is this field used for?"). If you develop any documentation, be sure to share it with the appropriate commmunity!

Screen Scraping

Wikipedia defines "screen scraping" as:

    ... the act of capturing data from a system or program by capturing and interpreting the contents of some display that is not actually intended for data transport or inspection by programs.

Because the "display" was not intended for use by a program, it may not provide simple (or even reliable) indications of its structure. After all, humans are much better than programs at discerning structure from context, formatting, etc. Fortunately, machine-generated documents tend to have reasonably predictable structures. Once you can handle the base layout and the common variations, you're mostly done. By coding for robustness (e.g., "print a diagnostic and continue"), you can deal with new variations as they appear.

Of course, if the generating program gets changed in a way that modifies the document structure, your extraction program will "break". So, it's a good idea to push for machine-friendly (e.g., XML) forms of any documents you rely upon.

These problems aside, screen scraping may involve assorted low-level formats: "plain text" (e.g., log files, nroff output), HTML, PDF, PostScript, etc. Access methods are often available for binary formats, but you'll have to handle most text-based formats yourself. Here are some hints...

Web Pages

If you only need a few items from a web page, you may be able to extract them using special-purpose code. For example, if a particular snippet of HTML always appears in a line by itself, or in a particular place in a table, you may be able to recognize it and handle it as a special case. Be aware that this "hack" may be brittle in the face of even minor formatting variations.

If you need to extract a lot of data, consider transforming the page into XHTML, a formalized dialect of HTML. Because XHTML files comply with XML syntax, you can load them with your favorite XML-handling tools. HTML Tidy will perform this conversion, as well as cleaning up ugly (e.g., Microsoft Word) HTML.

Although the resulting document will be valid XML, you'll still have some "detective work" to do. Unlike XML that was generated for information exchange, converted HTML is likely to have structural irregularities (e.g., interleaved tags and text). Worse, you won't have a schema to guide you in figuring out the variations.

PostScript, PDF, etc.

PostScript is a text-based "page description language". Although it is actually an imperative, Turing-complete programming language, this is seldom a real problem for parsing, etc. RPN syntax aside, most PostScript commands are used as formatting declarations. So, it is generally possible to examine a representative PostScript document, determine what "idioms" are being used, and write a specialized script to extract desired information.

An alternative approach uses that fact that PostScript files are, in fact, programs. By editing the PostScript code (and/or overriding selected operators), it is possible to make a document log information about itself (e.g., to standard output or a designated file). If the idea of hacking PostScript doesn't appeal to you, however, read on.

PDF (Portable Document Format) is a binary, declarative translation of PostScript. Although a binary format may seem daunting, there are libraries and other tools which can help in parsing PDF. There are also reliable tools for PostScript translation. So, you may want to convert all your incoming PostScript documents to PDF, then parse them all with the same tool(s).

Here are some useful tools for dealing with PDF and PostScript documents. Although there is some overlap in their capabilities, all three are well worth having.

  • Ghostscript

    Ghostscript is a powerful and flexible set of tools for processing PDF and PostScript files. It can be used to render documents, translate between formats, etc.

  • pdftk

    pdftk (PDF ToolKit) is a command-line tool for manipulating PDF files. It performs number of specialized functions (e.g., applying watermarks, encrypting and decrypting documents, merging and splitting documents, updating PDF metadata).

  • Xpdf

    Although Xpdf is billed as a "PDF viewer" for the X Window System, it is far more than this. The suite includes tools to extract images and text, translate PDF to PostScript, etc. Parts of Xpdf are used by other utilities, such as search engines (e.g., Swish-e) and PDF viewers.

For more tools (and tricks), see "PDF Hacks: 100 Industrial-Strength Tips & Tools" (Sid Stewart; O'Reilly, 2004). It's an easy read and will give you some ideas about unconventional ways to use PDF. If you get serious, you'll also want to get a copy of Adobe's "PDF Reference" (sixth edition). It contains well over 1000 pages of definitive and detailed information.

There are dozens of books on PostScript, ranging from introductions to reference manuals. Again, the Adobe books are definitive, but you may want to look at some others as a way to get started, explore undocumented areas, etc.

Arcane Formats

The data you need may be stored in some arcane format. Examples include binary libraries, spreadsheets, word processing documents, and source code files. Rather then researching (or worse, reverse-engineering) the format, look around for a library or command-line tool that already knows how to parse it.

For example, you might want to extract linkage information from binary library files. The nm command (and variations) does a fine job of generating reports on library and object files. These reports have a regular format and are easy to parse.

Perl's CPAN (Comprehensive Perl Archive Network) has modules that import spreadsheets and many other oddball files. Other scripting languages (e.g., Python, Ruby) have similar online collections.

Finally, you may be able to find a tool that reads the file(s) and exports the data into XML or some other parsable format. Doxygen, for example, will peruse source code files in several languages, dumping its accumulated knowledge as XML.

Hand-edited Files

There are times when you'll need to augment your machine-harvested data with hand-edited information. The supplementary data may have been collected from interviews, copied and pasted from a PDF document, or obtained in some other manner. Regardless, the current objective is to encode it for convenient use by your documentation scripts.

The ideal file format for this purpose would be flexible and powerful, supported by an active user community, easy to read and edit, and a good "fit" for scripting languages such as Perl, Python, and Ruby. Flat files (e.g., CSV) fail the first test: a two-dimensional array is neither flexible nor powerful. Any format that you might cobble up on your own fails the second test (user base). XML fails the last two tests; nobody but a masochist likes to edit XML or traverse its data structures.

YAML

Fortunately, YAML (YAML Ain't Markup Language) meets all of these criteria quite handily:

  • YAML is supported by a small but active user community. Though the pace toward standardization has been glacial, a formal standard is nearly finished. In any case, a de facto standard has been in place for several years.

  • Because YAML's data structures are based upon scalars (e.g., numbers and strings), mappings (i.e., dictionaries, hashes), and lists, YAML files are quite powerful, yet very easy to load and use.

Finally, I'll let you judge YAML's flexibility and readability for yourself:


  # This is a comment
  abc:
    -  123
    -  def:  'fed'
       ghi:  'ihg'

Assuming that this text was loaded into a Perl data structure referenced by $r, the expression $r->{abc}[0] would have the value 123. The expression $r->{abc}[1]{def} would have the value 'fed'.

The ability to create and edit data structures in one window, then test them out in another, is incredibly seductive. I have edited tens of thousands of lines of YAML, using it to store a wide variety of data. By post-processing the encoded text strings, I have even created (declarative) "little languages" of various sorts, using a YAML loader to handle the first-order parsing. The FSW case study page and this tutorial contain more information on YAML.

Next: Analysis