SGML/XML tools for arbitrary access to entities within document?

Question

Hi,

I am driving much of my software from data excised from formal documents that describe the algorithms involved in much more detail (and modalities) than is possible with (textual) "source code".

I'm a huge fan of table-driven applications so I often express components of algorithms in tables, then excise the tables from the document and propagate them into the formal "sources" (all mechanically, of course).

This minimizes the chance for typographical errors to creep into the "code" between the documentation and the executable. It also makes it difficult for the code to evolve without the documentation coming along *with* it!

And, of course, it allows things to be expressed in forms that are more intuitive/self-documenting than would otherwise be available with "ASCII text".

So far, I've been creating ad hoc tools to extract the needed components from the documents. The markup language used in the documents is well documented and the way I build my documents makes it fairly easy to isolate the components of interest and "extract them".

For example, to extract a particular table, I invoke a tool I wrote with the command line: gettable TABLETITLE [,] and redirect the output to a file (which is later massaged by an application specific tool/script to get it into a form suitable for #include in a source). This knows how to parse the (nested) tags of the MU language until it finds the table having the specified TABLETITLE (string); then, extracts lists of (font,string) tuples for each cell in the specified columns of the table.

[other tags associated with the cell only contain cosmetic information -- line spacing, text alignment, etc. -- so they can be ignored]

But, I'm looking at other options for a more generic solution to this problem.

E.g., I wrote a formal grammar for the markup language so I can build a specific parser to extract what I need *using* arbitrary parts of that grammar (e.g., if I later decide the *color* of the text in a cell is important -- highly doubtful!).

I'm also looking at building a formal DTD for the MU language and seeing what XML-ish tools exist to do these sorts of things.

The downside of a more "involved"/capable solution is it gets more tedious to maintain -- especially as the MU language evolves! And, testing the tool becomes a project in itself! :<

So, specifically, what sorts of OTS tools (prefer ones with sources that I can modify) exist that will let me do things like parsing to a particular nested tag, verifying the attribute associated with it matches what I seek (e.g., TABLETITLE) then extracting all (and ONLY!) attributes of specific tags contained nested *within* that context?

I.e., I want to be able to specify what parts of the tree to extract based on criteria I specify on a command line.

Thx,

--don

Clifford Heath · Accepted Answer

Do you know what XPATH is? It's a widely known, implemented and used  language for exactly that. DTDs suck. Use XSD if you need that kind of thing. Again, all XML tools  have support for it.

Don Y · Answer

Hi Clifford, Thanks, I looked at this.  But, after reviewing my code, I see a lot of complexity involved in tracking "state" throughout the table being parsed. This is because tables can have cells that straddle the underlying (row,column) matrix that is defined by the markup.  So, the VISIBLE contents (i.e., what a human reader perceives as the contents of that cell) of a  may actually be defined in a previous row *above* (or, above AND TO THE LEFT) or to the left of this cell. Obviously, you can't retrieve the effective contents of the *desired* cell without examining the rest of the table structure and the attributes in those other cells (indicating where the straddle occurs -- if ever) So, I'd have to artificially replicate the contents of the effective cell into ALL straddled "markup" cells in order to effectively retrieve contents solely by accessing a specific node.    That doesn't really save any labor. *But*, this is a worthwhile tool to keep in mind!  Perhaps I can use it to...

SGML/XML tools for arbitrary access to entities within document?

Join the Discussion

Didn't find your answer?