SGML/XML tools for arbitrary access to entities within document?


I am driving much of my software from data excised from formal documents that describe the algorithms involved in much more detail (and modalities) than is possible with (textual) "source code".

I'm a huge fan of table-driven applications so I often express components of algorithms in tables, then excise the tables from the document and propagate them into the formal "sources" (all mechanically, of course).

This minimizes the chance for typographical errors to creep into the "code" between the documentation and the executable. It also makes it difficult for the code to evolve without the documentation coming along *with* it!

And, of course, it allows things to be expressed in forms that are more intuitive/self-documenting than would otherwise be available with "ASCII text".

So far, I've been creating ad hoc tools to extract the needed components from the documents. The markup language used in the documents is well documented and the way I build my documents makes it fairly easy to isolate the components of interest and "extract them".

For example, to extract a particular table, I invoke a tool I wrote with the command line: gettable TABLETITLE [,] and redirect the output to a file (which is later massaged by an application specific tool/script to get it into a form suitable for #include in a source). This knows how to parse the (nested) tags of the MU language until it finds the table having the specified TABLETITLE (string); then, extracts lists of (font,string) tuples for each cell in the specified columns of the table.

[other tags associated with the cell only contain cosmetic information -- line spacing, text alignment, etc. -- so they can be ignored]

But, I'm looking at other options for a more generic solution to this problem.

E.g., I wrote a formal grammar for the markup language so I can build a specific parser to extract what I need *using* arbitrary parts of that grammar (e.g., if I later decide the *color* of the text in a cell is important -- highly doubtful!).

I'm also looking at building a formal DTD for the MU language and seeing what XML-ish tools exist to do these sorts of things.

The downside of a more "involved"/capable solution is it gets more tedious to maintain -- especially as the MU language evolves! And, testing the tool becomes a project in itself! :<

So, specifically, what sorts of OTS tools (prefer ones with sources that I can modify) exist that will let me do things like parsing to a particular nested tag, verifying the attribute associated with it matches what I seek (e.g., TABLETITLE) then extracting all (and ONLY!) attributes of specific tags contained nested *within* that context?

I.e., I want to be able to specify what parts of the tree to extract based on criteria I specify on a command line.



Reply to
Don Y
Loading thread data ...

Do you know what XPATH is? It's a widely known, implemented and used language for exactly that.

DTDs suck. Use XSD if you need that kind of thing. Again, all XML tools have support for it.

Reply to
Clifford Heath

Thanks, I looked at this. But, after reviewing my code, I see a lot of complexity involved in tracking "state" throughout the table being parsed.

This is because tables can have cells that straddle the underlying (row,column) matrix that is defined by the markup. So, the VISIBLE contents (i.e., what a human reader perceives as the contents of that cell) of a may actually be defined in a previous row

*above* (or, above AND TO THE LEFT) or to the left of this cell. Obviously, you can't retrieve the effective contents of the *desired* cell without examining the rest of the table structure and the attributes in those other cells (indicating where the straddle occurs -- if ever)

So, I'd have to artificially replicate the contents of the effective cell into ALL straddled "markup" cells in order to effectively retrieve contents solely by accessing a specific node. That doesn't really save any labor.

*But*, this is a worthwhile tool to keep in mind! Perhaps I can use it to extract other aspects of the documents that are more well-behaved (like figures).

The advantage the DTD has is that it gives you a (reasonably) terse overview of the overall "structure" of the document -- in much the same way a ToC gives you an overview of a document's contents. E.g., the DTD would make it clear that tables are stored in row major order; that cells can span rows *or* columns; that the contents of a cell can consist of several substrings, with which each can have a particular font tag, etc.

(sigh) I think I'll stick with the ad hoc approach and just keep the tools I've developed in mind when I consider how I construct (organize) the rest of my documents!

Reply to
Don Y

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.