Cheap HTML parser
Jim Davis
davis@dri.cornell.edu
July 1994
Aug 1998

This is code for doing simple processing on HTML. I know there are bugs and limitations in the code, but it suffices for simple purposes. Among the limitations: This is an HTML parser, not an SGML parser - it does not accept a DTD, rather the model of HTML is built into the code. Also it does not validate the HTML - it will attempt to parse invalid documents, and the results are undefined if the document is in error.

It runs under perl 4.0 patch level 36. I don't know about other versions of perl. This directory contains:

parse-html.pl
A simple HTML parser written in perl. As it parses the HTML, it calls routines (which you may redefine) for each tag encountered, and for whitespace and content. You can redefine these routines so as to process the HTML document.
html-to-ascii.pl
Uses the HTML parser to generate a plain ASCII version of an HTML document.
html-ascii.pl
The actual routines to generate the ASCII.
tformat.pl
A lowlevel text formatter used for generating ASCII. More or less like a subset of nroff
html-to-rfc.pl
Uses the HTML parser to generate a plain ASCII version of an HTML, with special formatting requirements for Internet drafts and RFCs
rfc.pl
Additional routines required for RFC formatting (e.g. page headers and footers)

Generating RFCs from HTML

The RFC format requires there be a header and footer containing, among other things, the name of the authors, a short title, and so on. You specify values for these fields with META tags as shown by the following example.
<META name="status" content="Internet Draft">
<META name="title" content="Internet audio protocol">
<META name="date" content="July 1983">
<META name="author" content="Nixon, Haldeman">
(The META tag is not officially part of HTML, it was proposed by Roy Fielding.) The tags should be in the HEAD.

Although I don't see that it's required, it seems to be the custom that RFCs have a left margin of three characters, so this code does that too.

Since table processing doesn't work, I suggest you use the PRE tag for the title page. As a special hack, the very first PRE tag is not indented.

Known bugs