htmlchek version 4.1, February 20 1995

[Not really under construction...]
htmlchek was an early HTML-checking program; it hasn't been updated since early 1995, and is not likely to be, since I've somewhat lost interest in the project, and this 1995 version is still mostly adequate for my non-bleeding-edge webmastering needs; in any case, the way it was programmed would not make it easy to change to accomodate various new HTML standards, in the full detail necessary for scrupulously correct validation. I'm leaving this page up mainly because certain supplemental HTML manipulation and filtering utilities included in the htmlchek distribution may still be useful to web-page authors who use Awk and/or Perl for text processing. (However, if you have some interest in the validation component, note that it is possible to custom configure the tags it checks for, to some degree. I've removed the documentation for the validation component from these web pages, but it's included in the htmlchek source distribution in both plaintext and HTML forms.)

Table of contents

Supplemental HTML-file processing programs: dehtml, entify, and metachar

dehtml

dehtml removes all HTML markup from a file so you can spell-check the darn thing. The commoner [HTML 2.0] ampersand entities are translated to the appropriate single characters, so you can spell check if you're writing in a non-English language, and your spelling checker understands 8-bit Latin-1 alphabetic characters. Note that dehtml makes no pretensions to being an intelligent HTML-to-text translator; it completely ignores everything within <...>, and passes everything outside <...> through completely unaltered (except known ampersand entities).

Typical command lines:

awk -f dehtml.awk infile.html > outfile.txt
perl dehtml.awk infile.html > outfile.txt

The shell script file dehtml.sh runs dehtml.awk using the best available interpreter (under Unix / Posix 1003.2):

sh dehtml.sh infile.html > outfile.txt

This program processes all files on the command line to STDOUT; to process a number of files individually, use the iteration mechanism of your shell; for example:

for a in *.html ; do awk -f dehtml.awk $a > otherdir/$a ; done

in Unix sh, or:

for %a in (*.htm) do call dehtml %a otherdir\%a

in MS-DOS, where dehtml.bat is the following one-line batch file:

gawk -f dehtml.awk %1 > %2

While dehtml isn't primarily an error-checking program, if it does happen to find errors connected with its functioning (or encounter HTML code beyond its capacity to handle), then the error messages are on lines beginning "&&^" which are intermixed with the non-error output.

entify

The relatively tiny entify program translates Latin-1 high alphabetic characters in a file to HTML ampersand entities for safety when moving the file through non-8-bit-safe transport mechanisms (principally non-Mime RFC-822 e-mail and Usenet). This is for the greater convenience of those writing European languages with editors which use Latin-1 characters; entify can be run just before distributing an HTML file externally.

Typical command line:

awk -f entify.awk infile.8bit > outfile.html
perl entify.pl infile.8bit > outfile.html

(Note that entify doesn't help in checking whether an HTML file is OK, but is rather used as a precautionary measure to prevent the file from being mangled by archaic 7-bit software.)

metachar

This relatively trivial script protects the HTML/SGML metacharacters `&', `<' and `>' by replacing them with the appropriate ampersand entity references; it is useful for importing plain text into an HTML file. Typical command lines:

awk -f metachar.awk infile.text > outfile.htmltext
perl metachar.pl infile.text > outfile.htmltext

Supplemental link extraction programs: makemenu and xtraclnk.pl

makemenu:

This program creates a simple menu for HTML files specified on the command line; the text in each input file's <TITLE>...</TITLE> element is placed in a link to that file in the output menu file. If the toc=1 command-line option is specified, makemenu also includes a simple table of contents for each input file in the menu, based on the file's <H1>-<H6> headings, interpreted as a system of sub-sections, with appropriate indenting.

If there are links inside headings, then makemenu will attempt to preserve the validity of <A HREF="..."> references, and transform an <A NAME="..."> into an <A HREF="..."> link to the heading from the menu file; however, makemenu is limited by the fact that it does not examine each <A> tag in a heading individually, but only does global search-and-replace operations on the whole <Hn>...</Hn> element (for this reason, the values of <A HREF=> and <A NAME=> are only operated on if they are quoted).

In general, makemenu is a small and somewhat simple program, and not an error-checker, so it is possible to confuse it by giving it erroneous or bizarrely-formatted HTML input. The following are typical command lines (makemenu.sh is a Unix / Posix 1003.2 shell script to run makemenu.awk under the best available awk interpreter, and with options checking):

awk -f makemenu.awk [options] infiles.html > menu.html
perl makemenu.pl [options] infiles.html > menu.html
sh makemenu.sh [options] infiles.html > menu.html

Further documentation is included as commments at the beginning of the makemenu.awk and makemenu.pl source files.

xtraclnk.pl:

This program extracts links and anchors from HTML files, and isolates text contained in <A> and <TITLE> elements. It copies <A HREF="...">Text</A> references from an input file to the output, and takes the text in <TITLE>Text</TITLE> elements and <A NAME="...">Text</A> anchors in an input file, and outputs them as references to the input file in which they were found. So the output of xtraclnk.pl contains basically only <A HREF="...">Text</A> links (which can be optionally sandwiched by a minimal HTML header and footer, so that the output is itself a valid HTML file).

This was suggested by an idea of John Harper at Toronto U.; what he had in mind, I think, was to use this as part of a CGI script which would dynamically construct an HTML document with links to all files with a title or anchors that contain text matching a user-specified search pattern. However, xtraclnk.pl also has some value as an HTML style debugging tool: if you have used a lot of context-dependent titles like "Intro", and meaningless link text like "Click Here", this will be very apparent when you view the HTML document (derived with xtraclnk.pl using the title= option) which contains only the text inside titles and anchors in your other HTML documents. This program can also be used to enforce consistency in link text: if there is random variation between different <A HREF="...">LinkText</A> elements which all point towards the same resource, this will be apparent when the output of xtraclnk.pl is sorted.

Also, by looking over the sorted output of xtraclnk.pl, it becomes relatively easy to detect mistaken links, that point to someplace other than what was intended.

Further documentation is in comments at the beginning of the source file itself.

Download information for htmlchek

A .ZIP archive (containing files with Unix (LF) line breaks) is available by HTTP from this site:

Author:

Copyright H. Churchyard 1994, 1995 -- freely redistributable. This code is functional but not very well commented -- sorry!

If you get an awk error under Unix, the most common problem that people seem to be having is inadvertently running incompatible ``old awk''; also, some vendor-supplied awks under Unix have problems (which can be avoided by using gawk from GNU/FSF).

Meaningless Bitmap Graphic:

No Web document would be complete without including a meaningless bitmap graphic.

htmlchek version 4.1, February 20 1995
Author: H. Churchyard -- churchh@crossmyt.com