Next section: HTML and SGML
Up to main index

[see parent document for copyright information]


Section 1: WebTechs and other validators

What does WebTechs do?
The WebTechs HTML Validation Service (hereinafter simply "WebTechs") is sort of like lint for C. It compares your HTML document to the defined syntax of HTML and reports any discrepancies.
Why should I validate my HTML pages?
One of the important maxims of computer programming is:
Be conservative in what you produce; be liberal in what you accept.

Browsers follow the second half of this maxim by accepting Web pages and trying to display them even if they're not legal HTML. Usually this means that the browser will try to make educated guesses about what you probably meant. The problem is that different browsers (or even different versions of the same browser) will make different guesses about the same illegal construct; worse, if your HTML is really pathological, the browser could get hopelessly confused and produce a mangled mess, or even crash.

That's why you want to follow the first half of the maxim by making sure your pages are legal HTML. The best way to do that is by running your documents through one or more HTML validators.

What other validators are there?
Besides WebTechs, the three most commonly mentioned HTML validators are:
     Weblint: http://www.cre.canon.co.uk/~neilb/weblint/
    HTMLChek: http://uts.cc.utexas.edu/~churchh/htmlchek.html
    The `Kinder, Gentler Validator' (KGV):
	      http://ugweb.cs.ualberta.ca/~gerald/validate/

Weblint and HTMLChek are heuristic validators --- that is, they do not completely parse your HTML markup, but simply scan it looking for errors. The advantage of this is that they can detect constructs that are legal HTML but considered "bad style", such as an <IMG> tag without an ALT attribute; the disadvantage is that they can fail to detect some errors. KGV is similar to WebTechs; both operate directly from the HTML language definition, and both strictly obey the rules of SGML. If your document passes one of these validators, you know it's clean.

I recommend using a combination of validators: one of Weblint or HTMLChek, and one of WebTechs or KGV. Each has features that the others don't, and they complement each other nicely.

How does WebTechs work?
WebTechs is based on James Clark's nsgmls SGML parser. The Validator itself is a CGI script that fetches each of your URLs in turn and pass them through nsgmls.
How does the new error message format work?
WebTechs returns its error messages pretty much exactly as they are generated by nsgmls, which can be a bit cryptic. Here's a sample error:
nsgmls:<OSFD>0:4:41:E: element "SPANKME" undefined

Fields in the error message are separated by colons. The first field just identifies the program generating the error, namely nsgmls. The second field is an artifact of how the WebTechs script communicates with nsgmls, and can be ignored. The third field is the line number of the error message; if you've selected the "Show input" option, this field will be a link to the corresponding line in your source code. The fourth field is the exact character position within the line, starting at 0, at which the error occurred. The fifth field indicates the severity of the error, in this case 'E' for normal error. The sixth and last field is the actual error message.

What do the report options do?
Show Input
The HTML source of your document is included in the report; each error message will have a link to the offending line.
Show Parser Output
The parsed output of nsgmls is included in the output. There is a description of the syntax of this output on James Clark's nsgmls page.
Show Formatted Output
The report includes a link to a temporary document containing the HTML source of your document, so you can see how it is formatted in your browser.
Treat URL Ampersands
This sets how WebTechs treats ampersands (the '&' character) in the HREF parameter of A elements. If this option is set (as it is by default), ampersands are replaced with the equivalent URL escape sequence '%26', as can be seen in the parser output if you have selected it; otherwise, ampersands are left in place and any character sequence in the URL that looks like an entity reference is treated as such. This primarily affects the treatment of URLs with CGI parameters, as described in the error section.
Weblint check
Runs your document through Weblint and includes the results in the report.
WebTechs won't let me in. What now?
Like most Web servers, WebTechs is inaccessible from time to time, whether due to scheduled maintenance, network outages or simple server overload. If you can't get in at WebTechs' main site, you may want to try one of the sites that mirror WebTechs. As of 12 Oct 95, there are WebTechs mirrors in Australia, the UK and Austria. You may also want to install a local copy of WebTechs' validation service; WebTechs provides the html-check tool kit for that purpose.
What conformance level should I use?
There's something of an art to this. For most cases, the Wilbur setting is a good first choice; if you're using extensions specific to Netscape or Microsoft, you may have to use those settings instead. There are some tricky spots:
Should I use the strict setting?
Short answer: Not unless you know what you're doing.

Longer answer:
First off, it should be noted that the phrase "strict HTML" is somewhat misleading. It doesn't mean that your document will be validated more strictly; WebTechs is strict enough as it is. ;) "Strict HTML" is a particular well-defined variant of standard HTML, intended to clean up parts of HTML that are less than aesthetically pleasing from an SGML perspective. Here's the scoop:

Far back in the forgotten mists of time, there was HTML 1.0. In HTML 1.0, the <P> tag was a paragraph separator; thus, the proper way to write a body consisting of two paragraphs was:

    <BODY>
    This is a paragraph.
    <P>
    This is another one.
    </BODY>

Today, in HTML 2.0 and 3.0, the <P> element is a paragraph container. Under this new model, our example above parses as:

    <BODY>-+
           |
           +- This is a paragraph.
           |
           +- <P>-+
                  |
                  +- This is another one.
                  |
           +- </P>+
           |
    </BODY>+

Notice what has happened. The second piece of text has been identified as a paragraph, but the first one hasn't. In SGML, an element that can contain both "free-floating" text and block elements like <P> is said to have a mixed content model, and mixed content creates some unpleasant ambiguities in SGML parsing.

In non-strict HTML, several elements have a mixed content model, among them <BODY>, <LI>, <DD> and <TD>. This means we will often have free-floating text whose ``purpose'' is not specified; in general, browsers and other HTML user agents are expected to treat such text the same way it would treat a paragraph.

In strict HTML, no elements have mixed content models. Thus, our original example is illegal under strict HTML; it would have to be rewritten as:

    <BODY>
    <P>
    This is a paragraph.
    <P>
    This is another one.
    </BODY>

which parses out as:

    <BODY>-+
           |
           +- <P>-+
                  |
                  +- This is a paragraph.
                  |
           +- </P>+
           |
           +- <P>-+
                  |
                  +- This is another one.
                  |
           +- </P>+
           |
    </BODY>+

The problem (a common one in the computer world) is that there are still lots of legacy documents out there written for HTML 1.0, and some legacy browsers that still think <P> is a paragraph separator. Such browsers would probably, for instance, render <LI><P>hmm with the text on a line below the list bullet. I therefore do not recommend modifying your pages to conform to strict HTML, at least not yet.

WebTechs said "No errors found". That means my page is clean, right?
Careful. Did you select "Show input" at the main page? If not, try your page again with that set, and check the output. If you see the contents of your page, then congratulations; your page is indeed clean. If instead you see something like "This document has moved" or an error message, then WebTechs never even saw your page.

What's happening here? Okay, let's say your browser requests the URL http://www.cs.duke.edu/~dsb from my server. That's not a valid URL (it should have a `/' at the end), so my server sends back the following:

HTTP/1.0 302 Found
Date: Friday, 31-Dec-99 23:59:59 GMT
Server: NCSA/1.2
MIME-version: 1.0
Location: http://www.cs.duke.edu:80/~dsb/
Content-type: text/html

<HEAD><TITLE>Document moved</TITLE></HEAD>
<BODY><H1>Document moved</H1>
This document has moved <A HREF="http://www.cs.duke.edu:80/~dsb/">here</A>.<P>
</BODY>

The return code 302 means "This document has moved"; the Location: header indicates the new location of the document (in this case, it's the same URL, just with the `/' added). Generally, the browser will automatically transfer to the new URL; a clever browser could also automatically correct the URL in its bookmark file, for instance, or send e-mail to the author of the page from which it got the stale link (as far as I know, no current browser is smart enough to do this, but that's a separate matter). Notice that the server also sent a short note in HTML describing the situation; this is for older browsers that don't recognize the Location: header.

Unfortunately, WebTechs doesn't recognize the Location: header either; it sees the short note and tries to validate that. Now, this tiny note will validate clean under just about any setting, so WebTechs mistakenly reports "No errors found".

How to handle this? Well, if you select "Show input", then the WebTechs results page will contain the HTML source of the server's note, and that should contain a hyperlink to the correct URL; try WebTechs again with that URL.

Arguably, this is a bug in WebTechs; it should really either automatically pick up the new URL or warn you that the URL is incorrect. The Kinder, Gentler HTML Validator does warn about this.

This can also happen with some server error messages. Say, for instance, your browser now requests the corrected URL http://www.cs.duke.edu/~dsb/ from my server, but I forgot to make my home page world-readable. In that case, my server will send back the following:

HTTP/1.0 403 Forbidden
Date: Friday, 31-Dec-99 23:59:59 GMT
Server: NCSA/1.2
MIME-version: 1.0
Content-type: text/html

<HEAD><TITLE>403 Forbidden</TITLE></HEAD>
<BODY><H1>403 Forbidden</H1>
Your client does not have permission to get URL /~dsb/index.html from this server.<P>
</BODY>

Again, WebTechs will fail to realize that an error has occurred and will validate the HTML source of the error message. To be fair, though, WebTechs does catch some server errors; it will, for instance, correctly recognize the error code "404 Not Found", which probably means you have a typo in the URL you submitted.

Help! WebTechs spewed a zillion error messages on my page!
Don't panic. Check to see if the first error is `cannot generate system identifier for public text "(stuff)"' or `cannot open "html.dtd" (No such file or directory)'. In that case, the first line of your page should look something like this:
    <!DOCTYPE HTML PUBLIC "...">

Delete this line and run your page through WebTechs again. If you're lucky, you should get a lot fewer errors. For an explanation of what's happening here, see the discussion of DOCTYPE.

If this doesn't help, then you may be experiencing a cascade failure --- one error that gets WebTechs so confused that it can't make sense of the rest of your page. Try correcting the first few errors and running your page through WebTechs again.

I used the Mozilla setting, but WebTechs still doesn't like my tables.
Unfortunately, this is one of the areas of HTML that Netscape botched pretty badly; some of Mozilla's behavior regarding tables simply cannot be represented in a DTD. The two thorny bits are:
I love your validator/I have a question about your validator.
Well, thanks, but it's not my validator. WebTechs is maintained by Mark Gaither (markg@webtechs.com); questions or problems concerning the validation service itself should go to Mark. This FAQ is maintained separately by me, Scott Bigham (dsb@cs.duke.edu); questions like "This error isn't listed; what does it mean?" and "Are you sure this is illegal HTML?" should go to me.

Next section: HTML and SGML
Up to main index

[see parent document for copyright information]

Sending feedback? Check here first.

HTML 2.0 Validated

Last update 05 Oct 97

dsb@cs.duke.edu