Next section: HTML and SGML
Up to main index
[see parent document for copyright information]
lint
for C. It compares your HTML document to the defined
syntax of HTML and reports any discrepancies.
Be conservative in what you produce; be liberal in what you accept.
Browsers follow the second half of this maxim by accepting Web pages and trying to display them even if they're not legal HTML. Usually this means that the browser will try to make educated guesses about what you probably meant. The problem is that different browsers (or even different versions of the same browser) will make different guesses about the same illegal construct; worse, if your HTML is really pathological, the browser could get hopelessly confused and produce a mangled mess, or even crash.
That's why you want to follow the first half of the maxim by making sure your pages are legal HTML. The best way to do that is by running your documents through one or more HTML validators.
Weblint: http://www.cre.canon.co.uk/~neilb/weblint/ HTMLChek: http://uts.cc.utexas.edu/~churchh/htmlchek.html The `Kinder, Gentler Validator' (KGV): http://ugweb.cs.ualberta.ca/~gerald/validate/
Weblint and HTMLChek are heuristic validators --- that is, they
do not completely parse your HTML markup, but simply scan it looking for
errors. The advantage of this is that they can detect constructs that
are legal HTML but considered "bad style", such as an
<IMG>
tag without an ALT attribute; the disadvantage
is that they can fail to detect some errors. KGV is similar to
WebTechs; both operate directly from the HTML language definition, and both
strictly obey the rules of SGML. If your
document passes one of these validators, you know it's clean.
I recommend using a combination of validators: one of Weblint or HTMLChek, and one of WebTechs or KGV. Each has features that the others don't, and they complement each other nicely.
nsgmls
SGML
parser. The Validator itself is a CGI script that fetches each of your
URLs in turn and pass them through nsgmls
.
nsgmls
,
which can be a bit cryptic. Here's a sample error:
nsgmls:<OSFD>0:4:41:E: element "SPANKME" undefined
Fields in the error message are separated by colons. The first field
just identifies the program generating the error, namely
nsgmls
. The second field is an artifact of how the
WebTechs script communicates with nsgmls
, and can be
ignored. The third field is the line number of the error message; if
you've selected the "Show input" option, this field will be a link to
the corresponding line in your source code. The fourth field is the
exact character position within the line, starting at 0, at which the
error occurred. The fifth field indicates the severity of the error, in
this case 'E' for normal error. The sixth and last field is the actual
error message.
nsgmls
is included in the output. There is a description of the syntax
of this output on James Clark's
nsgmls
page.
HREF
parameter of A
elements. If this
option is set (as it is by default), ampersands are replaced with the
equivalent URL escape sequence '%26
', as can be seen in the
parser output if you have selected it; otherwise, ampersands are left in
place and any character sequence in the URL that looks like an entity
reference is treated as such. This primarily affects the treatment of URLs
with CGI parameters, as described in the
error section.
STYLE
element, you may have to use the
Microsoft IE 3.0 setting. The Wilbur DTD
has only embryonic support for the STYLE
element, as it's a
"consensus" DTD representing the state of the art as of early 1996. The
Cougar setting might also work, but this DTD is still under development
by the W3C and so is subject to change.
Longer answer:
First off, it should be noted that the phrase "strict HTML" is somewhat
misleading. It doesn't mean that your document will be validated more
strictly; WebTechs is strict enough as it is. "Strict HTML" is a
particular well-defined variant of standard HTML, intended to clean up
parts of HTML that are less than aesthetically pleasing from an
SGML perspective. Here's the scoop:
Far back in the forgotten mists of time, there was HTML 1.0. In HTML
1.0, the <P>
tag was a paragraph separator; thus,
the proper way to write a body consisting of two paragraphs was:
<BODY> This is a paragraph. <P> This is another one. </BODY>
Today, in HTML 2.0 and 3.0, the <P>
element is a
paragraph container. Under this new model, our example above
parses as:
<BODY>-+ | +- This is a paragraph. | +- <P>-+ | +- This is another one. | +- </P>+ | </BODY>+
Notice what has happened. The second piece of text has been identified
as a paragraph, but the first one hasn't. In SGML,
an element that can contain both "free-floating" text and block
elements like <P>
is said to have a mixed
content model, and mixed content creates some unpleasant
ambiguities in SGML parsing.
In non-strict HTML, several elements have a mixed content model, among
them <BODY>
, <LI>
,
<DD>
and <TD>
. This means we
will often have free-floating text whose ``purpose'' is not specified;
in general, browsers and other HTML user agents are expected to treat
such text the same way it would treat a paragraph.
In strict HTML, no elements have mixed content models. Thus, our original example is illegal under strict HTML; it would have to be rewritten as:
<BODY> <P> This is a paragraph. <P> This is another one. </BODY>
which parses out as:
<BODY>-+ | +- <P>-+ | +- This is a paragraph. | +- </P>+ | +- <P>-+ | +- This is another one. | +- </P>+ | </BODY>+
The problem (a common one in the computer world) is that there are still
lots of legacy documents out there written for HTML 1.0, and some legacy
browsers that still think <P>
is a paragraph
separator. Such browsers would probably, for instance, render
<LI><P>hmm
with the text on a line below the
list bullet. I therefore do not recommend modifying your pages to
conform to strict HTML, at least not yet.
What's happening here? Okay, let's say your browser requests the URL
http://www.cs.duke.edu/~dsb
from my server. That's not a
valid URL (it should have a `/' at the end), so my server sends back the
following:
HTTP/1.0 302 Found Date: Friday, 31-Dec-99 23:59:59 GMT Server: NCSA/1.2 MIME-version: 1.0 Location: http://www.cs.duke.edu:80/~dsb/ Content-type: text/html <HEAD><TITLE>Document moved</TITLE></HEAD> <BODY><H1>Document moved</H1> This document has moved <A HREF="http://www.cs.duke.edu:80/~dsb/">here</A>.<P> </BODY>
The return code 302 means "This document has moved"; the
Location:
header indicates the new location of the
document (in this case, it's the same URL, just with the `/' added).
Generally, the browser will automatically transfer to the new URL; a
clever browser could also automatically correct the URL in its bookmark
file, for instance, or send e-mail to the author of the page from which
it got the stale link (as far as I know, no current browser is smart
enough to do this, but that's a separate matter). Notice that the
server also sent a short note in HTML describing the situation; this is
for older browsers that don't recognize the Location:
header.
Unfortunately, WebTechs doesn't recognize the Location:
header either; it sees the short note and tries to validate
that. Now, this tiny note will validate clean under just about
any setting, so WebTechs mistakenly reports "No errors found".
How to handle this? Well, if you select "Show input", then the WebTechs results page will contain the HTML source of the server's note, and that should contain a hyperlink to the correct URL; try WebTechs again with that URL.
Arguably, this is a bug in WebTechs; it should really either automatically pick up the new URL or warn you that the URL is incorrect. The Kinder, Gentler HTML Validator does warn about this.
This can also happen with some server error messages. Say, for
instance, your browser now requests the corrected URL
http://www.cs.duke.edu/~dsb/
from my server, but I forgot
to make my home page world-readable. In that case, my server will
send back the following:
HTTP/1.0 403 Forbidden Date: Friday, 31-Dec-99 23:59:59 GMT Server: NCSA/1.2 MIME-version: 1.0 Content-type: text/html <HEAD><TITLE>403 Forbidden</TITLE></HEAD> <BODY><H1>403 Forbidden</H1> Your client does not have permission to get URL /~dsb/index.html from this server.<P> </BODY>
Again, WebTechs will fail to realize that an error has occurred and will validate the HTML source of the error message. To be fair, though, WebTechs does catch some server errors; it will, for instance, correctly recognize the error code "404 Not Found", which probably means you have a typo in the URL you submitted.
<!DOCTYPE HTML PUBLIC "...">
Delete this line and run your page through WebTechs again.
If you're lucky, you should get a lot fewer errors. For an explanation
of what's happening here, see the
discussion of DOCTYPE
.
If this doesn't help, then you may be experiencing a cascade failure --- one error that gets WebTechs so confused that it can't make sense of the rest of your page. Try correcting the first few errors and running your page through WebTechs again.
<TABLE BORDER>
to indicate
the presence of a border) and as a width specifier (ie. <TABLE
BORDER=5>
to specify a five-pixel border). There's no way to
write a DTD that can encompass both of these uses; WebTechs' Mozilla DTD
allows the latter usage and not the former.
<TABLE WIDTH=50>
specifies
a width of 50 pixels) and relative (ie. <TABLE
WIDTH=50%>
specifies half the page width). The key here is
that without quotes, WebTechs attempts to interpret the attribute value
as a number and chokes on the %
. To fix this, just put the
attribute value in quotes, as in <TABLE WIDTH="50%">
.
markg@webtechs.com
);
questions or problems concerning the validation service itself
should go to Mark. This FAQ is maintained separately by me, Scott Bigham (dsb@cs.duke.edu
); questions like "This error isn't listed;
what does it mean?" and "Are you sure this is illegal HTML?"
should go to me.
Next section: HTML and SGML
Up to main index
[see parent document for copyright information]
Sending feedback? Check here first.
Last update 05 Oct 97
dsb@cs.duke.edu