@update gess 4/1/98
||EXPAT CALL BACKS *****************************
||Pass a buffer to Expat. If Expat is blocked aBuffer should be null and
aLength should be 0. The result of the call will be stored in
mInternalState. Expat will parse as much of the buffer as it can and store
the rest in its internal buffer.
@param aBuffer the buffer to pass to Expat. May be null.
@param aLength the length of the buffer to pass to Expat (in number of
char16_t's). Must be 0 if aBuffer is null and > 0 if
aBuffer is not null.
@param aIsFinal whether there will definitely not be any more new buffers
passed in to ParseBuffer
@param aConsumed [out] the number of PRUnichars that Expat consumed. This
doesn't include the PRUnichars that Expat stored in
its buffer but didn't parse yet.
||This file contains the list of all HTML tags.
See nsHTMLTags.h for access to the enum values for tags.
It is designed to be used as input to various places that will define the
HTML_TAG macro in useful ways through the magic of C preprocessing.
Additionally, it is consumed by the self-regeneration code in
ElementName.java from which nsHtml5ElementName.cpp/h is translated.
If you edit this list, you need to re-run ElementName.java
self-regeneration and the HTML parser Java to C++ translation.
All entries must be enclosed in the macro HTML_TAG which will have cruel
and unusual things done to it.
It is recommended (but not strictly necessary) to keep all entries
in alphabetical order.
The first argument to HTML_TAG is the tag name. The second argument is the
"creator" method of the form NS_New$TAGNAMEElement, that will be used by
nsHTMLContentSink.cpp to create a content object for a tag of that
type. Use NOTUSED, if the particular tag has a non-standard creator.
The third argument is the interface name specified for this element
in the HTML specification. It can be empty if the relevant interface name
The HTML_OTHER macro is for values in the nsHTMLTag enum that are
not strictly tags.
Entries *must* use only lowercase characters.
Don't forget to update /editor/libeditor/HTMLEditUtils.cpp as well.
* Break these invariants and bad things will happen. **
||Declare the enum list using the magic of preprocessing
enum values are "eHTMLTag_foo" (where foo is the tag)
To change the list of tags, see nsHTMLTagList.h
These enum values are used as the index of array in various places.
If we change the structure of the enum by adding entries to it or removing
entries from it _directly_, not via nsHTMLTagList.h, don't forget to update
dom/bindings/BindingUtils.cpp and dom/html/nsHTMLContentSink.cpp as well.
This is an implementation of the nsITokenizer interface.
This file contains the implementation of a tokenizer to tokenize an HTML
document. It attempts to do so, making tradeoffs between compatibility with
older parsers and the SGML specification. Note that most of the real
"tokenization" takes place in nsHTMLTokens.cpp.
@update gess 4/1/98
@update gess 4/1/98
This pure virtual interface is used as the "glue" that connects the parsing
process to the content model construction process.
The icontentsink interface is a very lightweight wrapper that represents the
content-sink model building process. There is another one that you may care
about more, which is the IHTMLContentSink interface. (See that file for
@update gess 7/20/98
This interface defines standard interface for DTD's. Note that this
isn't HTML specific. DTD's have several functions within the parser
1) To coordinate the consumption of an input stream via the
2) To serve as proxy to represent the containment rules of the
3) To offer autodetection services to the parser (mainly for doc
||This interface should be implemented by any content sink that wants
to get output from expat and do something with it; in other words,
by any sink that handles some sort of XML dialect.
||The fragment sink allows a client to parse a fragment of sink, possibly
surrounded in context. Also see nsIParser::ParseFragment().
Note: once you've parsed a fragment, the fragment sink must be re-set on
the parser in order to parse another fragment.
||This interface is OBSOLETE and in the process of being REMOVED.
Do NOT implement!
This file declares the concrete HTMLContentSink class.
This class is used during the parsing process as the
primary interface between the parser and the content
After the tokenizer completes, the parser iterates over
the known token list. As the parser identifies valid
elements, it calls the contentsink interface to notify
the content model that a new node or child node is being
created and added to the content model.
The HTMLContentSink interface assumes 4 underlying
containers: HTML, HEAD, BODY and FRAMESET. Before
accessing any these, the parser will call the appropriate
OpennsIHTMLContentSink method: OpenHTML,OpenHead,OpenBody,OpenFrameSet;
likewise, the ClosensIHTMLContentSink version will be called when the
parser is done with a given section.
IMPORTANT: The parser may Open each container more than
once! This is due to the irregular nature of HTML files.
For example, it is possible to encounter plain text at
the start of an HTML document (that precedes the HTML tag).
Such text is treated as if it were part of the body.
In such cases, the parser will Open the body, pass the text-
node in and then Close the body. The body will likely be
re-Opened later when the actual <BODY> tag has been seen.
Containers within the body are Opened and Closed
using the OpenContainer(...) and CloseContainer(...) calls.
It is assumed that the document or contentSink is
maintaining its state to manage where new content should
be added to the underlying document.
NOTE: OpenHTML() and OpenBody() may get called multiple times
in the same document. That's fine, and it doesn't mean
that we have multiple bodies or HTML's.
NOTE: I haven't figured out how sub-documents (non-frames)
are going to be handled. Stay tuned.
||This GECKO-INTERNAL interface is on track to being REMOVED (or refactored
to the point of being near-unrecognizable).
Please DO NOT #include this file in comm-central code, in your XULRunner
app or binary extensions.
Please DO NOT #include this into new files even inside Gecko. It is more
likely than not that #including this header is the wrong thing to do.
@update gess 4/1/98
||The parser can be explicitly interrupted by passing a return value of
NS_ERROR_HTMLPARSER_INTERRUPTED from BuildModel on the DTD. This will cause
the parser to stop processing and allow the application to return to the event
loop. The data which was left at the time of interruption will be processed
the next time OnDataAvailable is called. If the parser has received its final
chunk of data then OnDataAvailable will no longer be called by the networking
module, so the parser will schedule a nsParserContinueEvent which will call
the parser to process the remaining data after returning to the event loop.
If the parser is interrupted while processing the remaining data it will
schedule another ParseContinueEvent. The processing of data followed by
scheduling of the continue events will proceed until either:
1) All of the remaining data can be processed without interrupting
2) The parser has been cancelled.
This capability is currently used in CNavDTD and nsHTMLContentSink. The
nsHTMLContentSink is notified by CNavDTD when a chunk of tokens is going to be
processed and when each token is processed. The nsHTML content sink records
the time when the chunk has started processing and will return
NS_ERROR_HTMLPARSER_INTERRUPTED if the token processing time has exceeded a
threshold called max tokenizing processing time. This allows the content sink
to limit how much data is processed in a single chunk which in turn gates how
much time is spent away from the event loop. Processing smaller chunks of data
also reduces the time spent in subsequent reflows.
This capability is most apparent when loading large documents. If the maximum
token processing time is set small enough the application will remain
responsive during document load.
A side-effect of this capability is that document load is not complete when
the last chunk of data is passed to OnDataAvailable since the parser may have
been interrupted when the last chunk of data arrived. The document is complete
when all of the document has been tokenized and there aren't any pending
nsParserContinueEvents. This can cause problems if the application assumes
that it can monitor the load requests to determine when the document load has
been completed. This is what happens in Mozilla. The document is considered
completely loaded when all of the load requests have been satisfied. To delay
the document load until all of the parsing has been completed the
nsHTMLContentSink adds a dummy parser load request which is not removed until
the nsHTMLContentSink's DidBuildModel is called. The CNavDTD will not call
DidBuildModel until the final chunk of data has been passed to the parser
through the OnDataAvailable and there aren't any pending
Currently the parser is ignores requests to be interrupted during the
calls to manipulate the DOM may fail if the parser was interrupted during the
For more details @see bugzilla bug 76722
This class does two primary jobs:
1) It iterates the tokens provided during the
tokenization process, identifing where elements
begin and end (doing validation and normalization).
2) It controls and coordinates with an instance of
the IContentSink interface, to coordinate the
the production of the content model.
The basic operation of this class assumes that an HTML
document is non-normalized. Therefore, we don't process
the document in a normalized way. Don't bother to look
for methods like: doHead() or doBody().
Instead, in order to be backward compatible, we must
scan the set of tokens and perform this basic set of
1) Determine the token type (easy, since the tokens know)
2) Determine the appropriate section of the HTML document
each token belongs in (HTML,HEAD,BODY,FRAMESET).
3) Insert content into our document (via the sink) into
the correct section.
4) In the case of tags that belong in the BODY, we must
ensure that our underlying document state reflects
the appropriate context for our tag.
For example,if we see a <TR>, we must ensure our
document contains a table into which the row can
be placed. This may result in "implicit containers"
created to ensure a well-formed document.
||Use this constructor if you want i/o to be based on
a single string you hand in during construction.
@update gess 5/12/98
@param aMode represents the parser mode (nav, other)
@update gess 4/1/98
The scanner is a low-level service class that knows
how to consume characters out of an (internal) stream.
This class also offers a series of utility methods
that most tokenizers want, such as readUntil()
||NOTE: nsScannerString (and the other classes defined in this file) are
not related to nsAString or any of the other xpcom/string classes.
nsScannerString is based on the nsSlidingString implementation that used
to live in xpcom/string. Now that nsAString is limited to representing
only single fragment strings, nsSlidingString can no longer be used.
An advantage to this design is that it does not employ any virtual
This file uses SCC-style indenting in deference to the nsSlidingString
code from which this code is derived ;-)