Class HTMLTokenizer

In this topic

Package com.ms.util Previous This
Package Next

Class HTMLTokenizer

public class HTMLTokenizer
{
  // Fields
  public Hashtable attrs;
  public String tag;
  public String text;
  public static final int TT_BEGIN_TAG;
  public static final int TT_COMMENT;
  public static final int TT_END_TAG;
  public static final int TT_TEXT;
  public int type;

  // Constructors
  public HTMLTokenizer (InputStream isin);

  // Methods
  public boolean hasMoreTokens ();
  public void mark (int readLimit) throws IOException;
  public int nextToken () throws ParseException, IOException;
  public void reset () throws IOException;
  public String toString ();
}

This class parses an HTML version 3.2 document. The parser does not interpret any HTML tags, except for comments and the <PRE> tag.

Constructors

HTMLTokenizer

public HTMLTokenizer (InputStream isin);
Creates an HTMLTokenizer object when passed to an input stream.

Parameter Description

isin The input stream to tokenize.

Methods

hasMoreTokens

public boolean hasMoreTokens ();
Indicates if the HTMLTokenizer object contains more tokens.
Return Value:
Returns true if there are more tokens; otherwise, returns false.

mark

public void mark (int readLimit) throws IOException;
Marks the parser's current position in the input stream.
Return Value:
No return value.

Parameter Description

readLimit The number of bytes that can be read before this mark is invalidated.

Exceptions:
IOException if the tokenized input stream cannot set the requested mark.
See Also: java.lang.InputStream.mark

nextToken

public int nextToken () throws ParseException, IOException;
Parses the next token from the input stream. The white space that follows the token and the first character of the next token is consumed.
Return Value:
Returns one the following token types:
TT_TEXT
TT_BEGIN_TAG
TT_END_TAG
TT_COMMENT

Exceptions:
NoSuchElementException if a null token is received.
ParseException if no tag is found after a less than (<) symbol or a tag does not have a matching greater than (>) symbol.

reset

public void reset () throws IOException;
Resets the input to the last marked position.
Return Value:
No return value.
Exceptions:
IOException if the tokenized input stream cannot set the requested mark.
See Also: java.lang.InputStream.reset

toString

public String toString ();
Retrieves a string representation of the HTMLTokenizer object.
Return Value:
Returns a string containing the tag types, tags, attributes, and text of the current token in the HTML file.

Fields

attrs

The attributes of a tag. They are valid for these token types: TT_BEGIN_TAG and TT_END_TAG.

tag

The tag.

Comments:: If this is the closing end of a tag, it will not have the leading slash (/) character. This tag is valid for these token types: TT_BEGIN_TAG and TT_END_TAG.

text

Plain text. They are valid for these token types: TT_TEXT and TT_COMMENT.

TT_BEGIN_TAG

A token type representing a beginning tag (for example, <H1>).

TT_COMMENT

A token type representing a comment.

TT_END_TAG

A token type representing an ending tag (for example, </H1>).

TT_TEXT

A token type representing the token text.

type

The last token type read. It can be one of the following: