A raw Unicode character stream is translated into a sequence of Java tokens, using the following three lexical translation steps, which are applied in turn:
\u
xxxx, where xxxx is a hexadecimal value, represents the Unicode character whose encoding is xxxx. This translation step allows any Java program to be expressed using only ASCII characters.
Java always uses the longest possible translation at each step, even if the result does not ultimately make a correct Java program, while another lexical translation would. Thus the input characters a--b
are tokenized (§3.5) as a
, --
, b
, which is not part of any grammatically correct Java program, even though the tokenization a
, -
, -
, b
could be part of a grammatically correct Java program.