Lexical Tokens in Compiler Design



In programming languages, we write codes which are being translated to machineunderstandable codes through compilers. When we write a program, every word, symbol, and even punctuation marks has a purpose. We must understand how "lexical tokens" work in order to learn how the compiler makes sense of a written program.

Lexical tokens are the small, meaningful units of a programming language. They act as the connection between our code and the compiler. Lexical tokens help break everything down into manageable pieces for further processing.

In this chapter, we will cover the basics of lexical tokens and explain their role in programming.

What are Lexical Tokens?

A lexical token is a group of characters that form the basic unit of meaning in a program. We can think of them as words in a sentence. Just as every word in a sentence has a specific role (noun, verb, adjective), every token in a program has a specific purpose.

For example, consider the statement −

int x = 10;

Here, the tokens are −

  • int − A keyword indicating a data type
  • x − An identifier or variable name
  • = − An operator assigning a value
  • 10 − A constant
  • ; − A delimiter marking the end of the statement

Each of these tokens is classified into a category, like keywords, identifiers, operators, and literals.

The Role of Lexical Tokens in Programming

Lexical tokens are used in the lexical analysis phase of a compiler. This is the very first step in turning the code into something the machine can understand, we name this as machine understandable code or machine level language. During this phase, the compiler scans the source code and divides it into tokens. Now these tokens are then passed on to later stages like syntax analysis and code generation.

Without tokens, the compiler has no idea how to process the code. Imagine trying to read a book where all the words are jumbled together with no spaces or punctuation. One program will be looking like this to a compiler without lexical analysis.

Categories of Lexical Tokens

Lexical tokens in codes can be classified into several types or groups. Let us see the most useful groups with examples −

  • Keywords − Most of the time we learn the keywords first, The keywords are reserved words that have a special meaning in a programming language. For examples (if, else, for, while, etc). These words are predefined and cannot be used as variable names.
  • Identifiers − Identifiers are names given to variables, functions, or other elements in a program. For example, in int score = 100;, here the term score is an identifier.
  • Literals − Literals represent fixed values in your code. These can be numbers, characters, or strings. For instance, 10, "hello", and 'A' are all literals.
  • Operators − Operators are symbols used to perform operations, like +, -, *, and /. They might also include relational operators like == or logical operators like &&.
  • Delimiters − Delimiters are symbols that separate parts of a program. For example, semicolons (;), commas (,), and parentheses (()).

Example: Lexical Tokens in Action

Let us go through a specific example from the text to see how lexical tokens work in practice.

Consider the following line of code −

for (i = 1; i < 5.1e3; i++) 
        func1(x);

When broken down into tokens, this line contains the following tokens −

SymbolDescription
forA keyword, indicating the start of a loop.
(A delimiter, opening the loops parameters.
iAn identifier, representing a variable.
=An operator, assigning a value to i.
1A literal, the starting value of i.
;A delimiter, separating parameters.
iAgain, an identifier.
<A relational operator, checking if i is less than a value.
5.1e3A literal in scientific notation, representing 5100.
;Another delimiter.
i++An increment operator, increasing the value of i by 1.
)A delimiter, closing the loops parameters.
func1An identifier, representing a function name.
(A delimiter, opening the functions arguments.
xAn identifier, representing the functions input.
)A delimiter, closing the functions arguments.
;A delimiter marking the end of the statement.

Here we can see, each token plays a specific role. They together form a complete and meaningful statement.

How the Tokens are Processed

Let's understand how the tokens are processed. When the compiler reads a line of code, it does not immediately understand what it means. Instead, it breaks the statement down to its tokens. The process takes place like this −

  • Scanning the code from left to right.
  • Identifying groups of characters that match patterns. This is like keywords or operators.
  • Assigning each token to a category.

For instance, when the compiler encounters a keyword "for", it matches it with the list of keywords and recognizes it as a loop. Similarly, when it sees i = 1, it identifies = as an assignment operator and 1 as a literal.

Challenges in Tokenizing the Code

Tokenization is one of the most challenging task in compiler design. It looks simple but it comes with its challenges in implementation. For example −

  • Ambiguity − In some cases, a sequence of characters could belong to multiple categories. For example, a + b could be seen as one token or three separate ones (a, +, and b).
  • Whitespace and Comments − Whitespace and comments are not tokens. But the compiler needs to ignore them without disrupting the tokenization process.
  • Scientific Notation − Now another challenging on handling complex literals like 5.1e3 requires special rules to ensure the entire sequence is treated as one token.

Despite these challenges, lexical analysis gives that every part of the program is properly identified and categorized.

Importance of Lexical Tokens

Lexical tokens are the foundation of any programming language. Without them, it would be impossible to write or compile programs.

Tokens provide structure, meaning, and context to the code, which makes it understandable for both humans and machines.

Conclusion

Lexical tokens are small and simple units, but they are the backbone of any programming language. In this chapter, we explained the concept of lexical tokens and their role in programming.

We started by exploring what tokens are and their importance in the compilation process. We then looked into their categories such as keywords, identifiers, and operators.

Using the example of a "for" loop, we demonstrated how tokens are broken down and processed by a compiler. At the end, we highlighted some of the challenges involved in tokenization.