Learn through the super-clean Baeldung Pro experience:
>> Membership and Baeldung Pro.
No ads, dark-mode and 6 months free of IntelliJ Idea Ultimate to start with.
Last updated: March 18, 2024
Programming languages were created to allow developers to write human-readable source code. However, computers work with machine code, which people can hardly write or read. Thus, compilers translate the programming language’s source code to machine code dedicated to a specific machine.
In this article, we’ll analyze the compilation process phases. Then, we’ll see the differences between compilers and interpreters. Finally, we’ll introduce examples of a few compilers of modern programming languages.
As we already mentioned, the compilation process converts high-level source code to a low-level machine code that can be executed by the target machine. Moreover, an essential role of compilers is to inform the developer about errors committed, especially syntax-related ones.
The compilation process consists of several phases:
In this section, we’ll discuss each phase in detail.
The first stage of the compilation process is lexical analysis. During this phase, the compiler splits source code into fragments called lexemes. A lexeme is an abstract unit of a specific language’s lexical system. Let’s analyze a simple example:
String greeting = "hello";
In the above statement, we have five lexemes:
After splitting code into lexemes, a sequence of tokens is created. The sequence of tokens is a final product of lexical analysis. Thus, lexical analysis is also often called tokenization. A token is an object that describes a lexeme. It gives information about the lexeme’s purpose, such as whether it’s a keyword, variable name, or string literal. Moreover, it stores the lexeme’s source code location data.
During syntax analysis, the compiler uses a sequence of tokens generated in the first stage. Tokens are used to create a structure called an abstract syntax tree (AST), which is a tree that represents the logical structure of a program.
In this phase, the compiler checks a grammatic structure of the source code and its syntax correctness. Meanwhile, any syntax errors result in a compilation error, and the compiler informs the developer about them.
In brief, syntax analysis is responsible for two tasks:
In this stage, the compiler uses an abstract syntax tree to detect any semantic errors, for example:
Semantic analysis can be divided into three steps:
To achieve all the above goals, during semantic analysis, the compiler makes a complete traversal of the abstract syntax tree. Semantic analysis finally produces an annotated AST that describes the values of its attributes.
During the compilation process, a compiler can generate one or more intermediate code forms.
“After syntax and semantic analysis of the source program, many compilers generate an explicit low-level or machine-like intermediate representation, which we can think of as a program for an abstract machine. This intermediate representation should have two important properties: it should be easy to produce and it should be easy to translate into the target machine.” – Compilers. Principles, Techniques, & Tools. Second Edition. Alfred V. Aho. Columbia University. Monica S. Lam. Stanford University. Ravi Sethi. Avaya.
Intermediate code is machine-independent. Thus, there is no need for unique compilers for every different machine. Besides, optimization techniques are easier to apply to intermediate code than machine code. Intermediate code has two forms of representation:
In the optimization phase, the compiler uses a variety of ways to enhance the efficiency of the code. Certainly, the optimization process should follow three important rules:
Let’s see a few examples of optimization techniques:
Finally, the compiler converts the optimized intermediate code to the machine code dedicated to the target machine. The final code should have the same meaning as source code and be efficient in terms of memory and CPU resource usage. Furthermore, the code generation process must also be efficient.
In the below flowchart, we can see an example of the compilation process of a simple statement.
As we already know, the compiler converts high-level source code to low-level code. Then, the target machine executes low-level code. On the other hand, the interpreter analyzes and executes source code directly. An interpreter usually uses one of several techniques:
Let’s see a brief comparison between an interpreter and a compiler:
| COMPILER: | INTERPRETER: |
|---|---|
| 1. Converts the code but doesn’t execute it. | 1. Executes code directly. |
| 2. Implementing a compiler requires knowledge about the target machine. | 2. No need for knowledge about the target machine, since an interpreter executes the code. |
| 3. Each instruction is translated only once. | 3. The same instruction can be analyzed multiple times. |
| 4. The compiled program is faster to run. | 4. Interpreted programs are slower to run, but take less time to interpret than to compile and run. |
| 5. Consumes more memory due to intermediate code generation. | 5. Usually executes input code directly, thus it consumes less memory. |
| 6. Compiled language examples: Java, C++, Swift, C#. | 6. Interpreted language examples: Ruby, Lisp, PHP, PowerShell. |
In Java, source code is first compiled to the bytecode by the javac compiler. Then, a Java Virtual Machine (JVM) interprets and executes bytecode. So, javac is an excellent example of a compiler that belongs to an interpreter system. Such a system makes Java portable and multiplatform.
Moreover, there are other languages like Kotlin or Scala that are also compiled to bytecode, but these use unique compilers. Thus, the JVM can execute code that was originally written using different technologies.
Mono is a toolset, including a C# programming language compiler, for executing software dedicated to the .NET Platform. It was created to allow .NET applications to run on different platforms. Moreover, an important goal was to give developers working on Linux a better environment and tools for working with the .NET platform.
A compiler converts C# source code into an intermediate bytecode. After that, the virtual machine executes it. Both the C# compiler and virtual machine belong to the Mono toolset.
The GNU Compiler Collection (GCC) is a set of open-source compilers that belong to the GNU Project. These compilers could run on different hardware and software platforms. Therefore, they can generate machine code for various architectures and operating systems.
During compilation, GCC is responsible for processing arguments, calling the compiler for the specific programming language, running the assembler program, and eventually, running a linker to produce an executable binary.
GCC consists of compilers for several programming languages:
In this article, we described a compiler’s role. Further, we went through all phases of the compilation process. Then we discussed differences between a compiler and interpreter. Finally, we mentioned some real-world compiler examples.