Learn through the super-clean Baeldung Pro experience:
>> Membership and Baeldung Pro.
No ads, dark-mode and 6 months free of IntelliJ Idea Ultimate to start with.
Last updated: February 28, 2025
In this tutorial, we’ll talk about node impurity in decision trees. A decision tree is a greedy algorithm we use for supervised machine learning tasks such as classification and regression.
Firstly, the decision tree nodes are split based on all the variables. During the training phase, the data are passed from a root node to leaves for training. A decision tree uses different algorithms to decide whether to split a node into two or more sub-nodes. The algorithm chooses the partition maximizing the purity of the split (i.e., minimizing the impurity). Informally, impurity is a measure of homogeneity of the labels at the node at hand:
There are different ways to define impurity. In classification tasks, we frequently use the Gini impurity index and Entropy.
Gini Index is related to the misclassification probability of a random sample.
Let’s assume that a dataset contains examples from
classes. Its Gini Index,
, is defined as:
(1)
where is the relative frequency of class
in
, i.e., the probability that a randomly selected object belongs to class
.
As we can observe from the above equation, Gini Index may result in values inside the interval . The minimum value of zero corresponds to a node containing the elements of the same class. In case this occurs, the node is called pure. The maximum value of 0.5 corresponds to the highest impurity of a node.
In this example, we’ll compute the Gini Indices for 3 different cases of a set with 4 balls of two different colors, red and blue:
Ιn statistics, entropy is a measure of information.
Let’s assume that a dataset associated with a node contains examples from
classes. Then, its entropy is:
(2)
where is the relative frequency of class
in
. Entropy takes values from
. As is the case with the Gini index, a node is pure when
takes its minimum value, zero, and impure when it takes its highest value, 1.
Let’s compute the entropies for the same examples as above:
The quality of splitting data is very important during the training phase. When splitting, we choose to partition the data by the attribute that results in the smallest impurity of the new nodes.
We’ll show how to split the data using entropy and the Gini index.
The information gain is the difference between a parent node’s entropy and the weighted sum of its child node entropies.
Let’s assume a dataset with
objects is partitioned into two datasets:
and
of sizes
and
. Then, the split’s Information Gain (
) is:
(3)
In general, if splitting into
subsets
with
objects, respectively, the split’s Information Gain (
) is:
(4)
Let’s consider dataset , which shows if we’re going to play tennis or not based on the weather:
| Day | Outlook | Temperature | Humidity | Wind | Play Tennis? |
|---|---|---|---|---|---|
| D1 | Sunny | Hot | High | Weak | No |
| D2 | Sunny | Hot | High | Strong | No |
| D3 | Overcast | Hot | High | Weak | Yes |
| D4 | Rain | Mild | High | Weak | Yes |
| D5 | Rain | Cool | Normal | Weak | Yes |
| D6 | Rain | Cool | Normal | Strong | No |
| D7 | Overcast | Cool | Normal | Weak | Yes |
| D8 | Sunny | Mild | High | Weak | No |
| D9 | Sunny | Cool | Normal | Weak | Yes |
| D10 | Rain | Mild | Normal | Strong | Yes |
| D11 | Sunny | Mild | Normal | Strong | Yes |
| D12 | Overcast | Mild | High | Strong | Yes |
| D13 | Overcast | Hot | Normal | Weak | Yes |
| D14 | Rain | Mild | High | Strong | No |
It contains 9 positive outcomes (Yes) and 5 negatives (No). So, the starting structure is . We want to determine the attribute that offers the highest Information Gain.
Let’s see what happens if we split by Outcome. Outlook has three different values: Sunny, Overcast, and Rain. As we see, the possible splits are and
(
and
stand for entropy and split):
Firstly, we calculate the entropies. They are: and
. Therefore, the Information Gain based on the Outlook,
, is:
(5)
Next, we split the tree based on the Wind feature. It can take two values: Weak and Strong, with the possible splits and
:
The corresponding entropies are: and
. Therefore,
is:
(6)
Following this idea, we calculate the gains for Humidity and Temperature as well.
(7)
(8)
We see that the feature with the maximum Information Gain is Outlook. Thus, we conclude that Outlook is the best attribute to split the data at the tree’s root.
We can split the data by the Gini Index too. Let’s compute the required probabilities:
(9)
Out of the 14 days in the above example, Sunny, Overcast, and Rain occur 5, 4, and 5 times, respectively. Then, we compute the probabilities of a Sunny day and playing tennis or not. Out of the 5 times when Outlook=Sunny, we played tennis on 2 and didn’t play it on 3 days:
Having calculated the required probabilities, we can compute the Gini Index of Sunny:
We follow the same steps for Overcast and Rain:
Therefore, the weighted sum of the above Gini Indices is:
(10)
Similarly, we compute the Gini Index of Temperature, Humidity, and Wind:
(11)
We conclude that Outlook has the lowest Gini Index. Hence, we’ll choose it as the root node.
In this article, we talked about how we can compute the impurity of a node while training a decision tree. In particular, we talked about the Gini Index and entropy as common measures of impurity. By splitting the data to minimize the impurity scores of the resulting nodes, we get a precise tree.