Chi-square -Label encoding-One code encoding

Prasad
4 min readMay 9, 2023

Chi-square

Let’s consider an example to help illustrate the use of the Chi-square test. Suppose we want to investigate whether there is an association between gender and preferred mode of transportation to work (e.g., car, bus, bike, walk).

We randomly select 100 people and record their gender and preferred mode of transportation in a contingency table, which might look like this:

| | Car | Bus | Bike | Walk |
| — — — — — — — — -| — — — -| — — — -| — — — -| — — — -|
| Male | 20 | 15 | 5 | 10 |
| Female | 10 | 20 | 10 | 10 |

To perform the Chi-square test, we first calculate the expected frequencies for each cell assuming that there is no association between gender and preferred mode of transportation. We do this by multiplying the row and column totals to get the expected frequency for each cell. For example, the expected frequency for the cell representing males who prefer to bike to work is (35*15)/100 = 5.25.

| | Car | Bus | Bike | Walk | Row Total |
| — — — — — — — — -| — — — -| — — — -| — — — -| — — — -| — — — — — -|
| Male | 20 | 15 | 5 | 10 | 50 |
| Female | 10 | 20 | 10 | 10 | 50 |
| Column Total | 30 | 35 | 15 | 20 | 100 |

Next, we calculate the Chi-square test statistic by comparing the observed and expected frequencies for each cell. We use the formula:

χ² = Σ((O-E)² / E)

where O is the observed frequency and E is the expected frequency for each cell.

| | Car | Bus | Bike | Walk |
| — — — — — — — — -| — — — -| — — — -| — — — -| — — — -|
| Male | (20–15)²/15 = 1.67 | (15–17.5)²/17.5 = 0.51 | (5–5.25)²/5.25 = 0.09 | (10–12.5)²/12.5 = 0.50 |
| Female | (10–15)²/15 = 1.67 | (20–17.5)²/17.5 = 0.51 | (10–9.75)²/9.75 = 0.03 | (10–7.5)²/7.5 = 1.67 |
| Total | 3.33 | 1.02 | 0.12 | 2.17 |

Finally, we compare the calculated χ² value with the critical value from the Chi-square distribution table with (number of rows-1) * (number of columns-1) degrees of freedom (df). In this case, the df is (2–1) * (4–1) = 3. Assuming a significance level of 0.05, the critical value from the table is 7.815. Since our calculated χ² value of 6.64 is less than the critical value, we fail to reject the null hypothesis of no association between gender and preferred mode of transportation.

In other words, we cannot conclude that there is a significant association between gender and preferred mode of transportation based on our data. However, we should note that the sample size is relatively small, and a larger sample might yield different results.

Label encoding is a technique used in machine learning to convert categorical data into numerical data so that it can be used in various algorithms. In label encoding, each category or class of data is assigned a unique numerical value. This is often done when working with non-numerical data, such as in natural language processing or image classification.

For example, suppose we have a dataset with a column called “Color” that contains categorical data such as “Red,” “Green,” and “Blue.” We can use label encoding to convert these categories into numerical values. We might assign “Red” the value 0, “Green” the value 1, and “Blue” the value 2.

Label encoding is a simple and effective way to convert categorical data into numerical data that can be used in machine learning algorithms. However, it has some limitations. For example, if two categories have very different numerical values, it might give the impression that one category is more important than the other, when in fact they are both equally important. Also, label encoding is not appropriate for data with an inherent order or hierarchy, such as temperature or education level. In such cases, other encoding techniques such as ordinal encoding or one-hot encoding should be used instead.

One-hot encoding

One-hot encoding is a technique used in machine learning to convert categorical data into numerical data. In this technique, each category or class of data is represented as a binary vector, where only one element is “hot” (represented as 1) and the rest are “cold” (represented as 0). The “hot” element represents the category or class of data.One-hot encoding is a technique used in machine learning to convert categorical data into numerical data. In this technique, each category or class of data is represented as a binary vector, where only one element is “hot” (represented as 1) and the rest are “cold” (represented as 0). The “hot” element represents the category or class of data.For example, suppose we have a dataset with a column called “Color” that contains categorical data such as “Red,” “Green,” and “Blue.” We can use one-hot encoding to convert these categories into binary vectors. We might represent “Red” as [1, 0, 0], “Green” as [0, 1, 0], and “Blue” as [0, 0, 1].

import numpy as np

# create a contingency table
table = np.array([[10, 20, 30],
[6, 9, 17]])
from scipy.stats import chi2_contingency

# perform chi-square test
stat, p, dof, expected = chi2_contingency(table)

# print the results
print('stat=%.3f, p=%.3f, dof=%d' % (stat, p, dof))
print('expected=\n', expected)
stat=0.272, p=0.873, dof=2
expected=
[[11.76470588 21.17647059 27.05882353]
[ 4.23529412 7.82352941 9.94117647]]

--

--

Prasad
Prasad

Written by Prasad

I am a OpenSource Enthusiast|Python Lover who attempts to find simple explanations for questions and share them with others

No responses yet