Understanding Pearson and Cosine Similarity

Prasad
2 min readAug 25, 2023

--

When it comes to measuring relationships within data, two important metrics come into play: Pearson correlation and Cosine similarity. Each of these metrics has distinct properties and is suited for different scenarios. Let’s delve into these concepts with the help of an example.

Consider two sets of data, represented as vectors: X and Y. If we calculate the Pearson correlation coefficient between X and Y, we get a value of 1. On the other hand, when we calculate the Cosine similarity between them, we get a value of 0.915. You might wonder why these metrics yield such different results for the same data. The key lies in how they measure similarity.

**Pearson Correlation:**

Pearson correlation measures the linear relationship between two vectors. It evaluates how well the data points fit a straight line. However, it’s important to note that Pearson correlation does not consider the scale of the vectors; it focuses solely on the linear relationship.

**Cosine Similarity:**

Cosine similarity, on the other hand, measures the angle between two vectors. It captures the directional similarity between them, regardless of their magnitude. This property makes it particularly useful for scenarios where scale isn’t significant. Cosine similarity is commonly employed in text analysis, where documents are represented as vectors.

**Example Dataset:**

Imagine we have two vectors representing the number of hours two friends, Alice and Bob, spend on various activities each week. The activities include reading, jogging, and watching TV. Let’s say the vectors are as follows:

```

Alice: [5, 3, 2]

Bob: [4, 2, 1]

```

Calculating Pearson correlation between these vectors would give a coefficient close to 1, indicating a strong linear relationship. This implies that if Alice spends more time on an activity, Bob tends to spend more time on it as well.

However, when calculating the Cosine similarity, we get a value of 0.915. This high value suggests that Alice and Bob have very similar patterns of activities, even though the magnitude (total hours) of their activities differs. This is where the power of Cosine similarity shines – it focuses on the direction of the vectors, not their magnitude.

It’s worth noting that both metrics handle null or zero values differently. In Pearson correlation, null values are treated like any other value, affecting the correlation calculation. In contrast, Cosine similarity treats zero values as ineffective, as it emphasizes the non-zero dimensions where both vectors have values.

In summary, Pearson correlation is often used in statistics to analyze relationships between variables, while Cosine similarity is extensively used in text analysis to measure the similarity between documents. Cosine’s property of being unrelated to magnitude makes it particularly suitable for scenarios like text analysis, where the length of text doesn’t impact its meaning. Whether you’re exploring data correlations or comparing text documents, understanding these metrics can provide valuable insights into the underlying relationships.

--

--

Prasad

I am a OpenSource Enthusiast|Python Lover who attempts to find simple explanations for questions and share them with others