What is a Correlation and how are they calculated?
What is a Correlation and how are they calculated?

Everything you need to know about what a Correlation is and how correlations are calculated and are used in Kapiche.

Written by Josh Winters
Updated over a week ago

# Correlation measures the strength of the relationship between two variables in the data.

These two variables could be any combination of theme, concept, segment, or query. The strength of the relationship is determined by how likely the two variables are to appear together and is expressed as a number between -1 and 1.

0 means there is no correlation, -1 means there is a perfect negative correlation (when one variable appears the other is never present), and 1 means there is a perfect positive correlation (when one variable appears, the other is always present). In between these absolute negative and positive correlations are varying degrees of relationship. For example, if a theme about 'prices' has a 0.5 correlation with the segment value Country=Australia, we could say that these two variables have a significant likelihood of appearing together. The inference to be made is that customers from Australia are significantly more likely to talk about prices than the average customer.

Correlation scores range between -1 to 1:

# How are correlations calculated?

In order to calculate correlation Kapiche uses a word co-occurrence based measure, Pointwise Mutual Information (PMI). This measure is used to identify how semantically similar words can be and are useful in classification applications to reduce feature dimensionality by grouping similar terms into a small number of clusters.

PMI is calculated using the log of the ratio of the observed co-occurrence frequency to the frequency expected under independence. In simpler terms, using the probability that the two instances occur together (p(a,b)) in a text excerpt, the probability that the concept or segment (p(a)) occurs and the probability that the concept occurs (p(b)).

Equation One: Pointwise Mutual Information

It is widely understood that PMI has a well-known tendency to give excessive scores of relatedness to word pairs that involve low-frequency words. In order to combat this problem, Kapiche normalises this measure to remove the bias. In order to normalise this measure, the PMI score needs to be divided by the probability that instance a and instance b appear in the data collection together.

Equation Two: Normalised Pointwise Mutual Information

Calculating the NPMI allows Kapiche to compare the correlations of all the concepts on the graph as the individual concept or segment probability is eliminated.

## Equation Variables

a = The query results

b = The concept or segment being compared to

p(a,b) = Probability that the two instances (query and concept/segment) occur in the same excerpt.

p(a) = Probability that the query appears in a text excerpt.

p(b) = Probability that the concept/segment will appear in a text excerpt

PMI (a.b) = Pointwise Mutual Information of the two instances

NPMI (a,b) = Normalized Pointwise Mutual Information of two instances.
β

# Questions? π€

If you have any questions (or need some help!) you can get in touch with us any time by hitting the blue chat button to your right π

β