Correlations measure the co-occurrence of a query to a concept or segment in the data.
Whether they co-occur more than once can be attributed to chance or how independent from each other they are. The assumption is that words that frequently appear together in the text must be associated to each other throughout the data collection.
Correlation scores range between -1 to 1:
How are correlations calculated?
In order to calculate correlation Kapiche uses a word co-occurrence based measure, Pointwise Mutual Information (PMI). This measure is used to identify how semantically similar words can be and are useful in classification applications to reduce feature dimensionality by grouping similar terms into a small number of clusters.
PMI is calculated using the log of the ratio of the observed co-occurrence frequency to the frequency expected under independence. In simpler terms, using the probability that the two instances occur together (p(a,b)) in a text excerpt, the probability that the concept or segment (p(a)) occurs and the probability that the concept occurs (p(b)).
Equation One: Pointwise Mutual Information
It is widely understood that PMI has a well-known tendency to give excessive scores of relatedness to word pairs that involve low-frequency words. In order to combat this problem, Kapiche normalises this measure to remove the bias. In order to normalise this measure, the PMI score needs to be divided by the probability that instance a and instance b appear in the data collection together.
Equation Two: Normalised Pointwise Mutual Information
Calculating the NPMI allows Kapiche to compare the correlations of all the concepts on the graph as the individual concept or segment probability is eliminated.
a = The query results
b = The concept or segment being compared to
p(a,b) = Probability that the two instances (query and concept/segment) occur in the same excerpt.
p(a) = Probability that the query appears in a text excerpt.
p(b) = Probability that the concept/segment will appear in a text excerpt
PMI (a.b) = Pointwise Mutual Information of the two instances
NPMI (a,b) = Normalized Pointwise Mutual Information of two instances.