Document similarity (or distance between documents) is a one of the central themes in Information Retrieval. How humans usually define how similar are documents? Usually documents treated as similar if they are semantically close and describe similar concepts. On other hand “similarity” can be used in context of duplicate detection. We will review several common approaches.
text2vec package provides 2 set of functions for measuring various distances/similarity in a unified way. All methods are written with special attention to computational performance and memory efficiency.
sim2(x, y, method)
- calculates similarity between each row of matrix x
and each row of matrix y
using given method
.psim2(x, y, method)
- calculates parallel similarity between rows of matrix x
and corresponding rows of matrix y
using given method.
dist2(x, y, method)
- calculates distance/dissimilarity between each row of matrix x
and each row of matrix y
using given method
.pdist2(x, y, method)
- calculates parallel distance/dissimilarity between rows of matrix x
and corresponding rows of matrix y
using given method.
Methods have siffix 2
in their names because in contrast to base dist()
function they work with two matrces instead of one.
Following methods are implemented at the moment:
As usual we will use built-in text2vec::moview_review
dataset. Let’s clean it a little bit:
library(stringr)
library(text2vec)
data("movie_review")
# select 500 rows for faster running times
movie_review = movie_review[1:500, ]
prep_fun = function(x) {
# make text lower case
x = str_to_lower(x)
# remove non-alphanumeric symbols
x = str_replace_all(x, "[^[:alnum:]]", " ")
# collapse multiple spaces
str_replace_all(x, "\\s+", " ")
}
movie_review$review_clean = prep_fun(movie_review$review)
Now let’s define two sets of documents on which we will evaluate our distance models:
doc_set_1 = movie_review[1:300, ]
it1 = itoken(doc_set_1$review_clean, progressbar = FALSE)
# specially take different number of docs in second set
doc_set_2 = movie_review[301:500, ]
it2 = itoken(doc_set_2$review_clean, progressbar = FALSE)
We will compare documents in a vector space. So we need to define common space and project documents to it. We will use vocabulary-based vectorization vectorization for better interpretability:
it = itoken(movie_review$review_clean, progressbar = FALSE)
v = create_vocabulary(it)
v = prune_vocabulary(v, doc_proportion_max = 0.1, term_count_min = 5)
vectorizer = vocab_vectorizer(v)
Jaccard similarity is a simple but intuitive measure of similarity between two sets.
\[J(doc_1, doc_2) = \frac{doc_1 \cap doc_2}{doc_1 \cup doc_2}\] For documents we measure it as proportion of number of common words to number of unique words in both documets. In the field of NLP jaccard similarity can be particularly useful for duplicates detection. text2vec however provides generic efficient realization which can be used in many other applications.
For calculation of jaccard similarity between 2 sets of documents user have to provide DTM for each them (DTMs should be in the same vector space!):
# they will be in the same space because we use same vectorizer
# hash_vectorizer will also work fine
dtm1 = create_dtm(it1, vectorizer)
dim(dtm1)
## [1] 300 2338
dtm2 = create_dtm(it2, vectorizer)
dim(dtm2)
## [1] 200 2338
Once we have representation of documents in vector space we are almost done. One thing remains - call sim2()
:
d1_d2_jac_sim = sim2(dtm1, dtm2, method = "jaccard", norm = "none")
Check result:
dim(d1_d2_jac_sim)
## [1] 300 200
d1_d2_jac_sim[1:2, 1:5]
## 2 x 5 sparse Matrix of class "dgCMatrix"
## 1 2 3 4 5
## 1 0.02142857 . 0.02362205 0.007575758 0.02597403
## 2 0.01219512 . 0.02941176 0.013888889 0.02083333
Also we can comptute “parallel” similarity - similarity between corresponding rows of matrices (matrices should have identical shapes):
dtm1_2 = dtm1[1:200, ]
dtm2_2 = dtm2[1:200, ]
d1_d2_jac_psim = psim2(dtm1_2, dtm2_2, method = "jaccard", norm = "none")
str(d1_d2_jac_psim)
## Named num [1:200] 0.02143 0 0.00735 0 0.03311 ...
## - attr(*, "names")= chr [1:200] "1" "2" "3" "4" ...
We define Jaccard distance or Jaccard dissimilarity as \(1 - similarity(doc_1, doc_2)\). sim2()
and psim2()
have corresponding companion functions dist2()
, pdist2()
which computes dissimilarity. Note however that in many cases similarity between documents is 0. sim2
function exploit this advantage - result matrix will be sparse. Use dist2()
on large sparse matrices carefully.
Classical approach from computational linguistics is to measure similarity based on the content overlap between documents. For this we will represent documents as bag-of-words, so each document will be a sparse vector. And define measure of overlap as angle between vectors: \[similarity(doc_1, doc_2) = cos(\theta) = \frac{doc_1 doc_2}{\lvert doc_1\rvert \lvert doc_2\rvert}\] By cosine distance/dissimilarity we assume following: \[distance(doc_1, doc_2) = 1 - similarity(doc_1, doc_2)\] It is important to note, however, that this is not a proper distance metric in a mathematical sense as it does not have the triangle inequality property and it violates the coincidence axiom.
Calculation of cosine similarity is similar to jaccard similarity:
d1_d2_cos_sim = sim2(dtm1, dtm2, method = "cosine", norm = "l2")
Check result:
dim(d1_d2_cos_sim)
## [1] 300 200
d1_d2_cos_sim[1:2, 1:5]
## 2 x 5 sparse Matrix of class "dgCMatrix"
## 1 2 3 4 5
## 1 0.02703999 . 0.05063299 0.009500143 0.02753954
## 2 0.02455143 . 0.06567587 0.034503278 0.04000800
It can be useful to measure similarity not on vanilla bag-of-words matrix, but on transformed one. One choice is to apply tf-idf transformation. First let’t create tf-idf model:
dtm = create_dtm(it, vectorizer)
tfidf = TfIdf$new()
dtm_tfidf = fit_transform(dtm, tfidf)
Calculate similarities between all rows of dtm_tfidf
matrix:
d1_d2_tfidf_cos_sim = sim2(x = dtm_tfidf, method = "cosine", norm = "l2")
d1_d2_tfidf_cos_sim[1:2, 1:5]
## 2 x 5 sparse Matrix of class "dgCMatrix"
## 1 2 3 4 5
## 1 1.000000000 0.007517279 0.02216897 0.02697791 0.01366605
## 2 0.007517279 1.000000000 0.01111845 . .
Usually tf-idf/bag-of-words matrices contain a lot of noise. Applying LSA model can help with this problem, so you can achieve better quality similarities:
lsa = LSA$new(n_topics = 100)
dtm_tfidf_lsa = fit_transform(dtm_tfidf, lsa)
## INFO [2018-12-21 20:44:35] soft_als: iter 001, frobenious norm change 1.719
## INFO [2018-12-21 20:44:35] soft_als: iter 002, frobenious norm change 0.329
## INFO [2018-12-21 20:44:35] soft_als: iter 003, frobenious norm change 0.077
## INFO [2018-12-21 20:44:35] soft_als: iter 004, frobenious norm change 0.028
## INFO [2018-12-21 20:44:35] soft_als: iter 005, frobenious norm change 0.013
## INFO [2018-12-21 20:44:35] soft_als: iter 006, frobenious norm change 0.007
## INFO [2018-12-21 20:44:35] soft_als: iter 007, frobenious norm change 0.004
## INFO [2018-12-21 20:44:35] soft_als: iter 008, frobenious norm change 0.003
## INFO [2018-12-21 20:44:35] soft_als: iter 009, frobenious norm change 0.002
## INFO [2018-12-21 20:44:35] soft_als: iter 010, frobenious norm change 0.001
## INFO [2018-12-21 20:44:35] soft_als: iter 011, frobenious norm change 0.001
## INFO [2018-12-21 20:44:35] soft_impute: converged with tol 0.001000 after 11 iter
## INFO [2018-12-21 20:44:35] running final svd
## INFO [2018-12-21 20:44:35] final rank = 100
Calculate similarities between all rows of dtm_tfidf_lsa
matrix:
d1_d2_tfidf_cos_sim = sim2(x = dtm_tfidf_lsa, method = "cosine", norm = "l2")
d1_d2_tfidf_cos_sim[1:2, 1:5]
## 1 2 3 4 5
## 1 1.0000000 0.2155649 0.3245058 0.28449807 0.38486893
## 2 0.2155649 1.0000000 0.2575520 0.01282894 0.02879225
And “parallel” similarities:
x = dtm_tfidf_lsa[1:250, ]
y = dtm_tfidf_lsa[251:500, ]
head(psim2(x = x, y = y, method = "cosine", norm = "l2"))
## 1 2 3 4 5 6
## 0.22419185 0.22964204 0.30510876 0.10639831 0.29816243 0.02997241
Euclidean distance is not so useful in NLP field as Jaccard or Cosine similarities. But it always worth to try different measures. In text2vec it can by computed only on dense matrices, here is example:
x = dtm_tfidf_lsa[1:300, ]
y = dtm_tfidf_lsa[1:200, ]
m1 = dist2(x, y, method = "euclidean")
Also we can apply different row normalization techniques (by default was "l2"
in example above):
m2 = dist2(x, y, method = "euclidean", norm = "l1")
m3 = dist2(x, y, method = "euclidean", norm = "none")