It this tutorial I will show how to extract phrases from text and how they can be used in downstream tasks. I will use text8
dataset which is available for download here. It consists of 100mb of texts from english wikipedia.
Fitting model is as easy as:
library(text2vec)
model = Collocations$new(collocation_count_min = 50)
txt = readLines("~/text8")
it = itoken(txt)
model$fit(it, n_iter = 3)
## INFO [2017-07-07 19:02:08] iteration 1 - found 5300 collocations
## INFO [2017-07-07 19:02:24] iteration 2 - found 6778 collocations
## INFO [2017-07-07 19:02:37] iteration 3 - found 6802 collocations
Now let us check what we got. Learned collocations are kept in collocation_stat
field of the model:
model$collocation_stat
## prefix suffix n_i n_j n_ij pmi
## 1: politics_main article_politics 52 50 50 18.143106
## 2: krag_j rgensen 54 50 50 18.088658
## 3: demographics_main article_demographics 54 50 50 18.088658
## 4: economy_main article_economy 55 55 55 18.062186
## 5: merleau ponty 63 65 62 17.880585
## ---
## 6798: produced by 3457 111831 777 5.001542
## 6799: seem to 807 316376 513 5.001146
## 6800: referendum on 300 91250 55 5.001041
## 6801: other types 32433 2518 164 5.000351
## 6802: k n 4472 6942 66 5.000100
## lfmd gensim rank_pmi rank_lfmd rank_gensim
## 1: -18.25627 0.000000 1 138 6638
## 2: -18.31072 0.000000 2 143 6639
## 3: -18.31072 0.000000 3 144 6640
## 4: -18.06219 24880.960331 4 119 85
## 5: -18.06310 46707.003663 5 120 33
## ---
## 6798: -23.64699 29.972812 6798 1266 4121
## 6799: -24.84530 28.904043 6799 1812 4203
## 6800: -31.28831 2.911190 6800 6769 6423
## 6801: -28.13662 22.249316 6801 4162 4772
## 6802: -30.59820 7.758113 6802 6353 6022
Model goes through subsequent tokens and calculate some statstics - how frequently one token follows another, frequencies of tokens, etc. Based on this statistics model calculates several scores: PMI, LFMD (see paper below), “gensim”. Scores are actually heuristics - model is unsupervised. For overview of performance of different approaches check Automatic Extraction of Fixed Multiword Expressions paper.
There are several important parameters in the model. Let’s take a closer look into constructor:
colloc = Collocations$new(vocabulary = NULL, collocation_count_min = 50, pmi_min = 5, gensim_min = 0, lfmd_min = -Inf, sep = "_")
vocabulary
(optional parameter) - instance of text2vec
vocabulary. If provided, the model will search for collocations consisted of words from vocabulary. If not provided, first the model will make one pass over data and create it.collocation_count_min
- model will only consider set of words as phrase if it will observe it at least collocation_count_min
time. For example if collocation “new york” was observed less than 50 times, model will treat “new” and “york” as separate words.pmi_min
, gensim_min
, lfmd_min
minimal values of corresponding scores for filtering out low-scored collocation candidates.Generally model need to make several iterations over the data. As mentioned apove on each pass it collects some statistics about adjacent word co-occurences.
Let’s consider example.
pmi
, lfmd
, gensim
scores). In contrast if we take a look at words “it”, “is” it can happen that “it_is” occurs 500 times, but words “it” and “is” separately occur 15000 and 17000 times. Intuitevely it is very unlikely that “it_is” represents good phrase. So after each pass over the data we prune phrase candidates by removing co-occurences with low pmi
, lfmd
, gensim
scores.As a result, in the end model will be able to concatenate collocations from tokens. Let’s check how naive model trained on wikipedia will work:
test_txt = c("i am living in a new apartment in new york city",
"new york is the same as new york city",
"san francisco is very expensive city",
"who claimed that model works?")
it = itoken(test_txt, n_chunks = 1, progressbar = FALSE)
it_phrases = model$transform(it)
it_phrases$nextElem()
## $tokens
## $tokens[[1]]
## [1] "i_am" "living" "in" "a"
## [5] "new" "apartment" "in" "new_york_city"
##
## $tokens[[2]]
## [1] "new_york" "is" "the" "same"
## [5] "as" "new_york_city"
##
## $tokens[[3]]
## [1] "san_francisco" "is" "very" "expensive"
## [5] "city"
##
## $tokens[[4]]
## [1] "who" "claimed_that" "model" "works?"
##
##
## $ids
## [1] "1" "2" "3" "4"
As we can see results are pretty impressive but not ideal - we probably do not want to get “claimed_that” as collocation. One solution is to provide vocabulary
without stopwords to the model constructor. But this won’t solve most of the edge cases. Another solution is to keep tracking what model learned after each pass over the data. We can fit model incrementally with partial_fit()
method and prune bad phrases after each iteration.
it = itoken(txt)
v = create_vocabulary(it, stopwords = tokenizers::stopwords("en"))
v = prune_vocabulary(v, term_count_min = 50)
model2 = Collocations$new(vocabulary = v, collocation_count_min = 50, pmi_min = 0)
model2$partial_fit(it)
model2$collocation_stat
## prefix suffix n_i n_j n_ij pmi lfmd
## 1: merleau ponty 63 65 62 1.736649e+01 -17.54900
## 2: limp bizkit 66 57 54 1.728955e+01 -18.02457
## 3: bhagavad gita 51 70 50 1.725409e+01 -18.28208
## 4: krav maga 76 73 70 1.710348e+01 -17.46185
## 5: orl ans 59 81 58 1.704743e+01 -18.06050
## ---
## 10763: when its 20623 29567 55 9.637992e-03 -35.25153
## 10764: from english 72871 11868 78 9.500960e-03 -34.24358
## 10765: e two 11426 192644 198 5.691199e-03 -31.55949
## 10766: s l 116710 5343 56 3.297087e-03 -35.20588
## 10767: zero k 264975 4970 118 6.130957e-05 -33.05854
## gensim rank_pmi rank_lfmd rank_gensim
## 1: 3.270582e+04 1 88 15
## 2: 1.186695e+04 2 117 73
## 3: 0.000000e+00 3 141 10541
## 4: 4.023382e+04 4 82 7
## 5: 1.868318e+04 5 121 49
## ---
## 10763: 9.151845e-02 10763 10747 10338
## 10764: 3.613462e-01 10764 10421 9757
## 10765: 7.504292e-01 10765 7987 8938
## 10766: 1.073880e-01 10766 10742 10303
## 10767: 5.762957e-01 10767 9528 9295
Since we set restriction to PMI scrore to 0 we got a lot of garbage collocations like “when_its”. Fortunately we can manually prune them and continue training. Let’s filter by some thresholds:
temp = model2$collocation_stat[pmi >= 8 & gensim >= 10 & lfmd >= -25, ]
temp
## prefix suffix n_i n_j n_ij pmi lfmd gensim
## 1: merleau ponty 63 65 62 17.366494 -17.54900 32705.8227
## 2: limp bizkit 66 57 54 17.289548 -18.02457 11866.9452
## 3: krav maga 76 73 70 17.103476 -17.46185 40233.8212
## 4: orl ans 59 81 58 17.047433 -18.06050 18683.1756
## 5: lingua franca 78 55 52 17.045623 -18.37739 5203.1991
## ---
## 1337: difference between 1375 15737 506 8.027850 -20.83005 235.2003
## 1338: working class 2271 3412 181 8.026277 -23.79792 188.6874
## 1339: chemical elements 1944 2723 123 8.018666 -24.92020 153.9135
## 1340: republic ireland 4231 2362 231 8.011118 -23.10927 202.1405
## 1341: rock band 2819 3304 214 8.002446 -23.33851 196.5199
## rank_pmi rank_lfmd rank_gensim
## 1: 1 88 15
## 2: 2 117 73
## 3: 4 82 7
## 4: 5 121 49
## 5: 6 152 150
## ---
## 1337: 1899 474 1182
## 1338: 1900 1392 1324
## 1339: 1904 1887 1461
## 1340: 1907 1111 1279
## 1341: 1916 1201 1295
If it looks reasonable we can prune learned collocations:
model2$prune(pmi_min = 8, gensim_min = 10, lfmd_min = -25)
identical(temp, model2$collocation_stat)
## [1] TRUE
And continue training:
model2$partial_fit(it)
model2$prune(pmi_min = 8, gensim_min = 10, lfmd_min = -25)
model2$collocation_stat
## prefix suffix n_i n_j n_ij pmi lfmd
## 1: merleau ponty 63 65 62 17.366494 -17.54900
## 2: ifad_ifc ifrcs 64 64 60 17.291136 -17.66357
## 3: limp bizkit 66 57 54 17.289548 -18.02457
## 4: leonardo_da vinci 66 75 66 17.155426 -17.52428
## 5: krav maga 76 73 70 17.103476 -17.46185
## ---
## 1572: ice age 1441 4875 167 8.023908 -23.97717
## 1573: chemical elements 1944 2723 123 8.018666 -24.92020
## 1574: republic ireland 4231 2362 231 8.011118 -23.10927
## 1575: prize_laureate d 428 16581 167 8.009239 -23.99184
## 1576: rock band 2819 3304 214 8.002446 -23.33851
## gensim rank_pmi rank_lfmd rank_gensim
## 1: 32705.8227 2 96 21
## 2: 26730.0171 3 101 33
## 3: 11866.9452 4 133 97
## 4: 35389.4626 6 94 16
## 5: 40233.8212 7 87 10
## ---
## 1572: 182.3503 2032 1571 1534
## 1573: 153.9135 2036 2048 1628
## 1574: 202.1405 2042 1193 1469
## 1575: 180.5055 2044 1577 1544
## 1576: 196.5199 2045 1289 1480
And so on until we will decide to stop process (for example if number of learned phrases between two passes remains the same).
It is pretty interesting that we can extract collocation like “george_washington” or “new_york_city”, but it is even more exciting to use them in downstream tasks. Good examples could be topic models (phrases improves interpretability a lot!) and word embeddings.
How to incorporate them into the model? This is simple - create vocabulary which contains words and phrases and then document-term matrix or term-co-occurence matrix.
In order to do that we need to create itoken
iterator which will concatenate collocations and then just pass it to any fucntion which consumes iterators.
it_phrases = model2$transform(it)
vocabulary_with_phrases = create_vocabulary(it_phrases, stopwords = tokenizers::stopwords("en"))
vocabulary_with_phrases = prune_vocabulary(vocabulary_with_phrases, term_count_min = 10)
vocabulary_with_phrases[startsWith(vocabulary_with_phrases$term, "new_"), ]
## Number of docs: 1
## 33 stopwords: a, an, and, are, as, at ...
## ngram_min = 1; ngram_max = 1
## Vocabulary:
## term term_count doc_count
## 1: new_york_yankees 85 1
## 2: new_hampshire 183 1
## 3: new_brunswick 188 1
## 4: new_south_wales 204 1
## 5: new_orleans 300 1
## 6: new_jersey 425 1
## 7: new_testament 517 1
## 8: new_zealand 1095 1
## 9: new_york 4884 1
Now we can create term-co-occurence matrix wich will contain both words and multi-word phrases (make sure you provide itoken
iterator which generates phrases, not plain words):
tcm = create_tcm(it_phrases, vocab_vectorizer(vocabulary_with_phrases))
And train word embeddings model:
glove = GloVe$new(50, vocabulary = vocabulary_with_phrases, x_max = 50)
wv_main = glove$fit_transform(tcm, 10)
## INFO [2017-07-07 19:05:05] 2017-07-07 19:05:05 - epoch 1, expected cost 0.0305
## INFO [2017-07-07 19:05:08] 2017-07-07 19:05:08 - epoch 2, expected cost 0.0211
## INFO [2017-07-07 19:05:11] 2017-07-07 19:05:11 - epoch 3, expected cost 0.0187
## INFO [2017-07-07 19:05:14] 2017-07-07 19:05:14 - epoch 4, expected cost 0.0173
## INFO [2017-07-07 19:05:18] 2017-07-07 19:05:18 - epoch 5, expected cost 0.0164
## INFO [2017-07-07 19:05:21] 2017-07-07 19:05:21 - epoch 6, expected cost 0.0157
## INFO [2017-07-07 19:05:24] 2017-07-07 19:05:24 - epoch 7, expected cost 0.0153
## INFO [2017-07-07 19:05:28] 2017-07-07 19:05:28 - epoch 8, expected cost 0.0149
## INFO [2017-07-07 19:05:31] 2017-07-07 19:05:31 - epoch 9, expected cost 0.0146
## INFO [2017-07-07 19:05:35] 2017-07-07 19:05:35 - epoch 10, expected cost 0.0143
wv_context = glove$components
wv = wv_main + t(wv_context)
cos_sim = sim2(x = wv, y = wv["new_zealand", , drop = FALSE], method = "cosine", norm = "l2")
head(sort(cos_sim[,1], decreasing = TRUE), 5)
## new_zealand australia united_kingdom canada queensland
## 1.0000000 0.8906049 0.7560024 0.7496143 0.7143516
paris = wv["new_york", , drop = FALSE] -
wv["usa", , drop = FALSE] +
wv["france", , drop = FALSE]
cos_sim = sim2(x = wv, y = paris, method = "cosine", norm = "l2")
head(sort(cos_sim[,1], decreasing = TRUE), 5)
## france paris london after england
## 0.7630746 0.7031228 0.6930537 0.6638524 0.6628017
Incorporating collocations into topic models is very straightforward - need just to create document-term matrix and pass it to LDA model.
data("movie_review")
prep_fun = function(x) {
stringr::str_replace_all(tolower(x), "[^[:alpha:]]", " ")
}
it = itoken(movie_review$review, preprocessor = prep_fun, tokenizer = word_tokenizer,
ids = movie_review$id, progressbar = FALSE)
it = model2$transform(it)
v = create_vocabulary(it, stopwords = tokenizers::stopwords("en"))
v = prune_vocabulary(v, term_count_min = 10, doc_proportion_min = 0.01)
Let’s check how many phrases that we’ve learned from wikipedia we can find in movie_review
dataset:
word_count_per_token = sapply(strsplit(v$term, "_", T), length)
v$term[word_count_per_token > 1]
## [1] "ve_got" "th_century" "anything_else" "takes_place"
## [5] "her_husband" "once_again" "weren_t" "sci_fi"
## [9] "years_ago" "new_york" "looks_like" "rather_than"
## [13] "don_t_know" "wouldn_t" "aren_t" "couldn_t"
## [17] "wasn_t" "i_am" "isn_t" "didn_t"
## [21] "doesn_t" "don_t"
Not many. Seems we may need to learn collocations from movie_review
dataset itself or use other thresholds for scores. I leave this exercise for the reader.
Anyway now we can create document-term matrix and run LDA:
N_TOPICS = 20
vectorizer = vocab_vectorizer(v)
dtm = create_dtm(it, vectorizer)
lda = LDA$new(N_TOPICS)
doc_topic = lda$fit_transform(dtm)