It this tutorial I will show how to extract phrases from text and how they can be used in downstream tasks. I will use text8 dataset which is available for download here. It consists of 100mb of texts from english wikipedia.

Fitting model is as easy as:

library(text2vec)
model = Collocations$new(collocation_count_min = 50) txt = readLines("~/text8") it = itoken(txt) model$fit(it, n_iter = 3)
## INFO [2017-07-07 19:02:08] iteration 1 - found 5300 collocations
## INFO [2017-07-07 19:02:24] iteration 2 - found 6778 collocations
## INFO [2017-07-07 19:02:37] iteration 3 - found 6802 collocations

Now let us check what we got. Learned collocations are kept in collocation_stat field of the model:

model$collocation_stat ## prefix suffix n_i n_j n_ij pmi ## 1: politics_main article_politics 52 50 50 18.143106 ## 2: krag_j rgensen 54 50 50 18.088658 ## 3: demographics_main article_demographics 54 50 50 18.088658 ## 4: economy_main article_economy 55 55 55 18.062186 ## 5: merleau ponty 63 65 62 17.880585 ## --- ## 6798: produced by 3457 111831 777 5.001542 ## 6799: seem to 807 316376 513 5.001146 ## 6800: referendum on 300 91250 55 5.001041 ## 6801: other types 32433 2518 164 5.000351 ## 6802: k n 4472 6942 66 5.000100 ## lfmd gensim rank_pmi rank_lfmd rank_gensim ## 1: -18.25627 0.000000 1 138 6638 ## 2: -18.31072 0.000000 2 143 6639 ## 3: -18.31072 0.000000 3 144 6640 ## 4: -18.06219 24880.960331 4 119 85 ## 5: -18.06310 46707.003663 5 120 33 ## --- ## 6798: -23.64699 29.972812 6798 1266 4121 ## 6799: -24.84530 28.904043 6799 1812 4203 ## 6800: -31.28831 2.911190 6800 6769 6423 ## 6801: -28.13662 22.249316 6801 4162 4772 ## 6802: -30.59820 7.758113 6802 6353 6022 # How it works Model goes through subsequent tokens and calculate some statstics - how frequently one token follows another, frequencies of tokens, etc. Based on this statistics model calculates several scores: PMI, LFMD (see paper below), “gensim”. Scores are actually heuristics - model is unsupervised. For overview of performance of different approaches check Automatic Extraction of Fixed Multiword Expressions paper. ## Details There are several important parameters in the model. Let’s take a closer look into constructor: colloc = Collocations$new(vocabulary = NULL, collocation_count_min = 50, pmi_min = 5, gensim_min = 0, lfmd_min = -Inf, sep = "_")
• vocabulary(optional parameter) - instance of text2vec vocabulary. If provided, the model will search for collocations consisted of words from vocabulary. If not provided, first the model will make one pass over data and create it.
• collocation_count_min - model will only consider set of words as phrase if it will observe it at least collocation_count_min time. For example if collocation “new york” was observed less than 50 times, model will treat “new” and “york” as separate words.
• pmi_min, gensim_min, lfmd_min minimal values of corresponding scores for filtering out low-scored collocation candidates.

Generally model need to make several iterations over the data. As mentioned apove on each pass it collects some statistics about adjacent word co-occurences.

## Example

Let’s consider example.

1. Suppose at first pass model found that words “new” and “york” occurs 100 times together as “new_york” and each of words “new” and “york” occur 150 and 115 respectively. So intuitevely there is a very high chance that “new_york” is good phrase candidate (and it will have high pmi, lfmd, gensim scores). In contrast if we take a look at words “it”, “is” it can happen that “it_is” occurs 500 times, but words “it” and “is” separately occur 15000 and 17000 times. Intuitevely it is very unlikely that “it_is” represents good phrase. So after each pass over the data we prune phrase candidates by removing co-occurences with low pmi, lfmd, gensim scores.
2. Suppose we have detected phrase “new_york” after first pass. During second pass model will scan tokens and if it finds words “new” and “york” in sequence it will concatenate them into “new_york” and treat as single token (if any other word follows “new” then model won’t concatenate them and consider them as 2 separate tokens). Now imagine next token after “new_york” is “city”. Then model again will calculate co-occurence scores as in step 1 and decide whether to keep “new_york_city” as phrase/collocation or treat sequence “new_york” and “city” as separate tokens. So by repeating the process we can learn large multi-word phrases.

As a result, in the end model will be able to concatenate collocations from tokens. Let’s check how naive model trained on wikipedia will work:

test_txt = c("i am living in a new apartment in new york city",
"new york is the same as new york city",
"san francisco is very expensive city",
"who claimed that model works?")
it = itoken(test_txt, n_chunks = 1, progressbar = FALSE)
it_phrases = model$transform(it) it_phrases$nextElem()
## $tokens ##$tokens[[1]]
## [1] "i_am"          "living"        "in"            "a"
## [5] "new"           "apartment"     "in"            "new_york_city"
##
## $tokens[[2]] ## [1] "new_york" "is" "the" "same" ## [5] "as" "new_york_city" ## ##$tokens[[3]]
## [1] "san_francisco" "is"            "very"          "expensive"
## [5] "city"
##
## $tokens[[4]] ## [1] "who" "claimed_that" "model" "works?" ## ## ##$ids
## [1] "1" "2" "3" "4"

As we can see results are pretty impressive but not ideal - we probably do not want to get “claimed_that” as collocation. One solution is to provide vocabulary without stopwords to the model constructor. But this won’t solve most of the edge cases. Another solution is to keep tracking what model learned after each pass over the data. We can fit model incrementally with partial_fit() method and prune bad phrases after each iteration.

it = itoken(txt)
v = create_vocabulary(it, stopwords = tokenizers::stopwords("en"))
v = prune_vocabulary(v, term_count_min = 50)
model2 = Collocations$new(vocabulary = v, collocation_count_min = 50, pmi_min = 0) model2$partial_fit(it)
model2$collocation_stat ## prefix suffix n_i n_j n_ij pmi lfmd ## 1: merleau ponty 63 65 62 1.736649e+01 -17.54900 ## 2: limp bizkit 66 57 54 1.728955e+01 -18.02457 ## 3: bhagavad gita 51 70 50 1.725409e+01 -18.28208 ## 4: krav maga 76 73 70 1.710348e+01 -17.46185 ## 5: orl ans 59 81 58 1.704743e+01 -18.06050 ## --- ## 10763: when its 20623 29567 55 9.637992e-03 -35.25153 ## 10764: from english 72871 11868 78 9.500960e-03 -34.24358 ## 10765: e two 11426 192644 198 5.691199e-03 -31.55949 ## 10766: s l 116710 5343 56 3.297087e-03 -35.20588 ## 10767: zero k 264975 4970 118 6.130957e-05 -33.05854 ## gensim rank_pmi rank_lfmd rank_gensim ## 1: 3.270582e+04 1 88 15 ## 2: 1.186695e+04 2 117 73 ## 3: 0.000000e+00 3 141 10541 ## 4: 4.023382e+04 4 82 7 ## 5: 1.868318e+04 5 121 49 ## --- ## 10763: 9.151845e-02 10763 10747 10338 ## 10764: 3.613462e-01 10764 10421 9757 ## 10765: 7.504292e-01 10765 7987 8938 ## 10766: 1.073880e-01 10766 10742 10303 ## 10767: 5.762957e-01 10767 9528 9295 Since we set restriction to PMI scrore to 0 we got a lot of garbage collocations like “when_its”. Fortunately we can manually prune them and continue training. Let’s filter by some thresholds: temp = model2$collocation_stat[pmi >= 8 & gensim >= 10 & lfmd >= -25, ]
temp
##           prefix   suffix  n_i   n_j n_ij       pmi      lfmd     gensim
##    1:    merleau    ponty   63    65   62 17.366494 -17.54900 32705.8227
##    2:       limp   bizkit   66    57   54 17.289548 -18.02457 11866.9452
##    3:       krav     maga   76    73   70 17.103476 -17.46185 40233.8212
##    4:        orl      ans   59    81   58 17.047433 -18.06050 18683.1756
##    5:     lingua   franca   78    55   52 17.045623 -18.37739  5203.1991
##   ---
## 1337: difference  between 1375 15737  506  8.027850 -20.83005   235.2003
## 1338:    working    class 2271  3412  181  8.026277 -23.79792   188.6874
## 1339:   chemical elements 1944  2723  123  8.018666 -24.92020   153.9135
## 1340:   republic  ireland 4231  2362  231  8.011118 -23.10927   202.1405
## 1341:       rock     band 2819  3304  214  8.002446 -23.33851   196.5199
##       rank_pmi rank_lfmd rank_gensim
##    1:        1        88          15
##    2:        2       117          73
##    3:        4        82           7
##    4:        5       121          49
##    5:        6       152         150
##   ---
## 1337:     1899       474        1182
## 1338:     1900      1392        1324
## 1339:     1904      1887        1461
## 1340:     1907      1111        1279
## 1341:     1916      1201        1295

If it looks reasonable we can prune learned collocations:

model2$prune(pmi_min = 8, gensim_min = 10, lfmd_min = -25) identical(temp, model2$collocation_stat)
## [1] TRUE

And continue training:

model2$partial_fit(it) model2$prune(pmi_min = 8, gensim_min = 10, lfmd_min = -25)
model2$collocation_stat ## prefix suffix n_i n_j n_ij pmi lfmd ## 1: merleau ponty 63 65 62 17.366494 -17.54900 ## 2: ifad_ifc ifrcs 64 64 60 17.291136 -17.66357 ## 3: limp bizkit 66 57 54 17.289548 -18.02457 ## 4: leonardo_da vinci 66 75 66 17.155426 -17.52428 ## 5: krav maga 76 73 70 17.103476 -17.46185 ## --- ## 1572: ice age 1441 4875 167 8.023908 -23.97717 ## 1573: chemical elements 1944 2723 123 8.018666 -24.92020 ## 1574: republic ireland 4231 2362 231 8.011118 -23.10927 ## 1575: prize_laureate d 428 16581 167 8.009239 -23.99184 ## 1576: rock band 2819 3304 214 8.002446 -23.33851 ## gensim rank_pmi rank_lfmd rank_gensim ## 1: 32705.8227 2 96 21 ## 2: 26730.0171 3 101 33 ## 3: 11866.9452 4 133 97 ## 4: 35389.4626 6 94 16 ## 5: 40233.8212 7 87 10 ## --- ## 1572: 182.3503 2032 1571 1534 ## 1573: 153.9135 2036 2048 1628 ## 1574: 202.1405 2042 1193 1469 ## 1575: 180.5055 2044 1577 1544 ## 1576: 196.5199 2045 1289 1480 And so on until we will decide to stop process (for example if number of learned phrases between two passes remains the same). # Usage It is pretty interesting that we can extract collocation like “george_washington” or “new_york_city”, but it is even more exciting to use them in downstream tasks. Good examples could be topic models (phrases improves interpretability a lot!) and word embeddings. How to incorporate them into the model? This is simple - create vocabulary which contains words and phrases and then document-term matrix or term-co-occurence matrix. In order to do that we need to create itoken iterator which will concatenate collocations and then just pass it to any fucntion which consumes iterators. it_phrases = model2$transform(it)
vocabulary_with_phrases = create_vocabulary(it_phrases, stopwords = tokenizers::stopwords("en"))
vocabulary_with_phrases = prune_vocabulary(vocabulary_with_phrases, term_count_min = 10)
vocabulary_with_phrases[startsWith(vocabulary_with_phrases$term, "new_"), ] ## Number of docs: 1 ## 33 stopwords: a, an, and, are, as, at ... ## ngram_min = 1; ngram_max = 1 ## Vocabulary: ## term term_count doc_count ## 1: new_york_yankees 85 1 ## 2: new_hampshire 183 1 ## 3: new_brunswick 188 1 ## 4: new_south_wales 204 1 ## 5: new_orleans 300 1 ## 6: new_jersey 425 1 ## 7: new_testament 517 1 ## 8: new_zealand 1095 1 ## 9: new_york 4884 1 ## Word embeddings with collocations Now we can create term-co-occurence matrix wich will contain both words and multi-word phrases (make sure you provide itoken iterator which generates phrases, not plain words): tcm = create_tcm(it_phrases, vocab_vectorizer(vocabulary_with_phrases)) And train word embeddings model: glove = GloVe$new(50, vocabulary = vocabulary_with_phrases, x_max = 50)
wv_main = glove$fit_transform(tcm, 10) ## INFO [2017-07-07 19:05:05] 2017-07-07 19:05:05 - epoch 1, expected cost 0.0305 ## INFO [2017-07-07 19:05:08] 2017-07-07 19:05:08 - epoch 2, expected cost 0.0211 ## INFO [2017-07-07 19:05:11] 2017-07-07 19:05:11 - epoch 3, expected cost 0.0187 ## INFO [2017-07-07 19:05:14] 2017-07-07 19:05:14 - epoch 4, expected cost 0.0173 ## INFO [2017-07-07 19:05:18] 2017-07-07 19:05:18 - epoch 5, expected cost 0.0164 ## INFO [2017-07-07 19:05:21] 2017-07-07 19:05:21 - epoch 6, expected cost 0.0157 ## INFO [2017-07-07 19:05:24] 2017-07-07 19:05:24 - epoch 7, expected cost 0.0153 ## INFO [2017-07-07 19:05:28] 2017-07-07 19:05:28 - epoch 8, expected cost 0.0149 ## INFO [2017-07-07 19:05:31] 2017-07-07 19:05:31 - epoch 9, expected cost 0.0146 ## INFO [2017-07-07 19:05:35] 2017-07-07 19:05:35 - epoch 10, expected cost 0.0143 wv_context = glove$components
wv = wv_main + t(wv_context)
cos_sim = sim2(x = wv, y = wv["new_zealand", , drop = FALSE], method = "cosine", norm = "l2")
head(sort(cos_sim[,1], decreasing = TRUE), 5)
##    new_zealand      australia united_kingdom         canada     queensland
##      1.0000000      0.8906049      0.7560024      0.7496143      0.7143516
paris = wv["new_york", , drop = FALSE] -
wv["usa", , drop = FALSE] +
wv["france", , drop = FALSE]
cos_sim = sim2(x = wv, y = paris, method = "cosine", norm = "l2")
head(sort(cos_sim[,1], decreasing = TRUE), 5)
##    france     paris    london     after   england
## 0.7630746 0.7031228 0.6930537 0.6638524 0.6628017

## Topic models with collocations

Incorporating collocations into topic models is very straightforward - need just to create document-term matrix and pass it to LDA model.

data("movie_review")
prep_fun = function(x) {
stringr::str_replace_all(tolower(x), "[^[:alpha:]]", " ")
}
it = itoken(movie_review$review, preprocessor = prep_fun, tokenizer = word_tokenizer, ids = movie_review$id, progressbar = FALSE)
it = model2$transform(it) v = create_vocabulary(it, stopwords = tokenizers::stopwords("en")) v = prune_vocabulary(v, term_count_min = 10, doc_proportion_min = 0.01) Let’s check how many phrases that we’ve learned from wikipedia we can find in movie_review dataset: word_count_per_token = sapply(strsplit(v$term, "_", T), length)
v$term[word_count_per_token > 1] ## [1] "ve_got" "th_century" "anything_else" "takes_place" ## [5] "her_husband" "once_again" "weren_t" "sci_fi" ## [9] "years_ago" "new_york" "looks_like" "rather_than" ## [13] "don_t_know" "wouldn_t" "aren_t" "couldn_t" ## [17] "wasn_t" "i_am" "isn_t" "didn_t" ## [21] "doesn_t" "don_t" Not many. Seems we may need to learn collocations from movie_review dataset itself or use other thresholds for scores. I leave this exercise for the reader. Anyway now we can create document-term matrix and run LDA: N_TOPICS = 20 vectorizer = vocab_vectorizer(v) dtm = create_dtm(it, vectorizer) lda = LDA$new(N_TOPICS)
doc_topic = lda\$fit_transform(dtm)