It this tutorial I will show how to extract phrases from text and how they can be used in downstream tasks. I will use text8 dataset which is available for download here. It consists of 100mb of texts from english wikipedia.

Fitting model is as easy as:

library(text2vec)
model = Collocations$new(collocation_count_min = 50)
txt = readLines("~/text8")
it = itoken(txt)
model$fit(it, n_iter = 3)
## INFO [2017-07-07 19:02:08] iteration 1 - found 5300 collocations
## INFO [2017-07-07 19:02:24] iteration 2 - found 6778 collocations
## INFO [2017-07-07 19:02:37] iteration 3 - found 6802 collocations

Now let us check what we got. Learned collocations are kept in collocation_stat field of the model:

model$collocation_stat
##                  prefix               suffix   n_i    n_j n_ij       pmi
##    1:     politics_main     article_politics    52     50   50 18.143106
##    2:            krag_j              rgensen    54     50   50 18.088658
##    3: demographics_main article_demographics    54     50   50 18.088658
##    4:      economy_main      article_economy    55     55   55 18.062186
##    5:           merleau                ponty    63     65   62 17.880585
##   ---                                                                   
## 6798:          produced                   by  3457 111831  777  5.001542
## 6799:              seem                   to   807 316376  513  5.001146
## 6800:        referendum                   on   300  91250   55  5.001041
## 6801:             other                types 32433   2518  164  5.000351
## 6802:                 k                    n  4472   6942   66  5.000100
##            lfmd       gensim rank_pmi rank_lfmd rank_gensim
##    1: -18.25627     0.000000        1       138        6638
##    2: -18.31072     0.000000        2       143        6639
##    3: -18.31072     0.000000        3       144        6640
##    4: -18.06219 24880.960331        4       119          85
##    5: -18.06310 46707.003663        5       120          33
##   ---                                                      
## 6798: -23.64699    29.972812     6798      1266        4121
## 6799: -24.84530    28.904043     6799      1812        4203
## 6800: -31.28831     2.911190     6800      6769        6423
## 6801: -28.13662    22.249316     6801      4162        4772
## 6802: -30.59820     7.758113     6802      6353        6022

How it works

Model goes through subsequent tokens and calculate some statstics - how frequently one token follows another, frequencies of tokens, etc. Based on this statistics model calculates several scores: PMI, LFMD (see paper below), “gensim”. Scores are actually heuristics - model is unsupervised. For overview of performance of different approaches check Automatic Extraction of Fixed Multiword Expressions paper.

Details

There are several important parameters in the model. Let’s take a closer look into constructor:

colloc = Collocations$new(vocabulary = NULL, collocation_count_min = 50, pmi_min = 5, gensim_min = 0, lfmd_min = -Inf, sep = "_")
  • vocabulary(optional parameter) - instance of text2vec vocabulary. If provided, the model will search for collocations consisted of words from vocabulary. If not provided, first the model will make one pass over data and create it.
  • collocation_count_min - model will only consider set of words as phrase if it will observe it at least collocation_count_min time. For example if collocation “new york” was observed less than 50 times, model will treat “new” and “york” as separate words.
  • pmi_min, gensim_min, lfmd_min minimal values of corresponding scores for filtering out low-scored collocation candidates.

Generally model need to make several iterations over the data. As mentioned apove on each pass it collects some statistics about adjacent word co-occurences.

Example

Let’s consider example.

  1. Suppose at first pass model found that words “new” and “york” occurs 100 times together as “new_york” and each of words “new” and “york” occur 150 and 115 respectively. So intuitevely there is a very high chance that “new_york” is good phrase candidate (and it will have high pmi, lfmd, gensim scores). In contrast if we take a look at words “it”, “is” it can happen that “it_is” occurs 500 times, but words “it” and “is” separately occur 15000 and 17000 times. Intuitevely it is very unlikely that “it_is” represents good phrase. So after each pass over the data we prune phrase candidates by removing co-occurences with low pmi, lfmd, gensim scores.
  2. Suppose we have detected phrase “new_york” after first pass. During second pass model will scan tokens and if it finds words “new” and “york” in sequence it will concatenate them into “new_york” and treat as single token (if any other word follows “new” then model won’t concatenate them and consider them as 2 separate tokens). Now imagine next token after “new_york” is “city”. Then model again will calculate co-occurence scores as in step 1 and decide whether to keep “new_york_city” as phrase/collocation or treat sequence “new_york” and “city” as separate tokens. So by repeating the process we can learn large multi-word phrases.

As a result, in the end model will be able to concatenate collocations from tokens. Let’s check how naive model trained on wikipedia will work:

test_txt = c("i am living in a new apartment in new york city", 
        "new york is the same as new york city", 
        "san francisco is very expensive city", 
        "who claimed that model works?")
it = itoken(test_txt, n_chunks = 1, progressbar = FALSE)
it_phrases = model$transform(it)
it_phrases$nextElem()
## $tokens
## $tokens[[1]]
## [1] "i_am"          "living"        "in"            "a"            
## [5] "new"           "apartment"     "in"            "new_york_city"
## 
## $tokens[[2]]
## [1] "new_york"      "is"            "the"           "same"         
## [5] "as"            "new_york_city"
## 
## $tokens[[3]]
## [1] "san_francisco" "is"            "very"          "expensive"    
## [5] "city"         
## 
## $tokens[[4]]
## [1] "who"          "claimed_that" "model"        "works?"      
## 
## 
## $ids
## [1] "1" "2" "3" "4"

As we can see results are pretty impressive but not ideal - we probably do not want to get “claimed_that” as collocation. One solution is to provide vocabulary without stopwords to the model constructor. But this won’t solve most of the edge cases. Another solution is to keep tracking what model learned after each pass over the data. We can fit model incrementally with partial_fit() method and prune bad phrases after each iteration.

it = itoken(txt)
v = create_vocabulary(it, stopwords = tokenizers::stopwords("en"))
v = prune_vocabulary(v, term_count_min = 50)
model2 = Collocations$new(vocabulary = v, collocation_count_min = 50, pmi_min = 0)
model2$partial_fit(it)
model2$collocation_stat
##          prefix  suffix    n_i    n_j n_ij          pmi      lfmd
##     1:  merleau   ponty     63     65   62 1.736649e+01 -17.54900
##     2:     limp  bizkit     66     57   54 1.728955e+01 -18.02457
##     3: bhagavad    gita     51     70   50 1.725409e+01 -18.28208
##     4:     krav    maga     76     73   70 1.710348e+01 -17.46185
##     5:      orl     ans     59     81   58 1.704743e+01 -18.06050
##    ---                                                           
## 10763:     when     its  20623  29567   55 9.637992e-03 -35.25153
## 10764:     from english  72871  11868   78 9.500960e-03 -34.24358
## 10765:        e     two  11426 192644  198 5.691199e-03 -31.55949
## 10766:        s       l 116710   5343   56 3.297087e-03 -35.20588
## 10767:     zero       k 264975   4970  118 6.130957e-05 -33.05854
##              gensim rank_pmi rank_lfmd rank_gensim
##     1: 3.270582e+04        1        88          15
##     2: 1.186695e+04        2       117          73
##     3: 0.000000e+00        3       141       10541
##     4: 4.023382e+04        4        82           7
##     5: 1.868318e+04        5       121          49
##    ---                                            
## 10763: 9.151845e-02    10763     10747       10338
## 10764: 3.613462e-01    10764     10421        9757
## 10765: 7.504292e-01    10765      7987        8938
## 10766: 1.073880e-01    10766     10742       10303
## 10767: 5.762957e-01    10767      9528        9295

Since we set restriction to PMI scrore to 0 we got a lot of garbage collocations like “when_its”. Fortunately we can manually prune them and continue training. Let’s filter by some thresholds:

temp = model2$collocation_stat[pmi >= 8 & gensim >= 10 & lfmd >= -25, ]
temp
##           prefix   suffix  n_i   n_j n_ij       pmi      lfmd     gensim
##    1:    merleau    ponty   63    65   62 17.366494 -17.54900 32705.8227
##    2:       limp   bizkit   66    57   54 17.289548 -18.02457 11866.9452
##    3:       krav     maga   76    73   70 17.103476 -17.46185 40233.8212
##    4:        orl      ans   59    81   58 17.047433 -18.06050 18683.1756
##    5:     lingua   franca   78    55   52 17.045623 -18.37739  5203.1991
##   ---                                                                   
## 1337: difference  between 1375 15737  506  8.027850 -20.83005   235.2003
## 1338:    working    class 2271  3412  181  8.026277 -23.79792   188.6874
## 1339:   chemical elements 1944  2723  123  8.018666 -24.92020   153.9135
## 1340:   republic  ireland 4231  2362  231  8.011118 -23.10927   202.1405
## 1341:       rock     band 2819  3304  214  8.002446 -23.33851   196.5199
##       rank_pmi rank_lfmd rank_gensim
##    1:        1        88          15
##    2:        2       117          73
##    3:        4        82           7
##    4:        5       121          49
##    5:        6       152         150
##   ---                               
## 1337:     1899       474        1182
## 1338:     1900      1392        1324
## 1339:     1904      1887        1461
## 1340:     1907      1111        1279
## 1341:     1916      1201        1295

If it looks reasonable we can prune learned collocations:

model2$prune(pmi_min = 8, gensim_min = 10, lfmd_min = -25)
identical(temp, model2$collocation_stat)
## [1] TRUE

And continue training:

model2$partial_fit(it)
model2$prune(pmi_min = 8, gensim_min = 10, lfmd_min = -25)
model2$collocation_stat
##               prefix   suffix  n_i   n_j n_ij       pmi      lfmd
##    1:        merleau    ponty   63    65   62 17.366494 -17.54900
##    2:       ifad_ifc    ifrcs   64    64   60 17.291136 -17.66357
##    3:           limp   bizkit   66    57   54 17.289548 -18.02457
##    4:    leonardo_da    vinci   66    75   66 17.155426 -17.52428
##    5:           krav     maga   76    73   70 17.103476 -17.46185
##   ---                                                            
## 1572:            ice      age 1441  4875  167  8.023908 -23.97717
## 1573:       chemical elements 1944  2723  123  8.018666 -24.92020
## 1574:       republic  ireland 4231  2362  231  8.011118 -23.10927
## 1575: prize_laureate        d  428 16581  167  8.009239 -23.99184
## 1576:           rock     band 2819  3304  214  8.002446 -23.33851
##           gensim rank_pmi rank_lfmd rank_gensim
##    1: 32705.8227        2        96          21
##    2: 26730.0171        3       101          33
##    3: 11866.9452        4       133          97
##    4: 35389.4626        6        94          16
##    5: 40233.8212        7        87          10
##   ---                                          
## 1572:   182.3503     2032      1571        1534
## 1573:   153.9135     2036      2048        1628
## 1574:   202.1405     2042      1193        1469
## 1575:   180.5055     2044      1577        1544
## 1576:   196.5199     2045      1289        1480

And so on until we will decide to stop process (for example if number of learned phrases between two passes remains the same).

Usage

It is pretty interesting that we can extract collocation like “george_washington” or “new_york_city”, but it is even more exciting to use them in downstream tasks. Good examples could be topic models (phrases improves interpretability a lot!) and word embeddings.

How to incorporate them into the model? This is simple - create vocabulary which contains words and phrases and then document-term matrix or term-co-occurence matrix.

In order to do that we need to create itoken iterator which will concatenate collocations and then just pass it to any fucntion which consumes iterators.

it_phrases = model2$transform(it)
vocabulary_with_phrases = create_vocabulary(it_phrases, stopwords = tokenizers::stopwords("en"))
vocabulary_with_phrases = prune_vocabulary(vocabulary_with_phrases, term_count_min = 10)
vocabulary_with_phrases[startsWith(vocabulary_with_phrases$term, "new_"), ]
## Number of docs: 1 
## 33 stopwords: a, an, and, are, as, at ... 
## ngram_min = 1; ngram_max = 1 
## Vocabulary: 
##                term term_count doc_count
## 1: new_york_yankees         85         1
## 2:    new_hampshire        183         1
## 3:    new_brunswick        188         1
## 4:  new_south_wales        204         1
## 5:      new_orleans        300         1
## 6:       new_jersey        425         1
## 7:    new_testament        517         1
## 8:      new_zealand       1095         1
## 9:         new_york       4884         1

Word embeddings with collocations

Now we can create term-co-occurence matrix wich will contain both words and multi-word phrases (make sure you provide itoken iterator which generates phrases, not plain words):

tcm = create_tcm(it_phrases, vocab_vectorizer(vocabulary_with_phrases))

And train word embeddings model:

glove = GloVe$new(50, vocabulary = vocabulary_with_phrases, x_max = 50)
wv_main = glove$fit_transform(tcm, 10)
## INFO [2017-07-07 19:05:05] 2017-07-07 19:05:05 - epoch 1, expected cost 0.0305
## INFO [2017-07-07 19:05:08] 2017-07-07 19:05:08 - epoch 2, expected cost 0.0211
## INFO [2017-07-07 19:05:11] 2017-07-07 19:05:11 - epoch 3, expected cost 0.0187
## INFO [2017-07-07 19:05:14] 2017-07-07 19:05:14 - epoch 4, expected cost 0.0173
## INFO [2017-07-07 19:05:18] 2017-07-07 19:05:18 - epoch 5, expected cost 0.0164
## INFO [2017-07-07 19:05:21] 2017-07-07 19:05:21 - epoch 6, expected cost 0.0157
## INFO [2017-07-07 19:05:24] 2017-07-07 19:05:24 - epoch 7, expected cost 0.0153
## INFO [2017-07-07 19:05:28] 2017-07-07 19:05:28 - epoch 8, expected cost 0.0149
## INFO [2017-07-07 19:05:31] 2017-07-07 19:05:31 - epoch 9, expected cost 0.0146
## INFO [2017-07-07 19:05:35] 2017-07-07 19:05:35 - epoch 10, expected cost 0.0143
wv_context = glove$components
wv = wv_main + t(wv_context)
cos_sim = sim2(x = wv, y = wv["new_zealand", , drop = FALSE], method = "cosine", norm = "l2")
head(sort(cos_sim[,1], decreasing = TRUE), 5)
##    new_zealand      australia united_kingdom         canada     queensland 
##      1.0000000      0.8906049      0.7560024      0.7496143      0.7143516
paris = wv["new_york", , drop = FALSE] - 
  wv["usa", , drop = FALSE] + 
  wv["france", , drop = FALSE]
cos_sim = sim2(x = wv, y = paris, method = "cosine", norm = "l2")
head(sort(cos_sim[,1], decreasing = TRUE), 5)
##    france     paris    london     after   england 
## 0.7630746 0.7031228 0.6930537 0.6638524 0.6628017

Topic models with collocations

Incorporating collocations into topic models is very straightforward - need just to create document-term matrix and pass it to LDA model.

data("movie_review")
prep_fun = function(x) {
  stringr::str_replace_all(tolower(x), "[^[:alpha:]]", " ")
}
it = itoken(movie_review$review, preprocessor = prep_fun, tokenizer = word_tokenizer, 
            ids = movie_review$id, progressbar = FALSE)
it = model2$transform(it)
v = create_vocabulary(it, stopwords = tokenizers::stopwords("en"))
v = prune_vocabulary(v, term_count_min = 10, doc_proportion_min = 0.01)

Let’s check how many phrases that we’ve learned from wikipedia we can find in movie_review dataset:

word_count_per_token = sapply(strsplit(v$term, "_", T), length)
v$term[word_count_per_token > 1]
##  [1] "ve_got"        "th_century"    "anything_else" "takes_place"  
##  [5] "her_husband"   "once_again"    "weren_t"       "sci_fi"       
##  [9] "years_ago"     "new_york"      "looks_like"    "rather_than"  
## [13] "don_t_know"    "wouldn_t"      "aren_t"        "couldn_t"     
## [17] "wasn_t"        "i_am"          "isn_t"         "didn_t"       
## [21] "doesn_t"       "don_t"

Not many. Seems we may need to learn collocations from movie_review dataset itself or use other thresholds for scores. I leave this exercise for the reader.

Anyway now we can create document-term matrix and run LDA:

N_TOPICS = 20
vectorizer = vocab_vectorizer(v)
dtm = create_dtm(it, vectorizer)
lda = LDA$new(N_TOPICS)
doc_topic = lda$fit_transform(dtm)
text2vec is created by Dmitry Selivanov and contributors. © 2016.
If you have found any BUGS please report them here.