API

Goals which we aimed to achieve as a result of development of text2vec:

  • Concise - expose as few functions as possible;
  • Consistent - expose unified interfaces, no need to explore new interface for each task;
  • Flexible - allow to easily solve complex tasks;
  • Fast - maximize efficiency per single thread, transparently scale to multiple threads on multicore machines;
  • Memory efficient - use streams and iterators, not keep data in RAM if possible.

Conceptually we can divide API into several pieces:

Vectorization

See Vectorization section for details.

create_* family functions, vocab_vectorizer() and hash_vectorizer() are made to create vocabularies, Document-Term matrices and Term co-occurence matrices. Simply this family of functions is in charge of converting text into numeric form. Main functions are:

  • create_vocabulary();
  • create_dtm();
  • create_tcm();
  • vocab_vectorizer(), hash_vectorizer().

I/O handling

All functions from create_* family work with iterators over tokens as input. Good examples for creation of such iterators are:

  • ifiles() for creation iterator over files. Note that text2vec doesn’t handle I/O, users should provide their own reader function (data.table::fread() and functions from readr package usually are good choices).
  • itoken() for creation iterator over tokens;

Once user needs some custom source (for example data stream from some RDBMS), he/she just needs to create correct iterator over tokens.

Easy parallel processing

text2vec also provides convenient functions for easy parallel processing of text (many of tasks are emrassingly parallel).

  • ifiles_parallel() same as ifiles() above, but creates parallel iterator if parallel backend is registered (for example with registerDoParallel)
  • itoken_parallel() is the same as itoken() above but also creates parallel iterator if parallel backend is registered.

Parallel itoken iterators can be used in create_dtm(), create_tcm() functions exatly the same way as sequential counterparts.

Models

text2vec provides unified interface for models, which is inspired by scikit-learn interface. Models in text2vec are mostly transformers and decompositions - they transform Document-Term matrix or decompose into 2 low-rank matrices.

Models include:

  • Tf-idf reweighting. See Tf-idf in vectorization section;
  • Global Vectors (GloVe) word embeddings. See Word Embeddings section;
  • Latent Semantic Analysis (LSA). See LSA section;
  • Latent Dirichlet Allocation (LDA). See LDA section.
  • Collocations. Collocations model which can learn phrases from text is a bit separate from others and has a little bit different interface. It takes itoken iterator as input to fit method and learn model. After that user can pass another itoken iterator to transform method and receive another itoken iterator wich will produce tokens with phrases concatenated into single token.

All text2vec models are mutable! This means that fit() and fit_transform() methods change model which was provided as argument.

Important verbs

All models have unified interface. User should only remember few verbs for models manipulation:

  • model$new(...) - create model object, set up initial parameters for model. This is model-specific. For example for LDA it can be number of topics \(K\), alpha(\(\alpha\)) and eta(\(\eta\)) priors;
  • model$partial_fit(x, ...) - partially fits model to data (for online models);
  • model$fit_transform(x, ...) - fits model to data and then transforms data with fitted model;
  • model$transform(x_new, ...) - transforms new data with pretrained model.

Decomposition models decompose matrix into 2 low rank matrices \(X\) and \(Y\). \(X\) corresponds to item embeddings and \(Y\) corresponds to feature embeddings. For example for LDA \(X\) will be document-topic assignements and \(Y\) will be topic-word assignements. While fit_transform or transform methods gives you \(X\), second, matrix \(Y\) is available as components read-only field: model$components. Examples of “decomposition” models in text2vec are LDA, LSA, GloVe. Check documentation of these classes for additional information.

Distances

See Distances section for details.

text2vec package provides 2 set of functions for measuring various distances/similarity in a unified way. All methods are written with special attention to computational performance and memory efficiency.

  1. sim2(x, y, method) - calculates similarity between each row of matrix x and each row of matrix y using given method.
  2. psim2(x, y, method) - calculates parallel similarity between rows of matrix x and corresponding rows of matrix y using given method.
  3. dist2(x, y, method) - calculates distance/dissimilarity between each row of matrix x and each row of matrix y using given method.
  4. pdist2(x, y, method) - calculates parallel distance/dissimilarity between rows of matrix x and corresponding rows of matrix y using given method.

Distances/similarities implemented at the moment:

  • Cosine
  • Jaccard
  • Euclidean
  • Relaxed Word Mover’s Distance
text2vec is created by Dmitry Selivanov and contributors. © 2016.
If you have found any BUGS please report them here.