Goals which we aimed to achieve as a result of development of text2vec
:
Conceptually we can divide API into several pieces:
See Vectorization section for details.
create_*
family functions, vocab_vectorizer()
and hash_vectorizer()
are made to create vocabularies, Document-Term matrices and Term co-occurence matrices. Simply this family of functions is in charge of converting text into numeric form. Main functions are:
create_vocabulary()
;create_dtm()
;create_tcm()
;vocab_vectorizer()
, hash_vectorizer()
.All functions from create_*
family work with iterators over tokens as input. Good examples for creation of such iterators are:
ifiles()
for creation iterator over files. Note that text2vec doesn’t handle I/O, users should provide their own reader function (data.table::fread()
and functions from readr
package usually are good choices).itoken()
for creation iterator over tokens;Once user needs some custom source (for example data stream from some RDBMS), he/she just needs to create correct iterator over tokens.
text2vec
also provides convenient functions for easy parallel processing of text (many of tasks are emrassingly parallel).
ifiles_parallel()
same as ifiles()
above, but creates parallel iterator if parallel backend is registered (for example with registerDoParallel
)itoken_parallel()
is the same as itoken()
above but also creates parallel iterator if parallel backend is registered.Parallel itoken
iterators can be used in create_dtm()
, create_tcm()
functions exatly the same way as sequential counterparts.
text2vec provides unified interface for models, which is inspired by scikit-learn
interface. Models in text2vec are mostly transformers and decompositions - they transform Document-Term matrix or decompose into 2 low-rank matrices.
Models include:
itoken
iterator as input to fit
method and learn model. After that user can pass another itoken
iterator to transform
method and receive another itoken
iterator wich will produce tokens with phrases concatenated into single token.All text2vec models are mutable! This means that fit()
and fit_transform()
methods change model which was provided as argument.
All models have unified interface. User should only remember few verbs for models manipulation:
model$new(...)
- create model object, set up initial parameters for model. This is model-specific. For example for LDA it can be number of topics \(K\), alpha(\(\alpha\)) and eta(\(\eta\)) priors;model$partial_fit(x, ...)
- partially fits model to data (for online models);model$fit_transform(x, ...)
- fits model to data and then transforms data with fitted model;model$transform(x_new, ...)
- transforms new data with pretrained model.Decomposition models decompose matrix into 2 low rank matrices \(X\) and \(Y\). \(X\) corresponds to item embeddings and \(Y\) corresponds to feature embeddings. For example for LDA
\(X\) will be document-topic assignements and \(Y\) will be topic-word assignements. While fit_transform
or transform
methods gives you \(X\), second, matrix \(Y\) is available as components
read-only field: model$components
. Examples of “decomposition” models in text2vec
are LDA
, LSA
, GloVe
. Check documentation of these classes for additional information.
See Distances section for details.
text2vec package provides 2 set of functions for measuring various distances/similarity in a unified way. All methods are written with special attention to computational performance and memory efficiency.
sim2(x, y, method)
- calculates similarity between each row of matrix x
and each row of matrix y
using given method
.psim2(x, y, method)
- calculates parallel similarity between rows of matrix x
and corresponding rows of matrix y
using given method.
dist2(x, y, method)
- calculates distance/dissimilarity between each row of matrix x
and each row of matrix y
using given method
.pdist2(x, y, method)
- calculates parallel distance/dissimilarity between rows of matrix x
and corresponding rows of matrix y
using given method.
Distances/similarities implemented at the moment: