--- title: "Text prediction via N-gram Stupid Back-off models" author: - name: Valerio Gherardi email: vgherard@sissa.it date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{sbo} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Introduction The `sbo` package provides utilities for building and evaluating next-word prediction functions based on [Stupid Back-off](https://www.aclweb.org/anthology/D07-1090.pdf) N-gram models in R. In this vignette, I illustrate the main features of `sbo`, including in particular: * the typical workflow for building a text predictor from a given training corpus, and * the evaluation of next-word predictions through a test corpus. ```{r} library(sbo) ``` ## Building text predictors with `sbo` ### Building a text predictor In this and the next section we will employ the `sbo::twitter_train` example dataset[^1], each entry of which consists in a single tweet in English, *e.g.*: ```{r} head(sbo::twitter_train, 3) ``` [^1]: This is a small samples of $5ยท10^4$ entries from the "Tweets" Swiftkey dataset fully available [here](https://www.kaggle.com/crmercado/tweets-blogs-news-swiftkey-dataset-4million). Given the training corpus, the typical workflow for building a text-predictor consists of the following steps: 1. *Preprocessing*. Apply some transformations to the training corpus before $k$-gram extraction. 1. *Sentence tokenization*. Split the training corpus into sentences. 1. *Extract $k$-gram frequencies*. These are the building blocks for any $N$-gram language model. 1. *Train a text predictor*. Build a prediction function $f$, which takes some text input and returns as output a next-word prediction (or more than one, ordered by decreasing probability). Also, implicit in the previous steps is the *choice of a model dictionary*, which can be done a priori, or during the training process. All these steps (including building a dictionary) can be performed in `sbo` as follows: ```{r} p <- sbo_predictor(object = sbo::twitter_train, # preloaded example dataset N = 3, # Train a 3-gram model dict = target ~ 0.75, # cover 75% of training corpus .preprocess = sbo::preprocess, # Preprocessing transformation EOS = ".?!:;", # End-Of-Sentence tokens lambda = 0.4, # Back-off penalization in SBO algorithm L = 3L, # Number of predictions for input filtered = "" # Exclude the token from predictions ) ``` This creates an object `p` of class `sbo_predictor`, which can be used to generate text predictions as follows: ```{r} predict(p, "i love") ``` Let us comment the various arguments in the previous call to `sbo_predictor()`: - `object`. The training corpus used to train the text predictor. - `N`. The order $N$ of the $N$-gram model. - `dict`. This argument specifies the model dictionary. In this case, we build our dictionary directly from the training corpus, using the most frequent words which cover a fraction `target = 0.75` of the corpus. In alternative, one can prune the corpus word set to a fixed vocabulary size, or use a predefined dictionary. - `.preprocess`. The function used in corpus preprocessing. Here we leverage on the minimal `sbo::preprocess`, but this can be in principle any function taking a character input and returning a character output. - `EOS`. This argument lists, in a single string, all End-Of-Sentence characters, employed for sentence tokenization (moreover, text belonging to different entries of the preprocessed input vector are understood to belong to different sentences). - `lambda`. The penalization $\lambda$ employed in the Stupid Back-Off algorithm. Here $\lambda = 0.4$ is the benchmark value given in the original work by Brants *et al.*. - `L`. Number of predictions to return for any given input. This needs to be specified a priori, because the object `p` does not store the full $k$-gram frequency tables, but only a fixed number (i.e. $L$) of predictions for any $k$-gram prefix ($k\leq N -1$) observed in the training corpus. This is commented more at length below. - `filtered`. Words to exclude from next-word predictions. Here the reserved string `` excludes the Unknown-Word token (similarly, one can use the reserved string `` to exclude End-Of-Sentence tokens). Before proceeding, let us take a little break and introduce the most important function of `sbo`: ```{r} set.seed(840) babble(p) babble(p) babble(p) ``` ### Out of memory use The example in the previous Section illustrates how to use a text predictor in interactive mode. If the training process is computationally expensive, one may want to save the text predictor object (i.e. `p` in the example above) out of physical memory (e.g. through `save()`). For this purpose[^2], `sbo` provides the class `sbo_predtable` ("Stupid Back-Off prediction tables"). These objects are a "raw" equivalent of a text predictor, and can be created with `sbo_predtable()`, which has the same user interface of `sbo_predictor()`. For example, the definition of `p` above would be replaced by: ```{r} t <- sbo_predtable(object = sbo::twitter_train, # preloaded example dataset N = 3, # Train a 3-gram model dict = target ~ 0.75, # cover 75% of training corpus .preprocess = sbo::preprocess, # Preprocessing transformation EOS = ".?!:;", # End-Of-Sentence tokens lambda = 0.4, # Back-off penalization in SBO algorithm L = 3L, # Number of predictions for input filtered = "" # Exclude the token from predictions ) ``` From `t`, one can rapidly recover the corrisponding text predictor, using `sbo_predictor()`[^3]: ```{r} p <- sbo_predictor(t) # This is the same as 'p' created above ``` Objects of class `sbo_predtable` can be safely stored out of memory and loaded in other R sessions: ```{r, eval=FALSE} save(t) # ... and, in another session: load("t.rda") ``` [^2]: At the present stage of development, this cannot be done directly for the `sbo_predictor` object created above. Technically, this is because `sbo_predictor` objects are external pointers to a convenient C++ class, optimized for fast `predict()`ions. Such a class instance exists only during a single R session. [^3]: As this example makes clear, both `sbo_predictor()` and `sbo_predtable()` are S3 generics. The corresponding available methods are listed under `?sbo_predictions`. ### Some details on `sbo_predictor` and `sbo_predtable` class objects `sbo_predictor` and `sbo_predtable` objects directly store next-word predictions for each $k$-gram prefix ($k=1,\,2,\dots,\,N-1$) observed in the training corpus, allowing for memory compression and fast query. Both objects store, through attributes, information about the training process. This can be conveniently obtained through the corresponding `summary()` methods, e.g. ```{r} summary(p) ``` #### Internal structure of `sbo_predictor` and `sbo_predtable` objects Here are some details on the current (still under development) implementation of `sbo_predictor` and `sbo_predtable` objects. For clarity, I will refer to the `sbo_predtable` instance `t` created above: ```{r} head(t[[3]]) ``` The first two columns correspond to the word codes[^4] of $2$-gram prefixes observed in the training corpus, and the other columns code the top $L=3$ predictions for these $2$-grams. When a $2$-gram $w_1 w_2$ is given as input for text prediction, it is first looked for in the prefix columns of `t[[3]]`. If not found, $w_2$ is looked for in the prefix column of `t[[2]]`. If this also fails, the prediction is performed without any prefix, that is, we simply predict the `L` most frequent words, stored in: ```{r} t[[1]] ``` [^4]: Coded with respect to the rank sorted dictionary `dict` (the codes `0`, `length(dict) + 1` and `length(dict) + 2` correspond to the Begin-Of-Sentence, End-Of-Sentence and Unknown-Word tokens, respectively). ## Evaluating next-word predictions This Section leverages, for convenience, on: ```{r message=FALSE, warning=FALSE} library(dplyr) # installed with `sbo` ``` Once we have built our next-word predictor, we may want to directly test its predictions on an independent corpus. For this purpose, `sbo` offers the function `eval_sbo_predictor()`, which samples $N$-grams from a test corpus and compares the predictions from the $(N-1)$-gram prefixes with the true completions. More in detail, given a character vector `test`, where each entry of `test` represents a single document, `eval_sbo_predictor()` performs the following steps: 1. Sample a single sentence from each entry of `test`, i.e. from each document. 1. Sample a single $N$-gram from each sentence obtained in the previous step. 1. Predict next words from the $(N-1)$-gram prefix. 1. Return all predictions, together with the true word completions. For a reasonable estimate of prediction accuracy, the various entries of `test` should be independent documents, e.g. single tweets as in the `sbo::twitter_test` example dataset, which we use below to test the previously trained predictor `p`. ```{r} set.seed(840) (evaluation <- eval_sbo_predictor(p, test = sbo::twitter_test)) ``` As it is seen, `eval_sbo_predictor()` returns a tibble containing the input $(N-1)$-grams, the true completions, the predicted completions and a column indicating whether one of the predictions were correct or not. We can estimate predictive accuracy as follows (the uncertainty in the estimate is approximated by the binomial formula $\sigma = \sqrt{\frac{p(1-p)}{M}}$, where $M$ is the number of trials): ```{r} evaluation %>% summarise(accuracy = sum(correct)/n(), uncertainty = sqrt(accuracy * (1 - accuracy) / n()) ) ``` We may want to exclude from the test $N$-grams ending by the End-Of-Sentence token (here represented by `""`): ```{r} evaluation %>% # Accuracy for in-sentence predictions filter(true != "") %>% summarise(accuracy = sum(correct) / n(), uncertainty = sqrt(accuracy * (1 - accuracy) / n()) ) ``` In trying to reduce the size (in physical memory) of your text-predictor, it might be useful to prune the model dictionary. The following command plots an histogram of the distribution of correct predictions in our test. ```{r, fig.align = "center"} if (require(ggplot2)) { evaluation %>% filter(correct, true != "") %>% select(true) %>% transmute(rank = match(true, table = attr(p, "dict"))) %>% ggplot(aes(x = rank)) + geom_histogram(binwidth = 25) } ``` Apparently, the large majority of correct predictions come from the first ~ 300 words of the dictionary, so that if we prune the dictionary excluding words with rank greater than, *e.g.*, 500 we can reduce the size of our model without seriously affecting its prediction accuracy. ## Other functionalities In this Section, I briefly survey other functionalities provided by `sbo`. See the corresponding help pages for more details. ### Dictionaries Dictionaries can be directly built using `sbo_dictionary()`. For example, the command: ```{r} dict <- sbo_dictionary(corpus = sbo::twitter_train, max_size = 100, target = 0.5, .preprocess = sbo::preprocess, EOS = ".?!:;") ``` constructs a dictionary applying the most restrictive of the two constraint `max_size = 100` or `target = 0.5`, where `target` denotes the coverage fraction of `corpus`. The arguments `.preprocess` and `EOS` work as described above. The output is an object of class `sbo_dictionary`, which stores, along with a vector of words (sorted by decreasing frequency), also the original values of `.preprocess` and `EOS`. ### Word coverage The word coverage fraction of a dictionary can be computed through the generic function `word_coverage()`. This accepts as argument any object containing a dictionary, along with a preprocessing function and a list of End-Of-Sentence characters. This includes all `sbo` main classes: `sbo_dictionary`, `sbo_kgram_freqs`, `sbo_predtable` and `sbo_predictor`. For instance: ```{r} (c <- word_coverage(p, sbo::twitter_train)) ``` Computes the coverage fraction of the dictionary used by the predictor `p`, on the original training corpus. This can be conveniently summarized with: ```{r} summary(c) ``` or visualized with: ```{r, fig.align = "center", fig.width=5} plot(c) ``` ### $k$-gram tokenization $k$-gram frequency tables form the building blocks of *any* $N$-gram based language model. The function `sbo::kgram_freqs()` extracts frequency tables from a training corpus and stores them into a class `sbo_kgram_freq` object. For example: ```{r} f <- kgram_freqs(corpus = sbo::twitter_train, N = 3, dict = target ~ 0.75, .preprocess = sbo::preprocess, EOS = ".?!:;" ) ``` stores $k$-gram frequency tables into `f`. This object can itself be used for text prediction: ```{r} predict(f, "i love") ``` The output contains the full language model information, i.e. the probabilities[^5] for each possible word completion. Compare this with: ```{r} predict(p, "i love") ``` The extra information contained in `f` comes at a price. In fact, the advantage provided by `sbo_predictor`/`sbo_predtable` objects for simple text prediction is two-fold: 1. Memory compression: ```{r} size_in_MB <- function(x) format(utils::object.size(x), units = "MB") sapply(list(sbo_predtable = t, kgram_freqs = f), size_in_MB) ``` 2. Fast query: ```{r} chrono_predict <- function(x) system.time(predict(x, "i love"), gcFirst = TRUE) lapply(list(sbo_predictor = p, kgram_freqs = f), chrono_predict) ``` [^5]: More precisely, these are the normalized (to unity) scores resulting from the Stupid Back-Off smoothing method. ### Text preprocessing Usually text corpora require preprocessing before word and $k$-gram tokenization can take place. The `.preprocess` argument of `kgram_freqs()` allows for an user specified preprocessing function. The default is the minimal `sbo::preprocess()`, and the optimized `kgram_freqs_fast(erase = x, EOS = y)` is equivalent to `kgram_freqs(.preprocess = sbo::preprocess(erase = x, EOS = y))` (but more efficient). ### Sentence tokenization Tokenization at the sentence level is required to obtain terminal $k$-grams (i.e. $k$-grams containing Begin-Of-Sentence or End-Of-Sentence tokens). In the training process, sentence tokenization takes place after text preprocessing. End-Of-Sentence (single character) tokens are specified by the `EOS` argument of `kgram_freqs()` and `kgram_freqs_fast()`; empty sentences are always skipped. Also, if the input vector `text` has `length(text) > 1`, the various elements of `text` belong to different entries. The process of sentence tokenization can also be performed directly, through `sbo::tokenize_sentences()`, but this is not required for use with `kgram_freqs*()`.