Title: | Text Prediction via Stupid Back-Off N-Gram Models |
---|---|
Description: | Utilities for training and evaluating text predictors based on Stupid Back-Off N-gram models (Brants et al., 2007, <https://www.aclweb.org/anthology/D07-1090/>). |
Authors: | Valerio Gherardi |
Maintainer: | Valerio Gherardi <[email protected]> |
License: | GPL-3 |
Version: | 0.5.0 |
Built: | 2024-12-07 03:23:58 UTC |
Source: | https://github.com/vgherard/sbo |
Coerce objects to sbo_dictionary
class.
as_sbo_dictionary(x, ...) ## S3 method for class 'character' as_sbo_dictionary(x, .preprocess = identity, EOS = "", ...)
as_sbo_dictionary(x, ...) ## S3 method for class 'character' as_sbo_dictionary(x, .preprocess = identity, EOS = "", ...)
x |
object to be coerced. |
... |
further arguments passed to or from other methods. |
.preprocess |
a function for corpus preprocessing. |
EOS |
a length one character vector listing all (single character) end-of-sentence tokens. |
This function is an S3 generic for coercing existing objects to
sbo_dictionary
class objects. Currently, only a method for character
vectors is implemented, and this will be described below.
Character vector input:
Calling as_sbo_dictionary(x)
simply decorates the character
vector x
with the class sbo_dictionary
attribute,
and with customizable .preprocess
and EOS
attributes.
A sbo_dictionary
object.
Valerio Gherardi
dict <- as_sbo_dictionary(c("a","b","c"), .preprocess = tolower, EOS = ".")
dict <- as_sbo_dictionary(c("a","b","c"), .preprocess = tolower, EOS = ".")
Generate random text based on Stupid Back-off language model.
babble(model, input = NA, n_max = 100L, L = attr(model, "L"))
babble(model, input = NA, n_max = 100L, L = attr(model, "L"))
model |
a |
input |
a length one character vector. Starting point for babbling!
If |
n_max |
a length one integer. Maximum number of words to generate. |
L |
a length one integer. Number of next-word suggestions from which to sample (see details). |
This function generates random text from a Stupid Back-off language
model.
babble
randomly samples one of the top L next word
predictions. Text generation stops when an End-Of-Sentence token is
encountered, or when the number of generated words exceeds n_max.
A character vector of length one.
Valerio Gherardi
# Babble! p <- sbo_predictor(twitter_predtable) set.seed(840) # Set seed for reproducibility babble(p)
# Babble! p <- sbo_predictor(twitter_predtable) set.seed(840) # Set seed for reproducibility babble(p)
Evaluate next-word predictions based on Stupid Back-off N-gram model on a test corpus.
eval_sbo_predictor(model, test, L = attr(model, "L"))
eval_sbo_predictor(model, test, L = attr(model, "L"))
model |
a |
test |
a character vector. Perform a single prediction on each entry of this vector (see details). |
L |
Maximum number of predictions for each input sentence
(maximum allowed is |
This function allows to obtain information on the quality of
Stupid Back-off model predictions, such as next-word prediction accuracy,
or the word-rank distribution of correct prediction, by direct test against
a test set corpus. For a reasonable estimate of prediction accuracy, the
different entries of the test
vector should be uncorrelated
documents (e.g. separate tweets, as in the twitter_test
example dataset).
More in detail, eval_sbo_predictor
performs the following operations:
Sample a single sentence from each entry of the character vector
test
.
Sample a single $N$-gram from each sentence obtained in the previous step.
Predict next words from the $(N-1)$-gram prefix.
Return all predictions, together with the true word completions.
A tibble, containing the input $(N-1)$-grams, the true completions, the predicted completions and a column indicating whether one of the predictions were correct or not.
Valerio Gherardi
# Evaluating next-word predictions from a Stupid Back-off N-gram model if (suppressMessages(require(dplyr) && require(ggplot2))) { p <- sbo_predictor(twitter_predtable) set.seed(840) # Set seed for reproducibility test <- sample(twitter_test, 500) eval <- eval_sbo_predictor(p, test) ## Compute three-word accuracies eval %>% summarise(accuracy = sum(correct)/n()) # Overall accuracy eval %>% # Accuracy for in-sentence predictions filter(true != "<EOS>") %>% summarise(accuracy = sum(correct) / n()) ## Make histogram of word-rank distribution for correct predictions dict <- attr(twitter_predtable, "dict") eval %>% filter(correct, true != "<EOS>") %>% transmute(rank = match(true, table = dict)) %>% ggplot(aes(x = rank)) + geom_histogram(binwidth = 30) }
# Evaluating next-word predictions from a Stupid Back-off N-gram model if (suppressMessages(require(dplyr) && require(ggplot2))) { p <- sbo_predictor(twitter_predtable) set.seed(840) # Set seed for reproducibility test <- sample(twitter_test, 500) eval <- eval_sbo_predictor(p, test) ## Compute three-word accuracies eval %>% summarise(accuracy = sum(correct)/n()) # Overall accuracy eval %>% # Accuracy for in-sentence predictions filter(true != "<EOS>") %>% summarise(accuracy = sum(correct) / n()) ## Make histogram of word-rank distribution for correct predictions dict <- attr(twitter_predtable, "dict") eval %>% filter(correct, true != "<EOS>") %>% transmute(rank = match(true, table = dict)) %>% ggplot(aes(x = rank)) + geom_histogram(binwidth = 30) }
Get k-gram frequency tables from a training corpus.
kgram_freqs(corpus, N, dict, .preprocess = identity, EOS = "") sbo_kgram_freqs(corpus, N, dict, .preprocess = identity, EOS = "") kgram_freqs_fast(corpus, N, dict, erase = "", lower_case = FALSE, EOS = "") sbo_kgram_freqs_fast(corpus, N, dict, erase = "", lower_case = FALSE, EOS = "")
kgram_freqs(corpus, N, dict, .preprocess = identity, EOS = "") sbo_kgram_freqs(corpus, N, dict, .preprocess = identity, EOS = "") kgram_freqs_fast(corpus, N, dict, erase = "", lower_case = FALSE, EOS = "") sbo_kgram_freqs_fast(corpus, N, dict, erase = "", lower_case = FALSE, EOS = "")
corpus |
a character vector. The training corpus from which to extract k-gram frequencies. |
N |
a length one integer. The maximum order of k-grams for which frequencies are to be extracted. |
dict |
either a |
.preprocess |
a function to apply before k-gram tokenization. |
EOS |
a length one character vector listing all (single character) end-of-sentence tokens. |
erase |
a length one character vector. Regular expression matching parts of text to be erased from input. The default removes anything not alphanumeric, white space, apostrophes or punctuation characters (i.e. ".?!:;"). |
lower_case |
a length one logical vector. If TRUE, puts everything to lower case. |
These functions extract all k-gram frequency tables from a text
corpus up to a specified k-gram order N. These are
the building blocks to train any N-gram model. The functions
sbo_kgram_freqs()
and sbo_kgram_freqs_fast()
are aliases for
kgram_freqs()
and kgram_freqs_fast()
, respectively.
The optimized version kgram_freqs_fast(erase = x, lower_case = y)
is equivalent to
kgram_freqs(.preprocess = preprocess(erase = x, lower_case = y))
,
but more efficient (both from the speed and memory point of view).
Both kgram_freqs()
and kgram_freqs_fast()
employ a fixed
(user specified) dictionary: any out-of-vocabulary word gets effectively
replaced by an "unknown word" token. This is specified through the argument
dict
, which accepts three types of arguments: a sbo_dictionary
object, a character vector (containing the words of the dictionary) or a
formula. In this last case, valid formulas can be either max_size ~ V
or target ~ f
, where V
and f
represent a dictionary size
and a corpus word coverage fraction (of corpus
), respectively. This
usage of the dict
argument allows to build the model dictionary
'on the fly'.
The return value is a "sbo_kgram_freqs
" object, i.e. a list of N tibbles,
storing frequency counts for each k-gram observed in the training corpus, for
k = 1, 2, ..., N. In these tables, words are represented by
integer numbers corresponding to their position in the
reference dictionary. The special codes 0
,
length(dictionary)+1
and length(dictionary)+2
correspond to the "Begin-Of-Sentence", "End-Of-Sentence"
and "Unknown word" tokens, respectively.
Furthermore, the returned objected has the following attributes:
N
: The highest order of N-grams.
dict
: The reference dictionary, sorted by word frequency.
.preprocess
: The function used for text preprocessing.
EOS
: A length one character vector listing all (single character)
end-of-sentence tokens employed in k-gram tokenization.
The .preprocess
argument of kgram_freqs
allows the user to
apply a custom transformation to the training corpus, before kgram
tokenization takes place.
The algorithm for k-gram tokenization considers anything separated by
(any number of) white spaces (i.e. " ") as a single word. Sentences are split
according to end-of-sentence (single character) tokens, as specified
by the EOS
argument. Additionally text belonging to different entries of
the preprocessed input vector which are understood to belong to different
sentences.
Nota Bene: It is useful to keep in mind that the function
passed through the .preprocess
argument also captures its enclosing
environment, which is by default the environment in which the former
was defined.
If, for instance, .preprocess
was defined in the global environment,
and the latter binds heavy objects, the resulting sbo_kgram_freqs
will
contain bindings to the same objects. If sbo_kgram_freqs
is stored out of
memory and recalled in another R session, these objects will also be reloaded
in memory.
For this reason, for non interactive use, it is advisable to avoid using
preprocessing functions defined in the global environment
(for instance, base::identity
is preferred to function(x) x
).
A sbo_kgram_freqs
object, containing the k-gram
frequency tables for k = 1, 2, ..., N.
Valerio Gherardi
# Obtain k-gram frequency table from corpus ## Get k-gram frequencies, for k <= N = 3. ## The dictionary is built on the fly, using the most frequent 1000 words. freqs <- kgram_freqs(corpus = twitter_train, N = 3, dict = max_size ~ 1000, .preprocess = preprocess, EOS = ".?!:;") freqs ## Using a predefined dictionary freqs <- kgram_freqs_fast(twitter_train, N = 3, dict = twitter_dict, erase = "[^.?!:;'\\w\\s]", lower_case = TRUE, EOS = ".?!:;") freqs ## 2-grams, no preprocessing, use a dictionary covering 50% of corpus freqs <- kgram_freqs(corpus = twitter_train, N = 2, dict = target ~ 0.5, EOS = ".?!:;") freqs # Obtain k-gram frequency table from corpus freqs <- kgram_freqs_fast(twitter_train, N = 3, dict = twitter_dict) ## Print result freqs
# Obtain k-gram frequency table from corpus ## Get k-gram frequencies, for k <= N = 3. ## The dictionary is built on the fly, using the most frequent 1000 words. freqs <- kgram_freqs(corpus = twitter_train, N = 3, dict = max_size ~ 1000, .preprocess = preprocess, EOS = ".?!:;") freqs ## Using a predefined dictionary freqs <- kgram_freqs_fast(twitter_train, N = 3, dict = twitter_dict, erase = "[^.?!:;'\\w\\s]", lower_case = TRUE, EOS = ".?!:;") freqs ## 2-grams, no preprocessing, use a dictionary covering 50% of corpus freqs <- kgram_freqs(corpus = twitter_train, N = 2, dict = target ~ 0.5, EOS = ".?!:;") freqs # Obtain k-gram frequency table from corpus freqs <- kgram_freqs_fast(twitter_train, N = 3, dict = twitter_dict) ## Print result freqs
Plot cumulative corpus coverage fraction of a dictionary.
## S3 method for class 'word_coverage' plot( x, include_EOS = FALSE, show_limit = TRUE, type = "l", xlim = c(0, length(x)), ylim = c(0, 1), xticks = seq(from = 0, to = length(x), by = length(x)/5), yticks = seq(from = 0, to = 1, by = 0.25), xlab = "Rank", ylab = "Covered fraction", title = "Cumulative corpus coverage fraction of dictionary", subtitle = "_default_", ... )
## S3 method for class 'word_coverage' plot( x, include_EOS = FALSE, show_limit = TRUE, type = "l", xlim = c(0, length(x)), ylim = c(0, 1), xticks = seq(from = 0, to = length(x), by = length(x)/5), yticks = seq(from = 0, to = 1, by = 0.25), xlab = "Rank", ylab = "Covered fraction", title = "Cumulative corpus coverage fraction of dictionary", subtitle = "_default_", ... )
x |
a |
include_EOS |
length one logical. Should End-Of-Sentence tokens be considered in the computation of coverage fraction? |
show_limit |
length one logical. If |
type |
what type of plot should be drawn, as detailed in |
xlim |
length two numeric. Extremes of the x-range. |
ylim |
length two numeric. Extremes of the y-range. |
xticks |
numeric vector. position of the x-axis ticks. |
yticks |
numeric vector. position of the y-axis ticks. |
xlab |
length one character. The x-axis label. |
ylab |
length one character. The y-axis label. |
title |
length one character. Plot title. |
subtitle |
length one character. Plot subtitle; if "default", prints dictionary length and total covered fraction. |
... |
further arguments passed to or from other methods. |
This function generates nice plots of cumulative corpus coverage
fractions. The x
coordinate in the resulting plot is the word rank in the
underlying dictionary; the y
coordinate at
x
is the cumulative coverage fraction for rank <= x
.
Valerio Gherardi
c <- word_coverage(twitter_dict, twitter_test) plot(c)
c <- word_coverage(twitter_dict, twitter_test) plot(c)
Predictive text based on Stupid Back-off N-gram model.
## S3 method for class 'sbo_kgram_freqs' predict(object, input, lambda = 0.4, ...)
## S3 method for class 'sbo_kgram_freqs' predict(object, input, lambda = 0.4, ...)
object |
a |
input |
a length one character vector, containing the input for next-word prediction. |
lambda |
a numeric vector of length one. The back-off penalization in Stupid Back-off algorithm. |
... |
further arguments passed to or from other methods. |
A tibble containing the next-word probabilities for all words in the dictionary.
Valerio Gherardi
predict(twitter_freqs, "i love")
predict(twitter_freqs, "i love")
Predictive text based on Stupid Back-off N-gram model.
## S3 method for class 'sbo_predictor' predict(object, input, ...)
## S3 method for class 'sbo_predictor' predict(object, input, ...)
object |
a |
input |
a character vector, containing the input for next-word prediction. |
... |
further arguments passed to or from other methods. |
This method returns the top L
next-word predictions from a
text predictor trained with Stupid Back-Off.
Trying to predict from a sbo_predtable
results into an error. Instead,
one should load a sbo_predictor
object and use this one to predict(),
as shown in the example below.
A character vector if length(input) == 1
, otherwise a
character matrix.
Valerio Gherardi
p <- sbo_predictor(twitter_predtable) x <- predict(p, "i love") x x <- predict(p, "you love") x #N.B. the top predictions here are x[1], followed by x[2] and x[3]. predict(p, c("i love", "you love")) # Behaviour with length()>1 input.
p <- sbo_predictor(twitter_predtable) x <- predict(p, "i love") x x <- predict(p, "you love") x #N.B. the top predictions here are x[1], followed by x[2] and x[3]. predict(p, c("i love", "you love")) # Behaviour with length()>1 input.
A simple text preprocessing utility.
preprocess(input, erase = "[^.?!:;'\\w\\s]", lower_case = TRUE)
preprocess(input, erase = "[^.?!:;'\\w\\s]", lower_case = TRUE)
input |
a character vector. |
erase |
a length one character vector. Regular expression matching parts of text to be erased from input. The default removes anything not alphanumeric, white space, apostrophes or punctuation characters (i.e. ".?!:;"). |
lower_case |
a length one logical vector. If TRUE, puts everything to lower case. |
a character vector containing the processed output.
Valerio Gherardi
preprocess("Hi @ there! I'm using `sbo`.")
preprocess("Hi @ there! I'm using `sbo`.")
Prune M
-gram frequency tables or Stupid Back-Off prediction tables for
an M
-gram model to a smaller order N
.
prune(object, N, ...) ## S3 method for class 'sbo_kgram_freqs' prune(object, N, ...) ## S3 method for class 'sbo_predtable' prune(object, N, ...)
prune(object, N, ...) ## S3 method for class 'sbo_kgram_freqs' prune(object, N, ...) ## S3 method for class 'sbo_predtable' prune(object, N, ...)
object |
A |
N |
a length one positive integer. N-gram order of the new object. |
... |
further arguments passed to or from other methods. |
This generic function provides a helper to prune M-gram frequency
tables or M-gram models, represented by sbo_kgram_freqs
and
sbo_predtable
objects respectively, to objects of a smaller N-gram
order, N < M. For k-gram frequency objects, frequency tables for
k > N are simply dropped. For sbo_predtable
's, the predictions coming
from the nested N-gram model are instead retained. In both cases, all other
other attributes besides k-gram order (such as the corpus preprocessing
function, or the lambda
penalty in Stupid Back-Off training) are left
unchanged.
an object of the same class of the input object
.
Valerio Gherardi
# Drop k-gram frequencies for k > 2 freqs <- twitter_freqs summary(freqs) freqs <- prune(freqs, N = 2) summary(freqs) # Extract a 2-gram model from a larger 3-gram model pt <- twitter_predtable summary(pt) pt <- prune(pt, N = 2) summary(pt)
# Drop k-gram frequencies for k > 2 freqs <- twitter_freqs summary(freqs) freqs <- prune(freqs, N = 2) summary(freqs) # Extract a 2-gram model from a larger 3-gram model pt <- twitter_predtable summary(pt) pt <- prune(pt, N = 2) summary(pt)
Build dictionary from training corpus.
sbo_dictionary( corpus, max_size = Inf, target = 1, .preprocess = identity, EOS = "" ) dictionary( corpus, max_size = Inf, target = 1, .preprocess = identity, EOS = "" )
sbo_dictionary( corpus, max_size = Inf, target = 1, .preprocess = identity, EOS = "" ) dictionary( corpus, max_size = Inf, target = 1, .preprocess = identity, EOS = "" )
corpus |
a character vector. The training corpus from which to extract the dictionary. |
max_size |
a length one numeric. If less than |
target |
a length one numeric between |
.preprocess |
a function for corpus preprocessing. Takes a character vector as input and returns a character vector. |
EOS |
a length one character vector listing all (single character) end-of-sentence tokens. |
The function dictionary()
is an alias for
sbo_dictionary()
.
This function builds a dictionary using the most frequent words in a training corpus. Two pruning criterions can be applied:
Dictionary size, as implemented by the max_size
argument.
Target coverage fraction, as implemented by the target
argument.
If both these criterions imply non-trivial cuts, the most restrictive critierion applies.
The .preprocess
argument allows the user to apply a custom
transformation to the training corpus, before word tokenization. The
EOS
argument allows to specify a set of characters to be identified
as End-Of-Sentence tokens (and thus not part of words).
The returned object is a sbo_dictionary
object, which is a
character vector containing words sorted by decreasing corpus frequency.
Furthermore, the object stores as attributes the original values of
.preprocess
and EOS
(i.e. the function used in corpus
preprocessing and the End-Of-Sentence characters for sentence tokenization).
A sbo_dictionary
object.
Valerio Gherardi
# Extract dictionary from `twitter_train` corpus (all words) dict <- sbo_dictionary(twitter_train) # Extract dictionary from `twitter_train` corpus (top 1000 words) dict <- sbo_dictionary(twitter_train, max_size = 1000) # Extract dictionary from `twitter_train` corpus (coverage target = 50%) dict <- sbo_dictionary(twitter_train, target = 0.5)
# Extract dictionary from `twitter_train` corpus (all words) dict <- sbo_dictionary(twitter_train) # Extract dictionary from `twitter_train` corpus (top 1000 words) dict <- sbo_dictionary(twitter_train, max_size = 1000) # Extract dictionary from `twitter_train` corpus (coverage target = 50%) dict <- sbo_dictionary(twitter_train, target = 0.5)
Train a text predictor via Stupid Back-off
sbo_predictor(object, ...) predictor(object, ...) ## S3 method for class 'character' sbo_predictor( object, N, dict, .preprocess = identity, EOS = "", lambda = 0.4, L = 3L, filtered = "<UNK>", ... ) ## S3 method for class 'sbo_kgram_freqs' sbo_predictor(object, lambda = 0.4, L = 3L, filtered = "<UNK>", ...) ## S3 method for class 'sbo_predtable' sbo_predictor(object, ...) sbo_predtable(object, lambda = 0.4, L = 3L, filtered = "<UNK>", ...) predtable(object, lambda = 0.4, L = 3L, filtered = "<UNK>", ...) ## S3 method for class 'character' sbo_predtable( object, lambda = 0.4, L = 3L, filtered = "<UNK>", N, dict, .preprocess = identity, EOS = "", ... ) ## S3 method for class 'sbo_kgram_freqs' sbo_predtable(object, lambda = 0.4, L = 3L, filtered = "<UNK>", ...)
sbo_predictor(object, ...) predictor(object, ...) ## S3 method for class 'character' sbo_predictor( object, N, dict, .preprocess = identity, EOS = "", lambda = 0.4, L = 3L, filtered = "<UNK>", ... ) ## S3 method for class 'sbo_kgram_freqs' sbo_predictor(object, lambda = 0.4, L = 3L, filtered = "<UNK>", ...) ## S3 method for class 'sbo_predtable' sbo_predictor(object, ...) sbo_predtable(object, lambda = 0.4, L = 3L, filtered = "<UNK>", ...) predtable(object, lambda = 0.4, L = 3L, filtered = "<UNK>", ...) ## S3 method for class 'character' sbo_predtable( object, lambda = 0.4, L = 3L, filtered = "<UNK>", N, dict, .preprocess = identity, EOS = "", ... ) ## S3 method for class 'sbo_kgram_freqs' sbo_predtable(object, lambda = 0.4, L = 3L, filtered = "<UNK>", ...)
object |
either a character vector or an object inheriting from classes
|
... |
further arguments passed to or from other methods. |
N |
a length one integer. Order 'N' of the N-gram model. |
dict |
a |
.preprocess |
a function for corpus preprocessing. For
more details see |
EOS |
a length one character vector. String listing End-Of-Sentence
characters. For more details see |
lambda |
a length one numeric. Penalization in the Stupid Back-off algorithm. |
L |
a length one integer. Maximum number of next-word predictions for a given input (top scoring predictions are retained). |
filtered |
a character vector. Words to exclude from next-word predictions. The strings '<UNK>' and '<EOS>' are reserved keywords referring to the Unknown-Word and End-Of-Sentence tokens, respectively. |
These functions are generics used to train a text predictor
with Stupid Back-Off. The functions predictor()
and
predtable()
are aliases for sbo_predictor()
and
sbo_predtable()
, respectively.
The sbo_predictor
data structure carries
all information
required for prediction in a compact and efficient (upon retrieval) way,
by directly storing the top L
next-word predictions for each
k-gram prefix observed in the training corpus.
The sbo_predictor
objects are for interactive use. If the training
process is computationally heavy, one can store a "raw" version of the
text predictor in a sbo_predtable
class object, which can be safely
saved out of memory (with e.g. save()
).
The resulting object can be restored
in another R session, and the corresponding sbo_predictor
object
can be loaded rapidly using again the generic constructor
sbo_predictor()
(see example below).
The returned objects are a sbo_predictor
and a sbo_predtable
objects.
The latter contains Stupid Back-Off prediction tables, storing next-word
prediction for each k-gram prefix observed in the text, whereas the former
is an external pointer to an equivalent (but processed) C++ structure.
Both objects have the following attributes:
N
: The order of the underlying N-gram model, "N
".
dict
: The model dictionary.
lambda
: The penalization used in the Stupid Back-Off algorithm.
L
: The maximum number of next-word predictions for a given text
input.
.preprocess
: The function used for text preprocessing.
EOS
: A length one character vector listing all (single character)
end-of-sentence tokens.
A sbo_predictor
object for sbo_predictor()
, a
sbo_predtable
object for sbo_predtable()
.
Valerio Gherardi
# Train a text predictor directly from corpus p <- sbo_predictor(twitter_train, N = 3, dict = max_size ~ 1000, .preprocess = preprocess, EOS = ".?!:;") # Train a text predictor from previously computed 'kgram_freqs' object p <- sbo_predictor(twitter_freqs) # Load a text predictor from a Stupid Back-Off prediction table p <- sbo_predictor(twitter_predtable) # Predict from Stupid Back-Off text predictor p <- sbo_predictor(twitter_predtable) predict(p, "i love") # Build Stupid Back-Off prediction tables directly from corpus t <- sbo_predtable(twitter_train, N = 3, dict = max_size ~ 1000, .preprocess = preprocess, EOS = ".?!:;") # Build Stupid Back-Off prediction tables from kgram_freqs object t <- sbo_predtable(twitter_freqs) ## Not run: # Save and reload a 'sbo_predtable' object with base::save() save(t) load("t.rda") ## End(Not run)
# Train a text predictor directly from corpus p <- sbo_predictor(twitter_train, N = 3, dict = max_size ~ 1000, .preprocess = preprocess, EOS = ".?!:;") # Train a text predictor from previously computed 'kgram_freqs' object p <- sbo_predictor(twitter_freqs) # Load a text predictor from a Stupid Back-Off prediction table p <- sbo_predictor(twitter_predtable) # Predict from Stupid Back-Off text predictor p <- sbo_predictor(twitter_predtable) predict(p, "i love") # Build Stupid Back-Off prediction tables directly from corpus t <- sbo_predtable(twitter_train, N = 3, dict = max_size ~ 1000, .preprocess = preprocess, EOS = ".?!:;") # Build Stupid Back-Off prediction tables from kgram_freqs object t <- sbo_predtable(twitter_freqs) ## Not run: # Save and reload a 'sbo_predtable' object with base::save() save(t) load("t.rda") ## End(Not run)
Get sentence tokens from text
tokenize_sentences(input, EOS = ".?!:;")
tokenize_sentences(input, EOS = ".?!:;")
input |
a character vector. |
EOS |
a length one character vector listing all (single character) end-of-sentence tokens. |
a character vector, each entry of which corresponds to a single sentence.
Valerio Gherardi
tokenize_sentences("Hi there! I'm using `sbo`.")
tokenize_sentences("Hi there! I'm using `sbo`.")
Top 1000 dictionary from Twitter training set
twitter_dict
twitter_dict
A character vector. Contains the 1000 most frequent words from
the example training set twitter_train
, sorted by word
rank.
head(twitter_dict, 10)
head(twitter_dict, 10)
k-gram frequencies from Twitter training set
twitter_freqs
twitter_freqs
A sbo_kgram_freqs
object. Contains k-gram frequencies from
the example training set twitter_train
.
Next-word prediction tables from 3-gram model trained on Twitter training set
twitter_predtable
twitter_predtable
A sbo_predtable
object. Contains prediction tables of a 3-gram
Stupid Back-off model trained on the example training set
twitter_train
.
Twitter test set
twitter_test
twitter_test
A collection of 10'000 Twitter posts in English.
https://www.kaggle.com/crmercado/tweets-blogs-news-swiftkey-dataset-4million
head(twitter_test)
head(twitter_test)
Twitter training set
twitter_train
twitter_train
A collection of 50'000 Twitter posts in English.
https://www.kaggle.com/crmercado/tweets-blogs-news-swiftkey-dataset-4million
twitter_test
, twitter_dict
,
twitter_freqs
, twitter_predtable
head(twitter_train)
head(twitter_train)
Compute total and cumulative corpus coverage fraction of a dictionary.
word_coverage(object, corpus, ...) ## S3 method for class 'sbo_dictionary' word_coverage(object, corpus, ...) ## S3 method for class 'character' word_coverage(object, corpus, .preprocess = identity, EOS = "", ...) ## S3 method for class 'sbo_kgram_freqs' word_coverage(object, corpus, ...) ## S3 method for class 'sbo_predictions' word_coverage(object, corpus, ...)
word_coverage(object, corpus, ...) ## S3 method for class 'sbo_dictionary' word_coverage(object, corpus, ...) ## S3 method for class 'character' word_coverage(object, corpus, .preprocess = identity, EOS = "", ...) ## S3 method for class 'sbo_kgram_freqs' word_coverage(object, corpus, ...) ## S3 method for class 'sbo_predictions' word_coverage(object, corpus, ...)
object |
either a character vector, or an object inheriting from one of
the classes |
corpus |
a character vector. |
... |
further arguments passed to or from other methods. |
.preprocess |
preprocessing function for training corpus. See
|
EOS |
a length one character vector. String containing End-Of-Sentence
characters, see |
This function computes the corpus coverage fraction of a dictionary, that is the fraction of words appearing in corpus which are contained in the original dictionary.
This function is a generic, accepting as object
argument any object
storing a dictionary, along with a preprocessing function and a list
of End-Of-Sentence characters. This includes all sbo
main classes:
sbo_dictionary
, sbo_kgram_freqs
, sbo_predtable
and
sbo_predictor
. When object
is a character vector, the preprocessing
function and the End-Of-Sentence characters must be specified explicitly.
The coverage fraction is computed cumulatively, and the dependence of
coverage with respect to maximal rank can be explored through plot()
(see examples below)
a word_coverage
object.
Valerio Gherardi
c <- word_coverage(twitter_dict, twitter_train) print(c) summary(c) # Plot coverage fraction, including the End-Of-Sentence in word counts. plot(c, include_EOS = TRUE)
c <- word_coverage(twitter_dict, twitter_train) print(c) summary(c) # Plot coverage fraction, including the End-Of-Sentence in word counts. plot(c, include_EOS = TRUE)