{hash}This vignette provides a
comparison of {r2r} with the same-purpose CRAN package {hash},
which also offers an implementation of hash tables based on R
environments. We first describe the features offered by both packages,
and then perform some benchmark timing comparisons. The package versions
referred to in this vignette are:
library(hash)
library(r2r)
packageVersion("hash")
#> [1] '2.2.6.4'
packageVersion("r2r")
#> [1] '0.1.2'Both {r2r} and {hash} hash tables are built
on top of the R built-in environment data structure, and
have thus a similar API. In particular, hash table objects have
reference semantics for both packages. {r2r}
hashtables are S3 class objects, whereas in
{hash} the data structure is implemented as an S4
class.
Hash tables provided by r2r support arbitrary type keys
and values, arbitrary key comparison and hash functions, and have
customizable behaviour (either throw an exception or return a default
value) upon query of a missing key.
In contrast, hash tables in hash currently support only
string keys, with basic identity comparison (the hashing is performed
automatically by the underlying environment objects);
values can be arbitrary R objects. Querying missing keys through
non-vectorized [[-subsetting returns the default value
NULL, whereas queries through vectorized
[-subsetting result in an error. On the other hand,
hash also offers support for inverting hash tables (an
experimental feature at the time of writing).
The table below summarizes the features of the two packages
| Feature | r2r | hash |
|---|---|---|
| Basic data structure | R environment | R environment |
| Arbitrary type keys | X | |
| Arbitrary type values | X | X |
| Arbitrary hash function | X | |
| Arbitrary key comparison function | X | |
| Throw or return default on missing keys | X | |
| Hash table inversion | X |
We will perform our benchmark tests using the CRAN package microbenchmark.
We start by timing the insertion of:
random key-value pairs (with possible repetitions). In order to perform a meaningful comparison between the two packages, we restrict to string (i.e. length one character) keys. We can generate random keys as follows:
chars <- c(letters, LETTERS, 0:9)
random_keys <- function(n) paste0(
sample(chars, n, replace = TRUE),
sample(chars, n, replace = TRUE),
sample(chars, n, replace = TRUE),
sample(chars, n, replace = TRUE),
sample(chars, n, replace = TRUE)
)
set.seed(840)
keys <- random_keys(N)
values <- rnorm(N)We test both the non-vectorized ([[<-) and vectorized
([<-) operators:
microbenchmark(
`r2r_[[<-` = {
for (i in seq_along(keys))
m_r2r[[ keys[[i]] ]] <- values[[i]]
},
`r2r_[<-` = { m_r2r[keys] <- values },
`hash_[[<-` = {
for (i in seq_along(keys))
m_hash[[ keys[[i]] ]] <- values[[i]]
},
`hash_[<-` = m_hash[keys] <- values,
times = 30,
setup = { m_r2r <- hashmap(); m_hash <- hash() }
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> r2r_[[<- 87.75514 146.29892 179.65013 185.0527 214.0657 258.5093 30
#> r2r_[<- 73.00546 103.70894 152.65348 168.0877 181.9906 308.3985 30
#> hash_[[<- 74.37107 112.98790 146.97750 143.9921 171.9236 261.1371 30
#> hash_[<- 40.79819 72.05788 87.26613 91.0152 105.6002 148.6176 30As it is seen, r2r and hash have comparable
performances at the insertion of key-value pairs, with both vectorized
and non-vectorized insertions, hash being somewhat more
efficient in both cases.
We now test key query, again both in non-vectorized and vectorized form:
microbenchmark(
`r2r_[[` = { for (key in keys) m_r2r[[ key ]] },
`r2r_[` = { m_r2r[ keys ] },
`hash_[[` = { for (key in keys) m_hash[[ key ]] },
`hash_[` = { m_hash[ keys ] },
times = 30,
setup = {
m_r2r <- hashmap(); m_r2r[keys] <- values
m_hash <- hash(); m_hash[keys] <- values
}
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> r2r_[[ 98.98932 145.33967 201.37708 201.07437 265.73786 316.52912 30
#> r2r_[ 103.45720 147.21641 186.16901 192.07650 221.37324 272.69486 30
#> hash_[[ 10.83718 13.48205 18.11352 15.98164 23.78367 31.86381 30
#> hash_[ 61.43288 78.10456 113.56426 109.25188 147.13924 198.61892 30For non-vectorized queries, hash is significantly faster
(by one order of magnitude) than r2r. This is likely due to
the fact that the [[ method dispatch is handled natively by
R in hash (i.e. the default [[ method
for environments is used ), whereas r2r
suffers the overhead of S3 method dispatch. This is confirmed by the
result for vectorized queries, which is comparable for the two packages;
notice that here a single (rather than N) S3 method
dispatch occurs in the r2r timed expression.
As an additional test, we perform the benchmarks for non-vectorized expressions with a new set of keys:
set.seed(841)
new_keys <- random_keys(N)
microbenchmark(
`r2r_[[_bis` = { for (key in new_keys) m_r2r[[ key ]] },
`hash_[[_bis` = { for (key in new_keys) m_hash[[ key ]] },
times = 30,
setup = {
m_r2r <- hashmap(); m_r2r[keys] <- values
m_hash <- hash(); m_hash[keys] <- values
}
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> r2r_[[_bis 75.92730 99.37357 133.33455 128.91944 162.11952 236.66646 30
#> hash_[[_bis 11.65706 12.74223 19.72868 17.05059 26.13338 37.13237 30The results are similar to the ones already commented. Finally, we
test the performances of the two packages in checking the existence of
keys (notice that here has_key refers to
r2r::has_key, whereas has.key is
hash::has.key):
set.seed(842)
mixed_keys <- sample(c(keys, new_keys), N)
microbenchmark(
r2r_has_key = { for (key in mixed_keys) has_key(m_r2r, key) },
hash_has_key = { for (key in new_keys) has.key(key, m_hash) },
times = 30,
setup = {
m_r2r <- hashmap(); m_r2r[keys] <- values
m_hash <- hash(); m_hash[keys] <- values
}
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> r2r_has_key 69.23233 90.73365 111.1846 116.8536 127.4700 147.8883 30
#> hash_has_key 213.42948 258.96675 335.4890 326.9527 401.2557 525.3569 30The results are comparable for the two packages, r2r
being slightly more performant in this particular case.
Finally, we test key deletion. In order to handle name collisions, we
will use delete() (which refers to
r2r::delete()) and del() (which refers to
hash::del()).
microbenchmark(
r2r_delete = { for (key in keys) delete(m_r2r, key) },
hash_delete = { for (key in keys) del(key, m_hash) },
hash_vectorized_delete = { del(keys, m_hash) },
times = 30,
setup = {
m_r2r <- hashmap(); m_r2r[keys] <- values
m_hash <- hash(); m_hash[keys] <- values
}
)
#> Unit: milliseconds
#> expr min lq mean median uq
#> r2r_delete 128.817868 166.006647 227.701078 220.482861 256.545683
#> hash_delete 71.100662 91.946342 128.702465 127.956812 158.976427
#> hash_vectorized_delete 4.782184 5.718351 6.586383 6.240551 6.847719
#> max neval
#> 484.12153 30
#> 210.15627 30
#> 15.25836 30The vectorized version of hash significantly outperforms
the non-vectorized versions (by roughly two orders of magnitude in
speed). Currently, r2r does not support vectorized key
deletion 1.
The two R packages r2r and hash offer hash
table implementations with different advantages and drawbacks.
r2r focuses on flexibility, and has a richer set of
features. hash is more minimal, but offers superior
performance in some important tasks. Finally, as a positive note for
both parties, the two packages share a similar API, making it relatively
easy to switch between the two, according to the particular use case
needs.
This is due to complications introduced by the internal
hash collision handling system of r2r.↩︎