{hash}
This vignette provides a
comparison of {r2r}
with the same-purpose CRAN package {hash}
,
which also offers an implementation of hash tables based on R
environments. We first describe the features offered by both packages,
and then perform some benchmark timing comparisons. The package versions
referred to in this vignette are:
library(hash)
library(r2r)
packageVersion("hash")
#> [1] '2.2.6.3'
packageVersion("r2r")
#> [1] '0.1.2'
Both {r2r}
and {hash}
hash tables are built
on top of the R built-in environment
data structure, and
have thus a similar API. In particular, hash table objects have
reference semantics for both packages. {r2r}
hashtable
s are S3 class objects, whereas in
{hash}
the data structure is implemented as an S4
class.
Hash tables provided by r2r
support arbitrary type keys
and values, arbitrary key comparison and hash functions, and have
customizable behaviour (either throw an exception or return a default
value) upon query of a missing key.
In contrast, hash tables in hash
currently support only
string keys, with basic identity comparison (the hashing is performed
automatically by the underlying environment
objects);
values can be arbitrary R objects. Querying missing keys through
non-vectorized [[
-subsetting returns the default value
NULL
, whereas queries through vectorized
[
-subsetting result in an error. On the other hand,
hash
also offers support for inverting hash tables (an
experimental feature at the time of writing).
The table below summarizes the features of the two packages
Feature | r2r | hash |
---|---|---|
Basic data structure | R environment | R environment |
Arbitrary type keys | X | |
Arbitrary type values | X | X |
Arbitrary hash function | X | |
Arbitrary key comparison function | X | |
Throw or return default on missing keys | X | |
Hash table inversion | X |
We will perform our benchmark tests using the CRAN package microbenchmark
.
We start by timing the insertion of:
random key-value pairs (with possible repetitions). In order to perform a meaningful comparison between the two packages, we restrict to string (i.e. length one character) keys. We can generate random keys as follows:
chars <- c(letters, LETTERS, 0:9)
random_keys <- function(n) paste0(
sample(chars, n, replace = TRUE),
sample(chars, n, replace = TRUE),
sample(chars, n, replace = TRUE),
sample(chars, n, replace = TRUE),
sample(chars, n, replace = TRUE)
)
set.seed(840)
keys <- random_keys(N)
values <- rnorm(N)
We test both the non-vectorized ([[<-
) and vectorized
([<-
) operators:
microbenchmark(
`r2r_[[<-` = {
for (i in seq_along(keys))
m_r2r[[ keys[[i]] ]] <- values[[i]]
},
`r2r_[<-` = { m_r2r[keys] <- values },
`hash_[[<-` = {
for (i in seq_along(keys))
m_hash[[ keys[[i]] ]] <- values[[i]]
},
`hash_[<-` = m_hash[keys] <- values,
times = 30,
setup = { m_r2r <- hashmap(); m_hash <- hash() }
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> r2r_[[<- 97.35690 151.67253 187.59483 193.47574 234.7740 277.7255 30
#> r2r_[<- 73.90571 103.69061 159.41183 165.42399 200.4218 324.2356 30
#> hash_[[<- 73.80975 126.54497 149.60832 139.75092 172.9261 260.1585 30
#> hash_[<- 39.28803 64.79618 90.46477 99.42232 109.8349 141.6459 30
As it is seen, r2r
and hash
have comparable
performances at the insertion of key-value pairs, with both vectorized
and non-vectorized insertions, hash
being somewhat more
efficient in both cases.
We now test key query, again both in non-vectorized and vectorized form:
microbenchmark(
`r2r_[[` = { for (key in keys) m_r2r[[ key ]] },
`r2r_[` = { m_r2r[ keys ] },
`hash_[[` = { for (key in keys) m_hash[[ key ]] },
`hash_[` = { m_hash[ keys ] },
times = 30,
setup = {
m_r2r <- hashmap(); m_r2r[keys] <- values
m_hash <- hash(); m_hash[keys] <- values
}
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> r2r_[[ 88.545799 106.16686 157.7722 150.61003 188.24190 303.41547 30
#> r2r_[ 87.467316 118.76546 154.7393 144.03654 181.63452 298.83506 30
#> hash_[[ 9.947555 10.52080 16.9615 13.24612 21.78511 43.20254 30
#> hash_[ 55.392393 72.29879 91.9143 81.32744 108.42607 140.64172 30
For non-vectorized queries, hash
is significantly faster
(by one order of magnitude) than r2r
. This is likely due to
the fact that the [[
method dispatch is handled natively by
R in hash
(i.e. the default [[
method
for environment
s is used ), whereas r2r
suffers the overhead of S3 method dispatch. This is confirmed by the
result for vectorized queries, which is comparable for the two packages;
notice that here a single (rather than N
) S3 method
dispatch occurs in the r2r
timed expression.
As an additional test, we perform the benchmarks for non-vectorized expressions with a new set of keys:
set.seed(841)
new_keys <- random_keys(N)
microbenchmark(
`r2r_[[_bis` = { for (key in new_keys) m_r2r[[ key ]] },
`hash_[[_bis` = { for (key in new_keys) m_hash[[ key ]] },
times = 30,
setup = {
m_r2r <- hashmap(); m_r2r[keys] <- values
m_hash <- hash(); m_hash[keys] <- values
}
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> r2r_[[_bis 62.837395 77.70263 97.82561 85.30933 117.82900 193.5376 30
#> hash_[[_bis 9.809286 10.58638 13.39062 12.31619 15.48767 19.4628 30
The results are similar to the ones already commented. Finally, we
test the performances of the two packages in checking the existence of
keys (notice that here has_key
refers to
r2r::has_key
, whereas has.key
is
hash::has.key
):
set.seed(842)
mixed_keys <- sample(c(keys, new_keys), N)
microbenchmark(
r2r_has_key = { for (key in mixed_keys) has_key(m_r2r, key) },
hash_has_key = { for (key in new_keys) has.key(key, m_hash) },
times = 30,
setup = {
m_r2r <- hashmap(); m_r2r[keys] <- values
m_hash <- hash(); m_hash[keys] <- values
}
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> r2r_has_key 59.0308 63.3219 76.82297 73.23542 83.82676 156.3847 30
#> hash_has_key 175.7794 203.7641 230.09715 226.40480 259.97378 283.9682 30
The results are comparable for the two packages, r2r
being slightly more performant in this particular case.
Finally, we test key deletion. In order to handle name collisions, we
will use delete()
(which refers to
r2r::delete()
) and del()
(which refers to
hash::del()
).
microbenchmark(
r2r_delete = { for (key in keys) delete(m_r2r, key) },
hash_delete = { for (key in keys) del(key, m_hash) },
hash_vectorized_delete = { del(keys, m_hash) },
times = 30,
setup = {
m_r2r <- hashmap(); m_r2r[keys] <- values
m_hash <- hash(); m_hash[keys] <- values
}
)
#> Unit: milliseconds
#> expr min lq mean median uq
#> r2r_delete 108.364266 130.743575 154.582793 150.147717 171.749954
#> hash_delete 60.979746 67.823765 86.521171 84.526460 97.832571
#> hash_vectorized_delete 1.860043 2.218743 2.463795 2.428009 2.654858
#> max neval
#> 269.500612 30
#> 158.779385 30
#> 3.173727 30
The vectorized version of hash
significantly outperforms
the non-vectorized versions (by roughly two orders of magnitude in
speed). Currently, r2r
does not support vectorized key
deletion 1.
The two R packages r2r
and hash
offer hash
table implementations with different advantages and drawbacks.
r2r
focuses on flexibility, and has a richer set of
features. hash
is more minimal, but offers superior
performance in some important tasks. Finally, as a positive note for
both parties, the two packages share a similar API, making it relatively
easy to switch between the two, according to the particular use case
needs.
This is due to complications introduced by the internal
hash collision handling system of r2r
.↩︎