Comparison with {hash}

This vignette provides a comparison of {r2r} with the same-purpose CRAN package {hash}, which also offers an implementation of hash tables based on R environments. We first describe the features offered by both packages, and then perform some benchmark timing comparisons. The package versions referred to in this vignette are:

library(hash)
library(r2r)
packageVersion("hash")
#> [1] '2.2.6.3'
packageVersion("r2r")
#> [1] '0.1.2'

Features

Both {r2r} and {hash} hash tables are built on top of the R built-in environment data structure, and have thus a similar API. In particular, hash table objects have reference semantics for both packages. {r2r} hashtables are S3 class objects, whereas in {hash} the data structure is implemented as an S4 class.

Hash tables provided by r2r support arbitrary type keys and values, arbitrary key comparison and hash functions, and have customizable behaviour (either throw an exception or return a default value) upon query of a missing key.

In contrast, hash tables in hash currently support only string keys, with basic identity comparison (the hashing is performed automatically by the underlying environment objects); values can be arbitrary R objects. Querying missing keys through non-vectorized [[-subsetting returns the default value NULL, whereas queries through vectorized [-subsetting result in an error. On the other hand, hash also offers support for inverting hash tables (an experimental feature at the time of writing).

The table below summarizes the features of the two packages

Features supported by {r2r} and {hash}
Feature	r2r	hash
Basic data structure	R environment	R environment
Arbitrary type keys	X
Arbitrary type values	X	X
Arbitrary hash function	X
Arbitrary key comparison function	X
Throw or return default on missing keys	X
Hash table inversion		X

Performance tests

We will perform our benchmark tests using the CRAN package microbenchmark.

library(microbenchmark)

Key insertion

We start by timing the insertion of:

N <- 1e4

random key-value pairs (with possible repetitions). In order to perform a meaningful comparison between the two packages, we restrict to string (i.e. length one character) keys. We can generate random keys as follows:

chars <- c(letters, LETTERS, 0:9)
random_keys <- function(n) paste0(
    sample(chars, n, replace = TRUE),
    sample(chars, n, replace = TRUE),
    sample(chars, n, replace = TRUE),
    sample(chars, n, replace = TRUE),
    sample(chars, n, replace = TRUE)
    )

set.seed(840)
keys <- random_keys(N)
values <- rnorm(N)

We test both the non-vectorized ([[<-) and vectorized ([<-) operators:

microbenchmark(
    `r2r_[[<-` = {
        for (i in seq_along(keys))
            m_r2r[[ keys[[i]] ]] <- values[[i]]
    },
    `r2r_[<-` = { m_r2r[keys] <- values },
    `hash_[[<-` = { 
        for (i in seq_along(keys))
            m_hash[[ keys[[i]] ]] <- values[[i]]
    },
    `hash_[<-` = m_hash[keys] <- values,
    
    times = 30, 
    setup = { m_r2r <- hashmap(); m_hash <- hash() }
)
#> Unit: milliseconds
#>       expr      min        lq      mean    median        uq      max neval
#>   r2r_[[<- 77.15060 118.74840 141.26154 134.67349 172.41469 207.0194    30
#>    r2r_[<- 65.01318  72.81439 107.65670 110.74753 132.95282 167.9449    30
#>  hash_[[<- 67.79844  94.15135 111.17117 107.07354 135.81654 167.3922    30
#>   hash_[<- 35.98419  61.81443  68.84952  69.63259  77.61402 110.1348    30

As it is seen, r2r and hash have comparable performances at the insertion of key-value pairs, with both vectorized and non-vectorized insertions, hash being somewhat more efficient in both cases.

Key query

We now test key query, again both in non-vectorized and vectorized form:

microbenchmark(
    `r2r_[[` = { for (key in keys) m_r2r[[ key ]] },
    `r2r_[` = { m_r2r[ keys ] },
    `hash_[[` = { for (key in keys) m_hash[[ key ]] },
    `hash_[` = { m_hash[ keys ] },
    
    times = 30,
    setup = { 
        m_r2r <- hashmap(); m_r2r[keys] <- values
        m_hash <- hash(); m_hash[keys] <- values
    }
)
#> Unit: milliseconds
#>     expr       min        lq      mean    median        uq       max neval
#>   r2r_[[ 91.927319 120.55399 154.44732 161.86267 185.25701 225.45467    30
#>    r2r_[ 85.280613  98.54987 132.75034 137.46910 155.10621 193.44932    30
#>  hash_[[  9.921901  10.97999  14.20868  12.40087  17.91414  21.59114    30
#>   hash_[ 56.681264  66.16818  86.83520  85.10289  98.88669 156.46972    30

For non-vectorized queries, hash is significantly faster (by one order of magnitude) than r2r. This is likely due to the fact that the [[ method dispatch is handled natively by R in hash (i.e. the default [[ method for environments is used ), whereas r2r suffers the overhead of S3 method dispatch. This is confirmed by the result for vectorized queries, which is comparable for the two packages; notice that here a single (rather than N) S3 method dispatch occurs in the r2r timed expression.

As an additional test, we perform the benchmarks for non-vectorized expressions with a new set of keys:

set.seed(841)
new_keys <- random_keys(N)
microbenchmark(
    `r2r_[[_bis` = { for (key in new_keys) m_r2r[[ key ]] },
    `hash_[[_bis` = { for (key in new_keys) m_hash[[ key ]] },
    
    times = 30,
    setup = { 
        m_r2r <- hashmap(); m_r2r[keys] <- values
        m_hash <- hash(); m_hash[keys] <- values
    }
)
#> Unit: milliseconds
#>         expr      min       lq     mean   median        uq       max neval
#>   r2r_[[_bis 65.87390 78.86331 97.91370 97.71733 112.59728 140.82823    30
#>  hash_[[_bis 10.21583 11.43801 15.44436 12.34021  14.87279  46.57487    30

The results are similar to the ones already commented. Finally, we test the performances of the two packages in checking the existence of keys (notice that here has_key refers to r2r::has_key, whereas has.key is hash::has.key):

set.seed(842)
mixed_keys <- sample(c(keys, new_keys), N)
microbenchmark(
    r2r_has_key = { for (key in mixed_keys) has_key(m_r2r, key) },
    hash_has_key = { for (key in new_keys) has.key(key, m_hash) },
    
    times = 30,
    setup = { 
        m_r2r <- hashmap(); m_r2r[keys] <- values
        m_hash <- hash(); m_hash[keys] <- values
    }
)
#> Unit: milliseconds
#>          expr       min        lq      mean    median        uq      max neval
#>   r2r_has_key  62.18203  69.42541  88.18813  85.08335  98.78978 133.2706    30
#>  hash_has_key 184.18959 199.30432 249.92223 242.19929 302.40733 354.1190    30

The results are comparable for the two packages, r2r being slightly more performant in this particular case.

Key deletion

Finally, we test key deletion. In order to handle name collisions, we will use delete() (which refers to r2r::delete()) and del() (which refers to hash::del()).

microbenchmark(
    r2r_delete = { for (key in keys) delete(m_r2r, key) },
    hash_delete = { for (key in keys) del(key, m_hash) },
    hash_vectorized_delete = { del(keys, m_hash) },
    
    times = 30,
    setup = { 
        m_r2r <- hashmap(); m_r2r[keys] <- values
        m_hash <- hash(); m_hash[keys] <- values
    }
)
#> Unit: milliseconds
#>                    expr        min         lq       mean     median       uq
#>              r2r_delete 115.821059 127.226384 179.257774 178.427584 223.1466
#>             hash_delete  66.979756  78.954774 102.454344  97.385456 115.3295
#>  hash_vectorized_delete   2.181686   2.872363   3.378992   3.429627   3.9547
#>         max neval
#>  269.902141    30
#>  225.925614    30
#>    4.832175    30

The vectorized version of hash significantly outperforms the non-vectorized versions (by roughly two orders of magnitude in speed). Currently, r2r does not support vectorized key deletion ¹.

Conclusions

The two R packages r2r and hash offer hash table implementations with different advantages and drawbacks. r2r focuses on flexibility, and has a richer set of features. hash is more minimal, but offers superior performance in some important tasks. Finally, as a positive note for both parties, the two packages share a similar API, making it relatively easy to switch between the two, according to the particular use case needs.

This is due to complications introduced by the internal hash collision handling system of r2r.↩︎