---
title: "Comparison with `{hash}`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Comparison}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
  %\VignetteDepends{microbenchmark, hash}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

This vignette provides a comparison of `{r2r}` with the same-purpose CRAN package [`{hash}`](https://CRAN.R-project.org/package=hash), which also offers an implementation of hash tables based on R environments. We first describe the features offered by both packages, and then perform some benchmark timing comparisons. The package versions referred to in this vignette are:

```{r message=FALSE, warning=FALSE}
library(hash)
library(r2r)
packageVersion("hash")
packageVersion("r2r")
```

## Features

Both `{r2r}` and `{hash}` hash tables are built on top of the R built-in `environment` data structure, and have thus a similar API. In particular, hash table objects have reference semantics for both packages. `{r2r}` `hashtable`s are S3 class objects, whereas in `{hash}` the data structure is implemented as an S4 class.

Hash tables provided by `r2r` support arbitrary type keys and values, arbitrary key comparison and hash functions, and have customizable behaviour (either throw an exception or return a default value) upon query of a missing key.

In contrast, hash tables in `hash` currently support only string keys, with basic identity comparison (the hashing is performed automatically by the underlying `environment` objects); values can be arbitrary R objects. Querying missing keys through non-vectorized `[[`-subsetting returns the default value `NULL`, whereas queries through vectorized `[`-subsetting result in an error. On the other hand, `hash` also offers support for inverting hash tables (an experimental feature at the time of writing).

The table below summarizes the features of the two packages

```{r echo=FALSE}
knitr::kable(
	data.frame(
		Feature = c(
			"Basic data structure",
			"Arbitrary type keys", 
			"Arbitrary type values",
			"Arbitrary hash function",
			"Arbitrary key comparison function",
			"Throw or return default on missing keys",
			"Hash table inversion"
		),
		r2r = c("R environment", "X", "X", "X", "X", "X", ""),
		hash = c("R environment", "", "X", "", "", "", "X")
		),
	align = "c",
	caption = "Features supported by {r2r} and {hash}"
	)
```

## Performance tests

We will perform our benchmark tests using the CRAN package [`microbenchmark`](https://CRAN.R-project.org/package=microbenchmark).

```{r}
library(microbenchmark)
```


#### Key insertion

We start by timing the insertion of:

```{r}
N <- 1e4
```

random key-value pairs (with possible repetitions). In order to perform a meaningful comparison between the two packages, we restrict to string (*i.e.* length one character) keys. We can generate random keys as follows:

```{r}
chars <- c(letters, LETTERS, 0:9)
random_keys <- function(n) paste0(
	sample(chars, n, replace = TRUE),
	sample(chars, n, replace = TRUE),
	sample(chars, n, replace = TRUE),
	sample(chars, n, replace = TRUE),
	sample(chars, n, replace = TRUE)
	)

set.seed(840)
keys <- random_keys(N)
values <- rnorm(N)

```

We test both the non-vectorized (`[[<-`) and vectorized (`[<-`) operators:

```{r}
microbenchmark(
	`r2r_[[<-` = {
		for (i in seq_along(keys))
			m_r2r[[ keys[[i]] ]] <- values[[i]]
	},
	`r2r_[<-` = { m_r2r[keys] <- values },
	`hash_[[<-` = { 
		for (i in seq_along(keys))
			m_hash[[ keys[[i]] ]] <- values[[i]]
	},
	`hash_[<-` = m_hash[keys] <- values,
	
	times = 30, 
	setup = { m_r2r <- hashmap(); m_hash <- hash() }
)
```

As it is seen, `r2r` and `hash` have comparable performances at the insertion of 
key-value pairs, with both vectorized and non-vectorized insertions, `hash` 
being somewhat more efficient in both cases.

#### Key query

We now test key query, again both in non-vectorized and vectorized form:

```{r}
microbenchmark(
	`r2r_[[` = { for (key in keys) m_r2r[[ key ]] },
	`r2r_[` = { m_r2r[ keys ] },
	`hash_[[` = { for (key in keys) m_hash[[ key ]] },
	`hash_[` = { m_hash[ keys ] },
	
	times = 30,
	setup = { 
		m_r2r <- hashmap(); m_r2r[keys] <- values
		m_hash <- hash(); m_hash[keys] <- values
	}
)
```

For non-vectorized queries, `hash` is significantly faster (by one order of magnitude) 
than `r2r`. This is likely due to the fact that the `[[` method dispatch is 
handled natively by R in `hash` (*i.e.* the default `[[` method for `environment`s is used
), whereas `r2r` suffers the overhead of S3 method dispatch. This is confirmed by 
the result for vectorized queries, which is comparable for the two packages; notice 
that here a single (rather than `N`) S3 method dispatch occurs in the `r2r` timed 
expression.

As an additional test, we perform the benchmarks for non-vectorized expressions 
with a new set of keys:

```{r}
set.seed(841)
new_keys <- random_keys(N)
microbenchmark(
	`r2r_[[_bis` = { for (key in new_keys) m_r2r[[ key ]] },
	`hash_[[_bis` = { for (key in new_keys) m_hash[[ key ]] },
	
	times = 30,
	setup = { 
		m_r2r <- hashmap(); m_r2r[keys] <- values
		m_hash <- hash(); m_hash[keys] <- values
	}
)
```

The results are similar to the ones already commented. Finally, we test the 
performances of the two packages in checking the existence of keys (notice that 
here `has_key` refers to `r2r::has_key`, whereas `has.key` is `hash::has.key`):

```{r}
set.seed(842)
mixed_keys <- sample(c(keys, new_keys), N)
microbenchmark(
	r2r_has_key = { for (key in mixed_keys) has_key(m_r2r, key) },
	hash_has_key = { for (key in new_keys) has.key(key, m_hash) },
	
	times = 30,
	setup = { 
		m_r2r <- hashmap(); m_r2r[keys] <- values
		m_hash <- hash(); m_hash[keys] <- values
	}
)
```

The results are comparable for the two packages, `r2r` being slightly more 
performant in this particular case.

#### Key deletion

Finally, we test key deletion. In order to handle name collisions, we will use `delete()` 
(which refers to `r2r::delete()`) and `del()` (which refers to `hash::del()`).

```{r}
microbenchmark(
	r2r_delete = { for (key in keys) delete(m_r2r, key) },
	hash_delete = { for (key in keys) del(key, m_hash) },
	hash_vectorized_delete = { del(keys, m_hash) },
	
	times = 30,
	setup = { 
		m_r2r <- hashmap(); m_r2r[keys] <- values
		m_hash <- hash(); m_hash[keys] <- values
	}
)
```

The vectorized version of `hash` significantly outperforms the non-vectorized versions (by roughly two orders of magnitude in speed). Currently, `r2r` does not support vectorized key deletion [^1].

## Conclusions

The two R packages `r2r` and `hash` offer hash table implementations with different advantages and drawbacks. `r2r` focuses on flexibility, and has a richer set of features. `hash` is more minimal, but offers superior performance in some important tasks. Finally, as a positive note for both parties, the two packages share a similar API, making it relatively easy to switch between the two, according to the particular use case needs.

[^1]: This is due to complications introduced by the internal hash collision handling system of `r2r`.