BiocNeighbors 1.8.2
The BiocNeighbors package provides several algorithms for approximate neighbor searches:
These methods complement the exact algorithms described previously.
Again, it is straightforward to switch from one algorithm to another by simply changing the BNPARAM
argument in findKNN
and queryKNN
.
We perform the k-nearest neighbors search with the Annoy algorithm by specifying BNPARAM=AnnoyParam()
.
nobs <- 10000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)
fout <- findKNN(data, k=10, BNPARAM=AnnoyParam())
head(fout$index)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 3613 9813 4083 6358 3983 3841 9690 990 3672 9237
## [2,] 2730 898 5300 7456 1705 9050 8029 3410 1236 5852
## [3,] 8347 3664 3221 5832 9552 6442 4651 3990 4528 4338
## [4,] 8960 3145 8699 6751 5387 5205 127 2384 9399 2483
## [5,] 2152 6140 3869 7777 4876 8563 3927 5677 1768 4050
## [6,] 6756 1306 4789 257 2477 3204 8845 7856 9325 2529
head(fout$distance)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 1.0781144 1.1113344 1.1145079 1.129354 1.1311136 1.1343979 1.1349317
## [2,] 0.9071338 0.9250305 0.9465445 1.024742 1.0320660 1.0788873 1.0926216
## [3,] 1.0713972 1.0809000 1.0839305 1.086237 1.1044838 1.1107026 1.1273242
## [4,] 0.7429730 0.9661263 1.0163893 1.019275 1.0233577 1.0260446 1.0373743
## [5,] 0.8687162 0.9705194 1.0535698 1.066771 1.0937678 1.1508070 1.1616614
## [6,] 0.8059543 0.8987697 0.9082887 0.910625 0.9567227 0.9669381 0.9923332
## [,8] [,9] [,10]
## [1,] 1.138030 1.142646 1.144375
## [2,] 1.102200 1.104858 1.105894
## [3,] 1.144813 1.163753 1.164322
## [4,] 1.043010 1.056140 1.065495
## [5,] 1.165069 1.184794 1.187664
## [6,] 0.992562 1.000768 1.005777
We can also identify the k-nearest neighbors in one dataset based on query points in another dataset.
nquery <- 1000
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)
qout <- queryKNN(data, query, k=5, BNPARAM=AnnoyParam())
head(qout$index)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 331 4323 4123 4205 3873
## [2,] 8202 5315 8944 3853 7496
## [3,] 9098 3327 8742 2371 6678
## [4,] 4017 6387 2827 8607 3822
## [5,] 3041 7477 9651 8744 1238
## [6,] 3275 8183 3086 3916 5454
head(qout$distance)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.9419816 1.0356504 1.0510969 1.0612359 1.0825620
## [2,] 0.9605968 0.9929906 1.0112295 1.0572425 1.0747641
## [3,] 0.8392634 0.9016435 0.9875906 1.0381438 1.0406716
## [4,] 0.7842190 0.7871433 0.8616644 0.9260836 0.9327102
## [5,] 0.8686712 0.9164358 0.9401396 0.9404123 0.9861693
## [6,] 0.9874870 1.0020382 1.0524002 1.0581913 1.0642735
It is similarly easy to use the HNSW algorithm by setting BNPARAM=HnswParam()
.
Most of the options described for the exact methods are also applicable here. For example:
subset
to identify neighbors for a subset of points.get.distance
to avoid retrieving distances when unnecessary.BPPARAM
to parallelize the calculations across multiple workers.BNINDEX
to build the forest once for a given data set and re-use it across calls.The use of a pre-built BNINDEX
is illustrated below:
pre <- buildIndex(data, BNPARAM=AnnoyParam())
out1 <- findKNN(BNINDEX=pre, k=5)
out2 <- queryKNN(BNINDEX=pre, query=query, k=2)
Both Annoy and HNSW perform searches based on the Euclidean distance by default.
Searching by Manhattan distance is done by simply setting distance="Manhattan"
in AnnoyParam()
or HnswParam()
.
Users are referred to the documentation of each function for specific details on the available arguments.
Both Annoy and HNSW generate indexing structures - a forest of trees and series of graphs, respectively -
that are saved to file when calling buildIndex()
.
By default, this file is located in tempdir()
1 On HPC file systems, you can change TEMPDIR
to a location that is more amenable to concurrent access. and will be removed when the session finishes.
AnnoyIndex_path(pre)
## [1] "/tmp/Rtmpoen9na/file66a048a0e4a6.idx"
If the index is to persist across sessions, the path of the index file can be directly specified in buildIndex
.
This can be used to construct an index object directly using the relevant constructors, e.g., AnnoyIndex()
, HnswIndex()
.
However, it becomes the responsibility of the user to clean up any temporary indexing files after calculations are complete.
sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.12-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.12-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocNeighbors_1.8.2 knitr_1.30 BiocStyle_2.18.1
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.5 bookdown_0.21 lattice_0.20-41
## [4] digest_0.6.27 grid_4.0.3 stats4_4.0.3
## [7] magrittr_2.0.1 evaluate_0.14 rlang_0.4.9
## [10] stringi_1.5.3 S4Vectors_0.28.0 Matrix_1.2-18
## [13] rmarkdown_2.5 BiocParallel_1.24.1 tools_4.0.3
## [16] stringr_1.4.0 parallel_4.0.3 xfun_0.19
## [19] yaml_2.2.1 compiler_4.0.3 BiocGenerics_0.36.0
## [22] BiocManager_1.30.10 htmltools_0.5.0