A common C++ API for all types of R matrices

Package: beachmat
Author: Aaron Lun (alun@wehi.edu.au)
Compilation date: 2017-11-25

Introduction

The beachmat package provides a C++ API for handling a variety of R matrix types. The aim is to abstract away the specific type of matrix object when writing C++ extensions, thus simplifying the processing of data stored in those objects. Currently, the API supports double-precision, integer, logical and character matrices. Supported classes include base matrix objects, a number of classes from the Matrix package, and disk-backed matrices from the HDF5Array package.

Linking to the package

Prerequisites

The beachmat package currently has several dependencies:

The compiler should support the C++11 standard.
Rcpp, Rhdf5lib, HDF5Array and DelayedArray should be installed.

Most of the following instructions are ripped from the Rhtslib vignette.

Link to the library

To link successfully to the beachmat library, a package must include both a src/Makevars.win and src/Makevars file.

Note: the contents of src/Makevars.win and src/Makevars are almost identical, but not quite. Be careful of the differences.

Create a src/Makevars.win file with the following lines:

BEACHMAT_LIBS=$(shell echo 'beachmat::pkgconfig("PKG_LIBS")'|\
    "${R_HOME}/bin/R" --vanilla --slave)
PKG_LIBS=$(BEACHMAT_LIBS)

… and a src/Makevars file with the following lines:

BEACHMAT_LIBS=`echo 'beachmat::pkgconfig("PKG_LIBS")'|\
    "${R_HOME}/bin/R" --vanilla --slave`
PKG_LIBS=$(BEACHMAT_LIBS)

The statement for each platfrom modifies the $PKG_LIBS variable. If your package needs to add to the $PKG_LIBS variable, do so by adding to the PKG_LIBS=$(BEACHMAT_LIBS) line. For example:

PKG_LIBS=$(BEACHMAT_LIBS) -L/path/to/foolib -lfoo

The Linux implementation embeds the location of the beachmat library in the package-specific shared object via the compiler flag -Wl,rpath,path, where path is determined by system.file("lib", package="beachmat"). The path determined by system.file() is from .libPaths() and will resolve all symbolic links. This can cause problems, e.g., when the “head” node of a cluster mimicks the cluster node via a symbolic link to the directory in which beachmat is installed. Use the environment variable BEACHMAT_RPATH to resolve this by setting it to the cluster-node accessible path. Similar arguments apply to Rhdf5lib with the environment variable RHDF5LIB_RPATH.

Find headers

In order for the C/C++ compiler to find the beachmat package headers during installation, add the following to the LinkingTo field of the DESCRIPTION file:

LinkingTo: Rcpp, Rhdf5lib, beachmat

In C or C++ code files, use standard techniques, e.g., #include "beachmat/numeric_matrix.h" (see below for more details). Header files are available for perusal at the following location (enter in an R session):

system.file(package="beachmat", "include")

## [1] "/tmp/Rtmppw2J5Q/Rinst71da27de2146/beachmat/include"

Finishing up

You need to tell the build system to use C++11, by modifying the SystemRequirements field of the DESCRIPTION file:

SystemRequirements: C++11

You also need to ensure that Rcpp is initialized when your package is loaded. This requires addition of Rcpp to the Imports field of the DESCRIPTION file:

Imports: Rcpp

… and a corresponding importFrom specification in the NAMESPACE file:

importFrom(Rcpp, sourceCpp)

(The exact function to be imported doesn't matter, as long as the namespace is loaded. Check out the Rcpp documentation for more details.)

HDF5Array, DelayedArray and beachmat itself should be added to the Suggests field, as the API will perform some calls to R functions in those packages to query certain parameters. If you intend to accept instances of Matrix classes, the package should also be listed in the Suggests field, if not already in Imports or Depends:

Suggests: beachmat, HDF5Array, DelayedArray, Matrix

Overview of the API (input)

Creating the matrix pointer

We demonstrate the use of the API for numeric matrices. First, we include the relevant header file:

#include "beachmat/numeric_matrix.h"

A double-precision matrix object dmat is handled in C++ by passing the SEXP struct from .Call to create_numeric_matrix:

std::unique_ptr<beachmat::numeric_matrix> dptr = beachmat::create_numeric_matrix(dmat);

This creates a unique pointer that points to an object of the numeric_matrix virtual class. The exact class depends on the type of matrix in dmat, though the behaviour of the user-level functions are not affected by this detail.

Methods for input matrices

The available methods for this object are:

dptr->get_nrow() returns the number of rows (nrow) as a size_t.
dptr->get_ncol() returns the number of columns (ncol) as a size_t.
dptr->get_col(c, in) takes a Rcpp::Vector::iterator object in and fills it with values at column c. There should be at least nrow accessible elements, i.e., *in and *(in+nrow-1) should be valid entries. No value is returned by this method.
dptr->get_col(c, in, first, last) takes a Rcpp::Vector::iterator object in and fills it with values at column c from row first to last-1. There should be at least last-first accessible elements, i.e., *in and *(in+last-first-1) should be valid entries. No value is returned by this method.
dptr->get_row(r, in) takes an Rcpp::Vector::iterator object in and fills it with values at row r. There should be at least ncol accessible elements, i.e., *in and *(in+ncol-1) should be valid entries. No value is returned by this method.
dptr->get_row(r, in, first, last) takes an Rcpp::Vector::iterator object in and fills it with values at row r from column first to last-1. There should be at least last-first accessible elements, i.e., *in and *(in+last-first-1) should be valid entries. No value is returned by this method.
dptr->get(r, c) returns a double at matrix entry (r, c).
dptr->clone() returns a unique pointer to a numeric_matrix instance of the same type as that pointed to by dptr.
dptr->get_matrix_type() returns a matrix_type value specifying the specific matrix representation that is pointed to by dptr. This is an enumeration type that can be tested against constants like beachmat::SIMPLE or beachmat::SPARSE.
dptr->yield() returns a Rcpp::RObject containing the original R matrix that was used to create dptr.

In all cases, r and c should be non-negative integers (specificaly size_t) in [0, nrow) and [0, ncol) respectively. Zero-based indexing is assumed for both r and c, as is standard for most C/C++ applications. Similar rules apply to first and last, which should be in [0, nrow] for get_col and in [0, ncol] for get_row. Furthermore, last >= first should be true.

If the object X is a Rcpp::NumericVector::iterator instance, matrix entries will be extracted as double-precision values. If it is a Rcpp::IntegerVector::iterator instance, matrix entries will be extracted as integers with implicit conversion. It is also possible to use a Rcpp::LogicalVector::iterator, though this will not behave as expected - see notes below.

Special methods for specific matrix types

There are additional methods that provide some advantages for specific matrix representations:

dptr->get_const_col(c, work, first, last) returns a Rcpp::Vector::const_iterator pointing to first row of column c. The arguments are the same as dptr->get_col with work being equivalent to X. (The data type of work must correspond to that of the matrix - in this case, it should be a Rcpp::NumericVector::iterator.) For simple matrices, this function is more efficient than get_col as it returns the iterator without needing to copy data into work. For other matrices, this function simply calls get_col and returns an iterator to the start of work.
dptr->get_nonzero_col(c, index, values, first, last) takes a Rcpp::IntegerVector::iterator object index and a Rcpp::Vector::iterator object values. For each non-zero entry in column c from rows [first, last), its row index is stored in the memory pointed to by index and its value is stored in values. (Both iterators should point to memory with at least last-first addressable elements.) The return value of the function is the number of non-zero entries stored in this manner. This function is quite efficient for sparse matrices; for all other matrices, get_col is called and zeros are stripped out afterwards.
dptr->get_nonzero_col(r, index, values, first, last) takes a Rcpp::IntegerVector::iterator object index and a Rcpp::Vector::iterator object values. For each non-zero entry in row r from columns [first, last), its column index is stored in the memory pointed to by index and its value is stored in values. (Both iterators should point to memory with at least last-first addressable elements.) The return value of the function is the number of non-zero entries stored in this manner. This function is quite efficient for sparse matrices; for all other matrices, get_row is called and zeros are stripped out afterwards.

Obviously, the get_nonzero_* functions are not available for character matrices.

Other matrix types

Logical, integer and character matrices can be handled by including the following header files:

#include "beachmat/logical_matrix.h"
#include "beachmat/integer_matrix.h"
#include "beachmat/character_matrix.h"

The dispatch function changes correspondingly, for logical matrix lmat, integer matrix imat and character matrix cmat:

std::unique_ptr<beachmat::logical_matrix> lptr=beachmat::create_logical_matrix(lmat);
std::unique_ptr<beachmat::integer_matrix> iptr=beachmat::create_integer_matrix(imat);
std::unique_ptr<beachmat::character_matrix> cptr=beachmat::create_character_matrix(cmat);

Equivalent methods are available for each matrix types with appropriate changes in type. For integer and logical matrices, get will return an integer, while X can be an iterator object of a Rcpp::IntegerVector, Rcpp::LogicalVector or Rcpp::NumericVector instance (type conversions are implicitly performed as necessary). For character matrices, X should be of type Rcpp::StringVector::iterator, and get will return a Rcpp::String.

The following matrix classes are supported:

numeric: matrix, dgeMatrix, dgCMatrix, dspMatrix, RleMatrix, HDF5Matrix, DelayedMatrix
integer: matrix, RleMatrix, HDF5Matrix, DelayedMatrix
logical: matrix, lgeMatrix, lgCMatrix, lspMatrix, RleMatrix, HDF5Matrix, DelayedMatrix
character: matrix, RleMatrix, HDF5Matrix, DelayedMatrix

Additional classes can be added on a need-to-use basis. As a general rule, if a matrix-like object can be stored in a SummarizedExperiment class (from the SummarizedExperiment package), the API should be able to handle it. Please contact the maintainers if you have a class that you would like to see supported.

Important developer information

For non-logical matrices, using a Rcpp::LogicalVector::iterator in the get_* methods is not recommended. For numeric_matrix instances, double-to-integer conversion is performed such that values in (-1, 1) are converted to integer 0. This would be interpreted as a logical FALSE, which is incorrect for non-zero double-precision values. For integer_matrix instances, integer values are not coerced to {0, 1} when they are assigned into the Rcpp::LogicalVector. Thus, even though the interpretation is correct, the vector produced will not be equivalent to the result of an as.logical call.
The API is not thread-safe for simultaneous calls to the get methods from different threads. Some methods use cached class members for greater efficiency, and simultaneous calls will cause race conditions. It is the responsibility of the calling function to lock (and unlock) access to a single *_matrix object across threads. Alternatively, the clone method can be called to generate a unique pointer to a new *_matrix instance, which can be used concurrently in another thread. This is fairly cheap as the underlying matrix data are not copied.
When accessing character_matrix data, we do not return raw const char* pointers to the C-style string. Rather, the Rcpp::String class is used as it provides a convenient wrapper around the underlying CHARSXP. This ensures that the string is stored in R's global cache and is suitably protected against garbage collection.
DelayedMatrix objects are automatically realized via the realize method in the DelayedArray package. This uses the same realization backend that was specified in R – call getRealizationBackend() to determine the current backend. If the realized matrix is to be reused, it may be more efficient to perform the realization in R and pass the result to .Call.
The API will happily throw exceptions of the std::exception class, containing an informative error message. These should be caught and handled gracefully by the end-user code, otherwise a segmentation fault will probably occur. See the error-handling mechanism in Rcpp for how to deal with these exceptions.
For numeric matrices, beachmat does not support higher-level matrix operations such as addition, multiplication or various factorizations. Rather, the yield method can be used to obtain the original Rcpp::RObject for input to RcppArmadillo or RcppEigen. This functionality is generally limited to base matrices, though there is also limited support for sparse matrices in these libraries.

Overview of the API (output)

Specifying the output type

Three types of output matrices are supported - simple matrix, *gCMatrix and HDF5Matrix objects. For example, a simple numeric output matrix with nrow rows and ncol columns is created by:

std::unique_ptr<numeric_output> odmat=beachmat::create_numeric_output(nrow, ncol, beachmat::SIMPLE_PARAM);

A sparse matrix is similarly created by setting the last argument to beachmat::SPARSE_PARAM, while a HDF5Matrix is constructed by setting beachmat::HDF5_PARAM. These constants are instances of the output_param class that specify the type and parameters of the output matrix to be constructed.

Another option is to allow the function to dynamically choose the output type to match that of an existing matrix. This is useful for automatically choosing an output format that reflects the choice of input format. For example, if data are supplied to a function in a simple matrix, it would be reasonable to expect that the output is similarly small enough to be stored as a simple matrix. On the other hand, if the input is a HDF5Matrix, it may make more sense to return a HDF5Matrix object.

Dynamic choice of output type is performed by using the Rcpp::Robject object containing the input matrix to initialize the output_param object. If I have a matrix object dmat, the output type can be matched to the input type with:

beachmat::output_param oparam(dmat, /* simplify = */ true, /* preserve_zero = */ false);
std::unique_ptr<numeric_output> odmat=beachmat::create_numeric_output(nrow, ncol, oparam);

A similar process can be used for a pointer dptr to an existing *_matrix instance:

beachmat::output_param oparam(dptr->get_matrix_type(), /* simplify = */ true, /* preserve_zero = */ false);

The simplify argument indicates whether non-matrix input objects should be “simplified” to a matrix output object. If false, a HDF5Matrix output object will be returned instead. The preserve_zero argument indicates whether a *gCMatrix input should result in a *gCMatrix output when simplify=false (for logical or double-precision data only). Exact zeroes are detected and ignored when filling this matrix.

Methods for output matrices

To put data into the output matrix pointed to by dptr, the following methods are available:

dptr->set_col(c, out) fills column c with elements pointed to by a Rcpp::NumericVector::iterator object (out). There should be at least nrow accessible elements, i.e., *out and *(out+nrow-1) should be valid entries. No value is returned by this method.
dptr->set_col(c, out, first, last) fills column c from row first to last-1, using the elements pointed to by a Rcpp::NumericVector::iterator object (out). There should be at least last-first accessible elements, i.e., *out and *(out+last-first-1) should be valid entries. No value is returned by this method.
dptr->set_row(r, out) fills row r with elements pointed to by a Rcpp::NumericVector::iterator object (out). There should be at least ncol accessible elements, i.e., *out and *(out+ncol-1) should be valid entries. No value is returned by this method.
dptr->set_row(r, out, first, last) fills row r from column first to last-1, with elements pointed to by a Rcpp::NumericVector::iterator object (out). There should be at least last-first accessible elements, i.e., *out and *(out+last-first-1) should be valid entries. No value is returned by this method.
dptr->set(r, c, Y) fills the matrix entry (r, c) with the double Y. No value is returned by this method.
dptr->yield() returns a Rcpp::RObject object containing the matrix data to pass into R.

The allowable ranges of r, c, first and last are the same as previously described. The get_nrow, get_ncol, get_row, get_col, get, get_matrix_type and clone methods are also available and behave as described for numeric_matrix objects.

Other matrix types

Logical, integer and character output matrices are supported by changing the types in the creator function (and its variants):

std::unique_ptr<integer_output> oimat=beachmat::create_integer_output(nrow, ncol);
std::unique_ptr<logical_output> olmat=beachmat::create_logical_output(nrow, ncol);
std::unique_ptr<character_output> ocmat=beachmat::create_character_output(nrow, ncol);

Equivalent methods are available for these matrix types. For integer and logical matrices, X should be of type Rcpp::IntegerVector::iterator and Rcpp::LogicalVector::iterator, respectively, and Y should be an integer. For character matrices, X should be of type Rcpp::StringVector::iterator and Y should be a Rcpp::String object.

Important developer information

Creation of a HDF5Matrix will perform a new getHDF5DumpFile() call with for.use=TRUE to obtain a new file name for storing the HDF5 output. Similarly, the name of the data set is obtained via getHDF5DumpName(). Both names are recorded in the dump log via appendDatasetCreationToHDF5DumpLog(). This mimics the behaviour observed when HDF5Matrix instances are created at the R level. Similarly, the compression level and chunk dimensions will be taken from the appropriate global variables.
By default, the chunk dimensions and compression level for HDF5 output are retrieved using R functions from the HDF5Array package. An alternative is to directly specify the chunk dimensions using oparam.set_chunk_dim(chunk_nr, chunk_nc), where chunk_nr and chunk_nc are the chunk rows and columns respectively. Similarly, the compression level can be set using oparam.set_compression(compress), where compress can range from 0 (contiguous) to 9 (most compression). If specified, these settings will override the default behaviour, but will have no effect for non-HDF5 output.
For consecutive row and column access from a matrix with dimensions nr-by-nc, the optimal chunk dimensions can be specified with oparam.optimize_chunk_dims(nr, nc). beachmat exploits the chunk cache to store all chunks along a row or column, thus avoiding the need to reload data for the next row or column. These chunk settings are designed to minimize the chunk cache size while also reducing the number of disk reads.
HDF5 character output is stored as fixed-width character arrays. As such, the API must know the maximum string length during construction of a character_output instance. This can be set using oparam.set_strlen(strlen) where strlen is the length of a C-style string, not including the null-terminating character. Any attempts to fill the output matrix with strings larger than strlen will result in silent truncation.
The API is not thread-safe, due to (i) the use of cached class members and (ii) the potential for race conditions when writing to the same location on disk/memory. The first issue can be solved by using clone to create *_output copies for use in each thread. However, each copy may still read from and write to the same disk/memory location. It is the responsibility of the calling function to ensure that access is locked and unlocked appropriately across multiple threads. (This may not be necessary if access does not overlap, e.g., writing to different rows.)

Session information

sessionInfo()

## R version 3.4.2 (2017-09-28)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.3 LTS
## 
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.6-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.6-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] BiocStyle_2.6.0 knitr_1.17     
## 
## loaded via a namespace (and not attached):
##  [1] compiler_3.4.2  backports_1.1.1 magrittr_1.5    rprojroot_1.2  
##  [5] htmltools_0.3.6 tools_3.4.2     yaml_2.1.14     Rcpp_0.12.14   
##  [9] rmarkdown_1.8   stringi_1.1.6   digest_0.6.12   stringr_1.2.0  
## [13] evaluate_0.10.1