Package: beachmat
Author: Aaron Lun (alun@wehi.edu.au)
Compilation date: 2017-11-25
The beachmat package provides a C++ API for handling a variety of R matrix types.
The aim is to abstract away the specific type of matrix object when writing C++ extensions, thus simplifying the processing of data stored in those objects.
Currently, the API supports double-precision, integer, logical and character matrices.
Supported classes include base matrix
objects, a number of classes from the Matrix package, and disk-backed matrices from the HDF5Array package.
The beachmat package currently has several dependencies:
Most of the following instructions are ripped from the Rhtslib vignette.
To link successfully to the beachmat library, a package must include both a src/Makevars.win
and src/Makevars
file.
Note: the contents of src/Makevars.win
and src/Makevars
are almost identical, but not quite.
Be careful of the differences.
Create a src/Makevars.win
file with the following lines:
BEACHMAT_LIBS=$(shell echo 'beachmat::pkgconfig("PKG_LIBS")'|\
"${R_HOME}/bin/R" --vanilla --slave)
PKG_LIBS=$(BEACHMAT_LIBS)
… and a src/Makevars
file with the following lines:
BEACHMAT_LIBS=`echo 'beachmat::pkgconfig("PKG_LIBS")'|\
"${R_HOME}/bin/R" --vanilla --slave`
PKG_LIBS=$(BEACHMAT_LIBS)
The statement for each platfrom modifies the $PKG_LIBS
variable.
If your package needs to add to the $PKG_LIBS
variable, do so by adding to the PKG_LIBS=$(BEACHMAT_LIBS)
line.
For example:
PKG_LIBS=$(BEACHMAT_LIBS) -L/path/to/foolib -lfoo
The Linux implementation embeds the location of the beachmat library in the package-specific shared object via the compiler flag -Wl,rpath,path
, where path is determined by system.file("lib", package="beachmat")
.
The path determined by system.file()
is from .libPaths()
and will resolve all symbolic links.
This can cause problems, e.g., when the “head” node of a cluster mimicks the cluster node via a symbolic link to the directory in which beachmat is installed.
Use the environment variable BEACHMAT_RPATH
to resolve this by setting it to the cluster-node accessible path.
Similar arguments apply to Rhdf5lib with the environment variable RHDF5LIB_RPATH
.
In order for the C/C++ compiler to find the beachmat package headers during installation, add the following to the LinkingTo
field of the DESCRIPTION
file:
LinkingTo: Rcpp, Rhdf5lib, beachmat
In C or C++ code files, use standard techniques, e.g., #include "beachmat/numeric_matrix.h"
(see below for more details).
Header files are available for perusal at the following location (enter in an R session):
system.file(package="beachmat", "include")
## [1] "/tmp/Rtmppw2J5Q/Rinst71da27de2146/beachmat/include"
You need to tell the build system to use C++11, by modifying the SystemRequirements
field of the DESCRIPTION
file:
SystemRequirements: C++11
You also need to ensure that Rcpp is initialized when your package is loaded.
This requires addition of Rcpp
to the Imports
field of the DESCRIPTION
file:
Imports: Rcpp
… and a corresponding importFrom
specification in the NAMESPACE
file:
importFrom(Rcpp, sourceCpp)
(The exact function to be imported doesn't matter, as long as the namespace is loaded. Check out the Rcpp documentation for more details.)
HDF5Array, DelayedArray and beachmat itself should be added to the Suggests
field, as the API will perform some calls to R functions in those packages to query certain parameters.
If you intend to accept instances of Matrix classes, the package should also be listed in the Suggests
field, if not already in Imports
or Depends
:
Suggests: beachmat, HDF5Array, DelayedArray, Matrix
We demonstrate the use of the API for numeric matrices. First, we include the relevant header file:
#include "beachmat/numeric_matrix.h"
A double-precision matrix object dmat
is handled in C++ by passing the SEXP
struct from .Call
to create_numeric_matrix
:
std::unique_ptr<beachmat::numeric_matrix> dptr = beachmat::create_numeric_matrix(dmat);
This creates a unique pointer that points to an object of the numeric_matrix
virtual class.
The exact class depends on the type of matrix in dmat
, though the behaviour of the user-level functions are not affected by this detail.
The available methods for this object are:
dptr->get_nrow()
returns the number of rows (nrow
) as a size_t
.dptr->get_ncol()
returns the number of columns (ncol
) as a size_t
.dptr->get_col(c, in)
takes a Rcpp::Vector::iterator
object in
and fills it with values at column c
.
There should be at least nrow
accessible elements, i.e., *in
and *(in+nrow-1)
should be valid entries.
No value is returned by this method.dptr->get_col(c, in, first, last)
takes a Rcpp::Vector::iterator
object in
and fills it with values at column c
from row first
to last-1
.
There should be at least last-first
accessible elements, i.e., *in
and *(in+last-first-1)
should be valid entries.
No value is returned by this method.dptr->get_row(r, in)
takes an Rcpp::Vector::iterator
object in
and fills it with values at row r
.
There should be at least ncol
accessible elements, i.e., *in
and *(in+ncol-1)
should be valid entries.
No value is returned by this method.dptr->get_row(r, in, first, last)
takes an Rcpp::Vector::iterator
object in
and fills it with values at row r
from column first
to last-1
.
There should be at least last-first
accessible elements, i.e., *in
and *(in+last-first-1)
should be valid entries.
No value is returned by this method.dptr->get(r, c)
returns a double at matrix entry (r, c)
.dptr->clone()
returns a unique pointer to a numeric_matrix
instance of the same type as that pointed to by dptr
.dptr->get_matrix_type()
returns a matrix_type
value specifying the specific matrix representation that is pointed to by dptr
.
This is an enumeration type that can be tested against constants like beachmat::SIMPLE
or beachmat::SPARSE
.dptr->yield()
returns a Rcpp::RObject
containing the original R matrix that was used to create dptr
.In all cases, r
and c
should be non-negative integers (specificaly size_t
) in [0, nrow)
and [0, ncol)
respectively.
Zero-based indexing is assumed for both r
and c
, as is standard for most C/C++ applications.
Similar rules apply to first
and last
, which should be in [0, nrow]
for get_col
and in [0, ncol]
for get_row
.
Furthermore, last >= first
should be true.
If the object X
is a Rcpp::NumericVector::iterator
instance, matrix entries will be extracted as double-precision values.
If it is a Rcpp::IntegerVector::iterator
instance, matrix entries will be extracted as integers with implicit conversion.
It is also possible to use a Rcpp::LogicalVector::iterator
, though this will not behave as expected - see notes below.
There are additional methods that provide some advantages for specific matrix representations:
dptr->get_const_col(c, work, first, last)
returns a Rcpp::Vector::const_iterator
pointing to first
row of column c
.
The arguments are the same as dptr->get_col
with work
being equivalent to X
.
(The data type of work
must correspond to that of the matrix - in this case, it should be a Rcpp::NumericVector::iterator
.)
For simple matrices, this function is more efficient than get_col
as it returns the iterator without needing to copy data into work
.
For other matrices, this function simply calls get_col
and returns an iterator to the start of work
.dptr->get_nonzero_col(c, index, values, first, last)
takes a Rcpp::IntegerVector::iterator
object index
and a Rcpp::Vector::iterator
object values
.
For each non-zero entry in column c
from rows [first, last)
, its row index is stored in the memory pointed to by index
and its value is stored in values
.
(Both iterators should point to memory with at least last-first
addressable elements.)
The return value of the function is the number of non-zero entries stored in this manner.
This function is quite efficient for sparse matrices; for all other matrices, get_col
is called and zeros are stripped out afterwards.dptr->get_nonzero_col(r, index, values, first, last)
takes a Rcpp::IntegerVector::iterator
object index
and a Rcpp::Vector::iterator
object values
.
For each non-zero entry in row r
from columns [first, last)
, its column index is stored in the memory pointed to by index
and its value is stored in values
.
(Both iterators should point to memory with at least last-first
addressable elements.)
The return value of the function is the number of non-zero entries stored in this manner.
This function is quite efficient for sparse matrices; for all other matrices, get_row
is called and zeros are stripped out afterwards.Obviously, the get_nonzero_*
functions are not available for character matrices.
Logical, integer and character matrices can be handled by including the following header files:
#include "beachmat/logical_matrix.h"
#include "beachmat/integer_matrix.h"
#include "beachmat/character_matrix.h"
The dispatch function changes correspondingly, for logical matrix lmat
, integer matrix imat
and character matrix cmat
:
std::unique_ptr<beachmat::logical_matrix> lptr=beachmat::create_logical_matrix(lmat);
std::unique_ptr<beachmat::integer_matrix> iptr=beachmat::create_integer_matrix(imat);
std::unique_ptr<beachmat::character_matrix> cptr=beachmat::create_character_matrix(cmat);
Equivalent methods are available for each matrix types with appropriate changes in type.
For integer and logical matrices, get
will return an integer,
while X
can be an iterator
object of a Rcpp::IntegerVector
, Rcpp::LogicalVector
or Rcpp::NumericVector
instance (type conversions are implicitly performed as necessary).
For character matrices, X
should be of type Rcpp::StringVector::iterator
, and get
will return a Rcpp::String
.
The following matrix classes are supported:
matrix
, dgeMatrix
, dgCMatrix
, dspMatrix
, RleMatrix
, HDF5Matrix
, DelayedMatrix
matrix
, RleMatrix
, HDF5Matrix
, DelayedMatrix
matrix
, lgeMatrix
, lgCMatrix
, lspMatrix
, RleMatrix
, HDF5Matrix
, DelayedMatrix
matrix
, RleMatrix
, HDF5Matrix
, DelayedMatrix
Additional classes can be added on a need-to-use basis.
As a general rule, if a matrix-like object can be stored in a SummarizedExperiment
class (from the SummarizedExperiment package), the API should be able to handle it.
Please contact the maintainers if you have a class that you would like to see supported.
logical
matrices, using a Rcpp::LogicalVector::iterator
in the get_*
methods is not recommended.
For numeric_matrix
instances, double-to-integer conversion is performed such that values in (-1, 1)
are converted to integer 0
.
This would be interpreted as a logical FALSE
, which is incorrect for non-zero double-precision values.
For integer_matrix
instances, integer values are not coerced to {0, 1}
when they are assigned into the Rcpp::LogicalVector
.
Thus, even though the interpretation is correct, the vector produced will not be equivalent to the result of an as.logical
call.get
methods from different threads.
Some methods use cached class members for greater efficiency, and simultaneous calls will cause race conditions.
It is the responsibility of the calling function to lock (and unlock) access to a single *_matrix
object across threads.
Alternatively, the clone
method can be called to generate a unique pointer to a new *_matrix
instance, which can be used concurrently in another thread.
This is fairly cheap as the underlying matrix data are not copied.character_matrix
data, we do not return raw const char*
pointers to the C-style string.
Rather, the Rcpp::String
class is used as it provides a convenient wrapper around the underlying CHARSXP
.
This ensures that the string is stored in R's global cache and is suitably protected against garbage collection. DelayedMatrix
objects are automatically realized via the realize
method in the DelayedArray package.
This uses the same realization backend that was specified in R – call getRealizationBackend()
to determine the current backend.
If the realized matrix is to be reused, it may be more efficient to perform the realization in R and pass the result to .Call
.std::exception
class, containing an informative error message.
These should be caught and handled gracefully by the end-user code, otherwise a segmentation fault will probably occur.
See the error-handling mechanism in Rcpp for how to deal with these exceptions.yield
method can be used to obtain the original Rcpp::RObject
for input to RcppArmadillo or RcppEigen.
This functionality is generally limited to base matrices, though there is also limited support for sparse matrices in these libraries.Three types of output matrices are supported - simple matrix
, *gCMatrix
and HDF5Matrix
objects.
For example, a simple numeric output matrix with nrow
rows and ncol
columns is created by:
std::unique_ptr<numeric_output> odmat=beachmat::create_numeric_output(nrow, ncol, beachmat::SIMPLE_PARAM);
A sparse matrix is similarly created by setting the last argument to beachmat::SPARSE_PARAM
,
while a HDF5Matrix
is constructed by setting beachmat::HDF5_PARAM
.
These constants are instances of the output_param
class that specify the type and parameters of the output matrix to be constructed.
Another option is to allow the function to dynamically choose the output type to match that of an existing matrix.
This is useful for automatically choosing an output format that reflects the choice of input format.
For example, if data are supplied to a function in a simple matrix, it would be reasonable to expect that the output is similarly small enough to be stored as a simple matrix.
On the other hand, if the input is a HDF5Matrix
, it may make more sense to return a HDF5Matrix
object.
Dynamic choice of output type is performed by using the Rcpp::Robject
object containing the input matrix to initialize the output_param
object.
If I have a matrix object dmat
, the output type can be matched to the input type with:
beachmat::output_param oparam(dmat, /* simplify = */ true, /* preserve_zero = */ false);
std::unique_ptr<numeric_output> odmat=beachmat::create_numeric_output(nrow, ncol, oparam);
A similar process can be used for a pointer dptr
to an existing *_matrix
instance:
beachmat::output_param oparam(dptr->get_matrix_type(), /* simplify = */ true, /* preserve_zero = */ false);
The simplify
argument indicates whether non-matrix
input objects should be “simplified” to a matrix
output object.
If false
, a HDF5Matrix
output object will be returned instead.
The preserve_zero
argument indicates whether a *gCMatrix
input should result in a *gCMatrix
output when simplify=false
(for logical or double-precision data only).
Exact zeroes are detected and ignored when filling this matrix.
To put data into the output matrix pointed to by dptr
, the following methods are available:
dptr->set_col(c, out)
fills column c
with elements pointed to by a Rcpp::NumericVector::iterator
object (out
).
There should be at least nrow
accessible elements, i.e., *out
and *(out+nrow-1)
should be valid entries.
No value is returned by this method.dptr->set_col(c, out, first, last)
fills column c
from row first
to last-1
, using the elements pointed to by a Rcpp::NumericVector::iterator
object (out
).
There should be at least last-first
accessible elements, i.e., *out
and *(out+last-first-1)
should be valid entries.
No value is returned by this method.dptr->set_row(r, out)
fills row r
with elements pointed to by a Rcpp::NumericVector::iterator
object (out
).
There should be at least ncol
accessible elements, i.e., *out
and *(out+ncol-1)
should be valid entries.
No value is returned by this method.dptr->set_row(r, out, first, last)
fills row r
from column first
to last-1
, with elements pointed to by a Rcpp::NumericVector::iterator
object (out
).
There should be at least last-first
accessible elements, i.e., *out
and *(out+last-first-1)
should be valid entries.
No value is returned by this method.dptr->set(r, c, Y)
fills the matrix entry (r, c)
with the double Y
.
No value is returned by this method.dptr->yield()
returns a Rcpp::RObject
object containing the matrix data to pass into R.The allowable ranges of r
, c
, first
and last
are the same as previously described.
The get_nrow
, get_ncol
, get_row
, get_col
, get
, get_matrix_type
and clone
methods are also available and behave as described for numeric_matrix
objects.
Logical, integer and character output matrices are supported by changing the types in the creator function (and its variants):
std::unique_ptr<integer_output> oimat=beachmat::create_integer_output(nrow, ncol);
std::unique_ptr<logical_output> olmat=beachmat::create_logical_output(nrow, ncol);
std::unique_ptr<character_output> ocmat=beachmat::create_character_output(nrow, ncol);
Equivalent methods are available for these matrix types.
For integer and logical matrices, X
should be of type Rcpp::IntegerVector::iterator
and Rcpp::LogicalVector::iterator
, respectively, and Y
should be an integer.
For character matrices, X
should be of type Rcpp::StringVector::iterator
and Y
should be a Rcpp::String
object.
HDF5Matrix
will perform a new getHDF5DumpFile()
call with for.use=TRUE
to obtain a new file name for storing the HDF5 output.
Similarly, the name of the data set is obtained via getHDF5DumpName()
.
Both names are recorded in the dump log via appendDatasetCreationToHDF5DumpLog()
.
This mimics the behaviour observed when HDF5Matrix
instances are created at the R level.
Similarly, the compression level and chunk dimensions will be taken from the appropriate global variables.oparam.set_chunk_dim(chunk_nr, chunk_nc)
,
where chunk_nr
and chunk_nc
are the chunk rows and columns respectively.
Similarly, the compression level can be set using oparam.set_compression(compress)
, where compress
can range from 0 (contiguous) to 9 (most compression).
If specified, these settings will override the default behaviour, but will have no effect for non-HDF5 output.nr
-by-nc
, the optimal chunk dimensions can be specified with oparam.optimize_chunk_dims(nr, nc)
.
beachmat exploits the chunk cache to store all chunks along a row or column, thus avoiding the need to reload data for the next row or column.
These chunk settings are designed to minimize the chunk cache size while also reducing the number of disk reads.character_output
instance.
This can be set using oparam.set_strlen(strlen)
where strlen
is the length of a C-style string, not including the null-terminating character.
Any attempts to fill the output matrix with strings larger than strlen
will result in silent truncation.clone
to create *_output
copies for use in each thread.
However, each copy may still read from and write to the same disk/memory location.
It is the responsibility of the calling function to ensure that access is locked and unlocked appropriately across multiple threads.
(This may not be necessary if access does not overlap, e.g., writing to different rows.)sessionInfo()
## R version 3.4.2 (2017-09-28)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.6-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.6-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocStyle_2.6.0 knitr_1.17
##
## loaded via a namespace (and not attached):
## [1] compiler_3.4.2 backports_1.1.1 magrittr_1.5 rprojroot_1.2
## [5] htmltools_0.3.6 tools_3.4.2 yaml_2.1.14 Rcpp_0.12.14
## [9] rmarkdown_1.8 stringi_1.1.6 digest_0.6.12 stringr_1.2.0
## [13] evaluate_0.10.1