Using Rcpp and C++ to Count Genotypes
I had a matrix (88662 loci x 2060 genotypes) of maize chromosome 1 genotypes, encoded as 0, 1, 2 (e.g. the number of alternate alleles). I needed genotype counts per row, which at first glance is quite easy to solve: just use apply
and table
:
counts <- apply(chr1g, 1, table)
What’s the problem with this approach? First, technical: if a row only has genotypes 0 and 2, we don’t get counts for 1, which makes merging into a matrix later on a total nightmare (evident because apply
is smart enough to return a list). Second, it ignores NA
, which is not good. We can fix this with exclude=NULL
:
counts <- apply(chr1g, 1, function(x) table(x, exclude=NULL))
This doesn’t solve our first problem, and it’s a bit slower. apply(chr1g, 1, table)
took:
user system elapsed
84.638 3.110 87.865
Where as apply(chr1g, 1, function(x) table(x, exclude=NULL))
takes:
user system elapsed
108.718 5.708 114.609
(and no, passing exclude=NULL
to directly to apply
doesn’t make it faster). The way around the first technical issue is to use factors. as.factor
removes dimensions, so we’re out of luck converting the whole matrix at once (plus, this would require a copy in memory and these objects are moderately large). So, we could so something like:
count <- apply(chr1g, 1, function(x) table(factor(x, levels=c(0, 1, 2, NA)), exclude=NULL))})
This works well, but is also not too fast (chromosome 1 is 15% of our dataset):
user system elapsed
95.918 5.746 101.861
Rcpp stands out as a simple solution here: this is very easy to code up (it took me literally five minutes). Looking at the timing first:
user system elapsed
8.427 1.573 10.027
We see we have a clear winner. The whole dataset is 573,392 rows. Overall, this would take (95.918 / 88662)* 573392 = 620.3178
seconds or about 10 minutes to complete on all data. Chances are, I’ll have to run this code a few times as the analysis changes. In contrast, the Rcpp method takes (8.427 / 88662)* 573392 = 54.49882
seconds. That’s under 6 minutes to code and implement a working, faster solution in Rcpp!
The code is quite easy to understand too:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
IntegerVector countGenotypes(IntegerVector x) {
// This method is a specialized version of R's table that counts genotypes
// encoded as 0, 1, 2 in a vector (and also returns NA) always of length 4,
// always as numbers of 0, 1, 2, NA. This allows faster usage with apply, as
// we don't need to convert to factor to get all genotype counts, even if
// none are present.
CharacterVector names = CharacterVector::create("0", "1", "2", "NA");
IntegerVector genocounts(4);
genocounts.fill(0);
for (int i = 0; i < x.length(); i++) {
if (!IntegerVector::is_na(x[i])) {
genocounts[x[i]] += 1;
} else {
genocounts[3] += 1;
}
}
genocounts.attr("names") = names;
return genocounts;
}