<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Vince Buffalo – Vince Buffalo</title>
    <link>https://vincebuffalo.com/</link>
    <description>Recent content on Vince Buffalo</description>
    <generator>Hugo -- gohugo.io</generator>
    <language>en</language>
    
	  <atom:link href="https://vincebuffalo.com/index.xml" rel="self" type="application/rss+xml" />
    
    
      
      
    
    
    <item>
      <title>Why Do Species Get a Thin Slice of π? Revisiting Lewontin&#39;s Paradox of Variation</title>
      <link>https://vincebuffalo.com/blog/why-do-species-get-a-thin-slice-of-pi-revisiting-lewontins-paradox-of-variation/</link>
      <pubDate>Wed, 01 Sep 2021 00:00:00 +0000</pubDate>
      
      <guid>https://vincebuffalo.com/blog/why-do-species-get-a-thin-slice-of-pi-revisiting-lewontins-paradox-of-variation/</guid>
      <description>
        
        
        &lt;p&gt;The Great Obsession of population geneticists, to borrow John Gillespie&amp;rsquo;s
words, is genetic variation. As an evolutionary biologist, it&amp;rsquo;s rather hard to
&lt;em&gt;not&lt;/em&gt; be obsessed with genetic variation, for it&amp;rsquo;s the ultimate source of the
two most striking features of life on earth: the mind-boggling diversity of
species, and adaptations so utterly clever they look as though they were
assembled by a designer. Both life&amp;rsquo;s dizzying diversity and cunning adaptations
are the result of evolutionary processes like natural selection and numerous
historical accidents, overlaid on one another like brush strokes on canvas, to
give us the snapshot of life on earth today.&lt;/p&gt;
&lt;p&gt;Population geneticists&amp;rsquo; obsession with genetic variation is a result of the way
we look at evolution: evolution as the change in the genetic composition of a
population through time. By carefully studying genetic variability, we hope we
have some chance of figuring out &lt;em&gt;which&lt;/em&gt; evolutionary processes played out in
the past. Some biologists dismiss this view as &amp;ldquo;beanbag genetics&amp;rdquo;, since
population geneticists like to reduce a population down to the simplest
components: the frequencies of various genetic variants, or alleles, in that
population. While surely this is an oversimplified view, the upside is that it
is particularly amenable to thinking about mathematically. Indeed evolution
often occurs so slowly that to understand it we need to use mathematics to
figure out what a population looked like long before we were born, or what it
will look like long after we&amp;rsquo;re dead.&lt;/p&gt;
&lt;p&gt;Our field has constructed a rich mathematical theory of evolution over the last
one-hundred years, but it was only a half-century ago we got our first actual
estimates of the genetic variation in a fruitfly species named &lt;em&gt;Drosophila
pseudoobscura&lt;/em&gt;, from the work of Richard Lewontin and Jack Hubby. After a
glimpse of the data, one theory of evolution was slain, and its rival theory
was in disarray. Within a few years, evolutionary biologists were measuring
genetic variation in all kinds of species during the &amp;ldquo;find &amp;rsquo;em and grind &amp;rsquo;em&amp;rdquo;
era, named for the unceremonious way numerous flies, crabs, plants, and
individuals from other species met their fate in the quest to measure
variability.&lt;/p&gt;
&lt;figure&gt;
&lt;img src=&#34;https://vincebuffalo.com/images/lewontin_book.jpg&#34; width=&#34;260&#34; class=&#34;img-responsive&#34;/&gt;
  &lt;figcaption&gt;
  Richard Lewontin&#39;s seminal 1974 book, &lt;em&gt;The Genetic Basis of Evolutionary
  Change&lt;/em&gt;. Source: &lt;a href=&#34;https://upload.wikimedia.org/wikipedia/en/e/e2/The_Genetic_Basis_of_Evolutionary_Change.jpg&#34;&gt;Wikipedia&lt;/a&gt;.
  &lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;With new data of levels genetic variability &lt;em&gt;across&lt;/em&gt; species, another theory
was soon on chopping block. This theory was the &lt;em&gt;neutral theory&lt;/em&gt;, and Dick
Lewontin&amp;rsquo;s 1974 synthesis of the find &amp;rsquo;em and grind &amp;rsquo;em era pointed out a
seeming contradiction between this theory and the new estimates of genetic
variability. He named this the &amp;ldquo;The Paradox of Variation&amp;rdquo;, and it&amp;rsquo;s an enduring
riddle in my field of evolutionary genetics. My &lt;a href=&#34;https://elifesciences.org/articles/67509&#34;&gt;recent paper published in
eLife&lt;/a&gt;
 attempts to make some progress
on this longstanding paradox. I&amp;rsquo;ve written this blog post to introduce a
general audience to these topics and why we study them, and then give an
overview of my paper (you may wish to skip ahead and start &lt;a
href=&#34;#recent-work&#34;&gt;there&lt;/a&gt;).&lt;/p&gt;
&lt;h2&gt;Lewontin and Hubby&amp;rsquo;s Quest to Measure Genetic Variability&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;lewontin-and-hubbys-quest-to-measure-genetic-variability&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#lewontin-and-hubbys-quest-to-measure-genetic-variability&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Before Lewontin and Hubby&amp;rsquo;s work measuring genetic variability in &lt;em&gt;Drosophila
pseudoobscura&lt;/em&gt;, evolutionary biologists were uncertain of how high or low the
genetic variability was within populations. As I write this, I am surrounded by
two examples of the extremes of this genetic variability spectrum: my &lt;em&gt;Monstera
deliciosa&lt;/em&gt; houseplant (also known as the Swiss cheese plant for it&amp;rsquo;s deeply
fenestrated leaves) and the &lt;em&gt;Drosophila&lt;/em&gt; fruit flies that have invaded the
kitchen compost bin I&amp;rsquo;ve procrastinated emptying. While its incorrect to think
of the monsetera plants in people&amp;rsquo;s homes as a &amp;ldquo;population&amp;rdquo; since they do not
interbred with one another, they serve as a good example a group of organisms
with low genetic diversity. This is because most monstera houseplants are
clones of one another, created by taking a clipping of one plant, letting it
root and grow, then taking a clipping of this plant, and so on. Unlike parent
and offspring which are separated by a generation, all monstera plants are
identical siblings; perhaps some mutations lead to small amounts of genetic
variability, but this genetic variability is minuscule compared to what&amp;rsquo;s
created by free mating in a sexually-reproducing species.&lt;/p&gt;
&lt;figure&gt;
&lt;img src=&#34;https://vincebuffalo.com/images/hubby_lewontin.png&#34; width=&#34;460&#34; class=&#34;img-responsive&#34;/&gt;
  &lt;figcaption&gt;
    The gel electrophoresis of the esterase-5 gene in &lt;em&gt;Drosophila pseudoobscura&lt;/em&gt;.
    Each bar is where a sample&#39;s protein ended up after being drawn through a thick
    gel with electrical current. The different bars represent different samples,
    and their varying positions reflect the different protein variants, which due to
    their different shapes and charges, move through the gel at different speeds.
  &lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;By contrast, my compost bin population of &lt;em&gt;Drosophila melanogaster&lt;/em&gt; harbors
massive amounts of genetic diversity. Part of the reason why &lt;em&gt;Drosophila
melanogaster&lt;/em&gt; have such high genetic variability is because their population
sizes are so large. &lt;em&gt;Drosophila melanogaster&lt;/em&gt; originated from equatorial
Africa, yet followed humans around the globe living off our trash heaps and
spoiled fruit. One way to measure genetic diversity is to take two random
chromosomes and count the differences between their DNA sequences, then repeat
this for another two random chromosomes, and so forth. Population geneticists
call this &amp;ldquo;pairwise&amp;rdquo; measure of genetic variability the Greek letter $\pi$ (I
imagine much to the disappoint of mathematicians). For &lt;em&gt;Drosophila&lt;/em&gt;,
$\pi_\text{flies} \approx 1%$, or one difference per 100 DNA basepairs, and
for humans, $\pi_\text{humans} \approx 0.1%$, or one difference per 1,000
basepairs.  Humans have much lower genetic diversity than fruitflies.
Lewontin and Hubby explain the drastic importance of this simple quantity:&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;In a sense, a description of the genetic variation in a population is the
fundamental datum of evolutionary studies; and it is necessary to explain the
origin and maintenance of this variation and to predict its evolutionary
consequences. [&amp;hellip;]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Thus, one end goal of measuring variability within and across species is to
answer the fundamental question of evolutionary genetics: what evolutionary
processes are compatible with the observed levels of genetic variation?&lt;/p&gt;
&lt;h2&gt;The Evolutionary Theories Killed by Measuring Genetic Variation&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;the-evolutionary-theories-killed-by-measuring-genetic-variation&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#the-evolutionary-theories-killed-by-measuring-genetic-variation&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;aside&gt;
&lt;sup&gt;1&lt;/sup&gt; This is how Lewontin himself sets up the problem in his
book &lt;em&gt;The Genetic Basis of Evolutionary Change&lt;/em&gt;.
&lt;br/&gt;
&lt;sup&gt;2&lt;/sup&gt;
If this view sounds almost vaguely political, you&amp;rsquo;re in good company. Lewontin,
who never missed an opportunity to place scientific theories in
their historical context, writes:
&amp;ldquo;A basis for racism may also flow from the concept of a wild type, since if
there is genetic type of the species, those who fail to correspond to it must
be less perfect. Platonic notions of type are likely to intrude themselves from
one domain to another, and Dobzhanksy (1955) was clearly conscious of this
problem when he attacked the concept of a wild type.&amp;rdquo;
&lt;br/&gt;
&lt;sup&gt;3&lt;/sup&gt; For example, one observation that was at odds with the balance
theory was the observed rate of inbreeding depression was far too slow
if most fitness variation was maintained by overdominance.
&lt;/aside&gt;

&lt;p&gt;Before I introduce Lewontin&amp;rsquo;s Paradox of Variation and the neutral theory, it&amp;rsquo;s
worthwhile to set the stage with the theories of genetic variability that
preceded it.&lt;sup&gt;1&lt;/sup&gt; At the time of Lewontin and Hubby&amp;rsquo;s work, proponents
of the &lt;em&gt;classical theory&lt;/em&gt; believed there wasn&amp;rsquo;t much genetic variation between
individuals.  This was because selection was thought to be extremely powerful:
a new mutation that increased survival or the number of offspring quickly
replaced all alternatives. The genetic composition of a population was more or
less uniform, as it moves from one perfectly adapted state to another under the
steady marching orders of natural selection. Of the variation that existed, it
was primarily from harmful mutations that &amp;lsquo;broke&amp;rsquo; organisms from their
archetypal &amp;ldquo;wild-type&amp;rdquo;.&lt;sup&gt;2&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;By measuring their genetic variability, Lewontin and Hubby showed genetic
variability was far too abundant for the classical theory to be correct. The
alternative view, the &lt;em&gt;balance theory&lt;/em&gt;, held that natural selection maintained
vast amounts of genetic variation through a variety of mechanisms; aspects of
this theory were successfully challenged through other experiments and
arguments&lt;sup&gt;3&lt;/sup&gt;. How evolutionary processes, whether random change,
natural selection, historical accidents (e.g. a population bottleneck), combine
to determine the central quantity of evolutionary genetics, genetic variation,
remained a mystery.&lt;/p&gt;
&lt;h2&gt;The Neutral Theory&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;the-neutral-theory&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#the-neutral-theory&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;This brings us to the architect of the third view of genetic variation: Motoo
Kimura. Kimura&amp;rsquo;s view was that the majority of genetic variation was
&lt;em&gt;selectively neutral&lt;/em&gt;: it did not impact the survival or number of offspring
individuals had. Free from the marching orders of selection, such neutral
variation lingered in populations, drifting up or down in frequency due the
vagaries of which chromosomes were passed down to offspring (Mendelian
segregation), and random differences in survival and offspring number that
didn&amp;rsquo;t have a genetic cause. We call this process &lt;em&gt;genetic drift&lt;/em&gt;. Kimura&amp;rsquo;s
neutral theory was in some ways an extension of Sewall Wright&amp;rsquo;s view that
evolution was governed as much, or more, by random chance as it was by natural
selection. While here we&amp;rsquo;re focused on Kimura&amp;rsquo;s work on neutral polymorphism
within a population, much of Kimura&amp;rsquo;s motivation for his neutral theory was to
explain observations in molecular evolution, such as the steady clock-like
ticking of amino acid differences between species.&lt;/p&gt;
&lt;p&gt;Suddenly under neutral theory, the mystery of why there was so much genetic
variation wasn&amp;rsquo;t such a mystery. If the variation was neutral, natural
selection couldn&amp;rsquo;t act on it. Surely, some beneficial mutations would enter the
population &amp;mdash;adaptations of course had a genetic basis&amp;mdash; but these mutations
were rare, and they would quickly replace their alternative and didn&amp;rsquo;t persist
for long in the population. Lewontin called neutral theory the &amp;ldquo;neoclassical
theory&amp;rdquo; because of this similarity.&lt;/p&gt;
&lt;p&gt;Setting Goddess Tyche in charge of evolutionary change has another benefit.  If
we can ignore  natural selection, and thus all the uncertainty about the
&lt;em&gt;strength&lt;/em&gt; and &lt;em&gt;direction&lt;/em&gt; of natural selection, the mathematics of evolution
become much simpler. Under a model where $N$ individuals freely mate, new
variation is created through mutation at rate $\mu$ per generation per basepair
(i.e. there&amp;rsquo;s a $100 \times \mu$% chance that any one basepair mutates in a
given generation), and new variation is lost through random drift at rate
proportional to $1/N$ (i.e. slower in larger populations, fast in small
populations). Ultimately, like the level of water in a dam, a balance is met
between new mutations entering and old mutations going extinct in the
population: this is the equilibrium genetic diversity. A bit of mathematics
says that under this toy model, equilibrium genetic diversity should be $\pi
\approx 4N\mu$. In a big population (think of all the fruitflies in the world),
there should be high genetic variability. If population sizes are small (think
of a small island), soon everyone becomes everyone else&amp;rsquo;s cousin, and genetic
diversity is lost.&lt;/p&gt;
&lt;h2&gt;Lewontin&amp;rsquo;s Paradox of Variation&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;lewontins-paradox-of-variation&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#lewontins-paradox-of-variation&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;So while neutral theory is mathematically alluring, and not at odds with the
high level of genetic variability found in various species, Lewontin noticed a
critical problem. Under Kimura&amp;rsquo;s neutral theory, the heterozygosity at a locus
should depend on the product of mutation rate and population size, $N \mu$,
such that:&lt;sup&gt;4&lt;/sup&gt;&lt;/p&gt;
\[
\pi = \frac{4N\mu}{1 &amp;#43; 4 N \mu}
\]&lt;aside&gt;
&lt;sup&gt;4&lt;/sup&gt; This looks a bit different than the equation $\pi \approx 4N\mu$ above
but that&amp;rsquo;s because $4N\mu$ is usually small, so $1 + 4N\mu \approx 1$.
&lt;/aside&gt;

&lt;p&gt;However, the range of genetic variability across species was surprisingly
narrow. Lewontin lays out why this is a problem visually in his book:&lt;/p&gt;
&lt;figure&gt;
&lt;img src=&#34;https://vincebuffalo.com/images/lewontin_1974.jpg&#34; width=&#34;460&#34; class=&#34;img-responsive&#34;/&gt;
  &lt;figcaption&gt;
  A figure from Lewontin&#39;s 1974 book &lt;em&gt;The Genetic Basis of Evolutionary Change&lt;/em&gt;
  on the &#34;Paradox of Variation&#34; (p. 209). The curved line is the hypothetical
  heterozygosity (another term for genetic variability) given by \(\pi = 4N\mu /
  (1 + 4N\mu)\). The range of variation observed in early studies ranges
  between \(\pi \approx 6\%\) to \(\pi \approx 20\%\), suggesting the product \(N\mu\)
  is between 0.015 and 0.06. For a gene mutation rate of \(\mu = 10^{-6}\), this
  suggests that \(N\) ranges from 3,750 to 15,000 --- a minuscule factor of
  four difference.
  &lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;As Lewontin explains,&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;The observed range of heterozygosities over all the species [&amp;hellip;] lies in the
sensitive region, between 0.056 and 0.184. This range corresponds to values
of $N\mu$ between 0.015 and 0.057. Since there is no reason to suppose that
mutation rate has been specially adjusted in evolution to be the reciprocal
of population size for higher organisms, we are required to believe that
higher organisms including man, mouse, and Drosophila and the horseshoe crab
all have population sizes within a factor of 4 of each other. [&amp;hellip;] &lt;strong&gt;The
patent absurdity of such a proposition is strong evidence against a
neutralist explanation of observed heterozygosity&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;While Lewontin&amp;rsquo;s estimated range of heterozygosities only included 13 species
(Table 22, p. 117), this problem wasn&amp;rsquo;t just an artifact of small sample size.
Throughout the 1970s, estimating protein heterozygosities with gel
electrophoresis became a cottage industry. In the decade since Lewontin&amp;rsquo;s book,
over a thousand species had published estimates of their protein variability,
which confirmed the narrow range of genetic variability found by Lewontin. Nevo
and colleagues published a survey of estimates of protein heterozygosities for
1,111 different species, finding that heterozygosity ranges only from 0% to
30%:&lt;sup&gt;5&lt;/sup&gt;&lt;/p&gt;
&lt;aside&gt;
&lt;sup&gt;5&lt;/sup&gt;
I first heard of this paper in
&lt;a href=&#34;https://youtu.be/y0VjObP1lBA?t=1648&#34;&gt;this nice lecture by Monty Slatkin&lt;/a&gt;
on the history of population genetics. Midway through, Monty shows the
heterozygosity figure from Nevo et al. (1984) to introduce Lewontin&amp;rsquo;s
Paradox. This was the first time I&amp;rsquo;d heard of the Paradox of Variation.
&lt;/aside&gt;

&lt;figure&gt;
&lt;img src=&#34;https://vincebuffalo.com/images/nevo.png&#34; width=&#34;460&#34; class=&#34;img-responsive&#34;/&gt;
  &lt;figcaption&gt;
  Figure 2c from Nevo et al. &lt;em&gt;The Evolutionary Significance of Genetic
  Diversity: Ecological, Demographic and Life History Correlates&lt;/em&gt; (1984).
  &lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;So Lewontin&amp;rsquo;s Paradox of Variation seemed to be a rather serious problem for
Kimura&amp;rsquo;s neutral theory, and soon folks were looking for other evolutionary
processes that were consistent with the observed data.&lt;/p&gt;
&lt;h2&gt;Lewontin&amp;rsquo;s Paradox and the Hitchhiking Model&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;lewontins-paradox-and-the-hitchhiking-model&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#lewontins-paradox-and-the-hitchhiking-model&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Lewontin wrote his book while on sabbatical in John Maynard Smith&amp;rsquo;s lab at the
University of Sussex. The neutral theory was under attack from multiple angles,
but especially so in Britain, where a cultural that treasured naturalism and
adaptive storytelling fostered the conviction that &lt;em&gt;every&lt;/em&gt; genetic variant must
have &lt;em&gt;some&lt;/em&gt; adaptive benefit.&lt;sup&gt;6&lt;/sup&gt; As Maynard Smith described,&lt;/p&gt;
&lt;aside&gt;
&lt;sup&gt;6&lt;/sup&gt; Marek Kohn&amp;rsquo;s book &lt;a href=&#34;https://www.abebooks.com/Reason-Everything-Marek-Kohn-Faber/30463535450/bd&#34;&gt;&lt;em&gt;A Reason for
Everything&lt;/em&gt;&lt;/a&gt;

is a lovely history of evolutionary biology and the British tradition of
naturalism.
&lt;/aside&gt;

&lt;blockquote&gt;
  &lt;p&gt;The whole tradition of British population biology had
been, if you find a genetic variability, it must have some kind of selective
explanation, and if at first you don&amp;rsquo;t find it, you must try, try, and try
again until you do.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Perhaps in the spirit of this tradition, Maynard Smith and John Haigh tried to
find a selective explanation for the observed narrow range of diversity
diversity. In 1974, they published their masterwork &lt;em&gt;The hitch-hiking effect of
a favourable gene&lt;/em&gt;, which develops a mathematical model of how a selected
variant can decrease its surrounding genetic variability. When a new strongly
selected mutation occurs, it does so on a random chromosome in the population.
As this mutation increases in frequency, either by improving an individual&amp;rsquo;s
odds of survival or number of offspring, it drags along its neighboring
stretches of DNA as it spreads within the population. To understand why this
is, it may be helpful to think of our DNA as bit like a spool of 8mm film and
you can consider our genes as random frames of the motion picture.  Most
neighboring frames during a scene of the film end up together into the final
production, much like if I pass on a genetic variant I inherited from my
father, there&amp;rsquo;s a good chance that variant&amp;rsquo;s neighbor would also end up in my
child.  However, occasionally the film editor comes along and cuts the scene
short and splices in another scene.  So too can a cell&amp;rsquo;s recombination
machinery cut the stretch of my father&amp;rsquo;s chromosome I&amp;rsquo;m passing on and splice
in my mother&amp;rsquo;s; suddenly her genetic variants and their neighbors go on to the
next generation.  Our genomes, after all, are exactly half of each our parents&amp;rsquo;
genomes, but a random mosaic of our four grandparents&amp;rsquo; genomes.&lt;/p&gt;
&lt;figure&gt;
&lt;img src=&#34;https://vincebuffalo.com/images/film-splice.png&#34; width=&#34;460&#34; class=&#34;img-responsive&#34;/&gt;
  &lt;figcaption&gt;
  A Griswald Junior Film Splicer splicing together two bits of 16mm film (this still is from 
  &lt;a href=&#34;https://www.youtube.com/watch?v=x7vrs67YDZI&amp;t=281s&#34;&gt;this short film&lt;/a&gt;).
  &lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;This is &lt;em&gt;recombination&lt;/em&gt;, and it&amp;rsquo;s a vitally important part of the evolutionary
process. In particular, how often recombination occurs &amp;mdash;how often the scenes
of a film are cut and spliced to different scenes&amp;mdash; varies across organisms
and along the chromosome itself. In regions of high recombination, beneficial
variants are quickly spliced off and away from their neighbors, allowing the
surrounding genetic variation to persist in the population undisturbed. By
contrast, in low recombination regions, a beneficial variant won&amp;rsquo;t be spliced
away from even its furthest neighbors for quite some time, and vast
neighborhoods of genetic diversity are wiped out whenever a beneficial mutation
comes along and takes over the population. It is precisely this process that
Maynard Smith and Haigh thought could be reducing genetic variability along the
genome, and explain the narrow range of genetic diversity seen across species.
As they say,&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;[This investigation] can therefore be regarded as a last ditch attempt to
save the neutral mutation theory by showing that there is another process
which can account for the uniformity of [heterozygosity] between species.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Nearly fifty years of population genetics has indeed shown that hitchhiking can
strongly impact genetic diversity within a species. The textbook example is
from work a member of my dissertation committee, Dave Begun, did as a PhD
student with Chip Aquadro. They showed that there was a striking correlation
between recombination rates and genetic diversity in &lt;em&gt;Drosophila melanogaster&lt;/em&gt;,
which is precisely what one would expect under pervasive hitchhiking. Genetic
diversity is highest in high recombination regions, and lowest in low
recombination region across the &lt;em&gt;Drosophila melanogaster&lt;/em&gt; genome:&lt;/p&gt;
&lt;figure&gt;
&lt;img src=&#34;https://vincebuffalo.com/images/begun_aquadro.png&#34; alt=&#34;Correlation between diversity and recombination in Drosophila&#34; /&gt;
&lt;figcaption&gt;
The correlation between pairwise diversity (a measure of genetic variability)
and recombination rate in &lt;em&gt;Drosophila melanogaster&lt;/em&gt;, from &lt;a href=&#34;https://www.nature.com/articles/356519a0&#34;&gt;Begun and Aquadro (1992)&lt;/a&gt;.
&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;Later, theoretic work showed that a similar reduction in diversity can happen
if new mutations are &lt;em&gt;selected against&lt;/em&gt;, rather than &lt;em&gt;selected for&lt;/em&gt;. This
process is called &lt;em&gt;background selection&lt;/em&gt;, as it&amp;rsquo;s continually occurring in
the background of the genome. Population geneticists are still debating and
developing new ways to estimate the relative strengths of hitchhiking and
background selection, which collectively we call &lt;em&gt;linked selection&lt;/em&gt;. While
hitchhiking and background selection can reduce genetic diversity, and have
been shown to do so in many species, the central question remained
unanswered: are these selection processes strong enough to constrain genetic
diversity across species to the narrow range observed? Or is there some other
explanation?&lt;/p&gt;
&lt;h2&gt;The Neutralist View on Lewontin&amp;rsquo;s Paradox&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;the-neutralist-view-on-lewontins-paradox&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#the-neutralist-view-on-lewontins-paradox&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Kimura and others had a simpler explanation for the narrow range of diversity.
First, we have known since Sewall Wright that what is relevant to the rate of
genetic drift, and thus the level of genetic variability, isn&amp;rsquo;t a species
population &lt;em&gt;census size&lt;/em&gt; (i.e. the total number of individuals in the
population), but rather its &lt;em&gt;effective population size&lt;/em&gt;.  Effective population
size is a bit of an amorphous concept in population genetics, but it&amp;rsquo;s
fundamentally a way to account for the complexities of real populations of
breeding individuals. For many populations, the idealized random mating we
assume in our mathematical models is overly simplistic and ignores the
realities of ecology and demography. Fortunately, we have found that in the
vast majority of cases, we can account for the complexities of real populations
by simply rescaling the population size, $N$, to a new &amp;ldquo;effective population
size&amp;rdquo;, $N_e$.&lt;/p&gt;
&lt;p&gt;Think of it this way: imagine 500 male and 500 female crabs on an island.
There crabs freely and randomly mate. Still, it being an island, after a few
generations, every crab becomes every other crab&amp;rsquo;s cousin &amp;mdash; remember, the
rate at which this happens is proportional to $1/N$. However, now imagine that
a ruthless male despot crab comes along and battles the other males, and
forbids them from mating. While there are still $N=1,000$ individuals on this
island, within a generation &lt;em&gt;every crab&lt;/em&gt; is the despot crab&amp;rsquo;s kin. The rate at
which every crab on the island becomes a cousin is now much faster. Although
this situation seems quite different to the free crab love island, it only
requires adjusting $N$; specifically, we use an effective population size
$N_e$. In this case, it works out that the adjusted $N_e \approx
4$.&lt;sup&gt;7&lt;/sup&gt;&lt;/p&gt;
&lt;aside&gt;
&lt;sup&gt;7&lt;/sup&gt;
The equation for effective population size for different numbers of males and females
is $N_e = 4N_m N_f / (N_m + N_f)$ where $N_f$ and $N_m$ are the number of  female and
males respectively.
&lt;/aside&gt;

&lt;p&gt;Importantly, effective population also depends on population bottlenecks. If
instead of the male despot crab, the island goes 99 generations undisturbed,
but then suffers a severe generation-long bottleneck as seagulls descend on the
island. If only 2 crab survive, the effective population size is now reduced
from $N=1,000$ to $N_e = 170$. Since it&amp;rsquo;s effective population sizes, and not
census sizes, that determine levels of genetic variability, one can see how
frequent population bottlenecks could explain Lewontin&amp;rsquo;s Paradox. The
evolutionary history of many species also includes range expansions and
colonization and extinctions &amp;mdash; when a few individuals make it a new patch of
habitable land, procreate, but ultimately may die. These dynamics also decrease
effective population size and capture the messy reality of many species.&lt;/p&gt;
&lt;figure&gt;
&lt;img src=&#34;https://vincebuffalo.com/images/hewitt_iceage.png&#34; width=&#34;460&#34; class=&#34;img-responsive&#34;/&gt;
  &lt;figcaption&gt;
&lt;p&gt;The earth looked very different over the last 20,000 years than it did today.
Numerous ice sheets formed and forced species into refugia, where they may
have diverged into different species or subspecies (for example, the &lt;a
href=&#34;https://en.wikipedia.org/wiki/Hooded_crow&#34;&gt;hooded&lt;/a&gt; and &lt;a
href=&#34;https://en.wikipedia.org/wiki/Carrion_crow&#34;&gt;carrion&lt;/a&gt; crows). This
figure is from &lt;a href=&#34;https://www.nature.com/articles/35016000&#34;&gt;Hewitt
(2000)&lt;/a&gt;, &lt;em&gt;The genetic legacy of the Quaternary ice ages&lt;/em&gt;.&lt;/p&gt;
  &lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;Kimura argued that the only &amp;ldquo;serious problem that remains to be explained&amp;rdquo; with
Lewontin&amp;rsquo;s Paradox was that observed heterozygosities never exceeded 30%.
Considering &lt;em&gt;Drosophila&lt;/em&gt; species, which were at the upper end of the population
size and heterozygosity ranges, Kimura argued that their numbers were severely
depressed by the last continental glaciation. Genetic variability takes a while
to recover from a severe bottleneck, and longer for larger populations (as new
mutations must escape rarity and work up to intermediate frequencies), so this
severe, lasting bottlenecks during glaciations is the primary neutralist
hypothesis for Lewontin&amp;rsquo;s Paradox.&lt;/p&gt;
&lt;h2&gt;Recent Work on Lewontin&amp;rsquo;s Paradox&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;recent-work-on-lewontins-paradox&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#recent-work-on-lewontins-paradox&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;&lt;span id=&#34;recent-work&#34;&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;Technological improvements in genome sequencing revolutionized population
genomics, and have provided evidence that diversity within species is shaped by
a mix of past demographic events, such as bottlenecks, as well as by natural
selection. Lewontin&amp;rsquo;s Paradox, however, remains unresolved through the
genomics era. There are a few nice pieces of recent work that set up some
context for my recent paper. The first of which was &lt;a href=&#34;https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1001388&#34;&gt;Ellen Leffler et al.
(2012)&lt;/a&gt;
,
which compiled estimates of diversity from genomic data. This was one of the
first papers I read as I grew interested in population genomics, and certainly
inspired me to continue in the field. In this paper, Leffler and colleagues
survey diversity for 167 species, and find an 800-fold difference between the
highest diversity species (the sea squirt &lt;em&gt;Ciona savignyi&lt;/em&gt;) to the lowest
diversity species (the wild cat, &lt;em&gt;Lynx lynx&lt;/em&gt;). Overall, they point out the
narrow range of diversity is still an open and important mystery.&lt;/p&gt;
&lt;p&gt;The second paper is &lt;a href=&#34;https://www.nature.com/articles/nature13685&#34;&gt;J. Romiguier et al.
(2014)&lt;/a&gt;
, which surveyed diversity
and life-history characteristics across 72 species. These authors find that
diversity levels across species are highly-correlated to ecological strategy.
In particular they find diversity is highest when species have lots of
offspring they invest little in, and lower when species have few offspring they
invest more in.  Intriguingly, these results suggest ecological processes are
predictive of genetic diversity across species.  While these ecological
correlates are an important piece of the puzzle, it&amp;rsquo;s still uncertain how
mechanistically such ecological processes would act to constrain genetic
diversity.&lt;/p&gt;
&lt;p&gt;More recently, &lt;a href=&#34;https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002112&#34;&gt;Russ Corbett-Detig, et al.
(2015)&lt;/a&gt;

tested the hypothesis of whether linked selection could explain Lewontin&amp;rsquo;s
Paradox. Since population sizes are very difficult to estimate for many
species, Russ and colleagues considered how the strength of linked selection
varies with two proxies of population size: range size and body size.
Large-bodied animals (e.g. whales) have smaller population densities than
small-bodied animals (e.g. mice), in part because of the energy requirements
needed to sustain life are much higher than in larger-bodied animals.
Similarly, we&amp;rsquo;d expect species with larger ranges to have larger population
sizes than species with small ranges. Then, the authors fit a statistical model
to estimate the strength of linked selection across population genomic samples
for 40 species. As a former bioinformatician, I should point out this is a
monumental task worthy of praise. Interestingly, they find that the strength
of linked selection does seem to scale with these proxies of population size.
I read this paper in my first year of graduate school and discussions about
it with my PhD advisor Graham Coop were foundational to my early interest in
Lewontin&amp;rsquo;s Paradox.&lt;/p&gt;
&lt;p&gt;Graham ultimately found a limitation of Corbett-Detig et al. (2015), which he
shared on BioRxiv (&lt;a href=&#34;https://www.biorxiv.org/content/10.1101/042598v1&#34;&gt;Coop
2016&lt;/a&gt;
). While Corbett-Detig
et al. do indeed find that linked selection is &lt;em&gt;stronger&lt;/em&gt; in species with large
population sizes, it still may not be strong &lt;em&gt;enough&lt;/em&gt; to explain why the range
of genetic diversity is so narrow. While linked selection was reducing
diversity some 60%-80% in species with large population sizes, this isn&amp;rsquo;t
enough to explain why there&amp;rsquo;s only an 800-fold difference between fruit flies
and Lynx, when is likely well upwards of a &lt;em&gt;million&lt;/em&gt;-fold difference in their
population sizes. The mystery remains open, and at the end of Graham&amp;rsquo;s article
it seemed a bit more likely that repeated bottlenecks and other non-selective
processes could also be reducing genetic diversity to the narrow range we
observe.&lt;/p&gt;
&lt;h2&gt;Estimating Census Population Sizes&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;estimating-census-population-sizes&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#estimating-census-population-sizes&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;aside&gt;
&lt;sup&gt;8&lt;/sup&gt; I should point out here that my paper is entitled &lt;em&gt;Quantifying the
relationship between genetic diversity and population size suggests natural
selection cannot explain Lewontin’s Paradox&lt;/em&gt;. eLife has some restrictions on
titles that forced me to change it from the original &lt;a href=&#34;https://www.biorxiv.org/content/10.1101/2021.02.03.429633v1&#34;&gt;bioRxiv&lt;/a&gt;
 title (I think the only
clever title I&amp;rsquo;ve ever come up with): &lt;em&gt;Why do species get a thin slice of π?
Revisiting Lewontin’s Paradox of Variation&lt;/em&gt;.
&lt;/aside&gt;

&lt;p&gt;This brings us to &lt;a href=&#34;https://elifesciences.org/articles/67509&#34;&gt;my recent paper in
eLife&lt;/a&gt;
&lt;sup&gt;8&lt;/sup&gt; that hopefully
adds a few more pieces to this longstanding puzzle. My paper tries to chip away
at this problem from a few angles, which I&amp;rsquo;ll detail below. I also tried to
include a quick review of the history and relevant literature in my article.
Lewontin&amp;rsquo;s Paradox touches on lots of interesting, and often missed, parts of
evolutionary genetics.&lt;/p&gt;
&lt;p&gt;The first component of my paper was to try to actually estimate the population
sizes across various animal species, and quantify the relationship between
genetic diversity and population size. Early work from Soulé (1976), Frankham
(1996), and Nei and Graur (1984) has looked at this relationship before, using
early protein heterozygosity data:&lt;/p&gt;
&lt;figure&gt;
&lt;img src=&#34;https://vincebuffalo.com/images/frankham_soule_nei_graur.png&#34; width=&#34;660&#34; class=&#34;img-responsive&#34;/&gt;
  &lt;figcaption&gt;
  (A) Frankham (1996) using the data of Soulé (1976), and (B) Nei and Graur
  (1984). The solid line is neutral expectation if the mutation rate is \(\mu = 10^{-7}\).
  &lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;These papers used population size estimates from the literature, or
back-of-the-envelope calculations. To get a modern look at the relationship
between genomic estimates of genetic diversity from previous studies (Leffler
et al. 2012, Romiguier et al. 2014, and Corbett-Detig et al. 2015) wanted to
find a way to roughly approximate census population sizes for hundreds of
species in an automated way. One approach is to take the product of population
density (i.e. how many individuals there are per square kilometer) and species
range size (i.e. how wide a species range is), to get a very crude estimate of
population size. The downside of this approach though, is that population
densities are unknown for most species too.&lt;/p&gt;
&lt;p&gt;Luckily, the field of macroecology provides a way out. As mentioned, animals
have energetic needs that scale with their body sizes. Large-bodied animals
require more energy that they procure by hunting or grazing a certain area;
competition between individuals for such resources means that large-bodied
animals can only live at lower densities. The shocking result is that
population densities are surprisingly well correlated with body mass. Damuth
(1981, 1987) quantified this; here is a figure from my paper (&lt;a href=&#34;https://elifesciences.org/articles/67509#fig1s1&#34;&gt;Figure 1&amp;ndash;figure
supplement 1&lt;/a&gt;
):&lt;/p&gt;
&lt;figure&gt;
&lt;img src=&#34;https://vincebuffalo.com/images/damuth.png&#34; width=&#34;560&#34; class=&#34;img-responsive&#34;/&gt;
  &lt;figcaption&gt;
  The relationship between population density and population body mass from Damuth&#39;s 
  1987 data.
  &lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;Using this relationship, I predict population density using body mass. In
practice, I use body length, since this is much easier to collect and reported
in a more standard way than body mass. Using a statistical routine, I predict
body mass from body length, and population density from body mass.&lt;/p&gt;
&lt;p&gt;Next, I need to estimate range size. A common approach is to use occurrence
data &amp;mdash;records of the latitude and longitude of where animals were observed&amp;mdash;
to infer the range. Using the &lt;a href=&#34;https://www.gbif.org/&#34;&gt;Global Biodiversity Information
Facility&lt;/a&gt;
 database, I downloaded occurrence data for the
animals species I had genetic diversity and population density estimates for. I
then wrote some R code to automatically infer the ranges from this occurrence
data.&lt;/p&gt;
&lt;figure&gt;
&lt;img src=&#34;https://vincebuffalo.com/images/apis_mellifera.png&#34; width=&#34;660&#34; class=&#34;img-responsive&#34;/&gt;
  &lt;figcaption&gt;
  The GBIF occurrence data (red points) and inferred range (green polygons) of
  the common honeybee (&lt;em&gt;Apis mellifera&lt;/em&gt;). This shows some of the limitation of
  this course-grained approach: there is some uncertainty whether the some
  regions have sparse observations and the algorithm is properly filling in the
  range, or whether these are aberrant observations in regions where honeybees
  don&#39;t normally live. Still, for Lewontin&#39;s Paradox, we do not need precise
  estimates, but rather a rough look. Compare the range of the honeybee &lt;a href=&#34;images/apis_cerana.png&#34;&gt;to
  that of&lt;/a&gt; &lt;em&gt;Apis cerana&lt;/em&gt;, which lives only in South Asia.
  &lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;With the population densities and ranges estimated, I take their product to get
an approximate population size (see &lt;a href=&#34;https://elifesciences.org/articles/67509#fig1&#34;&gt;Figure
1&lt;/a&gt;
 of the paper for a look at
the distribution of these ranges by phylum).  There&amp;rsquo;s quite a bit of validation
I do to ensure the numerous approximations I&amp;rsquo;ve made here are reasonable. For
example, I make sure my census sizes don&amp;rsquo;t lead to predictions of the total
biomass that are unreasonably small or large (see &lt;a href=&#34;https://elifesciences.org/articles/67509#s4&#34;&gt;&lt;em&gt;Population Size
Validation&lt;/em&gt;&lt;/a&gt;
 and &lt;a href=&#34;https://elifesciences.org/articles/67509#table1&#34;&gt;&lt;em&gt;Table
1&lt;/em&gt;&lt;/a&gt;
), since estimates of the
total biomass on earth of certain animal groups are available thanks to the
study of &lt;a href=&#34;https://www.pnas.org/content/115/25/6506&#34;&gt;Bar-On et al. (2018)&lt;/a&gt;
. I
also do lots of other consistency checks in &lt;a href=&#34;https://elifesciences.org/articles/67509#appendix-3&#34;&gt;&lt;em&gt;Appendix
3&lt;/em&gt;&lt;/a&gt;
.&lt;sup&gt;9&lt;/sup&gt;&lt;/p&gt;
&lt;aside&gt;
&lt;sup&gt;9&lt;/sup&gt;
I am especially thankful to reviewer Guy Sella, who suspected my early fruitfly population
sizes were a tad too large; indeed, after lots of sleuthing, I found that
macroecologists like Damuth (1987) apply a correction to animals that do not
regulate their body temperature that I did not apply in earlier versions.
&lt;/aside&gt;

&lt;h2&gt;The Relationship Between Genetic Diversity and Population Size&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;the-relationship-between-genetic-diversity-and-population-size&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#the-relationship-between-genetic-diversity-and-population-size&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Next, I merged the approximate population sizes with the genetic diversity data
from Leffler et al. (2012), Romiguier et al. (2014), and Corbett-Detig et al.
(2015). These data allow for a nice visualization of Lewontin&amp;rsquo;s Paradox of
Variation, through the relationship between genetic diversity and population
size:&lt;/p&gt;
&lt;figure id=&#34;figure-2&#34;&gt;
&lt;img src=&#34;https://vincebuffalo.com/images/diversity_popsize_full.png&#34; width=&#34;860&#34; class=&#34;img-responsive&#34;/&gt;
  &lt;figcaption&gt;
  The relationship between genetic diversity (\(\pi\)) and approximate population
  size (\(N_c\)) for 172 animal species. Note that genetic diversity varies just
  over three orders of magnitude, while census sizes vary over 12 orders of
  magnitude. The gray ribbon indicate the expected neutral diversity for a
  range of mutation rates (\(10^{-9} \lt \mu \lt 10^{-8}\)) were the diversity to
  be determined entirely by census size under the neutral model. Lewontin&#39;s
  Paradox wouldn&#39;t be a paradox if the diversity estimates fell in this thin
  gray area; instead they do not scale with population size, and are mostly
  constrained to within three orders of magnitude. The eLife article has
  numerous supplementary figures relevant to this figure
  &lt;a href=&#34;https://elifesciences.org/articles/67509#fig2&#34;&gt;here&lt;/a&gt;.
  &lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;What this relationship tells us is that population size appears to impact
genetic diversity in the way we&amp;rsquo;d expected (there is higher genetic diversity
in species with larger population sizes &amp;ndash; this is shown by the dashed gray
line of best fit in the figure above), but as Lewontin first pointed out,
genetic diversity doesn&amp;rsquo;t increase as fast as we&amp;rsquo;d expect if solely census
population sizes and neutral evolution determined genetic diversity.&lt;/p&gt;
&lt;h2&gt;Is the Diversity&amp;ndash;Population-Size relationship meaningful?&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;is-the-diversitypopulation-size-relationship-meaningful&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#is-the-diversitypopulation-size-relationship-meaningful&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;figure&gt;
&lt;img src=&#34;https://vincebuffalo.com/images/darwin_i_think.png&#34; width=&#34;560&#34; class=&#34;img-responsive&#34;/&gt;
  &lt;figcaption&gt;
  Darwin&#39;s sketch of a phylogenetic tree in his &lt;em&gt;First Notebook on
  Transmutation of Species&lt;/em&gt;, 1837. Source: &lt;a href=&#34;https://commons.wikimedia.org/wiki/File:Darwin_Tree_1837.png&#34;&gt;Wikimedia commons&lt;/a&gt;.
  &lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;We might look at the relationship above and conclude that there is a
significant relationship between population sizes and genetic diversity.
Unfortunately, there&amp;rsquo;s a statistical conundrum that crops up whenever we draw
lines of best fit through data points like species. Species are connected by an
underlying &lt;em&gt;phylogenetic tree&lt;/em&gt;; certain species share common ancestors more
recently than others. This was one of Darwin&amp;rsquo;s brilliant ideas, sketched out in
1837 in his notebook (figure above). In a seminal article in 1985, Joe
Felsenstein pointed out that the underlying species tree creates a statistical
problem whenever we want to make comparisons across species (as I&amp;rsquo;m doing
here), and this statistical issue should be accounted for.&lt;/p&gt;
&lt;p&gt;We can understand this statistical issue, known as phylogenetic
non-independence, with a story. Imagine going to a large family reunion, where
there are two sides of the family: all the descendents of your
great-great-grandfather, and all the descendents of his sister, your
great-great-aunt. You look around at all your family members, and observe that
there seems to be a statistical relationship between having freckles and being
tall. Your tallest relatives tend to have the most freckles, including your
great-great aunt&amp;rsquo;s husband, who towers over you. Other relatives are more
average height, and do not have freckles. If you were to collect data and
quantify this relationship, you may conclude that somehow these two traits are
correlated in a meaningful way; perhaps there&amp;rsquo;s some biological process that
underpins both characteristics.&lt;/p&gt;
&lt;p&gt;However, this ignores something: all the relatives that are tall and have
freckles descend from your great-great aunt and uncle, and all the average
height relatives without freckles descend from your great-great grandfather
(who is shorter and doesn&amp;rsquo;t have freckles). There isn&amp;rsquo;t necessarily a
meaningful relationship between these two traits &amp;mdash; it could just be an
accident of who shares ancestry (and thus the genes that determine traits) with
whom and what traits they happen to have. The family tree here breaks the
statistical assumption of &lt;em&gt;independence&lt;/em&gt;, since everyone&amp;rsquo;s related and thus not
independent of one another.&lt;/p&gt;
&lt;p&gt;This same problem occurs when we compare traits across species too, traits like
genetic diversity and census size. John Gillespie, who I introduced at the
start of this article, suspected that early pictures of the
diversity&amp;ndash;population-size relationship may be &lt;em&gt;entirely&lt;/em&gt; a spurious artifact
of this shared ancestry. In his 1991 book, he has the following figure:&lt;/p&gt;
&lt;figure&gt;
&lt;img src=&#34;https://vincebuffalo.com/images/gillespie.png&#34; width=&#34;460&#34; class=&#34;img-responsive&#34;/&gt;
  &lt;figcaption&gt;
  John Gillespie suspected that the relationship between population size and
  genetic diversity found by Nei and Graur (1984) could be entirely an artifact
  of two groups of animals: carnivores and fruitflies (1991).
  &lt;/figcaption&gt;
&lt;/figure&gt;
&lt;aside&gt;
&lt;sup&gt;10&lt;/sup&gt; I should note here that Phylogenetic Comparative Methods were far
outside my area of expertise at the onset of this project, and I&amp;rsquo;m eternally
grateful to a few folks for helping me. &lt;a href=&#34;https://www.uyedalab.com/&#34;&gt;Josef
Uyeda&lt;/a&gt;
, &lt;a href=&#34;https://stfriedman.github.io/&#34;&gt;Sarah
Friedman&lt;/a&gt;
, and &lt;a href=&#34;https://kacorn.github.io/&#34;&gt;Katherine
Corn&lt;/a&gt;
 helped me immensely early on with
understanding and implementing these methods. Finally, &lt;a href=&#34;https://mwpennell.ca/&#34;&gt;Matt
Pennell&lt;/a&gt;
 reviewed my article and provided helpful
feedback on these methods and how best to present them. My paper is much
stronger thanks to their help.
&lt;/aside&gt;

&lt;p&gt;While some previous work has addressed this statistical quandary in clever
ways, it&amp;rsquo;s been limited by the difficulty of building big species phylogenetic
trees. In my paper, I use a tool called
&lt;a href=&#34;https://github.com/phylotastic/datelife&#34;&gt;datelife&lt;/a&gt;
 to build these trees, and
then account for this tree structure using something known as &lt;em&gt;Phylogenetic
Comparative Methods&lt;/em&gt;&lt;sup&gt;10&lt;/sup&gt;. These models account for the way certain
groups of related species may all differ from the line of best fit in a
systematic way, which is a violation of this statistical model. Overall, I find
that even accounting for the species tree, the relationship between population
size and diversity is significant statistically. In fact, diversity is also
well-predicted by range and body mass (see &lt;a href=&#34;https://elifesciences.org/articles/67509#fig2s3&#34;&gt;Supplementary Figure 2&amp;ndash;figure
supplement 3&lt;/a&gt;
.&lt;/p&gt;
&lt;figure&gt;
&lt;img src=&#34;https://vincebuffalo.com/images/diversity_pcm.png&#34; width=&#34;560&#34; class=&#34;img-responsive&#34;/&gt;
  &lt;figcaption&gt;
  The estimated ancestral population sizes and diversity levels across species (colors), 
  overlaid on the phylogenetic tree of all species for which data was available.
  &lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;While the relationship between diversity and population size is statistically
significant after accounting for shared ancestry, my work finds Gillespie&amp;rsquo;s
concerns are indeed substantiated. Athropods (insects, spiders, and
crustaceans) and vertebrates do indeed form to clusters as he suspected.
Insects typically have large population sizes and high genetic diversity, while
vertebrates typically have smaller population sizes and lower genetic
diversity. Still, when I look at the diversity&amp;ndash;population-size relationship
within each of these groups, I find it is &lt;a href=&#34;https://elifesciences.org/articles/67509#fig3s1&#34;&gt;still statistically
significant&lt;/a&gt;
.&lt;/p&gt;
&lt;p&gt;There are a few other analyses I did from this macroevolutionary perspective
which I won&amp;rsquo;t discuss here for brevity&amp;rsquo;s sake. Overall, I came away from this
part of the project thinking that considering the Lewontin&amp;rsquo;s Paradox from a
macroevolutionary perspective will be critical in resolving this puzzle. In
particular, it may be important to consider how diversity and population size
change along the species tree, and how the speciation and extinction processes
that lead that give rise to the species we see today interact with genetic
diversity and population size.&lt;/p&gt;
&lt;h2&gt;Can Selection Explain Lewontin&amp;rsquo;s Paradox?&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;can-selection-explain-lewontins-paradox&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#can-selection-explain-lewontins-paradox&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;The final component of my paper is investigating whether selection could
explain the shortfall between the observed genetic diversity levels across
species, and the diversity we would expect under the neutral theory if
effective population sizes were equal to census population sizes (e.g. the gray
ribbon in &lt;a href=&#34;#figure-2&#34;&gt;this figure&lt;/a&gt;). As described earlier, one of
the determinants of how strongly selection can impact genetic diversity is how
much recombination there is. The level of recombination varies across species,
so if I have any hope roughly predicting how strong selection can get, I need
to know how much recombination there is in each species.&lt;/p&gt;
&lt;p&gt;Luckily, Jessica Stapley and colleagues published &lt;a href=&#34;https://royalsocietypublishing.org/doi/full/10.1098/rstb.2016.0455&#34;&gt;a very nice
survey&lt;/a&gt;
 of
recombination across animal species. Simply put, the last part of this project
wouldn&amp;rsquo;t have worked without this nice dataset, so I am quite grateful for this
work. Using this dataset, I investigated the relationship between recombination
map length (the expected number of times chromosomal breaks occur per
generation) and my population size estimates. I find that species with large
population sizes (such as fruitflies) typically have less recombination
(shorter map lengths) than species with smaller population sizes (such as
humans). I show this in Figure A below:&lt;/p&gt;
&lt;figure id=&#34;figure-3&#34;&gt;
&lt;img src=&#34;https://vincebuffalo.com/images/figure_3.png&#34; width=&#34;860&#34; class=&#34;img-responsive&#34;/&gt;
  &lt;figcaption&gt;
&lt;p&gt;(A) The relationship between recombination map length and population sizes.
These triangle points are eusocial/social species like ants that have
adaptively longer map lengths (see &lt;a href=&#34;https://www.nature.com/articles/6800950&#34;&gt;Wilfert et al.,
2007&lt;/a&gt;
).  (B) The relationship
between genetic diversity and population size, with the predicted diversity
under hitchhiking and background selection overlaid over it (blue ribbon).
The ribbon is the diversity for a variety of mutation rates ($10^{-9} \lt \mu
\lt 10^{-8}$).&lt;/p&gt;
  &lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;While recombination map lengths are a critical parameter that mediates the
strength of selection, there are a few other parameters we unfortunately do not
have good estimates of for most species. For example, the rate that new
deleterious mutations flow into a population determines diversity, since if
more harmful mutations have to be purged from the population, any linked genetic
diversity would be purged as well, leading to a net loss of genetic
variability. Similarly, we&amp;rsquo;d also need an estimate of how many beneficial
mutations flow into a population each generation, since these also act to
reduce diversity through the hitchhiking effect.&lt;/p&gt;
&lt;p&gt;While we don&amp;rsquo;t have estimates of these key parameter for the majority of
species in my study, we do have very good estimates for the darling species of
evolutionary biology: the fruitfly. In a very nice study by &lt;a href=&#34;https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1006130&#34;&gt;Eyal Elyashiv and
colleagues&lt;/a&gt;
,
they developed a sophisticated statistical approach to estimate these key
selection parameters for &lt;em&gt;Drosophila melanogaster&lt;/em&gt;. &lt;em&gt;Drosophila&lt;/em&gt;, with its very
large population sizes, has been known to experience some of the strongest
selection of any species we&amp;rsquo;ve looked at. So while I cannot predict the
strength of selection for each species, I can predict another useful quantity:
if selection were as strong in all species as it is in &lt;em&gt;Drosophila
melanogaster&lt;/em&gt;, would selection sufficiently reduce diversity to recreate the
diversity&amp;ndash;population-size relationship I see in the data? It turns out, the
answer is no (see &lt;a href=&#34;#figure-3&#34;&gt;Figure (B)&lt;/a&gt; above). Essentially,
there is too much recombination in species with moderate population sizes &amp;mdash;
even if we assume they experience absurdly high levels of selection, similar
to what we expect in &lt;em&gt;Drosophila&lt;/em&gt;, the reduction in genetic diversity caused
by selection is severely weakened by the large amount of recombination they
experience. It seems that using our current models of linked selection, there
is not a plausible way that Lewontin&amp;rsquo;s Paradox could be solved by selection.
In this sense, my work extends Graham&amp;rsquo;s work: not only do current estimates
of the strength of selection seem incapable of explaining the narrow range of
diversity, &lt;em&gt;no plausible estimates&lt;/em&gt; seem like they could. Indeed, I even try
to increase all the key selection parameters ten-fold (&lt;a href=&#34;https://elifesciences.org/articles/67509#fig4s3&#34;&gt;Figure 4&amp;ndash;figure
supplement 3&lt;/a&gt;
, and &lt;em&gt;still&lt;/em&gt;
fail to see that predicted diversity matches the observed diversity).&lt;/p&gt;
&lt;h2&gt;So How Do We Solve Lewontin&amp;rsquo;s Paradox?&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;so-how-do-we-solve-lewontins-paradox&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#so-how-do-we-solve-lewontins-paradox&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;So where does this leave us? From my perspective, my work strengthens the
arguments against selection as an explanation for Lewontin&amp;rsquo;s Paradox.
Alternative hypotheses, such as the effect of complex demographic histories
like repeated bottlenecks, appear to be a more likely explanation for
Lewontin&amp;rsquo;s Paradox. I also suggest in the discussion that some interaction of
macroevolutionary processes, like speciation and extinction dynamics, could
play a role in the pattern of diversity we see across species today. At the
heart of the problem is that the genetic diversity across species we see today
is the result of multiple overlaid processes occurring at very different
timescales: ecological, evolutionary, and historical. As I argue in the
conclusion, Lewontin&amp;rsquo;s Paradox may not be fully resolved for some time because
the explanation requires synthesis and model building at so many different
disciplines.&lt;/p&gt;
&lt;figure&gt;
&lt;img src=&#34;https://vincebuffalo.com/images/lewontin_chalkboard.jpeg&#34; width=&#34;560&#34; class=&#34;img-responsive&#34;/&gt;
  &lt;figcaption&gt;
  Richard Lewontin at the chalkboard. It looks like he&#39;s explaining
  the interaction---and inseparability---of genotype and environment.
  Source: &lt;a href=&#34;https://whyevolutionistrue.com/2021/07/05/dick-lewontin-1929-2021/&#34;&gt;Why Evolution is True&lt;/a&gt;.
  &lt;/figcaption&gt;
&lt;/figure&gt;
&lt;aside&gt;
&lt;sup&gt;11&lt;/sup&gt; I should note that it&amp;rsquo;s a funny coincidence that Lewontin
was particularly opposed to putting his name on his student&amp;rsquo;s or postdoc&amp;rsquo;s
papers unless he had a substantial role, as Jerry Coyne points out in his
article.  I did this work at UO thanks to the encouragement of my advisor &lt;a href=&#34;https://kr-colab.github.io/&#34;&gt;Andy
Kern&lt;/a&gt;
, who, perhaps channeling  Lewontin,
supported me during this independent work and let me publish it alone. I owe a
debt of gratitude to Andy for this!
&lt;/aside&gt;

&lt;p&gt;Finally, I should mention that while this paper was in its second round of
reviews, Dick Lewontin passed away. I was deeply saddened by this news, as were
countless of his colleagues, students, and a large number of other younger
scientists that were deeply inspired by his work and approach to science.
Lewontin was not just a preeminent scientist, but an activist and outspoken
opponent of the misuse of genetics for racist ends. He practiced science in a
way that was always keenly aware of its larger social context. This is now much
more commonplace than it used to be, in part thanks to him. I suggest reading
his &lt;a href=&#34;https://paperpile.com/shared/AX0LO9&#34;&gt;takedown of Arthur Jensen&amp;rsquo;s racist misuse of
genetics&lt;/a&gt;
, &lt;a href=&#34;https://paperpile.com/shared/damoim&#34;&gt;&lt;em&gt;The Apportionment of Human
Diversity&lt;/em&gt;&lt;/a&gt;
, and the &lt;a href=&#34;https://paperpile.com/shared/9bQyGx&#34;&gt;&lt;em&gt;The Analysis of
Variance and The Analysis of Causes&lt;/em&gt;&lt;/a&gt;
.
I&amp;rsquo;ll never forget reading &lt;a href=&#34;https://paperpile.com/shared/vetii2&#34;&gt;Lewontin and Cohen
(1969)&lt;/a&gt;
 when I took Sebastian Schreiber&amp;rsquo;s
population ecology course during graduate school, nor Lewontin&amp;rsquo;s chapter in the
textbook &lt;em&gt;Building a Science of Population Biology&lt;/em&gt; (which I reference at the
end of my article). The obituaries in the &lt;a href=&#34;https://www.nytimes.com/2021/07/07/science/richard-c-lewontin-dead.html&#34;&gt;New York
Times&lt;/a&gt;
,
&lt;a href=&#34;https://www.nature.com/articles/d41586-021-01936-6&#34;&gt;Nature&lt;/a&gt;
, and the &lt;a href=&#34;https://www.smbe.org/smbe/HOME/TabId/37/ArtMID/1395/ArticleID/110/In-Memoriam-SMBE-mourns-the-passing-of-Dr-Richard-Charles-Dick-Lewontin.aspx&#34;&gt;Society
for Molecular Biology and
Evolution&lt;/a&gt;

are well-worth reading. Jerry Coyne&amp;rsquo;s stories of the &lt;a href=&#34;https://whyevolutionistrue.com/2021/07/05/dick-lewontin-1929-2021/&#34;&gt;lab cultural and life
of
Lewontin&lt;/a&gt;

are particularly fun to read.&lt;sup&gt;11&lt;/sup&gt;&lt;/p&gt;

      </description>
    </item>
    
    <item>
      <title>The Genome-wide Signal of Linked Selection in Temporal Data</title>
      <link>https://vincebuffalo.com/blog/the-genome-wide-signal-of-linked-selection-in-temporal-data/</link>
      <pubDate>Thu, 20 Aug 2020 00:00:00 +0000</pubDate>
      
      <guid>https://vincebuffalo.com/blog/the-genome-wide-signal-of-linked-selection-in-temporal-data/</guid>
      <description>
        
        
        &lt;p&gt;&lt;em&gt;The last chapter of my dissertation with Graham Coop was recently published in
PNAS
(&lt;a href=&#34;https://www.pnas.org/content/pnas/early/2020/08/07/1919039117.full.pdf&#34;&gt;pdf&lt;/a&gt;
,
&lt;a href=&#34;https://www.biorxiv.org/content/10.1101/798595v3&#34;&gt;bioRxiv&lt;/a&gt;
) last week. In an
effort to communicate my research to a broader audience, I have written two
blog posts on our work. The &lt;a href=&#34;https://vincebuffalo.com/blog/the-problem-of-detecting-polygenic-selection-from-temporal-data/&#34;&gt;first post&lt;/a&gt;
, is meant to introduce
the historical context and concepts like linked selection and polygenic
adaptation to a non-scientist, and the second post, below, describes our work
on temporal covariance as a signature of polygenic linked selection and its
application to four evolve-and-reseqeunce data sets.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;In the &lt;a href=&#34;https://vincebuffalo.com/blog/the-problem-of-detecting-polygenic-selection-from-temporal-data/&#34;&gt;last
post&lt;/a&gt;
,
I explained two longstanding problems in the field of evolutionary genetics.
The first is detecting adaptation on polygenic traits from temporal genomic
data. Temporal data is gathered by sampling a population through the
generations and sequencing these samples, and provides us with a immense amount
of information about the evolutionary process over short timescales. Yet even
with this amount of data, distinguishing allele frequency changes caused by
polygenic selection from that random genetic drift is a challenge. A population
could be adapting over short timescales &amp;mdash;we might even observe drastic
changes in a trait over a few generations&amp;mdash; yet it could be impossible to see
the signature of such strong selection at the DNA level.&lt;/p&gt;
&lt;p&gt;The second related problem is how we quantify the roles of drift and linked
selection in determining genome-wide allele frequency changes. Since the
debates between Fisher and Ford, and Wright, evolutionary geneticists have
disagreed on the relative roles played by genetic drift and natural selection
in determining allele frequency changes. Since their time, we have learned
selection at a site exerts a strong influence on its linked neighbors, known as
linked selection. This perturbs the frequency trajectories of alleles through
time, and to have a complete view of how populations evolve we need to
understand this process. With temporal data, we could perhaps directly estimate
the effects of linked selection on drift on allele frequency change.&lt;/p&gt;
&lt;h2&gt;The Quantitative Genetics View of Linked Selection and Temporal Covariance&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;the-quantitative-genetics-view-of-linked-selection-and-temporal-covariance&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#the-quantitative-genetics-view-of-linked-selection-and-temporal-covariance&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;In our first paper (&lt;a href=&#34;https://www.genetics.org/content/213/3/1007&#34;&gt;Buffalo and Coop,
2019&lt;/a&gt;
), Graham and I proposed that
one promising signal that could be used to detect polygenic selection in
temporal genomic data is &lt;em&gt;temporal covariance&lt;/em&gt;. Our work builds of many
excellent papers but three in particular: Robertson (1961), and Santiago and
Cabarello (1995, 1998)&lt;sup&gt;1&lt;/sup&gt;.&lt;/p&gt;
&lt;aside&gt;
For a very nice review of this work, and linked selection more broadly,
see &lt;a href=&#34;https://royalsocietypublishing.org/doi/10.1098/rstb.2000.0716&#34;&gt;Barton (2000)&lt;/a&gt;
.
&lt;/aside&gt;

&lt;figure&gt;
&lt;img src=&#34;https://vincebuffalo.com/images/morley_sheep.png&#34; alt=&#34;Morley sheep breeding figure&#34; /&gt;
&lt;figcaption&gt;
**Figure 1** 
The first discussion of linked selection in a quantitative genetics context was
F.H.W. Morley discussing selection in Australian sheep for merino wool:
&#34;in a flock exposed to selection, the genetically superior individuals will
tend to be most inbred. As a corollary, selection increases the approach to
homozygosity, not only at loci carrying genes determining the character in
question but at all loci.&#34; ([1954](https://www.publish.csiro.au/cp/AR9540305)).
(*image source: [Wikipedia](https://commons.wikimedia.org/wiki/File:Macarthur_stamp_sheep_1934.jpg)*)
&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;These papers (and others) establish what I&amp;rsquo;ll refer to as a &lt;em&gt;quantitative
genetics view of linked selection&lt;/em&gt;, and implicitly describe temporal
covariance.  While the classic linked selection work (i.e. hitchhiking and
background selection, which I explain in &lt;a href=&#34;XXX&#34;&gt;the first post&lt;/a&gt;
) describes how a
neutral allele behaves when it is a close neighbor of a new mutation that
affects fitness, the quantitative genetics view of linked selection often seeks
to understand how a neutral allele behaves when it is much more distant
&amp;mdash;perhaps even on a different chromosome&amp;mdash; from the genetic variation that is
selected upon.  Furthermore, this view supposes the genetic variation that
determines fitness is polygenic, and results from &lt;em&gt;standing variation&lt;/em&gt;, meaning
it is present at appreciable frequencies in the population (i.e. not the rare
new mutations of the classic linked selection work).&lt;/p&gt;
&lt;figure&gt;
&lt;img src=&#34;https://vincebuffalo.com/images/wright_ne.png&#34; alt=&#34;Wright population dynamics figure&#34; /&gt;
&lt;figcaption&gt;
&lt;strong&gt;Figure 2&lt;/strong&gt;
Wright&#39;s (&lt;a href=&#34;https://www.jstor.org/stable/2457575?seq=1&#34;&gt;1940&lt;/a&gt;) figure of
extinction and recolonization in a population. Population lineages are
expanding through time from left to right and space (top to bottom), with
groups going extinct and re-establishing. The genetic drift in a population with
such a complex breeding process can be represented by standard population
genetic models with a rescaled &lt;em&gt;effective&lt;/em&gt; population size.
&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;aside&gt;
&lt;sup&gt;2&lt;/sup&gt;
It&amp;rsquo;s worthwhile to note that there&amp;rsquo;s a rich
history of using the observed allele frequency changes through time to estimate
effective sizes of populations. A very simple estimator is based on the idea
that effective population size $N_e$ is proportional to the reciprocal of the
&lt;em&gt;variance&lt;/em&gt; in allele frequency change over $t$ generations, $N_e \propto t/2
\text{var}(p_t - p_0)$. With drift acting alone, the variance in allele frequency change serves
as a measure of the rate of drift.
&lt;/aside&gt;

&lt;p&gt;One of Sewall Wright&amp;rsquo;s ingenious ideas was to recognize that many different
breeding structures we might see in nature (e.g. organisms capable of
self-fertilization, or the extinction-recolonization dynamics depicted above)
can still be described by standard models of genetic drift, as long as the
population size is rescaled appropriately. We call this &lt;em&gt;effective population
size&lt;/em&gt;&lt;sup&gt;2&lt;/sup&gt;, and the early work of Robertson first described the long-run
effect of selection on a polygenic trait exerts on a neutral site as a
reduction in effective population size. This, in essence, means selection can
make it seem genetic drift is &lt;em&gt;running faster&lt;/em&gt; since changes are larger per
unit time.  The fact that many types of selection have effects quite similar to
random genetic drift occurring in a smaller population is one reason why the
effects of selection and drift can be hard to distinguish. This is precisely
why Fisher and Ford first had to estimate the population size to test whether
selection caused the decline in frequency of the dark wing color variant: they
needed to determine the magnitude of genetic drift to see if the frequency
changes were too drastic to be caused by drift alone, and were more likely to
be caused by selection.&lt;/p&gt;
&lt;p&gt;While Robertson, and Santiago and Cabarello&amp;rsquo;s work expressed the linked
selection effects felt by a neutral allele during polygenic selection as a
reduction in effective population size, their work tells us there is one key
difference between a neutral allele randomly drifting in the population, and a
neutral allele affected by linked selection: with linked selection, the neutral
allele&amp;rsquo;s frequency changes between two generations &lt;em&gt;are correlated&lt;/em&gt; with the
changes between later generations. Imagine following a neutral allele as it
descends down lineages, forward through the generations towards the present. In
one generation, the neutral allele finds itself in a father carrying genes very
well suited for his environment, and as a result, he leaves many offspring.
Each child inherits some fraction of his well-suited genes, and possibly, the
neutral allele as well. Within this family, the neutral allele&amp;rsquo;s frequency
increases. As long as the genes that suited him well in this environment are
also beneficial in his children&amp;rsquo;s generation, they too will leave more
offspring. So too will the neutral allele&amp;rsquo;s frequency also rise in this
generation, as long as it remains associated with a fraction of genes
well-suited for this environment.&lt;/p&gt;
&lt;figure&gt;
&lt;img src=&#34;https://vincebuffalo.com/images/tempcov.png&#34; alt=&#34;Temporal covariance illustration&#34; /&gt;
&lt;figcaption&gt;
**Figure 3**
Whether a neutral allele finds itself on a beneficial background (blue) or
disadvantageous background (orange), the direction of the allele frequency
changes between times will be the same (as long as the neutral allele is still
associated with its background). This creates positive temporal covariance that
we can detect in populations, which can tell us that there is heritable fitness
variation in the population that is perturbing allele frequency changes.
&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;If we were to track this neutral allele&amp;rsquo;s frequency trajectory through time in
the entire population, we would see it rise between the two first consecutive
generations, and then rise again between the next two, as long as it remains
associated with a fraction of the father&amp;rsquo;s beneficial genes. The neutral
allele&amp;rsquo;s frequencies between the second and first generation (mathematically,
$\Delta p_1 = p_2 - p_1$), and the third and second generation ($\Delta p_2 =
p_3 - p_2$) would both change in the same direction when they&amp;rsquo;re associated
with this good background, creating temporal covariance in the frequency
trajectory. This same effect happens if the neutral allele were to instead be
associated with a disadvantageous background (see figure above). In contrast,
genetic drift cannot create temporal covariance in neutral allele frequency
trajectories. Another way to say this is that when different chromosomes have
heritable fitness differences, the frequency change of an allele at one time
interval can be &lt;em&gt;predictive&lt;/em&gt; of the changes at later generations, as long as
the allele is still associated with its fitness background, and that background
has the same effect on fitness.&lt;/p&gt;
&lt;p&gt;In Buffalo and Coop (&lt;a href=&#34;https://www.genetics.org/content/213/3/1007&#34;&gt;2019&lt;/a&gt;
), we
proposed using temporal genomic data to quantify the amount of temporal
covariance in a population, and use it as a means of detecting polygenic
selection over short timescales. We also develop a mathematical theory of what
determines the strength of temporal covariance, and find it is determined by
how much additive genetic variance for fitness there is in a population (a key
quantitative genetics parameter), and how strong the neutral allele&amp;rsquo;s
association is with the genetic fitness variation (which is mediated by the
level of recombination and the strength of initial association, known as
linkage disequilibrium).&lt;/p&gt;
&lt;h2&gt;Detecting Temporal Covariance in Evolve-and-Resequence Studies&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;detecting-temporal-covariance-in-evolve-and-resequence-studies&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#detecting-temporal-covariance-in-evolve-and-resequence-studies&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;With a basic understanding of how temporal covariance functions as a signal of
polygenic selection, let&amp;rsquo;s look our recent paper, Buffalo and Coop
(&lt;a href=&#34;https://www.pnas.org/content/early/2020/08/07/1919039117/&#34;&gt;2020&lt;/a&gt;
). In this
study, we applied the methods we developed in our &lt;em&gt;Genetics&lt;/em&gt; paper to four
evolve-and-resequence studies. Our work relied entirely on the accessibility of
the data from these previous studies, and we both are grateful for the authors&#39;
support and openness with their data.&lt;/p&gt;
&lt;p&gt;The first study we analyzed was from &lt;a href=&#34;https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3000128&#34;&gt;Barghi &lt;em&gt;et al.&lt;/em&gt;
(2019)&lt;/a&gt;
,
an evolve-and-resequence study in &lt;em&gt;Drosophila simulans&lt;/em&gt;. In this experiment
fruit flies were evolved in a hot laboratory environment for 60 generations,
across ten independent replicates. We re-analyzed this well-designed study
using our temporal covariance methods. We found extensive evidence of temporal
covariance through time (Figure A below), consistent with the original author&amp;rsquo;s
findings that the populations were adapting to their new warmer environment.
Furthermore, these temporal covariances declined through time, just as we
expected, since the temporal covariances weaken as the associations between
neutral sites and fitness variation decay through time.&lt;/p&gt;
&lt;figure&gt;
&lt;img src=&#34;https://vincebuffalo.com/images/buffalo_coop_fig1.png&#34; alt=&#34;Temporal covariance results from Buffalo and Coop&#34; /&gt;
&lt;figcaption&gt;
&lt;strong&gt;Figure 4&lt;/strong&gt;
(A) Temporal covariance in the Barghi &lt;em&gt;et al.&lt;/em&gt; (2019) data set, averaged over
all ten replicate populations. Each line depicts the covariances between some
initial allele frequency change, and later frequency change through time (which
represents the rows of the covariance matrix, &lt;em&gt;right&lt;/em&gt;).
(B) A lower-bound on the proportion of total variance in allele frequency
change due to drift, \(G\).
&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;aside&gt;
&lt;sup&gt;2&lt;/sup&gt;
This is defined as the sum of the covariances terms (only non-zero in
the presence of linked selection) divided by the total variance:
$G = \frac{\sum_{i \ne j} \text{cov}(\Delta p_i, \Delta p_j)}{\text{Var}(p_t - p_0)}$.
&lt;/aside&gt;

&lt;p&gt;In Buffalo and Coop
(&lt;a href=&#34;https://www.pnas.org/content/early/2020/08/07/1919039117/&#34;&gt;2019&lt;/a&gt;
), we
proposed that we could use temporal covariances to estimate what proportion of
total variation in allele frequency change is caused by linked
selection&lt;sup&gt;2&lt;/sup&gt;. We called this proportion $G$; if $G = 0$, all the
variance in allele frequency change is caused by genetic drift, whereas if $G =
1$, all the allele frequency change is caused by linked selection. We
calculated $G$ as it accumulates through the generations (and thus call it
$G(t)$ where $t$ represents time), shown in the Figure B above.  Because
samples in Barghi &lt;em&gt;et al.&lt;/em&gt; (2019) were sequenced every ten generations, our $G$
estimate is a lower bound estimate, meaning the actual proportion of variation
in allele frequency change due to linked selection is &lt;em&gt;higher&lt;/em&gt; than our $G$
estimates.  Still, we find that over short timescales, at least 20% of the
variation in allele frequency change is due to linked selection. This provides
us with the first glimpse of how linked selection determines frequency changes
over very short timescales.&lt;/p&gt;
&lt;p&gt;Given we find such evidence of linked selection over short timescales, two
questions arise: (1) is this really polygenic selection? and (2) where is
the signal coming from? First, to investigate whether this was indeed polygenic
selection, rather than selection on a few mutations that have large effect, we
calculated temporal covariances along windows in the genome. If our genome-wide
signal of linked selection was driven by a few regions under strong selection,
we should expect to see these regions as outliers. Instead, we see that the
whole distribution of windowed covariances is enriched for positive
covariances, indicating the signal we&amp;rsquo;ve detected is spread across the entire
genome.&lt;/p&gt;
&lt;p&gt;Second, how can we detect a signal of linked polygenic selection when the
effect at each site is so weak? Drift and sampling variance introduce
considerable noise that can swamp the signal of temporal covariance, as well as
create spurious covariances. However, these sources of noise do &lt;em&gt;not&lt;/em&gt; share
random change in the same direction, whereas temporal covariances do, leading
to a signal that can be readily distinguished from random drift.&lt;/p&gt;
&lt;h2&gt;Convergent Correlations&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;convergent-correlations&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#convergent-correlations&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;One common study design of evolve-and-resequence experiments is to evolve
multiple &lt;em&gt;indepedent&lt;/em&gt; populations under the same (or different) environments,
and look for evidence of convergent (or divergent) selection. We hypothesized
that we should be able to detect something analogous to temporal covariance
across replicate populations exposed to the same selective pressures. This is
because replicate populations created from the same founding population will,
by chance, share some of the same fitness variation. Neutral alleles associated
with &lt;em&gt;the same&lt;/em&gt; advantageous genetic backgrounds would then be expected to
increase in frequency through time, in &lt;em&gt;both&lt;/em&gt; replicate populations. This
creates what we call between-replicate covariance, and we can measure this much
like we do temporal covariance.&lt;/p&gt;
&lt;figure&gt;
&lt;img src=&#34;https://vincebuffalo.com/images/barghi_rep.png&#34; alt=&#34;Convergence correlations in Barghi et al. data&#34; /&gt;
&lt;figcaption&gt;
&lt;strong&gt;Figure 5&lt;/strong&gt;
A measure of between-replicate covariance, &lt;em&gt;convergence correlations&lt;/em&gt;,
calculated on the ten replicates of the Barghi &lt;em&gt;et al.&lt;/em&gt; data. Each line
represents a row in the correlation matrix pictured on the right, showing how
the convergence correlation averages across all pairwise comparisons between
replicates changes through time.
&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;The Barghi &lt;em&gt;et al.&lt;/em&gt; study evolves ten replicate populations independently,
which provided us with a great data set to see if we could detect
between-replicate covariances. To measure the extent of between-replicate
covariances, we use what we call &lt;em&gt;convergence correlations&lt;/em&gt;, which are simply
the covariance in allele frequency changes between replicates scaled by the
standard deviation in allele frequency change. As with our temporal
covariances, we calculate these for all time intervals, and find that they are
relatively week and decay quickly through the generations. This tells us that
while early on, the selection occurring across replicates is similar, each
replicate quickly goes its own way (and this confirms a major finding of the
original Barghi &lt;em&gt;et al.&lt;/em&gt; study, that different loci across replicates
contribute to adaptation).&lt;/p&gt;
&lt;figure&gt;
&lt;img src=&#34;https://vincebuffalo.com/images/kelly_hughes.png&#34; alt=&#34;Kelly and Hughes convergence correlations&#34; /&gt;
&lt;figcaption&gt;
**Figure 6**
The convergence correlations between each pair of replicates (A, B, and C) of
the Kelly and Hughes (2019) study. The 95% confidence intervals, estimated by a
block bootstrap approach are shown (but look quite small on this y-axis scale).
&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;A benefit of convergence correlations is that unlike temporal covariances, they
only require evolve-and-resequence studies with two timepoints and replicate
populations. This allowed us to re-analyze two other elegant
evolve-and-resequence studies. The first is Kelly and Hughes
(&lt;a href=&#34;https://www.genetics.org/content/211/3/943&#34;&gt;2019&lt;/a&gt;
), which evolved three
replicate populations of &lt;em&gt;Drosophila simulans&lt;/em&gt; to a novel lab environment.
Similar to our re-analysis of the Barghi &lt;em&gt;et al.&lt;/em&gt; study, we calculated
convergent correlations across each pairwise combination of the replicate
populations and find that all these convergent correlations are statistically
significant and stronger than those we see in the Barghi &lt;em&gt;et al.&lt;/em&gt;
study&lt;sup&gt;3&lt;/sup&gt;.  Furthermore, using an approach like $G$ described above, we
found that at least 37% of the total variation in allele frequency change was
&lt;em&gt;shared&lt;/em&gt; between replicates, which is a pretty sizable proportion considering
these lab populations are rather small and strongly affected by genetic drift.&lt;/p&gt;
&lt;aside&gt;
&lt;sup&gt;3&lt;/sup&gt;
While Buffalo and Coop (2019) provides us with a theoretic
understanding of what determines the strength of temporal covariance, we have
yet to work out the theory for what determines the strength of convergence
correlations. We explored this with simulations, and found that the size
of each replicate population and the genetic architecture of fitness both
strongly affected the strength of the convergence correlation (see Section S8.3
and Figure S12 in &lt;a href=&#34;http://pnas.org/content/pnas/suppl/2020/08/07/1919039117.DCSupplemental/pnas.1919039117.sapp.pdf&#34;&gt;the
Appendix&lt;/a&gt;
).
&lt;/aside&gt;

&lt;p&gt;The second study is Longshanks selection experiment of Castro &lt;em&gt;et al.&lt;/em&gt;
(&lt;a href=&#34;https://elifesciences.org/articles/42014&#34;&gt;2019&lt;/a&gt;
, see also &lt;a href=&#34;https://bmcevolbiol.biomedcentral.com/articles/10.1186/s12862-014-0258-0&#34;&gt;Marchini &lt;em&gt;et al.&lt;/em&gt;
2014&lt;/a&gt;
),
where over twenty generations, two independent replicate population lines of
mice were selected for longer tibiae lengths relative to body size. The study
also has a control line, where mice were bred randomly. Remarkably, over twenty
generations of selection, tibiae length increased about five standard
deviations.&lt;/p&gt;
&lt;figure&gt;
&lt;img src=&#34;https://vincebuffalo.com/images/castro.png&#34; alt=&#34;Longshanks mouse selection experiment results&#34; /&gt;
&lt;figcaption&gt;
**Figure 7**
The convergence correlations between the two Longshanks selection lines (LS1
and LS2) and the control line (Ctrl). The black lines represent 95% confidence
intervals calculated on genome-wide data, and the blue lines represent 95%
confidence intervals calculated on the same data excluding the chromosomes
where Castro *et al.* found large-effect loci. The signal despite excluding
these chromosomes shows the extent to which selection for large tibiae length
was polygenic.
&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;Since the Longshanks study has a control line, it provided a powerful test of
our convergence correlations: we should expect significant convergence
correlations between the two Longshanks selection lines, but &lt;em&gt;not&lt;/em&gt; between each
Longshanks selection line and the control line, since these two do not share
convergence selection pressure. This is precisely what we find (shown above).
Furthermore, the original Castro &lt;em&gt;et al.&lt;/em&gt; study found two large effect loci,
one on chromosome 5 and the other on chromosome 10. In the original paper, they
show that while the loci on these chromosomes show a signal of convergent
selection, the trait itself is highly polygenic. In sharing our preliminary
results with the authors of Castro &lt;em&gt;et al.&lt;/em&gt;, they wondered the extent to which
our convergence correlations were driven just be these large-effect loci. We
decided to exclude these large-effect loci by leaving out the entire
chromosomes they reside on (in part, because these loci could have far-reaching
effects on linked variation). We show the convergence correlations sans these
chromosomes above in blue &amp;mdash; they are also statistically significant,
indicating the signal we are detecting is polygenic.&lt;/p&gt;
&lt;p&gt;The presence of the control line also allowed us to do a fun little calculation
to partition the total variation in allele frequency changes into drift, shared
selection, and unique selection components. Since the mating of mice is random
in the control line, the total variance in allele frequency change we see is
due only to genetic drift. Any additional variance in allele frequency change
we see in the two control lines is then caused by selection (due to exactly the
same effect that Robertson 1961 described).  Furthermore, we can estimate the
fraction of variance &lt;em&gt;shared&lt;/em&gt; between the Longshanks selection lines, since
this is just the covariance in frequency changes between Longshanks lines.
Finally, the remaining part of the variance due to selection is that which is
unique to each selection line. We find that at least 32% of the variance in
allele frequency change is due to selection, and of this 32%, 17% is due to
shared selection pressures and the remaining 14% is due to unique selection
pressures or associations unique to a particular replicate
population&lt;sup&gt;3&lt;/sup&gt;.&lt;/p&gt;
&lt;aside&gt;
&lt;sup&gt;3&lt;/sup&gt;
I&amp;rsquo;ve rounded the numbers here, which is why they don&amp;rsquo;t quite add up to 32%
&lt;/aside&gt;

&lt;h2&gt;Shifts in Temporal Covariance&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;shifts-in-temporal-covariance&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#shifts-in-temporal-covariance&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;figure&gt;
&lt;img src=&#34;https://vincebuffalo.com/images/neg_cov.png&#34; alt=&#34;Negative temporal covariance illustration&#34; /&gt;
&lt;figcaption&gt;
**Figure 8**
In one environment, where a trait is beneficial if it is larger (blue
background), the temporal covariance (gray points) through time is positive. If
the direction of selection changes such that small values of the trait are
beneficial (yellow background), the temporal covariance becomes negative.
Negative observed temporal covariance can tell us about reversals in the
direction of selection dynamics. Figure from Buffalo and Coop ([2019](https://www.genetics.org/content/213/3/1007)).
&lt;/figcaption&gt; 
&lt;/figure&gt;
&lt;p&gt;In our &lt;em&gt;Genetics&lt;/em&gt; paper, we suggested that fluctuating selection
could create &lt;em&gt;negative&lt;/em&gt; temporal covariance. The intuition here is that if a
neutral allele rises in frequency on a beneficial genetic background, a change
in the environment that leads this background to become disadvantageous would
cause a decline in frequency. Since in the first generations $\Delta p_t &amp;gt; 0$
and in the later generations $\Delta p_s &amp;lt; 0$, the temporal covariance would be
&lt;em&gt;negative&lt;/em&gt; (see figure above). We confirmed this hunch with simulations in our
&lt;em&gt;Genetics&lt;/em&gt; paper, and were curious if we&amp;rsquo;d see this pattern in real data.&lt;/p&gt;
&lt;p&gt;If you look closely at Figure 4 (A) above, you&amp;rsquo;ll see that at later timepoints,
we do observe negative covariances that are statistically significantly
different from zero. This is consistent with a reversal in the fitness of
certain genetic backgrounds. We wondered to what extent these reversals at
later time intervals were common, but obscured since we were average over the
entire genome.&lt;/p&gt;
&lt;p&gt;One thought we had was that perhaps we could look for such reversals happening
in smaller chunks, or windows, of the genome. While this could allow us to
detect subtle, local reversals in selection dynamics, it introduces two
problems. First, the temporal covariance estimates are incredibly noisy, since
we&amp;rsquo;re averaging over fewer sites. Second, we would need a way to estimate what
the distribution of these temporal covariances across windows would look like
under just drift, as a null hypothesis. We devised a simple way to do this by
randomly permuting the allele frequency changes across each window, and looking
at the entire distribution of windowed temporal covariances. The permutation
approach essentially breaks the directionality of temporal covariances caused
by selection, and gives us an estimate of the distribution of temporal
covariances as they would be under just genetic drift, which we could compare
to our observed distribution.  Below, I show the distributions of the windowed
temporal covariances between different time intervals:&lt;/p&gt;
&lt;figure&gt;
&lt;img src=&#34;https://vincebuffalo.com/images/shift_density.png&#34; alt=&#34;Windowed temporal covariance distributions&#34; /&gt;
&lt;figcaption&gt;
**Figure 9**
Each figure shows the observed distribution of windowed temporal covariances
(orange) and the sign-permuted null distribution (blue). (A\) Windowed temporal
covariances twenty generations apart show that these are positive (the right
shoulder of the orange distribution, compared to the blue null distribution).
(B\) Forty generations apart, we see the shoulder has shifted towards the left
side, indicating a reversal in the selection dynamics across the genome.
&lt;/figcaption&gt; 
&lt;/figure&gt;
&lt;p&gt;In Figure 9 (A) above, the temporal covariances are separated by two timepoints
(20 generations), and we see an enrichment of positive temporal covariances
across the genome, consistent with the findings mentioned above. In Figure 9
(B) above, the temporal covariances are 40 generations apart and we see an
enrichment of negative covariances, compared to the null distribution in blue.
Our paper and the appendix shows more figures illustrating this point. Seeing
the signal of such selection dynamics over short timescales was an unexpected
and interesting surprise. Note that the environment in the Barghi &lt;em&gt;et al.&lt;/em&gt;
study was kept constant; there was not an any outside factor that was
intentionally changed that could lead to such a reversal in the direction of
selection. This tells us that rather complex selection dynamics could be common
over short evolution, and these selection dynamics may go unnoticed in the
long-run despite having a considerable impact on genome-wide frequency changes.&lt;/p&gt;
&lt;h2&gt;Conclusions&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;conclusions&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#conclusions&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Overall, we find that as sizable proportion of allele frequency change, over
short timescales, is due to selection (likely through its effects on linked
sites). Our approach was overall extremely conservative, so the actual effect
could be much greater. As further evolve-and-resequence studies are designed
and conducted, I hope to apply and improve these methods to get better
estimates of the impact of polygenic selection on genome-wide frequency
changes. Overall, I believe our view of evolutionary genetics will be greatly
advanced if we continue to try to understand genomic evolution over short
timescales and try to reconcile these with population genomic studies using
samples from a single contemporary timepoint.&lt;/p&gt;
&lt;p&gt;&lt;a href=&#34;https://www.pnas.org/content/early/2020/08/07/1919039117/&#34;&gt;Our paper&lt;/a&gt;
 contains
much more detail, and many analyses I&amp;rsquo;ve omitted here for brevity. I have left
out our entire section on the extensive simulations we have conducted using
SLiM, which confirm that our measures of temporal covariance, $G$, and
convergence correlations work as intended. Graham and I are both quite grateful
to our reviewers and our editor for a recommendations and a review process that
made our paper much stronger.&lt;/p&gt;
&lt;p&gt;Finally, I should point out that this entire project was only possible because
the authors of the original studies conducted careful, well-designed
experiments, and were open with their data. I am extremely grateful for this
openness and I hope our re-analyses bring their excellent papers more readers.
Following their lead, I have made my analyses and the intermediate data I
produced openly available on &lt;a href=&#34;https://github.com/vsbuffalo/cvtk&#34;&gt;Github&lt;/a&gt;
. Each
analysis was conducted in a Jupyter Lab notebook using open source tools, and I
have tried to make my analyses as reproducible as possible. Collectively, our
knowledge of evolution will grow faster if we all embrace the same open science
mindset.&lt;/p&gt;

      </description>
    </item>
    
    <item>
      <title>The Problem of Detecting Polygenic Selection from Temporal Data</title>
      <link>https://vincebuffalo.com/blog/the-problem-of-detecting-polygenic-selection-from-temporal-data/</link>
      <pubDate>Thu, 20 Aug 2020 00:00:00 +0000</pubDate>
      
      <guid>https://vincebuffalo.com/blog/the-problem-of-detecting-polygenic-selection-from-temporal-data/</guid>
      <description>
        
        
        &lt;p&gt;&lt;em&gt;The last chapter of my dissertation with Graham Coop was recently published in
PNAS
(&lt;a href=&#34;https://www.pnas.org/content/pnas/early/2020/08/07/1919039117.full.pdf&#34;&gt;pdf&lt;/a&gt;
,
&lt;a href=&#34;https://www.biorxiv.org/content/10.1101/798595v3&#34;&gt;bioRxiv&lt;/a&gt;
) last week. In an
effort to communicate my research to a broader audience, I have written two
blog posts on our work.
The first post, below, is meant to introduce the historical context and
concepts like linked selection and polygenic adaptation to a non-scientist, and
the &lt;a href=&#34;https://vincebuffalo.com/blog/the-genome-wide-signal-of-linked-selection-in-temporal-data/&#34;&gt;second post&lt;/a&gt;
 describes our work on temporal covariance as a signature
of polygenic linked selection and its application to four evolve-and-reseqeunce
data sets.&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;Natural Selection is Rapid&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;natural-selection-is-rapid&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#natural-selection-is-rapid&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;For nearly seventy years, we have known Charles Darwin was wrong about a key
aspect of natural selection: that it was slow acting. In the first edition of
&lt;em&gt;The Origin of Species&lt;/em&gt;, Darwin wrote of natural selection,&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;We see nothing of these slow changes in progress, until the hand of time has
marked the long lapse of ages, and then so imperfect is our view into long
past geological ages, that we only see that the forms of life are now
different from what they formerly were.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;aside&gt;
&lt;sup&gt;1&lt;/sup&gt; Artificial selection refers to selective breeding done by humans,
usually to select a more desirable trait in a species.
&lt;/aside&gt;

&lt;p&gt;While Darwin knew artificial selection&lt;sup&gt;1&lt;/sup&gt; could rapidly change an
organism&amp;rsquo;s traits, it was only until ecological geneticists E.B. Ford and
Bernard Kettlewell showed rapid changes in moth wing coloration, that it was
recognized that natural selection could act over very short timescales. Since,
evolutionary biologists have shown rapid adaptation is remarkably common in a
variety of species including sticklebacks&lt;sup&gt;2&lt;/sup&gt;, &lt;em&gt;Anolis&lt;/em&gt;
lizards&lt;sup&gt;3&lt;/sup&gt;, soapberry bugs&lt;sup&gt;4&lt;/sup&gt;, guppies&lt;sup&gt;5&lt;/sup&gt;, and field
mustard&lt;sup&gt;6&lt;/sup&gt;.&lt;/p&gt;
&lt;aside&gt;
&lt;sup&gt;2&lt;/sup&gt; &lt;a href=&#34;https://www.pnas.org/content/112/52/E7204&#34;&gt;Lescak &lt;em&gt;et al.&lt;/em&gt; (2015)&lt;/a&gt;
. &lt;br/&gt;
&lt;sup&gt;3&lt;/sup&gt; &lt;a href=&#34;https://science.sciencemag.org/content/346/6208/463.abstract&#34;&gt;Stuart &lt;em&gt;et al.&lt;/em&gt; (2014)&lt;/a&gt;
. &lt;br/&gt;
&lt;sup&gt;4&lt;/sup&gt; &lt;a href=&#34;https://science.sciencemag.org/content/275/5308/1934.full&#34;&gt;Reznick &lt;em&gt;et al.&lt;/em&gt; (1997)&lt;/a&gt;
. &lt;br/&gt;
&lt;sup&gt;5&lt;/sup&gt; &lt;a href=&#34;https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1558-5646.1992.tb00619.x&#34;&gt;Carroll and Boyd (1992)&lt;/a&gt;
. &lt;br/&gt;
&lt;sup&gt;6&lt;/sup&gt; &lt;a href=&#34;https://www.pnas.org/content/104/4/1278&#34;&gt;Franks &lt;em&gt;et al.&lt;/em&gt; (2007)&lt;/a&gt;
.
&lt;/aside&gt;

&lt;h2&gt;Natural Selection versus Drift&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;natural-selection-versus-drift&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#natural-selection-versus-drift&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;figure&gt;
&lt;img src=&#34;https://vincebuffalo.com/images/ford_fisher.png&#34; alt=&#34;Ford and Fisher moth wing color frequency data&#34; /&gt;
&lt;figcaption&gt;
The frequency trajectory of the medionigra wing color variant in &lt;em&gt;Panaxia
dominula&lt;/em&gt;, scarlet tiger moth, &lt;br/&gt;observed in Oxford by Fisher and Ford (1947).
&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;In one classic study, E.B. Ford and R.A. Fisher tracked the frequency of dark
wing color variant in a population of scarlet tiger moths around Oxford for
nearly a decade. The dark wing variant declined quickly in frequency (see
figure above), which they believed was caused by natural selection against this
variant. However, to demonstrate that natural selection drove this change, they
had to rule out another possible explanation: genetic drift. Genetic drift is
another factor that leads the frequency of alleles&lt;sup&gt;7&lt;/sup&gt; in populations
to vary through time, caused by random chance. Some individuals in a population
may get lucky, and encounter an abundance of resources that allow them to leave
more offspring; some may experience random hardships that force them to leave
fewer. Unlike with natural selection, the underlying causes of variation in
survival and family size due to drift are not &lt;em&gt;genetic&lt;/em&gt;, but &lt;em&gt;random&lt;/em&gt;;
consequently they are not inherited by the next generation.  Another major
source of randomness is Mendelian segregation: if you carry two different
copies of a gene (one you inherited from your father, one you inherited from
your mother), the copy you pass to your child is determined essentially by a
biological coin flip.&lt;/p&gt;
&lt;aside&gt;
&lt;sup&gt;7&lt;/sup&gt; The genes we carry come in different alternative variants, which we refer to
as &lt;em&gt;alleles&lt;/em&gt;.
&lt;/aside&gt;

&lt;p&gt;These random changes in allele frequency across families lead to a random
behavior of allele frequencies within a population, which could, purely by
chance, also lead to the decline of a particular wing coloration through time
that Ford observed. To discern natural selection from random drift, Ford
collaborated with R.A. Fisher, who developed a statistical method that used
population size estimates to determine the strength of genetic drift (genetic
drift is the strongest in small populations, where a single family&amp;rsquo;s
reproductive fortunes have a larger proportional impact). Their 1947 paper
argued the change in the frequency of the dark wing variant through time was
caused by natural selection, not genetic drift; this later lead to a vitriolic
debate between Ford and Fisher and Sewall Wright, the leading proponent of
genetic drift at the time.&lt;/p&gt;
&lt;h2&gt;We can study natural selection through its effects on linked sites&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;we-can-study-natural-selection-through-its-effects-on-linked-sites&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#we-can-study-natural-selection-through-its-effects-on-linked-sites&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Since this early work using temporal data, there was a relative lull as efforts
shifted towards detecting selection from single present-day population genetic
samples, until the tremendous growth in DNA sequencing technology over the last
two decades. Now, researchers can directly observe allele frequency changes
through time all across an organism&amp;rsquo;s genome (millions of sites), rather than
at a few sites that affect traits (e.g. darker wing color). This has spurred
the further development of statistical methods that can differentiate allele
frequency changes caused by selection from those caused by random
drift&lt;sup&gt;8&lt;/sup&gt;. Simultaneously, new statistical methods were discovering the
footprints left by natural selection in our DNA and the DNA of many other
species, by the effect selection has on its &lt;em&gt;neighboring sites&lt;/em&gt;.&lt;/p&gt;
&lt;aside&gt;
&lt;sup&gt;8&lt;/sup&gt;&lt;a href=&#34;https://www.genetics.org/content/196/2/509.short&#34;&gt;Feder &lt;em&gt;et al.&lt;/em&gt; (2014)&lt;/a&gt;

&lt;/br&gt;
&lt;a href=&#34;https://www.genetics.org/content/193/3/973.short&#34;&gt;Mathieson and McVean (2013)&lt;/a&gt;

&lt;/br&gt;
&lt;a href=&#34;https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1005069&#34;&gt;Terhorst &lt;em&gt;et al.&lt;/em&gt; (2015)&lt;/a&gt;
.
&lt;/aside&gt;

&lt;p&gt;When a new beneficial mutation arises in a population, its comparative
advantage (either through increasing survival or in leaving more offspring)
causes it to quickly rise in frequency. Alleles reside on our chromosomes, and
are linked, or coupled to their neighbors because large contiguous stretches of
our chromosomes are inherited together. A process known as recombination does
shuffle up chromosomes a bit, since each chromosome we inherit from our father
is a patchwork of our paternal grandparent&amp;rsquo;s chromosomes (and likewise with our
maternal chromosomes).  Consequently, when beneficial mutations rise in
frequency, they drag along neighboring alleles that happen to be lucky enough
to reside upon the same chromosome that the beneficial mutation arose (we call
this &amp;ldquo;genetic hitchhiking&amp;rdquo;&lt;sup&gt;9&lt;/sup&gt; since these alleles hitch a ride along with the
beneficial mutation). We can detect these events because they wipe out genetic
variation in spots along the genome, since, in essence, everyone derives their
ancestry in this region from the individual in which the beneficial mutation
first arose.&lt;/p&gt;
&lt;aside&gt;
&lt;sup&gt;9&lt;/sup&gt; The first work on this &amp;ldquo;hitchhiking&amp;rdquo; effect was the seminal paper of &lt;a href=&#34;https://www.cambridge.org/core/journals/genetics-research/article/hitchhiking-effect-of-a-favourable-gene/918291A3B62BD50E1AE5C1F22165EF1B&#34;&gt;John Maynard Smith and John Haigh (1974)&lt;/a&gt;
.
&lt;/aside&gt;

&lt;figure&gt;
&lt;img src=&#34;https://vincebuffalo.com/images/nash_coop.png&#34; alt=&#34;Selective sweep in malaria showing drug resistance&#34; /&gt;
&lt;figcaption&gt;
A selective sweep in malaria, &lt;em&gt;Plasmodium falciparum&lt;/em&gt;, conferring drug
resistance. Data from Nash &lt;em&gt;et al.&lt;/em&gt; (2005), figure produced by Graham Coop in
his &lt;a href=&#34;https://github.com/cooplab/popgen-notes/blob/master/release_popgen_notes.pdf&#34;&gt;population genetics notes&lt;/a&gt;.
&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;One remarkable finding of the last three decades was that mutations that are
disadvantageous, or deleterious (that is, they leave fewer offspring or lower
odds of survival) also extinguish genetic variation in a region. This effect is
known as &lt;em&gt;background selection&lt;/em&gt;. Collectively, the hitchhiking effect and
background selection are types of &lt;em&gt;linked selection&lt;/em&gt;, and this is a primary
focus of Graham and my work. There is a great deal of mathematical theory that
predicts the extent to which, and over how long a stretch of chromosome,
genetic variation is wiped out by genetic hitchhiking and background selection.
From this body of work, we predict regions of high recombination (i.e. an
allele is very likely to be randomly shuffled off its chromosome background)
reduce linked selection&amp;rsquo;s effects on genetic variation, while in regions of low
recombination (i.e. an allele is very unlikely to be shuffled off its
chromosome background) are drastically affected by linked selection. These
predictions have been confirmed in numerous studies, perhaps most famously by
Begun and Aquadro (1992) in the fruit fly &lt;em&gt;Drosophila melanogaster&lt;/em&gt;, where they
find a strong correlation between the amount of recombination and the level of
pairwise diversity, a common measure of genetic variability.&lt;/p&gt;
&lt;figure&gt;
&lt;img src=&#34;https://vincebuffalo.com/images/begun_aquadro.png&#34; alt=&#34;Correlation between diversity and recombination in Drosophila&#34; /&gt;
&lt;figcaption&gt;
The correlation between pairwise diversity (a measure of genetic variability)
and recombination rate in &lt;em&gt;Drosophila melanogaster&lt;/em&gt;, from &lt;a href=&#34;https://www.nature.com/articles/356519a0&#34;&gt;Begun and Aquadro (1992)&lt;/a&gt;.
&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;Over the last fifty years, we have come to recognize that linked selection
itself introduces a source of randomness, much like genetic drift, into
evolution; we call this &lt;em&gt;genetic draft&lt;/em&gt;. Since new selected mutations (whether
advantageous or injurious) arise on random chromosomes in the population, and
since recombination randomly shuffles these chromosomes through the
generations, an allele&amp;rsquo;s frequency may jiggle about through time not only due
to genetic drift, but by the fitness of whatever background it happens to find
itself on. Our understanding of evolutionary genetics will not be complete
until we can differentiate between the randomness of genetic drift from the
randomness of linked selection. This open problem is reminiscent of the same
debates that Fisher and Ford were having over seventy years ago.&lt;/p&gt;
&lt;h2&gt;The Problem of Polygenic Selection&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;the-problem-of-polygenic-selection&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#the-problem-of-polygenic-selection&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;figure&gt;
&lt;img src=&#34;https://vincebuffalo.com/images/height.png&#34; alt=&#34;Distribution of human heights showing continuous trait&#34; /&gt;
&lt;figcaption&gt;
The distribution of human heights, a continuous trait, among 143
University of Connecticut students. Image from &lt;a href=&#34;https://amstat.tandfonline.com/doi/abs/10.1198/00031300265&#34;&gt;Schilling &lt;em&gt;et al.&lt;/em&gt; (2012)&lt;/a&gt;.
&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;Over the last decade, modern genomic data sets have advanced the field of
quantitative genetics as well. Quantitative genetics studies the nature of
&lt;em&gt;continuous traits&lt;/em&gt;, such as height, including how selection operates on these
traits and how variability for these traits arises in populations. One of the
triumphs of the Modern Synthesis&lt;sup&gt;10&lt;/sup&gt; was finding that the same
Mendelian genetic system that determines discrete traits, like wing coloration,
also determines continuous traits like height. The difference is that while a
single gene may determine wing coloration, hundreds to thousands of genes each
have a small effect on height; such traits are known as &lt;em&gt;polygenic traits&lt;/em&gt;.
These small differences across numerous genes, as well as environmental
factors, add up to determine one&amp;rsquo;s height; across a population, the genetic
differences between individuals lead to a smooth distribution of traits that
looks normally distributed (i.e. having a bell curve shape, such as the
distribution of heights pictured above).&lt;/p&gt;
&lt;aside&gt;
&lt;sup&gt;10&lt;/sup&gt; &lt;a href=&#34;https://en.wikipedia.org/wiki/Modern_synthesis_%2820th_century%29&#34;&gt;&lt;em&gt;The Modern Synethesis&lt;/em&gt;&lt;/a&gt;
 was
the synthesis of Charles Darwin and Alfred Russel Wallace&amp;rsquo;s theory of evolution
with Gregor Mendel&amp;rsquo;s genetic theory of inheritance. The mathematical part of
the synthesis was worked out primarily by R.A. Fisher, Sewall Wright, and J.B.S.
Haldane, and numerous other subdisciplines such as paleontology, systematics,
and botany were shown to be consistent with the new synthesis of evolution
and genetics.
&lt;/aside&gt;

&lt;p&gt;Selection also acts on continuous traits, which has enabled humans to
continually increase crop yields, breed cows that produce more milk, and so
forth. Like population genetics, a rich body of mathematical theory can predict
the response to selection and many other aspects of quantitative traits.
However, unlike population genetics theory, the mathematical theory of
quantitative genetics does not concern itself with the allele frequency changes
at each of the sites that determine a trait&amp;rsquo;s value, but rather takes a
macroscopic approach that focuses on a trait&amp;rsquo;s mean and variance, much like the
macroscopic view of the ideal gas law in physics. Bridging this macroscopic
quantitative genetics view with the microscopic population genetics view of
individual genes has proved to be an arduous task for many reasons.&lt;/p&gt;
&lt;p&gt;One key difficulty is that when a polygenic trait like height is selected for,
the effects of selection are distributed across all the hundreds or thousands
of sites that contribute to the trait&amp;rsquo;s value. Consequently, the effect of
selection at any one site is minuscule, and its frequency trajectory through
time will not look like the rapid change in frequency that Fisher and Ford saw
with wing coloration in scarlet tiger moths. Worse, such small changes in
allele frequency change can easily be mistaken for the random changes caused by
drift.&lt;/p&gt;
&lt;p&gt;In sum, evolutionary geneticists are left with an interesting quandary. We can
readily observe rapid adaptation in natural populations, as traits change
through time to better suit a new environment. Yet, despite the present
abundance of population genomic data, it can be difficult to detect selection
on polygenic traits from temporal data &lt;em&gt;at the DNA level&lt;/em&gt;, since the allele
frequency changes are minor and hard to differentiate from random genetic
drift. Furthermore, since these allele frequency changes are seemingly
indistinguishable from genetic drift, even with temporal data we are not able
to discern what fraction of allele frequency changes are due to drift, and what
fraction are due to selection. These are the problems my work with Graham Coop
is working to address.&lt;/p&gt;
&lt;p&gt;&lt;a href=&#34;https://vincebuffalo.com/blog/the-genome-wide-signal-of-linked-selection-in-temporal-data/&#34;&gt;Continue to Part II&lt;/a&gt;
&lt;/p&gt;

      </description>
    </item>
    
    <item>
      <title>Understanding Snakemake</title>
      <link>https://vincebuffalo.com/blog/understanding-snakemake/</link>
      <pubDate>Wed, 04 Mar 2020 00:00:00 +0000</pubDate>
      
      <guid>https://vincebuffalo.com/blog/understanding-snakemake/</guid>
      <description>
        
        
        &lt;figure&gt;
&lt;img src=&#34;https://vincebuffalo.com/images/snake_small.svg&#34; alt=&#34;Heraldic snake symbol&#34; /&gt;
&lt;figcaption&gt;
Heraldic snake from Flickr (CC Licensed).
&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;Each day, data scientists, computational biologists, astronomers, and other
folks that spend far too much time in front of a computer screen spend hours
doing somewhat horrible, monotonous tasks. Scientific programming, when done
right, is supposed to &lt;em&gt;prevent us&lt;/em&gt; from doing these monotonous tasks, and this
is certainly true when we compare what we do today to what the tireless
programmers and human computers of the 1950s did: inverting matrices by hand,
writing code to calculate the t-statistic and corresponding p-value, and so
forth. All of these monotonous tasks, thankfully, are implemented now in modern
libraries like BLAS/LAPACK, GNU Scientific Library, numpy, R, eigen, etc.
However, the problem has just shifted: today&amp;rsquo;s monotonous tasks are tidying
messy data, applying a linear model to tens of thousands of data sets, or
assessing prediction accuracy across statistical model parameters using cross
validation.&lt;/p&gt;
&lt;p&gt;Some of these monotonous tasks have been made easier by very clever
abstractions. Consider R&amp;rsquo;s tidyverse, which has simplified numerous monotonous
tasks most R users do by realizing these tasks fit a &lt;strong&gt;pattern&lt;/strong&gt;: import data,
tidy that data, explore the data, and communicate the findings (see Hadley
Wickham&amp;rsquo;s book &lt;a href=&#34;https://r4ds.had.co.nz/explore-intro.html&#34;&gt;R for Data
Science&lt;/a&gt;
).  Similarly, scikit-learn
has slick classes to do the tedious task of &lt;a href=&#34;https://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html&#34;&gt;model
selection&lt;/a&gt;
.
Additionally, data projects usually involve lots of repetitive file system
work: downloading datasets, running command line tools to pre-process raw data,
running simulation software, etc. These tasks are often repetitive because,
throughout the course of a project, you&amp;rsquo;ll likely need to &lt;em&gt;re-run the same
steps multiple times&lt;/em&gt;.  On page 9 of my book &lt;em&gt;&lt;a href=&#34;https://www.powells.com/book/bioinformatics-data-skills-reproducible-robust-research-with-open-source-tools-9781449367374&#34;&gt;Bioinformatics Data
Skills&lt;/a&gt;
&lt;/em&gt;
I explain why:&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;You will almost certainly have to re-run an analysis more than once, possibly
with new or changed data. This happens frequently because you’ll find a bug,
a collaborator will add or update a file, or you’ll want to try something new
upstream of a step. In all cases, downstream analyses depend on these earlier
results, meaning all steps of an analysis need to be re-run.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;What abstraction and accompanying software tools allows us to avoid these humdrum
repetitive tasks we do in the Unix shell? Make and Snakemake.&lt;/p&gt;
&lt;h2&gt;Make&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;make&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#make&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;I think the best way to learn Snakemake (our ultimate goal) is to first get a
rough sense of how its predecessor, Make, works. I still use Make for simple
tasks, and you&amp;rsquo;ve already likely used it to compile software at some point.
Make is a software tool and language that&amp;rsquo;s been around since 1976, and is
still widely used. Originally it was designed as a way to automate building
software, but the central problem of compiling software is quite like the
problem described above: some input file will change, and all downstream files
that depends on this file needs to be updated by re-running code. Compiling
software is relatively time consuming, so Make&amp;rsquo;s designers took advantage of a
simple idea: if we declare what files depend on other files, we only need to
run the steps downstream of the files that have changed. In computer science
lingo, we write out the code compiling steps as a directed acyclic graph, and
only the paths connected to a changed file need to be re-run. This will become
clear with some examples.&lt;/p&gt;
&lt;p&gt;First, Make (the software) looks for file named &lt;code&gt;Makefile&lt;/code&gt; in a directory when
we run the &lt;code&gt;make&lt;/code&gt; command&lt;sup&gt;1&lt;/sup&gt;. In a Makefile, we specify the &lt;strong&gt;rules&lt;/strong&gt; describing
the steps to run to turn input files into output files.  Specifically, a
Makefile consists of:&lt;/p&gt;
&lt;aside&gt;
&lt;sup&gt;1&lt;/sup&gt;
For clarity, I&amp;rsquo;ll try to be consistent in how I stylize various words when
describing Make and Snakemake.  Make is the name of the software and language,
Makefiles are the files full of code describing what to do, &lt;code&gt;make&lt;/code&gt; is Make&amp;rsquo;s
command line tool, and a file named &lt;code&gt;Makefile&lt;/code&gt; is what the &lt;code&gt;make&lt;/code&gt; tool looks
for in a directory when its run.
&lt;/aside&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;A &lt;strong&gt;target&lt;/strong&gt;, the thing to build with this rule. This is the output file.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;A single or list of &lt;strong&gt;dependencies&lt;/strong&gt; files. These are the files needed by the
rule to create the target.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Commands&lt;/strong&gt;, or the list of Unix commands needed to convert the
dependencies into the target.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The format of a rule is:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-Makefile&#34; data-lang=&#34;Makefile&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nf&#34;&gt;file_target.txt&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;dependency_&lt;/span&gt;1.&lt;span class=&#34;n&#34;&gt;txt&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;dependency_&lt;/span&gt;2.&lt;span class=&#34;n&#34;&gt;txt&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;	unix_command dependency_1.txt dependency_2.txt &amp;gt; file_target.txt
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Note that the command line &lt;em&gt;must&lt;/em&gt; begin with a tab character, not spaces.
Make&amp;rsquo;s a bit of a grumpy about this. Still, Make is quite clever, in that it
will only re-run each rule if (1) the input files changed, or (2) the target
does not exist and needs to be generated.&lt;/p&gt;
&lt;p&gt;I still use Make for the simplest redundant file tasks: usually downloading
data from the web, and doing some minimal pre-processing. For example, in a
recent project I wanted to download the genome of &lt;em&gt;Drosophila melanogaster&lt;/em&gt; and
create a file containing the lengths of all the sequences using
&lt;a href=&#34;https://github.com/lh3/bioawk&#34;&gt;bioawk&lt;/a&gt;
. Here&amp;rsquo;s what a Makefile running these
steps looks like&lt;sup&gt;2&lt;/sup&gt;:&lt;/p&gt;
&lt;aside&gt;
&lt;sup&gt;2&lt;/sup&gt; All code for these examples is available in the &lt;a href=&#34;https://github.com/vsbuffalo/snakemake-tutorial/&#34;&gt;Github
repository for this
tutorial&lt;/a&gt;
.  You can find the
code for this example in
&lt;a href=&#34;https://github.com/vsbuffalo/snakemake-tutorial/blob/master/example-01/Makefile&#34;&gt;here&lt;/a&gt;
.
&lt;/aside&gt;

&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-Makefile&#34; data-lang=&#34;Makefile&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nf&#34;&gt;all&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Dmel_BDGP&lt;/span&gt;6.28&lt;span class=&#34;n&#34;&gt;_seqlens&lt;/span&gt;.&lt;span class=&#34;n&#34;&gt;tsv&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nf&#34;&gt;Drosophila_melanogaster.BDGP6.28.dna.toplevel.fa.gz&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;:&lt;/span&gt; 
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  wget ftp://ftp.ensembl.org/pub/release-99/fasta/drosophila_melanogaster/dna/Drosophila_melanogaster.BDGP6.28.dna.toplevel.fa.gz
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nf&#34;&gt;Dmel_BDGP6.28_seqlens.tsv&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Drosophila_melanogaster&lt;/span&gt;.&lt;span class=&#34;n&#34;&gt;BDGP&lt;/span&gt;6.28.&lt;span class=&#34;n&#34;&gt;dna&lt;/span&gt;.&lt;span class=&#34;n&#34;&gt;toplevel&lt;/span&gt;.&lt;span class=&#34;n&#34;&gt;fa&lt;/span&gt;.&lt;span class=&#34;n&#34;&gt;gz&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  &lt;span class=&#34;c1&#34;&gt;# note we need the double dollar signs here, since the $ indicates a variable in Make&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  bioawk -c fastx &lt;span class=&#34;s1&#34;&gt;&amp;#39;{print $$name &amp;#34;\t&amp;#34; length($$seq)}&amp;#39;&lt;/span&gt; Drosophila_melanogaster.BDGP6.28.dna.toplevel.fa.gz &amp;gt; Dmel_BDGP6.28_seqlens.tsv
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;You can give this a try for yourself by copying and pasting this into a file
called &lt;code&gt;Makefile&lt;/code&gt; in an empty directory (or downloading it from
&lt;a href=&#34;https://github.com/vsbuffalo/snakemake-tutorial/blob/master/example-01/Makefile&#34;&gt;Github&lt;/a&gt;
)
and typing:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;$ make all&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;This looks for a file named &lt;code&gt;Makefile&lt;/code&gt;, and runs the &lt;code&gt;all&lt;/code&gt; target. Note that
Make is a &lt;a href=&#34;https://en.wikipedia.org/wiki/Declarative_language&#34;&gt;declarative
language&lt;/a&gt;
, meaning it
doesn&amp;rsquo;t execute code from top to bottom like a normal program&amp;rsquo;s control flow.
Instead you declare what needs to be done, and it executes things in the right
order.  Stepping through the Makefile code, there are three rules, which I&amp;rsquo;ll
explain in the order Make works through them:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;all&lt;/code&gt;, which is the primary target. It&amp;rsquo;s the starting place; typing &lt;code&gt;make all&lt;/code&gt; tells Make to create &lt;code&gt;all&lt;/code&gt;&amp;rsquo;s dependencies. In this case, there&amp;rsquo;s only one
dependency: &lt;code&gt;Dmel_BDGP6.28_seqlens.tsv&lt;/code&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Now, make is looking for &lt;code&gt;all&lt;/code&gt;&amp;rsquo;s dependency, &lt;code&gt;Dmel_BDGP6.28_seqlens.tsv&lt;/code&gt;.
Since this file does not exist, Make looks for a rule to create this target. It
finds this rule, but this rule requires the file
&lt;code&gt;Drosophila_melanogaster.BDGP6.28.dna.toplevel.fa.gz&lt;/code&gt;. This file isn&amp;rsquo;t in this
directory yet either, so Make goes looking for a rule to build that.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The second rule (after &lt;code&gt;all&lt;/code&gt;) declares how to generate
&lt;code&gt;Drosophila_melanogaster.BDGP6.28.dna.toplevel.fa.gz&lt;/code&gt;. This target requires
no dependencies, so Make proceeds straight into executing the rule: use
&lt;code&gt;wget&lt;/code&gt; to download a file from the web.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;One this file downloads, the dependency for the &lt;code&gt;Dmel_BDGP6.28_seqlens.tsv&lt;/code&gt;
file is available. Make, working backwards, runs this rule now, calling bioawk
to summarize the sequence lengths of this genome. This generates the file
&lt;code&gt;Dmel_BDGP6.28_seqlens.tsv&lt;/code&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Finally, make has satisfied dependency of &lt;code&gt;all&lt;/code&gt;: &lt;code&gt;Dmel_BDGP6.28_seqlens.tsv&lt;/code&gt;.
Since it&amp;rsquo;s got everything it needs, it quits.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The real magic is what happens if we delete &lt;code&gt;Dmel_BDGP6.28_seqlens.tsv&lt;/code&gt;, or
change &lt;code&gt;Drosophila_melanogaster.BDGP6.28.dna.toplevel.fa.gz&lt;/code&gt;. &lt;strong&gt;Unlike a bash
script, which will re-run everything start to finish, Make will only re-run
what it needs to build files depending on the changed files&lt;/strong&gt;:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;$ rm Dmel_BDGP6.28_seqlens.tsv
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;$ make all  &lt;span class=&#34;c1&#34;&gt;# since the genome hasn&amp;#39;t been changed or deleted, only the last rule is run!&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;bioawk -c fastx &lt;span class=&#34;s1&#34;&gt;&amp;#39;{print $name &amp;#34;\t&amp;#34; length($seq)}&amp;#39;&lt;/span&gt; Drosophila_melanogaster.BDGP6.28.dna.toplevel.fa.gz &amp;gt; Dmel_BDGP6.28_seqlens.tsv&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;We can emulate changing an input file in this example by using &lt;code&gt;touch&lt;/code&gt; to
change the timestamp of the
&lt;code&gt;Drosophila_melanogaster.BDGP6.28.dna.toplevel.fa.gz&lt;/code&gt; file:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;$ touch  Drosophila_melanogaster.BDGP6.28.dna.toplevel.fa.gz
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;$ make all  &lt;span class=&#34;c1&#34;&gt;# runs all downstream steps&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;bioawk -c fastx &lt;span class=&#34;s1&#34;&gt;&amp;#39;{print $name &amp;#34;\t&amp;#34; length($seq)}&amp;#39;&lt;/span&gt; Drosophila_melanogaster.BDGP6.28.dna.toplevel.fa.gz &amp;gt; Dmel_BDGP6.28_seqlens.tsv&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Make gets a lot, &lt;em&gt;lot&lt;/em&gt; more complicated than this simple example. The language
is rich, but its a rather tedious language for all about the simplest tasks. As
you use Make more, I&amp;rsquo;d recommend learning more about its &lt;a href=&#34;https://www.gnu.org/software/make/manual/html_node/Automatic-Variables.html&#34;&gt;automatic
variables&lt;/a&gt;
,
which allow us to avoid redundantly typing out target and dependency filenames.
The two I use most are &lt;code&gt;$@&lt;/code&gt;, which is a placeholder for the filename of the
target, and &lt;code&gt;$&amp;lt;&lt;/code&gt;, the name of the &lt;em&gt;first&lt;/em&gt; prerequisite. This would simplify our
earlier &lt;code&gt;Makefile&lt;/code&gt; like so&lt;sup&gt;3&lt;/sup&gt;:&lt;/p&gt;
&lt;aside&gt;
&lt;sup&gt;3&lt;/sup&gt;
The code for this Makefile is &lt;a href=&#34;https://github.com/vsbuffalo/snakemake-tutorial/blob/master/example-02/Makefile&#34;&gt;available here&lt;/a&gt;
.
&lt;/aside&gt;

&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-Makefile&#34; data-lang=&#34;Makefile&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nf&#34;&gt;all&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Dmel_BDGP&lt;/span&gt;6.28&lt;span class=&#34;n&#34;&gt;_seqlens&lt;/span&gt;.&lt;span class=&#34;n&#34;&gt;tsv&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nf&#34;&gt;Drosophila_melanogaster.BDGP6.28.dna.toplevel.fa.gz&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;:&lt;/span&gt; 
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  wget ftp://ftp.ensembl.org/pub/release-99/fasta/drosophila_melanogaster/dna/Drosophila_melanogaster.BDGP6.28.dna.toplevel.fa.gz
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nf&#34;&gt;Dmel_BDGP6.28_seqlens.tsv&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Drosophila_melanogaster&lt;/span&gt;.&lt;span class=&#34;n&#34;&gt;BDGP&lt;/span&gt;6.28.&lt;span class=&#34;n&#34;&gt;dna&lt;/span&gt;.&lt;span class=&#34;n&#34;&gt;toplevel&lt;/span&gt;.&lt;span class=&#34;n&#34;&gt;fa&lt;/span&gt;.&lt;span class=&#34;n&#34;&gt;gz&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  bioawk -c fastx &lt;span class=&#34;s1&#34;&gt;&amp;#39;{print $$name &amp;#34;\t&amp;#34; length($$seq)}&amp;#39;&lt;/span&gt; $&amp;lt; &amp;gt; &lt;span class=&#34;nv&#34;&gt;$@&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Finally, it&amp;rsquo;s worth mentioning that Make is trivially parallelizable. Since
Makefiles describe the chain of rules needed to create a file, &lt;em&gt;indepenent
chains can be run across different cores simulultaneously&lt;/em&gt;. The example above
is too simple to run steps in parallel, but if did have independent chains that
could be run independently, we&amp;rsquo;d do this with:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;$ make -j &lt;span class=&#34;m&#34;&gt;4&lt;/span&gt;  &lt;span class=&#34;c1&#34;&gt;# run the Makefile on 4 cores&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Make is definitely finicky and old; after all, it first emerged in 1974. I still
use Make for the simplest of data downloading and pre-processing tasks, much
like &lt;a href=&#34;https://bost.ocks.org/mike/make/&#34;&gt;Mike Bostock describes&lt;/a&gt;
.  While there
was a time when I would dig deep into the Make documentation, using Make&amp;rsquo;s
functions to write complicated Makefiles to process hoards of data, now I
prefer Snakemake.&lt;/p&gt;
&lt;h2&gt;Snakemake&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;snakemake&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#snakemake&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Snakemake is a new, Python-based build automation software program. Unlike
Make, which was intended to be used to automate compiling software, Snakemake&amp;rsquo;s
explicit intention is to automate command line data processing tasks, such as
those common in bioinformatics. You can install Snakemake with Conda
(&lt;a href=&#34;https://snakemake.readthedocs.io/en/stable/getting_started/installation.html&#34;&gt;instructions
here&lt;/a&gt;
).
Much like Make, running the command line program &lt;code&gt;snakemake&lt;/code&gt; looks for a
Snakefile, named &lt;code&gt;Snakefile&lt;/code&gt; in the directory. And much like Make, the format
of the Snakefile has rules defined by targets (known in Snakemake as
&lt;strong&gt;outputs&lt;/strong&gt;), dependencies (Snakemake calls these &lt;strong&gt;inputs&lt;/strong&gt;), and rules (and a
lot more is possible here with Snakemake, as we&amp;rsquo;ll see). Let&amp;rsquo;s translate our
earlier &lt;code&gt;Makefile&lt;/code&gt; to a &lt;code&gt;Snakefile&lt;/code&gt;&lt;sup&gt;4&lt;/sup&gt;:&lt;/p&gt;
&lt;aside&gt;
&lt;sup&gt;4&lt;/sup&gt;
The code for this Snakefile is &lt;a href=&#34;https://github.com/vsbuffalo/snakemake-tutorial/blob/master/example-03/Snakefile&#34;&gt;available here&lt;/a&gt;
.
&lt;/aside&gt;

&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-Makefile&#34; data-lang=&#34;Makefile&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nf&#34;&gt;rule all&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  input:
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;s2&#34;&gt;&amp;#34;Dmel_BDGP6.28_seqlens.tsv&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nf&#34;&gt;rule genome&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  output: 
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;s2&#34;&gt;&amp;#34;Drosophila_melanogaster.BDGP6.28.dna.toplevel.fa.gz&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  shell: 
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;	  &lt;span class=&#34;s2&#34;&gt;&amp;#34;wget ftp://ftp.ensembl.org/pub/release-99/fasta/drosophila_melanogaster/dna/Drosophila_melanogaster.BDGP6.28.dna.toplevel.fa.gz&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nf&#34;&gt;rule seqlens&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  input:
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;s2&#34;&gt;&amp;#34;Drosophila_melanogaster.BDGP6.28.dna.toplevel.fa.gz&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  output:
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;s2&#34;&gt;&amp;#34;Dmel_BDGP6.28_seqlens.tsv&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  shell:
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;	  &lt;span class=&#34;s2&#34;&gt;&amp;#34;&amp;#34;&amp;#34;bioawk -c fastx &amp;#39;{{print &lt;/span&gt;&lt;span class=&#34;nv&#34;&gt;$name&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt; &amp;#34;&lt;/span&gt;&lt;span class=&#34;se&#34;&gt;\t&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34; length(&lt;/span&gt;&lt;span class=&#34;nv&#34;&gt;$seq&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;)}}&amp;#39; {input} &amp;gt; {output}&amp;#34;&amp;#34;&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;The key changes are:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Slightly different rule format, and all rules are named (e.g.  &lt;code&gt;all&lt;/code&gt;, &lt;code&gt;genome&lt;/code&gt;, etc.).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Filenames and shell commands must be quoted (note that Python&amp;rsquo;s triple
quotes can be used to avoid escaping quotes in cases where single and double
quotes are used in the rule).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Rules running shell commands are specified in &lt;code&gt;shell&lt;/code&gt; blocks. Snakemake also
supports running Python in &lt;code&gt;run&lt;/code&gt; blocks.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Rather than awkward special variables like Make&amp;rsquo;s &lt;code&gt;$@&lt;/code&gt; and &lt;code&gt;$&amp;lt;&lt;/code&gt;, Snakemake
uses Python&amp;rsquo;s formatted strings (i.e. the braces in the last line) and clear
names like &lt;code&gt;{input}&lt;/code&gt; and &lt;code&gt;{output}&lt;/code&gt;. However, since braces are now special
in Snakemake, we need to &lt;em&gt;escape&lt;/em&gt; them when using them in our bioawk line; a
literal brace is specified by using two of them, e.g. &lt;code&gt;{{&lt;/code&gt; and &lt;code&gt;}}&lt;/code&gt; (this is
analogous to how we had to escape the &lt;code&gt;$&lt;/code&gt; in Make by using &lt;code&gt;$$&lt;/code&gt;!).&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This is then executed much like a Makefile. Before executing it though, let&amp;rsquo;s
do a dry run with &lt;code&gt;snakemake --dryrun&lt;/code&gt; or &lt;code&gt;snakemake -n&lt;/code&gt;. This doesn&amp;rsquo;t execute
any steps, it just shows what Snakemake would do if run&lt;sup&gt;5&lt;/sup&gt;:&lt;/p&gt;
&lt;aside&gt;
&lt;sup&gt;5&lt;/sup&gt;
Make does have a dry run option too, by the way; it&amp;rsquo;s &lt;code&gt;make --dry-run&lt;/code&gt; or &lt;code&gt;make -n&lt;/code&gt;.
&lt;/aside&gt;

&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;$ snakemake --dryrun
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Building DAG of jobs...
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Job counts:
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;	count	&lt;span class=&#34;nb&#34;&gt;jobs&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;	1	all
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;	1	genome
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;	1	seqlens
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;	&lt;span class=&#34;m&#34;&gt;3&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;o&#34;&gt;[&lt;/span&gt;Thu Mar  &lt;span class=&#34;m&#34;&gt;5&lt;/span&gt; 11:37:28 2020&lt;span class=&#34;o&#34;&gt;]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;rule genome:
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    output: Drosophila_melanogaster.BDGP6.28.dna.toplevel.fa.gz
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    jobid: &lt;span class=&#34;m&#34;&gt;2&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;o&#34;&gt;[&lt;/span&gt;Thu Mar  &lt;span class=&#34;m&#34;&gt;5&lt;/span&gt; 11:37:28 2020&lt;span class=&#34;o&#34;&gt;]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;rule seqlens:
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    input: Drosophila_melanogaster.BDGP6.28.dna.toplevel.fa.gz
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    output: Dmel_BDGP6.28_seqlens.tsv
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    jobid: &lt;span class=&#34;m&#34;&gt;1&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;o&#34;&gt;[&lt;/span&gt;Thu Mar  &lt;span class=&#34;m&#34;&gt;5&lt;/span&gt; 11:37:28 2020&lt;span class=&#34;o&#34;&gt;]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;localrule all:
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    input: Dmel_BDGP6.28_seqlens.tsv
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    jobid: &lt;span class=&#34;m&#34;&gt;0&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Job counts:
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;	count	&lt;span class=&#34;nb&#34;&gt;jobs&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;	1	all
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;	1	genome
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;	1	seqlens
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;	&lt;span class=&#34;m&#34;&gt;3&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;This was a dry-run &lt;span class=&#34;o&#34;&gt;(&lt;/span&gt;flag -n&lt;span class=&#34;o&#34;&gt;)&lt;/span&gt;. The order of &lt;span class=&#34;nb&#34;&gt;jobs&lt;/span&gt; does not reflect the order of execution.&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Now, let&amp;rsquo;s execute these steps &amp;ndash; on the command line, enter:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;$ snakemake&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Unlike Make, Snakemake has a really nice progress reporting (I&amp;rsquo;ve omitted this
output above for brevity).&lt;/p&gt;
&lt;h3&gt;Using Expand to Build up all Filesnames and Parameter Combinations&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;using-expand-to-build-up-all-filesnames-and-parameter-combinations&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#using-expand-to-build-up-all-filesnames-and-parameter-combinations&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;The real strength of Snakemake is how easy it makes applying rules across
multiple files that share a similar filename structure (this is why it is so
important to have a consistent file name scheme!). I&amp;rsquo;ll demonstrate this
incrementally with a few Snakefiles, which because we can program in Python
with Snakemake, allow us to see what&amp;rsquo;s happening with &lt;code&gt;print()&lt;/code&gt; statements.
First consider this &lt;code&gt;Snakefile&lt;/code&gt;&lt;sup&gt;6&lt;/sup&gt;:&lt;/p&gt;
&lt;aside&gt;
&lt;sup&gt;6&lt;/sup&gt;
The code for this Snakefile is &lt;a href=&#34;https://github.com/vsbuffalo/snakemake-tutorial/blob/master/example-04/Snakefile&#34;&gt;available here&lt;/a&gt;
.
&lt;/aside&gt;

&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;chrom_filename&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;Drosophila_melanogaster.BDGP6.28.dna.chromosome.&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{chrom}&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;.fa.gz&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;chroms&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;2L&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;2R&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;3L&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;3R&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;X&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;4&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;chrom_fa_files&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;expand&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;chrom_filename&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;chrom&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;chroms&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;chrom_fa_files&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Now, rather than downloading the entire &lt;em&gt;Drosophila melanogaster&lt;/em&gt; genome, we&amp;rsquo;re
going to download some individual chromosome sequences&lt;sup&gt;7&lt;/sup&gt;. We exploit
these well-named files, and build all the chromosome sequence filenames that
need to be downloaded &lt;em&gt;programmatically&lt;/em&gt; with Snakemake&amp;rsquo;s powerful &lt;code&gt;expand()&lt;/code&gt;
function. &lt;code&gt;expand()&lt;/code&gt; builds a list of strings by replacing the string
&lt;code&gt;{chrom}&lt;/code&gt; in &lt;code&gt;chrom_filename&lt;/code&gt; with each of the chromosome names in the &lt;code&gt;chroms&lt;/code&gt;
list. We use &lt;code&gt;print()&lt;/code&gt; on the last line to look at the resulting list of filenames.&lt;/p&gt;
&lt;aside&gt;
&lt;sup&gt;7&lt;/sup&gt; Note the consistent naming of these chromosome sequences
(see the &lt;a href=&#34;#ZgotmplZ&#34;&gt;FTP
page&lt;/a&gt;
)
is what makes automating this task possible.
&lt;/aside&gt;

&lt;p&gt;While above we used &lt;code&gt;expand()&lt;/code&gt; to build up &lt;code&gt;chrom_fa_files&lt;/code&gt;, populating it with
values from just the &lt;code&gt;chroms&lt;/code&gt; list, it works with more than one input list too,
and generates all combinations (a &lt;a href=&#34;https://en.wikipedia.org/wiki/Cartesian_product&#34;&gt;Cartesian
Product&lt;/a&gt;
) of the input values.
This makes &lt;code&gt;expand()&lt;/code&gt; exceedingly powerful because it can be used to build up
all possible parameter combinations for a series of simulations. Consider&lt;sup&gt;8&lt;/sup&gt;:&lt;/p&gt;
&lt;aside&gt;
&lt;sup&gt;8&lt;/sup&gt;
The code for this Snakefile is &lt;a href=&#34;https://github.com/vsbuffalo/snakemake-tutorial/blob/master/example-05/Snakefile&#34;&gt;available here&lt;/a&gt;
.
&lt;/aside&gt;

&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;numpy&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;np&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;Ns&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;100&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;selcoefs&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;10&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;**&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;np&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;linspace&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;3&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;3&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;rbps&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;10&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;**&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;np&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;linspace&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;8&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;7&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;nreps&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;np&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;arange&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;20&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;sim_results_pattern&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;sim_&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{N}&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;N_&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{selcoef}&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;s_&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{rbp}&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;rbp_&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{rep}&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;rep.tsv&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;sim_results&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;expand&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;sim_results_pattern&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;                     &lt;span class=&#34;n&#34;&gt;N&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Ns&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;selcoef&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;selcoefs&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;                                          &lt;span class=&#34;n&#34;&gt;rbp&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;rbps&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;rep&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;nreps&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nb&#34;&gt;print&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;sim_results&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Running this, we see we have a list of all results files for all parameter
combinations:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;$ snakemake
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;o&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;sim_100N_0.001s_1e-08rbp_0rep.tsv&amp;#39;&lt;/span&gt;, &lt;span class=&#34;s1&#34;&gt;&amp;#39;sim_100N_0.001s_1e-08rbp_1rep.tsv&amp;#39;&lt;/span&gt;,
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;sim_100N_0.001s_1e-08rbp_2rep.tsv&amp;#39;&lt;/span&gt;, &lt;span class=&#34;s1&#34;&gt;&amp;#39;sim_100N_0.001s_1e-08rbp_3rep.tsv&amp;#39;&lt;/span&gt;,
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;sim_100N_0.001s_1e-08rbp_4rep.tsv&amp;#39;&lt;/span&gt;, &lt;span class=&#34;s1&#34;&gt;&amp;#39;sim_100N_0.001s_1e-08rbp_5rep.tsv&amp;#39;&lt;/span&gt;,
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  ... 
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;sim_100N_0.1s_1e-07rbp_14rep.tsv&amp;#39;&lt;/span&gt;, &lt;span class=&#34;s1&#34;&gt;&amp;#39;sim_100N_0.1s_1e-07rbp_15rep.tsv&amp;#39;&lt;/span&gt;,
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;sim_100N_0.1s_1e-07rbp_16rep.tsv&amp;#39;&lt;/span&gt;, &lt;span class=&#34;s1&#34;&gt;&amp;#39;sim_100N_0.1s_1e-07rbp_17rep.tsv&amp;#39;&lt;/span&gt;,
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;sim_100N_0.1s_1e-07rbp_18rep.tsv&amp;#39;&lt;/span&gt;, &lt;span class=&#34;s1&#34;&gt;&amp;#39;sim_100N_0.1s_1e-07rbp_19rep.tsv&amp;#39;&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Building DAG of jobs...
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Nothing to be &lt;span class=&#34;k&#34;&gt;done&lt;/span&gt;.
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Complete log: /Users/vinceb/projects/snakemake-tutorial/example-05/.snakemake/log/2020-03-05T190027.307441.snakemake.log&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;There&amp;rsquo;s nothing special about by my file name scheme here, though it is one
that I often use. The filenames could be parsed by downstream programs so the
parameters are known, but I prefer usually to have the simulation software
write a metadata string at the top of the results files (e.g. as a comment line
a TSV/CSV beginning with &lt;code&gt;#&lt;/code&gt;).&lt;/p&gt;
&lt;h3&gt;Using Wildcards&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;using-wildcards&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#using-wildcards&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;In the previous section, we automatically created a bunch of filenames of
simulation results we want after running all our simulations, representing all
combinations of parameters. Now, we need to write a rule that describes how to
actually run all these simulations and pass the appropriate parameters to the
command line tool responsible for running the simulations.  What&amp;rsquo;s elegant
about Snakemake is that since each file is the result of running a simulation
once with particular parameters, we can write one special general rule that
describes how to generate all the simulation results. The trick to do this is
to use &lt;strong&gt;wildcards&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Understanding wildcards was, for me, the hardest part of understanding
Snakemake. They&amp;rsquo;re just
&lt;a href=&#34;https://en.wikipedia.org/wiki/Magic_%28programming%29&#34;&gt;magic&lt;/a&gt;
 enough to be
confusing, but also really useful. The best way to grok wildcards is to
understand that they match parts of a rule&amp;rsquo;s &lt;strong&gt;output&lt;/strong&gt; file. I think it&amp;rsquo;s
easier to explain this through a simple example, after which we&amp;rsquo;ll continue the
simulation example described above.&lt;/p&gt;
&lt;p&gt;Here&amp;rsquo;s a simple example of wildcards&lt;sup&gt;9&lt;/sup&gt;:&lt;/p&gt;
&lt;aside&gt;
&lt;sup&gt;9&lt;/sup&gt;
The code for this Snakefile is &lt;a href=&#34;https://github.com/vsbuffalo/snakemake-tutorial/blob/master/example-06/Snakefile&#34;&gt;available here&lt;/a&gt;
.
&lt;/aside&gt;

&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;results&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;file_&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{sample}&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;.txt&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;all_results&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;expand&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;results&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;sample&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;3&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;rule&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;all&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  &lt;span class=&#34;nb&#34;&gt;input&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;n&#34;&gt;all_results&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;rule&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;sims&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  &lt;span class=&#34;nb&#34;&gt;input&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  &lt;span class=&#34;n&#34;&gt;output&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;s2&#34;&gt;&amp;#34;file_&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{sample_name}&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;.txt&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  &lt;span class=&#34;n&#34;&gt;run&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;k&#34;&gt;with&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;open&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;output&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;],&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;w&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;f&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;      &lt;span class=&#34;n&#34;&gt;f&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;write&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;sa&#34;&gt;f&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;the sample name is &lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;wildcards&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;sample_name&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;}&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Note that rather than using the &lt;code&gt;shell&lt;/code&gt; block, we&amp;rsquo;re using a &lt;code&gt;run&lt;/code&gt; block which
is just pure Python code &amp;ndash; this is a beautiful feature of Snakemake. In this
block, the variable &lt;code&gt;output&lt;/code&gt; is automatically set by Python, and is a list of
all files in output. However, since we&amp;rsquo;re using wildcards, Snakemake is passing
in the files from &lt;code&gt;all_results&lt;/code&gt; &lt;em&gt;one at a time&lt;/em&gt;, so this list contains just a
single file. We grab the only file in the list, &lt;code&gt;output[0]&lt;/code&gt; and open it for
writing. In that file, we write the contents of &lt;code&gt;{wildcards.sample_name}&lt;/code&gt;,
which Snakemake also automatically sets for each output filename.&lt;/p&gt;
&lt;p&gt;If this is still unclear, it&amp;rsquo;s important to remember that Snakemake is working
&lt;em&gt;backwards&lt;/em&gt;. The &lt;code&gt;all&lt;/code&gt; target is first run, and Snakemake looks for this rule&amp;rsquo;s
inputs: the list of files in &lt;code&gt;all_results&lt;/code&gt;. Then, since these files don&amp;rsquo;t
exist, Snakemake looks for a rule to generate them. The &lt;code&gt;sims&lt;/code&gt; rule&amp;rsquo;s output
matches the filenames needed &amp;ndash; &lt;code&gt;&amp;quot;file_{sample_name}.txt&amp;quot;&lt;/code&gt; is treated like
&lt;code&gt;&amp;quot;file_*.txt&amp;quot;&lt;/code&gt; would be by Unix. The difference is that the matching section is
assigned to &lt;code&gt;wildcards.sample_name&lt;/code&gt; and can be used by the rule&amp;rsquo;s &lt;code&gt;shell&lt;/code&gt; or
&lt;code&gt;run&lt;/code&gt; block.&lt;/p&gt;
&lt;p&gt;With this simple example hopefully making wildcards clearer, let&amp;rsquo;s continue our
simulation example. For this example, I use the population genetics forward
simulation software &lt;a href=&#34;https://messerlab.org/slim/&#34;&gt;SLiM from the Messer Lab&lt;/a&gt;
,
but the basic idea extends broadly to bioinformatics and data science tasks.
I&amp;rsquo;m simulating evolution of a stretch of chromosome, where selected mutations
pop in the population, but only in a small region (emulating a gene) in the
middle of the chromosome. The details of the &lt;a href=&#34;https://github.com/vsbuffalo/snakemake-tutorial/blob/master/example-07/sim.slim&#34;&gt;simulation are on
Github&lt;/a&gt;
,
and I use the Snakemake file to try different parameters, in this case
selection coefficients and the level of recombination. I also use Snakemake to
generate a lot of independent replicate results. The Snakemake
file&lt;sup&gt;10&lt;/sup&gt; looks like:&lt;/p&gt;
&lt;aside&gt;
&lt;sup&gt;10&lt;/sup&gt;
The code for this Snakefile is &lt;a href=&#34;https://github.com/vsbuffalo/snakemake-tutorial/blob/master/example-07/Snakefile&#34;&gt;available here&lt;/a&gt;
.
&lt;/aside&gt;

&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;numpy&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;np&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;Ns&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;100&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;selcoefs&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;10&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;**&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;np&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;linspace&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;3&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;3&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;rbps&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;10&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;**&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;np&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;linspace&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;8&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;7&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;nreps&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;np&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;arange&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;40&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;sim_results_pattern&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;sim_&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{N}&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;N_&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{selcoef}&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;s_&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{rbp}&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;rbp_&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{rep}&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;rep.tsv&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;sim_results&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;expand&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;sim_results_pattern&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;                     &lt;span class=&#34;n&#34;&gt;N&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Ns&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;selcoef&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;selcoefs&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;                     &lt;span class=&#34;n&#34;&gt;rbp&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;rbps&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;rep&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;nreps&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;rule&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;all&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  &lt;span class=&#34;nb&#34;&gt;input&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;n&#34;&gt;sim_results&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;rule&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;sims&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  &lt;span class=&#34;nb&#34;&gt;input&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  &lt;span class=&#34;n&#34;&gt;output&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;n&#34;&gt;sim_results_pattern&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  &lt;span class=&#34;n&#34;&gt;shell&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;c1&#34;&gt;# split across two lines, to make this easier to fit on screen:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;slim -d s=&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{wildcards.selcoef}&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt; -d rbp=&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{wildcards.rbp}&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt; &amp;#34;&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt; 
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;     &lt;span class=&#34;s2&#34;&gt;&amp;#34;-d N=&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{wildcards.N}&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt; -d rep=&lt;/span&gt;&lt;span class=&#34;si&#34;&gt;{wildcards.rep}&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt; sim.slim&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Here, Snakemake captures the wildcards and passes them directly into the
command line call to SLiM. We can run this across four cores with:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;$ snakemake --cores &lt;span class=&#34;m&#34;&gt;4&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;This creates a &lt;em&gt;lot&lt;/em&gt; of simulation results. Processing these files isn&amp;rsquo;t within
the scope of this tutorial, but you can see the R script I used to do so &lt;a href=&#34;https://github.com/vsbuffalo/snakemake-tutorial/blob/master/example-07/process_sims.r&#34;&gt;on
Github&lt;/a&gt;
.
Additionally, I&amp;rsquo;ve included &lt;a href=&#34;https://github.com/vsbuffalo/snakemake-tutorial/blob/master/example-07/Snakefile_plot&#34;&gt;another Snakemake file for this
example&lt;/a&gt;

showing how Snakemake can also be used to run scripts to make figures, using
the simulation results generated by another part of Snakemake. It&amp;rsquo;s Snakes all
the way down! Our final result is a figure:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://vincebuffalo.com/images/snakemake_sims.png&#34; alt=&#34;SLiM simulation results showing the effect of recurrent sweeps.&#34; data-custom-hook=&#34;true&#34; /&gt;
&lt;/p&gt;
&lt;figcaption&gt;
	SLiM simulation results showing the effects of recurrent sweeps. The results
  are noisy because only 40 simulations were averaged over, and the population
  size is rather small (N = 100). Still, one sees the effect of increasing
  recombination (weak sweep effect) and changing the selection coefficient.
&lt;/figcaption&gt;
&lt;h2&gt;Future&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;future&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#future&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;I hope this has convinced you that Snakemake is a powerful tool that should be
in your computational toolbox, and it is clearer how some of the more powerful
features of Snakemake (&lt;code&gt;expand()&lt;/code&gt; and wildcards) work. For what it&amp;rsquo;s worth I
still use Make for small tasks, and will continue to do so.&lt;/p&gt;
&lt;p&gt;While I use Snakemake a fair amount now, I expect this to continue to increase.
Why? Once you become acquainted with Snakemake, you start to see increasingly
many areas in a project you can use it in (e.g. generating figures, parsing
collating raw data, running unit tests). Snakemake becomes fun to use because
it prevents the monotony of running the same steps repeatedly in a project. I
think too many years coding up analyses have made me realize that 70% of the
computational work of an analysis or project is the same as every other
project. This shared component of computational work is not intellectually
stimulating, and if some tool or library can serve as a higher-level
abstraction that makes these repetitive tasks easier, it frees up time to work
on intellectually stimulating parts of a project &amp;ndash; the stuff I enjoy more.
Hopefully Snakemake will help you do more monotonous data tasks in less time
with less effort too.&lt;/p&gt;

      </description>
    </item>
    
    <item>
      <title>A Genealogical Look at Shared Ancestry on the X Chromosome</title>
      <link>https://vincebuffalo.com/blog/a-genealogical-look-at-shared-ancestry-on-the-x-chromosome/</link>
      <pubDate>Sun, 03 Apr 2016 00:00:00 +0000</pubDate>
      
      <guid>https://vincebuffalo.com/blog/a-genealogical-look-at-shared-ancestry-on-the-x-chromosome/</guid>
      <description>
        
        
        &lt;style type=&#34;text/css&#34; media=&#34;screen&#34;&gt;
  #mainsvg {
    margin-left: auto;
    margin-right: auto;
    display: block;
  }
  .arc-male {
    fill: #43a2ca;
  }
  .arc-female {
    fill: #de2d26;
  }
  .highlighted {
    fill: #333; 
  }
  .arc-text {
    display: block;
    height: 1em;
    margin-top: 1em;
    font-family: Helvetica;
    color: #333;
    text-align: center;
  }
  #src {
    margin-top: 4em;
    font-family: Helvetica;
    text-align: center;
    color: #333;
  }
  .chrom-female {
   fill: #ddd;
  }
  .chrom-male {
    fill: #bbb;
  }
  .chrom-bg {
    fill: #fff;
  }
  .mum-segment {
    fill: #d4151d;
  }
  .dad-segment {
    fill: #3790be;
  }
&lt;/style&gt;
&lt;div id=&#34;xshared&#34;&gt;&lt;/div&gt;&lt;figcaption&gt;An example of a present-day female&#39;s X
material being broken up across her X ancestors in her X genealogy back through
the generations.&lt;/figcaption&gt;
&lt;p&gt;My article with Steve Mount and Graham Coop, &lt;em&gt;&lt;a href=&#34;https://www.genetics.org/content/204/1/57&#34;&gt;A Genealogical Look at Shared
Ancestry on the X Chromosome &lt;/a&gt;
&lt;/em&gt; has
been published in &lt;em&gt;Genetics&lt;/em&gt;. In the spirit of both outreach and continuing
Graham&amp;rsquo;s terrific series of blog posts&lt;sup&gt;1&lt;/sup&gt; on genetic genealogy, I&amp;rsquo;m
writing about our paper on X chromosome genealogy and recent ancestry. Before
diving into the details of X chromosome ancestry work, I&amp;rsquo;ll review the concepts
of genealogies and ancestry.  Then, in the next section we&amp;rsquo;ll look at how one&amp;rsquo;s
genetic ancestors —the subset of ancestors that you share genetic material
with— vary back through the generations. With these concepts reviewed, we&amp;rsquo;ll
look at the genealogy that includes all of our &lt;em&gt;X ancestors&lt;/em&gt;, which due to the
special inheritance pattern of the X chromosome is only a subset of one&amp;rsquo;s
genealogy.  The embedded X genealogy has some properties that impact how
segments of DNA are shared between individuals with recent common ancestry
(e.g. 6&lt;sup&gt;th&lt;/sup&gt; degree cousins), which we look at through a simple
probability model. Finally, we&amp;rsquo;ll look at what we can learn about the
relationships of individuals that share sections of their X chromosome due to
sharing a recent common ancestor.&lt;/p&gt;
&lt;aside&gt;
&lt;sup&gt;1&lt;/sup&gt; For example, see Graham&amp;rsquo;s posts &lt;a href=&#34;http://gcbias.org/2013/12/02/how-many-genomic-blocks-do-you-share-with-a-cousin/&#34;&gt;on how many genomic blocks
you share with a
cousin&lt;/a&gt;
,
&lt;a href=&#34;https://gcbias.org/2013/11/11/how-does-your-number-of-genetic-ancestors-grow-back-over-time/&#34;&gt;how your number of genetic ancestors grows back in
time&lt;/a&gt;
,
and &lt;a href=&#34;https://gcbias.org/2013/11/04/how-much-of-your-genome-do-you-inherit-from-a-particular-ancestor/&#34;&gt;how much of your genome is inherited from a particular
ancestor&lt;/a&gt;
&lt;/aside&gt;

&lt;h2&gt;Genealogies&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;genealogies&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#genealogies&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Each human, as a sexually reproducing species with two sexes, has two parents.
You have two parents, four grandparents, eight great-grandparents, 16
great-great grandparents, and $k$ generations back have $2^k$
great&lt;sup&gt;$(k-2)$&lt;/sup&gt; grandparents, and in general $2^k$ ancestors $k$
generations back. An example genealogy back five generations is shown below:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://vincebuffalo.com/images/genealogy.png&#34; alt=&#34;A genealogy back five generations. k generations back, one has 2k ancestors. Circles indicate females, and squares males. The shaded individual is a present-day female.&#34; data-custom-hook=&#34;true&#34; /&gt;
&lt;/p&gt;
&lt;p&gt;Of course, these genealogical ancestors are not necessarily all &lt;em&gt;distinct&lt;/em&gt;
individuals; as you go further back through the generations, some of these
$2^k$ individuals aren&amp;rsquo;t unique—they&amp;rsquo;re the same person. Intuitively, this
occurs when one&amp;rsquo;s two parents are actually related some number of generations
back. For example, one&amp;rsquo;s two parents could be 9&lt;sup&gt;th&lt;/sup&gt; degree
cousins—e.g. if we assume a generation time of about 30 years, this means these
parents shared an ancestor around 270 years ago. This phenomenon is known as
&lt;em&gt;pedigree collapse&lt;/em&gt;, and it&amp;rsquo;s the same thing as inbreeding. The further back
through the generations you go back, pedigree collapse &lt;em&gt;must&lt;/em&gt; happen—it&amp;rsquo;s
exceedingly unlikely that 20 generations ago, your 1,048,576 ancestors are all
distinct.&lt;sup&gt;2&lt;/sup&gt; While pedigree collapse definitely occurs, throughout the
rest of this blog post (and in our paper) we ignore it, as we model ancestry
that&amp;rsquo;s recent enough where pedigree collapse isn&amp;rsquo;t a large problem.&lt;/p&gt;
&lt;aside&gt;
&lt;sup&gt;2&lt;/sup&gt; Some beautiful probability theory by Chang (1999) has
shown that the &lt;strong&gt;most recent common ancestor&lt;/strong&gt; (the ancestor from which all
current individuals in the population descend from) of a population of size &lt;em&gt;N&lt;/em&gt;
lived $T_N = \log_2(N)$ generations ago. Furthermore, rather amazingly, any
individual $1.77 \log_2(N)$ generations ago that has &lt;em&gt;any&lt;/em&gt; present-day
descendents is actually (with very high probability) ancestors of &lt;em&gt;all&lt;/em&gt; modern
day individuals.
&lt;/aside&gt;

&lt;h2&gt;Genetic Ancestry&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;genetic-ancestry&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#genetic-ancestry&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Since each of us have two parents, we receive ½ of our autosomal (i.e. not
including the sex chromosomes) genetic material from each parent. We share ½ of
our genome with our mother, and ½ with our father.  Since your mother shares ½
her genetic material with her two parents, you share ¼ of your genetic material
with each grandparent. In general, on average you&amp;rsquo;ll share
&lt;em&gt;½&lt;sup&gt;k&lt;/sup&gt;&lt;/em&gt; of your genome with an ancestor $k$ generations in the
past. Since the number of crossovers per chromosome is limited, close relatives
are likely to share large contiguous segments of their genetic material; a
beautiful visualization of this is Morgan&amp;rsquo;s 1916 illustration of crossing over:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://vincebuffalo.com/images/morgan-crossover.svg&#34; alt=&#34;Thomas Hunt Morgan&amp;rsquo;s 1916 illustration of crossing over.&#34; data-custom-hook=&#34;true&#34; /&gt;
&lt;/p&gt;
&lt;p&gt;When we look at how much DNA two relatives share, we see it occurs in large
blocks like the black and white segments above. For example, using
&lt;a href=&#34;https://www.23andme.com/&#34;&gt;23andme&lt;/a&gt;
&amp;rsquo;s Ancestry Tools I can see how much DNA my
grandmother and I share—around 14 Morgans spread across 21 long segments.
Essentially, the fact that on average only one crossover occurs per
chromosome&lt;sup&gt;3&lt;/sup&gt; per generation limits how much the genome is broken up
through the generations. While on average ¼ of my DNA should be identical to my
grandmother&amp;rsquo;s DNA (we say such genetic material is &lt;strong&gt;identical by descent&lt;/strong&gt;)
there&amp;rsquo;s variance around this ¼ because the genome is of finite length and
recombination is limited. In other words, the fraction of my genome that
derives from my grandmother isn&amp;rsquo;t like randomly sampling 6.6 billion marbles
independently (the number of basepairs in a diploid human genome), a quarter of
which are colored red (i.e. come from my grandmother) and the rest white (i.e.
come from my other ancestors). Rather, a more appropriate model is that these
marbles are connected by string that is cut and reattached (much like Morgan
envisioned in his illustration)—leading recent ancestry to be blocky and
segmented.&lt;/p&gt;
&lt;aside&gt;
&lt;sup&gt;3&lt;/sup&gt; One crossover per chromosome occurs to ensure proper
disjunction during meiosis.
&lt;/aside&gt;

&lt;p&gt;Currently, there are computational methods (e.g. Browning and Browning, 2011)
that take polymorphism datasets and using probabilistic models, identify large
identical by descent (IBD) regions shared between individuals—it&amp;rsquo;s programs
like these that services like 23andme use to infer how far back your relatives
share ancestry with you. So if we wish to take genomic datasets and understand
the large shared segments between relatives due to their shared
ancestry&lt;sup&gt;4&lt;/sup&gt; we need a more appropriate mathematical model than the
simple model of sampling marbles. Numerous probabilists and statistical
geneticists have tackled this using probability theory and stochastic processes
(Donnelly 1983; Huff et al., 2011; Thomas et al., 1994).  Some of the
mathematical details are rather complex (leading to fun conceptualizations like
&amp;ldquo;a random walk on a hypercube&amp;rdquo;), but the underlying model can be simplified
considerably.&lt;/p&gt;
&lt;aside&gt;
&lt;p&gt;&lt;sup&gt;4&lt;/sup&gt; These shared blocks are due to recent ancestry; over long
periods of time the genome is eventually broken up into pieces that reflect
only very distant ancestry.&lt;/p&gt;
&lt;p&gt;Also, note that due to the exponentially growing number of
genealogical ancestors we have, we all share some recent ancestry. A
particularly elegant empirical demonstration of this fact comes from &lt;a href=&#34;http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1001555&#34;&gt;Ralph and
Coop
(2013)&lt;/a&gt;
.&lt;/p&gt;

&lt;/aside&gt;

&lt;p&gt;Each generation, we can imagine that a random number $B$ of crossovers breaks
the 22 human autosome, creating $B+22$ segments. As in Morgan&amp;rsquo;s original
illustration (above), this leads to complementary gametes, with alternating
paternal and maternal segments (the black and white segments in the rightmost
figure). Mathematically, tracking these alternate segments is a bit tricky, so
we can approximate the process by imaging that each of the segments is passed
on to the next generation with probability ½—a flip of a fair coin.  Since we
don&amp;rsquo;t actually know how many breakpoints have occurred, we model them as a
random process. In our case, we use the Poisson distribution&lt;sup&gt;5&lt;/sup&gt; to
assign a probability to the event that some number of breakpoints $B=b$ occurs.
This idea of using the Poisson distribution to model recombination has a long
history in genetics, going back to Haldane (1919). If we then imagine that this
same process across all of the $k$ individuals that connect you and one of your
ancestors in the &lt;em&gt;k&lt;sup&gt;th&lt;/sup&gt;&lt;/em&gt; generation, the total number of
breakpoints is a Poisson distributed, but with the rate is $k$ times faster.
Then, for a segment to survive to be passed from your ancestor in the
&lt;em&gt;k&lt;sup&gt;th&lt;/sup&gt;&lt;/em&gt; generation to you, it must survive &lt;em&gt;k&lt;/em&gt; independent coin
flips—an event that occurs with probability ½&lt;sup&gt;k&lt;/sup&gt;. By a nice property
of Poisson processes known as &lt;em&gt;Poisson thinning&lt;/em&gt;, this coin-flipping process
can be incorporated directly into the Poisson process by changing it&amp;rsquo;s rate.
Then, the expected number of segments $N$ shared between you and your ancestor
in the &lt;em&gt;k&lt;sup&gt;th&lt;/sup&gt;&lt;/em&gt; generation is:&lt;/p&gt;
&lt;aside&gt;
&lt;sup&gt;5&lt;/sup&gt; The Poisson distribution, among it&amp;rsquo;s many other uses, was
used to &lt;a href=&#34;https://en.wikipedia.org/wiki/Ladislaus_Bortkiewicz&#34;&gt;famously model&lt;/a&gt;

the number of fatalities of Prussian soldiers due to horse kicks in the face.
&lt;/aside&gt;

\[\mathbb{E}[N] = \frac{1}{2^k}(22 &amp;#43; 33k)\]&lt;p&gt;where 33 is the total genetic length of the human autosomes in Morgans, a unit
defined as the average number of recombinations that occur (and is named after
the Morgan that created the figure above). The formula above can give us a good
intuition about what&amp;rsquo;s going on—the number of segments created by recombination
grows linearly with how far back we go ($22 + 33k$), but the survival
probability decreases exponentially (&lt;em&gt;½&lt;sup&gt;k&lt;/sup&gt;&lt;/em&gt;). Using the Poisson
distribution&lt;sup&gt;6&lt;/sup&gt;, we can do more than just find an expression for the
&lt;em&gt;average&lt;/em&gt; number of segments you share with an ancestor, like calculate
probabilities of sharing zero segments (such that your genealogical ancestor is
not a genetic ancestor) and calculate the distribution of segment lengths.
Additionally, these models can be easily extended to handle the segments shared
between cousins.&lt;/p&gt;
&lt;aside&gt;
&lt;sup&gt;6&lt;/sup&gt; In our paper, we end up finding a model closely related to
the thinned Poisson process is more accurate. We call this the Poisson-Binomial
model; to keep this blog post simple, I don&amp;rsquo;t discuss in detail here.
Essentially, it&amp;rsquo;s identical to the Poisson model, but the probability $N$
segments survive given $b+22$ trials is Binomially distributed with probability
&lt;em&gt;½&lt;sup&gt;k&lt;/sup&gt;&lt;/em&gt;.
&lt;/aside&gt;

&lt;p&gt;What&amp;rsquo;s fascinating about this is that your may not share genetic material with
your genealogical ancestors. If you play around with the equation above with
different values of $k$, you&amp;rsquo;ll see around $k=9$ that you&amp;rsquo;re expected to share
less than one segment with your ancestors 9 generations back. We can visualize
this using an arc diagram, which depicts a present-day individual in the center
as the white half-circle, your two parents, four grandparents, and so forth:&lt;/p&gt;
&lt;div id=&#34;auto-family-arc&#34;&gt;&lt;/div&gt;
&lt;div id=&#34;auto-desc&#34; class=&#34;arc-text&#34;&gt;&lt;/div&gt;
&lt;div id=&#34;auto-help&#34; class=&#34;arc-text&#34;&gt;loading...&lt;/div&gt;
&lt;figcaption&gt; An arc diagram of one&#39;s genealogical ancestors and their genetic
contributions to the present-day individual. Female ancestors are colored red,
and male ancestors are colored blue. This visualization uses simulated genetic
ancestry back through the generations, and the opacity of the red or blue arcs
grows fainter with the less genetic material shared between that ancestor and
you. Completely white arcs are genealogical ancestors that contribute zero
genetic material to you. Hover over an ancestor to highlight it and find how
much genetic material it has contributed to the present-day
individual.&lt;/figcaption&gt;
&lt;p&gt;We see that one&amp;rsquo;s genetic ancestors don&amp;rsquo;t grow as rapidly one&amp;rsquo;s genealogical
ancestors. There&amp;rsquo;s a lot more to say about this; see Graham&amp;rsquo;s &lt;a href=&#34;https://gcbias.org/2013/11/11/how-does-your-number-of-genetic-ancestors-grow-back-over-time/&#34;&gt;terrific blog
post&lt;/a&gt;

on this topic for more information.&lt;/p&gt;
&lt;h2&gt;X Genealogies&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;x-genealogies&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#x-genealogies&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;In our paper, we were curious how these processes would play out on the X
chromosome. The human genome contains 22 autosome pairs and one sex chromosome
pair, give us 23 pairs (i.e. the 23 from 23andme), plus one mitochondrial
genome. However, unlike the autosomes, the X chromosome undergoes a special
inheritance pattern. Males have only one X chromosome, and a Y chromosome. In
contrast, females have two X chromosomes. Each generation, individuals pass a
haploid set of chromosomes to their offspring—meaning they take the 23 pairs
and pass a combination of each pair. Since males have two different sex
chromosomes (the X and the Y), these two different chromosomes don&amp;rsquo;t recombine
like the autosomes (except for over a small region called the pseudo-autosomal
region). Instead, the male either passes his X to a daughter or a Y to a son.
Females, having two X chromosomes, do pass a recombined X chromosome to their
son or daughter. Since the X can only recombine over its entire length in
females, we call these female meioses &lt;strong&gt;recombinational meioses&lt;/strong&gt;. Note that
with the autosomes, every meiosis is a recombinational meiosis.&lt;/p&gt;
&lt;p&gt;What&amp;rsquo;s fascinating is that this different inheritance pattern leads the X
chromosome to have a different genealogy than the one&amp;rsquo;s biparental genealogy.
Since males don&amp;rsquo;t pass X chromosomes to their sons, one&amp;rsquo;s X genealogy only
includes a subset of one&amp;rsquo;s total ancestors, and is embedded inside of one&amp;rsquo;s
total genealogy. Below is a genealogy for a present-day female, with her X
genealogy shaded in:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://vincebuffalo.com/images/xtree.png&#34; alt=&#34;An X genealogy going back five generations, with females drawn as circles and males as squares. Shaded individuals are X ancestors, while unshaded individuals are not X ancestors. The numbers indicate the number of recombinational meioses to that ancestor.&#34; data-custom-hook=&#34;true&#34; /&gt;
&lt;/p&gt;
&lt;p&gt;Note the number of X ancestors of a present-day female has back through the
generations, 2, 3, 5, 8, etc. This sequence is the famous &lt;a href=&#34;https://en.wikipedia.org/wiki/Fibonacci_number&#34;&gt;Fibonacci
sequence&lt;/a&gt;
 offset by two. Thus,
a present-female&amp;rsquo;s number of X ancestors is the $k+2$ Fibonacci number,
$\mathcal{F}_{k+2}$ (if the present-day individual is a male, we offset this by
one). This sequence crops up throughout
&lt;a href=&#34;https://en.wikipedia.org/wiki/Fibonacci_number#In_nature&#34;&gt;nature&lt;/a&gt;
 and
&lt;a href=&#34;https://en.wikipedia.org/wiki/Fibonacci_number#Use_in_mathematics&#34;&gt;mathematics&lt;/a&gt;
.&lt;/p&gt;
&lt;h2&gt;Models for X chromosome recent genetic ancestry&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;models-for-x-chromosome-recent-genetic-ancestry&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#models-for-x-chromosome-recent-genetic-ancestry&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Another feature of X genealogies is that unlike the autosomes, where
chromosomes undergo recombination every generation and in every ancestor, the
number of X recombinational meiosis vary by lineage. This is because the number
of females that occur in a lineage to an X ancestor in the 5&lt;sup&gt;th&lt;/sup&gt;
generation vary depending on the lineage. In the leftmost lineage of the
genealogy above, a female occurs each generation. In contrast, the rightmost X
lineage (with all shaded individuals) alternates between male and female
ancestors. Since the X chromosome only undergoes recombination over its entire
length in females, the specific lineage to an X ancestor impacts how quickly
genetic relatedness breaks down. In our paper, we sought to characterize this
lineage-specific rate and see how it affects genetic relatedness.&lt;/p&gt;
&lt;p&gt;Our models are similar to the autosomal models described earlier, except given
that we don&amp;rsquo;t know the particular lineage to an X ancestor, we need to average
over the number of possible recombinational meiosis that could occur. We found
that the number of lineages to an X ancestor $k$ generations back with $r$
recombinational meioses is:&lt;/p&gt;
\[{ r &amp;#43; 1 \choose k-r}\]&lt;p&gt;We can intuitively understand this by looking at an X genealogy; X genealogies
enumerate every possible way to arrange males and females such that no two
males are adjacent (since fathers don&amp;rsquo;t pass an X to their sons). Thus, the
number of lineages $k$ generations in the past with with $r$ females can be
thought of as the number of ways of ordering $r$ red balls and $k-r$ white
balls such that no to white balls are adjacent. The number of ways of ordering
red and balls this way is given by the binomial coefficient above.&lt;/p&gt;
&lt;p&gt;Since one has $\mathcal{F}_{k+2}$ X ancestors $k$ generations back, the
probability of $r$ recombinational meioses occurring is:&lt;/p&gt;
\[P_R(R=r) = \frac{{ r &amp;#43; 1 \choose k-r}}{\mathcal{F}_{k&amp;#43;2}}\]&lt;p&gt;Averaging over this number of recombinational meioses gives us a model for the
number and length of segments shared identically by descent on the X. It turns
out the Poisson thinning approximation described earlier doesn&amp;rsquo;t work as well
as another model we call the Poisson-Binomial model. I won&amp;rsquo;t cover the detailed
derivation here (see the
&lt;a href=&#34;https://www.genetics.org/content/204/1/57&#34;&gt;paper&lt;/a&gt;
 if you&amp;rsquo;re
interested), but we find the distribution of X segment number to be well
approximated by:&lt;/p&gt;
\[P(N=n \;|\; k, \nu) = \sum_{r=\lfloor k/2 \rfloor}^k \sum_{b=0}^\infty \text{Bin}(N=n \;|\; l=b&amp;#43;1, p=1/2^r) \; \text{Pois}(B=b \;|\; \lambda=\nu r) \; \frac{{r&amp;#43;1 \choose k-r}}{\mathcal{F}_{k&amp;#43;2}} \]&lt;p&gt;As with the autosomes, it&amp;rsquo;s possible one&amp;rsquo;s X genealogical ancestors don&amp;rsquo;t
contribute X genetic material to their present-day descendent. For example,
here is a simulated X genealogy with opacity of an ancestor indicating that
ancestor&amp;rsquo;s genetic contribution to the present-day individual:&lt;/p&gt;
&lt;div id=&#34;x-family-arc&#34;&gt;&lt;/div&gt;
&lt;div id=&#34;x-desc&#34; class=&#34;arc-text&#34;&gt;&lt;/div&gt;
&lt;div id=&#34;x-help&#34; class=&#34;arc-text&#34;&gt;loading...&lt;/div&gt;
&lt;figcaption&gt;An X genealogy depicted as an arc diagram. Red ancestors are
females, blue are males. The opacity indicates the genetic contribution the
present-day individual. White ancestors are those that make no genetic
contribution to the present-day individual. Gray arcs are genealogical
ancestors that are not X ancestors. Hover over an ancestor to highlight it and
find how much X genetic material it has contributed to the present-day
female.&lt;/figcaption&gt; 
&lt;p&gt;To get a sense of how one&amp;rsquo;s X genealogical ancestry grows back in time, we&amp;rsquo;ve
plotted it below (Figure A) compared to one&amp;rsquo;s autosomal ancestry, and the
growth of both one&amp;rsquo;s genetic X and autosomal ancestry back through the
generations. Using probability models we work through in the next section, we
also show (Figure B) the probability of sharing some autosomal genetic ancestry
(&lt;em&gt;P(N&lt;sub&gt;auto&lt;/sub&gt; &amp;gt; 0)&lt;/em&gt;) and X genetic ancestry (&lt;em&gt;P(N&lt;sub&gt;X&lt;/sub&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;0&lt;/em&gt;), conditional and unconditional on both being an X genealogical
ancestor (&amp;ldquo;X ancestor&amp;rdquo; and &amp;ldquo;ancestor&amp;rdquo;, respectively).&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;img src=&#34;https://vincebuffalo.com/images/num-ancestors.png&#34; alt=&#34;A: a present-day female’s number of genealogical and genetic ancestors, for X chromosomes and autosomes. B: the probability of genealogical and genetic ancestry for a variety of cases&#34; data-custom-hook=&#34;true&#34; /&gt;
&lt;/p&gt;
&lt;p&gt;Similarly, we extend these models to model the number of X chromosome segments
shared between half- and full-cousins and explore other properties of X
cousins. These models get a bit tricky mathematically, as the sex of the
cousins&amp;rsquo; shared ancestor impacts the number of segments shared between cousins,
so we incorporate the probability of the sex of the shared ancestor in our
models (see Section 3 of our
&lt;a href=&#34;https://www.genetics.org/content/204/1/57&#34;&gt;paper&lt;/a&gt;
 for more
details).&lt;/p&gt;
&lt;h2&gt;What recent ancestry on the X can tell us&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;what-recent-ancestry-on-the-x-can-tell-us&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#what-recent-ancestry-on-the-x-can-tell-us&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Using genetic data to infer relationships between individuals is an important
topic—it&amp;rsquo;s used by services like 23andme for ancestry finding, in forensics in
assessing DNA-based evidence, and in anthropology and ancient DNA to learn
about the familial relationships among individuals. We were curious what X
chromosome segments shared between cousins could tell us about their
relationship. We found that X chromosome segments can be quite informative
about which of their ancestors they share. This information occurs through two
avenues: (1) sharing IBD segments on the X immediately reduces the potential
genealogical ancestors two individuals share, since one&amp;rsquo;s X ancestors are only
a fraction of their possible genealogical ancestors, and (2) the varying number
of females in an X genealogy across lineages combined with the fact that
recombinational meioses only occur in females to some extent leave a
lineage-specific signature of ancestry. We&amp;rsquo;ll talk more about this second point
in this section.&lt;/p&gt;
&lt;p&gt;The X chromosome is relatively short (compared to the autosomes), leading
ancestry signals to decay relatively rapidly. Thus, inferring how far back
cousins share an ancestor is best accomplished through looking at segments
shared on the autosomes rather than X chromosome, and many methods are
available for this purpose (Durand et al., 2014; Henn et al., 2012; Huff et
al., 2011).  We condition on knowing how many generations back these
half-cousins share a common ancestor using this autosomal signal. Then, we use
&lt;a href=&#34;https://en.wikipedia.org/wiki/Bayes%27_theorem&#34;&gt;Bayes theorem&lt;/a&gt;
 to invert
$P(N=n| R)$ to learn the posterior $P(R | N=n)$, where $R$ is the number of
recombinational meioses (and thus number of females) between two half-cousins
and $N$ is the observed number of X segments shared between the cousins. These
posterior distributions are:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://vincebuffalo.com/images/rm-posterior.png&#34; alt=&#34;The posterior distribution for R across different generations (each panel). Each line is the number of observed segments between X half-cousins. The prior distribution is the gray dashed line.&#34; data-custom-hook=&#34;true&#34; /&gt;
&lt;/p&gt;
&lt;p&gt;For example, the top right panel shows the posterior distributions for the
number of females in the lineage connecting two 3&lt;sup&gt;rd&lt;/sup&gt; degree
half-cousins. Each line represents the posterior distribution for a specific
number of observed segments shared between these two half-cousins. If these two
3&lt;sup&gt;rd&lt;/sup&gt; degree cousins share six segments identically by descent, our
models say that a lineage with three females is the most likely genealogical
configuration. This information is interesting, as these genealogical details
cannot be inferred with the autosomal data alone.&lt;/p&gt;
&lt;p&gt;As genomic data sets increase, so will the probability of sampling individuals
that share recent ancestry. With large data sets (e.g. 23andme&amp;rsquo;s users),
there&amp;rsquo;s potential for recent ancestry on the X to shed some light on the
genealogical relationships connecting us all.&lt;/p&gt;
&lt;!-- requisite JS below --&gt;
&lt;script src=&#34;https://vincebuffalo.com/js/d3.v3.min.js&#34; charset=&#34;utf-8&#34;&gt;&lt;/script&gt;
&lt;script src=&#34;https://vincebuffalo.com/js/familyarc.js&#34; type=&#34;text/javascript&#34; charset=&#34;utf-8&#34;&gt;&lt;/script&gt;
&lt;script src=&#34;https://vincebuffalo.com/js/sharedsegments2.js&#34; type=&#34;text/javascript&#34; charset=&#34;utf-8&#34;&gt;&lt;/script&gt;
&lt;script type=&#34;text/javascript&#34; charset=&#34;utf-8&#34;&gt;
  
  var human_x = {
    &#39;nancestors&#39;: function(k) {
      return (Math.pow(φ, k+2) - Math.pow(ψ, k+2))/Math.sqrt(5);
    },
    &#39;genlen&#39;: 1.96,
  };

  var single_chrom = {
    &#39;nancestors&#39;: function(k) {
      return Math.pow(2, k);
    },
    &#39;genlen&#39;: 5,
  }


    d3.json(&#34;/js/x.json&#34;, function(data) {
      var config = single_chrom;
      config.genlen = data.genlen;
      if (data.type == &#39;x&#39;) {
        config = human_x;
        //config.tight = true;
      }
      config.animate = true;
      // maxgen: also change in sharedsegments2.js, filter()
      config.maxgen = 4; //d3.max(data.sims[0].map(function(d) { return d.gen; }));
      var drawShared = segmentsTree(config);
      d3.select(&#34;#xshared&#34;)
        .datum(data.sims[0])
        .call(drawShared);
    });
&lt;/script&gt;

      </description>
    </item>
    
    <item>
      <title>Using Rcpp and C&#43;&#43; to Count Genotypes</title>
      <link>https://vincebuffalo.com/blog/rcpp-counting-genotypes/</link>
      <pubDate>Sat, 05 Dec 2015 00:00:00 +0000</pubDate>
      
      <guid>https://vincebuffalo.com/blog/rcpp-counting-genotypes/</guid>
      <description>
        
        
        &lt;p&gt;I had a matrix (88662 loci x 2060 genotypes) of maize chromosome 1 genotypes, encoded as 0, 1, 2 (e.g. the number of alternate alleles). I needed genotype counts per row, which at first glance is quite easy to solve: just use &lt;code&gt;apply&lt;/code&gt; and &lt;code&gt;table&lt;/code&gt;:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-r&#34; data-lang=&#34;r&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;counts&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;apply&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;chr1g&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;table&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;What&amp;rsquo;s the problem with this approach? First, technical: if a row only has genotypes 0 and 2, we don&amp;rsquo;t get counts for 1, which makes merging into a matrix later on a total nightmare (evident because &lt;code&gt;apply&lt;/code&gt; is smart enough to return a list). Second, it ignores &lt;code&gt;NA&lt;/code&gt;, which is not good. We can fix this with &lt;code&gt;exclude=NULL&lt;/code&gt;:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-r&#34; data-lang=&#34;r&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;counts&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;apply&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;chr1g&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;kr&#34;&gt;function&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;x&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;table&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;x&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;exclude&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;NULL&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;This doesn&amp;rsquo;t solve our first problem, and it&amp;rsquo;s a bit slower. &lt;code&gt;apply(chr1g, 1, table)&lt;/code&gt; took:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;pre&gt;&lt;code&gt;  user  system elapsed
84.638   3.110  87.865&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Where as &lt;code&gt;apply(chr1g, 1, function(x) table(x, exclude=NULL))&lt;/code&gt; takes:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;pre&gt;&lt;code&gt;   user  system elapsed
108.718   5.708 114.609&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;(and no, passing &lt;code&gt;exclude=NULL&lt;/code&gt; to directly to &lt;code&gt;apply&lt;/code&gt; doesn&amp;rsquo;t make it faster). The way around the first technical issue is to use factors. &lt;code&gt;as.factor&lt;/code&gt; removes dimensions, so we&amp;rsquo;re out of luck converting the whole matrix at once (plus, this would require a copy in memory and these objects are moderately large). So, we could so something like:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-r&#34; data-lang=&#34;r&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;count&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;apply&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;chr1g&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;kr&#34;&gt;function&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;x&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;table&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;nf&#34;&gt;factor&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;x&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;levels&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;nf&#34;&gt;c&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;m&#34;&gt;0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;kc&#34;&gt;NA&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)),&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;exclude&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;NULL&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;This works well, but is also not too fast (chromosome 1 is 15% of our dataset):&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;pre&gt;&lt;code&gt;  user  system elapsed
95.918   5.746 101.861&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Rcpp stands out as a simple solution here: this is very easy to code up (it took me literally five minutes). Looking at the timing first:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;pre&gt;&lt;code&gt;  user  system elapsed
 8.427   1.573  10.027&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;We see we have a clear winner. The whole dataset is 573,392 rows. Overall, this would take &lt;code&gt;(95.918 / 88662)* 573392 = 620.3178&lt;/code&gt; seconds or about 10 minutes to complete on all data. Chances are, I&amp;rsquo;ll have to run this code a few times as the analysis changes. In contrast, the Rcpp method takes &lt;code&gt;(8.427 / 88662)* 573392 = 54.49882&lt;/code&gt; seconds. That&amp;rsquo;s under 6 minutes to code and implement a working, faster solution in Rcpp!&lt;/p&gt;
&lt;p&gt;The code is quite easy to understand too:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-cpp&#34; data-lang=&#34;cpp&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;cp&#34;&gt;#include&lt;/span&gt; &lt;span class=&#34;cpf&#34;&gt;&amp;lt;Rcpp.h&amp;gt;&lt;/span&gt;&lt;span class=&#34;cp&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;cp&#34;&gt;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;k&#34;&gt;using&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;namespace&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Rcpp&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;c1&#34;&gt;// [[Rcpp::export]]
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;c1&#34;&gt;&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;IntegerVector&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;countGenotypes&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;IntegerVector&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;x&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;c1&#34;&gt;// This method is a specialized version of R&amp;#39;s table that counts genotypes
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;c1&#34;&gt;&lt;/span&gt;    &lt;span class=&#34;c1&#34;&gt;// encoded as 0, 1, 2 in a vector (and also returns NA) always of length 4,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;c1&#34;&gt;&lt;/span&gt;    &lt;span class=&#34;c1&#34;&gt;// always as numbers of 0, 1, 2, NA. This allows faster usage with apply, as
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;c1&#34;&gt;&lt;/span&gt;    &lt;span class=&#34;c1&#34;&gt;// we don&amp;#39;t need to convert to factor to get all genotype counts, even if
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;c1&#34;&gt;&lt;/span&gt;    &lt;span class=&#34;c1&#34;&gt;// none are present.
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;c1&#34;&gt;&lt;/span&gt;    &lt;span class=&#34;n&#34;&gt;CharacterVector&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;names&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;CharacterVector&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;::&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;create&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;&amp;#34;0&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#34;1&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#34;2&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#34;NA&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;n&#34;&gt;IntegerVector&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;genocounts&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;4&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;n&#34;&gt;genocounts&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;fill&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;kt&#34;&gt;int&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;i&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;i&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;x&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;length&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;();&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;i&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;++&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;            &lt;span class=&#34;k&#34;&gt;if&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;!&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;IntegerVector&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;::&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;is_na&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;x&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;i&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]))&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;                    &lt;span class=&#34;n&#34;&gt;genocounts&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;x&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;i&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+=&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;            &lt;span class=&#34;p&#34;&gt;}&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;else&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;                    &lt;span class=&#34;n&#34;&gt;genocounts&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;3&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+=&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;            &lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;n&#34;&gt;genocounts&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;attr&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;&amp;#34;names&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;names&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;k&#34;&gt;return&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;genocounts&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;

      </description>
    </item>
    
    <item>
      <title>MD Tags in BAM Files</title>
      <link>https://vincebuffalo.com/blog/md-tags-in-bam-files/</link>
      <pubDate>Fri, 17 Jan 2014 00:00:00 +0000</pubDate>
      
      <guid>https://vincebuffalo.com/blog/md-tags-in-bam-files/</guid>
      <description>
        
        
        &lt;p&gt;I needed to work with the MD tag in BAM/SAM files for a recent project. There&amp;rsquo;s
not too much discussion online about this, so I took some notes as I went
through examples.&lt;/p&gt;
&lt;p&gt;The MD tag is for SNP/indel calling without looking at the
reference. It does this by carrying information about the reference
that the read does not carry, for a particular alignment. A SNP&amp;rsquo;s
alternate base is carried in the read, but without the MD tag or use
of the alignment reference, it&amp;rsquo;s impossible to know what the reference
base was. Thus, this information is carried in the MD tag. A SNP looks
like:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;pre&gt;&lt;code&gt;10A3T0T10&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Here, there are three SNPs:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;11 bases in from the &lt;em&gt;aligned portion of the read&lt;/em&gt;, the reference
has an A and the read has what ever base is at the 10th position
(excluding softclips).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;15 bases in there&amp;rsquo;s a T in the reference.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;16 bases in there&amp;rsquo;s a T in the reference. Note that 0s are use used
to indicate positions of neighboring SNPs.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Likewise, a reference would be necessary to know the deleted bases
from the reference in an alignment. The MD tag stores this information
too:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;pre&gt;&lt;code&gt;85^A16&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Here, there are 85 matches, 1 deletion from the reference (the
reference has an A there and the read doesn&amp;rsquo;t), and then there are 16
matches.&lt;/p&gt;
&lt;h2&gt;Example 1: With Insertion&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;example-1-with-insertion&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#example-1-with-insertion&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Note that insertions, since they don&amp;rsquo;t represent a loss of information
about the reference, are not stored in MD flag. This has some
interested consequences. Let&amp;rsquo;s look at an example:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;pre&gt;&lt;code&gt;read seq length: 101
CIGAR: 89M1I11M
MD: 100&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Immediately, the surprising part is that the MD tag represents 100
matches to the reference, but the read length is 101 bases and the
CIGAR string is 101. This comes back to the core purpose of MD tags:
they only represent information about the read aligned to the
reference. There are 100 bases that align to the reference, and one
insertion that does not.&lt;/p&gt;
&lt;h2&gt;Example 2: More Complex Insertion&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;example-2-more-complex-insertion&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#example-2-more-complex-insertion&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;pre&gt;&lt;code&gt;read length: 101
CIGAR: 9M1I91M
MD: 48T42G8
name: HWI-ST222:4:1105:19266:186667#0 // for my reference&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Here, there are SNPs (or errors) in the M (match/mismatch) parts of
the alignment. The total MD length is: 48 + 1 + 42 + 1 + 8 = 100,
which matches our read length &lt;em&gt;minus insertions&lt;/em&gt;. What gets tricky
(and in my opinion slightly annoying) is that the match component of
the MD tag (the numeric parts) overlaps the insertion, but does not
show it. This means that our read sequence has an insertion at the
10th base. However, here is where things get tricky: the mismatch at
the 49th base (where the reference is a T according to the MD tag) is
actually the 50th base in the read. This is because MD ignores
insertions and we have a 1 base insertion upstream of the mismatching
T. The same is true with the other mismatch (reference has G):
according to MD tag, it&amp;rsquo;s 48 + 1 + 42 + 1 = 92 bases in, but it&amp;rsquo;s
actually 93 bases in.&lt;/p&gt;
&lt;p&gt;As an aside, running &lt;code&gt;samtools calmd&lt;/code&gt; with &lt;code&gt;-e&lt;/code&gt;, which changes masking
bases to &lt;code&gt;=&lt;/code&gt; really helps seeing these details. Read inspection in IGV
also helps.&lt;/p&gt;
&lt;h2&gt;Example 3: Deletions&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;example-3-deletions&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#example-3-deletions&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Deletions are stored in the MD tag, because these represent a loss of
information with respect to the reference. Let&amp;rsquo;s look at a simple
example:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;pre&gt;&lt;code&gt;read length: 101
CIGAR: 56M1D45M
MD: 56^A45
read name: HWI-ST222:4:2101:12455:194028#0&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;CIGAR string length is: 56 + 1 + 45 = 102. MD length is 56 + 1 + 45
= 102. This case is pretty trivial because deletions are indicated in
both the MD tag and CIGAR string.&lt;/p&gt;
&lt;h2&gt;Example 4: Insertions and Deletions&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;example-4-insertions-and-deletions&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#example-4-insertions-and-deletions&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Here&amp;rsquo;s a trickier example:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;pre&gt;&lt;code&gt;read length: 101
CIGAR: 31M1I17M1D37M
MD: 6G4C20G1A5C5A1^C3A15G1G15
read name: HWI-ST222:4:1208:7027:16535#0&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;That&amp;rsquo;s a long one! Let&amp;rsquo;s look at the total lengths of CIGAR string and
MD tag. CIGAR length: 31 + 1 + 17 + 1 + 37 = 87. Parsing this is quite tricky.&lt;/p&gt;
&lt;h2&gt;Approaches&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;approaches&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#approaches&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Initially I thought of a single-pass approach. This would almost
surely involve a finite state automaton that manages four states:
CIGAR token and MD token, read position, reference position. This is
quite tricky, so an easier approach is to rebuild the reference string
with the MD tag, and then use it to compare to the align read
(following positions from CIGAR string). This way only either MD or
CIGAR states need to be kept in focus at same time.&lt;/p&gt;

      </description>
    </item>
    
    <item>
      <title>Using Named Pipes and Process Substitution in Bioinformatics</title>
      <link>https://vincebuffalo.com/blog/using-named-pipes-and-process-substitution-in-bioinformatics/</link>
      <pubDate>Thu, 08 Aug 2013 00:00:00 +0000</pubDate>
      
      <guid>https://vincebuffalo.com/blog/using-named-pipes-and-process-substitution-in-bioinformatics/</guid>
      <description>
        
        
        &lt;p&gt;It&amp;rsquo;s hard not to fall in love with Unix as a bioinformatician. In a
&lt;a href=&#34;https://vincebuffalo.com/blog/bioinformatics-and-interface-design/&#34;&gt;past post&lt;/a&gt;

I mentioned how Unix pipes are an extremely elegant way to interface
bioinformatics programs (and do inter-process communication in
general). In exploring other ways of interfacing programs in Unix,
I&amp;rsquo;ve discovered two great but overlooked ways of interfacing programs:
the named pipe and process substitution.&lt;/p&gt;
&lt;h2&gt;Why We Love Pipes and Unix&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;why-we-love-pipes-and-unix&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#why-we-love-pipes-and-unix&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;A few weeks ago I stumbled across a great talk by
&lt;a href=&#34;https://twitter.com/garybernhardt&#34;&gt;Gary Bernhardt&lt;/a&gt;
 entitled
&lt;a href=&#34;http://www.confreaks.com/videos/615-cascadiaruby2011-the-unix-chainsaw&#34;&gt;The Unix Chainsaw&lt;/a&gt;
. Bernhardt&amp;rsquo;s
&amp;ldquo;chainsaw&amp;rdquo; analogy is great: people sometimes fear doing work in Unix
because it&amp;rsquo;s a powerful tool, and it&amp;rsquo;s easy to screw up with powerful
tools. I think in the process of grokking Unix it&amp;rsquo;s not uncommon to
ask &amp;ldquo;is this clever and elegant? or completely fucking stupid?&amp;rdquo;. This
is normal, especially if you come from a language like Lisp or Python
(or even C really). Unix is a get-shit-done system. I&amp;rsquo;ve used a
chainsaw, and you&amp;rsquo;re simultaneously amazed at (1) how easily it slices
through a tree, and (2) that you&amp;rsquo;re dumb enough to use this thing
three feet away from your vital organs. This is Unix.&lt;/p&gt;
&lt;p&gt;Bernhardt also has this great slide, and I&amp;rsquo;m convinced there&amp;rsquo;s no
better way to describe how most Unix users feel about pipes
(especially bioinformaticians):&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://vincebuffalo.com/images/pipes.png&#34; alt=&#34;For love of Unix pipes&#34; data-custom-hook=&#34;true&#34; /&gt;
&lt;/p&gt;
&lt;p&gt;Pipes are fantastic. Any two (well-written) programs can talk to each
other in Unix. All of the nastiness and the difficulty of
inter-process communication is solved with one character, &lt;code&gt;|&lt;/code&gt;. Thanks
Doug McIlroy and others. The stream is usually plaintext,
&lt;a href=&#34;http://en.wikipedia.org/wiki/Unix_philosophy#McIlroy:_A_Quarter_Century_of_Unix&#34;&gt;the universal interface&lt;/a&gt;
,
but it doesn&amp;rsquo;t have to be. With pipes, it doesn&amp;rsquo;t matter if your pipe
is tab delimited marketing data, random email text, or 100 million
SNPs. Pipes are a tremendous, beautiful, elegant component of the Unix
chainsaw.&lt;/p&gt;
&lt;p&gt;But elegance alone won&amp;rsquo;t earn a software abstraction the hearts of
thousands of sysadmins, software engineers, and scientists: pipes are
fast. There&amp;rsquo;s little overheard between pipes, and they are certainly a
lot more efficient than writing and reading from the disk. In a
&lt;a href=&#34;https://vincebuffalo.com/blog/bioinformatics-and-interface-design/&#34;&gt;past article&lt;/a&gt;

I included the classic &lt;a href=&#34;http://samtools.sourceforge.net/&#34;&gt;Samtools&lt;/a&gt;

pipe for SNP calling. It&amp;rsquo;s no coincidence that other excellent SNP
callers like &lt;a href=&#34;https://github.com/ekg/freebayes&#34;&gt;FreeBayes&lt;/a&gt;
 make use of
pipes: pipes scale well to moderately large data and they&amp;rsquo;re just
plumbing. Interfacing programs this way allows us to check
intermediate output for issues, easily rework entire workflows, and
even split off a stream with the aptly named program &lt;a href=&#34;http://en.wikipedia.org/wiki/Tee_%28command%29&#34;&gt;tee&lt;/a&gt;
.&lt;/p&gt;
&lt;h2&gt;Where Pipes Don&amp;rsquo;t Work&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;where-pipes-dont-work&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#where-pipes-dont-work&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Unix pipes are great, but they don&amp;rsquo;t work in all situations. The
classic problem is in a situation like this:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;$ program --in1 in1.txt --in2 in2.txt --out1 out1.txt &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;se&#34;&gt;&lt;/span&gt;          --out2 out2.txt &amp;gt; stats.txt 2&amp;gt; diagnostics.stderr&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;My past colleagues at the
&lt;a href=&#34;http://bioinformatics.ucdavis.edu/&#34;&gt;UC Davis Bioinformatics Core&lt;/a&gt;
 and
I wrote a set of tools for processing next-generation sequencing data
and ran into this situation. In keeping with the Unix traditional,
each tool was separate. In practice, this was a crucial design because
we saw such differences in data quality due to different sequencing
library preparation. Having separate tools working together, in
addition to being more Unix-y, lead to more power to spot problems.&lt;/p&gt;
&lt;p&gt;However, one step of our workflow has &lt;em&gt;two&lt;/em&gt; input files and &lt;em&gt;three&lt;/em&gt;
output files due to the nature of our data (paired-end sequencing
data). Additionally, both &lt;code&gt;in1.txt&lt;/code&gt; and &lt;code&gt;in2.txt&lt;/code&gt; were the results of
&lt;em&gt;another program&lt;/em&gt;, and these could be run in &lt;em&gt;parallel&lt;/em&gt; (so
interleaving the pairs makes this harder to run in parallel). The
classic Unix pipe wouldn&amp;rsquo;t work, as we had more than one input and
output into a file: our pipe abstraction breaks down. Hacky solutions
like using standard error are way too unpalatable. What to do?&lt;/p&gt;
&lt;h2&gt;Named Pipes&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;named-pipes&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#named-pipes&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;One solution to this problem is to use &lt;strong&gt;named pipes&lt;/strong&gt;. A named pipe,
also known as a FIFO (after First In First Out, a concept in computer
science), is a special sort of file we can create with &lt;code&gt;mkfifo&lt;/code&gt;:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;$ mkfifo fqin
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  prw-r--r--    &lt;span class=&#34;m&#34;&gt;1&lt;/span&gt; vinceb  staff          &lt;span class=&#34;m&#34;&gt;0&lt;/span&gt; Aug  &lt;span class=&#34;m&#34;&gt;5&lt;/span&gt; 22:50 fqin&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;You&amp;rsquo;ll notice that this is indeed a special type of file: &lt;code&gt;p&lt;/code&gt; for
pipe. You interface with these as if they were files (i.e. with Unix
redirection, not pipes), but they behave like pipes:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;$ &lt;span class=&#34;nb&#34;&gt;echo&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;hello, named pipes&amp;#34;&lt;/span&gt; &amp;gt; fqin &lt;span class=&#34;p&#34;&gt;&amp;amp;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt; &lt;span class=&#34;o&#34;&gt;[&lt;/span&gt;1&lt;span class=&#34;o&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;16430&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;$ cat &amp;lt; fqin
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt; &lt;span class=&#34;o&#34;&gt;[&lt;/span&gt;1&lt;span class=&#34;o&#34;&gt;]&lt;/span&gt;  + &lt;span class=&#34;m&#34;&gt;16430&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;done&lt;/span&gt;       &lt;span class=&#34;nb&#34;&gt;echo&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;hello, named pipes&amp;#34;&lt;/span&gt; &amp;gt; fqin
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;hello, named pipes&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Hopefully you can see the power despite the simple example. Even
though the syntax is similar to shell redirection to a file, &lt;em&gt;we&amp;rsquo;re
not actually writing anything to our disk&lt;/em&gt;. Note too that the &lt;code&gt;[1] + 16430 done&lt;/code&gt; line is printed because we ran the first line as a
background process (to free up a prompt). We could also run the same
command in a different terminal window. To remove the named pipe, we
just use &lt;code&gt;rm&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Creating and using two named pipes would prevent IO bottlenecks and
allow us to interface the program creating &lt;code&gt;in1.txt&lt;/code&gt; and &lt;code&gt;in2.txt&lt;/code&gt;
directly with &lt;code&gt;program&lt;/code&gt;, but I wanted something cleaner. For quick
inter-process communication tasks, I really don&amp;rsquo;t want to use &lt;code&gt;mkfifo&lt;/code&gt;
a bunch of times and have to remove each of these named pipes. Luckily
Unix offers an even more elegant way: process substitution.&lt;/p&gt;
&lt;h2&gt;Process Substitution&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;process-substitution&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#process-substitution&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;&lt;strong&gt;Process substitution&lt;/strong&gt; uses the same mechanism as named pipes, but
does so without the need to actually create a lasting named pipe
through clever shell syntax. These are also appropriately called
&amp;ldquo;anonymous named pipes&amp;rdquo;. Process substitution is implemented in most
modern shells and can be used through the syntax &lt;code&gt;&amp;lt;(command arg1 arg2)&lt;/code&gt;. The shell runs these commands, and passes their output to a
file descriptor, which on Unix systems will be something like
&lt;code&gt;/dev/fd/11&lt;/code&gt;. This file descriptor will then be substituted by your
shell where the call to &lt;code&gt;&amp;lt;()&lt;/code&gt; was. Running a command in parenthesis in
a shell invokes a seperate subprocess, so multiple uses of &lt;code&gt;&amp;lt;()&lt;/code&gt; are
&lt;em&gt;run in parallel automatically&lt;/em&gt; (scheduling is handled by your OS
here, so you may want to use this cautiously on shared systems where
more explicity setting the number of processes may be
preferable). Additionally, as a subshell, each &lt;code&gt;&amp;lt;()&lt;/code&gt; can include its
&lt;em&gt;own&lt;/em&gt; pipes, so crazy stuff like &lt;code&gt;&amp;lt;(command arg1 | othercommand arg2)&lt;/code&gt;
is possible, and sometimes wise.&lt;/p&gt;
&lt;p&gt;In our simple fake example above, this would look like:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;$ program --in1 &amp;lt;&lt;span class=&#34;o&#34;&gt;(&lt;/span&gt;makein raw1.txt&lt;span class=&#34;o&#34;&gt;)&lt;/span&gt; --in2 &amp;lt;&lt;span class=&#34;o&#34;&gt;(&lt;/span&gt;makein raw2.txt&lt;span class=&#34;o&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;se&#34;&gt;&lt;/span&gt;          --out1 out1.txt --out2 out2.txt &amp;gt; stats.txt 2&amp;gt; diagnostics.stderr&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;where &lt;code&gt;makein&lt;/code&gt; is some program that creates &lt;code&gt;in1.txt&lt;/code&gt; and &lt;code&gt;in2.txt&lt;/code&gt; in
the original example (from &lt;code&gt;raw1.txt&lt;/code&gt; and &lt;code&gt;raw2.txt&lt;/code&gt;) and outputs it
to standard out. It&amp;rsquo;s that simple: you&amp;rsquo;re running a process in a
subshell, and its standard out is going to a file descriptor (the
&lt;code&gt;/dev/fd/11&lt;/code&gt; or whatever number it is on your system), and &lt;code&gt;program&lt;/code&gt;
is taking input from that. In fact, if we see this process in htop or
with ps, it looks like:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;$ ps aux &lt;span class=&#34;p&#34;&gt;|&lt;/span&gt; grep program
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  vince  &lt;span class=&#34;o&#34;&gt;[&lt;/span&gt;...&lt;span class=&#34;o&#34;&gt;]&lt;/span&gt; program --in1 /dev/fd/63 --in2 /dev/fd/62 --out1 out1.txt --out2 out2.txt &amp;gt; stats.txt 2&amp;gt; diagnostics.stderr&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;But suppose you wanted to pass &lt;code&gt;out1.txt&lt;/code&gt; and &lt;code&gt;out2.txt&lt;/code&gt; to gzip to
compress them? Clearly we don&amp;rsquo;t want to write them to disk, and &lt;em&gt;then&lt;/em&gt;
compress them, as this is slow and a waste or system
resources. Luckily process substitution works the other way too,
through &lt;code&gt;&amp;gt;()&lt;/code&gt;. So we could compress in place with:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;$ program --in1 &amp;lt;&lt;span class=&#34;o&#34;&gt;(&lt;/span&gt;makein raw1.txt&lt;span class=&#34;o&#34;&gt;)&lt;/span&gt; --in2 &amp;lt;&lt;span class=&#34;o&#34;&gt;(&lt;/span&gt;makein raw2.txt&lt;span class=&#34;o&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;se&#34;&gt;&lt;/span&gt;          --out1 &amp;gt;&lt;span class=&#34;o&#34;&gt;(&lt;/span&gt;gzip &amp;gt; out.txt.gz&lt;span class=&#34;o&#34;&gt;)&lt;/span&gt; --out2 &amp;gt;&lt;span class=&#34;o&#34;&gt;(&lt;/span&gt;gzip &amp;gt; out2.txt.gz&lt;span class=&#34;o&#34;&gt;)&lt;/span&gt; &amp;gt; stats.txt 2&amp;gt; diagnostics.stderr&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Unix never ceases to amaze me in its power. The chainsaw is out and
you&amp;rsquo;re cutting through a giant tree. But power comes with a cost here:
clarity. Debugging this can be difficult. This level of complexity is
like Marmite: I recommend not layering it on too thick at
first. You&amp;rsquo;ll hate it and want to vomit. Admittedly, the nested
inter-process communication syntax is neat but awkward — it&amp;rsquo;s not the
simple, clearly understandable &lt;code&gt;|&lt;/code&gt; that we&amp;rsquo;re used to.&lt;/p&gt;
&lt;h2&gt;Speed&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;speed&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#speed&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;So, is this really faster? Yes, quite. Writing and reading to the disk
comes at a big price — see
&lt;a href=&#34;http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html&#34;&gt;latency numbers every programmer should know&lt;/a&gt;
. Unfortunately
I am too busy to do extensive benchmarks, but I wrote a slightly
&lt;a href=&#34;https://gist.github.com/vsbuffalo/6181676&#34;&gt;insane read trimming script&lt;/a&gt;

that makes use of process substitution. Use at your own risk, but
we&amp;rsquo;re using it over simple
&lt;a href=&#34;https://github.com/najoshi/sickle&#34;&gt;Sickle&lt;/a&gt;
/&lt;a href=&#34;https://github.com/vsbuffalo/scythe&#34;&gt;Scythe&lt;/a&gt;
/&lt;a href=&#34;https://github.com/vsbuffalo/seqqs&#34;&gt;Seqqs&lt;/a&gt;

combinations. One test uses &lt;code&gt;trim.sh&lt;/code&gt;, the other is a simple shell
script that just runs Scythe in the background (in parallel, combined
with Bash&amp;rsquo;s &lt;code&gt;wait&lt;/code&gt;), writes files to disk, and Sickle processes
these. The benchmark is biased against process substitution, because I
also compress the files via &lt;code&gt;&amp;gt;(gzip &amp;gt; )&lt;/code&gt; in those tests, but don&amp;rsquo;t
compress the others. Despite my conservative benchmark, the difference
is striking:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://vincebuffalo.com/images/ps_benchmarks.png&#34; alt=&#34;Real time difference: process substitution = 55m43.274s, writing to file = 96m5.422s&#34; data-custom-hook=&#34;true&#34; /&gt;
&lt;/p&gt;
&lt;p&gt;Additionally, with the &lt;code&gt;&amp;gt;(gzip &amp;gt; )&lt;/code&gt; bit, our sequences had a
compression ratio of about 3.46% — not bad. With most good tools
handling gzip compression natively (that is, without requiring prior
decompression), and easy in-place compression via process
substitution, there&amp;rsquo;s really no reason to not keep data large data
sets compressed. This is especially the case in bioinformatics where
we get decent compression ratios, and our friends &lt;code&gt;less&lt;/code&gt;, &lt;code&gt;cat&lt;/code&gt;, and
&lt;code&gt;grep&lt;/code&gt; have their &lt;code&gt;zless&lt;/code&gt;, &lt;code&gt;gzcat&lt;/code&gt;, and &lt;code&gt;zgrep&lt;/code&gt; analogs.&lt;/p&gt;
&lt;p&gt;Once again, I&amp;rsquo;m astonished at the beauty and power of Unix. As far as
I know, process substitution is not well know — I asked a few sysadmin
friends and they&amp;rsquo;d seen named pipes but not process substitution. But
given Unix&amp;rsquo;s abstraction of files, it&amp;rsquo;s no surprise. Actually Brian
Kernighan waxed poetically about both pipes and Unix files in
&lt;a href=&#34;http://techchannel.att.com/play-video.cfm/2012/2/22/AT&amp;amp;T-Archives-The-UNIX-System&#34;&gt;this classic AT&amp;amp;T 1980s video on Unix&lt;/a&gt;
. Hopefully
younger generations of programmers will continue to discover the
beauty of Unix (and stop re-inventing the wheel, something we&amp;rsquo;ve all
been guilty of). Tools that are designed to work in the Unix
environment can leverage Unix&amp;rsquo;s power end up with emergent powers.&lt;/p&gt;
&lt;p&gt;If you want more information on Unix&amp;rsquo;s named pipes, I suggest:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Checking out the &lt;code&gt;tee&lt;/code&gt; example and other examples on Wikipedia&amp;rsquo;s
&lt;a href=&#34;http://en.wikipedia.org/wiki/Process_substitution&#34;&gt;process substitution&lt;/a&gt;

page.&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;http://www.linuxjournal.com/article/2156&#34;&gt;This 1997 article&lt;/a&gt;
 from
Linux Journal on named pipes.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Share your experiences/wisdom/comments/critiques; I&amp;rsquo;m on
&lt;a href=&#34;https://twitter.com/vsbuffalo&#34;&gt;Twitter&lt;/a&gt;
.&lt;/p&gt;

      </description>
    </item>
    
    <item>
      <title>Bioinformatics and Interface Design</title>
      <link>https://vincebuffalo.com/blog/bioinformatics-and-interface-design/</link>
      <pubDate>Sat, 26 Jan 2013 00:00:00 +0000</pubDate>
      
      <guid>https://vincebuffalo.com/blog/bioinformatics-and-interface-design/</guid>
      <description>
        
        
        &lt;p&gt;Day to day bioinformatics involves interfacing and executing many
programs to process data. We end up with some refinement of the data
from which we extract biological meaning through data analysis. Given
how much interfacing bioinformatics involves, this process undergoes
very little thought or design optimization.&lt;/p&gt;
&lt;p&gt;Much more attention is needed on the design of interfaces in
bioinformatics, to improve their ease of use, robustness, and
scalability. Interfacing is a low-level problem that we shouldn&amp;rsquo;t be
wasting time on when there are much better high-level problems out
there.&lt;/p&gt;
&lt;p&gt;More bluntly, interfacing is currently an inconvenient glue that full
time bioinformaticians waste too many hours on. There is no better
illustration of this than by looking at how much time we waste in file
parsing tasks. Parsers are most commonly employed in bioinformatics as
crappy interfaces to non-standard formats. We need better designed
interfaces and cleaner interface patterns to help.&lt;/p&gt;
&lt;p&gt;I&amp;rsquo;m hardly the only one to complain about this. Fred Ross had this
&lt;a href=&#34;http://madhadron.com/?p=227&#34;&gt;sadly accurate description&lt;/a&gt;
:&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;Estimating from my own experience and observation of my colleagues,
most bioinformaticists today spend between 90% and 100% of their
time stuck in cruft. Methods are chosen because the file formats are
compatible, not because of any underlying suitability. Second,
algorithms vanish from the field&amp;hellip;. I’m worried about
the number of bioinformaticists who don’t understand the difference
between an O(n) and an O(n^2) algorithm, and don’t realize that it
matters.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;He&amp;rsquo;s &lt;a href=&#34;http://madhadron.com/?p=263&#34;&gt;parting with bioinformatics&lt;/a&gt;
,
leaving our field with one less person to fix things. However, if
practices are suboptimal and frustrating now, it&amp;rsquo;s not because people
are unprepared to implement better approaches, it&amp;rsquo;s because they&amp;rsquo;re
content with the status quo because it does work. But, as I&amp;rsquo;ll argue,
we shouldn&amp;rsquo;t be wasting our time on this and much more elegant
solutions exist.&lt;/p&gt;
&lt;h2&gt;The Current Interfacing Practice and its Paradigm&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;the-current-interfacing-practice-and-its-paradigm&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#the-current-interfacing-practice-and-its-paradigm&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;The current practice is characterized by system calls from a scripting
language to execute larger bioinformatics programs, parse any output
files created, and repeat. In my mind, I see this as a paradigm in
which &lt;a href=&#34;http://en.wikipedia.org/wiki/State_%28computer_science%29&#34;&gt;state&lt;/a&gt;

is stored on the file system, and execution is achieved by passing a
stringified set of command-line arguments to a system call.&lt;/p&gt;
&lt;p&gt;This practice and paradigm has been the standard for bioinformatics
for ages. It&amp;rsquo;s clunky and inelegant, but it works for routine
bioinformatics tasks. However, the current practice isn&amp;rsquo;t well suited
for large, embarrassingly parallel tasks that will grow increasingly
common as the number of samples increases in genomics
projects. Portability of these pipelines is usually terrible, and
involves awkward tasks like ensuring all called programs are in a
user&amp;rsquo;s &lt;code&gt;$PATH&lt;/code&gt; (good luck with version differences too). Program state
is stored on the disk, the &lt;a href=&#34;http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html&#34;&gt;slowest
component&lt;/a&gt;

apart from the network. Here&amp;rsquo;s a better look at how to &lt;a href=&#34;http://i.imgur.com/X1Hi1.gif&#34;&gt;really
understand how costly storing state on a disk can be (zoom into this
image)&lt;/a&gt;
.&lt;/p&gt;
&lt;p&gt;Storing state on a slow, highly-mutable, non-concurrent component is
only acceptable if it&amp;rsquo;s too big to store in memory. Bioinformatics
certainly has tasks that produce files too large to store in
memory. However, if a user had a task with little memory overhead, the
complete lack of interfaces other than the command-line to all
aligners, mappers, assemblers, etc would require the user to write to
the disk. If they&amp;rsquo;re clever, they can invoke a subprocess from a
script and capture standard out, but then they&amp;rsquo;re back to parsing text
as a sloppy interface, rather than handling higher-level models or
objects. This needs to change.&lt;/p&gt;
&lt;p&gt;While I&amp;rsquo;m maybe being a little harsh on the current paradigm, I should
say some parts are elegant. Unix piping is extremely elegant, and
avoids any latency due to writing to the disk between execution
steps. Workflows like this example from the samtools mpileup &lt;a href=&#34;http://samtools.sourceforge.net/mpileup.shtml&#34;&gt;man
page&lt;/a&gt;
 are clear and
powerful:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;$ samtools mpileup -uf ref.fa aln1.bam aln2.bam &lt;span class=&#34;p&#34;&gt;|&lt;/span&gt; bcftools view -bvcg - &amp;gt; var.raw.bcf  
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;$ bcftools view var.raw.bcf &lt;span class=&#34;p&#34;&gt;|&lt;/span&gt; vcfutils.pl varFilter -D100 &amp;gt; var.flt.vcf  &lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;The real drawback of the current system occurs when a user wants to
leverage the larger command-line programs for something novel. Most
aligners assume you want to align a few sequences, output the results,
and then stop. A user that wants an aligner to align a few sequences,
and then proceed down different paths depending on the output has the
hassle of writing the sequences to a FASTA file or serializing the
sequences in the FASTA format, invoking a subprocess, and then either
passing it a filename, or (if the tool supports this), passing the
serialized string to the subprocess through standard in. If the
command-line tool has overheard for starting up, this is incurred
during each subprocess call (even if it could be shared and the
overhead amortized across subprocess calls).&lt;/p&gt;
&lt;h2&gt;File Formats and Interfacing&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;file-formats-and-interfacing&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#file-formats-and-interfacing&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;To make matters worse, many bioinformatics formats like FASTA, FASTQ,
and GTF are either ill-defined or implemented differently across
libraries, making them risky interface formats. In contrast, consider
the elegance of Google&amp;rsquo;s &lt;a href=&#34;http://code.google.com/p/protobuf/&#34;&gt;protocol
buffers&lt;/a&gt;
. This allow users to
write their data structures in the protocol buffer &lt;a href=&#34;http://en.wikipedia.org/wiki/Interface_description_language&#34;&gt;interface
description
language&lt;/a&gt;
,
and compile interfaces for C++, Java, and Python. This is the type of
high-level functionality bioinformatics needs to interface incredibly
complex data structures, yet we&amp;rsquo;re still stuck in the text parsing
stone age.&lt;/p&gt;
&lt;h2&gt;Foreign Function Interfaces and Shared Libraries&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;foreign-function-interfaces-and-shared-libraries&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#foreign-function-interfaces-and-shared-libraries&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;One way to avoid unnecessary clunky system calls is through foreign
foreign interfaces (FFIs) to shared libraries, using thin wrappers in
a user&amp;rsquo;s scripting language of choice. Large bioinformatics programs
like aligners, assemblers, and mappers are most commonly written in C
or C++, which allows code to be compiled as a shared library
relatively painlessly.&lt;/p&gt;
&lt;p&gt;FFIs could solve a few problems for bioinformaticians. First, they
would allow much more code recycling in low-level projects: rather
than writing your own highly-optimized C FASTA/FASTQ parser, you can
ejust link against a shared library with that routine. Additionally,
that shared library can be separately developed and improved.&lt;/p&gt;
&lt;p&gt;Second, FFIs allow modularity and high-level access to low-level
routines. Genome assemblers are packed to the gills with useful
functions. So are aligners. Yet unless the developer took the time to
separate this out via subcommands or an API (like git or samtools),
you&amp;rsquo;re unlikely to ever be able to access this
functionality. Developers with an eye for better program design can
write higher level functions that could be utilized through a
FFI. Now, novel bioinformatics tasks that may require some sequence
assembly, or a few parallel calls to an aligner can be tackled without
the system call rubbish, or re-implementing all the low-level
algorithms.&lt;/p&gt;
&lt;p&gt;For higher-level functionality with FFIs and shared libraries,
wrappers work beautifully. Rather than wrapping entire programs
through the command line (as &lt;a href=&#34;http://github.com/biopython/biopython/blob/master/Bio/Align/Applications/_Muscle.py&#34;&gt;BioPython
does&lt;/a&gt;
),
scripting language libraries could interact more directly with
low-level programs. In cases in which the current paradigm just
doesn&amp;rsquo;t fit, we&amp;rsquo;d have the option to avoid it by calling routines
directly. Tools like samtools are very successful because they have a
powerful API that allow programs like
&lt;a href=&#34;http://code.google.com/p/pysam/&#34;&gt;pysam&lt;/a&gt;
 to call their routines.&lt;/p&gt;
&lt;p&gt;Imagine now that you could also load adapter and quality trimmers
wrappers around shared libraries. Rather than using Unix pipes or bash
scripts to write quality control pipelines, and have every program in
the execution chain read, parse, and then write FASTA formatted files,
it could be done once, using object abstractions of data.&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;sys&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;biolib&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;sickle&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;scythe&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;k&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;read&lt;/span&gt; &lt;span class=&#34;ow&#34;&gt;in&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;biolib&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_fasta&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;nb&#34;&gt;open&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;sys&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;argv&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;])):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;n&#34;&gt;read_tmd&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;scythe&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;trim&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;seq&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;read&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;qual&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;prior&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mf&#34;&gt;0.3&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;n&#34;&gt;biolib&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;write_fasta&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;sickle&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;trim&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;read_tmd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;seq&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;read_tmd&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;qual&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;qual&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;20&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;),&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;stdout&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;In this artificial example, reads could then be sent
directly to aligners rather than to standard out. Working with
higher-level models of data (in this case, a read) allows easier
real-time debugging, statistics gathering, and
parallelization. Imagine being able to put this entire block in a
&lt;code&gt;try&lt;/code&gt; statement, and have exceptions handled at a higher level. An
error could invoke a debugger, and a bioinformatician could inspect
the culprit interactively in real-time. This is impossible in the old
paradigm (and we&amp;rsquo;ve all spent ages using streaming tools to track down
such bugs).&lt;/p&gt;
&lt;p&gt;Note that I&amp;rsquo;m not arguing that your average biologist should suddenly
start trying to understand &lt;a href=&#34;http://en.wikipedia.org/wiki/Position-independent_code&#34;&gt;position-independent
code&lt;/a&gt;
 and
compile shared libraries to avoid making systems calls in their
scripts. Sometimes a system call is the right tool for the job. But
bioinformatics software developers should reach for a system call not
because it&amp;rsquo;s the only interface, but because it&amp;rsquo;s the best interface
for a particular task. Maybe someday we&amp;rsquo;ll even see thin wrappers
coming packaged &lt;em&gt;with&lt;/em&gt; bioinformatics tools (even if under &lt;code&gt;contrib/&lt;/code&gt;
and written by other developers) — I can dream, right?&lt;/p&gt;
&lt;h2&gt;FFI in Practice&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;ffi-in-practice&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#ffi-in-practice&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Here&amp;rsquo;s a real FFI example: I needed to assemble many sets of sequences
for a pet project. I really wanted to avoid needlessly writing FASTA
files of these sets of sequences to disk, as this is quite
costly. However, most assemblers were designed solely to be interfaced
through the command-line. The inelegance of writing thousands of
files, making thousands of system calls to an assembler, then reading
and parsing the results not appealing, so I wrote
&lt;a href=&#34;http://github.com/vsbuffalo/pyfermi&#34;&gt;pyfermi&lt;/a&gt;
, a simple Python
interface to &lt;a href=&#34;http://lh3lh3.users.sourceforge.net/&#34;&gt;Heng Li&amp;rsquo;s&lt;/a&gt;
 &lt;a href=&#34;https://github.com/lh3/fermi&#34;&gt;Fermi
assembler&lt;/a&gt;
 (note that this software is
experimental, so use it with caution). First off, I couldn&amp;rsquo;t have done
this without Heng&amp;rsquo;s excellent assembler and code, so I owe him a debt
of gratitude. The Fermi source has a beautiful example of high-level
functions that can be interfaced with relative ease: see the
&lt;code&gt;fm6_api_*&lt;/code&gt; functions used in
&lt;a href=&#34;https://github.com/lh3/fermi/blob/master/example.c&#34;&gt;example.c&lt;/a&gt;
.&lt;/p&gt;
&lt;p&gt;I wrote a few extra C functions in pyfermi (mostly to deal with &lt;code&gt;void *&lt;/code&gt; necessary because Python&amp;rsquo;s ctypes can&amp;rsquo;t handle foreign types
AFAIK), and
&lt;a href=&#34;https://github.com/vsbuffalo/pyfermi/blob/master/Makefile&#34;&gt;compile&lt;/a&gt;

Fermi as a shared library. I was able to do all this in &lt;em&gt;far&lt;/em&gt; less
time than it would have taken me to go down the route of writing
thousands of files, making syscalls, and handing file parsing.&lt;/p&gt;
&lt;h2&gt;The Importance of Good Tools&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;the-importance-of-good-tools&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#the-importance-of-good-tools&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Overall, bioinformaticians need to be more conscious of the design and
interfacing of our tools. I strongly believe tools and methods shape
how we approach and solve problems. A data programmer only trained in
Perl will likely at some point wage a messy war trying to produce
decent statistical graphics. Likewise, a statistician only trained in
t-tests and ANOVA, will only see normally distributed data (and apply
every transformation under the sun to force their data into this
shape). I&amp;rsquo;m hardly the first person to argue that this occurs: this
idea is known as the &lt;strong&gt;law of the instrument&lt;/strong&gt;. Borrowing a 1964 quote
from Abraham Kaplan (and
&lt;a href=&#34;http://en.wikipedia.org/wiki/Law_of_the_instrument#History&#34;&gt;Wikipedia&lt;/a&gt;
):&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;I call it the law of the instrument, and it may be formulated as
follows: Give a small boy a hammer, and he will find that everything
he encounters needs pounding.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Earlier in our bioinformatics careers, we&amp;rsquo;re trained in file formats,
writing scripts, and calling large programs via the command-line. This
becomes a hammer, and all problems start looking like nails. However,
the old practice and paradigms are breaking down as the scale of our
data and the complexity of our problems increase. We need new school
bioinformatics with a focus on the bigger picture, so let&amp;rsquo;s start
talking about how we do it.&lt;/p&gt;

      </description>
    </item>
    
    <item>
      <title>21st Century Science and the Need for Open Data and Open Tools</title>
      <link>https://vincebuffalo.com/blog/21st-century-science-and-open-data/</link>
      <pubDate>Sun, 13 Jan 2013 00:00:00 +0000</pubDate>
      
      <guid>https://vincebuffalo.com/blog/21st-century-science-and-open-data/</guid>
      <description>
        
        
        &lt;p&gt;&lt;em&gt;Note: I&amp;rsquo;ve been writing this essay in my head for a few months, but I
felt it needed to be completed and released after the sad loss of Open
Access advocate &lt;a href=&#34;http://www.nytimes.com/2013/01/13/technology/aaron-swartz-internet-activist-dies-at-26.html&#34;&gt;Aaron
Swartz&lt;/a&gt;
,
a hacker and activist I admired.&lt;/em&gt;&lt;/p&gt;
&lt;h1&gt;21st Century Science and the Need for Open Data and Open Tools&lt;/h1&gt;&lt;p&gt;Open Science rests upon three core principles: open access, open data,
and open tools. However, &amp;ldquo;open science&amp;rdquo; doesn&amp;rsquo;t really imply the
importance of this philosophy; it&amp;rsquo;s perceived as a nice, but not
entirely necessary trait of scientific projects. Yet, without access,
data, and tools, science is not just not open, but it is not
reproducible.&lt;/p&gt;
&lt;p&gt;This is a new phenomenon too; hundreds of years ago, a scientific
article was sufficient to reproduce an experiment. Methods could be
followed to recollect the data, build or collect the necessary tools,
and follow the procedures to recreate the output. I believe this is
why the Open Access movement has had such a large role in the Open
Science movement. People readily accept that reproducibility is a goal
of science, and one cannot reproduce a scientific finding without
access to scholarly articles.&lt;/p&gt;
&lt;p&gt;However, science undergoing a transition that makes the scholarly
article alone insufficient to reproduce an experiment. As I&amp;rsquo;ll argue,
Open Science advocates need to quickly start focusing more energy into
open tools and open data.&lt;/p&gt;
&lt;h2&gt;Capital-Intensive Science and the Openness of Data&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;capital-intensive-science-and-the-openness-of-data&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#capital-intensive-science-and-the-openness-of-data&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;With large-scale projects like the LHC, new sequencing centers, and
Mars Curiousity, it&amp;rsquo;s clear that some parts of science of becoming
increasingly capital intensive. From my perspective in the genomics
world, it&amp;rsquo;s even more apparent: smalls labs are seeking large funds to
carry out sequencing efforts. The staggering levels of capital are
needed for investment in expensive and quickly-developing new
machines, whether sequencer, collider, or GC-MS. Besides the
probihitive costs these machines have in common, they also have
another trait they share: the create &lt;strong&gt;a lot of data&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;But the beauty of this new data is that it&amp;rsquo;s practically free to copy
and use. Science labs in the developing world may be decades away from
having start of the art sequencers, or GC-MSs, or colliders, but they
can benefit from the presence of such technology in Japan, the USA, or
Switzerland through access to the data produced, &lt;em&gt;as long as the data
is open&lt;/em&gt;. In essence, the beauty of big data is not only the
information it contains, the complexity it can untangle, or the
insights it can foster. It&amp;rsquo;s that it can be endlessly duplicated and
shared, and that any scarcity is entirely
&lt;a href=&#34;http://en.wikipedia.org/wiki/Artificial_scarcity&#34;&gt;artificial&lt;/a&gt;
. As
long as big data remain open, the information created through
capital-intensive science can be widely used and studied by all.&lt;/p&gt;
&lt;p&gt;Futhermore, the new capital-intensive science could be creating a new
era of impossible reproducibility. Small labs cannot afford resequence
samples just to recreate data; this is why we see a growth of sequence
repositories. As data becomes larger and begins coming from more
sources, there will be technical problems and cost issues in
maintaining public repositories of open data (such issues have already
played out with the Short Read Archive). Open Science must be ready
for such battles.&lt;/p&gt;
&lt;h2&gt;The Importance of Open Source Software&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;the-importance-of-open-source-software&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#the-importance-of-open-source-software&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;While large new scientific machinery illustrates the new
capital-intensive side of science, the growth of bioinformatics, data
science, and scientific programming illustrates the other side:
labor-intensive science with a high degree of specialization. The
science of the past centuries was characterized by a few or a single
scientist carrying out an experiment, analyzing the results, and
publishing them. However, the new scientific challenges we face
require a degree of task specialization that is a break from the
past. The necessity to model and understand more complex relationships
requires not just bench scientists to organize and run experiments,
but statisticians to consults, systems administrators to setup and
maintain large computing infrastructure, and programmers that can
write and maintain large codebases.&lt;/p&gt;
&lt;p&gt;Yet as with big data, the software tools created to analyze data can
be duplicated, reused, and adapted for no cost, as long as it&amp;rsquo;s open
source. Open Science must strongly advocate for the use of open source
tools, and also the development of open source alternatives to
proprietary tools.&lt;/p&gt;
&lt;p&gt;A further challenge persists too: open source software communities are
almost always factionated. Unlike proprietary software, in which a company
develops, releases, and works on one a single product, the open source
community
&lt;a href=&#34;http://biomickwatson.wordpress.com/2012/12/28/an-embargo-on-short-read-alignment-software/&#34;&gt;frequently&lt;/a&gt;

recreates software, rather than extends or improves existing
software. Even worse, while a company has managers and hierarchy to
maintain continuity in functionality, documentation, and quality of a
software product while individual developers may leave, open source
software projects frequently decay as lead developers switch groups or
funding runs short.&lt;/p&gt;
&lt;h2&gt;Adjusting Our Thinking&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;adjusting-our-thinking&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#adjusting-our-thinking&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Open Access is an ongoing battle, but also point of pride of Open
Science advocates. The success of PLoS and the increasing number of
open access journals are signs the open science movement is gaining
wide support. Yet, the science of the 21st century, characterized by a
capital and labor intensive production process, requires further
advocacy in open data and open tools.&lt;/p&gt;

      </description>
    </item>
    
    <item>
      <title>My First Recommendation to New Scientific Coders: Learn Visualization</title>
      <link>https://vincebuffalo.com/blog/my-first-recommendation-to-new-scientific-coders-learn-visualization/</link>
      <pubDate>Wed, 14 Nov 2012 00:00:00 +0000</pubDate>
      
      <guid>https://vincebuffalo.com/blog/my-first-recommendation-to-new-scientific-coders-learn-visualization/</guid>
      <description>
        
        
        &lt;p&gt;Scientists are learning programming at an unprecedented rate. I&amp;rsquo;ve
&lt;a href=&#34;https://vincebuffalo.com/blog/the-beauty-of-bioconductor/&#34;&gt;expressed&lt;/a&gt;

&lt;a href=&#34;http://www.dataists.com/2010/09/careful-statistical-computing-part-1/&#34;&gt;concern&lt;/a&gt;

over the fast-paced growth of computing across the sciences and what
this could mean for reproducibility and incorrect findings in the
sciences. Perhaps the best example that illustrates the severity of
this issue is Coombes and Baggerly&amp;rsquo;s &lt;a href=&#34;http://bioinformatics.mdanderson.org/Supplements/ReproRsch-All/Modified/StarterSet/index.html&#34;&gt;Duke
Saga&lt;/a&gt;
.&lt;/p&gt;
&lt;p&gt;I think a lot about how scientists learn programming and how we can
change this process to yield a better outcome (fewer errors, more
readible and reproducible code). Scientific coders must learn to
program in a particular fashion that &amp;ldquo;stacks the deck&amp;rdquo; to make errors
apparent. On this front, unit tests, following coding standards, and
peer code review get a lot of deserved attention. Yet for some reason,
visualization does not. This is unfortunate; visualization should be
learned to a high degree of competency very early on in a programmer&amp;rsquo;s
career.&lt;/p&gt;
&lt;h2&gt;Problems Look Differently When You Can Visualize Quickly&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;problems-look-differently-when-you-can-visualize-quickly&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#problems-look-differently-when-you-can-visualize-quickly&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Neil DeGrasse Tyson has an excellent saying: &amp;ldquo;If you are
scientifically literate the world looks very different to you.&amp;rdquo;
Similarly, if you have the skills to visualize information
&lt;strong&gt;quickly&lt;/strong&gt;, problems start to look different to you. Don&amp;rsquo;t only learn
to visualize, learn to do it so effectively that each time you imagine
a visualization, there&amp;rsquo;s almost no time cost in implementing it.&lt;/p&gt;
&lt;p&gt;Why do I stress being efficient at visualization? If there&amp;rsquo;s no
barrier to a coder making a plot —if a coder doesn&amp;rsquo;t think before each
plot, &amp;ldquo;shit, now I have to remember how to do this&amp;rdquo;— they&amp;rsquo;ll more
readily apply it to everything, and fewer errors will go unnoticed. If
the barrier is high, a coder will hesitate and end up using it less as
a tool.&lt;/p&gt;
&lt;p&gt;Visualization also drops the barrier for quick interpretation of data.
A graphic display of data is often more efficient to interpret than
numbers and tables. Trying to interpret a four dimension table
requires a lot of mental cycles and time. Look at the &lt;code&gt;Titanic&lt;/code&gt;
dataset from R:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Age = Child, Survived = No              Age = Adult, Survived = No
       Sex                                             Sex
 Class  Male Female                         Class  Male Female
   1st     0      0                          1st     0      0
   2nd     0      0                          2nd     0      0
   3rd    35     17                          3rd    35     17
   Crew    0      0                          Crew    0      0

Age = Child, Survived = Yes             Age = Adult, Survived = Yes
      Sex                                      Sex
 Class  Male Female                       Class  Male Female
   1st     5      1                         1st    57    140 
   2nd    11     13                         2nd    14     80
   3rd    13     14                         3rd    75     76
   Crew    0      0                         Crew  192     20
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now, consider &lt;a href=&#34;http://www.d3js.org&#34;&gt;d3.js&lt;/a&gt;
 &lt;a href=&#34;http://www.jasondavies.com/parallel-sets/&#34;&gt;this parallel sets
visualization&lt;/a&gt;
 by Jason
Davies:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://vincebuffalo.com/images/parallel-sets-titanic.png&#34; alt=&#34;Parallel sets visualization of Titanic survivors&#34; data-custom-hook=&#34;true&#34; /&gt;
&lt;/p&gt;
&lt;p&gt;Immediately, parallel sets shows us the large numbers that perished in
the Titanic&amp;rsquo;s sinking. Width reveals not only the breakdown or
survivors/non-survivors, but also the composition of the ship&amp;rsquo;s
passengers prior to hitting the iceberg. This additional data could
only be calculated from the table by manually adding across separate
tables, which again incurs a time cost.&lt;/p&gt;
&lt;h2&gt;Smart Visualization Over Stupid Hypothesis Testing&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;smart-visualization-over-stupid-hypothesis-testing&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#smart-visualization-over-stupid-hypothesis-testing&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Imagine you&amp;rsquo;ve just written a nucleotide sequence processing algorithm
and you want to make sure it isn&amp;rsquo;t being confounded by large sequences
or small sequences. Some scientists reach for a hypothesis test. No!
Visualize it. A hypothesis test is inherently
univariate. Visualization is multidimensional. In this case, I would
plot in two dimensions and color by sequence length. Or use density
plots and color by sequence length. Then try a logarithmic scale. Try
coloring by sequence length in continuous and discrete color scales.&lt;/p&gt;
&lt;p&gt;Statisticians are obsessed by confounding variables, but I feel folks
writing data processing scripts are not. The example above is what I
call &lt;strong&gt;color-by-confounder&lt;/strong&gt; (well, possible confounder). If a
variable that should be unrelated to another is forming a colorful
cluster in a scatterplot, visualization (and your pattern-finding
ape-brain) is much more effective than a clumsily applied,
assumption-ridden, old-fashioned,
&lt;a href=&#34;http://polmeth.wustl.edu/media/Paper/gill99.pdf&#34;&gt;philosophically-troubled&lt;/a&gt;

hypothesis test.&lt;/p&gt;
&lt;p&gt;It&amp;rsquo;s worth mentioning that while our ape-brains are indeed excellent
visual pattern finding machines, they can also be prone to false
positives. Getting in the habit of forming hypotheses about how a
graphic &lt;em&gt;should&lt;/em&gt; look before creating it can help protect us from
&lt;a href=&#34;http://en.wikipedia.org/wiki/Apophenia&#34;&gt;apophenia&lt;/a&gt;
. I&amp;rsquo;ve seen strange
patterns emerge in data that scream, &amp;ldquo;you screwed something up big&amp;rdquo;,
but after heavy thought reveal everything is fine.&lt;/p&gt;
&lt;h2&gt;Build Tools that Support Visual Output&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;build-tools-that-support-visual-output&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#build-tools-that-support-visual-output&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Developers should try to output data in file formats that are very
easily parsed by existing functions or libraries in popular
languages. It&amp;rsquo;s indefensible to make up a file format when your data
can be equally well expressed in an existing one. Most data can be
expressed in tab-delimited, CSV, JSON, or XML. It takes two lines in R
to load any of these file formats with the appropriate library; there
is virtually no barrier to loading and visualizing data from these
formats.&lt;/p&gt;
&lt;p&gt;Recently I had to process some variable-space tabular output from a
popular bioinformatics program. The manual had a footnote saying,
&amp;ldquo;contrary to the shrieks of outrage we occasionally receive about
this, space-delimited files are just as trivial to parse as
tab-delimited files.&amp;rdquo; Considering the header was across two rows
(seriously), data can contain spaces (and is not quoted), and the
delimiters could have as few as one space, this is most definitely not
&lt;strong&gt;safely&lt;/strong&gt; trivial across datasets.&lt;/p&gt;
&lt;p&gt;Attempting to use a human-readable format such as
variable-spaced/fixed-width-column formats makes the rather silly
assumption that your data will only be looked at by a human. It&amp;rsquo;s
always easier to make human readable data out of computer-readable
data than to do the opposite. In today&amp;rsquo;s big data age, I&amp;rsquo;m skeptical
humans actually process huge data sets in human-readable file formats
in any way that&amp;rsquo;s both meaningful and not horribly inefficient.&lt;/p&gt;
&lt;h2&gt;Ok, How do I Learn Visualization?&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;ok-how-do-i-learn-visualization&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#ok-how-do-i-learn-visualization&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;The first requirement to be able to visualize quickly is to know your
tools. R is unequivocally the best tool to learn first, and learn the
best. Buy Hadley Wickham&amp;rsquo;s &lt;a href=&#34;http://ggplot2.org/&#34;&gt;ggplot2&lt;/a&gt;
 book, and
bookmark his website. Learn ggplot2 thoroughly; it scales almost
impossibly well to the complexity of problems you throw at it
(primarily because it&amp;rsquo;s built around an ingenious abstraction).&lt;/p&gt;
&lt;p&gt;&lt;a href=&#34;#ZgotmplZ&#34;&gt;Edward Tufte&lt;/a&gt;
 also has some excellent
books on visualization worth investing in. &lt;a href=&#34;http://www.amazon.com/The-Visual-Display-Quantitative-Information/dp/0961392142/ref=sr_1_1?ie=UTF8&amp;amp;qid=1352871358&amp;amp;sr=8-1&amp;amp;keywords=edward&amp;#43;tufte&#34;&gt;The Visual Display of
Quantitative
Visualization&lt;/a&gt;

is probably the best to start with. Other excellent projects worth
being aware of are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&#34;http://colorbrewer2.org/&#34;&gt;Color Brewer&lt;/a&gt;
 for intelligent color
choices for different problems.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&#34;http://d3js.org&#34;&gt;d3.js&lt;/a&gt;
 is a new-ish Javascript visualization
framework I am very excited about.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&#34;http://www.ggobi.org/&#34;&gt;ggobi&lt;/a&gt;
 is a useful system for
high-dimension visualization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&#34;http://cran.r-project.org/web/packages/lattice/index.html&#34;&gt;lattice&lt;/a&gt;

was the first graphics package for R I learned, and even though I
use ggplot2 primarily now, it is still useful.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&#34;http://www.bioconductor.org/packages/2.11/bioc/html/ggbio.html&#34;&gt;ggbio&lt;/a&gt;

if you do bioinformatics. Learning to use genome browsers, track
formats (BED, WIG, GTF), and read visualization programs (such as
&lt;a href=&#34;http://www.broadinstitute.org/igv/&#34;&gt;IGV&lt;/a&gt;
) are also very important
skills.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Generally, the best advice about learning visualization is to be
patient, and not settle for a subpar graphic. Patience and
perfectionism will lead to better graphics and a thorough
understanding of the tools.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;conclusion&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#conclusion&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Visualization is a skill worth investing time in. It&amp;rsquo;s a low hanging
fruit for all programmers. It&amp;rsquo;s also enjoyable. Fundamentally,
developers need to adjust how they think about visualization. It&amp;rsquo;s not
something to brush up on every time a plot is needed for an article or
presentation. It should become part of every professional developer&amp;rsquo;s
workflow, right alongside version control, debugging, and unit tests.&lt;/p&gt;

      </description>
    </item>
    
    <item>
      <title>Simple Parallel Bioinformatics Pipelines with find, basename, and xargs</title>
      <link>https://vincebuffalo.com/blog/simple-parallel-bioinformatics-pipelines/</link>
      <pubDate>Mon, 08 Oct 2012 00:00:00 +0000</pubDate>
      
      <guid>https://vincebuffalo.com/blog/simple-parallel-bioinformatics-pipelines/</guid>
      <description>
        
        
        &lt;h1&gt;Simple Parallel Bioinformatics Pipelines with find, basename, and xargs&lt;/h1&gt;&lt;h2&gt;Big-Ass Servers and Data Parallelism&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;big-ass-servers-and-data-parallelism&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#big-ass-servers-and-data-parallelism&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;A routine operation in bioinformatics is to process a lot of files on
a so-called &lt;a href=&#34;http://jermdemo.blogspot.com/2011/06/big-ass-servers-and-myths-of-clusters.html&#34;&gt;&amp;ldquo;Big-Ass
Server&amp;rdquo;&lt;/a&gt;
. In
most cases, these have to be processed using the same tools, in the
same way, making this a prime example of data-parallism. The unit of
data divided across multiple cores is the file.&lt;/p&gt;
&lt;p&gt;Note that there&amp;rsquo;s very little opportunity for task-parallism in
bioinformatics file processing pipelines. Consider a common task of
most bioinformatics sequencing projects (resequencing, RNA-seq, etc):
taking reads from many samples, running quality diagnostics, and
running adapter and quality trimming. None of these steps can be done
in a task-parallel fashion; they must be pipelined.&lt;/p&gt;
&lt;h2&gt;Find-Basename-Xargs&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;find-basename-xargs&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#find-basename-xargs&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Suppose in this example I have genomic sequencing data of 50 human
individuals. All individuals&amp;rsquo; sequences are in the directory &lt;code&gt;seq/&lt;/code&gt;
and are semantically named in the format of
&lt;code&gt;{lane-number}-{individual-id}.fastq&lt;/code&gt;. In this example, suppose I just
want to run each file through my adapter trimming program
&lt;a href=&#34;github.com/vsbuffalo/scythe&#34;&gt;scythe&lt;/a&gt;
 in parallel, using four cores.&lt;/p&gt;
&lt;p&gt;For such tasks, I frequently use a design pattern I refer to as
&amp;ldquo;find-basename-xargs&amp;rdquo; (FBX hereafter). I doubt this is novel, but I do
refer to this specific design pattern frequently. I&amp;rsquo;ll dissect an
example.&lt;/p&gt;
&lt;p&gt;Our &lt;code&gt;seq/&lt;/code&gt; directory contains many files we wish to trim with Scythe,
in parallel. The first step of FBX is to use the &lt;code&gt;find&lt;/code&gt; command to
find all relevant files. In our case:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;pre&gt;&lt;code&gt;$ find seq -name &amp;#34;*.fastq&amp;#34;
seq/african-1.fastq
seq/african-10.fastq
seq/african-11.fastq
seq/african-12.fastq
# [...]&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Next, &lt;code&gt;basename&lt;/code&gt; is used to drop the extension and &lt;code&gt;seq/&lt;/code&gt; directory,
which leaves us with only the identifying string (I think of this as a
key&amp;hellip; you&amp;rsquo;ll see why in a few weeks). Having just the identifying key
allows us to specify the output directory and any file suffixes. We
use &lt;code&gt;basename&lt;/code&gt; with &lt;code&gt;xargs&lt;/code&gt; because we&amp;rsquo;re processing each incoming
line from stdin at a time (&lt;code&gt;-n1&lt;/code&gt; specifes one argument will be taken
at a time). The command results would look like:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;pre&gt;&lt;code&gt;$ find seq -name &amp;#34;*.fastq&amp;#34; | xargs -n1 -I{} basename {} .fastq
african-1
african-10
african-11
african-12
# [...]&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;The argument &lt;code&gt;-I{}&lt;/code&gt; specifes the replacement string. Since we&amp;rsquo;re the
last positional argument of &lt;code&gt;basename&lt;/code&gt; is &lt;code&gt;.fastq&lt;/code&gt;, this is necesary.&lt;/p&gt;
&lt;p&gt;Finally, we do the processing with &lt;code&gt;xargs&lt;/code&gt;. GNU &lt;code&gt;parallel&lt;/code&gt; also works,
and provides nice additional features. Scythe takes some fixed
arguments (adapter, prior) and file options dependent on the key
(input file, output file). So to run Scythe, on each file in parallel
and output the results to the trimmed directory, we&amp;rsquo;d use &lt;code&gt;xargs&lt;/code&gt; with
&lt;code&gt;-n1 -P4 -I{}&lt;/code&gt;. Our exact command would be:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;find seq -name &lt;span class=&#34;s2&#34;&gt;&amp;#34;*.fastq&amp;#34;&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;|&lt;/span&gt; &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;se&#34;&gt;&lt;/span&gt;  xargs -n1 -I&lt;span class=&#34;o&#34;&gt;{}&lt;/span&gt; basename &lt;span class=&#34;o&#34;&gt;{}&lt;/span&gt; .fastq &lt;span class=&#34;p&#34;&gt;|&lt;/span&gt; &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;se&#34;&gt;&lt;/span&gt;  xargs -n1 -P10 -I&lt;span class=&#34;o&#34;&gt;{}&lt;/span&gt; scythe -a adapters.fasta -p 0.4 -o trimmed/&lt;span class=&#34;o&#34;&gt;{}&lt;/span&gt;.fastq seq/&lt;span class=&#34;o&#34;&gt;{}&lt;/span&gt;.fastq&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Note that we re-specify the directories and extensions so Scythe can
find and process the appropriate file.&lt;/p&gt;
&lt;h2&gt;Redirecting Output in Parallel and XBF Chaining&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;redirecting-output-in-parallel-and-xbf-chaining&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#redirecting-output-in-parallel-and-xbf-chaining&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;I&amp;rsquo;m a stickler about gathering and analyzing statistics at all
intermediary steps in a bioinformatics processing pipeline. Scythe
will output summary statistics on found adapters in the reads, but it
prints it out to stderr. With hundreds of files being processed,
simple redirecting stderr to a file via &lt;code&gt;2&amp;gt;&lt;/code&gt; is ineffective. Since
&lt;code&gt;xargs&lt;/code&gt; is calling the program, there&amp;rsquo;s no way to redirect a specific
Scythe calls&amp;rsquo; stderr to a specific file.&lt;/p&gt;
&lt;p&gt;My work around this is to wrap a command call in a bash shell
script. Our shell script would be incredibly simple (a better version
would ensure directories exist):&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;cp&#34;&gt;#!/bin/bash
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;cp&#34;&gt;&lt;/span&gt;&lt;span class=&#34;nb&#34;&gt;set&lt;/span&gt; -o nounset
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nb&#34;&gt;set&lt;/span&gt; -e
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nv&#34;&gt;IN&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;nv&#34;&gt;$1&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nb&#34;&gt;echo&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34; started running scythe on file &amp;#39;&lt;/span&gt;&lt;span class=&#34;nv&#34;&gt;$IN&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#39;...&amp;#34;&lt;/span&gt; 1&amp;gt;&lt;span class=&#34;p&#34;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&#34;m&#34;&gt;2&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;scythe -a adapters.fasta -p 0.4 -o trimmed/&lt;span class=&#34;nv&#34;&gt;$IN&lt;/span&gt;.fastq seq/&lt;span class=&#34;nv&#34;&gt;$IN&lt;/span&gt;.fastq 2&amp;gt; stats/&lt;span class=&#34;nv&#34;&gt;$IN&lt;/span&gt;-output.txt
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nb&#34;&gt;echo&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34; completed scythe on file &amp;#39;&lt;/span&gt;&lt;span class=&#34;nv&#34;&gt;$IN&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#39;.&amp;#34;&lt;/span&gt; 1&amp;gt;&lt;span class=&#34;p&#34;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&#34;m&#34;&gt;2&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nb&#34;&gt;echo&lt;/span&gt; &lt;span class=&#34;nv&#34;&gt;$IN&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;And we could call it in the same fashion.&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;find seq -name &lt;span class=&#34;s2&#34;&gt;&amp;#34;*.fastq&amp;#34;&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;|&lt;/span&gt; &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;se&#34;&gt;&lt;/span&gt;  xargs -n1 -I&lt;span class=&#34;o&#34;&gt;{}&lt;/span&gt; basename &lt;span class=&#34;o&#34;&gt;{}&lt;/span&gt; .fastq &lt;span class=&#34;p&#34;&gt;|&lt;/span&gt; &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;se&#34;&gt;&lt;/span&gt;  xargs -n1 -P10 bash run-scythe.bash&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;A few notes: output message to stderr, not stdout. This allows us to
&lt;em&gt;chain&lt;/em&gt; the FBX pattern. When a file&amp;rsquo;s processing is complete,
execution will proceed to the echo line and send this file&amp;rsquo;s key to
the next step. This incredibly simple approach allows parallel steps
to be pipelined. If we use Sickle to do quality-based trimming, the
pattern is similar:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;find seq -name &lt;span class=&#34;s2&#34;&gt;&amp;#34;*.fastq&amp;#34;&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;|&lt;/span&gt; &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;se&#34;&gt;&lt;/span&gt;  xargs -n1 -I&lt;span class=&#34;o&#34;&gt;{}&lt;/span&gt; basename &lt;span class=&#34;o&#34;&gt;{}&lt;/span&gt; .fastq &lt;span class=&#34;p&#34;&gt;|&lt;/span&gt; &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;se&#34;&gt;&lt;/span&gt;  xargs -n1 -P10 bash run-scythe.bash &lt;span class=&#34;p&#34;&gt;|&lt;/span&gt; &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;se&#34;&gt;&lt;/span&gt;  xargs -n1 -P10 bash run-sickle.bash&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;

      </description>
    </item>
    
    <item>
      <title>Using Bioconductor to Analyze your 23andme Data</title>
      <link>https://vincebuffalo.com/blog/using-bioconductor-to-analyze-your-23andme-data/</link>
      <pubDate>Mon, 12 Mar 2012 00:00:00 +0000</pubDate>
      
      <guid>https://vincebuffalo.com/blog/using-bioconductor-to-analyze-your-23andme-data/</guid>
      <description>
        
        
        &lt;p&gt;Bioconductor is one of the open source projects of which I am most
fond. The documentation is excellent, the community wonderful, the
development fast-paced, and the software &lt;em&gt;very&lt;/em&gt; well written.&lt;/p&gt;
&lt;p&gt;There&amp;rsquo;s a new package in the development branch (due to be released as
2.10 very soon) called &lt;code&gt;gwascat&lt;/code&gt;. &lt;code&gt;gwascat&lt;/code&gt; is a package that serves
as an interface to the &lt;a href=&#34;http://www.genome.gov/&#34;&gt;NHGRI&amp;rsquo;s&lt;/a&gt;
 database of
genome-wide association studies.&lt;/p&gt;
&lt;p&gt;Loading the package with &lt;code&gt;library(gwascat)&lt;/code&gt; creates a &lt;code&gt;GRanges&lt;/code&gt;
instance of SNPs and their diseases. &lt;code&gt;GRanges&lt;/code&gt; is a fundamental data
structure in &lt;code&gt;Bioconductor&lt;/code&gt; (specifically the &lt;code&gt;GenomicRanges&lt;/code&gt; package)
that is designed to hold ranges on genomes efficiently, as well as
metadata about the ranges. In this case, the object &lt;code&gt;gwrngs&lt;/code&gt; holds SNP
ranges (well, locations) and metadata provided by the GWA studies in
NHGRI&amp;rsquo;s database.&lt;/p&gt;
&lt;p&gt;While I really do like 23andme&amp;rsquo;s interface to one&amp;rsquo;s genotype
information and research, the &lt;code&gt;gwascat&lt;/code&gt; package offers some nice data
mining power. I&amp;rsquo;ll briefly introduce it here, and perhaps add
additional details later on.&lt;/p&gt;
&lt;h2&gt;23andme Raw Data&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;23andme-raw-data&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#23andme-raw-data&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;When I was considering 23andme, I ultimately persuaded by the fact
that they release their raw genotype calls to users. Unfortunately
they do so without SNP call confidence data, but in a personal
correspondence with a 23andme representative they stated:&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;Data reproducibility of our genotyping platforms is estimated at about
99.9%. Average call rate is about 99%. When samples do not meet
sufficient call rate thresholds, we repeat the analysis, and/or
request a new sample. We do not return data to customers that does not
meet our quality thresholds.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The 99.9% figure sounds like a lot, but considering there are 960,545
SNPs being called, it&amp;rsquo;s not &lt;em&gt;that&lt;/em&gt; high.&lt;/p&gt;
&lt;p&gt;To retrieve raw data, simply click the &amp;ldquo;Account&amp;rdquo; link at the top of
the page (after you&amp;rsquo;ve signed in) and click &amp;ldquo;Browse Raw Data&amp;rdquo;. There
should be a download link. If you&amp;rsquo;ve never used GPG to encrypt a file,
now is the time to learn; keep your SNP data encrypted.&lt;/p&gt;
&lt;p&gt;The file 23andme provides has four columns: rs ID, chromosome,
position, and genotype.&lt;/p&gt;
&lt;h2&gt;Loading Raw Data into R&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;loading-raw-data-into-r&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#loading-raw-data-into-r&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Use &lt;code&gt;read.table&lt;/code&gt; to load this data in R. It&amp;rsquo;s a lot of data, so
providing this function with information about the type of data can
speed this up quite a bit. Here is the code I used:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-R&#34; data-lang=&#34;R&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nf&#34;&gt;library&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;gwascat&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;d&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;read.table&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;&amp;#34;data/genome_Vince_Buffalo_Full_20120313162059.txt&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;               &lt;span class=&#34;n&#34;&gt;sep&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;&amp;#34;\t&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;header&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;FALSE&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;               &lt;span class=&#34;n&#34;&gt;colClasses&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;nf&#34;&gt;c&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;&amp;#34;character&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#34;character&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#34;numeric&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#34;character&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;),&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;               &lt;span class=&#34;n&#34;&gt;col.names&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;nf&#34;&gt;c&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;&amp;#34;rsid&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#34;chrom&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#34;position&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#34;genotype&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;You may notice that chromosome has the class &amp;ldquo;character&amp;rdquo; - this is
because there are chromosomes X, Y, and MT (for mitochondrial). For
later plotting purposes, it&amp;rsquo;s good to make this an ordered factor:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-R&#34; data-lang=&#34;R&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;tmp&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;d&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;$&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;chrom&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;d&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;$&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;chrom&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;ordered&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;d&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;$&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;chrom&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;levels&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;nf&#34;&gt;c&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;nf&#34;&gt;seq&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;m&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;22&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;),&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#34;X&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#34;Y&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#34;MT&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;c1&#34;&gt;## It&amp;#39;s never a bad idea to check your work&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nf&#34;&gt;stopifnot&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;nf&#34;&gt;all&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;nf&#34;&gt;as.character&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;tmp&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;==&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;as.character&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;d&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;$&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;chrom&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)))&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;h2&gt;Where are the SNPs 23andme Genotypes?&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;where-are-the-snps-23andme-genotypes&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#where-are-the-snps-23andme-genotypes&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Using &lt;a href=&#34;http://had.co.nz/&#34;&gt;Hadley Wickham&amp;rsquo;s&lt;/a&gt;
 excellent &lt;code&gt;ggplot2&lt;/code&gt;
package, we can look at the distribution of SNPs by chromosome:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-R&#34; data-lang=&#34;R&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nf&#34;&gt;ggplot&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;d&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;geom_bar&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;nf&#34;&gt;aes&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;chrom&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;img src=&#34;https://vincebuffalo.com/images/23andme_chrom_dist.png&#34; alt=&#34;distribution of SNPs by chromosome&#34; data-custom-hook=&#34;true&#34; /&gt;
&lt;/p&gt;
&lt;p&gt;This isn&amp;rsquo;t providing information on SNP density as much as it is
chromosome length (except X). We&amp;rsquo;ll take a more detailed look a bit
later.&lt;/p&gt;
&lt;p&gt;Another really wonderful aspect of Bioconductor is that the project
isn&amp;rsquo;t just a repository of code: it also stores annotation, full
genomes, and experimental data. Such packaged data is the foundating
of reproducible bioinformatics, as you no longer have to worry about
keeping track of data versions and storing downloaded data
yourself. If you need to work with cutting edge data from Ensembl or
UCSC tracks, the packages &lt;code&gt;biomaRt&lt;/code&gt; and &lt;code&gt;rtracklayer&lt;/code&gt; work well.&lt;/p&gt;
&lt;h2&gt;A Quick Demonstration of GenomicRanges and Bioconductor Annotation Packages&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;a-quick-demonstration-of-genomicranges-and-bioconductor-annotation-packages&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#a-quick-demonstration-of-genomicranges-and-bioconductor-annotation-packages&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Suppose I want to see if any of my SNPs fall in the APOE gene
region. For this, I&amp;rsquo;ll need transcript annotation data. If I wished to
create a fresh database of exon, gene, transcript, and splicing data,
I could with the &lt;code&gt;GenomicFeature&lt;/code&gt; package. This package has methods
for building &lt;code&gt;transcriptDb&lt;/code&gt; objects from the Known Gene track from
UCSC, as well as Ensembl databases. However, I&amp;rsquo;ll just use a
pre-packaged version, &lt;code&gt;TxDb.Hsapiens.UCSC.hg18.knownGene&lt;/code&gt;. I use hg18
rather than hg19 because this is the build that 23andme&amp;rsquo;s coordinates
reference.&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-R&#34; data-lang=&#34;R&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nf&#34;&gt;library&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;TxDb.Hsapiens.UCSC.hg18.knownGene&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;txdb&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;TxDb.Hsapiens.UCSC.hg18.knownGene&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nf&#34;&gt;class&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;txdb&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;c1&#34;&gt;## do some digging around!&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;code&gt;transcriptDb&lt;/code&gt; objects have nice accessor functions for accessing
their components. Behind the scenes, everything is in SQLite and very
efficient (are you seeing why I love Bioconductor?).&lt;/p&gt;
&lt;p&gt;If we look at the transcripts with the &lt;code&gt;transcripts&lt;/code&gt; accessor
function, we see it&amp;rsquo;s a &lt;code&gt;GenomicRanges&lt;/code&gt; object:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-R&#34; data-lang=&#34;R&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;o&#34;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;transcripts&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;txdb&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;GRanges&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;with&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;66803&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ranges&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;and&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;2&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;elementMetadata&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;values&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;          &lt;span class=&#34;n&#34;&gt;seqnames&lt;/span&gt;               &lt;span class=&#34;n&#34;&gt;ranges&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;strand&lt;/span&gt;   &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt;     &lt;span class=&#34;n&#34;&gt;tx_id&lt;/span&gt;     &lt;span class=&#34;n&#34;&gt;tx_name&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;             &lt;span class=&#34;o&#34;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Rle&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;&amp;gt;&lt;/span&gt;            &lt;span class=&#34;o&#34;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;IRanges&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;&amp;gt;&lt;/span&gt;  &lt;span class=&#34;o&#34;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Rle&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;&amp;gt;&lt;/span&gt;   &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;integer&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;character&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;&amp;gt;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;      &lt;span class=&#34;n&#34;&gt;[1]&lt;/span&gt;     &lt;span class=&#34;n&#34;&gt;chr1&lt;/span&gt;     &lt;span class=&#34;n&#34;&gt;[&lt;/span&gt;  &lt;span class=&#34;m&#34;&gt;1116&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;   &lt;span class=&#34;m&#34;&gt;4121&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;]&lt;/span&gt;      &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;   &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt;         &lt;span class=&#34;m&#34;&gt;1&lt;/span&gt;  &lt;span class=&#34;n&#34;&gt;uc001aaa.2&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;      &lt;span class=&#34;n&#34;&gt;[2]&lt;/span&gt;     &lt;span class=&#34;n&#34;&gt;chr1&lt;/span&gt;     &lt;span class=&#34;n&#34;&gt;[&lt;/span&gt;  &lt;span class=&#34;m&#34;&gt;1116&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;   &lt;span class=&#34;m&#34;&gt;4272&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;]&lt;/span&gt;      &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;   &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt;         &lt;span class=&#34;m&#34;&gt;2&lt;/span&gt;  &lt;span class=&#34;n&#34;&gt;uc009vip.1&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;      &lt;span class=&#34;n&#34;&gt;[3]&lt;/span&gt;     &lt;span class=&#34;n&#34;&gt;chr1&lt;/span&gt;     &lt;span class=&#34;n&#34;&gt;[&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;19418&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;  &lt;span class=&#34;m&#34;&gt;20957&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;]&lt;/span&gt;      &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;   &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt;        &lt;span class=&#34;m&#34;&gt;26&lt;/span&gt;  &lt;span class=&#34;n&#34;&gt;uc009vjg.1&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;      &lt;span class=&#34;n&#34;&gt;[4]&lt;/span&gt;     &lt;span class=&#34;n&#34;&gt;chr1&lt;/span&gt;     &lt;span class=&#34;n&#34;&gt;[&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;55425&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;  &lt;span class=&#34;m&#34;&gt;59692&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;]&lt;/span&gt;      &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;   &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt;        &lt;span class=&#34;m&#34;&gt;28&lt;/span&gt;  &lt;span class=&#34;n&#34;&gt;uc009vjh.1&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;      &lt;span class=&#34;n&#34;&gt;[5]&lt;/span&gt;     &lt;span class=&#34;n&#34;&gt;chr1&lt;/span&gt;     &lt;span class=&#34;n&#34;&gt;[&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;58954&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;  &lt;span class=&#34;m&#34;&gt;59871&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;]&lt;/span&gt;      &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;   &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt;        &lt;span class=&#34;m&#34;&gt;29&lt;/span&gt;  &lt;span class=&#34;n&#34;&gt;uc001aal.1&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;      &lt;span class=&#34;n&#34;&gt;[6]&lt;/span&gt;     &lt;span class=&#34;n&#34;&gt;chr1&lt;/span&gt;     &lt;span class=&#34;n&#34;&gt;[310947&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;310977&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;]&lt;/span&gt;      &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;   &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt;        &lt;span class=&#34;m&#34;&gt;33&lt;/span&gt;  &lt;span class=&#34;n&#34;&gt;uc001aaq.1&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;      &lt;span class=&#34;n&#34;&gt;[7]&lt;/span&gt;     &lt;span class=&#34;n&#34;&gt;chr1&lt;/span&gt;     &lt;span class=&#34;n&#34;&gt;[311009&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;311086&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;]&lt;/span&gt;      &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;   &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt;        &lt;span class=&#34;m&#34;&gt;34&lt;/span&gt;  &lt;span class=&#34;n&#34;&gt;uc001aar.1&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;      &lt;span class=&#34;n&#34;&gt;[8]&lt;/span&gt;     &lt;span class=&#34;n&#34;&gt;chr1&lt;/span&gt;     &lt;span class=&#34;n&#34;&gt;[314323&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;314353&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;]&lt;/span&gt;      &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;   &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt;        &lt;span class=&#34;m&#34;&gt;35&lt;/span&gt;  &lt;span class=&#34;n&#34;&gt;uc001aas.1&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;      &lt;span class=&#34;n&#34;&gt;[9]&lt;/span&gt;     &lt;span class=&#34;n&#34;&gt;chr1&lt;/span&gt;     &lt;span class=&#34;n&#34;&gt;[314354&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;314385&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;]&lt;/span&gt;      &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;   &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt;        &lt;span class=&#34;m&#34;&gt;36&lt;/span&gt;  &lt;span class=&#34;n&#34;&gt;uc001aat.1&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;      &lt;span class=&#34;kc&#34;&gt;...&lt;/span&gt;      &lt;span class=&#34;kc&#34;&gt;...&lt;/span&gt;                  &lt;span class=&#34;kc&#34;&gt;...&lt;/span&gt;    &lt;span class=&#34;kc&#34;&gt;...&lt;/span&gt; &lt;span class=&#34;kc&#34;&gt;...&lt;/span&gt;       &lt;span class=&#34;kc&#34;&gt;...&lt;/span&gt;         &lt;span class=&#34;kc&#34;&gt;...&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  &lt;span class=&#34;n&#34;&gt;[66795]&lt;/span&gt;     &lt;span class=&#34;n&#34;&gt;chrY&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;[25318610&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;25368905&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;]&lt;/span&gt;      &lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;   &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt;     &lt;span class=&#34;m&#34;&gt;33721&lt;/span&gt;  &lt;span class=&#34;n&#34;&gt;uc004fwl.1&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  &lt;span class=&#34;n&#34;&gt;[66796]&lt;/span&gt;     &lt;span class=&#34;n&#34;&gt;chrY&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;[25318610&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;25368905&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;]&lt;/span&gt;      &lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;   &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt;     &lt;span class=&#34;m&#34;&gt;33722&lt;/span&gt;  &lt;span class=&#34;n&#34;&gt;uc010nxm.1&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  &lt;span class=&#34;n&#34;&gt;[66797]&lt;/span&gt;     &lt;span class=&#34;n&#34;&gt;chrY&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;[25586438&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;25607639&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;]&lt;/span&gt;      &lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;   &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt;     &lt;span class=&#34;m&#34;&gt;33731&lt;/span&gt;  &lt;span class=&#34;n&#34;&gt;uc004fws.1&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  &lt;span class=&#34;n&#34;&gt;[66798]&lt;/span&gt;     &lt;span class=&#34;n&#34;&gt;chrY&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;[25739178&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;25740308&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;]&lt;/span&gt;      &lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;   &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt;     &lt;span class=&#34;m&#34;&gt;33732&lt;/span&gt;  &lt;span class=&#34;n&#34;&gt;uc004fwt.1&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  &lt;span class=&#34;n&#34;&gt;[66799]&lt;/span&gt;     &lt;span class=&#34;n&#34;&gt;chrY&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;[25949151&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;25949179&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;]&lt;/span&gt;      &lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;   &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt;     &lt;span class=&#34;m&#34;&gt;33733&lt;/span&gt;  &lt;span class=&#34;n&#34;&gt;uc004fwu.1&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  &lt;span class=&#34;n&#34;&gt;[66800]&lt;/span&gt;     &lt;span class=&#34;n&#34;&gt;chrY&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;[26012854&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;26012887&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;]&lt;/span&gt;      &lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;   &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt;     &lt;span class=&#34;m&#34;&gt;33734&lt;/span&gt;  &lt;span class=&#34;n&#34;&gt;uc004fww.1&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  &lt;span class=&#34;n&#34;&gt;[66801]&lt;/span&gt;     &lt;span class=&#34;n&#34;&gt;chrY&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;[26015033&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;26015066&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;]&lt;/span&gt;      &lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;   &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt;     &lt;span class=&#34;m&#34;&gt;33735&lt;/span&gt;  &lt;span class=&#34;n&#34;&gt;uc004fwx.1&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  &lt;span class=&#34;n&#34;&gt;[66802]&lt;/span&gt;     &lt;span class=&#34;n&#34;&gt;chrY&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;[26015782&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;26015809&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;]&lt;/span&gt;      &lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;   &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt;     &lt;span class=&#34;m&#34;&gt;33737&lt;/span&gt;  &lt;span class=&#34;n&#34;&gt;uc004fwy.1&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  &lt;span class=&#34;n&#34;&gt;[66803]&lt;/span&gt;     &lt;span class=&#34;n&#34;&gt;chrY&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;[26016792&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;26016820&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;]&lt;/span&gt;      &lt;span class=&#34;o&#34;&gt;-&lt;/span&gt;   &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt;     &lt;span class=&#34;m&#34;&gt;33738&lt;/span&gt;  &lt;span class=&#34;n&#34;&gt;uc004fwz.1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;To interact with the wealth of data behind a &lt;code&gt;transcriptDb&lt;/code&gt; object, we
often group individual ranges into groups, leaving us with a
&lt;code&gt;GRangesList&lt;/code&gt;.&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-R&#34; data-lang=&#34;R&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;o&#34;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;tx.by.gene&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;transcriptsBy&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;txdb&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#34;gene&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;o&#34;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;tx.by.gene&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt; &lt;span class=&#34;n&#34;&gt;GRangesList&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;of&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;length&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;20121&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt; &lt;span class=&#34;o&#34;&gt;$&lt;/span&gt;&lt;span class=&#34;m&#34;&gt;1&lt;/span&gt; 
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt; &lt;span class=&#34;n&#34;&gt;GRanges&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;with&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;2&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ranges&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;and&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;2&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;elementMetadata&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;values&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;       &lt;span class=&#34;n&#34;&gt;seqnames&lt;/span&gt;               &lt;span class=&#34;n&#34;&gt;ranges&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;strand&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt;     &lt;span class=&#34;n&#34;&gt;tx_id&lt;/span&gt;     &lt;span class=&#34;n&#34;&gt;tx_name&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;          &lt;span class=&#34;o&#34;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Rle&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;&amp;gt;&lt;/span&gt;            &lt;span class=&#34;o&#34;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;IRanges&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;&amp;gt;&lt;/span&gt;  &lt;span class=&#34;o&#34;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Rle&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;integer&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;character&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;&amp;gt;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;   &lt;span class=&#34;n&#34;&gt;[1]&lt;/span&gt;    &lt;span class=&#34;n&#34;&gt;chr19&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;[63549984&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;63556677&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;]&lt;/span&gt;      &lt;span class=&#34;o&#34;&gt;-&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt;     &lt;span class=&#34;m&#34;&gt;61027&lt;/span&gt;  &lt;span class=&#34;n&#34;&gt;uc002qsd.2&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;   &lt;span class=&#34;n&#34;&gt;[2]&lt;/span&gt;    &lt;span class=&#34;n&#34;&gt;chr19&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;[63551644&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;63565932&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;]&lt;/span&gt;      &lt;span class=&#34;o&#34;&gt;-&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt;     &lt;span class=&#34;m&#34;&gt;61033&lt;/span&gt;  &lt;span class=&#34;n&#34;&gt;uc002qsf.1&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt; &lt;span class=&#34;o&#34;&gt;$&lt;/span&gt;&lt;span class=&#34;m&#34;&gt;10&lt;/span&gt; 
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt; &lt;span class=&#34;n&#34;&gt;GRanges&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;with&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;2&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ranges&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;and&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;2&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;elementMetadata&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;values&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;       &lt;span class=&#34;n&#34;&gt;seqnames&lt;/span&gt;               &lt;span class=&#34;n&#34;&gt;ranges&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;strand&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;tx_id&lt;/span&gt;    &lt;span class=&#34;n&#34;&gt;tx_name&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;   &lt;span class=&#34;n&#34;&gt;[1]&lt;/span&gt;     &lt;span class=&#34;n&#34;&gt;chr8&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;[18293035&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;18303003&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;]&lt;/span&gt;      &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;26503&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;uc003wyw.1&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;   &lt;span class=&#34;n&#34;&gt;[2]&lt;/span&gt;     &lt;span class=&#34;n&#34;&gt;chr8&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;[18301794&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;18302666&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;]&lt;/span&gt;      &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;26504&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;uc010lte.1&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt; &lt;span class=&#34;o&#34;&gt;$&lt;/span&gt;&lt;span class=&#34;m&#34;&gt;100&lt;/span&gt; 
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt; &lt;span class=&#34;n&#34;&gt;GRanges&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;with&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;2&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ranges&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;and&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;2&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;elementMetadata&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;values&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;       &lt;span class=&#34;n&#34;&gt;seqnames&lt;/span&gt;               &lt;span class=&#34;n&#34;&gt;ranges&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;strand&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;tx_id&lt;/span&gt;    &lt;span class=&#34;n&#34;&gt;tx_name&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;   &lt;span class=&#34;n&#34;&gt;[1]&lt;/span&gt;    &lt;span class=&#34;n&#34;&gt;chr20&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;[42681577&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;42713790&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;]&lt;/span&gt;      &lt;span class=&#34;o&#34;&gt;-&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;62142&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;uc002xmj.1&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;   &lt;span class=&#34;n&#34;&gt;[2]&lt;/span&gt;    &lt;span class=&#34;n&#34;&gt;chr20&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;[42681577&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;42713790&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;]&lt;/span&gt;      &lt;span class=&#34;o&#34;&gt;-&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;62143&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;uc010ggt.1&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt; &lt;span class=&#34;kc&#34;&gt;...&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt; &lt;span class=&#34;o&#34;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&#34;m&#34;&gt;20118&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;more&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;elements&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;&amp;gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Holy &lt;code&gt;GRangeList&lt;/code&gt; batman! These are the transcripts grouped by
gene. There are other methods for grouping by CDS and exons (&lt;code&gt;cdsBy&lt;/code&gt;
and &lt;code&gt;exonsBy&lt;/code&gt;).&lt;/p&gt;
&lt;p&gt;The names of the list elements are Entrez gene IDs. We can look up
specific genes with another Bioconductor annotation package,
&lt;code&gt;org.Hs.eg.db&lt;/code&gt;. There are org.* annotation packages for many
organisms. You can forge your own and interact with them with the
&lt;code&gt;AnnotationDbi&lt;/code&gt; package. I&amp;rsquo;m using a development version of this
package that has a new slick SQL-like interface; it will be widely
available with the upcoming 2.10 release.&lt;/p&gt;
&lt;p&gt;Suppose I want to convert the Entrez Gene IDs to gene names. The &amp;ldquo;eg&amp;rdquo;
in org.Hs.eg.db refers to Entrez Gene IDs. Printing the &lt;code&gt;org.Hs.eg.db&lt;/code&gt;
object gives a nice list of information. Let&amp;rsquo;s look for the APOE
gene&amp;rsquo;s Entrez Gene ID.&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-R&#34; data-lang=&#34;R&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;o&#34;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;library&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;org.Hs.eg.db&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;o&#34;&gt;&amp;gt;&lt;/span&gt;  &lt;span class=&#34;nf&#34;&gt;cols&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;org.Hs.eg.db&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;n&#34;&gt;[1]&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#34;ENTREZID&amp;#34;&lt;/span&gt;     &lt;span class=&#34;s&#34;&gt;&amp;#34;ACCNUM&amp;#34;&lt;/span&gt;       &lt;span class=&#34;s&#34;&gt;&amp;#34;ALIAS&amp;#34;&lt;/span&gt;        &lt;span class=&#34;s&#34;&gt;&amp;#34;CHR&amp;#34;&lt;/span&gt;          &lt;span class=&#34;s&#34;&gt;&amp;#34;ENZYME&amp;#34;&lt;/span&gt;      
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;n&#34;&gt;[6]&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#34;GENENAME&amp;#34;&lt;/span&gt;     &lt;span class=&#34;s&#34;&gt;&amp;#34;MAP&amp;#34;&lt;/span&gt;          &lt;span class=&#34;s&#34;&gt;&amp;#34;OMIM&amp;#34;&lt;/span&gt;         &lt;span class=&#34;s&#34;&gt;&amp;#34;PATH&amp;#34;&lt;/span&gt;         &lt;span class=&#34;s&#34;&gt;&amp;#34;PMID&amp;#34;&lt;/span&gt;        
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;   &lt;span class=&#34;n&#34;&gt;[11]&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#34;REFSEQ&amp;#34;&lt;/span&gt;       &lt;span class=&#34;s&#34;&gt;&amp;#34;SYMBOL&amp;#34;&lt;/span&gt;       &lt;span class=&#34;s&#34;&gt;&amp;#34;UNIGENE&amp;#34;&lt;/span&gt;      &lt;span class=&#34;s&#34;&gt;&amp;#34;CHRLOC&amp;#34;&lt;/span&gt;       &lt;span class=&#34;s&#34;&gt;&amp;#34;CHRLOCEND&amp;#34;&lt;/span&gt;   
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;   &lt;span class=&#34;n&#34;&gt;[16]&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#34;PFAM&amp;#34;&lt;/span&gt;         &lt;span class=&#34;s&#34;&gt;&amp;#34;PROSITE&amp;#34;&lt;/span&gt;      &lt;span class=&#34;s&#34;&gt;&amp;#34;ENSEMBL&amp;#34;&lt;/span&gt;      &lt;span class=&#34;s&#34;&gt;&amp;#34;ENSEMBLPROT&amp;#34;&lt;/span&gt;  &lt;span class=&#34;s&#34;&gt;&amp;#34;ENSEMBLTRANS&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;   &lt;span class=&#34;n&#34;&gt;[21]&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#34;UNIPROT&amp;#34;&lt;/span&gt;      &lt;span class=&#34;s&#34;&gt;&amp;#34;UCSCKG&amp;#34;&lt;/span&gt;       &lt;span class=&#34;s&#34;&gt;&amp;#34;GO&amp;#34;&lt;/span&gt;          &lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;These are the columns we can query out. Certain keys exist: we can
access these using &lt;code&gt;keytypes()&lt;/code&gt;. Using it all together, we can extract
the Entrez Gene ID:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-R&#34; data-lang=&#34;R&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;o&#34;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;select&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;org.Hs.eg.db&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;keys&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;&amp;#34;APOE&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;cols&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;nf&#34;&gt;c&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;&amp;#34;ENTREZID&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#34;SYMBOL&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#34;GENENAME&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;),&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;keytype&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;&amp;#34;SYMBOL&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;n&#34;&gt;SYMBOL&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ENTREZID&lt;/span&gt;         &lt;span class=&#34;n&#34;&gt;GENENAME&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;m&#34;&gt;23200&lt;/span&gt;   &lt;span class=&#34;n&#34;&gt;APOE&lt;/span&gt;      &lt;span class=&#34;m&#34;&gt;348&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;apolipoprotein&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;E&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Now, we can look for this in our &lt;code&gt;tx.by.gene&lt;/code&gt; &lt;code&gt;GRangesList&lt;/code&gt;. A word of
caution: Entrez Gene IDs are &lt;strong&gt;names&lt;/strong&gt; and thus they need to be quoted
when working with &lt;code&gt;GRangesList&lt;/code&gt; objects from transcript databases.&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;pre&gt;&lt;code&gt;&amp;gt; tx.by.gene[&amp;#34;348&amp;#34;]
  GRangesList of length 1:
  $348 
  GRanges with 1 range and 2 elementMetadata values:
        seqnames               ranges strand |     tx_id     tx_name
           &amp;lt;Rle&amp;gt;            &amp;lt;IRanges&amp;gt;  &amp;lt;Rle&amp;gt; | &amp;lt;integer&amp;gt; &amp;lt;character&amp;gt;
    [1]    chr19 [50100879, 50104490]      &amp;#43; |     59642  uc002pab.1&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;If I had used &lt;code&gt;tx.by.gene[348]&lt;/code&gt; the 348th element of the list would have
been returned, not the transcript data for the APOE gene (which has
Entrez Gene ID &amp;ldquo;348&amp;rdquo;).&lt;/p&gt;
&lt;p&gt;Now, do any SNPs fall in this region? Let&amp;rsquo;s build a &lt;code&gt;GRanges&lt;/code&gt; object
from my genotyping data, and look for overlaps. Before I do, it&amp;rsquo;s
worth mentioning another gotcha about working with bioinformatics
data: chromosome naming schemes. Different databases use all sorts of
schemes, and you should always check them. 23andme returns just
numbers, X, Y, and MT. Let&amp;rsquo;s change it to use the same as the
Bioconductor annotation.&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-R&#34; data-lang=&#34;R&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# CAREFUL: use levels() to check that you&amp;#39;re making new factor names&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# that correspond to the old ones!&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nf&#34;&gt;levels&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;d&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;$&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;chrom&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;paste&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;&amp;#34;chr&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;c&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;m&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;m&#34;&gt;22&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#34;X&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#34;Y&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#34;M&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;),&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;sep&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;&amp;#34;&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;my.snps&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;with&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;d&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;GRanges&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;seqnames&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;chrom&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; 
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;                   &lt;span class=&#34;nf&#34;&gt;IRanges&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;start&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;position&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;width&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;m&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;),&lt;/span&gt; 
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;                   &lt;span class=&#34;n&#34;&gt;rsid&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;rsid&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;genotype&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;genotype&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt; &lt;span class=&#34;c1&#34;&gt;# this goes into metadata&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Now, let&amp;rsquo;s find overlaps using, well, &lt;code&gt;findOverlaps&lt;/code&gt;:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-R&#34; data-lang=&#34;R&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;apoe.i&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;findOverlaps&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;tx.by.gene[&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;&amp;#34;348&amp;#34;&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;my.snps&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;code&gt;apoe.i&lt;/code&gt; is an object of class &lt;code&gt;RangesMatching&lt;/code&gt;. Note that had we not
matched chromosome names, Bioconductor gives us a nice warning that
sequence names don&amp;rsquo;t match. We could look at the slots of &lt;code&gt;apoe.i&lt;/code&gt; but
output can be seen with &lt;code&gt;matchMatrix&lt;/code&gt;:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-R&#34; data-lang=&#34;R&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;o&#34;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;hits&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;matchMatrix&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;apoe.i&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#34;subject&amp;#34;&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;o&#34;&gt;&amp;gt;&lt;/span&gt;  &lt;span class=&#34;n&#34;&gt;hits&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;     &lt;span class=&#34;n&#34;&gt;[1]&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;873650&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;873651&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;873652&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;873653&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;873654&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;873655&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;873656&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;873657&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;873658&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;873659&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;n&#34;&gt;[11]&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;873660&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;873661&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;873662&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;873663&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;873664&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;873665&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;873666&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;873667&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;873668&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;873669&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;n&#34;&gt;[21]&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;873670&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;873671&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;873672&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;873673&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;873674&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;873675&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;873676&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;So in our subject, we have two hits. Let&amp;rsquo;s dig them up in our SNP
&lt;code&gt;GRanges&lt;/code&gt; object:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-R&#34; data-lang=&#34;R&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;o&#34;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;my.snps[hits]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  &lt;span class=&#34;n&#34;&gt;GRanges&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;with&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;27&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ranges&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;and&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;2&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;elementMetadata&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;values&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;         &lt;span class=&#34;n&#34;&gt;seqnames&lt;/span&gt;               &lt;span class=&#34;n&#34;&gt;ranges&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;strand&lt;/span&gt;   &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt;        &lt;span class=&#34;n&#34;&gt;rsid&lt;/span&gt;    &lt;span class=&#34;n&#34;&gt;genotype&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;            &lt;span class=&#34;o&#34;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Rle&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;&amp;gt;&lt;/span&gt;            &lt;span class=&#34;o&#34;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;IRanges&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;&amp;gt;&lt;/span&gt;  &lt;span class=&#34;o&#34;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Rle&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;&amp;gt;&lt;/span&gt;   &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;character&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;character&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;&amp;gt;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;     &lt;span class=&#34;n&#34;&gt;[1]&lt;/span&gt;    &lt;span class=&#34;n&#34;&gt;chr19&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;[50101007&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;50101007&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;]&lt;/span&gt;      &lt;span class=&#34;o&#34;&gt;*&lt;/span&gt;   &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt;    &lt;span class=&#34;n&#34;&gt;rs440446&lt;/span&gt;          &lt;span class=&#34;n&#34;&gt;CG&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;     &lt;span class=&#34;n&#34;&gt;[2]&lt;/span&gt;    &lt;span class=&#34;n&#34;&gt;chr19&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;[50101842&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;50101842&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;]&lt;/span&gt;      &lt;span class=&#34;o&#34;&gt;*&lt;/span&gt;   &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt;    &lt;span class=&#34;n&#34;&gt;rs769449&lt;/span&gt;          &lt;span class=&#34;n&#34;&gt;GG&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;     &lt;span class=&#34;n&#34;&gt;[3]&lt;/span&gt;    &lt;span class=&#34;n&#34;&gt;chr19&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;[50102284&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;50102284&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;]&lt;/span&gt;      &lt;span class=&#34;o&#34;&gt;*&lt;/span&gt;   &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt;    &lt;span class=&#34;n&#34;&gt;rs769450&lt;/span&gt;          &lt;span class=&#34;n&#34;&gt;AG&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;     &lt;span class=&#34;n&#34;&gt;[4]&lt;/span&gt;    &lt;span class=&#34;n&#34;&gt;chr19&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;[50102751&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;50102751&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;]&lt;/span&gt;      &lt;span class=&#34;o&#34;&gt;*&lt;/span&gt;   &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt;    &lt;span class=&#34;n&#34;&gt;rs769451&lt;/span&gt;          &lt;span class=&#34;n&#34;&gt;TT&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;     &lt;span class=&#34;n&#34;&gt;[5]&lt;/span&gt;    &lt;span class=&#34;n&#34;&gt;chr19&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;[50102874&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;50102874&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;]&lt;/span&gt;      &lt;span class=&#34;o&#34;&gt;*&lt;/span&gt;   &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt;    &lt;span class=&#34;n&#34;&gt;i5000209&lt;/span&gt;          &lt;span class=&#34;n&#34;&gt;GG&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;     &lt;span class=&#34;n&#34;&gt;[6]&lt;/span&gt;    &lt;span class=&#34;n&#34;&gt;chr19&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;[50102904&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;50102904&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;]&lt;/span&gt;      &lt;span class=&#34;o&#34;&gt;*&lt;/span&gt;   &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt;    &lt;span class=&#34;n&#34;&gt;i5000208&lt;/span&gt;          &lt;span class=&#34;n&#34;&gt;GG&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;     &lt;span class=&#34;n&#34;&gt;[7]&lt;/span&gt;    &lt;span class=&#34;n&#34;&gt;chr19&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;[50102940&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;50102940&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;]&lt;/span&gt;      &lt;span class=&#34;o&#34;&gt;*&lt;/span&gt;   &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt;    &lt;span class=&#34;n&#34;&gt;i5000201&lt;/span&gt;          &lt;span class=&#34;n&#34;&gt;CC&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;     &lt;span class=&#34;n&#34;&gt;[8]&lt;/span&gt;    &lt;span class=&#34;n&#34;&gt;chr19&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;[50102991&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;50102991&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;]&lt;/span&gt;      &lt;span class=&#34;o&#34;&gt;*&lt;/span&gt;   &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt;  &lt;span class=&#34;n&#34;&gt;rs28931576&lt;/span&gt;          &lt;span class=&#34;n&#34;&gt;AA&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;     &lt;span class=&#34;n&#34;&gt;[9]&lt;/span&gt;    &lt;span class=&#34;n&#34;&gt;chr19&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;[50103697&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;50103697&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;]&lt;/span&gt;      &lt;span class=&#34;o&#34;&gt;*&lt;/span&gt;   &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt;  &lt;span class=&#34;n&#34;&gt;rs11542040&lt;/span&gt;          &lt;span class=&#34;n&#34;&gt;CC&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;     &lt;span class=&#34;kc&#34;&gt;...&lt;/span&gt;      &lt;span class=&#34;kc&#34;&gt;...&lt;/span&gt;                  &lt;span class=&#34;kc&#34;&gt;...&lt;/span&gt;    &lt;span class=&#34;kc&#34;&gt;...&lt;/span&gt; &lt;span class=&#34;kc&#34;&gt;...&lt;/span&gt;         &lt;span class=&#34;kc&#34;&gt;...&lt;/span&gt;         &lt;span class=&#34;kc&#34;&gt;...&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;n&#34;&gt;[19]&lt;/span&gt;    &lt;span class=&#34;n&#34;&gt;chr19&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;[50104077&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;50104077&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;]&lt;/span&gt;      &lt;span class=&#34;o&#34;&gt;*&lt;/span&gt;   &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt;    &lt;span class=&#34;n&#34;&gt;i5000212&lt;/span&gt;          &lt;span class=&#34;n&#34;&gt;GG&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;n&#34;&gt;[20]&lt;/span&gt;    &lt;span class=&#34;n&#34;&gt;chr19&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;[50104118&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;50104118&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;]&lt;/span&gt;      &lt;span class=&#34;o&#34;&gt;*&lt;/span&gt;   &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt;    &lt;span class=&#34;n&#34;&gt;i5000210&lt;/span&gt;          &lt;span class=&#34;n&#34;&gt;GG&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;n&#34;&gt;[21]&lt;/span&gt;    &lt;span class=&#34;n&#34;&gt;chr19&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;[50104129&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;50104129&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;]&lt;/span&gt;      &lt;span class=&#34;o&#34;&gt;*&lt;/span&gt;   &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt;    &lt;span class=&#34;n&#34;&gt;i5000213&lt;/span&gt;          &lt;span class=&#34;n&#34;&gt;CC&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;n&#34;&gt;[22]&lt;/span&gt;    &lt;span class=&#34;n&#34;&gt;chr19&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;[50104154&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;50104154&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;]&lt;/span&gt;      &lt;span class=&#34;o&#34;&gt;*&lt;/span&gt;   &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt;    &lt;span class=&#34;n&#34;&gt;i5000207&lt;/span&gt;          &lt;span class=&#34;n&#34;&gt;TT&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;n&#34;&gt;[23]&lt;/span&gt;    &lt;span class=&#34;n&#34;&gt;chr19&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;[50104177&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;50104177&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;]&lt;/span&gt;      &lt;span class=&#34;o&#34;&gt;*&lt;/span&gt;   &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt;    &lt;span class=&#34;n&#34;&gt;i5000219&lt;/span&gt;          &lt;span class=&#34;n&#34;&gt;GG&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;n&#34;&gt;[24]&lt;/span&gt;    &lt;span class=&#34;n&#34;&gt;chr19&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;[50104180&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;50104180&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;]&lt;/span&gt;      &lt;span class=&#34;o&#34;&gt;*&lt;/span&gt;   &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt;    &lt;span class=&#34;n&#34;&gt;i5000218&lt;/span&gt;          &lt;span class=&#34;n&#34;&gt;GG&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;n&#34;&gt;[25]&lt;/span&gt;    &lt;span class=&#34;n&#34;&gt;chr19&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;[50104198&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;50104198&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;]&lt;/span&gt;      &lt;span class=&#34;o&#34;&gt;*&lt;/span&gt;   &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt;    &lt;span class=&#34;n&#34;&gt;i5000206&lt;/span&gt;          &lt;span class=&#34;n&#34;&gt;CC&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;n&#34;&gt;[26]&lt;/span&gt;    &lt;span class=&#34;n&#34;&gt;chr19&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;[50104268&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;50104268&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;]&lt;/span&gt;      &lt;span class=&#34;o&#34;&gt;*&lt;/span&gt;   &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt;    &lt;span class=&#34;n&#34;&gt;i5000204&lt;/span&gt;          &lt;span class=&#34;n&#34;&gt;GG&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;n&#34;&gt;[27]&lt;/span&gt;    &lt;span class=&#34;n&#34;&gt;chr19&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;[50104333&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;50104333&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;]&lt;/span&gt;      &lt;span class=&#34;o&#34;&gt;*&lt;/span&gt;   &lt;span class=&#34;o&#34;&gt;|&lt;/span&gt;  &lt;span class=&#34;n&#34;&gt;rs28931579&lt;/span&gt;          &lt;span class=&#34;n&#34;&gt;AA&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Now, we can verify that these SNPs are in the APOE gene using the UCSC
Genome Browser (and actually pull open a browser to this spot from R
using &lt;code&gt;rtracklayer&lt;/code&gt;, but I&amp;rsquo;ll save that for another time). Be sure to
use hg18/build 36! Note that my genotype information is there.&lt;/p&gt;
&lt;p&gt;The ApoE4 allele is rs429358(C) + rs7412(C). The most common allele
(ApoE3, or e3/e3) is rs429358(T) + rs7412(C) which is what I have
(that&amp;rsquo;s a relief). There&amp;rsquo;s a lot of established research that shows
homozygous ApoE4 (that is rs429358(C/C) + rs7412(C/C)) leads to
substantially higher risk of Alzeheimer&amp;rsquo;s. According to
&lt;a href=&#34;http://snpedia.com/index.php/ApoE4&#34;&gt;SNPedia&lt;/a&gt;
, James Watson requested
he not learn his genotype at this locus, and Steven Pinker requested
his ApoE data be removed from his PGP10 data.&lt;/p&gt;
&lt;h2&gt;Looking for Risk Variants using &lt;code&gt;gwascat&lt;/code&gt;&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;looking-for-risk-variants-using-gwascat&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#looking-for-risk-variants-using-gwascat&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;We can use the metadata provided by &lt;code&gt;gwascat&lt;/code&gt; to further look for
interesting variants in our 23andme data. I would recommend
interpreting this data with caution, as summarizing these findings in
a single element metadata data frame is hard: there&amp;rsquo;s definitely lost
information.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;gwrngs&lt;/code&gt; &lt;code&gt;GRanges&lt;/code&gt; object has lots of metadata you should scan
through with &lt;code&gt;elementMetadata(gwrngs)&lt;/code&gt;. The
&lt;code&gt;Strongest.SNP.Risk.Allele&lt;/code&gt; is useful for seeing what you&amp;rsquo;re at risk
for. First, using the rs ID as a key, let&amp;rsquo;s join our SNP data with the
&lt;code&gt;gwrngs&lt;/code&gt; metadata:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-R&#34; data-lang=&#34;R&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;gwrngs.emd&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;as.data.frame&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;nf&#34;&gt;elementMetadata&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;gwrngs&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;dm&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;merge&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;d&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;gwrngs.emd&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;by.x&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;&amp;#34;rsid&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;by.y&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;&amp;#34;SNPs&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;We can search for the risk allele in the 23andme genotype data with R
and attach a vector of &lt;code&gt;i.have.risk&lt;/code&gt; to the &lt;code&gt;dm&lt;/code&gt; data frame:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-R&#34; data-lang=&#34;R&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;risk.alleles&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;gsub&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;&amp;#34;[^\\-]*-([ATCG?])&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#34;\\1&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;dm&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;$&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Strongest.SNP.Risk.Allele&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;i.have.risk&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;mapply&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;kr&#34;&gt;function&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;risk&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;mine&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  &lt;span class=&#34;n&#34;&gt;risk&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;%in%&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;unlist&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;nf&#34;&gt;strsplit&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;mine&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#34;&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;p&#34;&gt;},&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;risk.alleles&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;dm&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;$&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;genotype&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;dm&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;$&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;i.have.risk&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;i.have.risk&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Now that you have this data frame, you can mine it endlessly. You may
want to sort by &lt;code&gt;Risk.Allele.Frequency&lt;/code&gt; and whether you have the
risk. Because there are quite a few columns in the element metadata,
it&amp;rsquo;s nice to define a quick-summary subset:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-R&#34; data-lang=&#34;R&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;o&#34;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;my.risk&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;dm[dm&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;$&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;i.have.risk&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;o&#34;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;rel.cols&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;c&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;nf&#34;&gt;colnames&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;d&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;),&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#34;Disease.Trait&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#34;Risk.Allele.Frequency&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;                &lt;span class=&#34;s&#34;&gt;&amp;#34;p.Value&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#34;i.have.risk&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#34;X95..CI..text.&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;o&#34;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;head&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;my.risk&lt;/span&gt;&lt;span class=&#34;nf&#34;&gt;[order&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;my.risk&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;$&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Risk.Allele.Frequency&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;),&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;rel.cols]&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;            &lt;span class=&#34;n&#34;&gt;rsid&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;chrom&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;position&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;genotype&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Disease.Trait&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Risk.Allele.Frequency&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  &lt;span class=&#34;m&#34;&gt;2553&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;rs2315504&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;chr17&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;36300407&lt;/span&gt;       &lt;span class=&#34;n&#34;&gt;AC&lt;/span&gt;        &lt;span class=&#34;n&#34;&gt;Height&lt;/span&gt;                  &lt;span class=&#34;m&#34;&gt;0.01&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;       &lt;span class=&#34;n&#34;&gt;p.Value&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;i.have.risk&lt;/span&gt;   &lt;span class=&#34;n&#34;&gt;X95..CI..text.&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  &lt;span class=&#34;m&#34;&gt;2553&lt;/span&gt;   &lt;span class=&#34;m&#34;&gt;8e-06&lt;/span&gt;        &lt;span class=&#34;kc&#34;&gt;TRUE&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;[NR]&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;cm&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;increase&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;This is a rare variant, but the most important next question is, rare
in who?&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-R&#34; data-lang=&#34;R&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;o&#34;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;dm&lt;/span&gt;&lt;span class=&#34;nf&#34;&gt;[which&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;dm&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;$&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;rsid&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;==&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#34;rs2315504&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;),&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#34;Initial.Sample.Size&amp;#34;&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;[1]&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;8&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;m&#34;&gt;842&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Korean&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;individuals&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;So this clearly doesn&amp;rsquo;t mean much to me. We can use &lt;code&gt;grep&lt;/code&gt; to find
studies that mention &amp;ldquo;European&amp;rdquo;:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-R&#34; data-lang=&#34;R&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;o&#34;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;head&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;my.risk&lt;/span&gt;&lt;span class=&#34;nf&#34;&gt;[grep&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;&amp;#34;European&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;my.risk&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;$&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;Initial.Sample.Size&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;),&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;rel.cols]&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;m&#34;&gt;30&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;One interesting rs ID that popped up in this list of my data is
rs10166942, which is lightly linked to migraines (from which I
suffer).&lt;/p&gt;
&lt;h2&gt;Making Graphics with &lt;code&gt;ggbio&lt;/code&gt;&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;making-graphics-with-ggbio&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#making-graphics-with-ggbio&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;&lt;code&gt;ggbio&lt;/code&gt; is a new-ish (Bioconductor 2.9) package that produces really
nice graphics. Let&amp;rsquo;s plot the location of all SNPs that &lt;code&gt;gwascat&lt;/code&gt;
tells me my allele is the &amp;ldquo;risk&amp;rdquo; allele (again, strange word choice as
some &amp;ldquo;Disease.Traits&amp;rdquo; are height). &lt;code&gt;gwascat&lt;/code&gt; uses hg19, and &lt;code&gt;ggbio&lt;/code&gt;
doesn&amp;rsquo;t have ideogram cytobanding and chromosome position information
for hg18 bundled with it (yet?) so we&amp;rsquo;ll need to work with that.&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-R&#34; data-lang=&#34;R&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;o&#34;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;library&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;ggbio&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;o&#34;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;p&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;plotOverview&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;hg19IdeogramCyto&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;cytoband&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;FALSE&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Now, let&amp;rsquo;s take the &lt;code&gt;gwrngs&lt;/code&gt; object and subset by my risk
alleles. Notice how these assignment function &lt;code&gt;elementMetadata&amp;lt;-&lt;/code&gt; is
overloaded here:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-R&#34; data-lang=&#34;R&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;nf&#34;&gt;elementMetadata&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;gwrngs&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;$&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;my.genotype&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;&amp;lt;-&lt;/span&gt; 
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;   &lt;span class=&#34;n&#34;&gt;d&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;$&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;genotype&lt;/span&gt;&lt;span class=&#34;nf&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;nf&#34;&gt;match&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;nf&#34;&gt;elementMetadata&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;gwrngs&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;$&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;SNPs&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;d&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;$&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;rsid&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nf&#34;&gt;elementMetadata&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;gwrngs&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;$&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;my.risk&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;with&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;nf&#34;&gt;elementMetadata&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;gwrngs&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;),&lt;/span&gt; 
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;nf&#34;&gt;mapply&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;kr&#34;&gt;function&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;risk&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;mine&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;      &lt;span class=&#34;n&#34;&gt;risk&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;%in%&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;unlist&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;nf&#34;&gt;strsplit&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;mine&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#34;&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;p&#34;&gt;},&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;gsub&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;&amp;#34;[^\\-]*-([ATCG?])&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#34;\\1&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Strongest.SNP.Risk.Allele&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;),&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;my.genotype&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Now  to plot these regions:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-R&#34; data-lang=&#34;R&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;p&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;+&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;geom_hotregion&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;gwrngs&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;aes&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;color&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;my.risk&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;

      </description>
    </item>
    
    <item>
      <title>Git Notes</title>
      <link>https://vincebuffalo.com/blog/git-notes/</link>
      <pubDate>Sun, 11 Mar 2012 00:00:00 +0000</pubDate>
      
      <guid>https://vincebuffalo.com/blog/git-notes/</guid>
      <description>
        
        
        &lt;h1&gt;Git Notes&lt;/h1&gt;&lt;p&gt;These are updated by me periodically. I have tried my best to
illustrate common use cases, and the motivation for doing things the
&amp;ldquo;Git&amp;rdquo; way.&lt;/p&gt;
&lt;h2&gt;Example Set Up&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;example-set-up&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#example-set-up&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;I&amp;rsquo;ll use this setup scenario frequently. In a suitable scatch
repository (i.e. &lt;code&gt;git-sandbox&lt;/code&gt;), make a fake remote:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;mkdir fake-remote
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nb&#34;&gt;cd&lt;/span&gt; fake-remote
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;git init --bare
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nb&#34;&gt;cd&lt;/span&gt; ..&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Now, clone it, pretending you are two developers:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;git clone fake-remote jerry-repo
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;git clone fake-remote kramer-repo&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Let&amp;rsquo;s assume you&amp;rsquo;re Jerry and Kramer is another programmer in your
group. As Jerry, let&amp;rsquo;s make some changes:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nb&#34;&gt;cd&lt;/span&gt; jerry-repo
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nb&#34;&gt;echo&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;an example file&amp;#34;&lt;/span&gt; &amp;gt; file.txt
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;git add file.txt
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;git commit -am &lt;span class=&#34;s2&#34;&gt;&amp;#34;initial import&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;git push origin master
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nb&#34;&gt;cd&lt;/span&gt; ..&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Now, let&amp;rsquo;s pretend we&amp;rsquo;re Kramer and grab that recent commit:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nb&#34;&gt;cd&lt;/span&gt; kramer-repo
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;git pull origin master
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nb&#34;&gt;cd&lt;/span&gt; ..&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;h2&gt;Git Remote Tracking Branches&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;git-remote-tracking-branches&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#git-remote-tracking-branches&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Git remote tracking branches are similar to local branches (i.e. the
kind you interact with &lt;code&gt;git checkout -b branch-name&lt;/code&gt; and see with &lt;code&gt;git branch&lt;/code&gt;). However, you don&amp;rsquo;t work on the remote branch directly, you
work on a local branch that&amp;rsquo;s &lt;em&gt;tracking&lt;/em&gt; this remote branch. For
example, the most common workflow is to track a remote branch, then
push your commits to it or pull commits down from it. Even though it&amp;rsquo;s
a &amp;ldquo;remote&amp;rdquo; tracking branch, the branch is stored locally (this branch
doesn&amp;rsquo;t disappear if you can&amp;rsquo;t connected to the remote).&lt;/p&gt;
&lt;p&gt;Git remote tracking branches always have the format
&lt;code&gt;remote-repo/remote-branch&lt;/code&gt;. After cloning a repository, you can set
it to track a remote tracking branch with the &lt;code&gt;-u&lt;/code&gt; option of &lt;code&gt;git push&lt;/code&gt;, e.g. &lt;code&gt;git push -u origin master&lt;/code&gt;. From now on, you can just use
&lt;code&gt;git push&lt;/code&gt; when on this branch; this branch is &lt;em&gt;tracking&lt;/em&gt; &lt;code&gt;origin&lt;/code&gt;&amp;rsquo;s
&lt;code&gt;master&lt;/code&gt; branch.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;git branch&lt;/code&gt; shows local branches; to see remote branches use &lt;code&gt;git branch -r&lt;/code&gt;, and to see &lt;em&gt;all&lt;/em&gt; branches, use &lt;code&gt;git branch -a&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Remote tracking branches are also what determine what is pulled/pushed
when using &lt;code&gt;git pull&lt;/code&gt; and &lt;code&gt;git push&lt;/code&gt; without a remote repository and
refspec (i.e. &lt;code&gt;git push origin master&lt;/code&gt;).&lt;/p&gt;
&lt;p&gt;If the current branch is &lt;code&gt;new-feature&lt;/code&gt;, which tracks
&lt;code&gt;origin/new-feature&lt;/code&gt;, then any branches checked out from &lt;code&gt;new-feature&lt;/code&gt;
will &lt;em&gt;also&lt;/em&gt; track the remote too, unless &lt;code&gt;--no-track&lt;/code&gt; is added.&lt;/p&gt;
&lt;h2&gt;Git Fetch and Merge vs. Git Pull&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;git-fetch-and-merge-vs-git-pull&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#git-fetch-and-merge-vs-git-pull&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Recall the above point that remote tracking branches are local, so
(unlike Subversion) they function even when you&amp;rsquo;re not able to connect
to the remote. This gives an example of how elegant Git is: being so
similar to regular branches, Git remote tracking branches can be
merged into local branches. This is precisely what goes on behind the
scenes with &lt;code&gt;git pull&lt;/code&gt;. Here&amp;rsquo;s an example. First, let&amp;rsquo;s set it up such
that a developer in your group, Kramer, made some changes, committed
them, then pushed them to the remote.&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# assuming you&amp;#39;re in the right directory&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nb&#34;&gt;cd&lt;/span&gt; kramer-repo
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nb&#34;&gt;echo&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;kramer adding gibberish&amp;#34;&lt;/span&gt; &amp;gt;&amp;gt; file.txt
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;git commit -am &lt;span class=&#34;s2&#34;&gt;&amp;#34;I added some gibberish&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;git push
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nb&#34;&gt;cd&lt;/span&gt; ..&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Now, imagine you (Jerry) have made some commits but want to see what
the status of the remote looks like. &lt;code&gt;git remote show&lt;/code&gt; can be used to
see if the local remote tracking branch is out of date.&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;pre&gt;&lt;code&gt;vinceb@poisson$ git remote show origin
* remote origin
  Fetch URL: /Users/vinceb/Desktop/git-sandbox/fake-remote
  Push  URL: /Users/vinceb/Desktop/git-sandbox/fake-remote
  HEAD branch: master
  Remote branch:
    master tracked
  Local branch configured for &amp;#39;git pull&amp;#39;:
    master merges with remote master
  Local ref configured for &amp;#39;git push&amp;#39;:
    master pushes to master (local out of date)&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;So we are out of date! We can see these commits before merging them in
with &lt;code&gt;git fetch&lt;/code&gt;. &lt;code&gt;git fetch&lt;/code&gt; updates your remote tracking branch with
the new changes, allowing you to &lt;code&gt;diff&lt;/code&gt; branches just as you would
with two regular branches.&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;pre&gt;&lt;code&gt;vinceb@poisson$ git diff master origin/master
diff --git a/file.txt b/file.txt
index 4e850ce..34ceb34 100644
--- a/file.txt
&amp;#43;&amp;#43;&amp;#43; b/file.txt
@@ -1 &amp;#43;1,2 @@
 an example file&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;code&gt;git log origin/master master&lt;/code&gt; also works. &lt;code&gt;git log origin/master ^master&lt;/code&gt; shows us just the new commits. We could really explore these
commits by checkout out the remote tracking branch (but consider this
to be &amp;ldquo;taking a visit&amp;rdquo;; don&amp;rsquo;t commit anything). Suppose we did, and we
decide we want to merge them with our current branch. For this, we
just use &lt;code&gt;git merge&lt;/code&gt;: remember remote tracking branches are just
branches!&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;pre&gt;&lt;code&gt;vinceb@poisson$ git branch ## always check what branch you&amp;#39;re on!
* master
vinceb@poisson$ git merge origin/master
Updating ee922a9..8c8c240
Fast-forward
 file.txt |    1 &amp;#43;
 1 files changed, 1 insertions(&amp;#43;), 0 deletions(-)&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Note that &lt;code&gt;git pull&lt;/code&gt; basically is &lt;code&gt;git fetch &amp;amp;&amp;amp; git merge&lt;/code&gt;.&lt;/p&gt;
&lt;h2&gt;Great resources&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;great-resources&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#great-resources&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;http://ftp.newartisans.com/pub/git.from.bottom.up.pdf&#34;&gt;Git From the Ground Up&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;http://gitref.org/&#34;&gt;Git Reference&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

      </description>
    </item>
    
    <item>
      <title>The Beauty of Bioconductor</title>
      <link>https://vincebuffalo.com/blog/the-beauty-of-bioconductor/</link>
      <pubDate>Thu, 08 Mar 2012 00:00:00 +0000</pubDate>
      
      <guid>https://vincebuffalo.com/blog/the-beauty-of-bioconductor/</guid>
      <description>
        
        
        &lt;p&gt;In talking with bioinformaticians, biologists, and other researchers,
I&amp;rsquo;ve seen some worrying trends in computation in the sciences. I plan
on writing about these extensively in the future, as I believe
computation in the sciences will not scale well to face the huge
wealth of data coming experiments will provide. This is not due to
algorithmic or hardware limitations, but rather to the fact that
scientific programmers simply do not have the same standards and
practices that the software industry does.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://vincebuffalo.com/images/big-data-breakpoint.png&#34; alt=&#34;How do we prevent a big data breaking point?&#34; data-custom-hook=&#34;true&#34; /&gt;
&lt;/p&gt;
&lt;p&gt;Three events are simultaneously occurring that could endanger the
validity of scientific conclusions in the future. First, new
technology is providing the average scientist with more data than ever
before. Genomics is the prime example of this: the average biologist
can now sequence multiple samples simultaneously whereas this would be
prohibitively expensive just a few years earlier. As metabolomic and
proteomic data are increasingly incorporated into research alongside
genomic data, the work done by bioinformaticians will increase
significantly. More lines of code to write and more data to process
under deadlines will doubtlessly lead to mistakes.&lt;/p&gt;
&lt;p&gt;The second contributing factor is that more researchers are writing code and
analyzing their own data rather hiring a bioinformatician or statistician. It&amp;rsquo;s
an awesome and commendable occurrence, but sadly academic institutions don&amp;rsquo;t
adequately prepare researchers to code to high standards. Also, in many cases
these researchers learn to program by analyzing their own experimental data,
rather than example or “toy” data. This makes “silent” mistakes (i.e. those
that don’t prompt an error, but lead to incorrect results) impossible to
discover as the actual results are not known.&lt;/p&gt;
&lt;p&gt;The last contributing factor is that there’s not a strong expectation
that coding standards and software engineering practices be upheld in
the sciences. There’s a strong &lt;a href=&#34;http://en.wikipedia.org/wiki/Cowboy_coding#Inexperienced_developers&#34;&gt;cowboy
coding&lt;/a&gt;

culture in scientific programming. In this mindset, the coding is done
when the data is processed, not when the data is processed, the code
documented, the unit tests passed, the code checked into a repository,
etc. The scientific community needs to embrace the idea that proper
data analysis takes time: perhaps as long or longer than gathering
experimental data.&lt;/p&gt;
&lt;p&gt;In future essays I’ll talk more about these issues in depth. This
stuff honestly keeps me (and other people I know) awake at night. I
worry humanity may face a
&lt;a href=&#34;http://en.wikipedia.org/wiki/Thalidomide&#34;&gt;Thalidomide-like&lt;/a&gt;
 event in
the future due to an error in scientific programming.&lt;/p&gt;
&lt;p&gt;However, here I want to commend a project that I feel is underutilized in the
biology and bioinformatics communities: Bioconductor. It’s worthy of praise as
both an example of, and tool to aid in excellency in bioinformatics
programming. I’ll focus primarily on its capacity for handling high throughput
sequencing data (even though it handles data from other assays very well too).&lt;/p&gt;
&lt;h2&gt;Where is Bioinformatics?&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;where-is-bioinformatics&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#where-is-bioinformatics&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Currently, many bioinformatics analyses go something like this:
experimental data is received, and then a bioinformatician downloads
vast amounts of other data relating to the experiment from web
resource such as the UCSC Genome Browser, Ensembl, Phytozome,
etc. Often this includes genome assemblies, transcript sequences, and
annotation data. Then, application code (alignment software,
assemblers, SNP finding tools, etc) are downloaded and compiled. These
tools are used alongside custom code written that combines downloaded
data with the experimental data, and this produces results that are
interpreted. Intermediate results may be fed into other online tools
and databases like &lt;a href=&#34;http://david.abcc.ncifcrf.gov/&#34;&gt;DAVID&lt;/a&gt;
 or
&lt;a href=&#34;http://www.reactome.org/ReactomeGWT/entrypoint.html&#34;&gt;Reactome&lt;/a&gt;
.&lt;/p&gt;
&lt;p&gt;However, this is a bad model if one wants the analysis to be
reproducible. The common weakness is that web resources can be
unstable. It’s then necessary for the researcher to record software
and data versions manually. Even if the researcher dutifully complies,
outside databases and code repositories may disappear and leave the
project unable to be reproduced. Researchers truly invested in
conducting reproducible research then have to store data and software
versions themselves, which given the scale of genomic data is quite a
burden.&lt;/p&gt;
&lt;p&gt;Thus, three things currently perplex reproducible research in
bioinformatics: the scale of both experimental and other required data
prevents easy self-archival, the fast-paced development of
bioinformatics tools could lead to differing results across versions,
and the overwhelming prevalence of web-based data resources and
applications which are not easily reproducible.&lt;/p&gt;
&lt;h2&gt;The Bioconductor Solution&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;the-bioconductor-solution&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#the-bioconductor-solution&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Bioconductor has, in my opinion, the best solution to these
problems. First, Bioconductor stores past versions of its packages
back to their earliest releases. Past experiments can be replicated
using the exact version of software that was used for the actual
analysis.&lt;/p&gt;
&lt;p&gt;Second, Bioconductor stores data as packages. Pre-packaged versioned
data is a cornerstone of reproducible research. For example, suppose I
am working with human RNA-seq data. This requires transcript
annotation data, which could be downloaded from an online resource. To
be reproducible, this requires that:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;The webhost be up indefinitely.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The URL remain stable and point to the exact same resource.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The user provide not only a URL but the version of data/software
downloaded.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;That the external resource provider (i.e. database or application
developer) actually update their versions accordingly.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;For absolute best practice, it’s also necessary to MD5 checksum the
data and record this value to maintain any data gathered from the same
source is the exact same.&lt;/p&gt;
&lt;p&gt;In contrast, consider how I would load human transcript data into R
with Bioconductor:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-R&#34; data-lang=&#34;R&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nf&#34;&gt;library&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;TxDb.Hsapiens.UCSC.hg19.knownGene&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;The versioning here is done explicitly in the package name: hg19. I
could also easily record the state of all my Bioconductor packages and
my session with &lt;code&gt;sessionInfo()&lt;/code&gt;:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;pre&gt;&lt;code&gt;&amp;gt; sessionInfo()
  
  R version 2.14.1 (2011-12-22)
  Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
  
  locale:
  [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
  
  attached base packages:
  [1] stats     graphics  grDevices utils     datasets  methods   base     
  
  loaded via a namespace (and not attached):
   [1] annotate_1.32.2       AnnotationDbi_1.17.23 Biobase_2.14.0       
   [4] BiocGenerics_0.1.12   DBI_0.2-5             DESeq_1.6.1          
   [7] genefilter_1.36.0     geneplotter_1.32.1    grid_2.14.1          
   [10] IRanges_1.12.6        RColorBrewer_1.0-5    RSQLite_0.11.1       
   [13] splines_2.14.1        survival_2.36-12      xtable_1.7-0  &lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Entire genomes are also packaged via the &lt;a href=&#34;http://www.bioconductor.org/packages/release/bioc/html/BSgenome.html&#34;&gt;BSgenome
package&lt;/a&gt;

(BS refers here to
&lt;a href=&#34;http://www.bioconductor.org/packages/2.9/bioc/html/Biostrings.html&#34;&gt;Biostrings&lt;/a&gt;
). If
the data in packages is not sufficiently recent, the
&lt;a href=&#34;http://www.bioconductor.org/packages/release/bioc/html/GenomicFeatures.html&#34;&gt;GenomicFeatures&lt;/a&gt;

package provides a programmatic way of downloading, packaging, and
using data from BioMart and UCSC Genome Browser tracks, and provides
functions for saving and loading &lt;code&gt;transcriptDb&lt;/code&gt; objects from such
resources. Recently Duncan Temple Lang and I were speaking about
reproducible research, and he said “people adopt best practices when
they’re right in front of their face”. Bioconductor’s tools do just
that. Furthermore, Bioconductor has strict coding and documentation
standards (much stricter than CRAN actually), which ensures
user-contributed packages are high quality.&lt;/p&gt;
&lt;h2&gt;Information leakage and statistics at every level&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;information-leakage-and-statistics-at-every-level&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#information-leakage-and-statistics-at-every-level&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;When discussing R and Bioconductor with other researchers, it’s easy
to convince them to adopt both for analyzing statistical data - the
data that comes in the very final stages of a bioinformatics
analysis. It’s usually much more difficult to convince them to
consider working with high throughput sequencing data in
Bioconductor. Folks complain that it’s (1) not worth it to process
sequencing data with Bioconductor tools or (2) it’s not fast
enough. I’ll address the second point in a bit; more importantly I
want to emphasize that it’s &lt;strong&gt;absolutely&lt;/strong&gt; worth it to process sequencing
data in Bioconductor.&lt;/p&gt;
&lt;p&gt;In analyzing genomic data, we take very, very, very high dimension
data and try to condense it into biologically meaningful conclusions
without being misleading or getting something wrong. Every step is
about taking dense data and making it understandable: we take sequence
reads and try to assemble them into larger contigs and scaffolds, we
take cDNA reads and try to map them back to genomes to understand
expression, etc. At each step, our tools make heuristic or statistical
choices for us. Pipelines woefully ignore these choices because in
most cases, after a step is completed, a script jumps to the next
step.&lt;/p&gt;
&lt;p&gt;When I think about these steps, I try to assess what I think of as
“informational leakage” in bioinformatics processing. Each step
summarizes something, hopefully in a way without bias or too much
noise. Informational leakage is the information that’s lost between
steps. Catastrophic information leakage occurs when we lose
information that could have indicated whether the data is biased or
incorrect. We can hedge the risk of information leakage when we use
summary statistics between steps that try to capture this leaked
information.&lt;/p&gt;
&lt;p&gt;Consider processing RNA-seq reads. The first step is usually quality
control, i.e. removing adapter sequences and trimming off poor quality
bases. Failing to gather summary statistics before and after each of
these steps leads to potentially catastrophic information
leakage. Suppose that an experiment with control and treatment groups,
sequenced on two different lanes (bad experiment design!). If one lane
has systematically lower 3’-end quality than the other, quality
trimming software will trim these bases off and lead one experimental
group to have much shorter sequences than the other. The mapping rates
will differ significantly, as shorter reads may map less uniquely. In
the end, our data is completely confounded not only by the lane (and
bad experimental design), but by our own tools! If these tools are
being run in a pipeline without intermediate summary statistics being
gathered, this catastrophic information leakage will go unnoticed.&lt;/p&gt;
&lt;p&gt;Loading sequencing data into R and using Bioconductor’s tools earlier
allows summary statistics to be gathered earlier and more easily (R
is, after all, great for statistics and visualization), which I
strongly believe will decrease the risk of catastrophic information
leakage in genomics data analysis. This is why I wrote
&lt;a href=&#34;http://bioconductor.org/packages/release/bioc/html/qrqc.html&#34;&gt;qrqc&lt;/a&gt;
,
which can provide quick summary statistics on sequencing read
quality. Used before and after the application of quality tools,
&lt;code&gt;qrqc&lt;/code&gt; can provide information not only on the state of the data, but
also the effect of the tools.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://vincebuffalo.com/images/qrqc.png&#34; alt=&#34;sequencing data quality analysis with qrqc&#34; data-custom-hook=&#34;true&#34; /&gt;
&lt;/p&gt;
&lt;p&gt;With existing Bioconductor packages, many useful statistics can be
gathered on whole reads, BAM mapping results, VCF files, etc.&lt;/p&gt;
&lt;h2&gt;Massive Power&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;massive-power&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#massive-power&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;The complaint that R is slow, and couldn’t possibly be used with
sequencing/mapping-level data is unwarranted. In reality, some of
Bioconductor’s core packages for working with high throughput
sequencing data such as &lt;code&gt;Biostrings&lt;/code&gt; and &lt;code&gt;IRanges&lt;/code&gt; (the foundation of
GenomicRanges) are astoundingly fast because most of their backends
are written in C. Biostring actually uses external pointers to C
structures and bit patterns to encode biological string data
efficiently.&lt;/p&gt;
&lt;p&gt;In addition to being fast, they’re also clever. &lt;code&gt;Biostrings&lt;/code&gt;
implements an abstraction called a &amp;ldquo;view&amp;rdquo; on an &lt;code&gt;XString&lt;/code&gt; object,
which efficiently represents multiple sections of the same string
object (such as subsequences of interest). While I wouldn’t write a
short read aligner or assembler entirely in R, many bioinformatic
tasks are more than sufficiently fast with Bioconductor tools.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;conclusion&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#conclusion&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;In evangelizing Bioconductor, I have two goals. First, I want to
spread awareness that it&amp;rsquo;s the best way to do reproducible
bioinformatics that I think exists today. I want more people to use it
just because I deeply care about the state of science and
reproducibility. Second, I want to build excitement about this project
so that more people will contribute. I believe that far too many high
quality bioinformatics tools are written outside of
Bioconductor. Packaging bioinformatics tools in Bioconductor forces
the developer to adopt strict standards, write clear documentation,
and open up a program to a large, active user base. Any results from a
package&amp;rsquo;s methods can then easily be evaluated using R, CRAN, or
Bioconductor’s existing tools.&lt;/p&gt;
&lt;p&gt;I also believe that large programs (like BLAST, and maybe assemblers)
should provide better interfaces to R, to prevent information leakage
in analysis. I&amp;rsquo;m willing to bet that a large majority of
bioinformatics tools could be outputting more statistics than they
currently do that could be valuable in assessing their
functionality. R interfaces to these bioinformatics tools will
drastically make it easier for biologists and bioinformaticians to
prevent information leakage.&lt;/p&gt;

      </description>
    </item>
    
    <item>
      <title>Thoughts on Julia and R</title>
      <link>https://vincebuffalo.com/blog/thoughts-on-julia-and-r/</link>
      <pubDate>Wed, 07 Mar 2012 00:00:00 +0000</pubDate>
      
      <guid>https://vincebuffalo.com/blog/thoughts-on-julia-and-r/</guid>
      <description>
        
        
        &lt;h2&gt;Hello, Julia&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;hello-julia&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#hello-julia&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;&lt;a href=&#34;http://julialang.org&#34;&gt;Julia&lt;/a&gt;
 is an exciting new technical computing
language. It&amp;rsquo;s still in its infancy, but it&amp;rsquo;s fast (see below), and
already does a lot.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://vincebuffalo.com/images/julia_speed.png&#34; alt=&#34;Comparison of Julia to other languages&#34; data-custom-hook=&#34;true&#34; /&gt;
&lt;/p&gt;
&lt;p&gt;There&amp;rsquo;s been some excitement on Twitter about Julia. Excitement
combined with open source often yields development, which then leads
to further excitement, until a mature open source project arises. One
of Julia&amp;rsquo;s explicit goals is to challenge other statistical computing
environments, including R.&lt;/p&gt;
&lt;h2&gt;What&amp;rsquo;s wrong with R?&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;whats-wrong-with-r&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#whats-wrong-with-r&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;R is, without a doubt, changing the world. It&amp;rsquo;s being used by industry
giants like Facebook and Google, while also providing academic
researchers in statistics, biology, psychology, and countless other
fields with not only a free and open source statistical environment,
but a huge number of user-contributed package through CRAN. Now
methods papers in many fields are often accompanied by CRAN or
Bioconductor packages. It&amp;rsquo;s also a brilliant platform for reproducible,
open research, as Bioconductor beautifully illustrates with packaged
and version-controlled genomes, microarray probesets, etc.&lt;/p&gt;
&lt;p&gt;However, R is suffering from growing pains. For example, there are now
64-bit versions of R, however, vector indexing is still limited by
&lt;code&gt;R_len_t&lt;/code&gt; (see definition in &lt;code&gt;src/include/Rinternals.h&lt;/code&gt;):&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-R&#34; data-lang=&#34;R&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;o&#34;&gt;/*&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;type&lt;/span&gt; &lt;span class=&#34;kr&#34;&gt;for&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;length&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;of&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;vectors&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;etc&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;*/&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;typedef&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;int&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;R_len_t&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;/*&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;will&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;be&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;long&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;later&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;LONG64&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;or&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;ssize_t&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;on&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;Win64&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;*/&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;c1&#34;&gt;#define R_LEN_T_MAX INT_MAX&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;It appears that one can simply change this to a long and recompile to
increase the longest possible addressable vector, but no. Take a look
at &lt;code&gt;R_euclidean&lt;/code&gt; in &lt;code&gt;library/stats/src/distance.c&lt;/code&gt; for an example why:
almost all variables for iterating over elements in vectors are
defined as integers and don&amp;rsquo;t use this type. One would have to read
through every function, and every line of code to fix this.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;R_len_t&lt;/code&gt; is just one example. Another issue is that R has been slow
to adopt new compiler technologies (i.e. JIT, optional type
indications, etc). R almost always gains speed from pushing stuff to C
(the recent bytecode compiler is an exception). This isn&amp;rsquo;t a problem,
but it&amp;rsquo;s a huge limitation to require developers to not only know R,
but also C, and also how to interface the two. More modern languages
(Java, as well as Python and Julia come to mind) spend more time
tracking compiler technology developments and implementing them than R
core does (again, Luke Tierney and the bytecode compiler are
exceptions). It&amp;rsquo;s still sometimes efficient to use C with these
languages (consider &lt;a href=&#34;http://cython.org/&#34;&gt;Cython&lt;/a&gt;
), but developers in
these language aren&amp;rsquo;t cracking open Kernighan and Ritchie everytime
they need to have a &lt;code&gt;for&lt;/code&gt; loop do something quickly.&lt;/p&gt;
&lt;p&gt;Another gripe I have is that R language development is somewhat
closed. Despite a quickly expanding user base, the number of R core
contributors is not increasing. I find it hard to believe this is due
to lack of interest. It seems much more likely this is due to
institutional reasons that need to be changed. The nice thing about
language development that it&amp;rsquo;s really hard, so opening up R to more
contributors won&amp;rsquo;t likely flood the existing core with bad ideas and
patches. Personally I would dedicate much more time profiling, reading
the source, and working on the R language if it were more open.&lt;/p&gt;
&lt;p&gt;The last gripe I have is that R is fragmented. Consider Python:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;re&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;re&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;search&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;sa&#34;&gt;r&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;R-([\d]+).([\d]+)&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;R-2.15&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;groups&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Now, consider R:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-R&#34; data-lang=&#34;R&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nf&#34;&gt;gsub&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;&amp;#34;R-([\\d]+)\\.([\\d]+)&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#34;\\1&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#34;R-2.15&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# or&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nf&#34;&gt;library&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;stringr&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nf&#34;&gt;str_match&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;&amp;#34;R-2.15&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;s&#34;&gt;&amp;#34;R-([0-9]+)\\.([0-9]+)&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Now, Python also has PyPI&amp;rsquo;s &lt;a href=&#34;http://pypi.python.org/pypi/re2/&#34;&gt;&lt;code&gt;re2&lt;/code&gt;&lt;/a&gt;
,
but most developers are using &lt;code&gt;re&lt;/code&gt;. The motivation behind &lt;code&gt;stringr&lt;/code&gt; is
that R&amp;rsquo;s currently family of string processing functions are horribly
inconsistent:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-R&#34; data-lang=&#34;R&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# (my ... to avoid writing all parameters)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nf&#34;&gt;grep&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;pattern&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;x&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;kc&#34;&gt;...&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nf&#34;&gt;regexpr&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;pattern&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;text&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;kc&#34;&gt;...&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nf&#34;&gt;gsub&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;pattern&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;replacement&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;x&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;kc&#34;&gt;...&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nf&#34;&gt;strsplit&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;x&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;split&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;kc&#34;&gt;...&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
  &gt;
    &lt;div class=&#34;copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
    &lt;div class=&#34;success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;But rather than deprecate these and move forward, we now have &lt;em&gt;two&lt;/em&gt;
sets of string processing
functions. &lt;a href=&#34;http://github.com/search?langOverride=&amp;amp;language=R&amp;amp;q=str_extract&amp;amp;repo=&amp;amp;start_value=1&amp;amp;type=Code&#34;&gt;Both are being used&lt;/a&gt;
. I&amp;rsquo;m
not saying Hadley Wickham is to blame here; quite the contrary, he&amp;rsquo;s
trying to fix a very annoying problem in the language and should be
commended. I think the community needs to be more open; for example,
before writing a package that processes strings, let&amp;rsquo;s discuss an
implementation plan, deprecating old functions, etc. If not, in the
future R will be highly fragmented, and end up with five different
object orientation systems&amp;hellip; oh, wait.&lt;/p&gt;
&lt;h2&gt;What would it take to &amp;ldquo;challenge&amp;rdquo; R?&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;what-would-it-take-to-challenge-r&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#what-would-it-take-to-challenge-r&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Contributors to Julia are optimistic they can challenge R based on a
solid foundation of JIT compiling, parallelism, and nice language
semantics. I salute this optimism, but I think we need to
realistically consider what it would take to &amp;ldquo;challenge&amp;rdquo; R.&lt;/p&gt;
&lt;p&gt;First, we would need to build an equal statistical computing
environment. Consider moving all of &lt;code&gt;stats&lt;/code&gt;, &lt;code&gt;MASS&lt;/code&gt;, &lt;code&gt;graphics&lt;/code&gt;,
&lt;code&gt;grid&lt;/code&gt;, etc to Julia. Is Julia sufficiently faster than R &lt;em&gt;will be&lt;/em&gt; in
the time it takes to port these base packages? Remember, R is a moving
target; despite my few earlier gripes, R will evolve and get
faster. Now, consider adding the extremely popular CRAN packages like
&lt;code&gt;ggplot2&lt;/code&gt; and &lt;code&gt;lattice&lt;/code&gt; to Julia. In the time it takes to port these,
is Julia still sufficiently faster than R will be?&lt;/p&gt;
&lt;p&gt;Suppose it is still faster than R. What about after we port the rest
of CRAN, and all of Bioconductor to Julia? My point isn&amp;rsquo;t say that
it&amp;rsquo;s unimaginable that Julia will surpass R. It&amp;rsquo;s that developers
should really dissect what makes a successful language successful
before they try to challenge it. I don&amp;rsquo;t have a horse in this race; I
would love to see Julia surpass R. But if all developers want is a
fast environment to analyze large data sets using a wealth of methods
and libraries, it may be a lot easier to make R faster than to develop
a new fast language and hope/wait/beg the community to move over.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update:&lt;/strong&gt; Julia core developer Jeff Bezanson sent me a very kind
email on March 9th, 20012 about this post. In it, he said the
&amp;ldquo;challenge R&amp;rdquo; statement was made by a community member and is in no
way the mission of the Julia language. He had many kind words to say
about the R langauge and its statistical functionality.&lt;/p&gt;

      </description>
    </item>
    
    <item>
      <title>The Unbelievable Debate: Some Ramblings on Machine Learning in Science</title>
      <link>https://vincebuffalo.com/blog/the-unbelievable-debate-some-ramblings-on-machine-learning-in-science/</link>
      <pubDate>Sat, 03 Mar 2012 00:00:00 +0000</pubDate>
      
      <guid>https://vincebuffalo.com/blog/the-unbelievable-debate-some-ramblings-on-machine-learning-in-science/</guid>
      <description>
        
        
        &lt;p&gt;In between refactoring some &lt;code&gt;qrqc&lt;/code&gt; code this morning and looking at
RNA-seq data, I grabbed some cold brew coffee and caught up on some
missed tweets. Admittedly, my brain glosses over most tweets, but
&lt;a href=&#34;https://twitter.com/#!/drewconway/status/176725770885017600&#34;&gt;this tweet&lt;/a&gt;

from Drew Conway had the right mix of keywords to actually make me
click and read the link:&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;The data science debate: domain expertise or machine learning? by
@medriscoll &lt;a href=&#34;http://bit.ly/zr17Z2&#34;&gt;http://bit.ly/zr17Z2&lt;/a&gt;
&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I don&amp;rsquo;t mean for this title to be inflammatory, but I do believe this
debate is a bit unbelievable. Machine learning &lt;em&gt;is&lt;/em&gt; magical; I imagine
that everyone that has studied it goes through a
&lt;a href=&#34;http://en.wikipedia.org/wiki/Hype_cycle&#34;&gt;hype cycle&lt;/a&gt;
-like set of
epiphanies. This is my hype cycle story, and why I believe machine
learners need to calm down, collaborate with domain experts, and
together tackle hard problems.&lt;/p&gt;
&lt;h2&gt;Social Sciences and Machine Learning Caution&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;social-sciences-and-machine-learning-caution&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#social-sciences-and-machine-learning-caution&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Biologists, it&amp;rsquo;s true: I&amp;rsquo;m not one of you. I&amp;rsquo;m a transplant from the
social sciences. Specifically, from political science and economics
(with statistics too), where my interests lie in methodology and
comparative politics.&lt;/p&gt;
&lt;p&gt;In the social sciences, the dimensionality of most problems is small
enough that data mining is (at least in my experience) frowned upon. A
lot of political data is collected by hand, often by undergraduates
toiling away for meager pay as they try to understand some cryptic
event coding protocol. There are some very large &lt;em&gt;p&lt;/em&gt; data sets: The
&lt;a href=&#34;http://globalpolicy.gmu.edu/pitf/&#34;&gt;Political Instability Task Force&amp;rsquo;s&lt;/a&gt;

data set is something I&amp;rsquo;ll keep mentioning. Mining this data with
algorithms &lt;em&gt;looking&lt;/em&gt; for interesting relationships was exactly how I
was taught &lt;em&gt;not&lt;/em&gt; to do political science.&lt;/p&gt;
&lt;p&gt;I recall one story of a candidate giving a job talk mentioning he used
forward step-wise regression to find interesting variables (in a
presumably small &lt;em&gt;p&lt;/em&gt; data set) and three people immediately stood up
and left. I was proud to be knowledgeable of, but avoid statistical
learning techniques. Political science had flirted with neural
networks to understand massive state failure data sets, but my endless
gripe was there these were &lt;em&gt;predictive&lt;/em&gt;, not &lt;em&gt;causal&lt;/em&gt; models. The
latter required some &lt;em&gt;a priori&lt;/em&gt; testable theory, often derived from an
intimate knowledge of political crisis in a variety of countries. Just
as I thought biologists knew &lt;em&gt;c. elegans&lt;/em&gt; or &lt;em&gt;s. cerevisiae&lt;/em&gt; well
enough to form interesting experiment ideas, political scientists knew
many political crises well enough to form theories and test them on a
larger set of data in a quantitatively rigorous fashion. I also
believed that predictive models of state failure may predict recorded
(even when out-sample!) state failures well, but a model backed in a
good theory that fits existing data slightly less well could predict
unseen cases even better
(&lt;a href=&#34;http://www.amazon.com/Predictioneers-Game-Brazen-Self-Interest-Future/dp/1400067871&#34;&gt;Bruce Bueno de Mesquita has an entire wonderful book about game theory being such a model&lt;/a&gt;
).&lt;/p&gt;
&lt;h2&gt;The Machine Learning Awakening&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;the-machine-learning-awakening&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#the-machine-learning-awakening&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;When I made the jump to analyzing gene expression data, I was
initially astounded at how many algorithms were thrown at it. I had
this vision of the hard sciences having randomization and
experimentation at their disposal to lead to the purest causal
findings. Looking for any differences in 30,000 genes&amp;rsquo; expression
values and then forming hypotheses after seemed backwards. Microarrays
shocked biology with what they revealed about cancer and the cell, but
they probably shocked the methods of experimental biology more. If
your average biologist had a tenuous knowledge of p-values to begin
with, now microarrays analysts were throwing around false discovery rates,
empirical Bayesian techniques, Storey&amp;rsquo;s q-value, etc.&lt;/p&gt;
&lt;p&gt;However, as I analyzed more and more sets of data, the initial
reluctance I had about employing machine learning algorithms
disappeared. In hype cycle terms, I was climbing the peak of inflated
expectations. A quote from Michael E. Driscoll&amp;rsquo;s article captures
this excitement:&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;Claudia Perlich, a three-time winner of the KDD Nuggets competition,
stood up and shared how she had won contests in domains as varied as
&amp;ldquo;breast cancer, movie prediction, and sales performance - and I can
tell you I knew next to nothing about those things when I started.&amp;rdquo;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This optimism is abundant, and not entirely without
justification. Coming from a non-biological background yet thoroughly
understanding machine learning provides an immensely satisfying
feeling of understanding of the cell. Employing all sorts of machine
learning techniques, I could find &amp;ldquo;biologically interesting&amp;rdquo; genes in
data sets and help biologists understand the cell.&lt;/p&gt;
&lt;h2&gt;A Few Epiphanies and a Dip of Disillusionment&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;a-few-epiphanies-and-a-dip-of-disillusionment&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#a-few-epiphanies-and-a-dip-of-disillusionment&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;The Hype Cycle&amp;rsquo;s lowest stage is the &amp;ldquo;trough of
disillusionment&amp;rdquo;. Machine learning in biology certainly hasn&amp;rsquo;t had its
trough (and I don&amp;rsquo;t think it will), but it is priming up to have its
&amp;ldquo;slope of enlightenment&amp;rdquo; and &amp;ldquo;plateau of productivity&amp;rdquo;. There will be
future machine learning hype cycles in biology, especially as multiple
heterogeneous data sets need to be simultaneously mined to understand the
cell with the systems approach.&lt;/p&gt;
&lt;p&gt;My personal dip didn&amp;rsquo;t happen because machine learning left me with a
particularly terrible result - it occurred because (1) because of an
interaction I had with an experimental biologist and (2) I realized
how wonderfully complex the cell is.&lt;/p&gt;
&lt;h3&gt;Let&amp;rsquo;s Put That in This&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;lets-put-that-in-this&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#lets-put-that-in-this&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;The first interaction I had was with a graduate student friend of
mine. We were discussing an interesting finding the Korf Lab made:
that some introns lead to increased expression
(&lt;a href=&#34;http://www.frontiersin.org/plant_genetics_and_genomics/10.3389/fpls.2011.00098/full&#34;&gt;paper here&lt;/a&gt;
). Introns
traditionally haven&amp;rsquo;t had the same attention as promoters of enhancers
in regulating gene expression. A member of the Korf lab had previously
mentioned intron-mediated expression in passing to me, and I
immediately started imagining what ways I could look for such an
effect &lt;em&gt;in silico&lt;/em&gt;. As I understood it, &lt;em&gt;in silico&lt;/em&gt; was how it was
first discovered, further increasing my admiration of algorithms
applied to biology. When my friend mentioned it again, the first thing
she said was, &amp;ldquo;well, we just need to take that intron and put it in
something&amp;rdquo;.&lt;/p&gt;
&lt;p&gt;I immediately agreed, but I realized something: I hadn&amp;rsquo;t thought of
that simple step the first time I thought about intron-mediated
expression. Machine learning can bring so much wealth in finding
interesting relationships that my mind had glossed over the most
important question in science: whether these relationships were
spurious or causal. This is why my training in the social sciences was
rigidly anti-machine learning: it&amp;rsquo;s far too easy to let our thought
processes about &lt;em&gt;understanding&lt;/em&gt; a complex system be biased by some
spurious relationships machine learning and predictive models can
quickly give us.&lt;/p&gt;
&lt;h3&gt;The Complexity of the Cell&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;the-complexity-of-the-cell&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#the-complexity-of-the-cell&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;The epiphany was gradual (and still occurring): the cell is
wonderfully complex, or as my mind puts it &amp;ldquo;fucking awesomely
complex&amp;rdquo;. Machine learning applied to gene expression data gives
valuable insights into a complex system, but it&amp;rsquo;s really a messy
snapshot. I think we&amp;rsquo;ll look at current pristine RNA-seq experiments
in twenty years and we&amp;rsquo;ll realize they&amp;rsquo;re giving us an image into
cellular activity that is akin to a photograph from a cheap Soviet-era
camera.&lt;/p&gt;
&lt;p&gt;Measuring gene expression from many cells glosses over interesting
variation in each cell; this is certainly not a new
complaint. However, even a &lt;em&gt;single&lt;/em&gt; cell image is messy: mRNAs that
make their way into gene expression values may have never been
exported from the nucleus, they could have been degraded by the cell,
silenced, undergone post translational modification, etc. What&amp;rsquo;s
astounding is that these systems are not just complex, but are
amazingly accurate. Cellular data is messy, but the cell certainly
isn&amp;rsquo;t. Development is a prime example of how tightly regulated these
processes are. It&amp;rsquo;s up to us to understand these tightly regulated
systems with the messy images scientific data gives us. Machine
learning is a necessary, but not sufficient tool to help us understand
the cell.&lt;/p&gt;
&lt;p&gt;As an example, genes interact in groups, and many algorithms can gloss
over this detail. If an algorithm tries to find a sparse set of genes
that are biologically interesting to the problem at hand, it may be
indifferent to which it includes from a set of co-expressed genes
(consider the lasso against the elastic net here). If a biologist
reviews these findings, they could easily miss something vastly
important based on machine learning&amp;rsquo;s indifference.&lt;/p&gt;
&lt;h2&gt;Let&amp;rsquo;s Use Both.&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;lets-use-both&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#lets-use-both&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;These epiphanies are now what guides my path through biology and
machine learning. I still love and am infatuated with machine learning
(although, I much prefer the phrase statistical learning). However, if
we wish to understand a complex system, we need to take the approach
that modern biology does: leverage machine learning with &lt;em&gt;a priori&lt;/em&gt;
biological expert knowledge to bootstrap findings. We need to design
experiments that also harness the power of machine learning to help us
&lt;em&gt;understand&lt;/em&gt;, and not just &lt;em&gt;predict&lt;/em&gt; the behavior of complex
systems. Applied machine learners need to realize the power of
experimental data. Chances are if you&amp;rsquo;re finding everything you think
is out there with just machine learning, you&amp;rsquo;re making a mistake or
your problem is too simple.&lt;/p&gt;

      </description>
    </item>
    
    <item>
      <title>Elucidating k-mer Contamination with Kullback-Leibler Divergence</title>
      <link>https://vincebuffalo.com/blog/elucidating-kmer-contamination-with-kullback-leibler-divergence/</link>
      <pubDate>Thu, 01 Mar 2012 00:00:00 +0000</pubDate>
      
      <guid>https://vincebuffalo.com/blog/elucidating-kmer-contamination-with-kullback-leibler-divergence/</guid>
      <description>
        
        
        &lt;p&gt;Recently a coworker showed me a FASTQ file from an Illumina HiSeq run
(which will be packaged in the new release of my Bioconductor package
&lt;a href=&#34;http://www.bioconductor.org/packages/release/bioc/html/qrqc.html&#34;&gt;qrqc&lt;/a&gt;
)
that was severely contaminated. Below is the file in &lt;code&gt;less&lt;/code&gt; with a
string highlighted:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://vincebuffalo.com/images/less_contams.png&#34; alt=&#34;A severely contaminated file in less, with many contaminants highlighted&#34; data-custom-hook=&#34;true&#34; /&gt;
&lt;/p&gt;
&lt;p&gt;Holy contamination, Batman! There are a few approaches to handling
this level of contamination. The program
&lt;a href=&#34;http://www.ncbi.nlm.nih.gov/pubmed/19737799&#34;&gt;tagdust&lt;/a&gt;
 will match
contaminated reads and remove them. My program
&lt;a href=&#34;github.com/vsbuffalo/scythe&#34;&gt;Scythe&lt;/a&gt;
 is being changed so that it can
match adapter contaminants further in the read using its Bayesian
model. Both programs require &lt;em&gt;a priori&lt;/em&gt; knowledge of possible
contamiant sequences - what if this is a novel sequence contaminant?
In this case, &lt;code&gt;AAGCAGTGGTATCAACGCAGAGT&lt;/code&gt; appears to be a PCR primer
related to DSN normalization that may not have made it into our
adapter files.&lt;/p&gt;
&lt;h2&gt;k-mer Entropy Approaches&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;k-mer-entropy-approaches&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#k-mer-entropy-approaches&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;k-mer approaches are a nice way of searching for such contaminants. I
am currently adding this feature to &lt;code&gt;qrqc&lt;/code&gt;; you can follow development
on the &lt;code&gt;kmer&lt;/code&gt; branch on &lt;a href=&#34;http://github.com/vsbuffalo/qrqc&#34;&gt;Github&lt;/a&gt;
 but
this branch may merge into master and disappear.&lt;/p&gt;
&lt;p&gt;The C functions I&amp;rsquo;ve written use Heng Li&amp;rsquo;s
&lt;a href=&#34;http://attractivechaos.awardspace.com/khash.h.html&#34;&gt;khash.h&lt;/a&gt;
 library
to quickly hash k-mer sequences with their positions in the read. The
end result is a data frame in R of k-mer sequence, position, and
counts in that position.&lt;/p&gt;
&lt;p&gt;Looking at raw k-mer counts is somewhat useful, but I&amp;rsquo;ve been
exploring some information theoretical approaches to analyzing this
data. One useful graphic is entropy of k-mers by position:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://vincebuffalo.com/images/kmer_entropy.png&#34; alt=&#34;k-mer entropy increasing by position in read&#34; data-custom-hook=&#34;true&#34; /&gt;
&lt;/p&gt;
&lt;p&gt;These are 6-mers, so there are 4,096 possible k-mers (excluding N). If
the k-mer distribution were uniform, 12 bits would be needed to encode
each k-mer. This graph illustrates that even at the most random 3&amp;rsquo;-end
of the read, only about 6.5 bits are needed. In the first 20 bases,
the distribution of k-mers is so skewed that less the Shannon entropy
is less than four bits.&lt;/p&gt;
&lt;h2&gt;Kullback-Leibler Divergence Approach&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;kullback-leibler-divergence-approach&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#kullback-leibler-divergence-approach&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;It makes sense biologically that the k-mers don&amp;rsquo;t have uniform
frequency. In the case of read contaminants, the enrichment by
position against an empirical k-mer distribution may be as interesting
as total k-mer enrichment against a random distribution model.&lt;/p&gt;
&lt;p&gt;To assess this, some beta &lt;code&gt;qrqc&lt;/code&gt; code pools k-mer counts across
position to find an empirical k-mer distribution. Then, the k-mer
distribution per position is compared to the pooled distribution using
the
&lt;a href=&#34;http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence&#34;&gt;Kullback-Leibler divergence&lt;/a&gt;
. K-L
divergence is only defined when both distributions sum to 1, the
sample spaces are the same, and if $q(i) &amp;gt; 0$ for any $i$ such that $p(i)&amp;gt;
0$.&lt;/p&gt;
&lt;p&gt;In essence, the K-L divergence is measuring the average number of bits
needed to encode data from &lt;em&gt;P&lt;/em&gt; with a code based on the distribution
of &lt;em&gt;Q&lt;/em&gt;. In the k-mer case, &lt;em&gt;Q&lt;/em&gt; is the empirical distribution of k-mers
irrespective of k-mer position and &lt;em&gt;P&lt;/em&gt; is the position-specific
distribution of k-mers. Thus, an enrichment of k-mers at a particular
position would lead to more divergence.&lt;/p&gt;
&lt;p&gt;A nice feature of &lt;code&gt;ggplot2&lt;/code&gt; is the stacking of the &amp;ldquo;bar&amp;rdquo; geom. Since
K-L divergences are sums, stacking and setting fill color by k-mer
(the terms of the sum) gives us a sense of the total divergence and
each k-mer&amp;rsquo;s effect on the total. There are too many k-mers to plot,
so I have some procedures that find a nice subset. Because this is a
subset, the K-L total (indicated by bar height) &lt;strong&gt;is wrong&lt;/strong&gt;, but the
graphical interpretation is easier. Now, the enrichment by position is
clear:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://vincebuffalo.com/images/kl_kmer.png&#34; alt=&#34;Kullback-Leibler divergence of k-mers&#34; data-custom-hook=&#34;true&#34; /&gt;
&lt;/p&gt;
&lt;p&gt;This messy dataset has repeat primer contamination. Note that because
we&amp;rsquo;re plotting a &lt;em&gt;subset&lt;/em&gt; of k-mers, there is negative total K-L (not
mathematically possible) because we&amp;rsquo;re leaving out terms in the sum,
but the meaning still comes through. Also note that there is k-mer
nesting: The first wide peak begins with k-mer &lt;code&gt;TATCAA&lt;/code&gt;, then
&lt;code&gt;ATCAAC&lt;/code&gt;, then &lt;code&gt;TCAACG&lt;/code&gt;, etc. This indicates that we could adjust k
and find the entire repeated k-mer.&lt;/p&gt;
&lt;h2&gt;Update&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;update&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#update&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;I&amp;rsquo;ve added faceting of multiple &lt;code&gt;SequenceSummary&lt;/code&gt; objects&amp;rsquo; KL/k-mer
diagnostics. Combined with a random data file, this really illustrated
contamination:&lt;/p&gt;
&lt;p&gt;&lt;a href=&#34;https://vincebuffalo.com/images/large_facet_kl.png&#34;&gt;&lt;img src=&#34;https://vincebuffalo.com/images/tiny_facet_kl.png&#34;/&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;This is still in development; follow the code on
&lt;a href=&#34;http://github.com/vsbuffalo/qrqc&#34;&gt;Github&lt;/a&gt;
 and feel free to contact me
and make suggestions.&lt;/p&gt;

      </description>
    </item>
    
    <item>
      <title>Please developers, don&#39;t be dicks.</title>
      <link>https://vincebuffalo.com/blog/please-developers-dont-be-dicks/</link>
      <pubDate>Tue, 21 Feb 2012 00:00:00 +0000</pubDate>
      
      <guid>https://vincebuffalo.com/blog/please-developers-dont-be-dicks/</guid>
      <description>
        
        
        &lt;p&gt;&lt;a href=&#34;http://www.flickr.com/photos/mclapics/6121420214/&#34;
title=&#34;Science is great, open it (open science) by mclapics, on
Flickr&#34;&gt;&lt;img
src=&#34;http://farm7.staticflickr.com/6188/6121420214_7f4fe7200a.jpg&#34;
width=&#34;500&#34; height=&#34;333&#34; alt=&#34;Science is great, open it (open
science)&#34;&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h1&gt;Please developers, don&amp;rsquo;t be dicks.&lt;/h1&gt;&lt;p&gt;As the author of a few open source tools, I&amp;rsquo;ve had my fair share of
users seeking help. Emails range from the very useful (bug reports,
patches, etc) to the annoying (&amp;ldquo;can you help guide me through this
process&amp;rdquo;). But never once (that I can remember) have I been a dick
(and yes, I&amp;rsquo;ve wanted to be). It will be tricky to write this without
sounding self-righteous, but I hope to make the case that open source
developers shouldn&amp;rsquo;t be dicks in all cases.&lt;/p&gt;
&lt;h2&gt;We&amp;rsquo;ve All Been There (WABT)&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;weve-all-been-there-wabt&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#weve-all-been-there-wabt&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;The first reason to never be a dick is that We&amp;rsquo;ve All Been There (I&amp;rsquo;m
going to give this the acronym WABT). Even the most voracious and
diligent manual readers can suffer from the &lt;a href=&#34;http://www.perlmonks.org/?node_id=542341&#34;&gt;XY
problem&lt;/a&gt;
. A user comes
asking how to do Y, which they think is the solution to X. However,
it&amp;rsquo;s a bad solution to X and they don&amp;rsquo;t know this. These situations
will always lead to frustration: users waste time explaining Y and
helpers waste time explaining how to do Y to realize the user wanted
X. But this is not the user&amp;rsquo;s fault; it just takes programming
practice to realize Y is not the correct way to do X.&lt;/p&gt;
&lt;p&gt;We&amp;rsquo;ve all had these problems in our early stages as developers. Being
a dick in these cases will not help the user grok anything. They&amp;rsquo;re
already frustrated - that&amp;rsquo;s why they&amp;rsquo;re asking for your help. Being a
dick will cause them to get more frustrated and &lt;em&gt;really&lt;/em&gt; not grasp
anything. They&amp;rsquo;re not going to have an &amp;ldquo;ah ha!&amp;rdquo; moment when they&amp;rsquo;re
too busy trying to come up with a witty response to your burn on IRC.&lt;/p&gt;
&lt;h2&gt;PCTM has the same number of letters as RTFM&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;pctm-has-the-same-number-of-letters-as-rtfm&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#pctm-has-the-same-number-of-letters-as-rtfm&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Please Check The Manual (PCTM) has the same number of letters as Read
The Fucking Manual (RTFM). I strongly believe it takes more energy for
a developer to be a dick than to be nice. We&amp;rsquo;ve all had dumb questions
that disrupt our workflow, make us angry, etc. But being a dick back
does not discourage this behavior. Write some boilerplate text for
responding to users&amp;rsquo; questions. Make this a FAQ. Then respond, PCTM
(Please Check the Manual) and send them the link. If they get needy,
tell them open source software doesn&amp;rsquo;t come with a warranty.&lt;/p&gt;
&lt;h2&gt;People remember dicks&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;people-remember-dicks&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#people-remember-dicks&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Someone was once quite rude to me via email (I&amp;rsquo;ll name him Tom). I had
voiced some frustrations with software Tom wrote and he attacked me
for these public comments. Now, as an aside, there&amp;rsquo;s a lot of shitty
software out there, and signals about software quality (even noisy
signals) are &lt;em&gt;very valuable&lt;/em&gt;. Tom on one hand attacked me for saying
something negative about his software, and on the other hand asked me
to help fix it, emphasizing it was open source software. I agree with
this sentiment 100%, however the email was clearly very angry.&lt;/p&gt;
&lt;p&gt;I told another developer who I&amp;rsquo;ll call Jerry about the encounter, and
he laughed. Apparently, Tom nagged Jerry about portability issues of
Jerry&amp;rsquo;s software years ago. This is evidence of my first point,
WABT. It also shows that developers remember interactions with other
developers &lt;em&gt;really&lt;/em&gt; well. Since then, I&amp;rsquo;ve also heard other
programmers complaining about interactions with Tom. This is all too
bad, as Tom is probably very nice in person and certainly a good
programmer.&lt;/p&gt;
&lt;h2&gt;If you&amp;rsquo;re a dick, you&amp;rsquo;re hurting OSS&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;if-youre-a-dick-youre-hurting-oss&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#if-youre-a-dick-youre-hurting-oss&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;OSS has seen an explosion in recent years. Biologsts, ecologists, and
social scientists that never thought they&amp;rsquo;d write code are using R to
analyze data. Folks frustrated by Windows are installing Ubuntu and
asking for help. In the early days of the OSS, usenet, and IRC, it was
an acceptable norm to be a dick. Now, it&amp;rsquo;s not.&lt;/p&gt;
&lt;p&gt;OSS benefits from a large user base, but it will have growing
pains. Being a dick does not alleviate these pains, it makes them
worse. Let&amp;rsquo;s go back to my story about Tom.&lt;/p&gt;
&lt;p&gt;In the second half of Tom&amp;rsquo;s email (after attacking me), he asked me to
help him fix his software. Now, collaboration can be difficult; code
style clashes, merges fail, frustration is common. In a small project,
you&amp;rsquo;re really in bed with your collaborators. Now that Tom has sent me
the signal he&amp;rsquo;s nasty in correspondences, do you think I&amp;rsquo;ll work on
this project with him? Hell no. I&amp;rsquo;d rather fork, fix the problem and
encourage others to use my software. Of course this is bad for OSS;
consider this passage from Eric S. Raymond&amp;rsquo;s &lt;a href=&#34;http://catb.org/jargon/html/F/forked.html&#34;&gt;Jargon
File&lt;/a&gt;
:&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;Forking is considered a Bad Thing - not merely because it implies a
lot of wasted effort in the future, but because forks tend to be
accompanied by a great deal of strife and acrimony between the
successor groups over issues of legitimacy, succession, and design
direction. There is serious social pressure against forking.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Tom&amp;rsquo;s actions guarantee I will avoid working on his projects at all
costs. The two other developers, and anyone else we&amp;rsquo;ve told will
too. In the end, the software loses.&lt;/p&gt;
&lt;h2&gt;Idolize programmers, not their dickishness&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;idolize-programmers-not-their-dickishness&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#idolize-programmers-not-their-dickishness&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Some abrasive programmers are really gifted. Erik Naggum is regarded
as the &lt;a href=&#34;http://en.wikipedia.org/wiki/Erik_Naggum#Controversy&#34;&gt;first Usenet
flamer&lt;/a&gt;
. &lt;a href=&#34;http://en.wikipedia.org/wiki/Theo_de_raadt&#34;&gt;Theo
de Raadt&lt;/a&gt;
 forked NetBSD
into what became OpenBSD partially because of issues with other
developers. Richard Stallman gave an AMA on reddit a year ago and the
&lt;a href=&#34;http://www.reddit.com/r/gnu/comments/c8rrk/rms_ama/&#34;&gt;most popular
question&lt;/a&gt;
 (since
deleted) was about a young GNU-lover that was nervous about asking RMS
a question and accidentally referred to it as &amp;ldquo;Linux&amp;rdquo;, not &amp;ldquo;GNU/Linux&amp;rdquo;
and RMS ripped him in half.&lt;/p&gt;
&lt;p&gt;Now, all of these developers have been dickish and are well-known
because they are gifted visionaries. I&amp;rsquo;m not sure why, but other
developers admire this dickishness. But don&amp;rsquo;t idolize their
dickishness, idolize their skill. There are also overwhelmingly nice
programmers:
&lt;a href=&#34;http://en.wikipedia.org/wiki/John_McCarthy_%28computer_scientist%29&#34;&gt;John McCarthy&lt;/a&gt;
,
&lt;a href=&#34;http://en.wikipedia.org/wiki/Donald_Knuth&#34;&gt;Donald Knuth&lt;/a&gt;
, and
&lt;a href=&#34;http://en.wikipedia.org/wiki/Alan_Turing&#34;&gt;Alan Turing&lt;/a&gt;
 to name a
few. Admire their skill &lt;em&gt;and&lt;/em&gt; their personality.&lt;/p&gt;
&lt;h2&gt;Being a dick hurts science&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;being-a-dick-hurts-science&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#being-a-dick-hurts-science&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;There&amp;rsquo;s been an explosion of open source software utilization in the
sciences. My field, bioinformatics, provides an interesting case
study. There are bioinformaticians like myself that write
software. Users are divided into other programmer types (other
bioinformaticians) and biologists (on average, less knowledgeable of
programming). All else equal, biologists and bioinformaticians prefer
free, open source software to costly proprietary software.&lt;/p&gt;
&lt;p&gt;For these reasons, being helpful and nice to scientific users is
really important. For biologists, choosing tools is about getting
analysis done quickly and easily. Rude bioinformaticians will quickly
increase the cost of using OSS tools, which is already high for many
biologists who aren&amp;rsquo;t experienced with Unix tools and
programming. Consequently, science could become less open, something
neither group wants.&lt;/p&gt;

      </description>
    </item>
    
  </channel>
</rss>
