Note: I’ve been writing this essay in my head for a few months, but I felt it needed to be completed and released after the sad loss of Open Access advocate Aaron Swartz , a hacker and activist I admired.

21st Century Science and the Need for Open Data and Open Tools

Open Science rests upon three core principles: open access, open data, and open tools. However, “open science” doesn’t really imply the importance of this philosophy; it’s perceived as a nice, but not entirely necessary trait of scientific projects. Yet, without access, data, and tools, science is not just not open, but it is not reproducible.

This is a new phenomenon too; hundreds of years ago, a scientific article was sufficient to reproduce an experiment. Methods could be followed to recollect the data, build or collect the necessary tools, and follow the procedures to recreate the output. I believe this is why the Open Access movement has had such a large role in the Open Science movement. People readily accept that reproducibility is a goal of science, and one cannot reproduce a scientific finding without access to scholarly articles.

However, science undergoing a transition that makes the scholarly article alone insufficient to reproduce an experiment. As I’ll argue, Open Science advocates need to quickly start focusing more energy into open tools and open data.

Capital-Intensive Science and the Openness of Data

With large-scale projects like the LHC, new sequencing centers, and Mars Curiousity, it’s clear that some parts of science of becoming increasingly capital intensive. From my perspective in the genomics world, it’s even more apparent: smalls labs are seeking large funds to carry out sequencing efforts. The staggering levels of capital are needed for investment in expensive and quickly-developing new machines, whether sequencer, collider, or GC-MS. Besides the probihitive costs these machines have in common, they also have another trait they share: the create a lot of data.

But the beauty of this new data is that it’s practically free to copy and use. Science labs in the developing world may be decades away from having start of the art sequencers, or GC-MSs, or colliders, but they can benefit from the presence of such technology in Japan, the USA, or Switzerland through access to the data produced, as long as the data is open. In essence, the beauty of big data is not only the information it contains, the complexity it can untangle, or the insights it can foster. It’s that it can be endlessly duplicated and shared, and that any scarcity is entirely artificial . As long as big data remain open, the information created through capital-intensive science can be widely used and studied by all.

Futhermore, the new capital-intensive science could be creating a new era of impossible reproducibility. Small labs cannot afford resequence samples just to recreate data; this is why we see a growth of sequence repositories. As data becomes larger and begins coming from more sources, there will be technical problems and cost issues in maintaining public repositories of open data (such issues have already played out with the Short Read Archive). Open Science must be ready for such battles.

The Importance of Open Source Software

While large new scientific machinery illustrates the new capital-intensive side of science, the growth of bioinformatics, data science, and scientific programming illustrates the other side: labor-intensive science with a high degree of specialization. The science of the past centuries was characterized by a few or a single scientist carrying out an experiment, analyzing the results, and publishing them. However, the new scientific challenges we face require a degree of task specialization that is a break from the past. The necessity to model and understand more complex relationships requires not just bench scientists to organize and run experiments, but statisticians to consults, systems administrators to setup and maintain large computing infrastructure, and programmers that can write and maintain large codebases.

Yet as with big data, the software tools created to analyze data can be duplicated, reused, and adapted for no cost, as long as it’s open source. Open Science must strongly advocate for the use of open source tools, and also the development of open source alternatives to proprietary tools.

A further challenge persists too: open source software communities are almost always factionated. Unlike proprietary software, in which a company develops, releases, and works on one a single product, the open source community frequently recreates software, rather than extends or improves existing software. Even worse, while a company has managers and hierarchy to maintain continuity in functionality, documentation, and quality of a software product while individual developers may leave, open source software projects frequently decay as lead developers switch groups or funding runs short.

Adjusting Our Thinking

Open Access is an ongoing battle, but also point of pride of Open Science advocates. The success of PLoS and the increasing number of open access journals are signs the open science movement is gaining wide support. Yet, the science of the 21st century, characterized by a capital and labor intensive production process, requires further advocacy in open data and open tools.