As a physics student at Princeton University, Assistant Professor Carl Boettiger had no intention of working with computers: he says he was essentially “pulled in by accident.” During his time as an undergraduate, Boettiger gradually broadened his interests from physics, which he believed to be the type of science done solely on a chalkboard, to biophysics and ecology. Today, Boettiger is on the faculty of the Department of Environmental Science, Policy and Management at UC Berkeley, working as a theoretical ecologist in the field of ecoinformatics, the science of information in ecology.
Boettiger was first introduced to the world of data science and computation while pursuing a Ph.D in Population Biology at UC Davis. Despite the inspiration Boettiger gathered from his professors’ and fellow students’ prior computational knowledge, Boettiger believes it was a Computational Science Graduate Fellowship granted by the Department of Energy (DOE) that truly ignited his interest in data science. This grant did more than simply provide money for Boettiger’s research; it also required he enroll in basic computing classes that Boettiger called “eye-opening.” Through these classes, Boettiger realized he could combine his interest in ecology (and ecological forecasting specifically) with data science. When he first began working in his field, researchers relied on what Boettiger calls “small data”: individual researchers’ data that was rarely shared beyond the scientist’s home university or library. More recently, technologies like satellites, sensor networks, and drones have enabled automated collection of large data sets. Few resources are available to help researchers bridge the gaps between these two categories, yet the possibility of relating “big data” sets with individual projects conducted over the past century or two promises current ecological forecasters increased accuracy and new predictive capabilities.
In November 2010 Boettiger opened up his first notebook: an easily shared browser-based application capable of storing and executing live code, equations, and visualizations. Today, Boettiger uses Jupyter and RStudio notebooks to show and share with other researchers the data he is currently analyzing, how he is cleaning and relating disparate data sets, the code he uses for his analyses, and the ability to reproduce his research and use his methods in their own analysis. The ability to show and share live data sets and computational methods allows Boettiger and his colleagues to more quickly and easily broaden the scope of diverse yet relatable datasets, building on each others’ work and maintaining a clear, reproducible record of their methods.
In his own work, Boettiger mostly focuses on ecosystem regime changes, which are shifts in ecological communities in response to a warming climate. Boettiger and other scientists use data models in order to predict these changes and are constantly refining their models to increase the accuracy of predictions. Examining questions of why and how a regime, such as a fishery, could have collapsed, and then utilizing data science to help reach a conclusion, is just one of the many applications of Boettiger’s data models. Developing computational models that can accurately predict and analyze such regime shifts requires Boettiger to bridge the gap between individual researcher’s “small data” and large data sets gathered through new technologies.
Using the GitHub repository to publicly share data and code in executable notebooks advances two of Boettiger’s goals: making data accessible and making it reproducible. Compiling data he has painstakingly gathered, analyzed, cleaned, and normalized, then making it readily available alongside “tools [that] are useful and [ways] to apply them,” opens opportunities for himself and his colleagues to determine how to tackle current, developing, and urgent ecological issues. Curated datasets of these types can take up to 400 hours to create. Therefore, placing such datasets in public repositories can save researchers significant time and effort by enabling them to spend more time for new analysis rather than redoing data preparation steps already performed and documented by others like Boettiger. To verify the integrity of datasets, Boettiger has a student or postdoc regenerate it using the code shared in notebooks and compare it to the set published on GitHub. This way Boettiger can prove the reproducibility of his own data processing methods, while preserving the history of his work.
In general, Boettiger believes Jupyter and RStudio notebooks are most beneficial for their ability to be run on any diverse variety of computational platforms. To facilitate the “mobility of compute” of his notebooks, Boettiger encourages researchers to run them in Docker containers that he has developed, in which all software libraries required by his notebooks are pre-installed. In this way, researchers need not reproduce laborious software installation procedures that are often difficult or impossible to duplicate on different hardware or different operating systems.
Considering Boettiger had no experience with computing and believed computing to be an inessential part of his scientific work little more than a decade ago, his career is a clear example of how data science and its resources can benefit a researcher. Since 2010, Boettiger has posted over 1,002 notebooks. Going forward, Boettiger believes there is still a lot of work to be done in both the ecoinformatics and data science fields, yet thinks it is essential to take advantage of data science methods to deal with messy, large data, freeing researchers and allowing them to focus on analysis and reproducibility.
For more information on Prof. Boettiger and his research please visit his website. For more details on how Berkeley Research Computing can facilitate access to Jupyter notebooks, containerized computational environments, and computational resources from virtual machines to the cloud to high performance computing, email research-it@berkeley.edu.