Challenges facing a computational biologist in a core facility
Are you struggling to stay productive because of too many projects in vastly different research fields?
I did not know what to expect when I landed a job at a core facility — a core facility helping researchers at a large public university with anything and everything related to scientific computing. Within months, I ended up with a dozen projects, some in scientific disciplines very different from the one I trained for — and I quickly became unproductive in all but a few projects. The problem was not the amount of work but the time needed to switch between projects, and this was made worse by not having a thought-out system for organizing projects.
I have now worked on over a hundred different projects and have identified a few key challenges:
- Almost all projects are intermittent. Each delivered report is followed by days, weeks, or months of silence when the project partner verifies the most exciting discoveries. The challenge here is that it can take time to get going in a project that has been mothballed for a long time as all details have been forgotten.
- The technical requirements for each project vary wildly over several parameters such as software installation, size of data, length of calculations, and complexity of input data or reports. As no size fits all, we have to design our solutions to be flexible.
- The language and the meaning of some words differ drastically between different disciplines, making communication challenging. The best solution that I have found is to borrow ideas from agile software development and start delivering reports as soon as possible to the stakeholders. Good reports quickly identify misunderstandings.
- Many project partners have tight deadlines to submit grants, papers, or abstracts to conferences resulting in pressure to complete tasks regularly. This is always challenging, but solid solutions to the three previous points can take the edge off.
Over the years, I developed strategies to meet these challenges. I will share some of the lessons learned, hoping that somebody else will find them helpful.
The four most critical aspects of these projects were:
- Detailed and time-stamped documentation covering each project phase, including short summaries of meetings and essential communications over e-mail or other means. If done right, picking up the pieces after a long break is much quicker than going over old e-mails and reports.
- Manage data and metadata: keeping track of received, sent, and downloaded data and the relevant metadata is the cornerstone of each project. It still surprises me how often there are inconsistencies in how the data map to the experimental design.
- Managing software and software containers in an automated and documented way. Without proper tools, it is easy to end up with a hot mess and cause it to be unnecessarily hard to explain why results in early reports are no longer present at some later timepoint. Linking any computer result to the exact versions of software and databases goes a long way.
- Developing a standard way of reporting results to minimize miscommunication. I have also found that many project partners find the data processing part too abstract to ascribe much value to it, despite that this is almost always the most time-consuming part. Good reports are universally appreciated, so it is worth investing in this part early.
I cover each aspect in more detail below.
Documentation
Documenting each step of a project is the easiest to understand but also the hardest to implement, incidentally. In my experience, the best documentation system has the following properties:
- It should be quick to open the documentation file and add an entry.
- It should be compatible with git or other version control systems.
- It should allow for the integration of mixed media, such as images, videos, or complex, interactive graphs.
Meeting all these requirements is not trivial. The best solution that I have found is to write the documentation in markdown using a text editor. This is both fast and works well in version control. Further, I often use multiple markdown files to, in the simplest case, keep track of the current state of the project in one file and completed tasks in another. But, of course, many projects require a lot more files. To make the documentation manageable and meet the third point, I like mkdocs, a python-based tool to create a static website from the markdown files. While I do not have the space to cover all the wonderful features of mkdocs here, I will point out a few of my favorite ones:
- Creates a static HTML document set with a dynamic search function that already shows suggestions after a few keystrokes.
- A process to automatically monitor changes to the markdown files and update the site automatically.
I prefer the material theme as the pages are clean with a nice layout. I demonstrate how the search function looks using the mkdocs-material theme. Of course, I am using mkdocs to write my medium articles.
Manage data and metadata
As the cost of generating data drops, the experimental design becomes more extensive or complex. It is not uncommon for a single project to genome sequence hundreds of individuals or mass spectrometry projects to measure hundreds of samples. Not only are the data volumes high, but there might be complex relationships among the measured samples, such as an individual being a parent or a sibling of another or blood samples taken from an individual at different time points. I always use a proper data manager to build a metadata model that keeps track of individuals, relationships, and other important information. I explicitly link the measured data to this model, so there is no doubt about who or what a given dataset belongs to. There are many great data management tools, but my favorite is openBIS. There are several reasons why I pick openBIS in most cases; here are some important ones:
- Powerful API and python package, pybis, to interact with openBIS programmatically.
- A permissive license and long-term support from the ETH in Zurich.
- The data is immutable, which means that a dataset cannot change once it is registered.
The key aspect, of course, is the ability to query the data manager and copy the needed data to where it is required. In most cases, I will copy the results of any calculation into the data manager and link the resulting dataset directly to the input data, resulting in a mesh of input data and several levels of derived data.
Managing software installations
Software is another topic that varies in size and effort between projects. No custom software needs to be developed beyond simple scripts in most projects. However, most of them depend on a potentially large number of software tools or packages. In my experience, this creates challenges:
- We need to keep track of all the software tools used and their versions.
- We need to keep the environment stable as updated core libraries, for example, can cause incompatibility issues.
- We need to be able to move our software installation in case we need more memory or CPU power than is available on our local computer.
Software containers, such as docker or singularity, solve this problem by including fixed and stable versions of the entire software stack down to the operating system kernel, including system libraries. People commonly use docker files or singularity files to build their container images. Still, I have found this to be a bit limited. Instead, I use Ansible to do the heavy lifting. I immediately have access to over 30000 roles in Ansible Galaxy, covering a wide range of topics. I now specify roles (basically an installation rule that consists of one or more tasks) and variables to control the behavior. I add or write another Ansible role and update if I miss anything. The Ansible idempotency makes this operation quick as only the changes from the current state will be applied. In addition, I can move the container elsewhere and convert it to a singularity container if I run on a shared infrastructure without root privileges.
Reporting tools
Communicating the results is critical for a successful project, and I have found that Jupyter Notebooks is a great way to accomplish this. It is quick and easy to create a notebook as part of the data analysis. Once I need to send the report, I usually clean up the notebook and add sufficient amounts of human-readable text in so-called markdown cells for my collaborators to understand what is being shown in the cells below. I keep the topic for each notebook narrow and use more than one notebook if needed. I use papermill if I need to execute a given notebook with different subselections of the data. I finally create the full report using jupyter book, allowing me to add structure to the report and provide background information if needed. If my collaborators have access to the data manager, I add the jupyter book to the data manager and link it to all the input data.
Conclusion
So there you have it, the four cornerstones of a successful project are documentation, managing data, managing software, and creating reports from the calculated data. The recommendations I give in this article are based on many years of experience, but I am 100% sure that there are many excellent tools, so please share them in the comments; I would love to hear of them and how you use them. As I only had space to give a brief overview, I will cover some of these topics in more detail in future articles. If you found a particular topic hard to understand or of specific interest, let me know in the comments, and I will prioritize them. Thanks for staying with me this long; I hope it was worth your time.