How coding opens up scientific data

Algorithm illustration
IMAGE: Spencer Phillips/EMBL-EBI

print

The increasing importance of code in the biological sciences

Ask most people how scientific discoveries happen and they will probably describe a scientist wearing a white coat, working in a lab with microscopes, pipettes and the odd animal model. While lots of biological research is still carried out in this kind of environment, the role of coding in biology is becoming just as important as traditional wet-lab science. Bioinformatics allows researchers to analyse vast datasets and track trends or discrepancies in the data in a way that has never been possible before. As the size and diversity of data grow every day, so the tools we develop to analyse the data become more robust and more useful.

Developing algorithms and application programming interfaces (APIs), which help scientists spot such trends, also plays a key role in bioinformatics research. Here, we explore a few examples of how coding, algorithms and APIs open up scientific data and drive new biological discoveries.

Coding can predict cancer development

Moritz Gerstung is a group leader in cancer data science research at EMBL’s European Bioinformatics Institute (EMBL-EBI). He and his team use code in their work every day to analyse genetic data. Their aim is to understand different types of cancer and, ultimately, help develop improved treatments and cures.

“The human genome is so large – 3 billion base pairs long – it’s 20 times the Encyclopædia Britannica,” explains Gerstung. “As researchers, we need to identify the specific sites in the genome that are mutated in cancer cells. That’s just a tiny fraction of the whole genome – about 0.0002%. We need computational methods to enable us to find these variants; we can’t just point to them manually. And when we ask the more specific question, ‘Are the variants that we’re seeing truly causing cancer?’, it’s code that can help us find the answers by looking at not just one, but many hundreds or thousands of genomes. Comparing genomes and finding regions that are recurrently mutated across many cancers is a good indication that a certain part of the genome is driving cancer development.”

Photo of Moritz Gerstung

Moritz Gerstung. PHOTO: Mary Todd Bergman/EMBL

Collaborating with the Wellcome Sanger Institute, the University of Cambridge and other international organisations, Gerstung and his team discovered that it’s possible to identify people at high risk of developing acute myeloid leukaemia (AML), an aggressive blood cancer, years before the disease develops.

“What we have shown is that many of the genetic aberrations you find in AML are actually occurring a decade or more before diagnosis, which was previously totally unobserved,” says Gerstung.

By analysing data and a large number of blood samples from the European Prospective Investigation into Cancer and Nutrition (EPIC) study, Gerstung’s team was able to discover a pattern of genetic changes that happen long before AML appears in an individual, and which differs from the typical mutations seen in the natural ageing process.

“We have indications that this genetic lag may also be found in other types of cancers,” says Gerstung, “and this insight may open a window of opportunity for early cancer diagnosis.”

Algorithms show us who we are

It can be a long and complex process to understand the relationships between our genome, environmental factors such as air quality or geographical location, and phenotypic expression – that is, our observable characteristics such as eye colour or height. Until recently, scientists had to come up with a very specific hypothesis to reach a conclusion about how just one environmental factor interacts with genetic variables and impacts our phenotypes.

Photo of Oliver Stegle

Oliver Stegle. PHOTO: Jon Mold

Now Oliver Stegle and his group, who study statistical genomics and systems genetics, have developed an algorithm that enables researchers to simultaneously use hundreds of environmental factors to understand genotype–phenotype relationships.

“Now we can analyse everything in one go,” says Stegle, “meaning we can find and identify interplays between genomes, environment and phenotype in a comprehensive manner.”

The algorithm, called the structured linear mixed model (StructLMM), can be applied to human datasets to, among other things, provide a finer characterisation of high-risk groups for certain diseases, and to help identify the most relevant environmental factors.

In the future, this method will offer a more comprehensive way of incorporating environmental influences into genetics studies, and will also increase the number of discoveries of variants whose function depends on environment or lifestyle.

Accessing vital information

EMBL-EBI freely provides more than 200 biological databases to researchers worldwide, and received approximately 58 million data requests per day in 2018. With both the volume of biological data and the number of requests rapidly increasing, algorithms are not only essential for analysing scientific data, but also for accessing it.

Youngmi Park, a software engineer and Project Lead in the web production team at EMBL-EBI, developed EBI Search, a full-text search engine application programming interface (API), which allows researchers to rapidly access relevant data held in EMBL-EBI databases with minimal programming or at the touch of a button. An API is an interface that delivers a request to a source, like a waiter in a restaurant taking your order to the kitchen, and brings back the requested information, like the waiter bringing food to your table.

“EBI Search equips users with a tool that allows them to search through vast amounts of data,” says Park, who is also one of the four team members dedicated to maintaining and developing the API. “The resources [databases] are then enabled to present the data in a way that is beneficial and customised to their users.”

Photo of Youngmi Park

Youngmi Park. PHOTO: Georgia Hingston/EMBL

Many of EMBL-EBI’s data resources, including Ensembl Genomes and RNAcentral, have integrated Park’s central Search API directly into their systems.

“This is a good strategy for everyone,” says Park. “It allows scientists working on the data resources to spend more time developing other useful tools for their users and to focus on curating data. By applying the EBI Search API to data resources, our scientists can continue focusing on the quality of data rather than the running of software systems, all while users are able to access relevant research data in a single search.”

Not only can EBI Search retrieve requested data, it can also cross-reference information between EMBL-EBI data resources. For example, when a scientist looks up a specific gene in Ensembl – a genome data resource – EBI Search is able to cross-reference that gene with relevant protein sequences, chemical structures, and scientific literature references. Such tools ultimately enable an increased pace of research, providing researchers around the world with rapid access to the biological data they need.