I’m a biochemist without proper training in programming, but I still enjoy writing small Python scripts to automate everyday tasks.
My first encounter with programming language was right after finishing my bachelor in Vietnam. At that time, the codes were just short scripts to download DNA sequences from a database and then parse the information into text files. It was a magic experience seeing the output automatically printed on the terminal line by line.
Ten years later I was working as a researcher in the INBioPharm project at SINTEF, where I could fully embrace and appreciate the power of coding to automate the data handling and analysis workflow. This was especially important when dealing with a large dataset containing more than 1200 genomes.
One aim of the project is to discover new biosynthetic gene clusters (BGCs) responsible for synthesis of potential new antibiotics. The data comes from an actinobacteria strain collection sequenced at SINTEF. We planned to use a digital software to mine the BGCs.
It would not be feasible to upload more than 1200 genomes to the web interface and wait for the public server to perform the analysis. After some troubleshooting and tweaking, we managed to run the analysis on a local computer with a set of around 100 genomes.
It would take too much time to analyze the rest of the genomes with our current semi-automatic workflow. The post-analysis process would be even more challenging. We needed to have a good way to look through the approximately 30.000 BGCs with accompanying metadata.
For that, I went to work at Wageningen University in The Netherlands for three weeks. There I learned to use Python to mine the data as well as working with a High Performance Computing (HPC) platform for parallel computing.
I wrote my first small Python script, a simple code to estimate the completeness of the BGCs. For me, it was an "eye-opening" feeling to see the potential for Python scripts to automate the analysis of a large dataset, both the input and the output.
1) when doing analysis that could take a long time, the terminal multiplexer, tmux, is very efficient.
2) it is useful to have the results exported in csv format so that you can automatically feed them to the next analysis in the workflow or imported to a database.
3) The book Automate the Boring Stuff with Python is an entertaining start
When I came back to Trondheim, we used tools that I helped develop at Wageningen, and ran them on the HPC platform at SINTEF.
The analysis of more than 1000 genomes took only a couple of weeks instead of months. We also managed to organize the results in more efficient way.
A small Python script took care of something as simple as copying and renaming the predicted BGCs to reflect the type of BGCs and the original strains to allow better filtering the results and perform the analysis.
The experience of using Python to automate workflows of analysis in the INBioPharm project has been very helpful in another similar project, called OXYMOD, where we use bioinformatics to discover new enzymes for the forestry industry.
Even though the tools in OXYMOD project are different, it has been quite straightforward to adapt the workflow from the INBioPharm project. Through curiosity and trial and error, I’ve seen the huge potential for applying more advanced approaches for analysis of biological data.