Sharing scientific code: A short introduction to Git and GitHub

Git and Github are repositories of code that are becoming increasingly important in the life sciences, including plant science. Unsure what Git and Github are and why there’s a fantastical animal (Octocat) as the header image of this post on a botanical blog? Read on with this conversation with Tjelvar Olsson, a scientific computing manager at the John Innes Centre.

Ian Street: Mentions of GitHub occur more and more amongst scientists. For those not in the know (like me), what is it?

Tjelvar Olsson: GitHub is a cloud-based service for hosting source code of open source software. It’s one of the most popular sites for this purpose.

IS: OK, cool! How is source code hosting relevant to biologists?

TO: Two main reasons: First, biology is becoming increasingly data driven. Sequencing, bioimaging and other techniques are generating vast amounts of data. Extracting biological insight from that data often requires scripted data analysis and computer programs. The source code of these data analyses tools is therefore directly relevant to biologists. Second, science is all about reproducibility. By giving access to the source code, research becomes more (easily) reproducible.

IS: Yes, reproducibility in science is a huge issue, and quite a thorny subject.

TO: Yes it is. A big problem with scripting and programming is that it is very easy to make mistakes. Sometimes this simply breaks the script so that it no longer works, but sometimes there are fundamental flaws that give misleading results. In the latter case one needs to know the exact code used to produce the results.

IS: How can we access other people’s code?

TO: Well, that’s the first problem Github solves – it lets people share their code with the world.

IS: Does that solve the problem of knowing how the results were produced?

TO: Not necessarily. Suppose I have used one of my scripts to analyse some data. That data is then used in a manuscript. However, writing is often laborious and takes time. Meanwhile, I have some clever idea that I want to try out. So I edit my script and experiment a bit. At that point the data generated for the manuscript is out of sync with the source code in my script and if I’m not using version control I’m in big trouble.

IS: So version control is about tracking changes in source code?

TO: That’s right. Because the software industry has been facing these issues with reproducibility for a long time, albeit in a slightly different context, tools have been developed to alleviate the problem. Version control is one of these tools. You can think of it as an infinite “undo” button that allows you to go back to previous versions of your code.

IS: Like Google Docs keeping entire histories of a document. So GitHub is a version control tool?

TO: Kind of. The version control tool in question is actually called Git. Git was developed by Linux Torvalds, the person that created Linux the operating system that most bioinformatics software runs on. Git is one of the most popular version control programs, which means it is used by loads of open source projects and software companies. So if you are a scientist wanting to pick up a valuable transferable skill it is well worth learning how to use it.

IS: How do Git and GitHub fit together?

TO: Git is a means of keeping your source code under version control in what is called a “repository”, basically a directory with files. GitHub allows you to host your repository online, giving access to the source code and all the snapshots stored in its history, i.e. all the versions of the code.

IS: This sounds a bit complicated. Is it difficult to pick up and use Git/GitHub?

TO: No, it is actually very easy. You only need a handful of commands to get started. Okay, so you need to be able to use the command line, but if you are running and/or creating data analysis scripts that is unlikely to be new to you.

IS: How much does it cost to use Git and GitHub?

TO: Git is open source software so it is free. If you are using a Mac it is probably already installed on your computer. GitHub is also free if you share your code with the world. That’s what this is all about! Hosting your scripts on GitHub is also an ideal way of backing up your code.

IS: Can I back up my data on GitHub as well?

TO: No, you shouldn’t do that. GitHub is about sharing and collaborating on your code. It is not a data repository. In fact it will not allow you to upload files larger than 100MB.

IS: Okay, so GitHub is about source code and not about data. However, you just mentioned that GitHub was about collaboration. Could you expand on that?

TO: Of course. So far we have been talking about reproducibility because it is a big concern for scientists and it is one of the problems that Git solves. However, people that develop software also have a problem of collaborating on code. If two people edit the same file how can those changes be merged together? It is analogous to the problem faced by the main author trying to incorporate the disparate changes provided by their collaborators. Git provides tools for solving these types of problems and GitHub provides an easy means for people to collaborate on the same project by giving them access to the Git repository wherever they have an internet connection.

IS: This all sounds great. What are some good resources for learning about Git?

TO: There is a good article by John D. Blischak , Emily R. Davenport and Greg Wilson in PLOS Computational Biology called “A Quick Introduction to Version Control with Git and GitHub” (10.1371/journal.pcbi.1004668). There is also a free online Git course on Codecademy (Learn Git). My book, Biologist’s Guide to Computing has chapters both on Git (Keeping track of your work) and GitHub (Collaborating on projects). Happy coding!

Tjelvar Olsson is a biologist that has drifted into computing.

He did his undergraduate and PhD in biochemistry. During his PhD he found himself shifting focus from bench work to computational data analysis. This process was largely guided by trial and error and was quite painful. At the time there were not many resources to help biologists learn about computing.

After his PhD he wanted to learn more about professional software development (and chemistry) and got a job as an application scientist at the CambridgeCrystallographic Data Centre, where he worked on projects to develop software >and databases for the pharmaceutical industry. During this time he was fortunate enough to work with great people and learnt lots about coding, programming best practises and software development processes.

After six years the lure of biology and academia became too much and he got a job as a scientific computing lab manager at the John Innes Centre. At the JIC he spends much of his time developing scientific software and automating scientific services.

Sharing scientific code: A short introduction to Git and GitHub

Sharing scientific code: A short introduction to Git and GitHub

You might also like

The Digital Biodiversity Revolution Needs More Than Data (Part 2)

The Digital Biodiversity Revolution Needs More Than Data