Computational reproducibility using Continuous Integration to produce verifiable end-to-end runs of scientific analysis.
This repository presents Continuous Analysis, a process demonstrating computational reproducibility by producing verifiable end-to-end runs of computational research. The process is described in detail in: Nature Biotechnology and Biorxiv Preprint.
Scientific Research often involves complex workflows that can be challenging to reproduce. These workflows often have many dependencies. Even when authors release source code and data, readers may have differing versions of dependencies. These differences can cause significant differences of results.
Continuous analysis extends Docker and Continuous integration to re-run scientific analysis after any source code or data changes. Continuous Analysis generates:
Examples and real applications of continuous analysis are available:
We consider 3 configurations of continuous analysis:
Installing a personal local continuous integration service.
Using a full service continuous integration service requires the least set up time but cannot handle computational intensive work. Local continuous integration can be used at no cost, and configured to take advantage of institutional clusters or GPU computing. Continuous integration in the cloud offers elastic computing resources with the ability to scale up or down depending on the computational complexity of your work.
Each of the local and cloud implementations uses a common .drone.yml format. See here for a full example.
# choose the base docker image
image: brettbj/continuous_analysis_base
script:
# run tests
# perform analysis
# publish results
publish:
docker:
# docker details
Drone is a CI platform built on container technology. It can be used to easily run a local or private cluster based CI service. Instructions adapted from - http://readme.drone.io/setup/overview/
1.) Install Docker on host machine - Linux, Mac, Windows
2.) Pull the drone image via docker
sudo docker pull drone/drone:0.4
3.) Create a new application at - https://github.com/settings/developers - with your hosts ip address in the homepage URL and the authorization callback followed by /authorize/
Homepage URL: http://YOUR-IP-HERE/
Callback URL: http://YOUR-IP-HERE/authorize/
4.) Add a webhook in to notify the continuous integration server of any updates pushed to the repository.
The payload URL should be in the format of your-ip/api/hook/github.com/client-id
5.) Create a configuration file at (/etc/drone/dronerc), filling in the client
REMOTE_DRIVER=github
REMOTE_CONFIG=https://github.com?client_id=....&client_secret=....
6.) Create and run your drone container
sudo docker run \
--volume /var/lib/drone:/var/lib/drone \
--volume /var/run/docker.sock:/var/run/docker.sock \
--env-file /etc/drone/dronerc \
--restart=always \
--publish=80:8000 \
--detach=true \
--name=drone \
drone/drone:0.4
6.) Access the drone control panel - your-ip-address/login and press the github button.
Shippable is the only example shown that does not rely on the open source drone project
1.) Sign in to Shippable using Github
2.) Select the account owner for your repository in the Subscriptions dropdown, click the enable project button for the desired project.
3.) Add a shippable.yml file to the root of your repository
pre_ci_boot:
image_name: brettbj/continuous_analysis_base
image_tag: latest
pull: true
options: "-e HOME=/root"
ci:
- cd /root/src/github.com/greenelab/continuous_analysis
- nose2 --plugin nose2.plugins.junitxml --junit-xml test
- mv nose2-junit.xml shippable/testresults/tests.xml
- coverage run --branch test.py
- coverage xml -o shippable/codecoverage/coverage.xml test.py
# run kallisto on a few simple tests
- cd /kallisto/test
- kallisto index -i transcripts.idx transcripts.fasta.gz
- kallisto quant -i transcripts.idx -o output -b 100 reads_1.fastq.gz reads_2.fastq.gz
- cp -R /kallisto/test/output /root/src/github.com/greenelab/continuous_analysis/shippable/output
# plot the results from a jupyter notebook
- cd /root/src/github.com/greenelab/continuous_analysis
- jupyter nbconvert --to html --execute ./Shippable_Plotting.ipynb
4.) Push completed Results -
- git config user.email "brettbj@gmail.com"
- git config user.name "Brett Beaulieu-Jones"
- git config --global push.default simple
- git remote set-url origin https://brettbj:$git_publish_key@github.com/greenelab/continuous_analysis.git
- git checkout master
- git pull
- git add shippable/.
- git commit -a -m "Shippable output [CI SKIP] [SKIP CI] ."
- git stash
- git push
post_ci:
- docker build -t brettbj/continuous_analysis .
- docker push brettbj/daps:latest
box: brettbj/continuous_analysis_base
build:
steps:
- script:
name: Run Tests + Coverage
code: |
nose2 --plugin nose2.plugins.junitxml --junit-xml test
mkdir wercker
mv nose2-junit.xml wercker/tests.xml
coverage run --branch test.py
coverage xml -o wercker/coverage.xml test.py
- script:
name: Run Kallisto
code: |
cd /kallisto/test
kallisto index -i transcripts.idx transcripts.fasta.gz &>-
kallisto quant -i transcripts.idx -o output -b 100 reads_1.fastq.gz reads_2.fastq.gz &>-
cp -R /kallisto/test/output /pipeline/source/wercker/output
- script:
name: Plot Results
code: |
cd /pipeline/source
jupyter nbconvert --to html --execute ./Wercker_Plotting.ipynb
Create a personal access token in github (Personal settings -> Personal access tokens -> generate new token -> give it repo access)
Set an environment variable in Wercker to push results to github. Go to settings, environment variables (example used git_publish_key, the url in wercker.yml is then - https://{TOKEN}@github.com/{ACCOUNT}/{REPOSITORY}.git)
Push results to github
- script:
name: Push Results back to github
code: |
git config user.email "brettbj@gmail.com"
git config user.name "Brett Beaulieu-Jones"
git config --global push.default simple
git remote set-url origin https://brettbj:$git_publish_key@github.com/greenelab/continuous_analysis.git
git checkout master
git pull
git add wercker/.
git commit -a -m "Wercker output [CI SKIP] [SKIP CI] ."
git stash
git push
DigitalOcean provides an extremely easy way to start a cloud-based private continuous integration service. Instructions below were adapted from here.
1.) Create a drone droplet, selecting Drone on Ubuntu 14.04 from the applications tab.
2.) Create a new application at - https://github.com/settings/developers - with your hosts ip address in the homepage URL and the authorization callback followed by /api/auth/github.com
Homepage URL: http://YOUR-IP-HERE/
Callback URL: http://YOUR-IP-HERE/api/auth/github.com
3.) Take note of the Client ID and Client Secret and log into your new droplet via ssh. You’ll be asked a few questions, for simplest configuration choose automatic configuration. You’ll be prompted to choose your code repository and enter your Client ID and Client secret.
4.) You’re ready to start by going to: http://YOUR-IP-HERE/login
Instructions adapted from here.
1.) Log into the AWS and launch an Ubuntu image based off the Amazon provided AMI (ami-9abea4fb).
2.) SSH into the created instance -
ssh ubuntu@<aws-instance>
3.) Install Docker
4.) Create a new application at - https://github.com/settings/developers - with the ec2 application url in the homepage URL and the authorization callback followed by /authorize/
Homepage URL: http://YOUR-IP-HERE/
Callback URL: http://YOUR-IP-HERE/authorize/
5.) Create a configuration file at (/etc/drone/dronerc), filling in the client info -
REMOTE_DRIVER=github
REMOTE_CONFIG=https://github.com?client_id=....&client_secret=....
6.) Run the Drone Docker container
sudo docker run \
--volume /var/lib/drone:/var/lib/drone \
--volume /var/run/docker.sock:/var/run/docker.sock \
--env-file /etc/drone/dronerc \
--restart=always \
--publish=80:8000 \
--detach=true \
--name=drone \
drone/drone:0.4
7.) Adjust your EC2 AWS Security settings to allow inbound visitors. Do this by going to the instance details for your EC2 instance and clicking on the security group. Choose the inbound tab, edit and add a listener on port 80, with source 0.0.0.0.
8.) Go to your EC2 instances adddress and you should now be able to log in.
1.) Log into google cloud platform, and create a new VM instance under compute engine.
2.) SSH into the new vm instance (this can be done via the browser)
3.) Install Docker
4.) Create a new application at - https://github.com/settings/developers - with the compute engine instance ip in the homepage URL and the authorization callback followed by /authorize/
Homepage URL: http://YOUR-IP-HERE/
Callback URL: http://YOUR-IP-HERE/authorize/
5.) Create a configuration file at (/etc/drone/dronerc), filling in the client info -
REMOTE_DRIVER=github
REMOTE_CONFIG=https://github.com?client_id=....&client_secret=....
6.) Run the Drone Docker container
sudo docker run \
--volume /var/lib/drone:/var/lib/drone \
--volume /var/run/docker.sock:/var/run/docker.sock \
--env-file /etc/drone/dronerc \
--restart=always \
--publish=80:8000 \
--detach=true \
--name=drone \
drone/drone:0.4
Please feel free to email me - (brettbe) at med.upenn.edu with any feedback or raise a github issue with any comments or questions. We also encourage you to send examples of your own usage of continuous analysis to be included in the examples section.
This work is supported by the Gordon and Betty Moore Foundation’s Data-Driven Discovery Initiative through Grant GBMF4552 to C.S.G. as well as NIH grants AI116794 and LM010098 and the Commonwealth Universal Research Enhancement (CURE) Program grant from the Pennsylvania Department of Health to Jason Moore.