Get it in Git
|
Publishing analysis code allows others to better understand what you have done, to verify that your analysis does what you claim, and to build on your work.e
Use git init , git add , git commit , git remote add , and git push , as discussed in the Software Carpentry Git lesson
Include all the code that you have written to use in this analysis.
Leave out e.g. temporary copies, old backup versions, files containing secret or confidential information, and supporting files generated automatically.
|
Structuring your repository
|
Put code into a specific subdirectory (or several, if there is lots of code).
Keep important metadata, such as a license, citation information, and README in the root of the repository.
Keep other ancillary data, documentation, etc. in separate subdirectories.
Use git mv to move files and let Git know that they have moved.
|
Documentation and automation
|
Use a README or similar file to explain the essential steps of running your analysis.
Use shell script or similar to automate the steps you would take to perform your analysis.
Use command-line arguments or other parameters instead of having to manually edit lines of code.
|
Jupyter Notebooks and automation
|
Jupyter Notebook can be run in a non-linear order, and store their output as well as their input
Remove all output from notebooks before committing to a pure code repository.
Test notebooks from a fresh kernel, or run them from the command line with jupyter nbconvert .
Use environment variables to pass arguments into a notebook.
|
Data
|
Use curl to download data automatically.
For small amounts of data, and code that is specifically to analyse only those data, data and code can be stored and published together.
For large datasets, or where code is used for multiple different datasets, keep the two separate.
Data can be frequently be published, if there are no constraints preventing it. If data are not published, then publishing analysis code becomes less valuable.
|
Reproducible software environments
|
Different versions of packages can give different numerical results. Documenting the environment ensures others can get the same results from your work as you do.
pip freeze and conda env export give plain text files defining the packages installed in an environment, and that can be used to recreate it.
|
Verifying your analysis
|
Use pip install -r and conda env create to create a new environment from a definition.
Running your full analysis end-to-end in a clean environment will highlight most problems.
Binder services (e.g. MyBinder) will create an environment in the cloud based on your definiiton.
|
Publishing in open science repositories
|
Source code hosts like GitHub are designed for active development of software. Open science repositories are to keep versions of record for the longer term.
Services like Zenodo allow you to package a particular commit, archive it, and give it a permanent identifier.
Many services like Zenodo will automatically give you a DOI for any dataset, including repositories pulled from GitHub.
DOIs for code repositories can be cited in journal articles the same way as any other publication.
|