This lesson is in the early stages of development (Alpha version)

Publishing your data analysis code: Glossary

Key Points

Get it in Git
  • Publishing analysis code allows others to better understand what you have done, to verify that your analysis does what you claim, and to build on your work.e

  • Use git init, git add, git commit, git remote add, and git push, as discussed in the Software Carpentry Git lesson

  • Include all the code that you have written to use in this analysis.

  • Leave out e.g. temporary copies, old backup versions, files containing secret or confidential information, and supporting files generated automatically.

Structuring your repository
  • Put code into a specific subdirectory (or several, if there is lots of code).

  • Keep important metadata, such as a license, citation information, and README in the root of the repository.

  • Keep other ancillary data, documentation, etc. in separate subdirectories.

  • Use git mv to move files and let Git know that they have moved.

Documentation and automation
  • Use a README or similar file to explain the essential steps of running your analysis.

  • Use shell script or similar to automate the steps you would take to perform your analysis.

  • Use command-line arguments or other parameters instead of having to manually edit lines of code.

Jupyter Notebooks and automation
  • Jupyter Notebook can be run in a non-linear order, and store their output as well as their input

  • Remove all output from notebooks before committing to a pure code repository.

  • Test notebooks from a fresh kernel, or run them from the command line with jupyter nbconvert.

  • Use environment variables to pass arguments into a notebook.

Data
  • Use curl to download data automatically.

  • For small amounts of data, and code that is specifically to analyse only those data, data and code can be stored and published together.

  • For large datasets, or where code is used for multiple different datasets, keep the two separate.

  • Data can be frequently be published, if there are no constraints preventing it. If data are not published, then publishing analysis code becomes less valuable.

Reproducible software environments
  • Different versions of packages can give different numerical results. Documenting the environment ensures others can get the same results from your work as you do.

  • pip freeze and conda env export give plain text files defining the packages installed in an environment, and that can be used to recreate it.

Verifying your analysis
  • Use pip install -r and conda env create to create a new environment from a definition.

  • Running your full analysis end-to-end in a clean environment will highlight most problems.

  • Binder services (e.g. MyBinder) will create an environment in the cloud based on your definiiton.

Publishing in open science repositories
  • Source code hosts like GitHub are designed for active development of software. Open science repositories are to keep versions of record for the longer term.

  • Services like Zenodo allow you to package a particular commit, archive it, and give it a permanent identifier.

  • Many services like Zenodo will automatically give you a DOI for any dataset, including repositories pulled from GitHub.

  • DOIs for code repositories can be cited in journal articles the same way as any other publication.

Glossary

FIXME