Sharing data

Publishing research data is an important step in ensuring that your research outputs are following the Findable Accessible Interoperable Reusable (FAIR) principles for scientific data management. Although making your work accessible to others involves costs (effort and time, mostly), it considerably increases the reach of your work and thus also benefits your academic career. You can make the process of sharing your work, data, and code substantially easier if you plan it right from the start of your project. A data management plan, good organization of your data, and good documentation of your actions will all help make this process much easier, and are also essential for you to be able to reproduce and understand your own work. In practice, your closest collaborator is often you from 6 months ago!

How to read this document

This document is organized around a series of questions that you can ask yourself when planning to share data. For each question, we provide you with good external resources that provide answers.

What is the purpose for sharing the data

All shared data should be well documented, well organized, and follow community accepted data standards and file formats. But depending on the main purpose for sharing data, you may make different decisions on how and what to share. Generally, there are two options:

You are sharing a dataset that you have acquired (e.g. raw data) or that you have processed (derived data). Your goal here is that someone else might want to find these data useful, will reuse them and give you credit for your work.
You are sharing data as part of an empirical manuscript submission. Your goal here is to support your findings and figures with the data that support your analyses - and possibly to fulfill the requirements of the journal.

The main difference between these two options is that option 2. is a special case of 1. That is, to share data in support of a publication, you should follow the same steps and rigour in sharing a standalone dataset but you will also want to consider providing your data in a way that facilitates reproduction of the analyses described in your publication.

Sharing data in support of a published manuscript

Let’s assume you have just finished the manuscript for an exciting research project (✨ congrats ✨) and now you want to share the data that support your central findings and figures together with the manuscript. This is not only a great way to ensure that other can follow and re-use your research that will make it more likely your work will have an impact, many journals are starting to require the sharing of data with publication.

Consider sharing derivative data on their own

If your analyses are based on data that you have derived (e.g. through preprocessing with an image processing pipeline) from an existing dataset, then you should consider to share these derived data as a standalone dataset. You can then refer to the shared derived data in your publication and your publication specific data sharing.

Preparing data to be shared in support of your analyses should follow the same general principles as when you are sharing a dataset for re-use. In contrast to sharing a standalone dataset, the focus here is more on facilitating a reader in reproducing and understanding the specific analyses you have described in your publication.

Consider sharing code with data

Sharing data even with the best annotations and descriptions is not as easy to understand and reproduce as when you also share your analysis code. See this great overview of “The Turing Way”

Here are some questions to get you started

What analyses should your readers be able to reproduce and have access to?
- all major figures
- the central claims of the publication
- specific pre-processing, inclusion or exclusion steps
- outlier analyses
- etc
What are the steps taken between your raw data and the major outcomes you want to show your readers
- Here it can be helpful to look at your methods figure
- Alternatively, draw a flow chart on a piece of paper or write it down as hierarchical text
- If you do share code with your paper, each step should ideally map to a specific command or script
- For each step, make a note of the input and output data required
What intermediate data do you want to share?
- Some analyses steps take too long, results are too large, or require too many resources for a reader to reasonably re-run them
- Other times, raw data may not be possible to share due to privacy concerns, whereas derivative summary data may be uncomplicated
- In these cases it may make sense to just share intermediate data
- Here the flow-chart will come in very handy to inform you on what data are needed to reproduce a given analysis step
Make a data sharing plan
- Based on the previous decisions and information, you should now make a plan on what to share and how
- Start a text document and write down each data file you want to share and what name you plan to give it and in what directory you plan to store it
- This can become the foundation for your README file and your data dictionaries
- This plan will also help you make sure that you are using file names consistently, e.g. across your README file, in your data availability statement, and in your scripts
Can my readers understand the intermediate data files
- Make sure that intermediate data also have human readable names
- They should also be described in the documentation of your shared data
Do the data really work with my analysis code or descriptions
- Once you have all of this, you should review it
- If you have code to be shared with the publication and data, make sure it runs and produces the expected outcomes
- Go over your data sharing plan with a colleague to check if it makes sense or things are missing
- Ask someone to review the documents to see if they can follow your analyses given the data and documentation

At the end of this process, you should have a directory with description files (README, data dictionaries) and data that you want to share.

What repository should the data be shared on

Once you are clear on what data you want to share and for what purpose, you can make a decision on the data publishing repository you want to use. A data repository primarily serves two purposes:

to provide the storage space for your data to be hosted (so you don’t have to pay for or worry about it)
mint a DigitalObjectIdentifier for your dataset, a permanent record and link that will forever point to your dataset and that others can use to access your data. This is often also required by journals for data accessibility requirements.

The Nature Publishing Group maintains a repository of recommended subfield specific and domain general data repositories that you can take a look at.

Another great resource is the data publishing guideline of the F1000 publishing platform.

A good domain general data repository is Zenodo.org. Zenodo is funded by the CERN, and apart from points for coolness this also means a high level of confidence that it will remain accessible long into the future. We can recommend Zenodo for basic data sharing needs.

Another great platform for data sharing is the OpenScienceFramework. This is maintained by the Center for Open Science

Often these data repositories allow you to generate a DOI before you publish your data. This can be very helpful if you want to include the DOI in your manuscript submission, but also still want to make changes to the data you intend to share.

Sharing data

Check data sharing constraints

What is the purpose for sharing the data

Sharing a dataset for re-use

Sharing data in support of a published manuscript

What repository should the data be shared on

What license to share data with

Additional resources