Submitting to ENA

From mn/bio/cees-bioinf
Jump to: navigation, search

ENA submission as of 25th of January 2016

The ENA submission system is based upon umbrella projects (for example larger consortia - not mandatory). studies, experiments, samples, runs and metadata. You can pre-register your study and samples (and link them to an umbrella project if desirable) before uploading your raw reads at a later stage. The ENA webpage contains "how tos" on how to upload your data, but then assuming your data is stored on your laptop/computer. However, you can also upload directly from Abel/Cod nodes.

Step 1 - get your ENA account. This is easily done and your account is immediately active. Consider making a group account if your group generates HTS data at a high rate. Also connect your account to a secondary user (group leader, collaborator at UoO etc) if you contract is temporary. ENA is found here.

Step 2 - register your study in the "New Submission" tab. Here you can do all the steps in one go or pre-register the study. You will here fill inn a short name for the study with a descriptive title (example: RNAseq of Atlantic cod juveniles) and a short abstract. The abstract should describe the study in such a way that other researchers should be able to determine if they need your sequences or not for their meta analysis. Is it mRNA selected RNAseq, is it RADseq, is it a metagenomic sample etc etc. If you have publications related to the study you are required to attach Pubmed IDs during the registration - PS: ENA submissions that have been published are required to be public.

Step 3 - register your sample groups. Sample groups would be healthy and control or similar divisions. If you only have one category then that will result in one sample group.

Step 4 - upload your samples. Check the file formats accepted first! You can follow the upload "how to" or do it directly from Abel/Cod nodes using wput:

/projects/cees/bin/wput/wput-0.6.1/wput -B file_for_upload

Calculate the md5sums on Abel/Codnodes using 'md5sum filename' before registering your samples in the next step.

Step 5 - register your samples. If you have more than a handful of samples use the option of uploading sample spread sheets (download the templates). You modify the templates according to your samples and the amount of meta data you want to connect to them. The samples must have: title, taxonomic ID (if new species ENA will have to make a new taxonomic identifier for your - takes a few days), library source (genomic, transcriptomic....), library selection (PCR, oligo-dT, SAGE.....) library strategy, design description, library construction protocol (Illumina TruSeq....), instrument model, file type, library layout (single stranded, paired...) insert size, filenames and connected md5sums. In addition all the options you have added such as developmental stage, tissue time, collection site etc.

You will recieve emails as you go through the various steps and they pass quality control with a final email saying your submission is OK marked with either confidential or public according to the release date you selected in the first step.

Congrats with your submission!


Old submission protocol:

When you have got your sequences all analyzed and the paper ready, you will most likely be asked to submit your sequences somewhere. We recommend ENA, the European Nucleotide Archive. Here you will find some advice on how to do that in a not so painful way. Thanks to Mari Espelund for being a test case and for reporting on her experience.

How to proceed:

  • You need a submission account. You get this by applying to ENA for one. The webpage for doing that is this one. After getting your account, you can then proceed with the sumission itself.

  • Next, gather all of the sequence files that you will be submitting into one directory. You will need to figure out the md5sum of your sequences. This is actually just a number that is calculated on the basis of your file. If you change a comma, a whitespace or add an extra line, the number changes. It is used by ENA to ensure that when you transfer your files to them, they are getting all of the file.

    To calculate your md5sum, you either need to download a program for windows or mac that does this. An alternative is to log into a linux/unix machine (for instance titan) and calculate it there. The way to do it on these machines is simply:

    md5sum nameOfYourSequenceFile

    The results should look like something like this: d0e02ef5cac9e813839318061ee4edfb

  • ENA has two portals - a test and a production portal. Start out by using the test portal. Everything you do there, excluding the uploading of the files, will be deleted after 25hrs. Using this first is highly recommended so that you manage to get together all of the details they want, without running the risk of submitting something wrong.

    Please note that the window that you work in while sumitting is very wide, and in some cases fields that you are supposed to fill out may end up on the outside of your browser window. Scroll sideways to ensure that you are seeing everything. 

    Also note, if you shift panes during sumission without having filled in all of the details, you might end up with unrecoverable errors, thus having to start all over again.

  • In the first pane you are asked to upload your files. This can be done by clicking on the link where it says SRA-FileUpLoader. This opens a small program on your computer. Here you first log in with your login details, and then you select the directory where your sequence files are stored. Make sure they are all marked for upload and press upload. This will take some time - go get a coffee while waiting.
  • When you have filled out everything, you will get to the last pane where you fill in the file names and the md5sums of the files. Here you will see that you can download a spreadsheet with all of the information that you have filled out so far. NOTE: you can use this spreadsheet to go directly to this step when you are working in the production server. 

  • Please note: When you press sumit in the test server, YOU HAVE NOT REALLY SUBMITTED! This window just mimics that which you will se in the production server. However, the files you uploaded are still uploaded, so no need to do this again.
  • Last but not least, log in to the production server, start a new study and then go upload your spreadsheet and state the names of your files with your md5sums. When you the press submit, you have actually submitted. You should then get an email within a few days stating that they are working on your submission, and then, if everything is ok, you will get your accession numbers reasonably fast.
  • When uploading from Abel or the Cod nodes you can use wput together with ENA username and password. Make sure you are in the correct folder to avoid making a folder structure on the ENA ftp server:
  • /projects/cees/bin/wput/wput-0.6.1/wput -B reads_to_upload

    Congratulations on your submittal!