multiple NCBI_accession values per sample_id
2
0
Entering edit mode
@ajgasparrini-23549
Last seen 5 months ago
United States

What is the relationship between sample_id and NBCI_accession variables in the sampleMetadata? For most samples, there is one NCBI_accession listed per sample_id. For some samples, however, NCBI_accession contains a semi-colon delimited list of accessions. For example, here is what the sample_id/NCBI_accession fields look like for VogtmannE_2016:

> sampleMetadata %>% filter(study_name == "VogtmannE_2016") %>%select(sample_id, NCBI_accession) %>% head
sample_id                                                                          NCBI_accession
1 MMRS11288076ST-27-0-0ERR1293500;ERR1293499;ERR1293498;ERR1293497;ERR1293059;ERR1293058;ERR1293057;ERR1293056
2 MMRS11664448ST-27-0-0ERR1293861;ERR1293860;ERR1293859;ERR1293858;ERR1293420;ERR1293419;ERR1293418;ERR1293417
3 MMRS11932626ST-27-0-0ERR1293881;ERR1293880;ERR1293879;ERR1293878;ERR1293440;ERR1293439;ERR1293438;ERR1293437
4 MMRS12272136ST-27-0-0ERR1293877;ERR1293876;ERR1293875;ERR1293874;ERR1293436;ERR1293435;ERR1293434;ERR1293433
5 MMRS14379078ST-27-0-0ERR1293548;ERR1293547;ERR1293546;ERR1293545;ERR1293107;ERR1293106;ERR1293105;ERR1293104
6 MMRS14602194ST-27-0-0ERR1293813;ERR1293812;ERR1293811;ERR1293810;ERR1293372;ERR1293371;ERR1293370;ERR1293369


How should I interpret multiple NCBI accessions per sample? Are these multiple sequencing runs of the same library? For bioinformatic analyses, were these samples concatenated?

curatedMetagenomicData • 271 views
2
Entering edit mode
@schifferl
Last seen 15 days ago
New York, NY

The curatedMetagenomicData package provides samples as they are listed in original manuscripts by their sample_id, which simply means each sample_id is one sample that underwent some sequencing runs (potentially many). The NCBI_accession column of the sampleMetadata identifies these sequencing runs by their SRA identifiers, and where multiple runs of a single sample exists, they are delimited by semicolons. This is demonstrated in the table below.

|study_name     |sample_id             |NCBI_accession                 |
|:--------------|:---------------------|:------------------------------|
|VogtmannE_2016 |MMRS11288076ST-27-0-0 |ERR1293500;ERR1293499;ERR12... |
|VogtmannE_2016 |MMRS11664448ST-27-0-0 |ERR1293861;ERR1293860;ERR12... |
|VogtmannE_2016 |MMRS11932626ST-27-0-0 |ERR1293881;ERR1293880;ERR12... |
|VogtmannE_2016 |MMRS12272136ST-27-0-0 |ERR1293877;ERR1293876;ERR12... |
|VogtmannE_2016 |MMRS14379078ST-27-0-0 |ERR1293548;ERR1293547;ERR12... |
|VogtmannE_2016 |MMRS14602194ST-27-0-0 |ERR1293813;ERR1293812;ERR12... |


However, I think your question is more about why this happens and that is related to metagenomics and sequencing in general. Since shotgun sequencing produces short reads that are used to identify a bacterial clade, it is possible to sequence the same sample multiple times (i.e. via multiple runs) to improve read coverage and abundance quantification – I believe that is the case for the VogtmannE_2016 study. In any case, the NCBI_accession numbers are not of much consequence to users; we have simply included them to establish data provenance. They can be ignored unless you want to go back and check our work, but simply, as you have noticed, there can be a one-to-many relationship between sample_id and NCBI_accession.

1
Entering edit mode
@levi-waldron-3429
Last seen 6 weeks ago
CUNY Graduate School of Public Health a…

And just to add a little bit, I believe this typically happens when a multiplexed library is sequenced across multiple lanes. The authors then upload demultiplexed fastq files for each sample as a separate SRR. For example, if all the samples from a study were pooled and the pooled library was then sequenced in 4 lanes, each sample would end up with 4 SRRs. In cMD, the fastq files from these multiple SRR per sample are concatenated.

0
Entering edit mode

This makes sense to me. Thanks both for your replies.