Discrepancies between sradb, sraruninfo
0
2
Entering edit mode
langmea ▴ 20
@langmea-15913
Last seen 3.3 years ago

We noticed a large difference in the number of run accessions we get when we attempt to filter for human Illumina RNA-seq data in the SRA versus sradb. We first noticed this by comparing results from queries using sradb and using rentrez:

> ret <- dbGetQuery(sra_con, paste(
      'SELECT sra.taxon_id, sra.run_accession FROM sra',
      'WHERE sra.platform = "ILLUMINA"',
      '  AND sra.library_strategy = "RNA-Seq"',
      '  AND sra.library_source = "TRANSCRIPTOMIC"'))
> nrow(ret)
199100
> entrez_search(db="sra",
                term="\"homo sapiens\"[Organism] \"rna seq\"[Strategy] transcriptomic[Source] illumina[Platform]",
                retmax=0)
Entrez search result with 238656 hits (object contains 0 IDs and no web_history object)
 Search term (as translated):  "Homo sapiens"[Organism] AND "rna seq"[Strategy] A ...

So 199100 versus 238656. This was using a sradb database downloaded fresh on 5/22 or thereabouts.

We followed up by looking at results from SRARunInfo, pasting in the the same search terms as you can see in the entrez_search call above. We looked at some of the accesseions that were present in those results but not in the results from sradb. Here are a couple lines that look like they should have been picked up by our sradb query:

Run,ReleaseDate,LoadDate,spots,bases,spots_with_mates,avgLength,size_MB,AssemblyName,download_path,Experiment,LibraryName,LibraryStrategy,LibrarySelection,LibrarySource,LibraryLayout,InsertSize,InsertDev,Platform,Model,SRAStudy,BioProject,Study_Pubmed_id,ProjectID,Sample,BioSample,SampleType,TaxID,ScientificName,SampleName,g1k_pop_code,source,g1k_analysis_group,Subject_ID,Sex,Disease,Tumor,Affection_Status,Analyte_Type,Histological_Type,Body_Site,CenterName,Submission,dbgap_study_accession,Consent,RunHash,ReadHash
ERR1088666,2015-10-28 13:15:30,2016-03-03 09:22:45,179818,107890800,0,600,57,GCF_000001405.25,https://sra-download.ncbi.nlm.nih.gov/traces/era14/ERR/ERR1088/ERR1088666,ERX1168220,14316381,RNA-Seq,cDNA,TRANSCRIPTOMIC,PAIRED,450,0,ILLUMINA,Illumina MiSeq,ERP104024,PRJEB22346,,407322,ERS939075,SAMEA3631926,simple,9606,Homo sapiens,BCR_MV_PBMC6127932-sc-2319898,,,,,,,no,,,,,THE WELLCOME TRUST SANGER INSTITUTE,ERA526402,,public,EBD3F89D1F98C4422C5AD7FA02CECA38,5D0D3A82CEC289067EA6743E34AA2C2F
DRR097153,2018-01-09 15:33:50,2018-01-09 15:46:23,3461697,124621092,0,36,79,,https://sra-download.ncbi.nlm.nih.gov/traces/dra4/DRR/000094/DRR097153,DRX090649,bead_PC9_gefitinib_27_rep,RNA-Seq,PolyA,TRANSCRIPTOMIC,SINGLE,0,0,ILLUMINA,Illumina HiSeq 2500,DRP003981,PRJDB5904,,429378,DRS057296,SAMD00084300,simple,9606,Homo sapiens,DRS057296,,,,,,,no,,,,,UT-MGS,DRA005928,,public,ADC8A81E3A724C4CCD32078AE255BDC8,1A28D5349B272AEB5969DBE2B432D94F

Do you have any ideas?

Best,

Ben

sradb • 487 views
ADD COMMENT

Login before adding your answer.

Traffic: 189 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6