Question: Discrepancies between sradb, sraruninfo
1
gravatar for langmea
18 months ago by
langmea10
langmea10 wrote:

We noticed a large difference in the number of run accessions we get when we attempt to filter for human Illumina RNA-seq data in the SRA versus sradb. We first noticed this by comparing results from queries using sradb and using rentrez:

> ret <- dbGetQuery(sra_con, paste(
      'SELECT sra.taxon_id, sra.run_accession FROM sra',
      'WHERE sra.platform = "ILLUMINA"',
      '  AND sra.library_strategy = "RNA-Seq"',
      '  AND sra.library_source = "TRANSCRIPTOMIC"'))
> nrow(ret)
199100
> entrez_search(db="sra",
                term="\"homo sapiens\"[Organism] \"rna seq\"[Strategy] transcriptomic[Source] illumina[Platform]",
                retmax=0)
Entrez search result with 238656 hits (object contains 0 IDs and no web_history object)
 Search term (as translated):  "Homo sapiens"[Organism] AND "rna seq"[Strategy] A ...

So 199100 versus 238656. This was using a sradb database downloaded fresh on 5/22 or thereabouts.

We followed up by looking at results from SRARunInfo, pasting in the the same search terms as you can see in the entrez_search call above. We looked at some of the accesseions that were present in those results but not in the results from sradb. Here are a couple lines that look like they should have been picked up by our sradb query:

Run,ReleaseDate,LoadDate,spots,bases,spots_with_mates,avgLength,size_MB,AssemblyName,download_path,Experiment,LibraryName,LibraryStrategy,LibrarySelection,LibrarySource,LibraryLayout,InsertSize,InsertDev,Platform,Model,SRAStudy,BioProject,Study_Pubmed_id,ProjectID,Sample,BioSample,SampleType,TaxID,ScientificName,SampleName,g1k_pop_code,source,g1k_analysis_group,Subject_ID,Sex,Disease,Tumor,Affection_Status,Analyte_Type,Histological_Type,Body_Site,CenterName,Submission,dbgap_study_accession,Consent,RunHash,ReadHash
ERR1088666,2015-10-28 13:15:30,2016-03-03 09:22:45,179818,107890800,0,600,57,GCF_000001405.25,https://sra-download.ncbi.nlm.nih.gov/traces/era14/ERR/ERR1088/ERR1088666,ERX1168220,14316381,RNA-Seq,cDNA,TRANSCRIPTOMIC,PAIRED,450,0,ILLUMINA,Illumina MiSeq,ERP104024,PRJEB22346,,407322,ERS939075,SAMEA3631926,simple,9606,Homo sapiens,BCR_MV_PBMC6127932-sc-2319898,,,,,,,no,,,,,THE WELLCOME TRUST SANGER INSTITUTE,ERA526402,,public,EBD3F89D1F98C4422C5AD7FA02CECA38,5D0D3A82CEC289067EA6743E34AA2C2F
DRR097153,2018-01-09 15:33:50,2018-01-09 15:46:23,3461697,124621092,0,36,79,,https://sra-download.ncbi.nlm.nih.gov/traces/dra4/DRR/000094/DRR097153,DRX090649,bead_PC9_gefitinib_27_rep,RNA-Seq,PolyA,TRANSCRIPTOMIC,SINGLE,0,0,ILLUMINA,Illumina HiSeq 2500,DRP003981,PRJDB5904,,429378,DRS057296,SAMD00084300,simple,9606,Homo sapiens,DRS057296,,,,,,,no,,,,,UT-MGS,DRA005928,,public,ADC8A81E3A724C4CCD32078AE255BDC8,1A28D5349B272AEB5969DBE2B432D94F

Do you have any ideas?

Best,

Ben

sradb • 297 views
ADD COMMENTlink written 18 months ago by langmea10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 312 users visited in the last hour