Question: Discrepancies between sradb, sraruninfo
11 months ago by
langmea10
We noticed a large difference in the number of run accessions we get when we attempt to filter for human Illumina RNA-seq data in the SRA versus sradb. We first noticed this by comparing results from queries using sradb and using rentrez:

> ret <- dbGetQuery(sra_con, paste(
'SELECT sra.taxon_id, sra.run_accession FROM sra',
'WHERE sra.platform = "ILLUMINA"',
'  AND sra.library_strategy = "RNA-Seq"',
'  AND sra.library_source = "TRANSCRIPTOMIC"'))
> nrow(ret)
199100

> entrez_search(db="sra",
term="\"homo sapiens\"[Organism] \"rna seq\"[Strategy] transcriptomic[Source] illumina[Platform]",
retmax=0)
Entrez search result with 238656 hits (object contains 0 IDs and no web_history object)
Search term (as translated):  "Homo sapiens"[Organism] AND "rna seq"[Strategy] A ...


So 199100 versus 238656. This was using a sradb database downloaded fresh on 5/22 or thereabouts.

We followed up by looking at results from SRARunInfo, pasting in the the same search terms as you can see in the entrez_search call above. We looked at some of the accesseions that were present in those results but not in the results from sradb. Here are a couple lines that look like they should have been picked up by our sradb query:

Run,ReleaseDate,LoadDate,spots,bases,spots_with_mates,avgLength,size_MB,AssemblyName,download_path,Experiment,LibraryName,LibraryStrategy,LibrarySelection,LibrarySource,LibraryLayout,InsertSize,InsertDev,Platform,Model,SRAStudy,BioProject,Study_Pubmed_id,ProjectID,Sample,BioSample,SampleType,TaxID,ScientificName,SampleName,g1k_pop_code,source,g1k_analysis_group,Subject_ID,Sex,Disease,Tumor,Affection_Status,Analyte_Type,Histological_Type,Body_Site,CenterName,Submission,dbgap_study_accession,Consent,RunHash,ReadHash


Do you have any ideas?

Best,

Ben

