We noticed a large difference in the number of run accessions we get when we attempt to filter for human Illumina RNA-seq data in the SRA versus sradb. We first noticed this by comparing results from queries using sradb and using rentrez:
> ret <- dbGetQuery(sra_con, paste(
'SELECT sra.taxon_id, sra.run_accession FROM sra',
'WHERE sra.platform = "ILLUMINA"',
' AND sra.library_strategy = "RNA-Seq"',
' AND sra.library_source = "TRANSCRIPTOMIC"'))
> nrow(ret)
199100
> entrez_search(db="sra",
term="\"homo sapiens\"[Organism] \"rna seq\"[Strategy] transcriptomic[Source] illumina[Platform]",
retmax=0)
Entrez search result with 238656 hits (object contains 0 IDs and no web_history object)
Search term (as translated): "Homo sapiens"[Organism] AND "rna seq"[Strategy] A ...
So 199100 versus 238656. This was using a sradb database downloaded fresh on 5/22 or thereabouts.
We followed up by looking at results from SRARunInfo, pasting in the the same search terms as you can see in the entrez_search call above. We looked at some of the accesseions that were present in those results but not in the results from sradb. Here are a couple lines that look like they should have been picked up by our sradb query:
Run,ReleaseDate,LoadDate,spots,bases,spots_with_mates,avgLength,size_MB,AssemblyName,download_path,Experiment,LibraryName,LibraryStrategy,LibrarySelection,LibrarySource,LibraryLayout,InsertSize,InsertDev,Platform,Model,SRAStudy,BioProject,Study_Pubmed_id,ProjectID,Sample,BioSample,SampleType,TaxID,ScientificName,SampleName,g1k_pop_code,source,g1k_analysis_group,Subject_ID,Sex,Disease,Tumor,Affection_Status,Analyte_Type,Histological_Type,Body_Site,CenterName,Submission,dbgap_study_accession,Consent,RunHash,ReadHash
ERR1088666,2015-10-28 13:15:30,2016-03-03 09:22:45,179818,107890800,0,600,57,GCF_000001405.25,https://sra-download.ncbi.nlm.nih.gov/traces/era14/ERR/ERR1088/ERR1088666,ERX1168220,14316381,RNA-Seq,cDNA,TRANSCRIPTOMIC,PAIRED,450,0,ILLUMINA,Illumina MiSeq,ERP104024,PRJEB22346,,407322,ERS939075,SAMEA3631926,simple,9606,Homo sapiens,BCR_MV_PBMC6127932-sc-2319898,,,,,,,no,,,,,THE WELLCOME TRUST SANGER INSTITUTE,ERA526402,,public,EBD3F89D1F98C4422C5AD7FA02CECA38,5D0D3A82CEC289067EA6743E34AA2C2F
DRR097153,2018-01-09 15:33:50,2018-01-09 15:46:23,3461697,124621092,0,36,79,,https://sra-download.ncbi.nlm.nih.gov/traces/dra4/DRR/000094/DRR097153,DRX090649,bead_PC9_gefitinib_27_rep,RNA-Seq,PolyA,TRANSCRIPTOMIC,SINGLE,0,0,ILLUMINA,Illumina HiSeq 2500,DRP003981,PRJDB5904,,429378,DRS057296,SAMD00084300,simple,9606,Homo sapiens,DRS057296,,,,,,,no,,,,,UT-MGS,DRA005928,,public,ADC8A81E3A724C4CCD32078AE255BDC8,1A28D5349B272AEB5969DBE2B432D94F
Do you have any ideas?
Best,
Ben