Help: extract counts from RNA-seq .bam/.bai files
2
1
Entering edit mode
mat149 ▴ 70
@mat149-11450
Last seen 9 hours ago
United States

Hello,

I have an inquiry on a paired-end RNA-seq experiment prepared from zebrafish livers.  I would like to identify differentially expressed transcripts (or genes) between (n = 5) "control" and (n = 5) "lepa" samples.

I am looking for a method that will construct a matrix of counts from (n = 10) .bam files and their corresponding indices (.bai).  The goal in doing this is to place the subsequent count matrix into the edgeR pipeline.  I have tried to do some reading in various vignettes but it is not clear to me on how to accomplish this.  The code I am using is appended below, but I am not sure on what to do next after opening the bam files and the zebrafish txdb object.  I hope that my question is clear; if I failed to explain something please let me know and I can elaborate.  Thanks for looking into this,

- Matt

library(GenomicAlignments)
library(TxDb.Drerio.UCSC.danRer10.refGene)

exbygene <- exonsBy(TxDb.Drerio.UCSC.danRer10.refGene, "gene")

lepa1<-BamFile("C:\\...\\lepa1.bam",index="C:\\...\\lepa1.bai",asMates=TRUE)
lepa2<-BamFile("C:\\...\\lepa2.bam",index="C:\\...\\lepa2.bai",asMates=TRUE)
lepa3<-BamFile("C:\\...\\lepa3.bam",index="C:\\...\\lepa3.bai",asMates=TRUE)
lepa4<-BamFile("C:\\...\\lepa4.bam",index="C:\\...\\lepa4.bai",asMates=TRUE)
lepa5<-BamFile("C:\\...\\lepa5.bam",index="C:\\...\\lepa5.bai",asMates=TRUE)
wt1<-BamFile("C:\\...\\\wt1.bam",index="C:\\...\\wt1.bai",asMates=TRUE)
wt2<-BamFile("C:\\...\\\wt2.bam",index="C:\\...\\wt2.bai",asMates=TRUE)
wt3<-BamFile("C:\\...\\\wt3.bam",index="C:\\...\\wt3.bai",asMates=TRUE)
wt4<-BamFile("C:\\...\\\wt4.bam",index="C:\\...\\wt4.bai",asMates=TRUE)
wt5<-BamFile("C:\\...\\\wt5.bam",index="C:\\...\\wt5.bai",asMates=TRUE)

open.BamFile(lepa1)
open.BamFile(lepa2)
open.BamFile(lepa3)
open.BamFile(lepa4)
open.BamFile(lepa5)
open.BamFile(wt1)
open.BamFile(wt2)
open.BamFile(wt3)
open.BamFile(wt4)
open.BamFile(wt5)

 

R version 3.4.0 (2017-04-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] GenomicAlignments_1.12.0   Rsamtools_1.28.0          
 [3] Biostrings_2.44.0          XVector_0.16.0            
 [5] SummarizedExperiment_1.6.1 DelayedArray_0.2.2        
 [7] matrixStats_0.52.2         Biobase_2.36.2            
 [9] GenomicRanges_1.28.1       GenomeInfoDb_1.12.0       
[11] IRanges_2.10.0             S4Vectors_0.14.0          
[13] BiocGenerics_0.22.0        bamsignals_1.8.0          

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.10            lattice_0.20-35         bitops_1.0-6           
 [4] grid_3.4.0              zlibbioc_1.22.0         Matrix_1.2-10          
 [7] BiocParallel_1.10.1     tools_3.4.0             RCurl_1.95-4.8         
[10] compiler_3.4.0          GenomeInfoDbData_0.99.0

 

counts edger • 1.8k views
ADD COMMENT
2
Entering edit mode
@james-w-macdonald-5106
Last seen 10 hours ago
United States

Usually I do something like

library(Rsamtools)
library(GenomicAlignments)
bams <- dir("../", "bam$", full.names = TRUE)
bfl <- BamFileList(bams, asMates = TRUE)
ex <- exonsBy(TxDb.Drerio.UCSC.danRer10.refGene, "gene")
ex <- reduce(ex)
SE <- summarizeOverlaps(ex, bfl, singleEnd = FALSE, param = ScanBamParam(flag = scanBamFlag(isDuplicate = FALSE)))

Then assays(SE)[[1]] contains your count matrix. Whether or not you should remove duplicates is up to you. There are arguments for and against and I tend to do so on an ad hoc basis.

ADD COMMENT
1
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 17 hours ago
The city by the bay

I'll throw in a word for the featureCounts function in Rsubread.

https://www.bioconductor.org/help/workflows/RnaSeqGeneEdgeRQL/#quantifying-read-counts-for-each-gene

If you can get your TxDb object into a GTF file, it should be pretty easy to get it working.

ADD COMMENT

Login before adding your answer.

Traffic: 528 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6