Search
Question: Help: extract counts from RNA-seq .bam/.bai files
1
gravatar for mat149
16 months ago by
mat14920
mat14920 wrote:

Hello,

I have an inquiry on a paired-end RNA-seq experiment prepared from zebrafish livers.  I would like to identify differentially expressed transcripts (or genes) between (n = 5) "control" and (n = 5) "lepa" samples.

I am looking for a method that will construct a matrix of counts from (n = 10) .bam files and their corresponding indices (.bai).  The goal in doing this is to place the subsequent count matrix into the edgeR pipeline.  I have tried to do some reading in various vignettes but it is not clear to me on how to accomplish this.  The code I am using is appended below, but I am not sure on what to do next after opening the bam files and the zebrafish txdb object.  I hope that my question is clear; if I failed to explain something please let me know and I can elaborate.  Thanks for looking into this,

- Matt

library(GenomicAlignments)
library(TxDb.Drerio.UCSC.danRer10.refGene)

exbygene <- exonsBy(TxDb.Drerio.UCSC.danRer10.refGene, "gene")

lepa1<-BamFile("C:\\...\\lepa1.bam",index="C:\\...\\lepa1.bai",asMates=TRUE)
lepa2<-BamFile("C:\\...\\lepa2.bam",index="C:\\...\\lepa2.bai",asMates=TRUE)
lepa3<-BamFile("C:\\...\\lepa3.bam",index="C:\\...\\lepa3.bai",asMates=TRUE)
lepa4<-BamFile("C:\\...\\lepa4.bam",index="C:\\...\\lepa4.bai",asMates=TRUE)
lepa5<-BamFile("C:\\...\\lepa5.bam",index="C:\\...\\lepa5.bai",asMates=TRUE)
wt1<-BamFile("C:\\...\\\wt1.bam",index="C:\\...\\wt1.bai",asMates=TRUE)
wt2<-BamFile("C:\\...\\\wt2.bam",index="C:\\...\\wt2.bai",asMates=TRUE)
wt3<-BamFile("C:\\...\\\wt3.bam",index="C:\\...\\wt3.bai",asMates=TRUE)
wt4<-BamFile("C:\\...\\\wt4.bam",index="C:\\...\\wt4.bai",asMates=TRUE)
wt5<-BamFile("C:\\...\\\wt5.bam",index="C:\\...\\wt5.bai",asMates=TRUE)

open.BamFile(lepa1)
open.BamFile(lepa2)
open.BamFile(lepa3)
open.BamFile(lepa4)
open.BamFile(lepa5)
open.BamFile(wt1)
open.BamFile(wt2)
open.BamFile(wt3)
open.BamFile(wt4)
open.BamFile(wt5)

 

R version 3.4.0 (2017-04-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] GenomicAlignments_1.12.0   Rsamtools_1.28.0          
 [3] Biostrings_2.44.0          XVector_0.16.0            
 [5] SummarizedExperiment_1.6.1 DelayedArray_0.2.2        
 [7] matrixStats_0.52.2         Biobase_2.36.2            
 [9] GenomicRanges_1.28.1       GenomeInfoDb_1.12.0       
[11] IRanges_2.10.0             S4Vectors_0.14.0          
[13] BiocGenerics_0.22.0        bamsignals_1.8.0          

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.10            lattice_0.20-35         bitops_1.0-6           
 [4] grid_3.4.0              zlibbioc_1.22.0         Matrix_1.2-10          
 [7] BiocParallel_1.10.1     tools_3.4.0             RCurl_1.95-4.8         
[10] compiler_3.4.0          GenomeInfoDbData_0.99.0

 

ADD COMMENTlink modified 16 months ago by Aaron Lun21k • written 16 months ago by mat14920
2
gravatar for James W. MacDonald
16 months ago by
United States
James W. MacDonald47k wrote:

Usually I do something like

library(Rsamtools)
library(GenomicAlignments)
bams <- dir("../", "bam$", full.names = TRUE)
bfl <- BamFileList(bams, asMates = TRUE)
ex <- exonsBy(TxDb.Drerio.UCSC.danRer10.refGene, "gene")
ex <- reduce(ex)
SE <- summarizeOverlaps(ex, bfl, singleEnd = FALSE, param = ScanBamParam(flag = scanBamFlag(isDuplicate = FALSE)))

Then assays(SE)[[1]] contains your count matrix. Whether or not you should remove duplicates is up to you. There are arguments for and against and I tend to do so on an ad hoc basis.

ADD COMMENTlink modified 16 months ago • written 16 months ago by James W. MacDonald47k
1
gravatar for Aaron Lun
16 months ago by
Aaron Lun21k
Cambridge, United Kingdom
Aaron Lun21k wrote:

I'll throw in a word for the featureCounts function in Rsubread.

https://www.bioconductor.org/help/workflows/RnaSeqGeneEdgeRQL/#quantifying-read-counts-for-each-gene

If you can get your TxDb object into a GTF file, it should be pretty easy to get it working.

ADD COMMENTlink modified 16 months ago • written 16 months ago by Aaron Lun21k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 162 users visited in the last hour