Question

Help: extract counts from RNA-seq .bam/.bai files

1

Entering edit mode

mat149 ▴ 80

@mat149-11450

Last seen 4 weeks ago

United States

Hello,

I have an inquiry on a paired-end RNA-seq experiment prepared from zebrafish livers. I would like to identify differentially expressed transcripts (or genes) between (n = 5) "control" and (n = 5) "lepa" samples.

I am looking for a method that will construct a matrix of counts from (n = 10) .bam files and their corresponding indices (.bai). The goal in doing this is to place the subsequent count matrix into the edgeR pipeline. I have tried to do some reading in various vignettes but it is not clear to me on how to accomplish this. The code I am using is appended below, but I am not sure on what to do next after opening the bam files and the zebrafish txdb object. I hope that my question is clear; if I failed to explain something please let me know and I can elaborate. Thanks for looking into this,

- Matt

library(GenomicAlignments)
library(TxDb.Drerio.UCSC.danRer10.refGene)

exbygene <- exonsBy(TxDb.Drerio.UCSC.danRer10.refGene, "gene")

lepa1<-BamFile("C:\\...\\lepa1.bam",index="C:\\...\\lepa1.bai",asMates=TRUE)
lepa2<-BamFile("C:\\...\\lepa2.bam",index="C:\\...\\lepa2.bai",asMates=TRUE)
lepa3<-BamFile("C:\\...\\lepa3.bam",index="C:\\...\\lepa3.bai",asMates=TRUE)
lepa4<-BamFile("C:\\...\\lepa4.bam",index="C:\\...\\lepa4.bai",asMates=TRUE)
lepa5<-BamFile("C:\\...\\lepa5.bam",index="C:\\...\\lepa5.bai",asMates=TRUE)
wt1<-BamFile("C:\\...\\\wt1.bam",index="C:\\...\\wt1.bai",asMates=TRUE)
wt2<-BamFile("C:\\...\\\wt2.bam",index="C:\\...\\wt2.bai",asMates=TRUE)
wt3<-BamFile("C:\\...\\\wt3.bam",index="C:\\...\\wt3.bai",asMates=TRUE)
wt4<-BamFile("C:\\...\\\wt4.bam",index="C:\\...\\wt4.bai",asMates=TRUE)
wt5<-BamFile("C:\\...\\\wt5.bam",index="C:\\...\\wt5.bai",asMates=TRUE)

open.BamFile(lepa1)
open.BamFile(lepa2)
open.BamFile(lepa3)
open.BamFile(lepa4)
open.BamFile(lepa5)
open.BamFile(wt1)
open.BamFile(wt2)
open.BamFile(wt3)
open.BamFile(wt4)
open.BamFile(wt5)

R version 3.4.0 (2017-04-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base

other attached packages:
[1] GenomicAlignments_1.12.0 Rsamtools_1.28.0
[3] Biostrings_2.44.0 XVector_0.16.0
[5] SummarizedExperiment_1.6.1 DelayedArray_0.2.2
[7] matrixStats_0.52.2 Biobase_2.36.2
[9] GenomicRanges_1.28.1 GenomeInfoDb_1.12.0
[11] IRanges_2.10.0 S4Vectors_0.14.0
[13] BiocGenerics_0.22.0 bamsignals_1.8.0

loaded via a namespace (and not attached):
[1] Rcpp_0.12.10 lattice_0.20-35 bitops_1.0-6
[4] grid_3.4.0 zlibbioc_1.22.0 Matrix_1.2-10
[7] BiocParallel_1.10.1 tools_3.4.0 RCurl_1.95-4.8
[10] compiler_3.4.0 GenomeInfoDbData_0.99.0

counts edger • 2.0k views

ADD COMMENT • link updated 7.6 years ago by Aaron Lun ★ 28k • written 7.6 years ago by mat149 ▴ 80

score 2 · Answer 1 · 2017-05-12

Usually I do something like

library(Rsamtools)
library(GenomicAlignments)
bams <- dir("../", "bam$", full.names = TRUE)
bfl <- BamFileList(bams, asMates = TRUE)
ex <- exonsBy(TxDb.Drerio.UCSC.danRer10.refGene, "gene")
ex <- reduce(ex)
SE <- summarizeOverlaps(ex, bfl, singleEnd = FALSE, param = ScanBamParam(flag = scanBamFlag(isDuplicate = FALSE)))

Then assays(SE)[[1]] contains your count matrix. Whether or not you should remove duplicates is up to you. There are arguments for and against and I tend to do so on an ad hoc basis.

score 1 · Answer 2 · 2017-05-12

1

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 10 hours ago

The city by the bay

I'll throw in a word for the featureCounts function in Rsubread.

https://www.bioconductor.org/help/workflows/RnaSeqGeneEdgeRQL/#quantifying-read-counts-for-each-gene

If you can get your TxDb object into a GTF file, it should be pretty easy to get it working.

ADD COMMENT • link 7.6 years ago Aaron Lun ★ 28k