Import bamfiles for RNAsequencing analysis
4
0
Entering edit mode
Merlin ▴ 10
@merlin-15723
Last seen 2.6 years ago
Vancouver

Hello Everyone,
I'm new to RNAseq world and Rstudio in general and I want to analyze a couple of bamfiles from illumina.
So far, I successfully followed tutorials for the analysis of RNAseq like this:

https://master.bioconductor.org/help/course-materials/2016/CSAMA/lab-3-rnaseq/rnaseq_gene_CSAMA2016.html

Unfortunately, all the tutorials available start by giving packages and descriptions about how to import into R but none of them shows how to import bamfiles not packaged.

I learned about how to  look into bamfile for example by running this commands:

library(Rsamtools)

bam1 <-"test.bam"
bam1 <-scanBam("test.bam")

bam1



so that I can look what's inside but after that I'm stopped because I don't know how to proceed.
I would be super happy to receive some help, just for one step forward.

Thank you very much

P

R rnaseq bam file • 15k views
0
Entering edit mode

Can you say a bit more about what you are trying to do? What exactly do you mean by 'analyze a couple of bamfiles' and also what do you mean by 'import bamfiles not packaged'?

0
Entering edit mode

Hi James,
I want to do differential gene expression analysis, same as described in the link but  using my own bamfiles.
The link provides a package named airway which is easy to import into R following the description but I can not do the same thing with my files.

Thanks

0
Entering edit mode

0
Entering edit mode

0
Entering edit mode
@james-w-macdonald-5106
Last seen 1 day ago
United States

To import data from bam files, you usually start R in the directory that contains your bam files and then do something like

library(Rsamtools)
bfl <- BamFileList(dir(".", "bam$"), yieldsize = 2e6) Which is a shorter way of doing something like this ## look for all files in the current directory that end with 'bam' bamfiles <- dir(".", "bam$")

## load all those files into a BamFileList
bfl <- BamFileList(bamfiles, yieldsize = 2e6)

And then you would proceed with the rest of the RNA-Seq workflow.

0
Entering edit mode

After setting the directory where my two bam files are located, I run the lines you gave me and this is the output:

> library(Rsamtools)

> fls <- dir(".", "bam$") > bfl <- BamFileList(bamfiles, yieldsize = 2e6) Error in validObject(.Object) : invalid class “BamFileList” object: the 'listData' slot must be a list containing BamFile objects There is something to adjust in the third line, probably I have to make a list first? thanks ADD REPLY 0 Entering edit mode You are creating a character vector of your bam files (called fls) and then passing something else (called bamfiles) to your call to BamFileList. I don't know what bamfiles are in this context, so I can't really tell you anything. Also note that I misspelled the yieldSize argument. R is case-sensitive, so using yieldsize won't work. ADD REPLY 0 Entering edit mode I adjusted the misspelled of yealdsize to yealdSize and seems that it's ok now: > fls [1] "sample1.bam" "sample2.bam" Can you confirm me that now I have imported the bam files and I can go for the next step? It should be an excel table with detailed informations for each of my samples named comma-separated value (CSV) Thank you ADD REPLY 0 Entering edit mode You haven't imported the bam files. All you have done is created a BamFileList, and now if you use summarizeOverlaps it will automatically import the data from the bam files and count up the number of reads that overlap the exons for each of your genes. Do note that any workflow that is intended for people to run will necessarily use convenient, artificially small data sets. This is true of the workflow you are using as a template. So as a high level overview, you now want to read in the data from your bam files and then count up all the reads that overlap any gene for your organism. So you need to have something that says where all the genes are. The workflow uses an artificial example that you won't use. Instead you would likely use something like the TxDb.Hsapiens.UCSC.hg38.knownGene package, which has information about where all the genes in the hg38 genome are (assuming you are using human data). You would install and load that package, and then call exonsBy, just like in the workflow, but using the TxDb instead. You would then call summarizeOverlaps to get counts of all the reads in your two bam files that overlap any of the genes in the TxDb. ADD REPLY 0 Entering edit mode Hi James, As you told me to do, I downloaded the package TxDb.Hsapiens.UCSC.hg38.knownGene but when I I go to create the txt database with this line txdb <- makeTranscriptDbFromUCSC( genome='hg38', tablename='refGene' ) tx_by_exon <- exonsBy(txdb, 'tx') I have this error: Error: could not find function "makeTranscriptDbFromUCSC" I noticed that many people had this issue with this line command but I still can't find the solution to go ahead. Do you have any suggestion? Thank you -P ADD REPLY 0 Entering edit mode You don't need to make a transcript database, because you just downloaded and installed one! The next step is library(TxDb.Hsapiens.UCSC.hg38.knownGene) tx_by_exon <- exonsBy(TxDb.Hsapiens,UCSC.hg38.knownGene) ADD REPLY 0 Entering edit mode Merlin ▴ 10 @merlin-15723 Last seen 2.6 years ago Vancouver Hi, It has been days that I'm struggling to make this line working .. but I can't do by myself. this is what I get > library( "GenomicFeatures" ) > library(TxDb.Hsapiens.UCSC.hg38.knownGene) > tx_by_exon <- exonsBy(TxDb.Hsapiens,UCSC.hg38.knownGene) Error in exonsBy(TxDb.Hsapiens, UCSC.hg38.knownGene) : object 'TxDb.Hsapiens' not found  exonBy is under GenomicFeature but it gives me errors, can you please help me with that? Or maybe I can use a different way for counting reads in genes and make the matrice I will appreciate a lot thank you ADD COMMENT 0 Entering edit mode There is a typo in the code. There is a comma between Hsapiens and UCSC, when there should be a period. Hence, the line of code should be: tx_by_exon <- exonsBy(TxDb.Hsapiens.UCSC.hg38.knownGene) ADD REPLY 0 Entering edit mode Thank you for the answer,it's working now. In order to use summarizeOverlap I'm using this script : > tx_by_exon <- exonsBy(TxDb.Hsapiens.UCSC.hg38.knownGene) > library( "GenomicAlignments" ) > se <- summarizeOverlaps( exonsByGene, BamFileList( bamFiles ), mode="Union", + singleEnd=FALSE, ignore.strand=TRUE, fragments=TRUE ) Error in summarizeOverlaps(exonsByGene, BamFileList(bamFiles), mode = "Union", : object 'exonsByGene' not found I m not sure why do I get those error, is it because I'm missing packages? Thank you P ADD REPLY 0 Entering edit mode I'm afraid I can't debug your code line by line. It looks to me like you are trying to cut and paste together pre-existing code from different sources, and that is not going to work. In this case, the summarizeOverlaps line refers to a variable named exonsByGene, but in your previous line of code, you called the variable tx_by_exon. If you are inexperienced in programming, I recommend you find some introductory tutorials and go through them to learn the basics of R programming before attempting more complex analyses. ADD REPLY 0 Entering edit mode Merlin ▴ 10 @merlin-15723 Last seen 2.6 years ago Vancouver I messed up the topic, I changed something from the first posts. I will summarize here all my codes: I created fist the bam files list: list_bam <- lapply ( c("SRR1039508_subset.bam", "SRR1039512_subset.bam", "SRR1039516_subset.bam","SRR1039520_subset.bam","SRR1039509_subset.bam", "SRR1039513_subset.bam" , "SRR1039517_subset.bam","SRR1039521_subset.bam" ) , scanBam ) I generated the full paths to the files I want to perform the counting operation on: bamFiles <- file.path("/Users/patriziopanelli/Documents/BamFilesTest",sampleTable$fileName )

bamFiles

Launch this code by protocol

seqinfo( BamFile( bamFiles[1] ) )

I created a GRange list

txdb <-TxDb.Hsapiens.UCSC.hg38.knownGene

grl <- exonsBy(txdb, by="gene")

grl

And now I'm stuck here...

tx_by_exon <- exonsBy(TxDb.Hsapiens.UCSC.hg38.knownGene)

library( "GenomicAlignments" )

se <- summarizeOverlaps( exonsByGene, BamFileList( bamFiles ), mode="Union",

singleEnd=FALSE, ignore.strand=TRUE, fragments=TRUE )

I can assure that it ain't been easy to reach this point, yes I m new to Rstudio and RNAseq analysis but to me every time I go further each step it's a great satisfaction and I don't want to get lost in tons or codes of R ..I'm trying to learn it by using it.

I hope this post can clarify the work that I m doing ..or if there is something wrong, please tell me. My goal for now is to create the count matrice

Thank you

1
Entering edit mode

You seem to be completely at a loss as to what you should be doing. This support site is intended to help people that are stuck with some technical aspect rather than as an ongoing tutorial for those who are completely out of their depth. For people like that, there are two ways to proceed. There are vignettes and workflows, one of which is for DESeq2, and in section 2 there is a very detailed explanation of how to get counts. There is another workflow that uses a different set of tools here. If you can read those documents and start to get somewhere, then good for you. If you cannot get anywhere by reading those documents, then you should strongly consider that this might not be something you should be attempting, and instead find a local statistician with experience who can help you.

0
Entering edit mode
@gordon-smyth
Last seen 6 minutes ago
WEHI, Melbourne, Australia

The simplest and fastest Bioconductor way to get gene counts from aligned BAM files is to use featureCounts(). For example

files <- c("SRR1039508_subset.bam", "SRR1039512_subset.bam",
"SRR1039516_subset.bam", "SRR1039520_subset.bam",
"SRR1039509_subset.bam", "SRR1039513_subset.bam",
"SRR1039517_subset.bam", "SRR1039521_subset.bam")
fc <- featureCounts(files, annot.inbuilt="hg38")

Then the matrix of counts is found in fc$counts. ADD COMMENT 0 Entering edit mode Hi Gordon, I used a different code but at the end the result is the same. Please find the attachment. I also normalized in Rlog the reads by using this codes: #Save the counts and statistics write.table(x=data.frame(counts.75$annotation[,c("GeneID","Length")],counts.75$counts,stringsAsFactors=FALSE),file="counts.<samples>.txt",quote=FALSE,sep="\t",row.names=FALSE) write.table(x=data.frame(counts.75$stat,stringsAsFactors=FALSE),file="stat.<samples>.txt",quote=FALSE,sep="\t",row.names=FALSE)

counts_df.75 <- as.data.frame(counts.75$counts) colnames(counts_df.75) <- rownames(sampleTable1) #Normalization with rlog: library("DESeq2") dds <- DESeqDataSetFromMatrix(counts_df.75, colData=sampleTable1, design = ~1 ) nrow(dds) dds <- dds[ rowSums(counts(dds)) > 1, ] nrow(dds) rld <- rlog(dds, blind = TRUE) colnames(rld) <- sampleTable1$EXTERNAL_ID

I'm looking forward for the next step..
Can you tell me what should I do to get the count matrice?

Thank you