Import bamfiles for RNAsequencing analysis
4
0
Entering edit mode
Merlin ▴ 10
@merlin-15723
Last seen 4.3 years ago
Vancouver

 

Hello Everyone, 
I'm new to RNAseq world and Rstudio in general and I want to analyze a couple of bamfiles from illumina. 
So far, I successfully followed tutorials for the analysis of RNAseq like this:

https://master.bioconductor.org/help/course-materials/2016/CSAMA/lab-3-rnaseq/rnaseq_gene_CSAMA2016.html


Unfortunately, all the tutorials available start by giving packages and descriptions about how to import into R but none of them shows how to import bamfiles not packaged.

I learned about how to  look into bamfile for example by running this commands:

 

library(Rsamtools)

bam1 <-"test.bam"
bam1 <-scanBam("test.bam")

bam1


so that I can look what's inside but after that I'm stopped because I don't know how to proceed. 
I would be super happy to receive some help, just for one step forward.

Thank you very much 

P

 

 

 

 

R rnaseq bam file • 25k views
ADD COMMENT
0
Entering edit mode

Can you say a bit more about what you are trying to do? What exactly do you mean by 'analyze a couple of bamfiles' and also what do you mean by 'import bamfiles not packaged'?

ADD REPLY
0
Entering edit mode

Hi James, 
I want to do differential gene expression analysis, same as described in the link but  using my own bamfiles.
The link provides a package named airway which is easy to import into R following the description but I can not do the same thing with my files.

Thanks 

 

ADD REPLY
0
Entering edit mode

Don't answer your own question unless you are answering your own question. Instead use the ADD REPLY button and type in the box that pops up.
 

ADD REPLY
0
Entering edit mode

ok sorry about that 

ADD REPLY
0
Entering edit mode
@james-w-macdonald-5106
Last seen 23 hours ago
United States

To import data from bam files, you usually start R in the directory that contains your bam files and then do something like

library(Rsamtools)
bfl <- BamFileList(dir(".", "bam$"), yieldsize = 2e6)

Which is a shorter way of doing something like this

## look for all files in the current directory that end with 'bam'
bamfiles <- dir(".", "bam$")

## load all those files into a BamFileList
bfl <- BamFileList(bamfiles, yieldsize = 2e6)

And then you would proceed with the rest of the RNA-Seq workflow.

ADD COMMENT
0
Entering edit mode

After setting the directory where my two bam files are located, I run the lines you gave me and this is the output:


> library(Rsamtools)

> fls <- dir(".", "bam$")
> bfl <- BamFileList(bamfiles, yieldsize = 2e6)
Error in validObject(.Object) : 
  invalid class “BamFileList” object: the 'listData' slot must be a list containing BamFile objects

 

There is something to adjust in the third line, probably I have to make a list first? 

thanks 

ADD REPLY
0
Entering edit mode

You are creating a character vector of your bam files (called fls) and then passing something else (called bamfiles) to your call to BamFileList. I don't know what bamfiles are in this context, so I can't really tell you anything.

Also note that I misspelled the yieldSize argument. R is case-sensitive, so using yieldsize won't work.
 

ADD REPLY
0
Entering edit mode

I adjusted the misspelled of yealdsize to yealdSize and seems that it's ok now:

> fls
[1] "sample1.bam"   "sample2.bam"

Can you confirm me that now I have imported the bam files and I can go for the next step? 

It should be an excel table with detailed informations for each of my samples named comma-separated value (CSV) 

Thank you 

ADD REPLY
0
Entering edit mode

You haven't imported the bam files. All you have done is created a BamFileList, and now if you use summarizeOverlaps it will automatically import the data from the bam files and count up the number of reads that overlap the exons for each of your genes.

Do note that any workflow that is intended for people to run will necessarily use convenient, artificially small data sets. This is true of the workflow you are using as a template. So as a high level overview, you now want to read in the data from your bam files and then count up all the reads that overlap any gene for your organism. So you need to have something that says where all the genes are.

The workflow uses an artificial example that you won't use. Instead you would likely use something like the TxDb.Hsapiens.UCSC.hg38.knownGene package, which has information about where all the genes in the hg38 genome are (assuming you are using human data). You would install and load that package, and then call exonsBy, just like in the workflow, but using the TxDb instead. You would then call summarizeOverlaps to get counts of all the reads in your two bam files that overlap any of the genes in the TxDb.

ADD REPLY
0
Entering edit mode

Hi James,
 

As you told me to do, I downloaded the package TxDb.Hsapiens.UCSC.hg38.knownGene  

but when I I go to create the txt database with this line

txdb <- makeTranscriptDbFromUCSC( genome='hg38', tablename='refGene' )
tx_by_exon <- exonsBy(txdb, 'tx')

I have this error:

Error: could not find function "makeTranscriptDbFromUCSC"

 

I noticed that many people had this issue with this line command but I still can't find the solution to go ahead. 
Do you have any suggestion? 

Thank you 

-P

ADD REPLY
0
Entering edit mode

You don't need to make a transcript database, because you just downloaded and installed one! The next step is

library(TxDb.Hsapiens.UCSC.hg38.knownGene)
tx_by_exon <- exonsBy(TxDb.Hsapiens,UCSC.hg38.knownGene)

 

ADD REPLY
0
Entering edit mode
Merlin ▴ 10
@merlin-15723
Last seen 4.3 years ago
Vancouver

 

 

 

 

 

Hi, 
It has been days that I'm struggling to make this line working .. but I can't do by myself.

this is what I get

 

> library( "GenomicFeatures" )
> library(TxDb.Hsapiens.UCSC.hg38.knownGene)
> tx_by_exon <- exonsBy(TxDb.Hsapiens,UCSC.hg38.knownGene)
Error in exonsBy(TxDb.Hsapiens, UCSC.hg38.knownGene) : 
  object 'TxDb.Hsapiens' not found 

exonBy is under GenomicFeature but it gives me errors, can you please help me with that?
 

Or maybe I can use a different way for counting reads in genes and make the matrice 

I will appreciate a lot 

thank you 

 

ADD COMMENT
0
Entering edit mode

There is a typo in the code. There is a comma between Hsapiens and UCSC, when there should be a period. Hence, the line of code should be:

tx_by_exon <- exonsBy(TxDb.Hsapiens.UCSC.hg38.knownGene)
ADD REPLY
0
Entering edit mode

 

Thank you for the answer,it's working now. 

In order to use summarizeOverlap I'm using this script :

 

> tx_by_exon <- exonsBy(TxDb.Hsapiens.UCSC.hg38.knownGene)
> library( "GenomicAlignments" )
> se <- summarizeOverlaps( exonsByGene, BamFileList( bamFiles ), mode="Union",
+                          singleEnd=FALSE, ignore.strand=TRUE, fragments=TRUE )
Error in summarizeOverlaps(exonsByGene, BamFileList(bamFiles), mode = "Union",  : 
  object 'exonsByGene' not found

 

I m not sure why do I get those error, is it because I'm missing packages? 

Thank you 

P

ADD REPLY
0
Entering edit mode

I'm afraid I can't debug your code line by line. It looks to me like you are trying to cut and paste together pre-existing code from different sources, and that is not going to work. In this case, the summarizeOverlaps line refers to a variable named exonsByGene, but in your previous line of code, you called the variable tx_by_exon. If you are inexperienced in programming, I recommend you find some introductory tutorials and go through them to learn the basics of R programming before attempting more complex analyses.

ADD REPLY
0
Entering edit mode
Merlin ▴ 10
@merlin-15723
Last seen 4.3 years ago
Vancouver

 

 

I messed up the topic, I changed something from the first posts. 
I will summarize here all my codes:


I created fist the bam files list:

list_bam <- lapply ( c("SRR1039508_subset.bam", "SRR1039512_subset.bam",

                       "SRR1039516_subset.bam","SRR1039520_subset.bam","SRR1039509_subset.bam",

                       "SRR1039513_subset.bam" , "SRR1039517_subset.bam","SRR1039521_subset.bam"  )   , scanBam )

 

I generated the full paths to the files I want to perform the counting operation on:

bamFiles <- file.path("/Users/patriziopanelli/Documents/BamFilesTest",sampleTable$fileName )

bamFiles

 

Launch this code by protocol

seqinfo( BamFile( bamFiles[1] ) )

 

I created a GRange list

txdb <-TxDb.Hsapiens.UCSC.hg38.knownGene

grl <- exonsBy(txdb, by="gene")

grl

 

 

And now I'm stuck here...

tx_by_exon <- exonsBy(TxDb.Hsapiens.UCSC.hg38.knownGene)

library( "GenomicAlignments" )

se <- summarizeOverlaps( exonsByGene, BamFileList( bamFiles ), mode="Union",

                        singleEnd=FALSE, ignore.strand=TRUE, fragments=TRUE )

 


I can assure that it ain't been easy to reach this point, yes I m new to Rstudio and RNAseq analysis but to me every time I go further each step it's a great satisfaction and I don't want to get lost in tons or codes of R ..I'm trying to learn it by using it. 

I hope this post can clarify the work that I m doing ..or if there is something wrong, please tell me. My goal for now is to create the count matrice 

Thank you 

ADD COMMENT
1
Entering edit mode

Please stop answering your own question with another question! There is a link at the bottom of each post that says either ADD COMMENT or ADD REPLY. If you are not answering your own question, you should be adding a comment or reply.

You seem to be completely at a loss as to what you should be doing. This support site is intended to help people that are stuck with some technical aspect rather than as an ongoing tutorial for those who are completely out of their depth. For people like that, there are two ways to proceed. There are vignettes and workflows, one of which is for DESeq2, and in section 2 there is a very detailed explanation of how to get counts. There is another workflow that uses a different set of tools here. If you can read those documents and start to get somewhere, then good for you. If you cannot get anywhere by reading those documents, then you should strongly consider that this might not be something you should be attempting, and instead find a local statistician with experience who can help you.

ADD REPLY
0
Entering edit mode
@gordon-smyth
Last seen 8 hours ago
WEHI, Melbourne, Australia

The simplest and fastest Bioconductor way to get gene counts from aligned BAM files is to use featureCounts(). For example

files <- c("SRR1039508_subset.bam", "SRR1039512_subset.bam",
           "SRR1039516_subset.bam", "SRR1039520_subset.bam",
           "SRR1039509_subset.bam", "SRR1039513_subset.bam",
           "SRR1039517_subset.bam", "SRR1039521_subset.bam")
fc <- featureCounts(files, annot.inbuilt="hg38")

Then the matrix of counts is found in fc$counts.

ADD COMMENT
0
Entering edit mode

Hi Gordon,
I used a different code but at the end the result is the same. 
Please find the attachment.
I also normalized in Rlog the reads by using this codes:
 

#Save the counts and statistics
write.table(x=data.frame(counts.75$annotation[,c("GeneID","Length")],counts.75$counts,stringsAsFactors=FALSE),file="counts.<samples>.txt",quote=FALSE,sep="\t",row.names=FALSE)
write.table(x=data.frame(counts.75$stat,stringsAsFactors=FALSE),file="stat.<samples>.txt",quote=FALSE,sep="\t",row.names=FALSE)

counts_df.75 <- as.data.frame(counts.75$counts)
colnames(counts_df.75) <- rownames(sampleTable1)

#Normalization with rlog:
library("DESeq2")
dds <- DESeqDataSetFromMatrix(counts_df.75, colData=sampleTable1, design = ~1 )

nrow(dds)
dds <- dds[ rowSums(counts(dds)) > 1, ]
nrow(dds)

rld <- rlog(dds, blind = TRUE)
colnames(rld) <- sampleTable1$EXTERNAL_ID



I'm looking forward for the next step.. 
Can you tell me what should I do to get the count matrice?

Thank you 

ADD REPLY

Login before adding your answer.

Traffic: 644 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6