Question: Import bamfiles for RNAsequencing analysis
0
gravatar for Merlin
15 months ago by
Merlin 10
Vancouver
Merlin 10 wrote:

 

Hello Everyone, 
I'm new to RNAseq world and Rstudio in general and I want to analyze a couple of bamfiles from illumina. 
So far, I successfully followed tutorials for the analysis of RNAseq like this:

https://master.bioconductor.org/help/course-materials/2016/CSAMA/lab-3-rnaseq/rnaseq_gene_CSAMA2016.html


Unfortunately, all the tutorials available start by giving packages and descriptions about how to import into R but none of them shows how to import bamfiles not packaged.

I learned about how to  look into bamfile for example by running this commands:

 

library(Rsamtools)

bam1 <-"test.bam"
bam1 <-scanBam("test.bam")

bam1


so that I can look what's inside but after that I'm stopped because I don't know how to proceed. 
I would be super happy to receive some help, just for one step forward.

Thank you very much 

P

 

 

 

 

rnaseq R bam file • 760 views
ADD COMMENTlink modified 15 months ago by Gordon Smyth38k • written 15 months ago by Merlin 10

Can you say a bit more about what you are trying to do? What exactly do you mean by 'analyze a couple of bamfiles' and also what do you mean by 'import bamfiles not packaged'?

ADD REPLYlink written 15 months ago by James W. MacDonald50k

Hi James, 
I want to do differential gene expression analysis, same as described in the link but  using my own bamfiles.
The link provides a package named airway which is easy to import into R following the description but I can not do the same thing with my files.

Thanks 

 

ADD REPLYlink written 15 months ago by Merlin 10

Don't answer your own question unless you are answering your own question. Instead use the ADD REPLY button and type in the box that pops up.
 

ADD REPLYlink written 15 months ago by James W. MacDonald50k

ok sorry about that 

ADD REPLYlink written 15 months ago by Merlin 10
Answer: Import bamfiles for RNAsequencing analysis
0
gravatar for James W. MacDonald
15 months ago by
United States
James W. MacDonald50k wrote:

To import data from bam files, you usually start R in the directory that contains your bam files and then do something like

library(Rsamtools)
bfl <- BamFileList(dir(".", "bam$"), yieldsize = 2e6)

Which is a shorter way of doing something like this

## look for all files in the current directory that end with 'bam'
bamfiles <- dir(".", "bam$")

## load all those files into a BamFileList
bfl <- BamFileList(bamfiles, yieldsize = 2e6)

And then you would proceed with the rest of the RNA-Seq workflow.

ADD COMMENTlink written 15 months ago by James W. MacDonald50k

After setting the directory where my two bam files are located, I run the lines you gave me and this is the output:


> library(Rsamtools)

> fls <- dir(".", "bam$")
> bfl <- BamFileList(bamfiles, yieldsize = 2e6)
Error in validObject(.Object) : 
  invalid class “BamFileList” object: the 'listData' slot must be a list containing BamFile objects

 

There is something to adjust in the third line, probably I have to make a list first? 

thanks 

ADD REPLYlink written 15 months ago by Merlin 10

You are creating a character vector of your bam files (called fls) and then passing something else (called bamfiles) to your call to BamFileList. I don't know what bamfiles are in this context, so I can't really tell you anything.

Also note that I misspelled the yieldSize argument. R is case-sensitive, so using yieldsize won't work.
 

ADD REPLYlink written 15 months ago by James W. MacDonald50k

I adjusted the misspelled of yealdsize to yealdSize and seems that it's ok now:

> fls
[1] "sample1.bam"   "sample2.bam"

Can you confirm me that now I have imported the bam files and I can go for the next step? 

It should be an excel table with detailed informations for each of my samples named comma-separated value (CSV) 

Thank you 

ADD REPLYlink written 15 months ago by Merlin 10

You haven't imported the bam files. All you have done is created a BamFileList, and now if you use summarizeOverlaps it will automatically import the data from the bam files and count up the number of reads that overlap the exons for each of your genes.

Do note that any workflow that is intended for people to run will necessarily use convenient, artificially small data sets. This is true of the workflow you are using as a template. So as a high level overview, you now want to read in the data from your bam files and then count up all the reads that overlap any gene for your organism. So you need to have something that says where all the genes are.

The workflow uses an artificial example that you won't use. Instead you would likely use something like the TxDb.Hsapiens.UCSC.hg38.knownGene package, which has information about where all the genes in the hg38 genome are (assuming you are using human data). You would install and load that package, and then call exonsBy, just like in the workflow, but using the TxDb instead. You would then call summarizeOverlaps to get counts of all the reads in your two bam files that overlap any of the genes in the TxDb.

ADD REPLYlink written 15 months ago by James W. MacDonald50k

Hi James,
 

As you told me to do, I downloaded the package TxDb.Hsapiens.UCSC.hg38.knownGene  

but when I I go to create the txt database with this line

txdb <- makeTranscriptDbFromUCSC( genome='hg38', tablename='refGene' )
tx_by_exon <- exonsBy(txdb, 'tx')

I have this error:

Error: could not find function "makeTranscriptDbFromUCSC"

 

I noticed that many people had this issue with this line command but I still can't find the solution to go ahead. 
Do you have any suggestion? 

Thank you 

-P

ADD REPLYlink written 15 months ago by Merlin 10

You don't need to make a transcript database, because you just downloaded and installed one! The next step is

library(TxDb.Hsapiens.UCSC.hg38.knownGene)
tx_by_exon <- exonsBy(TxDb.Hsapiens,UCSC.hg38.knownGene)

 

ADD REPLYlink written 15 months ago by James W. MacDonald50k
Answer: Import bamfiles for RNAsequencing analysis
0
gravatar for Merlin
15 months ago by
Merlin 10
Vancouver
Merlin 10 wrote:

 

 

 

 

 

Hi, 
It has been days that I'm struggling to make this line working .. but I can't do by myself.

this is what I get

 

> library( "GenomicFeatures" )
> library(TxDb.Hsapiens.UCSC.hg38.knownGene)
> tx_by_exon <- exonsBy(TxDb.Hsapiens,UCSC.hg38.knownGene)
Error in exonsBy(TxDb.Hsapiens, UCSC.hg38.knownGene) : 
  object 'TxDb.Hsapiens' not found 

exonBy is under GenomicFeature but it gives me errors, can you please help me with that?
 

Or maybe I can use a different way for counting reads in genes and make the matrice 

I will appreciate a lot 

thank you 

 

ADD COMMENTlink modified 15 months ago by Ryan C. Thompson7.3k • written 15 months ago by Merlin 10

There is a typo in the code. There is a comma between Hsapiens and UCSC, when there should be a period. Hence, the line of code should be:

tx_by_exon <- exonsBy(TxDb.Hsapiens.UCSC.hg38.knownGene)
ADD REPLYlink written 15 months ago by Ryan C. Thompson7.3k

 

Thank you for the answer,it's working now. 

In order to use summarizeOverlap I'm using this script :

 

> tx_by_exon <- exonsBy(TxDb.Hsapiens.UCSC.hg38.knownGene)
> library( "GenomicAlignments" )
> se <- summarizeOverlaps( exonsByGene, BamFileList( bamFiles ), mode="Union",
+                          singleEnd=FALSE, ignore.strand=TRUE, fragments=TRUE )
Error in summarizeOverlaps(exonsByGene, BamFileList(bamFiles), mode = "Union",  : 
  object 'exonsByGene' not found

 

I m not sure why do I get those error, is it because I'm missing packages? 

Thank you 

P

ADD REPLYlink modified 15 months ago • written 15 months ago by Merlin 10

I'm afraid I can't debug your code line by line. It looks to me like you are trying to cut and paste together pre-existing code from different sources, and that is not going to work. In this case, the summarizeOverlaps line refers to a variable named exonsByGene, but in your previous line of code, you called the variable tx_by_exon. If you are inexperienced in programming, I recommend you find some introductory tutorials and go through them to learn the basics of R programming before attempting more complex analyses.

ADD REPLYlink written 15 months ago by Ryan C. Thompson7.3k
Answer: Import bamfiles for RNAsequencing analysis
0
gravatar for Merlin
15 months ago by
Merlin 10
Vancouver
Merlin 10 wrote:

 

 

I messed up the topic, I changed something from the first posts. 
I will summarize here all my codes:


I created fist the bam files list:

list_bam <- lapply ( c("SRR1039508_subset.bam", "SRR1039512_subset.bam",

                       "SRR1039516_subset.bam","SRR1039520_subset.bam","SRR1039509_subset.bam",

                       "SRR1039513_subset.bam" , "SRR1039517_subset.bam","SRR1039521_subset.bam"  )   , scanBam )

 

I generated the full paths to the files I want to perform the counting operation on:

bamFiles <- file.path("/Users/patriziopanelli/Documents/BamFilesTest",sampleTable$fileName )

bamFiles

 

Launch this code by protocol

seqinfo( BamFile( bamFiles[1] ) )

 

I created a GRange list

txdb <-TxDb.Hsapiens.UCSC.hg38.knownGene

grl <- exonsBy(txdb, by="gene")

grl

 

 

And now I'm stuck here...

tx_by_exon <- exonsBy(TxDb.Hsapiens.UCSC.hg38.knownGene)

library( "GenomicAlignments" )

se <- summarizeOverlaps( exonsByGene, BamFileList( bamFiles ), mode="Union",

                        singleEnd=FALSE, ignore.strand=TRUE, fragments=TRUE )

 


I can assure that it ain't been easy to reach this point, yes I m new to Rstudio and RNAseq analysis but to me every time I go further each step it's a great satisfaction and I don't want to get lost in tons or codes of R ..I'm trying to learn it by using it. 

I hope this post can clarify the work that I m doing ..or if there is something wrong, please tell me. My goal for now is to create the count matrice 

Thank you 

ADD COMMENTlink modified 15 months ago • written 15 months ago by Merlin 10
1

Please stop answering your own question with another question! There is a link at the bottom of each post that says either ADD COMMENT or ADD REPLY. If you are not answering your own question, you should be adding a comment or reply.

You seem to be completely at a loss as to what you should be doing. This support site is intended to help people that are stuck with some technical aspect rather than as an ongoing tutorial for those who are completely out of their depth. For people like that, there are two ways to proceed. There are vignettes and workflows, one of which is for DESeq2, and in section 2 there is a very detailed explanation of how to get counts. There is another workflow that uses a different set of tools here. If you can read those documents and start to get somewhere, then good for you. If you cannot get anywhere by reading those documents, then you should strongly consider that this might not be something you should be attempting, and instead find a local statistician with experience who can help you.

ADD REPLYlink written 15 months ago by James W. MacDonald50k
Answer: Import bamfiles for RNAsequencing analysis
0
gravatar for Gordon Smyth
15 months ago by
Gordon Smyth38k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth38k wrote:

The simplest and fastest Bioconductor way to get gene counts from aligned BAM files is to use featureCounts(). For example

files <- c("SRR1039508_subset.bam", "SRR1039512_subset.bam",
           "SRR1039516_subset.bam", "SRR1039520_subset.bam",
           "SRR1039509_subset.bam", "SRR1039513_subset.bam",
           "SRR1039517_subset.bam", "SRR1039521_subset.bam")
fc <- featureCounts(files, annot.inbuilt="hg38")

Then the matrix of counts is found in fc$counts.

ADD COMMENTlink modified 15 months ago • written 15 months ago by Gordon Smyth38k

Hi Gordon,
I used a different code but at the end the result is the same. 
Please find the attachment.
I also normalized in Rlog the reads by using this codes:
 

#Save the counts and statistics
write.table(x=data.frame(counts.75$annotation[,c("GeneID","Length")],counts.75$counts,stringsAsFactors=FALSE),file="counts.<samples>.txt",quote=FALSE,sep="\t",row.names=FALSE)
write.table(x=data.frame(counts.75$stat,stringsAsFactors=FALSE),file="stat.<samples>.txt",quote=FALSE,sep="\t",row.names=FALSE)

counts_df.75 <- as.data.frame(counts.75$counts)
colnames(counts_df.75) <- rownames(sampleTable1)

#Normalization with rlog:
library("DESeq2")
dds <- DESeqDataSetFromMatrix(counts_df.75, colData=sampleTable1, design = ~1 )

nrow(dds)
dds <- dds[ rowSums(counts(dds)) > 1, ]
nrow(dds)

rld <- rlog(dds, blind = TRUE)
colnames(rld) <- sampleTable1$EXTERNAL_ID



I'm looking forward for the next step.. 
Can you tell me what should I do to get the count matrice?

Thank you 

ADD REPLYlink modified 15 months ago • written 15 months ago by Merlin 10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 152 users visited in the last hour