Question

chromosome name in summarizeOverlaps for RNASeq data

0

Entering edit mode

bioprog • 0

@bioprog-7204

Last seen 10.0 years ago

For a while I used quantile normalization followed by t-test or cuffdiff for analysis of RNASeq. I would like to try DESeq2 given its better normalization method based on the readings. I am using your R package summarizeOverlaps to create count tables as I found it to be the easiest. However, I am encountering couple of issues. First,the ensemble-based gtf file that I used to align all my mouse RNAseq data has the chromosomes named as “1, 2 , 3”… Rather than “chr1 , chr2 ..”. My Testing R code for summarizeOverlaps is as follows:

library(TxDb.Mmusculus.UCSC.mm10.ensGene)

TxDb_mm10=TxDb.Mmusculus.UCSC.mm10.ensGene

saveDb(TxDb_mm10, file="mm10.sqlite”)

mm10<-loadDb("mm10.sqlite”)

exbyGene<-exonsBy(mm10,by="gene”) #I assume this is assigning things by GENES ids (ENSG..)

exbyTx<-exonsBy(mm10,by="tx”) #I assume this is assigning things by Transcript IDS (ENST…)



fls <- list.files(“Path/To/BamFiles", pattern=".accepted_hits.bam", full= TRUE)

grp1<- fls[1]

grp2 <- fls[c(2,3)]

bamLst_grp1<-BamFileList(grp1,yieldSize=100000)

bamLst_grp2<-BamFileList(grp2,yieldSize=100000)

head(seqlevels(TxDb_mm10)) # this shows that the mm10 sqlite I am using has chr1, chr2 .. Format

seqinfo(bamLst_grp1) #this shows that the chromosome names for my testing bam files are 1, 2, 3…

se_grp1<-summarizeOverlaps(exbyGene,bamLst_grp1,mode="Union",singleEnd=FALSE,ignore.strand=TRUE,fragments=TRUE)

se_grp2<-summarizeOverlaps(exbyGene,bamLst_grp2,mode="Union",singleEnd=FALSE,ignore.strand=TRUE,fragments=TRUE)

##Saving environment

saveRDS(se_grp1, file="se_grp1.Rdata")

saveRDS(se_grp2, file="se_grp2.Rdata”)

While the script runs nicely (and please, advice me if I should optimize anything as I have basic understanding of each of the commands), when I tried to load the environment I am getting the following error:

Error: bad restore file magic number (file may be corrupted) -- no data loaded

In addition: Warning message:

file ‘se_grp2.Rdata’ has magic number 'X'

  Use of save versions prior to 2 is deprecated

Also, I am not sure if the error is because the chromosome names are incompatible between the reference and bam file. If that’s the case, how can I correct for this?

Lastly, can i add features to the table counts being generated? For instance, information regarding TSS, exact locus, gene biotype/function that are usually in the gtf, how can we add those to the count tables if I successfully generated a table.

summarizeoverlaps RNAseq annotation counts • 3.1k views

ADD COMMENT • link 10.0 years ago bioprog • 0

score 2 · Answer 1 · 2016-01-12

Unless you need to rename an object, you should just use save() rather than saveRDS(). Or if you want to use saveRDS(), then you have to use readRDS() rather than load(). This is of course documented in the help file for saveRDS().

> z <- 1:50
> saveRDS(z, "whatevs.Rdata")
> load("whatevs.Rdata")
Error: bad restore file magic number (file may be corrupted) -- no data loaded
In addition: Warning message:
file  whatevs.Rdata  has magic number 'X'
  Use of save versions prior to 2 is deprecated
> zz <- readRDS("whatevs.Rdata")
> identical(z, zz)
[1] TRUE

To learn about sequence levels, see ?seqlevels and ?seqlevelsStyle. To learn about GRanges objects, see ?GRanges-class

score 0 · Answer 2 · 2016-01-12

0

Entering edit mode

bioprog • 0

@bioprog-7204

Last seen 10.0 years ago

Thank you for your reply regarding the save and load of RDS. that fixed it. In regards to the chromosome names, can you please confirm if my understanding is correct:

If I write old <- seqlevels(bamLst_grp1), I get:

[1] "1" "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "2" "3" "4" "5" "6" "7" "8" "9" "MT"
[21] "X" "Y"

Since the library I downloaded has 'chr1, chr2 ...' based on seqlevels(mm1), I wrote the following:

seqlevels(mm10)
new <- c("chr1", "chr10", "chr11", "chr12", "chr13", "chr14", "chr15", "chr16", "chr17", "chr18", "chr19", "chr2", "chr3", "chr4","chr5","chr6","chr7","chr8","chr9", "chrM", "chrX","chrY")

old<-seqlevels(bamLst_grp1)

seqinfo(bamLst_grp1, new2old=old, force=TRUE) <- new #i used force=TRUE as my understanding that is required to enforce this change?

That ran without an error. But is that what you would expect to correct for things? I dont know how to get the DESeq2 script to work yet so at least i want to confirm that this step is the correct one in order to account for any errors that will arise in the DESeq2 script later.

Also, cant i add more features for the count table other than just the geneId or name? Like the function, TSS...

ADD COMMENT • link 10.0 years ago bioprog • 0

0

Entering edit mode

First, you should respond to an answer by clicking on ADD COMMENT and then typing in the box that comes up. You only type in the box under 'Add your answer' if you are actually adding an answer.

That ran without an error. But is that what you would expect to correct for things?

A better question is whether or not that actually 'corrected for things' in your mind. Did you check the seqlevels after doing that? Did you look at the examples for seqlevels? Did you look at the help for seqlevelsStyle? While it is convenient to get help on this support site, the best recourse is to learn how to figure out things for yourself, which is why I pointed you to the help file.

Also, cant i add more features for the count table other than just the geneId or name? Like the function, TSS...

Probably. Here is a pointer. You used summarizeOverlaps() to do something, right? If you look at the help page for that function, you can see this, under the 'Value' section:

Value:

     A RangedSummarizedExperiment object. The  assays  slot holds the
     counts,  rowRanges  holds the annotation from  features .

Which will likely not mean much to you, but there is a help page for RangedSummarizedExperiment objects as well. Note that you sometimes have to add '-class' for objects (as compared to functions) to get the help. That's not true in this situation, but it's something to remember.

You can see the help page by doing ?RangedSummarizedExperiment. Or by reading the vignette for the SummarizedExperiment package.

I am purposefully not answering your questions here, because that won't help you. If you are planning to use R/Bioconductor to analyze data, you need to find out how to answer questions for yourself, as you cannot rely on the kindness of strangers. Figuring out how to figure things out is the best skill you can acquire.

ADD REPLY • link 10.0 years ago James W. MacDonald 68k

0

Entering edit mode

I understand where you are coming from and you are right. I should have tried seqlevels on BamLst after the change before posting. I did now and that didn't seem to fix the problem. I did go over the vignette in summarizedOverlaps but that didn't seem to help me fix the issue. I will look more into it.

Thanks-

ADD REPLY • link 10.0 years ago bioprog • 0