For a while I used quantile normalization followed by t-test or cuffdiff for analysis of RNASeq. I would like to try DESeq2 given its better normalization method based on the readings. I am using your R package summarizeOverlaps to create count tables as I found it to be the easiest. However, I am encountering couple of issues. First,the ensemble-based gtf file that I used to align all my mouse RNAseq data has the chromosomes named as “1, 2 , 3”… Rather than “chr1 , chr2 ..”. My Testing R code for summarizeOverlaps is as follows:
library(TxDb.Mmusculus.UCSC.mm10.ensGene) TxDb_mm10=TxDb.Mmusculus.UCSC.mm10.ensGene saveDb(TxDb_mm10, file="mm10.sqlite”) mm10<-loadDb("mm10.sqlite”) exbyGene<-exonsBy(mm10,by="gene”) #I assume this is assigning things by GENES ids (ENSG..) exbyTx<-exonsBy(mm10,by="tx”) #I assume this is assigning things by Transcript IDS (ENST…) fls <- list.files(“Path/To/BamFiles", pattern=".accepted_hits.bam", full= TRUE) grp1<- fls grp2 <- fls[c(2,3)] bamLst_grp1<-BamFileList(grp1,yieldSize=100000) bamLst_grp2<-BamFileList(grp2,yieldSize=100000) head(seqlevels(TxDb_mm10)) # this shows that the mm10 sqlite I am using has chr1, chr2 .. Format seqinfo(bamLst_grp1) #this shows that the chromosome names for my testing bam files are 1, 2, 3… se_grp1<-summarizeOverlaps(exbyGene,bamLst_grp1,mode="Union",singleEnd=FALSE,ignore.strand=TRUE,fragments=TRUE) se_grp2<-summarizeOverlaps(exbyGene,bamLst_grp2,mode="Union",singleEnd=FALSE,ignore.strand=TRUE,fragments=TRUE) ##Saving environment saveRDS(se_grp1, file="se_grp1.Rdata") saveRDS(se_grp2, file="se_grp2.Rdata”)
While the script runs nicely (and please, advice me if I should optimize anything as I have basic understanding of each of the commands), when I tried to load the environment I am getting the following error:
Error: bad restore file magic number (file may be corrupted) -- no data loaded In addition: Warning message: file ‘se_grp2.Rdata’ has magic number 'X' Use of save versions prior to 2 is deprecated
Also, I am not sure if the error is because the chromosome names are incompatible between the reference and bam file. If that’s the case, how can I correct for this?
Lastly, can i add features to the table counts being generated? For instance, information regarding TSS, exact locus, gene biotype/function that are usually in the gtf, how can we add those to the count tables if I successfully generated a table.