Issue during transcript variant analysis using rnaseqDTU-DRIMseq-DEXseq.
1
0
Entering edit mode
Last seen 2.5 years ago

HI, I am analysing differential transcript usage on rna seq data and I am stuck at this point- It seems to me that issue is in the ctsgenco object where a single row contains multiple transcript ids but dont know how to correct it.

> filesgenco <- list.files( pattern = ".txt",full.names = TRUE)
> filesgenco
[1] "./hesalc1.txt" "./hesalc2.txt" "./hesalt1.txt" "./hesalt2.txt"
> names(filesgenco)
NULL
> mycols
condition sample_id
HECON1 control      HECON1
HECON2 control      HECON2
HETRE1 treated      HETRE1
HETRE2 treated      HETRE2
> names(filesgenco) <- mycols$sampleID > filesgenco [1] "./hesalc1.txt" "./hesalc2.txt" "./hesalt1.txt" "./hesalt2.txt" > txigenco <- tximport(filesgenco, type="salmon", txOut=TRUE, + countsFromAbundance="scaledTPM") reading in files with read_tsv 1 2 3 4 > > ctsgenco <- txigenco$counts
> ctsgenco <- ctsgenco[rowSums(ctsgenco) > 0,]
> gtf <- "gencode.v29.annotation.gtf"
> txdfgenco <- select(txdb, keys(txdb, "GENEID"), "TXNAME", "GENEID")
'select()' returned 1:many mapping between keys and columns
> tabgenco <- table(txdfgenco$GENEID) > txdfgenco$ntx <- tabgenco[match(txdfgenco$GENEID, names(tabgenco))] > head(txdfgenco) GENEID TXNAME ntx 1 ENSG00000000003.14 ENST00000612152.4 5 2 ENSG00000000003.14 ENST00000373020.8 5 3 ENSG00000000003.14 ENST00000614008.4 5 4 ENSG00000000003.14 ENST00000496771.5 5 5 ENSG00000000003.14 ENST00000494424.1 5 6 ENSG00000000005.5 ENST00000373031.4 2 > ctsgenco[1:3,1:3] [,1] ENST00000456328.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000362751.1|DDX11L1-202|DDX11L1|1657|processed_transcript| 0.6308597 ENST00000450305.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000002844.2|DDX11L1-201|DDX11L1|632|transcribed_unprocessed_pseudogene| 0.0000000 ENST00000488147.1|ENSG00000227232.5|OTTHUMG00000000958.1|OTTHUMT00000002839.1|WASH7P-201|WASH7P|1351|unprocessed_pseudogene| 28.3130559 [,2] ENST00000456328.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000362751.1|DDX11L1-202|DDX11L1|1657|processed_transcript| 0.6636265 ENST00000450305.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000002844.2|DDX11L1-201|DDX11L1|632|transcribed_unprocessed_pseudogene| 0.0825407 ENST00000488147.1|ENSG00000227232.5|OTTHUMG00000000958.1|OTTHUMT00000002839.1|WASH7P-201|WASH7P|1351|unprocessed_pseudogene| 18.5500264 [,3] ENST00000456328.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000362751.1|DDX11L1-202|DDX11L1|1657|processed_transcript| 0.00000 ENST00000450305.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000002844.2|DDX11L1-201|DDX11L1|632|transcribed_unprocessed_pseudogene| 0.00000 ENST00000488147.1|ENSG00000227232.5|OTTHUMG00000000958.1|OTTHUMT00000002839.1|WASH7P-201|WASH7P|1351|unprocessed_pseudogene| 54.07508 > range(colSums(cts)/1e6) [1] 16.26968 26.21900 > range(colSums(ctsgenco)/1e6) [1] 12.64907 17.67470 > all(rownames(ctsgenco) %in% txdfgenco$TXNAME)
[1] FALSE


Any help/suggestion would be appreciated.

rnaseqDTU DRIMseq DEXseq rna-seq • 684 views
0
Entering edit mode
@mikelove
Last seen 11 hours ago
United States

Please post all of your code. See the posting guide for suggestions on how to formulate a post. Also you may want to tag the name of the Bioconductor package you are using, DRIMSeq? DEXSeq?

0
Entering edit mode

Thank you Michael. I have just corrected my post and have added entire code I am using.

0
Entering edit mode

I can see by eye why the rownames in cts are not contained in the TXNAME column of txdf. Can you see it too?

You should chop off the extra characters.

Try using this function:

chop <- function(x) sub("\\|.*","",x)

0
Entering edit mode

The gene symbol and description. Please tell me if I have used the function properly as I am getting following output.

> chop1 <-function(x) {
+     sub("\\|.*","",x)
+ }
> ctsgenco4 <- chop1(ctsgenco)

[1] "0.630859660016231"       "0"                  "28.3130559180533"          "0.0143187459425427"
[5]  "0"                   "0.288752943795375"

0
Entering edit mode

That is not correct. You are applying it to the counts. I suggested to remove the extra characters from the rownames of the counts.

Do you see the problem I am pointing out? You have extra characters on the rownames of the counts, and this doesn't match what's in TXNAME. You may want to collaborate with a bioinformatician to help you with this step, if the manipulation of objects in R is very difficult.

0
Entering edit mode

0
Entering edit mode

I am sorry to bother you again but the dmDSdata output does'nt look like what is there in the vignettes. It shows just 1 genes.

> all(rownames(ctsgenco) %in% txdfgenco$TXNAME) [1] TRUE > txdfgenco <- txdfgenco[match(rownames(ctsgenco),txdfgenco$TXNAME),]

> all(rownames(ctsgenco) == txdfgenco$TXNAME) [1] TRUE > countsgenco <- data.frame(gene_id=txdfgenco$GENEID,
+                      feature_id=txdfgenco\$TXNAME,
+                      ctsgenco)

gene_id        feature_id     HECON1     HECON2   HETRE1     HETRE2

ENST00000456328.2 ENSG00000223972.5 ENST00000456328.2  0.6308597  0.6636265  0.00000 0.08355679

ENST00000450305.2 ENSG00000223972.5 ENST00000450305.2  0.0000000  0.0825407  0.00000 0.00000000

ENST00000488147.1 ENSG00000227232.5 ENST00000488147.1 28.3130559 18.5500264 54.07508 3.89251328

> library(DRIMSeq)
> dgenco <- dmDSdata(counts=countsgenco, samples=samples)
> dgenco

An object of class dmDSdata

with 1 genes and 4 samples

* data accessors: counts(), samples()

> session_info()
─ Session info ───────────────────────────────────────────────────────────────────────────────
setting  value
version  R version 3.5.2 (2018-12-20)
os       macOS High Sierra 10.13.6
system   x86_64, darwin15.6.0
ui       RStudio
language (EN)
collate  en_US.UTF-8
ctype    en_US.UTF-8
tz       Asia/Kolkata
date     2019-04-11

0
Entering edit mode

I'm not sure.

But I also can't tell from what you have posted above, what you are providing as countsgenco. You could show for example the dim of this. Are you providing it with the counts for all the genes?

0
Entering edit mode
> dim(countsgenco)
[1] 174318      6


The row names of countgenco is feature_id as you could notice above. Countsgenco is nothing but the counts and ctsgenco is cts; I have just added "genco" acronym.

It might be relevant to ask, since I have run the code beyond this point just to see how things work and got 900 genes after stageR testing. Do i need to run the DEXseq codes as the stageR output has already given me the differential used transcript along with the p-adj values and it appears to work similar to DRIMseq?

0
Entering edit mode

You don't need to run both methods. You can just run one or the other.

If you are having continued difficulty with the DRIMSeq set up above, I'd make a new post with all your code and tag the DRIMSeq package so those authors are notified. (I think if you add a tag to an existing post it doesn't email the package maintainers).

0
Entering edit mode

Thank you for helping me on this issue.