Issue during transcript variant analysis using rnaseqDTU-DRIMseq-DEXseq.
1
0
Entering edit mode
@saddamhusain77-19420
Last seen 5.0 years ago

HI, I am analysing differential transcript usage on rna seq data and I am stuck at this point- It seems to me that issue is in the ctsgenco object where a single row contains multiple transcript ids but dont know how to correct it.

> filesgenco <- list.files( pattern = ".txt",full.names = TRUE)
> filesgenco
[1] "./hesalc1.txt" "./hesalc2.txt" "./hesalt1.txt" "./hesalt2.txt"
> names(filesgenco)
NULL
> mycols
       condition sample_id
HECON1 control      HECON1
HECON2 control      HECON2
HETRE1 treated      HETRE1
HETRE2 treated      HETRE2
> names(filesgenco) <- mycols$sampleID
> filesgenco
[1] "./hesalc1.txt" "./hesalc2.txt" "./hesalt1.txt" "./hesalt2.txt"
> txigenco <- tximport(filesgenco, type="salmon", txOut=TRUE,
+                      countsFromAbundance="scaledTPM")
reading in files with read_tsv
1 2 3 4 
> 
> ctsgenco <- txigenco$counts
> ctsgenco <- ctsgenco[rowSums(ctsgenco) > 0,]
> gtf <- "gencode.v29.annotation.gtf"
> txdfgenco <- select(txdb, keys(txdb, "GENEID"), "TXNAME", "GENEID")
'select()' returned 1:many mapping between keys and columns
> tabgenco <- table(txdfgenco$GENEID)
> txdfgenco$ntx <- tabgenco[match(txdfgenco$GENEID, names(tabgenco))]
> head(txdfgenco)
              GENEID            TXNAME ntx
1 ENSG00000000003.14 ENST00000612152.4   5
2 ENSG00000000003.14 ENST00000373020.8   5
3 ENSG00000000003.14 ENST00000614008.4   5
4 ENSG00000000003.14 ENST00000496771.5   5
5 ENSG00000000003.14 ENST00000494424.1   5
6  ENSG00000000005.5 ENST00000373031.4   2
> ctsgenco[1:3,1:3]
[,1]
ENST00000456328.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000362751.1|DDX11L1-202|DDX11L1|1657|processed_transcript|               0.6308597

ENST00000450305.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000002844.2|DDX11L1-201|DDX11L1|632|transcribed_unprocessed_pseudogene|  0.0000000

ENST00000488147.1|ENSG00000227232.5|OTTHUMG00000000958.1|OTTHUMT00000002839.1|WASH7P-201|WASH7P|1351|unprocessed_pseudogene|              28.3130559

[,2]
ENST00000456328.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000362751.1|DDX11L1-202|DDX11L1|1657|processed_transcript|               0.6636265

ENST00000450305.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000002844.2|DDX11L1-201|DDX11L1|632|transcribed_unprocessed_pseudogene|  0.0825407

ENST00000488147.1|ENSG00000227232.5|OTTHUMG00000000958.1|OTTHUMT00000002839.1|WASH7P-201|WASH7P|1351|unprocessed_pseudogene|              18.5500264

[,3]
ENST00000456328.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000362751.1|DDX11L1-202|DDX11L1|1657|processed_transcript|               0.00000

ENST00000450305.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000002844.2|DDX11L1-201|DDX11L1|632|transcribed_unprocessed_pseudogene|  0.00000

ENST00000488147.1|ENSG00000227232.5|OTTHUMG00000000958.1|OTTHUMT00000002839.1|WASH7P-201|WASH7P|1351|unprocessed_pseudogene|              54.07508

> range(colSums(cts)/1e6)
[1] 16.26968 26.21900
> range(colSums(ctsgenco)/1e6)
[1] 12.64907 17.67470
> all(rownames(ctsgenco) %in% txdfgenco$TXNAME)
[1] FALSE

Any help/suggestion would be appreciated.

rnaseqDTU DRIMseq DEXseq rna-seq • 1.5k views
ADD COMMENT
0
Entering edit mode
@mikelove
Last seen 4 hours ago
United States

Please post all of your code. See the posting guide for suggestions on how to formulate a post. Also you may want to tag the name of the Bioconductor package you are using, DRIMSeq? DEXSeq?

ADD COMMENT
0
Entering edit mode

Thank you Michael. I have just corrected my post and have added entire code I am using.

ADD REPLY
0
Entering edit mode

I can see by eye why the rownames in cts are not contained in the TXNAME column of txdf. Can you see it too?

You should chop off the extra characters.

Try using this function:

chop <- function(x) sub("\\|.*","",x)
ADD REPLY
0
Entering edit mode

The gene symbol and description. Please tell me if I have used the function properly as I am getting following output.

> chop1 <-function(x) {
+     sub("\\|.*","",x)
+ }
> ctsgenco4 <- chop1(ctsgenco)
> head(ctsgenco4)

[1] "0.630859660016231"       "0"                  "28.3130559180533"          "0.0143187459425427"
[5]  "0"                   "0.288752943795375" 
ADD REPLY
0
Entering edit mode

That is not correct. You are applying it to the counts. I suggested to remove the extra characters from the rownames of the counts.

Do you see the problem I am pointing out? You have extra characters on the rownames of the counts, and this doesn't match what's in TXNAME. You may want to collaborate with a bioinformatician to help you with this step, if the manipulation of objects in R is very difficult.

ADD REPLY
0
Entering edit mode

Yeah sure. I will consider your advice. Thank you for your time.

ADD REPLY
0
Entering edit mode

I am sorry to bother you again but the dmDSdata output does'nt look like what is there in the vignettes. It shows just 1 genes.

> all(rownames(ctsgenco) %in% txdfgenco$TXNAME)

[1] TRUE

> txdfgenco <- txdfgenco[match(rownames(ctsgenco),txdfgenco$TXNAME),]

> all(rownames(ctsgenco) == txdfgenco$TXNAME)

[1] TRUE

> countsgenco <- data.frame(gene_id=txdfgenco$GENEID,
+                      feature_id=txdfgenco$TXNAME,
+                      ctsgenco)

> head(countsgenco, 3)

gene_id        feature_id     HECON1     HECON2   HETRE1     HETRE2

ENST00000456328.2 ENSG00000223972.5 ENST00000456328.2  0.6308597  0.6636265  0.00000 0.08355679

ENST00000450305.2 ENSG00000223972.5 ENST00000450305.2  0.0000000  0.0825407  0.00000 0.00000000

ENST00000488147.1 ENSG00000227232.5 ENST00000488147.1 28.3130559 18.5500264 54.07508 3.89251328



> library(DRIMSeq)
> dgenco <- dmDSdata(counts=countsgenco, samples=samples)
> dgenco

An object of class dmDSdata 

with 1 genes and 4 samples

* data accessors: counts(), samples()


> session_info()
─ Session info ───────────────────────────────────────────────────────────────────────────────
 setting  value                       
 version  R version 3.5.2 (2018-12-20)
 os       macOS High Sierra 10.13.6   
 system   x86_64, darwin15.6.0        
 ui       RStudio                     
 language (EN)                        
 collate  en_US.UTF-8                 
 ctype    en_US.UTF-8                 
 tz       Asia/Kolkata                
 date     2019-04-11
ADD REPLY
0
Entering edit mode

I'm not sure.

But I also can't tell from what you have posted above, what you are providing as countsgenco. You could show for example the dim of this. Are you providing it with the counts for all the genes?

ADD REPLY
0
Entering edit mode
> dim(countsgenco)
[1] 174318      6

The row names of countgenco is feature_id as you could notice above. Countsgenco is nothing but the counts and ctsgenco is cts; I have just added "genco" acronym.

It might be relevant to ask, since I have run the code beyond this point just to see how things work and got 900 genes after stageR testing. Do i need to run the DEXseq codes as the stageR output has already given me the differential used transcript along with the p-adj values and it appears to work similar to DRIMseq?

ADD REPLY
0
Entering edit mode

You don't need to run both methods. You can just run one or the other.

If you are having continued difficulty with the DRIMSeq set up above, I'd make a new post with all your code and tag the DRIMSeq package so those authors are notified. (I think if you add a tag to an existing post it doesn't email the package maintainers).

ADD REPLY
0
Entering edit mode

Thank you for helping me on this issue.

ADD REPLY

Login before adding your answer.

Traffic: 516 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6