Search
Question: Count matrix from Kallisto as input for DESeq2
0
gravatar for yjliu1127
13 months ago by
yjliu11270
yjliu11270 wrote:

Hi everyone,

I have several RNAseq datasets. I have one control and one treated sample and each sample have 3 replicates(6 datasets in total). I already got the count data for each gene using Kallisto. Can I use the count table from Kallisto as an input for DESeq2 to get the normalized counts? 

Thanks,

Yong

 

ADD COMMENTlink modified 13 months ago • written 13 months ago by yjliu11270

Hi Michael,

I'm working on B. subtilis 168.

First, I downloaded all the ORF from one B. subtilis website and used that to build a index for Kallisto.

After Kallisto quantification is done (for all 6 samples). I generated the count matrix for R analysis, so I extracted all the counts data from 6 independent files to generate a new txt file in the script. In the end, I ran the script and got the estimated count file for all 6 samples. 

 

Thanks,

Yong

 

ADD REPLYlink modified 13 months ago • written 13 months ago by yjliu11270

I moved this to a comment rather than an "answer" which should be an answer to the original question.

I was confused because I didn't know if you were dealing with splicing (multiple isoforms per gene) or not. If you want to perform analysis on the same level as you did the quantification, yes you can just pass the estimated count matrix to DESeq2 using the DESeqDataSetFromMatrix function.

Regarding how to read in the column data (colData), I'd suggest you make an Excel sheet (and then export to CSV) or a CSV file in a text editor which contains all the information and read that in with read.csv() or the like. This way you can avoid mistakes which can happen using rep().

ADD REPLYlink written 13 months ago by Michael Love14k

Hi Michael,

Sorry to bother you again. I generated a .csv file for my dataset. I typed the following codes in R. 

> countMatrix=read.csv("C:/Users/kingon001/Desktop/counts.csv",header=TRUE,row.names=1)
> head(countMatrix)
      WT6A  WT6B WT6C  KO6A  KO6B KO6C
dnaA  6594  7412 3401  9067  7344 3803
dnaN  4274  5048 3680  7368  7300 6571
yaaA   712   678  531   813   591  433
recF 11282 10414 6007  9485  8185 4954
yaaB  1246  1074  448   898   787  291
gyrB 22499 21507 6943 17273 14169 5334
> coldata=data.frame(row.names=c('WT6A', 'WT6B', 'WT6C','KO6A','KO6B','KO6C'),group=rep(c("WT6","KO6"),3,each=3),treatment=rep(c("control","treated"),each=3))
Error in data.frame(row.names = c("WT6A", "WT6B", "WT6C", "KO6A", "KO6B",  : 
  row names supplied are of the wrong length.

Is this error related to the column order. I set the 1st column as row names. How can I avoid this error?

Yong

 

 

ADD REPLYlink modified 13 months ago • written 13 months ago by yjliu11270

I'm suggesting that you create a CSV for the phenotypic data: which samples are in which group and which treatment, etc. This will make things easier for you and help to avoid sample labelling mistakes which can cause big errors downstream.

ADD REPLYlink written 13 months ago by Michael Love14k

Thanks!

Yong

ADD REPLYlink modified 13 months ago • written 13 months ago by yjliu11270
0
gravatar for Michael Love
13 months ago by
Michael Love14k
United States
Michael Love14k wrote:
For using kallisto quantification with DESeq2 for gene level analysis you should use the tximport Bioconductor package. See the tximport vignette, it has all the code you need. This pipeline is better than using the count matrix alone, because it controls for changes in average transcript length as an offset. Tximport takes care of all the details.
ADD COMMENTlink modified 13 months ago • written 13 months ago by Michael Love14k

Thanks a lot!  Michael

Yong 

ADD REPLYlink written 13 months ago by yjliu11270

Hi Michael,

I'm new in R and Kallisto. I'm confused in many steps.  

The following is the count table I got from Kallisto. I have one control and one treated bacterial samples(3 replicates/sample) (control group: WT6A, WT6B, WT6C and treatment group: KO6A, KO6B, KO6C). The following is an example of my dataset(Column1 is gene names. Row1 is sample names.).

      WT6A  WT6B WT6C  KO6A  KO6B KO6C
dnaA  6594  7412 3401  9067  7344 3803
dnaN  4274  5048 3680  7368  7300 6571
yaaA   712   678  531   813   591  433
recF 11282 10414 6007  9485  8185 4954
yaaB  1246  1074  448   898   787  291
gyrB 22499 21507 6943 17273 14169 5334

1st question is about '"row names supplied are of the wrong length."

Originally, I followed the suggestion on https://www.biostars.org/p/152033/#152063 to input the count table for DESeq2 analysis. 

After I typed, coldata=data.frame(row.names = c('WT6A', 'WT6B', 'WT6C', 'KO6A', 'KO6B', 'KO6C'), group=rep(c("WT6","KO6"),3,each=3),treatment=rep(c("control","treated"),each=3))

There is one error: Error in data.frame(row.names = c("WT6A", "WT6B", "WT6C", "KO6A", "KO6B",  : 
  row names supplied are of the wrong length.   Could you please let me know how to resolve this question?

My 2nd question is about using "tximport" in R in Windows8.1 after taking the your advice. This is the instruction I used https://www.bioconductor.org/packages/devel/bioc/vignettes/tximport/inst/doc/tximport.html. However, I have individual folder which contains the "abundance.tsv" file for each sample from Kallisto. I already extracted the read counts for each gene from all 6 samples and generated the above count file.  

In the instruction, it's said "Transcripts need to be associated with gene IDs for gene-level summarization. If that information is present in the files, we can skip this step. But for kallisto, Salmon and Sailfish, the files only provide the transcript ID. We first make a data.frame called tx2gene with two columns: 1) transcript ID and 2) gene ID. The column names do not matter but this column order must be used. The transcript ID must be the same one used in the abundance files."

I only have gene name in the abundance file(no transcript ID) and I already got the counts for each gene. If I want to use "tximport" to control the average transcript length as an offset, can I just skip this step for creating tx2gene data. frame?

 

Thanks a lot for your help!

Yong

 

 

 

ADD REPLYlink written 13 months ago by yjliu11270

I'm going to skip over the first question because this is not following my recommendation of using tximport to simplify the import steps.

"I already extracted the read counts for each gene from all 6 samples and generated the above count file"

Why are you generating a count file? You should just pass the 'txi' object to DESeq2 functions as shown in both software vignettes (DESeq2 and tximport give example code you should follow).

"I only have gene name in the abundance file(no transcript ID)"

Can you explain exactly how you ran kallisto? You should provide kallisto with a FASTA with an entry for every transcript. What organism are you working with and why are you using genes instead of transcripts?

 

ADD REPLYlink written 13 months ago by Michael Love14k

Hi Michael,

I just came across this scenario you asked that my reference only contains CDS from computational annotation. It seems there are many information missing by this reference.
Can I ask for your advice on the kallisto est_counts as the transcripts level only?
Thanks!

ADD REPLYlink modified 10 weeks ago • written 10 weeks ago by yifangt10

I don't exactly follow your question. You can use tximport to import quantifications into Bioconductor and then use the downstream package of your choice following the instructions in the vignette.

 

ADD REPLYlink written 10 weeks ago by Michael Love14k

Thanks Michael!
I meant my situation is as "I only have gene name in the abundance file(no transcript ID)" too. The reference is a uncommon organism and the genome sequence is from de novo assembly and genes/CDS are from prediction. So that I do not have every transcript for each gene.
 

ADD REPLYlink modified 10 weeks ago • written 10 weeks ago by yifangt10

Oh then set txOut=TRUE

ADD REPLYlink written 9 weeks ago by Michael Love14k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 301 users visited in the last hour