Count matrix from Kallisto as input for DESeq2
1
0
Entering edit mode
yjliu1127 • 0
@yjliu1127-11414
Last seen 6.1 years ago

Hi everyone,

I have several RNAseq datasets. I have one control and one treated sample and each sample have 3 replicates(6 datasets in total). I already got the count data for each gene using Kallisto. Can I use the count table from Kallisto as an input for DESeq2 to get the normalized counts?

Thanks,

Yong

rnaseq deseq2 kallisto • 9.5k views
0
Entering edit mode

Hi Michael,

I'm working on B. subtilis 168.

First, I downloaded all the ORF from one B. subtilis website and used that to build a index for Kallisto.

After Kallisto quantification is done (for all 6 samples). I generated the count matrix for R analysis, so I extracted all the counts data from 6 independent files to generate a new txt file in the script. In the end, I ran the script and got the estimated count file for all 6 samples.

Thanks,

Yong

0
Entering edit mode

I moved this to a comment rather than an "answer" which should be an answer to the original question.

I was confused because I didn't know if you were dealing with splicing (multiple isoforms per gene) or not. If you want to perform analysis on the same level as you did the quantification, yes you can just pass the estimated count matrix to DESeq2 using the DESeqDataSetFromMatrix function.

Regarding how to read in the column data (colData), I'd suggest you make an Excel sheet (and then export to CSV) or a CSV file in a text editor which contains all the information and read that in with read.csv() or the like. This way you can avoid mistakes which can happen using rep().

0
Entering edit mode

Hi Michael,

Sorry to bother you again. I generated a .csv file for my dataset. I typed the following codes in R.

WT6A  WT6B WT6C  KO6A  KO6B KO6C
dnaA  6594  7412 3401  9067  7344 3803
dnaN  4274  5048 3680  7368  7300 6571
yaaA   712   678  531   813   591  433
recF 11282 10414 6007  9485  8185 4954
yaaB  1246  1074  448   898   787  291
gyrB 22499 21507 6943 17273 14169 5334
> coldata=data.frame(row.names=c('WT6A', 'WT6B', 'WT6C','KO6A','KO6B','KO6C'),group=rep(c("WT6","KO6"),3,each=3),treatment=rep(c("control","treated"),each=3))
Error in data.frame(row.names = c("WT6A", "WT6B", "WT6C", "KO6A", "KO6B",  :
row names supplied are of the wrong length.

Is this error related to the column order. I set the 1st column as row names. How can I avoid this error?

Yong

0
Entering edit mode

I'm suggesting that you create a CSV for the phenotypic data: which samples are in which group and which treatment, etc. This will make things easier for you and help to avoid sample labelling mistakes which can cause big errors downstream.

0
Entering edit mode

Thanks!

Yong

1
Entering edit mode
@mikelove
Last seen 1 day ago
United States
For using kallisto quantification with DESeq2 for gene level analysis you should use the tximport Bioconductor package. See the tximport vignette, it has all the code you need. This pipeline is better than using the count matrix alone, because it controls for changes in average transcript length as an offset. Tximport takes care of all the details.
0
Entering edit mode

Thanks a lot!  Michael

Yong

0
Entering edit mode

Hi Michael,

I'm new in R and Kallisto. I'm confused in many steps.

The following is the count table I got from Kallisto. I have one control and one treated bacterial samples(3 replicates/sample) (control group: WT6A, WT6B, WT6C and treatment group: KO6A, KO6B, KO6C). The following is an example of my dataset(Column1 is gene names. Row1 is sample names.).

WT6A  WT6B WT6C  KO6A  KO6B KO6C
dnaA  6594  7412 3401  9067  7344 3803
dnaN  4274  5048 3680  7368  7300 6571
yaaA   712   678  531   813   591  433
recF 11282 10414 6007  9485  8185 4954
yaaB  1246  1074  448   898   787  291
gyrB 22499 21507 6943 17273 14169 5334

1st question is about '"row names supplied are of the wrong length."

Originally, I followed the suggestion on https://www.biostars.org/p/152033/#152063 to input the count table for DESeq2 analysis.

After I typed, coldata=data.frame(row.names = c('WT6A', 'WT6B', 'WT6C', 'KO6A', 'KO6B', 'KO6C'), group=rep(c("WT6","KO6"),3,each=3),treatment=rep(c("control","treated"),each=3))

There is one error: Error in data.frame(row.names = c("WT6A", "WT6B", "WT6C", "KO6A", "KO6B",  :
row names supplied are of the wrong length.   Could you please let me know how to resolve this question?

My 2nd question is about using "tximport" in R in Windows8.1 after taking the your advice. This is the instruction I used https://www.bioconductor.org/packages/devel/bioc/vignettes/tximport/inst/doc/tximport.html. However, I have individual folder which contains the "abundance.tsv" file for each sample from Kallisto. I already extracted the read counts for each gene from all 6 samples and generated the above count file.

In the instruction, it's said "Transcripts need to be associated with gene IDs for gene-level summarization. If that information is present in the files, we can skip this step. But for kallisto, Salmon and Sailfish, the files only provide the transcript ID. We first make a data.frame called tx2gene with two columns: 1) transcript ID and 2) gene ID. The column names do not matter but this column order must be used. The transcript ID must be the same one used in the abundance files."

I only have gene name in the abundance file(no transcript ID) and I already got the counts for each gene. If I want to use "tximport" to control the average transcript length as an offset, can I just skip this step for creating tx2gene data. frame?

Thanks a lot for your help!

Yong

0
Entering edit mode

I'm going to skip over the first question because this is not following my recommendation of using tximport to simplify the import steps.

"I already extracted the read counts for each gene from all 6 samples and generated the above count file"

Why are you generating a count file? You should just pass the 'txi' object to DESeq2 functions as shown in both software vignettes (DESeq2 and tximport give example code you should follow).

"I only have gene name in the abundance file(no transcript ID)"

Can you explain exactly how you ran kallisto? You should provide kallisto with a FASTA with an entry for every transcript. What organism are you working with and why are you using genes instead of transcripts?

0
Entering edit mode

Hi Michael,

I just came across this scenario you asked that my reference only contains CDS from computational annotation. It seems there are many information missing by this reference.
Thanks!

0
Entering edit mode

I don't exactly follow your question. You can use tximport to import quantifications into Bioconductor and then use the downstream package of your choice following the instructions in the vignette.

0
Entering edit mode

Thanks Michael!
I meant my situation is as "I only have gene name in the abundance file(no transcript ID)" too. The reference is a uncommon organism and the genome sequence is from de novo assembly and genes/CDS are from prediction. So that I do not have every transcript for each gene.

0
Entering edit mode

Oh then set txOut=TRUE