Question

Count matrix from Kallisto as input for DESeq2

0

Entering edit mode

yjliu1127 • 0

@yjliu1127-11414

Last seen 7.6 years ago

Hi everyone,

I have several RNAseq datasets. I have one control and one treated sample and each sample have 3 replicates(6 datasets in total). I already got the count data for each gene using Kallisto. Can I use the count table from Kallisto as an input for DESeq2 to get the normalized counts?

Thanks,

Yong

rnaseq deseq2 kallisto • 14k views

ADD COMMENT • link 7.6 years ago yjliu1127 • 0

0

Entering edit mode

Hi Michael,

I'm working on B. subtilis 168.

First, I downloaded all the ORF from one B. subtilis website and used that to build a index for Kallisto.

After Kallisto quantification is done (for all 6 samples). I generated the count matrix for R analysis, so I extracted all the counts data from 6 independent files to generate a new txt file in the script. In the end, I ran the script and got the estimated count file for all 6 samples.

Thanks,

Yong

ADD REPLY • link 7.6 years ago yjliu1127 • 0

0

Entering edit mode

I moved this to a comment rather than an "answer" which should be an answer to the original question.

I was confused because I didn't know if you were dealing with splicing (multiple isoforms per gene) or not. If you want to perform analysis on the same level as you did the quantification, yes you can just pass the estimated count matrix to DESeq2 using the DESeqDataSetFromMatrix function.

Regarding how to read in the column data (colData), I'd suggest you make an Excel sheet (and then export to CSV) or a CSV file in a text editor which contains all the information and read that in with read.csv() or the like. This way you can avoid mistakes which can happen using rep().

ADD REPLY • link 7.6 years ago Michael Love 41k

0

Entering edit mode

Hi Michael,

Sorry to bother you again. I generated a .csv file for my dataset. I typed the following codes in R.

> countMatrix=read.csv("C:/Users/kingon001/Desktop/counts.csv",header=TRUE,row.names=1)
> head(countMatrix)
WT6A WT6B WT6C KO6A KO6B KO6C
dnaA 6594 7412 3401 9067 7344 3803
dnaN 4274 5048 3680 7368 7300 6571
yaaA 712 678 531 813 591 433
recF 11282 10414 6007 9485 8185 4954
yaaB 1246 1074 448 898 787 291
gyrB 22499 21507 6943 17273 14169 5334
> coldata=data.frame(row.names=c('WT6A', 'WT6B', 'WT6C','KO6A','KO6B','KO6C'),group=rep(c("WT6","KO6"),3,each=3),treatment=rep(c("control","treated"),each=3))
Error in data.frame(row.names = c("WT6A", "WT6B", "WT6C", "KO6A", "KO6B", :
row names supplied are of the wrong length.

Is this error related to the column order. I set the 1st column as row names. How can I avoid this error?

Yong

ADD REPLY • link 7.6 years ago yjliu1127 • 0

0

Entering edit mode

I'm suggesting that you create a CSV for the phenotypic data: which samples are in which group and which treatment, etc. This will make things easier for you and help to avoid sample labelling mistakes which can cause big errors downstream.

ADD REPLY • link 7.6 years ago Michael Love 41k

0

Entering edit mode

Thanks!

Yong

ADD REPLY • link 7.6 years ago yjliu1127 • 0

score 1 · Answer 1 · 2016-09-03

1

Entering edit mode

Michael Love 41k

@mikelove

Last seen 2 hours ago

United States

For using kallisto quantification with DESeq2 for gene level analysis you should use the tximport Bioconductor package. See the tximport vignette, it has all the code you need. This pipeline is better than using the count matrix alone, because it controls for changes in average transcript length as an offset. Tximport takes care of all the details.

ADD COMMENT • link 7.6 years ago Michael Love 41k

0

Entering edit mode

Thanks a lot! Michael

Yong

ADD REPLY • link 7.6 years ago yjliu1127 • 0

0

Entering edit mode

Hi Michael,

I'm new in R and Kallisto. I'm confused in many steps.

The following is the count table I got from Kallisto. I have one control and one treated bacterial samples(3 replicates/sample) (control group: WT6A, WT6B, WT6C and treatment group: KO6A, KO6B, KO6C). The following is an example of my dataset(Column1 is gene names. Row1 is sample names.).

WT6A WT6B WT6C KO6A KO6B KO6C
dnaA 6594 7412 3401 9067 7344 3803
dnaN 4274 5048 3680 7368 7300 6571
yaaA 712 678 531 813 591 433
recF 11282 10414 6007 9485 8185 4954
yaaB 1246 1074 448 898 787 291
gyrB 22499 21507 6943 17273 14169 5334

1st question is about '"row names supplied are of the wrong length."

Originally, I followed the suggestion on https://www.biostars.org/p/152033/#152063 to input the count table for DESeq2 analysis.

After I typed, coldata=data.frame(row.names = c('WT6A', 'WT6B', 'WT6C', 'KO6A', 'KO6B', 'KO6C'), group=rep(c("WT6","KO6"),3,each=3),treatment=rep(c("control","treated"),each=3))

There is one error: Error in data.frame(row.names = c("WT6A", "WT6B", "WT6C", "KO6A", "KO6B", :
row names supplied are of the wrong length. Could you please let me know how to resolve this question?

My 2nd question is about using "tximport" in R in Windows8.1 after taking the your advice. This is the instruction I used https://www.bioconductor.org/packages/devel/bioc/vignettes/tximport/inst/doc/tximport.html. However, I have individual folder which contains the "abundance.tsv" file for each sample from Kallisto. I already extracted the read counts for each gene from all 6 samples and generated the above count file.

In the instruction, it's said "Transcripts need to be associated with gene IDs for gene-level summarization. If that information is present in the files, we can skip this step. But for kallisto, Salmon and Sailfish, the files only provide the transcript ID. We first make a data.frame called tx2gene with two columns: 1) transcript ID and 2) gene ID. The column names do not matter but this column order must be used. The transcript ID must be the same one used in the abundance files."

I only have gene name in the abundance file(no transcript ID) and I already got the counts for each gene. If I want to use "tximport" to control the average transcript length as an offset, can I just skip this step for creating tx2gene data. frame?

Thanks a lot for your help!

Yong

ADD REPLY • link 7.6 years ago yjliu1127 • 0

0

Entering edit mode

I'm going to skip over the first question because this is not following my recommendation of using tximport to simplify the import steps.

"I already extracted the read counts for each gene from all 6 samples and generated the above count file"

Why are you generating a count file? You should just pass the 'txi' object to DESeq2 functions as shown in both software vignettes (DESeq2 and tximport give example code you should follow).

"I only have gene name in the abundance file(no transcript ID)"

Can you explain exactly how you ran kallisto? You should provide kallisto with a FASTA with an entry for every transcript. What organism are you working with and why are you using genes instead of transcripts?

ADD REPLY • link 7.6 years ago Michael Love 41k

0

Entering edit mode

Hi Michael,

I just came across this scenario you asked that my reference only contains CDS from computational annotation. It seems there are many information missing by this reference.
Can I ask for your advice on the kallisto est_counts as the transcripts level only?
Thanks!

ADD REPLY • link 6.7 years ago yifangt ▴ 20

0

Entering edit mode

I don't exactly follow your question. You can use tximport to import quantifications into Bioconductor and then use the downstream package of your choice following the instructions in the vignette.

ADD REPLY • link 6.7 years ago Michael Love 41k

0

Entering edit mode

Thanks Michael!
I meant my situation is as "I only have gene name in the abundance file(no transcript ID)" too. The reference is a uncommon organism and the genome sequence is from de novo assembly and genes/CDS are from prediction. So that I do not have every transcript for each gene.