Which input file is used for DGEList in EgdeR?
1
0
Entering edit mode
@mohammedtoufiq91-17679
Last seen 3 days ago
United States

Hi,

I used an nf-core/rnaseq pipeline using star_salmon default aligner, on strand specific dataset. I have a question about gene counts data obtained as a result of salmon quantification. I am interested in gene counts for downstream only rather than isoforms. It seems like the nf-core rnaseq pipeline is designed to import "counts_gene_length_scaled" reference 1 reference 2 via tximport > Deseq2 > size_factors > vst. The pipeline generates a number of files, I would like to know which file from the shown below is best to use in edgeR DGEList. Probably this file "salmon.merged.gene_counts.rds"?

Before using this pipeline I used to get started from the raw gene counts from featureCounts then use in EdgeR.

salmon.merged.gene_counts_length_scaled.tsv
salmon.merged.gene_counts.rds
salmon.merged.gene_counts_scaled.rds
salmon.merged.gene_counts_scaled.tsv
salmon.merged.gene_counts.tsv
salmon.merged.gene_tpm.tsv
salmon.merged.transcript_counts.rds
salmon.merged.transcript_counts.tsv
salmon.merged.transcript_tpm.tsv
salmon_tx2gene.tsv


Thank you,

Toufiq

salmon edgeR tximport nf-core gene_counts • 220 views
1
Entering edit mode
@gordon-smyth
Last seen 16 minutes ago
WEHI, Melbourne, Australia

I use and recommend featureCounts. Despite all that has been written on this topic, I still think that direct gene counting is faster and more accurate than gettng gene counts from transcript level estimates. If you want to use the above pipeline though, you could follow the advice from Mike Love given in your Reference 1 link.

0
Entering edit mode

Thank you Gordon Smyth

My collaborator asked me to test this pipeline with egdeR package. We are interested at gene level analysis only. It seems like salmon.merged.gene_counts.tsv could be a starting point in edgeR

1
Entering edit mode

Does it output Salmon files (directories with quant.sf in them)?

That would be the easiest. These files you have above are processed and not ideal. The whole point of tximport is to take Salmon output files are prepare count matrices with effective gene length offsets. The gene length offsets account for changes in transcript length as well as biases such as sample-specific variation based on amplification or fragmentation.

1
Entering edit mode

It does output the Salmon files, and it is documented here:

https://nf-co.re/rnaseq/output#pseudo-alignment-and-quantification

The first bulletpoint is the easiest, and is a commonly used pipeline for getting Salmon quantification into R/Bioconductor for use with downstream count based tools.

Alternatively, if you don't have access to the quant.sf files, you would load salmon.merged.gene_counts_length_scaled.tsv and use that as the count matrix input to edgeR.

0
Entering edit mode

Michael Love thank you. Yes, the pipeline generates quant.sf files too, however, those were deleted and only the above listed files were provided. As a workaround, I will use salmon.merged.gene_counts_length_scaled.tsv fileas the count for the input matrix in R