Is it advisable to input a count matrix that consists of reads aligned using different algorithms (HT-Seq and Salmon)?
2
0
Entering edit mode
akstks • 0
@ae8cc513
Last seen 14 months ago
Sweden

Hello!

First of all, thank you for the great package and the excellent documentation that supports it, much appreciated!

Sadly, I could not find an answer to my problem, so I wanted to ask here.

I have two different bulk RNA-seq datasets, one obtained from TCGA using the TCGAbiolinks package, and one that I acquired after communicating with the authors. They are both from the same cancer type, but the TCGA consists of patient samples exclusively from later stages and the other exclusively from earlier stages.

Following the TCGAbiolinks vignette, I got a SummarizedExperiment object that is ready to proceed with the DESeq2 workflow. The raw counts are HTSeq quantified.

The second dataset, was given to me as an excel file of raw counts quantified with Salmon so I had to bring it to the appropriate format:

#Assuming all the relevant packages are loaded

raw.salmon <- readxl::read_excel(file.choose())
clin.salmon <- read_excel(file.choose()) #Relevant clinical data to be used as metadata
raw.salmon <- raw.salmon %>% remove_rownames() %>% column_to_rownames(var = 'ensembl ID')
raw.salmon <- raw.salmon %>% mutate_if(is.numeric, round) #Salmon raw counts included decimals, so I rounded them. I don't believe meaningful information is lost.
raw.salmon <- raw.salmon %>% mutate_if(is.numeric, as.double) # This was the type of the counts from the other SummarizedExperiment object from TCGAbiolinks
raw.salmon <- as.matrix(raw.salmon) 
raw.salmon.se <- SummarizedExperiment(raw.salmon)
assayNames(raw.salmon.se) <- 'counts'

raw.salmon.se$Stage <- clin.salmon$Stage

Then I proceeded to "merge" the two together, so I could do DE analysis based on stage. The code is a bit messy here since I had saved the two Summarized experiments and came back to them at a different day so you see me going back and forth to matrices, data.frames etc...

#raw.htseq.se is the SummarizedExperiment from TCGAbiolinks

# Extract the counts into matrices
htseq.matrix <- raw.htseq.se@assays@data@listData[["HTSeq - Counts"]]
salmon.matrix <- raw.salmon.se@assays@data@listData[["counts"]]

#Add rownames and colnames to htseq.matrix to be the same as salmon
colnames(htseq.matrix) <- rownames(colData(htseq.salmon.se))

names.htseq<- rowData(htseq.salmon.se)
rownames(htseq.matrix) <- names.htseq$ensembl_gene_id

#Turn them to data frames for easier manipulation
htseq.matrix <- as.data.frame(htseq.matrix) # I know... data.frames names matrices...
salmon.matrix <- as.data.frame(salmon.matrix)

#Change the rownames into a column as a reference for merging
htseq.matrix <- htseq.matrix %>% rownames_to_column('Ensembl_ID')
salmon.matrix <- salmon.matrix %>% rownames_to_column('Ensembl_ID')

combined.counts <- merge(salmon.matrix, htseq.matrix, by = 'Ensembl_ID')
combined.counts <- combined.counts %>% column_to_rownames('Ensembl_ID')
combined.counts <- combined.counts %>% mutate_if(is.numeric, as.double)
combined.counts <- as.matrix(combined.counts)

#Turn into a SummarizedExperiment for DE analysis with DESeq2
combined.counts.se <- SummarizedExperiment(combined.counts.se)
assayNames(combined.counts.se) <- 'counts'

combined.counts.se$Stage <- all.clin$Stage # This is a table I made that had the appropriate stage metadata for both datasets.

After this, I could turn the combined SummarizedExperiment object into a DESeqDataSet and proceed to DE analysis using design = ~ Stage without a problem and I got plenty of results but, are they correct? Is it appropriate to run such a workflow with raw data that have been called differently? Running a PCA grouping the individuals based on stage (after transforming the counts with vst and blind = TRUE) gave me two big clusters, one for the early stage (orange) and one for the later stage (blue) which also happens to be the two different datasets respectively. Is this whole analysis just a big batch effect?

PCA

Any opinions are more than welcome!

DESeq2 • 927 views
ADD COMMENT
3
Entering edit mode
@mikelove
Last seen 8 hours ago
United States

Is it appropriate to run such a workflow with raw data that have been called differently?

I'd say no. Because the in silico processing and the condition of interest are perfectly confounded, I wouldn't interpret any differences seen as biologically meaningful.

ADD COMMENT
0
Entering edit mode

That is what I was afraid of and wanted to make sure before spending any more time on it. Will have to think about something else then.

Thanks again!

ADD REPLY
0
Entering edit mode
swbarnes2 ★ 1.3k
@swbarnes2-14086
Last seen 46 minutes ago
San Diego

If the confounding wasn't perfect between gene counting software and condition of interest, you could probably add software used as another column in your colData, and add it to your design like you would a batch effect.

ADD COMMENT

Login before adding your answer.

Traffic: 692 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6