Selecting individual samples from tximport and DGE matrix design with missing biological replicate
2
0
Entering edit mode
Kent • 0
@5f50f52d
Last seen 13 days ago
United Kingdom

Hi,

I am trying to improve my RNA-Seq DGE analysis practice so would like to see if I have been doing it correctly.

First part is related to tximport. For DESeq2, I use the suggested way:

txi <- tximport(files, type = "salmon", tx2gene = tx2gene)
dds <- DESeqDataSetFromTximport(txi, sampleTable, ~condition)


This uses the offset matrix to correct for the length if I am not mistaken. For limma:

txi <- tximport(files, type = "salmon", tx2gene = tx2gene, countsFromAbundance = "lengthScaledTPM")
y <- DGEList(txi$counts)  So my first question is, if I want to eliminate some of the samples from the DESeq or DGEList object, should I just select them from the object like: dds <- dds[,sample_i_want_to_keep] # or y <- y[,sample_i_want_to_keep]  Or should I redo the tximport() process? Would it affect the adjustment? My understanding is that TPM adjusts to gene length and library size so it shouldn't but I am not sure. The second part of the question is about DGE design matrix. Let's say I have something like this, where donor is the origin of the cells, and I want to compare conditions NTC vs others and between A, B and C.  donor Condition 1_NTC 1 NTC 1_A 1 A 1_B 1 B 1_C 1 C 2_NTC 2 NTC 2_A 2 A 2_B 2 B 2_C 2 C 3_A 3 A  Due to technical issue 3_NTC, 3_B and 3_C could not be sequenced. I am just wondering when creating the design matrix, should I include donor or not. How will it affect the result when I compare NTC vs non-NTC and between A, B and C, given that for donor 3 there is no NTC, B or C? I assume in both DESeq2 and limma, the answer for this question would be the same since both are GLM-based? Thanks a lot! Hope the questions make sense. DESeq2 limma tximport • 291 views ADD COMMENT 3 Entering edit mode @mikelove Last seen 8 hours ago United States It's fine to just subset the object returned by tximport(). Regarding the dataset from a collaborator, you can make counts from abundance, see this exported function: https://github.com/mikelove/tximport/blob/092f0a3bf2a1cd2430b7e1c2fa437426e746ea4e/R/helper.R#L19 Regarding whether adjusting for transcript length is a concern (something which Gordon Smyth brought up): it is true there are many issues with annotation not being complete, and isoform-level estimation is difficult. However transcript length adjustment is only practically relevant when there is a shift in expression from short to a long isoform, or the reverse. Between those two types of transcripts, estimation is not hard actually. It's easy to tell when a short or a long isoform is being used, relative to the other option. Mis-estimation of isoforms (which does happen, see e.g. my work with Irizarry and the Salmon team on sequence bias) happens most often among isoforms that have similar sequence, which tend to be similar length, and would then have little effect on the transcript length adjustment passed along via offsets in tximport. ADD COMMENT 0 Entering edit mode Thanks Michael Love ! I just reread the tutorial for tximport again and had a relevant question regarding to the scaling. So We don't have to scale to transcript length, but library size is always a must to be normalised to. My question is for limma-voom in the tutorial, you recommended to use "scaledTPM' to normalise the counts to library size, but then a few lines later you redo the normalisation again: y <- calcNormFactors(y) v <- voom(y, design)  Is the calNormFactors() step necessary given that the counts are already normalised? Will it change the value of the original count? ADD REPLY 0 Entering edit mode scaledTPM doesn't correct for library size. For example, on the files in the man page: > txi <- tximport(files, "salmon", countsFromAbundance="scaledTPM", txOut=TRUE) reading in files with read_tsv 1 2 3 4 5 6 > colSums(txi$counts)
sample1  sample2  sample3  sample4  sample5  sample6
22830167 23721985 34481913 23854225 28694460 22602904
>

2
Entering edit mode
@gordon-smyth
Last seen 58 minutes ago
WEHI, Melbourne, Australia

The answer is different for limma than it would be for edgeR or DESeq2 because limma uses normal linear models whereas edgeR and DESeq2 used negative binomial generalized linear models.

With limma, you can model the donor effect using voomLmFit with donor as a block effect, see for example multiple voom runs with duplicateCorrelation and negative values in limma . With edgeR and DESeq2, you would have to either ignore the 3_A sample or remove donor from the model, neither option being very desirable.

0
Entering edit mode

Thank you Gordon. I will wait and see if Michael would drop by and answer the first question. However, I also have another question related to linking tximport() and limma. I have another data set which I only got this txi object from a bioinformatician from another institute (it would be easier to just get the RSD file). It has a similar problem with the design matrix as the above example. It is not adjusted to anything i.e. countsFromAbundance = "no". Is it still okay to use limma? Is there anyway to adjust for the transcript length etc. with the txi object?

0
Entering edit mode

Sorry, I can't answer questions about tximport as I don't use it myself. The code in the tximport documentation for inputing to limma and edgeR was checked by Aaron Lun so I am sure it is correct --- you should not change it or omit any steps. Regarding transcript length, I personally prefer not to adjust for transcript length when doing gene level analyses because I believe that transcript annotation is often poor and transcripts are often inaccurately estimated.