Question

Selecting individual samples from tximport and DGE matrix design with missing biological replicate

0

Entering edit mode

Kent • 0

@5f50f52d

Last seen 5 weeks ago

United Kingdom

Hi,

I am trying to improve my RNA-Seq DGE analysis practice so would like to see if I have been doing it correctly.

First part is related to tximport. For DESeq2, I use the suggested way:

txi <- tximport(files, type = "salmon", tx2gene = tx2gene)
dds <- DESeqDataSetFromTximport(txi, sampleTable, ~condition)

This uses the offset matrix to correct for the length if I am not mistaken. For limma:

txi <- tximport(files, type = "salmon", tx2gene = tx2gene, countsFromAbundance = "lengthScaledTPM")
y <- DGEList(txi$counts)

So my first question is, if I want to eliminate some of the samples from the DESeq or DGEList object, should I just select them from the object like:

dds <- dds[,sample_i_want_to_keep]
# or
y <- y[,sample_i_want_to_keep]

Or should I redo the tximport() process? Would it affect the adjustment? My understanding is that TPM adjusts to gene length and library size so it shouldn't but I am not sure.

The second part of the question is about DGE design matrix. Let's say I have something like this, where donor is the origin of the cells, and I want to compare conditions NTC vs others and between A, B and C.

      donor   Condition
1_NTC     1         NTC
1_A       1           A
1_B       1           B
1_C       1           C
2_NTC     2         NTC
2_A       2           A
2_B       2           B
2_C       2           C
3_A       3           A

Due to technical issue 3_NTC, 3_B and 3_C could not be sequenced. I am just wondering when creating the design matrix, should I include donor or not. How will it affect the result when I compare NTC vs non-NTC and between A, B and C, given that for donor 3 there is no NTC, B or C? I assume in both DESeq2 and limma, the answer for this question would be the same since both are GLM-based?

Thanks a lot! Hope the questions make sense.

DESeq2 limma tximport • 1.4k views

ADD COMMENT • link updated 19 months ago by Gordon Smyth 50k • written 20 months ago by Kent • 0

score 3 · Accepted Answer · 2022-08-31

3

Entering edit mode

Michael Love 41k

@mikelove

Last seen 15 hours ago

United States

It's fine to just subset the object returned by tximport().

Regarding the dataset from a collaborator, you can make counts from abundance, see this exported function:

https://github.com/mikelove/tximport/blob/092f0a3bf2a1cd2430b7e1c2fa437426e746ea4e/R/helper.R#L19

Regarding whether adjusting for transcript length is a concern (something which Gordon Smyth brought up): it is true there are many issues with annotation not being complete, and isoform-level estimation is difficult. However transcript length adjustment is only practically relevant when there is a shift in expression from short to a long isoform, or the reverse. Between those two types of transcripts, estimation is not hard actually. It's easy to tell when a short or a long isoform is being used, relative to the other option. Mis-estimation of isoforms (which does happen, see e.g. my work with Irizarry and the Salmon team on sequence bias) happens most often among isoforms that have similar sequence, which tend to be similar length, and would then have little effect on the transcript length adjustment passed along via offsets in tximport.

ADD COMMENT • link 20 months ago Michael Love 41k

0

Entering edit mode

Thanks Michael Love ! I just reread the tutorial for tximport again and had a relevant question regarding to the scaling. So We don't have to scale to transcript length, but library size is always a must to be normalised to. My question is for limma-voom in the tutorial, you recommended to use "scaledTPM' to normalise the counts to library size, but then a few lines later you redo the normalisation again:

y <- calcNormFactors(y)
v <- voom(y, design)

Is the calNormFactors() step necessary given that the counts are already normalised? Will it change the value of the original count?

ADD REPLY • link 20 months ago Kent • 0

0

Entering edit mode

scaledTPM doesn't correct for library size. For example, on the files in the man page:

> txi <- tximport(files, "salmon", countsFromAbundance="scaledTPM", txOut=TRUE)
reading in files with read_tsv
1 2 3 4 5 6
> colSums(txi$counts)
 sample1  sample2  sample3  sample4  sample5  sample6
22830167 23721985 34481913 23854225 28694460 22602904
>

ADD REPLY • link 19 months ago Michael Love 41k

score 2 · Accepted Answer · 2022-08-31

2

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 2 hours ago

WEHI, Melbourne, Australia

The answer is different for limma than it would be for edgeR or DESeq2 because limma uses normal linear models whereas edgeR and DESeq2 used negative binomial generalized linear models.

With limma, you can model the donor effect using voomLmFit with donor as a block effect, see for example multiple voom runs with duplicateCorrelation and negative values in limma . With edgeR and DESeq2, you would have to either ignore the 3_A sample or remove donor from the model, neither option being very desirable.

ADD COMMENT • link 20 months ago Gordon Smyth 50k

0

Entering edit mode

Thank you Gordon. I will wait and see if Michael would drop by and answer the first question. However, I also have another question related to linking tximport() and limma. I have another data set which I only got this txi object from a bioinformatician from another institute (it would be easier to just get the RSD file). It has a similar problem with the design matrix as the above example. It is not adjusted to anything i.e. countsFromAbundance = "no". Is it still okay to use limma? Is there anyway to adjust for the transcript length etc. with the txi object?

ADD REPLY • link 20 months ago Kent • 0

0

Entering edit mode

Sorry, I can't answer questions about tximport as I don't use it myself. The code in the tximport documentation for inputing to limma and edgeR was checked by Aaron Lun so I am sure it is correct --- you should not change it or omit any steps. Regarding transcript length, I personally prefer not to adjust for transcript length when doing gene level analyses because I believe that transcript annotation is often poor and transcripts are often inaccurately estimated.

ADD REPLY • link 20 months ago • updated 19 months ago Gordon Smyth 50k