Is it appropriate to use DESeq2 to analyse MinION data that's been mapped to a transcriptome?
1
2
Entering edit mode
@david-eccles-gringer-7488
Last seen 4.2 years ago
New Zealand

More generally, how should we calculate differential expression of genes (or transcripts) based on long-read nanopore data?

Is there some option I can activate to get our long reads to work better with DESeq2 (or any other tool)?

My core problem is that the Log2FC values reported by DESeq2 do not reflect values I calculate myself by taking counts of genome-mapped reads for a particular transcript and determining an unadjusted fold change value. In extreme cases, there's a log difference of about 4 (i.e. DESeq2 is reporting a Log2FC value of 8, whereas I calculate it as about 4). In addition, zero counts are being inconsistently scaled by the rlog function, here's an example:

       |          Raw counts         |            rlog counts               |      |
       |  Sample 1 |     Sample 2    |   Sample 1     |   Sample 2          | L2FC |
Gstk1  |  0  0  0  |   5  36  17   3 | 0.38 0.02 0.26 | 2.36 3.50 3.03 1.50 | 4.69 |
Psmb8  |  0  0  0  |  12  16  64   5 | 1.43 0.92 1.26 | 4.04 3.26 5.19 2.70 | 5.16 |
Sumo3  |  0  0  0  |  14   8   9   4 | 1.74 1.22 1.57 | 4.36 2.78 3.21 2.73 | 5.23 |
Mgp    |  0  0  0  |   1  66  24  16 | 1.72 1.18 1.54 | 2.30 4.98 4.23 4.03 | 5.54 |
Ccl2   |  0  0  0  | 197  37  33   6 | 1.73 1.13 1.53 | 7.68 4.39 4.66 3.09 | 7.67 |

Why do the zero values have different adjusted levels, both for different genes, and within replicates of the same sample for the same gene?

I have the following workflow for processing nanopore data (more details here):

  1. Demultiplex reads by barcode
  2. Orient reads to be relative to transcription direction
  3. Map to transcriptome using LAST*
  4. Convert unique barcode/transcript/direction tuples into a count table

* I currently use LAST due to its high specificity, but am considering whether minimap2 (or something else) would be a better option

After this, I get a bit lost, because there doesn't seem to be any established method for analysing long-read data. Long-read data from nanopore sequencing tends to be very specific (e.g. a read count of 1 or 2 is a good indication that the transcript is actually present and expressed), but not very sensitive (i.e. read counts are typically quite low for each transcript). We probably have substantial batch effects, because our sequencing runs are a few months apart, and Oxford Nanopore usually changes at least one of their sample preparation protocol, reagent kit, basecalling software, or flow cell in that time frame.

Is it okay to analyse our data using DESeq2? Is there something specific that I should be doing for low-coverage long-read data (e.g. zero-value inflation)?

A count table and metadata file can be found here and here respectively [note: unpublished / draft data; subject to change without notice].

Edit: After Michael Love's comment about rlog vs VST, I did a VST transformation on the counts, which I'm much more comfortable with. This still produced a positive read count for zero values, but zero values were at least consistently scaled to the same transformed number, which made it easy to filter them out. I subtracted the minimum from the transformed values, then rescaled to make the 99th percentile for non-zero values the same value, which ended up with something that seemed a reasonable approximation of the raw read counts. Finally, I set all the zero values to -2 (so that they would be rounded to 0 in non-log space):

       |          Raw counts         |            rlog counts               |      |
       |  Sample 1 |     Sample 2    |   Sample 1     |   Sample 2          | L2FC |
Gstk1  |  0  0  0  |   5  36  17   3 |   -2   -2   -2 | 2.52 3.68 3.13 1.58 | 4.69 |
Psmb8  |  0  0  0  |  12  16  64   5 |   -2   -2   -2 | 3.66 2.61 5.22 2.01 | 5.16 |
Sumo3  |  0  0  0  |  14   8   9   4 |   -2   -2   -2 | 3.89 1.9  2.37 1.81 | 5.23 |
Mgp    |  0  0  0  |   1  66  24  16 |   -2   -2   -2 | 1.18 4.65 3.61 3.36 | 5.54 |
Ccl2   |  0  0  0  | 197  37  33   6 |   -2   -2   -2 | 8.81 3.72 4.1  2.18 | 7.67 |

Here's a snip of my current DESeq2 processing script, which processes a merged count table using DESeq2 [tdir = transcript + strand direction]:

analysisDate <- format(Sys.Date(), "%Y-%b-%d");
countDate <- "2019-Apr-18";
count.df <- read.csv(sprintf("raw_counts_%s.csv", countDate),
                     stringsAsFactors=FALSE);
## Re-scan header to avoid replacement of characters with '.'
colnames(count.df) <- scan(sprintf("raw_counts_%s.csv", countDate),
                           what="character", nlines=1,
                           sep=",");
count.cols <- setdiff(colnames(count.df),
                               c("transcript", "Chr", "Strand", "Start", "End",
                                 "Description", "Gene", "dir", "tdir"));

...
count.mat <- as.matrix(count.df[,count.cols]);
count.mat[is.na(count.mat)] <- 0;
## Only include genes with total counts >=2
rownames(count.mat) <- paste0(count.df$tdir);
count.mat <- count.mat[apply(count.mat,1,sum) >= 2,];
...
meta.df <- read.csv("metadata.csv");
excluded.factors <- "Treatment"; ## Factors to exclude from statistical model
## Identify factors for the statistical model from the metadata file
factorNames <- setdiff(colnames(meta.df), c(c("SampleID","Label","Replicate","Notes"), excluded.factors));
## Convert to differential expression structure
dds <- DESeqDataSetFromMatrix(count.mat, meta.df,
                              as.formula(paste0("~ ",paste(factorNames,collapse=" + "))));
## Run differential expression tests
dds <- DESeq(dds);
...
## Collect up comparisons to make
resultList <- NULL;
for(fi in factorNames){
    vn <- unique(metasub.df[,fi]);
    vn <- vn[order(-xtfrm(vn))];
    if(length(vn) == 1){
        next;
    }
    for(fai in seq(1, length(vn)-1)){
        for(fbi in seq(fai+1, length(vn))){
            cat(sprintf("%s: %s vs %s\n", fi, vn[fai], vn[fbi]));
            resultList <- c(resultList, list(c(fi, as.character(vn[fai]), as.character(vn[fbi]))));
        }
    }
}
## Generate base count table
countsub.df <- subset(count.df, tdir %in% rownames(count.mat));
dds.withCounts.tbl <- as.tbl(countsub.df);
## Generate DE results, add to base table
for(rn in resultList){
    print(rn);
    results.df <-
        if(l2FCShrink){
            as.data.frame(lfcShrink(dds, contrast=rn, type = "ashr"));
        } else {
           as.data.frame(results(dds, contrast=rn));
        }
    results.df$log2FoldChange <- round(results.df$log2FoldChange, 2);
    results.df$pvalue <- signif(results.df$pvalue, 3);
    results.df$padj <- signif(results.df$padj, 3);
    rn.label <- paste(rn, collapse="-");
    results.tbl <- as.tbl(results.df[, c("log2FoldChange", "pvalue", "padj")]);
    colnames(results.tbl) <- paste0(c("L2FC.","pval.", "padj."), rn.label);
    results.tbl$tdir <- rownames(results.df);
    dds.withCounts.tbl <-
        left_join(dds.withCounts.tbl, results.tbl, by="tdir");
}
## Replace original count array with normalised log2 count array
dds.counts <- assay(rlog(dds));
dds.withCounts.tbl[,colnames(dds.counts)] <- round(dds.counts,2);
## Exclude any zero/negative total counts after regularised log transform
dds.withCounts.tbl <-
    dds.withCounts.tbl[rowSums(dds.counts, na.rm=TRUE) > 0,];
## Replace count column names with labels from metadata file
colnames(dds.withCounts.tbl)[match(meta.df$SampleID,
                                   colnames(dds.withCounts.tbl))] <-
    as.character(meta.df$Label);

## Write out to a file
write.csv(dds.withCounts.tbl, row.names=FALSE,
          sprintf("DE_normalised_%s.csv", analysisDate));
deseq2 nanopore • 4.1k views
ADD COMMENT
0
Entering edit mode

You can try out the pipeline at https://github.com/nanoporetech/pipeline-transcriptome-de which is loosely based on the Swimming Downstream workflow. It uses minimap2 for mapping, salmon for quantification and edgeR quasi-likelihood for differential gene expression. Also, I would not orient the reads in this case as it is not necessary and might just discard some reads.

ADD REPLY
0
Entering edit mode

Read orientation is necessary for the discoveries we're carrying out.

The most relevant for us is negative strand expression on the mitochondrial genome in unannotated regions. I appreciate that this is mostly not going to be picked up by mapping to a transcriptome, and am on the lookout for alternative ways to do mapping of unannotated regions.

It doesn't make sense to me to discard that information, given that it is a strand-specific sequencing process.

ADD REPLY
4
Entering edit mode
@mikelove
Last seen 37 minutes ago
United States

I remember a while back you posted an example of a gene with a single sample with a high count, and the posterior LFC from lfcShrink was moderated with respect to the ratio-of-average-normalized-count estimator. This is typical behavior of posterior estimates, and is similar to what goes on when you do variance/dispersion moderation as well. The posterior effect sizes we report tend to have lower error and provide better ranking than ratio-of-averages, according to our 2018 paper (if you use ashr or apeglm, as you do above).

With respect to the long read data, I have been working with groups that use standard RNA-seq approaches. Given the Swimming Downstream workflow results, one group I work with went with edgeR quasi-likelihood, as it performed very well in that benchmark.

I don't have much details on the quantification aspect of long read data, but I'm trying to recruit someone who I know does have experience to weigh in here.

ADD COMMENT
3
Entering edit mode

We have done some investigations into quantification of nanopore data using several different approaches (including Salmon, Wub and featureCounts, based on minimap2 alignments): https://www.biorxiv.org/content/10.1101/574525v1. We noted that in many cases it can be difficult to unambiguously attribute a read to a specific transcript despite the long read length (there's naturally more certainty on the gene level), so I think some caution also on the specificity side is warranted. We didn't go into much detail on the differential expression side in the paper, but straightforward counting (i.e., each long read contributes one count to the corresponding transcript/gene) would indeed typically result in low counts. We also didn't look specifically at zero-inflation, but I would expect that we are rather dealing with overall low counts than a mixture of high counts and zeros (for a given feature), so in principle, I think standard RNA-seq tools should be applicable.

ADD REPLY
0
Entering edit mode

Thanks. I understand that in addition to the A-tail anchoring there are isoform annotation issues which mean that there might be incorrect assignment. Would you recommend preprocessing using something like tximport to convert the transcript counts to gene counts?

On second thoughts, I notice that the tximport documentation warns against this for 3'-tagged [cDNA]-Seq data. I get my 3s and 5s easily mixed up, but my quick searching suggests that the 3' end is the end that is polyadenylated in mRNA:

If you have 3’ tagged RNA-seq data, then correcting the counts for gene length will induce a bias in your analysis, because the counts do not have length bias. Instead of using the default full-transcript-length pipeline, we recommend to use the original counts, e.g. txi$counts as a counts matrix, e.g. providing to DESeqDataSetFromMatrix or to the edgeR or limma functions without calculating an offset and without using countsFromAbundance.

So... should I just add up the counts per gene? But that wouldn't work with the LAST results I have because [I think] it will report both isoforms if one [mapped] isoform is a subsequence of another; gene counts in that case would be inflated. I should be looking for unique barcode/gene/direction tuples, rather than unique barcode/transcript/direction tuples.

ADD REPLY
1
Entering edit mode

tximport with default argument just adds up the counts per gene in the “counts” matrix.

ADD REPLY
1
Entering edit mode

About the rlog, it is not monotonic across the entire matrix while vst is. I prefer vst for this and other reasons and have moved the documentation in this direction.

ADD REPLY
0
Entering edit mode

Thanks for that insight. I used vst for my last differential expression analysis, and really liked it (especially after length-based scaling, which I called vstpk). I thought that rlog was a newer, better approach. I think you've given me the little nudge I needed to revisit vst again.

ADD REPLY
0
Entering edit mode

I've also found this approach: https://github.com/a-slide/NanoCount. Would that be appropriate?

ADD REPLY
0
Entering edit mode

It is very much similar to what salmon does in alignment mode.

ADD REPLY

Login before adding your answer.

Traffic: 734 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6