Specific considerations in RNA-Seq DE analysis with edgeR pipeline for a TCGA dataset starting with HTSeq counts
0
0
Entering edit mode
svlachavas ▴ 780
@svlachavas-7225
Last seen 1 day ago
Germany/Heidelberg/German Cancer Resear…

Dear BioCommunity,

i have downloaded raw HTSeq counts for a provisional cancer TCGA dataset, with the R package TCGAbiolinks. My main goal is to implement DE expression analysis for a two-group comparison with edgeR pipeline, especially with the TREAT approach implementation, to inspect the overlap of the DE genes with a gene signature identified from a different technology, with similar phenotype. In detail, the starting "container" object is the following:

expdat

class: RangedSummarizedExperiment
dim: 57035 521
assays(1): HTSeq - Counts
rownames(57035): ENSG00000000003 ENSG00000000005 ...
ENSG00000281912 ENSG00000281920
rowData names(3): ensembl_gene_id external_gene_name
original_ensembl_gene_id
colnames(521): TCGA-A6-2677-01A-01R-0821-07
TCGA-CM-6163-01A-11R-1653-07 ...
TCGA-AA-3517-01A-01R-0821-07 TCGA-A6-2682-01A-01R-1410-07
colData names(100): patient barcode ...
subtype_vascular_invasion_present subtype_vital_status

# Thus, my main questions are the following:

# Except the initial part of DGE object construction and gene symbol annotation:

y <- DGEList(counts=assay(expdat),group=colData(expdat)$definition) head(y$samples)
group lib.size
TCGA-A6-2677-01A-01R-0821-07 Primary solid Tumor 26532847
TCGA-CM-6163-01A-11R-1653-07 Primary solid Tumor 49028498
TCGA-AA-3867-01A-01R-1022-07 Primary solid Tumor 18494632
TCGA-AA-A00Z-01A-01R-A002-07 Primary solid Tumor 20979976
TCGA-G4-6297-01A-11R-1723-07 Primary solid Tumor 43469922
TCGA-AA-A02E-01A-01R-A00A-07 Primary solid Tumor 19504106
norm.factors
TCGA-A6-2677-01A-01R-0821-07            1
TCGA-CM-6163-01A-11R-1653-07            1
TCGA-AA-3867-01A-01R-1022-07            1
TCGA-AA-A00Z-01A-01R-A002-07            1
TCGA-G4-6297-01A-11R-1723-07            1
TCGA-AA-A02E-01A-01R-A00A-07            1

y$Symbol <- mapIds(org.Hs.eg.db, rownames(y), + keytype="ENSEMBL",column="SYMBOL") 'select()' returned 1:many mapping between keys and columns y <- y[!is.na(y$Symbol), ]
dim(y)
[1] 25174   521

1) Concering the general approach of filtering non-expressed or very-low expressed genomic features-it is more beneficial to be performed prior to normalization process, as a putatively high fraction of non-expressed genes that could enter the normalization procedure, could affect the whole process? and thus, it is beneficial first to filter, then to normalize ?

2) Based on the notion of filtering: in the support group, as also in many workflows, various approaches are proposed. Based on the comprehensive tutorial/paper "It’s DE-licious", an approach is proposed by identifying a threshold of cpm for a relative number of 10 counts:

cpm(10, mean(y$samples$lib.size))

# in my dataset, the above function returns:

cpm(10, mean(y$samples$lib.size))
[,1]
[1,] 0.2804865

As my downstream DE analysis would be based on two groups: primary cancers and control samples, and the group with the smaller number of samples is the normal group with 41 samples, how i would modify my filtering criterion ? Something like the following ?

expressed <- rowSums (cpm(y) > 0.28) >=40 ?

3) My next crusial concern is about the actual DE comparison:

in detail, the 41 control samples present, are adjucent normal samples (correspond to the same patient) with the other 41 primary cancer samples, which are a subset of the total ~500 cancer samples in the dataset. Thus, a general two group comparison without blocking would be OK, or i have to subset to only the 41 pairs of cancer-control samples (in total 82 samples) ? My notion for this question, is that despite the fact that all the cancer samples are primary colon tumor cancers, there are various subtypes included, and might have a different range in expression regarding the comparison with control samples. A possible approach for this, is to include this information in the design matrix; however, only ~140 cancer samples have this information (many NA values).

4) My last question is about the TREAT implementation with edgeR:

# some above steps...

fit <- glmQLFit(y, design, robust=TRUE)

de <- glmTreat(fit, contrast=con, lfc=N)

topTags(tde)

Usually, a rational number in the number N/log2 threshold, would be a cutoff below which a gene has no biological interest ? for instance log2(1.2) ?

or also could be increased if the number of genes with glmQLFTest is relatively high ?

Please excuse me for any naive questions, but i'm a newbie in rna-seq analysis, and i tried to recap with some basic considerations !!

Efstathios