Specific considerations in RNA-Seq DE analysis with edgeR pipeline for a TCGA dataset starting with HTSeq counts
0
0
Entering edit mode
svlachavas ▴ 840
@svlachavas-7225
Last seen 14 months ago
Germany/Heidelberg/German Cancer Resear…

Dear BioCommunity,

i have downloaded raw HTSeq counts for a provisional cancer TCGA dataset, with the R package TCGAbiolinks. My main goal is to implement DE expression analysis for a two-group comparison with edgeR pipeline, especially with the TREAT approach implementation, to inspect the overlap of the DE genes with a gene signature identified from a different technology, with similar phenotype. In detail, the starting "container" object is the following:

expdat

class: RangedSummarizedExperiment 
dim: 57035 521 
metadata(0):
assays(1): HTSeq - Counts
rownames(57035): ENSG00000000003 ENSG00000000005 ...
  ENSG00000281912 ENSG00000281920
rowData names(3): ensembl_gene_id external_gene_name
  original_ensembl_gene_id
colnames(521): TCGA-A6-2677-01A-01R-0821-07
  TCGA-CM-6163-01A-11R-1653-07 ...
  TCGA-AA-3517-01A-01R-0821-07 TCGA-A6-2682-01A-01R-1410-07
colData names(100): patient barcode ...
  subtype_vascular_invasion_present subtype_vital_status

# Thus, my main questions are the following:

# Except the initial part of DGE object construction and gene symbol annotation:

 y <- DGEList(counts=assay(expdat),group=colData(expdat)$definition)

head(y$samples)
                                           group lib.size
TCGA-A6-2677-01A-01R-0821-07 Primary solid Tumor 26532847
TCGA-CM-6163-01A-11R-1653-07 Primary solid Tumor 49028498
TCGA-AA-3867-01A-01R-1022-07 Primary solid Tumor 18494632
TCGA-AA-A00Z-01A-01R-A002-07 Primary solid Tumor 20979976
TCGA-G4-6297-01A-11R-1723-07 Primary solid Tumor 43469922
TCGA-AA-A02E-01A-01R-A00A-07 Primary solid Tumor 19504106
                             norm.factors
TCGA-A6-2677-01A-01R-0821-07            1
TCGA-CM-6163-01A-11R-1653-07            1
TCGA-AA-3867-01A-01R-1022-07            1
TCGA-AA-A00Z-01A-01R-A002-07            1
TCGA-G4-6297-01A-11R-1723-07            1
TCGA-AA-A02E-01A-01R-A00A-07            1

y$Symbol <- mapIds(org.Hs.eg.db, rownames(y),
+ keytype="ENSEMBL",column="SYMBOL")
'select()' returned 1:many mapping between keys and columns
y <- y[!is.na(y$Symbol), ]
dim(y)
[1] 25174   521

1) Concering the general approach of filtering non-expressed or very-low expressed genomic features-it is more beneficial to be performed prior to normalization process, as a putatively high fraction of non-expressed genes that could enter the normalization procedure, could affect the whole process? and thus, it is beneficial first to filter, then to normalize ?

2) Based on the notion of filtering: in the support group, as also in many workflows, various approaches are proposed. Based on the comprehensive tutorial/paper "It’s DE-licious", an approach is proposed by identifying a threshold of cpm for a relative number of 10 counts:

cpm(10, mean(y$samples$lib.size))

# in my dataset, the above function returns:

cpm(10, mean(y$samples$lib.size))
          [,1]
[1,] 0.2804865

As my downstream DE analysis would be based on two groups: primary cancers and control samples, and the group with the smaller number of samples is the normal group with 41 samples, how i would modify my filtering criterion ? Something like the following ?

expressed <- rowSums (cpm(y) > 0.28) >=40 ?

3) My next crusial concern is about the actual DE comparison:

in detail, the 41 control samples present, are adjucent normal samples (correspond to the same patient) with the other 41 primary cancer samples, which are a subset of the total ~500 cancer samples in the dataset. Thus, a general two group comparison without blocking would be OK, or i have to subset to only the 41 pairs of cancer-control samples (in total 82 samples) ? My notion for this question, is that despite the fact that all the cancer samples are primary colon tumor cancers, there are various subtypes included, and might have a different range in expression regarding the comparison with control samples. A possible approach for this, is to include this information in the design matrix; however, only ~140 cancer samples have this information (many NA values).

4) My last question is about the TREAT implementation with edgeR:

# some above steps...

fit <- glmQLFit(y, design, robust=TRUE)

de <- glmTreat(fit, contrast=con, lfc=N)

topTags(tde)

 

Usually, a rational number in the number N/log2 threshold, would be a cutoff below which a gene has no biological interest ? for instance log2(1.2) ?

or also could be increased if the number of genes with glmQLFTest is relatively high ?

 

Please excuse me for any naive questions, but i'm a newbie in rna-seq analysis, and i tried to recap with some basic considerations !!

Thank you in advance,

Efstathios

 

 

edgeR rnaseq DEanalysis TCGA independent filtering • 1.5k views
ADD COMMENT

Login before adding your answer.

Traffic: 487 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6