i have downloaded raw HTSeq counts for a provisional cancer TCGA dataset, with the R package TCGAbiolinks. My main goal is to implement DE expression analysis for a two-group comparison with edgeR pipeline, especially with the TREAT approach implementation, to inspect the overlap of the DE genes with a gene signature identified from a different technology, with similar phenotype. In detail, the starting "container" object is the following:
expdat class: RangedSummarizedExperiment dim: 57035 521 metadata(0): assays(1): HTSeq - Counts rownames(57035): ENSG00000000003 ENSG00000000005 ... ENSG00000281912 ENSG00000281920 rowData names(3): ensembl_gene_id external_gene_name original_ensembl_gene_id colnames(521): TCGA-A6-2677-01A-01R-0821-07 TCGA-CM-6163-01A-11R-1653-07 ... TCGA-AA-3517-01A-01R-0821-07 TCGA-A6-2682-01A-01R-1410-07 colData names(100): patient barcode ... subtype_vascular_invasion_present subtype_vital_status # Thus, my main questions are the following: # Except the initial part of DGE object construction and gene symbol annotation: y <- DGEList(counts=assay(expdat),group=colData(expdat)$definition) head(y$samples) group lib.size TCGA-A6-2677-01A-01R-0821-07 Primary solid Tumor 26532847 TCGA-CM-6163-01A-11R-1653-07 Primary solid Tumor 49028498 TCGA-AA-3867-01A-01R-1022-07 Primary solid Tumor 18494632 TCGA-AA-A00Z-01A-01R-A002-07 Primary solid Tumor 20979976 TCGA-G4-6297-01A-11R-1723-07 Primary solid Tumor 43469922 TCGA-AA-A02E-01A-01R-A00A-07 Primary solid Tumor 19504106 norm.factors TCGA-A6-2677-01A-01R-0821-07 1 TCGA-CM-6163-01A-11R-1653-07 1 TCGA-AA-3867-01A-01R-1022-07 1 TCGA-AA-A00Z-01A-01R-A002-07 1 TCGA-G4-6297-01A-11R-1723-07 1 TCGA-AA-A02E-01A-01R-A00A-07 1 y$Symbol <- mapIds(org.Hs.eg.db, rownames(y), + keytype="ENSEMBL",column="SYMBOL") 'select()' returned 1:many mapping between keys and columns y <- y[!is.na(y$Symbol), ] dim(y)  25174 521
1) Concering the general approach of filtering non-expressed or very-low expressed genomic features-it is more beneficial to be performed prior to normalization process, as a putatively high fraction of non-expressed genes that could enter the normalization procedure, could affect the whole process? and thus, it is beneficial first to filter, then to normalize ?
2) Based on the notion of filtering: in the support group, as also in many workflows, various approaches are proposed. Based on the comprehensive tutorial/paper "It’s DE-licious", an approach is proposed by identifying a threshold of cpm for a relative number of 10 counts:
cpm(10, mean(y$samples$lib.size)) # in my dataset, the above function returns: cpm(10, mean(y$samples$lib.size)) [,1] [1,] 0.2804865
As my downstream DE analysis would be based on two groups: primary cancers and control samples, and the group with the smaller number of samples is the normal group with 41 samples, how i would modify my filtering criterion ? Something like the following ?
expressed <- rowSums (cpm(y) > 0.28) >=40 ?
3) My next crusial concern is about the actual DE comparison:
in detail, the 41 control samples present, are adjucent normal samples (correspond to the same patient) with the other 41 primary cancer samples, which are a subset of the total ~500 cancer samples in the dataset. Thus, a general two group comparison without blocking would be OK, or i have to subset to only the 41 pairs of cancer-control samples (in total 82 samples) ? My notion for this question, is that despite the fact that all the cancer samples are primary colon tumor cancers, there are various subtypes included, and might have a different range in expression regarding the comparison with control samples. A possible approach for this, is to include this information in the design matrix; however, only ~140 cancer samples have this information (many NA values).
4) My last question is about the TREAT implementation with edgeR:
# some above steps... fit <- glmQLFit(y, design, robust=TRUE) de <- glmTreat(fit, contrast=con, lfc=N) topTags(tde)
Usually, a rational number in the number N/log2 threshold, would be a cutoff below which a gene has no biological interest ? for instance log2(1.2) ?
or also could be increased if the number of genes with glmQLFTest is relatively high ?
Please excuse me for any naive questions, but i'm a newbie in rna-seq analysis, and i tried to recap with some basic considerations !!
Thank you in advance,