Search
Question: What's the correct way to normalize HTseq files with TCGAbiolinks
1
11 months ago by
eshelden20
eshelden20 wrote:

I am learning to process data from the Genomic Data Commons using TCGAbiolinks in R, but I have run into an issue normalizing the data using the routines built into TCGAbiolinks. In brief, I checked the results of the analysis by looking at breast cancer data that was divided into estrogen receptor positive and negative groups, then examined expression levels for the estrogen receptor gene. The raw data and normalization using the "cgContent" option both produce reasonable looking results, but normalization using the default values and/or using the "geneLength" option do not. I would like to know if I can rely on normalization using the "cgContent" option as the "right" way to be doing this. My protocol and results are shown below.

To start, I retrieved the breast cancer RNA-seq files using:

query <- TCGAbiolinks::GDCquery(project = c("TCGA-BRCA"), data.category = "Transcriptome Profiling", data.type = "Gene Expression Quantification",sample.type = c("Primary solid Tumor","Solid Tissue Normal"), workflow.type = "HTSeq - Counts")

TCGAbiolinks::GDCdownload(query)

Followed by:

exp.brca<-TCGAbiolinks::GDCprepare(query = query, save = TRUE, save.filename = "brcaExp.rda")

At this point, I separated the normal samples from the tumor samples, and split the tumor samples into estrogen receptor positive and negative groups, using the following:

normalSamples<-exp.brca$barcode[grep("NT",exp.brca$shortLetterCode)]

erPosTumors<-exp.brca$barcode[grep("Positive",exp.brca$subtype_ER.Status)]

erNegTumors<-exp.brca$barcode[grep("Negative",exp.brca$subtype_ER.Status)]

The estrogen receptor gene ID is: ENSG00000091831

I looked at the non-normalized values using this method:

mAssay<-SummarizedExperiment::assay(exp.brca)
ESR1_Norm<- mAssay["ENSG00000091831", normalSamples]
ESR1_Pos_Tumors<- mAssay ["ENSG00000091831", erPosTumors]
ESR1_Neg_Tumors<- mAssay ["ENSG00000091831", erNegTumors]
boxplot(ESR1_Norm, ESR1_Pos_Tumors, ESR1_Neg_Tumors, main = "ESR1", names =  c("ESR1_Norm","ESR1_Pos_Tumors", "ESR1_Neg_Tumors"), las=2, cex.axis=.7)

Here is the result, which shows elevated expression in estrogen receptor positive samples and low expression in estrogen receptor negative samples. This is the result I expected.

I also normalized the data using three different commands:

dataNorm <- TCGAbiolinks::TCGAanalyze_Normalization(exp.brca, geneInfo = TCGAbiolinks::geneInfoHT)

dataNorm <- TCGAbiolinks::TCGAanalyze_Normalization(exp.brca, geneInfo = TCGAbiolinks::geneInfoHT, method = "geneLength")

and each time plotted the results using:

ESR1_Norm<- dataNorm["ENSG00000091831", normalSamples]
ESR1_Pos_Tumors<- dataNorm ["ENSG00000091831", erPosTumors]
ESR1_Neg_Tumors<- dataNorm ["ENSG00000091831", erNegTumors]
boxplot(ESR1_Norm, ESR1_Pos_Tumors, ESR1_Neg_Tumors, main = "ESR1", names =  c("ESR1_Norm","ESR1_Pos_Tumors", "ESR1_Neg_Tumors"), las=2, cex.axis=.7)

Using the cgContent option gives a result that is in good agreement with my expectations and the raw data:

However, the default command, and using the "geneLength" option both give graphs that are non-informative, and inspection of the data shows that most of the values are zero:

So, can anyone tell me if the cgContent option is the "correct" approach for all circumstances, and/or why the other commands failed?

Thanks,

Eric