Dear BioCommunity,
i recently implemented the R package TCGAbiolinks, to download raw HTSEQ counts for a provinsional cancer TCGA dataset (COAD). My code used was the following:
query.exp.hg38 <- GDCquery(project = "TCGA-COAD",
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "HTSeq - Counts")
GDCdownload(query.exp.hg38,files.per.chunk = 50)
exp.hg38 <- GDCprepare(query = query.exp.hg38,
save=TRUE,directory=mypath,save.filename = "exp.coad.htseqcounts.rda")
exp.hg38
class: RangedSummarizedExperiment
dim: 56963 521
metadata(1): data_release
assays(1): HTSeq - Counts
rownames(56963): ENSG00000000003 ENSG00000000005 ...
ENSG00000281912 ENSG00000281920
rowData names(3): ensembl_gene_id external_gene_name
original_ensembl_gene_id
colnames(521): TCGA-3L-AA1B-01A-11R-A37K-07
TCGA-DM-A1D8-01A-11R-A155-07 ...
TCGA-AA-3675-01A-02R-0905-07 TCGA-G4-6323-01A-11R-1723-07
colData names(101): sample patient ...
subtype_vascular_invasion_present subtype_vital_status
test <- TCGAanalyze_Normalization(tabDF=exp.hg38,geneInfo)
I Need about 0 seconds for this Complete Normalization Upper Quantile [Processing 80k elements /s]
Step 1 of 4: newSeqExpressionSet ...
Step 2 of 4: withinLaneNormalization ...
Error in names(y) <- 1:length(y) :
'names' attribute [2] must be the same length as the vector [0]
Timing stopped at: 0 0 0.04
test2 <- TCGAbiolinks::TCGAanalyze_Normalization(tabDF = exp.hg38,geneInfo=TCGAbiolinks::geneInfoHT,method="gcContent")
I Need about 371 seconds for this Complete Normalization Upper Quantile [Processing 80k elements /s]
Step 1 of 4: newSeqExpressionSet ...
Step 2 of 4: withinLaneNormalization ...
Step 3 of 4: betweenLaneNormalization ...
Step 4 of 4: .quantileNormalization ...
> dim(test3)
[1] 23238 521
Any ideas about the first above error ? as also why the second approach worked, but the different number of features in the second approach is reduced to more than the half of the 56963 initial number ? From a small search in the function, in mentions in the geneInfo argument:
geneInfo= Information matrix of 20531 genes about geneLength and gcContent. Two objects are provided: TCGAbiolinks::geneInfoHT,TCGAbiolinks::geneInfo
But, again how it explains the above issue ? perhaps it has to do something with the EDASeq R package, which is mentioned for implementing the above normalization function ?
Briefly, my main goals here, is to perform downstream DE analysis between cancer and normal samples, as also a suarvival analysis with a specific subset of genes.
Thank you in advance,
Efstathios
Dear Antonio,
thank you for your consideration on this matter and also your anwer in the github group and relative discussion !! please excuse me for also adressing a similar issue here, and not for the reason that your answer did not help me for the first part-but mainly for adressing more a secondary issue about a discrepancy in the number of features returned when using gcContent:
that is, the geneInfoHT object has 20531 genes, but the normalization function with gcContent returned 23238 features (as illustrated in the code chunk above).
Best,
Efstathios