Dear BioCommunity,
i recently implemented the R package TCGAbiolinks, to download raw HTSEQ counts for a provinsional cancer TCGA dataset (COAD). My code used was the following:
query.exp.hg38 <- GDCquery(project = "TCGA-COAD",
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "HTSeq - Counts")
GDCdownload(query.exp.hg38,files.per.chunk = 50)
exp.hg38 <- GDCprepare(query = query.exp.hg38,
save=TRUE,directory=mypath,save.filename = "exp.coad.htseqcounts.rda")
exp.hg38
class: RangedSummarizedExperiment
dim: 56963 521
metadata(1): data_release
assays(1): HTSeq - Counts
rownames(56963): ENSG00000000003 ENSG00000000005 ...
ENSG00000281912 ENSG00000281920
rowData names(3): ensembl_gene_id external_gene_name
original_ensembl_gene_id
colnames(521): TCGA-3L-AA1B-01A-11R-A37K-07
TCGA-DM-A1D8-01A-11R-A155-07 ...
TCGA-AA-3675-01A-02R-0905-07 TCGA-G4-6323-01A-11R-1723-07
colData names(101): sample patient ...
subtype_vascular_invasion_present subtype_vital_status
test <- TCGAanalyze_Normalization(tabDF=exp.hg38,geneInfo)
I Need about 0 seconds for this Complete Normalization Upper Quantile [Processing 80k elements /s]
Step 1 of 4: newSeqExpressionSet ...
Step 2 of 4: withinLaneNormalization ...
Error in names(y) <- 1:length(y) :
'names' attribute [2] must be the same length as the vector [0]
Timing stopped at: 0 0 0.04
test2 <- TCGAbiolinks::TCGAanalyze_Normalization(tabDF = exp.hg38,geneInfo=TCGAbiolinks::geneInfoHT,method="gcContent")
I Need about 371 seconds for this Complete Normalization Upper Quantile [Processing 80k elements /s]
Step 1 of 4: newSeqExpressionSet ...
Step 2 of 4: withinLaneNormalization ...
Step 3 of 4: betweenLaneNormalization ...
Step 4 of 4: .quantileNormalization ...
> dim(test3)
[1] 23238 521
Any ideas about the first above error ? as also why the second approach worked, but the different number of features in the second approach is reduced to more than the half of the 56963 initial number ? From a small search in the function, in mentions in the geneInfo argument:
geneInfo= Information matrix of 20531 genes about geneLength and gcContent. Two objects are provided: TCGAbiolinks::geneInfoHT,TCGAbiolinks::geneInfo
But, again how it explains the above issue ? perhaps it has to do something with the EDASeq R package, which is mentioned for implementing the above normalization function ?
Briefly, my main goals here, is to perform downstream DE analysis between cancer and normal samples, as also a suarvival analysis with a specific subset of genes.
Thank you in advance,
Efstathios

