Error and weird result of TCGAanalyze_Normalization function using the R package TCGAbiolinks for normalization of raw HTSeq counts
2
0
Entering edit mode
svlachavas ▴ 830
@svlachavas-7225
Last seen 13 months ago
Germany/Heidelberg/German Cancer Resear…

Dear BioCommunity,

i recently implemented the R package TCGAbiolinks, to download raw HTSEQ counts for a provinsional cancer TCGA dataset (COAD). My code used was the following:

query.exp.hg38 <- GDCquery(project = "TCGA-COAD", 
data.category = "Transcriptome Profiling", 
data.type = "Gene Expression Quantification", 
workflow.type = "HTSeq - Counts")

GDCdownload(query.exp.hg38,files.per.chunk = 50)

exp.hg38 <- GDCprepare(query = query.exp.hg38,
save=TRUE,directory=mypath,save.filename = "exp.coad.htseqcounts.rda")

exp.hg38
class: RangedSummarizedExperiment 
dim: 56963 521 
metadata(1): data_release
assays(1): HTSeq - Counts
rownames(56963): ENSG00000000003 ENSG00000000005 ...
  ENSG00000281912 ENSG00000281920
rowData names(3): ensembl_gene_id external_gene_name
  original_ensembl_gene_id
colnames(521): TCGA-3L-AA1B-01A-11R-A37K-07
  TCGA-DM-A1D8-01A-11R-A155-07 ...
  TCGA-AA-3675-01A-02R-0905-07 TCGA-G4-6323-01A-11R-1723-07
colData names(101): sample patient ...
  subtype_vascular_invasion_present subtype_vital_status

 

test <- TCGAanalyze_Normalization(tabDF=exp.hg38,geneInfo)
I Need about  0 seconds for this Complete Normalization Upper Quantile  [Processing 80k elements /s]  
Step 1 of 4: newSeqExpressionSet ...
Step 2 of 4: withinLaneNormalization ...
Error in names(y) <- 1:length(y) : 
  'names' attribute [2] must be the same length as the vector [0]
Timing stopped at: 0 0 0.04 

 test2 <- TCGAbiolinks::TCGAanalyze_Normalization(tabDF = exp.hg38,geneInfo=TCGAbiolinks::geneInfoHT,method="gcContent")
I Need about  371 seconds for this Complete Normalization Upper Quantile  [Processing 80k elements /s]  
Step 1 of 4: newSeqExpressionSet ...
Step 2 of 4: withinLaneNormalization ...
Step 3 of 4: betweenLaneNormalization ...
Step 4 of 4: .quantileNormalization ...
> dim(test3)
[1] 23238   521

 

Any ideas about the first above error ? as also why the second approach worked,  but the different number  of features in the second approach is reduced to more than the half of the 56963 initial number ? From a small search in the function, in mentions in the geneInfo argument:

geneInfo= Information matrix of 20531 genes about geneLength and gcContent. Two objects are provided: TCGAbiolinks::geneInfoHT,TCGAbiolinks::geneInfo

But, again how it explains the above issue ? perhaps it has to do something with the EDASeq R package, which is mentioned for implementing the above normalization function ?

Briefly, my main goals here, is to perform downstream DE analysis between cancer and normal samples, as also a suarvival analysis with a specific subset of genes.

Thank you in advance,

Efstathios

 

 

tcgabiolinks rnaseq normalization EDASeq • 3.5k views
ADD COMMENT
2
Entering edit mode
@antoniocolaprico-14504
Last seen 7.0 years ago
USA/ Florida/ University of Miami Hospi…

Dear @svlachavas (Efstathios) thank you for interest of using TCGAbiolinks for your research. I think that your post it is similar to the one I answered 1 week ago in our TCGAbiolinks github issues here: https://github.com/BioinformaticsFMRP/TCGAbiolinks/issues/164. I am sorry that I wasn't reading also the bioconductor.support. Was my answer there useful? Thank you davide risso  (Davide) for helping us to find a solution. Again if you are using data from Gene expression aligned against hg38 in that case you can use geneInfoHT object in TCGAanalyze_Normalization as well documented in our vignette 

http://bioconductor.org/packages/release/bioc/vignettes/TCGAbiolinks/inst/doc/tcgaBiolinks.html#htseq_data:_downstream_analysis_brca

And thank you for your suggestion, I used the function getGeneLengthAndGCContent from Davide's package EDASeq, and when it possibile I will add a function to update the geneInfoHT in TCGAbiolinks.

 

 

 

 

 

 

 

 

 

ADD COMMENT
0
Entering edit mode

Dear Antonio,

thank you for your consideration on this matter and also your anwer in the github group and relative discussion !! please excuse me for also adressing a similar issue here, and not for the reason that your answer did not help me for the first part-but mainly for adressing more a secondary issue about a discrepancy in the number of features returned when using gcContent:

that is, the geneInfoHT object has 20531 genes, but the normalization function with gcContent returned 23238 features (as illustrated in the code chunk above).

Best,

Efstathios

 

ADD REPLY
1
Entering edit mode
davide risso ▴ 980
@davide-risso-5075
Last seen 8 months ago
University of Padova

Hi,

I am not familiar with the implementation of TCGAbiolinks and with how they use the functions in the EDASeq package, so I'm not fully able to address your problem. Hopefully, some of the TCGAbiolinks authors / maintainers will chip in.

The error seems related to some naming issues. How does your geneInfo object look like? Can you please post here the results of

class(geneInfo)
head(geneInfo)

I'm just guessing, but could it be that the geneInfo object uses a different set of IDs for genes compared to the exp.hg38 object? (e.g. Ensembl vs gene symbols?)

Also, from the help page that you are providing above, it looks like the TCGAbiolinks package only provides the gene length and GC content for 20531 genes. I'm not sure why you end up with 23238 genes. This is a question for the TCGAbiolinks authors.

Best, Davide

ADD COMMENT
0
Entering edit mode

Dear Davide,

thank you for your responce and consideration on this matter !! Firstly, based on your question, the geneInfo argument above, has 2 options to consider:

xx <- TCGAbiolinks::geneInfo
> head(xx)
          geneLength gcContent           chr    
100130426 NA         NA                  ""     
100133144 "2420"     "0.490495867768595" "chr15"
100134869 "3190"     "0.428526645768025" "chr15"
10357     NA         NA                  ""     
10431     NA         NA                  ""     
136542    NA         NA                  ""     
> class(xx)
[1] "matrix"
> dim(xx)
[1] 20531     3
> xx2 <- TCGAbiolinks::geneInfoHT
> dim(xx2)
[1] 23486     2
> head(xx2)
                geneLength gcContent
ENSG00000000003       4535 0.3995590
ENSG00000000005       1610 0.4248447
ENSG00000000419       1207 0.4159072
ENSG00000000457       6883 0.4117391
ENSG00000000460       5967 0.4298643
ENSG00000000938       3474 0.5734024

 head(rownames(exp.hg38)) # input rownames for normalization above
[1] "ENSG00000000003" "ENSG00000000005" "ENSG00000000419"
[4] "ENSG00000000457" "ENSG00000000460" "ENSG00000000938"

Thus, in your opinion with naming issues and gene identifiers ? As also it still concerns me for ending with 23238 genes with the second approach, whereas there are 23486 genes in the xx2 matrix above ? or i misunderstood something here ?

I also found from a small search some similar issues

https://github.com/BioinformaticsFMRP/TCGAbiolinks/issues/19

Best,

Efstathios

 

ADD REPLY
1
Entering edit mode

It seems that your input matrix (exp.hg38) has Ensembl gene IDs, so you should use a geneInfo matrix with Ensembl IDs as row names. Note that your code:

test <- TCGAanalyze_Normalization(tabDF=exp.hg38,geneInfo)

means that you are passing an object called "geneInfo" to the function "TCGAanalyze_Normalization". Do you have such object in your environment? How does it look like? That's what I meant when I said that you should copy and paste the value of head(geneInfo).

To be more specific, you need to make sure that the matrix "geneInfo" that you are passing to the function has Ensembl IDs as row names.

ADD REPLY
0
Entering edit mode

Dear davide,

thank you for your updated answer. Firstly, concerning the geneInfo object

head(geneInfo)
          geneLength gcContent           chr    
100130426 NA         NA                  ""     
100133144 "2420"     "0.490495867768595" "chr15"
100134869 "3190"     "0.428526645768025" "chr15"
10357     NA         NA                  ""     
10431     NA         NA                  ""     
136542    NA         NA                  ""     

 

so it seems that this is the default option when implementing the above normalization function, which then returns the above error. So i have to convert the entrez GeneIds of the gene info, to Ensemble gene ids as in the rows of my input counts matrix. Regarding, my second question about the discrepancy in the number of returned features (23238 instead of 23486), do you have any idea about this ? i have also created a thread in the github page issues of the package, but i haven't got an answer, and im not sure about this. 

Finally, as an alternative, which similar function can i implement from your package EDASeq directly, to move on with normalization of my raw counts matrix ?

Best,

Efstathios

ADD REPLY
1
Entering edit mode

I don't know about the discrepancy, sorry. You should contact the TCGAbiolinks maintainers.

You can, of course, use the EDASeq package directly. Simply use the "withinLaneNormalization()" function. You can have a look at the EDASeq vignette for an example on how to use it.

ADD REPLY
0
Entering edit mode

Thanks Davide. I will take a detailed look and perhaps write back or on a different thread for possible related questions.

ADD REPLY
0
Entering edit mode

Dear Davide, please excuse me to return on this post, but unfortunately, a weird error appeared when i tried to use the above function:

gcounts <- assay(exp.hg38)

class(gcounts)
[1] "matrix"

dnorm <- withinLaneNormalization(x=gcounts, which="full")
Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘withinLaneNormalization’ for signature ‘"matrix", "missing"

 sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7600)

Matrix products: default

locale:
[1] LC_COLLATE=Greek_Greece.1253  LC_CTYPE=Greek_Greece.1253   
[3] LC_MONETARY=Greek_Greece.1253 LC_NUMERIC=C                 
[5] LC_TIME=Greek_Greece.1253    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils    
[7] datasets  methods   base     

other attached packages:
 [1] EDASeq_2.12.0              ShortRead_1.36.0          
 [3] GenomicAlignments_1.14.1   SummarizedExperiment_1.8.0
 [5] DelayedArray_0.4.1         matrixStats_0.52.2        
 [7] Rsamtools_1.30.0           GenomicRanges_1.30.0      
 [9] GenomeInfoDb_1.14.0        Biostrings_2.46.0         
[11] XVector_0.18.0             IRanges_2.12.0            
[13] S4Vectors_0.16.0           BiocParallel_1.12.0       
[15] Biobase_2.38.0             BiocGenerics_0.24.0       
[17] TCGAbiolinks_2.7.4        

Any ideas for the above problem ?

 

ADD REPLY
1
Entering edit mode
*Please*, read the documentation before asking questions on the forum. If you look at the documentation of the `withinLaneNormalization` function you will see that, naturally, it needs the GC-content in addition to the gene-level read counts to carry on the normalization. The error is telling you that the second argument (the GC-content) is missing.
ADD REPLY
0
Entering edit mode

Thank you for your answer Davide and please excuse me for the misinprentation of the above function. I saw also in section 8 of the vignette the function getGeneLengthAndGCContent , from which i probably will retreive the gc content for my ensembl identifiers of the datasets.

ADD REPLY

Login before adding your answer.

Traffic: 588 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6