Using GenomicDataCommons R package to identify cases with both DNA methylation and gene expression data
1
2
Entering edit mode
@richardjacton-12268
Last seen 4.2 years ago

I'm trying to get a list of cases for which there is DNA methylation and gene expression data available for both normal and cancer tissue samples using the GenomicDataCommons R package (which I am using for the first time).

qCases <- cases() %>%
    filter( ~ samples.sample_type == "Solid Tissue Normal" | samples.sample_type == "Blood Derived Normal") %>%
    filter(~ files.type == 'gene_expression' & files.type == 'methylation_beta_value')
qCases %>% count()

> [1] 0

this returns no results

Examining the response() of qCases when only calling the first filter reveals that there are definitely cases with files that contain both gene expression and methylation beta value files.

qCases <- cases() %>%
filter( ~ samples.sample_type == "Solid Tissue Normal" | samples.sample_type == "Blood Derived Normal") %>%
GenomicDataCommons::select('files.type')
%>% response()

> $results
> files

> copy_number_segment, ----->###gene_expression###<-------, simple_somatic_mutation,
annotated_somatic_mutation, biospecimen_supplement, clinical_supplement, biospecimen_supplement,
biospecimen_supplement, mirna_expression, aligned_reads, clinical_supplement,
aggregated_somatic_mutation, slide_image, simple_somatic_mutation, copy_number_segment,
clinical_supplement, clinical_supplement, biospecimen_supplement, clinical_supplement,
clinical_supplement, clinical_supplement, biospecimen_supplement, copy_number_segment, aligned_reads,
annotated_somatic_mutation, biospecimen_supplement, biospecimen_supplement,
annotated_somatic_mutation, biospecimen_supplement, clinical_supplement,
----->###methylation_beta_value###<-----, masked_somatic_mutation, simple_somatic_mutation,
biospecimen_supplement, slide_image, copy_number_segment, annotated_somatic_mutation,
masked_somatic_mutation, clinical_supplement, aggregated_somatic_mutation,
aggregated_somatic_mutation, aggregated_somatic_mutation, mirna_expression, clinical_supplement,
gene_expression, masked_somatic_mutation, gene_expression, biospecimen_supplement, aligned_reads,
biospecimen_supplement, biospecimen_supplement, aligned_reads, simple_somatic_mutation, masked_somatic_mutation

My guess is that what is going wrong here is the filter is looking at individual files entries and thus no one file is both type gene_expression and type methylation_beta_value. Is there a way to filter for cases that have files with a given set of types?

I've been looking over the examples in the vignette but there don't seem to be any examples of composite queries like the one I'm trying to do. Any assistance would be appreciated!

NB Cross-posted from: https://www.biostars.org/p/344349/

R gdc rna-seq dna methylation • 1.8k views
ADD COMMENT
2
Entering edit mode
@sean-davis-490
Last seen 12 weeks ago
United States

There is a sort of unusual issue with using pipes (%>%) and filter. Unlike in dplyr, applying a second filter clears the first. I should definitely change this behavior, but in the meantime I think what you are looking for is:

qCases <- cases() %>%
    filter( ~ (samples.sample_type == "Solid Tissue Normal" | samples.sample_type == "Blood Derived Normal") & (files.type == 'gene_expression' | files.type == 'methylation_beta_value'))
qCases %>% count()
ADD COMMENT
1
Entering edit mode

Sean thanks so much for answering. I directed the user here from Biostars: https://www.biostars.org/p/344349/#344978

ADD REPLY
1
Entering edit mode

Thanks Sean and Kevin, I was caught out by the 'filter()' behaviour, but picked up on it during a re-read of the vignette. However if I have understood the above correctly This will list all cases with either "Solid Tissue Normal" or "Blood Derived Normal" and either "gene_expression" __or__ "methylation_beta_value" files. The bit I'm having difficulty with is getting cases with "gene_expression" __and__ "methylation_beta_value" files. Switching '|' for '&' in this part of the expresion: '(files.type == "gene_expression" | files.type == "methylation_beta_value")' returns 0 rows when there are definitely some samples with both expression and methylation data.

ADD REPLY
0
Entering edit mode

I don't think that the API supports that query. Instead, simply do two separate queries, get the case ids(), and intersect them. Then, perform a third cases() query and supply the ids(). 

> qCases1 <- cases() %>%
+         filter( ~ (samples.sample_type == "Solid Tissue Normal" | samples.sample_type == "Blood Derived Normal") & (files.type == 'gene_expression'))
> ids1 = qCases1 %>% ids()
> qCases2 <- cases() %>%
+         filter( ~ (samples.sample_type == "Solid Tissue Normal" | samples.sample_type == "Blood Derived Normal") & (files.type == 'methylation_beta_value'))
> ids2 = qCases2 %>% ids()
> length(intersect(ids1,ids2))
[1] 10146
> qCases = cases() %>% filter(~ case_id %in% intersect(ids1,ids2))
> qCases %>% count()
[1] 10146
> 
ADD REPLY
0
Entering edit mode

Thanks again. I was hoping that I was missing something and API would support more complex conditional queries as I have some moderately complex requirements for the subset of samples i'm after.

ADD REPLY
0
Entering edit mode

Thankfully, we have all the power of R at our disposal! 

ADD REPLY
1
Entering edit mode

Just a note that in the most recent devel version (1.5.8) of GenomicDataCommons, filter chaining is now supported. Each filter in the %>% chain is "AND"ed with the previous filters.

ADD REPLY

Login before adding your answer.

Traffic: 629 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6