Question: Using GenomicDataCommons R package to identify cases with both DNA methylation and gene expression data
2
13 months ago by
RichardJActon30 wrote:

I'm trying to get a list of cases for which there is DNA methylation and gene expression data available for both normal and cancer tissue samples using the GenomicDataCommons R package (which I am using for the first time).

qCases <- cases() %>%
filter( ~ samples.sample_type == "Solid Tissue Normal" | samples.sample_type == "Blood Derived Normal") %>%
filter(~ files.type == 'gene_expression' & files.type == 'methylation_beta_value')
qCases %>% count()

> [1] 0


Examining the response() of qCases when only calling the first filter reveals that there are definitely cases with files that contain both gene expression and methylation beta value files.

qCases <- cases() %>%
filter( ~ samples.sample_type == "Solid Tissue Normal" | samples.sample_type == "Blood Derived Normal") %>%
GenomicDataCommons::select('files.type')
%>% response()

> \$results
> files

> copy_number_segment, ----->###gene_expression###<-------, simple_somatic_mutation,
annotated_somatic_mutation, biospecimen_supplement, clinical_supplement, biospecimen_supplement,
aggregated_somatic_mutation, slide_image, simple_somatic_mutation, copy_number_segment,
clinical_supplement, clinical_supplement, biospecimen_supplement, clinical_supplement,
annotated_somatic_mutation, biospecimen_supplement, biospecimen_supplement,
annotated_somatic_mutation, biospecimen_supplement, clinical_supplement,
biospecimen_supplement, slide_image, copy_number_segment, annotated_somatic_mutation,
aggregated_somatic_mutation, aggregated_somatic_mutation, mirna_expression, clinical_supplement,


My guess is that what is going wrong here is the filter is looking at individual files entries and thus no one file is both type gene_expression and type methylation_beta_value. Is there a way to filter for cases that have files with a given set of types?

I've been looking over the examples in the vignette but there don't seem to be any examples of composite queries like the one I'm trying to do. Any assistance would be appreciated!

NB Cross-posted from: https://www.biostars.org/p/344349/

R rna-seq dna methylation gdc • 376 views
modified 13 months ago by Sean Davis21k • written 13 months ago by RichardJActon30
Answer: Using GenomicDataCommons R package to identify cases with both DNA methylation a
2
13 months ago by
Sean Davis21k
United States
Sean Davis21k wrote:

There is a sort of unusual issue with using pipes (%>%) and filter. Unlike in dplyr, applying a second filter clears the first. I should definitely change this behavior, but in the meantime I think what you are looking for is:

qCases <- cases() %>%
filter( ~ (samples.sample_type == "Solid Tissue Normal" | samples.sample_type == "Blood Derived Normal") & (files.type == 'gene_expression' | files.type == 'methylation_beta_value'))
qCases %>% count()

1

Sean thanks so much for answering. I directed the user here from Biostars: https://www.biostars.org/p/344349/#344978

1

Thanks Sean and Kevin, I was caught out by the 'filter()' behaviour, but picked up on it during a re-read of the vignette. However if I have understood the above correctly This will list all cases with either "Solid Tissue Normal" or "Blood Derived Normal" and either "gene_expression" __or__ "methylation_beta_value" files. The bit I'm having difficulty with is getting cases with "gene_expression" __and__ "methylation_beta_value" files. Switching '|' for '&' in this part of the expresion: '(files.type == "gene_expression" | files.type == "methylation_beta_value")' returns 0 rows when there are definitely some samples with both expression and methylation data.

I don't think that the API supports that query. Instead, simply do two separate queries, get the case ids(), and intersect them. Then, perform a third cases() query and supply the ids().

> qCases1 <- cases() %>%
+         filter( ~ (samples.sample_type == "Solid Tissue Normal" | samples.sample_type == "Blood Derived Normal") & (files.type == 'gene_expression'))
> ids1 = qCases1 %>% ids()
> qCases2 <- cases() %>%
+         filter( ~ (samples.sample_type == "Solid Tissue Normal" | samples.sample_type == "Blood Derived Normal") & (files.type == 'methylation_beta_value'))
> ids2 = qCases2 %>% ids()
> length(intersect(ids1,ids2))
[1] 10146
> qCases = cases() %>% filter(~ case_id %in% intersect(ids1,ids2))
> qCases %>% count()
[1] 10146
> 

Thanks again. I was hoping that I was missing something and API would support more complex conditional queries as I have some moderately complex requirements for the subset of samples i'm after.

Thankfully, we have all the power of R at our disposal!

1

Just a note that in the most recent devel version (1.5.8) of GenomicDataCommons, filter chaining is now supported. Each filter in the %>% chain is "AND"ed with the previous filters.