Question: ENCODExplorer doesn't return results which are on the website
0
3.6 years ago by
Germany
liz.ingsimmons140 wrote:

ENCODExplorer seems like a really neat way to query and download ENCODE data, however I'm finding it doesn't return data that exists according to the ENCODE website, e.g.

queryEncode(biosample = "ES-E14", target = "CTCF", file_format = "bed", assay = "ChIP-seq", organism = "Mus musculus", fixed = FALSE)

returns NULL, although here there are three bed files available for download.

encodexplorer • 738 views
modified 3.6 years ago by Mike Smith4.0k • written 3.6 years ago by liz.ingsimmons140
2

Sorry for the delay in the answer, I was out of town.

About the usage of the | in a queryEncode call. The function was not meant to deal with boolean operators. But this is something I think would be interesting to add support for. I created an Issue and will try to add this for the next release.

About the snapshots. We plan to update the snapshot before the next Bioconductor release, and every subsequent release afterward. One of the problem we encountered is that there were some changes in the ENCODE database (for instance, the Roadmap Epigenomics datasets are now availble). We will have to update some part of the code to make sure we update all the metadata correctly. Right now, we do not plan on adding snapshots between release.

Oh, that is interesting that it doesn't actually support the Boolean operators - using | works quite well considering that! I'd strongly support adding official support for this - it's useful to be able to search for multiple cell lines and targets at once, and the alternative is nested for loops which is not very nice :)

I agree that it would make a lot of sense. And it should not be such a big change. But I have to prepare a test suite before adding it officially in the documentation/vignettes. I will also have to solve the ^ problem mentionned by Mike that is only applied to the first term.

Answer: ENCODExplorer doesn't return results which are on the website
2
3.6 years ago by
Mike Smith4.0k
EMBL Heidelberg / de.NBI
Mike Smith4.0k wrote:

The queryEncode() function doesn't query the website directly, but rather looks at a list of experimental data you provide in the df argument.  If you don't provide anything here, it uses an internal snapshot of the data.  It looks like this may be too old to include the bed files, but does include the other files that are part of the experiment:

> tail(sort(unique(ENCODExplorer::encode_df$experiment[,"date_released"]))) [1] "2014-12-17" "2015-01-08" "2015-02-12" "2015-03-31" "2015-04-14" "2015-05-18" ADD COMMENTlink written 3.6 years ago by Mike Smith4.0k 1 Sorry to un-accept your answer, but I've found that the example I gave was misleading. The final bed file on that page can't be found by queryEncode in some situations -- but does exist in the up-to-date encode_df I've managed to create. I think the problem may be that some bed files have NA values for some of the columns, including organism, technical_replicate_number, and biological_replicate_number. Using dplyr to filter for the files I want, I get 14 results. z <- encode_df$experiment %>%
filter(biosample_name %in% c("ES-E14", "ES-Bruce4")) %>%
filter(target %in% c("CTCF-mouse", "Control-mouse")) %>%
filter(file_format %in% c("bed", "bam"))

Using queryEncode like this, I get 19 - those 14, plus one that is a different target that I'm not sure why it matches, and four bigBed files.

x <- queryEncode(df = encode_df, fixed = FALSE,
biosample = "ES-E14|ES-Bruce4",
target = "CTCF|Control",
file_format = "bam|bed")

If I try to create input to queryEncode like this, I get only 12 results, including none of the rows with NA in e.g. biological_replicate_number, but including two bigBed files.

biosamples <- c("ES-E14", "ES-Bruce4")
targets <- c("CTCF","Control")
formats <- c("bam", "bed")

y <- queryEncode(df = encode_df, assay = "ChIP-seq", organism = "Mus musculus",
fixed = FALSE,
biosample = paste(biosamples, collapse = "|"),
target = paste(targets, collapse = "|"),
file_format = paste(formats, collapse = "|"))

I have no idea what is going on here.

Edit: wait, no I do know what it is... some have organism = NA. So this will work!

y2 <- queryEncode(df = encode_df,
fixed = FALSE,
biosample = paste(biosamples, collapse = "|"),
target = paste(targets, collapse = "|"),
file_format = paste(formats, collapse = "|"))
1

Cool, I think I understand why you get differing results using your various queries.

Internally queryEncode() uses grepl() to find your searches, and it sets the argument ignore.case = TRUE.  If you use this in your dplyr example, you'll also find the hit with the different target, since it contains the word 'control' and everything else matches.

encode_df\$experiment %>%
filter(biosample_name %in% c("ES-E14", "ES-Bruce4")) %>%
filter(grepl(pattern = "CTCF|Control", x = target, ignore.case = TRUE)) %>%
filter(file_format %in% c("bed", "bam"))

This works, but actually oversimplifies things a little, since the query is transformed to allow for possible spaces,  commas and hyphens.

> ENCODExplorer:::query_transform("bam|bed")
[1] "^b[ ,-]?a[ ,-]?m[ ,-]?|[ ,-]?b[ ,-]?e[ ,-]?d"

This also places a '^' to fix the first letter to the start of the word, but it is only added right at the start of the query.  The second half of the query above will happily match to 'bigBed' when you ignore the case.  If you swap the arguments round in your query, you'll lose the bigBed results.

queryEncode(df = encode_df, fixed = FALSE,
biosample = "ES-E14|ES-Bruce4",
target = "CTCF|Control",
file_format = "bed|bam")