Search
Question: ENCODExplorer doesn't return results which are on the website
0
gravatar for liz.ingsimmons
20 months ago by
United Kingdom
liz.ingsimmons120 wrote:

ENCODExplorer seems like a really neat way to query and download ENCODE data, however I'm finding it doesn't return data that exists according to the ENCODE website, e.g.

queryEncode(biosample = "ES-E14", target = "CTCF", file_format = "bed", assay = "ChIP-seq", organism = "Mus musculus", fixed = FALSE)

returns NULL, although here there are three bed files available for download.

ADD COMMENTlink modified 20 months ago by Mike Smith2.1k • written 20 months ago by liz.ingsimmons120
2

Sorry for the delay in the answer, I was out of town.

I'd like to add some point to Mike's answers.

About the usage of the `|` in a queryEncode call. The function was not meant to deal with boolean operators. But this is something I think would be interesting to add support for. I created an Issue and will try to add this for the next release.

About the snapshots. We plan to update the snapshot before the next Bioconductor release, and every subsequent release afterward. One of the problem we encountered is that there were some changes in the ENCODE database (for instance, the Roadmap Epigenomics datasets are now availble). We will have to update some part of the code to make sure we update all the metadata correctly. Right now, we do not plan on adding snapshots between release.

ADD REPLYlink written 20 months ago by Charles Joly Beauparlant150

Oh, that is interesting that it doesn't actually support the Boolean operators - using `|` works quite well considering that! I'd strongly support adding official support for this - it's useful to be able to search for multiple cell lines and targets at once, and the alternative is nested for loops which is not very nice :)

ADD REPLYlink written 20 months ago by liz.ingsimmons120

I agree that it would make a lot of sense. And it should not be such a big change. But I have to prepare a test suite before adding it officially in the documentation/vignettes. I will also have to solve the ^ problem mentionned by Mike that is only applied to the first term.

ADD REPLYlink written 20 months ago by Charles Joly Beauparlant150
2
gravatar for Mike Smith
20 months ago by
Mike Smith2.1k
EMBL Heidelberg / de.NBI
Mike Smith2.1k wrote:

The queryEncode() function doesn't query the website directly, but rather looks at a list of experimental data you provide in the df argument.  If you don't provide anything here, it uses an internal snapshot of the data.  It looks like this may be too old to include the bed files, but does include the other files that are part of the experiment:

> tail(sort(unique(ENCODExplorer::encode_df$experiment[,"date_released"])))
[1] "2014-12-17" "2015-01-08" "2015-02-12" "2015-03-31" "2015-04-14" "2015-05-18"

 

ADD COMMENTlink written 20 months ago by Mike Smith2.1k
1

Sorry to un-accept your answer, but I've found that the example I gave was misleading. The final bed file on that page can't be found by queryEncode in some situations -- but does exist in the up-to-date encode_df I've managed to create. I think the problem may be that some bed files have NA values for some of the columns, including organism, technical_replicate_number, and biological_replicate_number.

Using dplyr to filter for the files I want, I get 14 results.

z <- encode_df$experiment %>%
  filter(biosample_name %in% c("ES-E14", "ES-Bruce4")) %>%
  filter(target %in% c("CTCF-mouse", "Control-mouse")) %>%
  filter(file_format %in% c("bed", "bam"))

Using queryEncode like this, I get 19 - those 14, plus one that is a different target that I'm not sure why it matches, and four bigBed files.

x <- queryEncode(df = encode_df, fixed = FALSE,
            biosample = "ES-E14|ES-Bruce4",
            target = "CTCF|Control",
            file_format = "bam|bed")

If I try to create input to queryEncode like this, I get only 12 results, including none of the rows with NA in e.g. biological_replicate_number, but including two bigBed files.

biosamples <- c("ES-E14", "ES-Bruce4")
targets <- c("CTCF","Control")
formats <- c("bam", "bed")

y <- queryEncode(df = encode_df, assay = "ChIP-seq", organism = "Mus musculus",
            fixed = FALSE,
            biosample = paste(biosamples, collapse = "|"),
            target = paste(targets, collapse = "|"),
            file_format = paste(formats, collapse = "|"))

I have no idea what is going on here.

Edit: wait, no I do know what it is... some have organism = NA. So this will work!

y2 <- queryEncode(df = encode_df,
            fixed = FALSE,
            biosample = paste(biosamples, collapse = "|"),
            target = paste(targets, collapse = "|"),
            file_format = paste(formats, collapse = "|"))
ADD REPLYlink modified 20 months ago • written 20 months ago by liz.ingsimmons120
1

Cool, I think I understand why you get differing results using your various queries.  

Internally queryEncode() uses grepl() to find your searches, and it sets the argument ignore.case = TRUE.  If you use this in your dplyr example, you'll also find the hit with the different target, since it contains the word 'control' and everything else matches.

encode_df$experiment %>%  
    filter(biosample_name %in% c("ES-E14", "ES-Bruce4")) %>%
    filter(grepl(pattern = "CTCF|Control", x = target, ignore.case = TRUE)) %>%
    filter(file_format %in% c("bed", "bam"))

This works, but actually oversimplifies things a little, since the query is transformed to allow for possible spaces,  commas and hyphens.

> ENCODExplorer:::query_transform("bam|bed")
[1] "^b[ ,-]?a[ ,-]?m[ ,-]?|[ ,-]?b[ ,-]?e[ ,-]?d"

This also places a '^' to fix the first letter to the start of the word, but it is only added right at the start of the query.  The second half of the query above will happily match to 'bigBed' when you ignore the case.  If you swap the arguments round in your query, you'll lose the bigBed results.

queryEncode(df = encode_df, fixed = FALSE,
            biosample = "ES-E14|ES-Bruce4",
            target = "CTCF|Control",
            file_format = "bed|bam")
ADD REPLYlink written 20 months ago by Mike Smith2.1k

Thanks, this does seem to be the problem. I wasn't expecting there to be such recent updates to the website, or for the snapshot to be more than six months old.

ADD REPLYlink written 20 months ago by liz.ingsimmons120
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 170 users visited in the last hour