How to run the annotation tool for GARFIELD v2: garfield_annotate_uk10k.sh
2
0
Entering edit mode
@rodrigoduarte88-16306
Last seen 10 months ago
United States

Hi all,

I think I may have an impossible question here, but if by any chance any of you has used the EMBL-EBI tool 'Garfield' (https://www.ebi.ac.uk/birney-srv/GARFIELD/), could you please advise? Garfield is available as a bunch of shell scripts or as an R package, but I am having an issue specifically with the shell script to create a GARFIELD annotation. Thus, I am aware maybe this is not the best forum to ask for help, but since the authors don't really provide a support channel for this pretty amazing tool, I thought maybe some of you may have come across this issue?

To reiterate, I want to create a functional category for GARFIELD based on a list of SNPs I am interested. The pipeline comes with a script 'garfield_annotate_uk10k.sh' which should do the job, but this is not working. This script is a series of shell commands (and use of a binary file for which code is not available unfortunately). This script will convert an INPUT bed annotation file that informs chr and SNP coordinates, like:

head bed
chr start   end
1   833927  833927
1   1344686 1344686
1   17006932    17006932
1   20498952    20498952
1   21775805    21775805

... to OUTPUT files per chromosome, which look like:

head chr1
833528  0   
833927  1
1344686 0
1344892 0
17006932    1
20498952    0

This output files contains a basal list of SNPs (based on SNP data that comes with the pipeline, on garfield-data.tar.gz), with the second column containing 1/0 to inform if the SNP is present in my input/annotation/bed file, according to the genomic coordinates for that SNP (chr + position).

There are a few errors in the published script (e.g. it's missing "mkdir -p $OUTPUT_DIR" at the beginning; there's a lost "fi" at the end of the script), so the original script runs when I correct these. But it's still producing the wrong output: it's essentially just outputting the SNPs from the "basal list of SNPs" from garfield-data, and not attributing "1"s to the SNPs that are present in my input bed file. Current output looks like this:

head chr1
1 28590 ./garfield-data/tmp.chr1.bed
1 31719 0
1 52152 0
1 52167 0
1 52185 0
1 54392 0
1 54421 0

It's possible my input bed file is not in the right format (unfortunately the authors just say that they require a bed file, but don't say how it should be formatted, so I am using the typical bed file structure (chr start end). It's also possible that there's something wrong with the binary file that comes with the script "garfield_annotate_uk10k_helper". But given the presence of typos in the script, it's possible that there's a mistake in some shell commands which are causing the wrong output to be produced?

I would really appreciate if anyone could have a look at this!

Download files:
garfield-v2.tar.gz
garfield-data.tar.gz (83Gb when uncompressed, sorry!)
Example bed file (not sure this is exactly the format the software takes as input, because authors did not specify in the manual)

Thanks a lot!

garfield awk unix shell embl-ebi • 1.8k views
ADD COMMENT
2
Entering edit mode
@rodrigoduarte88-16306
Last seen 10 months ago
United States

Just in case someone stumbles upon this in the future, too.... I had help from a colleague and created the required annotation for GARFIELD in R using the short script below:

library(data.table)    
dbsnp<-fread("eQTL_positions.bed", skip = 1)

for(chr in 1:22){
  tmp<-fread(paste0("maftssd/chr", chr))
  allVar<-cbind(tmp$V1, 0)
  overlap<-intersect(allVar[,1], dbsnp$V2[which(dbsnp[,1] == chr)])
  allVar[match(overlap, allVar[,1]),2]<-1
  write.table(allVar,paste0("eQTL_annot/chr", chr), col.names = FALSE, row.names = FALSE, sep = "\t", quote = FALSE)
}

Essentially, this matches my list of SNP positions within my custom bed file with the list of SNPs within "garfield-data.tar.gz", based on files available in "maftssd". These are also split by chromosome number, with the first column consisting of SNP position. Final annotation file is generated per chromosome, containing all SNPs in "maftssd", and 0/1s for absence/presence of the SNP in the annotation bed file.

ADD COMMENT
1
Entering edit mode

Thanks so much. I am running into similar problems where all SNPs in my annotation have 0. I think the script does contain many bugs including one with regards to an awk command on line 50. It is really confusing to me. I guess I will just try your script and it might be easier.

ADD REPLY
0
Entering edit mode
bk11 • 0
@bk11-19547
Last seen 3.1 years ago

I am too having not much luck with garfield_annotate_uk10k.sh and R script provided here. I am still seeing all SNPs in my annotation have 0. Have any one overcome this problem?

ADD COMMENT

Login before adding your answer.

Traffic: 793 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6