Question

Extracting genomic sequences using agat and bedtools yields different results

0

Entering edit mode

maximilian.beikirch • 0

@c689138e

Last seen 9 days ago

Germany

Hello,

I have a problem that I can just not wrap my head around. I am new to bioinformatics (coming from a computer science background), so I might have overlooked something really obvious.

I have a couple of genes that I need to get the 1500 base pairs upstream and the first 500 bases of (so 2000 bases in total). I then want to find motifs in those sequences. I want to do so using FIMO and MAST from the MEME suite. For this reason, I want my output to be formatted appropriately, as this facilitates downstream processing of the resulting gff3 files.

I obtained the genome and the annotation from here and here, respectively.

As for the tooling in this example, conda install agat bedtools should suffice.

Then, in approach A, I am retrieving the sequences like this:

agat_sp_filter_feature_from_keep_list.pl --gff ../reference_dbs/Homo_sapiens.GRCh38.111.gff3 --keep_list output_human/genes_trafo -a Name -o output_human/output_trafo.gff3

sed -i.old '1s;^;##seqid  source  type    start   end     score   strand  phase   attributes;' output_human/output_trafo_genesonly.gff3

agat_sp_extract_sequences.pl -g output_human/output_trafo.gff3 -t gene -f ../reference_dbs/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa --output output_human/sequences_from_paper_genes.fa
python reformat_fasta.py  -i output_human/sequences_from_paper_genes.fa -o output_human/sequences_from_paper_genes_stripped.fa
agat_sp_extract_sequences.pl -g output_human/output_trafo.gff3 -t gene -f ../reference_dbs/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa --output output_human/sequences_from_paper_genes_1500up.fa --up 1500
python reformat_fasta.py  -i -t 2000 output_human/sequences_from_paper_genes_1500up.fa -o output_human/sequences_from_paper_genes_1500up_500down.fa

where for the purposes of just a minimal example, output_human/genes_trafo just contains

ABCA9
ABLIM3
ABTB2
ACSL1
ADAMDEC1
ADAR
ADM
AHNAK2
AIM2
AKT3

A short check with ensemble (which appears to be non-functional at the moment) suggested that those sequences are correct. The problem is that the header is not formatted properly, so the output from MEME would not contain absolute coordinates, making downstream processing much harder.

When searching for a tool that produces the desired headers, I found that bedtools getfasta produces the correct headers. So, in a second approach, I tried retrieving the sequences using bedtools:

grep "ID=gene" output_human/output_trafo.gff3 > output_human/output_trafo_genesonly.gff3
gff2bed < output_human/output_trafo_genesonly.gff3 > output_human/output_trafo_genesonly.bed

bedtools getfasta -bed output_human/output_trafo_genesonly.bed -fi ../reference_dbs/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa > output_human/sequences_from_paper_genes_ucsc.fa
bedtools flank -l 1500 -r 0 -i output_human/output_trafo_genesonly.bed -g ../reference_dbs/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.fai > output_human/flank.bed
bedtools getfasta -bed output_human/flank.bed -fi ../reference_dbs/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa > output_human/flank.fa
python3 merge_fasta_linebyline.py -i output_human/flank.fa output_human/sequences_from_paper_genes_ucsc.fa -o output_human/sequences_from_paper_genes_ucsc_1500up.fa
python reformat_fasta.py -t 2000 -i output_human/sequences_from_paper_genes_ucsc_1500up.fa -o output_human/sequences_from_paper_genes_ucsc_1500up500down.fa

But I found that, while the resulting gff3 and bed files are compatible, the files sequences_from_paper_genes.fa (approach A) and sequences_from_paper_genes_ucsc.fa (approach B) are different: WinMerge

The scripts reformat_fasta.py and merge_fasta_linebyline.py should be fine, but I'm providing them here anyway (not my finest pieces of engineering, but they do the job):

## merge_fasta_linebyline.py
@cloup.command()
@cloup.option("--input-fasta", "-i", nargs=2, type=cloup.File("r"), required=True, help="input fasta file")
@cloup.option("--output-fasta", "-o", type=cloup.File("w"), required=True, help="input fasta file")
def main(input_fasta, output_fasta):
    in1, in2 = input_fasta
    for line1, line2 in zip(in1, in2):
        if line1.startswith(">"):
            assert (line2.startswith(">"))
            rgx = r"\>(.*):(\d+)\-(\d+)"
            m1 = re.match(rgx, line1)
            m2 = re.match(rgx, line2)
            assert m1.group(1) == m2.group(1)
            chr = m1.group(1)
            range1 = (int(m1.group(2)), int(m1.group(3)))
            range2 = (int(m2.group(2)), int(m2.group(3)))
            assert (range1[1] == range2[0])
            assert (range1[0] <= range2[1])
            outrange = (range1[0], range2[1])
            output_fasta.write(f">{chr}:{outrange[0]}-{outrange[1]}\n")
        else:
            # remove trailing newline (to avoid splitting the sequence!)
            output_fasta.write(line1.strip()+line2)


## reformat_fasta.py
@cloup.command()
@cloup.option("--input-fasta", "-i", type=cloup.File("r"), required=True, help="input fasta file")#
@cloup.option("--output-fasta", "-o", type=cloup.File("w"), required=True, help="input fasta file")
@cloup.option("--trim-sequences", "-t", type=int, required=False, help="input fasta file")
def main(input_fasta, output_fasta, trim_sequences):
    outline = ""
    for line in input_fasta:
        if line.startswith(">"):
            if outline:
                if trim_sequences:
                    outline = outline[:trim_sequences]
                output_fasta.write(f"{outline}\n")
            outline = ""
            output_fasta.write(f"{line}")
        else:
            outline += line.strip()
    if outline:
        output_fasta.write(f"{outline}")

I'm puzzled that there seems to be no systematic error, as two of the sequences are completely identical, but the rest looks like rubbish to me. Does anybody have an idea what's going on or how I could debug this further? Please feel free to add appropriate tags to the question (I'd suggest agat and bedtools but apparently I lack rights to create those tags)

MAST • 267 views

ADD COMMENT • link updated 5 weeks ago by ATpoint ★ 4.0k • written 5 weeks ago by maximilian.beikirch • 0

1

Entering edit mode

Hi, in the future please post at e.g. biostars.org as this is not related to Bioconductor.

ADD REPLY • link 5 weeks ago ATpoint ★ 4.0k

0

Entering edit mode

The common ones are the genes without UTRs while the different ones are gene having UTRs, so I guess AGAT and bedtools (gff2bed step?) do not handle the same way the UTRs. UTRs are part of the gene, so the upstream region should be upstream of the UTR, could you check a gene that is different manually on Ensembl to see what region is extracted (you can use blat if it is too complicated to check manually)

ADD REPLY • link 5 weeks ago Jacques • 0