I ran genome-wide association studies for wheat quality traits in wheat (Triticum aestivum) which returned a few significant regions of the genome. I would like to extract the genes within those regions using the version 2.1 of the wheat reference genome, using the "rtracklayer". I do not know where to start, I consulted the package documentation and I still don't understand it.
You don't need rtracklayer for this (not directly anyway), but instead you need txdbmaker which you can use to make a TxDb object that contains the gene locations, etc. There are (at least) two annotation services that have gene data for wheat; NCBI and Ensembl. I imagine MSU has one as well, so it depends on what genome you generated your SNP data from. Ideally you stick with the same annotation service because they all do things slightly differently, and trying to map things between any of them is a fool's errand. I generally use NCBI annotations, but they have a (terrible, IMO) habit of naming chromosomes using RefSeq IDs, and I don't know an easy way of changing the chromosomes, so I will show you how to use Ensembl instead.
> library(txdbmaker)
>> txdb <- makeTxDbFromGFF("https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-60/gff3/triticum_aestivum/Triticum_aestivum.IWGSC.60.gff3.gz")
Import genomic features from the file as a GRanges object ... trying URL 'https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-60/gff3/triticum_aestivum/Triticum_aestivum.IWGSC.60.gff3.gz'
Content type 'application/x-gzip' length 24463854 bytes (23.3 MB)
downloaded 23.3 MB
OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning message:
In .extract_exons_from_GRanges(exon_IDX, gr, mcols0, tx_IDX, feature = "exon", :
68 exons couldn't be linked to a
transcript so were dropped
(showing only the first 6):
seqid start end strand
1 1A 326329197 326329304 +
2 1A 366114101 366114204 +
3 1A 403039760 403039877 -
4 1B 357916232 357916339 -
5 1B 395058787 395058890 +
6 1B 432758094 432758212 +
ID Name
1 <NA> ENSRNA050014038-E1
2 <NA> ENSRNA050013952-E1
3 <NA> ENSRNA050013948-E1
4 <NA> ENSRNA050017309-E1
5 <NA> ENSRNA050019912-E1
6 <NA> ENSRNA050022692-E1
Parent
1 transcript:ENSRNA050014038-T1
2 transcript:ENSRNA050013952-T1
3 transcript:ENSRNA050013948-T1
4 transcript:ENSRNA050017309-T1
5 transcript:ENSRNA050019912-T1
6 transcript:ENSRNA050022692-T1
Parent_type
1 <NA>
2 <NA>
3 <NA>
4 <NA>
5 <NA>
6 <NA>
> genes(txdb)
GRanges object with 120676 ranges and 1 metadata column:
seqnames
<Rle>
ENSRNA050007810 3A
ENSRNA050007817 3A
ENSRNA050007820 3A
ENSRNA050007823 3A
ENSRNA050007826 3A
... ...
TraesCSU02G272776 Un
TraesCSU02G272800 Un
TraesCSU02G272807 Un
TraesCSU02G272900 Un
TraesCSU02G272997 Un
ranges
<IRanges>
ENSRNA050007810 8594746-8594817
ENSRNA050007817 8598700-8598771
ENSRNA050007820 19771253-19771326
ENSRNA050007823 27007038-27007108
ENSRNA050007826 32161321-32161392
... ...
TraesCSU02G272776 478410340-478410699
TraesCSU02G272800 478934578-478934940
TraesCSU02G272807 478901536-478902039
TraesCSU02G272900 479850130-479850488
TraesCSU02G272997 480840068-480840121
strand |
<Rle> |
ENSRNA050007810 + |
ENSRNA050007817 + |
ENSRNA050007820 + |
ENSRNA050007823 + |
ENSRNA050007826 + |
... ... .
TraesCSU02G272776 - |
TraesCSU02G272800 + |
TraesCSU02G272807 - |
TraesCSU02G272900 + |
TraesCSU02G272997 + |
gene_id
<character>
ENSRNA050007810 ENSRNA050007810
ENSRNA050007817 ENSRNA050007817
ENSRNA050007820 ENSRNA050007820
ENSRNA050007823 ENSRNA050007823
ENSRNA050007826 ENSRNA050007826
... ...
TraesCSU02G272776 TraesCSU02G272776
TraesCSU02G272800 TraesCSU02G272800
TraesCSU02G272807 TraesCSU02G272807
TraesCSU02G272900 TraesCSU02G272900
TraesCSU02G272997 TraesCSU02G272997
-------
seqinfo: 22 sequences from an unspecified genome; no seqlengths
Now is where you have to take over. I showed you how to get a GRanges of the gene locations. You can do all sorts of things with these objects and it's well beyond the scope of a support site post to cover them. You can read the vignettes for GenomicFeatures and GenomicRanges and maybe GenomicAlignments.
Is it possible to download the genome annotation file from the IWGSC website, store it in my computer as a GFF.gz file and import it using this function: "makeTxDbFromGFF"?
Hello,
Please disregard the previous post. it's working now, i think. I am able to create the txdb object but I can't see all the other information you listed. I get these warnings:
Warning messages:
1: In .get_cds_IDX(mcols0$type, mcols0$phase) :
The "phase" metadata column contains non-NA values for features of type
exon. This information was ignored.
2: In .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID, :
some transcripts have no "transcript_id" attribute ==> their name
("tx_name" column in the TxDb object) was set to NA
3: In .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID, :
the transcript names ("tx_name" column in the TxDb object) imported from
the "transcript_id" attribute are not unique
Telling me you can't do something without showing what you tried is not helpful. When posting here, try to put yourself in the shoes of people who are reading your post. If you say 'it does not work', what do you imagine a reader is going to ask first? They will ask 'Well, what did you try? What result did you get?'.
Have you read the vignettes? What you are trying to do is not particularly difficult, but it will require some background knowledge. I cannot impart knowledge to you - you have to do that yourself by reading the available information that I have pointed to.
Thank you very much for the quick response. However, I am unable to reproduce the code you provided. It starts to run, then I get the error below:
"Import genomic features from the file as a GRanges object ... trying URL 'https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-60/gff3/triticum_aestivum/Triticum_aestivum.IWGSC.60.gff3.gz' Content type 'application/x-gzip' length 24463854 bytes (23.3 MB) downloaded 19.4 MB
Error in download.file(fp, filepath2[i]) : download from 'https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-60/gff3/triticum_aestivum/Triticum_aestivum.IWGSC.60.gff3.gz' failed In addition: Warning messages: 1: In download.file(fp, filepath2[i]) : downloaded length 20367748 != reported length 24463854 2: In download.file(fp, filepath2[i]) : URL 'https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-60/gff3/triticum_aestivum/Triticum_aestivum.IWGSC.60.gff3.gz': Timeout of 60 seconds was reached ""
Is it possible to download the genome annotation file from the IWGSC website, store it in my computer as a GFF.gz file and import it using this function: "makeTxDbFromGFF"?
Hello, Please disregard the previous post. it's working now, i think. I am able to create the txdb object but I can't see all the other information you listed. I get these warnings:
Warning messages: 1: In .get_cds_IDX(mcols0$type, mcols0$phase) : The "phase" metadata column contains non-NA values for features of type exon. This information was ignored. 2: In .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID, : some transcripts have no "transcript_id" attribute ==> their name ("tx_name" column in the TxDb object) was set to NA 3: In .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID, : the transcript names ("tx_name" column in the TxDb object) imported from the "transcript_id" attribute are not unique
Telling me you can't do something without showing what you tried is not helpful. When posting here, try to put yourself in the shoes of people who are reading your post. If you say 'it does not work', what do you imagine a reader is going to ask first? They will ask 'Well, what did you try? What result did you get?'.
Have you read the vignettes? What you are trying to do is not particularly difficult, but it will require some background knowledge. I cannot impart knowledge to you - you have to do that yourself by reading the available information that I have pointed to.