Search
Question: goseq after edgeR
0
9 months ago by

I have followed this tutorial for edgeR. On this side it appears I could run goseq when I have created the following files:

1. Differentially expressed gene file but edgeR does not appear to produce a tabular file with gene names in the first column, and TRUE or FALSE in the last column. How is it possible to generate this kind of file?

2. Gene length file for length bias correction: I have a gff3 file and found (here)[http://seqanswers.com/forums/archive/index.php/t-39797.html] this code

#!/usr/bin/env Rscript
library(GenomicRanges)
library(rtracklayer)
GTFfile = "some_annotation_file.GTF"
GTF <- import.gff(GTFfile, format="gtf", genome="GRCm38.71", asRangedData=F, feature.type="exon")
grl <- reduce(split(GTF, elementMetadata(GTF)$gene_id)) reducedGTF <- unlist(grl, use.names=T) elementMetadata(reducedGTF)$gene_id <- rep(names(grl), elementLengths(grl))
elementMetadata(reducedGTF)$widths <- width(reducedGTF) calc_length <- function(x) { sum(elementMetadata(x)$widths)
}
output <- t(sapply(split(reducedGTF, elementMetadata(reducedGTF)$gene_id), calc_length)) colnames(output) <- c("Length")  • Should the ouput file contain any header and gene id assioated with the length? • import.gff has a argument genome="GRCm38.71" but my genome is a non model species. Should I provide the file path to the assembly? 1. Gene category file: I could obtain a mapping of gene id to gene ontology out of the below interproscan file: sp0000006-mRNA-1 edf5c2bb6341fe44b3da447099a5b2df 282 Pfam PF03083 Sugar efflux transporter for intercellular exchange 198 261 1.4E-15 T 02-05-2017 IPR004316 SWEET sugar transporter GO:0016021 sp0000006-mRNA-1 edf5c2bb6341fe44b3da447099a5b2df 282 Pfam PF03083 Sugar efflux transporter for intercellular exchange 7 91 1.1E-25 T 02-05-2017 IPR004316 SWEET sugar transporter GO:0016021  Any advice what would be the best way to generarate the 3 above files for goseq? Thank you in advance. ADD COMMENTlink modified 9 months ago by Aaron Lun20k • written 9 months ago by mictadlo0 0 9 months ago by Aaron Lun20k Cambridge, United Kingdom Aaron Lun20k wrote: Rather than going through some web-based interface, the easiest approach to running a goseq analysis is to just install the goseq package via biocLite and follow the instructions in the user's guide. The instructions in Section 5 for non-native organisms and the case study in Section 6 should answer most of your questions. ADD COMMENTlink modified 9 months ago • written 9 months ago by Aaron Lun20k I definitely would prefer to run goseq directly in R. As suggested, I read section 5 and I have still some questions how to implement the necessary code structures for goseq: 5.1 Length data format: The length data must be formatted as a numeric vector, of the same length as the main named vector specifying gene names/DE genes. Each entry should give the length of the corresponding gene in bp. If length data is unavailable for some genes, that entry should be set to NA. I found the following code:  #!/usr/bin/env Rscript library(GenomicRanges) library(rtracklayer) GTFfile = "some_annotation_file.GTF" GTF <- import.gff(GTFfile, format="gtf", genome="GRCm38.71", asRangedData=F, feature.type="exon") grl <- reduce(split(GTF, elementMetadata(GTF)$gene_id))
reducedGTF <- unlist(grl, use.names=T)
elementMetadata(reducedGTF)$gene_id <- rep(names(grl), elementLengths(grl)) elementMetadata(reducedGTF)$widths <- width(reducedGTF)
calc_length <- function(x) {
sum(elementMetadata(x)$widths) } output <- t(sapply(split(reducedGTF, elementMetadata(reducedGTF)$gene_id), calc_length))
colnames(output) <- c("Length")

• What should I set for genome argument inside import.gff or should I provide the file path to the non model species assembly?

• How to change the above code to satisfy section 5.1?

5.2 Category mapping format: The mapping between category names and genes should be given as a data frame with two columns. One column should contain the gene IDs and the other the name of an associated category. As the mapping between categories and genes is usually many-to-many, this data frame will usually have multiple rows with the same gene name and category name. Alternatively, mappings between genes and categories can be given as a list. The names of list entries should be gene IDs and the entries themselves should be a vector of category names to which the gene ID corresponds.

• I could obtain a mapping of gene id to gene ontology out of the below interproscan file:
sp0000006-mRNA-1    edf5c2bb6341fe44b3da447099a5b2df    282    Pfam    PF03083    Sugar efflux transporter for intercellular exchange    198    261    1.4E-15    T    02-05-2017    IPR004316    SWEET sugar transporter    GO:0016021
sp0000006-mRNA-1    edf5c2bb6341fe44b3da447099a5b2df    282    Pfam    PF03083    Sugar efflux transporter for intercellular exchange    7    91    1.1E-25    T    02-05-2017    IPR004316    SWEET sugar transporter    GO:0016021

• What would be the best way to read that above file and put the data in correct structure?

1. The genome= argument is optional, it isn't needed to import a GTF file.
2. See width.
3. See read.table.