Search
Question: goseq after edgeR
0
gravatar for mictadlo
14 days ago by
mictadlo0
mictadlo0 wrote:

I have followed this tutorial for edgeR. On this side it appears I could run goseq when I have created the following files:

  1. Differentially expressed gene file but edgeR does not appear to produce a tabular file with gene names in the first column, and TRUE or FALSE in the last column. How is it possible to generate this kind of file?

  2. Gene length file for length bias correction: I have a gff3 file and found (here)[http://seqanswers.com/forums/archive/index.php/t-39797.html] this code

#!/usr/bin/env Rscript
library(GenomicRanges)
library(rtracklayer)
GTFfile = "some_annotation_file.GTF"
GTF <- import.gff(GTFfile, format="gtf", genome="GRCm38.71", asRangedData=F, feature.type="exon")
grl <- reduce(split(GTF, elementMetadata(GTF)$gene_id))
reducedGTF <- unlist(grl, use.names=T)
elementMetadata(reducedGTF)$gene_id <- rep(names(grl), elementLengths(grl))
elementMetadata(reducedGTF)$widths <- width(reducedGTF)
calc_length <- function(x) {
sum(elementMetadata(x)$widths)
}
output <- t(sapply(split(reducedGTF, elementMetadata(reducedGTF)$gene_id), calc_length))
colnames(output) <- c("Length")
  • Should the ouput file contain any header and gene id assioated with the length?
  • import.gff has a argument genome="GRCm38.71" but my genome is a non model species. Should I provide the file path to the assembly?
  1. Gene category file: I could obtain a mapping of gene id to gene ontology out of the below interproscan file:
sp0000006-mRNA-1    edf5c2bb6341fe44b3da447099a5b2df    282    Pfam    PF03083    Sugar efflux transporter for intercellular exchange    198    261    1.4E-15    T    02-05-2017    IPR004316    SWEET sugar transporter    GO:0016021
sp0000006-mRNA-1    edf5c2bb6341fe44b3da447099a5b2df    282    Pfam    PF03083    Sugar efflux transporter for intercellular exchange    7    91    1.1E-25    T    02-05-2017    IPR004316    SWEET sugar transporter    GO:0016021

Any advice what would be the best way to generarate the 3 above files for goseq?

Thank you in advance.

ADD COMMENTlink modified 14 days ago by Aaron Lun16k • written 14 days ago by mictadlo0
0
gravatar for Aaron Lun
14 days ago by
Aaron Lun16k
Cambridge, United Kingdom
Aaron Lun16k wrote:

Rather than going through some web-based interface, the easiest approach to running a goseq analysis is to just install the goseq package via biocLite and follow the instructions in the user's guide. The instructions in Section 5 for non-native organisms and the case study in Section 6 should answer most of your questions.

ADD COMMENTlink modified 14 days ago • written 14 days ago by Aaron Lun16k

I definitely would prefer to run goseq directly in R. As suggested, I read section 5 and I have still some questions how to implement the necessary code structures for goseq:

5.1 Length data format: The length data must be formatted as a numeric vector, of the same length as the main named vector specifying gene names/DE genes. Each entry should give the length of the corresponding gene in bp. If length data is unavailable for some genes, that entry should be set to NA.

I found the following code:


#!/usr/bin/env Rscript
library(GenomicRanges)
library(rtracklayer)
GTFfile = "some_annotation_file.GTF"
GTF <- import.gff(GTFfile, format="gtf", genome="GRCm38.71", asRangedData=F, feature.type="exon")
grl <- reduce(split(GTF, elementMetadata(GTF)$gene_id))
reducedGTF <- unlist(grl, use.names=T)
elementMetadata(reducedGTF)$gene_id <- rep(names(grl), elementLengths(grl))
elementMetadata(reducedGTF)$widths <- width(reducedGTF)
calc_length <- function(x) {
sum(elementMetadata(x)$widths)
}
output <- t(sapply(split(reducedGTF, elementMetadata(reducedGTF)$gene_id), calc_length))
colnames(output) <- c("Length")
  • What should I set for genome argument inside import.gff or should I provide the file path to the non model species assembly?

  • How to change the above code to satisfy section 5.1?

5.2 Category mapping format: The mapping between category names and genes should be given as a data frame with two columns. One column should contain the gene IDs and the other the name of an associated category. As the mapping between categories and genes is usually many-to-many, this data frame will usually have multiple rows with the same gene name and category name. Alternatively, mappings between genes and categories can be given as a list. The names of list entries should be gene IDs and the entries themselves should be a vector of category names to which the gene ID corresponds.

  • I could obtain a mapping of gene id to gene ontology out of the below interproscan file:
sp0000006-mRNA-1    edf5c2bb6341fe44b3da447099a5b2df    282    Pfam    PF03083    Sugar efflux transporter for intercellular exchange    198    261    1.4E-15    T    02-05-2017    IPR004316    SWEET sugar transporter    GO:0016021
sp0000006-mRNA-1    edf5c2bb6341fe44b3da447099a5b2df    282    Pfam    PF03083    Sugar efflux transporter for intercellular exchange    7    91    1.1E-25    T    02-05-2017    IPR004316    SWEET sugar transporter    GO:0016021
  • What would be the best way to read that above file and put the data in correct structure?

Thank you in advance.

ADD REPLYlink modified 13 days ago • written 13 days ago by mictadlo0
  1. The genome= argument is optional, it isn't needed to import a GTF file.
  2. See width.
  3. See read.table.
ADD REPLYlink modified 13 days ago • written 13 days ago by Aaron Lun16k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 276 users visited in the last hour