Hi,
I would like to extract the coordinates for a chromosome. I made a chunk of code, but as I am new to these packages, I am not sure if I follow the right logic here:
library(IRanges) library(GenomicFeatures) library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb = transcriptsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, by = "gene",use.names=TRUE) #For example, only I am interested in intergenic coordinates in chromosome 1 seqlevels(txdb, force=TRUE) = c("chr1") #Creates IRanges ir = IRanges(start=unlist(start(txdb)), end=unlist(end(txdb)), names=names(txdb)) # Reduce overlapping gene positions to continuous range ir.red = reduce(ir) # Collapses ranges of overlapping genes to continuous range. #Identify overlaps among the initial and reduced range data sets overlap = findOverlaps(ir, ir.red)
So, the overlap should contain the intergenic coordinate of these genes. Is this right?
A subquestions:
What does the Txdb object really contain? By this I am mean: I am confused what grouping really means. For
example, if I group by genes how is that different if I group by exons or introns? Are not introns and exons part of
the gene, so is not that the same? Maybe the problem is just fundamental and revolves around the definition of a gene, but I am confused : if we have transcripts, isnt that the same as genes?
The
by
argument totranscriptsBy()
specifies how the output should be grouped. When you specify'gene'
you get back a GRangesList with one entry per gene - again this is ~23,500. For each gene there maybe one or more transcripts and you get the start and end position for each. This is certainly the right direction for what you're trying to do.Alternatively, if you specified
by = 'exon'
you get back a much longer GRangesList (~290,000 entries, that's the total number of exons in the genome). For each of these you get list of the transcripts that exon could be part of. That might be interesting in some situations, but isn't very helpful in identifying intergenic regions.I'm not sure that using IRanges is appropriate for the next step though. By doing this we lose the strand information, which may or may not be important to you. It depends what you're trying to do. In the example I gave I was working with a stranded RNA-seq library and looking to see how many mapped to intergenic regions. Since it was a stranded library I wanted to retain this information - you may not.
Assuming you don't care about strand, I'm also not sure of the logic in the
findOverlaps()
step. This is asking 'what elements inir
overlap with elements inir.red
?'. Since you generatedir.red
by reducingir
the answer will be everything. What you really want to do is find the spaces between elememts ofir.red
. You can do this withgaps(ir.red)
, with the caveat that you will miss the regions between the start of the chromosome and the first gene and the end of the last gene and the chromosome end.Thank you again for answering again.
Well, I can see what you are trying to tell me. But, I many more questions:
1) When I group by a gene, does that mean in the output I have just the genes and nothing else? If yes, then this does not make any sense to me. First, this is a TxDb object, so it has only transcripts. How can I get genes from them? Most genes
have overlapping transcripts, therefore how can we know that a transcript belongs to a certain gene, when again two genes share it! I know we can locate the start of a gene from the TSS, but again, the promoter is around that same point, so should this mean that twogene can share a promoter ( of course this would mean that a promoter is upstream and downstream from the TSS).
2) "For each of these you get list of the transcripts that exon could be part of"
So this means that many transcripts can be a part of the same exon, but does that mean they are part of the same gene?
3) You did not use the funciton gaps in your answer. Why? Would that be better?
4) What is a stranded RNA-seq library? Is that something good or bad? I know about single strand and pair end strategies, so is this connect to this approach of manipulating the RNA-seq as you did?
1) A gene contains a fixed set of exons and as a general statement we can say that an exon is only associated with one gene. However you can create various forms of the same gene by including only some of the exons. It is these sets of exons that are called transcripts. Thus each gene can contain many transcripts and a specifc exon can appear in one, several or all of those transcripts.
The TxDb object contains information on all of these different transcripts. When you group by gene this gets the start and end coordinates for each the the transcripts and then returns them to you sorted in a way that says 'this bunch come from gene A' and 'this bunch come from gene B'.
2) It's not that many transcripts can be part of the same exon, it's the otherway around. An exon can be part of many transcripts, and we would say that all those transcripts come from the same gene.
3)
gaps()
is used in the third stages of my solution, and since we're working with GRanges there (rather than IRanges), it is able to take into account both strands and the start/ends of chromosomes.4) Stranded RNA-seq is a technique that allows you to know which strand of DNA was transcribed to generate the read. Sometimes you can get two different genes encoded on opposite strands and so mapping a read to that position may be ambiguous - you don't know which of the two genes was actually being transcribed. Having a stranded library can be useful as it allows you to determine this.
It's hard to say whether you need to be concerned by this without knowing what you're trying to do.