CSAW: Annotation through clustering with external information
3
0
Entering edit mode
@sergioespeso-gil-6997
Last seen 4.8 years ago
New York

Hi,

I am following different strategies while clustering I don't know if I performed properly the annotation while using clustering with external information. As I need to specify the regions to be annotated I have used the predefined object "promoter" as follows:

require(TxDb.Mmusculus.UCSC.mm9.knownGene)
promoter<-promoters(TxDb.Mmusculus.UCSC.mm9.knownGene)
head(promoter)

And then while annotating:

require(org.Mm.eg.db)
anno<- detailRanges(promoter, txdb=TxDb.Mmusculus.UCSC.mm9.knownGene,
                    orgdb=org.Mm.eg.db, promoter=c(3000,1000), dist=5000)
head(anno$overlap)

Is it ok? I think it is , but I got this message:

Warning message:
In detailRanges(promoter, txdb = TxDb.Mmusculus.UCSC.mm9.knownGene,  :
  strandedness in incoming regions is ignored when overlapping

Thanks!!

Sergio

csaw • 1.5k views
ADD COMMENT
0
Entering edit mode

Well maybe there is no need to specify the distance, and I can do just:

anno.ranges<-detailRanges(promoter,txdb=TxDb.Mmusculus.UCSC.mm9.knownGene, orgdb=org.Mm.eg.db)

But I get the same warning

 

ADD REPLY
1
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 9 hours ago
The city by the bay

Okay, you've got a lot of questions here, but let's see what we can do. For your first question, the warning message just means that the strand of the incoming regions is ignored when computing overlaps with annotated features, i.e., the function won't limit overlaps of forward-strand incoming to forward-strand annotation. If you do want this behaviour, you can get it by setting ignore.strand=FALSE in the devel version of csaw. That said, I'm not sure it makes any sense to annotate promoters, given that you already know the gene corresponding to each promoter.

As for the BED file, you're not showing what the error is (if there is any), so I can't be sure what you're doing wrong. From looking at your code, the most obvious mistake is that you should be assigning

test_up$score<- tabpromoter$logFC.up[up]

as test_up is subsetted but tabpromoter is not. There may also be an issue with using the logFC.up count as the score, but that would depend on how GREAT uses the score information.

ADD COMMENT
0
Entering edit mode
@sergioespeso-gil-6997
Last seen 4.8 years ago
New York

I have also problems to try to define a bed file for GO analysis (using GREAT) for up and down regions, as the table that I get has logFC.up and logFC.down I was trying to define something like this... but I am doing something wrong obviously:

up<-tabpromoter$logFC.up>1
require(rtracklayer)
test_up<-promoter[up]
test_up$score<- tabpromoter$logFC.up
names(test_up)<-paste0("region", 1:sum(up))
export(test_up,"tabpromoter_UP.bed")

Help! XD 

Thanks!

Sergio

ADD COMMENT
0
Entering edit mode
@sergioespeso-gil-6997
Last seen 4.8 years ago
New York

Hi Aaron for the answers. Concerning my second question:

tabpromoter is setted as follows:

olap<- findOverlaps(promoter, rowRanges(filtered))
olap
tabpromoter<- combineOverlaps(olap, results.promoter$table)
head(tabpromoter[!is.na(tabpromoter$PValue),])

Concerning the error:

> test_up<-promoter[up]
Error in NSBS(i, x, exact = exact, upperBoundIsStrict = !allow.append) :
  subscript contains NAs

It is maybe easier to explain what I want to do. I want to have the DBS found in promoters for several histone marks. It is true that it doesn't matter to annotate as I am using a package of annotated promoters, good point . The only thing that I need is to find a way to merge the windows on the promoters and sort those that are GAIN or LOSS because the treatment into different tables or bed files.  I have filtered using predefined regions and abundance , as follows:

ab <- aveLogCPM(asDGEList(data))
keep.ab <- ab > 0
keep.reg <- overlapsAny(rowRanges(data), regions)
keep <- keep.ab & keep.reg
filtered.data <- data[keep,]

I know..I have read plenty of times the documentation , but I do not know what it is better if doing "clustering with external information" (6.3) or merging using the tol option (6.4).  I am a bit lost-frustated. Sorry. Need just some hints I guess

ADD COMMENT
1
Entering edit mode

Ah, yes. The error is because NA values are reported in the output table for those promoters that don't have any overlapping windows. This was intended to ensure a 1:1 mapping between promoter and tabpromoter, even for those promoters that are empty. However, it means that up needs to be defined as:

up <- tabpromoter$logFC.up>1 & !is.na(tabpromoter$logFC.up)

This should avoid getting NA values in the subscript. Keep in mind that you also need to subset tabpromoter$logFC.up before assigning it to the score metadata of test_up, as I described in my first post.

Now, as for how you should cluster; your promoter-based approach will report DB in terms of - well, in terms of promoters, obviously. If you were to use mergeWindows (i.e., tol) on the windows, then you would get a bunch of de novo DB regions, that are defined independently of existing annotation. So, what do you want to do with your results? If you're planning to integrate it with RNA-seq data, then gene/promoter-based output would be more appropriate. If you're trying to discover (differential) binding sites, then de novo region-based output would be better. This is really a question that only you can answer.

ADD REPLY
0
Entering edit mode

Ok, thanks Aaron. I will try. What I also need to ask you is how to enter my own predefined regions, I guess I need to read GRanges documentation, right? 

Have a nice day

ADD REPLY
0
Entering edit mode

Well, if you've got your own pre-defined regions, then you can just store them in a GRanges object and use that in place of the promoter object in the various parts of your code, e.g., in findOverlaps. However, it's important to note that the manner in which those regions are defined should be independent of DB in this data set. In particular, you can't define regions based on DB windows, and then use those regions to cluster the same windows. This constitutes "double dipping" into the data and is statistically invalid, as it will result in the loss of type I error and FDR control.

ADD REPLY

Login before adding your answer.

Traffic: 792 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6