Hello..I have a list of .CEL files from HTA 2.0 chips. I created an ExpressionSet object by reading them with the R package "oligo". This object has data on 70523 features (genes) and 48 samples. What is best way to annotate these genes so that I can create a "GenomicRanges" object with the following attributes: Chromosome, Transcript_start, Transcript end ?
Sincere thanks for your reply!…But this gives the annotation for 67758 transcripts, while I have the expression data on 70523 transcripts. How can I get the annotation for the remaining transcripts?
which removes all the data where the seqname (chromosome) is missing. It turns out there are 2995 probesets for which there is no chromosome in the annotation file. If you don't have a chromosome, how will you add it to the GRanges object? If we look a bit closer at these probesets, they don't have start/end information either:
And if we go to netaffx to see what these probesets are supposed to measure, we either return no results, or we get redirected to some random page. Try querying for 2824546_st at netaffx, or better yet, one of these PSR probesets:
I have a similar situation. I have read all my 30 CEL file into R and Normalized them using Oligo package. How can I use pd.hta.2.0 package to annotate my data ? I have been searching for codes. From what I have seen some people have tried to build their on library but I am not sure if is that a necessary step to annotate my data?
The short answer is that you can't. If you re-read the code in my previous answer, you will note this line:
> annot <- annot[!annot$seqname %in% c("", NA, "NA"),]which removes all the data where the seqname (chromosome) is missing. It turns out there are 2995 probesets for which there is no chromosome in the annotation file. If you don't have a chromosome, how will you add it to the GRanges object? If we look a bit closer at these probesets, they don't have start/end information either:
> annot.noseqname <- annot[annot$seqname == "", 1:6] > head(annot.noseqname) transcriptclusterid probesetid seqname strand start stop 2824546_st 2824546_st 2824546_st 0 0 2824549_st 2824549_st 2824549_st 0 0 2824551_st 2824551_st 2824551_st 0 0 2824554_st 2824554_st 2824554_st 0 0 2827992_st 2827992_st 2827992_st 0 0 2827995_st 2827995_st 2827995_st 0 0 > table(annot.noseqname$start) 0 2995 > table(annot.noseqname$stop) 0 2995And if we go to netaffx to see what these probesets are supposed to measure, we either return no results, or we get redirected to some random page. Try querying for 2824546_st at netaffx, or better yet, one of these PSR probesets:
> tail(annot.noseqname) transcriptclusterid probesetid PSR6_ssto_hap7002446.hg.1 PSR6_ssto_hap7002446.hg.1 PSR6_ssto_hap7002447.hg.1 PSR6_ssto_hap7002447.hg.1 PSR6_ssto_hap7002454.hg.1 PSR6_ssto_hap7002454.hg.1 PSR6_ssto_hap7002455.hg.1 PSR6_ssto_hap7002455.hg.1 PSR6_ssto_hap7002456.hg.1 PSR6_ssto_hap7002456.hg.1 PSR6_ssto_hap7002459.hg.1 PSR6_ssto_hap7002459.hg.1 seqname strand start stop PSR6_ssto_hap7002446.hg.1 0 0 PSR6_ssto_hap7002447.hg.1 0 0 PSR6_ssto_hap7002454.hg.1 0 0 PSR6_ssto_hap7002455.hg.1 0 0 PSR6_ssto_hap7002456.hg.1 0 0 PSR6_ssto_hap7002459.hg.1 0 0So long story short, you cannot get the position information for all of the probesets, because some of them have no such information.
The 2995 probesets in question are various types of controls, and are not considered main content for gene level analysis.