Entering edit mode
                    Yoo, Seungyeul
        
    
        ▴
    
    110
        @yoo-seungyeul-5323
        Last seen 11.2 years ago
        
    Dear all,
I'm working on a DNA Methylation microarray dataset. The microarray
design is "pd.feinberg.hg18.me.hx1".
I used the CHARM package to estimate methylation percentile and
selected 1000 probes having larger variances of methylation level
across samples.
The 1000 probe are identified as chromosome coordinate like following.
> rnames[1:10]
 [1] "chr1:1707145" "chr1:2148663" "chr1:3133683" "chr1:3180808"
"chr1:3294081"
 [6] "chr1:3470900" "chr1:3470969" "chr1:3633816" "chr1:3676205"
"chr1:3720637"
Now I want to see the gene expression of these 1000 probes and see the
correlation between gene expression and dna methylation.
I loaded human genome transcript information from UCSC and extracted
features of all transcripts like followings.
hg18KG<-loadFeatures("hg18_UCSC.sqlite")
tbl_tx<-select(hg18KG,keys(hg18KG,"GENEID"),cols=c("GENEID","TXNAME","
TXCHROM","TXSTRAND","TXSTART","TXEND"),keytype="GENEID")
> tbl_tx[1:10,]
   GENEID     TXNAME TXCHROM TXSTRAND   TXSTART     TXEND
1       1 uc002qsd.2   chr19        -  63549984  63556677
2       1 uc002qsf.1   chr19        -  63551644  63565932
3      10 uc003wyw.1    chr8        +  18293035  18303003
4      10 uc010lte.1    chr8        +  18301794  18302666
5     100 uc002xmj.1   chr20        -  42681577  42713790
6     100 uc010ggt.1   chr20        -  42681577  42713790
7    1000 uc002kwg.1   chr18        -  23784933  24011189
8   10000 uc001iaa.2    chr1        - 241731689 241733518
9   10000 uc001hzz.1    chr1        - 241718158 242073207
10  10000 uc001iab.1    chr1        - 241733107 242073207
For each of 1000 probes, I want to find the closest transcript
starting point (TXSTART).
But I don't know how to treat strand. There was no strand information
provided from raw data but transcripts have strand information (either
"+" or "-").
How I can calculate distance from probe coordinate to transcript
starting point which is on strand "+" or "-"?
Can I just ignore "+" or "-" which allows me to treat +111111 and
-111111 in the same way? My guess they should be different because
genome sequence shouldn't be symmetric.
I just started to join genomics field from different area and have
little experience working on genome sequences. Sorry for my naive
question.
But any comments about this, even conceptual ones, would be very
helpful for me.
Thank you.
Seungyeul Yoo
Postdoctoral Fellow
Institute of Genomics and Multiscale Biology
Department of Genetics and Genomic Sciences
Mount Sinai School of Medicine
(office) 212-659-6877
                    
                
                