Question

Mapping short reads to gene IDs in GPL9115

0

Entering edit mode

NS ▴ 60

@ns-7498

Last seen 5.1 years ago

United States

I have downloaded GSE31617 data. As you can see its platform is GPL9115.

In mRNA files, there are three columns:

       Tag                            CopyNumber    TPM(Transcript Per Million)
CATGTGGATGGGCTTCTTGTA                    2                          0.23
CATGGGGCCTTCCAGACCCAC                    2                          0.23
CATGTGCATTTTCAAGTGGGT                    2                          0.23
CATGAGCCACCGCGCCCTGTC                    2                          0.23
CATGGAACTAATTCGCTGACC                    2                          0.23
CATGCTGCTTCGGCCCCAGCG                    2                          0.23
CATGGTCCTCACCCAAGCCTA                    2                          0.23
CATGTCTGTGTGTGGTGAGCA                    2                          0.23

How can I map Tag to RefSeq ID or gene symbol? My goal is making a matrix of expression profile.

Unfortunately, I did not find any useful information in GPL9115. SOFT formatted family file(s) in this page contains something like above columns for all of the available samples in GEO.

I appreciated it if anyone could help me.

rnaseq EST gene Id • 1.5k views

ADD COMMENT • link updated 8.4 years ago by Martin Morgan 25k • written 8.5 years ago by NS ▴ 60

0

Entering edit mode

This question has been posted also to Biostars: https://www.biostars.org/p/165111

ADD REPLY • link 8.4 years ago Gordon Smyth 50k

0

Entering edit mode

Yeah, but I could not find a specific answer, e.g. R code or something like that.

ADD REPLY • link 8.4 years ago NS ▴ 60

score 0 · Answer 1 · 2015-11-09

I would guess these were generated from a SAGE-seq like protocol, similar to what was used in this paper.

I worked with this type of data a bit in grad school, and you can map these sequences to their host gene in much the same way you would analyze "normal" sequencing data, ie. align the original data to the genome and just count it as normal (via something like featureCounts), or you could just align the first column of the data you've shown, then resolve those gene hits with the counts in the table. Note that you will most likely have several rows of that data file resolving to the same gene, so you'll have to figure out if you want to sum all of those up to the gene level, or analyze at the unique tag level (I previously rolled up to the gene level).

score 0 · Answer 2 · 2015-11-15

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 5 days ago

United States

Use Biostrings::matchPDict(), e.g., following the example labelled A. A SIMPLE EXAMPLE OF EXACT MATCHING. But if the 'tpm' column is from a different mapping then your results will not make any sense.

ADD COMMENT • link 8.4 years ago Martin Morgan 25k

0

Entering edit mode

Thank you so much @Martin, especially for mentioning the point about 'tpm' column. My major is computer science and I did not know it.

ADD REPLY • link 8.4 years ago NS ▴ 60