HapMap gene list
1
0
Entering edit mode
@noxyportgmailcom-4197
Last seen 9.6 years ago
Hi, I have a problem with the gene list (gff version3 file) HapMap is using (ftp://ftp.ncbi.nlm.nih.gov/hapmap/gbrowse/2009-02_phaseII+III/g ff/refGene_hg18_tests_11Apr07.gff.gz). I tried loading the file into R and selecting all "mRNA" entries but something seems to go wrong with it: > hapmap=read.table("refGene_hg18_tests_11Apr07.gff", header=F, sep="??? ") > nrow(hapmap) [1] 171701 > hapmap2=hapmap[which(hapmap$V3=="mRNA"), ] > nrow(hapmap2) [1] 12718 > hapmap[(2210:2220), (1:3)] ?????? V1???? V2???????????? V3 2210 chr1 UCSC_1?????????? mRNA 2211 chr1 UCSC_1 five_prime_UTR 2212 chr1 UCSC_1 five_prime_UTR 2213 chr1 UCSC_1??????????? CDS 2214 chr1 UCSC_1??????????? CDS 2215 chr1 UCSC_1??????????? CDS 2216 chr1 UCSC_1??????????? CDS 2217 chr1 UCSC_1??????????? CDS 2218 chr1 UCSC_1??????????? CDS 2219 chr1 UCSC_1??????????? CDS 2220 chr1 UCSC_1??????????? CDS > Can anyone explain why this could be? Probably, the large descriptive column (V9) but I don't see the failure. I have to admit that it is probably not the best way to use this file but I do not find any other source (RefSeq, UCSC), which contains the same genomic regions for the genes annotated as in HapMap. Which NCBI 36 build did they use and where can I download a gene file with chromosome, gene start and stop matching with HapMap? Thanks for your help!
GO HapMap GO HapMap • 1.2k views
ADD COMMENT
0
Entering edit mode
@kasper-daniel-hansen-2979
Last seen 10 months ago
United States
On Wed, Aug 4, 2010 at 1:41 PM, noxyport at gmail.com <noxyport at="" gmail.com=""> wrote: > Hi, > > I have a problem with the gene list (gff version3 file) HapMap is > using (ftp://ftp.ncbi.nlm.nih.gov/hapmap/gbrowse/2009-02_phaseII+III /gff/refGene_hg18_tests_11Apr07.gff.gz). > I tried loading the file into R and selecting all "mRNA" entries but > something seems to go wrong with it: > >> hapmap=read.table("refGene_hg18_tests_11Apr07.gff", header=F, sep="??? ") >> nrow(hapmap) > [1] 171701 >> hapmap2=hapmap[which(hapmap$V3=="mRNA"), ] >> nrow(hapmap2) > [1] 12718 >> hapmap[(2210:2220), (1:3)] Here, you want to use hapmap2 and not hapmap. Kasper > 2210 chr1 UCSC_1?????????? mRNA > 2211 chr1 UCSC_1 five_prime_UTR > 2212 chr1 UCSC_1 five_prime_UTR > 2213 chr1 UCSC_1??????????? CDS > 2214 chr1 UCSC_1??????????? CDS > 2215 chr1 UCSC_1??????????? CDS > 2216 chr1 UCSC_1??????????? CDS > 2217 chr1 UCSC_1??????????? CDS > 2218 chr1 UCSC_1??????????? CDS > 2219 chr1 UCSC_1??????????? CDS > 2220 chr1 UCSC_1??????????? CDS >> > > Can anyone explain why this could be? Probably, the large descriptive > column (V9) but I don't see the failure. > > I have to admit that it is probably not the best way to use this file > but I do not find any other source (RefSeq, UCSC), which contains the > same genomic regions for the genes annotated as in HapMap. Which NCBI > 36 build did they use and where can I download a gene file with > chromosome, gene start and stop matching with HapMap? > > Thanks for your help! > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD COMMENT
0
Entering edit mode
You are right! Sorry to bother you with this. However, there is still something wrong. When I export the file again (write.table) there are CDS and UTR included and when you run: > hapmap=read.table("refGene_hg18_tests_11Apr07.gff", header=F, sep=" ") > nrow(hapmap) [1] 171701 > hapmap2=hapmap[which(hapmap$V3=="mRNA"), ] > nrow(hapmap2) [1] 12718 > hapmap2[205,] V1 V2 V3 V4 V5 V6 V7 V8 2759 chr1 UCSC_1 mRNA 11840109 11841579 . - . V9 2759 ID=NM_002521;Alias=NPPB;Note=natriuretic peptide precursor B preproprotein;summary=This gene is a member of the natriuretic peptide family and encodes a secreted protein which functions as a cardiac hormone. The protein undergoes two cleavage events%2C one within the cell and a second after secretion into the blood. The proteins biological actions include natriuresis%2C diuresis%2C vasorelaxation%2C inhibition of renin and aldosterone secretion%2C and a key role in cardiovascular homeostasis. A high concentration of this protein in the bloodstream is indicative of heart failure. Mutations in this gene have been associated with postmenopausal osteoporosis. Publication Note: This RefSeq record includes a subset of the publications that are available for this gene. Please see the Entrez Gene record to access additional publications.\nchr1\tUCSC_1\tthree_prime_UTR\t11840109\t11840298\t.\t- \t.\tParent=NM_002521\nchr1\tUCSC_1\tCDS\t11840299\t11840315\t.\t-\t1\ tParent=NM_002521\nchr1\tUCSC_1\tCDS\t11840858\t11841113\t.\t-\t0\tPar ent=NM_002521\nchr1\tUCSC_1\tCDS\t11841346\t11841477\t.\t-\t0\tParent= NM_002521\nchr1\tUCSC_1\tfive_prime_UTR\t11841478\t11841579\t.\t-\t.\t Parent=NM_002521\nchr1\tUCSC_1\tmRNA\t11902712\t11909067\t.\t-\t.\tID= NM_138346;Alias=KIAA2013;Note=hypothetical protein LOC90231\nchr1\tUCSC_1\tthree_prime_UTR\t11902712\t11902958\t. \t-\t.\tParent=NM_138346\nchr1\tUCSC_1\tCDS\t11902959\t11902976\t.\t-\ t1\tParent=NM_138346\nchr1\tUCSC_1\tCDS\t11905280\t11906133\t.\t-\t1\t Parent=NM_138346\nchr1\tUCSC_1\tCDS\t11907849\t11908881\t.\t-\t0\tPare nt=NM_138346\nchr1\tUCSC_1\tfive_prime_UTR\t11908882\t11909067\t.\t-\t .\tParent=NM_138346\nchr1\tUCSC_1\tmRNA\t11917333\t11958180\t.\t+\t.\t ID=NM_000302;Alias=PLOD1;Note=lysyl hydroxylase precursor;summary=Lysyl hydroxylase is a membrane-bound homodimeric protein localized to the cisternae of the endoplasmic reticulum. The enzyme (cofactors iron and ascorbate) catalyzes the hydroxylation of lysyl residues in collagen-like peptides. The resultant hydroxylysyl groups are attachment sites for carbohydrates in col ... (shortend here) I have no idea where R takes thes "\t.*" parts from but I think they screw the whole dataframe somehow. Any suggestions? Thanks On Wed, Aug 4, 2010 at 7:08 PM, Kasper Daniel Hansen <kasperdanielhansen at="" gmail.com=""> wrote: > On Wed, Aug 4, 2010 at 1:41 PM, noxyport at gmail.com <noxyport at="" gmail.com=""> wrote: >> Hi, >> >> I have a problem with the gene list (gff version3 file) HapMap is >> using (ftp://ftp.ncbi.nlm.nih.gov/hapmap/gbrowse/2009-02_phaseII+II I/gff/refGene_hg18_tests_11Apr07.gff.gz). >> I tried loading the file into R and selecting all "mRNA" entries but >> something seems to go wrong with it: >> >>> hapmap=read.table("refGene_hg18_tests_11Apr07.gff", header=F, sep="??? ") >>> nrow(hapmap) >> [1] 171701 >>> hapmap2=hapmap[which(hapmap$V3=="mRNA"), ] >>> nrow(hapmap2) >> [1] 12718 >>> hapmap[(2210:2220), (1:3)] > > Here, you want to use hapmap2 and not hapmap. > > Kasper > > >> 2210 chr1 UCSC_1?????????? mRNA >> 2211 chr1 UCSC_1 five_prime_UTR >> 2212 chr1 UCSC_1 five_prime_UTR >> 2213 chr1 UCSC_1??????????? CDS >> 2214 chr1 UCSC_1??????????? CDS >> 2215 chr1 UCSC_1??????????? CDS >> 2216 chr1 UCSC_1??????????? CDS >> 2217 chr1 UCSC_1??????????? CDS >> 2218 chr1 UCSC_1??????????? CDS >> 2219 chr1 UCSC_1??????????? CDS >> 2220 chr1 UCSC_1??????????? CDS >>> >> >> Can anyone explain why this could be? Probably, the large descriptive >> column (V9) but I don't see the failure. >> >> I have to admit that it is probably not the best way to use this file >> but I do not find any other source (RefSeq, UCSC), which contains the >> same genomic regions for the genes annotated as in HapMap. Which NCBI >> 36 build did they use and where can I download a gene file with >> chromosome, gene start and stop matching with HapMap? >> >> Thanks for your help! >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> >
ADD REPLY
0
Entering edit mode
The \t is a tab character. You may do better by using the default sep argument rather than by specifying one yourself. Best, Jim On 8/4/10 4:49 PM, noxyport at gmail.com wrote: > You are right! Sorry to bother you with this. > However, there is still something wrong. When I export the file again > (write.table) there are CDS and UTR included and when you run: > >> hapmap=read.table("refGene_hg18_tests_11Apr07.gff", header=F, sep=" ") >> nrow(hapmap) > [1] 171701 >> hapmap2=hapmap[which(hapmap$V3=="mRNA"), ] >> nrow(hapmap2) > [1] 12718 >> hapmap2[205,] > V1 V2 V3 V4 V5 V6 V7 V8 > 2759 chr1 UCSC_1 mRNA 11840109 11841579 . - . > V9 > 2759 ID=NM_002521;Alias=NPPB;Note=natriuretic peptide precursor B > preproprotein;summary=This gene is a member of the natriuretic peptide > family and encodes a secreted protein which functions as a cardiac > hormone. The protein undergoes two cleavage events%2C one within the > cell and a second after secretion into the blood. The proteins > biological actions include natriuresis%2C diuresis%2C > vasorelaxation%2C inhibition of renin and aldosterone secretion%2C and > a key role in cardiovascular homeostasis. A high concentration of this > protein in the bloodstream is indicative of heart failure. Mutations > in this gene have been associated with postmenopausal osteoporosis. > Publication Note: This RefSeq record includes a subset of the > publications that are available for this gene. Please see the Entrez > Gene record to access additional > publications.\nchr1\tUCSC_1\tthree_prime_UTR\t11840109\t11840298\t.\ t-\t.\tParent=NM_002521\nchr1\tUCSC_1\tCDS\t11840299\t11840315\t.\t-\t 1\tParent=NM_002521\nchr1\tUCSC_1\tCDS\t11840858\t11841113\t.\t-\t0\tP arent=NM_002521\nchr1\tUCSC_1\tCDS\t11841346\t11841477\t.\t-\t0\tParen t=NM_002521\nchr1\tUCSC_1\tfive_prime_UTR\t11841478\t11841579\t.\t-\t. \tParent=NM_002521\nchr1\tUCSC_1\tmRNA\t11902712\t11909067\t.\t-\t.\tI D=NM_138346;Alias=KIAA2013;Note=hypothetical > protein LOC90231\nchr1\tUCSC_1\tthree_prime_UTR\t11902712\t11902958\ t.\t-\t.\tParent=NM_138346\nchr1\tUCSC_1\tCDS\t11902959\t11902976\t.\t -\t1\tParent=NM_138346\nchr1\tUCSC_1\tCDS\t11905280\t11906133\t.\t-\t1 \tParent=NM_138346\nchr1\tUCSC_1\tCDS\t11907849\t11908881\t.\t-\t0\tPa rent=NM_138346\nchr1\tUCSC_1\tfive_prime_UTR\t11908882\t11909067\t.\t- \t.\tParent=NM_138346\nchr1\tUCSC_1\tmRNA\t11917333\t11958180\t.\t+\t. \tID=NM_000302;Alias=PLOD1;Note=lysyl > hydroxylase precursor;summary=Lysyl hydroxylase is a membrane-bound > homodimeric protein localized to the cisternae of the endoplasmic > reticulum. The enzyme (cofactors iron and ascorbate) catalyzes the > hydroxylation of lysyl residues in collagen-like peptides. The > resultant hydroxylysyl groups are attachment sites for carbohydrates > in col > ... (shortend here) > > I have no idea where R takes thes "\t.*" parts from but I think they > screw the whole dataframe somehow. Any suggestions? > > Thanks > > > > > On Wed, Aug 4, 2010 at 7:08 PM, Kasper Daniel Hansen > <kasperdanielhansen at="" gmail.com=""> wrote: >> On Wed, Aug 4, 2010 at 1:41 PM, noxyport at gmail.com<noxyport at="" gmail.com=""> wrote: >>> Hi, >>> >>> I have a problem with the gene list (gff version3 file) HapMap is >>> using (ftp://ftp.ncbi.nlm.nih.gov/hapmap/gbrowse/2009-02_phaseII+I II/gff/refGene_hg18_tests_11Apr07.gff.gz). >>> I tried loading the file into R and selecting all "mRNA" entries but >>> something seems to go wrong with it: >>> >>>> hapmap=read.table("refGene_hg18_tests_11Apr07.gff", header=F, sep=" ") >>>> nrow(hapmap) >>> [1] 171701 >>>> hapmap2=hapmap[which(hapmap$V3=="mRNA"), ] >>>> nrow(hapmap2) >>> [1] 12718 >>>> hapmap[(2210:2220), (1:3)] >> Here, you want to use hapmap2 and not hapmap. >> >> Kasper >> >> >>> 2210 chr1 UCSC_1 mRNA >>> 2211 chr1 UCSC_1 five_prime_UTR >>> 2212 chr1 UCSC_1 five_prime_UTR >>> 2213 chr1 UCSC_1 CDS >>> 2214 chr1 UCSC_1 CDS >>> 2215 chr1 UCSC_1 CDS >>> 2216 chr1 UCSC_1 CDS >>> 2217 chr1 UCSC_1 CDS >>> 2218 chr1 UCSC_1 CDS >>> 2219 chr1 UCSC_1 CDS >>> 2220 chr1 UCSC_1 CDS >>> Can anyone explain why this could be? Probably, the large descriptive >>> column (V9) but I don't see the failure. >>> >>> I have to admit that it is probably not the best way to use this file >>> but I do not find any other source (RefSeq, UCSC), which contains the >>> same genomic regions for the genes annotated as in HapMap. Which NCBI >>> 36 build did they use and where can I download a gene file with >>> chromosome, gene start and stop matching with HapMap? >>> >>> Thanks for your help! >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>> > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician Douglas Lab University of Michigan Department of Human Genetics 5912 Buhl 1241 E. Catherine St. Ann Arbor MI 48109-5618 734-615-7826 ********************************************************** Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues
ADD REPLY

Login before adding your answer.

Traffic: 543 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6