HapMap gene list

0

Entering edit mode

noxyport@gmail.com ▴ 20

@noxyportgmailcom-4197

Last seen 9.6 years ago

Hi, I have a problem with the gene list (gff version3 file) HapMap is using (ftp://ftp.ncbi.nlm.nih.gov/hapmap/gbrowse/2009-02_phaseII+III/g ff/refGene_hg18_tests_11Apr07.gff.gz). I tried loading the file into R and selecting all "mRNA" entries but something seems to go wrong with it: > hapmap=read.table("refGene_hg18_tests_11Apr07.gff", header=F, sep="??? ") > nrow(hapmap) [1] 171701 > hapmap2=hapmap[which(hapmap$V3=="mRNA"), ] > nrow(hapmap2) [1] 12718 > hapmap[(2210:2220), (1:3)] ?????? V1???? V2???????????? V3 2210 chr1 UCSC_1?????????? mRNA 2211 chr1 UCSC_1 five_prime_UTR 2212 chr1 UCSC_1 five_prime_UTR 2213 chr1 UCSC_1??????????? CDS 2214 chr1 UCSC_1??????????? CDS 2215 chr1 UCSC_1??????????? CDS 2216 chr1 UCSC_1??????????? CDS 2217 chr1 UCSC_1??????????? CDS 2218 chr1 UCSC_1??????????? CDS 2219 chr1 UCSC_1??????????? CDS 2220 chr1 UCSC_1??????????? CDS > Can anyone explain why this could be? Probably, the large descriptive column (V9) but I don't see the failure. I have to admit that it is probably not the best way to use this file but I do not find any other source (RefSeq, UCSC), which contains the same genomic regions for the genes annotated as in HapMap. Which NCBI 36 build did they use and where can I download a gene file with chromosome, gene start and stop matching with HapMap? Thanks for your help!

GO HapMap GO HapMap • 1.2k views

ADD COMMENT • link updated 13.7 years ago by Kasper Daniel Hansen ★ 6.5k • written 13.7 years ago by noxyport@gmail.com ▴ 20

0

Entering edit mode

Kasper Daniel Hansen ★ 6.5k

@kasper-daniel-hansen-2979

Last seen 10 months ago

United States

On Wed, Aug 4, 2010 at 1:41 PM, noxyport at gmail.com <noxyport at="" gmail.com=""> wrote: > Hi, > > I have a problem with the gene list (gff version3 file) HapMap is > using (ftp://ftp.ncbi.nlm.nih.gov/hapmap/gbrowse/2009-02_phaseII+III /gff/refGene_hg18_tests_11Apr07.gff.gz). > I tried loading the file into R and selecting all "mRNA" entries but > something seems to go wrong with it: > >> hapmap=read.table("refGene_hg18_tests_11Apr07.gff", header=F, sep="??? ") >> nrow(hapmap) > [1] 171701 >> hapmap2=hapmap[which(hapmap$V3=="mRNA"), ] >> nrow(hapmap2) > [1] 12718 >> hapmap[(2210:2220), (1:3)] Here, you want to use hapmap2 and not hapmap. Kasper > 2210 chr1 UCSC_1?????????? mRNA > 2211 chr1 UCSC_1 five_prime_UTR > 2212 chr1 UCSC_1 five_prime_UTR > 2213 chr1 UCSC_1??????????? CDS > 2214 chr1 UCSC_1??????????? CDS > 2215 chr1 UCSC_1??????????? CDS > 2216 chr1 UCSC_1??????????? CDS > 2217 chr1 UCSC_1??????????? CDS > 2218 chr1 UCSC_1??????????? CDS > 2219 chr1 UCSC_1??????????? CDS > 2220 chr1 UCSC_1??????????? CDS >> > > Can anyone explain why this could be? Probably, the large descriptive > column (V9) but I don't see the failure. > > I have to admit that it is probably not the best way to use this file > but I do not find any other source (RefSeq, UCSC), which contains the > same genomic regions for the genes annotated as in HapMap. Which NCBI > 36 build did they use and where can I download a gene file with > chromosome, gene start and stop matching with HapMap? > > Thanks for your help! > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD COMMENT • link 13.7 years ago Kasper Daniel Hansen ★ 6.5k

0

Entering edit mode

You are right! Sorry to bother you with this. However, there is still something wrong. When I export the file again (write.table) there are CDS and UTR included and when you run: > hapmap=read.table("refGene_hg18_tests_11Apr07.gff", header=F, sep=" ") > nrow(hapmap) [1] 171701 > hapmap2=hapmap[which(hapmap$V3=="mRNA"), ] > nrow(hapmap2) [1] 12718 > hapmap2[205,] V1 V2 V3 V4 V5 V6 V7 V8 2759 chr1 UCSC_1 mRNA 11840109 11841579 . - . V9 2759 ID=NM_002521;Alias=NPPB;Note=natriuretic peptide precursor B preproprotein;summary=This gene is a member of the natriuretic peptide family and encodes a secreted protein which functions as a cardiac hormone. The protein undergoes two cleavage events%2C one within the cell and a second after secretion into the blood. The proteins biological actions include natriuresis%2C diuresis%2C vasorelaxation%2C inhibition of renin and aldosterone secretion%2C and a key role in cardiovascular homeostasis. A high concentration of this protein in the bloodstream is indicative of heart failure. Mutations in this gene have been associated with postmenopausal osteoporosis. Publication Note: This RefSeq record includes a subset of the publications that are available for this gene. Please see the Entrez Gene record to access additional publications.\nchr1\tUCSC_1\tthree_prime_UTR\t11840109\t11840298\t.\t- \t.\tParent=NM_002521\nchr1\tUCSC_1\tCDS\t11840299\t11840315\t.\t-\t1\ tParent=NM_002521\nchr1\tUCSC_1\tCDS\t11840858\t11841113\t.\t-\t0\tPar ent=NM_002521\nchr1\tUCSC_1\tCDS\t11841346\t11841477\t.\t-\t0\tParent= NM_002521\nchr1\tUCSC_1\tfive_prime_UTR\t11841478\t11841579\t.\t-\t.\t Parent=NM_002521\nchr1\tUCSC_1\tmRNA\t11902712\t11909067\t.\t-\t.\tID= NM_138346;Alias=KIAA2013;Note=hypothetical protein LOC90231\nchr1\tUCSC_1\tthree_prime_UTR\t11902712\t11902958\t. \t-\t.\tParent=NM_138346\nchr1\tUCSC_1\tCDS\t11902959\t11902976\t.\t-\ t1\tParent=NM_138346\nchr1\tUCSC_1\tCDS\t11905280\t11906133\t.\t-\t1\t Parent=NM_138346\nchr1\tUCSC_1\tCDS\t11907849\t11908881\t.\t-\t0\tPare nt=NM_138346\nchr1\tUCSC_1\tfive_prime_UTR\t11908882\t11909067\t.\t-\t .\tParent=NM_138346\nchr1\tUCSC_1\tmRNA\t11917333\t11958180\t.\t+\t.\t ID=NM_000302;Alias=PLOD1;Note=lysyl hydroxylase precursor;summary=Lysyl hydroxylase is a membrane-bound homodimeric protein localized to the cisternae of the endoplasmic reticulum. The enzyme (cofactors iron and ascorbate) catalyzes the hydroxylation of lysyl residues in collagen-like peptides. The resultant hydroxylysyl groups are attachment sites for carbohydrates in col ... (shortend here) I have no idea where R takes thes "\t.*" parts from but I think they screw the whole dataframe somehow. Any suggestions? Thanks On Wed, Aug 4, 2010 at 7:08 PM, Kasper Daniel Hansen <kasperdanielhansen at="" gmail.com=""> wrote: > On Wed, Aug 4, 2010 at 1:41 PM, noxyport at gmail.com <noxyport at="" gmail.com=""> wrote: >> Hi, >> >> I have a problem with the gene list (gff version3 file) HapMap is >> using (ftp://ftp.ncbi.nlm.nih.gov/hapmap/gbrowse/2009-02_phaseII+II I/gff/refGene_hg18_tests_11Apr07.gff.gz). >> I tried loading the file into R and selecting all "mRNA" entries but >> something seems to go wrong with it: >> >>> hapmap=read.table("refGene_hg18_tests_11Apr07.gff", header=F, sep="??? ") >>> nrow(hapmap) >> [1] 171701 >>> hapmap2=hapmap[which(hapmap$V3=="mRNA"), ] >>> nrow(hapmap2) >> [1] 12718 >>> hapmap[(2210:2220), (1:3)] > > Here, you want to use hapmap2 and not hapmap. > > Kasper > > >> 2210 chr1 UCSC_1?????????? mRNA >> 2211 chr1 UCSC_1 five_prime_UTR >> 2212 chr1 UCSC_1 five_prime_UTR >> 2213 chr1 UCSC_1??????????? CDS >> 2214 chr1 UCSC_1??????????? CDS >> 2215 chr1 UCSC_1??????????? CDS >> 2216 chr1 UCSC_1??????????? CDS >> 2217 chr1 UCSC_1??????????? CDS >> 2218 chr1 UCSC_1??????????? CDS >> 2219 chr1 UCSC_1??????????? CDS >> 2220 chr1 UCSC_1??????????? CDS >>> >> >> Can anyone explain why this could be? Probably, the large descriptive >> column (V9) but I don't see the failure. >> >> I have to admit that it is probably not the best way to use this file >> but I do not find any other source (RefSeq, UCSC), which contains the >> same genomic regions for the genes annotated as in HapMap. Which NCBI >> 36 build did they use and where can I download a gene file with >> chromosome, gene start and stop matching with HapMap? >> >> Thanks for your help! >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> >

ADD REPLY • link 13.7 years ago noxyport@gmail.com ▴ 20

0

Entering edit mode

The \t is a tab character. You may do better by using the default sep argument rather than by specifying one yourself. Best, Jim On 8/4/10 4:49 PM, noxyport at gmail.com wrote: > You are right! Sorry to bother you with this. > However, there is still something wrong. When I export the file again > (write.table) there are CDS and UTR included and when you run: > >> hapmap=read.table("refGene_hg18_tests_11Apr07.gff", header=F, sep=" ") >> nrow(hapmap) > [1] 171701 >> hapmap2=hapmap[which(hapmap$V3=="mRNA"), ] >> nrow(hapmap2) > [1] 12718 >> hapmap2[205,] > V1 V2 V3 V4 V5 V6 V7 V8 > 2759 chr1 UCSC_1 mRNA 11840109 11841579 . - . > V9 > 2759 ID=NM_002521;Alias=NPPB;Note=natriuretic peptide precursor B > preproprotein;summary=This gene is a member of the natriuretic peptide > family and encodes a secreted protein which functions as a cardiac > hormone. The protein undergoes two cleavage events%2C one within the > cell and a second after secretion into the blood. The proteins > biological actions include natriuresis%2C diuresis%2C > vasorelaxation%2C inhibition of renin and aldosterone secretion%2C and > a key role in cardiovascular homeostasis. A high concentration of this > protein in the bloodstream is indicative of heart failure. Mutations > in this gene have been associated with postmenopausal osteoporosis. > Publication Note: This RefSeq record includes a subset of the > publications that are available for this gene. Please see the Entrez > Gene record to access additional > publications.\nchr1\tUCSC_1\tthree_prime_UTR\t11840109\t11840298\t.\ t-\t.\tParent=NM_002521\nchr1\tUCSC_1\tCDS\t11840299\t11840315\t.\t-\t 1\tParent=NM_002521\nchr1\tUCSC_1\tCDS\t11840858\t11841113\t.\t-\t0\tP arent=NM_002521\nchr1\tUCSC_1\tCDS\t11841346\t11841477\t.\t-\t0\tParen t=NM_002521\nchr1\tUCSC_1\tfive_prime_UTR\t11841478\t11841579\t.\t-\t. \tParent=NM_002521\nchr1\tUCSC_1\tmRNA\t11902712\t11909067\t.\t-\t.\tI D=NM_138346;Alias=KIAA2013;Note=hypothetical > protein LOC90231\nchr1\tUCSC_1\tthree_prime_UTR\t11902712\t11902958\ t.\t-\t.\tParent=NM_138346\nchr1\tUCSC_1\tCDS\t11902959\t11902976\t.\t -\t1\tParent=NM_138346\nchr1\tUCSC_1\tCDS\t11905280\t11906133\t.\t-\t1 \tParent=NM_138346\nchr1\tUCSC_1\tCDS\t11907849\t11908881\t.\t-\t0\tPa rent=NM_138346\nchr1\tUCSC_1\tfive_prime_UTR\t11908882\t11909067\t.\t- \t.\tParent=NM_138346\nchr1\tUCSC_1\tmRNA\t11917333\t11958180\t.\t+\t. \tID=NM_000302;Alias=PLOD1;Note=lysyl > hydroxylase precursor;summary=Lysyl hydroxylase is a membrane-bound > homodimeric protein localized to the cisternae of the endoplasmic > reticulum. The enzyme (cofactors iron and ascorbate) catalyzes the > hydroxylation of lysyl residues in collagen-like peptides. The > resultant hydroxylysyl groups are attachment sites for carbohydrates > in col > ... (shortend here) > > I have no idea where R takes thes "\t.*" parts from but I think they > screw the whole dataframe somehow. Any suggestions? > > Thanks > > > > > On Wed, Aug 4, 2010 at 7:08 PM, Kasper Daniel Hansen > <kasperdanielhansen at="" gmail.com=""> wrote: >> On Wed, Aug 4, 2010 at 1:41 PM, noxyport at gmail.com<noxyport at="" gmail.com=""> wrote: >>> Hi, >>> >>> I have a problem with the gene list (gff version3 file) HapMap is >>> using (ftp://ftp.ncbi.nlm.nih.gov/hapmap/gbrowse/2009-02_phaseII+I II/gff/refGene_hg18_tests_11Apr07.gff.gz). >>> I tried loading the file into R and selecting all "mRNA" entries but >>> something seems to go wrong with it: >>> >>>> hapmap=read.table("refGene_hg18_tests_11Apr07.gff", header=F, sep=" ") >>>> nrow(hapmap) >>> [1] 171701 >>>> hapmap2=hapmap[which(hapmap$V3=="mRNA"), ] >>>> nrow(hapmap2) >>> [1] 12718 >>>> hapmap[(2210:2220), (1:3)] >> Here, you want to use hapmap2 and not hapmap. >> >> Kasper >> >> >>> 2210 chr1 UCSC_1 mRNA >>> 2211 chr1 UCSC_1 five_prime_UTR >>> 2212 chr1 UCSC_1 five_prime_UTR >>> 2213 chr1 UCSC_1 CDS >>> 2214 chr1 UCSC_1 CDS >>> 2215 chr1 UCSC_1 CDS >>> 2216 chr1 UCSC_1 CDS >>> 2217 chr1 UCSC_1 CDS >>> 2218 chr1 UCSC_1 CDS >>> 2219 chr1 UCSC_1 CDS >>> 2220 chr1 UCSC_1 CDS >>> Can anyone explain why this could be? Probably, the large descriptive >>> column (V9) but I don't see the failure. >>> >>> I have to admit that it is probably not the best way to use this file >>> but I do not find any other source (RefSeq, UCSC), which contains the >>> same genomic regions for the genes annotated as in HapMap. Which NCBI >>> 36 build did they use and where can I download a gene file with >>> chromosome, gene start and stop matching with HapMap? >>> >>> Thanks for your help! >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>> > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician Douglas Lab University of Michigan Department of Human Genetics 5912 Buhl 1241 E. Catherine St. Ann Arbor MI 48109-5618 734-615-7826 ********************************************************** Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues

ADD REPLY • link 13.7 years ago James W. MacDonald 65k

Login before adding your answer.