What is the easiest way to make the GC content and length matrix for package cqn?
2
2
Entering edit mode
@matthew-thornton-5564
Last seen 3 months ago
USA, Los Angeles, USC
Hello! I have some RNASeq data that I am analyzing with edgeR and I would like to use the cqn package to correct for GC bias. I have aligned the data to the UCSC hg38 genome. >From googling I have found the the bedtools 'nuc' command will give me the GC content with ranges and the length. Providing that I have a bed file of the hg38 genome. What I need to make sure of is that I am calculating the length and the GC content for the correct intervals. I bin my reads using GenomicFeatures like this (thank you Homer) # Load Libraries library(GenomicFeatures) library(GenomicRanges) txdb <- makeTranscriptDbFromUCSC(genome='hg38',tablename='refGene') tr_by_gene <- transcriptsBy(txdb,'gene') library(Rsamtools) r_AE89_m <- readGAlignmentsFromBam("sorted_trimmed_AE89_minus.bam") r_AE89_p <- readGAlignmentsFromBam("sorted_trimmed_AE89_plus.1.bam") r_AE90_m <- readGAlignmentsFromBam("sorted_trimmed_AE90_minus.bam") r_AE90_p <- readGAlignmentsFromBam("sorted_trimmed_AE90_plus.bam") # counts reads overlapping the genes. c_AE89_m <- countOverlaps(tr_by_gene,r_AE89_m) c_AE89_p <- countOverlaps(tr_by_gene,r_AE89_p) c_AE90_m <- countOverlaps(tr_by_gene,r_AE90_m) c_AE90_p <- countOverlaps(tr_by_gene,r_AE90_p) # Create a count table countTable <- data.frame(AE89_minus=c_AE89_m, AE89_1_plus=c_AE89_p, AE90_minus=c_AE90_m, AE90_plus=c_AE90_p, stringsAsFactors=FALSE) rownames(countTable)=names(tr_by_gene) Is there a way to get lengths and gc content from 'tr_by-gene'? Is there a way to create a bed file from 'tr_by_gene' so that I can then use bedtools nuc? Thank you. Matt
RNASeq edgeR cqn RNASeq edgeR cqn • 3.6k views
1
Entering edit mode
@harris-a-jaffee-3972
Last seen 8.3 years ago
United States
I would like to know the answer! This is all I have. > width(tr_by_gene) IntegerList of length 25529 [["1"]] 6694 [["10"]] 9969 [["100"]] 32214 [["1000"]] 226516 [["10000"]] 355050 343564 343866 355040 343554 343856 [["100008586"]] 187588 6118 6109 [["100008587"]] 156 156 156 156 156 156 156 156 [["100008588"]] 1869 1869 1869 1869 1869 1869 1869 [["100008589"]] 188093 5066 5077 5070 [["100009613"]] 2915 ... <25519 more elements> Then my guess would be to build the "DNAStringSetList" (see Biostrings) underlying your transcripts, say x, and do sapply(x, letterFrequency, letters="GC") Hopefully there's something better. On Apr 8, 2014, at 4:03 PM, Thornton, Matthew wrote: > Hello! > > I have some RNASeq data that I am analyzing with edgeR and I would like to use the cqn package to correct for GC bias. I have aligned the data to the UCSC hg38 genome. > >> From googling I have found the the bedtools 'nuc' command will give me the GC content with ranges and the length. Providing that I have a bed file of the hg38 genome. > > What I need to make sure of is that I am calculating the length and the GC content for the correct intervals. I bin my reads using GenomicFeatures like this (thank you Homer) > > # Load Libraries > library(GenomicFeatures) > library(GenomicRanges) > > txdb <- makeTranscriptDbFromUCSC(genome='hg38',tablename='refGene') > > tr_by_gene <- transcriptsBy(txdb,'gene') > > library(Rsamtools) > > r_AE89_m <- readGAlignmentsFromBam("sorted_trimmed_AE89_minus.bam") > r_AE89_p <- readGAlignmentsFromBam("sorted_trimmed_AE89_plus.1.bam") > r_AE90_m <- readGAlignmentsFromBam("sorted_trimmed_AE90_minus.bam") > r_AE90_p <- readGAlignmentsFromBam("sorted_trimmed_AE90_plus.bam") > > # counts reads overlapping the genes. > c_AE89_m <- countOverlaps(tr_by_gene,r_AE89_m) > c_AE89_p <- countOverlaps(tr_by_gene,r_AE89_p) > c_AE90_m <- countOverlaps(tr_by_gene,r_AE90_m) > c_AE90_p <- countOverlaps(tr_by_gene,r_AE90_p) > > # Create a count table > countTable <- data.frame(AE89_minus=c_AE89_m, AE89_1_plus=c_AE89_p, AE90_minus=c_AE90_m, AE90_plus=c_AE90_p, stringsAsFactors=FALSE) > > rownames(countTable)=names(tr_by_gene) > > Is there a way to get lengths and gc content from 'tr_by-gene'? Is there a way to create a bed file from 'tr_by_gene' so that I can then use bedtools nuc? > > Thank you. > > Matt > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
0
Entering edit mode
On 04/08/2014 02:33 PM, Harris A. Jaffee wrote: > I would like to know the answer! This is all I have. > >> width(tr_by_gene) > IntegerList of length 25529 > [["1"]] 6694 > [["10"]] 9969 > [["100"]] 32214 > [["1000"]] 226516 > [["10000"]] 355050 343564 343866 355040 343554 343856 > [["100008586"]] 187588 6118 6109 > [["100008587"]] 156 156 156 156 156 156 156 156 > [["100008588"]] 1869 1869 1869 1869 1869 1869 1869 > [["100008589"]] 188093 5066 5077 5070 > [["100009613"]] 2915 > ... > <25519 more elements> > > Then my guess would be to build the "DNAStringSetList" (see Biostrings) > underlying your transcripts, say x, and do > > sapply(x, letterFrequency, letters="GC") > > Hopefully there's something better. Yes, that's the idea. You want to try avoiding the sapply() though, which is generally slow: ## Using BSgenome.Hsapiens.NCBI.GRCh38 for now (which hg38 is based ## on but, sadly, the reference sequences are named differently). ## Maybe we should have BSgenome.Hsapiens.NCBI.hg38 too, so it's ## easier to work with TranscriptDb objects obtained from UCSC. library(BSgenome) genome <- getBSgenome("GRCh38") ## The following hack would not be needed if hg38 and GRCh38 were ## using the same chromosome names. We'll keep only chr1-22, chrX, ## chrY, chrM, because it's too complicated to safely map and rename ## the other sequences. seqlevels(tr_by_gene, force=TRUE) <- seqlevels(tr_by_gene)[1:25] seqlevels(genome)[1:25] <- seqlevels(tr_by_gene) genome(tr_by_gene) <- "GRCh38" ## Extract the transcript sequences grouped by gene: trseq_by_gene <- getSeq(genome, tr_by_gene) ## Compute the transcript GC counts grouped by gene: af <- alphabetFrequency(unlist(trseq_by_gene, use.names=FALSE), baseOnly=TRUE) trGCcount_by_gene <- relist(af[ , "C"] + af[ , "G"], trseq_by_gene) Then: > trGCcount_by_gene IntegerList of length 24443 [["1"]] 3856 [["10"]] 3697 [["100"]] 17142 [["1000"]] 84322 [["100008586"]] 3156 2539 2537 [["100009613"]] 1578 [["100009676"]] 1463 [["10001"]] 6854 6854 6854 6854 [["10002"]] 2652 4134 [["10003"]] 20205 ... <24433 more elements> Note that BSgenome.Hsapiens.NCBI.GRCh38 is only available in BioC 2.14. I'll think of a way to make a thin BSgenome.Hsapiens.UCSC.hg38 package that depends on BSgenome.Hsapiens.NCBI.GRCh38 for the sequences but presents them to the user with the UCSC names. Cheers, H. > > On Apr 8, 2014, at 4:03 PM, Thornton, Matthew wrote: > >> Hello! >> >> I have some RNASeq data that I am analyzing with edgeR and I would like to use the cqn package to correct for GC bias. I have aligned the data to the UCSC hg38 genome. >> >>> From googling I have found the the bedtools 'nuc' command will give me the GC content with ranges and the length. Providing that I have a bed file of the hg38 genome. >> >> What I need to make sure of is that I am calculating the length and the GC content for the correct intervals. I bin my reads using GenomicFeatures like this (thank you Homer) >> >> # Load Libraries >> library(GenomicFeatures) >> library(GenomicRanges) >> >> txdb <- makeTranscriptDbFromUCSC(genome='hg38',tablename='refGene') >> >> tr_by_gene <- transcriptsBy(txdb,'gene') >> >> library(Rsamtools) >> >> r_AE89_m <- readGAlignmentsFromBam("sorted_trimmed_AE89_minus.bam") >> r_AE89_p <- readGAlignmentsFromBam("sorted_trimmed_AE89_plus.1.bam") >> r_AE90_m <- readGAlignmentsFromBam("sorted_trimmed_AE90_minus.bam") >> r_AE90_p <- readGAlignmentsFromBam("sorted_trimmed_AE90_plus.bam") >> >> # counts reads overlapping the genes. >> c_AE89_m <- countOverlaps(tr_by_gene,r_AE89_m) >> c_AE89_p <- countOverlaps(tr_by_gene,r_AE89_p) >> c_AE90_m <- countOverlaps(tr_by_gene,r_AE90_m) >> c_AE90_p <- countOverlaps(tr_by_gene,r_AE90_p) >> >> # Create a count table >> countTable <- data.frame(AE89_minus=c_AE89_m, AE89_1_plus=c_AE89_p, AE90_minus=c_AE90_m, AE90_plus=c_AE90_p, stringsAsFactors=FALSE) >> >> rownames(countTable)=names(tr_by_gene) >> >> Is there a way to get lengths and gc content from 'tr_by-gene'? Is there a way to create a bed file from 'tr_by_gene' so that I can then use bedtools nuc? >> >> Thank you. >> >> Matt >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
0
Entering edit mode
Hi Matt, On 04/08/2014 03:23 PM, Hervé Pagès wrote: > > > On 04/08/2014 02:33 PM, Harris A. Jaffee wrote: >> I would like to know the answer! This is all I have. >> >>> width(tr_by_gene) >> IntegerList of length 25529 >> [["1"]] 6694 >> [["10"]] 9969 >> [["100"]] 32214 >> [["1000"]] 226516 >> [["10000"]] 355050 343564 343866 355040 343554 343856 >> [["100008586"]] 187588 6118 6109 >> [["100008587"]] 156 156 156 156 156 156 156 156 >> [["100008588"]] 1869 1869 1869 1869 1869 1869 1869 >> [["100008589"]] 188093 5066 5077 5070 >> [["100009613"]] 2915 >> ... >> <25519 more elements> >> >> Then my guess would be to build the "DNAStringSetList" (see Biostrings) >> underlying your transcripts, say x, and do >> >> sapply(x, letterFrequency, letters="GC") >> >> Hopefully there's something better. > > Yes, that's the idea. You want to try avoiding the sapply() though, > which is generally slow: > > ## Using BSgenome.Hsapiens.NCBI.GRCh38 for now (which hg38 is based > ## on but, sadly, the reference sequences are named differently). > ## Maybe we should have BSgenome.Hsapiens.NCBI.hg38 too, so it's > ## easier to work with TranscriptDb objects obtained from UCSC. > library(BSgenome) > genome <- getBSgenome("GRCh38") > > ## The following hack would not be needed if hg38 and GRCh38 were > ## using the same chromosome names. We'll keep only chr1-22, chrX, > ## chrY, chrM, because it's too complicated to safely map and rename > ## the other sequences. > seqlevels(tr_by_gene, force=TRUE) <- seqlevels(tr_by_gene)[1:25] > seqlevels(genome)[1:25] <- seqlevels(tr_by_gene) > genome(tr_by_gene) <- "GRCh38" Don't know where you stand on this but FWIW I just added a new utility to the new GenomeInfoDb package that makes it relatively easy to rename the sequences in GRCh38 with the names used by UCSC: library(GenomeInfoDb) hg38_chrominfo <- fetchExtendedChromInfoFromUCSC("hg38") hg38_seqlevels <- setNames(hg38_chrominfo$UCSC_seqlevels, hg38_chrominfo$NCBI_seqlevels) seqlevels(genome) <- hg38_seqlevels[seqlevels(genome)] so the above hack is not needed anymore. Of course, things should be made even easier, e.g. maybe the user should be able to just do: genome(genome) <- "hg38" But I'll have to leave this for the next devel cycle. HTH, H. > > ## Extract the transcript sequences grouped by gene: > trseq_by_gene <- getSeq(genome, tr_by_gene) > > ## Compute the transcript GC counts grouped by gene: > af <- alphabetFrequency(unlist(trseq_by_gene, use.names=FALSE), > baseOnly=TRUE) > trGCcount_by_gene <- relist(af[ , "C"] + af[ , "G"], trseq_by_gene) > > Then: > > > trGCcount_by_gene > IntegerList of length 24443 > [["1"]] 3856 > [["10"]] 3697 > [["100"]] 17142 > [["1000"]] 84322 > [["100008586"]] 3156 2539 2537 > [["100009613"]] 1578 > [["100009676"]] 1463 > [["10001"]] 6854 6854 6854 6854 > [["10002"]] 2652 4134 > [["10003"]] 20205 > ... > <24433 more elements> > > Note that BSgenome.Hsapiens.NCBI.GRCh38 is only available in BioC 2.14. > I'll think of a way to make a thin BSgenome.Hsapiens.UCSC.hg38 package > that depends on BSgenome.Hsapiens.NCBI.GRCh38 for the sequences but > presents them to the user with the UCSC names. > > Cheers, > H. > >> >> On Apr 8, 2014, at 4:03 PM, Thornton, Matthew wrote: >> >>> Hello! >>> >>> I have some RNASeq data that I am analyzing with edgeR and I would >>> like to use the cqn package to correct for GC bias. I have aligned >>> the data to the UCSC hg38 genome. >>> >>>> From googling I have found the the bedtools 'nuc' command will give >>>> me the GC content with ranges and the length. Providing that I have >>>> a bed file of the hg38 genome. >>> >>> What I need to make sure of is that I am calculating the length and >>> the GC content for the correct intervals. I bin my reads using >>> GenomicFeatures like this (thank you Homer) >>> >>> # Load Libraries >>> library(GenomicFeatures) >>> library(GenomicRanges) >>> >>> txdb <- makeTranscriptDbFromUCSC(genome='hg38',tablename='refGene') >>> >>> tr_by_gene <- transcriptsBy(txdb,'gene') >>> >>> library(Rsamtools) >>> >>> r_AE89_m <- readGAlignmentsFromBam("sorted_trimmed_AE89_minus.bam") >>> r_AE89_p <- readGAlignmentsFromBam("sorted_trimmed_AE89_plus.1.bam") >>> r_AE90_m <- readGAlignmentsFromBam("sorted_trimmed_AE90_minus.bam") >>> r_AE90_p <- readGAlignmentsFromBam("sorted_trimmed_AE90_plus.bam") >>> >>> # counts reads overlapping the genes. >>> c_AE89_m <- countOverlaps(tr_by_gene,r_AE89_m) >>> c_AE89_p <- countOverlaps(tr_by_gene,r_AE89_p) >>> c_AE90_m <- countOverlaps(tr_by_gene,r_AE90_m) >>> c_AE90_p <- countOverlaps(tr_by_gene,r_AE90_p) >>> >>> # Create a count table >>> countTable <- data.frame(AE89_minus=c_AE89_m, AE89_1_plus=c_AE89_p, >>> AE90_minus=c_AE90_m, AE90_plus=c_AE90_p, stringsAsFactors=FALSE) >>> >>> rownames(countTable)=names(tr_by_gene) >>> >>> Is there a way to get lengths and gc content from 'tr_by-gene'? Is >>> there a way to create a bed file from 'tr_by_gene' so that I can then >>> use bedtools nuc? >>> >>> Thank you. >>> >>> Matt >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
0
Entering edit mode
@martin-morgan-1513
Last seen 3 days ago
United States