What is the easiest way to make the GC content and length matrix for package cqn?

2

Entering edit mode

Matthew Thornton ▴ 350

@matthew-thornton-5564

Last seen 5 weeks ago

USA, Los Angeles, USC

Hello! I have some RNASeq data that I am analyzing with edgeR and I would like to use the cqn package to correct for GC bias. I have aligned the data to the UCSC hg38 genome. >From googling I have found the the bedtools 'nuc' command will give me the GC content with ranges and the length. Providing that I have a bed file of the hg38 genome. What I need to make sure of is that I am calculating the length and the GC content for the correct intervals. I bin my reads using GenomicFeatures like this (thank you Homer) # Load Libraries library(GenomicFeatures) library(GenomicRanges) txdb <- makeTranscriptDbFromUCSC(genome='hg38',tablename='refGene') tr_by_gene <- transcriptsBy(txdb,'gene') library(Rsamtools) r_AE89_m <- readGAlignmentsFromBam("sorted_trimmed_AE89_minus.bam") r_AE89_p <- readGAlignmentsFromBam("sorted_trimmed_AE89_plus.1.bam") r_AE90_m <- readGAlignmentsFromBam("sorted_trimmed_AE90_minus.bam") r_AE90_p <- readGAlignmentsFromBam("sorted_trimmed_AE90_plus.bam") # counts reads overlapping the genes. c_AE89_m <- countOverlaps(tr_by_gene,r_AE89_m) c_AE89_p <- countOverlaps(tr_by_gene,r_AE89_p) c_AE90_m <- countOverlaps(tr_by_gene,r_AE90_m) c_AE90_p <- countOverlaps(tr_by_gene,r_AE90_p) # Create a count table countTable <- data.frame(AE89_minus=c_AE89_m, AE89_1_plus=c_AE89_p, AE90_minus=c_AE90_m, AE90_plus=c_AE90_p, stringsAsFactors=FALSE) rownames(countTable)=names(tr_by_gene) Is there a way to get lengths and gc content from 'tr_by-gene'? Is there a way to create a bed file from 'tr_by_gene' so that I can then use bedtools nuc? Thank you. Matt

RNASeq edgeR cqn RNASeq edgeR cqn • 4.1k views

ADD COMMENT • link updated 10.1 years ago by Martin Morgan 25k • written 10.1 years ago by Matthew Thornton ▴ 350

1

Entering edit mode

Harris A. Jaffee ▴ 590

@harris-a-jaffee-3972

Last seen 9.5 years ago

United States

I would like to know the answer! This is all I have. > width(tr_by_gene) IntegerList of length 25529 [["1"]] 6694 [["10"]] 9969 [["100"]] 32214 [["1000"]] 226516 [["10000"]] 355050 343564 343866 355040 343554 343856 [["100008586"]] 187588 6118 6109 [["100008587"]] 156 156 156 156 156 156 156 156 [["100008588"]] 1869 1869 1869 1869 1869 1869 1869 [["100008589"]] 188093 5066 5077 5070 [["100009613"]] 2915 ... <25519 more elements> Then my guess would be to build the "DNAStringSetList" (see Biostrings) underlying your transcripts, say x, and do sapply(x, letterFrequency, letters="GC") Hopefully there's something better. On Apr 8, 2014, at 4:03 PM, Thornton, Matthew wrote: > Hello! > > I have some RNASeq data that I am analyzing with edgeR and I would like to use the cqn package to correct for GC bias. I have aligned the data to the UCSC hg38 genome. > >> From googling I have found the the bedtools 'nuc' command will give me the GC content with ranges and the length. Providing that I have a bed file of the hg38 genome. > > What I need to make sure of is that I am calculating the length and the GC content for the correct intervals. I bin my reads using GenomicFeatures like this (thank you Homer) > > # Load Libraries > library(GenomicFeatures) > library(GenomicRanges) > > txdb <- makeTranscriptDbFromUCSC(genome='hg38',tablename='refGene') > > tr_by_gene <- transcriptsBy(txdb,'gene') > > library(Rsamtools) > > r_AE89_m <- readGAlignmentsFromBam("sorted_trimmed_AE89_minus.bam") > r_AE89_p <- readGAlignmentsFromBam("sorted_trimmed_AE89_plus.1.bam") > r_AE90_m <- readGAlignmentsFromBam("sorted_trimmed_AE90_minus.bam") > r_AE90_p <- readGAlignmentsFromBam("sorted_trimmed_AE90_plus.bam") > > # counts reads overlapping the genes. > c_AE89_m <- countOverlaps(tr_by_gene,r_AE89_m) > c_AE89_p <- countOverlaps(tr_by_gene,r_AE89_p) > c_AE90_m <- countOverlaps(tr_by_gene,r_AE90_m) > c_AE90_p <- countOverlaps(tr_by_gene,r_AE90_p) > > # Create a count table > countTable <- data.frame(AE89_minus=c_AE89_m, AE89_1_plus=c_AE89_p, AE90_minus=c_AE90_m, AE90_plus=c_AE90_p, stringsAsFactors=FALSE) > > rownames(countTable)=names(tr_by_gene) > > Is there a way to get lengths and gc content from 'tr_by-gene'? Is there a way to create a bed file from 'tr_by_gene' so that I can then use bedtools nuc? > > Thank you. > > Matt > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 10.1 years ago Harris A. Jaffee ▴ 590

0

Entering edit mode

On 04/08/2014 02:33 PM, Harris A. Jaffee wrote: > I would like to know the answer! This is all I have. > >> width(tr_by_gene) > IntegerList of length 25529 > [["1"]] 6694 > [["10"]] 9969 > [["100"]] 32214 > [["1000"]] 226516 > [["10000"]] 355050 343564 343866 355040 343554 343856 > [["100008586"]] 187588 6118 6109 > [["100008587"]] 156 156 156 156 156 156 156 156 > [["100008588"]] 1869 1869 1869 1869 1869 1869 1869 > [["100008589"]] 188093 5066 5077 5070 > [["100009613"]] 2915 > ... > <25519 more elements> > > Then my guess would be to build the "DNAStringSetList" (see Biostrings) > underlying your transcripts, say x, and do > > sapply(x, letterFrequency, letters="GC") > > Hopefully there's something better. Yes, that's the idea. You want to try avoiding the sapply() though, which is generally slow: ## Using BSgenome.Hsapiens.NCBI.GRCh38 for now (which hg38 is based ## on but, sadly, the reference sequences are named differently). ## Maybe we should have BSgenome.Hsapiens.NCBI.hg38 too, so it's ## easier to work with TranscriptDb objects obtained from UCSC. library(BSgenome) genome <- getBSgenome("GRCh38") ## The following hack would not be needed if hg38 and GRCh38 were ## using the same chromosome names. We'll keep only chr1-22, chrX, ## chrY, chrM, because it's too complicated to safely map and rename ## the other sequences. seqlevels(tr_by_gene, force=TRUE) <- seqlevels(tr_by_gene)[1:25] seqlevels(genome)[1:25] <- seqlevels(tr_by_gene) genome(tr_by_gene) <- "GRCh38" ## Extract the transcript sequences grouped by gene: trseq_by_gene <- getSeq(genome, tr_by_gene) ## Compute the transcript GC counts grouped by gene: af <- alphabetFrequency(unlist(trseq_by_gene, use.names=FALSE), baseOnly=TRUE) trGCcount_by_gene <- relist(af[ , "C"] + af[ , "G"], trseq_by_gene) Then: > trGCcount_by_gene IntegerList of length 24443 [["1"]] 3856 [["10"]] 3697 [["100"]] 17142 [["1000"]] 84322 [["100008586"]] 3156 2539 2537 [["100009613"]] 1578 [["100009676"]] 1463 [["10001"]] 6854 6854 6854 6854 [["10002"]] 2652 4134 [["10003"]] 20205 ... <24433 more elements> Note that BSgenome.Hsapiens.NCBI.GRCh38 is only available in BioC 2.14. I'll think of a way to make a thin BSgenome.Hsapiens.UCSC.hg38 package that depends on BSgenome.Hsapiens.NCBI.GRCh38 for the sequences but presents them to the user with the UCSC names. Cheers, H. > > On Apr 8, 2014, at 4:03 PM, Thornton, Matthew wrote: > >> Hello! >> >> I have some RNASeq data that I am analyzing with edgeR and I would like to use the cqn package to correct for GC bias. I have aligned the data to the UCSC hg38 genome. >> >>> From googling I have found the the bedtools 'nuc' command will give me the GC content with ranges and the length. Providing that I have a bed file of the hg38 genome. >> >> What I need to make sure of is that I am calculating the length and the GC content for the correct intervals. I bin my reads using GenomicFeatures like this (thank you Homer) >> >> # Load Libraries >> library(GenomicFeatures) >> library(GenomicRanges) >> >> txdb <- makeTranscriptDbFromUCSC(genome='hg38',tablename='refGene') >> >> tr_by_gene <- transcriptsBy(txdb,'gene') >> >> library(Rsamtools) >> >> r_AE89_m <- readGAlignmentsFromBam("sorted_trimmed_AE89_minus.bam") >> r_AE89_p <- readGAlignmentsFromBam("sorted_trimmed_AE89_plus.1.bam") >> r_AE90_m <- readGAlignmentsFromBam("sorted_trimmed_AE90_minus.bam") >> r_AE90_p <- readGAlignmentsFromBam("sorted_trimmed_AE90_plus.bam") >> >> # counts reads overlapping the genes. >> c_AE89_m <- countOverlaps(tr_by_gene,r_AE89_m) >> c_AE89_p <- countOverlaps(tr_by_gene,r_AE89_p) >> c_AE90_m <- countOverlaps(tr_by_gene,r_AE90_m) >> c_AE90_p <- countOverlaps(tr_by_gene,r_AE90_p) >> >> # Create a count table >> countTable <- data.frame(AE89_minus=c_AE89_m, AE89_1_plus=c_AE89_p, AE90_minus=c_AE90_m, AE90_plus=c_AE90_p, stringsAsFactors=FALSE) >> >> rownames(countTable)=names(tr_by_gene) >> >> Is there a way to get lengths and gc content from 'tr_by-gene'? Is there a way to create a bed file from 'tr_by_gene' so that I can then use bedtools nuc? >> >> Thank you. >> >> Matt >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 10.1 years ago Hervé Pagès 16k

0

Entering edit mode

Hi Matt, On 04/08/2014 03:23 PM, Hervé Pagès wrote: > > > On 04/08/2014 02:33 PM, Harris A. Jaffee wrote: >> I would like to know the answer! This is all I have. >> >>> width(tr_by_gene) >> IntegerList of length 25529 >> [["1"]] 6694 >> [["10"]] 9969 >> [["100"]] 32214 >> [["1000"]] 226516 >> [["10000"]] 355050 343564 343866 355040 343554 343856 >> [["100008586"]] 187588 6118 6109 >> [["100008587"]] 156 156 156 156 156 156 156 156 >> [["100008588"]] 1869 1869 1869 1869 1869 1869 1869 >> [["100008589"]] 188093 5066 5077 5070 >> [["100009613"]] 2915 >> ... >> <25519 more elements> >> >> Then my guess would be to build the "DNAStringSetList" (see Biostrings) >> underlying your transcripts, say x, and do >> >> sapply(x, letterFrequency, letters="GC") >> >> Hopefully there's something better. > > Yes, that's the idea. You want to try avoiding the sapply() though, > which is generally slow: > > ## Using BSgenome.Hsapiens.NCBI.GRCh38 for now (which hg38 is based > ## on but, sadly, the reference sequences are named differently). > ## Maybe we should have BSgenome.Hsapiens.NCBI.hg38 too, so it's > ## easier to work with TranscriptDb objects obtained from UCSC. > library(BSgenome) > genome <- getBSgenome("GRCh38") > > ## The following hack would not be needed if hg38 and GRCh38 were > ## using the same chromosome names. We'll keep only chr1-22, chrX, > ## chrY, chrM, because it's too complicated to safely map and rename > ## the other sequences. > seqlevels(tr_by_gene, force=TRUE) <- seqlevels(tr_by_gene)[1:25] > seqlevels(genome)[1:25] <- seqlevels(tr_by_gene) > genome(tr_by_gene) <- "GRCh38" Don't know where you stand on this but FWIW I just added a new utility to the new GenomeInfoDb package that makes it relatively easy to rename the sequences in GRCh38 with the names used by UCSC: library(GenomeInfoDb) hg38_chrominfo <- fetchExtendedChromInfoFromUCSC("hg38") hg38_seqlevels <- setNames(hg38_chrominfo$UCSC_seqlevels, hg38_chrominfo$NCBI_seqlevels) seqlevels(genome) <- hg38_seqlevels[seqlevels(genome)] so the above hack is not needed anymore. Of course, things should be made even easier, e.g. maybe the user should be able to just do: genome(genome) <- "hg38" But I'll have to leave this for the next devel cycle. HTH, H. > > ## Extract the transcript sequences grouped by gene: > trseq_by_gene <- getSeq(genome, tr_by_gene) > > ## Compute the transcript GC counts grouped by gene: > af <- alphabetFrequency(unlist(trseq_by_gene, use.names=FALSE), > baseOnly=TRUE) > trGCcount_by_gene <- relist(af[ , "C"] + af[ , "G"], trseq_by_gene) > > Then: > > > trGCcount_by_gene > IntegerList of length 24443 > [["1"]] 3856 > [["10"]] 3697 > [["100"]] 17142 > [["1000"]] 84322 > [["100008586"]] 3156 2539 2537 > [["100009613"]] 1578 > [["100009676"]] 1463 > [["10001"]] 6854 6854 6854 6854 > [["10002"]] 2652 4134 > [["10003"]] 20205 > ... > <24433 more elements> > > Note that BSgenome.Hsapiens.NCBI.GRCh38 is only available in BioC 2.14. > I'll think of a way to make a thin BSgenome.Hsapiens.UCSC.hg38 package > that depends on BSgenome.Hsapiens.NCBI.GRCh38 for the sequences but > presents them to the user with the UCSC names. > > Cheers, > H. > >> >> On Apr 8, 2014, at 4:03 PM, Thornton, Matthew wrote: >> >>> Hello! >>> >>> I have some RNASeq data that I am analyzing with edgeR and I would >>> like to use the cqn package to correct for GC bias. I have aligned >>> the data to the UCSC hg38 genome. >>> >>>> From googling I have found the the bedtools 'nuc' command will give >>>> me the GC content with ranges and the length. Providing that I have >>>> a bed file of the hg38 genome. >>> >>> What I need to make sure of is that I am calculating the length and >>> the GC content for the correct intervals. I bin my reads using >>> GenomicFeatures like this (thank you Homer) >>> >>> # Load Libraries >>> library(GenomicFeatures) >>> library(GenomicRanges) >>> >>> txdb <- makeTranscriptDbFromUCSC(genome='hg38',tablename='refGene') >>> >>> tr_by_gene <- transcriptsBy(txdb,'gene') >>> >>> library(Rsamtools) >>> >>> r_AE89_m <- readGAlignmentsFromBam("sorted_trimmed_AE89_minus.bam") >>> r_AE89_p <- readGAlignmentsFromBam("sorted_trimmed_AE89_plus.1.bam") >>> r_AE90_m <- readGAlignmentsFromBam("sorted_trimmed_AE90_minus.bam") >>> r_AE90_p <- readGAlignmentsFromBam("sorted_trimmed_AE90_plus.bam") >>> >>> # counts reads overlapping the genes. >>> c_AE89_m <- countOverlaps(tr_by_gene,r_AE89_m) >>> c_AE89_p <- countOverlaps(tr_by_gene,r_AE89_p) >>> c_AE90_m <- countOverlaps(tr_by_gene,r_AE90_m) >>> c_AE90_p <- countOverlaps(tr_by_gene,r_AE90_p) >>> >>> # Create a count table >>> countTable <- data.frame(AE89_minus=c_AE89_m, AE89_1_plus=c_AE89_p, >>> AE90_minus=c_AE90_m, AE90_plus=c_AE90_p, stringsAsFactors=FALSE) >>> >>> rownames(countTable)=names(tr_by_gene) >>> >>> Is there a way to get lengths and gc content from 'tr_by-gene'? Is >>> there a way to create a bed file from 'tr_by_gene' so that I can then >>> use bedtools nuc? >>> >>> Thank you. >>> >>> Matt >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 10.1 years ago Hervé Pagès 16k

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 6 days ago

United States

On 04/08/2014 01:03 PM, Thornton, Matthew wrote: > Hello! > > I have some RNASeq data that I am analyzing with edgeR and I would like to use the cqn package to correct for GC bias. I have aligned the data to the UCSC hg38 genome. > >>From googling I have found the the bedtools 'nuc' command will give me the GC content with ranges and the length. Providing that I have a bed file of the hg38 genome. > > What I need to make sure of is that I am calculating the length and the GC content for the correct intervals. I bin my reads using GenomicFeatures like this (thank you Homer) > > # Load Libraries > library(GenomicFeatures) > library(GenomicRanges) > > txdb <- makeTranscriptDbFromUCSC(genome='hg38',tablename='refGene') > > tr_by_gene <- transcriptsBy(txdb,'gene') > > library(Rsamtools) > > r_AE89_m <- readGAlignmentsFromBam("sorted_trimmed_AE89_minus.bam") > r_AE89_p <- readGAlignmentsFromBam("sorted_trimmed_AE89_plus.1.bam") > r_AE90_m <- readGAlignmentsFromBam("sorted_trimmed_AE90_minus.bam") > r_AE90_p <- readGAlignmentsFromBam("sorted_trimmed_AE90_plus.bam") > > # counts reads overlapping the genes. > c_AE89_m <- countOverlaps(tr_by_gene,r_AE89_m) > c_AE89_p <- countOverlaps(tr_by_gene,r_AE89_p) > c_AE90_m <- countOverlaps(tr_by_gene,r_AE90_m) > c_AE90_p <- countOverlaps(tr_by_gene,r_AE90_p) > > # Create a count table > countTable <- data.frame(AE89_minus=c_AE89_m, AE89_1_plus=c_AE89_p, AE90_minus=c_AE90_m, AE90_plus=c_AE90_p, stringsAsFactors=FALSE) > For the preceding (I know you didn't ask!) you could represent all of your files files <- c(r_AE89_m="sorted_trimmed_AE89_minus.bam", r_AE89_p="sorted_trimmed_AE89_plus.1.bam", r_AE90_m="sorted_trimmed_AE90_minus.bam", r_AE90_p="sorted_trimmed_AE90_plus.bam") as a 'BamFileList' bam_files <- BamFileList(files) write a 'COUNTER' function that takes features, reads, and possibly other arguments COUNTER <- function(features, reads, ...) ## ignore some arguments countOverlaps(features, reads) And then perform the summary se <- summarizeOverlaps(tx_by_gene, bam_files, COUNTER) head(assay(se), 3) head(rowData(se)) Once this is working to your satisfaction, modify the recipe to count each file on a separate thread, limiting memory use by changing the 'yieldSize' (number of records read at any one time) on the BamFileList library(BiocParallel) # count one file per core bam_files <- BamFileList(files, yieldSize=1000000) se <- summarizeOverlaps(tx_by_gene, bam_files, COUNTER) Add a param=ScanBamParam() argument to control which reads and possible other information (e.g., mapping quality) that gets read in and hence is available for use in the COUNTER function. The help page ?summarizeOverlaps has some built-in functions that count in different, relevant, ways. Martin > rownames(countTable)=names(tr_by_gene) > > Is there a way to get lengths and gc content from 'tr_by-gene'? Is there a way to create a bed file from 'tr_by_gene' so that I can then use bedtools nuc? > > Thank you. > > Matt > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793

ADD COMMENT • link 10.1 years ago Martin Morgan 25k

Login before adding your answer.