Hi everyone,
I'm trying to obtain a GC content track for the M. tuberculosis genome to plot on IGV.
Below there is the code I have used to do this. However, when calculating the GC content, R blocks without giving an output.
How can I solve the problem?
library(Biostrings)
library(Gviz)
#load FASTA file
genome_seq<-readDNAStringSet("M.tuberculosis_NC000962.3.fasta")
# Calculate genome length
genome_length <- sum(width(genome_seq))
# GC content for a certain window size
window_size <- 10000
gc_content <- numeric(genome_length)
for (i in 1:(genome_length - window_size + 1)) {
window_seq <- subseq(genome_seq, start = i, end = i + window_size - 1)
gc_content[i] <- sum(unlist(strsplit(as.character(window_seq), "")) %in% c("G", "C")) / window_size * 100
}
This has nothing to do with Gviz. You shouldn't pick an unrelated package for questions because you are A.) directly asking that maintainer to answer an unrelated question and B.) not asking the relevant package's maintainer for help. But anyway.
> library(Biostrings)
## I downloaded the relevant FASTQ from NCBI
> z <- readDNAStringSet("c:/Users/jmacdon/Downloads/sequence.fasta")
## get windows
> starts <- seq(1, width(z), 1000)
> ends <- starts + 1000
## last one is too long, so fix
> ends[length(ends)] <- width(z)
## get the alphabet frequency for each 1000 bp window and rbind into a matrix
> zz <- do.call(rbind, lapply(seq(along = starts), function(x) alphabetFrequency(subseq(z, starts[x], ends[x]))))
## obvs
> zzz <- rowSums(zz[,2:3])/rowSums(zz)
> head(zzz)
[1] 0.5884116 0.5764236 0.6073926 0.5844156 0.6473526 0.6213786