Search
Question: Human reference primary assembly and masking using BioStrings and BSgenome
0
12 months ago by
rockyb0
rockyb0 wrote:

Hi,

I'm new to using BioStrings and the BSgenome packages in R, and am having some trouble with my analysis.

What I'm trying to do is to take the latest human genome reference release from the GRC and transform it into something useable for GATK analysis. Essentially, I want to read the FASTA file, remove the alt contigs and patches and mask the PARs in the Y chromosome.

I wanted to ask a couple of questions:

1. I want to remove the alternate contigs as well as the patches on the latest GRC reference (GRCh38.p11) to create a primary reference assembly. I've laid out what I've done but it is quite long-winded - is there a quicker way of doing this? What I've done is read the file as a DNAStringSet, select the sequences to keep, reorder the data and write sequences out.

In addition, is there a way to read the sequence description separate from the sequence ID? I've had to manually separate the strings and then piece them back together so that it makes sense. i.e.

x <- readDNAStringSet("GCA_000001405.26_GRCh38.p11_genomic.fa")
y <- data.frame(seqID=sapply(strsplit(names(x)," "), [, 1),
chromInfo = gsub("^\\S+ ", "", sapply(strsplit(names(x),","), [, 1), perl=T),
assemblyInfo = sapply(strsplit(names(x),","), [, 2),
seqLen = x@ranges@width)

then decide which to keep and concatenate the columns together to make it output like the GATK format. It works, but was wondering whether there was a more efficient method.

2. I want to hard mask the PARs in the Y chromosome (i.e. change the sequence to N's). I've tried using the 'ExtractAt' and 'ReplaceAt' functions, but they haven't worked well for me (e.g. I keep getting errors like:

> extractAt(fgenome, at)
Error in .normarg_at2(at, x) :
some ranges in 'at' are off-limits with respect to their corresponding sequence in 'x'
1: In .normarg_at2(at, x) : 'at' names were ignored
2: In .V_recycle(at, x, "at", "'length(x)'") :
'length(x)' is not a multiple of 'length(at)'



I've also tried creating a data frame of the Y chromosome, manually substituted the ACTGs into Ns and then created a DNAStringSet from there. But when I try to replace the Y Chromosome DNAStringSet in the original data set with this StringSet, I am unable to do it...

at <- data.frame(chrom="Y GRCh38.p11:CM000686.2:1:57227415", start=10001, end = 2781479)
temp <- getSeq(fgenome, as(at, "GRanges"))
at <- gsub("A", "N", temp)
at <- gsub("C", "N", at)
at <- gsub("T", "N", at)
at <- gsub("G", "N", at)

replaceAt(fgenome, grep ("Y GRCh38.p11:CM000686.2:1:57227415", fgenome), YPAR1mask)

What I want to mask is the following:

Name           Chr     Start   Stop

 PAR#1 Y 10,001 2,781,479 PAR#2 Y 56,887,903 57,217,415

Any help would be much appreciated.

Thanks!

modified 12 months ago • written 12 months ago by rockyb0
0
12 months ago by
Mike Smith2.8k
EMBL Heidelberg / de.NBI
Mike Smith2.8k wrote:

For the second part, if you already have a MaskedDNAString does the function injectHardMask() do what you're looking for?

0
12 months ago by
rockyb0
rockyb0 wrote:

Hi,

injectHardMask doesn't work as it isn't a maskedDNAString - it's just a DNAString set.

Nonetheless, created a solution to Q2 in the form of a function - see below.

maskGenome <- function(chrname,start,end) {
##start is 1based and start and end are inclusive i.e. [1,5]
maskAt <- IRanges(start, width = end-start+1)
maskString <- paste(replicate((end-start+1), "N"), collapse = "")
}

Thanks for the help!