Human reference primary assembly and masking using BioStrings and BSgenome
Entering edit mode
rockyb • 0
Last seen 4.2 years ago


I'm new to using BioStrings and the BSgenome packages in R, and am having some trouble with my analysis.

What I'm trying to do is to take the latest human genome reference release from the GRC and transform it into something useable for GATK analysis. Essentially, I want to read the FASTA file, remove the alt contigs and patches and mask the PARs in the Y chromosome. 

I wanted to ask a couple of questions:

1. I want to remove the alternate contigs as well as the patches on the latest GRC reference (GRCh38.p11) to create a primary reference assembly. I've laid out what I've done but it is quite long-winded - is there a quicker way of doing this? What I've done is read the file as a DNAStringSet, select the sequences to keep, reorder the data and write sequences out.

In addition, is there a way to read the sequence description separate from the sequence ID? I've had to manually separate the strings and then piece them back together so that it makes sense. i.e.

x <- readDNAStringSet("GCA_000001405.26_GRCh38.p11_genomic.fa")
y <- data.frame(seqID=sapply(strsplit(names(x)," "), `[`, 1), 
                chromInfo = gsub("^\\S+ ", "", sapply(strsplit(names(x),","), `[`, 1), perl=T),
                assemblyInfo = sapply(strsplit(names(x),","), `[`, 2),
                seqHeader = names(x),
                seqLen = x@ranges@width)

then decide which to keep and concatenate the columns together to make it output like the GATK format. It works, but was wondering whether there was a more efficient method.

2. I want to hard mask the PARs in the Y chromosome (i.e. change the sequence to N's). I've tried using the 'ExtractAt' and 'ReplaceAt' functions, but they haven't worked well for me (e.g. I keep getting errors like:

> extractAt(fgenome, at)
Error in .normarg_at2(at, x) : 
  some ranges in 'at' are off-limits with respect to their corresponding sequence in 'x'
In addition: Warning messages:
1: In .normarg_at2(at, x) : 'at' names were ignored
2: In .V_recycle(at, x, "at", "'length(x)'") :
  'length(x)' is not a multiple of 'length(at)'

I've also tried creating a data frame of the Y chromosome, manually substituted the ACTGs into Ns and then created a DNAStringSet from there. But when I try to replace the Y Chromosome DNAStringSet in the original data set with this StringSet, I am unable to do it...

at <- data.frame(chrom="Y GRCh38.p11:CM000686.2:1:57227415", start=10001, end = 2781479)
temp <- getSeq(fgenome, as(at, "GRanges"))
at <- gsub("A", "N", temp)
at <- gsub("C", "N", at)
at <- gsub("T", "N", at)
at <- gsub("G", "N", at)
YPAR1mask <- DNAStringSet(at)

replaceAt(fgenome, grep ("Y GRCh38.p11:CM000686.2:1:57227415", fgenome), YPAR1mask)

What I want to mask is the following:

Name           Chr     Start   Stop

PAR#1 Y 10,001 2,781,479
PAR#2 Y 56,887,903 57,217,415

Any help would be much appreciated.


biostrings R bsgenome • 658 views
Entering edit mode
Mike Smith ★ 5.1k
Last seen 1 day ago
EMBL Heidelberg / de.NBI

For the second part, if you already have a MaskedDNAString does the function injectHardMask() do what you're looking for?

Entering edit mode
rockyb • 0
Last seen 4.2 years ago


injectHardMask doesn't work as it isn't a maskedDNAString - it's just a DNAString set.

Nonetheless, created a solution to Q2 in the form of a function - see below.

maskGenome <- function(chrname,start,end) {
  ##start is 1based and start and end are inclusive i.e. [1,5]
  maskAt <- IRanges(start, width = end-start+1)
  maskString <- paste(replicate((end-start+1), "N"), collapse = "")
  tempgenome[chrname] <- replaceAt(tempgenome[chrname], maskAt, value=maskString)

Thanks for the help!


Login before adding your answer.

Traffic: 209 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6