Question

Human reference primary assembly and masking using BioStrings and BSgenome

0

Entering edit mode

rockyb • 0

@rockyb-13333

Last seen 6.9 years ago

Hi,

I'm new to using BioStrings and the BSgenome packages in R, and am having some trouble with my analysis.

What I'm trying to do is to take the latest human genome reference release from the GRC and transform it into something useable for GATK analysis. Essentially, I want to read the FASTA file, remove the alt contigs and patches and mask the PARs in the Y chromosome.

I wanted to ask a couple of questions:

1. I want to remove the alternate contigs as well as the patches on the latest GRC reference (GRCh38.p11) to create a primary reference assembly. I've laid out what I've done but it is quite long-winded - is there a quicker way of doing this? What I've done is read the file as a DNAStringSet, select the sequences to keep, reorder the data and write sequences out.

In addition, is there a way to read the sequence description separate from the sequence ID? I've had to manually separate the strings and then piece them back together so that it makes sense. i.e.

x <- readDNAStringSet("GCA_000001405.26_GRCh38.p11_genomic.fa")
y <- data.frame(seqID=sapply(strsplit(names(x)," "), `[`, 1), 
                chromInfo = gsub("^\\S+ ", "", sapply(strsplit(names(x),","), `[`, 1), perl=T),
                assemblyInfo = sapply(strsplit(names(x),","), `[`, 2),
                seqHeader = names(x),
                seqLen = x@ranges@width)

then decide which to keep and concatenate the columns together to make it output like the GATK format. It works, but was wondering whether there was a more efficient method.

2. I want to hard mask the PARs in the Y chromosome (i.e. change the sequence to N's). I've tried using the 'ExtractAt' and 'ReplaceAt' functions, but they haven't worked well for me (e.g. I keep getting errors like:

> extractAt(fgenome, at)
Error in .normarg_at2(at, x) : 
  some ranges in 'at' are off-limits with respect to their corresponding sequence in 'x'
In addition: Warning messages:
1: In .normarg_at2(at, x) : 'at' names were ignored
2: In .V_recycle(at, x, "at", "'length(x)'") :
  'length(x)' is not a multiple of 'length(at)'

I've also tried creating a data frame of the Y chromosome, manually substituted the ACTGs into Ns and then created a DNAStringSet from there. But when I try to replace the Y Chromosome DNAStringSet in the original data set with this StringSet, I am unable to do it...

at <- data.frame(chrom="Y GRCh38.p11:CM000686.2:1:57227415", start=10001, end = 2781479)
temp <- getSeq(fgenome, as(at, "GRanges"))
at <- gsub("A", "N", temp)
at <- gsub("C", "N", at)
at <- gsub("T", "N", at)
at <- gsub("G", "N", at)
YPAR1mask <- DNAStringSet(at)

replaceAt(fgenome, grep ("Y GRCh38.p11:CM000686.2:1:57227415", fgenome), YPAR1mask)

What I want to mask is the following:

Name Chr Start Stop

PAR#1	Y	10,001	2,781,479
PAR#2	Y	56,887,903	57,217,415

Any help would be much appreciated.

Thanks!

biostrings R bsgenome • 1.2k views

ADD COMMENT • link 6.9 years ago • updated 6.8 years ago rockyb • 0

score 0 · Answer 1 · 2017-06-26

0

Entering edit mode

Mike Smith ★ 6.5k

@mike-smith

Last seen 5 hours ago

EMBL Heidelberg

For the second part, if you already have a MaskedDNAString does the function injectHardMask() do what you're looking for?

ADD COMMENT • link 6.9 years ago Mike Smith ★ 6.5k

score 0 · Answer 2 · 2017-06-26

Hi,

injectHardMask doesn't work as it isn't a maskedDNAString - it's just a DNAString set.

Nonetheless, created a solution to Q2 in the form of a function - see below.

maskGenome <- function(chrname,start,end) {
  ##start is 1based and start and end are inclusive i.e. [1,5]
  maskAt <- IRanges(start, width = end-start+1)
  maskString <- paste(replicate((end-start+1), "N"), collapse = "")
  tempgenome[chrname] <- replaceAt(tempgenome[chrname], maskAt, value=maskString)
}

Thanks for the help!