Human reference primary assembly and masking using BioStrings and BSgenome
2
0
Entering edit mode
rockyb • 0
@rockyb-13333
Last seen 6.9 years ago

Hi,

I'm new to using BioStrings and the BSgenome packages in R, and am having some trouble with my analysis.

What I'm trying to do is to take the latest human genome reference release from the GRC and transform it into something useable for GATK analysis. Essentially, I want to read the FASTA file, remove the alt contigs and patches and mask the PARs in the Y chromosome. 

I wanted to ask a couple of questions:

1. I want to remove the alternate contigs as well as the patches on the latest GRC reference (GRCh38.p11) to create a primary reference assembly. I've laid out what I've done but it is quite long-winded - is there a quicker way of doing this? What I've done is read the file as a DNAStringSet, select the sequences to keep, reorder the data and write sequences out.

In addition, is there a way to read the sequence description separate from the sequence ID? I've had to manually separate the strings and then piece them back together so that it makes sense. i.e.

x <- readDNAStringSet("GCA_000001405.26_GRCh38.p11_genomic.fa")
y <- data.frame(seqID=sapply(strsplit(names(x)," "), `[`, 1), 
                chromInfo = gsub("^\\S+ ", "", sapply(strsplit(names(x),","), `[`, 1), perl=T),
                assemblyInfo = sapply(strsplit(names(x),","), `[`, 2),
                seqHeader = names(x),
                seqLen = x@ranges@width)

then decide which to keep and concatenate the columns together to make it output like the GATK format. It works, but was wondering whether there was a more efficient method.

2. I want to hard mask the PARs in the Y chromosome (i.e. change the sequence to N's). I've tried using the 'ExtractAt' and 'ReplaceAt' functions, but they haven't worked well for me (e.g. I keep getting errors like:

> extractAt(fgenome, at)
Error in .normarg_at2(at, x) : 
  some ranges in 'at' are off-limits with respect to their corresponding sequence in 'x'
In addition: Warning messages:
1: In .normarg_at2(at, x) : 'at' names were ignored
2: In .V_recycle(at, x, "at", "'length(x)'") :
  'length(x)' is not a multiple of 'length(at)'

I've also tried creating a data frame of the Y chromosome, manually substituted the ACTGs into Ns and then created a DNAStringSet from there. But when I try to replace the Y Chromosome DNAStringSet in the original data set with this StringSet, I am unable to do it...

at <- data.frame(chrom="Y GRCh38.p11:CM000686.2:1:57227415", start=10001, end = 2781479)
temp <- getSeq(fgenome, as(at, "GRanges"))
at <- gsub("A", "N", temp)
at <- gsub("C", "N", at)
at <- gsub("T", "N", at)
at <- gsub("G", "N", at)
YPAR1mask <- DNAStringSet(at)

replaceAt(fgenome, grep ("Y GRCh38.p11:CM000686.2:1:57227415", fgenome), YPAR1mask)

What I want to mask is the following:

Name           Chr     Start   Stop

PAR#1 Y 10,001 2,781,479
PAR#2 Y 56,887,903 57,217,415

Any help would be much appreciated.

Thanks!

biostrings R bsgenome • 1.2k views
ADD COMMENT
0
Entering edit mode
Mike Smith ★ 6.5k
@mike-smith
Last seen 5 hours ago
EMBL Heidelberg

For the second part, if you already have a MaskedDNAString does the function injectHardMask() do what you're looking for?

ADD COMMENT
0
Entering edit mode
rockyb • 0
@rockyb-13333
Last seen 6.9 years ago

Hi,

injectHardMask doesn't work as it isn't a maskedDNAString - it's just a DNAString set.

Nonetheless, created a solution to Q2 in the form of a function - see below.

maskGenome <- function(chrname,start,end) {
  ##start is 1based and start and end are inclusive i.e. [1,5]
  maskAt <- IRanges(start, width = end-start+1)
  maskString <- paste(replicate((end-start+1), "N"), collapse = "")
  tempgenome[chrname] <- replaceAt(tempgenome[chrname], maskAt, value=maskString)
}

Thanks for the help!

ADD COMMENT

Login before adding your answer.

Traffic: 708 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6