Question

fastqCleaner adapter_filter() not trimming 5' primer

0

Entering edit mode

hannalberman • 0

@user-24786

Last seen 3.8 years ago

I am trying to use the adapter_filter() function in fastqCleaner to remove primer sequences, but can still find the primer sequences in the output after running the adapter_filter() command. I am using sequences from this ENA entry (Project: PRJNA377530) and the following is using just the forward and reverse reads from the first sample with files SRR5314314_1.fastq.gz, and SRR5314314_1.fastq.gz as an example.

fwdRead <- readFastq("~/SRR5314314_1.fastq.gz")
revRead <- readFastq("~/SRR5314314_2.fastq.gz")

FWD <- "ACCTGCGGARGGATCA"
REV <- "GAGATCCRTTGYTRAAAGTT"

fwdFilt <- adapter_filter(fwdRead, Lpattern = FWD, anchored=TRUE, fixed = FALSE)
refFilt <- adapter_filter(revRead, Lpattern = REV, anchored=TRUE, fixed = FALSE)

There is no error message, but the primer sequences do not get filtered from the reads

FastqCleaner • 981 views

ADD COMMENT • link 3.8 years ago hannalberman • 0

0

Entering edit mode

Can you share some of the reads that you expect to be filtered (but that are not filtered)? Also, are you aware of the other parameters that may be of importance, such as:

rc.L, Reverse complement Lpattern? default FALSE
rc.R, Reverse complement Rpatter? default FALSE
first, trim first right('R') or left ('L') side of sequences when both Lpattern and Rpattern are passed
error_rate, Error rate (value in the range [0, 1] The error rate is the proportion of mismatches allowed between the adapter and the aligned portion of the subject. For a given adapter A, the number of allowed mismatches between each subsequence s of A and the subject is computed as: error_rate * L_s, where L_s is the length of the subsequence s
anchored, Adapter or partial adapter within sequence (anchored = FALSE, default) or only in 3' and 5' terminals? (anchored = TRUE)

ADD REPLY • link 3.8 years ago Kevin Blighe ★ 4.0k

0

Entering edit mode

The function appears to be working on the 3' primers (Rpattern), so I've just posted using the Lpattern here. The primer sequences are forward, so I do not need rc.L=TRUE, (I don't need to match the reverse complement.) There is no need to include the "first" variable if I'm only searching for one primer pattern. The function tests for the length of each pattern and if the length of the Rpattern is 0 it will just run the Lpattern. I tried those parameters anyway, ie adapter_filter(fwdRead, Lpattern = FWD, anchored=TRUE, fixed = FALSE, first="L") or "R" for the first read. The 5' primers should be in the terminals, but I have tried both anchored=TRUE and anchored=FALSE. Neither have worked for me. Also tried increasing the error rates, but that should not be an issue anyway since I can see how many times times I should be able to find the primers in each sample with vmatchpattern() setting maximum mismatches to 0.

I normally would not try to do this in R but I'm going through the dada2 ITS tutorial with a class and trying to avoid compatibility issues for people with Windows.

It seems the function is trimming the number of bases of the in each primer from the right instead of the left. Example of a sequence before and after running the function:

sread(fwdRead)[1]

DNAStringSet object of length 6: width seq [1] 187 ACCTGCGGAGGGATCATTACCGAGTTTACAACTCCCAAACCCCTGTGAACATACCTTATGTTGCC...CTGTTTTTAGTTGAACTTCTGAGTATAAAAAACAAATAAATCAAAACTTTCAACAATGGATCTC

Example after:

sread(fwdFilt)[1]

DNAStringSet object of length 1: width seq [1] 171 ACCTGCGGAGGGATCATTACCGAGTTTACAACTCCCAAACCCCTGTGAACATACCTTATGTTGCC...GCAGGAACCCTAAACTCTGTTTTTAGTTGAACTTCTGAGTATAAAAAACAAATAAATCAAAACT

ADD REPLY • link 3.8 years ago hannalberman • 0