Search
Question: edgeR:processAmplicons barcode match only 57% when it should be 100% -- I artificially inserted them -- are there other assumptions
0
3.0 years ago by
United States
Anne Deslattes Mays10 wrote:

Hi There,

I inserted all my barcodes myself into the fastq files -- they should have matched 100% -- but they did not.  Based upon a previous thread regarding the processAmplicons runtime when you allow for mismatches, I turned mismatching basically off just so I could see a return.   I have 2 paired reads that were run on a MiSEQ

-- Number of Barcodes : 24
-- Number of Hairpins : 213141
Processing reads in /trimmed_fastq/dalal/fastq/Challenge18_S13_L001_R1_001.bc.fastq and /trimmed_fastq/dalal/fastq/Challenge18_S13_L001_R2_001.bc.fastq\
.
Number of reads in file /trimmed_fastq/dalal/fastq/Challenge18_S13_L001_R1_001.bc.fastq and /trimmed_fastq/dalal/fastq/Challenge18_S13_L001_R2_001.bc.f\
astq: 272472

The input run parameters are:
-- Barcode: start position 1    end position 8  length 8
-- Barcode in reverse read: start position 1    end position 8  length 8
-- Hairpin: start position 1    end position 27         length 27
-- Hairpin sequences need to match at specified positions.
-- Mismatch in barcode/hairpin sequences not allowed.

Total number of read is 272472
There are 157149 reads (57.6753 percent) with barcode matches
There are 17 reads (0.0062 percent) with hairpin matches
There are 0 reads (0.0000 percent) with both barcode and hairpin matches

I don't understand how I would have only 57% barcode matches and I don't understand why I don't have both hairpin and barcode matches (I know that I do -- I created them).

Thoughts?

> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 13.04

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8       LC_NAME=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] edgeR_3.10.5  limma_3.24.15

modified 2.9 years ago by thomas.leete0 • written 3.0 years ago by Anne Deslattes Mays10
2

You should use more general tags, e.g., edgeR and processAmplicons separately. Otherwise, there's no guarantee that the package developers will see this.

2
3.0 years ago by
Australia
Matthew Ritchie730 wrote:

Hi Anne,

The processAmplicons function assumes that the sequences in your fastq file have a fixed structure - the locations of the sample indexes and shRNAs need to be fairly consistent, with the exact position specified by the user (some wobble is allowed via the allowShifting option, but not too much).

All of the shRNA-seq and CRISPR-Cas9 genetic screens that I've dealt with follow this fixed layout. It sounds like your set-up is quite different (your shRNAs can be in any position within a read according to our offline correspondence) so you'll need to write a custom counting method.

With regard to your comment about not getting 100% matching when you manually add a specific sample index sequence, below is my test (based on a fastq file with 6 sequences) where I have tried to replicate this. I added a fixed sequence at the beginning of each read and recover matching statistics as you would expect (i.e. 100%).

Best wishes,

Matt

> library(edgeR)

Loading required package: limma

> x25 = processAmplicons("testadd5bp.fastq", barcodefile = "SamplesTest.txt", hairpinfile = "HairpinsTest.txt", hairpinStart = 43, hairpinEnd = 61, verbose = TRUE)

 -- Number of Barcodes : 3

 -- Number of Hairpins : 5

Processing reads in testadd5bp.fastq.

 -- Processing 10 million reads

Number of reads in file testadd5bp.fastq : 6

The input run parameters are: 

 -- Barcode: start position 1    end position 5  length 5

 -- Hairpin: start position 43   end position 61         length 19

 -- Hairpin sequences need to match at specified positions. 

 -- Mismatch in barcode/hairpin sequences not allowed. 

Total number of read is 6 

There are 6 reads (100.0000 percent) with barcode matches

There are 6 reads (100.0000 percent) with hairpin matches

There are 6 reads (100.0000 percent) with both barcode and hairpin matches

> sessionInfo()

R version 3.2.1 (2015-06-18)

Platform: x86_64-unknown-linux-gnu (64-bit)

Running under: CentOS release 6.4 (Final)

locale:

 [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C              

 [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8    

 [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8   

 [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                 

 [9] LC_ADDRESS=C               LC_TELEPHONE=C            

[11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       

attached base packages:

[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:

[1] edgeR_3.11.5  limma_3.24.15
0
2.9 years ago by
United States
thomas.leete0 wrote:

I just noticed that when I lazily used "head" to stitch .1% of my libraries together to test a script something completely confused processAmplicons and i got zero index+shrna matches.   It turned out that I forgot to suppress the headers between each file and this broke the process.