Question

edgeR:processAmplicons barcode match only 57% when it should be 100% -- I artificially inserted them -- are there other assumptions

0

Entering edit mode

Anne Deslattes Mays ▴ 10

@anne-deslattes-mays-5991

Last seen 8.2 years ago

United States

Hi There,

I inserted all my barcodes myself into the fastq files -- they should have matched 100% -- but they did not. Based upon a previous thread regarding the processAmplicons runtime when you allow for mismatches, I turned mismatching basically off just so I could see a return. I have 2 paired reads that were run on a MiSEQ

-- Number of Barcodes : 24
-- Number of Hairpins : 213141
Processing reads in /trimmed_fastq/dalal/fastq/Challenge18_S13_L001_R1_001.bc.fastq and /trimmed_fastq/dalal/fastq/Challenge18_S13_L001_R2_001.bc.fastq\
.
-- Processing 10 million reads
Number of reads in file /trimmed_fastq/dalal/fastq/Challenge18_S13_L001_R1_001.bc.fastq and /trimmed_fastq/dalal/fastq/Challenge18_S13_L001_R2_001.bc.f\
astq: 272472

The input run parameters are:
-- Barcode: start position 1    end position 8 length 8
-- Barcode in reverse read: start position 1    end position 8 length 8
-- Hairpin: start position 1    end position 27         length 27
-- Hairpin sequences need to match at specified positions.
-- Mismatch in barcode/hairpin sequences not allowed.

Total number of read is 272472
There are 157149 reads (57.6753 percent) with barcode matches
There are 17 reads (0.0062 percent) with hairpin matches
There are 0 reads (0.0000 percent) with both barcode and hairpin matches

I don't understand how I would have only 57% barcode matches and I don't understand why I don't have both hairpin and barcode matches (I know that I do -- I created them).

Thoughts?

> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 13.04

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8       LC_NAME=C
[9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] edgeR_3.10.5 limma_3.24.15

edgeR processamplicons shrna • 2.1k views

ADD COMMENT • link updated 10.1 years ago by thomas.leete • 0 • written 10.3 years ago by Anne Deslattes Mays ▴ 10

2

Entering edit mode

You should use more general tags, e.g., edgeR and processAmplicons separately. Otherwise, there's no guarantee that the package developers will see this.

ADD REPLY • link 10.3 years ago Aaron Lun ★ 29k

score 2 · Answer 1 · 2015-10-13

Hi Anne,

The processAmplicons function assumes that the sequences in your fastq file have a fixed structure - the locations of the sample indexes and shRNAs need to be fairly consistent, with the exact position specified by the user (some wobble is allowed via the allowShifting option, but not too much).

All of the shRNA-seq and CRISPR-Cas9 genetic screens that I've dealt with follow this fixed layout. It sounds like your set-up is quite different (your shRNAs can be in any position within a read according to our offline correspondence) so you'll need to write a custom counting method.

With regard to your comment about not getting 100% matching when you manually add a specific sample index sequence, below is my test (based on a fastq file with 6 sequences) where I have tried to replicate this. I added a fixed sequence at the beginning of each read and recover matching statistics as you would expect (i.e. 100%).

Best wishes,

Matt

> library(edgeR)

Loading required package: limma

> x25 = processAmplicons("testadd5bp.fastq", barcodefile = "SamplesTest.txt", hairpinfile = "HairpinsTest.txt", hairpinStart = 43, hairpinEnd = 61, verbose = TRUE)

 -- Number of Barcodes : 3

 -- Number of Hairpins : 5

Processing reads in testadd5bp.fastq.

 -- Processing 10 million reads

Number of reads in file testadd5bp.fastq : 6



The input run parameters are: 

 -- Barcode: start position 1    end position 5  length 5

 -- Hairpin: start position 43   end position 61         length 19

 -- Hairpin sequences need to match at specified positions. 

 -- Mismatch in barcode/hairpin sequences not allowed. 



Total number of read is 6 

There are 6 reads (100.0000 percent) with barcode matches

There are 6 reads (100.0000 percent) with hairpin matches

There are 6 reads (100.0000 percent) with both barcode and hairpin matches

> sessionInfo()

R version 3.2.1 (2015-06-18)

Platform: x86_64-unknown-linux-gnu (64-bit)

Running under: CentOS release 6.4 (Final)



locale:

 [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C              

 [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8    

 [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8   

 [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                 

 [9] LC_ADDRESS=C               LC_TELEPHONE=C            

[11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       



attached base packages:

[1] stats     graphics  grDevices utils     datasets  methods   base     



other attached packages:

[1] edgeR_3.11.5  limma_3.24.15

score 0 · Answer 2 · 2015-12-04

0

Entering edit mode

thomas.leete • 0

@thomasleete-7788

Last seen 16 months ago

United States

I just noticed that when I lazily used "head" to stitch .1% of my libraries together to test a script something completely confused processAmplicons and i got zero index+shrna matches. It turned out that I forgot to suppress the headers between each file and this broke the process.

ADD COMMENT • link 10.1 years ago thomas.leete • 0

0

Entering edit mode

Sounds like a bug with your code rather than ours! This function assumes sequence reads are in fastq format - if this is messed up you might find strange results.

ADD REPLY • link 10.1 years ago Matthew Ritchie ▴ 1000