Search
Question: edgeR:processAmplicons barcode match only 57% when it should be 100% -- I artificially inserted them -- are there other assumptions
0
gravatar for Anne Deslattes Mays
2.1 years ago by
United States
Anne Deslattes Mays10 wrote:

Hi There,

I inserted all my barcodes myself into the fastq files -- they should have matched 100% -- but they did not.  Based upon a previous thread regarding the processAmplicons runtime when you allow for mismatches, I turned mismatching basically off just so I could see a return.   I have 2 paired reads that were run on a MiSEQ

 -- Number of Barcodes : 24
 -- Number of Hairpins : 213141
Processing reads in /trimmed_fastq/dalal/fastq/Challenge18_S13_L001_R1_001.bc.fastq and /trimmed_fastq/dalal/fastq/Challenge18_S13_L001_R2_001.bc.fastq\
.
 -- Processing 10 million reads
Number of reads in file /trimmed_fastq/dalal/fastq/Challenge18_S13_L001_R1_001.bc.fastq and /trimmed_fastq/dalal/fastq/Challenge18_S13_L001_R2_001.bc.f\
astq: 272472

The input run parameters are:
 -- Barcode: start position 1    end position 8  length 8
 -- Barcode in reverse read: start position 1    end position 8  length 8
 -- Hairpin: start position 1    end position 27         length 27
 -- Hairpin sequences need to match at specified positions.
 -- Mismatch in barcode/hairpin sequences not allowed.

Total number of read is 272472
There are 157149 reads (57.6753 percent) with barcode matches
There are 17 reads (0.0062 percent) with hairpin matches
There are 0 reads (0.0000 percent) with both barcode and hairpin matches

 

I don't understand how I would have only 57% barcode matches and I don't understand why I don't have both hairpin and barcode matches (I know that I do -- I created them).

 

Thoughts?

> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 13.04

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] edgeR_3.10.5  limma_3.24.15

 

ADD COMMENTlink modified 23 months ago by thomas.leete0 • written 2.1 years ago by Anne Deslattes Mays10
2

You should use more general tags, e.g., edgeR and processAmplicons separately. Otherwise, there's no guarantee that the package developers will see this.

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by Aaron Lun17k
2
gravatar for Matthew Ritchie
2.1 years ago by
Australia
Matthew Ritchie660 wrote:

Hi Anne,

The processAmplicons function assumes that the sequences in your fastq file have a fixed structure - the locations of the sample indexes and shRNAs need to be fairly consistent, with the exact position specified by the user (some wobble is allowed via the allowShifting option, but not too much).

All of the shRNA-seq and CRISPR-Cas9 genetic screens that I've dealt with follow this fixed layout. It sounds like your set-up is quite different (your shRNAs can be in any position within a read according to our offline correspondence) so you'll need to write a custom counting method.

With regard to your comment about not getting 100% matching when you manually add a specific sample index sequence, below is my test (based on a fastq file with 6 sequences) where I have tried to replicate this. I added a fixed sequence at the beginning of each read and recover matching statistics as you would expect (i.e. 100%). 

Best wishes,

Matt

> library(edgeR)

Loading required package: limma

> x25 = processAmplicons("testadd5bp.fastq", barcodefile = "SamplesTest.txt", hairpinfile = "HairpinsTest.txt", hairpinStart = 43, hairpinEnd = 61, verbose = TRUE)

 -- Number of Barcodes : 3

 -- Number of Hairpins : 5

Processing reads in testadd5bp.fastq.

 -- Processing 10 million reads

Number of reads in file testadd5bp.fastq : 6



The input run parameters are: 

 -- Barcode: start position 1    end position 5  length 5

 -- Hairpin: start position 43   end position 61         length 19

 -- Hairpin sequences need to match at specified positions. 

 -- Mismatch in barcode/hairpin sequences not allowed. 



Total number of read is 6 

There are 6 reads (100.0000 percent) with barcode matches

There are 6 reads (100.0000 percent) with hairpin matches

There are 6 reads (100.0000 percent) with both barcode and hairpin matches

> sessionInfo()

R version 3.2.1 (2015-06-18)

Platform: x86_64-unknown-linux-gnu (64-bit)

Running under: CentOS release 6.4 (Final)



locale:

 [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C              

 [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8    

 [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8   

 [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                 

 [9] LC_ADDRESS=C               LC_TELEPHONE=C            

[11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       



attached base packages:

[1] stats     graphics  grDevices utils     datasets  methods   base     



other attached packages:

[1] edgeR_3.11.5  limma_3.24.15
ADD COMMENTlink written 2.1 years ago by Matthew Ritchie660
0
gravatar for thomas.leete
23 months ago by
United States
thomas.leete0 wrote:

I just noticed that when I lazily used "head" to stitch .1% of my libraries together to test a script something completely confused processAmplicons and i got zero index+shrna matches.   It turned out that I forgot to suppress the headers between each file and this broke the process.   

ADD COMMENTlink written 23 months ago by thomas.leete0

Sounds like a bug with your code rather than ours! This function assumes sequence reads are in fastq format - if this is messed up you might find strange results.

ADD REPLYlink written 23 months ago by Matthew Ritchie660
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 143 users visited in the last hour