edgeR: processAmplicons crashes with large number of hairpins
1
0
Entering edit mode
@thomasleete-7788
Last seen 6.8 years ago
United States

Hi,

I hope someone can help with this.   You will see by my lack of providing the appropriate information necessary for you to actually diagnose this (ie, error logs) that I am an R noob.  Please point me to the correct logs and I will happily provide them!

The edgeR function processAmplicons is causing R (rstudio Version 0.98.978 ) to crash.   I have it working with a normal number of reads (20 million) and a small hairpin table (200 hairpins is no problem) but it crashes if I feed it my whole hairpin list of ~120,000.   It generally (but not always) reports:

 -- Number of Barcodes : 4
-- Number of Hairpins : 119461

Then it sits indefinitely and eventually reports the following (see below) and restarts.   Usually this coincides with me clicking to another tab or something in chrome but I have let it go uninterrupted for 8+ hours with no progress.  200 hairpins takes about a minute to process and reports at the completion of each 10million reads.   120,000 hairpins will not process no matter how small i make the data set and never reports progress, just hanging after or during reading in the hairpins.

21 May 2015 20:29:15 [rsession-*****] ERROR session hadabend; LOGGED FROM: core::Error<unnamed>::rInit(const r::session::RInitInfo&) /root/rstudio/src/cpp/session/SessionMain.cpp:1694

R version 3.1.1 (2014-07-10)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
[1] C

attached base packages:
character(0)

other attached packages:
[1] edgeR_3.8.6

loaded via a namespace (and not attached):
[1] Rcpp_0.11.3      base_3.1.1       colorspace_1.2-4 datasets_3.1.1   grDevices_3.1.1
[6] graphics_3.1.1   limma_3.22.7     methods_3.1.1    stats_3.1.1      tools_3.1.1
[11] utils_3.1.1     

edger processamplicons rstudio • 2.2k views
0
Entering edit mode

How large are the files? Is it possible to post a minimal working example (e.g., on DropBox) that reproduces the problem?

Also, you haven't posted your sessionInfo, but make sure that you're using the latest version of edgeR. I don't think this'll solve your problem as the hairpin processing code has been stable for several versions; but, we should make sure, just in case.

0
Entering edit mode

sessionInfo is above.

I made a very small example set to demonstrate the failure.  Please see the .R file in the link below.

In the meanwhile, the person in charge of our rstudio install has suggested I run it on the command line on our cluster.   I erroneously believed that our rstudio was set to go through the interactive cue on our scheduler but apparently it's just running on a couple nodes or something.

Thanks again.

0
Entering edit mode

Running R/3.1.0 on command line I get the following error with the big (120,000) hairpin list:

> source("sgrnalib.R")
-- Number of Barcodes : 4
-- Number of Hairpins : 119461

*** caught segfault ***
address (nil), cause 'memory not mapped'

Traceback:
2: doTryCatch(return(expr), name, parentenv, handler)
3: tryCatchOne(expr, names, parentenv, handlers[[1L]])
4: tryCatchList(expr, classes, parentenv, handlers)
6: processAmplicons("plasmidpart.fastq", readfile2 = NULL, "Samples1.txt",     "sgrnauniqueAB.txt", barcodeStart = 1, barcodeEnd = 9, hairpinStart = 34,     hairpinEnd = 53, allowShifting = TRUE, shiftingBase = 2,     allowMismatch = TRUE, barcodeMismatchBase = 2, hairpinMismatchBase = 2,     allowShiftedMismatch = TRUE, verbose = TRUE)
7: eval(expr, envir, enclos)
8: eval(ei, envir)
9: withVisible(eval(ei, envir))
10: source("sgrnalib.R")

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
Selection:

2
Entering edit mode
Aaron Lun ★ 27k
@alun
Last seen 20 hours ago
The city by the bay

Examining the C code indicates that the original author hard-coded a maximum limit of 100000 hairpins. Your example exceeds this, resulting in memory access problems and a segmentation fault. We've rejigged the code to eliminate this limit, so processAmplicons should now work with any hairpins you have in your list. Check out the latest release version (3.10.1) to get the patch. It'll still take a while to run, though.

0
Entering edit mode

Thank you Aaron!   It was very nice to wake up to a solution this AM.   I should have asked here earlier instead of pulling my hair out for 2 days.

With basically everyone in my community attempting genome-wide CRIPSR with Zhang's 120,000 member library, I imagine that others would be running into this problem shortly.

Of course, as you say , this might take a while and turn out not to be the most efficient approach.

Thanks again.

-Thomas

0
Entering edit mode

Aaron,

I just now got to try the update (actually 3.10.2) as I had to wait for the powers-that-be to update R on rStudio.  Thank you again.  I can confirm that it works on a very small (40,000 reads) data set with ~120,000 hairpins but have a question.

Should I expect 120,000 hairpins to be extraordinarily slower than 60,000 x 2?   Unfortunately, I'm still waiting for (different) powers-that -be to update R on our cluster.  Previously, using a relatively small chunk of our cluster (5 cores, 20gb ram)  it took ~30 minutes to process all 200 million reads if done as 60k+60k hairpins.

On rStudio, it previously was taking <10' to process ~15 million reads as 60+60.

Now using edgeR on a 200 million read data set it did _NOT_ get through the first 10 million read chunk in 90 minutes and I terminated it.  I then fed it a 40 thousand read chunk and it processed it in ~10 minutes.  By extrapolation, that would be 80 hours to process the 16 million that previously processed in ~15 minutes.

Is it possible that this is a real thing and not just our rstudio server being taxed at the moment?

As evidence of it being "real", I also noticed that the % matched hairpins has changed.  Now I am getting ~99% match and was only getting ~80% previously (albeit a different data chunk of the same lane).   Is it possible that you guys improved the shifted hairpin algorithm of processAmplicons sometime in the last few versions and that is why it is taking so much longer now?  The version I was using previously was 3.8.6.  I have a variable spacer in the sequence which necessitates allowShifting=TRUE and did notice that it didn't previously behave exactly as I expected.

Thanks!

-thomas

0
Entering edit mode

0
Entering edit mode

Hi Aaron.

The hairpin loading takes less than a minute and 120k is about 2x 60k, so that makes sense.

It is the processing stop that is taking longer.   With "verbose" on it reports "Processing 10 million reads" and it is there that it takes a long time, processing at about 4000 reads per minute.  It is not appreciably faster to use 60k hairpins vs. 120k hairpins.

It also just occurred to me that with 3.10.2 I do _NOT_ see the message I saw in 3.8.6:

There are 19 hairpins without exact sequence match.
Re-processing reads in Plasmid_liba1b1a2b2.fastq, considering sequence mismatch

When obviously mapping 120k hairpins onto 40k reads, there will be at least 80k hairpins without "exact sequence match"

Anyhow, the computational time of 800 hours for 200 million reads is feasible on our cluster but it is a HUGE change from the last version.

Thanks again for your help.   If you have insight on why it's taking longer then great but otherwise, I'm just happy to have the tool to use, even if it is a little slow.

-thomas

0
Entering edit mode

Hi Thomas,

I had a quick look at your example, and it appears that very few guides can be matched (0.032% if I don't allow shifting, 0.059% with shifting and 0.083% if both shifting and mismatches are allowed - see code below).

I guess this is reasonable given that your demo provides 99 sgRNAs out of 120,000, which is about 0.08% of the total library.

For screens I've analysed, this step is done in 5-10 mins depending upon the number of reads. Maybe try re-running with allowMismatch=FALSE . According to system.time on this example, it takes around twice as long to allow both shifting and mismatches.

Cheers,

Matt

> system.time(processAmplicons("plasmidpart.fastq", readfile2=NULL, "Samples1.txt", "tinysgset_gene.txt",

+                                barcodeStart=1, barcodeEnd=9, hairpinStart=34, hairpinEnd=53, verbose=TRUE))

 -- Number of Barcodes : 4

 -- Number of Hairpins : 99

Processing reads in plasmidpart.fastq.

 -- Processing 10 million reads

Number of reads in file plasmidpart.fastq : 37500

The input run parameters are:

 -- Barcode: start position 1    end position 9  length 9

 -- Hairpin: start position 34   end position 53         length 20

 -- Hairpin sequences need to match at specified positions.

 -- Mismatch in barcode/hairpin sequences not allowed.

Total number of read is 37500

There are 37178 reads (99.1413 percent) with barcode matches

There are 12 reads (0.0320 percent) with hairpin matches

There are 12 reads (0.0320 percent) with both barcode and hairpin matches

   user  system elapsed

  0.056   0.008   0.330

> system.time(processAmplicons("plasmidpart.fastq", readfile2=NULL, "Samples1.txt", "tinysgset_gene.txt",

+                                barcodeStart=1, barcodeEnd=9, hairpinStart=34, hairpinEnd=53, allowShifting=TRUE, shiftingBase=2,

+                                verbose=TRUE))

 -- Number of Barcodes : 4

 -- Number of Hairpins : 99

Processing reads in plasmidpart.fastq.

 -- Processing 10 million reads

Number of reads in file plasmidpart.fastq : 37500

The input run parameters are:

 -- Barcode: start position 1    end position 9  length 9

 -- Hairpin: start position 34   end position 53         length 20

 -- Allow hairpin sequences to be matched to a shifted position, <= 2 base left or right of the specified positions.

 -- Mismatch in barcode/hairpin sequences not allowed.

Total number of read is 37500

There are 37178 reads (99.1413 percent) with barcode matches

There are 22 reads (0.0587 percent) with hairpin matches

There are 22 reads (0.0587 percent) with both barcode and hairpin matches

   user  system elapsed

  0.060   0.011   0.202

> system.time(processAmplicons("plasmidpart.fastq", readfile2=NULL, "Samples1.txt", "tinysgset_gene.txt",

+                                barcodeStart=1, barcodeEnd=9, hairpinStart=34, hairpinEnd=53, allowShifting=TRUE, shiftingBase=2,

+                                allowMismatch=TRUE, hairpinMismatchBase=2, allowShiftedMismatch=TRUE, verbose=TRUE))

 -- Number of Barcodes : 4

 -- Number of Hairpins : 99

Processing reads in plasmidpart.fastq.

 -- Processing 10 million reads

Number of reads in file plasmidpart.fastq : 37500

The input run parameters are:

 -- Barcode: start position 1    end position 9  length 9

 -- Hairpin: start position 34   end position 53         length 20

 -- Allow hairpin sequences to be matched to a shifted position, <= 2 base left or right of the specified positions.

 -- Allow sequence mismatch, <= 1 base in barcode sequence and <= 2 base in hairpin sequence.

Total number of read is 37500

There are 37347 reads (99.5920 percent) with barcode matches

There are 31 reads (0.0827 percent) with hairpin matches

There are 31 reads (0.0827 percent) with both barcode and hairpin matches

   user  system elapsed

  0.763   0.015   1.175

> sessionInfo()

R version 3.2.0 (2015-04-16)

Platform: x86_64-unknown-linux-gnu (64-bit)

Running under: CentOS release 6.4 (Final)

locale:

 [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C

 [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8

 [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8

 [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C

 [9] LC_ADDRESS=C               LC_TELEPHONE=C

[11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C

attached base packages:

[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:

[1] edgeR_3.11.5 limma_3.24.4

loaded via a namespace (and not attached):

[1] tools_3.2.0

0
Entering edit mode

Out of curiosity -- is the functionality you folks are coding down in the C level perhaps something that could be replaced with the functionality provided by Biostrings?

0
Entering edit mode

I guess so. We originally wrote the function entirely in R, but found it was a bit slow when dealing with 100 million plus reads from a fastq, so switched to C to speed things up.

0
Entering edit mode

FWIW, much of Biostrings is thin R wrappers around (performant) C code -- just pointing that out as it might might be something worth considering.

(I'm assuming that when you say it was previously written in R that you mean "base" R, ie. not using Biostrings)

0
Entering edit mode

Is there documentation for the Biostrings C-level API? I remember doing something with Biostrings_stubs.c once, but I've since forgotten.

0
Entering edit mode

It's a good question: but one probably better asked on bioc-devel (I'd tag Herve so he can chime in, but I don't think we can do that here).

I'm hopeful that you can get 95% of what you need for this task via the interface made available through R, though ... but I'll stop being annoying by chiming in now, I just thought it might save you folks some time, is all.

0
Entering edit mode

Hi Aaron,

Hopefully Biostrings exposes enough functionality at the R level to cover most string matching needs, and your problem can be solved efficiently without using the C-level API. I would be happy to provide some guidance if you want to give it a try. As Steve suggested, maybe it's better to use the bioc-devel mailing list to discuss this. We can also discuss this off-list if you prefer.

H.

0
Entering edit mode

Matthew,

Maybe I should have started a new thread with the new problem but since I didn't, let me clarify here.

There were two examples in the .R file:

Both used a small .fastq file with ~37000 reads.

(1) One that "#works" with a ~100 sgrna guide pool.   This is the one that you looked at.

(2) one that "#does not work" which uses a ~120,000 guide pool.  This initially caused a segmentation fault in 3.8.6. because the sgrna library was over 100,000 sequences.

Aaron or a team member fixed the segmentation fault in 3.10.2 which I confirmed however now processing reads seems to be taking much longer.   Now, I see that you are using 3.11.5 which I didn't know existed so I'll have to check it out; for all I know it is already fixed.

You guys are great.  Thank you.

-thomas

0
Entering edit mode

Okay -- since my run of 200k shRNAs and only 4-5 million reads is taking > 2 hours I went back and read the above thread -- it appears I should run this with allowMismatch=False.   I can imagine we could speed this up if we could parallelize the tasks -- if there is an internal qsort like step -- definitely benefit from throwing extra cpu's on it... which I have in my cloud instance..... can't stand waiting -- so exposure of any internal loops would be awesome -- so we can within R perhaps parallelize... I suppose I could split out the data too -- and do that -- okay -- that's what I will do now...

But wondering if this thread terminated and there is a new one?

0
Entering edit mode

It's probably best to start a new question, as (1) you're describing a different problem to the segmentation fault that was originally posted, and (2) comments don't get a lot of attention. Check out the posting guide, if you haven't already. Some sentence structure would be nice as well.