Using Polyester to simulate large number of reads from a large transcriptome
3
0
Entering edit mode
Last seen 5.3 years ago

I am using 100 mb transcriptome to simulate 400-800 million reads for a particular fold change eg. 1.2. I require this for 5 replicates. I have the following questions:

1. when i run polyester , its just occupying one core/slot for a particular node on my cluster. Therefore its unable to utilise all the available RAM.
Is polyester multi threaded? Can i make it occupy all the nodes?

2. How do I specify qrsh -l mf=100G,h_vmem=100G to be used on a particular node? lets say its node1.
3. Error in .Call(.NAME, ..., PACKAGE = PACKAGE) : negative length vectors are not allowed
Error in .Call2("Integer_any_missing_or_outside", x, lower, upper, PACKAGE = "S4Vectors") : long vectors not supported yet: ../../src/include/Rinlinedfuns.h:137
I am getting these errors. I guess its because R is running out of memory.

polyester rnaseq • 902 views
0
Entering edit mode
jmfu ▴ 10
@jmfu-9174
Last seen 17 months ago
United States

Before I answer how to split the transcriptome and generate reads separately for each piece, I'm going to clear something up for you regarding fold change and the replicates.

When you say you want a fold change of 1.2 for a transcript, this means that for a transcript one sample compared to another, one sample has 1.2x higher expression. So you need 2 samples in order to compare fold change. I take it to mean you want to do this comparison 5 times. So in truth, you need 10 samples (5 normal samples, and 5 samples with the 1.2x fold change)

0
Entering edit mode

yes you are right . for a particular fold change i need 10 samples in total. 5 tumor and 5 normal.

0
Entering edit mode
jmfu ▴ 10
@jmfu-9174
Last seen 17 months ago
United States

Now for the split generation of reads (assume the transcriptome is not too big and can be loaded into memory in one go):

fasta = readDNAStringSet(yourfile)
transcript_count = length(fasta)
# Choose number you want to break simulation up into
chunks = 100
# Assign each transcript to a chunk for processing
assignment = rep(1:chunks, each=ceiling(transcript_count/chunks))[1:transcript_count]
for(i in 1:chunks){
fasta_sub = fasta[assignment==i]
writeXStringSet(small_fasta, 'fasta_sub.fa')
# ~20x coverage
# Here all transcripts will have ~equal FPKM
readspertx = round(20 * width(small_fasta) / 100)
# Specify your fold changes here
fold_changes = cbind(rep(1, sum(assignment==i)),
​rep(1.2, sum(assignment==i)))
# num_reps specifies 5 replicates for both normal and change samples
num_reps=c(5, 5), fold_changes=fold_changes,
}


Now you have all the files in directories simulated_reads_1 ... simulated_reads_n. You just have to merge the files in each directory with the same name (which represent the same sample).

0
Entering edit mode

so if i have 100 mb transcriptome , what assumption do i make about its size? is it too big ? and can it be loaded into memory in one go?

i would like generate 800 million reads for 10 samples(5 tumor and 5 normal). how many chunks do i need to break the simulation into? is practical to simulate so many reads for a large transcriptome using the for loop?

0
Entering edit mode

100 mb transcriptome should be easily loaded into memory in one go. If you break it up into 100 pieces should be fine. As for practicality, it is all a matter of perspective. You can just leave the loop to run, and add in a line to let you know how far along the simulation you are. Insert after the for loop starts this line

print(i); flush.console()
0
Entering edit mode

### I used the script given by you for the latest version of polyester. but its not generating the desired fold changes. i got these output files:

file 1  : head sim_rep_info.txt
rep_id    group    lib_sizes
sample_01    1    1
sample_02    1    1
sample_03    1    1
sample_04    1    1
sample_05    1    1

file 2 : head sim_tx_info.txt
transcriptid    foldchange    DEstatus
uc010tkp.2    1    FALSE
uc001vuz.1    1    FALSE
uc001vwc.3    1    FALSE
uc021rno.1    1    FALSE
uc010tkt.2    1    FALSE
uc010tku.2    1    FALSE
uc010tkv.2    1    FALSE
uc001vwh.1    1    FALSE
uc010tkw.2    1    FALSE

as you can clearly see there is no differential expression. it has some problem with cbind function for fold changes and num_reps=c(5,5).

### However, i used the script with the for loop with the older version of the polyester(v 1.0.2) and it gives me the desired output.

• I didnt use the cbind function for fold changes. I used the following :

fold_changes = rep(1,length(small_fasta))
deInds = sample(1:length(small_fasta),size=128, replace = TRUE) #size specifies the number of transcripts i want to be differentially expressed
fold_changes[deInds] = rep(c(1.2,1/1.2), length(deInds)/2) #this gives equal number of both higher and lower expression ie. 1.2 and 0.8

• I didnt use num_reps=c(5,5). I used the following:

num_reps = 5 #I still got 10 samples.

how did i get 10 samples by using num_reps=5 and how do i differentiate between normal and tumor?

so can you tell me why is the new version of polyester causing these issues?

can i use the older version if that is working for me?

is there a significant difference and improvement from v1.0.2 to the latest version?

0
Entering edit mode
jmfu ▴ 10
@jmfu-9174
Last seen 17 months ago
United States

What is your sessionInfo() when using the newer version of polyester?