Hello, I have a basic question about RNA-seq data preprocessing and I am writing to seek clarification. I currently have two different RNA-seq datasets. One is composed of Ensembl gene IDs, while the other is composed of Ensembl gene IDs with version. Both are in raw count format and I want to merge them into a single dataset using sva::Combat_seq without losing any information. Is it possible to merge Ensembl gene IDs with and without version? Or should I use Biomart's getBM to find the common ones? Thank you for your help!
The obvious answer is to remove the version using gsub("\\..*", "", x) where x is the gene ID. Since you have differences in the gene IDs means that the datasets have been processe differently. That is not good, you should process identically to avoid batch effects. Even if processed the same in silico, be aware that batch correction has some assumptions, being that batch is not nested with the groups you have. Cannot comment further without details.
One dataset was processed using GRCh38, while the other was processed using GRCh37. Both were run on the same Illumina HiSeq 2500 platform using homo sapiens. If I use filtering methods such as cpm() to reduce the number of features, would it be a good way to start the analysis?
I do not know project and aim but generally a good way would be to use the exact same preprocessing pipeline in both. In the lab we cannot always avoid batches, in silico we can. No point using different pipelines.
Save yourself the headache of trying to harmonize them. Get fastqs for both, and realign both to the same genome and annotation. It's probably easier, and way safer.
One dataset was processed using GRCh38, while the other was processed using GRCh37. Both were run on the same Illumina HiSeq 2500 platform using homo sapiens. If I use filtering methods such as cpm() to reduce the number of features, would it be a good way to start the analysis?
I do not know project and aim but generally a good way would be to use the exact same preprocessing pipeline in both. In the lab we cannot always avoid batches, in silico we can. No point using different pipelines.