sample ids not matching when creating SeqVarData object
1
1
Entering edit mode
@stephanie-m-gogarten-5121
Last seen 21 days ago
University of Washington

Question submitted by email:

I’m trying to use the regression function in SeqVarTools but I keep getting error about sample id’s not matching when creating SeqVarData, e.g.

# Get the sample ids and force them to match
sampleIds <- data.frame( sample.id=seqGetData(gds, 'sample.id') )
phenos <- merge(sampleIds, phenos, sort=F)
seqSetFilter(gds, sample.id=phenos$sample.id) phenosDF <- AnnotatedDataFrame(phenos) assocData <- SeqVarData(gds, phenosDF)  It looks like the SeqVarData is ignoring the seqSetFilter command when creating SeqVarData. Is there a way to get around this? seqvartools • 728 views ADD COMMENT 0 Entering edit mode @stephanie-m-gogarten-5121 Last seen 21 days ago University of Washington You can't use the seqSetFilter function to change the order of samples read from the GDS file, or to assign sampleData to only a subset of samples in the GDS file. The AnnotatedDataFrame you use to create a SeqVarData object must contain the same samples, in the same order, as the GDS file. The problem you're having is that merge reorders rows in a data.frame even when sort=FALSE, and drops rows that don't match by default. You can get around this by using the all.x=TRUE argument to merge, and then explicitly reordering afterwards: sample.id <- seqGetData(gds, 'sample.id') sampleIds <- data.frame( sample.id, stringsAsFactors=FALSE ) phenos <- merge(sampleIds, phenos, sort=FALSE, all.x=TRUE) phenos <- phenos[matchsample.id, phenos$sample.id),]
phenosDF <- AnnotatedDataFrame(phenos)
assocData <- SeqVarData(gds, phenosDF)


Alternatively, you can use dplyr::left_join, which never re-orders its first argument:

phenos <- dplyr::left_join(sampleIds, phenos)

0
Entering edit mode

Thanks for the response, however, I guess I wasn't clear about the problem. First, I have verified that the merge isn't the issue, and the sort doesn't reorder the rows in the data.frame and works as I expected. To clarify, the issue is the opposite of what you're suggesting, i.e. there are extra sample ids in the gds that I do not want, and I think the seqFilter to remove these ids is ignored by SeqVarData. So, this proposed solution doesn't work. The reason I think that the seqFilter to remove sample id's is ignored by SeqVarData (similar to some snpgds functions), is because if I do a seqExport after the filter and then reload the new gds, then my code in the parent posting above works fine.

So, my workaround that works fine, is to seqFilter, seqExport, and then seqOpen the new file. I was hoping there was a better way though, because the exporting is incredibly slow.

Thanks again for any help.

1
Entering edit mode

Sorry, I think I was misunderstanding your question. You can use seqSetFilter to select a subset of samples for any analysis; you just have to do it after creating the SeqVarData object. To run regression on a subset of samples, you would do the following:

idsToKeep <- phenos\$sample.id
sampleIds <- data.frame( sample.id=seqGetData(gds, 'sample.id') )
phenos <- dplyr::left_join(sampleIds, phenos)
phenosDF <- AnnotatedDataFrame(phenos)
assocData <- SeqVarData(gds, phenosDF)
seqSetFilter(assocData, sample.id=idsToKeep)
myReg <- regression(assocData, "outcome") # only uses idsToKeep


Any function that works on a SeqVarGDSClass object (what you get from seqOpen) will also work on a SeqVarData object. This is why the AnnotatedDataFrame has to match the GDS exactly (even it requires filling in missing values when you create it, as you do here). Otherwise it would be possible to subsequently set a filter including samples that were not in your original data.frame.

0
Entering edit mode

Ah I see, so I fill in place-holders and then remove them basically.

Thanks a bunch, this is most helpful!