Question

imputing data using pcaMethods with large amounts of missing data in rows

0

Entering edit mode

eli-sava • 0

@eli-sava-12767

Last seen 7.0 years ago

Hello. I am starting to use pcamethods for an environmental application involving concentrations. The columns of my dataset represent stations, which I expect to have spatial structure. The rows represent hourly data from 2005-2017. The trick is that the monitoring network expanded greatly over the last decade. So the last rows representing 2013-2017 contain data for every column with perhaps 10% missing data, which I believe is within the range usually considered high for imputation, but not laughable. The initial rows, on the other hand, contain a much higher fraction of missing data. At best 40% of the stations go back to the first year (2005) I am considering. Others columns would be missing entirely until the corresponding station came on line.

Can anyone suggest a good way to proceed in this case? Should I develop the components based on 2013-2017? How do I best use pcamethods to impute the early part of the dataset, while avoiding its use for the components which I understand is way beyond its warranty. I am willing to assume the spatial structure has been stationary and that 2013-2017 samples the patterns of interest. Thanks.

pcamethods missing data • 817 views

ADD COMMENT • link 7.0 years ago eli-sava • 0