Introduction
There are various places in the GCM, the GCM Post Processing, and EVA where missing data needs to be handled. This document attempts to describe the current state, the desired state, changes that need to be made in order to achieve the desired state, and anything else related to how the entire EdGCM suite treats missing/invalid/NaN EdGCM data.
In this document missing data or invalid data is sometimes referred to in shorthand as NaN. Note that this is not the official NaN terminology. NaN might be and usually is equal to a number (-999 or -1e30 or whatever) in the GCM and whether NaN is NaN or is actually aN is dependent on context.
The flow of EdGCM Missing Data is shown below, and this document follows the same layout. The red link from the Masks to SumAndPD is, I believe, where the majority of the work will take place for the bug fix.
History
Previous versions of this document can be accessed via the Last Change link at the top right of this page.
Related Tickets
- This topic was created with ticket #640
GCM
Missing data in the GCM refers to masks, or locations where data is not valid. For example, SSTs are not valid on a continent, ocean ice is not valid on a continent although it is valid at the equator (value of 0). One of many unknowns is topography... Is it 0 in the ocean, or invalid?
The GCM has 3 types of masks that are applied to data based on input files:
- land
- ocean (= 1-land)
- land ice
and 2 additional types that are applied based on actively calculated arrays:
- ocean ice
- open ocean water
Additional "masks" are necessary for averages related to sea ice or open ocean regions because the sea ice/open ocean values are not inputs, but are calculated by the model during a simulation.
It is assumed, at this point in the specification, that there will not be any GCM code changes, as the GCM handles missing and invalid data locations validly.
Post Processing
The GCM handles data correctly. Information is lost during the post processing steps.
SumAndPD
Current Behavior
- Converts the GCM outputs into a custom binary format.
- Does not maintain data consistency. This is known because vertical netcdf files have a "missing_value" field (usually -1e30) that works for any given month, but the annual data averages this value and treats it as valid.
Desired Behavior
Either
- Any sum that contains a NaN must have a result of NaN
OR
- Any sum that contains a NaN must not use that value in the calculation. This would provide more cells with data, but would also require us to provide a bin count array to use in conjunction with the data (for example, most cells in DJF have a bin count of 3 but some might have a bin count of 2, if J had some NaNs)
- If all NaN data introduced in this step is equal it will simplify the modifications to MakeNetCDF (see below)
- NaN should either be a real NaN or -1e30, as the current value of -999 or 0 might be a valid number for some data
Work Plan
- SumAndPD should access the same files (the Z file) that the GCM uses to determine what locations contain valid or invalid data
The stricter definition (result is NaN) should be used to increase data accuracy.
MakeNetCDF
Current Behavior
- Converts the SumAndPD results to netCDF.
Desired Behavior
- Long term behavior should produce CF-1.0 compliant data sets. For this particular issue (missing data) we should work toward that goal.
- Introduce an attribute (either global or per variable) that specifies the missing_value value. If SumAndPD sets all missing data to NaN or -1e30 then this is simplified. If SumAndPD sets the missing data to different values for different variables then this is more complicated.
EVA
Current Behavior
- Graphically handles missing_value data in vertical mode
- Reads missing_value from netcdf files. Does not accept user input
- Sets invalid cells to black color
- Does not handle NaN in lat x lon (ij/map) mode
- Does not handle NaN when computing zonal averages in data window in either map or vertical mode
Desired Behavior
- Do NOT accept user input for invalid data (?)
- Allow users to select color for invalid cells
- Image cells in that color for any netcdf file that contains a CF-1.0 compliant missing_value field
- When showing data numerically, set missing_value cells to the string "NaN"
- When performing math operations on data with missing_value, the operations should ignore the NaNs, but produce valid output when possible. Note that this is a looser specification than the strict behavior of SumAndPD suggested above.
Summary
The GCM will remain unchanged. The post-processing routines will flag the data such that anyone working with it, without specialized knowledge of the GCM, and without our custom software, will still be able to easily determine what data is valid and what data is invalid for any variable. The post-processing routines will also move towards producing CF-1.0 compliant netCDF files. EVA will support CF-1.0 compliant missing_data attributes in netCDF files. EVA will not add any EdGCM/ModelII/GISS specific knowledge or data to perform this task
