Here you will find contact information and more in depth descriptions of
our data quality effort and current projects, as well as of the services
we can offer.
Without data quality, data quantity is useless
Consider a classic laboratory experiment designed to measure some physical
quantity repeatedly in order to establish it's value empirically. The scientist
undertaking the measurement will take great care in documenting every step
performed in reaching the measured value, knowing that her peers will want
to reproduce and check the published result. The value measured could be
``5.3 W/m^2'' but this in itself has no meaning unless accompanied with
additional information. This additional information, the lab journals with
the detailed description of the experiment set-up, is called meta-data.
In other words data about the data.
Today we often want data from sources (instruments) operating on a
continuous basis. This is very frequently the case in geophysical measurements,
for example in meteorology, where the statistical evolution of some quantity
over long periods of time, or the trend, is the ultimate objective. On one
side it is clear that we can not dedicate the same level of attention to
each and every one of the single measurements making up such an extended
time series as we would a single measurement in the laboratory, but on
the other side, putting an instrument up and leaving without any further
attention is also a waste of resources. Why? Because the data collected
is basically useless without meta-data.
Most researchers are happy if they can obtain the funding needed simply
to purchase an instrument. If the instrument is to be used for an extended
time-series type measurement, allocation of resources (man-hours) to the
continued task of checking the instrument and the data it's producing is
often given low priority or even neglected. This task, commonly referred
to as data quality control, is vital if the product, i.e. the
measured data, is to have scientific value. An engineer setting up
an instrument could be satisfied seeing that a voltage is delivered to the
out-put terminal, but if that the same engineer forgot to remove the cap from
the lens then the error
could go undiscovered until more elaborate checks on the data are
made. Routinely performing such checks will tell if the instrument is
delivering data of known and reasonable quality.
The Light and Life Laboratory Approach
What resources should be allocated to data quality control and
where should it be invested? It is clear that we want to maximize the return
from the resources actually allocated. We have found that the greater part
of the job is done before the production data collection even starts, and
that this effort greatly influences the level of resources needed during
the production phase itself. Below is an attempt to summarize the key aspects
of the two phases involved:
Development phase
- Clearly state scientific motivations and theoretical basis
- State desired accuracy of measurements, and the goal for the percentage
of data with this known and reasonable quality
- Choose site/measurement location
- Specify instrument capabilities and make/choose instrument(s)
- Calibrate instrument(s)
- Establish instrument (on-site) maintenance routines
- Develop data collection and ingest routines
- Develop automated data quality checks for automated flagging of data
and alerting instrument operators
- Develop data visualization modules for human inspection of large
quantities of data
- Establish routines for regular data quality checking and logging of
events by qualified personnel
- Implement record keeping system for data and meta-data, make this available
to end users
- Establish method for browsing and delivery of data and meta-data to
end users
- Make documentation of all of the above available for end users
Production phase
- Follow established procedures for regular instrument maintenance, making
careful log entries of any events and everything observed/done/not done
- Follow established procedures for regular data quality inspection, making
careful log entries of any events and everything observed/done/not done
- Perform instrument calibration at regular intervals
- Optionally summarize extended periods of production data collection
from a data quality perspective in aggregate reports
A larger part of the resources should normally be spent on the development
phase, but depending on the length of the measurement program and the level
of data quality expected, the production phase could conceivably take up
considerable amounts of resources as well. This is why it is important to
facilitate the production phase tasks as much as possible from the start
of the program.
Next we will discuss some of the tools and techniques that we have developed
in order to maximize the production of data with known and reasonable quality
while at the same time adapting the amount of resources spent on achieving this
to the desired level. We shall use the expression data-stream to
mean the continuous recording of a single physical quantity, whether it is
recorded by one single instrument sensor, produced as an aggregate from the
out-put of several instrument sensors, or over time, by a changing set of
instrument sensors. Note that one instrument can have several data-streams
associated with it. The important part is that we must always be able to
identify, down to the serial number, the actual sensor(s) that produced
the physical quantity we are reporting in our data products.
Time is Data - Near Real Time Operation
The actual instrument configuration for a specific (time series type)
measurement depends on several factors. The most important of these are
the level of accuracy desired for the measurements, the desired data
availability, and the available resources for implementation. For example,
if it is a stated goal that 90% of the data should be of a certain quality,
then most likely one would have to perform very frequent instrument and
data quality checks. Collecting or checking the data on a monthly basis
would not do as a month could go by without the discovery of a problem
and we would already be short of the stated goal. Near real time
operation is hence a crucial part of the production phase. The challenge
lies in facilitating this in such a way that we do not spend too much
of our resources.
Instrument Maintenance Routines
Step one in achieving a data quality goal is to implement regular
instrument maintenance routines. The frequency of the various
maintenance tasks should be determined according to their resource
utilization, or yield, in relation to the data quality goal. Doing
a full instrument calibration too often may result in a sub-optimal
data quality level simply because of the increased down-time
when the instrument is not taking data due to the calibration process.
Whenever possible, physical inspection and simple maintenance tasks
should be performed on a daily or weekly basis. The operators
performing these simple tasks must record in an instrument
by instrument log all their observations and maintenance actions.
This instrument log must also keep record of the instrument calibration
history. The operators log should either be a part of a larger meta-data
log or at least be able to interface with it. As an example we can take
the case of an instrument operator reporting that she observed and then
removed something obscuring the sensor of an instrument. Often it is
the case that such a problem problem cannot be spotted by neither an
automatic nor
a manual check of the data. If the operator log is not a part of
the meta-data generating process, data that should be flagged as
suspect could be delivered to an end user as being of known and reasonable
quality. Clear connections between the individual instruments (sensors)
and their associated data-streams must be defined to accurately map
operator log entries to data-stream meta-data.
The Light and Life Laboratory has developed tools and standards for
instrument operator procedures and logs. These have largely been
the result of our extended field experience in working with the
DOE Atmospheric Radiation Measurement (ARM) program and our own
on-going environmental monitoring platforms. The operator logs
are kept in a data-base that lets the operators make their
entries in the field. The operator log data-base will also export
meta-data information on a data-stream basis. The log entries are
specified with a start-time and an end-time so that data quality
flags can be produced on level with the parametric data. Other
views and statistics are available through the operator log
data-base as well. These are useful for spotting recurrent problems
with specific instruments and thus help us make decisions about
the allocation of resources.
Automated Near Real Time Data Collection and Checks
Production phase data quality checks would consume a prohibitive
amount of resources if the process could not be automated to a large
extent. The most costly form of resource utilization is the use of
actual human labour and this should thus be kept to a minimum.
As stated above, the more we can perform our data quality control
in near real time the better. As a result it is obvious that near
real time automated data collection and checks is an ultimate goal.
Automated data collection is a relatively easy task today with
our wired society and inexpensive electronics. The Light and Life
Laboratory have available expertise for implementing automated
data collection for practically any situation. Working with
data collection from remote sites like the North Slope of Alaska
(NSA) we have implemented automated data collection modules
that are ``smart'' in that they have handlers for a wide range of
exceptional cases. Examples of such exceptional cases would be
temporary disruption of network services, slow or poor data-lines,
and large data-volumes. Most of these cases are handled by ``smart''
schedulers that will not only adapt to the situation, but also
alert personnel if a problem needs human intervention.
Automated data quality checks is the most difficult task because
any algorithms devised can never be absolutely right. A highly
sophisticated data quality checking algorithm will fail to flag
problems that do not degrade the data enough to trigger the
flagging criteria. We can again use the example with the operator
observing something obscuring an instrument sensor, e.g. frost
on a dome or window. While the data will be of questionable
value, the automated quality checking algorithm will fail to
detect this because the instrument is still reporting within
the valid range of operation. Conversely, a data quality checking
algorithm may flag good data as bad if the algorithm is unable
to handle special cases. An example of this could be extreme
weather resulting in measured data points that are out-side a
minimum or maximum value determined by the checking algorithm.
We have found that a set of relative simple automated checks
outlined below are enough to achieve a high degree of automated
near real time data quality checking, and that additional inspection
by data quality control personnel, either triggered by a notification
generated by automated processes or performed on a regular basis,
is sufficient to catch the majority of the remaining problems either
escaping the automated quality checks or wrongly flagged by them.
Light and Life Laboratory Automated Data Quality Checks
- Min/max/delta checks - check that the measured data are within bounds
determined by instrument characteristics and local climatology.
Sophisticated min/max/delta checking routines will account for
the season and time of day.
- Comparison to theoretical predictions - there is often a
simple algorithm available that will give the absolute maximum
or minimum a physical value can obtain given the time and
location of measurement. Clear sky radiation is an example
of this.
- Comparison with other measures of the same physical quantity.
If more than one instrument or sensor is available that directly
or indirectly measures the same physical quantity we can raise
flags if their difference exceed some specified value.
See paragraph about operating multiple instruments in parallel
below.
Data Visualization - Quick-looks
Data visualization tools are a critical part of the data quality
control process. When data quality personnel manually inspect data
they almost exclusively rely on these visualizations to efficiently
inspect large amounts of data. More in-depth investigations are
usually initiated if a potential data quality problem is observed
in the visualizations. Data visualizations also provide a tool
for browsing large amounts of data when end users are ``shopping''
for good data to use in research projects.
The Light and Life Laboratory approach has always been to
provide visualizations of all data-streams on a near real
time basis, usually daily or on demand, and then to archive
these for future reference. We call these visualization
products quick-looks because they provide a basic
overview of a time-series of data. Depending on the volume of data
channeled through a data quality process, the amount of these
quick-looks can become significant after only a short time. For
our data quality efforts with the ARM NSA sites we now have over
80.000 quick-looks archived. To solve
the problem of managing all these quick-looks and to provide
a convenient interface for quick-look browsing, we have
developed a quick-look data-base with a web-browser interface
for searching through the available images and displaying
them. The quick-look data-base will let data quality operators
attach statements about the data quality and general comments to
the quick-look images. One can later search for the quick-looks
that has problems noted or additional information attached to them.
Here is a schematic figure that illustrates the flow of data (black
lines) and meta-data (red lines) in the production data acquisition
and quality control process.
Instrument Calibration
Regularly calibrating instruments is of course very important, but it
could also be very costly and result in significant loss of data
while the calibration is performed. How often should one calibrate
instruments? It depends on the instrument, but the period between
factory calibrations can be extended by operating more than one
of the same type of instrument at the same location, or by regularly
compare several instruments measuring the same quantity. This
process is often referred to as instrument inter-calibration.
In addiction, simple checks like measuring the ``dark voltage'',
i.e. the out-put when the instrument should measure zero or another
known quantity, can give an idea about drift in the calibration
constants.
Keeping Records
It cannot be stressed enough how important it is to make careful
logs of all the meta-data, preferably in a way that makes this
information easily available for end users many years down the
line. What may seem unimportant today could be crucial for a
future investigation that utilizes the data differently than
originally planned. Just like one never would delete any raw
data, one should never delete any meta-data. The meta-data is
useless, and so is the parametric data, if end users can
not access the meta-data.
Giving it All to The End User
When end users ``shop'' for data they would like to know a lot
about it before they go through the trouble of e.g. down-loading
and processing it. At first they would want to know what is available,
what format it is in, by what kind of instrument was the data
collected, how was it collected, and so forth. Next they would
like to scan the available data-files for bad or missing data, or
in other cases, interesting and sought for features. For this
purpose we should offer a user friendly interface to the meta-data
data-base or the quick-look data-base. ARM has developed an advanced
Meta Data Navigator (MDN) for this purpose simply because of the
large volume of data collected through this program. For smaller
projects The Light and Life Laboratory has available several
different options for browsing meta-data and quick-looks with
commonplace web-browsers. Through these interfaces the end user
can ``shop'' for data and down-load the desired files and
accompanying meta-data.
The data-files available for the end user are usually higher
level products. By this we mean that the actual data in these
files may be the product of several different physical measurements
that have been used as input to an algorithm for producing what
we call a Value Added Product (VAP). Normally there will be
data-quality flags on level with the parametric data in these
data files. An end user can consult the meta-data for the
data-file in question and device algorithms that use these
data-quality flags to sift out bad or unwanted data points.
Home
    
Internal Pages
    
LLLabDocu
    
Data Quality
    
Health and Status
    
Remote Sensing
      
Radiative Transfer Book
      
LLLabCam
This page is maintained by
Hans Eide,
phone: (201)-216-5557.