Production Data Acquisition and Quality Control

  • In geophysics one often encounters projects involving continuous data acquisition, for example in environmental monitoring.
  • An important part of these projects is to ensure that the data collected over long periods of time are of known and reasonable quality, and that the end-users of the data can readily access the information about factors that may have influenced the data quality over time.
  • Much too often this aspect of continuous data acquisition is neglected because of limited resources.

    The Three Golden Rules of Production Data Acquisition:


  • Contact information

    Visit the Light and Life Laboratory web-site at http://odin.mat.stevens-tech.edu/
    Here you will find contact information and more in depth descriptions of our data quality effort and current projects, as well as of the services we can offer.



    Without data quality, data quantity is useless

    Consider a classic laboratory experiment designed to measure some physical quantity repeatedly in order to establish it's value empirically. The scientist undertaking the measurement will take great care in documenting every step performed in reaching the measured value, knowing that her peers will want to reproduce and check the published result. The value measured could be ``5.3 W/m^2'' but this in itself has no meaning unless accompanied with additional information. This additional information, the lab journals with the detailed description of the experiment set-up, is called meta-data. In other words data about the data.

    Today we often want data from sources (instruments) operating on a continuous basis. This is very frequently the case in geophysical measurements, for example in meteorology, where the statistical evolution of some quantity over long periods of time, or the trend, is the ultimate objective. On one side it is clear that we can not dedicate the same level of attention to each and every one of the single measurements making up such an extended time series as we would a single measurement in the laboratory, but on the other side, putting an instrument up and leaving without any further attention is also a waste of resources. Why? Because the data collected is basically useless without meta-data.

    Most researchers are happy if they can obtain the funding needed simply to purchase an instrument. If the instrument is to be used for an extended time-series type measurement, allocation of resources (man-hours) to the continued task of checking the instrument and the data it's producing is often given low priority or even neglected. This task, commonly referred to as data quality control, is vital if the product, i.e. the measured data, is to have scientific value. An engineer setting up an instrument could be satisfied seeing that a voltage is delivered to the out-put terminal, but if that the same engineer forgot to remove the cap from the lens then the error could go undiscovered until more elaborate checks on the data are made. Routinely performing such checks will tell if the instrument is delivering data of known and reasonable quality.

    The Light and Life Laboratory Approach

    What resources should be allocated to data quality control and where should it be invested? It is clear that we want to maximize the return from the resources actually allocated. We have found that the greater part of the job is done before the production data collection even starts, and that this effort greatly influences the level of resources needed during the production phase itself. Below is an attempt to summarize the key aspects of the two phases involved: A larger part of the resources should normally be spent on the development phase, but depending on the length of the measurement program and the level of data quality expected, the production phase could conceivably take up considerable amounts of resources as well. This is why it is important to facilitate the production phase tasks as much as possible from the start of the program.

    Next we will discuss some of the tools and techniques that we have developed in order to maximize the production of data with known and reasonable quality while at the same time adapting the amount of resources spent on achieving this to the desired level. We shall use the expression data-stream to mean the continuous recording of a single physical quantity, whether it is recorded by one single instrument sensor, produced as an aggregate from the out-put of several instrument sensors, or over time, by a changing set of instrument sensors. Note that one instrument can have several data-streams associated with it. The important part is that we must always be able to identify, down to the serial number, the actual sensor(s) that produced the physical quantity we are reporting in our data products.

    Time is Data - Near Real Time Operation

    The actual instrument configuration for a specific (time series type) measurement depends on several factors. The most important of these are the level of accuracy desired for the measurements, the desired data availability, and the available resources for implementation. For example, if it is a stated goal that 90% of the data should be of a certain quality, then most likely one would have to perform very frequent instrument and data quality checks. Collecting or checking the data on a monthly basis would not do as a month could go by without the discovery of a problem and we would already be short of the stated goal. Near real time operation is hence a crucial part of the production phase. The challenge lies in facilitating this in such a way that we do not spend too much of our resources.

    Instrument Maintenance Routines

    Step one in achieving a data quality goal is to implement regular instrument maintenance routines. The frequency of the various maintenance tasks should be determined according to their resource utilization, or yield, in relation to the data quality goal. Doing a full instrument calibration too often may result in a sub-optimal data quality level simply because of the increased down-time when the instrument is not taking data due to the calibration process. Whenever possible, physical inspection and simple maintenance tasks should be performed on a daily or weekly basis. The operators performing these simple tasks must record in an instrument by instrument log all their observations and maintenance actions. This instrument log must also keep record of the instrument calibration history. The operators log should either be a part of a larger meta-data log or at least be able to interface with it. As an example we can take the case of an instrument operator reporting that she observed and then removed something obscuring the sensor of an instrument. Often it is the case that such a problem problem cannot be spotted by neither an automatic nor a manual check of the data. If the operator log is not a part of the meta-data generating process, data that should be flagged as suspect could be delivered to an end user as being of known and reasonable quality. Clear connections between the individual instruments (sensors) and their associated data-streams must be defined to accurately map operator log entries to data-stream meta-data.

    The Light and Life Laboratory has developed tools and standards for instrument operator procedures and logs. These have largely been the result of our extended field experience in working with the DOE Atmospheric Radiation Measurement (ARM) program and our own on-going environmental monitoring platforms. The operator logs are kept in a data-base that lets the operators make their entries in the field. The operator log data-base will also export meta-data information on a data-stream basis. The log entries are specified with a start-time and an end-time so that data quality flags can be produced on level with the parametric data. Other views and statistics are available through the operator log data-base as well. These are useful for spotting recurrent problems with specific instruments and thus help us make decisions about the allocation of resources.

    Automated Near Real Time Data Collection and Checks

    Production phase data quality checks would consume a prohibitive amount of resources if the process could not be automated to a large extent. The most costly form of resource utilization is the use of actual human labour and this should thus be kept to a minimum. As stated above, the more we can perform our data quality control in near real time the better. As a result it is obvious that near real time automated data collection and checks is an ultimate goal.

    Automated data collection is a relatively easy task today with our wired society and inexpensive electronics. The Light and Life Laboratory have available expertise for implementing automated data collection for practically any situation. Working with data collection from remote sites like the North Slope of Alaska (NSA) we have implemented automated data collection modules that are ``smart'' in that they have handlers for a wide range of exceptional cases. Examples of such exceptional cases would be temporary disruption of network services, slow or poor data-lines, and large data-volumes. Most of these cases are handled by ``smart'' schedulers that will not only adapt to the situation, but also alert personnel if a problem needs human intervention.

    Automated data quality checks is the most difficult task because any algorithms devised can never be absolutely right. A highly sophisticated data quality checking algorithm will fail to flag problems that do not degrade the data enough to trigger the flagging criteria. We can again use the example with the operator observing something obscuring an instrument sensor, e.g. frost on a dome or window. While the data will be of questionable value, the automated quality checking algorithm will fail to detect this because the instrument is still reporting within the valid range of operation. Conversely, a data quality checking algorithm may flag good data as bad if the algorithm is unable to handle special cases. An example of this could be extreme weather resulting in measured data points that are out-side a minimum or maximum value determined by the checking algorithm. We have found that a set of relative simple automated checks outlined below are enough to achieve a high degree of automated near real time data quality checking, and that additional inspection by data quality control personnel, either triggered by a notification generated by automated processes or performed on a regular basis, is sufficient to catch the majority of the remaining problems either escaping the automated quality checks or wrongly flagged by them.

    Data Visualization - Quick-looks

    Data visualization tools are a critical part of the data quality control process. When data quality personnel manually inspect data they almost exclusively rely on these visualizations to efficiently inspect large amounts of data. More in-depth investigations are usually initiated if a potential data quality problem is observed in the visualizations. Data visualizations also provide a tool for browsing large amounts of data when end users are ``shopping'' for good data to use in research projects.

    The Light and Life Laboratory approach has always been to provide visualizations of all data-streams on a near real time basis, usually daily or on demand, and then to archive these for future reference. We call these visualization products quick-looks because they provide a basic overview of a time-series of data. Depending on the volume of data channeled through a data quality process, the amount of these quick-looks can become significant after only a short time. For our data quality efforts with the ARM NSA sites we now have over 80.000 quick-looks archived. To solve the problem of managing all these quick-looks and to provide a convenient interface for quick-look browsing, we have developed a quick-look data-base with a web-browser interface for searching through the available images and displaying them. The quick-look data-base will let data quality operators attach statements about the data quality and general comments to the quick-look images. One can later search for the quick-looks that has problems noted or additional information attached to them.

    Here is a schematic figure that illustrates the flow of data (black lines) and meta-data (red lines) in the production data acquisition and quality control process.

    Instrument Calibration

    Regularly calibrating instruments is of course very important, but it could also be very costly and result in significant loss of data while the calibration is performed. How often should one calibrate instruments? It depends on the instrument, but the period between factory calibrations can be extended by operating more than one of the same type of instrument at the same location, or by regularly compare several instruments measuring the same quantity. This process is often referred to as instrument inter-calibration. In addiction, simple checks like measuring the ``dark voltage'', i.e. the out-put when the instrument should measure zero or another known quantity, can give an idea about drift in the calibration constants.

    Keeping Records

    It cannot be stressed enough how important it is to make careful logs of all the meta-data, preferably in a way that makes this information easily available for end users many years down the line. What may seem unimportant today could be crucial for a future investigation that utilizes the data differently than originally planned. Just like one never would delete any raw data, one should never delete any meta-data. The meta-data is useless, and so is the parametric data, if end users can not access the meta-data.

    Giving it All to The End User

    When end users ``shop'' for data they would like to know a lot about it before they go through the trouble of e.g. down-loading and processing it. At first they would want to know what is available, what format it is in, by what kind of instrument was the data collected, how was it collected, and so forth. Next they would like to scan the available data-files for bad or missing data, or in other cases, interesting and sought for features. For this purpose we should offer a user friendly interface to the meta-data data-base or the quick-look data-base. ARM has developed an advanced Meta Data Navigator (MDN) for this purpose simply because of the large volume of data collected through this program. For smaller projects The Light and Life Laboratory has available several different options for browsing meta-data and quick-looks with commonplace web-browsers. Through these interfaces the end user can ``shop'' for data and down-load the desired files and accompanying meta-data.

    The data-files available for the end user are usually higher level products. By this we mean that the actual data in these files may be the product of several different physical measurements that have been used as input to an algorithm for producing what we call a Value Added Product (VAP). Normally there will be data-quality flags on level with the parametric data in these data files. An end user can consult the meta-data for the data-file in question and device algorithms that use these data-quality flags to sift out bad or unwanted data points.

     

    Home      Internal Pages      LLLabDocu      Data Quality      Health and Status      Remote Sensing        Radiative Transfer Book        LLLabCam

     This page is maintained by Hans Eide,
     phone: (201)-216-5557.