Why is my air quality monitor reporting negative values and what can I do about it?

⚠️
This article was written by Dr. Anna Hodshire.

Seeing a negative value in your air quality data can be a surprise. After all, there is no such thing as negative ozone! So why do air quality monitors (both reference and non-reference instruments) report negative values, why might you care, and what can you do about it?

Why do air quality monitors report negative values?

Your air quality monitor, like any other analytical instrument, may report negative values for one of several reasons:

Calibration error and/or drift

If a monitor is out of calibration, it may exhibit data that presents a slight offset, but otherwise appears to be accurate.

If a monitor is out of calibration and all data are negatively offset, a simple offset correction may do the trick!

Instrument failure

If a monitor is broken or encounters a temporary glitch, it is possible a large negative spike could occur, either in the raw data or the modeled output. This is especially common in monitors that are not designed to be resistant to RF interference being placed in close proximity to an antenna that emits radio waves, as is common with the prevalence of internet-connected devices.

If your data includes outliers that are negative, it may be due to noise or other artifacts.

Expected uncertainty

All analytical instruments have a measurement uncertainty associated with them and negative values fall that within that uncertainty should be considered ‘good’. For example, an instrument that has a stated uncertainty of 5 ppb can report -5 ppb and still be within its stated uncertainty range. This method follows the EPA Federal Method Detection Limits guidelines written for reference instrumentation.

🔆
Note that the EPA released updated guidance in 2016 allowing for lower minimum limits for certain monitored species, citing a need to improve the MDLs and zero drift.
It is possible your negative data falls within the expected uncertainty of your monitor. Make sure to check the manual or spec sheet for your device to gain a better understanding of the expected uncertainty range.

Poor model selection and/or a bias in training data

Alongside the rise in low-cost, internet-connected air quality monitors is the proliferation of machine learning (and other statistical models) to compute air quality parameters (e.g., ozone, PM2.5, etc) from a variety of measured variables. Often, the data used to train the models is collected by colocating a low-cost monitor next to a reference instrument for a period of time. However, unless the complete range of expected parameters is observed during the colocation period (this is very hard to do!), you may end up with bias in your training data that can lead to poorly trained models and skewed data.

The exact reasons for your negative values may depend on the measurement type and manufacturer’s specifications. A monitor that frequently reports negative values that exceed its uncertainty range may have a calibration or instrument failure issue. It is important to assess the magnitude and frequency of negative values before proceeding to alter the dataset.


What can you do about negative values in your data?

What you choose to do with your negative values should depend on what you’re trying to accomplish and what the root cause of the negative values is. You may decide the negative values are within the expected uncertainty of the air quality monitor (e.g. within the minimum detection limit or another accepted limit) and there is no need to alter the data. However, certain analysis methods require positive values as inputs, such as Positive Matrix Factorization or Non-Negative Matrix Factorization.

If your analysis requires that negative values be removed, there are a few options available.

  1. NaN all values below zero (e.g. NaN, 'N/A', -999)
  2. Set all values below zero to zero
  3. Shift all values such that the lowest value is zero

Option 1: NaN all values below zero

Flagging values below zero (or another predetermined value; e.g. the minimum detection limit) is a simple option that is appropriate in many cases (see e.g. South Coast AQMD's 2016 report and Bauerová et al., 2020). “Flagging” in this context refers to setting undesired values to a value (for example, NaN or -999) that your coding routine knows to exclude from your analysis. Directly removing negative values is not recommended: this can misorder your dataset or can remove potentially good data if you were to remove the entire row (other reported concentrations in the row may be good).

This approach is especially appropriate if you have reason to believe the majority (or all) of your negative values are truly “bad” (e.g. instrument failures, other outliers) and should not be included in your statistical analysis.

🖥️
There are many methods to mask data in programming languages or spreadsheet software - we recommend doing a quick search to see which method fits your needs!

In Python, a quick example might look like the following:

import numpy as np

# Set the columns you would like to NaN
cols = ['col1', 'col2']

# Create a mask for where to NaN values
mask = df[df['col1'] <= 0.]

# Set the desired 
df.loc[mask, cols] = np.nan

Option 2: Set all negative values to zero

Setting negative values to zero is a common approach (see for example Kheris et al., 2022) that can be used when you believe the majority (or all) of your negative values are reasonable and the total number of values should be included in your statistical analysis. However, setting negative values to zero fundamentally changes the distribution of your data (that is, it artificially shifts a portion of your data upwards without shifting the entire dataset, as in option 3 below).

✴️
You may also determine that a combination of options 1 and 2 is appropriate if there are clear periods of instrument issues that can be flagged and masked and clear periods of negative values within the minimum detection limit that can be set to zero.

Option 3: Shift all values such that the lowest value is zero

Shifting a species' reported values by its most negative value can be appropriate if you have colocated air quality monitors or reference instruments that indicate that your sensor’ reported values may be systematically low. Here, shifting the baseline (lowest value) to zero (or another value) may be appropriate.

📘
Section 3.6 of “The Enhanced Air Sensor Guidebook” walks readers through best practices for colocating your monitors with reference instrumentation.

This process is a correction (sometimes referred to incorrectly as a calibration) of your data to the colocated instruments. If your colocated monitors or reference instruments also include negative values, however, this method may not work well when you need to remove negative values from your analysis. Additionally, you may find that correcting your monitor's data to other colocated instruments may improve with more sophisticated techniques than a simple additive shift, such as machine learning models (e.g. Chu et al., 2020; Hagan et al., 2018).

Be sure to only shift data one species (column) at a time - adding an offset to your entire dataset (all columns) will lead to incorrect values!

What do these methods mean for statistical analyses?

Each of the above approaches will alter computed statistics from the data in different ways that are explored below. Removing data points entirely will alter the statistics in which the total number of data points is used, such as calculating averages. Conversely, shifting the dataset or zeroing out negative values will include those data points in the statistical analysis.

For example, imagine a list that includes a negative concentration: [1, 2, -1]. Let’s find the average of this list using all methods we’ve discussed.

The original average of the unaltered list:
(1+2+-1)/3 = 0.67.

Setting negative values to invalid:
The list becomes [1, 2,  NaN], and the average is (1+2)/2  = 1.5

Zeroing all negative values:
The list becomes [1, 2, 0], and the average is (1+2)/3  = 1

Shifting the data:
The list becomes [2, 3, 0], and the average is (2+3+0)/3 = 1.67.

Your calculated statistics will change with each method, so it is important to think about which method is most appropriate for your use case!

🔆
Other air quality sensor and instrument manufacturers may have other recommended practices for their devices, so do assess their best practices when analyzing their datasets!

Summary

Negative values can (and will) happen in reported air quality data. It is ultimately up to you how you want to treat these negative values, but for QuantAQ air quality sensors we generally recommend masking the negative values or shifting all of the specie’s reported concentrations by the amount of the lowest value.