Why is my air quality monitor reporting negative values and what can I do about it?
Seeing a negative value in your air quality data can be a surprise. After all, there is no such thing as negative ozone! So why do air quality monitors (both reference and non-reference instruments) report negative values, why might you care, and what can you do about it?
Why do air quality monitors report negative values?
Your air quality monitor, like any other analytical instrument, may report negative values for one of several reasons:
Calibration error and/or drift
If a monitor is out of calibration, it may exhibit data that presents a slight offset, but otherwise appears to be accurate.
If a monitor is broken or encounters a temporary glitch, it is possible a large negative spike could occur, either in the raw data or the modeled output. This is especially common in monitors that are not designed to be resistant to RF interference being placed in close proximity to an antenna that emits radio waves, as is common with the prevalence of internet-connected devices.
All analytical instruments have a measurement uncertainty associated with them and negative values fall that within that uncertainty should be considered ‘good’. For example, an instrument that has a stated uncertainty of 5 ppb can report -5 ppb and still be within its stated uncertainty range. This method follows the EPA Federal Method Detection Limits guidelines written for reference instrumentation.
Poor model selection and/or a bias in training data
Alongside the rise in low-cost, internet-connected air quality monitors is the proliferation of machine learning (and other statistical models) to compute air quality parameters (e.g., ozone, PM2.5, etc) from a variety of measured variables. Often, the data used to train the models is collected by colocating a low-cost monitor next to a reference instrument for a period of time. However, unless the complete range of expected parameters is observed during the colocation period (this is very hard to do!), you may end up with bias in your training data that can lead to poorly trained models and skewed data.
The exact reasons for your negative values may depend on the measurement type and manufacturer’s specifications. A monitor that frequently reports negative values that exceed its uncertainty range may have a calibration or instrument failure issue. It is important to assess the magnitude and frequency of negative values before proceeding to alter the dataset.
What can you do about negative values in your data?
What you choose to do with your negative values should depend on what you’re trying to accomplish and what the root cause of the negative values is. You may decide the negative values are within the expected uncertainty of the air quality monitor (e.g. within the minimum detection limit or another accepted limit) and there is no need to alter the data. However, certain analysis methods require positive values as inputs, such as Positive Matrix Factorization or Non-Negative Matrix Factorization.
If your analysis requires that negative values be removed, there are a few options available.
- NaN all values below zero (e.g. NaN, 'N/A', -999)
- Set all values below zero to zero
- Shift all values such that the lowest value is zero
Option 1: NaN all values below zero
Flagging values below zero (or another predetermined value; e.g. the minimum detection limit) is a simple option that is appropriate in many cases (see e.g. South Coast AQMD's 2016 report and Bauerová et al., 2020). “Flagging” in this context refers to setting undesired values to a value (for example, NaN or -999) that your coding routine knows to exclude from your analysis. Directly removing negative values is not recommended: this can misorder your dataset or can remove potentially good data if you were to remove the entire row (other reported concentrations in the row may be good).
This approach is especially appropriate if you have reason to believe the majority (or all) of your negative values are truly “bad” (e.g. instrument failures, other outliers) and should not be included in your statistical analysis.
In Python, a quick example might look like the following:
import numpy as np # Set the columns you would like to NaN cols = ['col1', 'col2'] # Create a mask for where to NaN values mask = df[df['col1'] <= 0.] # Set the desired df.loc[mask, cols] = np.nan
Option 2: Set all negative values to zero
Setting negative values to zero is a common approach (see for example Kheris et al., 2022) that can be used when you believe the majority (or all) of your negative values are reasonable and the total number of values should be included in your statistical analysis. However, setting negative values to zero fundamentally changes the distribution of your data (that is, it artificially shifts a portion of your data upwards without shifting the entire dataset, as in option 3 below).
Option 3: Shift all values such that the lowest value is zero
Shifting a species' reported values by its most negative value can be appropriate if you have colocated air quality monitors or reference instruments that indicate that your sensor’ reported values may be systematically low. Here, shifting the baseline (lowest value) to zero (or another value) may be appropriate.
This process is a correction (sometimes referred to incorrectly as a calibration) of your data to the colocated instruments. If your colocated monitors or reference instruments also include negative values, however, this method may not work well when you need to remove negative values from your analysis. Additionally, you may find that correcting your monitor's data to other colocated instruments may improve with more sophisticated techniques than a simple additive shift, such as machine learning models (e.g. Chu et al., 2020; Hagan et al., 2018).
What do these methods mean for statistical analyses?
Each of the above approaches will alter computed statistics from the data in different ways that are explored below. Removing data points entirely will alter the statistics in which the total number of data points is used, such as calculating averages. Conversely, shifting the dataset or zeroing out negative values will include those data points in the statistical analysis.
For example, imagine a list that includes a negative concentration: [1, 2, -1]. Let’s find the average of this list using all methods we’ve discussed.
The original average of the unaltered list:
(1+2+-1)/3 = 0.67.
Setting negative values to invalid:
The list becomes [1, 2, NaN], and the average is (1+2)/2 = 1.5
Zeroing all negative values:
The list becomes [1, 2, 0], and the average is (1+2)/3 = 1
Shifting the data:
The list becomes [2, 3, 0], and the average is (2+3+0)/3 = 1.67.
Your calculated statistics will change with each method, so it is important to think about which method is most appropriate for your use case!
Negative values can (and will) happen in reported air quality data. It is ultimately up to you how you want to treat these negative values, but for QuantAQ air quality sensors we generally recommend masking the negative values or shifting all of the specie’s reported concentrations by the amount of the lowest value.