Rebasing time series in python
Quite often, you may find yourself with two sets of data - both time series - that do not share the same time index. There are a couple of common use cases where this may arise in air pollution/atmospheric chemistry:
- You have two instruments that collect data at the same frequency (ex. 1Hz) but one is slightly ahead/behind the other.
- You have data sets where the time index is at different frequencies (ex., you may have a reference data set that is available at an hourly resolution while you are collecting data at 1-minute frequency).
Both of the above examples require resampling each individual data set to a common time base. Fortunately, there are incredibly useful open-source libraries to help you do this! Here, we will use the pandas
library to demonstrate how you can easily rebase your time series and get on with your analysis.
A bit about pandas
If you aren't already familiar with the pandas library, you should definitely acclimate yourself. It is an incredible library for dealing with data cleaning, manipulation, and analysis of tabular data. We will utilize some key pandas
functionality throughout this post, namely the "resample" and "interpolate" methods for data frames.
Resampling your data
The DataFrame.resample
method allows you to conveniently alter the frequency base of your time series. Let's create a quick peak at what this means in practice. Let's create a time series with 1-day resolution:
import pandas as pd
# Create the index
index = pd.date_range("1/1/2022", periods=10, freq="1min")
# Create a series with the above as an index
s = pd.Series(range(10), index=index)
2022-01-01 00:00:00 0
2022-01-01 00:01:00 1
2022-01-01 00:02:00 2
2022-01-01 00:03:00 3
2022-01-01 00:04:00 4
2022-01-01 00:05:00 5
2022-01-01 00:06:00 6
2022-01-01 00:07:00 7
2022-01-01 00:08:00 8
2022-01-01 00:09:00 9
Freq: T, dtype: int64
Using the resample
method, we can either downsample the data:
s.resample("5min").mean()
2022-01-01 00:00:00 2.0
2022-01-01 00:05:00 7.0
Freq: 5T, dtype: float64
or, we can upsample the data:
s.resample("30s").mean()
2022-01-01 00:00:00 0.0
2022-01-01 00:00:30 NaN
2022-01-01 00:01:00 1.0
2022-01-01 00:01:30 NaN
2022-01-01 00:02:00 2.0
2022-01-01 00:02:30 NaN
2022-01-01 00:03:00 3.0
2022-01-01 00:03:30 NaN
2022-01-01 00:04:00 4.0
2022-01-01 00:04:30 NaN
2022-01-01 00:05:00 5.0
2022-01-01 00:05:30 NaN
2022-01-01 00:06:00 6.0
2022-01-01 00:06:30 NaN
2022-01-01 00:07:00 7.0
2022-01-01 00:07:30 NaN
2022-01-01 00:08:00 8.0
2022-01-01 00:08:30 NaN
2022-01-01 00:09:00 9.0
Freq: 30S, dtype: float64
We can resample up or down at whatever frequency we like! You will notice above, that when we upsample the data, we fill the new rows with NaN
.
If we want to get really fancy (we do!), then we can upsample the data AND interpolate between the original data points at the same time!
s.resample("30s").interpolate("time")
2022-01-01 00:00:00 0.0
2022-01-01 00:00:30 0.5
2022-01-01 00:01:00 1.0
2022-01-01 00:01:30 1.5
2022-01-01 00:02:00 2.0
2022-01-01 00:02:30 2.5
2022-01-01 00:03:00 3.0
2022-01-01 00:03:30 3.5
2022-01-01 00:04:00 4.0
2022-01-01 00:04:30 4.5
2022-01-01 00:05:00 5.0
2022-01-01 00:05:30 5.5
2022-01-01 00:06:00 6.0
2022-01-01 00:06:30 6.5
2022-01-01 00:07:00 7.0
2022-01-01 00:07:30 7.5
2022-01-01 00:08:00 8.0
2022-01-01 00:08:30 8.5
2022-01-01 00:09:00 9.0
Freq: 30S, dtype: float64
While we used a time-based interpolation above, there are a number of options (read more here), including "polynomial" and "cubicspline":
s.resample("5s").interpolate("cubicspline")[0:5]
2022-01-01 00:00:00 0.000000
2022-01-01 00:00:05 0.083333
2022-01-01 00:00:10 0.166667
2022-01-01 00:00:15 0.250000
2022-01-01 00:00:20 0.333333
Freq: 5S, dtype: float64
Now that we've seen how to resample to a new time base and interpolate between any time interval, we should have the tools needed to get two data sets onto the same time index. While there are a number of ways one could go about doing this, the following strategy tends to work quite well.
Let's create two data sets with different time axes. First, we create a data set with two columns containing 100 random integers between 0-100 and set the index to be at a 1-minute resolution.
import numpy as np
# Create a dataframe with 2 columns at a 1-min freq
df1 = pd.DataFrame(
data=np.random.randint(0, 100, size=(100, 2)),
index=pd.date_range("1/1/2022", periods=100, freq="1min"),
columns=list("AB")
)
A B
2022-01-01 00:00:00 38 95
2022-01-01 00:01:00 48 67
2022-01-01 00:02:00 8 39
2022-01-01 00:03:00 23 22
2022-01-01 00:04:00 94 62
... .. ..
2022-01-01 01:35:00 51 69
2022-01-01 01:36:00 61 96
2022-01-01 01:37:00 78 49
2022-01-01 01:38:00 46 67
2022-01-01 01:39:00 49 65
[100 rows x 2 columns]
Next, we'll create a data set with two columns of 1,000 random floats at a 1Hz frequency - and just for fun, we will go ahead and offset the index by an additional 5 milliseconds (I mean, why not?).
df2 = pd.DataFrame(
data=np.random.random(size=(1000, 2)),
index=pd.date_range("1/1/2022", periods=1000, freq="1s"),
columns=list("CD")
)
# adjust the index
df2.index = df2.index + pd.Timedelta("5ms")
C D
2022-01-01 00:00:00.005 0.497613 0.644631
2022-01-01 00:00:01.005 0.117858 0.160477
2022-01-01 00:00:02.005 0.754900 0.362505
2022-01-01 00:00:03.005 0.933868 0.548447
2022-01-01 00:00:04.005 0.272889 0.209575
... ... ...
2022-01-01 00:16:35.005 0.759741 0.119596
2022-01-01 00:16:36.005 0.412770 0.451684
2022-01-01 00:16:37.005 0.788443 0.860958
2022-01-01 00:16:38.005 0.221932 0.591946
2022-01-01 00:16:39.005 0.101640 0.419326
[1000 rows x 2 columns]
Now that we have our two datasets, it's easy to merge
them into a single dataframe by performing an outer join:
merged = pd.merge(df1, df2,left_index=True,
right_index=True, how='outer)
A B C D
2022-01-01 00:00:00.000 38.0 95.0 NaN NaN
2022-01-01 00:00:00.005 NaN NaN 0.497613 0.644631
2022-01-01 00:00:01.005 NaN NaN 0.117858 0.160477
2022-01-01 00:00:02.005 NaN NaN 0.754900 0.362505
2022-01-01 00:00:03.005 NaN NaN 0.933868 0.548447
... ... ... ... ...
2022-01-01 01:35:00.000 51.0 69.0 NaN NaN
2022-01-01 01:36:00.000 61.0 96.0 NaN NaN
2022-01-01 01:37:00.000 78.0 49.0 NaN NaN
2022-01-01 01:38:00.000 46.0 67.0 NaN NaN
2022-01-01 01:39:00.000 49.0 65.0 NaN NaN
[1100 rows x 4 columns]
You will notice that by performing the outer join, we create a row each point in time that occurs in one of the dataframes - all empty spots will automatically be filled with NaNs.
Next, we want to resample the dataframe to the highest frequency time base:
merged = merged.resample("5ms").mean()
A B C D
2022-01-01 00:00:00.000 38.0 95.0 NaN NaN
2022-01-01 00:00:00.005 NaN NaN 0.497613 0.644631
2022-01-01 00:00:00.010 NaN NaN NaN NaN
2022-01-01 00:00:00.015 NaN NaN NaN NaN
2022-01-01 00:00:00.020 NaN NaN NaN NaN
... ... ... ... ...
2022-01-01 01:38:59.980 NaN NaN NaN NaN
2022-01-01 01:38:59.985 NaN NaN NaN NaN
2022-01-01 01:38:59.990 NaN NaN NaN NaN
2022-01-01 01:38:59.995 NaN NaN NaN NaN
2022-01-01 01:39:00.000 49.0 65.0 NaN NaN
[1188001 rows x 4 columns]
Next, we will interpolate across the entire time axis using a "time" based interpolation:
merged = merged.interpolate("time")
A B C D
2022-01-01 00:00:00.000 38.000000 95.000000 NaN NaN
2022-01-01 00:00:00.005 38.000833 94.997667 0.497613 0.644631
2022-01-01 00:00:01.005 38.167500 94.531000 0.117858 0.160477
2022-01-01 00:00:02.005 38.334167 94.064333 0.754900 0.362505
2022-01-01 00:00:03.005 38.500833 93.597667 0.933868 0.548447
... ... ... ... ...
2022-01-01 01:35:00.000 51.000000 69.000000 0.101640 0.419326
2022-01-01 01:36:00.000 61.000000 96.000000 0.101640 0.419326
2022-01-01 01:37:00.000 78.000000 49.000000 0.101640 0.419326
2022-01-01 01:38:00.000 46.000000 67.000000 0.101640 0.419326
2022-01-01 01:39:00.000 49.000000 65.000000 0.101640 0.419326
[1100 rows x 4 columns]
You could also combine the last two steps together using a single call:
merged = merged.resample("5ms").interpolate("time")
At this point, we have completed the task! We have a single time-axis with all columns. One last step that can be useful, is to downsample onto a lower frequency time base. For example, if we wanted 5-minute resolution data from the above 200Hz resolution data, we could do the following:
merged.resample("5min").mean()
A B C D
2022-01-01 00:00:00 44.231492 48.583820 0.488557 0.535728
2022-01-01 00:05:00 34.589213 36.806525 0.519729 0.532371
2022-01-01 00:10:00 54.106393 41.346672 0.489820 0.520780
2022-01-01 00:15:00 63.422492 38.333746 0.493292 0.513556
2022-01-01 00:20:00 56.600000 53.600000 0.101640 0.419326
2022-01-01 00:25:00 48.800000 58.400000 0.101640 0.419326
2022-01-01 00:30:00 32.000000 54.200000 0.101640 0.419326
2022-01-01 00:35:00 76.800000 68.000000 0.101640 0.419326
2022-01-01 00:40:00 50.000000 43.000000 0.101640 0.419326
2022-01-01 00:45:00 29.600000 49.600000 0.101640 0.419326
2022-01-01 00:50:00 48.200000 54.200000 0.101640 0.419326
2022-01-01 00:55:00 33.600000 32.800000 0.101640 0.419326
2022-01-01 01:00:00 72.200000 57.800000 0.101640 0.419326
2022-01-01 01:05:00 48.200000 52.600000 0.101640 0.419326
2022-01-01 01:10:00 44.800000 69.800000 0.101640 0.419326
2022-01-01 01:15:00 59.800000 64.200000 0.101640 0.419326
2022-01-01 01:20:00 35.400000 47.200000 0.101640 0.419326
2022-01-01 01:25:00 55.200000 38.000000 0.101640 0.419326
2022-01-01 01:30:00 49.800000 59.200000 0.101640 0.419326
2022-01-01 01:35:00 57.000000 69.200000 0.101640 0.419326
With the above knowledge, you should be able to upsample, downsample, resample, interpolate, and rebase your time series any way you see fit.