Engineering

Rebasing time series in python

David Hagan

Feb 24, 2022 • 8 min read

Quite often, you may find yourself with two sets of data - both time series - that do not share the same time index. There are a couple of common use cases where this may arise in air pollution/atmospheric chemistry:

You have two instruments that collect data at the same frequency (ex. 1Hz) but one is slightly ahead/behind the other.
You have data sets where the time index is at different frequencies (ex., you may have a reference data set that is available at an hourly resolution while you are collecting data at 1-minute frequency).

Both of the above examples require resampling each individual data set to a common time base. Fortunately, there are incredibly useful open-source libraries to help you do this! Here, we will use the pandas library to demonstrate how you can easily rebase your time series and get on with your analysis.

A bit about `pandas`

If you aren't already familiar with the pandas library, you should definitely acclimate yourself. It is an incredible library for dealing with data cleaning, manipulation, and analysis of tabular data. We will utilize some key pandas functionality throughout this post, namely the "resample" and "interpolate" methods for data frames.

Resampling your data

The DataFrame.resample method allows you to conveniently alter the frequency base of your time series. Let's create a quick peak at what this means in practice. Let's create a time series with 1-day resolution:

import pandas as pd

# Create the index
index = pd.date_range("1/1/2022", periods=10, freq="1min")

# Create a series with the above as an index
s = pd.Series(range(10), index=index)

2022-01-01 00:00:00    0
2022-01-01 00:01:00    1
2022-01-01 00:02:00    2
2022-01-01 00:03:00    3
2022-01-01 00:04:00    4
2022-01-01 00:05:00    5
2022-01-01 00:06:00    6
2022-01-01 00:07:00    7
2022-01-01 00:08:00    8
2022-01-01 00:09:00    9
Freq: T, dtype: int64

Using the resample method, we can either downsample the data:

s.resample("5min").mean()

2022-01-01 00:00:00    2.0
2022-01-01 00:05:00    7.0
Freq: 5T, dtype: float64

or, we can upsample the data:

s.resample("30s").mean()

2022-01-01 00:00:00    0.0
2022-01-01 00:00:30    NaN
2022-01-01 00:01:00    1.0
2022-01-01 00:01:30    NaN
2022-01-01 00:02:00    2.0
2022-01-01 00:02:30    NaN
2022-01-01 00:03:00    3.0
2022-01-01 00:03:30    NaN
2022-01-01 00:04:00    4.0
2022-01-01 00:04:30    NaN
2022-01-01 00:05:00    5.0
2022-01-01 00:05:30    NaN
2022-01-01 00:06:00    6.0
2022-01-01 00:06:30    NaN
2022-01-01 00:07:00    7.0
2022-01-01 00:07:30    NaN
2022-01-01 00:08:00    8.0
2022-01-01 00:08:30    NaN
2022-01-01 00:09:00    9.0
Freq: 30S, dtype: float64

We can resample up or down at whatever frequency we like! You will notice above, that when we upsample the data, we fill the new rows with NaN.

If we want to get really fancy (we do!), then we can upsample the data AND interpolate between the original data points at the same time!

s.resample("30s").interpolate("time")

2022-01-01 00:00:00    0.0
2022-01-01 00:00:30    0.5
2022-01-01 00:01:00    1.0
2022-01-01 00:01:30    1.5
2022-01-01 00:02:00    2.0
2022-01-01 00:02:30    2.5
2022-01-01 00:03:00    3.0
2022-01-01 00:03:30    3.5
2022-01-01 00:04:00    4.0
2022-01-01 00:04:30    4.5
2022-01-01 00:05:00    5.0
2022-01-01 00:05:30    5.5
2022-01-01 00:06:00    6.0
2022-01-01 00:06:30    6.5
2022-01-01 00:07:00    7.0
2022-01-01 00:07:30    7.5
2022-01-01 00:08:00    8.0
2022-01-01 00:08:30    8.5
2022-01-01 00:09:00    9.0
Freq: 30S, dtype: float64

While we used a time-based interpolation above, there are a number of options (read more here), including "polynomial" and "cubicspline":

s.resample("5s").interpolate("cubicspline")[0:5]

2022-01-01 00:00:00    0.000000
2022-01-01 00:00:05    0.083333
2022-01-01 00:00:10    0.166667
2022-01-01 00:00:15    0.250000
2022-01-01 00:00:20    0.333333
Freq: 5S, dtype: float64

Now that we've seen how to resample to a new time base and interpolate between any time interval, we should have the tools needed to get two data sets onto the same time index. While there are a number of ways one could go about doing this, the following strategy tends to work quite well.

Let's create two data sets with different time axes. First, we create a data set with two columns containing 100 random integers between 0-100 and set the index to be at a 1-minute resolution.

import numpy as np

# Create a dataframe with 2 columns at a 1-min freq
df1 = pd.DataFrame(
	data=np.random.randint(0, 100, size=(100, 2)),
    index=pd.date_range("1/1/2022", periods=100, freq="1min"),
    columns=list("AB")
)

                      A   B
2022-01-01 00:00:00  38  95
2022-01-01 00:01:00  48  67
2022-01-01 00:02:00   8  39
2022-01-01 00:03:00  23  22
2022-01-01 00:04:00  94  62
...                  ..  ..
2022-01-01 01:35:00  51  69
2022-01-01 01:36:00  61  96
2022-01-01 01:37:00  78  49
2022-01-01 01:38:00  46  67
2022-01-01 01:39:00  49  65

[100 rows x 2 columns]

Next, we'll create a data set with two columns of 1,000 random floats at a 1Hz frequency - and just for fun, we will go ahead and offset the index by an additional 5 milliseconds (I mean, why not?).

df2 = pd.DataFrame(
	data=np.random.random(size=(1000, 2)),
    index=pd.date_range("1/1/2022", periods=1000, freq="1s"),
    columns=list("CD")
)

# adjust the index
df2.index = df2.index + pd.Timedelta("5ms")

                                C         D
2022-01-01 00:00:00.005  0.497613  0.644631
2022-01-01 00:00:01.005  0.117858  0.160477
2022-01-01 00:00:02.005  0.754900  0.362505
2022-01-01 00:00:03.005  0.933868  0.548447
2022-01-01 00:00:04.005  0.272889  0.209575
...                           ...       ...
2022-01-01 00:16:35.005  0.759741  0.119596
2022-01-01 00:16:36.005  0.412770  0.451684
2022-01-01 00:16:37.005  0.788443  0.860958
2022-01-01 00:16:38.005  0.221932  0.591946
2022-01-01 00:16:39.005  0.101640  0.419326

[1000 rows x 2 columns]

Now that we have our two datasets, it's easy to merge them into a single dataframe by performing an outer join:

merged = pd.merge(df1, df2,left_index=True, 
					right_index=True, how='outer)

                            A     B         C         D
2022-01-01 00:00:00.000  38.0  95.0       NaN       NaN
2022-01-01 00:00:00.005   NaN   NaN  0.497613  0.644631
2022-01-01 00:00:01.005   NaN   NaN  0.117858  0.160477
2022-01-01 00:00:02.005   NaN   NaN  0.754900  0.362505
2022-01-01 00:00:03.005   NaN   NaN  0.933868  0.548447
...                       ...   ...       ...       ...
2022-01-01 01:35:00.000  51.0  69.0       NaN       NaN
2022-01-01 01:36:00.000  61.0  96.0       NaN       NaN
2022-01-01 01:37:00.000  78.0  49.0       NaN       NaN
2022-01-01 01:38:00.000  46.0  67.0       NaN       NaN
2022-01-01 01:39:00.000  49.0  65.0       NaN       NaN

[1100 rows x 4 columns]

You will notice that by performing the outer join, we create a row each point in time that occurs in one of the dataframes - all empty spots will automatically be filled with NaNs.

Next, we want to resample the dataframe to the highest frequency time base:

merged = merged.resample("5ms").mean()

                            A     B         C         D
2022-01-01 00:00:00.000  38.0  95.0       NaN       NaN
2022-01-01 00:00:00.005   NaN   NaN  0.497613  0.644631
2022-01-01 00:00:00.010   NaN   NaN       NaN       NaN
2022-01-01 00:00:00.015   NaN   NaN       NaN       NaN
2022-01-01 00:00:00.020   NaN   NaN       NaN       NaN
...                       ...   ...       ...       ...
2022-01-01 01:38:59.980   NaN   NaN       NaN       NaN
2022-01-01 01:38:59.985   NaN   NaN       NaN       NaN
2022-01-01 01:38:59.990   NaN   NaN       NaN       NaN
2022-01-01 01:38:59.995   NaN   NaN       NaN       NaN
2022-01-01 01:39:00.000  49.0  65.0       NaN       NaN

[1188001 rows x 4 columns]

Next, we will interpolate across the entire time axis using a "time" based interpolation:

merged = merged.interpolate("time")

                                 A          B         C         D
2022-01-01 00:00:00.000  38.000000  95.000000       NaN       NaN
2022-01-01 00:00:00.005  38.000833  94.997667  0.497613  0.644631
2022-01-01 00:00:01.005  38.167500  94.531000  0.117858  0.160477
2022-01-01 00:00:02.005  38.334167  94.064333  0.754900  0.362505
2022-01-01 00:00:03.005  38.500833  93.597667  0.933868  0.548447
...                            ...        ...       ...       ...
2022-01-01 01:35:00.000  51.000000  69.000000  0.101640  0.419326
2022-01-01 01:36:00.000  61.000000  96.000000  0.101640  0.419326
2022-01-01 01:37:00.000  78.000000  49.000000  0.101640  0.419326
2022-01-01 01:38:00.000  46.000000  67.000000  0.101640  0.419326
2022-01-01 01:39:00.000  49.000000  65.000000  0.101640  0.419326

[1100 rows x 4 columns]

You could also combine the last two steps together using a single call:

merged = merged.resample("5ms").interpolate("time")

At this point, we have completed the task! We have a single time-axis with all columns. One last step that can be useful, is to downsample onto a lower frequency time base. For example, if we wanted 5-minute resolution data from the above 200Hz resolution data, we could do the following:

merged.resample("5min").mean()

                             A          B         C         D
2022-01-01 00:00:00  44.231492  48.583820  0.488557  0.535728
2022-01-01 00:05:00  34.589213  36.806525  0.519729  0.532371
2022-01-01 00:10:00  54.106393  41.346672  0.489820  0.520780
2022-01-01 00:15:00  63.422492  38.333746  0.493292  0.513556
2022-01-01 00:20:00  56.600000  53.600000  0.101640  0.419326
2022-01-01 00:25:00  48.800000  58.400000  0.101640  0.419326
2022-01-01 00:30:00  32.000000  54.200000  0.101640  0.419326
2022-01-01 00:35:00  76.800000  68.000000  0.101640  0.419326
2022-01-01 00:40:00  50.000000  43.000000  0.101640  0.419326
2022-01-01 00:45:00  29.600000  49.600000  0.101640  0.419326
2022-01-01 00:50:00  48.200000  54.200000  0.101640  0.419326
2022-01-01 00:55:00  33.600000  32.800000  0.101640  0.419326
2022-01-01 01:00:00  72.200000  57.800000  0.101640  0.419326
2022-01-01 01:05:00  48.200000  52.600000  0.101640  0.419326
2022-01-01 01:10:00  44.800000  69.800000  0.101640  0.419326
2022-01-01 01:15:00  59.800000  64.200000  0.101640  0.419326
2022-01-01 01:20:00  35.400000  47.200000  0.101640  0.419326
2022-01-01 01:25:00  55.200000  38.000000  0.101640  0.419326
2022-01-01 01:30:00  49.800000  59.200000  0.101640  0.419326
2022-01-01 01:35:00  57.000000  69.200000  0.101640  0.419326

With the above knowledge, you should be able to upsample, downsample, resample, interpolate, and rebase your time series any way you see fit.

A bit about pandas

Resampling your data

Sign up for more like this.

A bit about `pandas`