-
Notifications
You must be signed in to change notification settings - Fork 505
Description
Hello, when using n_lags, the question arises how does it interact with validation_df when fit?
Example:
dates = pd.date_range(start="2017-01-01", end="2024-12-01", freq="MS")
values = np.random.rand(len(dates)) * 100
data = pd.DataFrame({"ds": dates, "y": values})
train_data = data.iloc[:-12]
val_data = data.iloc[-12:] # !!!
model = NeuralProphet(
n_forecast=12,
n_lags=12,
yearly_seasonality=True,
weekly_seasonality=False,
daily_seasonality=False,
)
model.fit(train_data, freq="MS", validation_df=val_data)
When using n_lags = 12, validation_df must have a length of n_forecast + n_lags (in this case 12 + 12 = 24)
How would it be more correct to split the data to avoid a data leak?:
-
train_data = data.iloc[:-24] # -> n_forecast + n_lags for val_data
val_data = data.iloc[-24:] -
train_data = data.iloc[:-12]
val_data = data.iloc[-24:]
In the second case, for example, the entire year 2023 will be in both train_df and val_df. For val_df, 2023 is considered n_lags, which in fact should only be used for predict on n_forecast and should not be used for validation in fit (for train)