Skip to content

validation_df + n_lags in fit #1684

@ekpog200

Description

@ekpog200

Hello, when using n_lags, the question arises how does it interact with validation_df when fit?

Example:
dates = pd.date_range(start="2017-01-01", end="2024-12-01", freq="MS")
values = np.random.rand(len(dates)) * 100
data = pd.DataFrame({"ds": dates, "y": values})

train_data = data.iloc[:-12]
val_data = data.iloc[-12:] # !!!

model = NeuralProphet(
n_forecast=12,
n_lags=12,
yearly_seasonality=True,
weekly_seasonality=False,
daily_seasonality=False,
)

model.fit(train_data, freq="MS", validation_df=val_data)

When using n_lags = 12, validation_df must have a length of n_forecast + n_lags (in this case 12 + 12 = 24)

How would it be more correct to split the data to avoid a data leak?:

  1. train_data = data.iloc[:-24] # -> n_forecast + n_lags for val_data
    val_data = data.iloc[-24:]

  2. train_data = data.iloc[:-12]
    val_data = data.iloc[-24:]

In the second case, for example, the entire year 2023 will be in both train_df and val_df. For val_df, 2023 is considered n_lags, which in fact should only be used for predict on n_forecast and should not be used for validation in fit (for train)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions