validation_df + n_lags in fit

Hello, when using n_lags, the question arises how does it interact with validation_df when fit?

Example:
dates = pd.date_range(start="2017-01-01", end="2024-12-01", freq="MS")
values = np.random.rand(len(dates)) * 100 
data = pd.DataFrame({"ds": dates, "y": values})

train_data = data.iloc[:-12] 
val_data = data.iloc[-12:]  # !!!

model = NeuralProphet(
    n_forecast=12, 
    n_lags=12,      
    yearly_seasonality=True,  
    weekly_seasonality=False,  
    daily_seasonality=False,  
)

model.fit(train_data, freq="MS", validation_df=val_data)

When using n_lags = 12, validation_df must have a length of n_forecast + n_lags (in this case 12 + 12 = 24)

How would it be more correct to split the data to avoid a data leak?:
1. train_data = data.iloc[:-24]  # -> n_forecast + n_lags for val_data
val_data = data.iloc[-24:]

2.  train_data = data.iloc[:-12] 
val_data = data.iloc[-24:] 

In the second case, for example, the entire year 2023 will be in both train_df and val_df. For val_df, 2023 is considered n_lags, which in fact should only be used for predict on n_forecast and should not be used for validation in fit (for train)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

validation_df + n_lags in fit #1684

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

validation_df + n_lags in fit #1684

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions