Skip to content

fix(finetune): prevent data leakage in CustomKlineDataset normalization#263

Open
JasonOA888 wants to merge 1 commit intoshiyu-coder:masterfrom
JasonOA888:fix/finetune-csv-data-leakage
Open

fix(finetune): prevent data leakage in CustomKlineDataset normalization#263
JasonOA888 wants to merge 1 commit intoshiyu-coder:masterfrom
JasonOA888:fix/finetune-csv-data-leakage

Conversation

@JasonOA888
Copy link
Copy Markdown

Bug

CustomKlineDataset.__getitem__ computes np.mean(x) and np.std(x) over the entire sliding window (lookback_window + predict_window + 1 rows), which includes the prediction target period.

This leaks future price statistics (mean and std of the prediction window) into the training features. The model can indirectly infer information about upcoming prices through the normalization parameters.

The sibling dataset QlibDataset in finetune/dataset.py handles this correctly:

# QlibDataset — correct
past_x = x[:past_len]
x_mean = np.mean(past_x, axis=0)
x_std  = np.std(past_x, axis=0)

Fix

Restrict normalization statistics to the lookback portion only (x[:self.lookback_window]), matching the approach in QlibDataset.

3 lines changed. No API changes.

…t data leakage

CustomKlineDataset was computing mean/std over the entire sliding window
(lookback + predict + 1), leaking future data statistics into training.
QlibDataset in finetune/dataset.py correctly uses only the lookback
portion for this calculation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant