Create dataset without additional copies of data (2.6 - Data sampling with a sliding window) #745

labdmitriy · 2025-07-20T13:44:03Z

labdmitriy
Jul 20, 2025

The current implementation of GPTDatasetV1 uses append method of the Python list after converting chunk from list to tensor.
Because torch.tensor() always copies data, then, as I understand, we will use additional RAM/VRAM for input_ids and target_ids lists construction:

class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        # Tokenize the entire text
        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})
        assert len(token_ids) > max_length, "Number of tokenized inputs must at least be equal to max_length+1"

        # Use a sliding window to chunk the book into overlapping sequences of max_length
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

How do you think, maybe we can use unfold method of PyTorch tensor to always get the view of the original tensor to get all required data, or this implementation has any disadvantages:

class GPTDatasetStrided(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        token_ids = torch.tensor(
            tokenizer.encode(txt, allowed_special={"<|endoftext|>"}),
            dtype=torch.long
        )
        assert len(token_ids) > max_length, "Number of tokenized inputs must at least be equal to max_length+1"

        windows = token_ids.unfold(0, max_length + 1, stride) 

        self.input_view = windows[..., :-1]
        self.target_view = windows[..., 1:]

    def __len__(self):
        return self.input_view.size(0)

    def __getitem__(self, idx):
        return self.input_view[idx], self.target_view[idx]

Thank you.

Answered by rasbt

Aug 1, 2025

Hi @labdmitriy , thanks for suggesting! I should say that I prioritized code readability (and simplicity in terms of PyTorch commands) over pure efficiency as there are many PyTorch newcomers reading the book.

I think the data loader is also lightweight enough (and runs in multiple process) so that it wouldn't result in an observable speed-up during training

That being said, I actually really like your observation. It looks very clean and readable! If there's a new edition in 2-3 years, I make sure to revisit that! Thanks!

View full answer

rasbt · 2025-08-01T19:53:59Z

rasbt
Aug 1, 2025
Maintainer

Hi @labdmitriy , thanks for suggesting! I should say that I prioritized code readability (and simplicity in terms of PyTorch commands) over pure efficiency as there are many PyTorch newcomers reading the book.

I think the data loader is also lightweight enough (and runs in multiple process) so that it wouldn't result in an observable speed-up during training

That being said, I actually really like your observation. It looks very clean and readable! If there's a new edition in 2-3 years, I make sure to revisit that! Thanks!

1 reply

labdmitriy Aug 1, 2025
Author

Hi @rasbt,
Thank you for the answer!
How do you think if it can be useful mainly not for the speed increasing but memory usage decreasing?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Create dataset without additional copies of data (2.6 - Data sampling with a sliding window) #745

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Create dataset without additional copies of data (2.6 - Data sampling with a sliding window) #745

Uh oh!

labdmitriy Jul 20, 2025

Replies: 1 comment · 1 reply

Uh oh!

rasbt Aug 1, 2025 Maintainer

Uh oh!

labdmitriy Aug 1, 2025 Author

labdmitriy
Jul 20, 2025

Replies: 1 comment 1 reply

rasbt
Aug 1, 2025
Maintainer

labdmitriy Aug 1, 2025
Author