Create dataset without additional copies of data (2.6 - Data sampling with a sliding window) #745
-
Hi @rasbt, The current implementation of class GPTDatasetV1(Dataset):
def __init__(self, txt, tokenizer, max_length, stride):
self.input_ids = []
self.target_ids = []
# Tokenize the entire text
token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})
assert len(token_ids) > max_length, "Number of tokenized inputs must at least be equal to max_length+1"
# Use a sliding window to chunk the book into overlapping sequences of max_length
for i in range(0, len(token_ids) - max_length, stride):
input_chunk = token_ids[i:i + max_length]
target_chunk = token_ids[i + 1: i + max_length + 1]
self.input_ids.append(torch.tensor(input_chunk))
self.target_ids.append(torch.tensor(target_chunk))
def __len__(self):
return len(self.input_ids)
def __getitem__(self, idx):
return self.input_ids[idx], self.target_ids[idx] How do you think, maybe we can use class GPTDatasetStrided(Dataset):
def __init__(self, txt, tokenizer, max_length, stride):
token_ids = torch.tensor(
tokenizer.encode(txt, allowed_special={"<|endoftext|>"}),
dtype=torch.long
)
assert len(token_ids) > max_length, "Number of tokenized inputs must at least be equal to max_length+1"
windows = token_ids.unfold(0, max_length + 1, stride)
self.input_view = windows[..., :-1]
self.target_view = windows[..., 1:]
def __len__(self):
return self.input_view.size(0)
def __getitem__(self, idx):
return self.input_view[idx], self.target_view[idx] Thank you. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Hi @labdmitriy , thanks for suggesting! I should say that I prioritized code readability (and simplicity in terms of PyTorch commands) over pure efficiency as there are many PyTorch newcomers reading the book. I think the data loader is also lightweight enough (and runs in multiple process) so that it wouldn't result in an observable speed-up during training That being said, I actually really like your observation. It looks very clean and readable! If there's a new edition in 2-3 years, I make sure to revisit that! Thanks! |
Beta Was this translation helpful? Give feedback.
Hi @labdmitriy , thanks for suggesting! I should say that I prioritized code readability (and simplicity in terms of PyTorch commands) over pure efficiency as there are many PyTorch newcomers reading the book.
I think the data loader is also lightweight enough (and runs in multiple process) so that it wouldn't result in an observable speed-up during training
That being said, I actually really like your observation. It looks very clean and readable! If there's a new edition in 2-3 years, I make sure to revisit that! Thanks!