More straightforward custom_collate_function (chapter 7) #683

behzadshomali · 2025-06-20T17:52:25Z

behzadshomali
Jun 20, 2025

Hello!

I personally found the introduced custom_collate_funciton in the book a bit confusing. To be more specific, the way we added and ignored the pad_token_id was not really straightforward for me (the idea behind it was clear, but in order to understand the code I had to read it multiple times), therefore I modified it slightly and here I'm going to share it (I validated it by comparing its output against the output of the original collate function introduced in the book for several inpouts and got the same output):

def custom_collate_fn(
    batch,
    pad_token_id=50256,
    ignore_index=-100,
    allowed_max_length=None,
    device="cpu"
):
    batch_max_length = max(len(item) for item in batch)
    
    inputs_lst, targets_lst = [], []
    for item in batch:
        padded_item = (
            item.copy() + 
            [pad_token_id] * (batch_max_length - len(item))
        )
        inputs = torch.tensor(padded_item)
        targets = torch.tensor(padded_item[1:] + [pad_token_id]) # add an extra pad_token_id (the main one)
        
        mask = targets == pad_token_id
        indices = torch.nonzero(mask).squeeze()
        if indices.numel() > 1:
            targets[indices[1:]] = ignore_index # keep the first one

        if allowed_max_length is not None:
            inputs = inputs[:allowed_max_length]
            targets = targets[:allowed_max_length]
        
        inputs_lst.append(inputs)
        targets_lst.append(targets)

    return torch.stack(inputs_lst).to(device), torch.stack(targets_lst).to(device)

Was there any specific reason that the padding was done in that way in the book?

rasbt · 2025-10-09T01:34:59Z

rasbt
Oct 9, 2025
Maintainer

Thanks for the feedback, and sorry for the late response. This must have gotten lost in my GitHub notifications back then.

When I see it correctly, the difference is in how the inputs and targets are built via

    # your version
    for item in batch:
        padded_item = (
            item.copy() + 
            [pad_token_id] * (batch_max_length - len(item))
        )
        inputs = torch.tensor(padded_item)
        targets = torch.tensor(padded_item[1:] + [pad_token_id]) # add an extra pad_token_id (the main one)

and

# my version
    for item in batch:
        new_item = item.copy()
        new_item += [pad_token_id]
        padded = (
            new_item + [pad_token_id] *
            (batch_max_length - len(new_item))
        )
        inputs = torch.tensor(padded[:-1])
        targets = torch.tensor(padded[1:])

Looking at it, I think your version is indeed pretty intuitive. I.e., constructing the targets from the inputs and then just adding the padding token.

Mine is basically adding the padding token first, and then removing it from the input.

Why I did it this way? I don't recall but I assume it was the first thing that came to mind. Perhaps because it was a bit easier to show that the inputs and targets have the same length but are shifted by one position.

Anyways, thanks for suggesting, in hindsight, I like your version.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

More straightforward custom_collate_function (chapter 7) #683

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

More straightforward custom_collate_function (chapter 7) #683

Uh oh!

behzadshomali Jun 20, 2025

Replies: 1 comment

Uh oh!

rasbt Oct 9, 2025 Maintainer

behzadshomali
Jun 20, 2025

rasbt
Oct 9, 2025
Maintainer