Skip to content

ConcatThenSplitIterDataset producing incorrect results #981

@jackopenn

Description

@jackopenn

When using ConcatThenSplitIterDataset on a simple dataset, I can see gaps in the sequences that I expect to be filled, also it doesn't seem to consume the full dataset. Full end-to-end reproduction below. I know its experimental but this is super useful feature. :)

import grain
import sys

print('.'.join(map(str, sys.version_info[:3])))
print(grain.__version__)


dummy_dat = [{"tokens": [i]*i} for i in range(10, 20)]

for i in dummy_dat:
    print(i)
print()

parent = grain.MapDataset.source(dummy_dat).to_iter_dataset()

for i in parent:
    print(i)
print()

ds = (
    grain.experimental.ConcatThenSplitIterDataset(
        parent=parent,
        length_struct={"tokens": 7},
        meta_features=[],
        split_full_length_features=True,
    )
)

c = 0
for i in ds:
    print(i)
    c += 1
    if c > 20:
        break
3.11.9
0.2.11
{'tokens': [10, 10, 10, 10, 10, 10, 10, 10, 10, 10]}
{'tokens': [11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11]}
{'tokens': [12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12]}
{'tokens': [13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13]}
{'tokens': [14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14]}
{'tokens': [15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15]}
{'tokens': [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16]}
{'tokens': [17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17]}
{'tokens': [18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18]}
{'tokens': [19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19]}

{'tokens': [10, 10, 10, 10, 10, 10, 10, 10, 10, 10]}
{'tokens': [11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11]}
{'tokens': [12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12]}
{'tokens': [13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13]}
{'tokens': [14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14]}
{'tokens': [15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15]}
{'tokens': [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16]}
{'tokens': [17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17]}
{'tokens': [18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18]}
{'tokens': [19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19]}

{'tokens': array([10, 10, 10, 10, 10, 10, 10]), 'tokens_segment_ids': array([1, 1, 1, 1, 1, 1, 1], dtype=int32), 'tokens_positions': array([0, 1, 2, 3, 4, 5, 6], dtype=int32)}
{'tokens': array([10, 10, 10, 11, 11, 11, 11]), 'tokens_segment_ids': array([1, 1, 1, 2, 2, 2, 2], dtype=int32), 'tokens_positions': array([0, 1, 2, 0, 1, 2, 3], dtype=int32)}
{'tokens': array([11, 11, 11, 11, 11, 11, 11]), 'tokens_segment_ids': array([1, 1, 1, 1, 1, 1, 1], dtype=int32), 'tokens_positions': array([0, 1, 2, 3, 4, 5, 6], dtype=int32)}
{'tokens': array([12, 12, 12, 12, 12, 12, 12]), 'tokens_segment_ids': array([1, 1, 1, 1, 1, 1, 1], dtype=int32), 'tokens_positions': array([0, 1, 2, 3, 4, 5, 6], dtype=int32)}
{'tokens': array([12, 12, 12, 12, 12, 13, 13]), 'tokens_segment_ids': array([1, 1, 1, 1, 1, 2, 2], dtype=int32), 'tokens_positions': array([0, 1, 2, 3, 4, 0, 1], dtype=int32)}
{'tokens': array([13, 13, 13, 13, 13,  0,  0]), 'tokens_segment_ids': array([1, 1, 1, 1, 1, 0, 0], dtype=int32), 'tokens_positions': array([0, 1, 2, 3, 4, 0, 0], dtype=int32)}
{'tokens': array([13, 13, 13, 13, 13, 13, 14]), 'tokens_segment_ids': array([1, 1, 1, 1, 1, 1, 2], dtype=int32), 'tokens_positions': array([0, 1, 2, 3, 4, 5, 0], dtype=int32)}
{'tokens': array([14, 14, 14, 14, 14, 14,  0]), 'tokens_segment_ids': array([1, 1, 1, 1, 1, 1, 0], dtype=int32), 'tokens_positions': array([0, 1, 2, 3, 4, 5, 0], dtype=int32)}
{'tokens': array([14, 14, 14, 14, 14, 14, 14]), 'tokens_segment_ids': array([1, 1, 1, 1, 1, 1, 1], dtype=int32), 'tokens_positions': array([0, 1, 2, 3, 4, 5, 6], dtype=int32)}
{'tokens': array([15, 15, 15, 15, 15, 15, 15]), 'tokens_segment_ids': array([1, 1, 1, 1, 1, 1, 1], dtype=int32), 'tokens_positions': array([0, 1, 2, 3, 4, 5, 6], dtype=int32)}
{'tokens': array([0, 0, 0, 0, 0, 0, 0]), 'tokens_segment_ids': array([0, 0, 0, 0, 0, 0, 0], dtype=int32), 'tokens_positions': array([0, 0, 0, 0, 0, 0, 0], dtype=int32)}
{'tokens': array([0, 0, 0, 0, 0, 0, 0]), 'tokens_segment_ids': array([0, 0, 0, 0, 0, 0, 0], dtype=int32), 'tokens_positions': array([0, 0, 0, 0, 0, 0, 0], dtype=int32)}
{'tokens': array([0, 0, 0, 0, 0, 0, 0]), 'tokens_segment_ids': array([0, 0, 0, 0, 0, 0, 0], dtype=int32), 'tokens_positions': array([0, 0, 0, 0, 0, 0, 0], dtype=int32)}
{'tokens': array([0, 0, 0, 0, 0, 0, 0]), 'tokens_segment_ids': array([0, 0, 0, 0, 0, 0, 0], dtype=int32), 'tokens_positions': array([0, 0, 0, 0, 0, 0, 0], dtype=int32)}
{'tokens': array([0, 0, 0, 0, 0, 0, 0]), 'tokens_segment_ids': array([0, 0, 0, 0, 0, 0, 0], dtype=int32), 'tokens_positions': array([0, 0, 0, 0, 0, 0, 0], dtype=int32)}
{'tokens': array([0, 0, 0, 0, 0, 0, 0]), 'tokens_segment_ids': array([0, 0, 0, 0, 0, 0, 0], dtype=int32), 'tokens_positions': array([0, 0, 0, 0, 0, 0, 0], dtype=int32)}
{'tokens': array([0, 0, 0, 0, 0, 0, 0]), 'tokens_segment_ids': array([0, 0, 0, 0, 0, 0, 0], dtype=int32), 'tokens_positions': array([0, 0, 0, 0, 0, 0, 0], dtype=int32)}
{'tokens': array([0, 0, 0, 0, 0, 0, 0]), 'tokens_segment_ids': array([0, 0, 0, 0, 0, 0, 0], dtype=int32), 'tokens_positions': array([0, 0, 0, 0, 0, 0, 0], dtype=int32)}
{'tokens': array([0, 0, 0, 0, 0, 0, 0]), 'tokens_segment_ids': array([0, 0, 0, 0, 0, 0, 0], dtype=int32), 'tokens_positions': array([0, 0, 0, 0, 0, 0, 0], dtype=int32)}
{'tokens': array([0, 0, 0, 0, 0, 0, 0]), 'tokens_segment_ids': array([0, 0, 0, 0, 0, 0, 0], dtype=int32), 'tokens_positions': array([0, 0, 0, 0, 0, 0, 0], dtype=int32)}
{'tokens': array([0, 0, 0, 0, 0, 0, 0]), 'tokens_segment_ids': array([0, 0, 0, 0, 0, 0, 0], dtype=int32), 'tokens_positions': array([0, 0, 0, 0, 0, 0, 0], dtype=int32)}

Metadata

Metadata

Assignees

No one assigned

    Labels

    type:bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions