-
Notifications
You must be signed in to change notification settings - Fork 61
Open
Labels
type:bugSomething isn't workingSomething isn't working
Description
When using ConcatThenSplitIterDataset on a simple dataset, I can see gaps in the sequences that I expect to be filled, also it doesn't seem to consume the full dataset. Full end-to-end reproduction below. I know its experimental but this is super useful feature. :)
import grain
import sys
print('.'.join(map(str, sys.version_info[:3])))
print(grain.__version__)
dummy_dat = [{"tokens": [i]*i} for i in range(10, 20)]
for i in dummy_dat:
print(i)
print()
parent = grain.MapDataset.source(dummy_dat).to_iter_dataset()
for i in parent:
print(i)
print()
ds = (
grain.experimental.ConcatThenSplitIterDataset(
parent=parent,
length_struct={"tokens": 7},
meta_features=[],
split_full_length_features=True,
)
)
c = 0
for i in ds:
print(i)
c += 1
if c > 20:
break3.11.9
0.2.11
{'tokens': [10, 10, 10, 10, 10, 10, 10, 10, 10, 10]}
{'tokens': [11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11]}
{'tokens': [12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12]}
{'tokens': [13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13]}
{'tokens': [14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14]}
{'tokens': [15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15]}
{'tokens': [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16]}
{'tokens': [17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17]}
{'tokens': [18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18]}
{'tokens': [19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19]}
{'tokens': [10, 10, 10, 10, 10, 10, 10, 10, 10, 10]}
{'tokens': [11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11]}
{'tokens': [12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12]}
{'tokens': [13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13]}
{'tokens': [14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14]}
{'tokens': [15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15]}
{'tokens': [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16]}
{'tokens': [17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17]}
{'tokens': [18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18]}
{'tokens': [19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19]}
{'tokens': array([10, 10, 10, 10, 10, 10, 10]), 'tokens_segment_ids': array([1, 1, 1, 1, 1, 1, 1], dtype=int32), 'tokens_positions': array([0, 1, 2, 3, 4, 5, 6], dtype=int32)}
{'tokens': array([10, 10, 10, 11, 11, 11, 11]), 'tokens_segment_ids': array([1, 1, 1, 2, 2, 2, 2], dtype=int32), 'tokens_positions': array([0, 1, 2, 0, 1, 2, 3], dtype=int32)}
{'tokens': array([11, 11, 11, 11, 11, 11, 11]), 'tokens_segment_ids': array([1, 1, 1, 1, 1, 1, 1], dtype=int32), 'tokens_positions': array([0, 1, 2, 3, 4, 5, 6], dtype=int32)}
{'tokens': array([12, 12, 12, 12, 12, 12, 12]), 'tokens_segment_ids': array([1, 1, 1, 1, 1, 1, 1], dtype=int32), 'tokens_positions': array([0, 1, 2, 3, 4, 5, 6], dtype=int32)}
{'tokens': array([12, 12, 12, 12, 12, 13, 13]), 'tokens_segment_ids': array([1, 1, 1, 1, 1, 2, 2], dtype=int32), 'tokens_positions': array([0, 1, 2, 3, 4, 0, 1], dtype=int32)}
{'tokens': array([13, 13, 13, 13, 13, 0, 0]), 'tokens_segment_ids': array([1, 1, 1, 1, 1, 0, 0], dtype=int32), 'tokens_positions': array([0, 1, 2, 3, 4, 0, 0], dtype=int32)}
{'tokens': array([13, 13, 13, 13, 13, 13, 14]), 'tokens_segment_ids': array([1, 1, 1, 1, 1, 1, 2], dtype=int32), 'tokens_positions': array([0, 1, 2, 3, 4, 5, 0], dtype=int32)}
{'tokens': array([14, 14, 14, 14, 14, 14, 0]), 'tokens_segment_ids': array([1, 1, 1, 1, 1, 1, 0], dtype=int32), 'tokens_positions': array([0, 1, 2, 3, 4, 5, 0], dtype=int32)}
{'tokens': array([14, 14, 14, 14, 14, 14, 14]), 'tokens_segment_ids': array([1, 1, 1, 1, 1, 1, 1], dtype=int32), 'tokens_positions': array([0, 1, 2, 3, 4, 5, 6], dtype=int32)}
{'tokens': array([15, 15, 15, 15, 15, 15, 15]), 'tokens_segment_ids': array([1, 1, 1, 1, 1, 1, 1], dtype=int32), 'tokens_positions': array([0, 1, 2, 3, 4, 5, 6], dtype=int32)}
{'tokens': array([0, 0, 0, 0, 0, 0, 0]), 'tokens_segment_ids': array([0, 0, 0, 0, 0, 0, 0], dtype=int32), 'tokens_positions': array([0, 0, 0, 0, 0, 0, 0], dtype=int32)}
{'tokens': array([0, 0, 0, 0, 0, 0, 0]), 'tokens_segment_ids': array([0, 0, 0, 0, 0, 0, 0], dtype=int32), 'tokens_positions': array([0, 0, 0, 0, 0, 0, 0], dtype=int32)}
{'tokens': array([0, 0, 0, 0, 0, 0, 0]), 'tokens_segment_ids': array([0, 0, 0, 0, 0, 0, 0], dtype=int32), 'tokens_positions': array([0, 0, 0, 0, 0, 0, 0], dtype=int32)}
{'tokens': array([0, 0, 0, 0, 0, 0, 0]), 'tokens_segment_ids': array([0, 0, 0, 0, 0, 0, 0], dtype=int32), 'tokens_positions': array([0, 0, 0, 0, 0, 0, 0], dtype=int32)}
{'tokens': array([0, 0, 0, 0, 0, 0, 0]), 'tokens_segment_ids': array([0, 0, 0, 0, 0, 0, 0], dtype=int32), 'tokens_positions': array([0, 0, 0, 0, 0, 0, 0], dtype=int32)}
{'tokens': array([0, 0, 0, 0, 0, 0, 0]), 'tokens_segment_ids': array([0, 0, 0, 0, 0, 0, 0], dtype=int32), 'tokens_positions': array([0, 0, 0, 0, 0, 0, 0], dtype=int32)}
{'tokens': array([0, 0, 0, 0, 0, 0, 0]), 'tokens_segment_ids': array([0, 0, 0, 0, 0, 0, 0], dtype=int32), 'tokens_positions': array([0, 0, 0, 0, 0, 0, 0], dtype=int32)}
{'tokens': array([0, 0, 0, 0, 0, 0, 0]), 'tokens_segment_ids': array([0, 0, 0, 0, 0, 0, 0], dtype=int32), 'tokens_positions': array([0, 0, 0, 0, 0, 0, 0], dtype=int32)}
{'tokens': array([0, 0, 0, 0, 0, 0, 0]), 'tokens_segment_ids': array([0, 0, 0, 0, 0, 0, 0], dtype=int32), 'tokens_positions': array([0, 0, 0, 0, 0, 0, 0], dtype=int32)}
{'tokens': array([0, 0, 0, 0, 0, 0, 0]), 'tokens_segment_ids': array([0, 0, 0, 0, 0, 0, 0], dtype=int32), 'tokens_positions': array([0, 0, 0, 0, 0, 0, 0], dtype=int32)}
{'tokens': array([0, 0, 0, 0, 0, 0, 0]), 'tokens_segment_ids': array([0, 0, 0, 0, 0, 0, 0], dtype=int32), 'tokens_positions': array([0, 0, 0, 0, 0, 0, 0], dtype=int32)}
Metadata
Metadata
Assignees
Labels
type:bugSomething isn't workingSomething isn't working