Skip to content

to_sparse failed for Value with ragged_rank > 1 read from parquet file #69

@SamJia

Description

@SamJia

Current behavior

when hb read some nested lists with ragged_rank > 1,the read Value cannot be transformed to SparseTensor by function hb.data.to_sparse.

For example:
dense_feature is one of the features read by hb.data.ParquetDataset, and to_sparse does not work for it.
image

Moreover, if I swap the order of the two nested_row_splits, then it can be to_sparse.

image

So maybe the order of the nested_row_splits when reading parquet file is incorrect?

Expected behavior

the Value read from parquet file can be transformed to SparseTensor.

System information

  • GPU model and memory: No
  • OS Platform: Ubuntu
  • Docker version: No
  • GCC/CUDA/cuDNN version: 7.4/No/No
  • Python/conda version:3.6.13/4.13.0
  • TensorFlow/PyTorch version:1.14.0

Code to reproduce

import tensorflow as tf
import hybridbackend.tensorflow as hb
dataset = hb.data.ParquetDataset("test2.zstd.parquet", batch_size=1)
dataset = dataset.apply(hb.data.to_sparse())
iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()
sess = tf.Session()
vals = sess.run(next_element)

# One more simple demo:
import tensorflow as tf
import hybridbackend.tensorflow as hb
val = hb.data.dataframe.DataFrame.Value(values = np.array([1,2,3,4,5]), nested_row_splits=(np.array([0,1,3,4,5]), np.array([0,2,4])))
sess = tf.Session()
sess.run(val.to_sparse())

Willing to contribute

Yes

Metadata

Metadata

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions