11# Data Loading
22
3- Large batch training on cloud requires great IO performance. HybridBackend
3+ Large- batch training on cloud requires great IO performance. HybridBackend
44supports memory-efficient loading of categorical data.
55
66## 1. Data Frame
@@ -13,18 +13,18 @@ Supported logical data types:
1313
1414| Name | Data Structure |
1515| --------------------------- | ------------------------------------------------- |
16- | Scalar | ` tf.Tensor ` / ` hb.data.DataFrame.Value ` |
17- | Fixed-Length List | ` tf.Tensor ` / ` hb.data.DataFrame.Value ` |
18- | Variable-Length List | ` tf.SparseTensor ` / ` hb.data.DataFrame.Value ` |
19- | Variable-Length Nested List | ` tf.SparseTensor ` / ` hb.data.DataFrame.Value ` |
16+ | Scalar | ` tf.Tensor ` / ` hb.data.DataFrame.Value ` |
17+ | Fixed-Length List | ` tf.Tensor ` / ` hb.data.DataFrame.Value ` |
18+ | Variable-Length List | ` tf.SparseTensor ` / ` hb.data.DataFrame.Value ` |
19+ | Variable-Length Nested List | ` tf.SparseTensor ` / ` hb.data.DataFrame.Value ` |
2020
2121Supported physical data types:
2222
2323| Category | Types |
2424| -------- | ------------------------------------------------------------ |
25- | Integers | ` int64 ` ` uint64 ` ` int32 ` ` uint32 ` ` int8 ` ` uint8 ` |
26- | Numerics | ` float64 ` ` float32 ` ` float16 ` |
27- | Text | ` string ` |
25+ | Integers | ` int64 ` ` uint64 ` ` int32 ` ` uint32 ` ` int8 ` ` uint8 ` |
26+ | Numerics | ` float64 ` ` float32 ` ` float16 ` |
27+ | Text | ` string ` |
2828
2929``` {eval-rst}
3030.. autoclass:: hybridbackend.tensorflow.data.DataFrame
@@ -168,9 +168,78 @@ batch = it.get_next()
168168...
169169```
170170
171- ## 3. Tips
171+ ## 3. Deduplication
172172
173- ### 3.1 Remove dataset ops in exported saved model
173+ Some of the feature columns associated to users, such as an user's bio information or
174+ the recent behaviour (user-viewed items), would normally contain redundant
175+ information. For instance, two records associated to the same user id shall have
176+ the same data from the feature column of recent-viewed items. HybridBackend
177+ provides us of a deduplication mechanism to improve the data loading speedup
178+ as well as the data storage capacity.
179+
180+ ### 3.1 Preparation of deduplicated training data
181+
182+ Currently, it is user's responsibility to deduplicate the training data (e.g., in parquet format).
183+ An example of python script is described in ` hybridbackend/docs/tutorial/ranking/taobao/data/deduplicate.py ` .
184+ In general, users shall provide three arguments:
185+
186+ 1 . ` --deduplicated-block-size ` : indicates that how many rows (records) are
187+ involved per deduplicate operation. For instance, if 1000 rows applies a
188+ deduplication, the compressed one record shall be restored to 1000 records
189+ in the actual training. Theoretically, a large dedupicate block size shall
190+ bring a better deduplicate ratio, however, it also depends on the
191+ distribution of duplicated data.
192+
193+ 2 . ` --user-cols ` : A list of feature column names (fields).
194+ The first feature column of the list serves as the ` key `
195+ to deduplicate while the rest of feature columns are values (targets) to compress.
196+ There could be multiple such ` --user-cols ` to be deduplicate independently.
197+
198+ 3 . ` --non-user-cols ` : The feature columns that are excluded from the deduplication.
199+
200+ The prepared data shall contain an additional feature column for each ` --user-cols `
201+ , which stores the inverse index to restore the deduplicated values in training.
202+
203+ ### 3.2 Read deduplicated data and restore.
204+
205+ HybridBackend provides a API to read deduplicated training data prepared in 3.1.
206+
207+ Example:
208+
209+ ``` python
210+ import tensorflow as tf
211+ import hybridbackend.tensorflow as hb
212+
213+ # Define data frame fields.
214+ fields = [
215+ hb.data.DataFrame.Field(' user' , tf.int64), # scalar
216+ hb.data.DataFrame.Field(' user-index' , tf.int64), # scalar
217+ hb.data.DataFrame.Field(' user-feat-0' , tf.int64, shape = [32 ]), # fixed-length list
218+ hb.data.DataFrame.Field(' user-feat-1' , tf.int64, ragged_rank = 1 ), # variable-length list
219+ hb.data.DataFrame.Field(' item-feat-0' , tf.int64, ragged_rank = 1 )] # variable-length list
220+
221+ # Read from deduplicated parquet files (deduplicate every 1024 rows)
222+ # by specifying the `key` and `value` feature columns.
223+ ds = hb.data.Dataset.from_parquet(
224+ ' /path/to/f1.parquet' ,
225+ fields = fields,
226+ key_idx_field_names = [' user-index' ],
227+ value_field_names = [[' user' , ' user-feat-0' , ' user-feat-1' ]])
228+ ds = ds.batch(1 )
229+ ds = ds.prefetch(4 )
230+ it = tf.data.make_one_shot_iterator(ds)
231+ batch = it.get_next()
232+ ```
233+ Where the argument of ` key_idx_field_names ` is a list of feature columns that
234+ contains the inversed index of key feature columns, and
235+ ` value_field_names ` is a list of feature columns (list) associated to each
236+ key feature column. It supports multiple ` key-value ` deduplication. When
237+ calling ` get_next() ` method to obtain the batched data, the deduplicated values
238+ shall be internally restored to their original values.
239+
240+ ## 4. Tips
241+
242+ ### 4.1 Remove dataset ops in exported saved model
174243
175244``` python
176245import tensorflow as tf
@@ -195,7 +264,7 @@ with tf.Graph().as_default() as predict_graph:
195264 outputs = model_outputs)
196265```
197266
198- ## 4 . Benchmark
267+ ## 5 . Benchmark
199268
200269In benchmark for reading 20k samples from 200 columns of a Parquet file,
201270` hb.data.Dataset ` is about ** 21.51x faster** than
0 commit comments