[DOC] Add user docs for data deduplication and so forth. (#150)

francktcheng · web-flow · commit 4486ba138515 · 2023-05-29T12:46:16.000+08:00
1. data deduplication.
    2 the usage of `hb.embedding_scope`
    3. DataSyncReplicas
    4. `hb.keras` API.

Signed-off-by: langshi.cls &lt;langshi.cls@alibaba-inc.com&gt;
diff --git a/docs/data.md b/docs/data.md
@@ -1,6 +1,6 @@
 # Data Loading
 
-Large batch training on cloud requires great IO performance. HybridBackend
+Large-batch training on cloud requires great IO performance. HybridBackend
 supports memory-efficient loading of categorical data.
 
 ## 1. Data Frame
@@ -13,18 +13,18 @@ Supported logical data types:
 
 | Name                        | Data Structure                                    |
 | --------------------------- | ------------------------------------------------- |
-| Scalar                      | `tf.Tensor` / `hb.data.DataFrame.Value`       |
-| Fixed-Length List           | `tf.Tensor` / `hb.data.DataFrame.Value`       |
-| Variable-Length List        | `tf.SparseTensor` / `hb.data.DataFrame.Value` |
-| Variable-Length Nested List | `tf.SparseTensor` / `hb.data.DataFrame.Value` |
+| Scalar                      | `tf.Tensor` / `hb.data.DataFrame.Value`           |
+| Fixed-Length List           | `tf.Tensor` / `hb.data.DataFrame.Value`           |
+| Variable-Length List        | `tf.SparseTensor` / `hb.data.DataFrame.Value`     |
+| Variable-Length Nested List | `tf.SparseTensor` / `hb.data.DataFrame.Value`     |
 
 Supported physical data types:
 
 | Category | Types                                                        |
 | -------- | ------------------------------------------------------------ |
-| Integers | `int64` `uint64` `int32` `uint32` `int8` `uint8` |
-| Numerics | `float64` `float32` `float16`                          |
-| Text     | `string`                                                   |
+| Integers | `int64` `uint64` `int32` `uint32` `int8` `uint8`             |
+| Numerics | `float64` `float32` `float16`                                |
+| Text     | `string`                                                     |
 
 ```{eval-rst}
 .. autoclass:: hybridbackend.tensorflow.data.DataFrame
@@ -168,9 +168,78 @@ batch = it.get_next()
 ...
 ```
 
-## 3. Tips
+## 3. Deduplication
 
-### 3.1 Remove dataset ops in exported saved model
+Some of the feature columns associated to users, such as an user's bio information or
+the recent behaviour (user-viewed items), would normally contain redundant 
+information. For instance, two records associated to the same user id shall have 
+the same data from the feature column of recent-viewed items. HybridBackend
+provides us of a deduplication mechanism to improve the data loading speedup
+as well as the data storage capacity.
+
+### 3.1 Preparation of deduplicated training data
+
+Currently, it is user's responsibility to deduplicate the training data (e.g., in parquet format). 
+An example of python script is described in `hybridbackend/docs/tutorial/ranking/taobao/data/deduplicate.py`.
+In general, users shall provide three arguments:
+
+1. `--deduplicated-block-size`: indicates that how many rows (records) are
+   involved per deduplicate operation. For instance, if 1000 rows applies a
+   deduplication, the compressed one record shall be restored to 1000 records
+   in the actual training. Theoretically, a large dedupicate block size shall
+   bring a better deduplicate ratio, however, it also depends on the
+   distribution of duplicated data.
+
+2. `--user-cols`: A list of feature column names (fields). 
+   The first feature column of the list serves as the `key` 
+   to deduplicate while the rest of feature columns are values (targets) to compress.
+   There could be multiple such `--user-cols` to be deduplicate independently.
+
+3. `--non-user-cols`: The feature columns that are excluded from the deduplication. 
+
+The prepared data shall contain an additional feature column for each `--user-cols`
+, which stores the inverse index to restore the deduplicated values in training.
+
+### 3.2 Read deduplicated data and restore.
+
+HybridBackend provides a API to read deduplicated training data prepared in 3.1.
+
+Example: 
+
+```python
+import tensorflow as tf
+import hybridbackend.tensorflow as hb
+
+# Define data frame fields.
+fields = [
+    hb.data.DataFrame.Field('user', tf.int64),  # scalar
+    hb.data.DataFrame.Field('user-index', tf.int64),  # scalar
+    hb.data.DataFrame.Field('user-feat-0', tf.int64, shape=[32]),  # fixed-length list
+    hb.data.DataFrame.Field('user-feat-1', tf.int64, ragged_rank=1),  # variable-length list
+    hb.data.DataFrame.Field('item-feat-0', tf.int64, ragged_rank=1)]  # variable-length list
+
+# Read from deduplicated parquet files (deduplicate every 1024 rows)
+# by specifying the `key` and `value` feature columns.
+ds = hb.data.Dataset.from_parquet(
+    '/path/to/f1.parquet',
+    fields=fields,
+    key_idx_field_names=['user-index'],
+    value_field_names=[['user', 'user-feat-0', 'user-feat-1']])
+ds = ds.batch(1)
+ds = ds.prefetch(4)
+it = tf.data.make_one_shot_iterator(ds)
+batch = it.get_next()
+```
+Where the argument of `key_idx_field_names` is a list of feature columns that
+contains the inversed index of key feature columns, and
+`value_field_names` is a list of feature columns (list) associated to each 
+key feature column. It supports multiple `key-value` deduplication. When
+calling `get_next()` method to obtain the batched data, the deduplicated values
+shall be internally restored to their original values.
+
+## 4. Tips
+
+### 4.1 Remove dataset ops in exported saved model
 
 ```python
 import tensorflow as tf
@@ -195,7 +264,7 @@ with tf.Graph().as_default() as predict_graph:
       outputs=model_outputs)
 ```
 
-## 4. Benchmark
+## 5. Benchmark
 
 In benchmark for reading 20k samples from 200 columns of a Parquet file,
 `hb.data.Dataset` is about **21.51x faster** than
diff --git a/docs/distributed.md b/docs/distributed.md
@@ -37,11 +37,20 @@ or
 HB_GRAD_NBUCKETS=2 python xxx.py
 ```
 
-### 1.4 Example: Launch workers on multiple GPUs
+### 1.4 Example: Launch workers on single machine of multiple GPUs
 
 ```bash
 # Launch workers for each GPU by reading environment variable
-# `NVIDIA_VISIBLE_DEVICES`.
+# `NVIDIA_VISIBLE_DEVICES` or `CUDA_VISIBLE_DEVICES`.
+python -m hybridbackend.run python /path/to/main.py
+```
+
+### 1.5 Example: Launch workers on multiple machines of multiple GPUs
+
+```bash
+# set the environment of `TF_CONFIG` with respect to machines. E.g., 
+# TF_CONFIG='{"cluster":{"chief":["x.x.x.x:8860"],"worker":["x.x.x.x:8861"]}, "task":{"type":"chief","index":0}}'
+# then set `NVIDIA_VISIBLE_DEVICES` or `CUDA_VISIBLE_DEVICES` for gpus per machine
 python -m hybridbackend.run python /path/to/main.py
 ```
 
@@ -69,12 +78,12 @@ with hb.scope():
   opt = tf.train.GradientDescentOptimizer(learning_rate=lr)
 ```
 
-## 2. Embedding-Sharded Data Parallelism
+## 3. Embedding-Sharded Data Parallelism
 
-HybridBackend provides option `sharding` to shard variables and support
+HybridBackend provides a `hb.embedding_scope` to shard variables and support
 embedding-sharded data paralleism.
 
-### 2.1 APIs
+### 3.1 APIs
 
 ```{eval-rst}
 .. autofunction:: hybridbackend.tensorflow.metrics.accuracy
@@ -83,7 +92,7 @@ embedding-sharded data paralleism.
 .. autofunction:: hybridbackend.tensorflow.train.export
 ```
 
-### 2.2 Example: Sharding embedding weights within a scope
+### 3.2 Example: Sharding embedding weights within a scope
 
 ```python
 import tensorflow as tf
@@ -92,7 +101,7 @@ import hybridbackend.tensorflow as hb
 def foo():
   # ...
   with hb.scope():
-    with hb.scope(sharding=True):
+    with hb.embedding_scope():
       embedding_weights = tf.get_variable(
         'emb_weights', shape=[bucket_size, dim_size])
     embedding = tf.nn.embedding_lookup(embedding_weights, ids)
@@ -102,7 +111,7 @@ def foo():
     opt = tf.train.GradientDescentOptimizer(learning_rate=lr)
 ```
 
-### 2.3 Example: Evaluation
+### 3.3 Example: Evaluation
 
 ```python
 import tensorflow as tf
@@ -119,7 +128,7 @@ with tf.Graph().as_default():
   with hb.scope():
     batch = tf.data.make_one_shot_iterator(train_ds).get_next()
     # ...
-    with hb.scope(sharding=True):
+    with hb.embedding_scope():
       embedding_weights = tf.get_variable(
         'emb_weights', shape=[bucket_size, dim_size])
     embedding = tf.nn.embedding_lookup(embedding_weights, ids)
@@ -133,7 +142,7 @@ with tf.Graph().as_default():
         sess.run(train_op)
 ```
 
-### 2.4 Example: Exporting to SavedModel
+### 3.4 Example: Exporting to SavedModel
 
 ```python
 import tensorflow as tf
@@ -159,3 +168,31 @@ def _on_export():
 checkpoint_path = tf.train.latest_checkpoint(checkpoint_dir)
 hb.train.export(export_dir_base, checkpoint_path, _on_export)
 ```
+
+## 4. Sync training with unbalanced data across workers.
+
+In training data across distributed workers, it is likely that some of the 
+workers have been assigned less batches of data than the others. Hence, these
+workers shall run out of data ahead of other workers. HybridBackend provides users 
+of two strategy to process remained training data on some of the workers. 
+
+1. set `data_sync_drop_remainder=True` (by default) in `hb.scope()`
+```python
+import tensorflow as tf
+import hybridbackend.tensorflow as hb
+
+if __name__ == '__main__':
+...
+with hb.scope(data_sync_drop_remainder=True):
+  main()
+```
+By doing so, whenever one of the workers has finished assigned training data, 
+HybridBackend would drop remained training data on other workers to end the 
+training task.
+
+2. set `data_sync_drop_remainder=False` in `hb.scope()`. As a result, whenever
+a worker has finished its training data, it will keep producing empty data (tensor)
+to join the synchronous training along with other workers until all of the workers
+have finished their training data. It is worth noting that the users shall ensure 
+a compatibility of their customized TF operators or other implementation to allow
+such emtpy data (tensor) in their executions.
diff --git a/docs/high_level_api.md b/docs/high_level_api.md