Skip to content

Commit 4e7ae8c

Browse files
cavenesstfx-copybara
authored andcommitted
Add documentation for custom data validation.
PiperOrigin-RevId: 491737047
1 parent 905f2be commit 4e7ae8c

File tree

2 files changed

+138
-0
lines changed

2 files changed

+138
-0
lines changed

g3doc/custom_data_validation.md

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
# Custom Data Validation
2+
3+
<!--*
4+
freshness: { owner: 'caveness' reviewed: '2022-11-29' }
5+
*-->
6+
7+
TFDV supports custom data validation using SQL. You can run custom data
8+
validation using
9+
[validate_statistics](https://github.com/tensorflow/data-validation/blob/master/tensorflow_data_validation/api/validation_api.py;l=236;rcl=488721853)
10+
or
11+
[custom_validate_statistics](https://github.com/tensorflow/data-validation/blob/master/tensorflow_data_validation/api/validation_api.py;l=535;rcl=488721853).
12+
Use `validate_statistics` to run standard, schema-based data validation along
13+
with custom validation. Use `custom_validate_statistics` to run only custom
14+
validation.
15+
16+
## Configuring Custom Data Validation
17+
18+
Use the
19+
[CustomValidationConfig](https://github.com/tensorflow/data-validation/blob/master/tensorflow_data_validation/anomalies/proto/custom_validation_config.proto)
20+
to define custom validations to run. For each validation, provide an
21+
SQL expression, which returns a boolean value. Each SQL expression is run
22+
against the summary statistics for the specified feature. If the expression
23+
returns false, TFDV generates a custom anomaly using the provided severity and
24+
anomaly description.
25+
26+
You may configure custom validations that run against individual features or
27+
feature pairs. For each feature, specify both the dataset (i.e., slice) and the
28+
feature path to use, though you may leave the dataset name blank if you want to
29+
validate the default slice (i.e., all examples). For single feature validations,
30+
the feature statistics are bound to `feature`. For feature pair validations, the
31+
test feature statistics are bound to `feature_test` and the base feature
32+
statistics are bound to `feature_base`. See the section below for example
33+
queries.
34+
35+
If a custom validation triggers an anomaly, TFDV will return an Anomalies proto
36+
with the reason(s) for the anomaly. Each reason will have a short description,
37+
which is user configured, and a description with the query that caused the
38+
anomaly, the dataset names on which the query was run, and the base feature path
39+
(if running a feature-pair validation). See the section below for example
40+
results of custom validation.
41+
42+
See the
43+
[documentation](https://github.com/tensorflow/data-validation/blob/master/tensorflow_data_validation/anomalies/proto/custom_validation_config.proto)
44+
in the `CustomValidationConfig` proto for example
45+
configurations.
46+
47+

tensorflow_data_validation/anomalies/proto/custom_validation_config.proto

Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,97 @@ package tensorflow.data_validation;
2020
import "tensorflow_metadata/proto/v0/anomalies.proto";
2121
import "tensorflow_metadata/proto/v0/path.proto";
2222

23+
// Use this proto to configure custom validations in TFDV.
24+
// Example usages follow.
25+
// -----------------------------------------------------------------------------
26+
// Example Single-Feature Validation
27+
// Statistics
28+
// datasets {
29+
// name: "All Examples"
30+
// num_examples: 10
31+
// features {
32+
// path { step: 'test_feature' }
33+
// type: INT
34+
// num_stats { num_zeros: 5 max: 25 }
35+
// }
36+
// }
37+
// CustomValidationConfig
38+
// feature_validations {
39+
// feature_path { step: 'test_feature' }
40+
// validations {
41+
// sql_expression: 'feature.num_stats.num_zeros < 3'
42+
// severity: ERROR
43+
// description: 'Feature has too many zeros.'
44+
// }
45+
// validations {
46+
// sql_expression: 'feature.num_stats.max > 10'
47+
// severity: ERROR
48+
// description: 'Maximum value is too low.'
49+
// }
50+
// }
51+
// Anomalies
52+
// anomaly_info {
53+
// key: 'test_feature'
54+
// value: {
55+
// path { step: 'test_feature' }
56+
// severity: ERROR
57+
// reason {
58+
// type: CUSTOM_VALIDATION
59+
// short_description: 'Feature has too many zeros.'
60+
// description: 'Custom validation triggered anomaly. Query: feature.num_stats.num_zeros < 3 Test dataset: default slice'
61+
// }
62+
// }
63+
// }
64+
// -----------------------------------------------------------------------------
65+
// Example Feature Pair Validation
66+
// Statistics
67+
// Test statistics
68+
// datasets {
69+
// name: "slice_1"
70+
// num_examples: 10
71+
// features {
72+
// path { step: 'test_feature' }
73+
// type: INT
74+
// num_stats { num_zeros: 5 max: 25 }
75+
// }
76+
// }
77+
// Base statistics
78+
// datasets {
79+
// name: "slice_2"
80+
// num_examples: 10
81+
// features {
82+
// path { step: 'test_feature' }
83+
// type: INT
84+
// num_stats { num_zeros: 1 max: 1 }
85+
// }
86+
// }
87+
// CustomValidationConfig
88+
// feature_pair_validations {
89+
// dataset_name: 'slice_1'
90+
// feature_test_path { step: 'test_feature' }
91+
// base_dataset_name: 'slice_2'
92+
// feature_base_path { step: 'test_feature' }
93+
// validations {
94+
// sql_expression: 'feature_test.num_stats.num_zeros < feature_base.num_stats.num_zeros'
95+
// severity: ERROR
96+
// description: 'Test feature has too many zeros.'
97+
// }
98+
// }
99+
// Anomalies
100+
// anomaly_info {
101+
// key: 'test_feature'
102+
// value: {
103+
// path { step: 'test_feature' }
104+
// severity: ERROR
105+
// reason {
106+
// type: CUSTOM_VALIDATION
107+
// short_description: 'Test feature has too many zeros.'
108+
// description: 'Custom validation triggered anomaly. Query: feature_test.num_stats.num_zeros < feature_base.num_stats.num_zeros Test dataset: slice_1 Base dataset: slice_2 Base path: test_feature'
109+
// }
110+
// }
111+
// }
112+
// =============================================================================
113+
23114
message Validation {
24115
// Expression to evaluate. If the expression returns false, the anomaly is
25116
// returned.

0 commit comments

Comments
 (0)