You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/concepts/dev/inside.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,7 +8,7 @@ Hopsworks provides a Jupyter notebook development environment for programs writt
8
8
9
9
### Source Code Control
10
10
11
-
Hopsworks provides source code control support using Git (GitHub, GitLab or BitBucket). You can securely checkout code into your project and commit and push updates to your code to your source code repository.
11
+
Hopsworks provides source code control support using Git (GitHub, GitLab or BitBucket). You can securely checkout code into your project and commit and push updates to your code to your source code repository.
12
12
13
13
### FTI Pipeline Environments
14
14
@@ -31,4 +31,4 @@ In Hopsworks, a Job is a schedulable program that is allocated compute and memor
31
31
32
32
### Orchestration
33
33
34
-
Airflow comes out-of-the box with Hopsworks, but you can also use an external Airflow cluster (with the Hopsworks Job operator) if you have one. Airflow can be used to schedule the execution of Jobs, individually or as part of Airflow DAGs.
34
+
Airflow comes out-of-the box with Hopsworks, but you can also use an external Airflow cluster (with the Hopsworks Job operator) if you have one. Airflow can be used to schedule the execution of Jobs, individually or as part of Airflow DAGs.
Copy file name to clipboardExpand all lines: docs/concepts/fs/feature_group/external_fg.md
-1Lines changed: 0 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,4 +3,3 @@ External feature groups are offline feature groups where their data is stored in
3
3
In the image below, we can see that HSFS currently supports a large number of data sources, including any JDBC-enabled source, Snowflake, Data Lake, Redshift, BigQuery, S3, ADLS, GCS, RDS, and Kafka
A feature pipeline is a program that orchestrates the execution of a dataflow graph of data validation, aggregation, dimensionality reduction, transformation, and other feature engineering steps on input data to create and/or update feature data. With HSFS, you can write feature pipelines in different languages as shown in the figure below.
1
+
A feature pipeline is a program that orchestrates the execution of a dataflow graph of data validation, aggregation, dimensionality reduction, transformation, and other feature engineering steps on input data to create and/or update feature data. With HSFS, you can write feature pipelines in different languages as shown in the figure below.
Your feature pipeline needs to connect to some (external) data source to read the data to be processed. Python, Spark, and Flink have connectors to a huge number of different data sources, while SQL feature pipelines are often restricted to a single data source (for example, your connector to SnowFlake only runs SQL on SnowFlake). SparkSQL, in contrast, can be used over tables that originate in different data sources.
7
8
8
9
### Data Validation
10
+
9
11
In order to be able to train and serve models that you can rely on, you need clean, high quality features. Data validation operations include removing bad data, removing or imputing missing values, and identifying problems such as feature shift. HSFS supports Great Expectations to specify data validation rules that are executed in the client before features are written to the Feature Store. The validation results are collected and shown in Hopsworks.
10
12
11
13
### Aggregations
12
14
13
15
Aggregations are used to summarize large datasets into more concise, signal-rich features. Popular aggregations include count(), sum(), mean(), median(), stddev(), min(), and max(). These aggregations produce a single number (a numerical feature) that captures information about a potentially large dataset. Both numerical and categorical features are often transformed before being used to train or serve models.
14
16
15
17
### Dimensionality Reduction
18
+
16
19
If input data is impractically large or if it has a significant amount of redundancy, it can often be transformed into a reduced set of features with dimensionality reduction (often called feature extraction). Popular dimensionality algorithms include embedding algorithms, PCA, and TSNE.
17
20
18
21
### Transformations
19
-
Transformations are covered in more detail in [training/inference pipelines](../feature_view/training_inference_pipelines.md), as transformations typically happen after the feature store. If you store transformed features in feature groups, the feature data is no longer useful for EDA (as it near to impossible for Data Scientists to understand the transformed values). It also makes it impossible for inference pipelines to log untransformed feature values and predictions for an operational model. There is one use case for storing transformed features in feature groups - when you need to have ultra low latency when reading precomputed features (and online transformations when reading features add too much latency for your use case). The figure below shows to include transformations in your feature pipelines.
22
+
23
+
Transformations are covered in more detail in [training/inference pipelines](../feature_view/training_inference_pipelines.md), as transformations typically happen after the feature store. If you store transformed features in feature groups, the feature data is no longer useful for EDA (as it near to impossible for Data Scientists to understand the transformed values). It also makes it impossible for inference pipelines to log untransformed feature values and predictions for an operational model. There is one use case for storing transformed features in feature groups - when you need to have ultra low latency when reading precomputed features (and online transformations when reading features add too much latency for your use case). The figure below shows to include transformations in your feature pipelines.
Python is the most widely used framework for feature engineering due to its extensive library support for aggregations (Pandas/Polars), data validation (Great Expectations), and dimensionality reduction (embeddings, PCA), and transformations (in Scikit-Learn, TensorFlow, PyTorch). Python also supports open-source feature engineering frameworks used for automated feature engineering, such as [featuretools](https://www.featuretools.com/) that supports relational and temporal sources.
25
28
29
+
Python is the most widely used framework for feature engineering due to its extensive library support for aggregations (Pandas/Polars), data validation (Great Expectations), and dimensionality reduction (embeddings, PCA), and transformations (in Scikit-Learn, TensorFlow, PyTorch). Python also supports open-source feature engineering frameworks used for automated feature engineering, such as [featuretools](https://www.featuretools.com/) that supports relational and temporal sources.
26
30
27
31
### Feature Engineering in Spark/PySpark
28
32
@@ -34,12 +38,12 @@ SQL has grown in popularity for performing heavy lifting in feature pipelines -
34
38
35
39
You can do aggregation in SQL for data in your data warehouse or database.
36
40
37
-
38
41
### Feature Engineering in Flink
42
+
39
43
Apache Flink is a powerful and flexible framework for stateful feature computation operations over unbounded and bounded data streams. It is used for feature engineering when you need very fresh features computed in real-time. Flink provides a rich set of operators and functions such as time windows and aggregation operations that can be applied to keyed and/or global window streams. Flink’s stateful operations allow users to maintain and update state across multiple data records or events, which is particularly useful for feature engineering tasks such as sessionization and/or maintaining rolling aggregates over a sliding window of data.
40
44
41
45
Flink feature engineering pipelines are supported in Java/Scala only.
42
46
43
-
44
47
### Feature Engineering in Beam
45
-
Beam feature engineering pipelines are supported in Java/Scala only.
48
+
49
+
Beam feature engineering pipelines are supported in Java/Scala only.
Feature groups can be stored in a low-latency "online" database and/or in low cost, high throughput "offline" storage, typically a data lake or data warehouse.
@@ -19,4 +18,4 @@ The online store stores only the latest values of features for a feature group.
19
18
20
19
The offline store stores the historical values of features for a feature group so that it may store much more data than the online store. Offline feature groups are used, typically, to create training data for models, but also to retrieve data for batch scoring of models.
21
20
22
-
In most cases, offline data is stored in Hopsworks, but through the implementation of data sources, it can reside in an external file system. The externally stored data can be managed by Hopsworks by defining ordinary feature groups or it can be used for reading only by defining [External Feature Group](external_fg.md).
21
+
In most cases, offline data is stored in Hopsworks, but through the implementation of data sources, it can reside in an external file system. The externally stored data can be managed by Hopsworks by defining ordinary feature groups or it can be used for reading only by defining [External Feature Group](external_fg.md).
HSFS supports monitoring, validation, and alerting for features:
2
2
3
-
- transparently compute statistics over features on writing to a feature group;
4
-
- validation of data written to feature groups using Great Expectations
5
-
- alerting users when there was a problem writing or update features.
3
+
- transparently compute statistics over features on writing to a feature group;
4
+
- validation of data written to feature groups using Great Expectations
5
+
- alerting users when there was a problem writing or update features.
6
6
7
7
### Statistics
8
8
9
9
When you create a Feature Group in HSFS, you can configure it to compute statistics over the features inserted into the Feature Group by setting the `statistics_config` dict parameter, see [Feature Group Statistics](../../../../user_guides/fs/feature_group/statistics/) for details. Every time you write to the Feature Group, new statistics will be computed over all of the data in the Feature Group.
10
10
11
-
12
11
### Data Validation
13
12
14
13
You can define expectation suites in Great Expectations and associate them with feature groups. When you write to a feature group, the expectations are executed, then you can define a policy on the feature group for what to do if any expectation fails.
HSFS also supports alerts, that can be triggered when there are problems in your feature pipelines, for example, when a write fails due to an error or a failed expectation. You can send alerts to different alerting endpoints, such as email or Slack, that can be configured in the Hopsworks UI. For example, you can send a slack message if features being written to a feature group are missing some input data.
Features are defined as on-demand when their value cannot be pre-computed beforehand, rather they need to be computed in real-time during inference. This is achieved by implementing the on-demand features as a Python function in a Python module. Also ensure that the same version of the Python module is installed in both the feature and inference pipelines.
8
8
9
-
In the image below shows an example of a housing price model that demonstrates how to implement an on-demand feature, a zip code (or post code) that is computed using longitude/latitude parameters. In your online application, longitude and latitude are provided as parameters to the application, and the same python function used to calculate the zip code in the feature pipeline is used to compute the zip code in the Online Inference pipeline.
9
+
In the image below shows an example of a housing price model that demonstrates how to implement an on-demand feature, a zip code (or post code) that is computed using longitude/latitude parameters. In your online application, longitude and latitude are provided as parameters to the application, and the same python function used to calculate the zip code in the feature pipeline is used to compute the zip code in the Online Inference pipeline.
Copy file name to clipboardExpand all lines: docs/concepts/fs/feature_group/versioning.md
+3-4Lines changed: 3 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,9 +4,9 @@ See here for <a href="../../../../concepts/fs/feature_view/versioning/">informat
4
4
5
5
The schema of feature groups is versioned. If you make a breaking change to the schema of a feature group, you need to increment the version of the feature group, and then backfill the new feature group. A breaking schema change is when you:
6
6
7
-
- drop a column from the schema
8
-
- add a new feature without any default value for the new feature
9
-
- change how a feature is computed, such that, for training models, the data for the old feature is not compatible with the data for the new feature. For example, if you have an embedding as a feature and change the algorithm to compute that embedding, you probably should not mix feature values computed with the old embedding model with feature values computed with the new embedding model.
7
+
- drop a column from the schema
8
+
- add a new feature without any default value for the new feature
9
+
- change how a feature is computed, such that, for training models, the data for the old feature is not compatible with the data for the new feature. For example, if you have an embedding as a feature and change the algorithm to compute that embedding, you probably should not mix feature values computed with the old embedding model with feature values computed with the new embedding model.
@@ -15,4 +15,3 @@ The schema of feature groups is versioned. If you make a breaking change to the
15
15
Data Versioning of a feature group involves tracking updates to the feature group, so that you can recover the state of the feature group at a given point-in-time in the past.
0 commit comments