Skip to content

Commit 58386ba

Browse files
committed
Place a sentence per line
That is, replace ([^0-9])\. ([A-Z]) with $1.\n$2 plus process exceptions.
1 parent 675bb2f commit 58386ba

File tree

182 files changed

+3193
-1663
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

182 files changed

+3193
-1663
lines changed

docs/concepts/dev/outside.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,8 @@
1-
You can write programs that use Hopsworks in any [Python, Spark, PySpark, or Flink environment](../../user_guides/integrations/index.md). Hopsworks also running SQL queries to compute features in external data warehouses. The Feature Store can also be queried with SQL.
1+
You can write programs that use Hopsworks in any [Python, Spark, PySpark, or Flink environment](../../user_guides/integrations/index.md).
2+
Hopsworks also running SQL queries to compute features in external data warehouses.
3+
The Feature Store can also be queried with SQL.
24

3-
There is REST API for Hopsworks that can be used with a valid API key, generated in Hopsworks. However, it is often easier to develop your programs against SDKs available in Python and Java/Scala for HSFS, in Python for HSML, and in Python for the Hopsworks API.
5+
There is REST API for Hopsworks that can be used with a valid API key, generated in Hopsworks.
6+
However, it is often easier to develop your programs against SDKs available in Python and Java/Scala for HSFS, in Python for HSML, and in Python for the Hopsworks API.
47

58
<img src="../../../assets/images/concepts/dev/dev-outside.svg">

docs/concepts/fs/feature_group/external_fg.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,8 @@
1-
External feature groups are offline feature groups where their data is stored in an external table. An external table requires a data source, defined with the Connector API (or more typically in the user interface), to enable HSFS to retrieve data from the external table. An external feature group doesn't allow for offline data ingestion or modification; instead, it includes a user-defined SQL string for retrieving data. You can also perform SQL operations, including projections, aggregations, and so on. The SQL query is executed on-demand when HSFS retrieves data from the external Feature Group, for example, when creating training data using features in the external table.
1+
External feature groups are offline feature groups where their data is stored in an external table.
2+
An external table requires a data source, defined with the Connector API (or more typically in the user interface), to enable HSFS to retrieve data from the external table.
3+
An external feature group doesn't allow for offline data ingestion or modification; instead, it includes a user-defined SQL string for retrieving data.
4+
You can also perform SQL operations, including projections, aggregations, and so on.
5+
The SQL query is executed on-demand when HSFS retrieves data from the external Feature Group, for example, when creating training data using features in the external table.
26

37
In the image below, we can see that HSFS currently supports a large number of data sources, including any JDBC-enabled source, Snowflake, Data Lake, Redshift, BigQuery, S3, ADLS, GCS, RDS, and Kafka
48

docs/concepts/fs/feature_group/feature_monitoring.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,8 @@ HSFS supports monitoring features on your Feature Group by:
88

99
## Scheduled Statistics
1010

11-
After creating a Feature Group in HSFS, you can setup statistics monitoring to compute statistics over one or more features on a scheduled basis. Statistics are computed on the whole or a subset of feature data (i.e., detection window) already inserted into the Feature Group.
11+
After creating a Feature Group in HSFS, you can setup statistics monitoring to compute statistics over one or more features on a scheduled basis.
12+
Statistics are computed on the whole or a subset of feature data (i.e., detection window) already inserted into the Feature Group.
1213

1314
## Statistics Comparison
1415

docs/concepts/fs/feature_group/feature_pipelines.md

Lines changed: 32 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,46 +1,68 @@
1-
A feature pipeline is a program that orchestrates the execution of a dataflow graph of data validation, aggregation, dimensionality reduction, transformation, and other feature engineering steps on input data to create and/or update feature data. With HSFS, you can write feature pipelines in different languages as shown in the figure below.
1+
A feature pipeline is a program that orchestrates the execution of a dataflow graph of data validation, aggregation, dimensionality reduction, transformation, and other feature engineering steps on input data to create and/or update feature data.
2+
With HSFS, you can write feature pipelines in different languages as shown in the figure below.
23

34
<img src="../../../../assets/images/concepts/fs/feature-pipelines.svg">
45

56
### Data Sources
67

7-
Your feature pipeline needs to connect to some (external) data source to read the data to be processed. Python, Spark, and Flink have connectors to a huge number of different data sources, while SQL feature pipelines are often restricted to a single data source (for example, your connector to SnowFlake only runs SQL on SnowFlake). SparkSQL, in contrast, can be used over tables that originate in different data sources.
8+
Your feature pipeline needs to connect to some (external) data source to read the data to be processed.
9+
Python, Spark, and Flink have connectors to a huge number of different data sources, while SQL feature pipelines are often restricted to a single data source (for example, your connector to SnowFlake only runs SQL on SnowFlake).
10+
SparkSQL, in contrast, can be used over tables that originate in different data sources.
811

912
### Data Validation
1013

11-
In order to be able to train and serve models that you can rely on, you need clean, high quality features. Data validation operations include removing bad data, removing or imputing missing values, and identifying problems such as feature shift. HSFS supports Great Expectations to specify data validation rules that are executed in the client before features are written to the Feature Store. The validation results are collected and shown in Hopsworks.
14+
In order to be able to train and serve models that you can rely on, you need clean, high quality features.
15+
Data validation operations include removing bad data, removing or imputing missing values, and identifying problems such as feature shift.
16+
HSFS supports Great Expectations to specify data validation rules that are executed in the client before features are written to the Feature Store.
17+
The validation results are collected and shown in Hopsworks.
1218

1319
### Aggregations
1420

15-
Aggregations are used to summarize large datasets into more concise, signal-rich features. Popular aggregations include count(), sum(), mean(), median(), stddev(), min(), and max(). These aggregations produce a single number (a numerical feature) that captures information about a potentially large dataset. Both numerical and categorical features are often transformed before being used to train or serve models.
21+
Aggregations are used to summarize large datasets into more concise, signal-rich features.
22+
Popular aggregations include count(), sum(), mean(), median(), stddev(), min(), and max().
23+
These aggregations produce a single number (a numerical feature) that captures information about a potentially large dataset.
24+
Both numerical and categorical features are often transformed before being used to train or serve models.
1625

1726
### Dimensionality Reduction
1827

19-
If input data is impractically large or if it has a significant amount of redundancy, it can often be transformed into a reduced set of features with dimensionality reduction (often called feature extraction). Popular dimensionality algorithms include embedding algorithms, PCA, and TSNE.
28+
If input data is impractically large or if it has a significant amount of redundancy, it can often be transformed into a reduced set of features with dimensionality reduction (often called feature extraction).
29+
Popular dimensionality algorithms include embedding algorithms, PCA, and TSNE.
2030

2131
### Transformations
2232

23-
Transformations are covered in more detail in [training/inference pipelines](../feature_view/training_inference_pipelines.md), as transformations typically happen after the feature store. If you store transformed features in feature groups, the feature data is no longer useful for EDA (as it near to impossible for Data Scientists to understand the transformed values). It also makes it impossible for inference pipelines to log untransformed feature values and predictions for an operational model. There is one use case for storing transformed features in feature groups - when you need to have ultra low latency when reading precomputed features (and online transformations when reading features add too much latency for your use case). The figure below shows to include transformations in your feature pipelines.
33+
Transformations are covered in more detail in [training/inference pipelines](../feature_view/training_inference_pipelines.md), as transformations typically happen after the feature store.
34+
If you store transformed features in feature groups, the feature data is no longer useful for EDA (as it near to impossible for Data Scientists to understand the transformed values).
35+
It also makes it impossible for inference pipelines to log untransformed feature values and predictions for an operational model.
36+
There is one use case for storing transformed features in feature groups - when you need to have ultra low latency when reading precomputed features (and online transformations when reading features add too much latency for your use case).
37+
The figure below shows to include transformations in your feature pipelines.
2438

2539
<img src="../../../../assets/images/concepts/fs/feature-pipelines-with-transformations.svg">
2640

2741
### Feature Engineering in Python
2842

29-
Python is the most widely used framework for feature engineering due to its extensive library support for aggregations (Pandas/Polars), data validation (Great Expectations), and dimensionality reduction (embeddings, PCA), and transformations (in Scikit-Learn, TensorFlow, PyTorch). Python also supports open-source feature engineering frameworks used for automated feature engineering, such as [featuretools](https://www.featuretools.com/) that supports relational and temporal sources.
43+
Python is the most widely used framework for feature engineering due to its extensive library support for aggregations (Pandas/Polars), data validation (Great Expectations), and dimensionality reduction (embeddings, PCA), and transformations (in Scikit-Learn, TensorFlow, PyTorch).
44+
Python also supports open-source feature engineering frameworks used for automated feature engineering, such as [featuretools](https://www.featuretools.com/) that supports relational and temporal sources.
3045

3146
### Feature Engineering in Spark/PySpark
3247

33-
Spark is popular as a feature engineering framework as it can scale to process larger volumes of data than Python, and provides native support for aggregations, and it supports many of the same data validation (Great Expectations), and dimensionality reduction algorithms (embeddings, PCA) as Python. Spark also has native support for transformations, which are useful for analytical models (batch scoring), but less useful for operational models, where online transformations are required, and Spark environments are less common. Online model serving environments typically only support online transformations in Python.
48+
Spark is popular as a feature engineering framework as it can scale to process larger volumes of data than Python, and provides native support for aggregations, and it supports many of the same data validation (Great Expectations), and dimensionality reduction algorithms (embeddings, PCA) as Python.
49+
Spark also has native support for transformations, which are useful for analytical models (batch scoring), but less useful for operational models, where online transformations are required, and Spark environments are less common.
50+
Online model serving environments typically only support online transformations in Python.
3451

3552
### Feature Engineering in SQL
3653

37-
SQL has grown in popularity for performing heavy lifting in feature pipelines - computing aggregates on data - when the input data already resides in a data warehouse. Data warehouses also support data validation, for example, through Great Expectations in DBT. However, SQL is not mature as a platform for transformations and dimensionality reductions, where UDFs are applied row-wise.
54+
SQL has grown in popularity for performing heavy lifting in feature pipelines - computing aggregates on data - when the input data already resides in a data warehouse.
55+
Data warehouses also support data validation, for example, through Great Expectations in DBT.
56+
However, SQL is not mature as a platform for transformations and dimensionality reductions, where UDFs are applied row-wise.
3857

3958
You can do aggregation in SQL for data in your data warehouse or database.
4059

4160
### Feature Engineering in Flink
4261

43-
Apache Flink is a powerful and flexible framework for stateful feature computation operations over unbounded and bounded data streams. It is used for feature engineering when you need very fresh features computed in real-time. Flink provides a rich set of operators and functions such as time windows and aggregation operations that can be applied to keyed and/or global window streams. Flink’s stateful operations allow users to maintain and update state across multiple data records or events, which is particularly useful for feature engineering tasks such as sessionization and/or maintaining rolling aggregates over a sliding window of data.
62+
Apache Flink is a powerful and flexible framework for stateful feature computation operations over unbounded and bounded data streams.
63+
It is used for feature engineering when you need very fresh features computed in real-time.
64+
Flink provides a rich set of operators and functions such as time windows and aggregation operations that can be applied to keyed and/or global window streams.
65+
Flink’s stateful operations allow users to maintain and update state across multiple data records or events, which is particularly useful for feature engineering tasks such as sessionization and/or maintaining rolling aggregates over a sliding window of data.
4466

4567
Flink feature engineering pipelines are supported in Java/Scala only.
4668

Lines changed: 12 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,10 @@
1-
As a programmer, you can consider a feature, in machine learning, to be a variable associated with some entity that contains a value that is useful for helping train a model to solve a prediction problem. That is, the feature is just a variable with predictive power for a machine learning problem, or task.
1+
As a programmer, you can consider a feature, in machine learning, to be a variable associated with some entity that contains a value that is useful for helping train a model to solve a prediction problem.
2+
That is, the feature is just a variable with predictive power for a machine learning problem, or task.
23

3-
A feature group is a table of features, where each feature group has a primary key, and optionally an event_time column (indicating when the features in that row were observed), and a partition key. Collectively, they are referred to as columns. The partition key determines how to layout the feature group rows on disk such that you can efficiently query the data using queries with the partition key. For example, if your partition key is the day and you have hundreds of days worth of data, with a partition key, you can query the day for only a given day or a range of days, and only the data for those days will be read from disk.
4+
A feature group is a table of features, where each feature group has a primary key, and optionally an event_time column (indicating when the features in that row were observed), and a partition key.
5+
Collectively, they are referred to as columns.
6+
The partition key determines how to layout the feature group rows on disk such that you can efficiently query the data using queries with the partition key.
7+
For example, if your partition key is the day and you have hundreds of days worth of data, with a partition key, you can query the day for only a given day or a range of days, and only the data for those days will be read from disk.
48

59
<img src="../../../../assets/images/concepts/fs/feature-group-table.png">
610

@@ -12,10 +16,13 @@ Feature groups can be stored in a low-latency "online" database and/or in low co
1216

1317
#### Online Storage
1418

15-
The online store stores only the latest values of features for a feature group. It is used to serve pre-computed features to models at runtime.
19+
The online store stores only the latest values of features for a feature group.
20+
It is used to serve pre-computed features to models at runtime.
1621

1722
#### Offline Storage
1823

19-
The offline store stores the historical values of features for a feature group so that it may store much more data than the online store. Offline feature groups are used, typically, to create training data for models, but also to retrieve data for batch scoring of models.
24+
The offline store stores the historical values of features for a feature group so that it may store much more data than the online store.
25+
Offline feature groups are used, typically, to create training data for models, but also to retrieve data for batch scoring of models.
2026

21-
In most cases, offline data is stored in Hopsworks, but through the implementation of data sources, it can reside in an external file system. The externally stored data can be managed by Hopsworks by defining ordinary feature groups or it can be used for reading only by defining [External Feature Group](external_fg.md).
27+
In most cases, offline data is stored in Hopsworks, but through the implementation of data sources, it can reside in an external file system.
28+
The externally stored data can be managed by Hopsworks by defining ordinary feature groups or it can be used for reading only by defining [External Feature Group](external_fg.md).

docs/concepts/fs/feature_group/fg_statistics.md

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,14 +6,18 @@ HSFS supports monitoring, validation, and alerting for features:
66

77
### Statistics
88

9-
When you create a Feature Group in HSFS, you can configure it to compute statistics over the features inserted into the Feature Group by setting the `statistics_config` dict parameter, see [Feature Group Statistics](../../../../user_guides/fs/feature_group/statistics/) for details. Every time you write to the Feature Group, new statistics will be computed over all of the data in the Feature Group.
9+
When you create a Feature Group in HSFS, you can configure it to compute statistics over the features inserted into the Feature Group by setting the `statistics_config` dict parameter, see [Feature Group Statistics](../../../../user_guides/fs/feature_group/statistics/) for details.
10+
Every time you write to the Feature Group, new statistics will be computed over all of the data in the Feature Group.
1011

1112
### Data Validation
1213

13-
You can define expectation suites in Great Expectations and associate them with feature groups. When you write to a feature group, the expectations are executed, then you can define a policy on the feature group for what to do if any expectation fails.
14+
You can define expectation suites in Great Expectations and associate them with feature groups.
15+
When you write to a feature group, the expectations are executed, then you can define a policy on the feature group for what to do if any expectation fails.
1416

1517
<img src="../../../../assets/images/concepts/fs/fg-expectations.svg">
1618

1719
### Alerting
1820

19-
HSFS also supports alerts, that can be triggered when there are problems in your feature pipelines, for example, when a write fails due to an error or a failed expectation. You can send alerts to different alerting endpoints, such as email or Slack, that can be configured in the Hopsworks UI. For example, you can send a slack message if features being written to a feature group are missing some input data.
21+
HSFS also supports alerts, that can be triggered when there are problems in your feature pipelines, for example, when a write fails due to an error or a failed expectation.
22+
You can send alerts to different alerting endpoints, such as email or Slack, that can be configured in the Hopsworks UI.
23+
For example, you can send a slack message if features being written to a feature group are missing some input data.

0 commit comments

Comments
 (0)