Skip to content

Commit efdbed0

Browse files
committed
Add markdownlint config and fixes
1 parent 666b0c6 commit efdbed0

File tree

172 files changed

+1154
-974
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

172 files changed

+1154
-974
lines changed

.markdownlint.yaml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
MD041: false
2+
MD013: false
3+
MD033: false

README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Documentation landing page
22

3-
This is the source of the landing page for https://docs.hopsworks.ai
3+
This is the source of the landing page for <https://docs.hopsworks.ai>
44

55
## Build instructions
66

@@ -35,7 +35,7 @@ Use mkdocs to build the documentation and serve it locally
3535
{PY_ENV}/bin/mkdocs serve
3636
```
3737

38-
The documentation should now be available locally on the following URL: http://127.0.0.1:8000/
38+
The documentation should now be available locally on the following URL: <http://127.0.0.1:8000/>
3939

4040
## Adding new pages
4141

@@ -56,4 +56,4 @@ linkchecker http://127.0.0.1:8000/
5656

5757
# If ok just kill the server
5858
kill -9 $SERVER_PID
59-
```
59+
```

docs/concepts/dev/inside.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ Hopsworks provides a Jupyter notebook development environment for programs writt
88

99
### Source Code Control
1010

11-
Hopsworks provides source code control support using Git (GitHub, GitLab or BitBucket). You can securely checkout code into your project and commit and push updates to your code to your source code repository.
11+
Hopsworks provides source code control support using Git (GitHub, GitLab or BitBucket). You can securely checkout code into your project and commit and push updates to your code to your source code repository.
1212

1313
### FTI Pipeline Environments
1414

@@ -31,4 +31,4 @@ In Hopsworks, a Job is a schedulable program that is allocated compute and memor
3131

3232
### Orchestration
3333

34-
Airflow comes out-of-the box with Hopsworks, but you can also use an external Airflow cluster (with the Hopsworks Job operator) if you have one. Airflow can be used to schedule the execution of Jobs, individually or as part of Airflow DAGs.
34+
Airflow comes out-of-the box with Hopsworks, but you can also use an external Airflow cluster (with the Hopsworks Job operator) if you have one. Airflow can be used to schedule the execution of Jobs, individually or as part of Airflow DAGs.

docs/concepts/fs/feature_group/external_fg.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,4 +3,3 @@ External feature groups are offline feature groups where their data is stored in
33
In the image below, we can see that HSFS currently supports a large number of data sources, including any JDBC-enabled source, Snowflake, Data Lake, Redshift, BigQuery, S3, ADLS, GCS, RDS, and Kafka
44

55
<img src="../../../../assets/images/concepts/fs/fg-connector-api.svg">
6-

docs/concepts/fs/feature_group/feature_monitoring.md

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,5 +16,3 @@ In addition to scheduled statistics, you can enable the comparison of statistics
1616

1717
!!! info "Feature Monitoring Guide"
1818
More information can be found in the [Feature monitoring guide](../../../user_guides/fs/feature_monitoring/index.md).
19-
20-
Lines changed: 10 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,28 +1,32 @@
1-
A feature pipeline is a program that orchestrates the execution of a dataflow graph of data validation, aggregation, dimensionality reduction, transformation, and other feature engineering steps on input data to create and/or update feature data. With HSFS, you can write feature pipelines in different languages as shown in the figure below.
1+
A feature pipeline is a program that orchestrates the execution of a dataflow graph of data validation, aggregation, dimensionality reduction, transformation, and other feature engineering steps on input data to create and/or update feature data. With HSFS, you can write feature pipelines in different languages as shown in the figure below.
22

33
<img src="../../../../assets/images/concepts/fs/feature-pipelines.svg">
44

55
### Data Sources
6+
67
Your feature pipeline needs to connect to some (external) data source to read the data to be processed. Python, Spark, and Flink have connectors to a huge number of different data sources, while SQL feature pipelines are often restricted to a single data source (for example, your connector to SnowFlake only runs SQL on SnowFlake). SparkSQL, in contrast, can be used over tables that originate in different data sources.
78

89
### Data Validation
10+
911
In order to be able to train and serve models that you can rely on, you need clean, high quality features. Data validation operations include removing bad data, removing or imputing missing values, and identifying problems such as feature shift. HSFS supports Great Expectations to specify data validation rules that are executed in the client before features are written to the Feature Store. The validation results are collected and shown in Hopsworks.
1012

1113
### Aggregations
1214

1315
Aggregations are used to summarize large datasets into more concise, signal-rich features. Popular aggregations include count(), sum(), mean(), median(), stddev(), min(), and max(). These aggregations produce a single number (a numerical feature) that captures information about a potentially large dataset. Both numerical and categorical features are often transformed before being used to train or serve models.
1416

1517
### Dimensionality Reduction
18+
1619
If input data is impractically large or if it has a significant amount of redundancy, it can often be transformed into a reduced set of features with dimensionality reduction (often called feature extraction). Popular dimensionality algorithms include embedding algorithms, PCA, and TSNE.
1720

1821
### Transformations
19-
Transformations are covered in more detail in [training/inference pipelines](../feature_view/training_inference_pipelines.md), as transformations typically happen after the feature store. If you store transformed features in feature groups, the feature data is no longer useful for EDA (as it near to impossible for Data Scientists to understand the transformed values). It also makes it impossible for inference pipelines to log untransformed feature values and predictions for an operational model. There is one use case for storing transformed features in feature groups - when you need to have ultra low latency when reading precomputed features (and online transformations when reading features add too much latency for your use case). The figure below shows to include transformations in your feature pipelines.
22+
23+
Transformations are covered in more detail in [training/inference pipelines](../feature_view/training_inference_pipelines.md), as transformations typically happen after the feature store. If you store transformed features in feature groups, the feature data is no longer useful for EDA (as it near to impossible for Data Scientists to understand the transformed values). It also makes it impossible for inference pipelines to log untransformed feature values and predictions for an operational model. There is one use case for storing transformed features in feature groups - when you need to have ultra low latency when reading precomputed features (and online transformations when reading features add too much latency for your use case). The figure below shows to include transformations in your feature pipelines.
2024

2125
<img src="../../../../assets/images/concepts/fs/feature-pipelines-with-transformations.svg">
2226

2327
### Feature Engineering in Python
24-
Python is the most widely used framework for feature engineering due to its extensive library support for aggregations (Pandas/Polars), data validation (Great Expectations), and dimensionality reduction (embeddings, PCA), and transformations (in Scikit-Learn, TensorFlow, PyTorch). Python also supports open-source feature engineering frameworks used for automated feature engineering, such as [featuretools](https://www.featuretools.com/) that supports relational and temporal sources.
2528

29+
Python is the most widely used framework for feature engineering due to its extensive library support for aggregations (Pandas/Polars), data validation (Great Expectations), and dimensionality reduction (embeddings, PCA), and transformations (in Scikit-Learn, TensorFlow, PyTorch). Python also supports open-source feature engineering frameworks used for automated feature engineering, such as [featuretools](https://www.featuretools.com/) that supports relational and temporal sources.
2630

2731
### Feature Engineering in Spark/PySpark
2832

@@ -34,12 +38,12 @@ SQL has grown in popularity for performing heavy lifting in feature pipelines -
3438

3539
You can do aggregation in SQL for data in your data warehouse or database.
3640

37-
3841
### Feature Engineering in Flink
42+
3943
Apache Flink is a powerful and flexible framework for stateful feature computation operations over unbounded and bounded data streams. It is used for feature engineering when you need very fresh features computed in real-time. Flink provides a rich set of operators and functions such as time windows and aggregation operations that can be applied to keyed and/or global window streams. Flink’s stateful operations allow users to maintain and update state across multiple data records or events, which is particularly useful for feature engineering tasks such as sessionization and/or maintaining rolling aggregates over a sliding window of data.
4044

4145
Flink feature engineering pipelines are supported in Java/Scala only.
4246

43-
4447
### Feature Engineering in Beam
45-
Beam feature engineering pipelines are supported in Java/Scala only.
48+
49+
Beam feature engineering pipelines are supported in Java/Scala only.

docs/concepts/fs/feature_group/fg_overview.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,6 @@ A feature group is a table of features, where each feature group has a primary k
44

55
<img src="../../../../assets/images/concepts/fs/feature-group-table.png">
66

7-
87
### Online and offline Storage
98

109
Feature groups can be stored in a low-latency "online" database and/or in low cost, high throughput "offline" storage, typically a data lake or data warehouse.
@@ -19,4 +18,4 @@ The online store stores only the latest values of features for a feature group.
1918

2019
The offline store stores the historical values of features for a feature group so that it may store much more data than the online store. Offline feature groups are used, typically, to create training data for models, but also to retrieve data for batch scoring of models.
2120

22-
In most cases, offline data is stored in Hopsworks, but through the implementation of data sources, it can reside in an external file system. The externally stored data can be managed by Hopsworks by defining ordinary feature groups or it can be used for reading only by defining [External Feature Group](external_fg.md).
21+
In most cases, offline data is stored in Hopsworks, but through the implementation of data sources, it can reside in an external file system. The externally stored data can be managed by Hopsworks by defining ordinary feature groups or it can be used for reading only by defining [External Feature Group](external_fg.md).
Lines changed: 3 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,22 +1,19 @@
11
HSFS supports monitoring, validation, and alerting for features:
22

3-
- transparently compute statistics over features on writing to a feature group;
4-
- validation of data written to feature groups using Great Expectations
5-
- alerting users when there was a problem writing or update features.
3+
- transparently compute statistics over features on writing to a feature group;
4+
- validation of data written to feature groups using Great Expectations
5+
- alerting users when there was a problem writing or update features.
66

77
### Statistics
88

99
When you create a Feature Group in HSFS, you can configure it to compute statistics over the features inserted into the Feature Group by setting the `statistics_config` dict parameter, see [Feature Group Statistics](../../../../user_guides/fs/feature_group/statistics/) for details. Every time you write to the Feature Group, new statistics will be computed over all of the data in the Feature Group.
1010

11-
1211
### Data Validation
1312

1413
You can define expectation suites in Great Expectations and associate them with feature groups. When you write to a feature group, the expectations are executed, then you can define a policy on the feature group for what to do if any expectation fails.
1514

1615
<img src="../../../../assets/images/concepts/fs/fg-expectations.svg">
1716

18-
19-
2017
### Alerting
2118

2219
HSFS also supports alerts, that can be triggered when there are problems in your feature pipelines, for example, when a write fails due to an error or a failed expectation. You can send alerts to different alerting endpoints, such as email or Slack, that can be configured in the Hopsworks UI. For example, you can send a slack message if features being written to a feature group are missing some input data.

docs/concepts/fs/feature_group/on_demand_feature.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,6 @@ description: On-demand feature computation.
66

77
Features are defined as on-demand when their value cannot be pre-computed beforehand, rather they need to be computed in real-time during inference. This is achieved by implementing the on-demand features as a Python function in a Python module. Also ensure that the same version of the Python module is installed in both the feature and inference pipelines.
88

9-
In the image below shows an example of a housing price model that demonstrates how to implement an on-demand feature, a zip code (or post code) that is computed using longitude/latitude parameters. In your online application, longitude and latitude are provided as parameters to the application, and the same python function used to calculate the zip code in the feature pipeline is used to compute the zip code in the Online Inference pipeline.
9+
In the image below shows an example of a housing price model that demonstrates how to implement an on-demand feature, a zip code (or post code) that is computed using longitude/latitude parameters. In your online application, longitude and latitude are provided as parameters to the application, and the same python function used to calculate the zip code in the feature pipeline is used to compute the zip code in the Online Inference pipeline.
1010

1111
<img src="../../../../assets/images/concepts/fs/on-demand-feature.png">
12-

docs/concepts/fs/feature_group/versioning.md

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,9 +4,9 @@ See here for <a href="../../../../concepts/fs/feature_view/versioning/">informat
44

55
The schema of feature groups is versioned. If you make a breaking change to the schema of a feature group, you need to increment the version of the feature group, and then backfill the new feature group. A breaking schema change is when you:
66

7-
- drop a column from the schema
8-
- add a new feature without any default value for the new feature
9-
- change how a feature is computed, such that, for training models, the data for the old feature is not compatible with the data for the new feature. For example, if you have an embedding as a feature and change the algorithm to compute that embedding, you probably should not mix feature values computed with the old embedding model with feature values computed with the new embedding model.
7+
- drop a column from the schema
8+
- add a new feature without any default value for the new feature
9+
- change how a feature is computed, such that, for training models, the data for the old feature is not compatible with the data for the new feature. For example, if you have an embedding as a feature and change the algorithm to compute that embedding, you probably should not mix feature values computed with the old embedding model with feature values computed with the new embedding model.
1010

1111
<img src="../../../../assets/images/concepts/fs/schema-versioning.svg">
1212

@@ -15,4 +15,3 @@ The schema of feature groups is versioned. If you make a breaking change to the
1515
Data Versioning of a feature group involves tracking updates to the feature group, so that you can recover the state of the feature group at a given point-in-time in the past.
1616

1717
<img src="../../../../assets/images/concepts/fs/data-versioning.svg">
18-

0 commit comments

Comments
 (0)