Skip to content

Commit 675bb2f

Browse files
committed
Fix
1 parent efdbed0 commit 675bb2f

File tree

21 files changed

+168
-169
lines changed

21 files changed

+168
-169
lines changed

.markdownlint.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
11
MD041: false
22
MD013: false
33
MD033: false
4+
MD004:
5+
style: dash

docs/concepts/dev/inside.md

Lines changed: 27 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,34 +1,46 @@
1-
Hopsworks provides a complete self-service development environment for feature engineering and model training. You can develop programs as Jupyter notebooks or jobs, customize the bundled FTI (feature, training and inference pipeline) python environments, you can manage your source code with Git, and you can orchestrate jobs with Airflow.
21

3-
<img src="../../../assets/images/concepts/dev/dev-inside.svg">
2+
Hopsworks provides a complete self-service development environment for feature engineering and model training.
3+
You can develop programs as Jupyter notebooks or jobs, customize the bundled FTI (feature, training and inference pipeline) python environments, you can manage your source code with Git, and you can orchestrate jobs with Airflow.
4+
5+
<img src="../../../assets/images/concepts/dev/dev-inside.svg" alt="Hopsworks Development Environment" />
46

57
### Jupyter Notebooks
68

7-
Hopsworks provides a Jupyter notebook development environment for programs written in Python, Spark, Flink, and SparkSQL. You can also develop in your IDE (PyCharm, IntelliJ, etc), test locally, and then run your programs as Jobs in Hopsworks. Jupyter notebooks can also be run as Jobs.
9+
Hopsworks provides a Jupyter notebook development environment for programs written in Python, Spark, Flink, and SparkSQL.
10+
You can also develop in your IDE (PyCharm, IntelliJ, etc), test locally, and then run your programs as Jobs in Hopsworks.
11+
Jupyter notebooks can also be run as Jobs.
812

913
### Source Code Control
1014

11-
Hopsworks provides source code control support using Git (GitHub, GitLab or BitBucket). You can securely checkout code into your project and commit and push updates to your code to your source code repository.
15+
Hopsworks provides source code control support using Git (GitHub, GitLab or BitBucket).
16+
You can securely check out code into your project and commit and push updates to your code to your source code repository.
1217

1318
### FTI Pipeline Environments
1419

15-
Hopsworks postulates that building ML systems following the FTI pipeline architecture is best practice. This architecture consists of three independently developed and operated ML pipelines:
20+
Hopsworks postulates that building ML systems following the FTI pipeline architecture is best practice.
21+
This architecture consists of three independently developed and operated ML pipelines:
1622

17-
* Feature pipeline: takes as input raw data that it transforms into features (and labels)
18-
* Training pipeline: takes as input features (and labels) and outputs a trained model
19-
* Inference pipeline: takes new feature data and a trained model and makes predictions
23+
- Feature pipeline: takes as input raw data that it transforms into features (and labels)
24+
- Training pipeline: takes as input features (and labels) and outputs a trained model
25+
- Inference pipeline: takes new feature data and a trained model and makes predictions
2026

21-
In order to facilitate the development of these pipelines Hopsworks bundles several python environments containing necessary dependencies. Each of these environments may then also be customized further by cloning it and installing additional dependencies from PyPi, Conda channels, Wheel files, GitHub repos or a custom Dockerfile. Internal compute such as Jobs and Jupyter is run in one of these environments and changes are applied transparently when you install new libraries using our APIs. That is, there is no need to write a Dockerfile, users install libraries directly in one or more of the environments. You can setup custom development and production environments by creating separate projects or creating multiple clones of an environment within the same project.
27+
In order to facilitate the development of these pipelines Hopsworks bundles several python environments containing necessary dependencies.
28+
Each of these environments may then also be customized further by cloning it and installing additional dependencies from PyPi, Conda channels, Wheel files, GitHub repos or a custom Dockerfile.
29+
Internal compute such as Jobs and Jupyter is run in one of these environments and changes are applied transparently when you install new libraries using our APIs.
30+
That is, there is no need to write a Dockerfile, users install libraries directly in one or more of the environments.
31+
You can setup custom development and production environments by creating separate projects or creating multiple clones of an environment within the same project.
2232

2333
### Jobs
2434

25-
In Hopsworks, a Job is a schedulable program that is allocated compute and memory resources. You can run a Job in Hopsworks:
35+
In Hopsworks, a Job is a schedulable program that is allocated compute and memory resources.
36+
You can run a Job in Hopsworks:
2637

27-
* From the UI
28-
* Programmatically with the Hopsworks SDK (Python, Java) or REST API
29-
* From Airflow programs (either inside our outside Hopsworks)
30-
* From your IDE using a plugin ([PyCharm/IntelliJ plugin](https://plugins.jetbrains.com/plugin/15537-hopsworks))
38+
- From the UI
39+
- Programmatically with the Hopsworks SDK (Python, Java) or REST API
40+
- From Airflow programs (either inside our outside Hopsworks)
41+
- From your IDE using a plugin ([PyCharm/IntelliJ plugin](https://plugins.jetbrains.com/plugin/15537-hopsworks))
3142

3243
### Orchestration
3344

34-
Airflow comes out-of-the box with Hopsworks, but you can also use an external Airflow cluster (with the Hopsworks Job operator) if you have one. Airflow can be used to schedule the execution of Jobs, individually or as part of Airflow DAGs.
45+
Airflow comes out-of-the box with Hopsworks, but you can also use an external Airflow cluster (with the Hopsworks Job operator) if you have one.
46+
Airflow can be used to schedule the execution of Jobs, individually or as part of Airflow DAGs.

docs/concepts/fs/feature_view/offline_api.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,19 @@
11
The feature view provides an *Offline API* for
22

3-
* creating training data
4-
* creating batch (scoring) data
3+
- creating training data
4+
- creating batch (scoring) data
55

66
## Training Data
77

88
Training data is created using a feature view. You can create training data as either:
99

10-
* in-memory Pandas/Polars DataFrames, useful when you have a small amount of training data;
11-
* materialized training data in files, in a file format of your choice (such as .tfrecord, .csv, or .parquet).
10+
- in-memory Pandas/Polars DataFrames, useful when you have a small amount of training data;
11+
- materialized training data in files, in a file format of your choice (such as .tfrecord, .csv, or .parquet).
1212

1313
You can apply filters when creating training data from a feature view:
1414

15-
* start-time and end-time, for example, to create the train-set from an earlier time range, and the test-set from a later (unseen) time range;
16-
* feature value features, for example, only train a model on customers from a particular country.
15+
- start-time and end-time, for example, to create the train-set from an earlier time range, and the test-set from a later (unseen) time range;
16+
- feature value features, for example, only train a model on customers from a particular country.
1717

1818
Note that filters are not applied when retrieving feature vectors using feature views, as we only look up features for a specific entity, like a customer. In this case, the application should know that predictions for this customer should be made on the model trained on customers in USA, for example.
1919

@@ -44,8 +44,8 @@ Test data can also be split into evaluation sets to help evaluate a model for po
4444

4545
Batch data for scoring models is created using a feature view. Similar to training data, you can create batch data as either:
4646

47-
* in-memory Pandas/Polars DataFrames, useful when you have a small amount of data to score;
48-
* materialized data in files, in a file format of your choice (such as .tfrecord, .csv, or .parquet)
47+
- in-memory Pandas/Polars DataFrames, useful when you have a small amount of data to score;
48+
- materialized data in files, in a file format of your choice (such as .tfrecord, .csv, or .parquet)
4949

5050
Batch data requires specification of a `start_time` for the start of the batch scoring data. You can also specify the `end_time` (default is the current date).
5151

docs/concepts/fs/feature_view/online_api.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,4 +8,4 @@ A feature vector is a row of features (without the primary key(s) and event time
88

99
It may be the case that for any given feature vector, not all features will come pre-engineered from the feature store. Some features will be provided by the client (or at least the raw data to compute the feature will come from the client). We call these 'passed' features and, similar to precomputed features from the feature store, they can also be transformed by the HSFS client in the method:
1010

11-
* feature_view.get_feature_vector(entry, passed_features={...})
11+
- feature_view.get_feature_vector(entry, passed_features={...})

docs/concepts/mlops/prediction_services.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,10 +2,10 @@ A prediction service is an end-to-end analytical or operational machine learning
22

33
A prediction service consists of the following components:
44

5-
* feature pipeline(s),
6-
* training pipeline,
7-
* inference pipeline (for either batch predictions or online predictions)
8-
* a sink for predictions - either a store or a user-interface.
5+
- feature pipeline(s),
6+
- training pipeline,
7+
- inference pipeline (for either batch predictions or online predictions)
8+
- a sink for predictions - either a store or a user-interface.
99

1010
## Analytical ML
1111

docs/concepts/projects/cicd.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -11,10 +11,10 @@ You can create dev, staging, and prod projects - either on the same cluster, but
1111

1212
Hopsworks supports the versioning of ML assets, including:
1313

14-
* Feature Groups: the version of its schema - breaking schema changes require a new version and backfilling the new version;
15-
* Feature Views: the version of its schema, and breaking schema changes only require a new version;
16-
* Models: the version of a model;
17-
* Deployments: the version of the deployment of a model - a model with the same version can be found in >1 deployment.
14+
- Feature Groups: the version of its schema - breaking schema changes require a new version and backfilling the new version;
15+
- Feature Views: the version of its schema, and breaking schema changes only require a new version;
16+
- Models: the version of a model;
17+
- Deployments: the version of the deployment of a model - a model with the same version can be found in >1 deployment.
1818

1919
## Pytest for feature logic and feature pipeline tests
2020

docs/concepts/projects/search.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,10 +6,10 @@ description: "Documentation on the Hopsworks capabilities to discover machine-le
66

77
Hopsworks supports free-text search to discover machine-learning assets:
88

9-
* features
10-
* feature groups
11-
* feature views
12-
* training data
9+
- features
10+
- feature groups
11+
- feature views
12+
- training data
1313

1414
You can use the search bar at the top of your project to free-text search for the names or descriptions of any ML asset. You can also search using keywords or tags that are attached to an ML asset.
1515

docs/concepts/projects/storage.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
11
Every project in Hopsworks has its own private assets:
22

3-
* a Feature Store (including both Online and Offline Stores)
4-
* a Filesystem subtree (all directory and files under /Projects/<project_name>/)
5-
* a Model Registry
6-
* Model Deployments
7-
* Kafka topics
8-
* OpenSearch indexes (including KNN indexes - the vector DB)
9-
* a Hive Database
3+
- a Feature Store (including both Online and Offline Stores)
4+
- a Filesystem subtree (all directory and files under /Projects/<project_name>/)
5+
- a Model Registry
6+
- Model Deployments
7+
- Kafka topics
8+
- OpenSearch indexes (including KNN indexes - the vector DB)
9+
- a Hive Database
1010

1111
Access control to these assets is controlled using project membership ACLs (access-control lists). Users in a project who have a *Data Owner* role have read/write access to these assets. Users in a project who have a *Data Scientist* role have mostly read-only access to these assets, with the exception of the ability to write to well-known directories (Resources, Jupyter, Logs).
1212

docs/setup_installation/admin/ha-dr/dr.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,8 @@ The state of the Hopsworks cluster is divided into data and metadata and distrib
66

77
The following services contain critical state that should be backed up:
88

9-
* **RonDB**: as mentioned above, the RonDB is used by Hopsworks to store the cluster metadata as well as the data for the online feature store.
10-
* **HopsFS**: HopsFS stores the data for the batch feature store as well as checkpoints and logs for feature engineering applications.
9+
- **RonDB**: as mentioned above, the RonDB is used by Hopsworks to store the cluster metadata as well as the data for the online feature store.
10+
- **HopsFS**: HopsFS stores the data for the batch feature store as well as checkpoints and logs for feature engineering applications.
1111

1212
Backing up service/application metrics and services/applications logs are out of the scope of this guide. By default metrics and logs are rotated after 7 days. Application logs are available on HopsFS when the application has finished and, as such, are backed up with the rest of HopsFS’ data.
1313

docs/setup_installation/admin/ha-dr/ha.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,13 +2,13 @@
22

33
At a high level a Hopsworks cluster can be divided into 4 groups of nodes. Each node group should be deployed according to the requirements (e.g., 3/5/7 nodes for the head node group) to guarantee the availability of the components.
44

5-
* **Head nodes**: The head node is responsible for running all the metadata, public API, and user interface services that are required for Hopsworks to provide its functionality. They need to be deployed in an odd number (1, 3, 5) as the head nodes run services like Zookeeper and OpenSearch which enforce consistency through quorum based protocols. The head nodes are also responsible for managing the services running on the remaining group of nodes.
6-
* **Worker nodes**: The worker node is responsible for executing the feature engineering pipeline code as well as storing the data for the offline feature store (HopsFS). In an on-prem deployment, the data is stored and replicated on the workers’ local hard drives. By default the data is replicated across 3 workers. In a cloud deployment, HopsFS’ data is persisted in a cloud object store (Amazon S3, Azure Blob Storage, Google Cloud Blob Storage) and the HopsFS datanodes are responsible for persisting, retrieving and caching of blocks from the object store.
7-
* **RonDB Data nodes**:
5+
- **Head nodes**: The head node is responsible for running all the metadata, public API, and user interface services that are required for Hopsworks to provide its functionality. They need to be deployed in an odd number (1, 3, 5) as the head nodes run services like Zookeeper and OpenSearch which enforce consistency through quorum based protocols. The head nodes are also responsible for managing the services running on the remaining group of nodes.
6+
- **Worker nodes**: The worker node is responsible for executing the feature engineering pipeline code as well as storing the data for the offline feature store (HopsFS). In an on-prem deployment, the data is stored and replicated on the workers’ local hard drives. By default the data is replicated across 3 workers. In a cloud deployment, HopsFS’ data is persisted in a cloud object store (Amazon S3, Azure Blob Storage, Google Cloud Blob Storage) and the HopsFS datanodes are responsible for persisting, retrieving and caching of blocks from the object store.
7+
- **RonDB Data nodes**:
88
These nodes are responsible for storing the services’ metadata (Hopsworks, HopsFS, Hive Metastore, Airflow) as well as the data for the online feature store.
99
For high availability, at least two data nodes should be deployed and RonDB is typically configured with a replication factor of 2, as it uses synchronous replication with 2-phase commit, not a quorum-based replication protocol.
1010
More advanced deployment patterns and best practices are covered in the [RonDB documentation](https://docs.rondb.com).
11-
* **Query brokers**: The query brokers are the entry point for querying the online feature store. They handle authentication, authorization and execution of the requests for online feature data being submitted from the feature store APIs. At least two query brokers should be deployed to achieve high availability. Query brokers are stateless. Additional query brokers should be deployed to handle additional load and clients.
11+
- **Query brokers**: The query brokers are the entry point for querying the online feature store. They handle authentication, authorization and execution of the requests for online feature data being submitted from the feature store APIs. At least two query brokers should be deployed to achieve high availability. Query brokers are stateless. Additional query brokers should be deployed to handle additional load and clients.
1212

1313
Example deployment:
1414

0 commit comments

Comments
 (0)