Skip to content

Commit 60339c4

Browse files
[SDP][DEMO] Small changes
1 parent 7d8a0ea commit 60339c4

File tree

1 file changed

+166
-125
lines changed

1 file changed

+166
-125
lines changed

docs/declarative-pipelines/index.md

Lines changed: 166 additions & 125 deletions
Original file line numberDiff line numberDiff line change
@@ -94,8 +94,7 @@ Declarative Pipelines supports the following dataset types:
9494

9595
Streaming tables can be created with the following:
9696

97-
* [@dp.create_streaming_table](#create_streaming_table)
98-
* [CREATE STREAMING TABLE](../sql/SparkSqlAstBuilder.md/#visitCreatePipelineDataset)
97+
* [@dp.create_streaming_table](#create_streaming_table) or [CREATE STREAMING TABLE](../sql/SparkSqlAstBuilder.md/#visitCreatePipelineDataset) (with no flows that can be defined later with [@dp.append_flow](#append_flow) or [CREATE FLOW AS INSERT INTO BY NAME](../sql/SparkSqlAstBuilder.md/#visitCreatePipelineInsertIntoFlow))
9998
* [CREATE STREAMING TABLE ... AS](../sql/SparkSqlAstBuilder.md/#visitCreatePipelineDataset)
10099

101100
## Spark Connect Only { #spark-connect }
@@ -287,27 +286,47 @@ export SPARK_HOME=/Users/jacek/oss/spark
287286
uv add --editable $SPARK_HOME/python/packaging/client
288287
```
289288

289+
```shell
290+
uv tree --depth 2
291+
```
292+
293+
=== "Output"
294+
295+
```text
296+
Resolved 15 packages in 3ms
297+
hello-spark-pipelines v0.1.0
298+
└── pyspark-client v4.2.0.dev0
299+
├── googleapis-common-protos v1.72.0
300+
├── grpcio v1.76.0
301+
├── grpcio-status v1.76.0
302+
├── numpy v2.3.4
303+
├── pandas v2.3.3
304+
├── pyarrow v22.0.0
305+
└── pyyaml v6.0.3
306+
```
307+
290308
```shell
291309
uv pip list
292310
```
293311

294-
??? note "Output"
312+
=== "Output"
295313

296314
```text
297315
Package Version Editable project location
298316
------------------------ ----------- ----------------------------------------------
299-
googleapis-common-protos 1.70.0
300-
grpcio 1.74.0
301-
grpcio-status 1.74.0
302-
numpy 2.3.2
303-
pandas 2.3.1
304-
protobuf 6.31.1
305-
pyarrow 21.0.0
306-
pyspark-client 4.1.0.dev0 /Users/jacek/oss/spark/python/packaging/client
317+
googleapis-common-protos 1.72.0
318+
grpcio 1.76.0
319+
grpcio-status 1.76.0
320+
numpy 2.3.4
321+
pandas 2.3.3
322+
protobuf 6.33.0
323+
pyarrow 22.0.0
324+
pyspark-client 4.2.0.dev0 /Users/jacek/oss/spark/python/packaging/client
307325
python-dateutil 2.9.0.post0
308326
pytz 2025.2
309-
pyyaml 6.0.2
327+
pyyaml 6.0.3
310328
six 1.17.0
329+
typing-extensions 4.15.0
311330
tzdata 2025.2
312331
```
313332

@@ -317,7 +336,38 @@ Activate (_source_) the virtual environment (that `uv` helped us create).
317336
source .venv/bin/activate
318337
```
319338

320-
This activation brings all the necessary PySpark modules that have not been released yet and are only available in the source format only (incl. Spark Declarative Pipelines).
339+
This activation brings all the necessary Spark Declarative Pipelines' Python dependencies (that are only available in the source format only) for non-`uv` tools and CLI, incl. [Spark Pipelines CLI](#spark-pipelines) itself.
340+
341+
```shell
342+
$SPARK_HOME/bin/spark-pipelines --help
343+
```
344+
345+
!!! note ""
346+
347+
```text
348+
usage: cli.py [-h] {run,dry-run,init} ...
349+
350+
Pipelines CLI
351+
352+
positional arguments:
353+
{run,dry-run,init}
354+
run Run a pipeline. If no refresh options specified, a
355+
default incremental update is performed.
356+
dry-run Launch a run that just validates the graph and checks
357+
for errors.
358+
init Generate a sample pipeline project, with a spec file and
359+
example transformations.
360+
361+
options:
362+
-h, --help show this help message and exit
363+
```
364+
365+
??? note "macOS and PYSPARK_PYTHON"
366+
On macOS, you may want to define `PYSPARK_PYTHON` environment variable to point at Python >= 3.10.
367+
368+
```shell
369+
export PYSPARK_PYTHON=python3.14
370+
```
321371

322372
## Demo: Python API
323373

@@ -384,59 +434,40 @@ INFO PipelinesHandler: Define pipelines dataset cmd received: define_dataset {
384434

385435
## Demo: spark-pipelines CLI
386436

387-
??? warning "Activate Virtual Environment"
437+
!!! warning "Activate Virtual Environment"
388438
Follow [Demo: Create Virtual Environment for Python Client](#demo-create-virtual-environment-for-python-client) before getting started with this demo.
389439

390-
Run `spark-pipelines --help` to learn the options.
391-
392-
=== "Command Line"
393-
394-
```shell
395-
$SPARK_HOME/bin/spark-pipelines --help
396-
```
440+
### 1️⃣ Display Pipelines Help
397441

398-
!!! note ""
399-
400-
```text
401-
usage: cli.py [-h] {run,dry-run,init} ...
402-
403-
Pipelines CLI
442+
Run `spark-pipelines --help` to learn the options.
404443

405-
positional arguments:
406-
{run,dry-run,init}
407-
run Run a pipeline. If no refresh options specified, a
408-
default incremental update is performed.
409-
dry-run Launch a run that just validates the graph and checks
410-
for errors.
411-
init Generate a sample pipeline project, including a spec
412-
file and example definitions.
444+
```shell
445+
$SPARK_HOME/bin/spark-pipelines --help
446+
```
413447

414-
options:
415-
-h, --help show this help message and exit
416-
```
448+
!!! note ""
417449

418-
Execute `spark-pipelines dry-run` to validate a graph and checks for errors.
450+
```text
451+
usage: cli.py [-h] {run,dry-run,init} ...
419452
420-
You haven't created a pipeline graph yet (and any exceptions are expected).
453+
Pipelines CLI
421454
422-
=== "Command Line"
455+
positional arguments:
456+
{run,dry-run,init}
457+
run Run a pipeline. If no refresh options specified, a
458+
default incremental update is performed.
459+
dry-run Launch a run that just validates the graph and checks
460+
for errors.
461+
init Generate a sample pipeline project, including a spec
462+
file and example definitions.
423463
424-
```shell
425-
$SPARK_HOME/bin/spark-pipelines dry-run
464+
options:
465+
-h, --help show this help message and exit
426466
```
427467

428-
!!! note ""
429-
```console
430-
Traceback (most recent call last):
431-
File "/Users/jacek/oss/spark/python/pyspark/pipelines/cli.py", line 382, in <module>
432-
main()
433-
File "/Users/jacek/oss/spark/python/pyspark/pipelines/cli.py", line 358, in main
434-
spec_path = find_pipeline_spec(Path.cwd())
435-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
436-
File "/Users/jacek/oss/spark/python/pyspark/pipelines/cli.py", line 101, in find_pipeline_spec
437-
raise PySparkException(
438-
pyspark.errors.exceptions.base.PySparkException: [PIPELINE_SPEC_FILE_NOT_FOUND] No pipeline.yaml or pipeline.yml file provided in arguments or found in directory `/` or readable ancestor directories.
439-
```
468+
### 2️⃣ Create Pipelines Demo Project
469+
470+
You've only created an empty Python project so far (using `uv`).
440471

441472
Create a demo double `hello-spark-pipelines` pipelines project with a sample `pipeline.yml` and sample transformations (in Python and in SQL).
442473

@@ -446,26 +477,35 @@ mv hello-spark-pipelines/* . && \
446477
rm -rf hello-spark-pipelines
447478
```
448479

449-
```console
450-
❯ cat pipeline.yml
451-
452-
name: hello-spark-pipelines
453-
libraries:
454-
- glob:
455-
include: transformations/**/*.py
456-
- glob:
457-
include: transformations/**/*.sql
480+
```shell
481+
cat pipeline.yml
458482
```
459483

460-
```console
461-
❯ tree transformations
462-
transformations
463-
├── example_python_materialized_view.py
464-
└── example_sql_materialized_view.sql
484+
!!! note ""
485+
486+
```text
487+
488+
name: hello-spark-pipelines
489+
storage: storage-root
490+
libraries:
491+
- glob:
492+
include: transformations/**
493+
```
465494

466-
1 directory, 2 files
495+
```shell
496+
tree transformations
467497
```
468498

499+
!!! note ""
500+
501+
```text
502+
transformations
503+
├── example_python_materialized_view.py
504+
└── example_sql_materialized_view.sql
505+
506+
1 directory, 2 files
507+
```
508+
469509
!!! warning "Spark Connect Server should be down"
470510
`spark-pipelines dry-run` starts its own Spark Connect Server at 15002 port (unless started with `--remote` option).
471511

@@ -482,72 +522,73 @@ transformations
482522
$SPARK_HOME/bin/spark-pipelines --remote sc://localhost dry-run
483523
```
484524

485-
=== "Command Line"
525+
### 3️⃣ Dry Run Pipelines Project
486526

487-
```shell
488-
$SPARK_HOME/bin/spark-pipelines dry-run
527+
```shell
528+
$SPARK_HOME/bin/spark-pipelines dry-run
529+
```
530+
531+
!!! note ""
532+
533+
```text
534+
2025-11-08 18:01:45: Creating dataflow graph...
535+
2025-11-08 18:01:45: Registering graph elements...
536+
2025-11-08 18:01:45: Loading definitions. Root directory: '/Users/jacek/sandbox/hello-spark-pipelines'.
537+
2025-11-08 18:01:45: Found 2 files matching glob 'transformations/**/*'
538+
2025-11-08 18:01:45: Importing /Users/jacek/sandbox/hello-spark-pipelines/transformations/example_python_materialized_view.py...
539+
2025-11-08 18:01:45: Registering SQL file /Users/jacek/sandbox/hello-spark-pipelines/transformations/example_sql_materialized_view.sql...
540+
2025-11-08 18:01:45: Starting run...
541+
2025-11-08 17:01:45: Run is COMPLETED.
489542
```
490543

491-
!!! note ""
492-
493-
```text
494-
2025-08-31 12:26:59: Creating dataflow graph...
495-
2025-08-31 12:27:00: Dataflow graph created (ID: c11526a6-bffe-4708-8efe-7c146696d43c).
496-
2025-08-31 12:27:00: Registering graph elements...
497-
2025-08-31 12:27:00: Loading definitions. Root directory: '/Users/jacek/sandbox/hello-spark-pipelines'.
498-
2025-08-31 12:27:00: Found 1 files matching glob 'transformations/**/*.py'
499-
2025-08-31 12:27:00: Importing /Users/jacek/sandbox/hello-spark-pipelines/transformations/example_python_materialized_view.py...
500-
2025-08-31 12:27:00: Found 1 files matching glob 'transformations/**/*.sql'
501-
2025-08-31 12:27:00: Registering SQL file /Users/jacek/sandbox/hello-spark-pipelines/transformations/example_sql_materialized_view.sql...
502-
2025-08-31 12:27:00: Starting run (dry=True, full_refresh=[], full_refresh_all=False, refresh=[])...
503-
2025-08-31 10:27:00: Run is COMPLETED.
504-
```
544+
### 4️⃣ Run Pipelines Project
505545

506546
Run the pipeline.
507547

508-
=== "Command Line"
548+
```shell
549+
$SPARK_HOME/bin/spark-pipelines run
550+
```
509551

510-
```shell
511-
$SPARK_HOME/bin/spark-pipelines run
552+
!!! note ""
553+
554+
```text
555+
2025-11-08 18:02:35: Creating dataflow graph...
556+
2025-11-08 18:02:35: Registering graph elements...
557+
2025-11-08 18:02:35: Loading definitions. Root directory: '/Users/jacek/sandbox/hello-spark-pipelines'.
558+
2025-11-08 18:02:35: Found 2 files matching glob 'transformations/**/*'
559+
2025-11-08 18:02:35: Importing /Users/jacek/sandbox/hello-spark-pipelines/transformations/example_python_materialized_view.py...
560+
2025-11-08 18:02:35: Registering SQL file /Users/jacek/sandbox/hello-spark-pipelines/transformations/example_sql_materialized_view.sql...
561+
2025-11-08 18:02:35: Starting run...
562+
2025-11-08 17:02:35: Flow spark_catalog.default.example_python_materialized_view is QUEUED.
563+
2025-11-08 17:02:35: Flow spark_catalog.default.example_sql_materialized_view is QUEUED.
564+
2025-11-08 17:02:35: Flow spark_catalog.default.example_python_materialized_view is PLANNING.
565+
2025-11-08 17:02:35: Flow spark_catalog.default.example_python_materialized_view is STARTING.
566+
2025-11-08 17:02:35: Flow spark_catalog.default.example_python_materialized_view is RUNNING.
567+
2025-11-08 17:02:37: Flow spark_catalog.default.example_python_materialized_view has COMPLETED.
568+
2025-11-08 17:02:37: Flow spark_catalog.default.example_sql_materialized_view is PLANNING.
569+
2025-11-08 17:02:37: Flow spark_catalog.default.example_sql_materialized_view is STARTING.
570+
2025-11-08 17:02:37: Flow spark_catalog.default.example_sql_materialized_view is RUNNING.
571+
2025-11-08 17:02:38: Flow spark_catalog.default.example_sql_materialized_view has COMPLETED.
572+
2025-11-08 17:02:39: Run is COMPLETED.
512573
```
513574

514-
!!! note ""
515-
516-
```console
517-
2025-08-31 12:29:04: Creating dataflow graph...
518-
2025-08-31 12:29:04: Dataflow graph created (ID: 3851261d-9d74-416a-8ec6-22a28bee381c).
519-
2025-08-31 12:29:04: Registering graph elements...
520-
2025-08-31 12:29:04: Loading definitions. Root directory: '/Users/jacek/sandbox/hello-spark-pipelines'.
521-
2025-08-31 12:29:04: Found 1 files matching glob 'transformations/**/*.py'
522-
2025-08-31 12:29:04: Importing /Users/jacek/sandbox/hello-spark-pipelines/transformations/example_python_materialized_view.py...
523-
2025-08-31 12:29:04: Found 1 files matching glob 'transformations/**/*.sql'
524-
2025-08-31 12:29:04: Registering SQL file /Users/jacek/sandbox/hello-spark-pipelines/transformations/example_sql_materialized_view.sql...
525-
2025-08-31 12:29:04: Starting run (dry=False, full_refresh=[], full_refresh_all=False, refresh=[])...
526-
2025-08-31 10:29:05: Flow spark_catalog.default.example_python_materialized_view is QUEUED.
527-
2025-08-31 10:29:05: Flow spark_catalog.default.example_sql_materialized_view is QUEUED.
528-
2025-08-31 10:29:05: Flow spark_catalog.default.example_python_materialized_view is PLANNING.
529-
2025-08-31 10:29:05: Flow spark_catalog.default.example_python_materialized_view is STARTING.
530-
2025-08-31 10:29:05: Flow spark_catalog.default.example_python_materialized_view is RUNNING.
531-
2025-08-31 10:29:06: Flow spark_catalog.default.example_python_materialized_view has COMPLETED.
532-
2025-08-31 10:29:07: Flow spark_catalog.default.example_sql_materialized_view is PLANNING.
533-
2025-08-31 10:29:07: Flow spark_catalog.default.example_sql_materialized_view is STARTING.
534-
2025-08-31 10:29:07: Flow spark_catalog.default.example_sql_materialized_view is RUNNING.
535-
2025-08-31 10:29:07: Flow spark_catalog.default.example_sql_materialized_view has COMPLETED.
536-
2025-08-31 10:29:09: Run is COMPLETED.
537-
```
575+
```shell
576+
tree spark-warehouse
577+
```
538578

539-
```console
540-
❯ tree spark-warehouse
541-
spark-warehouse
542-
├── example_python_materialized_view
543-
│   ├── _SUCCESS
544-
│   └── part-00000-75bc5b01-aea2-4d05-a71c-5c04937981bc-c000.snappy.parquet
545-
└── example_sql_materialized_view
546-
├── _SUCCESS
547-
└── part-00000-e1d0d33c-5d9e-43c3-a87d-f5f772d32942-c000.snappy.parquet
579+
!!! note ""
548580

549-
3 directories, 4 files
550-
```
581+
```text
582+
spark-warehouse
583+
├── example_python_materialized_view
584+
│   ├── _SUCCESS
585+
│   └── part-00000-25786a51-3973-4839-9220-f2411cf9725f-c000.snappy.parquet
586+
└── example_sql_materialized_view
587+
├── _SUCCESS
588+
└── part-00000-7c8dcf19-8b55-4683-9895-b23ed752e71a-c000.snappy.parquet
589+
590+
3 directories, 4 files
591+
```
551592

552593
## Demo: Scala API
553594

0 commit comments

Comments
 (0)