From e798100f002cbcb0fd92df0445c0a721243728b8 Mon Sep 17 00:00:00 2001
From: Wenchen Fan
Date: Fri, 26 Dec 2025 01:04:36 +0800
Subject: [PATCH 01/11] Update developer-tools.md
---
developer-tools.md | 17 ++++++++++++++++-
1 file changed, 16 insertions(+), 1 deletion(-)
diff --git a/developer-tools.md b/developer-tools.md
index bce821d8c6..7123052190 100644
--- a/developer-tools.md
+++ b/developer-tools.md
@@ -352,17 +352,32 @@ By default, this script will format files that differ from git master. For more
IDE setup
+Make sure you have a clean start before setting up the IDE: A clean git clone of the Spark repo, install the latest
+version of the IDE. If something goes wrong, clear the build outputs by `./build/sbt clean` and `./build/mvn clean`,
+clear the m2 cache by `rm -rf ~/.m2/repository/*`, remove the IDE folder such as `.idea`, re-import the project into
+the IDE and try again.
+
IntelliJ
While many of the Spark developers use SBT or Maven on the command line, the most common IDE we
use is IntelliJ IDEA. You can get the community edition for free (Apache committers can get
free IntelliJ Ultimate Edition licenses) and install the JetBrains Scala plugin from `Preferences > Plugins`.
+Due to the complexity of Spark build, please modify the following global settings of IntelliJ IDEA:
+
+- Go to `Settings -> Build, Execution, Deployment -> Build Tools -> Maven -> Importing`, make sure you
+choose "Detect automatically" for `Generated source folders`, and choose "generate sources" for
+`Phase to be used for folders update`.
+- Go to `Settings -> Build, Execution, Deployment -> Compiler -> Scala Compiler -> Scala Compiler Server`,
+pick a large enough number for `Maximum heap size, MB`, such as "16000".
+
To create a Spark project for IntelliJ:
- Download IntelliJ and install the
Scala plug-in for IntelliJ.
-- Go to `File -> Import Project`, locate the spark source directory, and select "Maven Project".
+- Go to `File -> Import Project`, locate the spark source directory, and select "Maven Project". It's important to
+pick Maven instead if sbt here, as Spark has complicated building logic that is implemented for sbt using Scala code
+in `SparkBuilder.scala`, and IntelliJ IDEA cannot understant it well.
- In the Import wizard, it's fine to leave settings at their default. However it is usually useful
to enable "Import Maven projects automatically", since changes to the project structure will
automatically update the IntelliJ project.
From 0113b146f83d3b44e2d863f590e1bd7e9c028346 Mon Sep 17 00:00:00 2001
From: Wenchen Fan
Date: Fri, 26 Dec 2025 02:48:02 +0800
Subject: [PATCH 02/11] Update README.md
---
README.md | 48 +++++++++++++++++++++++-------------------------
1 file changed, 23 insertions(+), 25 deletions(-)
diff --git a/README.md b/README.md
index 7d051f074a..39da091c3a 100644
--- a/README.md
+++ b/README.md
@@ -3,31 +3,29 @@
In this directory you will find text files formatted using Markdown, with an `.md` suffix.
Building the site requires [Ruby 3](https://www.ruby-lang.org), [Jekyll](http://jekyllrb.com/docs), and
-[Rouge](https://github.com/rouge-ruby/rouge).
-The easiest way to install the right version of these tools is using
-[Bundler](https://bundler.io/) and running `bundle install` in this directory.
-
-See also [https://github.com/apache/spark/blob/master/docs/README.md](https://github.com/apache/spark/blob/master/docs/README.md)
-
-A site build will update the directories and files in the `site` directory with the generated files.
-Using Jekyll via `bundle exec jekyll` locks it to the right version.
-So after this you can generate the html website by running `bundle exec jekyll build` in this
-directory. Use the `--watch` flag to have jekyll recompile your files as you save changes.
-
-In addition to generating the site as HTML from the Markdown files, jekyll can serve the site via
-a web server. To build the site and run a web server use the command `bundle exec jekyll serve` which runs
-the web server on port 4000, then visit the site at http://localhost:4000.
-
-Please make sure you always run `bundle exec jekyll build` after testing your changes with
-`bundle exec jekyll serve`, otherwise you end up with broken links in a few places.
-
-## Updating Jekyll version
-
-To update `Jekyll` or any other gem please follow these steps:
-
-1. Update the version in the `Gemfile`
-1. Run `bundle update` which updates the `Gemfile.lock`
-1. Commit both files
+[Rouge](https://github.com/rouge-ruby/rouge). The most reliable way to ensure a compatible environment
+is to use the official Docker build image from the Apache Spark repository.
+
+If you haven't already, clone the [Apache Spark](https://github.com/apache/spark) repository. Navigate to
+the Spark root directory and run the following command to create the builder image:
+```
+docker build \
+ --tag docs-builder:latest \
+ --file dev/spark-test-image/docs/Dockerfile \
+ dev/spark-test-image-util/docs/
+```
+
+Once the image is built, run the container to process the Markdown files. Note: Replace `/path/to/spark-website`
+in the command below with the *absolute path* to your local website directory.
+```
+docker run \
+ -e HOST_UID=$(id -u) \
+ -e HOST_GID=$(id -g) \
+ --mount type=bind,source="/path/to/spark-website",target="/spark-website" \
+ -w /spark-website \
+ docs-builder:latest \
+ /bin/bash -c "sh .github/run-in-container.sh"
+```
## Docs sub-dir
From e6690bb2a46a0b6c07321932ad5658854aa2887c Mon Sep 17 00:00:00 2001
From: Wenchen Fan
Date: Fri, 26 Dec 2025 02:54:33 +0800
Subject: [PATCH 03/11] Create run-in-container.sh
---
.github/run-in-container.sh | 35 +++++++++++++++++++++++++++++++++++
1 file changed, 35 insertions(+)
create mode 100644 .github/run-in-container.sh
diff --git a/.github/run-in-container.sh b/.github/run-in-container.sh
new file mode 100644
index 0000000000..1ba306d629
--- /dev/null
+++ b/.github/run-in-container.sh
@@ -0,0 +1,35 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# 1.Set env variable.
+export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-arm64
+export PATH=$JAVA_HOME/bin:$PATH
+
+# 2.Install bundler.
+gem install bundler -v 2.4.22
+bundle install
+
+# 3. Create a user matching the host UID/GID
+groupadd -g $HOST_GID docuser
+useradd -u $HOST_UID -g $HOST_GID -m docuser
+
+# We need this link to make sure `python3` points to `python3.11` which contains the prerequisite packages.
+ln -s "$(which python3.11)" "/usr/local/bin/python3"
+
+# Build docs
+rm -rf .jekyll-cache
+su docuser -c "bundle exec jekyll build"
From 6a3074458005f020f2803c67b5a34a7d7057f320 Mon Sep 17 00:00:00 2001
From: Wenchen Fan
Date: Fri, 26 Dec 2025 02:55:17 +0800
Subject: [PATCH 04/11] Update developer-tools.html
---
site/developer-tools.html | 19 ++++++++++++++++++-
1 file changed, 18 insertions(+), 1 deletion(-)
diff --git a/site/developer-tools.html b/site/developer-tools.html
index fa874b50f3..c6570c2982 100644
--- a/site/developer-tools.html
+++ b/site/developer-tools.html
@@ -481,18 +481,35 @@ Formatting code
IDE setup
+Make sure you have a clean start before setting up the IDE: A clean git clone of the Spark repo, install the latest
+version of the IDE. If something goes wrong, clear the build outputs by ./build/sbt clean and ./build/mvn clean,
+clear the m2 cache by rm -rf ~/.m2/repository/*, remove the IDE folder such as .idea, re-import the project into
+the IDE and try again.
+
IntelliJ
While many of the Spark developers use SBT or Maven on the command line, the most common IDE we
use is IntelliJ IDEA. You can get the community edition for free (Apache committers can get
free IntelliJ Ultimate Edition licenses) and install the JetBrains Scala plugin from Preferences > Plugins.
+Due to the complexity of Spark build, please modify the following global settings of IntelliJ IDEA:
+
+
+ - Go to
Settings -> Build, Execution, Deployment -> Build Tools -> Maven -> Importing, make sure you
+choose “Detect automatically” for Generated source folders, and choose “generate sources” for
+Phase to be used for folders update.
+ - Go to
Settings -> Build, Execution, Deployment -> Compiler -> Scala Compiler -> Scala Compiler Server,
+pick a large enough number for Maximum heap size, MB, such as “16000”.
+
+
To create a Spark project for IntelliJ:
- Download IntelliJ and install the
Scala plug-in for IntelliJ.
- - Go to
File -> Import Project, locate the spark source directory, and select “Maven Project”.
+ - Go to
File -> Import Project, locate the spark source directory, and select “Maven Project”. It’s important to
+pick Maven instead if sbt here, as Spark has complicated building logic that is implemented for sbt using Scala code
+in SparkBuilder.scala, and IntelliJ IDEA cannot understant it well.
- In the Import wizard, it’s fine to leave settings at their default. However it is usually useful
to enable “Import Maven projects automatically”, since changes to the project structure will
automatically update the IntelliJ project.
From 10aa1622d4d94394a407281d6e45b6d5d6c5b6ea Mon Sep 17 00:00:00 2001
From: Wenchen Fan
Date: Fri, 26 Dec 2025 03:38:31 +0800
Subject: [PATCH 05/11] fix
---
developer-tools.md | 2 +-
site/developer-tools.html | 2 +-
site/sitemap.xml | 14 +++++++-------
3 files changed, 9 insertions(+), 9 deletions(-)
diff --git a/developer-tools.md b/developer-tools.md
index 7123052190..766281be6b 100644
--- a/developer-tools.md
+++ b/developer-tools.md
@@ -376,7 +376,7 @@ To create a Spark project for IntelliJ:
- Download IntelliJ and install the
Scala plug-in for IntelliJ.
- Go to `File -> Import Project`, locate the spark source directory, and select "Maven Project". It's important to
-pick Maven instead if sbt here, as Spark has complicated building logic that is implemented for sbt using Scala code
+pick Maven instead of sbt here, as Spark has complicated building logic that is implemented for sbt using Scala code
in `SparkBuilder.scala`, and IntelliJ IDEA cannot understant it well.
- In the Import wizard, it's fine to leave settings at their default. However it is usually useful
to enable "Import Maven projects automatically", since changes to the project structure will
diff --git a/site/developer-tools.html b/site/developer-tools.html
index c6570c2982..a9c1061a58 100644
--- a/site/developer-tools.html
+++ b/site/developer-tools.html
@@ -508,7 +508,7 @@ IntelliJ
- Download IntelliJ and install the
Scala plug-in for IntelliJ.
- Go to
File -> Import Project, locate the spark source directory, and select “Maven Project”. It’s important to
-pick Maven instead if sbt here, as Spark has complicated building logic that is implemented for sbt using Scala code
+pick Maven instead of sbt here, as Spark has complicated building logic that is implemented for sbt using Scala code
in SparkBuilder.scala, and IntelliJ IDEA cannot understant it well.
- In the Import wizard, it’s fine to leave settings at their default. However it is usually useful
to enable “Import Maven projects automatically”, since changes to the project structure will
diff --git a/site/sitemap.xml b/site/sitemap.xml
index fd71401fa2..e1272626df 100644
--- a/site/sitemap.xml
+++ b/site/sitemap.xml
@@ -1153,23 +1153,23 @@
weekly
- https://spark.apache.org/streaming/
+ https://spark.apache.org/spark-connect/
weekly
- https://spark.apache.org/sql/
+ https://spark.apache.org/pandas-on-spark/
weekly
- https://spark.apache.org/mllib/
+ https://spark.apache.org/graphx/
weekly
- https://spark.apache.org/graphx/
+ https://spark.apache.org/mllib/
weekly
- https://spark.apache.org/screencasts/
+ https://spark.apache.org/streaming/
weekly
@@ -1177,11 +1177,11 @@
weekly
- https://spark.apache.org/pandas-on-spark/
+ https://spark.apache.org/screencasts/
weekly
- https://spark.apache.org/spark-connect/
+ https://spark.apache.org/sql/
weekly
From b4d1e8767b296b5b2d2d9c72d80efcb3463cf9dd Mon Sep 17 00:00:00 2001
From: Wenchen Fan
Date: Fri, 26 Dec 2025 05:31:04 +0800
Subject: [PATCH 06/11] better
---
{.github => .dev}/run-in-container.sh | 0
1 file changed, 0 insertions(+), 0 deletions(-)
rename {.github => .dev}/run-in-container.sh (100%)
diff --git a/.github/run-in-container.sh b/.dev/run-in-container.sh
similarity index 100%
rename from .github/run-in-container.sh
rename to .dev/run-in-container.sh
From b99ba714e325106dee307e36e8dfbbf713c84db9 Mon Sep 17 00:00:00 2001
From: Wenchen Fan
Date: Fri, 26 Dec 2025 05:32:05 +0800
Subject: [PATCH 07/11] Update README.md
---
README.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/README.md b/README.md
index 39da091c3a..5e2bfc118d 100644
--- a/README.md
+++ b/README.md
@@ -24,7 +24,7 @@ docker run \
--mount type=bind,source="/path/to/spark-website",target="/spark-website" \
-w /spark-website \
docs-builder:latest \
- /bin/bash -c "sh .github/run-in-container.sh"
+ /bin/bash -c "sh .dev/run-in-container.sh"
```
## Docs sub-dir
From b27c4bd737e39082bd2c2d01e21768fe825fca6d Mon Sep 17 00:00:00 2001
From: Wenchen Fan
Date: Fri, 26 Dec 2025 05:41:54 +0800
Subject: [PATCH 08/11] improve
---
.dev/build-docs.sh | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)
create mode 100644 .dev/build-docs.sh
diff --git a/.dev/build-docs.sh b/.dev/build-docs.sh
new file mode 100644
index 0000000000..d2b7a41ef8
--- /dev/null
+++ b/.dev/build-docs.sh
@@ -0,0 +1,24 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+docker run \
+ -e HOST_UID=$(id -u) \
+ -e HOST_GID=$(id -g) \
+ --mount type=bind,source="/path/to/spark-website",target="/spark-website" \
+ -w /spark-website \
+ docs-builder:latest \
+ /bin/bash -c "sh .dev/run-in-container.sh"
From 5158309ad63c46d625fabafdafdbf3714ba3813e Mon Sep 17 00:00:00 2001
From: Wenchen Fan
Date: Fri, 26 Dec 2025 05:46:18 +0800
Subject: [PATCH 09/11] Apply suggestions from code review
---
.dev/build-docs.sh | 2 +-
README.md | 11 +++--------
2 files changed, 4 insertions(+), 9 deletions(-)
diff --git a/.dev/build-docs.sh b/.dev/build-docs.sh
index d2b7a41ef8..818f65ded1 100644
--- a/.dev/build-docs.sh
+++ b/.dev/build-docs.sh
@@ -18,7 +18,7 @@
docker run \
-e HOST_UID=$(id -u) \
-e HOST_GID=$(id -g) \
- --mount type=bind,source="/path/to/spark-website",target="/spark-website" \
+ --mount type=bind,source="${SPARK_WEBSITE_PATH}",target="/spark-website" \
-w /spark-website \
docs-builder:latest \
/bin/bash -c "sh .dev/run-in-container.sh"
diff --git a/README.md b/README.md
index 5e2bfc118d..60c50c9637 100644
--- a/README.md
+++ b/README.md
@@ -15,16 +15,11 @@ docker build \
dev/spark-test-image-util/docs/
```
-Once the image is built, run the container to process the Markdown files. Note: Replace `/path/to/spark-website`
+Once the image is built, navigate to the spark-website root directory, run the script
+to process the Markdown files. Note: Replace `/path/to/spark-website`
in the command below with the *absolute path* to your local website directory.
```
-docker run \
- -e HOST_UID=$(id -u) \
- -e HOST_GID=$(id -g) \
- --mount type=bind,source="/path/to/spark-website",target="/spark-website" \
- -w /spark-website \
- docs-builder:latest \
- /bin/bash -c "sh .dev/run-in-container.sh"
+SPARK_WEBSITE_PATH="/path/to/spark-website" sh .dev/build-docs.sh
```
## Docs sub-dir
From fc67fb603cc4416d79d745044369f81670dd0430 Mon Sep 17 00:00:00 2001
From: Wenchen Fan
Date: Fri, 26 Dec 2025 05:57:07 +0800
Subject: [PATCH 10/11] final
---
.dev/build-docs.sh | 2 +-
README.md | 5 ++---
2 files changed, 3 insertions(+), 4 deletions(-)
diff --git a/.dev/build-docs.sh b/.dev/build-docs.sh
index 818f65ded1..c4297b4177 100644
--- a/.dev/build-docs.sh
+++ b/.dev/build-docs.sh
@@ -18,7 +18,7 @@
docker run \
-e HOST_UID=$(id -u) \
-e HOST_GID=$(id -g) \
- --mount type=bind,source="${SPARK_WEBSITE_PATH}",target="/spark-website" \
+ --mount type=bind,source="$PWD",target="/spark-website" \
-w /spark-website \
docs-builder:latest \
/bin/bash -c "sh .dev/run-in-container.sh"
diff --git a/README.md b/README.md
index 60c50c9637..2e3e003dc6 100644
--- a/README.md
+++ b/README.md
@@ -15,9 +15,8 @@ docker build \
dev/spark-test-image-util/docs/
```
-Once the image is built, navigate to the spark-website root directory, run the script
-to process the Markdown files. Note: Replace `/path/to/spark-website`
-in the command below with the *absolute path* to your local website directory.
+Once the image is built, navigate to the `spark-website` root directory, run the script which processes
+the Markdown files in the Docker container.
```
SPARK_WEBSITE_PATH="/path/to/spark-website" sh .dev/build-docs.sh
```
From c8e63e76f9851e08652a524828cd4890b501e91d Mon Sep 17 00:00:00 2001
From: Wenchen Fan
Date: Fri, 26 Dec 2025 11:19:20 +0800
Subject: [PATCH 11/11] address comments
---
developer-tools.md | 12 ++++++------
site/developer-tools.html | 12 ++++++------
2 files changed, 12 insertions(+), 12 deletions(-)
diff --git a/developer-tools.md b/developer-tools.md
index 766281be6b..0908cef343 100644
--- a/developer-tools.md
+++ b/developer-tools.md
@@ -353,15 +353,15 @@ By default, this script will format files that differ from git master. For more
IDE setup
Make sure you have a clean start before setting up the IDE: A clean git clone of the Spark repo, install the latest
-version of the IDE. If something goes wrong, clear the build outputs by `./build/sbt clean` and `./build/mvn clean`,
-clear the m2 cache by `rm -rf ~/.m2/repository/*`, remove the IDE folder such as `.idea`, re-import the project into
-the IDE and try again.
+version of the IDE.
+
+If something goes wrong, clear the build outputs by `./build/sbt clean` and `./build/mvn clean`, clear the m2
+cache by `rm -rf ~/.m2/repository/*`, re-import the project into the IDE cleanly and try again.
IntelliJ
While many of the Spark developers use SBT or Maven on the command line, the most common IDE we
-use is IntelliJ IDEA. You can get the community edition for free (Apache committers can get
-free IntelliJ Ultimate Edition licenses) and install the JetBrains Scala plugin from `Preferences > Plugins`.
+use is IntelliJ IDEA. You need to install the JetBrains Scala plugin from `Preferences > Plugins`.
Due to the complexity of Spark build, please modify the following global settings of IntelliJ IDEA:
@@ -369,7 +369,7 @@ Due to the complexity of Spark build, please modify the following global setting
choose "Detect automatically" for `Generated source folders`, and choose "generate sources" for
`Phase to be used for folders update`.
- Go to `Settings -> Build, Execution, Deployment -> Compiler -> Scala Compiler -> Scala Compiler Server`,
-pick a large enough number for `Maximum heap size, MB`, such as "16000".
+pick a large enough number for `Maximum heap size, MB`, such as "4000".
To create a Spark project for IntelliJ:
diff --git a/site/developer-tools.html b/site/developer-tools.html
index a9c1061a58..574ef1f201 100644
--- a/site/developer-tools.html
+++ b/site/developer-tools.html
@@ -482,15 +482,15 @@ Formatting code
IDE setup
Make sure you have a clean start before setting up the IDE: A clean git clone of the Spark repo, install the latest
-version of the IDE. If something goes wrong, clear the build outputs by ./build/sbt clean and ./build/mvn clean,
-clear the m2 cache by rm -rf ~/.m2/repository/*, remove the IDE folder such as .idea, re-import the project into
-the IDE and try again.
+version of the IDE.
+
+If something goes wrong, clear the build outputs by ./build/sbt clean and ./build/mvn clean, clear the m2
+cache by rm -rf ~/.m2/repository/*, re-import the project into the IDE cleanly and try again.
IntelliJ
While many of the Spark developers use SBT or Maven on the command line, the most common IDE we
-use is IntelliJ IDEA. You can get the community edition for free (Apache committers can get
-free IntelliJ Ultimate Edition licenses) and install the JetBrains Scala plugin from Preferences > Plugins.
+use is IntelliJ IDEA. You need to install the JetBrains Scala plugin from Preferences > Plugins.
Due to the complexity of Spark build, please modify the following global settings of IntelliJ IDEA:
@@ -499,7 +499,7 @@ IntelliJ
choose “Detect automatically” for Generated source folders, and choose “generate sources” for
Phase to be used for folders update.
Go to Settings -> Build, Execution, Deployment -> Compiler -> Scala Compiler -> Scala Compiler Server,
-pick a large enough number for Maximum heap size, MB, such as “16000”.
+pick a large enough number for Maximum heap size, MB, such as “4000”.
To create a Spark project for IntelliJ: