Upgrade versions in spark-minimal and spark-python images. #23

atox120 · 2021-03-27T01:48:16Z

First request was the major changes.
Latest request was just to fix some typos to the Makefiles and delete an unneccesary .conf file.
You can pull built versions of these images from: https://hub.docker.com/u/atox120

…ges. Spark-minimal change summary: The changes are focused mainly on updating old packages to newer ones, a summary includes: Ubuntu 16.04 -> 20.04 Spark 2.2.0 -> 3.1.1 Hadoop 2.6 -> 2.7 Java8 -> OpenJDK v11 - Changed the JAVA_HOME directory location in the Dockerfile. - Removed the line to automatically accept the oracle user license in the Dockerfile, this is not needed since we are using openjdk. - apt-get install -y now references openjdk-11-jdk - added the apt-repository for open jdk Jars: - all .jar packages were updated to the latest, a few extra dependencies were also added. Added configuration file spark-defaults.conf - This file just contains the line to load the org.apache.spark_spark-sql-kafka-0-10_2.12-3.1.1.jar package to enable kafka topic reading. - The file is added to the $SPARK_HOME/conf folder by uncommenting a line in the Dockerfile. - At execution (e.g. docker-compose spark pyspark) it will start the session and using Ivy look for the maven package. I can't get it to read the packages contained already in the .jars folder, so it downloads the dependencies and runs in this way. Without this though, you can't read kafka topics, since kafka is not native to spark. Known Issues: - The jars don't seem to be loaded by default on the PATH. We need to either append the --packages argument in the command line at pyspark execution, or else use the spark-defaults.conf file as configured above to load the kafka client, despite it being present in the jars directory with the other packages. I was not able to figure out why it's not reading properly on the PATH. - The use of Java 11 creates an 'illegal' operation warning at runtime. This is documented in the spark logs, it doesn’t' appear to affect any functionality: https://issues.apache.org/jira/browse/JCLOUDS-1542?jql=text%20~%20%22illegal%20jdk%22 Spark-python change summary: Anaconda 4.X -> 2020.11 PY4J version -> 0.10.9.2 Known Issues: - The build takes a long time due to having to resolve the conda-forge environments for pyarrow, parquet and arrow. It works though.

mmm · 2021-04-13T14:42:48Z

thanks so much!
review in progress
There're a lot of examples we need to test this with, so I'll just keep a running log of them here
(sorry in advance for the noise)

atox120 · 2021-04-14T04:25:23Z

No problem.
I found another issue regarding the interface with the hive metastore. Spark 3.1.1 uses spark sql hive metastore version 2.3.7, the version in the old container is 1.2.1 so currently they don't work together.

You can get around this by passing the configuration settings when you run pyspark, but its slow since we rely on maven to download the 100 or so packages.
for instance:

docker-compose exec spark pyspark \
 --conf spark.sql.hive.metastore.version=1.2.1 \
 --conf spark.sql.hive.metastore.jars=maven

But with that passed in it works. I'm currently working on it for Project 3 - probably these could be baked into the image or I update the default hive metastore in the mids container...i'm not sure, still working on it. I'll put this in a proper pull request once I have what I think is a good solution.

atox120 added 2 commits March 27, 2021 01:27

Some minor cleaning and typo fixes.

d8d9666

atox120 requested a review from mmm March 27, 2021 01:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Upgrade versions in spark-minimal and spark-python images. #23

Upgrade versions in spark-minimal and spark-python images. #23

Uh oh!

atox120 commented Mar 27, 2021 •

edited

Loading

Uh oh!

mmm commented Apr 13, 2021

Uh oh!

atox120 commented Apr 14, 2021 •

edited

Loading

Uh oh!

Uh oh!

Upgrade versions in spark-minimal and spark-python images. #23

Are you sure you want to change the base?

Upgrade versions in spark-minimal and spark-python images. #23

Uh oh!

Conversation

atox120 commented Mar 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mmm commented Apr 13, 2021

Uh oh!

atox120 commented Apr 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

atox120 commented Mar 27, 2021 •

edited

Loading

atox120 commented Apr 14, 2021 •

edited

Loading