Skip to content

Conversation

atox120
Copy link

@atox120 atox120 commented Mar 27, 2021

First request was the major changes.
Latest request was just to fix some typos to the Makefiles and delete an unneccesary .conf file.
You can pull built versions of these images from: https://hub.docker.com/u/atox120

atox120 added 2 commits March 27, 2021 01:27
…ges.

Spark-minimal change summary:
The changes are focused mainly on updating old packages to newer ones, a summary includes:
Ubuntu 16.04 -> 20.04
Spark 2.2.0 -> 3.1.1
Hadoop 2.6 -> 2.7
Java8 -> OpenJDK v11
- Changed the JAVA_HOME directory location in the Dockerfile.
- Removed the line to automatically accept the oracle user license in the Dockerfile, this is not needed since we are using openjdk.
- apt-get install -y now references openjdk-11-jdk
- added the apt-repository for open jdk
Jars:
- all .jar packages were updated to the latest, a few extra dependencies were also added.

Added configuration file spark-defaults.conf
- This file just contains the line to load the org.apache.spark_spark-sql-kafka-0-10_2.12-3.1.1.jar package to enable kafka topic reading.
- The file is added to the $SPARK_HOME/conf folder by uncommenting a line in the Dockerfile.
- At execution (e.g. docker-compose spark pyspark) it will start the session and using Ivy look for the maven package. I can't get it to read the packages contained already in the .jars folder, so it downloads the dependencies and runs in this way. Without this though, you can't read kafka topics, since kafka is not native to spark.

Known Issues:
- The jars don't seem to be loaded by default on the PATH. We need to either append the --packages argument in the command line at pyspark execution, or else use the spark-defaults.conf file as configured above to load the kafka client, despite it being present in the jars directory with the other packages. I was not able to figure out why it's not reading properly on the PATH.
- The use of Java 11 creates an 'illegal' operation warning at runtime. This is documented in the spark logs, it doesn’t' appear to affect any functionality: https://issues.apache.org/jira/browse/JCLOUDS-1542?jql=text%20~%20%22illegal%20jdk%22

Spark-python change summary:
Anaconda 4.X -> 2020.11
PY4J version -> 0.10.9.2

Known Issues:
- The build takes a long time due to having to resolve the conda-forge environments for pyarrow, parquet and arrow. It works though.
@atox120 atox120 requested a review from mmm March 27, 2021 01:49
@mmm
Copy link
Member

mmm commented Apr 13, 2021

thanks so much!
review in progress
There're a lot of examples we need to test this with, so I'll just keep a running log of them here
(sorry in advance for the noise)

@atox120
Copy link
Author

atox120 commented Apr 14, 2021

No problem.
I found another issue regarding the interface with the hive metastore. Spark 3.1.1 uses spark sql hive metastore version 2.3.7, the version in the old container is 1.2.1 so currently they don't work together.

You can get around this by passing the configuration settings when you run pyspark, but its slow since we rely on maven to download the 100 or so packages.
for instance:

docker-compose exec spark pyspark \
 --conf spark.sql.hive.metastore.version=1.2.1 \
 --conf spark.sql.hive.metastore.jars=maven 

But with that passed in it works. I'm currently working on it for Project 3 - probably these could be baked into the image or I update the default hive metastore in the mids container...i'm not sure, still working on it. I'll put this in a proper pull request once I have what I think is a good solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants