Skip to content

Python example file's data location does not meet Lambda's expectation #7

@habemusne

Description

@habemusne

I am using the Python example python/ml/kmeans_example.py. This file has a hard-coded path 'data/mllib/sample_kmeans_data.txt'.

Now when I run ./bin/spark-submit --master lambda://test examples/src/main/python/ml/kmeans_example.py under the driver folder, Spark's log shows java.io.FileNotFoundException: File file:/home/ec2-user/driver/data/mllib/sample_kmeans_data.txt does not exist.

I was told that data file location string needs to be consistent between Lambda and Spark. Your Lambda code expects data file to be somewhere under /tmp/lambda, I looked at what actually was under /tmp/lambda. There was a spark folder. So my work-around was to create a temporary /tmp/lambda/spark/data/mllib/ under my EC2, move my data file there, and then point to that file in spark.read. Specifically I changed line 42 to

    import os
    data_folder = '/home/ec2-user/driver/data/mllib'
    lambda_folder = '/tmp/lambda/spark/data/mllib'
    filename = 'sample_kmeans_data.txt'
    os.system('mkdir -p ' + lambda_folder)
    os.system('cp {}/{} {}/{}'.format(data_folder, filename, lambda_folder, filename))
    dataset = spark.read.format("libsvm").load('{}/{}'.format(lambda_folder, filename))

And then it worked fine.

I suppose that part or many Python files has this problem, so it can be a barrier for python users.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions