skip to Main Content

In my local development environment, I can easily run a PySpark application without configuring anything. However, on the server, we are using PyInstaller for EXE deployment. PyInstaller does not include the PySpark libraries’ _internal folder in the executable, so I have to manually set the path.

Here’s a snippet of my PyInstaller manage.py script:

# -*- mode: python ; coding: utf-8 -*-

# Analysis for manage.py
a_manage = Analysis(
    ['manage.py'],
    pathex=['/app/app_name/app_name-backend-dev'],
    # I tried adding .venv/lib/python3.11/site-packages to the pathex, but it didn't work
    binaries=[
        ('/usr/lib/x86_64-linux-gnu/libpython3.11.so.1.0', './_internal/libpython3.11.so.1.0')
    ],
    datas=[],
    hiddenimports=[
        # I tried adding pyspark imports, but it didn't work
        'pyspark', 'pyspark.sql', 'pyspark.sql.session', 'pyspark.sql.functions', 'pyspark.sql.types', 'pyspark.sql.column',
        
        
        'app_name2.apps', 'Crypto.Cipher', 'Crypto.Util.Padding', 'snakecase', 'cryptography.fernet',
        'cryptography.hazmat.primitives', 'cryptography.hazmat.primitives.kdf.pbkdf2', 'apscheduler.triggers.cron',
        'apscheduler.schedulers.background', 'apscheduler.events', 'oauth2_provider.contrib.rest_framework',
        'app_name.apps', 'app_name.role_permissions', 'django_filters.rest_framework', 'app_name.urls',
        'app_name.others.constants', 'app_name.models', 'app_name', 'sslserver'
    ],
    hookspath=[],
    hooksconfig={},
    runtime_hooks=[],
    excludes=[],
    noarchive=False,
)

pyz_manage = PYZ(a_manage.pure)

exe_manage = EXE(
    pyz_manage,
    a_manage.scripts,
    [],
    exclude_binaries=True,
    name='manage',
    debug=False,
    bootloader_ignore_signals=False,
    strip=False,
    upx=True,
    console=True,
    disable_windowed_traceback=False,
    argv_emulation=False,
    target_arch=None,
    codesign_identity=None,
    entitlements_file=None,
)

coll_manage = COLLECT(
    exe_manage,
    a_manage.binaries,
    a_manage.datas,
    strip=False,
    upx=True,
    upx_exclude=[],
    name='manage',
)

When I try to run the executable, I encounter the following error:

Traceback (most recent call last):
  File "portal/operations/load_data/load_data.py", line 57, in start
  File "portal/pyspark/operations.py", line 498, in get_session
  File "pyspark/sql/session.py", line 497, in getOrCreate
  File "pyspark/context.py", line 515, in getOrCreate
  File "pyspark/context.py", line 201, in __init__
  File "pyspark/context.py", line 436, in _ensure_initialized
  File "pyspark/java_gateway.py", line 97, in launch_gateway
  File "subprocess.py", line 1026, in __init__
  File "subprocess.py", line 1955, in _execute_child
FileNotFoundError: [Errno 2] No such file or directory: '/home/rhythmflow/Desktop/Reconciliation/reconciliation-backend-v3/dist/manage/_internal/./bin/spark-submit'

To resolve this, I created a global .venv in the Linux home directory and installed PySpark using pip install pyspark.

I then manually set the SPARK_HOME environment variable:

SPARK_HOME = /home/user_name/.venv/lib/python3.11/site-packages/pyspark

And used it in my code as follows:

SPARK_HOME = env_var("SPARK_HOME")
SparkSession.builder.appName(app_name).config("spark.home", SPARK_HOME).getOrCreate()

This approach works fine in the development environment, but I want to simplify the process and avoid manually specifying the Spark home path.

Question:

Is there a way to automatically detect the PySpark home path in a PyInstaller executable, so that I don’t have to manually set the SPARK_HOME environment variable?

Edit:

I tried this approach to get the Spark home directory:

import pyspark
SPARK_HOME = os.path.dirname(pyspark.__file__)

However, I encountered the following error, I think PySpark is not getting installed in EXE build/dist.

Could not find valid SPARK_HOME while searching ['/home/user_name/Desktop/project_name/app_name/dist', '/home/user_name/Desktop/project_name/app_name/dist/manage/_internal/pyspark/spark-distribution', '/home/user_name/Desktop/project_name/app_name/dist/manage/_internal/pyspark', '/home/user_name/Desktop/project_name/app_name/dist/manage/_internal/pyspark/spark-distribution', '/home/user_name/Desktop/project_name/app_name/dist/manage/_internal/pyspark', '/home/user_name/Desktop/project_name/app_name/dist/manage']
Traceback (most recent call last):
  File "pyspark/find_spark_home.py", line 73, in _find_spark_home
StopIteration

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "wsgiref/handlers.py", line 137, in run
  File "django/contrib/staticfiles/handlers.py", line 80, in __call__
    return self.application(environ, start_response)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "django/core/handlers/wsgi.py", line 124, in __call__
    response = self.get_response(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "django/core/handlers/base.py", line 140, in get_response
    response = self._middleware_chain(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "django/core/handlers/exception.py", line 55, in inner
    response = get_response(request)
               ^^^^^^^^^^^^^^^^^^^^^
  File "django/utils/deprecation.py", line 129, in __call__
    response = response or self.get_response(request)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "django/core/handlers/exception.py", line 55, in inner
    response = get_response(request)
               ^^^^^^^^^^^^^^^^^^^^^
  File "django/utils/deprecation.py", line 129, in __call__
    response = response or self.get_response(request)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "django/core/handlers/exception.py", line 55, in inner
    response = get_response(request)
               ^^^^^^^^^^^^^^^^^^^^^
  File "corsheaders/middleware.py", line 56, in __call__
  File "django/core/handlers/exception.py", line 55, in inner
    response = get_response(request)
               ^^^^^^^^^^^^^^^^^^^^^
  File "django/utils/deprecation.py", line 129, in __call__
    response = response or self.get_response(request)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "django/core/handlers/exception.py", line 55, in inner
    response = get_response(request)
               ^^^^^^^^^^^^^^^^^^^^^
  File "django/utils/deprecation.py", line 129, in __call__
    response = response or self.get_response(request)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "django/core/handlers/exception.py", line 55, in inner
    response = get_response(request)
               ^^^^^^^^^^^^^^^^^^^^^
  File "django/utils/deprecation.py", line 129, in __call__
    response = response or self.get_response(request)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "django/core/handlers/exception.py", line 55, in inner
    response = get_response(request)
               ^^^^^^^^^^^^^^^^^^^^^
  File "django/utils/deprecation.py", line 129, in __call__
    response = response or self.get_response(request)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "django/core/handlers/exception.py", line 55, in inner
    response = get_response(request)
               ^^^^^^^^^^^^^^^^^^^^^
  File "django/utils/deprecation.py", line 129, in __call__
    response = response or self.get_response(request)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "django/core/handlers/exception.py", line 55, in inner
    response = get_response(request)
               ^^^^^^^^^^^^^^^^^^^^^
  File "portal/middleware.py", line 29, in __call__
  File "django/core/handlers/exception.py", line 55, in inner
    response = get_response(request)
               ^^^^^^^^^^^^^^^^^^^^^
  File "portal/encdec_middleware.py", line 53, in __call__
  File "django/core/handlers/exception.py", line 55, in inner
    response = get_response(request)
               ^^^^^^^^^^^^^^^^^^^^^
  File "django/utils/deprecation.py", line 129, in __call__
    response = response or self.get_response(request)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "django/core/handlers/exception.py", line 55, in inner
    response = get_response(request)
               ^^^^^^^^^^^^^^^^^^^^^
  File "portal/middleware.py", line 115, in __call__
  File "django/core/handlers/exception.py", line 55, in inner
    response = get_response(request)
               ^^^^^^^^^^^^^^^^^^^^^
  File "django/utils/deprecation.py", line 129, in __call__
    response = response or self.get_response(request)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "django/core/handlers/exception.py", line 55, in inner
    response = get_response(request)
               ^^^^^^^^^^^^^^^^^^^^^
  File "django/core/handlers/base.py", line 197, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "django/views/decorators/csrf.py", line 65, in _view_wrapper
    return view_func(request, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "django/views/generic/base.py", line 104, in view
    return self.dispatch(request, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "rest_framework/views.py", line 506, in dispatch
  File "portal/views.py", line 174, in post
  File "portal/operations/initializer.py", line 39, in initialize
  File "portal/operations/operation_factory.py", line 52, in create
  File "portal/operations/load_data/load_data.py", line 57, in start
  File "portal/pyspark/operations.py", line 496, in get_session
  File "pyspark/sql/session.py", line 497, in getOrCreate
  File "pyspark/context.py", line 515, in getOrCreate
  File "pyspark/context.py", line 201, in __init__
  File "pyspark/context.py", line 436, in _ensure_initialized
  File "pyspark/java_gateway.py", line 60, in launch_gateway
  File "pyspark/find_spark_home.py", line 91, in _find_spark_home
SystemExit: -1

2

Answers


  1. If you need the directory pyspark is installed to, you should be able to do something like the following:

    import os
    
    import pyspark
    
    
    SPARK_HOME = os.path.dirname(pyspark.__file__)
    
    Login or Signup to reply.
  2. The issue arises because PyInstaller packages the application into a standalone executable, but it does not include the PySpark library properly, resulting in a failure to find the SPARK_HOME path or the required PySpark binaries.

    Steps to resolve the issue are:

    1. Bundle PySpark with PyInstaller: Add the path to your PySpark library in the pathex argument of the Analysis

      a_manage = Analysis(
      [‘manage.py’],
      pathex=[‘/path/to/your/project’, ‘/home/user_name/.venv/lib/python3.11/site-packages’],

      )

    2. include the _internal folder and spark-submit binary in the binaries list:

      binaries=[
      (‘/home/user_name/.venv/lib/python3.11/site-packages/pyspark/_internal/bin/spark-submit’, ‘bin’),
      (‘/home/user_name/.venv/lib/python3.11/site-packages/pyspark/_internal/bin/spark-class’, ‘bin’),
      ],

    more … Step3: dynamically set SPARK_HOME in code

    import os
    import pyspark
    
    # Automatically set SPARK_HOME
    SPARK_HOME = os.path.dirname(pyspark.__file__)
    os.environ['SPARK_HOME'] = SPARK_HOME
    

    good luck 🙂

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search