skip to Main Content

I am building celery + django + selenium application. I am running selenium-based browsers in separate processes with help celery. Versions:

celery==5.2.6
redis==3.4.1
selenium-wire==5.1.0
Django==4.0.4
djangorestframework==3.13.1

I found out that after several hours application generates thousands of zombie processes. Also found out that problem deals with celery docker container, because after sudo /usr/local/bin/docker-compose -f /data/new_app/docker-compose.yml restart celery I have 0 zombie processes.

My code

from rest_framework.decorators import api_view

@api_view(['POST'])
def periodic_check_all_urls(request): # web-service endpoint
    ...
    check_urls.delay(parsing_results_ids) # call celery task

Celery task code

from celery import shared_task


@shared_task()
def check_urls(parsing_result_ids: List[int]):
    """
    Run Selenium-based parser
    the parser exctracts data and saves in database
    """
    try:
        logger.info(f"{datetime.now()} Start check_urls")
        parser = Parser() # open selenium browser
        parsing_results = ParsingResult.objects.filter(pk__in=parsing_result_ids).exclude(status__in=["DONE", "FAILED"])
        parser.check_parsing_result(parsing_results)
    except Exception as e:
        full_trace = traceback.format_exc()
    finally:
        if 'parser' in locals():
            parser.stop()

Selenium browser stop function and destructor

class Parser():
    def __init__(self):
        """
        Prepare parser
        """
        if not USE_GUI:
            self.display = Display(visible=0, size=(800, 600))
            self.display.start()
        

        """ Replaced with FireFox
        self.driver = get_chromedriver(proxy_data)
        """
        proxy_data = {
            ...
        }
        self.driver = get_firefox_driver(proxy_data=proxy_data)

    
    def __del__(self):
        self.stop()

    def stop(self):
        try:
            self.driver.quit()
            logger.info("Selenium driver closed")
        except:
            pass
        
        try:
            self.display.stop()
            logger.info("Display stopped")
        except:
            pass

Also I was trying several settings to limit celery task resources and time of work (it didn’t help with Zombie processes)

My celery settings in dgango settings.py

# celery setting (documents generation)
CELERY_BROKER_URL = os.environ.get("CELERY_BROKER", "redis://redis:6379/0")
CELERY_RESULT_BACKEND = os.environ.get("CELERY_BROKER", "redis://redis:6379/0")
CELERY_IMPORTS = ("core_app.celery",)
CELERY_TASK_TIME_LIMIT = 10 * 60

My celery settings in dockers

celery:
    build: ./project
    command: celery -A core_app worker  --loglevel=info --concurrency=15 --max-memory-per-child=1000000
    volumes:
      - ./project:/usr/src/app
      - ./project/media:/project/media
      - ./project/logs:/project/logs
    env_file:
      - .env
    environment:
    # environment variables declared in the environment section override env_file
      - DJANGO_ALLOWED_HOSTS=localhost 127.0.0.1 [::1]
      - CELERY_BROKER=redis://redis:6379/0
      - CELERY_BACKEND=redis://redis:6379/0
    depends_on:
      - django
      - redis

I read Django/Celery – How to kill a celery task? but it didn’t help

Also read Celery revoke leaving zombie ffmpeg process but my task already contains try/except

Example of zombie processes

ps aux | grep 'Z'

root     32448  0.0  0.0      0     0 ?        Z    13:45   0:00 [Utility Process] <defunct>
root     32449  0.0  0.0      0     0 ?        Z    13:09   0:00 [Utility Process] <defunct>
root     32450  0.0  0.0      0     0 ?        Z    11:13   0:00 [sh] <defunct>
root     32451  0.0  0.0      0     0 ?        Z    13:44   0:00 [Utility Process] <defunct>
root     32452  0.0  0.0      0     0 ?        Z    10:12   0:00 [Utility Process] <defunct>
root     32453  0.0  0.0      0     0 ?        Z    09:52   0:00 [sh] <defunct>
root     32454  0.0  0.0      0     0 ?        Z    10:40   0:00 [Utility Process] <defunct>
root     32455  0.0  0.0      0     0 ?        Z    09:52   0:00 [Utility Process] <defunct>
root     32456  0.0  0.0      0     0 ?        Z    10:13   0:00 [sh] <defunct>
root     32457  0.0  0.0      0     0 ?        Z    10:51   0:00 [Utility Process] <defunct>
root     32459  0.0  0.0      0     0 ?        Z    14:01   0:00 [Utility Process] <defunct>
root     32460  0.0  0.0      0     0 ?        Z    13:16   0:00 [Utility Process] <defunct>
root     32461  0.0  0.0      0     0 ?        Z    10:40   0:00 [Utility Process] <defunct>
root     32462  0.0  0.0      0     0 ?        Z    10:12   0:00 [Utility Process] <defunct>

2

Answers


  1. Use timeout and soft_time_limit

    You have already set CELERY_TASK_TIME_LIMIT, but it can be beneficial to also use soft_time_limit. The soft_time_limit sends a TimeoutError signal to the task, which you can catch to clean up resources before the task is forcefully terminated after the time_limit.

    Here’s how you can set both:

    from celery.exceptions import SoftTimeLimitExceeded
    
    @shared_task(soft_time_limit=600, time_limit=650)
    def check_urls(parsing_result_ids: List[int]):
        try:
            logger.info(f"{datetime.now()} Start check_urls")
            parser = Parser()  # Open selenium browser
            parsing_results = ParsingResult.objects.filter(pk__in=parsing_result_ids).exclude(status__in=["DONE", "FAILED"])
            parser.check_parsing_result(parsing_results)
        except SoftTimeLimitExceeded:
            logger.warning(f"Task exceeded soft time limit, cleaning up resources.")
        except Exception as e:
            full_trace = traceback.format_exc()
            logger.error(f"Error occurred: {full_trace}")
        finally:
            if 'parser' in locals():
                parser.stop()
    

    Ensure All Selenium Processes are Cleaned

    Make sure all subprocesses, including the Selenium driver and X server (in headless mode), are correctly stopped. This could involve adding explicit process killing if necessary. For instance:

    import psutil
    import os
    
    class Parser():
        def __init__(self):
            if not USE_GUI:
                self.display = Display(visible=0, size=(800, 600))
                self.display.start()
    
            self.driver = get_firefox_driver(proxy_data=proxy_data)
    
        def stop(self):
            try:
                self.driver.quit()
                logger.info("Selenium driver closed")
            except Exception as e:
                logger.error(f"Error closing driver: {e}")
            
            try:
                self.display.stop()
                logger.info("Display stopped")
            except Exception as e:
                logger.error(f"Error stopping display: {e}")
    
            # Clean up any remaining subprocesses (especially related to Selenium)
            self.cleanup_selenium_processes()
    
        def cleanup_selenium_processes(self):
            # Check for any lingering Selenium processes
            for proc in psutil.process_iter(attrs=['pid', 'name']):
                try:
                    if 'selenium' in proc.info['name'].lower():
                        logger.info(f"Killing zombie process: {proc.info['pid']}")
                        proc.terminate()
                except (psutil.NoSuchProcess, psutil.AccessDenied, psutil.ZombieProcess):
                    pass
    
    
    

    Implement soft_time_limit and time_limit for task termination.
    Ensure that all Selenium resources are released (including driver and display).
    Use psutil to clean lingering processes.
    Configure Docker memory limits and restart policies.
    Use max-tasks-per-child to automatically restart workers.

    Login or Signup to reply.
  2. I’d start by turning the Parser class into a context manager:

    class Parser():
        def __init__(self):
            self.display = Display(visible=0, size=(800, 600))
            self.display.start()
            self.driver = get_firefox_driver(proxy_data={})
    
        def __enter__(self):
            return self.driver
    
        def __exit__(self, exc_type, exc_val, exc_tb):
            self.kill_driver()
            self.display.stop()
            # handle exceptions here
            # if this returns true, any exceptions will be supressed
    
        def kill_driver(self):
            self.driver.close()
            self.driver.quit()
    

    If there is an error thrown within the with block, Parser.__exit__ will be called before the exception is raised, which gives you the chance to kill the driver and the display before the process closes.

    Note that I removed your empty try: except: blocks in the stop method. This is bad practice, because you won’t see the traceback, which would be quite useful for debugging your question…

    Now in your task:

    @shared_task()
    def check_urls(parsing_result_ids):
        with Parser() as parser:
            parsing_results = ParsingResult.objects.filter(pk__in=parsing_result_ids).exclude(status__in=["DONE", "FAILED"])
            parser.check_parsing_result(parsing_results)
    

    It’s unlikely Celery is the problem. Using Selenium within a Docker container seems to be the root cause of the zombie processes. See Jimmy Engelbrecht’s answer for further details.

    Jimmy’s solution to the zombie problem:

    def quit_driver_and_reap_children(driver):
        log.debug('Quitting session: %s' % driver.session_id)
        driver.quit()
        try:
            pid = True
            while pid:
                pid = os.waitpid(-1, os.WNOHANG)
                log.debug("Reaped child: %s" % str(pid))
    
                #Wonka's Solution to avoid infinite loop cause pid value -> (0, 0)
                try:
                    if pid[0] == 0:
                        pid = False
                except:
                    pass
                #---- ----
    
        except ChildProcessError:
            pass
    

    If this solution doesn’t work, please show us the traceback you suppressed in your Parser.stop method.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search