skip to Main Content

I am running integration tests in Azure Pipelines. I spin up two Docker containers. One container holds my test project and another container has the Postgres database.

When I run the docker compose on my local machine, the tests run successfully and take about 6 minutes.

When I run the same docker containers in the pipeline, the job doesn’t finish. The job is canceled because of the 60 min limit.

The job running on agent Hosted Agent ran longer than the maximum time of 60 minutes

I do not see any helpful data in the logs.

What tools/logs can I use to diagnose this issue?

It might have to do with RAM or CPU allocation.

Is there a way to do docker stats to see how many resources are allocated to docker containers?

Also, I have multiple test projects and I’m testing them (in the pipeline) one at a time. There are projects that succeeded with this setup. So this approach works, however, when it fails as described, there isn’t a way forward to troubleshoot.

The pipeline:

pool:
  vmImage: ubuntu-latest

stages:
  - stage: Build
    displayName: Docker compose build & up
    jobs:
      - job: Build
        displayName: Build
        steps:
          - script: |
              docker compose build --no-cache
              docker compose up --abort-on-container-exit
            displayName: 'Docker Compose Build & Up'

The docker compose that pipeline calls:

version: "3.8"

services:
  test_service:
    container_name: test_service
    image: test_service_image
    build:
      context: .
      dockerfile: Dockerfile
    environment:
      ASPNETCORE_ENVIRONMENT: Staging
      WAIT_HOSTS: integration_test_db_server:5432
    volumes:
      - ./TestResults:/var/temp
    depends_on:
      - integration_test_db_server
    deploy:
      resources:
        limits:
          memory: 4gb
  
  integration_test_db_server:
    image: postgres
    container_name: db_server
    restart: always
    ports:
      - "2345:5432"
    environment:
      POSTGRES_USER: test
      POSTGRES_PASSWORD: test
      POSTGRES_DB: db_server

Dockerfile refernced by test_service:

FROM mcr.microsoft.com/dotnet/sdk:6.0

WORKDIR /

COPY . ./

ADD https://github.com/ufoscout/docker-compose-wait/releases/download/2.9.0/wait /wait  
#RUN chmod +x /wait
RUN /bin/bash -c 'ls -la /wait; chmod +x /wait; ls -la /wait'

CMD /wait && dotnet test ./src/MyPorject/MyProject.Tests.csproj  --logger trx --results-directory /var/temp 

UPDATE – Jan 3rd 2023:

I was able to reproduce this on my local machine. Because the MSFT agent is limited to 2 cores, I made that same restriction in the docker-compose file.

enter image description here

This caused a test to run for a very long time (over 8 minutes for one test). At that time, the CPU usage was < 3%.

Running docker stats while the test is running

enter image description here

So restricting the number of CPU cores, causes less CPU usage? I am confused as to what’s happening here.

2

Answers


  1. Chosen as BEST ANSWER

    So there was an issue with "thread pool starvation". This didn't happen on my machine because I allocated all 4 CPU cores. However, once I limited the docker container to 2 cores, the problem appeared locally and I was able to figure out the underlying cause.

    So lesson learned. Try to reproduce the issue locally. Set container resources close to MSFT agent specs. In this case 2 core CPU, and 7 GB of RAM.


    Also, if your tests run for a long time and never finish, you can get more information by using blame-hang-timeout flag. This will set a time limit on a test.

    dotnet test <your project> --blame-hang-timeout 2min

    After that time limit a "hangdump" file will be generated with information on the errors. That's how I found out about the underlying issue.


  2. Update on 1/4

    Microsoft hosted agents have limited performance running the pipelines due to the fixed hardware configurations and network service.

    Microsoft-hosted agents that run Windows and Linux images are
    provisioned on Azure general purpose virtual machines with a 2 core
    CPU, 7 GB of RAM, and 14 GB of SSD disk space.

    Agents that run macOS images are provisioned on Mac pros with a 3 core
    CPU, 14 GB of RAM, and 14 GB of SSD disk space.

    If you pipeline have high performance required job, it’s suggested to run your pipeline via self-hosted agent or VMSS.

    ================================================================

    Origin

    I suppose that your issue could be related to the Build job timeout setting. You could check it with the screenshot below.

    By the way, you could look into this doc about Parallel Job Time Durance Limit for more reference.

    enter image description here

    ===============================================================

    first updated

    I suppose that the duration of the pipeline could be effected by multiple factors, like network health, data and files transmission speed or agent machine performance. If your task contains large quantity of single-file transmission, you could try to use archive task when uploading to agent workspace and extract task when building or testing the project.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search