I am running integration tests in Azure Pipelines. I spin up two Docker containers. One container holds my test project and another container has the Postgres database.
When I run the docker compose on my local machine, the tests run successfully and take about 6 minutes.
When I run the same docker containers in the pipeline, the job doesn’t finish. The job is canceled because of the 60 min limit.
The job running on agent Hosted Agent ran longer than the maximum time of 60 minutes
I do not see any helpful data in the logs.
What tools/logs can I use to diagnose this issue?
It might have to do with RAM or CPU allocation.
Is there a way to do docker stats
to see how many resources are allocated to docker containers?
Also, I have multiple test projects and I’m testing them (in the pipeline) one at a time. There are projects that succeeded with this setup. So this approach works, however, when it fails as described, there isn’t a way forward to troubleshoot.
The pipeline:
pool:
vmImage: ubuntu-latest
stages:
- stage: Build
displayName: Docker compose build & up
jobs:
- job: Build
displayName: Build
steps:
- script: |
docker compose build --no-cache
docker compose up --abort-on-container-exit
displayName: 'Docker Compose Build & Up'
The docker compose that pipeline calls:
version: "3.8"
services:
test_service:
container_name: test_service
image: test_service_image
build:
context: .
dockerfile: Dockerfile
environment:
ASPNETCORE_ENVIRONMENT: Staging
WAIT_HOSTS: integration_test_db_server:5432
volumes:
- ./TestResults:/var/temp
depends_on:
- integration_test_db_server
deploy:
resources:
limits:
memory: 4gb
integration_test_db_server:
image: postgres
container_name: db_server
restart: always
ports:
- "2345:5432"
environment:
POSTGRES_USER: test
POSTGRES_PASSWORD: test
POSTGRES_DB: db_server
Dockerfile refernced by test_service:
FROM mcr.microsoft.com/dotnet/sdk:6.0
WORKDIR /
COPY . ./
ADD https://github.com/ufoscout/docker-compose-wait/releases/download/2.9.0/wait /wait
#RUN chmod +x /wait
RUN /bin/bash -c 'ls -la /wait; chmod +x /wait; ls -la /wait'
CMD /wait && dotnet test ./src/MyPorject/MyProject.Tests.csproj --logger trx --results-directory /var/temp
UPDATE – Jan 3rd 2023:
I was able to reproduce this on my local machine. Because the MSFT agent is limited to 2 cores, I made that same restriction in the docker-compose file.
This caused a test to run for a very long time (over 8 minutes for one test). At that time, the CPU usage was < 3%.
Running docker stats while the test is running
So restricting the number of CPU cores, causes less CPU usage? I am confused as to what’s happening here.
2
Answers
So there was an issue with "thread pool starvation". This didn't happen on my machine because I allocated all 4 CPU cores. However, once I limited the docker container to 2 cores, the problem appeared locally and I was able to figure out the underlying cause.
So lesson learned. Try to reproduce the issue locally. Set container resources close to MSFT agent specs. In this case 2 core CPU, and 7 GB of RAM.
Also, if your tests run for a long time and never finish, you can get more information by using
blame-hang-timeout
flag. This will set a time limit on a test.dotnet test <your project> --blame-hang-timeout 2min
After that time limit a "hangdump" file will be generated with information on the errors. That's how I found out about the underlying issue.
Update on 1/4
Microsoft hosted agents have limited performance running the pipelines due to the fixed hardware configurations and network service.
If you pipeline have high performance required job, it’s suggested to run your pipeline via self-hosted agent or VMSS.
================================================================
Origin
I suppose that your issue could be related to the Build job timeout setting. You could check it with the screenshot below.
By the way, you could look into this doc about Parallel Job Time Durance Limit for more reference.
===============================================================
first updated
I suppose that the duration of the pipeline could be effected by multiple factors, like network health, data and files transmission speed or agent machine performance. If your task contains large quantity of single-file transmission, you could try to use
archive
task when uploading to agent workspace andextract
task when building or testing the project.