skip to Main Content

I’m experiencing difficulties in accessing a ScrapyRT service running on specific ports within a Kubernetes pod. My setup includes a Kubernetes cluster with a pod running a Scrapy application, which uses ScrapyRT to listen for incoming requests on designated ports. These requests are intended to trigger spiders on the corresponding ports.

Despite correctly setting up a Kubernetes service and referencing the Scrapy pod in it, I’m unable to receive any incoming requests to the pod. My understanding is that in Kubernetes networking, a service should be created first, followed by the pod, allowing inter-pod communication and external access through the service. Is this correct?

Below are the relevant configurations:


scrapy-pod Dockerfile:

# Use Ubuntu as the base image
FROM ubuntu:latest

# Avoid prompts from apt
ENV DEBIAN_FRONTEND=noninteractive

# # Update package repository and install Python, pip, and other utilities
RUN apt-get update && 
    apt-get install -y curl software-properties-common iputils-ping net-tools dnsutils vim build-essential python3 python3-pip && 
    rm -rf /var/lib/apt/lists/*


# Install nvm (Node Version Manager) - EXPRESS
ENV NVM_DIR /usr/local/nvm
ENV NODE_VERSION 16.20.1

RUN mkdir -p $NVM_DIR
RUN curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.1/install.sh | bash

# Install Node.js and npm - EXPRESS
RUN . "$NVM_DIR/nvm.sh" && nvm install $NODE_VERSION && nvm alias default $NODE_VERSION && nvm use default

# Add Node and npm to path so the commands are available - EXPRESS
ENV NODE_PATH $NVM_DIR/versions/node/v$NODE_VERSION/lib/node_modules
ENV PATH $NVM_DIR/versions/node/v$NODE_VERSION/bin:$PATH

# Install Yarn - EXPRESS
RUN npm install --global yarn

# Set the working directory in the container to /usr/src/app
WORKDIR /usr/src/app

# Copy the current directory contents into the container at /usr/src/app
COPY . .

# Install any needed packages specified in requirements.txt
RUN pip3 install --no-cache-dir -r requirements.txt

# Copy the start_services.sh script into the container
COPY start_services.sh /start_services.sh

# Make the script executable
RUN chmod +x /start_services.sh


# Install any needed packages specified in package.json using Yarn - EXPRESS
RUN yarn install


# Expose all the necessary ports
EXPOSE 14805 14807 12085 14806 13905 12080 14808 8000


# Define environment variable - EXPRESS
ENV NODE_ENV production

# Run the script when the container starts
CMD ["/start_services.sh"]

start_services.sh:

#!/bin/bash

# Start ScrapyRT instances on different ports
scrapyrt -p 14805 &
scrapyrt -p 14807 &
scrapyrt -p 12085 &
scrapyrt -p 14806 &
scrapyrt -p 13905 &
scrapyrt -p 12080 &
scrapyrt -p 14808 &

# Keep the container running since the ScrapyRT processes are in the background
tail -f /dev/null


service yaml file:

apiVersion: v1
kind: Service
metadata:
  name: scrapy-service
spec:
  selector:
    app: scrapy-pod
  ports:
    - name: port-14805
      protocol: TCP
      port: 14805
      targetPort: 14805
    - name: port-14807
      protocol: TCP
      port: 14807
      targetPort: 14807
    - name: port-12085
      protocol: TCP
      port: 12085
      targetPort: 12085
    - name: port-14806
      protocol: TCP
      port: 14806
      targetPort: 14806
    - name: port-13905
      protocol: TCP
      port: 13905
      targetPort: 13905
    - name: port-12080
      protocol: TCP
      port: 12080
      targetPort: 12080
    - name: port-14808
      protocol: TCP
      port: 14808
      targetPort: 14808
    - name: port-8000
      protocol: TCP
      port: 8000
      targetPort: 8000
  type: ClusterIP


deployment yaml file:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: scrapy-deployment
  labels:
    app: scrapy-pod
spec:
  replicas: 1
  selector:
    matchLabels:
      app: scrapy-pod
  template:
    metadata:
      labels:
        app: scrapy-pod
    spec:
      containers:
      - name: scrapy-pod
        image: mydockerhub/privaterepository-scrapy:latest
        imagePullPolicy: Always  
        ports:
        - containerPort: 14805
        - containerPort: 14806
        - containerPort: 14807
        - containerPort: 12085
        - containerPort: 13905
        - containerPort: 12080
        - containerPort: 8000
        envFrom:
        - secretRef:
            name: scrapy-env-secret
        - secretRef:
            name: express-env-secret
      imagePullSecrets:
      - name: my-docker-credentials 


scrapy-pod’s logs in Powershell terminal:

> k logs scrapy-deployment-56b9d66858-p59gs -f
2024-01-09 21:53:27+0000 [-] Log opened.
2024-01-09 21:53:27+0000 [-] Log opened.
2024-01-09 21:53:27+0000 [-] Log opened.
2024-01-09 21:53:27+0000 [-] Log opened.
2024-01-09 21:53:27+0000 [-] Log opened.
2024-01-09 21:53:27+0000 [-] Log opened.
2024-01-09 21:53:27+0000 [-] Log opened.
2024-01-09 21:53:27+0000 [-] Site starting on 12080
2024-01-09 21:53:27+0000 [-] Site starting on 14808
2024-01-09 21:53:27+0000 [-] Site starting on 14805
2024-01-09 21:53:27+0000 [-] Starting factory <twisted.web.server.Site object at 0x7f4cbdf44d60>
2024-01-09 21:53:27+0000 [-] Starting factory <twisted.web.server.Site object at 0x7fef9b620a00>
2024-01-09 21:53:27+0000 [-] Site starting on 13905
2024-01-09 21:53:27+0000 [-] Running with reactor: AsyncioSelectorReactor.
2024-01-09 21:53:27+0000 [-] Site starting on 14807
2024-01-09 21:53:27+0000 [-] Starting factory <twisted.web.server.Site object at 0x7f0892ff4df0>
2024-01-09 21:53:27+0000 [-] Site starting on 14806
2024-01-09 21:53:27+0000 [-] Starting factory <twisted.web.server.Site object at 0x7f00d3b99000>
2024-01-09 21:53:27+0000 [-] Starting factory <twisted.web.server.Site object at 0x7fba9e321180>
2024-01-09 21:53:27+0000 [-] Running with reactor: AsyncioSelectorReactor.
2024-01-09 21:53:27+0000 [-] Starting factory <twisted.web.server.Site object at 0x7f1782514f10>
2024-01-09 21:53:27+0000 [-] Running with reactor: AsyncioSelectorReactor.
2024-01-09 21:53:27+0000 [-] Running with reactor: AsyncioSelectorReactor.
2024-01-09 21:53:27+0000 [-] Site starting on 12085
2024-01-09 21:53:27+0000 [-] Starting factory <twisted.web.server.Site object at 0x7fb2054cd060>
2024-01-09 21:53:27+0000 [-] Running with reactor: AsyncioSelectorReactor.
2024-01-09 21:53:27+0000 [-] Running with reactor: AsyncioSelectorReactor.
2024-01-09 21:53:27+0000 [-] Running with reactor: AsyncioSelectorReactor.

Issue:
Despite these configurations, no requests seem to reach the Scrapy pod. Logs from kubectl logs show that ScrapyRT instances start successfully on the specified ports. However, when I send requests from a separate debug pod running a Python Jupyter Notebook, they succeed for other pods but not for the Scrapy pod.

Question:
How can I successfully connect to the Scrapy pod? What might be preventing the requests from reaching it?

Any insights or suggestions would be greatly appreciated.

Repair Attempts And Results

Milind’s Suggestions

  • verify that selector field in the service YAML (scrapy-service) matches the labels in the deployment YAML (scrapy-deployment). The labels should be the same to correctly select the pods.
    Yes, the selector field in the service yaml matches the labels in the deployment yaml.
scrapy-service.yaml
apiVersion: v1
kind: Service
metadata:
  name: scrapy-service
spec:
  selector:
    app: scrapy-pod
  ports:
    - protocol: TCP
      port: 14805
      targetPort: 14805
  type: ClusterIP
scrapy-service.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: scrapy-deployment
  labels:
    app: scrapy-pod
spec:
  replicas: 1
  selector:
    matchLabels:
      app: scrapy-pod
  template:
    metadata:
      labels:
        app: scrapy-pod
    spec:
      containers:
      - name: scrapy-pod
...
  • Did you check in the logs to see if there are any error messages or indications that the requests are being received?????
    Yes, I checked the logs but I get no indication the requests are being received. Here’s the series of steps I do to check this.

Get all the pods:

> k get po
NAME                                         READY   STATUS    RESTARTS   AGE
express-app-deployment-545f899f88-zq58r      1/1     Running   0          2d8h
jupyter-debug-pod                            1/1     Running   0          31h
scrapy-deployment-56b9d66858-wfhpk           1/1     Running   0          31h

Get all the pods and show their IP:

> k get po -o wide
NAME                                         READY   STATUS    RESTARTS   AGE    IP             NODE                   NOMINATED NODE   READINESS GATES
express-app-deployment-545f899f88-zq58r      1/1     Running   0          2d8h   10.244.0.191   pool-6snxmm4o8-xd7ds   <none>           <none>
jupyter-debug-pod                            1/1     Running   0          31h    10.244.1.14    pool-6snxmm4o8-xz05i   <none>           <none>
scrapy-deployment-56b9d66858-wfhpk           1/1     Running   0          31h    10.244.1.96    pool-6snxmm4o8-xz05i   <none>           <none>

Check the scrapy-deployment logs:

> k logs scrapy-deployment-56b9d66858-wfhpk -f
2024-01-13 23:55:55+0000 [-] Log opened.
2024-01-13 23:55:55+0000 [-] Site starting on 14805
2024-01-13 23:55:55+0000 [-] Starting factory <twisted.web.server.Site object at 0x7f6b6fe04460>
2024-01-13 23:55:55+0000 [-] Running with reactor: AsyncioSelectorReactor.

Check the services:

> k get svc
NAME                      TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)     AGE
express-backend-service   ClusterIP   10.245.59.90     <none>        80/TCP      9d
scrapy-service            ClusterIP   10.245.129.89    <none>        14805/TCP   31h

In a separate terminal, I exec into the jupyter-debug-pod:

> k exec -it scrapy-deployment-56b9d66858-wfhpk -- /bin/bash
root@scrapy-deployment-56b9d66858-wfhpk:/usr/src/app#

nslookup scrapy-service:

# nslookup scrapy-service
Server:         10.245.0.10
Address:        10.245.0.10#53

Name:   scrapy-service.default.svc.cluster.local
Address: 10.245.129.89

So, it SEES scrapy-service AND the 10.245.0.10 which I don’t see mentioned previously.

When I curl express-backend-service, it works as expected:

# curl express-backend-service
<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="UTF-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <title>HTTP Video Stream</title>
  </head>
  <body>
    <video id="videoPlayer" width="650" controls muted="muted" autoplay>
      <source src="/video/play" type="video/mp4" />
    </video>
  </body>
</html>

But when I curl scrapy-service it just hangs then fails:

#curl scrapy-service
curl: (28) Failed to connect to scrapy-service port 80 after 130552 ms: Connection timed out

Even when I try adding the 14805 port it still fails:

# curl scrapy-service:14805
curl: (7) Failed to connect to scrapy-service port 14805 after 6 ms: Connection refused
  • Did you verified that the DNS resolution is working within the cluster and the name (scrapy-service) can be resolved?????

Yes, the scrapy-service is successfully resolving to an internal cluster IP address (10.245.129.89).

  • Did you verified that if there are any firewall rules that might be blocking traffic between pods within the cluster?????

I checked my Digital Ocean control panel’s firewall settings and saw for Outbound Rules, all ports were set up. However, I did notice that for Inbound Rules, I had nothing set up. Perhaps this was the issue? I immediately set up 2 rules, one for TCP (All ports/All IPv4/All IPv6) and the same for UDP and ICMP. However, after making the changes, deleting the service and deployment, then recreating the service and deployment from scratch, it still did not solve the issue.

  • Did you tried ping or telnet to check connectivity between pod and cluster????

Yeah tried that, it failed too.

Here’s the result of telnet:

# telnet scrapy-service.default.svc.cluster.local
Trying 10.245.24.22...
telnet: Unable to connect to remote host: Connection timed out
root@jupyter-debug-pod:/# telnet scrapy-service.default.svc.cluster.local 14805
Trying 10.245.24.22...
telnet: Unable to connect to remote host: Connection refused

Here’s the result of ping:

# ping scrapy-service.default.svc.cluster.local
PING scrapy-service.default.svc.cluster.local (10.245.24.22) 56(84) bytes of data.

--- scrapy-service.default.svc.cluster.local ping statistics ---
1295 packets transmitted, 0 received, 100% packet loss, time 1325042ms
  • I can see that The scrapy-service is of type ClusterIP, which means it’s an internal service. This wont work if you need external access.Double check it pls.Try changing it to NodePort or LoadBalancer to gain external access.

Ok, I changed scrapy-service.yaml to NodePort like so:

apiVersion: v1
kind: Service
metadata:
  name: scrapy-service
spec:
  selector:
    app: scrapy-pod
  ports:
    - protocol: TCP
      port: 14805
      targetPort: 14805
  type: NodePort

After, I tried to do curl scrapy-service (after deleting and restarting the service):

# curl scrapy-service
curl: (28) Failed to connect to scrapy-service port 80 after 129976 ms: Connection timed out

This too failed.

  • Lastly, verify if the pod is running.
> k logs scrapy-deployment-56b9d66858-6xs9r -f
2024-01-15 07:33:04+0000 [-] Log opened.
2024-01-15 07:33:04+0000 [-] Site starting on 14805
2024-01-15 07:33:04+0000 [-] Starting factory <twisted.web.server.Site object at 0x7f51f08fce20>
2024-01-15 07:33:04+0000 [-] Running with reactor: AsyncioSelectorReactor.

As you can see above, the pod is running and gives logs.

And so, now you can see my frustration with this after over a week being unable to solve this. There is another pod, express-app-deployment-545f899f88-zq58r which does NOT behave like this. It runs an Express.js app on port 8000 and the service for that, express-backend-service, works as expected.

2

Answers


  1. Few things to try –

    • verify that selector field in the service YAML (scrapy-service) matches the labels in the deployment YAML (scrapy-deployment). The labels should be the same to correctly select the pods.
    • Did you check in the logs to see if there are any error messages or indications that the requests are being received?????
    • Did you verified that the DNS resolution is working within the cluster and the name (scrapy-service) can be resolved?????
    • Did you verified that if there are any firewall rules that might be blocking traffic between pods within the cluster?????
    • Did you tried ping or telnet to check connectivity between pod and cluster????
    • I can see that The scrapy-service is of type ClusterIP, which means it’s an internal service. This wont work if you need external access.Double check it pls.Try changing it to NodePort or LoadBalancer to gain external access.
    • Lastly, verify if the pod is running.

    Let me know if the above troubleshooting were helpful.

    Login or Signup to reply.
  2. My understanding is that in Kubernetes networking, a service should be created first, followed by the pod, allowing inter-pod communication and external access through the service. Is this correct?

    In Kubernetes, the order of creating a service and a pod is not strictly significant in terms of functionality. Services in Kubernetes are designed to be dynamic, and are capable of discovering pods dynamically.
    When a service is created, it continuously monitors for pods that match its selector criteria, regardless of whether those pods were created before or after the service itself.

    A Kubernetes service acts as a stable endpoint for a group of pods that match its selector. It provides a consistent IP address and port(s) through which the pods can be accessed, both internally within the cluster and externally, depending on the service type (ClusterIP, NodePort, LoadBalancer).
    The service makes sure any request to its IP and port is forwarded to one of the pods that match its selector.


    In your configuration (service.yaml), the service is set up with a selector that matches the labels of the Scrapy pod. That means that the service will route traffic to any pod with the label app: scrapy-pod, regardless of when these pods are created.
    So, it is not mandatory for the service to be created before the pods. However, creating the service first can be a good practice, as it makes sure the routing endpoint is available as soon as the pods are up and running.

    Note: The EXPOSE directive in the Dockerfile is more of a documentation feature; it does not actually publish the port. The actual port exposure to the outside world is handled by the Kubernetes service and pod configuration.
    I explained this in "Does "ports" on docker-compose.yml have the same effect as EXPOSE on Dockerfile?"

    The script starts multiple ScrapyRT instances, each on different ports. It ends with tail -f /dev/null to keep the container running since the ScrapyRT processes are sent to the background.
    As commented, it is more reliable to run a single foreground process per container. That improves observability and allows Kubernetes to restart the container if the process fails.

    The service.yaml defines a Kubernetes service with multiple ports, each corresponding to one of the ScrapyRT instances. That setup is fine, but as mentioned, running a single instance per container is more manageable.


    Restructure your deployment to run a single ScrapyRT instance per pod. That will make it easier to diagnose issues.

    You could start with a simple setup: one ScrapyRT instance per pod, running on a standard port like 8000.

    • Deploy this setup with a Kubernetes service and try accessing it from within the cluster.
    • If this works, scale up by increasing the number of replicas in your deployment.
    • If it does not work, focus on networking and Kubernetes service troubleshooting.

    After deploying your service, use kubectl get endpoints scrapy-service to make sure the service has endpoints and is correctly targeting your pods.


    To deploy only a single ScrapyRT instance per pod:

    The script can be simplified to start only one instance.

    #!/bin/bash
    
    # Use an environment variable for the port number
    PORT=${SCRAPE_PORT:-14805}  # Default to 14805 if not set
    
    # Start a single ScrapyRT instance on the specified port
    scrapyrt -p $PORT
    

    With this script, you only need one Docker image. The script will use the SCRAPE_PORT environment variable to determine which port to listen on. The Dockerfile remains unchanged, except for making sure it uses the new start-services.sh.

    # (previous Dockerfile content)
    
    # Copy the start_services.sh script into the container
    COPY start_services.sh /start_services.sh
    
    # Make the script executable
    RUN chmod +x /start_services.sh
    
    # Run the script when the container starts
    CMD ["/start_services.sh"]
    

    Kustomize allows you to define a base configuration and then customize it for different environments or scenarios. You can define a base service and deployment, and then use Kustomize to create variations for each ScrapyRT instance.

    • Base Service (base-service.yaml):

      apiVersion: v1
      kind: Service
      metadata:
        name: scrapy-service  # Base name
      spec:
        selector:
          app: scrapy-pod
        ports:
          - protocol: TCP
            port: 8000  # That will be overridden
            targetPort: 8000  # That will be overridden
        type: ClusterIP
      
    • Base Deployment (base-deployment.yaml):

      apiVersion: apps/v1
      kind: Deployment
      metadata:
        name: scrapy-deployment  # Base name
      spec:
        replicas: 1
        selector:
          matchLabels:
            app: scrapy-pod
        template:
          metadata:
            labels:
              app: scrapy-pod
          spec:
            containers:
            - name: scrapy-pod
              image: mydockerhub/privaterepository-scrapy:latest
              ports:
              - containerPort: 8000  # That will be overridden
              env:
              - name: SCRAPE_PORT
                value: "8000"  # That will be overridden
      

    You can then use Kustomize to create overlays for each port. In each overlay, you would override the port, targetPort, and SCRAPE_PORT environment variable to match the desired ScrapyRT instance.

    You would have a directory structure like this:

    .
    ├── base
    │   ├── base-deployment.yaml
    │   └── base-service.yaml
    └── overlays
        ├── 14805
        │   ├── kustomization.yaml
        │   ├── deployment.yaml
        │   └── service.yaml
        ├── 14807
    
    
        │   ├── kustomization.yaml
        │   ├── deployment.yaml
        │   └── service.yaml
        └── (other ports)
    

    Example Overlay for port 14805 (overlays/14805/kustomization.yaml):

    resources:
    - ../../base/base-deployment.yaml
    - ../../base/base-service.yaml
    
    patchesStrategicMerge:
    - deployment.yaml
    - service.yaml
    

    You would create deployment.yaml and service.yaml in each overlay directory to specify the port number for that specific instance. For example, for port 14805:

    # deployment.yaml
    apiVersion: apps/v1
    kind: Deployment
    metadata:
    name: scrapy-deployment-14805
    spec:
    template:
        spec:
        containers:
        - name: scrapy-pod
            env:
            - name: SCRAPE_PORT
            value: "14805"
    
    # service.yaml
    apiVersion: v1
    kind: Service
    metadata:
    name: scrapy-service-14805
    spec:
    ports:
    - port: 14805
        targetPort: 14805
    

    To deploy a specific instance, you would run Kustomize and kubectl apply in the directory of the desired overlay.
    For example:

    kubectl apply -k overlays/14805/
    

    That would apply the base configuration with the modifications defined in the overlays/14805/ directory, effectively deploying the ScrapyRT instance listening on port 14805.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search