I’m experiencing difficulties in accessing a ScrapyRT service running on specific ports within a Kubernetes pod. My setup includes a Kubernetes cluster with a pod running a Scrapy application, which uses ScrapyRT to listen for incoming requests on designated ports. These requests are intended to trigger spiders on the corresponding ports.
Despite correctly setting up a Kubernetes service and referencing the Scrapy pod in it, I’m unable to receive any incoming requests to the pod. My understanding is that in Kubernetes networking, a service should be created first, followed by the pod, allowing inter-pod communication and external access through the service. Is this correct?
Below are the relevant configurations:
scrapy-pod Dockerfile:
# Use Ubuntu as the base image
FROM ubuntu:latest
# Avoid prompts from apt
ENV DEBIAN_FRONTEND=noninteractive
# # Update package repository and install Python, pip, and other utilities
RUN apt-get update &&
apt-get install -y curl software-properties-common iputils-ping net-tools dnsutils vim build-essential python3 python3-pip &&
rm -rf /var/lib/apt/lists/*
# Install nvm (Node Version Manager) - EXPRESS
ENV NVM_DIR /usr/local/nvm
ENV NODE_VERSION 16.20.1
RUN mkdir -p $NVM_DIR
RUN curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.1/install.sh | bash
# Install Node.js and npm - EXPRESS
RUN . "$NVM_DIR/nvm.sh" && nvm install $NODE_VERSION && nvm alias default $NODE_VERSION && nvm use default
# Add Node and npm to path so the commands are available - EXPRESS
ENV NODE_PATH $NVM_DIR/versions/node/v$NODE_VERSION/lib/node_modules
ENV PATH $NVM_DIR/versions/node/v$NODE_VERSION/bin:$PATH
# Install Yarn - EXPRESS
RUN npm install --global yarn
# Set the working directory in the container to /usr/src/app
WORKDIR /usr/src/app
# Copy the current directory contents into the container at /usr/src/app
COPY . .
# Install any needed packages specified in requirements.txt
RUN pip3 install --no-cache-dir -r requirements.txt
# Copy the start_services.sh script into the container
COPY start_services.sh /start_services.sh
# Make the script executable
RUN chmod +x /start_services.sh
# Install any needed packages specified in package.json using Yarn - EXPRESS
RUN yarn install
# Expose all the necessary ports
EXPOSE 14805 14807 12085 14806 13905 12080 14808 8000
# Define environment variable - EXPRESS
ENV NODE_ENV production
# Run the script when the container starts
CMD ["/start_services.sh"]
start_services.sh:
#!/bin/bash
# Start ScrapyRT instances on different ports
scrapyrt -p 14805 &
scrapyrt -p 14807 &
scrapyrt -p 12085 &
scrapyrt -p 14806 &
scrapyrt -p 13905 &
scrapyrt -p 12080 &
scrapyrt -p 14808 &
# Keep the container running since the ScrapyRT processes are in the background
tail -f /dev/null
service yaml file:
apiVersion: v1
kind: Service
metadata:
name: scrapy-service
spec:
selector:
app: scrapy-pod
ports:
- name: port-14805
protocol: TCP
port: 14805
targetPort: 14805
- name: port-14807
protocol: TCP
port: 14807
targetPort: 14807
- name: port-12085
protocol: TCP
port: 12085
targetPort: 12085
- name: port-14806
protocol: TCP
port: 14806
targetPort: 14806
- name: port-13905
protocol: TCP
port: 13905
targetPort: 13905
- name: port-12080
protocol: TCP
port: 12080
targetPort: 12080
- name: port-14808
protocol: TCP
port: 14808
targetPort: 14808
- name: port-8000
protocol: TCP
port: 8000
targetPort: 8000
type: ClusterIP
deployment yaml file:
apiVersion: apps/v1
kind: Deployment
metadata:
name: scrapy-deployment
labels:
app: scrapy-pod
spec:
replicas: 1
selector:
matchLabels:
app: scrapy-pod
template:
metadata:
labels:
app: scrapy-pod
spec:
containers:
- name: scrapy-pod
image: mydockerhub/privaterepository-scrapy:latest
imagePullPolicy: Always
ports:
- containerPort: 14805
- containerPort: 14806
- containerPort: 14807
- containerPort: 12085
- containerPort: 13905
- containerPort: 12080
- containerPort: 8000
envFrom:
- secretRef:
name: scrapy-env-secret
- secretRef:
name: express-env-secret
imagePullSecrets:
- name: my-docker-credentials
scrapy-pod’s logs in Powershell terminal:
> k logs scrapy-deployment-56b9d66858-p59gs -f
2024-01-09 21:53:27+0000 [-] Log opened.
2024-01-09 21:53:27+0000 [-] Log opened.
2024-01-09 21:53:27+0000 [-] Log opened.
2024-01-09 21:53:27+0000 [-] Log opened.
2024-01-09 21:53:27+0000 [-] Log opened.
2024-01-09 21:53:27+0000 [-] Log opened.
2024-01-09 21:53:27+0000 [-] Log opened.
2024-01-09 21:53:27+0000 [-] Site starting on 12080
2024-01-09 21:53:27+0000 [-] Site starting on 14808
2024-01-09 21:53:27+0000 [-] Site starting on 14805
2024-01-09 21:53:27+0000 [-] Starting factory <twisted.web.server.Site object at 0x7f4cbdf44d60>
2024-01-09 21:53:27+0000 [-] Starting factory <twisted.web.server.Site object at 0x7fef9b620a00>
2024-01-09 21:53:27+0000 [-] Site starting on 13905
2024-01-09 21:53:27+0000 [-] Running with reactor: AsyncioSelectorReactor.
2024-01-09 21:53:27+0000 [-] Site starting on 14807
2024-01-09 21:53:27+0000 [-] Starting factory <twisted.web.server.Site object at 0x7f0892ff4df0>
2024-01-09 21:53:27+0000 [-] Site starting on 14806
2024-01-09 21:53:27+0000 [-] Starting factory <twisted.web.server.Site object at 0x7f00d3b99000>
2024-01-09 21:53:27+0000 [-] Starting factory <twisted.web.server.Site object at 0x7fba9e321180>
2024-01-09 21:53:27+0000 [-] Running with reactor: AsyncioSelectorReactor.
2024-01-09 21:53:27+0000 [-] Starting factory <twisted.web.server.Site object at 0x7f1782514f10>
2024-01-09 21:53:27+0000 [-] Running with reactor: AsyncioSelectorReactor.
2024-01-09 21:53:27+0000 [-] Running with reactor: AsyncioSelectorReactor.
2024-01-09 21:53:27+0000 [-] Site starting on 12085
2024-01-09 21:53:27+0000 [-] Starting factory <twisted.web.server.Site object at 0x7fb2054cd060>
2024-01-09 21:53:27+0000 [-] Running with reactor: AsyncioSelectorReactor.
2024-01-09 21:53:27+0000 [-] Running with reactor: AsyncioSelectorReactor.
2024-01-09 21:53:27+0000 [-] Running with reactor: AsyncioSelectorReactor.
Issue:
Despite these configurations, no requests seem to reach the Scrapy pod. Logs from kubectl logs show that ScrapyRT instances start successfully on the specified ports. However, when I send requests from a separate debug pod running a Python Jupyter Notebook, they succeed for other pods but not for the Scrapy pod.
Question:
How can I successfully connect to the Scrapy pod? What might be preventing the requests from reaching it?
Any insights or suggestions would be greatly appreciated.
Repair Attempts And Results
Milind’s Suggestions
- verify that selector field in the service YAML (scrapy-service) matches the labels in the deployment YAML (scrapy-deployment). The labels should be the same to correctly select the pods.
Yes, the selector field in the service yaml matches the labels in the deployment yaml.
scrapy-service.yaml
apiVersion: v1
kind: Service
metadata:
name: scrapy-service
spec:
selector:
app: scrapy-pod
ports:
- protocol: TCP
port: 14805
targetPort: 14805
type: ClusterIP
scrapy-service.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: scrapy-deployment
labels:
app: scrapy-pod
spec:
replicas: 1
selector:
matchLabels:
app: scrapy-pod
template:
metadata:
labels:
app: scrapy-pod
spec:
containers:
- name: scrapy-pod
...
- Did you check in the logs to see if there are any error messages or indications that the requests are being received?????
Yes, I checked the logs but I get no indication the requests are being received. Here’s the series of steps I do to check this.
Get all the pods:
> k get po
NAME READY STATUS RESTARTS AGE
express-app-deployment-545f899f88-zq58r 1/1 Running 0 2d8h
jupyter-debug-pod 1/1 Running 0 31h
scrapy-deployment-56b9d66858-wfhpk 1/1 Running 0 31h
Get all the pods and show their IP:
> k get po -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
express-app-deployment-545f899f88-zq58r 1/1 Running 0 2d8h 10.244.0.191 pool-6snxmm4o8-xd7ds <none> <none>
jupyter-debug-pod 1/1 Running 0 31h 10.244.1.14 pool-6snxmm4o8-xz05i <none> <none>
scrapy-deployment-56b9d66858-wfhpk 1/1 Running 0 31h 10.244.1.96 pool-6snxmm4o8-xz05i <none> <none>
Check the scrapy-deployment logs:
> k logs scrapy-deployment-56b9d66858-wfhpk -f
2024-01-13 23:55:55+0000 [-] Log opened.
2024-01-13 23:55:55+0000 [-] Site starting on 14805
2024-01-13 23:55:55+0000 [-] Starting factory <twisted.web.server.Site object at 0x7f6b6fe04460>
2024-01-13 23:55:55+0000 [-] Running with reactor: AsyncioSelectorReactor.
Check the services:
> k get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
express-backend-service ClusterIP 10.245.59.90 <none> 80/TCP 9d
scrapy-service ClusterIP 10.245.129.89 <none> 14805/TCP 31h
In a separate terminal, I exec into the jupyter-debug-pod:
> k exec -it scrapy-deployment-56b9d66858-wfhpk -- /bin/bash
root@scrapy-deployment-56b9d66858-wfhpk:/usr/src/app#
nslookup scrapy-service:
# nslookup scrapy-service
Server: 10.245.0.10
Address: 10.245.0.10#53
Name: scrapy-service.default.svc.cluster.local
Address: 10.245.129.89
So, it SEES scrapy-service
AND the 10.245.0.10
which I don’t see mentioned previously.
When I curl express-backend-service
, it works as expected:
# curl express-backend-service
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>HTTP Video Stream</title>
</head>
<body>
<video id="videoPlayer" width="650" controls muted="muted" autoplay>
<source src="/video/play" type="video/mp4" />
</video>
</body>
</html>
But when I curl scrapy-service
it just hangs then fails:
#curl scrapy-service
curl: (28) Failed to connect to scrapy-service port 80 after 130552 ms: Connection timed out
Even when I try adding the 14805 port it still fails:
# curl scrapy-service:14805
curl: (7) Failed to connect to scrapy-service port 14805 after 6 ms: Connection refused
- Did you verified that the DNS resolution is working within the cluster and the name (scrapy-service) can be resolved?????
Yes, the scrapy-service is successfully resolving to an internal cluster IP address (10.245.129.89).
- Did you verified that if there are any firewall rules that might be blocking traffic between pods within the cluster?????
I checked my Digital Ocean control panel’s firewall settings and saw for Outbound Rules, all ports were set up. However, I did notice that for Inbound Rules, I had nothing set up. Perhaps this was the issue? I immediately set up 2 rules, one for TCP (All ports/All IPv4/All IPv6) and the same for UDP and ICMP. However, after making the changes, deleting the service and deployment, then recreating the service and deployment from scratch, it still did not solve the issue.
- Did you tried ping or telnet to check connectivity between pod and cluster????
Yeah tried that, it failed too.
Here’s the result of telnet:
# telnet scrapy-service.default.svc.cluster.local
Trying 10.245.24.22...
telnet: Unable to connect to remote host: Connection timed out
root@jupyter-debug-pod:/# telnet scrapy-service.default.svc.cluster.local 14805
Trying 10.245.24.22...
telnet: Unable to connect to remote host: Connection refused
Here’s the result of ping:
# ping scrapy-service.default.svc.cluster.local
PING scrapy-service.default.svc.cluster.local (10.245.24.22) 56(84) bytes of data.
--- scrapy-service.default.svc.cluster.local ping statistics ---
1295 packets transmitted, 0 received, 100% packet loss, time 1325042ms
- I can see that The scrapy-service is of type ClusterIP, which means it’s an internal service. This wont work if you need external access.Double check it pls.Try changing it to NodePort or LoadBalancer to gain external access.
Ok, I changed scrapy-service.yaml to NodePort like so:
apiVersion: v1
kind: Service
metadata:
name: scrapy-service
spec:
selector:
app: scrapy-pod
ports:
- protocol: TCP
port: 14805
targetPort: 14805
type: NodePort
After, I tried to do curl scrapy-service
(after deleting and restarting the service):
# curl scrapy-service
curl: (28) Failed to connect to scrapy-service port 80 after 129976 ms: Connection timed out
This too failed.
- Lastly, verify if the pod is running.
> k logs scrapy-deployment-56b9d66858-6xs9r -f
2024-01-15 07:33:04+0000 [-] Log opened.
2024-01-15 07:33:04+0000 [-] Site starting on 14805
2024-01-15 07:33:04+0000 [-] Starting factory <twisted.web.server.Site object at 0x7f51f08fce20>
2024-01-15 07:33:04+0000 [-] Running with reactor: AsyncioSelectorReactor.
As you can see above, the pod is running and gives logs.
And so, now you can see my frustration with this after over a week being unable to solve this. There is another pod, express-app-deployment-545f899f88-zq58r which does NOT behave like this. It runs an Express.js app on port 8000 and the service for that, express-backend-service, works as expected.
2
Answers
Few things to try –
selector
field in the service YAML (scrapy-service) matches the labels in the deployment YAML (scrapy-deployment). The labels should be the same to correctly select the pods.ClusterIP
, which means it’s an internal service. This wont work if you need external access.Double check it pls.Try changing it toNodePort
orLoadBalancer
to gain external access.Let me know if the above troubleshooting were helpful.
In Kubernetes, the order of creating a service and a pod is not strictly significant in terms of functionality. Services in Kubernetes are designed to be dynamic, and are capable of discovering pods dynamically.
When a service is created, it continuously monitors for pods that match its selector criteria, regardless of whether those pods were created before or after the service itself.
A Kubernetes service acts as a stable endpoint for a group of pods that match its selector. It provides a consistent IP address and port(s) through which the pods can be accessed, both internally within the cluster and externally, depending on the service type (
ClusterIP
,NodePort
,LoadBalancer
).The service makes sure any request to its IP and port is forwarded to one of the pods that match its selector.
In your configuration (
service.yaml
), the service is set up with a selector that matches the labels of the Scrapy pod. That means that the service will route traffic to any pod with the labelapp: scrapy-pod
, regardless of when these pods are created.So, it is not mandatory for the service to be created before the pods. However, creating the service first can be a good practice, as it makes sure the routing endpoint is available as soon as the pods are up and running.
Note: The
EXPOSE
directive in the Dockerfile is more of a documentation feature; it does not actually publish the port. The actual port exposure to the outside world is handled by the Kubernetes service and pod configuration.I explained this in "Does "
ports
" ondocker-compose.yml
have the same effect asEXPOSE
on Dockerfile?"The script starts multiple ScrapyRT instances, each on different ports. It ends with
tail -f /dev/null
to keep the container running since the ScrapyRT processes are sent to the background.As commented, it is more reliable to run a single foreground process per container. That improves observability and allows Kubernetes to restart the container if the process fails.
The
service.yaml
defines a Kubernetes service with multiple ports, each corresponding to one of the ScrapyRT instances. That setup is fine, but as mentioned, running a single instance per container is more manageable.Restructure your deployment to run a single ScrapyRT instance per pod. That will make it easier to diagnose issues.
You could start with a simple setup: one ScrapyRT instance per pod, running on a standard port like 8000.
After deploying your service, use
kubectl get endpoints scrapy-service
to make sure the service has endpoints and is correctly targeting your pods.To deploy only a single ScrapyRT instance per pod:
The script can be simplified to start only one instance.
With this script, you only need one Docker image. The script will use the
SCRAPE_PORT
environment variable to determine which port to listen on. The Dockerfile remains unchanged, except for making sure it uses the newstart-services.sh
.Kustomize allows you to define a base configuration and then customize it for different environments or scenarios. You can define a base service and deployment, and then use Kustomize to create variations for each ScrapyRT instance.
Base Service (
base-service.yaml
):Base Deployment (
base-deployment.yaml
):You can then use Kustomize to create overlays for each port. In each overlay, you would override the
port
,targetPort
, andSCRAPE_PORT
environment variable to match the desired ScrapyRT instance.You would have a directory structure like this:
Example Overlay for port 14805 (
overlays/14805/kustomization.yaml
):You would create
deployment.yaml
andservice.yaml
in each overlay directory to specify the port number for that specific instance. For example, for port 14805:To deploy a specific instance, you would run Kustomize and
kubectl apply
in the directory of the desired overlay.For example:
That would apply the base configuration with the modifications defined in the
overlays/14805/
directory, effectively deploying the ScrapyRT instance listening on port 14805.