skip to Main Content

I maintain a service that basically pings sites to check whether they’re online or not. The service per se is really simple, it relies only on the HTTP status code returned by the requested URL. For instance, I ignore the response body completely.

The service works fine for a small list of domains. However, networking becomes an issue as the number of sites to ping grows. I tried a couple of different languages and libraries. My latest implementation uses NodeJS and node-fetch. But I already had versions of it wrote in Python, PHP, Java, Golang. From that experience, I now know the language is not what determines the request/response speed. There are differences between languages and lib, for sure, but the bottleneck is not there.

Today, I think the only way to make the service scales is with multiple clusters in different networks (e.g. VPC if we’re talking AWS). I can’t think of a way to deal with networking restrictions in a single or just a few instances.

So, I’m asking this really broad question: what strategies I can use to overcome networking limitations? I’m looking for both dev and ops answers, but mostly focusing on keep the structure as light as possible.

2

Answers


  1. One robust way to ping a website (or any TCP service in general) is to send TCP SYN packet to port 443 (or 80 for insecure HTTP) and measure the time till SYN+ACK response. Tools like hping3 and MTR utilize this method.

    This method is one of the best because ICMP may be blocked, take a different path, be prioritized differently on routers in the path, or be responded to by a totally different host. Whereas TCP SYN is the actual scenario the users of the website exercise. The network load is minimal as no data is sent in SYN/SYN+ACK packets, only protocol headers (TCP, IP, and lower level protocol headers).

    Login or Signup to reply.
  2. The answer of @Maxim Egorushkin is great, TCP SYN scanning is the most efficient way I can think of. There are other tools like Masscan, use pcap to send SYN packet in userspace, reduce TCP connection management overhead in kernel. This approach may do the job with a single instance.

    If you wanna use HTTP protocol to make sure application layer works fine, use HTTP HEAD request. It responses with a header and status code as GET, but without the body.

    Another potential optimization is DNS, you can host a DNS server locally and manage to update domains beforehand, or use a script to update host file before pinging those sites. This can save several milliseconds and bandwith
    during pinging sites.

    At development level, you could impletement a library just parse status code in HTTP response, so saving some CPU time on parsing headers.

    It is helpful to address the actual bottleneck first, it that bandwith limit? memory limit? file descriptor limit? etc.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search