I’m new to Kubernetes and to supporting a particular website hosted in Kubernetes. I’m trying to figure out why cert-manager did not renew the certificate in the QA environment a few weeks back.
Looking at the details of various certificate-related resources, the problem seems to be that the challenge failed:
State: invalid, Reason: Error accepting authorization: acme: authorization error for [DOMAIN]: 400 urn:ietf:params:acme:error:connection: Fetching http://[DOMAIN]/.well-known/acme-challenge/[CHALLENGE TOKEN STRING]: Timeout during connect (likely firewall problem)
I assume that error means Let’s Encrypt wasn’t able to access the challenge file at http://[DOMAIN]/.well-known/acme-challenge/[CHALLENGE TOKEN STRING]
(Domain and challenge token string redacted)
I’ve tried connecting to the URL via PowerShell:
PS C:UsersSimon> invoke-webrequest -uri http://[DOMAIN]/.well-known/acme-challenge/[CHALLENGE TOKEN STRING] -SkipCertificateCheck
and it returns a 200 OK.
However, PowerShell follows redirects automatically and checking with WireShark the Nginx web server is performing a 308 permanent redirect to https://[DOMAIN]/.well-known/acme-challenge/[CHALLENGE TOKEN STRING]
(same URL but just redirecting HTTP to HTTPS)
I understand that Let’s Encrypt should be able to handle HTTP to HTTPS redirects.
Given that the URL Let’s Encrypt was trying to reach is accessible from the internet I’m at a loss as to what the next step should be in investigating this issue. Could anyone provide any advice?
Here is the full output of the kubectl cert-manager plugin, checking the status of the certificate and associated resources:
PS C:UsersSimon> kubectl cert-manager status certificate -n qa containers-tls-secret
Name: containers-tls-secret
Namespace: qa
Created at: 2020-10-16T08:40:14+13:00
Conditions:
Ready: False, Reason: Expired, Message: Certificate expired on Sun, 14 Mar 2021 17:41:12 UTC
Issuing: False, Reason: Failed, Message: The certificate request has failed to complete and will be retried: Failed to wait for order resource "containers-tls-secret-q2cwr-3223066309" to become ready: order is in "invalid" state:
DNS Names:
- [DOMAIN]
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Issuing 31s (x236 over 9d) cert-manager Renewing certificate as renewal was scheduled at 2021-02-12 17:41:12 +0000 UTC
Normal Reused 31s (x236 over 9d) cert-manager Reusing private key stored in existing Secret resource "containers-tls-secret"
Warning Failed 31s (x236 over 9d) cert-manager The certificate request has failed to complete and will be retried: Failed to wait for order resource "containers-tls-secret-q2cwr-3223066309" to become ready: order is in "invalid" state:
Issuer:
Name: letsencrypt
Kind: ClusterIssuer
Conditions:
Ready: True, Reason: ACMEAccountRegistered, Message: The ACME account was registered with the ACME server
Events: <none>
Secret:
Name: containers-tls-secret
Issuer Country: US
Issuer Organisation: Let's Encrypt
Issuer Common Name: R3
Key Usage: Digital Signature, Key Encipherment
Extended Key Usages: Server Authentication, Client Authentication
Public Key Algorithm: RSA
Signature Algorithm: SHA256-RSA
Subject Key ID: dadf29869b58d05e980c390fdc8783f52369228d
Authority Key ID: 142eb317b75856cbae500940e61faf9d8b14c2c6
Serial Number: 04f7356add94a7909afab94f0847a3457765
Events: <none>
Not Before: 2020-12-15T06:41:12+13:00
Not After: 2021-03-15T06:41:12+13:00
Renewal Time: 2021-02-13T06:41:12+13:00
CertificateRequest:
Name: containers-tls-secret-q2cwr
Namespace: qa
Conditions:
Ready: False, Reason: Failed, Message: Failed to wait for order resource "containers-tls-secret-q2cwr-3223066309" to become ready: order is in "invalid" state:
Events: <none>
Order:
Name: containers-tls-secret-q2cwr-3223066309
State: invalid, Reason:
Authorizations:
URL: https://acme-v02.api.letsencrypt.org/acme/authz-v3/10810339315, Identifier: [DOMAIN], Initial State: pending, Wildcard: false
FailureTime: 2021-02-13T06:41:59+13:00
Challenges:
- Name: containers-tls-secret-q2cwr-3223066309-2302286353, Type: HTTP-01, Token: [CHALLENGE TOKEN STRING], Key: [CHALLENGE TOKEN STRING].8b00cc-ysOWGQ8vtmpOJobWOFa2cEQUe4Sun5NUKCws, State: invalid, Reason: Error accepting authorization: acme: authorization error for [DOMAIN]: 400 urn:ietf:params:acme:error:connection: Fetching http://[DOMAIN]/.well-known/acme-challenge/[CHALLENGE TOKEN STRING]: Timeout during connect (likely firewall problem), Processing: false, Presented: false
By the way, the invoke-webrequest results show an HTML page was returned:
<!doctype html><html lang="en"><head><meta charset="utf-8"><title>Containers</title><base href="./"><meta name="viewport" content="width=device-width,initial-scale=1"><link rel="icon" href="favicon.ico…
Could that be the issue? I don’t know what Let’s Encrypt expects to find at the URL of the HTTP01 challenge. Is a web page allowed or is it expecting something different?
EDIT: I now suspect the HTML page returned by invoke-webrequest is not normal, since I understand the file should include the Let’s Encrypt token and a key. Here is the full HTML page:
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Wineworks</title>
<base href="./">
<meta name="viewport" content="width=device-width,initial-scale=1">
<link rel="icon" href="favicon.ico">
<link rel="apple-touch-icon-precomposed" href="favicon-152.png">
<meta name="msapplication-TileColor" content="#FFFFFF">
<meta name="msapplication-TileImage" content="favicon-152.png">
<script src="https://secure.aadcdn.microsoftonline-p.com/lib/1.0.16/js/adal.min.js"/>
<link href="styles.025a840d59ecfcfe427e.bundle.css" rel="stylesheet"/>
</head>
<body>
<app-root/>
<script type="text/javascript" src="inline.ce954cfcbe723b5986e6.bundle.js"/>
<script type="text/javascript" src="polyfills.7edc676f7558876c179d.bundle.js"/>
<script type="text/javascript" src="main.da3590aac44ee76e7b3a.bundle.js"/>
</body>
</html>
Any idea what might cause cert-manager to drop the wrong kind of file at the challenge location?
2
Answers
In the end I was unable to determine the cause of the certificate renewal failure. However, events on one of the certificate-related resources suggested previous renewals had worked. So I thought it was possible whatever the problem was might have been transient or a one-off, and that trying again to renew the certificate may work.
Reading various articles and blog posts it appeared that deleting the CertificateRequest object would prompt cert-manager to create a new one, which should result in a certificate renewal. Also, deleting the CertificateRequest object would automatically delete the associated ACME Order and Challenge objects as well, so it wouldn't be necessary to delete them manually.
Deleting the CertificateRequest object did work: The certificate was renewed successfully. However, it didn't renew straight away. Further reading suggests it may take an hour for the certificate renewal (I didn't check the exact time it took so can't verify this).
To delete a CertificateRequest:
For example:
If you wish to force an immediate renewal, rather than waiting an hour, after deleting the CertificateRequest object and cert-manager creating a new one run the following kubectl command, if you have the kubectl cert-manager plugin installed:
For example, to renew certificate my-certificate in namespace qa:
NOTE: The easiest way to install the kubectl cert-manager plugin is via the Krew plugin manager:
See https://krew.sigs.k8s.io/docs/user-guide/setup/install/ for details of how to install Krew (which is useful for all kubectl plugins, not just cert-manager).
One further thing I found from researching this is that sometimes the old certificate secret can get "stuck", preventing a new secret from being created. You can delete the certificate secret to avoid this problem. For example:
I assume, however, that without a certificate secret your website will have no certificate, which may prevent browsers from accessing it. So I would only delete the existing secret as a last resort.
Maybe it will help someone in the future. My solution to the mentioned Problem was a misleading wildcard * A DNS ipv6 record. Lets letsencrypt is checking for ipv4&ipv6 record.
Therefore the solution was to remove the ipv6 record.