We have a VM in Compute Engine running CentOs 8. This VM is running for a long time and is never restarted. Last night, suddenly we lost connectivity to the VM through internal as well as external IPs. SSH was also not being possible.
On the serial port, we observed below logs:
Apr 29 15:53:18 <vm-name> google_osconfig_agent[1215]: default
Apr 29 16:05:18 <vm-name> google_osconfig_agent[1215]: default
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info> [1619712666.3120] dhcp4 (eth0): option dhcp_lease_time => '3600'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info> [1619712666.3124] dhcp4 (eth0): option domain_name => 'us-central1-a.c.<project-name>.internal'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info> [1619712666.3124] dhcp4 (eth0): option domain_name_servers => '169.254.169.254'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info> [1619712666.3124] dhcp4 (eth0): option domain_search => 'us-central1-a.c.<project-name>.internal c.<project-name>.internal google.internal'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info> [1619712666.3124] dhcp4 (eth0): option expiry => '1619716266'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info> [1619712666.3124] dhcp4 (eth0): option host_name => '<vm-name>.us-central1-a.c.<project-name>.internal'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info> [1619712666.3124] dhcp4 (eth0): option interface_mtu => '1460'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info> [1619712666.3124] dhcp4 (eth0): option ip_address => '10.128.0.4'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info> [1619712666.3124] dhcp4 (eth0): option next_server => '10.128.0.1'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info> [1619712666.3124] dhcp4 (eth0): option ntp_servers => '169.254.169.254'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info> [1619712666.3124] dhcp4 (eth0): option requested_broadcast_address => '1'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info> [1619712666.3124] dhcp4 (eth0): option requested_domain_name => '1'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info> [1619712666.3125] dhcp4 (eth0): option requested_domain_name_servers => '1'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info> [1619712666.3125] dhcp4 (eth0): option requested_domain_search => '1'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info> [1619712666.3125] dhcp4 (eth0): option requested_host_name => '1'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info> [1619712666.3125] dhcp4 (eth0): option requested_interface_mtu => '1'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info> [1619712666.3125] dhcp4 (eth0): option requested_ms_classless_static_routes => '1'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info> [1619712666.3125] dhcp4 (eth0): option requested_nis_domain => '1'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info> [1619712666.3125] dhcp4 (eth0): option requested_nis_servers => '1'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info> [1619712666.3125] dhcp4 (eth0): option requested_ntp_servers => '1'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info> [1619712666.3125] dhcp4 (eth0): option requested_rfc3442_classless_static_routes => '1'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info> [1619712666.3125] dhcp4 (eth0): option requested_root_path => '1'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info> [1619712666.3126] dhcp4 (eth0): option requested_routers => '1'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info> [1619712666.3126] dhcp4 (eth0): option requested_static_routes => '1'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info> [1619712666.3126] dhcp4 (eth0): option requested_subnet_mask => '1'
Apr 29 16:11:06 <vm-name> dbus-daemon[827]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service' requested by ':1.8' (uid=0 pid=907 comm="/usr/sbin/NetworkManager --no-daemon " label="system_u:system_r:NetworkManager_t:s0")
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info> [1619712666.3126] dhcp4 (eth0): option requested_time_offset => '1'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info> [1619712666.3126] dhcp4 (eth0): option requested_wpad => '1'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info> [1619712666.3126] dhcp4 (eth0): option rfc3442_classless_static_routes => '10.128.0.1/32 0.0.0.0 0.0.0.0/0 10.128.0.1'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info> [1619712666.3126] dhcp4 (eth0): option routers => '10.128.0.1'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info> [1619712666.3126] dhcp4 (eth0): option subnet_mask => '255.255.255.255'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info> [1619712666.3126] dhcp4 (eth0): state changed extended -> extended
Apr 29 16:11:06 <vm-name> systemd[1]: Starting Network Manager Script Dispatcher Service...
Apr 29 16:11:06 <vm-name> dbus-daemon[827]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'
Apr 29 16:11:06 <vm-name> systemd[1]: Started Network Manager Script Dispatcher Service.
Apr 29 16:15:18 <vm-name> google_osconfig_agent[1215]: default
Apr 29 16:29:30 <vm-name> GCEGuestAgent[1269]: 2021-04-29T16:28:35.1302Z GCEGuestAgent Error main.go:181: Error watching metadata: Get http://metadata.google.internal/computeMetadata/v1//?recursive=true&alt=json&wait_for_change=true&timeout_sec=60&last_etag=4ac15b8179731d72: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Apr 29 16:29:31 <vm-name> OSConfigAgent[1215]: 2021-04-29T16:29:03.9464Z OSConfigAgent Error main.go:189: Get http://169.254.169.254/computeMetadata/v1/?recursive=true&alt=json&wait_for_change=true&last_etag=4ac15b8179731d72&timeout_sec=60: dial tcp 169.254.169.254:80: i/o timeout
Apr 29 16:29:44 <vm-name> google_osconfig_agent[1215]: default
[16281276.978048] google_guest_agent[1269]: 2021/04/29 16:41:10 logging client: context deadline exceeded
Apr 29 17:11:14 <vm-name> google_guest_agent[1269]: 2021/04/29 16:41:10 logging client: context deadline exceeded
Apr 29 17:23:19 <vm-name> google_osconfig_agent[1215]: 2021/04/29 16:42:42 logging client: context deadline exceeded
Apr 29 17:23:19 <vm-name> google_osconfig_agent[1215]: default
Apr 29 17:23:19 <vm-name> google_osconfig_agent[1215]: 2021/04/29 17:03:31 logging client: context deadline exceeded
Apr 29 17:23:19 <vm-name> google_osconfig_agent[1215]: default
Apr 29 17:27:20 <vm-name> OSConfigAgent[1215]: 2021-04-29T16:41:09.0306Z OSConfigAgent Error main.go:189: Get http://169.254.169.254/computeMetadata/v1/?recursive=true&alt=json&wait_for_change=true&last_etag=4ac15b8179731d72&timeout_sec=60: dial tcp 169.254.169.254:80: i/o timeout
Apr 29 17:27:20 <vm-name> NetworkManager[907]: <info> [1619717130.6935] dhcp4 (eth0): option dhcp_lease_time => '3600'
Apr 29 17:27:20 <vm-name> google_osconfig_agent[1215]: 2021/04/29 17:26:00 logging client: context deadline exceeded
Apr 29 17:27:20 <vm-name> OSConfigAgent[1215]: 2021-04-29T17:07:12.7039Z OSConfigAgent Error main.go:189: Get http://169.254.169.254/computeMetadata/v1/?recursive=true&alt=json&wait_for_change=true&last_etag=4ac15b8179731d72&timeout_sec=60: dial tcp 169.254.169.254:80: i/o timeout
Apr 29 17:27:20 <vm-name> OSConfigAgent[1215]: 2021-04-29T17:22:13.2740Z OSConfigAgent Error main.go:189: network error when requesting metadata, make sure your instance has an active network and can reach the metadata server: Get http://169.254.169.254/computeMetadata/v1/?recursive=true&alt=json&wait_for_change=true&last_etag=4ac15b8179731d72&timeout_sec=60: dial tcp 169.254.169.254:80: connect: network is unreachable
Apr 29 17:28:55 <vm-name> NetworkManager[907]: <info> [1619717199.5681] dhcp4 (eth0): option domain_name => 'us-central1-a.c.<project-name>.internal'
Apr 29 17:34:48 <vm-name> OSConfigAgent[1215]: 2021-04-29T17:27:54.8816Z OSConfigAgent Error main.go:189: network error when requesting metadata, make sure your instance has an active network and can reach the metadata server: Get http://169.254.169.254/computeMetadata/v1/?recursive=true&alt=json&wait_for_change=true&last_etag=4ac15b8179731d72&timeout_sec=60: dial tcp 169.254.169.254:80: connect: network is unreachable
Apr 29 17:39:16 <vm-name> google_osconfig_agent[1215]: default
Apr 29 17:39:16 <vm-name> google_osconfig_agent[1215]: 2021/04/29 17:36:13 logging client: context deadline exceeded
Apr 29 17:39:16 <vm-name> google_osconfig_agent[1215]: default
Apr 29 17:55:16 <vm-name> NetworkManager[907]: <info> [1619717287.9321] dhcp4 (eth0): option domain_name_servers => '169.254.169.254'
Apr 29 17:58:36 <vm-name> OSConfigAgent[1215]: 2021-04-29T17:30:47.0872Z OSConfigAgent Error main.go:189: network error when requesting metadata, make sure your instance has an active network and can reach the metadata server: Get http://169.254.169.254/computeMetadata/v1/?recursive=true&alt=json&wait_for_change=true&last_etag=4ac15b8179731d72&timeout_sec=60: dial tcp 169.254.169.254:80: connect: network is unreachable
Apr 29 18:03:20 <vm-name> google_osconfig_agent[1215]: default
Apr 29 18:03:20 <vm-name> google_osconfig_agent[1215]: 2021/04/29 17:46:22 logging client: context deadline exceeded
Restarting the machine restored the network. We cannot find anything else in the logs, and the monitoring also doesn’t show anything suspicious. What could have caused this?
2
Answers
This OSConfigAgent Error seems to be the cause of the issue:
According to the GCP documentation:
Google Cloud runs a local metadata server alongside each instance at 169.254.169.254. This server is essential to the operation of the instance, so the instance can access it regardless of any firewall rules that you configure.
Based on that, there must be something interesting to look at within
/var/log/messages
. It is possible that you find something related to the Network adapter.I could infer that this is most likely because of disruption in the network connectivity based on the below log excerpts
From the logs, I notice that :
At 16:11:06, renewal of the DHCP lease happened, with a duration of 3600 seconds. So, the second renewal of the DHCP lease should be around 17:11:06. But from the logs, there was a delay of 15 minutes (approx.) in the renewal which occurred at 17:26:39. And the next renewal occurred at 17:55:07 which is 27 minutes (approx.) later.
The whole networking stack including DHCP renewal was delayed most likely due to CPU or memory overload. When the host machine experiences high CPU load it causes the networking to degrade.
To check if the overload was in CPU or Memory ,
open the cloud console —->Compute Engine —> VM instances —> Click on the three dots of your instance —> View monitoring —>CPU and Memory utilization charts (Check for the time frame when the issue occurred)
Make sure that your instance has enough memory utilization to sustain the load.