配置健康探测
概述
在应用程序中实施健康探测是 Kubernetes 自动执行一些任务的好方法,以便在发生错误时提高可用性。
由于 OSM 重新配置应用 Pod,使其通过代理边车重定向所有传入和传出的网络流量,kubelet 调用的 httpGet
和 tcpSocket
健康探测会因为代理缺少 mTLS 要求的上下文而失败。
OSM 增加了配置,通过代理暴露探测端点,并重写新 Pod 的探测定义,以引用代理暴露的端点,使得 httpGet
健康探测可以在服务网格中工作。原始探针的所有功能仍可用,OSM 只是将其与代理前置,以便 kubelet 能够与之通信。
需要特殊的配置来支持服务网中的 tcpSocket
健康探针。由于 OSM 通过 Envoy 重定向所有网络流量,所有的端口在 Pod 中都是开放的。这导致所有的 TCP 连接被路由到注入了 Envoy sidecar 的 Pod,看起来是成功的。为了使 tcpSocket
健康探测在网格中正常工作,OSM 将探测改写为 httpGet
探测,并添加了一个 iptables
命令,以绕过 osm-healthcheck
暴露端点的Envoy代理。osm-healthcheck
容器被添加到 Pod 中,处理来自 kubelet 的 HTTP 健康探测请求。处理程序从请求的 Original-Tcp-port
头中获取原始 TCP 端口,并尝试在指定端口上打开一个 socket。httpGet
探针的响应状态代码反映 TCP 连接是否成功。
Probe | Path | Port |
---|---|---|
Liveness | /osm-liveness-probe | 15901 |
Readiness | /osm-readiness-probe | 15902 |
Startup | /osm-startup-probe | 15903 |
Healthcheck | /osm-healthcheck | 15904 |
对于 HTTP 和 tcpSocket
探测,端口和路径会被修改。对于 HTTPS 探针,端口被修改,但路径保持不变。
只有预定义的 httpGet
和 tcpSocket
探针会被修改。如果一个探针未被定义,则不会在其位置上添加。只要 exec
探针(包括使用 grpc_health_probe
的探针)的命令不访问 localhost
以外的网络,则不会被修改,并正常工作。
例子
下面的例子显示 OSM 如何处理网格中的 Pod 的健康探测。
HTTP
假设 Pod 中的一个容器定义了如下 livenessProbe
探测:
livenessProbe:
httpGet:
path: /liveness
port: 14001
scheme: HTTP
当 Pod 被创建时,OSM 将修改探针为以下内容:
livenessProbe:
httpGet:
path: /osm-liveness-probe
port: 15901
scheme: HTTP
该 Pod 的代理将包含以下 Envoy 配置。
一个 Envoy 集群,它映射到原始探针端口 14001:
{
"cluster": {
"@type": "type.googleapis.com/envoy.config.cluster.v3.Cluster",
"name": "liveness_cluster",
"type": "STATIC",
"connect_timeout": "1s",
"load_assignment": {
"cluster_name": "liveness_cluster",
"endpoints": [
{
"lb_endpoints": [
{
"endpoint": {
"address": {
"socket_address": {
"address": "0.0.0.0",
"port_value": 14001
}
}
}
}
]
}
]
}
},
"last_updated": "2021-03-29T21:02:59.086Z"
}
为新的代理暴露的 HTTP 端点 /osm-liveness-probe
建立监听器,端口为 15901,映射到上述集群:
{
"listener": {
"@type": "type.googleapis.com/envoy.config.listener.v3.Listener",
"name": "liveness_listener",
"address": {
"socket_address": {
"address": "0.0.0.0",
"port_value": 15901
}
},
"filter_chains": [
{
"filters": [
{
"name": "envoy.filters.network.http_connection_manager",
"typed_config": {
"@type": "type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager",
"stat_prefix": "health_probes_http",
"route_config": {
"name": "local_route",
"virtual_hosts": [
{
"name": "local_service",
"domains": [
"*"
],
"routes": [
{
"match": {
"prefix": "/osm-liveness-probe"
},
"route": {
"cluster": "liveness_cluster",
"prefix_rewrite": "/liveness"
}
}
]
}
]
},
"http_filters": [...],
"access_log": [...]
}
}
]
}
]
},
"last_updated": "2021-03-29T21:02:59.092Z"
}
HTTPS
假设 Pod 中的一个容器定义了如下 livenessProbe
探测:
livenessProbe:
httpGet:
path: /liveness
port: 14001
scheme: HTTPS
当 Pod 被创建时,OSM 将修改探针为以下内容:
livenessProbe:
httpGet:
path: /liveness
port: 15901
scheme: HTTPS
该 Pod 的代理将包含以下 Envoy 配置。
一个 Envoy 集群,它映射到原始探针端口 14001:
{
"cluster": {
"@type": "type.googleapis.com/envoy.config.cluster.v3.Cluster",
"name": "liveness_cluster",
"type": "STATIC",
"connect_timeout": "1s",
"load_assignment": {
"cluster_name": "liveness_cluster",
"endpoints": [
{
"lb_endpoints": [
{
"endpoint": {
"address": {
"socket_address": {
"address": "0.0.0.0",
"port_value": 14001
}
}
}
}
]
}
]
}
},
"last_updated": "2021-03-29T21:02:59.086Z"
}
为新的代理暴露的 TCP 端点提供监听器,端口 15901,映射到上述集群:
{
"listener": {
"@type": "type.googleapis.com/envoy.config.listener.v3.Listener",
"name": "liveness_listener",
"address": {
"socket_address": {
"address": "0.0.0.0",
"port_value": 15901
}
},
"filter_chains": [
{
"filters": [
{
"name": "envoy.filters.network.tcp_proxy",
"typed_config": {
"@type": "type.googleapis.com/envoy.extensions.filters.network.tcp_proxy.v3.TcpProxy",
"stat_prefix": "health_probes",
"cluster": "liveness_cluster",
"access_log": [...]
}
}
]
}
]
},
"last_updated": "2021-04-07T15:09:22.704Z"
}
tcpSocket
假设 Pod 中的一个容器定义了如下 livenessProbe
探测:
livenessProbe:
tcpSocket:
port: 14001
当 Pod 被创建时,OSM 将修改探针为以下内容:
livenessProbe:
httpGet:
httpHeaders:
- name: Original-Tcp-Port
value: "14001"
path: /osm-healthcheck
port: 15904
scheme: HTTP
访问 15904 端口的请求绕过了 Envoy 代理,被引向 osm-healthcheck
端点。
如何在网格中验证 POD 的健康状态
Kubernetes 将自动轮询配置了启动(startup)、存活(liveness)和就绪(readiness)探测器的 Pod 的健康检查端点。
当启动探测失败时,Kubernetes 将生成一个事件(通过 kubectl describe pod <pod name>
可见)并重新启动 Pod。kubectl describe
的输出如下:
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 17s default-scheduler Successfully assigned bookstore/bookstore-v1-699c79b9dc-5g8zn to osm-control-plane
Normal Pulled 16s kubelet Successfully pulled image "openservicemesh/init:v0.8.0" in 26.5835ms
Normal Created 16s kubelet Created container osm-init
Normal Started 16s kubelet Started container osm-init
Normal Pulling 16s kubelet Pulling image "openservicemesh/init:v0.8.0"
Normal Pulling 15s kubelet Pulling image "envoyproxy/envoy-alpine:v1.17.2"
Normal Pulling 15s kubelet Pulling image "openservicemesh/bookstore:v0.8.0"
Normal Pulled 15s kubelet Successfully pulled image "openservicemesh/bookstore:v0.8.0" in 319.9863ms
Normal Started 15s kubelet Started container bookstore-v1
Normal Created 15s kubelet Created container bookstore-v1
Normal Pulled 14s kubelet Successfully pulled image "envoyproxy/envoy-alpine:v1.17.2" in 755.2666ms
Normal Created 14s kubelet Created container envoy
Normal Started 14s kubelet Started container envoy
Warning Unhealthy 13s kubelet Startup probe failed: Get "http://10.244.0.23:15903/osm-startup-probe": dial tcp 10.244.0.23:15903: connect: connection refused
Warning Unhealthy 3s (x2 over 8s) kubelet Startup probe failed: HTTP probe failed with statuscode: 503
当存活探测失败时,Kubernetes 将生成一个事件(通过 kubectl describe pod <pod name>
可见)并重新启动 Pod。kubectl describe
的输出如下:
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 59s default-scheduler Successfully assigned bookstore/bookstore-v1-746977967c-jqjt4 to osm-control-plane
Normal Pulling 58s kubelet Pulling image "openservicemesh/init:v0.8.0"
Normal Created 58s kubelet Created container osm-init
Normal Started 58s kubelet Started container osm-init
Normal Pulled 58s kubelet Successfully pulled image "openservicemesh/init:v0.8.0" in 23.415ms
Normal Pulled 57s kubelet Successfully pulled image "envoyproxy/envoy-alpine:v1.17.2" in 678.1391ms
Normal Pulled 57s kubelet Successfully pulled image "openservicemesh/bookstore:v0.8.0" in 230.3681ms
Normal Created 57s kubelet Created container envoy
Normal Pulling 57s kubelet Pulling image "envoyproxy/envoy-alpine:v1.17.2"
Normal Started 56s kubelet Started container envoy
Normal Pulled 44s kubelet Successfully pulled image "openservicemesh/bookstore:v0.8.0" in 20.6731ms
Normal Created 44s (x2 over 57s) kubelet Created container bookstore-v1
Normal Started 43s (x2 over 57s) kubelet Started container bookstore-v1
Normal Pulling 32s (x3 over 58s) kubelet Pulling image "openservicemesh/bookstore:v0.8.0"
Warning Unhealthy 32s (x6 over 50s) kubelet Liveness probe failed: HTTP probe failed with statuscode: 503
Normal Killing 32s (x2 over 44s) kubelet Container bookstore-v1 failed liveness probe, will be restarted
当就绪探测失败时,Kubernetes 将生成一个事件(通过 kubectl describe pod <pod name>
可以看到),并确保服务流量不会路由到这些不健康的 POD 上。一个准备就绪探测失败的 Pod 的 kubectl describe
输出如下:
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 32s default-scheduler Successfully assigned bookstore/bookstore-v1-5848999cb6-hp6qg to osm-control-plane
Normal Pulling 31s kubelet Pulling image "openservicemesh/init:v0.8.0"
Normal Pulled 31s kubelet Successfully pulled image "openservicemesh/init:v0.8.0" in 19.8726ms
Normal Created 31s kubelet Created container osm-init
Normal Started 31s kubelet Started container osm-init
Normal Created 30s kubelet Created container bookstore-v1
Normal Pulled 30s kubelet Successfully pulled image "openservicemesh/bookstore:v0.8.0" in 314.3628ms
Normal Pulling 30s kubelet Pulling image "openservicemesh/bookstore:v0.8.0"
Normal Started 30s kubelet Started container bookstore-v1
Normal Pulling 30s kubelet Pulling image "envoyproxy/envoy-alpine:v1.17.2"
Normal Pulled 29s kubelet Successfully pulled image "envoyproxy/envoy-alpine:v1.17.2" in 739.3931ms
Normal Created 29s kubelet Created container envoy
Normal Started 29s kubelet Started container envoy
Warning Unhealthy 0s (x3 over 20s) kubelet Readiness probe failed: HTTP probe failed with statuscode: 503
Pod 的 status
也可看出其尚未可用,在 kubectl get pod
输出中可以看到。例如:
NAME READY STATUS RESTARTS AGE
bookstore-v1-5848999cb6-hp6qg 1/2 Running 0 85s
Pod 的健康探测也可以通过转发 Pod 的必要端口并使用 curl
或任何其他 HTTP 客户端发出请求手动调用。例如,为了验证 bookstore-v1 demo Pod 的有效性探测,获取 Pod 的名称并转发 15901 端口:
kubectl port-forward -n bookstore deployment/bookstore-v1 15901
然后,在一个单独的终端里,可以使用 curl
来检查端点。下面是一个健康的 bookstore-v1 的例子:
$ curl -i localhost:15901/osm-liveness-probe
HTTP/1.1 200 OK
date: Wed, 31 Mar 2021 16:00:01 GMT
content-length: 1396
content-type: text/html; charset=utf-8
x-envoy-upstream-service-time: 1
server: envoy
<!doctype html>
<html itemscope="" itemtype="http://schema.org/WebPage" lang="en">
...
</html>
已知问题
排错
如果有健康探测持续失败,请执行以下步骤以确定根本原因:
-
验证网格中的 Pod 上的
httpGet
和tcpSocket
探针是否被修改。启动、存活和就绪的
httpGet
探针必须被 OSM 修改。端口必须被修改为 15901、15902 和 15903,分别适用于存活、就绪和启动httpGet
探针。只有 HTTP(不包括HTTPS)探针的路径将被修改,此外还有/osm-liveness-probe
、/osm-readiness-probe
或/osm-starttup-probe
。同时,验证 Pod 的 Envoy 配置中是否包含修改后的端点的监听。
为了让
tcpSocket
探针在网格中生效,必须将其改写为httpGet
探针。端口必须被修改为 15904,以用于存活、就绪和启动探测。路径必须设置为/osm-healthcheck
。HTTP 头Original-TCP-Port
,必须设置为tcpSocket
探针定义中指定的原始端口。另外,验证osm-healthcheck
容器是否正在运行。检查osm-healthcheck
日志以获得更多信息。更多细节见上面的例子。
-
确定 Kubernetes 在调度或启动 Pod 时是否遇到了任何其他错误。
使用
kubectl describe
命令查找关于不健康的 Pod 近期的错误。解决这些错误并再次验证 POD 的健康状态。 -
确定 Pod 是否遇到了一个运行时错误。
使用
kubectl logs
检查容器日志,寻找容器启动后发生的错误。解决这些错误并再次验证 Pod 的健康状况。
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.