- A+
白盒监控
我们监控主机的资源用量、容器的运行状态、数据库中间件的运行数据。 这些都是支持业务和服务的基础设施,通过白盒能够了解其内部的实际运行状态,通过对监控指标的观察能够预判可能出现的问题,从而对潜在的不确定因素进行优化。而从完整的监控逻辑的角度,除了大量的应用白盒监控以外,还应该添加适当的黑盒监控。黑盒监控即以用户的身份测试服务的外部可见性,常见的黑盒监控包括HTTP探针、TCP探针等用于检测站点或者服务的可访问性,以及访问效率等。
黑盒监控
相较于白盒监控最大的不同在于黑盒监控是以故障为导向当故障发生时,黑盒监控能快速发现故障,而白盒监控则侧重于主动发现或者预测潜在的问题。一个完善的监控目标是要能够从白盒的角度发现潜在问题,能够在黑盒的角度快速发现已经发生的问题。
Blackbox Exporter是Prometheus社区提供的官方黑盒监控解决方案,可以提供 http、dns、tcp、icmp 、ssl的方式对网络进行探测。
https://github.com/prometheus/blackbox_exporter
部署blackbox_exporter
具体配置参考:
https://github.com/prometheus/blackbox_exporter/blob/master/example.yml
cat > /root/tools/exporter/blackexporter.yaml <<EOF apiVersion: v1 data: config.yml: | modules: http_2xx: prober: http http: method: GET preferred_ip_protocol: "ip4" http_post_2xx: prober: http http: method: POST preferred_ip_protocol: "ip4" tcp_connect: prober: tcp icmp: prober: icmp timeout: 3s icmp: preferred_ip_protocol: "ip4" dns_tcp: prober: dns timeout: 5s dns: transport_protocol: "tcp" preferred_ip_protocol: "ip4" query_name: "kubernetes.default.svc.cluster.local" query_type: "A" kind: ConfigMap metadata: name: blackbox-exporter namespace: monitoring --- apiVersion: apps/v1 kind: Deployment metadata: creationTimestamp: null labels: name: blackbox-exporter cluster: ali-huabei2-dev name: blackbox-exporter namespace: monitoring spec: replicas: 1 selector: matchLabels: name: blackbox-exporter strategy: {} template: metadata: creationTimestamp: null labels: name: blackbox-exporter cluster: ali-huabei2-dev spec: containers: - image: prom/blackbox-exporter:v0.16.0 name: blackbox-exporter ports: - containerPort: 9115 volumeMounts: - name: config mountPath: /etc/blackbox_exporter args: - --config.file=/etc/blackbox_exporter/config.yml - --log.level=info volumes: - name: config configMap: name: blackbox-exporter --- apiVersion: v1 kind: Service metadata: labels: name: blackbox-exporter cluster: ali-huabei2-dev name: blackbox-exporter namespace: monitoring spec: selector: name: blackbox-exporter ports: - name: http-metrics port: 9115 targetPort: 9115 type: LoadBalancer EOF
应用文件,检查pod和svc,通过浏览器访问,如果修改configmap记得重启pod
kubectl apply -f blackexporter.yaml kubectl get svc -n monitoring kubectl get deploy -n monitoring
自定义配置job
cat > /root/tools/exporter/prometheus-additional.yaml <<EOF ##检查http网站存活 - job_name: "blackbox-external-website" scrape_interval: 30s scrape_timeout: 15s metrics_path: /probe params: module: [http_2xx] static_configs: - targets: - https://www.example.com # 要检查的网址 - https://test.example.com - https://www.baidu.com - http://www.sina.com.cn - http://www.liuyalei.top relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance - target_label: __address__ replacement: blackbox-exporter:9115 ##检查主机存活 - job_name: 'blackbox-node-status' metrics_path: /probe params: module: [icmp] static_configs: - targets: ['127.0.0.1','192.168.8.101','192.168.0.0'] labels: instance: 'node_status' group: 'node' relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance - target_label: __address__ replacement: blackbox-exporter:9115 ##检查主机端口存活 - job_name: 'balckbox-port-status' metrics_path: /probe params: module: [tcp_connect] static_configs: - targets: ['127.0.0.1:9100','127.0.0.1:9090','192.168.8.102:22'] labels: instance: 'port_status' group: 'tcp' relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance - target_label: __address__ replacement: blackbox-exporter:91155 EOF
replacement: blackbox-exporter:9115 这里配置svc blackbox-exporter地址和端口!!!
创建jod的secret对象
kubectl create secret generic additional-configs --from-file=prometheus-additional.yaml -n monitoring
创建完成后,会将上面配置信息进行 base64 编码后作为 prometheus-additional.yaml 这个 key 对应的值存在:
[root@master01 exporter]# kubectl -n monitoring get secrets additional-configs -o yaml apiVersion: v1 data: prometheus-additional.yaml: LSBqb2JfbmFtZTogJ2t1YmVybmV0ZXMtc2VydmljZS1lbmRwb2ludHMnCiAga3ViZXJuZXRlc19zZF9jb25maWdzOgogIC0gcm9sZTogZW5kcG9pbnRzCiAgcmVsYWJlbF9jb25maWdzOgogIC0gc291cmNlX2xhYmVsczogW19fbWV0YV9rdWJlcm5ldGVzX3NlcnZpY2VfYW5ub3RhdGlvbl9wcm9tZXRoZXVzX2lvX3NjcmFwZV0KICAgIGFjdGlvbjoga2VlcAogICAgcmVnZXg6IHRydWUKICAtIHNvdXJjZV9sYWJlbHM6IFtfX21ldGFfa3ViZXJuZXRlc19zZXJ2aWNlX2Fubm90YXRpb25fcHJvbWV0aGV1c19pb19zY2hlbWVdCiAgICBhY3Rpb246IHJlcGxhY2UKICAgIHRhcmdldF9sYWJlbDogX19zY2hlbWVfXwogICAgcmVnZXg6IChodHRwcz8pCiAgLSBzb3VyY2VfbGFiZWxzOiBbX19tZXRhX2t1YmVybmV0ZXNfc2VydmljZV9hbm5vdGF0aW9uX3Byb21ldGhldXNfaW9fcGF0aF0KICAgIGFjdGlvbjogcmVwbGFjZQogICAgdGFyZ2V0X2xhYmVsOiBfX21ldHJpY3NfcGF0aF9fCiAgICByZWdleDogKC4rKQogIC0gc291cmNlX2xhYmVsczogW19fYWRkcmVzc19fLCBfX21ldGFfa3ViZXJuZXRlc19zZXJ2aWNlX2Fubm90YXRpb25fcHJvbWV0aGV1c19pb19wb3J0XQogICAgYWN0aW9uOiByZXBsYWNlCiAgICB0YXJnZXRfbGFiZWw6IF9fYWRkcmVzc19fCiAgICByZWdleDogKFteOl0rKSg/OjpcZCspPzsoXGQrKQogICAgcmVwbGFjZW1lbnQ6ICQxOiQyCiAgLSBhY3Rpb246IGxhYmVsbWFwCiAgICByZWdleDogX19tZXRhX2t1YmVybmV0ZXNfc2VydmljZV9sYWJlbF8oLispCiAgLSBzb3VyY2VfbGFiZWxzOiBbX19tZXRhX2t1YmVybmV0ZXNfbmFtZXNwYWNlXQogICAgYWN0aW9uOiByZXBsYWNlCiAgICB0YXJnZXRfbGFiZWw6IGt1YmVybmV0ZXNfbmFtZXNwYWNlCiAgLSBzb3VyY2VfbGFiZWxzOiBbX19tZXRhX2t1YmVybmV0ZXNfc2VydmljZV9uYW1lXQogICAgYWN0aW9uOiByZXBsYWNlCiAgICB0YXJnZXRfbGFiZWw6IGt1YmVybmV0ZXNfbmFtZQotIGpvYl9uYW1lOiAiYmxhY2tib3gtZXh0ZXJuYWwtd2Vic2l0ZSIKICBzY3JhcGVfaW50ZXJ2YWw6IDMwcwogIHNjcmFwZV90aW1lb3V0OiAxNXMKICBtZXRyaWNzX3BhdGg6IC9wcm9iZQogIHBhcmFtczoKICAgIG1vZHVsZTogW2h0dHBfMnh4XQogIHN0YXRpY19jb25maWdzOgogIC0gdGFyZ2V0czoKICAgIC0gaHR0cHM6Ly93d3cuZXhhbXBsZS5jb20gIyDopoHmo4Dmn6XnmoTnvZHlnYAKICAgIC0gaHR0cHM6Ly90ZXN0LmV4YW1wbGUuY29tCiAgcmVsYWJlbF9jb25maWdzOgogIC0gc291cmNlX2xhYmVsczogW19fYWRkcmVzc19fXQogICAgdGFyZ2V0X2xhYmVsOiBfX3BhcmFtX3RhcmdldAogIC0gc291cmNlX2xhYmVsczogW19fcGFyYW1fdGFyZ2V0XQogICAgdGFyZ2V0X2xhYmVsOiBpbnN0YW5jZQogIC0gdGFyZ2V0X2xhYmVsOiBfX2FkZHJlc3NfXwogICAgcmVwbGFjZW1lbnQ6IGJsYWNrYm94LWV4cG9ydGVyOjkxMTUK kind: Secret metadata: creationTimestamp: "2020-09-18T14:02:52Z" name: additional-configs namespace: monitoring resourceVersion: "109161" selfLink: /api/v1/namespaces/monitoring/secrets/additional-configs uid: cd24759a-bac9-4fbe-b744-9d48728a8e96 type: Opaque
声明 prometheus 的资源对象文件中添加上这个额外的配置:(prometheus-prometheus.yaml)
添加:
additionalScrapeConfigs:
name: additional-configs
key: prometheus-additional.yaml
[root@master01 exporter]# kubectl -n monitoring get prometheus -o yaml apiVersion: v1 items: - apiVersion: monitoring.coreos.com/v1 kind: Prometheus metadata: annotations: kubectl.kubernetes.io/last-applied-configuration: | {"apiVersion":"monitoring.coreos.com/v1","kind":"Prometheus","metadata":{"annotations":{},"labels":{"prometheus":"k8s"},"name":"k8s","namespace":"monitoring"},"spec":{"alerting":{"alertmanagers":[{"name":"alertmanager-main","namespace":"monitoring","port":"web"}]},"baseImage":"quay.io/prometheus/prometheus","nodeSelector":{"kubernetes.io/os":"linux"},"podMonitorSelector":{},"replicas":2,"resources":{"requests":{"memory":"400Mi"}},"ruleSelector":{"matchLabels":{"prometheus":"k8s","role":"alert-rules"}},"securityContext":{"fsGroup":2000,"runAsNonRoot":true,"runAsUser":1000},"serviceAccountName":"prometheus-k8s","serviceMonitorNamespaceSelector":{},"serviceMonitorSelector":{},"version":"v2.11.0"}} creationTimestamp: "2020-08-29T20:00:47Z" generation: 4 labels: prometheus: k8s name: k8s namespace: monitoring resourceVersion: "108147" selfLink: /apis/monitoring.coreos.com/v1/namespaces/monitoring/prometheuses/k8s uid: b71abf71-be3c-457e-8cc2-244d0b38612a spec: additionalScrapeConfigs: key: prometheus-additional.yaml name: additional-configs alerting: alertmanagers: - name: alertmanager-main namespace: monitoring port: web baseImage: quay.io/prometheus/prometheus nodeSelector: kubernetes.io/os: linux podMonitorSelector: {} replicas: 2 resources: requests: memory: 400Mi ruleSelector: matchLabels: prometheus: k8s role: alert-rules securityContext: fsGroup: 2000 runAsNonRoot: true runAsUser: 1000 serviceAccountName: prometheus-k8s serviceMonitorNamespaceSelector: {} serviceMonitorSelector: {} version: v2.11.0 kind: List metadata: resourceVersion: "" selfLink: ""
添加完成后,直接更新 prometheus 这个 CRD 资源对象:
kubectl apply -f prometheus-prometheus.yaml prometheus.monitoring.coreos.com "k8s" configured
重载Prometheus配置
先删除,在重载 Prometheus,每次添加job都需要重载
打开 Prometheus 的Configuration/Target 页面,就会看到 上面定义的blackbox-external-website任务,有点慢需要等一段时间,probe_success查询条件可以看到状态
kubectl delete secrets -n monitoring additional-configs kubectl create secret generic additional-configs --from-file=prometheus-additional.yaml -n monitoring
上面我们通过自动prometheus自动发现配置的方式,完成backbox监控,下面我们通过operator完成blackbox和prometheus集成。
ServiceMonitor集成blackbox
apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: a6-blackbox-exporter namespace: monitoring spec: endpoints: ###检查网站存活curl - interval: 30s #多长时间抓取一次 params: module: - http_2xx #抓取模块 target: - https://blog.csdn.net path: "/probe" port: http-metrics #定义endpoint名称 scheme: http #抓取的方法 scrapeTimeout: 30s #抓取超时时间 relabelings: - sourceLabels: - __param_target targetLabel: target - sourceLabels: - __param_target targetLabel: instance ###检查主机存活ping - interval: 30s params: module: - icmp target: - 192.168.8.11 path: "/probe" port: http-metrics scheme: http scrapeTimeout: 30s relabelings: - sourceLabels: - __param_target targetLabel: target - sourceLabels: - __param_target targetLabel: instance ###检查端口存活telnet - interval: 30s params: module: - tcp_connect target: - 192.168.8.11:22 path: "/probe" port: http-metrics scrapeTimeout: 30s relabelings: - sourceLabels: - __param_target targetLabel: target - sourceLabels: - __param_target targetLabel: instance namespaceSelector: matchNames: - monitoring selector: matchLabels: name: blackbox-exporter #选择blacebox exporter容器service的标签!!!svc记得打标签,否则匹配不到
servicemonitor定义的监控项必须指定namespaceSelector,如果相匹配所有空间改成如下配置
namespaceSelector: any: true
prometheus和servicemonitor集成,匹配的是serviceMonitorNamespaceSelector/serviceMonitorSelector,如果prometheus定义了标签,servicemonitor也要修改,否则无法关联。
[root@master01 exporter]# kubectl -n monitoring get prometheus -o yaml .... serviceAccountName: prometheus-k8s serviceMonitorNamespaceSelector: {} serviceMonitorSelector: {} ....
servicemonitor endpoint详细配置:
https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/api.md#endpoint
relabelins 作用是重写,支持正则:
https://prometheus.io/docs/prometheus/latest/configuration/configuration/#relabel_config
问题:
servicemonitor不支持探针合并,如果要定义多个url监控,只能逐一添加,后续版本估计会优化(https://github.com/prometheus-operator/prometheus-operator/issues/2821)
Grafana配置
默认prometheus配置的grafana没有持久化,修改deployment,持久化到本地
#emptyDir容器销毁数据丢失
- emptyDir: {}
name: grafana-storage
hostPath替换:
volumes: - name: grafana-storage hostPath: path: /tmp/grafana type: DirectoryOrCreate
chmod 777 /tmp/grafana
安装插件
[root@master01 exporter]# kubectl -n monitoring exec -it grafana-5cd56df4cd-jtbl7 bash nobody@grafana-5cd56df4cd-jtbl7:/usr/share/grafana$ grafana-cli plugins install grafana-piechart-panel
删除pod,重启grafana,安装dashboard 官网11543
监控域名证书过期
- alert: "ssl证书过期警告" expr: (probe_ssl_earliest_cert_expiry - time())/86400 <10 for: 1h labels: severity: warn annotations: description: '域名{{$labels.instance}}的证书还有{{ printf "%.1f" $value }}天就过期了,请尽快更新证书' summary: "ssl证书过期警告"
参考文档:
https://zhuanlan.zhihu.com/p/103095462
https://www.qikqiak.com/post/prometheus-operator-advance/
https://www.voidking.com/dev-prometheus-operator-blackbox-exporter/