keyfil/草稿/业务指标监控.md
liuxiaohua e74d7af5a4
All checks were successful
Publish to Confluence / confluence (push) Successful in 57s
[2025-04-08] 添加lazada接入
2025-04-08 20:48:40 +08:00

184 lines
5.7 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

以下是监控 Spring Boot 业务服务的 **请求成功率、失败率、核心接口 RTT往返时间** 的完整方案,使用 **开源工具栈Prometheus + Grafana** 实现指标采集、存储和可视化:
---
### **一、监控方案架构**
```mermaid
graph TD
A[Spring Boot 应用] -->|暴露指标| B(Prometheus)
B -->|存储/查询| C(Grafana)
C -->|展示| D[监控大盘]
```
---
### **二、具体步骤**
#### **1. 集成监控指标库Micrometer**
Spring Boot 原生支持 **Micrometer** 作为指标采集库,需添加依赖并配置 Prometheus 格式的指标暴露。
##### **1.1 添加依赖pom.xml**
```xml
<!-- Spring Boot Actuator核心指标 -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<!-- Prometheus 格式指标暴露 -->
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
```
##### **1.2 配置指标暴露application.yml**
```yaml
management:
endpoints:
web:
exposure:
include: health,info,prometheus # 暴露 Prometheus 端点
metrics:
tags:
application: ${spring.application.name} # 添加应用标签
```
---
#### **2. 部署 Prometheus 抓取指标**
##### **2.1 Prometheus 配置prometheus.yml**
```yaml
scrape_configs:
- job_name: 'spring-boot-apps'
metrics_path: '/actuator/prometheus'
static_configs:
- targets: ['your-spring-boot-app:8080'] # 应用地址
relabel_configs:
- source_labels: [__address__]
target_label: instance
- source_labels: [__metrics_path__]
target_label: metrics_path
```
##### **2.2 启动 Prometheus**
```bash
docker run -d --name prometheus \
-p 9090:9090 \
-v /path/to/prometheus.yml:/etc/prometheus/prometheus.yml \
prom/prometheus
```
---
#### **3. 核心指标采集**
##### **3.1 HTTP 请求成功率/失败率**
Micrometer 自动采集的指标 `http_server_requests_seconds_count``http_server_requests_seconds_sum`,通过状态码区分成功/失败。
###### **成功率计算公式PromQL**
```promql
sum(rate(http_server_requests_seconds_count{application="your-app", status!~"5.."}[5m]))
/
sum(rate(http_server_requests_seconds_count{application="your-app"}[5m]))
```
###### **失败率计算公式PromQL**
```promql
sum(rate(http_server_requests_seconds_count{application="your-app", status=~"5.."}[5m]))
/
sum(rate(http_server_requests_seconds_count{application="your-app"}[5m]))
```
##### **3.2 核心接口 RTT平均响应时间**
```promql
avg(rate(http_server_requests_seconds_sum{application="your-app", uri="/api/core"}[5m]))
/
avg(rate(http_server_requests_seconds_count{application="your-app", uri="/api/core"}[5m]))
```
---
#### **4. 配置 Grafana 监控大盘**
##### **4.1 添加 Prometheus 数据源**
1. 访问 `http://grafana-server:3000`,登录后进入 **Configuration > Data Sources**
2. 选择 **Prometheus**,填写 URL`http://prometheus:9090`)。
##### **4.2 导入 Spring Boot 监控模板**
1. 访问 [Grafana 仪表盘市场](https://grafana.com/grafana/dashboards/),搜索 **"Spring Boot"**。
2. 选择模板(如 ID: **11378**),复制 ID。
3. 在 Grafana 中进入 **Create > Import**,输入 ID 完成导入。
##### **4.3 自定义核心指标面板**
###### **成功率/失败率(百分比仪表盘)**
```promql
# 成功率
(sum(rate(http_server_requests_seconds_count{status!~"5.."}[5m])) by (application)
/
(sum(rate(http_server_requests_seconds_count{}[5m])) by (application))
* 100
# 失败率
(sum(rate(http_server_requests_seconds_count{status=~"5.."}[5m])) by (application)
/
(sum(rate(http_server_requests_seconds_count{}[5m])) by (application))
* 100
```
###### **接口 RTT折线图**
```promql
# 平均响应时间(毫秒)
avg(rate(http_server_requests_seconds_sum{uri="/api/core"}[5m])) by (uri)
/
avg(rate(http_server_requests_seconds_count{uri="/api/core"}[5m])) by (uri)
* 1000
```
---
#### **5. 告警配置(可选)**
##### **5.1 在 Prometheus 中设置告警规则**
```yaml
# alert.rules.yml
groups:
- name: spring-boot-alerts
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_server_requests_seconds_count{status=~"5.."}[5m])) by (application)
/
sum(rate(http_server_requests_seconds_count{}[5m])) by (application)
> 0.05 # 错误率超过 5%
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected in {{ $labels.application }}"
```
##### **5.2 集成 Alertmanager**
配置通知渠道如邮件、Slack并启动 Alertmanager
```bash
docker run -d --name alertmanager \
-p 9093:9093 \
-v /path/to/alertmanager.yml:/etc/alertmanager/alertmanager.yml \
prom/alertmanager
```
---
### **三、最终效果**
- **监控大盘**:包含全局请求成功率、失败率、核心接口 RTT、JVM 内存、CPU 使用率等。
- **实时告警**:当错误率或延迟超过阈值时触发通知。
- **历史趋势**:通过 Grafana 查看指标的历史波动。
---
### **四、常见问题排查**
| **问题** | **解决方案** |
|-------------------------|----------------------------------------------|
| 指标未暴露 | 检查 `management.endpoints.web.exposure.include` 配置 |
| Prometheus 抓取失败 | 检查 `targets` 地址和网络连通性 |
| Grafana 无数据 | 确认数据源配置正确PromQL 无语法错误 |
| RTT 数值异常 | 检查 URI 标签是否匹配核心接口路径 |
通过以上步骤,可快速搭建 Spring Boot 服务的全链路监控体系。