184 lines
5.7 KiB
Markdown
184 lines
5.7 KiB
Markdown
|
以下是监控 Spring Boot 业务服务的 **请求成功率、失败率、核心接口 RTT(往返时间)** 的完整方案,使用 **开源工具栈(Prometheus + Grafana)** 实现指标采集、存储和可视化:
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
### **一、监控方案架构**
|
|||
|
```mermaid
|
|||
|
graph TD
|
|||
|
A[Spring Boot 应用] -->|暴露指标| B(Prometheus)
|
|||
|
B -->|存储/查询| C(Grafana)
|
|||
|
C -->|展示| D[监控大盘]
|
|||
|
```
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
### **二、具体步骤**
|
|||
|
|
|||
|
#### **1. 集成监控指标库(Micrometer)**
|
|||
|
Spring Boot 原生支持 **Micrometer** 作为指标采集库,需添加依赖并配置 Prometheus 格式的指标暴露。
|
|||
|
|
|||
|
##### **1.1 添加依赖(pom.xml)**
|
|||
|
```xml
|
|||
|
<!-- Spring Boot Actuator(核心指标) -->
|
|||
|
<dependency>
|
|||
|
<groupId>org.springframework.boot</groupId>
|
|||
|
<artifactId>spring-boot-starter-actuator</artifactId>
|
|||
|
</dependency>
|
|||
|
|
|||
|
<!-- Prometheus 格式指标暴露 -->
|
|||
|
<dependency>
|
|||
|
<groupId>io.micrometer</groupId>
|
|||
|
<artifactId>micrometer-registry-prometheus</artifactId>
|
|||
|
</dependency>
|
|||
|
```
|
|||
|
|
|||
|
##### **1.2 配置指标暴露(application.yml)**
|
|||
|
```yaml
|
|||
|
management:
|
|||
|
endpoints:
|
|||
|
web:
|
|||
|
exposure:
|
|||
|
include: health,info,prometheus # 暴露 Prometheus 端点
|
|||
|
metrics:
|
|||
|
tags:
|
|||
|
application: ${spring.application.name} # 添加应用标签
|
|||
|
```
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
#### **2. 部署 Prometheus 抓取指标**
|
|||
|
##### **2.1 Prometheus 配置(prometheus.yml)**
|
|||
|
```yaml
|
|||
|
scrape_configs:
|
|||
|
- job_name: 'spring-boot-apps'
|
|||
|
metrics_path: '/actuator/prometheus'
|
|||
|
static_configs:
|
|||
|
- targets: ['your-spring-boot-app:8080'] # 应用地址
|
|||
|
relabel_configs:
|
|||
|
- source_labels: [__address__]
|
|||
|
target_label: instance
|
|||
|
- source_labels: [__metrics_path__]
|
|||
|
target_label: metrics_path
|
|||
|
```
|
|||
|
|
|||
|
##### **2.2 启动 Prometheus**
|
|||
|
```bash
|
|||
|
docker run -d --name prometheus \
|
|||
|
-p 9090:9090 \
|
|||
|
-v /path/to/prometheus.yml:/etc/prometheus/prometheus.yml \
|
|||
|
prom/prometheus
|
|||
|
```
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
#### **3. 核心指标采集**
|
|||
|
##### **3.1 HTTP 请求成功率/失败率**
|
|||
|
Micrometer 自动采集的指标 `http_server_requests_seconds_count` 和 `http_server_requests_seconds_sum`,通过状态码区分成功/失败。
|
|||
|
|
|||
|
###### **成功率计算公式(PromQL)**
|
|||
|
```promql
|
|||
|
sum(rate(http_server_requests_seconds_count{application="your-app", status!~"5.."}[5m]))
|
|||
|
/
|
|||
|
sum(rate(http_server_requests_seconds_count{application="your-app"}[5m]))
|
|||
|
```
|
|||
|
|
|||
|
###### **失败率计算公式(PromQL)**
|
|||
|
```promql
|
|||
|
sum(rate(http_server_requests_seconds_count{application="your-app", status=~"5.."}[5m]))
|
|||
|
/
|
|||
|
sum(rate(http_server_requests_seconds_count{application="your-app"}[5m]))
|
|||
|
```
|
|||
|
|
|||
|
##### **3.2 核心接口 RTT(平均响应时间)**
|
|||
|
```promql
|
|||
|
avg(rate(http_server_requests_seconds_sum{application="your-app", uri="/api/core"}[5m]))
|
|||
|
/
|
|||
|
avg(rate(http_server_requests_seconds_count{application="your-app", uri="/api/core"}[5m]))
|
|||
|
```
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
#### **4. 配置 Grafana 监控大盘**
|
|||
|
##### **4.1 添加 Prometheus 数据源**
|
|||
|
1. 访问 `http://grafana-server:3000`,登录后进入 **Configuration > Data Sources**。
|
|||
|
2. 选择 **Prometheus**,填写 URL(如 `http://prometheus:9090`)。
|
|||
|
|
|||
|
##### **4.2 导入 Spring Boot 监控模板**
|
|||
|
1. 访问 [Grafana 仪表盘市场](https://grafana.com/grafana/dashboards/),搜索 **"Spring Boot"**。
|
|||
|
2. 选择模板(如 ID: **11378**),复制 ID。
|
|||
|
3. 在 Grafana 中进入 **Create > Import**,输入 ID 完成导入。
|
|||
|
|
|||
|
##### **4.3 自定义核心指标面板**
|
|||
|
###### **成功率/失败率(百分比仪表盘)**
|
|||
|
```promql
|
|||
|
# 成功率
|
|||
|
(sum(rate(http_server_requests_seconds_count{status!~"5.."}[5m])) by (application)
|
|||
|
/
|
|||
|
(sum(rate(http_server_requests_seconds_count{}[5m])) by (application))
|
|||
|
* 100
|
|||
|
|
|||
|
# 失败率
|
|||
|
(sum(rate(http_server_requests_seconds_count{status=~"5.."}[5m])) by (application)
|
|||
|
/
|
|||
|
(sum(rate(http_server_requests_seconds_count{}[5m])) by (application))
|
|||
|
* 100
|
|||
|
```
|
|||
|
|
|||
|
###### **接口 RTT(折线图)**
|
|||
|
```promql
|
|||
|
# 平均响应时间(毫秒)
|
|||
|
avg(rate(http_server_requests_seconds_sum{uri="/api/core"}[5m])) by (uri)
|
|||
|
/
|
|||
|
avg(rate(http_server_requests_seconds_count{uri="/api/core"}[5m])) by (uri)
|
|||
|
* 1000
|
|||
|
```
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
#### **5. 告警配置(可选)**
|
|||
|
##### **5.1 在 Prometheus 中设置告警规则**
|
|||
|
```yaml
|
|||
|
# alert.rules.yml
|
|||
|
groups:
|
|||
|
- name: spring-boot-alerts
|
|||
|
rules:
|
|||
|
- alert: HighErrorRate
|
|||
|
expr: |
|
|||
|
sum(rate(http_server_requests_seconds_count{status=~"5.."}[5m])) by (application)
|
|||
|
/
|
|||
|
sum(rate(http_server_requests_seconds_count{}[5m])) by (application)
|
|||
|
> 0.05 # 错误率超过 5%
|
|||
|
for: 5m
|
|||
|
labels:
|
|||
|
severity: critical
|
|||
|
annotations:
|
|||
|
summary: "High error rate detected in {{ $labels.application }}"
|
|||
|
```
|
|||
|
|
|||
|
##### **5.2 集成 Alertmanager**
|
|||
|
配置通知渠道(如邮件、Slack)并启动 Alertmanager:
|
|||
|
```bash
|
|||
|
docker run -d --name alertmanager \
|
|||
|
-p 9093:9093 \
|
|||
|
-v /path/to/alertmanager.yml:/etc/alertmanager/alertmanager.yml \
|
|||
|
prom/alertmanager
|
|||
|
```
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
### **三、最终效果**
|
|||
|
- **监控大盘**:包含全局请求成功率、失败率、核心接口 RTT、JVM 内存、CPU 使用率等。
|
|||
|
- **实时告警**:当错误率或延迟超过阈值时触发通知。
|
|||
|
- **历史趋势**:通过 Grafana 查看指标的历史波动。
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
### **四、常见问题排查**
|
|||
|
| **问题** | **解决方案** |
|
|||
|
|-------------------------|----------------------------------------------|
|
|||
|
| 指标未暴露 | 检查 `management.endpoints.web.exposure.include` 配置 |
|
|||
|
| Prometheus 抓取失败 | 检查 `targets` 地址和网络连通性 |
|
|||
|
| Grafana 无数据 | 确认数据源配置正确,PromQL 无语法错误 |
|
|||
|
| RTT 数值异常 | 检查 URI 标签是否匹配核心接口路径 |
|
|||
|
|
|||
|
通过以上步骤,可快速搭建 Spring Boot 服务的全链路监控体系。
|