当前位置: 首页 > article >正文

spring boot 项目 prometheus 自定义指标收集和 grafana 查询--方法耗时分位数指标

auth

  1. @author JellyfishMIX - github / blog.jellyfishmix.com
  2. LICENSE LICENSE-2.0

说明

  1. 网上有很多 promehteus 和 grafana 配置,本文不再重复,只介绍自定义部分。
  2. 目前只介绍了分位数指标的收集和查询,常用于方法耗时的指标监控。

自定义指标收集

仅引入以下依赖,只能看到 spring actuator 相关指标,看不到自定义指标。

            <!-- spring-boot-actuator 依赖 -->
            <dependency>
                <groupId>org.springframework.boot</groupId>
                <artifactId>spring-boot-starter-actuator</artifactId>
                <version>2.7.18</version>
            </dependency>
            <!-- prometheus 依赖,和 spring boot 版本需要搭配。spring boot 2.7 搭配 1.10.x 如需升级或降级 spring boot,此依赖 version 可以对应加减 0.1.0-->
            <dependency>
                <groupId>io.micrometer</groupId>
                <artifactId>micrometer-registry-prometheus</artifactId>
                <version>1.10.6</version>
            </dependency>

application.properties 配置

根据需要自定义调整

spring.application.name=spring-boot-explore
server.port=8083
server.servlet.context-path=/explore
# ip:port/actuator/prometheus
management.server.port=9051
management.endpoints.web.exposure.include=*
management.metrics.tags.application=${spring.application.name}

自定义指标的收集需要引入额外依赖

            <!--自定义 prometheus 指标依赖-->
            <dependency>
                <groupId>io.prometheus</groupId>
                <artifactId>simpleclient</artifactId>
                <version>0.16.0</version>
            </dependency>
            <dependency>
                <groupId>io.prometheus</groupId>
                <artifactId>simpleclient_hotspot</artifactId>
                <version>0.16.0</version>
            </dependency>
            <dependency>
                <groupId>io.prometheus</groupId>
                <artifactId>simpleclient_servlet</artifactId>
                <version>0.16.0</version>
            </dependency>

指标收集接口

按照 prometheus 的约定,客户端需要暴露一个接口供收集自定义指标。

import io.prometheus.client.exporter.MetricsServlet;
import org.springframework.stereotype.Controller;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.ResponseBody;

import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
import java.io.IOException;

/**
 * @author jellyfishmix
 * @date 2024/9/1 08:03
 */
@Controller
@RequestMapping("/prometheus")
public class PrometheusExportController extends MetricsServlet {

    @RequestMapping("/exportMetric")
    @ResponseBody
    public void exportMetric(HttpServletRequest request, HttpServletResponse response) throws IOException {
        this.doGet(request, response);
    }
}

暴露后的自定义指标收集端口,路径是自己配置的:

image-20240901103532161

分位数指标

  1. prometheus 四种 metrics 类型中,如果不是对性能特别敏感的场景,推荐使用 summary。详情阅读:
    1. summary 和 histogram 指标的简单理解 https://blog.csdn.net/wtan825/article/details/94616813
    2. prometheus 四种 metric 类型介绍 https://prometheus.wang/promql/prometheus-metrics-types.html

使用 summary 监控方法耗时

import com.google.common.base.Stopwatch;
import io.prometheus.client.CollectorRegistry;
import io.prometheus.client.Summary;
import org.springframework.stereotype.Controller;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.ResponseBody;

import java.util.concurrent.ThreadLocalRandom;
import java.util.concurrent.TimeUnit;

/**
 * @author jellyfishmix
 * @date 2024/1/3 23:18
 */
@RequestMapping("/test")
@Controller
public class TestController {
    private static final CollectorRegistry DEFAULT_PROMETHEUS_REGISTRY = CollectorRegistry.defaultRegistry;
    private static final Summary DEMO_SUMMARY = Summary.build()
            .name("TestController_compute_summary_demo")
            .help("demo of summary")
            .labelNames("labelName1", "labelNameB")
            .quantile(0.5, 0.01)
            .quantile(0.90, 0.01)
            .quantile(0.99, 0.01)
            .register(DEFAULT_PROMETHEUS_REGISTRY);

    @RequestMapping("/saySummary")
    @ResponseBody
    public String saySummary() {
        Stopwatch stopwatch = Stopwatch.createStarted();
        simulateInterfaceCall();
        var costMillis = stopwatch.elapsed().toMillis();
        DEMO_SUMMARY.labels("abc", "123").observe(costMillis);
        return "hello summary";
    }

    private static void simulateInterfaceCall() {
        // 模拟接口调用的随机耗时
        int randomDelay = ThreadLocalRandom.current().nextInt(100, 1000);
        try {
            TimeUnit.MILLISECONDS.sleep(randomDelay);
        } catch (InterruptedException e) {
            Thread.currentThread().interrupt();
        }
    }
}
quantile 方法
  1. 说明一下 Summary.build().quantile() 方法。
  2. .50 分位,误差 0.01,会把 [.49, .51] 范围内的指标计入 .50 分位,由于 summary 会在客户端把指标数记录下来,因此允许的误差越多,可以节约的内存占用越多。
  3. 其他分位以此类推。
# .50 分位,误差 0.01
.quantile(0.5, 0.01)
# .90 分位,误差 0.01
.quantile(0.90, 0.01)
# .99 分位,误差 0.01
.quantile(0.99, 0.01)

quantile 方法的详细说明可见 io.prometheus.client.Summary 的类注释,这里摘抄一段:

The Summary class provides different utility methods for observing values, like observe(double), startTimer() and Summary. Timer. observeDuration(), time(Callable), etc.
By default, Summary metrics provide the count and the sum. For example, if you measure latencies of a REST service, the count will tell you how often the REST service was called, and the sum will tell you the total aggregated response time. You can calculate the average response time using a Prometheus query dividing sum / count.
In addition to count and sum, you can configure a Summary to provide quantiles:
  Summary requestLatency = Summary. build()
      .name("requests_latency_seconds")
      .help("Request latency in seconds.")
      .quantile(0.5, 0.01)    // 0.5 quantile (median) with 0.01 allowed error
      .quantile(0.95, 0.005)  // 0.95 quantile with 0.005 allowed error
      // ...
      .register();
  
As an example, a 0.95 quantile of 120ms tells you that 95% of the calls were faster than 120ms, and 5% of the calls were slower than 120ms.
Tracking exact quantiles require a large amount of memory, because all observations need to be stored in a sorted list. Therefore, we allow an error to significantly reduce memory usage.
In the example, the allowed error of 0.005 means that you will not get the exact 0.95 quantile, but anything between the 0.945 quantile and the 0.955 quantile.
Experiments show that the Summary typically needs to keep less than 100 samples to provide that precision, even if you have hundreds of millions of observations.
labelNames 方法

说明一下 Summary.build().labelNames() 方法,表示为此指标设置两个 label,分别命名为 labelName1 和 labelNameB,

.labelNames("labelName1", "labelNameB")

如果设置了 Summary.build().labelNames(),不能直接调用 summary.observe(),会抛 NullPointerException

  // Convenience methods.
  /**
   * Observe the given amount on the summary with no labels.
   * @param amt in most cases amt should be &gt;= 0. Negative values are supported, but you should read
   *            <a href="https://prometheus.io/docs/practices/histograms/#count-and-sum-of-observations">
   *            https://prometheus.io/docs/practices/histograms/#count-and-sum-of-observations</a> for
   *            implications and alternatives.
   */
  public void observe(double amt) {
    noLabelsChild.observe(amt);
  }

需要调用 summary.labels(“abc”, “123”).observe(),labels 方法中的值表示构造 summary 指标时对应的 labelName 的值。

    @RequestMapping("/saySummary")
    @ResponseBody
    public String saySummary() {
        Stopwatch stopwatch = Stopwatch.createStarted();
        simulateInterfaceCall();
        var costMillis = stopwatch.elapsed().toMillis();
        DEMO_SUMMARY.labels("abc", "123").observe(costMillis);
        return "hello summary";
    }

summary 分位数指标效果示例

image-20240901103720431

grafana 视图

grafana query 填写示例如下,注意正确的分位数查询写法是如下图红圈所示,在 metric 位置填写 quantile = 0.5(客户端收集时填写的具体分位数)。

Screenshot 2024-09-01 at 11.41.23

分位数查询错误示例: operations 中填写 quantile 是错误的写法,可以看到图中,通过 operations 计算出的和真实值差距很大。

Screenshot 2024-09-01 at 11.48.24


http://www.kler.cn/a/285786.html

相关文章:

  • 江苏BGP大带宽服务器所适用的业务有哪些?
  • Spring MVC中的异常处理
  • 刷题DAY24
  • linux tail
  • 使用kafka完成数据的实时同步,同步到es中。(使用kafka实现自动上下架 upper、lower)
  • Spring Cloud Consul入门:服务发现与配置管理的最佳实践
  • 华为OD机试真题 - 荒岛求生 - 栈Stack(Python/JS/C/C++ 2024 E卷 100分)
  • 如何在D盘创建虚拟环境?包括安装PyTorch和配置PyCharm
  • 绝区零苹果电脑能玩吗,如何在Mac上玩绝区零?绝区零MacBook 下载安装保姆级教程
  • 使用 Python 实现自动化办公
  • Spring入门之DI(包含实例代码)
  • IEEE P3233 标准启动会回顾:迈向去中心化存储标准化的第一步
  • C++ 图形框架 Duilib
  • 滚动视图ScrollView
  • 主机安全-网络攻击监测
  • Quartz定时任务
  • 速盾:高防CDN在防御各类攻击方面的重要性和作用
  • 【再回顾面向对象】,关键字Satic、final
  • 【论文阅读】为大规模航空图像应用神经辐射场
  • 【论文阅读】LJP法律判决预测论文笔记