Prometheus介绍

请假不用害怕，有东西看着线上。

监控现状

随着服务器以及服务规模的扩大，监控体系也面临同样的问题，如何更好的在监控层面为日常运营而服务，如何预防并快速定位问题从而解决问题就显得尤为重要。

目前普遍采用的监控方式有：主动监控、被动监控、旁路监控。比如Nagios、Cacti、Zabbix等监控工具以及第三方监控平台。

监控要从公司的业务角度考虑，而不是某种手段，某个服务。对于辅助解决问题，又要细化到服务器、服务、容器甚至是进城，站在服务提供的角度发现、定位、解决问题。

使用者的痛点：配置复杂，可定制化低，组件间耦合度高，数据生涩，展示化程度低，通知信息不完善等。

监控体系的建立离不开功能模块的加入，不能增加监控模块、类别、参数，进行过滤，整合达到合理的产出。

简介

Prometheus是开源的系统监控和警报工具包。

已于2016加入云计算本地计算基金会，作为Kubernetes之后的第二个托管项目。

Prometheus是时间序列化数据库，可以简单理解为将数据打上标签，以时间维度存储。

GitHub - 地址

官网

根据操作系统下载对应的版本，直接运行即可：./prometheus，访问localhost:9090可以查看。

特点

多维数据模型（时序列数据由metric名和一组key/value组成）
在多维度上灵活的查询语言(PromQL)
不依赖分布式存储，单主节点工作
通过基于HTTP的Pull方式采集时序数据
可以通过中间网关进行时序列数据推送(pushing)
目标服务器可以通过发现服务或者静态配置实现
多种可视化和仪表盘支持

功能

每经过一个时间间隔，数据都会从运行的服务中流出，存储到一个时间序列数据库中，这个数据库之后可以通过PromQL语言查询。
因为数据是以时间序列存储的，当出现问题时，可以根据这些时间间隔进行诊断，另外还可以预测基础设施的长期监控趋势。

组件

Prometheus主服务
- 用来抓取和存储时序数据
client library
- 用来构造应用或 exporter 代码 (go,java,python,ruby)
pus
- 网关可用来支持短连接任务
可视化的dashboard
- (两种选择,promdash 和 grafana.目前主流选择是 grafana.)
实验性的报警管理端
- (alertmanager,单独进行报警汇总,分发,屏蔽等 )

配置

官方文档

# 常规的全局配置
global:
  scrape_interval:     15s # 每15秒采集一次数据.
  evaluation_interval: 15s # 每15秒做一次告警检测.

# 配置alertmanagers
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093

# 指定加载的告警规则文件，根据evaluation_interval设置的告警频率。
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# 指定Prometheus要监控的目标
# 每个监控目标是一个job，job的类型有很多种。可以是最简单的static_config，即静态地指定每一个目标
scrape_configs:
  # 监控的对象名称，不可重复	
  - job_name: 'prometheus'

    # 监听的服务地址
    static_configs:
    - targets: ['localhost:9090']

告警

在处理时间序列数据库时，希望对数据进行处理，并对结果给出反馈，这部分工作是用告警来实现。

告警在Grafana中非常常见，Prometheus也通过Alertmanager实现完成的告警系统。Alertmanager是一个独立的工具，可以绑定到Prometheus并运行自定义Alertmanager。告警通过配置文件定义，定义有一组指标定义规则组成，如果数据命中这些规则，则会触发告警并将其发送到预定义的目标。与Grafana类似，Prometheus的告警，可以通过email，Slack webhooks，PagerDuty和自定义HTTP目标等。

告警规则是单独文件定义，在prometheus.yml中引用，格式如下

$ cat first_rules.yml
groups:
- name: rule1-http_requst_total
  rules:
  - alert:  HTTP_REQUEST_TOTAL
    expr: http_requests_total > 100
    for: 1m
    labels:
      severity: page
    annotations:
      summary: Http request total reach limit

需要注意的是，还要在prometheus.yml中配置alertmanager的地址

alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093

Exporters

Exportr负责数据指标的采集。对于自定义应用程序，仪表化非常方便，允许自定义公开的指标以及其随时间的变化方式。

类型

node_exporter
- 主要用来采集机器的性能指标数据，包括cpu，内存，磁盘，io等基本信息。
mysqld_exporter
- 主要用于监控采集mysql数据库服务器相关指标。
redis_exporter
- 主要用于监控采集redis数据库服务器相关指标。
black_exporter
- prometheus社区提供的官方黑盒监控解决方案，其允许用户通过：http、https、dns、tcp以及icmp的方式对网络进行探测。
cadvisor
- google开源的用于监控容器运行的工具。

常见的模版：

数据库模版
- 用于MongoDB数据，SQL服务器和MySQL服务器的配置
HTTP模版
- 用于HAProxy，Apache或Nginx等Web服务器和代理的配置
Unix模版
- 用来使用构建的节点导出程序监控，可以实现完整的系统指标的监控

数据模型

Prometheus 从根本上所有的存储都是按时间序列去实现的，相同的 metrics(指标名称) 和 label(一个或多个标签) 组成一条时间序列，不同的label表示不同的时间序列。为了支持一些查询，有时还会临时产生一些时间序列存储。

每条时间序列是由唯一的指标名称和一组标签（key=value）的形式组成。
指标名称一般是给监测对像起一名字，例如 http_requests_total 这样，它有一些命名规则，可以包字母数字_之类的的。

标签就是对一条时间序列不同维度的识别了，例如一个http请求用的是POST还是GET，它的endpoint是什么，这时候就要用标签去标记了，最终形成的标识便是这样了：

1	http_requests_total{method="POST",endpoint="/api/tracks"}

记住，针对http_requests_total这个metrics name 无论是增加标签还是删除标签都会形成一条新的时间序列。
查询语句就可以跟据上面标签的组合来查询聚合结果了。

如果以传统数据库的理解来看这条语句，则可以考虑 http_requests_total是表名，标签是字段，而timestamp是主键，还有一个float64字段是值了。（Prometheus里面所有值都是按float64存储）。

命名最佳实践

类型

计数器Counter

只增不减的计数器。比如记录应用请求的总量(http_requests_total)，cpu使用时间(process_cpu_seconds_total)等。

对于Counter类型的指标，只包含一个inc()方法，用于计数器+1

一般而言，Counter类型的metrics指标在命名中我们使用_total结束

使用Counter.build()创建Counter metrics，name()方法，用于指定该指标的名称 labelNames()方法，用于声明该metrics拥有的维度label。在preHandle方法中，我们获取当前请求的，RequesPath，Method以及状态码。并且调用inc()方法，在每次请求发生时计数+1。

Counter.build()…register(),会像Collector中注册该指标，并且当访问/metrics地址时，返回该指标的状态。

通过指标io_namespace_http_requests_total我们可以：

查询应用的请求总量
- sum(io_namespace_http_requests_total)
查询每秒Http请求量
- sum(rate(io_wise2c_gateway_requests_total[5m]))
查询当前应用请求量Top N的URI
- topk(10, sum(io_namespace_http_requests_total) by (path))

计量表Gauges

可增可减的仪表盘，可以用于反应应用的当前状态,例如在监控主机时，主机当前空闲的内容大小(node_memory_MemFree)，可用内存大小(node_memory_MemAvailable)。或者容器当前的cpu使用率,内存使用率。

对于Gauge指标的对象则包含两个主要的方法inc()以及dec(),用户添加或者减少计数。在这里我们使用Gauge记录当前正在处理的Http请求数量。

查询应用当前正在处理中的Http请求数量:
- io_namespace_http_inprogress_requests{}

1
2
3

static final Gauge inprogressRequests = Gauge.build()
            .name("io_namespace_http_inprogress_requests").labelNames("path", "method", "code")
            .help("Inprogress requests.").register();

直方图Histogram

自带buckets区间用于统计分布统计图

主要用于在指定分布范围内(Buckets)记录大小(如http request bytes)或者事件发生的次数。

以请求响应时间requests_latency_seconds为例，假如我们需要记录http请求响应时间符合在分布范围{.005, .01, .025, .05, .075, .1, .25, .5, .75, 1, 2.5, 5, 7.5, 10}中的次数时。

public class PrometheusMetricsInterceptor extends HandlerInterceptorAdapter {

    static final Histogram requestLatencyHistogram = Histogram.build().labelNames("path", "method", "code")
            .name("io_namespace_http_requests_latency_seconds_histogram").help("Request latency in seconds.")
            .register();

    private Histogram.Timer histogramRequestTimer;

    @Override
    public boolean preHandle(HttpServletRequest request, HttpServletResponse response, Object handler) throws Exception {
        //...省略的代码
        histogramRequestTimer = requestLatencyHistogram.labels(requestURI, method, String.valueOf(status)).startTimer();
        //...省略的代码
    }

    @Override
    public void afterCompletion(HttpServletRequest request, HttpServletResponse response, Object handler, Exception ex) throws Exception {
        //...省略的代码
        histogramRequestTimer.observeDuration();
        //...省略的代码
    }
}

使用Histogram构造器可以创建Histogram监控指标。默认的buckets范围为{.005, .01, .025, .05, .075, .1, .25, .5, .75, 1, 2.5, 5, 7.5, 10}。如何需要覆盖默认的buckets，可以使用.buckets(double… buckets)覆盖。

Histogram会自动创建3个指标，分别为：

事件发生总次数： basename_count

1 2	# 实际含义：当前一共发生了2次http请求 io_namespace_http_requests_latency_seconds_histogram_count{path="/",method="GET",code="200",} 2.0

所有事件产生值的大小的总和: basename_sum

1 2	# 实际含义：发生的2次http请求总的响应时间为13.107670803000001 秒 io_namespace_http_requests_latency_seconds_histogram_sum{path="/",method="GET",code="200",} 13.107670803000001

事件产生的值分布在bucket中的次数： basename_bucket{le=”上包含”}

# 在总共2次请求当中。http请求响应时间 <=0.005 秒 的请求次数为0
io_namespace_http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="0.005",} 0.0
# 在总共2次请求当中。http请求响应时间 <=0.01 秒 的请求次数为0
io_namespace_http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="0.01",} 0.0
# 在总共2次请求当中。http请求响应时间 <=0.025 秒 的请求次数为0
io_namespace_http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="0.025",} 0.0
io_namespace_http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="0.05",} 0.0
io_namespace_http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="0.075",} 0.0
io_namespace_http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="0.1",} 0.0
io_namespace_http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="0.25",} 0.0
io_namespace_http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="0.5",} 0.0
io_namespace_http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="0.75",} 0.0
io_namespace_http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="1.0",} 0.0
io_namespace_http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="2.5",} 0.0
io_namespace_http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="5.0",} 0.0
io_namespace_http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="7.5",} 2.0
# 在总共2次请求当中。http请求响应时间 <=10 秒 的请求次数为0
io_namespace_http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="10.0",} 2.0
# 在总共2次请求当中。http请求响应时间 10 秒 的请求次数为0
io_namespace_http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="+Inf",} 2.0

摘要Summary

Summary和Histogram非常类型相似，都可以统计事件发生的次数或者发小，以及其分布情况。

Summary和Histogram都提供了对于事件的计数_count以及值的汇总_sum。因此使用_count,和_sum时间序列可以计算出相同的内容，例如http每秒的平均响应时间：rate(basename_sum[5m]) / rate(basename_count[5m])。

同时Summary和Histogram都可以计算和统计样本的分布情况，比如中位数，9分位数等等。其中 0.0<= 分位数Quantiles <= 1.0。

不同在于Histogram可以通过histogram_quantile函数在服务器端计算分位数。而Sumamry的分位数则是直接在客户端进行定义。因此对于分位数的计算。 Summary在通过PromQL进行查询时有更好的性能表现，而Histogram则会消耗更多的资源。相对的对于客户端而言Histogram消耗的资源更少。

public class PrometheusMetricsInterceptor extends HandlerInterceptorAdapter {

    static final Summary requestLatency = Summary.build()
            .name("io_namespace_http_requests_latency_seconds_summary")
            .quantile(0.5, 0.05)
            .quantile(0.9, 0.01)
            .labelNames("path", "method", "code")
            .help("Request latency in seconds.").register();


    @Override
    public boolean preHandle(HttpServletRequest request, HttpServletResponse response, Object handler) throws Exception {
        //...省略的代码
        requestTimer = requestLatency.labels(requestURI, method, String.valueOf(status)).startTimer();
        //...省略的代码
    }

    @Override
    public void afterCompletion(HttpServletRequest request, HttpServletResponse response, Object handler, Exception ex) throws Exception {
        //...省略的代码
        requestTimer.observeDuration();
        //...省略的代码
    }
}

使用Summary指标，会自动创建多个时间序列：

事件发生总的次数

1 2	# 含义：当前http请求发生总次数为12次 io_namespace_http_requests_latency_seconds_summary_count{path="/",method="GET",code="200",} 12.0

事件产生的值的总和

1 2	# 含义：这12次http请求的总响应时间为 51.029495508s io_namespace_http_requests_latency_seconds_summary_sum{path="/",method="GET",code="200",} 51.029495508

事件产生的值的分布情况

# 含义：这12次http请求响应时间的中位数是3.052404983s
io_namespace_http_requests_latency_seconds_summary{path="/",method="GET",code="200",quantile="0.5",} 3.052404983
# 含义：这12次http请求响应时间的9分位数是8.003261666s
io_namespace_http_requests_latency_seconds_summary{path="/",method="GET",code="200",quantile="0.9",} 8.003261666

PromQL

内置SQL查询，用于便捷和熟悉的方式从Prometheus查询和检索数据。

除了用来查询数据，告警规则也要用查询语句描述。查询语句直接就是指标的名称：

1	go_memstats_other_sys_bytes

可以通过label筛选

1	go_memstats_other_sys_bytes{instance="192.168.88.10"}

label可以使用4个操作符

=:：精准等于的label
!=：不等于的label
=~：符合正则表达式的label
!~：不符合正则表达式的label

并且可以使用多个标签属性，用,间隔，彼此是与的关系

http_requests_total{environment=~"staging|testing|development",method!="GET"}

或者只有标签

{instance="192.168.88.10"}

对查询出来的结果进行运算也是可以的

# 时间范围截取，Range Vector Selectors
http_requests_total{job="prometheus"}[5m]
 
# 时间偏移
http_requests_total offset 5m
 
# 时间段内数值累加
sum(http_requests_total{method="GET"} offset 5m)

还可以进行多元运算，以及使用函数。

sum(求和)
min(取最小)
max(取最大)
avg(取平均)
count (计数器)
stddev (计算偏差)
stdvar (计算方差)
count_values(每个元素独立值数量)
bottomk (取倒数几个),topk(取前几位)

页面

可以在Status->rule页面看到告警规则，在alert页面看到触发的告警。

alertmanager需要单独部署。用来接受prometheus发出的告警，然后按照配置文件的要求，将告警用用对应的方式发送出去。

实例

创建一个Demo来展示Prometheus的功能。Spring Boot + Pushgateway + Prometheus。

Spring Boot

创建Spring Boot项目

pom.xml

<dependency>
  <groupId>io.prometheus</groupId>
  <artifactId>simpleclient</artifactId>
  <version>0.3.0</version>
</dependency>
<dependency>
  <groupId>io.prometheus</groupId>
  <artifactId>simpleclient_pushgateway</artifactId>
  <version>0.0.10</version>
</dependency>

Controller

@RestController
public class TestPrometheus {
    private static final Counter REQUEST_COUNTER = Counter.build()
            .name("test_counter_name").help("test_counter_help.").register();

    @GetMapping("/push")
    public String push() throws Exception {
        CollectorRegistry registry = new CollectorRegistry();
        REQUEST_COUNTER.inc();
				//pushgateway地址
        PushGateway pg = new PushGateway("127.0.0.1:9091");
        //collector,jobName
        pg.push(REQUEST_COUNTER, "test_push");
        return String.valueOf(REQUEST_COUNTER.get());
    }
}

Pushgateway

下载对应系统版本
解压运行：./pushgateway
- 默认端口：9091
修改prometheus配置文件

###以下内容为SpringBoot应用配置
  - job_name: 'springboot_prometheus'
    metrics_path: '/actuator/prometheus'
    static_configs:
      - targets: ['127.0.0.1:8080']
### 以下内容为配置pushgateway
  - job_name: 'push-metrics'
    static_configs:
      - targets: ['localhost:9091']

启动prometheus

1	./prometheus

访问页面：http://localhost:8080/push
- 因为每访问一次push接口，Counter计数器REQUEST_COUNTER就会自增1，并将结果显示在页面中，所以页面会显示此次应用启动后，此接口的访问次数
访问页面：http://localhost:9090/graph，输入Counter的name值test_counter_name，点击Execute，将会看到下方输出：

Element	Value
test_counter_name{exported_job=”test_push”,instance=”localhost:9091”,job=”push-metrics”}	1

其中：

exported_job：代码中定义的PushGateWay的job名称
instance：实例
job：在prometheus中配置的job名称
Value：Counter计数器的数值

对比

influxdb、openTSDB等，是专门时间序列数据库，不是一套完整的监控告警系统，缺少告警功能。

应用场景

如果应用程序运行在容器上并由Kubernetes负责调度，在此环境中它们是高度自动化并且动态的。传统的监控工具一般是基于服务器，只监控静态的服务，所以当要在这种动态环境监控应用程序时，传统的监控工具往往很难满足这一需求，这时就需要Prometheus出马了。

参考

李佶澳博客

Prometheus 使用总结：我踩过得那些坑

Prometheus时间序列监控方案

贰白