夜莺运维指南之自定义告警模板
1 需求背景
夜莺自带的告警模板,比较繁杂且一眼望去很难看出重要信息,这对故障的排查十分不利
现在需要开发出一款适合公司且比较简约的告警模板,更重要的是能够一眼能看出故障详情从而快速的排查故障。如
2 操作步骤
最简单的方法就是使用自定义脚本进行告警通知
完整的Python脚本如下:
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
"""
代码功能: 定制飞书告警通知模板
"""
import datetime
import json
import requests
import sys # 导入 sys 模块,用于配置默认字符编码
reload(sys)
sys.setdefaultencoding('utf8') # 设置默认字符编码为 UTF-8,以确保处理 Unicode 数据时不会出现编码问题
class Sender(object):
def __init__(self, payload):
self.headers = {
"Content-Type": "application/json"
}
self.users = payload.get('event').get('notify_users_obj') # 获取所有接收通知的用户
self.is_recovered = payload.get('event').get('is_recovered') # 获取是否恢复
self.content = payload.get('tpls').get("feishu", "feishu not found")
self.hostname = payload.get('event').get('target_ident')
self.color = "red"
self.alert_headers = "夜莺监控异常告警"
self.status_text = "触发时值: "
if self.hostname.startswith("tps"):
self.monitor_url = "http://domain(脱敏域名)/dashboards/12?datasource=1&ident={0}".format(self.hostname)
def send_ifeishu(self, payload):
tokens = {} # 获取所有飞书token
for u in self.users:
"""将字典的值赋值为1的目的是:收集唯一的键而不在意键对应的value"""
contacts = u.get('contacts') # 获取所有联系人
if contacts == {}:
continue
if contacts.get("ifeishu_rebot_token",""): # 获取所有飞书token
tokens[contacts.get("ifeishu_rebot_token","")] = 1
alert_content = ""
for url in tokens:
if "带宽" in self.content:
bandwidth = payload.get('event').get('trigger_value')
bandwith_MB = "%.1f" %(float(bandwidth)/1024/1024)
bind_content = self.status_text+bandwith_MB+"MB"
alert_content = "告警对象: "+ self.hostname + "\n" + self.content + "\n" + bind_content + "\n" + "主要关注人: <at id=all></at>"
# 其他监控项已做脱敏处理
if self.is_recovered:
self.alert_headers = "夜莺监控恢复正常"
self.color = "green"
alert_content = "告警对象: "+ self.hostname +"\n"+ self.content
message_body={
"msg_type": "interactive",
"card": {
"config": {
"wide_screen_mode": True
},
"elements": [
{
"tag": "div",
"text": {
"content":alert_content,
"tag": "lark_md"
}
},
],
"header": {
"template": self.color, # 消息卡片主题颜色,可选值:red、orange、yellow、green、cyan、blue、purple、pink
"title": {
"content":self.alert_headers,
"tag": "plain_text"
}
}
}}
if not self.is_recovered and "进程" not in self.content:
button_elemnts = {
"tag": "action",
"actions": [
{
"tag": "button",
"text": {
"tag": "plain_text",
"content": "🔎 查看详情"
},
"type": "primary",
"multi_url": {
"url": self.monitor_url,
"pc_url": "", # 电脑端URL
"android_url": "", # 安卓端URL
"ios_url": "" # ios端URL
}
}
]}
message_body["card"]["elements"].append(button_elemnts)
response = requests.post(url, headers=self.headers, data=json.dumps(message_body))
if __name__ == '__main__':
payload = json.load(sys.stdin)
sender = Sender(payload)
sender.send_ifeishu(payload)
with open('payload.json', 'w') as f:
f.write(json.dumps(payload,indent=4)) # ident=4,表示每个嵌套级别的JSON数组都将会用4个空格缩进
这里注意,我定义了一个方法名称叫做send_ifeishu
. 这个时候就需要添加一个通知媒介及联系方式
然后修改告警规则,选择其中一个告警规则,修改其通知媒介为ifeishu
,告警接受组中添加一个飞书机器人,其中飞书机器人的联系方式修改为ifeishu_rebot_token: 飞书机器人通知链接
然后修改通知模板(feishu)/新增通知模板(新增模板需要改上面的Python代码),去掉一些不需要的内容.然后保存退出
# 例如我这里修改飞书通知模板为:
级别状态: {{if .IsRecovered}}恢复正常{{else}}触发告警{{end}}
规则名称: {{.RuleName}}{{if .RuleNote}}{{end}}
{{if .IsRecovered}}恢复时间: {{timeformat .LastEvalTime}}{{else}}触发时间: {{timeformat .TriggerTime}}{{end}}
3 脚本调试
脚本调试步骤为,先让夜莺生成payload。脚本如下
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
import sys
import json
import requests
class Sender(object):
@classmethod
def send_email(cls, payload):
# already done in go code
pass
@classmethod
def send_wecom(cls, payload):
# already done in go code
pass
@classmethod
def send_dingtalk(cls, payload):
# already done in go code
pass
@classmethod
def send_ifeishu(cls, payload):
with open('/tmp/payload.json','w') as f:
f.write(json.dumps(payload,indent=4))
@classmethod
def send_mm(cls, payload):
# already done in go code
pass
@classmethod
def send_sms(cls, payload):
pass
@classmethod
def send_voice(cls, payload):
pass
def main():
payload = json.load(sys.stdin)
with open(".payload", 'w') as f:
f.write(json.dumps(payload, indent=4))
for ch in payload.get('event').get('notify_channels'):
send_func_name = "send_{}".format(ch.strip())
if not hasattr(Sender, send_func_name):
print("function: {} not found", send_func_name)
continue
send_func = getattr(Sender, send_func_name)
send_func(payload)
def hello():
print("hello nightingale")
if __name__ == "__main__":
if len(sys.argv) == 1:
main()
elif sys.argv[1] == "hello":
hello()
else:
print("I am confused")
然后将此脚本放在通知脚本中,然后将告警规则修改为极易触发的内容.如mem_used_percent{ident=~'jde-server.*'} > 1 --->CPU利用率大于1的
然后生成/tmp/payload.json .
获得Payload后 ,使用python notify_feishu.py < /tmp/payload.json
这个命令进行调试.注意notify_feishu.py
是上面自己写的脚本
如果有自定义告警模板问题的,可以及时联系博主私信帮忙处理