An incident management tool that supports alerting across multiple channels with easy custom messaging and on-call integrations. Compatible with any tool supporting webhook alerts, itβs designed for modern DevOps teams to quickly respond to production incidents.
π Mastering your Site Reliability Engineering Skills: On-Call in Action.
Features
- π¨ Multi-channel Alerts: Send incident notifications to Slack, Microsoft Teams, Telegram, and Email (more channels coming!)
- π Custom Templates: Define your own alert messages using Go templates
- π§ Easy Configuration: YAML-based configuration with environment variables support
- π‘ REST API: Simple HTTP interface to receive alerts
- π‘ On-call: On-call integrations with AWS Incident Manager
Contributing
We welcome contributions! Please follow these steps:
- Fork the repository Versus Incident
- Create your feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add some amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
License
Distributed under the MIT License. See LICENSE
for more information.
Support This Project
Help us maintain Versus Incident! Your sponsorship:
π§ Funds critical infrastructure
π Accelerates new features like Viber/Lark integration, Web UI and On-call integrations
Getting Started
Table of Contents
- Prerequisites
- Easy Installation with Docker
- Universal Alert Template Support
- Development Custom Templates
- SNS Usage
- On-Call
Prerequisites
- Docker 20.10+ (optional)
- Slack workspace (for Slack notifications)
Easy Installation with Docker
docker run -p 3000:3000 \
-e SLACK_ENABLE=true \
-e SLACK_TOKEN=your_token \
-e SLACK_CHANNEL_ID=your_channel \
ghcr.io/versuscontrol/versus-incident
Versus listens on port 3000 by default and exposes the /api/incidents
endpoint, which you can configure as a webhook URL in your monitoring tools. This endpoint accepts JSON payloads from various monitoring systems and forwards the alerts to your configured notification channels.
Universal Alert Template Support
Our default template automatically handles alerts from multiple sources, including:
- Alertmanager (Prometheus)
- Grafana Alerts
- Sentry
- CloudWatch SNS
- FluentBit
Example: Send an Alertmanager alert
curl -X POST "http://localhost:3000/api/incidents" \
-H "Content-Type: application/json" \
-d '{
"receiver": "webhook-incident",
"status": "firing",
"alerts": [
{
"status": "firing",
"labels": {
"alertname": "PostgresqlDown",
"instance": "postgresql-prod-01",
"severity": "critical"
},
"annotations": {
"summary": "Postgresql down (instance postgresql-prod-01)",
"description": "Postgresql instance is down."
},
"startsAt": "2023-10-01T12:34:56.789Z",
"endsAt": "2023-10-01T12:44:56.789Z",
"generatorURL": ""
}
],
"groupLabels": {
"alertname": "PostgresqlDown"
},
"commonLabels": {
"alertname": "PostgresqlDown",
"severity": "critical",
"instance": "postgresql-prod-01"
},
"commonAnnotations": {
"summary": "Postgresql down (instance postgresql-prod-01)",
"description": "Postgresql instance is down."
},
"externalURL": ""
}'
Example: Send a Sentry alert
curl -X POST "http://localhost:3000/api/incidents" \
-H "Content-Type: application/json" \
-d '{
"action": "created",
"data": {
"issue": {
"id": "123456",
"title": "Example Issue",
"culprit": "example_function in example_module",
"shortId": "PROJECT-1",
"project": {
"id": "1",
"name": "Example Project",
"slug": "example-project"
},
"metadata": {
"type": "ExampleError",
"value": "This is an example error"
},
"status": "unresolved",
"level": "error",
"firstSeen": "2023-10-01T12:00:00Z",
"lastSeen": "2023-10-01T12:05:00Z",
"count": 5,
"userCount": 3
}
},
"installation": {
"uuid": "installation-uuid"
},
"actor": {
"type": "user",
"id": "789",
"name": "John Doe"
}
}'
Result:
Development Custom Templates
Docker
Create a configuration file:
mkdir -p ./config && touch config.yaml
config.yaml
:
name: versus
host: 0.0.0.0
port: 3000
alert:
slack:
enable: true
token: ${SLACK_TOKEN}
channel_id: ${SLACK_CHANNEL_ID}
template_path: "/app/config/slack_message.tmpl"
telegram:
enable: false
viber:
enable: false
msteams:
enable: false
Configuration Notes
Ensure template_path
in config.yaml
matches container path:
alert:
slack:
template_path: "/app/config/slack_message.tmpl" # For containerized env
Slack Template
Create your Slack message template, for example config/slack_message.tmpl
:
π₯ *Critical Error in {{.ServiceName}}*
β Error Details:
```{{.Logs}}```
Owner <@{{.UserID}}> please investigate
Run with volume mount:
docker run -d \
-p 3000:3000 \
-v $(pwd)/config:/app/config \
-e SLACK_ENABLE=true \
-e SLACK_TOKEN=your_slack_token \
-e SLACK_CHANNEL_ID=your_channel_id \
--name versus \
ghcr.io/versuscontrol/versus-incident
To test, simply send an incident to Versus:
curl -X POST http://localhost:3000/api/incidents \
-H "Content-Type: application/json" \
-d '{
"Logs": "[ERROR] This is an error log from User Service that we can obtain using Fluent Bit.",
"ServiceName": "order-service",
"UserID": "SLACK_USER_ID"
}'
Response:
{
"status":"Incident created"
}
Result:
Understanding Custom Templates with Monitoring Webhooks
When integrating Versus with any monitoring tool that supports webhooks, you need to understand the JSON payload structure that the tool sends to create an effective template. Here's a step-by-step guide:
- Enable Debug Mode: First, enable debug_body in your config to see the exact payload structure:
alert:
debug_body: true # This will print the incoming payload to the console
-
Capture Sample Payload: Send a test alert to Versus, then review the JSON structure within the logs of your Versus instance.
-
Create Custom Template: Use the JSON structure to build a template that extracts the relevant information.
FluentBit Integration Example
Here's a sample FluentBit configuration to send logs to Versus:
[OUTPUT]
Name http
Match kube.production.user-service.*
Host versus-host
Port 3000
URI /api/incidents
Format json
Header Content-Type application/json
Retry_Limit 3
Sample FluentBit JSON Payload:
{
"date": 1746354647.987654321,
"log": "ERROR: Exception occurred while handling request ID: req-55ef8801\nTraceback (most recent call last):\n File \"/app/server.py\", line 215, in handle_request\n user_id = session['user_id']\nKeyError: 'user_id'\n",
"stream": "stderr",
"time": "2025-05-04T17:30:47.987654321Z",
"kubernetes": {
"pod_name": "user-service-6cc8d5f7b5-wxyz9",
"namespace_name": "production",
"pod_id": "f0e9d8c7-b6a5-f4e3-d2c1-b0a9f8e7d6c5",
"labels": {
"app": "user-service",
"tier": "backend",
"environment": "production"
},
"annotations": {
"kubernetes.io/psp": "eks.restricted",
"monitoring.alpha.example.com/scrape": "true"
},
"host": "ip-10-1-2-4.ap-southeast-1.compute.internal",
"container_name": "auth-logic-container",
"docker_id": "f5e4d3c2b1a0f5e4d3c2b1a0f5e4d3c2b1a0f5e4d3c2b1a0f5e4d3c2b1a0f5e4",
"container_hash": "my-docker-hub/user-service@sha256:abcdef1234567890abcdef1234567890abcdef1234567890abcdef1234567890",
"container_image": "my-docker-hub/user-service:v2.1.0"
}
}
FluentBit Slack Template (config/slack_message.tmpl
):
π¨ *Error in {{.kubernetes.labels.app}}* π¨
*Environment:* {{.kubernetes.labels.environment}}
*Pod:* {{.kubernetes.pod_name}}
*Container:* {{.kubernetes.container_name}}
*Error Details:*
```{{.log}}```
*Time:* {{.time}}
*Host:* {{.kubernetes.host}}
<@SLACK_ONCALL_USER_ID> Please investigate!
Other Templates
Telegram Template
For Telegram, you can use HTML formatting. Create your Telegram message template, for example config/telegram_message.tmpl
:
π¨ <b>Critical Error Detected!</b> π¨
π <b>Service:</b> {{.ServiceName}}
β οΈ <b>Error Details:</b>
{{.Logs}}
This template will be parsed with HTML tags when sending the alert to Telegram.
Email Template
Create your email message template, for example config/email_message.tmpl
:
Subject: Critical Error Alert - {{.ServiceName}}
Critical Error Detected in {{.ServiceName}}
----------------------------------------
Error Details:
{{.Logs}}
Please investigate this issue immediately.
Best regards,
Versus Incident Management System
This template supports both plain text and HTML formatting for email notifications.
Microsoft Teams Template
Create your Teams message template, for example config/msteams_message.tmpl
:
**Critical Error in {{.ServiceName}}**
**Error Details:**
```{{.Logs}}```
Please investigate immediately
Kubernetes
- Create a secret for Slack:
# Create secret
kubectl create secret generic versus-secrets \
--from-literal=slack_token=$SLACK_TOKEN \
--from-literal=slack_channel_id=$SLACK_CHANNEL_ID
- Create ConfigMap for config and template file, for example
versus-config.yaml
:
apiVersion: v1
kind: ConfigMap
metadata:
name: versus-config
data:
config.yaml: |
name: versus
host: 0.0.0.0
port: 3000
alert:
slack:
enable: true
token: ${SLACK_TOKEN}
channel_id: ${SLACK_CHANNEL_ID}
template_path: "/app/config/slack_message.tmpl"
telegram:
enable: false
slack_message.tmpl: |
*Critical Error in {{.ServiceName}}*
----------
Error Details:
```
{{.Logs}}
```
----------
Owner <@{{.UserID}}> please investigate
kubectl apply -f versus-config.yaml
- Create
versus-deployment.yaml
:
apiVersion: apps/v1
kind: Deployment
metadata:
name: versus-incident
spec:
replicas: 2
selector:
matchLabels:
app: versus-incident
template:
metadata:
labels:
app: versus-incident
spec:
containers:
- name: versus-incident
image: ghcr.io/versuscontrol/versus-incident
ports:
- containerPort: 3000
livenessProbe:
httpGet:
path: /healthz
port: 3000
env:
- name: SLACK_CHANNEL_ID
valueFrom:
secretKeyRef:
name: versus-secrets
key: slack_channel_id
- name: SLACK_TOKEN
valueFrom:
secretKeyRef:
name: versus-secrets
key: slack_token
volumeMounts:
- name: versus-config
mountPath: /app/config/config.yaml
subPath: config.yaml
- name: versus-config
mountPath: /app/config/slack_message.tmpl
subPath: slack_message.tmpl
volumes:
- name: versus-config
configMap:
name: versus-config
---
apiVersion: v1
kind: Service
metadata:
name: versus-service
spec:
selector:
app: versus-incident
ports:
- protocol: TCP
port: 3000
targetPort: 3000
- Apply:
kubectl apply -f versus-deployment.yaml
Helm Chart
You can install the Versus Incident Helm chart using OCI registry:
helm install versus-incident oci://ghcr.io/versuscontrol/charts/versus-incident
Install with Custom Values
# Install with custom configuration from a values file
helm install \
versus-incident \
oci://ghcr.io/versuscontrol/charts/versus-incident \
-f values.yaml
Upgrading an Existing Installation
# Upgrade an existing installation with the latest version
helm upgrade \
versus-incident \
oci://ghcr.io/versuscontrol/charts/versus-incident
# Upgrade with custom values
helm upgrade \
versus-incident \
oci://ghcr.io/versuscontrol/charts/versus-incident \
-f values.yaml
SNS Usage
docker run -d \
-p 3000:3000 \
-e SLACK_ENABLE=true \
-e SLACK_TOKEN=your_slack_token \
-e SLACK_CHANNEL_ID=your_channel_id \
-e SNS_ENABLE=true \
-e SNS_TOPIC_ARN=$SNS_TOPIC_ARN \
-e SNS_HTTPS_ENDPOINT_SUBSCRIPTION=https://your-domain.com \
-e AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY \
-e AWS_SECRET_ACCESS_KEY=$AWS_SECRET_KEY \
--name versus \
ghcr.io/versuscontrol/versus-incident
Send test message using AWS CLI:
aws sns publish \
--topic-arn $SNS_TOPIC_ARN \
--message '{"ServiceName":"test-service","Logs":"[ERROR] Test error","UserID":"U12345"}' \
--region $AWS_REGION
A key real-world application of Amazon SNS involves integrating it with CloudWatch Alarms. This allows CloudWatch to publish messages to an SNS topic when an alarm state changes (e.g., from OK to ALARM), which can then trigger notifications to Slack, Telegram, or Email via Versus Incident with a custom template.
On-Call
Currently, Versus support On-Call integrations with AWS Incident Manager. Updated configuration example with on-call features:
name: versus
host: 0.0.0.0
port: 3000
public_host: https://your-ack-host.example # Required for on-call ack
# ... existing alert configurations ...
oncall:
### Enable overriding using query parameters
# /api/incidents?oncall_enable=false => Set to `true` or `false` to enable or disable on-call for a specific alert
# /api/incidents?oncall_wait_minutes=0 => Set the number of minutes to wait for acknowledgment before triggering on-call. Set to `0` to trigger immediately
enable: false
wait_minutes: 3 # If you set it to 0, it means thereβs no need to check for an acknowledgment, and the on-call will trigger immediately
aws_incident_manager:
response_plan_arn: ${AWS_INCIDENT_MANAGER_RESPONSE_PLAN_ARN}
redis: # Required for on-call functionality
insecure_skip_verify: true # dev only
host: ${REDIS_HOST}
port: ${REDIS_PORT}
password: ${REDIS_PASSWORD}
db: 0
Explanation:
The oncall
section includes:
enable
: A boolean to toggle on-call functionality for all incidents (default:false
).initialized_only
: Initialize on-call feature but keep it disabled by default. When set totrue
, on-call is triggered only for requests that explicitly include?oncall_enable=true
in the URL. This is useful for having on-call ready but not enabled for all alerts.wait_minutes
: Time in minutes to wait for an acknowledgment before escalating (default:3
). Setting it to0
triggers the on-call immediately.provider
: Specifies which on-call provider to use ("aws_incident_manager" or "pagerduty").aws_incident_manager
: Configuration for AWS Incident Manager when it's the selected provider, includingresponse_plan_arn
andother_response_plan_arns
.pagerduty
: Configuration for PagerDuty when it's the selected provider, including routing keys.
The redis section is required when oncall.enable
or oncall.initialized_only
is true. It configures the Redis instance used for state management or queuing, with settings like host, port, password, and db.
For detailed information on integration, please refer to the document here: On-Call setup with Versus.
Template Syntax Guide
This document explains the template syntax (Go template syntax) used for create a custom alert template in Versus Incident.
Table of Contents
Basic Syntax
Access Data
Access data fields using double curly braces and dot notation, for example, with the data:
{
"Logs": "[ERROR] This is an error log from User Service that we can obtain using Fluent Bit.",
"ServiceName": "order-service",
}
Example template:
*Error in {{ .ServiceName }}*
{{ .Logs }}
Variables
You can declare variables within a template using the {{ $variable := value }} syntax. Once declared, variables can be used throughout the template, for example:
{{ $owner := "Team Alpha" }}
Owner: {{ $owner }}
Output:
Owner: Team Alpha
Pipelines
Pipelines allow you to chain together multiple actions or functions. The result of one action can be passed as input to another, for example:
upper: Converts a string to uppercase.
*{{ .ServiceName | upper }} Failure*
lower: Converts a string to lowercase.
*{{ .ServiceName | lower }} Failure*
title: Converts a string to title case (first letter of each word capitalized).
*{{ .ServiceName | title }} Failure*
default: Provides a default value if the input is empty.
*{{ .ServiceName | default "unknown-service" }} Failure*
slice: Extracts a sub-slice from a slice or string.
{{ .Logs | slice 0 50 }} // First 50 characters
replace: Replaces occurrences of a substring.
{{ .Logs | replace "error" "issue" }}
trimPrefix: Trims a prefix from a string.
{{ .Logs | trimPrefix "prod-" }}
trimSuffix: Trims a suffix from a string.
{{ .Logs | trimSuffix "-service" }}
len: Returns the length
{{ .Logs | len }} // Length of the message
urlquery: Escapes a string for use in a URL query.
uri /search?q={{ .Query | urlquery }}
split: splits a string into array using a separator.
{{ $parts := split "apple,banana,cherry" "," }}
{{/* Iterate over split results */}}
{{ range $parts }}
{{ . }}
{{ end }}
You can chain multiple pipes together:
{{ .Logs | trim | lower | truncate 50 }}
Control Structures
Conditionals
The templates support conditional logic using if, else, and end keywords.
{{ if .IsCritical }}
π¨ CRITICAL ALERT π¨
{{ else }}
β οΈ Warning Alert β οΈ
{{ end }}
and:
{{ and .Value1 .Value2 .Value3 }}
or:
{{ or .Value1 .Value2 "default" }}
Best Practices
Error Handling:
{{ If .Error }}
{{ .Details }}
{{ else }}
No error details
{{ end }}
Whitespace Control:
{{- if .Production }} // Remove preceding whitespace
PROD ALERT{{ end -}} // Remove trailing whitespace
Template Comments:
{{/* This is a hidden comment */}}
Negates a boolean value:
{{ if not .IsCritical }}
This is not a critical issue.
{{ end }}
Checks if two values are equal:
{{ if eq .Status "critical" }}
π¨ Critical Alert π¨
{{ end }}
Checks if two values are not equal:
{{ if ne .Env "production" }}
This is not a production environment.
{{ end }}
Returns the length of a string, slice, array, or map:
{{ if gt (len .Errors) 0 }}
There are {{ len .Errors }} errors.
{{ end }}
Checks if a string has a specific prefix:
{{ if .ServiceName | hasPrefix "prod-" }}
Production service!
{{ end }}
Checks if a string has a specific suffix:
{{ if .ServiceName | hasSuffix "-service" }}
This is a service.
{{ end }}
Checks if a message contains a specific strings:
{{ if contains .Logs "error" }}
The message contains error logs.
{{ else }}
The message does NOT contain error.
{{ end }}
Loops
Iterate over slices/arrays with range:
{{ range .ErrorStack }}
- {{ . }}
{{ end }}
Microsoft Teams Templates
Microsoft Teams templates support Markdown syntax, which is automatically converted to Adaptive Cards when sent to Teams. As of April 2025 (with the retirement of Office 365 Connectors), all Microsoft Teams integrations use Power Automate Workflows.
Supported Markdown Features
Your template can include:
- Headings: Use
#
,##
, or###
for different heading levels - Bold Text: Wrap text with double asterisks (
**bold**
) - Code Blocks: Use triple backticks to create code blocks
- Lists: Create unordered lists with
-
or*
, and ordered lists with numbers - Links: Use
[text](url)
to create clickable links
Automatic Summary and Text Fields
Versus Incident now automatically handles two important fields for Microsoft Teams notifications:
- Summary: The system extracts a summary from your template's first heading (or first line if no heading exists) which appears in Teams notifications.
- Text: A plain text version of your message is automatically generated as a fallback for clients that don't support Adaptive Cards.
You don't need to add these fields manually - the system handles this for you to ensure proper display in Microsoft Teams.
Example Template
Here's a complete example for Microsoft Teams:
# Incident Alert: {{.ServiceName}}
### Error Information
**Time**: {{.Timestamp}}
**Severity**: {{.Severity}}
## Error Details
```{{.Logs}}```
## Action Required
1. Check system status
2. Review logs in monitoring dashboard
3. Escalate to on-call if needed
[View Details](https://your-dashboard/incidents/{{.IncidentID}})
This will be converted to an Adaptive Card with proper formatting in Microsoft Teams, with headings, code blocks, formatted lists, and clickable links.
Configuration
Table of Contents
A sample configuration file is located at config/config.yaml
:
name: versus
host: 0.0.0.0
port: 3000
public_host: https://your-ack-host.example # Required for on-call ack
alert:
debug_body: true # Default value, will be overridden by DEBUG_BODY env var
slack:
enable: false # Default value, will be overridden by SLACK_ENABLE env var
token: ${SLACK_TOKEN} # From environment
channel_id: ${SLACK_CHANNEL_ID} # From environment
template_path: "config/slack_message.tmpl"
message_properties:
button_text: "Acknowledge Alert" # Custom text for the acknowledgment button
button_style: "primary" # Button style: "primary" (default blue), "danger" (red), or empty for default gray
disable_button: false # Set to true to disable the button, if you want to handle acknowledgment differently
telegram:
enable: false # Default value, will be overridden by TELEGRAM_ENABLE env var
bot_token: ${TELEGRAM_BOT_TOKEN} # From environment
chat_id: ${TELEGRAM_CHAT_ID} # From environment
template_path: "config/telegram_message.tmpl"
viber:
enable: false # Default value, will be overridden by VIBER_ENABLE env var
api_type: ${VIBER_API_TYPE} # From environment - "channel" (default) or "bot"
bot_token: ${VIBER_BOT_TOKEN} # From environment (token for bot or channel)
# Channel API (recommended for incident management)
channel_id: ${VIBER_CHANNEL_ID} # From environment (required for channel API)
# Bot API (for individual user notifications)
user_id: ${VIBER_USER_ID} # From environment (required for bot API)
template_path: "config/viber_message.tmpl"
email:
enable: false # Default value, will be overridden by EMAIL_ENABLE env var
smtp_host: ${SMTP_HOST} # From environment
smtp_port: ${SMTP_PORT} # From environment
username: ${SMTP_USERNAME} # From environment
password: ${SMTP_PASSWORD} # From environment
to: ${EMAIL_TO} # From environment
subject: ${EMAIL_SUBJECT} # From environment
template_path: "config/email_message.tmpl"
msteams:
enable: false # Default value, will be overridden by MSTEAMS_ENABLE env var
power_automate_url: ${MSTEAMS_POWER_AUTOMATE_URL} # Power Automate HTTP trigger URL (required)
template_path: "config/msteams_message.tmpl"
other_power_urls: # Optional: Define additional Power Automate URLs for multiple MS Teams channels
qc: ${MSTEAMS_OTHER_POWER_URL_QC} # Power Automate URL for QC team
ops: ${MSTEAMS_OTHER_POWER_URL_OPS} # Power Automate URL for Ops team
dev: ${MSTEAMS_OTHER_POWER_URL_DEV} # Power Automate URL for Dev team
lark:
enable: false # Default value, will be overridden by LARK_ENABLE env var
webhook_url: ${LARK_WEBHOOK_URL} # Lark webhook URL (required)
template_path: "config/lark_message.tmpl"
other_webhook_urls: # Optional: Enable overriding the default webhook URL using query parameters, eg /api/incidents?lark_other_webhook_url=dev
dev: ${LARK_OTHER_WEBHOOK_URL_DEV}
prod: ${LARK_OTHER_WEBHOOK_URL_PROD}
queue:
enable: true
debug_body: true
# AWS SNS
sns:
enable: false
https_endpoint_subscription_path: /sns # URI to receive SNS messages, e.g. ${host}:${port}/sns or ${https_endpoint_subscription}/sns
# Options If you want to automatically create an sns subscription
https_endpoint_subscription: ${SNS_HTTPS_ENDPOINT_SUBSCRIPTION} # If the user configures an HTTPS endpoint, then an SNS subscription will be automatically created, e.g. https://your-domain.com
topic_arn: ${SNS_TOPIC_ARN}
# AWS SQS
sqs:
enable: false
queue_url: ${SQS_QUEUE_URL}
# GCP Pub Sub
pubsub:
enable: false
# Azure Event Bus
azbus:
enable: false
oncall:
### Enable overriding using query parameters
# /api/incidents?oncall_enable=false => Set to `true` or `false` to enable or disable on-call for a specific alert
# /api/incidents?oncall_wait_minutes=0 => Set the number of minutes to wait for acknowledgment before triggering on-call. Set to `0` to trigger immediately
initialized_only: true # Initialize on-call feature but don't enable by default; use query param oncall_enable=true to enable for specific requests
enable: false # Use this to enable or disable on-call for all alerts
wait_minutes: 3 # If you set it to 0, it means there's no need to check for an acknowledgment, and the on-call will trigger immediately
provider: aws_incident_manager # Valid values: "aws_incident_manager" or "pagerduty"
aws_incident_manager: # Used when provider is "aws_incident_manager"
response_plan_arn: ${AWS_INCIDENT_MANAGER_RESPONSE_PLAN_ARN}
other_response_plan_arns: # Optional: Enable overriding the default response plan ARN using query parameters, eg /api/incidents?awsim_other_response_plan=prod
prod: ${AWS_INCIDENT_MANAGER_OTHER_RESPONSE_PLAN_ARN_PROD}
dev: ${AWS_INCIDENT_MANAGER_OTHER_RESPONSE_PLAN_ARN_DEV}
staging: ${AWS_INCIDENT_MANAGER_OTHER_RESPONSE_PLAN_ARN_STAGING}
pagerduty: # Used when provider is "pagerduty"
routing_key: ${PAGERDUTY_ROUTING_KEY} # Integration/Routing key for Events API v2 (REQUIRED)
other_routing_keys: # Optional: Enable overriding the default routing key using query parameters, eg /api/incidents?pagerduty_other_routing_key=infra
infra: ${PAGERDUTY_OTHER_ROUTING_KEY_INFRA}
app: ${PAGERDUTY_OTHER_ROUTING_KEY_APP}
db: ${PAGERDUTY_OTHER_ROUTING_KEY_DB}
redis: # Required for on-call functionality
insecure_skip_verify: true # dev only
host: ${REDIS_HOST}
port: ${REDIS_PORT}
password: ${REDIS_PASSWORD}
db: 0
Environment Variables
The application relies on several environment variables to configure alerting services. Below is an explanation of each variable:
Common
Variable | Description |
---|---|
DEBUG_BODY | Set to true to enable print body send to Versus Incident. |
Slack Configuration
Variable | Description |
---|---|
SLACK_ENABLE | Set to true to enable Slack notifications. |
SLACK_TOKEN | The authentication token for your Slack bot. |
SLACK_CHANNEL_ID | The ID of the Slack channel where alerts will be sent. Can be overridden per request using the slack_channel_id query parameter. |
Slack also supports interactive acknowledgment buttons that can be configured using the following properties in the config.yaml
file:
alert:
slack:
# ...other slack configuration...
message_properties:
button_text: "Acknowledge Alert" # Custom text for the acknowledgment button
button_style: "primary" # Button style: "primary" (default blue), "danger" (red), or empty for default gray
disable_button: false # Set to true to disable the button, if you want to handle acknowledgment differently
These properties allow you to:
- Customize the text of the acknowledgment button (
button_text
) - Change the style of the button (
button_style
) - options are "primary" (blue), "danger" (red), or leave empty for default gray - Disable the interactive button entirely (
disable_button
) if you want to handle acknowledgment through other means
Telegram Configuration
Variable | Description |
---|---|
TELEGRAM_ENABLE | Set to true to enable Telegram notifications. |
TELEGRAM_BOT_TOKEN | The authentication token for your Telegram bot. |
TELEGRAM_CHAT_ID | The chat ID where alerts will be sent. Can be overridden per request using the telegram_chat_id query parameter. |
Viber Configuration
Viber supports two types of API integrations:
- Channel API (default): Send messages to Viber channels for team notifications
- Bot API: Send messages to individual users for personal notifications
When to use Channel API:
- β Broadcasting to team channels
- β Public incident notifications
- β Automated system alerts
- β Better for most incident management scenarios
- β No individual user setup required
When to use Bot API:
- β Personal notifications to specific users
- β Direct messaging for individual alerts
- β οΈ Limited to individual users only
- β οΈ Requires users to interact with bot first
- β οΈ User IDs can be hard to obtain
Variable | Description |
---|---|
VIBER_ENABLE | Set to true to enable Viber notifications. |
VIBER_BOT_TOKEN | The authentication token for your Viber bot or channel. |
VIBER_API_TYPE | API type: "channel" (default) for team notifications or "bot" for individual messaging. |
VIBER_CHANNEL_ID | The channel ID where alerts will be posted (required for channel API). Can be overridden per request using the viber_channel_id query parameter. |
VIBER_USER_ID | The user ID where alerts will be sent (required for bot API). Can be overridden per request using the viber_user_id query parameter. |
Email Configuration
Variable | Description |
---|---|
EMAIL_ENABLE | Set to true to enable email notifications. |
SMTP_HOST | The SMTP server hostname (e.g., smtp.gmail.com). |
SMTP_PORT | The SMTP server port (e.g., 587 for TLS). |
SMTP_USERNAME | The username/email for SMTP authentication. |
SMTP_PASSWORD | The password or app-specific password for SMTP authentication. |
EMAIL_TO | The recipient email address(es) for incident notifications. Can be multiple addresses separated by commas. Can be overridden per request using the email_to query parameter. |
EMAIL_SUBJECT | The subject line for email notifications. Can be overridden per request using the email_subject query parameter. |
Microsoft Teams Configuration
The Microsoft Teams integration now supports both legacy Office 365 webhooks and modern Power Automate workflows with a single configuration option:
alert:
msteams:
enable: true
power_automate_url: ${MSTEAMS_POWER_AUTOMATE_URL}
template_path: "config/msteams_message.tmpl"
Automatic URL Detection (April 2025 Update)
As of the April 2025 update, Versus Incident automatically detects the type of URL provided in the power_automate_url
setting:
-
Legacy Office 365 Webhook URLs: If the URL contains "webhook.office.com" (e.g.,
https://yourcompany.webhook.office.com/...
), the system will use the legacy format with a simple "text" field containing your rendered Markdown. -
Power Automate Workflow URLs: For newer Power Automate HTTP trigger URLs, the system converts your Markdown template to an Adaptive Card with rich formatting features.
This automatic detection provides backward compatibility while supporting newer features, eliminating the need for separate configuration options.
Variable | Description |
---|---|
MSTEAMS_ENABLE | Set to true to enable Microsoft Teams notifications. |
MSTEAMS_POWER_AUTOMATE_URL | The Power Automate HTTP trigger URL for your Teams channel. Automatically works with both Power Automate workflow URLs and legacy Office 365 webhooks. |
MSTEAMS_OTHER_POWER_URL_QC | (Optional) Power Automate URL for the QC team channel. Can be selected per request using the msteams_other_power_url=qc query parameter. |
MSTEAMS_OTHER_POWER_URL_OPS | (Optional) Power Automate URL for the Ops team channel. Can be selected per request using the msteams_other_power_url=ops query parameter. |
MSTEAMS_OTHER_POWER_URL_DEV | (Optional) Power Automate URL for the Dev team channel. Can be selected per request using the msteams_other_power_url=dev query parameter. |
Lark Configuration
Variable | Description |
---|---|
LARK_ENABLE | Set to true to enable Lark notifications. |
LARK_WEBHOOK_URL | The webhook URL for your Lark channel. |
LARK_OTHER_WEBHOOK_URL_DEV | (Optional) Webhook URL for the development team. Can be selected per request using the lark_other_webhook_url=dev query parameter. |
LARK_OTHER_WEBHOOK_URL_PROD | (Optional) Webhook URL for the production team. Can be selected per request using the lark_other_webhook_url=prod query parameter. |
Queue Services Configuration
Variable | Description |
---|---|
SNS_ENABLE | Set to true to enable receive Alert Messages from SNS. |
SNS_HTTPS_ENDPOINT_SUBSCRIPTION | This specifies the HTTPS endpoint to which SNS sends messages. When an HTTPS endpoint is configured, an SNS subscription is automatically created. If no endpoint is configured, you must create the SNS subscription manually using the CLI or AWS Console. E.g. https://your-domain.com . |
SNS_TOPIC_ARN | AWS ARN of the SNS topic to subscribe to. |
SQS_ENABLE | Set to true to enable receive Alert Messages from AWS SQS. |
SQS_QUEUE_URL | URL of the AWS SQS queue to receive messages from. |
On-Call Configuration
Variable | Description |
---|---|
ONCALL_ENABLE | Set to true to enable on-call functionality for all incidents by default. Can be overridden per request using the oncall_enable query parameter. |
ONCALL_INITIALIZED_ONLY | Set to true to initialize on-call feature but keep it disabled by default. When set to true , on-call is triggered only for requests that explicitly include ?oncall_enable=true in the URL. |
ONCALL_WAIT_MINUTES | Time in minutes to wait for acknowledgment before escalating (default: 3). Can be overridden per request using the oncall_wait_minutes query parameter. |
ONCALL_PROVIDER | Specify the on-call provider to use ("aws_incident_manager" or "pagerduty"). |
AWS_INCIDENT_MANAGER_RESPONSE_PLAN_ARN | The ARN of the AWS Incident Manager response plan to use for on-call escalations. Required if on-call provider is "aws_incident_manager". |
AWS_INCIDENT_MANAGER_OTHER_RESPONSE_PLAN_ARN_PROD | (Optional) AWS Incident Manager response plan ARN for production environment. Can be selected per request using the awsim_other_response_plan=prod query parameter. |
AWS_INCIDENT_MANAGER_OTHER_RESPONSE_PLAN_ARN_DEV | (Optional) AWS Incident Manager response plan ARN for development environment. Can be selected per request using the awsim_other_response_plan=dev query parameter. |
AWS_INCIDENT_MANAGER_OTHER_RESPONSE_PLAN_ARN_STAGING | (Optional) AWS Incident Manager response plan ARN for staging environment. Can be selected per request using the awsim_other_response_plan=staging query parameter. |
PAGERDUTY_ROUTING_KEY | Integration/Routing key for PagerDuty Events API v2. Required if on-call provider is "pagerduty". |
PAGERDUTY_OTHER_ROUTING_KEY_INFRA | (Optional) PagerDuty routing key for feature team. Can be selected per request using the pagerduty_other_routing_key=infra query parameter. |
PAGERDUTY_OTHER_ROUTING_KEY_APP | (Optional) PagerDuty routing key for application team. Can be selected per request using the pagerduty_other_routing_key=app query parameter. |
PAGERDUTY_OTHER_ROUTING_KEY_DB | (Optional) PagerDuty routing key for database team. Can be selected per request using the pagerduty_other_routing_key=db query parameter. |
Enabling On-Call for Specific Incidents with initialized_only
When you have initialized_only: true
in your configuration (rather than enable: true
), on-call is only triggered for incidents that explicitly request it. This is useful when:
- You want the on-call feature ready but not active for all alerts
- You need to selectively enable on-call only for high-priority services or incidents
- You want to let your monitoring system decide which alerts should trigger on-call
Example configuration:
oncall:
enable: false
initialized_only: true # feature ready but not active by default
wait_minutes: 3
provider: aws_incident_manager
# ... provider configuration ...
With this configuration, on-call is only triggered when requested via query parameter:
# This alert will send notifications but NOT trigger on-call escalation
curl -X POST "http://localhost:3000/api/incidents" \
-H "Content-Type: application/json" \
-d '{
"Logs": "[WARNING] Non-critical database latency increase.",
"ServiceName": "database-monitoring",
"UserID": "U12345"
}'
# This alert WILL trigger on-call escalation because of the query parameter
curl -X POST "http://localhost:3000/api/incidents?oncall_enable=true" \
-H "Content-Type: application/json" \
-d '{
"Logs": "[CRITICAL] Production database is down.",
"ServiceName": "core-database",
"UserID": "U12345"
}'
Understanding On-Call Modes:
Mode | Configuration | Behavior |
---|---|---|
Disabled | enable: false initialized_only: false | On-call feature is not initialized. No on-call functionality is available. |
Always Enabled | enable: true | On-call is active for all incidents by default. Can be disabled per request with ?oncall_enable=false . |
Opt-In Only | enable: false initialized_only: true | On-call feature is initialized but inactive by default. Must be explicitly enabled per request with ?oncall_enable=true . |
Redis Configuration
Variable | Description |
---|---|
REDIS_HOST | The hostname or IP address of the Redis server. Required if on-call is enabled. |
REDIS_PORT | The port number of the Redis server. Required if on-call is enabled. |
REDIS_PASSWORD | The password for authenticating with the Redis server. Required if on-call is enabled and Redis requires authentication. |
Ensure these environment variables are properly set before running the application.
Dynamic Configuration with Query Parameters
We provide a way to overwrite configuration values using query parameters, allowing you to send alerts to different channels and customize notification behavior on a per-request basis.
Query Parameter | Description |
---|---|
slack_channel_id | The ID of the Slack channel where alerts will be sent. Use: /api/incidents?slack_channel_id=<your_value> . |
telegram_chat_id | The chat ID where Telegram alerts will be sent. Use: /api/incidents?telegram_chat_id=<your_chat_id> . |
viber_channel_id | The channel ID where Viber alerts will be posted (for Channel API). Use: /api/incidents?viber_channel_id=<your_channel_id> . |
viber_user_id | The user ID where Viber alerts will be sent (for Bot API). Use: /api/incidents?viber_user_id=<your_user_id> . |
email_to | Overrides the default recipient email address for email notifications. Use: /api/incidents?email_to=<recipient_email> . |
email_subject | Overrides the default subject line for email notifications. Use: /api/incidents?email_subject=<custom_subject> . |
msteams_other_power_url | Overrides the default Microsoft Teams Power Automate flow by specifying an alternative key (e.g., qc, ops, dev). Use: /api/incidents?msteams_other_power_url=qc . |
lark_other_webhook_url | Overrides the default Lark webhook URL by specifying an alternative key (e.g., dev, prod). Use: /api/incidents?lark_other_webhook_url=dev . |
oncall_enable | Set to true or false to enable or disable on-call for a specific alert. Use: /api/incidents?oncall_enable=false . |
oncall_wait_minutes | Set the number of minutes to wait for acknowledgment before triggering on-call. Set to 0 to trigger immediately. Use: /api/incidents?oncall_wait_minutes=0 . |
awsim_other_response_plan | Overrides the default AWS Incident Manager response plan ARN by specifying an alternative key (e.g., prod, dev, staging). Use: /api/incidents?awsim_other_response_plan=prod . |
pagerduty_other_routing_key | Overrides the default PagerDuty routing key by specifying an alternative key (e.g., infra, app, db). Use: /api/incidents?pagerduty_other_routing_key=infra . |
Examples for Each Query Parameter
Slack Channel Override
To send an alert to a specific Slack channel (e.g., a dedicated channel for database issues):
curl -X POST "http://localhost:3000/api/incidents?slack_channel_id=C01DB2ISSUES" \
-H "Content-Type: application/json" \
-d '{
"Logs": "[ERROR] Database connection pool exhausted.",
"ServiceName": "database-service",
"UserID": "U12345"
}'
Telegram Chat Override
To send an alert to a different Telegram chat (e.g., for network monitoring):
curl -X POST "http://localhost:3000/api/incidents?telegram_chat_id=-1001234567890" \
-H "Content-Type: application/json" \
-d '{
"Logs": "[ERROR] Network latency exceeding thresholds.",
"ServiceName": "network-monitor",
"UserID": "U12345"
}'
Viber Channel Override
To send an alert to a specific Viber channel (recommended for team notifications):
curl -X POST "http://localhost:3000/api/incidents?viber_channel_id=01234567890A=" \
-H "Content-Type: application/json" \
-d '{
"Logs": "[ERROR] Mobile service experiencing high error rates.",
"ServiceName": "mobile-api",
"UserID": "U12345"
}'
Viber User Override
To send an alert to a specific Viber user (for individual notifications):
curl -X POST "http://localhost:3000/api/incidents?viber_user_id=01234567890A=" \
-H "Content-Type: application/json" \
-d '{
"Logs": "[ERROR] Personal alert for mobile service issue.",
"ServiceName": "mobile-api",
"UserID": "U12345"
}'
Email Recipient Override
To send an email alert to a specific recipient with a custom subject:
curl -X POST "http://localhost:3000/api/incidents?email_to=network-team@yourdomain.com&email_subject=Urgent%20Network%20Issue" \
-H "Content-Type: application/json" \
-d '{
"Logs": "[ERROR] Load balancer failing health checks.",
"ServiceName": "load-balancer",
"UserID": "U12345"
}'
Microsoft Teams Channel Override
You can configure multiple Microsoft Teams channels using the other_power_urls
setting:
alert:
msteams:
enable: true
power_automate_url: ${MSTEAMS_POWER_AUTOMATE_URL}
template_path: "config/msteams_message.tmpl"
other_power_urls:
qc: ${MSTEAMS_OTHER_POWER_URL_QC}
ops: ${MSTEAMS_OTHER_POWER_URL_OPS}
dev: ${MSTEAMS_OTHER_POWER_URL_DEV}
Then, to send an alert to the QC team's Microsoft Teams channel:
curl -X POST "http://localhost:3000/api/incidents?msteams_other_power_url=qc" \
-H "Content-Type: application/json" \
-d '{
"Logs": "[ERROR] Quality check failed for latest deployment.",
"ServiceName": "quality-service",
"UserID": "U12345"
}'
Lark Webhook Override
You can configure multiple Lark webhook URLs using the other_webhook_urls
setting:
alert:
lark:
enable: true
webhook_url: ${LARK_WEBHOOK_URL}
template_path: "config/lark_message.tmpl"
other_webhook_urls:
dev: ${LARK_OTHER_WEBHOOK_URL_DEV}
prod: ${LARK_OTHER_WEBHOOK_URL_PROD}
Then, to send an alert to the development team's Lark channel:
curl -X POST "http://localhost:3000/api/incidents?lark_other_webhook_url=dev" \
-H "Content-Type: application/json" \
-d '{
"Logs": "[ERROR] Development server crash detected.",
"ServiceName": "dev-server",
"UserID": "U12345"
}'
On-Call Controls
To disable on-call escalation for a non-critical alert:
curl -X POST "http://localhost:3000/api/incidents?oncall_enable=false" \
-H "Content-Type: application/json" \
-d '{
"Logs": "[WARNING] This is a minor issue that doesn't require on-call response.",
"ServiceName": "monitoring-service",
"UserID": "U12345"
}'
To trigger on-call immediately without the normal wait period for a critical issue:
curl -X POST "http://localhost:3000/api/incidents?oncall_wait_minutes=0" \
-H "Content-Type: application/json" \
-d '{
"Logs": "[CRITICAL] Payment processing system down.",
"ServiceName": "payment-service",
"UserID": "U12345"
}'
AWS Incident Manager Response Plan Override
You can configure multiple AWS Incident Manager response plans using the other_response_plan_arns
setting:
oncall:
enable: true
wait_minutes: 3
provider: aws_incident_manager
aws_incident_manager:
response_plan_arn: ${AWS_INCIDENT_MANAGER_RESPONSE_PLAN_ARN} # Default response plan
other_response_plan_arns:
prod: ${AWS_INCIDENT_MANAGER_OTHER_RESPONSE_PLAN_ARN_PROD} # Production environment
dev: ${AWS_INCIDENT_MANAGER_OTHER_RESPONSE_PLAN_ARN_DEV} # Development environment
staging: ${AWS_INCIDENT_MANAGER_OTHER_RESPONSE_PLAN_ARN_STAGING} # Staging environment
Then, to use a specific AWS Incident Manager response plan for a production environment issue:
curl -X POST "http://localhost:3000/api/incidents?awsim_other_response_plan=prod" \
-H "Content-Type: application/json" \
-d '{
"Logs": "[CRITICAL] Production database cluster failure.",
"ServiceName": "prod-database",
"UserID": "U12345"
}'
PagerDuty Routing Key Override
You can configure multiple PagerDuty routing keys using the other_routing_keys
setting:
oncall:
enable: true
wait_minutes: 3
provider: pagerduty
pagerduty:
routing_key: ${PAGERDUTY_ROUTING_KEY} # Default routing key
other_routing_keys:
infra: ${PAGERDUTY_OTHER_ROUTING_KEY_INFRA} # Infrastructure team
app: ${PAGERDUTY_OTHER_ROUTING_KEY_APP} # Application team
db: ${PAGERDUTY_OTHER_ROUTING_KEY_DB} # Database team
Then, to use a specific PagerDuty routing key for the infrastructure team:
curl -X POST "http://localhost:3000/api/incidents?pagerduty_other_routing_key=infra" \
-H "Content-Type: application/json" \
-d '{
"Logs": "[ERROR] Server load balancer failure in us-west-2.",
"ServiceName": "infrastructure",
"UserID": "U12345"
}'
Combining Multiple Parameters
You can combine multiple query parameters to customize exactly how an incident is handled:
curl -X POST "http://localhost:3000/api/incidents?slack_channel_id=C01PROD&telegram_chat_id=-987654321&oncall_enable=true&oncall_wait_minutes=1" \
-H "Content-Type: application/json" \
-d '{
"Logs": "[CRITICAL] Multiple service failures detected in production environment.",
"ServiceName": "core-infrastructure",
"UserID": "U12345",
"Severity": "CRITICAL"
}'
This will:
- Send the alert to a specific Slack channel (
C01PROD
) - Send the alert to a specific Telegram chat (
-987654321
) - Enable on-call escalation with a shortened 1-minute wait time
Installing Versus Incident with Helm
This guide explains how to deploy Versus Incident using Helm, a package manager for Kubernetes.
Requirements
- Kubernetes 1.19+
- Helm 3.2.0+
- PV provisioner support in the underlying infrastructure (if persistence is required for Redis)
Installing the Chart
You can install the Versus Incident Helm chart using OCI registry:
helm install versus-incident oci://ghcr.io/versuscontrol/charts/versus-incident
Install with Custom Values
# Install with custom configuration from a values file
helm install \
versus-incident \
oci://ghcr.io/versuscontrol/charts/versus-incident \
-f values.yaml
Upgrading an Existing Installation
# Upgrade an existing installation with the latest version
helm upgrade \
versus-incident \
oci://ghcr.io/versuscontrol/charts/versus-incident
# Upgrade with custom values
helm upgrade \
versus-incident \
oci://ghcr.io/versuscontrol/charts/versus-incident \
-f values.yaml
Configuration
Quick Start Example
Here's a simple example of a custom values file:
# values.yaml
replicaCount: 2
alert:
slack:
enable: true
token: "xoxb-your-slack-token"
channelId: "C12345"
messageProperties:
buttonText: "Acknowledge Alert"
buttonStyle: "primary"
telegram:
enable: false
email:
enable: false
msteams:
enable: false
lark:
enable: false
Important Parameters
Parameter | Description | Default |
---|---|---|
replicaCount | Number of replicas for the deployment | 2 |
config.publicHost | Public URL for acknowledgment links | "" |
alert.slack.enable | Enable Slack notifications | false |
alert.slack.token | Slack bot token | "" |
alert.slack.channelId | Slack channel ID | "" |
alert.telegram.enable | Enable Telegram notifications | false |
alert.email.enable | Enable email notifications | false |
alert.msteams.enable | Enable Microsoft Teams notifications | false |
alert.lark.enable | Enable Lark notifications | false |
oncall.enable | Enable on-call functionality | false |
oncall.provider | On-call provider ("aws_incident_manager" or "pagerduty") | "aws_incident_manager" |
redis.enabled | Enable bundled Redis (required for on-call) | false |
Notification Channel Configuration
Slack
alert:
slack:
enable: true
token: "xoxb-your-slack-token"
channelId: "C12345"
messageProperties:
buttonText: "Acknowledge Alert"
buttonStyle: "primary" # "primary" (blue), "danger" (red), or empty (default gray)
disableButton: false
Telegram
alert:
telegram:
enable: true
botToken: "your-telegram-bot-token"
chatId: "your-telegram-chat-id"
alert:
email:
enable: true
smtpHost: "smtp.example.com"
smtpPort: 587
username: "your-email@example.com"
password: "your-password"
to: "alerts@example.com"
subject: "Incident Alert"
Microsoft Teams
alert:
msteams:
enable: true
powerAutomateUrl: "your-power-automate-flow-url"
otherPowerUrls:
dev: "dev-team-power-automate-url"
ops: "ops-team-power-automate-url"
Lark
alert:
lark:
enable: true
webhookUrl: "your-lark-webhook-url"
otherWebhookUrls:
dev: "dev-team-webhook-url"
prod: "prod-team-webhook-url"
On-Call Configurations
AWS Incident Manager
oncall:
enable: true
waitMinutes: 3
provider: "aws_incident_manager"
awsIncidentManager:
responsePlanArn: "arn:aws:ssm-incidents::111122223333:response-plan/YourPlan"
otherResponsePlanArns:
prod: "arn:aws:ssm-incidents::111122223333:response-plan/ProdPlan"
dev: "arn:aws:ssm-incidents::111122223333:response-plan/DevPlan"
redis:
enabled: true
auth:
enabled: true
password: "your-redis-password"
architecture: standalone
master:
persistence:
enabled: true
size: 8Gi
PagerDuty
oncall:
enable: true
waitMinutes: 5
provider: "pagerduty"
pagerduty:
routingKey: "your-pagerduty-routing-key"
otherRoutingKeys:
infra: "infrastructure-team-routing-key"
app: "application-team-routing-key"
db: "database-team-routing-key"
redis:
enabled: true
auth:
enabled: true
password: "your-redis-password"
architecture: standalone
master:
persistence:
enabled: true
size: 8Gi
Redis Configuration
Redis is required for on-call functionality. The chart can either deploy its own Redis instance or connect to an external one.
External Redis
redis:
enabled: false
externalRedis:
host: "redis.example.com"
port: 6379
password: "your-redis-password"
insecureSkipVerify: false
db: 0
Custom Alert Templates
You can provide custom templates for each notification channel:
templates:
slack: |
*Critical Error in {{.ServiceName}}*
----------
Error Details:
```
{{.Logs}}
```
----------
Owner <@{{.UserID}}> please investigate
telegram: |
π¨ <b>Critical Error Detected!</b> π¨
π <b>Service:</b> {{.ServiceName}}
β οΈ <b>Error Details:</b>
{{.Logs}}
AWS Integrations
Versus Incident can receive alerts from aws sns systems:
AWS SNS
alert:
sns:
enable: true
httpsEndpointSubscriptionPath: "/sns"
Uninstalling the Chart
To uninstall/delete the versus-incident
deployment:
helm uninstall versus-incident
Additional Resources
Advanced Template Tips
Table of Contents
Multi-Service Template
Handle multiple alerts in one template:
{{ $service := .source | replace "aws." "" | upper }}
π‘ *{{$service}} Alert*
{{ if eq .source "aws.glue" }}
π§ Job: {{.detail.jobName}}
{{ else if eq .source "aws.ec2" }}
π₯ Instance: {{.detail.instance-id}}
{{ end }}
π *Details*: {{.detail | toJson}}
If the field does not exist when passed to the template, let's use the template's printf
function to handle it.
{{ if contains (printf "%v" .source) "aws.glue" }}
π₯ *Glue Job Failed*: {{.detail.jobName}}
β Error:
```{{.detail.errorMessage}}```
{{ else }}
π₯ *Critical Error in {{.ServiceName}}*
β Error Details:
```{{.Logs}}```
Owner <@{{.UserID}}> please investigate
{{ end }}
Conditional Formatting
Highlight critical issues:
{{ if gt .detail.actualValue .detail.threshold }}
π¨ CRITICAL: {{.detail.alarmName}} ({{.detail.actualValue}}%)
{{ else }}
β οΈ WARNING: {{.detail.alarmName}} ({{.detail.actualValue}}%)
{{ end }}
Best Practices for Custom Templates
- Keep It Simple: Focus on the most critical details for each alert.
- Use Conditional Logic: Tailor messages based on event severity or type.
- Test Your Templates: Use sample SNS messages to validate your templates.
- Document Your Templates: Share templates with your team for consistency.
How to Customize Alert Messages from Alertmanager to Slack and Telegram
Table of Contents
- Configure Alertmanager Webhook
- Launch Versus with Slack/Telegram
- Test
- Advanced: Dynamic Channel Routing
- Troubleshooting Tips
In this guide, you'll learn how to route Prometheus Alertmanager alerts to Slack and Telegram using the Versus Incident, while fully customizing alert messages.
Configure Alertmanager Webhook
Update your alertmanager.yml
to forward alerts to Versus:
route:
receiver: 'versus-incident'
group_wait: 10s
receivers:
- name: 'versus-incident'
webhook_configs:
- url: 'http://versus-host:3000/api/incidents' # Versus API endpoint
send_resolved: false
# Additional settings (if needed):
# http_config:
# tls_config:
# insecure_skip_verify: true # For self-signed certificates
For example, alert rules:
groups:
- name: cluster
rules:
- alert: PostgresqlDown
expr: pg_up == 0
for: 0m
labels:
severity: critical
annotations:
summary: Postgresql down (instance {{ $labels.instance }})
description: "Postgresql instance is down."
Alertmanager sends alerts to the webhook in JSON format. Hereβs an example of the payload:
{
"receiver": "webhook-incident",
"status": "firing",
"alerts": [
{
"status": "firing",
"labels": {
"alertname": "PostgresqlDown",
"instance": "postgresql-prod-01",
"severity": "critical"
},
"annotations": {
"summary": "Postgresql down (instance postgresql-prod-01)",
"description": "Postgresql instance is down."
},
"startsAt": "2023-10-01T12:34:56.789Z",
"endsAt": "2023-10-01T12:44:56.789Z",
"generatorURL": ""
}
],
"groupLabels": {
"alertname": "PostgresqlDown"
},
"commonLabels": {
"alertname": "PostgresqlDown",
"severity": "critical",
"instance": "postgresql-prod-01"
},
"commonAnnotations": {
"summary": "Postgresql down (instance postgresql-prod-01)",
"description": "Postgresql instance is down."
},
"externalURL": ""
}
Next, we will deploy Versus Incident and configure it with a custom template to send alerts to both Slack and Telegram for this payload.
Launch Versus with Slack/Telegram
Create a configuration file config/config.yaml
:
name: versus
host: 0.0.0.0
port: 3000
alert:
slack:
enable: true
token: ${SLACK_TOKEN}
channel_id: ${SLACK_CHANNEL_ID}
template_path: "/app/config/slack_message.tmpl"
telegram:
enable: true
bot_token: ${TELEGRAM_BOT_TOKEN}
chat_id: ${TELEGRAM_CHAT_ID}
template_path: "/app/config/telegram_message.tmpl"
Create Slack and Telegram templates.
config/slack_message.tmpl
:
π₯ *{{ .commonLabels.severity | upper }} Alert: {{ .commonLabels.alertname }}*
π *Instance*: `{{ .commonLabels.instance }}`
π¨ *Status*: `{{ .status }}`
{{ range .alerts }}
π {{ .annotations.description }}
β° *Firing since*: {{ .startsAt | formatTime }}
{{ end }}
π *Dashboard*: <{{ .externalURL }}|Investigate>
telegram_message.tmpl
:
π© <b>{{ .commonLabels.alertname }}</b>
{{ range .alerts }}
π {{ .startsAt | formatTime }}
{{ .annotations.summary }}
{{ end }}
<pre>
Status: {{ .status }}
Severity: {{ .commonLabels.severity }}
</pre>
Run Versus:
docker run -d -p 3000:3000 \
-e SLACK_ENABLE=true \
-e SLACK_TOKEN=xoxb-your-token \
-e SLACK_CHANNEL_ID=C12345 \
-e TELEGRAM_ENABLE=true \
-e TELEGRAM_BOT_TOKEN=123:ABC \
-e TELEGRAM_CHAT_ID=-456789 \
-v ./config:/app/config \
ghcr.io/versuscontrol/versus-incident
Test
Trigger a test alert using curl
:
curl -X POST http://localhost:3000/api/incidents \
-H "Content-Type: application/json" \
-d '{
"receiver": "webhook-incident",
"status": "firing",
"alerts": [
{
"status": "firing",
"labels": {
"alertname": "PostgresqlDown",
"instance": "postgresql-prod-01",
"severity": "critical"
},
"annotations": {
"summary": "Postgresql down (instance postgresql-prod-01)",
"description": "Postgresql instance is down."
},
"startsAt": "2023-10-01T12:34:56.789Z",
"endsAt": "2023-10-01T12:44:56.789Z",
"generatorURL": ""
}
],
"groupLabels": {
"alertname": "PostgresqlDown"
},
"commonLabels": {
"alertname": "PostgresqlDown",
"severity": "critical",
"instance": "postgresql-prod-01"
},
"commonAnnotations": {
"summary": "Postgresql down (instance postgresql-prod-01)",
"description": "Postgresql instance is down."
},
"externalURL": ""
}'
Final Result:
Advanced: Dynamic Channel Routing
Override Slack channels per alert using query parameters:
POST http://versus-host:3000/api/incidents?slack_channel_id=EMERGENCY-CHANNEL
Troubleshooting Tips
- Enable debug mode:
DEBUG_BODY=true
- Check Versus logs:
docker logs versus
If you encounter any issues or have further questions, feel free to reach out!
Configuring Fluent Bit to Send Error Logs to Versus Incident
Table of Contents
- Understand the Log Format
- Configure Fluent Bit Filters
- Configure Fluent Bit Output
- Full Fluent Bit Configuration Example
- Test the Configuration
- Conclusion
Fluent Bit is a lightweight log processor and forwarder that can filter, modify, and forward logs to various destinations. In this tutorial, we will configure Fluent Bit to filter logs containing [ERROR] and send them to the Versus Incident Management System using its REST API.
Understand the Log Format
The log format provided is as follows, you can create a sample.log
file:
[2023/01/22 09:46:49] [ INFO ] This is info logs 1
[2023/01/22 09:46:49] [ INFO ] This is info logs 2
[2023/01/22 09:46:49] [ INFO ] This is info logs 3
[2023/01/22 09:46:49] [ ERROR ] This is error logs
We are interested in filtering logs that contain [ ERROR ]
.
Configure Fluent Bit Filters
To filter and process logs, we use the grep
and modify
filters in Fluent Bit.
Filter Configuration
Add the following configuration to your Fluent Bit configuration file:
# Filter Section - Grep for ERROR logs
[FILTER]
Name grep
Match versus.*
Regex log .*\[.*ERROR.*\].*
# Filter Section - Modify fields
[FILTER]
Name modify
Match versus.*
Rename log Logs
Set ServiceName order-service
Explanation
- Grep Filter:
- Matches all logs that contain
[ ERROR ]
. - The
Regex
field uses a regular expression to identify logs with the[ ERROR ]
keyword.
- Modify Filter:
- Adds or modifies fields in the log record.
- Sets the
ServiceName
field for the default template. You can set the fields you want based on your template.
Default Telegram Template
π¨ <b>Critical Error Detected!</b> π¨
π <b>Service:</b> {{.ServiceName}}
β οΈ <b>Error Details:</b>
{{.Logs}}
Configure Fluent Bit Output
To send filtered logs to the Versus Incident Management System, we use the http
output plugin.
Output Configuration
Add the following configuration to your Fluent Bit configuration file:
...
# Output Section - Send logs to Versus Incident via HTTP
[OUTPUT]
Name http
Match versus.*
Host localhost
Port 3000
URI /api/incidents
Format json_stream
Explanation
- Name: Specifies the output plugin (
http
in this case). - Match: Matches all logs processed by the previous filters.
- Host and Port: Specify the host and port of the Versus Incident Management System (default is
localhost:3000
). - URI: Specifies the endpoint for creating incidents (
/api/incidents
). - Format: Ensures the payload is sent in JSON Stream format.
Full Fluent Bit Configuration Example
Here is the complete Fluent Bit configuration file:
# Input Section
[INPUT]
Name tail
Path sample.log
Tag versus.*
Mem_Buf_Limit 5MB
Skip_Long_Lines On
# Filter Section - Grep for ERROR logs
[FILTER]
Name grep
Match versus.*
Regex log .*\[.*ERROR.*\].*
# Filter Section - Modify fields
[FILTER]
Name modify
Match versus.*
Rename log Logs
Set ServiceName order-service
# Output Section - Send logs to Versus Incident via HTTP
[OUTPUT]
Name http
Match versus.*
Host localhost
Port 3000
URI /api/incidents
Format json_stream
Test the Configuration
Run Versus Incident:
docker run -p 3000:3000 \
-e TELEGRAM_ENABLE=true \
-e TELEGRAM_BOT_TOKEN=your_token \
-e TELEGRAM_CHAT_ID=your_channel \
ghcr.io/versuscontrol/versus-incident
Run Fluent Bit with the configuration file:
fluent-bit -c /path/to/fluent-bit.conf
Check the logs in the Versus Incident Management System. You should see an incident created with the following details:
Raw Request Body: {"date":1738999456.96342,"Logs":"[2023/01/22 09:46:49] [ ERROR ] This is error logs","ServiceName":"order-service"}
2025/02/08 14:24:18 POST /api/incidents 201 127.0.0.1 Fluent-Bit
Conclusion
By following the steps above, you can configure Fluent Bit to filter error logs and send them to the Versus Incident Management System. This integration enables automated incident management, ensuring that critical errors are promptly addressed by your DevOps team.
If you encounter any issues or have further questions, feel free to reach out!
Configuring CloudWatch to send Alert to Versus Incident
Table of Contents
- Create an SNS Topic
- Create a CloudWatch Alarm for RDS CPU
- Versus Incident
- Subscribe Versus to the SNS Topic
- Test the Integration
- Conclusion
In this guide, youβll learn how to set up a CloudWatch alarm to trigger when RDS CPU usage exceeds 80% and send an alert to Slack and Telegram.
Prerequisites
AWS account with access to RDS, CloudWatch, and SNS. An RDS instance running (replace my-rds-instance with your instance ID). Slack and Telegram API Token.
Steps
-
Create SNS Topic and Subscription.
-
Create CloudWatch Alarm.
-
Deploy Versus Incident with Slack and Telegram configurations.
-
Subscribe Versus to the SNS Topic.
Create an SNS Topic
Create an SNS topic to route CloudWatch Alarms to Versus:
aws sns create-topic --name RDS-CPU-Alarm-Topic
Create a CloudWatch Alarm for RDS CPU
Set up an alarm to trigger when RDS CPU exceeds 80% for 5 minutes.
aws cloudwatch put-metric-alarm \
--alarm-name "RDS_CPU_High" \
--alarm-description "RDS CPU utilization over 80%" \
--namespace AWS/RDS \
--metric-name CPUUtilization \
--dimensions Name=DBInstanceIdentifier,Value=my-rds-instance \
--statistic Average \
--period 300 \
--threshold 80 \
--comparison-operator GreaterThanThreshold \
--evaluation-periods 1 \
--alarm-actions arn:aws:sns:us-east-1:123456789012:RDS-CPU-Alarm-Topic
Explanation:
--namespace AWS/RDS
: Specifies RDS metrics.--metric-name CPUUtilization
: Tracks CPU usage.--dimensions
: Identifies your RDS instance.--alarm-actions
: The SNS topic ARN where alerts are sent.
Versus Incident
Next, we will deploy Versus Incident and configure it with a custom template to send alerts to both Slack and Telegram. Enable SNS support in config/config.yaml
:
name: versus
host: 0.0.0.0
port: 3000
alert:
debug_body: true
slack:
enable: true
token: ${SLACK_TOKEN}
channel_id: ${SLACK_CHANNEL_ID}
template_path: "/app/config/slack_message.tmpl"
telegram:
enable: true
bot_token: ${TELEGRAM_BOT_TOKEN}
chat_id: ${TELEGRAM_CHAT_ID}
template_path: "/app/config/telegram_message.tmpl"
queue:
enable: true
sns:
enable: true
https_endpoint_subscription_path: /sns
When your RDS_CPU_High alarm triggers, SNS will send a notification to your HTTP endpoint. The message will be a JSON object wrapped in an SNS envelope. Hereβs an example of what the JSON payload of Message field
might look like:
{
"AlarmName": "RDS_CPU_High",
"AlarmDescription": "RDS CPU utilization over 80%",
"AWSAccountId": "123456789012",
"NewStateValue": "ALARM",
"NewStateReason": "Threshold Crossed: 1 out of the last 1 datapoints was greater than the threshold (80.0). The most recent datapoint: 85.3.",
"StateChangeTime": "2025-03-17T12:34:56.789Z",
"Region": "US East (N. Virginia)",
"OldStateValue": "OK",
"Trigger": {
"MetricName": "CPUUtilization",
"Namespace": "AWS/RDS",
"StatisticType": "Statistic",
"Statistic": "AVERAGE",
"Unit": "Percent",
"Period": 300,
"EvaluationPeriods": 1,
"ComparisonOperator": "GreaterThanThreshold",
"Threshold": 80.0,
"TreatMissingData": "missing",
"Dimensions": [
{
"Name": "DBInstanceIdentifier",
"Value": "my-rds-instance"
}
]
}
}
Create Slack and Telegram templates, e.g. config/slack_message.tmpl
:
*π¨ CloudWatch Alarm: {{.AlarmName}}*
----------
Description: {{.AlarmDescription}}
Current State: {{.NewStateValue}}
Timestamp: {{.StateChangeTime}}
----------
Owner <@${USERID}>: Investigate immediately!
config/telegram_message.tmpl
:
π¨ <b>{{.AlarmName}}</b>
π <b>Status:</b> {{.NewStateValue}}
β οΈ <b>Description:</b> {{.AlarmDescription}}
π <b>Time:</b> {{.StateChangeTime}}
Deploy with Docker:
docker run -d \
-p 3000:3000 \
-v $(pwd)/config:/app/config \
-e SLACK_ENABLE=true \
-e SLACK_TOKEN=your_slack_token \
-e SLACK_CHANNEL_ID=your_channel_id \
-e TELEGRAM_ENABLE=true \
-e TELEGRAM_BOT_TOKEN=your_token \
-e TELEGRAM_CHAT_ID=your_channel \
--name versus \
ghcr.io/versuscontrol/versus-incident
Versus Incident is running and accessible at:
http://localhost:3000/sns
For testing purposes, we can use ngrok to enable the Versus on localhost that can be accessed via the internet.
ngrok http 3000 --url your-versus-https-url.ngrok-free.app
This URL is available to anyone on the internet.
Subscribe Versus to the SNS Topic
Subscribe Versusβs /sns endpoint to the topic. Replace versus-host with your deployment URL:
aws sns subscribe \
--topic-arn arn:aws:sns:us-east-1:123456789012:RDS-CPU-Alarm-Topic \
--protocol https \
--notification-endpoint https://your-versus-https-url.ngrok-free.app/sns
Test the Integration
- Simulate high CPU load on your RDS instance (e.g., run intensive queries).
- Check the CloudWatch console to confirm the alarm triggers.
- Verify Versus Incident receives the SNS payload and sends alerts to Slack and Telegram.
Conclusion
By integrating CloudWatch Alarms with Versus Incident via SNS, you centralize alert management and ensure critical infrastructure issues are promptly routed to Slack, Telegram, or Email.
If you encounter any issues or have further questions, feel free to reach out!
How to Configure Sentry to Send Alerts to MS Teams
Table of Contents
- Set Up Microsoft Teams Integration (2025 Update)
- Deploy Versus Incident with MS Teams Enabled
- Configure Sentry with Integration Webhooks
- Test the Integration
- Conclusion
This guide will show you how to route Sentry alerts through Versus Incident to Microsoft Teams, enabling your team to respond to application issues quickly and efficiently.
Prerequisites
- Microsoft Teams channel with Power Automate or webhook permissions
- Sentry account with project owner permissions
Set Up Microsoft Teams Integration (2025 Update)
Microsoft has announced the retirement of Office 365 Connectors (including Incoming Webhooks) by the end of 2025. Versus Incident supports both the legacy webhook method and the new Power Automate Workflows method. We recommend using Power Automate Workflows for all new deployments.
Option 1: Set Up a Power Automate Workflow (Recommended)
Follow these steps to create a Power Automate workflow to receive alerts in Microsoft Teams:
- Sign in to Power Automate
- Click Create and select Instant cloud flow
- Name your flow (e.g., "Versus Incident Alerts")
- Select When a HTTP request is received as the trigger and click Create
- In the HTTP trigger, you'll see a generated HTTP POST URL. Copy this URL - you'll need it later
- Click + New step and search for "Teams"
- Select Post a message in a chat or channel (under Microsoft Teams)
- Configure the action:
- Choose Channel as the Post as option
- Select your Team and Channel
- For the Message field, add:
@{triggerBody()?['messageText']}
- Click Save to save your flow
Option 2: Set Up an MS Teams Webhook (Legacy Method)
For backward compatibility, Versus still supports the traditional webhook method (being retired by end of 2025):
- Open MS Teams and go to the channel where you want alerts to appear.
- Click the three dots
(β¦)
next to the channel name and select Connectors. - Find Incoming Webhook, click Add, then Add again in the popup.
- Name your webhook (e.g., Sentry Alerts) and optionally upload an image.
- Click Create, then copy the generated webhook URL. Save this URL β you'll need it later.
Deploy Versus Incident with MS Teams Enabled
Next, configure Versus Incident to forward alerts to MS Teams. Create a directory for your configuration files:
mkdir -p ./config
Create config/config.yaml
with the following content for Power Automate (recommended):
name: versus
host: 0.0.0.0
port: 3000
alert:
debug_body: true
msteams:
enable: false # Default value, will be overridden by MSTEAMS_ENABLE env var
power_automate_url: ${MSTEAMS_POWER_AUTOMATE_URL} # Power Automate HTTP trigger URL
template_path: "config/msteams_message.tmpl"
Create a custom MS Teams template in config/msteams_message.tmpl
, for example, the JSON Format for Sentry Webhooks Integration:
{
"action": "created",
"data": {
"issue": {
"id": "123456",
"title": "Example Issue",
"culprit": "example_function in example_module",
"shortId": "PROJECT-1",
"project": {
"id": "1",
"name": "Example Project",
"slug": "example-project"
},
"metadata": {
"type": "ExampleError",
"value": "This is an example error"
},
"status": "unresolved",
"level": "error",
"firstSeen": "2023-10-01T12:00:00Z",
"lastSeen": "2023-10-01T12:05:00Z",
"count": 5,
"userCount": 3
}
},
"installation": {
"uuid": "installation-uuid"
},
"actor": {
"type": "user",
"id": "789",
"name": "John Doe"
}
}
Now, create a rich MS Teams template in config/msteams_message.tmpl
:
**π¨ Sentry Alert: {{.data.issue.title}}**
**Project**: {{.data.issue.project.name}}
**Issue URL**: {{.data.issue.url}}
Please investigate this issue immediately.
This template uses Markdown to format the alert in MS Teams. It pulls data from the Sentry webhook payload (e.g., {{.data.issue.title}}
).
Note about MS Teams notifications (April 2025): The system will automatically extract "Sentry Alert: {{.data.issue.title}}" as the summary for Microsoft Teams notifications, and generate a plain text version as a fallback. You don't need to add these fields manually - Versus Incident handles this to ensure proper display in Microsoft Teams.
Run Versus Incident using Docker, mounting your configuration files and setting the MS Teams Power Automate URL as an environment variable:
docker run -d \
-p 3000:3000 \
-v $(pwd)/config:/app/config \
-e MSTEAMS_ENABLE=true \
-e MSTEAMS_POWER_AUTOMATE_URL="your_power_automate_url" \
--name versus \
ghcr.io/versuscontrol/versus-incident
Replace your_power_automate_url
with the URL you copied from Power Automate. The Versus Incident API endpoint for receiving alerts is now available at:
http://localhost:3000/api/incidents
Configure Sentry with Integration Webhooks
Versus Incident is specifically designed to work with Sentry Integration Webhooks - a feature that allows Sentry to send detailed issue data to external services when specific events occur. Here's how to set it up:
- Log in to your Sentry account and navigate to your project.
- Go to Settings β Integrations β Webhook.
- Click on Install (or Configure if already installed).
- Enter a name for your webhook (e.g., "Versus Incident").
- For the webhook URL, enter:
- If Versus is running locally:
http://localhost:3000/api/incidents
- If deployed elsewhere:
https://your-versus-domain.com/api/incidents
- If Versus is running locally:
- Under Alerts, make sure Issue Alerts is checked.
- Under Services, check Issue to receive issue-related events.
- Click Save Changes.
Create Alert Rules with the Webhook Integration
Next, create alert rules that will use this webhook:
- Go to Alerts in the sidebar and click Create Alert Rule.
- Define the conditions for your alert, such as:
- When: "A new issue is created"
- Filter: (Optional) Add filters like "error level is fatal"
- Under Actions, select Send a notification via a webhook.
- Select the webhook you created earlier.
- Save the alert rule.
Sentry will now send standardized Integration webhook payloads to Versus Incident whenever the alert conditions are met. These payloads contain comprehensive issue details including stack traces, error information, and project metadata that Versus Incident can parse and format for MS Teams.
Test the Integration
To confirm everything works, simulate a Sentry alert using curl:
curl -X POST http://localhost:3000/api/incidents \
-H "Content-Type: application/json" \
-d '{
"action": "created",
"data": {
"issue": {
"id": "123456",
"title": "Example Issue",
"culprit": "example_function in example_module",
"shortId": "PROJECT-1",
"project": {
"id": "1",
"name": "Example Project",
"slug": "example-project"
},
"metadata": {
"type": "ExampleError",
"value": "This is an example error"
},
"status": "unresolved",
"level": "error",
"firstSeen": "2023-10-01T12:00:00Z",
"lastSeen": "2023-10-01T12:05:00Z",
"count": 5,
"userCount": 3
}
},
"installation": {
"uuid": "installation-uuid"
},
"actor": {
"type": "user",
"id": "789",
"name": "John Doe"
}
}'
Alternatively, trigger a real error in your Sentry-monitored application and verify the alert appears in MS Teams.
Conclusion
By connecting Sentry to MS Teams via Versus Incident, you've created a streamlined alerting system that keeps your team informed of critical issues in real-time. The Sentry Integration Webhook provides rich, detailed information about each issue, and Versus Incident's flexible templating system allows you to present this information in a clear, actionable format for your team.
Configure Kibana to Send Alerts to Slack and Telegram
Table of Contents
- Prerequisites
- Step 1: Set Up Slack and Telegram Bots
- Step 2: Deploy Versus Incident with Slack and Telegram Enabled
- Step 3: Configure Kibana Alerts with a Webhook
- Step 4: Test the Integration
- Conclusion
Kibana, part of the Elastic Stack, provides powerful monitoring and alerting capabilities for your applications and infrastructure. However, its native notification options are limited.
In this guide, weβll walk through setting up Kibana to send alerts to Versus Incident, which will then forward them to Slack and Telegram using custom templates.
Prerequisites
- A running Elastic Stack (Elasticsearch and Kibana) instance with alerting enabled (Kibana 7.13+ required for the Alerting feature).
- A Slack workspace with permissions to create a bot and obtain a token.
- A Telegram account with a bot created via BotFather and a chat ID for your target group or channel.
- Docker installed (optional, for easy Versus Incident deployment).
Step 1: Set Up Slack and Telegram Bots
Slack Bot
- Visit api.slack.com/apps and click Create New App.
- Name your app (e.g., βKibana Alertsβ) and select your Slack workspace.
- Under Bot Users, add a bot (e.g., βKibanaBotβ) and enable it.
- Go to OAuth & Permissions, add the
chat:write
scope under Scopes. - Install the app to your workspace and copy the Bot User OAuth Token (starts with
xoxb-
). Save it securely. - Invite the bot to your Slack channel by typing
/invite @KibanaBot
in the channel and note the channel ID (right-click the channel, copy the link, and extract the ID).
Telegram Bot
- Open Telegram and search for BotFather.
- Start a chat and type
/newbot
. Follow the prompts to name your bot (e.g., βKibanaAlertBotβ). - BotFather will provide a Bot Token (e.g.,
123456:ABC-DEF1234ghIkl-zyx57W2v1u123ew11
). Save it securely. - Create a group or channel in Telegram, add your bot, and get the Chat ID:
- Send a message to the group/channel via the bot.
- Use
https://api.telegram.org/bot<YourBotToken>/getUpdates
in a browser to retrieve thechat.id
(e.g.,-123456789
).
Step 2: Deploy Versus Incident with Slack and Telegram Enabled
Versus Incident acts as a bridge between Kibana and your notification channels. Weβll configure it to handle both Slack and Telegram alerts.
Create Configuration Files
- Create a directory for configuration:
mkdir -p ./config
- Create
config/config.yaml
with the following content:
name: versus
host: 0.0.0.0
port: 3000
alert:
slack:
enable: true
token: ${SLACK_TOKEN}
channel_id: ${SLACK_CHANNEL_ID}
template_path: "/app/config/slack_message.tmpl"
telegram:
enable: true
bot_token: ${TELEGRAM_BOT_TOKEN}
chat_id: ${TELEGRAM_CHAT_ID}
template_path: "/app/config/telegram_message.tmpl"
- Create a Slack template at
config/slack_message.tmpl
:
π¨ *Kibana Alert: {{.name}}*
**Message**: {{.message}}
**Status**: {{.status}}
**Kibana URL**: <{{.kibanaUrl}}|View in Kibana>
Please investigate this issue.
- Create a Telegram template at
config/telegram_message.tmpl
(using HTML formatting):
π¨ <b>Kibana Alert: {{.name}}</b>
<b>Message</b>: {{.message}}
<b>Status</b>: {{.status}}
<b>Kibana URL</b>: <a href="{{.kibanaUrl}}">View in Kibana</a>
Please investigate this issue.
Run Versus Incident with Docker
Deploy Versus Incident with the configuration and environment variables:
docker run -d \
-p 3000:3000 \
-v $(pwd)/config:/app/config \
-e SLACK_ENABLE=true \
-e SLACK_TOKEN="your_slack_bot_token" \
-e SLACK_CHANNEL_ID="your_slack_channel_id" \
-e TELEGRAM_ENABLE=true \
-e TELEGRAM_BOT_TOKEN="your_telegram_bot_token" \
-e TELEGRAM_CHAT_ID="your_telegram_chat_id" \
--name versus \
ghcr.io/versuscontrol/versus-incident
- Replace
your_slack_bot_token
andyour_slack_channel_id
with Slack values. - Replace
your_telegram_bot_token
andyour_telegram_chat_id
with Telegram values.
The Versus Incident API endpoint is now available at http://localhost:3000/api/incidents
.
Step 3: Configure Kibana Alerts with a Webhook
Kibanaβs Alerting feature allows you to send notifications via webhooks. Weβll configure it to send alerts to Versus Incident.
- Log in to Kibana and go to Stack Management > Alerts and Insights > Rules.
- Click Create Rule.
- Define your rule:
- Name: e.g., βHigh CPU Alertβ.
- Connector: Select an index or data view to monitor (e.g., system metrics).
- Condition: Set a condition, such as βCPU usage > 80% over the last 5 minutesβ.
- Check every: 1 minute (or your preferred interval).
- Add an Action:
- Action Type: Select Webhook.
- URL:
http://localhost:3000/api/incidents
(or your deployed Versus URL, e.g.,https://your-versus-domain.com/api/incidents
). - Method: POST.
- Headers: Add
Content-Type: application/json
. - Body: Use this JSON template to match Versus Incidentβs expected fields:
{ "name": "{{rule.name}}", "message": "{{context.message}}", "status": "{{alert.state}}", "kibanaUrl": "{{kibanaBaseUrl}}/app/management/insightsAndAlerting/rules/{{rule.id}}" }
- Save the rule.
Kibana will now send a JSON payload to Versus Incident whenever the alert condition is met.
Step 4: Test the Integration
Simulate a Kibana alert using curl
to test the setup:
curl -X POST http://localhost:3000/api/incidents \
-H "Content-Type: application/json" \
-d '{
"name": "High CPU Alert",
"message": "CPU usage exceeded 80% on server-01",
"status": "active",
"kibanaUrl": "https://your-kibana-instance.com/app/management/insightsAndAlerting/rules/12345"
}'
Alternatively, trigger a real alert in Kibana (e.g., by simulating high CPU usage in your monitored system) and confirm the notifications appear in both Slack and Telegram.
Conclusion
By integrating Kibana with Versus Incident, you can send alerts to Slack and Telegram with customized, actionable messages that enhance your teamβs incident response. This setup is flexible and scalableβVersus Incident also supports additional channels like Microsoft Teams and Email, as well as on-call integrations like AWS Incident Manager.
If you encounter any issues or have further questions, feel free to reach out!
On Call
This document provides a step-by-step guide to integrating Versus Incident with an on-call solutions. We currently support AWS Incident Manager and PagerDuty, with plans to support more tools in the future.
Before diving into how Versus integrates with on-call systems, let's start with the basics. You need to understand the on-call platforms we support:
Understanding AWS Incident Manager On-Call
Understanding PagerDuty On-Call
Understanding AWS Incident Manager On-Call
Table of Contents
AWS On-Call is a service that helps organizations manage and respond to incidents quickly and effectively. Itβs part of AWS Systems Manager. This document explains the key parts of AWS Incident Manager On-Callβcontacts, escalation plans, runbooks, and response plansβin a simple and clear way.
Key Components of AWS Incident Manager On-Call
AWS Incident Manager On-Call relies on four main pieces: contacts, escalation plans, runbooks, and response plans. Letβs break them down one by one.
1. Contacts
Contacts are the people who get notified when an incident happens. These could be:
- On-call engineers (the ones on duty to fix things).
- Experts who know specific systems.
- Managers or anyone else who needs to stay in the loop.
Each contact has contact methodsβways to reach them, like:
- SMS (text messages).
- Email.
- Voice calls.
Example: Imagine Natsu is an on-call engineer. His contact info might include:
- SMS: +84 3127 12 567
- Email: natsu@devopsvn.tech
If an incident occurs, AWS Incident Manager can send him a text and an email to let him know sheβs needed.
2. Escalation Plans
An escalation plan is a set of rules that decides who gets notifiedβand in what orderβif an incident isnβt handled quickly. Itβs like a backup plan to make sure someone responds, even if the first person is unavailable.
You can set it up to:
- Notify people simultaneously (all at once).
- Notify people sequentially (one after another, with a timeout between each).
Example: Suppose you have three engineers: Natsu, Zeref, and Igneel. Your escalation plan might say:
- Stage 1: Notify Natsu.
- Stage 2: If Natsu doesnβt respond in 5 minutes, notify Zeref.
- Stage 3: If Zeref doesnβt respond in another 5 minutes, notify Igneel.
This way, the incident doesnβt get stuck waiting for one personβit keeps moving until someone takes action.
3. Runbooks (Options)
Runbooks are like instruction manuals that AWS can follow automatically to fix an incident. Theyβre built in AWS Systems Manager Automation and contain steps to solve common problems without needing a human to step in.
Runbooks can:
- Restart a crashed service.
- Add more resources (like extra servers) if somethingβs overloaded.
- Run checks to figure out whatβs wrong.
Example: Letβs say your web server stops working. A runbook called βWebServerRestartβ could:
- Automatically detect the issue.
- Restart the server in seconds.
This saves time by fixing the problem before an engineer even picks up their phone.
4. Response Plans
A response plan is the master plan that pulls everything together. It tells AWS Incident Manager:
- Which contacts to notify.
- Which escalation plan to follow.
- Which runbooks to run.
It can have multiple stages, each with its own actions and time limits, to handle an incident step-by-step.
Example: For a critical incident (like a web application going offline), a response plan might look like this:
- 1: Run the βWebServerRestartβ runbook and notify Natsu.
- 2: If the issue isnβt fixed in 5 minutes, notify Bob (via the escalation plan).
- 3: If itβs still not resolved in 10 minutes, alert the manager.
This ensures both automation and people work together to fix the problem.
Next, we will provide a step-by-step guide to integrating Versus with AWS Incident Manager for On Call: Integration.
How to Integration AWS Incident Manager On-Call
Table of Contents
- Prerequisites
- Setting Up AWS Incident Manager for On-Call
- Define IAM Role for Versus
- Deploy Versus Incident
- Alert Rules
- Alert Manager Routing Configuration
- Testing the Integration
- Conclusion
This document provides a step-by-step guide to integrate Versus Incident with AWS Incident Manager make an On Call. The integration enables automated escalation of alerts to on-call teams when incidents are not acknowledged within a specified time.
We'll cover configuring Prometheus Alert Manager to send alerts to Versus, setting up AWS Incident Manager, deploying Versus, and testing the integration with a practical example.
Prerequisites
Before you begin, ensure you have:
- An AWS account with access to AWS Incident Manager.
- Versus Incident deployed (instructions provided later).
- Prometheus Alert Manager set up to monitor your systems.
Setting Up AWS Incident Manager for On-Call
AWS Incident Manager requires configuring several components to manage on-call workflows. Letβs configure a practical example using 6 contacts, two teams, and a two-stage response plan. Use the AWS Console to set these up.
Contacts
Contacts are individuals who will be notified during an incident.
- In the AWS Console, navigate to Systems Manager > Incident Manager > Contacts.
- Click Create contact.
- For each contact:
- Enter a Name (e.g., "Natsu Dragneel").
- Add Contact methods (e.g., SMS: +1-555-123-4567, Email: natsu@devopsvn.tech).
- Save the contact.
Repeat to create 6 contacts (e.g., Natsu, Zeref, Igneel, Gray, Gajeel, Laxus).
Escalation Plan
An escalation plan defines the order in which contacts are engaged.
- Go to Incident Manager > Escalation plans > Create escalation plan.
- Name it (e.g., "TeamA_Escalation").
- Add contacts (e.g., Natsu, Zeref, and Igneel) and set them to engage simultaneously or sequentially.
- Save the plan.
- Create a second plan (e.g., "TeamB_Escalation") for Gray, Gajeel, and Laxus.
RunBook (Optional)
RunBooks automate incident resolution steps. For this guide, weβll skip RunBook creation, but you can define one in AWS Systems Manager Automation if needed.
Response Plan
A response plan ties contacts and escalation plans into a structured response.
- Go to Incident Manager > Response plans > Create response plan.
- Name it (e.g., "CriticalIncidentResponse").
- Choose an Escalation Plan we had created, which defines two stages:
- Stage 1: Engage "TeamA_Escalation" (Natsu, Zeref, and Igneel) with a 5-minute timeout.
- Stage 2: If unacknowledged, engage "TeamB_Escalation" (Gray, Gajeel, and Laxus).
- Save the plan and note its ARN (e.g.,
arn:aws:ssm-incidents::111122223333:response-plan/CriticalIncidentResponse
).
Define IAM Role for Versus
Versus needs permissions to interact with AWS Incident Manager.
- In the AWS Console, go to IAM > Roles > Create role.
- Choose AWS service as the trusted entity and select EC2 (or your deployment type, e.g., ECS).
- Attach a custom policy with these permissions:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ssm-incidents:StartIncident",
"ssm-incidents:GetResponsePlan"
],
"Resource": "*"
}
]
}
- Name the role (e.g., "VersusIncidentRole") and create it.
- Note the Role ARN (e.g.,
arn:aws:iam::111122223333:role/VersusIncidentRole
).
Deploy Versus Incident
Deploy Versus using Docker or Kubernetes. Docker Deployment. Create a directory for your configuration files:
mkdir -p ./config
Create config/config.yaml
with the following content:
name: versus
host: 0.0.0.0
port: 3000
public_host: https://your-ack-host.example
alert:
debug_body: true
slack:
enable: true
token: ${SLACK_TOKEN}
channel_id: ${SLACK_CHANNEL_ID}
template_path: "config/slack_message.tmpl"
message_properties:
button_text: "Acknowledge Alert" # Custom text for the acknowledgment button
button_style: "primary" # Button style: "primary" (blue), "danger" (red), or empty for default gray
disable_button: false # Set to true to disable the button if you want to handle acknowledgment differently
oncall:
enable: true
wait_minutes: 3
aws_incident_manager:
response_plan_arn: ${AWS_INCIDENT_MANAGER_RESPONSE_PLAN_ARN}
redis: # Required for on-call functionality
insecure_skip_verify: true # dev only
host: ${REDIS_HOST}
port: ${REDIS_PORT}
password: ${REDIS_PASSWORD}
db: 0
Create Slack templates config/slack_message.tmpl
:
π₯ *{{ .commonLabels.severity | upper }} Alert: {{ .commonLabels.alertname }}*
π *Instance*: `{{ .commonLabels.instance }}`
π¨ *Status*: `{{ .status }}`
{{ range .alerts }}
π {{ .annotations.description }}
{{ end }}
Slack Acknowledgment Button (Default)
By default, Versus automatically adds an interactive acknowledgment button to Slack notifications when on-call is enabled. This allows users to acknowledge alerts. You can customize the button appearance in your config.yaml
, for example:
ACK URL Generation
- When an incident is created (e.g., via a POST to
/api/incidents
), Versus generates an acknowledgment URL if on-call is enabled. - The URL is constructed using the
public_host
value, typically in the format:https://your-host.example/api/incidents/ack/<incident-id>
. - This URL is injected into the button.
Manual Acknowledgment Handling
If you prefer to handle acknowledgments manually or want to disable the default button (by setting disable_button: true
), you can add the acknowledgment URL directly in your template. Here's an example of including a clickable link in your Slack template:
π₯ *{{ .commonLabels.severity | upper }} Alert: {{ .commonLabels.alertname }}*
π *Instance*: `{{ .commonLabels.instance }}`
π¨ *Status*: `{{ .status }}`
{{ range .alerts }}
π {{ .annotations.description }}
{{ end }}
{{ if .AckURL }}
----------
<{{.AckURL}}|Click here to acknowledge>
{{ end }}
The conditional {{ if .AckURL }}
ensures the link only appears if the acknowledgment URL is available (i.e., when on-call is enabled).
Create the docker-compose.yml
file:
version: '3.8'
services:
versus:
image: ghcr.io/versuscontrol/versus-incident
ports:
- "3000:3000"
environment:
- SLACK_TOKEN=your_slack_token
- SLACK_CHANNEL_ID=your_channel_id
- AWS_INCIDENT_MANAGER_RESPONSE_PLAN_ARN=arn:aws:ssm-incidents::111122223333:response-plan/CriticalIncidentResponse
- REDIS_HOST=redis
- REDIS_PORT=6379
- REDIS_PASSWORD=your_redis_password
depends_on:
- redis
redis:
image: redis:6.2-alpine
command: redis-server --requirepass your_redis_password
ports:
- "6379:6379"
volumes:
- redis_data:/data
volumes:
redis_data:
Note: If using AWS credentials, add AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
environment variables, or attach the IAM role to your deployment environment.
Run Docker Compose:
docker-compose up -d
Alert Rules
Create a prometheus.yml
file to define a metric and alerting rule:
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'server'
static_configs:
- targets: ['localhost:9090']
rule_files:
- 'alert_rules.yml'
Create alert_rules.yml
to define an alert:
groups:
- name: rate
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status="500"}[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate detected in {{ $labels.service }}"
description: "{{ $labels.service }} has an error rate above 0.1 for 5 minutes."
- alert: HighErrorRate
expr: rate(http_requests_total{status="500"}[5m]) > 0.5
for: 2m
labels:
severity: critical
annotations:
summary: "Very high error rate detected in {{ $labels.service }}"
description: "{{ $labels.service }} has an error rate above 0.5 for 2 minutes."
- alert: HighErrorRate
expr: rate(http_requests_total{status="500"}[5m]) > 0.8
for: 1m
labels:
severity: urgent
annotations:
summary: "Extremely high error rate detected in {{ $labels.service }}"
description: "{{ $labels.service }} has an error rate above 0.8 for 1 minute."
Alert Manager Routing Configuration
Configure Alert Manager to route alerts to Versus with different behaviors.
Send Alert Only (No On-Call)
receivers:
- name: 'versus-no-oncall'
webhook_configs:
- url: 'http://versus-service:3000/api/incidents?oncall_enable=false'
send_resolved: false
route:
receiver: 'versus-no-oncall'
group_by: ['alertname', 'service']
routes:
- match:
severity: warning
receiver: 'versus-no-oncall'
Send Alert with Acknowledgment Wait
receivers:
- name: 'versus-with-ack'
webhook_configs:
- url: 'http://versus-service:3000/api/incidents?oncall_wait_minutes=5'
send_resolved: false
route:
routes:
- match:
severity: critical
receiver: 'versus-with-ack'
This waits 5 minutes for acknowledgment before triggering the AWS Incident Manager Response Plan if the user doesn't click the link ACK that Versus sends to Slack.
Send Alert with Immediate On-Call Trigger
receivers:
- name: 'versus-immediate'
webhook_configs:
- url: 'http://versus-service:3000/api/incidents?oncall_wait_minutes=0'
send_resolved: false
route:
routes:
- match:
severity: urgent
receiver: 'versus-immediate'
This triggers the response plan immediately without waiting.
Testing the Integration
- Trigger an Alert: Simulate a critical alert in Prometheus to match the Alert Manager rule.
- Verify Versus: Check that Versus receives the alert and sends it to configured channels (e.g., Slack).
- Check Escalation:
- Wait 5 minutes without acknowledging the alert.
- In Incident Manager > Incidents, verify that an incident starts and Team A is engaged.
- After 5 more minutes, confirm Team B is engaged.
- Immediate Trigger Test: Send an urgent alert and confirm the response plan triggers instantly.
Result
Conclusion
Youβve now integrated Versus Incident with AWS Incident Manager for on-call management! Alerts from Prometheus Alert Manager can trigger notifications via Versus, with escalations handled by AWS Incident Manager based on your response plan. Adjust configurations as needed for your environment.
If you encounter any issues or have further questions, feel free to reach out!
How to Integration AWS Incident Manager On-Call (Advanced)
Table of Contents
- Prerequisites
- Advanced On-Call Management with AWS Incident Manager
- Creating On-Call Schedules
- Understanding AWS Incident Manager Rotations
- Multi-Level Escalation Workflows
- Advanced Versus Incident Configuration
- AWS IAM Role Configuration for Critical-Only Approach
- Advanced Incident Routing Rules
- Dynamic Configuration with Query Parameters
- Monitoring and Analytics
- Testing and Validation
- Testing the Critical-Only Approach
- Conclusion
This document provides an advanced guide to integrating Versus Incident with AWS Incident Manager for advanced on-call management. While the basic integration guide covers essential setup, this advanced guide focuses on implementing complex on-call rotations, schedules, and workflows.
Prerequisites
Before proceeding with this advanced guide, ensure you have:
- Completed the basic AWS Incident Manager integration
- An AWS account with administrative access
- Versus Incident deployed and functioning with basic integrations
- Prometheus Alert Manager configured and sending alerts to Versus
- Multiple teams requiring on-call management with different rotation patterns
Advanced On-Call Management with AWS Incident Manager
AWS Incident Manager offers advanced capabilities for managing on-call schedules, beyond the basic escalation plans covered in the introductory guide. These include:
- On-Call Schedules: Calendar-based rotations of on-call responsibilities
- Rotation Patterns: Daily, weekly, or custom rotation patterns for teams
- Time Zone Management: Support for global teams across different time zones
- Override Capabilities: Handling vacations, sick leave, and special events
Let's configure an advanced on-call system with two teams (Platform and Application) that have different rotation schedules and escalation workflows.
Creating On-Call Schedules
AWS Incident Manager allows you to create on-call schedules that automatically rotate responsibilities among team members. Here's how to set up comprehensive schedules:
-
Create Team-Specific Contact Groups:
- In the AWS Console, navigate to Systems Manager > Incident Manager > Contacts
- Click Create contact group
- For the Platform team:
- Name it "Platform-Team"
- Add 4-6 team member contacts created previously
- Save the group
- Repeat for the Application team
-
Create Schedule Rotations:
- Go to Incident Manager > Schedules
- Click Create schedule
- Configure the Platform team rotation:
- Name: "Platform-Rotation"
- Description: "24/7 support rotation for platform infrastructure"
- Time zone: Select your primary operations time zone
- Rotation settings:
- Rotation pattern: Weekly (Each person is on call for 1 week)
- Start date/time: Choose when the first rotation begins
- Handoff time: Typically 09:00 AM local time
- Recurrence: Recurring every 1 week
- Add all platform engineers to the rotation sequence
- Save the schedule
-
Create Application Team Schedule With Daily Rotation:
- Create another schedule named "App-Rotation"
- Configure for daily rotation instead of weekly
- Set business hours coverage (8 AM - 6 PM)
- Add application team members
- Save the schedule
You now have two separate rotation schedules that will automatically change the primary on-call contact based on the defined patterns.
Understanding AWS Incident Manager Rotations
AWS Incident Manager rotations provide a powerful way to manage on-call responsibilities. Here's a deeper explanation of how they work:
-
Rotation Sequence Management:
- Engineers are added to the rotation in a specific sequence
- Each engineer takes their turn as the primary on-call responder based on the configured rotation pattern
- AWS automatically tracks the current position in the rotation and advances it according to the schedule
-
Shift Transition Process:
- At the configured handoff time (e.g., 9:00 AM), AWS Incident Manager automatically transitions on-call responsibilities
- The system sends notifications to both the outgoing and incoming on-call engineers
- The previous on-call engineer remains responsible until the handoff is complete
- Any incidents created during the handoff window are assigned to the new on-call engineer
-
Handling Availability Exceptions:
- AWS Incident Manager allows you to create Overrides for planned absences like vacations or holidays
- To create an override:
- Navigate to the schedule
- Click "Create override"
- Select the time period and replacement contact
- Save the override
- During the override period, notifications are sent to the replacement contact instead of the regularly scheduled engineer
-
Multiple Rotation Layers:
- You can create primary, secondary, and tertiary rotation schedules
- These can be combined into escalation plans where notification fails over from primary to secondary
- Different rotations can have different time periods (e.g., primary rotates weekly, secondary rotates monthly)
- This adds redundancy to your on-call system and spreads the on-call burden appropriately
-
Managing Time Zones and Global Teams:
- AWS Incident Manager handles time zone differences automatically
- You can configure a "Follow-the-Sun" rotation where engineers in different time zones cover different parts of the day
- The handoff times are adjusted based on the configured time zone of the schedule
-
Rotation Visualization:
- The AWS Console provides a calendar view that shows who is on-call at any given time
- This helps teams plan their schedules and understand upcoming on-call responsibilities
- The calendar view accounts for overrides and exceptions
Multi-Level Escalation Workflows
Build advanced escalation workflows that incorporate your on-call schedules:
-
Create Advanced Escalation Plans:
- Go to Incident Manager > Escalation plans
- Click Create escalation plan
- Name it "Platform-Tiered-Escalation"
- Add escalation stages:
- Stage 1: Current on-call from "Platform-Rotation" (wait 5 minutes)
- Stage 2: Secondary on-call + Team Lead (wait 5 minutes)
- Stage 3: Engineering Manager + Director (wait 10 minutes)
- Stage 4: CTO/VP Engineering
-
Configure Severity-Based Escalation: Create an escalation plan specifically for critical alerts:
- Critical: Immediate engagement of primary on-call, with fast escalation (2-minute acknowledgment)
- Note: Non-critical alerts don't trigger on-call processes
-
Create Enhanced Response Plans:
- Go to Incident Manager > Response plans
- Create separate response plans aligned with different services and severity levels
- For example, "Critical-Platform-Outage" with:
- Associated escalation plan: "Platform-Tiered-Escalation"
- Automatic engagement of specific chat channels
- Pre-defined runbooks for common failure scenarios
- Integration with status page updates
These advanced escalation workflows ensure that the right people are engaged at the right time, without unnecessary escalation for routine issues.
Advanced Versus Incident Configuration
Configure Versus Incident for advanced integration with AWS Incident Manager:
name: versus
host: 0.0.0.0
port: 3000
public_host: https://versus.example.com # Required for acknowledgment URLs
alert:
debug_body: true # Useful for troubleshooting
slack:
enable: true
token: ${SLACK_TOKEN}
channel_id: ${SLACK_CHANNEL_ID}
template_path: "config/slack_message.tmpl"
message_properties:
button_text: "Acknowledge Incident"
button_style: "primary"
disable_button: false
oncall:
initialized_only: true # Initialize on-call but keep it disabled by default
enable: false # Not needed when initialized_only is true
wait_minutes: 2 # Wait 2 minutes before escalating critical alerts
provider: aws_incident_manager # Specify AWS Incident Manager as the on-call provider
# AWS Incident Manager response plan for critical alerts only
aws_incident_manager:
response_plan_arn: "arn:aws:ssm-incidents::123456789012:response-plan/PlatformCriticalPlan"
# Optional: Configure multiple response plans for different environments or teams
other_response_plan_arns:
app: "arn:aws:ssm-incidents::123456789012:response-plan/AppCriticalPlan"
redis: # Required for on-call functionality
insecure_skip_verify: false # production setting
host: ${REDIS_HOST}
port: ${REDIS_PORT}
password: ${REDIS_PASSWORD}
db: 0
This configuration allows Versus to:
- Use AWS response plans for critical alerts only
- Set a 2-minute wait time before escalation for critical alerts
- Ensure non-critical alerts don't trigger on-call processes
Understanding the initialized_only Setting
The initialized_only: true
setting is a powerful feature that allows you to:
-
Initialize the on-call system but keep it disabled by default: The on-call infrastructure is set up and ready to use, but won't automatically trigger for any alerts.
-
Enable on-call selectively using query parameters: Only alerts that explicitly include
?oncall_enable=true
in their webhook URL will trigger the on-call workflow. -
Implement a critical-only approach: Combined with Alert Manager routing rules, you can ensure only critical alerts with the right query parameters trigger on-call.
This approach provides several advantages:
- Greater control over which alerts can page on-call engineers
- Ability to test the on-call system without changing configuration
- Flexibility to adjust which services can trigger on-call without redeploying
- Protection against accidental on-call notifications during configuration changes
Enhanced Slack Template
Create an enhanced Slack template (config/slack_message.tmpl
) that provides more context:
π₯ *{{ .commonLabels.severity | upper }} Alert: {{ .commonLabels.alertname }}*
π *System*: `{{ .commonLabels.system }}`
π₯οΈ *Instance*: `{{ .commonLabels.instance }}`
π¨ *Status*: `{{ .status }}`
β±οΈ *Detected*: `{{ .startsAt | date "Jan 02, 2006 15:04:05 MST" }}`
{{ range .alerts }}
π *Description*: {{ .annotations.description }}
{{ if .annotations.runbook }}π *Runbook*: {{ .annotations.runbook }}{{ end }}
{{ if .annotations.dashboard }}π *Dashboard*: {{ .annotations.dashboard }}{{ end }}
{{ end }}
{{ if .AckURL }}
β οΈ *Auto-escalation*: This alert will escalate {{ if eq .commonLabels.severity "critical" }}in 2 minutes{{ end }} if not acknowledged.
{{ end }}
This template provides additional context and clear timing expectations for responders.
AWS IAM Role Configuration for Critical-Only Approach
For the critical-only on-call approach, you need to configure appropriate IAM permissions. This role allows Versus Incident to interact with AWS Incident Manager but only for critical incidents:
- Create a Dedicated IAM Policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ssm-incidents:StartIncident",
"ssm-incidents:GetResponsePlan",
"ssm-incidents:ListResponsePlans",
"ssm-incidents:TagResource"
],
"Resource": [
"arn:aws:ssm-incidents:*:*:response-plan/*Critical*",
"arn:aws:ssm-incidents:*:*:incident/*"
]
},
{
"Effect": "Allow",
"Action": [
"ssm-contacts:GetContact",
"ssm-contacts:ListContacts"
],
"Resource": "*"
}
]
}
This policy:
- Restricts incident creation to response plans containing "Critical" in the name
- Provides access to contacts for notification purposes
- Allows tagging of incidents for better organization
-
Create an IAM Role:
- Create a new role with the appropriate trust relationship for your deployment environment (EC2, ECS, or Lambda)
- Attach the policy created above
- Note the Role ARN (e.g.,
arn:aws:iam::111122223333:role/VersusIncidentCriticalRole
)
-
Configure AWS Credentials:
- If using environment variables, set
AWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
from a user with the ability to assume this role - Alternatively, use EC2 instance profiles or AWS service roles for containerized deployments
- If using environment variables, set
This IAM configuration ensures that even if a non-critical incident tries to invoke the response plan, it will fail due to IAM permissions, providing an additional layer of enforcement for your critical-only on-call policy.
Advanced Incident Routing Rules
Configure Alert Manager with advanced routing based on services, teams, and severity to work with the initialized_only
setting:
receivers:
- name: 'versus-normal'
webhook_configs:
- url: 'http://versus-service:3000/api/incidents'
send_resolved: true
- name: 'versus-critical'
webhook_configs:
- url: 'http://versus-service:3000/api/incidents?oncall_enable=true'
send_resolved: true
- name: 'versus-app-normal'
webhook_configs:
- url: 'http://versus-service:3000/api/incidents'
send_resolved: true
- name: 'versus-app-critical'
webhook_configs:
- url: 'http://versus-service:3000/api/incidents?oncall_enable=true&awsim_other_response_plan=app'
send_resolved: true
- name: 'versus-business-hours'
webhook_configs:
- url: 'http://versus-service:3000/api/incidents'
send_resolved: true
route:
receiver: 'versus-normal' # Default receiver
group_by: ['alertname', 'service', 'severity']
# Time-based routing
routes:
- match_re:
timeperiod: "business-hours"
receiver: 'versus-business-hours'
# Team and severity based routing - on-call only for critical
- match:
team: platform
severity: critical
receiver: 'versus-critical'
- match:
team: platform
severity: high
receiver: 'versus-normal'
- match:
team: application
severity: critical
receiver: 'versus-app-critical'
- match:
team: application
severity: high
receiver: 'versus-app-normal'
This configuration ensures that:
- On-call is completely disabled by default (even for critical alerts)
- Only alerts explicitly configured to trigger on-call will do so
- You have granular control over which alerts and severity levels can page your team
- You can easily test alert routing without risk of accidental paging
Dynamic Configuration with Query Parameters
Versus Incident supports dynamic configuration through query parameters, which is especially powerful for managing on-call behavior when using initialized_only: true
. These parameters can be added to your Alert Manager webhook URLs to override default settings on a per-alert basis:
Query Parameter | Description | Example |
---|---|---|
oncall_enable | Enable or disable on-call for a specific alert | ?oncall_enable=true |
oncall_wait_minutes | Override the default wait time before escalation | ?oncall_wait_minutes=5 |
awsim_other_response_plan | Use an alternative response plan defined in your configuration | ?awsim_other_response_plan=app |
Example Alert Manager Configurations:
# Immediately trigger on-call for database failures (no wait)
- name: 'versus-db-critical'
webhook_configs:
- url: 'http://versus-service:3000/api/incidents?oncall_enable=true&oncall_wait_minutes=0'
send_resolved: true
# Use a custom response plan for network issues
- name: 'versus-network-critical'
webhook_configs:
- url: 'http://versus-service:3000/api/incidents?oncall_enable=true&awsim_other_response_plan=network'
send_resolved: true
This flexibility allows you to fine-tune your incident response workflow based on the specific needs of different services and alert types while maintaining the critical-only approach to on-call escalation.
Monitoring and Analytics
Implement metrics and reporting for your incident response process:
-
Create CloudWatch Dashboards:
- Track incident frequency by service
- Monitor Mean Time to Acknowledge (MTTA)
- Monitor Mean Time to Resolve (MTTR)
- Track escalation frequency
- Visualize on-call burden distribution
-
Set Up Regular Reporting:
- Configure automatic weekly reports of on-call activity
- Track key metrics over time:
- Number of incidents by severity
- Acknowledge time by team, rotation, and individual
- Resolution time
- False positive rate
-
Implement Continuous Improvement:
- Review metrics regularly with teams
- Identify top sources of incidents
- Track improvement initiatives
- Use AWS Incident Manager's post-incident analysis feature
These analytics help identify patterns, reduce false positives, and enable teams to address systemic issues.
Testing and Validation
Thoroughly test your advanced on-call workflows:
-
Schedule Test Scenarios:
- During handoff periods between rotations
- At different times of day
- With different alert severities
- During planned override periods
-
Document Results:
- Track actual response times
- Identify any notification failures
- Ensure ChatBot integration works correctly
- Validate metrics collection
-
Conduct Regular Fire Drills:
- Schedule monthly unannounced test incidents
- Rotate scenarios to test different aspects of the system
- Include post-drill reviews and improvement plans
Testing the Critical-Only Approach
You need to verify both that on-call is triggered with the right parameters and that it doesn't trigger by default:
-
Test Default Behavior (Should NOT Trigger On-Call):
# Send a critical alert WITHOUT oncall_enable parameter - should NOT trigger on-call curl -X POST "http://versus-service:3000/api/incidents" \ -H "Content-Type: application/json" \ -d '{ "Logs": "[CRITICAL] This is a critical alert that should not trigger on-call", "ServiceName": "test-service", "Severity": "critical" }'
Verify that:
- The alert appears in your notification channels (Slack, etc.)
- No AWS Incident Manager incident is created
- No on-call team is notified
-
Test Explicit On-Call Activation:
# Send a critical alert WITH oncall_enable=true - should trigger on-call after wait period curl -X POST "http://versus-service:3000/api/incidents?oncall_enable=true" \ -H "Content-Type: application/json" \ -d '{ "Logs": "[CRITICAL] This is a critical alert that SHOULD trigger on-call", "ServiceName": "test-service", "Severity": "critical" }'
Verify that:
- The alert appears in your notification channels with an acknowledgment button
- If not acknowledged within the wait period, an AWS Incident Manager incident is created
- The appropriate on-call team is notified
-
Test Immediate On-Call Activation:
# Send a critical alert with immediate on-call activation curl -X POST "http://versus-service:3000/api/incidents?oncall_enable=true&oncall_wait_minutes=0" \ -H "Content-Type: application/json" \ -d '{ "Logs": "[CRITICAL] This is a critical alert that should trigger on-call IMMEDIATELY", "ServiceName": "test-service", "Severity": "critical" }'
Verify that:
- An AWS Incident Manager incident is created immediately
- The on-call team is notified without waiting for acknowledgment
-
Test Response Plan Override:
# Use a specific response plan curl -X POST "http://versus-service:3000/api/incidents?oncall_enable=true&awsim_other_response_plan=platform" \ -H "Content-Type: application/json" \ -d '{ "Logs": "[CRITICAL] Platform issue requiring specific team", "ServiceName": "platform-service", "Severity": "critical" }'
Verify that:
- The correct response plan is used (check in AWS Incident Manager)
- The appropriate platform team is engaged
Conclusion
By implementing this advanced on-call management system with AWS Incident Manager and Versus Incident, you've created a advanced incident response workflow that:
- Automatically rotates on-call responsibilities among team members
- Only triggers on-call for critical alerts with explicit activation, preventing alert fatigue
- Routes incidents to the appropriate teams based on service and time
- Escalates critical incidents according to well-defined patterns
- Facilitates real-time collaboration during incidents
- Provides analytics for continuous improvement
This system ensures that critical incidents receive appropriate attention without unnecessary escalation for routine issues. For non-critical alerts, they're still visible in notification channels, but don't trigger the on-call escalation process.
Regularly review and refine your configurations as your organization and systems evolve. Solicit feedback from on-call engineers to identify pain points and improvement opportunities. Consider gathering metrics on the effectiveness of your approach, adjusting severity thresholds and query parameters as needed.
If you encounter any challenges or have questions about advanced configurations, refer to the AWS Incident Manager documentation or reach out to the Versus Incident community for support.
Understanding PagerDuty On-Call
Table of Contents
- Key Components of PagerDuty On-Call
- 1. Services
- 2. Escalation Policies
- 3. Schedules
- 4. Integrations
- The PagerDuty Incident Lifecycle
- Key Benefits of PagerDuty for On-Call Management
PagerDuty is a popular incident management platform that provides robust on-call scheduling, alerting, and escalation capabilities. This document explains the key components of PagerDuty's on-call systemβservices, escalation policies, schedules, and integrationsβin a simple and clear way.
Key Components of PagerDuty On-Call
PagerDuty's on-call system relies on four main components: services, escalation policies, schedules, and integrations. Let's explore each one in detail.
1. Services
Services in PagerDuty represent the applications, components, or systems that you monitor. Each service:
- Has a unique name and description
- Is associated with an escalation policy
- Can be integrated with monitoring tools
- Contains a set of alert/incident settings
When an incident is triggered, it's associated with a specific service, which determines how the incident is handled and who is notified.
Example: A "Payment Processing API" service might be set up to:
- Alert the backend team when it experiences errors
- Have high urgency for all incidents
- Auto-resolve incidents after 24 hours if fixed
2. Escalation Policies
Escalation policies define who gets notified about an incident and in what order. They ensure that incidents are addressed even if the first responder isn't available.
An escalation policy typically includes:
- One or more escalation levels with designated responders
- Time delays between escalation levels
- Options to repeat the escalation process if no one responds
Example: For the "Payment API" service, an escalation policy might:
- Level 1: Notify the on-call engineer on the primary schedule
- Level 2: If no response in 15 minutes, notify the secondary on-call engineer
- Level 3: If still no response in 10 minutes, notify the engineering manager
3. Schedules
Schedules determine who is on-call at any given time. They allow teams to:
- Define rotation patterns (daily, weekly, custom)
- Set up multiple layers of coverage
- Handle time zone differences
- Plan for holidays and time off
PagerDuty's schedules are highly flexible and can accommodate complex team structures and rotation patterns.
Example: A "Backend Team Primary" schedule might rotate three engineers weekly, with handoffs occurring every Monday at 9 AM local time. A separate "Backend Team Secondary" schedule might rotate a different group of engineers as backup.
4. Integrations
Integrations connect PagerDuty to your monitoring tools, allowing alerts to be automatically converted into PagerDuty incidents. PagerDuty offers hundreds of integrations with popular monitoring systems.
For custom systems or tools without direct integrations, PagerDuty provides:
- Events API (V2) - A simple API for sending alerts to PagerDuty
- Webhooks - For receiving data about PagerDuty incidents in your other systems
Example: A company might integrate:
- Prometheus Alert Manager with their "Infrastructure" service
- Application error tracking with their "Application Errors" service
- Custom business logic monitors with their "Business Metrics" service
The PagerDuty Incident Lifecycle
When an incident is created in PagerDuty:
- Trigger: An alert comes in from an integrated monitoring system or API call
- Notification: PagerDuty notifies the appropriate on-call person based on the escalation policy
- Acknowledgment: The responder acknowledges the incident, letting others know they're working on it
- Resolution: After fixing the issue, the responder resolves the incident
- Post-Mortem: Teams can analyze what happened and how to prevent similar issues
This structured approach ensures that incidents are handled efficiently and consistently.
Key Benefits of PagerDuty for On-Call Management
- Reliability: Ensures critical alerts never go unnoticed with multiple notification methods and escalation paths
- Flexibility: Supports complex team structures and rotation patterns
- Reduced Alert Fatigue: Intelligent grouping and routing of alerts to the right people
- Comprehensive Visibility: Dashboards and reports to track incident metrics and on-call load
- Integration Ecosystem: Works with virtually any monitoring or alerting system
Next, we will provide a step-by-step guide to integrating Versus with PagerDuty for On-Call: Integration.
How to Integrate with PagerDuty
Table of Contents
- Prerequisites
- Setting Up PagerDuty for On-Call
- Deploy Versus Incident
- Alert Manager Routing Configuration
- Override the PagerDuty Routing Key per Alert
- Testing the Integration
- How It Works Under the Hood
- Conclusion
This document provides a step-by-step guide to integrate Versus Incident with PagerDuty for on-call management. The integration enables automated escalation of alerts to on-call teams when incidents are not acknowledged within a specified time.
We'll cover setting up PagerDuty, configuring the integration with Versus, deploying Versus, and testing the integration with practical examples.
Prerequisites
Before you begin, ensure you have:
- A PagerDuty account (you can start with a free trial if needed)
- Versus Incident deployed (instructions provided later)
- Prometheus Alert Manager set up to monitor your systems
Setting Up PagerDuty for On-Call
Let's configure a practical example in PagerDuty with services, schedules, and escalation policies.
1. Create Users in PagerDuty
First, we need to set up the users who will be part of the on-call rotation:
- Log in to your PagerDuty account
- Navigate to People > Users > Add User
- For each user, enter:
- Name (e.g., "Natsu Dragneel")
- Email address
- Role (User)
- Time Zone
- PagerDuty will send an email invitation to each user
- Users should complete their profiles by:
- Setting up notification rules (SMS, email, push notifications)
- Downloading the PagerDuty mobile app
- Setting contact details
Repeat to create multiple users (e.g., Natsu, Zeref, Igneel, Gray, Gajeel, Laxus).
2. Create On-Call Schedules
Now, let's create schedules for who is on-call and when:
- Navigate to People > Schedules > Create Schedule
- Name the schedule (e.g., "Team A Primary")
- Set up the rotation:
- Choose a rotation type (Weekly is common)
- Add users to the rotation (e.g., Natsu, Zeref, Igneel)
- Set handoff time (e.g., Mondays at 9:00 AM)
- Set time zone
- Save the schedule
- Create a second schedule (e.g., "Team B Secondary") for other team members
3. Create Escalation Policies
An escalation policy defines who gets notified when an incident occurs:
- Navigate to Configuration > Escalation Policies > New Escalation Policy
- Name the policy (e.g., "Critical Incident Response")
- Add escalation rules:
- Level 1: Select the "Team A Primary" schedule with a 5-minute timeout
- Level 2: Select the "Team B Secondary" schedule
- Optionally, add a Level 3 to notify a manager
- Save the policy
4. Create a PagerDuty Service
A service is what receives incidents from monitoring systems:
- Navigate to Configuration > Services > New Service
- Name the service (e.g., "Versus Incident Integration")
- Select "Events API V2" as the integration type
- Select the escalation policy you created in step 3
- Configure incident settings (Auto-resolve, urgency, etc.)
- Save the service
5. Get the Integration Key
After creating the service, you'll need the integration key (also called routing key):
- Navigate to Configuration > Services
- Click on your newly created service
- Go to the Integrations tab
- Find the "Events API V2" integration
- Copy the Integration Key (it looks something like:
12345678abcdef0123456789abcdef0
) - Keep this key secure - you'll need it for Versus configuration
Deploy Versus Incident
Now let's deploy Versus with PagerDuty integration. You can use Docker or Kubernetes.
Docker Deployment
Create a directory for your configuration files:
mkdir -p ./config
Create config/config.yaml
with the following content:
name: versus
host: 0.0.0.0
port: 3000
public_host: https://your-ack-host.example
alert:
debug_body: true
slack:
enable: true
token: ${SLACK_TOKEN}
channel_id: ${SLACK_CHANNEL_ID}
template_path: "config/slack_message.tmpl"
oncall:
enable: true
wait_minutes: 3
provider: pagerduty
pagerduty:
routing_key: ${PAGERDUTY_ROUTING_KEY} # The Integration Key from step 5
other_routing_keys:
infra: ${PAGERDUTY_OTHER_ROUTING_KEY_INFRA}
app: ${PAGERDUTY_OTHER_ROUTING_KEY_APP}
db: ${PAGERDUTY_OTHER_ROUTING_KEY_DB}
redis: # Required for on-call functionality
insecure_skip_verify: true # dev only
host: ${REDIS_HOST}
port: ${REDIS_PORT}
password: ${REDIS_PASSWORD}
db: 0
Create a Slack template in config/slack_message.tmpl
:
π₯ *{{ .commonLabels.severity | upper }} Alert: {{ .commonLabels.alertname }}*
π *Instance*: `{{ .commonLabels.instance }}`
π¨ *Status*: `{{ .status }}`
{{ range .alerts }}
π {{ .annotations.description }}
{{ end }}
Slack Acknowledgment Button (Default)
By default, Versus automatically adds an interactive acknowledgment button to Slack notifications when on-call is enabled. This allows users to acknowledge alerts. You can customize the button appearance in your config.yaml
, for example:
ACK URL Generation
- When an incident is created (e.g., via a POST to
/api/incidents
), Versus generates an acknowledgment URL if on-call is enabled. - The URL is constructed using the
public_host
value, typically in the format:https://your-host.example/api/incidents/ack/<incident-id>
. - This URL is injected into the button.
Manual Acknowledgment Handling
If you prefer to handle acknowledgments manually or want to disable the default button (by setting disable_button: true
), you can add the acknowledgment URL directly in your template. Here's an example of including a clickable link in your Slack template:
π₯ *{{ .commonLabels.severity | upper }} Alert: {{ .commonLabels.alertname }}*
π *Instance*: `{{ .commonLabels.instance }}`
π¨ *Status*: `{{ .status }}`
{{ range .alerts }}
π {{ .annotations.description }}
{{ end }}
{{ if .AckURL }}
----------
<{{.AckURL}}|Click here to acknowledge>
{{ end }}
The conditional {{ if .AckURL }}
ensures the link only appears if the acknowledgment URL is available (i.e., when on-call is enabled).
Create the docker-compose.yml
file:
version: '3.8'
services:
versus:
image: ghcr.io/versuscontrol/versus-incident
ports:
- "3000:3000"
environment:
- SLACK_TOKEN=your_slack_token
- SLACK_CHANNEL_ID=your_channel_id
- PAGERDUTY_ROUTING_KEY=your_pagerduty_integration_key
- REDIS_HOST=redis
- REDIS_PORT=6379
- REDIS_PASSWORD=your_redis_password
depends_on:
- redis
redis:
image: redis:6.2-alpine
command: redis-server --requirepass your_redis_password
ports:
- "6379:6379"
volumes:
- redis_data:/data
volumes:
redis_data:
Run Docker Compose:
docker-compose up -d
Alert Manager Routing Configuration
Now, let's configure Alert Manager to route alerts to Versus with different behaviors:
Send Alert Only (No On-Call)
receivers:
- name: 'versus-no-oncall'
webhook_configs:
- url: 'http://versus-service:3000/api/incidents?oncall_enable=false'
send_resolved: false
route:
receiver: 'versus-no-oncall'
group_by: ['alertname', 'service']
routes:
- match:
severity: warning
receiver: 'versus-no-oncall'
Send Alert with Acknowledgment Wait
receivers:
- name: 'versus-with-ack'
webhook_configs:
- url: 'http://versus-service:3000/api/incidents?oncall_wait_minutes=5'
send_resolved: false
route:
routes:
- match:
severity: critical
receiver: 'versus-with-ack'
This configuration waits 5 minutes for acknowledgment before triggering PagerDuty if the user doesn't click the ACK link in Slack.
Send Alert with Immediate On-Call Trigger
receivers:
- name: 'versus-immediate'
webhook_configs:
- url: 'http://versus-service:3000/api/incidents?oncall_wait_minutes=0'
send_resolved: false
route:
routes:
- match:
severity: urgent
receiver: 'versus-immediate'
This will trigger PagerDuty immediately without waiting.
Override the PagerDuty Routing Key per Alert
You can configure Alert Manager to use different PagerDuty services for specific alerts by using named routing keys instead of exposing sensitive routing keys directly in URLs:
receivers:
- name: 'versus-pagerduty-infra'
webhook_configs:
- url: 'http://versus-service:3000/api/incidents?pagerduty_other_routing_key=infra'
send_resolved: false
route:
routes:
- match:
team: infrastructure
receiver: 'versus-pagerduty-infra'
This routes infrastructure team alerts to a different PagerDuty service using the named routing key "infra", which is securely mapped to the actual integration key in your configuration file:
oncall:
provider: pagerduty
pagerduty:
routing_key: ${PAGERDUTY_ROUTING_KEY}
other_routing_keys:
infra: ${PAGERDUTY_OTHER_ROUTING_KEY_INFRA}
app: ${PAGERDUTY_OTHER_ROUTING_KEY_APP}
db: ${PAGERDUTY_OTHER_ROUTING_KEY_DB}
This approach keeps your sensitive PagerDuty integration keys secure by never exposing them in URLs or logs.
Testing the Integration
Let's test the complete workflow:
-
Trigger an Alert:
- Simulate a critical alert in Prometheus to match the Alert Manager rule.
-
Verify Versus:
- Check that Versus receives the alert and sends it to Slack.
- You should see a message with an acknowledgment link.
-
Check Escalation:
- Option 1: Click the ACK link to acknowledge the incident - PagerDuty should not be notified.
- Option 2: Wait for the acknowledgment timeout (e.g., 5 minutes) without clicking the link.
- In PagerDuty, verify that an incident is created and the on-call person is notified.
- Confirm that escalation happens according to your policy if the incident remains unacknowledged.
-
Immediate Trigger Test:
- Send an urgent alert and confirm that PagerDuty is triggered instantly.
How It Works Under the Hood
When Versus integrates with PagerDuty, the following occurs:
- Versus receives an alert from Alert Manager
- If on-call is enabled and the acknowledgment period passes without an ACK, Versus:
- Constructs a PagerDuty Events API v2 payload
- Sends a "trigger" event to PagerDuty with your routing key
- Includes incident details as custom properties
The PagerDuty service processes this event according to your escalation policy, notifying the appropriate on-call personnel.
Conclusion
You've now integrated Versus Incident with PagerDuty for on-call management! Alerts from Prometheus Alert Manager can trigger notifications via Versus, with escalations handled by PagerDuty based on your escalation policy.
This integration provides:
- A delay period for engineers to acknowledge incidents before escalation
- Slack notifications with one-click acknowledgment
- Structured escalation through PagerDuty's robust notification system
- Multiple notification channels to ensure critical alerts reach the right people
Adjust configurations as needed for your environment and incident response processes. If you encounter any issues or have further questions, feel free to reach out!
Migrating to v1.2.0
Table of Contents
- What's New in v1.2.0
- Key Changes
- Configuration Changes
- Migration Steps
- Testing the Migration
- Additional Notes
This guide provides instructions for migrating from v1.1.5 to v1.2.0.
What's New in v1.2.0
Version 1.2.0 introduces enhanced Microsoft Teams integration using Power Automate, allowing you to send incident alerts directly to Microsoft Teams channels with more formatting options and better delivery reliability.
Key Changes
The main change in v1.2.0 is the Microsoft Teams integration architecture:
-
Legacy webhook URLs replaced with Power Automate: Instead of using the legacy Office 365 webhook URLs, Versus Incident now integrates with Microsoft Teams through Power Automate HTTP triggers, which provide more flexibility and reliability.
-
Configuration property names updated:
webhook_url
βpower_automate_url
other_webhook_url
βother_power_urls
-
Environment variable names updated:
MSTEAMS_WEBHOOK_URL
βMSTEAMS_POWER_AUTOMATE_URL
MSTEAMS_OTHER_WEBHOOK_URL_*
βMSTEAMS_OTHER_POWER_URL_*
-
API query parameter updated:
msteams_other_webhook_url
βmsteams_other_power_url
Configuration Changes
Here's a side-by-side comparison of the Microsoft Teams configuration in v1.1.5 vs v1.2.0:
v1.1.5 (Before)
alert:
# ... other alert configurations ...
msteams:
enable: false # Default value, will be overridden by MSTEAMS_ENABLE env var
webhook_url: ${MSTEAMS_WEBHOOK_URL}
template_path: "config/msteams_message.tmpl"
other_webhook_url: # Optional: Define additional webhook URLs
qc: ${MSTEAMS_OTHER_WEBHOOK_URL_QC}
ops: ${MSTEAMS_OTHER_WEBHOOK_URL_OPS}
dev: ${MSTEAMS_OTHER_WEBHOOK_URL_DEV}
v1.2.0 (After)
alert:
# ... other alert configurations ...
msteams:
enable: false # Default value, will be overridden by MSTEAMS_ENABLE env var
power_automate_url: ${MSTEAMS_POWER_AUTOMATE_URL} # Power Automate HTTP trigger URL
template_path: "config/msteams_message.tmpl"
other_power_urls: # Optional: Enable overriding the default Power Automate flow
qc: ${MSTEAMS_OTHER_POWER_URL_QC}
ops: ${MSTEAMS_OTHER_POWER_URL_OPS}
dev: ${MSTEAMS_OTHER_POWER_URL_DEV}
Migration Steps
1. Update Your Configuration File
Replace the Microsoft Teams section in your config.yaml
file:
msteams:
enable: false # Set to true to enable, or use MSTEAMS_ENABLE env var
power_automate_url: ${MSTEAMS_POWER_AUTOMATE_URL} # Power Automate HTTP trigger URL
template_path: "config/msteams_message.tmpl"
other_power_urls: # Optional: Enable overriding the default Power Automate flow
qc: ${MSTEAMS_OTHER_POWER_URL_QC}
ops: ${MSTEAMS_OTHER_POWER_URL_OPS}
dev: ${MSTEAMS_OTHER_POWER_URL_DEV}
2. Update Your Environment Variables
If you're using environment variables, update them:
# Old (v1.1.5)
MSTEAMS_WEBHOOK_URL=https://...
MSTEAMS_OTHER_WEBHOOK_URL_QC=https://...
# New (v1.2.0)
MSTEAMS_POWER_AUTOMATE_URL=https://...
MSTEAMS_OTHER_POWER_URL_QC=https://...
3. Setting up Power Automate for Microsoft Teams
To set up Microsoft Teams integration with Power Automate:
-
Create a new Power Automate flow:
- Sign in to Power Automate
- Click on "Create" β "Instant cloud flow"
- Select "When a HTTP request is received" as the trigger
-
Configure the HTTP trigger:
- The HTTP POST URL will be generated automatically after you save the flow
- For the Request Body JSON Schema, you can use:
{ "type": "object", "properties": { "message": { "type": "string" } } }
-
Add a "Post message in a chat or channel" action:
- Click "+ New step"
- Search for "Teams" and select "Post message in a chat or channel"
- Configure the Teams channel where you want to post messages
- In the Message field, use:
@{triggerBody()?['message']}
-
Save your flow and copy the HTTP POST URL:
- After saving, go back to the HTTP trigger step to see the generated URL
- Copy this URL and use it for your
MSTEAMS_POWER_AUTOMATE_URL
environment variable or directly in your configuration file
4. Update Your API Calls
If you're making direct API calls that use the Teams integration, update your query parameters:
Old (v1.1.5):
POST /api/incidents?msteams_other_webhook_url=qc
New (v1.2.0):
POST /api/incidents?msteams_other_power_url=qc
5. Update Your Microsoft Teams Templates (Optional)
The template syntax remains the same, but you might want to review your templates to ensure they work correctly with the new integration. Here's a sample template for reference:
# Critical Error in {{.ServiceName}}
**Error Details:**
```{{.Logs}}```
Please investigate immediately
Testing the Migration
After updating your configuration, test the Microsoft Teams integration to ensure it's working correctly:
curl -X POST http://your-versus-incident-server:3000/api/incidents \
-H "Content-Type: application/json" \
-d '{"service_name": "Test Service", "logs": "This is a test incident alert for Microsoft Teams integration"}'
Additional Notes
- The older Microsoft Teams integration using webhook URLs still work after upgrading to v1.2.0, just update properties
webhook_url
βpower_automate_url
- If you experience any issues with message delivery to Microsoft Teams, check your Power Automate flow run history to debug potential issues
- For organizations with multiple teams or departments, consider setting up separate Power Automate flows for each team and configuring them with the
other_power_urls
property
Migration Guide to v1.3.0
Table of Contents
- Key Changes in v1.3.0
- How to Migrate from v1.2.0
- Complete Configuration Example
- Upgrading from v1.2.0
This guide explains the changes introduced in Versus Incident v1.3.0 and how to update your configuration to take advantage of the new features.
Key Changes in v1.3.0
Version 1.3.0 introduces enhanced on-call management capabilities and configuration options, with a focus on flexibility and team-specific routing.
1. New Provider Configuration (Major Change from v1.2.0)
A significant change in v1.3.0 is the introduction of the provider
property in the on-call configuration, which allows you to explicitly specify which on-call service to use:
oncall:
enable: false
wait_minutes: 3
provider: aws_incident_manager # NEW in v1.3.0: Explicitly select "aws_incident_manager" or "pagerduty"
This change enables Versus Incident to support multiple on-call providers simultaneously. In v1.2.0, there was no provider selection mechanism, as AWS Incident Manager was the only supported provider.
2. PagerDuty Integration (New in v1.3.0)
Version 1.3.0 introduces PagerDuty as a new on-call provider with comprehensive configuration options:
oncall:
provider: pagerduty # Select PagerDuty as your provider
pagerduty: # New configuration section in v1.3.0
routing_key: ${PAGERDUTY_ROUTING_KEY} # Integration/Routing key for Events API v2
other_routing_keys: # Optional team-specific routing keys
infra: ${PAGERDUTY_OTHER_ROUTING_KEY_INFRA}
app: ${PAGERDUTY_OTHER_ROUTING_KEY_APP}
db: ${PAGERDUTY_OTHER_ROUTING_KEY_DB}
The PagerDuty integration supports:
- Default routing key for general alerts
- Team-specific routing keys via the
other_routing_keys
configuration - Dynamic routing using the
pagerduty_other_routing_key
query parameter
Example API call to target the infrastructure team:
curl -X POST "http://your-versus-host:3000/api/incidents?pagerduty_other_routing_key=infra" \
-H "Content-Type: application/json" \
-d '{
"Logs": "[ERROR] Load balancer failure.",
"ServiceName": "lb-service",
"UserID": "U12345"
}'
3. AWS Incident Manager Environment-Specific Response Plans (New in v1.3.0)
Version 1.3.0 enhances AWS Incident Manager integration with support for environment-specific response plans:
oncall:
provider: aws_incident_manager
aws_incident_manager:
response_plan_arn: ${AWS_INCIDENT_MANAGER_RESPONSE_PLAN_ARN} # Default response plan
other_response_plan_arns: # New in v1.3.0
prod: ${AWS_INCIDENT_MANAGER_OTHER_RESPONSE_PLAN_ARN_PROD}
dev: ${AWS_INCIDENT_MANAGER_OTHER_RESPONSE_PLAN_ARN_DEV}
staging: ${AWS_INCIDENT_MANAGER_OTHER_RESPONSE_PLAN_ARN_STAGING}
This feature allows you to:
- Configure multiple response plans for different environments
- Dynamically select the appropriate response plan using the
awsim_other_response_plan
query parameter - Use a more flexible named environment approach for response plan selection
Example API call to use the production environment's response plan:
curl -X POST "http://your-versus-host:3000/api/incidents?awsim_other_response_plan=prod" \
-H "Content-Type: application/json" \
-d '{
"Logs": "[ERROR] Production database failure.",
"ServiceName": "prod-db-service",
"UserID": "U12345"
}'
How to Migrate from v1.2.0
If you're upgrading from v1.2.0, update your on-call configuration to include the provider
property.
Complete Configuration Example
Replace your existing on-call configuration with the new structure:
oncall:
enable: false # Set to true to enable on-call functionality
wait_minutes: 3 # Time to wait for acknowledgment before escalating
provider: aws_incident_manager # or "pagerduty"
aws_incident_manager: # Used when provider is "aws_incident_manager"
response_plan_arn: ${AWS_INCIDENT_MANAGER_RESPONSE_PLAN_ARN}
other_response_plan_arns: # NEW in v1.3.0: Optional environment-specific response plan ARNs
prod: ${AWS_INCIDENT_MANAGER_OTHER_RESPONSE_PLAN_ARN_PROD}
dev: ${AWS_INCIDENT_MANAGER_OTHER_RESPONSE_PLAN_ARN_DEV}
staging: ${AWS_INCIDENT_MANAGER_OTHER_RESPONSE_PLAN_ARN_STAGING}
pagerduty: # Used when provider is "pagerduty"
routing_key: ${PAGERDUTY_ROUTING_KEY}
other_routing_keys: # Optional team-specific routing keys
infra: ${PAGERDUTY_OTHER_ROUTING_KEY_INFRA}
app: ${PAGERDUTY_OTHER_ROUTING_KEY_APP}
db: ${PAGERDUTY_OTHER_ROUTING_KEY_DB}
redis: # Required for on-call functionality
host: ${REDIS_HOST}
port: ${REDIS_PORT}
password: ${REDIS_PASSWORD}
db: 0
Upgrading from v1.2.0
-
Update your Versus Incident deployment to v1.3.0:
# Docker docker pull ghcr.io/versuscontrol/versus-incident:v1.3.0 # Or update your Kubernetes deployment to use the new image
-
Update your configuration as described above, ensuring that Redis is properly configured if you're using on-call features.
-
Restart your Versus Incident service to apply the changes.
For any issues with the migration, please open an issue on GitHub.