Introduction
Long ago I wrote about blog post describing the usefulness of PagerDuty and the service it provides. Having used this service for over four years on multiple teams, it is still on my list of must have tools. In the past I’ve always configured PagerDuty manually; inviting users, hand crafting escalation policies and schedules. It wasn’t overly laborious but it was still a manual process.
I recently switched to a new team that uses both PagerDuty and Terraform. As there were lots of PagerDuty changes to make, I opted to take the plunge and automate our PagerDuty configuration. In this post I will walk through a sample Terraform configuration for configuring PagerDuty for a fictional team. I’ve also created a sample GitHub repo with a fully functional Terraform configuration.
PagerDuty Overview
PagerDuty is a wildly successful service that enables incident response and resolution. In short, it can act as an Operations hub and aggregate alerts from disparate applications or environments and route them to the correct teams or individuals. It provides an API, integrations (Slack, DataDog, AWS), reporting and a slew of other features. It’s not free but it is definitely best in class.
PagerDuty allows you to create teams, users, schedules and services so should there be an incident, it will escalate and notify via email, sms, push notification or phone call a human that something is broke and needs their attention.
Terraform Overview
Brought to you by the same folks who made Packer, Vagrant and Consul, Terraform is a tool for building, changing and versioning infrastructure or other resources. Using its own syntax called (HCL), Terraform allows you to treat your AWS instances, DataDog Monitors or Heroku applications as codified resources that can be versioned, shared, reviewed and reused. If you like to build and automate, Terraform has over 60 providers to choose from.
As I discussed in the introduction, we will be using the PagerDuty provider in Terraform to create our teams, users, services, schedules and escalation policies.
Fictional Team
In our example, a fictional team, The Transformers
is made up of seven individuals. There are two sub-teams, (Autobots
& Decepticons
); each sub-team has four developers all of them report to one leader. Both sub-teams are responsible for one application/service and is required to provide continuous on-call coverage to ensure any customer impacting incidents or failures are handled appropriately.
Setup
-
PagerDuty - If you are going to create PagerDuty resources via the API, you will need an API KEY with
Full Access
. -
Terraform - Installation for Terraform is pretty straightforward. The download page has the latest and former releases available for many platforms. As Hashicorp tends to do frequent releases and I work on multiple projects with varying Terraform versions, I opt to use TFENV to manage my versions.
Teams
The PagerDuty team is a collection of users. In our example, each user/Transformer will be a member of the Transformer
team and the individual sub-team (Autobots
|| Decepticons
).
/* teams.tf */
resource "pagerduty_team" "transformers" {
name = "Transformers"
}
resource "pagerduty_team" "autobots" {
name = "Autobots"
}
resource "pagerduty_team" "decepticons" {
name = "Decepticons"
}
Users
The PagerDuty users are defined in two separate files: the autobots.tf
and the decepticons.tf
. They could have easily been in one file but I chose to split them up.
Each user has several attributes but the only one required is the email address. Once the user is created, an email invitation will be sent to them to join your PagerDuty organization.
/* autobots.tf */
resource "pagerduty_user" "grimlock" {
name = "Grimlock"
email = "[email protected]"
color = "white"
role = "user"
job_title = "Dinosaur"
teams = ["${pagerduty_team.autobots.id}", "${pagerduty_team.transformers.id}"]
}
Schedules
Schedules for each sub-team is defined in one file. In our sample schedule, each on-call rotation lasts seven days as defined by the rotation_turn_length_seconds
attribute. Also, the order of the users defines the order of the rotation. The Terraform documentation has some good schedule examples as well.
/* schedules.tf */
resource "pagerduty_schedule" "autobots-schedule" {
name = "On-call - Autobots"
time_zone = "America/New_York"
layer {
name = "Layer 1"
rotation_turn_length_seconds = 604800
start = "2017-06-01T12:00:00-04:00"
rotation_virtual_start = "2017-06-01T12:00:00-04:00"
users = ["${pagerduty_user.bumblebee.id}", "${pagerduty_user.cliffjumper.id}", "${pagerduty_user.grimlock.id}"]
}
}
Escalation Policies
The escalation policy resource defines the order of notification when an incident occurs. In our example, each sub-team has their own escalation policy. The Autobot
or Decepticon
on-call will get fifteen minutes to acknowledge the Incident, else it will escalate to Optimus Prime
. If he fails to acknowledge the Incident, PagerDuty will restart the escalation process again. The num_loops
attribute defines how many times this escalation loop will occur.
/* escalation-policies.tf */
resource "pagerduty_escalation_policy" "autobots-esc-policy" {
name = "Autobots Policy"
num_loops = 5
rule {
escalation_delay_in_minutes = 15
target {
type = "schedule_reference"
id = "${pagerduty_schedule.autobots-schedule.id}"
}
}
rule {
escalation_delay_in_minutes = 15
target {
type = "user_reference"
id = "${pagerduty_user.optimus.id}"
}
}
}
Services
All of our services are defined in the services.tf
file. The service is associated to the appropriate escalation policy for the sub-team and specifies the urgency rule for the incident. A service can only have one urgency (high or low) and this can be very useful for triaging when multiple Incidents are firing simultaneously. Check out this PagerDuty knowledge base article for more information.
/* services.tf */
resource "pagerduty_service" "energon-v1" {
name = "Energon Service v1"
auto_resolve_timeout = 14400
acknowledgement_timeout = 1800
escalation_policy = "${pagerduty_escalation_policy.autobots-esc-policy.id}"
incident_urgency_rule {
type = "constant"
urgency = "high"
}
}
Planning
Now that we have our PagerDuty configuration defined, let’s run a terraform plan
to see what would happen. As you can see, all of our teams, users, services, schedules and escalation policies will be created.
Refreshing Terraform state in-memory prior to plan...
The refreshed state will be used to calculate this plan, but will not be
persisted to local or remote state storage.
...
+ pagerduty_escalation_policy.autobots-esc-policy
description: "Managed by Terraform"
name: "Autobots Policy"
num_loops: "5"
rule.#: "2"
rule.0.escalation_delay_in_minutes: "15"
rule.0.id: "<computed>"
rule.0.target.#: "1"
rule.0.target.0.id: "${pagerduty_schedule.autobots-schedule.id}"
rule.0.target.0.type: "schedule_reference"
rule.1.escalation_delay_in_minutes: "15"
rule.1.id: "<computed>"
rule.1.target.#: "1"
rule.1.target.0.id: "${pagerduty_user.optimus.id}"
rule.1.target.0.type: "user_reference"
+ pagerduty_escalation_policy.decepticons-esc-policy
description: "Managed by Terraform"
name: "Decepticons Policy"
num_loops: "5"
rule.#: "2"
rule.0.escalation_delay_in_minutes: "15"
rule.0.id: "<computed>"
rule.0.target.#: "1"
rule.0.target.0.id: "${pagerduty_schedule.decepticons-schedule.id}"
rule.0.target.0.type: "schedule_reference"
rule.1.escalation_delay_in_minutes: "15"
rule.1.id: "<computed>"
rule.1.target.#: "1"
rule.1.target.0.id: "${pagerduty_user.optimus.id}"
rule.1.target.0.type: "user_reference"
+ pagerduty_schedule.autobots-schedule
description: "Managed by Terraform"
layer.#: "1"
layer.0.id: "<computed>"
layer.0.name: "Layer 1"
layer.0.rotation_turn_length_seconds: "604800"
layer.0.rotation_virtual_start: "2017-06-01T12:00:00-04:00"
layer.0.start: "2017-06-01T12:00:00-04:00"
layer.0.users.#: "<computed>"
name: "On-call - Autobots"
time_zone: "America/New_York"
+ pagerduty_schedule.decepticons-schedule
description: "Managed by Terraform"
layer.#: "1"
layer.0.id: "<computed>"
layer.0.name: "Layer 1"
layer.0.rotation_turn_length_seconds: "604800"
layer.0.rotation_virtual_start: "2017-06-01T12:00:00-04:00"
layer.0.start: "2017-06-01T12:00:00-04:00"
layer.0.users.#: "<computed>"
name: "On-call - Decepticons"
time_zone: "America/New_York"
+ pagerduty_service.energon-v1
acknowledgement_timeout: "1800"
auto_resolve_timeout: "14400"
created_at: "<computed>"
description: "Managed by Terraform"
escalation_policy: "${pagerduty_escalation_policy.autobots-esc-policy.id}"
incident_urgency_rule.#: "1"
incident_urgency_rule.0.type: "constant"
incident_urgency_rule.0.urgency: "high"
last_incident_timestamp: "<computed>"
name: "Energon Service v1"
status: "<computed>"
+ pagerduty_service.space-bridge-v1
acknowledgement_timeout: "1800"
auto_resolve_timeout: "14400"
created_at: "<computed>"
description: "Managed by Terraform"
escalation_policy: "${pagerduty_escalation_policy.decepticons-esc-policy.id}"
incident_urgency_rule.#: "1"
incident_urgency_rule.0.type: "constant"
incident_urgency_rule.0.urgency: "high"
last_incident_timestamp: "<computed>"
name: "Space Bridge Service v1"
status: "<computed>"
+ pagerduty_team.autobots
description: "Managed by Terraform"
name: "Autobots"
+ pagerduty_team.decepticons
description: "Managed by Terraform"
name: "Decepticons"
+ pagerduty_team.transformers
description: "Managed by Terraform"
name: "Transformers"
+ pagerduty_user.bumblebee
avatar_url: "<computed>"
color: "yellow"
description: "Managed by Terraform"
email: "[email protected]"
html_url: "<computed>"
invitation_sent: "<computed>"
job_title: "Soldier"
name: "Bumblebee"
role: "user"
teams.#: "<computed>"
time_zone: "<computed>"
+ pagerduty_user.cliffjumper
avatar_url: "<computed>"
color: "red"
description: "Managed by Terraform"
email: "[email protected]"
html_url: "<computed>"
invitation_sent: "<computed>"
job_title: "Soldier"
name: "Cliffjumper"
role: "user"
teams.#: "<computed>"
time_zone: "<computed>"
+ pagerduty_user.grimlock
avatar_url: "<computed>"
color: "white"
description: "Managed by Terraform"
email: "[email protected]"
html_url: "<computed>"
invitation_sent: "<computed>"
job_title: "Dinosaur"
name: "Grimlock"
role: "user"
teams.#: "<computed>"
time_zone: "<computed>"
+ pagerduty_user.megatron
avatar_url: "<computed>"
color: "black"
description: "Managed by Terraform"
email: "[email protected]"
html_url: "<computed>"
invitation_sent: "<computed>"
job_title: "Evil Genius"
name: "Megatron"
role: "user"
teams.#: "<computed>"
time_zone: "<computed>"
+ pagerduty_user.optimus
avatar_url: "<computed>"
color: "dark-blue"
description: "Managed by Terraform"
email: "[email protected]"
html_url: "<computed>"
invitation_sent: "<computed>"
job_title: "Leader"
name: "Optimus Prime"
role: "admin"
teams.#: "<computed>"
time_zone: "<computed>"
+ pagerduty_user.soundwave
avatar_url: "<computed>"
color: "purple"
description: "Managed by Terraform"
email: "[email protected]"
html_url: "<computed>"
invitation_sent: "<computed>"
job_title: "Soldier"
name: "Soundwave"
role: "user"
teams.#: "<computed>"
time_zone: "<computed>"
+ pagerduty_user.starscream
avatar_url: "<computed>"
color: "blue"
description: "Managed by Terraform"
email: "[email protected]"
html_url: "<computed>"
invitation_sent: "<computed>"
job_title: "Soldier"
name: "Starscream"
role: "user"
teams.#: "<computed>"
time_zone: "<computed>"
Plan: 16 to add, 0 to change, 0 to destroy.
Summary
Hopefully this walkthrough about PagerDuty and Terraform helps shed some light on how to automate your configuration. Versioning your PagerDuty configuration not only provides a backup, it can help distribute knowledge throughout the rest of your team.
Matthew