Blog-Analytics_100x385

Chef Reporting API and Resource Updates

This post originally appeared on jtimberman’s Code Blog.

Have you ever wanted to find a list of nodes that updated a specific resource in a period of time? Such as “show me all the nodes in production that had an application service restart in the last hour”? Or, “which nodes have updated their apt cache recently?” For example,

[code]% knife report resource ‘role:supermarket-app AND chef_environment:supermarket-prod’ execute ‘apt-get update’
execute[apt-get update] changed in the following runs:
app-supermarket1.example.com 2230cf30-6d95-4e43-be18-211137eaf802 @ 2014-10-07T14:07:03Z
app-supermarket2.example.com c5e4d7bf-95a6-4385-9d8e-c6f5617ed79b @ 2014-10-07T14:14:04Z
app-supermarket3.example.com c4c4b4bb-91b6-4f73-9876-b24b093c7f1e @ 2014-10-07T14:09:54Z
app-supermarket4.example.com 3eb09034-7539-4a3c-af6d-5b01d35bc63f @ 2014-10-07T13:31:56Z
app-supermarket5.example.com aa48c1d3-da91-4031-a43d-582a577cbf2d @ 2014-10-07T13:35:15Z
Use ‘knife runs show’ with the run UUID to get more info
[/code]

I have released a new knife plugin to do that, but first some background.

At CHEF, we run the community’s cookbook site, Supermarket. We monitor the systems that run the site with Sensu. The current infrastructure runs instances on Amazon Web Services EC2, with an Elastic Load Balancer (ELB) in front of them. As a corrective action for a Supermarket outage, CHEF’s operations team added a new check for elevated HTTP 500 responses from the application servers behind the ELB. One thing we found was that when Supermarket was deployed, and the unicorn server restarted, we would see elevated 500’s, but the site often wouldn’t actually be impacted.

The Sensu check is run from a “relay” node. That is, it isn’t run on the application servers or the Sensu server – it’s run out of band since it’s for the ELB. One might imagine we could have similar checks for other services that aren’t run on “managed nodes,” but that’s neither here nor there. The issue is that we get an alert message that looks like this:

[code]Sensu Alerts ALERT – [i-d1dfd5d9/check-elb-backend-500] – CheckELBMetrics CRITICAL: supermarket-elb; Sum of HTTPCode_Backend_5XX is 2538.0. (expected lower than 30.0); (HTTPCode_Backend_5XX ‘2538.0’ within 300 seconds between 2014-08-19 13:33:36 +0000 to 2014-08-19 13:38:36 +0000) [Playbook].[/code]

The first part, [i-d1dfd5d9/check-elb-backend-500] is the node name and the check that alerted. The node name here is the monitoring relay that runs the check, not the actual node or nodes where Supermarket was deployed and restarted. This is where Chef Reporting comes into play. In Chef Reporting, we can view information about recent Chef client runs, which gives us a graph like this.

If we go look at the reports in the Chef Manage console, we can drill down to something like this.

This shows that unicorn was restarted in this run. That’s great, but if I’m getting this alert at a time when I’m not particularly coherent (e.g, 2AM), I want a command in a playbook that I can run to get more information quickly without having to log into the webui and click around imprecisely. CHEF publishes a knife-reporting gem that has a couple handy sub-commands to retrieve this run data. For example, we can list runs.

[code]% knife runs list
node_name: i-3022aa3b
run_id: 9eccd8f6-876b-4a57-87ac-0b3e7b7ef1e7
start_time: 2014-08-21T17:03:56Z
status: started

node_name: i-a09424a8
run_id: f2b7871a-149b-4fd3-abdc-d74a838d719a
start_time: 2014-08-21T17:00:23Z
status: success
[/code]

Or, we can display a specific run.

[code]% knife runs show eecb04fb-11df-438a-8e81-dd610eb66616
run_detail:
data:
end_time: 2014-08-20T17:50:12Z
node_name: i-9f22aa94
run_id: eecb04fb-11df-438a-8e81-dd610eb66616
run_list: ["role[base]","role[supermarket-app]"]
start_time: 2014-08-20T17:45:37Z
status: success
total_res_count: 261
updated_res_count: 17
run_resources:
cookbook_name: supermarket
cookbook_version: 2.7.2
duration: 209
final_state:
enabled: false
running: true
id: unicorn
initial_state:
enabled: false
running: true
name: unicorn
result: restart
type: service
uri: https://api.opscode.com/organizations/supermarket/reports/org/runs/eecb04fb-11df-438a-8e81-dd610eb66616/15[/code]

This is handy, but a little limited. What if I want to display only the runs containing the service[unicorn] resource?

That’s where my knife-report-resource plugin helps. At first, it was very much specific to finding unicorn restarts on Supermarket app servers. However, I wanted to make it more general purpose as I think people would want to be able to find when arbitrary resources were updated. This is how it works:

  1. Query the Chef Server for a particular set of nodes. For example, 'role:supermarket-app AND chef_environment:supermarket-prod'.
  2. Get all the Chef client runs for a specified time period up until the current time. By default, it starts from one hour ago, but we can pass an ISO8601 timestamp.
  3. Iterate over all the runs looking for runs by the nodes that were returned by the search query, gathering the specified resource type and name.
  4. Display some nice output with the node’s FQDN, the run’s UUID, and a timestamp.

From the earlier example:

[code]% knife report resource ‘role:supermarket-app AND chef_environment:supermarket-prod’ execute ‘apt-get update’
execute[apt-get update] changed in the following runs:
app-supermarket1.example.com 2230cf30-6d95-4e43-be18-211137eaf802 @ 2014-10-07T14:07:03Z
app-supermarket2.example.com c5e4d7bf-95a6-4385-9d8e-c6f5617ed79b @ 2014-10-07T14:14:04Z
app-supermarket3.example.com c4c4b4bb-91b6-4f73-9876-b24b093c7f1e @ 2014-10-07T14:09:54Z
app-supermarket4.example.com 3eb09034-7539-4a3c-af6d-5b01d35bc63f @ 2014-10-07T13:31:56Z
app-supermarket5.example.com aa48c1d3-da91-4031-a43d-582a577cbf2d @ 2014-10-07T13:35:15Z
Use `knife runs show` with the run UUID to get more info[/code]

Then, we can drill down further into one of these runs with the knife-reporting plugin.

[code]% knife runs show 2230cf30-6d95-4e43-be18-211137eaf802
run_detail:
data:
end_time: 2014-10-07T14:07:03Z
node_name: i-d7fed0df
run_id: 2230cf30-6d95-4e43-be18-211137eaf802
run_list: ["role[base]","role[supermarket-app]"]
start_time: 2014-10-07T14:03:59Z
status: success
total_res_count: 271
updated_res_count: 12
run_resources:
cookbook_name: chef-client
cookbook_version: 3.6.0
duration: 99
final_state:
enabled: true
running: false
id: chef-client
initial_state:
enabled: true
running: true
name: chef-client
result: enable
type: runit_service
uri: https://api.opscode.com/organizations/supermarket/reports/org/runs/2230cf30-6d95-4e43-be18-211137eaf802/0

cookbook_name: supermarket
cookbook_version: 2.11.0
duration: 8506
final_state:
id: apt-get update
initial_state:
name: apt-get update
result: run
type: execute
uri: https://api.opscode.com/organizations/supermarket/reports/org/runs/2230cf30-6d95-4e43-be18-211137eaf802/5[/code]

Hopefully you find this plugin useful! It is a RubyGem, and is available on RubyGems.org, and the source is available on GitHub.

Posted in:

Joshua Timberman

Joshua Timberman is a Code Cleric at CHEF, where he Cures Technical Debt Wounds for 1d8+5 lines of code, casts Protection from Yaks, and otherwise helps continuously improve internal technical process.