Steve Hardy: Debugging TripleO Heat templates

Monday, 20 April 2015

Debugging TripleO Heat templates

Lately, I've been spending increasing amounts of time working with TripleO heat templates, and have noticed some recurring aspects of my workflow whilst debugging them which I thought may be worth sharing.

For the uninitiated, TripleO is an OpenStack deployment project, which aims to deploy and manage OpenStack using standard OpenStack API's. In practice, this means using Nova and Ironic for baremetal node provisioning, and Heat to orchestrate the deployment and configuration of the nodes.

The TripleO heat templates, unlike most of the heat examples, are pretty complex. They make extensive use of many "advanced" features, such as nested stacks, using provider resources via the environment and also many software config resources.

This makes TripleO a fairly daunting target to those wishing to debug and modify and/or debug the TripleO templates.

Fortunately TripleO templates, although large, have many repeated patterns, and good levels of abstraction and modularity. Combined with some recently added heat interfaces, it becomes rapidly less daunting, as I will demonstrate in the worked example below:

Step 1: Create the Stack

So, step 1 when deploying OpenStack via TripleO is to do a "heat stack-create". Whether you create the heat stack directly via python-heatclient (which is what the TripleO "devtest" script calls), or indirectly via some other interface such as tuskar-ui the end result is the same - a heat stack is created (normally it's called "overcloud" by default):

$ heat stack-create -e /home/shardy/tripleo/overcloud-env.json -e /home/shardy/tripleo/tripleo-heat-templates/overcloud-resource-registry-puppet.yaml -t 360 -f /home/shardy/tripleo/tripleo-heat-templates/overcloud-without-
mergepy.yaml -P ExtraConfig= overcloud

+--------------------------------------+------------+--------------------+----------------------+
| id | stack_name | stack_status | creation_time |
+--------------------------------------+------------+--------------------+----------------------+
| e4cfc4a8-d9e9-4033-8556-5ebca84c1455 | overcloud | CREATE_IN_PROGRESS | 2015-04-20T11:05:53Z |
+--------------------------------------+------------+--------------------+----------------------+

Step 2: Oh No - CREATE_FAILED!

Ok, it happens - sometimes you have a fault in your environment, a bug in your templates, or just get bitten by a regression in one of the projects used to deploy your overcloud.

Unfortunately that modularity I just mentioned in the templates leads to a level of additional complexity when debugging - the tree of resources created by heat is actually grouped into nearly 40 nested stacks! (In my environment, this number is dependent on the number of nodes you're deploying).

You can see them all, including which one failed, with heat stack-list, using the --show-nested option, and your choice of either grep "FAILED" or the -f filter option to python heatclient:

$ heat stack-list --show-nested -f "status=FAILED"
+--------------------------------------+----------------------------------------------------------------------------------------------------------+---------------+----------------------+--------------------------------------+
| id                                   | stack_name                                                                                               | stack_status | creation_time        | parent                               |
+--------------------------------------+----------------------------------------------------------------------------------------------------------+---------------+----------------------+--------------------------------------+
| e4cfc4a8-d9e9-4033-8556-5ebca84c1455 | overcloud                                                                                                | CREATE_FAILED | 2015-04-20T11:05:53Z | None                                 |
| 36f3ef93-872f-460b-bd6a-14a89569d5a7 | overcloud-ControllerNodesPostDeployment-rl67kiqu7pbp                                                     | CREATE_FAILED | 2015-04-20T11:09:18Z | e4cfc4a8-d9e9-4033-8556-5ebca84c1455 |
| 28d1fd38-85ba-442b-9e57-859731349e94 | overcloud-ControllerNodesPostDeployment-rl67kiqu7pbp-ControllerDeploymentLoadBalancer_Step1-tnsuslbx5hu7 | CREATE_FAILED | 2015-04-20T11:09:20Z | 36f3ef93-872f-460b-bd6a-14a89569d5a7 |
+--------------------------------------+----------------------------------------------------------------------------------------------------------+---------------+----------------------+--------------------------------------+

Here, we can derive some useful information looking at the stack names, note that in all cases we can disregard the randomly generated suffix on the stack names (heat adds it internally for nested stack resources).

overcloud is the top-level stack, the parent at the top of the tree. This is defined by the overcloud-without-mergepy.yaml
template which we passed to heat stack-create.
ControllerNodesPostDeployment-rl67kiqu7pbp is the nested stack which handles post-deployment configuration of all Controller nodes. This is the ControllerNodesPostDeployment resource, defined by the overcloud resource registry as the implementation of the OS::TripleO::ControllerPostDeployment type, which is a provider resource alias for this template when using the puppet implementation.
The final (verbosely named!) stack maps to the
ControllerDeploymentLoadBalancer_Step1 resource in controller-post-puppet.yaml.

All of this is a long-winded way of saying that something went wrong applying a puppet manifest, via an OS::Heat::StructuredDeployments resource (
ControllerDeploymentLoadBalancer_Step1) - anything with "Deployment" in the name failing is highly likely to mean the same thing.

Armed with this information, we can proceed to figure out why :)

Step 3: Resource Introspection

So we now know which nested stack failed, but not which resource, or why.

There's a couple of ways to find this out, you can either use the steps outlined in my previous post about nested resource introspection, or (if you're lazy like me), you can use the heat resource-list --nested-depth option to save some time:

$ heat resource-list --nested-depth 5 overcloud | grep FAILED
| ControllerNodesPostDeployment               | 36f3ef93-872f-460b-bd6a-14a89569d5a7          | OS::TripleO::ControllerPostDeployment             | CREATE_FAILED   | 2015-04-20T11:05:53Z |                                        |
| ControllerDeploymentLoadBalancer_Step1      | 28d1fd38-85ba-442b-9e57-859731349e94          | OS::Heat::StructuredDeployments                   | CREATE_FAILED   | 2015-04-20T11:09:19Z | ControllerNodesPostDeployment          |
| 0                                           | 980137bc-21b1-460c-9d4a-488cb5611a6c          | OS::Heat::StructuredDeployment                    | CREATE_FAILED   | 2015-04-20T11:09:20Z | ControllerDeploymentLoadBalancer_Step1 |

Here, we can see several things:

ControllerDeploymentLoadBalancer_Step1 has failed, it's an
OS::Heat::StructuredDeployments resource. StructuredDeployments (plural) resources apply a heat StructuredConfig/SoftwareConfig to a group of servers.
There's a "0" resource, which is a
OS::Heat::StructuredDeployment (singular) type. The parent resource (last column) of this is ControllerDeploymentLoadBalancer_Step1. This is because a SoftwareDeployments resource creates a nested stack with a (sequentially named) SoftwareDeployment per server (in this case, one per Controller node in the OS::Heat::ResourceGroup defined as "Controller" in the overcloud-without-mergepy template)

Now, we can do a resource-show to find out the reason for the failure. Here, we use the ID of
ControllerDeploymentLoadBalancer_Step1 as the stack ID, because all nested stack resources set the ID as that of the stack they create:

$ heat resource-show 28d1fd38-85ba-442b-9e57-859731349e94 0 | grep resource_status_reason
| resource_status_reason | Error: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 6

So, to summarize what we've discovered so far

A SoftwareDeployment (in this case a puppet run) failed on Controller node 0
The thing it was running exited with status code 6.

The next step is to look at the logs to work out why..

Step 4: Debugging the failure

When a Heat SoftwareDeployment resource is triggered, it runs something on the node (e.g applying a puppet manifest), then signals either success or failure back to Heat. Fortunately, in recent version of Heat, there is an API which exposes this information (in a more verbose way than the resource-show output above with the reason for failure):

To access it, you need the ID of the deployment (e.g
980137bc-21b1-460c-9d4a-488cb5611a6c) from the heat resource-list above):

heat deployment-show 980137bc-21b1-460c-9d4a-488cb5611a6c
{
"status": "FAILED",
"server_id": "6a025200-b20e-47df-ae4c-97a54499b586",
"config_id": "b924d133-42d7-48ab-b2c9-7311de3b3ca4",
"output_values": {
    "deploy_stdout": "<stdout of command>,
    "deploy_stderr": "<stderr of command>",
    "deploy_status_code": 6
},
"creation_time": "2015-04-20T11:09:20Z",
"updated_time": "2015-04-20T11:10:02Z",
"input_values": {},
"action": "CREATE",
"status_reason": "deploy_status_code : Deployment exited with non-zero status code: 6",
"id": "980137bc-21b1-460c-9d4a-488cb5611a6c"
}
I've not included the full stderr/stdout because it's pretty long, but it's basically the same information that you get from SSHing onto the node and looking at the logs.

If you still want to do that, you can use "nova show" with the "server_id" above to get the IP of the node, SSH in and do further investigations.

In Summary...

So those paying attention will have spotted that this all really boils down to two steps:

Use heat resource-list with the --nested-depth option to get the failing resource. The one you want is the one which isn't the parent_resource to any other and is in a FAILED state.
Investigate what caused the failure, for failing SoftwareDeployment resources heat deployment-show is a useful time-saver which avoids always needing to log on to the node.

Hopefully this somewhat demystifies the debugging of TripleO templates, and other large Heat deployments which use similar techniques such as nested stacks and SoftwareConfig resources!