Steve Hardy

TripleO Containerized deployments, debugging basics

2018-06-04T10:09:00.001-07:00

Containerized deployments, debugging basics

Since the Pike release, TripleO has supported deployments with OpenStack services running in containers. Currently we use docker to run images based on those maintained by the Kolla project.

We already have some tips and tricks for container deployment debugging in tripleo-docs, but below are some more notes on my typical debug workflows.

Config generation debugging overview

In the TripleO container architecture, we still use Puppet to generate configuration files and do some bootstrapping, but it is run (inside a container) via a script docker-puppet.py

The config generation usage happens at the start of the deployment (step 1) and the configuration files are generated for all services (regardless of which step they are started in).

The input file used is /var/lib/docker-puppet/docker-puppet.json, but you can also filter this (e.g via cut/paste or jq as shown below) to enable debugging for specific services - this is helpful when you need to iterate on debugging a config generation issue for just one service.

[root@overcloud-controller-0 docker-puppet]# jq '[.[]|select(.config_volume | contains("heat"))]' /var/lib/docker-puppet/docker-puppet.json | tee /tmp/heat_docker_puppet.json
{
  "puppet_tags": "heat_config,file,concat,file_line",
  "config_volume": "heat_api",
  "step_config": "include ::tripleo::profile::base::heat::api\n",
  "config_image": "192.168.24.1:8787/tripleomaster/centos-binary-heat-api:current-tripleo"
}
{
  "puppet_tags": "heat_config,file,concat,file_line",
  "config_volume": "heat_api_cfn",
  "step_config": "include ::tripleo::profile::base::heat::api_cfn\n",
  "config_image": "192.168.24.1:8787/tripleomaster/centos-binary-heat-api-cfn:current-tripleo"
}
{
  "puppet_tags": "heat_config,file,concat,file_line",
  "config_volume": "heat",
  "step_config": "include ::tripleo::profile::base::heat::engine\n\ninclude ::tripleo::profile::base::database::mysql::client",
  "config_image": "192.168.24.1:8787/tripleomaster/centos-binary-heat-api:current-tripleo"
}

Then we can run the config generation, if necessary changing the tags (or puppet modules, which are consumed from the host filesystem e.g /etc/puppet/modules) until the desired output is achieved:


[root@overcloud-controller-0 docker-puppet]# export NET_HOST='true'
[root@overcloud-controller-0 docker-puppet]# export DEBUG='true'
[root@overcloud-controller-0 docker-puppet]# export PROCESS_COUNT=1
[root@overcloud-controller-0 docker-puppet]# export CONFIG=/tmp/heat_docker_puppet.json
[root@overcloud-controller-0 docker-puppet]# python /var/lib/docker-puppet/docker-puppet.py2018-02-09 16:13:16,978 INFO: 102305 -- Running docker-puppet
2018-02-09 16:13:16,978 DEBUG: 102305 -- CONFIG: /tmp/heat_docker_puppet.json
2018-02-09 16:13:16,978 DEBUG: 102305 -- config_volume heat_api
2018-02-09 16:13:16,978 DEBUG: 102305 -- puppet_tags heat_config,file,concat,file_line
2018-02-09 16:13:16,978 DEBUG: 102305 -- manifest include ::tripleo::profile::base::heat::api
2018-02-09 16:13:16,978 DEBUG: 102305 -- config_image 192.168.24.1:8787/tripleomaster/centos-binary-heat-api:current-tripleo
...

When the config generation is completed, configuration files are written out to /var/lib/config-data/heat.

We then compare timestamps against the /var/lib/config-data/heat/heat.*origin_of_time file (touched for each service before we run the config-generating containers), so that only those files modified or created by puppet are copied to /var/lib/config-data/puppet-generated/heat.

Note that we also calculate a checksum for each service (see /var/lib/config-data/puppet-generated/*.md5sum), which means we can detect when the configuration changes - when this happens we need paunch to restart the containers, even though the image did not change.

This checksum is added to the /var/lib/tripleo-config/hashed-docker-container-startup-config-step_*.json files by docker-puppet.py, and these files are later used by paunch to decide if a container should be restarted (see below).

Runtime debugging, paunch 101

Paunch is a tool that orchestrates launching containers for each step, and performing any bootstrapping tasks not handled via docker-puppet.py.

It accepts a json format, which are the /var/lib/tripleo-config/docker-container-startup-config-step_*.json files that are created based on the enabled services (the content is directly derived from the service templates in tripleo-heat-templates)

These json files are then modified via docker-puppet.py (as mentioned above) to add a TRIPLEO_CONFIG_HASH value to the container environment - these modified files are written with a different name, see /var/lib/tripleo-config/hashed-docker-container-startup-config-step_*.json

Note this environment variable isn't used by the container directly, it is used as a salt to trigger restarting containers when the configuration files in the mounted config volumes have changed.

As in the docker-puppet case it's possible to filter the json file with jq and debug e.g mounted volumes or other configuration changes directly.

It's also possible to test configuration changes by manually modifying /var/lib/config-data/puppet-generated/ then either restarting the container via docker restart, or by modifying TRIPLEO_CONFIG_HASH then re-running paunch.

Note paunch will kill any containers tagged for a particular step e.g the --config-id tripleo_step4 --managed-by tripleo-Controller means all containers started during this step for any previous paunch apply will be killed if they are removed from your json during testing. This is a feature which enables changes to the enabled services on update to your overcloud but it's worth bearing in mind when testing as described here.


[root@overcloud-controller-0]# cd /var/lib/tripleo-config/
[root@overcloud-controller-0 tripleo-config]# jq '{"heat_engine": .heat_engine}' hashed-docker-container-startup-config-step_4.json | tee /tmp/heat_startup_config.json
{
  "heat_engine": {
    "healthcheck": {
      "test": "/openstack/healthcheck"
    },
    "image": "192.168.24.1:8787/tripleomaster/centos-binary-heat-engine:current-tripleo",
    "environment": [
      "KOLLA_CONFIG_STRATEGY=COPY_ALWAYS",
      "TRIPLEO_CONFIG_HASH=14617e6728f5f919b16c74f1e98d0264"
    ],
    "volumes": [
      "/etc/hosts:/etc/hosts:ro",
      "/etc/localtime:/etc/localtime:ro",
      "/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro",
      "/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro",
      "/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro",
      "/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro",
      "/dev/log:/dev/log",
      "/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro",
      "/etc/puppet:/etc/puppet:ro",
      "/var/log/containers/heat:/var/log/heat",
      "/var/lib/kolla/config_files/heat_engine.json:/var/lib/kolla/config_files/config.json:ro",
      "/var/lib/config-data/puppet-generated/heat/:/var/lib/kolla/config_files/src:ro"
    ],
    "net": "host",
    "privileged": false,
    "restart": "always"
  }
}
[root@overcloud-controller-0 tripleo-config]#  paunch --debug apply --file /tmp/heat_startup_config.json --config-id tripleo_step4 --managed-by tripleo-Controller
stdout: dd60546daddd06753da445fd973e52411d0a9031c8758f4bebc6e094823a8b45

stderr: 
[root@overcloud-controller-0 tripleo-config]# docker ps | grep heat
dd60546daddd        192.168.24.1:8787/tripleomaster/centos-binary-heat-engine:current-tripleo          "kolla_start"            9 seconds ago       Up 9 seconds (health: starting)                       heat_engine

Containerized services, logging

There are a couple of ways to access the container logs:

On the host filesystem, the container logs are persisted under /var/log/containers/<service>
docker logs <container id or name>

It is also often useful to use docker inspect <container id or name> to verify the container configuration, e.g the image in use and the mounted volumes etc.

Debugging containers directly

Sometimes logs are not enough to debug problems, and in this case you must interact with the container directly to diagnose the issue.

When a container is not restarting, you can attach a shell to the running container via docker exec:


[root@openstack-controller-0 ~]# docker exec -ti heat_engine /bin/bash
()[heat@openstack-controller-0 /]$ ps ax
    PID TTY      STAT   TIME COMMAND
      1 ?        Ss     0:00 /usr/local/bin/dumb-init /bin/bash /usr/local/bin/kolla_start
      5 ?        Ss     1:50 /usr/bin/python /usr/bin/heat-engine --config-file /usr/share/heat/heat-dist.conf --config-file /etc/heat/heat
     25 ?        S      3:05 /usr/bin/python /usr/bin/heat-engine --config-file /usr/share/heat/heat-dist.conf --config-file /etc/heat/heat
     26 ?        S      3:06 /usr/bin/python /usr/bin/heat-engine --config-file /usr/share/heat/heat-dist.conf --config-file /etc/heat/heat
     27 ?        S      3:06 /usr/bin/python /usr/bin/heat-engine --config-file /usr/share/heat/heat-dist.conf --config-file /etc/heat/heat
     28 ?        S      3:05 /usr/bin/python /usr/bin/heat-engine --config-file /usr/share/heat/heat-dist.conf --config-file /etc/heat/heat
   2936 ?        Ss     0:00 /bin/bash
   2946 ?        R+     0:00 ps ax

That's all for today, for more information please refer to tripleo-docs,, or feel free to ask questions in #tripleo on Freenode!

Debugging TripleO revisited - Heat, Ansible & Puppet

2018-02-09T09:04:00.000-08:00

Some time ago I wrote a post about debugging TripleO heat templates, which contained some details of possible debug workflows when TripleO deployments fail.

In recent releases (since the Pike release) we've made some major changes to the TripleO architecture - we makes more use of Ansible "under the hood", and we now support deploying containerized environments. I described some of these architectural changes in a talk at the recent OpenStack Summit in Sydney.

In this post I'd like to provide a refreshed tutorial on typical debug workflow, primarily focussing on the configuration phase of a typical TripleO deployment, and with particular focus on interfaces which have changed or are new since my original debugging post.

We'll start by looking at the deploy workflow as a whole, some heat interfaces for diagnosing the nature of the failure, then we'll at how to debug directly via Ansible and Puppet. In a future post I'll also cover the basics of debugging containerized deployments.

The TripleO deploy workflow, overview

A typical TripleO deployment consists of several discrete phases, which are run in order:

Provisioning of the nodes

A "plan" is created (heat templates and other files are uploaded to Swift running on the undercloud
Some validation checks are performed by Mistral/Heat then a Heat stack create is started (by Mistral on the undercloud)
Heat creates some groups of nodes (one group per TripleO role e.g "Controller"), which results in API calls to Nova
Nova makes scheduling/placement decisions based on your flavors (which can be different per role), and calls Ironic to provision the baremetal nodes
The nodes are provisioned by Ironic

This first phase is the provisioning workflow, after that is complete and the nodes are reported ACTIVE by nova (e.g the nodes are provisioned with an OS and running).

Host preparation

The next step is to configure the nodes in preparation for starting the services, which again has a specific workflow (some optional steps are omitted for clarity):

The node networking is configured, via the os-net-config tool
We write hieradata for puppet to the node filesystem (under /etc/puppet/hieradata/*)
We write some data files to the node filesystem (a puppet manifest for baremetal configuration, and some json files that are used for container configuration)

Service deployment, step-by-step configuration

The final step is to deploy the services, either on the baremetal host or in containers, this consists of several tasks run in a specific order:

We run puppet on the baremetal host (even in the containerized architecture this is still needed, e.g to configure the docker daemon and a few other things)
We run "docker-puppet.py" to generate the configuration files for each enabled service (this only happens once, on step 1, for all services)
We start any containers enabled for this step via the "paunch" tool, which translates some json files into running docker containers, and optionally does some bootstrapping tasks.
We run docker-puppet.py again (with a different configuration, only on one node the "bootstrap host"), this does some bootstrap tasks that are performed via puppet, such as creating keystone users and endpoints after starting the service.

Note that these steps are performed repeatedly with an incrementing step value (e.g step 1, 2, 3, 4, and 5), with the exception of the "docker-puppet.py" config generation which we only need to do once (we just generate the configs for all services regardless of which step they get started in).

Below is a diagram which illustrates this step-by-step deployment workflow:

TripleO Service configuration workflow

The most common deployment failures occur during this service configuration phase of deployment, so the remainder of this post will primarily focus on debugging failures of the deployment steps.

Debugging first steps - what failed?

Heat Stack create failed.

Ok something failed during your TripleO deployment, it happens to all of us sometimes! The next step is to understand the root-cause.

My starting point after this is always to run:

openstack stack failures list --long <stackname>

(undercloud) [stack@undercloud ~]$ openstack stack failures list --long overcloud
overcloud.AllNodesDeploySteps.ControllerDeployment_Step1.0: 
  resource_type: OS::Heat::StructuredDeployment 
  physical_resource_id: 421c7860-dd7d-47bd-9e12-de0008a4c106 
  status: CREATE_FAILED 
  status_reason: | 
    Error: resources[0]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 2 
  deploy_stdout: | 
     
    PLAY [localhost] ***************************************************************  
     
    ... 
     
    TASK [Run puppet host configuration for step 1] ********************************  
    ok: [localhost] 
     
    TASK [debug] *******************************************************************  
    fatal: [localhost]: FAILED! => { 
        "changed": false,  
        "failed_when_result": true,  
        "outputs.stdout_lines|default([])|union(outputs.stderr_lines|default([]))": [ 
            "Debug: Runtime environment: puppet_version=4.8.2, ruby_version=2.0.0, run_mode=user, default_encoding=UTF-8",  
            "Error: Evaluation Error: Error while evaluating a Resource Statement, Unknown resource type: 'ugeas' at /etc/puppet/modules/tripleo/manifests/profile/base/docker.pp:181:5 on node overcloud-controller-0.localdomain" 
        ] 
    } 
          to retry, use: --limit @/var/lib/heat-config/heat-config-ansible/8dd0b23a-acb8-4e11-aef7-12ea1d4cf038_playbook.retry 
     
    PLAY RECAP ********************************************************************* 
    localhost                  : ok=18   changed=12   unreachable=0    failed=1

We can tell several things from the output (which has been edited above for brevity), firstly the name of the failing resource

overcloud.AllNodesDeploySteps.ControllerDeployment_Step1.0

The error was on one of the Controllers (ControllerDeployment)
The deployment failed during the per-step service configuration phase (the AllNodesDeploySteps part tells us this)
The failure was during the first step (Step1.0)

Then we see more clues in the deploy_stdout, ansible failed running the task which runs puppet on the host, it looks like a problem with the puppet code.

With a little more digging we can see which node exactly this failure relates to, e.g we copy the SoftwareDeployment ID from the output above, then run:

(undercloud) [stack@undercloud ~]$ openstack software deployment show 421c7860-dd7d-47bd-9e12-de0008a4c106 --format value --column server_id
29b3c254-5270-42ae-8150-9fc3f67d3d89
(undercloud) [stack@undercloud ~]$ openstack server list | grep 29b3c254-5270-42ae-8150-9fc3f67d3d89
| 29b3c254-5270-42ae-8150-9fc3f67d3d89 | overcloud-controller-0  | ACTIVE | ctlplane=192.168.24.6 | overcloud-full | oooq_control |

Ok so puppet failed while running via ansible on overcloud-controller-0.

Debugging via Ansible directly

Having identified that the problem was during the ansible-driven configuration phase, one option is to re-run the same configuration directly via ansible-ansible playbook, so you can either increase verbosity or potentially modify the tasks to debug the problem.

Since the Queens release, this is actually very easy, using a combination of the new "openstack overcloud config download" command and the tripleo dynamic ansible inventory.

(undercloud) [stack@undercloud ~]$ openstack overcloud config download
The TripleO configuration has been successfully generated into: /home/stack/tripleo-VOVet0-config
(undercloud) [stack@undercloud ~]$ cd /home/stack/tripleo-VOVet0-config
 (undercloud) [stack@undercloud tripleo-VOVet0-config]$ ls
 common_deploy_steps_tasks.yaml    external_post_deploy_steps_tasks.yaml  templates
 Compute                           global_vars.yaml                       update_steps_playbook.yaml
 Controller                        group_vars                             update_steps_tasks.yaml
 deploy_steps_playbook.yaml        post_upgrade_steps_playbook.yaml       upgrade_steps_playbook.yaml
 external_deploy_steps_tasks.yaml  post_upgrade_steps_tasks.yaml          upgrade_steps_tasks.yaml

Here we can see there is a "deploy_steps_playbook.yaml", which is the entry point to run the ansible service configuration steps. This runs all the common deployment tasks (as outlined above) as well as any service specific tasks (these end up in task include files in the per-role directories, e.g Controller and Compute in this example).

We can run the playbook again on all nodes with the tripleo-ansible-inventory from tripleo-validations, which is installed by default on the undercloud:

(undercloud) [stack@undercloud tripleo-VOVet0-config]$ ansible-playbook -i /usr/bin/tripleo-ansible-inventory deploy_steps_playbook.yaml --limit overcloud-controller-0
 ...
TASK [Run puppet host configuration for step 1] ********************************************************************
ok: [192.168.24.6]

TASK [debug] *******************************************************************************************************
fatal: [192.168.24.6]: FAILED! => {
    "changed": false, 
    "failed_when_result": true, 
    "outputs.stdout_lines|default([])|union(outputs.stderr_lines|default([]))": [
        "Notice: hiera(): Cannot load backend module_data: cannot load such file -- hiera/backend/module_data_backend", 
        "exception: connect failed", 
        "Warning: Undefined variable '::deploy_config_name'; ", 
        "   (file & line not available)", 
        "Warning: Undefined variable 'deploy_config_name'; ", 
        "Error: Evaluation Error: Error while evaluating a Resource Statement, Unknown resource type: 'ugeas' at /etc/puppet/modules/tripleo/manifests/profile
/base/docker.pp:181:5 on node overcloud-controller-0.localdomain"
    ]
}

NO MORE HOSTS LEFT *************************************************************************************************
 to retry, use: --limit @/home/stack/tripleo-VOVet0-config/deploy_steps_playbook.retry

PLAY RECAP *********************************************************************************************************
192.168.24.6               : ok=56   changed=2    unreachable=0    failed=1

Here we can see the same error is reproduced directly via ansible, and we made use of the --limit option to only run tasks on the overcloud-controller-0 node. We could also have added --tags to limit the tasks further (see tripleo-heat-templates for which tags are supported).

If the error were ansible related, this would be a good way to debug and test any potential fixes to the ansible tasks, and in the upcoming Rocky release there are plans to switch to this model of deployment by default.

Debugging via Puppet directly

Since this error seems to be puppet related, the next step is to reproduce it on the host (obviously the steps above often yield enough information to identify the puppet error, but this assumes you need to do more detailed debugging directly via puppet):

Firstly we log on to the node, and look at the files in the /var/lib/tripleo-config directory.

(undercloud) [stack@undercloud tripleo-VOVet0-config]$ ssh heat-admin@192.168.24.6
Warning: Permanently added '192.168.24.6' (ECDSA) to the list of known hosts.
Last login: Fri Feb  9 14:30:02 2018 from gateway
[heat-admin@overcloud-controller-0 ~]$ cd /var/lib/tripleo-config/
[heat-admin@overcloud-controller-0 tripleo-config]$ ls
docker-container-startup-config-step_1.json  docker-container-startup-config-step_4.json  puppet_step_config.pp
docker-container-startup-config-step_2.json  docker-container-startup-config-step_5.json
docker-container-startup-config-step_3.json  docker-container-startup-config-step_6.json

The puppet_step_config.pp file is the manifest applied by ansible on the baremetal host

We can debug any puppet host configuration by running puppet apply manually. Note that hiera is used to control the step value, this will be at the same value as the failing step, but it can also be useful sometimes to manually modify this for development testing of different steps for a particular service.

[root@overcloud-controller-0 tripleo-config]# hiera -c /etc/puppet/hiera.yaml step
1
[root@overcloud-controller-0 tripleo-config]# cat /etc/puppet/hieradata/config_step.json 
{"step": 1}[root@overcloud-controller-0 tripleo-config]# puppet apply --debug puppet_step_config.pp
...
Error: Evaluation Error: Error while evaluating a Resource Statement, Unknown resource type: 'ugeas' at /etc/puppet/modules/tripleo/manifests/profile/base/docker.pp:181:5 on node overcloud-controller-0.localdomain

Here we can see the problem is a typo in the /etc/puppet/modules/tripleo/manifests/profile/base/docker.pp file at line 181, I look at the file, fix the problem (ugeas should be augeas) then re-run puppet apply to confirm the fix.

Note that with puppet module fixes you will need to get the fix either into an updated overcloud image, or update the module via deploy artifacts for testing local forks of the modules.

That's all for today, but in a future post, I will cover the new container architecture, and share some debugging approaches I have found helpful when deployment failures are container related.

OpenStack Days UK

2017-09-27T04:17:00.002-07:00

OpenStack Days UK

Yesterday I attended the OpenStack Days UK event, held in London. It was a very good day and there were a number of interesting talks, and it provided a great opportunity to chat with folks about OpenStack.

I gave a talk, titled "Deploying OpenStack at scale, with TripleO, Ansible and Containers", where I gave an update of the recent rework in the TripleO project to make more use of Ansible and enable containerized deployments.

I'm planning some future blog posts with more detail on this topic, but for now here's a copy of the slide deck I used, also available on github.

OpenStack Summit - TripleO Project Onboarding

2017-05-11T05:28:00.000-07:00

We've been having a productive week here in Boston at the OpenStack Summit, and one of the sessions I was involved in was a TripleO project Onboarding session.

The project onboarding sessions are a new idea for this summit, and provide the opportunity for new or potential contributors (and/or users/operators) to talk with the existing project developers and get tips on how to get started as well as ask any questions and discuss ideas/issues.

The TripleO session went well, and I'm very happy to report it was well attended and we had some good discussions. The session was informal with an emphasis on questions and some live demos/examples, but we did also use a few slides which provide an overview and some context for those new to the project.

Here are the slides used (also on my github), unfortunately I can't share the Q+A aspects of the session as it wasn't recorded, but I hope the slides will prove useful - we can be found in #tripleo on Freenode if anyone has questions about the slides or getting started with TripleO in general.

Developing Mistral workflows for TripleO

2017-03-03T02:51:00.000-08:00

During the newton/ocata development cycles, TripleO made changes to the architecture so we make use of Mistral (the OpenStack workflow API project) to drive workflows required to deploy your OpenStack cloud.

Prior to this change we had workflow defined inside python-tripleoclient, and most API calls were made directly to Heat. This worked OK but there was too much "business logic" inside the client, which doesn't work well if non-python clients (such as tripleo-ui) want to interact with TripleO.

To solve this problem, number of mistral workflows and custom actions have been implemented, which are available via the Mistral API on the undercloud. This can be considered the primary "TripleO API" for driving all deployment tasks now.

Here's a diagram showing how it fits together:

Overview of Mistral integration in TripleO

Mistral workflows and actions

There are two primary interfaces to mistral, workflows which are a yaml definition of a process or series of tasks, and actions which are a concrete definition of how to do a specific task (such as call some OpenStack API).

Workflows and actions can defined directly via the mistral API, or a wrapper called a workbook. Mistral actions are also defined via a python plugin interface, which TripleO uses to run some tasks such as running jinja2 on tripleo-heat-templates prior to calling Heat to orchestrate the deployment.

Mistral workflows, in detail

Here I'm going to show how to view and interact with the mistral workflows used by TripleO directly, which is useful to understand what TripleO is doing "under the hood" during a deployment, and also for debugging/development.

First we view the mistral workbooks loaded into Mistral - these contain the TripleO specific workflows and are defined in tripleo-common

[stack@undercloud ~]$ . stackrc 
[stack@undercloud ~]$ mistral workbook-list
+----------------------------+--------+---------------------+------------+
| Name                       | Tags   | Created at          | Updated at |
+----------------------------+--------+---------------------+------------+
| tripleo.deployment.v1      | <none> | 2017-02-27 17:59:04 | None       |
| tripleo.package_update.v1  | <none> | 2017-02-27 17:59:06 | None       |
| tripleo.plan_management.v1 | <none> | 2017-02-27 17:59:09 | None       |
| tripleo.scale.v1           | <none> | 2017-02-27 17:59:11 | None       |
| tripleo.stack.v1           | <none> | 2017-02-27 17:59:13 | None       |
| tripleo.validations.v1     | <none> | 2017-02-27 17:59:15 | None       |
| tripleo.baremetal.v1       | <none> | 2017-02-28 19:26:33 | None       |
+----------------------------+--------+---------------------+------------+

The name of the workbook constitutes a namespace for the workflows it contains, so we can view the related workflows using grep (I also grep for tag_node to reduce the number of matches).

[stack@undercloud ~]$ mistral workflow-list | grep "tripleo.baremetal.v1" | grep tag_node
| 75d2566c-13d9-4aa3-b18d-8e8fc0dd2119 | tripleo.baremetal.v1.tag_nodes                            | 660c5ec71ce043c1a43d3529e7065a9d | <none> | tag_node_uuids, untag_nod... | 2017-02-28 19:26:33 | None       |
| 7a4220cc-f323-44a4-bb0b-5824377af249 | tripleo.baremetal.v1.tag_node                             | 660c5ec71ce043c1a43d3529e7065a9d | <none> | node_uuid, role=None, que... | 2017-02-28 19:26:33 | None       |

When you know the name of a workflow, you can inspect the required inputs, and run it directly via a mistral execution, in this case we're running the tripleo.baremetal.v1.tag_node workflow, which modifies the profile assigned in the ironic node capabilities (see tripleo-docs for more information about manual tagging of nodes)

[stack@undercloud ~]$ mistral workflow-get tripleo.baremetal.v1.tag_node
+------------+------------------------------------------+
| Field      | Value                                    |
+------------+------------------------------------------+
| ID         | 7a4220cc-f323-44a4-bb0b-5824377af249     |
| Name       | tripleo.baremetal.v1.tag_node            |
| Project ID | 660c5ec71ce043c1a43d3529e7065a9d         |
| Tags       | <none>                                   |
| Input      | node_uuid, role=None, queue_name=tripleo |
| Created at | 2017-02-28 19:26:33                      |
| Updated at | None                                     |
+------------+------------------------------------------+
[stack@undercloud ~]$ ironic node-list
+--------------------------------------+-----------+---------------+-------------+--------------------+-------------+
| UUID                                 | Name      | Instance UUID | Power State | Provisioning State | Maintenance |
+--------------------------------------+-----------+---------------+-------------+--------------------+-------------+
| 30182cb9-eba9-4335-b6b4-d74fe2581102 | control-0 | None          | power off   | available          | False       |
| 19fd7ea7-b4a0-4ae9-a06a-2f3d44f739e9 | compute-0 | None          | power off   | available          | False       |
+--------------------------------------+-----------+---------------+-------------+--------------------+-------------+
[stack@undercloud ~]$ mistral execution-create tripleo.baremetal.v1.tag_node '{"node_uuid": "30182cb9-eba9-4335-b6b4-d74fe2581102", "role": "test"}'
+-------------------+--------------------------------------+
| Field             | Value                                |
+-------------------+--------------------------------------+
| ID                | 6a141065-ad6e-4477-b1a8-c178e6fcadcb |
| Workflow ID       | 7a4220cc-f323-44a4-bb0b-5824377af249 |
| Workflow name     | tripleo.baremetal.v1.tag_node        |
| Description       |                                      |
| Task Execution ID | <none>                               |
| State             | RUNNING                              |
| State info        | None                                 |
| Created at        | 2017-03-03 09:53:10                  |
| Updated at        | 2017-03-03 09:53:10                  |
+-------------------+--------------------------------------+

At this point the mistral workflow is running, and it'll either succeed or fail, and also create some output (which in the TripleO model is sometimes returned to the Ux via a Zaqar queue). We can view the status, and the outputs (truncated for brevity):

[stack@undercloud ~]$ mistral execution-list | grep  6a141065-ad6e-4477-b1a8-c178e6fcadcb
| 6a141065-ad6e-4477-b1a8-c178e6fcadcb | 7a4220cc-f323-44a4-bb0b-5824377af249 | tripleo.baremetal.v1.tag_node                           |                        | <none>                               | SUCCESS | None       | 2017-03-03 09:53:10 | 2017-03-03 09:53:11 |
[stack@undercloud ~]$ mistral execution-get-output 6a141065-ad6e-4477-b1a8-c178e6fcadcb
{
    "status": "SUCCESS", 
    "message": {
...

So that's it - we ran a mistral workflow, it suceeded and we looked at the output, now we can see the result looking at the node in Ironic, it worked! :)

[stack@undercloud ~]$ ironic node-show 30182cb9-eba9-4335-b6b4-d74fe2581102 | grep profile
|                        | u'cpus': u'2', u'capabilities': u'profile:test,cpu_hugepages:true,boot_o |

Mistral workflows, create your own!

Here I'll show how to develop your own custom workflows (which isn't something we expect operators to necessarily do, but is now part of many developers workflow during feature development for TripleO).

First, we create a simple yaml definition of the workflow, as defined in the v2 Mistral DSL - this example lists all available ironic nodes, then finds those which match the "test" profile we assigned in the example above:

This example uses the mistral built-in "ironic" action, which is basically a pass-through action exposing the python-ironicclient interfaces. Similar actions exist for the majority of OpenStack python clients, so this is a pretty flexible interface.

Now we can now upload the workflow (not wrapped in a workbook this time, so we use workflow-create), run it via execution create, then look at the outputs - we can see that the matching_nodes output matches the ID of the node we tagged in the example above - success! :)

[stack@undercloud tripleo-common]$ mistral workflow-create shtest.yaml 
+--------------------------------------+-------------------------+----------------------------------+--------+--------------+---------------------+------------+
| ID                                   | Name                    | Project ID                       | Tags   | Input        | Created at          | Updated at |
+--------------------------------------+-------------------------+----------------------------------+--------+--------------+---------------------+------------+
| 2b8f2bea-f3dd-42f0-ad16-79987c75df4d | test_nodes_with_profile | 660c5ec71ce043c1a43d3529e7065a9d | <none> | profile=test | 2017-03-03 10:18:48 | None       |
+--------------------------------------+-------------------------+----------------------------------+--------+--------------+---------------------+------------+
[stack@undercloud tripleo-common]$ mistral execution-create test_nodes_with_profile
+-------------------+--------------------------------------+
| Field             | Value                                |
+-------------------+--------------------------------------+
| ID                | 2392ed1c-96b4-4787-9d11-0f3069e9a7e5 |
| Workflow ID       | 2b8f2bea-f3dd-42f0-ad16-79987c75df4d |
| Workflow name     | test_nodes_with_profile              |
| Description       |                                      |
| Task Execution ID | <none>                               |
| State             | RUNNING                              |
| State info        | None                                 |
| Created at        | 2017-03-03 10:19:30                  |
| Updated at        | 2017-03-03 10:19:30                  |
+-------------------+--------------------------------------+
[stack@undercloud tripleo-common]$ mistral execution-list | grep  2392ed1c-96b4-4787-9d11-0f3069e9a7e5
| 2392ed1c-96b4-4787-9d11-0f3069e9a7e5 | 2b8f2bea-f3dd-42f0-ad16-79987c75df4d | test_nodes_with_profile                                 |                        | <none>                               | SUCCESS | None       | 2017-03-03 10:19:30 | 2017-03-03 10:19:31 |
[stack@undercloud tripleo-common]$ mistral execution-get-output 2392ed1c-96b4-4787-9d11-0f3069e9a7e5
{
    "matching_nodes": [
        "30182cb9-eba9-4335-b6b4-d74fe2581102"
    ], 
    "available_nodes": [
        "30182cb9-eba9-4335-b6b4-d74fe2581102", 
        "19fd7ea7-b4a0-4ae9-a06a-2f3d44f739e9"
    ]
}

Using this basic example, you can see how to develop workflows which can then easily be copied into the tripleo-common workbooks, and integrated into the TripleO deployment workflow.

In a future post, I'll dig into the use of custom actions, and how to develop/debug those.

TripleO composable/custom roles

2016-10-10T02:04:00.000-07:00

This is a follow-up to my previous post outlining the new composable services interfaces , which covered the basics of the new for Newton composable services model.

The final piece of the composability model we've been developing this cycle is the ability to deploy user-defined custom roles, in addition to (or even instead of) the built in TripleO roles (where a role is a group of servers, e.g "Controller", which runs some combination of services).

What follows is an overview of this new functionality, the primary interfaces, and some usage examples and a summary of future planned work.

Fully Composable/Custom Roles

As described in previous posts TripleO has for a long time provided a fixed architecture with 5 roles (where "roles" means groups of nodes) e.g Controller, Compute, BlockStorage, CephStorage and ObjectStorage.

This architecture has been sufficient to enable standardized deployments, but it's not very flexible. With the addition of the composable-services model, moving services around between these roles becomes much easier, but many operators want to go further, and have full control of service placement on any arbitrary roles.

Now that the custom-roles feature has been implemented, this is possible, and operators can define arbitrary role types to enable fully composable deployments. When combined with composable services represents a huge step forward for TripleO flexibility! :)

Usage examples

To deploy with additional custom roles (or to remove/rename the default roles), a new interface has been added to the python-tripleoclient “overcloud deploy interface”, so you simply need to copy the default roles_data.yaml, modify to suit your requirements (for example by moving services between roles, or adding a new role), then do a deployment referencing the modified roles_data.yaml file:

cp /usr/share/openstack-tripleo-heat-templates/roles_data.yaml my_roles_data.yaml
<modify my_roles_data.yaml>
openstack overcloud deploy –templates -r my_roles_data.yaml

Alternatively you can copy the entire tripleo-heat-templates tree (or use a git checkout):

cp -r /usr/share/openstack-tripleo-heat-templates my-tripleo-heat-templates
<modify my-tripleo-heat-templates/roles_data.yaml>
openstack overcloud deploy –templates my-tripleo-heat-templates

Both approaches are essentially equivalent, the -r option simply overwrites the default roles_data.yaml during creation of the plan data (stored in swift on the undercloud), but it's slightly more convenient if you want to use the default packaged tripleo-heat-templates instead of constantly rebasing a copied tree.

So, lets say you wanted to deploy one additional node, only running the OS::TripleO::Ntp composable service, you'd copy roles_data.yaml, and append a list entry like this:

- name: NtpRole
CountDefault: 1
ServicesDefault:
- OS::TripleO::Services::Ntp

(Note that in practice you'll probably also want some of the common services deployed on all roles, such as OS::TripleO::Services::Kernel, OS::TripleO::Services::TripleoPackages, OS::TripleO::Services::TripleoFirewall and OS::TripleO::Services::VipHosts)

Nice, so how does it work?

The main change made to enable custom roles is a pre-deployment templating step which runs Jinja2. We define a roles_data.yaml file(which can be overridden by the user), which contains a list of role names, and optionally some additional data related to default parameter values (such as the default services deployed on the role, and default count in the group)

The roles_data.yaml definitions look like this:

- name: Controller
CountDefault: 1
ServicesDefault:
- OS::TripleO::Services::CACerts
- OS::TripleO::Services::CephMon
- OS::TripleO::Services::CinderApi
- ...

The format is simply a yaml list of maps, with a mandatory “name” key in each map, and a number of optional FooDefault keys which set the parameter defaults for the role (as a convenience so the user won't have to specify it via an environment file during the overcloud deployment).

A custom mistral action is used to run Jinja2 when creating or updating a “deployment plan” (which is a combination of some heat templates stored in swift, and a mistral environment containing user parameters) – and this basically consumes the roles_data.yaml list of required roles, and outputs a rendered tree of Heat templates ready to deploy your overcloud.

Custom Roles, overview

There are two types of Jinja2 templates which are rendered differently, distinguished by the file extension/suffix:

foo.j2.yaml

This will pass in the contents of the roles_data.yaml list, and iterate over each role in the list, The resulting file in the plan swift container will be named foo.yaml.
Here's an example of the syntax used for j2 templating inside these files:

enabled_services:
list_join:
- ','
{% for role in roles %}
- {get_attr: [{{role.name}}ServiceChain, role_data, service_names]}
{% endfor %}

This example is from overcloud.j2.yaml, it does a jinja2 loop appending service_names for all roles *ServiceChain resources (which are also dynamically generated via a similar loop), which is then processed on deployment via a heat list_join function,

foo.role.j2.yaml

This will generate a file per-role, where only the name of the role is passed in during the templating step, with the resulting files being called rolename-foo.yaml. (Note that If you have a role which requires a special template, it is possible to disable this file generation by adding the path to the j2_excludes.yaml file)

Here's an example of the syntax used in these files (taken from the role.role.j2.yaml file, which is our new definition of server for a generic role):

resources:

{{role}}:

type: OS::TripleO::Server

metadata:

os-collect-config:

command: {get_param: ConfigCommand}

properties:

image: {get_param: {{role}}Image}

As you can see, this simply allows use of a {{role}} placeholder, which is then substituted with the role name when rendering each file (one file per role defined in the roles_data.yaml list).

Debugging/Development tips

When making changes to either the roles_data.yaml, and particularly when making changes to the *.j2.yaml files in tripleo-heat-templates, it's often helpful to view the rendered templates before any overcloud deployment is attempted.

This is possible via use of the “openstack overcloud plan create” interface (which doesn't yet support the -r option above, so you have to copy or git clone the tree), combined with swiftclient:

openstack overcloud plan create overcloud –templates my_tripleo_heat_templates
mkdir tmp_templates && pushd tmp_templates
swift download overcloud

This will download the full tree of rendered files from the swift container (named “overcloud” due to the name passed to plan create), so you can e.g view the rendered overcloud.yaml that's generated by combining the overcloud.j2.yaml template with the roles_data.yaml file.

If you make a mistake in your *.j2.yaml file, the jinja2 error should be returned via the plan create command, but it can also be useful to tail -f /var/log/mistral/mistral-server.log for additional information during development (this shows the output logged from running jinja2 via the custom mistral action plugin).

Limitations/future work

These new interfaces allow for much greater deployment flexibility and choice, but there are a few remaining issues which will be addressed in future development cycles:

All services managed by pacemaker are still tied to the Controller role. Thanks to the implementation of a more lightweight HA architecture during the Newton cycle, the list of services managed by pacemaker is considerably reduced, but there's still a number of services (DB & RPC services primarily) which are, and until the composable-ha blueprint is completed (hopefully during Ocata), these services cannot be moved to a non Controller role.
Custom isolated networks cannot be defined. Since arbitrary roles types can now be defined, there may be a requirement to define arbitrary additional networks for network-isolation, but right now this is not possible.
roles_data.yaml must be copied. As in the examples above, it's necessary to copy either roles_data.yaml, (or the entire tripleo-heat-templates tree), which means if the packaged roles_data.yaml changes (such as to add new services to the built-in roles), you must merge these changes in with your custom roles_data. In future we may add a convenience interface which makes it easier to e.g add a new role without having to care about the default role definitions.
No model for dependencies between services. Currently ensuring the right combination of services is deployed on specific roles is left to the operator, there's no validation of incompatible or inter-dependent services, but this may be addressed in a future release.

Complex data transformations with nested Heat intrinsic functions

2016-09-01T02:31:00.000-07:00

Disclaimer, what follows is either pretty neat, or pure-evil depending your your viewpoint ;) But it's based on a real use-case and it works, so I'm posting this to document the approach, why it's needed, and hopefully stimulate some discussion around optimizations leading to a improved/simplified implementation in the future.

The requirement

In TripleO we have a requirement enable composition of different services onto different roles (groups of physical nodes), we need input data to configure the services which combines knowledge of the enabled services, which nodes/role they're running on, and which overlay network each service is bound to.

To do this, we need to input several pieces of data:

1. A list of the OpenStack services enabled for a particular deployment, expressed as a heat parameter it looks something like this:

EnabledServices:
    type: comma_delimited_list
    default:
      - heat_api
      - heat_engine
      - nova_api
      - neutron_api
      - glance_api
      - ceph_mon

2. A mapping of service names to one of several isolated overlay networks, such as "internal_api" "external" or "storage" etc:

ServiceNetMap:
    type: json
    default:
      heat_api_network: internal_api
      nova_api_network: internal_api
      neutron_api_network: internal_api
      glance_api_network: storage
      ceph_mon_network: storage

3. A mapping of the network names to the actual IP address (either a single VIP pointing to a loadbalancer, or a list of the IPs bound to that network for all nodes running the service):

NetIpMap:
    type: json
    default:
      internal_api: 192.168.1.12
      storage: 192.168.1.13

The implementation, step by step

Dynamically generate an initial mapping for all enabled services

Here we can use a nice pattern which combines the heat repeat function with map_merge:

map_merge:
    repeat:
      template:
        SERVICE_ip: SERVICE_network
      for_each:
         SERVICE: {get_param: EnabledServices}

Step1: repeat dynamically generates lists (including lists of maps as in this case), so we use it to generate a list of maps for every service in the EnabledServices list with a placeholder for the network, e.g:

- heat_api_ip: heat_api_network
- heat_engine_ip: heat_engine_network
- nova_api_ip: nova_api_network
- neutron_api_ip: neutron_api_network
- glance_api_ip: glance_api_network
- ceph_mon_ip: ceph_mon_network

Step2: map_merge combines this list of maps with only one key to one big map for all EnabledServices

heat_api_ip: heat_api_network
heat_engine_ip: heat_engine_network
nova_api_ip: nova_api_network
neutron_api_ip: neutron_api_network
glance_api_ip: glance_api_network
ceph_mon_ip: ceph_mon_network

Filter any values we don't want

As you can see we got a value we don't want - heat_engine is like many non-api services in that it's not bound to any network, it only talks to rabbitmq, so we don't have any entry in ServiceNetMap for it.

We can therefore remove any entries which remain in the mapping using the yaql heat function, which is an interface to run yaql queries inside a heat template.

It has to be said yaql is very powerful, but the docs are pretty sparse (but improving), so I tend to read the unit tests instead of the docs for usage examples.

yaql:
    expression: dict($.data.map.items().where(isString($[1]) and not $[1].endsWith("_network")))
      data:
        map:

heat_api_ip: 192.168.1.12
heat_engine_ip: heat_engine_network
nova_api_ip: 192.168.1.12
neutron_api_ip: 192.168.1.12
glance_api_ip: 192.168.1.13
      ceph_mon_ip: 192.168.1.13

Step5: filter all map values where the value is a string, and the string ends with "_network" via yaql, which gives:

heat_api_ip: 192.168.1.12

nova_api_ip: 192.168.1.12
neutron_api_ip: 192.168.1.12
glance_api_ip: 192.168.1.13
ceph_mon_ip: 192.168.1.13

So, that's it - we now transformed two input maps and a list into a dynamically generated mapping based on the list items! :)

Implementation, completed

Pulling all of the above together, here's a full example (you'll need a newton Heat environment to run this), it combines all steps described above into one big combination of nested intrinsic functions:

Edit - also available on github

heat_template_version: 2016-10-14

description: >
Example of nested heat functions

parameters:
NetIpMap:
    type: json
    default:
      internal_api: 192.168.1.12
      storage: 192.168.1.13

EnabledServices:
    type: comma_delimited_list
    default:
      - heat_api
      - nova_api
      - neutron_api
      - glance_api
      - ceph_mon

ServiceNetMap:
    type: json
    default:
      heat_api_network: internal_api
      nova_api_network: internal_api
      neutron_api_network: internal_api
      glance_api_network: storage
      ceph_mon_network: storage

outputs:
service_ip_map:
    description: Mapping of service names to IP address for the assigned network
    value:
      yaql:
        expression: dict($.data.map.items().where(isString($[1]) and not $[1].endsWith("_network")))
        data:
          map:
            map_replace:
              - map_replace:
                  - map_merge:
                      repeat:
                        template:
                          SERVICE_ip: SERVICE_network
                        for_each:
                          SERVICE: {get_param: EnabledServices}
                  - values: {get_param: ServiceNetMap}
              - values: {get_param: NetIpMap}

TripleO Deploy Artifacts (and puppet development workflow)

2016-08-12T15:20:00.001-07:00

For a while now, TripleO has supported a "DeployArtifacts" interface, aimed at making it easier to deploy modified/additional files on your overcloud, without the overhead of frequently rebuilding images.

This started out as a way to enable faster iteration on puppet module development (the puppet modules are by default stored inside the images deployed by TripleO, and generally you'll want to do development in a git checkout on the undercloud node), but it is actually a generic interface that can be used for a variety of deployment time customizations.

Ok, how do I use it?

Lets start with a couple of usage examples, making use of some helper scripts that are maintained in the tripleo-common repo (in future similar helper interfaces may be added to the TripleO CLI/UI but right now this is more targetted at developers and advanced operator usage).

First clone the tripleo-common repo (you can skip this step if you're running a packaged version which already contains the following scripts):

[stack@instack ~]$ git clone https://git.openstack.org/openstack/tripleo-common

There are two scripts of interest, firstly a generic script that can be used to deploy any kind of file (aka artifact) tripleo-common/scripts/upload-swift-artifacts and a slightly modified version which optimizes the flow for deploying directories containing puppet modules called tripleo-common/scripts/upload-puppet-modules
To make using these easier, I append this to my .bashrc

export PATH="$PATH:/home/stack/tripleo-common/scripts"

Example 1 - Deploy Artifacts "Hello World"

So, let's start with a really simple example. First lets create a tarball containing a single /tmp/hello file:

[stack@instack ~]$ mkdir tmp
[stack@instack ~]$ echo "hello" > tmp/hello
[stack@instack ~]$ tar -cvzf hello.tgz tmp
tmp/
tmp/hello

Now, we simply run the upload-swift-artifacts script, accepting all the default options other than to pass a reference to hello.tgz

[stack@instack ~]$ upload-swift-artifacts -f hello.tgz
Creating heat environment file: /home/stack/.tripleo/environments/deployment-artifacts.yaml
Uploading file to swift: hello.tgz
hello.tgz
Upload complete.

There are currently only two supported file types:

A tarball (will be unpacked from / on all nodes)
An RPM file (will be installed on all nodes)

Taking a look inside the environment file the script generated, we can see it's using the DeployArtifactURLs parameter, and passing a single URL (the parameter accepts a list of URLs). This happens to be a swift tempurl, created by the upload-swift-artifacts script but it could be any URL accessible to the overcloud nodes at deployment time.

[stack@instack ~]$ cat /home/stack/.tripleo/environments/deployment-artifacts.yaml
# Heat environment to deploy artifacts via Swift Temp URL(s)
parameter_defaults:
DeployArtifactURLs:
- 'http://192.0.2.1:8080/v1/AUTH_e9bcd2a11af94c319b164eba73c59a28/overcloud/hello.tgz?temp_url_sig=96ae277d85c3ee38dd61234b8c99351e64c8bd45&temp_url_expires=1502273853'

This environment file is automatically generated by the upload-swift-artifacts script, and put into the special ~/.tripleo/environments directory. This directory is read by tripleoclient and any environment files included here are always included automatically (no need for any -e options), but you can also pass a --environment option to upload-swift-artifacts if you prefer some different output location (e.g so it can be explicitly included in your overcloud deploy command).

Testing this example, you simply do an overcloud deployment, no additional arguments are needed if you use the default .tripleo/environments/deployment-artifacts.yaml environment path:

[stack@instack ~]$ openstack overcloud deploy --templates

Then check on one of the nodes for the expected file (note the tarball is unpacked from / in the filesystem):

[root@overcloud-controller-0 ~]# cat /tmp/hello
hello
Note the deploy artifact files are written to all roles, currently there is no way to deploy e.g only to Controller nodes. We might consider an enhancement that allows role specific artifact URL parameters in future should folks require it.

Hopefully despite the very simple example you can see that this is a very flexible interface - you can deploy a tarball containing anything, e.g even configuration files such as policy.json files to the nodes.

Note that you have to be careful though - most service configuration files are managed by puppet, so if you attempt using the deploy artifacts interface to overwrite puppet managed files it will not work - puppet runs after deploy artifacts are created (this is deliberate, as you will see in the next example) so you must use puppet hieradata to influence any configuration managed by puppet. (In the case of policy.json files, there is a puppet module that handles this, but currently TripleO does not use it - this may change in future though).

Example 2 - Puppet development workflow

There is coupling between tripleo-heat-templates and the puppet modules it interfaces with (and in particular with the puppet profiles that exist in puppet-tripleo, as discussed in my composable services tutorial recently), so a common pattern for a developer is:

Modify some puppet code
Modify tripleo-heat-templates to match the new/modified puppet profile
Deploy an overcloud
*OH NO* it doesn't work!
Debug the issue (hint, "openstack stack failures list overcloud" is a super-useful new heatclient command which helps a lot here, as it surfaces the puppet error in most cases)
Make coffee; goto (1) :)

Traditionally for TripleO deployments all puppet modules (including the puppet-tripleo profiles) have been built into the image we deploy (stored in Glance on the undercloud), so one missing step above is getting the modified puppet code into the image. There are a few options:

Rebuild the image every time (this is really slow)
Use virt-customize or virt-copy-in to copy some modifications into the image, then update the image in glance (this is faster, but it still means you must redeploy the nodes every time and it's easy to lose track of what modifications have been made).
Use DeployArtifactUrls to update the puppet modules on the fly during the deployment!

This last use-case is actually what prompted implementation of the DeployArtifacts interface (thanks Dan!), and I'll show how it works below:

First, we clone one or more puppet modules to a local directory - note the name of the repo e.g "puppet-tripleo" does not match the name of the deployed directory (on the nodes it's /etc/puppet/modules/tripleo), so you have to clone it to the "tripleo" directory.

mkdir puppet-modules
cd puppet-modules
git clone https://git.openstack.org/openstack/puppet-tripleo tripleo

Now you can make whatever edits are needed, pull under review code (or just do nothing if you want to deploy latest trunk of a given module). When you're ready you run the upload-puppet-modules script:

upload-puppet-modules -d puppet-modules

This works a little bit differently to the previous upload-swift-artifacts script, it takes the directory, creates a tarball using the --transform option, so we rewrite the prefix from /somewhere/puppet-modules to /etc/puppet/modules

The process after we create the tarball is exactly the same - we upload it to swift, get a tempurl, and create a heat environment file which references the location of the tarball. On deployment, the updated puppet modules will be untarred and this always happens before puppet runs, which makes the debug workflow above much faster, nice!

NOTE: There is one gotcha here - upload-puppet-modules creates a differently named environment file ($HOME/.tripleo/environments/puppet-modules-url.yaml) to upload-swift-artifacts by default, and their content is conflicting - if both environment files exist, one will be ignored as they will get merged together. (This is something we can probably improve in future when this heat feature lands, but right now the only option is to stick to one script or the other, or accept manually merging the environment files (to append rather than overwrite the DeployArtifactUrls parameter)

So how does it work?

Deploy Artifacts Overview

So, it's actually pretty simple, as illustrated in the diagram above

A tarball is created containing the files you want to deploy to the nodes
This tarball is uploaded to swift on the undercloud
A Swift tempurl is created, so the tarball can be accessed using a signed URL (no credentials needed in the nodes to access)
A Heat environment passes the Swift tempurl to a nested stack "deploy-artifacts.yaml", which defines a DeployArtifactUrls parameter (which is a list)
deploy-artifacts.yaml defines a Heat SoftwareConfig resource, which references a shell script that can download files from a list of URLs, check the file type and do something (e.g in the case of a tarball, untar it!)
The deploy-artifacts SoftwareConfig is deployed inside the per-role "PostDeploy" template, which is where we perform the puppet steps (5 deployment passes which apply puppet in a series of steps).
We use the heat depends_on directive to ensure that the DeployArtifacts deployment (ControllerArtifactsDeploy in the case of the Controller role) always runs before any of the puppet steps.
This pattern is replicated for all roles (not just the Controller as in the diagram above)

As you can see, there are a few steps to the process, but it's pretty simple and it leverages the exact same Heat SoftwareDeployment patterns we use throughout TripleO to deploy scripts (and apply puppet manifests, etc).

TripleO Composable Services 101

2016-08-05T07:02:00.000-07:00

Over the newton cycle, we've been working very hard on a major refactor of our heat templates and puppet manifiests, such that a much more granular and flexible "Composable Services" pattern is followed throughout our implementation.

It's been a lot of work, but it's been a frequently requested feature for some time, so I'm excited to be in a position to say it's complete for Newton (kudos to everyone involved in making that happen!) :)

This post aims to provide an introduction to this work, an overview of how it works under the hood, some simple usage examples and a roadmap for some related follow-on work.

Why Composable Services?

It probably helps to start with some historical context here. As described in previous posts TripleO has provided a fixed architecture with 5 roles (where "roles" means groups of nodes) Controller, Compute, BlockStorage, CephStorage and ObjectStorage.

To configure each of these roles, we used puppet, and we had a large manifest per role, with some relatively inflexible assumptions about which services would run on each role.

This worked OK, but many users have been requesting more flexibility, such as:

Ability to easily disable services they don't need
Allow service placement choice, such as co-locating the Ceph OSD service with nova-compute services to reduce the required hardware footprint (so-called "hyperconverged" deployments)
Make it easier to integrate new services and integrate third-party pieces (get closer to a strongly defined "plugin" interface)

The pre-newton Tripleo architecture, one manifest and heat template per role.

So, how does it work?

So, basically we've made two fundamental changes to our interfaces:

Each service, e.g "nova-api" is now defined by an individual heat template. The interfaces for these are standardized so all services must implement a basic subset of input parameters and output values.
Every service defines a small puppet "profile", which is a puppet manifest fragment that defines configuring that service. Again a standard interface is used, in particular a "step" variable is passed to each puppet profile, so you can choose which step configuration occurs in (we apply configuration in a series of six steps so the author of the profile can choose when a service is configured relative to other services).

This is the basis of the TripleO "service plugin" interface, and it should enable *much* easier integration of new services, and hopefully provide a more accessible interface to new contributors.

Inside the TripleO templates, we made use of a new-for-mitaka Heat ResourceChain interface to compose a deployment of multiple services. Basically a ResourceChain is a group of resources that may have different types, but conform to the same interfaces, which is what we need to combine a bunch of service templates that all have some standard interfaces.

Here's an illustration of how it works - essentially you define an input parameter which is a list of services, e.g OS::TripleO::Services: NovaApi which then maps to the heat template for that service, e.g puppet/services/nova-api.yaml via the resource_registry interface discussed in previous posts.

For Newton, each role has a ServiceChain that combines the chosen services for that role.

If you'd like more information on the implementation details, I'd encourage you to check out the developer documentation where we're starting to document these interfaces in more detail.

Ok, how do I use it?

Here I'm going to focus on usage of the feature vs developing new services (which is pretty well covered in the aforementioned developer docs), and hopefully illustrate why this is an important step forward that improves operator deployment choices.

Scenario 1 - All in one minimal deployment

Lets say for a moment that you're a keystone developer and you want a shorter debug cycle and/or are resource constrained. With the new interfaces, it's become very easy to deploy a minimal subset of services on a single node:

First you create an environment file that overrides the default ControllerServices list (which at the time of writing contains about 50 services!) so it only includes OS::TripleO::Services::Keystone and the services keystone depends on. We also set ComputeCount to zero as we don't need any compute nodes.

$ cat keystone-only.yaml
parameter_defaults:
  ControllerServices:
   - OS::TripleO::Services::Keystone
   - OS::TripleO::Services::RabbitMQ
      - OS::TripleO::Services::HAproxy
   - OS::TripleO::Services::MySQL
  ComputeCount: 0

(Note that in some environments it may also be necessary to include the OS::TripleO::Services::Pacemaker too)

You can then deploy your single node keystone-only environment:

openstack overcloud deploy --templates -e keystone_only.yaml

When this completes, you'll see the following message, and you can source the overcloudrc and get a token to prove the deployed keystone is working:

...
Overcloud Endpoint: http://192.0.2.15:5000/v2.0
Overcloud Deployed
[stack@instack ~]$ . overcloudrc
[stack@instack ~]$ openstack token issue
+------------+----------------------------------+
| Field      | Value                            |
+------------+----------------------------------+
| expires    | 2016-08-05 10:16:16+00:00        |
| id         | 976d5fcf9f744a5a9cf840e83d825560 |
| project_id | 99e92ae58d1f4147a5d7eda0af516060 |
| user_id    | 29fe578e45b24406ba6c5fd0baaeaa9c |
+------------+----------------------------------+

We can see by looking at the undercloud nova (don't forget to source the stackrc after interacting with the overcloud above!) that there is one controller node):

[stack@instack ~]$ . stackrc
[stack@instack ~]$ nova list
+--------------------------------------+------------------------+--------+------------+-------------+--------------------+
| ID                                   | Name                   | Status | Task State | Power State | Networks           |
+--------------------------------------+------------------------+--------+------------+-------------+--------------------+
| d5155616-d2a6-4cee-a6d1-37bb83fccfe0 | overcloud-controller-0 | ACTIVE | -          | Running     | ctlplane=192.0.2.7 |
+--------------------------------------+------------------------+--------+------------+-------------+--------------------+

Scenario 2 - "hyperconverged" ceph deployment

In this case, we want to move the Ceph OSD services, which normally run on the CephStorage role, and instead have them run on the Compute role.

To do this, we first look at the default values for the ComputeServices and CephStorageServices parameters in overcloud.yaml (as in the example above for the Controller role, these lists define the services to be deployed on the Compute and CephStorage roles respectively):

ComputeServices:
    default:
      - OS::TripleO::Services::CephClient
      - OS::TripleO::Services::CephExternal
      - OS::TripleO::Services::Timezone
      - OS::TripleO::Services::Ntp
      - OS::TripleO::Services::Snmp
      - OS::TripleO::Services::NovaCompute
      - OS::TripleO::Services::NovaLibvirt
      - OS::TripleO::Services::Kernel
      - OS::TripleO::Services::ComputeNeutronCorePlugin
      - OS::TripleO::Services::ComputeNeutronOvsAgent
      - OS::TripleO::Services::ComputeCeilometerAgent

CephStorageServices:
    default:
      - OS::TripleO::Services::CephOSD
      - OS::TripleO::Services::Kernel
      - OS::TripleO::Services::Ntp
      - OS::TripleO::Services::Timezone

Our aim is to deploy one Compute node, running both the standard compute services, and the OS::TripleO::Services::CephOSD service (the other services are clearly common to both roles). We also don't need the OS::TripleO::Services::CephExternal service defined in ComputeServices, because we won't be referencing any external ceph cluster, which gives us this:

$ cat ceph_osd_on_compute.yaml
parameter_defaults:
ComputeServices:
      - OS::TripleO::Services::CephClient
      - OS::TripleO::Services::CephOSD
      - OS::TripleO::Services::Timezone
      - OS::TripleO::Services::Ntp
      - OS::TripleO::Services::Snmp
      - OS::TripleO::Services::NovaCompute
      - OS::TripleO::Services::NovaLibvirt
      - OS::TripleO::Services::Kernel
      - OS::TripleO::Services::ComputeNeutronCorePlugin
      - OS::TripleO::Services::ComputeNeutronOvsAgent
      - OS::TripleO::Services::ComputeCeilometerAgent

That is all that's required to enable a hyperconverged ceph deployment! :)

Since the default count for CephStorage is zero, we can then deploy like this:

[stack@instack ~]$ openstack overcloud deploy --templates /tmp/tripleo-heat-templates -e ceph_osd_on_compute.yaml -e /tmp/tripleo-heat-templates/environments/storage-environment.yaml

Here we can see I'm specifying a non-default location /tmp/tripleo-heat-templates for the template tree (this defaults to /usr/share/openstack-tripleo-heat-templates), passing the ceph_osd_on_compute.yaml environment to enable the OSD service on the Compute role, and finally passing the storage-environment.yaml that configures things so they are backed by Ceph.

Logging onto the compute node after deployment we see this:

[root@overcloud-novacompute-0 ~]# ps ax | grep ceph
17437 ?        Ss     0:00 /bin/bash -c ulimit -n 32768; /usr/bin/ceph-osd -i 0 --pid-file /var/run/ceph/osd.0.pid -c /etc/ceph/ceph.conf --cluster ceph -f
17438 ?        Sl     0:00 /usr/bin/ceph-osd -i 0 --pid-file /var/run/ceph/osd.0.pid -c /etc/ceph/ceph.conf --cluster ceph -f

So, it worked, and we have the OSD service running on the Compute role! :)

Similar patterns to those described above can be used to achieve various deployment topologies which were not previously possible (an all-in-one deployment including nova-compute on a single node for example, as is done in one of our CI jobs now)

Future Work

Hopefully by now you can see that these new interfaces provide a much cleaner abstraction for services, and a lot more operator flexibility regarding their placement. However for some environments this is not enough, and completely new roles may be needed. We're working towards enabling that via the custom-roles blueprint, which will hopefully land for Newton.

Another related piece of work is enabling more flexible environment merging inside Heat. This will mean there is less need to specify the full list of Services as described above, and instead we'll be able to build up a list of services based on multiple environment files (which are then merged appending to the final list).

TripleO partial stack updates

2016-06-09T04:59:00.000-07:00

Recently I was asked if it's possible to do a partial update of a TripleO overcloud - the answer is yes, so I thought I'd write a post showing how to do it. Much of what follows is basically an update on my old post on nested resource introspection (some interfaces have changed a bit since I wrote that), combined with an introduction to heat PATCH updates.

Partial update?! Why?

So, the first question is why would you do this - TripleO heat templates are designed to enforce a consistent state for your entire OpenStack deployment, so in most cases you really should update the entire overcloud, and not mess with the underlying nested stacks directly.

However, for some development usage, this creates a long feedback loop - you change something (perhaps one line in a puppet manifest or heat template), then have to wait several minutes for Heat to walk the entire tree of nested stacks, puppet to run all steps on all nodes, etc.

So, while you would probably never do this in production (seriously, please don't!), it can be a useful technique for developers seeking a quicker hack-then-test cycle, and also when attempting to isolate root-causes for some subset of overcloud stack update behavior.

Ok, with that disclaimer clearly stated, here's how you do it:

Step 1 - Find the nested stack to update

Lets take a specific example - I want to update only the ControllerNodesPostDeployment resource which is defined in overcloud.yaml - this is a resource that maps to a nested stack that uses the cluster configuration interfaces I described in this previous post to apply puppet in a series of steps to all controller nodes.

Here's our overcloud (some CLI output removed for brevity):

$ heat stack-list
| 01c51e7e-ad2f-41d3-b056-3c4c84395114 | overcloud | CREATE_COMPLETE |
2016-06-08T18:07:00 | None         |

Here's the ControllerNodesPostDeployment resource:

$ heat resource-list overcloud | grep ControllerNodesPost
| ControllerNodesPostDeployment             |
e67fff24-8089-4cf8-adf4-9c6064bf01d6          |
OS::TripleO::ControllerPostDeployment             | CREATE_COMPLETE |
2016-06-08T18:07:00 |
e67fff24-8089-4cf8-adf4-9c6064bf01d6 is the resource ID of
ControllerNodesPostDeployment, which is a nested stack - you can confirm
this via:

$ heat stack-list -n | grep "^| e67fff24-8089-4cf8-adf4-9c6064bf01d6"
| e67fff24-8089-4cf8-adf4-9c6064bf01d6 |
overcloud-ControllerNodesPostDeployment-smy5ygz2lc26
| UPDATE_COMPLETE | 2016-06-08T18:10:34 | 2016-06-09T08:52:45 |
01c51e7e-ad2f-41d3-b056-3c4c84395114 |
Note here the first column is the stack ID, and the last is the parent
stack ID (e.g "overcloud" above).

overcloud-ControllerNodesPostDeployment-smy5ygz2lc26 is the name of the stack that implements ControllerNodesPostDeployment - we can refer to it by either that name or the ID (e67fff24-8089-4cf8-adf4-9c6064bf01d6).

Step 2 - Basic update of the stack

Step 3 - Update of the stack with modifications

So, those paying attention may have noticed that 30 seconds is too fast for puppet to run on all the controller nodes, and it is - the reason being that we did a no-op update, and so Heat detects that no inputs have changed, thus it doesn't cause puppet to re-run.

To work around this, and enable puppet to re-assert state on every overcloud update, we have an identifier in the nested stack that is normally updated to a value that changes every update (in includes a timestamp when updates are triggered via python-tripleoclient vs heatclient directly)

We can emulate this behavior in our patch update, and force puppet to re-run through all the deployment steps - lets first look at the NodeConfigIdentifers parameter value:

$ heat stack-show overcloud-ControllerNodesPostDeployment-smy5ygz2lc26 | grep NodeConfigIdentifiers
"NodeConfigIdentifiers": "{u'deployment_identifier': u'1465409217', u'controller_config': {u'0': u'os-apply-config deployment bb67a1d5-f0a5-48ec-9883-1f2ae578a8bd complet ed,Root CA cert injection not enabled.,TLS not enabled.,None,'}, u'allnodes_extra': u'none'}"

Here we can see various data, including a deployment_identifier, which is the timestamp-derived unique identifier normally passed via python-tripleoclient.

We could update just that field, but the content of this mapping isn't important, only that it changes (this data is not currently consumed by puppet on update, it's just used to trigger the SoftwareDeployment to re-apply the config due to an input value changing).

So we can create an environment file that looks like this (note this must use parameters, not parameter_defaults, so that it overrides the value passed from the parent stack) - any value can be used, but you must change it each update if you want the SoftwareDeployment resources to be re-applied to the nodes.

$ cat update_env.yaml
parameters:
NodeConfigIdentifiers: 123

Then we can trigger another PATCH update including this data:

heat stack-update -x overcloud-ControllerNodesPostDeployment-smy5ygz2lc26 -e update_env.yaml

This time I'm using the new openstack stack event list --follow approach to monitor progress (if you don't have this, you can repeat the marker event-list approach described above):

$ openstack stack event list --follow2016-06-09 08:52:46 [overcloud-ControllerNodesPostDeployment-smy5ygz2lc26]: UPDATE_IN_PROGRESS Stack UPDATE started
2016-06-09 08:52:54 [ControllerPuppetConfig]: UPDATE_IN_PROGRESS state changed
2016-06-09 08:52:54 [ControllerArtifactsConfig]: UPDATE_IN_PROGRESS state changed
2016-06-09 08:52:56 [ControllerPuppetConfig]: UPDATE_COMPLETE state changed
2016-06-09 08:52:56 [ControllerArtifactsConfig]: UPDATE_COMPLETE state changed
2016-06-09 08:52:56 [ControllerArtifactsDeploy]: UPDATE_IN_PROGRESS state changed
2016-06-09 08:52:58 [ControllerArtifactsDeploy]: UPDATE_COMPLETE state changed
2016-06-09 08:52:58 [ControllerLoadBalancerDeployment_Step1]: UPDATE_IN_PROGRESS state changed
2016-06-09 08:53:32 [ControllerLoadBalancerDeployment_Step1]: UPDATE_COMPLETE state changed
2016-06-09 08:53:32 [ControllerServicesBaseDeployment_Step2]: UPDATE_IN_PROGRESS state changed
2016-06-09 08:54:00 [ControllerServicesBaseDeployment_Step2]: UPDATE_COMPLETE state changed
2016-06-09 08:54:00 [ControllerOvercloudServicesDeployment_Step3]: UPDATE_IN_PROGRESS state changed
2016-06-09 08:54:57 [ControllerOvercloudServicesDeployment_Step3]: UPDATE_COMPLETE state changed
2016-06-09 08:54:57 [ControllerOvercloudServicesDeployment_Step4]: UPDATE_IN_PROGRESS state changed
2016-06-09 08:56:14 [ControllerOvercloudServicesDeployment_Step4]: UPDATE_COMPLETE state changed
2016-06-09 08:56:14 [ControllerOvercloudServicesDeployment_Step5]: UPDATE_IN_PROGRESS state changed
2016-06-09 08:57:16 [ControllerOvercloudServicesDeployment_Step5]: UPDATE_COMPLETE state changed
2016-06-09 08:57:16 [ExtraConfig]: UPDATE_IN_PROGRESS state changed
2016-06-09 08:57:17 [ExtraConfig]: UPDATE_COMPLETE state changed
2016-06-09 08:57:26 [overcloud-ControllerNodesPostDeployment-smy5ygz2lc26]: UPDATE_COMPLETE Stack UPDATE completed successfully
So, here we can see the update of the stack took a little longer (around 5 minutes in my environment), and if you were to check the os-collect-config logs on each controller node, you would see puppet re-applying on each node, fore every step defined in the template.

This approach can be extended if you want to e.g test changes to the stack template (or files it references such as puppet manifests or scripts), you would do something like:

$ cp -r /usr/share/openstack-tripleo-heat-templates .
$ cd openstack-tripleo-heat-templates/
$ heat stack-update -x overcloud-ControllerNodesPostDeployment-smy5ygz2lc26 -e update_env.yaml -f puppet/controller-post.yaml

Note that if you want to do a final update of the entire overcloud, you would need to point to this copied tree (assuming you want to maintain any changes), e.g

$ openstack overcloud deploy --templates /path/to/copy/openstack-tripleo-heat-templates

TripleO Heat templates Part 3 - Cluster configuration, introduction/primer

2015-05-17T01:06:00.003-07:00

In my previous two posts I covered an overview of TripleO template roles and groups, and specifics of how initial deployment of a node happens. Today I'm planning to introduce the next step of the deployment process - taking the deployed groups of nodes, and configuring them to work together as clusters running the various OpenStack services encapsulated by each role.

This post will provide an introduction to the patterns and Heat features used to configure the groups of nodes, then in the next instalment I'll dig into the specifics of exactly what configuration takes place in the TripleO heat templates.

Recap - the deployed group of servers

So, we're continuing from where we got to at the end of the last post - we've deployed a ResourceGroup containing several OS::TripleO::Controller resources, which in turn have deployed a nova server, and done some initial configuration of it.

What comes next is configuring the whole group, or cluster, to work together, e.g configuring the OpenStack services running on the controller.

Group/Cluster configuration with Heat

Similar to the SoftwareDeployment (singular) resources described in my previous post, Heat supports applying a SoftwareConfig to a group of servers via the SoftwareDeployments and StructuredDeployments (plural) resources. The function of both is basically the same, one works with a SoftwareConfig resource and the other with a StructuredConfig resource.

Typically (in TripleO at least) StructuredDeployments resources are used combined with a ResourceGroup containing some servers. You pass a list of servers to configure (provided via an attribute from the OS::Heat::ResourceGroup resource), and a reference to a StructuredConfig resource.

The StructuredConfig resource defines the configuration to apply to each server, and the StructuredDeployments resource then internally creates a series of StructuredDeployment (singular) resources, one per server.

When all of the deployment (singular) resources complete, the deployments (plural) resource goes CREATE_COMPLETE - if any of the nested deployment resources fail, the deployments resource will go into a FAILED state.

Debugging groups of deployments

You may notice that the StructuredDeployments resource above looks a lot like the ResourceGroup containing the OS::TripleO::Controller resources - this is no coincidence, internally heat actually creates a ResourceGroup containing the StructuredDeployment resources.

This is a useful fact to remember when debugging, because it means you can use the techniques I've previously described to inspect the individual Deployment resources created by the StructuredDeployments resource, e.g so you can use heat deployment-show <id> to help diagnose a problem with a failing deployment inside the StructuredDeployments group (which is often quicker and more convenient than SSHing onto the failing node and trawling the logs).

For example, here's a simple bash script which dumps out details about all of the Deployment resources in an overcloud, obviously you can add in a "grep FAILED" here if you just want to see details about failing deployments:

#!/bin/bash
while read -r line
do
deployment_name=$(echo $line | cut -d"|" -f2)
deployment_id=$(echo $line | cut -d"|" -f3)
parent_name=$(echo $line | cut -d"|" -f7)
echo "deployment=$deployment_name ($deployment_id) parent $parent_name"
heat deployment-show $deployment_id
echo "---"
done < <(heat resource-list --nested-depth 5 overcloud | grep "OS::Heat::$Software\|Structured$Deployment ")

We should probably add a python-heatclient feature which automates this lookup (particularly for failing deployments), but right now that is one way to do it.

Until next time..!

So here we've covered the basics of how Heat can be used to configure groups of servers, and we've illustrated how that pattern is applied in the TripleO templates.

The TripleO templates use this technique for all roles, to do multiple configuration passes during the deployment - in the next post I'll cover specifics of how this works in detail, but for now you can check out the TripleO heat templates and hopefully see this pattern for yourself. Note that it's combined with provider resource abstractions as previously discussed, which as we will see makes for a nicely abstracted approach to cluster configuration which is pretty easy to modify, extend, or plug in alternative implementations.

TripleO Heat templates Part 2 - Node initial deployment & config

2015-05-13T09:02:00.000-07:00

In my previous post "TripleO Heat templates Part 1 - roles and groups", I provided an overview of the various TripleO roles, the way the role implementation is abstracted via provider resources, and how they are grouped and scaled via OS::Heat::ResourceGroup.

In this post, I'm aiming to dig into the next level of template implementation, specifically how a role is implemented behind the provider resource alias used in the top-level template.

I'm only going to cover one role type for now OS::TripleO::Controller, because the patterns described are directly applicable to all other role types. I'm also going to focus on the puppet based implementation (because that's what I'm most familiar with), but again most concepts apply to the element/container/etc based implementations too.

Throughout this post, I'll be referring to templates in the tripleo-heat-templates repo, so if you haven't already, now might be a good time to clone that so you can follow along looking at the templates themselves.

Recap - the controller group definition

So, as described in my previous post, the top-level TripleO heat template defines an OS::Heat::ResourceGroup called "Controller", which contains a group of OS::TripleO::Controller resources.

This OS::TripleO::Controller resource type is mapped to another heat template via the resource registry in the heat environment, like this:

resource_registry:
OS::TripleO::SoftwareDeployment: OS::Heat::StructuredDeployment
OS::TripleO::Controller: puppet/controller-puppet.yaml
OS::TripleO::Controller::Net::SoftwareConfig: net-config-bridge.yaml
OS::TripleO::ControllerConfig: puppet/controller-config.yaml
OS::TripleO::NodeUserData: firstboot/userdata_default.yaml

For clarity, I've removed the mappings not related to the controller here, and I've also not shown the resources related to configuring the cluster after initial deployment via the ResourceGroup (that will be covered in the next installment! :)

I'm going to take these pieces step by step to show how the first part of the deployment flow works, starting with building one OS::TripleO::Controller.

Initial deployment flow, step by step

Creating a OS::TripleO::Controller resource creates a heat nested stack, using the template defined in the resource_registry.

The deployment flow will be familiar to anyone who has tried out Heat SoftwareConfig resources, as I covered in a previous post:

The deployment sequence looks like this:

An OS::TripleO::NodeUserData resource is created, by default this does nothing, but it provides a hook where deployers can easily plug in site specific "firstboot" configuration, e.g some special cloud-config to pass to cloud-init, or some script to run (more on this in a future post).
We create an OS::Nova::Server resource (confusingly called "Controller", the same as the ResourceGroup in the parent template..), using the flavor and size passed in to the template via parameters. Typically the "baremetal" flavor will be used, configured so the deployment happens via Ironic to enable deployment to baremetal servers.
An OS::TripleO::SoftwareDeployment is created, which applies an OS::TripleO::Net::SoftwareConfig SoftwareConfig resource to the server - as indicated by the names, these abstractions configure the network on the node, using the exact same method described in the primer on SoftwareConfig resources - the resources are named differently to enable abstractions which cleanly support different network configurations (and in future topologies), e.g in the resource_registry above we'll be applying the config defined in net-config-bridge.yaml.
Last but not least, we use another OS::TripleO::SoftwareDeployment to apply ControllerConfig, which is simply passing a large map of data to os-apply-config, which is then stored as heiradata, (to be consumed later by puppet when it's configuring the services on the deployed cluster of nodes)

Phew, is that all?

Well, as the eagle-eyed amongst you will have spotted, it's not - but it is the end of the initial deployment phase prior to configuring the cluster.

We've deployed a node
Optionally performed some "firstboot" configuration
Configured the network on the node
Performed some preliminary configuration of the services on the node

This means when your ResourceGroup of OS::TripleO::Controller nodes goes to CREATE_COMPLETE state, you have a bunch of active, partially configured (but basically useless) controller nodes.

The next step is to perform a series of post-deployment configuration passes on the whole ResourceGroup, or in other words configure the cluster of controllers so you have a group of fully functional OpenStack controller nodes - more on this in my next post! :)

TripleO Heat templates Part 1 - Roles and Groups

2015-05-07T11:17:00.001-07:00

This is the start of a series of posts aiming to de-construct the TripleO heat templates, explaining the abstractions that exist,and the heat features which enable them.

If you're not already a little familiar with ResourceGroups, "Provider Resources" used for template composition, and SoftwareConfig resources, it's probably not a bad idea to check out my previous posts on those topics, as well as our user guide and other documentation - TripleO makes heavy use of all of these features.

Overcloud "Roles"

TripleO typically refers to the deployed OpenStack cloud as the "overcloud", because the tools used to perform that deployment mirror those in the deployed cloud - e.g a small OpenStack is used to bootstrap and manage a bigger one (normally the small OpenStack is called either a "seed" or "undercloud", depending on your environment).

The definition of what is deployed in your overcloud exists in a number of Heat templates, with the top-level one defining a number of groups of different node types, or "roles".

Controller: Contains API services, e.g Keystone, Neutron, Heat, Glance, Ceilometer, Horizon, and the API parts of Nova, Cinder & Swift. It can also optionally host the storage parts for Cinder, Swift and Ceph if these are not deployed separately (see below).
Compute: Contains the Nova Hypervisor components
BlockStorage: Contains the Cinder storage components (if not hosted on the Controller(s).
ObjectStorage: Contains the Swift storage components (if not hosted on the Controllers(s).
CephStorage: Contains the Ceph storage components (if not hosted on the Controllers(s).

Roles & resource types

Each of the roles (or node types), are mapped to a a type defined in the resource_registry in the environment passed to Heat.

So, for example, the "Controller" role is defined in the heat template as a type OS::TripleO::Controller, and similar aliases exist for all the other roles.

The resource registry maps this type alias to another heat template, which implements whatever is required to deploy one node with that role.

So to create a node type "OS::TripleO::Controller" Heat may create a stack based on the template in "puppet/controller-puppet.yaml", or some other implementation based on whatever mapping exists in the resource_registry.

This makes it very easy if you want to plug in some alternate implementation, while maintaining the top-level template interfaces and deployment topology. For example, work is currently in-progress implementing an alternate implementation using docker containers, as an alternative to the existing puppet and element based impelementations.

Roles & ResourceGroups

Each of these roles may be independently scaled - because they are defined in an OS::Heat::ResourceGroup. The minimum you can deploy is one "Controller" and one "Compute" node (some roles may be deployed with zero nodes in the group).

Here's an example of what that looks like in the top level "overcloud-without-mergepy" template (this is the name of the main template TripleO uses to deploy OpenStack, the "without-mergepy" part is historical and refers to an older, now deprecated, implementation.)

Controller:
    type: OS::Heat::ResourceGroup
    properties:
      count: {get_param: ControllerCount}
      resource_def:
        type: OS::TripleO::Controller
        properties:
          AdminPassword: {get_param: AdminPassword}
          AdminToken: {get_param: AdminToken}
          ...

Here, you can see we've defined a group of OS::TripleO::Controller resources in an OS::Heat::ResourceGroup, and the number of nodes deployed is controlled via a template parameter, "ControllerCount", and similarly a number of template parameters are referenced to provide input properties to enable configuration of the deployed controller node (I've abbreviated the full list of properties).

This pattern is repeated for all roles, so building a specified number of nodes for a particular role (or adding/removing them via a stack-update), is as simple as passing a different number into Heat as a stack parameter :)

That's all, folks

That's all for today - hopefully it provides an overview of the top-level interfaces provided by the TripleO Heat templates, and illustrates the following:

There are clearly defined node "roles", containing the various parts of your OpenStack deployment
The patterns used to define and implement these roles are repeated, which helps understand the templates despite them being fairly large.
The implementation is modular, and abstractions exist which make implementing different "back end" implementations relatively simple.
Deployments can be easily scaled due to using Heat's ResourceGroup functionality.

In future instalments I'll dig further into the individual node implementations, ways to easily plug in site-specific additional configuration, and ways in which you can control and validate the deployments performed via TripleO.

Heat SoftwareConfig resources - primer/overview.

2015-05-05T09:15:00.000-07:00

In this post, I'm going to provide an overview of Heat's Software Configuration resources, as a preface to digging in more detail into the structure of TripleO heat templates, which leverage SoftwareConfig functionality to install and configure the deployed OpenStack cloud.

Heat has supported SoftwareConfig and SoftwareDeployment resources since the Icehouse release, in an effort to provide flexible and non-opinionated abstractions which enable integration with existing software configuration tools and scripts.

The key concepts and some examples are described in our user guide, but what follows is more of a primer, to provide necessary context before we get in to decomposing the TripleO heat templates. If what follows looks a little scary, check out my introductory (very simple) screencast, which was recorded for the Heat Beyond the Basics session at the Paris OpenStack summit.

Heat SoftwareConfig resources

There are three resources necessary in a typical software configuration scenario:

An OS::Heat::SoftwareConfig resource - this encapsulates the config to be applied, e.g a script, puppet manifest, or any other config definition format you care to use. This is just a wrapper for the config to apply, optionally parameterized with input values, it doesn't actually configure anything.
An OS::Heat::SoftwareDeployment resource - this is the thing which actually applies the config from (1) - when it moves to and IN_PROGRESS state, it makes the config available to the specified server. By default, the deployment will stay in the IN_PROGRESS state until a signal is received via the Heat API, notifying the service of success (or failure..) applying the config.
An OS::Nova::Server resource - this is the instance (or physical server in the case of TripleO deploying via Nova and Ironic) being configured, it must contain some tools to support SoftwareConfig, as discussed below, and define the user_data_format property to enable SoftwareConfig.

There are also OS::Heat::StructuredConfig and OS::Heat::StructuredDeployment resources, which are basically identical to SoftwareConfig/SoftwareDeployment resources, except the config is defined as a map, rather than a string (which is useful for some tools such as os-apply-config, which consumes a map of config data, rather than a string such as is used e.g applying a script or a puppet manifest). So I'll refer only to SoftwareConfig/SoftwareDeployment from here on, but all the concepts are applicable to both.

SoftwareDeployment flow

The server being configured requires some agents in order to collect and process the configuration made available via the deployment. Typically, this works as follows:

os-collect-config polls the Heat API for updated resource metadata associated with the OS::Nova::Server resource
When metadata is updated, os-refresh-config runs, and triggers an element called heat-config.
heat-config then uses the "group" property defined in the SoftwareConfig properties to process applying the configuration via a hook script. Heat provides (via the heat-templates repo) a variety of ready-made hook scripts for some popular tools, but it's simple to write your own if needed, and it involves no changes to Heat, only a script inside the image you deploy.
On completion, the hook uses "heat-config-notify" to send a signal back to Heat, which includes the return code of the thing we ran, along with stdout/stderr from the tool invoked by the hook script. This information is then made available via "heat deployment-show" via the Heat API.
If the signal notified success, we move the SoftwareDeployment to COMPLETE state, e.g CREATE_COMPLETE. If it failed, we move it to e.g CREATE_FAILED state.

SoftwareDeployment HOT template definition

We have example templates for many popular tools, but one thing to emphasize is that the template definition is not dependent on the tool doing the configuration - the coupling between the config to apply and the tool applying it only happens inside the instance via the heat-config hook logic.

Looking at the puppet example step-by step:

1. Define the SoftwareConfig resource

config:
    type: OS::Heat::SoftwareConfig
    properties:
      group: puppet
      inputs:
      - name: foo
        default: aninput
      - name: bar
      outputs:
      - name: result
      config:
        get_file: config-scripts/example-puppet-manifest.pp

Here, we can see the following:

We define the SoftwareConfig, which references a puppet manifest via get_file.
We parameterize applying the manifest by providing some input values, which can specify a default value
The "group" is specified as "puppet", which will enable heat-config to correctly apply the manifest using the heat-config-puppet hook.
We specify an output - this means inside the manifest we can reference the special "heat_outputs_path" input, and write a file containing a result related to this output.

A SoftwareConfig, when created, goes to CREATE_COMPLETE immediately, there is no dependency on applying the config to a server, thus a config once defined may be applied to multiple servers (potentially with different input parameters).

2. Define the Server resource

   server:
    type: OS::Nova::Server
    properties:
      image:animage
      flavor:m1.small

      user_data_format: SOFTWARE_CONFIG

Here, the some things to note:

The image provided must contain the tools previously discussed (os-collect-config, os-refresh-config, heat-config and whatever hooks you need for the "group" properties you want to specify)
user_data_format must be set to "SOFTWARE_CONFIG"
user_data may still be specified - this is useful where you want to combine first-boot configuration (e.g via cloud-init) with subsequent application deployment via SoftwareConfig. A recent TripleO patch illustrates this approach.

Optionally you can also specify the transport used for polling metadata via software_config_transport - by default heat will poll via the heat-api-cfn API, but you can also poll via the native heat-api, or in Kilo Heat, via Swift.

3. Define the SoftwareDeployment resource

deployment:
    type: OS::Heat::SoftwareDeployment
    properties:
      config:
        get_resource: config
      server:
        get_resource: server
      input_values:
        foo:abc
        bar:xyz
actions:
- CREATE

I've put the SoftwareDeployment resource last on purpose - to highlight the normal deployment flow:

heat creates the SoftwareConfig, then the Server, then the Deployment
The Deployment applies the config, depending on both the SoftwareConfig and Server resources (implicitly via the get_resource references).
The Deployment resource status depends on the status of the signal sent back to heat after applying the config (via the heat-config hook).

Note the "actions" property is optional (defaults to applying the config on both CREATE and UPDATE actions, but it can be used to specify that, for example you only want configuration to happen on create and not update (as above), or perhaps that you only want to run some cleanup on DELETE, for example unregistering the server from some external service.

input_values is used to provide the values for the parameters defined in the SoftwareConfig resource - thus it's perfectly fine for multiple Deployment resources to reference the same Config resource (potentially with different input_values and/or actions specified), but each deployment references exactly one Config (e.g a SoftwareDeployment cannot apply multiple SoftwareConfig resources, only one).

Dealing with dependencies

A common pattern is doing a series of configuration steps, for example configure a database, then some application that uses the database (and requires it to be installed and configured).

There are a couple of ways to handle this:

SoftwareDeployment resources have a "name" property, which can influence the sort-order so that, for example, heat-config will apply "config1" before "config2".
Template directive "depends_on" can be used to specify an explicit dependency between two (or more) SoftwareDeployment resources (Note: the dependency is between the *deployment* resources, not the SoftwareConfig resources!)

My preference is to use "depends_on" in most cases - it provides the most explicit control of deployment serialization, and makes the ordering very clear from the template level.

Conclusion and further resources

Hopefully that concludes a reasonable overview of heat SoftwareConfiguration capabilities. For further information, please see the heat documentation, the user guide and template guide and example templates are a good starting point.

If you're feeling brave, you can also dive into the TripleO heat templates, which make extensive use of SoftwareConfig/StructuredConfig and SoftwareDeployment/Structured deployment resources. More on TripleO specifically in a future post! :)

Debugging TripleO Heat templates

2015-04-20T09:16:00.000-07:00

Lately, I've been spending increasing amounts of time working with TripleO heat templates, and have noticed some recurring aspects of my workflow whilst debugging them which I thought may be worth sharing.

For the uninitiated, TripleO is an OpenStack deployment project, which aims to deploy and manage OpenStack using standard OpenStack API's. In practice, this means using Nova and Ironic for baremetal node provisioning, and Heat to orchestrate the deployment and configuration of the nodes.

The TripleO heat templates, unlike most of the heat examples, are pretty complex. They make extensive use of many "advanced" features, such as nested stacks, using provider resources via the environment and also many software config resources.

This makes TripleO a fairly daunting target to those wishing to debug and modify and/or debug the TripleO templates.

Fortunately TripleO templates, although large, have many repeated patterns, and good levels of abstraction and modularity. Combined with some recently added heat interfaces, it becomes rapidly less daunting, as I will demonstrate in the worked example below:

Step 1: Create the Stack

So, step 1 when deploying OpenStack via TripleO is to do a "heat stack-create". Whether you create the heat stack directly via python-heatclient (which is what the TripleO "devtest" script calls), or indirectly via some other interface such as tuskar-ui the end result is the same - a heat stack is created (normally it's called "overcloud" by default):

$ heat stack-create -e /home/shardy/tripleo/overcloud-env.json -e /home/shardy/tripleo/tripleo-heat-templates/overcloud-resource-registry-puppet.yaml -t 360 -f /home/shardy/tripleo/tripleo-heat-templates/overcloud-without-
mergepy.yaml -P ExtraConfig= overcloud

+--------------------------------------+------------+--------------------+----------------------+
| id | stack_name | stack_status | creation_time |
+--------------------------------------+------------+--------------------+----------------------+
| e4cfc4a8-d9e9-4033-8556-5ebca84c1455 | overcloud | CREATE_IN_PROGRESS | 2015-04-20T11:05:53Z |
+--------------------------------------+------------+--------------------+----------------------+

Step 2: Oh No - CREATE_FAILED!

Ok, it happens - sometimes you have a fault in your environment, a bug in your templates, or just get bitten by a regression in one of the projects used to deploy your overcloud.

Unfortunately that modularity I just mentioned in the templates leads to a level of additional complexity when debugging - the tree of resources created by heat is actually grouped into nearly 40 nested stacks! (In my environment, this number is dependent on the number of nodes you're deploying).

You can see them all, including which one failed, with heat stack-list, using the --show-nested option, and your choice of either grep "FAILED" or the -f filter option to python heatclient:

$ heat stack-list --show-nested -f "status=FAILED"
+--------------------------------------+----------------------------------------------------------------------------------------------------------+---------------+----------------------+--------------------------------------+
| id                                   | stack_name                                                                                               | stack_status | creation_time        | parent                               |
+--------------------------------------+----------------------------------------------------------------------------------------------------------+---------------+----------------------+--------------------------------------+
| e4cfc4a8-d9e9-4033-8556-5ebca84c1455 | overcloud                                                                                                | CREATE_FAILED | 2015-04-20T11:05:53Z | None                                 |
| 36f3ef93-872f-460b-bd6a-14a89569d5a7 | overcloud-ControllerNodesPostDeployment-rl67kiqu7pbp                                                     | CREATE_FAILED | 2015-04-20T11:09:18Z | e4cfc4a8-d9e9-4033-8556-5ebca84c1455 |
| 28d1fd38-85ba-442b-9e57-859731349e94 | overcloud-ControllerNodesPostDeployment-rl67kiqu7pbp-ControllerDeploymentLoadBalancer_Step1-tnsuslbx5hu7 | CREATE_FAILED | 2015-04-20T11:09:20Z | 36f3ef93-872f-460b-bd6a-14a89569d5a7 |
+--------------------------------------+----------------------------------------------------------------------------------------------------------+---------------+----------------------+--------------------------------------+

Here, we can derive some useful information looking at the stack names, note that in all cases we can disregard the randomly generated suffix on the stack names (heat adds it internally for nested stack resources).

overcloud is the top-level stack, the parent at the top of the tree. This is defined by the overcloud-without-mergepy.yaml
template which we passed to heat stack-create.
ControllerNodesPostDeployment-rl67kiqu7pbp is the nested stack which handles post-deployment configuration of all Controller nodes. This is the ControllerNodesPostDeployment resource, defined by the overcloud resource registry as the implementation of the OS::TripleO::ControllerPostDeployment type, which is a provider resource alias for this template when using the puppet implementation.
The final (verbosely named!) stack maps to the
ControllerDeploymentLoadBalancer_Step1 resource in controller-post-puppet.yaml.

All of this is a long-winded way of saying that something went wrong applying a puppet manifest, via an OS::Heat::StructuredDeployments resource (
ControllerDeploymentLoadBalancer_Step1) - anything with "Deployment" in the name failing is highly likely to mean the same thing.

Armed with this information, we can proceed to figure out why :)

Step 3: Resource Introspection

So we now know which nested stack failed, but not which resource, or why.

There's a couple of ways to find this out, you can either use the steps outlined in my previous post about nested resource introspection, or (if you're lazy like me), you can use the heat resource-list --nested-depth option to save some time:

$ heat resource-list --nested-depth 5 overcloud | grep FAILED
| ControllerNodesPostDeployment               | 36f3ef93-872f-460b-bd6a-14a89569d5a7          | OS::TripleO::ControllerPostDeployment             | CREATE_FAILED   | 2015-04-20T11:05:53Z |                                        |
| ControllerDeploymentLoadBalancer_Step1      | 28d1fd38-85ba-442b-9e57-859731349e94          | OS::Heat::StructuredDeployments                   | CREATE_FAILED   | 2015-04-20T11:09:19Z | ControllerNodesPostDeployment          |
| 0                                           | 980137bc-21b1-460c-9d4a-488cb5611a6c          | OS::Heat::StructuredDeployment                    | CREATE_FAILED   | 2015-04-20T11:09:20Z | ControllerDeploymentLoadBalancer_Step1 |

Here, we can see several things:

ControllerDeploymentLoadBalancer_Step1 has failed, it's an
OS::Heat::StructuredDeployments resource. StructuredDeployments (plural) resources apply a heat StructuredConfig/SoftwareConfig to a group of servers.
There's a "0" resource, which is a
OS::Heat::StructuredDeployment (singular) type. The parent resource (last column) of this is ControllerDeploymentLoadBalancer_Step1. This is because a SoftwareDeployments resource creates a nested stack with a (sequentially named) SoftwareDeployment per server (in this case, one per Controller node in the OS::Heat::ResourceGroup defined as "Controller" in the overcloud-without-mergepy template)

Now, we can do a resource-show to find out the reason for the failure. Here, we use the ID of
ControllerDeploymentLoadBalancer_Step1 as the stack ID, because all nested stack resources set the ID as that of the stack they create:

$ heat resource-show 28d1fd38-85ba-442b-9e57-859731349e94 0 | grep resource_status_reason
| resource_status_reason | Error: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 6

So, to summarize what we've discovered so far

A SoftwareDeployment (in this case a puppet run) failed on Controller node 0
The thing it was running exited with status code 6.

The next step is to look at the logs to work out why..

Step 4: Debugging the failure

When a Heat SoftwareDeployment resource is triggered, it runs something on the node (e.g applying a puppet manifest), then signals either success or failure back to Heat. Fortunately, in recent version of Heat, there is an API which exposes this information (in a more verbose way than the resource-show output above with the reason for failure):

To access it, you need the ID of the deployment (e.g
980137bc-21b1-460c-9d4a-488cb5611a6c) from the heat resource-list above):

heat deployment-show 980137bc-21b1-460c-9d4a-488cb5611a6c
{
"status": "FAILED",
"server_id": "6a025200-b20e-47df-ae4c-97a54499b586",
"config_id": "b924d133-42d7-48ab-b2c9-7311de3b3ca4",
"output_values": {
    "deploy_stdout": "<stdout of command>,
    "deploy_stderr": "<stderr of command>",
    "deploy_status_code": 6
},
"creation_time": "2015-04-20T11:09:20Z",
"updated_time": "2015-04-20T11:10:02Z",
"input_values": {},
"action": "CREATE",
"status_reason": "deploy_status_code : Deployment exited with non-zero status code: 6",
"id": "980137bc-21b1-460c-9d4a-488cb5611a6c"
}
I've not included the full stderr/stdout because it's pretty long, but it's basically the same information that you get from SSHing onto the node and looking at the logs.

If you still want to do that, you can use "nova show" with the "server_id" above to get the IP of the node, SSH in and do further investigations.

In Summary...

So those paying attention will have spotted that this all really boils down to two steps:

Use heat resource-list with the --nested-depth option to get the failing resource. The one you want is the one which isn't the parent_resource to any other and is in a FAILED state.
Investigate what caused the failure, for failing SoftwareDeployment resources heat deployment-show is a useful time-saver which avoids always needing to log on to the node.

Hopefully this somewhat demystifies the debugging of TripleO templates, and other large Heat deployments which use similar techniques such as nested stacks and SoftwareConfig resources!

Using Heat ResourceGroup resources

2014-09-11T03:48:00.001-07:00

This has come up a few times recently, so I wanted to share a howto showing how (and how not) to use the group resources in Heat, e.g OS::Heat::ResourceGroup and OS::Heat::AutoScalingGroup.

The key thing to remember when dealing with these resources is that they can multiply any number of resources (expressed as a heat stack), not just individual resources. This is a very cool feature when you get your head around it! :)

Lets go through a worked example, where we use ResourceGroup to create 5 identical servers, each with a cinder volume of the same size attached.

Resource group basics

To create one server with a volume attached, you define the server, a volume, and a volume attachment resource, like this:

Now, lets say you need 5 (or 500) of these identical servers with an attached volume. What you do *not* want to do is create three groups of resources (Server, Volume and VolumeAttachment), and somehow try to connect them all together. This is an anti-pattern which will cause you much pain and frustration! :)

Instead, you need to use ResourceGroup to scale out the combination of resources. Fortunately, Heat makes this very easy to do. Lets say you call the template above creating one server with attached volume server_with_volume.yaml, you can create 5 identical nested stacks, each containing one server, volume and volume attachment like this:

Note: currently templates referencing nested stack templates can only be launched via python-heatclient (not the Horizon dashboard, a known issue we're working on resolving).

Simply do heat stack-create my_group -f server_with_volume_group.yaml and Heat will create 5 identical servers, attached to 5 identical volumes!

A more complete example related to the fragments above is available here.

Resource groups and provider resources

What's that you say? You don't like the nested stack reference hard-coded template name? No problem! :) You can also make use of the environment to define a provider resource type alias.

Then specify the type alias instead of the template name in the ResourceGroup definition:

This can be lauched like thisheat stack-create my_group2 -f server_with_volume_group.yaml -e env_server_with_volume.yaml

The example will work exactly as before, only different versions of My::Server::WithVolume can easily be substituted, for example if you need a staging workflow where the resource alias is reused across a large number of templates, different versions of the nested template can easily be specified by changing it in one place (the environment).

That is all, for more information, please see the examples in the heat-templates, and this new example which shows how to attach several identical volumes to one server.

Heat auth model updates - part 2 "Stack Domain Users"

2014-04-16T03:14:00.000-07:00

As promised, here's the second part of my updates on the Heat auth model, following on from part 1 describing our use of Keystone trusts.

This post will cover details of the recently implemented instance-users blueprint, which makes use of keystone domains to contain users related to credentials which are deployed inside instances created by heat. If you just want to know how the new stuff works, you can skip to the last sections :)

So...why does heat create users at all?

Lets start with a bit of context. Heat has historically needed to do some or all of the following:

Provide metadata to agents inside instances, which poll for changes and apply the configuration expressed in the metadata to the instance.
Signal completion of some action, typically configuration of software on a VM after it is booted (because nova moves the state of a VM to "Active" as soon as it spawns it, not when heat has fully configured it)
Provide application level status or metrics from inside the instance, e.g to allow AutoScaling actions to be performed in response to some measure of performance or quality of service.

Heat provides API's which enable all of these things, but all of those API's require some sort of authentication, e.g credentials so whatever agent is running on the instance is able to access it. So credentials must be deployed inside the instance, e.g here's how things work if you're using the heat-cfntools agents:

heat-cfntools agents data-flow with CFN-compatible API's

The heat-cfntools agents use signed requests, which requires an ec2 keypair created via keystone, which is then used to sign requests to the heat cloudformation and cloudwatch compatible API's, which are authenticated by heat via signature validation (which uses the keystone ec2tokens extension).

The problem is, ec2 keypairs are associated with a user. And we don't want to deploy credentials directly related to the stack owner, otherwise any compromise of the (implicitly untrusted) instance could result in a cascading compromise where an attacker could take control of anything the stack-owning user has permission to access.

I've used cfntools/ec2tokens as an example, but the same issue exists if you use any credential available via keystone (token, username/password) which can be used to authenticate with the heat APIs.

So we need separation/isolation of the credentials deployed in the instance, such that we can limit the access allowed to the minimum necessary to make heat work. Our first attempt at this did the following:

Create a new user in keystone, in the same project as the stack owner (either explicitly in the template via User and AccessKey resources, or for some resources such as WaitConditionHandle and ScalingPolicy we do it internally to obtain an ec2 keypair for generation of a pre-signed URL)
Add the "heat stack user" to a special role, default "heat_stack_user" (configurable via the heat_stack_user_role in heat.conf)
Limit the API surface accessible to the "heat_stack_user" via policy.json, with the expectation that access to other service's will be restricted in a similar way, or denied completely via network separation/firewall rules.

This approach is flawed, and led to this long-standing bug, there are multiple problems:

It requires the user creating the stack to have permissions to create users in keystone, which typically requires administrative roles.
It doesn't provide complete separation - even with the policy rules, it's possible a compromised stack could abuse the credentials (for example obtaining metadata for some other stack created by the user in the same project)
It clutters the user list for the project with spurious (from the user/operator perspective) users who aren't "real" users, the users are a heat implementation detail, and we're exposing them to the end user.

Hmm, that sounds bad, what's the alternative?

Well, we've been considering that for quite some time ;) multiple solutions were discussed:

Delegating a subset of user roles via trusts (rejected because token expiry is not optional, and separation from the stack owner is desired, e.g we don't really want to delegate or impersonate them from the instance, we just need an identity which can be verified as related to the stack)
Rolling our own auth mechanism based on some random "token" (some folks were in favour of this, but I'm opposed to it, I think we should stick to orchestration and leverage or improve what's in keystone instead of taking on the burden and security risk of maintaining our own auth scheme)
Using the keystone OAuth extension to use OAuth keypairs and signed requests. (This was rejected due to lack of keystoneclient support, e.g client API and auth middleware, maybe we'll revisit enabling this as an option in some future release).
Isolating the in-instance users by creating them in a completely separate heat-specific keystone domain. This idea was first suggested by Adam Young, as is what we ended up implementing for Icehouse.

"Stack Domain Users", the details..

The new approach is, effectively, an optimisation of the existing implementation. We encapuslate all stack-defined users (ie users created as a result of things contained in a heat template) in a separate domain, which is created specifically to contain things related only to heat stacks. A user is created which is the "domain admin", and heat uses that user to manage the lifecycle of the users in the "stack user domain".

There are two aspects of this I'll discuss below, firstly what deployers need to do to enable stack domain users in Heat (Icehouse or later), and secondly what actually happens when you create a stack, and how it addresses the previously identified problems:

When deploying heat:

A special keystone domain is created, e.g one called "heat" and the ID is set in the "stack_user_domain" option in heat.conf
A user with sufficient permissions to create/delete projects and users in the "heat" domain is created, e.g in devstack a user called "heat_domain_admin" is created, and given the admin role on the heat domain.
The username/password for the domain admin user is set in heat.conf (stack_domain_admin and stack_domain_admin_password). This user administers "stack domain users" on behalf of stack owners, so they no longer need to be admins themselves, and the risk of this escalation path is limited because the heat_domain_admin is only given administrative permission for the "heat" domain.

This is all done automatically for you when using recent devstack, but if you're deploying via some other method, you need to use python-openstackclient (which is the only CLI interface to v3 keystone) to create the domain and user:

Create the domain:

$OS_TOKEN refers to a token, e.g the service admin token or some other valid token for a user with sufficient roles to create users and domains.

$KS_ENDPOINT_V3 refers to the v3 keystone endpoint, e.g http://<keystone>:5000/v3 where <keystone> is the IP address or resolvable name for the keystone service.

openstack --os-token $OS_TOKEN --os-url=$KS_ENDPOINT_V3 --os-identity-api-version=3 domain create heat --description "Owns users and projects created by heat"
The domain ID is returned by this command, and is referred to as $HEAT_DOMAIN_ID below.

Create the user:

openstack --os-token $OS_TOKEN --os-url=$KS_ENDPOINT_V3 --os-identity-api-version=3 user create --password $PASSWORD --domain $HEAT_DOMAIN_ID heat_domain_admin --description "Manages users and projects created by heat"

The user ID is returned by this command and is referred to as $DOMAIN_ADMIN_ID below:

Make the user a domain admin:

openstack --os-token $OS_TOKEN --os-url=$KS_ENDPOINT_V3 --os-identity-api-version=3 role add --user $DOMAIN_ADMIN_ID --domain $HEAT_DOMAIN_ID admin

Then you need to add the domain ID, username and password from these steps to heat.conf:


stack_domain_admin_password = <password>

stack_domain_admin = heat_domain_admin
stack_user_domain = <domain id returned from domain create above>

When a user creates a stack:

We create a new "stack domain project" in the "heat" domain, if the stack contains any resources which require creation of a "stack domain user"
Any resources which require a user, we create the user in the "stack domain project", which is associated with the heat stack in the heat database, but is completely separate and unrelated (from an authentication perspective) to the stack owners project
The users created in the stack domain are still assigned the heat_stack_user role, so as before the API surface they can access is limited via policy.json
When API requests are processed, we do an internal lookup, and allow stack details for a given stack to be retrieved from the database for both the stack owner's project (the default API path to the stack), and also the "stack domain project", subject to the policy.json restrictions.

To clarify that last point, that means there are now two paths which can result in retrieval of the same data via the heat API, e.g for resource-metadata:

GET v1/{stack_owner_project_id}/stacks/{stack_name}/{stack_id}/resources/{resource_name}/metadata

GET v1/{stack_domain_project_id}/stacks/{stack_name}/{stack_id}/resources/{resource_name}/metadata

The stack owner would use the former (e.g via "heat resource-metadata {stack_name} {resource_name}), and any agents in the instance will use the latter.

This solves all of the problems identified previously:

The stack owner no longer requires admin roles, because the heat_domain_admin user administers stack domain users
There is complete separation, the users created in the stack domain project cannot access any resources other than those explicitly allowed by heat, any attempt to access other stacks, or any other resource owned by the stack-owner will fail.
The list of users in the stack-owner project is unaffected, because we've created a completely different project in another domain.

Hopefully that provides a fairly clear picture of the new feature, and how it works - it should be transparent to users but I'm hoping this information may be useful to deployers when adopting the functionality for Icehouse.

The main gap still to be investigated is how we handle situations where keystone is backed by a read-only directory (e.g LDAP), my expectation is that it can be solved via the keystone capability to have different identity drivers per domain, so you could for example have e.g domains containing human users backed by LDAP, and the heat domain backed by SQL. My understanding is that there are outstanding issues to be solved for Juno in keystone, but I will post a future update when I've had time to do some testing and figure out what works.

That is all, respect if you managed to read it all! ;)

Heat auth model updates - part 1 Trusts

2014-04-09T09:13:00.000-07:00

Over the last few months I've spent a lot of my time looking at ways to rework the heat auth model, in an attempt to solve two long-standing issues:

Requirement to pass a password when creating a stack which may perform deferred orchestration actions (for example AutoScaling adjustments)
Requirement for users to have administrative roles when creating certain types of resource.

So, fixes to these issues have been happening (in Havana and Icehouse respectively), but discussions with various folks indicates significant confusion re differentiating the two changes, probably because I've not got around to writing up the documentation yet (it's in progress, honest!) ;)

In an attempt to clear up the confusion, and provide some documentation ahead of the upcoming Icehouse Heat release, I'm planning to cover each feature in this and a subsequent post - below is a discussion of the "Requirement to pass a password" problem, and the method used to solve it.

What? Passwords? Don't we pass tokens?

Well, yes mostly we do. However the problem with tokens is they expire, and we have no way of knowing how long a stack may exist for, so we can't store user tokens to do deferred operations after the initial creation of the heat stack (not that it's a good idea from a security perspective either..)

So in previous versions of heat, we've required the user to pass a password (yes, even if they are passing us a token), which we'd then encrypt and store in the heat database, such that we can then obtain a token to act on behalf of the user and to whatever deferred operations are required during the lifetime of the stack. It's not a nice design, but when it was implemented, Trusts did not exist in Keystone so there was no viable alternative. Here's exactly what happens:

User requests stack creation, providing a token and username/password (python-heatclient or Horizon normally requests the token for you)
If the stack contains any resources marked as requiring deferred operations heat will fail validation checks if no username/password is provided
The username/password are encrypted and stored in the heat DB
Stack creation is completed
At some later stage we retrieve the credentials and request another token on behalf of the user, the token is not limited in scope and provides access to all roles of the stack owner.

Clearly this is suboptimal, and is the reason for this strange additional password box in horizon:

You already entered your password, right?!

Happily, after discussions with Adam Young, Trusts were implemented during Grizzly and Heat integrated with the functionality during the Havana cycle. I get the impression not that many people have yet adopted it, so I'm hoping we can move towards making the new trusts based method the default, which has already happened for devstack quite recently.

Keystone Trusts 101

So, in describing the solution to Heat storing passwords, I will be referring to Keystone Trusts, because that is the method used to implement the solution. There's quite a bit of good information out there, including the Keystone Wiki, Adam Young's blog and the API documentation, but here's a quick summary of terminology which should be sufficient to understand how we're using trusts in Heat:

Trusts are a keystone extension, which provide a method to enable delegation, and optionally impersonation via keystone. The key terminology is trustor (the user delegating) and trustee (the user being delegated to).

To create a trust, the trustor (in this case the user creating the heat stack) provides keystone with the following information:

The ID of the trustee (who you want to delegate to, in this case the heat service user)
The roles to be delegated (configurable via the heat configuration file, but it needs to contain whatever roles are required to perform the deferred operations on the users behalf, e.g launching a nova instance in response to an AutoScaling event)
Whether to enable impersonation

Keystone then provides a trust_id, which can be consumed by the trustee (and only the trustee) to obtain a trust scoped token. This token is limited in scope such that the trustee has limited access to those roles delegated, along with effective impersonation of the trustor user, if it was selected when creating the trust.

Phew! Ok so how did you fix it?

Basically we now do the following:

User creates a stack via an API request (only the token is required)
Heat uses the token to create a trust between the stack owner (trustor) and the heat service user (trustee), delegating a special role (or roles) as defined in the trusts_delegated_roles list in the heat configuration file. By default heat sets this to "heat_stack_owner", so this role must exist and the user creating the stack must have this role assigned in the project they are creating a stack. Deployers may modify this list to reflect local RBAC policy, e.g to ensure the heat process can only access those services expected while impersonating a stack owner.
Heat stores the trust id in the heat DB (still encrypted, although in theory it doesn't need to be since it's useless to anyone other than the trustee, e.g the heat service user)
When a deferred operation is required, Heat retrieves the trust id, and requests a trust scoped token which enables the service user to impersonate the stack owner for the duration of the deferred operation, e.g to launch some nova instances on behalf of the stack owner in response to an AutoScaling event.

The advantages of this approach are hopefully clear, but to clarify:

It's better for users, we no longer require a password and can provide full functionality when provided with just a token (like all other OpenStack services... and we can kill the Horizon password box, yay!)
It's more secure, as we no longer store any credentials or other data which could use used by any attacker - the trust_id can only be consumed by the trustee (the heat service user).
It provides much more granular control of what can be done by heat in deferred operations, e.g if the stack owner has administrative roles, there's no need to delegate them to Heat, just the subset required.

I'd encourage everyone to switch to using this feature, enabling it is simple, first update your heat.conf file to have the following lines:

deferred_auth_method=trusts

trusts_delegated_roles=heat_stack_owner

Hopefully this will soon become the default from Juno for Heat.

Then ensure all users creating heat stacks have the "heat_stack_owner" role (or whatever roles you want them to delegate to the heat service user based on your local RBAC policies).

That is all, more coming soon on "stack domain users" which is new for Icehouse and resolves the second problem mentioned at the start of this post! :)

2013-10-02T03:51:00.000-07:00

Heat Providers/Environments 101

I've recently been experimenting with some cool new features we've added to Heat over the Havana cycle, testing things out in preparation for the Havana Release.

One potentially very powerful new abstraction is the Provider Resource method of defining nested stack resources. Combined with the new environments capability to map template resource names to non-default implementations, it provides a very flexible way for both users and those deploying Heat to define custom resources based on Heat templates.

Firstly let me clarify what nested stacks are, following on from my previous post, and why we decided to provide this native interface to the functionality, rather than something similar to the existing Cloudformation compatible resource interface.

So nested stack resources enable you to specify a URL of another heat template, which will be used to create another Heat stack, owned by the stack which defines the AWS::CloudFormation::Stack. This provides a way to implement composed Heat templates, and to reuse logically related template snippets.

The interface looks like this (using the new native HOT template syntax):

resources:

nested:

type: AWS::CloudFormation::Stack

properties:

TemplateURL: http://somewhere/something.yaml

There are several disadvantages to this interface:

Hard-coded URLs in your template
You have to have the nested templates accessible via a web-server somewhere
Hard to transparently substitute different versions of the nested implementation (without sedding! ;)
Passing Parameters is kind-of awkward (pass a nested map of parameter values)
Does not provide a way to define resources based on stack templates.

So we noticed something, which was the interface to stack templates is essentially very similar to the interface to resources, stack parameters map to resource properties, and stack outputs map to resource attributes. Provider resources leverage this symmetry to provide a more flexible (and arguably easier to use) native interface to nested stack resources.

Ok, so how does it work! Probably the easiest explanation is a simple worked example.

Say you define a simple template, like this example in our heat-templates repo, and you want to reuse it, as a nested stack via the Providers functionality, you simply need to do this:

Define an environment

The environment is used to map resource names to templates, and optionally can be used to define common stack parameters, so that they don't need to be passed every time you create a stack.

A user may pass an environment when creating (or updating) a stack, and a global environment may also be specified by the deployer (default location is /etc/heat/environment.d), using the same syntax. The environment may override existing resources.

The resource_registry section is used to map resource names to template files:

resource_registry:

My::WP::Server: https://raw.github.com/openstack/heat-templates/master/hot/F18/WordPress_Native.yaml

Or you can refer to a local file (local to the user running python-heatclient, which reads the file and attaches the content to the Heat API call creating the stack in the "files" parameter of the request):

resource_registry:

My::WP::Server: file:///home/shardy/git/heat-templates/hot/F18/WordPress_Native.yaml

There are also some other possibilities, for example aliasing one resource name to another, which are described in our documentation.

Create Stack

Now you can simply create a stack template which references My::WP::Server:

# cat minimal_test.yaml

heat_template_version: 2013-05-23

description: >

Heat WordPress template, demonstrating Provider Resource.

parameters:

user_key:

type: string

description : Name of a KeyPair to enable SSH access to the instance

resources:

wordpress_instance:

type: My::WP::Server

properties:

key_name: {get_param: user_key}

With an environment file:

# cat env_minimal.yaml

resource_registry:

My::WP::Server: file:///home/shardy/git/heat-templates/hot/F18/WordPress_Native.yaml

And then create the stack:

# heat stack-create test_stack1 --template-file=./minimal_test.yaml --environment-file=./env_minimal.yaml --parameters="user_key=$USER_key"

Optionally you could also specify the key via the environment too:

# cat env_key.yaml

parameters:

user_key: userkey

resource_registry:

My::WP::Server: file:///home/shardy/git/heat-templates/hot/F18/WordPress_Native.yaml

heat stack-create test_stack2 --template-file=./minimal_test.yaml --environment-file=./env_key.yaml

This would create the nested template with a key_name parameter of "userkey"

So hopefully that provides an overview of this new feature, for more info please see our documentation and we're planning to add some example Provider/Environment examples to our example template repository soon.

Finally, kudos to Angus Salkeld, who implemented the majority of this functionality, thanks Angus! :)

2013-08-05T10:34:00.002-07:00

Heat Nested Resource Introspection

The following topic has come up a couple of times lately on IRC, so I thought I'd put down some details describing $subject, in a more permanent place :)

Nested Stack Resources, Primer/Overview

So, Heat has a really powerful feature, which is the ability to nest stack definitions, such that one top-level stack definition may recursively define one or more nested stacks.

There are two ways to define a nested stack:

Explicitly reference a nested stack template in the parent template (via our implementation of the AWS::CloudFormation::Stack resource type, see this example template)
Create a new resource type, which internally defines a nested stack (an example of this is our simple loadbalancer resource, our implementation of the AWS::ElasticLoadBalancing::LoadBalancer resource)

There is actually a third way (Provider templates), but that's a bleeding-edge feature so I'm not considering it in this post.

In both cases, what Heat creates internally is a real stack, referenced by the parent stack via a unique ID. Since the Heat API allows you to request details for a specific stack using a stack UUID, that means you can use the heat API introspection operations to access information about the nested stack in the exact same way as you do for the top level stack.

Worked Example

If I create a stack, I can use various introspection operations, either via the Heat ReST API, or more conveniently via the "heat" CLI tool provided by python-heatclient (which uses the Heat ReST API):

> heat list

+--------------------------------------+------------+---------------+----------------------+

| id                                   | stack_name | stack_status  | creation_time        |

+--------------------------------------+------------+---------------+----------------------+

| faaca636-ed2f-44d9-b228-909c35b37215 | as123      | CREATE_FAILED | 2013-08-05T09:29:49Z |

+--------------------------------------+------------+---------------+----------------------+

I can use the heat introspection operations using either the stack_name (which Heat requires to be unique per tenant) or the unique id interchangeably:

E.g

heat stack-show as123

provides the exact same information as

heat stack-show faaca636-ed2f-44d9-b228-909c35b37215

If the stack contains a resource based on a nested stack (or a directly defined nested stack), we can look up the stack ID like this:

heat resource-list as123

+--------------------------+-----------------------------------------+-----------------+-------------

| logical_resource_id | resource_type | resource_status | updated_time

+--------------------------+-----------------------------------------+-----------------+-------------

+--------------------------+-----------------------------------------+-----------------+--------

Here we can see we have a resource "ElasticLoadBalancer", which is of type "AWS::ElasticLoadBalancing::LoadBalancer", which as I mentioned earlier is defined internally via a nested stack, whose ID we can access via the heat resource-show option, which gives details of the specified resource:

> heat resource-show as123 ElasticLoadBalancer

+------------------------+--------------------------------

| Property | Value

+------------------------+--------------------------------

| description | |

| links | http://localhost:8004/v1/1938f0707fe04b58b0053040d4a0fe06/stacks/as123/faaca636-ed2f-44d9-b228-909c35b37215/resources/ElasticLoadBalancer |

| | http://localhost:8004/v1/1938f0707fe04b58b0053040d4a0fe06/stacks/as123/faaca636-ed2f-44d9-b228-909c35b37215 |

| logical_resource_id | ElasticLoadBalancer |

| physical_resource_id | 60a1ee88-61fe-4bfb-a020-5837e35a42c9 |

| required_by | WebServerGroup |

| resource_status | CREATE_FAILED |

| resource_status_reason | Error: Resource create failed: WaitConditionTimeout: 0 of 1 received |

| resource_type | AWS::ElasticLoadBalancing::LoadBalancer |

| updated_time | 2013-08-05T09:40:49Z |

+------------------------+--------------------------------

Aha! physical_resource_id provides a UUID for the resource, which just so happens to be the UUID of the underlying nested stack ;)

So you can use that UUID to do introspection operations on the nested stack, e.g:

heat resource-list 60a1ee88-61fe-4bfb-a020-5837e35a42c9

+---------------------+------------------------------------------+-----------------+----------------------+

+---------------------+------------------------------------------+-----------------+----------------------+

+---------------------+------------------------------------------+-----------------+----------------------+

So we can see that the reason the stack failed was the nested stack WaitCondition resource failed (which we already knew from the top level status string, but hopefully you get the point ;)

2013-07-29T04:38:00.000-07:00

Roadmap for Heat Havana (part 2)

So with havana2 workload and holidays delaying this follow-up post it's probably a bit late to really call this a roadmap, but what follows is a status update and some further details on what we're working on delivering (or have delivered) for Heat's Havana cycle:

Ceilometer Integration

Some great work has been going on adding alarming features to ceilometer, and recently some patches have been landing integrating Heat with this alarming capability. This should allow us to move away from maintaining a metric store and alarming functionality inside heat, which will provide a many benefits:

Align with one openstack metric/alarm solution
Some alarms can use existing hypervisor-level metrics instead of in-instance agent
Allow extensible alarm resources via Provider templates
Removal of heat-engine periodic evaluation tasks (which will allow easier engine scale-out)

Heat (grizzy) metric collection mechanism

The diagram above illustrates how the metric collection works in grizzly heat - all metric data is collected via a "cfn-push-stats" agent (typically via a cron job defined in the stack template), which requires credentials (a keystone ec2-keypair) to be deployed inside the instance. The metric data is stored in the heat-engine database, and a periodic task evaluates the currently stored data against the alarm thresholds defined in the template. All in all, a crude (but simple) mechanism which has proven sufficient for initial Heat development purposes in the absence of ceilometer metric/alarm functionality.

The Havana Heat metric collection mechanism will look different, introducing a dependency on the ceilometer service, which can provide access to the hypervisor level statistics, avoiding the in-instance aspect of the method described above for many metric types:

Heat (Havana) metric collection/alarms via Ceilometer

We are also planning to support a compatibility mode (probably for one release cycle) which will allow existing templates using cfn-push-stats to work with the new Ceilometer based alarm mechanism:

This should allow existing users of the Heat metric/alarm features time to migrate to the new metric collection method, and also give us time to work out if a Ceilometer tool or agent will be developed which can replace cfn-push-stats (or if cfn-push-stats can be reworked to direct metric data to a Ceilometer API equivalent of PutMetricData), the exact way forward here is still under discussion.

Keystone Trusts Integration

Work is in-progress to integrate with the Keystone explicit impersonation "Trusts" feature which was added as a v3 API extension for grizzly. The initial focus will be to remove the requirement to store encrypted credentials in the Heat DB (which are used for post-create stack actions, for example AutoScaling adjustments), instead we will create a trust token with the minimum possible roles to perform these actions.

A second thread of this work is to provide an alternative to creating actual keystone users related to the User, AccessKey and WaitConditionHandle resources - because these resource depend on creating an ec2-keypair we need a way to create a keypair from a trust token, which has been proposed as a new keystone feature, but not yet implemented. As such it's not yet clear if we'll be able to complete this second step in the Havana time-frame, but we're looking into it! :)

HOT Progress

Work has been progressing well in delivering the abstractions related to the new HOT DSL, in particular the work related to Provider resources and Environments is now largely complete, the initial "hello world" HOT parser implementation has been completed, and work is under-way completing the various additional blueprints required to enable more complex templates to be expressed. It's a huge piece of work, but all those involved are doing a great job pushing things in the right direction.

And Much More...

There is much more that I've not covered here (more stack update improvements, more neutron fixes and functionality, heat standalone mode, converting InstanceGroups to nested stacks, event persistence, to name a few), but that's all I have time for today - hopefully the info above provides some useful context and detail!

2013-06-20T11:59:00.000-07:00

Roadmap for Heat Havana (part 1)

It's been quite a while now since the design summit in Portland, and I've been meaning to write some details of the features we discussed at the summit, and in particular those which have appeared now on our plan for Heat's havana development cycle.

What follows are some highlights of what we're working on, or expect to be working on over the next weeks/months. However I'll start with the disclaimer that this plan is a moving target, particularly since we're seeing an increasing number of new contributors whose planned contributions may not yet be captured on the plan, so please keep an eye on the plan in Launchpad for the latest details on what we're aiming to deliver - in other words, this may all change, but here-goes anyway! ;)

Concurrent Resource Scheduling

Inside heat, we create a dependency graph of resources, which we use to determine ordering of all operations (for example create, update, delete) within stack. For grizzly, the order in which these actions happen is determined via topological sorting followed by performing each action in series.

Clearly this is far from ideal when you have large numbers of non-dependent operations (for example creating a large number of grouped instances), so work has been under-way to improve this situation and perform stack operations in parallel where possible. The initial focus of this work is resource creation, but the plan is to eventually perform as many stack operations as possible concurrently, making use of the new task scheduler that has been developed, which uses coroutine based task scheduling.

Related to this work, we've been discussing ideas with the wider community around requirements for workflow and task scheduling in Heat, so that hopefully we can figure out a way to use a common solution across projects with these sorts of requirements.

Stack Suspend/Resume

This feature is aimed at allowing coordinated suspend/resume of a group of resources (including nested resources), such that you can take either an entire stack, or individual resources offline and the resume them (quickly) at some later time.

The idea is to provide access to the some of the underlying capabilities provided by the nova admin actions API, but to use the dependency information we have in Heat to do things in the correct order wrt the stack definition.

We will also handle non-instance suspend operations, for example disabling defined alarms, such that suspend/resume can be performed in a non-destructive way.

We may also provide access to other actions in future, so considerable effort has gone into refactoring such that this should be possible with less effort and duplication.

Native Template Language (Heat Orchestration Template aka "HOT")

So, this ended up being *the* hot topic at the design summit in Portland, we got the message, loud and clear, that there are a lot of users, and potential users, who would like to see an openstack-native (non CFN-compatible) template language develop.

There are two threads to this work - firstly defining the missing logical abstractions (ie what cannot be adequately expressed via the current heat logical model), and secondly the syntax itself. Most of these efforts are captured as dependencies of this umbrella blueprint, and there is the syntax specific "HOT hello world" effort.

This is a large and complex piece of work, and the progress made so far has been good, in particular there are recently aspects of the Environments and Providers abstractions landing, which will enable the future work to progress. (I'll post again next week with further details on these aspects)

To be continued...

That's all I have time for today, but hopefully provides a taste of what we've got in the pipeline for havana - there's more, much more (ceilometer integration, keystone trusts, native resource types, engine scale-out, concurrent updates, rolling updates, etc etc!), but I'll have to cover those another time! :)

Steve Hardy

TripleO Containerized deployments, debugging basics

Containerized deployments, debugging basics

Config generation debugging overview

Runtime debugging, paunch 101

Containerized services, logging

Debugging containers directly

Debugging TripleO revisited - Heat, Ansible & Puppet

The TripleO deploy workflow, overview

Provisioning of the nodes

Host preparation

Service deployment, step-by-step configuration

Debugging first steps - what failed?

Debugging via Ansible directly

Debugging via Puppet directly

OpenStack Days UK

OpenStack Days UK

OpenStack Summit - TripleO Project Onboarding

Developing Mistral workflows for TripleO

Mistral workflows and actions

Mistral workflows, in detail

Mistral workflows, create your own!

TripleO composable/custom roles

Fully Composable/Custom Roles

h2.cjk { font-family: "Droid Sans Fallback"; }h2.ctl { font-family: "Lohit Devanagari"; }p { margin-bottom: 0.25cm; line-height: 120%; }a:link { }

Usage examples

Nice, so how does it work?

foo.j2.yaml

foo.role.j2.yaml

Debugging/Development tips

Limitations/future work

Complex data transformations with nested Heat intrinsic functions

The requirement

The implementation, step by step

Dynamically generate an initial mapping for all enabled services

Substitute placeholder for the actual network/IP

Filter any values we don't want

Implementation, completed

TripleO Deploy Artifacts (and puppet development workflow)

Ok, how do I use it?

Example 1 - Deploy Artifacts "Hello World"

Example 2 - Puppet development workflow

So how does it work?

TripleO Composable Services 101

Why Composable Services?

So, how does it work?

Ok, how do I use it?

Scenario 1 - All in one minimal deployment

Scenario 2 - "hyperconverged" ceph deployment

Future Work

TripleO partial stack updates

Partial update?! Why?

Step 1 - Find the nested stack to update

Step 2 - Basic update of the stack

Step 3 - Update of the stack with modifications

TripleO Heat templates Part 3 - Cluster configuration, introduction/primer

Recap - the deployed group of servers

Group/Cluster configuration with Heat

Debugging groups of deployments

Until next time..!

TripleO Heat templates Part 2 - Node initial deployment & config

Recap - the controller group definition

Initial deployment flow, step by step

Phew, is that all?

TripleO Heat templates Part 1 - Roles and Groups

Overcloud "Roles"

Roles & resource types

Roles & ResourceGroups

That's all, folks

Heat SoftwareConfig resources - primer/overview.

Heat SoftwareConfig resources

SoftwareDeployment flow

SoftwareDeployment HOT template definition

1. Define the SoftwareConfig resource

2. Define the Server resource

3. Define the SoftwareDeployment resource

Dealing with dependencies

Conclusion and further resources

Debugging TripleO Heat templates

Step 1: Create the Stack