In recent releases (since the Pike release) we've made some major changes to the TripleO architecture - we makes more use of Ansible "under the hood", and we now support deploying containerized environments. I described some of these architectural changes in a talk at the recent OpenStack Summit in Sydney.
In this post I'd like to provide a refreshed tutorial on typical debug workflow, primarily focussing on the configuration phase of a typical TripleO deployment, and with particular focus on interfaces which have changed or are new since my original debugging post.
We'll start by looking at the deploy workflow as a whole, some heat interfaces for diagnosing the nature of the failure, then we'll at how to debug directly via Ansible and Puppet. In a future post I'll also cover the basics of debugging containerized deployments.
The TripleO deploy workflow, overview
A typical TripleO deployment consists of several discrete phases, which are run in order:
Provisioning of the nodes
- A "plan" is created (heat templates and other files are uploaded to Swift running on the undercloud
- Some validation checks are performed by Mistral/Heat then a Heat stack create is started (by Mistral on the undercloud)
- Heat creates some groups of nodes (one group per TripleO role e.g "Controller"), which results in API calls to Nova
- Nova makes scheduling/placement decisions based on your flavors (which can be different per role), and calls Ironic to provision the baremetal nodes
- The nodes are provisioned by Ironic
This first phase is the provisioning workflow, after that is complete and the nodes are reported ACTIVE by nova (e.g the nodes are provisioned with an OS and running).
Host preparation
The next step is to configure the nodes in preparation for starting the services, which again has a specific workflow (some optional steps are omitted for clarity):- The node networking is configured, via the os-net-config tool
- We write hieradata for puppet to the node filesystem (under /etc/puppet/hieradata/*)
- We write some data files to the node filesystem (a puppet manifest for baremetal configuration, and some json files that are used for container configuration)
Service deployment, step-by-step configuration
The final step is to deploy the services, either on the baremetal host or in containers, this consists of several tasks run in a specific order:- We run puppet on the baremetal host (even in the containerized architecture this is still needed, e.g to configure the docker daemon and a few other things)
- We run "docker-puppet.py" to generate the configuration files for each enabled service (this only happens once, on step 1, for all services)
- We start any containers enabled for this step via the "paunch" tool, which translates some json files into running docker containers, and optionally does some bootstrapping tasks.
- We run docker-puppet.py again (with a different configuration, only on one node the "bootstrap host"), this does some bootstrap tasks that are performed via puppet, such as creating keystone users and endpoints after starting the service.
Note that these steps are performed repeatedly with an incrementing step value (e.g step 1, 2, 3, 4, and 5), with the exception of the "docker-puppet.py" config generation which we only need to do once (we just generate the configs for all services regardless of which step they get started in).
Below is a diagram which illustrates this step-by-step deployment workflow:
TripleO Service configuration workflow |
The most common deployment failures occur during this service configuration phase of deployment, so the remainder of this post will primarily focus on debugging failures of the deployment steps.
Debugging first steps - what failed?
Heat Stack create failed.
Ok something failed during your TripleO deployment, it happens to all of us sometimes! The next step is to understand the root-cause.
My starting point after this is always to run:
openstack stack failures list --long <stackname>
(undercloud) [stack@undercloud ~]$ openstack stack failures list --long overcloud overcloud.AllNodesDeploySteps.ControllerDeployment_Step1.0: resource_type: OS::Heat::StructuredDeployment physical_resource_id: 421c7860-dd7d-47bd-9e12-de0008a4c106 status: CREATE_FAILED status_reason: | Error: resources[0]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 2 deploy_stdout: | PLAY [localhost] *************************************************************** ... TASK [Run puppet host configuration for step 1] ******************************** ok: [localhost] TASK [debug] ******************************************************************* fatal: [localhost]: FAILED! => { "changed": false, "failed_when_result": true, "outputs.stdout_lines|default([])|union(outputs.stderr_lines|default([]))": [ "Debug: Runtime environment: puppet_version=4.8.2, ruby_version=2.0.0, run_mode=user, default_encoding=UTF-8", "Error: Evaluation Error: Error while evaluating a Resource Statement, Unknown resource type: 'ugeas' at /etc/puppet/modules/tripleo/manifests/profile/base/docker.pp:181:5 on node overcloud-controller-0.localdomain" ] } to retry, use: --limit @/var/lib/heat-config/heat-config-ansible/8dd0b23a-acb8-4e11-aef7-12ea1d4cf038_playbook.retry PLAY RECAP ********************************************************************* localhost : ok=18 changed=12 unreachable=0 failed=1
We can tell several things from the output (which has been edited above for brevity), firstly the name of the failing resource
overcloud.AllNodesDeploySteps.ControllerDeployment_Step1.0
- The error was on one of the Controllers (ControllerDeployment)
- The deployment failed during the per-step service configuration phase (the AllNodesDeploySteps part tells us this)
- The failure was during the first step (Step1.0)
With a little more digging we can see which node exactly this failure relates to, e.g we copy the SoftwareDeployment ID from the output above, then run:
(undercloud) [stack@undercloud ~]$ openstack software deployment show 421c7860-dd7d-47bd-9e12-de0008a4c106 --format value --column server_id 29b3c254-5270-42ae-8150-9fc3f67d3d89 (undercloud) [stack@undercloud ~]$ openstack server list | grep 29b3c254-5270-42ae-8150-9fc3f67d3d89 | 29b3c254-5270-42ae-8150-9fc3f67d3d89 | overcloud-controller-0 | ACTIVE | ctlplane=192.168.24.6 | overcloud-full | oooq_control |
Ok so puppet failed while running via ansible on overcloud-controller-0.
Debugging via Ansible directly
Having identified that the problem was during the ansible-driven configuration phase, one option is to re-run the same configuration directly via ansible-ansible playbook, so you can either increase verbosity or potentially modify the tasks to debug the problem.Since the Queens release, this is actually very easy, using a combination of the new "openstack overcloud config download" command and the tripleo dynamic ansible inventory.
(undercloud) [stack@undercloud ~]$ openstack overcloud config download The TripleO configuration has been successfully generated into: /home/stack/tripleo-VOVet0-config (undercloud) [stack@undercloud ~]$ cd /home/stack/tripleo-VOVet0-config (undercloud) [stack@undercloud tripleo-VOVet0-config]$ ls common_deploy_steps_tasks.yaml external_post_deploy_steps_tasks.yaml templates Compute global_vars.yaml update_steps_playbook.yaml Controller group_vars update_steps_tasks.yaml deploy_steps_playbook.yaml post_upgrade_steps_playbook.yaml upgrade_steps_playbook.yaml external_deploy_steps_tasks.yaml post_upgrade_steps_tasks.yaml upgrade_steps_tasks.yaml
Here we can see there is a "deploy_steps_playbook.yaml", which is the entry point to run the ansible service configuration steps. This runs all the common deployment tasks (as outlined above) as well as any service specific tasks (these end up in task include files in the per-role directories, e.g Controller and Compute in this example).
We can run the playbook again on all nodes with the tripleo-ansible-inventory from tripleo-validations, which is installed by default on the undercloud:
(undercloud) [stack@undercloud tripleo-VOVet0-config]$ ansible-playbook -i /usr/bin/tripleo-ansible-inventory deploy_steps_playbook.yaml --limit overcloud-controller-0 ... TASK [Run puppet host configuration for step 1] ******************************************************************** ok: [192.168.24.6] TASK [debug] ******************************************************************************************************* fatal: [192.168.24.6]: FAILED! => { "changed": false, "failed_when_result": true, "outputs.stdout_lines|default([])|union(outputs.stderr_lines|default([]))": [ "Notice: hiera(): Cannot load backend module_data: cannot load such file -- hiera/backend/module_data_backend", "exception: connect failed", "Warning: Undefined variable '::deploy_config_name'; ", " (file & line not available)", "Warning: Undefined variable 'deploy_config_name'; ", "Error: Evaluation Error: Error while evaluating a Resource Statement, Unknown resource type: 'ugeas' at /etc/puppet/modules/tripleo/manifests/profile /base/docker.pp:181:5 on node overcloud-controller-0.localdomain" ] } NO MORE HOSTS LEFT ************************************************************************************************* to retry, use: --limit @/home/stack/tripleo-VOVet0-config/deploy_steps_playbook.retry PLAY RECAP ********************************************************************************************************* 192.168.24.6 : ok=56 changed=2 unreachable=0 failed=1
Here we can see the same error is reproduced directly via ansible, and we made use of the --limit option to only run tasks on the overcloud-controller-0 node. We could also have added --tags to limit the tasks further (see tripleo-heat-templates for which tags are supported).
If the error were ansible related, this would be a good way to debug and test any potential fixes to the ansible tasks, and in the upcoming Rocky release there are plans to switch to this model of deployment by default.
Debugging via Puppet directly
Since this error seems to be puppet related, the next step is to reproduce it on the host (obviously the steps above often yield enough information to identify the puppet error, but this assumes you need to do more detailed debugging directly via puppet):Firstly we log on to the node, and look at the files in the /var/lib/tripleo-config directory.
(undercloud) [stack@undercloud tripleo-VOVet0-config]$ ssh heat-admin@192.168.24.6 Warning: Permanently added '192.168.24.6' (ECDSA) to the list of known hosts. Last login: Fri Feb 9 14:30:02 2018 from gateway [heat-admin@overcloud-controller-0 ~]$ cd /var/lib/tripleo-config/ [heat-admin@overcloud-controller-0 tripleo-config]$ ls docker-container-startup-config-step_1.json docker-container-startup-config-step_4.json puppet_step_config.pp docker-container-startup-config-step_2.json docker-container-startup-config-step_5.json docker-container-startup-config-step_3.json docker-container-startup-config-step_6.json
The puppet_step_config.pp file is the manifest applied by ansible on the baremetal host
We can debug any puppet host configuration by running puppet apply manually. Note that hiera is used to control the step value, this will be at the same value as the failing step, but it can also be useful sometimes to manually modify this for development testing of different steps for a particular service.
[root@overcloud-controller-0 tripleo-config]# hiera -c /etc/puppet/hiera.yaml step 1 [root@overcloud-controller-0 tripleo-config]# cat /etc/puppet/hieradata/config_step.json {"step": 1}[root@overcloud-controller-0 tripleo-config]# puppet apply --debug puppet_step_config.pp ... Error: Evaluation Error: Error while evaluating a Resource Statement, Unknown resource type: 'ugeas' at /etc/puppet/modules/tripleo/manifests/profile/base/docker.pp:181:5 on node overcloud-controller-0.localdomain
Here we can see the problem is a typo in the /etc/puppet/modules/tripleo/manifests/profile/base/docker.pp file at line 181, I look at the file, fix the problem (ugeas should be augeas) then re-run puppet apply to confirm the fix.
Note that with puppet module fixes you will need to get the fix either into an updated overcloud image, or update the module via deploy artifacts for testing local forks of the modules.
That's all for today, but in a future post, I will cover the new container architecture, and share some debugging approaches I have found helpful when deployment failures are container related.
Hello,
ReplyDeleteOpenStack has a very modular design, and because of this design, there are lots of moving parts. It’s overwhelming to start walking through installing and using OpenStack without understanding the internal architecture of the components that make up. OpenStack manages a different resource that can be virtualized for the end user.
Great job for publishing such a beneficial blog. Your web log isn’t only useful but it is additionally really creative. hydroponics SYDNEY are using everywhere in the world.
ReplyDeleteYour blog is very useful for me, Thanks for your sharing.
ReplyDeleteRPA Training in Hyderabad
VEry nice input.how should I define Active and standby controllers in the heat template to have HA. Say I have 1 active and 2 standby controller, Is there anything special that needs to be done in config.yaml file.is there a sample file that I can refer for HA on controllers.
ReplyDeleteThe Best of the Blogs You have Mentioned here.
ReplyDeleteand also we are providing E-Learning Portal Videos for students and working Professionals
Hurry Up! Bag All Courses in Rs - 10000 /- + taxes
41 Career building courses.
Designed by 33 industrial experts
600+ hours of video Content
DevOps and Cloud E-Learning Portal
Quickbooks enterprise support number +1 (833) 400-1001 is available to solve QuickBooks Enterprise problems through QuickBooks Enterprise support. Call our Quickbooks support team at +1 (833) 400-1001 and contact our certified QuickBooks specialist for assistance.
ReplyDeleteThis is really great informative blog. Keep sharing.
ReplyDeleteOpenstack Training
Openstack Certification Training
OpenStack Online Training
Openstack Training Course
Openstack Training in Hyderabad
Error occurrence like password errors, login errors, signup errors, puzzle captcha not working are the common errors faced by the users every now and then. If you’re dealing with any of the above issues, you can reach the expert0073 by dialing Binance customer support number. They are known for providing the out-of-the-box experts services. They will offer handy and prompt solutions and remedies experts team. You can approach them any time as they are available 24/7 to assist the Binance users. Blockchain Support Number
ReplyDeleteIs your Binance account not working? Are you facing login Issues in Binance? This errors sound minor bit can create big problem for the user if not fixed on time. Binance team is completely trained and know all the possibilities to end all the worries and provide the desired results in a quick time. In a stepwise manner. You can reach the professionals by dialing Binance customer care number. Technical issues occur abruptly and are the main reason of unwanted problems. Therefore, you can always contact the team who is always at your service for help. Binance Support number
ReplyDeleteThanks for sharing
ReplyDelete
ReplyDeleteAmazing post and written in a very simple and impressive language. Thanks for sharing
Docker Training in Hyderabad
Kubernetes Training in Hyderabad
Docker and Kubernetes Training
Docker and Kubernetes Online Training
Your blog is very useful for me, Thanks for your sharing.
ReplyDeleteDevOps Training
DevOps Online Training
Well Written. keep sharing more and more DevOps Online Training
ReplyDeleteDevOps Online Training India
DevOps Online Training hyderabad
quickbooks activation Quickbooks activation proccess is not an easy task as after installing the Quickbooks software as you need to register a new account or need to activate your existing account to access the accounting services of the software.
ReplyDeletehadoop online courses
ReplyDeleteData Science online courses
linux online courses
etl testing online courses
web methods online courses
I need to to thank you for this great read!! I definitely loved every bit of it.
ReplyDeleteEVERYTHING YOU NEED TO KNOW ABOUT BLUE COOKIES STRAIN I have got you bookmarked to look at new stuff you
Finding the Best affordable tree removal service requires balancing quality with affordability. A trusted tree removal service should offer a variety of tree care solutions, including trimming, pruning, and complete tree removal, all at competitive prices. These services often operate with experienced and certified arborists who ensure that each task is completed safely and efficiently. Whether the tree is overgrown, hazardous, or simply in the way, an affordable tree removal company will use modern equipment to quickly and safely remove the tree without damaging your property.
ReplyDelete