Upgrading an OpenStack cloud has become a challenging task, which requires choosing the right approach, careful planning and precise execution to minimize the downtime of the cloud environment. Because of such complexity, cloud operators prefer to skip one or more releases before doing an upgrade.
In this OpenStack Tutorial we discuss different aspects of OpenStack upgrades, identify the major pitfalls when upgrading OpenStack and provide solutions and best practices to avoid these pitfalls.
Update Vs. Upgrade
First of all, we should define a strict distinction between updating and upgrading OpenStack. In this OpenStack tutorial, updating means applying bug fixes and fixes for security vulnerabilities to the OpenStack components and underlying operating system. Usually, such fixes are considered to be safe for in-place updates, because they do not introduce a new functionality and thus do not have regressions.
At the same time upgrading means upgrading to a new stable OpenStack release. An OpenStack cloud consists of a number of distributed software components that collaborate with each other in order to deliver the required cloud services. From the first look, such components, including operating system dependencies, must be upgraded at the same time, which make the upgrade tasks even more complex. Good news is that the OpenStack community aims to keep the APIs for the components compatible, so the old API version usually is kept and supported for some time. However, the old API can be marked as deprecated and removed from newer releases.
Planning an OpenStack upgrade
There are several important steps we recommend for planning any OpenStack upgrade:
- Read the OpenStack release notes thoroughly to identify potential incompatibilities between releases.
- Choose the proper method for OpenStack upgrade (see below).
- Prepare a plan to roll back a failed upgrade.
- Prepare a plan for data backups, at minimum, with backups of configuration files and databases.
- Determine the acceptable downtime for the cloud, as defined by the SLAs for specific services. If any data loss is projected, notify your users about the service interruption.
- Test the upgrade method using a test environment similar to the production one.
Methods for OpenStack upgrade
- Parallel cloud: Deploy a separate OpenStack cloud and migrate all the resources from the old cloud to the upgraded one. This is the simplest and least intrusive method. Also it has the simplest rollback procedure. However, it requires extensive hardware resources and leads to lengthy downtime.
- Rolling upgrade: These two methods upgrade each component on each server one by one, finally giving you an upgraded OpenStack cloud:
- In-place upgrade: This method requires shutting down each service for the upgrade, which gives you some downtime, though less than the parallel cloud method.
- Side by side upgrade: Since OpenStack Icehouse the controllers are decoupled from the compute nodes, so you can upgrade them independently. With this method, you can deploy an upgraded controller, transfer all the data from the old controller to the new one and seamlessly replace the old controller by the new one. The old controller is left untouched, so a roll back should be simple. In order to achieve zero downtime you should have more than one controller in HA mode.
Upgrade pitfalls and solutions
Manual upgrades are prone to failure
Upgrades commonly fail when a number of manually repetitive tasks must be completed. Your cloud consists of many nodes and each node contains a number of services. The services on each node collaborate with other services, and due to this complexity manual upgrades are not an option.
Solution: Use automation for the upgrade. There are many configuration management tools tools you can use such as Ansible, Chef and Puppet.
Upgrade of the production cloud can fail
By nature the OpenStack cloud contains custom settings and the standard upgrade procedure usually does not honor the custom settings in the configuration files. You should assume that upgrade of the cloud will fail, so you need to verify the upgrade on a test cloud, which should be similar to the production one. The test cloud can be smaller than the production one, but it should have the same architecture and configuration.
It is very important to have proper automation implemented in your organization. Both deployment (for the old release) and upgrade procedures should be automated and both should be under configuration management control. You should be able to track back each custom setting to the original requirement. Before upgrading the production cloud, the upgrading procedure and the corresponding automation should be properly verified with the following standard approach:
- Deploy a test cloud using the same automation scripts that you used to deploy the production cloud.
- Apply upgrade scripts to the test cloud.
- If the upgrade failed, make necessary fixes to the upgrade scripts and repeat the procedure from the step 1.
- If the upgrade completed successfully, verify the test cloud.
- If the verification failed, make necessary fixes to the upgrade scripts and repeat the procedure from Step 1.
You can use OpenStack Rally for automated cloud verification. Rally verification scenarios may include the standard ones and custom scenarios, which are specific for the cloud under test.
The cloud’s performance will degrade
Each OpenStack release introduces new features and brings new bugs, but more importantly, will require a new hardware configuration. A new OpenStack release might require additional or faster CPUs, more memory and disk space. This is true for several OpenStack releases, including Liberty. Potentially, community efforts to the OpenStack optimization may lead to decreased requirements, but at the moment you should expect the performance of your cloud to degrade due to an upgrade.
To pro-actively identify and solve such performance issues you need to perform benchmarking and profiling for your clouds: the old and the new one. You should be able to identify any performance degradation and add additional resources for OpenStack services under high load. You can use OpenStack Rally for automated cloud benchmarking and profiling.
Unclean shutdown of the services may lead to an inconsistent state of the cloud
The service should complete all the requests it has received from the message queue and notify the message queue to stop sending new requests to the service. You should shut down OpenStack services gracefully and give them enough time to complete all the active requests and report their unavailability to the message queue. Shut down one service at a time, upgrade it, start, then do the same for next one.
Upgrading the services in the wrong order may break the cloud
You can easily break the cloud by upgrading the services in the wrong order. The following order is the most recommended:
- Upgrade OpenStack Identity (Keystone)
- Upgrade the OpenStack Image service (Glance)
- Upgrade OpenStack Compute (Nova)
- Upgrade OpenStack Networking (Neutron)
- Upgrade OpenStack Block Storage (Cinder)
- Upgrade the OpenStack dashboard (Horizon)
- Upgrade the OpenStack Orchestration (Heat)
Upgrade will fail due to old or missing system dependencies
A new OpenStack release introduces new system dependencies and requires upgraded versions of the existing system dependencies. The upgraded OpenStack service will fail to start or will terminate with runtime failure if its some system dependencies are not installed or upgraded.
When upgrading the OpenStack services make sure that all the dependencies are also upgraded properly. Usually it implies that all of the OpenStack components are installed from packages (deb or rpm) with correctly defined and tested dependencies. Even in this case, depending on the specific configuration, upgrading the packages can break some services. It is recommended that if the package manager (yum or apt-get) asks you to update configuration files, reject the changes. Instead, review, change the configuration files and restart the services manually.
Database downgrades are not supported
Most of the OpenStack services support database migrations. That means that each service will try to upgrade its database during startup. Usually the automated upgrade is well tested for the stable OpenStack release and can be used safely (it can be disabled in favor of manual upgrade, if necessary). At the same time, starting from Kilo, database downgrades are not supported. Thus, the only reliable way for a database rollback is to restore a database from backup.
Configuration files will not be upgraded automatically
Each OpenStack release introduces changes to the configuration files. Options can be removed, renamed and moved to other sections. New options can be added with the default values that can break your cloud. Read the release notes thoroughly to identify such changes and apply them to your configuration files. For example:
- In Juno, the ‘identity_uri’ option should be used in the ‘[keystone_authtoken]’ section instead of ‘auth_host’, ‘auth_port’, and ‘auth_protocol’ for all of the services.
- In Kilo, when using libvirt 1.2.2 live snapshots are disabled by default. Deployers can set ‘workarounds.disable_libvirt_livesnapshot=True’ in nova.conf to enable live snapshot support.
- In Liberty, setting ‘force_config_drive=always’ in nova.conf is deprecated, use True/False boolean values instead
Upgrade will fail due to new, deprecated or removed API
If you have custom scripts or other software that uses OpenStack API, then be prepared for a failed upgrade, because a new OpenStack release introduces a new API version and marks the old version as deprecated.
In the worst case scenario, the API can be removed from the release. Read the release notes thoroughly to identify such changes and apply them to your cloud. For example:
- In Kilo, the EC2 API support has been deprecated and removed.
- In Liberty, the Load Balancer as a Service (LBaaS) V1 API is marked as deprecated and is planned to be removed in a future release. Going forward, the LBaaS V2 API should be used.
Upgrade will fail due to deprecated or removed features, plugins and drivers
If you are using, a vendor specific plugin then be prepared for a failed upgrade, because in a new OpenStack release such feature or plugin is deprecated or even removed. Read the release notes thoroughly to identify such changes and apply them to your cloud. For example:
- In Kilo, XML support in Keystone has been removed
- In Kilo and Liberty, many monolithic vendor specific plugins have been removed from Neutron
Upgrade will fail due to architectural changes
In some cases your cloud may depend on a specific architectural feature of the old OpenStack release, which is changed or deprecated in a new release. Read the release notes thoroughly to identify such changes and apply them to your cloud. For example:
- Use Python 3 instead of Python 2.6
- Use the pymysql database driver instead of Python-MySQL
- Use unified ‘openstack’ client instead of ‘keystone’, ‘glance’, etc.
- In Liberty, Ceilometer Alarms is deprecated in favour of Aodh
- In Kilo and Liberty releases, the Keystone project deprecates eventlet in favor of a separate web server with WSGI extensions
The future of OpenStack upgrades
To help solve challenges related to upgrading OpenStack, the OpenStack community has adopted a Big Tent approach for new releases. With the Big Tent model, operators will be able to select the preferred components and their version, and then add or upgrade modules incrementally with little or no downtime.