Upgrading RabbitMQ
You can only upgrade to RabbitMQ 4.0 from RabbitMQ 3.13.
Moreover, stable feature flags have to be enabled before the upgrade. The upgrade will fail if you miss this step.
Note that the khepri_db
feature flag must not be enabled in 3.13.x because it
was experimental and unsupported. If a 3.13.x node or cluster has khepri_db
enabled, upgrading to 4.x is not possible. In this case, the solution is to use
a blue-green deployment to migrate to RabbitMQ 4.x.
Upgrade Strategies
There are three major upgrade strategies that can be used with RabbitMQ. Below you'll find a brief overview of all of them. Each strategy has a dedicated page with more detailed information.
Rolling (in-place) Upgrade
This upgrade strategy is recommended
A rolling upgrade (also referred to as in-place upgrade) is an upgrade process where nodes are upgraded one by one. Refer to the rolling upgrade guide page for more details, but here are the main steps:
- Investigate if the current and target versions have a rolling upgrade path
- check version upgradability
- check Erlang version requirements
- check the release notes
- verify that all stable feature flags are enabled
- Check that the node or cluster is in a good state in order to be upgraded
- no alarms are in effect
- no ongoing queue or stream replica sync operations
- the system is otherwise under a reasonable load
- For each node
- stop the node
- upgrade RabbitMQ and, if applicable, Erlang
- start the node
- watch monitoring and health check data to assess the health and recovery of the upgraded node and cluster
- Once all nodes are upgraded, enable stable feature flags introduced in the new version
Blue-Green Deployment
This upgrade strategy is the safest option. It is recommended for environments where a rolling upgrade is not an option for any reason, or extra safety is particularly important
The Blue/Green deployment strategy offers the benefit of making the upgrade process safer at the cost of temporary increasing infrastructure footprint. The safety aspect comes from the fact that the operator can abort an upgrade by switching applications back to the old cluster.
A blue-green upgrade usually involves the following steps performed by a deployment tool or manually by an operator. Refer to the blue-green deployment guide for more details about these steps:
- Deploy a new cluster with the desired version
- Synchronize metadata between the old and the new cluster (unless applications can declare their own metadata)
- Set up federation
- Switch consumers to the new cluster
- Drain messages
- Switch publishers to the new cluster
- Decommission the old cluster
There's also a simplfied version of the blue-green strategy, if some downtime is acceptable:
- Deploy a new cluster with the target version
- Stop the applications
- Synchronize metadata between the old and new clusters
- Move all messages from the old cluster to the new one (e.g. using Shovel)
- Reconfigure applications to use the new cluster
- Start publishers and consumers
Grow-then-Shrink Upgrade
This upgrade strategy changes replica identities, can result in massive unnecessary data transfers between nodes, and is only safe with important precautions. Therefore, it is highly recommended against.
A grow-and-shrink upgrade usually involves the following steps. Consider a three node cluster with nodes A, B, and C:
- Investigate if the current and target versions can be clustered together
- check version upgradability; if a rolling upgrade between the old and new version is not supported, that also means that these two versions cannot coexist in a single cluster
- check Erlang version requirements
- check the release notes
- Add a new node, node D, to the cluster (note, you may need to start the node with feature flags disabled for node D to be able to join the cluster)
- Place a new replica of every quorum queue and every stream on the new node
- Check that the node or cluster is in a good state
- no alarms are in effect
- no ongoing queue or stream replica sync operations
- the system is otherwise under a reasonable load
- Remove node A from the cluster using
rabbitmqctl forget_cluster_node
- Repeat the steps above for the other nodes; in a 3-node cluster example, the cluster should now consist of nodes D, E and F
- Enable stable feature flags introduced in the new version
Multiple nodes can be added and removed at a time.
RabbitMQ Version Upgradability
You can only upgrade to RabbitMQ 4.0 from RabbitMQ 3.13.x.
Don't forget to enable all stable feature flags while still on 3.13, before attempting an upgrade to RabbitMQ 4.0, or the upgrade will fail.
If you are not on RabbitMQ 3.13 yet, refer to the table below to understand your upgrade path.
Release Series Upgradeability
The following shows the supported upgrade paths.
From | To | Notes |
---|---|---|
3.13.x | 4.0.x | All stable feature flags must be enabled before the upgrade |
3.12.x | 3.13.x | |
3.11.18 | 3.12.x | All feature flags must be enabled before the upgrade |
3.10.x | 3.11.x | All feature flags must be enabled before the upgrade |
3.9.x | 3.10.x | |
3.8.x | 3.9.x | |
3.7.18 | 3.8.x |
RabbitMQ 3.13 included experimental support for Khepri. However, major changes had to be introduced since then, leading to incompatibilities between Khepri support in 3.13 and 4.0. Therefore, RabbitMQ 3.13 with Khepri enabled cannot be upgraded to 4.0. Blue-Green Deployment can still be used in this situation, since technically it is not an upgrade, but rather a migration to a fresh cluster.
Erlang Version Requirements
Please refer to the Erlang Version Requirements guide to learn the minimum required and maximum supported version of Erlang for a given RabbitMQ version.
It's generally recommended to use the latest Erlang version supported by the target RabbitMQ version.
We recommend that you upgrade Erlang together with RabbitMQ.
Plugin Compatibility Between Versions
Plugins included in the RabbitMQ distribution are guaranteed to be compatible with the version they are distributed with. If community plugins are used, they need to be verified separately.
Management Plugin Upgrades
RabbitMQ management plugin comes with a Web application that runs in the browser.
After upgrading a cluster, it is highly recommended to clear browser cache, local storage, session storage and cookies for the domain(s) used to access the management UI. Otherwise, you may experience JavaScript errors.
Upgrade Considerations
Changes in System Resource Usage
During and after the upgrade, connections and queues will be balanced differently between the nodes: as nodes go down, connections will be reestablished on the remaining nodes and queue leaders will be reelected. It is important to make sure your cluster can sustain the workload while some, usually one, node is down for the upgrade. Performing the upgrade during low traffic hours is recommended.
Additionally, different versions of RabbitMQ can have different resource usage. That should be taken into account before upgrading: make sure there's enough capacity to run the workload with the new version. Always consult with the release notes of all versions between the one currently deployed and the target one in order to find out about changes which could impact your workload and resource usage.
Upgrading a Single Node Installation
There are no fundamental differences between upgrading a single node installation compared to upgrading a multi-node cluster.
Upgrading Development Environments
Single node deployments are often local development or test environments. In such cases, if the messages stored in RabbitMQ are not important, it may be easier to simply delete everything in the data directory and start a fresh node of the new version. Effectively, it's no longer an upgrade but a fresh installation of the new version.
Please note that this process will delete all data in your RabbitMQ (definitions and messages), but this is usually not a problem in a development/test environment. The definitions can be preserved using export/import. The benefit of this approach is that you can easily jump from any version to any other version without worrying about compatibility and feature flags.
Downgrades
RabbitMQ does not officially support downgrades - they are not tested and should not be relied upon. Users who want extra safety can use blue-green deployment approach, which allows switching back to the old environment.
Having said that, downgrades technically work between some versions, especially if they only differ by a patch release. It is not guaranteed however: there have been patch releases that could not be downgraded even to the immediately preceding patch release.
Backup
It's strongly advised to back node's data directory up before upgrading.
When to Restart Nodes
Multiple components and features depend on the availability of a quorum of nodes. In the most common case of a 3-node cluster, this means that 2 nodes should always be available during the upgrade.
RabbitMQ provides a health check command that would fail should any quorum queues, stream queues or other internal components on the target node lose their quorum, if that node was to be shut down:
- bash
- PowerShell
# exits with a non-zero code if any of the internal components, quorum queues or stream queues
# will lose online quorum should the target node be shut down;
# additionally, it will print which components and/or queues are affected
rabbitmq-diagnostics check_if_node_is_quorum_critical
# exits with a non-zero code if any of the internal components, quorum queues or stream queues
# will lose online quorum should the target node be shut down;
# additionally, it will print which components and/or queues are affected
rabbitmq-diagnostics.bat check_if_node_is_quorum_critical
For example, consider a three node cluster with nodes A, B, and C and some quorum queues. If node B is currently down, this check will fail if executed against node A or C, because if A or C went down, there would only be one node running (and therefore, there would be no quorum). When node B comes back online, the same check would succeed.
When automating the upgrade process, you can use rabbitmq-upgrade await_online_quorum_plus_one
command
to block the node shutdown process until there is enough nodes running to maintain quorum. Note that
some deployment options already incorporate this check - for example, when running RabbitMQ on Kubernetes
using the Cluster Operator, this is already a part of the preStop
hook.
Rebalancing Queue Leaders
If either the rolling or grown-then-shrink upgrade strategy is used, queue leaders will not be evenly distributed between the nodes after the upgrade. Rebalancing of queue and stream leaders helps spread the load across all cluster nodes.
To rebalance all queue and stream leader replicas, run:
- bash
- PowerShell
rabbitmq-queues rebalance all
rabbitmq-queues.bat rebalance all
Full-Stop Upgrades
There is no need to stop all nodes in a cluster to perform an upgrade.
Maintenance Mode
Maintenance mode is a special node operation mode that can be useful during upgrades. The mode is explicitly turned on and off by the operator using a bunch of new CLI commands covered below.
When a node is in maintenance mode, it will not be available for serving client traffic and will try to transfer as many of its responsibilities as practically possible and safe.
Currently this involves the following steps:
- Suspend all client connection listeners (no new client connections will be accepted)
- Close all existing client connections: applications are expected to reconnect to other nodes and recover
- Transfer primary replicas of all quorum queues hosted on the target node, and prevent them from participating in the subsequently triggered Raft elections
- Mark the node as down for maintenance
- At this point, a node shutdown will be least disruptive as the node has already transferred most of its responsibilities
A node in maintenance mode will not be considered for new primary queue replica placement, regardless of queue type and whether the queue type supports replication.
This feature is expected to evolve based on the feedback from RabbitMQ operators, users, and RabbitMQ core team's own experience with it.
A node in maintenance mode is expected to be shut down, upgraded or reconfigured, and restarted in a short time window (say, 5-30 minutes). Nodes are not expected to be running in this mode permanently or for long periods of time.
Enabling Maintenance Mode
To put a node into maintenance, use rabbitmq-upgrade drain
:
- bash
- PowerShell
rabbitmq-upgrade drain
rabbitmq-upgrade.bat drain
Disabling Maintenance Mode
A restart takes the node out of maintenance mode automatically.
A node in maintenance mode can be revived, that is, brought back into its regular operational state,
using rabbitmq-upgrade revive
:
- bash
- PowerShell
rabbitmq-upgrade revive
rabbitmq-upgrade.bat revive
The command exists to roll back (to the extent possible) the effects of the drain
command.
It is only necessary to run this command if you decided you can't restart the node as planned.
It is not necessary to revive a node after it was restarted/upgraded, because the restart automatically takes the node out of maintenance mode.
Checking Maintenance Status
You can check whether any of the nodes in the cluster is in the maintenance mode
by running rabbitmqctl cluster_status
. You can also check the status of a specific
node by running rabbitmqctl status
.
Handling Node Restarts in Applications
In order to reduce or eliminate the downtime, applications (both producers and consumers) should be able to cope with a server-initiated connection close. Some client libraries offer automatic connection recovery to help with this:
- Java client
- .NET client
- Bunny (Ruby)
In most client libraries there is a way to react to a connection closure, for example:
The recovery procedure for many applications follows the same steps:
- Reconnect
- Re-open channels
- Restore channel settings (e.g. the
basic.qos
setting, publisher confirms) - Recover topology
Topology recovery includes the following actions, performed for every channel:
- Re-declare exchanges declared by the application
- Re-declare queues
- Recover bindings (both queue and exchange-to-exchange ones)
- Recover consumers
This algorithm covers the majority of use cases and is what the aforementioned automatic recovery feature implements.
During a rolling upgrade when a node is stopped, clients connected to this node
will be disconnected using a server-sent connection.close
method and should reconnect to a different node.
This can be achieved by using a load balancer or proxy in front of the cluster
or by specifying multiple server hosts if client library supports this feature.
Many client libraries libraries support host lists, for example: