A regular Deployment resource in Kubernetes provides us with 2 deployment strategies that we can specify in .spec.strategy.type field – RollingUpdate (default option) and Recreate, and that’s basically everything that we’re able to use in Kubernetes by default. This could be enough for some scenarios, especially if we just want things to get done and set up a Minimal Viable Product as fast as possible.
What if we need a much more sophisticated deployment method? There are countless deployment strategies:
- Blue-Green,
- Canary,
- Big Bang,
- Feature Toggle, and so on…
Obviously, we can also use hybrids of those methods, so there is much more to explore than is provided by default in K8s. But how can we leverage those deployment strategies without the need for writing complex Bash scripts, without complex configuration of Load Balancer and multiple environments (or K8s clusters) in our cloud, and finally without the need for very complicated routing configuration of our K8s Ingress?
There is a way simpler solution for that, which is Kubernetes-friendly and will allow us for various mature deployment mechanisms with even the simplest K8s+cloud setup that you can image – the name of this tool is Argo Rollouts!
First, remind the basics
Before explaining what the Argo Rollouts is, and what it can give us with the usage of the Canary strategy. First, let’s remind how the regular Kubernetes Deployment (with default RollingUpdate strategy) behaves on an update.
Let’s take the below Deployment definition as an example:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
replicas: 10
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: my-image:v1
ports:
- containerPort: 80
strategy: # Field added for better clarity
type: RollingUpdate
rollingUpdate:
maxSurge: 10%
maxUnavailable: 10%
This Deployment uses all of the default deployment strategy configuration options, but I explicitly defined values in the .spec.strategy field for better clarity.
The process
When we create this Deployment and then update its image to a new version, it will simultaneously start spinning up new pods and remove the pods with older image versions in a sequence of patches; the whole process will look like this:
- Initially, we have 10 pods, all with an image in the v1 version.
- We update the image tag to v2 so the Rolling Update triggers are based on defined maxSurge and maxUnavailable.
- A new ReplicaSet is created that will be spinning up pods with a v2 image.
- Simultaneously – the new ReplicaSet creates 2 pods with v2 image, and the old ReplicaSet terminates its 1 old pod. New ReplicaSet can only create 2 pods at that point because of maxSurge set to 10% (of .spec.replicas count), so we will have 11 replicas in total (9 old replicas in Running state and 2 new replicas with ContainerCreating state). The old ReplicaSet (at this point) can terminate only a single pod because the maxUnavailable is set to 10% (at least 9 replicas need to be in the Running state).
- Right after replicas from new ReplicaSet turn from ContainerCreating to Running state, the old ReplicaSet will terminate 2 of its replicas, and at the same time the new ReplicaSet will create another 2 replicas), so we will have 11 replicas in total (7 old replicas in Running state, 2 new replicas in Running state, and 2 new replicas with ContainerCreating state).
- Then, all of the consecutive patches are similar to the 5th step until we reach 100% of pods in the deployment in the desired v2 (new) version and 0% of pods in the previous v1 (old) version.
- After performing the last patch, the old ReplicaSet is kept or removed based on your .spec.revisionHistoryLimit field (with the default config, it will stay, though it won’t have any replicas).
Gradual updates
This is a very handy functionality, especially compared to obsolete deployment methods in the pre-Kubernetes era. With Rolling Updates, we have no downtime. Instead, we have gradual updates where, continuously, a larger percentage of our pods are replaced by the newer version. We can even set maxUnavailable to 0 so we won’t lose any capacity (during the deployment of a new version, we will always have at least the same number of running pods as before the start of the update process).
That’s fine, but what if we need a much more sophisticated deployment strategy that would allow us to use much smarter logic in your deployment process?
Now, we can finally get into Argo Rollouts!
Argo Rollouts and Canary Deployment
Argo Rollout is an open-source tool that provides a Kubernetes controller and set of CRDs (e.g., a Rollout) for advanced deployment capabilities. Similarly to Argo CD, it comes from the Argo project. In this blog post, I’m focusing only on the most standard usage of the Canary Deployment strategy, but you can leverage Blue-Green deployment and many more deployment features of the Canary Deployment, so feel free to refer to the official documentation after/while reading this blog post.
With Canary Deployment, you have way more control over the deployment process of a new version compared to a regular Rolling Update. In the Rolling Update, the deployment process is straightforward and continuous – we just gradually replace old replicas with new ones at the same pace. This is not the case in Canary Deployment.
Canary Deployment and miners’ canaries
The name “Canary Deployment” comes from the practice of coal miners that used, in the past, canary birds as an early warning system for harmful gasses like carbon monoxide (CO) and methane (CH4). Canary birds alerted the miners of danger before they recognized it. Similar to coal miners, software engineers want to make sure that a new area (app version) is safe and can be used on a larger scale.
Instead of just deploying a new version without having any control when the deployment process is already happening, we can leverage Canary Deployment to initially spin up only a couple of pods in a new version, so only some of the users will use the new version, and then if everything is successfully validated (tests passed and users are satisfied with a change) we can fully update to a new version (and possibly implement some additional useful steps during the update process).
You can, e.g., initially replace 10% of your pods in a new version and serve 10% of your production traffic to this new version. Then we can have time waits or a manual gate and, in the meantime, have running automated tests that will be looking for an issue with the new version. Then, if tests are successful, the right amount of time passes, or a manual gate is approved, we scale to, e.g., 30% of pods (and routed traffic) in a new version and 70% in the old version, then wait for some time, then scale to half of the pods in a new version, and then finally scale to 100% of pods in a new version.
This is only one of the infinite number of possible setups that you can configure with the Canary Deployment approach and Argo Rollouts. Argo Rollout provides an enormous number of features to satisfy your deployment process needs. You can, e.g., manipulate the amount of traffic with the usage of 2 K8s services and a K8s Ingress controller for better isolation of versions and independency on a number of pods and route traffic to them (e.g., 10% pods in a new version but only 5% of traffic routed to this set of pods).
What to choose?
As you can see, Canary Deployment is a much more capable deployment strategy than a Rolling Update. I’m not saying it’s better or worse. Rarely is something in engineering simply better or worse. If one of your apps running in K8s doesn’t need Canary Deployment because, e.g., it is relatively simple, has very frequent and minor updates, and doesn’t need comprehensive validation on each update to a new version, then Canary Deployment would be an absolute overengineering, and you should stay with a default Rolling Update.
However, if your app can benefit from the Canary Deployment approach, you should definitely consider implementing it, especially with Argo Rollouts.
If you’re a DevOps Engineer or anybody who is interested in modern cloud-native and containerization, then there’s a high chance that you have already heard about Argo CD. If so, then that’s good because it will help you get an idea of what the Argo Rollouts are. Both of those tools have a very different purpose, but in terms of how they fundamentally operate, they are very similar – both of them are a bunch of open-source Golang code that we can install (along with a dedicated CLI tool) in order to leverage a new Kubernetes Controller and a bunch of CRD for deployment-related purposes in our K8s cluster.
Argo Rollouts core concept – a Rollout
The most important concept in Argo Rollout is a CRD called “Rollout.” This “new” object is not as new as it may seem because a Rollout is basically a regular K8s Deployment with a bunch of useful deployment capabilities built on top of it.
Let’s see an example definition of a Rollout object:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: nginx-rollout
labels:
app: nginx
spec:
replicas: 10
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: my-image:v1
ports:
- containerPort: 80
strategy:
canary:
maxSurge: '10%'
maxUnavailable: '10%'
steps:
- setWeight: 30
- pause:
duration: 30m
- setWeight: 40
- pause:
duration: 1h
- pause: {}
# We don't need to explicitly specify below line because it's the default behavior
# - setWeight: 100
Rollout and Deployment
As you can see, the Rollout definition is almost the same as the Deployment, but with only 3 differences:
- .apiVersion: – we need to use argoproj.io/v1alpha1 instead of apps/v1,
- .kind: – we need to use Rollout instead of Deployment,
- .spec.strategy: – instead of specifying the deployment strategy in .spec.strategy.type field and then eventually configuring the option of Rolling Update under .spec.strategy.rollingUpdate, we have the possibility of choosing between canary and blueGreen and specifying the configuration options under those new fields.
So basically, everything is the same as in a deployment, with the only major difference being the deployment strategy configuration. That’s great news for everybody who doesn’t have time to fight with writing some fancy CRD manifest file from scratch, especially if you want to use this custom resource as a replacement for a Deployment, which is undisputedly one of the most crucial objects in K8s clusters.
With Argo Rollouts, you can just install Argo Rollouts (we will get to this in a moment), change .apiVersion, .kind, and add, e.g., .spec.strategy.canary field with {} as the value, and that’s it! You don’t even need to specify anything in the .spec.strategy.canary field because if you don’t specify anything in this field, then your Rollout will behave exactly like a normal Deployment.
But obviously, you for sure want to leverage the features that Rollout provides if you decide to install Argo Rollouts, so don’t leave this field empty 😉
.spec.strategy field
Now let’s explain what is happening under the .spec.strategy field in the example that I showed you here:
...
strategy:
canary:
maxSurge: '10%'
maxUnavailable: '10%'
steps:
- setWeight: 30
- pause:
duration: 30m
- setWeight: 40
- pause:
duration: 1h
- pause: {}
First, we specify that we want to use the Canary Deployment strategy, then we optionally specify maxSurge and maxUnavailable that work exactly the same as in a regular Deployment – (in this case), maxSurge ensures that there will never be more than 11 replicas (in total) in the Running or ContainerCreating state, and maxUnavailable ensure that there is always at least 9 replicas that are running.
Then, we specify our deployment steps that define the Rollout’s behavior when updating to a new version.
Updating a pod to a new version
Here’s a step-by-step explanation of what will happen when we update a pod to a new version:
- Initially, we have 10 replicas in the single ReplicaSet with the image of v1 tag.
- We update the pod image version to v2.
- First step is performed – 1 replica is terminated from on old ReplicaSet (with revision:1) and simultaneously a new ReplicaSet (with revision:2) is created with 2 replicas in ContainerCreating state, so we will have 9 running replicas (all from old ReplicaSet) and 11 replicas in Running or ContainerCreating state (in total) – this is exactly what we expect from our config ofmaxUnavailable and maxSurge fields.
- Right after one of the replicas from a new ReplicaSet turns from ContainerCreating to Running state, a new replica in a new ReplicaSet is created, and at the same time, another replica from the old ReplicaSet is terminated, so we will have 9 running replicas (1 from the new ReplicaSet and 8 from the old ReplicaSet) and 11 replicas in Running or ContainerCreating state (in total).
- Right after another replica from the new ReplicaSet turns from ContainerCreating to Running state, another replica from the old ReplicaSet is terminated, so we will have 9 running replicas (2 from the new ReplicaSet and 7 from the old ReplicaSet) and 11 replicas in Running or ContainerCreating state (in total).
- Right after another replica from the new ReplicaSet turns from ContainerCreating to Running state, another replica from the old ReplicaSet is terminated, so we will have 10 running replicas (3 from the new ReplicaSet and 7 from the old ReplicaSet) and 10 replicas in Running state, so the first step ended – we have 30% of running replicas from a new ReplicaSet and 30% of traffic is routed to those new replicas.
- The second step is performed – the Rollout waits for 30 minutes (no changes in the number of replicas).
- Third step is performed – 1 replica is terminated from an old ReplicaSet, and at the same time, 1 replica is created in the new ReplicaSet (and is in ContainerCreating state).
- When a replica from a new ReplicaSet turns from ContainerCreating to Running state, then the third step ends because we have 40% of running replicas from a new ReplicaSet, and 40% of traffic is routed to those new replicas.
- Fourth step is performed – the Rollout waits for 1 hour (no changes in the number of replicas).
- Fifth step is performed – the Rollouts with for a promotion (a manual approval). This is an important step in implementing a manual gate to our deployment process. This step is the last one defined in this manifest, so it is the last step before updating our Rollout to 100% of new replicas. Now, we should go to our application interface, probably perform some tests, and make sure that we really want to update to a new version. If everything looks fine, then we can promote a Rollout, e.g., by using the command kubectl argo rollouts promote nginx-rollout (in a moment, I will show you how to install this command).
- Finally the last step is performed – update to 100% of replicas from the new ReplicaSet and scale down all of the replicas from the old ReplicaSet (obviously with respect to maxSurge and maxUnavailable fields). This step will be performed regardless do we specify it or not.
The setup process
That was a simple example of a Rollout usage. Now, let’s go through the setup process so you can test this example on your own cluster!
First, run those 2 commands in order to install the Argo Rollouts Controller and CRDs:
kubectl create namespace argo-rollouts
kubectl apply -n argo-rollouts -f https://github.com/argoproj/argo-rollouts/releases/latest/download/install.yaml
Now you can already deploy your first Rollout instance but it’s definitely a good idea to first install a few more things.
First, I highly recommend you install the Argo Rollouts plugin for kubectl which will allow you to promote the Rollouts, visualize the deployment update process, view Argo Rollouts Dashboard, and overall work more efficiently with Rollout objects. Moreover, you may want to install shell auto-completion for Argo Rollouts.
After installing all of the needed software you probably should play with Argo Rollouts by yourself. You can use an example that I already showed here, the example that I will show you in a moment, or look for some examples available on the internet (official GitHub examples can be a good starting point).
I recommend you to keep at least 2 terminal windows opened at the same time – one where you will execute commands like e.g. kubectl apply, and a second window where you will keep running kubectl argo rollouts get rollout nginx-rollout –watch command so you will be aware of everything that is happening with you Rollout (how it progress in deployment).
Automated tests integrated into the deployment process
Now, let’s get into a much more interesting example that will show you absolutely one of the most crucial benefits of Argo Rollouts – automated tests that are triggered during the deployment process. Manual gates, time waits, and flexibility of setting the amount of the pods updated on each step are all very useful functionalities but not as game-changing as the side of Argo Rollouts that we will cover right now.
Argo Rollouts allows us to define really comprehensive tests that will be based, e.g. on metrics from your monitoring solution (e.g. Prometheus) that will test a new version (revision) of your Rollout and based on the tests results decide whether it should continue with a deployment process or rollback to the previous revision – all fully automated!
Two manifests
Let’s see an example with 2 manifests. One with a Rollout and the second one with a new resource – AnalysisTemplate. An AnalysisTemplate defines how to perform a canary analysis, such as the metrics which it should perform, its frequency, and the values that are considered successful or failed.
rollout.yml (with .spec field simplified for better readability):
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: guestbook
spec:
# ...
strategy:
canary:
analysis:
templates:
- templateName: success-rate
startingStep: 2 # Delay starting analysis run until setWeight: 40%
args:
- name: service-name
value: guestbook-svc.default.svc.cluster.local
steps:
- setWeight: 20
- pause: {duration: 10m}
- setWeight: 40
- pause: {duration: 10m}
- setWeight: 60
- pause: {duration: 10m}
- setWeight: 80
- pause: {duration: 10m}
analysis-template.yml:
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: success-rate
spec:
args:
- name: service-name
metrics:
- name: success-rate
interval: 5m
# NOTE: Prometheus queries return results in the form of a vector.
# So it is common to access the index 0 of the returned array to obtain the value
successCondition: result[0] >= 0.95
failureLimit: 3
provider:
prometheus:
address: http://prometheus.example.com:9090
query: |
sum(irate(
istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}",response_code!~"5.*"}[5m]
)) /
sum(irate(
istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}"}[5m]
))
In the Rollout definition, we can see that we specify AnalysisTemplate that will be used for our Rollout, moreover, we explicitly specify the Service FQDN and that’s everything which is new, but the definition of the AnalysisTemplate is something completely new. In analysis-template.yml, we specify the configuration of AnalysisTemplate like:
- the interval,
- success condition,
- failure limit,
- and the provider config.
Note that in order to use this example, you need to run Prometheus.
Our AnalysisTemplate will start its testing from the second step and will execute the Prometheus query (PromQL expression) every 5 minutes to check whether the success rate is at least 95% and if the condition isn’t satisfied 3 times (there will be 3 or more failures) then a new revision (ReplicaSet) will be rolled back (scaled back to 0%) and the previous revision will be scaled out to 100% again, and the whole Rollout will stay in Degraded state (until we update the Rollout again).
That’s amazing functionality – absolutely zero manual effort, and our deployment is performed automatically with continuous tests. If something is wrong, we will just go back to the previous version (and almost definitely the working version)! Imagine how much you can do with those tests.
You can look for:
- any metrics (latency, success rate, etc.),
- any logs,
- and results from your app.
The possibilities are practically endless.
Prometheus metrics itself can provide you with very useful information about potential problems with a new version of your app, but you are not limited to Prometheus – you can use Datadog, NewRelic, AWS CloudWatch, and even set up custom K8s jobs or configure HTTP request that will be looking for a specific measurements!
AnalysisTemplate is a true power of Argo Rollouts, and I’m close to saying that’s the most useful feature of this tool, so if you have already decided that you want to implement Argo Rollouts because of some other of its functionality, then I highly recommend you to jump into a rabbit hole of Analysis in Argo Rollouts.
Argo Rollouts and Argo CD
There is one thing left that is definitely worth mentioning – how do Argo Rollouts and Argo CD work together when used on the same K8s workloads?
Both of those tools come from the same Argo project, so as you can expect, they have a great integration. Both of those tools can be used without using the other one (be used as standalone tools), but in most of the modern K8s setups you want to implement the Argo CD, and then depending on your desired deployment strategy, you can implement Argo Rollout too.
Of course, everything seems to be clear in case of successful deployment – there is an update to the pod image in the remote repository -> Argo CD notices that and triggers synchronization -> Argo Rollout performs an update (that is successful) -> we have a new version running in the cluster.
Are we in danger of an endless loop?
How about the situation when the AnalysisTemplate fails, and the deployment process is aborted by the Rollout because a new image version has a bug? Will we end up in an endless loop where Argo CD is continuously trying to sync the state from the remote repo, and at the same time, Argo Rollout is failing over and over again? Fortunately, it won’t happen. Argo CD is aware of the Degraded state of the Rollouts, and it won’t take any further action if this state occurs; instead of doing an override, it will simply show Out of Sync status.
That’s great, but what we should do in case of a situation like that? There are at least a few ways to handle this.
I believe that if you’re already using the GitOps approach in your SDLC, then you should stick with the values of this philosophy and use the git revert command to revert the commit that caused the issue in the latest version. So after doing a push, your git repo will reflect the desired state and Argo CD will notice that chance and automatically update the cluster, so you will end up with a repository and K8s cluster that is again in sync and Rollout that is stable and healthy (though with a previous version).
Eventually, you may want to simply push a new change with an issue fix instead of using git revert, but this action obviously assumes that you know how to fix the issue, you already fixed it, and have a new Docker container ready to be used. Often when your Rollout fails during the deployment process, first you want to roll back to the previous version (using git revert) in order to reduce the app disruption as fast as you can, and then try to actually fix the issue and try to deploy again.

Summary
Argo Rollouts can be an absolute game-changer if you need to implement a more advanced deployment strategy (like Canary) instead of relying on the default Rolling Update in Kubernetes Deployment. Furthermore, keep in mind that probably the most crucial advantages of Argo Rollouts are its automated tests and rollbacks on the update, so don’t forget about the power of those features.
Moreover, remember that implementing this new tool into your cluster shouldn’t be too complicated. The documentation is clear and contains a couple of good examples, and the DevOps community has already created many great guides on Argo CD. You also shouldn’t have a problem finding the right GitHub issue or Stack Overflow question if you encounter a problem while working with Argo Rollouts.
Last but not least – always remember to analyze and really make sure that you actually need such a tool. Don’t try to implement something that you don’t actually need – overengineering is one of the greatest traps for every engineer (not only DevOps engineers), so always try to question your requirements instead of creating or accepting work that doesn’t bring the actual value (business value or some other sort).
Nevertheless, if you see real benefits in Canary deployment for your scenario then Argo Rollouts is waiting for you!
***
If you are interested in the tools used in IT, be sure to also take a look at other articles by our experts 🙂
Leave a comment