Airflow vs Prefect vs Temporal vs Kestra vs Windmill
We compared Airflow, Prefect, Temporal, Kestra and Windmill with the following usecases:
- One flow composed of 40 lightweight tasks.
- One flow composed of 10 long-running tasks.
For additional insights about this study, refer to our blog post.
We chose to compute Fibonacci numbers as a simple task that can easily be run with the three orchestrators. Given that Airflow has a first class support for Python, we used Python for all 3 orchestrators. The function in charge of computing the Fibonacci numbers was very naive:
def fibo(n: int):
if n <= 1:
return n
else:
return fibo(n - 1) + fibo(n - 2)
After some testing, we chose to compute fibo(10)
for the lightweight tasks (taking around 10ms in our setup), and fibo(33)
for what we called "long-running" tasks (taking at least a few hundreds milliseconds as seen in the results).
On the infrastructure side, we went simple and used the docker-compose.yml
recommended in the documentation of each orchestrator. We deployed the orchestrators on AWS m4-large
instances.
Airflow setup
We set up Airflow version 2.7.3 using the docker-compose.yaml referenced in Airflows official documentation.
The DAG was the following:
ITER = 10 # respectively 40
FIBO_N = 33 # respectively 10
with DAG(
dag_id="bench_{}".format(ITER),
schedule=None,
start_date=datetime(2023, 1, 1),
catchup=False,
tags=["benchmark"],
) as dag:
for i in range(ITER):
@task(task_id=f"task_{i}")
def task_module():
return fibo(FIBO_N)
fibo_task = task_module()
if i > 0:
previous_task >> fibo_task
previous_task = fibo_task
Results
For 10 long running tasks run sequentially:
Details
Task | Created at | Started at | Completed at |
---|---|---|---|
task_00 | 0.000 | 4.347 | 6.910 |
task_01 | 7.315 | 9.690 | 16.387 |
task_02 | 16.545 | 18.361 | 20.077 |
task_03 | 20.130 | 21.785 | 23.487 |
task_04 | 23.869 | 25.319 | 27.463 |
task_05 | 28.061 | 29.665 | 32.354 |
task_06 | 33.210 | 34.996 | 37.498 |
task_07 | 38.378 | 39.938 | 41.754 |
task_08 | 42.366 | 43.933 | 45.887 |
task_09 | 46.281 | 50.179 | 54.668 |
For 40 lightweights tasks run sequentially:
Details
Task | Created at | Started at | Completed at |
---|---|---|---|
task_00 | 0.000 | 4.335 | 4.752 |
task_01 | 6.236 | 8.710 | 8.923 |
task_02 | 9.792 | 11.117 | 11.320 |
task_03 | 12.157 | 13.513 | 13.733 |
task_04 | 13.804 | 15.413 | 15.622 |
task_05 | 16.201 | 17.587 | 17.849 |
task_06 | 18.902 | 20.227 | 20.432 |
task_07 | 21.262 | 22.691 | 22.958 |
task_08 | 24.015 | 25.349 | 25.558 |
task_09 | 26.368 | 28.158 | 28.635 |
task_10 | 29.361 | 31.035 | 31.357 |
task_11 | 31.861 | 36.245 | 37.062 |
task_12 | 38.868 | 42.180 | 42.388 |
task_13 | 42.641 | 44.027 | 44.280 |
task_14 | 45.321 | 46.676 | 46.877 |
task_15 | 47.676 | 49.073 | 49.298 |
task_16 | 50.432 | 51.786 | 51.999 |
task_17 | 52.415 | 53.852 | 54.051 |
task_18 | 54.155 | 55.564 | 55.771 |
task_19 | 56.575 | 58.346 | 58.781 |
task_20 | 59.254 | 60.999 | 61.355 |
task_21 | 62.071 | 63.671 | 64.079 |
task_22 | 64.366 | 66.011 | 66.442 |
task_23 | 67.061 | 68.619 | 68.866 |
task_24 | 69.601 | 71.842 | 72.303 |
task_25 | 73.373 | 77.495 | 78.212 |
task_26 | 78.428 | 79.896 | 80.134 |
task_27 | 81.199 | 82.495 | 82.741 |
task_28 | 83.665 | 84.958 | 85.153 |
task_29 | 85.205 | 86.561 | 86.766 |
task_30 | 87.690 | 89.357 | 89.778 |
task_31 | 90.419 | 91.970 | 92.282 |
task_32 | 93.024 | 94.610 | 95.031 |
task_33 | 95.636 | 97.495 | 97.745 |
task_34 | 98.857 | 100.626 | 100.877 |
task_35 | 101.926 | 103.271 | 103.477 |
task_36 | 103.915 | 105.523 | 105.875 |
task_37 | 105.996 | 107.412 | 107.622 |
task_38 | 108.409 | 112.610 | 113.214 |
task_39 | 114.054 | 115.998 | 116.221 |
Prefect setup
We set up Prefect version 2.14.4. We wrote our own simple docker compose since we couldn't find a recommended one in Prefect's documentation. We chose to use Postgresql as a database, as it is the recommended option for production usecases.
version: '3.8'
services:
postgres:
image: postgres:14
restart: unless-stopped
volumes:
- db_data:/var/lib/postgresql/data
expose:
- 5432
environment:
POSTGRES_PASSWORD: changeme
POSTGRES_DB: prefect
healthcheck:
test: ['CMD-SHELL', 'pg_isready -U postgres']
interval: 10s
timeout: 5s
retries: 5
prefect-server:
image: prefecthq/prefect:2-latest
command:
- prefect
- server
- start
ports:
- 4200:4200
depends_on:
postgres:
condition: service_started
volumes:
- ${PWD}/prefect:/root/.prefect
- ${PWD}/flows:/flows
environment:
PREFECT_API_DATABASE_CONNECTION_URL: postgresql+asyncpg://postgres:changeme@postgres:5432/prefect
PREFECT_LOGGING_SERVER_LEVEL: INFO
PREFECT_API_URL: http://localhost:4200/api
volumes:
db_data: null
The flow was defined using the following Python file.
from prefect import flow, task
ITER = 10 # respectively 40
FIBO_N = 33 # respectively 10
def fibo(n: int):
if n <= 1:
return n
else:
return fibo(n - 1) + fibo(n - 2)
@task
def fibo_task():
return fibo(FIBO_N)
@flow(name="bench_{}".format(ITER))
def benchmark_flow():
for i in range(ITER):
fibo_task()
if __name__ == "__main__":
benchmark_flow.serve(name="bench_{}".format(ITER))
Results
For 10 long running tasks:
Details
Task | Created at | Started at | Completed at |
---|---|---|---|
task_00 | 0.000 | 1.270 | 2.629 |
task_01 | 2.673 | 2.703 | 4.059 |
task_02 | 4.095 | 4.121 | 5.475 |
task_03 | 5.508 | 5.534 | 6.916 |
task_04 | 6.951 | 6.979 | 8.337 |
task_05 | 8.373 | 8.401 | 9.816 |
task_06 | 9.849 | 9.874 | 11.253 |
task_07 | 11.287 | 11.313 | 12.675 |
task_08 | 12.710 | 12.737 | 14.070 |
task_09 | 14.102 | 14.129 | 15.489 |
For 40 lightweights tasks run sequentially:
Details
Task | Created at | Started at | Completed at |
---|---|---|---|
task_00 | 0.000 | 1.213 | 1.257 |
task_01 | 1.294 | 1.321 | 1.362 |
task_02 | 1.394 | 1.423 | 1.463 |
task_03 | 1.496 | 1.522 | 1.558 |
task_04 | 1.587 | 1.612 | 1.647 |
task_05 | 1.676 | 1.700 | 1.738 |
task_06 | 1.767 | 1.791 | 1.828 |
task_07 | 1.858 | 1.882 | 1.943 |
task_08 | 1.974 | 1.998 | 2.037 |
task_09 | 2.068 | 2.093 | 2.131 |
task_10 | 2.162 | 2.188 | 2.228 |
task_11 | 2.260 | 2.292 | 2.330 |
task_12 | 2.359 | 2.382 | 2.420 |
task_13 | 2.449 | 2.476 | 2.517 |
task_14 | 2.548 | 2.573 | 2.612 |
task_15 | 2.640 | 2.670 | 2.713 |
task_16 | 2.742 | 2.765 | 2.800 |
task_17 | 2.828 | 2.851 | 2.886 |
task_18 | 2.916 | 2.940 | 2.975 |
task_19 | 3.004 | 3.028 | 3.066 |
task_20 | 3.095 | 3.119 | 3.156 |
task_21 | 3.187 | 3.211 | 3.247 |
task_22 | 3.276 | 3.299 | 3.335 |
task_23 | 3.364 | 3.389 | 3.427 |
task_24 | 3.462 | 3.489 | 3.528 |
task_25 | 3.557 | 3.579 | 3.613 |
task_26 | 3.641 | 3.664 | 3.699 |
task_27 | 3.726 | 3.751 | 3.788 |
task_28 | 3.817 | 3.839 | 3.873 |
task_29 | 3.900 | 3.921 | 4.004 |
task_30 | 4.033 | 4.059 | 4.094 |
task_31 | 4.123 | 4.151 | 4.185 |
task_32 | 4.211 | 4.234 | 4.267 |
task_33 | 4.293 | 4.315 | 4.349 |
task_34 | 4.377 | 4.404 | 4.442 |
task_35 | 4.470 | 4.492 | 4.526 |
task_36 | 4.555 | 4.577 | 4.611 |
task_37 | 4.638 | 4.661 | 4.696 |
task_38 | 4.726 | 4.749 | 4.784 |
task_39 | 4.814 | 4.838 | 4.872 |
Temporal setup
We set up Temporal version 2.19.0 using the docker-compose.yml from the official GitHub repository.
The flow was defined using the following Python file. We executed it on the EC2 instance, using Python 3.10.12.
ITER = 10 # respectively 40
FIBO_N = 33 # respectively 10
@activity.defn
async def fibo_activity(n: int) -> int:
return fibo(n)
@workflow.defn
class BenchWorkflow:
@workflow.run
async def run(self) -> None:
for i in range(ITER):
await workflow.execute_activity(
fibo_activity,
FIBO_N,
activity_id="task_{}".format(i),
start_to_close_timeout=timedelta(seconds=60),
)
async def main():
client = await Client.connect("localhost:7233")
flow_name = "bench-{}".format(ITER)
async with Worker(
client,
task_queue=flow_name,
workflows=[BenchWorkflow],
activities=[fibo_activity],
):
await client.execute_workflow(
BenchWorkflow.run,
id=flow_name,
task_queue=flow_name,
)
if __name__ == "__main__":
asyncio.run(main())
Results
For 10 long running tasks:
Details
Task | Created at | Started at | Completed at |
---|---|---|---|
task_00 | 0.000 | 0.012 | 1.357 |
task_01 | 1.380 | 1.388 | 2.697 |
task_02 | 2.720 | 2.729 | 4.034 |
task_03 | 4.056 | 4.065 | 5.371 |
task_04 | 5.394 | 5.403 | 6.711 |
task_05 | 6.733 | 6.742 | 8.050 |
task_06 | 8.074 | 8.083 | 9.388 |
task_07 | 9.411 | 9.420 | 10.739 |
task_08 | 10.762 | 10.773 | 12.086 |
task_09 | 12.111 | 12.120 | 13.434 |
For 40 lightweights tasks run sequentially:
Details
Task | Created at | Started at | Completed at |
---|---|---|---|
task_00 | 0.000 | 0.009 | 0.016 |
task_01 | 0.034 | 0.044 | 0.052 |
task_02 | 0.072 | 0.079 | 0.087 |
task_03 | 0.107 | 0.116 | 0.124 |
task_04 | 0.144 | 0.153 | 0.161 |
task_05 | 0.180 | 0.189 | 0.197 |
task_06 | 0.218 | 0.227 | 0.235 |
task_07 | 0.256 | 0.265 | 0.273 |
task_08 | 0.296 | 0.305 | 0.312 |
task_09 | 0.332 | 0.340 | 0.348 |
task_10 | 0.367 | 0.376 | 0.383 |
task_11 | 0.403 | 0.412 | 0.420 |
task_12 | 0.440 | 0.449 | 0.457 |
task_13 | 0.486 | 0.498 | 0.507 |
task_14 | 0.527 | 0.536 | 0.545 |
task_15 | 0.565 | 0.574 | 0.583 |
task_16 | 0.622 | 0.660 | 0.669 |
task_17 | 0.721 | 0.759 | 0.768 |
task_18 | 0.820 | 0.859 | 0.867 |
task_19 | 0.920 | 0.959 | 0.967 |
task_20 | 1.020 | 1.059 | 1.069 |
task_21 | 1.122 | 1.159 | 1.167 |
task_22 | 1.221 | 1.259 | 1.268 |
task_23 | 1.321 | 1.360 | 1.368 |
task_24 | 1.421 | 1.460 | 1.468 |
task_25 | 1.521 | 1.560 | 1.568 |
task_26 | 1.622 | 1.660 | 1.669 |
task_27 | 1.721 | 1.759 | 1.767 |
task_28 | 1.822 | 1.859 | 1.867 |
task_29 | 1.921 | 1.960 | 1.969 |
task_30 | 2.021 | 2.059 | 2.067 |
task_31 | 2.121 | 2.160 | 2.168 |
task_32 | 2.220 | 2.260 | 2.269 |
task_33 | 2.322 | 2.359 | 2.368 |
task_34 | 2.427 | 2.459 | 2.467 |
task_35 | 2.522 | 2.559 | 2.568 |
task_36 | 2.621 | 2.659 | 2.668 |
task_37 | 2.721 | 2.759 | 2.768 |
task_38 | 2.820 | 2.859 | 2.867 |
task_39 | 2.921 | 2.959 | 2.967 |
Kestra setup
We set up Kestra version v0.19.0 using the docker-compose.yml from their official Documentation. We made some adjustments to it to have a similar setup compared to the other orchestrator.
The flow we used to run the benchmarks is the following:
id: benchmark
namespace: company.team
inputs:
- id: n
type: INT
- id: iters
type: INT
tasks:
- id: getIterations
type: io.kestra.plugin.scripts.python.Script
taskRunner:
type: io.kestra.plugin.scripts.runner.docker.Docker
containerImage: ghcr.io/kestra-io/pydata:latest
env:
iters: "{{ inputs.iters }}"
script: |
import os
from kestra import Kestra
iters = int(os.getenv("iters"))
iterations = list(range(0, iters))
Kestra.outputs({'iterations': iterations})
- id: processIterations
type: io.kestra.plugin.core.flow.ForEach
values: '{{ outputs.getIterations.vars.iterations }}'
concurrencyLimit: 1
tasks:
- id: python
type: io.kestra.plugin.scripts.python.Script
containerImage: python:slim
env:
N: "{{ inputs.n }}"
script: |
import os
def fibo(n: int):
if n <= 1:
return n
else:
return fibo(n - 1) + fibo(n - 2)
n = int(os.getenv("N"))
print(str(fibo(n)))
We executed it once with n=33
, iters=10
and once with n=10
and iters=40
. Note that we set the concurrency limit to 1 meaning all task will run sequentially on one worker. Furthermore, no extra python dependencies had to be installed during the execution of those flows.
Results
For 10 long running tasks:
Details
Task | Created at | Started at | Completed at |
---|---|---|---|
task_00 | 0.000 | 0.047 | 1.348 |
task_01 | 1.348 | 1.402 | 2.683 |
task_02 | 2.683 | 2.730 | 4.050 |
task_03 | 4.050 | 4.115 | 5.424 |
task_04 | 5.424 | 5.471 | 6.835 |
task_05 | 6.835 | 6.895 | 8.208 |
task_06 | 8.208 | 8.254 | 9.620 |
task_07 | 9.620 | 9.678 | 10.986 |
task_08 | 10.986 | 11.030 | 12.367 |
task_09 | 12.367 | 12.415 | 13.702 |
For 40 lightweights tasks run sequentially:
Details
Task | Created at | Started at | Completed at |
---|---|---|---|
task_00 | 0.000 | 0.062 | 0.832 |
task_01 | 0.832 | 0.888 | 1.479 |
task_02 | 1.479 | 1.524 | 2.137 |
task_03 | 2.137 | 2.195 | 2.840 |
task_04 | 2.840 | 2.885 | 3.502 |
task_05 | 3.502 | 3.561 | 4.177 |
task_06 | 4.177 | 4.222 | 4.846 |
task_07 | 4.846 | 4.902 | 5.504 |
task_08 | 5.504 | 5.552 | 6.243 |
task_09 | 6.243 | 6.325 | 6.935 |
task_10 | 6.935 | 7.005 | 7.612 |
task_11 | 7.612 | 7.668 | 8.278 |
task_12 | 8.278 | 8.323 | 8.960 |
task_13 | 8.960 | 9.011 | 9.629 |
task_14 | 9.629 | 9.090 | 10.290 |
task_15 | 10.290 | 10.385 | 11.055 |
task_16 | 11.055 | 11.123 | 11.807 |
task_17 | 11.807 | 11.863 | 12.515 |
task_18 | 12.515 | 12.578 | 13.153 |
task_19 | 13.153 | 13.224 | 13.869 |
task_20 | 13.869 | 13.939 | 14.614 |
task_21 | 14.614 | 14.685 | 15.323 |
task_22 | 15.323 | 15.388 | 16.022 |
task_23 | 16.022 | 16.089 | 16.772 |
task_24 | 16.772 | 16.831 | 17.453 |
task_25 | 17.453 | 17.508 | 18.120 |
task_26 | 18.120 | 18.177 | 18.780 |
task_27 | 18.780 | 18.831 | 19.457 |
task_28 | 19.457 | 19.525 | 20.149 |
task_29 | 20.149 | 20.210 | 20.843 |
task_30 | 20.843 | 20.895 | 21.507 |
task_31 | 21.507 | 21.572 | 22.206 |
task_32 | 22.206 | 22.267 | 22.904 |
task_33 | 22.904 | 22.982 | 23.597 |
task_34 | 23.597 | 23.667 | 24.282 |
task_35 | 24.282 | 24.343 | 25.010 |
task_36 | 25.010 | 25.072 | 25.694 |
task_37 | 25.694 | 25.752 | 26.368 |
task_38 | 26.368 | 26.440 | 27.075 |
task_39 | 27.075 | 27.132 | 27.762 |
Windmill setup
We set up Windmill version 1.204.1 using the docker-compose.yml from the official GitHub repository. We made some adjustments to it to have a similar setup compared to the other orchestrator. We set the number of workers to only one and removed the native workers since they would have been useless.
We executed the Windmill benchmarks in both "normal" and "dedicated worker" mode. To implement the 2 flows in Windmill, we first created a script simply computing the Fibonacci numbers:
# WIMDMILL script: `u/benchmarkuser/fibo_script`
def fibo(n: int):
if n <= 1:
return n
else:
return fibo(n - 1) + fibo(n - 2)
def main(
n: int,
):
return fibo(n)
And then we used this script in a simple flow composed of a For-Loop sequentially executing the scripts. The JSON representation of the flow is as follow:
summary: Fibonacci benchmark flow
description: Flow running 10 (resp. 40) times Fibonacci of 33 (resp. 10)
value:
modules:
- id: a
value:
type: forloopflow
modules:
- id: b
value:
path: u/admin/fibo_script
type: script
input_transforms:
n:
type: static
value: 33 # respectively 10
iterator:
expr: Array(10) # respectively 40
type: javascript
parallel: false
skip_failures: true
schema:
'$schema': https://json-schema.org/draft/2020-12/schema
properties: {}
required: []
type: object
Results
For 10 long running tasks in normal mode:
Details
Task | Created at | Started at | Completed at |
---|---|---|---|
task_00 | 0.000 | 0.004 | 0.789 |
task_01 | 0.847 | 0.852 | 1.630 |
task_02 | 1.691 | 1.695 | 2.516 |
task_03 | 2.575 | 2.579 | 3.349 |
task_04 | 3.409 | 3.412 | 4.179 |
task_05 | 4.237 | 4.241 | 5.008 |
task_06 | 5.066 | 5.070 | 5.852 |
task_07 | 5.912 | 5.915 | 6.685 |
task_08 | 6.743 | 6.747 | 7.519 |
task_09 | 7.578 | 7.582 | 8.351 |
For 40 lightweights tasks run sequentially in normal mode:
Details
Task | Created at | Started at | Completed at |
---|---|---|---|
task_00 | 0.000 | 0.004 | 0.052 |
task_01 | 0.111 | 0.115 | 0.163 |
task_02 | 0.220 | 0.224 | 0.272 |
task_03 | 0.330 | 0.334 | 0.382 |
task_04 | 0.440 | 0.443 | 0.490 |
task_05 | 0.547 | 0.551 | 0.598 |
task_06 | 0.655 | 0.659 | 0.706 |
task_07 | 0.763 | 0.767 | 0.813 |
task_08 | 0.872 | 0.875 | 0.925 |
task_09 | 0.982 | 0.987 | 1.036 |
task_10 | 1.093 | 1.097 | 1.144 |
task_11 | 1.202 | 1.205 | 1.252 |
task_12 | 1.313 | 1.317 | 1.373 |
task_13 | 1.432 | 1.436 | 1.488 |
task_14 | 1.545 | 1.548 | 1.595 |
task_15 | 1.656 | 1.659 | 1.704 |
task_16 | 1.762 | 1.766 | 1.812 |
task_17 | 1.869 | 1.873 | 1.920 |
task_18 | 1.978 | 1.982 | 2.029 |
task_19 | 2.087 | 2.091 | 2.141 |
task_20 | 2.198 | 2.201 | 2.251 |
task_21 | 2.310 | 2.313 | 2.360 |
task_22 | 2.417 | 2.420 | 2.466 |
task_23 | 2.524 | 2.528 | 2.574 |
task_24 | 2.631 | 2.634 | 2.680 |
task_25 | 2.739 | 2.743 | 2.789 |
task_26 | 2.846 | 2.851 | 2.897 |
task_27 | 2.954 | 2.958 | 3.005 |
task_28 | 3.063 | 3.066 | 3.112 |
task_29 | 3.168 | 3.172 | 3.218 |
task_30 | 3.275 | 3.279 | 3.326 |
task_31 | 3.383 | 3.386 | 3.432 |
task_32 | 3.489 | 3.493 | 3.539 |
task_33 | 3.596 | 3.600 | 3.646 |
task_34 | 3.704 | 3.707 | 3.753 |
task_35 | 3.812 | 3.815 | 3.863 |
task_36 | 3.920 | 3.923 | 3.972 |
task_37 | 4.030 | 4.034 | 4.083 |
task_38 | 4.140 | 4.143 | 4.190 |
task_39 | 4.248 | 4.252 | 4.300 |
In dedicated worker mode, we obtained the following results. For 10 long running tasks:
Details
Task | Created at | Started at | Completed at |
---|---|---|---|
task_00 | 0.000 | 0.004 | 0.738 |
task_01 | 0.802 | 0.809 | 1.543 |
task_02 | 1.601 | 1.605 | 2.334 |
task_03 | 2.392 | 2.396 | 3.124 |
task_04 | 3.187 | 3.191 | 3.945 |
task_05 | 3.980 | 3.985 | 4.744 |
task_06 | 4.771 | 4.774 | 5.506 |
task_07 | 5.561 | 5.565 | 6.291 |
task_08 | 6.350 | 6.354 | 7.082 |
task_09 | 7.136 | 7.140 | 7.885 |
And for the 40 lightweight tasks:
Details
Task | Created at | Started at | Completed at |
---|---|---|---|
task_00 | 0.000 | 0.003 | 0.005 |
task_01 | 0.062 | 0.065 | 0.067 |
task_02 | 0.123 | 0.126 | 0.128 |
task_03 | 0.184 | 0.188 | 0.190 |
task_04 | 0.247 | 0.251 | 0.253 |
task_05 | 0.310 | 0.314 | 0.316 |
task_06 | 0.372 | 0.376 | 0.378 |
task_07 | 0.434 | 0.437 | 0.439 |
task_08 | 0.496 | 0.500 | 0.502 |
task_09 | 0.559 | 0.563 | 0.565 |
task_10 | 0.622 | 0.625 | 0.627 |
task_11 | 0.684 | 0.687 | 0.689 |
task_12 | 0.746 | 0.750 | 0.752 |
task_13 | 0.809 | 0.813 | 0.815 |
task_14 | 0.873 | 0.877 | 0.879 |
task_15 | 0.934 | 0.938 | 0.940 |
task_16 | 0.997 | 1.000 | 1.002 |
task_17 | 1.059 | 1.062 | 1.064 |
task_18 | 1.120 | 1.124 | 1.128 |
task_19 | 1.182 | 1.186 | 1.188 |
task_20 | 1.244 | 1.248 | 1.250 |
task_21 | 1.306 | 1.309 | 1.311 |
task_22 | 1.368 | 1.371 | 1.373 |
task_23 | 1.429 | 1.432 | 1.434 |
task_24 | 1.491 | 1.494 | 1.496 |
task_25 | 1.552 | 1.555 | 1.557 |
task_26 | 1.614 | 1.618 | 1.620 |
task_27 | 1.677 | 1.681 | 1.683 |
task_28 | 1.740 | 1.744 | 1.746 |
task_29 | 1.802 | 1.806 | 1.808 |
task_30 | 1.864 | 1.867 | 1.869 |
task_31 | 1.926 | 1.930 | 1.932 |
task_32 | 1.988 | 1.992 | 1.994 |
task_33 | 2.050 | 2.054 | 2.056 |
task_34 | 2.112 | 2.116 | 2.118 |
task_35 | 2.174 | 2.178 | 2.181 |
task_36 | 2.237 | 2.240 | 2.242 |
task_37 | 2.300 | 2.303 | 2.305 |
task_38 | 2.362 | 2.366 | 2.368 |
task_39 | 2.424 | 2.427 | 2.429 |
Comparisons
We can exclude Airflow and Kestra from the previous chart:
At a macro level, it took 54.668s to Airflow to execute the 10 long running tasks, where Prefect took 15.489s, Temporal 13.434s, Kestra 14.15s and Windmill 8.351s in normal mode (7.885s in dedicated worker mode).
The same can be observed for the 40 lightweight tasks, where Airflow took total of 116.221s, Prefect 4.872s, Temporal 2.967s, Kestra 29.80s and Windmill 4.300s in normal mode (2.429s in dedicated worker mode).
By far, Airflow is the slowest. Temporal and Prefect are faster, but not as fast as Windmill. Kestra performs slowly for the 40 lightweight tasks, whereas it performs similar to Temporal and Prefect for the 10 long running tasks. For the 40 lightweight tasks, Windmill in normal mode was equivalent to Prefect and slightly slower than Temporal. This can be explained by the fact that the way Temporal works is closer to the way Windmill works in dedicated mode. I.e. Windmill in normal mode does a cold starts for each tasks, and when the tasks are numerous and lightweight, most of the execution ends up being taken by the cold start. In dedicated worker mode however, Windmill behavior is closer to Temporal, and we can see that the performance are similar, with a slight advantage for Windmill.
But we can deep dive in a little and compare the orchestrators three categories:
- Execution time: The time it takes for the orchestrator to execute the task once is has been assigned to an executor
- Assignment time: The time is takes for a task to be assigned to an executor once it has been created in the queue
- Transition time: The time it takes for to create the following time once a task is finished
After looking at the macro numbers above, it's interesting to compare the time spent in each of the above categories, relative to the total time the orchestrator took to execute the flow.
For the 10 long running tasks flow, we see the following:
Airflow | Prefect | Temporal | Kestra | Windmill Normal | Windmill Dedicated Worker | |
---|---|---|---|---|---|---|
Total duration (in secconds) | 54.668 | 15.489 | 13.434 | 14.15 | 8.351 | 7.885 |
Assignement | 40.36% | 9.77% | 0.71% | 3.65% | 0.47% | 0.55% |
Execution | 51.72% | 88.18% | 97.74% | 93.15% | 93.17% | 93.46% |
Transition | 7.93% | 2.05% | 1.55% | 3.20% | 6.36% | 6.00% |
The proportion of time spent in execution is important here since each task takes a long time to run. We see that Airflow and Prefect are spending a lot of time assigning the tasks compared to the others (When we look at the actual numbers, we see that both Prefect and Airflow are spending a lot of time assigning the first tasks, but after that, assignment duration decrease. Kestra's assignment and transition duration are somewhere in the middle, and we see that it spends most of the time in the execution phase. Airflow remain relatively slow though, and Prefect reaches decent performance. The exact same can be observed with the 40 tasks workflow below). Temporal and Windmill in normal mode are pretty similar. Windmill in dedicated worker mode is incredibly fast at executing the jobs, at a cost of spending a little more time doing the transitions, but overall it is the fastest.
If we look at the 40 lightweight tasks flow, we have:
Airflow | Prefect | Temporal | Kestra | Windmill Normal | Windmill Dedicated Worker | |
---|---|---|---|---|---|---|
Total duration (in secconds) | 56.221 | 4.872 | 2.967 | 29.80 | 4.300 | 2.429 |
Assignement | 64.63% | 44.62% | 35.58% | 8.28% | 3.42% | 5.89% |
Execution | 10.77% | 31.73% | 11.26% | 85.35% | 44.19% | 3.42% |
Transition | 24.60% | 23.65% | 53.16% | 6.38% | 52.40% | 90.70% |
Here we see that Windmill takes a greater portion of time executing the tasks, which can be explained by the fact that Windmill runs a "cold start" for each tasks submitted to the worker. However, it's by far the fastest assigning tasks to executors. As observed above, Windmill in dedicated worker mode is lightning fast at executing the tasks, but takes more time transitioning from one task to the next one.
Conclusion
Airflow is the slowest in all categories, followed by Prefect and Kestra. If you're looking for a high performance job orchestrator, they seem to not be the best option. Temporal and Windmill have better performance and are closer to each other in terms of performance, but in both cases Windmill performs better either in normal mode or in dedicated mode. If you're looking for a job orchestrator for various long-running tasks, Windmill in normal mode will be the most performant solution, optimizing the duration of each tasks knowing that transitions and assignments will remain a small portion of the overall workload. To run lightweight tasks at a very fast pace Windmill in dedicated worker mode should be your preferred choice, provided that the tasks are similar. It is lightening fast at execution and assignment.
Appendix: Scaling Windmill
We performed those benchmarks with a single worker assuming the capacity to process jobs would scale linearly with the number of workers deployed on the stack. We haven't verified this assumption for Airflow, Prefect, Kestra and Temporal, but we've scaled Windmill up to a 100 virtual workers to verify. And the conclusion is that it scales pretty linearly.
For this test, we've deployed the same docker compose as above on an AWS m4.xlarge
instance (4 vCPU, 16Gb of memory) and to virtually increase the number of workers, we've used the NUM_WORKERS
environment variable Windmill accepts. Note that it is not strictly equivalent to adding real hardware to the stack, but until we reach the maximum capacities on the instance, both in terms of CPU and memory, we can assume it's a good approximation.
The other change we had to make was to bump the max_connections
to 1000
on Postgresql: as we're adding more and more workers, each worker needs to connect to the database and we need to increase the maximum number of connections Posgtresql allows.
The job we ran was a simple sleeping job sleeping for 100ms, which is a good average during for a job running on an orchestrator.
import time
def main():
time.sleep(0.1)
Finally, we've ran it on Windmill Dedicated Worker mode, and we used a specific endpoint to "bulk-create" the jobs before any worker can start pulling them from the queue. For this test to be representative, we had to measure the performance of Windmill processing a large number of jobs (10000 in this case), and we quickly realised that the time it was taking to only insert the jobs one by one in the queue was non negligible and was affecting the real performance of workers.
The results are the following:
Details
Number of workers | Throughput (jobs/sec) batch of 10K jobs |
---|---|
2 | 19.9 |
6 | 59.8 |
10 | 99.6 |
20 | 198 |
30 | 298 |
40 | 391 |
50 | 496 |
60 | 591 |
70 | 693 |
80 | 786 |
90 | 887 |
100 | 981 |
This proves that Windmill scales linearly with the number of workers (at least up to 100 workers). We can also notice that the throughput is close to the optimal: given that the job takes 100ms to be executed, N workers processing the jobs in parallel can't go above N*100
jobs per seconds, and Windmill is pretty close.