Skip to main content

Airflow vs Prefect vs Temporal vs Kestra vs Windmill


We compared Airflow, Prefect, Temporal, Kestra and Windmill with the following usecases:

  • One flow composed of 40 lightweight tasks.
  • One flow composed of 10 long-running tasks.
More context

For additional insights about this study, refer to our blog post.

We chose to compute Fibonacci numbers as a simple task that can easily be run with the three orchestrators. Given that Airflow has a first class support for Python, we used Python for all 3 orchestrators. The function in charge of computing the Fibonacci numbers was very naive:

def fibo(n: int):
if n <= 1:
return n
else:
return fibo(n - 1) + fibo(n - 2)

After some testing, we chose to compute fibo(10) for the lightweight tasks (taking around 10ms in our setup), and fibo(33) for what we called "long-running" tasks (taking at least a few hundreds milliseconds as seen in the results).

On the infrastructure side, we went simple and used the docker-compose.yml recommended in the documentation of each orchestrator. We deployed the orchestrators on AWS m4-large instances.

Airflow setup

We set up Airflow version 2.7.3 using the docker-compose.yaml referenced in Airflows official documentation.

The DAG was the following:

ITER = 10     # respectively 40
FIBO_N = 33 # respectively 10

with DAG(
dag_id="bench_{}".format(ITER),
schedule=None,
start_date=datetime(2023, 1, 1),
catchup=False,
tags=["benchmark"],
) as dag:
for i in range(ITER):
@task(task_id=f"task_{i}")
def task_module():
return fibo(FIBO_N)
fibo_task = task_module()

if i > 0:
previous_task >> fibo_task
previous_task = fibo_task

Results

For 10 long running tasks run sequentially:

Details
TaskCreated atStarted atCompleted at
task_000.0004.3476.910
task_017.3159.69016.387
task_0216.54518.36120.077
task_0320.13021.78523.487
task_0423.86925.31927.463
task_0528.06129.66532.354
task_0633.21034.99637.498
task_0738.37839.93841.754
task_0842.36643.93345.887
task_0946.28150.17954.668

For 40 lightweights tasks run sequentially:

Details
TaskCreated atStarted atCompleted at
task_000.0004.3354.752
task_016.2368.7108.923
task_029.79211.11711.320
task_0312.15713.51313.733
task_0413.80415.41315.622
task_0516.20117.58717.849
task_0618.90220.22720.432
task_0721.26222.69122.958
task_0824.01525.34925.558
task_0926.36828.15828.635
task_1029.36131.03531.357
task_1131.86136.24537.062
task_1238.86842.18042.388
task_1342.64144.02744.280
task_1445.32146.67646.877
task_1547.67649.07349.298
task_1650.43251.78651.999
task_1752.41553.85254.051
task_1854.15555.56455.771
task_1956.57558.34658.781
task_2059.25460.99961.355
task_2162.07163.67164.079
task_2264.36666.01166.442
task_2367.06168.61968.866
task_2469.60171.84272.303
task_2573.37377.49578.212
task_2678.42879.89680.134
task_2781.19982.49582.741
task_2883.66584.95885.153
task_2985.20586.56186.766
task_3087.69089.35789.778
task_3190.41991.97092.282
task_3293.02494.61095.031
task_3395.63697.49597.745
task_3498.857100.626100.877
task_35101.926103.271103.477
task_36103.915105.523105.875
task_37105.996107.412107.622
task_38108.409112.610113.214
task_39114.054115.998116.221

Prefect setup

We set up Prefect version 2.14.4. We wrote our own simple docker compose since we couldn't find a recommended one in Prefect's documentation. We chose to use Postgresql as a database, as it is the recommended option for production usecases.

version: '3.8'

services:
postgres:
image: postgres:14
restart: unless-stopped
volumes:
- db_data:/var/lib/postgresql/data
expose:
- 5432
environment:
POSTGRES_PASSWORD: changeme
POSTGRES_DB: prefect
healthcheck:
test: ['CMD-SHELL', 'pg_isready -U postgres']
interval: 10s
timeout: 5s
retries: 5

prefect-server:
image: prefecthq/prefect:2-latest
command:
- prefect
- server
- start
ports:
- 4200:4200
depends_on:
postgres:
condition: service_started
volumes:
- ${PWD}/prefect:/root/.prefect
- ${PWD}/flows:/flows
environment:
PREFECT_API_DATABASE_CONNECTION_URL: postgresql+asyncpg://postgres:changeme@postgres:5432/prefect
PREFECT_LOGGING_SERVER_LEVEL: INFO
PREFECT_API_URL: http://localhost:4200/api
volumes:
db_data: null

The flow was defined using the following Python file.

from prefect import flow, task

ITER = 10 # respectively 40
FIBO_N = 33 # respectively 10

def fibo(n: int):
if n <= 1:
return n
else:
return fibo(n - 1) + fibo(n - 2)

@task
def fibo_task():
return fibo(FIBO_N)

@flow(name="bench_{}".format(ITER))
def benchmark_flow():
for i in range(ITER):
fibo_task()

if __name__ == "__main__":
benchmark_flow.serve(name="bench_{}".format(ITER))

Results

For 10 long running tasks:

Details
TaskCreated atStarted atCompleted at
task_000.0001.2702.629
task_012.6732.7034.059
task_024.0954.1215.475
task_035.5085.5346.916
task_046.9516.9798.337
task_058.3738.4019.816
task_069.8499.87411.253
task_0711.28711.31312.675
task_0812.71012.73714.070
task_0914.10214.12915.489

For 40 lightweights tasks run sequentially:

Details
TaskCreated atStarted atCompleted at
task_000.0001.2131.257
task_011.2941.3211.362
task_021.3941.4231.463
task_031.4961.5221.558
task_041.5871.6121.647
task_051.6761.7001.738
task_061.7671.7911.828
task_071.8581.8821.943
task_081.9741.9982.037
task_092.0682.0932.131
task_102.1622.1882.228
task_112.2602.2922.330
task_122.3592.3822.420
task_132.4492.4762.517
task_142.5482.5732.612
task_152.6402.6702.713
task_162.7422.7652.800
task_172.8282.8512.886
task_182.9162.9402.975
task_193.0043.0283.066
task_203.0953.1193.156
task_213.1873.2113.247
task_223.2763.2993.335
task_233.3643.3893.427
task_243.4623.4893.528
task_253.5573.5793.613
task_263.6413.6643.699
task_273.7263.7513.788
task_283.8173.8393.873
task_293.9003.9214.004
task_304.0334.0594.094
task_314.1234.1514.185
task_324.2114.2344.267
task_334.2934.3154.349
task_344.3774.4044.442
task_354.4704.4924.526
task_364.5554.5774.611
task_374.6384.6614.696
task_384.7264.7494.784
task_394.8144.8384.872

Temporal setup

We set up Temporal version 2.19.0 using the docker-compose.yml from the official GitHub repository.

The flow was defined using the following Python file. We executed it on the EC2 instance, using Python 3.10.12.

ITER = 10     # respectively 40
FIBO_N = 33 # respectively 10

@activity.defn
async def fibo_activity(n: int) -> int:
return fibo(n)

@workflow.defn
class BenchWorkflow:
@workflow.run
async def run(self) -> None:
for i in range(ITER):
await workflow.execute_activity(
fibo_activity,
FIBO_N,
activity_id="task_{}".format(i),
start_to_close_timeout=timedelta(seconds=60),
)

async def main():
client = await Client.connect("localhost:7233")
flow_name = "bench-{}".format(ITER)
async with Worker(
client,
task_queue=flow_name,
workflows=[BenchWorkflow],
activities=[fibo_activity],
):
await client.execute_workflow(
BenchWorkflow.run,
id=flow_name,
task_queue=flow_name,
)


if __name__ == "__main__":
asyncio.run(main())

Results

For 10 long running tasks:

Details
TaskCreated atStarted atCompleted at
task_000.0000.0121.357
task_011.3801.3882.697
task_022.7202.7294.034
task_034.0564.0655.371
task_045.3945.4036.711
task_056.7336.7428.050
task_068.0748.0839.388
task_079.4119.42010.739
task_0810.76210.77312.086
task_0912.11112.12013.434

For 40 lightweights tasks run sequentially:

Details
TaskCreated atStarted atCompleted at
task_000.0000.0090.016
task_010.0340.0440.052
task_020.0720.0790.087
task_030.1070.1160.124
task_040.1440.1530.161
task_050.1800.1890.197
task_060.2180.2270.235
task_070.2560.2650.273
task_080.2960.3050.312
task_090.3320.3400.348
task_100.3670.3760.383
task_110.4030.4120.420
task_120.4400.4490.457
task_130.4860.4980.507
task_140.5270.5360.545
task_150.5650.5740.583
task_160.6220.6600.669
task_170.7210.7590.768
task_180.8200.8590.867
task_190.9200.9590.967
task_201.0201.0591.069
task_211.1221.1591.167
task_221.2211.2591.268
task_231.3211.3601.368
task_241.4211.4601.468
task_251.5211.5601.568
task_261.6221.6601.669
task_271.7211.7591.767
task_281.8221.8591.867
task_291.9211.9601.969
task_302.0212.0592.067
task_312.1212.1602.168
task_322.2202.2602.269
task_332.3222.3592.368
task_342.4272.4592.467
task_352.5222.5592.568
task_362.6212.6592.668
task_372.7212.7592.768
task_382.8202.8592.867
task_392.9212.9592.967

Kestra setup

We set up Kestra version v0.19.0 using the docker-compose.yml from their official Documentation. We made some adjustments to it to have a similar setup compared to the other orchestrator.

The flow we used to run the benchmarks is the following:

id: benchmark
namespace: company.team
inputs:
- id: n
type: INT
- id: iters
type: INT
tasks:
- id: getIterations
type: io.kestra.plugin.scripts.python.Script
taskRunner:
type: io.kestra.plugin.scripts.runner.docker.Docker
containerImage: ghcr.io/kestra-io/pydata:latest
env:
iters: "{{ inputs.iters }}"
script: |
import os
from kestra import Kestra
iters = int(os.getenv("iters"))
iterations = list(range(0, iters))
Kestra.outputs({'iterations': iterations})

- id: processIterations
type: io.kestra.plugin.core.flow.ForEach
values: '{{ outputs.getIterations.vars.iterations }}'
concurrencyLimit: 1
tasks:
- id: python
type: io.kestra.plugin.scripts.python.Script
containerImage: python:slim
env:
N: "{{ inputs.n }}"
script: |
import os
def fibo(n: int):
if n <= 1:
return n
else:
return fibo(n - 1) + fibo(n - 2)

n = int(os.getenv("N"))
print(str(fibo(n)))

We executed it once with n=33, iters=10 and once with n=10 and iters=40. Note that we set the concurrency limit to 1 meaning all task will run sequentially on one worker. Furthermore, no extra python dependencies had to be installed during the execution of those flows.

Results

For 10 long running tasks:

Details
TaskCreated atStarted atCompleted at
task_000.0000.0471.348
task_011.3481.4022.683
task_022.6832.7304.050
task_034.0504.1155.424
task_045.4245.4716.835
task_056.8356.8958.208
task_068.2088.2549.620
task_079.6209.67810.986
task_0810.98611.03012.367
task_0912.36712.41513.702

For 40 lightweights tasks run sequentially:

Details
TaskCreated atStarted atCompleted at
task_000.0000.0620.832
task_010.8320.8881.479
task_021.4791.5242.137
task_032.1372.1952.840
task_042.8402.8853.502
task_053.5023.5614.177
task_064.1774.2224.846
task_074.8464.9025.504
task_085.5045.5526.243
task_096.2436.3256.935
task_106.9357.0057.612
task_117.6127.6688.278
task_128.2788.3238.960
task_138.9609.0119.629
task_149.6299.09010.290
task_1510.29010.38511.055
task_1611.05511.12311.807
task_1711.80711.86312.515
task_1812.51512.57813.153
task_1913.15313.22413.869
task_2013.86913.93914.614
task_2114.61414.68515.323
task_2215.32315.38816.022
task_2316.02216.08916.772
task_2416.77216.83117.453
task_2517.45317.50818.120
task_2618.12018.17718.780
task_2718.78018.83119.457
task_2819.45719.52520.149
task_2920.14920.21020.843
task_3020.84320.89521.507
task_3121.50721.57222.206
task_3222.20622.26722.904
task_3322.90422.98223.597
task_3423.59723.66724.282
task_3524.28224.34325.010
task_3625.01025.07225.694
task_3725.69425.75226.368
task_3826.36826.44027.075
task_3927.07527.13227.762

Windmill setup

We set up Windmill version 1.204.1 using the docker-compose.yml from the official GitHub repository. We made some adjustments to it to have a similar setup compared to the other orchestrator. We set the number of workers to only one and removed the native workers since they would have been useless.

We executed the Windmill benchmarks in both "normal" and "dedicated worker" mode. To implement the 2 flows in Windmill, we first created a script simply computing the Fibonacci numbers:

# WIMDMILL script: `u/benchmarkuser/fibo_script`
def fibo(n: int):
if n <= 1:
return n
else:
return fibo(n - 1) + fibo(n - 2)

def main(
n: int,
):
return fibo(n)

And then we used this script in a simple flow composed of a For-Loop sequentially executing the scripts. The JSON representation of the flow is as follow:

summary: Fibonacci benchmark flow
description: Flow running 10 (resp. 40) times Fibonacci of 33 (resp. 10)
value:
modules:
- id: a
value:
type: forloopflow
modules:
- id: b
value:
path: u/admin/fibo_script
type: script
input_transforms:
n:
type: static
value: 33 # respectively 10
iterator:
expr: Array(10) # respectively 40
type: javascript
parallel: false
skip_failures: true
schema:
'$schema': https://json-schema.org/draft/2020-12/schema
properties: {}
required: []
type: object

Results

For 10 long running tasks in normal mode:

Details
TaskCreated atStarted atCompleted at
task_000.0000.0040.789
task_010.8470.8521.630
task_021.6911.6952.516
task_032.5752.5793.349
task_043.4093.4124.179
task_054.2374.2415.008
task_065.0665.0705.852
task_075.9125.9156.685
task_086.7436.7477.519
task_097.5787.5828.351

For 40 lightweights tasks run sequentially in normal mode:

Details
TaskCreated atStarted atCompleted at
task_000.0000.0040.052
task_010.1110.1150.163
task_020.2200.2240.272
task_030.3300.3340.382
task_040.4400.4430.490
task_050.5470.5510.598
task_060.6550.6590.706
task_070.7630.7670.813
task_080.8720.8750.925
task_090.9820.9871.036
task_101.0931.0971.144
task_111.2021.2051.252
task_121.3131.3171.373
task_131.4321.4361.488
task_141.5451.5481.595
task_151.6561.6591.704
task_161.7621.7661.812
task_171.8691.8731.920
task_181.9781.9822.029
task_192.0872.0912.141
task_202.1982.2012.251
task_212.3102.3132.360
task_222.4172.4202.466
task_232.5242.5282.574
task_242.6312.6342.680
task_252.7392.7432.789
task_262.8462.8512.897
task_272.9542.9583.005
task_283.0633.0663.112
task_293.1683.1723.218
task_303.2753.2793.326
task_313.3833.3863.432
task_323.4893.4933.539
task_333.5963.6003.646
task_343.7043.7073.753
task_353.8123.8153.863
task_363.9203.9233.972
task_374.0304.0344.083
task_384.1404.1434.190
task_394.2484.2524.300

In dedicated worker mode, we obtained the following results. For 10 long running tasks:

Details
TaskCreated atStarted atCompleted at
task_000.0000.0040.738
task_010.8020.8091.543
task_021.6011.6052.334
task_032.3922.3963.124
task_043.1873.1913.945
task_053.9803.9854.744
task_064.7714.7745.506
task_075.5615.5656.291
task_086.3506.3547.082
task_097.1367.1407.885

And for the 40 lightweight tasks:

Details
TaskCreated atStarted atCompleted at
task_000.0000.0030.005
task_010.0620.0650.067
task_020.1230.1260.128
task_030.1840.1880.190
task_040.2470.2510.253
task_050.3100.3140.316
task_060.3720.3760.378
task_070.4340.4370.439
task_080.4960.5000.502
task_090.5590.5630.565
task_100.6220.6250.627
task_110.6840.6870.689
task_120.7460.7500.752
task_130.8090.8130.815
task_140.8730.8770.879
task_150.9340.9380.940
task_160.9971.0001.002
task_171.0591.0621.064
task_181.1201.1241.128
task_191.1821.1861.188
task_201.2441.2481.250
task_211.3061.3091.311
task_221.3681.3711.373
task_231.4291.4321.434
task_241.4911.4941.496
task_251.5521.5551.557
task_261.6141.6181.620
task_271.6771.6811.683
task_281.7401.7441.746
task_291.8021.8061.808
task_301.8641.8671.869
task_311.9261.9301.932
task_321.9881.9921.994
task_332.0502.0542.056
task_342.1122.1162.118
task_352.1742.1782.181
task_362.2372.2402.242
task_372.3002.3032.305
task_382.3622.3662.368
task_392.4242.4272.429

Comparisons

We can exclude Airflow and Kestra from the previous chart:

At a macro level, it took 54.668s to Airflow to execute the 10 long running tasks, where Prefect took 15.489s, Temporal 13.434s, Kestra 14.15s and Windmill 8.351s in normal mode (7.885s in dedicated worker mode).

The same can be observed for the 40 lightweight tasks, where Airflow took total of 116.221s, Prefect 4.872s, Temporal 2.967s, Kestra 29.80s and Windmill 4.300s in normal mode (2.429s in dedicated worker mode).

By far, Airflow is the slowest. Temporal and Prefect are faster, but not as fast as Windmill. Kestra performs slowly for the 40 lightweight tasks, whereas it performs similar to Temporal and Prefect for the 10 long running tasks. For the 40 lightweight tasks, Windmill in normal mode was equivalent to Prefect and slightly slower than Temporal. This can be explained by the fact that the way Temporal works is closer to the way Windmill works in dedicated mode. I.e. Windmill in normal mode does a cold starts for each tasks, and when the tasks are numerous and lightweight, most of the execution ends up being taken by the cold start. In dedicated worker mode however, Windmill behavior is closer to Temporal, and we can see that the performance are similar, with a slight advantage for Windmill.

But we can deep dive in a little and compare the orchestrators three categories:

  • Execution time: The time it takes for the orchestrator to execute the task once is has been assigned to an executor
  • Assignment time: The time is takes for a task to be assigned to an executor once it has been created in the queue
  • Transition time: The time it takes for to create the following time once a task is finished

After looking at the macro numbers above, it's interesting to compare the time spent in each of the above categories, relative to the total time the orchestrator took to execute the flow.

For the 10 long running tasks flow, we see the following:

AirflowPrefectTemporalKestraWindmill
Normal
Windmill
Dedicated Worker
Total duration (in secconds)54.66815.48913.43414.158.3517.885
Assignement40.36%9.77%0.71%3.65%0.47%0.55%
Execution51.72%88.18%97.74%93.15%93.17%93.46%
Transition7.93%2.05%1.55%3.20%6.36%6.00%

The proportion of time spent in execution is important here since each task takes a long time to run. We see that Airflow and Prefect are spending a lot of time assigning the tasks compared to the others (When we look at the actual numbers, we see that both Prefect and Airflow are spending a lot of time assigning the first tasks, but after that, assignment duration decrease. Kestra's assignment and transition duration are somewhere in the middle, and we see that it spends most of the time in the execution phase. Airflow remain relatively slow though, and Prefect reaches decent performance. The exact same can be observed with the 40 tasks workflow below). Temporal and Windmill in normal mode are pretty similar. Windmill in dedicated worker mode is incredibly fast at executing the jobs, at a cost of spending a little more time doing the transitions, but overall it is the fastest.

If we look at the 40 lightweight tasks flow, we have:

AirflowPrefectTemporalKestraWindmill
Normal
Windmill
Dedicated Worker
Total duration (in secconds)56.2214.8722.96729.804.3002.429
Assignement64.63%44.62%35.58%8.28%3.42%5.89%
Execution10.77%31.73%11.26%85.35%44.19%3.42%
Transition24.60%23.65%53.16%6.38%52.40%90.70%

Here we see that Windmill takes a greater portion of time executing the tasks, which can be explained by the fact that Windmill runs a "cold start" for each tasks submitted to the worker. However, it's by far the fastest assigning tasks to executors. As observed above, Windmill in dedicated worker mode is lightning fast at executing the tasks, but takes more time transitioning from one task to the next one.

Conclusion

Airflow is the slowest in all categories, followed by Prefect and Kestra. If you're looking for a high performance job orchestrator, they seem to not be the best option. Temporal and Windmill have better performance and are closer to each other in terms of performance, but in both cases Windmill performs better either in normal mode or in dedicated mode. If you're looking for a job orchestrator for various long-running tasks, Windmill in normal mode will be the most performant solution, optimizing the duration of each tasks knowing that transitions and assignments will remain a small portion of the overall workload. To run lightweight tasks at a very fast pace Windmill in dedicated worker mode should be your preferred choice, provided that the tasks are similar. It is lightening fast at execution and assignment.

Appendix: Scaling Windmill

We performed those benchmarks with a single worker assuming the capacity to process jobs would scale linearly with the number of workers deployed on the stack. We haven't verified this assumption for Airflow, Prefect, Kestra and Temporal, but we've scaled Windmill up to a 100 virtual workers to verify. And the conclusion is that it scales pretty linearly.

For this test, we've deployed the same docker compose as above on an AWS m4.xlarge instance (4 vCPU, 16Gb of memory) and to virtually increase the number of workers, we've used the NUM_WORKERS environment variable Windmill accepts. Note that it is not strictly equivalent to adding real hardware to the stack, but until we reach the maximum capacities on the instance, both in terms of CPU and memory, we can assume it's a good approximation. The other change we had to make was to bump the max_connections to 1000 on Postgresql: as we're adding more and more workers, each worker needs to connect to the database and we need to increase the maximum number of connections Posgtresql allows.

The job we ran was a simple sleeping job sleeping for 100ms, which is a good average during for a job running on an orchestrator.

import time
def main():
time.sleep(0.1)

Finally, we've ran it on Windmill Dedicated Worker mode, and we used a specific endpoint to "bulk-create" the jobs before any worker can start pulling them from the queue. For this test to be representative, we had to measure the performance of Windmill processing a large number of jobs (10000 in this case), and we quickly realised that the time it was taking to only insert the jobs one by one in the queue was non negligible and was affecting the real performance of workers.

The results are the following:

Details
Number of workersThroughput (jobs/sec) batch of 10K jobs
219.9
659.8
1099.6
20198
30298
40391
50496
60591
70693
80786
90887
100981

This proves that Windmill scales linearly with the number of workers (at least up to 100 workers). We can also notice that the throughput is close to the optimal: given that the job takes 100ms to be executed, N workers processing the jobs in parallel can't go above N*100 jobs per seconds, and Windmill is pretty close.