Mixing reliability with Celery for delicious async tasks

Oct 17 10:40 AM EDT :calendar: to 11:05 am

About This Talk

Celery is essential for asynchronous processing in Django backends. In multiple Django projects, we used far beyond the use case of sending emails without blocking HTTP responses. Celery helped us aggregate data, fill caches, run ETL workflows, parallelize heavy workloads, sync with external services, set up periodic background jobs, and much more.

But as with any distributed system, running Celery reliably in production is challenging. Due to the many issues we’ve seen on Celery, we considered many times replacing it with other task queues. But we never found another library with the features Celery offers. So we had to learn to work around its shortcomings and pitfalls. After years of running it in multiple Django projects, we faced and solved several reliability problems. We remediated concurrency hazards. We dealt with lost tasks in multiple edges of the architecture. We read tons of docs, articles, and issues to properly tweak settings. We fixed weird serialization bugs after version upgrades. We found what kind of monitoring really needed.

In this talk, you will learn how to configure, use and monitor Celery successfully in production. Celery performs well in simple contexts, because of that it might induce a false sense of safety that can be misleading as usage picks up and flows become more complex. Understanding the many ways it can fail as projects grow will help developers to prepare in advance.

Outline:

  • [2 minutes] Common concurrency issues
  • [5 minutes] Recommended settings
    • What Broker and Result Backend to use
    • What happens when using others
    • Serialization: pickle or not?
    • Thresholds and limits
    • Timeouts and expires
  • [5 minutes] How tasks can be lost and how Celery (tries) to solve that
    • ACKS_LATE, idempotency, and retries
    • Why that task again? Visibility timeout, prefetches, and automatic redelivering
    • Dead worker process, lost task
    • You need atomicity too
  • [5 minutes] Don’t use Celery canvas workflows: you need DB-level state
  • [2 minutes] Multiple queues and workers will save you from complex incidents
  • [2 minutes] The only monitoring you can trust: probe tasks
  • [2 minutes] Graceful shutdowns: Celery and Continuous Deployment
  • [3 minutes] Questions

Presenters

    Photo of Flávio Juvenal

    Flávio Juvenal (he/him)

    I’m a software engineer from Brazil and Chief Scientist at Vinta Software (www.vinta.com.br). I’ve been building web products with Python and Django for the last 13 years. I love drinking medium and light roast coffee and visiting museums around the world. Recently I got into retrogaming and I’ve been (trying) to fix and play SNES and Genesis consoles.