close
Apache Airflow
Sumit Maheshwari Qubole
Bangalore Big Data Meetup @ LinkedIn
27 Aug 2016
Agenda
â—Ź Workflows
â—Ź Problem statement
â—Ź Options
â—Ź Airflow
â—‹ Anatomy
â—‹ Sample DAG
â—‹ Architecture
â—‹ Demo
â—Ź Experiences
Workflows?
A B C
A
E H
D
CB F
G
A
E H
D
CB F
G
n
Background
Qubole was looking for a complete workflow solution. We do have a simple
(sequential) workflow and a very stable scheduler in-house already.
Options were:
1. Extend in-house workflow to full-fledged workflow
2. Oozie
3. Pinball
4. Luigi
5. Briefly
6. Airflow
In House
Pro:
â—Ź Full control
â—Ź Faster bug fixing
â—Ź Prioritised Qubole related features
Cons:
â—Ź Ever growing list of features
â—Ź Much longer dev & qa cycles
â—Ź Difficult to keep pace with latest trends
Oozie
Pros:
â—Ź Used by thousands of
companies
â—Ź Web apis, java apis, cli and
html support
â—Ź Oldest among all
Oozie
Cons:
â—Ź XML
â—Ź Significant efforts in
managing - frequent
OOM
â—Ź Difficult to customise
Pinball
Pros:
â—Ź Pythonic way of defining
DAGs.
â—Ź Extensible and horizontal
scalable.
â—Ź Pinterest is already using
pinball to submit commands
to Qubole.
Cons:
â—Ź Complex in understanding
● “pip install” was broken.
â—Ź Lack of community interest.
Luigi
Pros:
â—Ź Pythonic way to write DAGs
â—Ź Pretty stable
â—Ź Huge community
â—Ź Built in support for hadoop
Luigi
Cons:
â—Ź Have to schedule workflows
externally
â—Ź Minimal UI
â—Ź State persistence via files
â—Ź No inbuilt monitoring, alerting
Briefly
Pros: Very small codebase to
understand and modify. Inbuilt
support for Qubole.
Cons: Too naive for production
uses
Airflow
â—Ź Python code base
â—Ź Callable events
â—Ź Trigger rules
â—Ź Xcoms
â—Ź Cool UI & Rich CLI
â—Ź Queues & Pools
â—Ź Zombie cleanup
â—Ź Growing community
â—Ź The job definitions, in python code.
â—Ź A rich CLI (command line interface) to test, run, backfill, describe and clear parts of your
DAGs.
â—Ź A web application, to explore your DAGs definition, their dependencies, progress, metadata
and logs.
â—Ź A metadata repository that Airflow uses to keep track of task job statuses and other persistent
information.
â—Ź An array of workers, running the jobs task instances in a distributed fashion.
â—Ź Scheduler processes, that fire up the task instances that are ready to run.
Anatomy
Sample DAG
Demo
Airflow: Some facts
Small code base of size ~ 20k lines of python code.
Born at Airbnb, open sourced in June-15 and recently moved to Apache incubator
Under active development, some numbers:
a. ~1.5yr old project, 3400 commits, 177 contributors, around 20+ commits per week
b. Companies using airflow: Airbnb, Agari, Lyft, Wepay, Easytaxi, Qubole and many
others
c. 1000+ closed PRs
Airflow: Architecture
Airflow comes with 4 types of builtin execution modes
â—Ź Sequential
â—Ź Local
â—Ź Celery
â—Ź Mesos
And it’s very easy to add your own execution mode as well
Sequential
â—Ź Default mode
â—Ź Minimum setup - works with sqlite
as well
â—Ź Processes 1 task at a time
â—Ź Good for demoable purposes only
Local Executor
â—Ź Spawned by scheduler processes
â—Ź Vertical scalable
â—Ź Production grade
● Doesn’t need broker etc
Celery Executor
Celery Executor
â—Ź Vertical and Horizontal scalable
â—Ź Can be monitored (via Flower)
â—Ź Support Pools and Queues
Key aspects considered while productionizing Airflow at Qubole
â—Ź Availability
â—Ź Reliability
â—Ź Security
â—Ź Usability
Experiences
Thank You !
gitter - @msumit
msumit@apache.org
PS: Qubole is hiring, ping me :)

Apache Airflow

  • 1.
    Apache Airflow Sumit MaheshwariQubole Bangalore Big Data Meetup @ LinkedIn 27 Aug 2016
  • 2.
    Agenda â—Ź Workflows â—Ź Problemstatement â—Ź Options â—Ź Airflow â—‹ Anatomy â—‹ Sample DAG â—‹ Architecture â—‹ Demo â—Ź Experiences
  • 3.
  • 4.
  • 5.
  • 6.
    Background Qubole was lookingfor a complete workflow solution. We do have a simple (sequential) workflow and a very stable scheduler in-house already. Options were: 1. Extend in-house workflow to full-fledged workflow 2. Oozie 3. Pinball 4. Luigi 5. Briefly 6. Airflow
  • 7.
    In House Pro: â—Ź Fullcontrol â—Ź Faster bug fixing â—Ź Prioritised Qubole related features Cons: â—Ź Ever growing list of features â—Ź Much longer dev & qa cycles â—Ź Difficult to keep pace with latest trends
  • 8.
    Oozie Pros: â—Ź Used bythousands of companies â—Ź Web apis, java apis, cli and html support â—Ź Oldest among all
  • 9.
    Oozie Cons: â—Ź XML â—Ź Significantefforts in managing - frequent OOM â—Ź Difficult to customise
  • 10.
    Pinball Pros: ● Pythonic wayof defining DAGs. ● Extensible and horizontal scalable. ● Pinterest is already using pinball to submit commands to Qubole. Cons: ● Complex in understanding ● “pip install” was broken. ● Lack of community interest.
  • 11.
    Luigi Pros: â—Ź Pythonic wayto write DAGs â—Ź Pretty stable â—Ź Huge community â—Ź Built in support for hadoop
  • 12.
    Luigi Cons: â—Ź Have toschedule workflows externally â—Ź Minimal UI â—Ź State persistence via files â—Ź No inbuilt monitoring, alerting
  • 13.
    Briefly Pros: Very smallcodebase to understand and modify. Inbuilt support for Qubole. Cons: Too naive for production uses
  • 14.
    Airflow â—Ź Python codebase â—Ź Callable events â—Ź Trigger rules â—Ź Xcoms â—Ź Cool UI & Rich CLI â—Ź Queues & Pools â—Ź Zombie cleanup â—Ź Growing community
  • 15.
    â—Ź The jobdefinitions, in python code. â—Ź A rich CLI (command line interface) to test, run, backfill, describe and clear parts of your DAGs. â—Ź A web application, to explore your DAGs definition, their dependencies, progress, metadata and logs. â—Ź A metadata repository that Airflow uses to keep track of task job statuses and other persistent information. â—Ź An array of workers, running the jobs task instances in a distributed fashion. â—Ź Scheduler processes, that fire up the task instances that are ready to run. Anatomy
  • 16.
  • 17.
  • 18.
    Airflow: Some facts Smallcode base of size ~ 20k lines of python code. Born at Airbnb, open sourced in June-15 and recently moved to Apache incubator Under active development, some numbers: a. ~1.5yr old project, 3400 commits, 177 contributors, around 20+ commits per week b. Companies using airflow: Airbnb, Agari, Lyft, Wepay, Easytaxi, Qubole and many others c. 1000+ closed PRs
  • 19.
    Airflow: Architecture Airflow comeswith 4 types of builtin execution modes ● Sequential ● Local ● Celery ● Mesos And it’s very easy to add your own execution mode as well
  • 20.
    Sequential â—Ź Default mode â—ŹMinimum setup - works with sqlite as well â—Ź Processes 1 task at a time â—Ź Good for demoable purposes only
  • 21.
    Local Executor ● Spawnedby scheduler processes ● Vertical scalable ● Production grade ● Doesn’t need broker etc
  • 22.
  • 23.
    Celery Executor â—Ź Verticaland Horizontal scalable â—Ź Can be monitored (via Flower) â—Ź Support Pools and Queues
  • 24.
    Key aspects consideredwhile productionizing Airflow at Qubole â—Ź Availability â—Ź Reliability â—Ź Security â—Ź Usability Experiences
  • 25.
    Thank You ! gitter- @msumit msumit@apache.org PS: Qubole is hiring, ping me :)