Edit me

Along with Hooks, Operators are key to extending capabilities with Airflow while keeping your DAGs clean and simple.

An operator is an object that embodies an operation utilizing one or more hooks, typically to transfer data between one hook to another.

In Airflow, operators allow for generation of certain types of tasks that become nodes in the DAG when instantiated. All operators derive from BaseOperator and inherit many attributes and methods that way. There are 3 main types of operators:

  • Action - These operators perform an action or tell another system to perform an action.
  • Transfer - These operators move data from one system to another.
  • Sensors - This type of operator will keep running until a certain criterion is met.

BaseOperator

All operators are derived from BaseOperator and acquire much of their functionality through inheritance. Being that BaseOperator is the Lucy of operators, we’ve devoted this section of documentation to understanding the parameters of it. These parameters represent building blocks that can be leveraged in your DAGs.

Parameters

  • task_id (string) - A meaningful id that serves as a unique identifier for each task.
  • owner (string) - The owner of the task. A good practice is to use the Unix username here.
  • retires (int) - The number of retries that should be performed before failing the task.
  • retry_delay (timedelta) - How long the delay between retries should be.
  • retry_exponential_backoff (bool) - This allows progressively longer delays between retries. Note that the delay will be converted into seconds.
  • max_retry_delay (timedelta) - The maximum delay interval between retries.
  • start_date (datetime) - This determines the execution_date for the first task instance.
    • Tip: The best practice is to have the start_date rounded to your DAG’s schedule_interval. So, daily jobs should have start_date with time 00:00:00 and hourly jobs should have their start_date at 00:00 of a specific hour.
  • end_date (datetime) - If specified, the scheduler won’t go beyond this date.
  • depends_on_past (bool) - When set to ‘True’, task instances will run sequentially while relying on the previous task’s schedule to succeed. The task instance for the start_date is allowed to run despite this.
  • wait_for_downstream (bool) - When set to ‘True’, an instance of task B will wait for tasks immediately downstream of the previous instance of task B to finish successfully before it runs. This is useful when different instances of a task alter the same asset and that asset is used by further downstream tasks.
    • Note: depends_on_past is automatically ‘True’ whenever wait_for_downstream is used.
  • queue (str) - Points the executor to the right queue to target for a job.
  • dag (DAG) - A reference to the DAG the task is attached to.
  • priority_weight (int) - Priority weight of this task against other tasks. This allows the executor to trigger higher priority tasks before others when there is a backlog.
  • pool (str) - The slot pool this task should run in. Slot pools are a way to limit concurrency for certain tasks.
  • sla (datetime.timedelta) - This is the time by which the job is expected to succeed.
    • Note: This represents the timedelta after the period is closed. The scheduler pays special attention to jobs with an SLA and sends alert emails for SLA misses.
  • execution_timeout (datetime.timedelta) - Maximum time allowed for the execution of this task instance.
  • on_failure_callback (callable) - A function to be called when a task instance of this task fails. A context dictionary is passed as a single parameter to this function.
  • on_retry_callback (callable) - Much like the on_failure_callback except this is executed when retries occur.
  • on_success_callback (callable) - Much like the on_failure_callback except this is executed when the task is successful.
  • trigger_rule (str) - Defines the rule by which dependencies are applied for the task to get triggered.
    • Options for trigger_rule are: { all_success | all_failed | all_done | one_success | one_failed | dummy} and defaults to all_success.
  • resources (dict) - A map of resource parameter names to their values.
  • run_as_user (str) - Unix username to impersonate while running the task.

BaseSenseOperator

All sensors are derived from BaseSenseOperator and inherit timeout and poke_interval in addition to the BaseOperator attributes.

Parameters

  • soft_fail (bool) - Set to True to mark the task as SKIPPED on failure.
  • poke_interval (int) - Time in seconds that the job should wait in between each try.
  • timeout (int) - Time in seconds before the task times out and fails.

Thanks to the Apache Airflow project & community, there is a large base of operators already available for use.