Orchestration : Oozie
Oozie enables orchestration and scheduling of hadoop jobs in the ecosystem.
•Apache Oozie is a real time scheduler and workflow engine that blends well with large production environments
•It is a server based workflow engine
•Oozie can run workflow jobs with MapReduce and Pig action nodes
•Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions.
Oozie Architecture
Oozie Workflow Nodes
Control Flow
•Start/end/kill
•Decision
•Fork/join
Actions
•Map-reduce
•Pig
•Hdfs
•Sub-workflow
•Java-run custom java code
•To run oozie workflows, two files are needed.
- workflow.xml (stored in HDFS)
•It contains the structure of workflow. 2.job.properties (stored in local)
•It contains the configuration properties.
Oozie Server
•The Oozie server is designed to work with either MR V1 or YARN. Please note that it cannot work with both simultaneously •It can be configured with CATALINA_BASE variable in /etc/oozie/conf/oozie-env.sh
Hadoop 1
•CATALINA_BASE = / usr /lib/ oozie /oozie-server-0.20
Hadoop 2
•CATALINA_BASE=/ usr /lib/ oozie / oozie-server
Oozie Sample Workflow
nameNode= Address of NameNode
jobTracker= Address of JbTracker
oozie.libpath= Path containing related jars
oozie.wf.application.path=Path containing workflow.xml
Oozie Coordinator
•Oozie Coordinator is a collection of predicates (conditional statements based on time-frequency and data availability) and actions (i.e. Hadoop Map/Reduce jobs, Hadoop file system, Hadoop Streaming, Pig, Java and Oozie sub-workflow).
•Actions are recurrent workflow jobs invoked each time predicate returns true.
<coordinator-app name=“ Name of workflow " frequency=“ frequency in minutes " start =“Start Time" end=“ End Time " timezone =“Time Zone" xmlns =“uri:oozie:coordinator:0.1”>