都9102年,还用properties配置文件吗?非也非也。
Azkaban flow 2.0使用yaml进行作业配置:
上传的文件夹中,可以包含多个project 的yml配置文件。
Flow YAML File
关于Flow的文件有如下定义:
- 每个Flow对应一个yml文件.
- Flow的名称以yml文件的名称, 如: my-flow-name.flow.
- 包含所有的执行DAG节点.
- 每个执行节点可以是job或者是flow.
- 每个执行节点可以有一下属性:name, type, config, dependsOn 和节点.
- Node dependencies are specified by listing the parent nodes in dependsOn list.
- 包含其他的flow配置、.
- 所有的properties配置文件,将会迁移到YAML文件里面
Demo:acceptance-test.flow
---
config:
user.to.proxy: azktest
param.hadoopOutData: /tmp/wordcounthadoopout
param.inData: /tmp/wordcountpigin
param.outData: /tmp/wordcountpigout
# This section defines the list of jobs
# A node can be a job or a flow
# In this example, all nodes are jobs
nodes:
# Job definition
# The job definition is like a YAMLified version of properties file
# with one major difference. All custom properties are now clubbed together
# in a config section in the definition.
# The first line describes the name of the job
- name: AZTest
type: noop
# The dependsOn section contains the list of parent nodes the current
# node depends on
dependsOn:
- hadoopWC1
- NoOpTest1
- hive2
- java1
- jobCommand2
- name: pigWordCount1
type: pig
# The config section contains custom arguments or parameters which are
# required by the job
config:
pig.script: src/main/pig/wordCountText.pig
- name: hadoopWC1
type: hadoopJava
dependsOn:
- pigWordCount1
config:
classpath: ./*
force.output.overwrite: true
input.path: ${param.inData}
job.class: com.linkedin.wordcount.WordCount
main.args: ${param.inData} ${param.hadoopOutData}
output.path: ${param.hadoopOutData}
- name: hive1
type: hive
config:
hive.script: src/main/hive/showdb.q
- name: NoOpTest1
type: noop
- name: hive2
type: hive
dependsOn:
- hive1
config:
hive.script: src/main/hive/showTables.sql
- name: java1
type: javaprocess
config:
Xms: 96M
java.class: com.linkedin.foo.HelloJavaProcessJob
- name: jobCommand1
type: command
config:
command: echo "hello world from job_command_1"
- name: jobCommand2
type: command
dependsOn:
- jobCommand1
config:
command: echo "hello world from job_command_2"
然后打包的zip文件的架构如下:
project_root
├── sample_project.project
├── flow1.flow
├── flow2.flow
├── ...
├── flown.flow
├── lib
│ ├── ...
│ └── paranamer-2.4.1.jar
└── src
└── main
├── hive
│ └── query.q
└── pig
└── pig1.pig
在此之上,还可以做条件flow:
本人了解的Azkaban的传参方式,目前的解决方案有2种。
改源码,提供想要的EL表达式,比如${yesterday-ymd} 这种 (比较推荐,但是稍微麻烦)
使用 python props.py > $JOB_OUTPUT_PROP_FILE 方式。
将一个JSON 数据,输出到一个 $JOB_OUTPUT_PROP_FILE 环境变量中,这个环境变量,只能将参数传递一层依赖。
比如 init_step里面是>JOB_OUTPUT_PROP_FILE,那么依赖 init_step -> second_step -> last_step 中,second_step可以读取init_step中的环境变量。last_step读取不了。