| tags: [ ApacheBeam BigData ] categories: [ Development ]
Learn Apache Beam
1. Pipeline 框架
PipelineOptions options =
PipelineOptionsFactory.fromArgs(args).withValidation().create();
// or
public interface MyOptions extends PipelineOptions { ... }
PipelineOptionsFactory.register(MyOptions.class);
MyOptions options = PipelineOptionsFactory.fromArgs(args)
.withValidation()
.as(MyOptions.class);
Pipeline pipeline = Pipeline.create(options);
PCollection<...> = pipeline.apply(...PTransform...);
pipeline.run().waitUntilFinish();
2. Transform
- data source -> Read transform -> Transform -> Write transform -> data sink
- Core Beam transforms
- ParDo: 类似于 map,可以产生零个、一个、多个元素
- GroupByKey: 类似于 shuffle,在同一个 PCollection 里做 group by
- CoGroupByKey: 类似于 join,作用于多个有相同 Key 类型的 PCollection
- Combine: 类似于 reduce
- Flatten: 类似于 merge,合并多个同类型的 PCollection
- Partition: 类似于 split,将一个 PCollection 切分成多个同类型的 PCollection
3. Schema
@DefaultSchema(JavaBeanSchema.class)
public class A {
public String getUserId();
@Nullable public String getUserName();
@SchemaCreate
public A(String userId, String userName) { ... }
}
@DefaultSchema(JavaFieldSchema.class)
public class A {
public String userId;
@Nullable public String userName;
}
@SchemaCreate
, @SchemaIgnore
, @SchemaFieldName
可选。
4. Windowing + Triggers
- 默认所有元素属于单个全局窗口,后来的事件会被废弃,并且所有元素到齐后才会触发下游处理,所以对于无界集合,需要指定窗口函数或者指定触发器。
5. Java packages 一览
https://beam.apache.org/releases/javadoc/2.20.0/index.html
- org.apache.beam.runners.XXX
- org.apache.beam.sdk: 定义 Pipeline 类
- org.apache.beam.sdk.coders: 定义各种 Coder
- org.apache.beam.sdk.extensions.jackson: JSON 转换
- org.apache.beam.sdk.extensions.joinlibrary: FullOuterJoin, InnerJoin, LeftOuterJoin, RightOuterJoin 的实现
- org.apache.beam.sdk.extensions.protobuf: Protobuf 转换
- org.apache.beam.sdk.extensions.{sketching, zetasketch}: 近似统计
- org.apache.beam.sdk.extensions.sorter: 对大数据量的排序
- org.apache.beam.sdk.extensions.sql: Beam SQL
- org.apache.beam.sdk.io: 各种 Read / Write transform
- org.apache.beam.sdk.metrics: 监控指标
- org.apache.beam.sdk.schemas: PCollection 里数据格式的描述
- org.apache.beam.sdk.schemas.transforms: 基于 schema 的 transform
- org.apache.bean.sdk.state: 状态管理
- org.apache.beam.sdk.transforms: 标准内置 transform
- org.apache.beam.sdk.values: Pipeline 处理的数据结构