1. Pipeline 框架

PipelineOptions options =
    PipelineOptionsFactory.fromArgs(args).withValidation().create();

// or
public interface MyOptions extends PipelineOptions { ... }
PipelineOptionsFactory.register(MyOptions.class);
MyOptions options = PipelineOptionsFactory.fromArgs(args)
                                                .withValidation()
                                                .as(MyOptions.class);

Pipeline pipeline = Pipeline.create(options);

PCollection<...> = pipeline.apply(...PTransform...);

pipeline.run().waitUntilFinish();

2. Transform

  1. data source -> Read transform -> Transform -> Write transform -> data sink
  2. Core Beam transforms
    • ParDo: 类似于 map,可以产生零个、一个、多个元素
    • GroupByKey: 类似于 shuffle,在同一个 PCollection 里做 group by
    • CoGroupByKey: 类似于 join,作用于多个有相同 Key 类型的 PCollection
    • Combine: 类似于 reduce
    • Flatten: 类似于 merge,合并多个同类型的 PCollection
    • Partition: 类似于 split,将一个 PCollection 切分成多个同类型的 PCollection

3. Schema

@DefaultSchema(JavaBeanSchema.class)
public class A {
  public String getUserId();
  @Nullable public String getUserName();
  
  @SchemaCreate
  public A(String userId, String userName) { ... }
}

@DefaultSchema(JavaFieldSchema.class)
public class A {
  public String userId;
  @Nullable public String userName;
}

@SchemaCreate, @SchemaIgnore, @SchemaFieldName可选。

4. Windowing + Triggers

  1. 默认所有元素属于单个全局窗口,后来的事件会被废弃,并且所有元素到齐后才会触发下游处理,所以对于无界集合,需要指定窗口函数或者指定触发器。

5. Java packages 一览

https://beam.apache.org/releases/javadoc/2.20.0/index.html

  • org.apache.beam.runners.XXX
  • org.apache.beam.sdk: 定义 Pipeline 类
  • org.apache.beam.sdk.coders: 定义各种 Coder
  • org.apache.beam.sdk.extensions.jackson: JSON 转换
  • org.apache.beam.sdk.extensions.joinlibrary: FullOuterJoin, InnerJoin, LeftOuterJoin, RightOuterJoin 的实现
  • org.apache.beam.sdk.extensions.protobuf: Protobuf 转换
  • org.apache.beam.sdk.extensions.{sketching, zetasketch}: 近似统计
  • org.apache.beam.sdk.extensions.sorter: 对大数据量的排序
  • org.apache.beam.sdk.extensions.sql: Beam SQL
  • org.apache.beam.sdk.io: 各种 Read / Write transform
  • org.apache.beam.sdk.metrics: 监控指标
  • org.apache.beam.sdk.schemas: PCollection 里数据格式的描述
    • org.apache.beam.sdk.schemas.transforms: 基于 schema 的 transform
  • org.apache.bean.sdk.state: 状态管理
  • org.apache.beam.sdk.transforms: 标准内置 transform
  • org.apache.beam.sdk.values: Pipeline 处理的数据结构