博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
聊聊storm trident的operations
阅读量:5867 次
发布时间:2019-06-19

本文共 19052 字,大约阅读时间需要 63 分钟。

  hot3.png

本文主要研究一下storm trident的operations

function filter projection

Function

storm-core-1.2.2-sources.jar!/org/apache/storm/trident/operation/Function.java

public interface Function extends EachOperation {    /**     * Performs the function logic on an individual tuple and emits 0 or more tuples.     *     * @param tuple The incoming tuple     * @param collector A collector instance that can be used to emit tuples     */    void execute(TridentTuple tuple, TridentCollector collector);}
  • Function定义了execute方法,它发射的字段会追加到input tuple中

Filter

storm-core-1.2.2-sources.jar!/org/apache/storm/trident/operation/Filter.java

public interface Filter extends EachOperation {    /**     * Determines if a tuple should be filtered out of a stream     *     * @param tuple the tuple being evaluated     * @return `false` to drop the tuple, `true` to keep the tuple     */    boolean isKeep(TridentTuple tuple);}
  • Filter提供一个isKeep方法,用来决定该tuple是否输出

projection

storm-core-1.2.2-sources.jar!/org/apache/storm/trident/Stream.java

/**     * Filters out fields from a stream, resulting in a Stream containing only the fields specified by `keepFields`.     *     * For example, if you had a Stream `mystream` containing the fields `["a", "b", "c","d"]`, calling"     *     * ```java     * mystream.project(new Fields("b", "d"))     * ```     *     * would produce a stream containing only the fields `["b", "d"]`.     *     *     * @param keepFields The fields in the Stream to keep     * @return     */    public Stream project(Fields keepFields) {        projectionValidation(keepFields);        return _topology.addSourcedNode(this, new ProcessorNode(_topology.getUniqueStreamId(), _name, keepFields, new Fields(), new ProjectedProcessor(keepFields)));    }
  • 这里使用了ProjectedProcessor来进行projection操作

repartitioning operations

partition

storm-core-1.2.2-sources.jar!/org/apache/storm/trident/Stream.java

/**     * ## Repartitioning Operation     *     * @param partitioner     * @return     */    public Stream partition(CustomStreamGrouping partitioner) {        return partition(Grouping.custom_serialized(Utils.javaSerialize(partitioner)));    }
  • 这里使用了CustomStreamGrouping

partitionBy

storm-core-1.2.2-sources.jar!/org/apache/storm/trident/Stream.java

/**     * ## Repartitioning Operation     *     * @param fields     * @return     */    public Stream partitionBy(Fields fields) {        projectionValidation(fields);        return partition(Grouping.fields(fields.toList()));    }
  • 这里使用Grouping.fields

identityPartition

storm-core-1.2.2-sources.jar!/org/apache/storm/trident/Stream.java

/**     * ## Repartitioning Operation     *     * @return     */    public Stream identityPartition() {        return partition(new IdentityGrouping());    }
  • 这里使用IdentityGrouping

shuffle

storm-core-1.2.2-sources.jar!/org/apache/storm/trident/Stream.java

/**     * ## Repartitioning Operation     *     * Use random round robin algorithm to evenly redistribute tuples across all target partitions     *     * @return     */    public Stream shuffle() {        return partition(Grouping.shuffle(new NullStruct()));    }
  • 这里使用Grouping.shuffle

localOrShuffle

storm-core-1.2.2-sources.jar!/org/apache/storm/trident/Stream.java

/**     * ## Repartitioning Operation     *     * Use random round robin algorithm to evenly redistribute tuples across all target partitions, with a preference     * for local tasks.     *     * @return     */    public Stream localOrShuffle() {        return partition(Grouping.local_or_shuffle(new NullStruct()));    }
  • 这里使用Grouping.local_or_shuffle

global

storm-core-1.2.2-sources.jar!/org/apache/storm/trident/Stream.java

/**     * ## Repartitioning Operation     *     * All tuples are sent to the same partition. The same partition is chosen for all batches in the stream.     * @return     */    public Stream global() {        // use this instead of storm's built in one so that we can specify a singleemitbatchtopartition        // without knowledge of storm's internals        return partition(new GlobalGrouping());    }
  • 这里使用GlobalGrouping

batchGlobal

storm-core-1.2.2-sources.jar!/org/apache/storm/trident/Stream.java

/**     * ## Repartitioning Operation     *     *  All tuples in the batch are sent to the same partition. Different batches in the stream may go to different     *  partitions.     *     * @return     */    public Stream batchGlobal() {        // the first field is the batch id        return partition(new IndexHashGrouping(0));    }
  • 这里使用IndexHashGrouping,是对整个batch维度的repartition

broadcast

storm-core-1.2.2-sources.jar!/org/apache/storm/trident/Stream.java

/**     * ## Repartitioning Operation     *     * Every tuple is replicated to all target partitions. This can useful during DRPC – for example, if you need to do     * a stateQuery on every partition of data.     *     * @return     */    public Stream broadcast() {        return partition(Grouping.all(new NullStruct()));    }
  • 这里使用Grouping.all

groupBy

storm-core-1.2.2-sources.jar!/org/apache/storm/trident/Stream.java

/**     * ## Grouping Operation     *     * @param fields     * @return     */    public GroupedStream groupBy(Fields fields) {        projectionValidation(fields);        return new GroupedStream(this, fields);    }
  • 这里返回的是GroupedStream

aggregators

storm-core-1.2.2-sources.jar!/org/apache/storm/trident/Stream.java

//partition aggregate    public Stream partitionAggregate(Aggregator agg, Fields functionFields) {        return partitionAggregate(null, agg, functionFields);    }    public Stream partitionAggregate(CombinerAggregator agg, Fields functionFields) {        return partitionAggregate(null, agg, functionFields);    }    public Stream partitionAggregate(Fields inputFields, CombinerAggregator agg, Fields functionFields) {        projectionValidation(inputFields);        return chainedAgg()               .partitionAggregate(inputFields, agg, functionFields)               .chainEnd();    }    public Stream partitionAggregate(ReducerAggregator agg, Fields functionFields) {        return partitionAggregate(null, agg, functionFields);    }    public Stream partitionAggregate(Fields inputFields, ReducerAggregator agg, Fields functionFields) {        projectionValidation(inputFields);        return chainedAgg()               .partitionAggregate(inputFields, agg, functionFields)               .chainEnd();    }    //aggregate    public Stream aggregate(Fields inputFields, Aggregator agg, Fields functionFields) {        projectionValidation(inputFields);        return chainedAgg()               .aggregate(inputFields, agg, functionFields)               .chainEnd();    }    public Stream aggregate(Fields inputFields, CombinerAggregator agg, Fields functionFields) {        projectionValidation(inputFields);        return chainedAgg()               .aggregate(inputFields, agg, functionFields)               .chainEnd();    }    public Stream aggregate(Fields inputFields, ReducerAggregator agg, Fields functionFields) {        projectionValidation(inputFields);        return chainedAgg()                .aggregate(inputFields, agg, functionFields)                .chainEnd();    }    //persistent aggregate    public TridentState persistentAggregate(StateFactory stateFactory, CombinerAggregator agg, Fields functionFields) {        return persistentAggregate(new StateSpec(stateFactory), agg, functionFields);    }    public TridentState persistentAggregate(StateSpec spec, CombinerAggregator agg, Fields functionFields) {        return persistentAggregate(spec, null, agg, functionFields);    }    public TridentState persistentAggregate(StateFactory stateFactory, Fields inputFields, CombinerAggregator agg, Fields functionFields) {        return persistentAggregate(new StateSpec(stateFactory), inputFields, agg, functionFields);    }    public TridentState persistentAggregate(StateSpec spec, Fields inputFields, CombinerAggregator agg, Fields functionFields) {        projectionValidation(inputFields);        // replaces normal aggregation here with a global grouping because it needs to be consistent across batches         return new ChainedAggregatorDeclarer(this, new GlobalAggScheme())                .aggregate(inputFields, agg, functionFields)                .chainEnd()               .partitionPersist(spec, functionFields, new CombinerAggStateUpdater(agg), functionFields);    }    public TridentState persistentAggregate(StateFactory stateFactory, ReducerAggregator agg, Fields functionFields) {        return persistentAggregate(new StateSpec(stateFactory), agg, functionFields);    }    public TridentState persistentAggregate(StateSpec spec, ReducerAggregator agg, Fields functionFields) {        return persistentAggregate(spec, null, agg, functionFields);    }    public TridentState persistentAggregate(StateFactory stateFactory, Fields inputFields, ReducerAggregator agg, Fields functionFields) {        return persistentAggregate(new StateSpec(stateFactory), inputFields, agg, functionFields);    }    public TridentState persistentAggregate(StateSpec spec, Fields inputFields, ReducerAggregator agg, Fields functionFields) {        projectionValidation(inputFields);        return global().partitionPersist(spec, inputFields, new ReducerAggStateUpdater(agg), functionFields);    }
  • trident的aggregators主要分为三类,分别是partitionAggregate、aggregate、persistentAggregate;aggregator操作会改变输出
  • partitionAggregate其作用的粒度为每个partition,而非整个batch
  • aggregrate操作作用的粒度为batch,对每个batch,它先使用global操作将该batch的tuple从所有partition合并到一个partition,最后再对batch进行aggregation操作;这里提供了三类参数,分别是Aggregator、CombinerAggregator、ReducerAggregator;调用stream.aggregrate方法时,相当于一次global aggregation,此时使用Aggregator或ReducerAggregator时,stream会先将tuple划分到一个partition,然后再进行aggregate操作;而使用CombinerAggregator时,trident会进行优化,先对每个partition进行局部的aggregate操作,然后再划分到一个partition,最后再进行aggregate操作,因而相对Aggregator或ReducerAggregator可以节省网络传输耗时
  • persistentAggregate操作会对stream上所有batch的tuple进行aggretation,然后将结果存储在state中

Aggregator

storm-core-1.2.2-sources.jar!/org/apache/storm/trident/operation/Aggregator.java

public interface Aggregator
extends Operation { T init(Object batchId, TridentCollector collector); void aggregate(T val, TridentTuple tuple, TridentCollector collector); void complete(T val, TridentCollector collector);}
  • Aggregator首先会调用init进行初始化,然后通过参数传递给aggregate以及complete方法
  • 对于batch partition中的每个tuple执行一次aggregate;当batch partition中的tuple执行完aggregate之后执行complete方法
  • 假设自定义Aggregator为累加操作,那么对于[4]、[7]、[8]这批tuple,init为0,对于[4],val=0,0+4=4;对于[7],val=4,4+7=11;对于[8],val=11,11+8=19;然后batch结束,val=19,此时执行complete,可以使用collector发射数据

CombinerAggregator

storm-core-1.2.2-sources.jar!/org/apache/storm/trident/operation/CombinerAggregator.java

public interface CombinerAggregator
extends Serializable { T init(TridentTuple tuple); T combine(T val1, T val2); T zero();}
  • CombinerAggregator每收到一个tuple,就调用init获取当前tuple的值,调用combine操作使用前一个combine的结果(没有的话取zero的值)与init取得的值进行新的combine操作,如果该partition中没有tuple,则返回zero方法的值
  • 假设combine为累加操作,zero返回0,那么对于[4]、[7]、[8]这批tuple,init值分别是4、7、8,对于[4],没有前一个combine结果,于是val1=0,val2=4,combine结果为4;对于[7],val1=4,val2=7,combine结果为11;对于[8],val1为11,val2为8,combine结果为19
  • CombinerAggregator操作的网络开销相对较低,因此性能比其他两类aggratator好

ReducerAggregator

storm-core-1.2.2-sources.jar!/org/apache/storm/trident/operation/ReducerAggregator.java

public interface ReducerAggregator
extends Serializable { T init(); T reduce(T curr, TridentTuple tuple);}
  • ReducerAggregator在对一批tuple进行计算时,先调用一次init获取初始值,然后再执行reduce操作,curr值为前一次reduce操作的值,没有的话,就是init值
  • 假设reduce为累加操作,init返回0,那么对于[4]、[7]、[8]这批tuple,对于[4],init为0,然后curr=0,先是0+4=4;对于[7],curr为4,就是4+7=11;对于[8],curr为11,最后就是11+8=19

topology stream operations

join

storm-core-1.2.2-sources.jar!/org/apache/storm/trident/TridentTopology.java

public Stream join(Stream s1, Fields joinFields1, Stream s2, Fields joinFields2, Fields outFields) {        return join(Arrays.asList(s1, s2), Arrays.asList(joinFields1, joinFields2), outFields);            }        public Stream join(List
streams, List
joinFields, Fields outFields) { return join(streams, joinFields, outFields, JoinType.INNER); } public Stream join(Stream s1, Fields joinFields1, Stream s2, Fields joinFields2, Fields outFields, JoinType type) { return join(Arrays.asList(s1, s2), Arrays.asList(joinFields1, joinFields2), outFields, type); } public Stream join(List
streams, List
joinFields, Fields outFields, JoinType type) { return join(streams, joinFields, outFields, repeat(streams.size(), type)); } public Stream join(Stream s1, Fields joinFields1, Stream s2, Fields joinFields2, Fields outFields, List
mixed) { return join(Arrays.asList(s1, s2), Arrays.asList(joinFields1, joinFields2), outFields, mixed); } public Stream join(List
streams, List
joinFields, Fields outFields, List
mixed) { return join(streams, joinFields, outFields, mixed, JoinOutFieldsMode.COMPACT); } public Stream join(Stream s1, Fields joinFields1, Stream s2, Fields joinFields2, Fields outFields, JoinOutFieldsMode mode) { return join(Arrays.asList(s1, s2), Arrays.asList(joinFields1, joinFields2), outFields, mode); } public Stream join(List
streams, List
joinFields, Fields outFields, JoinOutFieldsMode mode) { return join(streams, joinFields, outFields, JoinType.INNER, mode); } public Stream join(Stream s1, Fields joinFields1, Stream s2, Fields joinFields2, Fields outFields, JoinType type, JoinOutFieldsMode mode) { return join(Arrays.asList(s1, s2), Arrays.asList(joinFields1, joinFields2), outFields, type, mode); } public Stream join(List
streams, List
joinFields, Fields outFields, JoinType type, JoinOutFieldsMode mode) { return join(streams, joinFields, outFields, repeat(streams.size(), type), mode); } public Stream join(Stream s1, Fields joinFields1, Stream s2, Fields joinFields2, Fields outFields, List
mixed, JoinOutFieldsMode mode) { return join(Arrays.asList(s1, s2), Arrays.asList(joinFields1, joinFields2), outFields, mixed, mode); } public Stream join(List
streams, List
joinFields, Fields outFields, List
mixed, JoinOutFieldsMode mode) { switch (mode) { case COMPACT: return multiReduce(strippedInputFields(streams, joinFields), groupedStreams(streams, joinFields), new JoinerMultiReducer(mixed, joinFields.get(0).size(), strippedInputFields(streams, joinFields)), outFields); case PRESERVE: return multiReduce(strippedInputFields(streams, joinFields), groupedStreams(streams, joinFields), new PreservingFieldsOrderJoinerMultiReducer(mixed, joinFields.get(0).size(), getAllOutputFields(streams), joinFields, strippedInputFields(streams, joinFields)), outFields); default: throw new IllegalArgumentException("Unsupported out-fields mode: " + mode); } }
  • 可以看到join最后调用了multiReduce,对于COMPACT类型使用的GroupedMultiReducer是JoinerMultiReducer,对于PRESERVE类型使用的GroupedMultiReducer是PreservingFieldsOrderJoinerMultiReducer

merge

storm-core-1.2.2-sources.jar!/org/apache/storm/trident/TridentTopology.java

public Stream merge(Fields outputFields, Stream... streams) {        return merge(outputFields, Arrays.asList(streams));    }        public Stream merge(Stream... streams) {        return merge(Arrays.asList(streams));    }        public Stream merge(List
streams) { return merge(streams.get(0).getOutputFields(), streams); } public Stream merge(Fields outputFields, List
streams) { return multiReduce(streams, new IdentityMultiReducer(), outputFields); }
  • 可以看到merge最后是调用了multiReduce,使用的MultiReducer是IdentityMultiReducer

multiReduce

storm-core-1.2.2-sources.jar!/org/apache/storm/trident/TridentTopology.java

public Stream multiReduce(Stream s1, Stream s2, MultiReducer function, Fields outputFields) {        return multiReduce(Arrays.asList(s1, s2), function, outputFields);            }    public Stream multiReduce(Fields inputFields1, Stream s1, Fields inputFields2, Stream s2, MultiReducer function, Fields outputFields) {        return multiReduce(Arrays.asList(inputFields1, inputFields2), Arrays.asList(s1, s2), function, outputFields);            }            public Stream multiReduce(GroupedStream s1, GroupedStream s2, GroupedMultiReducer function, Fields outputFields) {        return multiReduce(Arrays.asList(s1, s2), function, outputFields);            }        public Stream multiReduce(Fields inputFields1, GroupedStream s1, Fields inputFields2, GroupedStream s2, GroupedMultiReducer function, Fields outputFields) {        return multiReduce(Arrays.asList(inputFields1, inputFields2), Arrays.asList(s1, s2), function, outputFields);            }         public Stream multiReduce(List
streams, MultiReducer function, Fields outputFields) { return multiReduce(getAllOutputFields(streams), streams, function, outputFields); } public Stream multiReduce(List
streams, GroupedMultiReducer function, Fields outputFields) { return multiReduce(getAllOutputFields(streams), streams, function, outputFields); } public Stream multiReduce(List
inputFields, List
groupedStreams, GroupedMultiReducer function, Fields outputFields) { List
fullInputFields = new ArrayList<>(); List
streams = new ArrayList<>(); List
fullGroupFields = new ArrayList<>(); for(int i=0; i
inputFields, List
streams, MultiReducer function, Fields outputFields) { List
names = new ArrayList<>(); for(Stream s: streams) { if(s._name!=null) { names.add(s._name); } } Node n = new ProcessorNode(getUniqueStreamId(), Utils.join(names, "-"), outputFields, outputFields, new MultiReducerProcessor(inputFields, function)); return addSourcedNode(streams, n); }
  • multiReduce方法有个MultiReducer参数,join与merge虽然都调用了multiReduce,但是他们传的MultiReducer值不一样

小结

  • trident的操作主要有几类,一类是基本的function、filter、projection操作;一类是repartitioning操作,主要是一些grouping;一类是aggregate操作,包括aggregate、partitionAggregate、persistentAggregate;一类是在topology对stream的join、merge操作
  • function的话,若有emit字段会追加到原始的tuple上;filter用于过滤tuple;projection用于提取字段
  • repartitioning操作有Grouping.local_or_shuffle、Grouping.shuffle、Grouping.all、GlobalGrouping、CustomStreamGrouping、IdentityGrouping、IndexHashGrouping等;partition操作可以理解为将输入的tuple分配到task上,也可以理解为是对stream进行grouping
  • aggregate操作的话,普通的aggregate操作有3类接口,分别是Aggregator、CombinerAggregator、ReducerAggregator,其中Aggregator是最为通用的,它继承了Operation接口,而且在方法参数里头可以使用到collector,这是CombinerAggregator与ReducerAggregator所没有的;而CombinerAggregator与Aggregator及ReducerAggregator不同的是,调用stream.aggregrate方法时,trident会优先在partition进行局部聚合,然后再归一到一个partition做最后聚合,相对来说比较节省网络传输耗时,但是如果将CombinerAggregator与非CombinerAggregator的进行chaining的话,就享受不到这个优化;partitionAggregate主要是在partition维度上进行操作;而persistentAggregate则是在整个stream的维度上对所有batch的tuple进行操作,结果持久化在state上
  • 对于stream的join及merge操作,其最后都是依赖multiReduce来实现,只是传递的MultiReducer值不一样;join的话join的话需要字段来进行匹配(字段名可以不一样),可以选择JoinType,是INNER还是OUTER,不过join是对于spout的small batch来进行join的;merge的话,就是纯粹的几个stream进行tuple的归总。

doc

转载于:https://my.oschina.net/go4it/blog/2615110

你可能感兴趣的文章
报错信息
查看>>
云打印 对 追光的人 的Beta产品测试报告
查看>>
分布式 vs 集群 主从 vs 集群
查看>>
21个CSS技巧
查看>>
寒假。3.3.G - Common Child (最大公共子序)
查看>>
java junit4-JUnit测试报 No socket available 错误
查看>>
053(五十二)
查看>>
python--字典
查看>>
052(四十二)
查看>>
大观设计模式(一)
查看>>
Yii路径总结
查看>>
Json学习笔记(一)
查看>>
mindjet使用技巧
查看>>
beta版验收互评
查看>>
【排序算法】选择排序(Selection sort)
查看>>
行为型---状态者模式(State Pattern)
查看>>
第一章 Scorpio-CSharp基础语法介绍 变量篇
查看>>
asp.net core nginx配置问题解决
查看>>
Hdu 4525 威威猫系列故事——吃鸡腿
查看>>
(一)微服务分布式云架构spring cloud 简介
查看>>