You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/backend/ingestion.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,7 +4,7 @@ Bullet operates on a generic data container that it understands. In order to get
4
4
5
5
!!!note "If you are trying to set up Bullet..."
6
6
7
-
The rest of this page gives more information about the Record container and how to depend on it in code directly. If you are setting up Bullet, the Record is already included by default with the Bullet artifact. You can head on over to [setting up the Storm topology](setup-storm.md#installation) to build the piece that gets your data into the Record container.
7
+
The rest of this page gives more information about the Record container and how to depend on it in code directly. If you are setting up Bullet, the Record is already included by default with the Bullet artifact. You can head on over to [setting up the Storm topology](storm-setup.md#installation) to build the piece that gets your data into the Record container.
8
8
9
9
## Bullet Record
10
10
@@ -31,7 +31,7 @@ With these types, it is unlikely you would have data that cannot be represented
31
31
32
32
## Installing the Record directly
33
33
34
-
Generally, you depend on the Bullet artifact for your Stream Processor when you plug in the piece that gets your data into the Stream processor. The Bullet artifact already brings in the Bullet Record container as well. See the usage for the [Storm](setup-storm.md#installation).
34
+
Generally, you depend on the Bullet Core artifact for your Stream Processor when you plug in the piece that gets your data into the Stream processor. The Bullet Core artifact already brings in the Bullet Record container as well. See the usage for the [Storm](storm-setup.md#installation) for an example.
35
35
36
36
However, if you need it, the artifacts are available through JCenter to depend on them in code directly. You will need to add the repository. Below is a Maven example:
Copy file name to clipboardExpand all lines: docs/backend/storm-architecture.md
+10-18Lines changed: 10 additions & 18 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,23 +2,15 @@
2
2
3
3
This section describes how the [Backend architecture](../index.md#backend) is implemented in Storm.
4
4
5
-
## Storm DRPC
6
-
7
-
Bullet on [Storm](https://storm.apache.org/) is built using [Storm DRPC](http://storm.apache.org/releases/1.0.0/Distributed-RPC.html). DRPC or Distributed Remote Procedure Call, is built into Storm and consists of a set of servers that are part of the Storm cluster. When a Storm topology that uses DRPC is launched, it registers a spout with a unique name (the procedure in the Distributed Remote Procedure Call) with the DRPC infrastructure. The DRPC Servers expose a REST endpoint where data can be POSTed to or a GET request can be made with this unique name. The DRPC infrastructure then sends the request (a query in Bullet) through the spout(s) to the topology that registered that name (Bullet). The result from topology is sent back to the client. We picked Storm to implement Bullet on first not only because it was the most popular Streaming framework at Yahoo but also since DRPC provides us a nice and simple way to handle getting queries into Bullet and sending responses back.
8
-
9
-
!!! note "Thrift and DRPC servers"
10
-
11
-
DRPC also exposes a [Thrift](http://thrift.apache.org) endpoint but the Bullet Web Service uses REST for simplicity. When you launch your Bullet Storm topology, you can POST Bullet queries to a DRPC server directly with the function name that you specified in the Bullet configuration. This is a quick way to check if your topology is up and running!
12
-
13
5
## Topology
14
6
15
7
For Bullet on Storm, the Storm topology implements the backend piece from the full [Architecture](../index.md#architecture). The topology is implemented with the standard Storm spout and bolt components:
16
8
17
-

9
+

18
10
19
-
The components in [Architecture](../index.md#architecture) have direct counterparts here. The DRPC servers, the DRPC spouts, the Prepare Request bolts comprise the Request Processor. The Filter bolts and your plugin for your source of Data make up the Data Processor. The Join bolt and the Return Results bolt make up the Combiner.
11
+
The components in [Architecture](../index.md#architecture) have direct counterparts here. The Query spouts reading from the PubSub layer using plugged-in PubSub consumers make up the Request Processor. The Filter bolts and your plugin for your source of data (generally a spout but could be a topology) make up the Data Processor. The Join bolt and the Result bolt make up the Combiner.
20
12
21
-
The red colored lines are the path for the queries that come in through Storm DRPC and the blue is for the data from your data source. The pattern on the lines denote how the data (Storm tuples) is moved to the next component. Dashed indicates a broadcast (sent to all instances of the component), dotted indicates a key grouping (sent to a particular instance based on hashing on a particular field), and solid indicates a shuffle (randomly sent to an instance).
13
+
The red colored lines are the path for the queries that come in through the PubSub and the blue is for the data from your data source. The pattern on the lines denote how the data (Storm tuples) is moved to the next component. Dashed indicates a broadcast (sent to all instances of the component), dotted indicates a key grouping (sent to a particular instance based on hashing on a particular field), and solid indicates a shuffle (randomly sent to an instance).
22
14
23
15
!!! note "What's a Ticker?"
24
16
@@ -39,21 +31,21 @@ Bullet can accept arbitrary sources of data as long as they can be read from Sto
39
31
| Option 2 | Saves a persistence layer | Ties your topology to Bullet directly, making it affected by Storm Backpressure etc |
40
32
| Option 2 | You can add bolts to do more processing on your data before sending it to Bullet | Increases the complexity of the topology |
41
33
42
-
Your data is then emitted to the Filter bolt. If you have no queries in your system, the Filter Bolt will promptly drop all Bullet Records and do absolutely nothing. If there are queries in the Filter bolt, the record is checked against the [filters](../index.md#filters) in each query and if it matches, it is processed by the query. Each query type can choose to emit matched records in micro-batches. By default, ```RAW``` or ```LIMIT``` queries do not micro-batch. Each matched record up to the maximum for the query is emitted at once at the Filter bolt. Queries that aggregate, on the other hand, keep the query around till its duration is up and emit the local result. This is because these queries *cannot* return till they see all the data in your time window anyway because some late arriving data may update an existing aggregate.
34
+
Your data is then emitted to the Filter bolt. If you have no queries in your system, the Filter bolt will promptly drop all Bullet Records and do absolutely nothing. If there are queries in the Filter bolt, the record is checked against the [filters](../index.md#filters) in each query and if it matches, it is processed by the query. Each query type can choose to emit matched records in micro-batches. By default, ```RAW``` or ```LIMIT``` queries do not micro-batch. Each matched record up to the maximum for the query is emitted at once at the Filter bolt. Queries that aggregate, on the other hand, keep the query around till its duration is up and emit the local result. This is because these queries *cannot* return till they see all the data in your time window anyway because some late arriving data may update an existing aggregate. When the upcoming incremental results lands, queries will periodically (configurable) emit their intermediate results for combining in the Join bolt.
43
35
44
36
!!! note "Why support micro-batching?"
45
37
46
-
```RAW``` queries do not micro-batch by default, which makes Bullet really snappy when running those queries. As soon as your maximum record limit is reached, the query immediately returns. You can use a setting in [bullet_defaults.yaml](https://github.com/yahoo/bullet-storm/blob/master/src/main/resources/bullet_defaults.yaml) to turn on batching if you like. At some point in the future, micro-batching will let Bullet provide incremental results - partial results arrive over the duration of the query. Bullet can emit intermediate aggregations as they are all [additive](#combining).
38
+
```RAW``` queries do not micro-batch by default, which makes Bullet really snappy when running those queries. As soon as your maximum record limit is reached, the query immediately returns. You can use a setting in [bullet_defaults.yaml](https://github.com/yahoo/bullet-storm/blob/master/src/main/resources/bullet_defaults.yaml) to turn on batching if you like. In the near future, micro-batching will let Bullet provide incremental results - partial results arrive over the duration of the query. Bullet can emit intermediate aggregations as they are all [additive](#combining).
47
39
48
40
### Request processing
49
41
50
-
Storm DRPC handles receiving REST requests for the whole topology. The DRPC spouts fetch these requests (DRPC knows the request is for the Bullet topology using the unique function name set when launching the topology) and shuffle them to the Prepare Request bolts. The request also contains information about how to return the response back to the DRPC servers. The Prepare Request bolts generate unique identifiers for each request (a Bullet query) and broadcasts them to every Filter bolt. Since every Filter bolt has a copy of every query, the shuffled data from the source of data can be compared against the query no matter which particular Filter bolt it ends up at. Each Filter bolt has access to the unique query id and is able to key group by the id to the Join bolt with the intermediate results for the query.
42
+
The Query spouts fetch Bullet queries through the PubSub layer using the Subscribers provided by the plugged in PubSub layer. The queries received through the PubSub also contain information about the query such as its unique identifier and potentially other metadata. The Query spouts broadcasts the query body to every Filter bolt. Since every Filter bolt has a copy of every query, the shuffled data from the source of data can be compared against the query no matter which particular Filter bolt it ends up at. Each Filter bolt has access to the unique query id and is able to key group by the id to the Join bolt with the intermediate results for the query.
51
43
52
-
The Prepare Request bolt also key groups the query and the return information to the Join bolts. This means that the query will be assigned to one and only one Join bolt.
44
+
The Query spout also key groups the query and additional query metadata to the Join bolts. This means that the query and the metadata will be end up at exactly one Join bolt.
53
45
54
46
### Combining
55
47
56
-
Since the data from the Prepare Request bolt (a query and a piece of return information for the query) and the data from all Filter bolts (intermediate results) is key grouped by the unique query id, only one particular Join bolt receives both the query and all the intermediate results for a particular query. The Join bolt can then combine all the intermediate results and produce a final result. This final result is joined (hence the name) with the return information for the query and is shuffled to the Return Results bolt. This bolt then uses the return information to send the results back to a DRPC server, which then returns it back to the requester.
48
+
Since the data from the Query spout (query and metadata) and the data from all Filter bolts (intermediate results) is key grouped by the unique query id, only one particular Join bolt receives both the query and the intermediate results for a particular query. The Join bolt can then combine the intermediate results and produce a final result. This result is joined (hence the name) along with the metadata for the query and is shuffled to the Result bolts. This bolt then uses the particular Publisher from the plugged in PubSub layer and uses the metadata if it needs to and sends the results back through the PubSub layer to the requestor.
57
49
58
50
!!! note "Combining and operations"
59
51
@@ -65,9 +57,9 @@ Since the data from the Prepare Request bolt (a query and a piece of return info
65
57
The topology set up this way scales horizontally and has some nice properties:
66
58
67
59
* If you want to scale for processing more data but the same amount of queries, you only need to scale the components that read your data (the spout reading the data or your custom topology) and the Filter bolts.
68
-
* If you want to scale for more queries but the same amount of data, you generally need to scale up the Filter Bolts. If you only have a few DRPC servers in your Storm cluster, you may also need to add more to support more simultaneous DRPC requests. We have [found that](performance.md#conclusion_3) each server gives us about ~250 simultaneous queries. Finally, if you need it, you should scale the DRPC spouts, Prepare Request bolts, Join bolts and Return Results bolts. These components generally have low parallelisms compared to your data processing components since the data volume is generally much higher than your query volume.
60
+
* If you want to scale for more queries but the same amount of data, you generally need to scale up the Filter Bolts. If you need it, you can scale the Query spouts, Join bolts and Result bolts. You should ensure that your PubSub layer (if you're using the Storm DRPC PubSub layer, then this is the number of DRPC servers in your Storm cluster) can handle the volume of queries and results being sent through it. These components generally have low parallelisms compared to your data processing components since the data volume is generally much higher than your query volume, so this is generally not needed.
69
61
70
-
See [Scaling for more Queries](performance.md#test-7-scaling-for-more-queries) and [Scaling for more Data](performance.md#test-6-scaling-for-more-data) for more details.
62
+
See [Scaling for more Queries](storm-performance.md#test-7-scaling-for-more-queries) and [Scaling for more Data](storm-performance.md#test-6-scaling-for-more-data) for more details.
Copy file name to clipboardExpand all lines: docs/backend/storm-performance.md
+4-2Lines changed: 4 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -23,6 +23,8 @@ You should be familiar with [Storm](http://storm.apache.org), [Kafka](http://kaf
23
23
24
24
All tests run here were using [Bullet-Storm 0.4.2](https://github.com/yahoo/bullet-storm/releases/tag/bullet-storm-0.4.2) and [Bullet-Storm 0.4.3](https://github.com/yahoo/bullet-storm/releases/tag/bullet-storm-0.4.3). We are working with just the Storm piece without going through the Web Service or the UI. The DRPC REST endpoint provided by Storm lets us do just that.
25
25
26
+
This particular version of Bullet on Storm was **prior to the architecture shift** to a PubSub layer but this would be the equivalent to using the Storm DRPC PubSub layer on a newer version of Bullet on Storm. You can replace DRPC spout and PrepareRequest bolt with Query spout and ReturnResults bolt with Result bolt conceptually. The actual implementation of the DRPC based PubSub layer just uses these spout and bolt implementations underneath anyway for the Publishers and Subscribers so the parallelisms and CPU utilizations should map 1-1.
27
+
26
28
Using the pluggable metrics interface in Bullet on Storm, we captured worker level metrics such as CPU time, Heap usage, GC times and types, sent them to a in-house monitoring service for time-slicing and graphing. The figures shown below use this service.
27
29
28
30
See [0.3.0](https://github.com/yahoo/bullet-storm/releases/tag/bullet-storm-0.3.0) for how to plug in your own metrics collection.
@@ -319,7 +321,7 @@ We notice that the GC times have improved a lot (down to ~12K ms from ~35K ms in
319
321
320
322

321
323
322
-
<div class="one-text-numeri-table"></div>
324
+
<div class="one-text-numeric-table"></div>
323
325
324
326
| Simultaneous Queries | Average CPU (ms)| Average Result size |
@@ -339,7 +341,7 @@ With this change in heap usage, we could get to **```735```** of these queries s
339
341
340
342
!!! note "735 is a hard limit then?"
341
343
342
-
This is what our Storm cluster's configuration limits us to. There is an async implementation for DRPC that we could eventually switch to. Also, an alternative to DRPC - such as using a Pub/Sub queue like Kafka to deliver queries and retrieve results from Bullet - may be required anyway to implement Bullet on other Stream Processors.
344
+
This is what our Storm cluster's configuration and our usage of the Storm DRPC limits us to. There is an async implementation for DRPC that we could eventually switch to. If we used another PubSub implementation like Kafka, we would be able to bypass this limit.
0 commit comments