Skip to content

Commit 2258616

Browse files
committed
Adding UI usage documentation. Fixing typos and cleaning up
1 parent 10c5407 commit 2258616

18 files changed

+432
-263
lines changed

docs/backend/faq.md

Lines changed: 0 additions & 1 deletion
This file was deleted.

docs/backend/setup-storm.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -14,12 +14,12 @@ Bullet is configured at run-time using settings defined in a file. Settings not
1414

1515
## Installation
1616

17-
To use Bullet, you need to implement a way to read from your data source and convert your data into Bullet Records (bullet-record is a transitive dependency for Bullet and can be found [in Bintray](https://bintray.com/yahoo/maven/bullet-record/view). You have two options in how to get your data into Bullet:
17+
To use Bullet, you need to implement a way to read from your data source and convert your data into Bullet Records (bullet-record is a transitive dependency for Bullet and can be found [in JCenter](ingestion.md#installing-the-record-directly). You have two options in how to get your data into Bullet:
1818

1919
1. You can implement a Spout that reads from your data source and emits Bullet Record. This spout must have a constructor that takes a List of Strings.
20-
2. You can pipe your existing Storm topology directly into Bullet. In other words, you convert the data you wish to be queryable through Bullet into Bullet Records from a bolt in your topology.
20+
2. You can pipe your existing Storm topology directly into Bullet. In other words, you convert the data you wish to be query-able through Bullet into Bullet Records from a bolt in your topology.
2121

22-
Option 2 *directly* couples your topology to Bullet and as such, you would need to watch out for things like backpressure etc.
22+
Option 2 *directly* couples your topology to Bullet and as such, you would need to watch out for things like back-pressure etc.
2323

2424
You need a JVM based project that implements one of the two options above. You include the Bullet artifact and Storm dependencies in your pom.xml or other dependency management system. The artifacts are available through JCenter, so you will need to add the repository.
2525

@@ -100,9 +100,9 @@ The jar artifact can be downloaded directly from [JCenter](http://jcenter.bintra
100100

101101
You can also add ```<classifier>sources</classifier>``` or ```<classifier>javadoc</classifier>``` if you want the source or javadoc and ```<type>test-jar</type>``` for the test classes as with bullet-storm.
102102

103-
Also, since storm-metrics and the Resource Aware Scheduler are not in Storm versions less than 1.0, there are changes in the Bullet settings. The settings that set the CPU and memory loads do not exist (so config file does not specify them). The setting to enable the topology scheduler are no longer present (you can still override these settings if you run a custom version of Storm by passing it to the storm jar command. [See below](#launch).) You can take a look the settings file on the storm-0.10 branch in the Git repo.
103+
Also, since storm-metrics and the Resource Aware Scheduler are not in Storm versions less than 1.0, there are changes in the Bullet settings. The settings that set the CPU and memory loads do not exist (so the config file does not specify them). The setting to enable the topology scheduler are no longer present (you can still override these settings if you run a custom version of Storm by passing it to the storm jar command. [See below](#launch).) You can take a look the settings file on the storm-0.10 branch in the Git repo.
104104

105-
If for some reason, you are running a version of Storm less than 1.0 that has the RAS backported to it and you wish to set the CPU and other settings, you will your own main class that mirrors the master branch of the main class but with backtype.storm packages instead.
105+
If for some reason, you are running a version of Storm less than 1.0 that has the RAS back-ported to it and you wish to set the CPU and other settings, you will your own main class that mirrors the master branch of the main class but with backtype.storm packages instead.
106106

107107
## Launch
108108

@@ -123,7 +123,7 @@ storm jar your-fat-jar-with-dependencies.jar \
123123
-c topology.max.spout.pending=10000
124124
```
125125

126-
You can pass other arguments to Storm using the -c argument. The example above uses 64 ackers, which is the parallelism of the Filter Bolt. Storm DRPC follows the principle of leaving retries to the DRPC user (in our case, the Bullet web service). As a result, most of the DRPC components do not follow any at least once guarantees. However, you can enable at least once for the hop from your topology (or spout) to the Filter Bolt. This is why this example uses the parallelism of the Filter Bolt as the number of ackers since that is exactly the number of acker tasks we would need. Ackers are lightweight so you need not have the same number of tasks as Filter Bolts but you can tweak it accordingly. The example above also sets max spout pending to control how fast the spout emits. You could use the backpressure mechanisms in Storm in addition or in lieu of as you choose. We have found that max spout pending gives a much more predictable way of throttling our spouts during catch up or data spikes.
126+
You can pass other arguments to Storm using the -c argument. The example above uses 64 ackers, which is the parallelism of the Filter Bolt. Storm DRPC follows the principle of leaving retries to the DRPC user (in our case, the Bullet web service). As a result, most of the DRPC components do not follow any at least once guarantees. However, you can enable at least once for the hop from your topology (or spout) to the Filter Bolt. This is why this example uses the parallelism of the Filter Bolt as the number of ackers since that is exactly the number of acker tasks we would need (not accounting for the DRPCSpout to the PrepareRequest Bolt acking). Ackers are lightweight so you need not have the same number of tasks as Filter Bolts but you can tweak it accordingly. The example above also sets max spout pending to control how fast the spout emits. You could use the back-pressure mechanisms in Storm in addition or in lieu of as you choose. We have found that max spout pending gives a much more predictable way of throttling our spouts during catch up or data spikes.
127127

128128
!!! note "Main Class Arguments"
129129

docs/backend/storm-architecture.md

Lines changed: 14 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ This section describes how the [Backend architecture](../index.md#backend) is im
44

55
## Storm DRPC
66

7-
Bullet on [Storm](https://storm.apache.org/) is built using [Storm DRPC](http://storm.apache.org/releases/1.0.0/Distributed-RPC.html). DRPC or Distributed Remote Procedure Call, is built into Storm and consist of a set of servers that are part of the Storm cluster. When a Storm topology that uses DRPC is launched, it registers a spout with a unique name (the procedure in the Distributed Remote Procedure Call) with the DRPC infrastructure. The DRPC Servers expose a REST endpoint where data can be POSTed to or a GET request can be made with this unique name. The DRPC infrastructure then sends the request (a query in Bullet) through the spout(s) to the topology (Bullet). The result from Bullet is sent back to the client. We picked Storm to implement Bullet on first not only because it was the most popular Streaming framework at Yahoo but also since DRPC provides us a nice and simple way to handle getting queries into Bullet and sending responses back.
7+
Bullet on [Storm](https://storm.apache.org/) is built using [Storm DRPC](http://storm.apache.org/releases/1.0.0/Distributed-RPC.html). DRPC or Distributed Remote Procedure Call, is built into Storm and consists of a set of servers that are part of the Storm cluster. When a Storm topology that uses DRPC is launched, it registers a spout with a unique name (the procedure in the Distributed Remote Procedure Call) with the DRPC infrastructure. The DRPC Servers expose a REST endpoint where data can be POSTed to or a GET request can be made with this unique name. The DRPC infrastructure then sends the request (a query in Bullet) through the spout(s) to the topology that registered that name (Bullet). The result from topology is sent back to the client. We picked Storm to implement Bullet on first not only because it was the most popular Streaming framework at Yahoo but also since DRPC provides us a nice and simple way to handle getting queries into Bullet and sending responses back.
88

99
!!! note "Thrift and DRPC servers"
1010

@@ -22,36 +22,38 @@ The red colored lines are the path for the queries that come in through Storm DR
2222

2323
!!! note "What's a Ticker?"
2424

25-
The Ticker component is attached to the Filter and Join Bolts produce Storm tuples at predefined intervals. This is a Storm feature (and is configurable when you launch the Bullet topology). These tuples, called tick tuples, behave like a CPU clock cycles for Bullet. Bullet performs all its system related activities on a tick. This includes purging stale queries, emitting left over data for queries, etc. We could have gone the route of having asynchronous threads that do the same thing but this was a far more simpler and elegant solution. The downside is that Bullet is as fast or as slow as its tick period, which can only go as low at 1 s in Storm.
25+
The Ticker component is attached to the Filter and Join Bolts produce Storm tuples at predefined intervals. This is a Storm feature (and is configurable when you launch the Bullet topology). These tuples, called tick tuples, behave like a CPU clock cycles for Bullet. Bullet performs all its system related activities on a tick. This includes purging stale queries, emitting left over data for queries, etc. We could have gone the route of having asynchronous threads that do the same thing but this was a far more simpler solution. The downside is that Bullet is as fast or as slow as its tick period, which can only go as low at 1 s in Storm. In practice, this means that your window is longer by a tick and you can accommodate that in your query if you wish.
2626

27-
For example, when the final data is emitted from the Filter bolts when the query has expired, the Join bolt receiving it waits for 3 (this is configurable) ticks after *its query* expires to collect all the last intermediate results from the Filter bolts. If the tick period is set as high as 5 s, this means that a query will take 3 * 15 or 15 s to get back after its expiry! Setting it to 1 s, makes it 1 * 3 s. By changing the number of ticks that the Join bolt waits for and the tick period, you can get to any integral delay >= 1 s.
27+
As a practical example of how Bullet uses ticks: when the final data is emitted from the Filter bolts when the query has expired, the Join bolt receiving it waits for 3 (this is configurable) ticks after *its query* expires to collect all the last intermediate results from the Filter bolts. If the tick period is set as high as 5 s, this means that a query will take 3 * 15 or 15 s to get back after its expiry! Setting it to 1 s, makes it 1 * 3 s. By changing the number of ticks that the Join bolt waits for and the tick period, you can get to any integral delay >= 1 s.
2828

2929
### Data processing
3030

3131
Bullet can accept arbitrary sources of data as long as they can be read from Storm. You can either:
3232

33-
1. Write a Storm spout that reads your data from whereever it is (Kafka, etc) and [converts it to Bullet Records](ingestion.md). See [Quick Start](../quick-start.md#storm-topology) for an example.
33+
1. Write a Storm spout that reads your data from where ever it is (Kafka, etc) and [converts it to Bullet Records](ingestion.md). See [Quick Start](../quick-start.md#storm-topology) for an example.
3434
2. Hook up an existing topology that is doing something else directly to Bullet. You will still write and hook up a component that converts your data into Bullet Records in your existing topology.
3535

36-
Option 2 is nice if you do not want to introduce a persistence layer between your existing Streaming pipeline and Bullet. For example, if you just want periodically look at some data within your topology, you could filter them, convert them into Bullet Records and send it into Bullet. You could also sample data. The downside of Option 2 is that you will directly couple your topology with Bullet leaving your topology to be affected by Bullet through Storm features like backpressure (if you are on Storm 1.0) etc. You could also go with Option 2 if you need something more complex than just a spout from Option 1. For example, you may want to process your data in some fashion before emitting to Bullet.
36+
Option 2 is nice if you do not want to introduce a persistence layer between your existing Streaming pipeline and Bullet. For example, if you just want periodically look at some data within your topology, you could filter them, convert them into Bullet Records and send it into Bullet. You could also sample data. The downside of Option 2 is that you will directly couple your topology with Bullet leaving your topology to be affected by Bullet through Storm features like back-pressure (if you are on Storm 1.0) etc. You could also go with Option 2 if you need something more complex than just a spout from Option 1. For example, you may want to process your data in some fashion before emitting to Bullet.
3737

38-
Your data is then emitted to the Filter bolt which promptly drops all Bullet Records and does absolutely nothing if you have no queries in your system. If there are queries in the Filter bolt, the record is checked against the [filters](../index.md#filters) in each query and if it matches, it is processed by
39-
the query. Each query can choose to emit matched records in micro-batches. For example, queries that collect raw records (a LIMIT operation) do not micro-batch at all. Every matched record (up to the maximum for the query) is emitted. Queries that aggregate, on the other hand, keep the query around till its
40-
duration is up and emit the local result.
38+
Your data is then emitted to the Filter bolt which promptly drops all Bullet Records and does absolutely nothing if you have no queries in your system. If there are queries in the Filter bolt, the record is checked against the [filters](../index.md#filters) in each query and if it matches, it is processed by the query. Each query can choose to emit matched records in micro-batches. For example, queries that collect raw records (a LIMIT operation) do not micro-batch at all. Every matched record (up to the maximum for the query) is emitted. Queries that aggregate, on the other hand, keep the query around till its duration is up and emit the local result.
39+
40+
!!! note "To micro-batch or not to micro-batch?"
41+
42+
```RAW``` queries micro-batch by size 1, which makes Bullet really snappy when running those queries. As soon as your maximum record limit is reached, the query immediately returns. On the other hand, the other queries do not micro-batch at all. ```GROUP``` and other aggregate queries *cannot* return till they see all the data in your time window because some late arriving data may update an existing aggregate. So, these other queries have to wait for the entire query duration anyway. Once the queries have timed out, we have to rely on the ticks to get all the intermediate results over to the combiner to merge. Micro-batches are still useful here because we can still emit intermediate aggregations (and they are [additive](#combining)) and relieve memory pressure by periodically purging intermediate results. In practice though, Bullet queries are generally short-lived, so this isn't as needed as it may seem on first glance. Depends on whether others (you) find it necessary, we may decide to implement micro-batching for other queries besides ```RAW``` types.
4143

4244
### Request processing
4345

44-
Storm DRPC handles receiving REST requests for the whole topology. The DRPC spouts fetch these requests (DRPC knows the request is for the Bullet topology using the unique function name set when launching the topology) and shuffle them to the Prepare Request bolts. The request also contains information about how to return the response back to the DRPC servers. The Prepare Request bolts generate unique identifiers for each request (a Bullet query) and broadcasts them to every Filter bolt. Since every Filter bolt has a copy of every query, the shuffled data from the source of data will match or not match the query no matter which particular Filter bolt it ends up at. Each Filter bolt has access to the unique query id and is able to key group by the id to the Join bolt with the intermediate results for the query.
46+
Storm DRPC handles receiving REST requests for the whole topology. The DRPC spouts fetch these requests (DRPC knows the request is for the Bullet topology using the unique function name set when launching the topology) and shuffle them to the Prepare Request bolts. The request also contains information about how to return the response back to the DRPC servers. The Prepare Request bolts generate unique identifiers for each request (a Bullet query) and broadcasts them to every Filter bolt. Since every Filter bolt has a copy of every query, the shuffled data from the source of data can be compared against the query no matter which particular Filter bolt it ends up at. Each Filter bolt has access to the unique query id and is able to key group by the id to the Join bolt with the intermediate results for the query.
4547

46-
The Prepare Request bolt also key groups the query and the return information to the Join bolts.
48+
The Prepare Request bolt also key groups the query and the return information to the Join bolts. This means that only *one* Join bolt ever gets one query.
4749

4850
### Combining
4951

50-
Since the data from the Prepare Request bolt (a query and a piece of return information for the query) and the data from all Filter bolts (intermediate results) is key grouped by the unique query id, only one particular Join bolt receives both the query and all the intermediate results for a particular query. The Join bolt can then combine all the intermediate results and produce a final result. This final result is joined (hence the name) with the return information for the query and is shuffled to the Return Results bolt. This bolt then uses the return information to send the results back to a DRPC server, who then returns it back to the requestee.
52+
Since the data from the Prepare Request bolt (a query and a piece of return information for the query) and the data from all Filter bolts (intermediate results) is key grouped by the unique query id, only one particular Join bolt receives both the query and all the intermediate results for a particular query. The Join bolt can then combine all the intermediate results and produce a final result. This final result is joined (hence the name) with the return information for the query and is shuffled to the Return Results bolt. This bolt then uses the return information to send the results back to a DRPC server, who then returns it back to the requester.
5153

5254
!!! note "Combining and operations"
5355

54-
In order to be able to combine intermediate results and process data in any order, all aggreations that Bullet does need to be associative and have an identity. In other words, they need to be [Monoids](https://en.wikipedia.org/wiki/Monoid). Luckily for us, the [DataSketches](http://datasketches.github.io) that we use are monoids (actually are commutative monoids). Sketches be unioned and thus all the aggregations we support - SUM, COUNT, MIN, MAX, AVG, COUNT DISTINCTS, DISTINCT - are monoidal. (AVG is monoidal if you store a SUM and a COUNT instead).
56+
In order to be able to combine intermediate results and process data in any order, all aggregations that Bullet does need to be associative and have an identity. In other words, they need to be [Monoids](https://en.wikipedia.org/wiki/Monoid). Luckily for us, the [DataSketches](http://datasketches.github.io) that we use are monoids (actually are commutative monoids). Sketches be unioned and thus all the aggregations we support - SUM, COUNT, MIN, MAX, AVG, COUNT DISTINCTS, DISTINCT - are monoidal. (AVG is monoidal if you store a SUM and a COUNT instead).
5557

5658

5759
## Scalability

docs/css/extra.css

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,10 @@
2121
font-size: 100%;
2222
}
2323

24+
video {
25+
width: 100%;
26+
}
27+
2428
@media (min-width: 1650px) {
2529
body > .container {
2630
width: 1400px;

0 commit comments

Comments
 (0)