Skip to content

Commit 15de293

Browse files
authored
docs: updates on tutorial folder (#3754)
* update on tutorial folder * Update tutorial_sql_2.md * update for links
1 parent 9f0d3fc commit 15de293

File tree

5 files changed

+23
-19
lines changed

5 files changed

+23
-19
lines changed

docs/en/tutorial/index.rst

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -5,12 +5,14 @@ Tutorials
55
.. toctree::
66
:maxdepth: 1
77

8-
standalone_vs_cluster
9-
modes
8+
data_import_guide
109
tutorial_sql_1
1110
tutorial_sql_2
12-
data_import
1311
openmldbspark_distribution
12+
data_import
13+
data_export
1414
autofe
15-
common_architecture
15+
standalone_vs_cluster
16+
standalone_use
17+
app_arch
1618
online_offline_sync

docs/en/tutorial/tutorial_sql_1.md

Lines changed: 10 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# SQL for Feature Extraction (Part 1)
22

33

4-
## 1. The Feature Engineering of Machine Learning
4+
## 1. Feature Engineering for Machine Learning
55

66
A real-world machine learning application generally includes two main processes, namely **Feature Engineering** and **Machine Learning Model** (hereinafter referred to as **Model**). We must know a lot about the model, from the classic logistic regression and decision tree models to the deep learning models, we all focus on how to develop high-quality models. We may pay less attention to feature engineering.
77
However, as the saying goes, data and features determine the upper limit of machine learning, while models and algorithms only approach this limit. It can be seen that we have long agreed on the importance of Feature Engineering.
@@ -59,7 +59,7 @@ For example, the following user transaction table (hereinafter referred as data
5959
| trans_type | STRING | Transaction Type |
6060
| province | STRING | Province |
6161
| city | STRING | City |
62-
| label | BOOL | Sample label, true\|false |
62+
| label | BOOL | Sample label, `true` or `flase` |
6363

6464
In addition to the primary table, there may also be tables storing relevant auxiliary information in the database, which can be combined with the primary table through the JOIN operation. These tables are called **Secondary Tables** (note that there may be multiple secondary tables). For example, we can have a secondary table storing the merchants' history flow. In the process of feature engineering, more valuable information can be obtained by combining the primary and secondary tables. The feature engineering over multiple tables will be introduced in detail in the [next part](tutorial_sql_2.md) of this series.
6565

@@ -143,39 +143,40 @@ Important parameters include:
143143
- The lower bound time must be `>=` the upper bound time.
144144
- The lower bound row must follow the upper bound row.
145145

146+
For more features, pleaes referr to [documentation](../openmldb_sql/dql/WHERE_CLAUSE.md).
146147

147148
#### Example
148149

149150
For the transaction table T1 shown above, we define two `ROWS_RANGE` windows and two `ROWS` windows. The windows of each row are grouped by user ID (' uid ') and sorted by transaction time (' trans_time '). The following figure shows the result of grouping and sorting.
150151

151152
![img](images/table_t1.png)
152153

153-
Note that the following window definitions are not completed SQL. We will add aggregate functions later to complete runnable SQL.
154+
Note that the following window definitions are not completed SQL. We will add aggregate functions to complete runnable SQL. (See [3.3.2](332-step-2constructfeaturesbasedontimewindow))
154155

155-
- w1d: the window within the most recent day
156+
**w1d: the window within the most recent day**
156157
The window of the user's most recent day containing the rows from the current to the most recent day
157158
```sql
158159
window w1d as (PARTITION BY uid ORDER BY trans_time ROWS_RANGE BETWEEN 1d PRECEDING AND CURRENT ROW)
159160
```
160161

161162
The `w1d` window shown in the above figure is for the partition `id=9`, and the `w1d` window contains three rows (`id=6`, `id=8`, `id=9`). These three rows fall in the time window [2022-02-07 12:00:00, 2022-02-08 12:00:00] .
162163

163-
- w1d_10d: the window from 1 day ago to the last 10 days
164+
**w1d_10d: the window from 1 day ago to the last 10 days**
164165
```sql
165166
window w1d_10d as (PARTITION BY uid ORDER BY trans_time ROWS_RANGE BETWEEN 10d PRECEDING AND 1d PRECEDING)
166167
```
167168

168169
The window `w1d_10d` for the partition `id=9` contains three rows, which are `id=1`, `id=3` and `id=4`. These three rows fall in the time window of [2022-01-29 12:00:00, 2022-02-07 12:00:00]
169170

170-
- w0_1: the window contains the last 0 ~ 1 rows
171+
**w0_1: the window contains the last 0 ~ 1 rows**
171172
The window contains the last 0 ~ 1 rows, including the previous line and the current line.
172173
```sql
173174
window w0_1 as (PARTITION BY uid ORDER BY trans_time ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)
174175
```
175176

176177
The window `w0_1` for the partition `id=10` contains 2 rows, which are `id=7` and `id=10`.
177178

178-
- w2_10: the window contains the last 2 ~ 10 rows
179+
**w2_10: the window contains the last 2 ~ 10 rows**
179180

180181
```sql
181182
window w2_10 as (PARTITION BY uid ORDER BY trans_time ROWS BETWEEN 10 PRECEDING AND 2 PRECEDING)
@@ -304,7 +305,7 @@ window w30d as (PARTITION BY uid ORDER BY trans_time ROWS_RANGE BETWEEN 30d PREC
304305

305306
We make frequency statistics for a given column as we may need to know the type of the highest frequency, the proportion of the type with the largest number, etc., in each category.
306307

307-
`top1_ratio`: Find out the type with the largest number and compute the proportion of its number in the window.
308+
**`top1_ratio`**: Find out the type with the largest number and compute the proportion of its number in the window.
308309

309310
The following SQL uses `top1_ratio` to find out the city with the most transactions in the last 30 days and compute the proportion of the number of transactions of the city to the total number of transactions in t1.
310311
```sql
@@ -314,7 +315,7 @@ FROM t1
314315
window w30d as (PARTITION BY uid ORDER BY trans_time ROWS_RANGE BETWEEN 30d PRECEDING AND CURRENT ROW);
315316
```
316317

317-
`topn_frequency(col, top_n)`: Find the `top_n` categories with the highest frequency in the window
318+
**`topn_frequency(col, top_n)`**: Find the `top_n` categories with the highest frequency in the window
318319

319320
The following SQL uses `topn_frequency` to find out the top 2 cities with the highest number of transactions in the last 30 days in t1.
320321
```sql

docs/en/tutorial/tutorial_sql_2.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@ As shown below, left table `LAST JOIN` right table with `ORDER BY` and right tab
6363

6464
## 3. Multi-Row Aggregation over Multiple Tables
6565

66-
For aggregation over multiple tables, OpenMLDB extends the standard WINDOW syntax and adds [WINDOW UNION](../reference/sql/dql/WINDOW_CLAUSE.md#window-union) syntax.
66+
For aggregation over multiple tables, OpenMLDB extends the standard WINDOW syntax and adds [WINDOW UNION](../openmldb_sql/dql/WINDOW_CLAUSE.md#1-window--union) syntax.
6767
WINDOW UNION supports combining multiple pieces of data from the secondary table to form a window on secondary table.
6868
Based on the time window, it is convenient to construct the multi-row aggregation feature of the secondary table.
6969
Similarly, two steps need to be completed to construct the multi-row aggregation feature of the secondary table:
@@ -122,10 +122,10 @@ Among them, necessary elements include:
122122
- Lower bound time must be > = Upper bound time
123123
- The row number of lower bound must be < = The row number of upper bound
124124
- `INSTANCE_NOT_IN_WINDOW`: It indicates that except for the current row, other data in the main table will not enter the window.
125-
- For more syntax and features, please refer to [OpenMLDB WINDOW UNION Reference Manual](../reference/sql/dql/WINDOW_CLAUSE.md).
125+
- For more syntax and features, please refer to [OpenMLDB WINDOW UNION Reference Manual](../openmldb_sql/sql/dql/WINDOW_CLAUSE.md).
126126
```
127127

128-
### Example
128+
#### Example
129129

130130
Let's see the usage of WINDOW UNION through specific examples.
131131

@@ -166,7 +166,7 @@ PARTITION BY mid ORDER BY purchase_time
166166
ROWS_RANGE BETWEEN 10d PRECEDING AND 1 PRECEDING INSTANCE_NOT_IN_WINDOW)
167167
```
168168

169-
## 3.2 Step 2: Build Multi-Row Aggregation Feature of Sub Table
169+
### 3.2 Step 2: Build Multi-Row Aggregation Feature of Sub Table
170170

171171
Apply the multi-row aggregation function on the created window to construct aggregation features on multi-rows of secondary table, so that the number of rows finally generated is the same as that of the main table.
172172
For example, we can construct features from the secondary table like: the total retail sales of merchants in the last 10 days `w10d_merchant_purchase_amt_sum` and the total consumption times of the merchant in the last 10 days `w10d_merchant_purchase_count`.

docs/zh/tutorial/tutorial_sql_1.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -144,6 +144,7 @@ window window_name as (PARTITION BY partition_col ORDER BY order_col ROWS_RANGE
144144
- OpenMLDB 的下界条数必须<=上界条数
145145

146146
更多语法和特性可以参考 [OpenMLDB窗口参考手册](../openmldb_sql/dql/WHERE_CLAUSE.md)
147+
147148
#### 示例
148149
对于上面所示的交易表 t1,我们定义两个时间窗口和两个条数窗口。每一个样本行的窗口均按用户ID(`uid`)分组,按交易时间(`trans_time`)排序。下图展示了分组排序后的数据。
149150
![img](images/table_t1.jpg)
@@ -240,7 +241,7 @@ xxx_cate(col, cate) over w
240241
- 参数`col`:参与聚合计算的列。
241242
- 参数`cate`:分组列。
242243

243-
目前支持的带有 _cate 后缀的聚合函为:`count_cate`, `sum_cate`, `avg_cate`, `max_cate`, `min_cate`
244+
目前支持的带有 `_cate` 后缀的聚合函为:`count_cate`, `sum_cate`, `avg_cate`, `max_cate`, `min_cate`
244245

245246
相关示例如下:
246247

docs/zh/tutorial/tutorial_sql_2.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,7 @@ SELECT * FROM s1 LAST JOIN s2 ORDER BY s2.std_ts ON s1.col1 = s2.col1;
6464

6565
## 3. 副表多行聚合特征
6666

67-
OpenMLDB 针对副表拼接场景,扩展了标准的 WINDOW 语法,新增了 [WINDOW UNION](../openmldb_sql/dql/WINDOW_CLAUSE.md#windowunion) 的特性,支持从副表拼接多条数据形成副表窗口。在副表拼接窗口的基础上,可以方便构建副表多行聚合特征。同样地,构造副表多行聚合特征也需要完成两个步骤:
67+
OpenMLDB 针对副表拼接场景,扩展了标准的 WINDOW 语法,新增了 [WINDOW UNION](../openmldb_sql/dql/WINDOW_CLAUSE.md#1-window--union) 的特性,支持从副表拼接多条数据形成副表窗口。在副表拼接窗口的基础上,可以方便构建副表多行聚合特征。同样地,构造副表多行聚合特征也需要完成两个步骤:
6868

6969
- 步骤一:定义副表拼接窗口。
7070
- 步骤二:在副表拼接窗口上构造副表多行聚合特征。

0 commit comments

Comments
 (0)