You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/en/tutorial/tutorial_sql_1.md
+10-9Lines changed: 10 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,7 @@
1
1
# SQL for Feature Extraction (Part 1)
2
2
3
3
4
-
## 1. The Feature Engineering of Machine Learning
4
+
## 1. Feature Engineering for Machine Learning
5
5
6
6
A real-world machine learning application generally includes two main processes, namely **Feature Engineering** and **Machine Learning Model** (hereinafter referred to as **Model**). We must know a lot about the model, from the classic logistic regression and decision tree models to the deep learning models, we all focus on how to develop high-quality models. We may pay less attention to feature engineering.
7
7
However, as the saying goes, data and features determine the upper limit of machine learning, while models and algorithms only approach this limit. It can be seen that we have long agreed on the importance of Feature Engineering.
@@ -59,7 +59,7 @@ For example, the following user transaction table (hereinafter referred as data
59
59
| trans_type | STRING | Transaction Type |
60
60
| province | STRING | Province |
61
61
| city | STRING | City |
62
-
| label | BOOL | Sample label, true\|false |
62
+
| label | BOOL | Sample label, `true` or `flase`|
63
63
64
64
In addition to the primary table, there may also be tables storing relevant auxiliary information in the database, which can be combined with the primary table through the JOIN operation. These tables are called **Secondary Tables** (note that there may be multiple secondary tables). For example, we can have a secondary table storing the merchants' history flow. In the process of feature engineering, more valuable information can be obtained by combining the primary and secondary tables. The feature engineering over multiple tables will be introduced in detail in the [next part](tutorial_sql_2.md) of this series.
65
65
@@ -143,39 +143,40 @@ Important parameters include:
143
143
- The lower bound time must be `>=` the upper bound time.
144
144
- The lower bound row must follow the upper bound row.
145
145
146
+
For more features, pleaes referr to [documentation](../openmldb_sql/dql/WHERE_CLAUSE.md).
146
147
147
148
#### Example
148
149
149
150
For the transaction table T1 shown above, we define two `ROWS_RANGE` windows and two `ROWS` windows. The windows of each row are grouped by user ID (' uid ') and sorted by transaction time (' trans_time '). The following figure shows the result of grouping and sorting.
150
151
151
152

152
153
153
-
Note that the following window definitions are not completed SQL. We will add aggregate functions later to complete runnable SQL.
154
+
Note that the following window definitions are not completed SQL. We will add aggregate functions to complete runnable SQL. (See [3.3.2](332-step-2constructfeaturesbasedontimewindow))
154
155
155
-
-w1d: the window within the most recent day
156
+
**w1d: the window within the most recent day**
156
157
The window of the user's most recent day containing the rows from the current to the most recent day
157
158
```sql
158
159
window w1d as (PARTITION BY uid ORDER BY trans_time ROWS_RANGE BETWEEN 1d PRECEDING AND CURRENT ROW)
159
160
```
160
161
161
162
The `w1d` window shown in the above figure is for the partition `id=9`, and the `w1d` window contains three rows (`id=6`, `id=8`, `id=9`). These three rows fall in the time window [2022-02-07 12:00:00, 2022-02-08 12:00:00] .
162
163
163
-
-w1d_10d: the window from 1 day ago to the last 10 days
164
+
**w1d_10d: the window from 1 day ago to the last 10 days**
164
165
```sql
165
166
window w1d_10d as (PARTITION BY uid ORDER BY trans_time ROWS_RANGE BETWEEN 10d PRECEDING AND 1d PRECEDING)
166
167
```
167
168
168
169
The window `w1d_10d` for the partition `id=9` contains three rows, which are `id=1`, `id=3` and `id=4`. These three rows fall in the time window of [2022-01-29 12:00:00, 2022-02-07 12:00:00]。
169
170
170
-
-w0_1: the window contains the last 0 ~ 1 rows
171
+
**w0_1: the window contains the last 0 ~ 1 rows**
171
172
The window contains the last 0 ~ 1 rows, including the previous line and the current line.
172
173
```sql
173
174
window w0_1 as (PARTITION BY uid ORDER BY trans_time ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)
174
175
```
175
176
176
177
The window `w0_1` for the partition `id=10` contains 2 rows, which are `id=7` and `id=10`.
177
178
178
-
-w2_10: the window contains the last 2 ~ 10 rows
179
+
**w2_10: the window contains the last 2 ~ 10 rows**
179
180
180
181
```sql
181
182
window w2_10 as (PARTITION BY uid ORDER BY trans_time ROWS BETWEEN 10 PRECEDING AND2 PRECEDING)
@@ -304,7 +305,7 @@ window w30d as (PARTITION BY uid ORDER BY trans_time ROWS_RANGE BETWEEN 30d PREC
304
305
305
306
We make frequency statistics for a given column as we may need to know the type of the highest frequency, the proportion of the type with the largest number, etc., in each category.
306
307
307
-
`top1_ratio`: Find out the type with the largest number and compute the proportion of its number in the window.
308
+
**`top1_ratio`**: Find out the type with the largest number and compute the proportion of its number in the window.
308
309
309
310
The following SQL uses `top1_ratio` to find out the city with the most transactions in the last 30 days and compute the proportion of the number of transactions of the city to the total number of transactions in t1.
310
311
```sql
@@ -314,7 +315,7 @@ FROM t1
314
315
window w30d as (PARTITION BY uid ORDER BY trans_time ROWS_RANGE BETWEEN 30d PRECEDING AND CURRENT ROW);
315
316
```
316
317
317
-
`topn_frequency(col, top_n)`: Find the `top_n` categories with the highest frequency in the window
318
+
**`topn_frequency(col, top_n)`**: Find the `top_n` categories with the highest frequency in the window
318
319
319
320
The following SQL uses `topn_frequency` to find out the top 2 cities with the highest number of transactions in the last 30 days in t1.
Copy file name to clipboardExpand all lines: docs/en/tutorial/tutorial_sql_2.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -63,7 +63,7 @@ As shown below, left table `LAST JOIN` right table with `ORDER BY` and right tab
63
63
64
64
## 3. Multi-Row Aggregation over Multiple Tables
65
65
66
-
For aggregation over multiple tables, OpenMLDB extends the standard WINDOW syntax and adds [WINDOW UNION](../reference/sql/dql/WINDOW_CLAUSE.md#window-union) syntax.
66
+
For aggregation over multiple tables, OpenMLDB extends the standard WINDOW syntax and adds [WINDOW UNION](../openmldb_sql/dql/WINDOW_CLAUSE.md#1-window--union) syntax.
67
67
WINDOW UNION supports combining multiple pieces of data from the secondary table to form a window on secondary table.
68
68
Based on the time window, it is convenient to construct the multi-row aggregation feature of the secondary table.
69
69
Similarly, two steps need to be completed to construct the multi-row aggregation feature of the secondary table:
@@ -122,10 +122,10 @@ Among them, necessary elements include:
122
122
- Lower bound time must be > = Upper bound time
123
123
- The row number of lower bound must be < = The row number of upper bound
124
124
- `INSTANCE_NOT_IN_WINDOW`: It indicates that except for the current row, other data in the main table will not enter the window.
125
-
- For more syntax and features, please refer to [OpenMLDB WINDOW UNION Reference Manual](../reference/sql/dql/WINDOW_CLAUSE.md).
125
+
- For more syntax and features, please refer to [OpenMLDB WINDOW UNION Reference Manual](../openmldb_sql/sql/dql/WINDOW_CLAUSE.md).
126
126
```
127
127
128
-
### Example
128
+
####Example
129
129
130
130
Let's see the usage of WINDOW UNION through specific examples.
131
131
@@ -166,7 +166,7 @@ PARTITION BY mid ORDER BY purchase_time
166
166
ROWS_RANGE BETWEEN 10d PRECEDING AND1 PRECEDING INSTANCE_NOT_IN_WINDOW)
167
167
```
168
168
169
-
## 3.2 Step 2: Build Multi-Row Aggregation Feature of Sub Table
169
+
###3.2 Step 2: Build Multi-Row Aggregation Feature of Sub Table
170
170
171
171
Apply the multi-row aggregation function on the created window to construct aggregation features on multi-rows of secondary table, so that the number of rows finally generated is the same as that of the main table.
172
172
For example, we can construct features from the secondary table like: the total retail sales of merchants in the last 10 days `w10d_merchant_purchase_amt_sum` and the total consumption times of the merchant in the last 10 days `w10d_merchant_purchase_count`.
0 commit comments