|
| 1 | +--- |
| 2 | +globs: |
| 3 | +description: Crucial guidelines to build a dlt rest api source |
| 4 | +alwaysApply: true |
| 5 | +--- |
| 6 | +## Prerequisities to writing a source |
| 7 | + |
| 8 | +1. VERY IMPORTANT. When writing a new source, you should have an example available in the rest_api_pipeline.py file. |
| 9 | +Use this example or the github rest api source example from dlt's documentation on rest api for the general structure of the code. If you do not see this file rest_api_pipeline.py, ask the user to add it |
| 10 | +2. Recall OpenAPI spec. You will figure out the same information that the OpenAPI spec contains for each API. |
| 11 | +3. In particular: |
| 12 | +- API base url |
| 13 | +- type of authentication |
| 14 | +- list of endpoints with method GET (you can read data for those) |
| 15 | +4. You will figure out additional information that is required for successful data extraction |
| 16 | +- type of pagination |
| 17 | +- if data from an endpoint can be loaded incrementally |
| 18 | +- unwrapping end user data from a response |
| 19 | +- write disposition of the endpoint: append, replace, merge |
| 20 | +- in case of merge, you need to find primary key that can be compound |
| 21 | +5. Some endpoints take data from other endpoints. For example, in the github rest api source example from dlt's documentation, the `comments` endpoint needs `post id` to get the list of comments per particular post. You'll need to figure out such connections |
| 22 | +6. **ASK USER IF YOU MISS CRUCIAL INFORMATION** You will make sure the user has provided you with enough information to figure out the above. Below are the most common possibilities |
| 23 | +- open api spec (file or link) |
| 24 | +- any other api definition, for example Airbyte low code yaml |
| 25 | +- a source code in Python, java or c# of such connector or API client |
| 26 | +- a documentation of the api or endpoint |
| 27 | +7. In case you find more than 10 endpoints and you do not get instructions which you should add to the source, ask user. |
| 28 | +8. Make sure you use the right pagination and use exactly the arguments that are available in the pagination guide. do not try to guess anything. remember that we have many paginator types that are configured differently |
| 29 | +9. When creating pipeline instance add progress="log" as parameter `pipeline = dlt.pipeline(..., progress="log")` |
| 30 | +10. When fixing a bug report focus only on a single cause. ie. incremental, pagination or authentication or wrong dict fields |
| 31 | +11. You should have references for paginator types, authenticator types and general reference for rest api in you context. **DO NOT GUESS. DO NOT INVENT CODE. YOU SHOULD HAVE DOCUMENTATION FOR EVERYTHING YOU NEED. IF NOT - ASK USER** |
| 32 | + |
| 33 | + |
| 34 | +## Look for Required Client Settings |
| 35 | +When scanning docs or legacy code, first extract the API-level configuration including: |
| 36 | + |
| 37 | +Base URL: |
| 38 | +• The API's base URL (e.g. "https://api.pipedrive.com/"). |
| 39 | + |
| 40 | +Authentication: |
| 41 | +• The type of authentication used (commonly "api_key" or "bearer"). |
| 42 | +• The name/key (e.g. "api_token") and its placement (usually in the query). |
| 43 | +• Use secrets (e.g. dlt.secrets["api_token"]) to keep credentials secure. |
| 44 | + |
| 45 | +Headers (optional): |
| 46 | +• Check if any custom headers are required. |
| 47 | + |
| 48 | +## Authentication Methods |
| 49 | +Configure the appropriate authentication method: |
| 50 | + |
| 51 | +API Key Authentication: |
| 52 | +```python |
| 53 | +"auth": { |
| 54 | + "type": "api_key", |
| 55 | + "name": "api_key", |
| 56 | + "api_key": dlt.secrets["api_key"], |
| 57 | + "location": "query" # or "header" |
| 58 | +} |
| 59 | +``` |
| 60 | + |
| 61 | +Bearer Token Authentication: |
| 62 | +```python |
| 63 | +"auth": { |
| 64 | + "type": "bearer", |
| 65 | + "token": dlt.secrets["bearer_token"] |
| 66 | +} |
| 67 | +``` |
| 68 | + |
| 69 | +Basic Authentication: |
| 70 | +```python |
| 71 | +"auth": { |
| 72 | + "type": "basic", |
| 73 | + "username": dlt.secrets["username"], |
| 74 | + "password": dlt.secrets["password"] |
| 75 | +} |
| 76 | +``` |
| 77 | + |
| 78 | +OAuth2 Authentication: |
| 79 | +```python |
| 80 | +"auth": { |
| 81 | + "type": "oauth2", |
| 82 | + "token_url": "https://auth.example.com/oauth/token", |
| 83 | + "client_id": dlt.secrets["client_id"], |
| 84 | + "client_secret": dlt.secrets["client_secret"], |
| 85 | + "scopes": ["read", "write"] |
| 86 | +} |
| 87 | +``` |
| 88 | + |
| 89 | +## Find right pagination type |
| 90 | +These are the available paginator types to be used in `paginator` field of `endpoint`: |
| 91 | + |
| 92 | +* `json_link`: The link to the next page is in the body (JSON) of the response |
| 93 | +* `header_link`: The links to the next page are in the response headers |
| 94 | +* `offset`: The pagination is based on an offset parameter, with the total items count either in the response body or explicitly provided |
| 95 | +* `page_number`: The pagination is based on a page number parameter, with the total pages count either in the response body or explicitly provided |
| 96 | +* `cursor`: The pagination is based on a cursor parameter, with the value of the cursor in the response body (JSON) |
| 97 | +* `single_page`: The response will be interpreted as a single-page response, ignoring possible pagination metadata |
| 98 | + |
| 99 | + |
| 100 | +## Different Paginations per Endpoint are possible |
| 101 | +When analyzing the API documentation, carefully check for multiple pagination strategies: |
| 102 | + |
| 103 | +• Different Endpoint Types: |
| 104 | + - Some endpoints might use cursor-based pagination |
| 105 | + - Others might use offset-based pagination |
| 106 | + - Some might use page-based pagination |
| 107 | + - Some might use link-based pagination |
| 108 | + |
| 109 | +• Documentation Analysis: |
| 110 | + - Look for sections describing different pagination methods |
| 111 | + - Check if certain endpoints have special pagination requirements |
| 112 | + - Verify if pagination parameters differ between endpoints |
| 113 | + - Look for examples showing different pagination patterns |
| 114 | + |
| 115 | +• Implementation Strategy: |
| 116 | + - Configure pagination at the endpoint level rather than globally |
| 117 | + - Use the appropriate paginator type for each endpoint |
| 118 | + - Document which endpoints use which pagination strategy |
| 119 | + - Test pagination separately for each endpoint type |
| 120 | + |
| 121 | +## Select the right data from the response |
| 122 | +In each endpoint the interesting data (typically an array of objects) may be wrapped |
| 123 | +differently. You can unwrap this data by using `data_selector` |
| 124 | + |
| 125 | +Data Selection Patterns: |
| 126 | +```python |
| 127 | +"endpoint": { |
| 128 | + "data_selector": "data.items.*", # Basic array selection |
| 129 | + "data_selector": "data.*.items", # Nested array selection |
| 130 | + "data_selector": "data.{id,name,created_at}", # Field selection |
| 131 | +} |
| 132 | +``` |
| 133 | + |
| 134 | +## Resource Defaults & Endpoint Details |
| 135 | +Ensure that the default settings applied across all resources are clearly delineated: |
| 136 | + |
| 137 | +Defaults: |
| 138 | +• Specify the default primary key (e.g., "id"). |
| 139 | +• Define the write disposition (e.g., "merge"). |
| 140 | +• Include common endpoint parameters (for example, a default limit value like 50). |
| 141 | + |
| 142 | +Resource-Specific Configurations: |
| 143 | +• For each resource, extract the endpoint path, method, and any additional query parameters. |
| 144 | +• If incremental loading is supported, include the minimal incremental configuration (using fields like "start_param", "cursor_path", and "initial_value"), but try to keep it within the REST API config portion. |
| 145 | + |
| 146 | +## Incremental Loading Configuration |
| 147 | +Configure incremental loading for efficient data extraction. Your task is to get only new data from |
| 148 | +the endpoint. |
| 149 | + |
| 150 | +Typically you will identify query parameter that allows to get items that are newer than certain date: |
| 151 | + |
| 152 | +```py |
| 153 | +{ |
| 154 | + "path": "posts", |
| 155 | + "data_selector": "results", |
| 156 | + "params": { |
| 157 | + "created_since": "{incremental.start_value}", # Uses cursor value in query parameter |
| 158 | + }, |
| 159 | + "incremental": { |
| 160 | + "cursor_path": "created_at", |
| 161 | + "initial_value": "2024-01-25T00:00:00Z", |
| 162 | + }, |
| 163 | +} |
| 164 | +``` |
| 165 | + |
| 166 | + |
| 167 | +## End to end example |
| 168 | +Below is an annotated template that illustrates how your output should look. Use it as a reference to guide your extraction: |
| 169 | + |
| 170 | +```python |
| 171 | +import dlt |
| 172 | +from dlt.sources.rest_api import rest_api_source |
| 173 | + |
| 174 | +# Build the REST API config with cursor-based pagination |
| 175 | +source = rest_api_source({ |
| 176 | + "client": { |
| 177 | + "base_url": "https://api.pipedrive.com/", # Extract this from the docs/legacy code |
| 178 | + "auth": { |
| 179 | + "type": "api_key", # Use the documented auth type |
| 180 | + "name": "api_token", |
| 181 | + "api_key": dlt.secrets["api_token"], # Replace with secure token reference |
| 182 | + "location": "query" # Typically a query parameter for API keys |
| 183 | + } |
| 184 | + }, |
| 185 | + "resource_defaults": { |
| 186 | + "primary_key": "id", # Default primary key for resources |
| 187 | + "write_disposition": "merge", # Default write mode |
| 188 | + "endpoint": { |
| 189 | + "params": { |
| 190 | + "limit": 50 # Default query parameter for pagination size |
| 191 | + } |
| 192 | + } |
| 193 | + }, |
| 194 | + "resources": [ |
| 195 | + { |
| 196 | + "name": "deals", # Example resource name extracted from code or docs |
| 197 | + "endpoint": { |
| 198 | + "path": "v1/recents", # Endpoint path to be appended to base_url |
| 199 | + "method": "GET", # HTTP method (default is GET) |
| 200 | + "params": { |
| 201 | + "items": "deal" |
| 202 | + "since_timestamp": "{incremental.start_value}" |
| 203 | + }, |
| 204 | + "data_selector": "data.*", # JSONPath to extract the actual data |
| 205 | + "paginator": { # Endpoint-specific paginator |
| 206 | + "type": "offset", |
| 207 | + "offset": 0, |
| 208 | + "limit": 100 |
| 209 | + }, |
| 210 | + "incremental": { # Optional incremental configuration |
| 211 | + "cursor_path": "update_time", |
| 212 | + "initial_value": "2023-01-01 00:00:00" |
| 213 | + } |
| 214 | + } |
| 215 | + } |
| 216 | + ] |
| 217 | +}) |
| 218 | + |
| 219 | +if __name__ == "__main__": |
| 220 | + pipeline = dlt.pipeline( |
| 221 | + pipeline_name="pipedrive_rest", |
| 222 | + destination="duckdb", |
| 223 | + dataset_name="pipedrive_data" |
| 224 | + ) |
| 225 | + pipeline.run(source) |
| 226 | +``` |
| 227 | + |
| 228 | +## How to Apply This Rule |
| 229 | +Extraction: |
| 230 | +• Search both the REST API docs and any legacy pipeline code for all mentions of "cursor" or "pagination". |
| 231 | +• Identify the exact keys and JSONPath expressions needed for the cursor field. |
| 232 | +• Look for authentication requirements and rate limiting information. |
| 233 | +• Identify any dependent resources and their relationships. |
| 234 | +• Check for multiple pagination strategies across different endpoints. |
| 235 | + |
| 236 | +Configuration Building: |
| 237 | +• Assemble the configuration in a dictionary that mirrors the structure in the example. |
| 238 | +• Ensure that each section (client, resource defaults, resources) is as declarative as possible. |
| 239 | +• Implement proper state management and incremental loading where applicable. |
| 240 | +• Configure rate limiting based on API requirements. |
| 241 | +• Configure pagination at the endpoint level when multiple strategies exist. |
| 242 | + |
| 243 | +Verification: |
| 244 | +• Double-check that the configuration uses the REST API config keys correctly. |
| 245 | +• Verify that no extraneous Python code is introduced. |
| 246 | +• Test the configuration with mock responses. |
| 247 | +• Verify rate limiting and error handling. |
| 248 | +• Test pagination separately for each endpoint type. |
| 249 | + |
| 250 | +Customization: |
| 251 | +• Allow for adjustments (like modifying the "initial_value") where incremental loading is desired. |
| 252 | +• Customize rate limiting parameters based on API requirements. |
| 253 | +• Adjust batch sizes and pagination parameters as needed. |
| 254 | +• Implement custom error handling and retry logic where necessary. |
| 255 | +• Handle different pagination strategies appropriately. |
| 256 | + |
0 commit comments