-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Description
2026-01-04 fe enable_file_cache和disable_file_cache对缓存读写操作影响分析.pdf
Search before asking
- I had searched in the issues and found no similar issues.
Description
| Operation Type | Table Type | Effect of enable_file_cache |
Effect of disable_file_cache |
|---|---|---|---|
| Write (Load) | Internal Table | None (Ignored) |
Controls Write-to-Cache (True = Do not write to cache False = Write to cache) |
| External Table | None (Ignored, writes directly to remote storage) |
None (Ignored, writes directly to remote storage) |
|
| Read (Query) | External Table | On/Off Switch (True = Use cache False = Read directly from remote) |
Disposable Flag (Used when cache is enabled. Determines if data is "one-time use") |
| Internal Table | Ineffective (Always uses cache if globally enabled in BE) |
Eviction Policy Control (True = Uses disposable queue [TTL/LRU] False = Uses normal queue) |
Solution
Suggested Optimization Solution
Provide enable_file_cache_olap_tables and enable_file_cache_external_catalogs as separate control switches for internal tables and data lake tables respectively.
• enable_file_cache_olap_tables: File cache switch for internal tables in storage-compute separation deployment mode (cloud_mode), with caching enabled by default.
•• If set to false, read operations enter the file cache's disposable queue (DISPOSABLE), and write operations bypass file cache, writing only to remote storage.
• enable_file_cache_external_catalogs: File cache switch for data lake tables, with caching disabled by default. Both read and write operations bypass the cache.
•• If set to true, read operations normally populate the file cache. Doris currently does not support writing to cache when writing data to data lake tables (to be discussed separately).
Why Two Switches?
This is due to the different characteristics of internal tables and data lake tables:
• Data Volume Difference: Compared to data lake tables like Hive, Iceberg, and Paimon, internal tables have relatively smaller data volumes. With proper cache space configuration, they can effectively cache most query hotspots. In general, data lake tables have data volumes that are at least one or two orders of magnitude larger than cache space, so caching can only be done on demand.
• Different Caching Purposes: For internal tables, caching is only available in storage-compute separation deployment mode. The purpose is to provide performance comparable to storage-compute integrated deployment mode, supporting business migration from storage-compute integrated/private deployment to storage-compute separation deployment, which also helps promote SelectDB's paid cloud services. For data lake tables, the goal is to accelerate performance as much as possible, but it's not expected to match internal table performance yet.
Due to these different characteristics, data lake table queries should default to disabling file cache to avoid reading large amounts of data that would pollute existing cache hotspots. For storage-compute separated internal tables, file cache should be enabled by default to provide high-performance internal table query services.
If only a single global session-level switch variable is provided, the following problems exist:
• Inefficient adaptation to scenarios where the same session needs to execute both storage-compute separated internal table queries and data lake queries, requiring frequent cache switch toggling.
• Unable to handle cases where a single query accesses both internal tables and data lake tables, as one switch cannot be both enabled and disabled simultaneously.
In summary, it is necessary to provide enable_file_cache_olap_tables and enable_file_cache_external_catalogs as separate cache behaviors for internal tables and data lake tables respectively.
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project's Code of Conduct