[IMP] snippets.convert_html_columns: a batch processing story

KangOl · KangOl · commit 5f83f3aa5b9a · 2024-06-06T16:27:18.000Z
TLDR: RTFM Once upon a time, in a countryside farm in Belgium... At first, the upgrade of databases was straightforward. But, as time passed, the size of the databases grew, and some CPU-intensive computations took so much time that a solution needed to be found. Hopefully, the Python standard library has the perfect module for this task: `concurrent.futures`. Then, Python 3.10 appeared, and the usage of `ProcessPoolExecutor` started to sometimes hang for no apparent reasons. Soon, our hero finds out he wasn't the only one to suffer from this issue[^1]. Unfortunately, the proposed solution looked overkill. Still, it revealed that the issue had already been known[^2] for a few years. Despite the fact that an official patch wasn't ready to be committed, discussion about its legitimacy[^3] leads our hero to a nicer solution. By default, `ProcessPoolExecutor.map` submits elements one by one to the pool. This is pretty inefficient when there are a lot of elements to process. This can be changed by using a large value for the *chunksize* argument. Who would have thought that a bigger chunk size would solve a performance issue? As always, the response was in the documentation[^4]. [^1]: https://stackoverflow.com/questions/74633896/processpoolexecutor-using-map-hang-on-large-load [^2]: python/cpython#74028 [^3]: python/cpython#114975 (review) [^4]: https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.Executor.map closes #94 Signed-off-by: Nicolas Seinlet (nse) <nse@odoo.com>
diff --git a/src/util/snippets.py b/src/util/snippets.py
@@ -279,7 +279,7 @@ def convert_html_columns(cr, table, columns, converter_callback, where_column="I
         convert = Convertor(converters, converter_callback)
         for query in util.log_progress(split_queries, logger=_logger, qualifier=f"{table} updates"):
             cr.execute(query)
-            for data in executor.map(convert, cr.fetchall()):
+            for data in executor.map(convert, cr.fetchall(), chunksize=1000):
                 if "id" in data:
                     cr.execute(update_query, data)