This repository was archived by the owner on Dec 22, 2021. It is now read-only.

Description
Much of the analysis done on this dataset uses dask. Dask is excellent for distributed numerical computing but seems to struggle with strings.
Array extensions unfortunately can't be serialized pandas-dev/pandas#20612
https://github.com/xhochy/fletcher is an array extension that adds string processing functionality.
Doing some standard tasks for this dataset like: collecting all domains, or number of script domains per location domain compare the performance of spark, dask, and dask with extension arrays.