How to perform deduplication in a cluster environment？

When setting up the cluster environment, I want to run a deduplication task for a large data set (1T, stored locally), but how should I load the data? Should I put all the data on the supervisor node and then load it? Or should we divide the data equally into each node, and then run xorbits.init(address=http://supervisor_ip:web_port) on the supervisor node to load all the node data for deduplication? Please answer it, thank you~


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to perform deduplication in a cluster environment？ #759

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to perform deduplication in a cluster environment？ #759

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions