-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Labels
enhancementNew feature or requestNew feature or request
Description
test it for e.g., pca.py
Dask: Dask is a parallel computing library that integrates with pandas, NumPy, and scikit-learn. It can handle larger-than-memory datasets and can distribute the computation across multiple cores or even multiple machines.
import dask.dataframe as dd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import dask.array as da
# load data with dask
ddata = dd.read_csv(data_path, index_col=0)
# convert to dask array
data_array = ddata.to_dask_array(lengths=True)
# standardize data
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data_array)
# PCA transformation
pca_obj = PCA(n_components=None, random_state=42)
data_pca = pca_obj.fit_transform(data_scaled)
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request