Skip to content

improve data loading speed with Dask or NumPy #37

@sreichl

Description

@sreichl

test it for e.g., pca.py

Dask: Dask is a parallel computing library that integrates with pandas, NumPy, and scikit-learn. It can handle larger-than-memory datasets and can distribute the computation across multiple cores or even multiple machines.

import dask.dataframe as dd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import dask.array as da

# load data with dask
ddata = dd.read_csv(data_path, index_col=0)

# convert to dask array
data_array = ddata.to_dask_array(lengths=True)

# standardize data
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data_array)

# PCA transformation
pca_obj = PCA(n_components=None, random_state=42)
data_pca = pca_obj.fit_transform(data_scaled)

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions