Skip to content

tutorial of multidplyr #154

@wbvguo

Description

@wbvguo

Dear multidplyr developer,

Thank you for maintaining this package, I was wondering where we could find a more detailed tutorial of this package besides the documentation page https://multidplyr.tidyverse.org/articles/multidplyr.html?

For example, it take me a while to figure out the correct usage of mutate after data partition

# create data
set.seed(123)  # For reproducibility

num_groups = 5000
num_grp_obs= 10

df <- data.frame(
  id = 1:num_groups*num_grp_obs,
  group = rep(seq(num_groups), each = num_grp_obs),
  x = rnorm(num_groups*num_grp_obs),
  y = rnorm(num_groups*num_grp_obs)
)

df$x[c(5, 15)] <- NA # Introduce some NA values


# parallel setting
library(multidplyr)
cluster <- new_cluster(4)
cluster_library(cluster, c("dplyr"))


# partition
x_part = df %>% group_by(group) %>% nest() %>% partition(cluster) 

this will not work

x = x_part %>% mutate(fit = lm(y~x, data = .)) %>% collect()

Error in cluster_call():
! Remote computation failed in worker 1
Caused by error:
ℹ In argument: fit = lm(y ~ x, data = .).
ℹ In group 1: group = 1.
Caused by error:
! Native call to processx_connection_write_bytes failed
Caused by error:
! Invalid connection object @processx-connection.c:960 (processx_c_connection_write_bytes)
Run rlang::last_trace() to see where the error occurred.

this will work

x = mutate(fit = purrr::map(data, ~lm(y~x, data = .))) %>% collect()

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions