Slow runtime on slurm cluster #172
-
I am running 120 daemons on a slurm cluster. I have observed very unusual runtimes when daemons(
n = 120,
url = host_url(ws = TRUE),
remote = remote_config(
command = "sbatch",
args = c("--mem 512", "-n 1", "--wrap", "."),
rscript = file.path(R.home("bin"), "Rscript"),
quote = TRUE
)
)
m = mirai_map(seq(120), function(i) {
timestamp = Sys.time()
mlr3misc::map(seq(10),function(j) j)
difftime(Sys.time(), timestamp, units = "secs")
})
system.time({
collect_mirai(m)
})
# user system elapsed
# 0.006 0.006 165.554 Most of the mirais are resolved after a few milliseconds. A few mirais needs seconds to finish and the last one 165 seconds. If we use m = mirai_map(seq(120), function(i) {
timestamp = Sys.time()
lapply(seq(10), function(j) j)
difftime(Sys.time(), timestamp, units = "secs")
})
system.time({
collect_mirai(m)
})
# user system elapsed
# 0.001 0.001 0.000 I suspect that the shared file system of the cluster is overloaded when 120 mirais try to load a package. When I add |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 3 replies
-
I tried a bit more. Calling |
Beta Was this translation helpful? Give feedback.
-
I tested it again on another cluster. The cluster is much larger. You can easily use 100,000 cpu hours per day there. I was able to run I talked to the developer of |
Beta Was this translation helpful? Give feedback.
-
I'm closing this, as from a design perspective, mirai is not likely to be where this is solved. Conceptually I understand the desire for you to introduce some "jitter" in your operations, but as mirai doesn't perform code introspection, it isn't well-placed to tell when that would be required. |
Beta Was this translation helpful? Give feedback.
I don't think this falls under the remit of mirai as it's a general tool used in lots of different situations.
It would logically be better implemented where this is known to be an issue e.g. cluster managers if this is common on large cluster systems, or if it does depend entirely on the filesystem then perhaps it does fall to user code to include a random sleep at the top of the code.