Skip to content

Commit 527825a

Browse files
committed
Add more explanation to some section as per comments
1 parent 80515d4 commit 527825a

File tree

1 file changed

+46
-7
lines changed

1 file changed

+46
-7
lines changed

docs/src/developer/plan.rst

Lines changed: 46 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -16,16 +16,27 @@ The code in :py:func:`orion.core.cli.main` will parse the command line
1616
arguments and route to :py:func:`orion.core.cli.hunt.main`.
1717

1818
The command line arguments are passed to
19-
:py:func:`orion.core.io.experiment_builder.build_from_args`. This will
20-
massage the parsed command line arguments and merge that configuration
21-
with the config file and the defaults with various helpers from
22-
:py:mod:`orion.core.io.resolve_config` to build the final
23-
configuration. The result is eventually handled off to
19+
:py:func:`orion.core.io.experiment_builder.build_from_args`, which
20+
does some setup and hands over the arguments to
21+
:py:func:`orion.core.io.experiment_builder.build`. This will hand over
22+
the configuration to
23+
:py:func:`orion.core.io.experiment_builder.consolidate_config` which
24+
will look up the experiment in the configured storage to see if it's
25+
already there and merge the loaded configuration with the provided one
26+
with various helpers from :py:mod:`orion.core.io.resolve_config` to
27+
build the final configuration. The result is eventually handled off to
2428
:py:func:`orion.core.io.experiment_builder.create_experiment` to
2529
create an :py:class:`orion.core.worker.experiment.Experiment` and set
2630
its properties.
2731

28-
The created experiments finds its way back to
32+
If the experiment is new, meaning it has no storage id, then it will
33+
attempt to save it to storage, which may conflict in case another
34+
instance of ``orion hunt`` is doing the same thing. The storage is
35+
responsible for repoting conflicts and
36+
:py:func:`orion.core.io.experiment_builder.build` is called again
37+
recursively in that case to retry the whole operation.
38+
39+
The created experiment finds its way back to
2940
:py:func:`orion.core.cli.hunt.main` and is handed off to
3041
:py:func:`orion.core.cli.hunt.workon` along with some more
3142
configuration for the workers.
@@ -60,7 +71,7 @@ This will first check if any trials are available in the storage using
6071
:py:meth:`orion.core.worker.experiment.Experiment.reserve_trial`.
6172

6273
If none are available, it will produce new trials using
63-
:py:meth:`orion.core.worker.producer.Producer.produce()` which loads
74+
:py:meth:`orion.core.worker.producer.Producer.produce` which loads
6475
the state of the algorithm from the storage, runs it to suggest new
6576
:py:class:`orion.core.worker.trial.Trial` and saves both the new
6677
trials and the new algorithm state to the storage. This is protected
@@ -95,3 +106,31 @@ the count of broken trials if they did not finish successfully.
95106

96107
Finally we monitor the total amount of time spent waiting for trials
97108
to finish.
109+
110+
111+
Stopping criteria
112+
~~~~~~~~~~~~~~~~~
113+
114+
There are multiple criteria that are monitored to stop the
115+
experiment.
116+
117+
The first obvious one is the configured maximum number of trials to
118+
run. If this is reached, then we stop running more. This is checked at
119+
the beginning of the loop with
120+
:py:attr:`orion.client.runner.Runner.is_running`.
121+
122+
The experiment can also stop if too many trials fail, either because
123+
they fail to start, they crashed, were killed (like by an external job
124+
scheduler) or the take too much time to complete. This is checked in
125+
:py:meth:`orion.client.runner.Runner.gather` with
126+
:py:attr:`orion.client.runner.Runner.is_broken`.
127+
128+
If one of the workers returns an unexpected result the experiment is
129+
also stop immediately because it is assume that something is wrong
130+
with either the code or the configuration and spending more time
131+
computing stuff will not fix it. This is also checked for in
132+
:py:meth:`orion.client.runner.Runner.gather`.
133+
134+
Finaly if the loop spends too much time waiting and nothing happens
135+
the experiment is considered stalled and will also stop. This is
136+
checked at the end of :py:meth:`orion.client.runner.Runner.run`.

0 commit comments

Comments
 (0)