@@ -16,16 +16,27 @@ The code in :py:func:`orion.core.cli.main` will parse the command line
1616arguments and route to :py:func: `orion.core.cli.hunt.main `.
1717
1818The command line arguments are passed to
19- :py:func: `orion.core.io.experiment_builder.build_from_args `. This will
20- massage the parsed command line arguments and merge that configuration
21- with the config file and the defaults with various helpers from
22- :py:mod: `orion.core.io.resolve_config ` to build the final
23- configuration. The result is eventually handled off to
19+ :py:func: `orion.core.io.experiment_builder.build_from_args `, which
20+ does some setup and hands over the arguments to
21+ :py:func: `orion.core.io.experiment_builder.build `. This will hand over
22+ the configuration to
23+ :py:func: `orion.core.io.experiment_builder.consolidate_config ` which
24+ will look up the experiment in the configured storage to see if it's
25+ already there and merge the loaded configuration with the provided one
26+ with various helpers from :py:mod: `orion.core.io.resolve_config ` to
27+ build the final configuration. The result is eventually handled off to
2428:py:func: `orion.core.io.experiment_builder.create_experiment ` to
2529create an :py:class: `orion.core.worker.experiment.Experiment ` and set
2630its properties.
2731
28- The created experiments finds its way back to
32+ If the experiment is new, meaning it has no storage id, then it will
33+ attempt to save it to storage, which may conflict in case another
34+ instance of ``orion hunt `` is doing the same thing. The storage is
35+ responsible for repoting conflicts and
36+ :py:func: `orion.core.io.experiment_builder.build ` is called again
37+ recursively in that case to retry the whole operation.
38+
39+ The created experiment finds its way back to
2940:py:func: `orion.core.cli.hunt.main ` and is handed off to
3041:py:func: `orion.core.cli.hunt.workon ` along with some more
3142configuration for the workers.
@@ -60,7 +71,7 @@ This will first check if any trials are available in the storage using
6071:py:meth: `orion.core.worker.experiment.Experiment.reserve_trial `.
6172
6273If none are available, it will produce new trials using
63- :py:meth: `orion.core.worker.producer.Producer.produce() ` which loads
74+ :py:meth: `orion.core.worker.producer.Producer.produce ` which loads
6475the state of the algorithm from the storage, runs it to suggest new
6576:py:class: `orion.core.worker.trial.Trial ` and saves both the new
6677trials and the new algorithm state to the storage. This is protected
@@ -95,3 +106,31 @@ the count of broken trials if they did not finish successfully.
95106
96107Finally we monitor the total amount of time spent waiting for trials
97108to finish.
109+
110+
111+ Stopping criteria
112+ ~~~~~~~~~~~~~~~~~
113+
114+ There are multiple criteria that are monitored to stop the
115+ experiment.
116+
117+ The first obvious one is the configured maximum number of trials to
118+ run. If this is reached, then we stop running more. This is checked at
119+ the beginning of the loop with
120+ :py:attr: `orion.client.runner.Runner.is_running `.
121+
122+ The experiment can also stop if too many trials fail, either because
123+ they fail to start, they crashed, were killed (like by an external job
124+ scheduler) or the take too much time to complete. This is checked in
125+ :py:meth: `orion.client.runner.Runner.gather ` with
126+ :py:attr: `orion.client.runner.Runner.is_broken `.
127+
128+ If one of the workers returns an unexpected result the experiment is
129+ also stop immediately because it is assume that something is wrong
130+ with either the code or the configuration and spending more time
131+ computing stuff will not fix it. This is also checked for in
132+ :py:meth: `orion.client.runner.Runner.gather `.
133+
134+ Finaly if the loop spends too much time waiting and nothing happens
135+ the experiment is considered stalled and will also stop. This is
136+ checked at the end of :py:meth: `orion.client.runner.Runner.run `.
0 commit comments