-
Notifications
You must be signed in to change notification settings - Fork 146
Description
Context
Scheduler runopts were originally designed to be simple runtime configurations passed onto the scheduler (e.g. which cluster to submit the job to).
runopts were meant to be extra configs that are not inherent to the definition of the application itself but rather associated to the runtime of it. For instance, the run_as_user or cluster_id are runtime settings of an application that is not canonically part of describing what an app is. Where as the entrypoint (e.g binary name) and (in the case of AI/HPC) the type of resource (host type) that the application needs is an integral part of describing the application.
This is the reason why only runopts were only allowed to be a restricted set of types defined by torchx.specs.CfgVal
CfgVal = Union[str, int, float, bool, List[str], Dict[str, str], None]
Motivation
Turns out the type restriction is too stringent for expressing runopts for more advanced options.
Scheduler developers could've parameterized complex runopts to expose the option as one of the CfgVal . For example passing ulimit option to the aws_batch_scheduler.
Ulimit has three fields (see ulimit specs in aws docs):
- name: is the type of ulimit can be: "core", "cpu", "data", "fsize", "locks", etc
- softLimit: the softlimit value for the ulimit type
- hardLimit: the hardlimit value for the ulimit type
A natural way to express ulimit as a runopt is to make it a dataclass:
@dataclass
class Ulimit:
name: Literal["core", "cpu", "data", "fsize", "locks",...]
soft_limit: int
hard_limit: int
however due to the restriction in CfgVal types it is not directly possible to register a runopt as:
def _run_opts(self) -> runopts:
opts = runopts()
opts.add(
name="ulimit",
type_=list[Ulimit] | None,
default=None,
help="ulimits for the container"
)
Instead developers resort to encoding these struct-like options as strings. See: #1127
def _run_opts(self) -> runopts:
opts = runopts()
opts.add(
name="ulimit",
type_=str,
help="ulimits for the container. Format: {type},{softLimit},{hardLimit} (multiple ulimits separated by `;`"
)
When used from CLI restricted runopts types are easy to parse from CLI arguments however when used programmatically they create an awkward UX with two specific proglems:
The user needs to encode complex option types as strings just to pass it as scheduler_args
Type information is lost (ulimit is a str and one has to read the documentation to understand that what format and values it can take)
If we think about the root reason why we limited runopt types to CfgVal it was:
To motivate the scheduler developer to be very intentional about the scheduler options they expose (rather than just throwing all the underlying scheduler's configs as a runopt)
Make it easy to use from the CLI (where one has to parse strings)
#1 has already been violated so no point in fighting what is already done. Rather we should just make it easy and natural.
#2 since we are seeing an increasing programmatic usage of torchx, we don't want to sacrifice programmatic UX in exchange for easier CLI integration
Solution
Create a class StructuredRunOpt that can be subclassed by dataclasses representing more complex options.
StructuredRunOpt standardizes serde into a str (for CLI-friendliness).
For the ulimit option example above it could be implemented as:
@dataclass
class Ulimit(StructuredRunOpt):
name: Literal["core", "cpu", "data", "fsize", "locks",...]
soft_limit: int
hard_limit: int
def template(self) -> str:
return "{name},{soft_limit},{hard_limit}"
An illustrative implementation of StucturedRunOpt would be:
# use parse (https://pypi.org/project/parse/)
# to get the parameters from the repr given the template
import parse
class StructuredRunOpt(abc.ABC):
@abstractmethod
def template(self) -> str:
...
def __repr__(self) -> str:
return self.template().format(**asdict(self))
@classmethod
def from_repr(cls,repr: str):
tmpl = cls.__new__(cls).template()
result = parse.parse(tmpl, repr)
cls(**result.named)
Alternatives
A slightly better way to have exposed ulimit runopt is to use a dict[str, str] (one of the allowed CfgVal) or create a runopt for each ulimit type:
opts = runopts()
for ulimit_type in ["core", "cpu", "data", ...]:
opts.add(
name=f"ulimit_{ulimit_type}softLimit",
type=int,
help=f"ulimit {ulimit_type} softLimit value"
)
opts.add(
name=f"ulimit_{ulimit_type}hardLimit",
type=int,
help=f"ulimit {ulimit_type} hardLimit value"
)
While this solves the typing issue it is very unnatural to use and creates an option bloat.