Skip to content

Support Structured runopt types #1132

@kiukchung

Description

@kiukchung

Context

Scheduler runopts were originally designed to be simple runtime configurations passed onto the scheduler (e.g. which cluster to submit the job to).

runopts were meant to be extra configs that are not inherent to the definition of the application itself but rather associated to the runtime of it. For instance, the run_as_user or cluster_id are runtime settings of an application that is not canonically part of describing what an app is. Where as the entrypoint (e.g binary name) and (in the case of AI/HPC) the type of resource (host type) that the application needs is an integral part of describing the application.

This is the reason why only runopts were only allowed to be a restricted set of types defined by torchx.specs.CfgVal

CfgVal = Union[str, int, float, bool, List[str], Dict[str, str], None]

Motivation

Turns out the type restriction is too stringent for expressing runopts for more advanced options.

Scheduler developers could've parameterized complex runopts to expose the option as one of the CfgVal . For example passing ulimit option to the aws_batch_scheduler.

Ulimit has three fields (see ulimit specs in aws docs):

  • name: is the type of ulimit can be: "core", "cpu", "data", "fsize", "locks", etc
  • softLimit: the softlimit value for the ulimit type
  • hardLimit: the hardlimit value for the ulimit type

A natural way to express ulimit as a runopt is to make it a dataclass:

@dataclass
class Ulimit:
  name: Literal["core", "cpu", "data", "fsize", "locks",...]
  soft_limit: int
  hard_limit: int
however due to the restriction in CfgVal types it is not directly possible to register a runopt as:
def _run_opts(self) -> runopts:
  opts = runopts()
  opts.add(
    name="ulimit",
    type_=list[Ulimit] | None,
    default=None,
    help="ulimits for the container"
  )

Instead developers resort to encoding these struct-like options as strings. See: #1127

def _run_opts(self) -> runopts:
  opts = runopts()
  opts.add(
    name="ulimit",
    type_=str,
    help="ulimits for the container. Format: {type},{softLimit},{hardLimit} (multiple ulimits separated by `;`"
  )

When used from CLI restricted runopts types are easy to parse from CLI arguments however when used programmatically they create an awkward UX with two specific proglems:

The user needs to encode complex option types as strings just to pass it as scheduler_args
Type information is lost (ulimit is a str and one has to read the documentation to understand that what format and values it can take)
If we think about the root reason why we limited runopt types to CfgVal it was:
To motivate the scheduler developer to be very intentional about the scheduler options they expose (rather than just throwing all the underlying scheduler's configs as a runopt)
Make it easy to use from the CLI (where one has to parse strings)
#1 has already been violated so no point in fighting what is already done. Rather we should just make it easy and natural.

#2 since we are seeing an increasing programmatic usage of torchx, we don't want to sacrifice programmatic UX in exchange for easier CLI integration

Solution

Create a class StructuredRunOpt that can be subclassed by dataclasses representing more complex options.

StructuredRunOpt standardizes serde into a str (for CLI-friendliness).

For the ulimit option example above it could be implemented as:

@dataclass
class Ulimit(StructuredRunOpt):
  name: Literal["core", "cpu", "data", "fsize", "locks",...]
  soft_limit: int
  hard_limit: int
  
  def template(self) -> str:
    return "{name},{soft_limit},{hard_limit}"
An illustrative implementation of StucturedRunOpt would be:
# use parse (https://pypi.org/project/parse/)
# to get the parameters from the repr given the template
import parse 

class StructuredRunOpt(abc.ABC):
  
  @abstractmethod
  def template(self) -> str:
    ...
    
  def __repr__(self) -> str:
    return self.template().format(**asdict(self))

  @classmethod
  def from_repr(cls,repr: str):
    tmpl = cls.__new__(cls).template()
    result = parse.parse(tmpl, repr)
    cls(**result.named)

Alternatives

A slightly better way to have exposed ulimit runopt is to use a dict[str, str] (one of the allowed CfgVal) or create a runopt for each ulimit type:

opts = runopts()
for ulimit_type in ["core", "cpu", "data", ...]:
opts.add(
name=f"ulimit_{ulimit_type}softLimit",
type
=int,
help=f"ulimit {ulimit_type} softLimit value"
)
opts.add(
name=f"ulimit_{ulimit_type}hardLimit",
type
=int,
help=f"ulimit {ulimit_type} hardLimit value"
)
While this solves the typing issue it is very unnatural to use and creates an option bloat.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions