PTP/designs/rm new

Overview

This document describes a new resource manager architecture in an effort to simplify the development of new resource managers. For a detailed description of the overall PTP architecture, refer to the PTP 2.x Design Document.

In the existing architecture, a resource manager comprises two parts: a Java component that implements the client side of the Resource Manager Proxy Protocol and an external component (usually C or Python) that implements the server side of the protocol, and interacts with the runtime system on the target machine. In the new architecture, the resource manager comprises only a single Java component that interacts with the runtime system via the command-line interface.

Architecture

The resource manager is a Java component that interacts with a (potentially remote) runtime system via the command-line interface. Commands are issued by the resource manager to perform activities, and the resource manager attempts to interpret the command results in a meaningful manner. Since some commands will be long running, the resource manager will guarantee to terminate these commands when the resource manager is shut down (e.g. on exit from Eclipse).

Discovery

When a resource manager is first started, it attempts to discover information about the target system by issuing a discover command. This command will attempt to discover information such as:

The number of racks (machines) and nodes
Attributes about the hardware
Job queues and resource requirements
User definable parameters

The discover command is expected to only be run once when the resource manager is first started.

The output format for the discover command is currently system dependent, but an XML-based format is under development and is expected to be used for new resource managers.

Monitoring

Once discovery has completed, the resource manager will being monitoring the status of the system and jobs using a monitor command. Two types of monitor commands will be supported:

Periodic monitoring, where the command is issued on a regular basis
Continuous monitoring, where the command is issued once and continues to run for the life of the resource manager session

The monitor command will provide model update information, such as:

Status changes to machines and nodes
Status changes to queues, jobs and processes
Other attribute changes to model elements
New model elements (such as new nodes coming on line)

The output format for the monitor command is currently system dependent, but an XML-based format is under development and is expected to be used for new resource managers.

Job Control

Once discovery has completed, the resource manager is ready to begin launching and controlling jobs.

Submission

Jobs are submitted for launch using the submit command. The resource manager supplies the following information to this command (as arguments):

Resource requirements (e.g. number of processes, host list, etc.) required for the job launch
The executable path and arguments for the job
The working directory
Environment variables
A debug flag, indicating that the job should be launched under the control of a debugger
Debugger-specific arguments

The submit command is expected to persist for the life of the job. For interactive execution, it would persist for the life of an mpirun command. For batch execution, it would persist until job execution terminates (which may be a long time after job submission).

The submit command may provide additional information to PTP about the status of model elements (e.g. process status), and is responsible for forwarding stdin/stdout/stderr between the application and PTP.

The output format for the submit command is currently system dependent, but an XML-based format is under development and is expected to be used for new resource managers.

Termination

This command is provided for systems that require a command to terminate job execution. For example, a job scheduler may require a command to terminate and/or remove a job from a queue. For systems that do not require a termination command, job termination will be assumed to occur when the submit command exits (or is killed).

Other Commands

Some resource managers will provide additional functionality, such as manipulating jobs in queues, configuring parameters, etc. These activities will be supported through command extensions that will allow additional commands to be issued by PTP as required.

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

PTP/designs/rm new

Contents

Overview

Architecture

Discovery

Monitoring

Job Control

Submission

Termination

Other Commands

Breadcrumbs

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

PTP/designs/rm new

Contents

Overview

Architecture

Discovery

Monitoring

Job Control

Submission

Termination

Other Commands