This document describes a new resource manager architecture in an effort to simplify the development of new resource managers. For a detailed description of the overall PTP architecture, refer to the PTP 2.x Design Document.
In the existing architecture, a resource manager comprises two parts: a Java component that implements the client side of the Resource Manager Proxy Protocol, and an external component (usually C or Python) that implements the server side of the protocol, and interacts with the runtime system on the target machine. In the new architecture, the resource manager comprises only a single Java component that interacts with the runtime system via the command-line interface.
The resource manager is a Java component that interacts with a (potentially remote) runtime system via the command-line interface. Commands are issued by the resource manager to perform activities, and the resource manager attempts to interpret the command results in a meaningful manner. Since some commands will be long running, the resource manager will guarantee to terminate these commands when the resource manager is shut down (e.g. on exit from Eclipse).
When a resource manager is first started, it attempts to discover information about the target system by issuing a discover command. This command will attempt to discover information such as:
- The number of racks (machines) and nodes
- Attributes about the hardware
- Job queues and resource requirements
- User definable parameters
The discover command is expected to only be run once when the resource manager is first started.
The output format for the discover command is currently system dependent, but an XML-based format is under development and is expected to be used for new resource managers.
Once discovery has completed, the resource manager will being monitoring the status of the system and jobs using a monitor command. Two types of monitor commands will be supported:
- Periodic monitoring, where the command is issued on a regular basis
- Continuous monitoring, where the command is issued once and continues to run for the life of the resource manager session
The monitor command will provide model update information, such as:
- Status changes to machines and nodes
- Status changes to queues, jobs and processes
- Other attribute changes to model elements
- New model elements (such as new nodes coming on line)
The output format for the monitor command is currently system dependent, but an XML-based format is under development and is expected to be used for new resource managers.
Once discovery has completed, the resource manager is ready to begin launching and controlling jobs.
Jobs are submitted for launch using the submit command. The resource manager supplies the following information to this command (as arguments):
- Resource requirements (e.g. number of processes, host list, etc.) required for the job launch
- The executable path and arguments for the job
- The working directory
- Environment variables
- A debug flag, indicating that the job should be launched under the control of a debugger
- Debugger-specific arguments
The submit command is expected to persist for the life of the job. For interactive execution, it would persist for the life of an mpirun command. For batch execution, it would persist until job execution terminates (which may be a long time after job submission).
The submit command may provide additional information to PTP about the status of model elements (e.g. process status), and is responsible for forwarding stdin/stdout/stderr between the application and PTP.
The output format for the submit command is currently system dependent, but an XML-based format is under development and is expected to be used for new resource managers.
This command is provided for systems that require a command to terminate job execution. For example, a job scheduler may require a command to terminate and/or remove a job from a queue. For systems that do not require a termination command, job termination will be assumed to occur when the submit command exits (or is killed).
Some resource managers will provide additional functionality, such as manipulating jobs in queues, configuring parameters, etc. These activities will be supported through command extensions that will allow additional commands to be issued by PTP as required.