PTP/designs/4.0/rm proxy

Overview

This is a preliminary design for the PTP Resource Management proxy communication protocol. This protocol is used to communicate between the Resource Manager System in Eclipse, and a lightweight proxy agent running on a target system. The primary purpose of the protocol is for system monitoring, process launch, and process control activities.

Terminiology

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC 2119]

Resource Manager System

The Resource Manager System (RMS) is an Eclipse plugin that manages interaction with arbitrary resource managers. A resource manager in this context, is anything that provides program launch and monitoring services on a target system. Typically, a resource manager will be a job scheduler (e.g. LSF, LoadLeveler, PBS, etc.) running on a large multi-user system. Other types of resource managers include the Open Runtime Environment (ORTE) which is part of the OpenMPI distribution, or the MPICH2 runtime system. The RMS is responsible for populating an internal model in Eclipse which provides a cached representation of the system and program state. Various user interface views are available to inspect and interact with this model. Details of the RMS are provided in a separate document PTP/designs/rms.

Proxy Agent

The RMS communicates with proxy agents to gather information about the state of a target system. The proxy agent may be located on either a local or remote machine.

Proxy Agent Launch

The RMS is responsible for launching the proxy agent. On a local machine, this just involves executing a local process. To launch on a remote machine, the RMS must use an authenticated command service, such as ssh. Current plans are to utilize the Remote System Explorer (RSE) system to provide this remote proxy launch capability.

Proxy Session

One instance of a communcation channel between the RMS and a proxy agent is known as a session. A session only supports communication to a single proxy agent at a time. The mechanism used to effect communcation between the RMS and a proxy agent is not defined in this document, but can be any bi-directional communications channel (e.g. TCP/IP sockets, etc.)

Proxy Protocol

The communication protocol used between the RMS and the proxy agent is a simple text-based asynchronous command/event protocol. The RMS sends one or more commands to the proxy agent, which in turn will generate events that are returned to the RMS. One command may generate multiple events, but an event is always associated with a particular command. A command is not completed until either a corresponding ERROR or OK event is received.

Transaction IDs

Transaction IDs (TIDs) are numbers that are used to match commands and events. Since one command may generate multiple events, TIDs are essential in order to determine which command generated an event. This means that every event MUST have a TID that matches a corresponding command.

TIDs are only unique for uncompleted commands, not necessarily for the whole session.

Proxy agents SHOULD assume that a particular TID MAY be reused.

Proxy agents SHOULD NOT assume anything about the numbering or ordering of TIDs.

Any events received with an invalid TID (i.e. with no corresponding command) SHALL be discarded.

Element IDs

Element IDs are key to communcation between the RMS and the proxy agent. Every element in the model (queue, node, job, etc.) has a unique ID. This ID is generated by the proxy agent. In order to ensure the ID is unique, the RMS supplies a base ID as part of the initialization sequence. This base ID is guaranteed to be unique, and is used by the proxy agent to "uniquify" its generated IDs. The base ID is itself an element ID, so cannot be used by the proxy. It is possible to re-use an element ID, provided that the model element has been removed from the model. However this is not recommended.

Proxy agents have two choices for the types of ID's it generates: numeric or string. The RMS doesn't care which is used. There are advantages and disadvantages for each type. Numeric and string IDs can be mixed.

Numeric Element IDs

Numeric IDs are typically used when the proxy agent is dealing with large numbers of objects (e.g. processes, nodes, etc.) They are more efficient to transmit, because the compact Range Set Notation can be used. When using numeric IDs proxy agent should try and generate "blocks" of IDs to make the range sets more efficient. A proxy agent might, for example, reserve a range of IDs for nodes, a different range for processes, etc.

Numeric IDs are computed using the base ID in such a way that the ID value will be unique. For example, one such computation would be to add the base ID to a monotonically increasing series of integers starting at 1.

String Element IDs

String IDs are an easy way to generate IDs, but are less efficient because they prevent using the Range Set Notation. These IDs are generated by creating a unique string and appending it to the base ID. The new ID is then guaranteed to be unique across the mode.

Range Set Notation

This is a compact representation of sequences of numeric (unsigned integer) values. Ranges of consecutive values can be represented using the first and last values in the range, separated by a dash character (hex 2D). Ranges can then be grouped be separating them with commas (hex 2C). Spaces or other characters are not permitted in a range set.

For example, the range:

1,2,3,4,5,6,7,34,35,36,37,38,41,55,56,57

would be represented as:

1-7,34-38,55-57

Ranges in the set need not be in sorted order, and can be overlapping.

The range set:

1-10,4-12,3

represents the 12 consecutive numbers starting at 1.

Message Format

Commands and events consist of sequences of ASCII characters formatted into a message. A message is transmitted in the following format:

MESSAGE ⇒ LENGTH " " COMMAND_OR_EVENT

LENGTH and COMMAND_OR_EVENT are separated by a space (hex 20). The LENGTH is the length of the COMMAND_OR_EVENT portion of the message including the space. COMMAND_OR_EVENT is the actual text of the command or event.

COMMAND_OR_EVENT ⇒ COMMAND | EVENT

The COMMAND_OR_EVENT portion of the message contains a header part followed by a sequence of arguments separated by spaces. Each argument is a string formatted using the String Format described below. The command format is described in more detail in the Commands section. The event format is described in more detail in the Events section.

String Format

Strings are transmitted using the following format:

STRING ⇒ LENGTH ":" CHARACTERS

where LENGTH is a fixed length 8 digit hexadecimal representation of the length of the string, ':' is a colon character (hex 3A), and CHARACTERS are the actual ASCII characters in the string. No string terminating character (e.g. NULL) is ever transmitted.

For example, the string "A String" would be formatted as:

00000008:A String

A zero length string would be formatted as:

00000000:

Attributes

Attributes are used so that data sent between the RMS and proxy agent is self describing. Attributes are actually composed of two parts: an attribute definition ID, and the actual data. A unique attribute definition IDs is assigned to each Attribute Definition. Attribute definitions are either pre-defined by the RMS, or generated by the proxy agent during the DISCOVERY phase.

Attribute Definition

An attribute definition contains meta-data about the attribute. This meta-data includes:

ID: The attribute definition ID.

TYPE: The type of the attribute. Currently supported types are ARRAY, BOOLEAN, DATE, DOUBLE, ENUMERATED, INTEGER, and STRING.

NAME: The short name of the attribute. This is the name that is displayed in UI property views.

DESCRIPTION: A description of the attribute that is displayed when more information about the attribute is requested (e.g. tooltip popups.)

DEFAULT: The default value of the attribute.

The attribute definition is primarily used for data validation and displaying attributes in the user interface. There is currently no support for proxy agents to utilize the attribute definition information, although there is nothing to stop a proxy agent from validating against its own attribute definition information.

Attribute Value

An attribute value is always a key/value pair with an equals character (hex 3D) separating the key and value. The key is the attribute definition ID and the value is a string representation of the actual value of the attribute. All attributes support conversion to/from strings. It is assumed that once an attribute value is placed on the wire, it has been validated against the corresponding attribute definition.

An example attribute value is:

machineState=ALERT

In this example, the attribute definition ID is "machineState" and it's value is "ALERT". Because "machineState" is an enumerated attribute type, we know that ALERT must be a legal value, and that the string "ALERT" can be converted to the actual enumerated value.

Like any other arguments, the complete attribute value string is always converted to a STRING before being transmitted.

For example, the attribute value:

progArgs=-a 2 -b 4

would actually be transmitted as:

00000012:progArgs=-a 2 -b 4

Protocol Phases

The proxy protocol is divided into a number of phases. A phase determines the legal commands that can be sent to the proxy agent. During a particular phase, illegal commands SHOULD be discarded. Note: this may be changed to SHOULD generate an ERROR event.

Phases follow a strict ordering. Transition from one phase to the next occur when an OK event is received in response to a phase initiation command. A phase initiation command is a command that must be sent to initiate a particular phase. Once a phase has been initiated, any legal commands for that phase may be sent. The phase ordering is defined as follows:

INITIALIZE -> MODEL_DEF -> {START_EVENTS -> STOP_EVENTS}

The phases are defined in more detail in the following sections.

INITIALIZE

This is the first phase, and is used to initiate a communication session between the RMS and proxy agent, and agree on any protocol parameters that apply to this session.

Phase initiation command:

INIT

Legal commands:

none

DISCOVERY

The discovery phase is used to allow the proxy agent to inform the RMS of any dynamic property information. This information currently consists of attribute definitions and filter definitions which are described in more detail below.

Phase initiation command:

MODEL_DEF

Legal commands:

none

NORMAL

The normal phase is entered once the initialize and discovery phases are completed. This is the normal command/event processing phase.

Phase initiation command:

START_EVENTS

Legal commands:

SUBMIT_JOB
TERMINATE_JOB
MOVE_JOB
CHANGE_JOB
LIST_FILTERS
SET_FILTERS
STOP_EVENTS
QUIT

SUSPENDED

The suspended phase is used when the RMS needs to prevent the proxy agent from sending additional events.

Phase initiation command:

STOP_EVENTS

Legal commands:

START_EVENTS
QUIT

Phase Example

The following provides a simple example of phase transitions. Commands go from left to right, events from right to left. The command tid is shown in ()'s after the command or event name.

-- intialize phase --
INIT(1)         ->
                <- OK(1)
-- definition phase --
MODEL_DEF(2)    ->
                <- ATTR_DEF(2)
                <- ATTR_DEF(2)
                <- OK(2)
-- normal phase --
START_EVENTS(3) ->
                <- NEW_MACHINE(3)
                <- NEW_NODE(3)
STOP_EVENTS(4)  ->
                <- OK(3)
-- suspended phase --
START_EVENTS(5) ->
                <- OK(4)
-- normal phase --
                <- NEW_QUEUE(5)
QUIT(6)         ->
                <- OK(5)
                <- SHUTDOWN(6)

Note that the first suspended phase is not entered until the OK event corresponding to the START_EVENTS command (tid 3) is received. Similarly, the second normal phase is not entered until after the OK event corresponding to the STOP_EVENTS command (tid 4) is received.

Commands

Commands are formatted as simple ASCII text strings. A proxy command consists of a header and a body, separated by a space (hex 20), as follows:

COMMAND ⇒ COMMAND_HEADER " " COMMAND_BODY

The command header consists of three fixed length hexadecimal numbers separated by colons (hex 3A), so it is itself fixed length. The format of the header is:

COMMAND_HEADER ⇒ COMMAND_ID ":" TID ":" NUM_ARGS

where

COMMAND_ID is a 4 digit hexadecimal number containing the command to be performed

TID is an 8 digit hexadecimal number containing the transaction ID assigned to this command

NUM_ARGS is an 8 digit hexadecimal number containing the number of space separated elements in the command body

The command body consists of NUM_ARGS strings separated by spaces.

COMMAND_BODY ⇒ STRING { " " STRING }

The following sections describe the currently defined commands.

QUIT

Message Format

QUIT_COMMAND ⇒ "0000:TID:00000000"

Description: Terminate the proxy agent. This command will cause the proxy agent to terminate as soon as possible.

Events: SHUTDOWN

INIT

Message Format

INIT_COMMAND ⇒ "0001:TID:00000002" " " VERSION " " BASE_ID

Description: Initialize proxy communication. After this command has been received, the proxy is ready to receive and process other commands from the RMS. Initialization data may be passed on the command line when the proxy is run.

Arguments

VERSION is the wire protocol version number.

BASE_ID is the base ID used by the proxy agent when allocating new element IDs.

Events: OK, ERROR

MODEL_DEF

Message Format

MODEL_DEF_COMMAND ⇒ "0002:TID:00000000"

Description: Start the proxy discovery phase. The proxy agent responds with a series of ATTR_DEF and FILTER_DEF events. Attributes (see Attributes) are meta-data describing data from the proxy agent that the RMS is expected to receive and possibly display in the UI.

Arguments

none

Events: ATTR_DEF, FILTER_DEF, OK, ERROR

START_EVENTS

Message Format

START_EVENTS_COMMAND ⇒ "0003:TID:00000000"

Description: Initiate normal event processing phase.

Arguments

none

Events: CHANGE_JOB, CHANGE_MACHINE, CHANGE_NODE, CHANGE_PROCESS, CHANGE_QUEUE, NEW_JOB, NEW_MACHINE, NEW_NODE, NEW_PROCESS, NEW_QUEUE, REMOVE_ALL, REMOVE_JOB, REMOVE_MACHINE, REMOVE_NODE, REMOVE_PROCESS, REMOVE_QUEUE, OK, ERROR

STOP_EVENTS

Message Format

STOP_EVENTS_COMMAND ⇒ "0004:TID:00000000"

Description: Suspend normal event processing phase.

Arguments

none

Events: OK, ERROR

SUBMIT_JOB

Message Format

SUBMIT_JOB_COMMAND ⇒ "0005:TID:NUM_ARGS" { " " JOB_ATTR }

Description: Submit a job to the resource manager for execution. The job submission ID is an RMS generated ID that is used to match the newly created job model element with the job submission. The proxy agent MUST include this attribute when the corresponding NEW_JOB event is sent to the RMS. Once this event has been transmitted, the job submission ID can be discarded.

Arguments

JOB_ATTR is a proxy specific job submission attribute. At least one of these attributes MUST be an attribute containing the job submission ID for the job.

Events: OK, ERROR

TERMINATE_JOB

Message Format

TERMINATE_JOB_COMMAND ⇒ "0006:TID:00000001" " " JOB_ID_ATTR

Description: Request the terminaton of an existing job. The meaning of 'termination' depends on the state of the job.

Arguments

JOB_ID_ATTR is a jobId attribute containing the model element ID of the job.

Events: OK, ERROR

MOVE_JOB

Message Format

MOVE_JOB_COMMAND ⇒ "0007:TID:00000002" " " JOB_ID_ATTR " " QUEUE_ID_ATTR

Description: Not yet implemented. This command is intended to allow jobs to be moved between queues.

Arguments

JOB_ID_ATTR is a jobId attribute containing the model element ID of the job.

QUEUE_ID_ATTR is a queueId attribute containing the model element ID of the destination queue.

Events: OK, ERROR

CHANGE_JOB

Message Format

CHANGE_JOB_COMMAND ⇒ "0008:TID:NUM_ARGS" " " JOB_ID_ATTR { " " ATTR }

Description: Not yet implemented. This command is intended to allow a job's status to be changed (e.g. place a hold on a job)

Arguments

JOB_ID_ATTR is a jobId attribute containing the model element ID of the job to change.

ATTR is a proxy agent specific job attribute to change.

Events: OK, ERROR

LIST_FILTERS

Message Format

LIST_FILTERS_COMMAND ⇒ "0009:TID:00000000"

Description: Not yet implemented. This command lists the filters that are currently enabled in the proxy agent.

Arguments

none

Events: OK, ERROR

SET_FILTERS

Message Format

SET_FILTERS_COMMAND ⇒ "000A:TID:NUM_ARGS" { " " ATTR }

Description: Not yet implemented. This command sets the filters in the proxy agent.

Arguments

ATTR is a filter attributes to set.

Events: OK, ERROR

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

PTP/designs/4.0/rm proxy

Contents

Overview

Terminiology

Resource Manager System

Proxy Agent

Proxy Agent Launch

Proxy Session

Proxy Protocol

Transaction IDs

Element IDs

Numeric Element IDs

String Element IDs

Range Set Notation

Message Format

String Format

Attributes

Attribute Definition

Attribute Value

Protocol Phases

INITIALIZE

DISCOVERY

NORMAL

SUSPENDED

Phase Example

Commands

QUIT

INIT

MODEL_DEF

START_EVENTS

STOP_EVENTS

SUBMIT_JOB

TERMINATE_JOB

MOVE_JOB

CHANGE_JOB

LIST_FILTERS

SET_FILTERS

Breadcrumbs

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

PTP/designs/4.0/rm proxy

Contents

Overview

Terminiology

Resource Manager System

Proxy Agent

Proxy Agent Launch

Proxy Session

Proxy Protocol

Transaction IDs

Element IDs

Numeric Element IDs

String Element IDs

Range Set Notation

Message Format

String Format

Attributes

Attribute Definition

Attribute Value

Protocol Phases

INITIALIZE

DISCOVERY

NORMAL

SUSPENDED

Phase Example

Commands

QUIT

INIT

MODEL_DEF

START_EVENTS

STOP_EVENTS

SUBMIT_JOB

TERMINATE_JOB

MOVE_JOB

CHANGE_JOB

LIST_FILTERS

SET_FILTERS