PTP/designs/rm view
Overview
This is a preliminary design for the PTP Resource Management system. This is the design of the first phase product, which is limited in scope to viewing the state of the resource manager. This includes the machines, jobs, queues, and nodes that are under the resource manager's control. The classes currently in PTP will implement these new interfaces in addition to their previously implemented interfaces.
File:Ptp resource manager view.pdf
Requirements
The Resource Management system is the Eclipse/PTP interface into a host's resource manager. The Resource Manager (RM) is responsible for determining the layout of the system represented by the host's resource manager (HRM). This includes the determination of what machines, and nodes constitute the physical layout of the system, and their status. The RM is also responsible in determining the dynamic structure of the HRM, i.e. what Queues are available for the resource manager, and what Jobs are queued and running within those Queues. This dynamic structure will also comprise Node allocation and Process information for each Job. Examples of HRMs are Torque, LSF, ORTE, or SLURM. Each of these will have corresponding Eclipse/PTP RMs.
The Resource Management system encompasses not only the interface into a host's resource manager, but also specification of the environment with which to build and launch parallel programs. This environment may include such aspects of building and running parallel jobs, as setting compilers, and paths, e.g. LD_LIBRARY_PATH and include paths. This environment may be effected via module files.
The RM will consist of two parts. The first part will reside within the local Eclipse/PTP session. It will contain the interface presented to the user and maintain the local structures necessary to represent the physical and dynamic structure of the HRM. The second part will reside on the host, and be more intimately related to the HRM. This second portion will form a (usually remote) proxy to the HRM. The local RM will forward requests to the HRM and receive asynchronous responses back from the HRM.
In its role of specifying parallel launch environments, the RM may be sensitive to changes in the HRM's version. In order to shield the Eclipse/PTP system from changes in HRM versions, the local interface for the RM's portion of the launch configuration should consist of a set of typed attributes, to be filled by the user, determined by querying the proxy to the HRM. These typed attributes may include memory or time resource allocation limits, or anything else particular to the selected RM's HRM.
Other requirements:
- Terminated jobs persist
- Support for disconnect/reconnect to proxy
- No synchronization issues between model and proxy (i.e. they never get out of sync)
- Ability to register listeners on model objects (e.g. to detect when a job exits)
- If stdout capture is supported by the resource manager
- Ability to display stdout while connected to proxy and job is running
- Preserve stdout while disconnected
- Entire run stdout preserved on terminated jobs
- Efficiently refer to objects in fixed sets
- For communcation between Eclipse and proxy
- e.g. nodes in a machine, procs in a job
Package rm.core
For interfaces and abstract classes, the responsibilities and collaborations refer to concrete objects that are implementations of the interface or abstract class.
Interface: IRMResourceManager
- Responsibilities
- Proxy used to connect to the ResourceManagerHost's actual resource manager (ARM).
Retrieve list of machines, nodes, jobs, process, and queues from ARM.
Notify registered objects that the lists have changed, either in composition, or in their element's attributes due to changes propagated from the ARM - Collaborations
- IRMResourceManagerHost
IRMResourceManagerListener
RMNodesChangedEvent
RMJobsChangedEvent
RMQueuesChangedEvent
RMMachinesChangedEvent
RMStructureChangedEvent
IRMMachine, IRMNode, IRMJob, IRMQueue
Abstract Class: ResourceManagerFactory
- Responsibilities
- Subclasses of the class are to create and load instances of IRMResourceManager
- dispose of any resources acquired by factory objects
- Collaborations
- IRMResouceManager subclasses
Class: ResourceManagerHost
- Responsibilities
- Determine which remote (or local) host's resource manager to proxy
Determine which resource manager on the host to proxy
Provide hosts's status - Collaborations
- RMStatus
Interface: IRMMachine
- Responsibilities
- Provide the status information, i.e. attributes, for the ARM's associated machine
Set and provide specific attributes for a given attribute description
List all nodes associated with ARM's machine
Provide machine's status - Collaborations
- IAttribute
IAttrDesc
RMStatus
Interface: IRMQueue
- Responsibilities
- Provide the status information, i.e. attributes, for the ARM's associated queue
Set and provide specific attributes for a given attribute description
List all nodes that may have jobs dispatched from this queue
Provide queue's status - Collaborations
- IAttribute
IAttrDesc
RMStatus
Interface: IRMNode
- Responsibilities
- Provide the status information, i.e. attributes, for the ARM's associated node
Set and provide specific attributes for a given attribute description
List all jobs associated with ARM's node
List all queues that can run jobs on this node
Provide node's status - Collaborations
- IAttribute
IAttrDesc
RMStatus
Interface: IRMJob
- Responsibilities
- Provide the status information, i.e. attributes, for the ARM's associated job
Set and provide specific attributes for a given attribute description
List all processes associated with ARM's job
Provide job's status - Collaborations
- IAttribute
IAttrDesc
RMJobStatus
Interface: IRMProcess
- Responsibilities
- Provide the status information, i.e. attributes, for the ARM's associated process
Set and provide specific attributes for a given attribute description
Provide node on which the process runs - Collaborations
- IAttribute
IAttrDesc
Enumeration: RMStatus
- Responsibilities
- Provide consistent labeling of element status
OK element is up and able to accept jobs, etc.
DOWN element is down, reason will have to be provided in other attributes
UNAVAILABLE element is unable to accept jobs, etc., reason will have to be provied in other attributes
ALLOCATED_OTHER element is up but unable to accept jobs due to allocations by other users
UNKNOWN the status is unknown - Collaborations
- ResourceManagerHost, IRMMachine, IRMNode, IRMQueue
Enumeration: RMJobStatus
- Responsibilities
- Provide consistent labeling of job status
PENDING job is pending in queue
RUNNING job is running normally
SUSPENDED job is suspended, reason will have to be provided in other attributes
DONE job has completed normally
EXIT job has completed abnormally, reason will have to be provide in other attributes
UNKNOWN job status is unknown - Collaborations
- IRMJob
Package rm.events
Interface: IRMResourceManagerListener
- Responsibilities
- Registration site for Observer pattern to allow objects to be notified of changes in the IRMResourceManager's state
- Collaborations
- IRMResourceManager
RMStructureChangedEvent, RMMachinesChangedEvent, RMNodesChangedEvent, RMJobsChangedEvent, RMQueuesChangedEvent
Abstract Class: ResourceManagerEvent
- Responsibilities
- Determine type of changed in the IRMResourceManager's state
- The type can be ADDED, MODIFIED, or REMOVED
- Collaborations
- IRMResourceManager
RMStructureChangedEvent, RMNodesChangedEvent, RMJobsChangedEvent, RMQueuesChangedEvent, RMMachinesChangedEvent
Class: RMStructureChangedEvent
- Superclass
- ResourceManagerEvent
- Responsibilities
- Event created when the ARM has had major structure changes (table columns may need to be recreated)
- Collaborations
- none
Class: RMNodesChangedEvent
- Superclass
- ResourceManagerEvent
- Responsibilities
- Event created when the ARM has added, modified, or removed nodes
- Collaborations
- IRMNode
Class: RMJobsChangedEvent
- Superclass
- ResourceManagerEvent
- Responsibilities
- Event created when the ARM has added, modified, or removed jobs
- Collaborations
- IRMJob
Class: RMMachinesChangedEvent
- Superclass
- ResourceManagerEvent
- Responsibilities
- Event created when the ARM has added, modified, or removed machines
- Collaborations
- IRMMachine
Class: RMQueuesChangedEvent
- Superclass
- ResourceManagerEvent
- Responsibilities
- Event created when the ARM has added, modified, or removed queues
- Collaborations
- IRMQueue
Package rm.attributes
Interface: IAttribute
- Responsibilities
- Maintain the relationship between an attribute's value and its description
Specifies a strict-weak ordering of itself and other attributes
Provide a string representation of the attribute - Collaborations
- IAttrDesc
Interface: IAttrDesc
- Responsibilities
- Provide a string description of the attribute
Provide a name of the attribute
Know the actual type of the attribute
Create new attributes of the correct type - Collaborations
- IAttribute