PTP/designs/new sdm

Overview

This document describes the changes to the scalable debug manager that are proposed for the PTP 2.1 release. The major goal for this release is to remove the dependency on MPI (and a specific MPI implementation), plus provide a number of other improvements that will make the SDM more generic and portable. These changes are required to support the new resource manager model that interacts with parallel systems using direct command-line execution rather than (or in addition to) the proxy agents.

Requirements

Remove OpenMPI startup dependency

The current version of the SDM requires OpenMPI support for startup. The SDM is an MPI program that debugs MPI application programs. When starting an application program under debugger control, the SDM forks a copy of the backend debugger, which in turn forks an application process. A modified environment is passed to the application to enable it to communicate correctly with the OpenMPI runtime system. The environment values are obtained by querying the OpenMPI runtime through internal interfaces which will not be supported in OpenMPI versions later than 1.2.x. Also, the MPI specification states that forking may have undefined behavior. This requirement will ensure that the SDM can be started by any runtime system.

Remove MPI communication

The SDM is currently an MPI program. This introduces problems when starting the debugger and the application being debugged, as described above, but also causes the debugger to have a dependency on an MPI runtime system for communication. There are scenarios where it should be possible to debug a parallel non-MPI application without requiring an MPI installation. This requirement will remove the MPI dependency from the SDM.

Allow communication infrastructure to be pluggable

The SDM relies on only a small number of communication primitives for operation. In order to take advantage of scalable communication infrastructure, such as that proposed by the STCI project, the SDM should be capable of utilizing whatever communication primitives are available on the host system. This requirement will provide the SDM with a pluggable communications interface, and a default implementation that supports socket-based TCP/IP communication.

Remove protocol dependency

Currently the SDM infrastructure has full knowledge of the protocol that it is managing. Since the SDM is already separated into frontend/backend operation, it should be possible to use the SDM to route any protocol to/from the backend components without specific knowledge of the protocol. In addition, it should be possible to separate the routing and functional components of the protocol in order to support different routing policies. This requirement will provide an abstraction layer for both the routing and functional components of the protocol.

Support for I/O Forwarding

The current SDM implementation does not support forwarding of I/O streams (stdin/stdout/stderr) between the debugger front-end and the application. In particular, output generated from the application program is lost when it is being controlled by the debugger, and there is no mechanism for supplying terminal input to the application. This requirement will add a generic mechanism for I/O forwarding.

High Level Design

Startup

The communications network comprises a master process and a number of server processes. To debug an N process application, N+1 debugger processes are started (1 master and N server processes.) SDM startup occurs in two phases: the master process is started; then the server processes are started. It is assumed that a debugger front-end (PTP or some other tool) is responsible for starting the SDM processes using whatever process launching mechanism is available on the target system. For example, the front-end might use an MPI runtime system (such as the "mpirun" command, or equivalent), or a launch command (such as the XCPU "xrx" command), to perform the launch. If neither of these alternatives are available, the front-end will need to provide its own launching mechanism using whatever is available (e.g. ssh).

Master Process Launch

The master process could be located anywhere, but since it needs to be able to communicate with both the debugger front-end and the server processes, it will normally be launched on the system login node (the location specified in the resource manager configuration). When it is launched, the master process is passed a command-line argument specifying an arbitrary connection string. The master process then calls sdm_init() and passes this connection string as a parameter, and then calls sdm_connect(SDM_FRONTEND) to establish the connection. If the call to sdm_connect() is successful, the master will repeatedly call sdm_progress() to process incoming messages from the front-end.

The default implementation of sdm_init() requires a connection string that specifies the TCP/IP address of the front-end and port number to connect to. It will attempt to connected this address a predetermined number of times, returning and error if the connection is unsuccessful.

Server Process Launch

The server processes will be controlling the application, so need to be located on the same nodes as the application processes. It is assumed that the server processes will be launched by the same runtime system that is used to launch normal applications (typically MPI). Server processes are passed an arbitrary connection string as a command-line argument. Each server process will call sdm_init() and pass this connection string as a parameter. If the call to sdm_init() is successful, the server can obtain an ID that is guaranteed to be unique across all server processes by calling sdm_get_my_id(). The server will then repeatedly call sdm_progress() to process incoming messages from the parent servers.

In the default implementation of sdm_init() the connection string will comprise a random non-privileged TCP/IP port number. Each server will attempt to bind to this port number. If the port number is in use, then the server will increment the port number and try to bind again. This will be repeated until the server finds an available port number. Once the server processes are started, they will wait for incoming connections on the port. The unique ID will be obtained via an environment variable that is passed to the server process by the runtime system.

Initialization

As soon as it starts, the master connects to the front-end. This serves to notify the front-end that the debugger is ready for initialization. The initialization process then proceeds as follows:

The front-end waits until it has received notification from the runtime that all N server processes have started.
The front-end builds a routing table. The routing table consists of N entries that are indexed by the server ID (not the Unix PID). Each entry consists of the address of the host on which the server process is located.
The first command sent by the front-end to the master is a global initialization command that supplies the routing table to all processes.
The master process establishes connections to its immediate children by calling sdm_connect().
The master process then sends the initialization command to the children by calling sdm_send().
The master's child server processes will receive a connection followed by an initialization message.
The server processes will establish connections to their immediate children by calling sdm_connect() and then forward the initialization message by calling sdm_send().
This process will repeat until all servers have been initialized.
Each server process will generate an acknowledgment event and pass it back to its parent by calling sdm_send().
The front-end will eventually receive an aggregated event indicating all servers are initialized, or that some kind of error occurred (for example, a server could not be contacted.)
Assuming the front-end initialization was successful, the front-end will finish the initialization phase.

In the default implementation, the routing layer will compute the each process's location in a binomial tree. The ID of each child process will then be used to index the routing table in order to find the IP address of the node on which the child is located. A connection will be attempted on the port number that was previously supplied to the server processes. If unsuccessful, the port number will be incremented and another connection will be attempted after a delay period. This will continue for a predetermined number of times, or until a successful connection is established.

Operation

Once the front-end has received confirmation that the servers have been successfully started, it will enter normal operating mode. This mode starts by notifying the debug servers about the application they will be debugging, and then continuously processes commands initiated by user interaction, and events that are generated as a result of the commands.

Front-end debugger operation consists of the following:

The PDI method startDebugger() is called to pass the application executable name and arguments to the server processes. Depending on the startup mode, this will also indicate that the debugger should attach to existing application processes. If attach mode is enabled, the command will also include information necessary for the server processes to attach to the application processes (e.g. address and PID).
If the "Stop at main() on startup" option is selected, setFunctionBreakpoint() will be called to insert a breakpoint in main(), then start() will be called to start the application execution.
- The event generated as a result of the breakpoint being reached will be sent back to the front-end to indicate that operation can continue.
The debugger will now forward commands and process events normally.

The SDM master and server processes repeatedly call sdm_progress() in order to progress messages via callbacks. When a message is received, the callback sdm_recv_callback() is invoked. The callback is passed the message as parameter. The source of the message can be obtained by querying the message.

When a debugger command is received from the parent, it is immediately forwarded to the children using sdm_send(). When an event is received from a child, sdm_aggregate() is called to perform event aggregation, then sdm_send() to send the aggregated event to the parent.

Server processes also call sdm_process_message() each time they receive a command. This will perform the command locally if the process is in the set of processes specified by the command header. The resulting event is passed to sdm_aggregate() to be included in the event aggregation for the operation.

In the default implementation, sdm_progress() will check socket file descriptors for any available data. Data will be collected into complete messages and passed to the callback functions. The sdm_send() function will send messages on the file descriptors corresponding to the IDs in the message destination. The sdm_process_message() function will implement the proxy debug protocol and call stub handler routines in the backend debug interface. The sdm_addregate() function will use the existing SDM hashing function for aggregation.

I/O Forwarding

I/O forwarding will be supported using an out-of-band mechanism. The assumption will be that I/O forwarding is supported by the underlying runtime system for a normal application process.

There are two I/O forwarding scenarios:

Application launched under debugger control. In this case the I/O for each application processes will be managed by the servers. For stdout/stderr, a server will obtain output from the application process using whatever mechanism is provided by the backend debug engine. The server will then forward this output to its corresponding output stream. This output will be sent via the runtime OOB mechanism to the resource manager, which will pass the output to the UI. Input data will be sent from the UI to the resource manager, which will pass the data to the servers using the runtime system. The servers will then forward the input to the application process using the backend debug engine.
Debugger attaches to application. In this case the I/O should already be being forwarded using the runtime system OOB mechanism. The servers will know that they attached to existing processes and will not try to initiate any I/O forwarding operations.

The GDM/MI interface used by the default backend implementation does not provide a mechanism for I/O forwarding. Instead the backend must create a pseudo terminal and pass the slave side of the pseudo terminal to GDB using the -tty argument. The master side of the pseudo terminal can then be used to send and receive the target process stdio. A minor modification to the existing backend implementation will be required to forward this I/O to the SDM stdio streams.

Low Level Items

Implementation Changes

It is suggested that the implementation proceed in the phases as described in the following sections.

Phase 1

This phase will comprise implementation of the API (described below) and refactoring the existing SDM to use the new API. This will consist of:

Replacing the runtime_* functions in src/client/startup.c, src/client/client_svr.c, and src/server/server.c with the new APIs.
Changing the explicit routing operations in src/client/client_svr.c to use the new routing API.
Refactoring the MPI specific code in src/utils/runtime.c to implement the new APIs.
Adding the I/O forwarding API calls and providing a null implementation.

At this point the SDM can be regression tested to ensure that it is still functioning correctly.

Phase 2

This phase will be used to implement the protocol changes and the associated regression testing.

Phase 3

This phase will be used to develop a new implementation of the APIs that removes the MPI dependencies completely. This will consist of providing:

An implementation of sdm_init() that uses the connection string to specify the TCP/IP address of the front-end and port number, and that attempts to connected this address a predetermined number of times, returning and error if the connection is unsuccessful.
An implementation of sdm_init() that uses the connection string to specify a non-privileged TCP/IP port number, and that attempts to bind to this port number. If the port number is in use, then the port number will be incremented by one and an attempt to bind will be made again. This will be repeated until an available port number is found. The unique ID will be obtained via an environment variable passed during startup.
An implementation of the routing layer that computes the each process's location in a binomial tree.
An implementation of sdm_connect() that attempts to connect to the port number that was previously supplied. If unsuccessful, the port number will be incremented and another connection will be attempted after a delay period. This will continue for a predetermined number of times, or until a successful connection is established.
An implementation of sdm_progress() that will check socket file descriptors for any available data. Data will be collected into complete messages and passed to the callback functions.
An implementations of sdm_send() that sends messages on the file descriptors corresponding to the IDs in the message destination.
An implementation of sdm_process_message() that implements the proxy debug protocol and call stub handler routines in the backend debug interface.
An implementation of sdm_addregate() that uses the existing SDM hashing function for aggregation.
Modification of the front-end to supply the routing information.

At this point the SDM can be regression tested to ensure that it is still functioning correctly.

Phase 4

This phase will provide an implementation of the I/O forwarding functions and regression testing to ensure that the SDM still functions correctly.

Proposed API

Data Types

sdm_idset: Defines an arbitrary set of processes. An sdm_idset is used to represent the source(s) or destination(s) of a message. The implementation will be efficient for large number of processes (e.g. O(1M)).

sdm_message: A message that is being routed by the SDM infrastructure. A message contains routing information (that is accessible as an sdm_idset) and an opaque message body. Messages can have multiple sources and destinations. A message with multiple sources means that an identical message was sent by each source, and these have been aggregated into a single message.

sdm_aggregate: Defines an arbitrary aggregation element. This element is included in all messages. For downstream messages, it can contain information needed to initialize the aggregation process, such as a timeout value. For upstream messages, it can contain information necessary to perform the message aggregation, such as a hash.

Startup/Initialization

int sdm_init([in] int argc, char *argv[]): Initialize the SDM services. The argc and argv parameters provides information about how to establish a connection with the front-end or master process if this is being called from the master or a server process respectively. Returns -1 if initialization fails. In the default implementation, the master will specify the TCP/IP address of the front-end and port number to connect to. Servers will specify a random non-privileged TCP/IP port number. Each server will attempt to bind to this port number. If the port number is in use, then the server will increment the port number and try to bind again. This will be repeated until the server finds an available port number.

int sdm_connect([in] sdm_idset ids): Establishes communication with other SDM processes using the supplied ID information. If ids is SDM_FRONTEND this function will attempt to connect to the front-end, otherwise the IDs will be assumed to be children. Note that the number and location of children is established by the routing layer. Returns -1 if connection fails. In the default implementation, a connection will be attempted to each child using the port number that was previously supplied to sdm_init(). If unsuccessful, the port number will be incremented and another connection will be attempted after a delay period. This will continue for a predetermined number of times, or until a successful connection is established. The master will attempt to connect to the address and port supplied to the sdm_init() function a predetermined number of times, returning and error if the connection is unsuccessful.

void sdm_finalize(void): Finalize the SDM services. Called when the SDM is about to exit to perform cleanup. No other SDM functions can be used after this call.

void sdm_progress(void): Global progress routine. This must be called periodically to ensure that the SDM services progress.

Communication Primitives

int sdm_message_init([in] int argc, char *argv[]): Initialize the SDM message services. NOTE: this should not be called directly, only sdm_init() should be used. The argc and argv parameters provide message layer-specific information. Returns -1 if initialization fails.

void sdm_message_finalize(void): Finalize the SDM message services. NOTE: this should not be called directly, only sdm_finalize() should be used.

void sdm_message_progress(void): Progress the SDM message services. NOTE: this should not be called directly, only sdm_progress() should be used.

sdm_message sdm_message_new([in] char *buf, [in] int len): Create a new message with payload buf and length len.

void sdm_message_free([in] sdm_message msg): Free all resources associated with a message, including the payload.

sdm_idset sdm_message_get_destination([in] sdm_message msg): Return the destination of this message.

sdm_idset sdm_message_get_source([in] sdm_message msg): Return the source of this message.

sdm_aggregate sdm_message_get_aggregate([in] sdm_message msg): Return the aggregate information for this message.

int sdm_message_get_payload([in] sdm_message msg, [out] char **buf, [in] int *len): Gets the payload contained in a message. The payload is pointed to by the parameter buf and is length len bytes.

int sdm_message_send([in] sdm_message msg): Send the message pointed to by msg to one or more destinations. This is an asynchronous operation. The operation is completed when the message has been sent to all destinations, and will be indicated by a callback to sdm_send_complete().

int sdm_message_payload_deliver([in] sdm_message msg): Deliver the payload contained in the message msg.

void sdm_message_set_send_callback([in]sdm_message msg, [in] void (*callback)([in] sdm_message msg)): Set the send callback function for this message.

void sdm_message_set_recv_callback([in] void (*callback)(sdm_message msg)): Set the receive callback function. The callback will be invoked each time a message is received.

void sdm_message_set_payload_callback([in] void (*callback)(char *buf, int len)): Set the payload callback function. The callback function will be invoked when sdm_message_deliver_payload() is called.

void (*sdm_send_callback)([in] sdm_message msg): This callback function is invoked when a message has been successfully sent.

void (*sdm_recv_callback)([in] sdm_message msg): This callback function is invoked when a message arrives. The parameter msg will contain the message.

void (*sdm_payload_callback)([in] char *buf, [in] int len): This callback function is invoked when a payload is delivered.

Set Operations

sdm_idset sdm_set_new(void): Create a new, empty, set.

void sdm_set_free([in] sdm_idset set): Free resources associated with a set.

sdm_idset sdm_set_clear([in] sdm_idset set): Clear all elements from the set. Returns the empty set.

int sdm_set_size([in] sdm_idset set): Return the number of elements in the set.

sdm_idset sdm_set_add_element([in] sdm_idset set, [in] sdm_id id): Add an element to the set. Returns the set containing the element.

sdm_idset sdm_set_remove_element([in] sdm_idset set, [in] sdm_id id): Add an element to the set. Returns the set with the element removed.

sdm_idset sdm_set_add_all([in] sdm_idset set, [in] sdm_id id): Add all elements in the range [0, id] to the set. Returns the set containing the elements.

int sdm_set_is_subset([in] sdm_idset set1, [in] sdm_idset set2): Return true if set1 is a subset of set2.

int sdm_set_is_empty([in] sdm_idset set): Return true if set is empty.

int sdm_set_compare([in] sdm_idset set1, [in] sdm_idset set2): Compare set1 with set2. Returns true if they are different.

sdm_idset sdm_set_union([in] sdm_idset set1, [in] sdm_idset set2): Compute the union of set1 and set2 and stores the result in set1. Returns set1.

sdm_idset sdm_set_intersect([in] sdm_idset set1, [in] sdm_idset set2): Compute the intersections of set1 and set2 and stores the result in set1. Returns set1.

sdm_idset sdm_set_diff([in] sdm_idset set1, [in] sdm_idset set2): Compute the difference betweenf set1 and set2 and stores the result in set1. Returns set1.

int sdm_set_contains([in] sdm_idset set, [in] sdm_id id): Returns true if set contains id

sdm_id sdm_set_max[in] sdm_idset set): Returns the largest element that can be stored in the set.

sdm_id sdm_set_first([in] sdm_idset set): Initilalize and return the first element in the set iterator.

sdm_id sdm_set_next([in] sdm_idset set): Return the next element in the set iterator.

int sdm_set_done([in] sdm_idset set): Check if there are more elements in the iterator.

int sdm_set_serialize([in] sdm_idset set, [in] char *buf, [out] char **end): Create a serialized representation of the set. The result is placed in the buffer pointed to by buf. If end is not null, it will contain a pointer to the first location after the end of the conversion. Returns the number of characters in the conversion.

int sdm_set_serialized_length([in] sdm_idset set): Returns an upper bound on the number of characters required for the conversion. This can be used to pre-allocate a buffer for the conversion.

int sdm_set_deserialize([in] sdm_idset set, [in] char *str, [out] char **end): Convert a serialized representation into a set. The str argument points to a buffer containing the serialized set. If non-null, the end of the conversion will be returned in end. Returns the number of characters in the conversion.

Routing

int sdm_route_init([in] int argc, char *argv[]): Initialize the SDM routing services. NOTE: this should not be called directly, only sdm_init() should be used. The argc and argv parameters provide routing layer-specific information. Returns -1 if initialization fails.

void sdm_route_finalize(void): Finalize the SDM routing services. NOTE: this should not be called directly, only sdm_finalize() should be used.

sdm_id sdm_route_get_parent(void): Find the ID of the parent of this process. The special value SDM_MASTER ID indicates this is the master process.

sdm_id sdm_route_get_id(void): Find the ID of the current process.

void sdm_route_set_id([in] sdm_id id): Set the ID of the current process.

int sdm_route_get_size(void): Find the number of entries in the routing table.

void sdm_route_set_size([in] int size): Set the number of entries in the routing table.

sdm_idset sdm_route_get_route([in] sdm_idset dest): Given a destination set, returns a set containing nearest neighbors.

sdm_idset sdm_route_rechable([in] sdm_idset dest): Given a destination set, returns a set containing all reachable destinations.

Aggregation

int sdm_aggregate_init([in] int argc, char *argv[]): Initialize the SDM aggregation services. NOTE: this should not be called directly, only sdm_init() should be used. The argc and argv parameters provide aggregation layer-specific information. Returns -1 if initialization fails.

void sdm_aggregate_finalize(void): Finalize the SDM aggregation services. NOTE: this should not be called directly, only sdm_finalize() should be used.

void sdm_aggregate_progress(void): Progress the SDM aggregation services. NOTE: this should not be called directly, only sdm_progress() should be used.

sdm_aggregate sdm_aggregate_new(void): Create a new aggregation element.

void sdm_aggregate_free([in] sdm_aggregate a): Free the resources associated with an aggregation element.

void sdm_aggregate_set_value([in] sdm_aggregate a, [in] int type, ...): Supply aggregation-specific values to the aggregation service.

void sdm_aggregate_set_completion_callback([in] int (*callback)(sdm_message msg)): Set the aggregation completion callback. This callback is invoked when an upstream aggregation is completed.

int sdm_aggregate([in] sdm_message msg, [in] unsigned int flags): Aggregates the message specified by msg with msg_aggregate. The flags parameter is used to specify the direction of the aggregation. Currently supported flags are SDM_AGGREGATE_UPSTREAM and SDM_AGGREGATE_DOWNSTREAM.

int sdm_aggregate_serialize([in] sdm_aggregate a, [in] char *buf, [out] char **end): Create a serialized representation of the aggregation. The result is placed in the buffer pointed to by buf. If end is not null, it will contain a pointer to the first location after the end of the conversion. Returns the number of characters in the conversion.

int sdm_aggregate_serialized_length([in] sdm_aggregate a): Returns an upper bound on the number of characters required for the conversion. This can be used to pre-allocate a buffer for the conversion.

int sdm_aggregate_deserialize([in] sdm_aggregate a, [in] char *str, [out] char **end): Convert a serialized representation into an aggregation. The str argument points to a buffer containing the serialized set. If non-null, the end of the conversion will be returned in end. Returns the number of characters in the conversion.

I/O Forwarding

int sdm_iof_set_stdin_handler(int fd_in, int fd_out, int (*stdin_handler)(int, int)): Set the handler that will manage standard input to the backend. fd_in is the file descriptor on which stdin will be available. fd_out is the file descriptor to which the backend will be expecting the input to be sent. The handler will be called when there is data available on fd_in.

int (*stdin_handler)(int fd_in, int fd_out): Handle standard input. The normal behavior is to read from file descriptor fd_in and send to file descriptor fd_out.

int sdm_iof_set_stdout_handler(int fd_in, int fd_out, int (*stdout_handler)(int, int)): Set the handler that will manage standard output from the backend. fd_in is the file descriptor to which the standard output will be sent by the backend. fd_out is the file descriptor to which stdout should be sent. The handler will be called when there is data available on fd_in.

int (*stdout_handler)(int fd_in, int fd_out): Handle standard output. The normal behavior is to read from file descriptor fd_in and send to file descriptor fd_out.

int sdm_iof_set_stderr_handler(int fd_in, int fd_out, int (*stderr_handler)(int, int)): Set the handler that will manage standard error from the backend. fd_in is the file descriptor to which the standard error will be sent by the backend. fd_out is the file descriptor to which stderr should be sent. The handler will be called when there is data available on fd_in.

int (*stderr_handler)(int fd_in, int fd_out): Handle standard error output. The normal behavior is to read from file descriptor fd_in and send to file descriptor fd_out.

Protocol Changes

The changes to the proxy protocol will be described in a separate document.

External Interfaces

None.

Compatibility

The new SDM design changes the existing debug protocol. Because of this and the other proposed changes, the new SDM will not be backwards compatible with the PTP 2.0 SDM.

The new debug protocol will form a reference implementation for the new protocol abstraction layer. External interfaces and the connection setup protocol do not change. The interface between the SDM and the debug backend will remain backward compatible with the current SDM.

Packaging

The SDM will continue to be packaged as a source-only Eclipse plugin as part of PTP distribution. The plugin can be downloaded using the Eclipse update manager, or directly from the PTP download site as a gzipped tar file or zip file.

Installation and Configuration

The SDM executable must be installed in a location that is accessible to the launch mechanism employed by the frontend. This may mean copying the executable to nodes of a cluster, for example. Since this is highly dependent on the target architecture, PTP does not provide any additional installation assistance.

Configuration of the SDM is managed through the front-end. All configurable options are supplied by the front-end during debugger initialization.

Testing

Testing of the SDM will be undertaken as per the PTP 2.x test plan.

Dependencies

Hardware

Development/testing can largely be carried out on a local machine by emulating a multi-node cluster. However this is not adequate for some aspects of the debugger design, particularly the startup and communication. These design considerations require access to real systems with a variety of architectures and software stacks. In addition, development and testing for specialized hardware architectures will require access to these systems.

Software

Other than the PTP installation pre-requisites, there are no other software dependencies.

Breadcrumbs

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

PTP/designs/new sdm

Contents

Overview

Requirements

Remove OpenMPI startup dependency

Remove MPI communication

Allow communication infrastructure to be pluggable

Remove protocol dependency

Support for I/O Forwarding

High Level Design

Startup

Master Process Launch

Server Process Launch

Initialization

Operation

I/O Forwarding

Low Level Items

Implementation Changes

Phase 1

Phase 2

Phase 3

Phase 4

Proposed API

Data Types

Startup/Initialization

Communication Primitives

Set Operations

Routing

Aggregation

I/O Forwarding

Protocol Changes

External Interfaces

Compatibility

Packaging

Installation and Configuration

Testing

Dependencies

Hardware

Software

References