Difference between revisions of "PTP/designs/new sdm"

Revision as of 13:58, 12 May 2008

Overview

This document describes the changes to the scalable debug manager that are proposed for the PTP 2.1 release. It should be read in conjunction with the Scalable Debug Manager design document.

The major goals for the SDM 2.1 release are:

Remove dependency on OpenMPI for debugger startup
Remove dependency on MPI communication primitives
Allow communication infrastructure to be pluggable
Clean separation of protocol specific and protocol independent components
Support for I/O forwarding

Startup

The communications network comprises a master process and a number of server processes. To debug an N process application, N+1 debugger processes are started (1 master and N server processes.) SDM startup occurs in two phases: the master process is started; then the server processes are started. When debugging an application using PTP, the resource manager is responsible for coordinating this startup.

Master Process Launch

The master process could be located anywhere, but since it needs to be able to communicate with both the debugger front-end and the server processes, it will normally be launched on the system login node (the location specified in the resource manager configuration). When it is launched, the master process is passed arguments specifying an arbitrary connection string. The master process then calls sdm_master_init() and passes this connection string as a parameter. If the call to sdm_master_init() is successful, the master will repeatedly call sdm_master_progress() to process incoming messages from the front-end.

The default implementation of sdm_master_init() requires a connection string that specifies the TCP/IP address of the front-end and port number to connect to. It will attempt to connected this address a predetermined number of times, returning and error if the connection is unsuccessful.

Server Process Launch

The server processes will be controlling the application, so need to be located on the same nodes as the application processes. It is assumed that the server processes will be launched by the same runtime system that is used to launch normal applications (typically MPI). Server processes are passed an arbitrary connection string as an argument. Each server process will call sdm_server_init() and pass this connection string as a parameter. If the call to sdm_server_init() is successful, it will return an ID in the interval [0, N-1] that is guaranteed to be unique across all server processes. The server will then repeatedly call sdm_progress() to process incoming messages from the parent servers.

In the default implementation of sdm_server_init() the connection string will comprise a random non-privileged TCP/IP port number. Each server will attempt to bind to this port number. If the port number is in use, then the server will increment the port number and try to bind again. This will be repeated until the server finds an available port number. Once the server processes are started, they will wait for incoming connections on the port. The unique ID will be obtained via an environment variable that is passed to the server process by the runtime system.

Initialization

As soon as it starts, the master connects to the front-end. This serves to notify the front-end that the debugger is ready for initialization. The initialization process then proceeds as follows:

The front-end waits until it has received notification from the runtime that all N server processes have started.
The front-end queries the model to build a routing table. The routing table consists of N entries that are indexed by the server ID (not the Unix PID). Each entry consists of the address of the host on which the server process is located.
The first command sent by the front-end to the master is a global initialization command that supplies the routing table to all processes.
The master process establishes connections to its immediate children by calling sdm_connect_children().
The master process then sends the initialization command to the children by calling sdm_send_to_children().
The master's child server processes will receive a connection followed by an initialization message.
The server processes will establish connections to their immediate children by calling sdm_connect_children() and then forward the initialization message by calling sdm_send_to_children().
This process will repeat until all servers have been initialized.
Each server process will generate an acknowledgment event and pass it back to its parent by calling sdm_send_to_parent().
The front-end will eventually receive an aggregated event indicating all servers are initialized, or that some kind of error occurred (for example, a server could not be contacted.)
Assuming the front-end initialization was successful, the front-end will finish the initialization phase.

In the default implementation, sdm_connect_to_children() will compute the each process's location in a binomial tree. The ID of each child process will then be used to index the routing table in order to find the IP address of the node on which the child is located. A connection will be attempted on the port number that was previously supplied to the server processes. If unsuccessful, the port number will be incremented and another connection will be attempted after a delay period. This will continue for a predetermined number of times, or until a successful connection is established.

Operation

Once the front-end has received confirmation that the servers have been successfully started, it will enter normal operating mode. This mode starts by notifying the debug servers about the application they will be debugging, and then continuously processes commands initiated by user interaction, and events that are generated as a result of the commands.

Front-end debugger operation consists of the following:

The PDI method startDebugger() is called to pass the application executable name and arguments to the server processes. Depending on the startup mode, this will also indicate that the debugger should attach to existing application processes. If attach mode is enabled, the command will also include information necessary for the server processes to attach to the application processes (e.g. address and PID).
If the "Stop at main() on startup" option is selected, setFunctionBreakpoint() will be called to insert a breakpoint in main(), then start() will be called to start the application execution.
- The event generated as a result of the breakpoint being reached will be sent back to the front-end to indicate that operation can continue.
The debugger will now forward commands and process events normally.

The SDM master and server processes repeatedly call sdm_progress() in order to progress messages via callbacks. When a message is received, one of the callbacks sdm_parent_callback() or sdm_child_callback() is invoked. Both callbacks are passed the message as parameters, and the child callback includes an additional argument specifying which child the message was received from.

When a debugger command is received from the parent, it is immediately forwarded to the children using sdm_send_to_children(). When an event is received from a child, sdm_aggregate() is called to perform event aggregation, then sdm_send_to_parent() to send the aggregated event to the parent.

Server processes also call sdm_process_command() each time they receive a command. This will perform the command locally if the process is in the set of processes specified by the command header. The resulting event is passed to sdm_aggregate() to be included in the event aggregation for the operation.

In the default implementation, sdm_progress() will check socket file descriptors for any available data. Data will be collected into complete messages and passed to the callback functions. The sdm_send_to_children() and sdm_send_to_parent() functions will send messages on the file descriptors corresponding to the children and parent connections respectively. The sdm_process_command() function will implement the proxy debug protocol and call stub handler routines in the backend debug interface. The sdm_addregate() function will use the existing SDM hashing function for aggregation.

I/O Forwarding

I/O forwarding will be supported using an out-of-band mechanism. The assumption will be that I/O forwarding is supported by the underlying runtime system for a normal application process.

There are two I/O forwarding scenarios:

Application launched under debugger control. In this case the I/O for each application processes will be managed by the servers. For stdout/stderr, a server will obtain output from the application process using whatever mechanism is provided by the backend debug engine. The server will then forward this output to its corresponding output stream. This output will be sent via the runtime OOB mechanism to the resource manager, which will pass the output to the UI. Input data will be sent from the UI to the resource manager, which will pass the data to the servers using the runtime system. The servers will then forward the input to the application process using the backend debug engine.
Debugger attaches to application. In this case the I/O should already be being forwarded using the runtime system OOB mechanism. The servers will know that they attached to existing processes and will not try to initiate any I/O forwarding operations.

The GDM/MI interface used by the default backend implementation does not provide a mechanism for I/O forwarding. Instead the backend must create a pseudo terminal and pass the slave side of the pseudo terminal to GDB using the -tty argument. The master side of the pseudo terminal can then be used to send and receive the target process stdio. A minor modification to the existing backend implementation will be required to forward this I/O to the SDM stdio streams.

Proposed APIs

Startup/Initialization

int sdm_master_init([in] char *connection_info): Initialize the master process. The connection_info parameter provides information about how to establish a connection with the front-end. In the default implementation, connection_info specifies the TCP/IP address of the front-end and port number to connect to. The master will attempt to connected this address a predetermined number of times, returning and error if the connection is unsuccessful.

int sdm_server_init([in] char *connection_info, [out] sdm_id *id): Initialize the server process. The connection_info parameter provides information about how to establish a connection with the master process. If successful, the id parameter will contain a unique ID number in the interval [0, N-1], where N is the number of server processes launched. In the default implementation, connection_info will comprise a random non-privileged TCP/IP port number. Each server will attempt to bind to this port number. If the port number is in use, then the server will increment the port number and try to bind again. This will be repeated until the server finds an available port number. Once the server processes are started, they will wait for incoming connections on the port. The unique ID will be obtained via an environment variable that is passed to the server process by the runtime system.

int sdm_connect([in] sdm_collection child_ids): Establishes communication with children of the master or server process. The number and location of the children are established by the routing layer and are passed in using the child_ids parameter. In the default implementation, a connection will be attempted to each child using the port number that was previously supplied to sdm_server_init(). If unsuccessful, the port number will be incremented and another connection will be attempted after a delay period. This will continue for a predetermined number of times, or until a successful connection is established.

Communication Primitives

int sdm_message_new([in] sdm_collection dest_ids, [out] sdm_message *msg): Create a new empty message that is addressed to one or more destinations specified by dest_ids.

int sdm_message_add_destination([in] sdm_collection dest_ids, [in] sdm_message msg): Add the destinations specified by dest_ids to the message.

int sdm_message_get_destination([in] sdm_message msg, [out] sdm_collection *dest_ids): Return the destination of this message.

int sdm_message_add_data([in] sdm_message msg, [in] char *buf, [in] int len): Adds data to the message. The data is pointed to by the parameter buf and is length len bytes. Once a buffer is added, it should not be modified until the message has been successfully sent.

int sdm_message_get_data([in] sdm_message msg, [out] char **buf, [in] int *len): Gets the data contained in a message. The data is pointed to by the parameter buf and is length len bytes.

int sdm_message_free([in] sdm_message msg): Free all resources associated with a message. Note that this will not free the data buffer.

int sdm_in_collection([in] sdm_id id, sdm_collection [in] ids): Check if the ID specified by id is included in the collection ids

int sdm_send([in] sdm_message msg, [in] void *data): Send the message pointed to by msg to one or more destinations. This is an asynchronous operation. The operation is completed when the message has been sent to all destinations, and will be indicated by a callback to sdm_send_complete(). The parameter data will be passed to the callback function.

int (*sdm_send_callback)([in] void *data): This callback function is invoked when a message has been successfully sent. The parameter data can be used to distinguish between different send operations.

int (*sdm_recv_callback)([in] sdm_message msg): This callback function is invoked when a message arrives. The parameter msg will contain the message.

int sdm_progress(void): Progress communication. This function must be called regularly to progress communication operations.

Routing/Aggregation

int sdm_get_parent([out] sdm_id *parent_id): Find the ID of the parent of this process. A value of SDM_MASTER indicates this is the master process.

int sdm_get_children([out] sdm_collection *children): Find the IDs of the children of this process.

int sdm_init_aggregate([in] sdm_message msg): Initialize the message msg for aggregation. A message must be initialized for aggregation before it can be used in the sdm_aggregate() function.

int sdm_aggregate([in] sdm_message msg, [in] sdm_message msg_aggregate): Aggregates the message specified by msg with msg_aggregate.

Protocol Operations

int sdm_process_command()

@@ Line 87: / Line 87: @@
 === Communication Primitives ===
-; <code>int sdm_message_init([in] sdm_collection dest_ids, [out] sdm_message *msg)</code> : Create a new empty message that is addressed to one or more destinations specified by <code>dest_ids</code>.
+; <code>int sdm_message_new([in] sdm_collection dest_ids, [out] sdm_message *msg)</code> : Create a new empty message that is addressed to one or more destinations specified by <code>dest_ids</code>.
 ; <code>int sdm_message_add_destination([in] sdm_collection dest_ids, [in] sdm_message msg)</code> : Add the destinations specified by <code>dest_ids</code> to the message.

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Difference between revisions of "PTP/designs/new sdm"

Revision as of 13:58, 12 May 2008

Contents

Overview

Startup

Master Process Launch

Server Process Launch

Initialization

Operation

I/O Forwarding

Proposed APIs

Startup/Initialization

Communication Primitives

Routing/Aggregation

Protocol Operations

Breadcrumbs

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Difference between revisions of "PTP/designs/new sdm"

Revision as of 13:58, 12 May 2008

Contents

Overview

Startup

Master Process Launch

Server Process Launch

Initialization

Operation

I/O Forwarding

Proposed APIs

Startup/Initialization

Communication Primitives

Routing/Aggregation

Protocol Operations