DBus zerocopy

Been thinking about this for a while...

If I have two applications that know they are going to exchange large amounts of memory over the bus, for example a scanning application, then it could make sense to have them use a zero-copy mechanism for doing so, if both applications are local to the bus.


Connections on a single machine are assumed to be local in a sense the implementation requires.

  1.       Machine A
           d-bus server
    There's no problem doing zero-copy transmissions.
  2.       Machine A          Machine B
           d-bus server ------ App2
    Zero-copy transmission can be done from App1 to the server, but the server has to re-marshal the message before sending it to App2.
  3.       Machine A          Machine B
             App1 -----+
                       +--- d-bus server
             App2 -----+
    It should be possible to do zero-copy transmission between App1 and App2, but it is impractical to detect this situation and to create a reliable scheme for zero-copy transmissions. In practice, it is unlikely that this situation will occur.
  4.       Machine A          Machine B          Machine C
             App1 --------- d-bus server --------- App2
    No zero-copy transmission possible.

public API

An application that is going to transfer a large amount of memory will, instead of calling dbus_message_iter_open_container, call dbus_message_iter_open_array_fixed, which has the following prototype:

dbus_bool_t *dbus_message_iter_open_array_fixed(DBusMessageIter *iter,
                                                const char      *contained_signature,
                                                int              n_elements,
                                                DBusMessageIter *sub);

The function works similar to dbus_message_iter_open_container, with the additional limitations that the application must write exactly n_elements elements using the sub-iterator and that it only creates arrays.

Additionally, if (and only if) the contained_signature was "y" (but see below), the application can call a function dbus_message_iter_get_memory on the resulting sub-iterator, with the following prototype:

unsigned char *dbus_message_iter_get_memory(DBusMessageIter *sub);

which returns a pointer to the memory the dbus library pre-allocated for the fixed length array, which the application can then fill. It returns NULL if the inner type is invalid or any other error occurs.

Maybe all other types that are marshalled without padding in between them should be allowed for direct access as well. Types that are marshalled with padding within the array should not be allowed for direct access so that the application using this pointer need not be aware of padding rules (it has to be aware of them in the sense that anything that needs padding is not usable, but this is not a source of errors if the d-bus API checks this before returning a pointer).

For receiving, the same is done, and the dbus_message_iter_get_memory call is available as well. Note that must be made available for any kind of array even if that array was not sent via shared memory, in order to make this transparent to applications.

suggested implementation/specification

The following suggestion assumes that POSIX shared memory is available and that file descriptors can be passed via the local transport. Both these assumptions are true for Linux where connections are via unix sockets, and all implementations capable of connecting via unix sockets must be able to receive shared memory data.

Internally, the function dbus_message_iter_open_array_fixed works differently depending on the transport. If the destination is remote, it simply allocates enough memory for all the array elements including padding, and adds it to the message.

If the connection is local, it instead allocates a shared memory segment using shm_open and resizes it to the requested size. Then, instead of passing the data directly, the file descriptor referring to the shared memory is passed over the socket using a SCM_RIGHTS control message which accompanies the marshalled message itself. Inside the marshalled message, the special type 77 (ASCII 'M') is used to indicate a shared memory segment that is passed; the type 77 has similar semantics as the variant type, so you can have a signature of May, Ma(tt) or even Ma(dstunt), except that no data is marshalled (its size is available to the receiver by using fstat on the handle). Note that shared memory segments cannot be nested nor used inside arrays. If multiple segments need to be passed, their descriptors are passed in the SCM_RIGHTS ancillary data in the same order they occur in the message signature. A maximum of 10 (this number is arbitrary, but keep in mind the usual fd limit of 1024) descriptors may be passed in each message.

The rationale for using a new type M with semantics like the variant type is that then the full data can be passed in shared memory if the size of the message is known when it is constructed by simply wrapping the original signature in a struct and a shared memory indicator like this: M(<original signature>). Should the message need re-marshalling in the bus (scenario 2), the struct should be removed as well so the original message results (which itself may consist of a struct, of course). It should be possible (but optional) for the application to tell the d-bus library what the expected message size will be, so that the library can decide whether to pass the message data in shared memory or not and preallocate memory appropriately. In that case, it may be an error if that size is exceeded (but the shared memory segment size may also be increased if that is possible).

Note that this whole scheme requires that messages are read from the local transport using recvmsg and sent using sendmsg instead of recv and write or writev.

In order to guarantee that nothing is left behind, the sending process shall shm_unlink the shared memory segment immediately after opening it, the file descriptor is closed and the mapping is given up when the message it is associated with is destroyed (both these also happen automatically if the process terminates abnormally). Since it is unliked immediately, the name is not important, but care should be taken that a new segment is allocated every time one is needed.

If the client is connected to the d-bus daemon via a local socket, but the recipient of the message is not, the d-bus daemon must appropriately re-marshal the message upon receiving it, and replace any shared memory segments with the inline data. In the other direction, it may optionally replace array transmissions with shared memory transmissions where possible, but this is not required.

alternative implementation

Instead of passing file descriptors, it would also be possible to use SYSV shared memory and pass it's key inside the message instead of passing the file descriptor. This has the advantage that it can possibly cover case 3 from above (if the situation can be detected properly), but the disadvantage of leaving the transmitted data readable for everyone on the system since one cannot known in general which permissions suffice, and the bus may be remote or running without enough priviledges to do arbitration. This security issue prompted me to come up with the above implementation instead.

implementations on other systems

On other systems, for example windows, no local connections like unix sockets are currently supported. When this changes, it should be explored by the implementators if such a zero-copy extension is possible on that system.