\section{Implementation of the Flow Control for ESP}

The flow control scheme developed in section \ref{fc-design} had to be
implemented on top of the existing implementation of the ethernet
streaming protocol (ESP) by \cite{mreinhardt}. ESPs implementation
provides applications the usual socket programming interface, just
like TCP. This means every data transmission is started by an
application making a call to \fname{sendmsg} or its related functions
and ends with the receiving application's call to \fname{recvmsg} (or
its related).

\begin{figure}
  \centering
  \input{./graphics/data_flow.pdf_t}
  \caption[Data flow from sending to receiving process]{Stations of
    data flow from the sending to the receiving process. The curved
    arrow indicates the data transmission over the network.}
  \label{fig:data-flow}
\end{figure}

As ESP is a streaming protocol providing reliable data transmission,
the first step to be made before data can be transmitted is to set up
a virtual channel (called ``connection'' in the following) between the
two sockets. This is accomplished using a 3-way handshake quite
similar to what TCP does and explained in detail in \cite{mreinhardt}.
Calling the functions \fname{sendmsg} or \fname{recvmsg} is sensible
only for sockets which have entered the connected state.

As shown in figure~\ref{fig:data-flow}, there are several stations the
data has to pass before it's finally written to the buffer space
provided by the receiving process:

\begin{enumerate}
\item First the message has to be assembled by the sending process.
  While this sounds trivial, this step can contain several non-trivial
  operations. For example, byte-order conversions could be necessary
  or the data might be padded out with dummy bytes to allow aligned
  (and thus faster) access to individual pieces, if the architecture
  profits by (or even requires) doing so. There is no point in
  transmitting these dummy bytes over the network, so they should not
  be included with the final message. These tasks of compositing the
  final message to be sent are often performed by an intermediate
  piece of software like the various implementations of the MPI
  (message passing library; \cite{mpi-programming})
  standard\footnote{At the time of this writing, only the OpenMPI
    \cite{ompi, mreinhardt} implementation has support for the ESP
    protocol layer.}.
\item After the message has been composed and is ready for
  transmission, it's handed to the protocol layer which resides within
  kernel-space. This is where the ESP protocol steps in. The message
  is split into MSS-sized packets, equipped with a transport header,
  stored on the write queue and possibly the first packets are already
  sent out over the network. This step is explained in detail in
  section~\ref{sendmsg-handler}.
\item Next step is the actual transmission of the assembled packets
  over the network, including retransmissions if necessary. Here is
  the field of activity for the flow control and congestion avoidance
  algorithms.
\item The packets arriving at the networking device of the receiving
  host are handed out to the protocol handler registered with the
  kernel for the specific ethernet packet type. This handler decides
  if a socket exists which might be interested in the incoming packet
  by looking at the source and destination port numbers and adresses
  \cite{mreinhardt}. If it finds one, the packet is given to this
  socket for processing. Packets containing useful payload are stored
  on the socket-private receive queue until they are fetched by the
  receiving process.
\item Finally, the receiving process will do one or more call(s) to
  the \fname{recvmsg()} function. Here the socket buffers are taken
  off the receive queue and copied over to user-space. This step is
  repeated until the full message is stored in the
  application-provided buffer.  Possibly the data has to be unpacked
  by intermediate layers to be ready for further processing by the
  actual application, as mentioned in step~1.
\end{enumerate}

The steps 2 -- 5 are carried out by (or involve activities from) the
protocol layer, that is ESP. The practical aspects of these tasks,
including some optimizations applied, are to be explained in this
chapter. For easier reading there is a general survey of the state
information which is maintained by the ESP protocol layer in
section~\ref{state-information} on
page~\pageref{state-information}\,ff. This should be used as a
reference, if the meaning of some of the identifier names is not
obvious to the reader.

\subsection{Design of the ESP Output Engine}

The output engine is responsible for actually emitting packets into
the network. Two kinds of packets are to be differentiated here:
packets carrying payload (``data packets'') and packets used for
status updates only (``empty packets''). 

Data packets are only created upon user request (see
\ref{sendmsg-handler}) and are always stored on the write queue until
they are acknowledget by the receiver. They are sent out whenever the
congestion avoidance strategy decides that it is valid to do so.
Events which may trigger the sending of data, roughly ordered by the
probability of their occurrence, are:

\begin{itemize}
\item The receipt of an acknowledgement (ACK).
\item The user calls the \fname{sendmsg} handler.
\item The receipt of a retransmission request (RRQ).
\item The expiration of the retransmission timer.
\end{itemize}

Empty packets (packets with a length of $0$ recorded in the ESP
header) are used frequently for the exchange of state information, but
they never enter the write queue. These packets instead are
constructed and emitted on the fly. If a retransmission of such an
packet is necessary, it's simply constructed and emitted again. This
procedure does introduce some negligible overhead, but it helps to
simplify and streamline the implementation a lot. As these packets are
very small\footnote{Depending on what to count, the size of these
  frames is 39\,bytes (storage space required when buffered) or
  64\,bytes (the minimum ethernet frame duration, which includes
  padding).}, it is not neccessary to limit the emission of these
packets by the congestion avoidance. If there was a noteworthy share
of bandwidth used by these empty packets, it would rather indicate a
bug in the implementation.

\begin{figure}
  \centering
  \input{./graphics/output-engine.pdf_t}
  \caption[The ESP output engine]{Structure of the ESP output engine.
    An arrow stands for a possible function call. The only function
    which actually emits packets to the network is esp\_skb\_xmit.}
  \label{fig:output-engine}
\end{figure}

Figure~\ref{fig:output-engine} gives an overview of the various
function calling paths leading to a packet being emitted into the
network, where the upper frame contains functions which emit packets
taken off the write queue, and the lower frame deals with empty (i.e.
flags-only) packets. The emission of acknowledgements is not included
in this figure, because it uses a special-cased implementation as
explained in section~\ref{ack-queue}. This is, because if a socket
wants to send out an ACK, it might actually trigger the sending of an
ACK which was enqued by another socket, as outlined in section
\ref{handling-multiple-senders}. The \fname{esp\_skb\_xmit()} function
can not handle this situation, the a special implementation was
needed.

\subsection{Receiving Data from User-space}
\label{sendmsg-handler}

There is a single point where data is handed from user-space to the
ESP protocol layer for transmission, which is implemented by the
function \fname{sock\_esp\_sendmsg()}. This function had to be
extended in several ways in order to implement the proposed flow
control scheme.

The first part of this function tests if the preconditions for sending
data on the given socket are met, namely if the socket really is in a
connected state, the shutdown of the connection has not already been
announced to the receiver, and finally if the device this socket
operates on has not been shut down inbetween. If all of these
conditions are met, it is valid to send out data on this socket. As
explained in section~\ref{requesting-slots}, any transmission has to
be announced to the receiver by the means of a TXS-flagged packet to
be sent as the very first of any transmission. To ensure this, the
next step is to test the per-socket variable \fname{trans\_announced}.

If the value stored there is non-zero, the current call to
\fname{sock\_esp\_sendmsg()} is part of an already announced
transmission, which was interrupted by a signal sent to the user-space
process or the lack of write queue space and a subsequent timeout
while waiting for more write space via the kernel function
\fname{sk\_stream\_wait\_memory()}. In either case, the transmission
is already announced and the \fname{sendmsg()} handler can just go on
enqueing data onto the write queue.

But if the value is zero, a new transmission is about to start, and
there are two ways to deal with this situation. The first way is to
minute that the next packet to be constructed will have to carry the
TXS flag, and to record this packet's sequence number as the new value
for \fname{announce\_seq}. This procedure initiates a new
transmission.

Still, there is a more elegant solution, which can be used if the
socket's \fname{send\_head} is not \NULL: As already mentioned, the
send head always points to the next socket buffer to be sent out {\em
  for the first time}. If this variable actually points to a socket
buffer, no matter what the position of this specific buffer is on the
write queue, it is sure there is data on the queue which has not been
sent out before. Consequently, because \fname{trans\_announced} was
zero, the last packet on the write queue will have the TXF flag set,
which ought to indicate the end of transmission to the receiver. Now
it is possible to just clear the TXF flag on this last packet. Doing
so effectively merges these two consecutive transmissions into a
single, larger one. This procedure has two positive effects: it lowers
the processing overhead on the receiver side and allows the sender to
stick with using the full congestion window, skipping the waiting for
the first ACK, which would be required upon transmission announcement
oherwise.

The rest of the function is mainly a big loop which slices the
application-provided buffer into MSS-sized packets and enqueues them
to the tail of the write queue. Thereby it sets the \fname{send\_head}
if it was \NULL\/ and takes care to set the TXF flag on the last
packet, as well as clearing \fname{trans\_announced} in this case.
Additionally, in every pass a call to \fname{esp\_push\_one\_frame()}
is made. This aims for minimizing the end-to-end latency by
overlapping the receipt of data from userspace and sending the data
out over the network.

Additionally, the \fname{sendmsg()} handler function has been reworked
so it can gather the data to output from multiple buffers. This
feature, exposed to user-space by the \fname{writev()} (``write a
vector'') function, is the primary way the Open MPI library makes use
of the ESP protocol. Having a \fname{sendmsg()} which is capable of
doinng vector I/O can significantly reduce the number user-
kernel-space transitions needed to carry out certain collective MPI
operations, namely from the scatter / gather group of operations.

\subsection{Sending Data to the Receiver}

As can be seen in figure~\ref{fig:output-engine}, the regular calling
stack which leads to data packets being sent out to the receiver
always leads through the function \fname{esp\_push\_one\_frame()},
which shall be explained here along with its supporting functions.

All data being sent out comes off the write queue, whose structure is
shown in figure~\ref{fig:write_queue_scheme}. The write queue is
organized as a double linked list, with the list head (pointers to the
first and the last socket buffer on the queue) stored in the struct
\fname{sk\_write\_queue}, along with some auxiliary information used
to maintain the list and provide the ability to use it from
concurrently running contexts safely by providing a locking mechanism.

\begin{figure}
  \centering
  \input{./graphics/write_queue.pdf_t}
  \caption[The ESP write queue]{The ESP write queue. Arrows indicate
    pointers; for the write queue, the pointers to the previous
    element are omitted for clarity. The last SKB on the queue is the
    oldest, whose \fname{pkt\_seq} equals \fname{una\_seq}; the send
    head points to the next packet to be emitted for the first time.}
  \label{fig:write_queue_scheme}
\end{figure}

The first operation this function performs is to check if the socket's
\fname{sk\_send\_head} pointer is not \NULL. If it is \NULL, the
function returns immediately because there is no {\em new} data on
queue to be sent. Otherwise, the congestion avoidance is asked if it
is valid to send out the packet found.

This is done by calling \fname{esp\_cwnd\_test()}, which performs the
following: First it checks if the last TXS-flagged packet was already
acknowledged by the receiver. If this is true (because
\fname{una\_seq} is after \fname{announce\_seq}), this qualifies for
using the full congestion window. In this the case the packet may be
emitted if its sequence number is before \fname{una\_seq} plus
\fname{esp\_burst\_length}. Otherwise the socket is currently waiting
for the first ACK from the receiver, and only packets with a sequence
number up to (including) \fname{announce\_seq} plus
\fname{esp\_initial\_ack\_burst\_length} may be emitted. If the
congestion avoidance finally decides that its valid to emit the given
packet, it is handed to the \fname{esp\_skb\_xmit()} function and
emitted as explained in \cite{mreinhardt}.

A special case this function takes care of is the following: Suppose
the application uses non-blocking I/O and tries to send a message
which exceeds the size of the send buffer. In this case the
\fname{sendmsg()} handler will copy over as much of the message to the
send buffer as possible and return to user-space. As the message
transmission goes on, some buffer space will become free again, which
the application {\em might} fill up. But if it decides not to do so
(which is a perfectly valid thing to do), eventually the last packet
in the send buffer will be sent out, which will not have the TXF flag
set. Because of this, the receiver will not know the sending socket
ran out of data and the retransmission request timer will try to
recover from an apparent packet loss, which did not occur. To avoid
this, the \fname{esp\_push\_one\_frame()} function will set the TXF
flag on this very last packet found in the send buffer prior to its
emission.

\subsection{Managing the ACK queue}
\label{ack-queue}

The ACK queue is the data structure where odd\footnote{Odd in the
  sense that sending them out would be at the risk of causing network
  congestion.} ACKs can be stored until it's time for them to be
transmitted to the sending host, where they will open the send window
and allow for more data to be transfered. The occasions at which an
ACK has to be sent out are:
\begin{itemize}
\item Received data was successfully inserted into the receive queue.
\item Packet loss was detected because of a jump in packet sequence
  numbers.
\item A TXS-flagged packet was received.
\item A memory pressure situation was recently lifted because the
  application read some of the data which was buffered there.
\end{itemize}

\subsubsection{Enqueuing an ACK}

The task of enqueuing an ACK onto the receive queue is performed by
the function \fname{esp\_enqueue\_ack()}. This function takes two
parameters: The socket which wants to enqueue an ACK and a parameter
to indicate additional flags (apart from the mandatory ACK flag),
which the socket wishes to be set on the packet that will be
constructed.

The first action this function performs is to check if the device the
given socket operates on is still up, as there is no point in
enqueuing an ACK for a shut down networking device. If this check
passes, the packet carrying the ACK (and maybe additional flags which
were handed in as the mentioned function parameter) is constructed by
calling \fname{esp\_skb\_alloc\_sk()} which returns a socket buffer
(SKB) containing the needed network and protocol layer headers, but no
data section. This socket is now charged to the socket, which is
important to avoid memory being leaked and for the later
identification of the originating socket of this ACK as well.
Additionally, the socket's \fname{ack\_seq} is set to the current
value of \fname{recv\_seq} to account for this ACK being sent out
sooner or later.

The function \fname{esp\_enqueue\_ack()} is called from numerous
places without prior checking the value of \fname{esp\_recv\_count}.
Because of this, there is chance that the enqueueing of the ACK in
fact is not necessary at all, as there is currently only one socket
receiving data. This is checked here, and if there is only one
receiver, the ACK will be sent out immediately on the fast path, which
skips manipulating the ACK queue.

If this optimization is not possible because of a multiple-sender
situation, the next step is to grab a lock on the ACK queue to allow
its safe manipulation. Now the just created packet is added to the
tail of the ACK queue, thus making it grow. This growing is desired
whenever the size of the queue is less than the current value of
\fname{esp\_recv\_count}, as outlined in section
\ref{handling-multiple-senders}. If this is the case the function is
done for now, releases its lock on the ACK queue and returns.

If the size of queue is greater than or equal to the count of sending
hosts, it is time to send out an acknowledgment that was enqueued by
another socket. Therefore the packet at the head is dequeued and the
lock on the queue is released.  Now the function holds a reference to
a SKB which contains an ACK for some other socket waiting for data.
The last thing to do prior to sending out this ACK is to start the
originating socket's retransmission request timer, in order to give
this socket a chance to notice if anything goes wrong, either with the
transport of this ACK or the data frames the sender will emit in
return when he processes this ACK.

The strategy of always anqueuing at the tail of the ACK queue and
always dequeuing from the head has two positive effects: First it
allows for $O(1)$ access, which is important as access to this queue
happens very often, and performance has to be taken in consideration
to avoid introducing a new bottleneck. Secondly, it ensures that the
available bandwidth is evenly distributed among the competitive
sockets.

\subsection{State Information used by the Flow Control}
\label{state-information}

There is several state information which has to be maintained by the
ESP protocol layer. Some of it is private to the individual socket
allocated, some is shared across all sockets.

\subsubsection{Per-Socket Information}

The per-socket information is stored either within the \fname{struct
  sock} as defined by the Linux kernel, or in its ESP-specific
extension \fname{struct esp\_opt}. These two stuctures are always
allocated and used together and shall not be distinguished here.

\begin{itemize}
\item \fname{sk\_send\_head}: A pointer to the next packet on the
  write queue to be sent out if congestion avoidance allows it. If
  this pointer is \NULL, all packets on the write queue have been sent
  out at least once.
\item \fname{send\_seq}: The sequence number to be used for the next
  data packet constructed; this also is the sequence number of the
  packet at head of the send queue plus one.
\item \fname{recv\_seq}: The sequence number of the next data packet
  expected from the sender.
\item \fname{ack\_seq}: The least sequence number to acknowledge next.
  All packets with a sequence number less than \fname{ack\_seq} have
  been acknowledged at least once.
\item \fname{una\_seq}: The sequence number of the first sent but not
  yet acknowledged packet; therefore the sequence number of the packet
  at the tail of the write queue.
\item \fname{trans\_announced}: Non-zero if there is a currently
  running transmission from this socket announced to the receiver,
  zero otherwise.
\item \fname{recv\_announced}: Non-zero if the sender has announced a
  transmission to this socket, zero otherwise.
\item \fname{announce\_seq}: The sequence number of the last data
  packet sent carrying the TXS flag; if this packet is already
  acknowledged, the full congestion window may be used.
\item \fname{rcv\_mem\_pressure}: Non-zero if this socket is currently
  short of buffer space for the receive queue (and thus under ``memory
  pressure''); zero otherwise.
\end{itemize}

\subsubsection{Global Information}

The global information is shared between all ESP sockets, and its
storage space is allocated upon loading of the ESP kernel module.
Special care has to be taken to either only use atomic operations on
these structures or to use the locking mechanisms as provided by the
Linux kernel to guard critical sections.

\begin{itemize}
\item \fname{esp\_recv\_count}: The number of sockets which have a
  incoming transmission running, as detected by having received a TXS
  flagged packet.
\item \fname{esp\_ack\_queue}: The queue where the acknowledgements
  for the received data may be parked if necessary to avoid congesting
  the network.
\item \fname{esp\_burst\_length}: The congestion window used during
  bulk transfers.
\item \fname{esp\_initial\_ack\_burst\_length}: The congestion window
  used while waiting for the first ACK from the receiver.
\item \fname{esp\_packets\_to\_ack}: The least number of successfully
  enqueued data packets needed to trigger the sending of an ACK.
\end{itemize}

%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"
%%% End: 

