\section{Flow Control in a Cluster Environment}

\subsection{Evaluation of Flow Control Schemes for TCP/IP}
\label{tcp-fc-eval}

The key component of flow control for TCP is the size $w$ of the
congestion window used. This value determines the amount of data the
sending host might push into the network without having received an
acknowledgement and thus is considered ``in flight''. There are a few
occasions when this parameter is adjusted, which are
\cite{ca-and-control}:

\begin{itemize}
\item The receipt of an ACK: $w \leftarrow w + \frac{a}{w}$
\item The detection of packet loss: $w \leftarrow w - bw$
\item The receipt of an ACK during slow-start: $w \leftarrow w +
  \frac{w}{c} $
\end{itemize}

As a result of the need of compatibility among the different
implementations of TCP/IP, many aspects of the different flow control
schemes outlined here can be expressed in terms of concrete values for
the parameters $a$, $b$ and $c$, which defined in units of the maximum
segment size (MSS) of the link used.

The primary goal of an congestion avoidance algorithm is to choose a
window size $w$ such that the network in between the communicating
hosts does not become congested, neither the receiving host becomes
overloaded and has to drop packets because it runs out of receive
buffer space. TCP uses two values, which both give a upper bound on
the size of the congestion window:
%
\begin{itemize}
\item The send buffer size of the sending host and
\item the receive buffer size of the receiving host.
\end{itemize}

The congestion window for TCP is always less than or equal to these
values, as shown in figure~\ref{fig:snd_rcv_scheme}. Because the
sending host can not guess the available buffer space on the
receiver's side, this value is communicated to the sender using a TCP
header field with every ACK packet\footnote{Although this field is
  {\em always} part of the TCP header, it may only be evaluated if the
  ACK flag is set on the received packet.}.

\begin{figure}
  \centering
  \input{./graphics/snd_rcv_scheme.pdf_t}
  \caption[Depencence of Congestion-, Receive- and Send window
  sizes]{The dependence of the congestion window from the sending
    window and the receiving window. This figure illustrates how the
    maximum congestion window size for TCP is limited by the send and
    the receive window sizes at the time $S$ receives the ACK (dotted
    arrow) from $R$.}
  \label{fig:snd_rcv_scheme}
\end{figure}

\subsubsection{TCP Reno}
\label{tcp-reno}

Reno is the base flow control algorithm for most modern TCP/IP
implementations. It consists of the following key features: slow
start, fast retransmit and fast recovery.

For TCP Reno the parameters for the equations in \ref{tcp-fc-eval} are
$a=1$, $b=0.5$ and $c=1$. The initial size of the send window was
chosen to be equal to the maximum segment size of the connection used.
In other words, TCP is allowed to initially send one (maximum sized)
packet and then waits for the incoming acknowledgement(s). The
congestion avoidance subsequently decides if it is valid to increase
the size of the send window.

Slow start is the phase where Reno tries to detect the bandwidth
capacity available to the communicating hosts. During this phase, the
send window size is doubled for every incoming acknowledgement, until
the expected acknowledgements for the packets sent begin to stay away.
This event is as an indication for packet loss, which leads to the
slow start phase being finished. Then the congestion congestion window
is halved and the regular data transmission phase begins.

In this second phase the the growth of the congestion window is
decelerated, though constantly present. Because of this, the
congestion window will eventually reach a size where the link is
congested again, and packets get lost. When this packet loss is
detected, the congestion window is halved again, and so on\ldots
Because of this scheme of operation, TCP Reno oscillates around the
optimal size of the send window, instead of using it.

Fast retransmit and fast recovery as defined in \cite{rfc-2001} are an
extension to the original Reno flow control, which may be used to
detect the loss of packets faster. Following this extension, the
receiver informs the sender about a intermediate packet being lost by
acknowledging the last packet successfully received in-sequence packet
again, for every out-of-order packet arriving. The sender notices
these duplicate ACKs, and after receiving the third of them he
transmits the lost packet again, and thus avoids having to wait for a
timeout to finish. Fast recovery avoids halving the send window upon
the detected packet loss. This is accomplished by observing that each
of the incoming duplicate ACKs indicates a packet was successfully
received, though out of order. So, for each of these duplicate ACKs
the send window is raised again.

One of the main drawbacks of Reno when used in high-performance
networks is its peculiarity to interpret every packet loss as the
result of network congestion. Under this view it is a straight
consequence to shrink the congestion window whenever packet loss is
detected.  Reno {\em needs} packet loss to detect the optimal
congestion window.

\subsubsection{High Speed TCP}
\label{hstcp-eval}

High Speed TCP \cite{hstcp} is an approach to improve the bandwidth
utilization of TCP mainly by modifying the TCP response function
\cite{ca-and-control} so that packet loss is tolerated by the
congestion avoidance to a certain level, without backing up the window
size. This is accomplished by turning the determining parameters $a$
and $b$ into functions of the current size of the send window $w$,
which are found to be
\begin{eqnarray*}
  a(w) & = & \frac {w^2 \cdot 2.0 \cdot b(w) \cdot P} {2.0 - b(w)}\\
  b(w) & = & (B - 0.5) \frac {\log(w) - \log(W)} {\log(W_1) - \log(W)}
  + 0.5 \\
  a(w) & = & 1 \enspace \text{for} \enspace w \le W\\
  b(w) & = & 0.5 \enspace \text{for} \enspace w \le W\\
  B    & = & b(W_1) \\
\end{eqnarray*}
%
this allows to tune the shape of the response function (which
determines the window size) to be even more aggressive than the
original response function, once the window size $w$ has grown over
$W$ by choosing appropriate values for $W_1$ and $P$, where $W_1$ is
the desired size of the send window for a packets loss rate of $P$.

The benefit of this modification is that High Speed TCP has a better
tolerance for packet loss, and accepts a certain loss rate as being
unavoidable in a computer network. Still, High Speed TCP oscillates
around the optimal congestion window size, as Reno does.

\subsubsection{TCP Vegas}
\label{tcp-vegas}

TCP Vegas was developed by Lawrence S. Brakmo et al. \cite{tcp-vegas}
and introduces several changes to the sender side of a TCP connection.
These changes lead to the remarkable result of improving the
throughput by 40--70\,\% and lowering the amount of retransmissions to
one-half to one-fifth compared to Reno. A brief overview of how this
is achieved shall be given here.

Vegas uses a new mechanism to decide when to retransmit, which uses
the receipt of an ACK in certain situations as a trigger to check if a
timeout should happen. This allows neccessary retransmissions to be
sent out earlier. Secondly, Vegas uses a modified window sizing when
congestion is detected. Basically, if the detected packet loss
concerns only packets which were sent out {\em before} the last shrink
of the congestion window, this is not taken as an indication to shrink
the congestion window again. This is so because the old window might
have been too large, but it's still possible that the new window is
sized appropriately. Vegas also incorporates a spike-suppression
facility which aims to more evenly distribute the emitted data packets
over time. Lastly, Vegas eliminates Reno's need for packet loss to
estimate the possible throughput (and thus the congestion window size
$w$).  Instead, this is done by comparing the actual througput to the
expected throughput, which is:
$$
\text{Expected throughput} = \frac{w}{\text{baseRTT}}
$$
If the actual throughput is smaller than the expected throughput, the
window size is increased and vice versa. Additionally there is a
threshold value used to prevent the window size from oscillating.

The original Vegas implementation had some issues when rerouting leads
to a new route with a larger propagation delay, as this could not be
detected and thus Vegas' value for baseRTT was not updated. These have
been solved recently, though \cite{vegas-improved}. All these
advantages make it desireable for Vegas to become the default TCP
implementation on the internet soon.

\subsubsection{Conclusion}

The basic approach of all flow control schemes mentioned in this
section is to constantly try to go faster and to back up whenever
packet loss is detected. This is passable thing to do for large-scale
networks like the internet where bandwidth may change all the time.

In a cluster environment, several of the challenes TCP tries to fight
with this strategy simply don't exist or are limited to well known
situations. Problems arise especially if multiple senders exist, which
try to send to a single receiver. In this situation, if one of the
senders uses a send window size which is too large and causes
congestion, it is well possible that the result of this is that a
packet of {\em another} sender gets lost. Consequently, the sender
whose packet was lost will back up. Additionally, the sender that
caused the congestion will keep increasing its congestion window.
This situation can be seen as a set of dependent control loops (one
loop for each sender) which ``fight'' for the available bandwidth, and
it can take considerable time for this system to reach equilibrium.

\subsection{Design of a Flow Control Scheme suitable for Clusters}
\label{fc-design}

The most important goals for the flow control scheme to be developed
are:

\begin{itemize}
\item Reliable data transfer
\item Reach high speeds without requiring unrealisticly low packet
  loss rates
\item Preserve ESP's low latency and processing overhead
\item Keep the header as small as possible for maximum efficiency
\end{itemize}

As depicted in section~\ref{cluster-simplifications}, several
assumptions about the available bandwidth can be made in a cluster
environment, which do not hold true for large-scale networks. The flow
control scheme developed in this section tries to exploit this
knowledge, trying to gain better results than TCP does, especially
aiming to reduce packet loss significantly in multiple-sender
situations.

Foremost, the bandwidth is constant over time. This elimiantes the
need to probe for it over and over again by trying to send ``some
more'' data as TCP does. Having a constant value being the optimal
size of the congestion window $w$ (chosen large enought to be capable
to saturate the link), and using it whenever possible seems like a
good starting point for a flow control scheme in a cluster. It's got
the potential to keep up with TCP/IP flow control schemes, no matter
how well the heuristic controlling the probing process may be chosen.
There is no congestion-implied packet loss if the number of packets in
flight with a common destination host, $P_\text{cd}$ is limited so
that
\begin{equation}
\label{eqn:packets-in-flight}
P_\text{cd} \le
\left\lceil\frac{S_\text{buff}}{p_\text{max}}\right\rceil \enspace ,
\end{equation}
with $S_\text{buff}$ being the per-port buffer size on the switch and
$p_\text{max}$ being the maximum packet size of the ethernet.

\subsubsection{Basic Principles}
\label{basics}

The constitutive idea to loose the fighting for bandwidth which TCP
holds in a multiple-sender situation is to unravel the time-division
multiplexing (TDM) which inevitably occures on wire, when two or more
hosts concurrently send data to the same destination host. The idea is
to give each of the competitive senders a time slice it may use to
send out some packets carrying data.

The task of managing the TDM is assigned to the receiving host of the
multiple-sender constellation. This host is naturally involved in each
of the competing transmissions, and therefore has the ability to
detect (and manage) a multiple-sender situation. The fundamental
principles used to achieve this aim are the following:

\begin{enumerate}
\item The window size $w$ is a constant value which is common to all
  hosts within the ethernet's broadcast domain (i.e. the cluster). It
  must be chosen large enough to allow saturating the link.
\item A sending host may emit data into the network as long as the
  congestion window allows it. The sender does not make any
  assumptions about the round-trip time. Especially it does not do
  retransmissions\footnote{An exception to this rule is explained in
    \ref{requesting-slots}.} unless it is asked to do so. Incoming
  acknowledgements may open the congestion window, allowing some more
  data to be emitted.
\item A receiving host takes care not to send out too many
  acknowledgements to the sending host(s) to prevent the per-port
  buffer on the switch from overflowing. When the receiver detects
  packet loss it must request a retransmission of the lost data from
  the sender.
\end{enumerate}

\subsubsection{Handling of Receiver Congestion}

By following the rules defined in section~\ref{basics}, developing a
scheme for handling receiver congestion is very straightforward: As
mentioned in section \ref{recv-congestion} the receiving host can
easily detect receiver congestion. This is done by comparing the
amount of memory currently used for the receive buffers to the fixed
quota which is set on this value.

If the quota does not allow for storing another burst of $w$ full
sized packets, the sender does not acknowledge the receipt of data to
the sending host. As a result the sender will assume the data is still
in flight, and by obeying the congestion window size it will stop
sending new data. This gives the application on the receiver side the
chance to consume the data currently in the receive buffer, thereby
making room for new data. Once there is sufficient space free, the
receiver subsequently acknowledges the receipt of the data, which
opens the congestion window on the sender side and allows the transfer
of additional data.

There are two drawbacks which arise from this way of handling receiver
congestion: First, the utilization of the receive buffer is not
optimal, as in average when entering receiver congestion $\lceil w / 2
\rceil \cdot p_\text{max}$ bytes will remain unused there. It is worth
to mention that this memory is not wasted, it just is not used though
the user set (respectively the protocol default) value for the
socket's receive buffer size would allow to do so. This memory is
never allocated, so choosing a slightly larger quota for the receive
buffer fully compensates for this effect.

On the sender side the situation is a little worse. The sender has to
store every sent packet into the send queue and may drop the packets
there only if the receipt of them is acknowledged. By not sending an
ACK frame the receiving host leaves the sender uncertain about the
successfull transmission of the data, which will therefore have to
remain on queue. Indeed, this prevents some memory from being
deallocated which could have been freed if the sender would have known
about the successfull transmission.

\subsubsection{Handling Retransmissions}

Packet loss is a problem even the ideal congestion avoidance algorithm
(in terms of never forcing a unit using a store-and-forward scheme to
drop a packet, neither overcharging the processing capacity of the
receiver) will have to cope with, as the ethernet with a packet
corruption rate of 0 still remains to be developed.

As proposed, the usual retransmission scheme which TCP utilizes shall
not be used here. Because the sender does not make any assumptions
about the round-trip time, it doesn't even know for when to schedule a
retransmission timer anyway.

Therefore, instead of letting the sender decide when to do a
retransmission by using a timout value or by guessing multiple ACKs
mean some data was lost as practised by the fast-retransmit algorithm
\cite{rfc-2001} of the TCP/IP protocol, a new flag packets may carry
is introduced. This new flag is called the retransmission request
(RRQ) - flag. A RRQ - flagged frame must always have the ACK flag set
as well. The acknowledging aspect of this frame tells the sender up to
which packet the data was received successfully; and the RRQ instructs
to retransmit as many frames, counting from the first still
unacknowledged frame, as the size of the send window allows.

There are two occasions which result in a RRQ to be sent. One is the
receipt of a frame with a packet sequence number which is higher than
the next expected sequence number. The sequence number of the next
expected data frame is recorded with the socket and monotone
increased\footnote{In fact there is a wrap-around. So adding $1$ to
  $2^{16}-1$ leads to the next sequence number expected being 0.} by
one for every data frame successfully enqueued onto the receive queue.
As there are no loops in the network, packet reordering does not have
to be taken into account here, and a jump in packet sequence numbers
really means that the packets inbetween have been lost. This is
sufficient for detecting most packet losses, but not each.

The situation when this does not work is when a contiguous number of
frames off the end of the bulk transmission is lost, including the
worst case where every single frame sent as response to the last ACK
was lost. This situation is handled by a timeout, which is given by
the round-trip time of the network.

The receiver also has to know when the sender is willing to send more
data (and thus when to have a RRQ timeout scheduled), and contrary
when the sender is finished with the current transfer and this timer
should be stopped. This problem is handled by two more flags, as
explained in the next section.

\subsubsection{Requesting Transmission Slots}
\label{requesting-slots}

The problem that still remains residual is how the receiver can decide
if there is still data to be expected from the sender. Just having an
open connection in no way means that there currently is data to be
exchanged. Its perfectly legal to set up a connection between two
peers and close it afterwards without exchanging a single byte of
payload over this connection. But the receiver needs to to know for
which sockets to have the RRQ timer running. If the timer would be
trivially running for any socket residing in a connected state, the
protocol would not scale well in presence of many connections.

The solution chosen to overcome this was to introduce two more flags.
These are called ``start of transmission'' (TXS) and ``transmission
finished'' (TXF). The sender has to set the TXS flag on the first
packet of a message to be transmitted and the TXF flag on the last
packet of this message. Messages of a length less than or equal to the
maximum segment size of the network, which fit into a single packet,
should should have both, the TXS and the TXF flag set. By evaluating
these flags the receiver can easily decide when to start or stop the
retransmission request timer.

There is one pitfall about this way of handling the scheduling of the
RRQ timer, which becomes apparent if the packet carrying the TXS flag
is lost. As the receiver in this case never gets to know that there is
data pending to be transmitted, it will not schedule the
retransmission request timer, resulting in the transmission to linger
forever. To overcome this the sender keeps retransmitting the packet
carrying the TXS flag until the successfull receipt is acknowledged.

A similar problem arises for the packet carrying the TXF flag. When
this packet has successfully arrived at the receiver he will send an
ACK frame to the sender. This final ACK can even be sent without
taking care of network or receiver congestion, as it won't trigger any
new data to be send. It just allows the sender to clear it's send
queue. But if this final ACK is lost this cleanup will never happen.
That's why a retransmission timer is started on sender side which
keeps sending this TXF flagged packet until the acknowledgement for it
is received.

\subsubsection{Deciding when to send an Acknowledgement}
\label{when-to-send-ack}

The receiver has to send out acknowledgements to the sender to open
its congestion window and thus allowing more data packets to be sent
out. Additionally the sender may remove all acknowledged packets from
its write queue to allow the sending application to put more data in
there, if it still some part of the message pending.

Obviously there is some elbowroom on the rate $r_\text{ACK}$ at which
acknowledgements are sent. The lower bound is to send an ACK for each
and every data frame received. This would work like expected but adds
significant processing overhead \cite{mreinhardt} on the sending as
well as on the receiving side, which seems unneccesary. The upper
bound is given by the window size $w$. If $r_\text{ACK}$ was greater
than $w$ this would in practice mean no ACKs are sent at all. Any
progress in data transmission would solely rely on the RRQ timer to
kick in and send out the ACK, resulting in a very jerky data
transmission.

The upper bound has to be lowered even further by respecting the time
it takes for the ACK to be constructed, sent out, transferred over the
network and finally being received and processed by the sender. It's
desirable this time overlaps with the time it takes for the last
frames of data within the current congestion window to be emitted by
the sending host. This gives a new upper bound of
%
\begin{equation}
\label{eqn:packets-to-ack}
r_\text{ACK} \le w - \left\lceil\frac{T_\text{SAT}}{2}\right\rceil
\enspace ,
\end{equation}
%
where $T_\text{sat}$ is the minimum sending rate needed to saturate
the link, measured in MSS-sized packets per round-trip time (PPR). In
practice this value should be lowered even further to keep the
ethernet device sending in case there is a small hiccup (e.g. because
the kernel decides to swap some pages out or similar) preventing the
ACK from being processed immediately. In general, it's advisable to
choose $r_\text{ACK}$ as large as possible (respecting the above
limitations) in order to minimize the processing overhead.

\subsubsection{Handling multiple Senders}
\label{handling-multiple-senders}

The task of handling situations when more than one host wants to send
data to a common receiving host is handled completely by the receiver.
As mentioned in section~\ref{basics}, the receiver has to take care
not to emit too many acknowledgements to prevent network as well as
receiver congestion. The other aim that should be achieved is to keep
the link saturated.

When there is only a single sender these requirements are fullfilled
by simply choosing $w$ large enough while sending out ACKs just in
time (see equation \eqref{eqn:packets-to-ack}), so the sender keeps
emitting packets without a rest until the full message is transferred.
In the threat of receiver congestion the receiver may simply stop
sending ACKs and the sender will calm down until the next ACK opens
the congestion window again. This scheme gives a nice yet simple way
to avoid congestion by exploiting the simplifications a cluster
environment allows, but it still is focused on a single connection
(i.e. a pair of connected sockets).

In the presence of multiple senders, their values for $w$ add up and
its conceivable that if $n$ being the number of senders, $n \cdot w$,
which is the upper bound on the number of packets in flight with a
common destination host $P_\text{cd}$, may violate the rule of
equation~\eqref{eqn:packets-in-flight}. Broadening the view to all
sockets currently receiving data on the current host leads to a
surprisingly simple way to get around this.  All that is needed is an
instance which holds back the odd acknowledgements to prevent too many
senders from emitting packets addressed to the common receiver.

All this instance has to know is the number of currently active
incoming transmissions $n_\text{recv}$, which can be easily
incremented when a TXS flagged packet is received and decremented upon
the receipt of a TXF flagged packet. Now whenever a socket feels it
should send out an ACK because $r_\text{ACK}$ packets have been
received, this ACK is given to this mediating instance, which has to
take care to have held back at most $n_\text{recv} - 1$ ACKs at any
time.\footnote{The number of ACKs held back may be less than
  $n_\text{recv} - 1$ because of sockets already having incremented
  the receiver count but not yet received $R_\text{ACK}$ packets.}

In order to be fair to the sending hosts its advisable to use a
first-in first-out scheme on the ACKs held back. But this can be
changed to any order, for example to give preference to sockets with a
higher priority in further extensions, if desired.
Figure~\ref{fig:2snd_rcv_scheme} illustrates the exchange of packets
and acknowledgements for a situation with two senders and one
receiver.

\begin{figure}
  \centering
  \input{./graphics/2snd_rcv_scheme.pdf_t}
  \caption[Handling a multiple-sender transmission]{Schematic diagramm
    of a multiple-sender transmission where the receiver services the
    senders in a first-in first-out fashion. $S_i$ are sending hosts,
    $R$ is the receiver. Dotted lines are acknowledgements for
    received data, solid lines indicate data transmission.}
  \label{fig:2snd_rcv_scheme}
\end{figure}

\subsubsection{Additional Differences to TCP}

The meaning of the sequence numbers used by TCP and ESP differs
slightly as a result of exploiting another simplification a cluster
environment offers: While the sequence numbers of TCP in fact count
bytes, for ESP they simply count packets of an arbitrary size, which
is limited by the maximum segment size possible on the network.

TCP's need for the finer granularity of the sequence numbers arises
from the possibility of fragmentation \cite{original-tcp}. For TCP, a
$n$-byte packet with a sequence number of $s$ could at worst be
fragmented into $n$ 1-byte packets by any intermediate router. Each of
these 1-byte packets still needs an individual sequence number in the
range $[s, \enspace s+n-1]$ to allow the reconstruction of the
original message in case these packets arrive out of order at the
receiving host. Therefore, after a $n$-byte packet with sequence
number $s$ has been sent out, the next packet must carry the sequence
number $s + n$ to avoid collisions with the sequence numbers. As
fragmentation does not occure in a switched ethernet (which resides in
layer~2 according to the OSI reference model \cite{osi-model}), just
numbering the packets is sufficient for ESP.

The measure of the window size needs to be a multiple of the measure
of the sequence numbers because the send window is always defined {\em
  relative} to the current position in the data stream and there is no
such thing like fractional sequence numbers. For TCP the measure of
the window size was originally chosen to be 1\,byte. Additionally, the
free space in the receive buffer, which the receiver communicates to
the sender in every acknowledgement, is limited to 16\,bits. This
gives an upper bound on the size of the congestion window of only
$2^{16} = 65$\,kbyte. Additionally, the maximum window size must be
less than or equal to the maximum sequence number possible. This is so
to avoid two packets with the same sequence number can be on the fly
at the same time, because these packets would be indistinguishable and
might be mixed up by the receiver, if packet reordering occures. 

Because the sender may emit data frames only if the receipt of
previous frames is acknowledged, the throughput of a connection is
limited by $w_\text{max} / \text{RTT}$ (with $w_\text{max}$ being the
maximum congestion window size possible and RTT being the round-trip
time).  Therefore a small maximum window size directly influences the
maximum possible throughput, especially on long-delay links.

This limitation became apparent very soon and was lifted by the TCP
window scale option as defined in \cite{tcp-window-scale}. The window
scale option uses a previously unused part of the TCP header to
arrange an agreement on a scaling factor (being a power of two), which
is applied to the window size. This predefinition is determined during
connection handshake, and allows TCP to use window sizes up to $2^{30}
= 1$\,Gbyte by trading co-domain for granularity. This leaves plenty
margin to saturate a link with both, a high throughput and long
round-trip times.

For ESP, the measure of the window size was chosen to be equal to the
maximum segment size of the ethernet it runs atop. By assuming a MTU
of 1500\,bytes this gives 1489\,bytes (1500\,bytes - 11\,bytes for the
ESP header). ESP also uses a 16-bit field in its header for
sequencing, so that the maximum amount of data in flight is limited by
$2^{16} \cdot 1489\,\text{bytes} \approx 93$\,Mbytes, which is enough
to saturate even very fast links, especially because of the short
round-trip times experienced in a cluster environment.

Another simplification compared to TCP concerns the push-flag (PSH).
The purpose of this flag is to discourage intermediate routers from
collecting several small packets and conflating them into a bigger
one. While this behavior is mostly favourable, it might not be
applicable under certain circumstances. For example, dialog-based
applications like shell access or web browsing would show unacceptable
response times. These applications typically send out a small message
which acts as a request (e.g. the HTTP GET request as defined in
\cite{rfc-2616}) and then wait for a response. It does not make sense
for an intermediate router to wait for additional packets from the
client to append to the packet containing the request, as the client
on his part is already waiting for the server's response. Therefore, a
facility is needed to flag these packets for immediate delivery, which
is the PSH flag. Because ethernet does not know the concept of
segmentation (and conversely reassembling), there is no need for a
push- or similar flag in ESP.

\subsubsection{Conclusion}

By being dispensed from the need for compatibility to existing flow
control schemes, as required for the modifications to TCP mentioned in
section~\ref{tcp-fc-eval}, a new scheme of flow control could be
developed which is truly different. It possibly has the potential to
utilize the the bandwidth available in a cluster environment very
well, and it is free of heuristics trying to meter values which are
known a priori in such a tightly coupled network.

In addition, the proposed flow control scheme needs very little
per-packet header space. Only three flags were introduced, what means
only three bits are needed for the purposes of congestion avoidance
and flow control. As the flags field of the original ESP
implementation still had some bits in spare, the size of the header in
fact did not grow at all.  With a header size of 11\,bytes for ESP and
a MTU of 1500\,Bytes\footnote{If all stations within a broadcast
  domain in an ethernet support this protocol extension, jumbo frames
  may be used.  Jumbo frames are typically sized 9 -- 16\,kB. However,
  choosing a frame size larger than 12\,kB does not seem advisable
  because of the 32\,bit CRC used by ethernet looses it's
  expressiveness there. This is especially true for ESP as it does not
  perform additional checksumming.}, the header occupies only 0.7\,\%
of a packet, while TCP/IP uses 2.7\,\% (20\,bytes for the IP header
plus 20\,bytes for the TCP header).

%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"
%%% End: 

