\section{Congestion Avoidance}

\subsection{Different Types of Congestion}

There are two kinds of congestion which can occur in a computer
network. These are receiver congestion and network congestion, which
will be explained in more detail below, observing the specifics of a
cluster environment.

\subsubsection{Receiver Congestion}
\label{recv-congestion}
Receiver congestion occurs if the recipient of data is unable to
process it at the same speed the data arrives at it's network
interface. The networking protocol stores the data received into
socket buffers which are a stopover before the user-space application
which initiated the communication is ready to consume the data
received by calling \fname{recvmsg} or an equivalent. If this
application does expensive computation on the data or experiences
shortage in I/O performance, the networking protocol may run out of
socket buffers.  When there are no more socket buffers available it
will be forced to drop incoming packets, which are lost then.

Receiver congestion is very easy to detect by the software
implementing the network protocol because it knows when it runs out of
socket buffers for storing the packets coming in from the network.
More specifically it even knows exactly how many buffer space it has
in spare at any time.

\subsubsection{Network Congestion}
\label{network-congestion}

\begin{figure}
  \centering
  \input{./graphics/congestion/switch-and-peers.pdf_t}
  \caption[Topology of a cluster environment]{Topology of a cluster
    environment. Each of the peers is directly and equally connected
    to its port on the switch. The ports on their part are connected
    to the switch's backplane. Thick lines are part of data
    transmission paths, and line width is proportional to available
    bandwidth.}
  \label{cluster-topology}
\end{figure}

Basically, network congestion means that the network can not bear the
amount of data currently transmitted between communication hosts
\cite{rfc-2914}. Network congestion can happen wherever in a network a
store-and-forward approch is used to mediate between a section $N_1$
offering a bandwidth $C_1$ and a section $N_2$ offering bandwidth
$C_2$ with $C_2 > C_1$. Store-and-forward means that there is a buffer
where the packets for $N_2$ are stored at rate $C_1$, before they are
forwarded to their destination.

As long as this buffer is not empty, the data is sent out to $N_2$ at
rate $C_2$ and finally removed from the buffer. A lower bound on the
size $S_\text{buff}$ of this buffer for a network with a maximum
packet size of $p_\text{max}$ bytes is given by the equation:

\begin{equation}
\label{eq:min-buffer}
S_\text{buff} \ge \frac{C_1}{C_2} \cdot p_\text{max}
\end{equation}

A typical size for this buffer in ethernet switches is around 128\,kB.
Whenever this buffer is full and an additional packet arrives from
$N_1$ this packet has to be dropped and is lost. That is what commonly
is referred to as network congestion.

In a cluster environment this can happen for data leaving the
backplane and entering the per-peer port buffer\footnote{Actually
  today's switches know another mode of operation called cut-through
  switching. Rather than storing and forwarding full packets the
  switch only waits for the packet's header to arrive, so it knows the
  destination address. As soon as the responsible port is known, data
  is forwarded there. This mode offers lower latency but it is prone
  to collisions. It is disabled when collisions become too frequent,
  e.g. more than one peer tries to send data to a single receiving
  peer.}, but it becomes apparent only if the bandwidth actually
\textit{utilized} in $N_1$ is greater than $C_2$. Because all peers
are connected to the switch with equal upstream bandwidth, two or more
peers have to send data to a common third peer at full speed to
generate congestion. Then data for the receiving peer comes in at a
rate several times higher than it can be forwarded to it's
destination. The per-port buffer of the switch will fill up, and
eventually the switch will be forced to start dropping packets.

Recapulating, in a typical cluster environment, where all peers are
interconnected by one or more switches using a backplane capable of
offering full bisection bandwidth, the chances for network congestion
to occure are rather limited to certain well-known situations.

\subsection{Preventing Network Congestion}

Network congestion can be prevented by limiting the rate at which data
is emitted into the network by the involved hosts. The natural way of
doing so would be to obey a limit of the form
$$
\frac{\text{data}}{t} \le C_{\text{limit}}
$$

Unfortunately, this leads the problem that $t$ is surprisingly hard to
measure. It is virtually impossible to have clocks with sufficient low
granularity under Linux these days, as with any general purpose (i.e.
non-realtime) operating system. A typical granularity at which tasks
can be scheduled in these environments is 10\,ms or worse. A gigabit
ethernet device would have sent out about 1.28\,MBytes of data in this
time frame, which is far too much as a base for implementing a
reasonable flow control.

Observing the fact that transmission speed at physical layer in the
OSI reference model \cite{osi-model} is constant for ethernet
networking devices (as for almost every networking device) leads to
the insight, that limiting the amount of data which is ``in flight''
on the network is sufficient for preventing network congestion. The
term in flight refers to data which has left the networking device of
the sending peer but has not yet arrived completely at the networking
device of the receiving peer.  Being pessimistic it has to be assumed
that this data is currently stored in a buffer threatened by overflow
as outlined in section~\ref{network-congestion}.

\subsection{Simplifications in a Cluster Environment}
\label{cluster-simplifications}

Typically, congestion avoidance on a large-scale network as the
internet is a task at the mercy of many unknown quantities. The
available bandwidth may change at any time, new routes may appear,
previosly used routes may disappear \cite{tcp-congestion}. In order to
develop a congestion avoidance scheme suitable for clusters, the
following simplifications will be exploited in
section~\ref{fc-design}:

\begin{itemize}
\item The bandwidth available between two peers chosen at random from
  the set of peers in the cluster is the same for any two peers.
\item This bandwidth available between two peers is constant over
  time, given those two peers do not communicate with other peers.
\item The bandwidth available to a group of peers is completely
  independent of the bandwidth consumed by any other group of peers as
  long as there is no communication between these two groups. This is
  a direct conclusion of the claim for a network offering full
  bisection bandwidth.
\item There is only one route between two peers at any time. (Even if
  physically loops exist in the topology, techniques like the
  spanning-tree protocol \cite{stp-explained} are exerted by the
  switches to logically eliminate them.) This means that packets are
  neither duplicated nor reordered.
\end{itemize}

%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"
%%% IspellDict: "english"
%%% End: 
