Low latency diagnostic tools

* Low latency diagnostic tools
@ 2009-08-05 21:10 Christoph Lameter
  2009-08-06  0:17 ` Mark Smith
  0 siblings, 1 reply; 3+ messages in thread
From: Christoph Lameter @ 2009-08-05 21:10 UTC (permalink / raw)
  To: netdev

I am starting a collection of tools / tips for low latency networking.

lldiag-0.12 is available from
http://www.kernel.org/pub/linux/kernel/people/christoph/lldiag

Corrections and additional tools or references to additional material
welcome.

README:

This tarball contains a series of test programs that have turned out to
be useful for testing latency issues on networks and Linux systems.

Tools can be roughly separated into those dealing with networking,
those used for scheduling and for cpu cache issues.

Scheduling related tools:
-------------------------

latencytest	Basic tool to measure the impact of scheduling activity.
		Continually samples TSC and displays statistics on how OS
		scheduling impacted it.

latencystat	Query the Linux scheduling counters of a running process.
		This allows the observation on how the scheduler treats
		a running process.

Cpu cache related tools
-----------------------

trashcache	Clears all cpu caches. Run this before a test
		to avoid caching effects or to see the worst case
		caching situation for latency critical code.

Network related tools
---------------------

udpping		Measure ping pong times for UDP between two hosts.
		(mostly used for unicast)

mcast		Generate and analyze multicast traffic on a mesh
		of senders and receivers. mcast is designed to create
		multicast loads that allow one to explore the multicast
		limitations of a network infrastructure. It can create
		lots of multicast traffic at high rates.

mcasttest	Simple multicast latency test with a single
		multicast group between two machines.

Libraries:
----------

ll.*		Low latency library. Allows timestamp determination and
		determination of cpu caches for an application.

Linux configuration for large amounts of multicast groups
---------------------------------------------------------

/proc/sys/net/core/optmem_max

		Required for multicast metadata storage
		-ENBUFS will result if this is loo low.

/proc/sys/net/ipv4/igmp_max_memberships

		Limit on the number of MC groups that a single
		socket can join. If more MC groups are joined
		-ENOBUFS will result.

/proc/sys/net/ipv4/neigh/default/gc_thresh*

		These settings are often too low for heavy
		multicast usage. Each MC groups counts as a neighbor.
		Heavy MC use can result in thrashing of the neighbor
		cache. If usage reaches gc_thresh3 then again
		-ENOBUFS will be returned by some system calls.

Reducing network latency
------------------------

Most NICs have receive delays that cause additional latency.
ethtool can be used to switch those off. F.e.

ethtool -C eth0 rx-delay 0
ethtool -C eth0 rx-frames 1

WARNING: This may cause high interrupt and network processing
load. May limit the throughput of the NIC. Higher values reduce
the frequency of NIC interrupts and batch transfers from the NIC.

The default behavior of Linux is to send UDP packets immediately. This
means that each sendto() results in NIC interaction. In order to reduce
send delays multiple sendto()s can be coalesced into a single NIC
interaction. This can be accomplished by setting the MSG_MORE option
if it is know that there will be additional data sent. This creates
larger packets which reduce the load on the network infrastructure.

Configuring receive and send buffer sizes to reduce packet loss
---------------------------------------------------------------

In general large receive buffer sizes are recommended in order to
avoid packet loss when receiving data. The lower the buffer sizes
the lower the time until the application must pickup data from
the network socket to avoid packet loss.

For the send side the requirements are opposite due to the broken
flow control behavior of the Linux network stack (observed at least
in 2.6.22 - 2.6.30). Packets are accounted for by the SO_SNDBUF limit
and sendto() and friends block a process if more than SO_SNDBUF
bytes are queued on the socket. In theory this should result in the
application being blocked so that the NIC can send at full speed.

However this is usually jeopardized by the device drivers. These have
a fixed TX ring size and throw packet away that are pushed to the
driver when the count of packets exceeds TX ring size. A fast
cpu can loose huge amounts of packets by just sending at a rate
that the device does not support.

Outbound blocking only works if the SO_SNDBUF limit is lower than
the TX ring size. If SO_SNDBUF sizes are bigger than the TX ring then
the kernel will forward packets to the network device and it will queue
it until the TX ring is full. The additional packets after that are
tossed by the device driver. It is therefore recommended to configure
the send buffer sizes as small as possible to avoid this problem.

(Some device drivers --including the IPoIB layer-- behave in
a moronic way by queuing a few early packets and then throwing
away the rest until the packets queued first have been send.
This means outdated data will be send on the network. NIC should
toss the oldest packets. Best would be not to drop until the limit
established by the user through SO_SNDBUF is reached)

August 5, 2009
	Christoph Lameter <cl@linux-foundation.org>

^ permalink raw reply	[flat|nested] 3+ messages in thread