All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/8 net-next-2.6] udp: optimisations
@ 2009-11-08 20:16 Eric Dumazet
  2009-11-09  4:55 ` David Miller
  0 siblings, 1 reply; 4+ messages in thread
From: Eric Dumazet @ 2009-11-08 20:16 UTC (permalink / raw)
  To: David S. Miller
  Cc: Linux Netdev List, Lucian Adrian Grijincu, Octavian Purdila

This patch series address UDP scalability problems, we failed to solve in 2007
(commit 6aaf47fa48d3c44 INET : IPV4 UDP lookups converted to a 2 pass algo)
we had to revert a bit later.

One of the problem of UDP is its use of a single hash table, with
a key based on local port value only. When many IP addresses are used,
it is possible to have a chain with very large number N of sockets,
lookup time being N/2 in average.

Size of hash table has no effect on this, since all sockets are
chained in one particular slot.

It seems Lucian Adrian Grijincu & Octavian Purdila from IXIACOM have 
real workloads hitting hard this problem and posted a preliminary
patch/RFC, using a second hash, but based on (local address).

I took part of Lucian ideas and my previous patches, and cooked
a clean upgrade path.

With following patches, we might handle 1.000.000+ udp sockets
in linux without major slowdown, and no penalty for common cases.

Thanks

[PATCH 1/8] udp: add a counter into udp_hslot

Adds a counter in udp_hslot to keep an accurate count
of sockets present in chain.

This will permit to upcoming UDP lookup algo to chose
the shortest chain when secondary hash is added.

[PATCH 2/8] udp: split sk_hash into two u16 hashes

nion sk_hash with two u16 hashes for udp (no extra memory taken)

One 16 bits hash on (local port) value (the previous udp 'hash')

One 16 bits hash on (local address, local port) values, initialized
but not yet used. This second hash is using jenkin hash for better
distribution.

Because the 'port' is xored later, a partial hash is performed
on local address + net_hash_mix(net)

[PATCH 3/8] udp: secondary hash on (local port, local address)

Extends udp_table to contain a secondary hash table.

socket anchor for this second hash is free, because UDP
doesnt use skc_bind_node : We define an union to hold
both skc_bind_node & a new hlist_nulls_node udp_portaddr_node

udp_lib_get_port() inserts sockets into second hash chain
(additional cost of one atomic op)

udp_lib_unhash() deletes socket from second hash chain
(additional cost of one atomic op)

Note : No special lockdep annotation is needed, because
lock for the secondary hash chain is always get after
lock for primary hash chain.

[PATCH 4/8] ipv4: udp: optimize unicast RX path

We first locate the (local port) hash chain head
If few sockets are in this chain, we proceed with previous lookup algo.

If too many sockets are listed, we take a look at the secondary
(port, address) hash chain.

We choose the shortest chain and proceed with a RCU lookup on the elected chain.

But, if we chose (port, address) chain, and fail to find a socket on given address,
 we must try another lookup on (port, INADDR_ANY) chain to find sockets not bound
to a particular IP.

-> No extra cost for typical setups, where the first lookup will probabbly
be performed.

RCU lookups everywhere, we dont acquire spinlock.

[PATCH 5/8] ipv6: udp: optimize unicast RX path

Same algo than patch 4, but for ipv6


[PATCH 6/8] ipv4: udp: Optimise multicast reception

UDP multicast rx path is a bit complex and can hold a spinlock
for a long time.

Using a small (32 or 64 entries) stack of socket pointers can help
to perform expensive operations (skb_clone(), udp_queue_rcv_skb())
outside of the lock, in most cases.

It's also a base for a future RCU conversion of multicast recption.


[PATCH 7/8] ipv6: udp: Optimise multicast reception

Same optimisation, but for ipv6

[PATCH 8/8] udp: multicast RX should increment SNMP/sk_drops counter in allocation failures

When skb_clone() fails, we should increment sk_drops and SNMP counters.
This fix is not urgent and better done after previous patches.

-------------------------------------------------------------------------------------

Furthers patches could be :

udp: bind() optimisations
udp: multicast uses of secondary hash 
udp: multicast path uses RCU  

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH 0/8 net-next-2.6] udp: optimisations
  2009-11-08 20:16 [PATCH 0/8 net-next-2.6] udp: optimisations Eric Dumazet
@ 2009-11-09  4:55 ` David Miller
  2009-11-09  5:37   ` Eric Dumazet
  0 siblings, 1 reply; 4+ messages in thread
From: David Miller @ 2009-11-09  4:55 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, lgrijincu, opurdila

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sun, 08 Nov 2009 21:16:56 +0100

> This patch series address UDP scalability problems, we failed to solve in 2007
> (commit 6aaf47fa48d3c44 INET : IPV4 UDP lookups converted to a 2 pass algo)
> we had to revert a bit later.

Looks great, all applied, thanks Eric.

I would even go so far as to say that the cutoff to the second hash
table should be even lower than 10, like maybe 4 or 5.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH 0/8 net-next-2.6] udp: optimisations
  2009-11-09  4:55 ` David Miller
@ 2009-11-09  5:37   ` Eric Dumazet
  2009-11-09 15:38     ` Octavian Purdila
  0 siblings, 1 reply; 4+ messages in thread
From: Eric Dumazet @ 2009-11-09  5:37 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, lgrijincu, opurdila

David Miller a écrit :

> 
> Looks great, all applied, thanks Eric.

Thanks David, I'll make the remaining patches too.

> 
> I would even go so far as to say that the cutoff to the second hash
> table should be even lower than 10, like maybe 4 or 5.

Probably, but we want to avoid the secondary way if possible,
as this path might have to traverse two different chains.

(total of three cache line accesses to only take a look at chains head/count)

Maybe we can change the heuristic to take into account this like that :

if (hslot->count > 4) {
	...
	if (hslot->count < hslot2->count * 2)
		goto begin_primary_hash_lookup;

This is tuning, and needs benchmarking.


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH 0/8 net-next-2.6] udp: optimisations
  2009-11-09  5:37   ` Eric Dumazet
@ 2009-11-09 15:38     ` Octavian Purdila
  0 siblings, 0 replies; 4+ messages in thread
From: Octavian Purdila @ 2009-11-09 15:38 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, lgrijincu

On Monday 09 November 2009 07:37:24 you wrote:
> David Miller a écrit :
> > Looks great, all applied, thanks Eric.
> 
> Thanks David, I'll make the remaining patches too.
> 
> > I would even go so far as to say that the cutoff to the second hash
> > table should be even lower than 10, like maybe 4 or 5.
> 
> Probably, but we want to avoid the secondary way if possible,
> as this path might have to traverse two different chains.
> 
> (total of three cache line accesses to only take a look at chains
>  head/count)
> 
> Maybe we can change the heuristic to take into account this like that :
> 
> if (hslot->count > 4) {
> 	...
> 	if (hslot->count < hslot2->count * 2)
> 		goto begin_primary_hash_lookup;
> 
> This is tuning, and needs benchmarking.
> 

Eric, thanks a lot !

Lucian is currently testing it on our setup and once we iron out some IPv6 
issues we are seeing he will follow up with some numbers.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2009-11-09 15:41 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-11-08 20:16 [PATCH 0/8 net-next-2.6] udp: optimisations Eric Dumazet
2009-11-09  4:55 ` David Miller
2009-11-09  5:37   ` Eric Dumazet
2009-11-09 15:38     ` Octavian Purdila

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.