All of lore.kernel.org
 help / color / mirror / Atom feed
* many sockets, slow sendto
@ 2007-03-06 15:08 Zaccomer Lajos
  2007-03-06 18:20 ` Baruch Even
  2007-03-06 19:23 ` Andi Kleen
  0 siblings, 2 replies; 23+ messages in thread
From: Zaccomer Lajos @ 2007-03-06 15:08 UTC (permalink / raw)
  To: netdev

Hi,



I'm playing around with a simulation, in which many thousands of IP

addresses (on interface aliases) are used to send/receive TCP/UDP

packets. I noticed that the time of send/sendto increased linearly

with the number of file descriptors, and I found it rather strange.
I

searched for some related topics, but I couldn't find any (I might

have chosen the wrong keywords).

Though, I'm a real kernelnewbie, I made up my mind to follow what is

happening in the deep. I used the printk and do_gettimeofday

functions to print timestamps at certain points of the execution
path

to identify the bottleneck. Each socket is bound to a dedicated IP

address and each socket is using the very same port number, with the

help of SO_REUSEADDR socket option. I use the 2.16.19-r5

gentoo-sources on my Compaq Evo N1020v laptop. So far, I could get
to

the following point:



Function/macro: Location:

sys_send net/socket.c

sys_sendto net/socket.c

sock_sendmsg net/socket.c

__sock_sendmsg net/socket.c

(*socket_sendmsg)=inet_sendmsg net/ipv4/af_inet.c

(*sendmsg)=udp_sendmsg net/ipv4/udp.c

udp_push_pending_frames net/ipv4/udp.c

ip_push_pending_frames net/ipv4/ip_output.c

NF_HOOK include/linux/netfilter.h

ip_output net/ipv4/ip_output.c

ip_finish_output net/ipv4/ip_output.c

ip_finish_output2 net/ipv4/ip_output.c

dst->heighbour->output=dev_queue_xmit net/core/dev.c

rcu_read_unlock_bh include/linux/rcupdate.c

local_bh_enable kernel/softirq.c



At this point, I felt really helpless, and I don't even know if I'm

at the right track in this investigation. So, can someone please

help me with advice or links pointing to the info I need? Let me
know

if you need any further input.



I expect that sending a packet should not depend on the number of

file descriptors (sockets), it should be a more or less constant
time

operation. Can you confirm this, or is there something in the kernel

or elsewhere preventing it?



Any hints are welcome. Thanks a lot!

/Zacco 




_____________________________________________________________________
 50Mbyte kereskedelmi webtárhely + korlátlan számú e-mail + php + sql
       havi 1,365 Ft-ért!  Bővebben:  http://www.swi.hu
  

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: many sockets, slow sendto
  2007-03-06 15:08 many sockets, slow sendto Zaccomer Lajos
@ 2007-03-06 18:20 ` Baruch Even
  2007-03-19 23:10   ` Zacco
  2007-03-06 19:23 ` Andi Kleen
  1 sibling, 1 reply; 23+ messages in thread
From: Baruch Even @ 2007-03-06 18:20 UTC (permalink / raw)
  To: Zaccomer Lajos; +Cc: netdev

* Zaccomer Lajos <zacco@fw.hu> [070306 17:39]:
> Hi,
> 
> 
> 
> I'm playing around with a simulation, in which many thousands of IP
> 
> addresses (on interface aliases) are used to send/receive TCP/UDP
> 
> packets. I noticed that the time of send/sendto increased linearly
> 
> with the number of file descriptors, and I found it rather strange.

To better understand the reason for this problem you should first use
oprofile to profile the kernel. This will give you the hot spots of the
kernel, where the kernel (or userspace) spends most of its time.

Baruch

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: many sockets, slow sendto
  2007-03-06 15:08 many sockets, slow sendto Zaccomer Lajos
  2007-03-06 18:20 ` Baruch Even
@ 2007-03-06 19:23 ` Andi Kleen
  2007-03-06 21:28   ` Zacco
  1 sibling, 1 reply; 23+ messages in thread
From: Andi Kleen @ 2007-03-06 19:23 UTC (permalink / raw)
  To: Zaccomer Lajos; +Cc: netdev

Zaccomer Lajos <zacco@fw.hu> writes:

> I'm playing around with a simulation, in which many thousands of IP
> 
> addresses (on interface aliases) are used to send/receive TCP/UDP

Something seems to be wrong with your emailer. It adds a empty
line between each real line.

> 
> packets. I noticed that the time of send/sendto increased linearly
> 
> with the number of file descriptors, and I found it rather strange.

Yes that is strange. I would suggest you use oprofile to identify which
parts of the kernel use the CPU time with many descriptors.

-Andi

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: many sockets, slow sendto
  2007-03-06 19:23 ` Andi Kleen
@ 2007-03-06 21:28   ` Zacco
  0 siblings, 0 replies; 23+ messages in thread
From: Zacco @ 2007-03-06 21:28 UTC (permalink / raw)
  To: Andi Kleen, baruch; +Cc: netdev

Hi,

Thx a lot you for the advice, I'll have a try.
And sorry for the stupid webmail, I will not use it again.

Zacco


Andi Kleen wrote:
> Zaccomer Lajos <zacco@fw.hu> writes:
>
>   
>> I'm playing around with a simulation, in which many thousands of IP
>>
>> addresses (on interface aliases) are used to send/receive TCP/UDP
>>     
>
> Something seems to be wrong with your emailer. It adds a empty
> line between each real line.
>
>   
>> packets. I noticed that the time of send/sendto increased linearly
>>
>> with the number of file descriptors, and I found it rather strange.
>>     
>
> Yes that is strange. I would suggest you use oprofile to identify which
> parts of the kernel use the CPU time with many descriptors.
>
> -Andi
>
>   


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: many sockets, slow sendto
  2007-03-06 18:20 ` Baruch Even
@ 2007-03-19 23:10   ` Zacco
  2007-03-19 23:16     ` David Miller
  0 siblings, 1 reply; 23+ messages in thread
From: Zacco @ 2007-03-19 23:10 UTC (permalink / raw)
  To: Baruch Even; +Cc: netdev

Hi Baruch and all,

As you recommended, I used oprofile and it turned out that the 
__udp4_lib_lookup function spent most of the time. There is a udp hash 
table and the sockets are sought based on the 7 LSBs of the destination 
port number. So what happened is now quite obvious: I had many thousands 
of sockets, all with the same destination port, thus linked in the same 
slot of this hash table. I tried using different ports and it
was much faster then.

As the hash table seems to be about the proc fs, I tried to compile a 
kernel without that. I could start that kernel only with the 
init=/bin/bash option, otherwise many things did not work. My program 
stopped right after sending out the first packet, so the proc fs must be 
essential to the receive operation (but it might have been something 
else, too). If it is so, do you think I can remove the proc fs 
dependency that I can still use the regular socket functions (i.e. 
socket, connect, listen, send, recv etc)?

Now, I also understand why this receive related function is called right 
in the dev_queue_xmit function, where also some softirq happens, and 
both the send and receive part resided on the same host. As soon as I 
ran the test on two hosts, separating the send and receive sides, the 
considerable send delay simply disappeared. Could you help me to 
understand why? I expected the delay to appear on the receiving side, 
but it didn't.

The connection setup part (socket+connect) however still lineraly 
increased with the number of sockets. Hopefully, I can send you the 
oprofile result of this part, too, but in case you already have any idea 
of how I can get rid of this dependency, please, let me know!

thx a lot: Zacco


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: many sockets, slow sendto
  2007-03-19 23:10   ` Zacco
@ 2007-03-19 23:16     ` David Miller
  2007-03-20 21:59       ` Zacco
  0 siblings, 1 reply; 23+ messages in thread
From: David Miller @ 2007-03-19 23:16 UTC (permalink / raw)
  To: zacco; +Cc: baruch, netdev

From: Zacco <zacco@fw.hu>
Date: Tue, 20 Mar 2007 00:10:19 +0100

> As you recommended, I used oprofile and it turned out that the 
> __udp4_lib_lookup function spent most of the time. There is a udp hash 
> table and the sockets are sought based on the 7 LSBs of the destination 
> port number. So what happened is now quite obvious: I had many thousands 
> of sockets, all with the same destination port, thus linked in the same 
> slot of this hash table. I tried using different ports and it
> was much faster then.

There isn't much we can do here.  I bet your destination address
is unchanging just like your destination ports.

UDP apps can and do bind to specific destination addresses and
ports, but the source side is usually wild-carded.

Are both the source address and port fully specified for your
sockets?  Maybe we can do something using if that's the case...

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: many sockets, slow sendto
  2007-03-19 23:16     ` David Miller
@ 2007-03-20 21:59       ` Zacco
  2007-03-20 22:48         ` Eric Dumazet
  0 siblings, 1 reply; 23+ messages in thread
From: Zacco @ 2007-03-20 21:59 UTC (permalink / raw)
  To: David Miller; +Cc: baruch, netdev

Hi,

David Miller wrote:
> From: Zacco <zacco@fw.hu>
> Date: Tue, 20 Mar 2007 00:10:19 +0100
>
>   
>> As you recommended, I used oprofile and it turned out that the 
>> __udp4_lib_lookup function spent most of the time. There is a udp hash 
>> table and the sockets are sought based on the 7 LSBs of the destination 
>> port number. So what happened is now quite obvious: I had many thousands 
>> of sockets, all with the same destination port, thus linked in the same 
>> slot of this hash table. I tried using different ports and it
>> was much faster then.
>>     
>
> There isn't much we can do here.  I bet your destination address
> is unchanging just like your destination ports.
>   
As I'm simulating independent users on one host, each user has a 
different IP address, but each with the same port. So unlike the port, 
the address is changing, basically it's a huge A-class range.

> UDP apps can and do bind to specific destination addresses and
> ports, but the source side is usually wild-carded.
>   
Right, usually it is, but in my case the source addresses are also 
bound, otherwise the source address would be the primary address of the 
physical interface; however, I need to simulate users as if they were on 
separate hosts.
> Are both the source address and port fully specified for your
> sockets?  Maybe we can do something using if that's the case...
>   
You made me curious.  :)  What do you have in mind?

thx: Zacco

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: many sockets, slow sendto
  2007-03-20 21:59       ` Zacco
@ 2007-03-20 22:48         ` Eric Dumazet
  2007-03-21 21:53           ` Zacco
  2007-03-21 22:12           ` Eric Dumazet
  0 siblings, 2 replies; 23+ messages in thread
From: Eric Dumazet @ 2007-03-20 22:48 UTC (permalink / raw)
  To: Zacco; +Cc: David Miller, baruch, netdev

Zacco a écrit :
> Hi,
> 
> David Miller wrote:
>> From: Zacco <zacco@fw.hu>
>> Date: Tue, 20 Mar 2007 00:10:19 +0100
>>
>>  
>>> As you recommended, I used oprofile and it turned out that the 
>>> __udp4_lib_lookup function spent most of the time. There is a udp 
>>> hash table and the sockets are sought based on the 7 LSBs of the 
>>> destination port number. So what happened is now quite obvious: I had 
>>> many thousands of sockets, all with the same destination port, thus 
>>> linked in the same slot of this hash table. I tried using different 
>>> ports and it
>>> was much faster then.
>>>     
>>
>> There isn't much we can do here.  I bet your destination address
>> is unchanging just like your destination ports.
>>   
> As I'm simulating independent users on one host, each user has a 
> different IP address, but each with the same port. So unlike the port, 
> the address is changing, basically it's a huge A-class range.
> 
>> UDP apps can and do bind to specific destination addresses and
>> ports, but the source side is usually wild-carded.
>>   
> Right, usually it is, but in my case the source addresses are also 
> bound, otherwise the source address would be the primary address of the 
> physical interface; however, I need to simulate users as if they were on 
> separate hosts.
>> Are both the source address and port fully specified for your
>> sockets?  Maybe we can do something using if that's the case...
>>   
> You made me curious.  :)  What do you have in mind?

Currently, udp_hash[UDP_HTABLE_SIZE] is using a hash function based on dport 
number only.

In your case, as you use a single port value, all sockets are in a single slot 
of this hash table :
To find the good socket, __udp4_lib_lookup() has to search in a list with 
thousands of elements. Not that good, isnt it ? :(

As udp_hash is protected by a single rw_lock, I guess we could convert the 
hash table to a RB-tree, with a key being : (dport, daddr)

At lookup time, we could do :

1) A full lookup with (dport, daddr)
2) if not found, a lookup with wildcard : (dport, 0)

I dont know if this is OK, because I dont know if it is possible to have 
several UDP sockets with the same (dport, daddr)

It would be more scalable. But still the rw_lock is not very SMP friendly...



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: many sockets, slow sendto
  2007-03-20 22:48         ` Eric Dumazet
@ 2007-03-21 21:53           ` Zacco
  2007-03-21 22:24             ` Eric Dumazet
                               ` (2 more replies)
  2007-03-21 22:12           ` Eric Dumazet
  1 sibling, 3 replies; 23+ messages in thread
From: Zacco @ 2007-03-21 21:53 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, baruch, netdev

Eric Dumazet wrote:
>
> Currently, udp_hash[UDP_HTABLE_SIZE] is using a hash function based on 
> dport number only.
>
> In your case, as you use a single port value, all sockets are in a 
> single slot of this hash table :
> To find the good socket, __udp4_lib_lookup() has to search in a list 
> with thousands of elements. Not that good, isnt it ? :(
>
So, my worry is confirmed then. But how could that delay disappear when 
splitting the sender and receiver on distinct hosts? Even in that case 
the good socket must be found somehow.
> As udp_hash is protected by a single rw_lock, I guess we could convert 
> the hash table to a RB-tree, with a key being : (dport, daddr)
>
Actually, the source address would be more important in my case, as my 
clients (each with different IP address) wants to connect to the same 
server, i.e. to the same address and port.
I think, the current design is fair enough for server implementations 
and for regular clients. But even though my application is not tipical, 
as far as I know (but it can be important with the fast performance 
growth of regular PCs), the make-up should be general enough to cope 
with special circumstances, like mine. My initial idea was to somehow 
include the complete socket pair, i.e. source address:port and 
destination address:port, keeping in mind that it should work for both 
IPv4 and IPv6. Maybe it's an overkill, I don't know.
> At lookup time, we could do :
>
> 1) A full lookup with (dport, daddr)
> 2) if not found, a lookup with wildcard : (dport, 0)
>
> I dont know if this is OK, because I dont know if it is possible to 
> have several UDP sockets with the same (dport, daddr)
Definitely, it is.
>
> It would be more scalable. But still the rw_lock is not very SMP 
> friendly...
>
Do you think there is interest in such a modification? If so, how could 
we go on with it?

thx: Zacco

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: many sockets, slow sendto
  2007-03-20 22:48         ` Eric Dumazet
  2007-03-21 21:53           ` Zacco
@ 2007-03-21 22:12           ` Eric Dumazet
  2007-03-22  1:15             ` David Miller
                               ` (2 more replies)
  1 sibling, 3 replies; 23+ messages in thread
From: Eric Dumazet @ 2007-03-21 22:12 UTC (permalink / raw)
  To: Zacco; +Cc: David Miller, baruch, netdev

[-- Attachment #1: Type: text/plain, Size: 1044 bytes --]

Eric Dumazet a écrit :
> Currently, udp_hash[UDP_HTABLE_SIZE] is using a hash function based on 
> dport number only.
> 
> In your case, as you use a single port value, all sockets are in a 
> single slot of this hash table :
> To find the good socket, __udp4_lib_lookup() has to search in a list 
> with thousands of elements. Not that good, isnt it ? :(

In case you want to try, here is a patch that could help you :)

[PATCH] INET : IPV4 UDP lookups converted to a 2 pass algo

Some people want to have many UDP sockets, binded to a single port but many 
different addresses. We currently hash all those sockets into a single chain. 
Processing of incoming packets is very expensive, because the whole chain must 
be examined to find the best match.

I chose in this patch to hash UDP sockets with a hash function that take into 
account both their port number and address : This has a drawback because we 
need two lookups : one with a given address, one with a wildcard (null) address.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>


[-- Attachment #2: udp_2pass_lookups.patch --]
[-- Type: text/plain, Size: 7621 bytes --]

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 71b0b60..27437e7 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -114,14 +114,33 @@ DEFINE_RWLOCK(udp_hash_lock);
 
 static int udp_port_rover;
 
-static inline int __udp_lib_lport_inuse(__u16 num, struct hlist_head udptable[])
+/*
+ * Note about this hash function :
+ * Typical use is probably daddr = 0, only dport is going to vary hash
+ */
+static inline unsigned int hash_port_and_addr(__u16 port, __be32 addr)
+{
+	addr ^= addr >> 16;
+	addr ^= addr >> 8;
+	return port ^ addr;
+}
+
+static inline int __udp_lib_port_inuse(unsigned int hash, int port,
+	__be32 daddr, struct hlist_head udptable[])
 {
 	struct sock *sk;
 	struct hlist_node *node;
+	struct inet_sock *inet;
 
-	sk_for_each(sk, node, &udptable[num & (UDP_HTABLE_SIZE - 1)])
-		if (sk->sk_hash == num)
+	sk_for_each(sk, node, &udptable[hash & (UDP_HTABLE_SIZE - 1)]) {
+		if (sk->sk_hash != hash)
+			continue;
+		inet = inet_sk(sk);
+		if (inet->num != port)
+			continue;
+		if (inet->rcv_saddr == daddr)
 			return 1;
+	}
 	return 0;
 }
 
@@ -142,6 +161,7 @@ int __udp_lib_get_port(struct sock *sk, 
 	struct hlist_node *node;
 	struct hlist_head *head;
 	struct sock *sk2;
+	unsigned int hash;
 	int    error = 1;
 
 	write_lock_bh(&udp_hash_lock);
@@ -156,7 +176,9 @@ int __udp_lib_get_port(struct sock *sk, 
 		for (i = 0; i < UDP_HTABLE_SIZE; i++, result++) {
 			int size;
 
-			head = &udptable[result & (UDP_HTABLE_SIZE - 1)];
+			hash = hash_port_and_addr(result,
+					inet_sk(sk)->rcv_saddr);
+			head = &udptable[hash & (UDP_HTABLE_SIZE - 1)];
 			if (hlist_empty(head)) {
 				if (result > sysctl_local_port_range[1])
 					result = sysctl_local_port_range[0] +
@@ -181,7 +203,10 @@ int __udp_lib_get_port(struct sock *sk, 
 				result = sysctl_local_port_range[0]
 					+ ((result - sysctl_local_port_range[0]) &
 					   (UDP_HTABLE_SIZE - 1));
-			if (! __udp_lib_lport_inuse(result, udptable))
+			hash = hash_port_and_addr(result,
+					inet_sk(sk)->rcv_saddr);
+			if (! __udp_lib_port_inuse(hash, result,
+				inet_sk(sk)->rcv_saddr, udptable))
 				break;
 		}
 		if (i >= (1 << 16) / UDP_HTABLE_SIZE)
@@ -189,11 +214,13 @@ int __udp_lib_get_port(struct sock *sk, 
 gotit:
 		*port_rover = snum = result;
 	} else {
-		head = &udptable[snum & (UDP_HTABLE_SIZE - 1)];
+		hash = hash_port_and_addr(snum, inet_sk(sk)->rcv_saddr);
+		head = &udptable[hash & (UDP_HTABLE_SIZE - 1)];
 
 		sk_for_each(sk2, node, head)
-			if (sk2->sk_hash == snum                             &&
+			if (sk2->sk_hash == hash                             &&
 			    sk2 != sk                                        &&
+			    inet_sk(sk2)->num == snum	                     &&
 			    (!sk2->sk_reuse        || !sk->sk_reuse)         &&
 			    (!sk2->sk_bound_dev_if || !sk->sk_bound_dev_if
 			     || sk2->sk_bound_dev_if == sk->sk_bound_dev_if) &&
@@ -201,9 +228,9 @@ gotit:
 				goto fail;
 	}
 	inet_sk(sk)->num = snum;
-	sk->sk_hash = snum;
+	sk->sk_hash = hash;
 	if (sk_unhashed(sk)) {
-		head = &udptable[snum & (UDP_HTABLE_SIZE - 1)];
+		head = &udptable[hash & (UDP_HTABLE_SIZE - 1)];
 		sk_add_node(sk, head);
 		sock_prot_inc_use(sk->sk_prot);
 	}
@@ -242,63 +269,78 @@ static struct sock *__udp4_lib_lookup(__
 {
 	struct sock *sk, *result = NULL;
 	struct hlist_node *node;
-	unsigned short hnum = ntohs(dport);
-	int badness = -1;
+	unsigned int hash, hashwild;
+	int score, best = -1;
+
+	hash = hash_port_and_addr(ntohs(dport), daddr);
+	hashwild = hash_port_and_addr(ntohs(dport), 0);
 
 	read_lock(&udp_hash_lock);
-	sk_for_each(sk, node, &udptable[hnum & (UDP_HTABLE_SIZE - 1)]) {
+
+lookup:
+
+	sk_for_each(sk, node, &udptable[hash & (UDP_HTABLE_SIZE - 1)]) {
 		struct inet_sock *inet = inet_sk(sk);
 
-		if (sk->sk_hash == hnum && !ipv6_only_sock(sk)) {
-			int score = (sk->sk_family == PF_INET ? 1 : 0);
-			if (inet->rcv_saddr) {
-				if (inet->rcv_saddr != daddr)
-					continue;
-				score+=2;
-			}
-			if (inet->daddr) {
-				if (inet->daddr != saddr)
-					continue;
-				score+=2;
-			}
-			if (inet->dport) {
-				if (inet->dport != sport)
-					continue;
-				score+=2;
-			}
-			if (sk->sk_bound_dev_if) {
-				if (sk->sk_bound_dev_if != dif)
-					continue;
-				score+=2;
-			}
-			if (score == 9) {
-				result = sk;
-				break;
-			} else if (score > badness) {
-				result = sk;
-				badness = score;
-			}
+		if (sk->sk_hash != hash || ipv6_only_sock(sk) ||
+			inet->num != dport)
+			continue;
+
+		score = (sk->sk_family == PF_INET ? 1 : 0);
+		if (inet->rcv_saddr) {
+			if (inet->rcv_saddr != daddr)
+				continue;
+			score+=2;
+		}
+		if (inet->daddr) {
+			if (inet->daddr != saddr)
+				continue;
+			score+=2;
+		}
+		if (inet->dport) {
+			if (inet->dport != sport)
+				continue;
+			score+=2;
+		}
+		if (sk->sk_bound_dev_if) {
+			if (sk->sk_bound_dev_if != dif)
+				continue;
+			score+=2;
+		}
+		if (score == 9) {
+			result = sk;
+			goto found;
+		} else if (score > best) {
+			result = sk;
+			best = score;
 		}
 	}
+
+	if (hash != hashwild) {
+		hash = hashwild;
+		goto lookup;
+	}
+found:
 	if (result)
 		sock_hold(result);
 	read_unlock(&udp_hash_lock);
 	return result;
 }
 
-static inline struct sock *udp_v4_mcast_next(struct sock *sk,
-					     __be16 loc_port, __be32 loc_addr,
-					     __be16 rmt_port, __be32 rmt_addr,
-					     int dif)
+static inline struct sock *udp_v4_mcast_next(
+			struct sock *sk,
+			unsigned int hnum, __be16 loc_port, __be32 loc_addr,
+			__be16 rmt_port, __be32 rmt_addr,
+			int dif)
 {
 	struct hlist_node *node;
 	struct sock *s = sk;
-	unsigned short hnum = ntohs(loc_port);
 
 	sk_for_each_from(s, node) {
 		struct inet_sock *inet = inet_sk(s);
 
 		if (s->sk_hash != hnum					||
+		    inet->num != loc_port				||
 		    (inet->daddr && inet->daddr != rmt_addr)		||
 		    (inet->dport != rmt_port && inet->dport)		||
 		    (inet->rcv_saddr && inet->rcv_saddr != loc_addr)	||
@@ -1128,29 +1170,44 @@ static int __udp4_lib_mcast_deliver(stru
 				    __be32 saddr, __be32 daddr,
 				    struct hlist_head udptable[])
 {
-	struct sock *sk;
+	struct sock *sk, *skw, *sknext;
 	int dif;
+	unsigned int hash = hash_port_and_addr(ntohs(uh->dest), daddr);
+	unsigned int hashwild = hash_port_and_addr(ntohs(uh->dest), 0);
 
-	read_lock(&udp_hash_lock);
-	sk = sk_head(&udptable[ntohs(uh->dest) & (UDP_HTABLE_SIZE - 1)]);
 	dif = skb->dev->ifindex;
-	sk = udp_v4_mcast_next(sk, uh->dest, daddr, uh->source, saddr, dif);
-	if (sk) {
-		struct sock *sknext = NULL;
 
+	read_lock(&udp_hash_lock);
+
+	sk = sk_head(&udptable[hash & (UDP_HTABLE_SIZE - 1)]);
+	skw = sk_head(&udptable[hashwild & (UDP_HTABLE_SIZE - 1)]);
+
+	sk = udp_v4_mcast_next(sk, hash, uh->dest, daddr, uh->source, saddr, dif);
+	if (!sk) {
+		hash = hashwild;
+		sk = udp_v4_mcast_next(skw, hash, uh->dest, daddr, uh->source,
+			saddr, dif);
+	}
+	if (sk) {
 		do {
 			struct sk_buff *skb1 = skb;
-
-			sknext = udp_v4_mcast_next(sk_next(sk), uh->dest, daddr,
-						   uh->source, saddr, dif);
+			sknext = udp_v4_mcast_next(sk_next(sk), hash, uh->dest,
+				daddr, uh->source, saddr, dif);
+			if (!sknext && hash != hashwild) {
+				hash = hashwild;
+				sknext = udp_v4_mcast_next(skw, hash, uh->dest,
+					daddr, uh->source, saddr, dif);
+			}
 			if (sknext)
 				skb1 = skb_clone(skb, GFP_ATOMIC);
 
 			if (skb1) {
 				int ret = udp_queue_rcv_skb(sk, skb1);
 				if (ret > 0)
-					/* we should probably re-process instead
-					 * of dropping packets here. */
+					/*
+					 * we should probably re-process
+					 * instead of dropping packets here.
+					 */
 					kfree_skb(skb1);
 			}
 			sk = sknext;

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: many sockets, slow sendto
  2007-03-21 21:53           ` Zacco
@ 2007-03-21 22:24             ` Eric Dumazet
  2007-03-21 23:26             ` Eric Dumazet
  2007-03-22  1:14             ` David Miller
  2 siblings, 0 replies; 23+ messages in thread
From: Eric Dumazet @ 2007-03-21 22:24 UTC (permalink / raw)
  To: Zacco; +Cc: David Miller, baruch, netdev

Zacco a écrit :
> Actually, the source address would be more important in my case, as my 
> clients (each with different IP address) wants to connect to the same 
> server, i.e. to the same address and port.

I dont understand why you need many sockets then.

A single socket should be enough.

> I think, the current design is fair enough for server implementations 
> and for regular clients. But even though my application is not tipical, 
> as far as I know (but it can be important with the fast performance 
> growth of regular PCs), the make-up should be general enough to cope 
> with special circumstances, like mine. My initial idea was to somehow 
> include the complete socket pair, i.e. source address:port and 
> destination address:port, keeping in mind that it should work for both 
> IPv4 and IPv6. Maybe it's an overkill, I don't know.

Could you send me a copy of your application source, or detailed specs, 
because I am confused right now...



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: many sockets, slow sendto
  2007-03-21 21:53           ` Zacco
  2007-03-21 22:24             ` Eric Dumazet
@ 2007-03-21 23:26             ` Eric Dumazet
  2007-03-22  1:14             ` David Miller
  2 siblings, 0 replies; 23+ messages in thread
From: Eric Dumazet @ 2007-03-21 23:26 UTC (permalink / raw)
  To: Zacco; +Cc: David Miller, baruch, netdev

Zacco a écrit :
> So, my worry is confirmed then. But how could that delay disappear when 
> splitting the sender and receiver on distinct hosts? Even in that case 
> the good socket must be found somehow.


When the receiver and sender are on the same machine, the sendto() pass the 
packet to loopback and enters the receiving side. With that many sockets, the 
time to go through all sockets maybe 100 us. So your sendto() seems to be 
slow, but the slow part is the receiver.

If you put two machines, the sender might send XX.XXX frames per second (full 
speed), but the receiver might handle 5% of them and drop 95%

This is all speculation, since you didnt gave us the exact setup you use.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: many sockets, slow sendto
  2007-03-21 21:53           ` Zacco
  2007-03-21 22:24             ` Eric Dumazet
  2007-03-21 23:26             ` Eric Dumazet
@ 2007-03-22  1:14             ` David Miller
  2 siblings, 0 replies; 23+ messages in thread
From: David Miller @ 2007-03-22  1:14 UTC (permalink / raw)
  To: zacco; +Cc: dada1, baruch, netdev

From: Zacco <zacco@fw.hu>
Date: Wed, 21 Mar 2007 22:53:13 +0100

> Do you think there is interest in such a modification? If so, how could 
> we go on with it?

The best thing you can do is hash on both saddr/sport.  In order
to handle the saddr==0 case the socket lookup has to try two
lookups, one with the packets saddr, and one with saddr zero.
If the first lookup hits, we use that, since a precise match
should match over one to a wildcard saddr, otherwise we use the
result of the second lookup.

I'm not very inclined to hack on this, so anyone else is welcome
to.

FWIW, pretty much every other networking stack only hashes on sport
for UDP, just like Linux.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: many sockets, slow sendto
  2007-03-21 22:12           ` Eric Dumazet
@ 2007-03-22  1:15             ` David Miller
  2007-03-22  6:43               ` Eric Dumazet
  2007-03-29 19:24             ` Zacco
  2007-04-30  7:26             ` David Miller
  2 siblings, 1 reply; 23+ messages in thread
From: David Miller @ 2007-03-22  1:15 UTC (permalink / raw)
  To: dada1; +Cc: zacco, baruch, netdev

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Wed, 21 Mar 2007 23:12:40 +0100

> I chose in this patch to hash UDP sockets with a hash function that take into 
> account both their port number and address : This has a drawback because we 
> need two lookups : one with a given address, one with a wildcard (null) address.

Thanks for doing this work Eric, I'll review this when I get home
tomorrow night or Friday.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: many sockets, slow sendto
  2007-03-22  1:15             ` David Miller
@ 2007-03-22  6:43               ` Eric Dumazet
  0 siblings, 0 replies; 23+ messages in thread
From: Eric Dumazet @ 2007-03-22  6:43 UTC (permalink / raw)
  To: David Miller; +Cc: zacco, baruch, netdev

On Wed, 21 Mar 2007 18:15:10 -0700 (PDT)
David Miller <davem@davemloft.net> wrote:

> From: Eric Dumazet <dada1@cosmosbay.com>
> Date: Wed, 21 Mar 2007 23:12:40 +0100
> 
> > I chose in this patch to hash UDP sockets with a hash function that take into 
> > account both their port number and address : This has a drawback because we 
> > need two lookups : one with a given address, one with a wildcard (null) address.
> 
> Thanks for doing this work Eric, I'll review this when I get home
> tomorrow night or Friday.
> 

You're welcome :)

I knew you were busy with this new wii game^Wprogram

:)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: many sockets, slow sendto
  2007-03-21 22:12           ` Eric Dumazet
  2007-03-22  1:15             ` David Miller
@ 2007-03-29 19:24             ` Zacco
  2007-03-30 13:10               ` Eric Dumazet
  2007-04-30  7:26             ` David Miller
  2 siblings, 1 reply; 23+ messages in thread
From: Zacco @ 2007-03-29 19:24 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, baruch, netdev

Hi,

Thanks for the patch. I almost dare not confess that I don't know which 
version to apply to. I tried 3 different ones (2.6.19-r5-gentoo, 
2.6.20.1 and 2.6.21-rc4), but in the best case at least two hunks 
failed. Nevertheless, I applied the patches manually. In each case, UDP 
stopped working. I guess, you checked the patch and worked. I don't 
think I made a mistake in the manual copy, and it seems unlikely that 
your patch interfered with other parallel changes in the kernel - but, 
I'm just guessing ...
I think, I'd better send you the spec and code, as you suggested that 
first we have a common understanding of the issue. I must have failed in 
passing the point. I'm removing irrelevant stuff, and I send it to you 
as soon as I can (sorry for my long delays).

thx a lot,
Zacco

Eric Dumazet wrote:
> Eric Dumazet a écrit :
>> Currently, udp_hash[UDP_HTABLE_SIZE] is using a hash function based 
>> on dport number only.
>>
>> In your case, as you use a single port value, all sockets are in a 
>> single slot of this hash table :
>> To find the good socket, __udp4_lib_lookup() has to search in a list 
>> with thousands of elements. Not that good, isnt it ? :(
>
> In case you want to try, here is a patch that could help you :)
>
> [PATCH] INET : IPV4 UDP lookups converted to a 2 pass algo
>
> Some people want to have many UDP sockets, binded to a single port but 
> many different addresses. We currently hash all those sockets into a 
> single chain. Processing of incoming packets is very expensive, 
> because the whole chain must be examined to find the best match.
>
> I chose in this patch to hash UDP sockets with a hash function that 
> take into account both their port number and address : This has a 
> drawback because we need two lookups : one with a given address, one 
> with a wildcard (null) address.
>
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: many sockets, slow sendto
  2007-03-29 19:24             ` Zacco
@ 2007-03-30 13:10               ` Eric Dumazet
  2007-04-22 12:34                 ` Zacco
  0 siblings, 1 reply; 23+ messages in thread
From: Eric Dumazet @ 2007-03-30 13:10 UTC (permalink / raw)
  To: Zacco; +Cc: David Miller, baruch, netdev

On Thu, 29 Mar 2007 21:24:31 +0200
Zacco <zacco@fw.hu> wrote:

> Hi,
> 
> Thanks for the patch. I almost dare not confess that I don't know which 
> version to apply to. I tried 3 different ones (2.6.19-r5-gentoo, 
> 2.6.20.1 and 2.6.21-rc4), but in the best case at least two hunks 
> failed. Nevertheless, I applied the patches manually. In each case, UDP 
> stopped working. I guess, you checked the patch and worked. I don't 
> think I made a mistake in the manual copy, and it seems unlikely that 
> your patch interfered with other parallel changes in the kernel - but, 
> I'm just guessing ...
> I think, I'd better send you the spec and code, as you suggested that 
> first we have a common understanding of the issue. I must have failed in 
> passing the point. I'm removing irrelevant stuff, and I send it to you 
> as soon as I can (sorry for my long delays).
> 
> thx a lot,
> Zacco

Hum, please find a (working) patch against linux-2.6.21-rc5

(first patch was against net-2.6.22 git tree and had one bug)

Hope this helps

--- linux-2.6.21-rc5/net/ipv4/udp.c
+++ linux-2.6.21-rc5-ed/net/ipv4/udp.c
@@ -114,14 +114,33 @@ DEFINE_RWLOCK(udp_hash_lock);
 
 static int udp_port_rover;
 
-static inline int __udp_lib_lport_inuse(__u16 num, struct hlist_head udptable[])
+/*
+ * Note about this hash function :
+ * Typical use is probably daddr = 0, only dport is going to vary hash
+ */
+static inline unsigned int hash_port_and_addr(__u16 port, __be32 addr)
+{
+	addr ^= addr >> 16;
+	addr ^= addr >> 8;
+	return port ^ addr;
+}
+
+static inline int __udp_lib_port_inuse(unsigned int hash, int port,
+	__be32 daddr, struct hlist_head udptable[])
 {
 	struct sock *sk;
 	struct hlist_node *node;
+	struct inet_sock *inet;
 
-	sk_for_each(sk, node, &udptable[num & (UDP_HTABLE_SIZE - 1)])
-		if (sk->sk_hash == num)
+	sk_for_each(sk, node, &udptable[hash & (UDP_HTABLE_SIZE - 1)]) {
+		if (sk->sk_hash != hash)
+			continue;
+		inet = inet_sk(sk);
+		if (inet->num != port)
+			continue;
+		if (inet->rcv_saddr == daddr)
 			return 1;
+	}
 	return 0;
 }
 
@@ -142,6 +161,7 @@ int __udp_lib_get_port(struct sock *sk, 
 	struct hlist_node *node;
 	struct hlist_head *head;
 	struct sock *sk2;
+	unsigned int hash;
 	int    error = 1;
 
 	write_lock_bh(&udp_hash_lock);
@@ -156,7 +176,9 @@ int __udp_lib_get_port(struct sock *sk, 
 		for (i = 0; i < UDP_HTABLE_SIZE; i++, result++) {
 			int size;
 
-			head = &udptable[result & (UDP_HTABLE_SIZE - 1)];
+			hash = hash_port_and_addr(result,
+					inet_sk(sk)->rcv_saddr);
+			head = &udptable[hash & (UDP_HTABLE_SIZE - 1)];
 			if (hlist_empty(head)) {
 				if (result > sysctl_local_port_range[1])
 					result = sysctl_local_port_range[0] +
@@ -180,7 +202,10 @@ int __udp_lib_get_port(struct sock *sk, 
 				result = sysctl_local_port_range[0]
 					+ ((result - sysctl_local_port_range[0]) &
 					   (UDP_HTABLE_SIZE - 1));
-			if (! __udp_lib_lport_inuse(result, udptable))
+			hash = hash_port_and_addr(result,
+					inet_sk(sk)->rcv_saddr);
+			if (! __udp_lib_port_inuse(hash, result,
+				inet_sk(sk)->rcv_saddr, udptable))
 				break;
 		}
 		if (i >= (1 << 16) / UDP_HTABLE_SIZE)
@@ -188,11 +213,13 @@ int __udp_lib_get_port(struct sock *sk, 
 gotit:
 		*port_rover = snum = result;
 	} else {
-		head = &udptable[snum & (UDP_HTABLE_SIZE - 1)];
+		hash = hash_port_and_addr(snum, inet_sk(sk)->rcv_saddr);
+		head = &udptable[hash & (UDP_HTABLE_SIZE - 1)];
 
 		sk_for_each(sk2, node, head)
-			if (sk2->sk_hash == snum                             &&
+			if (sk2->sk_hash == hash                             &&
 			    sk2 != sk                                        &&
+			    inet_sk(sk2)->num == snum	                     &&
 			    (!sk2->sk_reuse        || !sk->sk_reuse)         &&
 			    (!sk2->sk_bound_dev_if || !sk->sk_bound_dev_if
 			     || sk2->sk_bound_dev_if == sk->sk_bound_dev_if) &&
@@ -200,9 +227,9 @@ gotit:
 				goto fail;
 	}
 	inet_sk(sk)->num = snum;
-	sk->sk_hash = snum;
+	sk->sk_hash = hash;
 	if (sk_unhashed(sk)) {
-		head = &udptable[snum & (UDP_HTABLE_SIZE - 1)];
+		head = &udptable[hash & (UDP_HTABLE_SIZE - 1)];
 		sk_add_node(sk, head);
 		sock_prot_inc_use(sk->sk_prot);
 	}
@@ -241,63 +268,76 @@ static struct sock *__udp4_lib_lookup(__
 {
 	struct sock *sk, *result = NULL;
 	struct hlist_node *node;
-	unsigned short hnum = ntohs(dport);
-	int badness = -1;
+	unsigned int hash, hashwild;
+	int score, best = -1, hport = ntohs(dport);
+
+ 	hash = hash_port_and_addr(hport, daddr);
+ 	hashwild = hash_port_and_addr(hport, 0);
 
 	read_lock(&udp_hash_lock);
-	sk_for_each(sk, node, &udptable[hnum & (UDP_HTABLE_SIZE - 1)]) {
+
+lookup:
+
+	sk_for_each(sk, node, &udptable[hash & (UDP_HTABLE_SIZE - 1)]) {
 		struct inet_sock *inet = inet_sk(sk);
 
-		if (sk->sk_hash == hnum && !ipv6_only_sock(sk)) {
-			int score = (sk->sk_family == PF_INET ? 1 : 0);
-			if (inet->rcv_saddr) {
-				if (inet->rcv_saddr != daddr)
-					continue;
-				score+=2;
-			}
-			if (inet->daddr) {
-				if (inet->daddr != saddr)
-					continue;
-				score+=2;
-			}
-			if (inet->dport) {
-				if (inet->dport != sport)
-					continue;
-				score+=2;
-			}
-			if (sk->sk_bound_dev_if) {
-				if (sk->sk_bound_dev_if != dif)
-					continue;
-				score+=2;
-			}
-			if(score == 9) {
-				result = sk;
-				break;
-			} else if(score > badness) {
-				result = sk;
-				badness = score;
-			}
+		if (sk->sk_hash != hash || ipv6_only_sock(sk) ||
+			inet->num != hport)
+			continue;
+
+		score = (sk->sk_family == PF_INET ? 1 : 0);
+		if (inet->rcv_saddr) {
+			if (inet->rcv_saddr != daddr)
+				continue;
+			score+=2;
+		}
+		if (inet->daddr) {
+			if (inet->daddr != saddr)
+				continue;
+			score+=2;
+		}
+		if (inet->dport) {
+			if (inet->dport != sport)
+				continue;
+			score+=2;
+		}
+		if (sk->sk_bound_dev_if) {
+			if (sk->sk_bound_dev_if != dif)
+				continue;
+			score+=2;
 		}
+		if(score == 9) {
+			result = sk;
+			goto found;
+		} else if(score > best) {
+			result = sk;
+			best = score;
+		}
+	}
+	if (hash != hashwild) {
+		hash = hashwild;
+		goto lookup;
 	}
+found:
 	if (result)
 		sock_hold(result);
 	read_unlock(&udp_hash_lock);
 	return result;
 }
 
-static inline struct sock *udp_v4_mcast_next(struct sock *sk,
-					     __be16 loc_port, __be32 loc_addr,
+static inline struct sock *udp_v4_mcast_next(struct sock *sk, unsigned int hnum,
+					     int hport, __be32 loc_addr,
 					     __be16 rmt_port, __be32 rmt_addr,
 					     int dif)
 {
 	struct hlist_node *node;
 	struct sock *s = sk;
-	unsigned short hnum = ntohs(loc_port);
 
 	sk_for_each_from(s, node) {
 		struct inet_sock *inet = inet_sk(s);
 
 		if (s->sk_hash != hnum					||
+		    inet->num != hport					||
 		    (inet->daddr && inet->daddr != rmt_addr)		||
 		    (inet->dport != rmt_port && inet->dport)		||
 		    (inet->rcv_saddr && inet->rcv_saddr != loc_addr)	||
@@ -1128,25 +1168,39 @@ static int __udp4_lib_mcast_deliver(stru
 				    __be32 saddr, __be32 daddr,
 				    struct hlist_head udptable[])
 {
-	struct sock *sk;
+	struct sock *sk, *skw, *sknext;
 	int dif;
+	int hport = ntohs(uh->dest);
+	unsigned int hash = hash_port_and_addr(hport, daddr);
+	unsigned int hashwild = hash_port_and_addr(hport, 0);
 
-	read_lock(&udp_hash_lock);
-	sk = sk_head(&udptable[ntohs(uh->dest) & (UDP_HTABLE_SIZE - 1)]);
 	dif = skb->dev->ifindex;
-	sk = udp_v4_mcast_next(sk, uh->dest, daddr, uh->source, saddr, dif);
-	if (sk) {
-		struct sock *sknext = NULL;
 
+	read_lock(&udp_hash_lock);
+
+	sk = sk_head(&udptable[hash & (UDP_HTABLE_SIZE - 1)]);
+	skw = sk_head(&udptable[hashwild & (UDP_HTABLE_SIZE - 1)]);
+
+	sk = udp_v4_mcast_next(sk, hash, hport, daddr, uh->source, saddr, dif);
+	if (!sk) {
+		hash = hashwild;
+		sk = udp_v4_mcast_next(skw, hash, hport, daddr, uh->source,
+			saddr, dif);
+	}
+	if (sk) {
 		do {
 			struct sk_buff *skb1 = skb;
-
-			sknext = udp_v4_mcast_next(sk_next(sk), uh->dest, daddr,
-						   uh->source, saddr, dif);
-			if(sknext)
+			sknext = udp_v4_mcast_next(sk_next(sk), hash, hport,
+						daddr, uh->source, saddr, dif);
+			if (!sknext && hash != hashwild) {
+				hash = hashwild;
+				sknext = udp_v4_mcast_next(skw, hash, hport,
+					daddr, uh->source, saddr, dif);
+			}
+			if (sknext)
 				skb1 = skb_clone(skb, GFP_ATOMIC);
 
-			if(skb1) {
+			if (skb1) {
 				int ret = udp_queue_rcv_skb(sk, skb1);
 				if (ret > 0)
 					/* we should probably re-process instead
@@ -1228,7 +1282,7 @@ int __udp4_lib_rcv(struct sk_buff *skb, 
 		return __udp4_lib_mcast_deliver(skb, uh, saddr, daddr, udptable);
 
 	sk = __udp4_lib_lookup(saddr, uh->source, daddr, uh->dest,
-			       skb->dev->ifindex, udptable        );
+			       skb->dev->ifindex, udptable);
 
 	if (sk != NULL) {
 		int ret = udp_queue_rcv_skb(sk, skb);
-- 
Eric Dumazet <dada1@cosmosbay.com>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: many sockets, slow sendto
  2007-03-30 13:10               ` Eric Dumazet
@ 2007-04-22 12:34                 ` Zacco
  0 siblings, 0 replies; 23+ messages in thread
From: Zacco @ 2007-04-22 12:34 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, baruch, netdev

Hi Eric and all,

Thanks a lot. However, some chunks still resisted to apply, the manual
patching worked. I tested and the results are more than satisfactory.
Not just the reception part became faster, but also the bind calls. The
gain is especially significant when increasing the hash table size. I
applied it in an older kernel with similar success. Basically, my
problem is solved.

I have just one question: Why do you check the address in
__udp_lib_port_inuse? I think it is enough to check the port, as the
name of the function suggests, but I may be wrong. I removed it and the
patch still works fine, but for me this is not a proof, as I may use it
in some special case.

As for your confusion: You're right. My idea to include the both sockets
in the hash function was really unnecessary, even in my example. Thx for
revealing that!

I applied another modification in the patch: I called the hash function
from __udp_lib_port_inuse, instead passing the hash value. This way, I
could reduce the number modifications, so the patch is easier to apply
manually to whatever kernel version.

Thx: Zacco


Eric Dumazet wrote:
> On Thu, 29 Mar 2007 21:24:31 +0200
> Zacco <zacco@fw.hu> wrote:
>
>   
>> Hi,
>>
>> Thanks for the patch. I almost dare not confess that I don't know which 
>> version to apply to. I tried 3 different ones (2.6.19-r5-gentoo, 
>> 2.6.20.1 and 2.6.21-rc4), but in the best case at least two hunks 
>> failed. Nevertheless, I applied the patches manually. In each case, UDP 
>> stopped working. I guess, you checked the patch and worked. I don't 
>> think I made a mistake in the manual copy, and it seems unlikely that 
>> your patch interfered with other parallel changes in the kernel - but, 
>> I'm just guessing ...
>> I think, I'd better send you the spec and code, as you suggested that 
>> first we have a common understanding of the issue. I must have failed in 
>> passing the point. I'm removing irrelevant stuff, and I send it to you 
>> as soon as I can (sorry for my long delays).
>>
>> thx a lot,
>> Zacco
>>     
>
> Hum, please find a (working) patch against linux-2.6.21-rc5
>
> (first patch was against net-2.6.22 git tree and had one bug)
>
> ...

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: many sockets, slow sendto
  2007-03-21 22:12           ` Eric Dumazet
  2007-03-22  1:15             ` David Miller
  2007-03-29 19:24             ` Zacco
@ 2007-04-30  7:26             ` David Miller
  2007-04-30 10:56               ` YOSHIFUJI Hideaki / 吉藤英明
  2 siblings, 1 reply; 23+ messages in thread
From: David Miller @ 2007-04-30  7:26 UTC (permalink / raw)
  To: dada1; +Cc: zacco, baruch, netdev

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Wed, 21 Mar 2007 23:12:40 +0100

> [PATCH] INET : IPV4 UDP lookups converted to a 2 pass algo
> 
> Some people want to have many UDP sockets, binded to a single port but many 
> different addresses. We currently hash all those sockets into a single chain. 
> Processing of incoming packets is very expensive, because the whole chain must 
> be examined to find the best match.
> 
> I chose in this patch to hash UDP sockets with a hash function that take into 
> account both their port number and address : This has a drawback because we 
> need two lookups : one with a given address, one with a wildcard (null) address.
> 
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

Eric, I've applied this, thanks again.

Could I trouble you to cook up an ipv6 version of this patch?

Thanks a lot.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: many sockets, slow sendto
  2007-04-30  7:26             ` David Miller
@ 2007-04-30 10:56               ` YOSHIFUJI Hideaki / 吉藤英明
  2007-04-30 12:47                 ` Eric Dumazet
  0 siblings, 1 reply; 23+ messages in thread
From: YOSHIFUJI Hideaki / 吉藤英明 @ 2007-04-30 10:56 UTC (permalink / raw)
  To: davem, dada1; +Cc: zacco, baruch, netdev, yoshfuji

In article <20070430.002643.68156452.davem@davemloft.net> (at Mon, 30 Apr 2007 00:26:43 -0700 (PDT)), David Miller <davem@davemloft.net> says:

> > Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> 
> Eric, I've applied this, thanks again.
> 
> Could I trouble you to cook up an ipv6 version of this patch?

Here's my tentative version.  Not tested.
Dave, Eric, could you double-check this, please?

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>

---
diff --git a/include/net/udp.h b/include/net/udp.h
index 98755eb..2c06017 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -120,8 +120,12 @@ static inline void udp_lib_close(struct sock *sk, long timeout)
 
 
 /* net/ipv4/udp.c */
+extern unsigned int udp_hash_port_and_rcvaddr(__u16 port,
+					      const struct sock *sk);
 extern int	udp_get_port(struct sock *sk, unsigned short snum,
-			     int (*saddr_cmp)(const struct sock *, const struct sock *));
+			     int (*saddr_cmp)(const struct sock *, const struct sock *),
+			     unsigned int (*hash_port_rcvaddr)(__u16 port,
+				     			       const struct sock *sk));
 extern void	udp_err(struct sk_buff *, u32);
 
 extern int	udp_sendmsg(struct kiocb *iocb, struct sock *sk,
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 1449707..9d4293d 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -125,6 +125,12 @@ static inline unsigned int hash_port_and_addr(__u16 port, __be32 addr)
 	return port ^ addr;
 }
 
+unsigned int udp4_hash_port_and_rcvaddr(__u16 port,
+					const struct sock *sk)
+{
+	return hash_port_and_addr(port, inet_sk(sk)->rcv_saddr);
+}
+
 static inline int __udp_lib_port_inuse(unsigned int hash, int port,
 	__be32 daddr, struct hlist_head udptable[])
 {
@@ -156,7 +162,9 @@ static inline int __udp_lib_port_inuse(unsigned int hash, int port,
 int __udp_lib_get_port(struct sock *sk, unsigned short snum,
 		       struct hlist_head udptable[], int *port_rover,
 		       int (*saddr_comp)(const struct sock *sk1,
-					 const struct sock *sk2 )    )
+					 const struct sock *sk2),
+		       unsigned int (*hash_port_rcvaddr)(__u16 port,
+							 const struct sock *sk))
 {
 	struct hlist_node *node;
 	struct hlist_head *head;
@@ -176,8 +184,7 @@ int __udp_lib_get_port(struct sock *sk, unsigned short snum,
 		for (i = 0; i < UDP_HTABLE_SIZE; i++, result++) {
 			int size;
 
-			hash = hash_port_and_addr(result,
-					inet_sk(sk)->rcv_saddr);
+			hash = hash_port_rcvaddr(result, sk);
 			head = &udptable[hash & (UDP_HTABLE_SIZE - 1)];
 			if (hlist_empty(head)) {
 				if (result > sysctl_local_port_range[1])
@@ -203,8 +210,7 @@ int __udp_lib_get_port(struct sock *sk, unsigned short snum,
 				result = sysctl_local_port_range[0]
 					+ ((result - sysctl_local_port_range[0]) &
 					   (UDP_HTABLE_SIZE - 1));
-			hash = hash_port_and_addr(result,
-					inet_sk(sk)->rcv_saddr);
+			hash = hash_port_rcvaddr(result, sk);
 			if (! __udp_lib_port_inuse(hash, result,
 				inet_sk(sk)->rcv_saddr, udptable))
 				break;
@@ -214,7 +220,7 @@ int __udp_lib_get_port(struct sock *sk, unsigned short snum,
 gotit:
 		*port_rover = snum = result;
 	} else {
-		hash = hash_port_and_addr(snum, inet_sk(sk)->rcv_saddr);
+		hash = hash_port_rcvaddr(snum, sk);
 		head = &udptable[hash & (UDP_HTABLE_SIZE - 1)];
 
 		sk_for_each(sk2, node, head)
@@ -241,9 +247,11 @@ fail:
 }
 
 int udp_get_port(struct sock *sk, unsigned short snum,
-			int (*scmp)(const struct sock *, const struct sock *))
+			int (*scmp)(const struct sock *, const struct sock *),
+			unsigned int (*uhash)(u16 port, const struct sock *))
 {
-	return  __udp_lib_get_port(sk, snum, udp_hash, &udp_port_rover, scmp);
+	return  __udp_lib_get_port(sk, snum, udp_hash, &udp_port_rover,
+				   scmp, uhash);
 }
 
 int ipv4_rcv_saddr_equal(const struct sock *sk1, const struct sock *sk2)
@@ -257,7 +265,8 @@ int ipv4_rcv_saddr_equal(const struct sock *sk1, const struct sock *sk2)
 
 static inline int udp_v4_get_port(struct sock *sk, unsigned short snum)
 {
-	return udp_get_port(sk, snum, ipv4_rcv_saddr_equal);
+	return udp_get_port(sk, snum, ipv4_rcv_saddr_equal,
+			    udp4_hash_port_and_rcvaddr);
 }
 
 /* UDP is nearly always wildcards out the wazoo, it makes no sense to try
@@ -328,8 +337,8 @@ found:
 }
 
 static inline struct sock *udp_v4_mcast_next(
-			struct sock *sk,
-			unsigned int hnum, __be16 loc_port, __be32 loc_addr,
+			struct sock *sk, unsigned int hnum,
+			__be16 loc_port, __be32 loc_addr,
 			__be16 rmt_port, __be32 rmt_addr,
 			int dif)
 {
diff --git a/net/ipv4/udp_impl.h b/net/ipv4/udp_impl.h
index 820a477..d7216c8 100644
--- a/net/ipv4/udp_impl.h
+++ b/net/ipv4/udp_impl.h
@@ -10,7 +10,8 @@ extern void 	__udp4_lib_err(struct sk_buff *, u32, struct hlist_head []);
 
 extern int	__udp_lib_get_port(struct sock *sk, unsigned short snum,
 				   struct hlist_head udptable[], int *port_rover,
-				   int (*)(const struct sock*,const struct sock*));
+				   int (*)(const struct sock*,const struct sock*),
+				   unsigned int (*)(__u16, const struct sock*));
 extern int	ipv4_rcv_saddr_equal(const struct sock *, const struct sock *);
 
 
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index b083c09..fa566c5 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -52,56 +52,92 @@
 
 DEFINE_SNMP_STAT(struct udp_mib, udp_stats_in6) __read_mostly;
 
+static inline unsigned int udp6_hash_port_and_addr(__u16 port,
+						  const struct in6_addr *addr)
+{
+	u32 hash = 0;
+	if (addr) {
+		hash = (__force u32) addr->s6_addr32[0] ^
+		       (__force u32) addr->s6_addr32[1] ^
+		       (__force u32) addr->s6_addr32[2] ^
+		       (__force u32) addr->s6_addr32[3];
+		hash ^= hash >> 16;
+		hash ^= hash >> 8;
+	}
+	return port ^ hash;
+}
+
+unsigned int udp6_hash_port_and_rcvaddr(__u16 port,
+					const struct sock *sk)
+{
+	return udp6_hash_port_and_addr(port, &inet6_sk(sk)->rcv_saddr);
+}
+
 static inline int udp_v6_get_port(struct sock *sk, unsigned short snum)
 {
-	return udp_get_port(sk, snum, ipv6_rcv_saddr_equal);
+	return udp_get_port(sk, snum,
+			    ipv6_rcv_saddr_equal,
+			    udp6_hash_port_and_rcvaddr);
 }
 
 static struct sock *__udp6_lib_lookup(struct in6_addr *saddr, __be16 sport,
 				      struct in6_addr *daddr, __be16 dport,
 				      int dif, struct hlist_head udptable[])
 {
-	struct sock *sk, *result = NULL;
+	struct sock *sk = NULL, *result = NULL;
 	struct hlist_node *node;
-	unsigned short hnum = ntohs(dport);
-	int badness = -1;
+	unsigned hash, hashwild;
+	int score, best = -1;
+
+	hash = udp6_hash_port_and_addr(ntohs(dport), saddr);
+	hashwild = udp6_hash_port_and_addr(ntohs(dport), NULL);
 
 	read_lock(&udp_hash_lock);
-	sk_for_each(sk, node, &udptable[hnum & (UDP_HTABLE_SIZE - 1)]) {
+lookup:
+	sk_for_each(sk, node, &udptable[hash & (UDP_HTABLE_SIZE - 1)]) {
 		struct inet_sock *inet = inet_sk(sk);
+		struct ipv6_pinfo *np = inet6_sk(sk);
 
-		if (sk->sk_hash == hnum && sk->sk_family == PF_INET6) {
-			struct ipv6_pinfo *np = inet6_sk(sk);
-			int score = 0;
-			if (inet->dport) {
-				if (inet->dport != sport)
-					continue;
-				score++;
-			}
-			if (!ipv6_addr_any(&np->rcv_saddr)) {
-				if (!ipv6_addr_equal(&np->rcv_saddr, daddr))
-					continue;
-				score++;
-			}
-			if (!ipv6_addr_any(&np->daddr)) {
-				if (!ipv6_addr_equal(&np->daddr, saddr))
-					continue;
-				score++;
-			}
-			if (sk->sk_bound_dev_if) {
-				if (sk->sk_bound_dev_if != dif)
-					continue;
-				score++;
-			}
-			if (score == 4) {
-				result = sk;
-				break;
-			} else if (score > badness) {
-				result = sk;
-				badness = score;
-			}
+		if (sk->sk_hash != hash || sk->sk_family != PF_INET6 ||
+		    inet->num != dport)
+			continue;
+
+		score = 0;
+
+		if (inet->dport) {
+			if (inet->dport != sport)
+				continue;
+			score++;
+		}
+		if (!ipv6_addr_any(&np->rcv_saddr)) {
+			if (!ipv6_addr_equal(&np->rcv_saddr, daddr))
+				continue;
+			score++;
+		}
+		if (!ipv6_addr_any(&np->daddr)) {
+			if (!ipv6_addr_equal(&np->daddr, saddr))
+				continue;
+			score++;
 		}
+		if (sk->sk_bound_dev_if) {
+			if (sk->sk_bound_dev_if != dif)
+				continue;
+			score++;
+		}
+		if (score == 4) {
+			result = sk;
+			goto found;
+		} else if (score > best) {
+			result = sk;
+			best = score;
+		}
+	}
+
+	if (hash != hashwild) {
+		hash = hashwild;
+		goto lookup;
 	}
+found:
 	if (result)
 		sock_hold(result);
 	read_unlock(&udp_hash_lock);
@@ -302,38 +338,41 @@ drop:
 }
 
 static struct sock *udp_v6_mcast_next(struct sock *sk,
+				      unsigned int hnum,
 				      __be16 loc_port, struct in6_addr *loc_addr,
 				      __be16 rmt_port, struct in6_addr *rmt_addr,
 				      int dif)
 {
 	struct hlist_node *node;
 	struct sock *s = sk;
-	unsigned short num = ntohs(loc_port);
 
 	sk_for_each_from(s, node) {
 		struct inet_sock *inet = inet_sk(s);
+		struct ipv6_pinfo *np = inet6_sk(s);
 
-		if (s->sk_hash == num && s->sk_family == PF_INET6) {
-			struct ipv6_pinfo *np = inet6_sk(s);
-			if (inet->dport) {
-				if (inet->dport != rmt_port)
-					continue;
-			}
-			if (!ipv6_addr_any(&np->daddr) &&
-			    !ipv6_addr_equal(&np->daddr, rmt_addr))
-				continue;
+		if (s->sk_hash != hnum || s->sk_family != PF_INET6 ||
+		    inet->num != loc_port)
+			continue;
 
-			if (s->sk_bound_dev_if && s->sk_bound_dev_if != dif)
+		if (inet->dport) {
+			if (inet->dport != rmt_port)
 				continue;
+		}
+		if (!ipv6_addr_any(&np->daddr) &&
+		    !ipv6_addr_equal(&np->daddr, rmt_addr))
+			continue;
 
-			if (!ipv6_addr_any(&np->rcv_saddr)) {
-				if (!ipv6_addr_equal(&np->rcv_saddr, loc_addr))
-					continue;
-			}
-			if (!inet6_mc_check(s, loc_addr, rmt_addr))
+		if (s->sk_bound_dev_if && s->sk_bound_dev_if != dif)
+			continue;
+
+		if (!ipv6_addr_any(&np->rcv_saddr)) {
+			if (!ipv6_addr_equal(&np->rcv_saddr, loc_addr))
 				continue;
-			return s;
 		}
+
+		if (!inet6_mc_check(s, loc_addr, rmt_addr))
+			continue;
+		return s;
 	}
 	return NULL;
 }
@@ -348,20 +387,42 @@ static int __udp6_lib_mcast_deliver(struct sk_buff *skb, struct in6_addr *saddr,
 	struct sock *sk, *sk2;
 	const struct udphdr *uh = udp_hdr(skb);
 	int dif;
+	int hport = ntohs(uh->dest);
+	unsigned int hash = udp6_hash_port_and_addr(ntohs(uh->dest), daddr);
+	unsigned int hashwild = udp6_hash_port_and_addr(ntohs(uh->dest), NULL);
 
-	read_lock(&udp_hash_lock);
-	sk = sk_head(&udptable[ntohs(uh->dest) & (UDP_HTABLE_SIZE - 1)]);
 	dif = inet6_iif(skb);
-	sk = udp_v6_mcast_next(sk, uh->dest, daddr, uh->source, saddr, dif);
+
+	read_lock(&udp_hash_lock);
+redo:
+	sk = sk_head(&udptable[hash & (UDP_HTABLE_SIZE - 1)]);
+	sk = udp_v6_mcast_next(sk, hash, uh->dest, daddr, uh->source, saddr, dif);
 	if (!sk) {
+		if (hash != hashwild) {
+			hash = hashwild;
+			goto redo;
+		}
 		kfree_skb(skb);
 		goto out;
 	}
 
 	sk2 = sk;
-	while ((sk2 = udp_v6_mcast_next(sk_next(sk2), uh->dest, daddr,
-					uh->source, saddr, dif))) {
-		struct sk_buff *buff = skb_clone(skb, GFP_ATOMIC);
+	while(1) {
+		struct sk_buff *buff;
+
+		sk2 = udp_v6_mcast_next(sk_next(sk2), hash, hport, daddr,
+					uh->source, saddr, dif);
+		if (!sk2) {
+			if (hash == hashwild)
+				break;
+			hash = hashwild;
+			sk2 = sk_head(&udptable[hash & (UDP_HTABLE_SIZE - 1)]);
+			sk2 = udp_v6_mcast_next(sk2, hash, uh->dest, daddr, uh->source, saddr, dif);
+			if (!sk2)
+				break;
+		}
+
+		buff = skb_clone(skb, GFP_ATOMIC);
 		if (buff)
 			udpv6_queue_rcv_skb(sk2, buff);
 	}

-- 
YOSHIFUJI Hideaki @ USAGI Project  <yoshfuji@linux-ipv6.org>
GPG-FP  : 9022 65EB 1ECF 3AD1 0BDF  80D8 4807 F894 E062 0EEA

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: many sockets, slow sendto
  2007-04-30 10:56               ` YOSHIFUJI Hideaki / 吉藤英明
@ 2007-04-30 12:47                 ` Eric Dumazet
  2007-04-30 19:43                   ` YOSHIFUJI Hideaki / 吉藤英明
  0 siblings, 1 reply; 23+ messages in thread
From: Eric Dumazet @ 2007-04-30 12:47 UTC (permalink / raw)
  To: YOSHIFUJI Hideaki / 吉藤英明
  Cc: davem, zacco, baruch, netdev

On Mon, 30 Apr 2007 19:56:04 +0900 (JST)
YOSHIFUJI Hideaki / 吉藤英明 <yoshfuji@linux-ipv6.org> wrote:

> In article <20070430.002643.68156452.davem@davemloft.net> (at Mon, 30 Apr 2007 00:26:43 -0700 (PDT)), David Miller <davem@davemloft.net> says:
> 
> > > Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> > 
> > Eric, I've applied this, thanks again.
> > 
> > Could I trouble you to cook up an ipv6 version of this patch?
> 
> Here's my tentative version.  Not tested.
> Dave, Eric, could you double-check this, please?

Hi Yoshifuji

Your patch seems good, thank you.

I have two minor comments about udp6_hash_port_and_addr()

Compiler knows that const 'struct in6_addr *addr' is NULL when we call udp6_hash_port_and_addr(..., NULL)

But not when we pass saddr :  udp6_hash_port_and_addr(port, saddr)

So to avoid one conditional branch, you could use udp6_hash_port_and_addr() only once, with addr not NULL :

Also, I am not sure we need to use all 128 bits of IPV6 address, maybe the 64 low order bits are enough ?

static inline unsigned int udp6_hash_port_and_addr(__u16 port,
						  const struct in6_addr *addr)
{
	u32 hash = (__force u32) addr->s6_addr32[2] ^
		   (__force u32) addr->s6_addr32[3];
	hash ^= hash >> 16;
	hash ^= hash >> 8;

	return port ^ hash;
}

static inline unsigned int udp6_hash_wildcard(__u16 port,
					      const struct in6_addr *addr)
{
	return port;
}

...


	hash = udp6_hash_port_and_addr(ntohs(dport), saddr);
	hashwild = udp6_hash_wildcard(ntohs(dport), saddr);

Eric

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: many sockets, slow sendto
  2007-04-30 12:47                 ` Eric Dumazet
@ 2007-04-30 19:43                   ` YOSHIFUJI Hideaki / 吉藤英明
  2007-04-30 19:59                     ` Eric Dumazet
  0 siblings, 1 reply; 23+ messages in thread
From: YOSHIFUJI Hideaki / 吉藤英明 @ 2007-04-30 19:43 UTC (permalink / raw)
  To: dada1, davem; +Cc: zacco, baruch, netdev, yoshfuji

In article <20070430144715.b0c03c83.dada1@cosmosbay.com> (at Mon, 30 Apr 2007 14:47:15 +0200), Eric Dumazet <dada1@cosmosbay.com> says:

> Also, I am not sure we need to use all 128 bits of IPV6 address, maybe the 64 low order bits are enough ?

Well, maybe, but in IPv6, auto-configured addresses on an interface have
the same 64-bit LSBs.  So, I'd keep as-is so far.

Here's the take 2, mainly for fixing UDP-Lite side.

Regards,

----
[IPV6]: Convert UDP(-Lite} to new 2-pass algos.

Some inputs from Eric Dumazet <dada1@cosmosbay.com>.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>

--- 
diff --git a/include/net/udp.h b/include/net/udp.h
index 98755eb..2c06017 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -120,8 +120,12 @@ static inline void udp_lib_close(struct sock *sk, long timeout)
 
 
 /* net/ipv4/udp.c */
+extern unsigned int udp_hash_port_and_rcvaddr(__u16 port,
+					      const struct sock *sk);
 extern int	udp_get_port(struct sock *sk, unsigned short snum,
-			     int (*saddr_cmp)(const struct sock *, const struct sock *));
+			     int (*saddr_cmp)(const struct sock *, const struct sock *),
+			     unsigned int (*hash_port_rcvaddr)(__u16 port,
+				     			       const struct sock *sk));
 extern void	udp_err(struct sk_buff *, u32);
 
 extern int	udp_sendmsg(struct kiocb *iocb, struct sock *sk,
diff --git a/include/net/udplite.h b/include/net/udplite.h
index 635b0ea..6da0d41 100644
--- a/include/net/udplite.h
+++ b/include/net/udplite.h
@@ -120,5 +120,6 @@ static inline __wsum udplite_csum_outgoing(struct sock *sk, struct sk_buff *skb)
 
 extern void	udplite4_register(void);
 extern int 	udplite_get_port(struct sock *sk, unsigned short snum,
-			int (*scmp)(const struct sock *, const struct sock *));
+			int (*scmp)(const struct sock *, const struct sock *),
+			unsigned int (*uhash)(__u16, const struct sock *));
 #endif	/* _UDPLITE_H */
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 1449707..9d4293d 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -125,6 +125,12 @@ static inline unsigned int hash_port_and_addr(__u16 port, __be32 addr)
 	return port ^ addr;
 }
 
+unsigned int udp4_hash_port_and_rcvaddr(__u16 port,
+					const struct sock *sk)
+{
+	return hash_port_and_addr(port, inet_sk(sk)->rcv_saddr);
+}
+
 static inline int __udp_lib_port_inuse(unsigned int hash, int port,
 	__be32 daddr, struct hlist_head udptable[])
 {
@@ -156,7 +162,9 @@ static inline int __udp_lib_port_inuse(unsigned int hash, int port,
 int __udp_lib_get_port(struct sock *sk, unsigned short snum,
 		       struct hlist_head udptable[], int *port_rover,
 		       int (*saddr_comp)(const struct sock *sk1,
-					 const struct sock *sk2 )    )
+					 const struct sock *sk2),
+		       unsigned int (*hash_port_rcvaddr)(__u16 port,
+							 const struct sock *sk))
 {
 	struct hlist_node *node;
 	struct hlist_head *head;
@@ -176,8 +184,7 @@ int __udp_lib_get_port(struct sock *sk, unsigned short snum,
 		for (i = 0; i < UDP_HTABLE_SIZE; i++, result++) {
 			int size;
 
-			hash = hash_port_and_addr(result,
-					inet_sk(sk)->rcv_saddr);
+			hash = hash_port_rcvaddr(result, sk);
 			head = &udptable[hash & (UDP_HTABLE_SIZE - 1)];
 			if (hlist_empty(head)) {
 				if (result > sysctl_local_port_range[1])
@@ -203,8 +210,7 @@ int __udp_lib_get_port(struct sock *sk, unsigned short snum,
 				result = sysctl_local_port_range[0]
 					+ ((result - sysctl_local_port_range[0]) &
 					   (UDP_HTABLE_SIZE - 1));
-			hash = hash_port_and_addr(result,
-					inet_sk(sk)->rcv_saddr);
+			hash = hash_port_rcvaddr(result, sk);
 			if (! __udp_lib_port_inuse(hash, result,
 				inet_sk(sk)->rcv_saddr, udptable))
 				break;
@@ -214,7 +220,7 @@ int __udp_lib_get_port(struct sock *sk, unsigned short snum,
 gotit:
 		*port_rover = snum = result;
 	} else {
-		hash = hash_port_and_addr(snum, inet_sk(sk)->rcv_saddr);
+		hash = hash_port_rcvaddr(snum, sk);
 		head = &udptable[hash & (UDP_HTABLE_SIZE - 1)];
 
 		sk_for_each(sk2, node, head)
@@ -241,9 +247,11 @@ fail:
 }
 
 int udp_get_port(struct sock *sk, unsigned short snum,
-			int (*scmp)(const struct sock *, const struct sock *))
+			int (*scmp)(const struct sock *, const struct sock *),
+			unsigned int (*uhash)(u16 port, const struct sock *))
 {
-	return  __udp_lib_get_port(sk, snum, udp_hash, &udp_port_rover, scmp);
+	return  __udp_lib_get_port(sk, snum, udp_hash, &udp_port_rover,
+				   scmp, uhash);
 }
 
 int ipv4_rcv_saddr_equal(const struct sock *sk1, const struct sock *sk2)
@@ -257,7 +265,8 @@ int ipv4_rcv_saddr_equal(const struct sock *sk1, const struct sock *sk2)
 
 static inline int udp_v4_get_port(struct sock *sk, unsigned short snum)
 {
-	return udp_get_port(sk, snum, ipv4_rcv_saddr_equal);
+	return udp_get_port(sk, snum, ipv4_rcv_saddr_equal,
+			    udp4_hash_port_and_rcvaddr);
 }
 
 /* UDP is nearly always wildcards out the wazoo, it makes no sense to try
@@ -328,8 +337,8 @@ found:
 }
 
 static inline struct sock *udp_v4_mcast_next(
-			struct sock *sk,
-			unsigned int hnum, __be16 loc_port, __be32 loc_addr,
+			struct sock *sk, unsigned int hnum,
+			__be16 loc_port, __be32 loc_addr,
 			__be16 rmt_port, __be32 rmt_addr,
 			int dif)
 {
diff --git a/net/ipv4/udp_impl.h b/net/ipv4/udp_impl.h
index 820a477..d7216c8 100644
--- a/net/ipv4/udp_impl.h
+++ b/net/ipv4/udp_impl.h
@@ -10,7 +10,8 @@ extern void 	__udp4_lib_err(struct sk_buff *, u32, struct hlist_head []);
 
 extern int	__udp_lib_get_port(struct sock *sk, unsigned short snum,
 				   struct hlist_head udptable[], int *port_rover,
-				   int (*)(const struct sock*,const struct sock*));
+				   int (*)(const struct sock*,const struct sock*),
+				   unsigned int (*)(__u16, const struct sock*));
 extern int	ipv4_rcv_saddr_equal(const struct sock *, const struct sock *);
 
 
diff --git a/net/ipv4/udplite.c b/net/ipv4/udplite.c
index f34fd68..4c4e0fd 100644
--- a/net/ipv4/udplite.c
+++ b/net/ipv4/udplite.c
@@ -18,15 +18,22 @@ DEFINE_SNMP_STAT(struct udp_mib, udplite_statistics)	__read_mostly;
 struct hlist_head 	udplite_hash[UDP_HTABLE_SIZE];
 static int		udplite_port_rover;
 
+extern unsigned int udp4_hash_port_and_rcvaddr(__u16 port,
+					       const struct sock *sk);
+
 int udplite_get_port(struct sock *sk, unsigned short p,
-		     int (*c)(const struct sock *, const struct sock *))
+		     int (*c)(const struct sock *, const struct sock *),
+		     unsigned int (*h)(__u16, const struct sock *))
 {
-	return  __udp_lib_get_port(sk, p, udplite_hash, &udplite_port_rover, c);
+	return  __udp_lib_get_port(sk, p, udplite_hash, &udplite_port_rover,
+				   c, h);
 }
 
 static int udplite_v4_get_port(struct sock *sk, unsigned short snum)
 {
-	return udplite_get_port(sk, snum, ipv4_rcv_saddr_equal);
+	return udplite_get_port(sk, snum,
+				ipv4_rcv_saddr_equal,
+				udp4_hash_port_and_rcvaddr);
 }
 
 static int udplite_rcv(struct sk_buff *skb)
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index b083c09..1d05a69 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -52,56 +52,95 @@
 
 DEFINE_SNMP_STAT(struct udp_mib, udp_stats_in6) __read_mostly;
 
+static inline unsigned int udp6_hash_port(__u16 port)
+{
+	return port;
+}
+
+static inline unsigned int udp6_hash_port_and_addr(__u16 port,
+						   const struct in6_addr *addr)
+{
+	u32 hash = 0;
+	hash = ((__force u32) addr->s6_addr32[0]) ^
+	       ((__force u32) addr->s6_addr32[1]) ^
+	       ((__force u32) addr->s6_addr32[2]) ^
+	       ((__force u32) addr->s6_addr32[3]);
+	hash ^= hash >> 16;
+	hash ^= hash >> 8;
+	return udp6_hash_port(port) ^ hash;
+}
+
+unsigned int udp6_hash_port_and_rcvaddr(__u16 port,
+					const struct sock *sk)
+{
+	return udp6_hash_port_and_addr(port, &inet6_sk(sk)->rcv_saddr);
+}
+
 static inline int udp_v6_get_port(struct sock *sk, unsigned short snum)
 {
-	return udp_get_port(sk, snum, ipv6_rcv_saddr_equal);
+	return udp_get_port(sk, snum,
+			    ipv6_rcv_saddr_equal,
+			    udp6_hash_port_and_rcvaddr);
 }
 
 static struct sock *__udp6_lib_lookup(struct in6_addr *saddr, __be16 sport,
 				      struct in6_addr *daddr, __be16 dport,
 				      int dif, struct hlist_head udptable[])
 {
-	struct sock *sk, *result = NULL;
+	struct sock *sk = NULL, *result = NULL;
 	struct hlist_node *node;
-	unsigned short hnum = ntohs(dport);
-	int badness = -1;
+	unsigned hash, hashwild;
+	int score, best = -1;
+
+	hash = udp6_hash_port_and_addr(ntohs(dport), saddr);
+	hashwild = udp6_hash_port(ntohs(dport));
 
 	read_lock(&udp_hash_lock);
-	sk_for_each(sk, node, &udptable[hnum & (UDP_HTABLE_SIZE - 1)]) {
+lookup:
+	sk_for_each(sk, node, &udptable[hash & (UDP_HTABLE_SIZE - 1)]) {
 		struct inet_sock *inet = inet_sk(sk);
+		struct ipv6_pinfo *np = inet6_sk(sk);
 
-		if (sk->sk_hash == hnum && sk->sk_family == PF_INET6) {
-			struct ipv6_pinfo *np = inet6_sk(sk);
-			int score = 0;
-			if (inet->dport) {
-				if (inet->dport != sport)
-					continue;
-				score++;
-			}
-			if (!ipv6_addr_any(&np->rcv_saddr)) {
-				if (!ipv6_addr_equal(&np->rcv_saddr, daddr))
-					continue;
-				score++;
-			}
-			if (!ipv6_addr_any(&np->daddr)) {
-				if (!ipv6_addr_equal(&np->daddr, saddr))
-					continue;
-				score++;
-			}
-			if (sk->sk_bound_dev_if) {
-				if (sk->sk_bound_dev_if != dif)
-					continue;
-				score++;
-			}
-			if (score == 4) {
-				result = sk;
-				break;
-			} else if (score > badness) {
-				result = sk;
-				badness = score;
-			}
+		if (sk->sk_hash != hash || sk->sk_family != PF_INET6 ||
+		    inet->num != dport)
+			continue;
+
+		score = 0;
+
+		if (inet->dport) {
+			if (inet->dport != sport)
+				continue;
+			score++;
 		}
+		if (!ipv6_addr_any(&np->rcv_saddr)) {
+			if (!ipv6_addr_equal(&np->rcv_saddr, daddr))
+				continue;
+			score++;
+		}
+		if (!ipv6_addr_any(&np->daddr)) {
+			if (!ipv6_addr_equal(&np->daddr, saddr))
+				continue;
+			score++;
+		}
+		if (sk->sk_bound_dev_if) {
+			if (sk->sk_bound_dev_if != dif)
+				continue;
+			score++;
+		}
+		if (score == 4) {
+			result = sk;
+			goto found;
+		} else if (score > best) {
+			result = sk;
+			best = score;
+		}
+	}
+
+	if (hash != hashwild) {
+		hash = hashwild;
+		goto lookup;
 	}
+found:
 	if (result)
 		sock_hold(result);
 	read_unlock(&udp_hash_lock);
@@ -302,38 +341,41 @@ drop:
 }
 
 static struct sock *udp_v6_mcast_next(struct sock *sk,
+				      unsigned int hnum,
 				      __be16 loc_port, struct in6_addr *loc_addr,
 				      __be16 rmt_port, struct in6_addr *rmt_addr,
 				      int dif)
 {
 	struct hlist_node *node;
 	struct sock *s = sk;
-	unsigned short num = ntohs(loc_port);
 
 	sk_for_each_from(s, node) {
 		struct inet_sock *inet = inet_sk(s);
+		struct ipv6_pinfo *np = inet6_sk(s);
 
-		if (s->sk_hash == num && s->sk_family == PF_INET6) {
-			struct ipv6_pinfo *np = inet6_sk(s);
-			if (inet->dport) {
-				if (inet->dport != rmt_port)
-					continue;
-			}
-			if (!ipv6_addr_any(&np->daddr) &&
-			    !ipv6_addr_equal(&np->daddr, rmt_addr))
-				continue;
+		if (s->sk_hash != hnum || s->sk_family != PF_INET6 ||
+		    inet->num != loc_port)
+			continue;
 
-			if (s->sk_bound_dev_if && s->sk_bound_dev_if != dif)
+		if (inet->dport) {
+			if (inet->dport != rmt_port)
 				continue;
+		}
+		if (!ipv6_addr_any(&np->daddr) &&
+		    !ipv6_addr_equal(&np->daddr, rmt_addr))
+			continue;
 
-			if (!ipv6_addr_any(&np->rcv_saddr)) {
-				if (!ipv6_addr_equal(&np->rcv_saddr, loc_addr))
-					continue;
-			}
-			if (!inet6_mc_check(s, loc_addr, rmt_addr))
+		if (s->sk_bound_dev_if && s->sk_bound_dev_if != dif)
+			continue;
+
+		if (!ipv6_addr_any(&np->rcv_saddr)) {
+			if (!ipv6_addr_equal(&np->rcv_saddr, loc_addr))
 				continue;
-			return s;
 		}
+
+		if (!inet6_mc_check(s, loc_addr, rmt_addr))
+			continue;
+		return s;
 	}
 	return NULL;
 }
@@ -348,20 +390,42 @@ static int __udp6_lib_mcast_deliver(struct sk_buff *skb, struct in6_addr *saddr,
 	struct sock *sk, *sk2;
 	const struct udphdr *uh = udp_hdr(skb);
 	int dif;
+	int hport = ntohs(uh->dest);
+	unsigned int hash = udp6_hash_port_and_addr(ntohs(uh->dest), daddr);
+	unsigned int hashwild = udp6_hash_port(ntohs(uh->dest));
 
-	read_lock(&udp_hash_lock);
-	sk = sk_head(&udptable[ntohs(uh->dest) & (UDP_HTABLE_SIZE - 1)]);
 	dif = inet6_iif(skb);
-	sk = udp_v6_mcast_next(sk, uh->dest, daddr, uh->source, saddr, dif);
+
+	read_lock(&udp_hash_lock);
+redo:
+	sk = sk_head(&udptable[hash & (UDP_HTABLE_SIZE - 1)]);
+	sk = udp_v6_mcast_next(sk, hash, uh->dest, daddr, uh->source, saddr, dif);
 	if (!sk) {
+		if (hash != hashwild) {
+			hash = hashwild;
+			goto redo;
+		}
 		kfree_skb(skb);
 		goto out;
 	}
 
 	sk2 = sk;
-	while ((sk2 = udp_v6_mcast_next(sk_next(sk2), uh->dest, daddr,
-					uh->source, saddr, dif))) {
-		struct sk_buff *buff = skb_clone(skb, GFP_ATOMIC);
+	while(1) {
+		struct sk_buff *buff;
+
+		sk2 = udp_v6_mcast_next(sk_next(sk2), hash, hport, daddr,
+					uh->source, saddr, dif);
+		if (!sk2) {
+			if (hash == hashwild)
+				break;
+			hash = hashwild;
+			sk2 = sk_head(&udptable[hash & (UDP_HTABLE_SIZE - 1)]);
+			sk2 = udp_v6_mcast_next(sk2, hash, uh->dest, daddr, uh->source, saddr, dif);
+			if (!sk2)
+				break;
+		}
+
+		buff = skb_clone(skb, GFP_ATOMIC);
 		if (buff)
 			udpv6_queue_rcv_skb(sk2, buff);
 	}
diff --git a/net/ipv6/udplite.c b/net/ipv6/udplite.c
index f54016a..797d76d 100644
--- a/net/ipv6/udplite.c
+++ b/net/ipv6/udplite.c
@@ -17,6 +17,9 @@
 
 DEFINE_SNMP_STAT(struct udp_mib, udplite_stats_in6) __read_mostly;
 
+extern unsigned int udp6_hash_port_and_rcvaddr(__u16 port,
+					       const struct sock *sk);
+
 static int udplitev6_rcv(struct sk_buff **pskb)
 {
 	return __udp6_lib_rcv(pskb, udplite_hash, IPPROTO_UDPLITE);
@@ -37,7 +40,9 @@ static struct inet6_protocol udplitev6_protocol = {
 
 static int udplite_v6_get_port(struct sock *sk, unsigned short snum)
 {
-	return udplite_get_port(sk, snum, ipv6_rcv_saddr_equal);
+	return udplite_get_port(sk, snum,
+				ipv6_rcv_saddr_equal,
+				udp6_hash_port_and_rcvaddr);
 }
 
 struct proto udplitev6_prot = {


--yoshfuji

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: many sockets, slow sendto
  2007-04-30 19:43                   ` YOSHIFUJI Hideaki / 吉藤英明
@ 2007-04-30 19:59                     ` Eric Dumazet
  0 siblings, 0 replies; 23+ messages in thread
From: Eric Dumazet @ 2007-04-30 19:59 UTC (permalink / raw)
  To: YOSHIFUJI Hideaki / 吉藤英明
  Cc: davem, zacco, baruch, netdev

YOSHIFUJI Hideaki / 吉藤英明 a écrit :
> In article <20070430144715.b0c03c83.dada1@cosmosbay.com> (at Mon, 30 Apr 2007 14:47:15 +0200), Eric Dumazet <dada1@cosmosbay.com> says:
> 
>> Also, I am not sure we need to use all 128 bits of IPV6 address, maybe the 64 low order bits are enough ?
> 
> Well, maybe, but in IPv6, auto-configured addresses on an interface have
> the same 64-bit LSBs.  So, I'd keep as-is so far.
> 

Hum... then maybe ipv6_addr_hash() in include/net/addrconf.h should be changed 
as well

static __inline__ u8 ipv6_addr_hash(const struct in6_addr *addr)
{
         __u32 word;

         /*
          * We perform the hash function over the last 64 bits of the address
          * This will include the IEEE address token on links that support it.
          */

         word = (__force u32)(addr->s6_addr32[2] ^ addr->s6_addr32[3]);
         word ^= (word >> 16);
         word ^= (word >> 8);

         return ((word ^ (word >> 4)) & 0x0f);
}



^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2007-04-30 19:59 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-03-06 15:08 many sockets, slow sendto Zaccomer Lajos
2007-03-06 18:20 ` Baruch Even
2007-03-19 23:10   ` Zacco
2007-03-19 23:16     ` David Miller
2007-03-20 21:59       ` Zacco
2007-03-20 22:48         ` Eric Dumazet
2007-03-21 21:53           ` Zacco
2007-03-21 22:24             ` Eric Dumazet
2007-03-21 23:26             ` Eric Dumazet
2007-03-22  1:14             ` David Miller
2007-03-21 22:12           ` Eric Dumazet
2007-03-22  1:15             ` David Miller
2007-03-22  6:43               ` Eric Dumazet
2007-03-29 19:24             ` Zacco
2007-03-30 13:10               ` Eric Dumazet
2007-04-22 12:34                 ` Zacco
2007-04-30  7:26             ` David Miller
2007-04-30 10:56               ` YOSHIFUJI Hideaki / 吉藤英明
2007-04-30 12:47                 ` Eric Dumazet
2007-04-30 19:43                   ` YOSHIFUJI Hideaki / 吉藤英明
2007-04-30 19:59                     ` Eric Dumazet
2007-03-06 19:23 ` Andi Kleen
2007-03-06 21:28   ` Zacco

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.