All of lore.kernel.org
 help / color / mirror / Atom feed
* 2.6.34: Problem with UDP traffic on lo + poll(?)
@ 2010-09-06 17:11 Krzysztof Oledzki
  2010-09-06 19:42 ` Eric Dumazet
  0 siblings, 1 reply; 34+ messages in thread
From: Krzysztof Oledzki @ 2010-09-06 17:11 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: TEXT/PLAIN, Size: 3359 bytes --]

Hello,

For the last two days I have been trying to track a starange problem I 
bumped into after upgrading my kernel from 2.6.31.12 to 2.6.34.6.

The problem is that several times a day, nagios logs that plugins are not 
able to resolve DNS hostnames of monitored hosts. The DNS service is 
provided locally by the host itself so all traffic is handled over a 
loopback interface. The host handles rather moderate traffic - ~1000pps 
and ~30 DNS requests per second. This DNS service is also provided to 
other hosts that are also running 2.6.34.6 and are connected over a 
Ethernet network, but the problem exists only locally.

After a long investigation I found that I'm able to reproduce this problem 
by adding: "*.t IN A 127.0.0.1" to the "lan" zone and using the following 
script:

--- cut here ---
a=0
while strace -o /tmp/s.log.1 -s 1024  /usr/lib64/nagios/plugins/check_icmp -H $a.t.lan ; do
  date
  sleep 0.1
  a=$((a+1))
done
-- cut here ---

Strace shows that the problem is in receiving responses from the 
nameserver:

socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 4
connect(4, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("192.168.130.53")}, 28) = 0
poll([{fd=4, events=POLLOUT}], 1, 0)    = 1 ([{fd=4, revents=POLLOUT}])
sendto(4, "\333b\1\0\0\1\0\0\0\0\0\0\0041817\1t\3lan\0\0\1\0\1", 28, MSG_NOSIGNAL, NULL, 0) = 28
poll([{fd=4, events=POLLIN}], 1, 5000)  = 0 (Timeout)
poll([{fd=4, events=POLLOUT}], 1, 0)    = 1 ([{fd=4, revents=POLLOUT}])
sendto(4, "\333b\1\0\0\1\0\0\0\0\0\0\0041817\1t\3lan\0\0\1\0\1", 28, MSG_NOSIGNAL, NULL, 0) = 28
poll([{fd=4, events=POLLIN}], 1, 5000)  = 0 (Timeout)
close(4)                                = 0

However, tcpdump attached to lo shows that both the request and 
the response are properly delivered:

03:00:47.181529 IP (tos 0x0, ttl 64, id 47869, offset 0, flags [DF], proto UDP (17), length 56)
     192.168.130.53.41083 > 192.168.130.53.53: 56162+ A? 1817.t.lan. (28)
03:00:47.181585 IP (tos 0x0, ttl 64, id 29563, offset 0, flags [none], proto UDP (17), length 112)
     192.168.130.53.53 > 192.168.130.53.41083: 56162* 1/1/1 1817.t.lan. A 127.0.0.1 (84)
--
03:00:52.186465 IP (tos 0x0, ttl 64, id 47870, offset 0, flags [DF], proto UDP (17), length 56)
     192.168.130.53.41083 > 192.168.130.53.53: 56162+ A? 1817.t.lan. (28)
03:00:52.186580 IP (tos 0x0, ttl 64, id 29576, offset 0, flags [none], proto UDP (17), length 112)
     192.168.130.53.53 > 192.168.130.53.41083: 56162* 1/1/1 1817.t.lan. A 127.0.0.1 (84)

03:00:57.298221 IP (tos 0x0, ttl 64, id 57985, offset 0, flags [DF], proto UDP (17), length 60)
     192.168.130.53.39370 > 192.168.130.53.53: 145+ A? 1817.t.lan.lan. (32)
03:00:57.298300 IP (tos 0x0, ttl 64, id 29584, offset 0, flags [none], proto UDP (17), length 116)
     192.168.130.53.53 > 192.168.130.53.39370: 145 NXDomain* 0/1/0 (88)

In most cases it takes from 2m to 15m to trigger this error and so far I 
have not been able to reproduce it on my lab environment. Downgrading 
the kernel back to 2.6.31 cures the issue.

I have a very short service window so bisecting is nearly impossible. 
During the next few days I should be able to find if this problem was 
introduced in 2.6.32 or 2.6.33, but if you have clues what to check first 
or idea about some smart debug patches, I will be very grateful.

Best regards,

 			Krzysztof Olędzki

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: 2.6.34: Problem with UDP traffic on lo + poll(?)
  2010-09-06 17:11 2.6.34: Problem with UDP traffic on lo + poll(?) Krzysztof Oledzki
@ 2010-09-06 19:42 ` Eric Dumazet
  2010-09-06 19:55   ` Krzysztof Olędzki
  0 siblings, 1 reply; 34+ messages in thread
From: Eric Dumazet @ 2010-09-06 19:42 UTC (permalink / raw)
  To: Krzysztof Oledzki; +Cc: netdev

Le lundi 06 septembre 2010 à 19:11 +0200, Krzysztof Oledzki a écrit :
> Hello,
> 
> For the last two days I have been trying to track a starange problem I 
> bumped into after upgrading my kernel from 2.6.31.12 to 2.6.34.6.
> 
> The problem is that several times a day, nagios logs that plugins are not 
> able to resolve DNS hostnames of monitored hosts. The DNS service is 
> provided locally by the host itself so all traffic is handled over a 
> loopback interface. The host handles rather moderate traffic - ~1000pps 
> and ~30 DNS requests per second. This DNS service is also provided to 
> other hosts that are also running 2.6.34.6 and are connected over a 
> Ethernet network, but the problem exists only locally.
> 
> After a long investigation I found that I'm able to reproduce this problem 
> by adding: "*.t IN A 127.0.0.1" to the "lan" zone and using the following 
> script:
> 
> --- cut here ---
> a=0
> while strace -o /tmp/s.log.1 -s 1024  /usr/lib64/nagios/plugins/check_icmp -H $a.t.lan ; do
>   date
>   sleep 0.1
>   a=$((a+1))
> done
> -- cut here ---
> 
> Strace shows that the problem is in receiving responses from the 
> nameserver:
> 
> socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 4
> connect(4, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("192.168.130.53")}, 28) = 0
> poll([{fd=4, events=POLLOUT}], 1, 0)    = 1 ([{fd=4, revents=POLLOUT}])
> sendto(4, "\333b\1\0\0\1\0\0\0\0\0\0\0041817\1t\3lan\0\0\1\0\1", 28, MSG_NOSIGNAL, NULL, 0) = 28
> poll([{fd=4, events=POLLIN}], 1, 5000)  = 0 (Timeout)
> poll([{fd=4, events=POLLOUT}], 1, 0)    = 1 ([{fd=4, revents=POLLOUT}])
> sendto(4, "\333b\1\0\0\1\0\0\0\0\0\0\0041817\1t\3lan\0\0\1\0\1", 28, MSG_NOSIGNAL, NULL, 0) = 28
> poll([{fd=4, events=POLLIN}], 1, 5000)  = 0 (Timeout)
> close(4)                                = 0
> 
> However, tcpdump attached to lo shows that both the request and 
> the response are properly delivered:
> 
> 03:00:47.181529 IP (tos 0x0, ttl 64, id 47869, offset 0, flags [DF], proto UDP (17), length 56)
>      192.168.130.53.41083 > 192.168.130.53.53: 56162+ A? 1817.t.lan. (28)
> 03:00:47.181585 IP (tos 0x0, ttl 64, id 29563, offset 0, flags [none], proto UDP (17), length 112)
>      192.168.130.53.53 > 192.168.130.53.41083: 56162* 1/1/1 1817.t.lan. A 127.0.0.1 (84)
> --
> 03:00:52.186465 IP (tos 0x0, ttl 64, id 47870, offset 0, flags [DF], proto UDP (17), length 56)
>      192.168.130.53.41083 > 192.168.130.53.53: 56162+ A? 1817.t.lan. (28)
> 03:00:52.186580 IP (tos 0x0, ttl 64, id 29576, offset 0, flags [none], proto UDP (17), length 112)
>      192.168.130.53.53 > 192.168.130.53.41083: 56162* 1/1/1 1817.t.lan. A 127.0.0.1 (84)
> 
> 03:00:57.298221 IP (tos 0x0, ttl 64, id 57985, offset 0, flags [DF], proto UDP (17), length 60)
>      192.168.130.53.39370 > 192.168.130.53.53: 145+ A? 1817.t.lan.lan. (32)
> 03:00:57.298300 IP (tos 0x0, ttl 64, id 29584, offset 0, flags [none], proto UDP (17), length 116)
>      192.168.130.53.53 > 192.168.130.53.39370: 145 NXDomain* 0/1/0 (88)
> 
> In most cases it takes from 2m to 15m to trigger this error and so far I 
> have not been able to reproduce it on my lab environment. Downgrading 
> the kernel back to 2.6.31 cures the issue.
> 
> I have a very short service window so bisecting is nearly impossible. 
> During the next few days I should be able to find if this problem was 
> introduced in 2.6.32 or 2.6.33, but if you have clues what to check first 
> or idea about some smart debug patches, I will be very grateful.

Do you have iptables and conntracking loaded ?

Maybe frame is droped by firewall on this particular port (39370 )




^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: 2.6.34: Problem with UDP traffic on lo + poll(?)
  2010-09-06 19:42 ` Eric Dumazet
@ 2010-09-06 19:55   ` Krzysztof Olędzki
  2010-09-06 20:29     ` Eric Dumazet
  0 siblings, 1 reply; 34+ messages in thread
From: Krzysztof Olędzki @ 2010-09-06 19:55 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

Hi Eric,

On 2010-09-06 21:42, Eric Dumazet wrote:
> Le lundi 06 septembre 2010 à 19:11 +0200, Krzysztof Oledzki a écrit :
>> Hello,
>>
>> For the last two days I have been trying to track a starange problem I
>> bumped into after upgrading my kernel from 2.6.31.12 to 2.6.34.6.
>>
>> The problem is that several times a day, nagios logs that plugins are not
>> able to resolve DNS hostnames of monitored hosts. The DNS service is
>> provided locally by the host itself so all traffic is handled over a
>> loopback interface. The host handles rather moderate traffic - ~1000pps
>> and ~30 DNS requests per second. This DNS service is also provided to
>> other hosts that are also running 2.6.34.6 and are connected over a
>> Ethernet network, but the problem exists only locally.
>>
>> After a long investigation I found that I'm able to reproduce this problem
>> by adding: "*.t IN A 127.0.0.1" to the "lan" zone and using the following
>> script:
>>
>> --- cut here ---
>> a=0
>> while strace -o /tmp/s.log.1 -s 1024  /usr/lib64/nagios/plugins/check_icmp -H $a.t.lan ; do
>>    date
>>    sleep 0.1
>>    a=$((a+1))
>> done
>> -- cut here ---
>>
>> Strace shows that the problem is in receiving responses from the
>> nameserver:
>>
>> socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 4
>> connect(4, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("192.168.130.53")}, 28) = 0
>> poll([{fd=4, events=POLLOUT}], 1, 0)    = 1 ([{fd=4, revents=POLLOUT}])
>> sendto(4, "\333b\1\0\0\1\0\0\0\0\0\0\0041817\1t\3lan\0\0\1\0\1", 28, MSG_NOSIGNAL, NULL, 0) = 28
>> poll([{fd=4, events=POLLIN}], 1, 5000)  = 0 (Timeout)
>> poll([{fd=4, events=POLLOUT}], 1, 0)    = 1 ([{fd=4, revents=POLLOUT}])
>> sendto(4, "\333b\1\0\0\1\0\0\0\0\0\0\0041817\1t\3lan\0\0\1\0\1", 28, MSG_NOSIGNAL, NULL, 0) = 28
>> poll([{fd=4, events=POLLIN}], 1, 5000)  = 0 (Timeout)
>> close(4)                                = 0
>>
>> However, tcpdump attached to lo shows that both the request and
>> the response are properly delivered:
>>
>> 03:00:47.181529 IP (tos 0x0, ttl 64, id 47869, offset 0, flags [DF], proto UDP (17), length 56)
>>       192.168.130.53.41083>  192.168.130.53.53: 56162+ A? 1817.t.lan. (28)
>> 03:00:47.181585 IP (tos 0x0, ttl 64, id 29563, offset 0, flags [none], proto UDP (17), length 112)
>>       192.168.130.53.53>  192.168.130.53.41083: 56162* 1/1/1 1817.t.lan. A 127.0.0.1 (84)
>> --
>> 03:00:52.186465 IP (tos 0x0, ttl 64, id 47870, offset 0, flags [DF], proto UDP (17), length 56)
>>       192.168.130.53.41083>  192.168.130.53.53: 56162+ A? 1817.t.lan. (28)
>> 03:00:52.186580 IP (tos 0x0, ttl 64, id 29576, offset 0, flags [none], proto UDP (17), length 112)
>>       192.168.130.53.53>  192.168.130.53.41083: 56162* 1/1/1 1817.t.lan. A 127.0.0.1 (84)
>>
>> 03:00:57.298221 IP (tos 0x0, ttl 64, id 57985, offset 0, flags [DF], proto UDP (17), length 60)
>>       192.168.130.53.39370>  192.168.130.53.53: 145+ A? 1817.t.lan.lan. (32)
>> 03:00:57.298300 IP (tos 0x0, ttl 64, id 29584, offset 0, flags [none], proto UDP (17), length 116)
>>       192.168.130.53.53>  192.168.130.53.39370: 145 NXDomain* 0/1/0 (88)
>>
>> In most cases it takes from 2m to 15m to trigger this error and so far I
>> have not been able to reproduce it on my lab environment. Downgrading
>> the kernel back to 2.6.31 cures the issue.
>>
>> I have a very short service window so bisecting is nearly impossible.
>> During the next few days I should be able to find if this problem was
>> introduced in 2.6.32 or 2.6.33, but if you have clues what to check first
>> or idea about some smart debug patches, I will be very grateful.
>
> Do you have iptables and conntracking loaded ?

2 x Yes:

# iptables-save
# Generated by iptables-save v1.4.9.1 on Mon Sep  6 21:52:24 2010
*raw
:PREROUTING ACCEPT [3061956837:831568193533]
:OUTPUT ACCEPT [2101642993:2332130961342]
COMMIT
# Completed on Mon Sep  6 21:52:24 2010
# Generated by iptables-save v1.4.9.1 on Mon Sep  6 21:52:24 2010
*nat
:PREROUTING ACCEPT [7763503:512346498]
:OUTPUT ACCEPT [83026014:5350796588]
:POSTROUTING ACCEPT [83026014:5350796588]
-A PREROUTING -d 192.168.126.31/32 -j RETURN
-A PREROUTING -i bond0 -p tcp -m tcp --dport 80 -j REDIRECT --to-ports 8088
COMMIT
# Completed on Mon Sep  6 21:52:24 2010
# Generated by iptables-save v1.4.9.1 on Mon Sep  6 21:52:24 2010
*mangle
:PREROUTING ACCEPT [3061956837:831568193533]
:INPUT ACCEPT [3061952879:831567826577]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [2101642993:2332130961342]
:POSTROUTING ACCEPT [2101704532:2332145225884]
COMMIT
# Completed on Mon Sep  6 21:52:24 2010
# Generated by iptables-save v1.4.9.1 on Mon Sep  6 21:52:24 2010
*filter
:INPUT ACCEPT [86498522:24274295687]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [42134743:136784108555]
COMMIT
# Completed on Mon Sep  6 21:52:24 2010

# sysctl net.netfilter.nf_conntrack_count net.netfilter.nf_conntrack_max
net.netfilter.nf_conntrack_count = 375
net.netfilter.nf_conntrack_max = 65536


> Maybe frame is droped by firewall on this particular port (39370 )

Yes, conntrack is one of possibilities. However, this problem only 
manifests on 2.6.34 and never on 2.6.31 where iptables and conntrack 
configurations are identically. And of course, each time it is a 
different port.

Please also note that this problem only exists when communication is 
handled over a loopback interface - I'm not able to trigger this from a 
remote host even if I run the test on two hosts (local & remote) 
simultaneously.

Best regards,

			Krzysztof Olędzki

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: 2.6.34: Problem with UDP traffic on lo + poll(?)
  2010-09-06 19:55   ` Krzysztof Olędzki
@ 2010-09-06 20:29     ` Eric Dumazet
  2010-09-06 20:44       ` Krzysztof Olędzki
  0 siblings, 1 reply; 34+ messages in thread
From: Eric Dumazet @ 2010-09-06 20:29 UTC (permalink / raw)
  To: Krzysztof Olędzki; +Cc: netdev

Le lundi 06 septembre 2010 à 21:55 +0200, Krzysztof Olędzki a écrit :

> Yes, conntrack is one of possibilities. However, this problem only 
> manifests on 2.6.34 and never on 2.6.31 where iptables and conntrack 
> configurations are identically. And of course, each time it is a 
> different port.
> 
> Please also note that this problem only exists when communication is 
> handled over a loopback interface - I'm not able to trigger this from a 
> remote host even if I run the test on two hosts (local & remote) 
> simultaneously.
> 

No particular error shown in "netstat -s" ?

port randomization on UDP changed in the past, and conntracking changed
a bit too ;)






^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: 2.6.34: Problem with UDP traffic on lo + poll(?)
  2010-09-06 20:29     ` Eric Dumazet
@ 2010-09-06 20:44       ` Krzysztof Olędzki
  2010-09-06 20:48         ` Krzysztof Olędzki
  0 siblings, 1 reply; 34+ messages in thread
From: Krzysztof Olędzki @ 2010-09-06 20:44 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On 2010-09-06 22:29, Eric Dumazet wrote:
> Le lundi 06 septembre 2010 à 21:55 +0200, Krzysztof Olędzki a écrit :
>
>> Yes, conntrack is one of possibilities. However, this problem only
>> manifests on 2.6.34 and never on 2.6.31 where iptables and conntrack
>> configurations are identically. And of course, each time it is a
>> different port.
>>
>> Please also note that this problem only exists when communication is
>> handled over a loopback interface - I'm not able to trigger this from a
>> remote host even if I run the test on two hosts (local&  remote)
>> simultaneously.
>>
>
> No particular error shown in "netstat -s" ?

No... :(

Udp:
     8542243 packets received
     489605 packets to unknown port received.
     1 packet receive errors
     4254527 packets sent
     RcvbufErrors: 1

> port randomization on UDP changed in the past, and conntracking changed
> a bit too ;)

I know but AFAIR all important changs were alredy included in 2.6.31. 
And again: there is no problem in quering DNS from a remote host:
  [client 2.6.24.6] <-ethernet-> [server 2.6.34.6]

BTW: I have been able to reproduce this problem on a different, less 
critical host after upgrading its kernel to 2.6.34.6. Unfortunately I'm 
still not able to do in on my lab environment. :( Anyway, I'll try to 
catch "conntrack -E" output and see what conntrack thinks about such 
packets.

Best regards,

			Krzysztof Olędzki

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: 2.6.34: Problem with UDP traffic on lo + poll(?)
  2010-09-06 20:44       ` Krzysztof Olędzki
@ 2010-09-06 20:48         ` Krzysztof Olędzki
  2010-09-07 15:37           ` Krzysztof Olędzki
  0 siblings, 1 reply; 34+ messages in thread
From: Krzysztof Olędzki @ 2010-09-06 20:48 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On 2010-09-06 22:44, Krzysztof Olędzki wrote:
> On 2010-09-06 22:29, Eric Dumazet wrote:
>> Le lundi 06 septembre 2010 à 21:55 +0200, Krzysztof Olędzki a écrit :
>>
>>> Yes, conntrack is one of possibilities. However, this problem only
>>> manifests on 2.6.34 and never on 2.6.31 where iptables and conntrack
>>> configurations are identically. And of course, each time it is a
>>> different port.
>>>
>>> Please also note that this problem only exists when communication is
>>> handled over a loopback interface - I'm not able to trigger this from a
>>> remote host even if I run the test on two hosts (local&   remote)
>>> simultaneously.
>>>
>>
>> No particular error shown in "netstat -s" ?
> 
> No... :(
> 
> Udp:
>       8542243 packets received
>       489605 packets to unknown port received.
>       1 packet receive errors
>       4254527 packets sent
>       RcvbufErrors: 1
> 
>> port randomization on UDP changed in the past, and conntracking changed
>> a bit too ;)
> 
> I know but AFAIR all important changs were alredy included in 2.6.31.
> And again: there is no problem in quering DNS from a remote host:
>    [client 2.6.24.6]<-ethernet->  [server 2.6.34.6]
> 
> BTW: I have been able to reproduce this problem on a different, less
> critical host after upgrading its kernel to 2.6.34.6. Unfortunately I'm
> still not able to do in on my lab environment. :( Anyway, I'll try to
> catch "conntrack -E" output and see what conntrack thinks about such
> packets.

OK, got it:

*strace (1682.t.lan):
socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 4
connect(4, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("192.168.130.53")}, 28) = 0
poll([{fd=4, events=POLLOUT}], 1, 0)    = 1 ([{fd=4, revents=POLLOUT}])
sendto(4, "Gz\1\0\0\1\0\0\0\0\0\0\0041683\1t\3lan\0\0\1\0\1", 28, MSG_NOSIGNAL, NULL, 0) = 28
poll([{fd=4, events=POLLIN}], 1, 5000)  = 0 (Timeout)
poll([{fd=4, events=POLLOUT}], 1, 0)    = 1 ([{fd=4, revents=POLLOUT}])

* tcpdump:
1283805361.395859 IP (tos 0x0, ttl 64, id 47011, offset 0, flags [DF], proto UDP (17), length 56)
    192.168.130.53.49279 > 192.168.130.53.53: 27611+ A? 1682.t.lan. (28)
1283805361.395933 IP (tos 0x0, ttl 64, id 10738, offset 0, flags [none], proto UDP (17), length 112)
    192.168.130.53.53 > 192.168.130.53.49279: 27611* 1/1/1 1682.t.lan. A 127.0.0.1 (84)

* conntrack:
[1283805361.395862]         [NEW] ipv4     2 udp      17 30 src=192.168.130.53 dst=192.168.130.53 sport=49279 dport=53 [UNREPLIED] src=192.168.130.53 dst=192.168.130.53 sport=53 dport=49279 id=3423125776
[1283805361.395939]      [UPDATE] ipv4     2 udp      17 30 src=192.168.130.53 dst=192.168.130.53 sport=49279 dport=53 src=192.168.130.53 dst=192.168.130.53 sport=53 dport=49279 id=3423125776


Pozdrawiam,

			Krzysztof Olędzki

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: 2.6.34: Problem with UDP traffic on lo + poll(?)
  2010-09-06 20:48         ` Krzysztof Olędzki
@ 2010-09-07 15:37           ` Krzysztof Olędzki
  2010-09-07 16:36             ` Eric Dumazet
  0 siblings, 1 reply; 34+ messages in thread
From: Krzysztof Olędzki @ 2010-09-07 15:37 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On 2010-09-06 22:48, Krzysztof Olędzki wrote:
> On 2010-09-06 22:44, Krzysztof Olędzki wrote:
>> On 2010-09-06 22:29, Eric Dumazet wrote:
>>> Le lundi 06 septembre 2010 à 21:55 +0200, Krzysztof Olędzki a écrit :
>>>
>>>> Yes, conntrack is one of possibilities. However, this problem only
>>>> manifests on 2.6.34 and never on 2.6.31 where iptables and conntrack
>>>> configurations are identically. And of course, each time it is a
>>>> different port.
>>>>
>>>> Please also note that this problem only exists when communication is
>>>> handled over a loopback interface - I'm not able to trigger this from a
>>>> remote host even if I run the test on two hosts (local&    remote)
>>>> simultaneously.
>>>>
>>>
>>> No particular error shown in "netstat -s" ?
>>
>> No... :(
>>
>> Udp:
>>        8542243 packets received
>>        489605 packets to unknown port received.
>>        1 packet receive errors
>>        4254527 packets sent
>>        RcvbufErrors: 1
>>
>>> port randomization on UDP changed in the past, and conntracking changed
>>> a bit too ;)
>>
>> I know but AFAIR all important changs were alredy included in 2.6.31.
>> And again: there is no problem in quering DNS from a remote host:
>>     [client 2.6.24.6]<-ethernet->   [server 2.6.34.6]
>>
>> BTW: I have been able to reproduce this problem on a different, less
>> critical host after upgrading its kernel to 2.6.34.6. Unfortunately I'm
>> still not able to do in on my lab environment. :( Anyway, I'll try to
>> catch "conntrack -E" output and see what conntrack thinks about such
>> packets.
>
> OK, got it:
>
> *strace (1682.t.lan):
> socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 4
> connect(4, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("192.168.130.53")}, 28) = 0
> poll([{fd=4, events=POLLOUT}], 1, 0)    = 1 ([{fd=4, revents=POLLOUT}])
> sendto(4, "Gz\1\0\0\1\0\0\0\0\0\0\0041683\1t\3lan\0\0\1\0\1", 28, MSG_NOSIGNAL, NULL, 0) = 28
> poll([{fd=4, events=POLLIN}], 1, 5000)  = 0 (Timeout)
> poll([{fd=4, events=POLLOUT}], 1, 0)    = 1 ([{fd=4, revents=POLLOUT}])
>
> * tcpdump:
> 1283805361.395859 IP (tos 0x0, ttl 64, id 47011, offset 0, flags [DF], proto UDP (17), length 56)
>      192.168.130.53.49279>  192.168.130.53.53: 27611+ A? 1682.t.lan. (28)
> 1283805361.395933 IP (tos 0x0, ttl 64, id 10738, offset 0, flags [none], proto UDP (17), length 112)
>      192.168.130.53.53>  192.168.130.53.49279: 27611* 1/1/1 1682.t.lan. A 127.0.0.1 (84)
>
> * conntrack:
> [1283805361.395862]         [NEW] ipv4     2 udp      17 30 src=192.168.130.53 dst=192.168.130.53 sport=49279 dport=53 [UNREPLIED] src=192.168.130.53 dst=192.168.130.53 sport=53 dport=49279 id=3423125776
> [1283805361.395939]      [UPDATE] ipv4     2 udp      17 30 src=192.168.130.53 dst=192.168.130.53 sport=49279 dport=53 src=192.168.130.53 dst=192.168.130.53 sport=53 dport=49279 id=3423125776

So far I have found that:

2.6.31.7/2.6.31.12: OK
2.6.32.21: OK
2.6.33-rc1: bad
2.6.33-rc5: bad
2.6.33.7: bad
2.6.34.4/2.6.34.6: bad

It looks like the bug must have been introduced in 2.6.33-rc1. There are 
8904 commits between 2.6.32 and 2.6.33-rc1, so with ~15 more reboots I 
should be able to point the problematic commit. I hope. ;)


Best regards,

			Krzysztof Olędzki


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: 2.6.34: Problem with UDP traffic on lo + poll(?)
  2010-09-07 15:37           ` Krzysztof Olędzki
@ 2010-09-07 16:36             ` Eric Dumazet
  2010-09-07 19:20               ` Krzysztof Olędzki
  2010-09-07 19:26               ` Eric Dumazet
  0 siblings, 2 replies; 34+ messages in thread
From: Eric Dumazet @ 2010-09-07 16:36 UTC (permalink / raw)
  To: Krzysztof Olędzki; +Cc: netdev

Le mardi 07 septembre 2010 à 17:37 +0200, Krzysztof Olędzki a écrit :

> So far I have found that:
> 
> 2.6.31.7/2.6.31.12: OK
> 2.6.32.21: OK
> 2.6.33-rc1: bad
> 2.6.33-rc5: bad
> 2.6.33.7: bad
> 2.6.34.4/2.6.34.6: bad
> 
> It looks like the bug must have been introduced in 2.6.33-rc1. There are 
> 8904 commits between 2.6.32 and 2.6.33-rc1, so with ~15 more reboots I 
> should be able to point the problematic commit. I hope. ;)
> 
> 

Hmm, I have a pretty good idea of what the problem is, and will post a
fix soon ;)

Thanks !




^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: 2.6.34: Problem with UDP traffic on lo + poll(?)
  2010-09-07 16:36             ` Eric Dumazet
@ 2010-09-07 19:20               ` Krzysztof Olędzki
  2010-09-07 19:26               ` Eric Dumazet
  1 sibling, 0 replies; 34+ messages in thread
From: Krzysztof Olędzki @ 2010-09-07 19:20 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On 2010-09-07 18:36, Eric Dumazet wrote:
> Le mardi 07 septembre 2010 à 17:37 +0200, Krzysztof Olędzki a écrit :
>
>> So far I have found that:
>>
>> 2.6.31.7/2.6.31.12: OK
>> 2.6.32.21: OK
>> 2.6.33-rc1: bad
>> 2.6.33-rc5: bad
>> 2.6.33.7: bad
>> 2.6.34.4/2.6.34.6: bad
>>
>> It looks like the bug must have been introduced in 2.6.33-rc1. There are
>> 8904 commits between 2.6.32 and 2.6.33-rc1, so with ~15 more reboots I
>> should be able to point the problematic commit. I hope. ;)
>>
>>
>
> Hmm, I have a pretty good idea of what the problem is, and will post a
> fix soon ;)
>
> Thanks !

Great! So I'm waiting for a patch to test! :)

Best regards,

				Krzysztof Olędzki

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: 2.6.34: Problem with UDP traffic on lo + poll(?)
  2010-09-07 16:36             ` Eric Dumazet
  2010-09-07 19:20               ` Krzysztof Olędzki
@ 2010-09-07 19:26               ` Eric Dumazet
  2010-09-07 19:59                 ` David Miller
  2010-09-07 21:28                 ` 2.6.34: Problem with UDP traffic on lo + poll(?) Krzysztof Olędzki
  1 sibling, 2 replies; 34+ messages in thread
From: Eric Dumazet @ 2010-09-07 19:26 UTC (permalink / raw)
  To: Krzysztof Olędzki, David Miller; +Cc: netdev

Le mardi 07 septembre 2010 à 18:36 +0200, Eric Dumazet a écrit :

> Hmm, I have a pretty good idea of what the problem is, and will post a
> fix soon ;)

David, if you feel this is too invasive for stable, we can make UDP
rehash the socket in case we dont want to change ip4_datagram_connect()


[PATCH] inet: Dont set inet_rcv_saddr in connect()

So the problem is that the sequence :

socket(PF_INET, SOCK_DGRAM, IPPROTO_IP)

connect(fd, {sa_family=AF_INET, sin_port=htons(xx),
sin_addr=inet_addr("1.2.3.4")}, 28)

1) Does an implicit inet_autobind()
  (using an INADDR_ANY address, and selecting a random port).

2) Then does an ip4_datagram_connect() to specify the address/port of
remote end point.

Problem is ip4_datagram_connect() also sets inet->inet_rcv_saddr (from
INADDR_ANY to IP source address, given the current route to remote end
point). Only the first connect() on the socket does this. Following ones
dont change the (possibly wrong) source address.

This breaks the secondary UDP hash, based on (ADDRESS, port), that was
computed by inet_autobind().

This also potentially breaks multiple connect() to change remote
endpoints, because old source address might be non usable for packets to
new destination.

If route happens to change, then we should automatically change our
source address too, at next sendmsg() call, and UDP code deals with this
just fine.

If an application needs to specify a precise source address, it must use
bind() system call. connect() man page only refers to remote address,
not local one.

Reported-by: Krzysztof Olędzki <ole@ans.pl>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/ipv4/datagram.c |    9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/net/ipv4/datagram.c b/net/ipv4/datagram.c
index f055094..8a17241 100644
--- a/net/ipv4/datagram.c
+++ b/net/ipv4/datagram.c
@@ -60,10 +60,19 @@ int ip4_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 		ip_rt_put(rt);
 		return -EACCES;
 	}
+/*
+ * Should connect() change inet_rcv_saddr ?
+ * It should not IMHO, because we want to specify the peer to which
+ * datagrams are to be sent, regardless of our source address that might
+ * change in the future, after a route change.
+ * To specify our source address, bind() is the right API.
+ */
+#if 0
 	if (!inet->inet_saddr)
 		inet->inet_saddr = rt->rt_src;	/* Update source address */
 	if (!inet->inet_rcv_saddr)
 		inet->inet_rcv_saddr = rt->rt_src;
+#endif
 	inet->inet_daddr = rt->rt_dst;
 	inet->inet_dport = usin->sin_port;
 	sk->sk_state = TCP_ESTABLISHED;



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: 2.6.34: Problem with UDP traffic on lo + poll(?)
  2010-09-07 19:26               ` Eric Dumazet
@ 2010-09-07 19:59                 ` David Miller
  2010-09-07 21:35                   ` [PATCH] inet: dont set inet_rcv_saddr in connect() Eric Dumazet
  2010-09-07 21:28                 ` 2.6.34: Problem with UDP traffic on lo + poll(?) Krzysztof Olędzki
  1 sibling, 1 reply; 34+ messages in thread
From: David Miller @ 2010-09-07 19:59 UTC (permalink / raw)
  To: eric.dumazet; +Cc: ole, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 07 Sep 2010 21:26:09 +0200

>  	}
> +/*
> + * Should connect() change inet_rcv_saddr ?
> + * It should not IMHO, because we want to specify the peer to which
> + * datagrams are to be sent, regardless of our source address that might
> + * change in the future, after a route change.
> + * To specify our source address, bind() is the right API.
> + */
> +#if 0
>  	if (!inet->inet_saddr)
>  		inet->inet_saddr = rt->rt_src;	/* Update source address */
>  	if (!inet->inet_rcv_saddr)
>  		inet->inet_rcv_saddr = rt->rt_src;
> +#endif
>  	inet->inet_daddr = rt->rt_dst;
>  	inet->inet_dport = usin->sin_port;
>  	sk->sk_state = TCP_ESTABLISHED;

Eric, please just delete the code block instead of leaving it
there inside of an #if 0 block.

If there is information conveyed by the unused code, add that
information to the nice comment you're adding :-)

Thanks.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: 2.6.34: Problem with UDP traffic on lo + poll(?)
  2010-09-07 19:26               ` Eric Dumazet
  2010-09-07 19:59                 ` David Miller
@ 2010-09-07 21:28                 ` Krzysztof Olędzki
  2010-09-07 21:39                   ` Eric Dumazet
  1 sibling, 1 reply; 34+ messages in thread
From: Krzysztof Olędzki @ 2010-09-07 21:28 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev

On 2010-09-07 21:26, Eric Dumazet wrote:
> Le mardi 07 septembre 2010 à 18:36 +0200, Eric Dumazet a écrit :
>
>> Hmm, I have a pretty good idea of what the problem is, and will post a
>> fix soon ;)
>
> David, if you feel this is too invasive for stable, we can make UDP
> rehash the socket in case we dont want to change ip4_datagram_connect()
>
>
> [PATCH] inet: Dont set inet_rcv_saddr in connect()
>
> So the problem is that the sequence :
>
> socket(PF_INET, SOCK_DGRAM, IPPROTO_IP)
>
> connect(fd, {sa_family=AF_INET, sin_port=htons(xx),
> sin_addr=inet_addr("1.2.3.4")}, 28)
>
> 1) Does an implicit inet_autobind()
>    (using an INADDR_ANY address, and selecting a random port).
>
> 2) Then does an ip4_datagram_connect() to specify the address/port of
> remote end point.
>
> Problem is ip4_datagram_connect() also sets inet->inet_rcv_saddr (from
> INADDR_ANY to IP source address, given the current route to remote end
> point). Only the first connect() on the socket does this. Following ones
> dont change the (possibly wrong) source address.
>
> This breaks the secondary UDP hash, based on (ADDRESS, port), that was
> computed by inet_autobind().
>
> This also potentially breaks multiple connect() to change remote
> endpoints, because old source address might be non usable for packets to
> new destination.
>
> If route happens to change, then we should automatically change our
> source address too, at next sendmsg() call, and UDP code deals with this
> just fine.
>
> If an application needs to specify a precise source address, it must use
> bind() system call. connect() man page only refers to remote address,
> not local one.
>
> Reported-by: Krzysztof Olędzki <ole@ans.pl>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> ---
>   net/ipv4/datagram.c |    9 +++++++++
>   1 file changed, 9 insertions(+)
>
> diff --git a/net/ipv4/datagram.c b/net/ipv4/datagram.c
> index f055094..8a17241 100644
> --- a/net/ipv4/datagram.c
> +++ b/net/ipv4/datagram.c
> @@ -60,10 +60,19 @@ int ip4_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
>   		ip_rt_put(rt);
>   		return -EACCES;
>   	}
> +/*
> + * Should connect() change inet_rcv_saddr ?
> + * It should not IMHO, because we want to specify the peer to which
> + * datagrams are to be sent, regardless of our source address that might
> + * change in the future, after a route change.
> + * To specify our source address, bind() is the right API.
> + */
> +#if 0
>   	if (!inet->inet_saddr)
>   		inet->inet_saddr = rt->rt_src;	/* Update source address */
>   	if (!inet->inet_rcv_saddr)
>   		inet->inet_rcv_saddr = rt->rt_src;
> +#endif
>   	inet->inet_daddr = rt->rt_dst;
>   	inet->inet_dport = usin->sin_port;
>   	sk->sk_state = TCP_ESTABLISHED;

With the above patch I'm no longer able to reproduce the problem. Thanks!

Tested-by: Krzysztof Piotr Oledzki <ole@ans.pl>

BTW: why it takes so long to trigger this bug and it is only possible 
over a loopback interface?

Best regards,

			Krzysztof Olędzki

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH] inet: dont set inet_rcv_saddr in connect()
  2010-09-07 19:59                 ` David Miller
@ 2010-09-07 21:35                   ` Eric Dumazet
  2010-09-07 21:52                     ` Krzysztof Olędzki
  2010-09-08  2:34                     ` Brian Haley
  0 siblings, 2 replies; 34+ messages in thread
From: Eric Dumazet @ 2010-09-07 21:35 UTC (permalink / raw)
  To: David Miller; +Cc: ole, netdev

Le mardi 07 septembre 2010 à 12:59 -0700, David Miller a écrit :

> Eric, please just delete the code block instead of leaving it
> there inside of an #if 0 block.
> 
> If there is information conveyed by the unused code, add that
> information to the nice comment you're adding :-)
> 

Indeed ;)

[PATCH] inet: dont set inet_rcv_saddr in connect()

The following sequence :

socket(PF_INET, SOCK_DGRAM, IPPROTO_IP)

connect(fd, {sa_family=AF_INET, sin_port=htons(xx),
sin_addr=inet_addr("1.2.3.4")}, 28)

1) Does an implicit inet_autobind()
  (using an INADDR_ANY address, and selecting a random port).

2) Does an ip4_datagram_connect() to specify the address/port of
remote end point.

Problem is ip4_datagram_connect() also sets inet->inet_rcv_saddr (from
INADDR_ANY to IP source address, given the current route to remote end
point). Only the first connect() on the socket does this. Following ones
dont change the (possibly wrong) source address.

This breaks the secondary UDP hash, based on (ADDRESS, port), that was
computed by inet_autobind(). If more than 10 sockets are linked in
primary hash chain, we can drop incoming packets because searches are
done in wrong secondary hash chain.

This also potentially breaks multiple connect() to change remote
endpoints, because old source address might be non usable for packets to
new destination.

If route happens to change, then we should automatically change our
source address too, at next sendmsg() call, and UDP code deals with this
just fine.

If an application needs to specify a precise source address, it must use
bind() system call. connect() man page only refers to remote address,
not local one.

Reported-by: Krzysztof Olędzki <ole@ans.pl>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/ipv4/datagram.c |   11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/datagram.c b/net/ipv4/datagram.c
index f055094..656dc73 100644
--- a/net/ipv4/datagram.c
+++ b/net/ipv4/datagram.c
@@ -60,10 +60,13 @@ int ip4_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 		ip_rt_put(rt);
 		return -EACCES;
 	}
-	if (!inet->inet_saddr)
-		inet->inet_saddr = rt->rt_src;	/* Update source address */
-	if (!inet->inet_rcv_saddr)
-		inet->inet_rcv_saddr = rt->rt_src;
+	/*
+	 * Should connect() change inet_rcv_saddr / inet_saddr ?
+	 * It should not, because we want to specify the peer to which
+	 * datagrams are to be sent, regardless of our source address that
+	 * might change in the future, after a route change.
+	 * To specify our source address, bind() is the right API.
+	 */
 	inet->inet_daddr = rt->rt_dst;
 	inet->inet_dport = usin->sin_port;
 	sk->sk_state = TCP_ESTABLISHED;



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: 2.6.34: Problem with UDP traffic on lo + poll(?)
  2010-09-07 21:28                 ` 2.6.34: Problem with UDP traffic on lo + poll(?) Krzysztof Olędzki
@ 2010-09-07 21:39                   ` Eric Dumazet
  2010-09-07 21:51                     ` Krzysztof Olędzki
  0 siblings, 1 reply; 34+ messages in thread
From: Eric Dumazet @ 2010-09-07 21:39 UTC (permalink / raw)
  To: Krzysztof Olędzki; +Cc: David Miller, netdev

Le mardi 07 septembre 2010 à 23:28 +0200, Krzysztof Olędzki a écrit :

> With the above patch I'm no longer able to reproduce the problem. Thanks!
> 
> Tested-by: Krzysztof Piotr Oledzki <ole@ans.pl>
> 

Thanks a lot !

> BTW: why it takes so long to trigger this bug and it is only possible 
> over a loopback interface?

Its a bit tricky : You need at least 10 sockets linked in a particular
hash chain.

To check this, you can :

cat /proc/net/udp

maybe you have many sockets on port 123 or 53 ?

And about loopback, I have no idea... I am pretty sure I can trigger the
bug with other interfaces.




^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: 2.6.34: Problem with UDP traffic on lo + poll(?)
  2010-09-07 21:39                   ` Eric Dumazet
@ 2010-09-07 21:51                     ` Krzysztof Olędzki
  2010-09-08  4:12                       ` Eric Dumazet
  0 siblings, 1 reply; 34+ messages in thread
From: Krzysztof Olędzki @ 2010-09-07 21:51 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev

On 2010-09-07 23:39, Eric Dumazet wrote:
> Le mardi 07 septembre 2010 à 23:28 +0200, Krzysztof Olędzki a écrit :
> 
>> With the above patch I'm no longer able to reproduce the problem. Thanks!
>>
>> Tested-by: Krzysztof Piotr Oledzki<ole@ans.pl>
>>
> 
> Thanks a lot !
> 
>> BTW: why it takes so long to trigger this bug and it is only possible
>> over a loopback interface?
> 
> Its a bit tricky : You need at least 10 sockets linked in a particular
> hash chain.
> 
> To check this, you can :
> 
> cat /proc/net/udp
> 
> maybe you have many sockets on port 123 or 53 ?

On one affected host I have 3+7 and on the other, also affacted one, I have 3+6:

root@sowa:~# egrep -cw '(53|123):' /proc/net/udp
10
root@sowa:~# egrep -w '(53|123):' /proc/net/udp
   53: 3582A8C0:0035 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 6084654 2 ffff8800cc012700 0
   53: 0100007F:0035 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 6084652 2 ffff8800cc010900 0
  123: D683A8C0:007B 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 4911 2 ffff88012de96400 0
  123: 7B85A8C0:007B 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 4910 2 ffff88012de96100 0
  123: 8982A8C0:007B 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 4909 2 ffff88012de95e00 0
  123: 7B82A8C0:007B 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 4908 2 ffff88012de95b00 0
  123: 3582A8C0:007B 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 4907 2 ffff88012de95800 0
  123: 1F7EA8C0:007B 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 4906 2 ffff88012de95500 0
  123: 0100007F:007B 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 4905 2 ffff88012de95200 0
  123: 00000000:007B 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 4899 2 ffff88012de94c00 0

root@cmyk:~# egrep -cw '(53|123):' /proc/net/udp
9
root@cmyk:~# egrep -w '(53|123):' /proc/net/udp
   53: A2CD1253:0035 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 3774 2 eeaa9840 0
   53: 0100A8C0:0035 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 3772 2 eeaa9a40 0
   53: 0100007F:0035 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 3770 2 eeaa9c40 0
  123: A6CD1253:007B 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 4693 2 eeacdc80 0
  123: A5CD1253:007B 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 4692 2 eeaa9640 0
  123: A2CD1253:007B 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 4691 2 ed201700 0
  123: 0100A8C0:007B 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 4690 2 eeacd880 0
  123: 0100007F:007B 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 4689 2 ec649740 0
  123: 00000000:007B 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 4683 2 ed201d00 0

But how 123 is related to 53?

> And about loopback, I have no idea... I am pretty sure I can trigger the
> bug with other interfaces.

OK. Probably it is because my other hosts have only a single IP and only
the problematic ones have both DNS server and multiple IP (many sockets).

Best regards,

			Krzysztof Olędzki

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] inet: dont set inet_rcv_saddr in connect()
  2010-09-07 21:35                   ` [PATCH] inet: dont set inet_rcv_saddr in connect() Eric Dumazet
@ 2010-09-07 21:52                     ` Krzysztof Olędzki
  2010-09-08  2:16                       ` David Miller
  2010-09-08  2:34                     ` Brian Haley
  1 sibling, 1 reply; 34+ messages in thread
From: Krzysztof Olędzki @ 2010-09-07 21:52 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev

On 2010-09-07 23:35, Eric Dumazet wrote:
> Le mardi 07 septembre 2010 à 12:59 -0700, David Miller a écrit :
>
>> Eric, please just delete the code block instead of leaving it
>> there inside of an #if 0 block.
>>
>> If there is information conveyed by the unused code, add that
>> information to the nice comment you're adding :-)
>>
>
> Indeed ;)
>
> [PATCH] inet: dont set inet_rcv_saddr in connect()
>
> The following sequence :
>
> socket(PF_INET, SOCK_DGRAM, IPPROTO_IP)
>
> connect(fd, {sa_family=AF_INET, sin_port=htons(xx),
> sin_addr=inet_addr("1.2.3.4")}, 28)
>
> 1) Does an implicit inet_autobind()
>    (using an INADDR_ANY address, and selecting a random port).
>
> 2) Does an ip4_datagram_connect() to specify the address/port of
> remote end point.
>
> Problem is ip4_datagram_connect() also sets inet->inet_rcv_saddr (from
> INADDR_ANY to IP source address, given the current route to remote end
> point). Only the first connect() on the socket does this. Following ones
> dont change the (possibly wrong) source address.
>
> This breaks the secondary UDP hash, based on (ADDRESS, port), that was
> computed by inet_autobind(). If more than 10 sockets are linked in
> primary hash chain, we can drop incoming packets because searches are
> done in wrong secondary hash chain.
>
> This also potentially breaks multiple connect() to change remote
> endpoints, because old source address might be non usable for packets to
> new destination.
>
> If route happens to change, then we should automatically change our
> source address too, at next sendmsg() call, and UDP code deals with this
> just fine.
>
> If an application needs to specify a precise source address, it must use
> bind() system call. connect() man page only refers to remote address,
> not local one.
>
> Reported-by: Krzysztof Olędzki <ole@ans.pl>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Tested-by: Krzysztof Piotr Oledzki <ole@ans.pl>

Best regards,

			Krzysztof Olędzki

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] inet: dont set inet_rcv_saddr in connect()
  2010-09-07 21:52                     ` Krzysztof Olędzki
@ 2010-09-08  2:16                       ` David Miller
  2010-09-08  4:13                         ` Eric Dumazet
  0 siblings, 1 reply; 34+ messages in thread
From: David Miller @ 2010-09-08  2:16 UTC (permalink / raw)
  To: ole; +Cc: eric.dumazet, netdev

From: Krzysztof Olędzki <ole@ans.pl>
Date: Tue, 07 Sep 2010 23:52:02 +0200

>> [PATCH] inet: dont set inet_rcv_saddr in connect()
 ...
>> Reported-by: Krzysztof Olędzki <ole@ans.pl>
>> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> 
> Tested-by: Krzysztof Piotr Oledzki <ole@ans.pl>

Eric, ipv6 has the same problem so I fixed it there too.

Here is the final patch I put into net-2.6, thanks!

--------------------
inet: dont set inet_rcv_saddr in connect()

The following sequence :

socket(PF_INET, SOCK_DGRAM, IPPROTO_IP)

connect(fd, {sa_family=AF_INET, sin_port=htons(xx),
sin_addr=inet_addr("1.2.3.4")}, 28)

1) Does an implicit inet_autobind()
  (using an INADDR_ANY address, and selecting a random port).

2) Does an ip4_datagram_connect() to specify the address/port of
remote end point.

Problem is ip4_datagram_connect() also sets inet->inet_rcv_saddr (from
INADDR_ANY to IP source address, given the current route to remote end
point). Only the first connect() on the socket does this. Following ones
dont change the (possibly wrong) source address.

This breaks the secondary UDP hash, based on (ADDRESS, port), that was
computed by inet_autobind(). If more than 10 sockets are linked in
primary hash chain, we can drop incoming packets because searches are
done in wrong secondary hash chain.

This also potentially breaks multiple connect() to change remote
endpoints, because old source address might be non usable for packets to
new destination.

If route happens to change, then we should automatically change our
source address too, at next sendmsg() call, and UDP code deals with this
just fine.

If an application needs to specify a precise source address, it must use
bind() system call. connect() man page only refers to remote address,
not local one.

[ IPV6 has the same problem so fix it there too. -DaveM ]

Reported-by: Krzysztof Olędzki <ole@ans.pl>
Tested-by: Krzysztof Piotr Oledzki <ole@ans.pl>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/ipv4/datagram.c |   11 +++++++----
 net/ipv6/datagram.c |   31 +++++++++++++++----------------
 2 files changed, 22 insertions(+), 20 deletions(-)

diff --git a/net/ipv4/datagram.c b/net/ipv4/datagram.c
index f055094..a59492c 100644
--- a/net/ipv4/datagram.c
+++ b/net/ipv4/datagram.c
@@ -60,10 +60,13 @@ int ip4_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 		ip_rt_put(rt);
 		return -EACCES;
 	}
-	if (!inet->inet_saddr)
-		inet->inet_saddr = rt->rt_src;	/* Update source address */
-	if (!inet->inet_rcv_saddr)
-		inet->inet_rcv_saddr = rt->rt_src;
+	/*
+	 * Should connect() change inet_rcv_saddr / inet_saddr ?
+	 * It should not, because we want to specify the peer to which
+	 * datagrams are to be sent, regardless of our source address that
+	 * might change in the future, after a route change.
+	 * To specify our source address, bind() is the right API.
+	 */
 	inet->inet_daddr = rt->rt_dst;
 	inet->inet_dport = usin->sin_port;
 	sk->sk_state = TCP_ESTABLISHED;
diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index 7d929a2..9c26556 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -102,13 +102,14 @@ ipv4_connected:
 
 		ipv6_addr_set_v4mapped(inet->inet_daddr, &np->daddr);
 
-		if (ipv6_addr_any(&np->saddr))
-			ipv6_addr_set_v4mapped(inet->inet_saddr, &np->saddr);
-
-		if (ipv6_addr_any(&np->rcv_saddr))
-			ipv6_addr_set_v4mapped(inet->inet_rcv_saddr,
-					       &np->rcv_saddr);
-
+		/*
+		 * Should connect() change np->rcv_saddr / np->saddr ?
+		 * It should not, because we want to specify the
+		 * peer to which datagrams are to be sent, regardless
+		 * of our source address that might change in the
+		 * future, after a route change.  To specify our
+		 * source address, bind() is the right API.
+		 */
 		goto out;
 	}
 
@@ -173,15 +174,13 @@ ipv4_connected:
 			goto out;
 	}
 
-	/* source address lookup done in ip6_dst_lookup */
-
-	if (ipv6_addr_any(&np->saddr))
-		ipv6_addr_copy(&np->saddr, &fl.fl6_src);
-
-	if (ipv6_addr_any(&np->rcv_saddr)) {
-		ipv6_addr_copy(&np->rcv_saddr, &fl.fl6_src);
-		inet->inet_rcv_saddr = LOOPBACK4_IPV6;
-	}
+	/*
+	 * Should connect() change np->rcv_saddr / np->saddr ?
+	 * It should not, because we want to specify the peer to which
+	 * datagrams are to be sent, regardless of our source address that
+	 * might change in the future, after a route change.
+	 * To specify our source address, bind() is the right API.
+	 */
 
 	ip6_dst_store(sk, dst,
 		      ipv6_addr_equal(&fl.fl6_dst, &np->daddr) ?
-- 
1.7.2.2


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH] inet: dont set inet_rcv_saddr in connect()
  2010-09-07 21:35                   ` [PATCH] inet: dont set inet_rcv_saddr in connect() Eric Dumazet
  2010-09-07 21:52                     ` Krzysztof Olędzki
@ 2010-09-08  2:34                     ` Brian Haley
  2010-09-08  3:34                       ` David Miller
  2010-09-08  4:57                       ` Eric Dumazet
  1 sibling, 2 replies; 34+ messages in thread
From: Brian Haley @ 2010-09-08  2:34 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, ole, netdev

On 09/07/2010 05:35 PM, Eric Dumazet wrote:
> Problem is ip4_datagram_connect() also sets inet->inet_rcv_saddr (from
> INADDR_ANY to IP source address, given the current route to remote end
> point). Only the first connect() on the socket does this. Following ones
> dont change the (possibly wrong) source address.
> 
> This breaks the secondary UDP hash, based on (ADDRESS, port), that was
> computed by inet_autobind(). If more than 10 sockets are linked in
> primary hash chain, we can drop incoming packets because searches are
> done in wrong secondary hash chain.

I can't confirm this since I don't have 10 sockets in my primary hash
chain at the moment...

> This also potentially breaks multiple connect() to change remote
> endpoints, because old source address might be non usable for packets to
> new destination.
> 
> If route happens to change, then we should automatically change our
> source address too, at next sendmsg() call, and UDP code deals with this
> just fine.
> 
> If an application needs to specify a precise source address, it must use
> bind() system call. connect() man page only refers to remote address,
> not local one.

Is this really the right thing to do?  Linux has been doing this forever,
right?  Just like BSD has done it forever.  The way I've always "cleared"
a local address is to set the address family to AF_UNSPEC on the next
connect() call, as mentioned on the man page.  I just want to make sure
we're not changing something just to work around a broken application,
sendto()/sendmsg() work perfect in this case by not setting the local address.

BTW, it seems as though the reason this might only happen sometimes is
that if the first connect() is to 127.0.0.1, you won't be able to then
try and connect to say, 192.168.1.1.  If you first connect() to a remote
address things will probably just work.

My possibly wrong $.02

-Brian

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] inet: dont set inet_rcv_saddr in connect()
  2010-09-08  2:34                     ` Brian Haley
@ 2010-09-08  3:34                       ` David Miller
  2010-09-08  4:42                         ` Eric Dumazet
  2010-09-08  4:57                       ` Eric Dumazet
  1 sibling, 1 reply; 34+ messages in thread
From: David Miller @ 2010-09-08  3:34 UTC (permalink / raw)
  To: brian.haley; +Cc: eric.dumazet, ole, netdev

From: Brian Haley <brian.haley@hp.com>
Date: Tue, 07 Sep 2010 22:34:59 -0400

> Is this really the right thing to do?  Linux has been doing this forever,
> right?  Just like BSD has done it forever.

Indeed, I checked for this in Stevens volume 2 when I reviewed
Eric's patch, the logic is identical.

> The way I've always "cleared" a local address is to set the address
> family to AF_UNSPEC on the next connect() call, as mentioned on the
> man page.  I just want to make sure we're not changing something
> just to work around a broken application, sendto()/sendmsg() work
> perfect in this case by not setting the local address.
> 
> BTW, it seems as though the reason this might only happen sometimes is
> that if the first connect() is to 127.0.0.1, you won't be able to then
> try and connect to say, 192.168.1.1.  If you first connect() to a remote
> address things will probably just work.

Actually I have enough doubts about this too.  So I'm going to hold
off on the patch until we sort this out.

As Eric stated, sendmsg() on raw and UDP sockets pick the source
address based upon the bound or looked up route anyways.  But still
I think something still might end up being not-kosher with Eric's
change.

More so, I suspect, the issue is simply that the ordering of events
is simply inconvenient for the secondary hash implementation :-)
The port bind happens before the source address is bound, and that
is what screws everything up.

In this sense leaving the source addresses at INADDR_ANY is just a
workaround for the secondary hash table :-)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: 2.6.34: Problem with UDP traffic on lo + poll(?)
  2010-09-07 21:51                     ` Krzysztof Olędzki
@ 2010-09-08  4:12                       ` Eric Dumazet
  0 siblings, 0 replies; 34+ messages in thread
From: Eric Dumazet @ 2010-09-08  4:12 UTC (permalink / raw)
  To: Krzysztof Olędzki; +Cc: David Miller, netdev

Le mardi 07 septembre 2010 à 23:51 +0200, Krzysztof Olędzki a écrit :
> On 2010-09-07 23:39, Eric Dumazet wrote:
> > Le mardi 07 septembre 2010 à 23:28 +0200, Krzysztof Olędzki a écrit :
> > 
> >> With the above patch I'm no longer able to reproduce the problem. Thanks!
> >>
> >> Tested-by: Krzysztof Piotr Oledzki<ole@ans.pl>
> >>
> > 
> > Thanks a lot !
> > 
> >> BTW: why it takes so long to trigger this bug and it is only possible
> >> over a loopback interface?
> > 
> > Its a bit tricky : You need at least 10 sockets linked in a particular
> > hash chain.
> > 
> > To check this, you can :
> > 
> > cat /proc/net/udp
> > 
> > maybe you have many sockets on port 123 or 53 ?
> 
> On one affected host I have 3+7 and on the other, also affacted one, I have 3+6:
> 
> root@sowa:~# egrep -cw '(53|123):' /proc/net/udp
> 10
> root@sowa:~# egrep -w '(53|123):' /proc/net/udp
>    53: 3582A8C0:0035 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 6084654 2 ffff8800cc012700 0
>    53: 0100007F:0035 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 6084652 2 ffff8800cc010900 0
>   123: D683A8C0:007B 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 4911 2 ffff88012de96400 0
>   123: 7B85A8C0:007B 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 4910 2 ffff88012de96100 0
>   123: 8982A8C0:007B 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 4909 2 ffff88012de95e00 0
>   123: 7B82A8C0:007B 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 4908 2 ffff88012de95b00 0
>   123: 3582A8C0:007B 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 4907 2 ffff88012de95800 0
>   123: 1F7EA8C0:007B 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 4906 2 ffff88012de95500 0
>   123: 0100007F:007B 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 4905 2 ffff88012de95200 0
>   123: 00000000:007B 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 4899 2 ffff88012de94c00 0
> 
> But how 123 is related to 53?
> 

I was mentioning 123 or 53, as probable suspects :)


When a socket is created, and connect() called, autobind() chooses a
source port X for this socket. 

if ((X % udp_hash_size) == 123), socket is inserted in hash chain number
123.

Bug then triggers, because when a packet is received for this socket, we
find a slot with more than 10 sockets -> Search is done on secondary
chain Z2, where we dont find the socket since its rcv_addr changed after
we inserted it (in chain Y2). Packet is dropped (as seen in netstat -s)


> > And about loopback, I have no idea... I am pretty sure I can trigger the
> > bug with other interfaces.
> 
> OK. Probably it is because my other hosts have only a single IP and only
> the problematic ones have both DNS server and multiple IP (many sockets).
> 

Yes



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] inet: dont set inet_rcv_saddr in connect()
  2010-09-08  2:16                       ` David Miller
@ 2010-09-08  4:13                         ` Eric Dumazet
  0 siblings, 0 replies; 34+ messages in thread
From: Eric Dumazet @ 2010-09-08  4:13 UTC (permalink / raw)
  To: David Miller; +Cc: ole, netdev

Le mardi 07 septembre 2010 à 19:16 -0700, David Miller a écrit :

> Eric, ipv6 has the same problem so I fixed it there too.
> 
> Here is the final patch I put into net-2.6, thanks!

Thanks David for taking care of ipv6 too :)



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] inet: dont set inet_rcv_saddr in connect()
  2010-09-08  3:34                       ` David Miller
@ 2010-09-08  4:42                         ` Eric Dumazet
  2010-09-08  5:51                           ` David Miller
  0 siblings, 1 reply; 34+ messages in thread
From: Eric Dumazet @ 2010-09-08  4:42 UTC (permalink / raw)
  To: David Miller; +Cc: brian.haley, ole, netdev

Le mardi 07 septembre 2010 à 20:34 -0700, David Miller a écrit :
> From: Brian Haley <brian.haley@hp.com>
> Date: Tue, 07 Sep 2010 22:34:59 -0400
> 
> > Is this really the right thing to do?  Linux has been doing this forever,
> > right?  Just like BSD has done it forever.
> 
> Indeed, I checked for this in Stevens volume 2 when I reviewed
> Eric's patch, the logic is identical.
> 
> > The way I've always "cleared" a local address is to set the address
> > family to AF_UNSPEC on the next connect() call, as mentioned on the
> > man page.  I just want to make sure we're not changing something
> > just to work around a broken application, sendto()/sendmsg() work
> > perfect in this case by not setting the local address.
> > 

Problem is AF_UNSPEC always clears the remote address (as stated in
manual), and sometimes local one (as not stated)

if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
	inet_reset_saddr(sk);

This is the workaround that was coded years ago in Linux to undo the
local addr setting ;)


Following program produces this output :

local addr=0x7f000001 sin_port=37877
after connect(AF_UNSPEC) local addr=0x0 sin_port=0
local addr=0x7f000001 sin_port=37877
after connect(AF_UNSPEC) local addr=0x7f000001 sin_port=0



#include <stdlib.h>
#include <unistd.h>
#include <sys/socket.h>
#include <arpa/inet.h>
#include <string.h>
#include <stdio.h>

int main(int argc, char *argv[])
{
	int fd;
	struct sockaddr_in target, me;
	socklen_t len;

	if ((fd = socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP)) == -1) {
		perror("socket");
		return 1;
	}
	memset(&target, 0, sizeof(target));
	target.sin_family = AF_INET;
	target.sin_port = 123;
	target.sin_addr.s_addr = htonl(0x7f000001);
	if (connect(fd, (const struct sockaddr *)&target, sizeof(target)) == -1) {
		perror("connect");
		close(fd);
		return 2;
	}
	len = sizeof(me);
	if (getsockname(fd, (struct sockaddr *)&me, &len)== -1) {
		perror("getsockname");
		close(fd);
		return 3;
	}
	printf("local addr=0x%x sin_port=%u\n",
		ntohl(me.sin_addr.s_addr), ntohs(me.sin_port));
	memset(&target, 0, sizeof(target));
	target.sin_family = AF_UNSPEC;
	if (connect(fd, (const struct sockaddr *)&target, sizeof(target)) == -1) {
		perror("connect AF_UNSPEC");
		close(fd);
		return 4;
	}	
	len = sizeof(me);
	if (getsockname(fd, (struct sockaddr *)&me, &len)== -1) {
		perror("getsockname 2");
		close(fd);
		return 3;
	}
	printf("after connect(AF_UNSPEC) local addr=0x%x sin_port=%u\n",
		ntohl(me.sin_addr.s_addr), ntohs(me.sin_port));
	memset(&target, 0, sizeof(target));
	target.sin_family = AF_INET;
	target.sin_addr.s_addr = htonl(0x7f000001);
	if (bind(fd, (const struct sockaddr *)&target, sizeof(target)) == -1) {
		perror("bind");
		close(fd);
		return 2;
	}
	len = sizeof(me);
	if (getsockname(fd, (struct sockaddr *)&me, &len)== -1) {
		perror("getsockname");
		close(fd);
		return 3;
	}
	printf("local addr=0x%x sin_port=%u\n",
		ntohl(me.sin_addr.s_addr), ntohs(me.sin_port));

	memset(&target, 0, sizeof(target));
	target.sin_family = AF_UNSPEC;
	if (connect(fd, (const struct sockaddr *)&target, sizeof(target)) == -1) {
		perror("connect AF_UNSPEC");
		close(fd);
		return 4;
	}	

	len = sizeof(me);
	if (getsockname(fd, (struct sockaddr *)&me, &len)== -1) {
		perror("getsockname 2");
		close(fd);
		return 3;
	}
	printf("after connect(AF_UNSPEC) local addr=0x%x sin_port=%u\n",
		ntohl(me.sin_addr.s_addr), ntohs(me.sin_port));
	return 0;
}




^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] inet: dont set inet_rcv_saddr in connect()
  2010-09-08  2:34                     ` Brian Haley
  2010-09-08  3:34                       ` David Miller
@ 2010-09-08  4:57                       ` Eric Dumazet
  2010-09-08  5:36                         ` David Miller
  1 sibling, 1 reply; 34+ messages in thread
From: Eric Dumazet @ 2010-09-08  4:57 UTC (permalink / raw)
  To: Brian Haley; +Cc: David Miller, ole, netdev

Le mardi 07 septembre 2010 à 22:34 -0400, Brian Haley a écrit :

> Is this really the right thing to do?  Linux has been doing this forever,
> right?  Just like BSD has done it forever.  The way I've always "cleared"
> a local address is to set the address family to AF_UNSPEC on the next
> connect() call, as mentioned on the man page.  I just want to make sure
> we're not changing something just to work around a broken application,
> sendto()/sendmsg() work perfect in this case by not setting the local address.
> 
> BTW, it seems as though the reason this might only happen sometimes is
> that if the first connect() is to 127.0.0.1, you won't be able to then
> try and connect to say, 192.168.1.1.  If you first connect() to a remote
> address things will probably just work.


I believe we have the following choice :

1) connect(AF_UNIX) sets the remote address/port
   bind() sets the local port (and optionally address)
   connect(AF_UNSPEC) clears remote addess/port,
                      let local address/port unchanged

2) Correct UDP hashing, when local address changes from 0 to x.y.z.t
   (cost : two locks taken), and a possible packet drop during the
operation.

   Document that connect() also sets local address, and that before
doing a second connect() to change remote address, its mandatory to
first issue a connect(AF_UNSPEC) to clear local address (if not locked
by a prior bind() call)




^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] inet: dont set inet_rcv_saddr in connect()
  2010-09-08  4:57                       ` Eric Dumazet
@ 2010-09-08  5:36                         ` David Miller
  2010-09-08  5:52                           ` Eric Dumazet
  0 siblings, 1 reply; 34+ messages in thread
From: David Miller @ 2010-09-08  5:36 UTC (permalink / raw)
  To: eric.dumazet; +Cc: brian.haley, ole, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 08 Sep 2010 06:57:37 +0200

>    Document that connect() also sets local address, and that before
> doing a second connect() to change remote address, its mandatory to
> first issue a connect(AF_UNSPEC) to clear local address (if not locked
> by a prior bind() call)

For connectionless sockets, the application may connect() as many
times as it wishes to change the remote address.  The local address
remains set if it were set before such a re-associating connect().

It need only issue a connect(AF_UNSPEC) to make the socket have no
remote association, and as you state this operation will also wipe out
any local address settings not created by a bind() call.

And nicely our man pages are very clear about this :-) as is BSD and
Steven's volume 2.

This has been legal for decades, so we have to keep working this way.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] inet: dont set inet_rcv_saddr in connect()
  2010-09-08  4:42                         ` Eric Dumazet
@ 2010-09-08  5:51                           ` David Miller
  0 siblings, 0 replies; 34+ messages in thread
From: David Miller @ 2010-09-08  5:51 UTC (permalink / raw)
  To: eric.dumazet; +Cc: brian.haley, ole, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 08 Sep 2010 06:42:45 +0200

> Problem is AF_UNSPEC always clears the remote address (as stated in
> manual), and sometimes local one (as not stated)
> 
> if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
> 	inet_reset_saddr(sk);
> 
> This is the workaround that was coded years ago in Linux to undo the
> local addr setting ;)

Ok, I mis-remembered about AF_UNSPEC, I thought BSD did it too, it
does not.

But we really offer the most flexibility here because we offer this.

In BSD case, connect() on an already connect()'d UDP socket
unconditionally zaps the local address setting (because, as Stevens
states, that local address "might" have been created by a previous
connect()).

This forces the application to re-bind() the local address if it wants
to use something other than what connect() is going to choose.

We leave the local settings alone for normal connect() on already
connect()'d sockets, and provide connect(AF_UNSPEC) to explicitly kill
connect() created local and remote associations.

This allows bind() settings to survive through multiple connect()
calls if the user wants, a facility BSD does not provide.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] inet: dont set inet_rcv_saddr in connect()
  2010-09-08  5:36                         ` David Miller
@ 2010-09-08  5:52                           ` Eric Dumazet
  2010-09-08 10:10                             ` [PATCH] udp: add rehash on connect() Eric Dumazet
  2010-09-08 14:27                             ` [PATCH] inet: dont set inet_rcv_saddr in connect() Eric Dumazet
  0 siblings, 2 replies; 34+ messages in thread
From: Eric Dumazet @ 2010-09-08  5:52 UTC (permalink / raw)
  To: David Miller; +Cc: brian.haley, ole, netdev

Le mardi 07 septembre 2010 à 22:36 -0700, David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Wed, 08 Sep 2010 06:57:37 +0200
> 
> >    Document that connect() also sets local address, and that before
> > doing a second connect() to change remote address, its mandatory to
> > first issue a connect(AF_UNSPEC) to clear local address (if not locked
> > by a prior bind() call)
> 
> For connectionless sockets, the application may connect() as many
> times as it wishes to change the remote address.  The local address
> remains set if it were set before such a re-associating connect().
> 
> It need only issue a connect(AF_UNSPEC) to make the socket have no
> remote association, and as you state this operation will also wipe out
> any local address settings not created by a bind() call.
> 
> And nicely our man pages are very clear about this :-) as is BSD and
> Steven's volume 2.
> 
> This has been legal for decades, so we have to keep working this way.

Yes, its also buggy, if 2nd remote address is not reachable on same interface.
Even if we try a connect(AF_UNSPEC), the local address stay as is :

after bind(port 5555) local addr=0x0:5555 
after connect(123) local addr=0x7f000001:5555 remote addr=0x7f000001:123 
Could not connect, errno=22
after connect(AF_UNSPEC) local addr=0x7f000001:5555 
connect: Invalid argument

I'll work on UDP fix anyway.



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH] udp: add rehash on connect()
  2010-09-08  5:52                           ` Eric Dumazet
@ 2010-09-08 10:10                             ` Eric Dumazet
  2010-09-08 15:06                               ` Krzysztof Olędzki
  2010-09-08 15:08                               ` [PATCH v2] " Eric Dumazet
  2010-09-08 14:27                             ` [PATCH] inet: dont set inet_rcv_saddr in connect() Eric Dumazet
  1 sibling, 2 replies; 34+ messages in thread
From: Eric Dumazet @ 2010-09-08 10:10 UTC (permalink / raw)
  To: David Miller; +Cc: brian.haley, ole, netdev

commit 30fff923 introduced in linux-2.6.33 (udp: bind() optimisation)
added a secondary hash on UDP, hashed on (local addr, local port).

Problem is that following sequence :

fd = socket(...)
connect(fd, &remote, ...)

not only selects remote end point (address and port), but also sets
local address, while UDP stack stored in secondary hash table the socket
while its local address was INADDR_ANY (or ipv6 equivalent)

Sequence is :
 - autobind() : choose a random local port, insert socket in hash tables
              [while local address is INADDR_ANY]
 - connect() : set remote address and port, change local address to IP
              given by a route lookup.

When an incoming UDP frame comes, if more than 10 sockets are found in
primary hash table, we switch to secondary table, and fail to find
socket because its local address changed.

One solution to this problem is to rehash datagram socket if needed.

We add a new rehash(struct socket *) method in "struct proto", and
implement this method for UDP v4 & v6, using a common helper.

This rehashing only takes care of secondary hash table, since primary
hash (based on local port only) is not changed.

Reported-by: Krzysztof Piotr Oledzki <ole@ans.pl>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/net/sock.h  |    1 +
 include/net/udp.h   |    1 +
 net/ipv4/datagram.c |    5 ++++-
 net/ipv4/udp.c      |   37 +++++++++++++++++++++++++++++++++++++
 net/ipv6/datagram.c |    7 ++++++-
 net/ipv6/udp.c      |   10 ++++++++++
 6 files changed, 59 insertions(+), 2 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index ac53bfb..adab9dc 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -752,6 +752,7 @@ struct proto {
 	/* Keeping track of sk's, looking them up, and port selection methods. */
 	void			(*hash)(struct sock *sk);
 	void			(*unhash)(struct sock *sk);
+	void			(*rehash)(struct sock *sk);
 	int			(*get_port)(struct sock *sk, unsigned short snum);
 
 	/* Keeping track of sockets in use */
diff --git a/include/net/udp.h b/include/net/udp.h
index 7abdf30..a184d34 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -151,6 +151,7 @@ static inline void udp_lib_hash(struct sock *sk)
 }
 
 extern void udp_lib_unhash(struct sock *sk);
+extern void udp_lib_rehash(struct sock *sk, u16 new_hash);
 
 static inline void udp_lib_close(struct sock *sk, long timeout)
 {
diff --git a/net/ipv4/datagram.c b/net/ipv4/datagram.c
index f055094..721a8a3 100644
--- a/net/ipv4/datagram.c
+++ b/net/ipv4/datagram.c
@@ -62,8 +62,11 @@ int ip4_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	}
 	if (!inet->inet_saddr)
 		inet->inet_saddr = rt->rt_src;	/* Update source address */
-	if (!inet->inet_rcv_saddr)
+	if (!inet->inet_rcv_saddr) {
 		inet->inet_rcv_saddr = rt->rt_src;
+		if (sk->sk_prot->rehash)
+			sk->sk_prot->rehash(sk);
+	}
 	inet->inet_daddr = rt->rt_dst;
 	inet->inet_dport = usin->sin_port;
 	sk->sk_state = TCP_ESTABLISHED;
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 32e0bef..c871758 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1260,6 +1260,42 @@ void udp_lib_unhash(struct sock *sk)
 }
 EXPORT_SYMBOL(udp_lib_unhash);
 
+/*
+ * inet_rcv_saddr was changed, we must rehash secondary hash
+ */
+void udp_lib_rehash(struct sock *sk, u16 newhash)
+{
+	if (sk_hashed(sk)) {
+		struct udp_table *udptable = sk->sk_prot->h.udp_table;
+		struct udp_hslot *hslot2, *nhslot2;
+
+		hslot2 = udp_hashslot2(udptable, udp_sk(sk)->udp_portaddr_hash);
+		nhslot2 = udp_hashslot2(udptable, newhash);
+		udp_sk(sk)->udp_portaddr_hash = newhash;
+		if (hslot2 != nhslot2) {
+			spin_lock_bh(&hslot2->lock);
+			hlist_nulls_del_init_rcu(&udp_sk(sk)->udp_portaddr_node);
+			hslot2->count--;
+			spin_unlock_bh(&hslot2->lock);
+
+			spin_lock_bh(&nhslot2->lock);
+			hlist_nulls_add_head_rcu(&udp_sk(sk)->udp_portaddr_node,
+						 &nhslot2->head);
+			nhslot2->count++;
+			spin_unlock_bh(&nhslot2->lock);
+		}
+	}
+}
+EXPORT_SYMBOL(udp_lib_rehash);
+
+static void udp_v4_rehash(struct sock *sk)
+{
+	u16 new_hash = udp4_portaddr_hash(sock_net(sk),
+					  inet_sk(sk)->inet_rcv_saddr,
+					  inet_sk(sk)->inet_num);
+	udp_lib_rehash(sk, new_hash);
+}
+
 static int __udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 {
 	int rc;
@@ -1843,6 +1879,7 @@ struct proto udp_prot = {
 	.backlog_rcv	   = __udp_queue_rcv_skb,
 	.hash		   = udp_lib_hash,
 	.unhash		   = udp_lib_unhash,
+	.rehash		   = udp_v4_rehash,
 	.get_port	   = udp_v4_get_port,
 	.memory_allocated  = &udp_memory_allocated,
 	.sysctl_mem	   = sysctl_udp_mem,
diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index 7d929a2..cb61d9a 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -105,9 +105,12 @@ ipv4_connected:
 		if (ipv6_addr_any(&np->saddr))
 			ipv6_addr_set_v4mapped(inet->inet_saddr, &np->saddr);
 
-		if (ipv6_addr_any(&np->rcv_saddr))
+		if (ipv6_addr_any(&np->rcv_saddr)) {
 			ipv6_addr_set_v4mapped(inet->inet_rcv_saddr,
 					       &np->rcv_saddr);
+			if (sk->sk_prot->rehash)
+				sk->sk_prot->rehash(sk);
+			}
 
 		goto out;
 	}
@@ -181,6 +184,8 @@ ipv4_connected:
 	if (ipv6_addr_any(&np->rcv_saddr)) {
 		ipv6_addr_copy(&np->rcv_saddr, &fl.fl6_src);
 		inet->inet_rcv_saddr = LOOPBACK4_IPV6;
+		if (sk->sk_prot->rehash)
+			sk->sk_prot->rehash(sk);
 	}
 
 	ip6_dst_store(sk, dst,
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 1dd1aff..5acb356 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -111,6 +111,15 @@ int udp_v6_get_port(struct sock *sk, unsigned short snum)
 	return udp_lib_get_port(sk, snum, ipv6_rcv_saddr_equal, hash2_nulladdr);
 }
 
+static void udp_v6_rehash(struct sock *sk)
+{
+	u16 new_hash = udp6_portaddr_hash(sock_net(sk),
+					  &inet6_sk(sk)->rcv_saddr,
+					  inet_sk(sk)->inet_num);
+
+	udp_lib_rehash(sk, new_hash);
+}
+
 static inline int compute_score(struct sock *sk, struct net *net,
 				unsigned short hnum,
 				struct in6_addr *saddr, __be16 sport,
@@ -1447,6 +1456,7 @@ struct proto udpv6_prot = {
 	.backlog_rcv	   = udpv6_queue_rcv_skb,
 	.hash		   = udp_lib_hash,
 	.unhash		   = udp_lib_unhash,
+	.rehash		   = udp_v6_rehash,
 	.get_port	   = udp_v6_get_port,
 	.memory_allocated  = &udp_memory_allocated,
 	.sysctl_mem	   = sysctl_udp_mem,



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH] inet: dont set inet_rcv_saddr in connect()
  2010-09-08  5:52                           ` Eric Dumazet
  2010-09-08 10:10                             ` [PATCH] udp: add rehash on connect() Eric Dumazet
@ 2010-09-08 14:27                             ` Eric Dumazet
  1 sibling, 0 replies; 34+ messages in thread
From: Eric Dumazet @ 2010-09-08 14:27 UTC (permalink / raw)
  To: David Miller; +Cc: brian.haley, ole, netdev

Le mercredi 08 septembre 2010 à 07:52 +0200, Eric Dumazet a écrit :
> Le mardi 07 septembre 2010 à 22:36 -0700, David Miller a écrit :
> > From: Eric Dumazet <eric.dumazet@gmail.com>
> > Date: Wed, 08 Sep 2010 06:57:37 +0200
> > 
> > >    Document that connect() also sets local address, and that before
> > > doing a second connect() to change remote address, its mandatory to
> > > first issue a connect(AF_UNSPEC) to clear local address (if not locked
> > > by a prior bind() call)
> > 
> > For connectionless sockets, the application may connect() as many
> > times as it wishes to change the remote address.  The local address
> > remains set if it were set before such a re-associating connect().
> > 
> > It need only issue a connect(AF_UNSPEC) to make the socket have no
> > remote association, and as you state this operation will also wipe out
> > any local address settings not created by a bind() call.
> > 
> > And nicely our man pages are very clear about this :-) as is BSD and
> > Steven's volume 2.
> > 
> > This has been legal for decades, so we have to keep working this way.
> 
> Yes, its also buggy, if 2nd remote address is not reachable on same interface.
> Even if we try a connect(AF_UNSPEC), the local address stay as is :
> 
> after bind(port 5555) local addr=0x0:5555 
> after connect(123) local addr=0x7f000001:5555 remote addr=0x7f000001:123 
> Could not connect, errno=22
> after connect(AF_UNSPEC) local addr=0x7f000001:5555 
> connect: Invalid argument
> 

I run the program on FreeBSD 8.1, and this OS
does change the source address at connect() time.
It also change it each time connect() is called, not only once.

fd = socket()
connect(fd, "127.0.0.1:ntp")
system("netstat -p udp | grep udp")
connect(fd, "192.168.20.110:ntp")
system("netstat -p udp | grep udp")

->

udp4  0  0 127.0.0.1.35974       127.0.0.1.ntp

udp4  0  0 192.168.20.80.35974   192.168.20.110.ntp

while on Linux we refuse the second connect() -> EINVAL
(Because no route can be found from 127.0.0.1 to 192.168.20.110)




^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] udp: add rehash on connect()
  2010-09-08 10:10                             ` [PATCH] udp: add rehash on connect() Eric Dumazet
@ 2010-09-08 15:06                               ` Krzysztof Olędzki
  2010-09-08 15:17                                 ` Eric Dumazet
  2010-09-08 15:08                               ` [PATCH v2] " Eric Dumazet
  1 sibling, 1 reply; 34+ messages in thread
From: Krzysztof Olędzki @ 2010-09-08 15:06 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, brian.haley, netdev

On 2010-09-08 12:10, Eric Dumazet wrote:
> commit 30fff923 introduced in linux-2.6.33 (udp: bind() optimisation)
> added a secondary hash on UDP, hashed on (local addr, local port).
>
> Problem is that following sequence :
>
> fd = socket(...)
> connect(fd,&remote, ...)
>
> not only selects remote end point (address and port), but also sets
> local address, while UDP stack stored in secondary hash table the socket
> while its local address was INADDR_ANY (or ipv6 equivalent)
>
> Sequence is :
>   - autobind() : choose a random local port, insert socket in hash tables
>                [while local address is INADDR_ANY]
>   - connect() : set remote address and port, change local address to IP
>                given by a route lookup.
>
> When an incoming UDP frame comes, if more than 10 sockets are found in
> primary hash table, we switch to secondary table, and fail to find
> socket because its local address changed.
>
> One solution to this problem is to rehash datagram socket if needed.
>
> We add a new rehash(struct socket *) method in "struct proto", and
> implement this method for UDP v4&  v6, using a common helper.
>
> This rehashing only takes care of secondary hash table, since primary
> hash (based on local port only) is not changed.
>
> Reported-by: Krzysztof Piotr Oledzki <ole@ans.pl>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Thanks, I'll test this patch this evening.


Best regards,

			Krzysztof Olędzki

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v2] udp: add rehash on connect()
  2010-09-08 10:10                             ` [PATCH] udp: add rehash on connect() Eric Dumazet
  2010-09-08 15:06                               ` Krzysztof Olędzki
@ 2010-09-08 15:08                               ` Eric Dumazet
  2010-09-08 16:52                                 ` Krzysztof Olędzki
  1 sibling, 1 reply; 34+ messages in thread
From: Eric Dumazet @ 2010-09-08 15:08 UTC (permalink / raw)
  To: David Miller; +Cc: brian.haley, ole, netdev

Hmm... We should lock primary hash table too in udp_lib_rehash(), to
prevent another thread to insert another socket on same tuple (local
addr, local port) while doing our move.

Thanks

[PATCH v2] udp: add rehash on connect()

commit 30fff923 introduced in linux-2.6.33 (udp: bind() optimisation)
added a secondary hash on UDP, hashed on (local addr, local port).

Problem is that following sequence :

fd = socket(...)
connect(fd, &remote, ...)

not only selects remote end point (address and port), but also sets
local address, while UDP stack stored in secondary hash table the socket
while its local address was INADDR_ANY (or ipv6 equivalent)

Sequence is :
 - autobind() : choose a random local port, insert socket in hash tables
              [while local address is INADDR_ANY]
 - connect() : set remote address and port, change local address to IP
              given by a route lookup.

When an incoming UDP frame comes, if more than 10 sockets are found in
primary hash table, we switch to secondary table, and fail to find
socket because its local address changed.

One solution to this problem is to rehash datagram socket if needed.

We add a new rehash(struct socket *) method in "struct proto", and
implement this method for UDP v4 & v6, using a common helper.

This rehashing only takes care of secondary hash table, since primary
hash (based on local port only) is not changed.

Reported-by: Krzysztof Piotr Oledzki <ole@ans.pl>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
V2: must lock primary hash table too

 include/net/sock.h  |    1 
 include/net/udp.h   |    1 
 net/ipv4/datagram.c |    5 +++-
 net/ipv4/udp.c      |   44 ++++++++++++++++++++++++++++++++++++++++++
 net/ipv6/datagram.c |    7 +++++-
 net/ipv6/udp.c      |   10 +++++++++
 6 files changed, 66 insertions(+), 2 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index ac53bfb..adab9dc 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -752,6 +752,7 @@ struct proto {
 	/* Keeping track of sk's, looking them up, and port selection methods. */
 	void			(*hash)(struct sock *sk);
 	void			(*unhash)(struct sock *sk);
+	void			(*rehash)(struct sock *sk);
 	int			(*get_port)(struct sock *sk, unsigned short snum);
 
 	/* Keeping track of sockets in use */
diff --git a/include/net/udp.h b/include/net/udp.h
index 7abdf30..a184d34 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -151,6 +151,7 @@ static inline void udp_lib_hash(struct sock *sk)
 }
 
 extern void udp_lib_unhash(struct sock *sk);
+extern void udp_lib_rehash(struct sock *sk, u16 new_hash);
 
 static inline void udp_lib_close(struct sock *sk, long timeout)
 {
diff --git a/net/ipv4/datagram.c b/net/ipv4/datagram.c
index f055094..721a8a3 100644
--- a/net/ipv4/datagram.c
+++ b/net/ipv4/datagram.c
@@ -62,8 +62,11 @@ int ip4_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	}
 	if (!inet->inet_saddr)
 		inet->inet_saddr = rt->rt_src;	/* Update source address */
-	if (!inet->inet_rcv_saddr)
+	if (!inet->inet_rcv_saddr) {
 		inet->inet_rcv_saddr = rt->rt_src;
+		if (sk->sk_prot->rehash)
+			sk->sk_prot->rehash(sk);
+	}
 	inet->inet_daddr = rt->rt_dst;
 	inet->inet_dport = usin->sin_port;
 	sk->sk_state = TCP_ESTABLISHED;
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 32e0bef..fb23c2e 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1260,6 +1260,49 @@ void udp_lib_unhash(struct sock *sk)
 }
 EXPORT_SYMBOL(udp_lib_unhash);
 
+/*
+ * inet_rcv_saddr was changed, we must rehash secondary hash
+ */
+void udp_lib_rehash(struct sock *sk, u16 newhash)
+{
+	if (sk_hashed(sk)) {
+		struct udp_table *udptable = sk->sk_prot->h.udp_table;
+		struct udp_hslot *hslot, *hslot2, *nhslot2;
+
+		hslot2 = udp_hashslot2(udptable, udp_sk(sk)->udp_portaddr_hash);
+		nhslot2 = udp_hashslot2(udptable, newhash);
+		udp_sk(sk)->udp_portaddr_hash = newhash;
+		if (hslot2 != nhslot2) {
+			hslot = udp_hashslot(udptable, sock_net(sk),
+					     udp_sk(sk)->udp_port_hash);
+			/* we must lock primary chain too */
+			spin_lock_bh(&hslot->lock);
+
+			spin_lock(&hslot2->lock);
+			hlist_nulls_del_init_rcu(&udp_sk(sk)->udp_portaddr_node);
+			hslot2->count--;
+			spin_unlock(&hslot2->lock);
+
+			spin_lock(&nhslot2->lock);
+			hlist_nulls_add_head_rcu(&udp_sk(sk)->udp_portaddr_node,
+						 &nhslot2->head);
+			nhslot2->count++;
+			spin_unlock(&nhslot2->lock);
+
+			spin_unlock_bh(&hslot->lock);
+		}
+	}
+}
+EXPORT_SYMBOL(udp_lib_rehash);
+
+static void udp_v4_rehash(struct sock *sk)
+{
+	u16 new_hash = udp4_portaddr_hash(sock_net(sk),
+					  inet_sk(sk)->inet_rcv_saddr,
+					  inet_sk(sk)->inet_num);
+	udp_lib_rehash(sk, new_hash);
+}
+
 static int __udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 {
 	int rc;
@@ -1843,6 +1886,7 @@ struct proto udp_prot = {
 	.backlog_rcv	   = __udp_queue_rcv_skb,
 	.hash		   = udp_lib_hash,
 	.unhash		   = udp_lib_unhash,
+	.rehash		   = udp_v4_rehash,
 	.get_port	   = udp_v4_get_port,
 	.memory_allocated  = &udp_memory_allocated,
 	.sysctl_mem	   = sysctl_udp_mem,
diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index 7d929a2..cb61d9a 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -105,9 +105,12 @@ ipv4_connected:
 		if (ipv6_addr_any(&np->saddr))
 			ipv6_addr_set_v4mapped(inet->inet_saddr, &np->saddr);
 
-		if (ipv6_addr_any(&np->rcv_saddr))
+		if (ipv6_addr_any(&np->rcv_saddr)) {
 			ipv6_addr_set_v4mapped(inet->inet_rcv_saddr,
 					       &np->rcv_saddr);
+			if (sk->sk_prot->rehash)
+				sk->sk_prot->rehash(sk);
+			}
 
 		goto out;
 	}
@@ -181,6 +184,8 @@ ipv4_connected:
 	if (ipv6_addr_any(&np->rcv_saddr)) {
 		ipv6_addr_copy(&np->rcv_saddr, &fl.fl6_src);
 		inet->inet_rcv_saddr = LOOPBACK4_IPV6;
+		if (sk->sk_prot->rehash)
+			sk->sk_prot->rehash(sk);
 	}
 
 	ip6_dst_store(sk, dst,
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 1dd1aff..5acb356 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -111,6 +111,15 @@ int udp_v6_get_port(struct sock *sk, unsigned short snum)
 	return udp_lib_get_port(sk, snum, ipv6_rcv_saddr_equal, hash2_nulladdr);
 }
 
+static void udp_v6_rehash(struct sock *sk)
+{
+	u16 new_hash = udp6_portaddr_hash(sock_net(sk),
+					  &inet6_sk(sk)->rcv_saddr,
+					  inet_sk(sk)->inet_num);
+
+	udp_lib_rehash(sk, new_hash);
+}
+
 static inline int compute_score(struct sock *sk, struct net *net,
 				unsigned short hnum,
 				struct in6_addr *saddr, __be16 sport,
@@ -1447,6 +1456,7 @@ struct proto udpv6_prot = {
 	.backlog_rcv	   = udpv6_queue_rcv_skb,
 	.hash		   = udp_lib_hash,
 	.unhash		   = udp_lib_unhash,
+	.rehash		   = udp_v6_rehash,
 	.get_port	   = udp_v6_get_port,
 	.memory_allocated  = &udp_memory_allocated,
 	.sysctl_mem	   = sysctl_udp_mem,



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH] udp: add rehash on connect()
  2010-09-08 15:06                               ` Krzysztof Olędzki
@ 2010-09-08 15:17                                 ` Eric Dumazet
  2010-09-08 15:29                                   ` Krzysztof Olędzki
  0 siblings, 1 reply; 34+ messages in thread
From: Eric Dumazet @ 2010-09-08 15:17 UTC (permalink / raw)
  To: Krzysztof Olędzki; +Cc: David Miller, brian.haley, netdev

Le mercredi 08 septembre 2010 à 17:06 +0200, Krzysztof Olędzki a écrit :

> Thanks, I'll test this patch this evening.

Take the v2 ;)

I tested it on my dev machine and it worked for me.

Thanks



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] udp: add rehash on connect()
  2010-09-08 15:17                                 ` Eric Dumazet
@ 2010-09-08 15:29                                   ` Krzysztof Olędzki
  0 siblings, 0 replies; 34+ messages in thread
From: Krzysztof Olędzki @ 2010-09-08 15:29 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, brian.haley, netdev

On 2010-09-08 17:17, Eric Dumazet wrote:
> Le mercredi 08 septembre 2010 à 17:06 +0200, Krzysztof Olędzki a écrit :
>
>> Thanks, I'll test this patch this evening.
>
> Take the v2 ;)
>
> I tested it on my dev machine and it worked for me.

Sure. :)

Best regards,

				Krzysztof Olędzki

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] udp: add rehash on connect()
  2010-09-08 15:08                               ` [PATCH v2] " Eric Dumazet
@ 2010-09-08 16:52                                 ` Krzysztof Olędzki
  2010-09-09  4:39                                   ` David Miller
  0 siblings, 1 reply; 34+ messages in thread
From: Krzysztof Olędzki @ 2010-09-08 16:52 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, brian.haley, netdev

On 2010-09-08 17:08, Eric Dumazet wrote:
> Hmm... We should lock primary hash table too in udp_lib_rehash(), to
> prevent another thread to insert another socket on same tuple (local
> addr, local port) while doing our move.
>
> Thanks
>
> [PATCH v2] udp: add rehash on connect()
>
> commit 30fff923 introduced in linux-2.6.33 (udp: bind() optimisation)
> added a secondary hash on UDP, hashed on (local addr, local port).
>
> Problem is that following sequence :
>
> fd = socket(...)
> connect(fd,&remote, ...)
>
> not only selects remote end point (address and port), but also sets
> local address, while UDP stack stored in secondary hash table the socket
> while its local address was INADDR_ANY (or ipv6 equivalent)
>
> Sequence is :
>   - autobind() : choose a random local port, insert socket in hash tables
>                [while local address is INADDR_ANY]
>   - connect() : set remote address and port, change local address to IP
>                given by a route lookup.
>
> When an incoming UDP frame comes, if more than 10 sockets are found in
> primary hash table, we switch to secondary table, and fail to find
> socket because its local address changed.
>
> One solution to this problem is to rehash datagram socket if needed.
>
> We add a new rehash(struct socket *) method in "struct proto", and
> implement this method for UDP v4&  v6, using a common helper.
>
> This rehashing only takes care of secondary hash table, since primary
> hash (based on local port only) is not changed.
>
> Reported-by: Krzysztof Piotr Oledzki <ole@ans.pl>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Works like a charm, thank you.

Tested-by: Krzysztof Piotr Oledzki <ole@ans.pl>

Best regards,

			Krzysztof Olędzki

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] udp: add rehash on connect()
  2010-09-08 16:52                                 ` Krzysztof Olędzki
@ 2010-09-09  4:39                                   ` David Miller
  0 siblings, 0 replies; 34+ messages in thread
From: David Miller @ 2010-09-09  4:39 UTC (permalink / raw)
  To: ole; +Cc: eric.dumazet, brian.haley, netdev

From: Krzysztof Olędzki <ole@ans.pl>
Date: Wed, 08 Sep 2010 18:52:10 +0200

> On 2010-09-08 17:08, Eric Dumazet wrote:
>> [PATCH v2] udp: add rehash on connect()
 ...
>> Reported-by: Krzysztof Piotr Oledzki <ole@ans.pl>
>> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> 
> Works like a charm, thank you.
> 
> Tested-by: Krzysztof Piotr Oledzki <ole@ans.pl>

Applied, thanks!

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2010-09-09  4:39 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-09-06 17:11 2.6.34: Problem with UDP traffic on lo + poll(?) Krzysztof Oledzki
2010-09-06 19:42 ` Eric Dumazet
2010-09-06 19:55   ` Krzysztof Olędzki
2010-09-06 20:29     ` Eric Dumazet
2010-09-06 20:44       ` Krzysztof Olędzki
2010-09-06 20:48         ` Krzysztof Olędzki
2010-09-07 15:37           ` Krzysztof Olędzki
2010-09-07 16:36             ` Eric Dumazet
2010-09-07 19:20               ` Krzysztof Olędzki
2010-09-07 19:26               ` Eric Dumazet
2010-09-07 19:59                 ` David Miller
2010-09-07 21:35                   ` [PATCH] inet: dont set inet_rcv_saddr in connect() Eric Dumazet
2010-09-07 21:52                     ` Krzysztof Olędzki
2010-09-08  2:16                       ` David Miller
2010-09-08  4:13                         ` Eric Dumazet
2010-09-08  2:34                     ` Brian Haley
2010-09-08  3:34                       ` David Miller
2010-09-08  4:42                         ` Eric Dumazet
2010-09-08  5:51                           ` David Miller
2010-09-08  4:57                       ` Eric Dumazet
2010-09-08  5:36                         ` David Miller
2010-09-08  5:52                           ` Eric Dumazet
2010-09-08 10:10                             ` [PATCH] udp: add rehash on connect() Eric Dumazet
2010-09-08 15:06                               ` Krzysztof Olędzki
2010-09-08 15:17                                 ` Eric Dumazet
2010-09-08 15:29                                   ` Krzysztof Olędzki
2010-09-08 15:08                               ` [PATCH v2] " Eric Dumazet
2010-09-08 16:52                                 ` Krzysztof Olędzki
2010-09-09  4:39                                   ` David Miller
2010-09-08 14:27                             ` [PATCH] inet: dont set inet_rcv_saddr in connect() Eric Dumazet
2010-09-07 21:28                 ` 2.6.34: Problem with UDP traffic on lo + poll(?) Krzysztof Olędzki
2010-09-07 21:39                   ` Eric Dumazet
2010-09-07 21:51                     ` Krzysztof Olędzki
2010-09-08  4:12                       ` Eric Dumazet

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.