All of lore.kernel.org
 help / color / mirror / Atom feed
* Warning in net/ipv4/af_inet.c:154
@ 2010-05-25 11:58 Anton Blanchard
  2010-05-25 15:27 ` Eric Dumazet
  0 siblings, 1 reply; 19+ messages in thread
From: Anton Blanchard @ 2010-05-25 11:58 UTC (permalink / raw)
  To: netdev


Hi,

A build from last night (2.6.34-git9) is hitting the following WARN_ON on
a ppc64 box:

Badness at net/ipv4/af_inet.c:154

NIP [c00000000062b034] .inet_sock_destruct+0x1b0/0x1f0
LR [c0000000005b376c] .__sk_free+0x58/0x154
Call Trace:
[c000000000088c0c] .__local_bh_disable+0xdc/0xfc (unreliable)
[c0000000005b376c] .__sk_free+0x58/0x154
[c0000000005b3978] .sk_free+0x4c/0x60
[c0000000005b3ab4] .sk_common_release+0x128/0x148
[c00000000061da44] .udp_lib_close+0x28/0x40
[c00000000062a990] .inet_release+0xd0/0xf8
[c0000000005adbf8] .sock_release+0x60/0xec
[c0000000005adcd0] .sock_close+0x4c/0x6c
[c000000000186554] .__fput+0x184/0x270
[c00000000018668c] .fput+0x4c/0x60
[c000000000182248] .filp_close+0xc8/0xf4
[c0000000000826d4] .put_files_struct+0xc8/0x158
[c0000000000827d4] .exit_files+0x70/0x8c
[c0000000000848cc] .do_exit+0x2c8/0x82c
[c000000000084efc] .do_group_exit+0xcc/0x100
[c000000000084f5c] .SyS_exit_group+0x2c/0x48
[c000000000008554] syscall_exit+0x0/0x40

Which is:

        WARN_ON(sk->sk_forward_alloc);

Anton

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Warning in net/ipv4/af_inet.c:154
  2010-05-25 11:58 Warning in net/ipv4/af_inet.c:154 Anton Blanchard
@ 2010-05-25 15:27 ` Eric Dumazet
  2010-05-26  3:19   ` Anton Blanchard
  0 siblings, 1 reply; 19+ messages in thread
From: Eric Dumazet @ 2010-05-25 15:27 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: netdev

Le mardi 25 mai 2010 à 21:58 +1000, Anton Blanchard a écrit :
> Hi,
> 
> A build from last night (2.6.34-git9) is hitting the following WARN_ON on
> a ppc64 box:
> 
> Badness at net/ipv4/af_inet.c:154
> 
> NIP [c00000000062b034] .inet_sock_destruct+0x1b0/0x1f0
> LR [c0000000005b376c] .__sk_free+0x58/0x154
> Call Trace:
> [c000000000088c0c] .__local_bh_disable+0xdc/0xfc (unreliable)
> [c0000000005b376c] .__sk_free+0x58/0x154
> [c0000000005b3978] .sk_free+0x4c/0x60
> [c0000000005b3ab4] .sk_common_release+0x128/0x148
> [c00000000061da44] .udp_lib_close+0x28/0x40
> [c00000000062a990] .inet_release+0xd0/0xf8
> [c0000000005adbf8] .sock_release+0x60/0xec
> [c0000000005adcd0] .sock_close+0x4c/0x6c
> [c000000000186554] .__fput+0x184/0x270
> [c00000000018668c] .fput+0x4c/0x60
> [c000000000182248] .filp_close+0xc8/0xf4
> [c0000000000826d4] .put_files_struct+0xc8/0x158
> [c0000000000827d4] .exit_files+0x70/0x8c
> [c0000000000848cc] .do_exit+0x2c8/0x82c
> [c000000000084efc] .do_group_exit+0xcc/0x100
> [c000000000084f5c] .SyS_exit_group+0x2c/0x48
> [c000000000008554] syscall_exit+0x0/0x40
> 
> Which is:
> 
>         WARN_ON(sk->sk_forward_alloc);
> 

Yes, the infamous one :)

Is it reproductible ? What kind of workload is it ?
What is the NIC involved ?




^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Warning in net/ipv4/af_inet.c:154
  2010-05-25 15:27 ` Eric Dumazet
@ 2010-05-26  3:19   ` Anton Blanchard
  2010-05-26  5:18     ` Eric Dumazet
  2010-05-26  7:56     ` David Miller
  0 siblings, 2 replies; 19+ messages in thread
From: Anton Blanchard @ 2010-05-26  3:19 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev


Hi,

> > Which is:
> > 
> >         WARN_ON(sk->sk_forward_alloc);
> > 
> 
> Yes, the infamous one :)
> 
> Is it reproductible ? What kind of workload is it ?
> What is the NIC involved ?

It was running sysbench against a postgresql database over localhost. In
each case I checked, sk_forward_alloc was less than one page.

I notice we update sk_forward_alloc in sk_mem_charge and sk_mem_uncharge.
Since it isn't an atomic variable I went looking for a lock somewhere in
the call chain (first thought was the socket lock). I couldn't find
anything, but I could easily be missing something.

Anton

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Warning in net/ipv4/af_inet.c:154
  2010-05-26  3:19   ` Anton Blanchard
@ 2010-05-26  5:18     ` Eric Dumazet
  2010-05-26  7:56     ` David Miller
  1 sibling, 0 replies; 19+ messages in thread
From: Eric Dumazet @ 2010-05-26  5:18 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: netdev

Le mercredi 26 mai 2010 à 13:19 +1000, Anton Blanchard a écrit :
> Hi,
> 
> > > Which is:
> > > 
> > >         WARN_ON(sk->sk_forward_alloc);
> > > 
> > 
> > Yes, the infamous one :)
> > 
> > Is it reproductible ? What kind of workload is it ?
> > What is the NIC involved ?
> 
> It was running sysbench against a postgresql database over localhost. In
> each case I checked, sk_forward_alloc was less than one page.

ok. I am a bit surprised postgresql uses UDP

> 
> I notice we update sk_forward_alloc in sk_mem_charge and sk_mem_uncharge.
> Since it isn't an atomic variable I went looking for a lock somewhere in
> the call chain (first thought was the socket lock). I couldn't find
> anything, but I could easily be missing something.
> 

UDP path uses socket lock indeed to protect sk_forward_alloc non atomic
updates.

Check skb_free_datagram_locked() for example : sk_mem_reclaim_partial()
is called inside the lock_sock_bh/unlock_sock_bh protected section, and
*after* the uncharge done by skb_orphan(). This function was changed
recently, so maybe your platform hit some bug somewhere.

In receive path, we hold the socket lock while calling
sock_queue_rcv_skb()




^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Warning in net/ipv4/af_inet.c:154
  2010-05-26  3:19   ` Anton Blanchard
  2010-05-26  5:18     ` Eric Dumazet
@ 2010-05-26  7:56     ` David Miller
  2010-05-26 10:12       ` Eric Dumazet
  1 sibling, 1 reply; 19+ messages in thread
From: David Miller @ 2010-05-26  7:56 UTC (permalink / raw)
  To: anton; +Cc: eric.dumazet, netdev

From: Anton Blanchard <anton@samba.org>
Date: Wed, 26 May 2010 13:19:43 +1000

> I notice we update sk_forward_alloc in sk_mem_charge and sk_mem_uncharge.
> Since it isn't an atomic variable I went looking for a lock somewhere in
> the call chain (first thought was the socket lock). I couldn't find
> anything, but I could easily be missing something.

We take the lock properly for all of the skb_queue_rcv_skb() cases
but this rule isn't followed properly for skb_queue_err_skb().

Eric, look at even things like skb_tstamp_tx().  Nothing locks the
socket in those cases, yet we dip down into sock_queue_err_skb() and
thus invoke skb_set_owner_r which goes into sk_mem_charge() and does
the non-atomic update on ->sk_forward_alloc.

I am sure there are other cases with this problem involving
sock_queue_err_skb()...  ip_icmp_error() (via __udp4_lib_err()),
ipv6_icmp_error(), etc.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Warning in net/ipv4/af_inet.c:154
  2010-05-26  7:56     ` David Miller
@ 2010-05-26 10:12       ` Eric Dumazet
  2010-05-27  3:56         ` Anton Blanchard
  2010-05-29  7:21         ` Warning in net/ipv4/af_inet.c:154 David Miller
  0 siblings, 2 replies; 19+ messages in thread
From: Eric Dumazet @ 2010-05-26 10:12 UTC (permalink / raw)
  To: David Miller; +Cc: anton, netdev

Le mercredi 26 mai 2010 à 00:56 -0700, David Miller a écrit :
> From: Anton Blanchard <anton@samba.org>
> Date: Wed, 26 May 2010 13:19:43 +1000
> 
> > I notice we update sk_forward_alloc in sk_mem_charge and sk_mem_uncharge.
> > Since it isn't an atomic variable I went looking for a lock somewhere in
> > the call chain (first thought was the socket lock). I couldn't find
> > anything, but I could easily be missing something.
> 
> We take the lock properly for all of the skb_queue_rcv_skb() cases
> but this rule isn't followed properly for skb_queue_err_skb().
> 
> Eric, look at even things like skb_tstamp_tx().  Nothing locks the
> socket in those cases, yet we dip down into sock_queue_err_skb() and
> thus invoke skb_set_owner_r which goes into sk_mem_charge() and does
> the non-atomic update on ->sk_forward_alloc.
> 
> I am sure there are other cases with this problem involving
> sock_queue_err_skb()...  ip_icmp_error() (via __udp4_lib_err()),
> ipv6_icmp_error(), etc.

All these points are indeed problematic, since a loooong time, so this
is a stable material.

You are 100% right David, maybe we should add a test when changing
sk_forward_alloc to test if socket is locked (lockdep only test), but
that's for 2.6.36 :)

RAW path is not impacted (yet)

Thanks

[PATCH] net: fix sk_forward_alloc corruptions

As David found out, sock_queue_err_skb() should be called with socket
lock hold, or we risk sk_forward_alloc corruption, since we use non
atomic operations to update this field.

This patch adds bh_lock_sock()/bh_unlock_sock() pair to three spots.
(BH already disabled)

1) skb_tstamp_tx() 
2) Before calling ip_icmp_error(), in __udp4_lib_err() 
3) Before calling ipv6_icmp_error(), in __udp6_lib_err()

Reported-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/core/skbuff.c |    4 ++++
 net/ipv4/udp.c    |    2 ++
 net/ipv6/udp.c    |    6 ++++--
 3 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index c543dd2..439e3b9 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2991,7 +2991,11 @@ void skb_tstamp_tx(struct sk_buff *orig_skb,
 	memset(serr, 0, sizeof(*serr));
 	serr->ee.ee_errno = ENOMSG;
 	serr->ee.ee_origin = SO_EE_ORIGIN_TIMESTAMPING;
+
+	bh_lock_sock(sk);
 	err = sock_queue_err_skb(sk, skb);
+	bh_unlock_sock(sk);
+
 	if (err)
 		kfree_skb(skb);
 }
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 9de6a69..1d70ff0 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -634,7 +634,9 @@ void __udp4_lib_err(struct sk_buff *skb, u32 info, struct udp_table *udptable)
 		if (!harderr || sk->sk_state != TCP_ESTABLISHED)
 			goto out;
 	} else {
+		bh_lock_sock(sk);
 		ip_icmp_error(sk, skb, err, uh->dest, info, (u8 *)(uh+1));
+		bh_unlock_sock(sk);
 	}
 	sk->sk_err = err;
 	sk->sk_error_report(sk);
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 3d7a2c0..f441365 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -465,9 +465,11 @@ void __udp6_lib_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
 	if (sk->sk_state != TCP_ESTABLISHED && !np->recverr)
 		goto out;
 
-	if (np->recverr)
+	if (np->recverr) {
+		bh_lock_sock(sk);
 		ipv6_icmp_error(sk, skb, err, uh->dest, ntohl(info), (u8 *)(uh+1));
-
+		bh_unlock_sock(sk);
+	}
 	sk->sk_err = err;
 	sk->sk_error_report(sk);
 out:



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: Warning in net/ipv4/af_inet.c:154
  2010-05-26 10:12       ` Eric Dumazet
@ 2010-05-27  3:56         ` Anton Blanchard
  2010-05-27  4:06           ` David Miller
  2010-05-27  4:18           ` Eric Dumazet
  2010-05-29  7:21         ` Warning in net/ipv4/af_inet.c:154 David Miller
  1 sibling, 2 replies; 19+ messages in thread
From: Anton Blanchard @ 2010-05-27  3:56 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev

 
Hi Eric,

> You are 100% right David, maybe we should add a test when changing
> sk_forward_alloc to test if socket is locked (lockdep only test), but
> that's for 2.6.36 :)

Thanks for the patch, unfortunately I can still hit the WARN_ON. I'm somewhat
confused by the two stage locking in the socket lock (ie sk_lock.slock and
sk_lock.owned).

What state should the socket lock be in for serialising updates of
sk_forward_alloc? In some cases we appear to update it with sk_lock.slock =
unlocked, sk_lock.owned = 1:

NIP [c0000000005b4ad0] .sock_queue_rcv_skb
LR [c0000000005b4acc] .sock_queue_rcv_skb
Call Trace:
[c0000000005f9fcc] .ip_queue_rcv_skb
[c00000000061d604] .__udp_queue_rcv_skb
[c0000000005b1a38] .release_sock
[c0000000006205f0] .udp_sendmsg
[c0000000006290d4] .inet_sendmsg
[c0000000005abfb4] .sock_sendmsg
[c0000000005ae9dc] .SyS_sendto
[c0000000005ab6c0] .SyS_send

And other times we update it with sk_lock.slock = locked, sk_lock.owned = 0:

NIP [c0000000005b2b6c] .sock_rfree
LR [c0000000005b2b68] .sock_rfree
Call Trace:
[c0000000005bca10] .skb_free_datagram_locked
[c00000000061fe88] .udp_recvmsg
[c0000000006285e8] .inet_recvmsg
[c0000000005abe0c] .sock_recvmsg
[c0000000005ae358] .SyS_recvfrom

I see we sometimes take sk_lock.slock then check the owned field, but we
aren't doing that all the time.

Anton

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Warning in net/ipv4/af_inet.c:154
  2010-05-27  3:56         ` Anton Blanchard
@ 2010-05-27  4:06           ` David Miller
  2010-05-27  4:21             ` Eric Dumazet
  2010-05-27  4:18           ` Eric Dumazet
  1 sibling, 1 reply; 19+ messages in thread
From: David Miller @ 2010-05-27  4:06 UTC (permalink / raw)
  To: anton; +Cc: eric.dumazet, netdev

From: Anton Blanchard <anton@samba.org>
Date: Thu, 27 May 2010 13:56:17 +1000

> I'm somewhat confused by the two stage locking in the socket lock
> (ie sk_lock.slock and sk_lock.owned).
> 
> What state should the socket lock be in for serialising updates of
> sk_forward_alloc? In some cases we appear to update it with sk_lock.slock =
> unlocked, sk_lock.owned = 1:

If sh_lock.owned=1 the user has grabbed exclusive sleepable lock on the
socket, it does this via something approximating:

retry:
	spin_lock(&sk_lock.slock);
	was_locked = sk_lock.owned;
	if (!was_locked)
		sk_lock.owned = 1;
	spin_unlock(&sk_lock.slock);
	if (was_locked) {
		sleep_on(condition(sk_lock.owned));
		goto retry;
	}

This allows user context code to sleep with exclusive access to the
socket.

Code that cannot sleep takes the spinlock, and queues the work if the
owner field if non-zero.  Else, it keeps the spinlock held and does
the work.

In either case, socket modifications are thus done completely protected
from other contexts.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Warning in net/ipv4/af_inet.c:154
  2010-05-27  3:56         ` Anton Blanchard
  2010-05-27  4:06           ` David Miller
@ 2010-05-27  4:18           ` Eric Dumazet
  2010-05-27  4:21             ` David Miller
  1 sibling, 1 reply; 19+ messages in thread
From: Eric Dumazet @ 2010-05-27  4:18 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: David Miller, netdev

Le jeudi 27 mai 2010 à 13:56 +1000, Anton Blanchard a écrit :
> Hi Eric,
> 
> > You are 100% right David, maybe we should add a test when changing
> > sk_forward_alloc to test if socket is locked (lockdep only test), but
> > that's for 2.6.36 :)
> 
> Thanks for the patch, unfortunately I can still hit the WARN_ON. I'm somewhat
> confused by the two stage locking in the socket lock (ie sk_lock.slock and
> sk_lock.owned).
> 
> What state should the socket lock be in for serialising updates of
> sk_forward_alloc? In some cases we appear to update it with sk_lock.slock =
> unlocked, sk_lock.owned = 1:
> 
> NIP [c0000000005b4ad0] .sock_queue_rcv_skb
> LR [c0000000005b4acc] .sock_queue_rcv_skb
> Call Trace:
> [c0000000005f9fcc] .ip_queue_rcv_skb
> [c00000000061d604] .__udp_queue_rcv_skb
> [c0000000005b1a38] .release_sock
> [c0000000006205f0] .udp_sendmsg
> [c0000000006290d4] .inet_sendmsg
> [c0000000005abfb4] .sock_sendmsg
> [c0000000005ae9dc] .SyS_sendto
> [c0000000005ab6c0] .SyS_send
> 
> And other times we update it with sk_lock.slock = locked, sk_lock.owned = 0:
> 
> NIP [c0000000005b2b6c] .sock_rfree
> LR [c0000000005b2b68] .sock_rfree
> Call Trace:
> [c0000000005bca10] .skb_free_datagram_locked
> [c00000000061fe88] .udp_recvmsg
> [c0000000006285e8] .inet_recvmsg
> [c0000000005abe0c] .sock_recvmsg
> [c0000000005ae358] .SyS_recvfrom
> 
> I see we sometimes take sk_lock.slock then check the owned field, but we
> aren't doing that all the time.
> 

Old rule was :

A Process context was using 
lock + test_and_set_or_sleep + unlock (sk_lock.slock = unlocked,
sk_lock.owned = 1)

softirq handler was using :
(sk_lock.slock = locked, sk_lock.owned =0)

I added a shortcut, but it seems wrong as is

Process context :

lock only (sk_lock.slock = locked, sk_lock.owned = ???)

So I should add a test on owned. If set (by another thread), we should take the slow path.

Thanks !



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Warning in net/ipv4/af_inet.c:154
  2010-05-27  4:18           ` Eric Dumazet
@ 2010-05-27  4:21             ` David Miller
  2010-05-27  5:06               ` [PATCH] net: fix lock_sock_bh/unlock_sock_bh Eric Dumazet
  0 siblings, 1 reply; 19+ messages in thread
From: David Miller @ 2010-05-27  4:21 UTC (permalink / raw)
  To: eric.dumazet; +Cc: anton, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 27 May 2010 06:18:46 +0200

> Process context :
> 
> lock only (sk_lock.slock = locked, sk_lock.owned = ???)
> 
> So I should add a test on owned. If set (by another thread), we should take the slow path.

Indeed, what you're doing now is broken.

If owned is non-zero when you take the spinlock you have to queue your
work, just like how we toss packets into the socket backlog, which is
processed by release_sock(), when this happens.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Warning in net/ipv4/af_inet.c:154
  2010-05-27  4:06           ` David Miller
@ 2010-05-27  4:21             ` Eric Dumazet
  0 siblings, 0 replies; 19+ messages in thread
From: Eric Dumazet @ 2010-05-27  4:21 UTC (permalink / raw)
  To: David Miller; +Cc: anton, netdev

Le mercredi 26 mai 2010 à 21:06 -0700, David Miller a écrit :
> From: Anton Blanchard <anton@samba.org>
> Date: Thu, 27 May 2010 13:56:17 +1000
> 
> > I'm somewhat confused by the two stage locking in the socket lock
> > (ie sk_lock.slock and sk_lock.owned).
> > 
> > What state should the socket lock be in for serialising updates of
> > sk_forward_alloc? In some cases we appear to update it with sk_lock.slock =
> > unlocked, sk_lock.owned = 1:
> 
> If sh_lock.owned=1 the user has grabbed exclusive sleepable lock on the
> socket, it does this via something approximating:
> 
> retry:
> 	spin_lock(&sk_lock.slock);
> 	was_locked = sk_lock.owned;
> 	if (!was_locked)
> 		sk_lock.owned = 1;
> 	spin_unlock(&sk_lock.slock);
> 	if (was_locked) {
> 		sleep_on(condition(sk_lock.owned));
> 		goto retry;
> 	}
> 
> This allows user context code to sleep with exclusive access to the
> socket.
> 
> Code that cannot sleep takes the spinlock, and queues the work if the
> owner field if non-zero.  Else, it keeps the spinlock held and does
> the work.
> 
> In either case, socket modifications are thus done completely protected
> from other contexts.
> 
> 

Yes, but not on the case one user context uses the 'slow locking', and
another uses the 'fast locking'.

I am working on a patch now.





^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH] net: fix lock_sock_bh/unlock_sock_bh
  2010-05-27  4:21             ` David Miller
@ 2010-05-27  5:06               ` Eric Dumazet
  2010-05-27  5:20                 ` Eric Dumazet
  0 siblings, 1 reply; 19+ messages in thread
From: Eric Dumazet @ 2010-05-27  5:06 UTC (permalink / raw)
  To: David Miller; +Cc: anton, netdev

Le mercredi 26 mai 2010 à 21:21 -0700, David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Thu, 27 May 2010 06:18:46 +0200
> 
> > Process context :
> > 
> > lock only (sk_lock.slock = locked, sk_lock.owned = ???)
> > 
> > So I should add a test on owned. If set (by another thread), we should take the slow path.
> 
> Indeed, what you're doing now is broken.
> 
> If owned is non-zero when you take the spinlock you have to queue your
> work, just like how we toss packets into the socket backlog, which is
> processed by release_sock(), when this happens.

Here is the patch I am going to test today.

Anton, could you please test it too, just for sure ?

Thanks !


[PATCH] net: fix lock_sock_bh/unlock_sock_bh

This new sock lock primitive was introduced to speedup some user context
socket manipulation. But it is unsafe to protect two threads, one using
regular lock_sock/release_sock, one using lock_sock_bh/unlock_sock_bh

This patch changes lock_sock_bh to be careful against 'owned' state.
If owned is found to be set, we must take the slow path.
lock_sock_bh() now returns a boolean to say if the slow path was taken,
and this boolean is used at unlock_sock_bh time to call the appropriate
unlock function.

After this change, BH are either disabled or enabled during the
lock_sock_bh/unlock_sock_bh protected section. This might be misleading,
so we rename these functions to lock_sock_fast()/unlock_sock_fast().

Reported-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/net/sock.h  |   20 ++++++++++++++------
 net/core/datagram.c |    6 ++++--
 net/core/sock.c     |   32 ++++++++++++++++++++++++++++++++
 net/ipv4/udp.c      |   14 ++++++++------
 net/ipv6/udp.c      |    5 +++--
 5 files changed, 61 insertions(+), 16 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index d2a71b0..ca241ea 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1026,15 +1026,23 @@ extern void release_sock(struct sock *sk);
 				SINGLE_DEPTH_NESTING)
 #define bh_unlock_sock(__sk)	spin_unlock(&((__sk)->sk_lock.slock))
 
-static inline void lock_sock_bh(struct sock *sk)
+extern bool lock_sock_fast(struct sock *sk);
+/**
+ * unlock_sock_fast - complement of lock_sock_fast
+ * @sk: socket
+ * @slow: slow mode
+ *
+ * fast unlock socket for user context.
+ * If slow mode is on, we call regular release_sock()
+ */
+static inline void unlock_sock_fast(struct sock *sk, bool slow)
 {
-	spin_lock_bh(&sk->sk_lock.slock);
+	if (slow)
+		release_sock(sk);
+	else
+		spin_unlock_bh(&sk->sk_lock.slock);
 }
 
-static inline void unlock_sock_bh(struct sock *sk)
-{
-	spin_unlock_bh(&sk->sk_lock.slock);
-}
 
 extern struct sock		*sk_alloc(struct net *net, int family,
 					  gfp_t priority,
diff --git a/net/core/datagram.c b/net/core/datagram.c
index e009753..f5b6f43 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -229,15 +229,17 @@ EXPORT_SYMBOL(skb_free_datagram);
 
 void skb_free_datagram_locked(struct sock *sk, struct sk_buff *skb)
 {
+	bool slow;
+
 	if (likely(atomic_read(&skb->users) == 1))
 		smp_rmb();
 	else if (likely(!atomic_dec_and_test(&skb->users)))
 		return;
 
-	lock_sock_bh(sk);
+	slow = lock_sock_fast(sk);
 	skb_orphan(skb);
 	sk_mem_reclaim_partial(sk);
-	unlock_sock_bh(sk);
+	unlock_sock_fast(sk, slow);
 
 	/* skb is now orphaned, can be freed outside of locked section */
 	__kfree_skb(skb);
diff --git a/net/core/sock.c b/net/core/sock.c
index 37fe9b6..7ab6398 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2007,6 +2007,38 @@ void release_sock(struct sock *sk)
 }
 EXPORT_SYMBOL(release_sock);
 
+/**
+ * lock_sock_fast - fast version of lock_sock
+ * @sk: socket
+ *
+ * This version should be used for very small section, where process wont block
+ * return false if fast path is taken
+ *   sk_lock.slock locked, owned = 0, BH disabled
+ * return true if slow path is taken
+ *   sk_lock.slock unlocked, owned = 1, BH enabled
+ */
+bool lock_sock_fast(struct sock *sk)
+{
+	might_sleep();
+	spin_lock_bh(&sk->sk_lock.slock);
+
+	if (!sk->sk_lock.owned)
+		/*
+		 * Note : We must disable BH or risk a deadlock
+		 */
+		return false;
+
+	__lock_sock(sk);
+	sk->sk_lock.owned = 1;
+	spin_unlock(&sk->sk_lock.slock);
+	/*
+	 * The sk_lock has mutex_lock() semantics here:
+	 */
+	mutex_acquire(&sk->sk_lock.dep_map, 0, 0, _RET_IP_);
+	return true;
+}
+EXPORT_SYMBOL(lock_sock_fast);
+
 int sock_get_timestamp(struct sock *sk, struct timeval __user *userstamp)
 {
 	struct timeval tv;
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 9de6a69..b9d0d40 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1063,10 +1063,11 @@ static unsigned int first_packet_length(struct sock *sk)
 	spin_unlock_bh(&rcvq->lock);
 
 	if (!skb_queue_empty(&list_kill)) {
-		lock_sock_bh(sk);
+		bool slow = lock_sock_fast(sk);
+
 		__skb_queue_purge(&list_kill);
 		sk_mem_reclaim_partial(sk);
-		unlock_sock_bh(sk);
+		unlock_sock_fast(sk, slow);
 	}
 	return res;
 }
@@ -1123,6 +1124,7 @@ int udp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 	int peeked;
 	int err;
 	int is_udplite = IS_UDPLITE(sk);
+	bool slow;
 
 	/*
 	 *	Check any passed addresses
@@ -1197,10 +1199,10 @@ out:
 	return err;
 
 csum_copy_err:
-	lock_sock_bh(sk);
+	slow = lock_sock_fast(sk);
 	if (!skb_kill_datagram(sk, skb, flags))
 		UDP_INC_STATS_USER(sock_net(sk), UDP_MIB_INERRORS, is_udplite);
-	unlock_sock_bh(sk);
+	unlock_sock_fast(sk, slow);
 
 	if (noblock)
 		return -EAGAIN;
@@ -1625,9 +1627,9 @@ int udp_rcv(struct sk_buff *skb)
 
 void udp_destroy_sock(struct sock *sk)
 {
-	lock_sock_bh(sk);
+	bool slow = lock_sock_fast(sk);
 	udp_flush_pending_frames(sk);
-	unlock_sock_bh(sk);
+	unlock_sock_fast(sk, slow);
 }
 
 /*
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 3d7a2c0..87be586 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -328,6 +328,7 @@ int udpv6_recvmsg(struct kiocb *iocb, struct sock *sk,
 	int err;
 	int is_udplite = IS_UDPLITE(sk);
 	int is_udp4;
+	bool slow;
 
 	if (addr_len)
 		*addr_len=sizeof(struct sockaddr_in6);
@@ -424,7 +425,7 @@ out:
 	return err;
 
 csum_copy_err:
-	lock_sock_bh(sk);
+	slow = lock_sock_fast(sk);
 	if (!skb_kill_datagram(sk, skb, flags)) {
 		if (is_udp4)
 			UDP_INC_STATS_USER(sock_net(sk),
@@ -433,7 +434,7 @@ csum_copy_err:
 			UDP6_INC_STATS_USER(sock_net(sk),
 					UDP_MIB_INERRORS, is_udplite);
 	}
-	unlock_sock_bh(sk);
+	unlock_sock_fast(sk, slow);
 
 	if (flags & MSG_DONTWAIT)
 		return -EAGAIN;



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] net: fix lock_sock_bh/unlock_sock_bh
  2010-05-27  5:06               ` [PATCH] net: fix lock_sock_bh/unlock_sock_bh Eric Dumazet
@ 2010-05-27  5:20                 ` Eric Dumazet
  2010-05-27  5:23                   ` David Miller
  0 siblings, 1 reply; 19+ messages in thread
From: Eric Dumazet @ 2010-05-27  5:20 UTC (permalink / raw)
  To: David Miller; +Cc: anton, netdev

Le jeudi 27 mai 2010 à 07:06 +0200, Eric Dumazet a écrit :
> Le mercredi 26 mai 2010 à 21:21 -0700, David Miller a écrit :
> > From: Eric Dumazet <eric.dumazet@gmail.com>
> > Date: Thu, 27 May 2010 06:18:46 +0200
> > 
> > > Process context :
> > > 
> > > lock only (sk_lock.slock = locked, sk_lock.owned = ???)
> > > 
> > > So I should add a test on owned. If set (by another thread), we should take the slow path.
> > 
> > Indeed, what you're doing now is broken.
> > 
> > If owned is non-zero when you take the spinlock you have to queue your
> > work, just like how we toss packets into the socket backlog, which is
> > processed by release_sock(), when this happens.
> 
> Here is the patch I am going to test today.
> 
> Anton, could you please test it too, just for sure ?
> 
> Thanks !
> 
> 
> [PATCH] net: fix lock_sock_bh/unlock_sock_bh
> 
> This new sock lock primitive was introduced to speedup some user context
> socket manipulation. But it is unsafe to protect two threads, one using
> regular lock_sock/release_sock, one using lock_sock_bh/unlock_sock_bh
> 
> This patch changes lock_sock_bh to be careful against 'owned' state.
> If owned is found to be set, we must take the slow path.
> lock_sock_bh() now returns a boolean to say if the slow path was taken,
> and this boolean is used at unlock_sock_bh time to call the appropriate
> unlock function.
> 
> After this change, BH are either disabled or enabled during the
> lock_sock_bh/unlock_sock_bh protected section. This might be misleading,
> so we rename these functions to lock_sock_fast()/unlock_sock_fast().
> 
> Reported-by: Anton Blanchard <anton@samba.org>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> ---
>  include/net/sock.h  |   20 ++++++++++++++------
>  net/core/datagram.c |    6 ++++--
>  net/core/sock.c     |   32 ++++++++++++++++++++++++++++++++
>  net/ipv4/udp.c      |   14 ++++++++------
>  net/ipv6/udp.c      |    5 +++--
>  5 files changed, 61 insertions(+), 16 deletions(-)
> 
> diff --git a/include/net/sock.h b/include/net/sock.h
> index d2a71b0..ca241ea 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -1026,15 +1026,23 @@ extern void release_sock(struct sock *sk);
>  				SINGLE_DEPTH_NESTING)
>  #define bh_unlock_sock(__sk)	spin_unlock(&((__sk)->sk_lock.slock))
>  
> -static inline void lock_sock_bh(struct sock *sk)
> +extern bool lock_sock_fast(struct sock *sk);
> +/**
> + * unlock_sock_fast - complement of lock_sock_fast
> + * @sk: socket
> + * @slow: slow mode
> + *
> + * fast unlock socket for user context.
> + * If slow mode is on, we call regular release_sock()
> + */
> +static inline void unlock_sock_fast(struct sock *sk, bool slow)
>  {
> -	spin_lock_bh(&sk->sk_lock.slock);
> +	if (slow)
> +		release_sock(sk);
> +	else
> +		spin_unlock_bh(&sk->sk_lock.slock);
>  }
>  
> -static inline void unlock_sock_bh(struct sock *sk)
> -{
> -	spin_unlock_bh(&sk->sk_lock.slock);
> -}
>  
>  extern struct sock		*sk_alloc(struct net *net, int family,
>  					  gfp_t priority,
> diff --git a/net/core/datagram.c b/net/core/datagram.c
> index e009753..f5b6f43 100644
> --- a/net/core/datagram.c
> +++ b/net/core/datagram.c
> @@ -229,15 +229,17 @@ EXPORT_SYMBOL(skb_free_datagram);
>  
>  void skb_free_datagram_locked(struct sock *sk, struct sk_buff *skb)
>  {
> +	bool slow;
> +
>  	if (likely(atomic_read(&skb->users) == 1))
>  		smp_rmb();
>  	else if (likely(!atomic_dec_and_test(&skb->users)))
>  		return;
>  
> -	lock_sock_bh(sk);
> +	slow = lock_sock_fast(sk);
>  	skb_orphan(skb);
>  	sk_mem_reclaim_partial(sk);
> -	unlock_sock_bh(sk);
> +	unlock_sock_fast(sk, slow);
>  
>  	/* skb is now orphaned, can be freed outside of locked section */
>  	__kfree_skb(skb);
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 37fe9b6..7ab6398 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -2007,6 +2007,38 @@ void release_sock(struct sock *sk)
>  }
>  EXPORT_SYMBOL(release_sock);
>  
> +/**
> + * lock_sock_fast - fast version of lock_sock
> + * @sk: socket
> + *
> + * This version should be used for very small section, where process wont block
> + * return false if fast path is taken
> + *   sk_lock.slock locked, owned = 0, BH disabled
> + * return true if slow path is taken
> + *   sk_lock.slock unlocked, owned = 1, BH enabled
> + */
> +bool lock_sock_fast(struct sock *sk)
> +{
> +	might_sleep();
> +	spin_lock_bh(&sk->sk_lock.slock);
> +
> +	if (!sk->sk_lock.owned)
> +		/*
> +		 * Note : We must disable BH or risk a deadlock
> +		 */
> +		return false;
> +
> +	__lock_sock(sk);
> +	sk->sk_lock.owned = 1;
> +	spin_unlock(&sk->sk_lock.slock);
> +	/*
> +	 * The sk_lock has mutex_lock() semantics here:
> +	 */
> +	mutex_acquire(&sk->sk_lock.dep_map, 0, 0, _RET_IP_);

Oops, I missed a local_bh_enable(); here

> +	return true;
> +}
> +EXPORT_SYMBOL(lock_sock_fast);
> +
>  int sock_get_timestamp(struct sock *sk, struct timeval __user *userstamp)
>  {
>  	struct timeval tv;

Here is v2 of patch with local_bh_enable(); added in lock_sock_fast()

[PATCH v2] net: fix lock_sock_bh/unlock_sock_bh

This new sock lock primitive was introduced to speedup some user context
socket manipulation. But it is unsafe to protect two threads, one using
regular lock_sock/release_sock, one using lock_sock_bh/unlock_sock_bh

This patch changes lock_sock_bh to be careful against 'owned' state.
If owned is found to be set, we must take the slow path.
lock_sock_bh() now returns a boolean to say if the slow path was taken,
and this boolean is used at unlock_sock_bh time to call the appropriate
unlock function.

After this change, BH are either disabled or enabled during the
lock_sock_bh/unlock_sock_bh protected section. This might be misleading,
so we rename these functions to lock_sock_fast()/unlock_sock_fast().

Reported-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/net/sock.h  |   20 ++++++++++++++------
 net/core/datagram.c |    6 ++++--
 net/core/sock.c     |   33 +++++++++++++++++++++++++++++++++
 net/ipv4/udp.c      |   14 ++++++++------
 net/ipv6/udp.c      |    5 +++--
 5 files changed, 62 insertions(+), 16 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index d2a71b0..ca241ea 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1026,15 +1026,23 @@ extern void release_sock(struct sock *sk);
 				SINGLE_DEPTH_NESTING)
 #define bh_unlock_sock(__sk)	spin_unlock(&((__sk)->sk_lock.slock))
 
-static inline void lock_sock_bh(struct sock *sk)
+extern bool lock_sock_fast(struct sock *sk);
+/**
+ * unlock_sock_fast - complement of lock_sock_fast
+ * @sk: socket
+ * @slow: slow mode
+ *
+ * fast unlock socket for user context.
+ * If slow mode is on, we call regular release_sock()
+ */
+static inline void unlock_sock_fast(struct sock *sk, bool slow)
 {
-	spin_lock_bh(&sk->sk_lock.slock);
+	if (slow)
+		release_sock(sk);
+	else
+		spin_unlock_bh(&sk->sk_lock.slock);
 }
 
-static inline void unlock_sock_bh(struct sock *sk)
-{
-	spin_unlock_bh(&sk->sk_lock.slock);
-}
 
 extern struct sock		*sk_alloc(struct net *net, int family,
 					  gfp_t priority,
diff --git a/net/core/datagram.c b/net/core/datagram.c
index e009753..f5b6f43 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -229,15 +229,17 @@ EXPORT_SYMBOL(skb_free_datagram);
 
 void skb_free_datagram_locked(struct sock *sk, struct sk_buff *skb)
 {
+	bool slow;
+
 	if (likely(atomic_read(&skb->users) == 1))
 		smp_rmb();
 	else if (likely(!atomic_dec_and_test(&skb->users)))
 		return;
 
-	lock_sock_bh(sk);
+	slow = lock_sock_fast(sk);
 	skb_orphan(skb);
 	sk_mem_reclaim_partial(sk);
-	unlock_sock_bh(sk);
+	unlock_sock_fast(sk, slow);
 
 	/* skb is now orphaned, can be freed outside of locked section */
 	__kfree_skb(skb);
diff --git a/net/core/sock.c b/net/core/sock.c
index 37fe9b6..2cf7f9f 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2007,6 +2007,39 @@ void release_sock(struct sock *sk)
 }
 EXPORT_SYMBOL(release_sock);
 
+/**
+ * lock_sock_fast - fast version of lock_sock
+ * @sk: socket
+ *
+ * This version should be used for very small section, where process wont block
+ * return false if fast path is taken
+ *   sk_lock.slock locked, owned = 0, BH disabled
+ * return true if slow path is taken
+ *   sk_lock.slock unlocked, owned = 1, BH enabled
+ */
+bool lock_sock_fast(struct sock *sk)
+{
+	might_sleep();
+	spin_lock_bh(&sk->sk_lock.slock);
+
+	if (!sk->sk_lock.owned)
+		/*
+		 * Note : We must disable BH
+		 */
+		return false;
+
+	__lock_sock(sk);
+	sk->sk_lock.owned = 1;
+	spin_unlock(&sk->sk_lock.slock);
+	/*
+	 * The sk_lock has mutex_lock() semantics here:
+	 */
+	mutex_acquire(&sk->sk_lock.dep_map, 0, 0, _RET_IP_);
+	local_bh_enable();
+	return true;
+}
+EXPORT_SYMBOL(lock_sock_fast);
+
 int sock_get_timestamp(struct sock *sk, struct timeval __user *userstamp)
 {
 	struct timeval tv;
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 9de6a69..b9d0d40 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1063,10 +1063,11 @@ static unsigned int first_packet_length(struct sock *sk)
 	spin_unlock_bh(&rcvq->lock);
 
 	if (!skb_queue_empty(&list_kill)) {
-		lock_sock_bh(sk);
+		bool slow = lock_sock_fast(sk);
+
 		__skb_queue_purge(&list_kill);
 		sk_mem_reclaim_partial(sk);
-		unlock_sock_bh(sk);
+		unlock_sock_fast(sk, slow);
 	}
 	return res;
 }
@@ -1123,6 +1124,7 @@ int udp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 	int peeked;
 	int err;
 	int is_udplite = IS_UDPLITE(sk);
+	bool slow;
 
 	/*
 	 *	Check any passed addresses
@@ -1197,10 +1199,10 @@ out:
 	return err;
 
 csum_copy_err:
-	lock_sock_bh(sk);
+	slow = lock_sock_fast(sk);
 	if (!skb_kill_datagram(sk, skb, flags))
 		UDP_INC_STATS_USER(sock_net(sk), UDP_MIB_INERRORS, is_udplite);
-	unlock_sock_bh(sk);
+	unlock_sock_fast(sk, slow);
 
 	if (noblock)
 		return -EAGAIN;
@@ -1625,9 +1627,9 @@ int udp_rcv(struct sk_buff *skb)
 
 void udp_destroy_sock(struct sock *sk)
 {
-	lock_sock_bh(sk);
+	bool slow = lock_sock_fast(sk);
 	udp_flush_pending_frames(sk);
-	unlock_sock_bh(sk);
+	unlock_sock_fast(sk, slow);
 }
 
 /*
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 3d7a2c0..87be586 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -328,6 +328,7 @@ int udpv6_recvmsg(struct kiocb *iocb, struct sock *sk,
 	int err;
 	int is_udplite = IS_UDPLITE(sk);
 	int is_udp4;
+	bool slow;
 
 	if (addr_len)
 		*addr_len=sizeof(struct sockaddr_in6);
@@ -424,7 +425,7 @@ out:
 	return err;
 
 csum_copy_err:
-	lock_sock_bh(sk);
+	slow = lock_sock_fast(sk);
 	if (!skb_kill_datagram(sk, skb, flags)) {
 		if (is_udp4)
 			UDP_INC_STATS_USER(sock_net(sk),
@@ -433,7 +434,7 @@ csum_copy_err:
 			UDP6_INC_STATS_USER(sock_net(sk),
 					UDP_MIB_INERRORS, is_udplite);
 	}
-	unlock_sock_bh(sk);
+	unlock_sock_fast(sk, slow);
 
 	if (flags & MSG_DONTWAIT)
 		return -EAGAIN;



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] net: fix lock_sock_bh/unlock_sock_bh
  2010-05-27  5:20                 ` Eric Dumazet
@ 2010-05-27  5:23                   ` David Miller
  2010-05-27  6:09                     ` Anton Blanchard
  0 siblings, 1 reply; 19+ messages in thread
From: David Miller @ 2010-05-27  5:23 UTC (permalink / raw)
  To: eric.dumazet; +Cc: anton, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 27 May 2010 07:20:18 +0200

> [PATCH v2] net: fix lock_sock_bh/unlock_sock_bh
> 
> This new sock lock primitive was introduced to speedup some user context
> socket manipulation. But it is unsafe to protect two threads, one using
> regular lock_sock/release_sock, one using lock_sock_bh/unlock_sock_bh
> 
> This patch changes lock_sock_bh to be careful against 'owned' state.
> If owned is found to be set, we must take the slow path.
> lock_sock_bh() now returns a boolean to say if the slow path was taken,
> and this boolean is used at unlock_sock_bh time to call the appropriate
> unlock function.
> 
> After this change, BH are either disabled or enabled during the
> lock_sock_bh/unlock_sock_bh protected section. This might be misleading,
> so we rename these functions to lock_sock_fast()/unlock_sock_fast().
> 
> Reported-by: Anton Blanchard <anton@samba.org>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Looks good, I'll wait for positive testing from Anton before applying
this.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] net: fix lock_sock_bh/unlock_sock_bh
  2010-05-27  5:23                   ` David Miller
@ 2010-05-27  6:09                     ` Anton Blanchard
  2010-05-27  7:29                       ` David Miller
  0 siblings, 1 reply; 19+ messages in thread
From: Anton Blanchard @ 2010-05-27  6:09 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, netdev


> > [PATCH v2] net: fix lock_sock_bh/unlock_sock_bh
> > 
> > This new sock lock primitive was introduced to speedup some user context
> > socket manipulation. But it is unsafe to protect two threads, one using
> > regular lock_sock/release_sock, one using lock_sock_bh/unlock_sock_bh
> > 
> > This patch changes lock_sock_bh to be careful against 'owned' state.
> > If owned is found to be set, we must take the slow path.
> > lock_sock_bh() now returns a boolean to say if the slow path was taken,
> > and this boolean is used at unlock_sock_bh time to call the appropriate
> > unlock function.
> > 
> > After this change, BH are either disabled or enabled during the
> > lock_sock_bh/unlock_sock_bh protected section. This might be misleading,
> > so we rename these functions to lock_sock_fast()/unlock_sock_fast().
> > 
> > Reported-by: Anton Blanchard <anton@samba.org>
> > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> 
> Looks good, I'll wait for positive testing from Anton before applying
> this.

Thanks guys, this fixed it.

Tested-by: Anton Blanchard <anton@samba.org>

Anton

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] net: fix lock_sock_bh/unlock_sock_bh
  2010-05-27  6:09                     ` Anton Blanchard
@ 2010-05-27  7:29                       ` David Miller
  0 siblings, 0 replies; 19+ messages in thread
From: David Miller @ 2010-05-27  7:29 UTC (permalink / raw)
  To: anton; +Cc: eric.dumazet, netdev

From: Anton Blanchard <anton@samba.org>
Date: Thu, 27 May 2010 16:09:23 +1000

> 
>> > [PATCH v2] net: fix lock_sock_bh/unlock_sock_bh
>> > 
>> > This new sock lock primitive was introduced to speedup some user context
>> > socket manipulation. But it is unsafe to protect two threads, one using
>> > regular lock_sock/release_sock, one using lock_sock_bh/unlock_sock_bh
>> > 
>> > This patch changes lock_sock_bh to be careful against 'owned' state.
>> > If owned is found to be set, we must take the slow path.
>> > lock_sock_bh() now returns a boolean to say if the slow path was taken,
>> > and this boolean is used at unlock_sock_bh time to call the appropriate
>> > unlock function.
>> > 
>> > After this change, BH are either disabled or enabled during the
>> > lock_sock_bh/unlock_sock_bh protected section. This might be misleading,
>> > so we rename these functions to lock_sock_fast()/unlock_sock_fast().
>> > 
>> > Reported-by: Anton Blanchard <anton@samba.org>
>> > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
>> 
>> Looks good, I'll wait for positive testing from Anton before applying
>> this.
> 
> Thanks guys, this fixed it.
> 
> Tested-by: Anton Blanchard <anton@samba.org>

Thanks for testing Anton, I'll apply this now.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Warning in net/ipv4/af_inet.c:154
  2010-05-26 10:12       ` Eric Dumazet
  2010-05-27  3:56         ` Anton Blanchard
@ 2010-05-29  7:21         ` David Miller
  2010-05-31 16:02           ` [PATCH] net: sock_queue_err_skb() dont mess with sk_forward_alloc Eric Dumazet
  1 sibling, 1 reply; 19+ messages in thread
From: David Miller @ 2010-05-29  7:21 UTC (permalink / raw)
  To: eric.dumazet; +Cc: anton, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 26 May 2010 12:12:56 +0200

> [PATCH] net: fix sk_forward_alloc corruptions
> 
> As David found out, sock_queue_err_skb() should be called with socket
> lock hold, or we risk sk_forward_alloc corruption, since we use non
> atomic operations to update this field.
> 
> This patch adds bh_lock_sock()/bh_unlock_sock() pair to three spots.
> (BH already disabled)
> 
> 1) skb_tstamp_tx() 
> 2) Before calling ip_icmp_error(), in __udp4_lib_err() 
> 3) Before calling ipv6_icmp_error(), in __udp6_lib_err()
> 
> Reported-by: Anton Blanchard <anton@samba.org>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

This wasn't the direct cause of Anton's problems but is
a serious legitimate bug.

So, applied, thanks!

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH] net: sock_queue_err_skb() dont mess with sk_forward_alloc
  2010-05-29  7:21         ` Warning in net/ipv4/af_inet.c:154 David Miller
@ 2010-05-31 16:02           ` Eric Dumazet
  2010-06-01  6:44             ` David Miller
  0 siblings, 1 reply; 19+ messages in thread
From: Eric Dumazet @ 2010-05-31 16:02 UTC (permalink / raw)
  To: David Miller; +Cc: anton, netdev

Le samedi 29 mai 2010 à 00:21 -0700, David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Wed, 26 May 2010 12:12:56 +0200
> 
> > [PATCH] net: fix sk_forward_alloc corruptions
> > 
> > As David found out, sock_queue_err_skb() should be called with socket
> > lock hold, or we risk sk_forward_alloc corruption, since we use non
> > atomic operations to update this field.
> > 
> > This patch adds bh_lock_sock()/bh_unlock_sock() pair to three spots.
> > (BH already disabled)
> > 
> > 1) skb_tstamp_tx() 
> > 2) Before calling ip_icmp_error(), in __udp4_lib_err() 
> > 3) Before calling ipv6_icmp_error(), in __udp6_lib_err()
> > 
> > Reported-by: Anton Blanchard <anton@samba.org>
> > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> 
> This wasn't the direct cause of Anton's problems but is
> a serious legitimate bug.
> 
> So, applied, thanks!

Thats embarrassing...

I believe there is still a problem with sock_queue_err_skb(), sorry
Dave :(

There is also a problem in ip_recv_error(), not called with socket
locked, skb freed -> potential corruption.

If current socket is 'owned' by a user thread, then we can still corrupt
sk_forward_alloc, even if we use bh_lock_sock()

I dont think we need to have another backlog for such case, maybe we
could account for skb->truesize in sk_rmem_alloc (this is atomic), and
not account for sk_mem_charge ?

Another possibility would be to store in skb the backlog function
pointer, so that backlog is generic (normal packets and error packets
handled in same backlog queue), instead of using a protocol provided
pointer. But thats more complex and error prone.

Thanks

[PATCH] net: sock_queue_err_skb() dont mess with sk_forward_alloc

Correct sk_forward_alloc handling for error_queue would need to use a
backlog of frames that softirq handler could not deliver because socket
is owned by user thread. Or extend backlog processing to be able to
process normal and error packets.

Another possibility is to not use mem charge for error queue, this is
what I implemented in this patch.

Note: this reverts commit 29030374
(net: fix sk_forward_alloc corruptions), since we dont need to lock
socket anymore.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/net/sock.h |   15 +--------------
 net/core/skbuff.c  |   30 ++++++++++++++++++++++++++++--
 net/ipv4/udp.c     |    6 ++----
 net/ipv6/udp.c     |    6 ++----
 4 files changed, 33 insertions(+), 24 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index ca241ea..731150d 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1524,20 +1524,7 @@ extern void sk_stop_timer(struct sock *sk, struct timer_list* timer);
 
 extern int sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb);
 
-static inline int sock_queue_err_skb(struct sock *sk, struct sk_buff *skb)
-{
-	/* Cast skb->rcvbuf to unsigned... It's pointless, but reduces
-	   number of warnings when compiling with -W --ANK
-	 */
-	if (atomic_read(&sk->sk_rmem_alloc) + skb->truesize >=
-	    (unsigned)sk->sk_rcvbuf)
-		return -ENOMEM;
-	skb_set_owner_r(skb, sk);
-	skb_queue_tail(&sk->sk_error_queue, skb);
-	if (!sock_flag(sk, SOCK_DEAD))
-		sk->sk_data_ready(sk, skb->len);
-	return 0;
-}
+extern int sock_queue_err_skb(struct sock *sk, struct sk_buff *skb);
 
 /*
  *	Recover an error report and clear atomically
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 4e7ac09..9f07e74 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2965,6 +2965,34 @@ int skb_cow_data(struct sk_buff *skb, int tailbits, struct sk_buff **trailer)
 }
 EXPORT_SYMBOL_GPL(skb_cow_data);
 
+static void sock_rmem_free(struct sk_buff *skb)
+{
+	struct sock *sk = skb->sk;
+
+	atomic_sub(skb->truesize, &sk->sk_rmem_alloc);
+}
+
+/*
+ * Note: We dont mem charge error packets (no sk_forward_alloc changes)
+ */
+int sock_queue_err_skb(struct sock *sk, struct sk_buff *skb)
+{
+	if (atomic_read(&sk->sk_rmem_alloc) + skb->truesize >=
+	    (unsigned)sk->sk_rcvbuf)
+		return -ENOMEM;
+
+	skb_orphan(skb);
+	skb->sk = sk;
+	skb->destructor = sock_rmem_free;
+	atomic_add(skb->truesize, &sk->sk_rmem_alloc);
+
+	skb_queue_tail(&sk->sk_error_queue, skb);
+	if (!sock_flag(sk, SOCK_DEAD))
+		sk->sk_data_ready(sk, skb->len);
+	return 0;
+}
+EXPORT_SYMBOL(sock_queue_err_skb);
+
 void skb_tstamp_tx(struct sk_buff *orig_skb,
 		struct skb_shared_hwtstamps *hwtstamps)
 {
@@ -2997,9 +3025,7 @@ void skb_tstamp_tx(struct sk_buff *orig_skb,
 	serr->ee.ee_errno = ENOMSG;
 	serr->ee.ee_origin = SO_EE_ORIGIN_TIMESTAMPING;
 
-	bh_lock_sock(sk);
 	err = sock_queue_err_skb(sk, skb);
-	bh_unlock_sock(sk);
 
 	if (err)
 		kfree_skb(skb);
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 50678f9..eec4ff4 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -633,11 +633,9 @@ void __udp4_lib_err(struct sk_buff *skb, u32 info, struct udp_table *udptable)
 	if (!inet->recverr) {
 		if (!harderr || sk->sk_state != TCP_ESTABLISHED)
 			goto out;
-	} else {
-		bh_lock_sock(sk);
+	} else
 		ip_icmp_error(sk, skb, err, uh->dest, info, (u8 *)(uh+1));
-		bh_unlock_sock(sk);
-	}
+
 	sk->sk_err = err;
 	sk->sk_error_report(sk);
 out:
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 3048f90..87be586 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -466,11 +466,9 @@ void __udp6_lib_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
 	if (sk->sk_state != TCP_ESTABLISHED && !np->recverr)
 		goto out;
 
-	if (np->recverr) {
-		bh_lock_sock(sk);
+	if (np->recverr)
 		ipv6_icmp_error(sk, skb, err, uh->dest, ntohl(info), (u8 *)(uh+1));
-		bh_unlock_sock(sk);
-	}
+
 	sk->sk_err = err;
 	sk->sk_error_report(sk);
 out:



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] net: sock_queue_err_skb() dont mess with sk_forward_alloc
  2010-05-31 16:02           ` [PATCH] net: sock_queue_err_skb() dont mess with sk_forward_alloc Eric Dumazet
@ 2010-06-01  6:44             ` David Miller
  0 siblings, 0 replies; 19+ messages in thread
From: David Miller @ 2010-06-01  6:44 UTC (permalink / raw)
  To: eric.dumazet; +Cc: anton, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 31 May 2010 18:02:46 +0200

> There is also a problem in ip_recv_error(), not called with socket
> locked, skb freed -> potential corruption.
> 
> If current socket is 'owned' by a user thread, then we can still corrupt
> sk_forward_alloc, even if we use bh_lock_sock()
> 
> I dont think we need to have another backlog for such case, maybe we
> could account for skb->truesize in sk_rmem_alloc (this is atomic), and
> not account for sk_mem_charge ?

That sounds fine to me.

> [PATCH] net: sock_queue_err_skb() dont mess with sk_forward_alloc
> 
> Correct sk_forward_alloc handling for error_queue would need to use a
> backlog of frames that softirq handler could not deliver because socket
> is owned by user thread. Or extend backlog processing to be able to
> process normal and error packets.
> 
> Another possibility is to not use mem charge for error queue, this is
> what I implemented in this patch.
> 
> Note: this reverts commit 29030374
> (net: fix sk_forward_alloc corruptions), since we dont need to lock
> socket anymore.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Looks good, applied, thanks Eric.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2010-06-01  6:44 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-05-25 11:58 Warning in net/ipv4/af_inet.c:154 Anton Blanchard
2010-05-25 15:27 ` Eric Dumazet
2010-05-26  3:19   ` Anton Blanchard
2010-05-26  5:18     ` Eric Dumazet
2010-05-26  7:56     ` David Miller
2010-05-26 10:12       ` Eric Dumazet
2010-05-27  3:56         ` Anton Blanchard
2010-05-27  4:06           ` David Miller
2010-05-27  4:21             ` Eric Dumazet
2010-05-27  4:18           ` Eric Dumazet
2010-05-27  4:21             ` David Miller
2010-05-27  5:06               ` [PATCH] net: fix lock_sock_bh/unlock_sock_bh Eric Dumazet
2010-05-27  5:20                 ` Eric Dumazet
2010-05-27  5:23                   ` David Miller
2010-05-27  6:09                     ` Anton Blanchard
2010-05-27  7:29                       ` David Miller
2010-05-29  7:21         ` Warning in net/ipv4/af_inet.c:154 David Miller
2010-05-31 16:02           ` [PATCH] net: sock_queue_err_skb() dont mess with sk_forward_alloc Eric Dumazet
2010-06-01  6:44             ` David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.