All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/1] commit c6825c0976fa7893692e0e43b09740b419b23c09 upstream.
@ 2015-10-26 18:55 Ani Sinha
  2015-10-26 20:06 ` Pablo Neira Ayuso
  0 siblings, 1 reply; 14+ messages in thread
From: Ani Sinha @ 2015-10-26 18:55 UTC (permalink / raw)
  To: Pablo Neira Ayuso, Patrick McHardy, David S. Miller,
	netfilter-devel, netfilter, netdev, coreteam, Florian Westphal,
	Pablo Neira Ayuso, Patrick McHardy, Jozsef Kadlecsik,
	Cyrill Gorcunov, Ani Sinha, Zefan Li, Eric Dumazet

netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get

Lets look at destroy_conntrack:

hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
...
nf_conntrack_free(ct)
	kmem_cache_free(net->ct.nf_conntrack_cachep, ct);

net->ct.nf_conntrack_cachep is created with SLAB_DESTROY_BY_RCU.

The hash is protected by rcu, so readers look up conntracks without
locks.
A conntrack is removed from the hash, but in this moment a few readers
still can use the conntrack. Then this conntrack is released and another
thread creates conntrack with the same address and the equal tuple.
After this a reader starts to validate the conntrack:
* It's not dying, because a new conntrack was created
* nf_ct_tuple_equal() returns true.

But this conntrack is not initialized yet, so it can not be used by two
threads concurrently. In this case BUG_ON may be triggered from
nf_nat_setup_info().

Florian Westphal suggested to check the confirm bit too. I think it's
right.

task 1			task 2			task 3
			nf_conntrack_find_get
			 ____nf_conntrack_find
destroy_conntrack
 hlist_nulls_del_rcu
 nf_conntrack_free
 kmem_cache_free
						__nf_conntrack_alloc
						 kmem_cache_alloc
						 memset(&ct->tuplehash[IP_CT_DIR_MAX],
			 if (nf_ct_is_dying(ct))
			 if (!nf_ct_tuple_equal()

I'm not sure, that I have ever seen this race condition in a real life.
Currently we are investigating a bug, which is reproduced on a few nodes.
In our case one conntrack is initialized from a few tasks concurrently,
we don't have any other explanation for this.

<2>[46267.083061] kernel BUG at net/ipv4/netfilter/nf_nat_core.c:322!
...
<4>[46267.083951] RIP: 0010:[<ffffffffa01e00a4>]  [<ffffffffa01e00a4>] nf_nat_setup_info+0x564/0x590 [nf_nat]
...
<4>[46267.085549] Call Trace:
<4>[46267.085622]  [<ffffffffa023421b>] alloc_null_binding+0x5b/0xa0 [iptable_nat]
<4>[46267.085697]  [<ffffffffa02342bc>] nf_nat_rule_find+0x5c/0x80 [iptable_nat]
<4>[46267.085770]  [<ffffffffa0234521>] nf_nat_fn+0x111/0x260 [iptable_nat]
<4>[46267.085843]  [<ffffffffa0234798>] nf_nat_out+0x48/0xd0 [iptable_nat]
<4>[46267.085919]  [<ffffffff814841b9>] nf_iterate+0x69/0xb0
<4>[46267.085991]  [<ffffffff81494e70>] ? ip_finish_output+0x0/0x2f0
<4>[46267.086063]  [<ffffffff81484374>] nf_hook_slow+0x74/0x110
<4>[46267.086133]  [<ffffffff81494e70>] ? ip_finish_output+0x0/0x2f0
<4>[46267.086207]  [<ffffffff814b5890>] ? dst_output+0x0/0x20
<4>[46267.086277]  [<ffffffff81495204>] ip_output+0xa4/0xc0
<4>[46267.086346]  [<ffffffff814b65a4>] raw_sendmsg+0x8b4/0x910
<4>[46267.086419]  [<ffffffff814c10fa>] inet_sendmsg+0x4a/0xb0
<4>[46267.086491]  [<ffffffff814459aa>] ? sock_update_classid+0x3a/0x50
<4>[46267.086562]  [<ffffffff81444d67>] sock_sendmsg+0x117/0x140
<4>[46267.086638]  [<ffffffff8151997b>] ? _spin_unlock_bh+0x1b/0x20
<4>[46267.086712]  [<ffffffff8109d370>] ? autoremove_wake_function+0x0/0x40
<4>[46267.086785]  [<ffffffff81495e80>] ? do_ip_setsockopt+0x90/0xd80
<4>[46267.086858]  [<ffffffff8100be0e>] ? call_function_interrupt+0xe/0x20
<4>[46267.086936]  [<ffffffff8118cb10>] ? ub_slab_ptr+0x20/0x90
<4>[46267.087006]  [<ffffffff8118cb10>] ? ub_slab_ptr+0x20/0x90
<4>[46267.087081]  [<ffffffff8118f2e8>] ? kmem_cache_alloc+0xd8/0x1e0
<4>[46267.087151]  [<ffffffff81445599>] sys_sendto+0x139/0x190
<4>[46267.087229]  [<ffffffff81448c0d>] ? sock_setsockopt+0x16d/0x6f0
<4>[46267.087303]  [<ffffffff810efa47>] ? audit_syscall_entry+0x1d7/0x200
<4>[46267.087378]  [<ffffffff810ef795>] ? __audit_syscall_exit+0x265/0x290
<4>[46267.087454]  [<ffffffff81474885>] ? compat_sys_setsockopt+0x75/0x210
<4>[46267.087531]  [<ffffffff81474b5f>] compat_sys_socketcall+0x13f/0x210
<4>[46267.087607]  [<ffffffff8104dea3>] ia32_sysret+0x0/0x5
<4>[46267.087676] Code: 91 20 e2 01 75 29 48 89 de 4c 89 f7 e8 56 fa ff ff 85 c0 0f 84 68 fc ff ff 0f b6 4d c6 41 8b 45 00 e9 4d fb ff ff e8 7c 19 e9 e0 <0f> 0b eb fe f6 05 17 91 20 e2 80 74 ce 80 3d 5f 2e 00 00 00 74
<1>[46267.088023] RIP  [<ffffffffa01e00a4>] nf_nat_setup_info+0x564/0x590

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Florian Westphal <fw@strlen.de>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Ani Sinha <ani@arista.com>
---
 net/netfilter/nf_conntrack_core.c | 21 +++++++++++++++++----
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 9a46908..fd0f7a3 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -309,6 +309,21 @@ static void death_by_timeout(unsigned long ul_conntrack)
 	nf_ct_put(ct);
 }
 
+static inline bool
+nf_ct_key_equal(struct nf_conntrack_tuple_hash *h,
+			const struct nf_conntrack_tuple *tuple,
+			u16 zone)
+{
+	struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(h);
+
+	/* A conntrack can be recreated with the equal tuple,
+	 * so we need to check that the conntrack is confirmed
+	 */
+	return nf_ct_tuple_equal(tuple, &h->tuple) &&
+		nf_ct_zone(ct) == zone &&
+		nf_ct_is_confirmed(ct);
+}
+
 /*
  * Warning :
  * - Caller must take a reference on returned object
@@ -330,8 +345,7 @@ ____nf_conntrack_find(struct net *net, u16 zone,
 	local_bh_disable();
 begin:
 	hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[bucket], hnnode) {
-		if (nf_ct_tuple_equal(tuple, &h->tuple) &&
-		    nf_ct_zone(nf_ct_tuplehash_to_ctrack(h)) == zone) {
+		if (nf_ct_key_equal(h, tuple, zone)) {
 			NF_CT_STAT_INC(net, found);
 			local_bh_enable();
 			return h;
@@ -378,8 +392,7 @@ begin:
 			     !atomic_inc_not_zero(&ct->ct_general.use)))
 			h = NULL;
 		else {
-			if (unlikely(!nf_ct_tuple_equal(tuple, &h->tuple) ||
-				     nf_ct_zone(ct) != zone)) {
+			if (unlikely(!nf_ct_key_equal(h, tuple, zone))) {
 				nf_ct_put(ct);
 				goto begin;
 			}
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/1] commit c6825c0976fa7893692e0e43b09740b419b23c09 upstream.
  2015-10-26 18:55 [PATCH 1/1] commit c6825c0976fa7893692e0e43b09740b419b23c09 upstream Ani Sinha
@ 2015-10-26 20:06 ` Pablo Neira Ayuso
  2015-10-28  6:36   ` Neal P. Murphy
                     ` (4 more replies)
  0 siblings, 5 replies; 14+ messages in thread
From: Pablo Neira Ayuso @ 2015-10-26 20:06 UTC (permalink / raw)
  To: Ani Sinha
  Cc: Patrick McHardy, David S. Miller, netfilter-devel, netfilter,
	netdev, coreteam, Florian Westphal, Jozsef Kadlecsik,
	Cyrill Gorcunov, Zefan Li, Eric Dumazet, Neal P. Murphy

Hi,

On Mon, Oct 26, 2015 at 11:55:39AM -0700, Ani Sinha wrote:
> netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get

Please, no need to Cc everyone here. Please, submit your Netfilter
patches to netfilter-devel@vger.kernel.org.

Moreover, it would be great if the subject includes something
descriptive on what you need, for this I'd suggest:

[PATCH -stable 3.4,backport] netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get

I'm including Neal P. Murphy, he said he would help testing these
backports, getting a Tested-by: tag usually speeds up things too.

Burden is usually huge here, the easier you get it for us, the best.
Then we can review and, if no major concerns, I can submit this to
-stable.

Let me know if you have any other questions,
Thanks.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/1] commit c6825c0976fa7893692e0e43b09740b419b23c09 upstream.
  2015-10-26 20:06 ` Pablo Neira Ayuso
@ 2015-10-28  6:36   ` Neal P. Murphy
       [not found]   ` <17709_1446014232_t9S6b5a3017652_20151028023650.7b76098f@playground>
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 14+ messages in thread
From: Neal P. Murphy @ 2015-10-28  6:36 UTC (permalink / raw)
  To: Ani Sinha; +Cc: Pablo Neira Ayuso, netfilter-devel, netfilter, netdev

On Mon, 26 Oct 2015 21:06:33 +0100
Pablo Neira Ayuso <pablo@netfilter.org> wrote:

> Hi,
> 
> On Mon, Oct 26, 2015 at 11:55:39AM -0700, Ani Sinha wrote:
> > netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get
> 
> Please, no need to Cc everyone here. Please, submit your Netfilter
> patches to netfilter-devel@vger.kernel.org.
> 
> Moreover, it would be great if the subject includes something
> descriptive on what you need, for this I'd suggest:
> 
> [PATCH -stable 3.4,backport] netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get
> 
> I'm including Neal P. Murphy, he said he would help testing these
> backports, getting a Tested-by: tag usually speeds up things too.

I hammered it a couple nights ago. First test was 5000 processes on 6 SMP CPUs opening and closing a port on a 'remote' host using the usual random source ports. Only got up to 32000 conntracks. The generator was a 64-bit Smoothwall KVM without the patch. The traffic passed through a 32-bit Smoothwall KVM with the patch. The target was on the VM host. No problems encountered. I suspect I didn't come close to triggering the original problem. Second test was a couple thousand processes all using the same source IP and port and dest IP and port. Still no problems. But these were perl scripts (and they used lots of RAM); perhaps a short C program would let me run more.

Any ideas on how I might test it more brutally?

N

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/1] commit c6825c0976fa7893692e0e43b09740b419b23c09 upstream.
       [not found]   ` <17709_1446014232_t9S6b5a3017652_20151028023650.7b76098f@playground>
@ 2015-10-29  6:40     ` Neal P. Murphy
  2015-10-30  0:01       ` Ani Sinha
  0 siblings, 1 reply; 14+ messages in thread
From: Neal P. Murphy @ 2015-10-29  6:40 UTC (permalink / raw)
  To: Ani Sinha; +Cc: Pablo Neira Ayuso, netfilter-devel, netfilter, netdev

On Wed, 28 Oct 2015 02:36:50 -0400
"Neal P. Murphy" <neal.p.murphy@alum.wpi.edu> wrote:

> On Mon, 26 Oct 2015 21:06:33 +0100
> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> 
> > Hi,
> > 
> > On Mon, Oct 26, 2015 at 11:55:39AM -0700, Ani Sinha wrote:
> > > netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get
> > 
> > Please, no need to Cc everyone here. Please, submit your Netfilter
> > patches to netfilter-devel@vger.kernel.org.
> > 
> > Moreover, it would be great if the subject includes something
> > descriptive on what you need, for this I'd suggest:
> > 
> > [PATCH -stable 3.4,backport] netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get
> > 
> > I'm including Neal P. Murphy, he said he would help testing these
> > backports, getting a Tested-by: tag usually speeds up things too.
> 

I've probably done about as much seat-of-the-pants testing as I can. All opening/closing the same destination IP/port.

Host: Debian Jessie, 8-core Vishera 8350 at 4.4 GHz, 16GiB RAM at (I think) 2100MHz.

Traffic generator 1: 6-CPU KVM running 64-bit Smoothwall Express 3.1 (linux 3.4.109 without these patches), with 8GiB RAM and 9GiB swap. Packets sent across PURPLE (to bypass NAT and firewall).

Traffic generator 2: 32-bit KVM running Smoothwall Express 3.1 (linux 3.4.110 with these patches), 3GiB RAM and minimal swap.

In the first set of tests, generator 1's traffic passed through Generator 2 as a NATting firewall, to the host's web server. In the second set of tests, generator 2's traffic went through NAT to the host's web server.

The load tests:
  - 2500 processes using 2500 addresses and random src ports
  - 2500 processes using 2500 addresses and the same src port
  - 2500 processes using the same src address and port

I also tested using stock NF timeouts and using 1 second timeouts.

Bandwidth used got as high as 16Mb/s for some tests.

Conntracks got up to 200 000 or so or bounced between 1 and 2, depending on the test and the timeouts.

I did not reproduce the problem these patches solve. But more importantly, I saw no problems at all. Each time I terminated a test, RAM usage returned to about that of post-boot; so there were no apparent memory leaks. No kernel messages and no netfilter messages appeared during the tests.

If I have time, I suppose I could run another set of tests: 2500 source processes using 2500 addresses times 200 ports to connect to 2500 addresses times 200 ports on a destination system. Each process opens 200 sockets, then closes them. And repeats ad infinitum. But I might have to be clever since I can't run 500 000 processes; but I could run 20 VMs; that would get it down to about 12 000 processes per VM. And I might have to figure out how to allow allow processes on the destination system to open hundreds or thousands of sockets.

N

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/1] commit c6825c0976fa7893692e0e43b09740b419b23c09 upstream.
  2015-10-29  6:40     ` Neal P. Murphy
@ 2015-10-30  0:01       ` Ani Sinha
  2015-10-30  1:21         ` Neal P. Murphy
  0 siblings, 1 reply; 14+ messages in thread
From: Ani Sinha @ 2015-10-30  0:01 UTC (permalink / raw)
  To: Neal P. Murphy; +Cc: Pablo Neira Ayuso, netfilter-devel, netfilter, netdev

On Wed, Oct 28, 2015 at 11:40 PM, Neal P. Murphy
<neal.p.murphy@alum.wpi.edu> wrote:
> On Wed, 28 Oct 2015 02:36:50 -0400
> "Neal P. Murphy" <neal.p.murphy@alum.wpi.edu> wrote:
>
>> On Mon, 26 Oct 2015 21:06:33 +0100
>> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
>>
>> > Hi,
>> >
>> > On Mon, Oct 26, 2015 at 11:55:39AM -0700, Ani Sinha wrote:
>> > > netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get
>> >
>> > Please, no need to Cc everyone here. Please, submit your Netfilter
>> > patches to netfilter-devel@vger.kernel.org.
>> >
>> > Moreover, it would be great if the subject includes something
>> > descriptive on what you need, for this I'd suggest:
>> >
>> > [PATCH -stable 3.4,backport] netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get
>> >
>> > I'm including Neal P. Murphy, he said he would help testing these
>> > backports, getting a Tested-by: tag usually speeds up things too.
>>
>
> I've probably done about as much seat-of-the-pants testing as I can. All opening/closing the same destination IP/port.
>
> Host: Debian Jessie, 8-core Vishera 8350 at 4.4 GHz, 16GiB RAM at (I think) 2100MHz.
>
> Traffic generator 1: 6-CPU KVM running 64-bit Smoothwall Express 3.1 (linux 3.4.109 without these patches), with 8GiB RAM and 9GiB swap. Packets sent across PURPLE (to bypass NAT and firewall).
>
> Traffic generator 2: 32-bit KVM running Smoothwall Express 3.1 (linux 3.4.110 with these patches), 3GiB RAM and minimal swap.
>
> In the first set of tests, generator 1's traffic passed through Generator 2 as a NATting firewall, to the host's web server. In the second set of tests, generator 2's traffic went through NAT to the host's web server.
>
> The load tests:
>   - 2500 processes using 2500 addresses and random src ports
>   - 2500 processes using 2500 addresses and the same src port
>   - 2500 processes using the same src address and port
>
> I also tested using stock NF timeouts and using 1 second timeouts.
>
> Bandwidth used got as high as 16Mb/s for some tests.
>
> Conntracks got up to 200 000 or so or bounced between 1 and 2, depending on the test and the timeouts.
>
> I did not reproduce the problem these patches solve. But more importantly, I saw no problems at all. Each time I terminated a test, RAM usage returned to about that of post-boot; so there were no apparent memory leaks. No kernel messages and no netfilter messages appeared during the tests.
>
> If I have time, I suppose I could run another set of tests: 2500 source processes using 2500 addresses times 200 ports to connect to 2500 addresses times 200 ports on a destination system. Each process opens 200 sockets, then closes them. And repeats ad infinitum. But I might have to be clever since I can't run 500 000 processes; but I could run 20 VMs; that would get it down to about 12 000 processes per VM. And I might have to figure out how to allow allow processes on the destination system to open hundreds or thousands of sockets.

Should I resend the patch with a Tested-by: tag?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/1] commit c6825c0976fa7893692e0e43b09740b419b23c09 upstream.
  2015-10-30  0:01       ` Ani Sinha
@ 2015-10-30  1:21         ` Neal P. Murphy
  2015-10-30 15:37           ` Ani Sinha
  0 siblings, 1 reply; 14+ messages in thread
From: Neal P. Murphy @ 2015-10-30  1:21 UTC (permalink / raw)
  To: Ani Sinha, netfilter; +Cc: Pablo Neira Ayuso, netfilter-devel, netdev

On Thu, 29 Oct 2015 17:01:24 -0700
Ani Sinha <ani@arista.com> wrote:

> On Wed, Oct 28, 2015 at 11:40 PM, Neal P. Murphy
> <neal.p.murphy@alum.wpi.edu> wrote:
> > On Wed, 28 Oct 2015 02:36:50 -0400
> > "Neal P. Murphy" <neal.p.murphy@alum.wpi.edu> wrote:
> >
> >> On Mon, 26 Oct 2015 21:06:33 +0100
> >> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> >>
> >> > Hi,
> >> >
> >> > On Mon, Oct 26, 2015 at 11:55:39AM -0700, Ani Sinha wrote:
> >> > > netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get
> >> >
> >> > Please, no need to Cc everyone here. Please, submit your Netfilter
> >> > patches to netfilter-devel@vger.kernel.org.
> >> >
> >> > Moreover, it would be great if the subject includes something
> >> > descriptive on what you need, for this I'd suggest:
> >> >
> >> > [PATCH -stable 3.4,backport] netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get
> >> >
> >> > I'm including Neal P. Murphy, he said he would help testing these
> >> > backports, getting a Tested-by: tag usually speeds up things too.
> >>
> >
> > I've probably done about as much seat-of-the-pants testing as I can. All opening/closing the same destination IP/port.
> >
> > Host: Debian Jessie, 8-core Vishera 8350 at 4.4 GHz, 16GiB RAM at (I think) 2100MHz.
> >
> > Traffic generator 1: 6-CPU KVM running 64-bit Smoothwall Express 3.1 (linux 3.4.109 without these patches), with 8GiB RAM and 9GiB swap. Packets sent across PURPLE (to bypass NAT and firewall).
> >
> > Traffic generator 2: 32-bit KVM running Smoothwall Express 3.1 (linux 3.4.110 with these patches), 3GiB RAM and minimal swap.
> >
> > In the first set of tests, generator 1's traffic passed through Generator 2 as a NATting firewall, to the host's web server. In the second set of tests, generator 2's traffic went through NAT to the host's web server.
> >
> > The load tests:
> >   - 2500 processes using 2500 addresses and random src ports
> >   - 2500 processes using 2500 addresses and the same src port
> >   - 2500 processes using the same src address and port
> >
> > I also tested using stock NF timeouts and using 1 second timeouts.
> >
> > Bandwidth used got as high as 16Mb/s for some tests.
> >
> > Conntracks got up to 200 000 or so or bounced between 1 and 2, depending on the test and the timeouts.
> >
> > I did not reproduce the problem these patches solve. But more importantly, I saw no problems at all. Each time I terminated a test, RAM usage returned to about that of post-boot; so there were no apparent memory leaks. No kernel messages and no netfilter messages appeared during the tests.
> >
> > If I have time, I suppose I could run another set of tests: 2500 source processes using 2500 addresses times 200 ports to connect to 2500 addresses times 200 ports on a destination system. Each process opens 200 sockets, then closes them. And repeats ad infinitum. But I might have to be clever since I can't run 500 000 processes; but I could run 20 VMs; that would get it down to about 12 000 processes per VM. And I might have to figure out how to allow allow processes on the destination system to open hundreds or thousands of sockets.
> 
> Should I resend the patch with a Tested-by: tag?

... Oh, wait. Not yet. The dawn just broke over ol' Marblehead here. I only tested TCP; I need to hammer UDP, too.

Can I set the timeouts to zero? Or is one as low as I can go?

N

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/1] commit c6825c0976fa7893692e0e43b09740b419b23c09 upstream.
  2015-10-30  1:21         ` Neal P. Murphy
@ 2015-10-30 15:37           ` Ani Sinha
  0 siblings, 0 replies; 14+ messages in thread
From: Ani Sinha @ 2015-10-30 15:37 UTC (permalink / raw)
  To: Neal P. Murphy; +Cc: netfilter, Pablo Neira Ayuso, netfilter-devel, netdev

On Thu, Oct 29, 2015 at 6:21 PM, Neal P. Murphy
<neal.p.murphy@alum.wpi.edu> wrote:
> On Thu, 29 Oct 2015 17:01:24 -0700
> Ani Sinha <ani@arista.com> wrote:
>
>> On Wed, Oct 28, 2015 at 11:40 PM, Neal P. Murphy
>> <neal.p.murphy@alum.wpi.edu> wrote:
>> > On Wed, 28 Oct 2015 02:36:50 -0400
>> > "Neal P. Murphy" <neal.p.murphy@alum.wpi.edu> wrote:
>> >
>> >> On Mon, 26 Oct 2015 21:06:33 +0100
>> >> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
>> >>
>> >> > Hi,
>> >> >
>> >> > On Mon, Oct 26, 2015 at 11:55:39AM -0700, Ani Sinha wrote:
>> >> > > netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get
>> >> >
>> >> > Please, no need to Cc everyone here. Please, submit your Netfilter
>> >> > patches to netfilter-devel@vger.kernel.org.
>> >> >
>> >> > Moreover, it would be great if the subject includes something
>> >> > descriptive on what you need, for this I'd suggest:
>> >> >
>> >> > [PATCH -stable 3.4,backport] netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get
>> >> >
>> >> > I'm including Neal P. Murphy, he said he would help testing these
>> >> > backports, getting a Tested-by: tag usually speeds up things too.
>> >>
>> >
>> > I've probably done about as much seat-of-the-pants testing as I can. All opening/closing the same destination IP/port.
>> >
>> > Host: Debian Jessie, 8-core Vishera 8350 at 4.4 GHz, 16GiB RAM at (I think) 2100MHz.
>> >
>> > Traffic generator 1: 6-CPU KVM running 64-bit Smoothwall Express 3.1 (linux 3.4.109 without these patches), with 8GiB RAM and 9GiB swap. Packets sent across PURPLE (to bypass NAT and firewall).
>> >
>> > Traffic generator 2: 32-bit KVM running Smoothwall Express 3.1 (linux 3.4.110 with these patches), 3GiB RAM and minimal swap.
>> >
>> > In the first set of tests, generator 1's traffic passed through Generator 2 as a NATting firewall, to the host's web server. In the second set of tests, generator 2's traffic went through NAT to the host's web server.
>> >
>> > The load tests:
>> >   - 2500 processes using 2500 addresses and random src ports
>> >   - 2500 processes using 2500 addresses and the same src port
>> >   - 2500 processes using the same src address and port
>> >
>> > I also tested using stock NF timeouts and using 1 second timeouts.
>> >
>> > Bandwidth used got as high as 16Mb/s for some tests.
>> >
>> > Conntracks got up to 200 000 or so or bounced between 1 and 2, depending on the test and the timeouts.
>> >
>> > I did not reproduce the problem these patches solve. But more importantly, I saw no problems at all. Each time I terminated a test, RAM usage returned to about that of post-boot; so there were no apparent memory leaks. No kernel messages and no netfilter messages appeared during the tests.
>> >
>> > If I have time, I suppose I could run another set of tests: 2500 source processes using 2500 addresses times 200 ports to connect to 2500 addresses times 200 ports on a destination system. Each process opens 200 sockets, then closes them. And repeats ad infinitum. But I might have to be clever since I can't run 500 000 processes; but I could run 20 VMs; that would get it down to about 12 000 processes per VM. And I might have to figure out how to allow allow processes on the destination system to open hundreds or thousands of sockets.
>>
>> Should I resend the patch with a Tested-by: tag?
>
> ... Oh, wait. Not yet. The dawn just broke over ol' Marblehead here. I only tested TCP; I need to hammer UDP, too.
>
> Can I set the timeouts to zero? Or is one as low as I can go?

I don't see any assertion or check against 0 sec timeouts. You can
try. Your conntrack entries will be constantly flushing.

>
> N

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH  -stable 3.4,backport] commit c6825c0976fa7893692e0e43b09740b419b23c09 upstream.
  2015-10-26 20:06 ` Pablo Neira Ayuso
  2015-10-28  6:36   ` Neal P. Murphy
       [not found]   ` <17709_1446014232_t9S6b5a3017652_20151028023650.7b76098f@playground>
@ 2015-11-02 19:11   ` Ani Sinha
  2015-11-02 19:47   ` Ani Sinha
  2015-11-04 23:46   ` [PATCH 1/1] " Ani Sinha
  4 siblings, 0 replies; 14+ messages in thread
From: Ani Sinha @ 2015-11-02 19:11 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: Ani Sinha, David S. Miller, netfilter-devel, netdev,
	Florian Westphal, Zefan Li, Neal P. Murphy

netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get

Lets look at destroy_conntrack:

hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
...
nf_conntrack_free(ct)
	kmem_cache_free(net->ct.nf_conntrack_cachep, ct);

net->ct.nf_conntrack_cachep is created with SLAB_DESTROY_BY_RCU.

The hash is protected by rcu, so readers look up conntracks without
locks.
A conntrack is removed from the hash, but in this moment a few readers
still can use the conntrack. Then this conntrack is released and another
thread creates conntrack with the same address and the equal tuple.
After this a reader starts to validate the conntrack:
* It's not dying, because a new conntrack was created
* nf_ct_tuple_equal() returns true.

But this conntrack is not initialized yet, so it can not be used by two
threads concurrently. In this case BUG_ON may be triggered from
nf_nat_setup_info().

Florian Westphal suggested to check the confirm bit too. I think it's
right.

task 1			task 2			task 3
			nf_conntrack_find_get
			 ____nf_conntrack_find
destroy_conntrack
 hlist_nulls_del_rcu
 nf_conntrack_free
 kmem_cache_free
						__nf_conntrack_alloc
						 kmem_cache_alloc
						 memset(&ct->tuplehash[IP_CT_DIR_MAX],
			 if (nf_ct_is_dying(ct))
			 if (!nf_ct_tuple_equal()

I'm not sure, that I have ever seen this race condition in a real life.
Currently we are investigating a bug, which is reproduced on a few nodes.
In our case one conntrack is initialized from a few tasks concurrently,
we don't have any other explanation for this.

<2>[46267.083061] kernel BUG at net/ipv4/netfilter/nf_nat_core.c:322!
...
<4>[46267.083951] RIP: 0010:[<ffffffffa01e00a4>]  [<ffffffffa01e00a4>] nf_nat_setup_info+0x564/0x590 [nf_nat]
...
<4>[46267.085549] Call Trace:
<4>[46267.085622]  [<ffffffffa023421b>] alloc_null_binding+0x5b/0xa0 [iptable_nat]
<4>[46267.085697]  [<ffffffffa02342bc>] nf_nat_rule_find+0x5c/0x80 [iptable_nat]
<4>[46267.085770]  [<ffffffffa0234521>] nf_nat_fn+0x111/0x260 [iptable_nat]
<4>[46267.085843]  [<ffffffffa0234798>] nf_nat_out+0x48/0xd0 [iptable_nat]
<4>[46267.085919]  [<ffffffff814841b9>] nf_iterate+0x69/0xb0
<4>[46267.085991]  [<ffffffff81494e70>] ? ip_finish_output+0x0/0x2f0
<4>[46267.086063]  [<ffffffff81484374>] nf_hook_slow+0x74/0x110
<4>[46267.086133]  [<ffffffff81494e70>] ? ip_finish_output+0x0/0x2f0
<4>[46267.086207]  [<ffffffff814b5890>] ? dst_output+0x0/0x20
<4>[46267.086277]  [<ffffffff81495204>] ip_output+0xa4/0xc0
<4>[46267.086346]  [<ffffffff814b65a4>] raw_sendmsg+0x8b4/0x910
<4>[46267.086419]  [<ffffffff814c10fa>] inet_sendmsg+0x4a/0xb0
<4>[46267.086491]  [<ffffffff814459aa>] ? sock_update_classid+0x3a/0x50
<4>[46267.086562]  [<ffffffff81444d67>] sock_sendmsg+0x117/0x140
<4>[46267.086638]  [<ffffffff8151997b>] ? _spin_unlock_bh+0x1b/0x20
<4>[46267.086712]  [<ffffffff8109d370>] ? autoremove_wake_function+0x0/0x40
<4>[46267.086785]  [<ffffffff81495e80>] ? do_ip_setsockopt+0x90/0xd80
<4>[46267.086858]  [<ffffffff8100be0e>] ? call_function_interrupt+0xe/0x20
<4>[46267.086936]  [<ffffffff8118cb10>] ? ub_slab_ptr+0x20/0x90
<4>[46267.087006]  [<ffffffff8118cb10>] ? ub_slab_ptr+0x20/0x90
<4>[46267.087081]  [<ffffffff8118f2e8>] ? kmem_cache_alloc+0xd8/0x1e0
<4>[46267.087151]  [<ffffffff81445599>] sys_sendto+0x139/0x190
<4>[46267.087229]  [<ffffffff81448c0d>] ? sock_setsockopt+0x16d/0x6f0
<4>[46267.087303]  [<ffffffff810efa47>] ? audit_syscall_entry+0x1d7/0x200
<4>[46267.087378]  [<ffffffff810ef795>] ? __audit_syscall_exit+0x265/0x290
<4>[46267.087454]  [<ffffffff81474885>] ? compat_sys_setsockopt+0x75/0x210
<4>[46267.087531]  [<ffffffff81474b5f>] compat_sys_socketcall+0x13f/0x210
<4>[46267.087607]  [<ffffffff8104dea3>] ia32_sysret+0x0/0x5
<4>[46267.087676] Code: 91 20 e2 01 75 29 48 89 de 4c 89 f7 e8 56 fa ff ff 85 c0 0f 84 68 fc ff ff 0f b6 4d c6 41 8b 45 00 e9 4d fb ff ff e8 7c 19 e9 e0 <0f> 0b eb fe f6 05 17 91 20 e2 80 74 ce 80 3d 5f 2e 00 00 00 74
<1>[46267.088023] RIP  [<ffffffffa01e00a4>] nf_nat_setup_info+0x564/0x590

Signed-off-by: Ani Sinha <ani@arista.com>
Tested-by: Neal P. Murphy <neal.p.murphy@alum.wpi.edu>
---
 net/netfilter/nf_conntrack_core.c | 21 +++++++++++++++++----
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 9a46908..fd0f7a3 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -309,6 +309,21 @@ static void death_by_timeout(unsigned long ul_conntrack)
 	nf_ct_put(ct);
 }
 
+static inline bool
+nf_ct_key_equal(struct nf_conntrack_tuple_hash *h,
+			const struct nf_conntrack_tuple *tuple,
+			u16 zone)
+{
+	struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(h);
+
+	/* A conntrack can be recreated with the equal tuple,
+	 * so we need to check that the conntrack is confirmed
+	 */
+	return nf_ct_tuple_equal(tuple, &h->tuple) &&
+		nf_ct_zone(ct) == zone &&
+		nf_ct_is_confirmed(ct);
+}
+
 /*
  * Warning :
  * - Caller must take a reference on returned object
@@ -330,8 +345,7 @@ ____nf_conntrack_find(struct net *net, u16 zone,
 	local_bh_disable();
 begin:
 	hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[bucket], hnnode) {
-		if (nf_ct_tuple_equal(tuple, &h->tuple) &&
-		    nf_ct_zone(nf_ct_tuplehash_to_ctrack(h)) == zone) {
+		if (nf_ct_key_equal(h, tuple, zone)) {
 			NF_CT_STAT_INC(net, found);
 			local_bh_enable();
 			return h;
@@ -378,8 +392,7 @@ begin:
 			     !atomic_inc_not_zero(&ct->ct_general.use)))
 			h = NULL;
 		else {
-			if (unlikely(!nf_ct_tuple_equal(tuple, &h->tuple) ||
-				     nf_ct_zone(ct) != zone)) {
+			if (unlikely(!nf_ct_key_equal(h, tuple, zone))) {
 				nf_ct_put(ct);
 				goto begin;
 			}
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH -stable 3.4,backport] commit c6825c0976fa7893692e0e43b09740b419b23c09 upstream.
  2015-10-26 20:06 ` Pablo Neira Ayuso
                     ` (2 preceding siblings ...)
  2015-11-02 19:11   ` [PATCH -stable 3.4,backport] " Ani Sinha
@ 2015-11-02 19:47   ` Ani Sinha
  2015-11-04 23:46   ` [PATCH 1/1] " Ani Sinha
  4 siblings, 0 replies; 14+ messages in thread
From: Ani Sinha @ 2015-11-02 19:47 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: Ani Sinha, David S. Miller, netfilter-devel, netdev,
	Florian Westphal, Zefan Li, Neal P. Murphy

netfilter: nf_conntrack: don't release a conntrack with non-zero
refcnt

With this patch, the conntrack refcount is initially set to zero and
it is bumped once it is added to any of the list, so we fulfill
Eric's golden rule which is that all released objects always have a
refcount that equals zero.

Andrey Vagin reports that nf_conntrack_free can't be called for a
conntrack with non-zero ref-counter, because it can race with
nf_conntrack_find_get().

A conntrack slab is created with SLAB_DESTROY_BY_RCU. Non-zero
ref-counter says that this conntrack is used. So when we release
a conntrack with non-zero counter, we break this assumption.

CPU1                                    CPU2
____nf_conntrack_find()
                                        nf_ct_put()
                                         destroy_conntrack()
                                        ...
                                        init_conntrack
                                         __nf_conntrack_alloc (set use = 1)
atomic_inc_not_zero(&ct->use) (use = 2)
                                         if (!l4proto->new(ct, skb, dataoff, timeouts))
                                          nf_conntrack_free(ct); (use = 2 !!!)
                                        ...
                                        __nf_conntrack_alloc (set use = 1)
 if (!nf_ct_key_equal(h, tuple, zone))
  nf_ct_put(ct); (use = 0)
   destroy_conntrack()
                                        /* continue to work with CT */

After applying the path "[PATCH] netfilter: nf_conntrack: fix RCU
race in nf_conntrack_find_get" another bug was triggered in
destroy_conntrack():

<4>[67096.759334] ------------[ cut here ]------------
<2>[67096.759353] kernel BUG at net/netfilter/nf_conntrack_core.c:211!
...
<4>[67096.759837] Pid: 498649, comm: atdd veid: 666 Tainted: G         C ---------------    2.6.32-042stab084.18 #1 042stab084_18 /DQ45CB
<4>[67096.759932] RIP: 0010:[<ffffffffa03d99ac>]  [<ffffffffa03d99ac>] destroy_conntrack+0x15c/0x190 [nf_conntrack]
<4>[67096.760255] Call Trace:
<4>[67096.760255]  [<ffffffff814844a7>] nf_conntrack_destroy+0x17/0x30
<4>[67096.760255]  [<ffffffffa03d9bb5>] nf_conntrack_find_get+0x85/0x130 [nf_conntrack]
<4>[67096.760255]  [<ffffffffa03d9fb2>] nf_conntrack_in+0x352/0xb60 [nf_conntrack]
<4>[67096.760255]  [<ffffffffa048c771>] ipv4_conntrack_local+0x51/0x60 [nf_conntrack_ipv4]
<4>[67096.760255]  [<ffffffff81484419>] nf_iterate+0x69/0xb0
<4>[67096.760255]  [<ffffffff814b5b00>] ? dst_output+0x0/0x20
<4>[67096.760255]  [<ffffffff814845d4>] nf_hook_slow+0x74/0x110
<4>[67096.760255]  [<ffffffff814b5b00>] ? dst_output+0x0/0x20
<4>[67096.760255]  [<ffffffff814b66d5>] raw_sendmsg+0x775/0x910
<4>[67096.760255]  [<ffffffff8104c5a8>] ? flush_tlb_others_ipi+0x128/0x130
<4>[67096.760255]  [<ffffffff8100bc4e>] ? apic_timer_interrupt+0xe/0x20
<4>[67096.760255]  [<ffffffff8100bc4e>] ? apic_timer_interrupt+0xe/0x20
<4>[67096.760255]  [<ffffffff814c136a>] inet_sendmsg+0x4a/0xb0
<4>[67096.760255]  [<ffffffff81444e93>] ? sock_sendmsg+0x13/0x140
<4>[67096.760255]  [<ffffffff81444f97>] sock_sendmsg+0x117/0x140
<4>[67096.760255]  [<ffffffff8102e299>] ? native_smp_send_reschedule+0x49/0x60
<4>[67096.760255]  [<ffffffff81519beb>] ? _spin_unlock_bh+0x1b/0x20
<4>[67096.760255]  [<ffffffff8109d930>] ? autoremove_wake_function+0x0/0x40
<4>[67096.760255]  [<ffffffff814960f0>] ? do_ip_setsockopt+0x90/0xd80
<4>[67096.760255]  [<ffffffff8100bc4e>] ? apic_timer_interrupt+0xe/0x20
<4>[67096.760255]  [<ffffffff8100bc4e>] ? apic_timer_interrupt+0xe/0x20
<4>[67096.760255]  [<ffffffff814457c9>] sys_sendto+0x139/0x190
<4>[67096.760255]  [<ffffffff810efa77>] ? audit_syscall_entry+0x1d7/0x200
<4>[67096.760255]  [<ffffffff810ef7c5>] ? __audit_syscall_exit+0x265/0x290
<4>[67096.760255]  [<ffffffff81474daf>] compat_sys_socketcall+0x13f/0x210
<4>[67096.760255]  [<ffffffff8104dea3>] ia32_sysret+0x0/0x5

I have reused the original title for the RFC patch that Andrey posted and
most of the original patch description.

Signed-off-by: Ani Sinha <ani@arista.com>
Tested-by: "Neal P. Murphy" <neal.p.murphy@alum.wpi.edu> 
---
 net/netfilter/nf_conntrack_core.c | 18 +++++++++++++-----
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 9a171b2..9a46908 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -441,7 +441,9 @@ nf_conntrack_hash_check_insert(struct nf_conn *ct)
 			goto out;
 
 	add_timer(&ct->timeout);
-	nf_conntrack_get(&ct->ct_general);
+	smp_wmb();
+	/* The caller holds a reference to this object */
+	atomic_set(&ct->ct_general.use, 2);
 	__nf_conntrack_hash_insert(ct, hash, repl_hash);
 	NF_CT_STAT_INC(net, insert);
 	spin_unlock_bh(&nf_conntrack_lock);
@@ -732,11 +734,10 @@ __nf_conntrack_alloc(struct net *net, u16 zone,
 		nf_ct_zone->id = zone;
 	}
 #endif
-	/*
-	 * changes to lookup keys must be done before setting refcnt to 1
+	/* Because we use RCU lookups, we set ct_general.use to zero before
+	 * this is inserted in any list.
 	 */
-	smp_wmb();
-	atomic_set(&ct->ct_general.use, 1);
+	atomic_set(&ct->ct_general.use, 0);
 	return ct;
 
 #ifdef CONFIG_NF_CONNTRACK_ZONES
@@ -759,6 +760,10 @@ EXPORT_SYMBOL_GPL(nf_conntrack_alloc);
 void nf_conntrack_free(struct nf_conn *ct)
 {
 	struct net *net = nf_ct_net(ct);
+	/* A freed object has refcnt == 0, that's
+	 * the golden rule for SLAB_DESTROY_BY_RCU
+	 */
+	NF_CT_ASSERT(atomic_read(&ct->ct_general.use) == 0);
 
 	nf_ct_ext_destroy(ct);
 	atomic_dec(&net->ct.count);
@@ -846,6 +851,9 @@ init_conntrack(struct net *net, struct nf_conn *tmpl,
 		NF_CT_STAT_INC(net, new);
 	}
 
+	/* Now it is inserted into the unconfirmed list, bump refcount */
+	nf_conntrack_get(&ct->ct_general);
+
 	/* Overload tuple linked list to put us in unconfirmed list. */
 	hlist_nulls_add_head_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode,
 		       &net->ct.unconfirmed);
-- 
1.8.1.4



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/1] commit c6825c0976fa7893692e0e43b09740b419b23c09 upstream.
  2015-10-26 20:06 ` Pablo Neira Ayuso
                     ` (3 preceding siblings ...)
  2015-11-02 19:47   ` Ani Sinha
@ 2015-11-04 23:46   ` Ani Sinha
  2015-11-06 11:09     ` Pablo Neira Ayuso
  4 siblings, 1 reply; 14+ messages in thread
From: Ani Sinha @ 2015-11-04 23:46 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: David S. Miller, netfilter-devel, netdev, Florian Westphal,
	Zefan Li, Neal P. Murphy

(removed a bunch of people from CC list)

On Mon, Oct 26, 2015 at 1:06 PM, Pablo Neira Ayuso <pablo@netfilter.org> wrote:

> Then we can review and, if no major concerns, I can submit this to
> -stable.

Now that Neal has sufficiently tested the patches, is it OK to apply
to -stable or do you guys want me to do anything more?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/1] commit c6825c0976fa7893692e0e43b09740b419b23c09 upstream.
  2015-11-04 23:46   ` [PATCH 1/1] " Ani Sinha
@ 2015-11-06 11:09     ` Pablo Neira Ayuso
  0 siblings, 0 replies; 14+ messages in thread
From: Pablo Neira Ayuso @ 2015-11-06 11:09 UTC (permalink / raw)
  To: Ani Sinha
  Cc: David S. Miller, netfilter-devel, netdev, Florian Westphal,
	Zefan Li, Neal P. Murphy

On Wed, Nov 04, 2015 at 03:46:54PM -0800, Ani Sinha wrote:
> (removed a bunch of people from CC list)
> 
> On Mon, Oct 26, 2015 at 1:06 PM, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> 
> > Then we can review and, if no major concerns, I can submit this to
> > -stable.
> 
> Now that Neal has sufficiently tested the patches, is it OK to apply
> to -stable or do you guys want me to do anything more?

I'll be passing up this to -stable asap.

Thanks.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/1] commit c6825c0976fa7893692e0e43b09740b419b23c09 upstream.
@ 2015-11-02 20:48 Ani Sinha
  0 siblings, 0 replies; 14+ messages in thread
From: Ani Sinha @ 2015-11-02 20:48 UTC (permalink / raw)
  To: Pablo Neira Ayuso, netfilter-devel, netfilter, netdev, Ani Sinha

> On Thu, Oct 29, 2015 at 6:21 PM, Neal P. Murphy
> <neal.p.murphy@alum.wpi.edu> wrote:
> > On Thu, 29 Oct 2015 17:01:24 -0700
> > Ani Sinha <ani@arista.com> wrote:
> >
> >> On Wed, Oct 28, 2015 at 11:40 PM, Neal P. Murphy
> >> <neal.p.murphy@alum.wpi.edu> wrote:
> >> > On Wed, 28 Oct 2015 02:36:50 -0400
> >> > "Neal P. Murphy" <neal.p.murphy@alum.wpi.edu> wrote:
> >> >
> >> >> On Mon, 26 Oct 2015 21:06:33 +0100
> >> >> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> >> >>
> >> >> > Hi,
> >> >> >
> >> >> > On Mon, Oct 26, 2015 at 11:55:39AM -0700, Ani Sinha wrote:
> >> >> > > netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get
> >> >> >
> >> >> > Please, no need to Cc everyone here. Please, submit your Netfilter
> >> >> > patches to netfilter-devel@vger.kernel.org.
> >> >> >
> >> >> > Moreover, it would be great if the subject includes something
> >> >> > descriptive on what you need, for this I'd suggest:
> >> >> >
> >> >> > [PATCH -stable 3.4,backport] netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get
> >> >> >
> >> >> > I'm including Neal P. Murphy, he said he would help testing these
> >> >> > backports, getting a Tested-by: tag usually speeds up things too.
> >> >>
> >> >
> >> > I've probably done about as much seat-of-the-pants testing as I can. All opening/closing the same destination IP/port.
> >> >
> >> > Host: Debian Jessie, 8-core Vishera 8350 at 4.4 GHz, 16GiB RAM at (I think) 2100MHz.
> >> >
> >> > Traffic generator 1: 6-CPU KVM running 64-bit Smoothwall Express 3.1 (linux 3.4.109 without these patches), with 8GiB RAM and 9GiB swap. Packets sent across PURPLE (to bypass NAT and firewall).
> >> >
> >> > Traffic generator 2: 32-bit KVM running Smoothwall Express 3.1 (linux 3.4.110 with these patches), 3GiB RAM and minimal swap.
> >> >
> >> > In the first set of tests, generator 1's traffic passed through Generator 2 as a NATting firewall, to the host's web server. In the second set of tests, generator 2's traffic went through NAT to the host's web server.
> >> >
> >> > The load tests:
> >> >   - 2500 processes using 2500 addresses and random src ports
> >> >   - 2500 processes using 2500 addresses and the same src port
> >> >   - 2500 processes using the same src address and port
> >> >
> >> > I also tested using stock NF timeouts and using 1 second timeouts.
> >> >
> >> > Bandwidth used got as high as 16Mb/s for some tests.
> >> >
> >> > Conntracks got up to 200 000 or so or bounced between 1 and 2, depending on the test and the timeouts.
> >> >
> >> > I did not reproduce the problem these patches solve. But more importantly, I saw no problems at all. Each time I terminated a test, RAM usage returned to about that of post-boot; so there were no apparent memory leaks. No kernel messages and no netfilter messages appeared during the tests.
> >> >
> >> > If I have time, I suppose I could run another set of tests: 2500 source processes using 2500 addresses times 200 ports to connect to 2500 addresses times 200 ports on a destination system. Each process opens 200 sockets, then closes them. And repeats ad infinitum. But I might have to be clever since I can't run 500 000 processes; but I could run 20 VMs; that would get it down to about 12 000 processes per VM. And I might have to figure out how to allow allow processes on the destination system to open hundreds or thousands of sockets.
> >>
> >> Should I resend the patch with a Tested-by: tag?
> >
> > ... Oh, wait. Not yet. The dawn just broke over ol' Marblehead here. I only tested TCP; I need to hammer UDP, too.
> >
> > Can I set the timeouts to zero? Or is one as low as I can go?
>
> Any progress with testing ?

I applied the 'hammer' through a firewall with the patch. I used TCP,
UDP and ICMP.

I don't know if the patch fixes the problem. But I'm reasonably sure
that it did not break normal operations.

To test a different problem I fixed (a memory leak in my 64-bit
counter patch for xt_ACCOUNT), I tested 60,000 addresses (most of a
/16) through the firewall. Again, no troubles.

I only observed two odd things which are likely completely unrelated
to your patch. When I started the TCP test, then added the UDP test,
only TCP would come through. If I stopped and restarted the TCP test,
only UDP would come through. I suspect this is due to buffering. It's
just a behaviour I haven't encountered since I started using Linux
many years ago (around '98). The second, when I started the test, the
firewall would lose contact with the upstream F/W's apcupsd daemon;
again, this is likely due to the nature of the test: it likely floods
input and output queues.


I'd say you can probably resend with Tested-by.

Neal

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/1] commit c6825c0976fa7893692e0e43b09740b419b23c09 upstream.
  2015-10-24 18:27 Ani Sinha
@ 2015-10-24 18:31 ` Ani Sinha
  0 siblings, 0 replies; 14+ messages in thread
From: Ani Sinha @ 2015-10-24 18:31 UTC (permalink / raw)
  To: Pablo Neira Ayuso, Patrick McHardy, David S. Miller,
	netfilter-devel, coreteam, Ani Sinha, netdev, Eric Dumazet,
	Florian Westphal

Please refer to the thread "linux 3.4.43 : kernel crash at
__nf_conntrack_confirm" on netdev for context.

thanks

On Sat, Oct 24, 2015 at 11:27 AM, Ani Sinha <ani@arista.com> wrote:
> netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get
>
> Lets look at destroy_conntrack:
>
> hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
> ...
> nf_conntrack_free(ct)
>         kmem_cache_free(net->ct.nf_conntrack_cachep, ct);
>
> net->ct.nf_conntrack_cachep is created with SLAB_DESTROY_BY_RCU.
>
> The hash is protected by rcu, so readers look up conntracks without
> locks.
> A conntrack is removed from the hash, but in this moment a few readers
> still can use the conntrack. Then this conntrack is released and another
> thread creates conntrack with the same address and the equal tuple.
> After this a reader starts to validate the conntrack:
> * It's not dying, because a new conntrack was created
> * nf_ct_tuple_equal() returns true.
>
> But this conntrack is not initialized yet, so it can not be used by two
> threads concurrently. In this case BUG_ON may be triggered from
> nf_nat_setup_info().
>
> Florian Westphal suggested to check the confirm bit too. I think it's
> right.
>
> task 1                  task 2                  task 3
>                         nf_conntrack_find_get
>                          ____nf_conntrack_find
> destroy_conntrack
>  hlist_nulls_del_rcu
>  nf_conntrack_free
>  kmem_cache_free
>                                                 __nf_conntrack_alloc
>                                                  kmem_cache_alloc
>                                                  memset(&ct->tuplehash[IP_CT_DIR_MAX],
>                          if (nf_ct_is_dying(ct))
>                          if (!nf_ct_tuple_equal()
>
> I'm not sure, that I have ever seen this race condition in a real life.
> Currently we are investigating a bug, which is reproduced on a few nodes.
> In our case one conntrack is initialized from a few tasks concurrently,
> we don't have any other explanation for this.
>
> <2>[46267.083061] kernel BUG at net/ipv4/netfilter/nf_nat_core.c:322!
> ...
> <4>[46267.083951] RIP: 0010:[<ffffffffa01e00a4>]  [<ffffffffa01e00a4>] nf_nat_setup_info+0x564/0x590 [nf_nat]
> ...
> <4>[46267.085549] Call Trace:
> <4>[46267.085622]  [<ffffffffa023421b>] alloc_null_binding+0x5b/0xa0 [iptable_nat]
> <4>[46267.085697]  [<ffffffffa02342bc>] nf_nat_rule_find+0x5c/0x80 [iptable_nat]
> <4>[46267.085770]  [<ffffffffa0234521>] nf_nat_fn+0x111/0x260 [iptable_nat]
> <4>[46267.085843]  [<ffffffffa0234798>] nf_nat_out+0x48/0xd0 [iptable_nat]
> <4>[46267.085919]  [<ffffffff814841b9>] nf_iterate+0x69/0xb0
> <4>[46267.085991]  [<ffffffff81494e70>] ? ip_finish_output+0x0/0x2f0
> <4>[46267.086063]  [<ffffffff81484374>] nf_hook_slow+0x74/0x110
> <4>[46267.086133]  [<ffffffff81494e70>] ? ip_finish_output+0x0/0x2f0
> <4>[46267.086207]  [<ffffffff814b5890>] ? dst_output+0x0/0x20
> <4>[46267.086277]  [<ffffffff81495204>] ip_output+0xa4/0xc0
> <4>[46267.086346]  [<ffffffff814b65a4>] raw_sendmsg+0x8b4/0x910
> <4>[46267.086419]  [<ffffffff814c10fa>] inet_sendmsg+0x4a/0xb0
> <4>[46267.086491]  [<ffffffff814459aa>] ? sock_update_classid+0x3a/0x50
> <4>[46267.086562]  [<ffffffff81444d67>] sock_sendmsg+0x117/0x140
> <4>[46267.086638]  [<ffffffff8151997b>] ? _spin_unlock_bh+0x1b/0x20
> <4>[46267.086712]  [<ffffffff8109d370>] ? autoremove_wake_function+0x0/0x40
> <4>[46267.086785]  [<ffffffff81495e80>] ? do_ip_setsockopt+0x90/0xd80
> <4>[46267.086858]  [<ffffffff8100be0e>] ? call_function_interrupt+0xe/0x20
> <4>[46267.086936]  [<ffffffff8118cb10>] ? ub_slab_ptr+0x20/0x90
> <4>[46267.087006]  [<ffffffff8118cb10>] ? ub_slab_ptr+0x20/0x90
> <4>[46267.087081]  [<ffffffff8118f2e8>] ? kmem_cache_alloc+0xd8/0x1e0
> <4>[46267.087151]  [<ffffffff81445599>] sys_sendto+0x139/0x190
> <4>[46267.087229]  [<ffffffff81448c0d>] ? sock_setsockopt+0x16d/0x6f0
> <4>[46267.087303]  [<ffffffff810efa47>] ? audit_syscall_entry+0x1d7/0x200
> <4>[46267.087378]  [<ffffffff810ef795>] ? __audit_syscall_exit+0x265/0x290
> <4>[46267.087454]  [<ffffffff81474885>] ? compat_sys_setsockopt+0x75/0x210
> <4>[46267.087531]  [<ffffffff81474b5f>] compat_sys_socketcall+0x13f/0x210
> <4>[46267.087607]  [<ffffffff8104dea3>] ia32_sysret+0x0/0x5
> <4>[46267.087676] Code: 91 20 e2 01 75 29 48 89 de 4c 89 f7 e8 56 fa ff ff 85 c0 0f 84 68 fc ff ff 0f b6 4d c6 41 8b 45 00 e9 4d fb ff ff e8 7c 19 e9 e0 <0f> 0b eb fe f6 05 17 91 20 e2 80 74 ce 80 3d 5f 2e 00 00 00 74
> <1>[46267.088023] RIP  [<ffffffffa01e00a4>] nf_nat_setup_info+0x564/0x590
>
> Cc: Eric Dumazet <eric.dumazet@gmail.com>
> Cc: Florian Westphal <fw@strlen.de>
> Cc: Pablo Neira Ayuso <pablo@netfilter.org>
> Cc: Patrick McHardy <kaber@trash.net>
> Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: Cyrill Gorcunov <gorcunov@openvz.org>
> Signed-off-by: Andrey Vagin <avagin@openvz.org>
> Acked-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
> Signed-off-by: Ani Sinha <ani@arista.com>
> ---
>  net/netfilter/nf_conntrack_core.c | 21 +++++++++++++++++----
>  1 file changed, 17 insertions(+), 4 deletions(-)
>
> diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
> index 9a46908..fd0f7a3 100644
> --- a/net/netfilter/nf_conntrack_core.c
> +++ b/net/netfilter/nf_conntrack_core.c
> @@ -309,6 +309,21 @@ static void death_by_timeout(unsigned long ul_conntrack)
>         nf_ct_put(ct);
>  }
>
> +static inline bool
> +nf_ct_key_equal(struct nf_conntrack_tuple_hash *h,
> +                       const struct nf_conntrack_tuple *tuple,
> +                       u16 zone)
> +{
> +       struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(h);
> +
> +       /* A conntrack can be recreated with the equal tuple,
> +        * so we need to check that the conntrack is confirmed
> +        */
> +       return nf_ct_tuple_equal(tuple, &h->tuple) &&
> +               nf_ct_zone(ct) == zone &&
> +               nf_ct_is_confirmed(ct);
> +}
> +
>  /*
>   * Warning :
>   * - Caller must take a reference on returned object
> @@ -330,8 +345,7 @@ ____nf_conntrack_find(struct net *net, u16 zone,
>         local_bh_disable();
>  begin:
>         hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[bucket], hnnode) {
> -               if (nf_ct_tuple_equal(tuple, &h->tuple) &&
> -                   nf_ct_zone(nf_ct_tuplehash_to_ctrack(h)) == zone) {
> +               if (nf_ct_key_equal(h, tuple, zone)) {
>                         NF_CT_STAT_INC(net, found);
>                         local_bh_enable();
>                         return h;
> @@ -378,8 +392,7 @@ begin:
>                              !atomic_inc_not_zero(&ct->ct_general.use)))
>                         h = NULL;
>                 else {
> -                       if (unlikely(!nf_ct_tuple_equal(tuple, &h->tuple) ||
> -                                    nf_ct_zone(ct) != zone)) {
> +                       if (unlikely(!nf_ct_key_equal(h, tuple, zone))) {
>                                 nf_ct_put(ct);
>                                 goto begin;
>                         }
> --
> 1.8.1.4
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/1] commit c6825c0976fa7893692e0e43b09740b419b23c09 upstream.
@ 2015-10-24 18:27 Ani Sinha
  2015-10-24 18:31 ` Ani Sinha
  0 siblings, 1 reply; 14+ messages in thread
From: Ani Sinha @ 2015-10-24 18:27 UTC (permalink / raw)
  To: Pablo Neira Ayuso, Patrick McHardy, David S. Miller,
	netfilter-devel, coreteam, ani, netdev, Eric Dumazet,
	Florian Westphal, Pablo Neira Ayuso

netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get

Lets look at destroy_conntrack:

hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
...
nf_conntrack_free(ct)
	kmem_cache_free(net->ct.nf_conntrack_cachep, ct);

net->ct.nf_conntrack_cachep is created with SLAB_DESTROY_BY_RCU.

The hash is protected by rcu, so readers look up conntracks without
locks.
A conntrack is removed from the hash, but in this moment a few readers
still can use the conntrack. Then this conntrack is released and another
thread creates conntrack with the same address and the equal tuple.
After this a reader starts to validate the conntrack:
* It's not dying, because a new conntrack was created
* nf_ct_tuple_equal() returns true.

But this conntrack is not initialized yet, so it can not be used by two
threads concurrently. In this case BUG_ON may be triggered from
nf_nat_setup_info().

Florian Westphal suggested to check the confirm bit too. I think it's
right.

task 1			task 2			task 3
			nf_conntrack_find_get
			 ____nf_conntrack_find
destroy_conntrack
 hlist_nulls_del_rcu
 nf_conntrack_free
 kmem_cache_free
						__nf_conntrack_alloc
						 kmem_cache_alloc
						 memset(&ct->tuplehash[IP_CT_DIR_MAX],
			 if (nf_ct_is_dying(ct))
			 if (!nf_ct_tuple_equal()

I'm not sure, that I have ever seen this race condition in a real life.
Currently we are investigating a bug, which is reproduced on a few nodes.
In our case one conntrack is initialized from a few tasks concurrently,
we don't have any other explanation for this.

<2>[46267.083061] kernel BUG at net/ipv4/netfilter/nf_nat_core.c:322!
...
<4>[46267.083951] RIP: 0010:[<ffffffffa01e00a4>]  [<ffffffffa01e00a4>] nf_nat_setup_info+0x564/0x590 [nf_nat]
...
<4>[46267.085549] Call Trace:
<4>[46267.085622]  [<ffffffffa023421b>] alloc_null_binding+0x5b/0xa0 [iptable_nat]
<4>[46267.085697]  [<ffffffffa02342bc>] nf_nat_rule_find+0x5c/0x80 [iptable_nat]
<4>[46267.085770]  [<ffffffffa0234521>] nf_nat_fn+0x111/0x260 [iptable_nat]
<4>[46267.085843]  [<ffffffffa0234798>] nf_nat_out+0x48/0xd0 [iptable_nat]
<4>[46267.085919]  [<ffffffff814841b9>] nf_iterate+0x69/0xb0
<4>[46267.085991]  [<ffffffff81494e70>] ? ip_finish_output+0x0/0x2f0
<4>[46267.086063]  [<ffffffff81484374>] nf_hook_slow+0x74/0x110
<4>[46267.086133]  [<ffffffff81494e70>] ? ip_finish_output+0x0/0x2f0
<4>[46267.086207]  [<ffffffff814b5890>] ? dst_output+0x0/0x20
<4>[46267.086277]  [<ffffffff81495204>] ip_output+0xa4/0xc0
<4>[46267.086346]  [<ffffffff814b65a4>] raw_sendmsg+0x8b4/0x910
<4>[46267.086419]  [<ffffffff814c10fa>] inet_sendmsg+0x4a/0xb0
<4>[46267.086491]  [<ffffffff814459aa>] ? sock_update_classid+0x3a/0x50
<4>[46267.086562]  [<ffffffff81444d67>] sock_sendmsg+0x117/0x140
<4>[46267.086638]  [<ffffffff8151997b>] ? _spin_unlock_bh+0x1b/0x20
<4>[46267.086712]  [<ffffffff8109d370>] ? autoremove_wake_function+0x0/0x40
<4>[46267.086785]  [<ffffffff81495e80>] ? do_ip_setsockopt+0x90/0xd80
<4>[46267.086858]  [<ffffffff8100be0e>] ? call_function_interrupt+0xe/0x20
<4>[46267.086936]  [<ffffffff8118cb10>] ? ub_slab_ptr+0x20/0x90
<4>[46267.087006]  [<ffffffff8118cb10>] ? ub_slab_ptr+0x20/0x90
<4>[46267.087081]  [<ffffffff8118f2e8>] ? kmem_cache_alloc+0xd8/0x1e0
<4>[46267.087151]  [<ffffffff81445599>] sys_sendto+0x139/0x190
<4>[46267.087229]  [<ffffffff81448c0d>] ? sock_setsockopt+0x16d/0x6f0
<4>[46267.087303]  [<ffffffff810efa47>] ? audit_syscall_entry+0x1d7/0x200
<4>[46267.087378]  [<ffffffff810ef795>] ? __audit_syscall_exit+0x265/0x290
<4>[46267.087454]  [<ffffffff81474885>] ? compat_sys_setsockopt+0x75/0x210
<4>[46267.087531]  [<ffffffff81474b5f>] compat_sys_socketcall+0x13f/0x210
<4>[46267.087607]  [<ffffffff8104dea3>] ia32_sysret+0x0/0x5
<4>[46267.087676] Code: 91 20 e2 01 75 29 48 89 de 4c 89 f7 e8 56 fa ff ff 85 c0 0f 84 68 fc ff ff 0f b6 4d c6 41 8b 45 00 e9 4d fb ff ff e8 7c 19 e9 e0 <0f> 0b eb fe f6 05 17 91 20 e2 80 74 ce 80 3d 5f 2e 00 00 00 74
<1>[46267.088023] RIP  [<ffffffffa01e00a4>] nf_nat_setup_info+0x564/0x590

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Florian Westphal <fw@strlen.de>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Ani Sinha <ani@arista.com>
---
 net/netfilter/nf_conntrack_core.c | 21 +++++++++++++++++----
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 9a46908..fd0f7a3 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -309,6 +309,21 @@ static void death_by_timeout(unsigned long ul_conntrack)
 	nf_ct_put(ct);
 }
 
+static inline bool
+nf_ct_key_equal(struct nf_conntrack_tuple_hash *h,
+			const struct nf_conntrack_tuple *tuple,
+			u16 zone)
+{
+	struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(h);
+
+	/* A conntrack can be recreated with the equal tuple,
+	 * so we need to check that the conntrack is confirmed
+	 */
+	return nf_ct_tuple_equal(tuple, &h->tuple) &&
+		nf_ct_zone(ct) == zone &&
+		nf_ct_is_confirmed(ct);
+}
+
 /*
  * Warning :
  * - Caller must take a reference on returned object
@@ -330,8 +345,7 @@ ____nf_conntrack_find(struct net *net, u16 zone,
 	local_bh_disable();
 begin:
 	hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[bucket], hnnode) {
-		if (nf_ct_tuple_equal(tuple, &h->tuple) &&
-		    nf_ct_zone(nf_ct_tuplehash_to_ctrack(h)) == zone) {
+		if (nf_ct_key_equal(h, tuple, zone)) {
 			NF_CT_STAT_INC(net, found);
 			local_bh_enable();
 			return h;
@@ -378,8 +392,7 @@ begin:
 			     !atomic_inc_not_zero(&ct->ct_general.use)))
 			h = NULL;
 		else {
-			if (unlikely(!nf_ct_tuple_equal(tuple, &h->tuple) ||
-				     nf_ct_zone(ct) != zone)) {
+			if (unlikely(!nf_ct_key_equal(h, tuple, zone))) {
 				nf_ct_put(ct);
 				goto begin;
 			}
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2015-11-06 11:09 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-26 18:55 [PATCH 1/1] commit c6825c0976fa7893692e0e43b09740b419b23c09 upstream Ani Sinha
2015-10-26 20:06 ` Pablo Neira Ayuso
2015-10-28  6:36   ` Neal P. Murphy
     [not found]   ` <17709_1446014232_t9S6b5a3017652_20151028023650.7b76098f@playground>
2015-10-29  6:40     ` Neal P. Murphy
2015-10-30  0:01       ` Ani Sinha
2015-10-30  1:21         ` Neal P. Murphy
2015-10-30 15:37           ` Ani Sinha
2015-11-02 19:11   ` [PATCH -stable 3.4,backport] " Ani Sinha
2015-11-02 19:47   ` Ani Sinha
2015-11-04 23:46   ` [PATCH 1/1] " Ani Sinha
2015-11-06 11:09     ` Pablo Neira Ayuso
  -- strict thread matches above, loose matches on Subject: below --
2015-11-02 20:48 Ani Sinha
2015-10-24 18:27 Ani Sinha
2015-10-24 18:31 ` Ani Sinha

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.