linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] netfilter: nf_conntrack: release conntrack from rcu callback
@ 2014-01-06 15:54 Andrey Vagin
  2014-01-06 17:02 ` Florian Westphal
  0 siblings, 1 reply; 9+ messages in thread
From: Andrey Vagin @ 2014-01-06 15:54 UTC (permalink / raw)
  To: netfilter-devel
  Cc: netfilter, coreteam, netdev, linux-kernel, vvs, Andrey Vagin,
	Pablo Neira Ayuso, Patrick McHardy, Jozsef Kadlecsik,
	David S. Miller, Cyrill Gorcunov

Lets look at destroy_conntrack:

hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
...
nf_conntrack_free(ct)
	kmem_cache_free(net->ct.nf_conntrack_cachep, ct);

The hash is protected by rcu, so readers look up conntracks without
locks.
A conntrack is removed from the hash, but in this moment a few readers
still can use the conntrack, so if we call kmem_cache_free now, all
readers will read released object.

Bellow you can find more tricky race condition of three tasks.

task 1			task 2			task 3
			nf_conntrack_find_get
			 ____nf_conntrack_find
destroy_conntrack
 hlist_nulls_del_rcu
 nf_conntrack_free
 kmem_cache_free
						__nf_conntrack_alloc
						 kmem_cache_alloc
						 memset(&ct->tuplehash[IP_CT_DIR_MAX],
			 if (nf_ct_is_dying(ct))

In this case the task 2 will not understand, that it uses a wrong
conntrack.

I'm not sure, that I have ever seen this race condition in a real life.
Currently we are investigating a bug, which is reproduced on a few node.
In our case one conntrack is initialized from a few tasks concurrently,
we don't have any other explanation for this.

<2>[46267.083061] kernel BUG at net/ipv4/netfilter/nf_nat_core.c:322!
...
<4>[46267.083951] RIP: 0010:[<ffffffffa01e00a4>]  [<ffffffffa01e00a4>] nf_nat_setup_info+0x564/0x590 [nf_nat]
...
<4>[46267.085549] Call Trace:
<4>[46267.085622]  [<ffffffffa023421b>] alloc_null_binding+0x5b/0xa0 [iptable_nat]
<4>[46267.085697]  [<ffffffffa02342bc>] nf_nat_rule_find+0x5c/0x80 [iptable_nat]
<4>[46267.085770]  [<ffffffffa0234521>] nf_nat_fn+0x111/0x260 [iptable_nat]
<4>[46267.085843]  [<ffffffffa0234798>] nf_nat_out+0x48/0xd0 [iptable_nat]
<4>[46267.085919]  [<ffffffff814841b9>] nf_iterate+0x69/0xb0
<4>[46267.085991]  [<ffffffff81494e70>] ? ip_finish_output+0x0/0x2f0
<4>[46267.086063]  [<ffffffff81484374>] nf_hook_slow+0x74/0x110
<4>[46267.086133]  [<ffffffff81494e70>] ? ip_finish_output+0x0/0x2f0
<4>[46267.086207]  [<ffffffff814b5890>] ? dst_output+0x0/0x20
<4>[46267.086277]  [<ffffffff81495204>] ip_output+0xa4/0xc0
<4>[46267.086346]  [<ffffffff814b65a4>] raw_sendmsg+0x8b4/0x910
<4>[46267.086419]  [<ffffffff814c10fa>] inet_sendmsg+0x4a/0xb0
<4>[46267.086491]  [<ffffffff814459aa>] ? sock_update_classid+0x3a/0x50
<4>[46267.086562]  [<ffffffff81444d67>] sock_sendmsg+0x117/0x140
<4>[46267.086638]  [<ffffffff8151997b>] ? _spin_unlock_bh+0x1b/0x20
<4>[46267.086712]  [<ffffffff8109d370>] ? autoremove_wake_function+0x0/0x40
<4>[46267.086785]  [<ffffffff81495e80>] ? do_ip_setsockopt+0x90/0xd80
<4>[46267.086858]  [<ffffffff8100be0e>] ? call_function_interrupt+0xe/0x20
<4>[46267.086936]  [<ffffffff8118cb10>] ? ub_slab_ptr+0x20/0x90
<4>[46267.087006]  [<ffffffff8118cb10>] ? ub_slab_ptr+0x20/0x90
<4>[46267.087081]  [<ffffffff8118f2e8>] ? kmem_cache_alloc+0xd8/0x1e0
<4>[46267.087151]  [<ffffffff81445599>] sys_sendto+0x139/0x190
<4>[46267.087229]  [<ffffffff81448c0d>] ? sock_setsockopt+0x16d/0x6f0
<4>[46267.087303]  [<ffffffff810efa47>] ? audit_syscall_entry+0x1d7/0x200
<4>[46267.087378]  [<ffffffff810ef795>] ? __audit_syscall_exit+0x265/0x290
<4>[46267.087454]  [<ffffffff81474885>] ? compat_sys_setsockopt+0x75/0x210
<4>[46267.087531]  [<ffffffff81474b5f>] compat_sys_socketcall+0x13f/0x210
<4>[46267.087607]  [<ffffffff8104dea3>] ia32_sysret+0x0/0x5
<4>[46267.087676] Code: 91 20 e2 01 75 29 48 89 de 4c 89 f7 e8 56 fa ff ff 85 c0 0f 84 68 fc ff ff 0f b6 4d c6 41 8b 45 00 e9 4d fb ff ff e8 7c 19 e9 e0 <0f> 0b eb fe f6 05 17 91 20 e2 80 74 ce 80 3d 5f 2e 00 00 00 74
<1>[46267.088023] RIP  [<ffffffffa01e00a4>] nf_nat_setup_info+0x564/0x590

Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
---
 include/net/netfilter/nf_conntrack.h |  2 ++
 net/netfilter/nf_conntrack_core.c    | 11 +++++++++--
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
index 01ea6ee..492e857 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -76,6 +76,8 @@ struct nf_conn {
            plus 1 for any connection(s) we are `master' for */
 	struct nf_conntrack ct_general;
 
+	struct rcu_head rcu;
+
 	spinlock_t lock;
 
 	/* XXX should I move this to the tail ? - Y.K */
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 43549eb..40e0d61 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -198,6 +198,14 @@ clean_from_lists(struct nf_conn *ct)
 	nf_ct_remove_expectations(ct);
 }
 
+static void nf_conntrack_free_rcu(struct rcu_head *head)
+{
+	struct nf_conn *ct = container_of(head, struct nf_conn, rcu);
+
+	pr_debug("destroy_conntrack: returning ct=%p to slab\n", ct);
+	nf_conntrack_free(ct);
+}
+
 static void
 destroy_conntrack(struct nf_conntrack *nfct)
 {
@@ -236,8 +244,7 @@ destroy_conntrack(struct nf_conntrack *nfct)
 	if (ct->master)
 		nf_ct_put(ct->master);
 
-	pr_debug("destroy_conntrack: returning ct=%p to slab\n", ct);
-	nf_conntrack_free(ct);
+	call_rcu(&ct->rcu, nf_conntrack_free_rcu);
 }
 
 static void nf_ct_delete_from_lists(struct nf_conn *ct)
-- 
1.8.4.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] netfilter: nf_conntrack: release conntrack from rcu callback
  2014-01-06 15:54 [PATCH] netfilter: nf_conntrack: release conntrack from rcu callback Andrey Vagin
@ 2014-01-06 17:02 ` Florian Westphal
  2014-01-06 17:21   ` Cyrill Gorcunov
  2014-01-06 20:54   ` Andrew Vagin
  0 siblings, 2 replies; 9+ messages in thread
From: Florian Westphal @ 2014-01-06 17:02 UTC (permalink / raw)
  To: Andrey Vagin
  Cc: netfilter-devel, netfilter, coreteam, netdev, linux-kernel, vvs,
	Pablo Neira Ayuso, Patrick McHardy, Jozsef Kadlecsik,
	David S. Miller, Cyrill Gorcunov

Andrey Vagin <avagin@openvz.org> wrote:
> Lets look at destroy_conntrack:
> 
> hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
> ...
> nf_conntrack_free(ct)
> 	kmem_cache_free(net->ct.nf_conntrack_cachep, ct);
> 
> The hash is protected by rcu, so readers look up conntracks without
> locks.
> A conntrack is removed from the hash, but in this moment a few readers
> still can use the conntrack, so if we call kmem_cache_free now, all
> readers will read released object.
> 
> Bellow you can find more tricky race condition of three tasks.
> 
> task 1			task 2			task 3
> 			nf_conntrack_find_get
> 			 ____nf_conntrack_find
> destroy_conntrack
>  hlist_nulls_del_rcu
>  nf_conntrack_free
>  kmem_cache_free
> 						__nf_conntrack_alloc
> 						 kmem_cache_alloc
> 						 memset(&ct->tuplehash[IP_CT_DIR_MAX],
> 			 if (nf_ct_is_dying(ct))
> 
> In this case the task 2 will not understand, that it uses a wrong
> conntrack.

Can you elaborate?
Yes, nf_ct_is_dying(ct) might be called for the wrong conntrack.

But, in case we _think_ that its the right one we call
nf_ct_tuple_equal() to verify we indeed found the right one:

       h = ____nf_conntrack_find(net, zone, tuple, hash);
       if (h) { // might be released right now, but page won't go away (SLAB_BY_RCU)
                ct = nf_ct_tuplehash_to_ctrack(h);
                if (unlikely(nf_ct_is_dying(ct) ||
                             !atomic_inc_not_zero(&ct->ct_general.use)))
			// which means we should hit this path (0 ref).
                        h = NULL;
                else {
			// otherwise, it cannot go away from under us, since
			// we own a reference now.
                        if (unlikely(!nf_ct_tuple_equal(tuple, &h->tuple) ||
                                     nf_ct_zone(ct) != zone)) {
			// if we get here, the entry got recycled on other cpu
			// for a different tuple, we can bail out and drop
			// the reference safely and re-try the lookup
                                nf_ct_put(ct);
                                goto begin;
                        }
                }

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] netfilter: nf_conntrack: release conntrack from rcu callback
  2014-01-06 17:02 ` Florian Westphal
@ 2014-01-06 17:21   ` Cyrill Gorcunov
  2014-01-06 18:09     ` Cyrill Gorcunov
  2014-01-06 21:23     ` Florian Westphal
  2014-01-06 20:54   ` Andrew Vagin
  1 sibling, 2 replies; 9+ messages in thread
From: Cyrill Gorcunov @ 2014-01-06 17:21 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Andrey Vagin, netfilter-devel, netfilter, coreteam, netdev,
	linux-kernel, vvs, Pablo Neira Ayuso, Patrick McHardy,
	Jozsef Kadlecsik, David S. Miller

On Mon, Jan 06, 2014 at 06:02:35PM +0100, Florian Westphal wrote:
> 
> Can you elaborate?
> Yes, nf_ct_is_dying(ct) might be called for the wrong conntrack.
> 
> But, in case we _think_ that its the right one we call
> nf_ct_tuple_equal() to verify we indeed found the right one:
> 
>        h = ____nf_conntrack_find(net, zone, tuple, hash);
>        if (h) { // might be released right now, but page won't go away (SLAB_BY_RCU)
>                 ct = nf_ct_tuplehash_to_ctrack(h);
>                 if (unlikely(nf_ct_is_dying(ct) ||
>                              !atomic_inc_not_zero(&ct->ct_general.use)))
> 			// which means we should hit this path (0 ref).
>                         h = NULL;
>                 else {
> 			// otherwise, it cannot go away from under us, since
> 			// we own a reference now.
>                         if (unlikely(!nf_ct_tuple_equal(tuple, &h->tuple) ||
>                                      nf_ct_zone(ct) != zone)) {
> 			// if we get here, the entry got recycled on other cpu
> 			// for a different tuple, we can bail out and drop
> 			// the reference safely and re-try the lookup
>                                 nf_ct_put(ct);
>                                 goto begin;
>                         }
>                 }

I think tuple may match if

task 1                  task 2                  task 3
                        nf_conntrack_find_get
                         ____nf_conntrack_find
destroy_conntrack
 hlist_nulls_del_rcu
 nf_conntrack_free
 kmem_cache_free
                                                __nf_conntrack_alloc
                                                 kmem_cache_alloc
                         if (nf_ct_is_dying(ct))

						data is not yet cleaned

                                                 memset(&ct->tuplehash[IP_CT_DIR_MAX],

No? Or there something obvious I'm missing?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] netfilter: nf_conntrack: release conntrack from rcu callback
  2014-01-06 17:21   ` Cyrill Gorcunov
@ 2014-01-06 18:09     ` Cyrill Gorcunov
  2014-01-06 21:23     ` Florian Westphal
  1 sibling, 0 replies; 9+ messages in thread
From: Cyrill Gorcunov @ 2014-01-06 18:09 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Andrey Vagin, netfilter-devel, netfilter, coreteam, netdev,
	linux-kernel, vvs, Pablo Neira Ayuso, Patrick McHardy,
	Jozsef Kadlecsik, David S. Miller

On Mon, Jan 06, 2014 at 09:21:30PM +0400, Cyrill Gorcunov wrote:
> 
> No? Or there something obvious I'm missing?

Drop my assumption, it can't happen (iow either dying bit is set,
either it clean but tuple can't match then).

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] netfilter: nf_conntrack: release conntrack from rcu callback
  2014-01-06 17:02 ` Florian Westphal
  2014-01-06 17:21   ` Cyrill Gorcunov
@ 2014-01-06 20:54   ` Andrew Vagin
  2014-01-06 21:53     ` Florian Westphal
  1 sibling, 1 reply; 9+ messages in thread
From: Andrew Vagin @ 2014-01-06 20:54 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Andrey Vagin, netfilter-devel, netfilter, coreteam, netdev,
	linux-kernel, vvs, Pablo Neira Ayuso, Patrick McHardy,
	Jozsef Kadlecsik, David S. Miller, Cyrill Gorcunov

On Mon, Jan 06, 2014 at 06:02:35PM +0100, Florian Westphal wrote:
> Andrey Vagin <avagin@openvz.org> wrote:
> > Lets look at destroy_conntrack:
> > 
> > hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
> > ...
> > nf_conntrack_free(ct)
> > 	kmem_cache_free(net->ct.nf_conntrack_cachep, ct);
> > 
> > The hash is protected by rcu, so readers look up conntracks without
> > locks.
> > A conntrack is removed from the hash, but in this moment a few readers
> > still can use the conntrack, so if we call kmem_cache_free now, all
> > readers will read released object.
> > 
> > Bellow you can find more tricky race condition of three tasks.
> > 
> > task 1			task 2			task 3
> > 			nf_conntrack_find_get
> > 			 ____nf_conntrack_find
> > destroy_conntrack
> >  hlist_nulls_del_rcu
> >  nf_conntrack_free
> >  kmem_cache_free
> > 						__nf_conntrack_alloc
> > 						 kmem_cache_alloc
> > 						 memset(&ct->tuplehash[IP_CT_DIR_MAX],
> > 			 if (nf_ct_is_dying(ct))
> > 
> > In this case the task 2 will not understand, that it uses a wrong
> > conntrack.
> 
> Can you elaborate?
> Yes, nf_ct_is_dying(ct) might be called for the wrong conntrack.
> 
> But, in case we _think_ that its the right one we call
> nf_ct_tuple_equal() to verify we indeed found the right one:

Ok. task3 creates a new contrack and nf_ct_tuple_equal() returns true on
it. Looks like it's possible. In this case we have two threads with one
unitialized contrack. It's really bad, because the code supposes that
conntrack can not be initialized in two threads concurrently. For
example BUG can be triggered from nf_nat_setup_info():

BUG_ON(nf_nat_initialized(ct, maniptype));


> 
>        h = ____nf_conntrack_find(net, zone, tuple, hash);
>        if (h) { // might be released right now, but page won't go away (SLAB_BY_RCU)

I did not take SLAB_BY_RCU into account. Thank you. But it doesn't say,
that we don't have the race condition here. It explains why we don't
have really bad situations, when a completely wrong contract is
used. We always use a "right" conntrack, but sometimes it is
uninitialized and here is a problem.

The race window is tiny, because usually we check that conntrack is not
initialized and only then we execute its initialization. We don't hold
any locks in these moments.

Task2					| Task3
if (!nf_nat_initialized(ct))		|
					| if (!nf_nat_initialized(ct)
 alloc_null_binding			|
					|  alloc_null_binding
  nf_nat_setup_info			|
   ct->status |= IPS_SRC_NAT_DONE	|
					|   nf_nat_setup_info
					|    BUG_ON(nf_nat_initialized(ct));

>                 ct = nf_ct_tuplehash_to_ctrack(h);
>                 if (unlikely(nf_ct_is_dying(ct) ||
>                              !atomic_inc_not_zero(&ct->ct_general.use)))
> 			// which means we should hit this path (0 ref).
>                         h = NULL;
>                 else {
> 			// otherwise, it cannot go away from under us, since
> 			// we own a reference now.
>                         if (unlikely(!nf_ct_tuple_equal(tuple, &h->tuple) ||
>                                      nf_ct_zone(ct) != zone)) {
> 			// if we get here, the entry got recycled on other cpu
> 			// for a different tuple, we can bail out and drop
> 			// the reference safely and re-try the lookup
>                                 nf_ct_put(ct);
>                                 goto begin;
>                         }
>                 }

Thanks,
Andrey

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] netfilter: nf_conntrack: release conntrack from rcu callback
  2014-01-06 17:21   ` Cyrill Gorcunov
  2014-01-06 18:09     ` Cyrill Gorcunov
@ 2014-01-06 21:23     ` Florian Westphal
  2014-01-06 21:44       ` Cyrill Gorcunov
  1 sibling, 1 reply; 9+ messages in thread
From: Florian Westphal @ 2014-01-06 21:23 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Florian Westphal, Andrey Vagin, netfilter-devel, netfilter,
	coreteam, netdev, linux-kernel, vvs, Pablo Neira Ayuso,
	Patrick McHardy, Jozsef Kadlecsik, David S. Miller

Cyrill Gorcunov <gorcunov@gmail.com> wrote:
> On Mon, Jan 06, 2014 at 06:02:35PM +0100, Florian Westphal wrote:
> > 
> > Can you elaborate?
> > Yes, nf_ct_is_dying(ct) might be called for the wrong conntrack.
> > 
> > But, in case we _think_ that its the right one we call
> > nf_ct_tuple_equal() to verify we indeed found the right one:
> > 
> >        h = ____nf_conntrack_find(net, zone, tuple, hash);
> >        if (h) { // might be released right now, but page won't go away (SLAB_BY_RCU)
> >                 ct = nf_ct_tuplehash_to_ctrack(h);
> >                 if (unlikely(nf_ct_is_dying(ct) ||
> >                              !atomic_inc_not_zero(&ct->ct_general.use)))
> > 			// which means we should hit this path (0 ref).
> >                         h = NULL;
> >                 else {
> > 			// otherwise, it cannot go away from under us, since
> > 			// we own a reference now.
> >                         if (unlikely(!nf_ct_tuple_equal(tuple, &h->tuple) ||
> >                                      nf_ct_zone(ct) != zone)) {
> > 			// if we get here, the entry got recycled on other cpu
> > 			// for a different tuple, we can bail out and drop
> > 			// the reference safely and re-try the lookup
> >                                 nf_ct_put(ct);
> >                                 goto begin;
> >                         }
> >                 }
> 
> I think tuple may match if
> 
> task 1                  task 2                  task 3
>                         nf_conntrack_find_get
>                          ____nf_conntrack_find
> destroy_conntrack
>  hlist_nulls_del_rcu
>  nf_conntrack_free
>  kmem_cache_free
>                                                 __nf_conntrack_alloc
>                                                  kmem_cache_alloc
>                          if (nf_ct_is_dying(ct))
> 
> 						data is not yet cleaned
> 
>                                                  memset(&ct->tuplehash[IP_CT_DIR_MAX],
> 
> No? Or there something obvious I'm missing?

IMHO this isn't obvious at all :-)

But, in the example above, the atomic_inc_not_zero() should fail
until after __nf_conntrack_alloc() re-inits the refcount to 1.

The mb there should make sure ____nf_conntrack_find() doesn't
find an outdated tuple before this.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] netfilter: nf_conntrack: release conntrack from rcu callback
  2014-01-06 21:23     ` Florian Westphal
@ 2014-01-06 21:44       ` Cyrill Gorcunov
  0 siblings, 0 replies; 9+ messages in thread
From: Cyrill Gorcunov @ 2014-01-06 21:44 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Andrey Vagin, netfilter-devel, netfilter, coreteam, netdev,
	linux-kernel, vvs, Pablo Neira Ayuso, Patrick McHardy,
	Jozsef Kadlecsik, David S. Miller

On Mon, Jan 06, 2014 at 10:23:26PM +0100, Florian Westphal wrote:
> > 
> > No? Or there something obvious I'm missing?
> 
> IMHO this isn't obvious at all :-)
> 
> But, in the example above, the atomic_inc_not_zero() should fail
> until after __nf_conntrack_alloc() re-inits the refcount to 1.
> 
> The mb there should make sure ____nf_conntrack_find() doesn't
> find an outdated tuple before this.

Yeah, thanks!

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] netfilter: nf_conntrack: release conntrack from rcu callback
  2014-01-06 20:54   ` Andrew Vagin
@ 2014-01-06 21:53     ` Florian Westphal
  2014-01-07 10:39       ` Andrey Wagin
  0 siblings, 1 reply; 9+ messages in thread
From: Florian Westphal @ 2014-01-06 21:53 UTC (permalink / raw)
  To: Andrew Vagin
  Cc: Florian Westphal, Andrey Vagin, netfilter-devel, netfilter,
	coreteam, netdev, linux-kernel, vvs, Pablo Neira Ayuso,
	Patrick McHardy, Jozsef Kadlecsik, David S. Miller,
	Cyrill Gorcunov

Andrew Vagin <avagin@gmail.com> wrote:
> On Mon, Jan 06, 2014 at 06:02:35PM +0100, Florian Westphal wrote:
> > Andrey Vagin <avagin@openvz.org> wrote:
> > > Lets look at destroy_conntrack:
> > > 
> > > hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
> > > ...
> > > nf_conntrack_free(ct)
> > > 	kmem_cache_free(net->ct.nf_conntrack_cachep, ct);
> > > 
> > > The hash is protected by rcu, so readers look up conntracks without
> > > locks.
> > > A conntrack is removed from the hash, but in this moment a few readers
> > > still can use the conntrack, so if we call kmem_cache_free now, all
> > > readers will read released object.
> > > 
> > > Bellow you can find more tricky race condition of three tasks.
> > > 
> > > task 1			task 2			task 3
> > > 			nf_conntrack_find_get
> > > 			 ____nf_conntrack_find
> > > destroy_conntrack
> > >  hlist_nulls_del_rcu
> > >  nf_conntrack_free
> > >  kmem_cache_free
> > > 						__nf_conntrack_alloc
> > > 						 kmem_cache_alloc
> > > 						 memset(&ct->tuplehash[IP_CT_DIR_MAX],
> > > 			 if (nf_ct_is_dying(ct))
> > > 
> > > In this case the task 2 will not understand, that it uses a wrong
> > > conntrack.
> > 
> > Can you elaborate?
> > Yes, nf_ct_is_dying(ct) might be called for the wrong conntrack.
> > 
> > But, in case we _think_ that its the right one we call
> > nf_ct_tuple_equal() to verify we indeed found the right one:
> 
> Ok. task3 creates a new contrack and nf_ct_tuple_equal() returns true on
> it. Looks like it's possible.

IFF we're recycling the exact same tuple (i.e., flow was destroyed/terminated
AND has been re-created in identical fashion on another cpu)
AND it is not yet confirmed (ie. its not in hash table any more but in
unconfirmed list) then, yes, I think you're right.

> unitialized contrack. It's really bad, because the code supposes that
> conntrack can not be initialized in two threads concurrently. For
> example BUG can be triggered from nf_nat_setup_info():
> 
> BUG_ON(nf_nat_initialized(ct, maniptype));

Right, since a new conntrack entry is not supposed to be in the hash
table.

> >                 ct = nf_ct_tuplehash_to_ctrack(h);
> >                 if (unlikely(nf_ct_is_dying(ct) ||
> >                              !atomic_inc_not_zero(&ct->ct_general.use)))
> > 			// which means we should hit this path (0 ref).
> >                         h = NULL;
> >                 else {
> > 			// otherwise, it cannot go away from under us, since
> > 			// we own a reference now.
> >                         if (unlikely(!nf_ct_tuple_equal(tuple, &h->tuple) ||
> >                                      nf_ct_zone(ct) != zone)) {

Perhaps this needs additional !nf_ct_is_confirmed()?

It would cover your case (found a recycled element that has been put on
the unconfirmed list (refcnt already set to 1, ct->tuple is set) on another cpu,
extensions possibly not yet fully initialised), and the same tuple).

Regards,
Florian

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] netfilter: nf_conntrack: release conntrack from rcu callback
  2014-01-06 21:53     ` Florian Westphal
@ 2014-01-07 10:39       ` Andrey Wagin
  0 siblings, 0 replies; 9+ messages in thread
From: Andrey Wagin @ 2014-01-07 10:39 UTC (permalink / raw)
  To: Florian Westphal
  Cc: netfilter-devel, netfilter, coreteam, netdev, LKML, vvs,
	Pablo Neira Ayuso, Patrick McHardy, Jozsef Kadlecsik,
	David S. Miller, Cyrill Gorcunov

Hi Florian,

2014/1/7 Florian Westphal <fw@strlen.de>:
> Andrew Vagin <avagin@gmail.com> wrote:

>> >                 ct = nf_ct_tuplehash_to_ctrack(h);
>> >                 if (unlikely(nf_ct_is_dying(ct) ||
>> >                              !atomic_inc_not_zero(&ct->ct_general.use)))
>> >                     // which means we should hit this path (0 ref).
>> >                         h = NULL;
>> >                 else {
>> >                     // otherwise, it cannot go away from under us, since
>> >                     // we own a reference now.
>> >                         if (unlikely(!nf_ct_tuple_equal(tuple, &h->tuple) ||
>> >                                      nf_ct_zone(ct) != zone)) {
>
> Perhaps this needs additional !nf_ct_is_confirmed()?

Yes, it think it must help. Thank you for the comments. I resent this patch:
[PATCH] netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get

>
> It would cover your case (found a recycled element that has been put on
> the unconfirmed list (refcnt already set to 1, ct->tuple is set) on another cpu,
> extensions possibly not yet fully initialised), and the same tuple).
>
> Regards,
> Florian

Thanks,
Andrey

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2014-01-07 10:39 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-01-06 15:54 [PATCH] netfilter: nf_conntrack: release conntrack from rcu callback Andrey Vagin
2014-01-06 17:02 ` Florian Westphal
2014-01-06 17:21   ` Cyrill Gorcunov
2014-01-06 18:09     ` Cyrill Gorcunov
2014-01-06 21:23     ` Florian Westphal
2014-01-06 21:44       ` Cyrill Gorcunov
2014-01-06 20:54   ` Andrew Vagin
2014-01-06 21:53     ` Florian Westphal
2014-01-07 10:39       ` Andrey Wagin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).