netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* linux 3.4.43 : kernel crash at __nf_conntrack_confirm
@ 2015-10-07 19:57 Ani Sinha
  2015-10-18  2:34 ` Ani Sinha
  0 siblings, 1 reply; 11+ messages in thread
From: Ani Sinha @ 2015-10-07 19:57 UTC (permalink / raw)
  To: Patrick McHardy, David S. Miller, netfilter-devel, netfilter,
	coreteam, netdev

Hi guys :

We encountered a kernel crash on one of our boxes running 3.4.43
kernel in the conntrack code. We are using dnsmasq as a proxy to relay
our dns requests to the real dns server. We verified that the
conntrack tables were not full. running conntrack -L around the time
of the crash showed that it had more than 2100 entries for dnsmasq.

Looking upstream, I see a couple of patches which fixes race condition
around the use of the conntrack hash table with RCU (lock free read)
primitives :

commit c6825c0976fa7893692e0e43b09740b419b23c09
Author: Andrey Vagin <avagin@openvz.org>
Date:   Wed Jan 29 19:34:14 2014 +0100
     netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get

and a followup patch :

commit e53376bef2cd97d3e3f61fdc677fb8da7d03d0da
Author: Pablo Neira Ayuso <pablo@netfilter.org>
Date:   Mon Feb 3 20:01:53 2014 +0100
        netfilter: nf_conntrack: don't release a conntrack with non-zero refcnt


We are trying to reproduce the crash again but it is very rare.
Meanwhile, I have two questions:

- Do you guys think the race condition described in the above two
patches have anything to do with the crash I mention below?
- If answer to the above is a NO, then have you guys have any other
reports of a similar crash or any idea what could be going on?

We are still investigating and I will update this thread if I can get
additional info.

Thanks
Ani

<1>[10618591.817967] BUG: unable to handle kernel NULL pointer
dereference at           (null)
<1>[10618591.914483] IP: [<ffffffffa007b3f7>]
__nf_conntrack_confirm+0x1fb/0x36c [nf_conntrack]
<4>[10618592.012027] PGD 5aa67067 PUD 5b4f4067 PMD 0
<4>[10618592.012035] Oops: 0002 [#1] PREEMPT SMP
<4>[10618592.012041] CPU 1
<4>[10618592.012043] Modules linked in: xt_comment sch_prio fpdma(PO)
msr nf_conntrack_ipv6 nf_defrag_ipv6 ip6t_REJECT ip6table_mangle
nf_conntrack_ipv4
nf_defr
ag_ipv4 xt_LOG xt_limit xt_hl xt_state ipt_REJECT xt_multiport
xt_tcpudp iptable_mangle kbfd(O) 8021q garp stp llc tun
nf_conntrack_tftp iptable_raw
iptable_fil
ter ip_tables xt_NOTRACK nf_conntrack xt_mark ip6table_raw
ip6table_filter ip6_tables x_tables k10temp hwmon amd64_edac_mod
scd(O) microcode kvm_amd kvm
<4>[10618592.012092]
<4>[10618592.012096] Pid: 5586, comm: dnsmasq Tainted: P           O 3.4.43 #1
<4>[10618592.012102] RIP: 0010:[<ffffffffa007b3f7>]
[<ffffffffa007b3f7>] __nf_conntrack_confirm+0x1fb/0x36c [nf_conntrack]
<4>[10618592.012112] RSP: 0018:ffff88005aa1fb98  EFLAGS: 00010202
<4>[10618592.012116] RAX: 0000000000002769 RBX: ffff880063d58658 RCX:
000000001cc74948
<4>[10618592.012120] RDX: 0000000000000000 RSI: ffff88010cd80000 RDI:
0000000000004000
<4>[10618592.012123] RBP: ffff88005aa1fbc8 R08: 00000000872541be R09:
000000007aa31682
<4>[10618592.012127] R10: ffff880063d586d8 R11: ffff88005aa1fb68 R12:
ffffffff81648180
<4>[10618592.012130] R13: 00000000000017ef R14: 000000000000bf78 R15:
0000000000009da0
<4>[10618592.012135] FS:  0000000000000000(0000)
GS:ffff88013fb00000(0063) knlGS:00000000f74126d0
<4>[10618592.012139] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
<4>[10618592.012142] CR2: 0000000000000000 CR3: 000000005b412000 CR4:
00000000000007e0
<4>[10618592.012146] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
<4>[10618592.012149] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
<4>[10618592.012154] Process dnsmasq (pid: 5586, threadinfo
ffff88005aa1e000, task ffff8800727d6050)
<4>[10618592.012156] Stack:
<4>[10618592.012159]  0000000000000000 ffff8800889050c0
ffff8800889050c0 ffff880063d58658
<4>[10618592.012166]  0000000000000004 0000000000000002
ffff88005aa1fc38 ffffffffa00e3c54
<4>[10618592.012172]  0000000000000004 0000000000000000
ffff88005aa1fc38 ffffffffa0078168
<4>[10618592.012179] Call Trace:
<4>[10618592.012186] [<ffffffffa00e3c54>] ipv4_confirm+0x17e/0x1a5
[nf_conntrack_ipv4]
<4>[10618592.012192] [<ffffffffa0078168>] ?
iptable_mangle_hook+0xfa/0x116 [iptable_mangle]
<4>[10618592.012199] [<ffffffff81324afe>] ? ip_finish_output+0x0/0x36f
<4>[10618592.012205] [<ffffffff8131900f>] nf_iterate+0x43/0x78
<4>[10618592.012210] [<ffffffff81324afe>] ? ip_finish_output+0x0/0x36f
<4>[10618592.012214] [<ffffffff813191a1>] nf_hook_slow+0x6e/0x106
<4>[10618592.012219] [<ffffffff81324afe>] ? ip_finish_output+0x0/0x36f
<4>[10618592.012224] [<ffffffff813222e8>] ? dst_output+0x0/0x11
<4>[10618592.012229] [<ffffffff81324ef0>] ip_output+0x83/0x97
<4>[10618592.012234] [<ffffffff813240a3>] ? __ip_local_out+0x9c/0x9e
<4>[10618592.012239] [<ffffffff813240c9>] ip_local_out+0x24/0x28
<4>[10618592.012244] [<ffffffff8132462f>] ip_queue_xmit+0x2e4/0x322
<4>[10618592.012249] [<ffffffff81336f97>] tcp_transmit_skb+0x766/0x7a7
<4>[10618592.012254] [<ffffffff81337345>] tcp_send_active_reset+0xd8/0x104
<4>[10618592.012258] [<ffffffff8132b8c6>] tcp_close+0x101/0x335
<4>[10618592.012264] [<ffffffff8134b8f2>] inet_release+0x7b/0x82
<4>[10618592.012269] [<ffffffff812ea36e>] sock_release+0x1a/0x72
<4>[10618592.012273] [<ffffffff812ea3e8>] sock_close+0x22/0x26
<4>[10618592.012278] [<ffffffff810aad2d>] fput+0x117/0x1f8
<4>[10618592.012283] [<ffffffff810a7ce2>] filp_close+0x6d/0x78
<4>[10618592.012288] [<ffffffff810a7d7b>] sys_close+0x8e/0xc8
<4>[10618592.012293] [<ffffffff813dcacb>] cstar_dispatch+0x7/0x1e
<4>[10618592.012296] Code: 31 d2 0f b6 d2 85 d2 0f 85 61 01 00 00 48
8b 00 a8 01 75 0d 8b 53 68 3b 50 10 75 94 e9 6a ff ff ff 48 8b 43 20
48 8b 53 28 a8 01
<48>
 89 02 75 04 48 89 50 08 49 bd 00 02 20 00 00 00 ad de 48 8d
<1>[10618592.012355] RIP  [<ffffffffa007b3f7>]
__nf_conntrack_confirm+0x1fb/0x36c [nf_conntrack]
<4>[10618592.110942]  RSP <ffff88005aa1fb98>
<4>[10618592.110944] CR2: 0000000000000000


The crash happened here in this code :

static inline void __hlist_nulls_del(struct hlist_nulls_node *n)
{
       struct hlist_nulls_node *next = n->next;
        struct hlist_nulls_node **pprev = n->pprev;
                                                   *pprev = next;
         1ac1:       48 89 02                mov    %rax,(%rdx)  <==== CRASH
        if (!is_a_nulls(next))
    1ac4:       75 04                   jne    1aca
<nf_ct_delete_from_lists+0x62>
next->pprev = pprev;

1ac6:       48 89 50 08             mov    %rdx,0x8(%rax)
* hlist_nulls_for_each_entry().
*/

The instruction is *prev = next and pprev pointer is NULL (RDX)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: linux 3.4.43 : kernel crash at __nf_conntrack_confirm
  2015-10-07 19:57 linux 3.4.43 : kernel crash at __nf_conntrack_confirm Ani Sinha
@ 2015-10-18  2:34 ` Ani Sinha
  2015-10-18  8:07   ` Florian Westphal
  0 siblings, 1 reply; 11+ messages in thread
From: Ani Sinha @ 2015-10-18  2:34 UTC (permalink / raw)
  To: Patrick McHardy, David S. Miller, netfilter-devel, netfilter,
	coreteam, netdev

Hi guys,

Coming back to this crash, I see something interesting in the
conntrack code in linux 3.4.109 (a supported kernel version). I see
that the hash table manipulations are protected by a spinlock. Also
lookups/reads are protected by RCU. However allocation and
deallocation of conntrack objects happen outside of both the locks.
It seems to me that a conntrack object can be deallocated and a new
object can be allocated and initialized within the same RCU grace
period, while the hash table is being read. It looks like a bug to me.
Do you guys have any thoughts on this? Situations like the one I
described can result in the crash I sent below.

thanks
ani

On Wed, Oct 7, 2015 at 12:57 PM, Ani Sinha <ani@arista.com> wrote:
> Hi guys :
>
> We encountered a kernel crash on one of our boxes running 3.4.43
> kernel in the conntrack code. We are using dnsmasq as a proxy to relay
> our dns requests to the real dns server. We verified that the
> conntrack tables were not full. running conntrack -L around the time
> of the crash showed that it had more than 2100 entries for dnsmasq.
>
> Looking upstream, I see a couple of patches which fixes race condition
> around the use of the conntrack hash table with RCU (lock free read)
> primitives :
>
> commit c6825c0976fa7893692e0e43b09740b419b23c09
> Author: Andrey Vagin <avagin@openvz.org>
> Date:   Wed Jan 29 19:34:14 2014 +0100
>      netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get
>
> and a followup patch :
>
> commit e53376bef2cd97d3e3f61fdc677fb8da7d03d0da
> Author: Pablo Neira Ayuso <pablo@netfilter.org>
> Date:   Mon Feb 3 20:01:53 2014 +0100
>         netfilter: nf_conntrack: don't release a conntrack with non-zero refcnt
>
>
> We are trying to reproduce the crash again but it is very rare.
> Meanwhile, I have two questions:
>
> - Do you guys think the race condition described in the above two
> patches have anything to do with the crash I mention below?
> - If answer to the above is a NO, then have you guys have any other
> reports of a similar crash or any idea what could be going on?
>
> We are still investigating and I will update this thread if I can get
> additional info.
>
> Thanks
> Ani
>
> <1>[10618591.817967] BUG: unable to handle kernel NULL pointer
> dereference at           (null)
> <1>[10618591.914483] IP: [<ffffffffa007b3f7>]
> __nf_conntrack_confirm+0x1fb/0x36c [nf_conntrack]
> <4>[10618592.012027] PGD 5aa67067 PUD 5b4f4067 PMD 0
> <4>[10618592.012035] Oops: 0002 [#1] PREEMPT SMP
> <4>[10618592.012041] CPU 1
> <4>[10618592.012043] Modules linked in: xt_comment sch_prio fpdma(PO)
> msr nf_conntrack_ipv6 nf_defrag_ipv6 ip6t_REJECT ip6table_mangle
> nf_conntrack_ipv4
> nf_defr
> ag_ipv4 xt_LOG xt_limit xt_hl xt_state ipt_REJECT xt_multiport
> xt_tcpudp iptable_mangle kbfd(O) 8021q garp stp llc tun
> nf_conntrack_tftp iptable_raw
> iptable_fil
> ter ip_tables xt_NOTRACK nf_conntrack xt_mark ip6table_raw
> ip6table_filter ip6_tables x_tables k10temp hwmon amd64_edac_mod
> scd(O) microcode kvm_amd kvm
> <4>[10618592.012092]
> <4>[10618592.012096] Pid: 5586, comm: dnsmasq Tainted: P           O 3.4.43 #1
> <4>[10618592.012102] RIP: 0010:[<ffffffffa007b3f7>]
> [<ffffffffa007b3f7>] __nf_conntrack_confirm+0x1fb/0x36c [nf_conntrack]
> <4>[10618592.012112] RSP: 0018:ffff88005aa1fb98  EFLAGS: 00010202
> <4>[10618592.012116] RAX: 0000000000002769 RBX: ffff880063d58658 RCX:
> 000000001cc74948
> <4>[10618592.012120] RDX: 0000000000000000 RSI: ffff88010cd80000 RDI:
> 0000000000004000
> <4>[10618592.012123] RBP: ffff88005aa1fbc8 R08: 00000000872541be R09:
> 000000007aa31682
> <4>[10618592.012127] R10: ffff880063d586d8 R11: ffff88005aa1fb68 R12:
> ffffffff81648180
> <4>[10618592.012130] R13: 00000000000017ef R14: 000000000000bf78 R15:
> 0000000000009da0
> <4>[10618592.012135] FS:  0000000000000000(0000)
> GS:ffff88013fb00000(0063) knlGS:00000000f74126d0
> <4>[10618592.012139] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
> <4>[10618592.012142] CR2: 0000000000000000 CR3: 000000005b412000 CR4:
> 00000000000007e0
> <4>[10618592.012146] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> <4>[10618592.012149] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> 0000000000000400
> <4>[10618592.012154] Process dnsmasq (pid: 5586, threadinfo
> ffff88005aa1e000, task ffff8800727d6050)
> <4>[10618592.012156] Stack:
> <4>[10618592.012159]  0000000000000000 ffff8800889050c0
> ffff8800889050c0 ffff880063d58658
> <4>[10618592.012166]  0000000000000004 0000000000000002
> ffff88005aa1fc38 ffffffffa00e3c54
> <4>[10618592.012172]  0000000000000004 0000000000000000
> ffff88005aa1fc38 ffffffffa0078168
> <4>[10618592.012179] Call Trace:
> <4>[10618592.012186] [<ffffffffa00e3c54>] ipv4_confirm+0x17e/0x1a5
> [nf_conntrack_ipv4]
> <4>[10618592.012192] [<ffffffffa0078168>] ?
> iptable_mangle_hook+0xfa/0x116 [iptable_mangle]
> <4>[10618592.012199] [<ffffffff81324afe>] ? ip_finish_output+0x0/0x36f
> <4>[10618592.012205] [<ffffffff8131900f>] nf_iterate+0x43/0x78
> <4>[10618592.012210] [<ffffffff81324afe>] ? ip_finish_output+0x0/0x36f
> <4>[10618592.012214] [<ffffffff813191a1>] nf_hook_slow+0x6e/0x106
> <4>[10618592.012219] [<ffffffff81324afe>] ? ip_finish_output+0x0/0x36f
> <4>[10618592.012224] [<ffffffff813222e8>] ? dst_output+0x0/0x11
> <4>[10618592.012229] [<ffffffff81324ef0>] ip_output+0x83/0x97
> <4>[10618592.012234] [<ffffffff813240a3>] ? __ip_local_out+0x9c/0x9e
> <4>[10618592.012239] [<ffffffff813240c9>] ip_local_out+0x24/0x28
> <4>[10618592.012244] [<ffffffff8132462f>] ip_queue_xmit+0x2e4/0x322
> <4>[10618592.012249] [<ffffffff81336f97>] tcp_transmit_skb+0x766/0x7a7
> <4>[10618592.012254] [<ffffffff81337345>] tcp_send_active_reset+0xd8/0x104
> <4>[10618592.012258] [<ffffffff8132b8c6>] tcp_close+0x101/0x335
> <4>[10618592.012264] [<ffffffff8134b8f2>] inet_release+0x7b/0x82
> <4>[10618592.012269] [<ffffffff812ea36e>] sock_release+0x1a/0x72
> <4>[10618592.012273] [<ffffffff812ea3e8>] sock_close+0x22/0x26
> <4>[10618592.012278] [<ffffffff810aad2d>] fput+0x117/0x1f8
> <4>[10618592.012283] [<ffffffff810a7ce2>] filp_close+0x6d/0x78
> <4>[10618592.012288] [<ffffffff810a7d7b>] sys_close+0x8e/0xc8
> <4>[10618592.012293] [<ffffffff813dcacb>] cstar_dispatch+0x7/0x1e
> <4>[10618592.012296] Code: 31 d2 0f b6 d2 85 d2 0f 85 61 01 00 00 48
> 8b 00 a8 01 75 0d 8b 53 68 3b 50 10 75 94 e9 6a ff ff ff 48 8b 43 20
> 48 8b 53 28 a8 01
> <48>
>  89 02 75 04 48 89 50 08 49 bd 00 02 20 00 00 00 ad de 48 8d
> <1>[10618592.012355] RIP  [<ffffffffa007b3f7>]
> __nf_conntrack_confirm+0x1fb/0x36c [nf_conntrack]
> <4>[10618592.110942]  RSP <ffff88005aa1fb98>
> <4>[10618592.110944] CR2: 0000000000000000
>
>
> The crash happened here in this code :
>
> static inline void __hlist_nulls_del(struct hlist_nulls_node *n)
> {
>        struct hlist_nulls_node *next = n->next;
>         struct hlist_nulls_node **pprev = n->pprev;
>                                                    *pprev = next;
>          1ac1:       48 89 02                mov    %rax,(%rdx)  <==== CRASH
>         if (!is_a_nulls(next))
>     1ac4:       75 04                   jne    1aca
> <nf_ct_delete_from_lists+0x62>
> next->pprev = pprev;
>
> 1ac6:       48 89 50 08             mov    %rdx,0x8(%rax)
> * hlist_nulls_for_each_entry().
> */
>
> The instruction is *prev = next and pprev pointer is NULL (RDX)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: linux 3.4.43 : kernel crash at __nf_conntrack_confirm
  2015-10-18  2:34 ` Ani Sinha
@ 2015-10-18  8:07   ` Florian Westphal
       [not found]     ` <CAOxq_8NLLFyNCSDJ68+VjxFGpNSex8ShdhGFNBHK29g_+UBW6g@mail.gmail.com>
  2015-10-21 19:35     ` Ani Sinha
  0 siblings, 2 replies; 11+ messages in thread
From: Florian Westphal @ 2015-10-18  8:07 UTC (permalink / raw)
  To: Ani Sinha
  Cc: Patrick McHardy, David S. Miller, netfilter-devel, netfilter,
	coreteam, netdev

Ani Sinha <ani@arista.com> wrote:
> Coming back to this crash, I see something interesting in the
> conntrack code in linux 3.4.109 (a supported kernel version). I see
> that the hash table manipulations are protected by a spinlock. Also
> lookups/reads are protected by RCU. However allocation and
> deallocation of conntrack objects happen outside of both the locks.
> It seems to me that a conntrack object can be deallocated and a new
> object can be allocated and initialized within the same RCU grace
> period, while the hash table is being read.

Yes.  We need to use SLAB_DESTROY_BY_RCU instead of kfree_rcu because
there could be hundreds of thousands of alloc/free pairs within a short
time period.

> It looks like a bug to me.

No, as long as readers detect object reuse.

> > Looking upstream, I see a couple of patches which fixes race condition
> > around the use of the conntrack hash table with RCU (lock free read)
> > primitives :
> >
> > commit c6825c0976fa7893692e0e43b09740b419b23c09
> > Author: Andrey Vagin <avagin@openvz.org>
> > Date:   Wed Jan 29 19:34:14 2014 +0100
> >      netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get
> >
> > and a followup patch :
> >
> > commit e53376bef2cd97d3e3f61fdc677fb8da7d03d0da
> > Author: Pablo Neira Ayuso <pablo@netfilter.org>
> > Date:   Mon Feb 3 20:01:53 2014 +0100
> >         netfilter: nf_conntrack: don't release a conntrack with non-zero refcnt
> >

These for instance fix such bugs.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: linux 3.4.43 : kernel crash at __nf_conntrack_confirm
       [not found]     ` <CAOxq_8NLLFyNCSDJ68+VjxFGpNSex8ShdhGFNBHK29g_+UBW6g@mail.gmail.com>
@ 2015-10-18 21:12       ` Ani Sinha
  2015-10-18 21:40         ` Florian Westphal
  0 siblings, 1 reply; 11+ messages in thread
From: Ani Sinha @ 2015-10-18 21:12 UTC (permalink / raw)
  To: Ani Sinha
  Cc: Florian Westphal, Patrick McHardy, David S. Miller,
	netfilter-devel, netfilter, coreteam, netdev



> 
> On Sun, Oct 18, 2015 at 1:07 AM, Florian Westphal <fw@strlen.de> wrote:
> > Ani Sinha <ani@arista.com> wrote:
> >> Coming back to this crash, I see something interesting in the
> >> conntrack code in linux 3.4.109 (a supported kernel version). I see
> >> that the hash table manipulations are protected by a spinlock. Also
> >> lookups/reads are protected by RCU. However allocation and
> >> deallocation of conntrack objects happen outside of both the locks.
> >> It seems to me that a conntrack object can be deallocated and a new
> >> object can be allocated and initialized within the same RCU grace
> >> period, while the hash table is being read.
> >
> > Yes.  We need to use SLAB_DESTROY_BY_RCU instead of kfree_rcu because
> > there could be hundreds of thousands of alloc/free pairs within a short
> > time period.
> >
> >> It looks like a bug to me.
> >
> > No, as long as readers detect object reuse.
 
Right.
 
> >
> >> > Looking upstream, I see a couple of patches which fixes race condition
> >> > around the use of the conntrack hash table with RCU (lock free read)
> >> > primitives :
> >> >
> >> > commit c6825c0976fa7893692e0e43b09740b419b23c09
> >> > Author: Andrey Vagin <avagin@openvz.org>
> >> > Date:   Wed Jan 29 19:34:14 2014 +0100
> >> >      netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get
> >> >
> >> > and a followup patch :
> >> >
> >> > commit e53376bef2cd97d3e3f61fdc677fb8da7d03d0da
> >> > Author: Pablo Neira Ayuso <pablo@netfilter.org>
> >> > Date:   Mon Feb 3 20:01:53 2014 +0100
> >> >         netfilter: nf_conntrack: don't release a conntrack with non-zero refcnt
> >> >
> >
> > These for instance fix such bugs.
> 
Indeed. So it seems to me that we have run into one another such case.
In patch c6825c0976fa7893692, I see we have added an additional check (along with comparing tuple and zone) to verify that if the conntrack is confirmed.
 
+       return nf_ct_tuple_equal(tuple, &h->tuple) &&
+               nf_ct_zone(ct) == zone &&
+               nf_ct_is_confirmed(ct);
 
 
This is necessary since it's possible that a conntrack can be recreated with the same zone.
Unfortunately, we leave a hole open in __nf_conntrack_confirm() because this routine _is_ responsible
for confirming the conntrack. We cannot use the same logic here. 
 
Should I send a patch along the lines of :
 
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 71935fc..6ff4088 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -535,6 +535,12 @@ __nf_conntrack_confirm(struct sk_buff *skb)
 		    zone == nf_ct_zone(nf_ct_tuplehash_to_ctrack(h)))
 			goto out;
 
+	/* we might be racing against a case where the conntrack was deleted 
+	   and a new conntrack was initialized with the exact same zone. We
+	   need to make sure that the conntrack node is in the hashtable */
+	if (hlist_nulls_unhashed(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode))
+	  goto out;
+
 	/* Remove from unconfirmed list */
 	hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
 



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: linux 3.4.43 : kernel crash at __nf_conntrack_confirm
  2015-10-18 21:12       ` Ani Sinha
@ 2015-10-18 21:40         ` Florian Westphal
  2015-10-19 20:22           ` Ani Sinha
  0 siblings, 1 reply; 11+ messages in thread
From: Florian Westphal @ 2015-10-18 21:40 UTC (permalink / raw)
  To: Ani Sinha
  Cc: Florian Westphal, Patrick McHardy, David S. Miller,
	netfilter-devel, netfilter, coreteam, netdev

Ani Sinha <ani@arista.com> wrote:
> Indeed. So it seems to me that we have run into one another such case.
> In patch c6825c0976fa7893692, I see we have added an additional check (along with comparing tuple and zone) to verify that if the conntrack is confirmed.
>  
> +       return nf_ct_tuple_equal(tuple, &h->tuple) &&
> +               nf_ct_zone(ct) == zone &&
> +               nf_ct_is_confirmed(ct);
> 
> This is necessary since it's possible that a conntrack can be recreated with the same zone.
> Unfortunately, we leave a hole open in __nf_conntrack_confirm() because this routine _is_ responsible
> for confirming the conntrack. We cannot use the same logic here.

Hmm, why?

I don't understand why we need to change __nf_conntrack_confirm(), can
you elaborate?

At __nf_conntrack_confirm call time, only one cpu can see this nfct entry.
Other cpus on read-side can see it due to object re-use but any of the
following tests should fail:

1. different tuples
2. differnet zones
3. CONFIRMED not set

So they would skip entry and restart lookup (NULs value mismatch).

> Should I send a patch along the lines of :
>  
> diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
> index 71935fc..6ff4088 100644
> --- a/net/netfilter/nf_conntrack_core.c
> +++ b/net/netfilter/nf_conntrack_core.c
> @@ -535,6 +535,12 @@ __nf_conntrack_confirm(struct sk_buff *skb)
>  		    zone == nf_ct_zone(nf_ct_tuplehash_to_ctrack(h)))
>  			goto out;
>  
> +	/* we might be racing against a case where the conntrack was deleted 
> +	   and a new conntrack was initialized with the exact same zone. We
> +	   need to make sure that the conntrack node is in the hashtable */

?

The conntrack is NOT in the hashtable at this point.  Its not even on
the unconfirmed list since we already removed it in preparation of
hashtable insertion.

> +	if (hlist_nulls_unhashed(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode))
> +	  goto out;

That would be a bug, how can ->nfct be confirmed twice?

If you're talking about IPS_CONFIRMED getting set -- that should be
harmless.  In some theoretical condition we could indeed observe this
nfct on another cpu, just before we actually insert this but this does
not cause a problem on the read-side since the conntrack matches the
tuple exactly and all extensions have been initialized.

And if we create two conntracks with identical tuples on different CPUs
which is possible regardless of RCU this will be detected during
confirm step (we search ht for a colliding tuple).

So, if there is a problem please describe in more detail, I don't see
anything wrong so far.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: linux 3.4.43 : kernel crash at __nf_conntrack_confirm
  2015-10-18 21:40         ` Florian Westphal
@ 2015-10-19 20:22           ` Ani Sinha
  2015-10-19 20:33             ` Florian Westphal
  0 siblings, 1 reply; 11+ messages in thread
From: Ani Sinha @ 2015-10-19 20:22 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Patrick McHardy, David S. Miller, netfilter-devel, netfilter,
	coreteam, netdev

On Sun, Oct 18, 2015 at 2:40 PM, Florian Westphal <fw@strlen.de> wrote:
> Ani Sinha <ani@arista.com> wrote:
>> Indeed. So it seems to me that we have run into one another such case.
>> In patch c6825c0976fa7893692, I see we have added an additional check (along with comparing tuple and zone) to verify that if the conntrack is confirmed.
>>
>> +       return nf_ct_tuple_equal(tuple, &h->tuple) &&
>> +               nf_ct_zone(ct) == zone &&
>> +               nf_ct_is_confirmed(ct);
>>
>> This is necessary since it's possible that a conntrack can be recreated with the same zone.
>> Unfortunately, we leave a hole open in __nf_conntrack_confirm() because this routine _is_ responsible
>> for confirming the conntrack. We cannot use the same logic here.
>
> Hmm, why?
>
> I don't understand why we need to change __nf_conntrack_confirm(), can
> you elaborate?

ok, let's take a step back. The fundamental question I am trying to
find answer to is that whether it is possible for another thread to
deallocate and then reallocate and initialize the conntrack object
while running concurrently during __nf_conntrack_confirm() . The crash
below seems to indicate that this can happen.

However, in the current 3.4 release (and the image which generated the
crash), we do not have the patch

e53376bef2cd97d3e3f61fdc6

applied. This patch bumps the refcount before adding the connrack
entry into the unconfirmed list.

+ /* Now it is inserted into the unconfirmed list, bump refcount */
+ nf_conntrack_get(&ct->ct_general);

and if we assume the invariant that nf_conntrack_free() is never
called when refcount is !=0, then this would seem to indicate that the
above patch should fix the crash I mentioned in the thread.

One curious piece of hunk is :

+ /* A freed object has refcnt == 0, that's
+ * the golden rule for SLAB_DESTROY_BY_RCU
+ */
+ NF_CT_ASSERT(atomic_read(&ct->ct_general.use) == 0);
+

First, this assertion only puts a warning log at best when it fails.
Second, if this assertion is false, at some point we will get into a
kernel crash as the one I mentioned. So this assertion effectively
does nothing other than perhaps help in debugging. Third, the very
fact that this assertion was placed seems to indicate that there might
be cases where we can free a conntrack object with non-zero ref-count.

Does all this makes sense?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: linux 3.4.43 : kernel crash at __nf_conntrack_confirm
  2015-10-19 20:22           ` Ani Sinha
@ 2015-10-19 20:33             ` Florian Westphal
  2015-10-19 22:13               ` Ani Sinha
  0 siblings, 1 reply; 11+ messages in thread
From: Florian Westphal @ 2015-10-19 20:33 UTC (permalink / raw)
  To: Ani Sinha
  Cc: Florian Westphal, Patrick McHardy, David S. Miller,
	netfilter-devel, netfilter, coreteam, netdev

Ani Sinha <ani@arista.com> wrote:
> On Sun, Oct 18, 2015 at 2:40 PM, Florian Westphal <fw@strlen.de> wrote:
> > Ani Sinha <ani@arista.com> wrote:
> >> Indeed. So it seems to me that we have run into one another such case.
> >> In patch c6825c0976fa7893692, I see we have added an additional check (along with comparing tuple and zone) to verify that if the conntrack is confirmed.
> >>
> >> +       return nf_ct_tuple_equal(tuple, &h->tuple) &&
> >> +               nf_ct_zone(ct) == zone &&
> >> +               nf_ct_is_confirmed(ct);
> >>
> >> This is necessary since it's possible that a conntrack can be recreated with the same zone.
> >> Unfortunately, we leave a hole open in __nf_conntrack_confirm() because this routine _is_ responsible
> >> for confirming the conntrack. We cannot use the same logic here.
> >
> > Hmm, why?
> >
> > I don't understand why we need to change __nf_conntrack_confirm(), can
> > you elaborate?
> 
> ok, let's take a step back. The fundamental question I am trying to
> find answer to is that whether it is possible for another thread to
> deallocate and then reallocate and initialize the conntrack object
> while running concurrently during __nf_conntrack_confirm() .

Not unless something is broken.

> crash), we do not have the patch
> 
> e53376bef2cd97d3e3f61fdc6
> 
> applied. This patch bumps the refcount before adding the connrack
> entry into the unconfirmed list.

Yes, that patch fixes such bug.

> + /* Now it is inserted into the unconfirmed list, bump refcount */
> + nf_conntrack_get(&ct->ct_general);
> 
> and if we assume the invariant that nf_conntrack_free() is never
> called when refcount is !=0, then this would seem to indicate that the
> above patch should fix the crash I mentioned in the thread.

nf_conntrack_free must only be invoked after refcount becomes zero, right.

> One curious piece of hunk is :
> 
> + /* A freed object has refcnt == 0, that's
> + * the golden rule for SLAB_DESTROY_BY_RCU
> + */
> + NF_CT_ASSERT(atomic_read(&ct->ct_general.use) == 0);
> +
> First, this assertion only puts a warning log at best when it fails.
> Second, if this assertion is false, at some point we will get into a
> kernel crash as the one I mentioned. So this assertion effectively
> does nothing other than perhaps help in debugging.

Right.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: linux 3.4.43 : kernel crash at __nf_conntrack_confirm
  2015-10-19 20:33             ` Florian Westphal
@ 2015-10-19 22:13               ` Ani Sinha
  0 siblings, 0 replies; 11+ messages in thread
From: Ani Sinha @ 2015-10-19 22:13 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Patrick McHardy, David S. Miller, netfilter-devel, netfilter,
	coreteam, netdev

On Mon, Oct 19, 2015 at 1:33 PM, Florian Westphal <fw@strlen.de> wrote:
> Ani Sinha <ani@arista.com> wrote:
>> On Sun, Oct 18, 2015 at 2:40 PM, Florian Westphal <fw@strlen.de> wrote:
>> > Ani Sinha <ani@arista.com> wrote:
>> >> Indeed. So it seems to me that we have run into one another such case.
>> >> In patch c6825c0976fa7893692, I see we have added an additional check (along with comparing tuple and zone) to verify that if the conntrack is confirmed.
>> >>
>> >> +       return nf_ct_tuple_equal(tuple, &h->tuple) &&
>> >> +               nf_ct_zone(ct) == zone &&
>> >> +               nf_ct_is_confirmed(ct);
>> >>
>> >> This is necessary since it's possible that a conntrack can be recreated with the same zone.
>> >> Unfortunately, we leave a hole open in __nf_conntrack_confirm() because this routine _is_ responsible
>> >> for confirming the conntrack. We cannot use the same logic here.
>> >
>> > Hmm, why?
>> >
>> > I don't understand why we need to change __nf_conntrack_confirm(), can
>> > you elaborate?
>>
>> ok, let's take a step back. The fundamental question I am trying to
>> find answer to is that whether it is possible for another thread to
>> deallocate and then reallocate and initialize the conntrack object
>> while running concurrently during __nf_conntrack_confirm() .
>
> Not unless something is broken.

With or without e53376bef2cd97d3e3f61fdc6 ?

>
>> crash), we do not have the patch
>>
>> e53376bef2cd97d3e3f61fdc6
>>
>> applied. This patch bumps the refcount before adding the connrack
>> entry into the unconfirmed list.
>
> Yes, that patch fixes such bug.
>
>> + /* Now it is inserted into the unconfirmed list, bump refcount */
>> + nf_conntrack_get(&ct->ct_general);
>>
>> and if we assume the invariant that nf_conntrack_free() is never
>> called when refcount is !=0, then this would seem to indicate that the
>> above patch should fix the crash I mentioned in the thread.
>
> nf_conntrack_free must only be invoked after refcount becomes zero, right.
>
>> One curious piece of hunk is :
>>
>> + /* A freed object has refcnt == 0, that's
>> + * the golden rule for SLAB_DESTROY_BY_RCU
>> + */
>> + NF_CT_ASSERT(atomic_read(&ct->ct_general.use) == 0);
>> +
>> First, this assertion only puts a warning log at best when it fails.
>> Second, if this assertion is false, at some point we will get into a
>> kernel crash as the one I mentioned. So this assertion effectively
>> does nothing other than perhaps help in debugging.
>
> Right.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: linux 3.4.43 : kernel crash at __nf_conntrack_confirm
  2015-10-18  8:07   ` Florian Westphal
       [not found]     ` <CAOxq_8NLLFyNCSDJ68+VjxFGpNSex8ShdhGFNBHK29g_+UBW6g@mail.gmail.com>
@ 2015-10-21 19:35     ` Ani Sinha
  2015-10-21 21:19       ` Florian Westphal
  1 sibling, 1 reply; 11+ messages in thread
From: Ani Sinha @ 2015-10-21 19:35 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Patrick McHardy, David S. Miller, netfilter-devel, netfilter,
	coreteam, netdev

On Sun, Oct 18, 2015 at 1:07 AM, Florian Westphal <fw@strlen.de> wrote:
> Ani Sinha <ani@arista.com> wrote:
>> Coming back to this crash, I see something interesting in the
>> conntrack code in linux 3.4.109 (a supported kernel version). I see
>> that the hash table manipulations are protected by a spinlock. Also
>> lookups/reads are protected by RCU. However allocation and
>> deallocation of conntrack objects happen outside of both the locks.
>> It seems to me that a conntrack object can be deallocated and a new
>> object can be allocated and initialized within the same RCU grace
>> period, while the hash table is being read.
>
> Yes.  We need to use SLAB_DESTROY_BY_RCU instead of kfree_rcu because
> there could be hundreds of thousands of alloc/free pairs within a short
> time period.
>
>> It looks like a bug to me.
>
> No, as long as readers detect object reuse.
>
>> > Looking upstream, I see a couple of patches which fixes race condition
>> > around the use of the conntrack hash table with RCU (lock free read)
>> > primitives :
>> >
>> > commit c6825c0976fa7893692e0e43b09740b419b23c09
>> > Author: Andrey Vagin <avagin@openvz.org>
>> > Date:   Wed Jan 29 19:34:14 2014 +0100
>> >      netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get
>> >
>> > and a followup patch :
>> >
>> > commit e53376bef2cd97d3e3f61fdc677fb8da7d03d0da
>> > Author: Pablo Neira Ayuso <pablo@netfilter.org>
>> > Date:   Mon Feb 3 20:01:53 2014 +0100
>> >         netfilter: nf_conntrack: don't release a conntrack with non-zero refcnt
>> >
>
> These for instance fix such bugs.

So since both these patches were not backported to 3.4 series and
since now we have evidence of a crash that points to issues which the
patches fix, should we consider backporting the above patches to 3.4?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: linux 3.4.43 : kernel crash at __nf_conntrack_confirm
  2015-10-21 19:35     ` Ani Sinha
@ 2015-10-21 21:19       ` Florian Westphal
  2015-10-21 21:26         ` Ani Sinha
  0 siblings, 1 reply; 11+ messages in thread
From: Florian Westphal @ 2015-10-21 21:19 UTC (permalink / raw)
  To: Ani Sinha
  Cc: Florian Westphal, Patrick McHardy, David S. Miller,
	netfilter-devel, netfilter, coreteam, netdev

Ani Sinha <ani@arista.com> wrote:
> >> > commit c6825c0976fa7893692e0e43b09740b419b23c09
> >> > Author: Andrey Vagin <avagin@openvz.org>
> >> > Date:   Wed Jan 29 19:34:14 2014 +0100
> >> >      netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get
> >> >
> >> > and a followup patch :
> >> >
> >> > commit e53376bef2cd97d3e3f61fdc677fb8da7d03d0da
> >> > Author: Pablo Neira Ayuso <pablo@netfilter.org>
> >> > Date:   Mon Feb 3 20:01:53 2014 +0100
> >> >         netfilter: nf_conntrack: don't release a conntrack with non-zero refcnt
> >> >
> >
> > These for instance fix such bugs.
> 
> So since both these patches were not backported to 3.4 series and
> since now we have evidence of a crash that points to issues which the
> patches fix, should we consider backporting the above patches to 3.4?

Yes.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: linux 3.4.43 : kernel crash at __nf_conntrack_confirm
  2015-10-21 21:19       ` Florian Westphal
@ 2015-10-21 21:26         ` Ani Sinha
  0 siblings, 0 replies; 11+ messages in thread
From: Ani Sinha @ 2015-10-21 21:26 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Patrick McHardy, David S. Miller, netfilter-devel, netfilter,
	coreteam, netdev

On Wed, Oct 21, 2015 at 2:19 PM, Florian Westphal <fw@strlen.de> wrote:
> Ani Sinha <ani@arista.com> wrote:
>> >> > commit c6825c0976fa7893692e0e43b09740b419b23c09
>> >> > Author: Andrey Vagin <avagin@openvz.org>
>> >> > Date:   Wed Jan 29 19:34:14 2014 +0100
>> >> >      netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get
>> >> >
>> >> > and a followup patch :
>> >> >
>> >> > commit e53376bef2cd97d3e3f61fdc677fb8da7d03d0da
>> >> > Author: Pablo Neira Ayuso <pablo@netfilter.org>
>> >> > Date:   Mon Feb 3 20:01:53 2014 +0100
>> >> >         netfilter: nf_conntrack: don't release a conntrack with non-zero refcnt
>> >> >
>> >
>> > These for instance fix such bugs.
>>
>> So since both these patches were not backported to 3.4 series and
>> since now we have evidence of a crash that points to issues which the
>> patches fix, should we consider backporting the above patches to 3.4?
>
> Yes.

Ok cool. I will send out backport patches for 3.4 corresponding to
both the above patches.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2015-10-21 21:26 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-07 19:57 linux 3.4.43 : kernel crash at __nf_conntrack_confirm Ani Sinha
2015-10-18  2:34 ` Ani Sinha
2015-10-18  8:07   ` Florian Westphal
     [not found]     ` <CAOxq_8NLLFyNCSDJ68+VjxFGpNSex8ShdhGFNBHK29g_+UBW6g@mail.gmail.com>
2015-10-18 21:12       ` Ani Sinha
2015-10-18 21:40         ` Florian Westphal
2015-10-19 20:22           ` Ani Sinha
2015-10-19 20:33             ` Florian Westphal
2015-10-19 22:13               ` Ani Sinha
2015-10-21 19:35     ` Ani Sinha
2015-10-21 21:19       ` Florian Westphal
2015-10-21 21:26         ` Ani Sinha

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).