All of lore.kernel.org
 help / color / mirror / Atom feed
* slab corruption with current -git (was Re: [git pull] vfs pile 1 (splice))
@ 2016-10-09 21:31 Linus Torvalds
  2016-10-10  0:51 ` Florian Westphal
  0 siblings, 1 reply; 26+ messages in thread
From: Linus Torvalds @ 2016-10-09 21:31 UTC (permalink / raw)
  To: Al Viro, Andrew Morton, Jens Axboe, Ted Ts'o,
	Christoph Lameter, David Miller, Pablo Neira Ayuso, Aaron Conole
  Cc: Linux Kernel Mailing List, linux-fsdevel, Network Development, NetFilter

On Sun, Oct 9, 2016 at 12:11 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Anyway, I don't think I can bisect it, but I'll try to narrow it down
> a *bit* at least.
>
> Not doing any more pulls on this unstable base, I've been puttering
> around in trying to clean up some stupid printk logging issues
> instead.

So I finally got a oops with slub debugging enabled. It doesn't really
narrow things down, though, it kind of extends on the possible
suspects. Now adding David Miller and Pablo, because it looks like it
may be netfilter that does something bad and corrupts memory.

Of course, maybe this is another symptom, and not the root cause for
my troubles, but it does look like it might be getting closer to the
cause... In particular, now it very much looks like a use-after-free
in the netfilter code, which *could* explain my original symptom with
later allocation users oopsing randomly.

Without further ado, here's the new oops:

   general protection fault: 0000 [#1] SMP
   CPU: 7 PID: 169 Comm: kworker/u16:7 Not tainted 4.8.0-11288-gb66484cd7470 #1
   Hardware name: System manufacturer System Product Name/Z170-K, BIOS
1803 05/06/2016
   Workqueue: netns cleanup_net
   task: ffff91935e001fc0 task.stack: ffffb4e2c213c000
   RIP: nf_unregister_net_hook+0x5f/0x190
   RSP: 0000:ffffb4e2c213fd40  EFLAGS: 00010202
   RAX: 6b6b6b6b6b6b6b6b RBX: ffff91933c4ab968 RCX: 0000000000000002
   RDX: 0000000000000002 RSI: ffffffffc0642280 RDI: ffffffff91cf9820
   RBP: ffffb4e2c213fd58 R08: ffff91933c4a86c8 R09: 0000000000000025
   R10: 00000000000000cc R11: ffff91935dd22000 R12: ffffffffc0642280
   R13: ffff91934cc0ea80 R14: ffffffff91cf97e0 R15: 00000000ffffffff
   FS:  0000000000000000(0000) GS:ffff919376dc0000(0000) knlGS:0000000000000000
   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
   CR2: 0000000003e7c000 CR3: 00000003fdb62000 CR4: 00000000003406e0
   DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
   DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
   Call Trace:
     netfilter_net_exit+0x2f/0x60
     ops_exit_list.isra.4+0x38/0x60
     cleanup_net+0x1ba/0x2a0
     process_one_work+0x1f1/0x480
     worker_thread+0x48/0x4d0
     ? process_one_work+0x480/0x480
     ? process_one_work+0x480/0x480
     kthread+0xd9/0xf0
     ? kthread_park+0x60/0x60
     ret_from_fork+0x22/0x30
   Code: 0f b6 ca 48 8d 84 c8 00 01 00 00 49 8b 5c c5 00 48 85 db 0f
84 cb 00 00 00 4c 3b 63 40 48 8b 03 0f 84 e9 00 00 00 48 85 c0 74 26
<4c> 3b 60 40 75 08 e9 ef 00 00 00 48 89 d8 48 8b 18 48 85 db 0f
   RIP  [<ffffffff916bae8f>] nf_unregister_net_hook+0x5f/0x190

and note the value in %rax: 6b is POISON_FREE, so it very much looks
like it's a pointer loaded from a free'd allocation.

The code disassembles to

   0: 0f b6 ca             movzbl %dl,%ecx
   3: 48 8d 84 c8 00 01 00 lea    0x100(%rax,%rcx,8),%rax
   a: 00
   b: 49 8b 5c c5 00       mov    0x0(%r13,%rax,8),%rbx
  10: 48 85 db             test   %rbx,%rbx
  13: 0f 84 cb 00 00 00     je     0xe4
  19: 4c 3b 63 40           cmp    0x40(%rbx),%r12
  1d: 48 8b 03             mov    (%rbx),%rax
  20: 0f 84 e9 00 00 00     je     0x10f
  26: 48 85 c0             test   %rax,%rax
  29: 74 26                 je     0x51
  2b:* 4c 3b 60 40           cmp    0x40(%rax),%r12 <-- trapping instruction
  2f: 75 08                 jne    0x39
  31: e9 ef 00 00 00       jmpq   0x125
  36: 48 89 d8             mov    %rbx,%rax
  39: 48 8b 18             mov    (%rax),%rbx
  3c: 48 85 db             test   %rbx,%rbx

and that oopsing instruction seems to be the compare of
"hooks_entry->orig_ops" from hooks_entry in this expression:

        if (hooks_entry && hooks_entry->orig_ops == reg) {

so hooks_entry() is bogus. It was gotten from

        hooks_entry = nf_hook_entry_head(net, reg);

but that's as far as I dug. And yes, I do have
CONFIG_NETFILTER_INGRESS=y in case that matters.

And all this code has changed pretty radically in commit e3b37f11e6e4
("netfilter: replace list_head with single linked list"), and there
was clearly already something wrong with that code, with commit
5119e4381a90 ("netfilter: Fix potential null pointer dereference")
adding the test against NULL. But I suspect that only hid the "oops,
it's actually not NULL, it loaded some uninitialized value" problem.

Over to the networking guys.. Ideas?

               Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: slab corruption with current -git (was Re: [git pull] vfs pile 1 (splice))
  2016-10-09 21:31 slab corruption with current -git (was Re: [git pull] vfs pile 1 (splice)) Linus Torvalds
@ 2016-10-10  0:51 ` Florian Westphal
  2016-10-10  1:35   ` Aaron Conole
  0 siblings, 1 reply; 26+ messages in thread
From: Florian Westphal @ 2016-10-10  0:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Andrew Morton, Jens Axboe, Ted Ts'o,
	Christoph Lameter, David Miller, Pablo Neira Ayuso, Aaron Conole,
	Linux Kernel Mailing List, linux-fsdevel, Network Development,
	NetFilter

Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Sun, Oct 9, 2016 at 12:11 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Anyway, I don't think I can bisect it, but I'll try to narrow it down
> > a *bit* at least.
> >
> > Not doing any more pulls on this unstable base, I've been puttering
> > around in trying to clean up some stupid printk logging issues
> > instead.
> 
> So I finally got a oops with slub debugging enabled. It doesn't really
> narrow things down, though, it kind of extends on the possible
> suspects. Now adding David Miller and Pablo, because it looks like it
> may be netfilter that does something bad and corrupts memory.

Quite possible, the netns interactions are not nice :-/

> Without further ado, here's the new oops:
> 
>    general protection fault: 0000 [#1] SMP
>    CPU: 7 PID: 169 Comm: kworker/u16:7 Not tainted 4.8.0-11288-gb66484cd7470 #1
>    Hardware name: System manufacturer System Product Name/Z170-K, BIOS
..
>    Call Trace:
>      netfilter_net_exit+0x2f/0x60
>      ops_exit_list.isra.4+0x38/0x60
>      cleanup_net+0x1ba/0x2a0
>      process_one_work+0x1f1/0x480
>      worker_thread+0x48/0x4d0
>      ? process_one_work+0x480/0x480

..

> like it's a pointer loaded from a free'd allocation.
> 
> The code disassembles to
> 
>    0: 0f b6 ca             movzbl %dl,%ecx
>    3: 48 8d 84 c8 00 01 00 lea    0x100(%rax,%rcx,8),%rax
>    a: 00
>    b: 49 8b 5c c5 00       mov    0x0(%r13,%rax,8),%rbx
>   10: 48 85 db             test   %rbx,%rbx
>   13: 0f 84 cb 00 00 00     je     0xe4
>   19: 4c 3b 63 40           cmp    0x40(%rbx),%r12
>   1d: 48 8b 03             mov    (%rbx),%rax
>   20: 0f 84 e9 00 00 00     je     0x10f
>   26: 48 85 c0             test   %rax,%rax
>   29: 74 26                 je     0x51
>   2b:* 4c 3b 60 40           cmp    0x40(%rax),%r12 <-- trapping instruction
>   2f: 75 08                 jne    0x39
>   31: e9 ef 00 00 00       jmpq   0x125
>   36: 48 89 d8             mov    %rbx,%rax
>   39: 48 8b 18             mov    (%rax),%rbx
>   3c: 48 85 db             test   %rbx,%rbx
> 
> and that oopsing instruction seems to be the compare of
> "hooks_entry->orig_ops" from hooks_entry in this expression:
> 
>         if (hooks_entry && hooks_entry->orig_ops == reg) {
> 
> so hooks_entry() is bogus. It was gotten from
> 
>         hooks_entry = nf_hook_entry_head(net, reg);
> 
> but that's as far as I dug. And yes, I do have
> CONFIG_NETFILTER_INGRESS=y in case that matters.
> 
> And all this code has changed pretty radically in commit e3b37f11e6e4
> ("netfilter: replace list_head with single linked list"), and there
> was clearly already something wrong with that code, with commit
> 5119e4381a90 ("netfilter: Fix potential null pointer dereference")
> adding the test against NULL. But I suspect that only hid the "oops,
> it's actually not NULL, it loaded some uninitialized value" problem.
> 
> Over to the networking guys.. Ideas?

Sorry, not off the top of my head.
Pablo is currently travelling back home from netdev 1.2 in Tokyo,
I can help starting Wednesday when I am back.

One shot in the dark (not even compile tested; wonder if we can end up
zapping bogus hook ...)

diff --git a/net/netfilter/core.c b/net/netfilter/core.c
index c9d90eb..fd6a2ce 100644
--- a/net/netfilter/core.c
+++ b/net/netfilter/core.c
@@ -189,6 +189,9 @@ void nf_unregister_net_hook(struct net *net, const struct nf_hook_ops *reg)
 
 unlock:
        mutex_unlock(&nf_hook_mutex);
+
+       WARN_ON(hooks_entry && hooks_entry->orig_ops != reg);
+
        if (!hooks_entry) {
                WARN(1, "nf_unregister_net_hook: hook not found!\n");
                return;

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: slab corruption with current -git (was Re: [git pull] vfs pile 1 (splice))
  2016-10-10  0:51 ` Florian Westphal
@ 2016-10-10  1:35   ` Aaron Conole
  2016-10-10  2:49     ` Linus Torvalds
  0 siblings, 1 reply; 26+ messages in thread
From: Aaron Conole @ 2016-10-10  1:35 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Linus Torvalds, Al Viro, Andrew Morton, Jens Axboe, Ted Ts'o,
	Christoph Lameter, David Miller, Pablo Neira Ayuso,
	Linux Kernel Mailing List, linux-fsdevel, Network Development,
	NetFilter

Florian Westphal <fw@strlen.de> writes:

> Linus Torvalds <torvalds@linux-foundation.org> wrote:
>> On Sun, Oct 9, 2016 at 12:11 PM, Linus Torvalds
>> <torvalds@linux-foundation.org> wrote:
>> >
>> > Anyway, I don't think I can bisect it, but I'll try to narrow it down
>> > a *bit* at least.
>> >
>> > Not doing any more pulls on this unstable base, I've been puttering
>> > around in trying to clean up some stupid printk logging issues
>> > instead.
>> 
>> So I finally got a oops with slub debugging enabled. It doesn't really
>> narrow things down, though, it kind of extends on the possible
>> suspects. Now adding David Miller and Pablo, because it looks like it
>> may be netfilter that does something bad and corrupts memory.
>
> Quite possible, the netns interactions are not nice :-/
>
>> Without further ado, here's the new oops:
>> 
>>    general protection fault: 0000 [#1] SMP
>>    CPU: 7 PID: 169 Comm: kworker/u16:7 Not tainted
>> 4.8.0-11288-gb66484cd7470 #1
>>    Hardware name: System manufacturer System Product Name/Z170-K, BIOS
> ..
>>    Call Trace:
>>      netfilter_net_exit+0x2f/0x60
>>      ops_exit_list.isra.4+0x38/0x60
>>      cleanup_net+0x1ba/0x2a0
>>      process_one_work+0x1f1/0x480
>>      worker_thread+0x48/0x4d0
>>      ? process_one_work+0x480/0x480
>
> ..
>
>> like it's a pointer loaded from a free'd allocation.
>> 
>> The code disassembles to
>> 
>>    0: 0f b6 ca             movzbl %dl,%ecx
>>    3: 48 8d 84 c8 00 01 00 lea    0x100(%rax,%rcx,8),%rax
>>    a: 00
>>    b: 49 8b 5c c5 00       mov    0x0(%r13,%rax,8),%rbx
>>   10: 48 85 db             test   %rbx,%rbx
>>   13: 0f 84 cb 00 00 00     je     0xe4
>>   19: 4c 3b 63 40           cmp    0x40(%rbx),%r12
>>   1d: 48 8b 03             mov    (%rbx),%rax
>>   20: 0f 84 e9 00 00 00     je     0x10f
>>   26: 48 85 c0             test   %rax,%rax
>>   29: 74 26                 je     0x51
>>   2b:* 4c 3b 60 40           cmp    0x40(%rax),%r12 <-- trapping instruction
>>   2f: 75 08                 jne    0x39
>>   31: e9 ef 00 00 00       jmpq   0x125
>>   36: 48 89 d8             mov    %rbx,%rax
>>   39: 48 8b 18             mov    (%rax),%rbx
>>   3c: 48 85 db             test   %rbx,%rbx
>> 
>> and that oopsing instruction seems to be the compare of
>> "hooks_entry->orig_ops" from hooks_entry in this expression:
>> 
>>         if (hooks_entry && hooks_entry->orig_ops == reg) {
>> 
>> so hooks_entry() is bogus. It was gotten from
>> 
>>         hooks_entry = nf_hook_entry_head(net, reg);
>> 
>> but that's as far as I dug. And yes, I do have
>> CONFIG_NETFILTER_INGRESS=y in case that matters.
>> 
>> And all this code has changed pretty radically in commit e3b37f11e6e4
>> ("netfilter: replace list_head with single linked list"), and there
>> was clearly already something wrong with that code, with commit
>> 5119e4381a90 ("netfilter: Fix potential null pointer dereference")
>> adding the test against NULL. But I suspect that only hid the "oops,
>> it's actually not NULL, it loaded some uninitialized value" problem.
>> 
>> Over to the networking guys.. Ideas?
>
> Sorry, not off the top of my head.
> Pablo is currently travelling back home from netdev 1.2 in Tokyo,
> I can help starting Wednesday when I am back.
>
> One shot in the dark (not even compile tested; wonder if we can end up
> zapping bogus hook ...)
>

I was just about to build and test something similar:

diff --git a/net/netfilter/core.c b/net/netfilter/core.c
index c9d90eb..e84103f 100644
--- a/net/netfilter/core.c
+++ b/net/netfilter/core.c
@@ -189,7 +189,7 @@ void nf_unregister_net_hook(struct net *net, const struct nf_hook_ops *reg)
 
 unlock:
        mutex_unlock(&nf_hook_mutex);
-       if (!hooks_entry) {
+       if (!hooks_entry || hooks_entry->orig_ops != reg) {
                WARN(1, "nf_unregister_net_hook: hook not found!\n");
                return;
        }

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: slab corruption with current -git (was Re: [git pull] vfs pile 1 (splice))
  2016-10-10  1:35   ` Aaron Conole
@ 2016-10-10  2:49     ` Linus Torvalds
  2016-10-10  3:41       ` Linus Torvalds
  0 siblings, 1 reply; 26+ messages in thread
From: Linus Torvalds @ 2016-10-10  2:49 UTC (permalink / raw)
  To: Aaron Conole
  Cc: Florian Westphal, Al Viro, Andrew Morton, Jens Axboe,
	Ted Ts'o, Christoph Lameter, David Miller, Pablo Neira Ayuso,
	Linux Kernel Mailing List, linux-fsdevel, Network Development,
	NetFilter

On Sun, Oct 9, 2016 at 6:35 PM, Aaron Conole <aconole@redhat.com> wrote:
>
> I was just about to build and test something similar:

So I haven't actually tested that one, but looking at the code, it
really looks very bogus. In fact, that code just looks like crap. It
does *not* do a proper "remove singly linked list entry". It's exactly
the kind of code that I rail against, and that people should never
write.

Any code that can't even traverse a linked list is not worth looking at.

There is one *correct* way to remove an entry from a singly linked
list, and it looks like this:

    struct entry **pp, *p;

    pp = &head;
    while ((p = *pp) != NULL) {
        if (right_entry(p)) {
            *pp = p->next;
            break;
        }
        pp = &p->next;
    }

and that's it. Nothing else. The above code exits the loop with "p"
containing the entry that was removed, or NULL if nothing was. It
can't get any simpler than that, but more importantly, anything more
complicated than that is WRONG.

Seriously, nothing else is acceptable. In particular, any linked list
traversal that makes a special case of the first entry or the last
entry should not be allowed to exist. Note how there is not a single
special case in the above correct code. It JustWorks(tm).

That nf_unregister_net_hook() code has all the signs of exactly that
kind of broken list-handling code: special-casing the head of the
loop, and having the loop condition test both current and that odd
"next to next" pointer etc. It's all very very wrong.

So I really see two options:

 - do that singly-linked list traversal right (and I'm serious:
nothing but the code above can ever be right)

 - don't make up your own list handling code at all, and use the
standard linux list code.

So either e3b37f11e6e4 needs to be reverted, or it needs to be taught
to use real list handling.  If the code doesn't want to use the
regular list.h (either the doubly linked one, or the hlist one), it
needs to at least learn to do list removal right.

               Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: slab corruption with current -git (was Re: [git pull] vfs pile 1 (splice))
  2016-10-10  2:49     ` Linus Torvalds
@ 2016-10-10  3:41       ` Linus Torvalds
  2016-10-10  3:57         ` slab corruption with current -git David Miller
                           ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Linus Torvalds @ 2016-10-10  3:41 UTC (permalink / raw)
  To: Aaron Conole
  Cc: Florian Westphal, Al Viro, Andrew Morton, Jens Axboe,
	Ted Ts'o, Christoph Lameter, David Miller, Pablo Neira Ayuso,
	Linux Kernel Mailing List, linux-fsdevel, Network Development,
	NetFilter

[-- Attachment #1: Type: text/plain, Size: 2080 bytes --]

On Sun, Oct 9, 2016 at 7:49 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> There is one *correct* way to remove an entry from a singly linked
> list, and it looks like this:
>
>     struct entry **pp, *p;
>
>     pp = &head;
>     while ((p = *pp) != NULL) {
>         if (right_entry(p)) {
>             *pp = p->next;
>             break;
>         }
>         pp = &p->next;
>     }
>
> and that's it. Nothing else.

This COMPLETELY UNTESTED patch tries to fix the nf_hook_entry code to do this.

I repeat: it's ENTIRELY UNTESTED. I just converted the insertion and
deletion to the proper pattern, but I could easily have gotten the
insertion priority test the wrong way around entirely, for example. Or
it could simply have some other completely broken bug in it. It
compiles for me, but that's all I actually checked.

Note that the "correct way" of doing list operations also almost
inevitably is the shortest way by far, since it gets rid of all the
special cases. So the patch looks nice. It gets rid of the magic
"nf_set_hooks_head()" thing too, because once you do list following
right, the head is no different from any other pointer in the list.

So the patch stats look good:

 net/netfilter/core.c | 108 ++++++++++++++++-----------------------------------
 1 file changed, 33 insertions(+), 75 deletions(-)

but again, it's entirely *entirely* untested. Please consider this
just a "this is generally how list insert/delete operations should be
done, avoiding special cases for the first entry".

ALSO NOTE! The code assumes that the "nf_hook_mutex" locking only
protects the actual *lists*, and that the address to the list can be
looked up without holding the lock. That's generally how things are
done, and it simplifies error handling (because you can do the "there
is no such list at all" test before you do anything else. But again, I
don't actually know the code, and if there is something that actually
expands the number of lists etc that depends on that mutex, then the
list head lookup may need to be inside the lock too.

           Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/plain, Size: 5008 bytes --]

 net/netfilter/core.c | 108 ++++++++++++++++-----------------------------------
 1 file changed, 33 insertions(+), 75 deletions(-)

diff --git a/net/netfilter/core.c b/net/netfilter/core.c
index c9d90eb64046..814258641fcc 100644
--- a/net/netfilter/core.c
+++ b/net/netfilter/core.c
@@ -65,49 +65,24 @@ static DEFINE_MUTEX(nf_hook_mutex);
 #define nf_entry_dereference(e) \
 	rcu_dereference_protected(e, lockdep_is_held(&nf_hook_mutex))
 
-static struct nf_hook_entry *nf_hook_entry_head(struct net *net,
-						const struct nf_hook_ops *reg)
+static struct nf_hook_entry __rcu **nf_hook_entry_head(struct net *net, const struct nf_hook_ops *reg)
 {
-	struct nf_hook_entry *hook_head = NULL;
-
 	if (reg->pf != NFPROTO_NETDEV)
-		hook_head = nf_entry_dereference(net->nf.hooks[reg->pf]
-						 [reg->hooknum]);
-	else if (reg->hooknum == NF_NETDEV_INGRESS) {
+		return net->nf.hooks[reg->pf]+reg->hooknum;
+
 #ifdef CONFIG_NETFILTER_INGRESS
+	if (reg->hooknum == NF_NETDEV_INGRESS) {
 		if (reg->dev && dev_net(reg->dev) == net)
-			hook_head =
-				nf_entry_dereference(
-					reg->dev->nf_hooks_ingress);
-#endif
+			return &reg->dev->nf_hooks_ingress;
 	}
-	return hook_head;
-}
-
-/* must hold nf_hook_mutex */
-static void nf_set_hooks_head(struct net *net, const struct nf_hook_ops *reg,
-			      struct nf_hook_entry *entry)
-{
-	switch (reg->pf) {
-	case NFPROTO_NETDEV:
-#ifdef CONFIG_NETFILTER_INGRESS
-		/* We already checked in nf_register_net_hook() that this is
-		 * used from ingress.
-		 */
-		rcu_assign_pointer(reg->dev->nf_hooks_ingress, entry);
 #endif
-		break;
-	default:
-		rcu_assign_pointer(net->nf.hooks[reg->pf][reg->hooknum],
-				   entry);
-		break;
-	}
+	return NULL;
 }
 
 int nf_register_net_hook(struct net *net, const struct nf_hook_ops *reg)
 {
-	struct nf_hook_entry *hooks_entry;
-	struct nf_hook_entry *entry;
+	struct nf_hook_entry __rcu **pp;
+	struct nf_hook_entry *entry, *p;
 
 	if (reg->pf == NFPROTO_NETDEV) {
 #ifndef CONFIG_NETFILTER_INGRESS
@@ -119,6 +94,10 @@ int nf_register_net_hook(struct net *net, const struct nf_hook_ops *reg)
 			return -EINVAL;
 	}
 
+	pp = nf_hook_entry_head(net, reg);
+	if (!pp)
+		return -EINVAL;
+
 	entry = kmalloc(sizeof(*entry), GFP_KERNEL);
 	if (!entry)
 		return -ENOMEM;
@@ -128,26 +107,15 @@ int nf_register_net_hook(struct net *net, const struct nf_hook_ops *reg)
 	entry->next	= NULL;
 
 	mutex_lock(&nf_hook_mutex);
-	hooks_entry = nf_hook_entry_head(net, reg);
-
-	if (hooks_entry && hooks_entry->orig_ops->priority > reg->priority) {
-		/* This is the case where we need to insert at the head */
-		entry->next = hooks_entry;
-		hooks_entry = NULL;
-	}
-
-	while (hooks_entry &&
-		reg->priority >= hooks_entry->orig_ops->priority &&
-		nf_entry_dereference(hooks_entry->next)) {
-		hooks_entry = nf_entry_dereference(hooks_entry->next);
-	}
 
-	if (hooks_entry) {
-		entry->next = nf_entry_dereference(hooks_entry->next);
-		rcu_assign_pointer(hooks_entry->next, entry);
-	} else {
-		nf_set_hooks_head(net, reg, entry);
+	/* Find the spot in the list */
+	while ((p = nf_entry_dereference(*pp)) != NULL) {
+		if (reg->priority < p->orig_ops->priority)
+			break;
+		pp = &p->next;
 	}
+	rcu_assign_pointer(entry->next, p);
+	rcu_assign_pointer(*pp, p);
 
 	mutex_unlock(&nf_hook_mutex);
 #ifdef CONFIG_NETFILTER_INGRESS
@@ -163,33 +131,23 @@ EXPORT_SYMBOL(nf_register_net_hook);
 
 void nf_unregister_net_hook(struct net *net, const struct nf_hook_ops *reg)
 {
-	struct nf_hook_entry *hooks_entry;
+	struct nf_hook_entry __rcu **pp;
+	struct nf_hook_entry *p;
 
-	mutex_lock(&nf_hook_mutex);
-	hooks_entry = nf_hook_entry_head(net, reg);
-	if (hooks_entry && hooks_entry->orig_ops == reg) {
-		nf_set_hooks_head(net, reg,
-				  nf_entry_dereference(hooks_entry->next));
-		goto unlock;
-	}
-	while (hooks_entry && nf_entry_dereference(hooks_entry->next)) {
-		struct nf_hook_entry *next =
-			nf_entry_dereference(hooks_entry->next);
-		struct nf_hook_entry *nnext;
+	pp = nf_hook_entry_head(net, reg);
+	if (WARN_ON_ONCE(!pp))
+		return;
 
-		if (next->orig_ops != reg) {
-			hooks_entry = next;
-			continue;
+	mutex_lock(&nf_hook_mutex);
+	while ((p = nf_entry_dereference(*pp)) != NULL) {
+		if (p->orig_ops == reg) {
+			rcu_assign_pointer(*pp, p->next);
+			break;
 		}
-		nnext = nf_entry_dereference(next->next);
-		rcu_assign_pointer(hooks_entry->next, nnext);
-		hooks_entry = next;
-		break;
+		pp = &p->next;
 	}
-
-unlock:
 	mutex_unlock(&nf_hook_mutex);
-	if (!hooks_entry) {
+	if (!p) {
 		WARN(1, "nf_unregister_net_hook: hook not found!\n");
 		return;
 	}
@@ -201,10 +159,10 @@ void nf_unregister_net_hook(struct net *net, const struct nf_hook_ops *reg)
 	static_key_slow_dec(&nf_hooks_needed[reg->pf][reg->hooknum]);
 #endif
 	synchronize_net();
-	nf_queue_nf_hook_drop(net, hooks_entry);
+	nf_queue_nf_hook_drop(net, p);
 	/* other cpu might still process nfqueue verdict that used reg */
 	synchronize_net();
-	kfree(hooks_entry);
+	kfree(p);
 }
 EXPORT_SYMBOL(nf_unregister_net_hook);
 

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: slab corruption with current -git
  2016-10-10  3:41       ` Linus Torvalds
@ 2016-10-10  3:57         ` David Miller
  2016-10-10  8:24           ` David Miller
  2016-10-10 13:49         ` slab corruption with current -git (was Re: [git pull] vfs pile 1 (splice)) Aaron Conole
  2016-10-11  5:39         ` slab corruption with current -git (was Re: [git pull] vfs pile 1 (splice)) Linus Torvalds
  2 siblings, 1 reply; 26+ messages in thread
From: David Miller @ 2016-10-10  3:57 UTC (permalink / raw)
  To: torvalds
  Cc: aconole, fw, viro, akpm, axboe, tytso, cl, pablo, linux-kernel,
	linux-fsdevel, netdev, netfilter-devel

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Sun, 9 Oct 2016 20:41:17 -0700

> Note that the "correct way" of doing list operations also almost
> inevitably is the shortest way by far, since it gets rid of all the
> special cases. So the patch looks nice. It gets rid of the magic
> "nf_set_hooks_head()" thing too, because once you do list following
> right, the head is no different from any other pointer in the list.

Perhaps we should have some "slist" primitives added to
include/linux/list.h but since the comparison differs for each user I
guess it's hard to abstract in a way that's generic and inlines
properly.

I'll start taking a look at your patch and this stuff as well, thanks
Linus.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: slab corruption with current -git
  2016-10-10  3:57         ` slab corruption with current -git David Miller
@ 2016-10-10  8:24           ` David Miller
  2016-10-10 16:15             ` Linus Torvalds
  2016-10-11 13:17             ` Michal Kubecek
  0 siblings, 2 replies; 26+ messages in thread
From: David Miller @ 2016-10-10  8:24 UTC (permalink / raw)
  To: torvalds
  Cc: aconole, fw, viro, akpm, axboe, tytso, cl, pablo, linux-kernel,
	linux-fsdevel, netdev, netfilter-devel

From: David Miller <davem@davemloft.net>
Date: Sun, 09 Oct 2016 23:57:45 -0400 (EDT)

> From: Linus Torvalds <torvalds@linux-foundation.org>
> Date: Sun, 9 Oct 2016 20:41:17 -0700
> 
>> Note that the "correct way" of doing list operations also almost
>> inevitably is the shortest way by far, since it gets rid of all the
>> special cases. So the patch looks nice. It gets rid of the magic
>> "nf_set_hooks_head()" thing too, because once you do list following
>> right, the head is no different from any other pointer in the list.
> 
> Perhaps we should have some "slist" primitives added to
> include/linux/list.h but since the comparison differs for each user I
> guess it's hard to abstract in a way that's generic and inlines
> properly.
> 
> I'll start taking a look at your patch and this stuff as well, thanks
> Linus.

So I've been reviewing this patch and it looks fine, but I also want
to figure out what is actually causing the OOPS and I can't spot it
yet.

One possible way to see that oops is to free the head entry of the
chain without unlinking it.  The next unregister will dereference a
POISON pointer.

Actually...

The POISON value comes not from a hook entry, but from the array of
pointers in the per-netns datastructure.

This means that the netns is possibly getting freed up before we
unregister the netfilter hooks.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: slab corruption with current -git (was Re: [git pull] vfs pile 1 (splice))
  2016-10-10  3:41       ` Linus Torvalds
  2016-10-10  3:57         ` slab corruption with current -git David Miller
@ 2016-10-10 13:49         ` Aaron Conole
  2016-10-10 16:28           ` Linus Torvalds
  2016-10-11  5:39         ` slab corruption with current -git (was Re: [git pull] vfs pile 1 (splice)) Linus Torvalds
  2 siblings, 1 reply; 26+ messages in thread
From: Aaron Conole @ 2016-10-10 13:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Florian Westphal, Al Viro, Andrew Morton, Jens Axboe,
	Ted Ts'o, Christoph Lameter, David Miller, Pablo Neira Ayuso,
	Linux Kernel Mailing List, linux-fsdevel, Network Development,
	NetFilter

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Sun, Oct 9, 2016 at 7:49 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> There is one *correct* way to remove an entry from a singly linked
>> list, and it looks like this:
>>
>>     struct entry **pp, *p;
>>
>>     pp = &head;
>>     while ((p = *pp) != NULL) {
>>         if (right_entry(p)) {
>>             *pp = p->next;
>>             break;
>>         }
>>         pp = &p->next;
>>     }
>>
>> and that's it. Nothing else.

Sorry, I should have done that.

> This COMPLETELY UNTESTED patch tries to fix the nf_hook_entry code to do this.
>
> I repeat: it's ENTIRELY UNTESTED. I just converted the insertion and
> deletion to the proper pattern, but I could easily have gotten the
> insertion priority test the wrong way around entirely, for example. Or
> it could simply have some other completely broken bug in it. It
> compiles for me, but that's all I actually checked.

Okay, I'm looking it over.  Sorry for the mess.

> Note that the "correct way" of doing list operations also almost
> inevitably is the shortest way by far, since it gets rid of all the
> special cases. So the patch looks nice. It gets rid of the magic
> "nf_set_hooks_head()" thing too, because once you do list following
> right, the head is no different from any other pointer in the list.
>
> So the patch stats look good:
>
>  net/netfilter/core.c | 108 ++++++++++++++++-----------------------------------
>  1 file changed, 33 insertions(+), 75 deletions(-)
>
> but again, it's entirely *entirely* untested. Please consider this
> just a "this is generally how list insert/delete operations should be
> done, avoiding special cases for the first entry".

I'll review it, and test it.  Can you tell me what steps you took to
reproduce the oops?  I'll enable slab debugging and try to reproduce
without and with this patch (and I'll also look into David's recent
email as well).  Are you simply creating and removing network
namespaces (I did test that, but I should have done a better job)?

> ALSO NOTE! The code assumes that the "nf_hook_mutex" locking only
> protects the actual *lists*, and that the address to the list can be
> looked up without holding the lock. That's generally how things are
> done, and it simplifies error handling (because you can do the "there
> is no such list at all" test before you do anything else. But again, I
> don't actually know the code, and if there is something that actually
> expands the number of lists etc that depends on that mutex, then the
> list head lookup may need to be inside the lock too.

That should be correct, the nf_hook_mutex is only for protecting the
lists.

>            Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: slab corruption with current -git
  2016-10-10  8:24           ` David Miller
@ 2016-10-10 16:15             ` Linus Torvalds
  2016-10-11 13:17             ` Michal Kubecek
  1 sibling, 0 replies; 26+ messages in thread
From: Linus Torvalds @ 2016-10-10 16:15 UTC (permalink / raw)
  To: David Miller
  Cc: Aaron Conole, Florian Westphal, Al Viro, Andrew Morton,
	Jens Axboe, Theodore Ts'o, Christoph Lameter,
	Pablo Neira Ayuso, Linux Kernel Mailing List, linux-fsdevel,
	Network Development, NetFilter

On Mon, Oct 10, 2016 at 1:24 AM, David Miller <davem@davemloft.net> wrote:
>
> So I've been reviewing this patch and it looks fine, but I also want
> to figure out what is actually causing the OOPS and I can't spot it
> yet.

Yeah, I'm not actually sure the old linked list implementation is
buggy - it might just be ugly. I tried to follow the old code, and I
couldn't.

So the patch I sent out was a combination of "that's not how you
should do singly linked lists" and "those special cases make me
worry". In particular, the old code really ended up doing odd things
in the "can't find entry" case, because it would exit the loop with a
non-NULL 'entry' just because the next entry was NULL..

> One possible way to see that oops is to free the head entry of the
> chain without unlinking it.  The next unregister will dereference a
> POISON pointer.
>
> Actually...
>
> The POISON value comes not from a hook entry, but from the array of
> pointers in the per-netns datastructure.
>
> This means that the netns is possibly getting freed up before we
> unregister the netfilter hooks.

That is certainly one possible explanation for it, yes. However, I
didn't think that part had changed, had it?

The other thing I find a bit odd in that new single-linked list code is this:

 - nf_hook_slow():
        ...
        RCU_INIT_POINTER(state->hook_entries, entry);

which makes me worried..  It copies the head entry of the list, and
maybe it will then (later) end up being used stale. I don't know. But
it looks a bit iffy.

              Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: slab corruption with current -git (was Re: [git pull] vfs pile 1 (splice))
  2016-10-10 13:49         ` slab corruption with current -git (was Re: [git pull] vfs pile 1 (splice)) Aaron Conole
@ 2016-10-10 16:28           ` Linus Torvalds
  2016-10-10 19:05             ` Linus Torvalds
  0 siblings, 1 reply; 26+ messages in thread
From: Linus Torvalds @ 2016-10-10 16:28 UTC (permalink / raw)
  To: Aaron Conole
  Cc: Florian Westphal, Al Viro, Andrew Morton, Jens Axboe,
	Ted Ts'o, Christoph Lameter, David Miller, Pablo Neira Ayuso,
	Linux Kernel Mailing List, linux-fsdevel, Network Development,
	NetFilter

On Mon, Oct 10, 2016 at 6:49 AM, Aaron Conole <aconole@redhat.com> wrote:
>
> Okay, I'm looking it over.  Sorry for the mess.

So as I already answered to Dave, I'm not actually sure that this was
the buggy code, or that my patch would make any difference at all.

I never got a good reproducer for the bug: I spent much of the weekend
rebooting, because it seems to happen only just after a reboot, as I
log in and start my usual thing.

I initially blamed some off filesystem or block layer issue ("Oh, it
only happens with a cold cache"), partly because the initial
non-poisoned slub oopses happened in filesystem code.

But I now think it's netfilter, and I *think* that what triggers it is
something like the bluetooth subsystem giving up or something. What I
do when I log into a new session tends to be to go to the kernel
subdirectory in one or two terminals, and fire up chrome to read
email. And the problem either happened within half a minute of me
doing that, or it never happens at all.

Which is why I ended up rebooting a *lot*. Just running the kernel
never triggered it.

(It took me some time to figure that out, which is basically why I did
almost no pull requests the whole weekend)

The journal entries for that invalid kernel access is somewhat suggestive:

  Oct 09 13:24:03 i7 dbus-daemon[1030]: [system] Failed to activate
service 'org.bluez': timed out

  Oct 09 13:24:09 i7 audit[1]: SERVICE_STOP pid=1 uid=0
auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0
msg='unit=systemd-hostnamed comm="systemd"
exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=?
res=success'

  Oct 09 13:24:09 i7 kernel: general protection fault: 0000 [#1] SMP

so it happened just as *some* network setup thing was finishing off (I
don't think it was systemd-hostnamed itself that necessarily matters,
but clearly something was finishing up as the netfilter problem
occurred.

> I'll review it, and test it.  Can you tell me what steps you took to
> reproduce the oops?

See above: I can't actually really "reproduce" it. It's probably
highly timing-dependent, and it is not unlikely that it's also very
much about specific setup. I'm running plain Fedora 24, I boot up, log
in, start two or three terminals, fire up chrome, and ...

So far I've seen the problem maybe 5-6 times, but a couple of those
were just silent hangs (I may have rebooted too quickly for things to
hit the disk, or the oops may just have killed the machine too hard).
Two I got the oops inside slub code, and I only have one successful
slub poisoning oops from netfilter.

(Part of the reason I only have one is that once I got that, I stopped
rebooting, and instead started looking at the netfilter code and
started to do some merge window pulls again because I felt that this
is *probably* the core reason, and I cant' afford to not do pulls
during the merge window for _too_ long).

              Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: slab corruption with current -git (was Re: [git pull] vfs pile 1 (splice))
  2016-10-10 16:28           ` Linus Torvalds
@ 2016-10-10 19:05             ` Linus Torvalds
  2016-10-10 19:18               ` Aaron Conole
  2016-10-11  0:30               ` slab corruption with current -git David Miller
  0 siblings, 2 replies; 26+ messages in thread
From: Linus Torvalds @ 2016-10-10 19:05 UTC (permalink / raw)
  To: Aaron Conole
  Cc: Florian Westphal, Al Viro, Andrew Morton, Jens Axboe,
	Ted Ts'o, Christoph Lameter, David Miller, Pablo Neira Ayuso,
	Linux Kernel Mailing List, linux-fsdevel, Network Development,
	NetFilter

On Mon, Oct 10, 2016 at 9:28 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So as I already answered to Dave, I'm not actually sure that this was
> the buggy code, or that my patch would make any difference at all.

My patch does seem to fix things, and in fact the warning about "hook
not found" now triggers.

So I think the bug really was that the singly-linked list handling
code did not correctly handle the case of not finding the entry, and
then freed (incorrectly) the last one that wasn't actually unlinked.

In fact, I get quite a few warnings (56 total) about 30 seconds after
logging in:

[   54.213170] WARNING: CPU: 1 PID: 111 at net/netfilter/core.c:151
nf_unregister_net_hook+0x8e/0x170
... repeat 54 times ...
[   54.445520] WARNING: CPU: 7 PID: 111 at net/netfilter/core.c:151
nf_unregister_net_hook+0x8e/0x170

and looking in the journal, the first one is (again) immediately
preceded by that systemd-hostnamed service stopping:

  Oct 10 11:45:47 i7 audit[1546]: USER_LOGIN
  ...
  Oct 10 11:46:11 i7 audit[1]: SERVICE_STOP pid=1 uid=0
auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0
msg='unit=fprintd comm="systemd" exe="/usr/lib/systemd/systemd"
hostname=? addr=? terminal=? res=success'
  Oct 10 11:46:13 i7 pulseaudio[1697]: [pulseaudio] bluez5-util.c:
GetManagedObjects() failed: org.freedesktop.DBus.Error.NoReply: Did
not receive a reply. Possible causes include: the remote application
did not send a reply, the message bus security policy blocked the
reply, the reply timeout expir
  Oct 10 11:46:13 i7 dbus-daemon[1003]: [system] Failed to activate
service 'org.bluez': timed out
  Oct 10 11:46:20 i7 audit[1]: SERVICE_STOP pid=1 uid=0
auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0
msg='unit=systemd-hostnamed comm="systemd"
exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=?
res=success'
  Oct 10 11:46:20 i7 kernel: ------------[ cut here ]------------
  Oct 10 11:46:20 i7 kernel: WARNING: CPU: 1 PID: 111 at
net/netfilter/core.c:151 nf_unregister_net_hook+0x8e/0x170

so I do think it's something to do with some network startup service
thing (perhaps dhcp, perhaps chrome, who knows) as I do my initial
login.

David - I think that also explains what was wrong with the old code.
In the old code, this loop:

        while (hooks_entry && nf_entry_dereference(hooks_entry->next)) {

would exit with "hooks_entry" pointing to the last list entry (because
->next was NULL). Nothing was ever unlinked in the loop itself,
because it never actually found a matching entry, but then after the
loop it would free that last entry because it *thought* that was the
match.

My list rewrite fixes that.

Anyway, I'm assuming it will come to me from the networking tree after
more testing by the maintainers. You can add my

  Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

to the patch, though.

David, if you want me to just commit that thing directly, I can
obviously do so, but I do think somebody should look at

 (a) that I actually got the priority list ordering right on the insertion side
 (b) what it is that makes it try to unregister that hook that isn't
on the list in the first place

but on the whole I consider this issue explained and solved. I'll
continue to run with my patch on my machine (just not committed).

              Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: slab corruption with current -git (was Re: [git pull] vfs pile 1 (splice))
  2016-10-10 19:05             ` Linus Torvalds
@ 2016-10-10 19:18               ` Aaron Conole
  2016-10-11  0:30               ` slab corruption with current -git David Miller
  1 sibling, 0 replies; 26+ messages in thread
From: Aaron Conole @ 2016-10-10 19:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Florian Westphal, Al Viro, Andrew Morton, Jens Axboe,
	Ted Ts'o, Christoph Lameter, David Miller, Pablo Neira Ayuso,
	Linux Kernel Mailing List, linux-fsdevel, Network Development,
	NetFilter

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Mon, Oct 10, 2016 at 9:28 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> So as I already answered to Dave, I'm not actually sure that this was
>> the buggy code, or that my patch would make any difference at all.
>
> My patch does seem to fix things, and in fact the warning about "hook
> not found" now triggers.
>
> So I think the bug really was that the singly-linked list handling
> code did not correctly handle the case of not finding the entry, and
> then freed (incorrectly) the last one that wasn't actually unlinked.
>
> In fact, I get quite a few warnings (56 total) about 30 seconds after
> logging in:
>
> [   54.213170] WARNING: CPU: 1 PID: 111 at net/netfilter/core.c:151
> nf_unregister_net_hook+0x8e/0x170
> ... repeat 54 times ...
> [   54.445520] WARNING: CPU: 7 PID: 111 at net/netfilter/core.c:151
> nf_unregister_net_hook+0x8e/0x170
>
> and looking in the journal, the first one is (again) immediately
> preceded by that systemd-hostnamed service stopping:
>
>   Oct 10 11:45:47 i7 audit[1546]: USER_LOGIN
>   ...
>   Oct 10 11:46:11 i7 audit[1]: SERVICE_STOP pid=1 uid=0
> auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0
> msg='unit=fprintd comm="systemd" exe="/usr/lib/systemd/systemd"
> hostname=? addr=? terminal=? res=success'
>   Oct 10 11:46:13 i7 pulseaudio[1697]: [pulseaudio] bluez5-util.c:
> GetManagedObjects() failed: org.freedesktop.DBus.Error.NoReply: Did
> not receive a reply. Possible causes include: the remote application
> did not send a reply, the message bus security policy blocked the
> reply, the reply timeout expir
>   Oct 10 11:46:13 i7 dbus-daemon[1003]: [system] Failed to activate
> service 'org.bluez': timed out
>   Oct 10 11:46:20 i7 audit[1]: SERVICE_STOP pid=1 uid=0
> auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0
> msg='unit=systemd-hostnamed comm="systemd"
> exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=?
> res=success'
>   Oct 10 11:46:20 i7 kernel: ------------[ cut here ]------------
>   Oct 10 11:46:20 i7 kernel: WARNING: CPU: 1 PID: 111 at
> net/netfilter/core.c:151 nf_unregister_net_hook+0x8e/0x170
>
> so I do think it's something to do with some network startup service
> thing (perhaps dhcp, perhaps chrome, who knows) as I do my initial
> login.
>
> David - I think that also explains what was wrong with the old code.
> In the old code, this loop:
>
>         while (hooks_entry && nf_entry_dereference(hooks_entry->next)) {
>
> would exit with "hooks_entry" pointing to the last list entry (because
> ->next was NULL). Nothing was ever unlinked in the loop itself,
> because it never actually found a matching entry, but then after the
> loop it would free that last entry because it *thought* that was the
> match.
>
> My list rewrite fixes that.
>
> Anyway, I'm assuming it will come to me from the networking tree after
> more testing by the maintainers. You can add my
>
>   Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
>
> to the patch, though.
>
> David, if you want me to just commit that thing directly, I can
> obviously do so, but I do think somebody should look at
>
>  (a) that I actually got the priority list ordering right on the
>  insertion side

It looks correct.

Reviewed-by: Aaron Conole <aconole@bytheb.org>

>  (b) what it is that makes it try to unregister that hook that isn't
> on the list in the first place

This is a still problem, I think.  I wasn't able to reproduce the issue
on a fedora-23 VM.  My fedora 24 bare-metal system does trigger this,
though.  Not sure what changed in userspace/kernel interaction side (not
an excuse, but just an observation).

> but on the whole I consider this issue explained and solved. I'll
> continue to run with my patch on my machine (just not committed).

Okay.  Very sorry for this, again.

>               Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: slab corruption with current -git
  2016-10-10 19:05             ` Linus Torvalds
  2016-10-10 19:18               ` Aaron Conole
@ 2016-10-11  0:30               ` David Miller
  2016-10-11  0:54                 ` Linus Torvalds
  1 sibling, 1 reply; 26+ messages in thread
From: David Miller @ 2016-10-11  0:30 UTC (permalink / raw)
  To: torvalds
  Cc: aconole, fw, viro, akpm, axboe, tytso, cl, pablo, linux-kernel,
	linux-fsdevel, netdev, netfilter-devel

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Mon, 10 Oct 2016 12:05:17 -0700

> David - I think that also explains what was wrong with the old code.
> In the old code, this loop:
> 
>         while (hooks_entry && nf_entry_dereference(hooks_entry->next)) {
> 
> would exit with "hooks_entry" pointing to the last list entry (because
> ->next was NULL). Nothing was ever unlinked in the loop itself,
> because it never actually found a matching entry, but then after the
> loop it would free that last entry because it *thought* that was the
> match.

It only does this when the ops don't match, but yes it can happen.

Linus can you add some extra info to that:

		WARN(1, "nf_unregister_net_hook: hook not found!\n");

diagnostic, such as the reg->pf and reg->hooknum values?

That might help track down why this is happening in the first
place.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: slab corruption with current -git
  2016-10-11  0:30               ` slab corruption with current -git David Miller
@ 2016-10-11  0:54                 ` Linus Torvalds
  0 siblings, 0 replies; 26+ messages in thread
From: Linus Torvalds @ 2016-10-11  0:54 UTC (permalink / raw)
  To: David Miller
  Cc: Aaron Conole, Florian Westphal, Al Viro, Andrew Morton,
	Jens Axboe, Theodore Ts'o, Christoph Lameter,
	Pablo Neira Ayuso, Linux Kernel Mailing List, linux-fsdevel,
	Network Development, NetFilter

On Mon, Oct 10, 2016 at 5:30 PM, David Miller <davem@davemloft.net> wrote:
>
> Linus can you add some extra info to that:

Sure. I made it a WARN_ON_ONCE(), but then always just printed the
pf/hooknum. It's all over the map:

 reg->pf=2 and reg->hooknum=4
 reg->pf=2 and reg->hooknum=2
 reg->pf=2 and reg->hooknum=3
 reg->pf=10 and reg->hooknum=4
 reg->pf=10 and reg->hooknum=2
 reg->pf=10 and reg->hooknum=3
 reg->pf=7 and reg->hooknum=1
 reg->pf=7 and reg->hooknum=2
 reg->pf=7 and reg->hooknum=3
 reg->pf=2 and reg->hooknum=0
 reg->pf=2 and reg->hooknum=3
 reg->pf=2 and reg->hooknum=0
 reg->pf=2 and reg->hooknum=3
 reg->pf=2 and reg->hooknum=4
 reg->pf=2 and reg->hooknum=4
 reg->pf=2 and reg->hooknum=1
 reg->pf=2 and reg->hooknum=1
 reg->pf=10 and reg->hooknum=0
 reg->pf=10 and reg->hooknum=3
 reg->pf=10 and reg->hooknum=0
 reg->pf=10 and reg->hooknum=3
 reg->pf=10 and reg->hooknum=4
 reg->pf=10 and reg->hooknum=4
 reg->pf=10 and reg->hooknum=1
 reg->pf=10 and reg->hooknum=1
 reg->pf=7 and reg->hooknum=3
 reg->pf=7 and reg->hooknum=4
 reg->pf=7 and reg->hooknum=0
 reg->pf=2 and reg->hooknum=4
 reg->pf=2 and reg->hooknum=2
 reg->pf=2 and reg->hooknum=3
 reg->pf=10 and reg->hooknum=4
 reg->pf=10 and reg->hooknum=2
 reg->pf=10 and reg->hooknum=3
 reg->pf=7 and reg->hooknum=1
 reg->pf=7 and reg->hooknum=2
 reg->pf=7 and reg->hooknum=3
 reg->pf=2 and reg->hooknum=0
 reg->pf=2 and reg->hooknum=3
 reg->pf=2 and reg->hooknum=0
 reg->pf=2 and reg->hooknum=3
 reg->pf=2 and reg->hooknum=4
 reg->pf=2 and reg->hooknum=4
 reg->pf=2 and reg->hooknum=1
 reg->pf=2 and reg->hooknum=1
 reg->pf=10 and reg->hooknum=0
 reg->pf=10 and reg->hooknum=3
 reg->pf=10 and reg->hooknum=0
 reg->pf=10 and reg->hooknum=3
 reg->pf=10 and reg->hooknum=4
 reg->pf=10 and reg->hooknum=4
 reg->pf=10 and reg->hooknum=1
 reg->pf=10 and reg->hooknum=1
 reg->pf=7 and reg->hooknum=3
 reg->pf=7 and reg->hooknum=4
 reg->pf=7 and reg->hooknum=0

and putting that through "sort -n" and "uniq -c", I get:

      4  reg->pf=10 and reg->hooknum=0
      4  reg->pf=10 and reg->hooknum=1
      2  reg->pf=10 and reg->hooknum=2
      6  reg->pf=10 and reg->hooknum=3
      6  reg->pf=10 and reg->hooknum=4
      4  reg->pf=2 and reg->hooknum=0
      4  reg->pf=2 and reg->hooknum=1
      2  reg->pf=2 and reg->hooknum=2
      6  reg->pf=2 and reg->hooknum=3
      6  reg->pf=2 and reg->hooknum=4
      2  reg->pf=7 and reg->hooknum=0
      2  reg->pf=7 and reg->hooknum=1
      2  reg->pf=7 and reg->hooknum=2
      4  reg->pf=7 and reg->hooknum=3

which doesn't look much better. But clearly there's a lot of those
"try to unregister stuff that you can't even find".

Maybe it tells you something.

               Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: slab corruption with current -git (was Re: [git pull] vfs pile 1 (splice))
  2016-10-10  3:41       ` Linus Torvalds
  2016-10-10  3:57         ` slab corruption with current -git David Miller
  2016-10-10 13:49         ` slab corruption with current -git (was Re: [git pull] vfs pile 1 (splice)) Aaron Conole
@ 2016-10-11  5:39         ` Linus Torvalds
  2016-10-11  5:47           ` Linus Torvalds
  2 siblings, 1 reply; 26+ messages in thread
From: Linus Torvalds @ 2016-10-11  5:39 UTC (permalink / raw)
  To: Aaron Conole
  Cc: Florian Westphal, Al Viro, Andrew Morton, Jens Axboe,
	Ted Ts'o, Christoph Lameter, David Miller, Pablo Neira Ayuso,
	Linux Kernel Mailing List, linux-fsdevel, Network Development,
	NetFilter

[-- Attachment #1: Type: text/plain, Size: 949 bytes --]

On Sun, Oct 9, 2016 at 8:41 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> This COMPLETELY UNTESTED patch tries to fix the nf_hook_entry code to do this.
>
> I repeat: it's ENTIRELY UNTESTED.

Gaah.

That patch was subtle garbage.

The "add to list" thing did this:

        rcu_assign_pointer(entry->next, p);
        rcu_assign_pointer(*pp, p);

which is not so subtly broken - that second assignment just assigns
"p" to "*pp", but that was what *pp already contained. Too much
cut-and-paste.

That also explains why I then get the NOT FOUND case, because the add
never actually worked.

It *should* be

        rcu_assign_pointer(entry->next, p);
        rcu_assign_pointer(*pp, entry);

and then the warnings about "not found" are gone.

Duh.

I guess I will have to double-check that the slub corruption is gone
still with that fixed.

Anyway, new version of the patch (just that one line changed) attached.

                Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/plain, Size: 5012 bytes --]

 net/netfilter/core.c | 108 ++++++++++++++++-----------------------------------
 1 file changed, 33 insertions(+), 75 deletions(-)

diff --git a/net/netfilter/core.c b/net/netfilter/core.c
index c9d90eb64046..fcb5d1df11e9 100644
--- a/net/netfilter/core.c
+++ b/net/netfilter/core.c
@@ -65,49 +65,24 @@ static DEFINE_MUTEX(nf_hook_mutex);
 #define nf_entry_dereference(e) \
 	rcu_dereference_protected(e, lockdep_is_held(&nf_hook_mutex))
 
-static struct nf_hook_entry *nf_hook_entry_head(struct net *net,
-						const struct nf_hook_ops *reg)
+static struct nf_hook_entry __rcu **nf_hook_entry_head(struct net *net, const struct nf_hook_ops *reg)
 {
-	struct nf_hook_entry *hook_head = NULL;
-
 	if (reg->pf != NFPROTO_NETDEV)
-		hook_head = nf_entry_dereference(net->nf.hooks[reg->pf]
-						 [reg->hooknum]);
-	else if (reg->hooknum == NF_NETDEV_INGRESS) {
+		return net->nf.hooks[reg->pf]+reg->hooknum;
+
 #ifdef CONFIG_NETFILTER_INGRESS
+	if (reg->hooknum == NF_NETDEV_INGRESS) {
 		if (reg->dev && dev_net(reg->dev) == net)
-			hook_head =
-				nf_entry_dereference(
-					reg->dev->nf_hooks_ingress);
-#endif
+			return &reg->dev->nf_hooks_ingress;
 	}
-	return hook_head;
-}
-
-/* must hold nf_hook_mutex */
-static void nf_set_hooks_head(struct net *net, const struct nf_hook_ops *reg,
-			      struct nf_hook_entry *entry)
-{
-	switch (reg->pf) {
-	case NFPROTO_NETDEV:
-#ifdef CONFIG_NETFILTER_INGRESS
-		/* We already checked in nf_register_net_hook() that this is
-		 * used from ingress.
-		 */
-		rcu_assign_pointer(reg->dev->nf_hooks_ingress, entry);
 #endif
-		break;
-	default:
-		rcu_assign_pointer(net->nf.hooks[reg->pf][reg->hooknum],
-				   entry);
-		break;
-	}
+	return NULL;
 }
 
 int nf_register_net_hook(struct net *net, const struct nf_hook_ops *reg)
 {
-	struct nf_hook_entry *hooks_entry;
-	struct nf_hook_entry *entry;
+	struct nf_hook_entry __rcu **pp;
+	struct nf_hook_entry *entry, *p;
 
 	if (reg->pf == NFPROTO_NETDEV) {
 #ifndef CONFIG_NETFILTER_INGRESS
@@ -119,6 +94,10 @@ int nf_register_net_hook(struct net *net, const struct nf_hook_ops *reg)
 			return -EINVAL;
 	}
 
+	pp = nf_hook_entry_head(net, reg);
+	if (!pp)
+		return -EINVAL;
+
 	entry = kmalloc(sizeof(*entry), GFP_KERNEL);
 	if (!entry)
 		return -ENOMEM;
@@ -128,26 +107,15 @@ int nf_register_net_hook(struct net *net, const struct nf_hook_ops *reg)
 	entry->next	= NULL;
 
 	mutex_lock(&nf_hook_mutex);
-	hooks_entry = nf_hook_entry_head(net, reg);
-
-	if (hooks_entry && hooks_entry->orig_ops->priority > reg->priority) {
-		/* This is the case where we need to insert at the head */
-		entry->next = hooks_entry;
-		hooks_entry = NULL;
-	}
-
-	while (hooks_entry &&
-		reg->priority >= hooks_entry->orig_ops->priority &&
-		nf_entry_dereference(hooks_entry->next)) {
-		hooks_entry = nf_entry_dereference(hooks_entry->next);
-	}
 
-	if (hooks_entry) {
-		entry->next = nf_entry_dereference(hooks_entry->next);
-		rcu_assign_pointer(hooks_entry->next, entry);
-	} else {
-		nf_set_hooks_head(net, reg, entry);
+	/* Find the spot in the list */
+	while ((p = nf_entry_dereference(*pp)) != NULL) {
+		if (reg->priority < p->orig_ops->priority)
+			break;
+		pp = &p->next;
 	}
+	rcu_assign_pointer(entry->next, p);
+	rcu_assign_pointer(*pp, entry);
 
 	mutex_unlock(&nf_hook_mutex);
 #ifdef CONFIG_NETFILTER_INGRESS
@@ -163,33 +131,23 @@ EXPORT_SYMBOL(nf_register_net_hook);
 
 void nf_unregister_net_hook(struct net *net, const struct nf_hook_ops *reg)
 {
-	struct nf_hook_entry *hooks_entry;
+	struct nf_hook_entry __rcu **pp;
+	struct nf_hook_entry *p;
 
-	mutex_lock(&nf_hook_mutex);
-	hooks_entry = nf_hook_entry_head(net, reg);
-	if (hooks_entry && hooks_entry->orig_ops == reg) {
-		nf_set_hooks_head(net, reg,
-				  nf_entry_dereference(hooks_entry->next));
-		goto unlock;
-	}
-	while (hooks_entry && nf_entry_dereference(hooks_entry->next)) {
-		struct nf_hook_entry *next =
-			nf_entry_dereference(hooks_entry->next);
-		struct nf_hook_entry *nnext;
+	pp = nf_hook_entry_head(net, reg);
+	if (WARN_ON_ONCE(!pp))
+		return;
 
-		if (next->orig_ops != reg) {
-			hooks_entry = next;
-			continue;
+	mutex_lock(&nf_hook_mutex);
+	while ((p = nf_entry_dereference(*pp)) != NULL) {
+		if (p->orig_ops == reg) {
+			rcu_assign_pointer(*pp, p->next);
+			break;
 		}
-		nnext = nf_entry_dereference(next->next);
-		rcu_assign_pointer(hooks_entry->next, nnext);
-		hooks_entry = next;
-		break;
+		pp = &p->next;
 	}
-
-unlock:
 	mutex_unlock(&nf_hook_mutex);
-	if (!hooks_entry) {
+	if (!p) {
 		WARN(1, "nf_unregister_net_hook: hook not found!\n");
 		return;
 	}
@@ -201,10 +159,10 @@ void nf_unregister_net_hook(struct net *net, const struct nf_hook_ops *reg)
 	static_key_slow_dec(&nf_hooks_needed[reg->pf][reg->hooknum]);
 #endif
 	synchronize_net();
-	nf_queue_nf_hook_drop(net, hooks_entry);
+	nf_queue_nf_hook_drop(net, p);
 	/* other cpu might still process nfqueue verdict that used reg */
 	synchronize_net();
-	kfree(hooks_entry);
+	kfree(p);
 }
 EXPORT_SYMBOL(nf_unregister_net_hook);
 

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: slab corruption with current -git (was Re: [git pull] vfs pile 1 (splice))
  2016-10-11  5:39         ` slab corruption with current -git (was Re: [git pull] vfs pile 1 (splice)) Linus Torvalds
@ 2016-10-11  5:47           ` Linus Torvalds
  2016-10-11  8:57             ` slab corruption with current -git David Miller
  0 siblings, 1 reply; 26+ messages in thread
From: Linus Torvalds @ 2016-10-11  5:47 UTC (permalink / raw)
  To: Aaron Conole
  Cc: Florian Westphal, Al Viro, Andrew Morton, Jens Axboe,
	Ted Ts'o, Christoph Lameter, David Miller, Pablo Neira Ayuso,
	Linux Kernel Mailing List, linux-fsdevel, Network Development,
	NetFilter

On Mon, Oct 10, 2016 at 10:39 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I guess I will have to double-check that the slub corruption is gone
> still with that fixed.

So I'm not getting any warnings now from SLUB debugging. So the
original bug seems to not have re-surfaced, and the registration bug
is gone, so now the unregistration doesn't warn about anything either.

But I only rebooted three times.

           Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: slab corruption with current -git
  2016-10-11  5:47           ` Linus Torvalds
@ 2016-10-11  8:57             ` David Miller
  2016-10-13  6:02               ` Markus Trippelsdorf
  0 siblings, 1 reply; 26+ messages in thread
From: David Miller @ 2016-10-11  8:57 UTC (permalink / raw)
  To: torvalds
  Cc: aconole, fw, viro, akpm, axboe, tytso, cl, pablo, linux-kernel,
	linux-fsdevel, netdev, netfilter-devel

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Mon, 10 Oct 2016 22:47:50 -0700

> On Mon, Oct 10, 2016 at 10:39 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> I guess I will have to double-check that the slub corruption is gone
>> still with that fixed.
> 
> So I'm not getting any warnings now from SLUB debugging. So the
> original bug seems to not have re-surfaced, and the registration bug
> is gone, so now the unregistration doesn't warn about anything either.
> 
> But I only rebooted three times.

Looks good to me, I applied it to my tree with your signoff and will
send you a pull request right now.

Thanks!

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: slab corruption with current -git
  2016-10-10  8:24           ` David Miller
  2016-10-10 16:15             ` Linus Torvalds
@ 2016-10-11 13:17             ` Michal Kubecek
  2016-10-11 13:55               ` Aaron Conole
  1 sibling, 1 reply; 26+ messages in thread
From: Michal Kubecek @ 2016-10-11 13:17 UTC (permalink / raw)
  To: David Miller
  Cc: torvalds, aconole, fw, viro, akpm, axboe, tytso, cl, pablo,
	linux-kernel, linux-fsdevel, netdev, netfilter-devel

On Mon, Oct 10, 2016 at 04:24:01AM -0400, David Miller wrote:
> From: David Miller <davem@davemloft.net>
> Date: Sun, 09 Oct 2016 23:57:45 -0400 (EDT)
> 
> This means that the netns is possibly getting freed up before we
> unregister the netfilter hooks.

Sounds a bit like the issue discussed here:

  https://marc.info/?l=netfilter-devel&m=146980917627262&w=2

Could it be (partly) the same race condition?

Michal Kubecek

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: slab corruption with current -git
  2016-10-11 13:17             ` Michal Kubecek
@ 2016-10-11 13:55               ` Aaron Conole
  0 siblings, 0 replies; 26+ messages in thread
From: Aaron Conole @ 2016-10-11 13:55 UTC (permalink / raw)
  To: Michal Kubecek
  Cc: David Miller, torvalds, fw, viro, akpm, axboe, tytso, cl, pablo,
	linux-kernel, linux-fsdevel, netdev, netfilter-devel

Michal Kubecek <mkubecek@suse.cz> writes:

> On Mon, Oct 10, 2016 at 04:24:01AM -0400, David Miller wrote:
>> From: David Miller <davem@davemloft.net>
>> Date: Sun, 09 Oct 2016 23:57:45 -0400 (EDT)
>> 
>> This means that the netns is possibly getting freed up before we
>> unregister the netfilter hooks.
>
> Sounds a bit like the issue discussed here:
>
>   https://marc.info/?l=netfilter-devel&m=146980917627262&w=2
>
> Could it be (partly) the same race condition?

It looks like it's possible.  It appears that there could be a
long-standing race between these.  I'll look into it more carefully, and
discuss with Pablo and Florian when they're situated from netdev
conference.

-Aaron

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: slab corruption with current -git
  2016-10-11  8:57             ` slab corruption with current -git David Miller
@ 2016-10-13  6:02               ` Markus Trippelsdorf
  2016-10-13  6:06                 ` Markus Trippelsdorf
  0 siblings, 1 reply; 26+ messages in thread
From: Markus Trippelsdorf @ 2016-10-13  6:02 UTC (permalink / raw)
  To: David Miller
  Cc: torvalds, aconole, fw, viro, akpm, axboe, tytso, cl, pablo,
	linux-kernel, linux-fsdevel, netdev, netfilter-devel

On 2016.10.11 at 04:57 -0400, David Miller wrote:
> From: Linus Torvalds <torvalds@linux-foundation.org>
> Date: Mon, 10 Oct 2016 22:47:50 -0700
> 
> > On Mon, Oct 10, 2016 at 10:39 PM, Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> >>
> >> I guess I will have to double-check that the slub corruption is gone
> >> still with that fixed.
> > 
> > So I'm not getting any warnings now from SLUB debugging. So the
> > original bug seems to not have re-surfaced, and the registration bug
> > is gone, so now the unregistration doesn't warn about anything either.
> > 
> > But I only rebooted three times.
> 
> Looks good to me, I applied it to my tree with your signoff and will
> send you a pull request right now.

I'm still seeing:

nf_conntrack version 0.5.0 (4096 buckets, 16384 max)
ctnetlink v0.93: registering with nfnetlink.
ip_tables: (C) 2000-2006 Netfilter Core Team
WARNING: kmemcheck: Caught 64-bit read from uninitialized memory (ffff88001e605480)
4055601e0088ffff000000000000000090686d81ffffffff0000000000000000
 u u u u u u u u u u u u u u u u i i i i i i i i u u u u u u u u
 ^
RIP: 0010:[<ffffffff8166e561>]  [<ffffffff8166e561>] nf_register_net_hook+0x51/0x160
RSP: 0018:ffffc9000000bcc0  EFLAGS: 00010286
RAX: ffff88001e5af9c0 RBX: ffff88001e605480 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff88001e5b0a20
RBP: ffffc9000000bcd8 R08: 000000001fd0e000 R09: 0000000000000000
R10: ffff88001e5b09c0 R11: 0000000000000067 R12: ffff88001e5af9c0
R13: ffffffff81c5c0c8 R14: 0000000000000003 R15: ffff88001e605480
FS:  0000000000000000(0000) GS:ffff88001fa00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff88001f45ca18 CR3: 0000000001c07000 CR4: 00000000000006f0
 [<ffffffff8166e561>] nf_register_net_hook+0x51/0x160
 [<ffffffff8166eaaf>] nf_register_net_hooks+0x3f/0xa0
 [<ffffffff816d6715>] ipt_register_table+0xe5/0x110
 [<ffffffff816d6815>] iptable_filter_table_init.part.1+0x55/0x80
 [<ffffffff816d688b>] iptable_filter_net_init+0x2b/0x30
 [<ffffffff8163edd7>] ops_init+0x47/0x150
 [<ffffffff8163f0c6>] register_pernet_operations+0xd6/0x170
 [<ffffffff8163fb77>] register_pernet_subsys+0x27/0x40
 [<ffffffff81cb9de3>] iptable_filter_init+0x33/0x4b
 [<ffffffff81c8bef0>] do_one_initcall+0x8b/0x113
 [<ffffffff81c8c091>] kernel_init_freeable+0x119/0x1a1
 [<ffffffff816efd09>] kernel_init+0x9/0x100
 [<ffffffff816f4e12>] ret_from_fork+0x22/0x30
 [<ffffffffffffffff>] 0xffffffffffffffff
NET: Registered protocol family 17
9pnet: Installing 9P2000 support


-- 
Markus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: slab corruption with current -git
  2016-10-13  6:02               ` Markus Trippelsdorf
@ 2016-10-13  6:06                 ` Markus Trippelsdorf
       [not found]                   ` <CA+55aFwsUR4-YmOYgJOOO4a2e48M4_tk7YhAo4s5KZQQxUjpZw@mail.gmail.com>
  0 siblings, 1 reply; 26+ messages in thread
From: Markus Trippelsdorf @ 2016-10-13  6:06 UTC (permalink / raw)
  To: David Miller
  Cc: torvalds, aconole, fw, viro, akpm, axboe, tytso, cl, pablo,
	linux-kernel, linux-fsdevel, netdev, netfilter-devel

On 2016.10.13 at 08:02 +0200, Markus Trippelsdorf wrote:
> On 2016.10.11 at 04:57 -0400, David Miller wrote:
> > From: Linus Torvalds <torvalds@linux-foundation.org>
> > Date: Mon, 10 Oct 2016 22:47:50 -0700
> > 
> > > On Mon, Oct 10, 2016 at 10:39 PM, Linus Torvalds
> > > <torvalds@linux-foundation.org> wrote:
> > >>
> > >> I guess I will have to double-check that the slub corruption is gone
> > >> still with that fixed.
> > > 
> > > So I'm not getting any warnings now from SLUB debugging. So the
> > > original bug seems to not have re-surfaced, and the registration bug
> > > is gone, so now the unregistration doesn't warn about anything either.
> > > 
> > > But I only rebooted three times.
> > 
> > Looks good to me, I applied it to my tree with your signoff and will
> > send you a pull request right now.
> 
> I'm still seeing:
> 
> nf_conntrack version 0.5.0 (4096 buckets, 16384 max)
> ctnetlink v0.93: registering with nfnetlink.
> ip_tables: (C) 2000-2006 Netfilter Core Team
> WARNING: kmemcheck: Caught 64-bit read from uninitialized memory (ffff88001e605480)
> 4055601e0088ffff000000000000000090686d81ffffffff0000000000000000
>  u u u u u u u u u u u u u u u u i i i i i i i i u u u u u u u u
>  ^
> RIP: 0010:[<ffffffff8166e561>]  [<ffffffff8166e561>] nf_register_net_hook+0x51/0x160

This is nf_register_net_hook at net/netfilter/core.c:106

-- 
Markus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: slab corruption with current -git
       [not found]                   ` <CA+55aFwsUR4-YmOYgJOOO4a2e48M4_tk7YhAo4s5KZQQxUjpZw@mail.gmail.com>
@ 2016-10-13  6:27                       ` Markus Trippelsdorf
  0 siblings, 0 replies; 26+ messages in thread
From: Markus Trippelsdorf @ 2016-10-13  6:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christoph Lameter, Jens Axboe, linux-kernel, aconole,
	David Miller, pablo, linux-fsdevel, viro, netfilter-devel,
	netdev, akpm, fw, tytso

On 2016.10.12 at 23:18 -0700, Linus Torvalds wrote:
> On Oct 12, 2016 23:07, "Markus Trippelsdorf" <markus@trippelsdorf.de> wrote:
> >
> > This is nf_register_net_hook at net/netfilter/core.c:106
>
> The "*regs" access?

Yeah.

105         entry->orig_ops = reg;
106         entry->ops      = *reg;
107         entry->next     = NULL;

--
Markus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: slab corruption with current -git
@ 2016-10-13  6:27                       ` Markus Trippelsdorf
  0 siblings, 0 replies; 26+ messages in thread
From: Markus Trippelsdorf @ 2016-10-13  6:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christoph Lameter, Jens Axboe, linux-kernel, aconole,
	David Miller, pablo, linux-fsdevel, viro, netfilter-devel,
	netdev, akpm, fw, tytso

On 2016.10.12 at 23:18 -0700, Linus Torvalds wrote:
> On Oct 12, 2016 23:07, "Markus Trippelsdorf" <markus@trippelsdorf.de> wrote:
> >
> > This is nf_register_net_hook at net/netfilter/core.c:106
>
> The "*regs" access?

Yeah.

105         entry->orig_ops = reg;
106         entry->ops      = *reg;
107         entry->next     = NULL;

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: slab corruption with current -git
  2016-10-13  6:27                       ` Markus Trippelsdorf
  (?)
@ 2016-10-13 19:49                       ` Linus Torvalds
  2016-10-13 20:43                         ` Florian Westphal
  2016-10-13 21:32                         ` Al Viro
  -1 siblings, 2 replies; 26+ messages in thread
From: Linus Torvalds @ 2016-10-13 19:49 UTC (permalink / raw)
  To: Markus Trippelsdorf
  Cc: Christoph Lameter, Jens Axboe, Linux Kernel Mailing List,
	Aaron Conole, David Miller, Pablo Neira Ayuso, linux-fsdevel,
	Al Viro, NetFilter, Network Development, Andrew Morton,
	Florian Westphal, Theodore Ts'o

On Wed, Oct 12, 2016 at 11:27 PM, Markus Trippelsdorf
<markus@trippelsdorf.de> wrote:
>
> Yeah.
>
> 105         entry->orig_ops = reg;
> 106         entry->ops      = *reg;
> 107         entry->next     = NULL;

So ipt_register_table() does:

        ret = nf_register_net_hooks(net, ops, hweight32(table->valid_hooks));

and then nf_register_net_hooks() just does

        for (i = 0; i < n; i++) {
                err = nf_register_net_hook(net, &reg[i]);

so if the *reg is uninitialized, it means that it's the 'ops[]' array
that isn't actually really valid in "valid_hooks". Odd. They should
all be initialized by xt_hook_ops_alloc(), no?

That said, xt_hook_ops_alloc() itself is odd. Lookie here, this is the
loop that initializes things:

        for (i = 0, hooknum = 0; i < num_hooks && hook_mask != 0;
             hook_mask >>= 1, ++hooknum) {

and it makes no sense to me how that tests *both* "i < num_hools" and
"hook_mask != 0".

Why? Because

    num_hooks = hweight32(hook_mask);

so it's entirely redundant. num_hooks is already how many bits are on
in hook_mask, so that test is just duplicating the same thing twice
("have we done less than that number of bits" and "do we have any bits
less").

I don't know. There's something odd going on. Regardless, thsi is a
different problem from the nf_register_net_hook() list handling, so
I'll leave it to the networking people. David?

               Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: slab corruption with current -git
  2016-10-13 19:49                       ` Linus Torvalds
@ 2016-10-13 20:43                         ` Florian Westphal
  2016-10-13 21:32                         ` Al Viro
  1 sibling, 0 replies; 26+ messages in thread
From: Florian Westphal @ 2016-10-13 20:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Markus Trippelsdorf, Christoph Lameter, Jens Axboe,
	Linux Kernel Mailing List, Aaron Conole, David Miller,
	Pablo Neira Ayuso, linux-fsdevel, Al Viro, NetFilter,
	Network Development, Andrew Morton, Florian Westphal,
	Theodore Ts'o

Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Wed, Oct 12, 2016 at 11:27 PM, Markus Trippelsdorf
> <markus@trippelsdorf.de> wrote:
> >
> > Yeah.
> >
> > 105         entry->orig_ops = reg;
> > 106         entry->ops      = *reg;
> > 107         entry->next     = NULL;
> 
> So ipt_register_table() does:
> 
>         ret = nf_register_net_hooks(net, ops, hweight32(table->valid_hooks));
> 
> and then nf_register_net_hooks() just does
> 
>         for (i = 0; i < n; i++) {
>                 err = nf_register_net_hook(net, &reg[i]);
> 
> so if the *reg is uninitialized, it means that it's the 'ops[]' array
> that isn't actually really valid in "valid_hooks". Odd. They should
> all be initialized by xt_hook_ops_alloc(), no?

Its only partially initialized.  Looking at Markus' splat
its complaining about first 16 bytes (list_head), whose contents are indeed
undefined when it gets copied to entry->ops.

For the time being this seems like the most simple "fix", until we
disentangle the hook description (which should be const) from run-time
allocated data.

diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index e0aa7c1d0224..fc4977456c30 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -1513,7 +1513,7 @@ xt_hook_ops_alloc(const struct xt_table *table,
		nf_hookfn *fn)
        if (!num_hooks)
		return ERR_PTR(-EINVAL);
 
-       ops = kmalloc(sizeof(*ops) * num_hooks, GFP_KERNEL);
+       ops = kcalloc(num_hooks, sizeof(*ops), GFP_KERNEL);
        if (ops == NULL)
		return ERR_PTR(-ENOMEM);

I'll pass such a patch to Pablo.

> That said, xt_hook_ops_alloc() itself is odd. Lookie here, this is the
> loop that initializes things:
> 
>         for (i = 0, hooknum = 0; i < num_hooks && hook_mask != 0;
>              hook_mask >>= 1, ++hooknum) {
> 
> and it makes no sense to me how that tests *both* "i < num_hools" and
> "hook_mask != 0".

Right, one of these is enough.

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: slab corruption with current -git
  2016-10-13 19:49                       ` Linus Torvalds
  2016-10-13 20:43                         ` Florian Westphal
@ 2016-10-13 21:32                         ` Al Viro
  1 sibling, 0 replies; 26+ messages in thread
From: Al Viro @ 2016-10-13 21:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Markus Trippelsdorf, Christoph Lameter, Jens Axboe,
	Linux Kernel Mailing List, Aaron Conole, David Miller,
	Pablo Neira Ayuso, linux-fsdevel, NetFilter, Network Development,
	Andrew Morton, Florian Westphal, Theodore Ts'o

On Thu, Oct 13, 2016 at 12:49:33PM -0700, Linus Torvalds wrote:

> That said, xt_hook_ops_alloc() itself is odd. Lookie here, this is the
> loop that initializes things:
> 
>         for (i = 0, hooknum = 0; i < num_hooks && hook_mask != 0;
>              hook_mask >>= 1, ++hooknum) {
> 
> and it makes no sense to me how that tests *both* "i < num_hools" and
> "hook_mask != 0".
> 
> Why? Because
> 
>     num_hooks = hweight32(hook_mask);
> 
> so it's entirely redundant. num_hooks is already how many bits are on
> in hook_mask, so that test is just duplicating the same thing twice
> ("have we done less than that number of bits" and "do we have any bits
> less").
> 
> I don't know. There's something odd going on. Regardless, thsi is a
> different problem from the nf_register_net_hook() list handling, so
> I'll leave it to the networking people. David?

Hey, I remember looking through that stuff. <checks>  There it is, in
a thread started by Krause Randomness(tm)...  Short version: nf_hook_ops
is a mess - it's embedded into different objects, with different subsets
of fields used depending on the containing object and I would seriously
suggest moving some of those into those containing objects.

--------------------------------------------------------------------------
On Thu, Sep 01, 2016 at 08:10:44AM -0500, Eric Sandeen wrote:
> On 8/4/16 8:57 AM, Al Viro wrote:
> 
> > Don't feed the troll.  On all paths leading to that place we have
> >         result->name = kname;
> >         len = strncpy_from_user(kname, filename, EMBEDDED_NAME_MAX);
> > or
> >                 result->name = kname;
> >                 len = strncpy_from_user(kname, filename, PATH_MAX);
> > with failure exits taken if strncpy_from_user() returns an error, which means
> > that the damn thing has already been copied into.
> > 
> > FWIW, it looks a lot like buggered kmemcheck; as usual, he can't be bothered
> > to mention which kernel version would it be (let alone how to reproduce it
> > on the kernel in question), but IIRC davej had run into some instrumentation
> > breakage lately.
> 
> The original report is in https://bugzilla.kernel.org/show_bug.cgi?id=120651
> if anyone is interested in it.

	What the hell does that one have to getname_flags(), other than having
attracted the same... something on the edge of failing the Turing Test?

	FWIW, looking at the netfilter one...  That's nf_register_net_hook()
hitting
        entry->ops      = *reg;
with reg pointing to something uninitialized (according to kmemcheck, that is,
and presuming that it's not an instrumentation bug).  With the callchain
in report, it came (all in the same assumptions) from
	nf_register_net_hooks(net, ops, hweight32(table->valid_hooks))
with hweight32(table->valid_hooks) being greater than the amount of
initialized entries in ops[] (call site in ipt_register_table()).

	This "ops" ought to be net/ipv4/netfilter/iptable_filter.c:filter_ops,
allocated by
        filter_ops = xt_hook_ops_alloc(&packet_filter, iptable_filter_hook);
in iptable_filter_init().  "table" is &packet_filter and its contents ought
to be unchanged, so ->valid_hooks in there is FILTER_VALID_HOOKS, i.e.
((1 << NF_INET_LOCAL_IN) | (1 << NF_INET_FORWARD) | (1 << NF_INET_LOCAL_OUT)).

	Which is to say, filter_ops[] had fewer than 3 initialized elements
when it got to the call of iptable_filter_table_init()...  Since filter_ops
hadn't been NULL, the xt_hook_ops_alloc() call above must've already been
done.  Said xt_hook_ops_alloc() should've allocated a 3-element array and
hooked through all of it, so it's not a wholesale uninitialized element, it's
uninitialized parts of one...

	What gets initialized is ->hook, ->pf, ->hooknum and ->priority.
Let's figure out the offsets:
	0: list (two pointers, i.e. 16 bytes)
	0x10: hook (8)
	0x18: dev (8)
	0x20: priv (8)
	0x28: pf (1)
	0x29: padding (3)
	0x2c: hooknum (4)
	0x30: priority (4)
	0x34: padding (8)
	
OK...  The address of the damn thing is apparently ffff880037b4bd80 and
we see complaint about the accesses at offsets 0, 0x18, 8, 0x20 and then
the same pattern with 0x38 and 0x70 added (i.e. the same fields in the next
two elements of the same array).  Then there are similar complaints, but
with a different call chain (iptable_mangle instead of iptable_filter).

These offsets are ->list, ->dev and ->priv, and those are exactly the ones
not initialized by xt_hook_ops_alloc().  Looking at the nf_register_net_hook(),
we have
        list_add_rcu(&entry->ops.list, elem->list.prev);
a bit further down the road.  ->dev and ->priv are left uninitialized (and
very likely - unused).

I would say it's a false positive.  struct nf_hook_ops is embedded into a
bunch of different objects, with different subsets of fields getting used.
IMO it's a bad idea (in particular, I really wonder if ->list would've
been better off moved into (some of) the containing suckers), but it's
not a bug per se, just a design choice asking for trouble.  One way of
getting kmemcheck off your back would be to switch xt_hook_ops_alloc() from
	ops = kmalloc(sizeof(*ops) * num_hooks, GFP_KERNEL);
to
	ops = kcalloc(num_hooks, sizeof(*ops), GFP_KERNEL);
which might have some merits beyond making kmemcheck STFU...
--------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2016-10-13 21:33 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-09 21:31 slab corruption with current -git (was Re: [git pull] vfs pile 1 (splice)) Linus Torvalds
2016-10-10  0:51 ` Florian Westphal
2016-10-10  1:35   ` Aaron Conole
2016-10-10  2:49     ` Linus Torvalds
2016-10-10  3:41       ` Linus Torvalds
2016-10-10  3:57         ` slab corruption with current -git David Miller
2016-10-10  8:24           ` David Miller
2016-10-10 16:15             ` Linus Torvalds
2016-10-11 13:17             ` Michal Kubecek
2016-10-11 13:55               ` Aaron Conole
2016-10-10 13:49         ` slab corruption with current -git (was Re: [git pull] vfs pile 1 (splice)) Aaron Conole
2016-10-10 16:28           ` Linus Torvalds
2016-10-10 19:05             ` Linus Torvalds
2016-10-10 19:18               ` Aaron Conole
2016-10-11  0:30               ` slab corruption with current -git David Miller
2016-10-11  0:54                 ` Linus Torvalds
2016-10-11  5:39         ` slab corruption with current -git (was Re: [git pull] vfs pile 1 (splice)) Linus Torvalds
2016-10-11  5:47           ` Linus Torvalds
2016-10-11  8:57             ` slab corruption with current -git David Miller
2016-10-13  6:02               ` Markus Trippelsdorf
2016-10-13  6:06                 ` Markus Trippelsdorf
     [not found]                   ` <CA+55aFwsUR4-YmOYgJOOO4a2e48M4_tk7YhAo4s5KZQQxUjpZw@mail.gmail.com>
2016-10-13  6:27                     ` Markus Trippelsdorf
2016-10-13  6:27                       ` Markus Trippelsdorf
2016-10-13 19:49                       ` Linus Torvalds
2016-10-13 20:43                         ` Florian Westphal
2016-10-13 21:32                         ` Al Viro

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.