From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1423469AbdDUSJz (ORCPT ); Fri, 21 Apr 2017 14:09:55 -0400 Received: from mail-io0-f196.google.com ([209.85.223.196]:35773 "EHLO mail-io0-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1162272AbdDUSJr (ORCPT ); Fri, 21 Apr 2017 14:09:47 -0400 MIME-Version: 1.0 In-Reply-To: References: From: Linus Torvalds Date: Fri, 21 Apr 2017 10:42:48 -0700 X-Google-Sender-Auth: oT4RjDcZzHauIRwVKWOoenbl_XI Message-ID: Subject: Re: net/core: BUG in unregister_netdevice_many To: Andrey Konovalov , Eric Dumazet Cc: "David S. Miller" , Alexey Kuznetsov , James Morris , Hideaki YOSHIFUJI , Patrick McHardy , netdev , LKML , Alexander Duyck , David Ahern , Daniel Borkmann , tcharding , Jiri Pirko , stephen hemminger , Dmitry Vyukov , Kostya Serebryany , syzkaller Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Apr 21, 2017 at 10:25 AM, Linus Torvalds wrote: > > I'm assuming that the real cause is simply that "dev->reg_state" ends > up being NETREG_UNREGISTERING or something. Maybe the BUG_ON() could > be just removed, and replaced by the previous warning about > NETREG_UNINITIALIZED. > > Something like the attached (TOTALLY UNTESTED) patch. .. might as well test it. That patch doesn't fix the problem, but it does show that yes, it was NETREG_UNREGISTERING: unregister_netdevice: device pim6reg/ffff962dc4606000 was not registered (2) but then immediately afterwards we get general protection fault: 0000 [#1] SMP Workqueue: netns cleanup_net RIP: 0010:dev_shutdown+0xe/0xc0 Call Trace: rollback_registered_many+0x2a5/0x440 unregister_netdevice_many+0x1e/0xb0 default_device_exit_batch+0x145/0x170 which is due to a mov 0x388(%rdi),%eax where %rdi is 0xdead000000000090. That is at the very beginning of dev_shutdown, it's "dev" itself that has that value, so it comes from (_another_) invocation of rollback_registered_many(), when it does that list_for_each_entry(dev, head, unreg_list) { so it seems to be a case of another "list_del() leaves list in bad state", and it was the added test for "dev->reg_state != NETREG_REGISTERED" that did that list_del(&dev->unreg_list); and left random contents in the unreg_list. So that "handle error case" was almost certainly just buggy too. And the bug seems to be that we're trying to unregister a netdevice that has already been unregistered. Over to Eric and networking people. This oops is user-triggerable, and leaves the machine in a bad state (the original BUG_ON() and the new GP fault both happen while holding the RTNL, so networking is not healthy afterwards. Linus