From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1161946AbdDUSav (ORCPT ); Fri, 21 Apr 2017 14:30:51 -0400 Received: from mail-wm0-f44.google.com ([74.125.82.44]:37477 "EHLO mail-wm0-f44.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1161287AbdDUSan (ORCPT ); Fri, 21 Apr 2017 14:30:43 -0400 Subject: Re: net/core: BUG in unregister_netdevice_many To: Linus Torvalds , Andrey Konovalov , Eric Dumazet References: Cc: "David S. Miller" , Alexey Kuznetsov , James Morris , Hideaki YOSHIFUJI , Patrick McHardy , netdev , LKML , Alexander Duyck , David Ahern , Daniel Borkmann , tcharding , Jiri Pirko , stephen hemminger , Dmitry Vyukov , Kostya Serebryany , syzkaller From: Nikolay Aleksandrov Message-ID: <2e044c75-c0e5-f5d5-1d94-7609d7af86fa@cumulusnetworks.com> Date: Fri, 21 Apr 2017 21:30:30 +0300 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Icedove/45.6.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 21/04/17 20:42, Linus Torvalds wrote: > On Fri, Apr 21, 2017 at 10:25 AM, Linus Torvalds > wrote: >> >> I'm assuming that the real cause is simply that "dev->reg_state" ends >> up being NETREG_UNREGISTERING or something. Maybe the BUG_ON() could >> be just removed, and replaced by the previous warning about >> NETREG_UNINITIALIZED. >> >> Something like the attached (TOTALLY UNTESTED) patch. > > .. might as well test it. > > That patch doesn't fix the problem, but it does show that yes, it was > NETREG_UNREGISTERING: > > unregister_netdevice: device pim6reg/ffff962dc4606000 was not registered (2) > > but then immediately afterwards we get > > general protection fault: 0000 [#1] SMP > Workqueue: netns cleanup_net > RIP: 0010:dev_shutdown+0xe/0xc0 > Call Trace: > rollback_registered_many+0x2a5/0x440 > unregister_netdevice_many+0x1e/0xb0 > default_device_exit_batch+0x145/0x170 > > which is due to a > > mov 0x388(%rdi),%eax > > where %rdi is 0xdead000000000090. That is at the very beginning of > dev_shutdown, it's "dev" itself that has that value, so it comes from > (_another_) invocation of rollback_registered_many(), when it does > that > > list_for_each_entry(dev, head, unreg_list) { > > so it seems to be a case of another "list_del() leaves list in bad > state", and it was the added test for "dev->reg_state != > NETREG_REGISTERED" that did that > > list_del(&dev->unreg_list); > > and left random contents in the unreg_list. > > So that "handle error case" was almost certainly just buggy too. > > And the bug seems to be that we're trying to unregister a netdevice > that has already been unregistered. > > Over to Eric and networking people. This oops is user-triggerable, and > leaves the machine in a bad state (the original BUG_ON() and the new > GP fault both happen while holding the RTNL, so networking is not > healthy afterwards. > > Linus > Right, I've already posted a patch for ip6mr that should fix the issue. CCed you and LKML just now. Thanks, Nik