From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751498AbcJDEA7 (ORCPT ); Tue, 4 Oct 2016 00:00:59 -0400 Received: from mail-oi0-f54.google.com ([209.85.218.54]:35161 "EHLO mail-oi0-f54.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750724AbcJDEA5 (ORCPT ); Tue, 4 Oct 2016 00:00:57 -0400 MIME-Version: 1.0 From: Linus Torvalds Date: Mon, 3 Oct 2016 21:00:55 -0700 X-Google-Sender-Auth: -pKMgJ4-qhClX5M_Ob_Bhiaolf8 Message-ID: Subject: BUG_ON() in workingset_node_shadows_dec() triggers To: Johannes Weiner , Andrew Morton Cc: Antonio SJ Musumeci , Miklos Szeredi , Linux Kernel Mailing List , stable Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org I'm really sorry I applied that last series from Andrew just before doing the 4.8 release, because they cause problems, and now it is in 4.8 (and that buggy crap is marked for stable too). In particular, I just got this kernel BUG at ./include/linux/swap.h:276 and the end result was a dead kernel. The bug that commit 22f2ac51b6d64 ("mm: workingset: fix crash in shadow node shrinker caused by replace_page_cache_page()") purports to have fixed has apparently been there since 3.15, but the fix is clearly worse than the bug it tried to fix, since that original bug has never killed my machine! I should have reacted to the damn added BUG_ON() lines. I suspect I will have to finally just remove the idiotic BUG_ON() concept once and for all, because there is NO F*CKING EXCUSE to knowingly kill the kernel. Why the hell was that not a *warning*? Yes, I'm grumpy. This went in very late in the release candidates, and I had higher expectations of things coming in through Andrew. Adding random BUG_ON()'s to code that clearly hasn't had sufficient testing is *not* acceptable, and it's definitely not acceptable to send that to me after rc8 unless it has gotten a *lot* of testing, which it clearly must not have had. Adding stable to the cc too to warn about this. The full report is kernel BUG at ./include/linux/swap.h:276! invalid opcode: 0000 [#1] SMP Modules linked in: isofs usb_storage fuse xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_REJECT nf_reject_ipv6 soundcore wmi acpi_als pinctrl_sunrisepoint kfifo_buf tpm_tis industrialio acpi_pad pinctrl_intel tpm_tis_core tpm nfsd auth_rpcgss nfs_acl lockd grace sunrpc dm_crypt CPU: 0 PID: 20929 Comm: blkid Not tainted 4.8.0-rc8-00087-gbe67d60ba944 #1 Hardware name: System manufacturer System Product Name/Z170-K, BIOS 1803 05/06/2016 task: ffff8faa93ecd940 task.stack: ffff8faa7f478000 RIP: page_cache_tree_insert+0xf1/0x100 RSP: 0018:ffff8faa7f47bab0 EFLAGS: 00010046 RAX: 0000000000000001 RBX: ffff8faadfaf8c18 RCX: ffff8fa8737b5488 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8fa8737b4b48 RBP: ffff8faa7f47bae8 R08: 0000000000000012 R09: ffff8fa8737b54b0 R10: 0000000000000040 R11: ffff8fa8737b54b0 R12: ffffea000b1ad580 R13: 0000000000000000 R14: ffff8faa7f47bb48 R15: ffffea000b1ad580 FS: 00007ffba3a61780(0000) GS:ffff8faaf6c00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ffba31a5430 CR3: 00000002c6d40000 CR4: 00000000003406f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: __add_to_page_cache_locked+0x12e/0x270 add_to_page_cache_lru+0x4e/0xe0 mpage_readpages+0x112/0x1d0 blkdev_readpages+0x1d/0x20 __do_page_cache_readahead+0x1ad/0x290 force_page_cache_readahead+0xaa/0x100 page_cache_sync_readahead+0x3f/0x50 generic_file_read_iter+0x5af/0x740 blkdev_read_iter+0x35/0x40 __vfs_read+0xe1/0x130 vfs_read+0x96/0x130 SyS_read+0x55/0xc0 entry_SYSCALL_64_fastpath+0x13/0x8f Code: 03 00 48 8b 5d d8 65 48 33 1c 25 28 00 00 00 44 89 e8 75 19 48 83 c4 18 5b 41 5c 41 5d 41 5e 5d c3 0f 0b 41 bd ef ff ff ff eb d7 <0f> 0b e8 88 68 ef ff 0f 1f 84 00 RIP page_cache_tree_insert+0xf1/0x100 and I hope somebody can see what is going wrong in there. The reason the machine *dies* from that thing is that we end up then immediately having a BUG: unable to handle kernel paging request at ffffffffb70bdaa8 IP: blk_flush_plug_list+0x8b/0x250 Call Trace: schedule+0x61/0x80 do_exit+0x8c8/0xae0 rewind_stack_do_exit+0x17/0x20 and then a Fixing recursive fault but reboot is needed! and the machine will never recover. People who add random assert statements that kill machines should damn well not be let near the VM layer. Johannes? Please make this your first priority. And in the meantime I will make that VM_BUG_ON() be a VM_WARN_ON_ONCE(). And dammit, if anybody else feels that they had done "debugging messages with BUG_ON()", I would suggest you (a) rethink your approach to programming (b) send me patches to remove the crap entirely, or make them real *DEBUGGING* messages, not "kill the whole machine" messages. I've ranted against people using BUG_ON() for debugging in the past. Why the f*ck does this still happen? And Andrew - please stop taking those kinds of patches! Lookie here: https://lwn.net/Articles/13183/ so excuse me for being upset that people still do this shit almost 15 years later. Linus