From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752671AbdHICja (ORCPT ); Tue, 8 Aug 2017 22:39:30 -0400 Received: from mail-qt0-f194.google.com ([209.85.216.194]:32919 "EHLO mail-qt0-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752466AbdHICj2 (ORCPT ); Tue, 8 Aug 2017 22:39:28 -0400 MIME-Version: 1.0 In-Reply-To: <20170809014459.GB7693@jaegeuk-macbookpro.roam.corp.google.com> References: <20170805155241.GA94821@jaegeuk-macbookpro.roam.corp.google.com> <20170808010150.4155-1-bradleybolen@gmail.com> <20170808162122.GA14689@cmpxchg.org> <20170808165601.GA7693@jaegeuk-macbookpro.roam.corp.google.com> <20170808173704.GA22887@cmpxchg.org> <20170808200849.GA1104@cmpxchg.org> <20170809014459.GB7693@jaegeuk-macbookpro.roam.corp.google.com> From: Brad Bolen Date: Tue, 8 Aug 2017 22:39:27 -0400 Message-ID: Subject: Re: kernel panic on null pointer on page->mem_cgroup To: Jaegeuk Kim Cc: Johannes Weiner , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Yes, the BUG_ON(!page_count(page)) fired for me as well. On Tue, Aug 8, 2017 at 9:44 PM, Jaegeuk Kim wrote: > On 08/08, Johannes Weiner wrote: >> On Tue, Aug 08, 2017 at 03:13:42PM -0400, Brad Bolen wrote: >> > On Tue, Aug 8, 2017 at 1:37 PM, Johannes Weiner wrote: >> > > On Tue, Aug 08, 2017 at 09:56:01AM -0700, Jaegeuk Kim wrote: >> > >> On 08/08, Johannes Weiner wrote: >> > >> > Hi Jaegeuk and Bradley, >> > >> > >> > >> > On Mon, Aug 07, 2017 at 09:01:50PM -0400, Bradley Bolen wrote: >> > >> > > I am getting a very similar error on v4.11 with an arm64 board. >> > >> > > >> > >> > > I, too, also see page->mem_cgroup checked to make sure that it is not >> > >> > > NULL and then several instructions later it is NULL. It does appear >> > >> > > that someone is changing that member without taking the lock. In my >> > >> > > setup, I see >> > >> > > >> > >> > > crash> bt >> > >> > > PID: 72 TASK: e1f48640 CPU: 0 COMMAND: "mmcqd/1" >> > >> > > #0 [] (__crash_kexec) from [] >> > >> > > #1 [] (panic) from [] >> > >> > > #2 [] (svcerr_panic) from [] >> > >> > > #3 [] (_SvcErr_) from [] >> > >> > > #4 [] (die) from [] >> > >> > > #5 [] (__do_kernel_fault) from [] >> > >> > > #6 [] (do_page_fault) from [] >> > >> > > #7 [] (do_DataAbort) from [] >> > >> > > pc : [] lr : [] psr: a0000193 >> > >> > > sp : c1a19cc8 ip : 00000000 fp : c1a19d04 >> > >> > > r10: 0006ae29 r9 : 00000000 r8 : dfbf1800 >> > >> > > r7 : dfbf1800 r6 : 00000001 r5 : f3c1107c r4 : e2fb6424 >> > >> > > r3 : 00000000 r2 : 00040228 r1 : 221e3000 r0 : a0000113 >> > >> > > Flags: NzCv IRQs off FIQs on Mode SVC_32 ISA ARM >> > >> > > #8 [] (__dabt_svc) from [] >> > >> > > #9 [] (test_clear_page_writeback) from [] >> > >> > > #10 [] (end_page_writeback) from [] >> > >> > > #11 [] (end_swap_bio_write) from [] >> > >> > > #12 [] (bio_endio) from [] >> > >> > > #13 [] (dec_pending) from [] >> > >> > > #14 [] (clone_endio) from [] >> > >> > > #15 [] (bio_endio) from [] >> > >> > > #16 [] (crypt_dec_pending [dm_crypt]) from [] >> > >> > > #17 [] (crypt_endio [dm_crypt]) from [] >> > >> > > #18 [] (bio_endio) from [] >> > >> > > #19 [] (blk_update_request) from [] >> > >> > > #20 [] (blk_update_bidi_request) from [] >> > >> > > #21 [] (blk_end_bidi_request) from [] >> > >> > > #22 [] (blk_end_request) from [] >> > >> > > #23 [] (mmc_blk_issue_rw_rq) from [] >> > >> > > #24 [] (mmc_blk_issue_rq) from [] >> > >> > > #25 [] (mmc_queue_thread) from [] >> > >> > > #26 [] (kthread) from [] >> > >> > > crash> sym c0112540 >> > >> > > c0112540 (T) test_clear_page_writeback+512 >> > >> > > /kernel-source/include/linux/memcontrol.h: 518 >> > >> > > >> > >> > > crash> bt 35 >> > >> > > PID: 35 TASK: e1d45dc0 CPU: 1 COMMAND: "kswapd0" >> > >> > > #0 [] (__schedule) from [] >> > >> > > #1 [] (schedule) from [] >> > >> > > #2 [] (schedule_timeout) from [] >> > >> > > #3 [] (io_schedule_timeout) from [] >> > >> > > #4 [] (mempool_alloc) from [] >> > >> > > #5 [] (bio_alloc_bioset) from [] >> > >> > > #6 [] (get_swap_bio) from [] >> > >> > > #7 [] (__swap_writepage) from [] >> > >> > > #8 [] (swap_writepage) from [] >> > >> > > #9 [] (shmem_writepage) from [] >> > >> > > #10 [] (shrink_page_list) from [] >> > >> > > #11 [] (shrink_inactive_list) from [] >> > >> > > #12 [] (shrink_node_memcg) from [] >> > >> > > #13 [] (shrink_node) from [] >> > >> > > #14 [] (kswapd) from [] >> > >> > > #15 [] (kthread) from [] >> > >> > > >> > >> > > It appears that uncharge_list() in mm/memcontrol.c is not taking the >> > >> > > page lock when it sets mem_cgroup to NULL. I am not familiar with the >> > >> > > mm code so I do not know if this is on purpose or not. There is a >> > >> > > comment in uncharge_list that makes me believe that the crashing code >> > >> > > should not have been running: >> > >> > > /* >> > >> > > * Nobody should be changing or seriously looking at >> > >> > > * page->mem_cgroup at this point, we have fully >> > >> > > * exclusive access to the page. >> > >> > > */ >> > >> > > However, I am new to looking at this area of the kernel so I am not >> > >> > > sure. >> > >> > >> > >> > The lock is for pages that are actively being used, whereas the free >> > >> > path requires the page refcount to be 0; nobody else should be having >> > >> > access to the page at that time. >> > >> >> > >> Given various trials for a while, using __mod_memcg_state() instead of >> > >> mod_memcg_state() ssems somehow blowing the panic away. It might be caused >> > >> by kernel preemption? >> > > >> > > That's puzzling. Is that reliably the case? Because on x86-64, >> > > __this_cpu_add and this_cpu_add should result in the same code: >> > > >> > > #define raw_cpu_add_8(pcp, val) percpu_add_op((pcp), val) >> > > #define this_cpu_add_8(pcp, val) percpu_add_op((pcp), val) >> > > >> > > which boils down to single instructions - incq, decq, addq - and so >> > > never needs explicit disabling of scheduler or interrupt preemption. >> > > >> > >> > > I was able to create a reproducible scenario by using a udelay to >> > >> > > increase the time between the if (page->mem_cgroup) check and the later >> > >> > > dereference of it to increase the race window. I then mounted an empty >> > >> > > ext4 partition and ran the following no more than twice before it >> > >> > > crashed. >> > >> > > dd if=/dev/zero of=/tmp/ext4disk/test bs=1M count=100 >> > >> > >> > >> > Thanks, that's useful. I'm going to try to reproduce this also. >> > >> > >> > >> > There is a >> > >> > >> > >> > VM_BUG_ON_PAGE(!PageHWPoison(page) && page_count(page), page); >> > >> > >> > >> > inside uncharge_list() that verifies that there shouldn't in fact be >> > >> > any pages ending writeback when they get into that function. Can you >> > >> > build your kernel with CONFIG_DEBUG_VM to enable that test? >> > >> >> > >> I'll test this as well. ;) >> > > >> > > Thanks. I'm trying various udelays in between the NULL-check and the >> > > dereference, but I cannot seem to make it happen with the ext4-dd on >> > > my local machine. >> > I am using a fairly fat udelay of 100. >> > >> > I turned on CONFIG_DEBUG_VM but it didn't bug for me, although the original >> > problem still reproduced. >> >> Ok that sounds like a page refcounting bug then. >> >> > I can insert some tracing to try to get some more information, but unfortunately >> > I am stuck with v4.11 at the moment and can't try v4.13-rc4 on this setup. >> >> Does the below trigger with your reproducer? (Still doesn't for me). >> >> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h >> index 3914e3dd6168..c052b085c5dd 100644 >> --- a/include/linux/memcontrol.h >> +++ b/include/linux/memcontrol.h >> @@ -599,6 +599,7 @@ static inline void __mod_lruvec_page_state(struct page *page, >> struct mem_cgroup_per_node *pn; >> >> __mod_node_page_state(page_pgdat(page), idx, val); >> + VM_BUG_ON_PAGE(!page_count(page), page); >> if (mem_cgroup_disabled() || !page->mem_cgroup) >> return; >> __mod_memcg_state(page->mem_cgroup, idx, val); >> @@ -612,6 +613,7 @@ static inline void mod_lruvec_page_state(struct page *page, >> struct mem_cgroup_per_node *pn; >> >> mod_node_page_state(page_pgdat(page), idx, val); >> + VM_BUG_ON_PAGE(!page_count(page), page); > > It seems DEBUG_VM may hide the problem in my case, so I turned it off and > replaced it with BUG_ON(!page_count(page)). Then, I exactly hit this. > > Thanks, > >> if (mem_cgroup_disabled() || !page->mem_cgroup) >> return; >> mod_memcg_state(page->mem_cgroup, idx, val);