linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* kernel panic on null pointer on page->mem_cgroup
@ 2017-08-05 15:52 Jaegeuk Kim
       [not found] ` <20170808010150.4155-1-bradleybolen@gmail.com>
  0 siblings, 1 reply; 14+ messages in thread
From: Jaegeuk Kim @ 2017-08-05 15:52 UTC (permalink / raw)
  To: hannes; +Cc: Linux Kernel Mailing List, Linux F2FS Dev Mailing List, linux-mm

Hi Johannes,

Can I ask your help about the below panic which is annoying me recently.
I'm currently testing xfstests with 4.13-rc2, and have hit the below panic
very randomly.

[ 3722.366490] BUG: unable to handle kernel NULL pointer dereference at 00000000000003b0
[ 3722.378815] IP: test_clear_page_writeback+0x12e/0x2c0
[ 3722.384931] PGD 3fb77067 
[ 3722.384932] P4D 3fb77067 
[ 3722.389222] PUD 1302f067 
[ 3722.392676] PMD 0 
[ 3722.407447] 
[ 3722.416459] Oops: 0000 [#1] SMP
[ 3722.424191] Modules linked in: quota_v2 quota_tree dm_snapshot dm_bufio dm_flakey f2fs(O) ppdev joydev input_leds serio_raw snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm snd_timer snd parport_pc soundcore mac_hid i2c_piix4 parport ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid hid crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd glue_helper cryptd ahci psmouse libahci e1000 pata_acpi video [last unloaded: scsi_debug]
[ 3722.494822] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G           O    4.13.0-rc2+ #7
[ 3722.509659] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[ 3722.523018] task: ffff8e3abe32bc00 task.stack: ffffab1e801f0000
[ 3722.534108] RIP: 0010:test_clear_page_writeback+0x12e/0x2c0
[ 3722.547281] RSP: 0018:ffff8e3abfd03d78 EFLAGS: 00010046
[ 3722.561761] RAX: 0000000000000000 RBX: ffffdb59c03f8900 RCX: ffffffffffffffe8
[ 3722.595343] RDX: 0000000000000000 RSI: 0000000000000010 RDI: ffff8e3abffeb000
[ 3722.615108] RBP: ffff8e3abfd03da8 R08: 0000000000020059 R09: 00000000fffffffc
[ 3722.674717] R10: 0000000000000000 R11: 0000000000020048 R12: ffff8e3a8c39f668
[ 3722.691916] R13: 0000000000000001 R14: ffff8e3a8c39f680 R15: 0000000000000000
[ 3722.736393] FS:  0000000000000000(0000) GS:ffff8e3abfd00000(0000) knlGS:0000000000000000
[ 3722.797553] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3722.852623] CR2: 00000000000003b0 CR3: 000000002c5e1000 CR4: 00000000000406e0
[ 3722.896451] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3722.950847] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 3722.965578] Call Trace:
[ 3722.971710]  <IRQ>
[ 3722.976306]  end_page_writeback+0x47/0x70
[ 3722.983252]  f2fs_write_end_io+0x76/0x180 [f2fs]
[ 3723.012721]  bio_endio+0x9f/0x120
[ 3723.035764]  blk_update_request+0xa8/0x2f0
[ 3723.064621]  scsi_end_request+0x39/0x1d0
[ 3723.086994]  scsi_io_completion+0x211/0x690
[ 3723.116553]  scsi_finish_command+0xd9/0x120
[ 3723.143690]  scsi_softirq_done+0x127/0x150
[ 3723.170070]  __blk_mq_complete_request_remote+0x13/0x20
[ 3723.199780]  flush_smp_call_function_queue+0x56/0x110
[ 3723.233148]  generic_smp_call_function_single_interrupt+0x13/0x30
[ 3723.255267]  smp_call_function_single_interrupt+0x27/0x40
[ 3723.285327]  call_function_single_interrupt+0x89/0x90
[ 3723.309718] RIP: 0010:native_safe_halt+0x6/0x10


(gdb) l *(test_clear_page_writeback+0x12e)
0xffffffff811bae3e is in test_clear_page_writeback (./include/linux/memcontrol.h:619).
614		mod_node_page_state(page_pgdat(page), idx, val);
615		if (mem_cgroup_disabled() || !page->mem_cgroup)
616			return;
617		mod_memcg_state(page->mem_cgroup, idx, val);
618		pn = page->mem_cgroup->nodeinfo[page_to_nid(page)];
619		this_cpu_add(pn->lruvec_stat->count[idx], val);
620	}
621	
622	unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
623							gfp_t gfp_mask,

So first, without your below patch, I've confirmed that there is no problem.

   commit 00f3ca2c2d6635d ("mm: memcontrol: per-lruvec stats infrastructure")

Second, what I've figured out so far is page->mem_cgroup is already checked
above, but after that line, it just becomes NULL. Is it possible somebody can
take it away without locking the page?

Could you please shed a light on this?
Or, is there a patch to fix this already?

Thanks,

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: kernel panic on null pointer on page->mem_cgroup
       [not found] ` <20170808010150.4155-1-bradleybolen@gmail.com>
@ 2017-08-08 16:21   ` Johannes Weiner
  2017-08-08 16:56     ` Jaegeuk Kim
  0 siblings, 1 reply; 14+ messages in thread
From: Johannes Weiner @ 2017-08-08 16:21 UTC (permalink / raw)
  To: Bradley Bolen; +Cc: linux-mm, jaegeuk, cgroups, linux-kernel

Hi Jaegeuk and Bradley,

On Mon, Aug 07, 2017 at 09:01:50PM -0400, Bradley Bolen wrote:
> I am getting a very similar error on v4.11 with an arm64 board.
> 
> I, too, also see page->mem_cgroup checked to make sure that it is not
> NULL and then several instructions later it is NULL.  It does appear
> that someone is changing that member without taking the lock.  In my
> setup, I see
> 
> crash> bt
> PID: 72     TASK: e1f48640  CPU: 0   COMMAND: "mmcqd/1"
>  #0 [<c00ad35c>] (__crash_kexec) from [<c0101080>]
>  #1 [<c0101080>] (panic) from [<c028cd6c>]
>  #2 [<c028cd6c>] (svcerr_panic) from [<c028cdc4>]
>  #3 [<c028cdc4>] (_SvcErr_) from [<c001474c>]
>  #4 [<c001474c>] (die) from [<c00241f8>]
>  #5 [<c00241f8>] (__do_kernel_fault) from [<c0560600>]
>  #6 [<c0560600>] (do_page_fault) from [<c00092e8>]
>  #7 [<c00092e8>] (do_DataAbort) from [<c055f9f0>]
>     pc : [<c0112540>]    lr : [<c0112518>]    psr: a0000193
>     sp : c1a19cc8  ip : 00000000  fp : c1a19d04
>     r10: 0006ae29  r9 : 00000000  r8 : dfbf1800
>     r7 : dfbf1800  r6 : 00000001  r5 : f3c1107c  r4 : e2fb6424
>     r3 : 00000000  r2 : 00040228  r1 : 221e3000  r0 : a0000113
>     Flags: NzCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM
>  #8 [<c055f9f0>] (__dabt_svc) from [<c0112518>]
>  #9 [<c0112540>] (test_clear_page_writeback) from [<c01046d4>]
> #10 [<c01046d4>] (end_page_writeback) from [<c0149bcc>]
> #11 [<c0149bcc>] (end_swap_bio_write) from [<c0261460>]
> #12 [<c0261460>] (bio_endio) from [<c042c800>]
> #13 [<c042c800>] (dec_pending) from [<c042e648>]
> #14 [<c042e648>] (clone_endio) from [<c0261460>]
> #15 [<c0261460>] (bio_endio) from [<bf60aa00>]
> #16 [<bf60aa00>] (crypt_dec_pending [dm_crypt]) from [<bf60c1e8>]
> #17 [<bf60c1e8>] (crypt_endio [dm_crypt]) from [<c0261460>]
> #18 [<c0261460>] (bio_endio) from [<c0269e34>]
> #19 [<c0269e34>] (blk_update_request) from [<c026a058>]
> #20 [<c026a058>] (blk_update_bidi_request) from [<c026a444>]
> #21 [<c026a444>] (blk_end_bidi_request) from [<c026a494>]
> #22 [<c026a494>] (blk_end_request) from [<c0458dbc>]
> #23 [<c0458dbc>] (mmc_blk_issue_rw_rq) from [<c0459e24>]
> #24 [<c0459e24>] (mmc_blk_issue_rq) from [<c045a018>]
> #25 [<c045a018>] (mmc_queue_thread) from [<c0048890>]
> #26 [<c0048890>] (kthread) from [<c0010388>]
> crash> sym c0112540
> c0112540 (T) test_clear_page_writeback+512
>  /kernel-source/include/linux/memcontrol.h: 518
> 
> crash> bt 35
> PID: 35     TASK: e1d45dc0  CPU: 1   COMMAND: "kswapd0"
>  #0 [<c0559ab8>] (__schedule) from [<c0559edc>]
>  #1 [<c0559edc>] (schedule) from [<c055e54c>]
>  #2 [<c055e54c>] (schedule_timeout) from [<c055a3a4>]
>  #3 [<c055a3a4>] (io_schedule_timeout) from [<c0106cb0>]
>  #4 [<c0106cb0>] (mempool_alloc) from [<c0261668>]
>  #5 [<c0261668>] (bio_alloc_bioset) from [<c0149d68>]
>  #6 [<c0149d68>] (get_swap_bio) from [<c014a280>]
>  #7 [<c014a280>] (__swap_writepage) from [<c014a3bc>]
>  #8 [<c014a3bc>] (swap_writepage) from [<c011e5c8>]
>  #9 [<c011e5c8>] (shmem_writepage) from [<c011a9b8>]
> #10 [<c011a9b8>] (shrink_page_list) from [<c011b528>]
> #11 [<c011b528>] (shrink_inactive_list) from [<c011c160>]
> #12 [<c011c160>] (shrink_node_memcg) from [<c011c400>]
> #13 [<c011c400>] (shrink_node) from [<c011d7dc>]
> #14 [<c011d7dc>] (kswapd) from [<c0048890>]
> #15 [<c0048890>] (kthread) from [<c0010388>]
> 
> It appears that uncharge_list() in mm/memcontrol.c is not taking the
> page lock when it sets mem_cgroup to NULL.  I am not familiar with the
> mm code so I do not know if this is on purpose or not.  There is a
> comment in uncharge_list that makes me believe that the crashing code
> should not have been running:
> /*
>  * Nobody should be changing or seriously looking at
>  * page->mem_cgroup at this point, we have fully
>  * exclusive access to the page.
>  */
> However, I am new to looking at this area of the kernel so I am not
> sure.

The lock is for pages that are actively being used, whereas the free
path requires the page refcount to be 0; nobody else should be having
access to the page at that time.

> I was able to create a reproducible scenario by using a udelay to
> increase the time between the if (page->mem_cgroup) check and the later
> dereference of it to increase the race window.  I then mounted an empty
> ext4 partition and ran the following no more than twice before it
> crashed.
> dd if=/dev/zero of=/tmp/ext4disk/test bs=1M count=100

Thanks, that's useful. I'm going to try to reproduce this also.

There is a

	VM_BUG_ON_PAGE(!PageHWPoison(page) && page_count(page), page);

inside uncharge_list() that verifies that there shouldn't in fact be
any pages ending writeback when they get into that function. Can you
build your kernel with CONFIG_DEBUG_VM to enable that test?

Thanks

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: kernel panic on null pointer on page->mem_cgroup
  2017-08-08 16:21   ` Johannes Weiner
@ 2017-08-08 16:56     ` Jaegeuk Kim
  2017-08-08 17:37       ` Johannes Weiner
  0 siblings, 1 reply; 14+ messages in thread
From: Jaegeuk Kim @ 2017-08-08 16:56 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Bradley Bolen, linux-mm, cgroups, linux-kernel

On 08/08, Johannes Weiner wrote:
> Hi Jaegeuk and Bradley,
> 
> On Mon, Aug 07, 2017 at 09:01:50PM -0400, Bradley Bolen wrote:
> > I am getting a very similar error on v4.11 with an arm64 board.
> > 
> > I, too, also see page->mem_cgroup checked to make sure that it is not
> > NULL and then several instructions later it is NULL.  It does appear
> > that someone is changing that member without taking the lock.  In my
> > setup, I see
> > 
> > crash> bt
> > PID: 72     TASK: e1f48640  CPU: 0   COMMAND: "mmcqd/1"
> >  #0 [<c00ad35c>] (__crash_kexec) from [<c0101080>]
> >  #1 [<c0101080>] (panic) from [<c028cd6c>]
> >  #2 [<c028cd6c>] (svcerr_panic) from [<c028cdc4>]
> >  #3 [<c028cdc4>] (_SvcErr_) from [<c001474c>]
> >  #4 [<c001474c>] (die) from [<c00241f8>]
> >  #5 [<c00241f8>] (__do_kernel_fault) from [<c0560600>]
> >  #6 [<c0560600>] (do_page_fault) from [<c00092e8>]
> >  #7 [<c00092e8>] (do_DataAbort) from [<c055f9f0>]
> >     pc : [<c0112540>]    lr : [<c0112518>]    psr: a0000193
> >     sp : c1a19cc8  ip : 00000000  fp : c1a19d04
> >     r10: 0006ae29  r9 : 00000000  r8 : dfbf1800
> >     r7 : dfbf1800  r6 : 00000001  r5 : f3c1107c  r4 : e2fb6424
> >     r3 : 00000000  r2 : 00040228  r1 : 221e3000  r0 : a0000113
> >     Flags: NzCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM
> >  #8 [<c055f9f0>] (__dabt_svc) from [<c0112518>]
> >  #9 [<c0112540>] (test_clear_page_writeback) from [<c01046d4>]
> > #10 [<c01046d4>] (end_page_writeback) from [<c0149bcc>]
> > #11 [<c0149bcc>] (end_swap_bio_write) from [<c0261460>]
> > #12 [<c0261460>] (bio_endio) from [<c042c800>]
> > #13 [<c042c800>] (dec_pending) from [<c042e648>]
> > #14 [<c042e648>] (clone_endio) from [<c0261460>]
> > #15 [<c0261460>] (bio_endio) from [<bf60aa00>]
> > #16 [<bf60aa00>] (crypt_dec_pending [dm_crypt]) from [<bf60c1e8>]
> > #17 [<bf60c1e8>] (crypt_endio [dm_crypt]) from [<c0261460>]
> > #18 [<c0261460>] (bio_endio) from [<c0269e34>]
> > #19 [<c0269e34>] (blk_update_request) from [<c026a058>]
> > #20 [<c026a058>] (blk_update_bidi_request) from [<c026a444>]
> > #21 [<c026a444>] (blk_end_bidi_request) from [<c026a494>]
> > #22 [<c026a494>] (blk_end_request) from [<c0458dbc>]
> > #23 [<c0458dbc>] (mmc_blk_issue_rw_rq) from [<c0459e24>]
> > #24 [<c0459e24>] (mmc_blk_issue_rq) from [<c045a018>]
> > #25 [<c045a018>] (mmc_queue_thread) from [<c0048890>]
> > #26 [<c0048890>] (kthread) from [<c0010388>]
> > crash> sym c0112540
> > c0112540 (T) test_clear_page_writeback+512
> >  /kernel-source/include/linux/memcontrol.h: 518
> > 
> > crash> bt 35
> > PID: 35     TASK: e1d45dc0  CPU: 1   COMMAND: "kswapd0"
> >  #0 [<c0559ab8>] (__schedule) from [<c0559edc>]
> >  #1 [<c0559edc>] (schedule) from [<c055e54c>]
> >  #2 [<c055e54c>] (schedule_timeout) from [<c055a3a4>]
> >  #3 [<c055a3a4>] (io_schedule_timeout) from [<c0106cb0>]
> >  #4 [<c0106cb0>] (mempool_alloc) from [<c0261668>]
> >  #5 [<c0261668>] (bio_alloc_bioset) from [<c0149d68>]
> >  #6 [<c0149d68>] (get_swap_bio) from [<c014a280>]
> >  #7 [<c014a280>] (__swap_writepage) from [<c014a3bc>]
> >  #8 [<c014a3bc>] (swap_writepage) from [<c011e5c8>]
> >  #9 [<c011e5c8>] (shmem_writepage) from [<c011a9b8>]
> > #10 [<c011a9b8>] (shrink_page_list) from [<c011b528>]
> > #11 [<c011b528>] (shrink_inactive_list) from [<c011c160>]
> > #12 [<c011c160>] (shrink_node_memcg) from [<c011c400>]
> > #13 [<c011c400>] (shrink_node) from [<c011d7dc>]
> > #14 [<c011d7dc>] (kswapd) from [<c0048890>]
> > #15 [<c0048890>] (kthread) from [<c0010388>]
> > 
> > It appears that uncharge_list() in mm/memcontrol.c is not taking the
> > page lock when it sets mem_cgroup to NULL.  I am not familiar with the
> > mm code so I do not know if this is on purpose or not.  There is a
> > comment in uncharge_list that makes me believe that the crashing code
> > should not have been running:
> > /*
> >  * Nobody should be changing or seriously looking at
> >  * page->mem_cgroup at this point, we have fully
> >  * exclusive access to the page.
> >  */
> > However, I am new to looking at this area of the kernel so I am not
> > sure.
> 
> The lock is for pages that are actively being used, whereas the free
> path requires the page refcount to be 0; nobody else should be having
> access to the page at that time.

Given various trials for a while, using __mod_memcg_state() instead of
mod_memcg_state() ssems somehow blowing the panic away. It might be caused
by kernel preemption?

> 
> > I was able to create a reproducible scenario by using a udelay to
> > increase the time between the if (page->mem_cgroup) check and the later
> > dereference of it to increase the race window.  I then mounted an empty
> > ext4 partition and ran the following no more than twice before it
> > crashed.
> > dd if=/dev/zero of=/tmp/ext4disk/test bs=1M count=100
> 
> Thanks, that's useful. I'm going to try to reproduce this also.
> 
> There is a
> 
> 	VM_BUG_ON_PAGE(!PageHWPoison(page) && page_count(page), page);
> 
> inside uncharge_list() that verifies that there shouldn't in fact be
> any pages ending writeback when they get into that function. Can you
> build your kernel with CONFIG_DEBUG_VM to enable that test?

I'll test this as well. ;)

Thanks,

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: kernel panic on null pointer on page->mem_cgroup
  2017-08-08 16:56     ` Jaegeuk Kim
@ 2017-08-08 17:37       ` Johannes Weiner
  2017-08-08 19:13         ` Brad Bolen
  0 siblings, 1 reply; 14+ messages in thread
From: Johannes Weiner @ 2017-08-08 17:37 UTC (permalink / raw)
  To: Jaegeuk Kim; +Cc: Bradley Bolen, linux-mm, cgroups, linux-kernel

On Tue, Aug 08, 2017 at 09:56:01AM -0700, Jaegeuk Kim wrote:
> On 08/08, Johannes Weiner wrote:
> > Hi Jaegeuk and Bradley,
> > 
> > On Mon, Aug 07, 2017 at 09:01:50PM -0400, Bradley Bolen wrote:
> > > I am getting a very similar error on v4.11 with an arm64 board.
> > > 
> > > I, too, also see page->mem_cgroup checked to make sure that it is not
> > > NULL and then several instructions later it is NULL.  It does appear
> > > that someone is changing that member without taking the lock.  In my
> > > setup, I see
> > > 
> > > crash> bt
> > > PID: 72     TASK: e1f48640  CPU: 0   COMMAND: "mmcqd/1"
> > >  #0 [<c00ad35c>] (__crash_kexec) from [<c0101080>]
> > >  #1 [<c0101080>] (panic) from [<c028cd6c>]
> > >  #2 [<c028cd6c>] (svcerr_panic) from [<c028cdc4>]
> > >  #3 [<c028cdc4>] (_SvcErr_) from [<c001474c>]
> > >  #4 [<c001474c>] (die) from [<c00241f8>]
> > >  #5 [<c00241f8>] (__do_kernel_fault) from [<c0560600>]
> > >  #6 [<c0560600>] (do_page_fault) from [<c00092e8>]
> > >  #7 [<c00092e8>] (do_DataAbort) from [<c055f9f0>]
> > >     pc : [<c0112540>]    lr : [<c0112518>]    psr: a0000193
> > >     sp : c1a19cc8  ip : 00000000  fp : c1a19d04
> > >     r10: 0006ae29  r9 : 00000000  r8 : dfbf1800
> > >     r7 : dfbf1800  r6 : 00000001  r5 : f3c1107c  r4 : e2fb6424
> > >     r3 : 00000000  r2 : 00040228  r1 : 221e3000  r0 : a0000113
> > >     Flags: NzCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM
> > >  #8 [<c055f9f0>] (__dabt_svc) from [<c0112518>]
> > >  #9 [<c0112540>] (test_clear_page_writeback) from [<c01046d4>]
> > > #10 [<c01046d4>] (end_page_writeback) from [<c0149bcc>]
> > > #11 [<c0149bcc>] (end_swap_bio_write) from [<c0261460>]
> > > #12 [<c0261460>] (bio_endio) from [<c042c800>]
> > > #13 [<c042c800>] (dec_pending) from [<c042e648>]
> > > #14 [<c042e648>] (clone_endio) from [<c0261460>]
> > > #15 [<c0261460>] (bio_endio) from [<bf60aa00>]
> > > #16 [<bf60aa00>] (crypt_dec_pending [dm_crypt]) from [<bf60c1e8>]
> > > #17 [<bf60c1e8>] (crypt_endio [dm_crypt]) from [<c0261460>]
> > > #18 [<c0261460>] (bio_endio) from [<c0269e34>]
> > > #19 [<c0269e34>] (blk_update_request) from [<c026a058>]
> > > #20 [<c026a058>] (blk_update_bidi_request) from [<c026a444>]
> > > #21 [<c026a444>] (blk_end_bidi_request) from [<c026a494>]
> > > #22 [<c026a494>] (blk_end_request) from [<c0458dbc>]
> > > #23 [<c0458dbc>] (mmc_blk_issue_rw_rq) from [<c0459e24>]
> > > #24 [<c0459e24>] (mmc_blk_issue_rq) from [<c045a018>]
> > > #25 [<c045a018>] (mmc_queue_thread) from [<c0048890>]
> > > #26 [<c0048890>] (kthread) from [<c0010388>]
> > > crash> sym c0112540
> > > c0112540 (T) test_clear_page_writeback+512
> > >  /kernel-source/include/linux/memcontrol.h: 518
> > > 
> > > crash> bt 35
> > > PID: 35     TASK: e1d45dc0  CPU: 1   COMMAND: "kswapd0"
> > >  #0 [<c0559ab8>] (__schedule) from [<c0559edc>]
> > >  #1 [<c0559edc>] (schedule) from [<c055e54c>]
> > >  #2 [<c055e54c>] (schedule_timeout) from [<c055a3a4>]
> > >  #3 [<c055a3a4>] (io_schedule_timeout) from [<c0106cb0>]
> > >  #4 [<c0106cb0>] (mempool_alloc) from [<c0261668>]
> > >  #5 [<c0261668>] (bio_alloc_bioset) from [<c0149d68>]
> > >  #6 [<c0149d68>] (get_swap_bio) from [<c014a280>]
> > >  #7 [<c014a280>] (__swap_writepage) from [<c014a3bc>]
> > >  #8 [<c014a3bc>] (swap_writepage) from [<c011e5c8>]
> > >  #9 [<c011e5c8>] (shmem_writepage) from [<c011a9b8>]
> > > #10 [<c011a9b8>] (shrink_page_list) from [<c011b528>]
> > > #11 [<c011b528>] (shrink_inactive_list) from [<c011c160>]
> > > #12 [<c011c160>] (shrink_node_memcg) from [<c011c400>]
> > > #13 [<c011c400>] (shrink_node) from [<c011d7dc>]
> > > #14 [<c011d7dc>] (kswapd) from [<c0048890>]
> > > #15 [<c0048890>] (kthread) from [<c0010388>]
> > > 
> > > It appears that uncharge_list() in mm/memcontrol.c is not taking the
> > > page lock when it sets mem_cgroup to NULL.  I am not familiar with the
> > > mm code so I do not know if this is on purpose or not.  There is a
> > > comment in uncharge_list that makes me believe that the crashing code
> > > should not have been running:
> > > /*
> > >  * Nobody should be changing or seriously looking at
> > >  * page->mem_cgroup at this point, we have fully
> > >  * exclusive access to the page.
> > >  */
> > > However, I am new to looking at this area of the kernel so I am not
> > > sure.
> > 
> > The lock is for pages that are actively being used, whereas the free
> > path requires the page refcount to be 0; nobody else should be having
> > access to the page at that time.
> 
> Given various trials for a while, using __mod_memcg_state() instead of
> mod_memcg_state() ssems somehow blowing the panic away. It might be caused
> by kernel preemption?

That's puzzling. Is that reliably the case? Because on x86-64,
__this_cpu_add and this_cpu_add should result in the same code:

#define raw_cpu_add_8(pcp, val)                 percpu_add_op((pcp), val)
#define this_cpu_add_8(pcp, val)                percpu_add_op((pcp), val)

which boils down to single instructions - incq, decq, addq - and so
never needs explicit disabling of scheduler or interrupt preemption.

> > > I was able to create a reproducible scenario by using a udelay to
> > > increase the time between the if (page->mem_cgroup) check and the later
> > > dereference of it to increase the race window.  I then mounted an empty
> > > ext4 partition and ran the following no more than twice before it
> > > crashed.
> > > dd if=/dev/zero of=/tmp/ext4disk/test bs=1M count=100
> > 
> > Thanks, that's useful. I'm going to try to reproduce this also.
> > 
> > There is a
> > 
> > 	VM_BUG_ON_PAGE(!PageHWPoison(page) && page_count(page), page);
> > 
> > inside uncharge_list() that verifies that there shouldn't in fact be
> > any pages ending writeback when they get into that function. Can you
> > build your kernel with CONFIG_DEBUG_VM to enable that test?
> 
> I'll test this as well. ;)

Thanks. I'm trying various udelays in between the NULL-check and the
dereference, but I cannot seem to make it happen with the ext4-dd on
my local machine.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: kernel panic on null pointer on page->mem_cgroup
  2017-08-08 17:37       ` Johannes Weiner
@ 2017-08-08 19:13         ` Brad Bolen
  2017-08-08 20:08           ` Johannes Weiner
  0 siblings, 1 reply; 14+ messages in thread
From: Brad Bolen @ 2017-08-08 19:13 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Jaegeuk Kim, linux-mm, cgroups, linux-kernel

On Tue, Aug 8, 2017 at 1:37 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> On Tue, Aug 08, 2017 at 09:56:01AM -0700, Jaegeuk Kim wrote:
>> On 08/08, Johannes Weiner wrote:
>> > Hi Jaegeuk and Bradley,
>> >
>> > On Mon, Aug 07, 2017 at 09:01:50PM -0400, Bradley Bolen wrote:
>> > > I am getting a very similar error on v4.11 with an arm64 board.
>> > >
>> > > I, too, also see page->mem_cgroup checked to make sure that it is not
>> > > NULL and then several instructions later it is NULL.  It does appear
>> > > that someone is changing that member without taking the lock.  In my
>> > > setup, I see
>> > >
>> > > crash> bt
>> > > PID: 72     TASK: e1f48640  CPU: 0   COMMAND: "mmcqd/1"
>> > >  #0 [<c00ad35c>] (__crash_kexec) from [<c0101080>]
>> > >  #1 [<c0101080>] (panic) from [<c028cd6c>]
>> > >  #2 [<c028cd6c>] (svcerr_panic) from [<c028cdc4>]
>> > >  #3 [<c028cdc4>] (_SvcErr_) from [<c001474c>]
>> > >  #4 [<c001474c>] (die) from [<c00241f8>]
>> > >  #5 [<c00241f8>] (__do_kernel_fault) from [<c0560600>]
>> > >  #6 [<c0560600>] (do_page_fault) from [<c00092e8>]
>> > >  #7 [<c00092e8>] (do_DataAbort) from [<c055f9f0>]
>> > >     pc : [<c0112540>]    lr : [<c0112518>]    psr: a0000193
>> > >     sp : c1a19cc8  ip : 00000000  fp : c1a19d04
>> > >     r10: 0006ae29  r9 : 00000000  r8 : dfbf1800
>> > >     r7 : dfbf1800  r6 : 00000001  r5 : f3c1107c  r4 : e2fb6424
>> > >     r3 : 00000000  r2 : 00040228  r1 : 221e3000  r0 : a0000113
>> > >     Flags: NzCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM
>> > >  #8 [<c055f9f0>] (__dabt_svc) from [<c0112518>]
>> > >  #9 [<c0112540>] (test_clear_page_writeback) from [<c01046d4>]
>> > > #10 [<c01046d4>] (end_page_writeback) from [<c0149bcc>]
>> > > #11 [<c0149bcc>] (end_swap_bio_write) from [<c0261460>]
>> > > #12 [<c0261460>] (bio_endio) from [<c042c800>]
>> > > #13 [<c042c800>] (dec_pending) from [<c042e648>]
>> > > #14 [<c042e648>] (clone_endio) from [<c0261460>]
>> > > #15 [<c0261460>] (bio_endio) from [<bf60aa00>]
>> > > #16 [<bf60aa00>] (crypt_dec_pending [dm_crypt]) from [<bf60c1e8>]
>> > > #17 [<bf60c1e8>] (crypt_endio [dm_crypt]) from [<c0261460>]
>> > > #18 [<c0261460>] (bio_endio) from [<c0269e34>]
>> > > #19 [<c0269e34>] (blk_update_request) from [<c026a058>]
>> > > #20 [<c026a058>] (blk_update_bidi_request) from [<c026a444>]
>> > > #21 [<c026a444>] (blk_end_bidi_request) from [<c026a494>]
>> > > #22 [<c026a494>] (blk_end_request) from [<c0458dbc>]
>> > > #23 [<c0458dbc>] (mmc_blk_issue_rw_rq) from [<c0459e24>]
>> > > #24 [<c0459e24>] (mmc_blk_issue_rq) from [<c045a018>]
>> > > #25 [<c045a018>] (mmc_queue_thread) from [<c0048890>]
>> > > #26 [<c0048890>] (kthread) from [<c0010388>]
>> > > crash> sym c0112540
>> > > c0112540 (T) test_clear_page_writeback+512
>> > >  /kernel-source/include/linux/memcontrol.h: 518
>> > >
>> > > crash> bt 35
>> > > PID: 35     TASK: e1d45dc0  CPU: 1   COMMAND: "kswapd0"
>> > >  #0 [<c0559ab8>] (__schedule) from [<c0559edc>]
>> > >  #1 [<c0559edc>] (schedule) from [<c055e54c>]
>> > >  #2 [<c055e54c>] (schedule_timeout) from [<c055a3a4>]
>> > >  #3 [<c055a3a4>] (io_schedule_timeout) from [<c0106cb0>]
>> > >  #4 [<c0106cb0>] (mempool_alloc) from [<c0261668>]
>> > >  #5 [<c0261668>] (bio_alloc_bioset) from [<c0149d68>]
>> > >  #6 [<c0149d68>] (get_swap_bio) from [<c014a280>]
>> > >  #7 [<c014a280>] (__swap_writepage) from [<c014a3bc>]
>> > >  #8 [<c014a3bc>] (swap_writepage) from [<c011e5c8>]
>> > >  #9 [<c011e5c8>] (shmem_writepage) from [<c011a9b8>]
>> > > #10 [<c011a9b8>] (shrink_page_list) from [<c011b528>]
>> > > #11 [<c011b528>] (shrink_inactive_list) from [<c011c160>]
>> > > #12 [<c011c160>] (shrink_node_memcg) from [<c011c400>]
>> > > #13 [<c011c400>] (shrink_node) from [<c011d7dc>]
>> > > #14 [<c011d7dc>] (kswapd) from [<c0048890>]
>> > > #15 [<c0048890>] (kthread) from [<c0010388>]
>> > >
>> > > It appears that uncharge_list() in mm/memcontrol.c is not taking the
>> > > page lock when it sets mem_cgroup to NULL.  I am not familiar with the
>> > > mm code so I do not know if this is on purpose or not.  There is a
>> > > comment in uncharge_list that makes me believe that the crashing code
>> > > should not have been running:
>> > > /*
>> > >  * Nobody should be changing or seriously looking at
>> > >  * page->mem_cgroup at this point, we have fully
>> > >  * exclusive access to the page.
>> > >  */
>> > > However, I am new to looking at this area of the kernel so I am not
>> > > sure.
>> >
>> > The lock is for pages that are actively being used, whereas the free
>> > path requires the page refcount to be 0; nobody else should be having
>> > access to the page at that time.
>>
>> Given various trials for a while, using __mod_memcg_state() instead of
>> mod_memcg_state() ssems somehow blowing the panic away. It might be caused
>> by kernel preemption?
>
> That's puzzling. Is that reliably the case? Because on x86-64,
> __this_cpu_add and this_cpu_add should result in the same code:
>
> #define raw_cpu_add_8(pcp, val)                 percpu_add_op((pcp), val)
> #define this_cpu_add_8(pcp, val)                percpu_add_op((pcp), val)
>
> which boils down to single instructions - incq, decq, addq - and so
> never needs explicit disabling of scheduler or interrupt preemption.
>
>> > > I was able to create a reproducible scenario by using a udelay to
>> > > increase the time between the if (page->mem_cgroup) check and the later
>> > > dereference of it to increase the race window.  I then mounted an empty
>> > > ext4 partition and ran the following no more than twice before it
>> > > crashed.
>> > > dd if=/dev/zero of=/tmp/ext4disk/test bs=1M count=100
>> >
>> > Thanks, that's useful. I'm going to try to reproduce this also.
>> >
>> > There is a
>> >
>> >     VM_BUG_ON_PAGE(!PageHWPoison(page) && page_count(page), page);
>> >
>> > inside uncharge_list() that verifies that there shouldn't in fact be
>> > any pages ending writeback when they get into that function. Can you
>> > build your kernel with CONFIG_DEBUG_VM to enable that test?
>>
>> I'll test this as well. ;)
>
> Thanks. I'm trying various udelays in between the NULL-check and the
> dereference, but I cannot seem to make it happen with the ext4-dd on
> my local machine.
I am using a fairly fat udelay of 100.

I turned on CONFIG_DEBUG_VM but it didn't bug for me, although the original
problem still reproduced.

I can insert some tracing to try to get some more information, but unfortunately
I am stuck with v4.11 at the moment and can't try v4.13-rc4 on this setup.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: kernel panic on null pointer on page->mem_cgroup
  2017-08-08 19:13         ` Brad Bolen
@ 2017-08-08 20:08           ` Johannes Weiner
  2017-08-09  1:44             ` Jaegeuk Kim
  0 siblings, 1 reply; 14+ messages in thread
From: Johannes Weiner @ 2017-08-08 20:08 UTC (permalink / raw)
  To: Brad Bolen; +Cc: Jaegeuk Kim, linux-mm, cgroups, linux-kernel

On Tue, Aug 08, 2017 at 03:13:42PM -0400, Brad Bolen wrote:
> On Tue, Aug 8, 2017 at 1:37 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > On Tue, Aug 08, 2017 at 09:56:01AM -0700, Jaegeuk Kim wrote:
> >> On 08/08, Johannes Weiner wrote:
> >> > Hi Jaegeuk and Bradley,
> >> >
> >> > On Mon, Aug 07, 2017 at 09:01:50PM -0400, Bradley Bolen wrote:
> >> > > I am getting a very similar error on v4.11 with an arm64 board.
> >> > >
> >> > > I, too, also see page->mem_cgroup checked to make sure that it is not
> >> > > NULL and then several instructions later it is NULL.  It does appear
> >> > > that someone is changing that member without taking the lock.  In my
> >> > > setup, I see
> >> > >
> >> > > crash> bt
> >> > > PID: 72     TASK: e1f48640  CPU: 0   COMMAND: "mmcqd/1"
> >> > >  #0 [<c00ad35c>] (__crash_kexec) from [<c0101080>]
> >> > >  #1 [<c0101080>] (panic) from [<c028cd6c>]
> >> > >  #2 [<c028cd6c>] (svcerr_panic) from [<c028cdc4>]
> >> > >  #3 [<c028cdc4>] (_SvcErr_) from [<c001474c>]
> >> > >  #4 [<c001474c>] (die) from [<c00241f8>]
> >> > >  #5 [<c00241f8>] (__do_kernel_fault) from [<c0560600>]
> >> > >  #6 [<c0560600>] (do_page_fault) from [<c00092e8>]
> >> > >  #7 [<c00092e8>] (do_DataAbort) from [<c055f9f0>]
> >> > >     pc : [<c0112540>]    lr : [<c0112518>]    psr: a0000193
> >> > >     sp : c1a19cc8  ip : 00000000  fp : c1a19d04
> >> > >     r10: 0006ae29  r9 : 00000000  r8 : dfbf1800
> >> > >     r7 : dfbf1800  r6 : 00000001  r5 : f3c1107c  r4 : e2fb6424
> >> > >     r3 : 00000000  r2 : 00040228  r1 : 221e3000  r0 : a0000113
> >> > >     Flags: NzCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM
> >> > >  #8 [<c055f9f0>] (__dabt_svc) from [<c0112518>]
> >> > >  #9 [<c0112540>] (test_clear_page_writeback) from [<c01046d4>]
> >> > > #10 [<c01046d4>] (end_page_writeback) from [<c0149bcc>]
> >> > > #11 [<c0149bcc>] (end_swap_bio_write) from [<c0261460>]
> >> > > #12 [<c0261460>] (bio_endio) from [<c042c800>]
> >> > > #13 [<c042c800>] (dec_pending) from [<c042e648>]
> >> > > #14 [<c042e648>] (clone_endio) from [<c0261460>]
> >> > > #15 [<c0261460>] (bio_endio) from [<bf60aa00>]
> >> > > #16 [<bf60aa00>] (crypt_dec_pending [dm_crypt]) from [<bf60c1e8>]
> >> > > #17 [<bf60c1e8>] (crypt_endio [dm_crypt]) from [<c0261460>]
> >> > > #18 [<c0261460>] (bio_endio) from [<c0269e34>]
> >> > > #19 [<c0269e34>] (blk_update_request) from [<c026a058>]
> >> > > #20 [<c026a058>] (blk_update_bidi_request) from [<c026a444>]
> >> > > #21 [<c026a444>] (blk_end_bidi_request) from [<c026a494>]
> >> > > #22 [<c026a494>] (blk_end_request) from [<c0458dbc>]
> >> > > #23 [<c0458dbc>] (mmc_blk_issue_rw_rq) from [<c0459e24>]
> >> > > #24 [<c0459e24>] (mmc_blk_issue_rq) from [<c045a018>]
> >> > > #25 [<c045a018>] (mmc_queue_thread) from [<c0048890>]
> >> > > #26 [<c0048890>] (kthread) from [<c0010388>]
> >> > > crash> sym c0112540
> >> > > c0112540 (T) test_clear_page_writeback+512
> >> > >  /kernel-source/include/linux/memcontrol.h: 518
> >> > >
> >> > > crash> bt 35
> >> > > PID: 35     TASK: e1d45dc0  CPU: 1   COMMAND: "kswapd0"
> >> > >  #0 [<c0559ab8>] (__schedule) from [<c0559edc>]
> >> > >  #1 [<c0559edc>] (schedule) from [<c055e54c>]
> >> > >  #2 [<c055e54c>] (schedule_timeout) from [<c055a3a4>]
> >> > >  #3 [<c055a3a4>] (io_schedule_timeout) from [<c0106cb0>]
> >> > >  #4 [<c0106cb0>] (mempool_alloc) from [<c0261668>]
> >> > >  #5 [<c0261668>] (bio_alloc_bioset) from [<c0149d68>]
> >> > >  #6 [<c0149d68>] (get_swap_bio) from [<c014a280>]
> >> > >  #7 [<c014a280>] (__swap_writepage) from [<c014a3bc>]
> >> > >  #8 [<c014a3bc>] (swap_writepage) from [<c011e5c8>]
> >> > >  #9 [<c011e5c8>] (shmem_writepage) from [<c011a9b8>]
> >> > > #10 [<c011a9b8>] (shrink_page_list) from [<c011b528>]
> >> > > #11 [<c011b528>] (shrink_inactive_list) from [<c011c160>]
> >> > > #12 [<c011c160>] (shrink_node_memcg) from [<c011c400>]
> >> > > #13 [<c011c400>] (shrink_node) from [<c011d7dc>]
> >> > > #14 [<c011d7dc>] (kswapd) from [<c0048890>]
> >> > > #15 [<c0048890>] (kthread) from [<c0010388>]
> >> > >
> >> > > It appears that uncharge_list() in mm/memcontrol.c is not taking the
> >> > > page lock when it sets mem_cgroup to NULL.  I am not familiar with the
> >> > > mm code so I do not know if this is on purpose or not.  There is a
> >> > > comment in uncharge_list that makes me believe that the crashing code
> >> > > should not have been running:
> >> > > /*
> >> > >  * Nobody should be changing or seriously looking at
> >> > >  * page->mem_cgroup at this point, we have fully
> >> > >  * exclusive access to the page.
> >> > >  */
> >> > > However, I am new to looking at this area of the kernel so I am not
> >> > > sure.
> >> >
> >> > The lock is for pages that are actively being used, whereas the free
> >> > path requires the page refcount to be 0; nobody else should be having
> >> > access to the page at that time.
> >>
> >> Given various trials for a while, using __mod_memcg_state() instead of
> >> mod_memcg_state() ssems somehow blowing the panic away. It might be caused
> >> by kernel preemption?
> >
> > That's puzzling. Is that reliably the case? Because on x86-64,
> > __this_cpu_add and this_cpu_add should result in the same code:
> >
> > #define raw_cpu_add_8(pcp, val)                 percpu_add_op((pcp), val)
> > #define this_cpu_add_8(pcp, val)                percpu_add_op((pcp), val)
> >
> > which boils down to single instructions - incq, decq, addq - and so
> > never needs explicit disabling of scheduler or interrupt preemption.
> >
> >> > > I was able to create a reproducible scenario by using a udelay to
> >> > > increase the time between the if (page->mem_cgroup) check and the later
> >> > > dereference of it to increase the race window.  I then mounted an empty
> >> > > ext4 partition and ran the following no more than twice before it
> >> > > crashed.
> >> > > dd if=/dev/zero of=/tmp/ext4disk/test bs=1M count=100
> >> >
> >> > Thanks, that's useful. I'm going to try to reproduce this also.
> >> >
> >> > There is a
> >> >
> >> >     VM_BUG_ON_PAGE(!PageHWPoison(page) && page_count(page), page);
> >> >
> >> > inside uncharge_list() that verifies that there shouldn't in fact be
> >> > any pages ending writeback when they get into that function. Can you
> >> > build your kernel with CONFIG_DEBUG_VM to enable that test?
> >>
> >> I'll test this as well. ;)
> >
> > Thanks. I'm trying various udelays in between the NULL-check and the
> > dereference, but I cannot seem to make it happen with the ext4-dd on
> > my local machine.
> I am using a fairly fat udelay of 100.
> 
> I turned on CONFIG_DEBUG_VM but it didn't bug for me, although the original
> problem still reproduced.

Ok that sounds like a page refcounting bug then.

> I can insert some tracing to try to get some more information, but unfortunately
> I am stuck with v4.11 at the moment and can't try v4.13-rc4 on this setup.

Does the below trigger with your reproducer? (Still doesn't for me).

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 3914e3dd6168..c052b085c5dd 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -599,6 +599,7 @@ static inline void __mod_lruvec_page_state(struct page *page,
 	struct mem_cgroup_per_node *pn;
 
 	__mod_node_page_state(page_pgdat(page), idx, val);
+	VM_BUG_ON_PAGE(!page_count(page), page);
 	if (mem_cgroup_disabled() || !page->mem_cgroup)
 		return;
 	__mod_memcg_state(page->mem_cgroup, idx, val);
@@ -612,6 +613,7 @@ static inline void mod_lruvec_page_state(struct page *page,
 	struct mem_cgroup_per_node *pn;
 
 	mod_node_page_state(page_pgdat(page), idx, val);
+	VM_BUG_ON_PAGE(!page_count(page), page);
 	if (mem_cgroup_disabled() || !page->mem_cgroup)
 		return;
 	mod_memcg_state(page->mem_cgroup, idx, val);

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: kernel panic on null pointer on page->mem_cgroup
  2017-08-08 20:08           ` Johannes Weiner
@ 2017-08-09  1:44             ` Jaegeuk Kim
  2017-08-09  2:39               ` Brad Bolen
  0 siblings, 1 reply; 14+ messages in thread
From: Jaegeuk Kim @ 2017-08-09  1:44 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Brad Bolen, linux-mm, cgroups, linux-kernel

On 08/08, Johannes Weiner wrote:
> On Tue, Aug 08, 2017 at 03:13:42PM -0400, Brad Bolen wrote:
> > On Tue, Aug 8, 2017 at 1:37 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > On Tue, Aug 08, 2017 at 09:56:01AM -0700, Jaegeuk Kim wrote:
> > >> On 08/08, Johannes Weiner wrote:
> > >> > Hi Jaegeuk and Bradley,
> > >> >
> > >> > On Mon, Aug 07, 2017 at 09:01:50PM -0400, Bradley Bolen wrote:
> > >> > > I am getting a very similar error on v4.11 with an arm64 board.
> > >> > >
> > >> > > I, too, also see page->mem_cgroup checked to make sure that it is not
> > >> > > NULL and then several instructions later it is NULL.  It does appear
> > >> > > that someone is changing that member without taking the lock.  In my
> > >> > > setup, I see
> > >> > >
> > >> > > crash> bt
> > >> > > PID: 72     TASK: e1f48640  CPU: 0   COMMAND: "mmcqd/1"
> > >> > >  #0 [<c00ad35c>] (__crash_kexec) from [<c0101080>]
> > >> > >  #1 [<c0101080>] (panic) from [<c028cd6c>]
> > >> > >  #2 [<c028cd6c>] (svcerr_panic) from [<c028cdc4>]
> > >> > >  #3 [<c028cdc4>] (_SvcErr_) from [<c001474c>]
> > >> > >  #4 [<c001474c>] (die) from [<c00241f8>]
> > >> > >  #5 [<c00241f8>] (__do_kernel_fault) from [<c0560600>]
> > >> > >  #6 [<c0560600>] (do_page_fault) from [<c00092e8>]
> > >> > >  #7 [<c00092e8>] (do_DataAbort) from [<c055f9f0>]
> > >> > >     pc : [<c0112540>]    lr : [<c0112518>]    psr: a0000193
> > >> > >     sp : c1a19cc8  ip : 00000000  fp : c1a19d04
> > >> > >     r10: 0006ae29  r9 : 00000000  r8 : dfbf1800
> > >> > >     r7 : dfbf1800  r6 : 00000001  r5 : f3c1107c  r4 : e2fb6424
> > >> > >     r3 : 00000000  r2 : 00040228  r1 : 221e3000  r0 : a0000113
> > >> > >     Flags: NzCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM
> > >> > >  #8 [<c055f9f0>] (__dabt_svc) from [<c0112518>]
> > >> > >  #9 [<c0112540>] (test_clear_page_writeback) from [<c01046d4>]
> > >> > > #10 [<c01046d4>] (end_page_writeback) from [<c0149bcc>]
> > >> > > #11 [<c0149bcc>] (end_swap_bio_write) from [<c0261460>]
> > >> > > #12 [<c0261460>] (bio_endio) from [<c042c800>]
> > >> > > #13 [<c042c800>] (dec_pending) from [<c042e648>]
> > >> > > #14 [<c042e648>] (clone_endio) from [<c0261460>]
> > >> > > #15 [<c0261460>] (bio_endio) from [<bf60aa00>]
> > >> > > #16 [<bf60aa00>] (crypt_dec_pending [dm_crypt]) from [<bf60c1e8>]
> > >> > > #17 [<bf60c1e8>] (crypt_endio [dm_crypt]) from [<c0261460>]
> > >> > > #18 [<c0261460>] (bio_endio) from [<c0269e34>]
> > >> > > #19 [<c0269e34>] (blk_update_request) from [<c026a058>]
> > >> > > #20 [<c026a058>] (blk_update_bidi_request) from [<c026a444>]
> > >> > > #21 [<c026a444>] (blk_end_bidi_request) from [<c026a494>]
> > >> > > #22 [<c026a494>] (blk_end_request) from [<c0458dbc>]
> > >> > > #23 [<c0458dbc>] (mmc_blk_issue_rw_rq) from [<c0459e24>]
> > >> > > #24 [<c0459e24>] (mmc_blk_issue_rq) from [<c045a018>]
> > >> > > #25 [<c045a018>] (mmc_queue_thread) from [<c0048890>]
> > >> > > #26 [<c0048890>] (kthread) from [<c0010388>]
> > >> > > crash> sym c0112540
> > >> > > c0112540 (T) test_clear_page_writeback+512
> > >> > >  /kernel-source/include/linux/memcontrol.h: 518
> > >> > >
> > >> > > crash> bt 35
> > >> > > PID: 35     TASK: e1d45dc0  CPU: 1   COMMAND: "kswapd0"
> > >> > >  #0 [<c0559ab8>] (__schedule) from [<c0559edc>]
> > >> > >  #1 [<c0559edc>] (schedule) from [<c055e54c>]
> > >> > >  #2 [<c055e54c>] (schedule_timeout) from [<c055a3a4>]
> > >> > >  #3 [<c055a3a4>] (io_schedule_timeout) from [<c0106cb0>]
> > >> > >  #4 [<c0106cb0>] (mempool_alloc) from [<c0261668>]
> > >> > >  #5 [<c0261668>] (bio_alloc_bioset) from [<c0149d68>]
> > >> > >  #6 [<c0149d68>] (get_swap_bio) from [<c014a280>]
> > >> > >  #7 [<c014a280>] (__swap_writepage) from [<c014a3bc>]
> > >> > >  #8 [<c014a3bc>] (swap_writepage) from [<c011e5c8>]
> > >> > >  #9 [<c011e5c8>] (shmem_writepage) from [<c011a9b8>]
> > >> > > #10 [<c011a9b8>] (shrink_page_list) from [<c011b528>]
> > >> > > #11 [<c011b528>] (shrink_inactive_list) from [<c011c160>]
> > >> > > #12 [<c011c160>] (shrink_node_memcg) from [<c011c400>]
> > >> > > #13 [<c011c400>] (shrink_node) from [<c011d7dc>]
> > >> > > #14 [<c011d7dc>] (kswapd) from [<c0048890>]
> > >> > > #15 [<c0048890>] (kthread) from [<c0010388>]
> > >> > >
> > >> > > It appears that uncharge_list() in mm/memcontrol.c is not taking the
> > >> > > page lock when it sets mem_cgroup to NULL.  I am not familiar with the
> > >> > > mm code so I do not know if this is on purpose or not.  There is a
> > >> > > comment in uncharge_list that makes me believe that the crashing code
> > >> > > should not have been running:
> > >> > > /*
> > >> > >  * Nobody should be changing or seriously looking at
> > >> > >  * page->mem_cgroup at this point, we have fully
> > >> > >  * exclusive access to the page.
> > >> > >  */
> > >> > > However, I am new to looking at this area of the kernel so I am not
> > >> > > sure.
> > >> >
> > >> > The lock is for pages that are actively being used, whereas the free
> > >> > path requires the page refcount to be 0; nobody else should be having
> > >> > access to the page at that time.
> > >>
> > >> Given various trials for a while, using __mod_memcg_state() instead of
> > >> mod_memcg_state() ssems somehow blowing the panic away. It might be caused
> > >> by kernel preemption?
> > >
> > > That's puzzling. Is that reliably the case? Because on x86-64,
> > > __this_cpu_add and this_cpu_add should result in the same code:
> > >
> > > #define raw_cpu_add_8(pcp, val)                 percpu_add_op((pcp), val)
> > > #define this_cpu_add_8(pcp, val)                percpu_add_op((pcp), val)
> > >
> > > which boils down to single instructions - incq, decq, addq - and so
> > > never needs explicit disabling of scheduler or interrupt preemption.
> > >
> > >> > > I was able to create a reproducible scenario by using a udelay to
> > >> > > increase the time between the if (page->mem_cgroup) check and the later
> > >> > > dereference of it to increase the race window.  I then mounted an empty
> > >> > > ext4 partition and ran the following no more than twice before it
> > >> > > crashed.
> > >> > > dd if=/dev/zero of=/tmp/ext4disk/test bs=1M count=100
> > >> >
> > >> > Thanks, that's useful. I'm going to try to reproduce this also.
> > >> >
> > >> > There is a
> > >> >
> > >> >     VM_BUG_ON_PAGE(!PageHWPoison(page) && page_count(page), page);
> > >> >
> > >> > inside uncharge_list() that verifies that there shouldn't in fact be
> > >> > any pages ending writeback when they get into that function. Can you
> > >> > build your kernel with CONFIG_DEBUG_VM to enable that test?
> > >>
> > >> I'll test this as well. ;)
> > >
> > > Thanks. I'm trying various udelays in between the NULL-check and the
> > > dereference, but I cannot seem to make it happen with the ext4-dd on
> > > my local machine.
> > I am using a fairly fat udelay of 100.
> > 
> > I turned on CONFIG_DEBUG_VM but it didn't bug for me, although the original
> > problem still reproduced.
> 
> Ok that sounds like a page refcounting bug then.
> 
> > I can insert some tracing to try to get some more information, but unfortunately
> > I am stuck with v4.11 at the moment and can't try v4.13-rc4 on this setup.
> 
> Does the below trigger with your reproducer? (Still doesn't for me).
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 3914e3dd6168..c052b085c5dd 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -599,6 +599,7 @@ static inline void __mod_lruvec_page_state(struct page *page,
>  	struct mem_cgroup_per_node *pn;
>  
>  	__mod_node_page_state(page_pgdat(page), idx, val);
> +	VM_BUG_ON_PAGE(!page_count(page), page);
>  	if (mem_cgroup_disabled() || !page->mem_cgroup)
>  		return;
>  	__mod_memcg_state(page->mem_cgroup, idx, val);
> @@ -612,6 +613,7 @@ static inline void mod_lruvec_page_state(struct page *page,
>  	struct mem_cgroup_per_node *pn;
>  
>  	mod_node_page_state(page_pgdat(page), idx, val);
> +	VM_BUG_ON_PAGE(!page_count(page), page);

It seems DEBUG_VM may hide the problem in my case, so I turned it off and
replaced it with BUG_ON(!page_count(page)). Then, I exactly hit this.

Thanks,

>  	if (mem_cgroup_disabled() || !page->mem_cgroup)
>  		return;
>  	mod_memcg_state(page->mem_cgroup, idx, val);

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: kernel panic on null pointer on page->mem_cgroup
  2017-08-09  1:44             ` Jaegeuk Kim
@ 2017-08-09  2:39               ` Brad Bolen
  2017-08-09 18:38                 ` Johannes Weiner
  0 siblings, 1 reply; 14+ messages in thread
From: Brad Bolen @ 2017-08-09  2:39 UTC (permalink / raw)
  To: Jaegeuk Kim; +Cc: Johannes Weiner, linux-mm, cgroups, linux-kernel

Yes, the BUG_ON(!page_count(page)) fired for me as well.

On Tue, Aug 8, 2017 at 9:44 PM, Jaegeuk Kim <jaegeuk@kernel.org> wrote:
> On 08/08, Johannes Weiner wrote:
>> On Tue, Aug 08, 2017 at 03:13:42PM -0400, Brad Bolen wrote:
>> > On Tue, Aug 8, 2017 at 1:37 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
>> > > On Tue, Aug 08, 2017 at 09:56:01AM -0700, Jaegeuk Kim wrote:
>> > >> On 08/08, Johannes Weiner wrote:
>> > >> > Hi Jaegeuk and Bradley,
>> > >> >
>> > >> > On Mon, Aug 07, 2017 at 09:01:50PM -0400, Bradley Bolen wrote:
>> > >> > > I am getting a very similar error on v4.11 with an arm64 board.
>> > >> > >
>> > >> > > I, too, also see page->mem_cgroup checked to make sure that it is not
>> > >> > > NULL and then several instructions later it is NULL.  It does appear
>> > >> > > that someone is changing that member without taking the lock.  In my
>> > >> > > setup, I see
>> > >> > >
>> > >> > > crash> bt
>> > >> > > PID: 72     TASK: e1f48640  CPU: 0   COMMAND: "mmcqd/1"
>> > >> > >  #0 [<c00ad35c>] (__crash_kexec) from [<c0101080>]
>> > >> > >  #1 [<c0101080>] (panic) from [<c028cd6c>]
>> > >> > >  #2 [<c028cd6c>] (svcerr_panic) from [<c028cdc4>]
>> > >> > >  #3 [<c028cdc4>] (_SvcErr_) from [<c001474c>]
>> > >> > >  #4 [<c001474c>] (die) from [<c00241f8>]
>> > >> > >  #5 [<c00241f8>] (__do_kernel_fault) from [<c0560600>]
>> > >> > >  #6 [<c0560600>] (do_page_fault) from [<c00092e8>]
>> > >> > >  #7 [<c00092e8>] (do_DataAbort) from [<c055f9f0>]
>> > >> > >     pc : [<c0112540>]    lr : [<c0112518>]    psr: a0000193
>> > >> > >     sp : c1a19cc8  ip : 00000000  fp : c1a19d04
>> > >> > >     r10: 0006ae29  r9 : 00000000  r8 : dfbf1800
>> > >> > >     r7 : dfbf1800  r6 : 00000001  r5 : f3c1107c  r4 : e2fb6424
>> > >> > >     r3 : 00000000  r2 : 00040228  r1 : 221e3000  r0 : a0000113
>> > >> > >     Flags: NzCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM
>> > >> > >  #8 [<c055f9f0>] (__dabt_svc) from [<c0112518>]
>> > >> > >  #9 [<c0112540>] (test_clear_page_writeback) from [<c01046d4>]
>> > >> > > #10 [<c01046d4>] (end_page_writeback) from [<c0149bcc>]
>> > >> > > #11 [<c0149bcc>] (end_swap_bio_write) from [<c0261460>]
>> > >> > > #12 [<c0261460>] (bio_endio) from [<c042c800>]
>> > >> > > #13 [<c042c800>] (dec_pending) from [<c042e648>]
>> > >> > > #14 [<c042e648>] (clone_endio) from [<c0261460>]
>> > >> > > #15 [<c0261460>] (bio_endio) from [<bf60aa00>]
>> > >> > > #16 [<bf60aa00>] (crypt_dec_pending [dm_crypt]) from [<bf60c1e8>]
>> > >> > > #17 [<bf60c1e8>] (crypt_endio [dm_crypt]) from [<c0261460>]
>> > >> > > #18 [<c0261460>] (bio_endio) from [<c0269e34>]
>> > >> > > #19 [<c0269e34>] (blk_update_request) from [<c026a058>]
>> > >> > > #20 [<c026a058>] (blk_update_bidi_request) from [<c026a444>]
>> > >> > > #21 [<c026a444>] (blk_end_bidi_request) from [<c026a494>]
>> > >> > > #22 [<c026a494>] (blk_end_request) from [<c0458dbc>]
>> > >> > > #23 [<c0458dbc>] (mmc_blk_issue_rw_rq) from [<c0459e24>]
>> > >> > > #24 [<c0459e24>] (mmc_blk_issue_rq) from [<c045a018>]
>> > >> > > #25 [<c045a018>] (mmc_queue_thread) from [<c0048890>]
>> > >> > > #26 [<c0048890>] (kthread) from [<c0010388>]
>> > >> > > crash> sym c0112540
>> > >> > > c0112540 (T) test_clear_page_writeback+512
>> > >> > >  /kernel-source/include/linux/memcontrol.h: 518
>> > >> > >
>> > >> > > crash> bt 35
>> > >> > > PID: 35     TASK: e1d45dc0  CPU: 1   COMMAND: "kswapd0"
>> > >> > >  #0 [<c0559ab8>] (__schedule) from [<c0559edc>]
>> > >> > >  #1 [<c0559edc>] (schedule) from [<c055e54c>]
>> > >> > >  #2 [<c055e54c>] (schedule_timeout) from [<c055a3a4>]
>> > >> > >  #3 [<c055a3a4>] (io_schedule_timeout) from [<c0106cb0>]
>> > >> > >  #4 [<c0106cb0>] (mempool_alloc) from [<c0261668>]
>> > >> > >  #5 [<c0261668>] (bio_alloc_bioset) from [<c0149d68>]
>> > >> > >  #6 [<c0149d68>] (get_swap_bio) from [<c014a280>]
>> > >> > >  #7 [<c014a280>] (__swap_writepage) from [<c014a3bc>]
>> > >> > >  #8 [<c014a3bc>] (swap_writepage) from [<c011e5c8>]
>> > >> > >  #9 [<c011e5c8>] (shmem_writepage) from [<c011a9b8>]
>> > >> > > #10 [<c011a9b8>] (shrink_page_list) from [<c011b528>]
>> > >> > > #11 [<c011b528>] (shrink_inactive_list) from [<c011c160>]
>> > >> > > #12 [<c011c160>] (shrink_node_memcg) from [<c011c400>]
>> > >> > > #13 [<c011c400>] (shrink_node) from [<c011d7dc>]
>> > >> > > #14 [<c011d7dc>] (kswapd) from [<c0048890>]
>> > >> > > #15 [<c0048890>] (kthread) from [<c0010388>]
>> > >> > >
>> > >> > > It appears that uncharge_list() in mm/memcontrol.c is not taking the
>> > >> > > page lock when it sets mem_cgroup to NULL.  I am not familiar with the
>> > >> > > mm code so I do not know if this is on purpose or not.  There is a
>> > >> > > comment in uncharge_list that makes me believe that the crashing code
>> > >> > > should not have been running:
>> > >> > > /*
>> > >> > >  * Nobody should be changing or seriously looking at
>> > >> > >  * page->mem_cgroup at this point, we have fully
>> > >> > >  * exclusive access to the page.
>> > >> > >  */
>> > >> > > However, I am new to looking at this area of the kernel so I am not
>> > >> > > sure.
>> > >> >
>> > >> > The lock is for pages that are actively being used, whereas the free
>> > >> > path requires the page refcount to be 0; nobody else should be having
>> > >> > access to the page at that time.
>> > >>
>> > >> Given various trials for a while, using __mod_memcg_state() instead of
>> > >> mod_memcg_state() ssems somehow blowing the panic away. It might be caused
>> > >> by kernel preemption?
>> > >
>> > > That's puzzling. Is that reliably the case? Because on x86-64,
>> > > __this_cpu_add and this_cpu_add should result in the same code:
>> > >
>> > > #define raw_cpu_add_8(pcp, val)                 percpu_add_op((pcp), val)
>> > > #define this_cpu_add_8(pcp, val)                percpu_add_op((pcp), val)
>> > >
>> > > which boils down to single instructions - incq, decq, addq - and so
>> > > never needs explicit disabling of scheduler or interrupt preemption.
>> > >
>> > >> > > I was able to create a reproducible scenario by using a udelay to
>> > >> > > increase the time between the if (page->mem_cgroup) check and the later
>> > >> > > dereference of it to increase the race window.  I then mounted an empty
>> > >> > > ext4 partition and ran the following no more than twice before it
>> > >> > > crashed.
>> > >> > > dd if=/dev/zero of=/tmp/ext4disk/test bs=1M count=100
>> > >> >
>> > >> > Thanks, that's useful. I'm going to try to reproduce this also.
>> > >> >
>> > >> > There is a
>> > >> >
>> > >> >     VM_BUG_ON_PAGE(!PageHWPoison(page) && page_count(page), page);
>> > >> >
>> > >> > inside uncharge_list() that verifies that there shouldn't in fact be
>> > >> > any pages ending writeback when they get into that function. Can you
>> > >> > build your kernel with CONFIG_DEBUG_VM to enable that test?
>> > >>
>> > >> I'll test this as well. ;)
>> > >
>> > > Thanks. I'm trying various udelays in between the NULL-check and the
>> > > dereference, but I cannot seem to make it happen with the ext4-dd on
>> > > my local machine.
>> > I am using a fairly fat udelay of 100.
>> >
>> > I turned on CONFIG_DEBUG_VM but it didn't bug for me, although the original
>> > problem still reproduced.
>>
>> Ok that sounds like a page refcounting bug then.
>>
>> > I can insert some tracing to try to get some more information, but unfortunately
>> > I am stuck with v4.11 at the moment and can't try v4.13-rc4 on this setup.
>>
>> Does the below trigger with your reproducer? (Still doesn't for me).
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 3914e3dd6168..c052b085c5dd 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -599,6 +599,7 @@ static inline void __mod_lruvec_page_state(struct page *page,
>>       struct mem_cgroup_per_node *pn;
>>
>>       __mod_node_page_state(page_pgdat(page), idx, val);
>> +     VM_BUG_ON_PAGE(!page_count(page), page);
>>       if (mem_cgroup_disabled() || !page->mem_cgroup)
>>               return;
>>       __mod_memcg_state(page->mem_cgroup, idx, val);
>> @@ -612,6 +613,7 @@ static inline void mod_lruvec_page_state(struct page *page,
>>       struct mem_cgroup_per_node *pn;
>>
>>       mod_node_page_state(page_pgdat(page), idx, val);
>> +     VM_BUG_ON_PAGE(!page_count(page), page);
>
> It seems DEBUG_VM may hide the problem in my case, so I turned it off and
> replaced it with BUG_ON(!page_count(page)). Then, I exactly hit this.
>
> Thanks,
>
>>       if (mem_cgroup_disabled() || !page->mem_cgroup)
>>               return;
>>       mod_memcg_state(page->mem_cgroup, idx, val);

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: kernel panic on null pointer on page->mem_cgroup
  2017-08-09  2:39               ` Brad Bolen
@ 2017-08-09 18:38                 ` Johannes Weiner
  2017-08-10 11:56                   ` Michal Hocko
                                     ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Johannes Weiner @ 2017-08-09 18:38 UTC (permalink / raw)
  To: Brad Bolen
  Cc: Jaegeuk Kim, Andrew Morton, Michal Hocko, Vladimir Davydov,
	linux-mm, cgroups, linux-kernel

On Tue, Aug 08, 2017 at 10:39:27PM -0400, Brad Bolen wrote:
> Yes, the BUG_ON(!page_count(page)) fired for me as well.

Brad, Jaegeuk, does the following patch address this problem?

---

>From cf0060892eb70bccbc8cedeac0a5756c8f7b975e Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Wed, 9 Aug 2017 12:06:03 -0400
Subject: [PATCH] mm: memcontrol: fix NULL pointer crash in
 test_clear_page_writeback()

Jaegeuk and Brad report a NULL pointer crash when writeback ending
tries to update the memcg stats:

[] BUG: unable to handle kernel NULL pointer dereference at 00000000000003b0
[] IP: test_clear_page_writeback+0x12e/0x2c0
[...]
[] RIP: 0010:test_clear_page_writeback+0x12e/0x2c0
[] RSP: 0018:ffff8e3abfd03d78 EFLAGS: 00010046
[] RAX: 0000000000000000 RBX: ffffdb59c03f8900 RCX: ffffffffffffffe8
[] RDX: 0000000000000000 RSI: 0000000000000010 RDI: ffff8e3abffeb000
[] RBP: ffff8e3abfd03da8 R08: 0000000000020059 R09: 00000000fffffffc
[] R10: 0000000000000000 R11: 0000000000020048 R12: ffff8e3a8c39f668
[] R13: 0000000000000001 R14: ffff8e3a8c39f680 R15: 0000000000000000
[] FS:  0000000000000000(0000) GS:ffff8e3abfd00000(0000) knlGS:0000000000000000
[] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[] CR2: 00000000000003b0 CR3: 000000002c5e1000 CR4: 00000000000406e0
[] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[] Call Trace:
[]  <IRQ>
[]  end_page_writeback+0x47/0x70
[]  f2fs_write_end_io+0x76/0x180 [f2fs]
[]  bio_endio+0x9f/0x120
[]  blk_update_request+0xa8/0x2f0
[]  scsi_end_request+0x39/0x1d0
[]  scsi_io_completion+0x211/0x690
[]  scsi_finish_command+0xd9/0x120
[]  scsi_softirq_done+0x127/0x150
[]  __blk_mq_complete_request_remote+0x13/0x20
[]  flush_smp_call_function_queue+0x56/0x110
[]  generic_smp_call_function_single_interrupt+0x13/0x30
[]  smp_call_function_single_interrupt+0x27/0x40
[]  call_function_single_interrupt+0x89/0x90
[] RIP: 0010:native_safe_halt+0x6/0x10

(gdb) l *(test_clear_page_writeback+0x12e)
0xffffffff811bae3e is in test_clear_page_writeback (./include/linux/memcontrol.h:619).
614		mod_node_page_state(page_pgdat(page), idx, val);
615		if (mem_cgroup_disabled() || !page->mem_cgroup)
616			return;
617		mod_memcg_state(page->mem_cgroup, idx, val);
618		pn = page->mem_cgroup->nodeinfo[page_to_nid(page)];
619		this_cpu_add(pn->lruvec_stat->count[idx], val);
620	}
621
622	unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
623							gfp_t gfp_mask,

The issue is that writeback doesn't hold a page reference and the page
might get freed after PG_writeback is cleared (and the mapping is
unlocked) in test_clear_page_writeback(). The stat functions looking
up the page's node or zone are safe, as those attributes are static
across allocation and free cycles. But page->mem_cgroup is not, and it
will get cleared if we race with truncation or migration.

It appears this race window has been around for a while, but less
likely to trigger when the memcg stats were updated first thing after
PG_writeback is cleared. Recent changes reshuffled this code to update
the global node stats before the memcg ones, though, stretching the
race window out to an extent where people can reproduce the problem.

Update test_clear_page_writeback() to look up and pin page->mem_cgroup
before clearing PG_writeback, then not use that pointer afterward. It
is a partial revert of 62cccb8c8e7a ("mm: simplify lock_page_memcg()")
but leaves the pageref-holding callsites that aren't affected alone.

Fixes: 62cccb8c8e7a ("mm: simplify lock_page_memcg()")
Reported-by: Jaegeuk Kim <jaegeuk@kernel.org>
Reported-by: Bradley Bolen <bradleybolen@gmail.com>
Cc: <stable@vger.kernel.org> # 4.6+
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/memcontrol.h | 10 ++++++++--
 mm/memcontrol.c            | 43 +++++++++++++++++++++++++++++++------------
 mm/page-writeback.c        | 15 ++++++++++++---
 3 files changed, 51 insertions(+), 17 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 3914e3dd6168..9b15a4bcfa77 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -484,7 +484,8 @@ bool mem_cgroup_oom_synchronize(bool wait);
 extern int do_swap_account;
 #endif
 
-void lock_page_memcg(struct page *page);
+struct mem_cgroup *lock_page_memcg(struct page *page);
+void __unlock_page_memcg(struct mem_cgroup *memcg);
 void unlock_page_memcg(struct page *page);
 
 static inline unsigned long memcg_page_state(struct mem_cgroup *memcg,
@@ -809,7 +810,12 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
 {
 }
 
-static inline void lock_page_memcg(struct page *page)
+static inline struct mem_cgroup *lock_page_memcg(struct page *page)
+{
+	return NULL;
+}
+
+static inline void __unlock_page_memcg(struct mem_cgroup *memcg)
 {
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3df3c04d73ab..e09741af816f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1611,9 +1611,13 @@ bool mem_cgroup_oom_synchronize(bool handle)
  * @page: the page
  *
  * This function protects unlocked LRU pages from being moved to
- * another cgroup and stabilizes their page->mem_cgroup binding.
+ * another cgroup.
+ *
+ * It ensures lifetime of the returned memcg. Caller is responsible
+ * for the lifetime of the page; __unlock_page_memcg() is available
+ * when @page might get freed inside the locked section.
  */
-void lock_page_memcg(struct page *page)
+struct mem_cgroup *lock_page_memcg(struct page *page)
 {
 	struct mem_cgroup *memcg;
 	unsigned long flags;
@@ -1622,18 +1626,24 @@ void lock_page_memcg(struct page *page)
 	 * The RCU lock is held throughout the transaction.  The fast
 	 * path can get away without acquiring the memcg->move_lock
 	 * because page moving starts with an RCU grace period.
-	 */
+	 *
+	 * The RCU lock also protects the memcg from being freed when
+	 * the page state that is going to change is the only thing
+	 * preventing the page itself from being freed. E.g. writeback
+	 * doesn't hold a page reference and relies on PG_writeback to
+	 * keep off truncation, migration and so forth.
+         */
 	rcu_read_lock();
 
 	if (mem_cgroup_disabled())
-		return;
+		return NULL;
 again:
 	memcg = page->mem_cgroup;
 	if (unlikely(!memcg))
-		return;
+		return NULL;
 
 	if (atomic_read(&memcg->moving_account) <= 0)
-		return;
+		return memcg;
 
 	spin_lock_irqsave(&memcg->move_lock, flags);
 	if (memcg != page->mem_cgroup) {
@@ -1649,18 +1659,18 @@ void lock_page_memcg(struct page *page)
 	memcg->move_lock_task = current;
 	memcg->move_lock_flags = flags;
 
-	return;
+	return memcg;
 }
 EXPORT_SYMBOL(lock_page_memcg);
 
 /**
- * unlock_page_memcg - unlock a page->mem_cgroup binding
- * @page: the page
+ * __unlock_page_memcg - unlock and unpin a memcg
+ * @memcg: the memcg
+ *
+ * Unlock and unpin a memcg returned by lock_page_memcg().
  */
-void unlock_page_memcg(struct page *page)
+void __unlock_page_memcg(struct mem_cgroup *memcg)
 {
-	struct mem_cgroup *memcg = page->mem_cgroup;
-
 	if (memcg && memcg->move_lock_task == current) {
 		unsigned long flags = memcg->move_lock_flags;
 
@@ -1672,6 +1682,15 @@ void unlock_page_memcg(struct page *page)
 
 	rcu_read_unlock();
 }
+
+/**
+ * unlock_page_memcg - unlock a page->mem_cgroup binding
+ * @page: the page
+ */
+void unlock_page_memcg(struct page *page)
+{
+	__unlock_page_memcg(page->mem_cgroup);
+}
 EXPORT_SYMBOL(unlock_page_memcg);
 
 /*
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 96e93b214d31..bf050ab025b7 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2724,9 +2724,12 @@ EXPORT_SYMBOL(clear_page_dirty_for_io);
 int test_clear_page_writeback(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
+	struct mem_cgroup *memcg;
+	struct lruvec *lruvec;
 	int ret;
 
-	lock_page_memcg(page);
+	memcg = lock_page_memcg(page);
+	lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
 	if (mapping && mapping_use_writeback_tags(mapping)) {
 		struct inode *inode = mapping->host;
 		struct backing_dev_info *bdi = inode_to_bdi(inode);
@@ -2754,12 +2757,18 @@ int test_clear_page_writeback(struct page *page)
 	} else {
 		ret = TestClearPageWriteback(page);
 	}
+	/*
+	 * NOTE: Page might be free now! Writeback doesn't hold a page
+	 * reference on its own, it relies on truncation to wait for
+	 * the clearing of PG_writeback. The below can only access
+	 * page state that is static across allocation cycles.
+	 */
 	if (ret) {
-		dec_lruvec_page_state(page, NR_WRITEBACK);
+		dec_lruvec_state(lruvec, NR_WRITEBACK);
 		dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
 		inc_node_page_state(page, NR_WRITTEN);
 	}
-	unlock_page_memcg(page);
+	__unlock_page_memcg(memcg);
 	return ret;
 }
 
-- 
2.13.3

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: kernel panic on null pointer on page->mem_cgroup
  2017-08-09 18:38                 ` Johannes Weiner
@ 2017-08-10 11:56                   ` Michal Hocko
  2017-08-21 13:02                     ` Johannes Weiner
  2017-08-11  1:30                   ` Brad Bolen
  2017-08-11 21:37                   ` Jaegeuk Kim
  2 siblings, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2017-08-10 11:56 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Brad Bolen, Jaegeuk Kim, Andrew Morton, Vladimir Davydov,
	linux-mm, cgroups, linux-kernel

On Wed 09-08-17 14:38:25, Johannes Weiner wrote:
> On Tue, Aug 08, 2017 at 10:39:27PM -0400, Brad Bolen wrote:
> > Yes, the BUG_ON(!page_count(page)) fired for me as well.
> 
> Brad, Jaegeuk, does the following patch address this problem?
> 
> ---
> 
> >From cf0060892eb70bccbc8cedeac0a5756c8f7b975e Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Wed, 9 Aug 2017 12:06:03 -0400
> Subject: [PATCH] mm: memcontrol: fix NULL pointer crash in
>  test_clear_page_writeback()
> 
> Jaegeuk and Brad report a NULL pointer crash when writeback ending
> tries to update the memcg stats:
> 
> [] BUG: unable to handle kernel NULL pointer dereference at 00000000000003b0
> [] IP: test_clear_page_writeback+0x12e/0x2c0
> [...]
> [] RIP: 0010:test_clear_page_writeback+0x12e/0x2c0
> [] RSP: 0018:ffff8e3abfd03d78 EFLAGS: 00010046
> [] RAX: 0000000000000000 RBX: ffffdb59c03f8900 RCX: ffffffffffffffe8
> [] RDX: 0000000000000000 RSI: 0000000000000010 RDI: ffff8e3abffeb000
> [] RBP: ffff8e3abfd03da8 R08: 0000000000020059 R09: 00000000fffffffc
> [] R10: 0000000000000000 R11: 0000000000020048 R12: ffff8e3a8c39f668
> [] R13: 0000000000000001 R14: ffff8e3a8c39f680 R15: 0000000000000000
> [] FS:  0000000000000000(0000) GS:ffff8e3abfd00000(0000) knlGS:0000000000000000
> [] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [] CR2: 00000000000003b0 CR3: 000000002c5e1000 CR4: 00000000000406e0
> [] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [] Call Trace:
> []  <IRQ>
> []  end_page_writeback+0x47/0x70
> []  f2fs_write_end_io+0x76/0x180 [f2fs]
> []  bio_endio+0x9f/0x120
> []  blk_update_request+0xa8/0x2f0
> []  scsi_end_request+0x39/0x1d0
> []  scsi_io_completion+0x211/0x690
> []  scsi_finish_command+0xd9/0x120
> []  scsi_softirq_done+0x127/0x150
> []  __blk_mq_complete_request_remote+0x13/0x20
> []  flush_smp_call_function_queue+0x56/0x110
> []  generic_smp_call_function_single_interrupt+0x13/0x30
> []  smp_call_function_single_interrupt+0x27/0x40
> []  call_function_single_interrupt+0x89/0x90
> [] RIP: 0010:native_safe_halt+0x6/0x10
> 
> (gdb) l *(test_clear_page_writeback+0x12e)
> 0xffffffff811bae3e is in test_clear_page_writeback (./include/linux/memcontrol.h:619).
> 614		mod_node_page_state(page_pgdat(page), idx, val);
> 615		if (mem_cgroup_disabled() || !page->mem_cgroup)
> 616			return;
> 617		mod_memcg_state(page->mem_cgroup, idx, val);
> 618		pn = page->mem_cgroup->nodeinfo[page_to_nid(page)];
> 619		this_cpu_add(pn->lruvec_stat->count[idx], val);
> 620	}
> 621
> 622	unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
> 623							gfp_t gfp_mask,
> 
> The issue is that writeback doesn't hold a page reference and the page
> might get freed after PG_writeback is cleared (and the mapping is
> unlocked) in test_clear_page_writeback(). The stat functions looking
> up the page's node or zone are safe, as those attributes are static
> across allocation and free cycles. But page->mem_cgroup is not, and it
> will get cleared if we race with truncation or migration.

Is there anything that prevents us from holding a reference on a page
under writeback?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: kernel panic on null pointer on page->mem_cgroup
  2017-08-09 18:38                 ` Johannes Weiner
  2017-08-10 11:56                   ` Michal Hocko
@ 2017-08-11  1:30                   ` Brad Bolen
  2017-08-11 21:37                   ` Jaegeuk Kim
  2 siblings, 0 replies; 14+ messages in thread
From: Brad Bolen @ 2017-08-11  1:30 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Jaegeuk Kim, Andrew Morton, Michal Hocko, Vladimir Davydov,
	linux-mm, cgroups, linux-kernel

Johannes,

Yes, the patch (slightly modified to apply for 4.11) does make my problem
go away.  Thank you for driving this to a solution.

Brad

On Wed, Aug 9, 2017 at 2:38 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> On Tue, Aug 08, 2017 at 10:39:27PM -0400, Brad Bolen wrote:
>> Yes, the BUG_ON(!page_count(page)) fired for me as well.
>
> Brad, Jaegeuk, does the following patch address this problem?
>
> ---
>
> From cf0060892eb70bccbc8cedeac0a5756c8f7b975e Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Wed, 9 Aug 2017 12:06:03 -0400
> Subject: [PATCH] mm: memcontrol: fix NULL pointer crash in
>  test_clear_page_writeback()
>
> Jaegeuk and Brad report a NULL pointer crash when writeback ending
> tries to update the memcg stats:
>
> [] BUG: unable to handle kernel NULL pointer dereference at 00000000000003b0
> [] IP: test_clear_page_writeback+0x12e/0x2c0
> [...]
> [] RIP: 0010:test_clear_page_writeback+0x12e/0x2c0
> [] RSP: 0018:ffff8e3abfd03d78 EFLAGS: 00010046
> [] RAX: 0000000000000000 RBX: ffffdb59c03f8900 RCX: ffffffffffffffe8
> [] RDX: 0000000000000000 RSI: 0000000000000010 RDI: ffff8e3abffeb000
> [] RBP: ffff8e3abfd03da8 R08: 0000000000020059 R09: 00000000fffffffc
> [] R10: 0000000000000000 R11: 0000000000020048 R12: ffff8e3a8c39f668
> [] R13: 0000000000000001 R14: ffff8e3a8c39f680 R15: 0000000000000000
> [] FS:  0000000000000000(0000) GS:ffff8e3abfd00000(0000) knlGS:0000000000000000
> [] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [] CR2: 00000000000003b0 CR3: 000000002c5e1000 CR4: 00000000000406e0
> [] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [] Call Trace:
> []  <IRQ>
> []  end_page_writeback+0x47/0x70
> []  f2fs_write_end_io+0x76/0x180 [f2fs]
> []  bio_endio+0x9f/0x120
> []  blk_update_request+0xa8/0x2f0
> []  scsi_end_request+0x39/0x1d0
> []  scsi_io_completion+0x211/0x690
> []  scsi_finish_command+0xd9/0x120
> []  scsi_softirq_done+0x127/0x150
> []  __blk_mq_complete_request_remote+0x13/0x20
> []  flush_smp_call_function_queue+0x56/0x110
> []  generic_smp_call_function_single_interrupt+0x13/0x30
> []  smp_call_function_single_interrupt+0x27/0x40
> []  call_function_single_interrupt+0x89/0x90
> [] RIP: 0010:native_safe_halt+0x6/0x10
>
> (gdb) l *(test_clear_page_writeback+0x12e)
> 0xffffffff811bae3e is in test_clear_page_writeback (./include/linux/memcontrol.h:619).
> 614             mod_node_page_state(page_pgdat(page), idx, val);
> 615             if (mem_cgroup_disabled() || !page->mem_cgroup)
> 616                     return;
> 617             mod_memcg_state(page->mem_cgroup, idx, val);
> 618             pn = page->mem_cgroup->nodeinfo[page_to_nid(page)];
> 619             this_cpu_add(pn->lruvec_stat->count[idx], val);
> 620     }
> 621
> 622     unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
> 623                                                     gfp_t gfp_mask,
>
> The issue is that writeback doesn't hold a page reference and the page
> might get freed after PG_writeback is cleared (and the mapping is
> unlocked) in test_clear_page_writeback(). The stat functions looking
> up the page's node or zone are safe, as those attributes are static
> across allocation and free cycles. But page->mem_cgroup is not, and it
> will get cleared if we race with truncation or migration.
>
> It appears this race window has been around for a while, but less
> likely to trigger when the memcg stats were updated first thing after
> PG_writeback is cleared. Recent changes reshuffled this code to update
> the global node stats before the memcg ones, though, stretching the
> race window out to an extent where people can reproduce the problem.
>
> Update test_clear_page_writeback() to look up and pin page->mem_cgroup
> before clearing PG_writeback, then not use that pointer afterward. It
> is a partial revert of 62cccb8c8e7a ("mm: simplify lock_page_memcg()")
> but leaves the pageref-holding callsites that aren't affected alone.
>
> Fixes: 62cccb8c8e7a ("mm: simplify lock_page_memcg()")
> Reported-by: Jaegeuk Kim <jaegeuk@kernel.org>
> Reported-by: Bradley Bolen <bradleybolen@gmail.com>
> Cc: <stable@vger.kernel.org> # 4.6+
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  include/linux/memcontrol.h | 10 ++++++++--
>  mm/memcontrol.c            | 43 +++++++++++++++++++++++++++++++------------
>  mm/page-writeback.c        | 15 ++++++++++++---
>  3 files changed, 51 insertions(+), 17 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 3914e3dd6168..9b15a4bcfa77 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -484,7 +484,8 @@ bool mem_cgroup_oom_synchronize(bool wait);
>  extern int do_swap_account;
>  #endif
>
> -void lock_page_memcg(struct page *page);
> +struct mem_cgroup *lock_page_memcg(struct page *page);
> +void __unlock_page_memcg(struct mem_cgroup *memcg);
>  void unlock_page_memcg(struct page *page);
>
>  static inline unsigned long memcg_page_state(struct mem_cgroup *memcg,
> @@ -809,7 +810,12 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
>  {
>  }
>
> -static inline void lock_page_memcg(struct page *page)
> +static inline struct mem_cgroup *lock_page_memcg(struct page *page)
> +{
> +       return NULL;
> +}
> +
> +static inline void __unlock_page_memcg(struct mem_cgroup *memcg)
>  {
>  }
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 3df3c04d73ab..e09741af816f 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1611,9 +1611,13 @@ bool mem_cgroup_oom_synchronize(bool handle)
>   * @page: the page
>   *
>   * This function protects unlocked LRU pages from being moved to
> - * another cgroup and stabilizes their page->mem_cgroup binding.
> + * another cgroup.
> + *
> + * It ensures lifetime of the returned memcg. Caller is responsible
> + * for the lifetime of the page; __unlock_page_memcg() is available
> + * when @page might get freed inside the locked section.
>   */
> -void lock_page_memcg(struct page *page)
> +struct mem_cgroup *lock_page_memcg(struct page *page)
>  {
>         struct mem_cgroup *memcg;
>         unsigned long flags;
> @@ -1622,18 +1626,24 @@ void lock_page_memcg(struct page *page)
>          * The RCU lock is held throughout the transaction.  The fast
>          * path can get away without acquiring the memcg->move_lock
>          * because page moving starts with an RCU grace period.
> -        */
> +        *
> +        * The RCU lock also protects the memcg from being freed when
> +        * the page state that is going to change is the only thing
> +        * preventing the page itself from being freed. E.g. writeback
> +        * doesn't hold a page reference and relies on PG_writeback to
> +        * keep off truncation, migration and so forth.
> +         */
>         rcu_read_lock();
>
>         if (mem_cgroup_disabled())
> -               return;
> +               return NULL;
>  again:
>         memcg = page->mem_cgroup;
>         if (unlikely(!memcg))
> -               return;
> +               return NULL;
>
>         if (atomic_read(&memcg->moving_account) <= 0)
> -               return;
> +               return memcg;
>
>         spin_lock_irqsave(&memcg->move_lock, flags);
>         if (memcg != page->mem_cgroup) {
> @@ -1649,18 +1659,18 @@ void lock_page_memcg(struct page *page)
>         memcg->move_lock_task = current;
>         memcg->move_lock_flags = flags;
>
> -       return;
> +       return memcg;
>  }
>  EXPORT_SYMBOL(lock_page_memcg);
>
>  /**
> - * unlock_page_memcg - unlock a page->mem_cgroup binding
> - * @page: the page
> + * __unlock_page_memcg - unlock and unpin a memcg
> + * @memcg: the memcg
> + *
> + * Unlock and unpin a memcg returned by lock_page_memcg().
>   */
> -void unlock_page_memcg(struct page *page)
> +void __unlock_page_memcg(struct mem_cgroup *memcg)
>  {
> -       struct mem_cgroup *memcg = page->mem_cgroup;
> -
>         if (memcg && memcg->move_lock_task == current) {
>                 unsigned long flags = memcg->move_lock_flags;
>
> @@ -1672,6 +1682,15 @@ void unlock_page_memcg(struct page *page)
>
>         rcu_read_unlock();
>  }
> +
> +/**
> + * unlock_page_memcg - unlock a page->mem_cgroup binding
> + * @page: the page
> + */
> +void unlock_page_memcg(struct page *page)
> +{
> +       __unlock_page_memcg(page->mem_cgroup);
> +}
>  EXPORT_SYMBOL(unlock_page_memcg);
>
>  /*
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 96e93b214d31..bf050ab025b7 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2724,9 +2724,12 @@ EXPORT_SYMBOL(clear_page_dirty_for_io);
>  int test_clear_page_writeback(struct page *page)
>  {
>         struct address_space *mapping = page_mapping(page);
> +       struct mem_cgroup *memcg;
> +       struct lruvec *lruvec;
>         int ret;
>
> -       lock_page_memcg(page);
> +       memcg = lock_page_memcg(page);
> +       lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
>         if (mapping && mapping_use_writeback_tags(mapping)) {
>                 struct inode *inode = mapping->host;
>                 struct backing_dev_info *bdi = inode_to_bdi(inode);
> @@ -2754,12 +2757,18 @@ int test_clear_page_writeback(struct page *page)
>         } else {
>                 ret = TestClearPageWriteback(page);
>         }
> +       /*
> +        * NOTE: Page might be free now! Writeback doesn't hold a page
> +        * reference on its own, it relies on truncation to wait for
> +        * the clearing of PG_writeback. The below can only access
> +        * page state that is static across allocation cycles.
> +        */
>         if (ret) {
> -               dec_lruvec_page_state(page, NR_WRITEBACK);
> +               dec_lruvec_state(lruvec, NR_WRITEBACK);
>                 dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
>                 inc_node_page_state(page, NR_WRITTEN);
>         }
> -       unlock_page_memcg(page);
> +       __unlock_page_memcg(memcg);
>         return ret;
>  }
>
> --
> 2.13.3
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: kernel panic on null pointer on page->mem_cgroup
  2017-08-09 18:38                 ` Johannes Weiner
  2017-08-10 11:56                   ` Michal Hocko
  2017-08-11  1:30                   ` Brad Bolen
@ 2017-08-11 21:37                   ` Jaegeuk Kim
  2 siblings, 0 replies; 14+ messages in thread
From: Jaegeuk Kim @ 2017-08-11 21:37 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Brad Bolen, Andrew Morton, Michal Hocko, Vladimir Davydov,
	linux-mm, cgroups, linux-kernel

Hi Johannes,

On 08/09, Johannes Weiner wrote:
> On Tue, Aug 08, 2017 at 10:39:27PM -0400, Brad Bolen wrote:
> > Yes, the BUG_ON(!page_count(page)) fired for me as well.
> 
> Brad, Jaegeuk, does the following patch address this problem?

I also confirmed that this patch addresses the problem.

Tested-by: Jaegeuk Kim <jaegeuk@kernel.org>

Thanks,

> 
> ---
> 
> >From cf0060892eb70bccbc8cedeac0a5756c8f7b975e Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Wed, 9 Aug 2017 12:06:03 -0400
> Subject: [PATCH] mm: memcontrol: fix NULL pointer crash in
>  test_clear_page_writeback()
> 
> Jaegeuk and Brad report a NULL pointer crash when writeback ending
> tries to update the memcg stats:
> 
> [] BUG: unable to handle kernel NULL pointer dereference at 00000000000003b0
> [] IP: test_clear_page_writeback+0x12e/0x2c0
> [...]
> [] RIP: 0010:test_clear_page_writeback+0x12e/0x2c0
> [] RSP: 0018:ffff8e3abfd03d78 EFLAGS: 00010046
> [] RAX: 0000000000000000 RBX: ffffdb59c03f8900 RCX: ffffffffffffffe8
> [] RDX: 0000000000000000 RSI: 0000000000000010 RDI: ffff8e3abffeb000
> [] RBP: ffff8e3abfd03da8 R08: 0000000000020059 R09: 00000000fffffffc
> [] R10: 0000000000000000 R11: 0000000000020048 R12: ffff8e3a8c39f668
> [] R13: 0000000000000001 R14: ffff8e3a8c39f680 R15: 0000000000000000
> [] FS:  0000000000000000(0000) GS:ffff8e3abfd00000(0000) knlGS:0000000000000000
> [] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [] CR2: 00000000000003b0 CR3: 000000002c5e1000 CR4: 00000000000406e0
> [] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [] Call Trace:
> []  <IRQ>
> []  end_page_writeback+0x47/0x70
> []  f2fs_write_end_io+0x76/0x180 [f2fs]
> []  bio_endio+0x9f/0x120
> []  blk_update_request+0xa8/0x2f0
> []  scsi_end_request+0x39/0x1d0
> []  scsi_io_completion+0x211/0x690
> []  scsi_finish_command+0xd9/0x120
> []  scsi_softirq_done+0x127/0x150
> []  __blk_mq_complete_request_remote+0x13/0x20
> []  flush_smp_call_function_queue+0x56/0x110
> []  generic_smp_call_function_single_interrupt+0x13/0x30
> []  smp_call_function_single_interrupt+0x27/0x40
> []  call_function_single_interrupt+0x89/0x90
> [] RIP: 0010:native_safe_halt+0x6/0x10
> 
> (gdb) l *(test_clear_page_writeback+0x12e)
> 0xffffffff811bae3e is in test_clear_page_writeback (./include/linux/memcontrol.h:619).
> 614		mod_node_page_state(page_pgdat(page), idx, val);
> 615		if (mem_cgroup_disabled() || !page->mem_cgroup)
> 616			return;
> 617		mod_memcg_state(page->mem_cgroup, idx, val);
> 618		pn = page->mem_cgroup->nodeinfo[page_to_nid(page)];
> 619		this_cpu_add(pn->lruvec_stat->count[idx], val);
> 620	}
> 621
> 622	unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
> 623							gfp_t gfp_mask,
> 
> The issue is that writeback doesn't hold a page reference and the page
> might get freed after PG_writeback is cleared (and the mapping is
> unlocked) in test_clear_page_writeback(). The stat functions looking
> up the page's node or zone are safe, as those attributes are static
> across allocation and free cycles. But page->mem_cgroup is not, and it
> will get cleared if we race with truncation or migration.
> 
> It appears this race window has been around for a while, but less
> likely to trigger when the memcg stats were updated first thing after
> PG_writeback is cleared. Recent changes reshuffled this code to update
> the global node stats before the memcg ones, though, stretching the
> race window out to an extent where people can reproduce the problem.
> 
> Update test_clear_page_writeback() to look up and pin page->mem_cgroup
> before clearing PG_writeback, then not use that pointer afterward. It
> is a partial revert of 62cccb8c8e7a ("mm: simplify lock_page_memcg()")
> but leaves the pageref-holding callsites that aren't affected alone.
> 
> Fixes: 62cccb8c8e7a ("mm: simplify lock_page_memcg()")
> Reported-by: Jaegeuk Kim <jaegeuk@kernel.org>
> Reported-by: Bradley Bolen <bradleybolen@gmail.com>
> Cc: <stable@vger.kernel.org> # 4.6+
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  include/linux/memcontrol.h | 10 ++++++++--
>  mm/memcontrol.c            | 43 +++++++++++++++++++++++++++++++------------
>  mm/page-writeback.c        | 15 ++++++++++++---
>  3 files changed, 51 insertions(+), 17 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 3914e3dd6168..9b15a4bcfa77 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -484,7 +484,8 @@ bool mem_cgroup_oom_synchronize(bool wait);
>  extern int do_swap_account;
>  #endif
>  
> -void lock_page_memcg(struct page *page);
> +struct mem_cgroup *lock_page_memcg(struct page *page);
> +void __unlock_page_memcg(struct mem_cgroup *memcg);
>  void unlock_page_memcg(struct page *page);
>  
>  static inline unsigned long memcg_page_state(struct mem_cgroup *memcg,
> @@ -809,7 +810,12 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
>  {
>  }
>  
> -static inline void lock_page_memcg(struct page *page)
> +static inline struct mem_cgroup *lock_page_memcg(struct page *page)
> +{
> +	return NULL;
> +}
> +
> +static inline void __unlock_page_memcg(struct mem_cgroup *memcg)
>  {
>  }
>  
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 3df3c04d73ab..e09741af816f 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1611,9 +1611,13 @@ bool mem_cgroup_oom_synchronize(bool handle)
>   * @page: the page
>   *
>   * This function protects unlocked LRU pages from being moved to
> - * another cgroup and stabilizes their page->mem_cgroup binding.
> + * another cgroup.
> + *
> + * It ensures lifetime of the returned memcg. Caller is responsible
> + * for the lifetime of the page; __unlock_page_memcg() is available
> + * when @page might get freed inside the locked section.
>   */
> -void lock_page_memcg(struct page *page)
> +struct mem_cgroup *lock_page_memcg(struct page *page)
>  {
>  	struct mem_cgroup *memcg;
>  	unsigned long flags;
> @@ -1622,18 +1626,24 @@ void lock_page_memcg(struct page *page)
>  	 * The RCU lock is held throughout the transaction.  The fast
>  	 * path can get away without acquiring the memcg->move_lock
>  	 * because page moving starts with an RCU grace period.
> -	 */
> +	 *
> +	 * The RCU lock also protects the memcg from being freed when
> +	 * the page state that is going to change is the only thing
> +	 * preventing the page itself from being freed. E.g. writeback
> +	 * doesn't hold a page reference and relies on PG_writeback to
> +	 * keep off truncation, migration and so forth.
> +         */
>  	rcu_read_lock();
>  
>  	if (mem_cgroup_disabled())
> -		return;
> +		return NULL;
>  again:
>  	memcg = page->mem_cgroup;
>  	if (unlikely(!memcg))
> -		return;
> +		return NULL;
>  
>  	if (atomic_read(&memcg->moving_account) <= 0)
> -		return;
> +		return memcg;
>  
>  	spin_lock_irqsave(&memcg->move_lock, flags);
>  	if (memcg != page->mem_cgroup) {
> @@ -1649,18 +1659,18 @@ void lock_page_memcg(struct page *page)
>  	memcg->move_lock_task = current;
>  	memcg->move_lock_flags = flags;
>  
> -	return;
> +	return memcg;
>  }
>  EXPORT_SYMBOL(lock_page_memcg);
>  
>  /**
> - * unlock_page_memcg - unlock a page->mem_cgroup binding
> - * @page: the page
> + * __unlock_page_memcg - unlock and unpin a memcg
> + * @memcg: the memcg
> + *
> + * Unlock and unpin a memcg returned by lock_page_memcg().
>   */
> -void unlock_page_memcg(struct page *page)
> +void __unlock_page_memcg(struct mem_cgroup *memcg)
>  {
> -	struct mem_cgroup *memcg = page->mem_cgroup;
> -
>  	if (memcg && memcg->move_lock_task == current) {
>  		unsigned long flags = memcg->move_lock_flags;
>  
> @@ -1672,6 +1682,15 @@ void unlock_page_memcg(struct page *page)
>  
>  	rcu_read_unlock();
>  }
> +
> +/**
> + * unlock_page_memcg - unlock a page->mem_cgroup binding
> + * @page: the page
> + */
> +void unlock_page_memcg(struct page *page)
> +{
> +	__unlock_page_memcg(page->mem_cgroup);
> +}
>  EXPORT_SYMBOL(unlock_page_memcg);
>  
>  /*
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 96e93b214d31..bf050ab025b7 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2724,9 +2724,12 @@ EXPORT_SYMBOL(clear_page_dirty_for_io);
>  int test_clear_page_writeback(struct page *page)
>  {
>  	struct address_space *mapping = page_mapping(page);
> +	struct mem_cgroup *memcg;
> +	struct lruvec *lruvec;
>  	int ret;
>  
> -	lock_page_memcg(page);
> +	memcg = lock_page_memcg(page);
> +	lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
>  	if (mapping && mapping_use_writeback_tags(mapping)) {
>  		struct inode *inode = mapping->host;
>  		struct backing_dev_info *bdi = inode_to_bdi(inode);
> @@ -2754,12 +2757,18 @@ int test_clear_page_writeback(struct page *page)
>  	} else {
>  		ret = TestClearPageWriteback(page);
>  	}
> +	/*
> +	 * NOTE: Page might be free now! Writeback doesn't hold a page
> +	 * reference on its own, it relies on truncation to wait for
> +	 * the clearing of PG_writeback. The below can only access
> +	 * page state that is static across allocation cycles.
> +	 */
>  	if (ret) {
> -		dec_lruvec_page_state(page, NR_WRITEBACK);
> +		dec_lruvec_state(lruvec, NR_WRITEBACK);
>  		dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
>  		inc_node_page_state(page, NR_WRITTEN);
>  	}
> -	unlock_page_memcg(page);
> +	__unlock_page_memcg(memcg);
>  	return ret;
>  }
>  
> -- 
> 2.13.3

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: kernel panic on null pointer on page->mem_cgroup
  2017-08-10 11:56                   ` Michal Hocko
@ 2017-08-21 13:02                     ` Johannes Weiner
  2017-08-21 13:23                       ` Michal Hocko
  0 siblings, 1 reply; 14+ messages in thread
From: Johannes Weiner @ 2017-08-21 13:02 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Brad Bolen, Jaegeuk Kim, Andrew Morton, Vladimir Davydov,
	linux-mm, cgroups, linux-kernel

On Thu, Aug 10, 2017 at 01:56:05PM +0200, Michal Hocko wrote:
> On Wed 09-08-17 14:38:25, Johannes Weiner wrote:
> > The issue is that writeback doesn't hold a page reference and the page
> > might get freed after PG_writeback is cleared (and the mapping is
> > unlocked) in test_clear_page_writeback(). The stat functions looking
> > up the page's node or zone are safe, as those attributes are static
> > across allocation and free cycles. But page->mem_cgroup is not, and it
> > will get cleared if we race with truncation or migration.
> 
> Is there anything that prevents us from holding a reference on a page
> under writeback?

Hm, I'm hesitant to add redundant life-time management to the page
there just for memcg, which is not always configured in.

Pinning the memcg instead is slightly more complex, but IMO has the
complexity in a preferrable place.

Would you agree?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: kernel panic on null pointer on page->mem_cgroup
  2017-08-21 13:02                     ` Johannes Weiner
@ 2017-08-21 13:23                       ` Michal Hocko
  0 siblings, 0 replies; 14+ messages in thread
From: Michal Hocko @ 2017-08-21 13:23 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Brad Bolen, Jaegeuk Kim, Andrew Morton, Vladimir Davydov,
	linux-mm, cgroups, linux-kernel

On Mon 21-08-17 09:02:18, Johannes Weiner wrote:
> On Thu, Aug 10, 2017 at 01:56:05PM +0200, Michal Hocko wrote:
> > On Wed 09-08-17 14:38:25, Johannes Weiner wrote:
> > > The issue is that writeback doesn't hold a page reference and the page
> > > might get freed after PG_writeback is cleared (and the mapping is
> > > unlocked) in test_clear_page_writeback(). The stat functions looking
> > > up the page's node or zone are safe, as those attributes are static
> > > across allocation and free cycles. But page->mem_cgroup is not, and it
> > > will get cleared if we race with truncation or migration.
> > 
> > Is there anything that prevents us from holding a reference on a page
> > under writeback?
> 
> Hm, I'm hesitant to add redundant life-time management to the page
> there just for memcg, which is not always configured in.
> 
> Pinning the memcg instead is slightly more complex, but IMO has the
> complexity in a preferrable place.

If that is the single place that needs such a special handling and it is
very likely to stay that way then the additional complexity is probably
justified. I am just worried that this is really subtle and history
tells us that such a code usually kicks us back later.
 
> Would you agree?

Well, I was not objecting to the patch. It seems correct I am just
worried a robust fix would be preferable. And a clear object life time
sounds like a more robust thing to do.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2017-08-21 13:23 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-05 15:52 kernel panic on null pointer on page->mem_cgroup Jaegeuk Kim
     [not found] ` <20170808010150.4155-1-bradleybolen@gmail.com>
2017-08-08 16:21   ` Johannes Weiner
2017-08-08 16:56     ` Jaegeuk Kim
2017-08-08 17:37       ` Johannes Weiner
2017-08-08 19:13         ` Brad Bolen
2017-08-08 20:08           ` Johannes Weiner
2017-08-09  1:44             ` Jaegeuk Kim
2017-08-09  2:39               ` Brad Bolen
2017-08-09 18:38                 ` Johannes Weiner
2017-08-10 11:56                   ` Michal Hocko
2017-08-21 13:02                     ` Johannes Weiner
2017-08-21 13:23                       ` Michal Hocko
2017-08-11  1:30                   ` Brad Bolen
2017-08-11 21:37                   ` Jaegeuk Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).