kernel panic on null pointer on page->mem_cgroup

* kernel panic on null pointer on page->mem_cgroup
@ 2017-08-05 15:52 Jaegeuk Kim
       [not found] ` <20170808010150.4155-1-bradleybolen@gmail.com>
  0 siblings, 1 reply; 14+ messages in thread
From: Jaegeuk Kim @ 2017-08-05 15:52 UTC (permalink / raw)
  To: hannes; +Cc: Linux Kernel Mailing List, Linux F2FS Dev Mailing List, linux-mm

Hi Johannes,

Can I ask your help about the below panic which is annoying me recently.
I'm currently testing xfstests with 4.13-rc2, and have hit the below panic
very randomly.

[ 3722.366490] BUG: unable to handle kernel NULL pointer dereference at 00000000000003b0
[ 3722.378815] IP: test_clear_page_writeback+0x12e/0x2c0
[ 3722.384931] PGD 3fb77067 
[ 3722.384932] P4D 3fb77067 
[ 3722.389222] PUD 1302f067 
[ 3722.392676] PMD 0 
[ 3722.407447] 
[ 3722.416459] Oops: 0000 [#1] SMP
[ 3722.424191] Modules linked in: quota_v2 quota_tree dm_snapshot dm_bufio dm_flakey f2fs(O) ppdev joydev input_leds serio_raw snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm snd_timer snd parport_pc soundcore mac_hid i2c_piix4 parport ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid hid crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd glue_helper cryptd ahci psmouse libahci e1000 pata_acpi video [last unloaded: scsi_debug]
[ 3722.494822] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G           O    4.13.0-rc2+ #7
[ 3722.509659] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[ 3722.523018] task: ffff8e3abe32bc00 task.stack: ffffab1e801f0000
[ 3722.534108] RIP: 0010:test_clear_page_writeback+0x12e/0x2c0
[ 3722.547281] RSP: 0018:ffff8e3abfd03d78 EFLAGS: 00010046
[ 3722.561761] RAX: 0000000000000000 RBX: ffffdb59c03f8900 RCX: ffffffffffffffe8
[ 3722.595343] RDX: 0000000000000000 RSI: 0000000000000010 RDI: ffff8e3abffeb000
[ 3722.615108] RBP: ffff8e3abfd03da8 R08: 0000000000020059 R09: 00000000fffffffc
[ 3722.674717] R10: 0000000000000000 R11: 0000000000020048 R12: ffff8e3a8c39f668
[ 3722.691916] R13: 0000000000000001 R14: ffff8e3a8c39f680 R15: 0000000000000000
[ 3722.736393] FS:  0000000000000000(0000) GS:ffff8e3abfd00000(0000) knlGS:0000000000000000
[ 3722.797553] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3722.852623] CR2: 00000000000003b0 CR3: 000000002c5e1000 CR4: 00000000000406e0
[ 3722.896451] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3722.950847] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 3722.965578] Call Trace:
[ 3722.971710]  <IRQ>
[ 3722.976306]  end_page_writeback+0x47/0x70
[ 3722.983252]  f2fs_write_end_io+0x76/0x180 [f2fs]
[ 3723.012721]  bio_endio+0x9f/0x120
[ 3723.035764]  blk_update_request+0xa8/0x2f0
[ 3723.064621]  scsi_end_request+0x39/0x1d0
[ 3723.086994]  scsi_io_completion+0x211/0x690
[ 3723.116553]  scsi_finish_command+0xd9/0x120
[ 3723.143690]  scsi_softirq_done+0x127/0x150
[ 3723.170070]  __blk_mq_complete_request_remote+0x13/0x20
[ 3723.199780]  flush_smp_call_function_queue+0x56/0x110
[ 3723.233148]  generic_smp_call_function_single_interrupt+0x13/0x30
[ 3723.255267]  smp_call_function_single_interrupt+0x27/0x40
[ 3723.285327]  call_function_single_interrupt+0x89/0x90
[ 3723.309718] RIP: 0010:native_safe_halt+0x6/0x10

(gdb) l *(test_clear_page_writeback+0x12e)
0xffffffff811bae3e is in test_clear_page_writeback (./include/linux/memcontrol.h:619).
614		mod_node_page_state(page_pgdat(page), idx, val);
615		if (mem_cgroup_disabled() || !page->mem_cgroup)
616			return;
617		mod_memcg_state(page->mem_cgroup, idx, val);
618		pn = page->mem_cgroup->nodeinfo[page_to_nid(page)];
619		this_cpu_add(pn->lruvec_stat->count[idx], val);
620	}
621	
622	unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
623							gfp_t gfp_mask,

So first, without your below patch, I've confirmed that there is no problem.

   commit 00f3ca2c2d6635d ("mm: memcontrol: per-lruvec stats infrastructure")

Second, what I've figured out so far is page->mem_cgroup is already checked
above, but after that line, it just becomes NULL. Is it possible somebody can
take it away without locking the page?

Could you please shed a light on this?
Or, is there a patch to fix this already?

Thanks,

^ permalink raw reply	[flat|nested] 14+ messages in thread