* hugepage related lockdep trace. @ 2013-07-17 15:32 ` Dave Jones 0 siblings, 0 replies; 40+ messages in thread From: Dave Jones @ 2013-07-17 15:32 UTC (permalink / raw) To: Linux Kernel; +Cc: linux-mm [128095.470960] ================================= [128095.471315] [ INFO: inconsistent lock state ] [128095.471660] 3.11.0-rc1+ #9 Not tainted [128095.472156] --------------------------------- [128095.472905] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage. [128095.473650] kswapd0/49 [HC0[0]:SC0[0]:HE1:SE1] takes: [128095.474373] (&mapping->i_mmap_mutex){+.+.?.}, at: [<c114971b>] page_referenced+0x87/0x5e3 [128095.475128] {RECLAIM_FS-ON-W} state was registered at: [128095.475866] [<c10a6232>] mark_held_locks+0x81/0xe7 [128095.476597] [<c10a8db3>] lockdep_trace_alloc+0x5e/0xbc [128095.477322] [<c112316b>] __alloc_pages_nodemask+0x8b/0x9b6 [128095.478049] [<c1123ab6>] __get_free_pages+0x20/0x31 [128095.478769] [<c1123ad9>] get_zeroed_page+0x12/0x14 [128095.479477] [<c113fe1e>] __pmd_alloc+0x1c/0x6b [128095.480138] [<c1155ea7>] huge_pmd_share+0x265/0x283 [128095.480138] [<c1155f22>] huge_pte_alloc+0x5d/0x71 [128095.480138] [<c115612e>] hugetlb_fault+0x7c/0x64a [128095.480138] [<c114087c>] handle_mm_fault+0x255/0x299 [128095.480138] [<c15bbab0>] __do_page_fault+0x142/0x55c [128095.480138] [<c15bbed7>] do_page_fault+0xd/0x16 [128095.480138] [<c15b927c>] error_code+0x6c/0x74 [128095.480138] irq event stamp: 3136917 [128095.480138] hardirqs last enabled at (3136917): [<c15b8139>] _raw_spin_unlock_irq+0x27/0x50 [128095.480138] hardirqs last disabled at (3136916): [<c15b7f4e>] _raw_spin_lock_irq+0x15/0x78 [128095.480138] softirqs last enabled at (3136180): [<c1048e4a>] __do_softirq+0x137/0x30f [128095.480138] softirqs last disabled at (3136175): [<c1049195>] irq_exit+0xa8/0xaa [128095.480138] other info that might help us debug this: [128095.480138] Possible unsafe locking scenario: [128095.480138] CPU0 [128095.480138] ---- [128095.480138] lock(&mapping->i_mmap_mutex); [128095.480138] <Interrupt> [128095.480138] lock(&mapping->i_mmap_mutex); [128095.480138] *** DEADLOCK *** [128095.480138] no locks held by kswapd0/49. [128095.480138] stack backtrace: [128095.480138] CPU: 1 PID: 49 Comm: kswapd0 Not tainted 3.11.0-rc1+ #9 [128095.480138] Hardware name: Dell Inc. Precision WorkStation 490 /0DT031, BIOS A08 04/25/2008 [128095.480138] c1d32630 00000000 ee39fb18 c15b001e ee395780 ee39fb54 c15acdcb c1751845 [128095.480138] c1751bbf 00000031 00000000 00000000 00000000 00000000 00000001 00000001 [128095.480138] c1751bbf 00000008 ee395c44 00000100 ee39fb88 c10a6130 00000008 0000d8fb [128095.480138] Call Trace: [128095.480138] [<c15b001e>] dump_stack+0x4b/0x79 [128095.480138] [<c15acdcb>] print_usage_bug+0x1d9/0x1e3 [128095.480138] [<c10a6130>] mark_lock+0x1e0/0x261 [128095.480138] [<c10a5878>] ? check_usage_backwards+0x109/0x109 [128095.480138] [<c10a6cde>] __lock_acquire+0x623/0x17f2 [128095.480138] [<c107aa43>] ? sched_clock_cpu+0xcd/0x130 [128095.480138] [<c107a7e8>] ? sched_clock_local+0x42/0x12e [128095.480138] [<c10a84cf>] lock_acquire+0x7d/0x195 [128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3 [128095.480138] [<c15b3671>] mutex_lock_nested+0x6c/0x3a7 [128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3 [128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3 [128095.480138] [<c11661d5>] ? mem_cgroup_charge_statistics.isra.24+0x61/0x9e [128095.480138] [<c114971b>] page_referenced+0x87/0x5e3 [128095.480138] [<f8433030>] ? raid0_congested+0x26/0x8a [raid0] [128095.480138] [<c112b9c7>] shrink_page_list+0x3d9/0x947 [128095.480138] [<c10a6457>] ? trace_hardirqs_on+0xb/0xd [128095.480138] [<c112c3cf>] shrink_inactive_list+0x155/0x4cb [128095.480138] [<c112cd07>] shrink_lruvec+0x300/0x5ce [128095.480138] [<c112d028>] shrink_zone+0x53/0x14e [128095.480138] [<c112e531>] kswapd+0x517/0xa75 [128095.480138] [<c112e01a>] ? mem_cgroup_shrink_node_zone+0x280/0x280 [128095.480138] [<c10661ff>] kthread+0xa8/0xaa [128095.480138] [<c10a6457>] ? trace_hardirqs_on+0xb/0xd [128095.480138] [<c15bf737>] ret_from_kernel_thread+0x1b/0x28 [128095.480138] [<c1066157>] ? insert_kthread_work+0x63/0x63 ^ permalink raw reply [flat|nested] 40+ messages in thread
* hugepage related lockdep trace. @ 2013-07-17 15:32 ` Dave Jones 0 siblings, 0 replies; 40+ messages in thread From: Dave Jones @ 2013-07-17 15:32 UTC (permalink / raw) To: Linux Kernel; +Cc: linux-mm [128095.470960] ================================= [128095.471315] [ INFO: inconsistent lock state ] [128095.471660] 3.11.0-rc1+ #9 Not tainted [128095.472156] --------------------------------- [128095.472905] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage. [128095.473650] kswapd0/49 [HC0[0]:SC0[0]:HE1:SE1] takes: [128095.474373] (&mapping->i_mmap_mutex){+.+.?.}, at: [<c114971b>] page_referenced+0x87/0x5e3 [128095.475128] {RECLAIM_FS-ON-W} state was registered at: [128095.475866] [<c10a6232>] mark_held_locks+0x81/0xe7 [128095.476597] [<c10a8db3>] lockdep_trace_alloc+0x5e/0xbc [128095.477322] [<c112316b>] __alloc_pages_nodemask+0x8b/0x9b6 [128095.478049] [<c1123ab6>] __get_free_pages+0x20/0x31 [128095.478769] [<c1123ad9>] get_zeroed_page+0x12/0x14 [128095.479477] [<c113fe1e>] __pmd_alloc+0x1c/0x6b [128095.480138] [<c1155ea7>] huge_pmd_share+0x265/0x283 [128095.480138] [<c1155f22>] huge_pte_alloc+0x5d/0x71 [128095.480138] [<c115612e>] hugetlb_fault+0x7c/0x64a [128095.480138] [<c114087c>] handle_mm_fault+0x255/0x299 [128095.480138] [<c15bbab0>] __do_page_fault+0x142/0x55c [128095.480138] [<c15bbed7>] do_page_fault+0xd/0x16 [128095.480138] [<c15b927c>] error_code+0x6c/0x74 [128095.480138] irq event stamp: 3136917 [128095.480138] hardirqs last enabled at (3136917): [<c15b8139>] _raw_spin_unlock_irq+0x27/0x50 [128095.480138] hardirqs last disabled at (3136916): [<c15b7f4e>] _raw_spin_lock_irq+0x15/0x78 [128095.480138] softirqs last enabled at (3136180): [<c1048e4a>] __do_softirq+0x137/0x30f [128095.480138] softirqs last disabled at (3136175): [<c1049195>] irq_exit+0xa8/0xaa [128095.480138] other info that might help us debug this: [128095.480138] Possible unsafe locking scenario: [128095.480138] CPU0 [128095.480138] ---- [128095.480138] lock(&mapping->i_mmap_mutex); [128095.480138] <Interrupt> [128095.480138] lock(&mapping->i_mmap_mutex); [128095.480138] *** DEADLOCK *** [128095.480138] no locks held by kswapd0/49. [128095.480138] stack backtrace: [128095.480138] CPU: 1 PID: 49 Comm: kswapd0 Not tainted 3.11.0-rc1+ #9 [128095.480138] Hardware name: Dell Inc. Precision WorkStation 490 /0DT031, BIOS A08 04/25/2008 [128095.480138] c1d32630 00000000 ee39fb18 c15b001e ee395780 ee39fb54 c15acdcb c1751845 [128095.480138] c1751bbf 00000031 00000000 00000000 00000000 00000000 00000001 00000001 [128095.480138] c1751bbf 00000008 ee395c44 00000100 ee39fb88 c10a6130 00000008 0000d8fb [128095.480138] Call Trace: [128095.480138] [<c15b001e>] dump_stack+0x4b/0x79 [128095.480138] [<c15acdcb>] print_usage_bug+0x1d9/0x1e3 [128095.480138] [<c10a6130>] mark_lock+0x1e0/0x261 [128095.480138] [<c10a5878>] ? check_usage_backwards+0x109/0x109 [128095.480138] [<c10a6cde>] __lock_acquire+0x623/0x17f2 [128095.480138] [<c107aa43>] ? sched_clock_cpu+0xcd/0x130 [128095.480138] [<c107a7e8>] ? sched_clock_local+0x42/0x12e [128095.480138] [<c10a84cf>] lock_acquire+0x7d/0x195 [128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3 [128095.480138] [<c15b3671>] mutex_lock_nested+0x6c/0x3a7 [128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3 [128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3 [128095.480138] [<c11661d5>] ? mem_cgroup_charge_statistics.isra.24+0x61/0x9e [128095.480138] [<c114971b>] page_referenced+0x87/0x5e3 [128095.480138] [<f8433030>] ? raid0_congested+0x26/0x8a [raid0] [128095.480138] [<c112b9c7>] shrink_page_list+0x3d9/0x947 [128095.480138] [<c10a6457>] ? trace_hardirqs_on+0xb/0xd [128095.480138] [<c112c3cf>] shrink_inactive_list+0x155/0x4cb [128095.480138] [<c112cd07>] shrink_lruvec+0x300/0x5ce [128095.480138] [<c112d028>] shrink_zone+0x53/0x14e [128095.480138] [<c112e531>] kswapd+0x517/0xa75 [128095.480138] [<c112e01a>] ? mem_cgroup_shrink_node_zone+0x280/0x280 [128095.480138] [<c10661ff>] kthread+0xa8/0xaa [128095.480138] [<c10a6457>] ? trace_hardirqs_on+0xb/0xd [128095.480138] [<c15bf737>] ret_from_kernel_thread+0x1b/0x28 [128095.480138] [<c1066157>] ? insert_kthread_work+0x63/0x63 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: hugepage related lockdep trace. 2013-07-17 15:32 ` Dave Jones @ 2013-07-18 0:09 ` Minchan Kim -1 siblings, 0 replies; 40+ messages in thread From: Minchan Kim @ 2013-07-18 0:09 UTC (permalink / raw) To: Dave Jones, Linux Kernel, linux-mm Cc: Rik van Riel, Aneesh Kumar K.V, Michal Hocko, KAMEZAWA Hiroyuki, Hillf Danton, Andrew Morton Ccing people get_maintainer says. On Wed, Jul 17, 2013 at 11:32:23AM -0400, Dave Jones wrote: > [128095.470960] ================================= > [128095.471315] [ INFO: inconsistent lock state ] > [128095.471660] 3.11.0-rc1+ #9 Not tainted > [128095.472156] --------------------------------- > [128095.472905] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage. > [128095.473650] kswapd0/49 [HC0[0]:SC0[0]:HE1:SE1] takes: > [128095.474373] (&mapping->i_mmap_mutex){+.+.?.}, at: [<c114971b>] page_referenced+0x87/0x5e3 > [128095.475128] {RECLAIM_FS-ON-W} state was registered at: > [128095.475866] [<c10a6232>] mark_held_locks+0x81/0xe7 > [128095.476597] [<c10a8db3>] lockdep_trace_alloc+0x5e/0xbc > [128095.477322] [<c112316b>] __alloc_pages_nodemask+0x8b/0x9b6 > [128095.478049] [<c1123ab6>] __get_free_pages+0x20/0x31 > [128095.478769] [<c1123ad9>] get_zeroed_page+0x12/0x14 > [128095.479477] [<c113fe1e>] __pmd_alloc+0x1c/0x6b > [128095.480138] [<c1155ea7>] huge_pmd_share+0x265/0x283 > [128095.480138] [<c1155f22>] huge_pte_alloc+0x5d/0x71 > [128095.480138] [<c115612e>] hugetlb_fault+0x7c/0x64a > [128095.480138] [<c114087c>] handle_mm_fault+0x255/0x299 > [128095.480138] [<c15bbab0>] __do_page_fault+0x142/0x55c > [128095.480138] [<c15bbed7>] do_page_fault+0xd/0x16 > [128095.480138] [<c15b927c>] error_code+0x6c/0x74 > [128095.480138] irq event stamp: 3136917 > [128095.480138] hardirqs last enabled at (3136917): [<c15b8139>] _raw_spin_unlock_irq+0x27/0x50 > [128095.480138] hardirqs last disabled at (3136916): [<c15b7f4e>] _raw_spin_lock_irq+0x15/0x78 > [128095.480138] softirqs last enabled at (3136180): [<c1048e4a>] __do_softirq+0x137/0x30f > [128095.480138] softirqs last disabled at (3136175): [<c1049195>] irq_exit+0xa8/0xaa > [128095.480138] > other info that might help us debug this: > [128095.480138] Possible unsafe locking scenario: > > [128095.480138] CPU0 > [128095.480138] ---- > [128095.480138] lock(&mapping->i_mmap_mutex); > [128095.480138] <Interrupt> > [128095.480138] lock(&mapping->i_mmap_mutex); > [128095.480138] > *** DEADLOCK *** > > [128095.480138] no locks held by kswapd0/49. > [128095.480138] > stack backtrace: > [128095.480138] CPU: 1 PID: 49 Comm: kswapd0 Not tainted 3.11.0-rc1+ #9 > [128095.480138] Hardware name: Dell Inc. Precision WorkStation 490 /0DT031, BIOS A08 04/25/2008 > [128095.480138] c1d32630 00000000 ee39fb18 c15b001e ee395780 ee39fb54 c15acdcb c1751845 > [128095.480138] c1751bbf 00000031 00000000 00000000 00000000 00000000 00000001 00000001 > [128095.480138] c1751bbf 00000008 ee395c44 00000100 ee39fb88 c10a6130 00000008 0000d8fb > [128095.480138] Call Trace: > [128095.480138] [<c15b001e>] dump_stack+0x4b/0x79 > [128095.480138] [<c15acdcb>] print_usage_bug+0x1d9/0x1e3 > [128095.480138] [<c10a6130>] mark_lock+0x1e0/0x261 > [128095.480138] [<c10a5878>] ? check_usage_backwards+0x109/0x109 > [128095.480138] [<c10a6cde>] __lock_acquire+0x623/0x17f2 > [128095.480138] [<c107aa43>] ? sched_clock_cpu+0xcd/0x130 > [128095.480138] [<c107a7e8>] ? sched_clock_local+0x42/0x12e > [128095.480138] [<c10a84cf>] lock_acquire+0x7d/0x195 > [128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3 > [128095.480138] [<c15b3671>] mutex_lock_nested+0x6c/0x3a7 > [128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3 > [128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3 > [128095.480138] [<c11661d5>] ? mem_cgroup_charge_statistics.isra.24+0x61/0x9e > [128095.480138] [<c114971b>] page_referenced+0x87/0x5e3 > [128095.480138] [<f8433030>] ? raid0_congested+0x26/0x8a [raid0] > [128095.480138] [<c112b9c7>] shrink_page_list+0x3d9/0x947 > [128095.480138] [<c10a6457>] ? trace_hardirqs_on+0xb/0xd > [128095.480138] [<c112c3cf>] shrink_inactive_list+0x155/0x4cb > [128095.480138] [<c112cd07>] shrink_lruvec+0x300/0x5ce > [128095.480138] [<c112d028>] shrink_zone+0x53/0x14e > [128095.480138] [<c112e531>] kswapd+0x517/0xa75 > [128095.480138] [<c112e01a>] ? mem_cgroup_shrink_node_zone+0x280/0x280 > [128095.480138] [<c10661ff>] kthread+0xa8/0xaa > [128095.480138] [<c10a6457>] ? trace_hardirqs_on+0xb/0xd > [128095.480138] [<c15bf737>] ret_from_kernel_thread+0x1b/0x28 > [128095.480138] [<c1066157>] ? insert_kthread_work+0x63/0x63 IMHO, it's a false positive because i_mmap_mutex was held by kswapd while one in the middle of fault path could be never on kswapd context. It seems lockdep for reclaim-over-fs isn't enough smart to identify between background and direct reclaim. Wait for other's opinion. > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- Kind regards, Minchan Kim ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: hugepage related lockdep trace. @ 2013-07-18 0:09 ` Minchan Kim 0 siblings, 0 replies; 40+ messages in thread From: Minchan Kim @ 2013-07-18 0:09 UTC (permalink / raw) To: Dave Jones, Linux Kernel, linux-mm Cc: Rik van Riel, Aneesh Kumar K.V, Michal Hocko, KAMEZAWA Hiroyuki, Hillf Danton, Andrew Morton Ccing people get_maintainer says. On Wed, Jul 17, 2013 at 11:32:23AM -0400, Dave Jones wrote: > [128095.470960] ================================= > [128095.471315] [ INFO: inconsistent lock state ] > [128095.471660] 3.11.0-rc1+ #9 Not tainted > [128095.472156] --------------------------------- > [128095.472905] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage. > [128095.473650] kswapd0/49 [HC0[0]:SC0[0]:HE1:SE1] takes: > [128095.474373] (&mapping->i_mmap_mutex){+.+.?.}, at: [<c114971b>] page_referenced+0x87/0x5e3 > [128095.475128] {RECLAIM_FS-ON-W} state was registered at: > [128095.475866] [<c10a6232>] mark_held_locks+0x81/0xe7 > [128095.476597] [<c10a8db3>] lockdep_trace_alloc+0x5e/0xbc > [128095.477322] [<c112316b>] __alloc_pages_nodemask+0x8b/0x9b6 > [128095.478049] [<c1123ab6>] __get_free_pages+0x20/0x31 > [128095.478769] [<c1123ad9>] get_zeroed_page+0x12/0x14 > [128095.479477] [<c113fe1e>] __pmd_alloc+0x1c/0x6b > [128095.480138] [<c1155ea7>] huge_pmd_share+0x265/0x283 > [128095.480138] [<c1155f22>] huge_pte_alloc+0x5d/0x71 > [128095.480138] [<c115612e>] hugetlb_fault+0x7c/0x64a > [128095.480138] [<c114087c>] handle_mm_fault+0x255/0x299 > [128095.480138] [<c15bbab0>] __do_page_fault+0x142/0x55c > [128095.480138] [<c15bbed7>] do_page_fault+0xd/0x16 > [128095.480138] [<c15b927c>] error_code+0x6c/0x74 > [128095.480138] irq event stamp: 3136917 > [128095.480138] hardirqs last enabled at (3136917): [<c15b8139>] _raw_spin_unlock_irq+0x27/0x50 > [128095.480138] hardirqs last disabled at (3136916): [<c15b7f4e>] _raw_spin_lock_irq+0x15/0x78 > [128095.480138] softirqs last enabled at (3136180): [<c1048e4a>] __do_softirq+0x137/0x30f > [128095.480138] softirqs last disabled at (3136175): [<c1049195>] irq_exit+0xa8/0xaa > [128095.480138] > other info that might help us debug this: > [128095.480138] Possible unsafe locking scenario: > > [128095.480138] CPU0 > [128095.480138] ---- > [128095.480138] lock(&mapping->i_mmap_mutex); > [128095.480138] <Interrupt> > [128095.480138] lock(&mapping->i_mmap_mutex); > [128095.480138] > *** DEADLOCK *** > > [128095.480138] no locks held by kswapd0/49. > [128095.480138] > stack backtrace: > [128095.480138] CPU: 1 PID: 49 Comm: kswapd0 Not tainted 3.11.0-rc1+ #9 > [128095.480138] Hardware name: Dell Inc. Precision WorkStation 490 /0DT031, BIOS A08 04/25/2008 > [128095.480138] c1d32630 00000000 ee39fb18 c15b001e ee395780 ee39fb54 c15acdcb c1751845 > [128095.480138] c1751bbf 00000031 00000000 00000000 00000000 00000000 00000001 00000001 > [128095.480138] c1751bbf 00000008 ee395c44 00000100 ee39fb88 c10a6130 00000008 0000d8fb > [128095.480138] Call Trace: > [128095.480138] [<c15b001e>] dump_stack+0x4b/0x79 > [128095.480138] [<c15acdcb>] print_usage_bug+0x1d9/0x1e3 > [128095.480138] [<c10a6130>] mark_lock+0x1e0/0x261 > [128095.480138] [<c10a5878>] ? check_usage_backwards+0x109/0x109 > [128095.480138] [<c10a6cde>] __lock_acquire+0x623/0x17f2 > [128095.480138] [<c107aa43>] ? sched_clock_cpu+0xcd/0x130 > [128095.480138] [<c107a7e8>] ? sched_clock_local+0x42/0x12e > [128095.480138] [<c10a84cf>] lock_acquire+0x7d/0x195 > [128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3 > [128095.480138] [<c15b3671>] mutex_lock_nested+0x6c/0x3a7 > [128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3 > [128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3 > [128095.480138] [<c11661d5>] ? mem_cgroup_charge_statistics.isra.24+0x61/0x9e > [128095.480138] [<c114971b>] page_referenced+0x87/0x5e3 > [128095.480138] [<f8433030>] ? raid0_congested+0x26/0x8a [raid0] > [128095.480138] [<c112b9c7>] shrink_page_list+0x3d9/0x947 > [128095.480138] [<c10a6457>] ? trace_hardirqs_on+0xb/0xd > [128095.480138] [<c112c3cf>] shrink_inactive_list+0x155/0x4cb > [128095.480138] [<c112cd07>] shrink_lruvec+0x300/0x5ce > [128095.480138] [<c112d028>] shrink_zone+0x53/0x14e > [128095.480138] [<c112e531>] kswapd+0x517/0xa75 > [128095.480138] [<c112e01a>] ? mem_cgroup_shrink_node_zone+0x280/0x280 > [128095.480138] [<c10661ff>] kthread+0xa8/0xaa > [128095.480138] [<c10a6457>] ? trace_hardirqs_on+0xb/0xd > [128095.480138] [<c15bf737>] ret_from_kernel_thread+0x1b/0x28 > [128095.480138] [<c1066157>] ? insert_kthread_work+0x63/0x63 IMHO, it's a false positive because i_mmap_mutex was held by kswapd while one in the middle of fault path could be never on kswapd context. It seems lockdep for reclaim-over-fs isn't enough smart to identify between background and direct reclaim. Wait for other's opinion. > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: hugepage related lockdep trace. 2013-07-18 0:09 ` Minchan Kim @ 2013-07-18 17:42 ` Aneesh Kumar K.V -1 siblings, 0 replies; 40+ messages in thread From: Aneesh Kumar K.V @ 2013-07-18 17:42 UTC (permalink / raw) To: Minchan Kim, Dave Jones, Linux Kernel, linux-mm Cc: Rik van Riel, Michal Hocko, KAMEZAWA Hiroyuki, Hillf Danton, Andrew Morton Minchan Kim <minchan@kernel.org> writes: > Ccing people get_maintainer says. > > On Wed, Jul 17, 2013 at 11:32:23AM -0400, Dave Jones wrote: >> [128095.470960] ================================= >> [128095.471315] [ INFO: inconsistent lock state ] >> [128095.471660] 3.11.0-rc1+ #9 Not tainted >> [128095.472156] --------------------------------- >> [128095.472905] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage. >> [128095.473650] kswapd0/49 [HC0[0]:SC0[0]:HE1:SE1] takes: >> [128095.474373] (&mapping->i_mmap_mutex){+.+.?.}, at: [<c114971b>] page_referenced+0x87/0x5e3 >> [128095.475128] {RECLAIM_FS-ON-W} state was registered at: >> [128095.475866] [<c10a6232>] mark_held_locks+0x81/0xe7 >> [128095.476597] [<c10a8db3>] lockdep_trace_alloc+0x5e/0xbc >> [128095.477322] [<c112316b>] __alloc_pages_nodemask+0x8b/0x9b6 >> [128095.478049] [<c1123ab6>] __get_free_pages+0x20/0x31 >> [128095.478769] [<c1123ad9>] get_zeroed_page+0x12/0x14 >> [128095.479477] [<c113fe1e>] __pmd_alloc+0x1c/0x6b >> [128095.480138] [<c1155ea7>] huge_pmd_share+0x265/0x283 >> [128095.480138] [<c1155f22>] huge_pte_alloc+0x5d/0x71 >> [128095.480138] [<c115612e>] hugetlb_fault+0x7c/0x64a >> [128095.480138] [<c114087c>] handle_mm_fault+0x255/0x299 >> [128095.480138] [<c15bbab0>] __do_page_fault+0x142/0x55c >> [128095.480138] [<c15bbed7>] do_page_fault+0xd/0x16 >> [128095.480138] [<c15b927c>] error_code+0x6c/0x74 >> [128095.480138] irq event stamp: 3136917 >> [128095.480138] hardirqs last enabled at (3136917): [<c15b8139>] _raw_spin_unlock_irq+0x27/0x50 >> [128095.480138] hardirqs last disabled at (3136916): [<c15b7f4e>] _raw_spin_lock_irq+0x15/0x78 >> [128095.480138] softirqs last enabled at (3136180): [<c1048e4a>] __do_softirq+0x137/0x30f >> [128095.480138] softirqs last disabled at (3136175): [<c1049195>] irq_exit+0xa8/0xaa >> [128095.480138] >> other info that might help us debug this: >> [128095.480138] Possible unsafe locking scenario: >> >> [128095.480138] CPU0 >> [128095.480138] ---- >> [128095.480138] lock(&mapping->i_mmap_mutex); >> [128095.480138] <Interrupt> >> [128095.480138] lock(&mapping->i_mmap_mutex); >> [128095.480138] >> *** DEADLOCK *** >> >> [128095.480138] no locks held by kswapd0/49. >> [128095.480138] >> stack backtrace: >> [128095.480138] CPU: 1 PID: 49 Comm: kswapd0 Not tainted 3.11.0-rc1+ #9 >> [128095.480138] Hardware name: Dell Inc. Precision WorkStation 490 /0DT031, BIOS A08 04/25/2008 >> [128095.480138] c1d32630 00000000 ee39fb18 c15b001e ee395780 ee39fb54 c15acdcb c1751845 >> [128095.480138] c1751bbf 00000031 00000000 00000000 00000000 00000000 00000001 00000001 >> [128095.480138] c1751bbf 00000008 ee395c44 00000100 ee39fb88 c10a6130 00000008 0000d8fb >> [128095.480138] Call Trace: >> [128095.480138] [<c15b001e>] dump_stack+0x4b/0x79 >> [128095.480138] [<c15acdcb>] print_usage_bug+0x1d9/0x1e3 >> [128095.480138] [<c10a6130>] mark_lock+0x1e0/0x261 >> [128095.480138] [<c10a5878>] ? check_usage_backwards+0x109/0x109 >> [128095.480138] [<c10a6cde>] __lock_acquire+0x623/0x17f2 >> [128095.480138] [<c107aa43>] ? sched_clock_cpu+0xcd/0x130 >> [128095.480138] [<c107a7e8>] ? sched_clock_local+0x42/0x12e >> [128095.480138] [<c10a84cf>] lock_acquire+0x7d/0x195 >> [128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3 >> [128095.480138] [<c15b3671>] mutex_lock_nested+0x6c/0x3a7 >> [128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3 >> [128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3 >> [128095.480138] [<c11661d5>] ? mem_cgroup_charge_statistics.isra.24+0x61/0x9e >> [128095.480138] [<c114971b>] page_referenced+0x87/0x5e3 >> [128095.480138] [<f8433030>] ? raid0_congested+0x26/0x8a [raid0] >> [128095.480138] [<c112b9c7>] shrink_page_list+0x3d9/0x947 >> [128095.480138] [<c10a6457>] ? trace_hardirqs_on+0xb/0xd >> [128095.480138] [<c112c3cf>] shrink_inactive_list+0x155/0x4cb >> [128095.480138] [<c112cd07>] shrink_lruvec+0x300/0x5ce >> [128095.480138] [<c112d028>] shrink_zone+0x53/0x14e >> [128095.480138] [<c112e531>] kswapd+0x517/0xa75 >> [128095.480138] [<c112e01a>] ? mem_cgroup_shrink_node_zone+0x280/0x280 >> [128095.480138] [<c10661ff>] kthread+0xa8/0xaa >> [128095.480138] [<c10a6457>] ? trace_hardirqs_on+0xb/0xd >> [128095.480138] [<c15bf737>] ret_from_kernel_thread+0x1b/0x28 >> [128095.480138] [<c1066157>] ? insert_kthread_work+0x63/0x63 > > IMHO, it's a false positive because i_mmap_mutex was held by kswapd > while one in the middle of fault path could be never on kswapd context. > > It seems lockdep for reclaim-over-fs isn't enough smart to identify > between background and direct reclaim. > > Wait for other's opinion. Is that reasoning correct ?. We may not deadlock because hugetlb pages cannot be reclaimed. So the fault path in hugetlb won't end up reclaiming pages from same inode. But the report is correct right ? Looking at the hugetlb code we have in huge_pmd_share out: pte = (pte_t *)pmd_alloc(mm, pud, addr); mutex_unlock(&mapping->i_mmap_mutex); return pte; I guess we should move that pmd_alloc outside i_mmap_mutex. Otherwise that pmd_alloc can result in a reclaim which can call shrink_page_list ? Something like ? diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 83aff0a..2cb1be3 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -3266,8 +3266,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud) put_page(virt_to_page(spte)); spin_unlock(&mm->page_table_lock); out: - pte = (pte_t *)pmd_alloc(mm, pud, addr); mutex_unlock(&mapping->i_mmap_mutex); + pte = (pte_t *)pmd_alloc(mm, pud, addr); return pte; } -aneesh ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: hugepage related lockdep trace. @ 2013-07-18 17:42 ` Aneesh Kumar K.V 0 siblings, 0 replies; 40+ messages in thread From: Aneesh Kumar K.V @ 2013-07-18 17:42 UTC (permalink / raw) To: Minchan Kim, Dave Jones, Linux Kernel, linux-mm Cc: Rik van Riel, Michal Hocko, KAMEZAWA Hiroyuki, Hillf Danton, Andrew Morton Minchan Kim <minchan@kernel.org> writes: > Ccing people get_maintainer says. > > On Wed, Jul 17, 2013 at 11:32:23AM -0400, Dave Jones wrote: >> [128095.470960] ================================= >> [128095.471315] [ INFO: inconsistent lock state ] >> [128095.471660] 3.11.0-rc1+ #9 Not tainted >> [128095.472156] --------------------------------- >> [128095.472905] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage. >> [128095.473650] kswapd0/49 [HC0[0]:SC0[0]:HE1:SE1] takes: >> [128095.474373] (&mapping->i_mmap_mutex){+.+.?.}, at: [<c114971b>] page_referenced+0x87/0x5e3 >> [128095.475128] {RECLAIM_FS-ON-W} state was registered at: >> [128095.475866] [<c10a6232>] mark_held_locks+0x81/0xe7 >> [128095.476597] [<c10a8db3>] lockdep_trace_alloc+0x5e/0xbc >> [128095.477322] [<c112316b>] __alloc_pages_nodemask+0x8b/0x9b6 >> [128095.478049] [<c1123ab6>] __get_free_pages+0x20/0x31 >> [128095.478769] [<c1123ad9>] get_zeroed_page+0x12/0x14 >> [128095.479477] [<c113fe1e>] __pmd_alloc+0x1c/0x6b >> [128095.480138] [<c1155ea7>] huge_pmd_share+0x265/0x283 >> [128095.480138] [<c1155f22>] huge_pte_alloc+0x5d/0x71 >> [128095.480138] [<c115612e>] hugetlb_fault+0x7c/0x64a >> [128095.480138] [<c114087c>] handle_mm_fault+0x255/0x299 >> [128095.480138] [<c15bbab0>] __do_page_fault+0x142/0x55c >> [128095.480138] [<c15bbed7>] do_page_fault+0xd/0x16 >> [128095.480138] [<c15b927c>] error_code+0x6c/0x74 >> [128095.480138] irq event stamp: 3136917 >> [128095.480138] hardirqs last enabled at (3136917): [<c15b8139>] _raw_spin_unlock_irq+0x27/0x50 >> [128095.480138] hardirqs last disabled at (3136916): [<c15b7f4e>] _raw_spin_lock_irq+0x15/0x78 >> [128095.480138] softirqs last enabled at (3136180): [<c1048e4a>] __do_softirq+0x137/0x30f >> [128095.480138] softirqs last disabled at (3136175): [<c1049195>] irq_exit+0xa8/0xaa >> [128095.480138] >> other info that might help us debug this: >> [128095.480138] Possible unsafe locking scenario: >> >> [128095.480138] CPU0 >> [128095.480138] ---- >> [128095.480138] lock(&mapping->i_mmap_mutex); >> [128095.480138] <Interrupt> >> [128095.480138] lock(&mapping->i_mmap_mutex); >> [128095.480138] >> *** DEADLOCK *** >> >> [128095.480138] no locks held by kswapd0/49. >> [128095.480138] >> stack backtrace: >> [128095.480138] CPU: 1 PID: 49 Comm: kswapd0 Not tainted 3.11.0-rc1+ #9 >> [128095.480138] Hardware name: Dell Inc. Precision WorkStation 490 /0DT031, BIOS A08 04/25/2008 >> [128095.480138] c1d32630 00000000 ee39fb18 c15b001e ee395780 ee39fb54 c15acdcb c1751845 >> [128095.480138] c1751bbf 00000031 00000000 00000000 00000000 00000000 00000001 00000001 >> [128095.480138] c1751bbf 00000008 ee395c44 00000100 ee39fb88 c10a6130 00000008 0000d8fb >> [128095.480138] Call Trace: >> [128095.480138] [<c15b001e>] dump_stack+0x4b/0x79 >> [128095.480138] [<c15acdcb>] print_usage_bug+0x1d9/0x1e3 >> [128095.480138] [<c10a6130>] mark_lock+0x1e0/0x261 >> [128095.480138] [<c10a5878>] ? check_usage_backwards+0x109/0x109 >> [128095.480138] [<c10a6cde>] __lock_acquire+0x623/0x17f2 >> [128095.480138] [<c107aa43>] ? sched_clock_cpu+0xcd/0x130 >> [128095.480138] [<c107a7e8>] ? sched_clock_local+0x42/0x12e >> [128095.480138] [<c10a84cf>] lock_acquire+0x7d/0x195 >> [128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3 >> [128095.480138] [<c15b3671>] mutex_lock_nested+0x6c/0x3a7 >> [128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3 >> [128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3 >> [128095.480138] [<c11661d5>] ? mem_cgroup_charge_statistics.isra.24+0x61/0x9e >> [128095.480138] [<c114971b>] page_referenced+0x87/0x5e3 >> [128095.480138] [<f8433030>] ? raid0_congested+0x26/0x8a [raid0] >> [128095.480138] [<c112b9c7>] shrink_page_list+0x3d9/0x947 >> [128095.480138] [<c10a6457>] ? trace_hardirqs_on+0xb/0xd >> [128095.480138] [<c112c3cf>] shrink_inactive_list+0x155/0x4cb >> [128095.480138] [<c112cd07>] shrink_lruvec+0x300/0x5ce >> [128095.480138] [<c112d028>] shrink_zone+0x53/0x14e >> [128095.480138] [<c112e531>] kswapd+0x517/0xa75 >> [128095.480138] [<c112e01a>] ? mem_cgroup_shrink_node_zone+0x280/0x280 >> [128095.480138] [<c10661ff>] kthread+0xa8/0xaa >> [128095.480138] [<c10a6457>] ? trace_hardirqs_on+0xb/0xd >> [128095.480138] [<c15bf737>] ret_from_kernel_thread+0x1b/0x28 >> [128095.480138] [<c1066157>] ? insert_kthread_work+0x63/0x63 > > IMHO, it's a false positive because i_mmap_mutex was held by kswapd > while one in the middle of fault path could be never on kswapd context. > > It seems lockdep for reclaim-over-fs isn't enough smart to identify > between background and direct reclaim. > > Wait for other's opinion. Is that reasoning correct ?. We may not deadlock because hugetlb pages cannot be reclaimed. So the fault path in hugetlb won't end up reclaiming pages from same inode. But the report is correct right ? Looking at the hugetlb code we have in huge_pmd_share out: pte = (pte_t *)pmd_alloc(mm, pud, addr); mutex_unlock(&mapping->i_mmap_mutex); return pte; I guess we should move that pmd_alloc outside i_mmap_mutex. Otherwise that pmd_alloc can result in a reclaim which can call shrink_page_list ? Something like ? diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 83aff0a..2cb1be3 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -3266,8 +3266,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud) put_page(virt_to_page(spte)); spin_unlock(&mm->page_table_lock); out: - pte = (pte_t *)pmd_alloc(mm, pud, addr); mutex_unlock(&mapping->i_mmap_mutex); + pte = (pte_t *)pmd_alloc(mm, pud, addr); return pte; } -aneesh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: hugepage related lockdep trace. 2013-07-18 17:42 ` Aneesh Kumar K.V @ 2013-07-19 0:13 ` Minchan Kim -1 siblings, 0 replies; 40+ messages in thread From: Minchan Kim @ 2013-07-19 0:13 UTC (permalink / raw) To: Aneesh Kumar K.V Cc: Dave Jones, Linux Kernel, linux-mm, Rik van Riel, Michal Hocko, KAMEZAWA Hiroyuki, Hillf Danton, Andrew Morton On Thu, Jul 18, 2013 at 11:12:24PM +0530, Aneesh Kumar K.V wrote: > Minchan Kim <minchan@kernel.org> writes: > > > Ccing people get_maintainer says. > > > > On Wed, Jul 17, 2013 at 11:32:23AM -0400, Dave Jones wrote: > >> [128095.470960] ================================= > >> [128095.471315] [ INFO: inconsistent lock state ] > >> [128095.471660] 3.11.0-rc1+ #9 Not tainted > >> [128095.472156] --------------------------------- > >> [128095.472905] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage. > >> [128095.473650] kswapd0/49 [HC0[0]:SC0[0]:HE1:SE1] takes: > >> [128095.474373] (&mapping->i_mmap_mutex){+.+.?.}, at: [<c114971b>] page_referenced+0x87/0x5e3 > >> [128095.475128] {RECLAIM_FS-ON-W} state was registered at: > >> [128095.475866] [<c10a6232>] mark_held_locks+0x81/0xe7 > >> [128095.476597] [<c10a8db3>] lockdep_trace_alloc+0x5e/0xbc > >> [128095.477322] [<c112316b>] __alloc_pages_nodemask+0x8b/0x9b6 > >> [128095.478049] [<c1123ab6>] __get_free_pages+0x20/0x31 > >> [128095.478769] [<c1123ad9>] get_zeroed_page+0x12/0x14 > >> [128095.479477] [<c113fe1e>] __pmd_alloc+0x1c/0x6b > >> [128095.480138] [<c1155ea7>] huge_pmd_share+0x265/0x283 > >> [128095.480138] [<c1155f22>] huge_pte_alloc+0x5d/0x71 > >> [128095.480138] [<c115612e>] hugetlb_fault+0x7c/0x64a > >> [128095.480138] [<c114087c>] handle_mm_fault+0x255/0x299 > >> [128095.480138] [<c15bbab0>] __do_page_fault+0x142/0x55c > >> [128095.480138] [<c15bbed7>] do_page_fault+0xd/0x16 > >> [128095.480138] [<c15b927c>] error_code+0x6c/0x74 > >> [128095.480138] irq event stamp: 3136917 > >> [128095.480138] hardirqs last enabled at (3136917): [<c15b8139>] _raw_spin_unlock_irq+0x27/0x50 > >> [128095.480138] hardirqs last disabled at (3136916): [<c15b7f4e>] _raw_spin_lock_irq+0x15/0x78 > >> [128095.480138] softirqs last enabled at (3136180): [<c1048e4a>] __do_softirq+0x137/0x30f > >> [128095.480138] softirqs last disabled at (3136175): [<c1049195>] irq_exit+0xa8/0xaa > >> [128095.480138] > >> other info that might help us debug this: > >> [128095.480138] Possible unsafe locking scenario: > >> > >> [128095.480138] CPU0 > >> [128095.480138] ---- > >> [128095.480138] lock(&mapping->i_mmap_mutex); > >> [128095.480138] <Interrupt> > >> [128095.480138] lock(&mapping->i_mmap_mutex); > >> [128095.480138] > >> *** DEADLOCK *** > >> > >> [128095.480138] no locks held by kswapd0/49. > >> [128095.480138] > >> stack backtrace: > >> [128095.480138] CPU: 1 PID: 49 Comm: kswapd0 Not tainted 3.11.0-rc1+ #9 > >> [128095.480138] Hardware name: Dell Inc. Precision WorkStation 490 /0DT031, BIOS A08 04/25/2008 > >> [128095.480138] c1d32630 00000000 ee39fb18 c15b001e ee395780 ee39fb54 c15acdcb c1751845 > >> [128095.480138] c1751bbf 00000031 00000000 00000000 00000000 00000000 00000001 00000001 > >> [128095.480138] c1751bbf 00000008 ee395c44 00000100 ee39fb88 c10a6130 00000008 0000d8fb > >> [128095.480138] Call Trace: > >> [128095.480138] [<c15b001e>] dump_stack+0x4b/0x79 > >> [128095.480138] [<c15acdcb>] print_usage_bug+0x1d9/0x1e3 > >> [128095.480138] [<c10a6130>] mark_lock+0x1e0/0x261 > >> [128095.480138] [<c10a5878>] ? check_usage_backwards+0x109/0x109 > >> [128095.480138] [<c10a6cde>] __lock_acquire+0x623/0x17f2 > >> [128095.480138] [<c107aa43>] ? sched_clock_cpu+0xcd/0x130 > >> [128095.480138] [<c107a7e8>] ? sched_clock_local+0x42/0x12e > >> [128095.480138] [<c10a84cf>] lock_acquire+0x7d/0x195 > >> [128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3 > >> [128095.480138] [<c15b3671>] mutex_lock_nested+0x6c/0x3a7 > >> [128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3 > >> [128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3 > >> [128095.480138] [<c11661d5>] ? mem_cgroup_charge_statistics.isra.24+0x61/0x9e > >> [128095.480138] [<c114971b>] page_referenced+0x87/0x5e3 > >> [128095.480138] [<f8433030>] ? raid0_congested+0x26/0x8a [raid0] > >> [128095.480138] [<c112b9c7>] shrink_page_list+0x3d9/0x947 > >> [128095.480138] [<c10a6457>] ? trace_hardirqs_on+0xb/0xd > >> [128095.480138] [<c112c3cf>] shrink_inactive_list+0x155/0x4cb > >> [128095.480138] [<c112cd07>] shrink_lruvec+0x300/0x5ce > >> [128095.480138] [<c112d028>] shrink_zone+0x53/0x14e > >> [128095.480138] [<c112e531>] kswapd+0x517/0xa75 > >> [128095.480138] [<c112e01a>] ? mem_cgroup_shrink_node_zone+0x280/0x280 > >> [128095.480138] [<c10661ff>] kthread+0xa8/0xaa > >> [128095.480138] [<c10a6457>] ? trace_hardirqs_on+0xb/0xd > >> [128095.480138] [<c15bf737>] ret_from_kernel_thread+0x1b/0x28 > >> [128095.480138] [<c1066157>] ? insert_kthread_work+0x63/0x63 > > > > IMHO, it's a false positive because i_mmap_mutex was held by kswapd > > while one in the middle of fault path could be never on kswapd context. > > > > It seems lockdep for reclaim-over-fs isn't enough smart to identify > > between background and direct reclaim. > > > > Wait for other's opinion. > > Is that reasoning correct ?. We may not deadlock because hugetlb pages > cannot be reclaimed. So the fault path in hugetlb won't end up > reclaiming pages from same inode. But the report is correct right ? > > > Looking at the hugetlb code we have in huge_pmd_share > > out: > pte = (pte_t *)pmd_alloc(mm, pud, addr); > mutex_unlock(&mapping->i_mmap_mutex); > return pte; > > I guess we should move that pmd_alloc outside i_mmap_mutex. Otherwise > that pmd_alloc can result in a reclaim which can call shrink_page_list ? True. Sorry for that I didn't review the code carefully and I was very paranoid in reclaim-over-fs due to internal works. :( > > Something like ? > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 83aff0a..2cb1be3 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -3266,8 +3266,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud) > put_page(virt_to_page(spte)); > spin_unlock(&mm->page_table_lock); > out: > - pte = (pte_t *)pmd_alloc(mm, pud, addr); > mutex_unlock(&mapping->i_mmap_mutex); > + pte = (pte_t *)pmd_alloc(mm, pud, addr); > return pte; I am blind on hugetlb but not sure it doesn't break eb48c071. Michal? > } > > -aneesh > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- Kind regards, Minchan Kim ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: hugepage related lockdep trace. @ 2013-07-19 0:13 ` Minchan Kim 0 siblings, 0 replies; 40+ messages in thread From: Minchan Kim @ 2013-07-19 0:13 UTC (permalink / raw) To: Aneesh Kumar K.V Cc: Dave Jones, Linux Kernel, linux-mm, Rik van Riel, Michal Hocko, KAMEZAWA Hiroyuki, Hillf Danton, Andrew Morton On Thu, Jul 18, 2013 at 11:12:24PM +0530, Aneesh Kumar K.V wrote: > Minchan Kim <minchan@kernel.org> writes: > > > Ccing people get_maintainer says. > > > > On Wed, Jul 17, 2013 at 11:32:23AM -0400, Dave Jones wrote: > >> [128095.470960] ================================= > >> [128095.471315] [ INFO: inconsistent lock state ] > >> [128095.471660] 3.11.0-rc1+ #9 Not tainted > >> [128095.472156] --------------------------------- > >> [128095.472905] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage. > >> [128095.473650] kswapd0/49 [HC0[0]:SC0[0]:HE1:SE1] takes: > >> [128095.474373] (&mapping->i_mmap_mutex){+.+.?.}, at: [<c114971b>] page_referenced+0x87/0x5e3 > >> [128095.475128] {RECLAIM_FS-ON-W} state was registered at: > >> [128095.475866] [<c10a6232>] mark_held_locks+0x81/0xe7 > >> [128095.476597] [<c10a8db3>] lockdep_trace_alloc+0x5e/0xbc > >> [128095.477322] [<c112316b>] __alloc_pages_nodemask+0x8b/0x9b6 > >> [128095.478049] [<c1123ab6>] __get_free_pages+0x20/0x31 > >> [128095.478769] [<c1123ad9>] get_zeroed_page+0x12/0x14 > >> [128095.479477] [<c113fe1e>] __pmd_alloc+0x1c/0x6b > >> [128095.480138] [<c1155ea7>] huge_pmd_share+0x265/0x283 > >> [128095.480138] [<c1155f22>] huge_pte_alloc+0x5d/0x71 > >> [128095.480138] [<c115612e>] hugetlb_fault+0x7c/0x64a > >> [128095.480138] [<c114087c>] handle_mm_fault+0x255/0x299 > >> [128095.480138] [<c15bbab0>] __do_page_fault+0x142/0x55c > >> [128095.480138] [<c15bbed7>] do_page_fault+0xd/0x16 > >> [128095.480138] [<c15b927c>] error_code+0x6c/0x74 > >> [128095.480138] irq event stamp: 3136917 > >> [128095.480138] hardirqs last enabled at (3136917): [<c15b8139>] _raw_spin_unlock_irq+0x27/0x50 > >> [128095.480138] hardirqs last disabled at (3136916): [<c15b7f4e>] _raw_spin_lock_irq+0x15/0x78 > >> [128095.480138] softirqs last enabled at (3136180): [<c1048e4a>] __do_softirq+0x137/0x30f > >> [128095.480138] softirqs last disabled at (3136175): [<c1049195>] irq_exit+0xa8/0xaa > >> [128095.480138] > >> other info that might help us debug this: > >> [128095.480138] Possible unsafe locking scenario: > >> > >> [128095.480138] CPU0 > >> [128095.480138] ---- > >> [128095.480138] lock(&mapping->i_mmap_mutex); > >> [128095.480138] <Interrupt> > >> [128095.480138] lock(&mapping->i_mmap_mutex); > >> [128095.480138] > >> *** DEADLOCK *** > >> > >> [128095.480138] no locks held by kswapd0/49. > >> [128095.480138] > >> stack backtrace: > >> [128095.480138] CPU: 1 PID: 49 Comm: kswapd0 Not tainted 3.11.0-rc1+ #9 > >> [128095.480138] Hardware name: Dell Inc. Precision WorkStation 490 /0DT031, BIOS A08 04/25/2008 > >> [128095.480138] c1d32630 00000000 ee39fb18 c15b001e ee395780 ee39fb54 c15acdcb c1751845 > >> [128095.480138] c1751bbf 00000031 00000000 00000000 00000000 00000000 00000001 00000001 > >> [128095.480138] c1751bbf 00000008 ee395c44 00000100 ee39fb88 c10a6130 00000008 0000d8fb > >> [128095.480138] Call Trace: > >> [128095.480138] [<c15b001e>] dump_stack+0x4b/0x79 > >> [128095.480138] [<c15acdcb>] print_usage_bug+0x1d9/0x1e3 > >> [128095.480138] [<c10a6130>] mark_lock+0x1e0/0x261 > >> [128095.480138] [<c10a5878>] ? check_usage_backwards+0x109/0x109 > >> [128095.480138] [<c10a6cde>] __lock_acquire+0x623/0x17f2 > >> [128095.480138] [<c107aa43>] ? sched_clock_cpu+0xcd/0x130 > >> [128095.480138] [<c107a7e8>] ? sched_clock_local+0x42/0x12e > >> [128095.480138] [<c10a84cf>] lock_acquire+0x7d/0x195 > >> [128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3 > >> [128095.480138] [<c15b3671>] mutex_lock_nested+0x6c/0x3a7 > >> [128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3 > >> [128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3 > >> [128095.480138] [<c11661d5>] ? mem_cgroup_charge_statistics.isra.24+0x61/0x9e > >> [128095.480138] [<c114971b>] page_referenced+0x87/0x5e3 > >> [128095.480138] [<f8433030>] ? raid0_congested+0x26/0x8a [raid0] > >> [128095.480138] [<c112b9c7>] shrink_page_list+0x3d9/0x947 > >> [128095.480138] [<c10a6457>] ? trace_hardirqs_on+0xb/0xd > >> [128095.480138] [<c112c3cf>] shrink_inactive_list+0x155/0x4cb > >> [128095.480138] [<c112cd07>] shrink_lruvec+0x300/0x5ce > >> [128095.480138] [<c112d028>] shrink_zone+0x53/0x14e > >> [128095.480138] [<c112e531>] kswapd+0x517/0xa75 > >> [128095.480138] [<c112e01a>] ? mem_cgroup_shrink_node_zone+0x280/0x280 > >> [128095.480138] [<c10661ff>] kthread+0xa8/0xaa > >> [128095.480138] [<c10a6457>] ? trace_hardirqs_on+0xb/0xd > >> [128095.480138] [<c15bf737>] ret_from_kernel_thread+0x1b/0x28 > >> [128095.480138] [<c1066157>] ? insert_kthread_work+0x63/0x63 > > > > IMHO, it's a false positive because i_mmap_mutex was held by kswapd > > while one in the middle of fault path could be never on kswapd context. > > > > It seems lockdep for reclaim-over-fs isn't enough smart to identify > > between background and direct reclaim. > > > > Wait for other's opinion. > > Is that reasoning correct ?. We may not deadlock because hugetlb pages > cannot be reclaimed. So the fault path in hugetlb won't end up > reclaiming pages from same inode. But the report is correct right ? > > > Looking at the hugetlb code we have in huge_pmd_share > > out: > pte = (pte_t *)pmd_alloc(mm, pud, addr); > mutex_unlock(&mapping->i_mmap_mutex); > return pte; > > I guess we should move that pmd_alloc outside i_mmap_mutex. Otherwise > that pmd_alloc can result in a reclaim which can call shrink_page_list ? True. Sorry for that I didn't review the code carefully and I was very paranoid in reclaim-over-fs due to internal works. :( > > Something like ? > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 83aff0a..2cb1be3 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -3266,8 +3266,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud) > put_page(virt_to_page(spte)); > spin_unlock(&mm->page_table_lock); > out: > - pte = (pte_t *)pmd_alloc(mm, pud, addr); > mutex_unlock(&mapping->i_mmap_mutex); > + pte = (pte_t *)pmd_alloc(mm, pud, addr); > return pte; I am blind on hugetlb but not sure it doesn't break eb48c071. Michal? > } > > -aneesh > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: hugepage related lockdep trace. 2013-07-19 0:13 ` Minchan Kim @ 2013-07-23 7:24 ` Hush Bensen -1 siblings, 0 replies; 40+ messages in thread From: Hush Bensen @ 2013-07-23 7:24 UTC (permalink / raw) To: Minchan Kim Cc: Aneesh Kumar K.V, Dave Jones, Linux Kernel, linux-mm, Rik van Riel, Michal Hocko, KAMEZAWA Hiroyuki, Hillf Danton, Andrew Morton On 07/18/2013 06:13 PM, Minchan Kim wrote: > On Thu, Jul 18, 2013 at 11:12:24PM +0530, Aneesh Kumar K.V wrote: >> Minchan Kim <minchan@kernel.org> writes: >> >>> Ccing people get_maintainer says. >>> >>> On Wed, Jul 17, 2013 at 11:32:23AM -0400, Dave Jones wrote: >>>> [128095.470960] ================================= >>>> [128095.471315] [ INFO: inconsistent lock state ] >>>> [128095.471660] 3.11.0-rc1+ #9 Not tainted >>>> [128095.472156] --------------------------------- >>>> [128095.472905] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage. >>>> [128095.473650] kswapd0/49 [HC0[0]:SC0[0]:HE1:SE1] takes: >>>> [128095.474373] (&mapping->i_mmap_mutex){+.+.?.}, at: [<c114971b>] page_referenced+0x87/0x5e3 >>>> [128095.475128] {RECLAIM_FS-ON-W} state was registered at: >>>> [128095.475866] [<c10a6232>] mark_held_locks+0x81/0xe7 >>>> [128095.476597] [<c10a8db3>] lockdep_trace_alloc+0x5e/0xbc >>>> [128095.477322] [<c112316b>] __alloc_pages_nodemask+0x8b/0x9b6 >>>> [128095.478049] [<c1123ab6>] __get_free_pages+0x20/0x31 >>>> [128095.478769] [<c1123ad9>] get_zeroed_page+0x12/0x14 >>>> [128095.479477] [<c113fe1e>] __pmd_alloc+0x1c/0x6b >>>> [128095.480138] [<c1155ea7>] huge_pmd_share+0x265/0x283 >>>> [128095.480138] [<c1155f22>] huge_pte_alloc+0x5d/0x71 >>>> [128095.480138] [<c115612e>] hugetlb_fault+0x7c/0x64a >>>> [128095.480138] [<c114087c>] handle_mm_fault+0x255/0x299 >>>> [128095.480138] [<c15bbab0>] __do_page_fault+0x142/0x55c >>>> [128095.480138] [<c15bbed7>] do_page_fault+0xd/0x16 >>>> [128095.480138] [<c15b927c>] error_code+0x6c/0x74 >>>> [128095.480138] irq event stamp: 3136917 >>>> [128095.480138] hardirqs last enabled at (3136917): [<c15b8139>] _raw_spin_unlock_irq+0x27/0x50 >>>> [128095.480138] hardirqs last disabled at (3136916): [<c15b7f4e>] _raw_spin_lock_irq+0x15/0x78 >>>> [128095.480138] softirqs last enabled at (3136180): [<c1048e4a>] __do_softirq+0x137/0x30f >>>> [128095.480138] softirqs last disabled at (3136175): [<c1049195>] irq_exit+0xa8/0xaa >>>> [128095.480138] >>>> other info that might help us debug this: >>>> [128095.480138] Possible unsafe locking scenario: >>>> >>>> [128095.480138] CPU0 >>>> [128095.480138] ---- >>>> [128095.480138] lock(&mapping->i_mmap_mutex); >>>> [128095.480138] <Interrupt> >>>> [128095.480138] lock(&mapping->i_mmap_mutex); >>>> [128095.480138] >>>> *** DEADLOCK *** >>>> >>>> [128095.480138] no locks held by kswapd0/49. >>>> [128095.480138] >>>> stack backtrace: >>>> [128095.480138] CPU: 1 PID: 49 Comm: kswapd0 Not tainted 3.11.0-rc1+ #9 >>>> [128095.480138] Hardware name: Dell Inc. Precision WorkStation 490 /0DT031, BIOS A08 04/25/2008 >>>> [128095.480138] c1d32630 00000000 ee39fb18 c15b001e ee395780 ee39fb54 c15acdcb c1751845 >>>> [128095.480138] c1751bbf 00000031 00000000 00000000 00000000 00000000 00000001 00000001 >>>> [128095.480138] c1751bbf 00000008 ee395c44 00000100 ee39fb88 c10a6130 00000008 0000d8fb >>>> [128095.480138] Call Trace: >>>> [128095.480138] [<c15b001e>] dump_stack+0x4b/0x79 >>>> [128095.480138] [<c15acdcb>] print_usage_bug+0x1d9/0x1e3 >>>> [128095.480138] [<c10a6130>] mark_lock+0x1e0/0x261 >>>> [128095.480138] [<c10a5878>] ? check_usage_backwards+0x109/0x109 >>>> [128095.480138] [<c10a6cde>] __lock_acquire+0x623/0x17f2 >>>> [128095.480138] [<c107aa43>] ? sched_clock_cpu+0xcd/0x130 >>>> [128095.480138] [<c107a7e8>] ? sched_clock_local+0x42/0x12e >>>> [128095.480138] [<c10a84cf>] lock_acquire+0x7d/0x195 >>>> [128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3 >>>> [128095.480138] [<c15b3671>] mutex_lock_nested+0x6c/0x3a7 >>>> [128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3 >>>> [128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3 >>>> [128095.480138] [<c11661d5>] ? mem_cgroup_charge_statistics.isra.24+0x61/0x9e >>>> [128095.480138] [<c114971b>] page_referenced+0x87/0x5e3 >>>> [128095.480138] [<f8433030>] ? raid0_congested+0x26/0x8a [raid0] >>>> [128095.480138] [<c112b9c7>] shrink_page_list+0x3d9/0x947 >>>> [128095.480138] [<c10a6457>] ? trace_hardirqs_on+0xb/0xd >>>> [128095.480138] [<c112c3cf>] shrink_inactive_list+0x155/0x4cb >>>> [128095.480138] [<c112cd07>] shrink_lruvec+0x300/0x5ce >>>> [128095.480138] [<c112d028>] shrink_zone+0x53/0x14e >>>> [128095.480138] [<c112e531>] kswapd+0x517/0xa75 >>>> [128095.480138] [<c112e01a>] ? mem_cgroup_shrink_node_zone+0x280/0x280 >>>> [128095.480138] [<c10661ff>] kthread+0xa8/0xaa >>>> [128095.480138] [<c10a6457>] ? trace_hardirqs_on+0xb/0xd >>>> [128095.480138] [<c15bf737>] ret_from_kernel_thread+0x1b/0x28 >>>> [128095.480138] [<c1066157>] ? insert_kthread_work+0x63/0x63 >>> IMHO, it's a false positive because i_mmap_mutex was held by kswapd >>> while one in the middle of fault path could be never on kswapd context. >>> >>> It seems lockdep for reclaim-over-fs isn't enough smart to identify >>> between background and direct reclaim. >>> >>> Wait for other's opinion. >> Is that reasoning correct ?. We may not deadlock because hugetlb pages >> cannot be reclaimed. So the fault path in hugetlb won't end up >> reclaiming pages from same inode. But the report is correct right ? >> >> >> Looking at the hugetlb code we have in huge_pmd_share >> >> out: >> pte = (pte_t *)pmd_alloc(mm, pud, addr); >> mutex_unlock(&mapping->i_mmap_mutex); >> return pte; >> >> I guess we should move that pmd_alloc outside i_mmap_mutex. Otherwise >> that pmd_alloc can result in a reclaim which can call shrink_page_list ? > True. Sorry for that I didn't review the code carefully and I was very paranoid > in reclaim-over-fs due to internal works. :( Could you explain more about reclaim-over-fs stuff? > >> Something like ? >> >> diff --git a/mm/hugetlb.c b/mm/hugetlb.c >> index 83aff0a..2cb1be3 100644 >> --- a/mm/hugetlb.c >> +++ b/mm/hugetlb.c >> @@ -3266,8 +3266,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud) >> put_page(virt_to_page(spte)); >> spin_unlock(&mm->page_table_lock); >> out: >> - pte = (pte_t *)pmd_alloc(mm, pud, addr); >> mutex_unlock(&mapping->i_mmap_mutex); >> + pte = (pte_t *)pmd_alloc(mm, pud, addr); >> return pte; > I am blind on hugetlb but not sure it doesn't break eb48c071. > Michal? > > >> } >> >> -aneesh >> >> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: hugepage related lockdep trace. @ 2013-07-23 7:24 ` Hush Bensen 0 siblings, 0 replies; 40+ messages in thread From: Hush Bensen @ 2013-07-23 7:24 UTC (permalink / raw) To: Minchan Kim Cc: Aneesh Kumar K.V, Dave Jones, Linux Kernel, linux-mm, Rik van Riel, Michal Hocko, KAMEZAWA Hiroyuki, Hillf Danton, Andrew Morton On 07/18/2013 06:13 PM, Minchan Kim wrote: > On Thu, Jul 18, 2013 at 11:12:24PM +0530, Aneesh Kumar K.V wrote: >> Minchan Kim <minchan@kernel.org> writes: >> >>> Ccing people get_maintainer says. >>> >>> On Wed, Jul 17, 2013 at 11:32:23AM -0400, Dave Jones wrote: >>>> [128095.470960] ================================= >>>> [128095.471315] [ INFO: inconsistent lock state ] >>>> [128095.471660] 3.11.0-rc1+ #9 Not tainted >>>> [128095.472156] --------------------------------- >>>> [128095.472905] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage. >>>> [128095.473650] kswapd0/49 [HC0[0]:SC0[0]:HE1:SE1] takes: >>>> [128095.474373] (&mapping->i_mmap_mutex){+.+.?.}, at: [<c114971b>] page_referenced+0x87/0x5e3 >>>> [128095.475128] {RECLAIM_FS-ON-W} state was registered at: >>>> [128095.475866] [<c10a6232>] mark_held_locks+0x81/0xe7 >>>> [128095.476597] [<c10a8db3>] lockdep_trace_alloc+0x5e/0xbc >>>> [128095.477322] [<c112316b>] __alloc_pages_nodemask+0x8b/0x9b6 >>>> [128095.478049] [<c1123ab6>] __get_free_pages+0x20/0x31 >>>> [128095.478769] [<c1123ad9>] get_zeroed_page+0x12/0x14 >>>> [128095.479477] [<c113fe1e>] __pmd_alloc+0x1c/0x6b >>>> [128095.480138] [<c1155ea7>] huge_pmd_share+0x265/0x283 >>>> [128095.480138] [<c1155f22>] huge_pte_alloc+0x5d/0x71 >>>> [128095.480138] [<c115612e>] hugetlb_fault+0x7c/0x64a >>>> [128095.480138] [<c114087c>] handle_mm_fault+0x255/0x299 >>>> [128095.480138] [<c15bbab0>] __do_page_fault+0x142/0x55c >>>> [128095.480138] [<c15bbed7>] do_page_fault+0xd/0x16 >>>> [128095.480138] [<c15b927c>] error_code+0x6c/0x74 >>>> [128095.480138] irq event stamp: 3136917 >>>> [128095.480138] hardirqs last enabled at (3136917): [<c15b8139>] _raw_spin_unlock_irq+0x27/0x50 >>>> [128095.480138] hardirqs last disabled at (3136916): [<c15b7f4e>] _raw_spin_lock_irq+0x15/0x78 >>>> [128095.480138] softirqs last enabled at (3136180): [<c1048e4a>] __do_softirq+0x137/0x30f >>>> [128095.480138] softirqs last disabled at (3136175): [<c1049195>] irq_exit+0xa8/0xaa >>>> [128095.480138] >>>> other info that might help us debug this: >>>> [128095.480138] Possible unsafe locking scenario: >>>> >>>> [128095.480138] CPU0 >>>> [128095.480138] ---- >>>> [128095.480138] lock(&mapping->i_mmap_mutex); >>>> [128095.480138] <Interrupt> >>>> [128095.480138] lock(&mapping->i_mmap_mutex); >>>> [128095.480138] >>>> *** DEADLOCK *** >>>> >>>> [128095.480138] no locks held by kswapd0/49. >>>> [128095.480138] >>>> stack backtrace: >>>> [128095.480138] CPU: 1 PID: 49 Comm: kswapd0 Not tainted 3.11.0-rc1+ #9 >>>> [128095.480138] Hardware name: Dell Inc. Precision WorkStation 490 /0DT031, BIOS A08 04/25/2008 >>>> [128095.480138] c1d32630 00000000 ee39fb18 c15b001e ee395780 ee39fb54 c15acdcb c1751845 >>>> [128095.480138] c1751bbf 00000031 00000000 00000000 00000000 00000000 00000001 00000001 >>>> [128095.480138] c1751bbf 00000008 ee395c44 00000100 ee39fb88 c10a6130 00000008 0000d8fb >>>> [128095.480138] Call Trace: >>>> [128095.480138] [<c15b001e>] dump_stack+0x4b/0x79 >>>> [128095.480138] [<c15acdcb>] print_usage_bug+0x1d9/0x1e3 >>>> [128095.480138] [<c10a6130>] mark_lock+0x1e0/0x261 >>>> [128095.480138] [<c10a5878>] ? check_usage_backwards+0x109/0x109 >>>> [128095.480138] [<c10a6cde>] __lock_acquire+0x623/0x17f2 >>>> [128095.480138] [<c107aa43>] ? sched_clock_cpu+0xcd/0x130 >>>> [128095.480138] [<c107a7e8>] ? sched_clock_local+0x42/0x12e >>>> [128095.480138] [<c10a84cf>] lock_acquire+0x7d/0x195 >>>> [128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3 >>>> [128095.480138] [<c15b3671>] mutex_lock_nested+0x6c/0x3a7 >>>> [128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3 >>>> [128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3 >>>> [128095.480138] [<c11661d5>] ? mem_cgroup_charge_statistics.isra.24+0x61/0x9e >>>> [128095.480138] [<c114971b>] page_referenced+0x87/0x5e3 >>>> [128095.480138] [<f8433030>] ? raid0_congested+0x26/0x8a [raid0] >>>> [128095.480138] [<c112b9c7>] shrink_page_list+0x3d9/0x947 >>>> [128095.480138] [<c10a6457>] ? trace_hardirqs_on+0xb/0xd >>>> [128095.480138] [<c112c3cf>] shrink_inactive_list+0x155/0x4cb >>>> [128095.480138] [<c112cd07>] shrink_lruvec+0x300/0x5ce >>>> [128095.480138] [<c112d028>] shrink_zone+0x53/0x14e >>>> [128095.480138] [<c112e531>] kswapd+0x517/0xa75 >>>> [128095.480138] [<c112e01a>] ? mem_cgroup_shrink_node_zone+0x280/0x280 >>>> [128095.480138] [<c10661ff>] kthread+0xa8/0xaa >>>> [128095.480138] [<c10a6457>] ? trace_hardirqs_on+0xb/0xd >>>> [128095.480138] [<c15bf737>] ret_from_kernel_thread+0x1b/0x28 >>>> [128095.480138] [<c1066157>] ? insert_kthread_work+0x63/0x63 >>> IMHO, it's a false positive because i_mmap_mutex was held by kswapd >>> while one in the middle of fault path could be never on kswapd context. >>> >>> It seems lockdep for reclaim-over-fs isn't enough smart to identify >>> between background and direct reclaim. >>> >>> Wait for other's opinion. >> Is that reasoning correct ?. We may not deadlock because hugetlb pages >> cannot be reclaimed. So the fault path in hugetlb won't end up >> reclaiming pages from same inode. But the report is correct right ? >> >> >> Looking at the hugetlb code we have in huge_pmd_share >> >> out: >> pte = (pte_t *)pmd_alloc(mm, pud, addr); >> mutex_unlock(&mapping->i_mmap_mutex); >> return pte; >> >> I guess we should move that pmd_alloc outside i_mmap_mutex. Otherwise >> that pmd_alloc can result in a reclaim which can call shrink_page_list ? > True. Sorry for that I didn't review the code carefully and I was very paranoid > in reclaim-over-fs due to internal works. :( Could you explain more about reclaim-over-fs stuff? > >> Something like ? >> >> diff --git a/mm/hugetlb.c b/mm/hugetlb.c >> index 83aff0a..2cb1be3 100644 >> --- a/mm/hugetlb.c >> +++ b/mm/hugetlb.c >> @@ -3266,8 +3266,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud) >> put_page(virt_to_page(spte)); >> spin_unlock(&mm->page_table_lock); >> out: >> - pte = (pte_t *)pmd_alloc(mm, pud, addr); >> mutex_unlock(&mapping->i_mmap_mutex); >> + pte = (pte_t *)pmd_alloc(mm, pud, addr); >> return pte; > I am blind on hugetlb but not sure it doesn't break eb48c071. > Michal? > > >> } >> >> -aneesh >> >> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: hugepage related lockdep trace. 2013-07-23 7:24 ` Hush Bensen @ 2013-07-24 2:57 ` Minchan Kim -1 siblings, 0 replies; 40+ messages in thread From: Minchan Kim @ 2013-07-24 2:57 UTC (permalink / raw) To: Hush Bensen Cc: Aneesh Kumar K.V, Dave Jones, Linux Kernel, linux-mm, Rik van Riel, Michal Hocko, KAMEZAWA Hiroyuki, Hillf Danton, Andrew Morton On Tue, Jul 23, 2013 at 01:24:17AM -0600, Hush Bensen wrote: > On 07/18/2013 06:13 PM, Minchan Kim wrote: > >On Thu, Jul 18, 2013 at 11:12:24PM +0530, Aneesh Kumar K.V wrote: > >>Minchan Kim <minchan@kernel.org> writes: > >> > >>>Ccing people get_maintainer says. > >>> > >>>On Wed, Jul 17, 2013 at 11:32:23AM -0400, Dave Jones wrote: > >>>>[128095.470960] ================================= > >>>>[128095.471315] [ INFO: inconsistent lock state ] > >>>>[128095.471660] 3.11.0-rc1+ #9 Not tainted > >>>>[128095.472156] --------------------------------- > >>>>[128095.472905] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage. > >>>>[128095.473650] kswapd0/49 [HC0[0]:SC0[0]:HE1:SE1] takes: > >>>>[128095.474373] (&mapping->i_mmap_mutex){+.+.?.}, at: [<c114971b>] page_referenced+0x87/0x5e3 > >>>>[128095.475128] {RECLAIM_FS-ON-W} state was registered at: > >>>>[128095.475866] [<c10a6232>] mark_held_locks+0x81/0xe7 > >>>>[128095.476597] [<c10a8db3>] lockdep_trace_alloc+0x5e/0xbc > >>>>[128095.477322] [<c112316b>] __alloc_pages_nodemask+0x8b/0x9b6 > >>>>[128095.478049] [<c1123ab6>] __get_free_pages+0x20/0x31 > >>>>[128095.478769] [<c1123ad9>] get_zeroed_page+0x12/0x14 > >>>>[128095.479477] [<c113fe1e>] __pmd_alloc+0x1c/0x6b > >>>>[128095.480138] [<c1155ea7>] huge_pmd_share+0x265/0x283 > >>>>[128095.480138] [<c1155f22>] huge_pte_alloc+0x5d/0x71 > >>>>[128095.480138] [<c115612e>] hugetlb_fault+0x7c/0x64a > >>>>[128095.480138] [<c114087c>] handle_mm_fault+0x255/0x299 > >>>>[128095.480138] [<c15bbab0>] __do_page_fault+0x142/0x55c > >>>>[128095.480138] [<c15bbed7>] do_page_fault+0xd/0x16 > >>>>[128095.480138] [<c15b927c>] error_code+0x6c/0x74 > >>>>[128095.480138] irq event stamp: 3136917 > >>>>[128095.480138] hardirqs last enabled at (3136917): [<c15b8139>] _raw_spin_unlock_irq+0x27/0x50 > >>>>[128095.480138] hardirqs last disabled at (3136916): [<c15b7f4e>] _raw_spin_lock_irq+0x15/0x78 > >>>>[128095.480138] softirqs last enabled at (3136180): [<c1048e4a>] __do_softirq+0x137/0x30f > >>>>[128095.480138] softirqs last disabled at (3136175): [<c1049195>] irq_exit+0xa8/0xaa > >>>>[128095.480138] > >>>>other info that might help us debug this: > >>>>[128095.480138] Possible unsafe locking scenario: > >>>> > >>>>[128095.480138] CPU0 > >>>>[128095.480138] ---- > >>>>[128095.480138] lock(&mapping->i_mmap_mutex); > >>>>[128095.480138] <Interrupt> > >>>>[128095.480138] lock(&mapping->i_mmap_mutex); > >>>>[128095.480138] > >>>> *** DEADLOCK *** > >>>> > >>>>[128095.480138] no locks held by kswapd0/49. > >>>>[128095.480138] > >>>>stack backtrace: > >>>>[128095.480138] CPU: 1 PID: 49 Comm: kswapd0 Not tainted 3.11.0-rc1+ #9 > >>>>[128095.480138] Hardware name: Dell Inc. Precision WorkStation 490 /0DT031, BIOS A08 04/25/2008 > >>>>[128095.480138] c1d32630 00000000 ee39fb18 c15b001e ee395780 ee39fb54 c15acdcb c1751845 > >>>>[128095.480138] c1751bbf 00000031 00000000 00000000 00000000 00000000 00000001 00000001 > >>>>[128095.480138] c1751bbf 00000008 ee395c44 00000100 ee39fb88 c10a6130 00000008 0000d8fb > >>>>[128095.480138] Call Trace: > >>>>[128095.480138] [<c15b001e>] dump_stack+0x4b/0x79 > >>>>[128095.480138] [<c15acdcb>] print_usage_bug+0x1d9/0x1e3 > >>>>[128095.480138] [<c10a6130>] mark_lock+0x1e0/0x261 > >>>>[128095.480138] [<c10a5878>] ? check_usage_backwards+0x109/0x109 > >>>>[128095.480138] [<c10a6cde>] __lock_acquire+0x623/0x17f2 > >>>>[128095.480138] [<c107aa43>] ? sched_clock_cpu+0xcd/0x130 > >>>>[128095.480138] [<c107a7e8>] ? sched_clock_local+0x42/0x12e > >>>>[128095.480138] [<c10a84cf>] lock_acquire+0x7d/0x195 > >>>>[128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3 > >>>>[128095.480138] [<c15b3671>] mutex_lock_nested+0x6c/0x3a7 > >>>>[128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3 > >>>>[128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3 > >>>>[128095.480138] [<c11661d5>] ? mem_cgroup_charge_statistics.isra.24+0x61/0x9e > >>>>[128095.480138] [<c114971b>] page_referenced+0x87/0x5e3 > >>>>[128095.480138] [<f8433030>] ? raid0_congested+0x26/0x8a [raid0] > >>>>[128095.480138] [<c112b9c7>] shrink_page_list+0x3d9/0x947 > >>>>[128095.480138] [<c10a6457>] ? trace_hardirqs_on+0xb/0xd > >>>>[128095.480138] [<c112c3cf>] shrink_inactive_list+0x155/0x4cb > >>>>[128095.480138] [<c112cd07>] shrink_lruvec+0x300/0x5ce > >>>>[128095.480138] [<c112d028>] shrink_zone+0x53/0x14e > >>>>[128095.480138] [<c112e531>] kswapd+0x517/0xa75 > >>>>[128095.480138] [<c112e01a>] ? mem_cgroup_shrink_node_zone+0x280/0x280 > >>>>[128095.480138] [<c10661ff>] kthread+0xa8/0xaa > >>>>[128095.480138] [<c10a6457>] ? trace_hardirqs_on+0xb/0xd > >>>>[128095.480138] [<c15bf737>] ret_from_kernel_thread+0x1b/0x28 > >>>>[128095.480138] [<c1066157>] ? insert_kthread_work+0x63/0x63 > >>>IMHO, it's a false positive because i_mmap_mutex was held by kswapd > >>>while one in the middle of fault path could be never on kswapd context. > >>> > >>>It seems lockdep for reclaim-over-fs isn't enough smart to identify > >>>between background and direct reclaim. > >>> > >>>Wait for other's opinion. > >>Is that reasoning correct ?. We may not deadlock because hugetlb pages > >>cannot be reclaimed. So the fault path in hugetlb won't end up > >>reclaiming pages from same inode. But the report is correct right ? > >> > >> > >>Looking at the hugetlb code we have in huge_pmd_share > >> > >>out: > >> pte = (pte_t *)pmd_alloc(mm, pud, addr); > >> mutex_unlock(&mapping->i_mmap_mutex); > >> return pte; > >> > >>I guess we should move that pmd_alloc outside i_mmap_mutex. Otherwise > >>that pmd_alloc can result in a reclaim which can call shrink_page_list ? > >True. Sorry for that I didn't review the code carefully and I was very paranoid > >in reclaim-over-fs due to internal works. :( > > Could you explain more about reclaim-over-fs stuff? It's lockdep stuff to catch reclaim deadlock, which could happen following as fox example, fs_write lock(fs_lock); alloc memory reclaim path pageout fs_write lock(fs_lock) > > > > >>Something like ? > >> > >>diff --git a/mm/hugetlb.c b/mm/hugetlb.c > >>index 83aff0a..2cb1be3 100644 > >>--- a/mm/hugetlb.c > >>+++ b/mm/hugetlb.c > >>@@ -3266,8 +3266,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud) > >> put_page(virt_to_page(spte)); > >> spin_unlock(&mm->page_table_lock); > >> out: > >>- pte = (pte_t *)pmd_alloc(mm, pud, addr); > >> mutex_unlock(&mapping->i_mmap_mutex); > >>+ pte = (pte_t *)pmd_alloc(mm, pud, addr); > >> return pte; > >I am blind on hugetlb but not sure it doesn't break eb48c071. > >Michal? > > > > > >> } > >>-aneesh > >> > >> > >> > >>-- > >>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > >>the body of a message to majordomo@vger.kernel.org > >>More majordomo info at http://vger.kernel.org/majordomo-info.html > >>Please read the FAQ at http://www.tux.org/lkml/ > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- Kind regards, Minchan Kim ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: hugepage related lockdep trace. @ 2013-07-24 2:57 ` Minchan Kim 0 siblings, 0 replies; 40+ messages in thread From: Minchan Kim @ 2013-07-24 2:57 UTC (permalink / raw) To: Hush Bensen Cc: Aneesh Kumar K.V, Dave Jones, Linux Kernel, linux-mm, Rik van Riel, Michal Hocko, KAMEZAWA Hiroyuki, Hillf Danton, Andrew Morton On Tue, Jul 23, 2013 at 01:24:17AM -0600, Hush Bensen wrote: > On 07/18/2013 06:13 PM, Minchan Kim wrote: > >On Thu, Jul 18, 2013 at 11:12:24PM +0530, Aneesh Kumar K.V wrote: > >>Minchan Kim <minchan@kernel.org> writes: > >> > >>>Ccing people get_maintainer says. > >>> > >>>On Wed, Jul 17, 2013 at 11:32:23AM -0400, Dave Jones wrote: > >>>>[128095.470960] ================================= > >>>>[128095.471315] [ INFO: inconsistent lock state ] > >>>>[128095.471660] 3.11.0-rc1+ #9 Not tainted > >>>>[128095.472156] --------------------------------- > >>>>[128095.472905] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage. > >>>>[128095.473650] kswapd0/49 [HC0[0]:SC0[0]:HE1:SE1] takes: > >>>>[128095.474373] (&mapping->i_mmap_mutex){+.+.?.}, at: [<c114971b>] page_referenced+0x87/0x5e3 > >>>>[128095.475128] {RECLAIM_FS-ON-W} state was registered at: > >>>>[128095.475866] [<c10a6232>] mark_held_locks+0x81/0xe7 > >>>>[128095.476597] [<c10a8db3>] lockdep_trace_alloc+0x5e/0xbc > >>>>[128095.477322] [<c112316b>] __alloc_pages_nodemask+0x8b/0x9b6 > >>>>[128095.478049] [<c1123ab6>] __get_free_pages+0x20/0x31 > >>>>[128095.478769] [<c1123ad9>] get_zeroed_page+0x12/0x14 > >>>>[128095.479477] [<c113fe1e>] __pmd_alloc+0x1c/0x6b > >>>>[128095.480138] [<c1155ea7>] huge_pmd_share+0x265/0x283 > >>>>[128095.480138] [<c1155f22>] huge_pte_alloc+0x5d/0x71 > >>>>[128095.480138] [<c115612e>] hugetlb_fault+0x7c/0x64a > >>>>[128095.480138] [<c114087c>] handle_mm_fault+0x255/0x299 > >>>>[128095.480138] [<c15bbab0>] __do_page_fault+0x142/0x55c > >>>>[128095.480138] [<c15bbed7>] do_page_fault+0xd/0x16 > >>>>[128095.480138] [<c15b927c>] error_code+0x6c/0x74 > >>>>[128095.480138] irq event stamp: 3136917 > >>>>[128095.480138] hardirqs last enabled at (3136917): [<c15b8139>] _raw_spin_unlock_irq+0x27/0x50 > >>>>[128095.480138] hardirqs last disabled at (3136916): [<c15b7f4e>] _raw_spin_lock_irq+0x15/0x78 > >>>>[128095.480138] softirqs last enabled at (3136180): [<c1048e4a>] __do_softirq+0x137/0x30f > >>>>[128095.480138] softirqs last disabled at (3136175): [<c1049195>] irq_exit+0xa8/0xaa > >>>>[128095.480138] > >>>>other info that might help us debug this: > >>>>[128095.480138] Possible unsafe locking scenario: > >>>> > >>>>[128095.480138] CPU0 > >>>>[128095.480138] ---- > >>>>[128095.480138] lock(&mapping->i_mmap_mutex); > >>>>[128095.480138] <Interrupt> > >>>>[128095.480138] lock(&mapping->i_mmap_mutex); > >>>>[128095.480138] > >>>> *** DEADLOCK *** > >>>> > >>>>[128095.480138] no locks held by kswapd0/49. > >>>>[128095.480138] > >>>>stack backtrace: > >>>>[128095.480138] CPU: 1 PID: 49 Comm: kswapd0 Not tainted 3.11.0-rc1+ #9 > >>>>[128095.480138] Hardware name: Dell Inc. Precision WorkStation 490 /0DT031, BIOS A08 04/25/2008 > >>>>[128095.480138] c1d32630 00000000 ee39fb18 c15b001e ee395780 ee39fb54 c15acdcb c1751845 > >>>>[128095.480138] c1751bbf 00000031 00000000 00000000 00000000 00000000 00000001 00000001 > >>>>[128095.480138] c1751bbf 00000008 ee395c44 00000100 ee39fb88 c10a6130 00000008 0000d8fb > >>>>[128095.480138] Call Trace: > >>>>[128095.480138] [<c15b001e>] dump_stack+0x4b/0x79 > >>>>[128095.480138] [<c15acdcb>] print_usage_bug+0x1d9/0x1e3 > >>>>[128095.480138] [<c10a6130>] mark_lock+0x1e0/0x261 > >>>>[128095.480138] [<c10a5878>] ? check_usage_backwards+0x109/0x109 > >>>>[128095.480138] [<c10a6cde>] __lock_acquire+0x623/0x17f2 > >>>>[128095.480138] [<c107aa43>] ? sched_clock_cpu+0xcd/0x130 > >>>>[128095.480138] [<c107a7e8>] ? sched_clock_local+0x42/0x12e > >>>>[128095.480138] [<c10a84cf>] lock_acquire+0x7d/0x195 > >>>>[128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3 > >>>>[128095.480138] [<c15b3671>] mutex_lock_nested+0x6c/0x3a7 > >>>>[128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3 > >>>>[128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3 > >>>>[128095.480138] [<c11661d5>] ? mem_cgroup_charge_statistics.isra.24+0x61/0x9e > >>>>[128095.480138] [<c114971b>] page_referenced+0x87/0x5e3 > >>>>[128095.480138] [<f8433030>] ? raid0_congested+0x26/0x8a [raid0] > >>>>[128095.480138] [<c112b9c7>] shrink_page_list+0x3d9/0x947 > >>>>[128095.480138] [<c10a6457>] ? trace_hardirqs_on+0xb/0xd > >>>>[128095.480138] [<c112c3cf>] shrink_inactive_list+0x155/0x4cb > >>>>[128095.480138] [<c112cd07>] shrink_lruvec+0x300/0x5ce > >>>>[128095.480138] [<c112d028>] shrink_zone+0x53/0x14e > >>>>[128095.480138] [<c112e531>] kswapd+0x517/0xa75 > >>>>[128095.480138] [<c112e01a>] ? mem_cgroup_shrink_node_zone+0x280/0x280 > >>>>[128095.480138] [<c10661ff>] kthread+0xa8/0xaa > >>>>[128095.480138] [<c10a6457>] ? trace_hardirqs_on+0xb/0xd > >>>>[128095.480138] [<c15bf737>] ret_from_kernel_thread+0x1b/0x28 > >>>>[128095.480138] [<c1066157>] ? insert_kthread_work+0x63/0x63 > >>>IMHO, it's a false positive because i_mmap_mutex was held by kswapd > >>>while one in the middle of fault path could be never on kswapd context. > >>> > >>>It seems lockdep for reclaim-over-fs isn't enough smart to identify > >>>between background and direct reclaim. > >>> > >>>Wait for other's opinion. > >>Is that reasoning correct ?. We may not deadlock because hugetlb pages > >>cannot be reclaimed. So the fault path in hugetlb won't end up > >>reclaiming pages from same inode. But the report is correct right ? > >> > >> > >>Looking at the hugetlb code we have in huge_pmd_share > >> > >>out: > >> pte = (pte_t *)pmd_alloc(mm, pud, addr); > >> mutex_unlock(&mapping->i_mmap_mutex); > >> return pte; > >> > >>I guess we should move that pmd_alloc outside i_mmap_mutex. Otherwise > >>that pmd_alloc can result in a reclaim which can call shrink_page_list ? > >True. Sorry for that I didn't review the code carefully and I was very paranoid > >in reclaim-over-fs due to internal works. :( > > Could you explain more about reclaim-over-fs stuff? It's lockdep stuff to catch reclaim deadlock, which could happen following as fox example, fs_write lock(fs_lock); alloc memory reclaim path pageout fs_write lock(fs_lock) > > > > >>Something like ? > >> > >>diff --git a/mm/hugetlb.c b/mm/hugetlb.c > >>index 83aff0a..2cb1be3 100644 > >>--- a/mm/hugetlb.c > >>+++ b/mm/hugetlb.c > >>@@ -3266,8 +3266,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud) > >> put_page(virt_to_page(spte)); > >> spin_unlock(&mm->page_table_lock); > >> out: > >>- pte = (pte_t *)pmd_alloc(mm, pud, addr); > >> mutex_unlock(&mapping->i_mmap_mutex); > >>+ pte = (pte_t *)pmd_alloc(mm, pud, addr); > >> return pte; > >I am blind on hugetlb but not sure it doesn't break eb48c071. > >Michal? > > > > > >> } > >>-aneesh > >> > >> > >> > >>-- > >>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > >>the body of a message to majordomo@vger.kernel.org > >>More majordomo info at http://vger.kernel.org/majordomo-info.html > >>Please read the FAQ at http://www.tux.org/lkml/ > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: hugepage related lockdep trace. 2013-07-19 0:13 ` Minchan Kim @ 2013-07-23 14:01 ` Michal Hocko -1 siblings, 0 replies; 40+ messages in thread From: Michal Hocko @ 2013-07-23 14:01 UTC (permalink / raw) To: Minchan Kim Cc: Aneesh Kumar K.V, Dave Jones, Linux Kernel, linux-mm, Rik van Riel, KAMEZAWA Hiroyuki, Hillf Danton, Andrew Morton On Fri 19-07-13 09:13:03, Minchan Kim wrote: > On Thu, Jul 18, 2013 at 11:12:24PM +0530, Aneesh Kumar K.V wrote: [...] > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > index 83aff0a..2cb1be3 100644 > > --- a/mm/hugetlb.c > > +++ b/mm/hugetlb.c > > @@ -3266,8 +3266,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud) > > put_page(virt_to_page(spte)); > > spin_unlock(&mm->page_table_lock); > > out: > > - pte = (pte_t *)pmd_alloc(mm, pud, addr); > > mutex_unlock(&mapping->i_mmap_mutex); > > + pte = (pte_t *)pmd_alloc(mm, pud, addr); > > return pte; > > I am blind on hugetlb but not sure it doesn't break eb48c071. > Michal? Well, it is some time since I debugged the race and all the details vanished in the meantime. But this part of the changelog suggests that this indeed breaks the fix: " This patch addresses the issue by moving pmd_alloc into huge_pmd_share which guarantees that the shared pud is populated in the same critical section as pmd. This also means that huge_pte_offset test in huge_pmd_share is serialized correctly now which in turn means that the success of the sharing will be higher as the racing tasks see the pud and pmd populated together. " Besides that I fail to see how moving pmd_alloc down changes anything. Even if pmd_alloc triggered reclaim then we cannot trip over the same i_mmap_mutex as hugetlb pages are not reclaimable because they are not on the LRU. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: hugepage related lockdep trace. @ 2013-07-23 14:01 ` Michal Hocko 0 siblings, 0 replies; 40+ messages in thread From: Michal Hocko @ 2013-07-23 14:01 UTC (permalink / raw) To: Minchan Kim Cc: Aneesh Kumar K.V, Dave Jones, Linux Kernel, linux-mm, Rik van Riel, KAMEZAWA Hiroyuki, Hillf Danton, Andrew Morton On Fri 19-07-13 09:13:03, Minchan Kim wrote: > On Thu, Jul 18, 2013 at 11:12:24PM +0530, Aneesh Kumar K.V wrote: [...] > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > index 83aff0a..2cb1be3 100644 > > --- a/mm/hugetlb.c > > +++ b/mm/hugetlb.c > > @@ -3266,8 +3266,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud) > > put_page(virt_to_page(spte)); > > spin_unlock(&mm->page_table_lock); > > out: > > - pte = (pte_t *)pmd_alloc(mm, pud, addr); > > mutex_unlock(&mapping->i_mmap_mutex); > > + pte = (pte_t *)pmd_alloc(mm, pud, addr); > > return pte; > > I am blind on hugetlb but not sure it doesn't break eb48c071. > Michal? Well, it is some time since I debugged the race and all the details vanished in the meantime. But this part of the changelog suggests that this indeed breaks the fix: " This patch addresses the issue by moving pmd_alloc into huge_pmd_share which guarantees that the shared pud is populated in the same critical section as pmd. This also means that huge_pte_offset test in huge_pmd_share is serialized correctly now which in turn means that the success of the sharing will be higher as the racing tasks see the pud and pmd populated together. " Besides that I fail to see how moving pmd_alloc down changes anything. Even if pmd_alloc triggered reclaim then we cannot trip over the same i_mmap_mutex as hugetlb pages are not reclaimable because they are not on the LRU. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: hugepage related lockdep trace. 2013-07-23 14:01 ` Michal Hocko @ 2013-07-24 2:44 ` Minchan Kim -1 siblings, 0 replies; 40+ messages in thread From: Minchan Kim @ 2013-07-24 2:44 UTC (permalink / raw) To: Michal Hocko Cc: Aneesh Kumar K.V, Dave Jones, Linux Kernel, linux-mm, Rik van Riel, KAMEZAWA Hiroyuki, Hillf Danton, Andrew Morton On Tue, Jul 23, 2013 at 04:01:20PM +0200, Michal Hocko wrote: > On Fri 19-07-13 09:13:03, Minchan Kim wrote: > > On Thu, Jul 18, 2013 at 11:12:24PM +0530, Aneesh Kumar K.V wrote: > [...] > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > > index 83aff0a..2cb1be3 100644 > > > --- a/mm/hugetlb.c > > > +++ b/mm/hugetlb.c > > > @@ -3266,8 +3266,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud) > > > put_page(virt_to_page(spte)); > > > spin_unlock(&mm->page_table_lock); > > > out: > > > - pte = (pte_t *)pmd_alloc(mm, pud, addr); > > > mutex_unlock(&mapping->i_mmap_mutex); > > > + pte = (pte_t *)pmd_alloc(mm, pud, addr); > > > return pte; > > > > I am blind on hugetlb but not sure it doesn't break eb48c071. > > Michal? > > Well, it is some time since I debugged the race and all the details > vanished in the meantime. But this part of the changelog suggests that > this indeed breaks the fix: > " > This patch addresses the issue by moving pmd_alloc into huge_pmd_share > which guarantees that the shared pud is populated in the same critical > section as pmd. This also means that huge_pte_offset test in > huge_pmd_share is serialized correctly now which in turn means that the > success of the sharing will be higher as the racing tasks see the pud > and pmd populated together. > " > > Besides that I fail to see how moving pmd_alloc down changes anything. > Even if pmd_alloc triggered reclaim then we cannot trip over the same > i_mmap_mutex as hugetlb pages are not reclaimable because they are not > on the LRU. I thought we could map some part of binary with normal page and other part of the one with MAP_HUGETLB so that a address space could have both normal page and HugeTLB page. Okay, it's impossible so HugeTLB pages are not on LRU. Then, above lockdep warning is totally false positive. Best solution is avoiding pmd_alloc with holding i_mmap_mutex but we need it to fix eb48c071 so how about this if we couldn't have a better idea? diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 83aff0a..e7c3a15 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -3240,7 +3240,15 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud) if (!vma_shareable(vma, addr)) return (pte_t *)pmd_alloc(mm, pud, addr); - mutex_lock(&mapping->i_mmap_mutex); + /* + * It annotates to shut lockdep's warning up casued by i_mmap_mutex + * Below pmd_alloc try to allocate memory with GFP_KERNEL while + * holding i_mmap_mutex so that it could enter direct reclaim path + * that rmap try to hold i_mmap_mutex again. But it's no problem + * for hugetlb because pages on hugetlb never could live in LRU so + * it's false positive. I hope someone fixes it with avoiding pmd_alloc + * with holding i_mmap_mutex rather than nesting annotation. + */ + mutex_lock_nested(&mapping->i_mmap_mutex, SINGLE_DEPTH_NESTING); vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) { if (svma == vma) continue; > -- > Michal Hocko > SUSE Labs > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- Kind regards, Minchan Kim ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: hugepage related lockdep trace. @ 2013-07-24 2:44 ` Minchan Kim 0 siblings, 0 replies; 40+ messages in thread From: Minchan Kim @ 2013-07-24 2:44 UTC (permalink / raw) To: Michal Hocko Cc: Aneesh Kumar K.V, Dave Jones, Linux Kernel, linux-mm, Rik van Riel, KAMEZAWA Hiroyuki, Hillf Danton, Andrew Morton On Tue, Jul 23, 2013 at 04:01:20PM +0200, Michal Hocko wrote: > On Fri 19-07-13 09:13:03, Minchan Kim wrote: > > On Thu, Jul 18, 2013 at 11:12:24PM +0530, Aneesh Kumar K.V wrote: > [...] > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > > index 83aff0a..2cb1be3 100644 > > > --- a/mm/hugetlb.c > > > +++ b/mm/hugetlb.c > > > @@ -3266,8 +3266,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud) > > > put_page(virt_to_page(spte)); > > > spin_unlock(&mm->page_table_lock); > > > out: > > > - pte = (pte_t *)pmd_alloc(mm, pud, addr); > > > mutex_unlock(&mapping->i_mmap_mutex); > > > + pte = (pte_t *)pmd_alloc(mm, pud, addr); > > > return pte; > > > > I am blind on hugetlb but not sure it doesn't break eb48c071. > > Michal? > > Well, it is some time since I debugged the race and all the details > vanished in the meantime. But this part of the changelog suggests that > this indeed breaks the fix: > " > This patch addresses the issue by moving pmd_alloc into huge_pmd_share > which guarantees that the shared pud is populated in the same critical > section as pmd. This also means that huge_pte_offset test in > huge_pmd_share is serialized correctly now which in turn means that the > success of the sharing will be higher as the racing tasks see the pud > and pmd populated together. > " > > Besides that I fail to see how moving pmd_alloc down changes anything. > Even if pmd_alloc triggered reclaim then we cannot trip over the same > i_mmap_mutex as hugetlb pages are not reclaimable because they are not > on the LRU. I thought we could map some part of binary with normal page and other part of the one with MAP_HUGETLB so that a address space could have both normal page and HugeTLB page. Okay, it's impossible so HugeTLB pages are not on LRU. Then, above lockdep warning is totally false positive. Best solution is avoiding pmd_alloc with holding i_mmap_mutex but we need it to fix eb48c071 so how about this if we couldn't have a better idea? diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 83aff0a..e7c3a15 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -3240,7 +3240,15 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud) if (!vma_shareable(vma, addr)) return (pte_t *)pmd_alloc(mm, pud, addr); - mutex_lock(&mapping->i_mmap_mutex); + /* + * It annotates to shut lockdep's warning up casued by i_mmap_mutex + * Below pmd_alloc try to allocate memory with GFP_KERNEL while + * holding i_mmap_mutex so that it could enter direct reclaim path + * that rmap try to hold i_mmap_mutex again. But it's no problem + * for hugetlb because pages on hugetlb never could live in LRU so + * it's false positive. I hope someone fixes it with avoiding pmd_alloc + * with holding i_mmap_mutex rather than nesting annotation. + */ + mutex_lock_nested(&mapping->i_mmap_mutex, SINGLE_DEPTH_NESTING); vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) { if (svma == vma) continue; > -- > Michal Hocko > SUSE Labs > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: hugepage related lockdep trace. 2013-07-24 2:44 ` Minchan Kim @ 2013-07-25 13:30 ` Michal Hocko -1 siblings, 0 replies; 40+ messages in thread From: Michal Hocko @ 2013-07-25 13:30 UTC (permalink / raw) To: Minchan Kim Cc: Aneesh Kumar K.V, Dave Jones, Linux Kernel, linux-mm, Rik van Riel, KAMEZAWA Hiroyuki, Hillf Danton, Andrew Morton On Wed 24-07-13 11:44:28, Minchan Kim wrote: > On Tue, Jul 23, 2013 at 04:01:20PM +0200, Michal Hocko wrote: > > On Fri 19-07-13 09:13:03, Minchan Kim wrote: > > > On Thu, Jul 18, 2013 at 11:12:24PM +0530, Aneesh Kumar K.V wrote: > > [...] > > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > > > index 83aff0a..2cb1be3 100644 > > > > --- a/mm/hugetlb.c > > > > +++ b/mm/hugetlb.c > > > > @@ -3266,8 +3266,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud) > > > > put_page(virt_to_page(spte)); > > > > spin_unlock(&mm->page_table_lock); > > > > out: > > > > - pte = (pte_t *)pmd_alloc(mm, pud, addr); > > > > mutex_unlock(&mapping->i_mmap_mutex); > > > > + pte = (pte_t *)pmd_alloc(mm, pud, addr); > > > > return pte; > > > > > > I am blind on hugetlb but not sure it doesn't break eb48c071. > > > Michal? > > > > Well, it is some time since I debugged the race and all the details > > vanished in the meantime. But this part of the changelog suggests that > > this indeed breaks the fix: > > " > > This patch addresses the issue by moving pmd_alloc into huge_pmd_share > > which guarantees that the shared pud is populated in the same critical > > section as pmd. This also means that huge_pte_offset test in > > huge_pmd_share is serialized correctly now which in turn means that the > > success of the sharing will be higher as the racing tasks see the pud > > and pmd populated together. > > " > > > > Besides that I fail to see how moving pmd_alloc down changes anything. > > Even if pmd_alloc triggered reclaim then we cannot trip over the same > > i_mmap_mutex as hugetlb pages are not reclaimable because they are not > > on the LRU. > > I thought we could map some part of binary with normal page and other part > of the one with MAP_HUGETLB so that a address space could have both normal > page and HugeTLB page. Okay, it's impossible so HugeTLB pages are not on LRU. > Then, above lockdep warning is totally false positive. > Best solution is avoiding pmd_alloc with holding i_mmap_mutex but we need it > to fix eb48c071 so how about this if we couldn't have a better idea? Shouldn't we rather use a hugetlb specific lock_class_key. I am not familiar with lockdep much but something like bellow should do the trick? diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index a3f868a..40a61f6 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -463,6 +463,8 @@ static struct inode *hugetlbfs_get_root(struct super_block *sb, return inode; } +struct lock_class_key hugetlbfs_i_mmap_mutex_key; + static struct inode *hugetlbfs_get_inode(struct super_block *sb, struct inode *dir, umode_t mode, dev_t dev) @@ -474,6 +476,7 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb, struct hugetlbfs_inode_info *info; inode->i_ino = get_next_ino(); inode_init_owner(inode, dir, mode); + lockdep_set_class(&inode->i_mapping->i_mmap_mutex, &hugetlbfs_i_mmap_mutex_key); inode->i_mapping->a_ops = &hugetlbfs_aops; inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info; inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME; -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: hugepage related lockdep trace. @ 2013-07-25 13:30 ` Michal Hocko 0 siblings, 0 replies; 40+ messages in thread From: Michal Hocko @ 2013-07-25 13:30 UTC (permalink / raw) To: Minchan Kim Cc: Aneesh Kumar K.V, Dave Jones, Linux Kernel, linux-mm, Rik van Riel, KAMEZAWA Hiroyuki, Hillf Danton, Andrew Morton On Wed 24-07-13 11:44:28, Minchan Kim wrote: > On Tue, Jul 23, 2013 at 04:01:20PM +0200, Michal Hocko wrote: > > On Fri 19-07-13 09:13:03, Minchan Kim wrote: > > > On Thu, Jul 18, 2013 at 11:12:24PM +0530, Aneesh Kumar K.V wrote: > > [...] > > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > > > index 83aff0a..2cb1be3 100644 > > > > --- a/mm/hugetlb.c > > > > +++ b/mm/hugetlb.c > > > > @@ -3266,8 +3266,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud) > > > > put_page(virt_to_page(spte)); > > > > spin_unlock(&mm->page_table_lock); > > > > out: > > > > - pte = (pte_t *)pmd_alloc(mm, pud, addr); > > > > mutex_unlock(&mapping->i_mmap_mutex); > > > > + pte = (pte_t *)pmd_alloc(mm, pud, addr); > > > > return pte; > > > > > > I am blind on hugetlb but not sure it doesn't break eb48c071. > > > Michal? > > > > Well, it is some time since I debugged the race and all the details > > vanished in the meantime. But this part of the changelog suggests that > > this indeed breaks the fix: > > " > > This patch addresses the issue by moving pmd_alloc into huge_pmd_share > > which guarantees that the shared pud is populated in the same critical > > section as pmd. This also means that huge_pte_offset test in > > huge_pmd_share is serialized correctly now which in turn means that the > > success of the sharing will be higher as the racing tasks see the pud > > and pmd populated together. > > " > > > > Besides that I fail to see how moving pmd_alloc down changes anything. > > Even if pmd_alloc triggered reclaim then we cannot trip over the same > > i_mmap_mutex as hugetlb pages are not reclaimable because they are not > > on the LRU. > > I thought we could map some part of binary with normal page and other part > of the one with MAP_HUGETLB so that a address space could have both normal > page and HugeTLB page. Okay, it's impossible so HugeTLB pages are not on LRU. > Then, above lockdep warning is totally false positive. > Best solution is avoiding pmd_alloc with holding i_mmap_mutex but we need it > to fix eb48c071 so how about this if we couldn't have a better idea? Shouldn't we rather use a hugetlb specific lock_class_key. I am not familiar with lockdep much but something like bellow should do the trick? diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index a3f868a..40a61f6 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -463,6 +463,8 @@ static struct inode *hugetlbfs_get_root(struct super_block *sb, return inode; } +struct lock_class_key hugetlbfs_i_mmap_mutex_key; + static struct inode *hugetlbfs_get_inode(struct super_block *sb, struct inode *dir, umode_t mode, dev_t dev) @@ -474,6 +476,7 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb, struct hugetlbfs_inode_info *info; inode->i_ino = get_next_ino(); inode_init_owner(inode, dir, mode); + lockdep_set_class(&inode->i_mapping->i_mmap_mutex, &hugetlbfs_i_mmap_mutex_key); inode->i_mapping->a_ops = &hugetlbfs_aops; inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info; inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME; -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: hugepage related lockdep trace. 2013-07-25 13:30 ` Michal Hocko @ 2013-07-29 8:24 ` Minchan Kim -1 siblings, 0 replies; 40+ messages in thread From: Minchan Kim @ 2013-07-29 8:24 UTC (permalink / raw) To: Michal Hocko Cc: Aneesh Kumar K.V, Dave Jones, Linux Kernel, linux-mm, Rik van Riel, KAMEZAWA Hiroyuki, Hillf Danton, Andrew Morton Hi Michal, On Thu, Jul 25, 2013 at 03:30:40PM +0200, Michal Hocko wrote: > On Wed 24-07-13 11:44:28, Minchan Kim wrote: > > On Tue, Jul 23, 2013 at 04:01:20PM +0200, Michal Hocko wrote: > > > On Fri 19-07-13 09:13:03, Minchan Kim wrote: > > > > On Thu, Jul 18, 2013 at 11:12:24PM +0530, Aneesh Kumar K.V wrote: > > > [...] > > > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > > > > index 83aff0a..2cb1be3 100644 > > > > > --- a/mm/hugetlb.c > > > > > +++ b/mm/hugetlb.c > > > > > @@ -3266,8 +3266,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud) > > > > > put_page(virt_to_page(spte)); > > > > > spin_unlock(&mm->page_table_lock); > > > > > out: > > > > > - pte = (pte_t *)pmd_alloc(mm, pud, addr); > > > > > mutex_unlock(&mapping->i_mmap_mutex); > > > > > + pte = (pte_t *)pmd_alloc(mm, pud, addr); > > > > > return pte; > > > > > > > > I am blind on hugetlb but not sure it doesn't break eb48c071. > > > > Michal? > > > > > > Well, it is some time since I debugged the race and all the details > > > vanished in the meantime. But this part of the changelog suggests that > > > this indeed breaks the fix: > > > " > > > This patch addresses the issue by moving pmd_alloc into huge_pmd_share > > > which guarantees that the shared pud is populated in the same critical > > > section as pmd. This also means that huge_pte_offset test in > > > huge_pmd_share is serialized correctly now which in turn means that the > > > success of the sharing will be higher as the racing tasks see the pud > > > and pmd populated together. > > > " > > > > > > Besides that I fail to see how moving pmd_alloc down changes anything. > > > Even if pmd_alloc triggered reclaim then we cannot trip over the same > > > i_mmap_mutex as hugetlb pages are not reclaimable because they are not > > > on the LRU. > > > > I thought we could map some part of binary with normal page and other part > > of the one with MAP_HUGETLB so that a address space could have both normal > > page and HugeTLB page. Okay, it's impossible so HugeTLB pages are not on LRU. > > Then, above lockdep warning is totally false positive. > > Best solution is avoiding pmd_alloc with holding i_mmap_mutex but we need it > > to fix eb48c071 so how about this if we couldn't have a better idea? > > Shouldn't we rather use a hugetlb specific lock_class_key. I am not > familiar with lockdep much but something like bellow should do the > trick? Looks good to me. Could you resend it with formal patch with Ccing Peter for Dave to confirm it? Below just a nitpick. > > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c > index a3f868a..40a61f6 100644 > --- a/fs/hugetlbfs/inode.c > +++ b/fs/hugetlbfs/inode.c > @@ -463,6 +463,8 @@ static struct inode *hugetlbfs_get_root(struct super_block *sb, > return inode; > } > > +struct lock_class_key hugetlbfs_i_mmap_mutex_key; Let's add comment why we need this special something. How about this? /* * Now, reclaim path never hold hugetlbfs_inode->i_mmap_mutex while it could * hold normal inode->i_mmap_mutex so this annoation avoids a lockdep splat. */ static struct lock_class_key hugetlbctx_mutex; > + > static struct inode *hugetlbfs_get_inode(struct super_block *sb, > struct inode *dir, > umode_t mode, dev_t dev) > @@ -474,6 +476,7 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb, > struct hugetlbfs_inode_info *info; > inode->i_ino = get_next_ino(); > inode_init_owner(inode, dir, mode); > + lockdep_set_class(&inode->i_mapping->i_mmap_mutex, &hugetlbfs_i_mmap_mutex_key); > inode->i_mapping->a_ops = &hugetlbfs_aops; > inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info; > inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME; > -- > Michal Hocko > SUSE Labs > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- Kind regards, Minchan Kim ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: hugepage related lockdep trace. @ 2013-07-29 8:24 ` Minchan Kim 0 siblings, 0 replies; 40+ messages in thread From: Minchan Kim @ 2013-07-29 8:24 UTC (permalink / raw) To: Michal Hocko Cc: Aneesh Kumar K.V, Dave Jones, Linux Kernel, linux-mm, Rik van Riel, KAMEZAWA Hiroyuki, Hillf Danton, Andrew Morton Hi Michal, On Thu, Jul 25, 2013 at 03:30:40PM +0200, Michal Hocko wrote: > On Wed 24-07-13 11:44:28, Minchan Kim wrote: > > On Tue, Jul 23, 2013 at 04:01:20PM +0200, Michal Hocko wrote: > > > On Fri 19-07-13 09:13:03, Minchan Kim wrote: > > > > On Thu, Jul 18, 2013 at 11:12:24PM +0530, Aneesh Kumar K.V wrote: > > > [...] > > > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > > > > index 83aff0a..2cb1be3 100644 > > > > > --- a/mm/hugetlb.c > > > > > +++ b/mm/hugetlb.c > > > > > @@ -3266,8 +3266,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud) > > > > > put_page(virt_to_page(spte)); > > > > > spin_unlock(&mm->page_table_lock); > > > > > out: > > > > > - pte = (pte_t *)pmd_alloc(mm, pud, addr); > > > > > mutex_unlock(&mapping->i_mmap_mutex); > > > > > + pte = (pte_t *)pmd_alloc(mm, pud, addr); > > > > > return pte; > > > > > > > > I am blind on hugetlb but not sure it doesn't break eb48c071. > > > > Michal? > > > > > > Well, it is some time since I debugged the race and all the details > > > vanished in the meantime. But this part of the changelog suggests that > > > this indeed breaks the fix: > > > " > > > This patch addresses the issue by moving pmd_alloc into huge_pmd_share > > > which guarantees that the shared pud is populated in the same critical > > > section as pmd. This also means that huge_pte_offset test in > > > huge_pmd_share is serialized correctly now which in turn means that the > > > success of the sharing will be higher as the racing tasks see the pud > > > and pmd populated together. > > > " > > > > > > Besides that I fail to see how moving pmd_alloc down changes anything. > > > Even if pmd_alloc triggered reclaim then we cannot trip over the same > > > i_mmap_mutex as hugetlb pages are not reclaimable because they are not > > > on the LRU. > > > > I thought we could map some part of binary with normal page and other part > > of the one with MAP_HUGETLB so that a address space could have both normal > > page and HugeTLB page. Okay, it's impossible so HugeTLB pages are not on LRU. > > Then, above lockdep warning is totally false positive. > > Best solution is avoiding pmd_alloc with holding i_mmap_mutex but we need it > > to fix eb48c071 so how about this if we couldn't have a better idea? > > Shouldn't we rather use a hugetlb specific lock_class_key. I am not > familiar with lockdep much but something like bellow should do the > trick? Looks good to me. Could you resend it with formal patch with Ccing Peter for Dave to confirm it? Below just a nitpick. > > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c > index a3f868a..40a61f6 100644 > --- a/fs/hugetlbfs/inode.c > +++ b/fs/hugetlbfs/inode.c > @@ -463,6 +463,8 @@ static struct inode *hugetlbfs_get_root(struct super_block *sb, > return inode; > } > > +struct lock_class_key hugetlbfs_i_mmap_mutex_key; Let's add comment why we need this special something. How about this? /* * Now, reclaim path never hold hugetlbfs_inode->i_mmap_mutex while it could * hold normal inode->i_mmap_mutex so this annoation avoids a lockdep splat. */ static struct lock_class_key hugetlbctx_mutex; > + > static struct inode *hugetlbfs_get_inode(struct super_block *sb, > struct inode *dir, > umode_t mode, dev_t dev) > @@ -474,6 +476,7 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb, > struct hugetlbfs_inode_info *info; > inode->i_ino = get_next_ino(); > inode_init_owner(inode, dir, mode); > + lockdep_set_class(&inode->i_mapping->i_mmap_mutex, &hugetlbfs_i_mmap_mutex_key); > inode->i_mapping->a_ops = &hugetlbfs_aops; > inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info; > inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME; > -- > Michal Hocko > SUSE Labs > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: hugepage related lockdep trace. 2013-07-29 8:24 ` Minchan Kim @ 2013-07-29 14:53 ` Michal Hocko -1 siblings, 0 replies; 40+ messages in thread From: Michal Hocko @ 2013-07-29 14:53 UTC (permalink / raw) To: Minchan Kim, Peter Zijlstra Cc: Aneesh Kumar K.V, Dave Jones, Linux Kernel, linux-mm, Rik van Riel, KAMEZAWA Hiroyuki, Hillf Danton, Andrew Morton On Mon 29-07-13 17:24:53, Minchan Kim wrote: > Hi Michal, > > On Thu, Jul 25, 2013 at 03:30:40PM +0200, Michal Hocko wrote: > > On Wed 24-07-13 11:44:28, Minchan Kim wrote: > > > On Tue, Jul 23, 2013 at 04:01:20PM +0200, Michal Hocko wrote: > > > > On Fri 19-07-13 09:13:03, Minchan Kim wrote: > > > > > On Thu, Jul 18, 2013 at 11:12:24PM +0530, Aneesh Kumar K.V wrote: > > > > [...] > > > > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > > > > > index 83aff0a..2cb1be3 100644 > > > > > > --- a/mm/hugetlb.c > > > > > > +++ b/mm/hugetlb.c > > > > > > @@ -3266,8 +3266,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud) > > > > > > put_page(virt_to_page(spte)); > > > > > > spin_unlock(&mm->page_table_lock); > > > > > > out: > > > > > > - pte = (pte_t *)pmd_alloc(mm, pud, addr); > > > > > > mutex_unlock(&mapping->i_mmap_mutex); > > > > > > + pte = (pte_t *)pmd_alloc(mm, pud, addr); > > > > > > return pte; > > > > > > > > > > I am blind on hugetlb but not sure it doesn't break eb48c071. > > > > > Michal? > > > > > > > > Well, it is some time since I debugged the race and all the details > > > > vanished in the meantime. But this part of the changelog suggests that > > > > this indeed breaks the fix: > > > > " > > > > This patch addresses the issue by moving pmd_alloc into huge_pmd_share > > > > which guarantees that the shared pud is populated in the same critical > > > > section as pmd. This also means that huge_pte_offset test in > > > > huge_pmd_share is serialized correctly now which in turn means that the > > > > success of the sharing will be higher as the racing tasks see the pud > > > > and pmd populated together. > > > > " > > > > > > > > Besides that I fail to see how moving pmd_alloc down changes anything. > > > > Even if pmd_alloc triggered reclaim then we cannot trip over the same > > > > i_mmap_mutex as hugetlb pages are not reclaimable because they are not > > > > on the LRU. > > > > > > I thought we could map some part of binary with normal page and other part > > > of the one with MAP_HUGETLB so that a address space could have both normal > > > page and HugeTLB page. Okay, it's impossible so HugeTLB pages are not on LRU. > > > Then, above lockdep warning is totally false positive. > > > Best solution is avoiding pmd_alloc with holding i_mmap_mutex but we need it > > > to fix eb48c071 so how about this if we couldn't have a better idea? > > > > Shouldn't we rather use a hugetlb specific lock_class_key. I am not > > familiar with lockdep much but something like bellow should do the > > trick? > > Looks good to me. > Could you resend it with formal patch with Ccing Peter for Dave to confirm it? > Below just a nitpick. I would have posted it already but I have to confess I am really not familiar with lockdep and what is the good way to fix such a false positive. Peter, for you context the lockdep splat has been reported here: https://lkml.org/lkml/2013/7/17/381 Minchan has proposed to workaround it by using SINGLE_DEPTH_NESTING https://lkml.org/lkml/2013/7/23/812 my idea was to use a separate class key for hugetlb as it is quite special in many ways: https://lkml.org/lkml/2013/7/25/277 What is the preferred way of fixing such an issue? Thanks! -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: hugepage related lockdep trace. @ 2013-07-29 14:53 ` Michal Hocko 0 siblings, 0 replies; 40+ messages in thread From: Michal Hocko @ 2013-07-29 14:53 UTC (permalink / raw) To: Minchan Kim, Peter Zijlstra Cc: Aneesh Kumar K.V, Dave Jones, Linux Kernel, linux-mm, Rik van Riel, KAMEZAWA Hiroyuki, Hillf Danton, Andrew Morton On Mon 29-07-13 17:24:53, Minchan Kim wrote: > Hi Michal, > > On Thu, Jul 25, 2013 at 03:30:40PM +0200, Michal Hocko wrote: > > On Wed 24-07-13 11:44:28, Minchan Kim wrote: > > > On Tue, Jul 23, 2013 at 04:01:20PM +0200, Michal Hocko wrote: > > > > On Fri 19-07-13 09:13:03, Minchan Kim wrote: > > > > > On Thu, Jul 18, 2013 at 11:12:24PM +0530, Aneesh Kumar K.V wrote: > > > > [...] > > > > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > > > > > index 83aff0a..2cb1be3 100644 > > > > > > --- a/mm/hugetlb.c > > > > > > +++ b/mm/hugetlb.c > > > > > > @@ -3266,8 +3266,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud) > > > > > > put_page(virt_to_page(spte)); > > > > > > spin_unlock(&mm->page_table_lock); > > > > > > out: > > > > > > - pte = (pte_t *)pmd_alloc(mm, pud, addr); > > > > > > mutex_unlock(&mapping->i_mmap_mutex); > > > > > > + pte = (pte_t *)pmd_alloc(mm, pud, addr); > > > > > > return pte; > > > > > > > > > > I am blind on hugetlb but not sure it doesn't break eb48c071. > > > > > Michal? > > > > > > > > Well, it is some time since I debugged the race and all the details > > > > vanished in the meantime. But this part of the changelog suggests that > > > > this indeed breaks the fix: > > > > " > > > > This patch addresses the issue by moving pmd_alloc into huge_pmd_share > > > > which guarantees that the shared pud is populated in the same critical > > > > section as pmd. This also means that huge_pte_offset test in > > > > huge_pmd_share is serialized correctly now which in turn means that the > > > > success of the sharing will be higher as the racing tasks see the pud > > > > and pmd populated together. > > > > " > > > > > > > > Besides that I fail to see how moving pmd_alloc down changes anything. > > > > Even if pmd_alloc triggered reclaim then we cannot trip over the same > > > > i_mmap_mutex as hugetlb pages are not reclaimable because they are not > > > > on the LRU. > > > > > > I thought we could map some part of binary with normal page and other part > > > of the one with MAP_HUGETLB so that a address space could have both normal > > > page and HugeTLB page. Okay, it's impossible so HugeTLB pages are not on LRU. > > > Then, above lockdep warning is totally false positive. > > > Best solution is avoiding pmd_alloc with holding i_mmap_mutex but we need it > > > to fix eb48c071 so how about this if we couldn't have a better idea? > > > > Shouldn't we rather use a hugetlb specific lock_class_key. I am not > > familiar with lockdep much but something like bellow should do the > > trick? > > Looks good to me. > Could you resend it with formal patch with Ccing Peter for Dave to confirm it? > Below just a nitpick. I would have posted it already but I have to confess I am really not familiar with lockdep and what is the good way to fix such a false positive. Peter, for you context the lockdep splat has been reported here: https://lkml.org/lkml/2013/7/17/381 Minchan has proposed to workaround it by using SINGLE_DEPTH_NESTING https://lkml.org/lkml/2013/7/23/812 my idea was to use a separate class key for hugetlb as it is quite special in many ways: https://lkml.org/lkml/2013/7/25/277 What is the preferred way of fixing such an issue? Thanks! -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: hugepage related lockdep trace. 2013-07-29 14:53 ` Michal Hocko @ 2013-07-29 15:20 ` Peter Zijlstra -1 siblings, 0 replies; 40+ messages in thread From: Peter Zijlstra @ 2013-07-29 15:20 UTC (permalink / raw) To: Michal Hocko Cc: Minchan Kim, Aneesh Kumar K.V, Dave Jones, Linux Kernel, linux-mm, Rik van Riel, KAMEZAWA Hiroyuki, Hillf Danton, Andrew Morton On Mon, Jul 29, 2013 at 04:53:08PM +0200, Michal Hocko wrote: > Peter, for you context the lockdep splat has been reported > here: https://lkml.org/lkml/2013/7/17/381 > > Minchan has proposed to workaround it by using SINGLE_DEPTH_NESTING > https://lkml.org/lkml/2013/7/23/812 > > my idea was to use a separate class key for hugetlb as it is quite > special in many ways: > https://lkml.org/lkml/2013/7/25/277 > > What is the preferred way of fixing such an issue? The class is the safer annotation. That said; it is a rather horrible issue any which way. This PMD sharing is very unique to hugetlbfs (also is that really worth the effort these days?) and it will make it impossible to make hugetlbfs swappable. The other solution is to make the pmd allocation GFP_NOFS. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: hugepage related lockdep trace. @ 2013-07-29 15:20 ` Peter Zijlstra 0 siblings, 0 replies; 40+ messages in thread From: Peter Zijlstra @ 2013-07-29 15:20 UTC (permalink / raw) To: Michal Hocko Cc: Minchan Kim, Aneesh Kumar K.V, Dave Jones, Linux Kernel, linux-mm, Rik van Riel, KAMEZAWA Hiroyuki, Hillf Danton, Andrew Morton On Mon, Jul 29, 2013 at 04:53:08PM +0200, Michal Hocko wrote: > Peter, for you context the lockdep splat has been reported > here: https://lkml.org/lkml/2013/7/17/381 > > Minchan has proposed to workaround it by using SINGLE_DEPTH_NESTING > https://lkml.org/lkml/2013/7/23/812 > > my idea was to use a separate class key for hugetlb as it is quite > special in many ways: > https://lkml.org/lkml/2013/7/25/277 > > What is the preferred way of fixing such an issue? The class is the safer annotation. That said; it is a rather horrible issue any which way. This PMD sharing is very unique to hugetlbfs (also is that really worth the effort these days?) and it will make it impossible to make hugetlbfs swappable. The other solution is to make the pmd allocation GFP_NOFS. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: hugepage related lockdep trace. 2013-07-29 15:20 ` Peter Zijlstra @ 2013-07-30 14:29 ` Michal Hocko -1 siblings, 0 replies; 40+ messages in thread From: Michal Hocko @ 2013-07-30 14:29 UTC (permalink / raw) To: Peter Zijlstra Cc: Minchan Kim, Aneesh Kumar K.V, Dave Jones, Linux Kernel, linux-mm, Rik van Riel, KAMEZAWA Hiroyuki, Hillf Danton, Andrew Morton On Mon 29-07-13 17:20:01, Peter Zijlstra wrote: > On Mon, Jul 29, 2013 at 04:53:08PM +0200, Michal Hocko wrote: > > Peter, for you context the lockdep splat has been reported > > here: https://lkml.org/lkml/2013/7/17/381 > > > > Minchan has proposed to workaround it by using SINGLE_DEPTH_NESTING > > https://lkml.org/lkml/2013/7/23/812 > > > > my idea was to use a separate class key for hugetlb as it is quite > > special in many ways: > > https://lkml.org/lkml/2013/7/25/277 > > > > What is the preferred way of fixing such an issue? > > The class is the safer annotation. OK, I will use the class then. It should prevent other false positives AFAIU. > That said; it is a rather horrible issue any which way. This PMD sharing > is very unique to hugetlbfs (also is that really worth the effort these > days?) and it will make it impossible to make hugetlbfs swappable. No idea. > The other solution is to make the pmd allocation GFP_NOFS. That would be just papering over the lockdep limitation. So I would rather stick with something lockdep specific. I will cook up a patch. Thanks! -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: hugepage related lockdep trace. @ 2013-07-30 14:29 ` Michal Hocko 0 siblings, 0 replies; 40+ messages in thread From: Michal Hocko @ 2013-07-30 14:29 UTC (permalink / raw) To: Peter Zijlstra Cc: Minchan Kim, Aneesh Kumar K.V, Dave Jones, Linux Kernel, linux-mm, Rik van Riel, KAMEZAWA Hiroyuki, Hillf Danton, Andrew Morton On Mon 29-07-13 17:20:01, Peter Zijlstra wrote: > On Mon, Jul 29, 2013 at 04:53:08PM +0200, Michal Hocko wrote: > > Peter, for you context the lockdep splat has been reported > > here: https://lkml.org/lkml/2013/7/17/381 > > > > Minchan has proposed to workaround it by using SINGLE_DEPTH_NESTING > > https://lkml.org/lkml/2013/7/23/812 > > > > my idea was to use a separate class key for hugetlb as it is quite > > special in many ways: > > https://lkml.org/lkml/2013/7/25/277 > > > > What is the preferred way of fixing such an issue? > > The class is the safer annotation. OK, I will use the class then. It should prevent other false positives AFAIU. > That said; it is a rather horrible issue any which way. This PMD sharing > is very unique to hugetlbfs (also is that really worth the effort these > days?) and it will make it impossible to make hugetlbfs swappable. No idea. > The other solution is to make the pmd allocation GFP_NOFS. That would be just papering over the lockdep limitation. So I would rather stick with something lockdep specific. I will cook up a patch. Thanks! -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH] hugetlb: fix lockdep splat caused by pmd sharing 2013-07-30 14:29 ` Michal Hocko @ 2013-07-30 14:46 ` Michal Hocko -1 siblings, 0 replies; 40+ messages in thread From: Michal Hocko @ 2013-07-30 14:46 UTC (permalink / raw) To: Andrew Morton Cc: Dave Jones, Aneesh Kumar K.V, Minchan Kim, Rik van Riel, KAMEZAWA Hiroyuki, Hillf Danton, Peter Zijlstra, linux-kernel, linux-mm Dave has reported the following lockdep splat: [128095.470960] ================================= [128095.471315] [ INFO: inconsistent lock state ] [128095.471660] 3.11.0-rc1+ #9 Not tainted [128095.472156] --------------------------------- [128095.472905] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage. [128095.473650] kswapd0/49 [HC0[0]:SC0[0]:HE1:SE1] takes: [128095.474373] (&mapping->i_mmap_mutex){+.+.?.}, at: [<c114971b>] page_referenced+0x87/0x5e3 [128095.475128] {RECLAIM_FS-ON-W} state was registered at: [128095.475866] [<c10a6232>] mark_held_locks+0x81/0xe7 [128095.476597] [<c10a8db3>] lockdep_trace_alloc+0x5e/0xbc [128095.477322] [<c112316b>] __alloc_pages_nodemask+0x8b/0x9b6 [128095.478049] [<c1123ab6>] __get_free_pages+0x20/0x31 [128095.478769] [<c1123ad9>] get_zeroed_page+0x12/0x14 [128095.479477] [<c113fe1e>] __pmd_alloc+0x1c/0x6b [128095.480138] [<c1155ea7>] huge_pmd_share+0x265/0x283 [128095.480138] [<c1155f22>] huge_pte_alloc+0x5d/0x71 [128095.480138] [<c115612e>] hugetlb_fault+0x7c/0x64a [128095.480138] [<c114087c>] handle_mm_fault+0x255/0x299 [128095.480138] [<c15bbab0>] __do_page_fault+0x142/0x55c [128095.480138] [<c15bbed7>] do_page_fault+0xd/0x16 [128095.480138] [<c15b927c>] error_code+0x6c/0x74 [128095.480138] irq event stamp: 3136917 [128095.480138] hardirqs last enabled at (3136917): [<c15b8139>] _raw_spin_unlock_irq+0x27/0x50 [128095.480138] hardirqs last disabled at (3136916): [<c15b7f4e>] _raw_spin_lock_irq+0x15/0x78 [128095.480138] softirqs last enabled at (3136180): [<c1048e4a>] __do_softirq+0x137/0x30f [128095.480138] softirqs last disabled at (3136175): [<c1049195>] irq_exit+0xa8/0xaa [128095.480138] other info that might help us debug this: [128095.480138] Possible unsafe locking scenario: [128095.480138] CPU0 [128095.480138] ---- [128095.480138] lock(&mapping->i_mmap_mutex); [128095.480138] <Interrupt> [128095.480138] lock(&mapping->i_mmap_mutex); [128095.480138] *** DEADLOCK *** [128095.480138] no locks held by kswapd0/49. [128095.480138] stack backtrace: [128095.480138] CPU: 1 PID: 49 Comm: kswapd0 Not tainted 3.11.0-rc1+ #9 [128095.480138] Hardware name: Dell Inc. Precision WorkStation 490 /0DT031, BIOS A08 04/25/2008 [128095.480138] c1d32630 00000000 ee39fb18 c15b001e ee395780 ee39fb54 c15acdcb c1751845 [128095.480138] c1751bbf 00000031 00000000 00000000 00000000 00000000 00000001 00000001 [128095.480138] c1751bbf 00000008 ee395c44 00000100 ee39fb88 c10a6130 00000008 0000d8fb [128095.480138] Call Trace: [128095.480138] [<c15b001e>] dump_stack+0x4b/0x79 [128095.480138] [<c15acdcb>] print_usage_bug+0x1d9/0x1e3 [128095.480138] [<c10a6130>] mark_lock+0x1e0/0x261 [128095.480138] [<c10a5878>] ? check_usage_backwards+0x109/0x109 [128095.480138] [<c10a6cde>] __lock_acquire+0x623/0x17f2 [128095.480138] [<c107aa43>] ? sched_clock_cpu+0xcd/0x130 [128095.480138] [<c107a7e8>] ? sched_clock_local+0x42/0x12e [128095.480138] [<c10a84cf>] lock_acquire+0x7d/0x195 [128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3 [128095.480138] [<c15b3671>] mutex_lock_nested+0x6c/0x3a7 [128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3 [128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3 [128095.480138] [<c11661d5>] ? mem_cgroup_charge_statistics.isra.24+0x61/0x9e [128095.480138] [<c114971b>] page_referenced+0x87/0x5e3 [128095.480138] [<f8433030>] ? raid0_congested+0x26/0x8a [raid0] [128095.480138] [<c112b9c7>] shrink_page_list+0x3d9/0x947 [128095.480138] [<c10a6457>] ? trace_hardirqs_on+0xb/0xd [128095.480138] [<c112c3cf>] shrink_inactive_list+0x155/0x4cb [128095.480138] [<c112cd07>] shrink_lruvec+0x300/0x5ce [128095.480138] [<c112d028>] shrink_zone+0x53/0x14e [128095.480138] [<c112e531>] kswapd+0x517/0xa75 [128095.480138] [<c112e01a>] ? mem_cgroup_shrink_node_zone+0x280/0x280 [128095.480138] [<c10661ff>] kthread+0xa8/0xaa [128095.480138] [<c10a6457>] ? trace_hardirqs_on+0xb/0xd [128095.480138] [<c15bf737>] ret_from_kernel_thread+0x1b/0x28 [128095.480138] [<c1066157>] ? insert_kthread_work+0x63/0x63 which is a false positive caused by hugetlb pmd sharing code which allocates a new pmd from withing mappint->i_mmap_mutex. If this allocation causes reclaim then the lockdep detector complains that we might self-deadlock. This is not correct though, because hugetlb pages are not reclaimable so their mapping will be never touched from the reclaim path. The patch tells lockup detector that hugetlb i_mmap_mutex is special by assigning it a separate lockdep class so it won't report possible deadlocks on unrelated mappings. Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: Michal Hocko <mhocko@suse.cz> --- fs/hugetlbfs/inode.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index a3f868a..230533d 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -463,6 +463,12 @@ static struct inode *hugetlbfs_get_root(struct super_block *sb, return inode; } +/* + * Now, reclaim path never holds hugetlbfs_inode->i_mmap_mutex while it could + * hold normal inode->i_mmap_mutex so this annotation avoids a lockdep splat. + */ +struct lock_class_key hugetlbfs_i_mmap_mutex_key; + static struct inode *hugetlbfs_get_inode(struct super_block *sb, struct inode *dir, umode_t mode, dev_t dev) @@ -474,6 +480,8 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb, struct hugetlbfs_inode_info *info; inode->i_ino = get_next_ino(); inode_init_owner(inode, dir, mode); + lockdep_set_class(&inode->i_mapping->i_mmap_mutex, + &hugetlbfs_i_mmap_mutex_key); inode->i_mapping->a_ops = &hugetlbfs_aops; inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info; inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME; -- 1.8.3.2 ^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH] hugetlb: fix lockdep splat caused by pmd sharing @ 2013-07-30 14:46 ` Michal Hocko 0 siblings, 0 replies; 40+ messages in thread From: Michal Hocko @ 2013-07-30 14:46 UTC (permalink / raw) To: Andrew Morton Cc: Dave Jones, Aneesh Kumar K.V, Minchan Kim, Rik van Riel, KAMEZAWA Hiroyuki, Hillf Danton, Peter Zijlstra, linux-kernel, linux-mm Dave has reported the following lockdep splat: [128095.470960] ================================= [128095.471315] [ INFO: inconsistent lock state ] [128095.471660] 3.11.0-rc1+ #9 Not tainted [128095.472156] --------------------------------- [128095.472905] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage. [128095.473650] kswapd0/49 [HC0[0]:SC0[0]:HE1:SE1] takes: [128095.474373] (&mapping->i_mmap_mutex){+.+.?.}, at: [<c114971b>] page_referenced+0x87/0x5e3 [128095.475128] {RECLAIM_FS-ON-W} state was registered at: [128095.475866] [<c10a6232>] mark_held_locks+0x81/0xe7 [128095.476597] [<c10a8db3>] lockdep_trace_alloc+0x5e/0xbc [128095.477322] [<c112316b>] __alloc_pages_nodemask+0x8b/0x9b6 [128095.478049] [<c1123ab6>] __get_free_pages+0x20/0x31 [128095.478769] [<c1123ad9>] get_zeroed_page+0x12/0x14 [128095.479477] [<c113fe1e>] __pmd_alloc+0x1c/0x6b [128095.480138] [<c1155ea7>] huge_pmd_share+0x265/0x283 [128095.480138] [<c1155f22>] huge_pte_alloc+0x5d/0x71 [128095.480138] [<c115612e>] hugetlb_fault+0x7c/0x64a [128095.480138] [<c114087c>] handle_mm_fault+0x255/0x299 [128095.480138] [<c15bbab0>] __do_page_fault+0x142/0x55c [128095.480138] [<c15bbed7>] do_page_fault+0xd/0x16 [128095.480138] [<c15b927c>] error_code+0x6c/0x74 [128095.480138] irq event stamp: 3136917 [128095.480138] hardirqs last enabled at (3136917): [<c15b8139>] _raw_spin_unlock_irq+0x27/0x50 [128095.480138] hardirqs last disabled at (3136916): [<c15b7f4e>] _raw_spin_lock_irq+0x15/0x78 [128095.480138] softirqs last enabled at (3136180): [<c1048e4a>] __do_softirq+0x137/0x30f [128095.480138] softirqs last disabled at (3136175): [<c1049195>] irq_exit+0xa8/0xaa [128095.480138] other info that might help us debug this: [128095.480138] Possible unsafe locking scenario: [128095.480138] CPU0 [128095.480138] ---- [128095.480138] lock(&mapping->i_mmap_mutex); [128095.480138] <Interrupt> [128095.480138] lock(&mapping->i_mmap_mutex); [128095.480138] *** DEADLOCK *** [128095.480138] no locks held by kswapd0/49. [128095.480138] stack backtrace: [128095.480138] CPU: 1 PID: 49 Comm: kswapd0 Not tainted 3.11.0-rc1+ #9 [128095.480138] Hardware name: Dell Inc. Precision WorkStation 490 /0DT031, BIOS A08 04/25/2008 [128095.480138] c1d32630 00000000 ee39fb18 c15b001e ee395780 ee39fb54 c15acdcb c1751845 [128095.480138] c1751bbf 00000031 00000000 00000000 00000000 00000000 00000001 00000001 [128095.480138] c1751bbf 00000008 ee395c44 00000100 ee39fb88 c10a6130 00000008 0000d8fb [128095.480138] Call Trace: [128095.480138] [<c15b001e>] dump_stack+0x4b/0x79 [128095.480138] [<c15acdcb>] print_usage_bug+0x1d9/0x1e3 [128095.480138] [<c10a6130>] mark_lock+0x1e0/0x261 [128095.480138] [<c10a5878>] ? check_usage_backwards+0x109/0x109 [128095.480138] [<c10a6cde>] __lock_acquire+0x623/0x17f2 [128095.480138] [<c107aa43>] ? sched_clock_cpu+0xcd/0x130 [128095.480138] [<c107a7e8>] ? sched_clock_local+0x42/0x12e [128095.480138] [<c10a84cf>] lock_acquire+0x7d/0x195 [128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3 [128095.480138] [<c15b3671>] mutex_lock_nested+0x6c/0x3a7 [128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3 [128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3 [128095.480138] [<c11661d5>] ? mem_cgroup_charge_statistics.isra.24+0x61/0x9e [128095.480138] [<c114971b>] page_referenced+0x87/0x5e3 [128095.480138] [<f8433030>] ? raid0_congested+0x26/0x8a [raid0] [128095.480138] [<c112b9c7>] shrink_page_list+0x3d9/0x947 [128095.480138] [<c10a6457>] ? trace_hardirqs_on+0xb/0xd [128095.480138] [<c112c3cf>] shrink_inactive_list+0x155/0x4cb [128095.480138] [<c112cd07>] shrink_lruvec+0x300/0x5ce [128095.480138] [<c112d028>] shrink_zone+0x53/0x14e [128095.480138] [<c112e531>] kswapd+0x517/0xa75 [128095.480138] [<c112e01a>] ? mem_cgroup_shrink_node_zone+0x280/0x280 [128095.480138] [<c10661ff>] kthread+0xa8/0xaa [128095.480138] [<c10a6457>] ? trace_hardirqs_on+0xb/0xd [128095.480138] [<c15bf737>] ret_from_kernel_thread+0x1b/0x28 [128095.480138] [<c1066157>] ? insert_kthread_work+0x63/0x63 which is a false positive caused by hugetlb pmd sharing code which allocates a new pmd from withing mappint->i_mmap_mutex. If this allocation causes reclaim then the lockdep detector complains that we might self-deadlock. This is not correct though, because hugetlb pages are not reclaimable so their mapping will be never touched from the reclaim path. The patch tells lockup detector that hugetlb i_mmap_mutex is special by assigning it a separate lockdep class so it won't report possible deadlocks on unrelated mappings. Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: Michal Hocko <mhocko@suse.cz> --- fs/hugetlbfs/inode.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index a3f868a..230533d 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -463,6 +463,12 @@ static struct inode *hugetlbfs_get_root(struct super_block *sb, return inode; } +/* + * Now, reclaim path never holds hugetlbfs_inode->i_mmap_mutex while it could + * hold normal inode->i_mmap_mutex so this annotation avoids a lockdep splat. + */ +struct lock_class_key hugetlbfs_i_mmap_mutex_key; + static struct inode *hugetlbfs_get_inode(struct super_block *sb, struct inode *dir, umode_t mode, dev_t dev) @@ -474,6 +480,8 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb, struct hugetlbfs_inode_info *info; inode->i_ino = get_next_ino(); inode_init_owner(inode, dir, mode); + lockdep_set_class(&inode->i_mapping->i_mmap_mutex, + &hugetlbfs_i_mmap_mutex_key); inode->i_mapping->a_ops = &hugetlbfs_aops; inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info; inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME; -- 1.8.3.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [PATCH] hugetlb: fix lockdep splat caused by pmd sharing 2013-07-30 14:46 ` Michal Hocko @ 2013-07-30 14:58 ` Peter Zijlstra -1 siblings, 0 replies; 40+ messages in thread From: Peter Zijlstra @ 2013-07-30 14:58 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, Dave Jones, Aneesh Kumar K.V, Minchan Kim, Rik van Riel, KAMEZAWA Hiroyuki, Hillf Danton, linux-kernel, linux-mm On Tue, Jul 30, 2013 at 04:46:00PM +0200, Michal Hocko wrote: > which is a false positive caused by hugetlb pmd sharing code which > allocates a new pmd from withing mappint->i_mmap_mutex. If this > allocation causes reclaim then the lockdep detector complains that we > might self-deadlock. > > This is not correct though, because hugetlb pages are not reclaimable so > their mapping will be never touched from the reclaim path. > > The patch tells lockup detector that hugetlb i_mmap_mutex is special > by assigning it a separate lockdep class so it won't report possible > deadlocks on unrelated mappings. > > Reported-by: Dave Jones <davej@redhat.com> > Signed-off-by: Michal Hocko <mhocko@suse.cz> > --- > fs/hugetlbfs/inode.c | 8 ++++++++ > 1 file changed, 8 insertions(+) > > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c > index a3f868a..230533d 100644 > --- a/fs/hugetlbfs/inode.c > +++ b/fs/hugetlbfs/inode.c > @@ -463,6 +463,12 @@ static struct inode *hugetlbfs_get_root(struct super_block *sb, > return inode; > } > > +/* > + * Now, reclaim path never holds hugetlbfs_inode->i_mmap_mutex while it could > + * hold normal inode->i_mmap_mutex so this annotation avoids a lockdep splat. How about something like: /* * Hugetlbfs is not reclaimable; therefore its i_mmap_mutex will never * be taken from reclaim -- unlike regular filesystems. This needs an * annotation because huge_pmd_share() does an allocation under * i_mmap_mutex. */ It clarifies the exact conditions and makes easier to verify the validity of the annotation. > + */ > +struct lock_class_key hugetlbfs_i_mmap_mutex_key; > + > static struct inode *hugetlbfs_get_inode(struct super_block *sb, > struct inode *dir, > umode_t mode, dev_t dev) > @@ -474,6 +480,8 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb, > struct hugetlbfs_inode_info *info; > inode->i_ino = get_next_ino(); > inode_init_owner(inode, dir, mode); > + lockdep_set_class(&inode->i_mapping->i_mmap_mutex, > + &hugetlbfs_i_mmap_mutex_key); > inode->i_mapping->a_ops = &hugetlbfs_aops; > inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info; > inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME; > -- > 1.8.3.2 > ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH] hugetlb: fix lockdep splat caused by pmd sharing @ 2013-07-30 14:58 ` Peter Zijlstra 0 siblings, 0 replies; 40+ messages in thread From: Peter Zijlstra @ 2013-07-30 14:58 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, Dave Jones, Aneesh Kumar K.V, Minchan Kim, Rik van Riel, KAMEZAWA Hiroyuki, Hillf Danton, linux-kernel, linux-mm On Tue, Jul 30, 2013 at 04:46:00PM +0200, Michal Hocko wrote: > which is a false positive caused by hugetlb pmd sharing code which > allocates a new pmd from withing mappint->i_mmap_mutex. If this > allocation causes reclaim then the lockdep detector complains that we > might self-deadlock. > > This is not correct though, because hugetlb pages are not reclaimable so > their mapping will be never touched from the reclaim path. > > The patch tells lockup detector that hugetlb i_mmap_mutex is special > by assigning it a separate lockdep class so it won't report possible > deadlocks on unrelated mappings. > > Reported-by: Dave Jones <davej@redhat.com> > Signed-off-by: Michal Hocko <mhocko@suse.cz> > --- > fs/hugetlbfs/inode.c | 8 ++++++++ > 1 file changed, 8 insertions(+) > > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c > index a3f868a..230533d 100644 > --- a/fs/hugetlbfs/inode.c > +++ b/fs/hugetlbfs/inode.c > @@ -463,6 +463,12 @@ static struct inode *hugetlbfs_get_root(struct super_block *sb, > return inode; > } > > +/* > + * Now, reclaim path never holds hugetlbfs_inode->i_mmap_mutex while it could > + * hold normal inode->i_mmap_mutex so this annotation avoids a lockdep splat. How about something like: /* * Hugetlbfs is not reclaimable; therefore its i_mmap_mutex will never * be taken from reclaim -- unlike regular filesystems. This needs an * annotation because huge_pmd_share() does an allocation under * i_mmap_mutex. */ It clarifies the exact conditions and makes easier to verify the validity of the annotation. > + */ > +struct lock_class_key hugetlbfs_i_mmap_mutex_key; > + > static struct inode *hugetlbfs_get_inode(struct super_block *sb, > struct inode *dir, > umode_t mode, dev_t dev) > @@ -474,6 +480,8 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb, > struct hugetlbfs_inode_info *info; > inode->i_ino = get_next_ino(); > inode_init_owner(inode, dir, mode); > + lockdep_set_class(&inode->i_mapping->i_mmap_mutex, > + &hugetlbfs_i_mmap_mutex_key); > inode->i_mapping->a_ops = &hugetlbfs_aops; > inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info; > inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME; > -- > 1.8.3.2 > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH] hugetlb: fix lockdep splat caused by pmd sharing 2013-07-30 14:58 ` Peter Zijlstra @ 2013-07-30 15:23 ` Michal Hocko -1 siblings, 0 replies; 40+ messages in thread From: Michal Hocko @ 2013-07-30 15:23 UTC (permalink / raw) To: Peter Zijlstra Cc: Andrew Morton, Dave Jones, Aneesh Kumar K.V, Minchan Kim, Rik van Riel, KAMEZAWA Hiroyuki, Hillf Danton, linux-kernel, linux-mm On Tue 30-07-13 16:58:34, Peter Zijlstra wrote: > On Tue, Jul 30, 2013 at 04:46:00PM +0200, Michal Hocko wrote: [...] > > +/* > > + * Now, reclaim path never holds hugetlbfs_inode->i_mmap_mutex while it could > > + * hold normal inode->i_mmap_mutex so this annotation avoids a lockdep splat. > > How about something like: > > /* > * Hugetlbfs is not reclaimable; therefore its i_mmap_mutex will never > * be taken from reclaim -- unlike regular filesystems. This needs an > * annotation because huge_pmd_share() does an allocation under > * i_mmap_mutex. > */ > > It clarifies the exact conditions and makes easier to verify the > validity of the annotation. Yes, looks much better. Thanks! --- >From 673cbe2ca7df0decd7320987d97585660542e468 Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@suse.cz> Date: Tue, 30 Jul 2013 17:22:14 +0200 Subject: [PATCH] hugetlb: fix lockdep splat caused by pmd sharing Dave has reported the following lockdep splat: [128095.470960] ================================= [128095.471315] [ INFO: inconsistent lock state ] [128095.471660] 3.11.0-rc1+ #9 Not tainted [128095.472156] --------------------------------- [128095.472905] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage. [128095.473650] kswapd0/49 [HC0[0]:SC0[0]:HE1:SE1] takes: [128095.474373] (&mapping->i_mmap_mutex){+.+.?.}, at: [<c114971b>] page_referenced+0x87/0x5e3 [128095.475128] {RECLAIM_FS-ON-W} state was registered at: [128095.475866] [<c10a6232>] mark_held_locks+0x81/0xe7 [128095.476597] [<c10a8db3>] lockdep_trace_alloc+0x5e/0xbc [128095.477322] [<c112316b>] __alloc_pages_nodemask+0x8b/0x9b6 [128095.478049] [<c1123ab6>] __get_free_pages+0x20/0x31 [128095.478769] [<c1123ad9>] get_zeroed_page+0x12/0x14 [128095.479477] [<c113fe1e>] __pmd_alloc+0x1c/0x6b [128095.480138] [<c1155ea7>] huge_pmd_share+0x265/0x283 [128095.480138] [<c1155f22>] huge_pte_alloc+0x5d/0x71 [128095.480138] [<c115612e>] hugetlb_fault+0x7c/0x64a [128095.480138] [<c114087c>] handle_mm_fault+0x255/0x299 [128095.480138] [<c15bbab0>] __do_page_fault+0x142/0x55c [128095.480138] [<c15bbed7>] do_page_fault+0xd/0x16 [128095.480138] [<c15b927c>] error_code+0x6c/0x74 [128095.480138] irq event stamp: 3136917 [128095.480138] hardirqs last enabled at (3136917): [<c15b8139>] _raw_spin_unlock_irq+0x27/0x50 [128095.480138] hardirqs last disabled at (3136916): [<c15b7f4e>] _raw_spin_lock_irq+0x15/0x78 [128095.480138] softirqs last enabled at (3136180): [<c1048e4a>] __do_softirq+0x137/0x30f [128095.480138] softirqs last disabled at (3136175): [<c1049195>] irq_exit+0xa8/0xaa [128095.480138] other info that might help us debug this: [128095.480138] Possible unsafe locking scenario: [128095.480138] CPU0 [128095.480138] ---- [128095.480138] lock(&mapping->i_mmap_mutex); [128095.480138] <Interrupt> [128095.480138] lock(&mapping->i_mmap_mutex); [128095.480138] *** DEADLOCK *** [128095.480138] no locks held by kswapd0/49. [128095.480138] stack backtrace: [128095.480138] CPU: 1 PID: 49 Comm: kswapd0 Not tainted 3.11.0-rc1+ #9 [128095.480138] Hardware name: Dell Inc. Precision WorkStation 490 /0DT031, BIOS A08 04/25/2008 [128095.480138] c1d32630 00000000 ee39fb18 c15b001e ee395780 ee39fb54 c15acdcb c1751845 [128095.480138] c1751bbf 00000031 00000000 00000000 00000000 00000000 00000001 00000001 [128095.480138] c1751bbf 00000008 ee395c44 00000100 ee39fb88 c10a6130 00000008 0000d8fb [128095.480138] Call Trace: [128095.480138] [<c15b001e>] dump_stack+0x4b/0x79 [128095.480138] [<c15acdcb>] print_usage_bug+0x1d9/0x1e3 [128095.480138] [<c10a6130>] mark_lock+0x1e0/0x261 [128095.480138] [<c10a5878>] ? check_usage_backwards+0x109/0x109 [128095.480138] [<c10a6cde>] __lock_acquire+0x623/0x17f2 [128095.480138] [<c107aa43>] ? sched_clock_cpu+0xcd/0x130 [128095.480138] [<c107a7e8>] ? sched_clock_local+0x42/0x12e [128095.480138] [<c10a84cf>] lock_acquire+0x7d/0x195 [128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3 [128095.480138] [<c15b3671>] mutex_lock_nested+0x6c/0x3a7 [128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3 [128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3 [128095.480138] [<c11661d5>] ? mem_cgroup_charge_statistics.isra.24+0x61/0x9e [128095.480138] [<c114971b>] page_referenced+0x87/0x5e3 [128095.480138] [<f8433030>] ? raid0_congested+0x26/0x8a [raid0] [128095.480138] [<c112b9c7>] shrink_page_list+0x3d9/0x947 [128095.480138] [<c10a6457>] ? trace_hardirqs_on+0xb/0xd [128095.480138] [<c112c3cf>] shrink_inactive_list+0x155/0x4cb [128095.480138] [<c112cd07>] shrink_lruvec+0x300/0x5ce [128095.480138] [<c112d028>] shrink_zone+0x53/0x14e [128095.480138] [<c112e531>] kswapd+0x517/0xa75 [128095.480138] [<c112e01a>] ? mem_cgroup_shrink_node_zone+0x280/0x280 [128095.480138] [<c10661ff>] kthread+0xa8/0xaa [128095.480138] [<c10a6457>] ? trace_hardirqs_on+0xb/0xd [128095.480138] [<c15bf737>] ret_from_kernel_thread+0x1b/0x28 [128095.480138] [<c1066157>] ? insert_kthread_work+0x63/0x63 which is a false positive caused by hugetlb pmd sharing code which allocates a new pmd from withing mappint->i_mmap_mutex. If this allocation causes reclaim then the lockdep detector complains that we might self-deadlock. This is not correct though, because hugetlb pages are not reclaimable so their mapping will be never touched from the reclaim path. The patch tells lockup detector that hugetlb i_mmap_mutex is special by assigning it a separate lockdep class so it won't report possible deadlocks on unrelated mappings. [peterz@infradead.org: comment for annotation] Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: Michal Hocko <mhocko@suse.cz> --- fs/hugetlbfs/inode.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index a3f868a..3442397 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -463,6 +463,14 @@ static struct inode *hugetlbfs_get_root(struct super_block *sb, return inode; } +/* + * Hugetlbfs is not reclaimable; therefore its i_mmap_mutex will never + * be taken from reclaim -- unlike regular filesystems. This needs an + * annotation because huge_pmd_share() does an allocation under + * i_mmap_mutex. + */ +struct lock_class_key hugetlbfs_i_mmap_mutex_key; + static struct inode *hugetlbfs_get_inode(struct super_block *sb, struct inode *dir, umode_t mode, dev_t dev) @@ -474,6 +482,8 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb, struct hugetlbfs_inode_info *info; inode->i_ino = get_next_ino(); inode_init_owner(inode, dir, mode); + lockdep_set_class(&inode->i_mapping->i_mmap_mutex, + &hugetlbfs_i_mmap_mutex_key); inode->i_mapping->a_ops = &hugetlbfs_aops; inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info; inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME; -- 1.8.3.2 -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [PATCH] hugetlb: fix lockdep splat caused by pmd sharing @ 2013-07-30 15:23 ` Michal Hocko 0 siblings, 0 replies; 40+ messages in thread From: Michal Hocko @ 2013-07-30 15:23 UTC (permalink / raw) To: Peter Zijlstra Cc: Andrew Morton, Dave Jones, Aneesh Kumar K.V, Minchan Kim, Rik van Riel, KAMEZAWA Hiroyuki, Hillf Danton, linux-kernel, linux-mm On Tue 30-07-13 16:58:34, Peter Zijlstra wrote: > On Tue, Jul 30, 2013 at 04:46:00PM +0200, Michal Hocko wrote: [...] > > +/* > > + * Now, reclaim path never holds hugetlbfs_inode->i_mmap_mutex while it could > > + * hold normal inode->i_mmap_mutex so this annotation avoids a lockdep splat. > > How about something like: > > /* > * Hugetlbfs is not reclaimable; therefore its i_mmap_mutex will never > * be taken from reclaim -- unlike regular filesystems. This needs an > * annotation because huge_pmd_share() does an allocation under > * i_mmap_mutex. > */ > > It clarifies the exact conditions and makes easier to verify the > validity of the annotation. Yes, looks much better. Thanks! --- ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH] hugetlb: fix lockdep splat caused by pmd sharing 2013-07-30 15:23 ` Michal Hocko @ 2013-07-30 23:35 ` Minchan Kim -1 siblings, 0 replies; 40+ messages in thread From: Minchan Kim @ 2013-07-30 23:35 UTC (permalink / raw) To: Michal Hocko Cc: Peter Zijlstra, Andrew Morton, Dave Jones, Aneesh Kumar K.V, Rik van Riel, KAMEZAWA Hiroyuki, Hillf Danton, linux-kernel, linux-mm On Tue, Jul 30, 2013 at 05:23:33PM +0200, Michal Hocko wrote: > On Tue 30-07-13 16:58:34, Peter Zijlstra wrote: > > On Tue, Jul 30, 2013 at 04:46:00PM +0200, Michal Hocko wrote: > [...] > > > +/* > > > + * Now, reclaim path never holds hugetlbfs_inode->i_mmap_mutex while it could > > > + * hold normal inode->i_mmap_mutex so this annotation avoids a lockdep splat. > > > > How about something like: > > > > /* > > * Hugetlbfs is not reclaimable; therefore its i_mmap_mutex will never > > * be taken from reclaim -- unlike regular filesystems. This needs an > > * annotation because huge_pmd_share() does an allocation under > > * i_mmap_mutex. > > */ > > > > It clarifies the exact conditions and makes easier to verify the > > validity of the annotation. > > Yes, looks much better. Thanks! > --- > >From 673cbe2ca7df0decd7320987d97585660542e468 Mon Sep 17 00:00:00 2001 > From: Michal Hocko <mhocko@suse.cz> > Date: Tue, 30 Jul 2013 17:22:14 +0200 > Subject: [PATCH] hugetlb: fix lockdep splat caused by pmd sharing > > Dave has reported the following lockdep splat: > [128095.470960] ================================= > [128095.471315] [ INFO: inconsistent lock state ] > [128095.471660] 3.11.0-rc1+ #9 Not tainted > [128095.472156] --------------------------------- > [128095.472905] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage. > [128095.473650] kswapd0/49 [HC0[0]:SC0[0]:HE1:SE1] takes: > [128095.474373] (&mapping->i_mmap_mutex){+.+.?.}, at: [<c114971b>] page_referenced+0x87/0x5e3 > [128095.475128] {RECLAIM_FS-ON-W} state was registered at: > [128095.475866] [<c10a6232>] mark_held_locks+0x81/0xe7 > [128095.476597] [<c10a8db3>] lockdep_trace_alloc+0x5e/0xbc > [128095.477322] [<c112316b>] __alloc_pages_nodemask+0x8b/0x9b6 > [128095.478049] [<c1123ab6>] __get_free_pages+0x20/0x31 > [128095.478769] [<c1123ad9>] get_zeroed_page+0x12/0x14 > [128095.479477] [<c113fe1e>] __pmd_alloc+0x1c/0x6b > [128095.480138] [<c1155ea7>] huge_pmd_share+0x265/0x283 > [128095.480138] [<c1155f22>] huge_pte_alloc+0x5d/0x71 > [128095.480138] [<c115612e>] hugetlb_fault+0x7c/0x64a > [128095.480138] [<c114087c>] handle_mm_fault+0x255/0x299 > [128095.480138] [<c15bbab0>] __do_page_fault+0x142/0x55c > [128095.480138] [<c15bbed7>] do_page_fault+0xd/0x16 > [128095.480138] [<c15b927c>] error_code+0x6c/0x74 > [128095.480138] irq event stamp: 3136917 > [128095.480138] hardirqs last enabled at (3136917): [<c15b8139>] _raw_spin_unlock_irq+0x27/0x50 > [128095.480138] hardirqs last disabled at (3136916): [<c15b7f4e>] _raw_spin_lock_irq+0x15/0x78 > [128095.480138] softirqs last enabled at (3136180): [<c1048e4a>] __do_softirq+0x137/0x30f > [128095.480138] softirqs last disabled at (3136175): [<c1049195>] irq_exit+0xa8/0xaa > [128095.480138] > other info that might help us debug this: > [128095.480138] Possible unsafe locking scenario: > [128095.480138] CPU0 > [128095.480138] ---- > [128095.480138] lock(&mapping->i_mmap_mutex); > [128095.480138] <Interrupt> > [128095.480138] lock(&mapping->i_mmap_mutex); > [128095.480138] > *** DEADLOCK *** > [128095.480138] no locks held by kswapd0/49. > [128095.480138] > stack backtrace: > [128095.480138] CPU: 1 PID: 49 Comm: kswapd0 Not tainted 3.11.0-rc1+ #9 > [128095.480138] Hardware name: Dell Inc. Precision WorkStation 490 /0DT031, BIOS A08 04/25/2008 > [128095.480138] c1d32630 00000000 ee39fb18 c15b001e ee395780 ee39fb54 c15acdcb c1751845 > [128095.480138] c1751bbf 00000031 00000000 00000000 00000000 00000000 00000001 00000001 > [128095.480138] c1751bbf 00000008 ee395c44 00000100 ee39fb88 c10a6130 00000008 0000d8fb > [128095.480138] Call Trace: > [128095.480138] [<c15b001e>] dump_stack+0x4b/0x79 > [128095.480138] [<c15acdcb>] print_usage_bug+0x1d9/0x1e3 > [128095.480138] [<c10a6130>] mark_lock+0x1e0/0x261 > [128095.480138] [<c10a5878>] ? check_usage_backwards+0x109/0x109 > [128095.480138] [<c10a6cde>] __lock_acquire+0x623/0x17f2 > [128095.480138] [<c107aa43>] ? sched_clock_cpu+0xcd/0x130 > [128095.480138] [<c107a7e8>] ? sched_clock_local+0x42/0x12e > [128095.480138] [<c10a84cf>] lock_acquire+0x7d/0x195 > [128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3 > [128095.480138] [<c15b3671>] mutex_lock_nested+0x6c/0x3a7 > [128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3 > [128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3 > [128095.480138] [<c11661d5>] ? mem_cgroup_charge_statistics.isra.24+0x61/0x9e > [128095.480138] [<c114971b>] page_referenced+0x87/0x5e3 > [128095.480138] [<f8433030>] ? raid0_congested+0x26/0x8a [raid0] > [128095.480138] [<c112b9c7>] shrink_page_list+0x3d9/0x947 > [128095.480138] [<c10a6457>] ? trace_hardirqs_on+0xb/0xd > [128095.480138] [<c112c3cf>] shrink_inactive_list+0x155/0x4cb > [128095.480138] [<c112cd07>] shrink_lruvec+0x300/0x5ce > [128095.480138] [<c112d028>] shrink_zone+0x53/0x14e > [128095.480138] [<c112e531>] kswapd+0x517/0xa75 > [128095.480138] [<c112e01a>] ? mem_cgroup_shrink_node_zone+0x280/0x280 > [128095.480138] [<c10661ff>] kthread+0xa8/0xaa > [128095.480138] [<c10a6457>] ? trace_hardirqs_on+0xb/0xd > [128095.480138] [<c15bf737>] ret_from_kernel_thread+0x1b/0x28 > [128095.480138] [<c1066157>] ? insert_kthread_work+0x63/0x63 > > which is a false positive caused by hugetlb pmd sharing code which > allocates a new pmd from withing mappint->i_mmap_mutex. If this > allocation causes reclaim then the lockdep detector complains that we > might self-deadlock. > > This is not correct though, because hugetlb pages are not reclaimable so > their mapping will be never touched from the reclaim path. > > The patch tells lockup detector that hugetlb i_mmap_mutex is special > by assigning it a separate lockdep class so it won't report possible > deadlocks on unrelated mappings. > > [peterz@infradead.org: comment for annotation] > Reported-by: Dave Jones <davej@redhat.com> > Signed-off-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Minchan Kim <minchan@kernel.org> Thanks, Michal! Only remained thing is Dave's testing. - Kind regards, Minchan Kim ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH] hugetlb: fix lockdep splat caused by pmd sharing @ 2013-07-30 23:35 ` Minchan Kim 0 siblings, 0 replies; 40+ messages in thread From: Minchan Kim @ 2013-07-30 23:35 UTC (permalink / raw) To: Michal Hocko Cc: Peter Zijlstra, Andrew Morton, Dave Jones, Aneesh Kumar K.V, Rik van Riel, KAMEZAWA Hiroyuki, Hillf Danton, linux-kernel, linux-mm On Tue, Jul 30, 2013 at 05:23:33PM +0200, Michal Hocko wrote: > On Tue 30-07-13 16:58:34, Peter Zijlstra wrote: > > On Tue, Jul 30, 2013 at 04:46:00PM +0200, Michal Hocko wrote: > [...] > > > +/* > > > + * Now, reclaim path never holds hugetlbfs_inode->i_mmap_mutex while it could > > > + * hold normal inode->i_mmap_mutex so this annotation avoids a lockdep splat. > > > > How about something like: > > > > /* > > * Hugetlbfs is not reclaimable; therefore its i_mmap_mutex will never > > * be taken from reclaim -- unlike regular filesystems. This needs an > > * annotation because huge_pmd_share() does an allocation under > > * i_mmap_mutex. > > */ > > > > It clarifies the exact conditions and makes easier to verify the > > validity of the annotation. > > Yes, looks much better. Thanks! > --- > >From 673cbe2ca7df0decd7320987d97585660542e468 Mon Sep 17 00:00:00 2001 > From: Michal Hocko <mhocko@suse.cz> > Date: Tue, 30 Jul 2013 17:22:14 +0200 > Subject: [PATCH] hugetlb: fix lockdep splat caused by pmd sharing > > Dave has reported the following lockdep splat: > [128095.470960] ================================= > [128095.471315] [ INFO: inconsistent lock state ] > [128095.471660] 3.11.0-rc1+ #9 Not tainted > [128095.472156] --------------------------------- > [128095.472905] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage. > [128095.473650] kswapd0/49 [HC0[0]:SC0[0]:HE1:SE1] takes: > [128095.474373] (&mapping->i_mmap_mutex){+.+.?.}, at: [<c114971b>] page_referenced+0x87/0x5e3 > [128095.475128] {RECLAIM_FS-ON-W} state was registered at: > [128095.475866] [<c10a6232>] mark_held_locks+0x81/0xe7 > [128095.476597] [<c10a8db3>] lockdep_trace_alloc+0x5e/0xbc > [128095.477322] [<c112316b>] __alloc_pages_nodemask+0x8b/0x9b6 > [128095.478049] [<c1123ab6>] __get_free_pages+0x20/0x31 > [128095.478769] [<c1123ad9>] get_zeroed_page+0x12/0x14 > [128095.479477] [<c113fe1e>] __pmd_alloc+0x1c/0x6b > [128095.480138] [<c1155ea7>] huge_pmd_share+0x265/0x283 > [128095.480138] [<c1155f22>] huge_pte_alloc+0x5d/0x71 > [128095.480138] [<c115612e>] hugetlb_fault+0x7c/0x64a > [128095.480138] [<c114087c>] handle_mm_fault+0x255/0x299 > [128095.480138] [<c15bbab0>] __do_page_fault+0x142/0x55c > [128095.480138] [<c15bbed7>] do_page_fault+0xd/0x16 > [128095.480138] [<c15b927c>] error_code+0x6c/0x74 > [128095.480138] irq event stamp: 3136917 > [128095.480138] hardirqs last enabled at (3136917): [<c15b8139>] _raw_spin_unlock_irq+0x27/0x50 > [128095.480138] hardirqs last disabled at (3136916): [<c15b7f4e>] _raw_spin_lock_irq+0x15/0x78 > [128095.480138] softirqs last enabled at (3136180): [<c1048e4a>] __do_softirq+0x137/0x30f > [128095.480138] softirqs last disabled at (3136175): [<c1049195>] irq_exit+0xa8/0xaa > [128095.480138] > other info that might help us debug this: > [128095.480138] Possible unsafe locking scenario: > [128095.480138] CPU0 > [128095.480138] ---- > [128095.480138] lock(&mapping->i_mmap_mutex); > [128095.480138] <Interrupt> > [128095.480138] lock(&mapping->i_mmap_mutex); > [128095.480138] > *** DEADLOCK *** > [128095.480138] no locks held by kswapd0/49. > [128095.480138] > stack backtrace: > [128095.480138] CPU: 1 PID: 49 Comm: kswapd0 Not tainted 3.11.0-rc1+ #9 > [128095.480138] Hardware name: Dell Inc. Precision WorkStation 490 /0DT031, BIOS A08 04/25/2008 > [128095.480138] c1d32630 00000000 ee39fb18 c15b001e ee395780 ee39fb54 c15acdcb c1751845 > [128095.480138] c1751bbf 00000031 00000000 00000000 00000000 00000000 00000001 00000001 > [128095.480138] c1751bbf 00000008 ee395c44 00000100 ee39fb88 c10a6130 00000008 0000d8fb > [128095.480138] Call Trace: > [128095.480138] [<c15b001e>] dump_stack+0x4b/0x79 > [128095.480138] [<c15acdcb>] print_usage_bug+0x1d9/0x1e3 > [128095.480138] [<c10a6130>] mark_lock+0x1e0/0x261 > [128095.480138] [<c10a5878>] ? check_usage_backwards+0x109/0x109 > [128095.480138] [<c10a6cde>] __lock_acquire+0x623/0x17f2 > [128095.480138] [<c107aa43>] ? sched_clock_cpu+0xcd/0x130 > [128095.480138] [<c107a7e8>] ? sched_clock_local+0x42/0x12e > [128095.480138] [<c10a84cf>] lock_acquire+0x7d/0x195 > [128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3 > [128095.480138] [<c15b3671>] mutex_lock_nested+0x6c/0x3a7 > [128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3 > [128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3 > [128095.480138] [<c11661d5>] ? mem_cgroup_charge_statistics.isra.24+0x61/0x9e > [128095.480138] [<c114971b>] page_referenced+0x87/0x5e3 > [128095.480138] [<f8433030>] ? raid0_congested+0x26/0x8a [raid0] > [128095.480138] [<c112b9c7>] shrink_page_list+0x3d9/0x947 > [128095.480138] [<c10a6457>] ? trace_hardirqs_on+0xb/0xd > [128095.480138] [<c112c3cf>] shrink_inactive_list+0x155/0x4cb > [128095.480138] [<c112cd07>] shrink_lruvec+0x300/0x5ce > [128095.480138] [<c112d028>] shrink_zone+0x53/0x14e > [128095.480138] [<c112e531>] kswapd+0x517/0xa75 > [128095.480138] [<c112e01a>] ? mem_cgroup_shrink_node_zone+0x280/0x280 > [128095.480138] [<c10661ff>] kthread+0xa8/0xaa > [128095.480138] [<c10a6457>] ? trace_hardirqs_on+0xb/0xd > [128095.480138] [<c15bf737>] ret_from_kernel_thread+0x1b/0x28 > [128095.480138] [<c1066157>] ? insert_kthread_work+0x63/0x63 > > which is a false positive caused by hugetlb pmd sharing code which > allocates a new pmd from withing mappint->i_mmap_mutex. If this > allocation causes reclaim then the lockdep detector complains that we > might self-deadlock. > > This is not correct though, because hugetlb pages are not reclaimable so > their mapping will be never touched from the reclaim path. > > The patch tells lockup detector that hugetlb i_mmap_mutex is special > by assigning it a separate lockdep class so it won't report possible > deadlocks on unrelated mappings. > > [peterz@infradead.org: comment for annotation] > Reported-by: Dave Jones <davej@redhat.com> > Signed-off-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Minchan Kim <minchan@kernel.org> Thanks, Michal! Only remained thing is Dave's testing. - Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH] hugetlb: fix lockdep splat caused by pmd sharing 2013-07-30 23:35 ` Minchan Kim @ 2013-07-30 23:37 ` Dave Jones -1 siblings, 0 replies; 40+ messages in thread From: Dave Jones @ 2013-07-30 23:37 UTC (permalink / raw) To: Minchan Kim Cc: Michal Hocko, Peter Zijlstra, Andrew Morton, Aneesh Kumar K.V, Rik van Riel, KAMEZAWA Hiroyuki, Hillf Danton, linux-kernel, linux-mm On Wed, Jul 31, 2013 at 08:35:30AM +0900, Minchan Kim wrote: > > which is a false positive caused by hugetlb pmd sharing code which > > allocates a new pmd from withing mappint->i_mmap_mutex. If this > > allocation causes reclaim then the lockdep detector complains that we > > might self-deadlock. > > > > This is not correct though, because hugetlb pages are not reclaimable so > > their mapping will be never touched from the reclaim path. > > > > The patch tells lockup detector that hugetlb i_mmap_mutex is special > > by assigning it a separate lockdep class so it won't report possible > > deadlocks on unrelated mappings. > > > > [peterz@infradead.org: comment for annotation] > > Reported-by: Dave Jones <davej@redhat.com> > > Signed-off-by: Michal Hocko <mhocko@suse.cz> > Reviewed-by: Minchan Kim <minchan@kernel.org> > > Thanks, Michal! > Only remained thing is Dave's testing. I've added it to my builds, all is quiet so far.. Dave ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH] hugetlb: fix lockdep splat caused by pmd sharing @ 2013-07-30 23:37 ` Dave Jones 0 siblings, 0 replies; 40+ messages in thread From: Dave Jones @ 2013-07-30 23:37 UTC (permalink / raw) To: Minchan Kim Cc: Michal Hocko, Peter Zijlstra, Andrew Morton, Aneesh Kumar K.V, Rik van Riel, KAMEZAWA Hiroyuki, Hillf Danton, linux-kernel, linux-mm On Wed, Jul 31, 2013 at 08:35:30AM +0900, Minchan Kim wrote: > > which is a false positive caused by hugetlb pmd sharing code which > > allocates a new pmd from withing mappint->i_mmap_mutex. If this > > allocation causes reclaim then the lockdep detector complains that we > > might self-deadlock. > > > > This is not correct though, because hugetlb pages are not reclaimable so > > their mapping will be never touched from the reclaim path. > > > > The patch tells lockup detector that hugetlb i_mmap_mutex is special > > by assigning it a separate lockdep class so it won't report possible > > deadlocks on unrelated mappings. > > > > [peterz@infradead.org: comment for annotation] > > Reported-by: Dave Jones <davej@redhat.com> > > Signed-off-by: Michal Hocko <mhocko@suse.cz> > Reviewed-by: Minchan Kim <minchan@kernel.org> > > Thanks, Michal! > Only remained thing is Dave's testing. I've added it to my builds, all is quiet so far.. Dave -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: hugepage related lockdep trace. 2013-07-18 17:42 ` Aneesh Kumar K.V @ 2013-07-19 2:08 ` Hillf Danton -1 siblings, 0 replies; 40+ messages in thread From: Hillf Danton @ 2013-07-19 2:08 UTC (permalink / raw) To: Aneesh Kumar K.V Cc: Minchan Kim, Dave Jones, Linux Kernel, linux-mm, Rik van Riel, Michal Hocko, KAMEZAWA Hiroyuki, Andrew Morton On Fri, Jul 19, 2013 at 1:42 AM, Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> wrote: > Minchan Kim <minchan@kernel.org> writes: >> IMHO, it's a false positive because i_mmap_mutex was held by kswapd >> while one in the middle of fault path could be never on kswapd context. >> >> It seems lockdep for reclaim-over-fs isn't enough smart to identify >> between background and direct reclaim. >> >> Wait for other's opinion. > > Is that reasoning correct ?. We may not deadlock because hugetlb pages > cannot be reclaimed. So the fault path in hugetlb won't end up > reclaiming pages from same inode. But the report is correct right ? > > > Looking at the hugetlb code we have in huge_pmd_share > > out: > pte = (pte_t *)pmd_alloc(mm, pud, addr); > mutex_unlock(&mapping->i_mmap_mutex); > return pte; > > I guess we should move that pmd_alloc outside i_mmap_mutex. Otherwise > that pmd_alloc can result in a reclaim which can call shrink_page_list ? > Hm, can huge pages be reclaimed, say by kswapd currently? > Something like ? > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 83aff0a..2cb1be3 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -3266,8 +3266,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud) > put_page(virt_to_page(spte)); > spin_unlock(&mm->page_table_lock); > out: > - pte = (pte_t *)pmd_alloc(mm, pud, addr); > mutex_unlock(&mapping->i_mmap_mutex); > + pte = (pte_t *)pmd_alloc(mm, pud, addr); > return pte; > } > ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: hugepage related lockdep trace. @ 2013-07-19 2:08 ` Hillf Danton 0 siblings, 0 replies; 40+ messages in thread From: Hillf Danton @ 2013-07-19 2:08 UTC (permalink / raw) To: Aneesh Kumar K.V Cc: Minchan Kim, Dave Jones, Linux Kernel, linux-mm, Rik van Riel, Michal Hocko, KAMEZAWA Hiroyuki, Andrew Morton On Fri, Jul 19, 2013 at 1:42 AM, Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> wrote: > Minchan Kim <minchan@kernel.org> writes: >> IMHO, it's a false positive because i_mmap_mutex was held by kswapd >> while one in the middle of fault path could be never on kswapd context. >> >> It seems lockdep for reclaim-over-fs isn't enough smart to identify >> between background and direct reclaim. >> >> Wait for other's opinion. > > Is that reasoning correct ?. We may not deadlock because hugetlb pages > cannot be reclaimed. So the fault path in hugetlb won't end up > reclaiming pages from same inode. But the report is correct right ? > > > Looking at the hugetlb code we have in huge_pmd_share > > out: > pte = (pte_t *)pmd_alloc(mm, pud, addr); > mutex_unlock(&mapping->i_mmap_mutex); > return pte; > > I guess we should move that pmd_alloc outside i_mmap_mutex. Otherwise > that pmd_alloc can result in a reclaim which can call shrink_page_list ? > Hm, can huge pages be reclaimed, say by kswapd currently? > Something like ? > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 83aff0a..2cb1be3 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -3266,8 +3266,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud) > put_page(virt_to_page(spte)); > spin_unlock(&mm->page_table_lock); > out: > - pte = (pte_t *)pmd_alloc(mm, pud, addr); > mutex_unlock(&mapping->i_mmap_mutex); > + pte = (pte_t *)pmd_alloc(mm, pud, addr); > return pte; > } > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: hugepage related lockdep trace. 2013-07-19 2:08 ` Hillf Danton @ 2013-07-19 3:20 ` Aneesh Kumar K.V -1 siblings, 0 replies; 40+ messages in thread From: Aneesh Kumar K.V @ 2013-07-19 3:20 UTC (permalink / raw) To: Hillf Danton Cc: Minchan Kim, Dave Jones, Linux Kernel, linux-mm, Rik van Riel, Michal Hocko, KAMEZAWA Hiroyuki, Andrew Morton Hillf Danton <dhillf@gmail.com> writes: > On Fri, Jul 19, 2013 at 1:42 AM, Aneesh Kumar K.V > <aneesh.kumar@linux.vnet.ibm.com> wrote: >> Minchan Kim <minchan@kernel.org> writes: >>> IMHO, it's a false positive because i_mmap_mutex was held by kswapd >>> while one in the middle of fault path could be never on kswapd context. >>> >>> It seems lockdep for reclaim-over-fs isn't enough smart to identify >>> between background and direct reclaim. >>> >>> Wait for other's opinion. >> >> Is that reasoning correct ?. We may not deadlock because hugetlb pages >> cannot be reclaimed. So the fault path in hugetlb won't end up >> reclaiming pages from same inode. But the report is correct right ? >> >> >> Looking at the hugetlb code we have in huge_pmd_share >> >> out: >> pte = (pte_t *)pmd_alloc(mm, pud, addr); >> mutex_unlock(&mapping->i_mmap_mutex); >> return pte; >> >> I guess we should move that pmd_alloc outside i_mmap_mutex. Otherwise >> that pmd_alloc can result in a reclaim which can call shrink_page_list ? >> > Hm, can huge pages be reclaimed, say by kswapd currently? No we don't reclaim hugetlb pages. -aneesh ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: hugepage related lockdep trace. @ 2013-07-19 3:20 ` Aneesh Kumar K.V 0 siblings, 0 replies; 40+ messages in thread From: Aneesh Kumar K.V @ 2013-07-19 3:20 UTC (permalink / raw) To: Hillf Danton Cc: Minchan Kim, Dave Jones, Linux Kernel, linux-mm, Rik van Riel, Michal Hocko, KAMEZAWA Hiroyuki, Andrew Morton Hillf Danton <dhillf@gmail.com> writes: > On Fri, Jul 19, 2013 at 1:42 AM, Aneesh Kumar K.V > <aneesh.kumar@linux.vnet.ibm.com> wrote: >> Minchan Kim <minchan@kernel.org> writes: >>> IMHO, it's a false positive because i_mmap_mutex was held by kswapd >>> while one in the middle of fault path could be never on kswapd context. >>> >>> It seems lockdep for reclaim-over-fs isn't enough smart to identify >>> between background and direct reclaim. >>> >>> Wait for other's opinion. >> >> Is that reasoning correct ?. We may not deadlock because hugetlb pages >> cannot be reclaimed. So the fault path in hugetlb won't end up >> reclaiming pages from same inode. But the report is correct right ? >> >> >> Looking at the hugetlb code we have in huge_pmd_share >> >> out: >> pte = (pte_t *)pmd_alloc(mm, pud, addr); >> mutex_unlock(&mapping->i_mmap_mutex); >> return pte; >> >> I guess we should move that pmd_alloc outside i_mmap_mutex. Otherwise >> that pmd_alloc can result in a reclaim which can call shrink_page_list ? >> > Hm, can huge pages be reclaimed, say by kswapd currently? No we don't reclaim hugetlb pages. -aneesh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 40+ messages in thread
end of thread, other threads:[~2013-07-30 23:50 UTC | newest] Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2013-07-17 15:32 hugepage related lockdep trace Dave Jones 2013-07-17 15:32 ` Dave Jones 2013-07-18 0:09 ` Minchan Kim 2013-07-18 0:09 ` Minchan Kim 2013-07-18 17:42 ` Aneesh Kumar K.V 2013-07-18 17:42 ` Aneesh Kumar K.V 2013-07-19 0:13 ` Minchan Kim 2013-07-19 0:13 ` Minchan Kim 2013-07-23 7:24 ` Hush Bensen 2013-07-23 7:24 ` Hush Bensen 2013-07-24 2:57 ` Minchan Kim 2013-07-24 2:57 ` Minchan Kim 2013-07-23 14:01 ` Michal Hocko 2013-07-23 14:01 ` Michal Hocko 2013-07-24 2:44 ` Minchan Kim 2013-07-24 2:44 ` Minchan Kim 2013-07-25 13:30 ` Michal Hocko 2013-07-25 13:30 ` Michal Hocko 2013-07-29 8:24 ` Minchan Kim 2013-07-29 8:24 ` Minchan Kim 2013-07-29 14:53 ` Michal Hocko 2013-07-29 14:53 ` Michal Hocko 2013-07-29 15:20 ` Peter Zijlstra 2013-07-29 15:20 ` Peter Zijlstra 2013-07-30 14:29 ` Michal Hocko 2013-07-30 14:29 ` Michal Hocko 2013-07-30 14:46 ` [PATCH] hugetlb: fix lockdep splat caused by pmd sharing Michal Hocko 2013-07-30 14:46 ` Michal Hocko 2013-07-30 14:58 ` Peter Zijlstra 2013-07-30 14:58 ` Peter Zijlstra 2013-07-30 15:23 ` Michal Hocko 2013-07-30 15:23 ` Michal Hocko 2013-07-30 23:35 ` Minchan Kim 2013-07-30 23:35 ` Minchan Kim 2013-07-30 23:37 ` Dave Jones 2013-07-30 23:37 ` Dave Jones 2013-07-19 2:08 ` hugepage related lockdep trace Hillf Danton 2013-07-19 2:08 ` Hillf Danton 2013-07-19 3:20 ` Aneesh Kumar K.V 2013-07-19 3:20 ` Aneesh Kumar K.V
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.