All of lore.kernel.org
 help / color / mirror / Atom feed
* hugepage related lockdep trace.
@ 2013-07-17 15:32 ` Dave Jones
  0 siblings, 0 replies; 40+ messages in thread
From: Dave Jones @ 2013-07-17 15:32 UTC (permalink / raw)
  To: Linux Kernel; +Cc: linux-mm

[128095.470960] =================================
[128095.471315] [ INFO: inconsistent lock state ]
[128095.471660] 3.11.0-rc1+ #9 Not tainted
[128095.472156] ---------------------------------
[128095.472905] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
[128095.473650] kswapd0/49 [HC0[0]:SC0[0]:HE1:SE1] takes:
[128095.474373]  (&mapping->i_mmap_mutex){+.+.?.}, at: [<c114971b>] page_referenced+0x87/0x5e3
[128095.475128] {RECLAIM_FS-ON-W} state was registered at:
[128095.475866]   [<c10a6232>] mark_held_locks+0x81/0xe7
[128095.476597]   [<c10a8db3>] lockdep_trace_alloc+0x5e/0xbc
[128095.477322]   [<c112316b>] __alloc_pages_nodemask+0x8b/0x9b6
[128095.478049]   [<c1123ab6>] __get_free_pages+0x20/0x31
[128095.478769]   [<c1123ad9>] get_zeroed_page+0x12/0x14
[128095.479477]   [<c113fe1e>] __pmd_alloc+0x1c/0x6b
[128095.480138]   [<c1155ea7>] huge_pmd_share+0x265/0x283
[128095.480138]   [<c1155f22>] huge_pte_alloc+0x5d/0x71
[128095.480138]   [<c115612e>] hugetlb_fault+0x7c/0x64a
[128095.480138]   [<c114087c>] handle_mm_fault+0x255/0x299
[128095.480138]   [<c15bbab0>] __do_page_fault+0x142/0x55c
[128095.480138]   [<c15bbed7>] do_page_fault+0xd/0x16
[128095.480138]   [<c15b927c>] error_code+0x6c/0x74
[128095.480138] irq event stamp: 3136917
[128095.480138] hardirqs last  enabled at (3136917): [<c15b8139>] _raw_spin_unlock_irq+0x27/0x50
[128095.480138] hardirqs last disabled at (3136916): [<c15b7f4e>] _raw_spin_lock_irq+0x15/0x78
[128095.480138] softirqs last  enabled at (3136180): [<c1048e4a>] __do_softirq+0x137/0x30f
[128095.480138] softirqs last disabled at (3136175): [<c1049195>] irq_exit+0xa8/0xaa
[128095.480138] 
other info that might help us debug this:
[128095.480138]  Possible unsafe locking scenario:

[128095.480138]        CPU0
[128095.480138]        ----
[128095.480138]   lock(&mapping->i_mmap_mutex);
[128095.480138]   <Interrupt>
[128095.480138]     lock(&mapping->i_mmap_mutex);
[128095.480138] 
 *** DEADLOCK ***

[128095.480138] no locks held by kswapd0/49.
[128095.480138] 
stack backtrace:
[128095.480138] CPU: 1 PID: 49 Comm: kswapd0 Not tainted 3.11.0-rc1+ #9
[128095.480138] Hardware name: Dell Inc.                 Precision WorkStation 490    /0DT031, BIOS A08 04/25/2008
[128095.480138]  c1d32630 00000000 ee39fb18 c15b001e ee395780 ee39fb54 c15acdcb c1751845
[128095.480138]  c1751bbf 00000031 00000000 00000000 00000000 00000000 00000001 00000001
[128095.480138]  c1751bbf 00000008 ee395c44 00000100 ee39fb88 c10a6130 00000008 0000d8fb
[128095.480138] Call Trace:
[128095.480138]  [<c15b001e>] dump_stack+0x4b/0x79
[128095.480138]  [<c15acdcb>] print_usage_bug+0x1d9/0x1e3
[128095.480138]  [<c10a6130>] mark_lock+0x1e0/0x261
[128095.480138]  [<c10a5878>] ? check_usage_backwards+0x109/0x109
[128095.480138]  [<c10a6cde>] __lock_acquire+0x623/0x17f2
[128095.480138]  [<c107aa43>] ? sched_clock_cpu+0xcd/0x130
[128095.480138]  [<c107a7e8>] ? sched_clock_local+0x42/0x12e
[128095.480138]  [<c10a84cf>] lock_acquire+0x7d/0x195
[128095.480138]  [<c114971b>] ? page_referenced+0x87/0x5e3
[128095.480138]  [<c15b3671>] mutex_lock_nested+0x6c/0x3a7
[128095.480138]  [<c114971b>] ? page_referenced+0x87/0x5e3
[128095.480138]  [<c114971b>] ? page_referenced+0x87/0x5e3
[128095.480138]  [<c11661d5>] ? mem_cgroup_charge_statistics.isra.24+0x61/0x9e
[128095.480138]  [<c114971b>] page_referenced+0x87/0x5e3
[128095.480138]  [<f8433030>] ? raid0_congested+0x26/0x8a [raid0]
[128095.480138]  [<c112b9c7>] shrink_page_list+0x3d9/0x947
[128095.480138]  [<c10a6457>] ? trace_hardirqs_on+0xb/0xd
[128095.480138]  [<c112c3cf>] shrink_inactive_list+0x155/0x4cb
[128095.480138]  [<c112cd07>] shrink_lruvec+0x300/0x5ce
[128095.480138]  [<c112d028>] shrink_zone+0x53/0x14e
[128095.480138]  [<c112e531>] kswapd+0x517/0xa75
[128095.480138]  [<c112e01a>] ? mem_cgroup_shrink_node_zone+0x280/0x280
[128095.480138]  [<c10661ff>] kthread+0xa8/0xaa
[128095.480138]  [<c10a6457>] ? trace_hardirqs_on+0xb/0xd
[128095.480138]  [<c15bf737>] ret_from_kernel_thread+0x1b/0x28
[128095.480138]  [<c1066157>] ? insert_kthread_work+0x63/0x63


^ permalink raw reply	[flat|nested] 40+ messages in thread

* hugepage related lockdep trace.
@ 2013-07-17 15:32 ` Dave Jones
  0 siblings, 0 replies; 40+ messages in thread
From: Dave Jones @ 2013-07-17 15:32 UTC (permalink / raw)
  To: Linux Kernel; +Cc: linux-mm

[128095.470960] =================================
[128095.471315] [ INFO: inconsistent lock state ]
[128095.471660] 3.11.0-rc1+ #9 Not tainted
[128095.472156] ---------------------------------
[128095.472905] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
[128095.473650] kswapd0/49 [HC0[0]:SC0[0]:HE1:SE1] takes:
[128095.474373]  (&mapping->i_mmap_mutex){+.+.?.}, at: [<c114971b>] page_referenced+0x87/0x5e3
[128095.475128] {RECLAIM_FS-ON-W} state was registered at:
[128095.475866]   [<c10a6232>] mark_held_locks+0x81/0xe7
[128095.476597]   [<c10a8db3>] lockdep_trace_alloc+0x5e/0xbc
[128095.477322]   [<c112316b>] __alloc_pages_nodemask+0x8b/0x9b6
[128095.478049]   [<c1123ab6>] __get_free_pages+0x20/0x31
[128095.478769]   [<c1123ad9>] get_zeroed_page+0x12/0x14
[128095.479477]   [<c113fe1e>] __pmd_alloc+0x1c/0x6b
[128095.480138]   [<c1155ea7>] huge_pmd_share+0x265/0x283
[128095.480138]   [<c1155f22>] huge_pte_alloc+0x5d/0x71
[128095.480138]   [<c115612e>] hugetlb_fault+0x7c/0x64a
[128095.480138]   [<c114087c>] handle_mm_fault+0x255/0x299
[128095.480138]   [<c15bbab0>] __do_page_fault+0x142/0x55c
[128095.480138]   [<c15bbed7>] do_page_fault+0xd/0x16
[128095.480138]   [<c15b927c>] error_code+0x6c/0x74
[128095.480138] irq event stamp: 3136917
[128095.480138] hardirqs last  enabled at (3136917): [<c15b8139>] _raw_spin_unlock_irq+0x27/0x50
[128095.480138] hardirqs last disabled at (3136916): [<c15b7f4e>] _raw_spin_lock_irq+0x15/0x78
[128095.480138] softirqs last  enabled at (3136180): [<c1048e4a>] __do_softirq+0x137/0x30f
[128095.480138] softirqs last disabled at (3136175): [<c1049195>] irq_exit+0xa8/0xaa
[128095.480138] 
other info that might help us debug this:
[128095.480138]  Possible unsafe locking scenario:

[128095.480138]        CPU0
[128095.480138]        ----
[128095.480138]   lock(&mapping->i_mmap_mutex);
[128095.480138]   <Interrupt>
[128095.480138]     lock(&mapping->i_mmap_mutex);
[128095.480138] 
 *** DEADLOCK ***

[128095.480138] no locks held by kswapd0/49.
[128095.480138] 
stack backtrace:
[128095.480138] CPU: 1 PID: 49 Comm: kswapd0 Not tainted 3.11.0-rc1+ #9
[128095.480138] Hardware name: Dell Inc.                 Precision WorkStation 490    /0DT031, BIOS A08 04/25/2008
[128095.480138]  c1d32630 00000000 ee39fb18 c15b001e ee395780 ee39fb54 c15acdcb c1751845
[128095.480138]  c1751bbf 00000031 00000000 00000000 00000000 00000000 00000001 00000001
[128095.480138]  c1751bbf 00000008 ee395c44 00000100 ee39fb88 c10a6130 00000008 0000d8fb
[128095.480138] Call Trace:
[128095.480138]  [<c15b001e>] dump_stack+0x4b/0x79
[128095.480138]  [<c15acdcb>] print_usage_bug+0x1d9/0x1e3
[128095.480138]  [<c10a6130>] mark_lock+0x1e0/0x261
[128095.480138]  [<c10a5878>] ? check_usage_backwards+0x109/0x109
[128095.480138]  [<c10a6cde>] __lock_acquire+0x623/0x17f2
[128095.480138]  [<c107aa43>] ? sched_clock_cpu+0xcd/0x130
[128095.480138]  [<c107a7e8>] ? sched_clock_local+0x42/0x12e
[128095.480138]  [<c10a84cf>] lock_acquire+0x7d/0x195
[128095.480138]  [<c114971b>] ? page_referenced+0x87/0x5e3
[128095.480138]  [<c15b3671>] mutex_lock_nested+0x6c/0x3a7
[128095.480138]  [<c114971b>] ? page_referenced+0x87/0x5e3
[128095.480138]  [<c114971b>] ? page_referenced+0x87/0x5e3
[128095.480138]  [<c11661d5>] ? mem_cgroup_charge_statistics.isra.24+0x61/0x9e
[128095.480138]  [<c114971b>] page_referenced+0x87/0x5e3
[128095.480138]  [<f8433030>] ? raid0_congested+0x26/0x8a [raid0]
[128095.480138]  [<c112b9c7>] shrink_page_list+0x3d9/0x947
[128095.480138]  [<c10a6457>] ? trace_hardirqs_on+0xb/0xd
[128095.480138]  [<c112c3cf>] shrink_inactive_list+0x155/0x4cb
[128095.480138]  [<c112cd07>] shrink_lruvec+0x300/0x5ce
[128095.480138]  [<c112d028>] shrink_zone+0x53/0x14e
[128095.480138]  [<c112e531>] kswapd+0x517/0xa75
[128095.480138]  [<c112e01a>] ? mem_cgroup_shrink_node_zone+0x280/0x280
[128095.480138]  [<c10661ff>] kthread+0xa8/0xaa
[128095.480138]  [<c10a6457>] ? trace_hardirqs_on+0xb/0xd
[128095.480138]  [<c15bf737>] ret_from_kernel_thread+0x1b/0x28
[128095.480138]  [<c1066157>] ? insert_kthread_work+0x63/0x63

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: hugepage related lockdep trace.
  2013-07-17 15:32 ` Dave Jones
@ 2013-07-18  0:09   ` Minchan Kim
  -1 siblings, 0 replies; 40+ messages in thread
From: Minchan Kim @ 2013-07-18  0:09 UTC (permalink / raw)
  To: Dave Jones, Linux Kernel, linux-mm
  Cc: Rik van Riel, Aneesh Kumar K.V, Michal Hocko, KAMEZAWA Hiroyuki,
	Hillf Danton, Andrew Morton

Ccing people get_maintainer says.

On Wed, Jul 17, 2013 at 11:32:23AM -0400, Dave Jones wrote:
> [128095.470960] =================================
> [128095.471315] [ INFO: inconsistent lock state ]
> [128095.471660] 3.11.0-rc1+ #9 Not tainted
> [128095.472156] ---------------------------------
> [128095.472905] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
> [128095.473650] kswapd0/49 [HC0[0]:SC0[0]:HE1:SE1] takes:
> [128095.474373]  (&mapping->i_mmap_mutex){+.+.?.}, at: [<c114971b>] page_referenced+0x87/0x5e3
> [128095.475128] {RECLAIM_FS-ON-W} state was registered at:
> [128095.475866]   [<c10a6232>] mark_held_locks+0x81/0xe7
> [128095.476597]   [<c10a8db3>] lockdep_trace_alloc+0x5e/0xbc
> [128095.477322]   [<c112316b>] __alloc_pages_nodemask+0x8b/0x9b6
> [128095.478049]   [<c1123ab6>] __get_free_pages+0x20/0x31
> [128095.478769]   [<c1123ad9>] get_zeroed_page+0x12/0x14
> [128095.479477]   [<c113fe1e>] __pmd_alloc+0x1c/0x6b
> [128095.480138]   [<c1155ea7>] huge_pmd_share+0x265/0x283
> [128095.480138]   [<c1155f22>] huge_pte_alloc+0x5d/0x71
> [128095.480138]   [<c115612e>] hugetlb_fault+0x7c/0x64a
> [128095.480138]   [<c114087c>] handle_mm_fault+0x255/0x299
> [128095.480138]   [<c15bbab0>] __do_page_fault+0x142/0x55c
> [128095.480138]   [<c15bbed7>] do_page_fault+0xd/0x16
> [128095.480138]   [<c15b927c>] error_code+0x6c/0x74
> [128095.480138] irq event stamp: 3136917
> [128095.480138] hardirqs last  enabled at (3136917): [<c15b8139>] _raw_spin_unlock_irq+0x27/0x50
> [128095.480138] hardirqs last disabled at (3136916): [<c15b7f4e>] _raw_spin_lock_irq+0x15/0x78
> [128095.480138] softirqs last  enabled at (3136180): [<c1048e4a>] __do_softirq+0x137/0x30f
> [128095.480138] softirqs last disabled at (3136175): [<c1049195>] irq_exit+0xa8/0xaa
> [128095.480138] 
> other info that might help us debug this:
> [128095.480138]  Possible unsafe locking scenario:
> 
> [128095.480138]        CPU0
> [128095.480138]        ----
> [128095.480138]   lock(&mapping->i_mmap_mutex);
> [128095.480138]   <Interrupt>
> [128095.480138]     lock(&mapping->i_mmap_mutex);
> [128095.480138] 
>  *** DEADLOCK ***
> 
> [128095.480138] no locks held by kswapd0/49.
> [128095.480138] 
> stack backtrace:
> [128095.480138] CPU: 1 PID: 49 Comm: kswapd0 Not tainted 3.11.0-rc1+ #9
> [128095.480138] Hardware name: Dell Inc.                 Precision WorkStation 490    /0DT031, BIOS A08 04/25/2008
> [128095.480138]  c1d32630 00000000 ee39fb18 c15b001e ee395780 ee39fb54 c15acdcb c1751845
> [128095.480138]  c1751bbf 00000031 00000000 00000000 00000000 00000000 00000001 00000001
> [128095.480138]  c1751bbf 00000008 ee395c44 00000100 ee39fb88 c10a6130 00000008 0000d8fb
> [128095.480138] Call Trace:
> [128095.480138]  [<c15b001e>] dump_stack+0x4b/0x79
> [128095.480138]  [<c15acdcb>] print_usage_bug+0x1d9/0x1e3
> [128095.480138]  [<c10a6130>] mark_lock+0x1e0/0x261
> [128095.480138]  [<c10a5878>] ? check_usage_backwards+0x109/0x109
> [128095.480138]  [<c10a6cde>] __lock_acquire+0x623/0x17f2
> [128095.480138]  [<c107aa43>] ? sched_clock_cpu+0xcd/0x130
> [128095.480138]  [<c107a7e8>] ? sched_clock_local+0x42/0x12e
> [128095.480138]  [<c10a84cf>] lock_acquire+0x7d/0x195
> [128095.480138]  [<c114971b>] ? page_referenced+0x87/0x5e3
> [128095.480138]  [<c15b3671>] mutex_lock_nested+0x6c/0x3a7
> [128095.480138]  [<c114971b>] ? page_referenced+0x87/0x5e3
> [128095.480138]  [<c114971b>] ? page_referenced+0x87/0x5e3
> [128095.480138]  [<c11661d5>] ? mem_cgroup_charge_statistics.isra.24+0x61/0x9e
> [128095.480138]  [<c114971b>] page_referenced+0x87/0x5e3
> [128095.480138]  [<f8433030>] ? raid0_congested+0x26/0x8a [raid0]
> [128095.480138]  [<c112b9c7>] shrink_page_list+0x3d9/0x947
> [128095.480138]  [<c10a6457>] ? trace_hardirqs_on+0xb/0xd
> [128095.480138]  [<c112c3cf>] shrink_inactive_list+0x155/0x4cb
> [128095.480138]  [<c112cd07>] shrink_lruvec+0x300/0x5ce
> [128095.480138]  [<c112d028>] shrink_zone+0x53/0x14e
> [128095.480138]  [<c112e531>] kswapd+0x517/0xa75
> [128095.480138]  [<c112e01a>] ? mem_cgroup_shrink_node_zone+0x280/0x280
> [128095.480138]  [<c10661ff>] kthread+0xa8/0xaa
> [128095.480138]  [<c10a6457>] ? trace_hardirqs_on+0xb/0xd
> [128095.480138]  [<c15bf737>] ret_from_kernel_thread+0x1b/0x28
> [128095.480138]  [<c1066157>] ? insert_kthread_work+0x63/0x63

IMHO, it's a false positive because i_mmap_mutex was held by kswapd
while one in the middle of fault path could be never on kswapd context.

It seems lockdep for reclaim-over-fs isn't enough smart to identify
between background and direct reclaim.

Wait for other's opinion.

> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: hugepage related lockdep trace.
@ 2013-07-18  0:09   ` Minchan Kim
  0 siblings, 0 replies; 40+ messages in thread
From: Minchan Kim @ 2013-07-18  0:09 UTC (permalink / raw)
  To: Dave Jones, Linux Kernel, linux-mm
  Cc: Rik van Riel, Aneesh Kumar K.V, Michal Hocko, KAMEZAWA Hiroyuki,
	Hillf Danton, Andrew Morton

Ccing people get_maintainer says.

On Wed, Jul 17, 2013 at 11:32:23AM -0400, Dave Jones wrote:
> [128095.470960] =================================
> [128095.471315] [ INFO: inconsistent lock state ]
> [128095.471660] 3.11.0-rc1+ #9 Not tainted
> [128095.472156] ---------------------------------
> [128095.472905] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
> [128095.473650] kswapd0/49 [HC0[0]:SC0[0]:HE1:SE1] takes:
> [128095.474373]  (&mapping->i_mmap_mutex){+.+.?.}, at: [<c114971b>] page_referenced+0x87/0x5e3
> [128095.475128] {RECLAIM_FS-ON-W} state was registered at:
> [128095.475866]   [<c10a6232>] mark_held_locks+0x81/0xe7
> [128095.476597]   [<c10a8db3>] lockdep_trace_alloc+0x5e/0xbc
> [128095.477322]   [<c112316b>] __alloc_pages_nodemask+0x8b/0x9b6
> [128095.478049]   [<c1123ab6>] __get_free_pages+0x20/0x31
> [128095.478769]   [<c1123ad9>] get_zeroed_page+0x12/0x14
> [128095.479477]   [<c113fe1e>] __pmd_alloc+0x1c/0x6b
> [128095.480138]   [<c1155ea7>] huge_pmd_share+0x265/0x283
> [128095.480138]   [<c1155f22>] huge_pte_alloc+0x5d/0x71
> [128095.480138]   [<c115612e>] hugetlb_fault+0x7c/0x64a
> [128095.480138]   [<c114087c>] handle_mm_fault+0x255/0x299
> [128095.480138]   [<c15bbab0>] __do_page_fault+0x142/0x55c
> [128095.480138]   [<c15bbed7>] do_page_fault+0xd/0x16
> [128095.480138]   [<c15b927c>] error_code+0x6c/0x74
> [128095.480138] irq event stamp: 3136917
> [128095.480138] hardirqs last  enabled at (3136917): [<c15b8139>] _raw_spin_unlock_irq+0x27/0x50
> [128095.480138] hardirqs last disabled at (3136916): [<c15b7f4e>] _raw_spin_lock_irq+0x15/0x78
> [128095.480138] softirqs last  enabled at (3136180): [<c1048e4a>] __do_softirq+0x137/0x30f
> [128095.480138] softirqs last disabled at (3136175): [<c1049195>] irq_exit+0xa8/0xaa
> [128095.480138] 
> other info that might help us debug this:
> [128095.480138]  Possible unsafe locking scenario:
> 
> [128095.480138]        CPU0
> [128095.480138]        ----
> [128095.480138]   lock(&mapping->i_mmap_mutex);
> [128095.480138]   <Interrupt>
> [128095.480138]     lock(&mapping->i_mmap_mutex);
> [128095.480138] 
>  *** DEADLOCK ***
> 
> [128095.480138] no locks held by kswapd0/49.
> [128095.480138] 
> stack backtrace:
> [128095.480138] CPU: 1 PID: 49 Comm: kswapd0 Not tainted 3.11.0-rc1+ #9
> [128095.480138] Hardware name: Dell Inc.                 Precision WorkStation 490    /0DT031, BIOS A08 04/25/2008
> [128095.480138]  c1d32630 00000000 ee39fb18 c15b001e ee395780 ee39fb54 c15acdcb c1751845
> [128095.480138]  c1751bbf 00000031 00000000 00000000 00000000 00000000 00000001 00000001
> [128095.480138]  c1751bbf 00000008 ee395c44 00000100 ee39fb88 c10a6130 00000008 0000d8fb
> [128095.480138] Call Trace:
> [128095.480138]  [<c15b001e>] dump_stack+0x4b/0x79
> [128095.480138]  [<c15acdcb>] print_usage_bug+0x1d9/0x1e3
> [128095.480138]  [<c10a6130>] mark_lock+0x1e0/0x261
> [128095.480138]  [<c10a5878>] ? check_usage_backwards+0x109/0x109
> [128095.480138]  [<c10a6cde>] __lock_acquire+0x623/0x17f2
> [128095.480138]  [<c107aa43>] ? sched_clock_cpu+0xcd/0x130
> [128095.480138]  [<c107a7e8>] ? sched_clock_local+0x42/0x12e
> [128095.480138]  [<c10a84cf>] lock_acquire+0x7d/0x195
> [128095.480138]  [<c114971b>] ? page_referenced+0x87/0x5e3
> [128095.480138]  [<c15b3671>] mutex_lock_nested+0x6c/0x3a7
> [128095.480138]  [<c114971b>] ? page_referenced+0x87/0x5e3
> [128095.480138]  [<c114971b>] ? page_referenced+0x87/0x5e3
> [128095.480138]  [<c11661d5>] ? mem_cgroup_charge_statistics.isra.24+0x61/0x9e
> [128095.480138]  [<c114971b>] page_referenced+0x87/0x5e3
> [128095.480138]  [<f8433030>] ? raid0_congested+0x26/0x8a [raid0]
> [128095.480138]  [<c112b9c7>] shrink_page_list+0x3d9/0x947
> [128095.480138]  [<c10a6457>] ? trace_hardirqs_on+0xb/0xd
> [128095.480138]  [<c112c3cf>] shrink_inactive_list+0x155/0x4cb
> [128095.480138]  [<c112cd07>] shrink_lruvec+0x300/0x5ce
> [128095.480138]  [<c112d028>] shrink_zone+0x53/0x14e
> [128095.480138]  [<c112e531>] kswapd+0x517/0xa75
> [128095.480138]  [<c112e01a>] ? mem_cgroup_shrink_node_zone+0x280/0x280
> [128095.480138]  [<c10661ff>] kthread+0xa8/0xaa
> [128095.480138]  [<c10a6457>] ? trace_hardirqs_on+0xb/0xd
> [128095.480138]  [<c15bf737>] ret_from_kernel_thread+0x1b/0x28
> [128095.480138]  [<c1066157>] ? insert_kthread_work+0x63/0x63

IMHO, it's a false positive because i_mmap_mutex was held by kswapd
while one in the middle of fault path could be never on kswapd context.

It seems lockdep for reclaim-over-fs isn't enough smart to identify
between background and direct reclaim.

Wait for other's opinion.

> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: hugepage related lockdep trace.
  2013-07-18  0:09   ` Minchan Kim
@ 2013-07-18 17:42     ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 40+ messages in thread
From: Aneesh Kumar K.V @ 2013-07-18 17:42 UTC (permalink / raw)
  To: Minchan Kim, Dave Jones, Linux Kernel, linux-mm
  Cc: Rik van Riel, Michal Hocko, KAMEZAWA Hiroyuki, Hillf Danton,
	Andrew Morton

Minchan Kim <minchan@kernel.org> writes:

> Ccing people get_maintainer says.
>
> On Wed, Jul 17, 2013 at 11:32:23AM -0400, Dave Jones wrote:
>> [128095.470960] =================================
>> [128095.471315] [ INFO: inconsistent lock state ]
>> [128095.471660] 3.11.0-rc1+ #9 Not tainted
>> [128095.472156] ---------------------------------
>> [128095.472905] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
>> [128095.473650] kswapd0/49 [HC0[0]:SC0[0]:HE1:SE1] takes:
>> [128095.474373]  (&mapping->i_mmap_mutex){+.+.?.}, at: [<c114971b>] page_referenced+0x87/0x5e3
>> [128095.475128] {RECLAIM_FS-ON-W} state was registered at:
>> [128095.475866]   [<c10a6232>] mark_held_locks+0x81/0xe7
>> [128095.476597]   [<c10a8db3>] lockdep_trace_alloc+0x5e/0xbc
>> [128095.477322]   [<c112316b>] __alloc_pages_nodemask+0x8b/0x9b6
>> [128095.478049]   [<c1123ab6>] __get_free_pages+0x20/0x31
>> [128095.478769]   [<c1123ad9>] get_zeroed_page+0x12/0x14
>> [128095.479477]   [<c113fe1e>] __pmd_alloc+0x1c/0x6b
>> [128095.480138]   [<c1155ea7>] huge_pmd_share+0x265/0x283
>> [128095.480138]   [<c1155f22>] huge_pte_alloc+0x5d/0x71
>> [128095.480138]   [<c115612e>] hugetlb_fault+0x7c/0x64a
>> [128095.480138]   [<c114087c>] handle_mm_fault+0x255/0x299
>> [128095.480138]   [<c15bbab0>] __do_page_fault+0x142/0x55c
>> [128095.480138]   [<c15bbed7>] do_page_fault+0xd/0x16
>> [128095.480138]   [<c15b927c>] error_code+0x6c/0x74
>> [128095.480138] irq event stamp: 3136917
>> [128095.480138] hardirqs last  enabled at (3136917): [<c15b8139>] _raw_spin_unlock_irq+0x27/0x50
>> [128095.480138] hardirqs last disabled at (3136916): [<c15b7f4e>] _raw_spin_lock_irq+0x15/0x78
>> [128095.480138] softirqs last  enabled at (3136180): [<c1048e4a>] __do_softirq+0x137/0x30f
>> [128095.480138] softirqs last disabled at (3136175): [<c1049195>] irq_exit+0xa8/0xaa
>> [128095.480138] 
>> other info that might help us debug this:
>> [128095.480138]  Possible unsafe locking scenario:
>> 
>> [128095.480138]        CPU0
>> [128095.480138]        ----
>> [128095.480138]   lock(&mapping->i_mmap_mutex);
>> [128095.480138]   <Interrupt>
>> [128095.480138]     lock(&mapping->i_mmap_mutex);
>> [128095.480138] 
>>  *** DEADLOCK ***
>> 
>> [128095.480138] no locks held by kswapd0/49.
>> [128095.480138] 
>> stack backtrace:
>> [128095.480138] CPU: 1 PID: 49 Comm: kswapd0 Not tainted 3.11.0-rc1+ #9
>> [128095.480138] Hardware name: Dell Inc.                 Precision WorkStation 490    /0DT031, BIOS A08 04/25/2008
>> [128095.480138]  c1d32630 00000000 ee39fb18 c15b001e ee395780 ee39fb54 c15acdcb c1751845
>> [128095.480138]  c1751bbf 00000031 00000000 00000000 00000000 00000000 00000001 00000001
>> [128095.480138]  c1751bbf 00000008 ee395c44 00000100 ee39fb88 c10a6130 00000008 0000d8fb
>> [128095.480138] Call Trace:
>> [128095.480138]  [<c15b001e>] dump_stack+0x4b/0x79
>> [128095.480138]  [<c15acdcb>] print_usage_bug+0x1d9/0x1e3
>> [128095.480138]  [<c10a6130>] mark_lock+0x1e0/0x261
>> [128095.480138]  [<c10a5878>] ? check_usage_backwards+0x109/0x109
>> [128095.480138]  [<c10a6cde>] __lock_acquire+0x623/0x17f2
>> [128095.480138]  [<c107aa43>] ? sched_clock_cpu+0xcd/0x130
>> [128095.480138]  [<c107a7e8>] ? sched_clock_local+0x42/0x12e
>> [128095.480138]  [<c10a84cf>] lock_acquire+0x7d/0x195
>> [128095.480138]  [<c114971b>] ? page_referenced+0x87/0x5e3
>> [128095.480138]  [<c15b3671>] mutex_lock_nested+0x6c/0x3a7
>> [128095.480138]  [<c114971b>] ? page_referenced+0x87/0x5e3
>> [128095.480138]  [<c114971b>] ? page_referenced+0x87/0x5e3
>> [128095.480138]  [<c11661d5>] ? mem_cgroup_charge_statistics.isra.24+0x61/0x9e
>> [128095.480138]  [<c114971b>] page_referenced+0x87/0x5e3
>> [128095.480138]  [<f8433030>] ? raid0_congested+0x26/0x8a [raid0]
>> [128095.480138]  [<c112b9c7>] shrink_page_list+0x3d9/0x947
>> [128095.480138]  [<c10a6457>] ? trace_hardirqs_on+0xb/0xd
>> [128095.480138]  [<c112c3cf>] shrink_inactive_list+0x155/0x4cb
>> [128095.480138]  [<c112cd07>] shrink_lruvec+0x300/0x5ce
>> [128095.480138]  [<c112d028>] shrink_zone+0x53/0x14e
>> [128095.480138]  [<c112e531>] kswapd+0x517/0xa75
>> [128095.480138]  [<c112e01a>] ? mem_cgroup_shrink_node_zone+0x280/0x280
>> [128095.480138]  [<c10661ff>] kthread+0xa8/0xaa
>> [128095.480138]  [<c10a6457>] ? trace_hardirqs_on+0xb/0xd
>> [128095.480138]  [<c15bf737>] ret_from_kernel_thread+0x1b/0x28
>> [128095.480138]  [<c1066157>] ? insert_kthread_work+0x63/0x63
>
> IMHO, it's a false positive because i_mmap_mutex was held by kswapd
> while one in the middle of fault path could be never on kswapd context.
>
> It seems lockdep for reclaim-over-fs isn't enough smart to identify
> between background and direct reclaim.
>
> Wait for other's opinion.

Is that reasoning correct ?. We may not deadlock because hugetlb pages
cannot be reclaimed. So the fault path in hugetlb won't end up
reclaiming pages from same inode. But the report is correct right ?


Looking at the hugetlb code we have in huge_pmd_share

out:
	pte = (pte_t *)pmd_alloc(mm, pud, addr);
	mutex_unlock(&mapping->i_mmap_mutex);
	return pte;

I guess we should move that pmd_alloc outside i_mmap_mutex. Otherwise
that pmd_alloc can result in a reclaim which can call shrink_page_list ?

Something like  ?

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 83aff0a..2cb1be3 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3266,8 +3266,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 		put_page(virt_to_page(spte));
 	spin_unlock(&mm->page_table_lock);
 out:
-	pte = (pte_t *)pmd_alloc(mm, pud, addr);
 	mutex_unlock(&mapping->i_mmap_mutex);
+	pte = (pte_t *)pmd_alloc(mm, pud, addr);
 	return pte;
 }
 
-aneesh




^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: hugepage related lockdep trace.
@ 2013-07-18 17:42     ` Aneesh Kumar K.V
  0 siblings, 0 replies; 40+ messages in thread
From: Aneesh Kumar K.V @ 2013-07-18 17:42 UTC (permalink / raw)
  To: Minchan Kim, Dave Jones, Linux Kernel, linux-mm
  Cc: Rik van Riel, Michal Hocko, KAMEZAWA Hiroyuki, Hillf Danton,
	Andrew Morton

Minchan Kim <minchan@kernel.org> writes:

> Ccing people get_maintainer says.
>
> On Wed, Jul 17, 2013 at 11:32:23AM -0400, Dave Jones wrote:
>> [128095.470960] =================================
>> [128095.471315] [ INFO: inconsistent lock state ]
>> [128095.471660] 3.11.0-rc1+ #9 Not tainted
>> [128095.472156] ---------------------------------
>> [128095.472905] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
>> [128095.473650] kswapd0/49 [HC0[0]:SC0[0]:HE1:SE1] takes:
>> [128095.474373]  (&mapping->i_mmap_mutex){+.+.?.}, at: [<c114971b>] page_referenced+0x87/0x5e3
>> [128095.475128] {RECLAIM_FS-ON-W} state was registered at:
>> [128095.475866]   [<c10a6232>] mark_held_locks+0x81/0xe7
>> [128095.476597]   [<c10a8db3>] lockdep_trace_alloc+0x5e/0xbc
>> [128095.477322]   [<c112316b>] __alloc_pages_nodemask+0x8b/0x9b6
>> [128095.478049]   [<c1123ab6>] __get_free_pages+0x20/0x31
>> [128095.478769]   [<c1123ad9>] get_zeroed_page+0x12/0x14
>> [128095.479477]   [<c113fe1e>] __pmd_alloc+0x1c/0x6b
>> [128095.480138]   [<c1155ea7>] huge_pmd_share+0x265/0x283
>> [128095.480138]   [<c1155f22>] huge_pte_alloc+0x5d/0x71
>> [128095.480138]   [<c115612e>] hugetlb_fault+0x7c/0x64a
>> [128095.480138]   [<c114087c>] handle_mm_fault+0x255/0x299
>> [128095.480138]   [<c15bbab0>] __do_page_fault+0x142/0x55c
>> [128095.480138]   [<c15bbed7>] do_page_fault+0xd/0x16
>> [128095.480138]   [<c15b927c>] error_code+0x6c/0x74
>> [128095.480138] irq event stamp: 3136917
>> [128095.480138] hardirqs last  enabled at (3136917): [<c15b8139>] _raw_spin_unlock_irq+0x27/0x50
>> [128095.480138] hardirqs last disabled at (3136916): [<c15b7f4e>] _raw_spin_lock_irq+0x15/0x78
>> [128095.480138] softirqs last  enabled at (3136180): [<c1048e4a>] __do_softirq+0x137/0x30f
>> [128095.480138] softirqs last disabled at (3136175): [<c1049195>] irq_exit+0xa8/0xaa
>> [128095.480138] 
>> other info that might help us debug this:
>> [128095.480138]  Possible unsafe locking scenario:
>> 
>> [128095.480138]        CPU0
>> [128095.480138]        ----
>> [128095.480138]   lock(&mapping->i_mmap_mutex);
>> [128095.480138]   <Interrupt>
>> [128095.480138]     lock(&mapping->i_mmap_mutex);
>> [128095.480138] 
>>  *** DEADLOCK ***
>> 
>> [128095.480138] no locks held by kswapd0/49.
>> [128095.480138] 
>> stack backtrace:
>> [128095.480138] CPU: 1 PID: 49 Comm: kswapd0 Not tainted 3.11.0-rc1+ #9
>> [128095.480138] Hardware name: Dell Inc.                 Precision WorkStation 490    /0DT031, BIOS A08 04/25/2008
>> [128095.480138]  c1d32630 00000000 ee39fb18 c15b001e ee395780 ee39fb54 c15acdcb c1751845
>> [128095.480138]  c1751bbf 00000031 00000000 00000000 00000000 00000000 00000001 00000001
>> [128095.480138]  c1751bbf 00000008 ee395c44 00000100 ee39fb88 c10a6130 00000008 0000d8fb
>> [128095.480138] Call Trace:
>> [128095.480138]  [<c15b001e>] dump_stack+0x4b/0x79
>> [128095.480138]  [<c15acdcb>] print_usage_bug+0x1d9/0x1e3
>> [128095.480138]  [<c10a6130>] mark_lock+0x1e0/0x261
>> [128095.480138]  [<c10a5878>] ? check_usage_backwards+0x109/0x109
>> [128095.480138]  [<c10a6cde>] __lock_acquire+0x623/0x17f2
>> [128095.480138]  [<c107aa43>] ? sched_clock_cpu+0xcd/0x130
>> [128095.480138]  [<c107a7e8>] ? sched_clock_local+0x42/0x12e
>> [128095.480138]  [<c10a84cf>] lock_acquire+0x7d/0x195
>> [128095.480138]  [<c114971b>] ? page_referenced+0x87/0x5e3
>> [128095.480138]  [<c15b3671>] mutex_lock_nested+0x6c/0x3a7
>> [128095.480138]  [<c114971b>] ? page_referenced+0x87/0x5e3
>> [128095.480138]  [<c114971b>] ? page_referenced+0x87/0x5e3
>> [128095.480138]  [<c11661d5>] ? mem_cgroup_charge_statistics.isra.24+0x61/0x9e
>> [128095.480138]  [<c114971b>] page_referenced+0x87/0x5e3
>> [128095.480138]  [<f8433030>] ? raid0_congested+0x26/0x8a [raid0]
>> [128095.480138]  [<c112b9c7>] shrink_page_list+0x3d9/0x947
>> [128095.480138]  [<c10a6457>] ? trace_hardirqs_on+0xb/0xd
>> [128095.480138]  [<c112c3cf>] shrink_inactive_list+0x155/0x4cb
>> [128095.480138]  [<c112cd07>] shrink_lruvec+0x300/0x5ce
>> [128095.480138]  [<c112d028>] shrink_zone+0x53/0x14e
>> [128095.480138]  [<c112e531>] kswapd+0x517/0xa75
>> [128095.480138]  [<c112e01a>] ? mem_cgroup_shrink_node_zone+0x280/0x280
>> [128095.480138]  [<c10661ff>] kthread+0xa8/0xaa
>> [128095.480138]  [<c10a6457>] ? trace_hardirqs_on+0xb/0xd
>> [128095.480138]  [<c15bf737>] ret_from_kernel_thread+0x1b/0x28
>> [128095.480138]  [<c1066157>] ? insert_kthread_work+0x63/0x63
>
> IMHO, it's a false positive because i_mmap_mutex was held by kswapd
> while one in the middle of fault path could be never on kswapd context.
>
> It seems lockdep for reclaim-over-fs isn't enough smart to identify
> between background and direct reclaim.
>
> Wait for other's opinion.

Is that reasoning correct ?. We may not deadlock because hugetlb pages
cannot be reclaimed. So the fault path in hugetlb won't end up
reclaiming pages from same inode. But the report is correct right ?


Looking at the hugetlb code we have in huge_pmd_share

out:
	pte = (pte_t *)pmd_alloc(mm, pud, addr);
	mutex_unlock(&mapping->i_mmap_mutex);
	return pte;

I guess we should move that pmd_alloc outside i_mmap_mutex. Otherwise
that pmd_alloc can result in a reclaim which can call shrink_page_list ?

Something like  ?

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 83aff0a..2cb1be3 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3266,8 +3266,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 		put_page(virt_to_page(spte));
 	spin_unlock(&mm->page_table_lock);
 out:
-	pte = (pte_t *)pmd_alloc(mm, pud, addr);
 	mutex_unlock(&mapping->i_mmap_mutex);
+	pte = (pte_t *)pmd_alloc(mm, pud, addr);
 	return pte;
 }
 
-aneesh



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: hugepage related lockdep trace.
  2013-07-18 17:42     ` Aneesh Kumar K.V
@ 2013-07-19  0:13       ` Minchan Kim
  -1 siblings, 0 replies; 40+ messages in thread
From: Minchan Kim @ 2013-07-19  0:13 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Dave Jones, Linux Kernel, linux-mm, Rik van Riel, Michal Hocko,
	KAMEZAWA Hiroyuki, Hillf Danton, Andrew Morton

On Thu, Jul 18, 2013 at 11:12:24PM +0530, Aneesh Kumar K.V wrote:
> Minchan Kim <minchan@kernel.org> writes:
> 
> > Ccing people get_maintainer says.
> >
> > On Wed, Jul 17, 2013 at 11:32:23AM -0400, Dave Jones wrote:
> >> [128095.470960] =================================
> >> [128095.471315] [ INFO: inconsistent lock state ]
> >> [128095.471660] 3.11.0-rc1+ #9 Not tainted
> >> [128095.472156] ---------------------------------
> >> [128095.472905] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
> >> [128095.473650] kswapd0/49 [HC0[0]:SC0[0]:HE1:SE1] takes:
> >> [128095.474373]  (&mapping->i_mmap_mutex){+.+.?.}, at: [<c114971b>] page_referenced+0x87/0x5e3
> >> [128095.475128] {RECLAIM_FS-ON-W} state was registered at:
> >> [128095.475866]   [<c10a6232>] mark_held_locks+0x81/0xe7
> >> [128095.476597]   [<c10a8db3>] lockdep_trace_alloc+0x5e/0xbc
> >> [128095.477322]   [<c112316b>] __alloc_pages_nodemask+0x8b/0x9b6
> >> [128095.478049]   [<c1123ab6>] __get_free_pages+0x20/0x31
> >> [128095.478769]   [<c1123ad9>] get_zeroed_page+0x12/0x14
> >> [128095.479477]   [<c113fe1e>] __pmd_alloc+0x1c/0x6b
> >> [128095.480138]   [<c1155ea7>] huge_pmd_share+0x265/0x283
> >> [128095.480138]   [<c1155f22>] huge_pte_alloc+0x5d/0x71
> >> [128095.480138]   [<c115612e>] hugetlb_fault+0x7c/0x64a
> >> [128095.480138]   [<c114087c>] handle_mm_fault+0x255/0x299
> >> [128095.480138]   [<c15bbab0>] __do_page_fault+0x142/0x55c
> >> [128095.480138]   [<c15bbed7>] do_page_fault+0xd/0x16
> >> [128095.480138]   [<c15b927c>] error_code+0x6c/0x74
> >> [128095.480138] irq event stamp: 3136917
> >> [128095.480138] hardirqs last  enabled at (3136917): [<c15b8139>] _raw_spin_unlock_irq+0x27/0x50
> >> [128095.480138] hardirqs last disabled at (3136916): [<c15b7f4e>] _raw_spin_lock_irq+0x15/0x78
> >> [128095.480138] softirqs last  enabled at (3136180): [<c1048e4a>] __do_softirq+0x137/0x30f
> >> [128095.480138] softirqs last disabled at (3136175): [<c1049195>] irq_exit+0xa8/0xaa
> >> [128095.480138] 
> >> other info that might help us debug this:
> >> [128095.480138]  Possible unsafe locking scenario:
> >> 
> >> [128095.480138]        CPU0
> >> [128095.480138]        ----
> >> [128095.480138]   lock(&mapping->i_mmap_mutex);
> >> [128095.480138]   <Interrupt>
> >> [128095.480138]     lock(&mapping->i_mmap_mutex);
> >> [128095.480138] 
> >>  *** DEADLOCK ***
> >> 
> >> [128095.480138] no locks held by kswapd0/49.
> >> [128095.480138] 
> >> stack backtrace:
> >> [128095.480138] CPU: 1 PID: 49 Comm: kswapd0 Not tainted 3.11.0-rc1+ #9
> >> [128095.480138] Hardware name: Dell Inc.                 Precision WorkStation 490    /0DT031, BIOS A08 04/25/2008
> >> [128095.480138]  c1d32630 00000000 ee39fb18 c15b001e ee395780 ee39fb54 c15acdcb c1751845
> >> [128095.480138]  c1751bbf 00000031 00000000 00000000 00000000 00000000 00000001 00000001
> >> [128095.480138]  c1751bbf 00000008 ee395c44 00000100 ee39fb88 c10a6130 00000008 0000d8fb
> >> [128095.480138] Call Trace:
> >> [128095.480138]  [<c15b001e>] dump_stack+0x4b/0x79
> >> [128095.480138]  [<c15acdcb>] print_usage_bug+0x1d9/0x1e3
> >> [128095.480138]  [<c10a6130>] mark_lock+0x1e0/0x261
> >> [128095.480138]  [<c10a5878>] ? check_usage_backwards+0x109/0x109
> >> [128095.480138]  [<c10a6cde>] __lock_acquire+0x623/0x17f2
> >> [128095.480138]  [<c107aa43>] ? sched_clock_cpu+0xcd/0x130
> >> [128095.480138]  [<c107a7e8>] ? sched_clock_local+0x42/0x12e
> >> [128095.480138]  [<c10a84cf>] lock_acquire+0x7d/0x195
> >> [128095.480138]  [<c114971b>] ? page_referenced+0x87/0x5e3
> >> [128095.480138]  [<c15b3671>] mutex_lock_nested+0x6c/0x3a7
> >> [128095.480138]  [<c114971b>] ? page_referenced+0x87/0x5e3
> >> [128095.480138]  [<c114971b>] ? page_referenced+0x87/0x5e3
> >> [128095.480138]  [<c11661d5>] ? mem_cgroup_charge_statistics.isra.24+0x61/0x9e
> >> [128095.480138]  [<c114971b>] page_referenced+0x87/0x5e3
> >> [128095.480138]  [<f8433030>] ? raid0_congested+0x26/0x8a [raid0]
> >> [128095.480138]  [<c112b9c7>] shrink_page_list+0x3d9/0x947
> >> [128095.480138]  [<c10a6457>] ? trace_hardirqs_on+0xb/0xd
> >> [128095.480138]  [<c112c3cf>] shrink_inactive_list+0x155/0x4cb
> >> [128095.480138]  [<c112cd07>] shrink_lruvec+0x300/0x5ce
> >> [128095.480138]  [<c112d028>] shrink_zone+0x53/0x14e
> >> [128095.480138]  [<c112e531>] kswapd+0x517/0xa75
> >> [128095.480138]  [<c112e01a>] ? mem_cgroup_shrink_node_zone+0x280/0x280
> >> [128095.480138]  [<c10661ff>] kthread+0xa8/0xaa
> >> [128095.480138]  [<c10a6457>] ? trace_hardirqs_on+0xb/0xd
> >> [128095.480138]  [<c15bf737>] ret_from_kernel_thread+0x1b/0x28
> >> [128095.480138]  [<c1066157>] ? insert_kthread_work+0x63/0x63
> >
> > IMHO, it's a false positive because i_mmap_mutex was held by kswapd
> > while one in the middle of fault path could be never on kswapd context.
> >
> > It seems lockdep for reclaim-over-fs isn't enough smart to identify
> > between background and direct reclaim.
> >
> > Wait for other's opinion.
> 
> Is that reasoning correct ?. We may not deadlock because hugetlb pages
> cannot be reclaimed. So the fault path in hugetlb won't end up
> reclaiming pages from same inode. But the report is correct right ?
> 
> 
> Looking at the hugetlb code we have in huge_pmd_share
> 
> out:
> 	pte = (pte_t *)pmd_alloc(mm, pud, addr);
> 	mutex_unlock(&mapping->i_mmap_mutex);
> 	return pte;
> 
> I guess we should move that pmd_alloc outside i_mmap_mutex. Otherwise
> that pmd_alloc can result in a reclaim which can call shrink_page_list ?

True. Sorry for that I didn't review the code carefully and I was very paranoid
in reclaim-over-fs due to internal works. :(

> 
> Something like  ?
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 83aff0a..2cb1be3 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3266,8 +3266,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
>  		put_page(virt_to_page(spte));
>  	spin_unlock(&mm->page_table_lock);
>  out:
> -	pte = (pte_t *)pmd_alloc(mm, pud, addr);
>  	mutex_unlock(&mapping->i_mmap_mutex);
> +	pte = (pte_t *)pmd_alloc(mm, pud, addr);
>  	return pte;

I am blind on hugetlb but not sure it doesn't break eb48c071.
Michal?


>  }
>  
> -aneesh
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: hugepage related lockdep trace.
@ 2013-07-19  0:13       ` Minchan Kim
  0 siblings, 0 replies; 40+ messages in thread
From: Minchan Kim @ 2013-07-19  0:13 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Dave Jones, Linux Kernel, linux-mm, Rik van Riel, Michal Hocko,
	KAMEZAWA Hiroyuki, Hillf Danton, Andrew Morton

On Thu, Jul 18, 2013 at 11:12:24PM +0530, Aneesh Kumar K.V wrote:
> Minchan Kim <minchan@kernel.org> writes:
> 
> > Ccing people get_maintainer says.
> >
> > On Wed, Jul 17, 2013 at 11:32:23AM -0400, Dave Jones wrote:
> >> [128095.470960] =================================
> >> [128095.471315] [ INFO: inconsistent lock state ]
> >> [128095.471660] 3.11.0-rc1+ #9 Not tainted
> >> [128095.472156] ---------------------------------
> >> [128095.472905] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
> >> [128095.473650] kswapd0/49 [HC0[0]:SC0[0]:HE1:SE1] takes:
> >> [128095.474373]  (&mapping->i_mmap_mutex){+.+.?.}, at: [<c114971b>] page_referenced+0x87/0x5e3
> >> [128095.475128] {RECLAIM_FS-ON-W} state was registered at:
> >> [128095.475866]   [<c10a6232>] mark_held_locks+0x81/0xe7
> >> [128095.476597]   [<c10a8db3>] lockdep_trace_alloc+0x5e/0xbc
> >> [128095.477322]   [<c112316b>] __alloc_pages_nodemask+0x8b/0x9b6
> >> [128095.478049]   [<c1123ab6>] __get_free_pages+0x20/0x31
> >> [128095.478769]   [<c1123ad9>] get_zeroed_page+0x12/0x14
> >> [128095.479477]   [<c113fe1e>] __pmd_alloc+0x1c/0x6b
> >> [128095.480138]   [<c1155ea7>] huge_pmd_share+0x265/0x283
> >> [128095.480138]   [<c1155f22>] huge_pte_alloc+0x5d/0x71
> >> [128095.480138]   [<c115612e>] hugetlb_fault+0x7c/0x64a
> >> [128095.480138]   [<c114087c>] handle_mm_fault+0x255/0x299
> >> [128095.480138]   [<c15bbab0>] __do_page_fault+0x142/0x55c
> >> [128095.480138]   [<c15bbed7>] do_page_fault+0xd/0x16
> >> [128095.480138]   [<c15b927c>] error_code+0x6c/0x74
> >> [128095.480138] irq event stamp: 3136917
> >> [128095.480138] hardirqs last  enabled at (3136917): [<c15b8139>] _raw_spin_unlock_irq+0x27/0x50
> >> [128095.480138] hardirqs last disabled at (3136916): [<c15b7f4e>] _raw_spin_lock_irq+0x15/0x78
> >> [128095.480138] softirqs last  enabled at (3136180): [<c1048e4a>] __do_softirq+0x137/0x30f
> >> [128095.480138] softirqs last disabled at (3136175): [<c1049195>] irq_exit+0xa8/0xaa
> >> [128095.480138] 
> >> other info that might help us debug this:
> >> [128095.480138]  Possible unsafe locking scenario:
> >> 
> >> [128095.480138]        CPU0
> >> [128095.480138]        ----
> >> [128095.480138]   lock(&mapping->i_mmap_mutex);
> >> [128095.480138]   <Interrupt>
> >> [128095.480138]     lock(&mapping->i_mmap_mutex);
> >> [128095.480138] 
> >>  *** DEADLOCK ***
> >> 
> >> [128095.480138] no locks held by kswapd0/49.
> >> [128095.480138] 
> >> stack backtrace:
> >> [128095.480138] CPU: 1 PID: 49 Comm: kswapd0 Not tainted 3.11.0-rc1+ #9
> >> [128095.480138] Hardware name: Dell Inc.                 Precision WorkStation 490    /0DT031, BIOS A08 04/25/2008
> >> [128095.480138]  c1d32630 00000000 ee39fb18 c15b001e ee395780 ee39fb54 c15acdcb c1751845
> >> [128095.480138]  c1751bbf 00000031 00000000 00000000 00000000 00000000 00000001 00000001
> >> [128095.480138]  c1751bbf 00000008 ee395c44 00000100 ee39fb88 c10a6130 00000008 0000d8fb
> >> [128095.480138] Call Trace:
> >> [128095.480138]  [<c15b001e>] dump_stack+0x4b/0x79
> >> [128095.480138]  [<c15acdcb>] print_usage_bug+0x1d9/0x1e3
> >> [128095.480138]  [<c10a6130>] mark_lock+0x1e0/0x261
> >> [128095.480138]  [<c10a5878>] ? check_usage_backwards+0x109/0x109
> >> [128095.480138]  [<c10a6cde>] __lock_acquire+0x623/0x17f2
> >> [128095.480138]  [<c107aa43>] ? sched_clock_cpu+0xcd/0x130
> >> [128095.480138]  [<c107a7e8>] ? sched_clock_local+0x42/0x12e
> >> [128095.480138]  [<c10a84cf>] lock_acquire+0x7d/0x195
> >> [128095.480138]  [<c114971b>] ? page_referenced+0x87/0x5e3
> >> [128095.480138]  [<c15b3671>] mutex_lock_nested+0x6c/0x3a7
> >> [128095.480138]  [<c114971b>] ? page_referenced+0x87/0x5e3
> >> [128095.480138]  [<c114971b>] ? page_referenced+0x87/0x5e3
> >> [128095.480138]  [<c11661d5>] ? mem_cgroup_charge_statistics.isra.24+0x61/0x9e
> >> [128095.480138]  [<c114971b>] page_referenced+0x87/0x5e3
> >> [128095.480138]  [<f8433030>] ? raid0_congested+0x26/0x8a [raid0]
> >> [128095.480138]  [<c112b9c7>] shrink_page_list+0x3d9/0x947
> >> [128095.480138]  [<c10a6457>] ? trace_hardirqs_on+0xb/0xd
> >> [128095.480138]  [<c112c3cf>] shrink_inactive_list+0x155/0x4cb
> >> [128095.480138]  [<c112cd07>] shrink_lruvec+0x300/0x5ce
> >> [128095.480138]  [<c112d028>] shrink_zone+0x53/0x14e
> >> [128095.480138]  [<c112e531>] kswapd+0x517/0xa75
> >> [128095.480138]  [<c112e01a>] ? mem_cgroup_shrink_node_zone+0x280/0x280
> >> [128095.480138]  [<c10661ff>] kthread+0xa8/0xaa
> >> [128095.480138]  [<c10a6457>] ? trace_hardirqs_on+0xb/0xd
> >> [128095.480138]  [<c15bf737>] ret_from_kernel_thread+0x1b/0x28
> >> [128095.480138]  [<c1066157>] ? insert_kthread_work+0x63/0x63
> >
> > IMHO, it's a false positive because i_mmap_mutex was held by kswapd
> > while one in the middle of fault path could be never on kswapd context.
> >
> > It seems lockdep for reclaim-over-fs isn't enough smart to identify
> > between background and direct reclaim.
> >
> > Wait for other's opinion.
> 
> Is that reasoning correct ?. We may not deadlock because hugetlb pages
> cannot be reclaimed. So the fault path in hugetlb won't end up
> reclaiming pages from same inode. But the report is correct right ?
> 
> 
> Looking at the hugetlb code we have in huge_pmd_share
> 
> out:
> 	pte = (pte_t *)pmd_alloc(mm, pud, addr);
> 	mutex_unlock(&mapping->i_mmap_mutex);
> 	return pte;
> 
> I guess we should move that pmd_alloc outside i_mmap_mutex. Otherwise
> that pmd_alloc can result in a reclaim which can call shrink_page_list ?

True. Sorry for that I didn't review the code carefully and I was very paranoid
in reclaim-over-fs due to internal works. :(

> 
> Something like  ?
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 83aff0a..2cb1be3 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3266,8 +3266,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
>  		put_page(virt_to_page(spte));
>  	spin_unlock(&mm->page_table_lock);
>  out:
> -	pte = (pte_t *)pmd_alloc(mm, pud, addr);
>  	mutex_unlock(&mapping->i_mmap_mutex);
> +	pte = (pte_t *)pmd_alloc(mm, pud, addr);
>  	return pte;

I am blind on hugetlb but not sure it doesn't break eb48c071.
Michal?


>  }
>  
> -aneesh
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: hugepage related lockdep trace.
  2013-07-18 17:42     ` Aneesh Kumar K.V
@ 2013-07-19  2:08       ` Hillf Danton
  -1 siblings, 0 replies; 40+ messages in thread
From: Hillf Danton @ 2013-07-19  2:08 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Minchan Kim, Dave Jones, Linux Kernel, linux-mm, Rik van Riel,
	Michal Hocko, KAMEZAWA Hiroyuki, Andrew Morton

On Fri, Jul 19, 2013 at 1:42 AM, Aneesh Kumar K.V
<aneesh.kumar@linux.vnet.ibm.com> wrote:
> Minchan Kim <minchan@kernel.org> writes:
>> IMHO, it's a false positive because i_mmap_mutex was held by kswapd
>> while one in the middle of fault path could be never on kswapd context.
>>
>> It seems lockdep for reclaim-over-fs isn't enough smart to identify
>> between background and direct reclaim.
>>
>> Wait for other's opinion.
>
> Is that reasoning correct ?. We may not deadlock because hugetlb pages
> cannot be reclaimed. So the fault path in hugetlb won't end up
> reclaiming pages from same inode. But the report is correct right ?
>
>
> Looking at the hugetlb code we have in huge_pmd_share
>
> out:
>         pte = (pte_t *)pmd_alloc(mm, pud, addr);
>         mutex_unlock(&mapping->i_mmap_mutex);
>         return pte;
>
> I guess we should move that pmd_alloc outside i_mmap_mutex. Otherwise
> that pmd_alloc can result in a reclaim which can call shrink_page_list ?
>
Hm, can huge pages be reclaimed, say by kswapd currently?

> Something like  ?
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 83aff0a..2cb1be3 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3266,8 +3266,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
>                 put_page(virt_to_page(spte));
>         spin_unlock(&mm->page_table_lock);
>  out:
> -       pte = (pte_t *)pmd_alloc(mm, pud, addr);
>         mutex_unlock(&mapping->i_mmap_mutex);
> +       pte = (pte_t *)pmd_alloc(mm, pud, addr);
>         return pte;
>  }
>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: hugepage related lockdep trace.
@ 2013-07-19  2:08       ` Hillf Danton
  0 siblings, 0 replies; 40+ messages in thread
From: Hillf Danton @ 2013-07-19  2:08 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Minchan Kim, Dave Jones, Linux Kernel, linux-mm, Rik van Riel,
	Michal Hocko, KAMEZAWA Hiroyuki, Andrew Morton

On Fri, Jul 19, 2013 at 1:42 AM, Aneesh Kumar K.V
<aneesh.kumar@linux.vnet.ibm.com> wrote:
> Minchan Kim <minchan@kernel.org> writes:
>> IMHO, it's a false positive because i_mmap_mutex was held by kswapd
>> while one in the middle of fault path could be never on kswapd context.
>>
>> It seems lockdep for reclaim-over-fs isn't enough smart to identify
>> between background and direct reclaim.
>>
>> Wait for other's opinion.
>
> Is that reasoning correct ?. We may not deadlock because hugetlb pages
> cannot be reclaimed. So the fault path in hugetlb won't end up
> reclaiming pages from same inode. But the report is correct right ?
>
>
> Looking at the hugetlb code we have in huge_pmd_share
>
> out:
>         pte = (pte_t *)pmd_alloc(mm, pud, addr);
>         mutex_unlock(&mapping->i_mmap_mutex);
>         return pte;
>
> I guess we should move that pmd_alloc outside i_mmap_mutex. Otherwise
> that pmd_alloc can result in a reclaim which can call shrink_page_list ?
>
Hm, can huge pages be reclaimed, say by kswapd currently?

> Something like  ?
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 83aff0a..2cb1be3 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3266,8 +3266,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
>                 put_page(virt_to_page(spte));
>         spin_unlock(&mm->page_table_lock);
>  out:
> -       pte = (pte_t *)pmd_alloc(mm, pud, addr);
>         mutex_unlock(&mapping->i_mmap_mutex);
> +       pte = (pte_t *)pmd_alloc(mm, pud, addr);
>         return pte;
>  }
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: hugepage related lockdep trace.
  2013-07-19  2:08       ` Hillf Danton
@ 2013-07-19  3:20         ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 40+ messages in thread
From: Aneesh Kumar K.V @ 2013-07-19  3:20 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Minchan Kim, Dave Jones, Linux Kernel, linux-mm, Rik van Riel,
	Michal Hocko, KAMEZAWA Hiroyuki, Andrew Morton

Hillf Danton <dhillf@gmail.com> writes:

> On Fri, Jul 19, 2013 at 1:42 AM, Aneesh Kumar K.V
> <aneesh.kumar@linux.vnet.ibm.com> wrote:
>> Minchan Kim <minchan@kernel.org> writes:
>>> IMHO, it's a false positive because i_mmap_mutex was held by kswapd
>>> while one in the middle of fault path could be never on kswapd context.
>>>
>>> It seems lockdep for reclaim-over-fs isn't enough smart to identify
>>> between background and direct reclaim.
>>>
>>> Wait for other's opinion.
>>
>> Is that reasoning correct ?. We may not deadlock because hugetlb pages
>> cannot be reclaimed. So the fault path in hugetlb won't end up
>> reclaiming pages from same inode. But the report is correct right ?
>>
>>
>> Looking at the hugetlb code we have in huge_pmd_share
>>
>> out:
>>         pte = (pte_t *)pmd_alloc(mm, pud, addr);
>>         mutex_unlock(&mapping->i_mmap_mutex);
>>         return pte;
>>
>> I guess we should move that pmd_alloc outside i_mmap_mutex. Otherwise
>> that pmd_alloc can result in a reclaim which can call shrink_page_list ?
>>
> Hm, can huge pages be reclaimed, say by kswapd currently?

No we don't reclaim hugetlb pages.

-aneesh


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: hugepage related lockdep trace.
@ 2013-07-19  3:20         ` Aneesh Kumar K.V
  0 siblings, 0 replies; 40+ messages in thread
From: Aneesh Kumar K.V @ 2013-07-19  3:20 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Minchan Kim, Dave Jones, Linux Kernel, linux-mm, Rik van Riel,
	Michal Hocko, KAMEZAWA Hiroyuki, Andrew Morton

Hillf Danton <dhillf@gmail.com> writes:

> On Fri, Jul 19, 2013 at 1:42 AM, Aneesh Kumar K.V
> <aneesh.kumar@linux.vnet.ibm.com> wrote:
>> Minchan Kim <minchan@kernel.org> writes:
>>> IMHO, it's a false positive because i_mmap_mutex was held by kswapd
>>> while one in the middle of fault path could be never on kswapd context.
>>>
>>> It seems lockdep for reclaim-over-fs isn't enough smart to identify
>>> between background and direct reclaim.
>>>
>>> Wait for other's opinion.
>>
>> Is that reasoning correct ?. We may not deadlock because hugetlb pages
>> cannot be reclaimed. So the fault path in hugetlb won't end up
>> reclaiming pages from same inode. But the report is correct right ?
>>
>>
>> Looking at the hugetlb code we have in huge_pmd_share
>>
>> out:
>>         pte = (pte_t *)pmd_alloc(mm, pud, addr);
>>         mutex_unlock(&mapping->i_mmap_mutex);
>>         return pte;
>>
>> I guess we should move that pmd_alloc outside i_mmap_mutex. Otherwise
>> that pmd_alloc can result in a reclaim which can call shrink_page_list ?
>>
> Hm, can huge pages be reclaimed, say by kswapd currently?

No we don't reclaim hugetlb pages.

-aneesh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: hugepage related lockdep trace.
  2013-07-19  0:13       ` Minchan Kim
@ 2013-07-23  7:24         ` Hush Bensen
  -1 siblings, 0 replies; 40+ messages in thread
From: Hush Bensen @ 2013-07-23  7:24 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Aneesh Kumar K.V, Dave Jones, Linux Kernel, linux-mm,
	Rik van Riel, Michal Hocko, KAMEZAWA Hiroyuki, Hillf Danton,
	Andrew Morton

On 07/18/2013 06:13 PM, Minchan Kim wrote:
> On Thu, Jul 18, 2013 at 11:12:24PM +0530, Aneesh Kumar K.V wrote:
>> Minchan Kim <minchan@kernel.org> writes:
>>
>>> Ccing people get_maintainer says.
>>>
>>> On Wed, Jul 17, 2013 at 11:32:23AM -0400, Dave Jones wrote:
>>>> [128095.470960] =================================
>>>> [128095.471315] [ INFO: inconsistent lock state ]
>>>> [128095.471660] 3.11.0-rc1+ #9 Not tainted
>>>> [128095.472156] ---------------------------------
>>>> [128095.472905] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
>>>> [128095.473650] kswapd0/49 [HC0[0]:SC0[0]:HE1:SE1] takes:
>>>> [128095.474373]  (&mapping->i_mmap_mutex){+.+.?.}, at: [<c114971b>] page_referenced+0x87/0x5e3
>>>> [128095.475128] {RECLAIM_FS-ON-W} state was registered at:
>>>> [128095.475866]   [<c10a6232>] mark_held_locks+0x81/0xe7
>>>> [128095.476597]   [<c10a8db3>] lockdep_trace_alloc+0x5e/0xbc
>>>> [128095.477322]   [<c112316b>] __alloc_pages_nodemask+0x8b/0x9b6
>>>> [128095.478049]   [<c1123ab6>] __get_free_pages+0x20/0x31
>>>> [128095.478769]   [<c1123ad9>] get_zeroed_page+0x12/0x14
>>>> [128095.479477]   [<c113fe1e>] __pmd_alloc+0x1c/0x6b
>>>> [128095.480138]   [<c1155ea7>] huge_pmd_share+0x265/0x283
>>>> [128095.480138]   [<c1155f22>] huge_pte_alloc+0x5d/0x71
>>>> [128095.480138]   [<c115612e>] hugetlb_fault+0x7c/0x64a
>>>> [128095.480138]   [<c114087c>] handle_mm_fault+0x255/0x299
>>>> [128095.480138]   [<c15bbab0>] __do_page_fault+0x142/0x55c
>>>> [128095.480138]   [<c15bbed7>] do_page_fault+0xd/0x16
>>>> [128095.480138]   [<c15b927c>] error_code+0x6c/0x74
>>>> [128095.480138] irq event stamp: 3136917
>>>> [128095.480138] hardirqs last  enabled at (3136917): [<c15b8139>] _raw_spin_unlock_irq+0x27/0x50
>>>> [128095.480138] hardirqs last disabled at (3136916): [<c15b7f4e>] _raw_spin_lock_irq+0x15/0x78
>>>> [128095.480138] softirqs last  enabled at (3136180): [<c1048e4a>] __do_softirq+0x137/0x30f
>>>> [128095.480138] softirqs last disabled at (3136175): [<c1049195>] irq_exit+0xa8/0xaa
>>>> [128095.480138]
>>>> other info that might help us debug this:
>>>> [128095.480138]  Possible unsafe locking scenario:
>>>>
>>>> [128095.480138]        CPU0
>>>> [128095.480138]        ----
>>>> [128095.480138]   lock(&mapping->i_mmap_mutex);
>>>> [128095.480138]   <Interrupt>
>>>> [128095.480138]     lock(&mapping->i_mmap_mutex);
>>>> [128095.480138]
>>>>   *** DEADLOCK ***
>>>>
>>>> [128095.480138] no locks held by kswapd0/49.
>>>> [128095.480138]
>>>> stack backtrace:
>>>> [128095.480138] CPU: 1 PID: 49 Comm: kswapd0 Not tainted 3.11.0-rc1+ #9
>>>> [128095.480138] Hardware name: Dell Inc.                 Precision WorkStation 490    /0DT031, BIOS A08 04/25/2008
>>>> [128095.480138]  c1d32630 00000000 ee39fb18 c15b001e ee395780 ee39fb54 c15acdcb c1751845
>>>> [128095.480138]  c1751bbf 00000031 00000000 00000000 00000000 00000000 00000001 00000001
>>>> [128095.480138]  c1751bbf 00000008 ee395c44 00000100 ee39fb88 c10a6130 00000008 0000d8fb
>>>> [128095.480138] Call Trace:
>>>> [128095.480138]  [<c15b001e>] dump_stack+0x4b/0x79
>>>> [128095.480138]  [<c15acdcb>] print_usage_bug+0x1d9/0x1e3
>>>> [128095.480138]  [<c10a6130>] mark_lock+0x1e0/0x261
>>>> [128095.480138]  [<c10a5878>] ? check_usage_backwards+0x109/0x109
>>>> [128095.480138]  [<c10a6cde>] __lock_acquire+0x623/0x17f2
>>>> [128095.480138]  [<c107aa43>] ? sched_clock_cpu+0xcd/0x130
>>>> [128095.480138]  [<c107a7e8>] ? sched_clock_local+0x42/0x12e
>>>> [128095.480138]  [<c10a84cf>] lock_acquire+0x7d/0x195
>>>> [128095.480138]  [<c114971b>] ? page_referenced+0x87/0x5e3
>>>> [128095.480138]  [<c15b3671>] mutex_lock_nested+0x6c/0x3a7
>>>> [128095.480138]  [<c114971b>] ? page_referenced+0x87/0x5e3
>>>> [128095.480138]  [<c114971b>] ? page_referenced+0x87/0x5e3
>>>> [128095.480138]  [<c11661d5>] ? mem_cgroup_charge_statistics.isra.24+0x61/0x9e
>>>> [128095.480138]  [<c114971b>] page_referenced+0x87/0x5e3
>>>> [128095.480138]  [<f8433030>] ? raid0_congested+0x26/0x8a [raid0]
>>>> [128095.480138]  [<c112b9c7>] shrink_page_list+0x3d9/0x947
>>>> [128095.480138]  [<c10a6457>] ? trace_hardirqs_on+0xb/0xd
>>>> [128095.480138]  [<c112c3cf>] shrink_inactive_list+0x155/0x4cb
>>>> [128095.480138]  [<c112cd07>] shrink_lruvec+0x300/0x5ce
>>>> [128095.480138]  [<c112d028>] shrink_zone+0x53/0x14e
>>>> [128095.480138]  [<c112e531>] kswapd+0x517/0xa75
>>>> [128095.480138]  [<c112e01a>] ? mem_cgroup_shrink_node_zone+0x280/0x280
>>>> [128095.480138]  [<c10661ff>] kthread+0xa8/0xaa
>>>> [128095.480138]  [<c10a6457>] ? trace_hardirqs_on+0xb/0xd
>>>> [128095.480138]  [<c15bf737>] ret_from_kernel_thread+0x1b/0x28
>>>> [128095.480138]  [<c1066157>] ? insert_kthread_work+0x63/0x63
>>> IMHO, it's a false positive because i_mmap_mutex was held by kswapd
>>> while one in the middle of fault path could be never on kswapd context.
>>>
>>> It seems lockdep for reclaim-over-fs isn't enough smart to identify
>>> between background and direct reclaim.
>>>
>>> Wait for other's opinion.
>> Is that reasoning correct ?. We may not deadlock because hugetlb pages
>> cannot be reclaimed. So the fault path in hugetlb won't end up
>> reclaiming pages from same inode. But the report is correct right ?
>>
>>
>> Looking at the hugetlb code we have in huge_pmd_share
>>
>> out:
>> 	pte = (pte_t *)pmd_alloc(mm, pud, addr);
>> 	mutex_unlock(&mapping->i_mmap_mutex);
>> 	return pte;
>>
>> I guess we should move that pmd_alloc outside i_mmap_mutex. Otherwise
>> that pmd_alloc can result in a reclaim which can call shrink_page_list ?
> True. Sorry for that I didn't review the code carefully and I was very paranoid
> in reclaim-over-fs due to internal works. :(

Could you explain more about reclaim-over-fs stuff?

>
>> Something like  ?
>>
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 83aff0a..2cb1be3 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -3266,8 +3266,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
>>   		put_page(virt_to_page(spte));
>>   	spin_unlock(&mm->page_table_lock);
>>   out:
>> -	pte = (pte_t *)pmd_alloc(mm, pud, addr);
>>   	mutex_unlock(&mapping->i_mmap_mutex);
>> +	pte = (pte_t *)pmd_alloc(mm, pud, addr);
>>   	return pte;
> I am blind on hugetlb but not sure it doesn't break eb48c071.
> Michal?
>
>
>>   }
>>   
>> -aneesh
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: hugepage related lockdep trace.
@ 2013-07-23  7:24         ` Hush Bensen
  0 siblings, 0 replies; 40+ messages in thread
From: Hush Bensen @ 2013-07-23  7:24 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Aneesh Kumar K.V, Dave Jones, Linux Kernel, linux-mm,
	Rik van Riel, Michal Hocko, KAMEZAWA Hiroyuki, Hillf Danton,
	Andrew Morton

On 07/18/2013 06:13 PM, Minchan Kim wrote:
> On Thu, Jul 18, 2013 at 11:12:24PM +0530, Aneesh Kumar K.V wrote:
>> Minchan Kim <minchan@kernel.org> writes:
>>
>>> Ccing people get_maintainer says.
>>>
>>> On Wed, Jul 17, 2013 at 11:32:23AM -0400, Dave Jones wrote:
>>>> [128095.470960] =================================
>>>> [128095.471315] [ INFO: inconsistent lock state ]
>>>> [128095.471660] 3.11.0-rc1+ #9 Not tainted
>>>> [128095.472156] ---------------------------------
>>>> [128095.472905] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
>>>> [128095.473650] kswapd0/49 [HC0[0]:SC0[0]:HE1:SE1] takes:
>>>> [128095.474373]  (&mapping->i_mmap_mutex){+.+.?.}, at: [<c114971b>] page_referenced+0x87/0x5e3
>>>> [128095.475128] {RECLAIM_FS-ON-W} state was registered at:
>>>> [128095.475866]   [<c10a6232>] mark_held_locks+0x81/0xe7
>>>> [128095.476597]   [<c10a8db3>] lockdep_trace_alloc+0x5e/0xbc
>>>> [128095.477322]   [<c112316b>] __alloc_pages_nodemask+0x8b/0x9b6
>>>> [128095.478049]   [<c1123ab6>] __get_free_pages+0x20/0x31
>>>> [128095.478769]   [<c1123ad9>] get_zeroed_page+0x12/0x14
>>>> [128095.479477]   [<c113fe1e>] __pmd_alloc+0x1c/0x6b
>>>> [128095.480138]   [<c1155ea7>] huge_pmd_share+0x265/0x283
>>>> [128095.480138]   [<c1155f22>] huge_pte_alloc+0x5d/0x71
>>>> [128095.480138]   [<c115612e>] hugetlb_fault+0x7c/0x64a
>>>> [128095.480138]   [<c114087c>] handle_mm_fault+0x255/0x299
>>>> [128095.480138]   [<c15bbab0>] __do_page_fault+0x142/0x55c
>>>> [128095.480138]   [<c15bbed7>] do_page_fault+0xd/0x16
>>>> [128095.480138]   [<c15b927c>] error_code+0x6c/0x74
>>>> [128095.480138] irq event stamp: 3136917
>>>> [128095.480138] hardirqs last  enabled at (3136917): [<c15b8139>] _raw_spin_unlock_irq+0x27/0x50
>>>> [128095.480138] hardirqs last disabled at (3136916): [<c15b7f4e>] _raw_spin_lock_irq+0x15/0x78
>>>> [128095.480138] softirqs last  enabled at (3136180): [<c1048e4a>] __do_softirq+0x137/0x30f
>>>> [128095.480138] softirqs last disabled at (3136175): [<c1049195>] irq_exit+0xa8/0xaa
>>>> [128095.480138]
>>>> other info that might help us debug this:
>>>> [128095.480138]  Possible unsafe locking scenario:
>>>>
>>>> [128095.480138]        CPU0
>>>> [128095.480138]        ----
>>>> [128095.480138]   lock(&mapping->i_mmap_mutex);
>>>> [128095.480138]   <Interrupt>
>>>> [128095.480138]     lock(&mapping->i_mmap_mutex);
>>>> [128095.480138]
>>>>   *** DEADLOCK ***
>>>>
>>>> [128095.480138] no locks held by kswapd0/49.
>>>> [128095.480138]
>>>> stack backtrace:
>>>> [128095.480138] CPU: 1 PID: 49 Comm: kswapd0 Not tainted 3.11.0-rc1+ #9
>>>> [128095.480138] Hardware name: Dell Inc.                 Precision WorkStation 490    /0DT031, BIOS A08 04/25/2008
>>>> [128095.480138]  c1d32630 00000000 ee39fb18 c15b001e ee395780 ee39fb54 c15acdcb c1751845
>>>> [128095.480138]  c1751bbf 00000031 00000000 00000000 00000000 00000000 00000001 00000001
>>>> [128095.480138]  c1751bbf 00000008 ee395c44 00000100 ee39fb88 c10a6130 00000008 0000d8fb
>>>> [128095.480138] Call Trace:
>>>> [128095.480138]  [<c15b001e>] dump_stack+0x4b/0x79
>>>> [128095.480138]  [<c15acdcb>] print_usage_bug+0x1d9/0x1e3
>>>> [128095.480138]  [<c10a6130>] mark_lock+0x1e0/0x261
>>>> [128095.480138]  [<c10a5878>] ? check_usage_backwards+0x109/0x109
>>>> [128095.480138]  [<c10a6cde>] __lock_acquire+0x623/0x17f2
>>>> [128095.480138]  [<c107aa43>] ? sched_clock_cpu+0xcd/0x130
>>>> [128095.480138]  [<c107a7e8>] ? sched_clock_local+0x42/0x12e
>>>> [128095.480138]  [<c10a84cf>] lock_acquire+0x7d/0x195
>>>> [128095.480138]  [<c114971b>] ? page_referenced+0x87/0x5e3
>>>> [128095.480138]  [<c15b3671>] mutex_lock_nested+0x6c/0x3a7
>>>> [128095.480138]  [<c114971b>] ? page_referenced+0x87/0x5e3
>>>> [128095.480138]  [<c114971b>] ? page_referenced+0x87/0x5e3
>>>> [128095.480138]  [<c11661d5>] ? mem_cgroup_charge_statistics.isra.24+0x61/0x9e
>>>> [128095.480138]  [<c114971b>] page_referenced+0x87/0x5e3
>>>> [128095.480138]  [<f8433030>] ? raid0_congested+0x26/0x8a [raid0]
>>>> [128095.480138]  [<c112b9c7>] shrink_page_list+0x3d9/0x947
>>>> [128095.480138]  [<c10a6457>] ? trace_hardirqs_on+0xb/0xd
>>>> [128095.480138]  [<c112c3cf>] shrink_inactive_list+0x155/0x4cb
>>>> [128095.480138]  [<c112cd07>] shrink_lruvec+0x300/0x5ce
>>>> [128095.480138]  [<c112d028>] shrink_zone+0x53/0x14e
>>>> [128095.480138]  [<c112e531>] kswapd+0x517/0xa75
>>>> [128095.480138]  [<c112e01a>] ? mem_cgroup_shrink_node_zone+0x280/0x280
>>>> [128095.480138]  [<c10661ff>] kthread+0xa8/0xaa
>>>> [128095.480138]  [<c10a6457>] ? trace_hardirqs_on+0xb/0xd
>>>> [128095.480138]  [<c15bf737>] ret_from_kernel_thread+0x1b/0x28
>>>> [128095.480138]  [<c1066157>] ? insert_kthread_work+0x63/0x63
>>> IMHO, it's a false positive because i_mmap_mutex was held by kswapd
>>> while one in the middle of fault path could be never on kswapd context.
>>>
>>> It seems lockdep for reclaim-over-fs isn't enough smart to identify
>>> between background and direct reclaim.
>>>
>>> Wait for other's opinion.
>> Is that reasoning correct ?. We may not deadlock because hugetlb pages
>> cannot be reclaimed. So the fault path in hugetlb won't end up
>> reclaiming pages from same inode. But the report is correct right ?
>>
>>
>> Looking at the hugetlb code we have in huge_pmd_share
>>
>> out:
>> 	pte = (pte_t *)pmd_alloc(mm, pud, addr);
>> 	mutex_unlock(&mapping->i_mmap_mutex);
>> 	return pte;
>>
>> I guess we should move that pmd_alloc outside i_mmap_mutex. Otherwise
>> that pmd_alloc can result in a reclaim which can call shrink_page_list ?
> True. Sorry for that I didn't review the code carefully and I was very paranoid
> in reclaim-over-fs due to internal works. :(

Could you explain more about reclaim-over-fs stuff?

>
>> Something like  ?
>>
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 83aff0a..2cb1be3 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -3266,8 +3266,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
>>   		put_page(virt_to_page(spte));
>>   	spin_unlock(&mm->page_table_lock);
>>   out:
>> -	pte = (pte_t *)pmd_alloc(mm, pud, addr);
>>   	mutex_unlock(&mapping->i_mmap_mutex);
>> +	pte = (pte_t *)pmd_alloc(mm, pud, addr);
>>   	return pte;
> I am blind on hugetlb but not sure it doesn't break eb48c071.
> Michal?
>
>
>>   }
>>   
>> -aneesh
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: hugepage related lockdep trace.
  2013-07-19  0:13       ` Minchan Kim
@ 2013-07-23 14:01         ` Michal Hocko
  -1 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2013-07-23 14:01 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Aneesh Kumar K.V, Dave Jones, Linux Kernel, linux-mm,
	Rik van Riel, KAMEZAWA Hiroyuki, Hillf Danton, Andrew Morton

On Fri 19-07-13 09:13:03, Minchan Kim wrote:
> On Thu, Jul 18, 2013 at 11:12:24PM +0530, Aneesh Kumar K.V wrote:
[...]
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 83aff0a..2cb1be3 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -3266,8 +3266,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
> >  		put_page(virt_to_page(spte));
> >  	spin_unlock(&mm->page_table_lock);
> >  out:
> > -	pte = (pte_t *)pmd_alloc(mm, pud, addr);
> >  	mutex_unlock(&mapping->i_mmap_mutex);
> > +	pte = (pte_t *)pmd_alloc(mm, pud, addr);
> >  	return pte;
> 
> I am blind on hugetlb but not sure it doesn't break eb48c071.
> Michal?

Well, it is some time since I debugged the race and all the details
vanished in the meantime. But this part of the changelog suggests that
this indeed breaks the fix:
"
    This patch addresses the issue by moving pmd_alloc into huge_pmd_share
    which guarantees that the shared pud is populated in the same critical
    section as pmd.  This also means that huge_pte_offset test in
    huge_pmd_share is serialized correctly now which in turn means that the
    success of the sharing will be higher as the racing tasks see the pud
    and pmd populated together.
"

Besides that I fail to see how moving pmd_alloc down changes anything.
Even if pmd_alloc triggered reclaim then we cannot trip over the same
i_mmap_mutex as hugetlb pages are not reclaimable because they are not
on the LRU.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: hugepage related lockdep trace.
@ 2013-07-23 14:01         ` Michal Hocko
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2013-07-23 14:01 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Aneesh Kumar K.V, Dave Jones, Linux Kernel, linux-mm,
	Rik van Riel, KAMEZAWA Hiroyuki, Hillf Danton, Andrew Morton

On Fri 19-07-13 09:13:03, Minchan Kim wrote:
> On Thu, Jul 18, 2013 at 11:12:24PM +0530, Aneesh Kumar K.V wrote:
[...]
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 83aff0a..2cb1be3 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -3266,8 +3266,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
> >  		put_page(virt_to_page(spte));
> >  	spin_unlock(&mm->page_table_lock);
> >  out:
> > -	pte = (pte_t *)pmd_alloc(mm, pud, addr);
> >  	mutex_unlock(&mapping->i_mmap_mutex);
> > +	pte = (pte_t *)pmd_alloc(mm, pud, addr);
> >  	return pte;
> 
> I am blind on hugetlb but not sure it doesn't break eb48c071.
> Michal?

Well, it is some time since I debugged the race and all the details
vanished in the meantime. But this part of the changelog suggests that
this indeed breaks the fix:
"
    This patch addresses the issue by moving pmd_alloc into huge_pmd_share
    which guarantees that the shared pud is populated in the same critical
    section as pmd.  This also means that huge_pte_offset test in
    huge_pmd_share is serialized correctly now which in turn means that the
    success of the sharing will be higher as the racing tasks see the pud
    and pmd populated together.
"

Besides that I fail to see how moving pmd_alloc down changes anything.
Even if pmd_alloc triggered reclaim then we cannot trip over the same
i_mmap_mutex as hugetlb pages are not reclaimable because they are not
on the LRU.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: hugepage related lockdep trace.
  2013-07-23 14:01         ` Michal Hocko
@ 2013-07-24  2:44           ` Minchan Kim
  -1 siblings, 0 replies; 40+ messages in thread
From: Minchan Kim @ 2013-07-24  2:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Aneesh Kumar K.V, Dave Jones, Linux Kernel, linux-mm,
	Rik van Riel, KAMEZAWA Hiroyuki, Hillf Danton, Andrew Morton

On Tue, Jul 23, 2013 at 04:01:20PM +0200, Michal Hocko wrote:
> On Fri 19-07-13 09:13:03, Minchan Kim wrote:
> > On Thu, Jul 18, 2013 at 11:12:24PM +0530, Aneesh Kumar K.V wrote:
> [...]
> > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > index 83aff0a..2cb1be3 100644
> > > --- a/mm/hugetlb.c
> > > +++ b/mm/hugetlb.c
> > > @@ -3266,8 +3266,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
> > >  		put_page(virt_to_page(spte));
> > >  	spin_unlock(&mm->page_table_lock);
> > >  out:
> > > -	pte = (pte_t *)pmd_alloc(mm, pud, addr);
> > >  	mutex_unlock(&mapping->i_mmap_mutex);
> > > +	pte = (pte_t *)pmd_alloc(mm, pud, addr);
> > >  	return pte;
> > 
> > I am blind on hugetlb but not sure it doesn't break eb48c071.
> > Michal?
> 
> Well, it is some time since I debugged the race and all the details
> vanished in the meantime. But this part of the changelog suggests that
> this indeed breaks the fix:
> "
>     This patch addresses the issue by moving pmd_alloc into huge_pmd_share
>     which guarantees that the shared pud is populated in the same critical
>     section as pmd.  This also means that huge_pte_offset test in
>     huge_pmd_share is serialized correctly now which in turn means that the
>     success of the sharing will be higher as the racing tasks see the pud
>     and pmd populated together.
> "
> 
> Besides that I fail to see how moving pmd_alloc down changes anything.
> Even if pmd_alloc triggered reclaim then we cannot trip over the same
> i_mmap_mutex as hugetlb pages are not reclaimable because they are not
> on the LRU.

I thought we could map some part of binary with normal page and other part
of the one with MAP_HUGETLB so that a address space could have both normal
page and HugeTLB page. Okay, it's impossible so HugeTLB pages are not on LRU.
Then, above lockdep warning is totally false positive.
Best solution is avoiding pmd_alloc with holding i_mmap_mutex but we need it
to fix eb48c071 so how about this if we couldn't have a better idea?


diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 83aff0a..e7c3a15 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3240,7 +3240,15 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 	if (!vma_shareable(vma, addr))
 		return (pte_t *)pmd_alloc(mm, pud, addr);
 
-	mutex_lock(&mapping->i_mmap_mutex);
+	/*
+	 * It annotates to shut lockdep's warning up casued by i_mmap_mutex
+	 * Below pmd_alloc try to allocate memory with GFP_KERNEL while
+	 * holding i_mmap_mutex so that it could enter direct reclaim path
+	 * that rmap try to hold i_mmap_mutex again. But it's no problem
+	 * for hugetlb because pages on hugetlb never could live in LRU so
+	 * it's false positive. I hope someone fixes it with avoiding pmd_alloc
+        * with holding i_mmap_mutex rather than nesting annotation.
+	 */
+	mutex_lock_nested(&mapping->i_mmap_mutex, SINGLE_DEPTH_NESTING);
 	vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
 		if (svma == vma)
 			continue;

> -- 
> Michal Hocko
> SUSE Labs
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: hugepage related lockdep trace.
@ 2013-07-24  2:44           ` Minchan Kim
  0 siblings, 0 replies; 40+ messages in thread
From: Minchan Kim @ 2013-07-24  2:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Aneesh Kumar K.V, Dave Jones, Linux Kernel, linux-mm,
	Rik van Riel, KAMEZAWA Hiroyuki, Hillf Danton, Andrew Morton

On Tue, Jul 23, 2013 at 04:01:20PM +0200, Michal Hocko wrote:
> On Fri 19-07-13 09:13:03, Minchan Kim wrote:
> > On Thu, Jul 18, 2013 at 11:12:24PM +0530, Aneesh Kumar K.V wrote:
> [...]
> > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > index 83aff0a..2cb1be3 100644
> > > --- a/mm/hugetlb.c
> > > +++ b/mm/hugetlb.c
> > > @@ -3266,8 +3266,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
> > >  		put_page(virt_to_page(spte));
> > >  	spin_unlock(&mm->page_table_lock);
> > >  out:
> > > -	pte = (pte_t *)pmd_alloc(mm, pud, addr);
> > >  	mutex_unlock(&mapping->i_mmap_mutex);
> > > +	pte = (pte_t *)pmd_alloc(mm, pud, addr);
> > >  	return pte;
> > 
> > I am blind on hugetlb but not sure it doesn't break eb48c071.
> > Michal?
> 
> Well, it is some time since I debugged the race and all the details
> vanished in the meantime. But this part of the changelog suggests that
> this indeed breaks the fix:
> "
>     This patch addresses the issue by moving pmd_alloc into huge_pmd_share
>     which guarantees that the shared pud is populated in the same critical
>     section as pmd.  This also means that huge_pte_offset test in
>     huge_pmd_share is serialized correctly now which in turn means that the
>     success of the sharing will be higher as the racing tasks see the pud
>     and pmd populated together.
> "
> 
> Besides that I fail to see how moving pmd_alloc down changes anything.
> Even if pmd_alloc triggered reclaim then we cannot trip over the same
> i_mmap_mutex as hugetlb pages are not reclaimable because they are not
> on the LRU.

I thought we could map some part of binary with normal page and other part
of the one with MAP_HUGETLB so that a address space could have both normal
page and HugeTLB page. Okay, it's impossible so HugeTLB pages are not on LRU.
Then, above lockdep warning is totally false positive.
Best solution is avoiding pmd_alloc with holding i_mmap_mutex but we need it
to fix eb48c071 so how about this if we couldn't have a better idea?


diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 83aff0a..e7c3a15 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3240,7 +3240,15 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 	if (!vma_shareable(vma, addr))
 		return (pte_t *)pmd_alloc(mm, pud, addr);
 
-	mutex_lock(&mapping->i_mmap_mutex);
+	/*
+	 * It annotates to shut lockdep's warning up casued by i_mmap_mutex
+	 * Below pmd_alloc try to allocate memory with GFP_KERNEL while
+	 * holding i_mmap_mutex so that it could enter direct reclaim path
+	 * that rmap try to hold i_mmap_mutex again. But it's no problem
+	 * for hugetlb because pages on hugetlb never could live in LRU so
+	 * it's false positive. I hope someone fixes it with avoiding pmd_alloc
+        * with holding i_mmap_mutex rather than nesting annotation.
+	 */
+	mutex_lock_nested(&mapping->i_mmap_mutex, SINGLE_DEPTH_NESTING);
 	vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
 		if (svma == vma)
 			continue;

> -- 
> Michal Hocko
> SUSE Labs
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: hugepage related lockdep trace.
  2013-07-23  7:24         ` Hush Bensen
@ 2013-07-24  2:57           ` Minchan Kim
  -1 siblings, 0 replies; 40+ messages in thread
From: Minchan Kim @ 2013-07-24  2:57 UTC (permalink / raw)
  To: Hush Bensen
  Cc: Aneesh Kumar K.V, Dave Jones, Linux Kernel, linux-mm,
	Rik van Riel, Michal Hocko, KAMEZAWA Hiroyuki, Hillf Danton,
	Andrew Morton

On Tue, Jul 23, 2013 at 01:24:17AM -0600, Hush Bensen wrote:
> On 07/18/2013 06:13 PM, Minchan Kim wrote:
> >On Thu, Jul 18, 2013 at 11:12:24PM +0530, Aneesh Kumar K.V wrote:
> >>Minchan Kim <minchan@kernel.org> writes:
> >>
> >>>Ccing people get_maintainer says.
> >>>
> >>>On Wed, Jul 17, 2013 at 11:32:23AM -0400, Dave Jones wrote:
> >>>>[128095.470960] =================================
> >>>>[128095.471315] [ INFO: inconsistent lock state ]
> >>>>[128095.471660] 3.11.0-rc1+ #9 Not tainted
> >>>>[128095.472156] ---------------------------------
> >>>>[128095.472905] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
> >>>>[128095.473650] kswapd0/49 [HC0[0]:SC0[0]:HE1:SE1] takes:
> >>>>[128095.474373]  (&mapping->i_mmap_mutex){+.+.?.}, at: [<c114971b>] page_referenced+0x87/0x5e3
> >>>>[128095.475128] {RECLAIM_FS-ON-W} state was registered at:
> >>>>[128095.475866]   [<c10a6232>] mark_held_locks+0x81/0xe7
> >>>>[128095.476597]   [<c10a8db3>] lockdep_trace_alloc+0x5e/0xbc
> >>>>[128095.477322]   [<c112316b>] __alloc_pages_nodemask+0x8b/0x9b6
> >>>>[128095.478049]   [<c1123ab6>] __get_free_pages+0x20/0x31
> >>>>[128095.478769]   [<c1123ad9>] get_zeroed_page+0x12/0x14
> >>>>[128095.479477]   [<c113fe1e>] __pmd_alloc+0x1c/0x6b
> >>>>[128095.480138]   [<c1155ea7>] huge_pmd_share+0x265/0x283
> >>>>[128095.480138]   [<c1155f22>] huge_pte_alloc+0x5d/0x71
> >>>>[128095.480138]   [<c115612e>] hugetlb_fault+0x7c/0x64a
> >>>>[128095.480138]   [<c114087c>] handle_mm_fault+0x255/0x299
> >>>>[128095.480138]   [<c15bbab0>] __do_page_fault+0x142/0x55c
> >>>>[128095.480138]   [<c15bbed7>] do_page_fault+0xd/0x16
> >>>>[128095.480138]   [<c15b927c>] error_code+0x6c/0x74
> >>>>[128095.480138] irq event stamp: 3136917
> >>>>[128095.480138] hardirqs last  enabled at (3136917): [<c15b8139>] _raw_spin_unlock_irq+0x27/0x50
> >>>>[128095.480138] hardirqs last disabled at (3136916): [<c15b7f4e>] _raw_spin_lock_irq+0x15/0x78
> >>>>[128095.480138] softirqs last  enabled at (3136180): [<c1048e4a>] __do_softirq+0x137/0x30f
> >>>>[128095.480138] softirqs last disabled at (3136175): [<c1049195>] irq_exit+0xa8/0xaa
> >>>>[128095.480138]
> >>>>other info that might help us debug this:
> >>>>[128095.480138]  Possible unsafe locking scenario:
> >>>>
> >>>>[128095.480138]        CPU0
> >>>>[128095.480138]        ----
> >>>>[128095.480138]   lock(&mapping->i_mmap_mutex);
> >>>>[128095.480138]   <Interrupt>
> >>>>[128095.480138]     lock(&mapping->i_mmap_mutex);
> >>>>[128095.480138]
> >>>>  *** DEADLOCK ***
> >>>>
> >>>>[128095.480138] no locks held by kswapd0/49.
> >>>>[128095.480138]
> >>>>stack backtrace:
> >>>>[128095.480138] CPU: 1 PID: 49 Comm: kswapd0 Not tainted 3.11.0-rc1+ #9
> >>>>[128095.480138] Hardware name: Dell Inc.                 Precision WorkStation 490    /0DT031, BIOS A08 04/25/2008
> >>>>[128095.480138]  c1d32630 00000000 ee39fb18 c15b001e ee395780 ee39fb54 c15acdcb c1751845
> >>>>[128095.480138]  c1751bbf 00000031 00000000 00000000 00000000 00000000 00000001 00000001
> >>>>[128095.480138]  c1751bbf 00000008 ee395c44 00000100 ee39fb88 c10a6130 00000008 0000d8fb
> >>>>[128095.480138] Call Trace:
> >>>>[128095.480138]  [<c15b001e>] dump_stack+0x4b/0x79
> >>>>[128095.480138]  [<c15acdcb>] print_usage_bug+0x1d9/0x1e3
> >>>>[128095.480138]  [<c10a6130>] mark_lock+0x1e0/0x261
> >>>>[128095.480138]  [<c10a5878>] ? check_usage_backwards+0x109/0x109
> >>>>[128095.480138]  [<c10a6cde>] __lock_acquire+0x623/0x17f2
> >>>>[128095.480138]  [<c107aa43>] ? sched_clock_cpu+0xcd/0x130
> >>>>[128095.480138]  [<c107a7e8>] ? sched_clock_local+0x42/0x12e
> >>>>[128095.480138]  [<c10a84cf>] lock_acquire+0x7d/0x195
> >>>>[128095.480138]  [<c114971b>] ? page_referenced+0x87/0x5e3
> >>>>[128095.480138]  [<c15b3671>] mutex_lock_nested+0x6c/0x3a7
> >>>>[128095.480138]  [<c114971b>] ? page_referenced+0x87/0x5e3
> >>>>[128095.480138]  [<c114971b>] ? page_referenced+0x87/0x5e3
> >>>>[128095.480138]  [<c11661d5>] ? mem_cgroup_charge_statistics.isra.24+0x61/0x9e
> >>>>[128095.480138]  [<c114971b>] page_referenced+0x87/0x5e3
> >>>>[128095.480138]  [<f8433030>] ? raid0_congested+0x26/0x8a [raid0]
> >>>>[128095.480138]  [<c112b9c7>] shrink_page_list+0x3d9/0x947
> >>>>[128095.480138]  [<c10a6457>] ? trace_hardirqs_on+0xb/0xd
> >>>>[128095.480138]  [<c112c3cf>] shrink_inactive_list+0x155/0x4cb
> >>>>[128095.480138]  [<c112cd07>] shrink_lruvec+0x300/0x5ce
> >>>>[128095.480138]  [<c112d028>] shrink_zone+0x53/0x14e
> >>>>[128095.480138]  [<c112e531>] kswapd+0x517/0xa75
> >>>>[128095.480138]  [<c112e01a>] ? mem_cgroup_shrink_node_zone+0x280/0x280
> >>>>[128095.480138]  [<c10661ff>] kthread+0xa8/0xaa
> >>>>[128095.480138]  [<c10a6457>] ? trace_hardirqs_on+0xb/0xd
> >>>>[128095.480138]  [<c15bf737>] ret_from_kernel_thread+0x1b/0x28
> >>>>[128095.480138]  [<c1066157>] ? insert_kthread_work+0x63/0x63
> >>>IMHO, it's a false positive because i_mmap_mutex was held by kswapd
> >>>while one in the middle of fault path could be never on kswapd context.
> >>>
> >>>It seems lockdep for reclaim-over-fs isn't enough smart to identify
> >>>between background and direct reclaim.
> >>>
> >>>Wait for other's opinion.
> >>Is that reasoning correct ?. We may not deadlock because hugetlb pages
> >>cannot be reclaimed. So the fault path in hugetlb won't end up
> >>reclaiming pages from same inode. But the report is correct right ?
> >>
> >>
> >>Looking at the hugetlb code we have in huge_pmd_share
> >>
> >>out:
> >>	pte = (pte_t *)pmd_alloc(mm, pud, addr);
> >>	mutex_unlock(&mapping->i_mmap_mutex);
> >>	return pte;
> >>
> >>I guess we should move that pmd_alloc outside i_mmap_mutex. Otherwise
> >>that pmd_alloc can result in a reclaim which can call shrink_page_list ?
> >True. Sorry for that I didn't review the code carefully and I was very paranoid
> >in reclaim-over-fs due to internal works. :(
> 
> Could you explain more about reclaim-over-fs stuff?

It's lockdep stuff to catch reclaim deadlock, which could happen following as
fox example,

fs_write
        lock(fs_lock);
        alloc memory
                reclaim path
                        pageout
                                fs_write
                                        lock(fs_lock)

                

> 
> >
> >>Something like  ?
> >>
> >>diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> >>index 83aff0a..2cb1be3 100644
> >>--- a/mm/hugetlb.c
> >>+++ b/mm/hugetlb.c
> >>@@ -3266,8 +3266,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
> >>  		put_page(virt_to_page(spte));
> >>  	spin_unlock(&mm->page_table_lock);
> >>  out:
> >>-	pte = (pte_t *)pmd_alloc(mm, pud, addr);
> >>  	mutex_unlock(&mapping->i_mmap_mutex);
> >>+	pte = (pte_t *)pmd_alloc(mm, pud, addr);
> >>  	return pte;
> >I am blind on hugetlb but not sure it doesn't break eb48c071.
> >Michal?
> >
> >
> >>  }
> >>-aneesh
> >>
> >>
> >>
> >>--
> >>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >>the body of a message to majordomo@vger.kernel.org
> >>More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>Please read the FAQ at  http://www.tux.org/lkml/
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: hugepage related lockdep trace.
@ 2013-07-24  2:57           ` Minchan Kim
  0 siblings, 0 replies; 40+ messages in thread
From: Minchan Kim @ 2013-07-24  2:57 UTC (permalink / raw)
  To: Hush Bensen
  Cc: Aneesh Kumar K.V, Dave Jones, Linux Kernel, linux-mm,
	Rik van Riel, Michal Hocko, KAMEZAWA Hiroyuki, Hillf Danton,
	Andrew Morton

On Tue, Jul 23, 2013 at 01:24:17AM -0600, Hush Bensen wrote:
> On 07/18/2013 06:13 PM, Minchan Kim wrote:
> >On Thu, Jul 18, 2013 at 11:12:24PM +0530, Aneesh Kumar K.V wrote:
> >>Minchan Kim <minchan@kernel.org> writes:
> >>
> >>>Ccing people get_maintainer says.
> >>>
> >>>On Wed, Jul 17, 2013 at 11:32:23AM -0400, Dave Jones wrote:
> >>>>[128095.470960] =================================
> >>>>[128095.471315] [ INFO: inconsistent lock state ]
> >>>>[128095.471660] 3.11.0-rc1+ #9 Not tainted
> >>>>[128095.472156] ---------------------------------
> >>>>[128095.472905] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
> >>>>[128095.473650] kswapd0/49 [HC0[0]:SC0[0]:HE1:SE1] takes:
> >>>>[128095.474373]  (&mapping->i_mmap_mutex){+.+.?.}, at: [<c114971b>] page_referenced+0x87/0x5e3
> >>>>[128095.475128] {RECLAIM_FS-ON-W} state was registered at:
> >>>>[128095.475866]   [<c10a6232>] mark_held_locks+0x81/0xe7
> >>>>[128095.476597]   [<c10a8db3>] lockdep_trace_alloc+0x5e/0xbc
> >>>>[128095.477322]   [<c112316b>] __alloc_pages_nodemask+0x8b/0x9b6
> >>>>[128095.478049]   [<c1123ab6>] __get_free_pages+0x20/0x31
> >>>>[128095.478769]   [<c1123ad9>] get_zeroed_page+0x12/0x14
> >>>>[128095.479477]   [<c113fe1e>] __pmd_alloc+0x1c/0x6b
> >>>>[128095.480138]   [<c1155ea7>] huge_pmd_share+0x265/0x283
> >>>>[128095.480138]   [<c1155f22>] huge_pte_alloc+0x5d/0x71
> >>>>[128095.480138]   [<c115612e>] hugetlb_fault+0x7c/0x64a
> >>>>[128095.480138]   [<c114087c>] handle_mm_fault+0x255/0x299
> >>>>[128095.480138]   [<c15bbab0>] __do_page_fault+0x142/0x55c
> >>>>[128095.480138]   [<c15bbed7>] do_page_fault+0xd/0x16
> >>>>[128095.480138]   [<c15b927c>] error_code+0x6c/0x74
> >>>>[128095.480138] irq event stamp: 3136917
> >>>>[128095.480138] hardirqs last  enabled at (3136917): [<c15b8139>] _raw_spin_unlock_irq+0x27/0x50
> >>>>[128095.480138] hardirqs last disabled at (3136916): [<c15b7f4e>] _raw_spin_lock_irq+0x15/0x78
> >>>>[128095.480138] softirqs last  enabled at (3136180): [<c1048e4a>] __do_softirq+0x137/0x30f
> >>>>[128095.480138] softirqs last disabled at (3136175): [<c1049195>] irq_exit+0xa8/0xaa
> >>>>[128095.480138]
> >>>>other info that might help us debug this:
> >>>>[128095.480138]  Possible unsafe locking scenario:
> >>>>
> >>>>[128095.480138]        CPU0
> >>>>[128095.480138]        ----
> >>>>[128095.480138]   lock(&mapping->i_mmap_mutex);
> >>>>[128095.480138]   <Interrupt>
> >>>>[128095.480138]     lock(&mapping->i_mmap_mutex);
> >>>>[128095.480138]
> >>>>  *** DEADLOCK ***
> >>>>
> >>>>[128095.480138] no locks held by kswapd0/49.
> >>>>[128095.480138]
> >>>>stack backtrace:
> >>>>[128095.480138] CPU: 1 PID: 49 Comm: kswapd0 Not tainted 3.11.0-rc1+ #9
> >>>>[128095.480138] Hardware name: Dell Inc.                 Precision WorkStation 490    /0DT031, BIOS A08 04/25/2008
> >>>>[128095.480138]  c1d32630 00000000 ee39fb18 c15b001e ee395780 ee39fb54 c15acdcb c1751845
> >>>>[128095.480138]  c1751bbf 00000031 00000000 00000000 00000000 00000000 00000001 00000001
> >>>>[128095.480138]  c1751bbf 00000008 ee395c44 00000100 ee39fb88 c10a6130 00000008 0000d8fb
> >>>>[128095.480138] Call Trace:
> >>>>[128095.480138]  [<c15b001e>] dump_stack+0x4b/0x79
> >>>>[128095.480138]  [<c15acdcb>] print_usage_bug+0x1d9/0x1e3
> >>>>[128095.480138]  [<c10a6130>] mark_lock+0x1e0/0x261
> >>>>[128095.480138]  [<c10a5878>] ? check_usage_backwards+0x109/0x109
> >>>>[128095.480138]  [<c10a6cde>] __lock_acquire+0x623/0x17f2
> >>>>[128095.480138]  [<c107aa43>] ? sched_clock_cpu+0xcd/0x130
> >>>>[128095.480138]  [<c107a7e8>] ? sched_clock_local+0x42/0x12e
> >>>>[128095.480138]  [<c10a84cf>] lock_acquire+0x7d/0x195
> >>>>[128095.480138]  [<c114971b>] ? page_referenced+0x87/0x5e3
> >>>>[128095.480138]  [<c15b3671>] mutex_lock_nested+0x6c/0x3a7
> >>>>[128095.480138]  [<c114971b>] ? page_referenced+0x87/0x5e3
> >>>>[128095.480138]  [<c114971b>] ? page_referenced+0x87/0x5e3
> >>>>[128095.480138]  [<c11661d5>] ? mem_cgroup_charge_statistics.isra.24+0x61/0x9e
> >>>>[128095.480138]  [<c114971b>] page_referenced+0x87/0x5e3
> >>>>[128095.480138]  [<f8433030>] ? raid0_congested+0x26/0x8a [raid0]
> >>>>[128095.480138]  [<c112b9c7>] shrink_page_list+0x3d9/0x947
> >>>>[128095.480138]  [<c10a6457>] ? trace_hardirqs_on+0xb/0xd
> >>>>[128095.480138]  [<c112c3cf>] shrink_inactive_list+0x155/0x4cb
> >>>>[128095.480138]  [<c112cd07>] shrink_lruvec+0x300/0x5ce
> >>>>[128095.480138]  [<c112d028>] shrink_zone+0x53/0x14e
> >>>>[128095.480138]  [<c112e531>] kswapd+0x517/0xa75
> >>>>[128095.480138]  [<c112e01a>] ? mem_cgroup_shrink_node_zone+0x280/0x280
> >>>>[128095.480138]  [<c10661ff>] kthread+0xa8/0xaa
> >>>>[128095.480138]  [<c10a6457>] ? trace_hardirqs_on+0xb/0xd
> >>>>[128095.480138]  [<c15bf737>] ret_from_kernel_thread+0x1b/0x28
> >>>>[128095.480138]  [<c1066157>] ? insert_kthread_work+0x63/0x63
> >>>IMHO, it's a false positive because i_mmap_mutex was held by kswapd
> >>>while one in the middle of fault path could be never on kswapd context.
> >>>
> >>>It seems lockdep for reclaim-over-fs isn't enough smart to identify
> >>>between background and direct reclaim.
> >>>
> >>>Wait for other's opinion.
> >>Is that reasoning correct ?. We may not deadlock because hugetlb pages
> >>cannot be reclaimed. So the fault path in hugetlb won't end up
> >>reclaiming pages from same inode. But the report is correct right ?
> >>
> >>
> >>Looking at the hugetlb code we have in huge_pmd_share
> >>
> >>out:
> >>	pte = (pte_t *)pmd_alloc(mm, pud, addr);
> >>	mutex_unlock(&mapping->i_mmap_mutex);
> >>	return pte;
> >>
> >>I guess we should move that pmd_alloc outside i_mmap_mutex. Otherwise
> >>that pmd_alloc can result in a reclaim which can call shrink_page_list ?
> >True. Sorry for that I didn't review the code carefully and I was very paranoid
> >in reclaim-over-fs due to internal works. :(
> 
> Could you explain more about reclaim-over-fs stuff?

It's lockdep stuff to catch reclaim deadlock, which could happen following as
fox example,

fs_write
        lock(fs_lock);
        alloc memory
                reclaim path
                        pageout
                                fs_write
                                        lock(fs_lock)

                

> 
> >
> >>Something like  ?
> >>
> >>diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> >>index 83aff0a..2cb1be3 100644
> >>--- a/mm/hugetlb.c
> >>+++ b/mm/hugetlb.c
> >>@@ -3266,8 +3266,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
> >>  		put_page(virt_to_page(spte));
> >>  	spin_unlock(&mm->page_table_lock);
> >>  out:
> >>-	pte = (pte_t *)pmd_alloc(mm, pud, addr);
> >>  	mutex_unlock(&mapping->i_mmap_mutex);
> >>+	pte = (pte_t *)pmd_alloc(mm, pud, addr);
> >>  	return pte;
> >I am blind on hugetlb but not sure it doesn't break eb48c071.
> >Michal?
> >
> >
> >>  }
> >>-aneesh
> >>
> >>
> >>
> >>--
> >>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >>the body of a message to majordomo@vger.kernel.org
> >>More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>Please read the FAQ at  http://www.tux.org/lkml/
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: hugepage related lockdep trace.
  2013-07-24  2:44           ` Minchan Kim
@ 2013-07-25 13:30             ` Michal Hocko
  -1 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2013-07-25 13:30 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Aneesh Kumar K.V, Dave Jones, Linux Kernel, linux-mm,
	Rik van Riel, KAMEZAWA Hiroyuki, Hillf Danton, Andrew Morton

On Wed 24-07-13 11:44:28, Minchan Kim wrote:
> On Tue, Jul 23, 2013 at 04:01:20PM +0200, Michal Hocko wrote:
> > On Fri 19-07-13 09:13:03, Minchan Kim wrote:
> > > On Thu, Jul 18, 2013 at 11:12:24PM +0530, Aneesh Kumar K.V wrote:
> > [...]
> > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > > index 83aff0a..2cb1be3 100644
> > > > --- a/mm/hugetlb.c
> > > > +++ b/mm/hugetlb.c
> > > > @@ -3266,8 +3266,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
> > > >  		put_page(virt_to_page(spte));
> > > >  	spin_unlock(&mm->page_table_lock);
> > > >  out:
> > > > -	pte = (pte_t *)pmd_alloc(mm, pud, addr);
> > > >  	mutex_unlock(&mapping->i_mmap_mutex);
> > > > +	pte = (pte_t *)pmd_alloc(mm, pud, addr);
> > > >  	return pte;
> > > 
> > > I am blind on hugetlb but not sure it doesn't break eb48c071.
> > > Michal?
> > 
> > Well, it is some time since I debugged the race and all the details
> > vanished in the meantime. But this part of the changelog suggests that
> > this indeed breaks the fix:
> > "
> >     This patch addresses the issue by moving pmd_alloc into huge_pmd_share
> >     which guarantees that the shared pud is populated in the same critical
> >     section as pmd.  This also means that huge_pte_offset test in
> >     huge_pmd_share is serialized correctly now which in turn means that the
> >     success of the sharing will be higher as the racing tasks see the pud
> >     and pmd populated together.
> > "
> > 
> > Besides that I fail to see how moving pmd_alloc down changes anything.
> > Even if pmd_alloc triggered reclaim then we cannot trip over the same
> > i_mmap_mutex as hugetlb pages are not reclaimable because they are not
> > on the LRU.
> 
> I thought we could map some part of binary with normal page and other part
> of the one with MAP_HUGETLB so that a address space could have both normal
> page and HugeTLB page. Okay, it's impossible so HugeTLB pages are not on LRU.
> Then, above lockdep warning is totally false positive.
> Best solution is avoiding pmd_alloc with holding i_mmap_mutex but we need it
> to fix eb48c071 so how about this if we couldn't have a better idea?

Shouldn't we rather use a hugetlb specific lock_class_key. I am not
familiar with lockdep much but something like bellow should do the
trick?

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index a3f868a..40a61f6 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -463,6 +463,8 @@ static struct inode *hugetlbfs_get_root(struct super_block *sb,
 	return inode;
 }
 
+struct lock_class_key hugetlbfs_i_mmap_mutex_key;
+
 static struct inode *hugetlbfs_get_inode(struct super_block *sb,
 					struct inode *dir,
 					umode_t mode, dev_t dev)
@@ -474,6 +476,7 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb,
 		struct hugetlbfs_inode_info *info;
 		inode->i_ino = get_next_ino();
 		inode_init_owner(inode, dir, mode);
+		lockdep_set_class(&inode->i_mapping->i_mmap_mutex, &hugetlbfs_i_mmap_mutex_key);
 		inode->i_mapping->a_ops = &hugetlbfs_aops;
 		inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info;
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: hugepage related lockdep trace.
@ 2013-07-25 13:30             ` Michal Hocko
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2013-07-25 13:30 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Aneesh Kumar K.V, Dave Jones, Linux Kernel, linux-mm,
	Rik van Riel, KAMEZAWA Hiroyuki, Hillf Danton, Andrew Morton

On Wed 24-07-13 11:44:28, Minchan Kim wrote:
> On Tue, Jul 23, 2013 at 04:01:20PM +0200, Michal Hocko wrote:
> > On Fri 19-07-13 09:13:03, Minchan Kim wrote:
> > > On Thu, Jul 18, 2013 at 11:12:24PM +0530, Aneesh Kumar K.V wrote:
> > [...]
> > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > > index 83aff0a..2cb1be3 100644
> > > > --- a/mm/hugetlb.c
> > > > +++ b/mm/hugetlb.c
> > > > @@ -3266,8 +3266,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
> > > >  		put_page(virt_to_page(spte));
> > > >  	spin_unlock(&mm->page_table_lock);
> > > >  out:
> > > > -	pte = (pte_t *)pmd_alloc(mm, pud, addr);
> > > >  	mutex_unlock(&mapping->i_mmap_mutex);
> > > > +	pte = (pte_t *)pmd_alloc(mm, pud, addr);
> > > >  	return pte;
> > > 
> > > I am blind on hugetlb but not sure it doesn't break eb48c071.
> > > Michal?
> > 
> > Well, it is some time since I debugged the race and all the details
> > vanished in the meantime. But this part of the changelog suggests that
> > this indeed breaks the fix:
> > "
> >     This patch addresses the issue by moving pmd_alloc into huge_pmd_share
> >     which guarantees that the shared pud is populated in the same critical
> >     section as pmd.  This also means that huge_pte_offset test in
> >     huge_pmd_share is serialized correctly now which in turn means that the
> >     success of the sharing will be higher as the racing tasks see the pud
> >     and pmd populated together.
> > "
> > 
> > Besides that I fail to see how moving pmd_alloc down changes anything.
> > Even if pmd_alloc triggered reclaim then we cannot trip over the same
> > i_mmap_mutex as hugetlb pages are not reclaimable because they are not
> > on the LRU.
> 
> I thought we could map some part of binary with normal page and other part
> of the one with MAP_HUGETLB so that a address space could have both normal
> page and HugeTLB page. Okay, it's impossible so HugeTLB pages are not on LRU.
> Then, above lockdep warning is totally false positive.
> Best solution is avoiding pmd_alloc with holding i_mmap_mutex but we need it
> to fix eb48c071 so how about this if we couldn't have a better idea?

Shouldn't we rather use a hugetlb specific lock_class_key. I am not
familiar with lockdep much but something like bellow should do the
trick?

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index a3f868a..40a61f6 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -463,6 +463,8 @@ static struct inode *hugetlbfs_get_root(struct super_block *sb,
 	return inode;
 }
 
+struct lock_class_key hugetlbfs_i_mmap_mutex_key;
+
 static struct inode *hugetlbfs_get_inode(struct super_block *sb,
 					struct inode *dir,
 					umode_t mode, dev_t dev)
@@ -474,6 +476,7 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb,
 		struct hugetlbfs_inode_info *info;
 		inode->i_ino = get_next_ino();
 		inode_init_owner(inode, dir, mode);
+		lockdep_set_class(&inode->i_mapping->i_mmap_mutex, &hugetlbfs_i_mmap_mutex_key);
 		inode->i_mapping->a_ops = &hugetlbfs_aops;
 		inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info;
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: hugepage related lockdep trace.
  2013-07-25 13:30             ` Michal Hocko
@ 2013-07-29  8:24               ` Minchan Kim
  -1 siblings, 0 replies; 40+ messages in thread
From: Minchan Kim @ 2013-07-29  8:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Aneesh Kumar K.V, Dave Jones, Linux Kernel, linux-mm,
	Rik van Riel, KAMEZAWA Hiroyuki, Hillf Danton, Andrew Morton

Hi Michal,

On Thu, Jul 25, 2013 at 03:30:40PM +0200, Michal Hocko wrote:
> On Wed 24-07-13 11:44:28, Minchan Kim wrote:
> > On Tue, Jul 23, 2013 at 04:01:20PM +0200, Michal Hocko wrote:
> > > On Fri 19-07-13 09:13:03, Minchan Kim wrote:
> > > > On Thu, Jul 18, 2013 at 11:12:24PM +0530, Aneesh Kumar K.V wrote:
> > > [...]
> > > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > > > index 83aff0a..2cb1be3 100644
> > > > > --- a/mm/hugetlb.c
> > > > > +++ b/mm/hugetlb.c
> > > > > @@ -3266,8 +3266,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
> > > > >  		put_page(virt_to_page(spte));
> > > > >  	spin_unlock(&mm->page_table_lock);
> > > > >  out:
> > > > > -	pte = (pte_t *)pmd_alloc(mm, pud, addr);
> > > > >  	mutex_unlock(&mapping->i_mmap_mutex);
> > > > > +	pte = (pte_t *)pmd_alloc(mm, pud, addr);
> > > > >  	return pte;
> > > > 
> > > > I am blind on hugetlb but not sure it doesn't break eb48c071.
> > > > Michal?
> > > 
> > > Well, it is some time since I debugged the race and all the details
> > > vanished in the meantime. But this part of the changelog suggests that
> > > this indeed breaks the fix:
> > > "
> > >     This patch addresses the issue by moving pmd_alloc into huge_pmd_share
> > >     which guarantees that the shared pud is populated in the same critical
> > >     section as pmd.  This also means that huge_pte_offset test in
> > >     huge_pmd_share is serialized correctly now which in turn means that the
> > >     success of the sharing will be higher as the racing tasks see the pud
> > >     and pmd populated together.
> > > "
> > > 
> > > Besides that I fail to see how moving pmd_alloc down changes anything.
> > > Even if pmd_alloc triggered reclaim then we cannot trip over the same
> > > i_mmap_mutex as hugetlb pages are not reclaimable because they are not
> > > on the LRU.
> > 
> > I thought we could map some part of binary with normal page and other part
> > of the one with MAP_HUGETLB so that a address space could have both normal
> > page and HugeTLB page. Okay, it's impossible so HugeTLB pages are not on LRU.
> > Then, above lockdep warning is totally false positive.
> > Best solution is avoiding pmd_alloc with holding i_mmap_mutex but we need it
> > to fix eb48c071 so how about this if we couldn't have a better idea?
> 
> Shouldn't we rather use a hugetlb specific lock_class_key. I am not
> familiar with lockdep much but something like bellow should do the
> trick?

Looks good to me.
Could you resend it with formal patch with Ccing Peter for Dave to confirm it?
Below just a nitpick.

> 
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index a3f868a..40a61f6 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -463,6 +463,8 @@ static struct inode *hugetlbfs_get_root(struct super_block *sb,
>  	return inode;
>  }
>  
> +struct lock_class_key hugetlbfs_i_mmap_mutex_key;

Let's add comment why we need this special something.
How about this?

/*
 * Now, reclaim path never hold hugetlbfs_inode->i_mmap_mutex while it could
 * hold normal inode->i_mmap_mutex so this annoation avoids a lockdep splat.
 */
static struct lock_class_key hugetlbctx_mutex;


> +
>  static struct inode *hugetlbfs_get_inode(struct super_block *sb,
>  					struct inode *dir,
>  					umode_t mode, dev_t dev)
> @@ -474,6 +476,7 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb,
>  		struct hugetlbfs_inode_info *info;
>  		inode->i_ino = get_next_ino();
>  		inode_init_owner(inode, dir, mode);
> +		lockdep_set_class(&inode->i_mapping->i_mmap_mutex, &hugetlbfs_i_mmap_mutex_key);
>  		inode->i_mapping->a_ops = &hugetlbfs_aops;
>  		inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info;
>  		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
> -- 
> Michal Hocko
> SUSE Labs
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: hugepage related lockdep trace.
@ 2013-07-29  8:24               ` Minchan Kim
  0 siblings, 0 replies; 40+ messages in thread
From: Minchan Kim @ 2013-07-29  8:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Aneesh Kumar K.V, Dave Jones, Linux Kernel, linux-mm,
	Rik van Riel, KAMEZAWA Hiroyuki, Hillf Danton, Andrew Morton

Hi Michal,

On Thu, Jul 25, 2013 at 03:30:40PM +0200, Michal Hocko wrote:
> On Wed 24-07-13 11:44:28, Minchan Kim wrote:
> > On Tue, Jul 23, 2013 at 04:01:20PM +0200, Michal Hocko wrote:
> > > On Fri 19-07-13 09:13:03, Minchan Kim wrote:
> > > > On Thu, Jul 18, 2013 at 11:12:24PM +0530, Aneesh Kumar K.V wrote:
> > > [...]
> > > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > > > index 83aff0a..2cb1be3 100644
> > > > > --- a/mm/hugetlb.c
> > > > > +++ b/mm/hugetlb.c
> > > > > @@ -3266,8 +3266,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
> > > > >  		put_page(virt_to_page(spte));
> > > > >  	spin_unlock(&mm->page_table_lock);
> > > > >  out:
> > > > > -	pte = (pte_t *)pmd_alloc(mm, pud, addr);
> > > > >  	mutex_unlock(&mapping->i_mmap_mutex);
> > > > > +	pte = (pte_t *)pmd_alloc(mm, pud, addr);
> > > > >  	return pte;
> > > > 
> > > > I am blind on hugetlb but not sure it doesn't break eb48c071.
> > > > Michal?
> > > 
> > > Well, it is some time since I debugged the race and all the details
> > > vanished in the meantime. But this part of the changelog suggests that
> > > this indeed breaks the fix:
> > > "
> > >     This patch addresses the issue by moving pmd_alloc into huge_pmd_share
> > >     which guarantees that the shared pud is populated in the same critical
> > >     section as pmd.  This also means that huge_pte_offset test in
> > >     huge_pmd_share is serialized correctly now which in turn means that the
> > >     success of the sharing will be higher as the racing tasks see the pud
> > >     and pmd populated together.
> > > "
> > > 
> > > Besides that I fail to see how moving pmd_alloc down changes anything.
> > > Even if pmd_alloc triggered reclaim then we cannot trip over the same
> > > i_mmap_mutex as hugetlb pages are not reclaimable because they are not
> > > on the LRU.
> > 
> > I thought we could map some part of binary with normal page and other part
> > of the one with MAP_HUGETLB so that a address space could have both normal
> > page and HugeTLB page. Okay, it's impossible so HugeTLB pages are not on LRU.
> > Then, above lockdep warning is totally false positive.
> > Best solution is avoiding pmd_alloc with holding i_mmap_mutex but we need it
> > to fix eb48c071 so how about this if we couldn't have a better idea?
> 
> Shouldn't we rather use a hugetlb specific lock_class_key. I am not
> familiar with lockdep much but something like bellow should do the
> trick?

Looks good to me.
Could you resend it with formal patch with Ccing Peter for Dave to confirm it?
Below just a nitpick.

> 
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index a3f868a..40a61f6 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -463,6 +463,8 @@ static struct inode *hugetlbfs_get_root(struct super_block *sb,
>  	return inode;
>  }
>  
> +struct lock_class_key hugetlbfs_i_mmap_mutex_key;

Let's add comment why we need this special something.
How about this?

/*
 * Now, reclaim path never hold hugetlbfs_inode->i_mmap_mutex while it could
 * hold normal inode->i_mmap_mutex so this annoation avoids a lockdep splat.
 */
static struct lock_class_key hugetlbctx_mutex;


> +
>  static struct inode *hugetlbfs_get_inode(struct super_block *sb,
>  					struct inode *dir,
>  					umode_t mode, dev_t dev)
> @@ -474,6 +476,7 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb,
>  		struct hugetlbfs_inode_info *info;
>  		inode->i_ino = get_next_ino();
>  		inode_init_owner(inode, dir, mode);
> +		lockdep_set_class(&inode->i_mapping->i_mmap_mutex, &hugetlbfs_i_mmap_mutex_key);
>  		inode->i_mapping->a_ops = &hugetlbfs_aops;
>  		inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info;
>  		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
> -- 
> Michal Hocko
> SUSE Labs
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: hugepage related lockdep trace.
  2013-07-29  8:24               ` Minchan Kim
@ 2013-07-29 14:53                 ` Michal Hocko
  -1 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2013-07-29 14:53 UTC (permalink / raw)
  To: Minchan Kim, Peter Zijlstra
  Cc: Aneesh Kumar K.V, Dave Jones, Linux Kernel, linux-mm,
	Rik van Riel, KAMEZAWA Hiroyuki, Hillf Danton, Andrew Morton

On Mon 29-07-13 17:24:53, Minchan Kim wrote:
> Hi Michal,
> 
> On Thu, Jul 25, 2013 at 03:30:40PM +0200, Michal Hocko wrote:
> > On Wed 24-07-13 11:44:28, Minchan Kim wrote:
> > > On Tue, Jul 23, 2013 at 04:01:20PM +0200, Michal Hocko wrote:
> > > > On Fri 19-07-13 09:13:03, Minchan Kim wrote:
> > > > > On Thu, Jul 18, 2013 at 11:12:24PM +0530, Aneesh Kumar K.V wrote:
> > > > [...]
> > > > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > > > > index 83aff0a..2cb1be3 100644
> > > > > > --- a/mm/hugetlb.c
> > > > > > +++ b/mm/hugetlb.c
> > > > > > @@ -3266,8 +3266,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
> > > > > >  		put_page(virt_to_page(spte));
> > > > > >  	spin_unlock(&mm->page_table_lock);
> > > > > >  out:
> > > > > > -	pte = (pte_t *)pmd_alloc(mm, pud, addr);
> > > > > >  	mutex_unlock(&mapping->i_mmap_mutex);
> > > > > > +	pte = (pte_t *)pmd_alloc(mm, pud, addr);
> > > > > >  	return pte;
> > > > > 
> > > > > I am blind on hugetlb but not sure it doesn't break eb48c071.
> > > > > Michal?
> > > > 
> > > > Well, it is some time since I debugged the race and all the details
> > > > vanished in the meantime. But this part of the changelog suggests that
> > > > this indeed breaks the fix:
> > > > "
> > > >     This patch addresses the issue by moving pmd_alloc into huge_pmd_share
> > > >     which guarantees that the shared pud is populated in the same critical
> > > >     section as pmd.  This also means that huge_pte_offset test in
> > > >     huge_pmd_share is serialized correctly now which in turn means that the
> > > >     success of the sharing will be higher as the racing tasks see the pud
> > > >     and pmd populated together.
> > > > "
> > > > 
> > > > Besides that I fail to see how moving pmd_alloc down changes anything.
> > > > Even if pmd_alloc triggered reclaim then we cannot trip over the same
> > > > i_mmap_mutex as hugetlb pages are not reclaimable because they are not
> > > > on the LRU.
> > > 
> > > I thought we could map some part of binary with normal page and other part
> > > of the one with MAP_HUGETLB so that a address space could have both normal
> > > page and HugeTLB page. Okay, it's impossible so HugeTLB pages are not on LRU.
> > > Then, above lockdep warning is totally false positive.
> > > Best solution is avoiding pmd_alloc with holding i_mmap_mutex but we need it
> > > to fix eb48c071 so how about this if we couldn't have a better idea?
> > 
> > Shouldn't we rather use a hugetlb specific lock_class_key. I am not
> > familiar with lockdep much but something like bellow should do the
> > trick?
> 
> Looks good to me.
> Could you resend it with formal patch with Ccing Peter for Dave to confirm it?
> Below just a nitpick.

I would have posted it already but I have to confess I am really not
familiar with lockdep and what is the good way to fix such a false
positive.

Peter, for you context the lockdep splat has been reported
here: https://lkml.org/lkml/2013/7/17/381

Minchan has proposed to workaround it by using SINGLE_DEPTH_NESTING
https://lkml.org/lkml/2013/7/23/812

my idea was to use a separate class key for hugetlb as it is quite
special in many ways:
https://lkml.org/lkml/2013/7/25/277

What is the preferred way of fixing such an issue?

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: hugepage related lockdep trace.
@ 2013-07-29 14:53                 ` Michal Hocko
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2013-07-29 14:53 UTC (permalink / raw)
  To: Minchan Kim, Peter Zijlstra
  Cc: Aneesh Kumar K.V, Dave Jones, Linux Kernel, linux-mm,
	Rik van Riel, KAMEZAWA Hiroyuki, Hillf Danton, Andrew Morton

On Mon 29-07-13 17:24:53, Minchan Kim wrote:
> Hi Michal,
> 
> On Thu, Jul 25, 2013 at 03:30:40PM +0200, Michal Hocko wrote:
> > On Wed 24-07-13 11:44:28, Minchan Kim wrote:
> > > On Tue, Jul 23, 2013 at 04:01:20PM +0200, Michal Hocko wrote:
> > > > On Fri 19-07-13 09:13:03, Minchan Kim wrote:
> > > > > On Thu, Jul 18, 2013 at 11:12:24PM +0530, Aneesh Kumar K.V wrote:
> > > > [...]
> > > > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > > > > index 83aff0a..2cb1be3 100644
> > > > > > --- a/mm/hugetlb.c
> > > > > > +++ b/mm/hugetlb.c
> > > > > > @@ -3266,8 +3266,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
> > > > > >  		put_page(virt_to_page(spte));
> > > > > >  	spin_unlock(&mm->page_table_lock);
> > > > > >  out:
> > > > > > -	pte = (pte_t *)pmd_alloc(mm, pud, addr);
> > > > > >  	mutex_unlock(&mapping->i_mmap_mutex);
> > > > > > +	pte = (pte_t *)pmd_alloc(mm, pud, addr);
> > > > > >  	return pte;
> > > > > 
> > > > > I am blind on hugetlb but not sure it doesn't break eb48c071.
> > > > > Michal?
> > > > 
> > > > Well, it is some time since I debugged the race and all the details
> > > > vanished in the meantime. But this part of the changelog suggests that
> > > > this indeed breaks the fix:
> > > > "
> > > >     This patch addresses the issue by moving pmd_alloc into huge_pmd_share
> > > >     which guarantees that the shared pud is populated in the same critical
> > > >     section as pmd.  This also means that huge_pte_offset test in
> > > >     huge_pmd_share is serialized correctly now which in turn means that the
> > > >     success of the sharing will be higher as the racing tasks see the pud
> > > >     and pmd populated together.
> > > > "
> > > > 
> > > > Besides that I fail to see how moving pmd_alloc down changes anything.
> > > > Even if pmd_alloc triggered reclaim then we cannot trip over the same
> > > > i_mmap_mutex as hugetlb pages are not reclaimable because they are not
> > > > on the LRU.
> > > 
> > > I thought we could map some part of binary with normal page and other part
> > > of the one with MAP_HUGETLB so that a address space could have both normal
> > > page and HugeTLB page. Okay, it's impossible so HugeTLB pages are not on LRU.
> > > Then, above lockdep warning is totally false positive.
> > > Best solution is avoiding pmd_alloc with holding i_mmap_mutex but we need it
> > > to fix eb48c071 so how about this if we couldn't have a better idea?
> > 
> > Shouldn't we rather use a hugetlb specific lock_class_key. I am not
> > familiar with lockdep much but something like bellow should do the
> > trick?
> 
> Looks good to me.
> Could you resend it with formal patch with Ccing Peter for Dave to confirm it?
> Below just a nitpick.

I would have posted it already but I have to confess I am really not
familiar with lockdep and what is the good way to fix such a false
positive.

Peter, for you context the lockdep splat has been reported
here: https://lkml.org/lkml/2013/7/17/381

Minchan has proposed to workaround it by using SINGLE_DEPTH_NESTING
https://lkml.org/lkml/2013/7/23/812

my idea was to use a separate class key for hugetlb as it is quite
special in many ways:
https://lkml.org/lkml/2013/7/25/277

What is the preferred way of fixing such an issue?

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: hugepage related lockdep trace.
  2013-07-29 14:53                 ` Michal Hocko
@ 2013-07-29 15:20                   ` Peter Zijlstra
  -1 siblings, 0 replies; 40+ messages in thread
From: Peter Zijlstra @ 2013-07-29 15:20 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Minchan Kim, Aneesh Kumar K.V, Dave Jones, Linux Kernel,
	linux-mm, Rik van Riel, KAMEZAWA Hiroyuki, Hillf Danton,
	Andrew Morton

On Mon, Jul 29, 2013 at 04:53:08PM +0200, Michal Hocko wrote:
> Peter, for you context the lockdep splat has been reported
> here: https://lkml.org/lkml/2013/7/17/381
> 
> Minchan has proposed to workaround it by using SINGLE_DEPTH_NESTING
> https://lkml.org/lkml/2013/7/23/812
> 
> my idea was to use a separate class key for hugetlb as it is quite
> special in many ways:
> https://lkml.org/lkml/2013/7/25/277
> 
> What is the preferred way of fixing such an issue?

The class is the safer annotation.

That said; it is a rather horrible issue any which way. This PMD sharing
is very unique to hugetlbfs (also is that really worth the effort these
days?) and it will make it impossible to make hugetlbfs swappable.

The other solution is to make the pmd allocation GFP_NOFS.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: hugepage related lockdep trace.
@ 2013-07-29 15:20                   ` Peter Zijlstra
  0 siblings, 0 replies; 40+ messages in thread
From: Peter Zijlstra @ 2013-07-29 15:20 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Minchan Kim, Aneesh Kumar K.V, Dave Jones, Linux Kernel,
	linux-mm, Rik van Riel, KAMEZAWA Hiroyuki, Hillf Danton,
	Andrew Morton

On Mon, Jul 29, 2013 at 04:53:08PM +0200, Michal Hocko wrote:
> Peter, for you context the lockdep splat has been reported
> here: https://lkml.org/lkml/2013/7/17/381
> 
> Minchan has proposed to workaround it by using SINGLE_DEPTH_NESTING
> https://lkml.org/lkml/2013/7/23/812
> 
> my idea was to use a separate class key for hugetlb as it is quite
> special in many ways:
> https://lkml.org/lkml/2013/7/25/277
> 
> What is the preferred way of fixing such an issue?

The class is the safer annotation.

That said; it is a rather horrible issue any which way. This PMD sharing
is very unique to hugetlbfs (also is that really worth the effort these
days?) and it will make it impossible to make hugetlbfs swappable.

The other solution is to make the pmd allocation GFP_NOFS.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: hugepage related lockdep trace.
  2013-07-29 15:20                   ` Peter Zijlstra
@ 2013-07-30 14:29                     ` Michal Hocko
  -1 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2013-07-30 14:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Minchan Kim, Aneesh Kumar K.V, Dave Jones, Linux Kernel,
	linux-mm, Rik van Riel, KAMEZAWA Hiroyuki, Hillf Danton,
	Andrew Morton

On Mon 29-07-13 17:20:01, Peter Zijlstra wrote:
> On Mon, Jul 29, 2013 at 04:53:08PM +0200, Michal Hocko wrote:
> > Peter, for you context the lockdep splat has been reported
> > here: https://lkml.org/lkml/2013/7/17/381
> > 
> > Minchan has proposed to workaround it by using SINGLE_DEPTH_NESTING
> > https://lkml.org/lkml/2013/7/23/812
> > 
> > my idea was to use a separate class key for hugetlb as it is quite
> > special in many ways:
> > https://lkml.org/lkml/2013/7/25/277
> > 
> > What is the preferred way of fixing such an issue?
> 
> The class is the safer annotation.

OK, I will use the class then. It should prevent other false positives
AFAIU.

> That said; it is a rather horrible issue any which way. This PMD sharing
> is very unique to hugetlbfs (also is that really worth the effort these
> days?) and it will make it impossible to make hugetlbfs swappable.

No idea.

> The other solution is to make the pmd allocation GFP_NOFS.

That would be just papering over the lockdep limitation. So I would
rather stick with something lockdep specific.

I will cook up a patch.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: hugepage related lockdep trace.
@ 2013-07-30 14:29                     ` Michal Hocko
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2013-07-30 14:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Minchan Kim, Aneesh Kumar K.V, Dave Jones, Linux Kernel,
	linux-mm, Rik van Riel, KAMEZAWA Hiroyuki, Hillf Danton,
	Andrew Morton

On Mon 29-07-13 17:20:01, Peter Zijlstra wrote:
> On Mon, Jul 29, 2013 at 04:53:08PM +0200, Michal Hocko wrote:
> > Peter, for you context the lockdep splat has been reported
> > here: https://lkml.org/lkml/2013/7/17/381
> > 
> > Minchan has proposed to workaround it by using SINGLE_DEPTH_NESTING
> > https://lkml.org/lkml/2013/7/23/812
> > 
> > my idea was to use a separate class key for hugetlb as it is quite
> > special in many ways:
> > https://lkml.org/lkml/2013/7/25/277
> > 
> > What is the preferred way of fixing such an issue?
> 
> The class is the safer annotation.

OK, I will use the class then. It should prevent other false positives
AFAIU.

> That said; it is a rather horrible issue any which way. This PMD sharing
> is very unique to hugetlbfs (also is that really worth the effort these
> days?) and it will make it impossible to make hugetlbfs swappable.

No idea.

> The other solution is to make the pmd allocation GFP_NOFS.

That would be just papering over the lockdep limitation. So I would
rather stick with something lockdep specific.

I will cook up a patch.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH] hugetlb: fix lockdep splat caused by pmd sharing
  2013-07-30 14:29                     ` Michal Hocko
@ 2013-07-30 14:46                       ` Michal Hocko
  -1 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2013-07-30 14:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Jones, Aneesh Kumar K.V, Minchan Kim, Rik van Riel,
	KAMEZAWA Hiroyuki, Hillf Danton, Peter Zijlstra, linux-kernel,
	linux-mm

Dave has reported the following lockdep splat:
[128095.470960] =================================
[128095.471315] [ INFO: inconsistent lock state ]
[128095.471660] 3.11.0-rc1+ #9 Not tainted
[128095.472156] ---------------------------------
[128095.472905] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
[128095.473650] kswapd0/49 [HC0[0]:SC0[0]:HE1:SE1] takes:
[128095.474373]  (&mapping->i_mmap_mutex){+.+.?.}, at: [<c114971b>] page_referenced+0x87/0x5e3
[128095.475128] {RECLAIM_FS-ON-W} state was registered at:
[128095.475866]   [<c10a6232>] mark_held_locks+0x81/0xe7
[128095.476597]   [<c10a8db3>] lockdep_trace_alloc+0x5e/0xbc
[128095.477322]   [<c112316b>] __alloc_pages_nodemask+0x8b/0x9b6
[128095.478049]   [<c1123ab6>] __get_free_pages+0x20/0x31
[128095.478769]   [<c1123ad9>] get_zeroed_page+0x12/0x14
[128095.479477]   [<c113fe1e>] __pmd_alloc+0x1c/0x6b
[128095.480138]   [<c1155ea7>] huge_pmd_share+0x265/0x283
[128095.480138]   [<c1155f22>] huge_pte_alloc+0x5d/0x71
[128095.480138]   [<c115612e>] hugetlb_fault+0x7c/0x64a
[128095.480138]   [<c114087c>] handle_mm_fault+0x255/0x299
[128095.480138]   [<c15bbab0>] __do_page_fault+0x142/0x55c
[128095.480138]   [<c15bbed7>] do_page_fault+0xd/0x16
[128095.480138]   [<c15b927c>] error_code+0x6c/0x74
[128095.480138] irq event stamp: 3136917
[128095.480138] hardirqs last  enabled at (3136917): [<c15b8139>] _raw_spin_unlock_irq+0x27/0x50
[128095.480138] hardirqs last disabled at (3136916): [<c15b7f4e>] _raw_spin_lock_irq+0x15/0x78
[128095.480138] softirqs last  enabled at (3136180): [<c1048e4a>] __do_softirq+0x137/0x30f
[128095.480138] softirqs last disabled at (3136175): [<c1049195>] irq_exit+0xa8/0xaa
[128095.480138]
other info that might help us debug this:
[128095.480138]  Possible unsafe locking scenario:
[128095.480138]        CPU0
[128095.480138]        ----
[128095.480138]   lock(&mapping->i_mmap_mutex);
[128095.480138]   <Interrupt>
[128095.480138]     lock(&mapping->i_mmap_mutex);
[128095.480138]
 *** DEADLOCK ***
[128095.480138] no locks held by kswapd0/49.
[128095.480138]
stack backtrace:
[128095.480138] CPU: 1 PID: 49 Comm: kswapd0 Not tainted 3.11.0-rc1+ #9
[128095.480138] Hardware name: Dell Inc.                 Precision WorkStation 490    /0DT031, BIOS A08 04/25/2008
[128095.480138]  c1d32630 00000000 ee39fb18 c15b001e ee395780 ee39fb54 c15acdcb c1751845
[128095.480138]  c1751bbf 00000031 00000000 00000000 00000000 00000000 00000001 00000001
[128095.480138]  c1751bbf 00000008 ee395c44 00000100 ee39fb88 c10a6130 00000008 0000d8fb
[128095.480138] Call Trace:
[128095.480138]  [<c15b001e>] dump_stack+0x4b/0x79
[128095.480138]  [<c15acdcb>] print_usage_bug+0x1d9/0x1e3
[128095.480138]  [<c10a6130>] mark_lock+0x1e0/0x261
[128095.480138]  [<c10a5878>] ? check_usage_backwards+0x109/0x109
[128095.480138]  [<c10a6cde>] __lock_acquire+0x623/0x17f2
[128095.480138]  [<c107aa43>] ? sched_clock_cpu+0xcd/0x130
[128095.480138]  [<c107a7e8>] ? sched_clock_local+0x42/0x12e
[128095.480138]  [<c10a84cf>] lock_acquire+0x7d/0x195
[128095.480138]  [<c114971b>] ? page_referenced+0x87/0x5e3
[128095.480138]  [<c15b3671>] mutex_lock_nested+0x6c/0x3a7
[128095.480138]  [<c114971b>] ? page_referenced+0x87/0x5e3
[128095.480138]  [<c114971b>] ? page_referenced+0x87/0x5e3
[128095.480138]  [<c11661d5>] ? mem_cgroup_charge_statistics.isra.24+0x61/0x9e
[128095.480138]  [<c114971b>] page_referenced+0x87/0x5e3
[128095.480138]  [<f8433030>] ? raid0_congested+0x26/0x8a [raid0]
[128095.480138]  [<c112b9c7>] shrink_page_list+0x3d9/0x947
[128095.480138]  [<c10a6457>] ? trace_hardirqs_on+0xb/0xd
[128095.480138]  [<c112c3cf>] shrink_inactive_list+0x155/0x4cb
[128095.480138]  [<c112cd07>] shrink_lruvec+0x300/0x5ce
[128095.480138]  [<c112d028>] shrink_zone+0x53/0x14e
[128095.480138]  [<c112e531>] kswapd+0x517/0xa75
[128095.480138]  [<c112e01a>] ? mem_cgroup_shrink_node_zone+0x280/0x280
[128095.480138]  [<c10661ff>] kthread+0xa8/0xaa
[128095.480138]  [<c10a6457>] ? trace_hardirqs_on+0xb/0xd
[128095.480138]  [<c15bf737>] ret_from_kernel_thread+0x1b/0x28
[128095.480138]  [<c1066157>] ? insert_kthread_work+0x63/0x63

which is a false positive caused by hugetlb pmd sharing code which
allocates a new pmd from withing mappint->i_mmap_mutex. If this
allocation causes reclaim then the lockdep detector complains that we
might self-deadlock.

This is not correct though, because hugetlb pages are not reclaimable so
their mapping will be never touched from the reclaim path.

The patch tells lockup detector that hugetlb i_mmap_mutex is special
by assigning it a separate lockdep class so it won't report possible
deadlocks on unrelated mappings.

Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 fs/hugetlbfs/inode.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index a3f868a..230533d 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -463,6 +463,12 @@ static struct inode *hugetlbfs_get_root(struct super_block *sb,
 	return inode;
 }
 
+/*
+ * Now, reclaim path never holds hugetlbfs_inode->i_mmap_mutex while it could
+ * hold normal inode->i_mmap_mutex so this annotation avoids a lockdep splat.
+ */
+struct lock_class_key hugetlbfs_i_mmap_mutex_key;
+
 static struct inode *hugetlbfs_get_inode(struct super_block *sb,
 					struct inode *dir,
 					umode_t mode, dev_t dev)
@@ -474,6 +480,8 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb,
 		struct hugetlbfs_inode_info *info;
 		inode->i_ino = get_next_ino();
 		inode_init_owner(inode, dir, mode);
+		lockdep_set_class(&inode->i_mapping->i_mmap_mutex,
+				&hugetlbfs_i_mmap_mutex_key);
 		inode->i_mapping->a_ops = &hugetlbfs_aops;
 		inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info;
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH] hugetlb: fix lockdep splat caused by pmd sharing
@ 2013-07-30 14:46                       ` Michal Hocko
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2013-07-30 14:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Jones, Aneesh Kumar K.V, Minchan Kim, Rik van Riel,
	KAMEZAWA Hiroyuki, Hillf Danton, Peter Zijlstra, linux-kernel,
	linux-mm

Dave has reported the following lockdep splat:
[128095.470960] =================================
[128095.471315] [ INFO: inconsistent lock state ]
[128095.471660] 3.11.0-rc1+ #9 Not tainted
[128095.472156] ---------------------------------
[128095.472905] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
[128095.473650] kswapd0/49 [HC0[0]:SC0[0]:HE1:SE1] takes:
[128095.474373]  (&mapping->i_mmap_mutex){+.+.?.}, at: [<c114971b>] page_referenced+0x87/0x5e3
[128095.475128] {RECLAIM_FS-ON-W} state was registered at:
[128095.475866]   [<c10a6232>] mark_held_locks+0x81/0xe7
[128095.476597]   [<c10a8db3>] lockdep_trace_alloc+0x5e/0xbc
[128095.477322]   [<c112316b>] __alloc_pages_nodemask+0x8b/0x9b6
[128095.478049]   [<c1123ab6>] __get_free_pages+0x20/0x31
[128095.478769]   [<c1123ad9>] get_zeroed_page+0x12/0x14
[128095.479477]   [<c113fe1e>] __pmd_alloc+0x1c/0x6b
[128095.480138]   [<c1155ea7>] huge_pmd_share+0x265/0x283
[128095.480138]   [<c1155f22>] huge_pte_alloc+0x5d/0x71
[128095.480138]   [<c115612e>] hugetlb_fault+0x7c/0x64a
[128095.480138]   [<c114087c>] handle_mm_fault+0x255/0x299
[128095.480138]   [<c15bbab0>] __do_page_fault+0x142/0x55c
[128095.480138]   [<c15bbed7>] do_page_fault+0xd/0x16
[128095.480138]   [<c15b927c>] error_code+0x6c/0x74
[128095.480138] irq event stamp: 3136917
[128095.480138] hardirqs last  enabled at (3136917): [<c15b8139>] _raw_spin_unlock_irq+0x27/0x50
[128095.480138] hardirqs last disabled at (3136916): [<c15b7f4e>] _raw_spin_lock_irq+0x15/0x78
[128095.480138] softirqs last  enabled at (3136180): [<c1048e4a>] __do_softirq+0x137/0x30f
[128095.480138] softirqs last disabled at (3136175): [<c1049195>] irq_exit+0xa8/0xaa
[128095.480138]
other info that might help us debug this:
[128095.480138]  Possible unsafe locking scenario:
[128095.480138]        CPU0
[128095.480138]        ----
[128095.480138]   lock(&mapping->i_mmap_mutex);
[128095.480138]   <Interrupt>
[128095.480138]     lock(&mapping->i_mmap_mutex);
[128095.480138]
 *** DEADLOCK ***
[128095.480138] no locks held by kswapd0/49.
[128095.480138]
stack backtrace:
[128095.480138] CPU: 1 PID: 49 Comm: kswapd0 Not tainted 3.11.0-rc1+ #9
[128095.480138] Hardware name: Dell Inc.                 Precision WorkStation 490    /0DT031, BIOS A08 04/25/2008
[128095.480138]  c1d32630 00000000 ee39fb18 c15b001e ee395780 ee39fb54 c15acdcb c1751845
[128095.480138]  c1751bbf 00000031 00000000 00000000 00000000 00000000 00000001 00000001
[128095.480138]  c1751bbf 00000008 ee395c44 00000100 ee39fb88 c10a6130 00000008 0000d8fb
[128095.480138] Call Trace:
[128095.480138]  [<c15b001e>] dump_stack+0x4b/0x79
[128095.480138]  [<c15acdcb>] print_usage_bug+0x1d9/0x1e3
[128095.480138]  [<c10a6130>] mark_lock+0x1e0/0x261
[128095.480138]  [<c10a5878>] ? check_usage_backwards+0x109/0x109
[128095.480138]  [<c10a6cde>] __lock_acquire+0x623/0x17f2
[128095.480138]  [<c107aa43>] ? sched_clock_cpu+0xcd/0x130
[128095.480138]  [<c107a7e8>] ? sched_clock_local+0x42/0x12e
[128095.480138]  [<c10a84cf>] lock_acquire+0x7d/0x195
[128095.480138]  [<c114971b>] ? page_referenced+0x87/0x5e3
[128095.480138]  [<c15b3671>] mutex_lock_nested+0x6c/0x3a7
[128095.480138]  [<c114971b>] ? page_referenced+0x87/0x5e3
[128095.480138]  [<c114971b>] ? page_referenced+0x87/0x5e3
[128095.480138]  [<c11661d5>] ? mem_cgroup_charge_statistics.isra.24+0x61/0x9e
[128095.480138]  [<c114971b>] page_referenced+0x87/0x5e3
[128095.480138]  [<f8433030>] ? raid0_congested+0x26/0x8a [raid0]
[128095.480138]  [<c112b9c7>] shrink_page_list+0x3d9/0x947
[128095.480138]  [<c10a6457>] ? trace_hardirqs_on+0xb/0xd
[128095.480138]  [<c112c3cf>] shrink_inactive_list+0x155/0x4cb
[128095.480138]  [<c112cd07>] shrink_lruvec+0x300/0x5ce
[128095.480138]  [<c112d028>] shrink_zone+0x53/0x14e
[128095.480138]  [<c112e531>] kswapd+0x517/0xa75
[128095.480138]  [<c112e01a>] ? mem_cgroup_shrink_node_zone+0x280/0x280
[128095.480138]  [<c10661ff>] kthread+0xa8/0xaa
[128095.480138]  [<c10a6457>] ? trace_hardirqs_on+0xb/0xd
[128095.480138]  [<c15bf737>] ret_from_kernel_thread+0x1b/0x28
[128095.480138]  [<c1066157>] ? insert_kthread_work+0x63/0x63

which is a false positive caused by hugetlb pmd sharing code which
allocates a new pmd from withing mappint->i_mmap_mutex. If this
allocation causes reclaim then the lockdep detector complains that we
might self-deadlock.

This is not correct though, because hugetlb pages are not reclaimable so
their mapping will be never touched from the reclaim path.

The patch tells lockup detector that hugetlb i_mmap_mutex is special
by assigning it a separate lockdep class so it won't report possible
deadlocks on unrelated mappings.

Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 fs/hugetlbfs/inode.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index a3f868a..230533d 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -463,6 +463,12 @@ static struct inode *hugetlbfs_get_root(struct super_block *sb,
 	return inode;
 }
 
+/*
+ * Now, reclaim path never holds hugetlbfs_inode->i_mmap_mutex while it could
+ * hold normal inode->i_mmap_mutex so this annotation avoids a lockdep splat.
+ */
+struct lock_class_key hugetlbfs_i_mmap_mutex_key;
+
 static struct inode *hugetlbfs_get_inode(struct super_block *sb,
 					struct inode *dir,
 					umode_t mode, dev_t dev)
@@ -474,6 +480,8 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb,
 		struct hugetlbfs_inode_info *info;
 		inode->i_ino = get_next_ino();
 		inode_init_owner(inode, dir, mode);
+		lockdep_set_class(&inode->i_mapping->i_mmap_mutex,
+				&hugetlbfs_i_mmap_mutex_key);
 		inode->i_mapping->a_ops = &hugetlbfs_aops;
 		inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info;
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
-- 
1.8.3.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH] hugetlb: fix lockdep splat caused by pmd sharing
  2013-07-30 14:46                       ` Michal Hocko
@ 2013-07-30 14:58                         ` Peter Zijlstra
  -1 siblings, 0 replies; 40+ messages in thread
From: Peter Zijlstra @ 2013-07-30 14:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Dave Jones, Aneesh Kumar K.V, Minchan Kim,
	Rik van Riel, KAMEZAWA Hiroyuki, Hillf Danton, linux-kernel,
	linux-mm

On Tue, Jul 30, 2013 at 04:46:00PM +0200, Michal Hocko wrote:
> which is a false positive caused by hugetlb pmd sharing code which
> allocates a new pmd from withing mappint->i_mmap_mutex. If this
> allocation causes reclaim then the lockdep detector complains that we
> might self-deadlock.
> 
> This is not correct though, because hugetlb pages are not reclaimable so
> their mapping will be never touched from the reclaim path.
> 
> The patch tells lockup detector that hugetlb i_mmap_mutex is special
> by assigning it a separate lockdep class so it won't report possible
> deadlocks on unrelated mappings.
> 
> Reported-by: Dave Jones <davej@redhat.com>
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> ---
>  fs/hugetlbfs/inode.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index a3f868a..230533d 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -463,6 +463,12 @@ static struct inode *hugetlbfs_get_root(struct super_block *sb,
>  	return inode;
>  }
>  
> +/*
> + * Now, reclaim path never holds hugetlbfs_inode->i_mmap_mutex while it could
> + * hold normal inode->i_mmap_mutex so this annotation avoids a lockdep splat.

How about something like:

/*
 * Hugetlbfs is not reclaimable; therefore its i_mmap_mutex will never
 * be taken from reclaim -- unlike regular filesystems. This needs an
 * annotation because huge_pmd_share() does an allocation under
 * i_mmap_mutex.
 */

It clarifies the exact conditions and makes easier to verify the
validity of the annotation.

> + */
> +struct lock_class_key hugetlbfs_i_mmap_mutex_key;
> +
>  static struct inode *hugetlbfs_get_inode(struct super_block *sb,
>  					struct inode *dir,
>  					umode_t mode, dev_t dev)
> @@ -474,6 +480,8 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb,
>  		struct hugetlbfs_inode_info *info;
>  		inode->i_ino = get_next_ino();
>  		inode_init_owner(inode, dir, mode);
> +		lockdep_set_class(&inode->i_mapping->i_mmap_mutex,
> +				&hugetlbfs_i_mmap_mutex_key);
>  		inode->i_mapping->a_ops = &hugetlbfs_aops;
>  		inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info;
>  		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
> -- 
> 1.8.3.2
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] hugetlb: fix lockdep splat caused by pmd sharing
@ 2013-07-30 14:58                         ` Peter Zijlstra
  0 siblings, 0 replies; 40+ messages in thread
From: Peter Zijlstra @ 2013-07-30 14:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Dave Jones, Aneesh Kumar K.V, Minchan Kim,
	Rik van Riel, KAMEZAWA Hiroyuki, Hillf Danton, linux-kernel,
	linux-mm

On Tue, Jul 30, 2013 at 04:46:00PM +0200, Michal Hocko wrote:
> which is a false positive caused by hugetlb pmd sharing code which
> allocates a new pmd from withing mappint->i_mmap_mutex. If this
> allocation causes reclaim then the lockdep detector complains that we
> might self-deadlock.
> 
> This is not correct though, because hugetlb pages are not reclaimable so
> their mapping will be never touched from the reclaim path.
> 
> The patch tells lockup detector that hugetlb i_mmap_mutex is special
> by assigning it a separate lockdep class so it won't report possible
> deadlocks on unrelated mappings.
> 
> Reported-by: Dave Jones <davej@redhat.com>
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> ---
>  fs/hugetlbfs/inode.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index a3f868a..230533d 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -463,6 +463,12 @@ static struct inode *hugetlbfs_get_root(struct super_block *sb,
>  	return inode;
>  }
>  
> +/*
> + * Now, reclaim path never holds hugetlbfs_inode->i_mmap_mutex while it could
> + * hold normal inode->i_mmap_mutex so this annotation avoids a lockdep splat.

How about something like:

/*
 * Hugetlbfs is not reclaimable; therefore its i_mmap_mutex will never
 * be taken from reclaim -- unlike regular filesystems. This needs an
 * annotation because huge_pmd_share() does an allocation under
 * i_mmap_mutex.
 */

It clarifies the exact conditions and makes easier to verify the
validity of the annotation.

> + */
> +struct lock_class_key hugetlbfs_i_mmap_mutex_key;
> +
>  static struct inode *hugetlbfs_get_inode(struct super_block *sb,
>  					struct inode *dir,
>  					umode_t mode, dev_t dev)
> @@ -474,6 +480,8 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb,
>  		struct hugetlbfs_inode_info *info;
>  		inode->i_ino = get_next_ino();
>  		inode_init_owner(inode, dir, mode);
> +		lockdep_set_class(&inode->i_mapping->i_mmap_mutex,
> +				&hugetlbfs_i_mmap_mutex_key);
>  		inode->i_mapping->a_ops = &hugetlbfs_aops;
>  		inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info;
>  		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
> -- 
> 1.8.3.2
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] hugetlb: fix lockdep splat caused by pmd sharing
  2013-07-30 14:58                         ` Peter Zijlstra
@ 2013-07-30 15:23                           ` Michal Hocko
  -1 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2013-07-30 15:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Dave Jones, Aneesh Kumar K.V, Minchan Kim,
	Rik van Riel, KAMEZAWA Hiroyuki, Hillf Danton, linux-kernel,
	linux-mm

On Tue 30-07-13 16:58:34, Peter Zijlstra wrote:
> On Tue, Jul 30, 2013 at 04:46:00PM +0200, Michal Hocko wrote:
[...]
> > +/*
> > + * Now, reclaim path never holds hugetlbfs_inode->i_mmap_mutex while it could
> > + * hold normal inode->i_mmap_mutex so this annotation avoids a lockdep splat.
> 
> How about something like:
> 
> /*
>  * Hugetlbfs is not reclaimable; therefore its i_mmap_mutex will never
>  * be taken from reclaim -- unlike regular filesystems. This needs an
>  * annotation because huge_pmd_share() does an allocation under
>  * i_mmap_mutex.
>  */
> 
> It clarifies the exact conditions and makes easier to verify the
> validity of the annotation.

Yes, looks much better. Thanks!
---
>From 673cbe2ca7df0decd7320987d97585660542e468 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Tue, 30 Jul 2013 17:22:14 +0200
Subject: [PATCH] hugetlb: fix lockdep splat caused by pmd sharing

Dave has reported the following lockdep splat:
[128095.470960] =================================
[128095.471315] [ INFO: inconsistent lock state ]
[128095.471660] 3.11.0-rc1+ #9 Not tainted
[128095.472156] ---------------------------------
[128095.472905] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
[128095.473650] kswapd0/49 [HC0[0]:SC0[0]:HE1:SE1] takes:
[128095.474373]  (&mapping->i_mmap_mutex){+.+.?.}, at: [<c114971b>] page_referenced+0x87/0x5e3
[128095.475128] {RECLAIM_FS-ON-W} state was registered at:
[128095.475866]   [<c10a6232>] mark_held_locks+0x81/0xe7
[128095.476597]   [<c10a8db3>] lockdep_trace_alloc+0x5e/0xbc
[128095.477322]   [<c112316b>] __alloc_pages_nodemask+0x8b/0x9b6
[128095.478049]   [<c1123ab6>] __get_free_pages+0x20/0x31
[128095.478769]   [<c1123ad9>] get_zeroed_page+0x12/0x14
[128095.479477]   [<c113fe1e>] __pmd_alloc+0x1c/0x6b
[128095.480138]   [<c1155ea7>] huge_pmd_share+0x265/0x283
[128095.480138]   [<c1155f22>] huge_pte_alloc+0x5d/0x71
[128095.480138]   [<c115612e>] hugetlb_fault+0x7c/0x64a
[128095.480138]   [<c114087c>] handle_mm_fault+0x255/0x299
[128095.480138]   [<c15bbab0>] __do_page_fault+0x142/0x55c
[128095.480138]   [<c15bbed7>] do_page_fault+0xd/0x16
[128095.480138]   [<c15b927c>] error_code+0x6c/0x74
[128095.480138] irq event stamp: 3136917
[128095.480138] hardirqs last  enabled at (3136917): [<c15b8139>] _raw_spin_unlock_irq+0x27/0x50
[128095.480138] hardirqs last disabled at (3136916): [<c15b7f4e>] _raw_spin_lock_irq+0x15/0x78
[128095.480138] softirqs last  enabled at (3136180): [<c1048e4a>] __do_softirq+0x137/0x30f
[128095.480138] softirqs last disabled at (3136175): [<c1049195>] irq_exit+0xa8/0xaa
[128095.480138]
other info that might help us debug this:
[128095.480138]  Possible unsafe locking scenario:
[128095.480138]        CPU0
[128095.480138]        ----
[128095.480138]   lock(&mapping->i_mmap_mutex);
[128095.480138]   <Interrupt>
[128095.480138]     lock(&mapping->i_mmap_mutex);
[128095.480138]
 *** DEADLOCK ***
[128095.480138] no locks held by kswapd0/49.
[128095.480138]
stack backtrace:
[128095.480138] CPU: 1 PID: 49 Comm: kswapd0 Not tainted 3.11.0-rc1+ #9
[128095.480138] Hardware name: Dell Inc.                 Precision WorkStation 490    /0DT031, BIOS A08 04/25/2008
[128095.480138]  c1d32630 00000000 ee39fb18 c15b001e ee395780 ee39fb54 c15acdcb c1751845
[128095.480138]  c1751bbf 00000031 00000000 00000000 00000000 00000000 00000001 00000001
[128095.480138]  c1751bbf 00000008 ee395c44 00000100 ee39fb88 c10a6130 00000008 0000d8fb
[128095.480138] Call Trace:
[128095.480138]  [<c15b001e>] dump_stack+0x4b/0x79
[128095.480138]  [<c15acdcb>] print_usage_bug+0x1d9/0x1e3
[128095.480138]  [<c10a6130>] mark_lock+0x1e0/0x261
[128095.480138]  [<c10a5878>] ? check_usage_backwards+0x109/0x109
[128095.480138]  [<c10a6cde>] __lock_acquire+0x623/0x17f2
[128095.480138]  [<c107aa43>] ? sched_clock_cpu+0xcd/0x130
[128095.480138]  [<c107a7e8>] ? sched_clock_local+0x42/0x12e
[128095.480138]  [<c10a84cf>] lock_acquire+0x7d/0x195
[128095.480138]  [<c114971b>] ? page_referenced+0x87/0x5e3
[128095.480138]  [<c15b3671>] mutex_lock_nested+0x6c/0x3a7
[128095.480138]  [<c114971b>] ? page_referenced+0x87/0x5e3
[128095.480138]  [<c114971b>] ? page_referenced+0x87/0x5e3
[128095.480138]  [<c11661d5>] ? mem_cgroup_charge_statistics.isra.24+0x61/0x9e
[128095.480138]  [<c114971b>] page_referenced+0x87/0x5e3
[128095.480138]  [<f8433030>] ? raid0_congested+0x26/0x8a [raid0]
[128095.480138]  [<c112b9c7>] shrink_page_list+0x3d9/0x947
[128095.480138]  [<c10a6457>] ? trace_hardirqs_on+0xb/0xd
[128095.480138]  [<c112c3cf>] shrink_inactive_list+0x155/0x4cb
[128095.480138]  [<c112cd07>] shrink_lruvec+0x300/0x5ce
[128095.480138]  [<c112d028>] shrink_zone+0x53/0x14e
[128095.480138]  [<c112e531>] kswapd+0x517/0xa75
[128095.480138]  [<c112e01a>] ? mem_cgroup_shrink_node_zone+0x280/0x280
[128095.480138]  [<c10661ff>] kthread+0xa8/0xaa
[128095.480138]  [<c10a6457>] ? trace_hardirqs_on+0xb/0xd
[128095.480138]  [<c15bf737>] ret_from_kernel_thread+0x1b/0x28
[128095.480138]  [<c1066157>] ? insert_kthread_work+0x63/0x63

which is a false positive caused by hugetlb pmd sharing code which
allocates a new pmd from withing mappint->i_mmap_mutex. If this
allocation causes reclaim then the lockdep detector complains that we
might self-deadlock.

This is not correct though, because hugetlb pages are not reclaimable so
their mapping will be never touched from the reclaim path.

The patch tells lockup detector that hugetlb i_mmap_mutex is special
by assigning it a separate lockdep class so it won't report possible
deadlocks on unrelated mappings.

[peterz@infradead.org: comment for annotation]
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 fs/hugetlbfs/inode.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index a3f868a..3442397 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -463,6 +463,14 @@ static struct inode *hugetlbfs_get_root(struct super_block *sb,
 	return inode;
 }
 
+/*
+ * Hugetlbfs is not reclaimable; therefore its i_mmap_mutex will never
+ * be taken from reclaim -- unlike regular filesystems. This needs an
+ * annotation because huge_pmd_share() does an allocation under
+ * i_mmap_mutex.
+ */
+struct lock_class_key hugetlbfs_i_mmap_mutex_key;
+
 static struct inode *hugetlbfs_get_inode(struct super_block *sb,
 					struct inode *dir,
 					umode_t mode, dev_t dev)
@@ -474,6 +482,8 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb,
 		struct hugetlbfs_inode_info *info;
 		inode->i_ino = get_next_ino();
 		inode_init_owner(inode, dir, mode);
+		lockdep_set_class(&inode->i_mapping->i_mmap_mutex,
+				&hugetlbfs_i_mmap_mutex_key);
 		inode->i_mapping->a_ops = &hugetlbfs_aops;
 		inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info;
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
-- 
1.8.3.2

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH] hugetlb: fix lockdep splat caused by pmd sharing
@ 2013-07-30 15:23                           ` Michal Hocko
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2013-07-30 15:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Dave Jones, Aneesh Kumar K.V, Minchan Kim,
	Rik van Riel, KAMEZAWA Hiroyuki, Hillf Danton, linux-kernel,
	linux-mm

On Tue 30-07-13 16:58:34, Peter Zijlstra wrote:
> On Tue, Jul 30, 2013 at 04:46:00PM +0200, Michal Hocko wrote:
[...]
> > +/*
> > + * Now, reclaim path never holds hugetlbfs_inode->i_mmap_mutex while it could
> > + * hold normal inode->i_mmap_mutex so this annotation avoids a lockdep splat.
> 
> How about something like:
> 
> /*
>  * Hugetlbfs is not reclaimable; therefore its i_mmap_mutex will never
>  * be taken from reclaim -- unlike regular filesystems. This needs an
>  * annotation because huge_pmd_share() does an allocation under
>  * i_mmap_mutex.
>  */
> 
> It clarifies the exact conditions and makes easier to verify the
> validity of the annotation.

Yes, looks much better. Thanks!
---

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] hugetlb: fix lockdep splat caused by pmd sharing
  2013-07-30 15:23                           ` Michal Hocko
@ 2013-07-30 23:35                             ` Minchan Kim
  -1 siblings, 0 replies; 40+ messages in thread
From: Minchan Kim @ 2013-07-30 23:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Peter Zijlstra, Andrew Morton, Dave Jones, Aneesh Kumar K.V,
	Rik van Riel, KAMEZAWA Hiroyuki, Hillf Danton, linux-kernel,
	linux-mm

On Tue, Jul 30, 2013 at 05:23:33PM +0200, Michal Hocko wrote:
> On Tue 30-07-13 16:58:34, Peter Zijlstra wrote:
> > On Tue, Jul 30, 2013 at 04:46:00PM +0200, Michal Hocko wrote:
> [...]
> > > +/*
> > > + * Now, reclaim path never holds hugetlbfs_inode->i_mmap_mutex while it could
> > > + * hold normal inode->i_mmap_mutex so this annotation avoids a lockdep splat.
> > 
> > How about something like:
> > 
> > /*
> >  * Hugetlbfs is not reclaimable; therefore its i_mmap_mutex will never
> >  * be taken from reclaim -- unlike regular filesystems. This needs an
> >  * annotation because huge_pmd_share() does an allocation under
> >  * i_mmap_mutex.
> >  */
> > 
> > It clarifies the exact conditions and makes easier to verify the
> > validity of the annotation.
> 
> Yes, looks much better. Thanks!
> ---
> >From 673cbe2ca7df0decd7320987d97585660542e468 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.cz>
> Date: Tue, 30 Jul 2013 17:22:14 +0200
> Subject: [PATCH] hugetlb: fix lockdep splat caused by pmd sharing
> 
> Dave has reported the following lockdep splat:
> [128095.470960] =================================
> [128095.471315] [ INFO: inconsistent lock state ]
> [128095.471660] 3.11.0-rc1+ #9 Not tainted
> [128095.472156] ---------------------------------
> [128095.472905] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
> [128095.473650] kswapd0/49 [HC0[0]:SC0[0]:HE1:SE1] takes:
> [128095.474373]  (&mapping->i_mmap_mutex){+.+.?.}, at: [<c114971b>] page_referenced+0x87/0x5e3
> [128095.475128] {RECLAIM_FS-ON-W} state was registered at:
> [128095.475866]   [<c10a6232>] mark_held_locks+0x81/0xe7
> [128095.476597]   [<c10a8db3>] lockdep_trace_alloc+0x5e/0xbc
> [128095.477322]   [<c112316b>] __alloc_pages_nodemask+0x8b/0x9b6
> [128095.478049]   [<c1123ab6>] __get_free_pages+0x20/0x31
> [128095.478769]   [<c1123ad9>] get_zeroed_page+0x12/0x14
> [128095.479477]   [<c113fe1e>] __pmd_alloc+0x1c/0x6b
> [128095.480138]   [<c1155ea7>] huge_pmd_share+0x265/0x283
> [128095.480138]   [<c1155f22>] huge_pte_alloc+0x5d/0x71
> [128095.480138]   [<c115612e>] hugetlb_fault+0x7c/0x64a
> [128095.480138]   [<c114087c>] handle_mm_fault+0x255/0x299
> [128095.480138]   [<c15bbab0>] __do_page_fault+0x142/0x55c
> [128095.480138]   [<c15bbed7>] do_page_fault+0xd/0x16
> [128095.480138]   [<c15b927c>] error_code+0x6c/0x74
> [128095.480138] irq event stamp: 3136917
> [128095.480138] hardirqs last  enabled at (3136917): [<c15b8139>] _raw_spin_unlock_irq+0x27/0x50
> [128095.480138] hardirqs last disabled at (3136916): [<c15b7f4e>] _raw_spin_lock_irq+0x15/0x78
> [128095.480138] softirqs last  enabled at (3136180): [<c1048e4a>] __do_softirq+0x137/0x30f
> [128095.480138] softirqs last disabled at (3136175): [<c1049195>] irq_exit+0xa8/0xaa
> [128095.480138]
> other info that might help us debug this:
> [128095.480138]  Possible unsafe locking scenario:
> [128095.480138]        CPU0
> [128095.480138]        ----
> [128095.480138]   lock(&mapping->i_mmap_mutex);
> [128095.480138]   <Interrupt>
> [128095.480138]     lock(&mapping->i_mmap_mutex);
> [128095.480138]
>  *** DEADLOCK ***
> [128095.480138] no locks held by kswapd0/49.
> [128095.480138]
> stack backtrace:
> [128095.480138] CPU: 1 PID: 49 Comm: kswapd0 Not tainted 3.11.0-rc1+ #9
> [128095.480138] Hardware name: Dell Inc.                 Precision WorkStation 490    /0DT031, BIOS A08 04/25/2008
> [128095.480138]  c1d32630 00000000 ee39fb18 c15b001e ee395780 ee39fb54 c15acdcb c1751845
> [128095.480138]  c1751bbf 00000031 00000000 00000000 00000000 00000000 00000001 00000001
> [128095.480138]  c1751bbf 00000008 ee395c44 00000100 ee39fb88 c10a6130 00000008 0000d8fb
> [128095.480138] Call Trace:
> [128095.480138]  [<c15b001e>] dump_stack+0x4b/0x79
> [128095.480138]  [<c15acdcb>] print_usage_bug+0x1d9/0x1e3
> [128095.480138]  [<c10a6130>] mark_lock+0x1e0/0x261
> [128095.480138]  [<c10a5878>] ? check_usage_backwards+0x109/0x109
> [128095.480138]  [<c10a6cde>] __lock_acquire+0x623/0x17f2
> [128095.480138]  [<c107aa43>] ? sched_clock_cpu+0xcd/0x130
> [128095.480138]  [<c107a7e8>] ? sched_clock_local+0x42/0x12e
> [128095.480138]  [<c10a84cf>] lock_acquire+0x7d/0x195
> [128095.480138]  [<c114971b>] ? page_referenced+0x87/0x5e3
> [128095.480138]  [<c15b3671>] mutex_lock_nested+0x6c/0x3a7
> [128095.480138]  [<c114971b>] ? page_referenced+0x87/0x5e3
> [128095.480138]  [<c114971b>] ? page_referenced+0x87/0x5e3
> [128095.480138]  [<c11661d5>] ? mem_cgroup_charge_statistics.isra.24+0x61/0x9e
> [128095.480138]  [<c114971b>] page_referenced+0x87/0x5e3
> [128095.480138]  [<f8433030>] ? raid0_congested+0x26/0x8a [raid0]
> [128095.480138]  [<c112b9c7>] shrink_page_list+0x3d9/0x947
> [128095.480138]  [<c10a6457>] ? trace_hardirqs_on+0xb/0xd
> [128095.480138]  [<c112c3cf>] shrink_inactive_list+0x155/0x4cb
> [128095.480138]  [<c112cd07>] shrink_lruvec+0x300/0x5ce
> [128095.480138]  [<c112d028>] shrink_zone+0x53/0x14e
> [128095.480138]  [<c112e531>] kswapd+0x517/0xa75
> [128095.480138]  [<c112e01a>] ? mem_cgroup_shrink_node_zone+0x280/0x280
> [128095.480138]  [<c10661ff>] kthread+0xa8/0xaa
> [128095.480138]  [<c10a6457>] ? trace_hardirqs_on+0xb/0xd
> [128095.480138]  [<c15bf737>] ret_from_kernel_thread+0x1b/0x28
> [128095.480138]  [<c1066157>] ? insert_kthread_work+0x63/0x63
> 
> which is a false positive caused by hugetlb pmd sharing code which
> allocates a new pmd from withing mappint->i_mmap_mutex. If this
> allocation causes reclaim then the lockdep detector complains that we
> might self-deadlock.
> 
> This is not correct though, because hugetlb pages are not reclaimable so
> their mapping will be never touched from the reclaim path.
> 
> The patch tells lockup detector that hugetlb i_mmap_mutex is special
> by assigning it a separate lockdep class so it won't report possible
> deadlocks on unrelated mappings.
> 
> [peterz@infradead.org: comment for annotation]
> Reported-by: Dave Jones <davej@redhat.com>
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Minchan Kim <minchan@kernel.org>

Thanks, Michal!
Only remained thing is Dave's testing.

- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] hugetlb: fix lockdep splat caused by pmd sharing
@ 2013-07-30 23:35                             ` Minchan Kim
  0 siblings, 0 replies; 40+ messages in thread
From: Minchan Kim @ 2013-07-30 23:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Peter Zijlstra, Andrew Morton, Dave Jones, Aneesh Kumar K.V,
	Rik van Riel, KAMEZAWA Hiroyuki, Hillf Danton, linux-kernel,
	linux-mm

On Tue, Jul 30, 2013 at 05:23:33PM +0200, Michal Hocko wrote:
> On Tue 30-07-13 16:58:34, Peter Zijlstra wrote:
> > On Tue, Jul 30, 2013 at 04:46:00PM +0200, Michal Hocko wrote:
> [...]
> > > +/*
> > > + * Now, reclaim path never holds hugetlbfs_inode->i_mmap_mutex while it could
> > > + * hold normal inode->i_mmap_mutex so this annotation avoids a lockdep splat.
> > 
> > How about something like:
> > 
> > /*
> >  * Hugetlbfs is not reclaimable; therefore its i_mmap_mutex will never
> >  * be taken from reclaim -- unlike regular filesystems. This needs an
> >  * annotation because huge_pmd_share() does an allocation under
> >  * i_mmap_mutex.
> >  */
> > 
> > It clarifies the exact conditions and makes easier to verify the
> > validity of the annotation.
> 
> Yes, looks much better. Thanks!
> ---
> >From 673cbe2ca7df0decd7320987d97585660542e468 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.cz>
> Date: Tue, 30 Jul 2013 17:22:14 +0200
> Subject: [PATCH] hugetlb: fix lockdep splat caused by pmd sharing
> 
> Dave has reported the following lockdep splat:
> [128095.470960] =================================
> [128095.471315] [ INFO: inconsistent lock state ]
> [128095.471660] 3.11.0-rc1+ #9 Not tainted
> [128095.472156] ---------------------------------
> [128095.472905] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
> [128095.473650] kswapd0/49 [HC0[0]:SC0[0]:HE1:SE1] takes:
> [128095.474373]  (&mapping->i_mmap_mutex){+.+.?.}, at: [<c114971b>] page_referenced+0x87/0x5e3
> [128095.475128] {RECLAIM_FS-ON-W} state was registered at:
> [128095.475866]   [<c10a6232>] mark_held_locks+0x81/0xe7
> [128095.476597]   [<c10a8db3>] lockdep_trace_alloc+0x5e/0xbc
> [128095.477322]   [<c112316b>] __alloc_pages_nodemask+0x8b/0x9b6
> [128095.478049]   [<c1123ab6>] __get_free_pages+0x20/0x31
> [128095.478769]   [<c1123ad9>] get_zeroed_page+0x12/0x14
> [128095.479477]   [<c113fe1e>] __pmd_alloc+0x1c/0x6b
> [128095.480138]   [<c1155ea7>] huge_pmd_share+0x265/0x283
> [128095.480138]   [<c1155f22>] huge_pte_alloc+0x5d/0x71
> [128095.480138]   [<c115612e>] hugetlb_fault+0x7c/0x64a
> [128095.480138]   [<c114087c>] handle_mm_fault+0x255/0x299
> [128095.480138]   [<c15bbab0>] __do_page_fault+0x142/0x55c
> [128095.480138]   [<c15bbed7>] do_page_fault+0xd/0x16
> [128095.480138]   [<c15b927c>] error_code+0x6c/0x74
> [128095.480138] irq event stamp: 3136917
> [128095.480138] hardirqs last  enabled at (3136917): [<c15b8139>] _raw_spin_unlock_irq+0x27/0x50
> [128095.480138] hardirqs last disabled at (3136916): [<c15b7f4e>] _raw_spin_lock_irq+0x15/0x78
> [128095.480138] softirqs last  enabled at (3136180): [<c1048e4a>] __do_softirq+0x137/0x30f
> [128095.480138] softirqs last disabled at (3136175): [<c1049195>] irq_exit+0xa8/0xaa
> [128095.480138]
> other info that might help us debug this:
> [128095.480138]  Possible unsafe locking scenario:
> [128095.480138]        CPU0
> [128095.480138]        ----
> [128095.480138]   lock(&mapping->i_mmap_mutex);
> [128095.480138]   <Interrupt>
> [128095.480138]     lock(&mapping->i_mmap_mutex);
> [128095.480138]
>  *** DEADLOCK ***
> [128095.480138] no locks held by kswapd0/49.
> [128095.480138]
> stack backtrace:
> [128095.480138] CPU: 1 PID: 49 Comm: kswapd0 Not tainted 3.11.0-rc1+ #9
> [128095.480138] Hardware name: Dell Inc.                 Precision WorkStation 490    /0DT031, BIOS A08 04/25/2008
> [128095.480138]  c1d32630 00000000 ee39fb18 c15b001e ee395780 ee39fb54 c15acdcb c1751845
> [128095.480138]  c1751bbf 00000031 00000000 00000000 00000000 00000000 00000001 00000001
> [128095.480138]  c1751bbf 00000008 ee395c44 00000100 ee39fb88 c10a6130 00000008 0000d8fb
> [128095.480138] Call Trace:
> [128095.480138]  [<c15b001e>] dump_stack+0x4b/0x79
> [128095.480138]  [<c15acdcb>] print_usage_bug+0x1d9/0x1e3
> [128095.480138]  [<c10a6130>] mark_lock+0x1e0/0x261
> [128095.480138]  [<c10a5878>] ? check_usage_backwards+0x109/0x109
> [128095.480138]  [<c10a6cde>] __lock_acquire+0x623/0x17f2
> [128095.480138]  [<c107aa43>] ? sched_clock_cpu+0xcd/0x130
> [128095.480138]  [<c107a7e8>] ? sched_clock_local+0x42/0x12e
> [128095.480138]  [<c10a84cf>] lock_acquire+0x7d/0x195
> [128095.480138]  [<c114971b>] ? page_referenced+0x87/0x5e3
> [128095.480138]  [<c15b3671>] mutex_lock_nested+0x6c/0x3a7
> [128095.480138]  [<c114971b>] ? page_referenced+0x87/0x5e3
> [128095.480138]  [<c114971b>] ? page_referenced+0x87/0x5e3
> [128095.480138]  [<c11661d5>] ? mem_cgroup_charge_statistics.isra.24+0x61/0x9e
> [128095.480138]  [<c114971b>] page_referenced+0x87/0x5e3
> [128095.480138]  [<f8433030>] ? raid0_congested+0x26/0x8a [raid0]
> [128095.480138]  [<c112b9c7>] shrink_page_list+0x3d9/0x947
> [128095.480138]  [<c10a6457>] ? trace_hardirqs_on+0xb/0xd
> [128095.480138]  [<c112c3cf>] shrink_inactive_list+0x155/0x4cb
> [128095.480138]  [<c112cd07>] shrink_lruvec+0x300/0x5ce
> [128095.480138]  [<c112d028>] shrink_zone+0x53/0x14e
> [128095.480138]  [<c112e531>] kswapd+0x517/0xa75
> [128095.480138]  [<c112e01a>] ? mem_cgroup_shrink_node_zone+0x280/0x280
> [128095.480138]  [<c10661ff>] kthread+0xa8/0xaa
> [128095.480138]  [<c10a6457>] ? trace_hardirqs_on+0xb/0xd
> [128095.480138]  [<c15bf737>] ret_from_kernel_thread+0x1b/0x28
> [128095.480138]  [<c1066157>] ? insert_kthread_work+0x63/0x63
> 
> which is a false positive caused by hugetlb pmd sharing code which
> allocates a new pmd from withing mappint->i_mmap_mutex. If this
> allocation causes reclaim then the lockdep detector complains that we
> might self-deadlock.
> 
> This is not correct though, because hugetlb pages are not reclaimable so
> their mapping will be never touched from the reclaim path.
> 
> The patch tells lockup detector that hugetlb i_mmap_mutex is special
> by assigning it a separate lockdep class so it won't report possible
> deadlocks on unrelated mappings.
> 
> [peterz@infradead.org: comment for annotation]
> Reported-by: Dave Jones <davej@redhat.com>
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Minchan Kim <minchan@kernel.org>

Thanks, Michal!
Only remained thing is Dave's testing.

- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] hugetlb: fix lockdep splat caused by pmd sharing
  2013-07-30 23:35                             ` Minchan Kim
@ 2013-07-30 23:37                               ` Dave Jones
  -1 siblings, 0 replies; 40+ messages in thread
From: Dave Jones @ 2013-07-30 23:37 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Michal Hocko, Peter Zijlstra, Andrew Morton, Aneesh Kumar K.V,
	Rik van Riel, KAMEZAWA Hiroyuki, Hillf Danton, linux-kernel,
	linux-mm

On Wed, Jul 31, 2013 at 08:35:30AM +0900, Minchan Kim wrote:
 > > which is a false positive caused by hugetlb pmd sharing code which
 > > allocates a new pmd from withing mappint->i_mmap_mutex. If this
 > > allocation causes reclaim then the lockdep detector complains that we
 > > might self-deadlock.
 > > 
 > > This is not correct though, because hugetlb pages are not reclaimable so
 > > their mapping will be never touched from the reclaim path.
 > > 
 > > The patch tells lockup detector that hugetlb i_mmap_mutex is special
 > > by assigning it a separate lockdep class so it won't report possible
 > > deadlocks on unrelated mappings.
 > > 
 > > [peterz@infradead.org: comment for annotation]
 > > Reported-by: Dave Jones <davej@redhat.com>
 > > Signed-off-by: Michal Hocko <mhocko@suse.cz>
 > Reviewed-by: Minchan Kim <minchan@kernel.org>
 > 
 > Thanks, Michal!
 > Only remained thing is Dave's testing.

I've added it to my builds, all is quiet so far..

	Dave






^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] hugetlb: fix lockdep splat caused by pmd sharing
@ 2013-07-30 23:37                               ` Dave Jones
  0 siblings, 0 replies; 40+ messages in thread
From: Dave Jones @ 2013-07-30 23:37 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Michal Hocko, Peter Zijlstra, Andrew Morton, Aneesh Kumar K.V,
	Rik van Riel, KAMEZAWA Hiroyuki, Hillf Danton, linux-kernel,
	linux-mm

On Wed, Jul 31, 2013 at 08:35:30AM +0900, Minchan Kim wrote:
 > > which is a false positive caused by hugetlb pmd sharing code which
 > > allocates a new pmd from withing mappint->i_mmap_mutex. If this
 > > allocation causes reclaim then the lockdep detector complains that we
 > > might self-deadlock.
 > > 
 > > This is not correct though, because hugetlb pages are not reclaimable so
 > > their mapping will be never touched from the reclaim path.
 > > 
 > > The patch tells lockup detector that hugetlb i_mmap_mutex is special
 > > by assigning it a separate lockdep class so it won't report possible
 > > deadlocks on unrelated mappings.
 > > 
 > > [peterz@infradead.org: comment for annotation]
 > > Reported-by: Dave Jones <davej@redhat.com>
 > > Signed-off-by: Michal Hocko <mhocko@suse.cz>
 > Reviewed-by: Minchan Kim <minchan@kernel.org>
 > 
 > Thanks, Michal!
 > Only remained thing is Dave's testing.

I've added it to my builds, all is quiet so far..

	Dave





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2013-07-30 23:50 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-07-17 15:32 hugepage related lockdep trace Dave Jones
2013-07-17 15:32 ` Dave Jones
2013-07-18  0:09 ` Minchan Kim
2013-07-18  0:09   ` Minchan Kim
2013-07-18 17:42   ` Aneesh Kumar K.V
2013-07-18 17:42     ` Aneesh Kumar K.V
2013-07-19  0:13     ` Minchan Kim
2013-07-19  0:13       ` Minchan Kim
2013-07-23  7:24       ` Hush Bensen
2013-07-23  7:24         ` Hush Bensen
2013-07-24  2:57         ` Minchan Kim
2013-07-24  2:57           ` Minchan Kim
2013-07-23 14:01       ` Michal Hocko
2013-07-23 14:01         ` Michal Hocko
2013-07-24  2:44         ` Minchan Kim
2013-07-24  2:44           ` Minchan Kim
2013-07-25 13:30           ` Michal Hocko
2013-07-25 13:30             ` Michal Hocko
2013-07-29  8:24             ` Minchan Kim
2013-07-29  8:24               ` Minchan Kim
2013-07-29 14:53               ` Michal Hocko
2013-07-29 14:53                 ` Michal Hocko
2013-07-29 15:20                 ` Peter Zijlstra
2013-07-29 15:20                   ` Peter Zijlstra
2013-07-30 14:29                   ` Michal Hocko
2013-07-30 14:29                     ` Michal Hocko
2013-07-30 14:46                     ` [PATCH] hugetlb: fix lockdep splat caused by pmd sharing Michal Hocko
2013-07-30 14:46                       ` Michal Hocko
2013-07-30 14:58                       ` Peter Zijlstra
2013-07-30 14:58                         ` Peter Zijlstra
2013-07-30 15:23                         ` Michal Hocko
2013-07-30 15:23                           ` Michal Hocko
2013-07-30 23:35                           ` Minchan Kim
2013-07-30 23:35                             ` Minchan Kim
2013-07-30 23:37                             ` Dave Jones
2013-07-30 23:37                               ` Dave Jones
2013-07-19  2:08     ` hugepage related lockdep trace Hillf Danton
2013-07-19  2:08       ` Hillf Danton
2013-07-19  3:20       ` Aneesh Kumar K.V
2013-07-19  3:20         ` Aneesh Kumar K.V

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.