All of lore.kernel.org
 help / color / mirror / Atom feed
* [4.15-rc9] fs_reclaim lockdep trace
@ 2018-01-24  1:36 ` Dave Jones
  0 siblings, 0 replies; 35+ messages in thread
From: Dave Jones @ 2018-01-24  1:36 UTC (permalink / raw)
  To: Linux Kernel; +Cc: linux-mm

Just triggered this on a server I was rsync'ing to.


============================================
WARNING: possible recursive locking detected
4.15.0-rc9-backup-debug+ #1 Not tainted
--------------------------------------------
sshd/24800 is trying to acquire lock:
 (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30

but task is already holding lock:
 (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(fs_reclaim);
  lock(fs_reclaim);

 *** DEADLOCK ***

 May be due to missing lock nesting notation

2 locks held by sshd/24800:
 #0:  (sk_lock-AF_INET6){+.+.}, at: [<000000001a069652>] tcp_sendmsg+0x19/0x40
 #1:  (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30

stack backtrace:
CPU: 3 PID: 24800 Comm: sshd Not tainted 4.15.0-rc9-backup-debug+ #1
Call Trace:
 dump_stack+0xbc/0x13f
 ? _atomic_dec_and_lock+0x101/0x101
 ? fs_reclaim_acquire.part.102+0x5/0x30
 ? print_lock+0x54/0x68
 __lock_acquire+0xa09/0x2040
 ? debug_show_all_locks+0x2f0/0x2f0
 ? mutex_destroy+0x120/0x120
 ? hlock_class+0xa0/0xa0
 ? kernel_text_address+0x5c/0x90
 ? __kernel_text_address+0xe/0x30
 ? unwind_get_return_address+0x2f/0x50
 ? __save_stack_trace+0x92/0x100
 ? graph_lock+0x8d/0x100
 ? check_noncircular+0x20/0x20
 ? __lock_acquire+0x616/0x2040
 ? debug_show_all_locks+0x2f0/0x2f0
 ? __lock_acquire+0x616/0x2040
 ? debug_show_all_locks+0x2f0/0x2f0
 ? print_irqtrace_events+0x110/0x110
 ? active_load_balance_cpu_stop+0x7b0/0x7b0
 ? debug_show_all_locks+0x2f0/0x2f0
 ? mark_lock+0x1b1/0xa00
 ? lock_acquire+0x12e/0x350
 lock_acquire+0x12e/0x350
 ? fs_reclaim_acquire.part.102+0x5/0x30
 ? lockdep_rcu_suspicious+0x100/0x100
 ? set_next_entity+0x20e/0x10d0
 ? mark_lock+0x1b1/0xa00
 ? match_held_lock+0x8d/0x440
 ? mark_lock+0x1b1/0xa00
 ? save_trace+0x1e0/0x1e0
 ? print_irqtrace_events+0x110/0x110
 ? alloc_extent_state+0xa7/0x410
 fs_reclaim_acquire.part.102+0x29/0x30
 ? fs_reclaim_acquire.part.102+0x5/0x30
 kmem_cache_alloc+0x3d/0x2c0
 ? rb_erase+0xe63/0x1240
 alloc_extent_state+0xa7/0x410
 ? lock_extent_buffer_for_io+0x3f0/0x3f0
 ? find_held_lock+0x6d/0xd0
 ? test_range_bit+0x197/0x210
 ? lock_acquire+0x350/0x350
 ? do_raw_spin_unlock+0x147/0x220
 ? do_raw_spin_trylock+0x100/0x100
 ? iotree_fs_info+0x30/0x30
 __clear_extent_bit+0x3ea/0x570
 ? clear_state_bit+0x270/0x270
 ? count_range_bits+0x2f0/0x2f0
 ? lock_acquire+0x350/0x350
 ? rb_prev+0x21/0x90
 try_release_extent_mapping+0x21a/0x260
 __btrfs_releasepage+0xb0/0x1c0
 ? btrfs_submit_direct+0xca0/0xca0
 ? check_new_page_bad+0x1f0/0x1f0
 ? match_held_lock+0xa5/0x440
 ? debug_show_all_locks+0x2f0/0x2f0
 btrfs_releasepage+0x161/0x170
 ? __btrfs_releasepage+0x1c0/0x1c0
 ? page_rmapping+0xd0/0xd0
 ? rmap_walk+0x100/0x100
 try_to_release_page+0x162/0x1c0
 ? generic_file_write_iter+0x3c0/0x3c0
 ? page_evictable+0xcc/0x110
 ? lookup_address_in_pgd+0x107/0x190
 shrink_page_list+0x1d5a/0x2fb0
 ? putback_lru_page+0x3f0/0x3f0
 ? save_trace+0x1e0/0x1e0
 ? _lookup_address_cpa.isra.13+0x40/0x60
 ? debug_show_all_locks+0x2f0/0x2f0
 ? kmem_cache_free+0x8c/0x280
 ? free_extent_state+0x1c8/0x3b0
 ? mark_lock+0x1b1/0xa00
 ? page_rmapping+0xd0/0xd0
 ? print_irqtrace_events+0x110/0x110
 ? shrink_node_memcg.constprop.88+0x4c9/0x5e0
 ? shrink_node+0x12d/0x260
 ? try_to_free_pages+0x418/0xaf0
 ? __alloc_pages_slowpath+0x976/0x1790
 ? __alloc_pages_nodemask+0x52c/0x5c0
 ? delete_node+0x28d/0x5c0
 ? find_held_lock+0x6d/0xd0
 ? free_pcppages_bulk+0x381/0x570
 ? lock_acquire+0x350/0x350
 ? do_raw_spin_unlock+0x147/0x220
 ? do_raw_spin_trylock+0x100/0x100
 ? __lock_is_held+0x51/0xc0
 ? _raw_spin_unlock+0x24/0x30
 ? free_pcppages_bulk+0x381/0x570
 ? mark_lock+0x1b1/0xa00
 ? free_compound_page+0x30/0x30
 ? print_irqtrace_events+0x110/0x110
 ? __kernel_map_pages+0x2c9/0x310
 ? mark_lock+0x1b1/0xa00
 ? print_irqtrace_events+0x110/0x110
 ? __delete_from_page_cache+0x2e7/0x4e0
 ? save_trace+0x1e0/0x1e0
 ? __add_to_page_cache_locked+0x680/0x680
 ? find_held_lock+0x6d/0xd0
 ? __list_add_valid+0x29/0xa0
 ? free_unref_page_commit+0x198/0x270
 ? drain_local_pages_wq+0x20/0x20
 ? stop_critical_timings+0x210/0x210
 ? mark_lock+0x1b1/0xa00
 ? mark_lock+0x1b1/0xa00
 ? print_irqtrace_events+0x110/0x110
 ? __lock_acquire+0x616/0x2040
 ? mark_lock+0x1b1/0xa00
 ? mark_lock+0x1b1/0xa00
 ? print_irqtrace_events+0x110/0x110
 ? __phys_addr_symbol+0x23/0x40
 ? __change_page_attr_set_clr+0xe86/0x1640
 ? __btrfs_releasepage+0x1c0/0x1c0
 ? mark_lock+0x1b1/0xa00
 ? mark_lock+0x1b1/0xa00
 ? print_irqtrace_events+0x110/0x110
 ? mark_lock+0x1b1/0xa00
 ? __lock_acquire+0x616/0x2040
 ? __lock_acquire+0x616/0x2040
 ? debug_show_all_locks+0x2f0/0x2f0
 ? swiotlb_free_coherent+0x60/0x60
 ? __phys_addr+0x32/0x80
 ? igb_xmit_frame_ring+0xad7/0x1890
 ? stack_access_ok+0x35/0x80
 ? deref_stack_reg+0xa1/0xe0
 ? __read_once_size_nocheck.constprop.6+0x10/0x10
 ? __orc_find+0x6b/0xc0
 ? unwind_next_frame+0x407/0xa20
 ? __save_stack_trace+0x5e/0x100
 ? stack_access_ok+0x35/0x80
 ? deref_stack_reg+0xa1/0xe0
 ? __read_once_size_nocheck.constprop.6+0x10/0x10
 ? __lock_acquire+0x616/0x2040
 ? debug_lockdep_rcu_enabled.part.37+0x16/0x30
 ? is_ftrace_trampoline+0x112/0x190
 ? ftrace_profile_pages_init+0x130/0x130
 ? unwind_next_frame+0x407/0xa20
 ? rcu_is_watching+0x88/0xd0
 ? unwind_get_return_address_ptr+0x50/0x50
 ? kernel_text_address+0x5c/0x90
 ? __kernel_text_address+0xe/0x30
 ? unwind_get_return_address+0x2f/0x50
 ? __save_stack_trace+0x92/0x100
 ? __list_add_valid+0x29/0xa0
 ? add_lock_to_list.isra.26+0x1d0/0x21f
 ? print_lockdep_cache.isra.29+0xd8/0xd8
 ? save_trace+0x106/0x1e0
 ? __isolate_lru_page+0x2dc/0x3c0
 ? remove_mapping+0x1b0/0x1b0
 ? match_held_lock+0xa5/0x440
 ? __lock_acquire+0x616/0x2040
 ? __mod_zone_page_state+0x1a/0x70
 ? isolate_lru_pages.isra.83+0x888/0xae0
 ? __isolate_lru_page+0x3c0/0x3c0
 ? check_usage+0x174/0x790
 ? mark_lock+0x1b1/0xa00
 ? print_irqtrace_events+0x110/0x110
 ? check_usage_forwards+0x2b0/0x2b0
 ? class_equal+0x11/0x20
 ? __bfs+0xed/0x430
 ? __phys_addr_symbol+0x23/0x40
 ? mutex_destroy+0x120/0x120
 ? match_held_lock+0x8d/0x440
 ? hlock_class+0xa0/0xa0
 ? mark_lock+0x1b1/0xa00
 ? save_trace+0x1e0/0x1e0
 ? print_irqtrace_events+0x110/0x110
 ? lock_acquire+0x350/0x350
 ? __zone_watermark_ok+0xd8/0x280
 ? graph_lock+0x8d/0x100
 ? check_noncircular+0x20/0x20
 ? find_held_lock+0x6d/0xd0
 ? shrink_inactive_list+0x3b4/0x940
 ? lock_acquire+0x350/0x350
 ? do_raw_spin_unlock+0x147/0x220
 ? do_raw_spin_trylock+0x100/0x100
 ? stop_critical_timings+0x210/0x210
 ? mark_held_locks+0x6e/0x90
 ? _raw_spin_unlock_irq+0x29/0x40
 shrink_inactive_list+0x451/0x940
 ? save_trace+0x180/0x1e0
 ? putback_inactive_pages+0x9f0/0x9f0
 ? dev_queue_xmit_nit+0x548/0x660
 ? __kernel_map_pages+0x2c9/0x310
 ? set_pages_rw+0xe0/0xe0
 ? get_page_from_freelist+0x1ea5/0x2ca0
 ? match_held_lock+0x8d/0x440
 ? blk_start_plug+0x17d/0x1e0
 ? kblockd_schedule_delayed_work_on+0x20/0x20
 ? print_irqtrace_events+0x110/0x110
 ? cpumask_next+0x1d/0x20
 ? zone_reclaimable_pages+0x25b/0x470
 ? mark_held_locks+0x6e/0x90
 ? __remove_mapping+0x4e0/0x4e0
 shrink_node_memcg.constprop.88+0x4c9/0x5e0
 ? __delayacct_freepages_start+0x28/0x40
 ? lock_acquire+0x311/0x350
 ? shrink_active_list+0x9c0/0x9c0
 ? stop_critical_timings+0x210/0x210
 ? allow_direct_reclaim.part.82+0xea/0x220
 ? mark_held_locks+0x6e/0x90
 ? ktime_get+0x1f0/0x3e0
 ? shrink_node+0x12d/0x260
 shrink_node+0x12d/0x260
 ? shrink_node_memcg.constprop.88+0x5e0/0x5e0
 ? __lock_is_held+0x51/0xc0
 try_to_free_pages+0x418/0xaf0
 ? shrink_node+0x260/0x260
 ? lock_acquire+0x12e/0x350
 ? lock_acquire+0x12e/0x350
 ? fs_reclaim_acquire.part.102+0x5/0x30
 ? lockdep_rcu_suspicious+0x100/0x100
 ? rcu_note_context_switch+0x520/0x520
 ? wake_all_kswapds+0x10a/0x150
 __alloc_pages_slowpath+0x976/0x1790
 ? __zone_watermark_ok+0x280/0x280
 ? warn_alloc+0x250/0x250
 ? __lock_acquire+0x616/0x2040
 ? match_held_lock+0x8d/0x440
 ? save_trace+0x1e0/0x1e0
 ? debug_show_all_locks+0x2f0/0x2f0
 ? match_held_lock+0xa5/0x440
 ? stack_access_ok+0x35/0x80
 ? save_trace+0x1e0/0x1e0
 ? __read_once_size_nocheck.constprop.6+0x10/0x10
 ? __lock_acquire+0x616/0x2040
 ? match_held_lock+0xa5/0x440
 ? find_held_lock+0x6d/0xd0
 ? __lock_is_held+0x51/0xc0
 ? rcu_note_context_switch+0x520/0x520
 ? perf_trace_sched_switch+0x560/0x560
 ? __might_sleep+0x58/0xe0
 __alloc_pages_nodemask+0x52c/0x5c0
 ? gfp_pfmemalloc_allowed+0xc0/0xc0
 ? kernel_text_address+0x5c/0x90
 ? __kernel_text_address+0xe/0x30
 ? unwind_get_return_address+0x2f/0x50
 ? memcmp+0x45/0x70
 ? match_held_lock+0x8d/0x440
 ? depot_save_stack+0x12e/0x480
 ? match_held_lock+0xa5/0x440
 ? stop_critical_timings+0x210/0x210
 ? sk_stream_alloc_skb+0xb8/0x340
 ? mark_held_locks+0x6e/0x90
 ? new_slab+0x2f3/0x3f0
 new_slab+0x374/0x3f0
 ___slab_alloc.constprop.81+0x47e/0x5a0
 ? __alloc_skb+0xee/0x390
 ? __alloc_skb+0xee/0x390
 ? __alloc_skb+0xee/0x390
 ? __slab_alloc.constprop.80+0x32/0x60
 __slab_alloc.constprop.80+0x32/0x60
 ? __alloc_skb+0xee/0x390
 __kmalloc_track_caller+0x267/0x310
 __kmalloc_reserve.isra.40+0x29/0x80
 __alloc_skb+0xee/0x390
 ? __skb_splice_bits+0x3e0/0x3e0
 ? ip6_mtu+0x1d9/0x290
 ? ip6_link_failure+0x3c0/0x3c0
 ? tcp_current_mss+0x1d8/0x2f0
 ? tcp_sync_mss+0x520/0x520
 sk_stream_alloc_skb+0xb8/0x340
 ? tcp_ioctl+0x280/0x280
 tcp_sendmsg_locked+0x8e6/0x1d30
 ? match_held_lock+0x8d/0x440
 ? mark_lock+0x1b1/0xa00
 ? tcp_set_state+0x450/0x450
 ? debug_show_all_locks+0x2f0/0x2f0
 ? match_held_lock+0x8d/0x440
 ? save_trace+0x1e0/0x1e0
 ? find_held_lock+0x6d/0xd0
 ? lock_acquire+0x12e/0x350
 ? lock_acquire+0x12e/0x350
 ? tcp_sendmsg+0x19/0x40
 ? lockdep_rcu_suspicious+0x100/0x100
 ? do_raw_spin_trylock+0x100/0x100
 ? stop_critical_timings+0x210/0x210
 ? mark_held_locks+0x6e/0x90
 ? __local_bh_enable_ip+0x94/0x100
 ? lock_sock_nested+0x51/0xb0
 tcp_sendmsg+0x27/0x40
 inet_sendmsg+0xd0/0x310
 ? inet_recvmsg+0x360/0x360
 ? match_held_lock+0x8d/0x440
 ? inet_recvmsg+0x360/0x360
 sock_write_iter+0x17a/0x240
 ? sock_ioctl+0x290/0x290
 ? find_held_lock+0x6d/0xd0
 __vfs_write+0x2ab/0x380
 ? kernel_read+0xa0/0xa0
 ? __context_tracking_exit.part.4+0xe7/0x290
 ? lock_acquire+0x350/0x350
 ? __fdget_pos+0x7f/0x110
 ? __fdget_raw+0x10/0x10
 vfs_write+0xfb/0x260
 SyS_write+0xb6/0x140
 ? SyS_read+0x140/0x140
 ? SyS_clock_settime+0x120/0x120
 ? mark_held_locks+0x1c/0x90
 ? do_syscall_64+0x110/0xc05
 ? SyS_read+0x140/0x140
 do_syscall_64+0x1e5/0xc05
 ? syscall_return_slowpath+0x5b0/0x5b0
 ? lock_acquire+0x350/0x350
 ? lockdep_rcu_suspicious+0x100/0x100
 ? get_vtime_delta+0x15/0xf0
 ? get_vtime_delta+0x8b/0xf0
 ? vtime_user_enter+0x7f/0x90
 ? __context_tracking_enter+0x13c/0x2b0
 ? __context_tracking_enter+0x13c/0x2b0
 ? context_tracking_exit.part.5+0x40/0x40
 ? rcu_is_watching+0x88/0xd0
 ? time_hardirqs_on+0x220/0x220
 ? prepare_exit_to_usermode+0x1d0/0x2a0
 ? enter_from_user_mode+0x30/0x30
 ? entry_SYSCALL_64_after_hwframe+0x18/0x2e
 ? trace_hardirqs_off_caller+0xc2/0x110
 ? trace_hardirqs_off_thunk+0x1a/0x1c
 entry_SYSCALL64_slow_path+0x25/0x25
RIP: 0033:0x7f26d47d1974
RSP: 002b:00007ffd62e2f548 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000024 RCX: 00007f26d47d1974
RDX: 0000000000000024 RSI: 000055a0bc9a6220 RDI: 0000000000000003
RBP: 000055a0bc984370 R08: 0000000000000000 R09: 00007ffd62fb9080
R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000000
R13: 000055a0bc311ab0 R14: 0000000000000003 R15: 00007ffd62e2f5cf

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [4.15-rc9] fs_reclaim lockdep trace
@ 2018-01-24  1:36 ` Dave Jones
  0 siblings, 0 replies; 35+ messages in thread
From: Dave Jones @ 2018-01-24  1:36 UTC (permalink / raw)
  To: Linux Kernel; +Cc: linux-mm

Just triggered this on a server I was rsync'ing to.


============================================
WARNING: possible recursive locking detected
4.15.0-rc9-backup-debug+ #1 Not tainted
--------------------------------------------
sshd/24800 is trying to acquire lock:
 (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30

but task is already holding lock:
 (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(fs_reclaim);
  lock(fs_reclaim);

 *** DEADLOCK ***

 May be due to missing lock nesting notation

2 locks held by sshd/24800:
 #0:  (sk_lock-AF_INET6){+.+.}, at: [<000000001a069652>] tcp_sendmsg+0x19/0x40
 #1:  (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30

stack backtrace:
CPU: 3 PID: 24800 Comm: sshd Not tainted 4.15.0-rc9-backup-debug+ #1
Call Trace:
 dump_stack+0xbc/0x13f
 ? _atomic_dec_and_lock+0x101/0x101
 ? fs_reclaim_acquire.part.102+0x5/0x30
 ? print_lock+0x54/0x68
 __lock_acquire+0xa09/0x2040
 ? debug_show_all_locks+0x2f0/0x2f0
 ? mutex_destroy+0x120/0x120
 ? hlock_class+0xa0/0xa0
 ? kernel_text_address+0x5c/0x90
 ? __kernel_text_address+0xe/0x30
 ? unwind_get_return_address+0x2f/0x50
 ? __save_stack_trace+0x92/0x100
 ? graph_lock+0x8d/0x100
 ? check_noncircular+0x20/0x20
 ? __lock_acquire+0x616/0x2040
 ? debug_show_all_locks+0x2f0/0x2f0
 ? __lock_acquire+0x616/0x2040
 ? debug_show_all_locks+0x2f0/0x2f0
 ? print_irqtrace_events+0x110/0x110
 ? active_load_balance_cpu_stop+0x7b0/0x7b0
 ? debug_show_all_locks+0x2f0/0x2f0
 ? mark_lock+0x1b1/0xa00
 ? lock_acquire+0x12e/0x350
 lock_acquire+0x12e/0x350
 ? fs_reclaim_acquire.part.102+0x5/0x30
 ? lockdep_rcu_suspicious+0x100/0x100
 ? set_next_entity+0x20e/0x10d0
 ? mark_lock+0x1b1/0xa00
 ? match_held_lock+0x8d/0x440
 ? mark_lock+0x1b1/0xa00
 ? save_trace+0x1e0/0x1e0
 ? print_irqtrace_events+0x110/0x110
 ? alloc_extent_state+0xa7/0x410
 fs_reclaim_acquire.part.102+0x29/0x30
 ? fs_reclaim_acquire.part.102+0x5/0x30
 kmem_cache_alloc+0x3d/0x2c0
 ? rb_erase+0xe63/0x1240
 alloc_extent_state+0xa7/0x410
 ? lock_extent_buffer_for_io+0x3f0/0x3f0
 ? find_held_lock+0x6d/0xd0
 ? test_range_bit+0x197/0x210
 ? lock_acquire+0x350/0x350
 ? do_raw_spin_unlock+0x147/0x220
 ? do_raw_spin_trylock+0x100/0x100
 ? iotree_fs_info+0x30/0x30
 __clear_extent_bit+0x3ea/0x570
 ? clear_state_bit+0x270/0x270
 ? count_range_bits+0x2f0/0x2f0
 ? lock_acquire+0x350/0x350
 ? rb_prev+0x21/0x90
 try_release_extent_mapping+0x21a/0x260
 __btrfs_releasepage+0xb0/0x1c0
 ? btrfs_submit_direct+0xca0/0xca0
 ? check_new_page_bad+0x1f0/0x1f0
 ? match_held_lock+0xa5/0x440
 ? debug_show_all_locks+0x2f0/0x2f0
 btrfs_releasepage+0x161/0x170
 ? __btrfs_releasepage+0x1c0/0x1c0
 ? page_rmapping+0xd0/0xd0
 ? rmap_walk+0x100/0x100
 try_to_release_page+0x162/0x1c0
 ? generic_file_write_iter+0x3c0/0x3c0
 ? page_evictable+0xcc/0x110
 ? lookup_address_in_pgd+0x107/0x190
 shrink_page_list+0x1d5a/0x2fb0
 ? putback_lru_page+0x3f0/0x3f0
 ? save_trace+0x1e0/0x1e0
 ? _lookup_address_cpa.isra.13+0x40/0x60
 ? debug_show_all_locks+0x2f0/0x2f0
 ? kmem_cache_free+0x8c/0x280
 ? free_extent_state+0x1c8/0x3b0
 ? mark_lock+0x1b1/0xa00
 ? page_rmapping+0xd0/0xd0
 ? print_irqtrace_events+0x110/0x110
 ? shrink_node_memcg.constprop.88+0x4c9/0x5e0
 ? shrink_node+0x12d/0x260
 ? try_to_free_pages+0x418/0xaf0
 ? __alloc_pages_slowpath+0x976/0x1790
 ? __alloc_pages_nodemask+0x52c/0x5c0
 ? delete_node+0x28d/0x5c0
 ? find_held_lock+0x6d/0xd0
 ? free_pcppages_bulk+0x381/0x570
 ? lock_acquire+0x350/0x350
 ? do_raw_spin_unlock+0x147/0x220
 ? do_raw_spin_trylock+0x100/0x100
 ? __lock_is_held+0x51/0xc0
 ? _raw_spin_unlock+0x24/0x30
 ? free_pcppages_bulk+0x381/0x570
 ? mark_lock+0x1b1/0xa00
 ? free_compound_page+0x30/0x30
 ? print_irqtrace_events+0x110/0x110
 ? __kernel_map_pages+0x2c9/0x310
 ? mark_lock+0x1b1/0xa00
 ? print_irqtrace_events+0x110/0x110
 ? __delete_from_page_cache+0x2e7/0x4e0
 ? save_trace+0x1e0/0x1e0
 ? __add_to_page_cache_locked+0x680/0x680
 ? find_held_lock+0x6d/0xd0
 ? __list_add_valid+0x29/0xa0
 ? free_unref_page_commit+0x198/0x270
 ? drain_local_pages_wq+0x20/0x20
 ? stop_critical_timings+0x210/0x210
 ? mark_lock+0x1b1/0xa00
 ? mark_lock+0x1b1/0xa00
 ? print_irqtrace_events+0x110/0x110
 ? __lock_acquire+0x616/0x2040
 ? mark_lock+0x1b1/0xa00
 ? mark_lock+0x1b1/0xa00
 ? print_irqtrace_events+0x110/0x110
 ? __phys_addr_symbol+0x23/0x40
 ? __change_page_attr_set_clr+0xe86/0x1640
 ? __btrfs_releasepage+0x1c0/0x1c0
 ? mark_lock+0x1b1/0xa00
 ? mark_lock+0x1b1/0xa00
 ? print_irqtrace_events+0x110/0x110
 ? mark_lock+0x1b1/0xa00
 ? __lock_acquire+0x616/0x2040
 ? __lock_acquire+0x616/0x2040
 ? debug_show_all_locks+0x2f0/0x2f0
 ? swiotlb_free_coherent+0x60/0x60
 ? __phys_addr+0x32/0x80
 ? igb_xmit_frame_ring+0xad7/0x1890
 ? stack_access_ok+0x35/0x80
 ? deref_stack_reg+0xa1/0xe0
 ? __read_once_size_nocheck.constprop.6+0x10/0x10
 ? __orc_find+0x6b/0xc0
 ? unwind_next_frame+0x407/0xa20
 ? __save_stack_trace+0x5e/0x100
 ? stack_access_ok+0x35/0x80
 ? deref_stack_reg+0xa1/0xe0
 ? __read_once_size_nocheck.constprop.6+0x10/0x10
 ? __lock_acquire+0x616/0x2040
 ? debug_lockdep_rcu_enabled.part.37+0x16/0x30
 ? is_ftrace_trampoline+0x112/0x190
 ? ftrace_profile_pages_init+0x130/0x130
 ? unwind_next_frame+0x407/0xa20
 ? rcu_is_watching+0x88/0xd0
 ? unwind_get_return_address_ptr+0x50/0x50
 ? kernel_text_address+0x5c/0x90
 ? __kernel_text_address+0xe/0x30
 ? unwind_get_return_address+0x2f/0x50
 ? __save_stack_trace+0x92/0x100
 ? __list_add_valid+0x29/0xa0
 ? add_lock_to_list.isra.26+0x1d0/0x21f
 ? print_lockdep_cache.isra.29+0xd8/0xd8
 ? save_trace+0x106/0x1e0
 ? __isolate_lru_page+0x2dc/0x3c0
 ? remove_mapping+0x1b0/0x1b0
 ? match_held_lock+0xa5/0x440
 ? __lock_acquire+0x616/0x2040
 ? __mod_zone_page_state+0x1a/0x70
 ? isolate_lru_pages.isra.83+0x888/0xae0
 ? __isolate_lru_page+0x3c0/0x3c0
 ? check_usage+0x174/0x790
 ? mark_lock+0x1b1/0xa00
 ? print_irqtrace_events+0x110/0x110
 ? check_usage_forwards+0x2b0/0x2b0
 ? class_equal+0x11/0x20
 ? __bfs+0xed/0x430
 ? __phys_addr_symbol+0x23/0x40
 ? mutex_destroy+0x120/0x120
 ? match_held_lock+0x8d/0x440
 ? hlock_class+0xa0/0xa0
 ? mark_lock+0x1b1/0xa00
 ? save_trace+0x1e0/0x1e0
 ? print_irqtrace_events+0x110/0x110
 ? lock_acquire+0x350/0x350
 ? __zone_watermark_ok+0xd8/0x280
 ? graph_lock+0x8d/0x100
 ? check_noncircular+0x20/0x20
 ? find_held_lock+0x6d/0xd0
 ? shrink_inactive_list+0x3b4/0x940
 ? lock_acquire+0x350/0x350
 ? do_raw_spin_unlock+0x147/0x220
 ? do_raw_spin_trylock+0x100/0x100
 ? stop_critical_timings+0x210/0x210
 ? mark_held_locks+0x6e/0x90
 ? _raw_spin_unlock_irq+0x29/0x40
 shrink_inactive_list+0x451/0x940
 ? save_trace+0x180/0x1e0
 ? putback_inactive_pages+0x9f0/0x9f0
 ? dev_queue_xmit_nit+0x548/0x660
 ? __kernel_map_pages+0x2c9/0x310
 ? set_pages_rw+0xe0/0xe0
 ? get_page_from_freelist+0x1ea5/0x2ca0
 ? match_held_lock+0x8d/0x440
 ? blk_start_plug+0x17d/0x1e0
 ? kblockd_schedule_delayed_work_on+0x20/0x20
 ? print_irqtrace_events+0x110/0x110
 ? cpumask_next+0x1d/0x20
 ? zone_reclaimable_pages+0x25b/0x470
 ? mark_held_locks+0x6e/0x90
 ? __remove_mapping+0x4e0/0x4e0
 shrink_node_memcg.constprop.88+0x4c9/0x5e0
 ? __delayacct_freepages_start+0x28/0x40
 ? lock_acquire+0x311/0x350
 ? shrink_active_list+0x9c0/0x9c0
 ? stop_critical_timings+0x210/0x210
 ? allow_direct_reclaim.part.82+0xea/0x220
 ? mark_held_locks+0x6e/0x90
 ? ktime_get+0x1f0/0x3e0
 ? shrink_node+0x12d/0x260
 shrink_node+0x12d/0x260
 ? shrink_node_memcg.constprop.88+0x5e0/0x5e0
 ? __lock_is_held+0x51/0xc0
 try_to_free_pages+0x418/0xaf0
 ? shrink_node+0x260/0x260
 ? lock_acquire+0x12e/0x350
 ? lock_acquire+0x12e/0x350
 ? fs_reclaim_acquire.part.102+0x5/0x30
 ? lockdep_rcu_suspicious+0x100/0x100
 ? rcu_note_context_switch+0x520/0x520
 ? wake_all_kswapds+0x10a/0x150
 __alloc_pages_slowpath+0x976/0x1790
 ? __zone_watermark_ok+0x280/0x280
 ? warn_alloc+0x250/0x250
 ? __lock_acquire+0x616/0x2040
 ? match_held_lock+0x8d/0x440
 ? save_trace+0x1e0/0x1e0
 ? debug_show_all_locks+0x2f0/0x2f0
 ? match_held_lock+0xa5/0x440
 ? stack_access_ok+0x35/0x80
 ? save_trace+0x1e0/0x1e0
 ? __read_once_size_nocheck.constprop.6+0x10/0x10
 ? __lock_acquire+0x616/0x2040
 ? match_held_lock+0xa5/0x440
 ? find_held_lock+0x6d/0xd0
 ? __lock_is_held+0x51/0xc0
 ? rcu_note_context_switch+0x520/0x520
 ? perf_trace_sched_switch+0x560/0x560
 ? __might_sleep+0x58/0xe0
 __alloc_pages_nodemask+0x52c/0x5c0
 ? gfp_pfmemalloc_allowed+0xc0/0xc0
 ? kernel_text_address+0x5c/0x90
 ? __kernel_text_address+0xe/0x30
 ? unwind_get_return_address+0x2f/0x50
 ? memcmp+0x45/0x70
 ? match_held_lock+0x8d/0x440
 ? depot_save_stack+0x12e/0x480
 ? match_held_lock+0xa5/0x440
 ? stop_critical_timings+0x210/0x210
 ? sk_stream_alloc_skb+0xb8/0x340
 ? mark_held_locks+0x6e/0x90
 ? new_slab+0x2f3/0x3f0
 new_slab+0x374/0x3f0
 ___slab_alloc.constprop.81+0x47e/0x5a0
 ? __alloc_skb+0xee/0x390
 ? __alloc_skb+0xee/0x390
 ? __alloc_skb+0xee/0x390
 ? __slab_alloc.constprop.80+0x32/0x60
 __slab_alloc.constprop.80+0x32/0x60
 ? __alloc_skb+0xee/0x390
 __kmalloc_track_caller+0x267/0x310
 __kmalloc_reserve.isra.40+0x29/0x80
 __alloc_skb+0xee/0x390
 ? __skb_splice_bits+0x3e0/0x3e0
 ? ip6_mtu+0x1d9/0x290
 ? ip6_link_failure+0x3c0/0x3c0
 ? tcp_current_mss+0x1d8/0x2f0
 ? tcp_sync_mss+0x520/0x520
 sk_stream_alloc_skb+0xb8/0x340
 ? tcp_ioctl+0x280/0x280
 tcp_sendmsg_locked+0x8e6/0x1d30
 ? match_held_lock+0x8d/0x440
 ? mark_lock+0x1b1/0xa00
 ? tcp_set_state+0x450/0x450
 ? debug_show_all_locks+0x2f0/0x2f0
 ? match_held_lock+0x8d/0x440
 ? save_trace+0x1e0/0x1e0
 ? find_held_lock+0x6d/0xd0
 ? lock_acquire+0x12e/0x350
 ? lock_acquire+0x12e/0x350
 ? tcp_sendmsg+0x19/0x40
 ? lockdep_rcu_suspicious+0x100/0x100
 ? do_raw_spin_trylock+0x100/0x100
 ? stop_critical_timings+0x210/0x210
 ? mark_held_locks+0x6e/0x90
 ? __local_bh_enable_ip+0x94/0x100
 ? lock_sock_nested+0x51/0xb0
 tcp_sendmsg+0x27/0x40
 inet_sendmsg+0xd0/0x310
 ? inet_recvmsg+0x360/0x360
 ? match_held_lock+0x8d/0x440
 ? inet_recvmsg+0x360/0x360
 sock_write_iter+0x17a/0x240
 ? sock_ioctl+0x290/0x290
 ? find_held_lock+0x6d/0xd0
 __vfs_write+0x2ab/0x380
 ? kernel_read+0xa0/0xa0
 ? __context_tracking_exit.part.4+0xe7/0x290
 ? lock_acquire+0x350/0x350
 ? __fdget_pos+0x7f/0x110
 ? __fdget_raw+0x10/0x10
 vfs_write+0xfb/0x260
 SyS_write+0xb6/0x140
 ? SyS_read+0x140/0x140
 ? SyS_clock_settime+0x120/0x120
 ? mark_held_locks+0x1c/0x90
 ? do_syscall_64+0x110/0xc05
 ? SyS_read+0x140/0x140
 do_syscall_64+0x1e5/0xc05
 ? syscall_return_slowpath+0x5b0/0x5b0
 ? lock_acquire+0x350/0x350
 ? lockdep_rcu_suspicious+0x100/0x100
 ? get_vtime_delta+0x15/0xf0
 ? get_vtime_delta+0x8b/0xf0
 ? vtime_user_enter+0x7f/0x90
 ? __context_tracking_enter+0x13c/0x2b0
 ? __context_tracking_enter+0x13c/0x2b0
 ? context_tracking_exit.part.5+0x40/0x40
 ? rcu_is_watching+0x88/0xd0
 ? time_hardirqs_on+0x220/0x220
 ? prepare_exit_to_usermode+0x1d0/0x2a0
 ? enter_from_user_mode+0x30/0x30
 ? entry_SYSCALL_64_after_hwframe+0x18/0x2e
 ? trace_hardirqs_off_caller+0xc2/0x110
 ? trace_hardirqs_off_thunk+0x1a/0x1c
 entry_SYSCALL64_slow_path+0x25/0x25
RIP: 0033:0x7f26d47d1974
RSP: 002b:00007ffd62e2f548 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000024 RCX: 00007f26d47d1974
RDX: 0000000000000024 RSI: 000055a0bc9a6220 RDI: 0000000000000003
RBP: 000055a0bc984370 R08: 0000000000000000 R09: 00007ffd62fb9080
R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000000
R13: 000055a0bc311ab0 R14: 0000000000000003 R15: 00007ffd62e2f5cf

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [4.15-rc9] fs_reclaim lockdep trace
  2018-01-24  1:36 ` Dave Jones
@ 2018-01-27 22:24   ` Dave Jones
  -1 siblings, 0 replies; 35+ messages in thread
From: Dave Jones @ 2018-01-27 22:24 UTC (permalink / raw)
  To: Linux Kernel, linux-mm; +Cc: netdev, Linus Torvalds

On Tue, Jan 23, 2018 at 08:36:51PM -0500, Dave Jones wrote:
 > Just triggered this on a server I was rsync'ing to.

Actually, I can trigger this really easily, even with an rsync from one
disk to another.  Though that also smells a little like networking in
the traces. Maybe netdev has ideas.

 
The first instance:

 > ============================================
 > WARNING: possible recursive locking detected
 > 4.15.0-rc9-backup-debug+ #1 Not tainted
 > --------------------------------------------
 > sshd/24800 is trying to acquire lock:
 >  (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30
 > 
 > but task is already holding lock:
 >  (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30
 > 
 > other info that might help us debug this:
 >  Possible unsafe locking scenario:
 > 
 >        CPU0
 >        ----
 >   lock(fs_reclaim);
 >   lock(fs_reclaim);
 > 
 >  *** DEADLOCK ***
 > 
 >  May be due to missing lock nesting notation
 > 
 > 2 locks held by sshd/24800:
 >  #0:  (sk_lock-AF_INET6){+.+.}, at: [<000000001a069652>] tcp_sendmsg+0x19/0x40
 >  #1:  (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30
 > 
 > stack backtrace:
 > CPU: 3 PID: 24800 Comm: sshd Not tainted 4.15.0-rc9-backup-debug+ #1
 > Call Trace:
 >  dump_stack+0xbc/0x13f
 >  ? _atomic_dec_and_lock+0x101/0x101
 >  ? fs_reclaim_acquire.part.102+0x5/0x30
 >  ? print_lock+0x54/0x68
 >  __lock_acquire+0xa09/0x2040
 >  ? debug_show_all_locks+0x2f0/0x2f0
 >  ? mutex_destroy+0x120/0x120
 >  ? hlock_class+0xa0/0xa0
 >  ? kernel_text_address+0x5c/0x90
 >  ? __kernel_text_address+0xe/0x30
 >  ? unwind_get_return_address+0x2f/0x50
 >  ? __save_stack_trace+0x92/0x100
 >  ? graph_lock+0x8d/0x100
 >  ? check_noncircular+0x20/0x20
 >  ? __lock_acquire+0x616/0x2040
 >  ? debug_show_all_locks+0x2f0/0x2f0
 >  ? __lock_acquire+0x616/0x2040
 >  ? debug_show_all_locks+0x2f0/0x2f0
 >  ? print_irqtrace_events+0x110/0x110
 >  ? active_load_balance_cpu_stop+0x7b0/0x7b0
 >  ? debug_show_all_locks+0x2f0/0x2f0
 >  ? mark_lock+0x1b1/0xa00
 >  ? lock_acquire+0x12e/0x350
 >  lock_acquire+0x12e/0x350
 >  ? fs_reclaim_acquire.part.102+0x5/0x30
 >  ? lockdep_rcu_suspicious+0x100/0x100
 >  ? set_next_entity+0x20e/0x10d0
 >  ? mark_lock+0x1b1/0xa00
 >  ? match_held_lock+0x8d/0x440
 >  ? mark_lock+0x1b1/0xa00
 >  ? save_trace+0x1e0/0x1e0
 >  ? print_irqtrace_events+0x110/0x110
 >  ? alloc_extent_state+0xa7/0x410
 >  fs_reclaim_acquire.part.102+0x29/0x30
 >  ? fs_reclaim_acquire.part.102+0x5/0x30
 >  kmem_cache_alloc+0x3d/0x2c0
 >  ? rb_erase+0xe63/0x1240
 >  alloc_extent_state+0xa7/0x410
 >  ? lock_extent_buffer_for_io+0x3f0/0x3f0
 >  ? find_held_lock+0x6d/0xd0
 >  ? test_range_bit+0x197/0x210
 >  ? lock_acquire+0x350/0x350
 >  ? do_raw_spin_unlock+0x147/0x220
 >  ? do_raw_spin_trylock+0x100/0x100
 >  ? iotree_fs_info+0x30/0x30
 >  __clear_extent_bit+0x3ea/0x570
 >  ? clear_state_bit+0x270/0x270
 >  ? count_range_bits+0x2f0/0x2f0
 >  ? lock_acquire+0x350/0x350
 >  ? rb_prev+0x21/0x90
 >  try_release_extent_mapping+0x21a/0x260
 >  __btrfs_releasepage+0xb0/0x1c0
 >  ? btrfs_submit_direct+0xca0/0xca0
 >  ? check_new_page_bad+0x1f0/0x1f0
 >  ? match_held_lock+0xa5/0x440
 >  ? debug_show_all_locks+0x2f0/0x2f0
 >  btrfs_releasepage+0x161/0x170
 >  ? __btrfs_releasepage+0x1c0/0x1c0
 >  ? page_rmapping+0xd0/0xd0
 >  ? rmap_walk+0x100/0x100
 >  try_to_release_page+0x162/0x1c0
 >  ? generic_file_write_iter+0x3c0/0x3c0
 >  ? page_evictable+0xcc/0x110
 >  ? lookup_address_in_pgd+0x107/0x190
 >  shrink_page_list+0x1d5a/0x2fb0
 >  ? putback_lru_page+0x3f0/0x3f0
 >  ? save_trace+0x1e0/0x1e0
 >  ? _lookup_address_cpa.isra.13+0x40/0x60
 >  ? debug_show_all_locks+0x2f0/0x2f0
 >  ? kmem_cache_free+0x8c/0x280
 >  ? free_extent_state+0x1c8/0x3b0
 >  ? mark_lock+0x1b1/0xa00
 >  ? page_rmapping+0xd0/0xd0
 >  ? print_irqtrace_events+0x110/0x110
 >  ? shrink_node_memcg.constprop.88+0x4c9/0x5e0
 >  ? shrink_node+0x12d/0x260
 >  ? try_to_free_pages+0x418/0xaf0
 >  ? __alloc_pages_slowpath+0x976/0x1790
 >  ? __alloc_pages_nodemask+0x52c/0x5c0
 >  ? delete_node+0x28d/0x5c0
 >  ? find_held_lock+0x6d/0xd0
 >  ? free_pcppages_bulk+0x381/0x570
 >  ? lock_acquire+0x350/0x350
 >  ? do_raw_spin_unlock+0x147/0x220
 >  ? do_raw_spin_trylock+0x100/0x100
 >  ? __lock_is_held+0x51/0xc0
 >  ? _raw_spin_unlock+0x24/0x30
 >  ? free_pcppages_bulk+0x381/0x570
 >  ? mark_lock+0x1b1/0xa00
 >  ? free_compound_page+0x30/0x30
 >  ? print_irqtrace_events+0x110/0x110
 >  ? __kernel_map_pages+0x2c9/0x310
 >  ? mark_lock+0x1b1/0xa00
 >  ? print_irqtrace_events+0x110/0x110
 >  ? __delete_from_page_cache+0x2e7/0x4e0
 >  ? save_trace+0x1e0/0x1e0
 >  ? __add_to_page_cache_locked+0x680/0x680
 >  ? find_held_lock+0x6d/0xd0
 >  ? __list_add_valid+0x29/0xa0
 >  ? free_unref_page_commit+0x198/0x270
 >  ? drain_local_pages_wq+0x20/0x20
 >  ? stop_critical_timings+0x210/0x210
 >  ? mark_lock+0x1b1/0xa00
 >  ? mark_lock+0x1b1/0xa00
 >  ? print_irqtrace_events+0x110/0x110
 >  ? __lock_acquire+0x616/0x2040
 >  ? mark_lock+0x1b1/0xa00
 >  ? mark_lock+0x1b1/0xa00
 >  ? print_irqtrace_events+0x110/0x110
 >  ? __phys_addr_symbol+0x23/0x40
 >  ? __change_page_attr_set_clr+0xe86/0x1640
 >  ? __btrfs_releasepage+0x1c0/0x1c0
 >  ? mark_lock+0x1b1/0xa00
 >  ? mark_lock+0x1b1/0xa00
 >  ? print_irqtrace_events+0x110/0x110
 >  ? mark_lock+0x1b1/0xa00
 >  ? __lock_acquire+0x616/0x2040
 >  ? __lock_acquire+0x616/0x2040
 >  ? debug_show_all_locks+0x2f0/0x2f0
 >  ? swiotlb_free_coherent+0x60/0x60
 >  ? __phys_addr+0x32/0x80
 >  ? igb_xmit_frame_ring+0xad7/0x1890
 >  ? stack_access_ok+0x35/0x80
 >  ? deref_stack_reg+0xa1/0xe0
 >  ? __read_once_size_nocheck.constprop.6+0x10/0x10
 >  ? __orc_find+0x6b/0xc0
 >  ? unwind_next_frame+0x407/0xa20
 >  ? __save_stack_trace+0x5e/0x100
 >  ? stack_access_ok+0x35/0x80
 >  ? deref_stack_reg+0xa1/0xe0
 >  ? __read_once_size_nocheck.constprop.6+0x10/0x10
 >  ? __lock_acquire+0x616/0x2040
 >  ? debug_lockdep_rcu_enabled.part.37+0x16/0x30
 >  ? is_ftrace_trampoline+0x112/0x190
 >  ? ftrace_profile_pages_init+0x130/0x130
 >  ? unwind_next_frame+0x407/0xa20
 >  ? rcu_is_watching+0x88/0xd0
 >  ? unwind_get_return_address_ptr+0x50/0x50
 >  ? kernel_text_address+0x5c/0x90
 >  ? __kernel_text_address+0xe/0x30
 >  ? unwind_get_return_address+0x2f/0x50
 >  ? __save_stack_trace+0x92/0x100
 >  ? __list_add_valid+0x29/0xa0
 >  ? add_lock_to_list.isra.26+0x1d0/0x21f
 >  ? print_lockdep_cache.isra.29+0xd8/0xd8
 >  ? save_trace+0x106/0x1e0
 >  ? __isolate_lru_page+0x2dc/0x3c0
 >  ? remove_mapping+0x1b0/0x1b0
 >  ? match_held_lock+0xa5/0x440
 >  ? __lock_acquire+0x616/0x2040
 >  ? __mod_zone_page_state+0x1a/0x70
 >  ? isolate_lru_pages.isra.83+0x888/0xae0
 >  ? __isolate_lru_page+0x3c0/0x3c0
 >  ? check_usage+0x174/0x790
 >  ? mark_lock+0x1b1/0xa00
 >  ? print_irqtrace_events+0x110/0x110
 >  ? check_usage_forwards+0x2b0/0x2b0
 >  ? class_equal+0x11/0x20
 >  ? __bfs+0xed/0x430
 >  ? __phys_addr_symbol+0x23/0x40
 >  ? mutex_destroy+0x120/0x120
 >  ? match_held_lock+0x8d/0x440
 >  ? hlock_class+0xa0/0xa0
 >  ? mark_lock+0x1b1/0xa00
 >  ? save_trace+0x1e0/0x1e0
 >  ? print_irqtrace_events+0x110/0x110
 >  ? lock_acquire+0x350/0x350
 >  ? __zone_watermark_ok+0xd8/0x280
 >  ? graph_lock+0x8d/0x100
 >  ? check_noncircular+0x20/0x20
 >  ? find_held_lock+0x6d/0xd0
 >  ? shrink_inactive_list+0x3b4/0x940
 >  ? lock_acquire+0x350/0x350
 >  ? do_raw_spin_unlock+0x147/0x220
 >  ? do_raw_spin_trylock+0x100/0x100
 >  ? stop_critical_timings+0x210/0x210
 >  ? mark_held_locks+0x6e/0x90
 >  ? _raw_spin_unlock_irq+0x29/0x40
 >  shrink_inactive_list+0x451/0x940
 >  ? save_trace+0x180/0x1e0
 >  ? putback_inactive_pages+0x9f0/0x9f0
 >  ? dev_queue_xmit_nit+0x548/0x660
 >  ? __kernel_map_pages+0x2c9/0x310
 >  ? set_pages_rw+0xe0/0xe0
 >  ? get_page_from_freelist+0x1ea5/0x2ca0
 >  ? match_held_lock+0x8d/0x440
 >  ? blk_start_plug+0x17d/0x1e0
 >  ? kblockd_schedule_delayed_work_on+0x20/0x20
 >  ? print_irqtrace_events+0x110/0x110
 >  ? cpumask_next+0x1d/0x20
 >  ? zone_reclaimable_pages+0x25b/0x470
 >  ? mark_held_locks+0x6e/0x90
 >  ? __remove_mapping+0x4e0/0x4e0
 >  shrink_node_memcg.constprop.88+0x4c9/0x5e0
 >  ? __delayacct_freepages_start+0x28/0x40
 >  ? lock_acquire+0x311/0x350
 >  ? shrink_active_list+0x9c0/0x9c0
 >  ? stop_critical_timings+0x210/0x210
 >  ? allow_direct_reclaim.part.82+0xea/0x220
 >  ? mark_held_locks+0x6e/0x90
 >  ? ktime_get+0x1f0/0x3e0
 >  ? shrink_node+0x12d/0x260
 >  shrink_node+0x12d/0x260
 >  ? shrink_node_memcg.constprop.88+0x5e0/0x5e0
 >  ? __lock_is_held+0x51/0xc0
 >  try_to_free_pages+0x418/0xaf0
 >  ? shrink_node+0x260/0x260
 >  ? lock_acquire+0x12e/0x350
 >  ? lock_acquire+0x12e/0x350
 >  ? fs_reclaim_acquire.part.102+0x5/0x30
 >  ? lockdep_rcu_suspicious+0x100/0x100
 >  ? rcu_note_context_switch+0x520/0x520
 >  ? wake_all_kswapds+0x10a/0x150
 >  __alloc_pages_slowpath+0x976/0x1790
 >  ? __zone_watermark_ok+0x280/0x280
 >  ? warn_alloc+0x250/0x250
 >  ? __lock_acquire+0x616/0x2040
 >  ? match_held_lock+0x8d/0x440
 >  ? save_trace+0x1e0/0x1e0
 >  ? debug_show_all_locks+0x2f0/0x2f0
 >  ? match_held_lock+0xa5/0x440
 >  ? stack_access_ok+0x35/0x80
 >  ? save_trace+0x1e0/0x1e0
 >  ? __read_once_size_nocheck.constprop.6+0x10/0x10
 >  ? __lock_acquire+0x616/0x2040
 >  ? match_held_lock+0xa5/0x440
 >  ? find_held_lock+0x6d/0xd0
 >  ? __lock_is_held+0x51/0xc0
 >  ? rcu_note_context_switch+0x520/0x520
 >  ? perf_trace_sched_switch+0x560/0x560
 >  ? __might_sleep+0x58/0xe0
 >  __alloc_pages_nodemask+0x52c/0x5c0
 >  ? gfp_pfmemalloc_allowed+0xc0/0xc0
 >  ? kernel_text_address+0x5c/0x90
 >  ? __kernel_text_address+0xe/0x30
 >  ? unwind_get_return_address+0x2f/0x50
 >  ? memcmp+0x45/0x70
 >  ? match_held_lock+0x8d/0x440
 >  ? depot_save_stack+0x12e/0x480
 >  ? match_held_lock+0xa5/0x440
 >  ? stop_critical_timings+0x210/0x210
 >  ? sk_stream_alloc_skb+0xb8/0x340
 >  ? mark_held_locks+0x6e/0x90
 >  ? new_slab+0x2f3/0x3f0
 >  new_slab+0x374/0x3f0
 >  ___slab_alloc.constprop.81+0x47e/0x5a0
 >  ? __alloc_skb+0xee/0x390
 >  ? __alloc_skb+0xee/0x390
 >  ? __alloc_skb+0xee/0x390
 >  ? __slab_alloc.constprop.80+0x32/0x60
 >  __slab_alloc.constprop.80+0x32/0x60
 >  ? __alloc_skb+0xee/0x390
 >  __kmalloc_track_caller+0x267/0x310
 >  __kmalloc_reserve.isra.40+0x29/0x80
 >  __alloc_skb+0xee/0x390
 >  ? __skb_splice_bits+0x3e0/0x3e0
 >  ? ip6_mtu+0x1d9/0x290
 >  ? ip6_link_failure+0x3c0/0x3c0
 >  ? tcp_current_mss+0x1d8/0x2f0
 >  ? tcp_sync_mss+0x520/0x520
 >  sk_stream_alloc_skb+0xb8/0x340
 >  ? tcp_ioctl+0x280/0x280
 >  tcp_sendmsg_locked+0x8e6/0x1d30
 >  ? match_held_lock+0x8d/0x440
 >  ? mark_lock+0x1b1/0xa00
 >  ? tcp_set_state+0x450/0x450
 >  ? debug_show_all_locks+0x2f0/0x2f0
 >  ? match_held_lock+0x8d/0x440
 >  ? save_trace+0x1e0/0x1e0
 >  ? find_held_lock+0x6d/0xd0
 >  ? lock_acquire+0x12e/0x350
 >  ? lock_acquire+0x12e/0x350
 >  ? tcp_sendmsg+0x19/0x40
 >  ? lockdep_rcu_suspicious+0x100/0x100
 >  ? do_raw_spin_trylock+0x100/0x100
 >  ? stop_critical_timings+0x210/0x210
 >  ? mark_held_locks+0x6e/0x90
 >  ? __local_bh_enable_ip+0x94/0x100
 >  ? lock_sock_nested+0x51/0xb0
 >  tcp_sendmsg+0x27/0x40
 >  inet_sendmsg+0xd0/0x310
 >  ? inet_recvmsg+0x360/0x360
 >  ? match_held_lock+0x8d/0x440
 >  ? inet_recvmsg+0x360/0x360
 >  sock_write_iter+0x17a/0x240
 >  ? sock_ioctl+0x290/0x290
 >  ? find_held_lock+0x6d/0xd0
 >  __vfs_write+0x2ab/0x380
 >  ? kernel_read+0xa0/0xa0
 >  ? __context_tracking_exit.part.4+0xe7/0x290
 >  ? lock_acquire+0x350/0x350
 >  ? __fdget_pos+0x7f/0x110
 >  ? __fdget_raw+0x10/0x10
 >  vfs_write+0xfb/0x260
 >  SyS_write+0xb6/0x140
 >  ? SyS_read+0x140/0x140
 >  ? SyS_clock_settime+0x120/0x120
 >  ? mark_held_locks+0x1c/0x90
 >  ? do_syscall_64+0x110/0xc05
 >  ? SyS_read+0x140/0x140
 >  do_syscall_64+0x1e5/0xc05
 >  ? syscall_return_slowpath+0x5b0/0x5b0
 >  ? lock_acquire+0x350/0x350
 >  ? lockdep_rcu_suspicious+0x100/0x100
 >  ? get_vtime_delta+0x15/0xf0
 >  ? get_vtime_delta+0x8b/0xf0
 >  ? vtime_user_enter+0x7f/0x90
 >  ? __context_tracking_enter+0x13c/0x2b0
 >  ? __context_tracking_enter+0x13c/0x2b0
 >  ? context_tracking_exit.part.5+0x40/0x40
 >  ? rcu_is_watching+0x88/0xd0
 >  ? time_hardirqs_on+0x220/0x220
 >  ? prepare_exit_to_usermode+0x1d0/0x2a0
 >  ? enter_from_user_mode+0x30/0x30
 >  ? entry_SYSCALL_64_after_hwframe+0x18/0x2e
 >  ? trace_hardirqs_off_caller+0xc2/0x110
 >  ? trace_hardirqs_off_thunk+0x1a/0x1c
 >  entry_SYSCALL64_slow_path+0x25/0x25
 

 And now I can hit..



============================================
WARNING: possible recursive locking detected
4.15.0-rc9-backup-debug+ #7 Not tainted
--------------------------------------------
snmpd/892 is trying to acquire lock:
 (fs_reclaim){+.+.}, at: [<0000000002e4c185>] fs_reclaim_acquire.part.101+0x5/0x30

but task is already holding lock:
 (fs_reclaim){+.+.}, at: [<0000000002e4c185>] fs_reclaim_acquire.part.101+0x5/0x30

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(fs_reclaim);
  lock(fs_reclaim);

 *** DEADLOCK ***

 May be due to missing lock nesting notation

2 locks held by snmpd/892:
 #0:  (rtnl_mutex){+.+.}, at: [<00000000dcd3ba2f>] netlink_dump+0x89/0x520
 #1:  (fs_reclaim){+.+.}, at: [<0000000002e4c185>] fs_reclaim_acquire.part.101+0x5/0x30

stack backtrace:
CPU: 5 PID: 892 Comm: snmpd Not tainted 4.15.0-rc9-backup-debug+ #7
Call Trace:
 dump_stack+0xbc/0x13f
 ? _atomic_dec_and_lock+0x101/0x101
 ? fs_reclaim_acquire.part.101+0x5/0x30
 ? print_lock+0x54/0x68
 __lock_acquire+0xa09/0x2040
 ? debug_show_all_locks+0x2f0/0x2f0
 ? __save_stack_trace+0x92/0x100
 ? __list_add_valid+0x29/0xa0
 ? add_lock_to_list.isra.26+0x1d0/0x21f
 ? print_lockdep_cache.isra.29+0xd8/0xd8
 ? save_trace+0x106/0x1e0
 ? graph_lock+0x100/0x100
 ? graph_lock+0x8d/0x100
 ? check_noncircular+0x20/0x20
 ? __lock_acquire+0x616/0x2040
 ? debug_show_all_locks+0x2f0/0x2f0
 ? lock_acquire+0x12e/0x350
 lock_acquire+0x12e/0x350
 ? fs_reclaim_acquire.part.101+0x5/0x30
 ? lockdep_rcu_suspicious+0x100/0x100
 ? match_held_lock+0x8d/0x440
 ? save_trace+0x1e0/0x1e0
 ? alloc_extent_state+0xa7/0x410
 fs_reclaim_acquire.part.101+0x29/0x30
 ? fs_reclaim_acquire.part.101+0x5/0x30
 kmem_cache_alloc+0x3d/0x2c0
 alloc_extent_state+0xa7/0x410
 ? lock_extent_buffer_for_io+0x3f0/0x3f0
 ? find_held_lock+0x6d/0xd0
 ? test_range_bit+0x197/0x210
 ? lock_acquire+0x350/0x350
 ? do_raw_spin_unlock+0x147/0x220
 ? do_raw_spin_trylock+0x100/0x100
 ? iotree_fs_info+0x30/0x30
 __clear_extent_bit+0x3ea/0x570
 ? clear_state_bit+0x270/0x270
 ? count_range_bits+0x2f0/0x2f0
 try_release_extent_mapping+0x21a/0x260
 __btrfs_releasepage+0xb0/0x1c0
 ? btrfs_submit_direct+0xca0/0xca0
 ? check_usage+0x257/0x790
 ? match_held_lock+0xa5/0x440
 ? print_irqtrace_events+0x110/0x110
 btrfs_releasepage+0x161/0x170
 ? __btrfs_releasepage+0x1c0/0x1c0
 ? page_rmapping+0xd0/0xd0
 ? rmap_walk+0x100/0x100
 try_to_release_page+0x162/0x1c0
 ? generic_file_write_iter+0x3c0/0x3c0
 ? page_evictable+0xcc/0x110
 ? debug_show_all_locks+0x2f0/0x2f0
 shrink_page_list+0x1d5a/0x2fb0
 ? putback_lru_page+0x3f0/0x3f0
 ? match_held_lock+0x8d/0x440
 ? save_trace+0x1e0/0x1e0
 ? update_curr+0xd0/0x670
 ? __lock_is_held+0x71/0xc0
 ? update_cfs_group+0x86/0x290
 ? __list_add_valid+0x29/0xa0
 ? account_entity_dequeue+0x230/0x230
 ? nohz_balance_exit_idle.part.92+0x60/0x60
 ? __update_load_avg_se.isra.28+0x352/0x360
 ? __update_load_avg_se.isra.28+0x1f4/0x360
 ? __accumulate_pelt_segments+0x47/0xd0
 ? __enqueue_entity+0x93/0xc0
 ? match_held_lock+0x8d/0x440
 ? mark_lock+0x1b1/0xa00
 ? save_trace+0x1e0/0x1e0
 ? print_irqtrace_events+0x110/0x110
 ? check_preempt_wakeup+0x410/0x410
 ? mark_lock+0x1b1/0xa00
 ? __kernel_map_pages+0x2c9/0x310
 ? set_pages_rw+0xe0/0xe0
 ? lock_acquire+0x350/0x350
 ? lockdep_rcu_suspicious+0x100/0x100
 ? mark_held_locks+0x6e/0x90
 ? trace_hardirqs_on_caller+0x187/0x260
 ? mark_lock+0x1b1/0xa00
 ? print_irqtrace_events+0x110/0x110
 ? mark_held_locks+0x6e/0x90
 ? mark_lock+0x1b1/0xa00
 ? kasan_unpoison_shadow+0x30/0x40
 ? print_irqtrace_events+0x110/0x110
 ? mutex_destroy+0x120/0x120
 ? hlock_class+0xa0/0xa0
 ? __kernel_map_pages+0x2c9/0x310
 ? __lock_acquire+0x616/0x2040
 ? __lock_acquire+0x616/0x2040
 ? debug_show_all_locks+0x2f0/0x2f0
 ? mark_lock+0x1b1/0xa00
 ? print_irqtrace_events+0x110/0x110
 ? debug_show_all_locks+0x2f0/0x2f0
 ? debug_show_all_locks+0x2f0/0x2f0
 ? mark_held_locks+0x6e/0x90
 ? mark_lock+0x1b1/0xa00
 ? print_irqtrace_events+0x110/0x110
 ? check_usage+0x174/0x790
 ? mark_lock+0x1b1/0xa00
 ? print_irqtrace_events+0x110/0x110
 ? mark_lock+0x1b1/0xa00
 ? usage_match+0x1e/0x30
 ? print_irqtrace_events+0x110/0x110
 ? __bfs+0x2a2/0x430
 ? noop_count+0x20/0x20
 ? __lock_acquire+0x616/0x2040
 ? stack_access_ok+0x35/0x80
 ? deref_stack_reg+0xa1/0xe0
 ? __read_once_size_nocheck.constprop.6+0x10/0x10
 ? __orc_find+0x6b/0xc0
 ? unwind_next_frame+0x39c/0x9d0
 ? __save_stack_trace+0x5e/0x100
 ? save_trace+0xbd/0x1e0
 ? stack_access_ok+0x35/0x80
 ? deref_stack_reg+0xa1/0xe0
 ? __read_once_size_nocheck.constprop.6+0x10/0x10
 ? debug_lockdep_rcu_enabled.part.37+0x16/0x30
 ? ftrace_ops_trampoline+0x111/0x190
 ? ftrace_profile_pages_init+0x130/0x130
 ? unwind_next_frame+0x39c/0x9d0
 ? rcu_is_watching+0x88/0xd0
 ? rcu_nmi_exit+0x130/0x130
 ? is_ftrace_trampoline+0x5/0x10
 ? kernel_text_address+0x5c/0x90
 ? __kernel_text_address+0xe/0x30
 ? unwind_get_return_address+0x2f/0x50
 ? __save_stack_trace+0x92/0x100
 ? __list_add_valid+0x29/0xa0
 ? add_lock_to_list.isra.26+0x1d0/0x21f
 ? print_lockdep_cache.isra.29+0xd8/0xd8
 ? save_trace+0x106/0x1e0
 ? __isolate_lru_page+0x2dc/0x3c0
 ? remove_mapping+0x1b0/0x1b0
 ? match_held_lock+0xa5/0x440
 ? __lock_acquire+0x616/0x2040
 ? __mod_zone_page_state+0x1a/0x70
 ? isolate_lru_pages.isra.79+0x888/0xae0
 ? __isolate_lru_page+0x3c0/0x3c0
 ? noop_count+0x20/0x20
 ? hlock_class+0xa0/0xa0
 ? print_irqtrace_events+0x110/0x110
 ? check_usage+0x174/0x790
 ? mark_lock+0x1b1/0xa00
 ? rb_next+0x90/0x90
 ? match_held_lock+0x8d/0x440
 ? mark_lock+0x1b1/0xa00
 ? save_trace+0x1e0/0x1e0
 ? print_irqtrace_events+0x110/0x110
 ? class_equal+0x11/0x20
 ? __bfs+0xed/0x430
 ? __kernel_map_pages+0x2c9/0x310
 ? mutex_destroy+0x120/0x120
 ? find_held_lock+0x6d/0xd0
 ? shrink_inactive_list+0x3b4/0x940
 ? lock_acquire+0x350/0x350
 ? do_raw_spin_unlock+0x147/0x220
 ? do_raw_spin_trylock+0x100/0x100
 ? stop_critical_timings+0x210/0x210
 ? mark_held_locks+0x6e/0x90
 ? _raw_spin_unlock_irq+0x29/0x40
 shrink_inactive_list+0x451/0x940
 ? putback_inactive_pages+0x9f0/0x9f0
 ? isolate_migratepages_range+0x120/0x120
 ? mark_lock+0x1b1/0xa00
 ? update_curr+0x2d9/0x670
 ? compact_zone+0x1d3/0x14c0
 ? blk_start_plug+0x17d/0x1e0
 ? kblockd_schedule_delayed_work_on+0x20/0x20
 ? save_trace+0x1e0/0x1e0
 ? update_curr+0x15c/0x670
 ? active_load_balance_cpu_stop+0x7b0/0x7b0
 ? compaction_suitable+0x350/0x350
 ? update_cfs_group+0x23a/0x290
 ? match_held_lock+0x8d/0x440
 shrink_node_memcg.constprop.84+0x4c9/0x5e0
 ? shrink_active_list+0x9c0/0x9c0
 ? __delayacct_freepages_start+0x28/0x40
 ? lockdep_rcu_suspicious+0x100/0x100
 ? shrink_node+0x1c2/0x510
 shrink_node+0x1c2/0x510
 ? trace_hardirqs_on_caller+0x187/0x260
 ? shrink_node_memcg.constprop.84+0x5e0/0x5e0
 ? getnstimeofday64+0x20/0x20
 ? allow_direct_reclaim.part.78+0x220/0x220
 ? mark_lock+0x1b1/0xa00
 ? mark_lock+0x1b1/0xa00
 ? __lock_is_held+0x51/0xc0
 try_to_free_pages+0x425/0xb90
 ? shrink_node+0x510/0x510
 ? try_to_compact_pages+0x1f4/0x6b0
 ? compaction_zonelist_suitable+0x2f0/0x2f0
 ? lock_acquire+0x12e/0x350
 ? lock_acquire+0x12e/0x350
 ? fs_reclaim_acquire.part.101+0x5/0x30
 ? lockdep_rcu_suspicious+0x100/0x100
 ? rcu_note_context_switch+0x520/0x520
 ? wake_all_kswapds+0x10a/0x150
 __alloc_pages_slowpath+0x955/0x1a00
 ? __lock_acquire+0x616/0x2040
 ? warn_alloc+0x250/0x250
 ? __lock_acquire+0x616/0x2040
 ? match_held_lock+0x8d/0x440
 ? save_trace+0x1e0/0x1e0
 ? debug_show_all_locks+0x2f0/0x2f0
 ? match_held_lock+0xa5/0x440
 ? save_trace+0x1e0/0x1e0
 ? __read_once_size_nocheck.constprop.6+0x10/0x10
 ? __orc_find+0x6b/0xc0
 ? match_held_lock+0xa5/0x440
 ? find_held_lock+0x6d/0xd0
 ? __lock_is_held+0x51/0xc0
 ? rcu_note_context_switch+0x520/0x520
 ? perf_trace_sched_switch+0x560/0x560
 ? __might_sleep+0x58/0xe0
 __alloc_pages_nodemask+0x52c/0x5c0
 ? gfp_pfmemalloc_allowed+0xc0/0xc0
 ? __kernel_text_address+0xe/0x30
 ? unwind_get_return_address+0x2f/0x50
 ? __save_stack_trace+0x92/0x100
 ? match_held_lock+0x8d/0x440
 ? depot_save_stack+0x12e/0x480
 ? match_held_lock+0xa5/0x440
 ? stop_critical_timings+0x210/0x210
 ? __netlink_dump_start+0x201/0x280
 ? mark_held_locks+0x6e/0x90
 ? new_slab+0x2f3/0x3f0
 new_slab+0x374/0x3f0
 ___slab_alloc.constprop.81+0x47e/0x5a0
 ? __lock_is_held+0x51/0xc0
 ? __alloc_skb+0xee/0x390
 ? __alloc_skb+0xee/0x390
 ? __alloc_skb+0xee/0x390
 ? __slab_alloc.constprop.80+0x32/0x60
 __slab_alloc.constprop.80+0x32/0x60
 ? __alloc_skb+0xee/0x390
 __kmalloc_track_caller+0x267/0x310
 __kmalloc_reserve.isra.40+0x29/0x80
 __alloc_skb+0xee/0x390
 ? __skb_splice_bits+0x3e0/0x3e0
 ? netlink_connect+0x1d0/0x1d0
 ? __netlink_dump_start+0x1f9/0x280
 ? __mutex_unlock_slowpath+0x121/0x460
 ? wait_for_completion_killable_timeout+0x450/0x450
 ? find_held_lock+0x6d/0xd0
 netlink_dump+0x2e1/0x520
 ? refcount_inc_not_zero+0x74/0x110
 ? __nlmsg_put+0xb0/0xb0
 ? rcu_is_watching+0x88/0xd0
 __netlink_dump_start+0x201/0x280
 ? inet6_dump_ifmcaddr+0x10/0x10
 rtnetlink_rcv_msg+0x6d6/0xa90
 ? validate_linkmsg+0x540/0x540
 ? inet6_dump_ifmcaddr+0x10/0x10
 ? find_held_lock+0x6d/0xd0
 ? netlink_lookup.isra.42+0x428/0x730
 ? lock_acquire+0x350/0x350
 ? find_held_lock+0x6d/0xd0
 ? inet6_dump_ifmcaddr+0x10/0x10
 ? netlink_deliver_tap+0x124/0x5c0
 ? lock_acquire+0x350/0x350
 ? lockdep_rcu_suspicious+0x100/0x100
 ? netlink_lookup.isra.42+0x447/0x730
 ? rcu_is_watching+0x88/0xd0
 ? netlink_connect+0x1d0/0x1d0
 ? netlink_deliver_tap+0x143/0x5c0
 ? __might_fault+0x7d/0xe0
 ? iov_iter_advance+0x176/0x7d0
 ? netlink_getname+0x150/0x150
 netlink_rcv_skb+0xb6/0x1d0
 ? validate_linkmsg+0x540/0x540
 ? netlink_ack+0x4a0/0x4a0
 ? netlink_trim+0xda/0x1b0
 netlink_unicast+0x298/0x320
 ? netlink_detachskb+0x30/0x30
 ? __fget+0x410/0x410
 netlink_sendmsg+0x57e/0x630
 ? netlink_broadcast_filtered+0x8f0/0x8f0
 ? netlink_broadcast_filtered+0x8f0/0x8f0
 SYSC_sendto+0x296/0x320
 ? SYSC_connect+0x200/0x200
 ? __context_tracking_exit.part.4+0xe7/0x290
 ? cyc2ns_read_end+0x10/0x10
 ? lockdep_rcu_suspicious+0x100/0x100
 ? rcu_read_lock_sched_held+0x90/0xa0
 ? __context_tracking_exit.part.4+0x223/0x290
 ? stop_critical_timings+0x210/0x210
 ? SyS_socket+0xd6/0x120
 ? sock_create_kern+0x10/0x10
 ? mark_held_locks+0x1c/0x90
 ? do_syscall_64+0x110/0xc05
 ? SyS_getpeername+0x10/0x10
 do_syscall_64+0x1e5/0xc05
 ? syscall_return_slowpath+0x5b0/0x5b0
 ? lock_acquire+0x350/0x350
 ? lockdep_rcu_suspicious+0x100/0x100
 ? get_vtime_delta+0x15/0xf0
 ? get_vtime_delta+0x8b/0xf0
 ? vtime_user_enter+0x7f/0x90
 ? __context_tracking_enter+0x13c/0x2b0
 ? __context_tracking_enter+0x13c/0x2b0
 ? context_tracking_exit.part.5+0x40/0x40
 ? do_page_fault+0xb0/0x4d0
 ? rcu_is_watching+0x88/0xd0
 ? vmalloc_sync_all+0x20/0x20
 ? time_hardirqs_on+0x220/0x220
 ? prepare_exit_to_usermode+0x1d0/0x2a0
 ? enter_from_user_mode+0x30/0x30
 ? entry_SYSCALL_64_after_hwframe+0x18/0x2e
 ? trace_hardirqs_off_caller+0xc2/0x110
 ? trace_hardirqs_off_thunk+0x1a/0x1c
 entry_SYSCALL64_slow_path+0x25/0x25
RIP: 0033:0x7f204299f54d
RSP: 002b:00007ffc49024fd8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 000000000000000a RCX: 00007f204299f54d
RDX: 0000000000000018 RSI: 00007ffc49025010 RDI: 0000000000000012
RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000012
R13: 00007ffc49029550 R14: 000055e31307a250 R15: 00007ffc49029530

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [4.15-rc9] fs_reclaim lockdep trace
@ 2018-01-27 22:24   ` Dave Jones
  0 siblings, 0 replies; 35+ messages in thread
From: Dave Jones @ 2018-01-27 22:24 UTC (permalink / raw)
  To: Linux Kernel, linux-mm; +Cc: netdev, Linus Torvalds

On Tue, Jan 23, 2018 at 08:36:51PM -0500, Dave Jones wrote:
 > Just triggered this on a server I was rsync'ing to.

Actually, I can trigger this really easily, even with an rsync from one
disk to another.  Though that also smells a little like networking in
the traces. Maybe netdev has ideas.

 
The first instance:

 > ============================================
 > WARNING: possible recursive locking detected
 > 4.15.0-rc9-backup-debug+ #1 Not tainted
 > --------------------------------------------
 > sshd/24800 is trying to acquire lock:
 >  (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30
 > 
 > but task is already holding lock:
 >  (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30
 > 
 > other info that might help us debug this:
 >  Possible unsafe locking scenario:
 > 
 >        CPU0
 >        ----
 >   lock(fs_reclaim);
 >   lock(fs_reclaim);
 > 
 >  *** DEADLOCK ***
 > 
 >  May be due to missing lock nesting notation
 > 
 > 2 locks held by sshd/24800:
 >  #0:  (sk_lock-AF_INET6){+.+.}, at: [<000000001a069652>] tcp_sendmsg+0x19/0x40
 >  #1:  (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30
 > 
 > stack backtrace:
 > CPU: 3 PID: 24800 Comm: sshd Not tainted 4.15.0-rc9-backup-debug+ #1
 > Call Trace:
 >  dump_stack+0xbc/0x13f
 >  ? _atomic_dec_and_lock+0x101/0x101
 >  ? fs_reclaim_acquire.part.102+0x5/0x30
 >  ? print_lock+0x54/0x68
 >  __lock_acquire+0xa09/0x2040
 >  ? debug_show_all_locks+0x2f0/0x2f0
 >  ? mutex_destroy+0x120/0x120
 >  ? hlock_class+0xa0/0xa0
 >  ? kernel_text_address+0x5c/0x90
 >  ? __kernel_text_address+0xe/0x30
 >  ? unwind_get_return_address+0x2f/0x50
 >  ? __save_stack_trace+0x92/0x100
 >  ? graph_lock+0x8d/0x100
 >  ? check_noncircular+0x20/0x20
 >  ? __lock_acquire+0x616/0x2040
 >  ? debug_show_all_locks+0x2f0/0x2f0
 >  ? __lock_acquire+0x616/0x2040
 >  ? debug_show_all_locks+0x2f0/0x2f0
 >  ? print_irqtrace_events+0x110/0x110
 >  ? active_load_balance_cpu_stop+0x7b0/0x7b0
 >  ? debug_show_all_locks+0x2f0/0x2f0
 >  ? mark_lock+0x1b1/0xa00
 >  ? lock_acquire+0x12e/0x350
 >  lock_acquire+0x12e/0x350
 >  ? fs_reclaim_acquire.part.102+0x5/0x30
 >  ? lockdep_rcu_suspicious+0x100/0x100
 >  ? set_next_entity+0x20e/0x10d0
 >  ? mark_lock+0x1b1/0xa00
 >  ? match_held_lock+0x8d/0x440
 >  ? mark_lock+0x1b1/0xa00
 >  ? save_trace+0x1e0/0x1e0
 >  ? print_irqtrace_events+0x110/0x110
 >  ? alloc_extent_state+0xa7/0x410
 >  fs_reclaim_acquire.part.102+0x29/0x30
 >  ? fs_reclaim_acquire.part.102+0x5/0x30
 >  kmem_cache_alloc+0x3d/0x2c0
 >  ? rb_erase+0xe63/0x1240
 >  alloc_extent_state+0xa7/0x410
 >  ? lock_extent_buffer_for_io+0x3f0/0x3f0
 >  ? find_held_lock+0x6d/0xd0
 >  ? test_range_bit+0x197/0x210
 >  ? lock_acquire+0x350/0x350
 >  ? do_raw_spin_unlock+0x147/0x220
 >  ? do_raw_spin_trylock+0x100/0x100
 >  ? iotree_fs_info+0x30/0x30
 >  __clear_extent_bit+0x3ea/0x570
 >  ? clear_state_bit+0x270/0x270
 >  ? count_range_bits+0x2f0/0x2f0
 >  ? lock_acquire+0x350/0x350
 >  ? rb_prev+0x21/0x90
 >  try_release_extent_mapping+0x21a/0x260
 >  __btrfs_releasepage+0xb0/0x1c0
 >  ? btrfs_submit_direct+0xca0/0xca0
 >  ? check_new_page_bad+0x1f0/0x1f0
 >  ? match_held_lock+0xa5/0x440
 >  ? debug_show_all_locks+0x2f0/0x2f0
 >  btrfs_releasepage+0x161/0x170
 >  ? __btrfs_releasepage+0x1c0/0x1c0
 >  ? page_rmapping+0xd0/0xd0
 >  ? rmap_walk+0x100/0x100
 >  try_to_release_page+0x162/0x1c0
 >  ? generic_file_write_iter+0x3c0/0x3c0
 >  ? page_evictable+0xcc/0x110
 >  ? lookup_address_in_pgd+0x107/0x190
 >  shrink_page_list+0x1d5a/0x2fb0
 >  ? putback_lru_page+0x3f0/0x3f0
 >  ? save_trace+0x1e0/0x1e0
 >  ? _lookup_address_cpa.isra.13+0x40/0x60
 >  ? debug_show_all_locks+0x2f0/0x2f0
 >  ? kmem_cache_free+0x8c/0x280
 >  ? free_extent_state+0x1c8/0x3b0
 >  ? mark_lock+0x1b1/0xa00
 >  ? page_rmapping+0xd0/0xd0
 >  ? print_irqtrace_events+0x110/0x110
 >  ? shrink_node_memcg.constprop.88+0x4c9/0x5e0
 >  ? shrink_node+0x12d/0x260
 >  ? try_to_free_pages+0x418/0xaf0
 >  ? __alloc_pages_slowpath+0x976/0x1790
 >  ? __alloc_pages_nodemask+0x52c/0x5c0
 >  ? delete_node+0x28d/0x5c0
 >  ? find_held_lock+0x6d/0xd0
 >  ? free_pcppages_bulk+0x381/0x570
 >  ? lock_acquire+0x350/0x350
 >  ? do_raw_spin_unlock+0x147/0x220
 >  ? do_raw_spin_trylock+0x100/0x100
 >  ? __lock_is_held+0x51/0xc0
 >  ? _raw_spin_unlock+0x24/0x30
 >  ? free_pcppages_bulk+0x381/0x570
 >  ? mark_lock+0x1b1/0xa00
 >  ? free_compound_page+0x30/0x30
 >  ? print_irqtrace_events+0x110/0x110
 >  ? __kernel_map_pages+0x2c9/0x310
 >  ? mark_lock+0x1b1/0xa00
 >  ? print_irqtrace_events+0x110/0x110
 >  ? __delete_from_page_cache+0x2e7/0x4e0
 >  ? save_trace+0x1e0/0x1e0
 >  ? __add_to_page_cache_locked+0x680/0x680
 >  ? find_held_lock+0x6d/0xd0
 >  ? __list_add_valid+0x29/0xa0
 >  ? free_unref_page_commit+0x198/0x270
 >  ? drain_local_pages_wq+0x20/0x20
 >  ? stop_critical_timings+0x210/0x210
 >  ? mark_lock+0x1b1/0xa00
 >  ? mark_lock+0x1b1/0xa00
 >  ? print_irqtrace_events+0x110/0x110
 >  ? __lock_acquire+0x616/0x2040
 >  ? mark_lock+0x1b1/0xa00
 >  ? mark_lock+0x1b1/0xa00
 >  ? print_irqtrace_events+0x110/0x110
 >  ? __phys_addr_symbol+0x23/0x40
 >  ? __change_page_attr_set_clr+0xe86/0x1640
 >  ? __btrfs_releasepage+0x1c0/0x1c0
 >  ? mark_lock+0x1b1/0xa00
 >  ? mark_lock+0x1b1/0xa00
 >  ? print_irqtrace_events+0x110/0x110
 >  ? mark_lock+0x1b1/0xa00
 >  ? __lock_acquire+0x616/0x2040
 >  ? __lock_acquire+0x616/0x2040
 >  ? debug_show_all_locks+0x2f0/0x2f0
 >  ? swiotlb_free_coherent+0x60/0x60
 >  ? __phys_addr+0x32/0x80
 >  ? igb_xmit_frame_ring+0xad7/0x1890
 >  ? stack_access_ok+0x35/0x80
 >  ? deref_stack_reg+0xa1/0xe0
 >  ? __read_once_size_nocheck.constprop.6+0x10/0x10
 >  ? __orc_find+0x6b/0xc0
 >  ? unwind_next_frame+0x407/0xa20
 >  ? __save_stack_trace+0x5e/0x100
 >  ? stack_access_ok+0x35/0x80
 >  ? deref_stack_reg+0xa1/0xe0
 >  ? __read_once_size_nocheck.constprop.6+0x10/0x10
 >  ? __lock_acquire+0x616/0x2040
 >  ? debug_lockdep_rcu_enabled.part.37+0x16/0x30
 >  ? is_ftrace_trampoline+0x112/0x190
 >  ? ftrace_profile_pages_init+0x130/0x130
 >  ? unwind_next_frame+0x407/0xa20
 >  ? rcu_is_watching+0x88/0xd0
 >  ? unwind_get_return_address_ptr+0x50/0x50
 >  ? kernel_text_address+0x5c/0x90
 >  ? __kernel_text_address+0xe/0x30
 >  ? unwind_get_return_address+0x2f/0x50
 >  ? __save_stack_trace+0x92/0x100
 >  ? __list_add_valid+0x29/0xa0
 >  ? add_lock_to_list.isra.26+0x1d0/0x21f
 >  ? print_lockdep_cache.isra.29+0xd8/0xd8
 >  ? save_trace+0x106/0x1e0
 >  ? __isolate_lru_page+0x2dc/0x3c0
 >  ? remove_mapping+0x1b0/0x1b0
 >  ? match_held_lock+0xa5/0x440
 >  ? __lock_acquire+0x616/0x2040
 >  ? __mod_zone_page_state+0x1a/0x70
 >  ? isolate_lru_pages.isra.83+0x888/0xae0
 >  ? __isolate_lru_page+0x3c0/0x3c0
 >  ? check_usage+0x174/0x790
 >  ? mark_lock+0x1b1/0xa00
 >  ? print_irqtrace_events+0x110/0x110
 >  ? check_usage_forwards+0x2b0/0x2b0
 >  ? class_equal+0x11/0x20
 >  ? __bfs+0xed/0x430
 >  ? __phys_addr_symbol+0x23/0x40
 >  ? mutex_destroy+0x120/0x120
 >  ? match_held_lock+0x8d/0x440
 >  ? hlock_class+0xa0/0xa0
 >  ? mark_lock+0x1b1/0xa00
 >  ? save_trace+0x1e0/0x1e0
 >  ? print_irqtrace_events+0x110/0x110
 >  ? lock_acquire+0x350/0x350
 >  ? __zone_watermark_ok+0xd8/0x280
 >  ? graph_lock+0x8d/0x100
 >  ? check_noncircular+0x20/0x20
 >  ? find_held_lock+0x6d/0xd0
 >  ? shrink_inactive_list+0x3b4/0x940
 >  ? lock_acquire+0x350/0x350
 >  ? do_raw_spin_unlock+0x147/0x220
 >  ? do_raw_spin_trylock+0x100/0x100
 >  ? stop_critical_timings+0x210/0x210
 >  ? mark_held_locks+0x6e/0x90
 >  ? _raw_spin_unlock_irq+0x29/0x40
 >  shrink_inactive_list+0x451/0x940
 >  ? save_trace+0x180/0x1e0
 >  ? putback_inactive_pages+0x9f0/0x9f0
 >  ? dev_queue_xmit_nit+0x548/0x660
 >  ? __kernel_map_pages+0x2c9/0x310
 >  ? set_pages_rw+0xe0/0xe0
 >  ? get_page_from_freelist+0x1ea5/0x2ca0
 >  ? match_held_lock+0x8d/0x440
 >  ? blk_start_plug+0x17d/0x1e0
 >  ? kblockd_schedule_delayed_work_on+0x20/0x20
 >  ? print_irqtrace_events+0x110/0x110
 >  ? cpumask_next+0x1d/0x20
 >  ? zone_reclaimable_pages+0x25b/0x470
 >  ? mark_held_locks+0x6e/0x90
 >  ? __remove_mapping+0x4e0/0x4e0
 >  shrink_node_memcg.constprop.88+0x4c9/0x5e0
 >  ? __delayacct_freepages_start+0x28/0x40
 >  ? lock_acquire+0x311/0x350
 >  ? shrink_active_list+0x9c0/0x9c0
 >  ? stop_critical_timings+0x210/0x210
 >  ? allow_direct_reclaim.part.82+0xea/0x220
 >  ? mark_held_locks+0x6e/0x90
 >  ? ktime_get+0x1f0/0x3e0
 >  ? shrink_node+0x12d/0x260
 >  shrink_node+0x12d/0x260
 >  ? shrink_node_memcg.constprop.88+0x5e0/0x5e0
 >  ? __lock_is_held+0x51/0xc0
 >  try_to_free_pages+0x418/0xaf0
 >  ? shrink_node+0x260/0x260
 >  ? lock_acquire+0x12e/0x350
 >  ? lock_acquire+0x12e/0x350
 >  ? fs_reclaim_acquire.part.102+0x5/0x30
 >  ? lockdep_rcu_suspicious+0x100/0x100
 >  ? rcu_note_context_switch+0x520/0x520
 >  ? wake_all_kswapds+0x10a/0x150
 >  __alloc_pages_slowpath+0x976/0x1790
 >  ? __zone_watermark_ok+0x280/0x280
 >  ? warn_alloc+0x250/0x250
 >  ? __lock_acquire+0x616/0x2040
 >  ? match_held_lock+0x8d/0x440
 >  ? save_trace+0x1e0/0x1e0
 >  ? debug_show_all_locks+0x2f0/0x2f0
 >  ? match_held_lock+0xa5/0x440
 >  ? stack_access_ok+0x35/0x80
 >  ? save_trace+0x1e0/0x1e0
 >  ? __read_once_size_nocheck.constprop.6+0x10/0x10
 >  ? __lock_acquire+0x616/0x2040
 >  ? match_held_lock+0xa5/0x440
 >  ? find_held_lock+0x6d/0xd0
 >  ? __lock_is_held+0x51/0xc0
 >  ? rcu_note_context_switch+0x520/0x520
 >  ? perf_trace_sched_switch+0x560/0x560
 >  ? __might_sleep+0x58/0xe0
 >  __alloc_pages_nodemask+0x52c/0x5c0
 >  ? gfp_pfmemalloc_allowed+0xc0/0xc0
 >  ? kernel_text_address+0x5c/0x90
 >  ? __kernel_text_address+0xe/0x30
 >  ? unwind_get_return_address+0x2f/0x50
 >  ? memcmp+0x45/0x70
 >  ? match_held_lock+0x8d/0x440
 >  ? depot_save_stack+0x12e/0x480
 >  ? match_held_lock+0xa5/0x440
 >  ? stop_critical_timings+0x210/0x210
 >  ? sk_stream_alloc_skb+0xb8/0x340
 >  ? mark_held_locks+0x6e/0x90
 >  ? new_slab+0x2f3/0x3f0
 >  new_slab+0x374/0x3f0
 >  ___slab_alloc.constprop.81+0x47e/0x5a0
 >  ? __alloc_skb+0xee/0x390
 >  ? __alloc_skb+0xee/0x390
 >  ? __alloc_skb+0xee/0x390
 >  ? __slab_alloc.constprop.80+0x32/0x60
 >  __slab_alloc.constprop.80+0x32/0x60
 >  ? __alloc_skb+0xee/0x390
 >  __kmalloc_track_caller+0x267/0x310
 >  __kmalloc_reserve.isra.40+0x29/0x80
 >  __alloc_skb+0xee/0x390
 >  ? __skb_splice_bits+0x3e0/0x3e0
 >  ? ip6_mtu+0x1d9/0x290
 >  ? ip6_link_failure+0x3c0/0x3c0
 >  ? tcp_current_mss+0x1d8/0x2f0
 >  ? tcp_sync_mss+0x520/0x520
 >  sk_stream_alloc_skb+0xb8/0x340
 >  ? tcp_ioctl+0x280/0x280
 >  tcp_sendmsg_locked+0x8e6/0x1d30
 >  ? match_held_lock+0x8d/0x440
 >  ? mark_lock+0x1b1/0xa00
 >  ? tcp_set_state+0x450/0x450
 >  ? debug_show_all_locks+0x2f0/0x2f0
 >  ? match_held_lock+0x8d/0x440
 >  ? save_trace+0x1e0/0x1e0
 >  ? find_held_lock+0x6d/0xd0
 >  ? lock_acquire+0x12e/0x350
 >  ? lock_acquire+0x12e/0x350
 >  ? tcp_sendmsg+0x19/0x40
 >  ? lockdep_rcu_suspicious+0x100/0x100
 >  ? do_raw_spin_trylock+0x100/0x100
 >  ? stop_critical_timings+0x210/0x210
 >  ? mark_held_locks+0x6e/0x90
 >  ? __local_bh_enable_ip+0x94/0x100
 >  ? lock_sock_nested+0x51/0xb0
 >  tcp_sendmsg+0x27/0x40
 >  inet_sendmsg+0xd0/0x310
 >  ? inet_recvmsg+0x360/0x360
 >  ? match_held_lock+0x8d/0x440
 >  ? inet_recvmsg+0x360/0x360
 >  sock_write_iter+0x17a/0x240
 >  ? sock_ioctl+0x290/0x290
 >  ? find_held_lock+0x6d/0xd0
 >  __vfs_write+0x2ab/0x380
 >  ? kernel_read+0xa0/0xa0
 >  ? __context_tracking_exit.part.4+0xe7/0x290
 >  ? lock_acquire+0x350/0x350
 >  ? __fdget_pos+0x7f/0x110
 >  ? __fdget_raw+0x10/0x10
 >  vfs_write+0xfb/0x260
 >  SyS_write+0xb6/0x140
 >  ? SyS_read+0x140/0x140
 >  ? SyS_clock_settime+0x120/0x120
 >  ? mark_held_locks+0x1c/0x90
 >  ? do_syscall_64+0x110/0xc05
 >  ? SyS_read+0x140/0x140
 >  do_syscall_64+0x1e5/0xc05
 >  ? syscall_return_slowpath+0x5b0/0x5b0
 >  ? lock_acquire+0x350/0x350
 >  ? lockdep_rcu_suspicious+0x100/0x100
 >  ? get_vtime_delta+0x15/0xf0
 >  ? get_vtime_delta+0x8b/0xf0
 >  ? vtime_user_enter+0x7f/0x90
 >  ? __context_tracking_enter+0x13c/0x2b0
 >  ? __context_tracking_enter+0x13c/0x2b0
 >  ? context_tracking_exit.part.5+0x40/0x40
 >  ? rcu_is_watching+0x88/0xd0
 >  ? time_hardirqs_on+0x220/0x220
 >  ? prepare_exit_to_usermode+0x1d0/0x2a0
 >  ? enter_from_user_mode+0x30/0x30
 >  ? entry_SYSCALL_64_after_hwframe+0x18/0x2e
 >  ? trace_hardirqs_off_caller+0xc2/0x110
 >  ? trace_hardirqs_off_thunk+0x1a/0x1c
 >  entry_SYSCALL64_slow_path+0x25/0x25
 

 And now I can hit..



============================================
WARNING: possible recursive locking detected
4.15.0-rc9-backup-debug+ #7 Not tainted
--------------------------------------------
snmpd/892 is trying to acquire lock:
 (fs_reclaim){+.+.}, at: [<0000000002e4c185>] fs_reclaim_acquire.part.101+0x5/0x30

but task is already holding lock:
 (fs_reclaim){+.+.}, at: [<0000000002e4c185>] fs_reclaim_acquire.part.101+0x5/0x30

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(fs_reclaim);
  lock(fs_reclaim);

 *** DEADLOCK ***

 May be due to missing lock nesting notation

2 locks held by snmpd/892:
 #0:  (rtnl_mutex){+.+.}, at: [<00000000dcd3ba2f>] netlink_dump+0x89/0x520
 #1:  (fs_reclaim){+.+.}, at: [<0000000002e4c185>] fs_reclaim_acquire.part.101+0x5/0x30

stack backtrace:
CPU: 5 PID: 892 Comm: snmpd Not tainted 4.15.0-rc9-backup-debug+ #7
Call Trace:
 dump_stack+0xbc/0x13f
 ? _atomic_dec_and_lock+0x101/0x101
 ? fs_reclaim_acquire.part.101+0x5/0x30
 ? print_lock+0x54/0x68
 __lock_acquire+0xa09/0x2040
 ? debug_show_all_locks+0x2f0/0x2f0
 ? __save_stack_trace+0x92/0x100
 ? __list_add_valid+0x29/0xa0
 ? add_lock_to_list.isra.26+0x1d0/0x21f
 ? print_lockdep_cache.isra.29+0xd8/0xd8
 ? save_trace+0x106/0x1e0
 ? graph_lock+0x100/0x100
 ? graph_lock+0x8d/0x100
 ? check_noncircular+0x20/0x20
 ? __lock_acquire+0x616/0x2040
 ? debug_show_all_locks+0x2f0/0x2f0
 ? lock_acquire+0x12e/0x350
 lock_acquire+0x12e/0x350
 ? fs_reclaim_acquire.part.101+0x5/0x30
 ? lockdep_rcu_suspicious+0x100/0x100
 ? match_held_lock+0x8d/0x440
 ? save_trace+0x1e0/0x1e0
 ? alloc_extent_state+0xa7/0x410
 fs_reclaim_acquire.part.101+0x29/0x30
 ? fs_reclaim_acquire.part.101+0x5/0x30
 kmem_cache_alloc+0x3d/0x2c0
 alloc_extent_state+0xa7/0x410
 ? lock_extent_buffer_for_io+0x3f0/0x3f0
 ? find_held_lock+0x6d/0xd0
 ? test_range_bit+0x197/0x210
 ? lock_acquire+0x350/0x350
 ? do_raw_spin_unlock+0x147/0x220
 ? do_raw_spin_trylock+0x100/0x100
 ? iotree_fs_info+0x30/0x30
 __clear_extent_bit+0x3ea/0x570
 ? clear_state_bit+0x270/0x270
 ? count_range_bits+0x2f0/0x2f0
 try_release_extent_mapping+0x21a/0x260
 __btrfs_releasepage+0xb0/0x1c0
 ? btrfs_submit_direct+0xca0/0xca0
 ? check_usage+0x257/0x790
 ? match_held_lock+0xa5/0x440
 ? print_irqtrace_events+0x110/0x110
 btrfs_releasepage+0x161/0x170
 ? __btrfs_releasepage+0x1c0/0x1c0
 ? page_rmapping+0xd0/0xd0
 ? rmap_walk+0x100/0x100
 try_to_release_page+0x162/0x1c0
 ? generic_file_write_iter+0x3c0/0x3c0
 ? page_evictable+0xcc/0x110
 ? debug_show_all_locks+0x2f0/0x2f0
 shrink_page_list+0x1d5a/0x2fb0
 ? putback_lru_page+0x3f0/0x3f0
 ? match_held_lock+0x8d/0x440
 ? save_trace+0x1e0/0x1e0
 ? update_curr+0xd0/0x670
 ? __lock_is_held+0x71/0xc0
 ? update_cfs_group+0x86/0x290
 ? __list_add_valid+0x29/0xa0
 ? account_entity_dequeue+0x230/0x230
 ? nohz_balance_exit_idle.part.92+0x60/0x60
 ? __update_load_avg_se.isra.28+0x352/0x360
 ? __update_load_avg_se.isra.28+0x1f4/0x360
 ? __accumulate_pelt_segments+0x47/0xd0
 ? __enqueue_entity+0x93/0xc0
 ? match_held_lock+0x8d/0x440
 ? mark_lock+0x1b1/0xa00
 ? save_trace+0x1e0/0x1e0
 ? print_irqtrace_events+0x110/0x110
 ? check_preempt_wakeup+0x410/0x410
 ? mark_lock+0x1b1/0xa00
 ? __kernel_map_pages+0x2c9/0x310
 ? set_pages_rw+0xe0/0xe0
 ? lock_acquire+0x350/0x350
 ? lockdep_rcu_suspicious+0x100/0x100
 ? mark_held_locks+0x6e/0x90
 ? trace_hardirqs_on_caller+0x187/0x260
 ? mark_lock+0x1b1/0xa00
 ? print_irqtrace_events+0x110/0x110
 ? mark_held_locks+0x6e/0x90
 ? mark_lock+0x1b1/0xa00
 ? kasan_unpoison_shadow+0x30/0x40
 ? print_irqtrace_events+0x110/0x110
 ? mutex_destroy+0x120/0x120
 ? hlock_class+0xa0/0xa0
 ? __kernel_map_pages+0x2c9/0x310
 ? __lock_acquire+0x616/0x2040
 ? __lock_acquire+0x616/0x2040
 ? debug_show_all_locks+0x2f0/0x2f0
 ? mark_lock+0x1b1/0xa00
 ? print_irqtrace_events+0x110/0x110
 ? debug_show_all_locks+0x2f0/0x2f0
 ? debug_show_all_locks+0x2f0/0x2f0
 ? mark_held_locks+0x6e/0x90
 ? mark_lock+0x1b1/0xa00
 ? print_irqtrace_events+0x110/0x110
 ? check_usage+0x174/0x790
 ? mark_lock+0x1b1/0xa00
 ? print_irqtrace_events+0x110/0x110
 ? mark_lock+0x1b1/0xa00
 ? usage_match+0x1e/0x30
 ? print_irqtrace_events+0x110/0x110
 ? __bfs+0x2a2/0x430
 ? noop_count+0x20/0x20
 ? __lock_acquire+0x616/0x2040
 ? stack_access_ok+0x35/0x80
 ? deref_stack_reg+0xa1/0xe0
 ? __read_once_size_nocheck.constprop.6+0x10/0x10
 ? __orc_find+0x6b/0xc0
 ? unwind_next_frame+0x39c/0x9d0
 ? __save_stack_trace+0x5e/0x100
 ? save_trace+0xbd/0x1e0
 ? stack_access_ok+0x35/0x80
 ? deref_stack_reg+0xa1/0xe0
 ? __read_once_size_nocheck.constprop.6+0x10/0x10
 ? debug_lockdep_rcu_enabled.part.37+0x16/0x30
 ? ftrace_ops_trampoline+0x111/0x190
 ? ftrace_profile_pages_init+0x130/0x130
 ? unwind_next_frame+0x39c/0x9d0
 ? rcu_is_watching+0x88/0xd0
 ? rcu_nmi_exit+0x130/0x130
 ? is_ftrace_trampoline+0x5/0x10
 ? kernel_text_address+0x5c/0x90
 ? __kernel_text_address+0xe/0x30
 ? unwind_get_return_address+0x2f/0x50
 ? __save_stack_trace+0x92/0x100
 ? __list_add_valid+0x29/0xa0
 ? add_lock_to_list.isra.26+0x1d0/0x21f
 ? print_lockdep_cache.isra.29+0xd8/0xd8
 ? save_trace+0x106/0x1e0
 ? __isolate_lru_page+0x2dc/0x3c0
 ? remove_mapping+0x1b0/0x1b0
 ? match_held_lock+0xa5/0x440
 ? __lock_acquire+0x616/0x2040
 ? __mod_zone_page_state+0x1a/0x70
 ? isolate_lru_pages.isra.79+0x888/0xae0
 ? __isolate_lru_page+0x3c0/0x3c0
 ? noop_count+0x20/0x20
 ? hlock_class+0xa0/0xa0
 ? print_irqtrace_events+0x110/0x110
 ? check_usage+0x174/0x790
 ? mark_lock+0x1b1/0xa00
 ? rb_next+0x90/0x90
 ? match_held_lock+0x8d/0x440
 ? mark_lock+0x1b1/0xa00
 ? save_trace+0x1e0/0x1e0
 ? print_irqtrace_events+0x110/0x110
 ? class_equal+0x11/0x20
 ? __bfs+0xed/0x430
 ? __kernel_map_pages+0x2c9/0x310
 ? mutex_destroy+0x120/0x120
 ? find_held_lock+0x6d/0xd0
 ? shrink_inactive_list+0x3b4/0x940
 ? lock_acquire+0x350/0x350
 ? do_raw_spin_unlock+0x147/0x220
 ? do_raw_spin_trylock+0x100/0x100
 ? stop_critical_timings+0x210/0x210
 ? mark_held_locks+0x6e/0x90
 ? _raw_spin_unlock_irq+0x29/0x40
 shrink_inactive_list+0x451/0x940
 ? putback_inactive_pages+0x9f0/0x9f0
 ? isolate_migratepages_range+0x120/0x120
 ? mark_lock+0x1b1/0xa00
 ? update_curr+0x2d9/0x670
 ? compact_zone+0x1d3/0x14c0
 ? blk_start_plug+0x17d/0x1e0
 ? kblockd_schedule_delayed_work_on+0x20/0x20
 ? save_trace+0x1e0/0x1e0
 ? update_curr+0x15c/0x670
 ? active_load_balance_cpu_stop+0x7b0/0x7b0
 ? compaction_suitable+0x350/0x350
 ? update_cfs_group+0x23a/0x290
 ? match_held_lock+0x8d/0x440
 shrink_node_memcg.constprop.84+0x4c9/0x5e0
 ? shrink_active_list+0x9c0/0x9c0
 ? __delayacct_freepages_start+0x28/0x40
 ? lockdep_rcu_suspicious+0x100/0x100
 ? shrink_node+0x1c2/0x510
 shrink_node+0x1c2/0x510
 ? trace_hardirqs_on_caller+0x187/0x260
 ? shrink_node_memcg.constprop.84+0x5e0/0x5e0
 ? getnstimeofday64+0x20/0x20
 ? allow_direct_reclaim.part.78+0x220/0x220
 ? mark_lock+0x1b1/0xa00
 ? mark_lock+0x1b1/0xa00
 ? __lock_is_held+0x51/0xc0
 try_to_free_pages+0x425/0xb90
 ? shrink_node+0x510/0x510
 ? try_to_compact_pages+0x1f4/0x6b0
 ? compaction_zonelist_suitable+0x2f0/0x2f0
 ? lock_acquire+0x12e/0x350
 ? lock_acquire+0x12e/0x350
 ? fs_reclaim_acquire.part.101+0x5/0x30
 ? lockdep_rcu_suspicious+0x100/0x100
 ? rcu_note_context_switch+0x520/0x520
 ? wake_all_kswapds+0x10a/0x150
 __alloc_pages_slowpath+0x955/0x1a00
 ? __lock_acquire+0x616/0x2040
 ? warn_alloc+0x250/0x250
 ? __lock_acquire+0x616/0x2040
 ? match_held_lock+0x8d/0x440
 ? save_trace+0x1e0/0x1e0
 ? debug_show_all_locks+0x2f0/0x2f0
 ? match_held_lock+0xa5/0x440
 ? save_trace+0x1e0/0x1e0
 ? __read_once_size_nocheck.constprop.6+0x10/0x10
 ? __orc_find+0x6b/0xc0
 ? match_held_lock+0xa5/0x440
 ? find_held_lock+0x6d/0xd0
 ? __lock_is_held+0x51/0xc0
 ? rcu_note_context_switch+0x520/0x520
 ? perf_trace_sched_switch+0x560/0x560
 ? __might_sleep+0x58/0xe0
 __alloc_pages_nodemask+0x52c/0x5c0
 ? gfp_pfmemalloc_allowed+0xc0/0xc0
 ? __kernel_text_address+0xe/0x30
 ? unwind_get_return_address+0x2f/0x50
 ? __save_stack_trace+0x92/0x100
 ? match_held_lock+0x8d/0x440
 ? depot_save_stack+0x12e/0x480
 ? match_held_lock+0xa5/0x440
 ? stop_critical_timings+0x210/0x210
 ? __netlink_dump_start+0x201/0x280
 ? mark_held_locks+0x6e/0x90
 ? new_slab+0x2f3/0x3f0
 new_slab+0x374/0x3f0
 ___slab_alloc.constprop.81+0x47e/0x5a0
 ? __lock_is_held+0x51/0xc0
 ? __alloc_skb+0xee/0x390
 ? __alloc_skb+0xee/0x390
 ? __alloc_skb+0xee/0x390
 ? __slab_alloc.constprop.80+0x32/0x60
 __slab_alloc.constprop.80+0x32/0x60
 ? __alloc_skb+0xee/0x390
 __kmalloc_track_caller+0x267/0x310
 __kmalloc_reserve.isra.40+0x29/0x80
 __alloc_skb+0xee/0x390
 ? __skb_splice_bits+0x3e0/0x3e0
 ? netlink_connect+0x1d0/0x1d0
 ? __netlink_dump_start+0x1f9/0x280
 ? __mutex_unlock_slowpath+0x121/0x460
 ? wait_for_completion_killable_timeout+0x450/0x450
 ? find_held_lock+0x6d/0xd0
 netlink_dump+0x2e1/0x520
 ? refcount_inc_not_zero+0x74/0x110
 ? __nlmsg_put+0xb0/0xb0
 ? rcu_is_watching+0x88/0xd0
 __netlink_dump_start+0x201/0x280
 ? inet6_dump_ifmcaddr+0x10/0x10
 rtnetlink_rcv_msg+0x6d6/0xa90
 ? validate_linkmsg+0x540/0x540
 ? inet6_dump_ifmcaddr+0x10/0x10
 ? find_held_lock+0x6d/0xd0
 ? netlink_lookup.isra.42+0x428/0x730
 ? lock_acquire+0x350/0x350
 ? find_held_lock+0x6d/0xd0
 ? inet6_dump_ifmcaddr+0x10/0x10
 ? netlink_deliver_tap+0x124/0x5c0
 ? lock_acquire+0x350/0x350
 ? lockdep_rcu_suspicious+0x100/0x100
 ? netlink_lookup.isra.42+0x447/0x730
 ? rcu_is_watching+0x88/0xd0
 ? netlink_connect+0x1d0/0x1d0
 ? netlink_deliver_tap+0x143/0x5c0
 ? __might_fault+0x7d/0xe0
 ? iov_iter_advance+0x176/0x7d0
 ? netlink_getname+0x150/0x150
 netlink_rcv_skb+0xb6/0x1d0
 ? validate_linkmsg+0x540/0x540
 ? netlink_ack+0x4a0/0x4a0
 ? netlink_trim+0xda/0x1b0
 netlink_unicast+0x298/0x320
 ? netlink_detachskb+0x30/0x30
 ? __fget+0x410/0x410
 netlink_sendmsg+0x57e/0x630
 ? netlink_broadcast_filtered+0x8f0/0x8f0
 ? netlink_broadcast_filtered+0x8f0/0x8f0
 SYSC_sendto+0x296/0x320
 ? SYSC_connect+0x200/0x200
 ? __context_tracking_exit.part.4+0xe7/0x290
 ? cyc2ns_read_end+0x10/0x10
 ? lockdep_rcu_suspicious+0x100/0x100
 ? rcu_read_lock_sched_held+0x90/0xa0
 ? __context_tracking_exit.part.4+0x223/0x290
 ? stop_critical_timings+0x210/0x210
 ? SyS_socket+0xd6/0x120
 ? sock_create_kern+0x10/0x10
 ? mark_held_locks+0x1c/0x90
 ? do_syscall_64+0x110/0xc05
 ? SyS_getpeername+0x10/0x10
 do_syscall_64+0x1e5/0xc05
 ? syscall_return_slowpath+0x5b0/0x5b0
 ? lock_acquire+0x350/0x350
 ? lockdep_rcu_suspicious+0x100/0x100
 ? get_vtime_delta+0x15/0xf0
 ? get_vtime_delta+0x8b/0xf0
 ? vtime_user_enter+0x7f/0x90
 ? __context_tracking_enter+0x13c/0x2b0
 ? __context_tracking_enter+0x13c/0x2b0
 ? context_tracking_exit.part.5+0x40/0x40
 ? do_page_fault+0xb0/0x4d0
 ? rcu_is_watching+0x88/0xd0
 ? vmalloc_sync_all+0x20/0x20
 ? time_hardirqs_on+0x220/0x220
 ? prepare_exit_to_usermode+0x1d0/0x2a0
 ? enter_from_user_mode+0x30/0x30
 ? entry_SYSCALL_64_after_hwframe+0x18/0x2e
 ? trace_hardirqs_off_caller+0xc2/0x110
 ? trace_hardirqs_off_thunk+0x1a/0x1c
 entry_SYSCALL64_slow_path+0x25/0x25
RIP: 0033:0x7f204299f54d
RSP: 002b:00007ffc49024fd8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 000000000000000a RCX: 00007f204299f54d
RDX: 0000000000000018 RSI: 00007ffc49025010 RDI: 0000000000000012
RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000012
R13: 00007ffc49029550 R14: 000055e31307a250 R15: 00007ffc49029530

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [4.15-rc9] fs_reclaim lockdep trace
  2018-01-27 22:24   ` Dave Jones
@ 2018-01-27 22:43     ` Linus Torvalds
  -1 siblings, 0 replies; 35+ messages in thread
From: Linus Torvalds @ 2018-01-27 22:43 UTC (permalink / raw)
  To: Dave Jones, Linux Kernel, linux-mm, Network Development, Peter Zijlstra

On Sat, Jan 27, 2018 at 2:24 PM, Dave Jones <davej@codemonkey.org.uk> wrote:
> On Tue, Jan 23, 2018 at 08:36:51PM -0500, Dave Jones wrote:
>  > Just triggered this on a server I was rsync'ing to.
>
> Actually, I can trigger this really easily, even with an rsync from one
> disk to another.  Though that also smells a little like networking in
> the traces. Maybe netdev has ideas.

Is this new to 4.15? Or is it just that you're testing something new?

If it's new and easy to repro, can you just bisect it? And if it isn't
new, can you perhaps check whether it's new to 4.14 (ie 4.13 being
ok)?

Because that fs_reclaim_acquire/release() debugging isn't new to 4.15,
but it was rewritten for 4.14.. I'm wondering if that remodeling ended
up triggering something.

Adding PeterZ to the participants list in case he has ideas. I'm not
seeing what would be the problem in that call chain from hell.

               Linus

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [4.15-rc9] fs_reclaim lockdep trace
@ 2018-01-27 22:43     ` Linus Torvalds
  0 siblings, 0 replies; 35+ messages in thread
From: Linus Torvalds @ 2018-01-27 22:43 UTC (permalink / raw)
  To: Dave Jones, Linux Kernel, linux-mm, Network Development, Peter Zijlstra

On Sat, Jan 27, 2018 at 2:24 PM, Dave Jones <davej@codemonkey.org.uk> wrote:
> On Tue, Jan 23, 2018 at 08:36:51PM -0500, Dave Jones wrote:
>  > Just triggered this on a server I was rsync'ing to.
>
> Actually, I can trigger this really easily, even with an rsync from one
> disk to another.  Though that also smells a little like networking in
> the traces. Maybe netdev has ideas.

Is this new to 4.15? Or is it just that you're testing something new?

If it's new and easy to repro, can you just bisect it? And if it isn't
new, can you perhaps check whether it's new to 4.14 (ie 4.13 being
ok)?

Because that fs_reclaim_acquire/release() debugging isn't new to 4.15,
but it was rewritten for 4.14.. I'm wondering if that remodeling ended
up triggering something.

Adding PeterZ to the participants list in case he has ideas. I'm not
seeing what would be the problem in that call chain from hell.

               Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [4.15-rc9] fs_reclaim lockdep trace
  2018-01-27 22:43     ` Linus Torvalds
@ 2018-01-28  1:16       ` Tetsuo Handa
  -1 siblings, 0 replies; 35+ messages in thread
From: Tetsuo Handa @ 2018-01-28  1:16 UTC (permalink / raw)
  To: Linus Torvalds, Dave Jones, Peter Zijlstra
  Cc: Linux Kernel, linux-mm, Network Development

Linus Torvalds wrote:
> On Sat, Jan 27, 2018 at 2:24 PM, Dave Jones <davej@codemonkey.org.uk> wrote:
>> On Tue, Jan 23, 2018 at 08:36:51PM -0500, Dave Jones wrote:
>>  > Just triggered this on a server I was rsync'ing to.
>>
>> Actually, I can trigger this really easily, even with an rsync from one
>> disk to another.  Though that also smells a little like networking in
>> the traces. Maybe netdev has ideas.
> 
> Is this new to 4.15? Or is it just that you're testing something new?
> 
> If it's new and easy to repro, can you just bisect it? And if it isn't
> new, can you perhaps check whether it's new to 4.14 (ie 4.13 being
> ok)?
> 
> Because that fs_reclaim_acquire/release() debugging isn't new to 4.15,
> but it was rewritten for 4.14.. I'm wondering if that remodeling ended
> up triggering something.

--- linux-4.13.16/mm/page_alloc.c
+++ linux-4.14.15/mm/page_alloc.c
@@ -3527,53 +3519,12 @@
 			return true;
 	}
 	return false;
 }
 #endif /* CONFIG_COMPACTION */
 
-#ifdef CONFIG_LOCKDEP
-struct lockdep_map __fs_reclaim_map =
-	STATIC_LOCKDEP_MAP_INIT("fs_reclaim", &__fs_reclaim_map);
-
-static bool __need_fs_reclaim(gfp_t gfp_mask)
-{
-	gfp_mask = current_gfp_context(gfp_mask);
-
-	/* no reclaim without waiting on it */
-	if (!(gfp_mask & __GFP_DIRECT_RECLAIM))
-		return false;
-
-	/* this guy won't enter reclaim */
-	if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
-		return false;
-
-	/* We're only interested __GFP_FS allocations for now */
-	if (!(gfp_mask & __GFP_FS))
-		return false;
-
-	if (gfp_mask & __GFP_NOLOCKDEP)
-		return false;
-
-	return true;
-}
-
-void fs_reclaim_acquire(gfp_t gfp_mask)
-{
-	if (__need_fs_reclaim(gfp_mask))
-		lock_map_acquire(&__fs_reclaim_map);
-}
-EXPORT_SYMBOL_GPL(fs_reclaim_acquire);
-
-void fs_reclaim_release(gfp_t gfp_mask)
-{
-	if (__need_fs_reclaim(gfp_mask))
-		lock_map_release(&__fs_reclaim_map);
-}
-EXPORT_SYMBOL_GPL(fs_reclaim_release);
-#endif
-
 /* Perform direct synchronous page reclaim */
 static int
 __perform_reclaim(gfp_t gfp_mask, unsigned int order,
 					const struct alloc_context *ac)
 {
 	struct reclaim_state reclaim_state;
@@ -3582,21 +3533,21 @@
 
 	cond_resched();
 
 	/* We now go into synchronous reclaim */
 	cpuset_memory_pressure_bump();
 	noreclaim_flag = memalloc_noreclaim_save();
-	fs_reclaim_acquire(gfp_mask);
+	lockdep_set_current_reclaim_state(gfp_mask);
 	reclaim_state.reclaimed_slab = 0;
 	current->reclaim_state = &reclaim_state;
 
 	progress = try_to_free_pages(ac->zonelist, order, gfp_mask,
 								ac->nodemask);
 
 	current->reclaim_state = NULL;
-	fs_reclaim_release(gfp_mask);
+	lockdep_clear_current_reclaim_state();
 	memalloc_noreclaim_restore(noreclaim_flag);
 
 	cond_resched();
 
 	return progress;
 }

> 
> Adding PeterZ to the participants list in case he has ideas. I'm not
> seeing what would be the problem in that call chain from hell.
> 
>                Linus

Dave Jones wrote:
> ============================================
> WARNING: possible recursive locking detected
> 4.15.0-rc9-backup-debug+ #1 Not tainted
> --------------------------------------------
> sshd/24800 is trying to acquire lock:
>  (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30
> 
> but task is already holding lock:
>  (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30
> 
> other info that might help us debug this:
>  Possible unsafe locking scenario:
> 
>        CPU0
>        ----
>   lock(fs_reclaim);
>   lock(fs_reclaim);
> 
>  *** DEADLOCK ***
> 
>  May be due to missing lock nesting notation
> 
> 2 locks held by sshd/24800:
>  #0:  (sk_lock-AF_INET6){+.+.}, at: [<000000001a069652>] tcp_sendmsg+0x19/0x40
>  #1:  (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30
> 
> stack backtrace:
> CPU: 3 PID: 24800 Comm: sshd Not tainted 4.15.0-rc9-backup-debug+ #1
> Call Trace:
>  dump_stack+0xbc/0x13f
>  __lock_acquire+0xa09/0x2040
>  lock_acquire+0x12e/0x350
>  fs_reclaim_acquire.part.102+0x29/0x30
>  kmem_cache_alloc+0x3d/0x2c0
>  alloc_extent_state+0xa7/0x410
>  __clear_extent_bit+0x3ea/0x570
>  try_release_extent_mapping+0x21a/0x260
>  __btrfs_releasepage+0xb0/0x1c0
>  btrfs_releasepage+0x161/0x170
>  try_to_release_page+0x162/0x1c0
>  shrink_page_list+0x1d5a/0x2fb0
>  shrink_inactive_list+0x451/0x940
>  shrink_node_memcg.constprop.88+0x4c9/0x5e0
>  shrink_node+0x12d/0x260
>  try_to_free_pages+0x418/0xaf0
>  __alloc_pages_slowpath+0x976/0x1790
>  __alloc_pages_nodemask+0x52c/0x5c0
>  new_slab+0x374/0x3f0
>  ___slab_alloc.constprop.81+0x47e/0x5a0
>  __slab_alloc.constprop.80+0x32/0x60
>  __kmalloc_track_caller+0x267/0x310
>  __kmalloc_reserve.isra.40+0x29/0x80
>  __alloc_skb+0xee/0x390
>  sk_stream_alloc_skb+0xb8/0x340
>  tcp_sendmsg_locked+0x8e6/0x1d30
>  tcp_sendmsg+0x27/0x40
>  inet_sendmsg+0xd0/0x310
>  sock_write_iter+0x17a/0x240
>  __vfs_write+0x2ab/0x380
>  vfs_write+0xfb/0x260
>  SyS_write+0xb6/0x140
>  do_syscall_64+0x1e5/0xc05
>  entry_SYSCALL64_slow_path+0x25/0x25

> ============================================
> WARNING: possible recursive locking detected
> 4.15.0-rc9-backup-debug+ #7 Not tainted
> --------------------------------------------
> snmpd/892 is trying to acquire lock:
>  (fs_reclaim){+.+.}, at: [<0000000002e4c185>] fs_reclaim_acquire.part.101+0x5/0x30
> 
> but task is already holding lock:
>  (fs_reclaim){+.+.}, at: [<0000000002e4c185>] fs_reclaim_acquire.part.101+0x5/0x30
> 
> other info that might help us debug this:
>  Possible unsafe locking scenario:
> 
>        CPU0
>        ----
>   lock(fs_reclaim);
>   lock(fs_reclaim);
> 
>  *** DEADLOCK ***
> 
>  May be due to missing lock nesting notation
> 
> 2 locks held by snmpd/892:
>  #0:  (rtnl_mutex){+.+.}, at: [<00000000dcd3ba2f>] netlink_dump+0x89/0x520
>  #1:  (fs_reclaim){+.+.}, at: [<0000000002e4c185>] fs_reclaim_acquire.part.101+0x5/0x30
> 
> stack backtrace:
> CPU: 5 PID: 892 Comm: snmpd Not tainted 4.15.0-rc9-backup-debug+ #7
> Call Trace:
>  dump_stack+0xbc/0x13f
>  __lock_acquire+0xa09/0x2040
>  lock_acquire+0x12e/0x350
>  fs_reclaim_acquire.part.101+0x29/0x30
>  kmem_cache_alloc+0x3d/0x2c0
>  alloc_extent_state+0xa7/0x410
>  __clear_extent_bit+0x3ea/0x570
>  try_release_extent_mapping+0x21a/0x260
>  __btrfs_releasepage+0xb0/0x1c0
>  btrfs_releasepage+0x161/0x170
>  try_to_release_page+0x162/0x1c0
>  shrink_page_list+0x1d5a/0x2fb0
>  shrink_inactive_list+0x451/0x940
>  shrink_node_memcg.constprop.84+0x4c9/0x5e0
>  shrink_node+0x1c2/0x510
>  try_to_free_pages+0x425/0xb90
>  __alloc_pages_slowpath+0x955/0x1a00
>  __alloc_pages_nodemask+0x52c/0x5c0
>  new_slab+0x374/0x3f0
>  ___slab_alloc.constprop.81+0x47e/0x5a0
>  __slab_alloc.constprop.80+0x32/0x60
>  __kmalloc_track_caller+0x267/0x310
>  __kmalloc_reserve.isra.40+0x29/0x80
>  __alloc_skb+0xee/0x390
>  netlink_dump+0x2e1/0x520
>  __netlink_dump_start+0x201/0x280
>  rtnetlink_rcv_msg+0x6d6/0xa90
>  netlink_rcv_skb+0xb6/0x1d0
>  netlink_unicast+0x298/0x320
>  netlink_sendmsg+0x57e/0x630
>  SYSC_sendto+0x296/0x320
>  do_syscall_64+0x1e5/0xc05
>  entry_SYSCALL64_slow_path+0x25/0x25
> RIP: 0033:0x7f204299f54d
> RSP: 002b:00007ffc49024fd8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
> RAX: ffffffffffffffda RBX: 000000000000000a RCX: 00007f204299f54d
> RDX: 0000000000000018 RSI: 00007ffc49025010 RDI: 0000000000000012
> RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000012
> R13: 00007ffc49029550 R14: 000055e31307a250 R15: 00007ffc49029530

Both traces are identical and no fs locks held? And therefore,
doing GFP_KERNEL allocation should be safe (as long as there is
PF_MEMALLOC safeguard which prevents infinite recursion), isn't it?

Then, I think that "git bisect" should reach commit d92a8cfcb37ecd13
("locking/lockdep: Rework FS_RECLAIM annotation").

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [4.15-rc9] fs_reclaim lockdep trace
@ 2018-01-28  1:16       ` Tetsuo Handa
  0 siblings, 0 replies; 35+ messages in thread
From: Tetsuo Handa @ 2018-01-28  1:16 UTC (permalink / raw)
  To: Linus Torvalds, Dave Jones, Peter Zijlstra
  Cc: Linux Kernel, linux-mm, Network Development

Linus Torvalds wrote:
> On Sat, Jan 27, 2018 at 2:24 PM, Dave Jones <davej@codemonkey.org.uk> wrote:
>> On Tue, Jan 23, 2018 at 08:36:51PM -0500, Dave Jones wrote:
>>  > Just triggered this on a server I was rsync'ing to.
>>
>> Actually, I can trigger this really easily, even with an rsync from one
>> disk to another.  Though that also smells a little like networking in
>> the traces. Maybe netdev has ideas.
> 
> Is this new to 4.15? Or is it just that you're testing something new?
> 
> If it's new and easy to repro, can you just bisect it? And if it isn't
> new, can you perhaps check whether it's new to 4.14 (ie 4.13 being
> ok)?
> 
> Because that fs_reclaim_acquire/release() debugging isn't new to 4.15,
> but it was rewritten for 4.14.. I'm wondering if that remodeling ended
> up triggering something.

--- linux-4.13.16/mm/page_alloc.c
+++ linux-4.14.15/mm/page_alloc.c
@@ -3527,53 +3519,12 @@
 			return true;
 	}
 	return false;
 }
 #endif /* CONFIG_COMPACTION */
 
-#ifdef CONFIG_LOCKDEP
-struct lockdep_map __fs_reclaim_map =
-	STATIC_LOCKDEP_MAP_INIT("fs_reclaim", &__fs_reclaim_map);
-
-static bool __need_fs_reclaim(gfp_t gfp_mask)
-{
-	gfp_mask = current_gfp_context(gfp_mask);
-
-	/* no reclaim without waiting on it */
-	if (!(gfp_mask & __GFP_DIRECT_RECLAIM))
-		return false;
-
-	/* this guy won't enter reclaim */
-	if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
-		return false;
-
-	/* We're only interested __GFP_FS allocations for now */
-	if (!(gfp_mask & __GFP_FS))
-		return false;
-
-	if (gfp_mask & __GFP_NOLOCKDEP)
-		return false;
-
-	return true;
-}
-
-void fs_reclaim_acquire(gfp_t gfp_mask)
-{
-	if (__need_fs_reclaim(gfp_mask))
-		lock_map_acquire(&__fs_reclaim_map);
-}
-EXPORT_SYMBOL_GPL(fs_reclaim_acquire);
-
-void fs_reclaim_release(gfp_t gfp_mask)
-{
-	if (__need_fs_reclaim(gfp_mask))
-		lock_map_release(&__fs_reclaim_map);
-}
-EXPORT_SYMBOL_GPL(fs_reclaim_release);
-#endif
-
 /* Perform direct synchronous page reclaim */
 static int
 __perform_reclaim(gfp_t gfp_mask, unsigned int order,
 					const struct alloc_context *ac)
 {
 	struct reclaim_state reclaim_state;
@@ -3582,21 +3533,21 @@
 
 	cond_resched();
 
 	/* We now go into synchronous reclaim */
 	cpuset_memory_pressure_bump();
 	noreclaim_flag = memalloc_noreclaim_save();
-	fs_reclaim_acquire(gfp_mask);
+	lockdep_set_current_reclaim_state(gfp_mask);
 	reclaim_state.reclaimed_slab = 0;
 	current->reclaim_state = &reclaim_state;
 
 	progress = try_to_free_pages(ac->zonelist, order, gfp_mask,
 								ac->nodemask);
 
 	current->reclaim_state = NULL;
-	fs_reclaim_release(gfp_mask);
+	lockdep_clear_current_reclaim_state();
 	memalloc_noreclaim_restore(noreclaim_flag);
 
 	cond_resched();
 
 	return progress;
 }

> 
> Adding PeterZ to the participants list in case he has ideas. I'm not
> seeing what would be the problem in that call chain from hell.
> 
>                Linus

Dave Jones wrote:
> ============================================
> WARNING: possible recursive locking detected
> 4.15.0-rc9-backup-debug+ #1 Not tainted
> --------------------------------------------
> sshd/24800 is trying to acquire lock:
>  (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30
> 
> but task is already holding lock:
>  (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30
> 
> other info that might help us debug this:
>  Possible unsafe locking scenario:
> 
>        CPU0
>        ----
>   lock(fs_reclaim);
>   lock(fs_reclaim);
> 
>  *** DEADLOCK ***
> 
>  May be due to missing lock nesting notation
> 
> 2 locks held by sshd/24800:
>  #0:  (sk_lock-AF_INET6){+.+.}, at: [<000000001a069652>] tcp_sendmsg+0x19/0x40
>  #1:  (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30
> 
> stack backtrace:
> CPU: 3 PID: 24800 Comm: sshd Not tainted 4.15.0-rc9-backup-debug+ #1
> Call Trace:
>  dump_stack+0xbc/0x13f
>  __lock_acquire+0xa09/0x2040
>  lock_acquire+0x12e/0x350
>  fs_reclaim_acquire.part.102+0x29/0x30
>  kmem_cache_alloc+0x3d/0x2c0
>  alloc_extent_state+0xa7/0x410
>  __clear_extent_bit+0x3ea/0x570
>  try_release_extent_mapping+0x21a/0x260
>  __btrfs_releasepage+0xb0/0x1c0
>  btrfs_releasepage+0x161/0x170
>  try_to_release_page+0x162/0x1c0
>  shrink_page_list+0x1d5a/0x2fb0
>  shrink_inactive_list+0x451/0x940
>  shrink_node_memcg.constprop.88+0x4c9/0x5e0
>  shrink_node+0x12d/0x260
>  try_to_free_pages+0x418/0xaf0
>  __alloc_pages_slowpath+0x976/0x1790
>  __alloc_pages_nodemask+0x52c/0x5c0
>  new_slab+0x374/0x3f0
>  ___slab_alloc.constprop.81+0x47e/0x5a0
>  __slab_alloc.constprop.80+0x32/0x60
>  __kmalloc_track_caller+0x267/0x310
>  __kmalloc_reserve.isra.40+0x29/0x80
>  __alloc_skb+0xee/0x390
>  sk_stream_alloc_skb+0xb8/0x340
>  tcp_sendmsg_locked+0x8e6/0x1d30
>  tcp_sendmsg+0x27/0x40
>  inet_sendmsg+0xd0/0x310
>  sock_write_iter+0x17a/0x240
>  __vfs_write+0x2ab/0x380
>  vfs_write+0xfb/0x260
>  SyS_write+0xb6/0x140
>  do_syscall_64+0x1e5/0xc05
>  entry_SYSCALL64_slow_path+0x25/0x25

> ============================================
> WARNING: possible recursive locking detected
> 4.15.0-rc9-backup-debug+ #7 Not tainted
> --------------------------------------------
> snmpd/892 is trying to acquire lock:
>  (fs_reclaim){+.+.}, at: [<0000000002e4c185>] fs_reclaim_acquire.part.101+0x5/0x30
> 
> but task is already holding lock:
>  (fs_reclaim){+.+.}, at: [<0000000002e4c185>] fs_reclaim_acquire.part.101+0x5/0x30
> 
> other info that might help us debug this:
>  Possible unsafe locking scenario:
> 
>        CPU0
>        ----
>   lock(fs_reclaim);
>   lock(fs_reclaim);
> 
>  *** DEADLOCK ***
> 
>  May be due to missing lock nesting notation
> 
> 2 locks held by snmpd/892:
>  #0:  (rtnl_mutex){+.+.}, at: [<00000000dcd3ba2f>] netlink_dump+0x89/0x520
>  #1:  (fs_reclaim){+.+.}, at: [<0000000002e4c185>] fs_reclaim_acquire.part.101+0x5/0x30
> 
> stack backtrace:
> CPU: 5 PID: 892 Comm: snmpd Not tainted 4.15.0-rc9-backup-debug+ #7
> Call Trace:
>  dump_stack+0xbc/0x13f
>  __lock_acquire+0xa09/0x2040
>  lock_acquire+0x12e/0x350
>  fs_reclaim_acquire.part.101+0x29/0x30
>  kmem_cache_alloc+0x3d/0x2c0
>  alloc_extent_state+0xa7/0x410
>  __clear_extent_bit+0x3ea/0x570
>  try_release_extent_mapping+0x21a/0x260
>  __btrfs_releasepage+0xb0/0x1c0
>  btrfs_releasepage+0x161/0x170
>  try_to_release_page+0x162/0x1c0
>  shrink_page_list+0x1d5a/0x2fb0
>  shrink_inactive_list+0x451/0x940
>  shrink_node_memcg.constprop.84+0x4c9/0x5e0
>  shrink_node+0x1c2/0x510
>  try_to_free_pages+0x425/0xb90
>  __alloc_pages_slowpath+0x955/0x1a00
>  __alloc_pages_nodemask+0x52c/0x5c0
>  new_slab+0x374/0x3f0
>  ___slab_alloc.constprop.81+0x47e/0x5a0
>  __slab_alloc.constprop.80+0x32/0x60
>  __kmalloc_track_caller+0x267/0x310
>  __kmalloc_reserve.isra.40+0x29/0x80
>  __alloc_skb+0xee/0x390
>  netlink_dump+0x2e1/0x520
>  __netlink_dump_start+0x201/0x280
>  rtnetlink_rcv_msg+0x6d6/0xa90
>  netlink_rcv_skb+0xb6/0x1d0
>  netlink_unicast+0x298/0x320
>  netlink_sendmsg+0x57e/0x630
>  SYSC_sendto+0x296/0x320
>  do_syscall_64+0x1e5/0xc05
>  entry_SYSCALL64_slow_path+0x25/0x25
> RIP: 0033:0x7f204299f54d
> RSP: 002b:00007ffc49024fd8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
> RAX: ffffffffffffffda RBX: 000000000000000a RCX: 00007f204299f54d
> RDX: 0000000000000018 RSI: 00007ffc49025010 RDI: 0000000000000012
> RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000012
> R13: 00007ffc49029550 R14: 000055e31307a250 R15: 00007ffc49029530

Both traces are identical and no fs locks held? And therefore,
doing GFP_KERNEL allocation should be safe (as long as there is
PF_MEMALLOC safeguard which prevents infinite recursion), isn't it?

Then, I think that "git bisect" should reach commit d92a8cfcb37ecd13
("locking/lockdep: Rework FS_RECLAIM annotation").

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [4.15-rc9] fs_reclaim lockdep trace
  2018-01-28  1:16       ` Tetsuo Handa
@ 2018-01-28  4:25         ` Tetsuo Handa
  -1 siblings, 0 replies; 35+ messages in thread
From: Tetsuo Handa @ 2018-01-28  4:25 UTC (permalink / raw)
  To: Linus Torvalds, Dave Jones, Peter Zijlstra
  Cc: Linux Kernel, linux-mm, Network Development

On 2018/01/28 10:16, Tetsuo Handa wrote:
> Linus Torvalds wrote:
>> On Sat, Jan 27, 2018 at 2:24 PM, Dave Jones <davej@codemonkey.org.uk> wrote:
>>> On Tue, Jan 23, 2018 at 08:36:51PM -0500, Dave Jones wrote:
>>>  > Just triggered this on a server I was rsync'ing to.
>>>
>>> Actually, I can trigger this really easily, even with an rsync from one
>>> disk to another.  Though that also smells a little like networking in
>>> the traces. Maybe netdev has ideas.
>>
>> Is this new to 4.15? Or is it just that you're testing something new?
>>
>> If it's new and easy to repro, can you just bisect it? And if it isn't
>> new, can you perhaps check whether it's new to 4.14 (ie 4.13 being
>> ok)?
>>
>> Because that fs_reclaim_acquire/release() debugging isn't new to 4.15,
>> but it was rewritten for 4.14.. I'm wondering if that remodeling ended
>> up triggering something.
> 
> --- linux-4.13.16/mm/page_alloc.c
> +++ linux-4.14.15/mm/page_alloc.c

Oops. This output was inverted.

> @@ -3527,53 +3519,12 @@
>  			return true;
>  	}
>  	return false;
>  }
>  #endif /* CONFIG_COMPACTION */
>  
> -#ifdef CONFIG_LOCKDEP
> -struct lockdep_map __fs_reclaim_map =
> -	STATIC_LOCKDEP_MAP_INIT("fs_reclaim", &__fs_reclaim_map);
> -
> -static bool __need_fs_reclaim(gfp_t gfp_mask)
> -{
> -	gfp_mask = current_gfp_context(gfp_mask);
> -
> -	/* no reclaim without waiting on it */
> -	if (!(gfp_mask & __GFP_DIRECT_RECLAIM))
> -		return false;
> -
> -	/* this guy won't enter reclaim */
> -	if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
> -		return false;

Since __kmalloc_reserve() from __alloc_skb() adds __GFP_NOMEMALLOC | __GFP_NOWARN
to gfp_mask, __need_fs_reclaim() is failing to return false here.

But why checking __GFP_NOMEMALLOC here? __alloc_pages_slowpath() skips direct
reclaim if !(gfp_mask & __GFP_DIRECT_RECLAIM) or (current->flags & PF_MEMALLOC),
doesn't it?

----------
static inline struct page *
__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
                                                struct alloc_context *ac)
{
(...snipped...)
        /* Caller is not willing to reclaim, we can't balance anything */
        if (!can_direct_reclaim)
                goto nopage;

        /* Avoid recursion of direct reclaim */
        if (current->flags & PF_MEMALLOC)
                goto nopage;

        /* Try direct reclaim and then allocating */
        page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac,
                                                        &did_some_progress);
        if (page)
                goto got_pg;
(...snipped...)
}
----------

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [4.15-rc9] fs_reclaim lockdep trace
@ 2018-01-28  4:25         ` Tetsuo Handa
  0 siblings, 0 replies; 35+ messages in thread
From: Tetsuo Handa @ 2018-01-28  4:25 UTC (permalink / raw)
  To: Linus Torvalds, Dave Jones, Peter Zijlstra
  Cc: Linux Kernel, linux-mm, Network Development

On 2018/01/28 10:16, Tetsuo Handa wrote:
> Linus Torvalds wrote:
>> On Sat, Jan 27, 2018 at 2:24 PM, Dave Jones <davej@codemonkey.org.uk> wrote:
>>> On Tue, Jan 23, 2018 at 08:36:51PM -0500, Dave Jones wrote:
>>>  > Just triggered this on a server I was rsync'ing to.
>>>
>>> Actually, I can trigger this really easily, even with an rsync from one
>>> disk to another.  Though that also smells a little like networking in
>>> the traces. Maybe netdev has ideas.
>>
>> Is this new to 4.15? Or is it just that you're testing something new?
>>
>> If it's new and easy to repro, can you just bisect it? And if it isn't
>> new, can you perhaps check whether it's new to 4.14 (ie 4.13 being
>> ok)?
>>
>> Because that fs_reclaim_acquire/release() debugging isn't new to 4.15,
>> but it was rewritten for 4.14.. I'm wondering if that remodeling ended
>> up triggering something.
> 
> --- linux-4.13.16/mm/page_alloc.c
> +++ linux-4.14.15/mm/page_alloc.c

Oops. This output was inverted.

> @@ -3527,53 +3519,12 @@
>  			return true;
>  	}
>  	return false;
>  }
>  #endif /* CONFIG_COMPACTION */
>  
> -#ifdef CONFIG_LOCKDEP
> -struct lockdep_map __fs_reclaim_map =
> -	STATIC_LOCKDEP_MAP_INIT("fs_reclaim", &__fs_reclaim_map);
> -
> -static bool __need_fs_reclaim(gfp_t gfp_mask)
> -{
> -	gfp_mask = current_gfp_context(gfp_mask);
> -
> -	/* no reclaim without waiting on it */
> -	if (!(gfp_mask & __GFP_DIRECT_RECLAIM))
> -		return false;
> -
> -	/* this guy won't enter reclaim */
> -	if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
> -		return false;

Since __kmalloc_reserve() from __alloc_skb() adds __GFP_NOMEMALLOC | __GFP_NOWARN
to gfp_mask, __need_fs_reclaim() is failing to return false here.

But why checking __GFP_NOMEMALLOC here? __alloc_pages_slowpath() skips direct
reclaim if !(gfp_mask & __GFP_DIRECT_RECLAIM) or (current->flags & PF_MEMALLOC),
doesn't it?

----------
static inline struct page *
__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
                                                struct alloc_context *ac)
{
(...snipped...)
        /* Caller is not willing to reclaim, we can't balance anything */
        if (!can_direct_reclaim)
                goto nopage;

        /* Avoid recursion of direct reclaim */
        if (current->flags & PF_MEMALLOC)
                goto nopage;

        /* Try direct reclaim and then allocating */
        page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac,
                                                        &did_some_progress);
        if (page)
                goto got_pg;
(...snipped...)
}
----------

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [4.15-rc9] fs_reclaim lockdep trace
  2018-01-28  4:25         ` Tetsuo Handa
  (?)
@ 2018-01-28  5:55           ` Tetsuo Handa
  -1 siblings, 0 replies; 35+ messages in thread
From: Tetsuo Handa @ 2018-01-28  5:55 UTC (permalink / raw)
  To: Linus Torvalds, Dave Jones, Peter Zijlstra, Nick Piggin
  Cc: Linux Kernel, linux-mm, Network Development

Dave, would you try below patch?



>From cae2cbf389ae3cdef1b492622722b4aeb07eb284 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Sun, 28 Jan 2018 14:17:14 +0900
Subject: [PATCH] lockdep: Fix fs_reclaim warning.

Dave Jones reported fs_reclaim lockdep warnings.

  ============================================
  WARNING: possible recursive locking detected
  4.15.0-rc9-backup-debug+ #1 Not tainted
  --------------------------------------------
  sshd/24800 is trying to acquire lock:
   (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30

  but task is already holding lock:
   (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30

  other info that might help us debug this:
   Possible unsafe locking scenario:

         CPU0
         ----
    lock(fs_reclaim);
    lock(fs_reclaim);

   *** DEADLOCK ***

   May be due to missing lock nesting notation

  2 locks held by sshd/24800:
   #0:  (sk_lock-AF_INET6){+.+.}, at: [<000000001a069652>] tcp_sendmsg+0x19/0x40
   #1:  (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30

  stack backtrace:
  CPU: 3 PID: 24800 Comm: sshd Not tainted 4.15.0-rc9-backup-debug+ #1
  Call Trace:
   dump_stack+0xbc/0x13f
   __lock_acquire+0xa09/0x2040
   lock_acquire+0x12e/0x350
   fs_reclaim_acquire.part.102+0x29/0x30
   kmem_cache_alloc+0x3d/0x2c0
   alloc_extent_state+0xa7/0x410
   __clear_extent_bit+0x3ea/0x570
   try_release_extent_mapping+0x21a/0x260
   __btrfs_releasepage+0xb0/0x1c0
   btrfs_releasepage+0x161/0x170
   try_to_release_page+0x162/0x1c0
   shrink_page_list+0x1d5a/0x2fb0
   shrink_inactive_list+0x451/0x940
   shrink_node_memcg.constprop.88+0x4c9/0x5e0
   shrink_node+0x12d/0x260
   try_to_free_pages+0x418/0xaf0
   __alloc_pages_slowpath+0x976/0x1790
   __alloc_pages_nodemask+0x52c/0x5c0
   new_slab+0x374/0x3f0
   ___slab_alloc.constprop.81+0x47e/0x5a0
   __slab_alloc.constprop.80+0x32/0x60
   __kmalloc_track_caller+0x267/0x310
   __kmalloc_reserve.isra.40+0x29/0x80
   __alloc_skb+0xee/0x390
   sk_stream_alloc_skb+0xb8/0x340
   tcp_sendmsg_locked+0x8e6/0x1d30
   tcp_sendmsg+0x27/0x40
   inet_sendmsg+0xd0/0x310
   sock_write_iter+0x17a/0x240
   __vfs_write+0x2ab/0x380
   vfs_write+0xfb/0x260
   SyS_write+0xb6/0x140
   do_syscall_64+0x1e5/0xc05
   entry_SYSCALL64_slow_path+0x25/0x25

Since no fs locks are held, doing GFP_KERNEL allocation should be safe
as long as there is PF_MEMALLOC safeguard (

  /* Avoid recursion of direct reclaim */
  if (p->flags & PF_MEMALLOC)
          goto nopage;

) which prevents infinite recursion.

This warning seems to be caused by commit d92a8cfcb37ecd13
("locking/lockdep: Rework FS_RECLAIM annotation") which moved the
location of

  /* this guy won't enter reclaim */
  if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
          return false;

check added by commit cf40bd16fdad42c0 ("lockdep: annotate reclaim context
(__GFP_NOFS)"). Since __kmalloc_reserve() from __alloc_skb() adds
__GFP_NOMEMALLOC | __GFP_NOWARN to gfp_mask, __need_fs_reclaim() is
failing to return false despite PF_MEMALLOC context (and resulted in
lockdep warning).

Since there was no PF_MEMALLOC safeguard as of cf40bd16fdad42c0, checking
__GFP_NOMEMALLOC might make sense. But since this safeguard was added by
commit 341ce06f69abfafa ("page allocator: calculate the alloc_flags for
allocation only once"), checking __GFP_NOMEMALLOC no longer makes sense.
Thus, let's remove __GFP_NOMEMALLOC check and allow __need_fs_reclaim() to
return false.

Reported-by: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Nick Piggin <npiggin@gmail.com>
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 76c9688..7804b0e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3583,7 +3583,7 @@ static bool __need_fs_reclaim(gfp_t gfp_mask)
 		return false;
 
 	/* this guy won't enter reclaim */
-	if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
+	if (current->flags & PF_MEMALLOC)
 		return false;
 
 	/* We're only interested __GFP_FS allocations for now */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [4.15-rc9] fs_reclaim lockdep trace
@ 2018-01-28  5:55           ` Tetsuo Handa
  0 siblings, 0 replies; 35+ messages in thread
From: Tetsuo Handa @ 2018-01-28  5:55 UTC (permalink / raw)
  To: Linus Torvalds, Dave Jones, Peter Zijlstra, Nick Piggin
  Cc: Linux Kernel, linux-mm, Network Development

Dave, would you try below patch?



>From cae2cbf389ae3cdef1b492622722b4aeb07eb284 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Sun, 28 Jan 2018 14:17:14 +0900
Subject: [PATCH] lockdep: Fix fs_reclaim warning.

Dave Jones reported fs_reclaim lockdep warnings.

  ============================================
  WARNING: possible recursive locking detected
  4.15.0-rc9-backup-debug+ #1 Not tainted
  --------------------------------------------
  sshd/24800 is trying to acquire lock:
   (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30

  but task is already holding lock:
   (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30

  other info that might help us debug this:
   Possible unsafe locking scenario:

         CPU0
         ----
    lock(fs_reclaim);
    lock(fs_reclaim);

   *** DEADLOCK ***

   May be due to missing lock nesting notation

  2 locks held by sshd/24800:
   #0:  (sk_lock-AF_INET6){+.+.}, at: [<000000001a069652>] tcp_sendmsg+0x19/0x40
   #1:  (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30

  stack backtrace:
  CPU: 3 PID: 24800 Comm: sshd Not tainted 4.15.0-rc9-backup-debug+ #1
  Call Trace:
   dump_stack+0xbc/0x13f
   __lock_acquire+0xa09/0x2040
   lock_acquire+0x12e/0x350
   fs_reclaim_acquire.part.102+0x29/0x30
   kmem_cache_alloc+0x3d/0x2c0
   alloc_extent_state+0xa7/0x410
   __clear_extent_bit+0x3ea/0x570
   try_release_extent_mapping+0x21a/0x260
   __btrfs_releasepage+0xb0/0x1c0
   btrfs_releasepage+0x161/0x170
   try_to_release_page+0x162/0x1c0
   shrink_page_list+0x1d5a/0x2fb0
   shrink_inactive_list+0x451/0x940
   shrink_node_memcg.constprop.88+0x4c9/0x5e0
   shrink_node+0x12d/0x260
   try_to_free_pages+0x418/0xaf0
   __alloc_pages_slowpath+0x976/0x1790
   __alloc_pages_nodemask+0x52c/0x5c0
   new_slab+0x374/0x3f0
   ___slab_alloc.constprop.81+0x47e/0x5a0
   __slab_alloc.constprop.80+0x32/0x60
   __kmalloc_track_caller+0x267/0x310
   __kmalloc_reserve.isra.40+0x29/0x80
   __alloc_skb+0xee/0x390
   sk_stream_alloc_skb+0xb8/0x340
   tcp_sendmsg_locked+0x8e6/0x1d30
   tcp_sendmsg+0x27/0x40
   inet_sendmsg+0xd0/0x310
   sock_write_iter+0x17a/0x240
   __vfs_write+0x2ab/0x380
   vfs_write+0xfb/0x260
   SyS_write+0xb6/0x140
   do_syscall_64+0x1e5/0xc05
   entry_SYSCALL64_slow_path+0x25/0x25

Since no fs locks are held, doing GFP_KERNEL allocation should be safe
as long as there is PF_MEMALLOC safeguard (

  /* Avoid recursion of direct reclaim */
  if (p->flags & PF_MEMALLOC)
          goto nopage;

) which prevents infinite recursion.

This warning seems to be caused by commit d92a8cfcb37ecd13
("locking/lockdep: Rework FS_RECLAIM annotation") which moved the
location of

  /* this guy won't enter reclaim */
  if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
          return false;

check added by commit cf40bd16fdad42c0 ("lockdep: annotate reclaim context
(__GFP_NOFS)"). Since __kmalloc_reserve() from __alloc_skb() adds
__GFP_NOMEMALLOC | __GFP_NOWARN to gfp_mask, __need_fs_reclaim() is
failing to return false despite PF_MEMALLOC context (and resulted in
lockdep warning).

Since there was no PF_MEMALLOC safeguard as of cf40bd16fdad42c0, checking
__GFP_NOMEMALLOC might make sense. But since this safeguard was added by
commit 341ce06f69abfafa ("page allocator: calculate the alloc_flags for
allocation only once"), checking __GFP_NOMEMALLOC no longer makes sense.
Thus, let's remove __GFP_NOMEMALLOC check and allow __need_fs_reclaim() to
return false.

Reported-by: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Nick Piggin <npiggin@gmail.com>
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 76c9688..7804b0e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3583,7 +3583,7 @@ static bool __need_fs_reclaim(gfp_t gfp_mask)
 		return false;
 
 	/* this guy won't enter reclaim */
-	if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
+	if (current->flags & PF_MEMALLOC)
 		return false;
 
 	/* We're only interested __GFP_FS allocations for now */
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [4.15-rc9] fs_reclaim lockdep trace
@ 2018-01-28  5:55           ` Tetsuo Handa
  0 siblings, 0 replies; 35+ messages in thread
From: Tetsuo Handa @ 2018-01-28  5:55 UTC (permalink / raw)
  To: Linus Torvalds, Dave Jones, Peter Zijlstra, Nick Piggin
  Cc: Linux Kernel, linux-mm, Network Development

Dave, would you try below patch?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [4.15-rc9] fs_reclaim lockdep trace
  2018-01-28  5:55           ` Tetsuo Handa
@ 2018-01-29  2:43             ` Dave Jones
  -1 siblings, 0 replies; 35+ messages in thread
From: Dave Jones @ 2018-01-29  2:43 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Linus Torvalds, Peter Zijlstra, Nick Piggin, Linux Kernel,
	linux-mm, Network Development

On Sun, Jan 28, 2018 at 02:55:28PM +0900, Tetsuo Handa wrote:
 > Dave, would you try below patch?
 > 
 > >From cae2cbf389ae3cdef1b492622722b4aeb07eb284 Mon Sep 17 00:00:00 2001
 > From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
 > Date: Sun, 28 Jan 2018 14:17:14 +0900
 > Subject: [PATCH] lockdep: Fix fs_reclaim warning.


Seems to suppress the warning for me.

Tested-by: Dave Jones <davej@codemonkey.org.uk>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [4.15-rc9] fs_reclaim lockdep trace
@ 2018-01-29  2:43             ` Dave Jones
  0 siblings, 0 replies; 35+ messages in thread
From: Dave Jones @ 2018-01-29  2:43 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Linus Torvalds, Peter Zijlstra, Nick Piggin, Linux Kernel,
	linux-mm, Network Development

On Sun, Jan 28, 2018 at 02:55:28PM +0900, Tetsuo Handa wrote:
 > Dave, would you try below patch?
 > 
 > >From cae2cbf389ae3cdef1b492622722b4aeb07eb284 Mon Sep 17 00:00:00 2001
 > From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
 > Date: Sun, 28 Jan 2018 14:17:14 +0900
 > Subject: [PATCH] lockdep: Fix fs_reclaim warning.


Seems to suppress the warning for me.

Tested-by: Dave Jones <davej@codemonkey.org.uk>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [4.15-rc9] fs_reclaim lockdep trace
  2018-01-28  5:55           ` Tetsuo Handa
@ 2018-01-29 10:27             ` Peter Zijlstra
  -1 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2018-01-29 10:27 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Linus Torvalds, Dave Jones, Nick Piggin, Linux Kernel, linux-mm,
	Network Development, mhocko

On Sun, Jan 28, 2018 at 02:55:28PM +0900, Tetsuo Handa wrote:
> This warning seems to be caused by commit d92a8cfcb37ecd13
> ("locking/lockdep: Rework FS_RECLAIM annotation") which moved the
> location of
> 
>   /* this guy won't enter reclaim */
>   if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
>           return false;
> 
> check added by commit cf40bd16fdad42c0 ("lockdep: annotate reclaim context
> (__GFP_NOFS)").

I'm not entirly sure I get what you mean here. How did I move it? It was
part of lockdep_trace_alloc(), if __GFP_NOMEMALLOC was set, it would not
mark the lock as held.

The new code has it in fs_reclaim_acquire/release to the same effect, if
__GFP_NOMEMALLOC, we'll not acquire/release the lock.


> Since __kmalloc_reserve() from __alloc_skb() adds
> __GFP_NOMEMALLOC | __GFP_NOWARN to gfp_mask, __need_fs_reclaim() is
> failing to return false despite PF_MEMALLOC context (and resulted in
> lockdep warning).

But that's correct right, __GFP_NOMEMALLOC should negate PF_MEMALLOC.
That's what the name says.

> Since there was no PF_MEMALLOC safeguard as of cf40bd16fdad42c0, checking
> __GFP_NOMEMALLOC might make sense. But since this safeguard was added by
> commit 341ce06f69abfafa ("page allocator: calculate the alloc_flags for
> allocation only once"), checking __GFP_NOMEMALLOC no longer makes sense.
> Thus, let's remove __GFP_NOMEMALLOC check and allow __need_fs_reclaim() to
> return false.

This does not in fact explain what's going on, it just points to
'random' patches.

Are you talking about this:

+       /* Avoid recursion of direct reclaim */
+       if (p->flags & PF_MEMALLOC)
+               goto nopage;

bit?

> Reported-by: Dave Jones <davej@codemonkey.org.uk>
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Nick Piggin <npiggin@gmail.com>
> ---
>  mm/page_alloc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 76c9688..7804b0e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3583,7 +3583,7 @@ static bool __need_fs_reclaim(gfp_t gfp_mask)
>  		return false;
>  
>  	/* this guy won't enter reclaim */
> -	if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
> +	if (current->flags & PF_MEMALLOC)
>  		return false;

I'm _really_ uncomfortable doing that. Esp. without a solid explanation
of how this raelly can't possibly lead to trouble. Which the above semi
incoherent rambling is not.

Your backtrace shows the btrfs shrinker doing an allocation, that's the
exact kind of thing we need to be extremely careful with.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [4.15-rc9] fs_reclaim lockdep trace
@ 2018-01-29 10:27             ` Peter Zijlstra
  0 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2018-01-29 10:27 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Linus Torvalds, Dave Jones, Nick Piggin, Linux Kernel, linux-mm,
	Network Development, mhocko

On Sun, Jan 28, 2018 at 02:55:28PM +0900, Tetsuo Handa wrote:
> This warning seems to be caused by commit d92a8cfcb37ecd13
> ("locking/lockdep: Rework FS_RECLAIM annotation") which moved the
> location of
> 
>   /* this guy won't enter reclaim */
>   if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
>           return false;
> 
> check added by commit cf40bd16fdad42c0 ("lockdep: annotate reclaim context
> (__GFP_NOFS)").

I'm not entirly sure I get what you mean here. How did I move it? It was
part of lockdep_trace_alloc(), if __GFP_NOMEMALLOC was set, it would not
mark the lock as held.

The new code has it in fs_reclaim_acquire/release to the same effect, if
__GFP_NOMEMALLOC, we'll not acquire/release the lock.


> Since __kmalloc_reserve() from __alloc_skb() adds
> __GFP_NOMEMALLOC | __GFP_NOWARN to gfp_mask, __need_fs_reclaim() is
> failing to return false despite PF_MEMALLOC context (and resulted in
> lockdep warning).

But that's correct right, __GFP_NOMEMALLOC should negate PF_MEMALLOC.
That's what the name says.

> Since there was no PF_MEMALLOC safeguard as of cf40bd16fdad42c0, checking
> __GFP_NOMEMALLOC might make sense. But since this safeguard was added by
> commit 341ce06f69abfafa ("page allocator: calculate the alloc_flags for
> allocation only once"), checking __GFP_NOMEMALLOC no longer makes sense.
> Thus, let's remove __GFP_NOMEMALLOC check and allow __need_fs_reclaim() to
> return false.

This does not in fact explain what's going on, it just points to
'random' patches.

Are you talking about this:

+       /* Avoid recursion of direct reclaim */
+       if (p->flags & PF_MEMALLOC)
+               goto nopage;

bit?

> Reported-by: Dave Jones <davej@codemonkey.org.uk>
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Nick Piggin <npiggin@gmail.com>
> ---
>  mm/page_alloc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 76c9688..7804b0e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3583,7 +3583,7 @@ static bool __need_fs_reclaim(gfp_t gfp_mask)
>  		return false;
>  
>  	/* this guy won't enter reclaim */
> -	if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
> +	if (current->flags & PF_MEMALLOC)
>  		return false;

I'm _really_ uncomfortable doing that. Esp. without a solid explanation
of how this raelly can't possibly lead to trouble. Which the above semi
incoherent rambling is not.

Your backtrace shows the btrfs shrinker doing an allocation, that's the
exact kind of thing we need to be extremely careful with.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [4.15-rc9] fs_reclaim lockdep trace
  2018-01-29 10:27             ` Peter Zijlstra
@ 2018-01-29 11:47               ` Tetsuo Handa
  -1 siblings, 0 replies; 35+ messages in thread
From: Tetsuo Handa @ 2018-01-29 11:47 UTC (permalink / raw)
  To: peterz; +Cc: torvalds, davej, npiggin, linux-kernel, linux-mm, netdev, mhocko

Peter Zijlstra wrote:
> On Sun, Jan 28, 2018 at 02:55:28PM +0900, Tetsuo Handa wrote:
> > This warning seems to be caused by commit d92a8cfcb37ecd13
> > ("locking/lockdep: Rework FS_RECLAIM annotation") which moved the
> > location of
> > 
> >   /* this guy won't enter reclaim */
> >   if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
> >           return false;
> > 
> > check added by commit cf40bd16fdad42c0 ("lockdep: annotate reclaim context
> > (__GFP_NOFS)").
> 
> I'm not entirly sure I get what you mean here. How did I move it? It was
> part of lockdep_trace_alloc(), if __GFP_NOMEMALLOC was set, it would not
> mark the lock as held.

d92a8cfcb37ecd13 replaced lockdep_set_current_reclaim_state() with
fs_reclaim_acquire(), and removed current->lockdep_recursion handling.

----------
# git show d92a8cfcb37ecd13 | grep recursion
-# define INIT_LOCKDEP                          .lockdep_recursion = 0, .lockdep_reclaim_gfp = 0,
+# define INIT_LOCKDEP                          .lockdep_recursion = 0,
        unsigned int                    lockdep_recursion;
-       if (unlikely(current->lockdep_recursion))
-       current->lockdep_recursion = 1;
-       current->lockdep_recursion = 0;
-        * context checking code. This tests GFP_FS recursion (a lock taken
----------

> 
> The new code has it in fs_reclaim_acquire/release to the same effect, if
> __GFP_NOMEMALLOC, we'll not acquire/release the lock.

Excuse me, but I can't catch.
We currently acquire/release __fs_reclaim_map if __GFP_NOMEMALLOC.

----------
+static bool __need_fs_reclaim(gfp_t gfp_mask)
+{
(...snipped...)
+       /* this guy won't enter reclaim */
+       if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
+               return false;
(...snipped...)
+}
----------

> 
> 
> > Since __kmalloc_reserve() from __alloc_skb() adds
> > __GFP_NOMEMALLOC | __GFP_NOWARN to gfp_mask, __need_fs_reclaim() is
> > failing to return false despite PF_MEMALLOC context (and resulted in
> > lockdep warning).
> 
> But that's correct right, __GFP_NOMEMALLOC should negate PF_MEMALLOC.
> That's what the name says.

__GFP_NOMEMALLOC negates PF_MEMALLOC regarding what watermark that allocation
request should use.

----------
static inline int __gfp_pfmemalloc_flags(gfp_t gfp_mask)
{
        if (unlikely(gfp_mask & __GFP_NOMEMALLOC))
                return 0;
        if (gfp_mask & __GFP_MEMALLOC)
                return ALLOC_NO_WATERMARKS;
        if (in_serving_softirq() && (current->flags & PF_MEMALLOC))
                return ALLOC_NO_WATERMARKS;
        if (!in_interrupt()) {
                if (current->flags & PF_MEMALLOC)
                        return ALLOC_NO_WATERMARKS;
                else if (oom_reserves_allowed(current))
                        return ALLOC_OOM;
        }

        return 0;
}
----------

But at the same time, PF_MEMALLOC negates __GFP_DIRECT_RECLAIM.

----------
        /* Attempt with potentially adjusted zonelist and alloc_flags */
        page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
        if (page)
                goto got_pg;

        /* Caller is not willing to reclaim, we can't balance anything */
        if (!can_direct_reclaim)
                goto nopage;

        /* Avoid recursion of direct reclaim */
        if (current->flags & PF_MEMALLOC)
                goto nopage;

        /* Try direct reclaim and then allocating */
        page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac,
                                                        &did_some_progress);
        if (page)
                goto got_pg;

        /* Try direct compaction and then allocating */
        page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,
                                        compact_priority, &compact_result);
        if (page)
                goto got_pg;

        /* Do not loop if specifically requested */
        if (gfp_mask & __GFP_NORETRY)
                goto nopage;
----------

Then, how can fs_reclaim contribute to deadlock?

> 
> > Since there was no PF_MEMALLOC safeguard as of cf40bd16fdad42c0, checking
> > __GFP_NOMEMALLOC might make sense. But since this safeguard was added by
> > commit 341ce06f69abfafa ("page allocator: calculate the alloc_flags for
> > allocation only once"), checking __GFP_NOMEMALLOC no longer makes sense.
> > Thus, let's remove __GFP_NOMEMALLOC check and allow __need_fs_reclaim() to
> > return false.
> 
> This does not in fact explain what's going on, it just points to
> 'random' patches.
> 
> Are you talking about this:
> 
> +       /* Avoid recursion of direct reclaim */
> +       if (p->flags & PF_MEMALLOC)
> +               goto nopage;
> 
> bit?

Yes.

> 
> > Reported-by: Dave Jones <davej@codemonkey.org.uk>
> > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Nick Piggin <npiggin@gmail.com>
> > ---
> >  mm/page_alloc.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 76c9688..7804b0e 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -3583,7 +3583,7 @@ static bool __need_fs_reclaim(gfp_t gfp_mask)
> >  		return false;
> >  
> >  	/* this guy won't enter reclaim */
> > -	if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
> > +	if (current->flags & PF_MEMALLOC)
> >  		return false;
> 
> I'm _really_ uncomfortable doing that. Esp. without a solid explanation
> of how this raelly can't possibly lead to trouble. Which the above semi
> incoherent rambling is not.
> 
> Your backtrace shows the btrfs shrinker doing an allocation, that's the
> exact kind of thing we need to be extremely careful with.
> 

If btrfs is already holding some lock (and thus __GFP_FS is not safe),
that lock must be printed at

  2 locks held by sshd/24800:
   #0:  (sk_lock-AF_INET6){+.+.}, at: [<000000001a069652>] tcp_sendmsg+0x19/0x40
   #1:  (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30

doesn't it? But sk_lock-AF_INET6 is not a FS lock, and fs_reclaim does not
actually lock something.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [4.15-rc9] fs_reclaim lockdep trace
@ 2018-01-29 11:47               ` Tetsuo Handa
  0 siblings, 0 replies; 35+ messages in thread
From: Tetsuo Handa @ 2018-01-29 11:47 UTC (permalink / raw)
  To: peterz; +Cc: torvalds, davej, npiggin, linux-kernel, linux-mm, netdev, mhocko

Peter Zijlstra wrote:
> On Sun, Jan 28, 2018 at 02:55:28PM +0900, Tetsuo Handa wrote:
> > This warning seems to be caused by commit d92a8cfcb37ecd13
> > ("locking/lockdep: Rework FS_RECLAIM annotation") which moved the
> > location of
> > 
> >   /* this guy won't enter reclaim */
> >   if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
> >           return false;
> > 
> > check added by commit cf40bd16fdad42c0 ("lockdep: annotate reclaim context
> > (__GFP_NOFS)").
> 
> I'm not entirly sure I get what you mean here. How did I move it? It was
> part of lockdep_trace_alloc(), if __GFP_NOMEMALLOC was set, it would not
> mark the lock as held.

d92a8cfcb37ecd13 replaced lockdep_set_current_reclaim_state() with
fs_reclaim_acquire(), and removed current->lockdep_recursion handling.

----------
# git show d92a8cfcb37ecd13 | grep recursion
-# define INIT_LOCKDEP                          .lockdep_recursion = 0, .lockdep_reclaim_gfp = 0,
+# define INIT_LOCKDEP                          .lockdep_recursion = 0,
        unsigned int                    lockdep_recursion;
-       if (unlikely(current->lockdep_recursion))
-       current->lockdep_recursion = 1;
-       current->lockdep_recursion = 0;
-        * context checking code. This tests GFP_FS recursion (a lock taken
----------

> 
> The new code has it in fs_reclaim_acquire/release to the same effect, if
> __GFP_NOMEMALLOC, we'll not acquire/release the lock.

Excuse me, but I can't catch.
We currently acquire/release __fs_reclaim_map if __GFP_NOMEMALLOC.

----------
+static bool __need_fs_reclaim(gfp_t gfp_mask)
+{
(...snipped...)
+       /* this guy won't enter reclaim */
+       if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
+               return false;
(...snipped...)
+}
----------

> 
> 
> > Since __kmalloc_reserve() from __alloc_skb() adds
> > __GFP_NOMEMALLOC | __GFP_NOWARN to gfp_mask, __need_fs_reclaim() is
> > failing to return false despite PF_MEMALLOC context (and resulted in
> > lockdep warning).
> 
> But that's correct right, __GFP_NOMEMALLOC should negate PF_MEMALLOC.
> That's what the name says.

__GFP_NOMEMALLOC negates PF_MEMALLOC regarding what watermark that allocation
request should use.

----------
static inline int __gfp_pfmemalloc_flags(gfp_t gfp_mask)
{
        if (unlikely(gfp_mask & __GFP_NOMEMALLOC))
                return 0;
        if (gfp_mask & __GFP_MEMALLOC)
                return ALLOC_NO_WATERMARKS;
        if (in_serving_softirq() && (current->flags & PF_MEMALLOC))
                return ALLOC_NO_WATERMARKS;
        if (!in_interrupt()) {
                if (current->flags & PF_MEMALLOC)
                        return ALLOC_NO_WATERMARKS;
                else if (oom_reserves_allowed(current))
                        return ALLOC_OOM;
        }

        return 0;
}
----------

But at the same time, PF_MEMALLOC negates __GFP_DIRECT_RECLAIM.

----------
        /* Attempt with potentially adjusted zonelist and alloc_flags */
        page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
        if (page)
                goto got_pg;

        /* Caller is not willing to reclaim, we can't balance anything */
        if (!can_direct_reclaim)
                goto nopage;

        /* Avoid recursion of direct reclaim */
        if (current->flags & PF_MEMALLOC)
                goto nopage;

        /* Try direct reclaim and then allocating */
        page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac,
                                                        &did_some_progress);
        if (page)
                goto got_pg;

        /* Try direct compaction and then allocating */
        page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,
                                        compact_priority, &compact_result);
        if (page)
                goto got_pg;

        /* Do not loop if specifically requested */
        if (gfp_mask & __GFP_NORETRY)
                goto nopage;
----------

Then, how can fs_reclaim contribute to deadlock?

> 
> > Since there was no PF_MEMALLOC safeguard as of cf40bd16fdad42c0, checking
> > __GFP_NOMEMALLOC might make sense. But since this safeguard was added by
> > commit 341ce06f69abfafa ("page allocator: calculate the alloc_flags for
> > allocation only once"), checking __GFP_NOMEMALLOC no longer makes sense.
> > Thus, let's remove __GFP_NOMEMALLOC check and allow __need_fs_reclaim() to
> > return false.
> 
> This does not in fact explain what's going on, it just points to
> 'random' patches.
> 
> Are you talking about this:
> 
> +       /* Avoid recursion of direct reclaim */
> +       if (p->flags & PF_MEMALLOC)
> +               goto nopage;
> 
> bit?

Yes.

> 
> > Reported-by: Dave Jones <davej@codemonkey.org.uk>
> > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Nick Piggin <npiggin@gmail.com>
> > ---
> >  mm/page_alloc.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 76c9688..7804b0e 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -3583,7 +3583,7 @@ static bool __need_fs_reclaim(gfp_t gfp_mask)
> >  		return false;
> >  
> >  	/* this guy won't enter reclaim */
> > -	if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
> > +	if (current->flags & PF_MEMALLOC)
> >  		return false;
> 
> I'm _really_ uncomfortable doing that. Esp. without a solid explanation
> of how this raelly can't possibly lead to trouble. Which the above semi
> incoherent rambling is not.
> 
> Your backtrace shows the btrfs shrinker doing an allocation, that's the
> exact kind of thing we need to be extremely careful with.
> 

If btrfs is already holding some lock (and thus __GFP_FS is not safe),
that lock must be printed at

  2 locks held by sshd/24800:
   #0:  (sk_lock-AF_INET6){+.+.}, at: [<000000001a069652>] tcp_sendmsg+0x19/0x40
   #1:  (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30

doesn't it? But sk_lock-AF_INET6 is not a FS lock, and fs_reclaim does not
actually lock something.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [4.15-rc9] fs_reclaim lockdep trace
  2018-01-29 11:47               ` Tetsuo Handa
@ 2018-01-29 13:55                 ` Peter Zijlstra
  -1 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2018-01-29 13:55 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: torvalds, davej, npiggin, linux-kernel, linux-mm, netdev, mhocko

On Mon, Jan 29, 2018 at 08:47:20PM +0900, Tetsuo Handa wrote:
> Peter Zijlstra wrote:
> > On Sun, Jan 28, 2018 at 02:55:28PM +0900, Tetsuo Handa wrote:
> > > This warning seems to be caused by commit d92a8cfcb37ecd13
> > > ("locking/lockdep: Rework FS_RECLAIM annotation") which moved the
> > > location of
> > > 
> > >   /* this guy won't enter reclaim */
> > >   if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
> > >           return false;
> > > 
> > > check added by commit cf40bd16fdad42c0 ("lockdep: annotate reclaim context
> > > (__GFP_NOFS)").
> > 
> > I'm not entirly sure I get what you mean here. How did I move it? It was
> > part of lockdep_trace_alloc(), if __GFP_NOMEMALLOC was set, it would not
> > mark the lock as held.
> 
> d92a8cfcb37ecd13 replaced lockdep_set_current_reclaim_state() with
> fs_reclaim_acquire(), and removed current->lockdep_recursion handling.
> 
> ----------
> # git show d92a8cfcb37ecd13 | grep recursion
> -# define INIT_LOCKDEP                          .lockdep_recursion = 0, .lockdep_reclaim_gfp = 0,
> +# define INIT_LOCKDEP                          .lockdep_recursion = 0,
>         unsigned int                    lockdep_recursion;
> -       if (unlikely(current->lockdep_recursion))
> -       current->lockdep_recursion = 1;
> -       current->lockdep_recursion = 0;
> -        * context checking code. This tests GFP_FS recursion (a lock taken
> ----------

That should not matter at all. The only case that would matter for is if
lockdep itself would ever call into lockdep again. Not something that
happens here.

> > The new code has it in fs_reclaim_acquire/release to the same effect, if
> > __GFP_NOMEMALLOC, we'll not acquire/release the lock.
> 
> Excuse me, but I can't catch.
> We currently acquire/release __fs_reclaim_map if __GFP_NOMEMALLOC.

Right, got the case inverted, same difference though. Before we'd do
mark_held_lock(), now we do acquire/release under the same conditions.

> > > Since __kmalloc_reserve() from __alloc_skb() adds
> > > __GFP_NOMEMALLOC | __GFP_NOWARN to gfp_mask, __need_fs_reclaim() is
> > > failing to return false despite PF_MEMALLOC context (and resulted in
> > > lockdep warning).
> > 
> > But that's correct right, __GFP_NOMEMALLOC should negate PF_MEMALLOC.
> > That's what the name says.
> 
> __GFP_NOMEMALLOC negates PF_MEMALLOC regarding what watermark that allocation
> request should use.

Right.

> But at the same time, PF_MEMALLOC negates __GFP_DIRECT_RECLAIM.

Ah indeed.

> Then, how can fs_reclaim contribute to deadlock?

Not sure it can. But if we're going to allow this, it needs to come with
a clear description on why. Not a few clues to a puzzle.

Now, even if its not strictly a deadlock, there is something to be said
for flagging GFP_FS allocs that lead to nested GFP_FS allocs, do we ever
want to allow that?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [4.15-rc9] fs_reclaim lockdep trace
@ 2018-01-29 13:55                 ` Peter Zijlstra
  0 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2018-01-29 13:55 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: torvalds, davej, npiggin, linux-kernel, linux-mm, netdev, mhocko

On Mon, Jan 29, 2018 at 08:47:20PM +0900, Tetsuo Handa wrote:
> Peter Zijlstra wrote:
> > On Sun, Jan 28, 2018 at 02:55:28PM +0900, Tetsuo Handa wrote:
> > > This warning seems to be caused by commit d92a8cfcb37ecd13
> > > ("locking/lockdep: Rework FS_RECLAIM annotation") which moved the
> > > location of
> > > 
> > >   /* this guy won't enter reclaim */
> > >   if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
> > >           return false;
> > > 
> > > check added by commit cf40bd16fdad42c0 ("lockdep: annotate reclaim context
> > > (__GFP_NOFS)").
> > 
> > I'm not entirly sure I get what you mean here. How did I move it? It was
> > part of lockdep_trace_alloc(), if __GFP_NOMEMALLOC was set, it would not
> > mark the lock as held.
> 
> d92a8cfcb37ecd13 replaced lockdep_set_current_reclaim_state() with
> fs_reclaim_acquire(), and removed current->lockdep_recursion handling.
> 
> ----------
> # git show d92a8cfcb37ecd13 | grep recursion
> -# define INIT_LOCKDEP                          .lockdep_recursion = 0, .lockdep_reclaim_gfp = 0,
> +# define INIT_LOCKDEP                          .lockdep_recursion = 0,
>         unsigned int                    lockdep_recursion;
> -       if (unlikely(current->lockdep_recursion))
> -       current->lockdep_recursion = 1;
> -       current->lockdep_recursion = 0;
> -        * context checking code. This tests GFP_FS recursion (a lock taken
> ----------

That should not matter at all. The only case that would matter for is if
lockdep itself would ever call into lockdep again. Not something that
happens here.

> > The new code has it in fs_reclaim_acquire/release to the same effect, if
> > __GFP_NOMEMALLOC, we'll not acquire/release the lock.
> 
> Excuse me, but I can't catch.
> We currently acquire/release __fs_reclaim_map if __GFP_NOMEMALLOC.

Right, got the case inverted, same difference though. Before we'd do
mark_held_lock(), now we do acquire/release under the same conditions.

> > > Since __kmalloc_reserve() from __alloc_skb() adds
> > > __GFP_NOMEMALLOC | __GFP_NOWARN to gfp_mask, __need_fs_reclaim() is
> > > failing to return false despite PF_MEMALLOC context (and resulted in
> > > lockdep warning).
> > 
> > But that's correct right, __GFP_NOMEMALLOC should negate PF_MEMALLOC.
> > That's what the name says.
> 
> __GFP_NOMEMALLOC negates PF_MEMALLOC regarding what watermark that allocation
> request should use.

Right.

> But at the same time, PF_MEMALLOC negates __GFP_DIRECT_RECLAIM.

Ah indeed.

> Then, how can fs_reclaim contribute to deadlock?

Not sure it can. But if we're going to allow this, it needs to come with
a clear description on why. Not a few clues to a puzzle.

Now, even if its not strictly a deadlock, there is something to be said
for flagging GFP_FS allocs that lead to nested GFP_FS allocs, do we ever
want to allow that?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [4.15-rc9] fs_reclaim lockdep trace
  2018-01-29 13:55                 ` Peter Zijlstra
@ 2018-02-01 11:36                   ` Tetsuo Handa
  -1 siblings, 0 replies; 35+ messages in thread
From: Tetsuo Handa @ 2018-02-01 11:36 UTC (permalink / raw)
  To: peterz
  Cc: torvalds, davej, npiggin, linux-kernel, linux-mm, netdev, mhocko,
	linux-btrfs

Peter Zijlstra wrote:
> On Mon, Jan 29, 2018 at 08:47:20PM +0900, Tetsuo Handa wrote:
> > Peter Zijlstra wrote:
> > > On Sun, Jan 28, 2018 at 02:55:28PM +0900, Tetsuo Handa wrote:
> > > > This warning seems to be caused by commit d92a8cfcb37ecd13
> > > > ("locking/lockdep: Rework FS_RECLAIM annotation") which moved the
> > > > location of
> > > > 
> > > >   /* this guy won't enter reclaim */
> > > >   if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
> > > >           return false;
> > > > 
> > > > check added by commit cf40bd16fdad42c0 ("lockdep: annotate reclaim context
> > > > (__GFP_NOFS)").
> > > 
> > > I'm not entirly sure I get what you mean here. How did I move it? It was
> > > part of lockdep_trace_alloc(), if __GFP_NOMEMALLOC was set, it would not
> > > mark the lock as held.
> > 
> > d92a8cfcb37ecd13 replaced lockdep_set_current_reclaim_state() with
> > fs_reclaim_acquire(), and removed current->lockdep_recursion handling.
> > 
> > ----------
> > # git show d92a8cfcb37ecd13 | grep recursion
> > -# define INIT_LOCKDEP                          .lockdep_recursion = 0, .lockdep_reclaim_gfp = 0,
> > +# define INIT_LOCKDEP                          .lockdep_recursion = 0,
> >         unsigned int                    lockdep_recursion;
> > -       if (unlikely(current->lockdep_recursion))
> > -       current->lockdep_recursion = 1;
> > -       current->lockdep_recursion = 0;
> > -        * context checking code. This tests GFP_FS recursion (a lock taken
> > ----------
> 
> That should not matter at all. The only case that would matter for is if
> lockdep itself would ever call into lockdep again. Not something that
> happens here.
> 
> > > The new code has it in fs_reclaim_acquire/release to the same effect, if
> > > __GFP_NOMEMALLOC, we'll not acquire/release the lock.
> > 
> > Excuse me, but I can't catch.
> > We currently acquire/release __fs_reclaim_map if __GFP_NOMEMALLOC.
> 
> Right, got the case inverted, same difference though. Before we'd do
> mark_held_lock(), now we do acquire/release under the same conditions.
> 
> > > > Since __kmalloc_reserve() from __alloc_skb() adds
> > > > __GFP_NOMEMALLOC | __GFP_NOWARN to gfp_mask, __need_fs_reclaim() is
> > > > failing to return false despite PF_MEMALLOC context (and resulted in
> > > > lockdep warning).
> > > 
> > > But that's correct right, __GFP_NOMEMALLOC should negate PF_MEMALLOC.
> > > That's what the name says.
> > 
> > __GFP_NOMEMALLOC negates PF_MEMALLOC regarding what watermark that allocation
> > request should use.
> 
> Right.
> 
> > But at the same time, PF_MEMALLOC negates __GFP_DIRECT_RECLAIM.
> 
> Ah indeed.
> 
> > Then, how can fs_reclaim contribute to deadlock?
> 
> Not sure it can. But if we're going to allow this, it needs to come with
> a clear description on why. Not a few clues to a puzzle.
> 

Let's decode Dave's report.

----------
stack backtrace:
CPU: 3 PID: 24800 Comm: sshd Not tainted 4.15.0-rc9-backup-debug+ #1
Call Trace:
 dump_stack+0xbc/0x13f
 __lock_acquire+0xa09/0x2040
 lock_acquire+0x12e/0x350
 fs_reclaim_acquire.part.102+0x29/0x30
 kmem_cache_alloc+0x3d/0x2c0
 alloc_extent_state+0xa7/0x410
 __clear_extent_bit+0x3ea/0x570
 try_release_extent_mapping+0x21a/0x260
 __btrfs_releasepage+0xb0/0x1c0
 btrfs_releasepage+0x161/0x170
 try_to_release_page+0x162/0x1c0
 shrink_page_list+0x1d5a/0x2fb0
 shrink_inactive_list+0x451/0x940
 shrink_node_memcg.constprop.88+0x4c9/0x5e0
 shrink_node+0x12d/0x260
 try_to_free_pages+0x418/0xaf0
 __alloc_pages_slowpath+0x976/0x1790
 __alloc_pages_nodemask+0x52c/0x5c0
 new_slab+0x374/0x3f0
 ___slab_alloc.constprop.81+0x47e/0x5a0
 __slab_alloc.constprop.80+0x32/0x60
 __kmalloc_track_caller+0x267/0x310
 __kmalloc_reserve.isra.40+0x29/0x80
 __alloc_skb+0xee/0x390
 sk_stream_alloc_skb+0xb8/0x340
----------

struct sk_buff *sk_stream_alloc_skb(struct sock *sk, int size, gfp_t gfp, bool force_schedule) {
  skb = alloc_skb_fclone(size + sk->sk_prot->max_header, gfp) = { // gfp == GFP_KERNEL
    static inline struct sk_buff *alloc_skb_fclone(unsigned int size, gfp_t priority) { // priority == GFP_KERNEL
      return __alloc_skb(size, priority, SKB_ALLOC_FCLONE, NUMA_NO_NODE) = {
        data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc) = { // gfp_mask == GFP_KERNEL
          obj = kmalloc_node_track_caller(size, flags | __GFP_NOMEMALLOC | __GFP_NOWARN, node) = { // flags == GFP_KERNEL
            __kmalloc_node_track_caller(size, GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN, node) = {
              void *__kmalloc_node_track_caller(size_t size, gfp_t gfpflags, int node, unsigned long caller) { // gfpflags == GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN
                ret = slab_alloc_node(s, gfpflags, node, caller) = { // gfpflags == GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN
                  static __always_inline void *slab_alloc_node(struct kmem_cache *s, gfp_t gfpflags, int node, unsigned long addr) { // gfpflags == GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN
                    s = slab_pre_alloc_hook(s, gfpflags) = { // gfpflags == GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN
                      static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s, gfp_t flags) { // gfpflags == GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN
                        fs_reclaim_acquire(flags) = { // flags == GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN
                          void fs_reclaim_acquire(gfp_t gfp_mask) { // gfp_mask == GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN
                            if (__need_fs_reclaim(gfp_mask)) // true due to gfp_mask == GFP_KERNEL | __GFP_NOMEMALLOC
                              lock_map_acquire(&__fs_reclaim_map); // acquires __fs_reclaim_map
                          }
                        }
                      }
                      fs_reclaim_release(flags); // releases __fs_reclaim_map
                    }
                    object = __slab_alloc(s, gfpflags, node, addr, c) = { // gfpflags == GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN
                      p = ___slab_alloc(s, gfpflags, node, addr, c) = { // gfpflags == GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN
                        freelist = new_slab_objects(s, gfpflags, node, &c) = {
                          page = new_slab(s, flags, node) = { // flags == GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN
                            return allocate_slab(s, flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node) = {
                              page = alloc_slab_page(s, alloc_gfp, node, oo) = { // alloc_gfp == GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN
                                page = alloc_pages(flags, order) { // flags == GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN
                                  return alloc_pages_current(gfp_mask, order) = { //gfp_mask == GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN
                                    page = __alloc_pages_nodemask(gfp, order, policy_node(gfp, pol, numa_node_id()), policy_nodemask(gfp, pol)) = { // gfp == GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN
                                      page = __alloc_pages_slowpath(alloc_mask, order, &ac) = { // alloc_mask == GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN
                                        page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac, &did_some_progress) = { // gfp_mask == GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN
                                          *did_some_progress = __perform_reclaim(gfp_mask, order, ac) = { // gfp_mask == GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN
                                            noreclaim_flag = memalloc_noreclaim_save(); // Sets PF_MEMALLOC
                                            fs_reclaim_acquire(flags) = { // flags == GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN
                                              void fs_reclaim_acquire(gfp_t gfp_mask) { // gfp_mask == GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN
                                                if (__need_fs_reclaim(gfp_mask)) // true due to gfp_mask == GFP_KERNEL | __GFP_NOMEMALLOC
                                                  lock_map_acquire(&__fs_reclaim_map); // acquires __fs_reclaim_map
                                              }
                                            }
                                            progress = try_to_free_pages(ac->zonelist, order, gfp_mask, ac->nodemask) = {
                                              nr_reclaimed = do_try_to_free_pages(zonelist, &sc) = {
                                                shrink_zones(zonelist, sc) = {
                                                  shrink_node(zone->zone_pgdat, sc) = {
                                                    shrink_node_memcg(pgdat, memcg, sc, &lru_pages) = {
                                                      nr_reclaimed += shrink_list(lru, nr_to_scan, lruvec, memcg, sc) = {
                                                        return shrink_inactive_list(nr_to_scan, lruvec, sc, lru) = {
                                                          nr_reclaimed = shrink_page_list(&page_list, pgdat, sc, 0, &stat, false) = {
                                                            if (!try_to_release_page(page, sc->gfp_mask))
                                                              goto activate_locked = {
                                                                return mapping->a_ops->releasepage(page, gfp_mask) = {
                                                                  static int btrfs_releasepage(struct page *page, gfp_t gfp_flags) { // gfp_flags == GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN
                                                                    return __btrfs_releasepage(page, gfp_flags) = {
                                                                      ret = try_release_extent_mapping(map, tree, page, gfp_flags) = {
                                                                        return try_release_extent_state(map, tree, page, mask) = { // mask == GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN
                                                                          ret = clear_extent_bit(tree, start, end, ~(EXTENT_LOCKED | EXTENT_NODATASUM), 0, 0, NULL, mask) = {
                                                                            return __clear_extent_bit(tree, start, end, bits, wake, delete, cached, mask, NULL) = {
                                                                              prealloc = alloc_extent_state(mask) = {
                                                                                state = kmem_cache_alloc(extent_state_cache, mask) = {
                                                                                  void *ret = slab_alloc(s, gfpflags, _RET_IP_) = { // gfpflags == GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN
                                                                                    return slab_alloc_node(s, gfpflags, NUMA_NO_NODE, addr) = {
                                                                                      s = slab_pre_alloc_hook(s, gfpflags) = {
                                                                                        static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s, gfp_t flags) { // gfpflags == GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN
                                                                                          fs_reclaim_acquire(flags) = { // flags == GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN
                                                                                            void fs_reclaim_acquire(gfp_t gfp_mask) { // gfp_mask == GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN
                                                                                              if (__need_fs_reclaim(gfp_mask)) // true due to gfp_mask == GFP_KERNEL | __GFP_NOMEMALLOC despite PF_MEMALLOC
                                                                                                lock_map_acquire(&__fs_reclaim_map); // acquires __fs_reclaim_map nestedly and lockdep complains
                                                                                            }
                                                                                          }
                                                                                        }
                                                                                        fs_reclaim_release(flags); // releases __fs_reclaim_map
                                                                                      }
                                                                                    }
                                                                                  }
                                                                                }
                                                                              }
                                                                            }
                                                                          }
                                                                        }
                                                                      }
                                                                    }
                                                                  }
                                                                }
                                                              }
                                                          }
                                                        }
                                                      }
                                                    }
                                                  }
                                                }
                                              }
                                            }
                                          }
                                        }
                                      }
                                    }
                                  }
                                }
                              }
                            }
                          }
                        }
                      }
                     }
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

That is, all reclaim code is simply propagating __GFP_NOMEMALLOC added by kmalloc_reserve(), and
despite memory allocation from try_to_free_pages() path won't do direct reclaim due to PF_MEMALLOC,
fs_reclaim_acquire() from slab_pre_alloc_hook() from try_to_free_pages() path is failing to find that
this allocation will not do direct reclaim due to PF_MEMALLOC (due to

	/* this guy won't enter reclaim */
	if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
		return false;

check in __need_fs_reclaim()).

After all, nested GFP_FS allocations cannot occur (whatever GFP flags are passed)
because such allocation will not do direct reclaim due to PF_MEMALLOC.

> Now, even if its not strictly a deadlock, there is something to be said
> for flagging GFP_FS allocs that lead to nested GFP_FS allocs, do we ever
> want to allow that?

Since PF_MEMALLOC negates __GFP_DIRECT_RECLAIM, propagating unmodified GFP flags
(like above) is safe as long as dependency is within current thread.

So, how to fix this?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [4.15-rc9] fs_reclaim lockdep trace
@ 2018-02-01 11:36                   ` Tetsuo Handa
  0 siblings, 0 replies; 35+ messages in thread
From: Tetsuo Handa @ 2018-02-01 11:36 UTC (permalink / raw)
  To: peterz
  Cc: torvalds, davej, npiggin, linux-kernel, linux-mm, netdev, mhocko,
	linux-btrfs

Peter Zijlstra wrote:
> On Mon, Jan 29, 2018 at 08:47:20PM +0900, Tetsuo Handa wrote:
> > Peter Zijlstra wrote:
> > > On Sun, Jan 28, 2018 at 02:55:28PM +0900, Tetsuo Handa wrote:
> > > > This warning seems to be caused by commit d92a8cfcb37ecd13
> > > > ("locking/lockdep: Rework FS_RECLAIM annotation") which moved the
> > > > location of
> > > > 
> > > >   /* this guy won't enter reclaim */
> > > >   if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
> > > >           return false;
> > > > 
> > > > check added by commit cf40bd16fdad42c0 ("lockdep: annotate reclaim context
> > > > (__GFP_NOFS)").
> > > 
> > > I'm not entirly sure I get what you mean here. How did I move it? It was
> > > part of lockdep_trace_alloc(), if __GFP_NOMEMALLOC was set, it would not
> > > mark the lock as held.
> > 
> > d92a8cfcb37ecd13 replaced lockdep_set_current_reclaim_state() with
> > fs_reclaim_acquire(), and removed current->lockdep_recursion handling.
> > 
> > ----------
> > # git show d92a8cfcb37ecd13 | grep recursion
> > -# define INIT_LOCKDEP                          .lockdep_recursion = 0, .lockdep_reclaim_gfp = 0,
> > +# define INIT_LOCKDEP                          .lockdep_recursion = 0,
> >         unsigned int                    lockdep_recursion;
> > -       if (unlikely(current->lockdep_recursion))
> > -       current->lockdep_recursion = 1;
> > -       current->lockdep_recursion = 0;
> > -        * context checking code. This tests GFP_FS recursion (a lock taken
> > ----------
> 
> That should not matter at all. The only case that would matter for is if
> lockdep itself would ever call into lockdep again. Not something that
> happens here.
> 
> > > The new code has it in fs_reclaim_acquire/release to the same effect, if
> > > __GFP_NOMEMALLOC, we'll not acquire/release the lock.
> > 
> > Excuse me, but I can't catch.
> > We currently acquire/release __fs_reclaim_map if __GFP_NOMEMALLOC.
> 
> Right, got the case inverted, same difference though. Before we'd do
> mark_held_lock(), now we do acquire/release under the same conditions.
> 
> > > > Since __kmalloc_reserve() from __alloc_skb() adds
> > > > __GFP_NOMEMALLOC | __GFP_NOWARN to gfp_mask, __need_fs_reclaim() is
> > > > failing to return false despite PF_MEMALLOC context (and resulted in
> > > > lockdep warning).
> > > 
> > > But that's correct right, __GFP_NOMEMALLOC should negate PF_MEMALLOC.
> > > That's what the name says.
> > 
> > __GFP_NOMEMALLOC negates PF_MEMALLOC regarding what watermark that allocation
> > request should use.
> 
> Right.
> 
> > But at the same time, PF_MEMALLOC negates __GFP_DIRECT_RECLAIM.
> 
> Ah indeed.
> 
> > Then, how can fs_reclaim contribute to deadlock?
> 
> Not sure it can. But if we're going to allow this, it needs to come with
> a clear description on why. Not a few clues to a puzzle.
> 

Let's decode Dave's report.

----------
stack backtrace:
CPU: 3 PID: 24800 Comm: sshd Not tainted 4.15.0-rc9-backup-debug+ #1
Call Trace:
 dump_stack+0xbc/0x13f
 __lock_acquire+0xa09/0x2040
 lock_acquire+0x12e/0x350
 fs_reclaim_acquire.part.102+0x29/0x30
 kmem_cache_alloc+0x3d/0x2c0
 alloc_extent_state+0xa7/0x410
 __clear_extent_bit+0x3ea/0x570
 try_release_extent_mapping+0x21a/0x260
 __btrfs_releasepage+0xb0/0x1c0
 btrfs_releasepage+0x161/0x170
 try_to_release_page+0x162/0x1c0
 shrink_page_list+0x1d5a/0x2fb0
 shrink_inactive_list+0x451/0x940
 shrink_node_memcg.constprop.88+0x4c9/0x5e0
 shrink_node+0x12d/0x260
 try_to_free_pages+0x418/0xaf0
 __alloc_pages_slowpath+0x976/0x1790
 __alloc_pages_nodemask+0x52c/0x5c0
 new_slab+0x374/0x3f0
 ___slab_alloc.constprop.81+0x47e/0x5a0
 __slab_alloc.constprop.80+0x32/0x60
 __kmalloc_track_caller+0x267/0x310
 __kmalloc_reserve.isra.40+0x29/0x80
 __alloc_skb+0xee/0x390
 sk_stream_alloc_skb+0xb8/0x340
----------

struct sk_buff *sk_stream_alloc_skb(struct sock *sk, int size, gfp_t gfp, bool force_schedule) {
  skb = alloc_skb_fclone(size + sk->sk_prot->max_header, gfp) = { // gfp == GFP_KERNEL
    static inline struct sk_buff *alloc_skb_fclone(unsigned int size, gfp_t priority) { // priority == GFP_KERNEL
      return __alloc_skb(size, priority, SKB_ALLOC_FCLONE, NUMA_NO_NODE) = {
        data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc) = { // gfp_mask == GFP_KERNEL
          obj = kmalloc_node_track_caller(size, flags | __GFP_NOMEMALLOC | __GFP_NOWARN, node) = { // flags == GFP_KERNEL
            __kmalloc_node_track_caller(size, GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN, node) = {
              void *__kmalloc_node_track_caller(size_t size, gfp_t gfpflags, int node, unsigned long caller) { // gfpflags == GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN
                ret = slab_alloc_node(s, gfpflags, node, caller) = { // gfpflags == GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN
                  static __always_inline void *slab_alloc_node(struct kmem_cache *s, gfp_t gfpflags, int node, unsigned long addr) { // gfpflags == GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN
                    s = slab_pre_alloc_hook(s, gfpflags) = { // gfpflags == GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN
                      static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s, gfp_t flags) { // gfpflags == GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN
                        fs_reclaim_acquire(flags) = { // flags == GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN
                          void fs_reclaim_acquire(gfp_t gfp_mask) { // gfp_mask == GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN
                            if (__need_fs_reclaim(gfp_mask)) // true due to gfp_mask == GFP_KERNEL | __GFP_NOMEMALLOC
                              lock_map_acquire(&__fs_reclaim_map); // acquires __fs_reclaim_map
                          }
                        }
                      }
                      fs_reclaim_release(flags); // releases __fs_reclaim_map
                    }
                    object = __slab_alloc(s, gfpflags, node, addr, c) = { // gfpflags == GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN
                      p = ___slab_alloc(s, gfpflags, node, addr, c) = { // gfpflags == GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN
                        freelist = new_slab_objects(s, gfpflags, node, &c) = {
                          page = new_slab(s, flags, node) = { // flags == GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN
                            return allocate_slab(s, flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node) = {
                              page = alloc_slab_page(s, alloc_gfp, node, oo) = { // alloc_gfp == GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN
                                page = alloc_pages(flags, order) { // flags == GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN
                                  return alloc_pages_current(gfp_mask, order) = { //gfp_mask == GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN
                                    page = __alloc_pages_nodemask(gfp, order, policy_node(gfp, pol, numa_node_id()), policy_nodemask(gfp, pol)) = { // gfp == GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN
                                      page = __alloc_pages_slowpath(alloc_mask, order, &ac) = { // alloc_mask == GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN
                                        page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac, &did_some_progress) = { // gfp_mask == GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN
                                          *did_some_progress = __perform_reclaim(gfp_mask, order, ac) = { // gfp_mask == GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN
                                            noreclaim_flag = memalloc_noreclaim_save(); // Sets PF_MEMALLOC
                                            fs_reclaim_acquire(flags) = { // flags == GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN
                                              void fs_reclaim_acquire(gfp_t gfp_mask) { // gfp_mask == GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN
                                                if (__need_fs_reclaim(gfp_mask)) // true due to gfp_mask == GFP_KERNEL | __GFP_NOMEMALLOC
                                                  lock_map_acquire(&__fs_reclaim_map); // acquires __fs_reclaim_map
                                              }
                                            }
                                            progress = try_to_free_pages(ac->zonelist, order, gfp_mask, ac->nodemask) = {
                                              nr_reclaimed = do_try_to_free_pages(zonelist, &sc) = {
                                                shrink_zones(zonelist, sc) = {
                                                  shrink_node(zone->zone_pgdat, sc) = {
                                                    shrink_node_memcg(pgdat, memcg, sc, &lru_pages) = {
                                                      nr_reclaimed += shrink_list(lru, nr_to_scan, lruvec, memcg, sc) = {
                                                        return shrink_inactive_list(nr_to_scan, lruvec, sc, lru) = {
                                                          nr_reclaimed = shrink_page_list(&page_list, pgdat, sc, 0, &stat, false) = {
                                                            if (!try_to_release_page(page, sc->gfp_mask))
                                                              goto activate_locked = {
                                                                return mapping->a_ops->releasepage(page, gfp_mask) = {
                                                                  static int btrfs_releasepage(struct page *page, gfp_t gfp_flags) { // gfp_flags == GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN
                                                                    return __btrfs_releasepage(page, gfp_flags) = {
                                                                      ret = try_release_extent_mapping(map, tree, page, gfp_flags) = {
                                                                        return try_release_extent_state(map, tree, page, mask) = { // mask == GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN
                                                                          ret = clear_extent_bit(tree, start, end, ~(EXTENT_LOCKED | EXTENT_NODATASUM), 0, 0, NULL, mask) = {
                                                                            return __clear_extent_bit(tree, start, end, bits, wake, delete, cached, mask, NULL) = {
                                                                              prealloc = alloc_extent_state(mask) = {
                                                                                state = kmem_cache_alloc(extent_state_cache, mask) = {
                                                                                  void *ret = slab_alloc(s, gfpflags, _RET_IP_) = { // gfpflags == GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN
                                                                                    return slab_alloc_node(s, gfpflags, NUMA_NO_NODE, addr) = {
                                                                                      s = slab_pre_alloc_hook(s, gfpflags) = {
                                                                                        static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s, gfp_t flags) { // gfpflags == GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN
                                                                                          fs_reclaim_acquire(flags) = { // flags == GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN
                                                                                            void fs_reclaim_acquire(gfp_t gfp_mask) { // gfp_mask == GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN
                                                                                              if (__need_fs_reclaim(gfp_mask)) // true due to gfp_mask == GFP_KERNEL | __GFP_NOMEMALLOC despite PF_MEMALLOC
                                                                                                lock_map_acquire(&__fs_reclaim_map); // acquires __fs_reclaim_map nestedly and lockdep complains
                                                                                            }
                                                                                          }
                                                                                        }
                                                                                        fs_reclaim_release(flags); // releases __fs_reclaim_map
                                                                                      }
                                                                                    }
                                                                                  }
                                                                                }
                                                                              }
                                                                            }
                                                                          }
                                                                        }
                                                                      }
                                                                    }
                                                                  }
                                                                }
                                                              }
                                                          }
                                                        }
                                                      }
                                                    }
                                                  }
                                                }
                                              }
                                            }
                                          }
                                        }
                                      }
                                    }
                                  }
                                }
                              }
                            }
                          }
                        }
                      }
                     }
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

That is, all reclaim code is simply propagating __GFP_NOMEMALLOC added by kmalloc_reserve(), and
despite memory allocation from try_to_free_pages() path won't do direct reclaim due to PF_MEMALLOC,
fs_reclaim_acquire() from slab_pre_alloc_hook() from try_to_free_pages() path is failing to find that
this allocation will not do direct reclaim due to PF_MEMALLOC (due to

	/* this guy won't enter reclaim */
	if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
		return false;

check in __need_fs_reclaim()).

After all, nested GFP_FS allocations cannot occur (whatever GFP flags are passed)
because such allocation will not do direct reclaim due to PF_MEMALLOC.

> Now, even if its not strictly a deadlock, there is something to be said
> for flagging GFP_FS allocs that lead to nested GFP_FS allocs, do we ever
> want to allow that?

Since PF_MEMALLOC negates __GFP_DIRECT_RECLAIM, propagating unmodified GFP flags
(like above) is safe as long as dependency is within current thread.

So, how to fix this?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v2] lockdep: Fix fs_reclaim warning.
  2018-02-01 11:36                   ` Tetsuo Handa
@ 2018-02-08 11:43                     ` Tetsuo Handa
  -1 siblings, 0 replies; 35+ messages in thread
From: Tetsuo Handa @ 2018-02-08 11:43 UTC (permalink / raw)
  To: peterz
  Cc: torvalds, davej, npiggin, linux-kernel, linux-mm, netdev, mhocko,
	linux-btrfs

>From 361d37a7d36978020dfb4c11ec1f4800937ccb68 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Thu, 8 Feb 2018 10:35:35 +0900
Subject: [PATCH v2] lockdep: Fix fs_reclaim warning.

Dave Jones reported fs_reclaim lockdep warnings.

  ============================================
  WARNING: possible recursive locking detected
  4.15.0-rc9-backup-debug+ #1 Not tainted
  --------------------------------------------
  sshd/24800 is trying to acquire lock:
   (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30

  but task is already holding lock:
   (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30

  other info that might help us debug this:
   Possible unsafe locking scenario:

         CPU0
         ----
    lock(fs_reclaim);
    lock(fs_reclaim);

   *** DEADLOCK ***

   May be due to missing lock nesting notation

  2 locks held by sshd/24800:
   #0:  (sk_lock-AF_INET6){+.+.}, at: [<000000001a069652>] tcp_sendmsg+0x19/0x40
   #1:  (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30

  stack backtrace:
  CPU: 3 PID: 24800 Comm: sshd Not tainted 4.15.0-rc9-backup-debug+ #1
  Call Trace:
   dump_stack+0xbc/0x13f
   __lock_acquire+0xa09/0x2040
   lock_acquire+0x12e/0x350
   fs_reclaim_acquire.part.102+0x29/0x30
   kmem_cache_alloc+0x3d/0x2c0
   alloc_extent_state+0xa7/0x410
   __clear_extent_bit+0x3ea/0x570
   try_release_extent_mapping+0x21a/0x260
   __btrfs_releasepage+0xb0/0x1c0
   btrfs_releasepage+0x161/0x170
   try_to_release_page+0x162/0x1c0
   shrink_page_list+0x1d5a/0x2fb0
   shrink_inactive_list+0x451/0x940
   shrink_node_memcg.constprop.88+0x4c9/0x5e0
   shrink_node+0x12d/0x260
   try_to_free_pages+0x418/0xaf0
   __alloc_pages_slowpath+0x976/0x1790
   __alloc_pages_nodemask+0x52c/0x5c0
   new_slab+0x374/0x3f0
   ___slab_alloc.constprop.81+0x47e/0x5a0
   __slab_alloc.constprop.80+0x32/0x60
   __kmalloc_track_caller+0x267/0x310
   __kmalloc_reserve.isra.40+0x29/0x80
   __alloc_skb+0xee/0x390
   sk_stream_alloc_skb+0xb8/0x340
   tcp_sendmsg_locked+0x8e6/0x1d30
   tcp_sendmsg+0x27/0x40
   inet_sendmsg+0xd0/0x310
   sock_write_iter+0x17a/0x240
   __vfs_write+0x2ab/0x380
   vfs_write+0xfb/0x260
   SyS_write+0xb6/0x140
   do_syscall_64+0x1e5/0xc05
   entry_SYSCALL64_slow_path+0x25/0x25

This warning is caused by commit d92a8cfcb37ecd13 ("locking/lockdep: Rework
FS_RECLAIM annotation") which replaced lockdep_set_current_reclaim_state()/
lockdep_clear_current_reclaim_state() in __perform_reclaim() and
lockdep_trace_alloc() in slab_pre_alloc_hook() with fs_reclaim_acquire()/
fs_reclaim_release(). Since __kmalloc_reserve() from __alloc_skb() adds
__GFP_NOMEMALLOC | __GFP_NOWARN to gfp_mask, and all reclaim path simply
propagates __GFP_NOMEMALLOC, fs_reclaim_acquire() in slab_pre_alloc_hook()
is trying to grab the 'fake' lock again when __perform_reclaim() already
grabbed the 'fake' lock.

The

  /* this guy won't enter reclaim */
  if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
          return false;

test which causes slab_pre_alloc_hook() to try to grab the 'fake' lock
was added by commit cf40bd16fdad42c0 ("lockdep: annotate reclaim context
(__GFP_NOFS)"). But that test is outdated because PF_MEMALLOC thread won't
enter reclaim regardless of __GFP_NOMEMALLOC after commit 341ce06f69abfafa
("page allocator: calculate the alloc_flags for allocation only once")
added the PF_MEMALLOC safeguard (

  /* Avoid recursion of direct reclaim */
  if (p->flags & PF_MEMALLOC)
          goto nopage;

in __alloc_pages_slowpath()).

Thus, let's fix outdated test by removing __GFP_NOMEMALLOC test and allow
__need_fs_reclaim() to return false.

Reported-and-tested-by: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Nick Piggin <npiggin@gmail.com>
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 81e18ce..19fb76b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3590,7 +3590,7 @@ static bool __need_fs_reclaim(gfp_t gfp_mask)
 		return false;
 
 	/* this guy won't enter reclaim */
-	if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
+	if (current->flags & PF_MEMALLOC)
 		return false;
 
 	/* We're only interested __GFP_FS allocations for now */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v2] lockdep: Fix fs_reclaim warning.
@ 2018-02-08 11:43                     ` Tetsuo Handa
  0 siblings, 0 replies; 35+ messages in thread
From: Tetsuo Handa @ 2018-02-08 11:43 UTC (permalink / raw)
  To: peterz
  Cc: torvalds, davej, npiggin, linux-kernel, linux-mm, netdev, mhocko,
	linux-btrfs

>From 361d37a7d36978020dfb4c11ec1f4800937ccb68 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Thu, 8 Feb 2018 10:35:35 +0900
Subject: [PATCH v2] lockdep: Fix fs_reclaim warning.

Dave Jones reported fs_reclaim lockdep warnings.

  ============================================
  WARNING: possible recursive locking detected
  4.15.0-rc9-backup-debug+ #1 Not tainted
  --------------------------------------------
  sshd/24800 is trying to acquire lock:
   (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30

  but task is already holding lock:
   (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30

  other info that might help us debug this:
   Possible unsafe locking scenario:

         CPU0
         ----
    lock(fs_reclaim);
    lock(fs_reclaim);

   *** DEADLOCK ***

   May be due to missing lock nesting notation

  2 locks held by sshd/24800:
   #0:  (sk_lock-AF_INET6){+.+.}, at: [<000000001a069652>] tcp_sendmsg+0x19/0x40
   #1:  (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30

  stack backtrace:
  CPU: 3 PID: 24800 Comm: sshd Not tainted 4.15.0-rc9-backup-debug+ #1
  Call Trace:
   dump_stack+0xbc/0x13f
   __lock_acquire+0xa09/0x2040
   lock_acquire+0x12e/0x350
   fs_reclaim_acquire.part.102+0x29/0x30
   kmem_cache_alloc+0x3d/0x2c0
   alloc_extent_state+0xa7/0x410
   __clear_extent_bit+0x3ea/0x570
   try_release_extent_mapping+0x21a/0x260
   __btrfs_releasepage+0xb0/0x1c0
   btrfs_releasepage+0x161/0x170
   try_to_release_page+0x162/0x1c0
   shrink_page_list+0x1d5a/0x2fb0
   shrink_inactive_list+0x451/0x940
   shrink_node_memcg.constprop.88+0x4c9/0x5e0
   shrink_node+0x12d/0x260
   try_to_free_pages+0x418/0xaf0
   __alloc_pages_slowpath+0x976/0x1790
   __alloc_pages_nodemask+0x52c/0x5c0
   new_slab+0x374/0x3f0
   ___slab_alloc.constprop.81+0x47e/0x5a0
   __slab_alloc.constprop.80+0x32/0x60
   __kmalloc_track_caller+0x267/0x310
   __kmalloc_reserve.isra.40+0x29/0x80
   __alloc_skb+0xee/0x390
   sk_stream_alloc_skb+0xb8/0x340
   tcp_sendmsg_locked+0x8e6/0x1d30
   tcp_sendmsg+0x27/0x40
   inet_sendmsg+0xd0/0x310
   sock_write_iter+0x17a/0x240
   __vfs_write+0x2ab/0x380
   vfs_write+0xfb/0x260
   SyS_write+0xb6/0x140
   do_syscall_64+0x1e5/0xc05
   entry_SYSCALL64_slow_path+0x25/0x25

This warning is caused by commit d92a8cfcb37ecd13 ("locking/lockdep: Rework
FS_RECLAIM annotation") which replaced lockdep_set_current_reclaim_state()/
lockdep_clear_current_reclaim_state() in __perform_reclaim() and
lockdep_trace_alloc() in slab_pre_alloc_hook() with fs_reclaim_acquire()/
fs_reclaim_release(). Since __kmalloc_reserve() from __alloc_skb() adds
__GFP_NOMEMALLOC | __GFP_NOWARN to gfp_mask, and all reclaim path simply
propagates __GFP_NOMEMALLOC, fs_reclaim_acquire() in slab_pre_alloc_hook()
is trying to grab the 'fake' lock again when __perform_reclaim() already
grabbed the 'fake' lock.

The

  /* this guy won't enter reclaim */
  if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
          return false;

test which causes slab_pre_alloc_hook() to try to grab the 'fake' lock
was added by commit cf40bd16fdad42c0 ("lockdep: annotate reclaim context
(__GFP_NOFS)"). But that test is outdated because PF_MEMALLOC thread won't
enter reclaim regardless of __GFP_NOMEMALLOC after commit 341ce06f69abfafa
("page allocator: calculate the alloc_flags for allocation only once")
added the PF_MEMALLOC safeguard (

  /* Avoid recursion of direct reclaim */
  if (p->flags & PF_MEMALLOC)
          goto nopage;

in __alloc_pages_slowpath()).

Thus, let's fix outdated test by removing __GFP_NOMEMALLOC test and allow
__need_fs_reclaim() to return false.

Reported-and-tested-by: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Nick Piggin <npiggin@gmail.com>
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 81e18ce..19fb76b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3590,7 +3590,7 @@ static bool __need_fs_reclaim(gfp_t gfp_mask)
 		return false;
 
 	/* this guy won't enter reclaim */
-	if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
+	if (current->flags & PF_MEMALLOC)
 		return false;
 
 	/* We're only interested __GFP_FS allocations for now */
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v2] lockdep: Fix fs_reclaim warning.
  2018-02-08 11:43                     ` Tetsuo Handa
@ 2018-02-12 12:08                       ` Nikolay Borisov
  -1 siblings, 0 replies; 35+ messages in thread
From: Nikolay Borisov @ 2018-02-12 12:08 UTC (permalink / raw)
  To: Tetsuo Handa, peterz
  Cc: torvalds, davej, npiggin, linux-kernel, linux-mm, netdev, mhocko,
	linux-btrfs



On  8.02.2018 13:43, Tetsuo Handa wrote:
>>From 361d37a7d36978020dfb4c11ec1f4800937ccb68 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Thu, 8 Feb 2018 10:35:35 +0900
> Subject: [PATCH v2] lockdep: Fix fs_reclaim warning.
> 
> Dave Jones reported fs_reclaim lockdep warnings.
> 
>   ============================================
>   WARNING: possible recursive locking detected
>   4.15.0-rc9-backup-debug+ #1 Not tainted
>   --------------------------------------------
>   sshd/24800 is trying to acquire lock:
>    (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30
> 
>   but task is already holding lock:
>    (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30
> 
>   other info that might help us debug this:
>    Possible unsafe locking scenario:
> 
>          CPU0
>          ----
>     lock(fs_reclaim);
>     lock(fs_reclaim);
> 
>    *** DEADLOCK ***
> 
>    May be due to missing lock nesting notation
> 
>   2 locks held by sshd/24800:
>    #0:  (sk_lock-AF_INET6){+.+.}, at: [<000000001a069652>] tcp_sendmsg+0x19/0x40
>    #1:  (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30
> 
>   stack backtrace:
>   CPU: 3 PID: 24800 Comm: sshd Not tainted 4.15.0-rc9-backup-debug+ #1
>   Call Trace:
>    dump_stack+0xbc/0x13f
>    __lock_acquire+0xa09/0x2040
>    lock_acquire+0x12e/0x350
>    fs_reclaim_acquire.part.102+0x29/0x30
>    kmem_cache_alloc+0x3d/0x2c0
>    alloc_extent_state+0xa7/0x410
>    __clear_extent_bit+0x3ea/0x570
>    try_release_extent_mapping+0x21a/0x260
>    __btrfs_releasepage+0xb0/0x1c0
>    btrfs_releasepage+0x161/0x170
>    try_to_release_page+0x162/0x1c0
>    shrink_page_list+0x1d5a/0x2fb0
>    shrink_inactive_list+0x451/0x940
>    shrink_node_memcg.constprop.88+0x4c9/0x5e0
>    shrink_node+0x12d/0x260
>    try_to_free_pages+0x418/0xaf0
>    __alloc_pages_slowpath+0x976/0x1790
>    __alloc_pages_nodemask+0x52c/0x5c0
>    new_slab+0x374/0x3f0
>    ___slab_alloc.constprop.81+0x47e/0x5a0
>    __slab_alloc.constprop.80+0x32/0x60
>    __kmalloc_track_caller+0x267/0x310
>    __kmalloc_reserve.isra.40+0x29/0x80
>    __alloc_skb+0xee/0x390
>    sk_stream_alloc_skb+0xb8/0x340
>    tcp_sendmsg_locked+0x8e6/0x1d30
>    tcp_sendmsg+0x27/0x40
>    inet_sendmsg+0xd0/0x310
>    sock_write_iter+0x17a/0x240
>    __vfs_write+0x2ab/0x380
>    vfs_write+0xfb/0x260
>    SyS_write+0xb6/0x140
>    do_syscall_64+0x1e5/0xc05
>    entry_SYSCALL64_slow_path+0x25/0x25
> 

I think I've hit another incarnation of that one. The call stack is:
http://paste.opensuse.org/3f22d013

The cleaned up callstack of all the ? entries look like:

__lock_acquire+0x2d8a/0x4b70
lock_acquire+0x110/0x330
kmem_cache_alloc+0x29/0x2c0
__clear_extent_bit+0x488/0x800
try_release_extent_mapping+0x288/0x3c0
__btrfs_releasepage+0x6c/0x140
shrink_page_list+0x227e/0x3110
shrink_inactive_list+0x414/0xdb0
shrink_node_memcg+0x7c8/0x1250
shrink_node+0x2ae/0xb50
do_try_to_free_pages+0x2b1/0xe20
try_to_free_pages+0x205/0x570
 __alloc_pages_nodemask+0xb91/0x2160
new_slab+0x27a/0x4e0
___slab_alloc+0x355/0x610
 __slab_alloc+0x4c/0xa0
kmem_cache_alloc+0x22d/0x2c0
mempool_alloc+0xe1/0x280
bio_alloc_bioset+0x1d7/0x830
ext4_mpage_readpages+0x99f/0x1000 <-
__do_page_cache_readahead+0x4be/0x840
filemap_fault+0x8c8/0xfc0
ext4_filemap_fault+0x7d/0xb0
__do_fault+0x7a/0x150
__handle_mm_fault+0x1542/0x29d0
__do_page_fault+0x557/0xa30
async_page_fault+0x4c/0x60


There is no fs stacking going on here and that is 4.15-rc9.


> This warning is caused by commit d92a8cfcb37ecd13 ("locking/lockdep: Rework
> FS_RECLAIM annotation") which replaced lockdep_set_current_reclaim_state()/
> lockdep_clear_current_reclaim_state() in __perform_reclaim() and
> lockdep_trace_alloc() in slab_pre_alloc_hook() with fs_reclaim_acquire()/
> fs_reclaim_release(). Since __kmalloc_reserve() from __alloc_skb() adds
> __GFP_NOMEMALLOC | __GFP_NOWARN to gfp_mask, and all reclaim path simply
> propagates __GFP_NOMEMALLOC, fs_reclaim_acquire() in slab_pre_alloc_hook()
> is trying to grab the 'fake' lock again when __perform_reclaim() already
> grabbed the 'fake' lock.
> 
> The
> 
>   /* this guy won't enter reclaim */
>   if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
>           return false;
> 
> test which causes slab_pre_alloc_hook() to try to grab the 'fake' lock
> was added by commit cf40bd16fdad42c0 ("lockdep: annotate reclaim context
> (__GFP_NOFS)"). But that test is outdated because PF_MEMALLOC thread won't
> enter reclaim regardless of __GFP_NOMEMALLOC after commit 341ce06f69abfafa
> ("page allocator: calculate the alloc_flags for allocation only once")
> added the PF_MEMALLOC safeguard (
> 
>   /* Avoid recursion of direct reclaim */
>   if (p->flags & PF_MEMALLOC)
>           goto nopage;
> 
> in __alloc_pages_slowpath()).
> 
> Thus, let's fix outdated test by removing __GFP_NOMEMALLOC test and allow
> __need_fs_reclaim() to return false.
> 
> Reported-and-tested-by: Dave Jones <davej@codemonkey.org.uk>
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Nick Piggin <npiggin@gmail.com>
> ---
>  mm/page_alloc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 81e18ce..19fb76b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3590,7 +3590,7 @@ static bool __need_fs_reclaim(gfp_t gfp_mask)
>  		return false;
>  
>  	/* this guy won't enter reclaim */
> -	if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
> +	if (current->flags & PF_MEMALLOC)
>  		return false;
>  
>  	/* We're only interested __GFP_FS allocations for now */
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2] lockdep: Fix fs_reclaim warning.
@ 2018-02-12 12:08                       ` Nikolay Borisov
  0 siblings, 0 replies; 35+ messages in thread
From: Nikolay Borisov @ 2018-02-12 12:08 UTC (permalink / raw)
  To: Tetsuo Handa, peterz
  Cc: torvalds, davej, npiggin, linux-kernel, linux-mm, netdev, mhocko,
	linux-btrfs



On  8.02.2018 13:43, Tetsuo Handa wrote:
>>From 361d37a7d36978020dfb4c11ec1f4800937ccb68 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Thu, 8 Feb 2018 10:35:35 +0900
> Subject: [PATCH v2] lockdep: Fix fs_reclaim warning.
> 
> Dave Jones reported fs_reclaim lockdep warnings.
> 
>   ============================================
>   WARNING: possible recursive locking detected
>   4.15.0-rc9-backup-debug+ #1 Not tainted
>   --------------------------------------------
>   sshd/24800 is trying to acquire lock:
>    (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30
> 
>   but task is already holding lock:
>    (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30
> 
>   other info that might help us debug this:
>    Possible unsafe locking scenario:
> 
>          CPU0
>          ----
>     lock(fs_reclaim);
>     lock(fs_reclaim);
> 
>    *** DEADLOCK ***
> 
>    May be due to missing lock nesting notation
> 
>   2 locks held by sshd/24800:
>    #0:  (sk_lock-AF_INET6){+.+.}, at: [<000000001a069652>] tcp_sendmsg+0x19/0x40
>    #1:  (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30
> 
>   stack backtrace:
>   CPU: 3 PID: 24800 Comm: sshd Not tainted 4.15.0-rc9-backup-debug+ #1
>   Call Trace:
>    dump_stack+0xbc/0x13f
>    __lock_acquire+0xa09/0x2040
>    lock_acquire+0x12e/0x350
>    fs_reclaim_acquire.part.102+0x29/0x30
>    kmem_cache_alloc+0x3d/0x2c0
>    alloc_extent_state+0xa7/0x410
>    __clear_extent_bit+0x3ea/0x570
>    try_release_extent_mapping+0x21a/0x260
>    __btrfs_releasepage+0xb0/0x1c0
>    btrfs_releasepage+0x161/0x170
>    try_to_release_page+0x162/0x1c0
>    shrink_page_list+0x1d5a/0x2fb0
>    shrink_inactive_list+0x451/0x940
>    shrink_node_memcg.constprop.88+0x4c9/0x5e0
>    shrink_node+0x12d/0x260
>    try_to_free_pages+0x418/0xaf0
>    __alloc_pages_slowpath+0x976/0x1790
>    __alloc_pages_nodemask+0x52c/0x5c0
>    new_slab+0x374/0x3f0
>    ___slab_alloc.constprop.81+0x47e/0x5a0
>    __slab_alloc.constprop.80+0x32/0x60
>    __kmalloc_track_caller+0x267/0x310
>    __kmalloc_reserve.isra.40+0x29/0x80
>    __alloc_skb+0xee/0x390
>    sk_stream_alloc_skb+0xb8/0x340
>    tcp_sendmsg_locked+0x8e6/0x1d30
>    tcp_sendmsg+0x27/0x40
>    inet_sendmsg+0xd0/0x310
>    sock_write_iter+0x17a/0x240
>    __vfs_write+0x2ab/0x380
>    vfs_write+0xfb/0x260
>    SyS_write+0xb6/0x140
>    do_syscall_64+0x1e5/0xc05
>    entry_SYSCALL64_slow_path+0x25/0x25
> 

I think I've hit another incarnation of that one. The call stack is:
http://paste.opensuse.org/3f22d013

The cleaned up callstack of all the ? entries look like:

__lock_acquire+0x2d8a/0x4b70
lock_acquire+0x110/0x330
kmem_cache_alloc+0x29/0x2c0
__clear_extent_bit+0x488/0x800
try_release_extent_mapping+0x288/0x3c0
__btrfs_releasepage+0x6c/0x140
shrink_page_list+0x227e/0x3110
shrink_inactive_list+0x414/0xdb0
shrink_node_memcg+0x7c8/0x1250
shrink_node+0x2ae/0xb50
do_try_to_free_pages+0x2b1/0xe20
try_to_free_pages+0x205/0x570
 __alloc_pages_nodemask+0xb91/0x2160
new_slab+0x27a/0x4e0
___slab_alloc+0x355/0x610
 __slab_alloc+0x4c/0xa0
kmem_cache_alloc+0x22d/0x2c0
mempool_alloc+0xe1/0x280
bio_alloc_bioset+0x1d7/0x830
ext4_mpage_readpages+0x99f/0x1000 <-
__do_page_cache_readahead+0x4be/0x840
filemap_fault+0x8c8/0xfc0
ext4_filemap_fault+0x7d/0xb0
__do_fault+0x7a/0x150
__handle_mm_fault+0x1542/0x29d0
__do_page_fault+0x557/0xa30
async_page_fault+0x4c/0x60


There is no fs stacking going on here and that is 4.15-rc9.


> This warning is caused by commit d92a8cfcb37ecd13 ("locking/lockdep: Rework
> FS_RECLAIM annotation") which replaced lockdep_set_current_reclaim_state()/
> lockdep_clear_current_reclaim_state() in __perform_reclaim() and
> lockdep_trace_alloc() in slab_pre_alloc_hook() with fs_reclaim_acquire()/
> fs_reclaim_release(). Since __kmalloc_reserve() from __alloc_skb() adds
> __GFP_NOMEMALLOC | __GFP_NOWARN to gfp_mask, and all reclaim path simply
> propagates __GFP_NOMEMALLOC, fs_reclaim_acquire() in slab_pre_alloc_hook()
> is trying to grab the 'fake' lock again when __perform_reclaim() already
> grabbed the 'fake' lock.
> 
> The
> 
>   /* this guy won't enter reclaim */
>   if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
>           return false;
> 
> test which causes slab_pre_alloc_hook() to try to grab the 'fake' lock
> was added by commit cf40bd16fdad42c0 ("lockdep: annotate reclaim context
> (__GFP_NOFS)"). But that test is outdated because PF_MEMALLOC thread won't
> enter reclaim regardless of __GFP_NOMEMALLOC after commit 341ce06f69abfafa
> ("page allocator: calculate the alloc_flags for allocation only once")
> added the PF_MEMALLOC safeguard (
> 
>   /* Avoid recursion of direct reclaim */
>   if (p->flags & PF_MEMALLOC)
>           goto nopage;
> 
> in __alloc_pages_slowpath()).
> 
> Thus, let's fix outdated test by removing __GFP_NOMEMALLOC test and allow
> __need_fs_reclaim() to return false.
> 
> Reported-and-tested-by: Dave Jones <davej@codemonkey.org.uk>
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Nick Piggin <npiggin@gmail.com>
> ---
>  mm/page_alloc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 81e18ce..19fb76b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3590,7 +3590,7 @@ static bool __need_fs_reclaim(gfp_t gfp_mask)
>  		return false;
>  
>  	/* this guy won't enter reclaim */
> -	if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
> +	if (current->flags & PF_MEMALLOC)
>  		return false;
>  
>  	/* We're only interested __GFP_FS allocations for now */
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2] lockdep: Fix fs_reclaim warning.
  2018-02-12 12:08                       ` Nikolay Borisov
@ 2018-02-12 13:46                         ` Tetsuo Handa
  -1 siblings, 0 replies; 35+ messages in thread
From: Tetsuo Handa @ 2018-02-12 13:46 UTC (permalink / raw)
  To: nborisov, peterz
  Cc: torvalds, davej, npiggin, linux-kernel, linux-mm, netdev, mhocko,
	linux-btrfs

Nikolay Borisov wrote:
> I think I've hit another incarnation of that one. The call stack is:
> http://paste.opensuse.org/3f22d013
> 
> The cleaned up callstack of all the ? entries look like:
> 
> __lock_acquire+0x2d8a/0x4b70
> lock_acquire+0x110/0x330
> kmem_cache_alloc+0x29/0x2c0
> __clear_extent_bit+0x488/0x800
> try_release_extent_mapping+0x288/0x3c0
> __btrfs_releasepage+0x6c/0x140
> shrink_page_list+0x227e/0x3110
> shrink_inactive_list+0x414/0xdb0
> shrink_node_memcg+0x7c8/0x1250
> shrink_node+0x2ae/0xb50
> do_try_to_free_pages+0x2b1/0xe20
> try_to_free_pages+0x205/0x570
>  __alloc_pages_nodemask+0xb91/0x2160
> new_slab+0x27a/0x4e0
> ___slab_alloc+0x355/0x610
>  __slab_alloc+0x4c/0xa0
> kmem_cache_alloc+0x22d/0x2c0
> mempool_alloc+0xe1/0x280

Yes, for mempool_alloc() is adding __GFP_NOMEMALLOC | __GFP_NOWARN to gfp_mask.

	gfp_mask |= __GFP_NOMEMALLOC;   /* don't allocate emergency reserves */
	gfp_mask |= __GFP_NORETRY;      /* don't loop in __alloc_pages */
	gfp_mask |= __GFP_NOWARN;       /* failures are OK */

> bio_alloc_bioset+0x1d7/0x830
> ext4_mpage_readpages+0x99f/0x1000 <-
> __do_page_cache_readahead+0x4be/0x840
> filemap_fault+0x8c8/0xfc0
> ext4_filemap_fault+0x7d/0xb0
> __do_fault+0x7a/0x150
> __handle_mm_fault+0x1542/0x29d0
> __do_page_fault+0x557/0xa30
> async_page_fault+0x4c/0x60

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2] lockdep: Fix fs_reclaim warning.
@ 2018-02-12 13:46                         ` Tetsuo Handa
  0 siblings, 0 replies; 35+ messages in thread
From: Tetsuo Handa @ 2018-02-12 13:46 UTC (permalink / raw)
  To: nborisov, peterz
  Cc: torvalds, davej, npiggin, linux-kernel, linux-mm, netdev, mhocko,
	linux-btrfs

Nikolay Borisov wrote:
> I think I've hit another incarnation of that one. The call stack is:
> http://paste.opensuse.org/3f22d013
> 
> The cleaned up callstack of all the ? entries look like:
> 
> __lock_acquire+0x2d8a/0x4b70
> lock_acquire+0x110/0x330
> kmem_cache_alloc+0x29/0x2c0
> __clear_extent_bit+0x488/0x800
> try_release_extent_mapping+0x288/0x3c0
> __btrfs_releasepage+0x6c/0x140
> shrink_page_list+0x227e/0x3110
> shrink_inactive_list+0x414/0xdb0
> shrink_node_memcg+0x7c8/0x1250
> shrink_node+0x2ae/0xb50
> do_try_to_free_pages+0x2b1/0xe20
> try_to_free_pages+0x205/0x570
>  __alloc_pages_nodemask+0xb91/0x2160
> new_slab+0x27a/0x4e0
> ___slab_alloc+0x355/0x610
>  __slab_alloc+0x4c/0xa0
> kmem_cache_alloc+0x22d/0x2c0
> mempool_alloc+0xe1/0x280

Yes, for mempool_alloc() is adding __GFP_NOMEMALLOC | __GFP_NOWARN to gfp_mask.

	gfp_mask |= __GFP_NOMEMALLOC;   /* don't allocate emergency reserves */
	gfp_mask |= __GFP_NORETRY;      /* don't loop in __alloc_pages */
	gfp_mask |= __GFP_NOWARN;       /* failures are OK */

> bio_alloc_bioset+0x1d7/0x830
> ext4_mpage_readpages+0x99f/0x1000 <-
> __do_page_cache_readahead+0x4be/0x840
> filemap_fault+0x8c8/0xfc0
> ext4_filemap_fault+0x7d/0xb0
> __do_fault+0x7a/0x150
> __handle_mm_fault+0x1542/0x29d0
> __do_page_fault+0x557/0xa30
> async_page_fault+0x4c/0x60

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2] lockdep: Fix fs_reclaim warning.
  2018-02-08 11:43                     ` Tetsuo Handa
@ 2018-02-19 11:52                       ` Tetsuo Handa
  -1 siblings, 0 replies; 35+ messages in thread
From: Tetsuo Handa @ 2018-02-19 11:52 UTC (permalink / raw)
  To: peterz
  Cc: torvalds, davej, npiggin, linux-kernel, linux-mm, netdev, mhocko,
	linux-btrfs

Peter, are you OK with this patch?

Tetsuo Handa wrote:
> From 361d37a7d36978020dfb4c11ec1f4800937ccb68 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Thu, 8 Feb 2018 10:35:35 +0900
> Subject: [PATCH v2] lockdep: Fix fs_reclaim warning.
> 
> Dave Jones reported fs_reclaim lockdep warnings.
> 
>   ============================================
>   WARNING: possible recursive locking detected
>   4.15.0-rc9-backup-debug+ #1 Not tainted
>   --------------------------------------------
>   sshd/24800 is trying to acquire lock:
>    (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30
> 
>   but task is already holding lock:
>    (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30
> 
>   other info that might help us debug this:
>    Possible unsafe locking scenario:
> 
>          CPU0
>          ----
>     lock(fs_reclaim);
>     lock(fs_reclaim);
> 
>    *** DEADLOCK ***
> 
>    May be due to missing lock nesting notation
> 
>   2 locks held by sshd/24800:
>    #0:  (sk_lock-AF_INET6){+.+.}, at: [<000000001a069652>] tcp_sendmsg+0x19/0x40
>    #1:  (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30
> 
>   stack backtrace:
>   CPU: 3 PID: 24800 Comm: sshd Not tainted 4.15.0-rc9-backup-debug+ #1
>   Call Trace:
>    dump_stack+0xbc/0x13f
>    __lock_acquire+0xa09/0x2040
>    lock_acquire+0x12e/0x350
>    fs_reclaim_acquire.part.102+0x29/0x30
>    kmem_cache_alloc+0x3d/0x2c0
>    alloc_extent_state+0xa7/0x410
>    __clear_extent_bit+0x3ea/0x570
>    try_release_extent_mapping+0x21a/0x260
>    __btrfs_releasepage+0xb0/0x1c0
>    btrfs_releasepage+0x161/0x170
>    try_to_release_page+0x162/0x1c0
>    shrink_page_list+0x1d5a/0x2fb0
>    shrink_inactive_list+0x451/0x940
>    shrink_node_memcg.constprop.88+0x4c9/0x5e0
>    shrink_node+0x12d/0x260
>    try_to_free_pages+0x418/0xaf0
>    __alloc_pages_slowpath+0x976/0x1790
>    __alloc_pages_nodemask+0x52c/0x5c0
>    new_slab+0x374/0x3f0
>    ___slab_alloc.constprop.81+0x47e/0x5a0
>    __slab_alloc.constprop.80+0x32/0x60
>    __kmalloc_track_caller+0x267/0x310
>    __kmalloc_reserve.isra.40+0x29/0x80
>    __alloc_skb+0xee/0x390
>    sk_stream_alloc_skb+0xb8/0x340
>    tcp_sendmsg_locked+0x8e6/0x1d30
>    tcp_sendmsg+0x27/0x40
>    inet_sendmsg+0xd0/0x310
>    sock_write_iter+0x17a/0x240
>    __vfs_write+0x2ab/0x380
>    vfs_write+0xfb/0x260
>    SyS_write+0xb6/0x140
>    do_syscall_64+0x1e5/0xc05
>    entry_SYSCALL64_slow_path+0x25/0x25
> 
> This warning is caused by commit d92a8cfcb37ecd13 ("locking/lockdep: Rework
> FS_RECLAIM annotation") which replaced lockdep_set_current_reclaim_state()/
> lockdep_clear_current_reclaim_state() in __perform_reclaim() and
> lockdep_trace_alloc() in slab_pre_alloc_hook() with fs_reclaim_acquire()/
> fs_reclaim_release(). Since __kmalloc_reserve() from __alloc_skb() adds
> __GFP_NOMEMALLOC | __GFP_NOWARN to gfp_mask, and all reclaim path simply
> propagates __GFP_NOMEMALLOC, fs_reclaim_acquire() in slab_pre_alloc_hook()
> is trying to grab the 'fake' lock again when __perform_reclaim() already
> grabbed the 'fake' lock.
> 
> The
> 
>   /* this guy won't enter reclaim */
>   if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
>           return false;
> 
> test which causes slab_pre_alloc_hook() to try to grab the 'fake' lock
> was added by commit cf40bd16fdad42c0 ("lockdep: annotate reclaim context
> (__GFP_NOFS)"). But that test is outdated because PF_MEMALLOC thread won't
> enter reclaim regardless of __GFP_NOMEMALLOC after commit 341ce06f69abfafa
> ("page allocator: calculate the alloc_flags for allocation only once")
> added the PF_MEMALLOC safeguard (
> 
>   /* Avoid recursion of direct reclaim */
>   if (p->flags & PF_MEMALLOC)
>           goto nopage;
> 
> in __alloc_pages_slowpath()).
> 
> Thus, let's fix outdated test by removing __GFP_NOMEMALLOC test and allow
> __need_fs_reclaim() to return false.
> 
> Reported-and-tested-by: Dave Jones <davej@codemonkey.org.uk>
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Nick Piggin <npiggin@gmail.com>
> ---
>  mm/page_alloc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 81e18ce..19fb76b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3590,7 +3590,7 @@ static bool __need_fs_reclaim(gfp_t gfp_mask)
>  		return false;
>  
>  	/* this guy won't enter reclaim */
> -	if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
> +	if (current->flags & PF_MEMALLOC)
>  		return false;
>  
>  	/* We're only interested __GFP_FS allocations for now */
> -- 
> 1.8.3.1
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2] lockdep: Fix fs_reclaim warning.
@ 2018-02-19 11:52                       ` Tetsuo Handa
  0 siblings, 0 replies; 35+ messages in thread
From: Tetsuo Handa @ 2018-02-19 11:52 UTC (permalink / raw)
  To: peterz
  Cc: torvalds, davej, npiggin, linux-kernel, linux-mm, netdev, mhocko,
	linux-btrfs

Peter, are you OK with this patch?

Tetsuo Handa wrote:
> From 361d37a7d36978020dfb4c11ec1f4800937ccb68 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Thu, 8 Feb 2018 10:35:35 +0900
> Subject: [PATCH v2] lockdep: Fix fs_reclaim warning.
> 
> Dave Jones reported fs_reclaim lockdep warnings.
> 
>   ============================================
>   WARNING: possible recursive locking detected
>   4.15.0-rc9-backup-debug+ #1 Not tainted
>   --------------------------------------------
>   sshd/24800 is trying to acquire lock:
>    (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30
> 
>   but task is already holding lock:
>    (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30
> 
>   other info that might help us debug this:
>    Possible unsafe locking scenario:
> 
>          CPU0
>          ----
>     lock(fs_reclaim);
>     lock(fs_reclaim);
> 
>    *** DEADLOCK ***
> 
>    May be due to missing lock nesting notation
> 
>   2 locks held by sshd/24800:
>    #0:  (sk_lock-AF_INET6){+.+.}, at: [<000000001a069652>] tcp_sendmsg+0x19/0x40
>    #1:  (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30
> 
>   stack backtrace:
>   CPU: 3 PID: 24800 Comm: sshd Not tainted 4.15.0-rc9-backup-debug+ #1
>   Call Trace:
>    dump_stack+0xbc/0x13f
>    __lock_acquire+0xa09/0x2040
>    lock_acquire+0x12e/0x350
>    fs_reclaim_acquire.part.102+0x29/0x30
>    kmem_cache_alloc+0x3d/0x2c0
>    alloc_extent_state+0xa7/0x410
>    __clear_extent_bit+0x3ea/0x570
>    try_release_extent_mapping+0x21a/0x260
>    __btrfs_releasepage+0xb0/0x1c0
>    btrfs_releasepage+0x161/0x170
>    try_to_release_page+0x162/0x1c0
>    shrink_page_list+0x1d5a/0x2fb0
>    shrink_inactive_list+0x451/0x940
>    shrink_node_memcg.constprop.88+0x4c9/0x5e0
>    shrink_node+0x12d/0x260
>    try_to_free_pages+0x418/0xaf0
>    __alloc_pages_slowpath+0x976/0x1790
>    __alloc_pages_nodemask+0x52c/0x5c0
>    new_slab+0x374/0x3f0
>    ___slab_alloc.constprop.81+0x47e/0x5a0
>    __slab_alloc.constprop.80+0x32/0x60
>    __kmalloc_track_caller+0x267/0x310
>    __kmalloc_reserve.isra.40+0x29/0x80
>    __alloc_skb+0xee/0x390
>    sk_stream_alloc_skb+0xb8/0x340
>    tcp_sendmsg_locked+0x8e6/0x1d30
>    tcp_sendmsg+0x27/0x40
>    inet_sendmsg+0xd0/0x310
>    sock_write_iter+0x17a/0x240
>    __vfs_write+0x2ab/0x380
>    vfs_write+0xfb/0x260
>    SyS_write+0xb6/0x140
>    do_syscall_64+0x1e5/0xc05
>    entry_SYSCALL64_slow_path+0x25/0x25
> 
> This warning is caused by commit d92a8cfcb37ecd13 ("locking/lockdep: Rework
> FS_RECLAIM annotation") which replaced lockdep_set_current_reclaim_state()/
> lockdep_clear_current_reclaim_state() in __perform_reclaim() and
> lockdep_trace_alloc() in slab_pre_alloc_hook() with fs_reclaim_acquire()/
> fs_reclaim_release(). Since __kmalloc_reserve() from __alloc_skb() adds
> __GFP_NOMEMALLOC | __GFP_NOWARN to gfp_mask, and all reclaim path simply
> propagates __GFP_NOMEMALLOC, fs_reclaim_acquire() in slab_pre_alloc_hook()
> is trying to grab the 'fake' lock again when __perform_reclaim() already
> grabbed the 'fake' lock.
> 
> The
> 
>   /* this guy won't enter reclaim */
>   if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
>           return false;
> 
> test which causes slab_pre_alloc_hook() to try to grab the 'fake' lock
> was added by commit cf40bd16fdad42c0 ("lockdep: annotate reclaim context
> (__GFP_NOFS)"). But that test is outdated because PF_MEMALLOC thread won't
> enter reclaim regardless of __GFP_NOMEMALLOC after commit 341ce06f69abfafa
> ("page allocator: calculate the alloc_flags for allocation only once")
> added the PF_MEMALLOC safeguard (
> 
>   /* Avoid recursion of direct reclaim */
>   if (p->flags & PF_MEMALLOC)
>           goto nopage;
> 
> in __alloc_pages_slowpath()).
> 
> Thus, let's fix outdated test by removing __GFP_NOMEMALLOC test and allow
> __need_fs_reclaim() to return false.
> 
> Reported-and-tested-by: Dave Jones <davej@codemonkey.org.uk>
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Nick Piggin <npiggin@gmail.com>
> ---
>  mm/page_alloc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 81e18ce..19fb76b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3590,7 +3590,7 @@ static bool __need_fs_reclaim(gfp_t gfp_mask)
>  		return false;
>  
>  	/* this guy won't enter reclaim */
> -	if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
> +	if (current->flags & PF_MEMALLOC)
>  		return false;
>  
>  	/* We're only interested __GFP_FS allocations for now */
> -- 
> 1.8.3.1
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v2 (RESEND)] lockdep: Fix fs_reclaim warning.
  2018-02-08 11:43                     ` Tetsuo Handa
                                       ` (2 preceding siblings ...)
  (?)
@ 2018-02-27 21:50                     ` Tetsuo Handa
  2018-03-07 21:44                       ` Tetsuo Handa
  2018-03-07 23:33                       ` Andrew Morton
  -1 siblings, 2 replies; 35+ messages in thread
From: Tetsuo Handa @ 2018-02-27 21:50 UTC (permalink / raw)
  To: peterz, mingo; +Cc: linux-kernel

>From 361d37a7d36978020dfb4c11ec1f4800937ccb68 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Thu, 8 Feb 2018 10:35:35 +0900
Subject: [PATCH v2] lockdep: Fix fs_reclaim warning.

Dave Jones reported fs_reclaim lockdep warnings.

  ============================================
  WARNING: possible recursive locking detected
  4.15.0-rc9-backup-debug+ #1 Not tainted
  --------------------------------------------
  sshd/24800 is trying to acquire lock:
   (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30

  but task is already holding lock:
   (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30

  other info that might help us debug this:
   Possible unsafe locking scenario:

         CPU0
         ----
    lock(fs_reclaim);
    lock(fs_reclaim);

   *** DEADLOCK ***

   May be due to missing lock nesting notation

  2 locks held by sshd/24800:
   #0:  (sk_lock-AF_INET6){+.+.}, at: [<000000001a069652>] tcp_sendmsg+0x19/0x40
   #1:  (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30

  stack backtrace:
  CPU: 3 PID: 24800 Comm: sshd Not tainted 4.15.0-rc9-backup-debug+ #1
  Call Trace:
   dump_stack+0xbc/0x13f
   __lock_acquire+0xa09/0x2040
   lock_acquire+0x12e/0x350
   fs_reclaim_acquire.part.102+0x29/0x30
   kmem_cache_alloc+0x3d/0x2c0
   alloc_extent_state+0xa7/0x410
   __clear_extent_bit+0x3ea/0x570
   try_release_extent_mapping+0x21a/0x260
   __btrfs_releasepage+0xb0/0x1c0
   btrfs_releasepage+0x161/0x170
   try_to_release_page+0x162/0x1c0
   shrink_page_list+0x1d5a/0x2fb0
   shrink_inactive_list+0x451/0x940
   shrink_node_memcg.constprop.88+0x4c9/0x5e0
   shrink_node+0x12d/0x260
   try_to_free_pages+0x418/0xaf0
   __alloc_pages_slowpath+0x976/0x1790
   __alloc_pages_nodemask+0x52c/0x5c0
   new_slab+0x374/0x3f0
   ___slab_alloc.constprop.81+0x47e/0x5a0
   __slab_alloc.constprop.80+0x32/0x60
   __kmalloc_track_caller+0x267/0x310
   __kmalloc_reserve.isra.40+0x29/0x80
   __alloc_skb+0xee/0x390
   sk_stream_alloc_skb+0xb8/0x340
   tcp_sendmsg_locked+0x8e6/0x1d30
   tcp_sendmsg+0x27/0x40
   inet_sendmsg+0xd0/0x310
   sock_write_iter+0x17a/0x240
   __vfs_write+0x2ab/0x380
   vfs_write+0xfb/0x260
   SyS_write+0xb6/0x140
   do_syscall_64+0x1e5/0xc05
   entry_SYSCALL64_slow_path+0x25/0x25

This warning is caused by commit d92a8cfcb37ecd13 ("locking/lockdep: Rework
FS_RECLAIM annotation") which replaced lockdep_set_current_reclaim_state()/
lockdep_clear_current_reclaim_state() in __perform_reclaim() and
lockdep_trace_alloc() in slab_pre_alloc_hook() with fs_reclaim_acquire()/
fs_reclaim_release(). Since __kmalloc_reserve() from __alloc_skb() adds
__GFP_NOMEMALLOC | __GFP_NOWARN to gfp_mask, and all reclaim path simply
propagates __GFP_NOMEMALLOC, fs_reclaim_acquire() in slab_pre_alloc_hook()
is trying to grab the 'fake' lock again when __perform_reclaim() already
grabbed the 'fake' lock.

The

  /* this guy won't enter reclaim */
  if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
          return false;

test which causes slab_pre_alloc_hook() to try to grab the 'fake' lock
was added by commit cf40bd16fdad42c0 ("lockdep: annotate reclaim context
(__GFP_NOFS)"). But that test is outdated because PF_MEMALLOC thread won't
enter reclaim regardless of __GFP_NOMEMALLOC after commit 341ce06f69abfafa
("page allocator: calculate the alloc_flags for allocation only once")
added the PF_MEMALLOC safeguard (

  /* Avoid recursion of direct reclaim */
  if (p->flags & PF_MEMALLOC)
          goto nopage;

in __alloc_pages_slowpath()).

Thus, let's fix outdated test by removing __GFP_NOMEMALLOC test and allow
__need_fs_reclaim() to return false.

Reported-and-tested-by: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Nick Piggin <npiggin@gmail.com>
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 81e18ce..19fb76b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3590,7 +3590,7 @@ static bool __need_fs_reclaim(gfp_t gfp_mask)
 		return false;
 
 	/* this guy won't enter reclaim */
-	if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
+	if (current->flags & PF_MEMALLOC)
 		return false;
 
 	/* We're only interested __GFP_FS allocations for now */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 (RESEND)] lockdep: Fix fs_reclaim warning.
  2018-02-27 21:50                     ` [PATCH v2 (RESEND)] " Tetsuo Handa
@ 2018-03-07 21:44                       ` Tetsuo Handa
  2018-03-07 23:33                       ` Andrew Morton
  1 sibling, 0 replies; 35+ messages in thread
From: Tetsuo Handa @ 2018-03-07 21:44 UTC (permalink / raw)
  To: akpm; +Cc: peterz, mingo, linux-kernel

I assumed this patch goes to mainline via locking tree, but neither
Peter nor Ingo is responding. Andrew, can you pick up this patch?

Tetsuo Handa wrote:
> From 361d37a7d36978020dfb4c11ec1f4800937ccb68 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Thu, 8 Feb 2018 10:35:35 +0900
> Subject: [PATCH v2] lockdep: Fix fs_reclaim warning.
> 
> Dave Jones reported fs_reclaim lockdep warnings.
> 
>   ============================================
>   WARNING: possible recursive locking detected
>   4.15.0-rc9-backup-debug+ #1 Not tainted
>   --------------------------------------------
>   sshd/24800 is trying to acquire lock:
>    (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30
> 
>   but task is already holding lock:
>    (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30
> 
>   other info that might help us debug this:
>    Possible unsafe locking scenario:
> 
>          CPU0
>          ----
>     lock(fs_reclaim);
>     lock(fs_reclaim);
> 
>    *** DEADLOCK ***
> 
>    May be due to missing lock nesting notation
> 
>   2 locks held by sshd/24800:
>    #0:  (sk_lock-AF_INET6){+.+.}, at: [<000000001a069652>] tcp_sendmsg+0x19/0x40
>    #1:  (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30
> 
>   stack backtrace:
>   CPU: 3 PID: 24800 Comm: sshd Not tainted 4.15.0-rc9-backup-debug+ #1
>   Call Trace:
>    dump_stack+0xbc/0x13f
>    __lock_acquire+0xa09/0x2040
>    lock_acquire+0x12e/0x350
>    fs_reclaim_acquire.part.102+0x29/0x30
>    kmem_cache_alloc+0x3d/0x2c0
>    alloc_extent_state+0xa7/0x410
>    __clear_extent_bit+0x3ea/0x570
>    try_release_extent_mapping+0x21a/0x260
>    __btrfs_releasepage+0xb0/0x1c0
>    btrfs_releasepage+0x161/0x170
>    try_to_release_page+0x162/0x1c0
>    shrink_page_list+0x1d5a/0x2fb0
>    shrink_inactive_list+0x451/0x940
>    shrink_node_memcg.constprop.88+0x4c9/0x5e0
>    shrink_node+0x12d/0x260
>    try_to_free_pages+0x418/0xaf0
>    __alloc_pages_slowpath+0x976/0x1790
>    __alloc_pages_nodemask+0x52c/0x5c0
>    new_slab+0x374/0x3f0
>    ___slab_alloc.constprop.81+0x47e/0x5a0
>    __slab_alloc.constprop.80+0x32/0x60
>    __kmalloc_track_caller+0x267/0x310
>    __kmalloc_reserve.isra.40+0x29/0x80
>    __alloc_skb+0xee/0x390
>    sk_stream_alloc_skb+0xb8/0x340
>    tcp_sendmsg_locked+0x8e6/0x1d30
>    tcp_sendmsg+0x27/0x40
>    inet_sendmsg+0xd0/0x310
>    sock_write_iter+0x17a/0x240
>    __vfs_write+0x2ab/0x380
>    vfs_write+0xfb/0x260
>    SyS_write+0xb6/0x140
>    do_syscall_64+0x1e5/0xc05
>    entry_SYSCALL64_slow_path+0x25/0x25
> 
> This warning is caused by commit d92a8cfcb37ecd13 ("locking/lockdep: Rework
> FS_RECLAIM annotation") which replaced lockdep_set_current_reclaim_state()/
> lockdep_clear_current_reclaim_state() in __perform_reclaim() and
> lockdep_trace_alloc() in slab_pre_alloc_hook() with fs_reclaim_acquire()/
> fs_reclaim_release(). Since __kmalloc_reserve() from __alloc_skb() adds
> __GFP_NOMEMALLOC | __GFP_NOWARN to gfp_mask, and all reclaim path simply
> propagates __GFP_NOMEMALLOC, fs_reclaim_acquire() in slab_pre_alloc_hook()
> is trying to grab the 'fake' lock again when __perform_reclaim() already
> grabbed the 'fake' lock.
> 
> The
> 
>   /* this guy won't enter reclaim */
>   if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
>           return false;
> 
> test which causes slab_pre_alloc_hook() to try to grab the 'fake' lock
> was added by commit cf40bd16fdad42c0 ("lockdep: annotate reclaim context
> (__GFP_NOFS)"). But that test is outdated because PF_MEMALLOC thread won't
> enter reclaim regardless of __GFP_NOMEMALLOC after commit 341ce06f69abfafa
> ("page allocator: calculate the alloc_flags for allocation only once")
> added the PF_MEMALLOC safeguard (
> 
>   /* Avoid recursion of direct reclaim */
>   if (p->flags & PF_MEMALLOC)
>           goto nopage;
> 
> in __alloc_pages_slowpath()).
> 
> Thus, let's fix outdated test by removing __GFP_NOMEMALLOC test and allow
> __need_fs_reclaim() to return false.
> 
> Reported-and-tested-by: Dave Jones <davej@codemonkey.org.uk>
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Nick Piggin <npiggin@gmail.com>
> ---
>  mm/page_alloc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 81e18ce..19fb76b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3590,7 +3590,7 @@ static bool __need_fs_reclaim(gfp_t gfp_mask)
>  		return false;
>  
>  	/* this guy won't enter reclaim */
> -	if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
> +	if (current->flags & PF_MEMALLOC)
>  		return false;
>  
>  	/* We're only interested __GFP_FS allocations for now */
> -- 
> 1.8.3.1
> 
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 (RESEND)] lockdep: Fix fs_reclaim warning.
  2018-02-27 21:50                     ` [PATCH v2 (RESEND)] " Tetsuo Handa
  2018-03-07 21:44                       ` Tetsuo Handa
@ 2018-03-07 23:33                       ` Andrew Morton
  2018-03-08 15:30                         ` Tetsuo Handa
  1 sibling, 1 reply; 35+ messages in thread
From: Andrew Morton @ 2018-03-07 23:33 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: peterz, mingo, linux-kernel

On Wed, 28 Feb 2018 06:50:02 +0900 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:

> 
> This warning is caused by commit d92a8cfcb37ecd13 ("locking/lockdep: Rework
> FS_RECLAIM annotation") which replaced lockdep_set_current_reclaim_state()/
> lockdep_clear_current_reclaim_state() in __perform_reclaim() and
> lockdep_trace_alloc() in slab_pre_alloc_hook() with fs_reclaim_acquire()/
> fs_reclaim_release(). Since __kmalloc_reserve() from __alloc_skb() adds
> __GFP_NOMEMALLOC | __GFP_NOWARN to gfp_mask, and all reclaim path simply
> propagates __GFP_NOMEMALLOC, fs_reclaim_acquire() in slab_pre_alloc_hook()
> is trying to grab the 'fake' lock again when __perform_reclaim() already
> grabbed the 'fake' lock.

That's quite an audit trail.

Shouldn't we be doing a cc:stable here?  If so, which patch do we
identify as being fixed, with "Fixes:"?  d92a8cfcb37ecd13, I assume?

I'd never even noticed fs_reclaim_acquire() and friends before.  I do
wish they had "lockdep" in their names, and a comment to explain what
they do and why they exist.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 (RESEND)] lockdep: Fix fs_reclaim warning.
  2018-03-07 23:33                       ` Andrew Morton
@ 2018-03-08 15:30                         ` Tetsuo Handa
  0 siblings, 0 replies; 35+ messages in thread
From: Tetsuo Handa @ 2018-03-08 15:30 UTC (permalink / raw)
  To: akpm; +Cc: peterz, mingo, linux-kernel

Andrew Morton wrote:
> On Wed, 28 Feb 2018 06:50:02 +0900 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:
> 
> > 
> > This warning is caused by commit d92a8cfcb37ecd13 ("locking/lockdep: Rework
> > FS_RECLAIM annotation") which replaced lockdep_set_current_reclaim_state()/
> > lockdep_clear_current_reclaim_state() in __perform_reclaim() and
> > lockdep_trace_alloc() in slab_pre_alloc_hook() with fs_reclaim_acquire()/
> > fs_reclaim_release(). Since __kmalloc_reserve() from __alloc_skb() adds
> > __GFP_NOMEMALLOC | __GFP_NOWARN to gfp_mask, and all reclaim path simply
> > propagates __GFP_NOMEMALLOC, fs_reclaim_acquire() in slab_pre_alloc_hook()
> > is trying to grab the 'fake' lock again when __perform_reclaim() already
> > grabbed the 'fake' lock.
> 
> That's quite an audit trail.
> 
> Shouldn't we be doing a cc:stable here?  If so, which patch do we
> identify as being fixed, with "Fixes:"?  d92a8cfcb37ecd13, I assume?

Yes please, if you think this patch qualifies for backport.

The test was outdated since v2.6.31, but only v4.14+ seems to trigger this warning.
Thus, I think it is OK to add:

  Fixes: d92a8cfcb37ecd13 ("locking/lockdep: Rework FS_RECLAIM annotation")
  Cc: <stable@vger.kernel.org> # v4.14+

> 
> I'd never even noticed fs_reclaim_acquire() and friends before.  I do
> wish they had "lockdep" in their names, and a comment to explain what
> they do and why they exist.

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2018-03-08 15:31 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-24  1:36 [4.15-rc9] fs_reclaim lockdep trace Dave Jones
2018-01-24  1:36 ` Dave Jones
2018-01-27 22:24 ` Dave Jones
2018-01-27 22:24   ` Dave Jones
2018-01-27 22:43   ` Linus Torvalds
2018-01-27 22:43     ` Linus Torvalds
2018-01-28  1:16     ` Tetsuo Handa
2018-01-28  1:16       ` Tetsuo Handa
2018-01-28  4:25       ` Tetsuo Handa
2018-01-28  4:25         ` Tetsuo Handa
2018-01-28  5:55         ` Tetsuo Handa
2018-01-28  5:55           ` Tetsuo Handa
2018-01-28  5:55           ` Tetsuo Handa
2018-01-29  2:43           ` Dave Jones
2018-01-29  2:43             ` Dave Jones
2018-01-29 10:27           ` Peter Zijlstra
2018-01-29 10:27             ` Peter Zijlstra
2018-01-29 11:47             ` Tetsuo Handa
2018-01-29 11:47               ` Tetsuo Handa
2018-01-29 13:55               ` Peter Zijlstra
2018-01-29 13:55                 ` Peter Zijlstra
2018-02-01 11:36                 ` Tetsuo Handa
2018-02-01 11:36                   ` Tetsuo Handa
2018-02-08 11:43                   ` [PATCH v2] lockdep: Fix fs_reclaim warning Tetsuo Handa
2018-02-08 11:43                     ` Tetsuo Handa
2018-02-12 12:08                     ` Nikolay Borisov
2018-02-12 12:08                       ` Nikolay Borisov
2018-02-12 13:46                       ` Tetsuo Handa
2018-02-12 13:46                         ` Tetsuo Handa
2018-02-19 11:52                     ` Tetsuo Handa
2018-02-19 11:52                       ` Tetsuo Handa
2018-02-27 21:50                     ` [PATCH v2 (RESEND)] " Tetsuo Handa
2018-03-07 21:44                       ` Tetsuo Handa
2018-03-07 23:33                       ` Andrew Morton
2018-03-08 15:30                         ` Tetsuo Handa

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.