* lockdep warning: fs_reclaim_acquire vs tcp_sendpage
@ 2022-10-19 7:51 Daniel Wagner
2022-10-19 9:03 ` Sagi Grimberg
0 siblings, 1 reply; 12+ messages in thread
From: Daniel Wagner @ 2022-10-19 7:51 UTC (permalink / raw)
To: Sagi Grimberg; +Cc: linux-nvme
Hi Sagi,
While working on something else I got the lockdep splat below. As this
is a dirty tree and not latest greatest it might be a false alarm.
I haven't really looked into yet, this is just to let you know that
there might be something going on.
Cheers,
Daniel
======================================================
WARNING: possible circular locking dependency detected
6.0.0-rc2+ #25 Tainted: G W
------------------------------------------------------
kswapd0/92 is trying to acquire lock:
ffff888114003240 (sk_lock-AF_INET-NVME){+.+.}-{0:0}, at: tcp_sendpage+0x23/0xa0
but task is already holding lock:
ffffffff97e95ca0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x987/0x10d0
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (fs_reclaim){+.+.}-{0:0}:
fs_reclaim_acquire+0x11e/0x160
kmem_cache_alloc_node+0x44/0x530
__alloc_skb+0x158/0x230
tcp_send_active_reset+0x7e/0x730
tcp_disconnect+0x1272/0x1ae0
__tcp_close+0x707/0xd90
tcp_close+0x26/0x80
inet_release+0xfa/0x220
sock_release+0x85/0x1a0
nvme_tcp_free_queue+0x1fd/0x470 [nvme_tcp]
nvme_do_delete_ctrl+0x130/0x13d [nvme_core]
nvme_sysfs_delete.cold+0x8/0xd [nvme_core]
kernfs_fop_write_iter+0x356/0x530
vfs_write+0x4e8/0xce0
ksys_write+0xfd/0x1d0
do_syscall_64+0x58/0x80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
-> #0 (sk_lock-AF_INET-NVME){+.+.}-{0:0}:
__lock_acquire+0x2a0c/0x5690
lock_acquire+0x18e/0x4f0
lock_sock_nested+0x37/0xc0
tcp_sendpage+0x23/0xa0
inet_sendpage+0xad/0x120
kernel_sendpage+0x156/0x440
nvme_tcp_try_send+0x48a/0x2630 [nvme_tcp]
nvme_tcp_queue_rq+0xefb/0x17e0 [nvme_tcp]
__blk_mq_try_issue_directly+0x452/0x660
blk_mq_plug_issue_direct.constprop.0+0x207/0x700
blk_mq_flush_plug_list+0x6f5/0xc70
__blk_flush_plug+0x264/0x410
blk_finish_plug+0x4b/0xa0
shrink_lruvec+0x1263/0x1ea0
shrink_node+0x736/0x1a80
balance_pgdat+0x740/0x10d0
kswapd+0x5f2/0xaf0
kthread+0x256/0x2f0
ret_from_fork+0x1f/0x30
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(fs_reclaim);
lock(sk_lock-AF_INET-NVME);
lock(fs_reclaim);
lock(sk_lock-AF_INET-NVME);
*** DEADLOCK ***
3 locks held by kswapd0/92:
#0: ffffffff97e95ca0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x987/0x10d0
#1: ffff88811f21b0b0 (q->srcu){....}-{0:0}, at: blk_mq_flush_plug_list+0x6b3/0xc70
#2: ffff888170b11470 (&queue->send_mutex){+.+.}-{3:3}, at: nvme_tcp_queue_rq+0xeb9/0x17e0 [nvme_tcp]
stack backtrace:
CPU: 7 PID: 92 Comm: kswapd0 Tainted: G W 6.0.0-rc2+ #25 910779b354c48f37d01f55ab57fbca0c616a47fd
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Call Trace:
<TASK>
dump_stack_lvl+0x5b/0x77
check_noncircular+0x26e/0x320
? lock_chain_count+0x20/0x20
? print_circular_bug+0x1e0/0x1e0
? kvm_sched_clock_read+0x14/0x40
? sched_clock_cpu+0x69/0x240
? __bfs+0x317/0x6f0
? usage_match+0x110/0x110
? lockdep_lock+0xbe/0x1c0
? call_rcu_zapped+0xc0/0xc0
__lock_acquire+0x2a0c/0x5690
? lockdep_hardirqs_on_prepare+0x3f0/0x3f0
? lock_chain_count+0x20/0x20
lock_acquire+0x18e/0x4f0
? tcp_sendpage+0x23/0xa0
? lock_downgrade+0x6c0/0x6c0
? __lock_acquire+0xd3f/0x5690
lock_sock_nested+0x37/0xc0
? tcp_sendpage+0x23/0xa0
tcp_sendpage+0x23/0xa0
inet_sendpage+0xad/0x120
kernel_sendpage+0x156/0x440
nvme_tcp_try_send+0x48a/0x2630 [nvme_tcp 9175a0e5b6247ff4e2c0da5432ec9d6d589fc288]
? lock_downgrade+0x6c0/0x6c0
? lock_release+0x6cd/0xd30
? nvme_tcp_state_change+0x150/0x150 [nvme_tcp 9175a0e5b6247ff4e2c0da5432ec9d6d589fc288]
? mutex_trylock+0x204/0x330
? nvme_tcp_queue_rq+0xeb9/0x17e0 [nvme_tcp 9175a0e5b6247ff4e2c0da5432ec9d6d589fc288]
? ww_mutex_unlock+0x270/0x270
nvme_tcp_queue_rq+0xefb/0x17e0 [nvme_tcp 9175a0e5b6247ff4e2c0da5432ec9d6d589fc288]
? kvm_sched_clock_read+0x14/0x40
__blk_mq_try_issue_directly+0x452/0x660
? __blk_mq_get_driver_tag+0x980/0x980
? lock_downgrade+0x6c0/0x6c0
blk_mq_plug_issue_direct.constprop.0+0x207/0x700
? __mem_cgroup_uncharge+0x140/0x140
blk_mq_flush_plug_list+0x6f5/0xc70
? blk_mq_flush_plug_list+0x6b3/0xc70
? blk_mq_insert_requests+0x450/0x450
__blk_flush_plug+0x264/0x410
? memset+0x1f/0x40
? __mem_cgroup_uncharge_list+0x84/0x150
? blk_start_plug_nr_ios+0x280/0x280
blk_finish_plug+0x4b/0xa0
shrink_lruvec+0x1263/0x1ea0
? reclaim_throttle+0x790/0x790
? sched_clock_cpu+0x69/0x240
? lockdep_hardirqs_on_prepare+0x3f0/0x3f0
? lock_is_held_type+0xa9/0x120
? mem_cgroup_iter+0x2b2/0x780
shrink_node+0x736/0x1a80
balance_pgdat+0x740/0x10d0
? shrink_node+0x1a80/0x1a80
? lock_is_held_type+0xa9/0x120
? find_held_lock+0x34/0x120
? lock_is_held_type+0xa9/0x120
? reacquire_held_locks+0x4f0/0x4f0
kswapd+0x5f2/0xaf0
? balance_pgdat+0x10d0/0x10d0
? destroy_sched_domains_rcu+0x60/0x60
? trace_hardirqs_on+0x2d/0x110
? __kthread_parkme+0x83/0x140
? balance_pgdat+0x10d0/0x10d0
kthread+0x256/0x2f0
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x1f/0x30
</TASK>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: lockdep warning: fs_reclaim_acquire vs tcp_sendpage
2022-10-19 7:51 lockdep warning: fs_reclaim_acquire vs tcp_sendpage Daniel Wagner
@ 2022-10-19 9:03 ` Sagi Grimberg
2022-10-19 9:37 ` Daniel Wagner
0 siblings, 1 reply; 12+ messages in thread
From: Sagi Grimberg @ 2022-10-19 9:03 UTC (permalink / raw)
To: Daniel Wagner; +Cc: linux-nvme
> Hi Sagi,
Thanks for reporting.
> While working on something else I got the lockdep splat below. As this
> is a dirty tree and not latest greatest it might be a false alarm.
>
> I haven't really looked into yet, this is just to let you know that
> there might be something going on.
I didn't see anything similar to this one yet.
>
> Cheers,
> Daniel
>
> ======================================================
> WARNING: possible circular locking dependency detected
> 6.0.0-rc2+ #25 Tainted: G W
> ------------------------------------------------------
> kswapd0/92 is trying to acquire lock:
> ffff888114003240 (sk_lock-AF_INET-NVME){+.+.}-{0:0}, at: tcp_sendpage+0x23/0xa0
>
> but task is already holding lock:
> ffffffff97e95ca0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x987/0x10d0
>
> which lock already depends on the new lock.
>
>
> the existing dependency chain (in reverse order) is:
>
> -> #1 (fs_reclaim){+.+.}-{0:0}:
> fs_reclaim_acquire+0x11e/0x160
> kmem_cache_alloc_node+0x44/0x530
> __alloc_skb+0x158/0x230
> tcp_send_active_reset+0x7e/0x730
> tcp_disconnect+0x1272/0x1ae0
Here tcp_disconnect is using gfp_any() down to alloc_skb, which
overrides the socket allocation flags.
> __tcp_close+0x707/0xd90
> tcp_close+0x26/0x80
> inet_release+0xfa/0x220
> sock_release+0x85/0x1a0
> nvme_tcp_free_queue+0x1fd/0x470 [nvme_tcp]
> nvme_do_delete_ctrl+0x130/0x13d [nvme_core]
> nvme_sysfs_delete.cold+0x8/0xd [nvme_core]
> kernfs_fop_write_iter+0x356/0x530
> vfs_write+0x4e8/0xce0
> ksys_write+0xfd/0x1d0
> do_syscall_64+0x58/0x80
> entry_SYSCALL_64_after_hwframe+0x63/0xcd
>
> -> #0 (sk_lock-AF_INET-NVME){+.+.}-{0:0}:
> __lock_acquire+0x2a0c/0x5690
> lock_acquire+0x18e/0x4f0
> lock_sock_nested+0x37/0xc0
> tcp_sendpage+0x23/0xa0
> inet_sendpage+0xad/0x120
> kernel_sendpage+0x156/0x440
> nvme_tcp_try_send+0x48a/0x2630 [nvme_tcp]
> nvme_tcp_queue_rq+0xefb/0x17e0 [nvme_tcp]
> __blk_mq_try_issue_directly+0x452/0x660
> blk_mq_plug_issue_direct.constprop.0+0x207/0x700
> blk_mq_flush_plug_list+0x6f5/0xc70
> __blk_flush_plug+0x264/0x410
> blk_finish_plug+0x4b/0xa0
> shrink_lruvec+0x1263/0x1ea0
> shrink_node+0x736/0x1a80
> balance_pgdat+0x740/0x10d0
> kswapd+0x5f2/0xaf0
> kthread+0x256/0x2f0
> ret_from_fork+0x1f/0x30
>
> other info that might help us debug this:
>
> Possible unsafe locking scenario:
>
> CPU0 CPU1
> ---- ----
> lock(fs_reclaim);
> lock(sk_lock-AF_INET-NVME);
> lock(fs_reclaim);
> lock(sk_lock-AF_INET-NVME);
Indeed. I see the issue.
kswapd is trying to swap out pages, but if someone were to delete
the controller (like in this case), sock_release -> tcp_disconnect
will alloc skb that may need to reclaim pages.
Two questions, the stack trace suggests that you are not using
nvme-mpath? is that the case?
Given that we fail all inflight requests before we free the socket,
I don't expect for this to be truly circular...
I'm assuming that we'll need the below similar to nbd/iscsi:
--
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 4f5dcfe5357f..c5bea92560bd 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1141,6 +1141,7 @@ static int nvme_tcp_try_send_ddgst(struct
nvme_tcp_request *req)
static int nvme_tcp_try_send(struct nvme_tcp_queue *queue)
{
struct nvme_tcp_request *req;
+ unsigned int noreclaim_flag;
int ret = 1;
if (!queue->request) {
@@ -1150,12 +1151,13 @@ static int nvme_tcp_try_send(struct
nvme_tcp_queue *queue)
}
req = queue->request;
+ noreclaim_flag = memalloc_noreclaim_save();
if (req->state == NVME_TCP_SEND_CMD_PDU) {
ret = nvme_tcp_try_send_cmd_pdu(req);
if (ret <= 0)
goto done;
if (!nvme_tcp_has_inline_data(req))
- return ret;
+ goto out;
}
if (req->state == NVME_TCP_SEND_H2C_PDU) {
@@ -1181,6 +1183,8 @@ static int nvme_tcp_try_send(struct nvme_tcp_queue
*queue)
nvme_tcp_fail_request(queue->request);
nvme_tcp_done_send_req(queue);
}
+out:
+ memalloc_noreclaim_restore(noreclaim_flag);
return ret;
}
--
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: lockdep warning: fs_reclaim_acquire vs tcp_sendpage
2022-10-19 9:03 ` Sagi Grimberg
@ 2022-10-19 9:37 ` Daniel Wagner
2022-10-19 11:35 ` Daniel Wagner
0 siblings, 1 reply; 12+ messages in thread
From: Daniel Wagner @ 2022-10-19 9:37 UTC (permalink / raw)
To: Sagi Grimberg; +Cc: linux-nvme
> > Possible unsafe locking scenario:
> >
> > CPU0 CPU1
> > ---- ----
> > lock(fs_reclaim);
> > lock(sk_lock-AF_INET-NVME);
> > lock(fs_reclaim);
> > lock(sk_lock-AF_INET-NVME);
>
> Indeed. I see the issue.
> kswapd is trying to swap out pages, but if someone were to delete
> the controller (like in this case), sock_release -> tcp_disconnect
> will alloc skb that may need to reclaim pages.
>
> Two questions, the stack trace suggests that you are not using
> nvme-mpath? is that the case?
This is with a multipath setup. The fio settings are pushing the limits
of the VM (memory size) hence the kswap process kicking in.
> Given that we fail all inflight requests before we free the socket,
> I don't expect for this to be truly circular...
>
> I'm assuming that we'll need the below similar to nbd/iscsi:
Let me try this.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: lockdep warning: fs_reclaim_acquire vs tcp_sendpage
2022-10-19 9:37 ` Daniel Wagner
@ 2022-10-19 11:35 ` Daniel Wagner
2022-10-19 13:09 ` Sagi Grimberg
0 siblings, 1 reply; 12+ messages in thread
From: Daniel Wagner @ 2022-10-19 11:35 UTC (permalink / raw)
To: Sagi Grimberg; +Cc: linux-nvme
On Wed, Oct 19, 2022 at 11:37:13AM +0200, Daniel Wagner wrote:
> > > Possible unsafe locking scenario:
> > >
> > > CPU0 CPU1
> > > ---- ----
> > > lock(fs_reclaim);
> > > lock(sk_lock-AF_INET-NVME);
> > > lock(fs_reclaim);
> > > lock(sk_lock-AF_INET-NVME);
> >
> > Indeed. I see the issue.
> > kswapd is trying to swap out pages, but if someone were to delete
> > the controller (like in this case), sock_release -> tcp_disconnect
> > will alloc skb that may need to reclaim pages.
> >
> > Two questions, the stack trace suggests that you are not using
> > nvme-mpath? is that the case?
>
> This is with a multipath setup. The fio settings are pushing the limits
> of the VM (memory size) hence the kswap process kicking in.
>
> > Given that we fail all inflight requests before we free the socket,
> > I don't expect for this to be truly circular...
> >
> > I'm assuming that we'll need the below similar to nbd/iscsi:
>
> Let me try this.
Still able to trigger though I figured out how I am able to
reproduce it:
VM 4M memory, 8 vCPUs
nvme target with at least 2 namespaces
ns 1: fio read/write
ns 2: swap space
1) nvme connect-all
2) nvme disconnect-all
3) nvme connect-all
4) swapon /dev/nvme0n4
4) fio --rw=rw --name=test --filename=/dev/nvme1n1 --size=1G --direct=1 \
--iodepth=32 --blocksize_range=4k-4M --numjobs=32 \
--group_reporting --runtime=2m --time_based
======================================================
WARNING: possible circular locking dependency detected
6.0.0-rc2+ #27 Tainted: G W
------------------------------------------------------
fio/1749 is trying to acquire lock:
ffff888120b38140 (sk_lock-AF_INET-NVME){+.+.}-{0:0}, at: tcp_sendpage+0x23/0xa0
but task is already holding lock:
ffffffff93695b20 (fs_reclaim){+.+.}-{0:0}, at: __alloc_pages_slowpath.constprop.0+0x6a3/0x22f0
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (fs_reclaim){+.+.}-{0:0}:
fs_reclaim_acquire+0x11e/0x160
kmem_cache_alloc_node+0x44/0x530
__alloc_skb+0x158/0x230
tcp_send_active_reset+0x7e/0x730
tcp_disconnect+0x1272/0x1ae0
__tcp_close+0x707/0xd90
tcp_close+0x26/0x80
inet_release+0xfa/0x220
sock_release+0x85/0x1a0
nvme_tcp_free_queue+0x1fd/0x470 [nvme_tcp]
nvme_do_delete_ctrl+0x130/0x13d [nvme_core]
nvme_sysfs_delete.cold+0x8/0xd [nvme_core]
kernfs_fop_write_iter+0x356/0x530
vfs_write+0x4e8/0xce0
ksys_write+0xfd/0x1d0
do_syscall_64+0x58/0x80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
-> #0 (sk_lock-AF_INET-NVME){+.+.}-{0:0}:
__lock_acquire+0x2a0c/0x5690
lock_acquire+0x18e/0x4f0
lock_sock_nested+0x37/0xc0
tcp_sendpage+0x23/0xa0
inet_sendpage+0xad/0x120
kernel_sendpage+0x156/0x440
nvme_tcp_try_send+0x59e/0x27a0 [nvme_tcp]
nvme_tcp_queue_rq+0xf5e/0x1870 [nvme_tcp]
__blk_mq_try_issue_directly+0x452/0x660
blk_mq_plug_issue_direct.constprop.0+0x207/0x700
blk_mq_flush_plug_list+0x6f5/0xc70
__blk_flush_plug+0x264/0x410
blk_finish_plug+0x4b/0xa0
shrink_lruvec+0x1263/0x1ea0
shrink_node+0x736/0x1a80
do_try_to_free_pages+0x2ba/0x15e0
try_to_free_pages+0x20b/0x580
__alloc_pages_slowpath.constprop.0+0x744/0x22f0
__alloc_pages+0x42a/0x500
__folio_alloc+0x17/0x50
vma_alloc_folio+0xbd/0x4d0
__handle_mm_fault+0x1170/0x2380
handle_mm_fault+0x1d6/0x710
do_user_addr_fault+0x320/0xdc0
exc_page_fault+0x61/0xf0
asm_exc_page_fault+0x22/0x30
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(fs_reclaim);
lock(sk_lock-AF_INET-NVME);
lock(fs_reclaim);
lock(sk_lock-AF_INET-NVME);
*** DEADLOCK ***
4 locks held by fio/1749:
#0: ffff8881251f62b8 (&mm->mmap_lock#2){++++}-{3:3}, at: do_user_addr_fault+0x1e3/0xdc0
#1: ffffffff93695b20 (fs_reclaim){+.+.}-{0:0}, at: __alloc_pages_slowpath.constprop.0+0x6a3/0x22f0
#2: ffff8881087cb0b0 (q->srcu){....}-{0:0}, at: blk_mq_flush_plug_list+0x6b3/0xc70
#3: ffff888124e543d0 (&queue->send_mutex){+.+.}-{3:3}, at: nvme_tcp_queue_rq+0xec1/0x1870 [nvme_tcp]
stack backtrace:
CPU: 0 PID: 1749 Comm: fio Tainted: G W 6.0.0-rc2+ #27 f927f62e1062089b9e698ced355fcf5ecf276cb2
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Call Trace:
<TASK>
dump_stack_lvl+0x5b/0x77
check_noncircular+0x26e/0x320
? print_circular_bug+0x1e0/0x1e0
? kvm_sched_clock_read+0x14/0x40
? sched_clock_cpu+0x69/0x240
? lockdep_lock+0x18a/0x1c0
? call_rcu_zapped+0xc0/0xc0
__lock_acquire+0x2a0c/0x5690
? lockdep_hardirqs_on_prepare+0x3f0/0x3f0
? lock_chain_count+0x20/0x20
? mark_lock+0x101/0x1650
lock_acquire+0x18e/0x4f0
? tcp_sendpage+0x23/0xa0
? sched_clock_cpu+0x69/0x240
? lock_downgrade+0x6c0/0x6c0
? __lock_acquire+0xd3f/0x5690
lock_sock_nested+0x37/0xc0
? tcp_sendpage+0x23/0xa0
tcp_sendpage+0x23/0xa0
inet_sendpage+0xad/0x120
kernel_sendpage+0x156/0x440
nvme_tcp_try_send+0x59e/0x27a0 [nvme_tcp 154cb4fe55d74667e1ca60e2a90f260935f9e2bd]
? lock_downgrade+0x6c0/0x6c0
? lock_release+0x6cd/0xd30
? nvme_tcp_state_change+0x150/0x150 [nvme_tcp 154cb4fe55d74667e1ca60e2a90f260935f9e2bd]
? mutex_trylock+0x204/0x330
? nvme_tcp_queue_rq+0xec1/0x1870 [nvme_tcp 154cb4fe55d74667e1ca60e2a90f260935f9e2bd]
? ww_mutex_unlock+0x270/0x270
nvme_tcp_queue_rq+0xf5e/0x1870 [nvme_tcp 154cb4fe55d74667e1ca60e2a90f260935f9e2bd]
__blk_mq_try_issue_directly+0x452/0x660
? __blk_mq_get_driver_tag+0x980/0x980
? lock_downgrade+0x6c0/0x6c0
blk_mq_plug_issue_direct.constprop.0+0x207/0x700
blk_mq_flush_plug_list+0x6f5/0xc70
? blk_mq_flush_plug_list+0x6b3/0xc70
? set_next_task_stop+0x1c0/0x1c0
? blk_mq_insert_requests+0x450/0x450
? lock_release+0x6cd/0xd30
__blk_flush_plug+0x264/0x410
? memset+0x1f/0x40
? __mem_cgroup_uncharge_list+0x84/0x150
? __mem_cgroup_uncharge+0x140/0x140
? blk_start_plug_nr_ios+0x280/0x280
blk_finish_plug+0x4b/0xa0
shrink_lruvec+0x1263/0x1ea0
? reclaim_throttle+0x790/0x790
? sched_clock_cpu+0x69/0x240
? lockdep_hardirqs_on_prepare+0x3f0/0x3f0
? lock_is_held_type+0xa9/0x120
? mem_cgroup_iter+0x2b2/0x780
shrink_node+0x736/0x1a80
do_try_to_free_pages+0x2ba/0x15e0
? __node_reclaim+0x7c0/0x7c0
? lock_is_held_type+0xa9/0x120
? lock_is_held_type+0xa9/0x120
try_to_free_pages+0x20b/0x580
? reclaim_pages+0x5b0/0x5b0
? psi_task_change+0x2f0/0x2f0
__alloc_pages_slowpath.constprop.0+0x744/0x22f0
? get_page_from_freelist+0x3bf/0x3920
? warn_alloc+0x190/0x190
? io_schedule_timeout+0x160/0x160
? __zone_watermark_ok+0x420/0x420
? preempt_schedule_common+0x44/0x70
? __cond_resched+0x1c/0x30
? prepare_alloc_pages.constprop.0+0x150/0x4c0
? lock_chain_count+0x20/0x20
__alloc_pages+0x42a/0x500
? __alloc_pages_slowpath.constprop.0+0x22f0/0x22f0
? set_next_task_stop+0x1c0/0x1c0
__folio_alloc+0x17/0x50
vma_alloc_folio+0xbd/0x4d0
? sched_clock_cpu+0x69/0x240
__handle_mm_fault+0x1170/0x2380
? copy_page_range+0x2ae0/0x2ae0
? lockdep_hardirqs_on_prepare+0x27b/0x3f0
? count_memcg_events.constprop.0+0x40/0x50
handle_mm_fault+0x1d6/0x710
do_user_addr_fault+0x320/0xdc0
exc_page_fault+0x61/0xf0
asm_exc_page_fault+0x22/0x30
RIP: 0033:0x55d6818eee0e
Code: 48 89 54 24 18 48 01 c2 48 89 54 24 20 48 8d 14 80 48 89 54 24 28 48 39 f1 74 38 90 66 41 0f 6f 01 66 41 0f 6f 49 10 4c 89 c8 <0f> 11 01 0f 11 49 10 48 8b 10 48 83 c0 08 48 0f af d7 48 89 50 f8
RSP: 002b:00007ffdc1100e30 EFLAGS: 00010206
RAX: 00007ffdc1100e40 RBX: 81b4c40bf7ec8b20 RCX: 00007f16bb429000
RDX: 5709bcafa91b77a0 RSI: 00007f16bb6d4000 RDI: 61c8864680b583eb
RBP: 0000000000000013 R08: 00007ffdc1100e60 R09: 00007ffdc1100e40
R10: 0000000000400000 R11: 0000000000000246 R12: 0000000000400000
R13: 0000000000000001 R14: 000055d681d52540 R15: 00007f16bb2d4000
</TASK>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: lockdep warning: fs_reclaim_acquire vs tcp_sendpage
2022-10-19 11:35 ` Daniel Wagner
@ 2022-10-19 13:09 ` Sagi Grimberg
2022-10-19 16:01 ` Daniel Wagner
0 siblings, 1 reply; 12+ messages in thread
From: Sagi Grimberg @ 2022-10-19 13:09 UTC (permalink / raw)
To: Daniel Wagner; +Cc: linux-nvme
On 10/19/22 14:35, Daniel Wagner wrote:
> On Wed, Oct 19, 2022 at 11:37:13AM +0200, Daniel Wagner wrote:
>>>> Possible unsafe locking scenario:
>>>>
>>>> CPU0 CPU1
>>>> ---- ----
>>>> lock(fs_reclaim);
>>>> lock(sk_lock-AF_INET-NVME);
>>>> lock(fs_reclaim);
>>>> lock(sk_lock-AF_INET-NVME);
>>>
>>> Indeed. I see the issue.
>>> kswapd is trying to swap out pages, but if someone were to delete
>>> the controller (like in this case), sock_release -> tcp_disconnect
>>> will alloc skb that may need to reclaim pages.
>>>
>>> Two questions, the stack trace suggests that you are not using
>>> nvme-mpath? is that the case?
>>
>> This is with a multipath setup. The fio settings are pushing the limits
>> of the VM (memory size) hence the kswap process kicking in.
>>
>>> Given that we fail all inflight requests before we free the socket,
>>> I don't expect for this to be truly circular...
>>>
>>> I'm assuming that we'll need the below similar to nbd/iscsi:
>>
>> Let me try this.
>
> Still able to trigger though I figured out how I am able to
> reproduce it:
>
> VM 4M memory, 8 vCPUs
thats small...
What is vm.min_free_kbytes (via sysctl)?
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: lockdep warning: fs_reclaim_acquire vs tcp_sendpage
2022-10-19 13:09 ` Sagi Grimberg
@ 2022-10-19 16:01 ` Daniel Wagner
2022-10-19 17:43 ` Sagi Grimberg
0 siblings, 1 reply; 12+ messages in thread
From: Daniel Wagner @ 2022-10-19 16:01 UTC (permalink / raw)
To: Sagi Grimberg; +Cc: linux-nvme
On Wed, Oct 19, 2022 at 04:09:39PM +0300, Sagi Grimberg wrote:
> > Still able to trigger though I figured out how I am able to
> > reproduce it:
> >
> > VM 4M memory, 8 vCPUs
>
> thats small...
Just a test VM. But I think this is actually the key to reproduce the
lockdep splat. The fio command is eating up a lot of ram (I guess any
other memory hog would do the job as well) and forces the mm subsystem
to use the swap.
> What is vm.min_free_kbytes (via sysctl)?
vm.min_free_kbytes = 67584
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: lockdep warning: fs_reclaim_acquire vs tcp_sendpage
2022-10-19 16:01 ` Daniel Wagner
@ 2022-10-19 17:43 ` Sagi Grimberg
2022-10-20 8:10 ` Daniel Wagner
0 siblings, 1 reply; 12+ messages in thread
From: Sagi Grimberg @ 2022-10-19 17:43 UTC (permalink / raw)
To: Daniel Wagner; +Cc: linux-nvme
>>> Still able to trigger though I figured out how I am able to
>>> reproduce it:
>>>
>>> VM 4M memory, 8 vCPUs
>>
>> thats small...
>
> Just a test VM. But I think this is actually the key to reproduce the
> lockdep splat. The fio command is eating up a lot of ram (I guess any
> other memory hog would do the job as well) and forces the mm subsystem
> to use the swap.
Is that 4MB of memory? or 4GB?
>
>> What is vm.min_free_kbytes (via sysctl)?
>
> vm.min_free_kbytes = 67584
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: lockdep warning: fs_reclaim_acquire vs tcp_sendpage
2022-10-19 17:43 ` Sagi Grimberg
@ 2022-10-20 8:10 ` Daniel Wagner
2022-10-20 9:57 ` Sagi Grimberg
0 siblings, 1 reply; 12+ messages in thread
From: Daniel Wagner @ 2022-10-20 8:10 UTC (permalink / raw)
To: Sagi Grimberg; +Cc: linux-nvme
On Wed, Oct 19, 2022 at 08:43:43PM +0300, Sagi Grimberg wrote:
>
> > > > Still able to trigger though I figured out how I am able to
> > > > reproduce it:
> > > >
> > > > VM 4M memory, 8 vCPUs
> > >
> > > thats small...
> >
> > Just a test VM. But I think this is actually the key to reproduce the
> > lockdep splat. The fio command is eating up a lot of ram (I guess any
> > other memory hog would do the job as well) and forces the mm subsystem
> > to use the swap.
>
> Is that 4MB of memory? or 4GB?
Ah sorry... it is 4GB indeed.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: lockdep warning: fs_reclaim_acquire vs tcp_sendpage
2022-10-20 8:10 ` Daniel Wagner
@ 2022-10-20 9:57 ` Sagi Grimberg
2022-10-20 14:16 ` Daniel Wagner
0 siblings, 1 reply; 12+ messages in thread
From: Sagi Grimberg @ 2022-10-20 9:57 UTC (permalink / raw)
To: Daniel Wagner; +Cc: linux-nvme
>>>>> Still able to trigger though I figured out how I am able to
>>>>> reproduce it:
>>>>>
>>>>> VM 4M memory, 8 vCPUs
>>>>
>>>> thats small...
>>>
>>> Just a test VM. But I think this is actually the key to reproduce the
>>> lockdep splat. The fio command is eating up a lot of ram (I guess any
>>> other memory hog would do the job as well) and forces the mm subsystem
>>> to use the swap.
>>
>> Is that 4MB of memory? or 4GB?
>
> Ah sorry... it is 4GB indeed.
Just for the experiment, can you try with this change:
--
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index c5bea92560bd..d814be5dca1e 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1519,7 +1519,7 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl
*nctrl, int qid)
* close. This is done to prevent stale data from being sent should
* the network connection be restored before TCP times out.
*/
- sock_no_linger(queue->sock->sk);
+ //sock_no_linger(queue->sock->sk);
if (so_priority > 0)
sock_set_priority(queue->sock->sk, so_priority);
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: lockdep warning: fs_reclaim_acquire vs tcp_sendpage
2022-10-20 9:57 ` Sagi Grimberg
@ 2022-10-20 14:16 ` Daniel Wagner
2022-10-20 16:20 ` Sagi Grimberg
0 siblings, 1 reply; 12+ messages in thread
From: Daniel Wagner @ 2022-10-20 14:16 UTC (permalink / raw)
To: Sagi Grimberg; +Cc: linux-nvme
On Thu, Oct 20, 2022 at 12:57:11PM +0300, Sagi Grimberg wrote:
> Just for the experiment, can you try with this change:
Good call, this seems to do the trick. The splat is gone with it.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: lockdep warning: fs_reclaim_acquire vs tcp_sendpage
2022-10-20 14:16 ` Daniel Wagner
@ 2022-10-20 16:20 ` Sagi Grimberg
2022-10-21 10:11 ` Daniel Wagner
0 siblings, 1 reply; 12+ messages in thread
From: Sagi Grimberg @ 2022-10-20 16:20 UTC (permalink / raw)
To: Daniel Wagner; +Cc: linux-nvme
>> Just for the experiment, can you try with this change:
>
> Good call, this seems to do the trick. The splat is gone with it.
OK, it doesn't say much because it is just one of many conditions
that can make a socket release to allocate an skb and send a tcp
RST, which can happen under memory pressure.
It's also not a great option to set a minimum linger of 1, which means
that if the controller is not accessible, we can block for 1 second
per queue, which is awful.
Does this change also make the issue go away?
--
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index c5bea92560bd..5bae8914c861 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1300,6 +1300,7 @@ static void nvme_tcp_free_queue(struct nvme_ctrl
*nctrl, int qid)
struct page *page;
struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
struct nvme_tcp_queue *queue = &ctrl->queues[qid];
+ unsigned int noreclaim_flag;
if (!test_and_clear_bit(NVME_TCP_Q_ALLOCATED, &queue->flags))
return;
@@ -1312,7 +1313,11 @@ static void nvme_tcp_free_queue(struct nvme_ctrl
*nctrl, int qid)
__page_frag_cache_drain(page,
queue->pf_cache.pagecnt_bias);
queue->pf_cache.va = NULL;
}
+
+ noreclaim_flag = memalloc_noreclaim_save();
sock_release(queue->sock);
+ memalloc_noreclaim_restore(noreclaim_flag);
+
kfree(queue->pdu);
mutex_destroy(&queue->send_mutex);
mutex_destroy(&queue->queue_lock);
--
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: lockdep warning: fs_reclaim_acquire vs tcp_sendpage
2022-10-20 16:20 ` Sagi Grimberg
@ 2022-10-21 10:11 ` Daniel Wagner
0 siblings, 0 replies; 12+ messages in thread
From: Daniel Wagner @ 2022-10-21 10:11 UTC (permalink / raw)
To: Sagi Grimberg; +Cc: linux-nvme
On Thu, Oct 20, 2022 at 07:20:13PM +0300, Sagi Grimberg wrote:
>
> > > Just for the experiment, can you try with this change:
> >
> > Good call, this seems to do the trick. The splat is gone with it.
>
> OK, it doesn't say much because it is just one of many conditions
> that can make a socket release to allocate an skb and send a tcp
> RST, which can happen under memory pressure.
>
> It's also not a great option to set a minimum linger of 1, which means
> that if the controller is not accessible, we can block for 1 second
> per queue, which is awful.
>
> Does this change also make the issue go away?
Yes, with this patch alone the lockdep splat is gone.
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2022-10-21 10:11 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-19 7:51 lockdep warning: fs_reclaim_acquire vs tcp_sendpage Daniel Wagner
2022-10-19 9:03 ` Sagi Grimberg
2022-10-19 9:37 ` Daniel Wagner
2022-10-19 11:35 ` Daniel Wagner
2022-10-19 13:09 ` Sagi Grimberg
2022-10-19 16:01 ` Daniel Wagner
2022-10-19 17:43 ` Sagi Grimberg
2022-10-20 8:10 ` Daniel Wagner
2022-10-20 9:57 ` Sagi Grimberg
2022-10-20 14:16 ` Daniel Wagner
2022-10-20 16:20 ` Sagi Grimberg
2022-10-21 10:11 ` Daniel Wagner
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.