All of lore.kernel.org
 help / color / mirror / Atom feed
* lockdep warning: fs_reclaim_acquire vs tcp_sendpage
@ 2022-10-19  7:51 Daniel Wagner
  2022-10-19  9:03 ` Sagi Grimberg
  0 siblings, 1 reply; 12+ messages in thread
From: Daniel Wagner @ 2022-10-19  7:51 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme

Hi Sagi,

While working on something else I got the lockdep splat below. As this
is a dirty tree and not latest greatest it might be a false alarm.

I haven't really looked into yet, this is just to let you know that
there might be something going on.

Cheers,
Daniel

 ======================================================
 WARNING: possible circular locking dependency detected
 6.0.0-rc2+ #25 Tainted: G        W         
 ------------------------------------------------------
 kswapd0/92 is trying to acquire lock:
 ffff888114003240 (sk_lock-AF_INET-NVME){+.+.}-{0:0}, at: tcp_sendpage+0x23/0xa0
 
 but task is already holding lock:
 ffffffff97e95ca0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x987/0x10d0
 
 which lock already depends on the new lock.

 
 the existing dependency chain (in reverse order) is:
 
 -> #1 (fs_reclaim){+.+.}-{0:0}:
        fs_reclaim_acquire+0x11e/0x160
        kmem_cache_alloc_node+0x44/0x530
        __alloc_skb+0x158/0x230
        tcp_send_active_reset+0x7e/0x730
        tcp_disconnect+0x1272/0x1ae0
        __tcp_close+0x707/0xd90
        tcp_close+0x26/0x80
        inet_release+0xfa/0x220
        sock_release+0x85/0x1a0
        nvme_tcp_free_queue+0x1fd/0x470 [nvme_tcp]
        nvme_do_delete_ctrl+0x130/0x13d [nvme_core]
        nvme_sysfs_delete.cold+0x8/0xd [nvme_core]
        kernfs_fop_write_iter+0x356/0x530
        vfs_write+0x4e8/0xce0
        ksys_write+0xfd/0x1d0
        do_syscall_64+0x58/0x80
        entry_SYSCALL_64_after_hwframe+0x63/0xcd
 
 -> #0 (sk_lock-AF_INET-NVME){+.+.}-{0:0}:
        __lock_acquire+0x2a0c/0x5690
        lock_acquire+0x18e/0x4f0
        lock_sock_nested+0x37/0xc0
        tcp_sendpage+0x23/0xa0
        inet_sendpage+0xad/0x120
        kernel_sendpage+0x156/0x440
        nvme_tcp_try_send+0x48a/0x2630 [nvme_tcp]
        nvme_tcp_queue_rq+0xefb/0x17e0 [nvme_tcp]
        __blk_mq_try_issue_directly+0x452/0x660
        blk_mq_plug_issue_direct.constprop.0+0x207/0x700
        blk_mq_flush_plug_list+0x6f5/0xc70
        __blk_flush_plug+0x264/0x410
        blk_finish_plug+0x4b/0xa0
        shrink_lruvec+0x1263/0x1ea0
        shrink_node+0x736/0x1a80
        balance_pgdat+0x740/0x10d0
        kswapd+0x5f2/0xaf0
        kthread+0x256/0x2f0
        ret_from_fork+0x1f/0x30
 
 other info that might help us debug this:

  Possible unsafe locking scenario:

        CPU0                    CPU1
        ----                    ----
   lock(fs_reclaim);
                                lock(sk_lock-AF_INET-NVME);
                                lock(fs_reclaim);
   lock(sk_lock-AF_INET-NVME);
 
  *** DEADLOCK ***

 3 locks held by kswapd0/92:
  #0: ffffffff97e95ca0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x987/0x10d0
  #1: ffff88811f21b0b0 (q->srcu){....}-{0:0}, at: blk_mq_flush_plug_list+0x6b3/0xc70
  #2: ffff888170b11470 (&queue->send_mutex){+.+.}-{3:3}, at: nvme_tcp_queue_rq+0xeb9/0x17e0 [nvme_tcp]
 
 stack backtrace:
 CPU: 7 PID: 92 Comm: kswapd0 Tainted: G        W          6.0.0-rc2+ #25 910779b354c48f37d01f55ab57fbca0c616a47fd
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
 Call Trace:
  <TASK>
  dump_stack_lvl+0x5b/0x77
  check_noncircular+0x26e/0x320
  ? lock_chain_count+0x20/0x20
  ? print_circular_bug+0x1e0/0x1e0
  ? kvm_sched_clock_read+0x14/0x40
  ? sched_clock_cpu+0x69/0x240
  ? __bfs+0x317/0x6f0
  ? usage_match+0x110/0x110
  ? lockdep_lock+0xbe/0x1c0
  ? call_rcu_zapped+0xc0/0xc0
  __lock_acquire+0x2a0c/0x5690
  ? lockdep_hardirqs_on_prepare+0x3f0/0x3f0
  ? lock_chain_count+0x20/0x20
  lock_acquire+0x18e/0x4f0
  ? tcp_sendpage+0x23/0xa0
  ? lock_downgrade+0x6c0/0x6c0
  ? __lock_acquire+0xd3f/0x5690
  lock_sock_nested+0x37/0xc0
  ? tcp_sendpage+0x23/0xa0
  tcp_sendpage+0x23/0xa0
  inet_sendpage+0xad/0x120
  kernel_sendpage+0x156/0x440
  nvme_tcp_try_send+0x48a/0x2630 [nvme_tcp 9175a0e5b6247ff4e2c0da5432ec9d6d589fc288]
  ? lock_downgrade+0x6c0/0x6c0
  ? lock_release+0x6cd/0xd30
  ? nvme_tcp_state_change+0x150/0x150 [nvme_tcp 9175a0e5b6247ff4e2c0da5432ec9d6d589fc288]
  ? mutex_trylock+0x204/0x330
  ? nvme_tcp_queue_rq+0xeb9/0x17e0 [nvme_tcp 9175a0e5b6247ff4e2c0da5432ec9d6d589fc288]
  ? ww_mutex_unlock+0x270/0x270
  nvme_tcp_queue_rq+0xefb/0x17e0 [nvme_tcp 9175a0e5b6247ff4e2c0da5432ec9d6d589fc288]
  ? kvm_sched_clock_read+0x14/0x40
  __blk_mq_try_issue_directly+0x452/0x660
  ? __blk_mq_get_driver_tag+0x980/0x980
  ? lock_downgrade+0x6c0/0x6c0
  blk_mq_plug_issue_direct.constprop.0+0x207/0x700
  ? __mem_cgroup_uncharge+0x140/0x140
  blk_mq_flush_plug_list+0x6f5/0xc70
  ? blk_mq_flush_plug_list+0x6b3/0xc70
  ? blk_mq_insert_requests+0x450/0x450
  __blk_flush_plug+0x264/0x410
  ? memset+0x1f/0x40
  ? __mem_cgroup_uncharge_list+0x84/0x150
  ? blk_start_plug_nr_ios+0x280/0x280
  blk_finish_plug+0x4b/0xa0
  shrink_lruvec+0x1263/0x1ea0
  ? reclaim_throttle+0x790/0x790
  ? sched_clock_cpu+0x69/0x240
  ? lockdep_hardirqs_on_prepare+0x3f0/0x3f0
  ? lock_is_held_type+0xa9/0x120
  ? mem_cgroup_iter+0x2b2/0x780
  shrink_node+0x736/0x1a80
  balance_pgdat+0x740/0x10d0
  ? shrink_node+0x1a80/0x1a80
  ? lock_is_held_type+0xa9/0x120
  ? find_held_lock+0x34/0x120
  ? lock_is_held_type+0xa9/0x120
  ? reacquire_held_locks+0x4f0/0x4f0
  kswapd+0x5f2/0xaf0
  ? balance_pgdat+0x10d0/0x10d0
  ? destroy_sched_domains_rcu+0x60/0x60
  ? trace_hardirqs_on+0x2d/0x110
  ? __kthread_parkme+0x83/0x140
  ? balance_pgdat+0x10d0/0x10d0
  kthread+0x256/0x2f0
  ? kthread_complete_and_exit+0x20/0x20
  ret_from_fork+0x1f/0x30
  </TASK>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: lockdep warning: fs_reclaim_acquire vs tcp_sendpage
  2022-10-19  7:51 lockdep warning: fs_reclaim_acquire vs tcp_sendpage Daniel Wagner
@ 2022-10-19  9:03 ` Sagi Grimberg
  2022-10-19  9:37   ` Daniel Wagner
  0 siblings, 1 reply; 12+ messages in thread
From: Sagi Grimberg @ 2022-10-19  9:03 UTC (permalink / raw)
  To: Daniel Wagner; +Cc: linux-nvme


> Hi Sagi,

Thanks for reporting.

> While working on something else I got the lockdep splat below. As this
> is a dirty tree and not latest greatest it might be a false alarm.
> 
> I haven't really looked into yet, this is just to let you know that
> there might be something going on.

I didn't see anything similar to this one yet.

> 
> Cheers,
> Daniel
> 
>   ======================================================
>   WARNING: possible circular locking dependency detected
>   6.0.0-rc2+ #25 Tainted: G        W
>   ------------------------------------------------------
>   kswapd0/92 is trying to acquire lock:
>   ffff888114003240 (sk_lock-AF_INET-NVME){+.+.}-{0:0}, at: tcp_sendpage+0x23/0xa0
>   
>   but task is already holding lock:
>   ffffffff97e95ca0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x987/0x10d0
>   
>   which lock already depends on the new lock.
> 
>   
>   the existing dependency chain (in reverse order) is:
>   
>   -> #1 (fs_reclaim){+.+.}-{0:0}:
>          fs_reclaim_acquire+0x11e/0x160
>          kmem_cache_alloc_node+0x44/0x530
>          __alloc_skb+0x158/0x230
>          tcp_send_active_reset+0x7e/0x730
>          tcp_disconnect+0x1272/0x1ae0

Here tcp_disconnect is using gfp_any() down to alloc_skb, which
overrides the socket allocation flags.

>          __tcp_close+0x707/0xd90
>          tcp_close+0x26/0x80
>          inet_release+0xfa/0x220
>          sock_release+0x85/0x1a0
>          nvme_tcp_free_queue+0x1fd/0x470 [nvme_tcp]
>          nvme_do_delete_ctrl+0x130/0x13d [nvme_core]
>          nvme_sysfs_delete.cold+0x8/0xd [nvme_core]
>          kernfs_fop_write_iter+0x356/0x530
>          vfs_write+0x4e8/0xce0
>          ksys_write+0xfd/0x1d0
>          do_syscall_64+0x58/0x80
>          entry_SYSCALL_64_after_hwframe+0x63/0xcd
>   
>   -> #0 (sk_lock-AF_INET-NVME){+.+.}-{0:0}:
>          __lock_acquire+0x2a0c/0x5690
>          lock_acquire+0x18e/0x4f0
>          lock_sock_nested+0x37/0xc0
>          tcp_sendpage+0x23/0xa0
>          inet_sendpage+0xad/0x120
>          kernel_sendpage+0x156/0x440
>          nvme_tcp_try_send+0x48a/0x2630 [nvme_tcp]
>          nvme_tcp_queue_rq+0xefb/0x17e0 [nvme_tcp]
>          __blk_mq_try_issue_directly+0x452/0x660
>          blk_mq_plug_issue_direct.constprop.0+0x207/0x700
>          blk_mq_flush_plug_list+0x6f5/0xc70
>          __blk_flush_plug+0x264/0x410
>          blk_finish_plug+0x4b/0xa0
>          shrink_lruvec+0x1263/0x1ea0
>          shrink_node+0x736/0x1a80
>          balance_pgdat+0x740/0x10d0
>          kswapd+0x5f2/0xaf0
>          kthread+0x256/0x2f0
>          ret_from_fork+0x1f/0x30
>   
>   other info that might help us debug this:
> 
>    Possible unsafe locking scenario:
> 
>          CPU0                    CPU1
>          ----                    ----
>     lock(fs_reclaim);
>                                  lock(sk_lock-AF_INET-NVME);
>                                  lock(fs_reclaim);
>     lock(sk_lock-AF_INET-NVME);

Indeed. I see the issue.
kswapd is trying to swap out pages, but if someone were to delete
the controller (like in this case), sock_release -> tcp_disconnect
will alloc skb that may need to reclaim pages.

Two questions, the stack trace suggests that you are not using
nvme-mpath? is that the case?

Given that we fail all inflight requests before we free the socket,
I don't expect for this to be truly circular...

I'm assuming that we'll need the below similar to nbd/iscsi:
--
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 4f5dcfe5357f..c5bea92560bd 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1141,6 +1141,7 @@ static int nvme_tcp_try_send_ddgst(struct 
nvme_tcp_request *req)
  static int nvme_tcp_try_send(struct nvme_tcp_queue *queue)
  {
         struct nvme_tcp_request *req;
+       unsigned int noreclaim_flag;
         int ret = 1;

         if (!queue->request) {
@@ -1150,12 +1151,13 @@ static int nvme_tcp_try_send(struct 
nvme_tcp_queue *queue)
         }
         req = queue->request;

+       noreclaim_flag = memalloc_noreclaim_save();
         if (req->state == NVME_TCP_SEND_CMD_PDU) {
                 ret = nvme_tcp_try_send_cmd_pdu(req);
                 if (ret <= 0)
                         goto done;
                 if (!nvme_tcp_has_inline_data(req))
-                       return ret;
+                       goto out;
         }

         if (req->state == NVME_TCP_SEND_H2C_PDU) {
@@ -1181,6 +1183,8 @@ static int nvme_tcp_try_send(struct nvme_tcp_queue 
*queue)
                 nvme_tcp_fail_request(queue->request);
                 nvme_tcp_done_send_req(queue);
         }
+out:
+       memalloc_noreclaim_restore(noreclaim_flag);
         return ret;
  }
--


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: lockdep warning: fs_reclaim_acquire vs tcp_sendpage
  2022-10-19  9:03 ` Sagi Grimberg
@ 2022-10-19  9:37   ` Daniel Wagner
  2022-10-19 11:35     ` Daniel Wagner
  0 siblings, 1 reply; 12+ messages in thread
From: Daniel Wagner @ 2022-10-19  9:37 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme

> >    Possible unsafe locking scenario:
> > 
> >          CPU0                    CPU1
> >          ----                    ----
> >     lock(fs_reclaim);
> >                                  lock(sk_lock-AF_INET-NVME);
> >                                  lock(fs_reclaim);
> >     lock(sk_lock-AF_INET-NVME);
> 
> Indeed. I see the issue.
> kswapd is trying to swap out pages, but if someone were to delete
> the controller (like in this case), sock_release -> tcp_disconnect
> will alloc skb that may need to reclaim pages.
> 
> Two questions, the stack trace suggests that you are not using
> nvme-mpath? is that the case?

This is with a multipath setup. The fio settings are pushing the limits
of the VM (memory size) hence the kswap process kicking in.

> Given that we fail all inflight requests before we free the socket,
> I don't expect for this to be truly circular...
> 
> I'm assuming that we'll need the below similar to nbd/iscsi:

Let me try this.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: lockdep warning: fs_reclaim_acquire vs tcp_sendpage
  2022-10-19  9:37   ` Daniel Wagner
@ 2022-10-19 11:35     ` Daniel Wagner
  2022-10-19 13:09       ` Sagi Grimberg
  0 siblings, 1 reply; 12+ messages in thread
From: Daniel Wagner @ 2022-10-19 11:35 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme

On Wed, Oct 19, 2022 at 11:37:13AM +0200, Daniel Wagner wrote:
> > >    Possible unsafe locking scenario:
> > > 
> > >          CPU0                    CPU1
> > >          ----                    ----
> > >     lock(fs_reclaim);
> > >                                  lock(sk_lock-AF_INET-NVME);
> > >                                  lock(fs_reclaim);
> > >     lock(sk_lock-AF_INET-NVME);
> > 
> > Indeed. I see the issue.
> > kswapd is trying to swap out pages, but if someone were to delete
> > the controller (like in this case), sock_release -> tcp_disconnect
> > will alloc skb that may need to reclaim pages.
> > 
> > Two questions, the stack trace suggests that you are not using
> > nvme-mpath? is that the case?
> 
> This is with a multipath setup. The fio settings are pushing the limits
> of the VM (memory size) hence the kswap process kicking in.
> 
> > Given that we fail all inflight requests before we free the socket,
> > I don't expect for this to be truly circular...
> > 
> > I'm assuming that we'll need the below similar to nbd/iscsi:
> 
> Let me try this.

Still able to trigger though I figured out how I am able to
reproduce it:

 VM 4M memory, 8 vCPUs
 nvme target with at least 2 namespaces
 ns 1: fio read/write
 ns 2: swap space

 1) nvme connect-all
 2) nvme disconnect-all
 3) nvme connect-all
 4) swapon /dev/nvme0n4
 4) fio --rw=rw --name=test --filename=/dev/nvme1n1 --size=1G --direct=1 \
        --iodepth=32 --blocksize_range=4k-4M --numjobs=32 \
        --group_reporting --runtime=2m --time_based

 ======================================================
 WARNING: possible circular locking dependency detected
 6.0.0-rc2+ #27 Tainted: G        W         
 ------------------------------------------------------
 fio/1749 is trying to acquire lock:
 ffff888120b38140 (sk_lock-AF_INET-NVME){+.+.}-{0:0}, at: tcp_sendpage+0x23/0xa0
 
 but task is already holding lock:
 ffffffff93695b20 (fs_reclaim){+.+.}-{0:0}, at: __alloc_pages_slowpath.constprop.0+0x6a3/0x22f0
 
 which lock already depends on the new lock.

 
 the existing dependency chain (in reverse order) is:
 
 -> #1 (fs_reclaim){+.+.}-{0:0}:
        fs_reclaim_acquire+0x11e/0x160
        kmem_cache_alloc_node+0x44/0x530
        __alloc_skb+0x158/0x230
        tcp_send_active_reset+0x7e/0x730
        tcp_disconnect+0x1272/0x1ae0
        __tcp_close+0x707/0xd90
        tcp_close+0x26/0x80
        inet_release+0xfa/0x220
        sock_release+0x85/0x1a0
        nvme_tcp_free_queue+0x1fd/0x470 [nvme_tcp]
        nvme_do_delete_ctrl+0x130/0x13d [nvme_core]
        nvme_sysfs_delete.cold+0x8/0xd [nvme_core]
        kernfs_fop_write_iter+0x356/0x530
        vfs_write+0x4e8/0xce0
        ksys_write+0xfd/0x1d0
        do_syscall_64+0x58/0x80
        entry_SYSCALL_64_after_hwframe+0x63/0xcd
 
 -> #0 (sk_lock-AF_INET-NVME){+.+.}-{0:0}:
        __lock_acquire+0x2a0c/0x5690
        lock_acquire+0x18e/0x4f0
        lock_sock_nested+0x37/0xc0
        tcp_sendpage+0x23/0xa0
        inet_sendpage+0xad/0x120
        kernel_sendpage+0x156/0x440
        nvme_tcp_try_send+0x59e/0x27a0 [nvme_tcp]
        nvme_tcp_queue_rq+0xf5e/0x1870 [nvme_tcp]
        __blk_mq_try_issue_directly+0x452/0x660
        blk_mq_plug_issue_direct.constprop.0+0x207/0x700
        blk_mq_flush_plug_list+0x6f5/0xc70
        __blk_flush_plug+0x264/0x410
        blk_finish_plug+0x4b/0xa0
        shrink_lruvec+0x1263/0x1ea0
        shrink_node+0x736/0x1a80
        do_try_to_free_pages+0x2ba/0x15e0
        try_to_free_pages+0x20b/0x580
        __alloc_pages_slowpath.constprop.0+0x744/0x22f0
        __alloc_pages+0x42a/0x500
        __folio_alloc+0x17/0x50
        vma_alloc_folio+0xbd/0x4d0
        __handle_mm_fault+0x1170/0x2380
        handle_mm_fault+0x1d6/0x710
        do_user_addr_fault+0x320/0xdc0
        exc_page_fault+0x61/0xf0
        asm_exc_page_fault+0x22/0x30
 
 other info that might help us debug this:

  Possible unsafe locking scenario:

        CPU0                    CPU1
        ----                    ----
   lock(fs_reclaim);
                                lock(sk_lock-AF_INET-NVME);
                                lock(fs_reclaim);
   lock(sk_lock-AF_INET-NVME);
 
  *** DEADLOCK ***

 4 locks held by fio/1749:
  #0: ffff8881251f62b8 (&mm->mmap_lock#2){++++}-{3:3}, at: do_user_addr_fault+0x1e3/0xdc0
  #1: ffffffff93695b20 (fs_reclaim){+.+.}-{0:0}, at: __alloc_pages_slowpath.constprop.0+0x6a3/0x22f0
  #2: ffff8881087cb0b0 (q->srcu){....}-{0:0}, at: blk_mq_flush_plug_list+0x6b3/0xc70
  #3: ffff888124e543d0 (&queue->send_mutex){+.+.}-{3:3}, at: nvme_tcp_queue_rq+0xec1/0x1870 [nvme_tcp]
 
 stack backtrace:
 CPU: 0 PID: 1749 Comm: fio Tainted: G        W          6.0.0-rc2+ #27 f927f62e1062089b9e698ced355fcf5ecf276cb2
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
 Call Trace:
  <TASK>
  dump_stack_lvl+0x5b/0x77
  check_noncircular+0x26e/0x320
  ? print_circular_bug+0x1e0/0x1e0
  ? kvm_sched_clock_read+0x14/0x40
  ? sched_clock_cpu+0x69/0x240
  ? lockdep_lock+0x18a/0x1c0
  ? call_rcu_zapped+0xc0/0xc0
  __lock_acquire+0x2a0c/0x5690
  ? lockdep_hardirqs_on_prepare+0x3f0/0x3f0
  ? lock_chain_count+0x20/0x20
  ? mark_lock+0x101/0x1650
  lock_acquire+0x18e/0x4f0
  ? tcp_sendpage+0x23/0xa0
  ? sched_clock_cpu+0x69/0x240
  ? lock_downgrade+0x6c0/0x6c0
  ? __lock_acquire+0xd3f/0x5690
  lock_sock_nested+0x37/0xc0
  ? tcp_sendpage+0x23/0xa0
  tcp_sendpage+0x23/0xa0
  inet_sendpage+0xad/0x120
  kernel_sendpage+0x156/0x440
  nvme_tcp_try_send+0x59e/0x27a0 [nvme_tcp 154cb4fe55d74667e1ca60e2a90f260935f9e2bd]
  ? lock_downgrade+0x6c0/0x6c0
  ? lock_release+0x6cd/0xd30
  ? nvme_tcp_state_change+0x150/0x150 [nvme_tcp 154cb4fe55d74667e1ca60e2a90f260935f9e2bd]
  ? mutex_trylock+0x204/0x330
  ? nvme_tcp_queue_rq+0xec1/0x1870 [nvme_tcp 154cb4fe55d74667e1ca60e2a90f260935f9e2bd]
  ? ww_mutex_unlock+0x270/0x270
  nvme_tcp_queue_rq+0xf5e/0x1870 [nvme_tcp 154cb4fe55d74667e1ca60e2a90f260935f9e2bd]
  __blk_mq_try_issue_directly+0x452/0x660
  ? __blk_mq_get_driver_tag+0x980/0x980
  ? lock_downgrade+0x6c0/0x6c0
  blk_mq_plug_issue_direct.constprop.0+0x207/0x700
  blk_mq_flush_plug_list+0x6f5/0xc70
  ? blk_mq_flush_plug_list+0x6b3/0xc70
  ? set_next_task_stop+0x1c0/0x1c0
  ? blk_mq_insert_requests+0x450/0x450
  ? lock_release+0x6cd/0xd30
  __blk_flush_plug+0x264/0x410
  ? memset+0x1f/0x40
  ? __mem_cgroup_uncharge_list+0x84/0x150
  ? __mem_cgroup_uncharge+0x140/0x140
  ? blk_start_plug_nr_ios+0x280/0x280
  blk_finish_plug+0x4b/0xa0
  shrink_lruvec+0x1263/0x1ea0
  ? reclaim_throttle+0x790/0x790
  ? sched_clock_cpu+0x69/0x240
  ? lockdep_hardirqs_on_prepare+0x3f0/0x3f0
  ? lock_is_held_type+0xa9/0x120
  ? mem_cgroup_iter+0x2b2/0x780
  shrink_node+0x736/0x1a80
  do_try_to_free_pages+0x2ba/0x15e0
  ? __node_reclaim+0x7c0/0x7c0
  ? lock_is_held_type+0xa9/0x120
  ? lock_is_held_type+0xa9/0x120
  try_to_free_pages+0x20b/0x580
  ? reclaim_pages+0x5b0/0x5b0
  ? psi_task_change+0x2f0/0x2f0
  __alloc_pages_slowpath.constprop.0+0x744/0x22f0
  ? get_page_from_freelist+0x3bf/0x3920
  ? warn_alloc+0x190/0x190
  ? io_schedule_timeout+0x160/0x160
  ? __zone_watermark_ok+0x420/0x420
  ? preempt_schedule_common+0x44/0x70
  ? __cond_resched+0x1c/0x30
  ? prepare_alloc_pages.constprop.0+0x150/0x4c0
  ? lock_chain_count+0x20/0x20
  __alloc_pages+0x42a/0x500
  ? __alloc_pages_slowpath.constprop.0+0x22f0/0x22f0
  ? set_next_task_stop+0x1c0/0x1c0
  __folio_alloc+0x17/0x50
  vma_alloc_folio+0xbd/0x4d0
  ? sched_clock_cpu+0x69/0x240
  __handle_mm_fault+0x1170/0x2380
  ? copy_page_range+0x2ae0/0x2ae0
  ? lockdep_hardirqs_on_prepare+0x27b/0x3f0
  ? count_memcg_events.constprop.0+0x40/0x50
  handle_mm_fault+0x1d6/0x710
  do_user_addr_fault+0x320/0xdc0
  exc_page_fault+0x61/0xf0
  asm_exc_page_fault+0x22/0x30
 RIP: 0033:0x55d6818eee0e
 Code: 48 89 54 24 18 48 01 c2 48 89 54 24 20 48 8d 14 80 48 89 54 24 28 48 39 f1 74 38 90 66 41 0f 6f 01 66 41 0f 6f 49 10 4c 89 c8 <0f> 11 01 0f 11 49 10 48 8b 10 48 83 c0 08 48 0f af d7 48 89 50 f8
 RSP: 002b:00007ffdc1100e30 EFLAGS: 00010206
 RAX: 00007ffdc1100e40 RBX: 81b4c40bf7ec8b20 RCX: 00007f16bb429000
 RDX: 5709bcafa91b77a0 RSI: 00007f16bb6d4000 RDI: 61c8864680b583eb
 RBP: 0000000000000013 R08: 00007ffdc1100e60 R09: 00007ffdc1100e40
 R10: 0000000000400000 R11: 0000000000000246 R12: 0000000000400000
 R13: 0000000000000001 R14: 000055d681d52540 R15: 00007f16bb2d4000
  </TASK>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: lockdep warning: fs_reclaim_acquire vs tcp_sendpage
  2022-10-19 11:35     ` Daniel Wagner
@ 2022-10-19 13:09       ` Sagi Grimberg
  2022-10-19 16:01         ` Daniel Wagner
  0 siblings, 1 reply; 12+ messages in thread
From: Sagi Grimberg @ 2022-10-19 13:09 UTC (permalink / raw)
  To: Daniel Wagner; +Cc: linux-nvme



On 10/19/22 14:35, Daniel Wagner wrote:
> On Wed, Oct 19, 2022 at 11:37:13AM +0200, Daniel Wagner wrote:
>>>>     Possible unsafe locking scenario:
>>>>
>>>>           CPU0                    CPU1
>>>>           ----                    ----
>>>>      lock(fs_reclaim);
>>>>                                   lock(sk_lock-AF_INET-NVME);
>>>>                                   lock(fs_reclaim);
>>>>      lock(sk_lock-AF_INET-NVME);
>>>
>>> Indeed. I see the issue.
>>> kswapd is trying to swap out pages, but if someone were to delete
>>> the controller (like in this case), sock_release -> tcp_disconnect
>>> will alloc skb that may need to reclaim pages.
>>>
>>> Two questions, the stack trace suggests that you are not using
>>> nvme-mpath? is that the case?
>>
>> This is with a multipath setup. The fio settings are pushing the limits
>> of the VM (memory size) hence the kswap process kicking in.
>>
>>> Given that we fail all inflight requests before we free the socket,
>>> I don't expect for this to be truly circular...
>>>
>>> I'm assuming that we'll need the below similar to nbd/iscsi:
>>
>> Let me try this.
> 
> Still able to trigger though I figured out how I am able to
> reproduce it:
> 
>   VM 4M memory, 8 vCPUs

thats small...

What is vm.min_free_kbytes (via sysctl)?


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: lockdep warning: fs_reclaim_acquire vs tcp_sendpage
  2022-10-19 13:09       ` Sagi Grimberg
@ 2022-10-19 16:01         ` Daniel Wagner
  2022-10-19 17:43           ` Sagi Grimberg
  0 siblings, 1 reply; 12+ messages in thread
From: Daniel Wagner @ 2022-10-19 16:01 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme

On Wed, Oct 19, 2022 at 04:09:39PM +0300, Sagi Grimberg wrote:
> > Still able to trigger though I figured out how I am able to
> > reproduce it:
> > 
> >   VM 4M memory, 8 vCPUs
> 
> thats small...

Just a test VM. But I think this is actually the key to reproduce the
lockdep splat. The fio command is eating up a lot of ram (I guess any
other memory hog would do the job as well) and forces the mm subsystem
to use the swap.

> What is vm.min_free_kbytes (via sysctl)?

vm.min_free_kbytes = 67584


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: lockdep warning: fs_reclaim_acquire vs tcp_sendpage
  2022-10-19 16:01         ` Daniel Wagner
@ 2022-10-19 17:43           ` Sagi Grimberg
  2022-10-20  8:10             ` Daniel Wagner
  0 siblings, 1 reply; 12+ messages in thread
From: Sagi Grimberg @ 2022-10-19 17:43 UTC (permalink / raw)
  To: Daniel Wagner; +Cc: linux-nvme


>>> Still able to trigger though I figured out how I am able to
>>> reproduce it:
>>>
>>>    VM 4M memory, 8 vCPUs
>>
>> thats small...
> 
> Just a test VM. But I think this is actually the key to reproduce the
> lockdep splat. The fio command is eating up a lot of ram (I guess any
> other memory hog would do the job as well) and forces the mm subsystem
> to use the swap.

Is that 4MB of memory? or 4GB?

> 
>> What is vm.min_free_kbytes (via sysctl)?
> 
> vm.min_free_kbytes = 67584


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: lockdep warning: fs_reclaim_acquire vs tcp_sendpage
  2022-10-19 17:43           ` Sagi Grimberg
@ 2022-10-20  8:10             ` Daniel Wagner
  2022-10-20  9:57               ` Sagi Grimberg
  0 siblings, 1 reply; 12+ messages in thread
From: Daniel Wagner @ 2022-10-20  8:10 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme

On Wed, Oct 19, 2022 at 08:43:43PM +0300, Sagi Grimberg wrote:
> 
> > > > Still able to trigger though I figured out how I am able to
> > > > reproduce it:
> > > > 
> > > >    VM 4M memory, 8 vCPUs
> > > 
> > > thats small...
> > 
> > Just a test VM. But I think this is actually the key to reproduce the
> > lockdep splat. The fio command is eating up a lot of ram (I guess any
> > other memory hog would do the job as well) and forces the mm subsystem
> > to use the swap.
> 
> Is that 4MB of memory? or 4GB?

Ah sorry... it is 4GB indeed.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: lockdep warning: fs_reclaim_acquire vs tcp_sendpage
  2022-10-20  8:10             ` Daniel Wagner
@ 2022-10-20  9:57               ` Sagi Grimberg
  2022-10-20 14:16                 ` Daniel Wagner
  0 siblings, 1 reply; 12+ messages in thread
From: Sagi Grimberg @ 2022-10-20  9:57 UTC (permalink / raw)
  To: Daniel Wagner; +Cc: linux-nvme


>>>>> Still able to trigger though I figured out how I am able to
>>>>> reproduce it:
>>>>>
>>>>>     VM 4M memory, 8 vCPUs
>>>>
>>>> thats small...
>>>
>>> Just a test VM. But I think this is actually the key to reproduce the
>>> lockdep splat. The fio command is eating up a lot of ram (I guess any
>>> other memory hog would do the job as well) and forces the mm subsystem
>>> to use the swap.
>>
>> Is that 4MB of memory? or 4GB?
> 
> Ah sorry... it is 4GB indeed.

Just for the experiment, can you try with this change:
--
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index c5bea92560bd..d814be5dca1e 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1519,7 +1519,7 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl 
*nctrl, int qid)
          * close. This is done to prevent stale data from being sent should
          * the network connection be restored before TCP times out.
          */
-       sock_no_linger(queue->sock->sk);
+       //sock_no_linger(queue->sock->sk);

         if (so_priority > 0)
                 sock_set_priority(queue->sock->sk, so_priority);


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: lockdep warning: fs_reclaim_acquire vs tcp_sendpage
  2022-10-20  9:57               ` Sagi Grimberg
@ 2022-10-20 14:16                 ` Daniel Wagner
  2022-10-20 16:20                   ` Sagi Grimberg
  0 siblings, 1 reply; 12+ messages in thread
From: Daniel Wagner @ 2022-10-20 14:16 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme

On Thu, Oct 20, 2022 at 12:57:11PM +0300, Sagi Grimberg wrote:
> Just for the experiment, can you try with this change:

Good call, this seems to do the trick. The splat is gone with it.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: lockdep warning: fs_reclaim_acquire vs tcp_sendpage
  2022-10-20 14:16                 ` Daniel Wagner
@ 2022-10-20 16:20                   ` Sagi Grimberg
  2022-10-21 10:11                     ` Daniel Wagner
  0 siblings, 1 reply; 12+ messages in thread
From: Sagi Grimberg @ 2022-10-20 16:20 UTC (permalink / raw)
  To: Daniel Wagner; +Cc: linux-nvme


>> Just for the experiment, can you try with this change:
> 
> Good call, this seems to do the trick. The splat is gone with it.

OK, it doesn't say much because it is just one of many conditions
that can make a socket release to allocate an skb and send a tcp
RST, which can happen under memory pressure.

It's also not a great option to set a minimum linger of 1, which means
that if the controller is not accessible, we can block for 1 second
per queue, which is awful.

Does this change also make the issue go away?
--
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index c5bea92560bd..5bae8914c861 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1300,6 +1300,7 @@ static void nvme_tcp_free_queue(struct nvme_ctrl 
*nctrl, int qid)
         struct page *page;
         struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
         struct nvme_tcp_queue *queue = &ctrl->queues[qid];
+       unsigned int noreclaim_flag;

         if (!test_and_clear_bit(NVME_TCP_Q_ALLOCATED, &queue->flags))
                 return;
@@ -1312,7 +1313,11 @@ static void nvme_tcp_free_queue(struct nvme_ctrl 
*nctrl, int qid)
                 __page_frag_cache_drain(page, 
queue->pf_cache.pagecnt_bias);
                 queue->pf_cache.va = NULL;
         }
+
+       noreclaim_flag = memalloc_noreclaim_save();
         sock_release(queue->sock);
+       memalloc_noreclaim_restore(noreclaim_flag);
+
         kfree(queue->pdu);
         mutex_destroy(&queue->send_mutex);
         mutex_destroy(&queue->queue_lock);
--


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: lockdep warning: fs_reclaim_acquire vs tcp_sendpage
  2022-10-20 16:20                   ` Sagi Grimberg
@ 2022-10-21 10:11                     ` Daniel Wagner
  0 siblings, 0 replies; 12+ messages in thread
From: Daniel Wagner @ 2022-10-21 10:11 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme

On Thu, Oct 20, 2022 at 07:20:13PM +0300, Sagi Grimberg wrote:
> 
> > > Just for the experiment, can you try with this change:
> > 
> > Good call, this seems to do the trick. The splat is gone with it.
> 
> OK, it doesn't say much because it is just one of many conditions
> that can make a socket release to allocate an skb and send a tcp
> RST, which can happen under memory pressure.
> 
> It's also not a great option to set a minimum linger of 1, which means
> that if the controller is not accessible, we can block for 1 second
> per queue, which is awful.
> 
> Does this change also make the issue go away?

Yes, with this patch alone the lockdep splat is gone.


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2022-10-21 10:11 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-19  7:51 lockdep warning: fs_reclaim_acquire vs tcp_sendpage Daniel Wagner
2022-10-19  9:03 ` Sagi Grimberg
2022-10-19  9:37   ` Daniel Wagner
2022-10-19 11:35     ` Daniel Wagner
2022-10-19 13:09       ` Sagi Grimberg
2022-10-19 16:01         ` Daniel Wagner
2022-10-19 17:43           ` Sagi Grimberg
2022-10-20  8:10             ` Daniel Wagner
2022-10-20  9:57               ` Sagi Grimberg
2022-10-20 14:16                 ` Daniel Wagner
2022-10-20 16:20                   ` Sagi Grimberg
2022-10-21 10:11                     ` Daniel Wagner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.