All of lore.kernel.org
 help / color / mirror / Atom feed
* Some tasks got to hang_task that would be due to RBD driver
@ 2020-01-29  7:24 Satoru Takeuchi
  2020-01-29 11:07 ` Ilya Dryomov
  0 siblings, 1 reply; 4+ messages in thread
From: Satoru Takeuchi @ 2020-01-29  7:24 UTC (permalink / raw)
  To: ceph-devel

Hi,

In Rook(*1)/Ceph community, some users encountered hang_task in XFS and it
would be caused by RBD driver.

Although we've not reproduced this problem in the newest kernel, could anyone
give us any hint about this problem, if possible?

*1) An Ceph orchestration in Kubernetes

- A question about this problem in linux-xfs ML
  https://marc.info/?l=linux-xfs&m=158009629016068&w=2

- The reply to the above-mentioned question from Dave Chinner.
  https://marc.info/?l=linux-xfs&m=158018349207976&w=2

I copied-and-pasted the discussion of above-mentioned mails here.

my question:
```
Under some workload in Ceph, many processes got to hang_task. We found
that there are two kinds of processes.

a) In very high CPU load
b) Encountered hang_task in the XFS

In addition,a user got the following two kernel traces.

A (b) process's backtrace with `hung_task_panic=1`.

```
[51717.039319] INFO: task kworker/2:1:5938 blocked for more than 120 seconds.
[51717.039361]       Not tainted 4.15.0-72-generic #81-Ubuntu
[51717.039388] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[51717.039426] kworker/2:1     D    0  5938      2 0x80000000
[51717.039471] Workqueue: xfs-sync/rbd0 xfs_log_worker [xfs]
[51717.039472] Call Trace:
[51717.039478]  __schedule+0x24e/0x880
[51717.039504]  ? xlog_sync+0x2d5/0x3c0 [xfs]
[51717.039506]  schedule+0x2c/0x80
[51717.039530]  _xfs_log_force_lsn+0x20e/0x350 [xfs]
[51717.039533]  ? wake_up_q+0x80/0x80
[51717.039556]  __xfs_trans_commit+0x20b/0x280 [xfs]
[51717.039577]  xfs_trans_commit+0x10/0x20 [xfs]
[51717.039600]  xfs_sync_sb+0x6d/0x80 [xfs]
[51717.039623]  xfs_log_worker+0xe7/0x100 [xfs]
[51717.039626]  process_one_work+0x1de/0x420
[51717.039627]  worker_thread+0x32/0x410
[51717.039628]  kthread+0x121/0x140
[51717.039630]  ? process_one_work+0x420/0x420
[51717.039631]  ? kthread_create_worker_on_cpu+0x70/0x70
[51717.039633]  ret_from_fork+0x35/0x40
```

A (b) process's backtrace that is got by `sudo cat /proc/<PID of a D
process>/stack`

```
[<0>] _xfs_log_force_lsn+0x20e/0x350 [xfs]
[<0>] __xfs_trans_commit+0x20b/0x280 [xfs]
[<0>] xfs_trans_commit+0x10/0x20 [xfs]
[<0>] xfs_sync_sb+0x6d/0x80 [xfs]
[<0>] xfs_log_sbcount+0x4b/0x60 [xfs]
[<0>] xfs_unmountfs+0xe7/0x200 [xfs]
[<0>] xfs_fs_put_super+0x3e/0xb0 [xfs]
[<0>] generic_shutdown_super+0x72/0x120
[<0>] kill_block_super+0x2c/0x80
[<0>] deactivate_locked_super+0x48/0x80
[<0>] deactivate_super+0x40/0x60
[<0>] cleanup_mnt+0x3f/0x80
[<0>] __cleanup_mnt+0x12/0x20
[<0>] task_work_run+0x9d/0xc0
[<0>] exit_to_usermode_loop+0xc0/0xd0
[<0>] do_syscall_64+0x121/0x130
[<0>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[<0>] 0xffffffffffffffff
```

Here is the result of my investigation:

- I couldn't find any commit that would be related to this problem,
both in the upstream
  master and master and XFS's for-next
- I couldn't find any discussions that would be related to the
above-mentioned backtrace
  in linux-xfs ML
- There would be a problem in the transaction commit of XFS. In both
of two traces,
  (b) processes hung in _xfs_log_force_lsn+0x20e/0x350 [xfs]. This
code is one of
  the following two xlog_wait().

  https://github.com/torvalds/linux/blob/master/fs/xfs/xfs_log.c#L3366
  https://github.com/torvalds/linux/blob/master/fs/xfs/xfs_log.c#L3387

  These processes released CPU voluntarily in the following line.

  https://github.com/torvalds/linux/blob/master/fs/xfs/xfs_log_priv.h#L549

  These two processes should be woken by the other process after that.
  However, unfortunately, it didn't happen.

Test environment:
- kernel: 4.15.0-<x>-generic
- XFS # Anyone hasn't reported this problem with other filesystems yet.

Related discussions:
- Issue of Rook:
  https://github.com/rook/rook/issues/3132
- Issue of Ceph
  https://tracker.ceph.com/issues/40068
```

Dave's answer:
```
> Under some workload in Ceph, many processes got to hang_task. We found
> that there
> are two kinds of processes.
>
> a) In very high CPU load
> b) Encountered hang_task in the XFS
>
> In addition,a user got the following two kernel traces.
>
> A (b) process's backtrace with `hung_task_panic=1`.
>
> ```
> [51717.039319] INFO: task kworker/2:1:5938 blocked for more than 120 seconds.
> [51717.039361]       Not tainted 4.15.0-72-generic #81-Ubuntu

Kinda old, and not an upstream LTS kernel, right?

> [51717.039388] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [51717.039426] kworker/2:1     D    0  5938      2 0x80000000
> [51717.039471] Workqueue: xfs-sync/rbd0 xfs_log_worker [xfs]

Filesystem is on a Ceph RBD device.

> [51717.039472] Call Trace:
> [51717.039478]  __schedule+0x24e/0x880
> [51717.039504]  ? xlog_sync+0x2d5/0x3c0 [xfs]
> [51717.039506]  schedule+0x2c/0x80
> [51717.039530]  _xfs_log_force_lsn+0x20e/0x350 [xfs]
> [51717.039533]  ? wake_up_q+0x80/0x80
> [51717.039556]  __xfs_trans_commit+0x20b/0x280 [xfs]
> [51717.039577]  xfs_trans_commit+0x10/0x20 [xfs]
> [51717.039600]  xfs_sync_sb+0x6d/0x80 [xfs]
> [51717.039623]  xfs_log_worker+0xe7/0x100 [xfs]
> [51717.039626]  process_one_work+0x1de/0x420
> [51717.039627]  worker_thread+0x32/0x410
> [51717.039628]  kthread+0x121/0x140
> [51717.039630]  ? process_one_work+0x420/0x420
> [51717.039631]  ? kthread_create_worker_on_cpu+0x70/0x70
> [51717.039633]  ret_from_fork+0x35/0x40

That's waiting for log IO completion.

> ```
>
> A (b) process's backtrace that is got by `sudo cat /proc/<PID of a D
> process>/stack`
>
> ```
> [<0>] _xfs_log_force_lsn+0x20e/0x350 [xfs]
> [<0>] __xfs_trans_commit+0x20b/0x280 [xfs]
> [<0>] xfs_trans_commit+0x10/0x20 [xfs]
> [<0>] xfs_sync_sb+0x6d/0x80 [xfs]
> [<0>] xfs_log_sbcount+0x4b/0x60 [xfs]
> [<0>] xfs_unmountfs+0xe7/0x200 [xfs]
> [<0>] xfs_fs_put_super+0x3e/0xb0 [xfs]
> [<0>] generic_shutdown_super+0x72/0x120
> [<0>] kill_block_super+0x2c/0x80
> [<0>] deactivate_locked_super+0x48/0x80
> [<0>] deactivate_super+0x40/0x60
> [<0>] cleanup_mnt+0x3f/0x80
> [<0>] __cleanup_mnt+0x12/0x20
> [<0>] task_work_run+0x9d/0xc0
> [<0>] exit_to_usermode_loop+0xc0/0xd0
> [<0>] do_syscall_64+0x121/0x130
> [<0>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> [<0>] 0xffffffffffffffff

ANd this is the last reference to the filesystem being dropped and
it waiting for log IO completion.

So, the filesytem has been unmounted, and it's waiting for journal
IO on the device to complete.  I wonder if a wakeup was missed
somewhere?

Did the system stop/tear down /dev/rbd0 prematurely?

> Related discussions:
> - Issue of Rook:
>   https://github.com/rook/rook/issues/3132
> - Issue of Ceph
>   https://tracker.ceph.com/issues/40068

These point to Ceph RBDs failing to respond under high load and
tasks hanging because they are waiting on IO. That's exactly the
symptoms you are reporting here. That points to it being a Ceph RBD
issue to me, especially the reports where rbd devices report no IO
load but the ceph back end is at 100% disk utilisation doing
-something-.
```

Thanks,
Satoru

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Some tasks got to hang_task that would be due to RBD driver
  2020-01-29  7:24 Some tasks got to hang_task that would be due to RBD driver Satoru Takeuchi
@ 2020-01-29 11:07 ` Ilya Dryomov
  2020-01-30 14:54   ` Ilya Dryomov
  0 siblings, 1 reply; 4+ messages in thread
From: Ilya Dryomov @ 2020-01-29 11:07 UTC (permalink / raw)
  To: Satoru Takeuchi; +Cc: Ceph Development

On Wed, Jan 29, 2020 at 8:25 AM Satoru Takeuchi
<satoru.takeuchi@gmail.com> wrote:
>
> Hi,
>
> In Rook(*1)/Ceph community, some users encountered hang_task in XFS and it
> would be caused by RBD driver.
>
> Although we've not reproduced this problem in the newest kernel, could anyone
> give us any hint about this problem, if possible?
>
> *1) An Ceph orchestration in Kubernetes
>
> - A question about this problem in linux-xfs ML
>   https://marc.info/?l=linux-xfs&m=158009629016068&w=2
>
> - The reply to the above-mentioned question from Dave Chinner.
>   https://marc.info/?l=linux-xfs&m=158018349207976&w=2
>
> I copied-and-pasted the discussion of above-mentioned mails here.
>
> my question:
> ```
> Under some workload in Ceph, many processes got to hang_task. We found
> that there are two kinds of processes.
>
> a) In very high CPU load
> b) Encountered hang_task in the XFS
>
> In addition,a user got the following two kernel traces.
>
> A (b) process's backtrace with `hung_task_panic=1`.
>
> ```
> [51717.039319] INFO: task kworker/2:1:5938 blocked for more than 120 seconds.
> [51717.039361]       Not tainted 4.15.0-72-generic #81-Ubuntu
> [51717.039388] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [51717.039426] kworker/2:1     D    0  5938      2 0x80000000
> [51717.039471] Workqueue: xfs-sync/rbd0 xfs_log_worker [xfs]
> [51717.039472] Call Trace:
> [51717.039478]  __schedule+0x24e/0x880
> [51717.039504]  ? xlog_sync+0x2d5/0x3c0 [xfs]
> [51717.039506]  schedule+0x2c/0x80
> [51717.039530]  _xfs_log_force_lsn+0x20e/0x350 [xfs]
> [51717.039533]  ? wake_up_q+0x80/0x80
> [51717.039556]  __xfs_trans_commit+0x20b/0x280 [xfs]
> [51717.039577]  xfs_trans_commit+0x10/0x20 [xfs]
> [51717.039600]  xfs_sync_sb+0x6d/0x80 [xfs]
> [51717.039623]  xfs_log_worker+0xe7/0x100 [xfs]
> [51717.039626]  process_one_work+0x1de/0x420
> [51717.039627]  worker_thread+0x32/0x410
> [51717.039628]  kthread+0x121/0x140
> [51717.039630]  ? process_one_work+0x420/0x420
> [51717.039631]  ? kthread_create_worker_on_cpu+0x70/0x70
> [51717.039633]  ret_from_fork+0x35/0x40
> ```
>
> A (b) process's backtrace that is got by `sudo cat /proc/<PID of a D
> process>/stack`
>
> ```
> [<0>] _xfs_log_force_lsn+0x20e/0x350 [xfs]
> [<0>] __xfs_trans_commit+0x20b/0x280 [xfs]
> [<0>] xfs_trans_commit+0x10/0x20 [xfs]
> [<0>] xfs_sync_sb+0x6d/0x80 [xfs]
> [<0>] xfs_log_sbcount+0x4b/0x60 [xfs]
> [<0>] xfs_unmountfs+0xe7/0x200 [xfs]
> [<0>] xfs_fs_put_super+0x3e/0xb0 [xfs]
> [<0>] generic_shutdown_super+0x72/0x120
> [<0>] kill_block_super+0x2c/0x80
> [<0>] deactivate_locked_super+0x48/0x80
> [<0>] deactivate_super+0x40/0x60
> [<0>] cleanup_mnt+0x3f/0x80
> [<0>] __cleanup_mnt+0x12/0x20
> [<0>] task_work_run+0x9d/0xc0
> [<0>] exit_to_usermode_loop+0xc0/0xd0
> [<0>] do_syscall_64+0x121/0x130
> [<0>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> [<0>] 0xffffffffffffffff
> ```
>
> Here is the result of my investigation:
>
> - I couldn't find any commit that would be related to this problem,
> both in the upstream
>   master and master and XFS's for-next
> - I couldn't find any discussions that would be related to the
> above-mentioned backtrace
>   in linux-xfs ML
> - There would be a problem in the transaction commit of XFS. In both
> of two traces,
>   (b) processes hung in _xfs_log_force_lsn+0x20e/0x350 [xfs]. This
> code is one of
>   the following two xlog_wait().
>
>   https://github.com/torvalds/linux/blob/master/fs/xfs/xfs_log.c#L3366
>   https://github.com/torvalds/linux/blob/master/fs/xfs/xfs_log.c#L3387
>
>   These processes released CPU voluntarily in the following line.
>
>   https://github.com/torvalds/linux/blob/master/fs/xfs/xfs_log_priv.h#L549
>
>   These two processes should be woken by the other process after that.
>   However, unfortunately, it didn't happen.
>
> Test environment:
> - kernel: 4.15.0-<x>-generic
> - XFS # Anyone hasn't reported this problem with other filesystems yet.
>
> Related discussions:
> - Issue of Rook:
>   https://github.com/rook/rook/issues/3132
> - Issue of Ceph
>   https://tracker.ceph.com/issues/40068
> ```
>
> Dave's answer:
> ```
> > Under some workload in Ceph, many processes got to hang_task. We found
> > that there
> > are two kinds of processes.
> >
> > a) In very high CPU load
> > b) Encountered hang_task in the XFS
> >
> > In addition,a user got the following two kernel traces.
> >
> > A (b) process's backtrace with `hung_task_panic=1`.
> >
> > ```
> > [51717.039319] INFO: task kworker/2:1:5938 blocked for more than 120 seconds.
> > [51717.039361]       Not tainted 4.15.0-72-generic #81-Ubuntu
>
> Kinda old, and not an upstream LTS kernel, right?
>
> > [51717.039388] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > disables this message.
> > [51717.039426] kworker/2:1     D    0  5938      2 0x80000000
> > [51717.039471] Workqueue: xfs-sync/rbd0 xfs_log_worker [xfs]
>
> Filesystem is on a Ceph RBD device.
>
> > [51717.039472] Call Trace:
> > [51717.039478]  __schedule+0x24e/0x880
> > [51717.039504]  ? xlog_sync+0x2d5/0x3c0 [xfs]
> > [51717.039506]  schedule+0x2c/0x80
> > [51717.039530]  _xfs_log_force_lsn+0x20e/0x350 [xfs]
> > [51717.039533]  ? wake_up_q+0x80/0x80
> > [51717.039556]  __xfs_trans_commit+0x20b/0x280 [xfs]
> > [51717.039577]  xfs_trans_commit+0x10/0x20 [xfs]
> > [51717.039600]  xfs_sync_sb+0x6d/0x80 [xfs]
> > [51717.039623]  xfs_log_worker+0xe7/0x100 [xfs]
> > [51717.039626]  process_one_work+0x1de/0x420
> > [51717.039627]  worker_thread+0x32/0x410
> > [51717.039628]  kthread+0x121/0x140
> > [51717.039630]  ? process_one_work+0x420/0x420
> > [51717.039631]  ? kthread_create_worker_on_cpu+0x70/0x70
> > [51717.039633]  ret_from_fork+0x35/0x40
>
> That's waiting for log IO completion.
>
> > ```
> >
> > A (b) process's backtrace that is got by `sudo cat /proc/<PID of a D
> > process>/stack`
> >
> > ```
> > [<0>] _xfs_log_force_lsn+0x20e/0x350 [xfs]
> > [<0>] __xfs_trans_commit+0x20b/0x280 [xfs]
> > [<0>] xfs_trans_commit+0x10/0x20 [xfs]
> > [<0>] xfs_sync_sb+0x6d/0x80 [xfs]
> > [<0>] xfs_log_sbcount+0x4b/0x60 [xfs]
> > [<0>] xfs_unmountfs+0xe7/0x200 [xfs]
> > [<0>] xfs_fs_put_super+0x3e/0xb0 [xfs]
> > [<0>] generic_shutdown_super+0x72/0x120
> > [<0>] kill_block_super+0x2c/0x80
> > [<0>] deactivate_locked_super+0x48/0x80
> > [<0>] deactivate_super+0x40/0x60
> > [<0>] cleanup_mnt+0x3f/0x80
> > [<0>] __cleanup_mnt+0x12/0x20
> > [<0>] task_work_run+0x9d/0xc0
> > [<0>] exit_to_usermode_loop+0xc0/0xd0
> > [<0>] do_syscall_64+0x121/0x130
> > [<0>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> > [<0>] 0xffffffffffffffff
>
> ANd this is the last reference to the filesystem being dropped and
> it waiting for log IO completion.
>
> So, the filesytem has been unmounted, and it's waiting for journal
> IO on the device to complete.  I wonder if a wakeup was missed
> somewhere?
>
> Did the system stop/tear down /dev/rbd0 prematurely?

Hi Satoru,

I would ask the same question.  Typically this is seen when the network
is shut off before the filesystem on top of rbd is fully unmounted.

>
> > Related discussions:
> > - Issue of Rook:
> >   https://github.com/rook/rook/issues/3132
> > - Issue of Ceph
> >   https://tracker.ceph.com/issues/40068
>
> These point to Ceph RBDs failing to respond under high load and
> tasks hanging because they are waiting on IO. That's exactly the
> symptoms you are reporting here. That points to it being a Ceph RBD
> issue to me, especially the reports where rbd devices report no IO
> load but the ceph back end is at 100% disk utilisation doing
> -something-.

I skimmed through https://github.com/rook/rook/issues/3132 and this
particular stack trace came from someone who was upgrading their
cluster, draining node after node and saw a umount process get stuck.
This happens outside of k8s as well, but it is easier to hit with k8s
because of different volume plugins and sometimes multiple layers of
SDN involved.  One thing to do is to make sure that the filesystem on
top of rbd gets mounted with _netdev option.  It has no effect on the
kernel, but serves as a cue to systemd to order the unmount before
network teardown and some container engines had trouble propagating it
between their bind mounts in the past.  Of course, if you or k8s
somehow shut the network down before umount runs, _netdev won't help.

When dealing with suspected kernel issues, try to collect the entire
dmesg, not just a couple of stack traces.  In this case we would have
likely seen the kernel attempting to reconnect to the cluster over the
dead network because XFS has not finished unmounting:

  libceph: connect 10.19.115.5:6789 error -101
  libceph: mon1 10.19.115.5:6789 connect error


The rest of the issues in that ticket appear to have been resolved by
switching from XFS to ext4 on top of rbd.  This is crucial when mapping
rbd devices on the OSD nodes (i.e. co-locating the kernel client with
the OSDs) because XFS is a lot more aggressive about memory reclaim and
in doing that it is _much_ more likely to recurse back on itself, which
results in a deadlock.  This co-location + XFS deadlock is not specific
to rbd or ceph, here is someone who hit it with nbd:

  https://www.spinics.net/lists/linux-xfs/msg26261.html

There is no reliable way to prevent it in current kernels.  This is why
rook and ceph-csi default to ext4.

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Some tasks got to hang_task that would be due to RBD driver
  2020-01-29 11:07 ` Ilya Dryomov
@ 2020-01-30 14:54   ` Ilya Dryomov
  2020-01-30 22:31     ` Satoru Takeuchi
  0 siblings, 1 reply; 4+ messages in thread
From: Ilya Dryomov @ 2020-01-30 14:54 UTC (permalink / raw)
  To: Satoru Takeuchi; +Cc: Ceph Development

On Wed, Jan 29, 2020 at 12:07 PM Ilya Dryomov <idryomov@gmail.com> wrote:
>
> On Wed, Jan 29, 2020 at 8:25 AM Satoru Takeuchi
> <satoru.takeuchi@gmail.com> wrote:
> >
> > Hi,
> >
> > In Rook(*1)/Ceph community, some users encountered hang_task in XFS and it
> > would be caused by RBD driver.
> >
> > Although we've not reproduced this problem in the newest kernel, could anyone
> > give us any hint about this problem, if possible?
> >
> > *1) An Ceph orchestration in Kubernetes
> >
> > - A question about this problem in linux-xfs ML
> >   https://marc.info/?l=linux-xfs&m=158009629016068&w=2
> >
> > - The reply to the above-mentioned question from Dave Chinner.
> >   https://marc.info/?l=linux-xfs&m=158018349207976&w=2
> >
> > I copied-and-pasted the discussion of above-mentioned mails here.
> >
> > my question:
> > ```
> > Under some workload in Ceph, many processes got to hang_task. We found
> > that there are two kinds of processes.
> >
> > a) In very high CPU load
> > b) Encountered hang_task in the XFS
> >
> > In addition,a user got the following two kernel traces.
> >
> > A (b) process's backtrace with `hung_task_panic=1`.
> >
> > ```
> > [51717.039319] INFO: task kworker/2:1:5938 blocked for more than 120 seconds.
> > [51717.039361]       Not tainted 4.15.0-72-generic #81-Ubuntu
> > [51717.039388] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > disables this message.
> > [51717.039426] kworker/2:1     D    0  5938      2 0x80000000
> > [51717.039471] Workqueue: xfs-sync/rbd0 xfs_log_worker [xfs]
> > [51717.039472] Call Trace:
> > [51717.039478]  __schedule+0x24e/0x880
> > [51717.039504]  ? xlog_sync+0x2d5/0x3c0 [xfs]
> > [51717.039506]  schedule+0x2c/0x80
> > [51717.039530]  _xfs_log_force_lsn+0x20e/0x350 [xfs]
> > [51717.039533]  ? wake_up_q+0x80/0x80
> > [51717.039556]  __xfs_trans_commit+0x20b/0x280 [xfs]
> > [51717.039577]  xfs_trans_commit+0x10/0x20 [xfs]
> > [51717.039600]  xfs_sync_sb+0x6d/0x80 [xfs]
> > [51717.039623]  xfs_log_worker+0xe7/0x100 [xfs]
> > [51717.039626]  process_one_work+0x1de/0x420
> > [51717.039627]  worker_thread+0x32/0x410
> > [51717.039628]  kthread+0x121/0x140
> > [51717.039630]  ? process_one_work+0x420/0x420
> > [51717.039631]  ? kthread_create_worker_on_cpu+0x70/0x70
> > [51717.039633]  ret_from_fork+0x35/0x40
> > ```
> >
> > A (b) process's backtrace that is got by `sudo cat /proc/<PID of a D
> > process>/stack`
> >
> > ```
> > [<0>] _xfs_log_force_lsn+0x20e/0x350 [xfs]
> > [<0>] __xfs_trans_commit+0x20b/0x280 [xfs]
> > [<0>] xfs_trans_commit+0x10/0x20 [xfs]
> > [<0>] xfs_sync_sb+0x6d/0x80 [xfs]
> > [<0>] xfs_log_sbcount+0x4b/0x60 [xfs]
> > [<0>] xfs_unmountfs+0xe7/0x200 [xfs]
> > [<0>] xfs_fs_put_super+0x3e/0xb0 [xfs]
> > [<0>] generic_shutdown_super+0x72/0x120
> > [<0>] kill_block_super+0x2c/0x80
> > [<0>] deactivate_locked_super+0x48/0x80
> > [<0>] deactivate_super+0x40/0x60
> > [<0>] cleanup_mnt+0x3f/0x80
> > [<0>] __cleanup_mnt+0x12/0x20
> > [<0>] task_work_run+0x9d/0xc0
> > [<0>] exit_to_usermode_loop+0xc0/0xd0
> > [<0>] do_syscall_64+0x121/0x130
> > [<0>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> > [<0>] 0xffffffffffffffff
> > ```
> >
> > Here is the result of my investigation:
> >
> > - I couldn't find any commit that would be related to this problem,
> > both in the upstream
> >   master and master and XFS's for-next
> > - I couldn't find any discussions that would be related to the
> > above-mentioned backtrace
> >   in linux-xfs ML
> > - There would be a problem in the transaction commit of XFS. In both
> > of two traces,
> >   (b) processes hung in _xfs_log_force_lsn+0x20e/0x350 [xfs]. This
> > code is one of
> >   the following two xlog_wait().
> >
> >   https://github.com/torvalds/linux/blob/master/fs/xfs/xfs_log.c#L3366
> >   https://github.com/torvalds/linux/blob/master/fs/xfs/xfs_log.c#L3387
> >
> >   These processes released CPU voluntarily in the following line.
> >
> >   https://github.com/torvalds/linux/blob/master/fs/xfs/xfs_log_priv.h#L549
> >
> >   These two processes should be woken by the other process after that.
> >   However, unfortunately, it didn't happen.
> >
> > Test environment:
> > - kernel: 4.15.0-<x>-generic
> > - XFS # Anyone hasn't reported this problem with other filesystems yet.
> >
> > Related discussions:
> > - Issue of Rook:
> >   https://github.com/rook/rook/issues/3132
> > - Issue of Ceph
> >   https://tracker.ceph.com/issues/40068
> > ```
> >
> > Dave's answer:
> > ```
> > > Under some workload in Ceph, many processes got to hang_task. We found
> > > that there
> > > are two kinds of processes.
> > >
> > > a) In very high CPU load
> > > b) Encountered hang_task in the XFS
> > >
> > > In addition,a user got the following two kernel traces.
> > >
> > > A (b) process's backtrace with `hung_task_panic=1`.
> > >
> > > ```
> > > [51717.039319] INFO: task kworker/2:1:5938 blocked for more than 120 seconds.
> > > [51717.039361]       Not tainted 4.15.0-72-generic #81-Ubuntu
> >
> > Kinda old, and not an upstream LTS kernel, right?
> >
> > > [51717.039388] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > > disables this message.
> > > [51717.039426] kworker/2:1     D    0  5938      2 0x80000000
> > > [51717.039471] Workqueue: xfs-sync/rbd0 xfs_log_worker [xfs]
> >
> > Filesystem is on a Ceph RBD device.
> >
> > > [51717.039472] Call Trace:
> > > [51717.039478]  __schedule+0x24e/0x880
> > > [51717.039504]  ? xlog_sync+0x2d5/0x3c0 [xfs]
> > > [51717.039506]  schedule+0x2c/0x80
> > > [51717.039530]  _xfs_log_force_lsn+0x20e/0x350 [xfs]
> > > [51717.039533]  ? wake_up_q+0x80/0x80
> > > [51717.039556]  __xfs_trans_commit+0x20b/0x280 [xfs]
> > > [51717.039577]  xfs_trans_commit+0x10/0x20 [xfs]
> > > [51717.039600]  xfs_sync_sb+0x6d/0x80 [xfs]
> > > [51717.039623]  xfs_log_worker+0xe7/0x100 [xfs]
> > > [51717.039626]  process_one_work+0x1de/0x420
> > > [51717.039627]  worker_thread+0x32/0x410
> > > [51717.039628]  kthread+0x121/0x140
> > > [51717.039630]  ? process_one_work+0x420/0x420
> > > [51717.039631]  ? kthread_create_worker_on_cpu+0x70/0x70
> > > [51717.039633]  ret_from_fork+0x35/0x40
> >
> > That's waiting for log IO completion.
> >
> > > ```
> > >
> > > A (b) process's backtrace that is got by `sudo cat /proc/<PID of a D
> > > process>/stack`
> > >
> > > ```
> > > [<0>] _xfs_log_force_lsn+0x20e/0x350 [xfs]
> > > [<0>] __xfs_trans_commit+0x20b/0x280 [xfs]
> > > [<0>] xfs_trans_commit+0x10/0x20 [xfs]
> > > [<0>] xfs_sync_sb+0x6d/0x80 [xfs]
> > > [<0>] xfs_log_sbcount+0x4b/0x60 [xfs]
> > > [<0>] xfs_unmountfs+0xe7/0x200 [xfs]
> > > [<0>] xfs_fs_put_super+0x3e/0xb0 [xfs]
> > > [<0>] generic_shutdown_super+0x72/0x120
> > > [<0>] kill_block_super+0x2c/0x80
> > > [<0>] deactivate_locked_super+0x48/0x80
> > > [<0>] deactivate_super+0x40/0x60
> > > [<0>] cleanup_mnt+0x3f/0x80
> > > [<0>] __cleanup_mnt+0x12/0x20
> > > [<0>] task_work_run+0x9d/0xc0
> > > [<0>] exit_to_usermode_loop+0xc0/0xd0
> > > [<0>] do_syscall_64+0x121/0x130
> > > [<0>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> > > [<0>] 0xffffffffffffffff
> >
> > ANd this is the last reference to the filesystem being dropped and
> > it waiting for log IO completion.
> >
> > So, the filesytem has been unmounted, and it's waiting for journal
> > IO on the device to complete.  I wonder if a wakeup was missed
> > somewhere?
> >
> > Did the system stop/tear down /dev/rbd0 prematurely?
>
> Hi Satoru,
>
> I would ask the same question.  Typically this is seen when the network
> is shut off before the filesystem on top of rbd is fully unmounted.
>
> >
> > > Related discussions:
> > > - Issue of Rook:
> > >   https://github.com/rook/rook/issues/3132
> > > - Issue of Ceph
> > >   https://tracker.ceph.com/issues/40068
> >
> > These point to Ceph RBDs failing to respond under high load and
> > tasks hanging because they are waiting on IO. That's exactly the
> > symptoms you are reporting here. That points to it being a Ceph RBD
> > issue to me, especially the reports where rbd devices report no IO
> > load but the ceph back end is at 100% disk utilisation doing
> > -something-.
>
> I skimmed through https://github.com/rook/rook/issues/3132 and this
> particular stack trace came from someone who was upgrading their
> cluster, draining node after node and saw a umount process get stuck.
> This happens outside of k8s as well, but it is easier to hit with k8s
> because of different volume plugins and sometimes multiple layers of
> SDN involved.  One thing to do is to make sure that the filesystem on
> top of rbd gets mounted with _netdev option.  It has no effect on the
> kernel, but serves as a cue to systemd to order the unmount before
> network teardown and some container engines had trouble propagating it
> between their bind mounts in the past.  Of course, if you or k8s
> somehow shut the network down before umount runs, _netdev won't help.
>
> When dealing with suspected kernel issues, try to collect the entire
> dmesg, not just a couple of stack traces.  In this case we would have
> likely seen the kernel attempting to reconnect to the cluster over the
> dead network because XFS has not finished unmounting:
>
>   libceph: connect 10.19.115.5:6789 error -101
>   libceph: mon1 10.19.115.5:6789 connect error
>
>
> The rest of the issues in that ticket appear to have been resolved by
> switching from XFS to ext4 on top of rbd.  This is crucial when mapping
> rbd devices on the OSD nodes (i.e. co-locating the kernel client with
> the OSDs) because XFS is a lot more aggressive about memory reclaim and
> in doing that it is _much_ more likely to recurse back on itself, which
> results in a deadlock.  This co-location + XFS deadlock is not specific
> to rbd or ceph, here is someone who hit it with nbd:
>
>   https://www.spinics.net/lists/linux-xfs/msg26261.html
>
> There is no reliable way to prevent it in current kernels.  This is why
> rook and ceph-csi default to ext4.

This patch has just merged for 5.6 kernel.  Once it's out and ceph
daemons are updated to make use of PR_SET_IO_FLUSHER, XFS should become
usable in co-located setups.

  https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8d19f1c8e1937baf74e1962aae9f90fa3aeab463

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Some tasks got to hang_task that would be due to RBD driver
  2020-01-30 14:54   ` Ilya Dryomov
@ 2020-01-30 22:31     ` Satoru Takeuchi
  0 siblings, 0 replies; 4+ messages in thread
From: Satoru Takeuchi @ 2020-01-30 22:31 UTC (permalink / raw)
  To: Ilya Dryomov; +Cc: Ceph Development

Hi  Ilya,

> This patch has just merged for 5.6 kernel.  Once it's out and ceph
> daemons are updated to make use of PR_SET_IO_FLUSHER, XFS should become
> usable in co-located setups.
>
>   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8d19f1c8e1937baf74e1962aae9f90fa3aeab463

Thank you very much for letting me know the detailed explanation!
I'll tell Rook/Ceph guys this information.

Regards,
Satoru

2020年1月30日(木) 23:54 Ilya Dryomov <idryomov@gmail.com>:
>
> On Wed, Jan 29, 2020 at 12:07 PM Ilya Dryomov <idryomov@gmail.com> wrote:
> >
> > On Wed, Jan 29, 2020 at 8:25 AM Satoru Takeuchi
> > <satoru.takeuchi@gmail.com> wrote:
> > >
> > > Hi,
> > >
> > > In Rook(*1)/Ceph community, some users encountered hang_task in XFS and it
> > > would be caused by RBD driver.
> > >
> > > Although we've not reproduced this problem in the newest kernel, could anyone
> > > give us any hint about this problem, if possible?
> > >
> > > *1) An Ceph orchestration in Kubernetes
> > >
> > > - A question about this problem in linux-xfs ML
> > >   https://marc.info/?l=linux-xfs&m=158009629016068&w=2
> > >
> > > - The reply to the above-mentioned question from Dave Chinner.
> > >   https://marc.info/?l=linux-xfs&m=158018349207976&w=2
> > >
> > > I copied-and-pasted the discussion of above-mentioned mails here.
> > >
> > > my question:
> > > ```
> > > Under some workload in Ceph, many processes got to hang_task. We found
> > > that there are two kinds of processes.
> > >
> > > a) In very high CPU load
> > > b) Encountered hang_task in the XFS
> > >
> > > In addition,a user got the following two kernel traces.
> > >
> > > A (b) process's backtrace with `hung_task_panic=1`.
> > >
> > > ```
> > > [51717.039319] INFO: task kworker/2:1:5938 blocked for more than 120 seconds.
> > > [51717.039361]       Not tainted 4.15.0-72-generic #81-Ubuntu
> > > [51717.039388] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > > disables this message.
> > > [51717.039426] kworker/2:1     D    0  5938      2 0x80000000
> > > [51717.039471] Workqueue: xfs-sync/rbd0 xfs_log_worker [xfs]
> > > [51717.039472] Call Trace:
> > > [51717.039478]  __schedule+0x24e/0x880
> > > [51717.039504]  ? xlog_sync+0x2d5/0x3c0 [xfs]
> > > [51717.039506]  schedule+0x2c/0x80
> > > [51717.039530]  _xfs_log_force_lsn+0x20e/0x350 [xfs]
> > > [51717.039533]  ? wake_up_q+0x80/0x80
> > > [51717.039556]  __xfs_trans_commit+0x20b/0x280 [xfs]
> > > [51717.039577]  xfs_trans_commit+0x10/0x20 [xfs]
> > > [51717.039600]  xfs_sync_sb+0x6d/0x80 [xfs]
> > > [51717.039623]  xfs_log_worker+0xe7/0x100 [xfs]
> > > [51717.039626]  process_one_work+0x1de/0x420
> > > [51717.039627]  worker_thread+0x32/0x410
> > > [51717.039628]  kthread+0x121/0x140
> > > [51717.039630]  ? process_one_work+0x420/0x420
> > > [51717.039631]  ? kthread_create_worker_on_cpu+0x70/0x70
> > > [51717.039633]  ret_from_fork+0x35/0x40
> > > ```
> > >
> > > A (b) process's backtrace that is got by `sudo cat /proc/<PID of a D
> > > process>/stack`
> > >
> > > ```
> > > [<0>] _xfs_log_force_lsn+0x20e/0x350 [xfs]
> > > [<0>] __xfs_trans_commit+0x20b/0x280 [xfs]
> > > [<0>] xfs_trans_commit+0x10/0x20 [xfs]
> > > [<0>] xfs_sync_sb+0x6d/0x80 [xfs]
> > > [<0>] xfs_log_sbcount+0x4b/0x60 [xfs]
> > > [<0>] xfs_unmountfs+0xe7/0x200 [xfs]
> > > [<0>] xfs_fs_put_super+0x3e/0xb0 [xfs]
> > > [<0>] generic_shutdown_super+0x72/0x120
> > > [<0>] kill_block_super+0x2c/0x80
> > > [<0>] deactivate_locked_super+0x48/0x80
> > > [<0>] deactivate_super+0x40/0x60
> > > [<0>] cleanup_mnt+0x3f/0x80
> > > [<0>] __cleanup_mnt+0x12/0x20
> > > [<0>] task_work_run+0x9d/0xc0
> > > [<0>] exit_to_usermode_loop+0xc0/0xd0
> > > [<0>] do_syscall_64+0x121/0x130
> > > [<0>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> > > [<0>] 0xffffffffffffffff
> > > ```
> > >
> > > Here is the result of my investigation:
> > >
> > > - I couldn't find any commit that would be related to this problem,
> > > both in the upstream
> > >   master and master and XFS's for-next
> > > - I couldn't find any discussions that would be related to the
> > > above-mentioned backtrace
> > >   in linux-xfs ML
> > > - There would be a problem in the transaction commit of XFS. In both
> > > of two traces,
> > >   (b) processes hung in _xfs_log_force_lsn+0x20e/0x350 [xfs]. This
> > > code is one of
> > >   the following two xlog_wait().
> > >
> > >   https://github.com/torvalds/linux/blob/master/fs/xfs/xfs_log.c#L3366
> > >   https://github.com/torvalds/linux/blob/master/fs/xfs/xfs_log.c#L3387
> > >
> > >   These processes released CPU voluntarily in the following line.
> > >
> > >   https://github.com/torvalds/linux/blob/master/fs/xfs/xfs_log_priv.h#L549
> > >
> > >   These two processes should be woken by the other process after that.
> > >   However, unfortunately, it didn't happen.
> > >
> > > Test environment:
> > > - kernel: 4.15.0-<x>-generic
> > > - XFS # Anyone hasn't reported this problem with other filesystems yet.
> > >
> > > Related discussions:
> > > - Issue of Rook:
> > >   https://github.com/rook/rook/issues/3132
> > > - Issue of Ceph
> > >   https://tracker.ceph.com/issues/40068
> > > ```
> > >
> > > Dave's answer:
> > > ```
> > > > Under some workload in Ceph, many processes got to hang_task. We found
> > > > that there
> > > > are two kinds of processes.
> > > >
> > > > a) In very high CPU load
> > > > b) Encountered hang_task in the XFS
> > > >
> > > > In addition,a user got the following two kernel traces.
> > > >
> > > > A (b) process's backtrace with `hung_task_panic=1`.
> > > >
> > > > ```
> > > > [51717.039319] INFO: task kworker/2:1:5938 blocked for more than 120 seconds.
> > > > [51717.039361]       Not tainted 4.15.0-72-generic #81-Ubuntu
> > >
> > > Kinda old, and not an upstream LTS kernel, right?
> > >
> > > > [51717.039388] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > > > disables this message.
> > > > [51717.039426] kworker/2:1     D    0  5938      2 0x80000000
> > > > [51717.039471] Workqueue: xfs-sync/rbd0 xfs_log_worker [xfs]
> > >
> > > Filesystem is on a Ceph RBD device.
> > >
> > > > [51717.039472] Call Trace:
> > > > [51717.039478]  __schedule+0x24e/0x880
> > > > [51717.039504]  ? xlog_sync+0x2d5/0x3c0 [xfs]
> > > > [51717.039506]  schedule+0x2c/0x80
> > > > [51717.039530]  _xfs_log_force_lsn+0x20e/0x350 [xfs]
> > > > [51717.039533]  ? wake_up_q+0x80/0x80
> > > > [51717.039556]  __xfs_trans_commit+0x20b/0x280 [xfs]
> > > > [51717.039577]  xfs_trans_commit+0x10/0x20 [xfs]
> > > > [51717.039600]  xfs_sync_sb+0x6d/0x80 [xfs]
> > > > [51717.039623]  xfs_log_worker+0xe7/0x100 [xfs]
> > > > [51717.039626]  process_one_work+0x1de/0x420
> > > > [51717.039627]  worker_thread+0x32/0x410
> > > > [51717.039628]  kthread+0x121/0x140
> > > > [51717.039630]  ? process_one_work+0x420/0x420
> > > > [51717.039631]  ? kthread_create_worker_on_cpu+0x70/0x70
> > > > [51717.039633]  ret_from_fork+0x35/0x40
> > >
> > > That's waiting for log IO completion.
> > >
> > > > ```
> > > >
> > > > A (b) process's backtrace that is got by `sudo cat /proc/<PID of a D
> > > > process>/stack`
> > > >
> > > > ```
> > > > [<0>] _xfs_log_force_lsn+0x20e/0x350 [xfs]
> > > > [<0>] __xfs_trans_commit+0x20b/0x280 [xfs]
> > > > [<0>] xfs_trans_commit+0x10/0x20 [xfs]
> > > > [<0>] xfs_sync_sb+0x6d/0x80 [xfs]
> > > > [<0>] xfs_log_sbcount+0x4b/0x60 [xfs]
> > > > [<0>] xfs_unmountfs+0xe7/0x200 [xfs]
> > > > [<0>] xfs_fs_put_super+0x3e/0xb0 [xfs]
> > > > [<0>] generic_shutdown_super+0x72/0x120
> > > > [<0>] kill_block_super+0x2c/0x80
> > > > [<0>] deactivate_locked_super+0x48/0x80
> > > > [<0>] deactivate_super+0x40/0x60
> > > > [<0>] cleanup_mnt+0x3f/0x80
> > > > [<0>] __cleanup_mnt+0x12/0x20
> > > > [<0>] task_work_run+0x9d/0xc0
> > > > [<0>] exit_to_usermode_loop+0xc0/0xd0
> > > > [<0>] do_syscall_64+0x121/0x130
> > > > [<0>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> > > > [<0>] 0xffffffffffffffff
> > >
> > > ANd this is the last reference to the filesystem being dropped and
> > > it waiting for log IO completion.
> > >
> > > So, the filesytem has been unmounted, and it's waiting for journal
> > > IO on the device to complete.  I wonder if a wakeup was missed
> > > somewhere?
> > >
> > > Did the system stop/tear down /dev/rbd0 prematurely?
> >
> > Hi Satoru,
> >
> > I would ask the same question.  Typically this is seen when the network
> > is shut off before the filesystem on top of rbd is fully unmounted.
> >
> > >
> > > > Related discussions:
> > > > - Issue of Rook:
> > > >   https://github.com/rook/rook/issues/3132
> > > > - Issue of Ceph
> > > >   https://tracker.ceph.com/issues/40068
> > >
> > > These point to Ceph RBDs failing to respond under high load and
> > > tasks hanging because they are waiting on IO. That's exactly the
> > > symptoms you are reporting here. That points to it being a Ceph RBD
> > > issue to me, especially the reports where rbd devices report no IO
> > > load but the ceph back end is at 100% disk utilisation doing
> > > -something-.
> >
> > I skimmed through https://github.com/rook/rook/issues/3132 and this
> > particular stack trace came from someone who was upgrading their
> > cluster, draining node after node and saw a umount process get stuck.
> > This happens outside of k8s as well, but it is easier to hit with k8s
> > because of different volume plugins and sometimes multiple layers of
> > SDN involved.  One thing to do is to make sure that the filesystem on
> > top of rbd gets mounted with _netdev option.  It has no effect on the
> > kernel, but serves as a cue to systemd to order the unmount before
> > network teardown and some container engines had trouble propagating it
> > between their bind mounts in the past.  Of course, if you or k8s
> > somehow shut the network down before umount runs, _netdev won't help.
> >
> > When dealing with suspected kernel issues, try to collect the entire
> > dmesg, not just a couple of stack traces.  In this case we would have
> > likely seen the kernel attempting to reconnect to the cluster over the
> > dead network because XFS has not finished unmounting:
> >
> >   libceph: connect 10.19.115.5:6789 error -101
> >   libceph: mon1 10.19.115.5:6789 connect error
> >
> >
> > The rest of the issues in that ticket appear to have been resolved by
> > switching from XFS to ext4 on top of rbd.  This is crucial when mapping
> > rbd devices on the OSD nodes (i.e. co-locating the kernel client with
> > the OSDs) because XFS is a lot more aggressive about memory reclaim and
> > in doing that it is _much_ more likely to recurse back on itself, which
> > results in a deadlock.  This co-location + XFS deadlock is not specific
> > to rbd or ceph, here is someone who hit it with nbd:
> >
> >   https://www.spinics.net/lists/linux-xfs/msg26261.html
> >
> > There is no reliable way to prevent it in current kernels.  This is why
> > rook and ceph-csi default to ext4.
>
> This patch has just merged for 5.6 kernel.  Once it's out and ceph
> daemons are updated to make use of PR_SET_IO_FLUSHER, XFS should become
> usable in co-located setups.
>
>   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8d19f1c8e1937baf74e1962aae9f90fa3aeab463
>
> Thanks,
>
>                 Ilya

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2020-01-30 22:31 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-29  7:24 Some tasks got to hang_task that would be due to RBD driver Satoru Takeuchi
2020-01-29 11:07 ` Ilya Dryomov
2020-01-30 14:54   ` Ilya Dryomov
2020-01-30 22:31     ` Satoru Takeuchi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.