All of lore.kernel.org
 help / color / mirror / Atom feed
* [Bug 200981] New: hypervisor fs hangs at heavy write activity on VM (kvm, qcow2 image) having a reflink disk copy
@ 2018-08-30 14:32 bugzilla-daemon
  2018-08-30 14:34 ` [Bug 200981] " bugzilla-daemon
                   ` (8 more replies)
  0 siblings, 9 replies; 10+ messages in thread
From: bugzilla-daemon @ 2018-08-30 14:32 UTC (permalink / raw)
  To: linux-xfs

https://bugzilla.kernel.org/show_bug.cgi?id=200981

            Bug ID: 200981
           Summary: hypervisor fs hangs at heavy write activity on VM
                    (kvm, qcow2 image) having a reflink disk copy
           Product: File System
           Version: 2.5
    Kernel Version: 4.18.5
          Hardware: All
                OS: Linux
              Tree: Mainline
            Status: NEW
          Severity: normal
          Priority: P1
         Component: XFS
          Assignee: filesystem_xfs@kernel-bugs.kernel.org
          Reporter: git.user@gmail.com
        Regression: No

Created attachment 278203
  --> https://bugzilla.kernel.org/attachment.cgi?id=278203&action=edit
dmesg

kernel: vanilla 4.18.5
gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

More or less reproducible for me using next sequence:

- on host:
  create LV of appropriate size (20g in my case)
  mkfs.xfs -m reflink=1 /dev/data/LV
  mount /dev/data/LV /mnt/
  run kvm VM with qcow2 image (/mnt/disk)    

- inside vm:
  sysbench --test=fileio --file-total-size=9G prepare

- on host:
  cp --reflink=always disk disk.b

- inside vm: 
  sysbench --test=fileio --file-total-size=9G --file-test-mode=seqwr
--max-time=6000 --max-requests=0 --threads=16 run

Some time after i/o on /dev/data/LV fall to zero and fs become completely
unavailable and then I see a bunch of records:

[ 2580.058205] INFO: task worker:6343 blocked for more than 120 seconds.
[ 2580.064719]       Not tainted 4.18.5 #1
[ 2580.068614] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
[ 2580.076496] worker          D    0  6343      1 0x00000000
[ 2580.082034] Call Trace:
[ 2580.084532]  ? __schedule+0x386/0xc50
[ 2580.088248]  ? xlog_grant_head_wait+0xa3/0x3a0
[ 2580.092741]  schedule+0x2f/0x90
[ 2580.095932]  xlog_grant_head_wait+0x53/0x3a0
[ 2580.100256]  xlog_grant_head_check+0xb3/0x160
[ 2580.104662]  xfs_log_reserve+0x108/0x3f0
[ 2580.108682]  xfs_trans_reserve+0x1b4/0x2b0
[ 2580.112948]  xfs_trans_alloc+0xbe/0x220
[ 2580.116952]  xfs_vn_update_time+0xcb/0x2b0
[ 2580.121220]  ? current_time+0x4d/0x90
[ 2580.125047]  file_update_time+0xe0/0x120
[ 2580.129139]  xfs_file_aio_write_checks+0x14f/0x2d0
[ 2580.134099]  xfs_file_dio_aio_write+0xcc/0x420
[ 2580.138715]  xfs_file_write_iter+0x7b/0xa0
[ 2580.142978]  do_iter_readv_writev+0x139/0x190
[ 2580.147502]  do_iter_write+0x7f/0x1c0
[ 2580.151329]  vfs_writev+0x98/0x110
[ 2580.154907]  ? lock_acquire+0x8e/0x230
[ 2580.158823]  ? __fget+0x5/0x200
[ 2580.162131]  ? do_pwritev+0x9c/0xe0
[ 2580.165782]  ? __fget_light+0x51/0x60
[ 2580.169614]  do_pwritev+0x9c/0xe0
[ 2580.173095]  do_syscall_64+0x5a/0x190
[ 2580.176922]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 2580.182138] RIP: 0033:0x7fe1937b784a
[ 2580.185836] Code: Bad RIP value.
[ 2580.189239] RSP: 002b:00007fe05e1f5850 EFLAGS: 00000246 ORIG_RAX:
0000000000000128
[ 2580.197058] RAX: ffffffffffffffda RBX: 0000000000000014 RCX:
00007fe1937b784a
[ 2580.204361] RDX: 000000000000001e RSI: 0000564bf126f200 RDI:
0000000000000014
[ 2580.211660] RBP: 0000564bf126f200 R08: 0000000000000000 R09:
0000000000000000
[ 2580.218968] R10: 00000000dccf0000 R11: 0000000000000246 R12:
000000000000001e
[ 2580.226265] R13: 00000000dccf0000 R14: 0000564bf13312a0 R15:
00007fe05e9f67a0

Full dmesg attached

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug 200981] hypervisor fs hangs at heavy write activity on VM (kvm, qcow2 image) having a reflink disk copy
  2018-08-30 14:32 [Bug 200981] New: hypervisor fs hangs at heavy write activity on VM (kvm, qcow2 image) having a reflink disk copy bugzilla-daemon
@ 2018-08-30 14:34 ` bugzilla-daemon
  2018-08-31  1:26 ` bugzilla-daemon
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: bugzilla-daemon @ 2018-08-30 14:34 UTC (permalink / raw)
  To: linux-xfs

https://bugzilla.kernel.org/show_bug.cgi?id=200981

--- Comment #1 from Alexander Y.  Fomichev (git.user@gmail.com) ---
Created attachment 278205
  --> https://bugzilla.kernel.org/attachment.cgi?id=278205&action=edit
config

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug 200981] hypervisor fs hangs at heavy write activity on VM (kvm, qcow2 image) having a reflink disk copy
  2018-08-30 14:32 [Bug 200981] New: hypervisor fs hangs at heavy write activity on VM (kvm, qcow2 image) having a reflink disk copy bugzilla-daemon
  2018-08-30 14:34 ` [Bug 200981] " bugzilla-daemon
@ 2018-08-31  1:26 ` bugzilla-daemon
  2018-08-31  7:39 ` bugzilla-daemon
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: bugzilla-daemon @ 2018-08-31  1:26 UTC (permalink / raw)
  To: linux-xfs

https://bugzilla.kernel.org/show_bug.cgi?id=200981

--- Comment #2 from Dave Chinner (david@fromorbit.com) ---
On Thu, Aug 30, 2018 at 02:32:35PM +0000, bugzilla-daemon@bugzilla.kernel.org
wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=200981
> 
> kernel: vanilla 4.18.5
> gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0
> Copyright (C) 2017 Free Software Foundation, Inc.
> This is free software; see the source for copying conditions.  There is NO
> warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
> 
> More or less reproducible for me using next sequence:
> 
> - on host:
>   create LV of appropriate size (20g in my case)
>   mkfs.xfs -m reflink=1 /dev/data/LV
>   mount /dev/data/LV /mnt/
>   run kvm VM with qcow2 image (/mnt/disk)    
> 
> - inside vm:
>   sysbench --test=fileio --file-total-size=9G prepare
> 
> - on host:
>   cp --reflink=always disk disk.b
> 
> - inside vm: 
>   sysbench --test=fileio --file-total-size=9G --file-test-mode=seqwr
> --max-time=6000 --max-requests=0 --threads=16 run
> 
> Some time after i/o on /dev/data/LV fall to zero and fs become completely
> unavailable and then I see a bunch of records:

The first error is this:

[ 2212.046108] ================================================
[ 2212.051809] WARNING: lock held when returning to user space!
[ 2212.057511] 4.18.5 #1 Not tainted
[ 2212.060864] ------------------------------------------------
[ 2212.066564] worker/6123 is leaving the kernel with locks still held!
[ 2212.072961] 1 lock held by worker/6123:
[ 2212.076835]  #0: 000000009eab4f1b (sb_internal#2){.+.+}, at:
xfs_trans_alloc+0x17c/0x220

Which happens 5 minutes before the hung processes start being
reported. Looks like something has gone wrong and an error path has
leaked a transaction.

Can you see if commit dcbd44f79986 ("xfs: fix transaction leak on
remote attr set/remove failure") addresses the problem you are
seeing?

-Dave.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug 200981] hypervisor fs hangs at heavy write activity on VM (kvm, qcow2 image) having a reflink disk copy
  2018-08-30 14:32 [Bug 200981] New: hypervisor fs hangs at heavy write activity on VM (kvm, qcow2 image) having a reflink disk copy bugzilla-daemon
  2018-08-30 14:34 ` [Bug 200981] " bugzilla-daemon
  2018-08-31  1:26 ` bugzilla-daemon
@ 2018-08-31  7:39 ` bugzilla-daemon
  2018-09-01  5:01 ` bugzilla-daemon
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: bugzilla-daemon @ 2018-08-31  7:39 UTC (permalink / raw)
  To: linux-xfs

https://bugzilla.kernel.org/show_bug.cgi?id=200981

--- Comment #3 from Alexander Y.  Fomichev (git.user@gmail.com) ---
> Can you see if commit dcbd44f79986 ("xfs: fix transaction leak on
remote attr set/remove failure") addresses the problem you are
seeing?

I've just tried but see no difference.
Warning still here and vm become blocked some time after.
( It worth noting I've CONFIG_LOCKDEP turned on )

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug 200981] hypervisor fs hangs at heavy write activity on VM (kvm, qcow2 image) having a reflink disk copy
  2018-08-30 14:32 [Bug 200981] New: hypervisor fs hangs at heavy write activity on VM (kvm, qcow2 image) having a reflink disk copy bugzilla-daemon
                   ` (2 preceding siblings ...)
  2018-08-31  7:39 ` bugzilla-daemon
@ 2018-09-01  5:01 ` bugzilla-daemon
  2018-09-06  5:36 ` bugzilla-daemon
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: bugzilla-daemon @ 2018-09-01  5:01 UTC (permalink / raw)
  To: linux-xfs

https://bugzilla.kernel.org/show_bug.cgi?id=200981

Dave Chinner (david@fromorbit.com) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |david@fromorbit.com

--- Comment #4 from Dave Chinner (david@fromorbit.com) ---
Ok, so the obvious fix doesn't help.

Any idea why lockdep is not giving you the full stack trace of where the lock
that was leaked to userspace was first accessed? Without that, we have no idea
which of the 70-odd calls to xfs_trans_alloc() in the XFS code the lockdep
warning refers to. At minimum, we need to know what syscall is triggering the
warning so we know what code to look at.

Also, can you reproduce this on older kernels? If not, could you bisect it?

-Dave.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug 200981] hypervisor fs hangs at heavy write activity on VM (kvm, qcow2 image) having a reflink disk copy
  2018-08-30 14:32 [Bug 200981] New: hypervisor fs hangs at heavy write activity on VM (kvm, qcow2 image) having a reflink disk copy bugzilla-daemon
                   ` (3 preceding siblings ...)
  2018-09-01  5:01 ` bugzilla-daemon
@ 2018-09-06  5:36 ` bugzilla-daemon
  2018-09-06  5:37 ` bugzilla-daemon
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: bugzilla-daemon @ 2018-09-06  5:36 UTC (permalink / raw)
  To: linux-xfs

https://bugzilla.kernel.org/show_bug.cgi?id=200981

--- Comment #5 from Alexander Y.  Fomichev (git.user@gmail.com) ---
Created attachment 278345
  --> https://bugzilla.kernel.org/attachment.cgi?id=278345&action=edit
dmesg showing all locks held in the system

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug 200981] hypervisor fs hangs at heavy write activity on VM (kvm, qcow2 image) having a reflink disk copy
  2018-08-30 14:32 [Bug 200981] New: hypervisor fs hangs at heavy write activity on VM (kvm, qcow2 image) having a reflink disk copy bugzilla-daemon
                   ` (4 preceding siblings ...)
  2018-09-06  5:36 ` bugzilla-daemon
@ 2018-09-06  5:37 ` bugzilla-daemon
  2018-09-06  5:39 ` bugzilla-daemon
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: bugzilla-daemon @ 2018-09-06  5:37 UTC (permalink / raw)
  To: linux-xfs

https://bugzilla.kernel.org/show_bug.cgi?id=200981

--- Comment #6 from Alexander Y.  Fomichev (git.user@gmail.com) ---
Created attachment 278347
  --> https://bugzilla.kernel.org/attachment.cgi?id=278347&action=edit
missing sb_end_intwrite: proof of concept

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug 200981] hypervisor fs hangs at heavy write activity on VM (kvm, qcow2 image) having a reflink disk copy
  2018-08-30 14:32 [Bug 200981] New: hypervisor fs hangs at heavy write activity on VM (kvm, qcow2 image) having a reflink disk copy bugzilla-daemon
                   ` (5 preceding siblings ...)
  2018-09-06  5:37 ` bugzilla-daemon
@ 2018-09-06  5:39 ` bugzilla-daemon
  2018-09-06 14:25 ` bugzilla-daemon
  2018-09-06 16:52 ` bugzilla-daemon
  8 siblings, 0 replies; 10+ messages in thread
From: bugzilla-daemon @ 2018-09-06  5:39 UTC (permalink / raw)
  To: linux-xfs

https://bugzilla.kernel.org/show_bug.cgi?id=200981

--- Comment #7 from Alexander Y.  Fomichev (git.user@gmail.com) ---
Ok, it looks like a warning is a result of missing sb_end_intwrite in
xfs_reflink_allocate_cow, namely in a sequence xfs_trans_alloc -> goto retry ->
goto convert. With proof-of-concept patch (attached) warning is gone but
deadlock is not. Though now I have a list of held locks (dmesg attached).
Did you mean taintad kernel  will not show locks held?

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug 200981] hypervisor fs hangs at heavy write activity on VM (kvm, qcow2 image) having a reflink disk copy
  2018-08-30 14:32 [Bug 200981] New: hypervisor fs hangs at heavy write activity on VM (kvm, qcow2 image) having a reflink disk copy bugzilla-daemon
                   ` (6 preceding siblings ...)
  2018-09-06  5:39 ` bugzilla-daemon
@ 2018-09-06 14:25 ` bugzilla-daemon
  2018-09-06 16:52 ` bugzilla-daemon
  8 siblings, 0 replies; 10+ messages in thread
From: bugzilla-daemon @ 2018-09-06 14:25 UTC (permalink / raw)
  To: linux-xfs

https://bugzilla.kernel.org/show_bug.cgi?id=200981

--- Comment #8 from Dave Chinner (david@fromorbit.com) ---
(In reply to Alexander Y.  Fomichev from comment #7)
> Ok, it looks like a warning is a result of missing y in
> xfs_reflink_allocate_cow, namely in a sequence xfs_trans_alloc -> goto retry
> -> goto convert.

Yup, I see the issue. Nice work finding this!

> With proof-of-concept patch (attached) warning is gone but
> deadlock is not.

The code is not missing a sb_end_intwrite() call - it's missing a
xfs_trans_cancel() call. That's why it still deadlocks even though you shut up
the warning. I'll have a closer look in the morning to see if there's any other
paths that can leak the transaction and write a patch to fix it.

Thanks again for tracking it down, Alexander :)

Cheers,

Dave.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug 200981] hypervisor fs hangs at heavy write activity on VM (kvm, qcow2 image) having a reflink disk copy
  2018-08-30 14:32 [Bug 200981] New: hypervisor fs hangs at heavy write activity on VM (kvm, qcow2 image) having a reflink disk copy bugzilla-daemon
                   ` (7 preceding siblings ...)
  2018-09-06 14:25 ` bugzilla-daemon
@ 2018-09-06 16:52 ` bugzilla-daemon
  8 siblings, 0 replies; 10+ messages in thread
From: bugzilla-daemon @ 2018-09-06 16:52 UTC (permalink / raw)
  To: linux-xfs

https://bugzilla.kernel.org/show_bug.cgi?id=200981

--- Comment #9 from Alexander Y.  Fomichev (git.user@gmail.com) ---
I confirm. With xfs_trans_cancel() lock is gone and works like a charm. Thanks.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2018-09-06 21:28 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-30 14:32 [Bug 200981] New: hypervisor fs hangs at heavy write activity on VM (kvm, qcow2 image) having a reflink disk copy bugzilla-daemon
2018-08-30 14:34 ` [Bug 200981] " bugzilla-daemon
2018-08-31  1:26 ` bugzilla-daemon
2018-08-31  7:39 ` bugzilla-daemon
2018-09-01  5:01 ` bugzilla-daemon
2018-09-06  5:36 ` bugzilla-daemon
2018-09-06  5:37 ` bugzilla-daemon
2018-09-06  5:39 ` bugzilla-daemon
2018-09-06 14:25 ` bugzilla-daemon
2018-09-06 16:52 ` bugzilla-daemon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.