All of lore.kernel.org
 help / color / mirror / Atom feed
* [Cluster-devel] gfs2 hang in xfstests generic/361
@ 2021-07-26 14:00 Christoph Hellwig
  2021-07-26 14:49 ` Bob Peterson
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Christoph Hellwig @ 2021-07-26 14:00 UTC (permalink / raw)
  To: cluster-devel.redhat.com

I noticed this hang while testing the iomap_iter series on gfs2,
but it also reproduces on 5.14-rc3.  This is running locally with
"-O -p lock_nolock":

generic/361 files ... [ 1479.222703] run fstests generic/361 at 2021-07-26 13:57:10
[ 1481.188746] gfs2: fsid=vdc: Trying to join cluster "lock_nolock", "vdc"
[ 1481.190270] gfs2: fsid=vdc: Now mounting FS (format 1801)...
[ 1481.199160] gfs2: fsid=vdc.0: journal 0 mapped with 1 extents in 0ms
[ 1481.202347] gfs2: fsid=vdc.0: jid=0, already locked for use
[ 1481.203368] gfs2: fsid=vdc.0: jid=0: Looking at journal...
[ 1481.269554] gfs2: fsid=vdc.0: jid=0: Journal head lookup took 66ms
[ 1481.270730] gfs2: fsid=vdc.0: jid=0: Done
[ 1481.271802] gfs2: fsid=vdc.0: first mount done, others may mount
[ 1481.371768] loop0: detected capacity change from 0 to 2097152
[ 1482.799919] gfs2: fsid=loop0: Trying to join cluster "lock_nolock", "loop0"
[ 1482.803425] gfs2: fsid=loop0: Now mounting FS (format 1801)...
[ 1482.815215] gfs2: fsid=loop0.0: journal 0 mapped with 1 extents in 0ms
[ 1482.819376] gfs2: fsid=loop0.0: jid=0, already locked for use
[ 1482.822181] gfs2: fsid=loop0.0: jid=0: Looking at journal...
[ 1483.005770] gfs2: fsid=loop0.0: jid=0: Journal head lookup took 183ms
[ 1483.007077] gfs2: fsid=loop0.0: jid=0: Done
[ 1483.007898] gfs2: fsid=loop0.0: first mount done, others may mount
[ 1491.612990] loop: Write error at byte offset 468144128, length 4096.
[ 1491.615612] blk_update_request: I/O error, dev loop0, sector 913152 op 0x1:(WRITE) fl0
[ 1491.617029] loop: Write error at byte offset 468844544, length 4096.
[ 1491.623654] blk_update_request: I/O error, dev loop0, sector 915712 op 0x1:(WRITE) fl0
[ 1491.625416] loop: Write error at byte offset 469622784, length 4096.
[ 1491.630162] loop0: writeback error on inode 16708, offset 398209024, sector 917224
[ 1491.631682] blk_update_request: I/O error, dev loop0, sector 917232 op 0x1:(WRITE) fl0
[ 1491.633448] loop: Write error at byte offset 470933504, length 4096.
[ 1491.641790] blk_update_request: I/O error, dev loop0, sector 919792 op 0x1:(WRITE) fl0
[ 1491.642736] loop: Write error at byte offset 471711744, length 4096.
[ 1491.647210] loop0: writeback error on inode 16708, offset 400293888, sector 921304
[ 1491.648540] blk_update_request: I/O error, dev loop0, sector 921312 op 0x1:(WRITE) fl0
[ 1491.649673] loop: Write error at byte offset 473022464, length 4096.
[ 1491.655308] blk_update_request: I/O error, dev loop0, sector 923872 op 0x1:(WRITE) fl0
[ 1491.656176] loop: Write error at byte offset 473800704, length 4096.
[ 1491.658274] loop0: writeback error on inode 16708, offset 402378752, sector 925384
[ 1491.659257] blk_update_request: I/O error, dev loop0, sector 925392 op 0x1:(WRITE) fl0
[ 1491.660052] loop: Write error at byte offset 475111424, length 4096.
[ 1491.663825] blk_update_request: I/O error, dev loop0, sector 927952 op 0x1:(WRITE) fl0
[ 1491.663872] loop: Write error at byte offset 475889664, length 4096.
[ 1491.666004] loop0: writeback error on inode 16708, offset 404463616, sector 929464
[ 1491.666880] blk_update_request: I/O error, dev loop0, sector 929472 op 0x1:(WRITE) fl0
[ 1491.667670] loop: Write error at byte offset 477200384, length 4096.
[ 1491.671450] blk_update_request: I/O error, dev loop0, sector 932032 op 0x1:(WRITE) fl0
[ 1491.674066] loop0: writeback error on inode 16708, offset 406548480, sector 933544
[ 1491.674488] loop0: writeback error on inode 16708, offset 408633344, sector 937624
[ 1491.676163] loop0: writeback error on inode 16708, offset 410718208, sector 941704
[ 1491.678042] loop0: writeback error on inode 16708, offset 412803072, sector 945784
[ 1491.679677] loop0: writeback error on inode 16708, offset 414887936, sector 949864
[ 1491.681342] loop0: writeback error on inode 16708, offset 416972800, sector 953944
[ 1491.713431] Buffer I/O error on dev loop0, logical block 17192, lost async page write
[ 1491.716453] Buffer I/O error on dev loop0, logical block 17219, lost async page write
[ 1491.717967] Buffer I/O error on dev loop0, logical block 17729, lost async page write
[ 1491.719021] Buffer I/O error on dev loop0, logical block 77399, lost async page write
[ 1491.719333] Buffer I/O error on dev loop0, logical block 18239, lost async page write
[ 1491.720966] Buffer I/O error on dev loop0, logical block 77933, lost async page write
[ 1491.722209] Buffer I/O error on dev loop0, logical block 18749, lost async page write
[ 1491.723602] Buffer I/O error on dev loop0, logical block 78443, lost async page write
[ 1491.724989] Buffer I/O error on dev loop0, logical block 19259, lost async page write
[ 1491.726488] Buffer I/O error on dev loop0, logical block 78953, lost async page write
[ 1491.744459] gfs2: fsid=loop0.0: fatal: I/O error
[ 1491.744459]   block = 17192
[ 1491.744459]   function = gfs2_ail1_empty_one, file = fs/gfs2/log.c, line = 323
[ 1491.747491] gfs2: fsid=loop0.0: fatal: I/O error(s)
[ 1491.748477] gfs2: fsid=loop0.0: about to withdraw this file system
[ 1491.752284] 
[ 1491.752587] =============================
[ 1491.753403] [ BUG: Invalid wait context ]
[ 1491.754122] 5.14.0-rc2+ #47 Not tainted
[ 1491.754860] -----------------------------
[ 1491.755563] kworker/2:1H/1975 is trying to lock:
[ 1491.756370] ffff8881048d0888 (&wq->mutex){+.+.}-{3:3}, at: flush_workqueue+0xc9/0x5f0
[ 1491.757736] other info that might help us debug this:
[ 1491.758622] context-{4:4}
[ 1491.759087] 4 locks held by kworker/2:1H/1975:
[ 1491.759863]  #0: ffff888101717b38 ((wq_completion)glock_workqueue){+.+.}-{0:0}, at: p0
[ 1491.761623]  #1: ffffc900040dfe78
((work_completion)(&(&gl->gl_work)->work)){+.+.}-{00
[ 1491.763528]  #2: ffff88811ce6b000 (&sdp->sd_log_flush_lock){++++}-{3:3}, at: gfs2_log0
[ 1491.765284]  #3: ffff88811ce6ae28 (&sdp->sd_log_lock){+.+.}-{2:2}, at: gfs2_flush_rev0
[ 1491.767064] stack backtrace:
[ 1491.767629] CPU: 2 PID: 1975 Comm: kworker/2:1H Not tainted 5.14.0-rc2+ #47
[ 1491.769000] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/04
[ 1491.770641] Workqueue: glock_workqueue glock_work_func
[ 1491.771635] Call Trace:
[ 1491.772101]  dump_stack_lvl+0x45/0x59
[ 1491.772777]  __lock_acquire.cold+0x2a2/0x2be
[ 1491.773529]  ? save_trace+0x3e/0x380
[ 1491.774160]  lock_acquire+0xc9/0x2f0
[ 1491.774815]  ? flush_workqueue+0xc9/0x5f0
[ 1491.775521]  __mutex_lock+0x75/0x870
[ 1491.776151]  ? flush_workqueue+0xc9/0x5f0
[ 1491.776856]  ? flush_workqueue+0xc9/0x5f0
[ 1491.777560]  ? lock_release+0x13c/0x2e0
[ 1491.778234]  flush_workqueue+0xc9/0x5f0
[ 1491.779012]  gfs2_make_fs_ro+0x2b/0x2b0
[ 1491.779687]  gfs2_withdraw.cold+0x16f/0x4bd
[ 1491.780424]  ? gfs2_freeze_lock+0x24/0x60
[ 1491.781129]  gfs2_ail1_empty+0x305/0x310
[ 1491.781821]  gfs2_flush_revokes+0x29/0x40
[ 1491.782526]  revoke_lo_before_commit+0x12/0x1c0
[ 1491.783324]  gfs2_log_flush+0x337/0xb00
[ 1491.784001]  inode_go_sync+0x8e/0x200
[ 1491.784663]  do_xmote+0xd2/0x380
[ 1491.785268]  glock_work_func+0x57/0x130
[ 1491.785944]  process_one_work+0x237/0x560
[ 1491.786660]  worker_thread+0x4d/0x3b0
[ 1491.787307]  ? process_one_work+0x560/0x560
[ 1491.788042]  kthread+0x13f/0x170
[ 1491.788617]  ? set_kthread_struct+0x40/0x40
[ 1491.789352]  ret_from_fork+0x22/0x30
[ 1491.790084] BUG: scheduling while atomic: kworker/2:1H/1975/0x00000002
[ 1491.791316] INFO: lockdep is turned off.
[ 1491.792119] Modules linked in:
[ 1491.792734] Preemption disabled at:
[ 1491.792735] [<0000000000000000>] 0x0
[ 1491.793971] CPU: 2 PID: 1975 Comm: kworker/2:1H Not tainted 5.14.0-rc2+ #47
[ 1491.795160] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/04
[ 1491.796618] Workqueue: glock_workqueue glock_work_func
[ 1491.797531] Call Trace:
[ 1491.797968]  dump_stack_lvl+0x45/0x59
[ 1491.798621]  __schedule_bug.cold+0x9c/0xad
[ 1491.799340]  __schedule+0x8e2/0xbb0
[ 1491.799956]  ? lock_release+0x201/0x2e0
[ 1491.800633]  schedule+0x48/0xf0
[ 1491.801241]  schedule_timeout+0xaf/0x140
[ 1491.802045]  ? trace_hardirqs_on+0x1b/0xe0
[ 1491.802840]  ? destroy_timer_on_stack+0x10/0x10
[ 1491.803658]  gfs2_make_fs_ro+0x249/0x2b0
[ 1491.804417]  ? do_wait_intr_irq+0xa0/0xa0
[ 1491.805141]  gfs2_withdraw.cold+0x16f/0x4bd
[ 1491.805885]  ? gfs2_freeze_lock+0x24/0x60
[ 1491.806685]  gfs2_ail1_empty+0x305/0x310
[ 1491.807495]  gfs2_flush_revokes+0x29/0x40
[ 1491.808281]  revoke_lo_before_commit+0x12/0x1c0
[ 1491.809198]  gfs2_log_flush+0x337/0xb00
[ 1491.809976]  inode_go_sync+0x8e/0x200
[ 1491.810716]  do_xmote+0xd2/0x380
[ 1491.811357]  glock_work_func+0x57/0x130
[ 1491.812116]  process_one_work+0x237/0x560
[ 1491.812882]  worker_thread+0x4d/0x3b0
[ 1491.813612]  ? process_one_work+0x560/0x560
[ 1491.814342]  kthread+0x13f/0x170
[ 1491.814918]  ? set_kthread_struct+0x40/0x40
[ 1491.815647]  ret_from_fork+0x22/0x30

[ 1496.964925] gfs2: fsid=loop0.0: Journal recovery skipped for 0 until next mount.
[ 1496.970088] gfs2: fsid=loop0.0: Glock dequeues delayed: 0
[ 1496.971948] gfs2: fsid=loop0.0: File system withdrawn
[ 1496.973558] CPU: 2 PID: 1975 Comm: kworker/2:1H Tainted: G        W 5.14.0-rc7
[ 1496.976692] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/04
[ 1496.979972] Workqueue: glock_workqueue glock_work_func
[ 1496.981980] Call Trace:
[ 1496.983015]  dump_stack_lvl+0x45/0x59
[ 1496.984674]  gfs2_withdraw.cold+0xd3/0x4bd
[ 1496.986840]  gfs2_ail1_empty+0x305/0x310
[ 1496.988888]  gfs2_flush_revokes+0x29/0x40
[ 1496.990914]  revoke_lo_before_commit+0x12/0x1c0
[ 1496.992965]  gfs2_log_flush+0x337/0xb00
[ 1496.995501]  inode_go_sync+0x8e/0x200
[ 1496.998030]  do_xmote+0xd2/0x380
[ 1496.999681]  glock_work_func+0x57/0x130
[ 1497.001480]  process_one_work+0x237/0x560
[ 1497.003512]  worker_thread+0x4d/0x3b0
[ 1497.005249]  ? process_one_work+0x560/0x560
[ 1497.007208]  kthread+0x13f/0x170
[ 1497.008761]  ? set_kthread_struct+0x40/0x40
[ 1497.010758]  ret_from_fork+0x22/0x30
[ 1502.086199] gfs2: fsid=loop0.0: warning: assertion "gfs2_log_is_empty(sdp)" failed at6
[ 1502.093118] CPU: 1 PID: 31026 Comm: mount Tainted: G        W 5.14.0-rc2+ #47
[ 1502.096754] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/04
[ 1502.100580] Call Trace:
[ 1502.101744]  dump_stack_lvl+0x45/0x59
[ 1502.103470]  gfs2_assert_warn_i.cold+0x2e/0x77
[ 1502.105537]  gfs2_make_fs_ro+0x2a3/0x2b0
[ 1502.107381]  ? do_wait_intr_irq+0xa0/0xa0
[ 1502.109256]  gfs2_reconfigure+0x290/0x3c0
[ 1502.111144]  ? gfs2_freeze_lock+0x24/0x60
[ 1502.113008]  reconfigure_super+0x9f/0x220
[ 1502.114909]  path_mount+0x96a/0xb40
[ 1502.116549]  __x64_sys_mount+0xfe/0x140
[ 1502.118358]  do_syscall_64+0x3b/0x90
[ 1502.120101]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 1502.122652] RIP: 0033:0x7f591889dfea
[ 1502.124426] Code: 48 8b 0d a9 0e 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 08
[ 1502.133077] RSP: 002b:00007ffde8ccc818 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
[ 1502.136618] RAX: ffffffffffffffda RBX: 000055d76690c970 RCX: 00007f591889dfea
[ 1502.139919] RDX: 000055d76690cb80 RSI: 000055d766910280 RDI: 000055d7669133d0
[ 1502.143309] RBP: 00007f5918beb1c4 R08: 000055d76690cba0 R09: 00007f5918960130
[ 1502.146999] R10: 0000000000200021 R11: 0000000000000246 R12: 0000000000000000
[ 1502.151191] R13: 0000000000200021 R14: 000055d7669133d0 R15: 000055d76690cb80





^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Cluster-devel] gfs2 hang in xfstests generic/361
  2021-07-26 14:00 [Cluster-devel] gfs2 hang in xfstests generic/361 Christoph Hellwig
@ 2021-07-26 14:49 ` Bob Peterson
  2021-07-26 17:11 ` Bob Peterson
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 10+ messages in thread
From: Bob Peterson @ 2021-07-26 14:49 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On 7/26/21 9:00 AM, Christoph Hellwig wrote:
> I noticed this hang while testing the iomap_iter series on gfs2,
> but it also reproduces on 5.14-rc3.  This is running locally with
> "-O -p lock_nolock":
> 
> generic/361 files ... [ 1479.222703] run fstests generic/361 at 2021-07-26 13:57:10
(snip)
> [ 1491.744459] gfs2: fsid=loop0.0: fatal: I/O error
> [ 1491.744459]   block = 17192
> [ 1491.744459]   function = gfs2_ail1_empty_one, file = fs/gfs2/log.c, line = 323
> [ 1491.747491] gfs2: fsid=loop0.0: fatal: I/O error(s)
> [ 1491.748477] gfs2: fsid=loop0.0: about to withdraw this file system
> [ 1491.752284]
> [ 1491.752587] =============================
> [ 1491.753403] [ BUG: Invalid wait context ]
> [ 1491.754122] 5.14.0-rc2+ #47 Not tainted
> [ 1491.754860] -----------------------------
> [ 1491.755563] kworker/2:1H/1975 is trying to lock:
> [ 1491.756370] ffff8881048d0888 (&wq->mutex){+.+.}-{3:3}, at: flush_workqueue+0xc9/0x5f0
> [ 1491.757736] other info that might help us debug this:
> [ 1491.758622] context-{4:4}
> [ 1491.759087] 4 locks held by kworker/2:1H/1975:
> [ 1491.759863]  #0: ffff888101717b38 ((wq_completion)glock_workqueue){+.+.}-{0:0}, at: p0
> [ 1491.761623]  #1: ffffc900040dfe78
> ((work_completion)(&(&gl->gl_work)->work)){+.+.}-{00
> [ 1491.763528]  #2: ffff88811ce6b000 (&sdp->sd_log_flush_lock){++++}-{3:3}, at: gfs2_log0
> [ 1491.765284]  #3: ffff88811ce6ae28 (&sdp->sd_log_lock){+.+.}-{2:2}, at: gfs2_flush_rev0
> [ 1491.767064] stack backtrace:
> [ 1491.767629] CPU: 2 PID: 1975 Comm: kworker/2:1H Not tainted 5.14.0-rc2+ #47
> [ 1491.769000] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/04
> [ 1491.770641] Workqueue: glock_workqueue glock_work_func
> [ 1491.771635] Call Trace:
> [ 1491.772101]  dump_stack_lvl+0x45/0x59
> [ 1491.772777]  __lock_acquire.cold+0x2a2/0x2be
> [ 1491.773529]  ? save_trace+0x3e/0x380
> [ 1491.774160]  lock_acquire+0xc9/0x2f0
> [ 1491.774815]  ? flush_workqueue+0xc9/0x5f0
> [ 1491.775521]  __mutex_lock+0x75/0x870
> [ 1491.776151]  ? flush_workqueue+0xc9/0x5f0
> [ 1491.776856]  ? flush_workqueue+0xc9/0x5f0
> [ 1491.777560]  ? lock_release+0x13c/0x2e0
> [ 1491.778234]  flush_workqueue+0xc9/0x5f0
> [ 1491.779012]  gfs2_make_fs_ro+0x2b/0x2b0
> [ 1491.779687]  gfs2_withdraw.cold+0x16f/0x4bd
> [ 1491.780424]  ? gfs2_freeze_lock+0x24/0x60
> [ 1491.781129]  gfs2_ail1_empty+0x305/0x310
> [ 1491.781821]  gfs2_flush_revokes+0x29/0x40
> [ 1491.782526]  revoke_lo_before_commit+0x12/0x1c0
> [ 1491.783324]  gfs2_log_flush+0x337/0xb00
> [ 1491.784001]  inode_go_sync+0x8e/0x200
> [ 1491.784663]  do_xmote+0xd2/0x380
> [ 1491.785268]  glock_work_func+0x57/0x130
> [ 1491.785944]  process_one_work+0x237/0x560

Hi Christoph,

Thanks. I've run generic/361 many times on many recent branches and I've
never seen this before. For example, this is from last Friday:

generic/361 8s ...  13s

Still, I can see what's going on and it's not a recent problem. This is
basically a problem with our withdraw sequence from February 2020.
(patch 601ef0d52e96) I'll try to fix it as soon as I get a chance.

Regards,

Bob Peterson



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Cluster-devel] gfs2 hang in xfstests generic/361
  2021-07-26 14:00 [Cluster-devel] gfs2 hang in xfstests generic/361 Christoph Hellwig
  2021-07-26 14:49 ` Bob Peterson
@ 2021-07-26 17:11 ` Bob Peterson
  2021-07-26 17:35 ` Bob Peterson
  2021-07-26 18:50 ` [Cluster-devel] gfs2 hang in xfstests generic/361 (v3) Bob Peterson
  3 siblings, 0 replies; 10+ messages in thread
From: Bob Peterson @ 2021-07-26 17:11 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On 7/26/21 9:00 AM, Christoph Hellwig wrote:
> I noticed this hang while testing the iomap_iter series on gfs2,
> but it also reproduces on 5.14-rc3.  This is running locally with
> "-O -p lock_nolock":
> 
> generic/361 files ... [ 1479.222703] run fstests generic/361 at 2021-07-26 13:57:10
(snip)
> [ 1491.752587] =============================
> [ 1491.753403] [ BUG: Invalid wait context ]
> [ 1491.754122] 5.14.0-rc2+ #47 Not tainted
> [ 1491.754860] -----------------------------
> [ 1491.755563] kworker/2:1H/1975 is trying to lock:
> [ 1491.756370] ffff8881048d0888 (&wq->mutex){+.+.}-{3:3}, at: flush_workqueue+0xc9/0x5f0

Hi Christoph,

Okay, I can recreate the failure you hit. The problem is that a withdraw
is happening from the glock workqueue, which is not right, since the
withdraw sequence itself waits for that workqueue to be flushed out.
(The withdraw itself is expected in this case, due to the IO errors.)

A patch like the following should hopefully fix the deadlock and/or BUG
by deferring the withdraw until the gfs2_logd daemon or something else
catches the error the next time it runs its loop. This is what is
supposed to happen in other similar situations we've had in the past.
Basically, we ignore the withdraw until it can be handled from a more
appropriate context.

diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
index 42c15cfc0821..c138850b1f27 100644
--- a/fs/gfs2/log.c
+++ b/fs/gfs2/log.c
@@ -364,10 +364,8 @@ static int gfs2_ail1_empty(struct gfs2_sbd *sdp, 
int max_revokes)
  	ret = list_empty(&sdp->sd_ail1_list);
  	spin_unlock(&sdp->sd_ail_lock);

-	if (test_bit(SDF_WITHDRAWING, &sdp->sd_flags)) {
+	if (test_bit(SDF_WITHDRAWING, &sdp->sd_flags))
  		gfs2_lm(sdp, "fatal: I/O error(s)\n");
-		gfs2_withdraw(sdp);
-	}

  	return ret;
  }

Can you try something like that and see if it handles the withdraw more
sanely? Thanks.

Regards,

Bob Peterson



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [Cluster-devel] gfs2 hang in xfstests generic/361
  2021-07-26 14:00 [Cluster-devel] gfs2 hang in xfstests generic/361 Christoph Hellwig
  2021-07-26 14:49 ` Bob Peterson
  2021-07-26 17:11 ` Bob Peterson
@ 2021-07-26 17:35 ` Bob Peterson
  2021-07-26 18:50 ` [Cluster-devel] gfs2 hang in xfstests generic/361 (v3) Bob Peterson
  3 siblings, 0 replies; 10+ messages in thread
From: Bob Peterson @ 2021-07-26 17:35 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On 7/26/21 9:00 AM, Christoph Hellwig wrote:
> I noticed this hang while testing the iomap_iter series on gfs2,
> but it also reproduces on 5.14-rc3.  This is running locally with
> "-O -p lock_nolock":
> 
(snip)
> generic/361 files ... [ 1479.222703] run fstests generic/361 at 2021-07-26 13:57:10
(snip)
> [ 1491.752587] =============================
> [ 1491.753403] [ BUG: Invalid wait context ]
> [ 1491.754122] 5.14.0-rc2+ #47 Not tainted
> [ 1491.754860] -----------------------------
> [ 1491.755563] kworker/2:1H/1975 is trying to lock:

Hi again Christoph,

Apparently the reason I haven't been seeing this problem myself is
because I've been testing code in an experimental branch in linux-gfs2,
and one of my other patches already fixed the problem. I just pushed my
experimental branch to linux-gfs2/for-next.next4 and the patch that
fixes it is 631e950b2eeb ("gfs2: fix deadlock in gfs2_ail1_empty
withdraw"). Perhaps you can cherry-pick that patch (and others if 
warranted) into your test branch and see if it behaves itself?

Regards,

Bob Peterson



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Cluster-devel] gfs2 hang in xfstests generic/361 (v3)
  2021-07-26 14:00 [Cluster-devel] gfs2 hang in xfstests generic/361 Christoph Hellwig
                   ` (2 preceding siblings ...)
  2021-07-26 17:35 ` Bob Peterson
@ 2021-07-26 18:50 ` Bob Peterson
  2021-07-27  7:05   ` Christoph Hellwig
  3 siblings, 1 reply; 10+ messages in thread
From: Bob Peterson @ 2021-07-26 18:50 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On 7/26/21 9:00 AM, Christoph Hellwig wrote:
> I noticed this hang while testing the iomap_iter series on gfs2,
> but it also reproduces on 5.14-rc3.  This is running locally with
> "-O -p lock_nolock":
> 
> generic/361 files ... [ 1479.222703] run fstests generic/361 at 2021-07-26 13:57:10
(snip)
> [ 1491.752284]
> [ 1491.752587] =============================
> [ 1491.753403] [ BUG: Invalid wait context ]
> [ 1491.754122] 5.14.0-rc2+ #47 Not tainted
> [ 1491.754860] -----------------------------
> [ 1491.755563] kworker/2:1H/1975 is trying to lock:
(snip)

FYI: I just pushed a replacement patch to linux-gfs2/for-next.next4.
The patch I mentioned last time had problems, so this is a simplified
middle-ground between that patch and the one I suggested earlier.
Patch is a89a427e0afe.

Bob Peterson



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Cluster-devel] gfs2 hang in xfstests generic/361 (v3)
  2021-07-26 18:50 ` [Cluster-devel] gfs2 hang in xfstests generic/361 (v3) Bob Peterson
@ 2021-07-27  7:05   ` Christoph Hellwig
  2021-07-27 14:05     ` Bob Peterson
  2021-07-27 16:41     ` Bob Peterson
  0 siblings, 2 replies; 10+ messages in thread
From: Christoph Hellwig @ 2021-07-27  7:05 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Mon, Jul 26, 2021 at 01:50:11PM -0500, Bob Peterson wrote:
> FYI: I just pushed a replacement patch to linux-gfs2/for-next.next4.
> The patch I mentioned last time had problems, so this is a simplified
> middle-ground between that patch and the one I suggested earlier.
> Patch is a89a427e0afe.

With just that patch cherry picked the test still hangs.

The entire linux-gfs2/for-next.next4 completes a quick group xfstests
run, but with a lot of failures:

Failures: generic/079 generic/082 generic/092 generic/094 generic/103 generic/219 generic/230 generic/235 generic/244 generic/294 generic/306 generic/347 generic/379 generic/380 generic/382 generic/383 generic/384 generic/385 generic/386 generic/400 generic/441 generic/452 generic/488 generic/545 generic/566 generic/587 generic/594 generic/600 generic/601 generic/603
Failed 30 of 514 tests



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Cluster-devel] gfs2 hang in xfstests generic/361 (v3)
  2021-07-27  7:05   ` Christoph Hellwig
@ 2021-07-27 14:05     ` Bob Peterson
  2021-07-27 14:27       ` Andreas Gruenbacher
  2021-07-27 16:41     ` Bob Peterson
  1 sibling, 1 reply; 10+ messages in thread
From: Bob Peterson @ 2021-07-27 14:05 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On 7/27/21 2:05 AM, Christoph Hellwig wrote:
> On Mon, Jul 26, 2021 at 01:50:11PM -0500, Bob Peterson wrote:
>> FYI: I just pushed a replacement patch to linux-gfs2/for-next.next4.
>> The patch I mentioned last time had problems, so this is a simplified
>> middle-ground between that patch and the one I suggested earlier.
>> Patch is a89a427e0afe.
> 
> With just that patch cherry picked the test still hangs.
> 
> The entire linux-gfs2/for-next.next4 completes a quick group xfstests
> run, but with a lot of failures:
> 
> Failures: generic/079 generic/082 generic/092 generic/094 generic/103 generic/219 generic/230 generic/235 generic/244 generic/294 generic/306 generic/347 generic/379 generic/380 generic/382 generic/383 generic/384 generic/385 generic/386 generic/400 generic/441 generic/452 generic/488 generic/545 generic/566 generic/587 generic/594 generic/600 generic/601 generic/603
> Failed 30 of 514 tests
> 

Hi Christoph,

Thanks for testing gfs2.
Andreas and I are fully aware of these failures. When I do a full run 
(which I do almost every day), I get these same 9 failures that you also 
reported:

generic/294
generic/306
generic/441
generic/452
generic/488
generic/545
generic/594
generic/619
generic/631

I've not done a detailed analysis of these 9 lately, but in the past 
I've looked at individual failures and they're usually because gfs2 does 
a "file system withdraw" due to IO errors writing to its journal (and 
such--as we want and expect) whereas most file systems don't do that. So 
these are kind of expected failures. Perhaps I should go back check them 
again.

The others you reported are are exempted from my testing (I've got them 
in my excludes file) because in most cases the problem is gfs2's lack of 
support for the usrquota / grpquota / prjquota mount options:

generic/082 - Lack of support for -o usrquota / grpquota / prjquota
generic/219
generic/230
generic/235
generic/244
generic/379
generic/380
generic/382
generic/383
generic/384
generic/385
generic/386
generic/400
generic/566
generic/587
generic/600
generic/601
generic/603

That leaves us:

generic/079 [expunged] - Incorrect permission check (doesn't match the 
one in notify_change)
generic/092 [expunged] - Lack of support for unwritten extents
generic/094 [expunged] - Lack of support for unwritten extents
generic/103       [expunged] - Command having problems: xfs_io -i -fc 
falloc 0 66574524k /mnt/scratch/spc
(I haven't taken the time to analyze this yet)
Inconsistent filesystem when running out of space on dm-thin
(I haven't taken the time to analyze this yet either)

So we know we've still got some work to do here, but we've got higher 
priorities at the moment. They're still on our to-do list.

Regards,

Bob Peterson



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Cluster-devel] gfs2 hang in xfstests generic/361 (v3)
  2021-07-27 14:05     ` Bob Peterson
@ 2021-07-27 14:27       ` Andreas Gruenbacher
  0 siblings, 0 replies; 10+ messages in thread
From: Andreas Gruenbacher @ 2021-07-27 14:27 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Tue, Jul 27, 2021 at 4:05 PM Bob Peterson <rpeterso@redhat.com> wrote:
> On 7/27/21 2:05 AM, Christoph Hellwig wrote:
> > On Mon, Jul 26, 2021 at 01:50:11PM -0500, Bob Peterson wrote:
> >> FYI: I just pushed a replacement patch to linux-gfs2/for-next.next4.
> >> The patch I mentioned last time had problems, so this is a simplified
> >> middle-ground between that patch and the one I suggested earlier.
> >> Patch is a89a427e0afe.
> >
> > With just that patch cherry picked the test still hangs.
> >
> > The entire linux-gfs2/for-next.next4 completes a quick group xfstests
> > run, but with a lot of failures:
> >
> > Failures: generic/079 generic/082 generic/092 generic/094 generic/103 generic/219 generic/230 generic/235 generic/244 generic/294 generic/306 generic/347 generic/379 generic/380 generic/382 generic/383 generic/384 generic/385 generic/386 generic/400 generic/441 generic/452 generic/488 generic/545 generic/566 generic/587 generic/594 generic/600 generic/601 generic/603
> > Failed 30 of 514 tests
> >
>
> Hi Christoph,
>
> Thanks for testing gfs2.
> Andreas and I are fully aware of these failures. When I do a full run
> (which I do almost every day), I get these same 9 failures that you also
> reported:
>
> generic/294
> generic/306
> generic/441
> generic/452
> generic/488
> generic/545
> generic/594
> generic/619
> generic/631
>
> I've not done a detailed analysis of these 9 lately, but in the past
> I've looked at individual failures and they're usually because gfs2 does
> a "file system withdraw" due to IO errors writing to its journal (and
> such--as we want and expect) whereas most file systems don't do that. So
> these are kind of expected failures. Perhaps I should go back check them
> again.
>
> The others you reported are are exempted from my testing (I've got them
> in my excludes file) because in most cases the problem is gfs2's lack of
> support for the usrquota / grpquota / prjquota mount options:
>
> generic/082 - Lack of support for -o usrquota / grpquota / prjquota
> generic/219
> generic/230
> generic/235
> generic/244
> generic/379
> generic/380
> generic/382
> generic/383
> generic/384
> generic/385
> generic/386
> generic/400
> generic/566
> generic/587
> generic/600
> generic/601
> generic/603
>
> That leaves us:
>
> generic/079 [expunged] - Incorrect permission check (doesn't match the
> one in notify_change)

I've tried fixing that one a long time ago:

https://listman.redhat.com/archives/cluster-devel/2018-February/msg00042.html

> generic/092 [expunged] - Lack of support for unwritten extents
> generic/094 [expunged] - Lack of support for unwritten extents
> generic/103       [expunged] - Command having problems: xfs_io -i -fc
> falloc 0 66574524k /mnt/scratch/spc
> (I haven't taken the time to analyze this yet)
> Inconsistent filesystem when running out of space on dm-thin
> (I haven't taken the time to analyze this yet either)
>
> So we know we've still got some work to do here, but we've got higher
> priorities at the moment. They're still on our to-do list.
>
> Regards,
>
> Bob Peterson

Andreas




^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Cluster-devel] gfs2 hang in xfstests generic/361 (v3)
  2021-07-27  7:05   ` Christoph Hellwig
  2021-07-27 14:05     ` Bob Peterson
@ 2021-07-27 16:41     ` Bob Peterson
  2021-07-28  5:39       ` Christoph Hellwig
  1 sibling, 1 reply; 10+ messages in thread
From: Bob Peterson @ 2021-07-27 16:41 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On 7/27/21 2:05 AM, Christoph Hellwig wrote:
> On Mon, Jul 26, 2021 at 01:50:11PM -0500, Bob Peterson wrote:
>> FYI: I just pushed a replacement patch to linux-gfs2/for-next.next4.
>> The patch I mentioned last time had problems, so this is a simplified
>> middle-ground between that patch and the one I suggested earlier.
>> Patch is a89a427e0afe.
> 
> With just that patch cherry picked the test still hangs.
> 
> The entire linux-gfs2/for-next.next4 completes a quick group xfstests
> run, but with a lot of failures:
> 
> Failures: generic/079 generic/082 generic/092 generic/094 generic/103 generic/219 generic/230 generic/235 generic/244 generic/294 generic/306 generic/347 generic/379 generic/380 generic/382 generic/383 generic/384 generic/385 generic/386 generic/400 generic/441 generic/452 generic/488 generic/545 generic/566 generic/587 generic/594 generic/600 generic/601 generic/603
> Failed 30 of 514 tests
> 
Hi Christoph,

Can you please pull my latest linux-gfs2/for-next.next4 branch and try 
your full run again? I added a patch to not treat the usrquota / 
grpquota / prjquota mount options as errors, and I added a new patch 
from Andreas for generic/079 and it all seems to work okay for me
(although I haven't done a full run yet). TIA.

Regards,

Bob Peterson



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Cluster-devel] gfs2 hang in xfstests generic/361 (v3)
  2021-07-27 16:41     ` Bob Peterson
@ 2021-07-28  5:39       ` Christoph Hellwig
  0 siblings, 0 replies; 10+ messages in thread
From: Christoph Hellwig @ 2021-07-28  5:39 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Tue, Jul 27, 2021 at 11:41:03AM -0500, Bob Peterson wrote:
> On 7/27/21 2:05 AM, Christoph Hellwig wrote:
>> On Mon, Jul 26, 2021 at 01:50:11PM -0500, Bob Peterson wrote:
>>> FYI: I just pushed a replacement patch to linux-gfs2/for-next.next4.
>>> The patch I mentioned last time had problems, so this is a simplified
>>> middle-ground between that patch and the one I suggested earlier.
>>> Patch is a89a427e0afe.
>>
>> With just that patch cherry picked the test still hangs.
>>
>> The entire linux-gfs2/for-next.next4 completes a quick group xfstests
>> run, but with a lot of failures:
>>
>> Failures: generic/079 generic/082 generic/092 generic/094 generic/103 generic/219 generic/230 generic/235 generic/244 generic/294 generic/306 generic/347 generic/379 generic/380 generic/382 generic/383 generic/384 generic/385 generic/386 generic/400 generic/441 generic/452 generic/488 generic/545 generic/566 generic/587 generic/594 generic/600 generic/601 generic/603
>> Failed 30 of 514 tests
>>
> Hi Christoph,
>
> Can you please pull my latest linux-gfs2/for-next.next4 branch and try your 
> full run again? I added a patch to not treat the usrquota / grpquota / 
> prjquota mount options as errors, and I added a new patch from Andreas for 
> generic/079 and it all seems to work okay for me
> (although I haven't done a full run yet). TIA.

Failures: generic/082 generic/092 generic/094 generic/103 generic/219 generic/230 generic/235 generic/294 generic/306 generic/347 generic/371 generic/382 generic/441 generic/452 generic/488 generic/545 generic/566 generic/587 generic/600 generic/601
Failed 20 of 514 tests



^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-07-28  5:39 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-26 14:00 [Cluster-devel] gfs2 hang in xfstests generic/361 Christoph Hellwig
2021-07-26 14:49 ` Bob Peterson
2021-07-26 17:11 ` Bob Peterson
2021-07-26 17:35 ` Bob Peterson
2021-07-26 18:50 ` [Cluster-devel] gfs2 hang in xfstests generic/361 (v3) Bob Peterson
2021-07-27  7:05   ` Christoph Hellwig
2021-07-27 14:05     ` Bob Peterson
2021-07-27 14:27       ` Andreas Gruenbacher
2021-07-27 16:41     ` Bob Peterson
2021-07-28  5:39       ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.