From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mark Syms Date: Mon, 8 Oct 2018 13:36:30 +0100 Subject: [Cluster-devel] [PATCH 1/2] GFS2: use schedule timeout in find insert glock In-Reply-To: <1539002191-40831-1-git-send-email-mark.syms@citrix.com> References: <1539002191-40831-1-git-send-email-mark.syms@citrix.com> Message-ID: <1539002191-40831-2-git-send-email-mark.syms@citrix.com> List-Id: To: cluster-devel.redhat.com MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit During a VM stress test we encountered a system lockup and kern.log contained kernel: [21389.462707] INFO: task python:15480 blocked for more than 120 seconds. kernel: [21389.462749] Tainted: G O 4.4.0+10 #1 kernel: [21389.462763] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kernel: [21389.462783] python D ffff88019628bc90 0 15480 1 0x00000000 kernel: [21389.462790] ffff88019628bc90 ffff880198f11c00 ffff88005a509c00 ffff88019628c000 kernel: [21389.462795] ffffc90040226000 ffff88019628bd80 fffffffffffffe58 ffff8801818da418 kernel: [21389.462799] ffff88019628bca8 ffffffff815a1cd4 ffff8801818da5c0 ffff88019628bd68 kernel: [21389.462803] Call Trace: kernel: [21389.462815] [] schedule+0x64/0x80 kernel: [21389.462877] [] find_insert_glock+0x4a4/0x530 [gfs2] kernel: [21389.462891] [] ? gfs2_holder_wake+0x20/0x20 [gfs2] kernel: [21389.462903] [] gfs2_glock_get+0x3d/0x330 [gfs2] kernel: [21389.462928] [] do_flock+0xf2/0x210 [gfs2] kernel: [21389.462933] [] ? gfs2_getattr+0xe0/0xf0 [gfs2] kernel: [21389.462938] [] ? cp_new_stat+0x10b/0x120 kernel: [21389.462943] [] gfs2_flock+0x78/0xa0 [gfs2] kernel: [21389.462946] [] SyS_flock+0x129/0x170 kernel: [21389.462948] [] entry_SYSCALL_64_fastpath+0x12/0x71 On examination of the code it was determined that this code path is only taken if the selected glock is marked as dead, the supposition therefore is that by the time schedule as called the glock had been cleaned up and therefore nothing woke the schedule. Instead of calling schedule, call schedule_timeout(HZ) so at least we get a chance to re-evaluate. On repeating the stress test, the printk message was seen once in the logs across four servers but no further occurences nor were there any stuck task log entries. This indicates that when the timeout occured the code repeated the lookup and did not find the same glock entry but as we hadn't been woken this means that we would never have been woken. Signed-off-by: Mark Syms Signed-off-by: Tim Smith --- fs/gfs2/glock.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c index 4614ee2..0a59a01 100644 --- a/fs/gfs2/glock.c +++ b/fs/gfs2/glock.c @@ -758,7 +758,8 @@ static struct gfs2_glock *find_insert_glock(struct lm_lockname *name, } if (gl && !lockref_get_not_dead(&gl->gl_lockref)) { rcu_read_unlock(); - schedule(); + if (schedule_timeout(HZ) == 0) + printk(KERN_INFO "find_insert_glock schedule timed out\n"); goto again; } out: -- 1.8.3.1