From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx2.suse.de ([195.135.220.15]:44080 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1728891AbeJARC7 (ORCPT ); Mon, 1 Oct 2018 13:02:59 -0400 Date: Mon, 1 Oct 2018 12:25:44 +0200 From: Jan Kara To: Nigel Banks Cc: Jan Kara , Amir Goldstein , linux-fsdevel@vger.kernel.org Subject: Re: Deadlock in fsnotify for Message-ID: <20181001102544.GF3913@quack2.suse.cz> References: <20180927162412.GA12883@quack2.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: Sender: linux-fsdevel-owner@vger.kernel.org List-ID: Hello, On Fri 28-09-18 09:35:38, Nigel Banks wrote: > I've attached the kern.log as you instructed, please let me know if there is > any more information I can provide. Thanks for the traces. So all processes but one hang like you've described - i.e.: schedule+0x2c/0x80 schedule_timeout+0x1cf/0x350 ? sched_clock+0x9/0x10 ? sched_clock+0x9/0x10 ? sched_clock_cpu+0x11/0xb0 wait_for_completion+0xba/0x140 ? wake_up_q+0x80/0x80 flush_work+0x126/0x1e0 ? worker_detach_from_pool+0xa0/0xa0 flush_delayed_work+0x3f/0x50 fsnotify_wait_marks_destroyed+0x15/0x20 fsnotify_destroy_group+0x48/0xd0 inotify_release+0x1e/0x50 __fput+0xea/0x220 ____fput+0xe/0x10 task_work_run+0x9d/0xc0 They all wait for worker thread to destroy marks. That is hung like: schedule+0x2c/0x80 schedule_timeout+0x1cf/0x350 ? select_idle_sibling+0x262/0x410 ? __enqueue_entity+0x5c/0x60 ? enqueue_entity+0x10e/0x6b0 wait_for_completion+0xba/0x140 ? wake_up_q+0x80/0x80 __synchronize_srcu.part.13+0x85/0xb0 ? trace_raw_output_rcu_utilization+0x50/0x50 ? ttwu_do_activate+0x77/0x80 synchronize_srcu+0x66/0xe0 ? synchronize_srcu+0x66/0xe0 fsnotify_mark_destroy_workfn+0x7b/0xe0 process_one_work+0x1de/0x410 worker_thread+0x228/0x410 kthread+0x121/0x140 So it waits for SRCU period to end. From the traces it is not clear who prevents the SRCU period from finishing. Did you include all tasks from the trace in the attached file? If yes, I have no good idea what could be holding the SRCU. Since your kernel is 4.4 based which is relatively old and has some patches applied on top, I'd suggest you either try newer kernel (e.g. you should be able to install 4.16 or 4.17 relatively easily) or report this to Ubuntu bugzilla. Thanks. Honza > > > > Cheers, > > Nigel > > > > From: Jan Kara > To: Nigel Banks > Cc: jack@suse.cz, linux-fsdevel@vger.kernel.org, Amir Goldstein > > Date: 09/27/2018 05:24 PM > Subject: Re: Deadlock in fsnotify for > ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ > > > > Hello, > > [added to CC other relevant mails] > > On Thu 27-09-18 16:44:53, Nigel Banks wrote: > > Sorry to trouble you, but from looking through the git history of linux/fs/ > > notify you seem to be the best person to contact. > > > > I've encounter a hard to reproduce situation that happens on our CI > > servers, in which it becomes impossible to release any inotify file > > descriptors. We're currently running Ubuntu 18.04 (Kernel 4.15) using > > ext4 fs, and our code is running in docker containers (overlay2) if that > > makes a difference. > > > > Essentially we're running a number of concurrent tests which internally > > use inotify to monitor some directories this all works fine and they > > clean up after themselves, but after several days there will be a > > deadlock in the kernel code (sys stack below): > > > > [<0>] flush_work+0x126/0x1e0 > > [<0>] flush_delayed_work+0x3f/0x50 > > [<0>] fsnotify_wait_marks_destroyed+0x15/0x20 > > [<0>] fsnotify_destroy_group+0x48/0xd0 > > [<0>] inotify_release+0x1e/0x50 > > [<0>] __fput+0xea/0x220 > > [<0>] ____fput+0xe/0x10 > > [<0>] task_work_run+0x9d/0xc0 > > [<0>] exit_to_usermode_loop+0xc0/0xd0 > > [<0>] do_syscall_64+0x115/0x130 > > [<0>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 > > [<0>] 0xffffffffffffffff > > Hum, I don't remember seeing any deadlock like this. When a system hangs > like this, can you please do: > > echo w >/proc/sysrq-trigger > > and send me the output of 'dmesg' command after that. In that output we > should see all hung tasks (including kernel threads) and their traces and > hopefully it will tell us more. > > > Once a processes gets stuck in this uninterruptable sleep it will never wake. > > At this point the system is still usable, we're able to create more inotify > > instances and receive messages for them, but we are not able to close any of > > them. So eventually we run out of handles and the system becomes unstable, > not > > to mention we can't run any more tests on the machine at this point, and a > > reboot is required. > > Yes, this is expected. I looks like some deadlock in the fsnotify > subsystem. > > > From my research, it looks like lxc project has also encountered this issue: > > https://github.com/lxc/lxc/issues/2456, like them we also didn't experience > > this behaviour with our previous set-up Ubuntu 16.04 (Kernel 14.04). > > > > I had a look through the bug lists and through the commit history for linux/ > fs/ > > notify and could not find this issue listed anywhere. > > > > I've attempted to write a small C program using pthreads and the inotify > > sys-calls, but was unable to create a program that could reproduce this > issue. > > Thanks for report. > > > Honza > -- > Jan Kara > SUSE Labs, CR > > > > =========================================================== The information in > this email is confidential, and is intended solely for the addressee(s). Access > to this email by anyone else is unauthorized and therefore prohibited. If you > are not the intended recipient you are notified that disclosing, copying, > distributing or taking any action in reliance on the contents of this > information is strictly prohibited and may be unlawful. ======================= > ==================================== > -- Jan Kara SUSE Labs, CR