From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-oi0-f65.google.com ([209.85.218.65]:46780 "EHLO mail-oi0-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2388076AbeGWOGs (ORCPT ); Mon, 23 Jul 2018 10:06:48 -0400 Received: by mail-oi0-f65.google.com with SMTP id y207-v6so917424oie.13 for ; Mon, 23 Jul 2018 06:05:39 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: References: <000000000000bc17b60571a60434@google.com> From: Miklos Szeredi Date: Mon, 23 Jul 2018 15:05:38 +0200 Message-ID: Subject: Re: INFO: task hung in fuse_reverse_inval_entry To: Dmitry Vyukov Cc: linux-fsdevel , LKML , syzkaller-bugs , syzbot Content-Type: text/plain; charset="UTF-8" Sender: linux-fsdevel-owner@vger.kernel.org List-ID: On Mon, Jul 23, 2018 at 2:46 PM, Dmitry Vyukov wrote: > On Mon, Jul 23, 2018 at 2:33 PM, Miklos Szeredi wrote: >>>>> On Mon, Jul 23, 2018 at 9:59 AM, syzbot >>>>> wrote: >>>>>> Hello, >>>>>> >>>>>> syzbot found the following crash on: >>>>>> >>>>>> HEAD commit: d72e90f33aa4 Linux 4.18-rc6 >>>>>> git tree: upstream >>>>>> console output: https://syzkaller.appspot.com/x/log.txt?x=1324f794400000 >>>>>> kernel config: https://syzkaller.appspot.com/x/.config?x=68af3495408deac5 >>>>>> dashboard link: https://syzkaller.appspot.com/bug?extid=bb6d800770577a083f8c >>>>>> compiler: gcc (GCC) 8.0.1 20180413 (experimental) >>>>>> syzkaller repro:https://syzkaller.appspot.com/x/repro.syz?x=11564d1c400000 >>>>>> C reproducer: https://syzkaller.appspot.com/x/repro.c?x=16fc570c400000 >>>>> >>>>> >>>>> Hi fuse maintainers, >>>>> >>>>> We are seeing a bunch of such deadlocks in fuse on syzbot. As far as I >>>>> understand this is mostly working-as-intended (parts about deadlocks >>>>> in Documentation/filesystems/fuse.txt). The intended way to resolve >>>>> this is aborting connections via fusectl, right? >>>> >>>> Yes. Alternative is with "umount -f". >>>> >>>>> The doc says "Under >>>>> the fuse control filesystem each connection has a directory named by a >>>>> unique number". The question is: if I start a process and this process >>>>> can mount fuse, how do I kill it? I mean: totally and certainly get >>>>> rid of it right away? How do I find these unique numbers for the >>>>> mounts it created? >>>> >>>> It is the device number found in st_dev for the mount. Other than >>>> doing stat(2) it is possible to find out the device number by reading >>>> /proc/$PID/mountinfo (third field). >>> >>> Thanks. I will try to figure out fusectl connection numbers and see if >>> it's possible to integrate aborting into syzkaller. >>> >>>>> Taking into account that there is usually no >>>>> operator attached to each server, I wonder if kernel could somehow >>>>> auto-abort fuse on kill? >>>> >>>> Depends on what the fuse server is sleeping on. If it's trying to >>>> acquire an inode lock (e.g. unlink(2)), which is classical way to >>>> deadlock a fuse filesystem, then it will go into an uninterruptible >>>> sleep. There's no way in which that process can be killed except to >>>> force a release of the offending lock, which can only be done by >>>> aborting the request that is being performed while holding that lock. >>> >>> I understand that it is not killed today, but I am asking if we can >>> make it killable. It's all code that we can change, and if a human >>> operator can do it, it can be done pure programmatically on kill too, >>> right? >> >> Hmm, you mean if a process is in an uninterruptible sleep trying to >> acquire a lock on a fuse filesystem and is killed, then the fuse >> filesystem should be aborted? >> >> Even if we'd manage to implement that, it's a large backward >> incompatibility risk. >> >> I don't argue that it can be done, but I would definitely argue *if* >> it should be done. > > > I understand that we should abort only if we are sure that it's > actually deadlocked and there is no other way. > So if fuse-user process is blocked on fuse lock, then we probably > should do nothing. However, if the fuse-server is killed, then perhaps > we could abort the connection at that point. Namely, if a process that > has a fuse fd open is killed and it is the only process that shared > this fd, then we could abort the connection on arrival of the kill > signal (rather than wait untill all it's threads finish and then start > closing all fd's, this is where we get the deadlock -- some of its > threads won't finish). I don't know if such synchronous kill hook is > available, though. If several processes shared the same fuse fd, then > we could close the fd in each process on SIGKILL arrival, then when > all of these processes are killed, fuse fd will be closed and we can > abort the connection, which will un-deadlock all of these processes. > Does this look any reasonable? Biggest conceptual problem: your definition of fuse-server is weak. Take the following example: process A is holding the fuse device fd and is forwarding requests and replies to/from process B via a pipe. So basically A is just a proxy that does nothing interesting, the "real" server is B. But according to your definition B is not a server, only A is. And this is just a simple example, parts of the server might be on different machines, etc... It's impossible to automatically detect if a process is acting as a fuse server or not. We could let the fuse server itself notify the kernel that it's a fuse server. That might help in the cases where the deadlock is accidental, but obviously not in the case when done by a malicious agent. I'm not sure it's worth the effort. Also I have no idea how the respective maintainers would take the idea of "kill hooks"... It would probably be a lot of work for little gain. Thanks, Miklos