From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id E9F93C433EF for ; Tue, 12 Jul 2022 14:35:58 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233620AbiGLOfm (ORCPT ); Tue, 12 Jul 2022 10:35:42 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51384 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233499AbiGLOfJ (ORCPT ); Tue, 12 Jul 2022 10:35:09 -0400 Received: from out01.mta.xmission.com (out01.mta.xmission.com [166.70.13.231]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 150F4B31F5; Tue, 12 Jul 2022 07:34:59 -0700 (PDT) Received: from in02.mta.xmission.com ([166.70.13.52]:57756) by out01.mta.xmission.com with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.93) (envelope-from ) id 1oBGyX-00B32U-U5; Tue, 12 Jul 2022 08:34:57 -0600 Received: from ip68-227-174-4.om.om.cox.net ([68.227.174.4]:46192 helo=email.froward.int.ebiederm.org.xmission.com) by in02.mta.xmission.com with esmtpsa (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.93) (envelope-from ) id 1oBGyW-00Dghw-Qt; Tue, 12 Jul 2022 08:34:57 -0600 From: "Eric W. Biederman" To: Tycho Andersen Cc: Miklos Szeredi , Christian Brauner , fuse-devel , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org References: <877d4jbabb.fsf@email.froward.int.ebiederm.org> <87zghf6yhe.fsf@email.froward.int.ebiederm.org> Date: Tue, 12 Jul 2022 09:34:50 -0500 In-Reply-To: (Tycho Andersen's message of "Tue, 12 Jul 2022 07:43:51 -0600") Message-ID: <87sfn62yd1.fsf@email.froward.int.ebiederm.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-XM-SPF: eid=1oBGyW-00Dghw-Qt;;;mid=<87sfn62yd1.fsf@email.froward.int.ebiederm.org>;;;hst=in02.mta.xmission.com;;;ip=68.227.174.4;;;frm=ebiederm@xmission.com;;;spf=softfail X-XM-AID: U2FsdGVkX19GFtaSVDwPoTYZnHwdRxtgbEun6enaSZ4= X-SA-Exim-Connect-IP: 68.227.174.4 X-SA-Exim-Mail-From: ebiederm@xmission.com Subject: Re: strange interaction between fuse + pidns X-SA-Exim-Version: 4.2.1 (built Sat, 08 Feb 2020 21:53:50 +0000) X-SA-Exim-Scanned: Yes (on in02.mta.xmission.com) Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Tycho Andersen writes: > On Mon, Jul 11, 2022 at 06:06:21PM -0500, Eric W. Biederman wrote: >> Tycho Andersen writes: >> It is not different enough to change the semantics. What I am aiming >> for is having a dedicated flag indicating a task will exit, that >> fatal_signal_pending can check. And I intend to make that flag one way >> so that once it is set it will never be cleared. > > Ok - how far out is that? I'd like to try to convince Miklos to land > the fuse part of this fix now, but without the "look at shared signals > too" patch, that fix is useless. I'm not married to my patch, but I > would like to get this fixed somehow soon. My point is that we need to figure out why you need the look at shared signals. If I can get everything reviewed my changes will be in the next merge window (it unfortunately always takes longer to get the code reviewed than I would like). However when my changes land does not matter. What you are trying to solve is orthogonal of my on-going work. The problem is that looking at shared signals is fundamentally broken. A case in point is that kernel threads can have a pending SIGKILL that is not a fatal signal. As kernel threads are allowed to ignore or even handle SIGKILL. If you want to change fatal_signal_pending to include PF_EXITING I would need to double check the implications but I think that would work, and would not have the problems including the shared pending state of SIGKILL. >> The other thing I have played with that might be relevant was removing >> the explicit wait in zap_pid_ns_processes and simply not allowing wait >> to reap the pid namespace init until all it's children had been reaped. >> Essentially how we deal with the thread group leader for ordinary >> processes. Does that sound like it might help in the fuse case? > > No, the problem is that the wait code doesn't know to look in the > right place, so waiting later still won't help. I was suggesting to modify the kernel so that zap_pid_ns_processes would not wait for the zapped processes. Instead I was proposing that delay_group_leader called from wait_consider_task would simply refuse to allow the init process of a pid namespace to be reaped until every other process of that pid namespace had exited. You can prototype how that would affect the deadlock by simply removing the waiting from zap_pid_ns_processes. I suggest that simply because that has the potential to remove some of the strange pid namespace cases. I don't understand the problematic interaction between pid namespace shutdown and the fuse daemon, so I am merely suggesting a possibility that I know can simplify pid namespace shutdown. Something like: diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c index f4f8cb0435b4..d22a30b0b0cf 100644 --- a/kernel/pid_namespace.c +++ b/kernel/pid_namespace.c @@ -207,47 +207,6 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns) read_unlock(&tasklist_lock); rcu_read_unlock(); - /* - * Reap the EXIT_ZOMBIE children we had before we ignored SIGCHLD. - * kernel_wait4() will also block until our children traced from the - * parent namespace are detached and become EXIT_DEAD. - */ - do { - clear_thread_flag(TIF_SIGPENDING); - rc = kernel_wait4(-1, NULL, __WALL, NULL); - } while (rc != -ECHILD); - - /* - * kernel_wait4() misses EXIT_DEAD children, and EXIT_ZOMBIE - * process whose parents processes are outside of the pid - * namespace. Such processes are created with setns()+fork(). - * - * If those EXIT_ZOMBIE processes are not reaped by their - * parents before their parents exit, they will be reparented - * to pid_ns->child_reaper. Thus pidns->child_reaper needs to - * stay valid until they all go away. - * - * The code relies on the pid_ns->child_reaper ignoring - * SIGCHILD to cause those EXIT_ZOMBIE processes to be - * autoreaped if reparented. - * - * Semantically it is also desirable to wait for EXIT_ZOMBIE - * processes before allowing the child_reaper to be reaped, as - * that gives the invariant that when the init process of a - * pid namespace is reaped all of the processes in the pid - * namespace are gone. - * - * Once all of the other tasks are gone from the pid_namespace - * free_pid() will awaken this task. - */ - for (;;) { - set_current_state(TASK_INTERRUPTIBLE); - if (pid_ns->pid_allocated == init_pids) - break; - schedule(); - } - __set_current_state(TASK_RUNNING); - if (pid_ns->reboot) current->signal->group_exit_code = pid_ns->reboot; Eric