From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932669Ab2EVAQy (ORCPT ); Mon, 21 May 2012 20:16:54 -0400 Received: from out01.mta.xmission.com ([166.70.13.231]:57746 "EHLO out01.mta.xmission.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932093Ab2EVAQw (ORCPT ); Mon, 21 May 2012 20:16:52 -0400 From: ebiederm@xmission.com (Eric W. Biederman) To: Oleg Nesterov Cc: Andrew Morton , LKML , Pavel Emelyanov , Cyrill Gorcunov , Louis Rilling , Mike Galbraith In-Reply-To: <20120521124414.GA20391@redhat.com> (Oleg Nesterov's message of "Mon, 21 May 2012 14:44:14 +0200") References: <20120501134214.f6b44f4a.akpm@linux-foundation.org> <87havs7rvv.fsf_-_@xmission.com> <8762c87rrd.fsf_-_@xmission.com> <20120516183920.GA19975@redhat.com> <878vgrsv7q.fsf@xmission.com> <20120517170015.GA12436@redhat.com> <87d3628oqa.fsf@xmission.com> <20120518123911.GA417@redhat.com> <87zk95kper.fsf@xmission.com> <20120521124414.GA20391@redhat.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.3 (gnu/linux) Date: Mon, 21 May 2012 18:16:37 -0600 Message-ID: <87ipfp5au2.fsf@xmission.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-XM-SPF: eid=;;;mid=;;;hst=in02.mta.xmission.com;;;ip=208.38.5.102;;;frm=ebiederm@xmission.com;;;spf=neutral X-XM-AID: U2FsdGVkX19uOFW6bk/DgG1o/XYfk4ZPc9ABRmZBLzM= X-SA-Exim-Connect-IP: 208.38.5.102 X-SA-Exim-Mail-From: ebiederm@xmission.com X-Spam-Report: * -1.0 ALL_TRUSTED Passed through trusted hosts only via SMTP * 1.2 SARE_LWSHORTT BODY: SARE_LWSHORTT * 0.0 TVD_RCVD_IP TVD_RCVD_IP * 1.5 XMNoVowels Alpha-numberic number with no vowels * 0.1 XMSubLong Long Subject * 0.0 T_TM2_M_HEADER_IN_MSG BODY: T_TM2_M_HEADER_IN_MSG * 0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60% * [score: 0.4725] * -0.0 DCC_CHECK_NEGATIVE Not listed in DCC * [sa03 1397; Body=1 Fuz1=1 Fuz2=1] * 0.0 T_TooManySym_01 4+ unique symbols in subject X-Spam-DCC: XMission; sa03 1397; Body=1 Fuz1=1 Fuz2=1 X-Spam-Combo: **;Oleg Nesterov X-Spam-Relay-Country: Subject: Re: [PATCH 2/3] pidns: Guarantee that the pidns init will be the last pidns process reaped. X-Spam-Flag: No X-SA-Exim-Version: 4.2.1 (built Fri, 06 Aug 2010 16:31:04 -0600) X-SA-Exim-Scanned: Yes (on in02.mta.xmission.com) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Oleg Nesterov writes: > On 05/18, Eric W. Biederman wrote: >> >> Oleg Nesterov writes: >> >> >> I think there is something very compelling about your solution, >> >> we do need my bit about making the init process ignore SIGCHLD >> >> so all of init's children self reap. >> > >> > Not sure I understand. This can work with or without 3/3 which >> > changes zap_pid_ns_processes() to ignore SIGCHLD. And just in >> > case, I think 3/3 is fine. >> >> The only issue I see is that without 3/3 we might have processes that >> on one wait(2)s for and so will never have release_task called on. >> >> We do have the wait loop > > Yes, and we need this loop anyway, even if SIGCHLD is ignored. > It is possible that we already have a EXIT_ZOMBIE child(s) when > zap_pid_ns_processes(). > >> but I think there is a race possible there. > > Hmm. I do not see any race, but perhaps I missed something. > I think we can trust -ECHILD, or do_wait() is buggy. Think about it some more you are right. For some reason I had forgotten that without WNOHANG we don't block forever until a child exits. > Hmm. But there is another (off-topic) problem, security_task_wait() > can return an error if there are some security policy problems... > OK, this shouldn't happen I hope. Agreed. We might be able to address that problem but that is indeed another issue. >> > And once again, this wait_event() + __wake_up_parent() is very >> > simple and straightforward, we can cleanup this code later if >> > needed. >> >> Yes, and it doesn't when you do an UNINTERRUPTIBLE sleep with >> an INTERRUPTIBLE wake up unless I misread the code. > > Yes. so we need wait_event_interruptible() or __unhash_process() > should use __wake_up_sync_key(wait_chldexit). > >> > Yes. This is the known oddity. We always notify the tracer if the >> > leader exits, even if !thread_group_empty(). But after that the >> > tracer can't detach, and it can't do do_wait(WEXITED). >> > >> > The problem is not that we can't "fix" this. Just any discussed >> > fix adds the subtle/incompatible user-visible change. >> >> Yes and that is nasty. > > Agreed. ptrace API is nasty ;) > >> and moving detach_pid so we don't have to be super careful about >> where we call task_active_pid_ns. > > Yes, I was thinking about this change too, > >> --- a/kernel/pid_namespace.c >> +++ b/kernel/pid_namespace.c >> @@ -189,6 +189,17 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns) >> rc = sys_wait4(-1, NULL, __WALL, NULL); >> } while (rc != -ECHILD); >> >> + read_lock(&tasklist_lock); >> + for (;;) { >> + __set_current_state(TASK_INTERRUPTIBLE); >> + if (list_empty(¤t->children)) >> + break; >> + read_unlock(&tasklist_lock); >> + schedule(); > > OK, but then it makes sense to add clear_thread_flag(TIF_SIGPENDING) > before schedule, to avoid the busy-wait loop (like the sys_wait4 loop > does). Or simply use TASK_UNINTERRUPTIBLE, I do not think it is that > important to "fool" /proc/loadavg. But I am fine either way. It can get darn strange when you hold a thread in stopped with ptrace and your load mysteriously jumps. But we already have this problem with de_thread and people aren't yelling so shrug. So at a practical level Idon't think it is fooling /proc/loadavg but at this point if we want more accuraccy from /proc/loadavg we need to fix the computation and distinguish short term disk sleeps from other uninterruptible sleeps and thus fix how /proc/loadavg is computed, rather than hacking around with code like this. > Maybe you can also add "ifdef CONFIG_PID_NS" into __unhash_process(), > but this is minor too. An #ifdef just leads to weird build failures that in weird rare configurations. If we can hide it all away in a header fine, but putting a bare #ifdef in the core of the code simply as a performance optimization is ugly and a a major testing challenge. Keeping track of all of the flying pieces with this patch has been tricky enough as it is. Eric