From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1758798AbbKSQt5 (ORCPT <rfc822;w@1wt.eu>);
	Thu, 19 Nov 2015 11:49:57 -0500
Received: from mx1.redhat.com ([209.132.183.28]:44261 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1756882AbbKSQtz (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
	Thu, 19 Nov 2015 11:49:55 -0500
Message-ID: <564DFDAF.3000402@redhat.com>
Date: Thu, 19 Nov 2015 16:49:51 +0000
From: Pedro Alves <palves@redhat.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.7.0
MIME-Version: 1.0
To: Oleg Nesterov <oleg@redhat.com>, Tejun Heo <tj@kernel.org>
CC: Jan Kratochvil <jan.kratochvil@redhat.com>,
        Andrey Ryabinin <aryabinin@virtuozzo.com>,
        Roland McGrath <roland@hack.frob.com>,
        LKML <linux-kernel@vger.kernel.org>
Subject: Re: ptrace() hangs on attempt to seize/attach stopped & frozen task
References: <5640B7F2.70406@virtuozzo.com> <20151109185506.GA22744@redhat.com> <20151109180207.GA28507@mtj.duckdns.org> <20151110202017.GA2976@redhat.com> <20151116184516.GJ18894@mtj.duckdns.org> <20151117193419.GA9993@redhat.com>
In-Reply-To: <20151117193419.GA9993@redhat.com>
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On 11/17/2015 07:34 PM, Oleg Nesterov wrote:
> On 11/16, Tejun Heo wrote:

>>> And perhaps we can simply remove this logic? I forgot why do we hide this
>>> STOPPED -> RUNNING -> TRACED transition from the attaching thread. But the
>>> vague feeling tells me that we discussed this before and perhaps it was me
>>> who suggested to avoid the user-visible change when you introduced this
>>> transition...
>>
>> Heh, it was too long ago for me to remember much. :)
> 
> Same here...
> 
>>> Anyway, now I do not understand why do we want to hide it. Lets consider
>>> the following "test-case",
>>>
>>> 	void test(int pid)
>>> 	{
>>> 		kill(pid, SIGSTOP);
>>> 		waitpid(pid, NULL, WSTOPPED);
>>>
>>> 		ptrace(PTRACE_ATTACH-OR-PTRACE_SEIZE, pid, 0,0);
>>>
>>> 		assert(ptrace(PTRACE_DETACH, pid, 0,0) == 0);
>>> 	}
>>>
>>> Yes, it will fail if we remove JOBCTL_TRAPPING. But it can equally fail
>>> if SIGCONT comes before ATTACH, so perhaps we do not really care?
>>>
>>> Jan, Pedro, do you think the patch below can break gdb somehow? With this
>>> patch you can never assume that waitpid(WNOHANG) or ptrace(WHATEVER) will
>>> succeed right after PTRACE_ATTACH/PTRACE_SEIZE, even if you know that the
>>> tracee was TASK_STOPPED before attach.

Not sure, because I don't think I fully understand that proposed change.

Both GDB and gdbserver have special processing for attaching to already-stopped
processes.  (and neither use PTRACE_SEIZE yet.)

Here's the gdbserver version:

 https://sourceware.org/git/?p=binutils-gdb.git;a=blob;f=gdb/gdbserver/linux-low.c;h=41ab510fa4ac5654f101f08efb68e26b5bc5dbd7;hb=HEAD#l903

Copied here for convenience:

 907 linux_attach_lwp (ptid_t ptid)
 908 {
 909   struct lwp_info *new_lwp;
 910   int lwpid = ptid_get_lwp (ptid);
 911
 912   if (ptrace (PTRACE_ATTACH, lwpid, (PTRACE_TYPE_ARG3) 0, (PTRACE_TYPE_ARG4) 0)
 913       != 0)
 914     return errno;
 915
 916   new_lwp = add_lwp (ptid);
 917
 918   /* We need to wait for SIGSTOP before being able to make the next
 919      ptrace call on this LWP.  */
 920   new_lwp->must_set_ptrace_flags = 1;
 921
 922   if (linux_proc_pid_is_stopped (lwpid))
 923     {
 924       if (debug_threads)
 925         debug_printf ("Attached to a stopped process\n");
 926
 927       /* The process is definitely stopped.  It is in a job control
 928          stop, unless the kernel predates the TASK_STOPPED /
 929          TASK_TRACED distinction, in which case it might be in a
 930          ptrace stop.  Make sure it is in a ptrace stop; from there we
 931          can kill it, signal it, et cetera.
 932
 933          First make sure there is a pending SIGSTOP.  Since we are
 934          already attached, the process can not transition from stopped
 935          to running without a PTRACE_CONT; so we know this signal will
 936          go into the queue.  The SIGSTOP generated by PTRACE_ATTACH is
 937          probably already in the queue (unless this kernel is old
 938          enough to use TASK_STOPPED for ptrace stops); but since
 939          SIGSTOP is not an RT signal, it can only be queued once.  */
 940       kill_lwp (lwpid, SIGSTOP);
 941
 942       /* Finally, resume the stopped process.  This will deliver the
 943          SIGSTOP (or a higher priority signal, just like normal
 944          PTRACE_ATTACH), which we'll catch later on.  */
 945       ptrace (PTRACE_CONT, lwpid, (PTRACE_TYPE_ARG3) 0, (PTRACE_TYPE_ARG4) 0);
 946     }
 947
 948   /* The next time we wait for this LWP we'll see a SIGSTOP as PTRACE_ATTACH
 949      brings it to a halt.
 950

linux_proc_pid_is_stopped checks whether the state in /proc/pid/status is "T (stopped)".

Here's the equivalent in gdb:

  https://sourceware.org/git/?p=binutils-gdb.git;a=blob;f=gdb/linux-nat.c;h=841ec3949c37438dfba924d8db6b37ffc416dd29;hb=HEAD#l974

This queuing of a SIGSTOP + PTRACE_CONT was necessary because
otherwise when gdb attaches to a job stopped process, gdb would hang in the waitpid
after PTRACE_ATTACH, waiting for the initial SIGSTOP which would never arrive.

If the proposed change makes it so that a new intermediate state can be observed
right after PTRACE_ATTACH, and so linux_proc_pid_is_stopped can return false,
then there's potential for breakage.  But maybe not, if we're sure that
that when that happens, waitpid returns for the initial
PTRACE_ATTACH-induced SIGSTOP.

Thanks,
Pedro Alves