From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934045AbdKBAjW (ORCPT ); Wed, 1 Nov 2017 20:39:22 -0400 Received: from userp1040.oracle.com ([156.151.31.81]:22570 "EHLO userp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932433AbdKBAjT (ORCPT ); Wed, 1 Nov 2017 20:39:19 -0400 Reply-To: prakash.sangappa@oracle.com Subject: Re: [PATCH v4] pidns: introduce syscall translate_pid References: <150788678482.924140.11785205105514746135.stgit@buzz> <20171013160514.GA27812@redhat.com> <3bdb5341-9ae6-265a-ce5b-45c2cfc76fad@yandex-team.ru> <20171016143628.b2ef80a9ef16d4345889b4d9@linux-foundation.org> To: Jann Horn Cc: Andy Lutomirski , Nagarathnam Muthusamy , Andrew Morton , Konstantin Khlebnikov , Oleg Nesterov , Linux API , "linux-kernel@vger.kernel.org" , Serge Hallyn , "Eric W. Biederman" , Eugene Syromiatnikov From: "prakash.sangappa" Message-ID: Date: Wed, 1 Nov 2017 17:38:10 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.5.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Source-IP: userv0022.oracle.com [156.151.31.74] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 11/01/2017 10:43 AM, Jann Horn wrote: > On Tue, Oct 17, 2017 at 5:38 PM, Prakash Sangappa > wrote: >> >> On 10/16/17 5:52 PM, Andy Lutomirski wrote: >>> On Mon, Oct 16, 2017 at 3:54 PM, prakash.sangappa >>> wrote: >>>> >>>> On 10/16/2017 03:07 PM, Nagarathnam Muthusamy wrote: >>>>> >>>>> >>>>> On 10/16/2017 02:36 PM, Andrew Morton wrote: >>>>>> On Sat, 14 Oct 2017 11:17:47 +0300 Konstantin Khlebnikov >>>>>> wrote: >>>>>> >>>>>>>>>> pid_t translate_pid(pid_t pid, int source, int target); >>>>>>>>>> >>>>>>>>>> This syscall converts pid from source pid-ns into pid in target >>>>>>>>>> pid-ns. >>>>>>>>>> If pid is unreachable from target pid-ns it returns zero. >>>>>>>>>> >>>>>>>>>> Pid-namespaces are referred file descriptors opened to proc files >>>>>>>>>> /proc/[pid]/ns/pid or /proc/[pid]/ns/pid_for_children. Negative >>>>>>>>>> argument >>>>>>>>>> refers to current pid namespace, same as file /proc/self/ns/pid. >>>>>>>>>> >>>>>>>>>> Kernel expose virtual pids in /proc/[pid]/status:NSpid, but >>>>>>>>>> backward >>>>>>>>>> translation requires scanning all tasks. Also pids could be >>>>>>>>>> translated >>>>>>>>>> by sending them through unix socket between namespaces, this method >>>>>>>>>> is >>>>>>>>>> slow and insecure because other side is exposed inside pid >>>>>>>>>> namespace. >>>>>>> Andrew asked why we might need this. >>>>>>> >>>>>>> Such conversion is required for interaction between processes across >>>>>>> pid-namespaces. >>>>>>> For example to identify process in container by pid file looking from >>>>>>> outside. >>>>>>> >>>>>>> Two years ago I've solved this in project of mine with monstrous code >>>>>>> which >>>>>>> forks couple times just to convert pid, lucky for me performance >>>>>>> wasn't >>>>>>> important. >>>>>> That's a single user who needed this a single time, and found a >>>>>> userspace-based solution anyway. This is not exactly compelling! >>>>>> >>>>>> Is there a stronger case to be made? How does this change benefit our >>>>>> users? Sell it to us! >>>>> Oracle database is planning to use pid namespace for sandboxing database >>>>> instances and they need an API similar to translate_pid to effectively >>>>> translate process IDs from other pid namespaces. Prakash (cced in mail) >>>>> can >>>>> provide more details on this usecase. >>>> >>>> As Nagarathnam indicated, Oracle Database will be using pid namespaces >>>> and >>>> needs a direct method of converting pids of processes in the pid >>>> namespace >>>> hierarchy. In this use case multiple >>>> nested PID namespaces will be used. The currently available mechanism >>>> are >>>> not very efficient for this use case. For ex. as Konstantin described, >>>> using >>>> /proc//status would require the application to scan all the pid's >>>> status files to determine the pid of given process in a child namespace. >>>> >>>> Use of SCM_CREDENTIALS's socket message is another way, which would >>>> require >>>> every process starting inside a pid namespace to send this message and >>>> the >>>> receiving process in the target namespace would have to save the >>>> converted >>>> pid and reference it. This mechanism becomes cumbersome especially if the >>>> application has to deal with multiple nested pid namespaces. Also, the >>>> Database needs to be able to convert a thread's global pid(gettid()). >>>> Passing the thread's pid(gettid()) in SCM_CREDENTIALS message requires >>>> CAP_SYS_ADMIN, which is an issue. >>>> >>>> So having a direct method, like the API that Konstantin is proposing, >>>> will >>>> work best for the Database >>>> since pid of a process in any of the nested pid namespaces can be >>>> converted >>>> as and when required. I think with the proposed API, the application >>>> should >>>> be able to convert pid of a process or tid(gettid()) of a thread as well. >>>> >>> Can you explain what Oracle's database is planning to do with this >>> information? >> >> Database uses the PID to programmatically find out if the process/thread is >> alive(kill 0) also send signals to the processes requesting it to dump >> status/debug information and kill the processes in case of a shutdown abort >> of the instance. > But if kill(pid, 0) returns 0, that doesn't tell you anything, right? > It could be that > the process you're trying to check is still alive, but it could also > be that it has > died, ns_last_pid has wrapped around, and the PID is now being reused by > another process, right? That is true. Database checks the process start time by reading /proc//stat file to verify that it is the correct process. > > Wouldn't it be more reliable to open("/proc/self", O_RDONLY) > (or /proc/thread-self) in the process you want to monitor, then send > the resulting file descriptor to the monitoring process with SCM_RIGHTS? > Then something like this should work for checking whether the process > is still alive without relying on PIDs at all: > > int retval = faccessat(child_proc_self_fd, "stat", F_OK, 0); > if (retval == 0) { > /* process still exists */ > } else if (retval == -1 && errno == ESRCH) { > /* process is gone */ > } else { > err(1, "unexpected fstatat result"); > } Yes, but there will be a large number of processes to deal with and few processes monitoring. All these processes would have to open /proc/self and send fd to all the monitoring processes. In the database case, there is one fixed monitoring process, but other processes monitoring can exit and new ones started.