From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.8 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3A457C43381 for ; Mon, 1 Apr 2019 21:58:10 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id EEB922082C for ; Mon, 1 Apr 2019 21:58:09 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="mN9CHwKp" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727006AbfDAV6J (ORCPT ); Mon, 1 Apr 2019 17:58:09 -0400 Received: from mail-qk1-f196.google.com ([209.85.222.196]:37010 "EHLO mail-qk1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725858AbfDAV6I (ORCPT ); Mon, 1 Apr 2019 17:58:08 -0400 Received: by mail-qk1-f196.google.com with SMTP id c1so6696027qkk.4; Mon, 01 Apr 2019 14:58:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=PDePpgwXcePqusDqA5Z1tklxNFAP/0zBUyt3wDwtZo0=; b=mN9CHwKp6/MH1hbten6BgMOla7Hk1ofRaH8Li3AJLbLI8c77gPJyWYYyH3VEc+wWGQ 0Vwlqocvs6SQ6JWav2PDWCJOcqOXgZ5UACChwf5ZA9k9C2RSJSzWenLPbbDyk65lpq/9 HixdNUF8qRP8jtR5KhrW6INBiAIQJ9v10Mz3tj30nmMqflWJ7Ne18EoPdSw8vQ2Oc3vx dmYft0RAidNsKCl0+7tXCgADrBmNt+3/oBFBxinPF6JzaLyGzju6ZbNhF+6uCk7VnUzV qI5TVk4+MMrNm9HB/uy8P93OlKlUHCdobHxd3D2IJlOZaatcMviaOwwdU9o+hz5YY9Qj iGow== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=PDePpgwXcePqusDqA5Z1tklxNFAP/0zBUyt3wDwtZo0=; b=c5wrYZvVoCFYu+tLdGJukYO4dDaDaCDyfOw1+sw04QvzrN4aU2p7UKJ58poxqnHTno kn8z0spiH1Am6DR+Vws7nrvUxbt9Jy7yHJ84AoetWbfgtl9aAuQTtzsr4CVd1YtlPPg6 kkQY0YW4Tqe1jQjf61xdtjkw/hAgWYNEvOnccWM0AEmf9EebtzmLlYpXxbG7F0LZzU74 It8BGM0EJA6EF9jFaMp6rDP8cysGAApb/P2jmTg94RLzvlwD9ojUxK77s175QYfaSKS7 gkspb5Rg5Dldn4bQ4SSyodkp778lJFnRS1IfbDLSwew6ODDcQn0CUEmaMQ2oA4rcLhWq 2ioA== X-Gm-Message-State: APjAAAXfTjUCCPH8tjg7cuqn2fPX6GeEP0v5ARCD0nA3o+UUqKj88lLG UpTmvCgMGtoHg7/pyUIccy6UfMx9Ns+AwGJCVCQ= X-Google-Smtp-Source: APXvYqwrltiWzY7iE7UAHEnGDRNJHwctZbKdZ8xmGtKvMtD3+nWlD6zLayPUama/wwpiKzPprsISn6b6VI+eDMvaslE= X-Received: by 2002:a37:9cc1:: with SMTP id f184mr53423899qke.211.1554155887139; Mon, 01 Apr 2019 14:58:07 -0700 (PDT) MIME-Version: 1.0 References: <132107F4-F56B-4D6E-9E00-A6F7C092E6BD@amacapital.net> <20190331211041.vht7dnqg4e4bilr2@brauner.io> <18C7FCB9-2CBA-4237-94BB-9C4395A2106B@amacapital.net> <20190401114059.7gdsvcqyoz2o5bbz@yavin> <20190401194214.4rbvmwogpke3cjcx@brauner.io> In-Reply-To: From: Jonathan Kowalski Date: Mon, 1 Apr 2019 22:58:04 +0100 Message-ID: Subject: Re: [PATCH v2 0/5] pid: add pidfd_open() To: Linus Torvalds Cc: Christian Brauner , Jann Horn , Daniel Colascione , Aleksa Sarai , Andy Lutomirski , Andrew Lutomirski , David Howells , "Serge E. Hallyn" , Linux API , Linux List Kernel Mailing , Arnd Bergmann , "Eric W. Biederman" , Konstantin Khlebnikov , Kees Cook , Alexey Dobriyan , Thomas Gleixner , Michael Kerrisk-manpages , "Dmitry V. Levin" , Andrew Morton , Oleg Nesterov , Nagarathnam Muthusamy , Al Viro , Joel Fernandes Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Apr 1, 2019 at 10:35 PM Linus Torvalds wrote: > > On Mon, Apr 1, 2019 at 12:42 PM Christian Brauner wrote: > > > > From what I gather from this thread we are still best of with using fds > > to /proc/ as pidfds. Linus, do you agree or have I misunderstood? You mention the race about learning the PID, PID being recycled, and pidfd_open getting the wrong reference. This exists with the /proc model to way. How do you propose to address this? > > That does seem to be the most flexible option. > > > Yes, we can have an internal mount option to restrict access to various > > parts of procfs from such pidfds > > I suspect you'd find that other parties might want such a "restrict > proc" mount option too, so I don't think it needs to be anything > internal. > > But it would be pretty much independent of the pidfd issue, of course. > > > One thing is that we also need something to disable access to the > > "/proc//net". One option could be to give the files in "net/" an > > ->open-handler which checks that our file->f_path.mnt is not one of our > > special clone() mounts and if it is refuse the open. > > I would expect that that would be part of the "restrict proc" mount options, no? I was thinking if some part of /proc could be split in a procpidfs, with possibly shared code, which means with the new mount API, you could configure a superblock with restricted views by virtue of mount options, per task. This only gives you the view as inside /proc/, and you might not be able to lift restrctions depending on privileges in owning userns of the task. Now, this would mean we keep the notion of anon inode fds as pidfds, and on supported systems, you could configure an instance, pass the pidfd descriptor in the fsconfig stage (David Howells demonstrated similar support for passing user and net namespace descriptors during superblock configuration) and also the pidns descriptor. Without mounting the fs, the mount fd can then be used to do metadata access passing it to openat and friends, possibly passed to others. This is more granular than restrcting access over the entire /proc instance, and can be adjusted per process, and since you need the pidfd's pidns descriptor, you cannot easily cross namespace boundaries with just a single pidfd. You can also create many variants of restricted versions of a single task's proc directory. The pidfd API and /proc access can be connected while also keeping them both separate, conceptually. There are some details but this will wound up being much more powerful (restricting /proc as a whole is ofcourse also fine, in addition to this). There are some details on when and how someone should be able to do this, but is something like this up for discussion? > > > Basically, if you have a system without CONFIG_PROC_FS it makes sense > > that clone gives back an anon inode file descriptor as pidfds because > > you can still signal threads in a race-free way. But it doesn't make a > > lot of sense to have pidfd_open() in this scenario because you can't > > really do anything with that pidfd apart from sending signals. > > Well, people might want that. > > But realistically, everybody enables /proc support anyway. Even if you > don't actually fully *mount* it in some restricted area, the support > is pretty much always there in any kernel config. > > But yes, in general I agree that that also most likely means that a > separate system call for "open_pidfd()" isn't worth it. > > Because if the main objection to /proc is that it exposes too much, > then I think a much better option is to see how to sanely restrict the > "too much" parts. > > Because I think there might be a lot of people who want a restricted > /proc, in various container models etc. > > Linus From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jonathan Kowalski Subject: Re: [PATCH v2 0/5] pid: add pidfd_open() Date: Mon, 1 Apr 2019 22:58:04 +0100 Message-ID: References: <132107F4-F56B-4D6E-9E00-A6F7C092E6BD@amacapital.net> <20190331211041.vht7dnqg4e4bilr2@brauner.io> <18C7FCB9-2CBA-4237-94BB-9C4395A2106B@amacapital.net> <20190401114059.7gdsvcqyoz2o5bbz@yavin> <20190401194214.4rbvmwogpke3cjcx@brauner.io> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Return-path: In-Reply-To: Sender: linux-kernel-owner@vger.kernel.org To: Linus Torvalds Cc: Christian Brauner , Jann Horn , Daniel Colascione , Aleksa Sarai , Andy Lutomirski , Andrew Lutomirski , David Howells , "Serge E. Hallyn" , Linux API , Linux List Kernel Mailing , Arnd Bergmann , "Eric W. Biederman" , Konstantin Khlebnikov , Kees Cook , Alexey Dobriyan , Thomas Gleixner , Michael Kerrisk-manpages , "Dmitry V. Levin" , Andrew Morton List-Id: linux-api@vger.kernel.org On Mon, Apr 1, 2019 at 10:35 PM Linus Torvalds wrote: > > On Mon, Apr 1, 2019 at 12:42 PM Christian Brauner wrote: > > > > From what I gather from this thread we are still best of with using fds > > to /proc/ as pidfds. Linus, do you agree or have I misunderstood? You mention the race about learning the PID, PID being recycled, and pidfd_open getting the wrong reference. This exists with the /proc model to way. How do you propose to address this? > > That does seem to be the most flexible option. > > > Yes, we can have an internal mount option to restrict access to various > > parts of procfs from such pidfds > > I suspect you'd find that other parties might want such a "restrict > proc" mount option too, so I don't think it needs to be anything > internal. > > But it would be pretty much independent of the pidfd issue, of course. > > > One thing is that we also need something to disable access to the > > "/proc//net". One option could be to give the files in "net/" an > > ->open-handler which checks that our file->f_path.mnt is not one of our > > special clone() mounts and if it is refuse the open. > > I would expect that that would be part of the "restrict proc" mount options, no? I was thinking if some part of /proc could be split in a procpidfs, with possibly shared code, which means with the new mount API, you could configure a superblock with restricted views by virtue of mount options, per task. This only gives you the view as inside /proc/, and you might not be able to lift restrctions depending on privileges in owning userns of the task. Now, this would mean we keep the notion of anon inode fds as pidfds, and on supported systems, you could configure an instance, pass the pidfd descriptor in the fsconfig stage (David Howells demonstrated similar support for passing user and net namespace descriptors during superblock configuration) and also the pidns descriptor. Without mounting the fs, the mount fd can then be used to do metadata access passing it to openat and friends, possibly passed to others. This is more granular than restrcting access over the entire /proc instance, and can be adjusted per process, and since you need the pidfd's pidns descriptor, you cannot easily cross namespace boundaries with just a single pidfd. You can also create many variants of restricted versions of a single task's proc directory. The pidfd API and /proc access can be connected while also keeping them both separate, conceptually. There are some details but this will wound up being much more powerful (restricting /proc as a whole is ofcourse also fine, in addition to this). There are some details on when and how someone should be able to do this, but is something like this up for discussion? > > > Basically, if you have a system without CONFIG_PROC_FS it makes sense > > that clone gives back an anon inode file descriptor as pidfds because > > you can still signal threads in a race-free way. But it doesn't make a > > lot of sense to have pidfd_open() in this scenario because you can't > > really do anything with that pidfd apart from sending signals. > > Well, people might want that. > > But realistically, everybody enables /proc support anyway. Even if you > don't actually fully *mount* it in some restricted area, the support > is pretty much always there in any kernel config. > > But yes, in general I agree that that also most likely means that a > separate system call for "open_pidfd()" isn't worth it. > > Because if the main objection to /proc is that it exposes too much, > then I think a much better option is to see how to sanely restrict the > "too much" parts. > > Because I think there might be a lot of people who want a restricted > /proc, in various container models etc. > > Linus