From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.1 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_PASS, USER_AGENT_NEOMUTT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 57E82C10F03 for ; Tue, 19 Mar 2019 23:10:31 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 15B2E2175B for ; Tue, 19 Mar 2019 23:10:31 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=brauner.io header.i=@brauner.io header.b="M+ReO8ag" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727210AbfCSXK3 (ORCPT ); Tue, 19 Mar 2019 19:10:29 -0400 Received: from mail-qt1-f194.google.com ([209.85.160.194]:39903 "EHLO mail-qt1-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725913AbfCSXK3 (ORCPT ); Tue, 19 Mar 2019 19:10:29 -0400 Received: by mail-qt1-f194.google.com with SMTP id t28so418191qte.6 for ; Tue, 19 Mar 2019 16:10:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=brauner.io; s=google; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=6FWypQkapP8gnmtBtLQAKuuE6zk0IbVIm8LRbFhdYXc=; b=M+ReO8agDfxcUZE1BC6jIroXPpA/3rFDw7WL2eBF6QPi8+fe4q1yfE3cc0ZNNln0tP 3Dutwt74X0Mb9BEPu7GbrMZZWOuKVdH78KU+HAbkslsG02wo4iK6ObbXFZIoJvse5hPt +mNSLNRsZHvFLQ4ytrymOevkVA91aSIXV02lVyagkwxkXhUuYamp7vkRAOrO2Z+riYjx 7iDcO0ceMht3+Sfv53hEyNRAiX3hMmxRqjcgVLTi/UdEAQ7Fpt8Na61DjVKqaRKcg0ZT c6uHvd7we28+D1ALMCmPsoULQb1zPqsGc8VZi23DjfCjNIdsYvHONVsI2or/qGNiNMh1 BQEw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=6FWypQkapP8gnmtBtLQAKuuE6zk0IbVIm8LRbFhdYXc=; b=Zy+1x0UVxRVWBeA+trapVSD5xkaTLeMJ5AYnl91WPVReOpWA4am2tYFJr9+UUD9yt2 Bn+1yT0UgCZ0PXV+MtC/J+btCdb52gMdnxdVPydGW5SzQdyo84uGae/+2Zw1FffReY0P zjWp+qvPf/lzntSrtXU9VW1JiRp10mlRDHNEE/MTJQynAYqX6dWGtE1niK5d4A6OXdrD dpOOjUpH6ZVhVmBsTqLLeP44FmFDCi16JPFBTVtwMvNKNwUMzTgk6EhNb4QlNlv6P2pB KkjoFCMmRlR2T/8IOO1CJ+eyBvXf51dvV6meYCGer3H6lmC7Nz4oGNm5ZxXGVAl7/nO9 1tDA== X-Gm-Message-State: APjAAAVHy6D37uieRYl9rR6il0NMyeaB+I74o7NLKLMH1f0CvIrE7zcl yJ6Fa/08+ksB+E77B17umEViFQ== X-Google-Smtp-Source: APXvYqxwNsa8SsSdl8LqERXal6FHQWmFo7gdTiOZy3amFeU+WySgqgoRQgKFw4lJtMZx/FhDpN1fcQ== X-Received: by 2002:ac8:2850:: with SMTP id 16mr4256446qtr.84.1553037028006; Tue, 19 Mar 2019 16:10:28 -0700 (PDT) Received: from brauner.io ([38.127.230.10]) by smtp.gmail.com with ESMTPSA id j10sm59557qth.14.2019.03.19.16.10.24 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Tue, 19 Mar 2019 16:10:27 -0700 (PDT) Date: Wed, 20 Mar 2019 00:10:23 +0100 From: Christian Brauner To: Daniel Colascione Cc: Joel Fernandes , Suren Baghdasaryan , Steven Rostedt , Sultan Alsawaf , Tim Murray , Michal Hocko , Greg Kroah-Hartman , Arve =?utf-8?B?SGrDuG5uZXbDpWc=?= , Todd Kjos , Martijn Coenen , Ingo Molnar , Peter Zijlstra , LKML , "open list:ANDROID DRIVERS" , linux-mm , kernel-team , Oleg Nesterov , Andy Lutomirski , "Serge E. Hallyn" , Kees Cook Subject: Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android Message-ID: <20190319231020.tdcttojlbmx57gke@brauner.io> References: <20190316185726.jc53aqq5ph65ojpk@brauner.io> <20190317015306.GA167393@google.com> <20190317114238.ab6tvvovpkpozld5@brauner.io> <20190318002949.mqknisgt7cmjmt7n@brauner.io> <20190318235052.GA65315@google.com> <20190319221415.baov7x6zoz7hvsno@brauner.io> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: User-Agent: NeoMutt/20180716 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Mar 19, 2019 at 03:48:32PM -0700, Daniel Colascione wrote: > On Tue, Mar 19, 2019 at 3:14 PM Christian Brauner wrote: > > So I dislike the idea of allocating new inodes from the procfs super > > block. I would like to avoid pinning the whole pidfd concept exclusively > > to proc. The idea is that the pidfd API will be useable through procfs > > via open("/proc/") because that is what users expect and really > > wanted to have for a long time. So it makes sense to have this working. > > But it should really be useable without it. That's why translate_pid() > > and pidfd_clone() are on the table. What I'm saying is, once the pidfd > > api is "complete" you should be able to set CONFIG_PROCFS=N - even > > though that's crazy - and still be able to use pidfds. This is also a > > point akpm asked about when I did the pidfd_send_signal work. > > I agree that you shouldn't need CONFIG_PROCFS=Y to use pidfds. One > crazy idea that I was discussing with Joel the other day is to just > make CONFIG_PROCFS=Y mandatory and provide a new get_procfs_root() > system call that returned, out of thin air and independent of the > mount table, a procfs root directory file descriptor for the caller's > PID namspace and suitable for use with openat(2). Even if this works I'm pretty sure that Al and a lot of others will not be happy about this. A syscall to get an fd to /proc? That's not going to happen and I don't see the need for a separate syscall just for that. (I do see the point of making CONFIG_PROCFS=y the default btw.) Inode allocation from the procfs mount for the file descriptors Joel wants is not correct. Their not really procfs file descriptors so this is a nack. We can't just hook into proc that way. > > C'mon: /proc is used by everyone today and almost every program breaks > if it's not around. The string "/proc" is already de facto kernel ABI. > Let's just drop the pretense of /proc being optional and bake it into > the kernel proper, then give programs a way to get to /proc that isn't > tied to any particular mount configuration. This way, we don't need a > translate_pid(), since callers can just use procfs to do the same > thing. (That is, if I understand correctly what translate_pid does.) I'm not sure what you think translate_pid() is doing since you're not saying what you think it does. Examples from the old patchset: translate_pid(pid, ns, -1) - get pid in our pid namespace translate_pid(pid, -1, ns) - get pid in other pid namespace translate_pid(1, ns, -1) - get pid of init task for namespace translate_pid(pid, -1, ns) > 0 - is pid is reachable from ns? translate_pid(1, ns1, ns2) > 0 - is ns1 inside ns2? translate_pid(1, ns1, ns2) == 0 - is ns1 outside ns2? translate_pid(1, ns1, ns2) == 1 - is ns1 equal ns2? Allowing this syscall to yield pidfds as proper regular file fds and also taking pidfds as argument is an excellent way to kill a few problems at once: - cheap pid namespace introspection - creates a bridge between the "old" pid-based api and the "new" pidfd api - allows us to get proper non-directory file descriptors for any pids we like The additional advantage is that people are already happy to add this syscall so simply extending it and routing it through the pidfd tree or Eric's tree is reasonable. (It should probably grow a flag argument. I need to start prototyping this.) > > We still need a pidfd_clone() for atomicity reasons, but that's a > separate story. My goal is to be able to write a library that Yes, on my todo list and I have a ported patch based on prior working rotting somehwere on my git server. > transparently creates and manages a helper child process even in a > "hostile" process environment in which some other uncoordinated thread > is constantly doing a waitpid(-1) (e.g., the JVM). > > > So instead of going throught proc we should probably do what David has > > been doing in the mount API and come to rely on anone_inode. So > > something like: > > > > fd = anon_inode_getfd("pidfd", &pidfd_fops, file_priv_data, flags); > > > > and stash information such as pid namespace etc. in a pidfd struct or > > something that we then can stash file->private_data of the new file. > > This also lets us avoid all this open coding done here. > > Another advantage is that anon_inodes is its own kernel-internal > > filesystem. > > Sure. That works too. Great.