From: ebiederm@xmission.com (Eric W. Biederman) To: Al Viro <viro@zeniv.linux.org.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org>, LKML <linux-kernel@vger.kernel.org>, Kernel Hardening <kernel-hardening@lists.openwall.com>, Linux API <linux-api@vger.kernel.org>, Linux FS Devel <linux-fsdevel@vger.kernel.org>, Linux Security Module <linux-security-module@vger.kernel.org>, Akinobu Mita <akinobu.mita@gmail.com>, Alexey Dobriyan <adobriyan@gmail.com>, Andrew Morton <akpm@linux-foundation.org>, Andy Lutomirski <luto@kernel.org>, Daniel Micay <danielmicay@gmail.com>, Djalal Harouni <tixxdz@gmail.com>, "Dmitry V . Levin" <ldv@altlinux.org>, Greg Kroah-Hartman <gregkh@linuxfoundation.org>, Ingo Molnar <mingo@kernel.org>, "J . Bruce Fields" <bfields@fieldses.org>, Jeff Layton <jlayton@poochiereds.net>, Jonathan Corbet <corbet@lwn.net>, Kees Cook <keescook@chromium.org>, Oleg Nesterov <oleg@redhat.com>, Solar Designer <solar@openwall.com> Subject: Re: [PATCH v8 07/11] proc: flush task dcache entries from all procfs instances Date: Thu, 13 Feb 2020 21:48:48 -0600 Message-ID: <87tv3tde1r.fsf@x220.int.ebiederm.org> (raw) In-Reply-To: <20200213055527.GS23230@ZenIV.linux.org.uk> (Al Viro's message of "Thu, 13 Feb 2020 05:55:27 +0000") Al Viro <viro@zeniv.linux.org.uk> writes: > On Wed, Feb 12, 2020 at 10:37:52PM -0600, Eric W. Biederman wrote: > >> I think I have an alternate idea that could work. Add some extra code >> into proc_task_readdir, that would look for dentries that no longer >> point to tasks and d_invalidate them. With the same logic probably >> being called from a few more places as well like proc_pid_readdir, >> proc_task_lookup, and proc_pid_lookup. >> >> We could even optimize it and have a process died flag we set in the >> superblock. >> >> That would would batch up the freeing work until the next time someone >> reads from proc in a way that would create more dentries. So it would >> prevent dentries from reaped zombies from growing without bound. >> >> Hmm. Given the existence of proc_fill_cache it would really be a good >> idea if readdir and lookup performed some of the freeing work as well. >> As on readdir we always populate the dcache for all of the directory >> entries. > > First of all, that won't do a damn thing when nobody is accessing > given superblock. What's more, readdir in root of that procfs instance > is not enough - you need it in task/ of group leader. It should give a rough bound on the number of stale dentries a superblock can have. The same basic concept has been used very successfully in many incremental garbage collectors. In those malloc (or the equivalent) does a finite amount of garbage collection work to roughly balance out the amount of memory allocated. I am proposing something similar for proc instances. Further if no one is accessing a superblock we don't have a problem either. > What I don't understand is the insistence on getting those dentries > via dcache lookups. _IF_ we are willing to live with cacheline > contention (on ->d_lock of root dentry, if nothing else), why not > do the following: No insistence from this side. I was not seeing atomic_inc_not_zero(sb->s_active) from rcu context as option earlier. But it is an option. > * put all dentries of such directories ([0-9]* and [0-9]*/task/*) > into a list anchored in task_struct; have non-counting reference to > task_struct stored in them (might simplify part of get_proc_task() users, > BTW - avoids pid-to-task_struct lookups if we have a dentry and not just > the inode; many callers do) > * have ->d_release() remove from it (protecting per-task_struct lock > nested outside of all ->d_lock) > * on exit: > lock the (per-task_struct) list > while list is non-empty > pick the first dentry > remove from the list > sb = dentry->d_sb > try to bump sb->s_active (if non-zero, that is). > if failed > continue // move on to the next one - nothing to do here > grab ->d_lock > res = handle_it(dentry, &temp_list) > drop ->d_lock > unlock the list > if (!list_empty(&temp_list)) > shrink_dentry_list(&temp_list) > if (res) > d_invalidate(dentry) > dput(dentry) > deactivate_super(sb) > lock the list > unlock the list > > handle_it(dentry, temp_list) // ->d_lock held; that one should be in dcache.c > if ->d_count is negative // unlikely > return 0; > if ->d_count is positive, > increment ->d_count > return 1; > // OK, it's still alive, but ->d_count is 0 > __d_drop // equivalent of d_invalidate in this case > if not on a shrink list // otherwise it's not our headache > if on lru list > d_lru_del > d_shrink_add dentry to temp_list > return 0; > > And yeah, that'll dirty ->s_active for each procfs superblock that > has dentry for our process present in dcache. On exit()... I would thread the whole thing through the proc_inode instead of coming up with a new allocation per dentry so an extra memory allocation isn't needed. We already have i_dentry. So going from the vfs_inode to the dentry is trivial. But truthfully I don't like proc_flush_task. The problem is that proc_flush_task is a layering violation and magic code that pretty much no one understands. We have some very weird cases where dput or d_invalidate wound up triggering ext3 code. It has been fixed for a long time now, but it wasy crazy weird unexpected stuff. Al your logic above just feels very clever, and like many pieces of the kernel have to know how other pieces of the kernel work. If we can find something stupid and simple that also solves the problem I would be much happier. Than anyone could understand and fix it if something goes wrong. Eric
next prev parent reply index Thread overview: 85+ messages / expand[flat|nested] mbox.gz Atom feed top 2020-02-10 15:05 [PATCH v8 00/11] proc: modernize proc to support multiple private instances Alexey Gladkov 2020-02-10 15:05 ` [PATCH v8 01/11] proc: Rename struct proc_fs_info to proc_fs_opts Alexey Gladkov 2020-02-10 15:05 ` [PATCH v8 02/11] proc: add proc_fs_info struct to store proc information Alexey Gladkov 2020-02-10 15:05 ` [PATCH v8 03/11] proc: move /proc/{self|thread-self} dentries to proc_fs_info Alexey Gladkov 2020-02-10 18:23 ` Andy Lutomirski 2020-02-12 15:00 ` Alexey Gladkov 2020-02-10 15:05 ` [PATCH v8 04/11] proc: move hide_pid, pid_gid from pid_namespace " Alexey Gladkov 2020-02-10 15:05 ` [PATCH v8 05/11] proc: add helpers to set and get proc hidepid and gid mount options Alexey Gladkov 2020-02-10 18:30 ` Andy Lutomirski 2020-02-12 14:57 ` Alexey Gladkov 2020-02-10 15:05 ` [PATCH v8 06/11] proc: support mounting procfs instances inside same pid namespace Alexey Gladkov 2020-02-10 15:05 ` [PATCH v8 07/11] proc: flush task dcache entries from all procfs instances Alexey Gladkov 2020-02-10 17:46 ` Linus Torvalds 2020-02-10 19:23 ` Al Viro 2020-02-11 1:36 ` Eric W. Biederman 2020-02-11 4:01 ` Eric W. Biederman 2020-02-12 14:49 ` Alexey Gladkov 2020-02-12 14:59 ` Eric W. Biederman 2020-02-12 17:08 ` Alexey Gladkov 2020-02-12 18:45 ` Linus Torvalds 2020-02-12 19:16 ` Eric W. Biederman 2020-02-12 19:49 ` Linus Torvalds 2020-02-12 20:03 ` Al Viro 2020-02-12 20:35 ` Linus Torvalds 2020-02-12 20:38 ` Al Viro 2020-02-12 20:41 ` Al Viro 2020-02-12 21:02 ` Linus Torvalds 2020-02-12 21:46 ` Eric W. Biederman 2020-02-13 0:48 ` Linus Torvalds 2020-02-13 4:37 ` Eric W. Biederman 2020-02-13 5:55 ` Al Viro 2020-02-13 21:30 ` Linus Torvalds 2020-02-13 22:23 ` Al Viro 2020-02-13 22:47 ` Linus Torvalds 2020-02-14 14:15 ` Eric W. Biederman 2020-02-14 3:48 ` Eric W. Biederman [this message] 2020-02-20 20:46 ` [PATCH 0/7] proc: Dentry flushing without proc_mnt Eric W. Biederman 2020-02-20 20:47 ` [PATCH 1/7] proc: Rename in proc_inode rename sysctl_inodes sibling_inodes Eric W. Biederman 2020-02-20 20:48 ` [PATCH 2/7] proc: Generalize proc_sys_prune_dcache into proc_prune_siblings_dcache Eric W. Biederman 2020-02-20 20:49 ` [PATCH 3/7] proc: Mov rcu_read_(lock|unlock) in proc_prune_siblings_dcache Eric W. Biederman 2020-02-20 22:33 ` Linus Torvalds 2020-02-20 20:49 ` [PATCH 4/7] proc: Use d_invalidate " Eric W. Biederman 2020-02-20 22:43 ` Linus Torvalds 2020-02-20 22:54 ` Al Viro 2020-02-20 23:00 ` Linus Torvalds 2020-02-20 23:03 ` Al Viro 2020-02-20 23:39 ` Eric W. Biederman 2020-02-20 20:51 ` [PATCH 5/7] proc: Clear the pieces of proc_inode that proc_evict_inode cares about Eric W. Biederman 2020-02-20 20:52 ` [PATCH 6/7] proc: Use a list of inodes to flush from proc Eric W. Biederman 2020-02-20 20:52 ` [PATCH 7/7] proc: Ensure we see the exit of each process tid exactly once Eric W. Biederman 2020-02-21 16:50 ` Oleg Nesterov 2020-02-22 15:46 ` Eric W. Biederman 2020-02-20 23:02 ` [PATCH 0/7] proc: Dentry flushing without proc_mnt Linus Torvalds 2020-02-20 23:07 ` Al Viro 2020-02-20 23:37 ` Eric W. Biederman 2020-02-24 16:25 ` [PATCH v2 0/6] " Eric W. Biederman 2020-02-24 16:26 ` [PATCH v2 1/6] proc: Rename in proc_inode rename sysctl_inodes sibling_inodes Eric W. Biederman 2020-02-24 16:27 ` [PATCH v2 2/6] proc: Generalize proc_sys_prune_dcache into proc_prune_siblings_dcache Eric W. Biederman 2020-02-24 16:27 ` [PATCH v2 3/6] proc: In proc_prune_siblings_dcache cache an aquired super block Eric W. Biederman 2020-02-24 16:28 ` [PATCH v2 4/6] proc: Use d_invalidate in proc_prune_siblings_dcache Eric W. Biederman 2020-02-24 16:28 ` [PATCH v2 5/6] proc: Clear the pieces of proc_inode that proc_evict_inode cares about Eric W. Biederman 2020-02-24 16:29 ` [PATCH v2 6/6] proc: Use a list of inodes to flush from proc Eric W. Biederman 2020-02-28 20:17 ` [PATCH 0/3] proc: Actually honor the mount options Eric W. Biederman 2020-02-28 20:18 ` [PATCH 1/3] uml: Don't consult current to find the proc_mnt in mconsole_proc Eric W. Biederman 2020-02-28 20:18 ` [PATCH 2/3] uml: Create a private mount of proc for mconsole Eric W. Biederman 2020-02-28 20:30 ` Christian Brauner 2020-02-28 21:28 ` Eric W. Biederman 2020-02-28 21:59 ` Christian Brauner 2020-02-28 20:19 ` [PATCH 3/3] proc: Remove the now unnecessary internal mount of proc Eric W. Biederman 2020-02-28 20:39 ` Christian Brauner 2020-02-28 21:40 ` Eric W. Biederman 2020-02-28 22:34 ` [PATCH 4/3] pid: Improve the comment about waiting in zap_pid_ns_processes Eric W. Biederman 2020-02-29 2:59 ` Christian Brauner 2020-02-14 3:49 ` [PATCH v8 07/11] proc: flush task dcache entries from all procfs instances Eric W. Biederman 2020-02-12 19:47 ` Al Viro 2020-02-11 22:45 ` Al Viro 2020-02-12 14:26 ` Alexey Gladkov 2020-02-10 15:05 ` [PATCH v8 08/11] proc: instantiate only pids that we can ptrace on 'hidepid=4' mount option Alexey Gladkov 2020-02-10 16:29 ` Jordan Glover 2020-02-12 14:34 ` Alexey Gladkov 2020-02-10 15:05 ` [PATCH v8 09/11] proc: add option to mount only a pids subset Alexey Gladkov 2020-02-10 15:05 ` [PATCH v8 10/11] docs: proc: add documentation for "hidepid=4" and "subset=pidfs" options and new mount behavior Alexey Gladkov 2020-02-10 18:29 ` Andy Lutomirski 2020-02-12 16:03 ` Alexey Gladkov 2020-02-10 15:05 ` [PATCH v8 11/11] proc: Move hidepid values to uapi as they are user interface to mount Alexey Gladkov
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=87tv3tde1r.fsf@x220.int.ebiederm.org \ --to=ebiederm@xmission.com \ --cc=adobriyan@gmail.com \ --cc=akinobu.mita@gmail.com \ --cc=akpm@linux-foundation.org \ --cc=bfields@fieldses.org \ --cc=corbet@lwn.net \ --cc=danielmicay@gmail.com \ --cc=gregkh@linuxfoundation.org \ --cc=jlayton@poochiereds.net \ --cc=keescook@chromium.org \ --cc=kernel-hardening@lists.openwall.com \ --cc=ldv@altlinux.org \ --cc=linux-api@vger.kernel.org \ --cc=linux-fsdevel@vger.kernel.org \ --cc=linux-kernel@vger.kernel.org \ --cc=linux-security-module@vger.kernel.org \ --cc=luto@kernel.org \ --cc=mingo@kernel.org \ --cc=oleg@redhat.com \ --cc=solar@openwall.com \ --cc=tixxdz@gmail.com \ --cc=torvalds@linux-foundation.org \ --cc=viro@zeniv.linux.org.uk \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
Kernel-hardening Archive on lore.kernel.org Archives are clonable: git clone --mirror https://lore.kernel.org/kernel-hardening/0 kernel-hardening/git/0.git # If you have public-inbox 1.1+ installed, you may # initialize and index your mirror using the following commands: public-inbox-init -V2 kernel-hardening kernel-hardening/ https://lore.kernel.org/kernel-hardening \ kernel-hardening@lists.openwall.com public-inbox-index kernel-hardening Example config snippet for mirrors Newsgroup available over NNTP: nntp://nntp.lore.kernel.org/com.openwall.lists.kernel-hardening AGPL code for this site: git clone https://public-inbox.org/public-inbox.git