From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1A29FECAAD8 for ; Sat, 17 Sep 2022 00:11:22 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2A8EA940007; Fri, 16 Sep 2022 20:11:22 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 22FF38D0001; Fri, 16 Sep 2022 20:11:22 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0ABFE940007; Fri, 16 Sep 2022 20:11:22 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id E9EEC8D0001 for ; Fri, 16 Sep 2022 20:11:21 -0400 (EDT) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id B1ED2402CD for ; Sat, 17 Sep 2022 00:11:21 +0000 (UTC) X-FDA: 79919647962.11.A9F3D7C Received: from mail-pj1-f42.google.com (mail-pj1-f42.google.com [209.85.216.42]) by imf11.hostedemail.com (Postfix) with ESMTP id 53E064009D for ; Sat, 17 Sep 2022 00:11:21 +0000 (UTC) Received: by mail-pj1-f42.google.com with SMTP id bu5-20020a17090aee4500b00202e9ca2182so2653897pjb.0 for ; Fri, 16 Sep 2022 17:11:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=chromium.org; s=google; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date; bh=9Hg3WExcsAga0Qhwnt0s+/w+FgE+Fu4QOFI0odH5wLw=; b=acmLYJB90z6yu5lwsEGqpEGUuyqXxuAyLieS/i0RxF4QVKRLADMm6JnV+XBMbU9EAA elQkBbIz+sad/CGZx42A8rsnunF+Xok/Oqr7xqReyHCAX/3ws4pKA49/ldpiPL6yTaRg pzUxrW7n2Qf4rg8ztazly0c1K+osu8H2CEbhg= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date; bh=9Hg3WExcsAga0Qhwnt0s+/w+FgE+Fu4QOFI0odH5wLw=; b=CXHYjgSwwK/sCSpPnStNwqDjX1S/Ro+p6BI/ulmtSAaVaQ/nHL/Mb4yKY/s9e2iDfM 9ZZVFk/754MkqNKinDXVYiN7veCwxjmsd0C/gbHf9bPw35fWMayYXj6XbswadfnasN5m tuJF3v719UvLe6TrwyxCPSKOP98rT1szrN05ejws7hEbbn+KSYcvaVtvKZatKscXgKc2 KRX8Ns4JKrvhdMNt+tB8m7R2Uf6Rldj3cHpbPuZ/LJ1E0KCP27MnBcnVG/lo2akyUicj qLnjvz0aN5rad7b7TYNcQW3u995MUCG2iO3g9cvXIRaGW8F3ZRqbK1mtUaeLJ9mqfUEy p8aA== X-Gm-Message-State: ACrzQf1InEcK14PXeG1BNYxNbDFw0C4CMJ17U4Fk4sHAQxtA+CTiLls0 xqbaoQvmWdkq7hhe38mWJ6IQsQ== X-Google-Smtp-Source: AMsMyM7yb8IkuA2CaVot0t/bvM6dtkqaItFJcFVr0dj47HvZl5Zkqje37Lxyoo/sQLSBjvgXpodG4Q== X-Received: by 2002:a17:902:e40c:b0:176:9543:883 with SMTP id m12-20020a170902e40c00b0017695430883mr2207448ple.169.1663373479992; Fri, 16 Sep 2022 17:11:19 -0700 (PDT) Received: from www.outflux.net (smtp.outflux.net. [198.145.64.163]) by smtp.gmail.com with ESMTPSA id u126-20020a626084000000b005385e2e86eesm15525980pfb.18.2022.09.16.17.11.19 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 16 Sep 2022 17:11:19 -0700 (PDT) Date: Fri, 16 Sep 2022 17:11:18 -0700 From: Kees Cook To: Josh Triplett Cc: Peter Zijlstra , Eric Biederman , Alexander Viro , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH] fs/exec.c: Add fast path for ENOENT on PATH search before allocating mm Message-ID: <202209161637.9EDAF6B18@keescook> References: <5c7333ea4bec2fad1b47a8fa2db7c31e4ffc4f14.1663334978.git.josh@joshtriplett.org> <202209160727.5FC78B735@keescook> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1663373481; a=rsa-sha256; cv=none; b=RtAcxEJBzmTGfV7MhavxYtLk5xmaErtLtps5jnpRH52/YCYlHHtFk8VUHSz/obzUVYkZQD guZJ0yxwQtPYFHgAcXRE7LpMMi65XA6r19OIc1baOluIG4Dv3iwppEKqpcZiiiWG6yNhDx qNoaUC8N+/gwVDG8YhT4bBXcbTza/zU= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=chromium.org header.s=google header.b=acmLYJB9; dmarc=pass (policy=none) header.from=chromium.org; spf=pass (imf11.hostedemail.com: domain of keescook@chromium.org designates 209.85.216.42 as permitted sender) smtp.mailfrom=keescook@chromium.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1663373481; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=9Hg3WExcsAga0Qhwnt0s+/w+FgE+Fu4QOFI0odH5wLw=; b=kqtH5qd+7ZMfrZjUkVFTl3Sz9Zk3VLmZyefI/jjHoyVpK9z4LkRolvt2cDOnJbvmjmBX7E oBO7Ieq8rHab9nhH2deAh3CySgpi2p44V8CZ+AcVZlYall9T5H17OyN6mhYYkgFH9rz0Yl wQ7devEveHy+yHUQMY2CfGuFCV9kFPc= X-Stat-Signature: r9zkrsg31aw5uxm9uszsk4icxobs3qrx X-Rspamd-Queue-Id: 53E064009D Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=chromium.org header.s=google header.b=acmLYJB9; dmarc=pass (policy=none) header.from=chromium.org; spf=pass (imf11.hostedemail.com: domain of keescook@chromium.org designates 209.85.216.42 as permitted sender) smtp.mailfrom=keescook@chromium.org X-Rspam-User: X-Rspamd-Server: rspam10 X-HE-Tag: 1663373481-742088 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: [Hi Peter, apologies for dumping you into the middle of this thread. I've got a question about sched_exec() below...] On Fri, Sep 16, 2022 at 09:13:44PM +0100, Josh Triplett wrote: > musl does the same thing, as do python and perl (likely via execvp or > posix_spawnp). As does gcc when it executes `as`. And I've seen more > than a few programs hand-implement a PATH search the same way. Seems > worth optimizing for. Yeah, it does seem like a simple way to eliminate needless work, though I'd really like to see some kind of perf count of "in a given kernel build, how many execve() system calls fail due to path search vs succeed", just to get a better sense of the scale of the problem. I don't like the idea of penalizing the _succeeding_ case, though, which happens if we do the path walk twice. So, I went and refactoring the setup order, moving the do_open_execat() up into alloc_bprm() instead of where it was in bprm_exec(). The result makes it so it is, as you observed, before the mm creation and generally expensive argument copying. The difference to your patch seems to only be the allocation of the file table entry, but avoids the double lookup, so I'm hoping the result is actually even faster. This cleanup is actually quite satisfying organizationally too -- the fd and filename were passed around rather oddly. The interaction with sched_exec() should be no worse (the file is opened before it in either case), but in reading that function, it talks about taking the opportunity to move the process to another CPU (IIUC) since, paraphrasing, "it is at its lowest memory/cache size." But I wonder if there is an existing accidental pessimistic result in that the process stack has already been allocated. I am only passingly familiar with how tasks get moved around under NUMA -- is the scheduler going to move this process onto a different NUMA node and now it will be forced to have the userspace process stack on one node and the program text and heap on another? Or is that totally lost in the noise? More specifically, I was wondering if processes would benefit from having sched_exec() moved before the mm creation? Regardless, here's a very lightly tested patch. Can you take this for a spin and check your benchmark? Thanks! -Kees diff --git a/fs/exec.c b/fs/exec.c index 9a5ca7b82bfc..5534301d67ca 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -898,6 +898,10 @@ EXPORT_SYMBOL(transfer_args_to_stack); #endif /* CONFIG_MMU */ +/* + * On success, callers must call do_close_execat() on the returned + * struct file. + */ static struct file *do_open_execat(int fd, struct filename *name, int flags) { struct file *file; @@ -945,6 +949,16 @@ static struct file *do_open_execat(int fd, struct filename *name, int flags) return ERR_PTR(err); } +/** + * open_exec - Open a path name for execution + * + * @name: path name to open with the intent of executing it. + * + * Returns ERR_PTR on failure or allocated struct file on success. + * + * As this is a wrapper for the internal do_open_execat(), callers + * must call allow_write_access() before fput() on release. + */ struct file *open_exec(const char *name) { struct filename *filename = getname_kernel(name); @@ -1485,6 +1499,15 @@ static int prepare_bprm_creds(struct linux_binprm *bprm) return -ENOMEM; } +/* Matches do_open_execat() */ +static void do_close_execat(struct file *file) +{ + if (!file) + return; + allow_write_access(file); + fput(file); +} + static void free_bprm(struct linux_binprm *bprm) { if (bprm->mm) { @@ -1496,10 +1519,7 @@ static void free_bprm(struct linux_binprm *bprm) mutex_unlock(¤t->signal->cred_guard_mutex); abort_creds(bprm->cred); } - if (bprm->file) { - allow_write_access(bprm->file); - fput(bprm->file); - } + do_close_execat(bprm->file); if (bprm->executable) fput(bprm->executable); /* If a binfmt changed the interp, free it. */ @@ -1509,12 +1529,26 @@ static void free_bprm(struct linux_binprm *bprm) kfree(bprm); } -static struct linux_binprm *alloc_bprm(int fd, struct filename *filename) +static struct linux_binprm *alloc_bprm(int fd, struct filename *filename, + int flags) { - struct linux_binprm *bprm = kzalloc(sizeof(*bprm), GFP_KERNEL); - int retval = -ENOMEM; - if (!bprm) + struct linux_binprm *bprm; + struct file *file; + int retval; + + file = do_open_execat(fd, filename, flags); + if (IS_ERR(file)) { + retval = PTR_ERR(file); goto out; + } + + retval = -ENOMEM; + bprm = kzalloc(sizeof(*bprm), GFP_KERNEL); + if (!bprm) { + do_close_execat(file); + goto out; + } + bprm->file = file; if (fd == AT_FDCWD || filename->name[0] == '/') { bprm->filename = filename->name; @@ -1531,6 +1565,18 @@ static struct linux_binprm *alloc_bprm(int fd, struct filename *filename) } bprm->interp = bprm->filename; + /* + * Record that a name derived from an O_CLOEXEC fd will be + * inaccessible after exec. This allows the code in exec to + * choose to fail when the executable is not mmaped into the + * interpreter and an open file descriptor is not passed to + * the interpreter. This makes for a better user experience + * than having the interpreter start and then immediately fail + * when it finds the executable is inaccessible. + */ + if (bprm->fdpath && get_close_on_exec(fd)) + bprm->interp_flags |= BINPRM_FLAGS_PATH_INACCESSIBLE; + retval = bprm_mm_init(bprm); if (retval) goto out_free; @@ -1803,10 +1849,8 @@ static int exec_binprm(struct linux_binprm *bprm) /* * sys_execve() executes a new program. */ -static int bprm_execve(struct linux_binprm *bprm, - int fd, struct filename *filename, int flags) +static int bprm_execve(struct linux_binprm *bprm) { - struct file *file; int retval; retval = prepare_bprm_creds(bprm); @@ -1816,26 +1860,8 @@ static int bprm_execve(struct linux_binprm *bprm, check_unsafe_exec(bprm); current->in_execve = 1; - file = do_open_execat(fd, filename, flags); - retval = PTR_ERR(file); - if (IS_ERR(file)) - goto out_unmark; - sched_exec(); - bprm->file = file; - /* - * Record that a name derived from an O_CLOEXEC fd will be - * inaccessible after exec. This allows the code in exec to - * choose to fail when the executable is not mmaped into the - * interpreter and an open file descriptor is not passed to - * the interpreter. This makes for a better user experience - * than having the interpreter start and then immediately fail - * when it finds the executable is inaccessible. - */ - if (bprm->fdpath && get_close_on_exec(fd)) - bprm->interp_flags |= BINPRM_FLAGS_PATH_INACCESSIBLE; - /* Set the unchanging part of bprm->cred */ retval = security_bprm_creds_for_exec(bprm); if (retval) @@ -1863,7 +1889,6 @@ static int bprm_execve(struct linux_binprm *bprm, if (bprm->point_of_no_return && !fatal_signal_pending(current)) force_fatal_sig(SIGSEGV); -out_unmark: current->fs->in_exec = 0; current->in_execve = 0; @@ -1897,7 +1922,7 @@ static int do_execveat_common(int fd, struct filename *filename, * further execve() calls fail. */ current->flags &= ~PF_NPROC_EXCEEDED; - bprm = alloc_bprm(fd, filename); + bprm = alloc_bprm(fd, filename, flags); if (IS_ERR(bprm)) { retval = PTR_ERR(bprm); goto out_ret; @@ -1946,7 +1971,7 @@ static int do_execveat_common(int fd, struct filename *filename, bprm->argc = 1; } - retval = bprm_execve(bprm, fd, filename, flags); + retval = bprm_execve(bprm); out_free: free_bprm(bprm); @@ -1971,7 +1996,7 @@ int kernel_execve(const char *kernel_filename, if (IS_ERR(filename)) return PTR_ERR(filename); - bprm = alloc_bprm(fd, filename); + bprm = alloc_bprm(fd, filename, 0); if (IS_ERR(bprm)) { retval = PTR_ERR(bprm); goto out_ret; @@ -2006,7 +2031,7 @@ int kernel_execve(const char *kernel_filename, if (retval < 0) goto out_free; - retval = bprm_execve(bprm, fd, filename, 0); + retval = bprm_execve(bprm); out_free: free_bprm(bprm); out_ret: -- Kees Cook