From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 40C47C4167D for ; Thu, 9 Nov 2023 12:21:09 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232078AbjKIMVJ (ORCPT ); Thu, 9 Nov 2023 07:21:09 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42060 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229450AbjKIMVH (ORCPT ); Thu, 9 Nov 2023 07:21:07 -0500 Received: from mail-oo1-xc30.google.com (mail-oo1-xc30.google.com [IPv6:2607:f8b0:4864:20::c30]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C42CF2702; Thu, 9 Nov 2023 04:21:05 -0800 (PST) Received: by mail-oo1-xc30.google.com with SMTP id 006d021491bc7-5849fc56c62so430384eaf.3; Thu, 09 Nov 2023 04:21:05 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1699532465; x=1700137265; darn=vger.kernel.org; h=cc:to:subject:message-id:date:from:references:in-reply-to :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=Cf06u8ZRup48FiodSvGNoaJfH2Amt2aPrTV0aUl6vho=; b=bh9+RiTbrzxgHshpCGlnT/nMw3MWTOAeiew0g/iCh381q+PV7R6USTyEBvK+RED17U 4qHNCL2eKZ3z8mZ+1yXF3p2ENJHVq1ciMfECrJvXeicCNtQgm/BrH72cyCX/DMMwHsfz 51QfPCcTXwOL1HyNTZnOCtVOpo47x8Kn794t/5MMpx52Nfm1XKRoy1dUp45FtPpdSC8t A6xSDmjlyy1UAmQWEXgbXHFBqrNLu3OZP+1x2DBl/DzIiWcmMATLtt1D2A74gYDKZ8vU L7WyDksCm7oM8W9fuXNVRoEeAMJadleOgaz+syOjzMv3boVO1JlRQ3sH7TwbH62b7K1F tB3g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1699532465; x=1700137265; h=cc:to:subject:message-id:date:from:references:in-reply-to :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=Cf06u8ZRup48FiodSvGNoaJfH2Amt2aPrTV0aUl6vho=; b=A4hYRF8XMFIxUtYZZ8sG2yAkecPm/ccbOtTiGMdfXsdqWGN5tzKukeEAh3lLVH05Ip c1y7cUFyFVZrL6hTLXD5ktCiofTM+OadDr30klm1BCdjn0v5n3+L+CE/tDJUDHtCE5xC orNHPYBRdpg6RG+zVrZxH93BGuLMwfzmzMdZweEwLJPuh/+tWnoOgVvdGEd+vMNRQmj8 Lg7ZrqKJNyXttK6mVqnEC3UOQGT/LC4T8dvhHxn3tUZ5G3ATeOVl82vuKL7aP2AZqDrS uuXRw8PEu4CvMCIh+qy5jKkmiPbr8KCT6AsIK59H2ychkSxLU7VJjAQCibSKSUD0H4+T oU+A== X-Gm-Message-State: AOJu0YwDRxfAiZi4mBoV7VyObQpqly5j9gXv+o0a7FR1kES3Lzy32z7m ZsPGJBzStNCcx25EwIThUzywe+diV7LLpkj8DBw= X-Google-Smtp-Source: AGHT+IHSCIfAvxAsoOhCmuC/wS5IjQ8yNasMvbesR4csTwSDuItjtE1BiSf3ph5q+6Ne9SxAA+PV13TTUBc5f1h52dg= X-Received: by 2002:a4a:d35a:0:b0:581:ec87:edc0 with SMTP id d26-20020a4ad35a000000b00581ec87edc0mr4683338oos.9.1699532464870; Thu, 09 Nov 2023 04:21:04 -0800 (PST) MIME-Version: 1.0 Received: by 2002:ac9:6696:0:b0:4f0:1250:dd51 with HTTP; Thu, 9 Nov 2023 04:21:04 -0800 (PST) In-Reply-To: <87msvnwzim.fsf@email.froward.int.ebiederm.org> References: <5c7333ea4bec2fad1b47a8fa2db7c31e4ffc4f14.1663334978.git.josh@joshtriplett.org> <202311071228.27D22C00@keescook> <20231107205151.qkwlw7aarjvkyrqs@f> <202311071445.53E5D72C@keescook> <202311081129.9E1EC8D34@keescook> <87msvnwzim.fsf@email.froward.int.ebiederm.org> From: Mateusz Guzik Date: Thu, 9 Nov 2023 13:21:04 +0100 Message-ID: Subject: Re: [PATCH] fs/exec.c: Add fast path for ENOENT on PATH search before allocating mm To: "Eric W. Biederman" , Peter Zijlstra Cc: Kees Cook , Josh Triplett , Alexander Viro , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 11/9/23, Eric W. Biederman wrote: > Mateusz Guzik writes: >> sched_exec causes migration only for only few % of execs in the bench, >> but when it does happen there is tons of overhead elsewhere. >> >> I expect real programs which get past execve will be prone to >> migrating anyway, regardless of what sched_exec is doing. >> >> That is to say, while sched_exec buggering off here would be nice, I >> think for real-world wins the thing to investigate is the overhead >> which comes from migration to begin with. > > I have a vague memory that the idea is that there is a point during exec > when it should be much less expensive than normal to allow migration > between cpus because all of the old state has gone away. > > Assuming that is the rationale, if we are getting lock contention > then either there is a global lock in there, or there is the potential > to pick a less expensive location within exec. > Given the commit below I think the term "migration cost" is overloaded here. By migration cost in my previous mail I meant the immediate cost (stop_one_cpu and so on), but also the aftermath -- for example tlb flushes on another CPU when tearing down your now-defunct mm after you switched. For testing purposes I verified commenting out sched_exec and not using taskset still gives me about 9.5k ops/s. I 100% agree should the task be moved between NUMA domains, it makes sense to do it when it has the smallest footprint. I don't know what the original patch did, the current code just picks a CPU and migrates to it, regardless of NUMA considerations. I will note that the goal would still be achieved by comparing domains and doing nothing if they match. I think this would be nice to fix, but it is definitely not a big deal. I guess the question is to Peter Zijlstra if this sounds reasonable. > Just to confirm my memory I dug a little deeper and I found the original > commit that added sched_exec (in tglx's git tree of the bit keeper > history). > > commit f01419fd6d4e5b32fef19d206bc3550cc04567a9 > Author: Martin J. Bligh > Date: Wed Jan 15 19:46:10 2003 -0800 > > [PATCH] (2/3) Initial load balancing > > Patch from Michael Hohnbaum > > This adds a hook, sched_balance_exec(), to the exec code, to make it > place the exec'ed task on the least loaded queue. We have less state > to move at exec time than fork time, so this is the cheapest point > to cross-node migrate. Experience in Dynix/PTX and testing on Linux > has confirmed that this is the cheapest time to move tasks between > nodes. > > It also macro-wraps changes to nr_running, to allow us to keep track of > per-node nr_running as well. Again, no impact on non-NUMA machines. > > > Eric > > -- Mateusz Guzik