From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E18C3C4320A for ; Wed, 1 Sep 2021 22:56:35 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id C622C61027 for ; Wed, 1 Sep 2021 22:56:35 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1344041AbhIAW5c (ORCPT ); Wed, 1 Sep 2021 18:57:32 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35044 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230235AbhIAW5a (ORCPT ); Wed, 1 Sep 2021 18:57:30 -0400 Received: from mail-ej1-x636.google.com (mail-ej1-x636.google.com [IPv6:2a00:1450:4864:20::636]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 55693C061575 for ; Wed, 1 Sep 2021 15:56:33 -0700 (PDT) Received: by mail-ej1-x636.google.com with SMTP id e21so2514033ejz.12 for ; Wed, 01 Sep 2021 15:56:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=0c37HS5qGIQEDVteYJ2WQmPS4YgB8XOhpU0JZzYZ1Qo=; b=gfp4E9dBC2tWKF9R7s3e22N9lHVXU87BJ5dwhu3XiXLxprxKEHSaI29Vm0i5OUt0UO YgTrWTl0A8gLHo569dg4NGpFxPz8apW1J+wp3dHoNgabn0eSUPPB1NCTxNZtmFl727Rj YQdKbVWl5SHRDKRkkKN4+XjPwPjXFrNvzCYhZEkRnHK285ZN+MZZ5mNfc/EW1ZOnRBrT i0UmlIW1k0ki5Ma0UDTCwg3f5P97+f21ixpT/p0cloWiLSteW2zhBKPMz5nxYHR3WY+b bGM4HUKZcMZg8e/CiopZ3HLPaFca/gy2cvMRW+4ocdlUm0V4xfX5civTULe2ZrojEdRK I5fA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=0c37HS5qGIQEDVteYJ2WQmPS4YgB8XOhpU0JZzYZ1Qo=; b=Lf6rqb1zuerJYq2OAwhYm+oFQSC1nsLzBRHinIJEgFpWA+IhzOCuj+byj90vxhVytJ bnLWLx3QJBUUU3GoiGPRPkxp+Hbc/EA4AJL/ZARp0nG0/NK43bLuEOOOMwrMq2CnrNs2 Rv8ZqPRu9RLRHhyWnzh0VGhB8+TfoPxOSXdBsd7S39f3AgzuWeogjPYuk7XoPgW1/rcW KlF5+I8L1PaHgStHTObYKTPKuAM0cb/V+PGaRZ9K/N7mMjsQQcYyLY/66KM9tKKJibIo f+S+aGFlHhDzk8sao9zrXohAb0LmxzRyprMRo+2CBbihEWM3e863doCygB5OT/MXkC+d 0cFg== X-Gm-Message-State: AOAM531JR5EgNgHlJRCv5kk2nqfD3gpsO/EHBF/qHyBKSUC6w8ou79Qd 6iIHyjehmKs3hG18PEX3hhnH87ZgT+AStVFsryc= X-Google-Smtp-Source: ABdhPJyWNbS01BkhSLqOZqUJi5Tfwg15PEXXHHjA1v4ofvx9dCuJ3rJDgrkfSGWUZcPLKYjbHuqa3oP69fgncW82wvQ= X-Received: by 2002:a17:906:b14d:: with SMTP id bt13mr241919ejb.39.1630536991855; Wed, 01 Sep 2021 15:56:31 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Barry Song <21cnbao@gmail.com> Date: Thu, 2 Sep 2021 10:56:20 +1200 Message-ID: Subject: Re: Is it possible to implement the per-node page cache for programs/libraries? To: Linus Torvalds Cc: Al Viro , Shijie Huang , Andrew Morton , Linux-MM , "Song Bao Hua (Barry Song)" , Linux Kernel Mailing List , Frank Wang Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Sep 2, 2021 at 5:31 AM Linus Torvalds wrote: > > On Wed, Sep 1, 2021 at 10:24 AM Linus Torvalds > wrote: > > > > But what you could do, if you wanted to, would be to catch the > > situation where you have lots of expensive NUMA accesses either using > > our VM infrastructure or performance counters, and when the mapping is > > a MAP_PRIVATE you just do a COW fault on them. > > > > Sounds entirely doable, and has absolutely nothing to do with the page > > cache. It would literally just be an "over-eager COW fault triggered > > by NUMA access counters". > > Note how it would work perfectly fine for anonymous mappings too. Just > to reinforce the point that this has nothing to do with any page cache > issues. > > Of course, if you want to actually then *share* pages within a node > (rather than replicate them for each process), that gets more > exciting. > > But I suspect that this is mainly only useful for long-running big > processes (not least due to that node binding thing), so I question > the need for that kind of excitement. In Linux server scenarios, it would be quite common to have long-running big processes constantly running on one machine, for example, web, database etc. This kind of process can cross a couple of NUMA nodes using all CPUs in a server to achieve the maximum throughput. SGI/HPE has a numatool with command "dplace" to help deploy processes with replicated text in either libraries or binary (a.out) [1]: dplace [-e] [-c cpu_numbers] [-s skip_count] [-n process_name] \ [-x skip_mask] [-r [l|b|t]] [-o log_file] [-v 1|2] \ command [command-args] The dplace command accepts the following options: ... -r: Specifies that text should be replicated on the node or nodes where the application is running. In some cases, replication will improve performance by reducing the need to make offnode memory references for code. The replication option applies to all programs placed by the dplace command. See the dplace man page for additional information on text replication. The replication options are a string of one or more of the following characters: l - Replicate library text b - Replicate binary (a.out) text t - Thread round-robin option On the other hand, it would be also interesting to investigate if kernel text replication can help improve performance. MIPS does have REPLICATE_KTEXT support in the kernel: config REPLICATE_KTEXT bool "Kernel text replication support" depends on SGI_IP27 select MAPPED_KERNEL help Say Y here to enable replicating the kernel text across multiple nodes in a NUMA cluster. This trades memory for speed. Not quite sure how it will benefit X86 and ARM64 though it seems concurrent-rt has some solution and benchmark data in RedHawk Linux[2]. [1] http://www.nacad.ufrj.br/online/sgi/007-5646-002/sgi_html/ch05.html [2] https://www.concurrent-rt.com/wp-content/uploads/2016/11/kernel-page-replication.pdf > > Linus Thanks Barry From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 385A4C432BE for ; Wed, 1 Sep 2021 22:56:35 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 98C9861027 for ; Wed, 1 Sep 2021 22:56:34 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 98C9861027 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id F04458D0002; Wed, 1 Sep 2021 18:56:33 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id EB4ED8D0001; Wed, 1 Sep 2021 18:56:33 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D55768D0002; Wed, 1 Sep 2021 18:56:33 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0090.hostedemail.com [216.40.44.90]) by kanga.kvack.org (Postfix) with ESMTP id C33678D0001 for ; Wed, 1 Sep 2021 18:56:33 -0400 (EDT) Received: from smtpin14.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 7F1361824DBD0 for ; Wed, 1 Sep 2021 22:56:33 +0000 (UTC) X-FDA: 78540515466.14.67BC835 Received: from mail-ej1-f42.google.com (mail-ej1-f42.google.com [209.85.218.42]) by imf25.hostedemail.com (Postfix) with ESMTP id 3A10FB000184 for ; Wed, 1 Sep 2021 22:56:33 +0000 (UTC) Received: by mail-ej1-f42.google.com with SMTP id x11so2733263ejv.0 for ; Wed, 01 Sep 2021 15:56:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=0c37HS5qGIQEDVteYJ2WQmPS4YgB8XOhpU0JZzYZ1Qo=; b=gfp4E9dBC2tWKF9R7s3e22N9lHVXU87BJ5dwhu3XiXLxprxKEHSaI29Vm0i5OUt0UO YgTrWTl0A8gLHo569dg4NGpFxPz8apW1J+wp3dHoNgabn0eSUPPB1NCTxNZtmFl727Rj YQdKbVWl5SHRDKRkkKN4+XjPwPjXFrNvzCYhZEkRnHK285ZN+MZZ5mNfc/EW1ZOnRBrT i0UmlIW1k0ki5Ma0UDTCwg3f5P97+f21ixpT/p0cloWiLSteW2zhBKPMz5nxYHR3WY+b bGM4HUKZcMZg8e/CiopZ3HLPaFca/gy2cvMRW+4ocdlUm0V4xfX5civTULe2ZrojEdRK I5fA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=0c37HS5qGIQEDVteYJ2WQmPS4YgB8XOhpU0JZzYZ1Qo=; b=GOjiO2QoDiZySUywsn9AEOtI75RRwm4T6VyCNvukhxWPq5hM28qBXwa33JwZaAsvcu EprShH01mJmSObFRBXGVWO7eH53N5m9eSulIAJ+/DijCX3PYOVeAhmFLprrEP8xCQg11 PiIK9zCT6opm9bybAEAMq03boirOrqtOUMDypqC94uWmSvb6uihTVQY6yZ1tU8uek9UU oUQYfLLPYtd/UQQgWwYW0GbNEl5CPJRJUtidU3xMkfUwR6wpXbObxE9916v+Q5/i+oyk MuOl+CtN8ApuVQDsUcu/irlTabnBIcyJqyBNXSfvlioWr9PH01tyMB5/xT8l13+rBEzz 0utA== X-Gm-Message-State: AOAM530fWJcMcjC7q+MJBElrsspnQLm8ZJRrvKT5aIRegtTX4QBfgQTC RDurBm/QUwFPWbdNgQX3kcmkOZyiddNjU6HtXTY= X-Google-Smtp-Source: ABdhPJyWNbS01BkhSLqOZqUJi5Tfwg15PEXXHHjA1v4ofvx9dCuJ3rJDgrkfSGWUZcPLKYjbHuqa3oP69fgncW82wvQ= X-Received: by 2002:a17:906:b14d:: with SMTP id bt13mr241919ejb.39.1630536991855; Wed, 01 Sep 2021 15:56:31 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Barry Song <21cnbao@gmail.com> Date: Thu, 2 Sep 2021 10:56:20 +1200 Message-ID: Subject: Re: Is it possible to implement the per-node page cache for programs/libraries? To: Linus Torvalds Cc: Al Viro , Shijie Huang , Andrew Morton , Linux-MM , "Song Bao Hua (Barry Song)" , Linux Kernel Mailing List , Frank Wang Content-Type: text/plain; charset="UTF-8" X-Rspamd-Queue-Id: 3A10FB000184 Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=gfp4E9dB; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf25.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.218.42 as permitted sender) smtp.mailfrom=21cnbao@gmail.com X-Rspamd-Server: rspam01 X-Stat-Signature: bzszen1fkattjtnfzwcabj9kegm1de5r X-HE-Tag: 1630536993-125923 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Sep 2, 2021 at 5:31 AM Linus Torvalds wrote: > > On Wed, Sep 1, 2021 at 10:24 AM Linus Torvalds > wrote: > > > > But what you could do, if you wanted to, would be to catch the > > situation where you have lots of expensive NUMA accesses either using > > our VM infrastructure or performance counters, and when the mapping is > > a MAP_PRIVATE you just do a COW fault on them. > > > > Sounds entirely doable, and has absolutely nothing to do with the page > > cache. It would literally just be an "over-eager COW fault triggered > > by NUMA access counters". > > Note how it would work perfectly fine for anonymous mappings too. Just > to reinforce the point that this has nothing to do with any page cache > issues. > > Of course, if you want to actually then *share* pages within a node > (rather than replicate them for each process), that gets more > exciting. > > But I suspect that this is mainly only useful for long-running big > processes (not least due to that node binding thing), so I question > the need for that kind of excitement. In Linux server scenarios, it would be quite common to have long-running big processes constantly running on one machine, for example, web, database etc. This kind of process can cross a couple of NUMA nodes using all CPUs in a server to achieve the maximum throughput. SGI/HPE has a numatool with command "dplace" to help deploy processes with replicated text in either libraries or binary (a.out) [1]: dplace [-e] [-c cpu_numbers] [-s skip_count] [-n process_name] \ [-x skip_mask] [-r [l|b|t]] [-o log_file] [-v 1|2] \ command [command-args] The dplace command accepts the following options: ... -r: Specifies that text should be replicated on the node or nodes where the application is running. In some cases, replication will improve performance by reducing the need to make offnode memory references for code. The replication option applies to all programs placed by the dplace command. See the dplace man page for additional information on text replication. The replication options are a string of one or more of the following characters: l - Replicate library text b - Replicate binary (a.out) text t - Thread round-robin option On the other hand, it would be also interesting to investigate if kernel text replication can help improve performance. MIPS does have REPLICATE_KTEXT support in the kernel: config REPLICATE_KTEXT bool "Kernel text replication support" depends on SGI_IP27 select MAPPED_KERNEL help Say Y here to enable replicating the kernel text across multiple nodes in a NUMA cluster. This trades memory for speed. Not quite sure how it will benefit X86 and ARM64 though it seems concurrent-rt has some solution and benchmark data in RedHawk Linux[2]. [1] http://www.nacad.ufrj.br/online/sgi/007-5646-002/sgi_html/ch05.html [2] https://www.concurrent-rt.com/wp-content/uploads/2016/11/kernel-page-replication.pdf > > Linus Thanks Barry