Re: Is it possible to implement the per-node page cache for programs/libraries?

From: Barry Song <21cnbao@gmail.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>,
	Shijie Huang <shijie@amperemail.onmicrosoft.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linux-MM <linux-mm@kvack.org>,
	"Song Bao Hua (Barry Song)" <song.bao.hua@hisilicon.com>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Frank Wang <zwang@amperecomputing.com>
Subject: Re: Is it possible to implement the per-node page cache for programs/libraries?
Date: Thu, 2 Sep 2021 10:56:20 +1200	[thread overview]
Message-ID: <CAGsJ_4yLrGv2izZ2z4QWnBbDOhEjHygHDFBthfFqW0XEkMP-ag@mail.gmail.com> (raw)
In-Reply-To: <CAHk-=wjAPEs3HRGswJ-AE1R048j2MBsBtMfg3GOsaFykHoeKsg@mail.gmail.com>

On Thu, Sep 2, 2021 at 5:31 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Wed, Sep 1, 2021 at 10:24 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > But what you could do, if  you wanted to, would be to catch the
> > situation where you have lots of expensive NUMA accesses either using
> > our VM infrastructure or performance counters, and when the mapping is
> > a MAP_PRIVATE you just do a COW fault on them.
> >
> > Sounds entirely doable, and has absolutely nothing to do with the page
> > cache. It would literally just be an "over-eager COW fault triggered
> > by NUMA access counters".
>
> Note how it would work perfectly fine for anonymous mappings too. Just
> to reinforce the point that this has nothing to do with any page cache
> issues.
>
> Of course, if you want to actually then *share* pages within a node
> (rather than replicate them for each process), that gets more
> exciting.
>
> But I suspect that this is mainly only useful for long-running big
> processes (not least due to that node binding thing), so I question
> the need for that kind of excitement.

In Linux server scenarios, it would be quite common to have long-running big
processes constantly running on one machine, for example, web, database
etc. This kind of process can cross a couple of NUMA nodes using all CPUs
in a server to achieve the maximum throughput.

SGI/HPE has a numatool with command "dplace" to help deploy processes
with replicated text in either libraries or binary (a.out) [1]:

dplace [-e] [-c cpu_numbers] [-s skip_count] [-n process_name] \
             [-x skip_mask] [-r [l|b|t]] [-o log_file] [-v 1|2] \
             command [command-args]

The dplace command accepts the following options:
...
-r: Specifies that text should be replicated on the node or nodes
where the application is running.
In some cases, replication will improve performance by reducing the
need to make offnode memory
references for code. The replication option applies to all programs
placed by the dplace command.
See the dplace man page for additional information on text
replication. The replication options are
a string of one or more of the following characters:
l - Replicate library text
b - Replicate binary (a.out) text
t - Thread round-robin option

On the other hand, it would be also interesting to investigate if
kernel text replication can help
improve performance. MIPS does have REPLICATE_KTEXT support in the kernel:
config REPLICATE_KTEXT
bool "Kernel text replication support"
depends on SGI_IP27
select MAPPED_KERNEL
help
 Say Y here to enable replicating the kernel text across multiple
 nodes in a NUMA cluster.  This trades memory for speed.

Not quite sure how it will benefit X86 and ARM64 though it seems concurrent-rt
has some solution and benchmark data in RedHawk Linux[2].

[1] http://www.nacad.ufrj.br/online/sgi/007-5646-002/sgi_html/ch05.html
[2] https://www.concurrent-rt.com/wp-content/uploads/2016/11/kernel-page-replication.pdf

>
>                 Linus

Thanks
Barry