linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: Is it possible to implement the per-node page cache for programs/libraries?
  2021-09-01  3:07 Is it possible to implement the per-node page cache for programs/libraries? Shijie Huang
@ 2021-09-01  2:09 ` Barry Song
  2021-09-01  3:25 ` Matthew Wilcox
  2021-09-01  4:55 ` Al Viro
  2 siblings, 0 replies; 24+ messages in thread
From: Barry Song @ 2021-09-01  2:09 UTC (permalink / raw)
  To: Shijie Huang
  Cc: Linus Torvalds, viro, Andrew Morton, linux-mm, Barry Song, LKML,
	Frank Wang

On Wed, Sep 1, 2021 at 11:09 AM Shijie Huang
<shijie@amperemail.onmicrosoft.com> wrote:
>
> Hi Everyone,
>
>      In the NUMA, we only have one page cache for each file. For the
> program/shared libraries, the
>
> remote-access delays longer then the  local-access.
>
> So, is it possible to implement the per-node page cache for
> programs/libraries?

as far as i know, this is an very interesting topic, we do have some
"solutions" on this.
MIPS kernel supports kernel TEXT replication:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/mips/sgi-ip27/Kconfig
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/mips/sgi-ip27/ip27-klnuma.c

config REPLICATE_KTEXT
bool "Kernel text replication support"
depends on SGI_IP27
select MAPPED_KERNEL
help
 Say Y here to enable replicating the kernel text across multiple
 nodes in a NUMA cluster.  This trades memory for speed.

for x86, RedHawk Linux(https://www.concurrent-rt.com/solutions/linux/)supports
kernel text replication.
here are some benchmark:
https://www.concurrent-rt.com/wp-content/uploads/2016/11/kernel-page-replication.pdf

For userspace, dplace from SGI can help replicate text:
https://www.spec.org/cpu2006/flags/SGI-platform.html

-r bl: specifies that text should be replicated on the NUMA node or
nodes where the process is running.
'b' indicates that binary (a.out) text should be replicated;
'l' indicates that library text should be replicated.

but all of the above except mips ktext replication are out of tree.

Please count me in if you have any solution and any pending patch.
I am interested in this topic.

>
>
>     We can do it like this:
>
>          1.) Add a new system call to control specific files to
> NUMA-aware, such as:
>
>                     set_numa_aware("/usr/lib/libc.so", enable);
>
>              After the system call, the page cache of libc.so has the
> flags "NUMA_ENABLED"
>
>
>          2.) When A new process tries to setup the MMU page table for
> libc.so, it will check
>
>               if NUMA_ENABLED is set. If it set, the kernel will give a
> page which is bind to the process's NUMA node.
>
>               By this way, we can eliminate the remote-access for
> programs/shared library.
>
>
> Is this proposal ok?  Or do you have a better idea?
>
>
> Thanks
>
> Huang Shijie

Thanks
barry

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Is it possible to implement the per-node page cache for programs/libraries?
@ 2021-09-01  3:07 Shijie Huang
  2021-09-01  2:09 ` Barry Song
                   ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Shijie Huang @ 2021-09-01  3:07 UTC (permalink / raw)
  To: torvalds, viro, akpm, linux-mm, song.bao.hua; +Cc: linux-kernel, Frank Wang

Hi Everyone,

     In the NUMA, we only have one page cache for each file. For the 
program/shared libraries, the

remote-access delays longer then the  local-access.

So, is it possible to implement the per-node page cache for 
programs/libraries?


    We can do it like this:

         1.) Add a new system call to control specific files to 
NUMA-aware, such as:

                    set_numa_aware("/usr/lib/libc.so", enable);

             After the system call, the page cache of libc.so has the 
flags "NUMA_ENABLED"


         2.) When A new process tries to setup the MMU page table for 
libc.so, it will check

              if NUMA_ENABLED is set. If it set, the kernel will give a 
page which is bind to the process's NUMA node.

              By this way, we can eliminate the remote-access for 
programs/shared library.


Is this proposal ok?  Or do you have a better idea?


Thanks

Huang Shijie







^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Is it possible to implement the per-node page cache for programs/libraries?
  2021-09-01  3:07 Is it possible to implement the per-node page cache for programs/libraries? Shijie Huang
  2021-09-01  2:09 ` Barry Song
@ 2021-09-01  3:25 ` Matthew Wilcox
  2021-09-01 13:30   ` Huang Shijie
  2021-09-02  3:25   ` Nicholas Piggin
  2021-09-01  4:55 ` Al Viro
  2 siblings, 2 replies; 24+ messages in thread
From: Matthew Wilcox @ 2021-09-01  3:25 UTC (permalink / raw)
  To: Shijie Huang
  Cc: torvalds, viro, akpm, linux-mm, song.bao.hua, linux-kernel, Frank Wang

On Wed, Sep 01, 2021 at 11:07:41AM +0800, Shijie Huang wrote:
>     In the NUMA, we only have one page cache for each file. For the
> program/shared libraries, the
> remote-access delays longer then the  local-access.
> 
> So, is it possible to implement the per-node page cache for
> programs/libraries?

At this point, we have no way to support text replication within a
process.  So what you're suggesting (if implemented) would work for
processes which limit themselves to a single node.  That is, if you
have a system with CPUs 0-3 on node 0 and CPUs 4-7 on node 1, a process
which only works on node 0 or only works on node 1 will get text on the
appropriate node.

If there's a process which runs on both nodes 0 and 1, there's no support
for per-node PGDs.  So it will get a mix of pages from nodes 0 and 1,
and that doesn't necessarily seem like a big win.  I haven't yet dived
into how hard it would be to make mm->pgd a per-node allocation.

I have been thinking about this a bit; one of our internal performance
teams flagged the potential performance win to me a few months ago.
I don't have a concrete design for text replication yet; there have been
various attempts over the years, but none were particularly compelling.

By the way, the degree of performance win varies between different CPUs,
but it's measurable on all the systems we've tested on (from three
different vendors).

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Is it possible to implement the per-node page cache for programs/libraries?
  2021-09-01  3:07 Is it possible to implement the per-node page cache for programs/libraries? Shijie Huang
  2021-09-01  2:09 ` Barry Song
  2021-09-01  3:25 ` Matthew Wilcox
@ 2021-09-01  4:55 ` Al Viro
  2021-09-01 13:10   ` Huang Shijie
  2021-09-01 17:24   ` Linus Torvalds
  2 siblings, 2 replies; 24+ messages in thread
From: Al Viro @ 2021-09-01  4:55 UTC (permalink / raw)
  To: Shijie Huang
  Cc: torvalds, akpm, linux-mm, song.bao.hua, linux-kernel, Frank Wang

On Wed, Sep 01, 2021 at 11:07:41AM +0800, Shijie Huang wrote:
> Hi Everyone,
> 
>     In the NUMA, we only have one page cache for each file. For the
> program/shared libraries, the
> 
> remote-access delays longer then the  local-access.
> 
> So, is it possible to implement the per-node page cache for
> programs/libraries?

What do you mean, per-node page cache?  Multiple pages for the same
area of file?  That'd be bloody awful on coherency...

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Is it possible to implement the per-node page cache for programs/libraries?
  2021-09-01 14:25     ` Huang Shijie
@ 2021-09-01 11:32       ` Matthew Wilcox
  2021-09-01 23:58       ` Matthew Wilcox
  1 sibling, 0 replies; 24+ messages in thread
From: Matthew Wilcox @ 2021-09-01 11:32 UTC (permalink / raw)
  To: Huang Shijie
  Cc: Shijie Huang, torvalds, viro, akpm, linux-mm, song.bao.hua,
	linux-kernel, Frank Wang

On Wed, Sep 01, 2021 at 02:25:34PM +0000, Huang Shijie wrote:
> On Wed, Sep 01, 2021 at 01:30:45PM +0000, Huang Shijie wrote:
> > On Wed, Sep 01, 2021 at 04:25:01AM +0100, Matthew Wilcox wrote:
> > > On Wed, Sep 01, 2021 at 11:07:41AM +0800, Shijie Huang wrote:
> > > >     In the NUMA, we only have one page cache for each file. For the
> > > > program/shared libraries, the
> > > > remote-access delays longer then the  local-access.
> > > > 
> > > > So, is it possible to implement the per-node page cache for
> > > > programs/libraries?
> > > 
> > > At this point, we have no way to support text replication within a
> > > process.  So what you're suggesting (if implemented) would work for
> > 
> > I created a glibc patch which can do the text replication within a process.
> The "text replication" means the shared libraries, not program itself.

Is it really worthwhile to do only the shared libraries?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Is it possible to implement the per-node page cache for programs/libraries?
  2021-09-01  4:55 ` Al Viro
@ 2021-09-01 13:10   ` Huang Shijie
  2021-09-01 17:24   ` Linus Torvalds
  1 sibling, 0 replies; 24+ messages in thread
From: Huang Shijie @ 2021-09-01 13:10 UTC (permalink / raw)
  To: Al Viro
  Cc: Shijie Huang, torvalds, akpm, linux-mm, song.bao.hua,
	linux-kernel, Frank Wang

On Wed, Sep 01, 2021 at 04:55:01AM +0000, Al Viro wrote:
> On Wed, Sep 01, 2021 at 11:07:41AM +0800, Shijie Huang wrote:
> > Hi Everyone,
> > 
> >     In the NUMA, we only have one page cache for each file. For the
> > program/shared libraries, the
> > 
> > remote-access delays longer then the  local-access.
> > 
> > So, is it possible to implement the per-node page cache for
> > programs/libraries?
> 
> What do you mean, per-node page cache?  Multiple pages for the same
> area of file?  That'd be bloody awful on coherency...
Yes. per-NUMA-node page cache.

We can limit the files to program/(shared libraries) which are read-only mostly,
and do not need coherency.

Thanks
Huang Shijie

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Is it possible to implement the per-node page cache for programs/libraries?
  2021-09-01  3:25 ` Matthew Wilcox
@ 2021-09-01 13:30   ` Huang Shijie
  2021-09-01 14:25     ` Huang Shijie
  2021-09-02  3:25   ` Nicholas Piggin
  1 sibling, 1 reply; 24+ messages in thread
From: Huang Shijie @ 2021-09-01 13:30 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Shijie Huang, torvalds, viro, akpm, linux-mm, song.bao.hua,
	linux-kernel, Frank Wang

On Wed, Sep 01, 2021 at 04:25:01AM +0100, Matthew Wilcox wrote:
> On Wed, Sep 01, 2021 at 11:07:41AM +0800, Shijie Huang wrote:
> >     In the NUMA, we only have one page cache for each file. For the
> > program/shared libraries, the
> > remote-access delays longer then the  local-access.
> > 
> > So, is it possible to implement the per-node page cache for
> > programs/libraries?
> 
> At this point, we have no way to support text replication within a
> process.  So what you're suggesting (if implemented) would work for

I created a glibc patch which can do the text replication within a process.
I will send to glibc maintainer later.. 
(it seems glibc does not use patches to maintain the code.)


> processes which limit themselves to a single node.  That is, if you
> have a system with CPUs 0-3 on node 0 and CPUs 4-7 on node 1, a process
> which only works on node 0 or only works on node 1 will get text on the
> appropriate node.
> 
> If there's a process which runs on both nodes 0 and 1, there's no support
> for per-node PGDs.  So it will get a mix of pages from nodes 0 and 1,
I think we do not need the per-node PGDs.

One-PGD for one process is okay to me.

> and that doesn't necessarily seem like a big win.  I haven't yet dived
> into how hard it would be to make mm->pgd a per-node allocation.
> 
> I have been thinking about this a bit; one of our internal performance
> teams flagged the potential performance win to me a few months ago.
> I don't have a concrete design for text replication yet; there have been
> various attempts over the years, but none were particularly compelling.
> 
> By the way, the degree of performance win varies between different CPUs,
> but it's measurable on all the systems we've tested on (from three
> different vendors).
Thank you for sharing this.

Thanks
Huang Shijie

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Is it possible to implement the per-node page cache for programs/libraries?
  2021-09-01 13:30   ` Huang Shijie
@ 2021-09-01 14:25     ` Huang Shijie
  2021-09-01 11:32       ` Matthew Wilcox
  2021-09-01 23:58       ` Matthew Wilcox
  0 siblings, 2 replies; 24+ messages in thread
From: Huang Shijie @ 2021-09-01 14:25 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Shijie Huang, torvalds, viro, akpm, linux-mm, song.bao.hua,
	linux-kernel, Frank Wang

On Wed, Sep 01, 2021 at 01:30:45PM +0000, Huang Shijie wrote:
> On Wed, Sep 01, 2021 at 04:25:01AM +0100, Matthew Wilcox wrote:
> > On Wed, Sep 01, 2021 at 11:07:41AM +0800, Shijie Huang wrote:
> > >     In the NUMA, we only have one page cache for each file. For the
> > > program/shared libraries, the
> > > remote-access delays longer then the  local-access.
> > > 
> > > So, is it possible to implement the per-node page cache for
> > > programs/libraries?
> > 
> > At this point, we have no way to support text replication within a
> > process.  So what you're suggesting (if implemented) would work for
> 
> I created a glibc patch which can do the text replication within a process.
The "text replication" means the shared libraries, not program itself.

Thanks
Huang Shijie

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Is it possible to implement the per-node page cache for programs/libraries?
  2021-09-01  4:55 ` Al Viro
  2021-09-01 13:10   ` Huang Shijie
@ 2021-09-01 17:24   ` Linus Torvalds
  2021-09-01 17:29     ` Linus Torvalds
  1 sibling, 1 reply; 24+ messages in thread
From: Linus Torvalds @ 2021-09-01 17:24 UTC (permalink / raw)
  To: Al Viro
  Cc: Shijie Huang, Andrew Morton, Linux-MM, Song Bao Hua (Barry Song),
	Linux Kernel Mailing List, Frank Wang

On Tue, Aug 31, 2021 at 9:57 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> What do you mean, per-node page cache?  Multiple pages for the same
> area of file?  That'd be bloody awful on coherency...

You absolutely don't want to actually duplicate it in the cache.

But what you could do, if  you wanted to, would be to catch the
situation where you have lots of expensive NUMA accesses either using
our VM infrastructure or performance counters, and when the mapping is
a MAP_PRIVATE you just do a COW fault on them.

Honestly, I suspect it only makes sense when you have already bound
your process to one particular NUMA node.

Sounds entirely doable, and has absolutely nothing to do with the page
cache. It would literally just be an "over-eager COW fault triggered
by NUMA access counters".

             Linus

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Is it possible to implement the per-node page cache for programs/libraries?
  2021-09-01 17:24   ` Linus Torvalds
@ 2021-09-01 17:29     ` Linus Torvalds
  2021-09-01 22:56       ` Barry Song
  2021-09-02 10:08       ` Huang Shijie
  0 siblings, 2 replies; 24+ messages in thread
From: Linus Torvalds @ 2021-09-01 17:29 UTC (permalink / raw)
  To: Al Viro
  Cc: Shijie Huang, Andrew Morton, Linux-MM, Song Bao Hua (Barry Song),
	Linux Kernel Mailing List, Frank Wang

On Wed, Sep 1, 2021 at 10:24 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> But what you could do, if  you wanted to, would be to catch the
> situation where you have lots of expensive NUMA accesses either using
> our VM infrastructure or performance counters, and when the mapping is
> a MAP_PRIVATE you just do a COW fault on them.
>
> Sounds entirely doable, and has absolutely nothing to do with the page
> cache. It would literally just be an "over-eager COW fault triggered
> by NUMA access counters".

Note how it would work perfectly fine for anonymous mappings too. Just
to reinforce the point that this has nothing to do with any page cache
issues.

Of course, if you want to actually then *share* pages within a node
(rather than replicate them for each process), that gets more
exciting.

But I suspect that this is mainly only useful for long-running big
processes (not least due to that node binding thing), so I question
the need for that kind of excitement.

                Linus

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Is it possible to implement the per-node page cache for programs/libraries?
  2021-09-01 17:29     ` Linus Torvalds
@ 2021-09-01 22:56       ` Barry Song
  2021-09-02 10:12         ` Huang Shijie
  2021-09-02 10:08       ` Huang Shijie
  1 sibling, 1 reply; 24+ messages in thread
From: Barry Song @ 2021-09-01 22:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Shijie Huang, Andrew Morton, Linux-MM,
	Song Bao Hua (Barry Song),
	Linux Kernel Mailing List, Frank Wang

On Thu, Sep 2, 2021 at 5:31 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Wed, Sep 1, 2021 at 10:24 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > But what you could do, if  you wanted to, would be to catch the
> > situation where you have lots of expensive NUMA accesses either using
> > our VM infrastructure or performance counters, and when the mapping is
> > a MAP_PRIVATE you just do a COW fault on them.
> >
> > Sounds entirely doable, and has absolutely nothing to do with the page
> > cache. It would literally just be an "over-eager COW fault triggered
> > by NUMA access counters".
>
> Note how it would work perfectly fine for anonymous mappings too. Just
> to reinforce the point that this has nothing to do with any page cache
> issues.
>
> Of course, if you want to actually then *share* pages within a node
> (rather than replicate them for each process), that gets more
> exciting.
>
> But I suspect that this is mainly only useful for long-running big
> processes (not least due to that node binding thing), so I question
> the need for that kind of excitement.

In Linux server scenarios, it would be quite common to have long-running big
processes constantly running on one machine, for example, web, database
etc. This kind of process can cross a couple of NUMA nodes using all CPUs
in a server to achieve the maximum throughput.

SGI/HPE has a numatool with command "dplace" to help deploy processes
with replicated text in either libraries or binary (a.out) [1]:

dplace [-e] [-c cpu_numbers] [-s skip_count] [-n process_name] \
             [-x skip_mask] [-r [l|b|t]] [-o log_file] [-v 1|2] \
             command [command-args]

The dplace command accepts the following options:
...
-r: Specifies that text should be replicated on the node or nodes
where the application is running.
In some cases, replication will improve performance by reducing the
need to make offnode memory
references for code. The replication option applies to all programs
placed by the dplace command.
See the dplace man page for additional information on text
replication. The replication options are
a string of one or more of the following characters:
l - Replicate library text
b - Replicate binary (a.out) text
t - Thread round-robin option

On the other hand, it would be also interesting to investigate if
kernel text replication can help
improve performance. MIPS does have REPLICATE_KTEXT support in the kernel:
config REPLICATE_KTEXT
bool "Kernel text replication support"
depends on SGI_IP27
select MAPPED_KERNEL
help
 Say Y here to enable replicating the kernel text across multiple
 nodes in a NUMA cluster.  This trades memory for speed.

Not quite sure how it will benefit X86 and ARM64 though it seems concurrent-rt
has some solution and benchmark data in RedHawk Linux[2].

[1] http://www.nacad.ufrj.br/online/sgi/007-5646-002/sgi_html/ch05.html
[2] https://www.concurrent-rt.com/wp-content/uploads/2016/11/kernel-page-replication.pdf

>
>                 Linus

Thanks
Barry

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Is it possible to implement the per-node page cache for programs/libraries?
  2021-09-01 14:25     ` Huang Shijie
  2021-09-01 11:32       ` Matthew Wilcox
@ 2021-09-01 23:58       ` Matthew Wilcox
  2021-09-02  0:15         ` Barry Song
  2021-09-02 10:16         ` Huang Shijie
  1 sibling, 2 replies; 24+ messages in thread
From: Matthew Wilcox @ 2021-09-01 23:58 UTC (permalink / raw)
  To: Huang Shijie
  Cc: Shijie Huang, torvalds, viro, akpm, linux-mm, song.bao.hua,
	linux-kernel, Frank Wang

On Wed, Sep 01, 2021 at 02:25:34PM +0000, Huang Shijie wrote:
> On Wed, Sep 01, 2021 at 01:30:45PM +0000, Huang Shijie wrote:
> > On Wed, Sep 01, 2021 at 04:25:01AM +0100, Matthew Wilcox wrote:
> > > On Wed, Sep 01, 2021 at 11:07:41AM +0800, Shijie Huang wrote:
> > > >     In the NUMA, we only have one page cache for each file. For the
> > > > program/shared libraries, the
> > > > remote-access delays longer then the  local-access.
> > > > 
> > > > So, is it possible to implement the per-node page cache for
> > > > programs/libraries?
> > > 
> > > At this point, we have no way to support text replication within a
> > > process.  So what you're suggesting (if implemented) would work for
> > 
> > I created a glibc patch which can do the text replication within a process.
> The "text replication" means the shared libraries, not program itself.

Thinking about it some more, if you're ok with it only being shared
libraries, you can do this:

for i in `seq 0 3`; do \
	cp --reflink=always /lib/x86_64-linux-gnu/libc.so.6 \
		/lib/x86_64-linux-gnu/libc.so.6.numa$i; \
done

Reflinked files don't share page cache, so you can do this all in
userspace with no kernel changes.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Is it possible to implement the per-node page cache for programs/libraries?
  2021-09-01 23:58       ` Matthew Wilcox
@ 2021-09-02  0:15         ` Barry Song
  2021-09-02  1:13           ` Linus Torvalds
  2021-09-02 10:16         ` Huang Shijie
  1 sibling, 1 reply; 24+ messages in thread
From: Barry Song @ 2021-09-02  0:15 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Huang Shijie, Shijie Huang, Linus Torvalds, Al Viro,
	Andrew Morton, Linux-MM, Barry Song, LKML, Frank Wang

On Thu, Sep 2, 2021 at 12:00 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Wed, Sep 01, 2021 at 02:25:34PM +0000, Huang Shijie wrote:
> > On Wed, Sep 01, 2021 at 01:30:45PM +0000, Huang Shijie wrote:
> > > On Wed, Sep 01, 2021 at 04:25:01AM +0100, Matthew Wilcox wrote:
> > > > On Wed, Sep 01, 2021 at 11:07:41AM +0800, Shijie Huang wrote:
> > > > >     In the NUMA, we only have one page cache for each file. For the
> > > > > program/shared libraries, the
> > > > > remote-access delays longer then the  local-access.
> > > > >
> > > > > So, is it possible to implement the per-node page cache for
> > > > > programs/libraries?
> > > >
> > > > At this point, we have no way to support text replication within a
> > > > process.  So what you're suggesting (if implemented) would work for
> > >
> > > I created a glibc patch which can do the text replication within a process.
> > The "text replication" means the shared libraries, not program itself.
>
> Thinking about it some more, if you're ok with it only being shared
> libraries, you can do this:
>
> for i in `seq 0 3`; do \
>         cp --reflink=always /lib/x86_64-linux-gnu/libc.so.6 \
>                 /lib/x86_64-linux-gnu/libc.so.6.numa$i; \
> done
>
> Reflinked files don't share page cache, so you can do this all in
> userspace with no kernel changes.

Not quite sure I catch your point. In case we are running mysql on a
machine with 128 cores
(4numa, 32cores in each numa), how will the reflink help the only
mysql process to leverage
its local libc copy?

Thanks
Barry

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Is it possible to implement the per-node page cache for programs/libraries?
  2021-09-02  0:15         ` Barry Song
@ 2021-09-02  1:13           ` Linus Torvalds
  0 siblings, 0 replies; 24+ messages in thread
From: Linus Torvalds @ 2021-09-02  1:13 UTC (permalink / raw)
  To: Barry Song
  Cc: Matthew Wilcox, Huang Shijie, Shijie Huang, Al Viro,
	Andrew Morton, Linux-MM, Barry Song, LKML, Frank Wang

On Wed, Sep 1, 2021 at 5:15 PM Barry Song <21cnbao@gmail.com> wrote:
>
> In case we are running mysql on a machine with 128 cores
> (4numa, 32cores in each numa), how will the reflink help the only
> mysql process to leverage its local libc copy?

That's a fundamentally harder problem anyway, and for the foreseeable
future you should expect the answer to that be "Not a way in hell".

Because it's not about "local libc copies" at that point any more,
it's about "a single process only has a single page table".

So a single process will have a particular virtual address mapped to
*one* physical page. And no, it doesn't matter how many threads you
have. What makes them threads - not processes - is that they share the
same VM image.

So the only way you will have local NUMA copies is if you
 (a) run multiple processes
 (b) bind each process to a particular NUMA node
 (c) do something special to then have per-node mappings

That "(c)" is what is up for discussion, whether it be with various
user mode hacks, or the "NUMA COW" thing, or whatever.

But (a) and (b) are basically required.

               Linus

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Is it possible to implement the per-node page cache for programs/libraries?
  2021-09-01  3:25 ` Matthew Wilcox
  2021-09-01 13:30   ` Huang Shijie
@ 2021-09-02  3:25   ` Nicholas Piggin
  2021-09-02 10:17     ` Matthew Wilcox
  1 sibling, 1 reply; 24+ messages in thread
From: Nicholas Piggin @ 2021-09-02  3:25 UTC (permalink / raw)
  To: Shijie Huang, Matthew Wilcox
  Cc: akpm, linux-kernel, linux-mm, song.bao.hua, torvalds, viro, Frank Wang

Excerpts from Matthew Wilcox's message of September 1, 2021 1:25 pm:
> On Wed, Sep 01, 2021 at 11:07:41AM +0800, Shijie Huang wrote:
>>     In the NUMA, we only have one page cache for each file. For the
>> program/shared libraries, the
>> remote-access delays longer then the  local-access.
>> 
>> So, is it possible to implement the per-node page cache for
>> programs/libraries?
> 
> At this point, we have no way to support text replication within a
> process.  So what you're suggesting (if implemented) would work for
> processes which limit themselves to a single node.  That is, if you
> have a system with CPUs 0-3 on node 0 and CPUs 4-7 on node 1, a process
> which only works on node 0 or only works on node 1 will get text on the
> appropriate node.
> 
> If there's a process which runs on both nodes 0 and 1, there's no support
> for per-node PGDs.  So it will get a mix of pages from nodes 0 and 1,
> and that doesn't necessarily seem like a big win.  I haven't yet dived
> into how hard it would be to make mm->pgd a per-node allocation.
> 
> I have been thinking about this a bit; one of our internal performance
> teams flagged the potential performance win to me a few months ago.
> I don't have a concrete design for text replication yet; there have been
> various attempts over the years, but none were particularly compelling.

What was not compelling about it?

https://lists.openwall.net/linux-kernel/2007/07/27/112

What are the other attempts?

Thanks,
Nick

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Is it possible to implement the per-node page cache for programs/libraries?
  2021-09-01 17:29     ` Linus Torvalds
  2021-09-01 22:56       ` Barry Song
@ 2021-09-02 10:08       ` Huang Shijie
  1 sibling, 0 replies; 24+ messages in thread
From: Huang Shijie @ 2021-09-02 10:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Shijie Huang, Andrew Morton, Linux-MM,
	Song Bao Hua (Barry Song),
	Linux Kernel Mailing List, Frank Wang

Hi Linus,
On Wed, Sep 01, 2021 at 10:29:01AM -0700, Linus Torvalds wrote:
> On Wed, Sep 1, 2021 at 10:24 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > But what you could do, if  you wanted to, would be to catch the
> > situation where you have lots of expensive NUMA accesses either using
> > our VM infrastructure or performance counters, and when the mapping is
> > a MAP_PRIVATE you just do a COW fault on them.
> >
> > Sounds entirely doable, and has absolutely nothing to do with the page
> > cache. It would literally just be an "over-eager COW fault triggered
> > by NUMA access counters".
Yes. You are right, we can use COW. :)

Actually we have _TWO_ levels to do the optimization for NUMA remote-access:
   1.) the page cache which is independent to process.
   2.) the process address space(page table).

   For 2.), we can use the over-eager COW:
        2.1) I have finished a user patch for glibc which uses "over-eager COW" to do the text
	   replication in NUMA.
        2.2) Also a kernel patch uses the "over-eager COW" to do the replication for 
           the programs itself in NUMA. (We may refine it to another topic..)
> 
> Note how it would work perfectly fine for anonymous mappings too. Just
> to reinforce the point that this has nothing to do with any page cache
> issues.
> 
> Of course, if you want to actually then *share* pages within a node
> (rather than replicate them for each process), that gets more
> exciting.
Do we really need to change the page cache?
          The 2.1) above may produces one-copy "shared libraries pages" for each process, such glibc.so.
          Even in the same NUMA node 0, we may run two same processes. So it produces "two glibc.so" now.
	  If We run 5 same processes in NUMA Node 0, it will produces "five glibs.so".

	  But if we have per-node page cache for the glibc.so, we can do it like this:
	  (1) disable the "over-eager COW" in the process.
	  (2) use the per-node page cache's pages to different processes in the _SAME_ NUMA node.
	      So all the processes in the same NUMA node, can use only one same page.
          (3) Processes in other NUMA nodes, use the pages belong to this node.

	  By this way, we can save many pages, and provide more access speed in NUMA.

Thanks
Huang Shijie

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Is it possible to implement the per-node page cache for programs/libraries?
  2021-09-01 22:56       ` Barry Song
@ 2021-09-02 10:12         ` Huang Shijie
  0 siblings, 0 replies; 24+ messages in thread
From: Huang Shijie @ 2021-09-02 10:12 UTC (permalink / raw)
  To: Barry Song
  Cc: Linus Torvalds, Al Viro, Shijie Huang, Andrew Morton, Linux-MM,
	Song Bao Hua (Barry Song),
	Linux Kernel Mailing List, Frank Wang

On Thu, Sep 02, 2021 at 10:56:20AM +1200, Barry Song wrote:
> Not quite sure how it will benefit X86 and ARM64 though it seems concurrent-rt
> has some solution and benchmark data in RedHawk Linux[2].
> 
> [1] http://www.nacad.ufrj.br/online/sgi/007-5646-002/sgi_html/ch05.html
> [2] https://www.concurrent-rt.com/wp-content/uploads/2016/11/kernel-page-replication.pdf
Thanks for sharing this.

Thanks
Huang Shijie

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Is it possible to implement the per-node page cache for programs/libraries?
  2021-09-01 23:58       ` Matthew Wilcox
  2021-09-02  0:15         ` Barry Song
@ 2021-09-02 10:16         ` Huang Shijie
  1 sibling, 0 replies; 24+ messages in thread
From: Huang Shijie @ 2021-09-02 10:16 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Shijie Huang, torvalds, viro, akpm, linux-mm, song.bao.hua,
	linux-kernel, Frank Wang

On Thu, Sep 02, 2021 at 12:58:02AM +0100, Matthew Wilcox wrote:
> On Wed, Sep 01, 2021 at 02:25:34PM +0000, Huang Shijie wrote:
> > On Wed, Sep 01, 2021 at 01:30:45PM +0000, Huang Shijie wrote:
> > > On Wed, Sep 01, 2021 at 04:25:01AM +0100, Matthew Wilcox wrote:
> > > > On Wed, Sep 01, 2021 at 11:07:41AM +0800, Shijie Huang wrote:
> > > > >     In the NUMA, we only have one page cache for each file. For the
> > > > > program/shared libraries, the
> > > > > remote-access delays longer then the  local-access.
> > > > > 
> > > > > So, is it possible to implement the per-node page cache for
> > > > > programs/libraries?
> > > > 
> > > > At this point, we have no way to support text replication within a
> > > > process.  So what you're suggesting (if implemented) would work for
> > > 
> > > I created a glibc patch which can do the text replication within a process.
> > The "text replication" means the shared libraries, not program itself.
> 
> Thinking about it some more, if you're ok with it only being shared
> libraries, you can do this:
> 
> for i in `seq 0 3`; do \
> 	cp --reflink=always /lib/x86_64-linux-gnu/libc.so.6 \
> 		/lib/x86_64-linux-gnu/libc.so.6.numa$i; \
> done
> 
> Reflinked files don't share page cache, so you can do this all in
> userspace with no kernel changes.
This is not grace enough :)
And customers may not accept it..

For the shared libraries, it is better to change the glibc/ld.so.
For the program itself, it is better to change the linux kernel.

Thanks
Huang Shijie

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Is it possible to implement the per-node page cache for programs/libraries?
  2021-09-02  3:25   ` Nicholas Piggin
@ 2021-09-02 10:17     ` Matthew Wilcox
  2021-09-03  7:10       ` Nicholas Piggin
  0 siblings, 1 reply; 24+ messages in thread
From: Matthew Wilcox @ 2021-09-02 10:17 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Shijie Huang, akpm, linux-kernel, linux-mm, song.bao.hua,
	torvalds, viro, Frank Wang

On Thu, Sep 02, 2021 at 01:25:36PM +1000, Nicholas Piggin wrote:
> > I have been thinking about this a bit; one of our internal performance
> > teams flagged the potential performance win to me a few months ago.
> > I don't have a concrete design for text replication yet; there have been
> > various attempts over the years, but none were particularly compelling.
> 
> What was not compelling about it?

It wasn't merged, so clearly it wasn't compelling enough?

> https://lists.openwall.net/linux-kernel/2007/07/27/112
> 
> What are the other attempts?

I found one from Dave Hansen in 2003:

https://lwn.net/Articles/45082/

I think somebody else may have posted a different one, but I don't
remember now.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Is it possible to implement the per-node page cache for programs/libraries?
  2021-09-02 10:17     ` Matthew Wilcox
@ 2021-09-03  7:10       ` Nicholas Piggin
  2021-09-03 19:01         ` Matthew Wilcox
  0 siblings, 1 reply; 24+ messages in thread
From: Nicholas Piggin @ 2021-09-03  7:10 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: akpm, linux-kernel, linux-mm, Shijie Huang, song.bao.hua,
	torvalds, viro, Frank Wang

Excerpts from Matthew Wilcox's message of September 2, 2021 8:17 pm:
> On Thu, Sep 02, 2021 at 01:25:36PM +1000, Nicholas Piggin wrote:
>> > I have been thinking about this a bit; one of our internal performance
>> > teams flagged the potential performance win to me a few months ago.
>> > I don't have a concrete design for text replication yet; there have been
>> > various attempts over the years, but none were particularly compelling.
>> 
>> What was not compelling about it?
> 
> It wasn't merged, so clearly it wasn't compelling enough?

Ha ha. It sounded like you had some reasons you didn't find it 
particularly compelling :P

> 
>> https://lists.openwall.net/linux-kernel/2007/07/27/112
>> 
>> What are the other attempts?
> 
> I found one from Dave Hansen in 2003:
> 
> https://lwn.net/Articles/45082/
> 

Huh interesting. I'd be surprised if I didn't see it go by at the time.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Is it possible to implement the per-node page cache for programs/libraries?
  2021-09-03  7:10       ` Nicholas Piggin
@ 2021-09-03 19:01         ` Matthew Wilcox
  2021-09-03 19:08           ` Linus Torvalds
  2021-09-03 23:42           ` Nicholas Piggin
  0 siblings, 2 replies; 24+ messages in thread
From: Matthew Wilcox @ 2021-09-03 19:01 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: akpm, linux-kernel, linux-mm, Shijie Huang, song.bao.hua,
	torvalds, viro, Frank Wang

On Fri, Sep 03, 2021 at 05:10:31PM +1000, Nicholas Piggin wrote:
> Excerpts from Matthew Wilcox's message of September 2, 2021 8:17 pm:
> > On Thu, Sep 02, 2021 at 01:25:36PM +1000, Nicholas Piggin wrote:
> >> > I have been thinking about this a bit; one of our internal performance
> >> > teams flagged the potential performance win to me a few months ago.
> >> > I don't have a concrete design for text replication yet; there have been
> >> > various attempts over the years, but none were particularly compelling.
> >> 
> >> What was not compelling about it?
> > 
> > It wasn't merged, so clearly it wasn't compelling enough?
> 
> Ha ha. It sounded like you had some reasons you didn't find it 
> particularly compelling :P

I haven't studied it in detail, but it seems to me that your patch (from
2007!) chooses whether to store pages or pcache_desc pointers in i_pages.
Was there a reason you chose to do it that way instead of having per-node
i_mapping pointers?  (And which way would you choose to do it now, given
the infrastructure we have now?)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Is it possible to implement the per-node page cache for programs/libraries?
  2021-09-03 19:01         ` Matthew Wilcox
@ 2021-09-03 19:08           ` Linus Torvalds
  2021-09-06  9:56             ` Huang Shijie
  2021-09-03 23:42           ` Nicholas Piggin
  1 sibling, 1 reply; 24+ messages in thread
From: Linus Torvalds @ 2021-09-03 19:08 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Nicholas Piggin, Andrew Morton, Linux Kernel Mailing List,
	Linux-MM, Shijie Huang, Song Bao Hua (Barry Song),
	Al Viro, Frank Wang

On Fri, Sep 3, 2021 at 12:02 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> Was there a reason you chose to do it that way instead of having per-node
> i_mapping pointers?

You can't have per-node i_mapping pointers without huge coherence issues.

If you don't care about coherence, that's fine - but that has to be a
user-space decision (ie "I will just replicate this file").

You can't just have the kernel decide "I'll map this set of pages on
this node, and that other ser of pages on that other node", in case
there's MAP_SHARED things going on.

Anyway, I think very fundamentally this is one of those things where
99.9% of all people don't care, and DO NOT WANT the complexity.

And the 0.1% that _does_ care really could and should do this in user
space, because they know they care.

Asking the kernel to do complex things in critical core functions for
something that is very very rare and irrelevant to most people, and
that can and should just be done in user space for the people who care
is the wrong approach.

Because the question here really should be "is this truly important,
and does this need kernel help because user space simply cannot do it
itself".

And the answer is a fairly simple "no".

            Linus

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Is it possible to implement the per-node page cache for programs/libraries?
  2021-09-03 19:01         ` Matthew Wilcox
  2021-09-03 19:08           ` Linus Torvalds
@ 2021-09-03 23:42           ` Nicholas Piggin
  1 sibling, 0 replies; 24+ messages in thread
From: Nicholas Piggin @ 2021-09-03 23:42 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: akpm, linux-kernel, linux-mm, Shijie Huang, song.bao.hua,
	torvalds, viro, Frank Wang

Excerpts from Matthew Wilcox's message of September 4, 2021 5:01 am:
> On Fri, Sep 03, 2021 at 05:10:31PM +1000, Nicholas Piggin wrote:
>> Excerpts from Matthew Wilcox's message of September 2, 2021 8:17 pm:
>> > On Thu, Sep 02, 2021 at 01:25:36PM +1000, Nicholas Piggin wrote:
>> >> > I have been thinking about this a bit; one of our internal performance
>> >> > teams flagged the potential performance win to me a few months ago.
>> >> > I don't have a concrete design for text replication yet; there have been
>> >> > various attempts over the years, but none were particularly compelling.
>> >> 
>> >> What was not compelling about it?
>> > 
>> > It wasn't merged, so clearly it wasn't compelling enough?
>> 
>> Ha ha. It sounded like you had some reasons you didn't find it 
>> particularly compelling :P
> 
> I haven't studied it in detail, but it seems to me that your patch (from
> 2007!) chooses whether to store pages or pcache_desc pointers in i_pages.
> Was there a reason you chose to do it that way instead of having per-node
> i_mapping pointers?

What Linus said. The patch was obviously mechanism only and more 
heuristics would need to be done (in that case you could have per inode 
hints or whatever).

> (And which way would you choose to do it now, given
> the infrastructure we have now?)

I'm not aware of anything new that would change it fundamentally.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Is it possible to implement the per-node page cache for programs/libraries?
  2021-09-03 19:08           ` Linus Torvalds
@ 2021-09-06  9:56             ` Huang Shijie
  0 siblings, 0 replies; 24+ messages in thread
From: Huang Shijie @ 2021-09-06  9:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, Nicholas Piggin, Andrew Morton,
	Linux Kernel Mailing List, Linux-MM, Shijie Huang,
	Song Bao Hua (Barry Song),
	Al Viro, Frank Wang

Hi Linus,
On Fri, Sep 03, 2021 at 12:08:03PM -0700, Linus Torvalds wrote:
> On Fri, Sep 3, 2021 at 12:02 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > Was there a reason you chose to do it that way instead of having per-node
> > i_mapping pointers?
> 
> You can't have per-node i_mapping pointers without huge coherence issues.
> 
> If you don't care about coherence, that's fine - but that has to be a
> user-space decision (ie "I will just replicate this file").
> 
> You can't just have the kernel decide "I'll map this set of pages on
> this node, and that other ser of pages on that other node", in case
> there's MAP_SHARED things going on.
> 
> Anyway, I think very fundamentally this is one of those things where
> 99.9% of all people don't care, and DO NOT WANT the complexity.
> 
> And the 0.1% that _does_ care really could and should do this in user
> space, because they know they care.
> 
> Asking the kernel to do complex things in critical core functions for
> something that is very very rare and irrelevant to most people, and
> that can and should just be done in user space for the people who care
> is the wrong approach.
> 
> Because the question here really should be "is this truly important,
> and does this need kernel help because user space simply cannot do it
> itself".
> 
> And the answer is a fairly simple "no".
Okay.

Thanks for confirming this.

Thanks
Huang Shijie

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2021-09-06  1:58 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-01  3:07 Is it possible to implement the per-node page cache for programs/libraries? Shijie Huang
2021-09-01  2:09 ` Barry Song
2021-09-01  3:25 ` Matthew Wilcox
2021-09-01 13:30   ` Huang Shijie
2021-09-01 14:25     ` Huang Shijie
2021-09-01 11:32       ` Matthew Wilcox
2021-09-01 23:58       ` Matthew Wilcox
2021-09-02  0:15         ` Barry Song
2021-09-02  1:13           ` Linus Torvalds
2021-09-02 10:16         ` Huang Shijie
2021-09-02  3:25   ` Nicholas Piggin
2021-09-02 10:17     ` Matthew Wilcox
2021-09-03  7:10       ` Nicholas Piggin
2021-09-03 19:01         ` Matthew Wilcox
2021-09-03 19:08           ` Linus Torvalds
2021-09-06  9:56             ` Huang Shijie
2021-09-03 23:42           ` Nicholas Piggin
2021-09-01  4:55 ` Al Viro
2021-09-01 13:10   ` Huang Shijie
2021-09-01 17:24   ` Linus Torvalds
2021-09-01 17:29     ` Linus Torvalds
2021-09-01 22:56       ` Barry Song
2021-09-02 10:12         ` Huang Shijie
2021-09-02 10:08       ` Huang Shijie

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).