Re: Is it possible to implement the per-node page cache for programs/libraries?

From: Huang Shijie <shijie@os.amperecomputing.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>,
	Shijie Huang <shijie@amperemail.onmicrosoft.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linux-MM <linux-mm@kvack.org>,
	"Song Bao Hua (Barry Song)" <song.bao.hua@hisilicon.com>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Frank Wang <zwang@amperecomputing.com>
Subject: Re: Is it possible to implement the per-node page cache for programs/libraries?
Date: Thu, 2 Sep 2021 10:08:06 +0000	[thread overview]
Message-ID: <YTCihsPZL0HtO2lp@hsj> (raw)
In-Reply-To: <CAHk-=wjAPEs3HRGswJ-AE1R048j2MBsBtMfg3GOsaFykHoeKsg@mail.gmail.com>

Hi Linus,
On Wed, Sep 01, 2021 at 10:29:01AM -0700, Linus Torvalds wrote:
> On Wed, Sep 1, 2021 at 10:24 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > But what you could do, if  you wanted to, would be to catch the
> > situation where you have lots of expensive NUMA accesses either using
> > our VM infrastructure or performance counters, and when the mapping is
> > a MAP_PRIVATE you just do a COW fault on them.
> >
> > Sounds entirely doable, and has absolutely nothing to do with the page
> > cache. It would literally just be an "over-eager COW fault triggered
> > by NUMA access counters".
Yes. You are right, we can use COW. :)

Actually we have _TWO_ levels to do the optimization for NUMA remote-access:
   1.) the page cache which is independent to process.
   2.) the process address space(page table).

   For 2.), we can use the over-eager COW:
        2.1) I have finished a user patch for glibc which uses "over-eager COW" to do the text
	   replication in NUMA.
        2.2) Also a kernel patch uses the "over-eager COW" to do the replication for 
           the programs itself in NUMA. (We may refine it to another topic..)
> 
> Note how it would work perfectly fine for anonymous mappings too. Just
> to reinforce the point that this has nothing to do with any page cache
> issues.
> 
> Of course, if you want to actually then *share* pages within a node
> (rather than replicate them for each process), that gets more
> exciting.
Do we really need to change the page cache?
          The 2.1) above may produces one-copy "shared libraries pages" for each process, such glibc.so.
          Even in the same NUMA node 0, we may run two same processes. So it produces "two glibc.so" now.
	  If We run 5 same processes in NUMA Node 0, it will produces "five glibs.so".

	  But if we have per-node page cache for the glibc.so, we can do it like this:
	  (1) disable the "over-eager COW" in the process.
	  (2) use the per-node page cache's pages to different processes in the _SAME_ NUMA node.
	      So all the processes in the same NUMA node, can use only one same page.
          (3) Processes in other NUMA nodes, use the pages belong to this node.

	  By this way, we can save many pages, and provide more access speed in NUMA.

Thanks
Huang Shijie