On Mon, 2016-08-01 at 12:33 -0400, James Bottomley wrote:
> On Mon, 2016-08-01 at 09:11 -0700, Dave Hansen wrote:
> > On 08/01/2016 09:06 AM, James Bottomley wrote:
> > > >  With persistent memory devices you might actually run out of
> > > > CPU
> > > > > capacity while performing basic page aging before you
> > > > > saturate
> > > > > the 
> > > > > storage device (which is why Andi Kleen has been suggesting
> > > > > to 
> > > > > replace LRU reclaim with random replacement for these
> > > > > devices).
> > > > > So 
> > > > > storage device saturation might not be the final answer to
> > > > > this
> > > > > problem.
> > > We really wouldn't want this.  All cloud jobs seem to have
> > > memory 
> > > they allocate but rarely use, so we want the properties of the
> > > LRU 
> > > list to get this on swap so we can re-use the memory pages for 
> > > something else.  A random replacement algorithm would play havoc
> > > with that.
> > 
> > I don't want to put words in Andi's mouth, but what we want isn't
> > necessarily something that is random, but it's something that uses 
> > less CPU to swap out a given page.
> 
> OK, if it's more deterministic, I'll wait to see the proposal.
> 
> > All the LRU scanning is expensive and doesn't scale particularly
> > well, and there are some situations where we should be willing to
> > give up some of the precision of the current LRU in order to
> > increase
> > the throughput of reclaim in general.
> 
> Would some type of hinting mechanism work (say via madvise)? 

I suspect that might introduce overhead in other ways.

> I suppose another question is do we still want all of this to be page
> based?  We moved to extents in filesystems a while ago, wouldn't some
> extent based LRU mechanism be cheaper ... unfortunately it means
> something has to try to come up with an idea of what an extent means
> (I
> suspect it would be a bunch of virtually contiguous pages which have
> the same expected LRU properties, but I'm thinking from the
> application
> centric viewpoint).
> 
On sufficiently fast swap, we could just swap 2MB pages,
or whatever size THP is on the architecture in question,
in and out of memory.

Working with blocks 512x the size of a 4kB page might
be enough of a scalability gain to match the faster IO
speeds of new storage.

-- 

All Rights Reversed.