Re: Balloon pressuring page cache

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: Balloon pressuring page cache
       [not found]   ` <CAJuQAmoaK0Swytu2Os_SQRfG5_LqiCPaDa9yatatm9MtfncNTQ@mail.gmail.com>
@ 2020-01-30 15:02     ` David Hildenbrand
  2020-01-30 15:20       ` Michael S. Tsirkin
                         ` (2 more replies)
  0 siblings, 3 replies; 59+ messages in thread
From: David Hildenbrand @ 2020-01-30 15:02 UTC (permalink / raw)
  To: Tyler Sanderson
  Cc: virtualization, Wei Wang, David Rientjes, linux-mm, Michal Hocko,
	Michael S. Tsirkin

On 29.01.20 20:11, Tyler Sanderson wrote:
> 
> 
> On Wed, Jan 29, 2020 at 2:31 AM David Hildenbrand <david@redhat.com
> <mailto:david@redhat.com>> wrote:
> 
>     On 29.01.20 01:22, Tyler Sanderson via Virtualization wrote:
>     > A primary advantage of virtio balloon over other memory reclaim
>     > mechanisms is that it can pressure the guest's page cache into
>     shrinking.
>     >
>     > However, since the balloon driver changed to using the shrinker API
>     >
>     <https://github.com/torvalds/linux/commit/71994620bb25a8b109388fefa9e99a28e355255a#diff-fd202acf694d9eba19c8c64da3e480c9> this
>     > use case has become a bit more tricky. I'm wondering what the intended
>     > device implementation is.
>     >
>     > When inflating the balloon against page cache (i.e. no free memory
>     > remains) vmscan.c will both shrink page cache, but also invoke the
>     > shrinkers -- including the balloon's shrinker. So the balloon driver
>     > allocates memory which requires reclaim, vmscan gets this memory by
>     > shrinking the balloon, and then the driver adds the memory back to the
>     > balloon. Basically a busy no-op.
>     >
>     > If file IO is ongoing during this balloon inflation then the page
>     cache
>     > could be growing which further puts "back pressure" on the balloon
>     > trying to inflate. In testing I've seen periods of > 45 seconds where
>     > balloon inflation makes no net forward progress.
>     >
>     > This wasn't a problem before the change to the shrinker API since
>     forced
>     > balloon deflation only occurred via the OOM notifier callback
>     which was
>     > invoked only after the page cache had depleted.
>     >
>     > Is this new busy behavior working as intended?
> 
>     Please note that the shrinker will only be registered in case we have
>     VIRTIO_BALLOON_F_DEFLATE_ON_OOM - (which is AFAIK very rare) - to
>     implement automatic balloon deflation when the guest is under memory
>     pressure.
> 
> 
>     Are you actually experiencing issues with that or did you just stumble
>     over the code?
> 
> 
> We have a use case that is encountering this (and that registers
> DEFLATE_ON_OOM). We can work around this, but it does seem inefficient.
> I understand there were good reasons for moving away from the OOM
> notifier callback, but I'm wondering if the balloon driver could specify
> a "nice" level to the shrinker API that would cause it to be reclaimed
> from only as a last resort?
>  

Cc-ing linux-mm, Michal and Michael.

Just wondering, how does your workaround look like?

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Balloon pressuring page cache
  2020-01-30 15:02     ` Balloon pressuring page cache David Hildenbrand
@ 2020-01-30 15:20       ` Michael S. Tsirkin
  2020-01-30 15:23         ` David Hildenbrand
  2020-01-30 15:31       ` Wang, Wei W
  2020-01-30 22:46       ` Tyler Sanderson
  2 siblings, 1 reply; 59+ messages in thread
From: Michael S. Tsirkin @ 2020-01-30 15:20 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Tyler Sanderson, virtualization, Wei Wang, David Rientjes,
	linux-mm, Michal Hocko

On Thu, Jan 30, 2020 at 04:02:34PM +0100, David Hildenbrand wrote:
> On 29.01.20 20:11, Tyler Sanderson wrote:
> > 
> > 
> > On Wed, Jan 29, 2020 at 2:31 AM David Hildenbrand <david@redhat.com
> > <mailto:david@redhat.com>> wrote:
> > 
> >     On 29.01.20 01:22, Tyler Sanderson via Virtualization wrote:
> >     > A primary advantage of virtio balloon over other memory reclaim
> >     > mechanisms is that it can pressure the guest's page cache into
> >     shrinking.
> >     >
> >     > However, since the balloon driver changed to using the shrinker API
> >     >
> >     <https://github.com/torvalds/linux/commit/71994620bb25a8b109388fefa9e99a28e355255a#diff-fd202acf694d9eba19c8c64da3e480c9> this
> >     > use case has become a bit more tricky. I'm wondering what the intended
> >     > device implementation is.
> >     >
> >     > When inflating the balloon against page cache (i.e. no free memory
> >     > remains) vmscan.c will both shrink page cache, but also invoke the
> >     > shrinkers -- including the balloon's shrinker. So the balloon driver
> >     > allocates memory which requires reclaim, vmscan gets this memory by
> >     > shrinking the balloon, and then the driver adds the memory back to the
> >     > balloon. Basically a busy no-op.
> >     >
> >     > If file IO is ongoing during this balloon inflation then the page
> >     cache
> >     > could be growing which further puts "back pressure" on the balloon
> >     > trying to inflate. In testing I've seen periods of > 45 seconds where
> >     > balloon inflation makes no net forward progress.
> >     >
> >     > This wasn't a problem before the change to the shrinker API since
> >     forced
> >     > balloon deflation only occurred via the OOM notifier callback
> >     which was
> >     > invoked only after the page cache had depleted.
> >     >
> >     > Is this new busy behavior working as intended?
> > 
> >     Please note that the shrinker will only be registered in case we have
> >     VIRTIO_BALLOON_F_DEFLATE_ON_OOM - (which is AFAIK very rare) - to
> >     implement automatic balloon deflation when the guest is under memory
> >     pressure.
> > 
> > 
> >     Are you actually experiencing issues with that or did you just stumble
> >     over the code?
> > 
> > 
> > We have a use case that is encountering this (and that registers
> > DEFLATE_ON_OOM). We can work around this, but it does seem inefficient.
> > I understand there were good reasons for moving away from the OOM
> > notifier callback, but I'm wondering if the balloon driver could specify
> > a "nice" level to the shrinker API that would cause it to be reclaimed
> > from only as a last resort?
> >  
> 
> Cc-ing linux-mm, Michal and Michael.


Interesting.  VIRTIO_BALLOON_F_DEFLATE_ON_OOM is really
underspecified in a bunch of ways.

I'll wait to see what does Michal say from Linux POV.





> Just wondering, how does your workaround look like?
> 
> -- 
> Thanks,
> 
> David / dhildenb



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Balloon pressuring page cache
  2020-01-30 15:20       ` Michael S. Tsirkin
@ 2020-01-30 15:23         ` David Hildenbrand
  0 siblings, 0 replies; 59+ messages in thread
From: David Hildenbrand @ 2020-01-30 15:23 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Tyler Sanderson, virtualization, Wei Wang, David Rientjes,
	linux-mm, Michal Hocko

On 30.01.20 16:20, Michael S. Tsirkin wrote:
> On Thu, Jan 30, 2020 at 04:02:34PM +0100, David Hildenbrand wrote:
>> On 29.01.20 20:11, Tyler Sanderson wrote:
>>>
>>>
>>> On Wed, Jan 29, 2020 at 2:31 AM David Hildenbrand <david@redhat.com
>>> <mailto:david@redhat.com>> wrote:
>>>
>>>     On 29.01.20 01:22, Tyler Sanderson via Virtualization wrote:
>>>     > A primary advantage of virtio balloon over other memory reclaim
>>>     > mechanisms is that it can pressure the guest's page cache into
>>>     shrinking.
>>>     >
>>>     > However, since the balloon driver changed to using the shrinker API
>>>     >
>>>     <https://github.com/torvalds/linux/commit/71994620bb25a8b109388fefa9e99a28e355255a#diff-fd202acf694d9eba19c8c64da3e480c9> this
>>>     > use case has become a bit more tricky. I'm wondering what the intended
>>>     > device implementation is.
>>>     >
>>>     > When inflating the balloon against page cache (i.e. no free memory
>>>     > remains) vmscan.c will both shrink page cache, but also invoke the
>>>     > shrinkers -- including the balloon's shrinker. So the balloon driver
>>>     > allocates memory which requires reclaim, vmscan gets this memory by
>>>     > shrinking the balloon, and then the driver adds the memory back to the
>>>     > balloon. Basically a busy no-op.
>>>     >
>>>     > If file IO is ongoing during this balloon inflation then the page
>>>     cache
>>>     > could be growing which further puts "back pressure" on the balloon
>>>     > trying to inflate. In testing I've seen periods of > 45 seconds where
>>>     > balloon inflation makes no net forward progress.
>>>     >
>>>     > This wasn't a problem before the change to the shrinker API since
>>>     forced
>>>     > balloon deflation only occurred via the OOM notifier callback
>>>     which was
>>>     > invoked only after the page cache had depleted.
>>>     >
>>>     > Is this new busy behavior working as intended?
>>>
>>>     Please note that the shrinker will only be registered in case we have
>>>     VIRTIO_BALLOON_F_DEFLATE_ON_OOM - (which is AFAIK very rare) - to
>>>     implement automatic balloon deflation when the guest is under memory
>>>     pressure.
>>>
>>>
>>>     Are you actually experiencing issues with that or did you just stumble
>>>     over the code?
>>>
>>>
>>> We have a use case that is encountering this (and that registers
>>> DEFLATE_ON_OOM). We can work around this, but it does seem inefficient.
>>> I understand there were good reasons for moving away from the OOM
>>> notifier callback, but I'm wondering if the balloon driver could specify
>>> a "nice" level to the shrinker API that would cause it to be reclaimed
>>> from only as a last resort?
>>>  
>>
>> Cc-ing linux-mm, Michal and Michael.
> 
> 
> Interesting.  VIRTIO_BALLOON_F_DEFLATE_ON_OOM is really
> underspecified in a bunch of ways.
> 
> I'll wait to see what does Michal say from Linux POV.


Just wondering, does implying that we are using the shrinker that a

echo 3 > /proc/sys/vm/drop_caches

Will deflate the whole balloon? If yes, than that's *really* not desired.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: Balloon pressuring page cache
  2020-01-30 15:02     ` Balloon pressuring page cache David Hildenbrand
  2020-01-30 15:20       ` Michael S. Tsirkin
@ 2020-01-30 15:31       ` Wang, Wei W
  2020-01-30 19:59         ` Tyler Sanderson
  2020-01-30 22:46       ` Tyler Sanderson
  2 siblings, 1 reply; 59+ messages in thread
From: Wang, Wei W @ 2020-01-30 15:31 UTC (permalink / raw)
  To: David Hildenbrand, Tyler Sanderson
  Cc: virtualization, David Rientjes, linux-mm, Michal Hocko,
	Michael S. Tsirkin

On Thursday, January 30, 2020 11:03 PM, David Hildenbrand wrote:
> On 29.01.20 20:11, Tyler Sanderson wrote:
> >
> >
> > On Wed, Jan 29, 2020 at 2:31 AM David Hildenbrand <david@redhat.com
> > <mailto:david@redhat.com>> wrote:
> >
> >     On 29.01.20 01:22, Tyler Sanderson via Virtualization wrote:
> >     > A primary advantage of virtio balloon over other memory reclaim
> >     > mechanisms is that it can pressure the guest's page cache into
> >     shrinking.
> >     >
> >     > However, since the balloon driver changed to using the shrinker API
> >     >
> >
> <https://github.com/torvalds/linux/commit/71994620bb25a8b109388fefa9
> e99a28e355255a#diff-fd202acf694d9eba19c8c64da3e480c9> this
> >     > use case has become a bit more tricky. I'm wondering what the
> intended
> >     > device implementation is.
> >     >
> >     > When inflating the balloon against page cache (i.e. no free memory
> >     > remains) vmscan.c will both shrink page cache, but also invoke the
> >     > shrinkers -- including the balloon's shrinker. So the balloon driver
> >     > allocates memory which requires reclaim, vmscan gets this memory
> by
> >     > shrinking the balloon, and then the driver adds the memory back to
> the
> >     > balloon. Basically a busy no-op.

Per my understanding, the balloon allocation won’t invoke shrinker as __GFP_DIRECT_RECLAIM isn't set, no?


> >     >
> >     > If file IO is ongoing during this balloon inflation then the page
> >     cache
> >     > could be growing which further puts "back pressure" on the balloon
> >     > trying to inflate. In testing I've seen periods of > 45 seconds where
> >     > balloon inflation makes no net forward progress.

I think this is intentional (but could be improved). As inflation does not stop when the allocation fails (it simply sleeps for a while and resumes.. repeat till there are memory to inflate)
That's why you see no inflation progress for long time under memory pressure.


Best,
Wei

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Balloon pressuring page cache
  2020-01-30 15:31       ` Wang, Wei W
@ 2020-01-30 19:59         ` Tyler Sanderson
  2020-02-03 13:11           ` Michael S. Tsirkin
  0 siblings, 1 reply; 59+ messages in thread
From: Tyler Sanderson @ 2020-01-30 19:59 UTC (permalink / raw)
  To: Wang, Wei W
  Cc: David Hildenbrand, virtualization, David Rientjes, linux-mm,
	Michal Hocko, Michael S. Tsirkin

[-- Attachment #1: Type: text/plain, Size: 2545 bytes --]

On Thu, Jan 30, 2020 at 7:31 AM Wang, Wei W <wei.w.wang@intel.com> wrote:

> On Thursday, January 30, 2020 11:03 PM, David Hildenbrand wrote:
> > On 29.01.20 20:11, Tyler Sanderson wrote:
> > >
> > >
> > > On Wed, Jan 29, 2020 at 2:31 AM David Hildenbrand <david@redhat.com
> > > <mailto:david@redhat.com>> wrote:
> > >
> > >     On 29.01.20 01:22, Tyler Sanderson via Virtualization wrote:
> > >     > A primary advantage of virtio balloon over other memory reclaim
> > >     > mechanisms is that it can pressure the guest's page cache into
> > >     shrinking.
> > >     >
> > >     > However, since the balloon driver changed to using the shrinker
> API
> > >     >
> > >
> > <https://github.com/torvalds/linux/commit/71994620bb25a8b109388fefa9
> > e99a28e355255a#diff-fd202acf694d9eba19c8c64da3e480c9> this
> > >     > use case has become a bit more tricky. I'm wondering what the
> > intended
> > >     > device implementation is.
> > >     >
> > >     > When inflating the balloon against page cache (i.e. no free
> memory
> > >     > remains) vmscan.c will both shrink page cache, but also invoke
> the
> > >     > shrinkers -- including the balloon's shrinker. So the balloon
> driver
> > >     > allocates memory which requires reclaim, vmscan gets this memory
> > by
> > >     > shrinking the balloon, and then the driver adds the memory back
> to
> > the
> > >     > balloon. Basically a busy no-op.
>
> Per my understanding, the balloon allocation won’t invoke shrinker as
> __GFP_DIRECT_RECLAIM isn't set, no?
>
I could be wrong about the mechanism, but the device sees lots of activity
on the deflate queue. The balloon is being shrunk. And this only starts
once all free memory is depleted and we're inflating into page cache.

>
>
> > >     >
> > >     > If file IO is ongoing during this balloon inflation then the page
> > >     cache
> > >     > could be growing which further puts "back pressure" on the
> balloon
> > >     > trying to inflate. In testing I've seen periods of > 45 seconds
> where
> > >     > balloon inflation makes no net forward progress.
>
> I think this is intentional (but could be improved). As inflation does not
> stop when the allocation fails (it simply sleeps for a while and resumes..
> repeat till there are memory to inflate)
> That's why you see no inflation progress for long time under memory
> pressure.
>
As noted above the deflate queue is active, so it's not just memory
allocation failures.


>
>
> Best,
> Wei
>

[-- Attachment #2: Type: text/html, Size: 3794 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Balloon pressuring page cache
  2020-01-30 15:02     ` Balloon pressuring page cache David Hildenbrand
  2020-01-30 15:20       ` Michael S. Tsirkin
  2020-01-30 15:31       ` Wang, Wei W
@ 2020-01-30 22:46       ` Tyler Sanderson
  2 siblings, 0 replies; 59+ messages in thread
From: Tyler Sanderson @ 2020-01-30 22:46 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: virtualization, Wei Wang, David Rientjes, linux-mm, Michal Hocko,
	Michael S. Tsirkin

[-- Attachment #1: Type: text/plain, Size: 3072 bytes --]

On Thu, Jan 30, 2020 at 7:03 AM David Hildenbrand <david@redhat.com> wrote:

> On 29.01.20 20:11, Tyler Sanderson wrote:
> >
> >
> > On Wed, Jan 29, 2020 at 2:31 AM David Hildenbrand <david@redhat.com
> > <mailto:david@redhat.com>> wrote:
> >
> >     On 29.01.20 01:22, Tyler Sanderson via Virtualization wrote:
> >     > A primary advantage of virtio balloon over other memory reclaim
> >     > mechanisms is that it can pressure the guest's page cache into
> >     shrinking.
> >     >
> >     > However, since the balloon driver changed to using the shrinker API
> >     >
> >     <
> https://github.com/torvalds/linux/commit/71994620bb25a8b109388fefa9e99a28e355255a#diff-fd202acf694d9eba19c8c64da3e480c9
> > this
> >     > use case has become a bit more tricky. I'm wondering what the
> intended
> >     > device implementation is.
> >     >
> >     > When inflating the balloon against page cache (i.e. no free memory
> >     > remains) vmscan.c will both shrink page cache, but also invoke the
> >     > shrinkers -- including the balloon's shrinker. So the balloon
> driver
> >     > allocates memory which requires reclaim, vmscan gets this memory by
> >     > shrinking the balloon, and then the driver adds the memory back to
> the
> >     > balloon. Basically a busy no-op.
> >     >
> >     > If file IO is ongoing during this balloon inflation then the page
> >     cache
> >     > could be growing which further puts "back pressure" on the balloon
> >     > trying to inflate. In testing I've seen periods of > 45 seconds
> where
> >     > balloon inflation makes no net forward progress.
> >     >
> >     > This wasn't a problem before the change to the shrinker API since
> >     forced
> >     > balloon deflation only occurred via the OOM notifier callback
> >     which was
> >     > invoked only after the page cache had depleted.
> >     >
> >     > Is this new busy behavior working as intended?
> >
> >     Please note that the shrinker will only be registered in case we have
> >     VIRTIO_BALLOON_F_DEFLATE_ON_OOM - (which is AFAIK very rare) - to
> >     implement automatic balloon deflation when the guest is under memory
> >     pressure.
> >
> >
> >     Are you actually experiencing issues with that or did you just
> stumble
> >     over the code?
> >
> >
> > We have a use case that is encountering this (and that registers
> > DEFLATE_ON_OOM). We can work around this, but it does seem inefficient.
> > I understand there were good reasons for moving away from the OOM
> > notifier callback, but I'm wondering if the balloon driver could specify
> > a "nice" level to the shrinker API that would cause it to be reclaimed
> > from only as a last resort?
> >
>
> Cc-ing linux-mm, Michal and Michael.
>
> Just wondering, how does your workaround look like?
>
The work around is to monitor the memory statistics reported on the stats
queue. Keep inflating (inefficiently) -- despite the activity on the
deflate queue -- until memory statistics indicate the guest is actually low
on available memory.

-- 
> Thanks,
>
> David / dhildenb
>
>

[-- Attachment #2: Type: text/html, Size: 4413 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Balloon pressuring page cache
  2020-01-30 19:59         ` Tyler Sanderson
@ 2020-02-03 13:11           ` Michael S. Tsirkin
  2020-02-03 16:18             ` Alexander Duyck
  0 siblings, 1 reply; 59+ messages in thread
From: Michael S. Tsirkin @ 2020-02-03 13:11 UTC (permalink / raw)
  To: Tyler Sanderson
  Cc: Wang, Wei W, David Hildenbrand, virtualization, David Rientjes,
	linux-mm, Michal Hocko, alexander.h.duyck

On Thu, Jan 30, 2020 at 11:59:46AM -0800, Tyler Sanderson wrote:
> 
> 
> On Thu, Jan 30, 2020 at 7:31 AM Wang, Wei W <wei.w.wang@intel.com> wrote:
> 
>     On Thursday, January 30, 2020 11:03 PM, David Hildenbrand wrote:
>     > On 29.01.20 20:11, Tyler Sanderson wrote:
>     > >
>     > >
>     > > On Wed, Jan 29, 2020 at 2:31 AM David Hildenbrand <david@redhat.com
>     > > <mailto:david@redhat.com>> wrote:
>     > >
>     > >     On 29.01.20 01:22, Tyler Sanderson via Virtualization wrote:
>     > >     > A primary advantage of virtio balloon over other memory reclaim
>     > >     > mechanisms is that it can pressure the guest's page cache into
>     > >     shrinking.
>     > >     >
>     > >     > However, since the balloon driver changed to using the shrinker
>     API
>     > >     >
>     > >
>     > <https://github.com/torvalds/linux/commit/71994620bb25a8b109388fefa9
>     > e99a28e355255a#diff-fd202acf694d9eba19c8c64da3e480c9> this
>     > >     > use case has become a bit more tricky. I'm wondering what the
>     > intended
>     > >     > device implementation is.
>     > >     >
>     > >     > When inflating the balloon against page cache (i.e. no free
>     memory
>     > >     > remains) vmscan.c will both shrink page cache, but also invoke
>     the
>     > >     > shrinkers -- including the balloon's shrinker. So the balloon
>     driver
>     > >     > allocates memory which requires reclaim, vmscan gets this memory
>     > by
>     > >     > shrinking the balloon, and then the driver adds the memory back
>     to
>     > the
>     > >     > balloon. Basically a busy no-op.
> 
>     Per my understanding, the balloon allocation won’t invoke shrinker as
>     __GFP_DIRECT_RECLAIM isn't set, no?
> 
> I could be wrong about the mechanism, but the device sees lots of activity on
> the deflate queue. The balloon is being shrunk. And this only starts once all
> free memory is depleted and we're inflating into page cache.

So given this looks like a regression, maybe we should revert the
patch in question 71994620bb25 ("virtio_balloon: replace oom notifier with shrinker")
Besides, with VIRTIO_BALLOON_F_FREE_PAGE_HINT
shrinker also ignores VIRTIO_BALLOON_F_MUST_TELL_HOST which isn't nice
at all.

So it looks like all this rework introduced more issues than it
addressed ...

I also CC Alex Duyck for an opinion on this.
Alex, what do you use to put pressure on page cache?


> 
> 
>     > >     >
>     > >     > If file IO is ongoing during this balloon inflation then the page
>     > >     cache
>     > >     > could be growing which further puts "back pressure" on the
>     balloon
>     > >     > trying to inflate. In testing I've seen periods of > 45 seconds
>     where
>     > >     > balloon inflation makes no net forward progress.
> 
>     I think this is intentional (but could be improved). As inflation does not
>     stop when the allocation fails (it simply sleeps for a while and resumes..
>     repeat till there are memory to inflate)
>     That's why you see no inflation progress for long time under memory
>     pressure.
> 
> As noted above the deflate queue is active, so it's not just memory allocation
> failures.
>  
> 
> 
> 
>     Best,
>     Wei
> 



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Balloon pressuring page cache
  2020-02-03 13:11           ` Michael S. Tsirkin
@ 2020-02-03 16:18             ` Alexander Duyck
  2020-02-03 16:34               ` David Hildenbrand
  0 siblings, 1 reply; 59+ messages in thread
From: Alexander Duyck @ 2020-02-03 16:18 UTC (permalink / raw)
  To: Michael S. Tsirkin, Tyler Sanderson
  Cc: Wang, Wei W, David Hildenbrand, virtualization, David Rientjes,
	linux-mm, Michal Hocko

On Mon, 2020-02-03 at 08:11 -0500, Michael S. Tsirkin wrote:
> On Thu, Jan 30, 2020 at 11:59:46AM -0800, Tyler Sanderson wrote:
> > 
> > On Thu, Jan 30, 2020 at 7:31 AM Wang, Wei W <wei.w.wang@intel.com> wrote:
> > 
> >     On Thursday, January 30, 2020 11:03 PM, David Hildenbrand wrote:
> >     > On 29.01.20 20:11, Tyler Sanderson wrote:
> >     > >
> >     > >
> >     > > On Wed, Jan 29, 2020 at 2:31 AM David Hildenbrand <david@redhat.com
> >     > > <mailto:david@redhat.com>> wrote:
> >     > >
> >     > >     On 29.01.20 01:22, Tyler Sanderson via Virtualization wrote:
> >     > >     > A primary advantage of virtio balloon over other memory reclaim
> >     > >     > mechanisms is that it can pressure the guest's page cache into
> >     > >     shrinking.
> >     > >     >
> >     > >     > However, since the balloon driver changed to using the shrinker
> >     API
> >     > >     >
> >     > >
> >     > <https://github.com/torvalds/linux/commit/71994620bb25a8b109388fefa9
> >     > e99a28e355255a#diff-fd202acf694d9eba19c8c64da3e480c9> this
> >     > >     > use case has become a bit more tricky. I'm wondering what the
> >     > intended
> >     > >     > device implementation is.
> >     > >     >
> >     > >     > When inflating the balloon against page cache (i.e. no free
> >     memory
> >     > >     > remains) vmscan.c will both shrink page cache, but also invoke
> >     the
> >     > >     > shrinkers -- including the balloon's shrinker. So the balloon
> >     driver
> >     > >     > allocates memory which requires reclaim, vmscan gets this memory
> >     > by
> >     > >     > shrinking the balloon, and then the driver adds the memory back
> >     to
> >     > the
> >     > >     > balloon. Basically a busy no-op.
> > 
> >     Per my understanding, the balloon allocation won’t invoke shrinker as
> >     __GFP_DIRECT_RECLAIM isn't set, no?
> > 
> > I could be wrong about the mechanism, but the device sees lots of activity on
> > the deflate queue. The balloon is being shrunk. And this only starts once all
> > free memory is depleted and we're inflating into page cache.
> 
> So given this looks like a regression, maybe we should revert the
> patch in question 71994620bb25 ("virtio_balloon: replace oom notifier with shrinker")
> Besides, with VIRTIO_BALLOON_F_FREE_PAGE_HINT
> shrinker also ignores VIRTIO_BALLOON_F_MUST_TELL_HOST which isn't nice
> at all.
> 
> So it looks like all this rework introduced more issues than it
> addressed ...
> 
> I also CC Alex Duyck for an opinion on this.
> Alex, what do you use to put pressure on page cache?

I would say reverting probably makes sense. I'm not sure there is much
value to having a shrinker running deflation when you are actively trying
to increase the balloon. It would make more sense to wait until you are
actually about to start hitting oom.

As far as putting pressure on the page cache I don't have anything
actively doing it in my current test environment. I was keeping things
simple and just resorting to using "echo 3 > /proc/sys/vm/drop_caches"
when I needed to flush out page cache entries. I was planning to work on
the page cache pressure piece as a next step.

Did you see the thread recently in response to my v16.1 patch set? We
actually had a similar discussion about how best to exert pressure on the
page cache and one thing I realized with that is that there are already
several similar efforts ongoing so my hope is to discuss this at LSF/MM,
decide on a way to consolidate the existing work to get to something that
will work for most, and get that submitted over the next year. A link to
that discussion can be found here:
https://lore.kernel.org/lkml/20200122173040.6142.39116.stgit@localhost.localdomain/

Thanks.

- Alex



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Balloon pressuring page cache
  2020-02-03 16:18             ` Alexander Duyck
@ 2020-02-03 16:34               ` David Hildenbrand
  2020-02-03 17:03                 ` Michael S. Tsirkin
  2020-02-03 22:50                 ` Nadav Amit
  0 siblings, 2 replies; 59+ messages in thread
From: David Hildenbrand @ 2020-02-03 16:34 UTC (permalink / raw)
  To: Alexander Duyck, Michael S. Tsirkin, Tyler Sanderson
  Cc: Wang, Wei W, virtualization, David Rientjes, linux-mm, Michal Hocko

On 03.02.20 17:18, Alexander Duyck wrote:
> On Mon, 2020-02-03 at 08:11 -0500, Michael S. Tsirkin wrote:
>> On Thu, Jan 30, 2020 at 11:59:46AM -0800, Tyler Sanderson wrote:
>>>
>>> On Thu, Jan 30, 2020 at 7:31 AM Wang, Wei W <wei.w.wang@intel.com> wrote:
>>>
>>>     On Thursday, January 30, 2020 11:03 PM, David Hildenbrand wrote:
>>>     > On 29.01.20 20:11, Tyler Sanderson wrote:
>>>     > >
>>>     > >
>>>     > > On Wed, Jan 29, 2020 at 2:31 AM David Hildenbrand <david@redhat.com
>>>     > > <mailto:david@redhat.com>> wrote:
>>>     > >
>>>     > >     On 29.01.20 01:22, Tyler Sanderson via Virtualization wrote:
>>>     > >     > A primary advantage of virtio balloon over other memory reclaim
>>>     > >     > mechanisms is that it can pressure the guest's page cache into
>>>     > >     shrinking.
>>>     > >     >
>>>     > >     > However, since the balloon driver changed to using the shrinker
>>>     API
>>>     > >     >
>>>     > >
>>>     > <https://github.com/torvalds/linux/commit/71994620bb25a8b109388fefa9
>>>     > e99a28e355255a#diff-fd202acf694d9eba19c8c64da3e480c9> this
>>>     > >     > use case has become a bit more tricky. I'm wondering what the
>>>     > intended
>>>     > >     > device implementation is.
>>>     > >     >
>>>     > >     > When inflating the balloon against page cache (i.e. no free
>>>     memory
>>>     > >     > remains) vmscan.c will both shrink page cache, but also invoke
>>>     the
>>>     > >     > shrinkers -- including the balloon's shrinker. So the balloon
>>>     driver
>>>     > >     > allocates memory which requires reclaim, vmscan gets this memory
>>>     > by
>>>     > >     > shrinking the balloon, and then the driver adds the memory back
>>>     to
>>>     > the
>>>     > >     > balloon. Basically a busy no-op.
>>>
>>>     Per my understanding, the balloon allocation won’t invoke shrinker as
>>>     __GFP_DIRECT_RECLAIM isn't set, no?
>>>
>>> I could be wrong about the mechanism, but the device sees lots of activity on
>>> the deflate queue. The balloon is being shrunk. And this only starts once all
>>> free memory is depleted and we're inflating into page cache.
>>
>> So given this looks like a regression, maybe we should revert the
>> patch in question 71994620bb25 ("virtio_balloon: replace oom notifier with shrinker")
>> Besides, with VIRTIO_BALLOON_F_FREE_PAGE_HINT
>> shrinker also ignores VIRTIO_BALLOON_F_MUST_TELL_HOST which isn't nice
>> at all.
>>
>> So it looks like all this rework introduced more issues than it
>> addressed ...
>>
>> I also CC Alex Duyck for an opinion on this.
>> Alex, what do you use to put pressure on page cache?
> 
> I would say reverting probably makes sense. I'm not sure there is much
> value to having a shrinker running deflation when you are actively trying
> to increase the balloon. It would make more sense to wait until you are
> actually about to start hitting oom.

I think the shrinker makes sense for free page hinting feature
(everything on free_page_list).

So instead of only reverting, I think we should split it up and always
register the shrinker for VIRTIO_BALLOON_F_FREE_PAGE_HINT and the OOM
notifier (as before) for VIRTIO_BALLOON_F_MUST_TELL_HOST.

(Of course, adapting what is being done in the shrinker and in the OOM
notifier)

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Balloon pressuring page cache
  2020-02-03 16:34               ` David Hildenbrand
@ 2020-02-03 17:03                 ` Michael S. Tsirkin
  2020-02-03 20:32                   ` Tyler Sanderson
  2020-02-03 22:50                 ` Nadav Amit
  1 sibling, 1 reply; 59+ messages in thread
From: Michael S. Tsirkin @ 2020-02-03 17:03 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alexander Duyck, Tyler Sanderson, Wang, Wei W, virtualization,
	David Rientjes, linux-mm, Michal Hocko

On Mon, Feb 03, 2020 at 05:34:20PM +0100, David Hildenbrand wrote:
> On 03.02.20 17:18, Alexander Duyck wrote:
> > On Mon, 2020-02-03 at 08:11 -0500, Michael S. Tsirkin wrote:
> >> On Thu, Jan 30, 2020 at 11:59:46AM -0800, Tyler Sanderson wrote:
> >>>
> >>> On Thu, Jan 30, 2020 at 7:31 AM Wang, Wei W <wei.w.wang@intel.com> wrote:
> >>>
> >>>     On Thursday, January 30, 2020 11:03 PM, David Hildenbrand wrote:
> >>>     > On 29.01.20 20:11, Tyler Sanderson wrote:
> >>>     > >
> >>>     > >
> >>>     > > On Wed, Jan 29, 2020 at 2:31 AM David Hildenbrand <david@redhat.com
> >>>     > > <mailto:david@redhat.com>> wrote:
> >>>     > >
> >>>     > >     On 29.01.20 01:22, Tyler Sanderson via Virtualization wrote:
> >>>     > >     > A primary advantage of virtio balloon over other memory reclaim
> >>>     > >     > mechanisms is that it can pressure the guest's page cache into
> >>>     > >     shrinking.
> >>>     > >     >
> >>>     > >     > However, since the balloon driver changed to using the shrinker
> >>>     API
> >>>     > >     >
> >>>     > >
> >>>     > <https://github.com/torvalds/linux/commit/71994620bb25a8b109388fefa9
> >>>     > e99a28e355255a#diff-fd202acf694d9eba19c8c64da3e480c9> this
> >>>     > >     > use case has become a bit more tricky. I'm wondering what the
> >>>     > intended
> >>>     > >     > device implementation is.
> >>>     > >     >
> >>>     > >     > When inflating the balloon against page cache (i.e. no free
> >>>     memory
> >>>     > >     > remains) vmscan.c will both shrink page cache, but also invoke
> >>>     the
> >>>     > >     > shrinkers -- including the balloon's shrinker. So the balloon
> >>>     driver
> >>>     > >     > allocates memory which requires reclaim, vmscan gets this memory
> >>>     > by
> >>>     > >     > shrinking the balloon, and then the driver adds the memory back
> >>>     to
> >>>     > the
> >>>     > >     > balloon. Basically a busy no-op.
> >>>
> >>>     Per my understanding, the balloon allocation won’t invoke shrinker as
> >>>     __GFP_DIRECT_RECLAIM isn't set, no?
> >>>
> >>> I could be wrong about the mechanism, but the device sees lots of activity on
> >>> the deflate queue. The balloon is being shrunk. And this only starts once all
> >>> free memory is depleted and we're inflating into page cache.
> >>
> >> So given this looks like a regression, maybe we should revert the
> >> patch in question 71994620bb25 ("virtio_balloon: replace oom notifier with shrinker")
> >> Besides, with VIRTIO_BALLOON_F_FREE_PAGE_HINT
> >> shrinker also ignores VIRTIO_BALLOON_F_MUST_TELL_HOST which isn't nice
> >> at all.
> >>
> >> So it looks like all this rework introduced more issues than it
> >> addressed ...
> >>
> >> I also CC Alex Duyck for an opinion on this.
> >> Alex, what do you use to put pressure on page cache?
> > 
> > I would say reverting probably makes sense. I'm not sure there is much
> > value to having a shrinker running deflation when you are actively trying
> > to increase the balloon. It would make more sense to wait until you are
> > actually about to start hitting oom.
> 
> I think the shrinker makes sense for free page hinting feature
> (everything on free_page_list).
> 
> So instead of only reverting, I think we should split it up and always
> register the shrinker for VIRTIO_BALLOON_F_FREE_PAGE_HINT and the OOM
> notifier (as before) for VIRTIO_BALLOON_F_MUST_TELL_HOST.

OK ... I guess that means we need to fix shrinker to take
VIRTIO_BALLOON_F_MUST_TELL_HOST into account correctly.
Hosts ignore it at the moment but it's a fragile thing
to do what it does and ignore used buffers.

> (Of course, adapting what is being done in the shrinker and in the OOM
> notifier)
> 
> -- 
> Thanks,
> 
> David / dhildenb



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Balloon pressuring page cache
  2020-02-03 17:03                 ` Michael S. Tsirkin
@ 2020-02-03 20:32                   ` Tyler Sanderson
  2020-02-03 21:22                     ` Alexander Duyck
                                       ` (2 more replies)
  0 siblings, 3 replies; 59+ messages in thread
From: Tyler Sanderson @ 2020-02-03 20:32 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: David Hildenbrand, Alexander Duyck, Wang, Wei W, virtualization,
	David Rientjes, linux-mm, Michal Hocko

[-- Attachment #1: Type: text/plain, Size: 5748 bytes --]

There were apparently good reasons for moving away from OOM notifier
callback:
https://lkml.org/lkml/2018/7/12/314
https://lkml.org/lkml/2018/8/2/322

In particular the OOM notifier is worse than the shrinker because:

   1. It is last-resort, which means the system has already gone through
   heroics to prevent OOM. Those heroic reclaim efforts are expensive and
   impact application performance.
   2. It lacks understanding of NUMA or other OOM constraints.
   3. It has a higher potential for bugs due to the subtlety of the
   callback context.

Given the above, I think the shrinker API certainly makes the most sense
_if_ the balloon size is static. In that case memory should be reclaimed
from the balloon early and proportionally to balloon size, which the
shrinker API achieves.

However, if the balloon is inflating and intentionally causing memory
pressure then this results in the inefficiency pointed out earlier.

If the balloon is inflating but not causing memory pressure then there is
no problem with either API.

This suggests another route: rather than cause memory pressure to shrink
the page cache, the balloon could issue the equivalent of "echo 3 >
/proc/sys/vm/drop_caches".
Of course ideally, we want to be more fine grained than "drop everything".
We really want an API that says "drop everything that hasn't been accessed
in the last 5 minutes".

This would eliminate the need for the balloon to cause memory pressure at
all which avoids the inefficiency in question. Furthermore, this pairs
nicely with the FREE_PAGE_HINT feature.


On Mon, Feb 3, 2020 at 9:04 AM Michael S. Tsirkin <mst@redhat.com> wrote:

> On Mon, Feb 03, 2020 at 05:34:20PM +0100, David Hildenbrand wrote:
> > On 03.02.20 17:18, Alexander Duyck wrote:
> > > On Mon, 2020-02-03 at 08:11 -0500, Michael S. Tsirkin wrote:
> > >> On Thu, Jan 30, 2020 at 11:59:46AM -0800, Tyler Sanderson wrote:
> > >>>
> > >>> On Thu, Jan 30, 2020 at 7:31 AM Wang, Wei W <wei.w.wang@intel.com>
> wrote:
> > >>>
> > >>>     On Thursday, January 30, 2020 11:03 PM, David Hildenbrand wrote:
> > >>>     > On 29.01.20 20:11, Tyler Sanderson wrote:
> > >>>     > >
> > >>>     > >
> > >>>     > > On Wed, Jan 29, 2020 at 2:31 AM David Hildenbrand <
> david@redhat.com
> > >>>     > > <mailto:david@redhat.com>> wrote:
> > >>>     > >
> > >>>     > >     On 29.01.20 01:22, Tyler Sanderson via Virtualization
> wrote:
> > >>>     > >     > A primary advantage of virtio balloon over other
> memory reclaim
> > >>>     > >     > mechanisms is that it can pressure the guest's page
> cache into
> > >>>     > >     shrinking.
> > >>>     > >     >
> > >>>     > >     > However, since the balloon driver changed to using the
> shrinker
> > >>>     API
> > >>>     > >     >
> > >>>     > >
> > >>>     > <
> https://github.com/torvalds/linux/commit/71994620bb25a8b109388fefa9
> > >>>     > e99a28e355255a#diff-fd202acf694d9eba19c8c64da3e480c9> this
> > >>>     > >     > use case has become a bit more tricky. I'm wondering
> what the
> > >>>     > intended
> > >>>     > >     > device implementation is.
> > >>>     > >     >
> > >>>     > >     > When inflating the balloon against page cache (i.e. no
> free
> > >>>     memory
> > >>>     > >     > remains) vmscan.c will both shrink page cache, but
> also invoke
> > >>>     the
> > >>>     > >     > shrinkers -- including the balloon's shrinker. So the
> balloon
> > >>>     driver
> > >>>     > >     > allocates memory which requires reclaim, vmscan gets
> this memory
> > >>>     > by
> > >>>     > >     > shrinking the balloon, and then the driver adds the
> memory back
> > >>>     to
> > >>>     > the
> > >>>     > >     > balloon. Basically a busy no-op.
> > >>>
> > >>>     Per my understanding, the balloon allocation won’t invoke
> shrinker as
> > >>>     __GFP_DIRECT_RECLAIM isn't set, no?
> > >>>
> > >>> I could be wrong about the mechanism, but the device sees lots of
> activity on
> > >>> the deflate queue. The balloon is being shrunk. And this only starts
> once all
> > >>> free memory is depleted and we're inflating into page cache.
> > >>
> > >> So given this looks like a regression, maybe we should revert the
> > >> patch in question 71994620bb25 ("virtio_balloon: replace oom notifier
> with shrinker")
> > >> Besides, with VIRTIO_BALLOON_F_FREE_PAGE_HINT
> > >> shrinker also ignores VIRTIO_BALLOON_F_MUST_TELL_HOST which isn't nice
> > >> at all.
> > >>
> > >> So it looks like all this rework introduced more issues than it
> > >> addressed ...
> > >>
> > >> I also CC Alex Duyck for an opinion on this.
> > >> Alex, what do you use to put pressure on page cache?
> > >
> > > I would say reverting probably makes sense. I'm not sure there is much
> > > value to having a shrinker running deflation when you are actively
> trying
> > > to increase the balloon. It would make more sense to wait until you are
> > > actually about to start hitting oom.
> >
> > I think the shrinker makes sense for free page hinting feature
> > (everything on free_page_list).
> >
> > So instead of only reverting, I think we should split it up and always
> > register the shrinker for VIRTIO_BALLOON_F_FREE_PAGE_HINT and the OOM
> > notifier (as before) for VIRTIO_BALLOON_F_MUST_TELL_HOST.
>
> OK ... I guess that means we need to fix shrinker to take
> VIRTIO_BALLOON_F_MUST_TELL_HOST into account correctly.
> Hosts ignore it at the moment but it's a fragile thing
> to do what it does and ignore used buffers.
>
> > (Of course, adapting what is being done in the shrinker and in the OOM
> > notifier)
> >
> > --
> > Thanks,
> >
> > David / dhildenb
>
>

[-- Attachment #2: Type: text/html, Size: 7870 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Balloon pressuring page cache
  2020-02-03 20:32                   ` Tyler Sanderson
@ 2020-02-03 21:22                     ` Alexander Duyck
  2020-02-03 23:16                       ` Tyler Sanderson
  2020-02-04  5:45                     ` Michael S. Tsirkin
  2020-02-04  8:29                     ` David Hildenbrand
  2 siblings, 1 reply; 59+ messages in thread
From: Alexander Duyck @ 2020-02-03 21:22 UTC (permalink / raw)
  To: Tyler Sanderson, Michael S. Tsirkin
  Cc: David Hildenbrand, Wang, Wei W, virtualization, David Rientjes,
	linux-mm, Michal Hocko

On Mon, 2020-02-03 at 12:32 -0800, Tyler Sanderson wrote:
> There were apparently good reasons for moving away from OOM notifier
> callback:
> https://lkml.org/lkml/2018/7/12/314
> https://lkml.org/lkml/2018/8/2/322
> 
> In particular the OOM notifier is worse than the shrinker because:
> It is last-resort, which means the system has already gone through
> heroics to prevent OOM. Those heroic reclaim efforts are expensive and
> impact application performance.
> It lacks understanding of NUMA or other OOM constraints.
> It has a higher potential for bugs due to the subtlety of the callback
> context.
> Given the above, I think the shrinker API certainly makes the most sense
> _if_ the balloon size is static. In that case memory should be reclaimed
> from the balloon early and proportionally to balloon size, which the
> shrinker API achieves.

The problem is the shrinker doesn't have any concept of tiering or
priority. I suspect he reason for using the OOM notification is because in
practice it should be the last thing we are pulling memory out of with
things like page cache and slab caches being first. Once we have pages
that are leaked out of the balloon by the shrinker it will trigger the
balloon wanting to reinflate. Ideally if the shrinker is running we
shouldn't be able to reinflate the balloon, and if we are reinflating the
balloon we shouldn't need to run the shrinker. The fact that we can do
both at the same time is problematic.

> However, if the balloon is inflating and intentionally causing memory
> pressure then this results in the inefficiency pointed out earlier.
> 
> If the balloon is inflating but not causing memory pressure then there
> is no problem with either API.

The entire point of the balloon is to cause memory pressure. Otherwise
essentially all we are really doing is hinting since the guest doesn't
need the memory and isn't going to use it any time soon.

> This suggests another route: rather than cause memory pressure to shrink
> the page cache, the balloon could issue the equivalent of "echo 3 >
> /proc/sys/vm/drop_caches".
> Of course ideally, we want to be more fine grained than "drop
> everything". We really want an API that says "drop everything that
> hasn't been accessed in the last 5 minutes".
> 
> This would eliminate the need for the balloon to cause memory pressure
> at all which avoids the inefficiency in question. Furthermore, this
> pairs nicely with the FREE_PAGE_HINT feature.

Something similar was brought up in the discussion we had about this in my
patch set. The problem is, by trying to use a value like "5 minutes" it
implies that we are going to need to track some extra state somewhere to
determine that value.

An alternative is to essentially just slowly shrink memory for the guest.
We had some discussion about this in another thread, and the following
code example was brought up as a way to go about doing that:
https://github.com/Conan-Kudo/omv-kernel-rc/blob/master/0154-sysctl-vm-Fine-grained-cache-shrinking.patch

The idea is you essentially just slowly bleed the memory from the guest by
specifying some amount of MB of cache to be freed on some regular
interval.

Thanks.

- Alex

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Balloon pressuring page cache
  2020-02-03 16:34               ` David Hildenbrand
  2020-02-03 17:03                 ` Michael S. Tsirkin
@ 2020-02-03 22:50                 ` Nadav Amit
  2020-02-04  8:35                   ` David Hildenbrand
  2020-02-05  7:35                   ` Nadav Amit
  1 sibling, 2 replies; 59+ messages in thread
From: Nadav Amit @ 2020-02-03 22:50 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alexander Duyck, Michael S. Tsirkin, Tyler Sanderson, Wang,
	Wei W, virtualization, David Rientjes, linux-mm, Michal Hocko

> On Feb 3, 2020, at 8:34 AM, David Hildenbrand <david@redhat.com> wrote:
> 
> On 03.02.20 17:18, Alexander Duyck wrote:
>> On Mon, 2020-02-03 at 08:11 -0500, Michael S. Tsirkin wrote:
>>> On Thu, Jan 30, 2020 at 11:59:46AM -0800, Tyler Sanderson wrote:
>>>> On Thu, Jan 30, 2020 at 7:31 AM Wang, Wei W <wei.w.wang@intel.com> wrote:
>>>> 
>>>>    On Thursday, January 30, 2020 11:03 PM, David Hildenbrand wrote:
>>>>> On 29.01.20 20:11, Tyler Sanderson wrote:
>>>>>> On Wed, Jan 29, 2020 at 2:31 AM David Hildenbrand <david@redhat.com
>>>>>> <mailto:david@redhat.com>> wrote:
>>>>>> 
>>>>>>    On 29.01.20 01:22, Tyler Sanderson via Virtualization wrote:
>>>>>>> A primary advantage of virtio balloon over other memory reclaim
>>>>>>> mechanisms is that it can pressure the guest's page cache into
>>>>>>    shrinking.
>>>>>>> However, since the balloon driver changed to using the shrinker
>>>>    API
>>>>> <https://github.com/torvalds/linux/commit/71994620bb25a8b109388fefa9
>>>>> e99a28e355255a#diff-fd202acf694d9eba19c8c64da3e480c9> this
>>>>>>> use case has become a bit more tricky. I'm wondering what the
>>>>> intended
>>>>>>> device implementation is.
>>>>>>> 
>>>>>>> When inflating the balloon against page cache (i.e. no free
>>>>    memory
>>>>>>> remains) vmscan.c will both shrink page cache, but also invoke
>>>>    the
>>>>>>> shrinkers -- including the balloon's shrinker. So the balloon
>>>>    driver
>>>>>>> allocates memory which requires reclaim, vmscan gets this memory
>>>>> by
>>>>>>> shrinking the balloon, and then the driver adds the memory back
>>>>    to
>>>>> the
>>>>>>> balloon. Basically a busy no-op.
>>>> 
>>>>    Per my understanding, the balloon allocation won’t invoke shrinker as
>>>>    __GFP_DIRECT_RECLAIM isn't set, no?
>>>> 
>>>> I could be wrong about the mechanism, but the device sees lots of activity on
>>>> the deflate queue. The balloon is being shrunk. And this only starts once all
>>>> free memory is depleted and we're inflating into page cache.
>>> 
>>> So given this looks like a regression, maybe we should revert the
>>> patch in question 71994620bb25 ("virtio_balloon: replace oom notifier with shrinker")
>>> Besides, with VIRTIO_BALLOON_F_FREE_PAGE_HINT
>>> shrinker also ignores VIRTIO_BALLOON_F_MUST_TELL_HOST which isn't nice
>>> at all.
>>> 
>>> So it looks like all this rework introduced more issues than it
>>> addressed ...
>>> 
>>> I also CC Alex Duyck for an opinion on this.
>>> Alex, what do you use to put pressure on page cache?
>> 
>> I would say reverting probably makes sense. I'm not sure there is much
>> value to having a shrinker running deflation when you are actively trying
>> to increase the balloon. It would make more sense to wait until you are
>> actually about to start hitting oom.
> 
> I think the shrinker makes sense for free page hinting feature
> (everything on free_page_list).
> 
> So instead of only reverting, I think we should split it up and always
> register the shrinker for VIRTIO_BALLOON_F_FREE_PAGE_HINT and the OOM
> notifier (as before) for VIRTIO_BALLOON_F_MUST_TELL_HOST.
> 
> (Of course, adapting what is being done in the shrinker and in the OOM
> notifier)

David,

Please keep me posted. I decided to adapt the same solution as the virtio
balloon for the VMware balloon. If the verdict is that this is damaging and
the OOM notifier should be used instead, I will submit patches to move to
OOM notifier as well.

Regards,
Nadav


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Balloon pressuring page cache
  2020-02-03 21:22                     ` Alexander Duyck
@ 2020-02-03 23:16                       ` Tyler Sanderson
  2020-02-04  0:10                         ` Alexander Duyck
  0 siblings, 1 reply; 59+ messages in thread
From: Tyler Sanderson @ 2020-02-03 23:16 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Michael S. Tsirkin, David Hildenbrand, Wang, Wei W,
	virtualization, David Rientjes, linux-mm, Michal Hocko

[-- Attachment #1: Type: text/plain, Size: 4548 bytes --]

On Mon, Feb 3, 2020 at 1:22 PM Alexander Duyck <
alexander.h.duyck@linux.intel.com> wrote:

> On Mon, 2020-02-03 at 12:32 -0800, Tyler Sanderson wrote:
> > There were apparently good reasons for moving away from OOM notifier
> > callback:
> > https://lkml.org/lkml/2018/7/12/314
> > https://lkml.org/lkml/2018/8/2/322
> >
> > In particular the OOM notifier is worse than the shrinker because:
> > It is last-resort, which means the system has already gone through
> > heroics to prevent OOM. Those heroic reclaim efforts are expensive and
> > impact application performance.
> > It lacks understanding of NUMA or other OOM constraints.
> > It has a higher potential for bugs due to the subtlety of the callback
> > context.
> > Given the above, I think the shrinker API certainly makes the most sense
> > _if_ the balloon size is static. In that case memory should be reclaimed
> > from the balloon early and proportionally to balloon size, which the
> > shrinker API achieves.
>
> The problem is the shrinker doesn't have any concept of tiering or
> priority. I suspect he reason for using the OOM notification is because in
> practice it should be the last thing we are pulling memory out of with
> things like page cache and slab caches being first. Once we have pages
> that are leaked out of the balloon by the shrinker it will trigger the
> balloon wanting to reinflate.

Deciding whether to trade IO performance (page cache) for memory-usage
efficiency (balloon) seems use-case dependent.
Deciding when to re-inflate is a similar policy choice.

If the balloon's shrinker priority is hard-coded to "last-resort" then
there would be no way to implement a policy where page cache growth could
shrink the balloon.
The current balloon implementation allows the host to implement this policy
and tune the tradeoff between balloon and page cache.


> Ideally if the shrinker is running we
> shouldn't be able to reinflate the balloon, and if we are reinflating the
> balloon we shouldn't need to run the shrinker. The fact that we can do
> both at the same time is problematic.
>
I agree that this is inefficient.


>
> > However, if the balloon is inflating and intentionally causing memory
> > pressure then this results in the inefficiency pointed out earlier.
> >
> > If the balloon is inflating but not causing memory pressure then there
> > is no problem with either API.
>
> The entire point of the balloon is to cause memory pressure. Otherwise
> essentially all we are really doing is hinting since the guest doesn't
> need the memory and isn't going to use it any time soon.
>
Causing memory pressure is just a mechanism to achieve increased reclaim.
If there was a better mechanism (like the fine-grained-cache-shrinking one
discussed below) then I think the balloon device would be perfectly
justified in using that instead (and maybe "balloon" becomes a misnomer. Oh
well).


>
> > This suggests another route: rather than cause memory pressure to shrink
> > the page cache, the balloon could issue the equivalent of "echo 3 >
> > /proc/sys/vm/drop_caches".
> > Of course ideally, we want to be more fine grained than "drop
> > everything". We really want an API that says "drop everything that
> > hasn't been accessed in the last 5 minutes".
> >
> > This would eliminate the need for the balloon to cause memory pressure
> > at all which avoids the inefficiency in question. Furthermore, this
> > pairs nicely with the FREE_PAGE_HINT feature.
>
> Something similar was brought up in the discussion we had about this in my
> patch set. The problem is, by trying to use a value like "5 minutes" it
> implies that we are going to need to track some extra state somewhere to
> determine that value.
>
> An alternative is to essentially just slowly shrink memory for the guest.
> We had some discussion about this in another thread, and the following
> code example was brought up as a way to go about doing that:
>
> https://github.com/Conan-Kudo/omv-kernel-rc/blob/master/0154-sysctl-vm-Fine-grained-cache-shrinking.patch
>
> The idea is you essentially just slowly bleed the memory from the guest by
> specifying some amount of MB of cache to be freed on some regular
> interval.
>
Makes sense. Whatever API is settled on, I'd just propose that we allow the
host to invoke it via the balloon device since the host has a host-global
view of memory and can make decisions that an individual guest cannot.

Alex, what is the status of your fine-grained-cache-shrinking patch? It
seems like a really good idea.


> Thanks.
>
> - Alex
>
>

[-- Attachment #2: Type: text/html, Size: 6266 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Balloon pressuring page cache
  2020-02-03 23:16                       ` Tyler Sanderson
@ 2020-02-04  0:10                         ` Alexander Duyck
  0 siblings, 0 replies; 59+ messages in thread
From: Alexander Duyck @ 2020-02-04  0:10 UTC (permalink / raw)
  To: Tyler Sanderson
  Cc: Michael S. Tsirkin, David Hildenbrand, Wang, Wei W,
	virtualization, David Rientjes, linux-mm, Michal Hocko

On Mon, 2020-02-03 at 15:16 -0800, Tyler Sanderson wrote:
> 
> 
> On Mon, Feb 3, 2020 at 1:22 PM Alexander Duyck <alexander.h.duyck@linux.intel.com> wrote:
> > On Mon, 2020-02-03 at 12:32 -0800, Tyler Sanderson wrote:
> > > There were apparently good reasons for moving away from OOM notifier
> > > callback:
> > > https://lkml.org/lkml/2018/7/12/314
> > > https://lkml.org/lkml/2018/8/2/322
> > > 
> > > In particular the OOM notifier is worse than the shrinker because:
> > > It is last-resort, which means the system has already gone through
> > > heroics to prevent OOM. Those heroic reclaim efforts are expensive and
> > > impact application performance.
> > > It lacks understanding of NUMA or other OOM constraints.
> > > It has a higher potential for bugs due to the subtlety of the callback
> > > context.
> > > Given the above, I think the shrinker API certainly makes the most sense
> > > _if_ the balloon size is static. In that case memory should be reclaimed
> > > from the balloon early and proportionally to balloon size, which the
> > > shrinker API achieves.
> > 
> > The problem is the shrinker doesn't have any concept of tiering or
> > priority. I suspect he reason for using the OOM notification is because in
> > practice it should be the last thing we are pulling memory out of with
> > things like page cache and slab caches being first. Once we have pages
> > that are leaked out of the balloon by the shrinker it will trigger the
> > balloon wanting to reinflate.
> 
> Deciding whether to trade IO performance (page cache) for memory-usage
> efficiency (balloon) seems use-case dependent.
> Deciding when to re-inflate is a similar policy choice.
> 
> If the balloon's shrinker priority is hard-coded to "last-resort" then
> there would be no way to implement a policy where page cache growth
> could shrink the balloon.
> The current balloon implementation allows the host to implement this
> policy and tune the tradeoff between balloon and page cache.

I'm not saying the priority has to be hard coded, but ideally we should be
able to have some sort of priority so we can decide that sort of stuff.
The general idea is we want to be able to somehow make it so that if the
shrinker is running we aren't trying to inflate the balloon, and us
inflating the balloon will not trigger us to trigger the shrinker on
ourselves.

If our preference is to deflate the balloon rather than maintain page
cache then we really should have some way for us to avoid triggering
flushing of the page cache if we are inflating the balloon. What we need
is a clear definition for what we are willing to flush to fill the balloon
and what we are not and then we can use that to decide when we stop
inflating and start deflating based on memory pressure.

> > Ideally if the shrinker is running we
> > shouldn't be able to reinflate the balloon, and if we are reinflating the
> > balloon we shouldn't need to run the shrinker. The fact that we can do
> > both at the same time is problematic.
> 
> I agree that this is inefficient.
>  
> > > However, if the balloon is inflating and intentionally causing memory
> > > pressure then this results in the inefficiency pointed out earlier.
> > > 
> > > If the balloon is inflating but not causing memory pressure then there
> > > is no problem with either API.
> > 
> > The entire point of the balloon is to cause memory pressure. Otherwise
> > essentially all we are really doing is hinting since the guest doesn't
> > need the memory and isn't going to use it any time soon.
> 
> Causing memory pressure is just a mechanism to achieve increased
> reclaim. If there was a better mechanism (like the fine-grained-cache-
> shrinking one discussed below) then I think the balloon device would be
> perfectly justified in using that instead (and maybe "balloon" becomes a
> misnomer. Oh well).

Agreed. There have been a number of threads on how to go about doing a
proactive reclaim in the last month or two. In addition there have been a
number of solutions that I have seen going around. I am hoping this is one
of the topics we can go over at LSF/MM.

> > > This suggests another route: rather than cause memory pressure to shrink
> > > the page cache, the balloon could issue the equivalent of "echo 3 >
> > > /proc/sys/vm/drop_caches".
> > > Of course ideally, we want to be more fine grained than "drop
> > > everything". We really want an API that says "drop everything that
> > > hasn't been accessed in the last 5 minutes".
> > > 
> > > This would eliminate the need for the balloon to cause memory pressure
> > > at all which avoids the inefficiency in question. Furthermore, this
> > > pairs nicely with the FREE_PAGE_HINT feature.
> > 
> > Something similar was brought up in the discussion we had about this in my
> > patch set. The problem is, by trying to use a value like "5 minutes" it
> > implies that we are going to need to track some extra state somewhere to
> > determine that value.
> > 
> > An alternative is to essentially just slowly shrink memory for the guest.
> > We had some discussion about this in another thread, and the following
> > code example was brought up as a way to go about doing that:
> > https://github.com/Conan-Kudo/omv-kernel-rc/blob/master/0154-sysctl-vm-Fine-grained-cache-shrinking.patch
> > 
> > The idea is you essentially just slowly bleed the memory from the guest by
> > specifying some amount of MB of cache to be freed on some regular
> > interval.
> 
> Makes sense. Whatever API is settled on, I'd just propose that we allow
> the host to invoke it via the balloon device since the host has a host-
> global view of memory and can make decisions that an individual guest
> cannot.
> 
> Alex, what is the status of your fine-grained-cache-shrinking patch? It
> seems like a really good idea.

Well the patch I mentioned above is something that someone else was
working on back in 2017. I have been working on the page reporting piece
for the last year or so trying to get that sorted before I get to the
matter of dealing with how to go about triggering reclaim. I figure I can
work on reclaim while getting the QEMU bits accepted.

You might take a look at the discussion we had on my free page reporting
interface. I think we ended up with something like 3 or 4 different ways
to approach reclaim being proposed. The discussion can be found here:
https://lore.kernel.org/lkml/20200122173040.6142.39116.stgit@localhost.localdomain/

Thanks.

- Alex



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Balloon pressuring page cache
  2020-02-03 20:32                   ` Tyler Sanderson
  2020-02-03 21:22                     ` Alexander Duyck
@ 2020-02-04  5:45                     ` Michael S. Tsirkin
  2020-02-04  8:29                     ` David Hildenbrand
  2 siblings, 0 replies; 59+ messages in thread
From: Michael S. Tsirkin @ 2020-02-04  5:45 UTC (permalink / raw)
  To: Tyler Sanderson
  Cc: David Hildenbrand, Alexander Duyck, Wang, Wei W, virtualization,
	David Rientjes, linux-mm, Michal Hocko

On Mon, Feb 03, 2020 at 12:32:05PM -0800, Tyler Sanderson wrote:
> There were apparently good reasons for moving away from OOM notifier callback:
> https://lkml.org/lkml/2018/7/12/314
> https://lkml.org/lkml/2018/8/2/322
> 
> In particular the OOM notifier is worse than the shrinker because:
> 
>  1. It is last-resort, which means the system has already gone through heroics
>     to prevent OOM. Those heroic reclaim efforts are expensive and impact
>     application performance.
>  2. It lacks understanding of NUMA or other OOM constraints.
>  3. It has a higher potential for bugs due to the subtlety of the callback
>     context.
> 
> Given the above, I think the shrinker API certainly makes the most sense _if_
> the balloon size is static. In that case memory should be reclaimed from the
> balloon early and proportionally to balloon size, which the shrinker API
> achieves.

OK that sounds like VIRTIO_BALLOON_F_FREE_PAGE_HINT then.

> However, if the balloon is inflating and intentionally causing memory pressure
> then this results in the inefficiency pointed out earlier.

And that sounds like VIRTIO_BALLOON_F_DEFLATE_ON_OOM.

> If the balloon is inflating but not causing memory pressure then there is no
> problem with either API.
> 
> This suggests another route: rather than cause memory pressure to shrink the
> page cache, the balloon could issue the equivalent of "echo 3 > /proc/sys/vm/
> drop_caches".
> Of course ideally, we want to be more fine grained than "drop everything". We
> really want an API that says "drop everything that hasn't been accessed in the
> last 5 minutes".
> 
> This would eliminate the need for the balloon to cause memory pressure at all
> which avoids the inefficiency in question. Furthermore, this pairs nicely with
> the FREE_PAGE_HINT feature.

Well we still do have a regression. So we probably should revert
for now, and separately look for better solutions.



> 
> On Mon, Feb 3, 2020 at 9:04 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> 
>     On Mon, Feb 03, 2020 at 05:34:20PM +0100, David Hildenbrand wrote:
>     > On 03.02.20 17:18, Alexander Duyck wrote:
>     > > On Mon, 2020-02-03 at 08:11 -0500, Michael S. Tsirkin wrote:
>     > >> On Thu, Jan 30, 2020 at 11:59:46AM -0800, Tyler Sanderson wrote:
>     > >>>
>     > >>> On Thu, Jan 30, 2020 at 7:31 AM Wang, Wei W <wei.w.wang@intel.com>
>     wrote:
>     > >>>
>     > >>>     On Thursday, January 30, 2020 11:03 PM, David Hildenbrand wrote:
>     > >>>     > On 29.01.20 20:11, Tyler Sanderson wrote:
>     > >>>     > >
>     > >>>     > >
>     > >>>     > > On Wed, Jan 29, 2020 at 2:31 AM David Hildenbrand <
>     david@redhat.com
>     > >>>     > > <mailto:david@redhat.com>> wrote:
>     > >>>     > >
>     > >>>     > >     On 29.01.20 01:22, Tyler Sanderson via Virtualization
>     wrote:
>     > >>>     > >     > A primary advantage of virtio balloon over other memory
>     reclaim
>     > >>>     > >     > mechanisms is that it can pressure the guest's page
>     cache into
>     > >>>     > >     shrinking.
>     > >>>     > >     >
>     > >>>     > >     > However, since the balloon driver changed to using the
>     shrinker
>     > >>>     API
>     > >>>     > >     >
>     > >>>     > >
>     > >>>     > <https://github.com/torvalds/linux/commit/
>     71994620bb25a8b109388fefa9
>     > >>>     > e99a28e355255a#diff-fd202acf694d9eba19c8c64da3e480c9> this
>     > >>>     > >     > use case has become a bit more tricky. I'm wondering
>     what the
>     > >>>     > intended
>     > >>>     > >     > device implementation is.
>     > >>>     > >     >
>     > >>>     > >     > When inflating the balloon against page cache (i.e. no
>     free
>     > >>>     memory
>     > >>>     > >     > remains) vmscan.c will both shrink page cache, but also
>     invoke
>     > >>>     the
>     > >>>     > >     > shrinkers -- including the balloon's shrinker. So the
>     balloon
>     > >>>     driver
>     > >>>     > >     > allocates memory which requires reclaim, vmscan gets
>     this memory
>     > >>>     > by
>     > >>>     > >     > shrinking the balloon, and then the driver adds the
>     memory back
>     > >>>     to
>     > >>>     > the
>     > >>>     > >     > balloon. Basically a busy no-op.
>     > >>>
>     > >>>     Per my understanding, the balloon allocation won’t invoke
>     shrinker as
>     > >>>     __GFP_DIRECT_RECLAIM isn't set, no?
>     > >>>
>     > >>> I could be wrong about the mechanism, but the device sees lots of
>     activity on
>     > >>> the deflate queue. The balloon is being shrunk. And this only starts
>     once all
>     > >>> free memory is depleted and we're inflating into page cache.
>     > >>
>     > >> So given this looks like a regression, maybe we should revert the
>     > >> patch in question 71994620bb25 ("virtio_balloon: replace oom notifier
>     with shrinker")
>     > >> Besides, with VIRTIO_BALLOON_F_FREE_PAGE_HINT
>     > >> shrinker also ignores VIRTIO_BALLOON_F_MUST_TELL_HOST which isn't nice
>     > >> at all.
>     > >>
>     > >> So it looks like all this rework introduced more issues than it
>     > >> addressed ...
>     > >>
>     > >> I also CC Alex Duyck for an opinion on this.
>     > >> Alex, what do you use to put pressure on page cache?
>     > >
>     > > I would say reverting probably makes sense. I'm not sure there is much
>     > > value to having a shrinker running deflation when you are actively
>     trying
>     > > to increase the balloon. It would make more sense to wait until you are
>     > > actually about to start hitting oom.
>     >
>     > I think the shrinker makes sense for free page hinting feature
>     > (everything on free_page_list).
>     >
>     > So instead of only reverting, I think we should split it up and always
>     > register the shrinker for VIRTIO_BALLOON_F_FREE_PAGE_HINT and the OOM
>     > notifier (as before) for VIRTIO_BALLOON_F_MUST_TELL_HOST.
> 
>     OK ... I guess that means we need to fix shrinker to take
>     VIRTIO_BALLOON_F_MUST_TELL_HOST into account correctly.
>     Hosts ignore it at the moment but it's a fragile thing
>     to do what it does and ignore used buffers.
> 
>     > (Of course, adapting what is being done in the shrinker and in the OOM
>     > notifier)
>     >
>     > --
>     > Thanks,
>     >
>     > David / dhildenb
> 
> 



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Balloon pressuring page cache
  2020-02-03 20:32                   ` Tyler Sanderson
  2020-02-03 21:22                     ` Alexander Duyck
  2020-02-04  5:45                     ` Michael S. Tsirkin
@ 2020-02-04  8:29                     ` David Hildenbrand
  2020-02-04 18:52                       ` Tyler Sanderson
  2 siblings, 1 reply; 59+ messages in thread
From: David Hildenbrand @ 2020-02-04  8:29 UTC (permalink / raw)
  To: Tyler Sanderson, Michael S. Tsirkin
  Cc: Alexander Duyck, Wang, Wei W, virtualization, David Rientjes,
	linux-mm, Michal Hocko

On 03.02.20 21:32, Tyler Sanderson wrote:
> There were apparently good reasons for moving away from OOM notifier
> callback:
> https://lkml.org/lkml/2018/7/12/314
> https://lkml.org/lkml/2018/8/2/322
> 
> In particular the OOM notifier is worse than the shrinker because:

The issue is that DEFLATE_ON_OOM is under-specified.

> 
>  1. It is last-resort, which means the system has already gone through
>     heroics to prevent OOM. Those heroic reclaim efforts are expensive
>     and impact application performance.

That's *exactly* what "deflate on OOM" suggests.

Assume you are using virtio-balloon for some weird way of memory
hotunplug (which is what some people do) and you want to minimize the
footprint of your guest. Then you really only want to give the guest
more memory (or rather, let it take back memory automatically in this
case) in case it really needs more memory. It should try to reclaim first.

Under-specified.

>  2. It lacks understanding of NUMA or other OOM constraints.

Ballooning in general lacks the understanding of NUMA.

>  3. It has a higher potential for bugs due to the subtlety of the
>     callback context.

While that is a valid point, it doesn't explain why existing
functionality is changed.

Personally, I think DEFLATE_ON_OOM should never have been introduced (at
least not in this form).

-- 
Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Balloon pressuring page cache
  2020-02-03 22:50                 ` Nadav Amit
@ 2020-02-04  8:35                   ` David Hildenbrand
  2020-02-04  8:40                     ` Michael S. Tsirkin
  2020-02-05  7:35                   ` Nadav Amit
  1 sibling, 1 reply; 59+ messages in thread
From: David Hildenbrand @ 2020-02-04  8:35 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Alexander Duyck, Michael S. Tsirkin, Tyler Sanderson, Wang,
	Wei W, virtualization, David Rientjes, linux-mm, Michal Hocko

>>> I would say reverting probably makes sense. I'm not sure there is much
>>> value to having a shrinker running deflation when you are actively trying
>>> to increase the balloon. It would make more sense to wait until you are
>>> actually about to start hitting oom.
>>
>> I think the shrinker makes sense for free page hinting feature
>> (everything on free_page_list).
>>
>> So instead of only reverting, I think we should split it up and always
>> register the shrinker for VIRTIO_BALLOON_F_FREE_PAGE_HINT and the OOM
>> notifier (as before) for VIRTIO_BALLOON_F_MUST_TELL_HOST.

s/VIRTIO_BALLOON_F_MUST_TELL_HOST/VIRTIO_BALLOON_F_DEFLATE_ON_OOM/

:)

>>
>> (Of course, adapting what is being done in the shrinker and in the OOM
>> notifier)
> 
> David,
> 
> Please keep me posted. I decided to adapt the same solution as the virtio
> balloon for the VMware balloon. If the verdict is that this is damaging and
> the OOM notifier should be used instead, I will submit patches to move to
> OOM notifier as well.

Will do. It all sounds sub-optimal to me at this point ... but I prefer
the old variant where a simple "drop_slab()" won't deflate the balloon.
That looks broken to me.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Balloon pressuring page cache
  2020-02-04  8:35                   ` David Hildenbrand
@ 2020-02-04  8:40                     ` Michael S. Tsirkin
  2020-02-04  8:48                       ` David Hildenbrand
  2020-02-04 14:30                       ` David Hildenbrand
  0 siblings, 2 replies; 59+ messages in thread
From: Michael S. Tsirkin @ 2020-02-04  8:40 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Nadav Amit, Alexander Duyck, Tyler Sanderson, Wang, Wei W,
	virtualization, David Rientjes, linux-mm, Michal Hocko

On Tue, Feb 04, 2020 at 09:35:21AM +0100, David Hildenbrand wrote:
> >>> I would say reverting probably makes sense. I'm not sure there is much
> >>> value to having a shrinker running deflation when you are actively trying
> >>> to increase the balloon. It would make more sense to wait until you are
> >>> actually about to start hitting oom.
> >>
> >> I think the shrinker makes sense for free page hinting feature
> >> (everything on free_page_list).
> >>
> >> So instead of only reverting, I think we should split it up and always
> >> register the shrinker for VIRTIO_BALLOON_F_FREE_PAGE_HINT and the OOM
> >> notifier (as before) for VIRTIO_BALLOON_F_MUST_TELL_HOST.
> 
> s/VIRTIO_BALLOON_F_MUST_TELL_HOST/VIRTIO_BALLOON_F_DEFLATE_ON_OOM/
> 
> :)

Well VIRTIO_BALLOON_F_MUST_TELL_HOST is also broken by shrinker
with VIRTIO_BALLOON_F_FREE_PAGE_HINT as that code adds buffers
but does not wait for them to be used even with VIRTIO_BALLOON_F_MUST_TELL_HOST.
We never noticed because QEMU does not advertize
VIRTIO_BALLOON_F_MUST_TELL_HOST.


> >>
> >> (Of course, adapting what is being done in the shrinker and in the OOM
> >> notifier)
> > 
> > David,
> > 
> > Please keep me posted. I decided to adapt the same solution as the virtio
> > balloon for the VMware balloon. If the verdict is that this is damaging and
> > the OOM notifier should be used instead, I will submit patches to move to
> > OOM notifier as well.
> 
> Will do. It all sounds sub-optimal to me at this point ... but I prefer
> the old variant where a simple "drop_slab()" won't deflate the balloon.
> That looks broken to me.

Okay. Could you post a patch?

> -- 
> Thanks,
> 
> David / dhildenb



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Balloon pressuring page cache
  2020-02-04  8:40                     ` Michael S. Tsirkin
@ 2020-02-04  8:48                       ` David Hildenbrand
  2020-02-04 14:30                       ` David Hildenbrand
  1 sibling, 0 replies; 59+ messages in thread
From: David Hildenbrand @ 2020-02-04  8:48 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Nadav Amit, Alexander Duyck, Tyler Sanderson, Wang, Wei W,
	virtualization, David Rientjes, linux-mm, Michal Hocko

On 04.02.20 09:40, Michael S. Tsirkin wrote:
> On Tue, Feb 04, 2020 at 09:35:21AM +0100, David Hildenbrand wrote:
>>>>> I would say reverting probably makes sense. I'm not sure there is much
>>>>> value to having a shrinker running deflation when you are actively trying
>>>>> to increase the balloon. It would make more sense to wait until you are
>>>>> actually about to start hitting oom.
>>>>
>>>> I think the shrinker makes sense for free page hinting feature
>>>> (everything on free_page_list).
>>>>
>>>> So instead of only reverting, I think we should split it up and always
>>>> register the shrinker for VIRTIO_BALLOON_F_FREE_PAGE_HINT and the OOM
>>>> notifier (as before) for VIRTIO_BALLOON_F_MUST_TELL_HOST.
>>
>> s/VIRTIO_BALLOON_F_MUST_TELL_HOST/VIRTIO_BALLOON_F_DEFLATE_ON_OOM/
>>
>> :)
> 
> Well VIRTIO_BALLOON_F_MUST_TELL_HOST is also broken by shrinker
> with VIRTIO_BALLOON_F_FREE_PAGE_HINT as that code adds buffers
> but does not wait for them to be used even with VIRTIO_BALLOON_F_MUST_TELL_HOST.
> We never noticed because QEMU does not advertize
> VIRTIO_BALLOON_F_MUST_TELL_HOST.

Will try to figure out how to best undo this mess :)

> 
> 
>>>>
>>>> (Of course, adapting what is being done in the shrinker and in the OOM
>>>> notifier)
>>>
>>> David,
>>>
>>> Please keep me posted. I decided to adapt the same solution as the virtio
>>> balloon for the VMware balloon. If the verdict is that this is damaging and
>>> the OOM notifier should be used instead, I will submit patches to move to
>>> OOM notifier as well.
>>
>> Will do. It all sounds sub-optimal to me at this point ... but I prefer
>> the old variant where a simple "drop_slab()" won't deflate the balloon.
>> That looks broken to me.
> 
> Okay. Could you post a patch?

I can give it a shot.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Balloon pressuring page cache
  2020-02-04  8:40                     ` Michael S. Tsirkin
  2020-02-04  8:48                       ` David Hildenbrand
@ 2020-02-04 14:30                       ` David Hildenbrand
  2020-02-04 16:50                         ` Michael S. Tsirkin
  2020-02-05  6:49                         ` Wang, Wei W
  1 sibling, 2 replies; 59+ messages in thread
From: David Hildenbrand @ 2020-02-04 14:30 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Nadav Amit, Alexander Duyck, Tyler Sanderson, Wang, Wei W,
	virtualization, David Rientjes, linux-mm, Michal Hocko

On 04.02.20 09:40, Michael S. Tsirkin wrote:
> On Tue, Feb 04, 2020 at 09:35:21AM +0100, David Hildenbrand wrote:
>>>>> I would say reverting probably makes sense. I'm not sure there is much
>>>>> value to having a shrinker running deflation when you are actively trying
>>>>> to increase the balloon. It would make more sense to wait until you are
>>>>> actually about to start hitting oom.
>>>>
>>>> I think the shrinker makes sense for free page hinting feature
>>>> (everything on free_page_list).
>>>>
>>>> So instead of only reverting, I think we should split it up and always
>>>> register the shrinker for VIRTIO_BALLOON_F_FREE_PAGE_HINT and the OOM
>>>> notifier (as before) for VIRTIO_BALLOON_F_MUST_TELL_HOST.
>>
>> s/VIRTIO_BALLOON_F_MUST_TELL_HOST/VIRTIO_BALLOON_F_DEFLATE_ON_OOM/
>>
>> :)
> 
> Well VIRTIO_BALLOON_F_MUST_TELL_HOST is also broken by shrinker
> with VIRTIO_BALLOON_F_FREE_PAGE_HINT as that code adds buffers
> but does not wait for them to be used even with VIRTIO_BALLOON_F_MUST_TELL_HOST.
> We never noticed because QEMU does not advertize
> VIRTIO_BALLOON_F_MUST_TELL_HOST.

So, I am trying to understand how the code is intended to work, but I
am afraid I am missing something (or to rephrase: I think I found a BUG :) and
there is lack of proper documentation about this feature).

a) We allocate pages and add them to the list as long as we are told to do so.
   We send these pages to the host one by one.
b) We free all pages once we get a STOP signal. Until then, we keep pages allocated.
c) When called via the shrinker, we want to free pages from the list, even
though the hypervisor did not notify us to do so.

Issue 1: When we unload the balloon driver in the guest in an unlucky event,
we won't free the pages. We are missing something like (if I am not wrong):

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index b1d2068fa2bd..e2b0925e1e83 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -929,6 +929,10 @@ static void remove_common(struct virtio_balloon *vb)
                leak_balloon(vb, vb->num_pages);
        update_balloon_size(vb);

+       /* There might be free pages that are being reported: release them. */
+       if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT))
+               return_free_pages_to_mm(vb, ULONG_MAX);
+
        /* Now we reset the device so we can clean up the queues. */
        vb->vdev->config->reset(vb->vdev);

Issue 2: When called via the shrinker, (but also to fix Issue 1), it could be
that we do have VIRTIO_BALLOON_F_MUST_TELL_HOST. I assume this means
(-ENOCLUE) that we have to wait until the hypervisor notifies us via the STOP? Or
for which event do we have to wait? Because there is no way to *tell host* here
that we want to reuse a page. The hypervisor will *tell us* when we can reuse pages.

For the shrinker it is simple: Don't use the shrinker with
VIRTIO_BALLOON_F_MUST_TELL_HOST :) . But to fix Issue 1, we *would* have to wait
until we get a STOP signal. That is not really possible because it might
take an infinite amount of time.

Michael, any clue on which event we have to wait with
VIRTIO_BALLOON_F_MUST_TELL_HOST? IMHO, I don't think
VIRTIO_BALLOON_F_MUST_TELL_HOST applies to VIRTIO_BALLOON_F_FREE_PAGE_HINT and
we'd better document that. It introduces complexity with no clear benefit.

-- 
Thanks,

David / dhildenb

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: Balloon pressuring page cache
  2020-02-04 14:30                       ` David Hildenbrand
@ 2020-02-04 16:50                         ` Michael S. Tsirkin
  2020-02-04 16:56                           ` David Hildenbrand
  2020-02-05  6:52                           ` Wang, Wei W
  2020-02-05  6:49                         ` Wang, Wei W
  1 sibling, 2 replies; 59+ messages in thread
From: Michael S. Tsirkin @ 2020-02-04 16:50 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Nadav Amit, Alexander Duyck, Tyler Sanderson, Wang, Wei W,
	virtualization, David Rientjes, linux-mm, Michal Hocko

On Tue, Feb 04, 2020 at 03:30:19PM +0100, David Hildenbrand wrote:
> On 04.02.20 09:40, Michael S. Tsirkin wrote:
> > On Tue, Feb 04, 2020 at 09:35:21AM +0100, David Hildenbrand wrote:
> >>>>> I would say reverting probably makes sense. I'm not sure there is much
> >>>>> value to having a shrinker running deflation when you are actively trying
> >>>>> to increase the balloon. It would make more sense to wait until you are
> >>>>> actually about to start hitting oom.
> >>>>
> >>>> I think the shrinker makes sense for free page hinting feature
> >>>> (everything on free_page_list).
> >>>>
> >>>> So instead of only reverting, I think we should split it up and always
> >>>> register the shrinker for VIRTIO_BALLOON_F_FREE_PAGE_HINT and the OOM
> >>>> notifier (as before) for VIRTIO_BALLOON_F_MUST_TELL_HOST.
> >>
> >> s/VIRTIO_BALLOON_F_MUST_TELL_HOST/VIRTIO_BALLOON_F_DEFLATE_ON_OOM/
> >>
> >> :)
> > 
> > Well VIRTIO_BALLOON_F_MUST_TELL_HOST is also broken by shrinker
> > with VIRTIO_BALLOON_F_FREE_PAGE_HINT as that code adds buffers
> > but does not wait for them to be used even with VIRTIO_BALLOON_F_MUST_TELL_HOST.
> > We never noticed because QEMU does not advertize
> > VIRTIO_BALLOON_F_MUST_TELL_HOST.
> 
> So, I am trying to understand how the code is intended to work, but I
> am afraid I am missing something (or to rephrase: I think I found a BUG :) and
> there is lack of proper documentation about this feature).
> 
> a) We allocate pages and add them to the list as long as we are told to do so.
>    We send these pages to the host one by one.
> b) We free all pages once we get a STOP signal. Until then, we keep pages allocated.
> c) When called via the shrinker, we want to free pages from the list, even
> though the hypervisor did not notify us to do so.
> 
> 
> Issue 1: When we unload the balloon driver in the guest in an unlucky event,
> we won't free the pages. We are missing something like (if I am not wrong):
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index b1d2068fa2bd..e2b0925e1e83 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -929,6 +929,10 @@ static void remove_common(struct virtio_balloon *vb)
>                 leak_balloon(vb, vb->num_pages);
>         update_balloon_size(vb);
>  
> +       /* There might be free pages that are being reported: release them. */
> +       if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT))
> +               return_free_pages_to_mm(vb, ULONG_MAX);
> +
>         /* Now we reset the device so we can clean up the queues. */
>         vb->vdev->config->reset(vb->vdev);


Indeed.

> 
> Issue 2: When called via the shrinker, (but also to fix Issue 1), it could be
> that we do have VIRTIO_BALLOON_F_MUST_TELL_HOST. I assume this means
> (-ENOCLUE) that we have to wait until the hypervisor notifies us via the STOP? Or
> for which event do we have to wait? Because there is no way to *tell host* here
> that we want to reuse a page. The hypervisor will *tell us* when we can reuse pages.
> For the shrinker it is simple: Don't use the shrinker with
> VIRTIO_BALLOON_F_MUST_TELL_HOST :) . But to fix Issue 1, we *would* have to wait
> until we get a STOP signal. That is not really possible because it might
> take an infinite amount of time.
> 
> Michael, any clue on which event we have to wait with
> VIRTIO_BALLOON_F_MUST_TELL_HOST? IMHO, I don't think
> VIRTIO_BALLOON_F_MUST_TELL_HOST applies to VIRTIO_BALLOON_F_FREE_PAGE_HINT and
> we'd better document that. It introduces complexity with no clear benefit.

I meant that we must wait for host to see the hint. Signalled via using
the buffer.  But maybe that's too far in the meaning from
VIRTIO_BALLOON_F_MUST_TELL_HOST and we need a separate new flag for
that. Then current code won't be broken (yay!) but we need to
document another flag that's pretty similar.

> 
> -- 
> Thanks,
> 
> David / dhildenb



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Balloon pressuring page cache
  2020-02-04 16:50                         ` Michael S. Tsirkin
@ 2020-02-04 16:56                           ` David Hildenbrand
  2020-02-04 20:33                             ` Michael S. Tsirkin
  2020-02-05  6:52                           ` Wang, Wei W
  1 sibling, 1 reply; 59+ messages in thread
From: David Hildenbrand @ 2020-02-04 16:56 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Nadav Amit, Alexander Duyck, Tyler Sanderson, Wang, Wei W,
	virtualization, David Rientjes, linux-mm, Michal Hocko

[...]

>>
>> Issue 2: When called via the shrinker, (but also to fix Issue 1), it could be
>> that we do have VIRTIO_BALLOON_F_MUST_TELL_HOST. I assume this means
>> (-ENOCLUE) that we have to wait until the hypervisor notifies us via the STOP? Or
>> for which event do we have to wait? Because there is no way to *tell host* here
>> that we want to reuse a page. The hypervisor will *tell us* when we can reuse pages.
>> For the shrinker it is simple: Don't use the shrinker with
>> VIRTIO_BALLOON_F_MUST_TELL_HOST :) . But to fix Issue 1, we *would* have to wait
>> until we get a STOP signal. That is not really possible because it might
>> take an infinite amount of time.
>>
>> Michael, any clue on which event we have to wait with
>> VIRTIO_BALLOON_F_MUST_TELL_HOST? IMHO, I don't think
>> VIRTIO_BALLOON_F_MUST_TELL_HOST applies to VIRTIO_BALLOON_F_FREE_PAGE_HINT and
>> we'd better document that. It introduces complexity with no clear benefit.
> 
> I meant that we must wait for host to see the hint. Signalled via using
> the buffer.  But maybe that's too far in the meaning from
> VIRTIO_BALLOON_F_MUST_TELL_HOST and we need a separate new flag for

Yes, that's what I think.

> that. Then current code won't be broken (yay!) but we need to
> document another flag that's pretty similar.

I mean, do we need a flag at all as long as there is no user?
Introducing a flag and documenting it if nobody uses it does not sound
like a work I will enjoy :)

We can simply document "VIRTIO_BALLOON_F_MUST_TELL_HOST does not apply
to FREE_PAGE_HINTING" and "with FREE_PAGE_HINTING, the guest can reuse
pages any time, without waiting for a response/ack from the hypervisor".

Thoughts?

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Balloon pressuring page cache
  2020-02-04  8:29                     ` David Hildenbrand
@ 2020-02-04 18:52                       ` Tyler Sanderson
  2020-02-04 18:56                         ` Michael S. Tsirkin
  2020-02-04 19:17                         ` David Hildenbrand
  0 siblings, 2 replies; 59+ messages in thread
From: Tyler Sanderson @ 2020-02-04 18:52 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Michael S. Tsirkin, Alexander Duyck, Wang, Wei W, virtualization,
	David Rientjes, linux-mm, Michal Hocko, namit

[-- Attachment #1: Type: text/plain, Size: 1988 bytes --]

On Tue, Feb 4, 2020 at 12:29 AM David Hildenbrand <david@redhat.com> wrote:

> On 03.02.20 21:32, Tyler Sanderson wrote:
> > There were apparently good reasons for moving away from OOM notifier
> > callback:
> > https://lkml.org/lkml/2018/7/12/314
> > https://lkml.org/lkml/2018/8/2/322
> >
> > In particular the OOM notifier is worse than the shrinker because:
>
> The issue is that DEFLATE_ON_OOM is under-specified.
>
> >
> >  1. It is last-resort, which means the system has already gone through
> >     heroics to prevent OOM. Those heroic reclaim efforts are expensive
> >     and impact application performance.
>
> That's *exactly* what "deflate on OOM" suggests.
>

It seems there are some use cases where "deflate on OOM" is desired and
others where "deflate on pressure" is desired.
This suggests adding a new feature bit "DEFLATE_ON_PRESSURE" that registers
the shrinker, and reverting DEFLATE_ON_OOM to use the OOM notifier callback.

This lets users configure the balloon for their use case.


>
> Assume you are using virtio-balloon for some weird way of memory
> hotunplug (which is what some people do) and you want to minimize the
> footprint of your guest. Then you really only want to give the guest
> more memory (or rather, let it take back memory automatically in this
> case) in case it really needs more memory. It should try to reclaim first.
>
> Under-specified.
>
>
> >  2. It lacks understanding of NUMA or other OOM constraints.
>
> Ballooning in general lacks the understanding of NUMA.
>
> >  3. It has a higher potential for bugs due to the subtlety of the
> >     callback context.
>
> While that is a valid point, it doesn't explain why existing
> functionality is changed.
>
> Personally, I think DEFLATE_ON_OOM should never have been introduced (at
> least not in this form).
>
I'm actually not sure how you would safely do memory overcommit without
DEFLATE_ON_OOM. So I think it unlocks a huge use case.


>
>
> --
> Thanks,
>
> David / dhildenb
>
>

[-- Attachment #2: Type: text/html, Size: 3065 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Balloon pressuring page cache
  2020-02-04 18:52                       ` Tyler Sanderson
@ 2020-02-04 18:56                         ` Michael S. Tsirkin
  2020-02-04 19:17                         ` David Hildenbrand
  1 sibling, 0 replies; 59+ messages in thread
From: Michael S. Tsirkin @ 2020-02-04 18:56 UTC (permalink / raw)
  To: Tyler Sanderson
  Cc: David Hildenbrand, Alexander Duyck, Wang, Wei W, virtualization,
	David Rientjes, linux-mm, Michal Hocko, namit

On Tue, Feb 04, 2020 at 10:52:42AM -0800, Tyler Sanderson wrote:
> 
> 
> On Tue, Feb 4, 2020 at 12:29 AM David Hildenbrand <david@redhat.com> wrote:
> 
>     On 03.02.20 21:32, Tyler Sanderson wrote:
>     > There were apparently good reasons for moving away from OOM notifier
>     > callback:
>     > https://lkml.org/lkml/2018/7/12/314
>     > https://lkml.org/lkml/2018/8/2/322
>     >
>     > In particular the OOM notifier is worse than the shrinker because:
> 
>     The issue is that DEFLATE_ON_OOM is under-specified.
> 
>     >
>     >  1. It is last-resort, which means the system has already gone through
>     >     heroics to prevent OOM. Those heroic reclaim efforts are expensive
>     >     and impact application performance.
> 
>     That's *exactly* what "deflate on OOM" suggests.
> 
> 
> It seems there are some use cases where "deflate on OOM" is desired and others
> where "deflate on pressure" is desired.
> This suggests adding a new feature bit "DEFLATE_ON_PRESSURE" that registers the
> shrinker, and reverting DEFLATE_ON_OOM to use the OOM notifier callback.
> 
> This lets users configure the balloon for their use case.

Right. Let's not repeat past mistakes and let's try to specify this
new one properly though :)

> 
> 
>     Assume you are using virtio-balloon for some weird way of memory
>     hotunplug (which is what some people do) and you want to minimize the
>     footprint of your guest. Then you really only want to give the guest
>     more memory (or rather, let it take back memory automatically in this
>     case) in case it really needs more memory. It should try to reclaim first.
> 
>     Under-specified.
> 
> 
>     >  2. It lacks understanding of NUMA or other OOM constraints.
> 
>     Ballooning in general lacks the understanding of NUMA.
> 
>     >  3. It has a higher potential for bugs due to the subtlety of the
>     >     callback context.
> 
>     While that is a valid point, it doesn't explain why existing
>     functionality is changed.
> 
>     Personally, I think DEFLATE_ON_OOM should never have been introduced (at
>     least not in this form).
> 
> I'm actually not sure how you would safely do memory overcommit without
> DEFLATE_ON_OOM. So I think it unlocks a huge use case.
>  
> 
> 
> 
>     --
>     Thanks,
> 
>     David / dhildenb
> 
> 



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Balloon pressuring page cache
  2020-02-04 18:52                       ` Tyler Sanderson
  2020-02-04 18:56                         ` Michael S. Tsirkin
@ 2020-02-04 19:17                         ` David Hildenbrand
  2020-02-04 23:58                           ` Tyler Sanderson
  1 sibling, 1 reply; 59+ messages in thread
From: David Hildenbrand @ 2020-02-04 19:17 UTC (permalink / raw)
  To: Tyler Sanderson
  Cc: Michael S. Tsirkin, Alexander Duyck, Wang, Wei W, virtualization,
	David Rientjes, linux-mm, Michal Hocko, namit

On 04.02.20 19:52, Tyler Sanderson wrote:
> 
> 
> On Tue, Feb 4, 2020 at 12:29 AM David Hildenbrand <david@redhat.com
> <mailto:david@redhat.com>> wrote:
> 
>     On 03.02.20 21:32, Tyler Sanderson wrote:
>     > There were apparently good reasons for moving away from OOM notifier
>     > callback:
>     > https://lkml.org/lkml/2018/7/12/314
>     > https://lkml.org/lkml/2018/8/2/322
>     >
>     > In particular the OOM notifier is worse than the shrinker because:
> 
>     The issue is that DEFLATE_ON_OOM is under-specified.
> 
>     >
>     >  1. It is last-resort, which means the system has already gone through
>     >     heroics to prevent OOM. Those heroic reclaim efforts are expensive
>     >     and impact application performance.
> 
>     That's *exactly* what "deflate on OOM" suggests.
> 
> 
> It seems there are some use cases where "deflate on OOM" is desired and
> others where "deflate on pressure" is desired.
> This suggests adding a new feature bit "DEFLATE_ON_PRESSURE" that
> registers the shrinker, and reverting DEFLATE_ON_OOM to use the OOM
> notifier callback.
> 
> This lets users configure the balloon for their use case.

You want the old behavior back, so why should we introduce a new one? Or
am I missing something? (you did want us to revert to old handling, no?)

I consider virtio-balloon to this very day a big hack. And I don't see
it getting better with new config knobs. Having that said, the
technologies that are candidates to replace it (free page reporting,
taming the guest page cache, etc.) are still not ready - so we'll have
to stick with it for now :( .

> 
> I'm actually not sure how you would safely do memory overcommit without
> DEFLATE_ON_OOM. So I think it unlocks a huge use case.

Using better suited technologies that are not ready yet (well, some form
of free page reporting is available under IBM z already but in a
proprietary form) ;) Anyhow, I remember that DEFLATE_ON_OOM only makes
it less likely to crash your guest, but not that you are safe to squeeze
the last bit out of your guest VM.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Balloon pressuring page cache
  2020-02-04 16:56                           ` David Hildenbrand
@ 2020-02-04 20:33                             ` Michael S. Tsirkin
  2020-02-05  8:31                               ` David Hildenbrand
  0 siblings, 1 reply; 59+ messages in thread
From: Michael S. Tsirkin @ 2020-02-04 20:33 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Nadav Amit, Alexander Duyck, Tyler Sanderson, Wang, Wei W,
	virtualization, David Rientjes, linux-mm, Michal Hocko,
	virtio-dev

On Tue, Feb 04, 2020 at 05:56:22PM +0100, David Hildenbrand wrote:
> [...]
> 
> >>
> >> Issue 2: When called via the shrinker, (but also to fix Issue 1), it could be
> >> that we do have VIRTIO_BALLOON_F_MUST_TELL_HOST. I assume this means
> >> (-ENOCLUE) that we have to wait until the hypervisor notifies us via the STOP? Or
> >> for which event do we have to wait? Because there is no way to *tell host* here
> >> that we want to reuse a page. The hypervisor will *tell us* when we can reuse pages.
> >> For the shrinker it is simple: Don't use the shrinker with
> >> VIRTIO_BALLOON_F_MUST_TELL_HOST :) . But to fix Issue 1, we *would* have to wait
> >> until we get a STOP signal. That is not really possible because it might
> >> take an infinite amount of time.
> >>
> >> Michael, any clue on which event we have to wait with
> >> VIRTIO_BALLOON_F_MUST_TELL_HOST? IMHO, I don't think
> >> VIRTIO_BALLOON_F_MUST_TELL_HOST applies to VIRTIO_BALLOON_F_FREE_PAGE_HINT and
> >> we'd better document that. It introduces complexity with no clear benefit.
> > 
> > I meant that we must wait for host to see the hint. Signalled via using
> > the buffer.  But maybe that's too far in the meaning from
> > VIRTIO_BALLOON_F_MUST_TELL_HOST and we need a separate new flag for
> 
> Yes, that's what I think.
> 
> > that. Then current code won't be broken (yay!) but we need to
> > document another flag that's pretty similar.
> 
> I mean, do we need a flag at all as long as there is no user?
> Introducing a flag and documenting it if nobody uses it does not sound
> like a work I will enjoy :)

It's not the user. It's the non-orthogonality that I find inelegant.

Let me try to formulate the issue, forgive me for thinking aloud
(and I Cc'd virtio-dev since we are talking spec things here):

The annoying thing is that with Alex's VIRTIO_BALLOON_F_REPORTING
host does depend on guest not touching memory before host uses it.
So functionally VIRTIO_BALLOON_F_FREE_PAGE_HINT and
VIRTIO_BALLOON_F_REPORTING really are supposed to do
exectly the same thing, with the differences being
- VIRTIO_BALLOON_F_FREE_PAGE_HINT comes upon host's request.
  VIRTIO_BALLOON_F_REPORTING is initiated by guest.
- VIRTIO_BALLOON_F_FREE_PAGE_HINT does not always wait for
  host to use the hint before touching the page.
  Well it almost always does, but there's an exception in the
  shrinker which tries to stop reporting as quickly as possible
  in the case of a slow host.
  VIRTIO_BALLOON_F_REPORTING always does.
  This means host can blow the page away when it sees the hint.

Now the point is that with VIRTIO_BALLOON_F_REPORTING
I think you really must wait for host to use the hint.
But with VIRTIO_BALLOON_F_FREE_PAGE_HINT it depends
on how host uses it. Something to think about,
I'm not sure what is the best thing to do here.


> We can simply document "VIRTIO_BALLOON_F_MUST_TELL_HOST does not apply
> to FREE_PAGE_HINTING" and "with FREE_PAGE_HINTING, the guest can reuse
> pages any time, without waiting for a response/ack from the hypervisor".
> 
> Thoughts?
> 
> -- 
> Thanks,
> 
> David / dhildenb



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Balloon pressuring page cache
  2020-02-04 19:17                         ` David Hildenbrand
@ 2020-02-04 23:58                           ` Tyler Sanderson
  2020-02-05  0:15                             ` Tyler Sanderson
  2020-02-05  6:57                             ` Michael S. Tsirkin
  0 siblings, 2 replies; 59+ messages in thread
From: Tyler Sanderson @ 2020-02-04 23:58 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Michael S. Tsirkin, Alexander Duyck, Wang, Wei W, virtualization,
	David Rientjes, linux-mm, Michal Hocko, namit

[-- Attachment #1: Type: text/plain, Size: 3912 bytes --]

On Tue, Feb 4, 2020 at 11:17 AM David Hildenbrand <david@redhat.com> wrote:

> On 04.02.20 19:52, Tyler Sanderson wrote:
> >
> >
> > On Tue, Feb 4, 2020 at 12:29 AM David Hildenbrand <david@redhat.com
> > <mailto:david@redhat.com>> wrote:
> >
> >     On 03.02.20 21:32, Tyler Sanderson wrote:
> >     > There were apparently good reasons for moving away from OOM
> notifier
> >     > callback:
> >     > https://lkml.org/lkml/2018/7/12/314
> >     > https://lkml.org/lkml/2018/8/2/322
> >     >
> >     > In particular the OOM notifier is worse than the shrinker because:
> >
> >     The issue is that DEFLATE_ON_OOM is under-specified.
> >
> >     >
> >     >  1. It is last-resort, which means the system has already gone
> through
> >     >     heroics to prevent OOM. Those heroic reclaim efforts are
> expensive
> >     >     and impact application performance.
> >
> >     That's *exactly* what "deflate on OOM" suggests.
> >
> >
> > It seems there are some use cases where "deflate on OOM" is desired and
> > others where "deflate on pressure" is desired.
> > This suggests adding a new feature bit "DEFLATE_ON_PRESSURE" that
> > registers the shrinker, and reverting DEFLATE_ON_OOM to use the OOM
> > notifier callback.
> >
> > This lets users configure the balloon for their use case.
>
> You want the old behavior back, so why should we introduce a new one? Or
> am I missing something? (you did want us to revert to old handling, no?)
>
Reverting actually doesn't help me because this has been the behavior since
Linux 4.19 which is already widely in use. So my device implementation
needs to handle the shrinker behavior anyways. I started this conversation
to ask what the intended device implementation was.

I think there are reasonable device implementations that would prefer the
shrinker behavior (it turns out that mine doesn't).
For example, an implementation that slowly inflates the balloon for the
purpose of memory overcommit. It might leave the balloon inflated and
expect any memory pressure (including page cache usage) to deflate the
balloon as a way to dynamically right-size the balloon.

Two reasons I didn't go with the above implementation:
1. I need to support guests before Linux 4.19 which don't have the shrinker
behavior.
2. Memory in the balloon does not appear as "available" in /proc/meminfo
even though it is freeable. This is confusing to users, but isn't a deal
breaker.

If we added a DEFLATE_ON_PRESSURE feature bit that indicated shrinker API
support then that would resolve reason #1 (ideally we would backport the
bit to 4.19).

In any case, the shrinker behavior when pressuring page cache is more of an
inefficiency than a bug. It's not clear to me that it necessitates
reverting. If there were/are reasons to be on the shrinker interface then I
think those carry similar weight as the problem itself.


>
> I consider virtio-balloon to this very day a big hack. And I don't see
> it getting better with new config knobs. Having that said, the
> technologies that are candidates to replace it (free page reporting,
> taming the guest page cache, etc.) are still not ready - so we'll have
> to stick with it for now :( .
>
> >
> > I'm actually not sure how you would safely do memory overcommit without
> > DEFLATE_ON_OOM. So I think it unlocks a huge use case.
>
> Using better suited technologies that are not ready yet (well, some form
> of free page reporting is available under IBM z already but in a
> proprietary form) ;) Anyhow, I remember that DEFLATE_ON_OOM only makes
> it less likely to crash your guest, but not that you are safe to squeeze
> the last bit out of your guest VM.
>
Can you elaborate on the danger of DEFLATE_ON_OOM? I haven't seen any
problems in testing but I'd really like to know about the dangers.
Is there a difference in safety between the OOM notifier callback and the
shrinker API?


>
> --
> Thanks,
>
> David / dhildenb
>
>

[-- Attachment #2: Type: text/html, Size: 5420 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Balloon pressuring page cache
  2020-02-04 23:58                           ` Tyler Sanderson
@ 2020-02-05  0:15                             ` Tyler Sanderson
  2020-02-05  6:57                             ` Michael S. Tsirkin
  1 sibling, 0 replies; 59+ messages in thread
From: Tyler Sanderson @ 2020-02-05  0:15 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Michael S. Tsirkin, Alexander Duyck, Wang, Wei W, virtualization,
	David Rientjes, linux-mm, Michal Hocko, namit

[-- Attachment #1: Type: text/plain, Size: 4395 bytes --]

On Tue, Feb 4, 2020 at 3:58 PM Tyler Sanderson <tysand@google.com> wrote:

>
>
> On Tue, Feb 4, 2020 at 11:17 AM David Hildenbrand <david@redhat.com>
> wrote:
>
>> On 04.02.20 19:52, Tyler Sanderson wrote:
>> >
>> >
>> > On Tue, Feb 4, 2020 at 12:29 AM David Hildenbrand <david@redhat.com
>> > <mailto:david@redhat.com>> wrote:
>> >
>> >     On 03.02.20 21:32, Tyler Sanderson wrote:
>> >     > There were apparently good reasons for moving away from OOM
>> notifier
>> >     > callback:
>> >     > https://lkml.org/lkml/2018/7/12/314
>> >     > https://lkml.org/lkml/2018/8/2/322
>> >     >
>> >     > In particular the OOM notifier is worse than the shrinker because:
>> >
>> >     The issue is that DEFLATE_ON_OOM is under-specified.
>> >
>> >     >
>> >     >  1. It is last-resort, which means the system has already gone
>> through
>> >     >     heroics to prevent OOM. Those heroic reclaim efforts are
>> expensive
>> >     >     and impact application performance.
>> >
>> >     That's *exactly* what "deflate on OOM" suggests.
>> >
>> >
>> > It seems there are some use cases where "deflate on OOM" is desired and
>> > others where "deflate on pressure" is desired.
>> > This suggests adding a new feature bit "DEFLATE_ON_PRESSURE" that
>> > registers the shrinker, and reverting DEFLATE_ON_OOM to use the OOM
>> > notifier callback.
>> >
>> > This lets users configure the balloon for their use case.
>>
>> You want the old behavior back, so why should we introduce a new one? Or
>> am I missing something? (you did want us to revert to old handling, no?)
>>
> Reverting actually doesn't help me because this has been the behavior
> since Linux 4.19 which is already widely in use. So my device
> implementation needs to handle the shrinker behavior anyways. I started
> this conversation to ask what the intended device implementation was.
>
I should clarify: reverting _would_ improve guest performance under my
implementation. So I guess I'm in favor. But I think we should consider
reasonable alternative implementations. I think this suggests adding a new
feature bit to allow device implementations to choose.


> I think there are reasonable device implementations that would prefer the
> shrinker behavior (it turns out that mine doesn't).
> For example, an implementation that slowly inflates the balloon for the
> purpose of memory overcommit. It might leave the balloon inflated and
> expect any memory pressure (including page cache usage) to deflate the
> balloon as a way to dynamically right-size the balloon.
>
> Two reasons I didn't go with the above implementation:
> 1. I need to support guests before Linux 4.19 which don't have the
> shrinker behavior.
> 2. Memory in the balloon does not appear as "available" in /proc/meminfo
> even though it is freeable. This is confusing to users, but isn't a deal
> breaker.
>
> If we added a DEFLATE_ON_PRESSURE feature bit that indicated shrinker API
> support then that would resolve reason #1 (ideally we would backport the
> bit to 4.19).
>
> In any case, the shrinker behavior when pressuring page cache is more of
> an inefficiency than a bug. It's not clear to me that it necessitates
> reverting. If there were/are reasons to be on the shrinker interface then I
> think those carry similar weight as the problem itself.
>
>
>>
>> I consider virtio-balloon to this very day a big hack. And I don't see
>> it getting better with new config knobs. Having that said, the
>> technologies that are candidates to replace it (free page reporting,
>> taming the guest page cache, etc.) are still not ready - so we'll have
>> to stick with it for now :( .
>>
>> >
>> > I'm actually not sure how you would safely do memory overcommit without
>> > DEFLATE_ON_OOM. So I think it unlocks a huge use case.
>>
>> Using better suited technologies that are not ready yet (well, some form
>> of free page reporting is available under IBM z already but in a
>> proprietary form) ;) Anyhow, I remember that DEFLATE_ON_OOM only makes
>> it less likely to crash your guest, but not that you are safe to squeeze
>> the last bit out of your guest VM.
>>
> Can you elaborate on the danger of DEFLATE_ON_OOM? I haven't seen any
> problems in testing but I'd really like to know about the dangers.
> Is there a difference in safety between the OOM notifier callback and the
> shrinker API?
>
>
>>
>> --
>> Thanks,
>>
>> David / dhildenb
>>
>>

[-- Attachment #2: Type: text/html, Size: 6304 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: Balloon pressuring page cache
  2020-02-04 14:30                       ` David Hildenbrand
  2020-02-04 16:50                         ` Michael S. Tsirkin
@ 2020-02-05  6:49                         ` Wang, Wei W
  2020-02-05  8:19                           ` David Hildenbrand
  1 sibling, 1 reply; 59+ messages in thread
From: Wang, Wei W @ 2020-02-05  6:49 UTC (permalink / raw)
  To: David Hildenbrand, Michael S. Tsirkin
  Cc: Nadav Amit, Alexander Duyck, Tyler Sanderson, virtualization,
	David Rientjes, linux-mm, Michal Hocko

On Tuesday, February 4, 2020 10:30 PM, David Hildenbrand wrote:
> So, I am trying to understand how the code is intended to work, but I am
> afraid I am missing something (or to rephrase: I think I found a BUG :) and
> there is lack of proper documentation about this feature).
> 
> a) We allocate pages and add them to the list as long as we are told to do
> so.
>    We send these pages to the host one by one.
> b) We free all pages once we get a STOP signal. Until then, we keep pages
> allocated.

Yes. Either host sends to the guest a STOP cmd or when the guest fails to allocate a page (meaning that all the possible free pages are taken already),
the reporting ends.

> c) When called via the shrinker, we want to free pages from the list, even
> though the hypervisor did not notify us to do so.
> 
> 
> Issue 1: When we unload the balloon driver in the guest in an unlucky event,
> we won't free the pages. We are missing something like (if I am not wrong):
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index b1d2068fa2bd..e2b0925e1e83 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -929,6 +929,10 @@ static void remove_common(struct virtio_balloon
> *vb)
>                 leak_balloon(vb, vb->num_pages);
>         update_balloon_size(vb);
> 
> +       /* There might be free pages that are being reported: release them.
> */
> +       if (virtio_has_feature(vb->vdev,
> VIRTIO_BALLOON_F_FREE_PAGE_HINT))
> +               return_free_pages_to_mm(vb, ULONG_MAX);
> +
>         /* Now we reset the device so we can clean up the queues. */
>         vb->vdev->config->reset(vb->vdev);


Right, thanks!

> 
> 
> Issue 2: When called via the shrinker, (but also to fix Issue 1), it could be that
> we do have VIRTIO_BALLOON_F_MUST_TELL_HOST. 

I don't think it is an issue here.
MUST_TELL_HOST is for the ballooning pages, where pages are offered to host to _USE_.
For free page hint, as the name already suggests, it's just a _HINT_ , so in whatever use case,
the host should not take the page to use. So the guest doesn't need to tell host and wait.

Back to the implementation of virtio_balloon_shrinker_scan, which I don't see an issue so far:
shrink_free_pages just return pages to mm without waiting for the ack from host
shrink_balloon_pages goes through leak_balloon which tell_host before release the balloon pages.

Best,
Wei


^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: Balloon pressuring page cache
  2020-02-04 16:50                         ` Michael S. Tsirkin
  2020-02-04 16:56                           ` David Hildenbrand
@ 2020-02-05  6:52                           ` Wang, Wei W
  2020-02-05  7:05                             ` Michael S. Tsirkin
  1 sibling, 1 reply; 59+ messages in thread
From: Wang, Wei W @ 2020-02-05  6:52 UTC (permalink / raw)
  To: Michael S. Tsirkin, David Hildenbrand
  Cc: Nadav Amit, Alexander Duyck, Tyler Sanderson, virtualization,
	David Rientjes, linux-mm, Michal Hocko

On Wednesday, February 5, 2020 12:50 AM, Michael S. Tsirkin wrote:
> > Michael, any clue on which event we have to wait with
> > VIRTIO_BALLOON_F_MUST_TELL_HOST? IMHO, I don't think
> > VIRTIO_BALLOON_F_MUST_TELL_HOST applies to
> > VIRTIO_BALLOON_F_FREE_PAGE_HINT and we'd better document that. It
> introduces complexity with no clear benefit.
> 
> I meant that we must wait for host to see the hint.

Why?

Best,
Wei


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Balloon pressuring page cache
  2020-02-04 23:58                           ` Tyler Sanderson
  2020-02-05  0:15                             ` Tyler Sanderson
@ 2020-02-05  6:57                             ` Michael S. Tsirkin
  2020-02-05 19:01                               ` Tyler Sanderson
  1 sibling, 1 reply; 59+ messages in thread
From: Michael S. Tsirkin @ 2020-02-05  6:57 UTC (permalink / raw)
  To: Tyler Sanderson
  Cc: David Hildenbrand, Alexander Duyck, Wang, Wei W, virtualization,
	David Rientjes, linux-mm, Michal Hocko, namit

On Tue, Feb 04, 2020 at 03:58:51PM -0800, Tyler Sanderson wrote:
>     >     >
>     >     >  1. It is last-resort, which means the system has already gone     through
>     >     >     heroics to prevent OOM. Those heroic reclaim efforts are     expensive
>     >     >     and impact application performance.
>     >
>     >     That's *exactly* what "deflate on OOM" suggests.
>     >
>     >
>     > It seems there are some use cases where "deflate on OOM" is desired and
>     > others where "deflate on pressure" is desired.
>     > This suggests adding a new feature bit "DEFLATE_ON_PRESSURE" that
>     > registers the shrinker, and reverting DEFLATE_ON_OOM to use the OOM
>     > notifier callback.
>     >
>     > This lets users configure the balloon for their use case.
> 
>     You want the old behavior back, so why should we introduce a new one? Or
>     am I missing something? (you did want us to revert to old handling, no?)
> 
> Reverting actually doesn't help me because this has been the behavior since
> Linux 4.19 which is already widely in use. So my device implementation needs to
> handle the shrinker behavior anyways. I started this conversation to ask what
> the intended device implementation was.
> 
> I think there are reasonable device implementations that would prefer the
> shrinker behavior (it turns out that mine doesn't).
> For example, an implementation that slowly inflates the balloon for the purpose
> of memory overcommit. It might leave the balloon inflated and expect any memory
> pressure (including page cache usage) to deflate the balloon as a way to
> dynamically right-size the balloon.

So just to make sure we understand, what exactly does your
implementation do?


> Two reasons I didn't go with the above implementation:
> 1. I need to support guests before Linux 4.19 which don't have the shrinker
> behavior.
> 2. Memory in the balloon does not appear as "available" in /proc/meminfo even
> though it is freeable. This is confusing to users, but isn't a deal breaker.
> 
> If we added a DEFLATE_ON_PRESSURE feature bit that indicated shrinker API
> support then that would resolve reason #1 (ideally we would backport the bit to
> 4.19).

We could declare lack of pagecache pressure with DEFLATE_ON_OOM a
regression and backport the revert but not I think the new
DEFLATE_ON_PRESSURE.


> In any case, the shrinker behavior when pressuring page cache is more of an
> inefficiency than a bug. It's not clear to me that it necessitates reverting.
> If there were/are reasons to be on the shrinker interface then I think those
> carry similar weight as the problem itself.
>  
> 
> 
>     I consider virtio-balloon to this very day a big hack. And I don't see
>     it getting better with new config knobs. Having that said, the
>     technologies that are candidates to replace it (free page reporting,
>     taming the guest page cache, etc.) are still not ready - so we'll have
>     to stick with it for now :( .
> 
>     >
>     > I'm actually not sure how you would safely do memory overcommit without
>     > DEFLATE_ON_OOM. So I think it unlocks a huge use case.
> 
>     Using better suited technologies that are not ready yet (well, some form
>     of free page reporting is available under IBM z already but in a
>     proprietary form) ;) Anyhow, I remember that DEFLATE_ON_OOM only makes
>     it less likely to crash your guest, but not that you are safe to squeeze
>     the last bit out of your guest VM.
> 
> Can you elaborate on the danger of DEFLATE_ON_OOM? I haven't seen any problems
> in testing but I'd really like to know about the dangers.
> Is there a difference in safety between the OOM notifier callback and the
> shrinker API?

It's not about dangers as such. It's just that when linux hits OOM
all kind of error paths are being hit, latent bugs start triggering,
latency goes up drastically.


> 
> 
>     --
>     Thanks,
> 
>     David / dhildenb
> 
> 



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Balloon pressuring page cache
  2020-02-05  6:52                           ` Wang, Wei W
@ 2020-02-05  7:05                             ` Michael S. Tsirkin
  2020-02-05  8:50                               ` Wang, Wei W
  0 siblings, 1 reply; 59+ messages in thread
From: Michael S. Tsirkin @ 2020-02-05  7:05 UTC (permalink / raw)
  To: Wang, Wei W
  Cc: David Hildenbrand, Nadav Amit, Alexander Duyck, Tyler Sanderson,
	virtualization, David Rientjes, linux-mm, Michal Hocko

On Wed, Feb 05, 2020 at 06:52:34AM +0000, Wang, Wei W wrote:
> On Wednesday, February 5, 2020 12:50 AM, Michael S. Tsirkin wrote:
> > > Michael, any clue on which event we have to wait with
> > > VIRTIO_BALLOON_F_MUST_TELL_HOST? IMHO, I don't think
> > > VIRTIO_BALLOON_F_MUST_TELL_HOST applies to
> > > VIRTIO_BALLOON_F_FREE_PAGE_HINT and we'd better document that. It
> > introduces complexity with no clear benefit.
> > 
> > I meant that we must wait for host to see the hint.
> 
> Why?
> 
> Best,
> Wei

Well if we did the hint would be reliable, allowing host to immediately
drop any pages it gets in the hint. Originally I wanted to speed up
hinting by never waiting for host, but that does not seem to be what was
implemented: the only place we don't wait is the shrinker and it seems a
waste that we introduced complexity to host without getting any real
benefit out of it.

VIRTIO_BALLOON_F_MUST_TELL_HOST doesn't really apply to hinting
right now, so we could have used it to mean "hints must wait for host
to use buffers". I'm afraid it's already a wasted opportunity at
this point, reusing it isn't worth the compatibility headaches.

-- 
MST

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Balloon pressuring page cache
  2020-02-03 22:50                 ` Nadav Amit
  2020-02-04  8:35                   ` David Hildenbrand
@ 2020-02-05  7:35                   ` Nadav Amit
  2020-02-05  8:19                     ` David Hildenbrand
  1 sibling, 1 reply; 59+ messages in thread
From: Nadav Amit @ 2020-02-05  7:35 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alexander Duyck, Michael S. Tsirkin, Tyler Sanderson, Wang,
	Wei W, virtualization, David Rientjes, linux-mm, Michal Hocko

> On Feb 3, 2020, at 2:50 PM, Nadav Amit <namit@vmware.com> wrote:
> 
>> On Feb 3, 2020, at 8:34 AM, David Hildenbrand <david@redhat.com> wrote:
>> 
>> On 03.02.20 17:18, Alexander Duyck wrote:
>>> On Mon, 2020-02-03 at 08:11 -0500, Michael S. Tsirkin wrote:
>>>> On Thu, Jan 30, 2020 at 11:59:46AM -0800, Tyler Sanderson wrote:
>>>>> On Thu, Jan 30, 2020 at 7:31 AM Wang, Wei W <wei.w.wang@intel.com> wrote:
>>>>> 
>>>>>   On Thursday, January 30, 2020 11:03 PM, David Hildenbrand wrote:
>>>>>> On 29.01.20 20:11, Tyler Sanderson wrote:
>>>>>>> On Wed, Jan 29, 2020 at 2:31 AM David Hildenbrand <david@redhat.com
>>>>>>> <mailto:david@redhat.com>> wrote:
>>>>>>> 
>>>>>>>   On 29.01.20 01:22, Tyler Sanderson via Virtualization wrote:
>>>>>>>> A primary advantage of virtio balloon over other memory reclaim
>>>>>>>> mechanisms is that it can pressure the guest's page cache into
>>>>>>>   shrinking.
>>>>>>>> However, since the balloon driver changed to using the shrinker
>>>>>   API
>>>>>> <https://github.com/torvalds/linux/commit/71994620bb25a8b109388fefa9
>>>>>> e99a28e355255a#diff-fd202acf694d9eba19c8c64da3e480c9> this
>>>>>>>> use case has become a bit more tricky. I'm wondering what the
>>>>>> intended
>>>>>>>> device implementation is.
>>>>>>>> 
>>>>>>>> When inflating the balloon against page cache (i.e. no free
>>>>>   memory
>>>>>>>> remains) vmscan.c will both shrink page cache, but also invoke
>>>>>   the
>>>>>>>> shrinkers -- including the balloon's shrinker. So the balloon
>>>>>   driver
>>>>>>>> allocates memory which requires reclaim, vmscan gets this memory
>>>>>> by
>>>>>>>> shrinking the balloon, and then the driver adds the memory back
>>>>>   to
>>>>>> the
>>>>>>>> balloon. Basically a busy no-op.
>>>>> 
>>>>>   Per my understanding, the balloon allocation won’t invoke shrinker as
>>>>>   __GFP_DIRECT_RECLAIM isn't set, no?
>>>>> 
>>>>> I could be wrong about the mechanism, but the device sees lots of activity on
>>>>> the deflate queue. The balloon is being shrunk. And this only starts once all
>>>>> free memory is depleted and we're inflating into page cache.
>>>> 
>>>> So given this looks like a regression, maybe we should revert the
>>>> patch in question 71994620bb25 ("virtio_balloon: replace oom notifier with shrinker")
>>>> Besides, with VIRTIO_BALLOON_F_FREE_PAGE_HINT
>>>> shrinker also ignores VIRTIO_BALLOON_F_MUST_TELL_HOST which isn't nice
>>>> at all.
>>>> 
>>>> So it looks like all this rework introduced more issues than it
>>>> addressed ...
>>>> 
>>>> I also CC Alex Duyck for an opinion on this.
>>>> Alex, what do you use to put pressure on page cache?
>>> 
>>> I would say reverting probably makes sense. I'm not sure there is much
>>> value to having a shrinker running deflation when you are actively trying
>>> to increase the balloon. It would make more sense to wait until you are
>>> actually about to start hitting oom.
>> 
>> I think the shrinker makes sense for free page hinting feature
>> (everything on free_page_list).
>> 
>> So instead of only reverting, I think we should split it up and always
>> register the shrinker for VIRTIO_BALLOON_F_FREE_PAGE_HINT and the OOM
>> notifier (as before) for VIRTIO_BALLOON_F_MUST_TELL_HOST.
>> 
>> (Of course, adapting what is being done in the shrinker and in the OOM
>> notifier)
> 
> David,
> 
> Please keep me posted. I decided to adapt the same solution as the virtio
> balloon for the VMware balloon. If the verdict is that this is damaging and
> the OOM notifier should be used instead, I will submit patches to move to
> OOM notifier as well.

Adding some information for the record (if someone googles this thread):

In the VMware balloon driver, the shrinker is disabled by default since we
encountered a performance degradation in testing. I tried to avoid rapid
inflation/shrinker-deflation cycles by adding a timeout, but apparently it
did not help in avoiding the performance regression.

So there is no such issue in VMware balloon driver, unless someone
intentionally enables the shrinker through a module parameter.


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Balloon pressuring page cache
  2020-02-05  6:49                         ` Wang, Wei W
@ 2020-02-05  8:19                           ` David Hildenbrand
  2020-02-05  8:54                             ` Wang, Wei W
  0 siblings, 1 reply; 59+ messages in thread
From: David Hildenbrand @ 2020-02-05  8:19 UTC (permalink / raw)
  To: Wang, Wei W, Michael S. Tsirkin
  Cc: Nadav Amit, Alexander Duyck, Tyler Sanderson, virtualization,
	David Rientjes, linux-mm, Michal Hocko

>> Issue 2: When called via the shrinker, (but also to fix Issue 1), it could be that
>> we do have VIRTIO_BALLOON_F_MUST_TELL_HOST. 
> 
> I don't think it is an issue here.
> MUST_TELL_HOST is for the ballooning pages, where pages are offered to host to _USE_.
> For free page hint, as the name already suggests, it's just a _HINT_ , so in whatever use case,
> the host should not take the page to use. So the guest doesn't need to tell host and wait.

Yes, I agree with you. Yet, I am thinking about one
(unlikely?impossible?) scenario. Can you refresh my brain why that
cannot happen (IOW, why we don't have to wait for the host to process
the request)?

1. Guest allocates a page and sends it to the host.
2. Shrinker gets active and releases that page again.
3. Some user in the guest allocates and modifies that page. After that,
it is done using that page for the next hour.
4. The host processes the request and clears the bit in the dirty bitmap.
5. The guest is being migrated by the host. The modified page is not
being migrated.

-- 
Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Balloon pressuring page cache
  2020-02-05  7:35                   ` Nadav Amit
@ 2020-02-05  8:19                     ` David Hildenbrand
  2020-02-05 10:27                       ` Michael S. Tsirkin
  0 siblings, 1 reply; 59+ messages in thread
From: David Hildenbrand @ 2020-02-05  8:19 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Alexander Duyck, Michael S. Tsirkin, Tyler Sanderson, Wang,
	Wei W, virtualization, David Rientjes, linux-mm, Michal Hocko

On 05.02.20 08:35, Nadav Amit wrote:
>> On Feb 3, 2020, at 2:50 PM, Nadav Amit <namit@vmware.com> wrote:
>>
>>> On Feb 3, 2020, at 8:34 AM, David Hildenbrand <david@redhat.com> wrote:
>>>
>>> On 03.02.20 17:18, Alexander Duyck wrote:
>>>> On Mon, 2020-02-03 at 08:11 -0500, Michael S. Tsirkin wrote:
>>>>> On Thu, Jan 30, 2020 at 11:59:46AM -0800, Tyler Sanderson wrote:
>>>>>> On Thu, Jan 30, 2020 at 7:31 AM Wang, Wei W <wei.w.wang@intel.com> wrote:
>>>>>>
>>>>>>   On Thursday, January 30, 2020 11:03 PM, David Hildenbrand wrote:
>>>>>>> On 29.01.20 20:11, Tyler Sanderson wrote:
>>>>>>>> On Wed, Jan 29, 2020 at 2:31 AM David Hildenbrand <david@redhat.com
>>>>>>>> <mailto:david@redhat.com>> wrote:
>>>>>>>>
>>>>>>>>   On 29.01.20 01:22, Tyler Sanderson via Virtualization wrote:
>>>>>>>>> A primary advantage of virtio balloon over other memory reclaim
>>>>>>>>> mechanisms is that it can pressure the guest's page cache into
>>>>>>>>   shrinking.
>>>>>>>>> However, since the balloon driver changed to using the shrinker
>>>>>>   API
>>>>>>> <https://github.com/torvalds/linux/commit/71994620bb25a8b109388fefa9
>>>>>>> e99a28e355255a#diff-fd202acf694d9eba19c8c64da3e480c9> this
>>>>>>>>> use case has become a bit more tricky. I'm wondering what the
>>>>>>> intended
>>>>>>>>> device implementation is.
>>>>>>>>>
>>>>>>>>> When inflating the balloon against page cache (i.e. no free
>>>>>>   memory
>>>>>>>>> remains) vmscan.c will both shrink page cache, but also invoke
>>>>>>   the
>>>>>>>>> shrinkers -- including the balloon's shrinker. So the balloon
>>>>>>   driver
>>>>>>>>> allocates memory which requires reclaim, vmscan gets this memory
>>>>>>> by
>>>>>>>>> shrinking the balloon, and then the driver adds the memory back
>>>>>>   to
>>>>>>> the
>>>>>>>>> balloon. Basically a busy no-op.
>>>>>>
>>>>>>   Per my understanding, the balloon allocation won’t invoke shrinker as
>>>>>>   __GFP_DIRECT_RECLAIM isn't set, no?
>>>>>>
>>>>>> I could be wrong about the mechanism, but the device sees lots of activity on
>>>>>> the deflate queue. The balloon is being shrunk. And this only starts once all
>>>>>> free memory is depleted and we're inflating into page cache.
>>>>>
>>>>> So given this looks like a regression, maybe we should revert the
>>>>> patch in question 71994620bb25 ("virtio_balloon: replace oom notifier with shrinker")
>>>>> Besides, with VIRTIO_BALLOON_F_FREE_PAGE_HINT
>>>>> shrinker also ignores VIRTIO_BALLOON_F_MUST_TELL_HOST which isn't nice
>>>>> at all.
>>>>>
>>>>> So it looks like all this rework introduced more issues than it
>>>>> addressed ...
>>>>>
>>>>> I also CC Alex Duyck for an opinion on this.
>>>>> Alex, what do you use to put pressure on page cache?
>>>>
>>>> I would say reverting probably makes sense. I'm not sure there is much
>>>> value to having a shrinker running deflation when you are actively trying
>>>> to increase the balloon. It would make more sense to wait until you are
>>>> actually about to start hitting oom.
>>>
>>> I think the shrinker makes sense for free page hinting feature
>>> (everything on free_page_list).
>>>
>>> So instead of only reverting, I think we should split it up and always
>>> register the shrinker for VIRTIO_BALLOON_F_FREE_PAGE_HINT and the OOM
>>> notifier (as before) for VIRTIO_BALLOON_F_MUST_TELL_HOST.
>>>
>>> (Of course, adapting what is being done in the shrinker and in the OOM
>>> notifier)
>>
>> David,
>>
>> Please keep me posted. I decided to adapt the same solution as the virtio
>> balloon for the VMware balloon. If the verdict is that this is damaging and
>> the OOM notifier should be used instead, I will submit patches to move to
>> OOM notifier as well.
> 
> Adding some information for the record (if someone googles this thread):
> 
> In the VMware balloon driver, the shrinker is disabled by default since we
> encountered a performance degradation in testing. I tried to avoid rapid
> inflation/shrinker-deflation cycles by adding a timeout, but apparently it
> did not help in avoiding the performance regression.

Thanks for that info. To me that sounds like the shrinker is the wrong
approach to "auto-deflation". It's not just "some slab cache".


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Balloon pressuring page cache
  2020-02-04 20:33                             ` Michael S. Tsirkin
@ 2020-02-05  8:31                               ` David Hildenbrand
  0 siblings, 0 replies; 59+ messages in thread
From: David Hildenbrand @ 2020-02-05  8:31 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Nadav Amit, Alexander Duyck, Tyler Sanderson, Wang, Wei W,
	virtualization, David Rientjes, linux-mm, Michal Hocko,
	virtio-dev

On 04.02.20 21:33, Michael S. Tsirkin wrote:
> On Tue, Feb 04, 2020 at 05:56:22PM +0100, David Hildenbrand wrote:
>> [...]
>>
>>>>
>>>> Issue 2: When called via the shrinker, (but also to fix Issue 1), it could be
>>>> that we do have VIRTIO_BALLOON_F_MUST_TELL_HOST. I assume this means
>>>> (-ENOCLUE) that we have to wait until the hypervisor notifies us via the STOP? Or
>>>> for which event do we have to wait? Because there is no way to *tell host* here
>>>> that we want to reuse a page. The hypervisor will *tell us* when we can reuse pages.
>>>> For the shrinker it is simple: Don't use the shrinker with
>>>> VIRTIO_BALLOON_F_MUST_TELL_HOST :) . But to fix Issue 1, we *would* have to wait
>>>> until we get a STOP signal. That is not really possible because it might
>>>> take an infinite amount of time.
>>>>
>>>> Michael, any clue on which event we have to wait with
>>>> VIRTIO_BALLOON_F_MUST_TELL_HOST? IMHO, I don't think
>>>> VIRTIO_BALLOON_F_MUST_TELL_HOST applies to VIRTIO_BALLOON_F_FREE_PAGE_HINT and
>>>> we'd better document that. It introduces complexity with no clear benefit.
>>>
>>> I meant that we must wait for host to see the hint. Signalled via using
>>> the buffer.  But maybe that's too far in the meaning from
>>> VIRTIO_BALLOON_F_MUST_TELL_HOST and we need a separate new flag for
>>
>> Yes, that's what I think.
>>
>>> that. Then current code won't be broken (yay!) but we need to
>>> document another flag that's pretty similar.
>>
>> I mean, do we need a flag at all as long as there is no user?
>> Introducing a flag and documenting it if nobody uses it does not sound
>> like a work I will enjoy :)
> 
> It's not the user. It's the non-orthogonality that I find inelegant.
> 
> Let me try to formulate the issue, forgive me for thinking aloud
> (and I Cc'd virtio-dev since we are talking spec things here):
> 
> The annoying thing is that with Alex's VIRTIO_BALLOON_F_REPORTING
> host does depend on guest not touching memory before host uses it.
> So functionally VIRTIO_BALLOON_F_FREE_PAGE_HINT and
> VIRTIO_BALLOON_F_REPORTING really are supposed to do
> exectly the same thing, with the differences being
> - VIRTIO_BALLOON_F_FREE_PAGE_HINT comes upon host's request.
>   VIRTIO_BALLOON_F_REPORTING is initiated by guest.
> - VIRTIO_BALLOON_F_FREE_PAGE_HINT does not always wait for
>   host to use the hint before touching the page.
>   Well it almost always does, but there's an exception in the
>   shrinker which tries to stop reporting as quickly as possible
>   in the case of a slow host.
>   VIRTIO_BALLOON_F_REPORTING always does.
>   This means host can blow the page away when it sees the hint.
> 
> Now the point is that with VIRTIO_BALLOON_F_REPORTING
> I think you really must wait for host to use the hint.
> But with VIRTIO_BALLOON_F_FREE_PAGE_HINT it depends
> on how host uses it. Something to think about,
> I'm not sure what is the best thing to do here.


I think VIRTIO_BALLOON_F_FREE_PAGE_HINT is really the special case and
shall be left alone (not messed with VIRTIO_BALLOON_F_MUST_TELL_HOST).
Initiated by the host, complicated protocol and semantics, guest can
reuse pages any time it wants ("hint").

VIRTIO_BALLOON_F_REPORTING is *basically* ordinary inflation on
stereoids (be able to report a size for each page and multiple pages in
one go) BUT, we can currently *never* have
VIRTIO_BALLOON_F_MUST_TELL_HOST semantics - there is no deflation.

We could rename VIRTIO_BALLOON_F_REPORTING to something like
VIRTIO_BALLOON_F_SIZE and make it obey to
VIRTIO_BALLOON_F_MUST_TELL_HOST (meaning, there would have to be a
deflate queue as well!) - but it contradicts to the real needs.
VIRTIO_BALLOON_F_REPORTING comnbined with
VIRTIO_BALLOON_F_MUST_TELL_HOST would not be usable by Linux for free
page reporting.

Well, as QEMU never sets VIRTIO_BALLOON_F_MUST_TELL_HOST we would be
fine. Alexander would have to add an inflate+deflate queue and make his
feature depend on !VIRTIO_BALLOON_F_MUST_TELL_HOST.

Is that the consistency you're looking for? Alexander, thoughts?

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: Balloon pressuring page cache
  2020-02-05  7:05                             ` Michael S. Tsirkin
@ 2020-02-05  8:50                               ` Wang, Wei W
  0 siblings, 0 replies; 59+ messages in thread
From: Wang, Wei W @ 2020-02-05  8:50 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: David Hildenbrand, Nadav Amit, Alexander Duyck, Tyler Sanderson,
	virtualization, David Rientjes, linux-mm, Michal Hocko

On Wednesday, February 5, 2020 3:05 PM, Michael S. Tsirkin wrote:
> 
> Well if we did the hint would be reliable, allowing host to immediately drop
> any pages it gets in the hint. 

"drop", you mean host to unmap the page from guest? I think that's not allowed for hints.

> Originally I wanted to speed up hinting by never
> waiting for host, but that does not seem to be what was
> implemented: the only place we don't wait is the shrinker

Didn't get this one. For FREE_PAGE_HINT, the hints are always sent to host without
an ack from host about whether it has read the hint or not. (please see get_free_page_and_send)

> and it seems a
> waste that we introduced complexity to host without getting any real
> benefit out of it.
> 
> VIRTIO_BALLOON_F_MUST_TELL_HOST doesn't really apply to hinting right
> now, 

There is no need I think, as host isn't allowed to use or unmap the hint page.

Best,
Wei



^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: Balloon pressuring page cache
  2020-02-05  8:19                           ` David Hildenbrand
@ 2020-02-05  8:54                             ` Wang, Wei W
  2020-02-05  8:56                               ` David Hildenbrand
  0 siblings, 1 reply; 59+ messages in thread
From: Wang, Wei W @ 2020-02-05  8:54 UTC (permalink / raw)
  To: David Hildenbrand, Michael S. Tsirkin
  Cc: Nadav Amit, Alexander Duyck, Tyler Sanderson, virtualization,
	David Rientjes, linux-mm, Michal Hocko

On Wednesday, February 5, 2020 4:19 PM, David Hildenbrand wrote:
> Yes, I agree with you. Yet, I am thinking about one
> (unlikely?impossible?) scenario. Can you refresh my brain why that cannot
> happen (IOW, why we don't have to wait for the host to process the
> request)?
> 
> 1. Guest allocates a page and sends it to the host.
> 2. Shrinker gets active and releases that page again.
> 3. Some user in the guest allocates and modifies that page. After that, it is
> done using that page for the next hour.
> 4. The host processes the request and clears the bit in the dirty bitmap.
> 5. The guest is being migrated by the host. The modified page is not being
> migrated.

Whenever the guest modifies a page during migration, it will be captured by the
dirty logging and the hypervisor will send the dirtied the page in the following round.

Just more thoughts to clarify the difference. I think it's all about the page ownership.
For VIRTIO_BALLOON_F_FREE_PAGE_HINT, the guest always owns the page,
so host should not use or unmap the page.
For VIRTIO_BALLOON_F_REPORTING or the legacy balloon inflation,
guest intends to transfer the ownership of the underlying physical page to the host,
that's why host and guest needs a sync about - if the "ownership" transfer completes or not.

Best,
Wei

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Balloon pressuring page cache
  2020-02-05  8:54                             ` Wang, Wei W
@ 2020-02-05  8:56                               ` David Hildenbrand
  2020-02-05  9:00                                 ` Wang, Wei W
  0 siblings, 1 reply; 59+ messages in thread
From: David Hildenbrand @ 2020-02-05  8:56 UTC (permalink / raw)
  To: Wang, Wei W, Michael S. Tsirkin
  Cc: Nadav Amit, Alexander Duyck, Tyler Sanderson, virtualization,
	David Rientjes, linux-mm, Michal Hocko

On 05.02.20 09:54, Wang, Wei W wrote:
> On Wednesday, February 5, 2020 4:19 PM, David Hildenbrand wrote:
>> Yes, I agree with you. Yet, I am thinking about one
>> (unlikely?impossible?) scenario. Can you refresh my brain why that cannot
>> happen (IOW, why we don't have to wait for the host to process the
>> request)?
>>
>> 1. Guest allocates a page and sends it to the host.
>> 2. Shrinker gets active and releases that page again.
>> 3. Some user in the guest allocates and modifies that page. After that, it is
>> done using that page for the next hour.
>> 4. The host processes the request and clears the bit in the dirty bitmap.
>> 5. The guest is being migrated by the host. The modified page is not being
>> migrated.
> 
> Whenever the guest modifies a page during migration, it will be captured by the
> dirty logging and the hypervisor will send the dirtied the page in the following round.

Please explain why the steps I outlined don't apply esp. in the last
round. Your general statement does not explain why this race can't happen.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: Balloon pressuring page cache
  2020-02-05  8:56                               ` David Hildenbrand
@ 2020-02-05  9:00                                 ` Wang, Wei W
  2020-02-05  9:05                                   ` David Hildenbrand
  2020-02-05 18:43                                   ` Tyler Sanderson
  0 siblings, 2 replies; 59+ messages in thread
From: Wang, Wei W @ 2020-02-05  9:00 UTC (permalink / raw)
  To: David Hildenbrand, Michael S. Tsirkin
  Cc: Nadav Amit, Alexander Duyck, Tyler Sanderson, virtualization,
	David Rientjes, linux-mm, Michal Hocko

On Wednesday, February 5, 2020 4:57 PM, David Hildenbrand wrote:
> >> Yes, I agree with you. Yet, I am thinking about one
> >> (unlikely?impossible?) scenario. Can you refresh my brain why that
> >> cannot happen (IOW, why we don't have to wait for the host to process
> >> the request)?
> >>
> >> 1. Guest allocates a page and sends it to the host.
> >> 2. Shrinker gets active and releases that page again.
> >> 3. Some user in the guest allocates and modifies that page. After
> >> that, it is done using that page for the next hour.
> >> 4. The host processes the request and clears the bit in the dirty bitmap.
> >> 5. The guest is being migrated by the host. The modified page is not
> >> being migrated.
> >
> > Whenever the guest modifies a page during migration, it will be
> > captured by the dirty logging and the hypervisor will send the dirtied the
> page in the following round.
> 
> Please explain why the steps I outlined don't apply esp. in the last round.
> Your general statement does not explain why this race can't happen.
> 

The guest is stopped in the last round, thus no page will be modified at that time.

Best,
Wei


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Balloon pressuring page cache
  2020-02-05  9:00                                 ` Wang, Wei W
@ 2020-02-05  9:05                                   ` David Hildenbrand
  2020-02-05  9:19                                     ` Wang, Wei W
  2020-02-05 18:43                                   ` Tyler Sanderson
  1 sibling, 1 reply; 59+ messages in thread
From: David Hildenbrand @ 2020-02-05  9:05 UTC (permalink / raw)
  To: Wang, Wei W, Michael S. Tsirkin
  Cc: Nadav Amit, Alexander Duyck, Tyler Sanderson, virtualization,
	David Rientjes, linux-mm, Michal Hocko

On 05.02.20 10:00, Wang, Wei W wrote:
> On Wednesday, February 5, 2020 4:57 PM, David Hildenbrand wrote:
>>>> Yes, I agree with you. Yet, I am thinking about one
>>>> (unlikely?impossible?) scenario. Can you refresh my brain why that
>>>> cannot happen (IOW, why we don't have to wait for the host to process
>>>> the request)?
>>>>
>>>> 1. Guest allocates a page and sends it to the host.
>>>> 2. Shrinker gets active and releases that page again.
>>>> 3. Some user in the guest allocates and modifies that page. After
>>>> that, it is done using that page for the next hour.
>>>> 4. The host processes the request and clears the bit in the dirty bitmap.
>>>> 5. The guest is being migrated by the host. The modified page is not
>>>> being migrated.
>>>
>>> Whenever the guest modifies a page during migration, it will be
>>> captured by the dirty logging and the hypervisor will send the dirtied the
>> page in the following round.
>>
>> Please explain why the steps I outlined don't apply esp. in the last round.
>> Your general statement does not explain why this race can't happen.
>>
> 
> The guest is stopped in the last round, thus no page will be modified at that time.

No, that does not answer my question. Because then, obviously the guest
can't do any hinting in the last round. I think I am missing something
important :)

1. Guest allocates a page and sends it to the host.
2. Shrinker gets active and releases that page again.
3. Some user in the guest allocates and modifies that page. The dirty
bit is set in the hypervisor.
4. The host processes the request and clears the bit in the dirty bitmap.
5. The guest is stopped and the last set of dirty pages is migrated. The
modified page is not being migrated (because not marked dirty).

Something between 3. and 4. has to guarantee that the page will still be
migrated, what guarantees that?

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: Balloon pressuring page cache
  2020-02-05  9:05                                   ` David Hildenbrand
@ 2020-02-05  9:19                                     ` Wang, Wei W
  2020-02-05  9:22                                       ` David Hildenbrand
  0 siblings, 1 reply; 59+ messages in thread
From: Wang, Wei W @ 2020-02-05  9:19 UTC (permalink / raw)
  To: David Hildenbrand, Michael S. Tsirkin
  Cc: Nadav Amit, Alexander Duyck, Tyler Sanderson, virtualization,
	David Rientjes, linux-mm, Michal Hocko

On Wednesday, February 5, 2020 5:06 PM, David Hildenbrand wrote:

> 
> No, that does not answer my question. Because then, obviously the guest
> can't do any hinting in the last round. I think I am missing something
> important :)

No problem, probably need more details here:

QEMU has a dirty bitmap which indicates all the dirty pages from the previous round.
KVM has a dirty bitmap which records what pages are modified in this round.
When a new round starts, QEMU syncs the bitmap from KVM (this round always
sends the pages dirtied from the previous round).

> 1. Guest allocates a page and sends it to the host.
> 2. Shrinker gets active and releases that page again.
> 3. Some user in the guest allocates and modifies that page. The dirty bit is
> set in the hypervisor.

The bit will be set in KVM's bitmap, and will be synced to QEMU's bitmap when the next round starts.

> 4. The host processes the request and clears the bit in the dirty bitmap.

This clears the bit from the QEMU bitmap, and this page will not be sent in this round.

> 5. The guest is stopped and the last set of dirty pages is migrated. The
> modified page is not being migrated (because not marked dirty).

When QEMU start the last round, it first syncs the bitmap from KVM, which includes the one set in step 3.
Then the modified page gets sent.

Best,
Wei

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Balloon pressuring page cache
  2020-02-05  9:19                                     ` Wang, Wei W
@ 2020-02-05  9:22                                       ` David Hildenbrand
  2020-02-05  9:35                                         ` Wang, Wei W
  2020-02-05  9:35                                         ` Michael S. Tsirkin
  0 siblings, 2 replies; 59+ messages in thread
From: David Hildenbrand @ 2020-02-05  9:22 UTC (permalink / raw)
  To: Wang, Wei W, Michael S. Tsirkin
  Cc: Nadav Amit, Alexander Duyck, Tyler Sanderson, virtualization,
	David Rientjes, linux-mm, Michal Hocko

>> 1. Guest allocates a page and sends it to the host.
>> 2. Shrinker gets active and releases that page again.
>> 3. Some user in the guest allocates and modifies that page. The dirty bit is
>> set in the hypervisor.
> 
> The bit will be set in KVM's bitmap, and will be synced to QEMU's bitmap when the next round starts.
> 
>> 4. The host processes the request and clears the bit in the dirty bitmap.
> 
> This clears the bit from the QEMU bitmap, and this page will not be sent in this round.
> 
>> 5. The guest is stopped and the last set of dirty pages is migrated. The
>> modified page is not being migrated (because not marked dirty).
> 
> When QEMU start the last round, it first syncs the bitmap from KVM, which includes the one set in step 3.
> Then the modified page gets sent.

So, if you run a TCG guest and use it with free page reporting, the race
is possible? So the correctness depends on two dirty bitmaps in the
hypervisor and how they interact. wow this is fragile.

Thanks for the info :)

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: Balloon pressuring page cache
  2020-02-05  9:22                                       ` David Hildenbrand
@ 2020-02-05  9:35                                         ` Wang, Wei W
  2020-02-05  9:37                                           ` David Hildenbrand
  2020-02-05  9:35                                         ` Michael S. Tsirkin
  1 sibling, 1 reply; 59+ messages in thread
From: Wang, Wei W @ 2020-02-05  9:35 UTC (permalink / raw)
  To: David Hildenbrand, Michael S. Tsirkin
  Cc: Nadav Amit, Alexander Duyck, Tyler Sanderson, virtualization,
	David Rientjes, linux-mm, Michal Hocko

On Wednesday, February 5, 2020 5:23 PM, David Hildenbrand wrote:
> So, if you run a TCG guest and use it with free page reporting, the race is
> possible? So the correctness depends on two dirty bitmaps in the hypervisor
> and how they interact. wow this is fragile.
>

Not sure how TCG tracks the dirty bits. But In whatever implementation, the hypervisor should have
already dealt with the race between he current round and the previous round dirty recording.
(the race isn't brought by this feature essentially)

Best,
Wei


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Balloon pressuring page cache
  2020-02-05  9:22                                       ` David Hildenbrand
  2020-02-05  9:35                                         ` Wang, Wei W
@ 2020-02-05  9:35                                         ` Michael S. Tsirkin
  1 sibling, 0 replies; 59+ messages in thread
From: Michael S. Tsirkin @ 2020-02-05  9:35 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Wang, Wei W, Nadav Amit, Alexander Duyck, Tyler Sanderson,
	virtualization, David Rientjes, linux-mm, Michal Hocko

On Wed, Feb 05, 2020 at 10:22:34AM +0100, David Hildenbrand wrote:
> >> 1. Guest allocates a page and sends it to the host.
> >> 2. Shrinker gets active and releases that page again.
> >> 3. Some user in the guest allocates and modifies that page. The dirty bit is
> >> set in the hypervisor.
> > 
> > The bit will be set in KVM's bitmap, and will be synced to QEMU's bitmap when the next round starts.
> > 
> >> 4. The host processes the request and clears the bit in the dirty bitmap.
> > 
> > This clears the bit from the QEMU bitmap, and this page will not be sent in this round.
> > 
> >> 5. The guest is stopped and the last set of dirty pages is migrated. The
> >> modified page is not being migrated (because not marked dirty).
> > 
> > When QEMU start the last round, it first syncs the bitmap from KVM, which includes the one set in step 3.
> > Then the modified page gets sent.
> 
> So, if you run a TCG guest and use it with free page reporting, the race
> is possible?

I'd have to look at the implementation but the basic idea is not
kvm specific. The idea is that hypervisor can detect that 3 happened
after 1, by means of creating a copy of the dirty bitmap
when request is sent to the guest.

> So the correctness depends on two dirty bitmaps in the
> hypervisor and how they interact. wow this is fragile.
> 
> Thanks for the info :)

It's pretty fragile, and the annoying part is we do not
actually benefit from this at all since it all only triggers
in the shrinker corner case.

The original idea was that we can send any hint to hypervisor and reuse
the page immediately without waiting for hint to be seen.  That seemed
worth having, as a means to minimize impact of hinting.
Then we dropped that and switched to OOM, and there not having
to wait also seemed like a worthwhile thing.
In the end we switched to shrinker where we can wait
if we like, and many guests never even hit the shrinker so we
have sacrificed robustness for nothing.

If we go back to OOM then at least it's justified ..

> -- 
> Thanks,
> 
> David / dhildenb

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Balloon pressuring page cache
  2020-02-05  9:35                                         ` Wang, Wei W
@ 2020-02-05  9:37                                           ` David Hildenbrand
  2020-02-05  9:49                                             ` Wang, Wei W
  0 siblings, 1 reply; 59+ messages in thread
From: David Hildenbrand @ 2020-02-05  9:37 UTC (permalink / raw)
  To: Wang, Wei W, Michael S. Tsirkin
  Cc: Nadav Amit, Alexander Duyck, Tyler Sanderson, virtualization,
	David Rientjes, linux-mm, Michal Hocko

On 05.02.20 10:35, Wang, Wei W wrote:
> On Wednesday, February 5, 2020 5:23 PM, David Hildenbrand wrote:
>> So, if you run a TCG guest and use it with free page reporting, the race is
>> possible? So the correctness depends on two dirty bitmaps in the hypervisor
>> and how they interact. wow this is fragile.
>>
> 
> Not sure how TCG tracks the dirty bits. But In whatever implementation, the hypervisor should have

There is only a single bitmap for that purpose. (well, the one where KVM
syncs to)

> already dealt with the race between he current round and the previous round dirty recording.
> (the race isn't brought by this feature essentially)

It is guaranteed to work reliably without this feature as you only clear
what *has been migrated*, not what your guest thinks should not been
migrated at one point and decides differently at another point. The race
is bought forwards by this feature.


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: Balloon pressuring page cache
  2020-02-05  9:37                                           ` David Hildenbrand
@ 2020-02-05  9:49                                             ` Wang, Wei W
  2020-02-05  9:58                                               ` David Hildenbrand
  0 siblings, 1 reply; 59+ messages in thread
From: Wang, Wei W @ 2020-02-05  9:49 UTC (permalink / raw)
  To: David Hildenbrand, Michael S. Tsirkin
  Cc: Nadav Amit, Alexander Duyck, Tyler Sanderson, virtualization,
	David Rientjes, linux-mm, Michal Hocko

On Wednesday, February 5, 2020 5:37 PM, David Hildenbrand wrote:
> >
> > Not sure how TCG tracks the dirty bits. But In whatever
> > implementation, the hypervisor should have
> 
> There is only a single bitmap for that purpose. (well, the one where KVM
> syncs to)
> 
> > already dealt with the race between he current round and the previous
> round dirty recording.
> > (the race isn't brought by this feature essentially)
> 
> It is guaranteed to work reliably without this feature as you only clear what
> *has been migrated*, 

Not "clear what has been migrated" (that skips nothing..)
Anyway, it's a hint used for optimization.

Best,
Wei


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Balloon pressuring page cache
  2020-02-05  9:49                                             ` Wang, Wei W
@ 2020-02-05  9:58                                               ` David Hildenbrand
  2020-02-05 10:25                                                 ` Michael S. Tsirkin
  0 siblings, 1 reply; 59+ messages in thread
From: David Hildenbrand @ 2020-02-05  9:58 UTC (permalink / raw)
  To: Wang, Wei W, Michael S. Tsirkin
  Cc: Nadav Amit, Alexander Duyck, Tyler Sanderson, virtualization,
	David Rientjes, linux-mm, Michal Hocko

On 05.02.20 10:49, Wang, Wei W wrote:
> On Wednesday, February 5, 2020 5:37 PM, David Hildenbrand wrote:
>>>
>>> Not sure how TCG tracks the dirty bits. But In whatever
>>> implementation, the hypervisor should have
>>
>> There is only a single bitmap for that purpose. (well, the one where KVM
>> syncs to)
>>
>>> already dealt with the race between he current round and the previous
>> round dirty recording.
>>> (the race isn't brought by this feature essentially)
>>
>> It is guaranteed to work reliably without this feature as you only clear what
>> *has been migrated*, 
> 
> Not "clear what has been migrated" (that skips nothing..)
> Anyway, it's a hint used for optimization.

Yes, an optimization that might easily lead to data corruption when the
two bitmaps are either not in place or don't play along in that specific
way (and I suspect this is the case under TCG).

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Balloon pressuring page cache
  2020-02-05  9:58                                               ` David Hildenbrand
@ 2020-02-05 10:25                                                 ` Michael S. Tsirkin
  2020-02-05 10:42                                                   ` David Hildenbrand
  0 siblings, 1 reply; 59+ messages in thread
From: Michael S. Tsirkin @ 2020-02-05 10:25 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Wang, Wei W, Nadav Amit, Alexander Duyck, Tyler Sanderson,
	virtualization, David Rientjes, linux-mm, Michal Hocko

On Wed, Feb 05, 2020 at 10:58:14AM +0100, David Hildenbrand wrote:
> On 05.02.20 10:49, Wang, Wei W wrote:
> > On Wednesday, February 5, 2020 5:37 PM, David Hildenbrand wrote:
> >>>
> >>> Not sure how TCG tracks the dirty bits. But In whatever
> >>> implementation, the hypervisor should have
> >>
> >> There is only a single bitmap for that purpose. (well, the one where KVM
> >> syncs to)
> >>
> >>> already dealt with the race between he current round and the previous
> >> round dirty recording.
> >>> (the race isn't brought by this feature essentially)
> >>
> >> It is guaranteed to work reliably without this feature as you only clear what
> >> *has been migrated*, 
> > 
> > Not "clear what has been migrated" (that skips nothing..)
> > Anyway, it's a hint used for optimization.
> 
> Yes, an optimization that might easily lead to data corruption when the
> two bitmaps are either not in place or don't play along in that specific
> way (and I suspect this is the case under TCG).

So I checked and TCG has two copies too.
Each block has bmap used for migration and also dirty_memory
where pages are marked dirty. See cpu_physical_memory_sync_dirty_bitmap.

So from QEMU POV, there is a callback that tells balloon when it's safe
to request hints. As that affects the bitmap, that must not happen in
parallel with dirty bitmap handling. Sounds like a reasonable
limitation.

The hint can be useful outside migration, but in its current form
needs to then be non-destructive.
E.g. I can imaging userspace calling MADV_SOFT_OFFLINE on the hinted
memory.

Again a flag that tells guest it should wait until used
could be a reasonable expension. If we stick to the shrinker
it's actually implementable easily. With an OOM notifier - I'm not so
sure ...

And a big part of the problem is that after all this time the page
hinting interfaces are still undocumented. Quite sad really :(

> -- 
> Thanks,
> 
> David / dhildenb

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Balloon pressuring page cache
  2020-02-05  8:19                     ` David Hildenbrand
@ 2020-02-05 10:27                       ` Michael S. Tsirkin
  2020-02-05 10:43                         ` David Hildenbrand
  0 siblings, 1 reply; 59+ messages in thread
From: Michael S. Tsirkin @ 2020-02-05 10:27 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Nadav Amit, Alexander Duyck, Tyler Sanderson, Wang, Wei W,
	virtualization, David Rientjes, linux-mm, Michal Hocko

On Wed, Feb 05, 2020 at 09:19:58AM +0100, David Hildenbrand wrote:
> On 05.02.20 08:35, Nadav Amit wrote:
> >> On Feb 3, 2020, at 2:50 PM, Nadav Amit <namit@vmware.com> wrote:
> >>
> >>> On Feb 3, 2020, at 8:34 AM, David Hildenbrand <david@redhat.com> wrote:
> >>>
> >>> On 03.02.20 17:18, Alexander Duyck wrote:
> >>>> On Mon, 2020-02-03 at 08:11 -0500, Michael S. Tsirkin wrote:
> >>>>> On Thu, Jan 30, 2020 at 11:59:46AM -0800, Tyler Sanderson wrote:
> >>>>>> On Thu, Jan 30, 2020 at 7:31 AM Wang, Wei W <wei.w.wang@intel.com> wrote:
> >>>>>>
> >>>>>>   On Thursday, January 30, 2020 11:03 PM, David Hildenbrand wrote:
> >>>>>>> On 29.01.20 20:11, Tyler Sanderson wrote:
> >>>>>>>> On Wed, Jan 29, 2020 at 2:31 AM David Hildenbrand <david@redhat.com
> >>>>>>>> <mailto:david@redhat.com>> wrote:
> >>>>>>>>
> >>>>>>>>   On 29.01.20 01:22, Tyler Sanderson via Virtualization wrote:
> >>>>>>>>> A primary advantage of virtio balloon over other memory reclaim
> >>>>>>>>> mechanisms is that it can pressure the guest's page cache into
> >>>>>>>>   shrinking.
> >>>>>>>>> However, since the balloon driver changed to using the shrinker
> >>>>>>   API
> >>>>>>> <https://github.com/torvalds/linux/commit/71994620bb25a8b109388fefa9
> >>>>>>> e99a28e355255a#diff-fd202acf694d9eba19c8c64da3e480c9> this
> >>>>>>>>> use case has become a bit more tricky. I'm wondering what the
> >>>>>>> intended
> >>>>>>>>> device implementation is.
> >>>>>>>>>
> >>>>>>>>> When inflating the balloon against page cache (i.e. no free
> >>>>>>   memory
> >>>>>>>>> remains) vmscan.c will both shrink page cache, but also invoke
> >>>>>>   the
> >>>>>>>>> shrinkers -- including the balloon's shrinker. So the balloon
> >>>>>>   driver
> >>>>>>>>> allocates memory which requires reclaim, vmscan gets this memory
> >>>>>>> by
> >>>>>>>>> shrinking the balloon, and then the driver adds the memory back
> >>>>>>   to
> >>>>>>> the
> >>>>>>>>> balloon. Basically a busy no-op.
> >>>>>>
> >>>>>>   Per my understanding, the balloon allocation won’t invoke shrinker as
> >>>>>>   __GFP_DIRECT_RECLAIM isn't set, no?
> >>>>>>
> >>>>>> I could be wrong about the mechanism, but the device sees lots of activity on
> >>>>>> the deflate queue. The balloon is being shrunk. And this only starts once all
> >>>>>> free memory is depleted and we're inflating into page cache.
> >>>>>
> >>>>> So given this looks like a regression, maybe we should revert the
> >>>>> patch in question 71994620bb25 ("virtio_balloon: replace oom notifier with shrinker")
> >>>>> Besides, with VIRTIO_BALLOON_F_FREE_PAGE_HINT
> >>>>> shrinker also ignores VIRTIO_BALLOON_F_MUST_TELL_HOST which isn't nice
> >>>>> at all.
> >>>>>
> >>>>> So it looks like all this rework introduced more issues than it
> >>>>> addressed ...
> >>>>>
> >>>>> I also CC Alex Duyck for an opinion on this.
> >>>>> Alex, what do you use to put pressure on page cache?
> >>>>
> >>>> I would say reverting probably makes sense. I'm not sure there is much
> >>>> value to having a shrinker running deflation when you are actively trying
> >>>> to increase the balloon. It would make more sense to wait until you are
> >>>> actually about to start hitting oom.
> >>>
> >>> I think the shrinker makes sense for free page hinting feature
> >>> (everything on free_page_list).
> >>>
> >>> So instead of only reverting, I think we should split it up and always
> >>> register the shrinker for VIRTIO_BALLOON_F_FREE_PAGE_HINT and the OOM
> >>> notifier (as before) for VIRTIO_BALLOON_F_MUST_TELL_HOST.
> >>>
> >>> (Of course, adapting what is being done in the shrinker and in the OOM
> >>> notifier)
> >>
> >> David,
> >>
> >> Please keep me posted. I decided to adapt the same solution as the virtio
> >> balloon for the VMware balloon. If the verdict is that this is damaging and
> >> the OOM notifier should be used instead, I will submit patches to move to
> >> OOM notifier as well.
> > 
> > Adding some information for the record (if someone googles this thread):
> > 
> > In the VMware balloon driver, the shrinker is disabled by default since we
> > encountered a performance degradation in testing. I tried to avoid rapid
> > inflation/shrinker-deflation cycles by adding a timeout, but apparently it
> > did not help in avoiding the performance regression.
> 
> Thanks for that info. To me that sounds like the shrinker is the wrong
> approach to "auto-deflation". It's not just "some slab cache".

So as you pointed out yourself deflate on oom is really under-specified.

I would be very happy if you could take a stub at documenting what's
expected from guest and how it could be used.
Please copy the virtio TC when you do this as this is spec stuff.


> 
> -- 
> Thanks,
> 
> David / dhildenb



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Balloon pressuring page cache
  2020-02-05 10:25                                                 ` Michael S. Tsirkin
@ 2020-02-05 10:42                                                   ` David Hildenbrand
  0 siblings, 0 replies; 59+ messages in thread
From: David Hildenbrand @ 2020-02-05 10:42 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Wang, Wei W, Nadav Amit, Alexander Duyck, Tyler Sanderson,
	virtualization, David Rientjes, linux-mm, Michal Hocko

On 05.02.20 11:25, Michael S. Tsirkin wrote:
> On Wed, Feb 05, 2020 at 10:58:14AM +0100, David Hildenbrand wrote:
>> On 05.02.20 10:49, Wang, Wei W wrote:
>>> On Wednesday, February 5, 2020 5:37 PM, David Hildenbrand wrote:
>>>>>
>>>>> Not sure how TCG tracks the dirty bits. But In whatever
>>>>> implementation, the hypervisor should have
>>>>
>>>> There is only a single bitmap for that purpose. (well, the one where KVM
>>>> syncs to)
>>>>
>>>>> already dealt with the race between he current round and the previous
>>>> round dirty recording.
>>>>> (the race isn't brought by this feature essentially)
>>>>
>>>> It is guaranteed to work reliably without this feature as you only clear what
>>>> *has been migrated*, 
>>>
>>> Not "clear what has been migrated" (that skips nothing..)
>>> Anyway, it's a hint used for optimization.
>>
>> Yes, an optimization that might easily lead to data corruption when the
>> two bitmaps are either not in place or don't play along in that specific
>> way (and I suspect this is the case under TCG).
> 
> So I checked and TCG has two copies too.
> Each block has bmap used for migration and also dirty_memory
> where pages are marked dirty. See cpu_physical_memory_sync_dirty_bitmap.
qemu_guest_free_page_hint() works on block->bmap.
cpu_physical_memory_set_dirty_range() works on ram_list.dirty_memory[i].

So you are right - sorry for the false alarm and thanks for verifying :)

[...]

> 
> Again a flag that tells guest it should wait until used
> could be a reasonable expension. If we stick to the shrinker
> it's actually implementable easily. With an OOM notifier - I'm not so
> sure ...

See my other mail. I think we should keep handling just as is and not
overcomplicate things (especially in our implementation as you noted)
Instead, maybe abstract the reporting feature.

> 
> And a big part of the problem is that after all this time the page
> hinting interfaces are still undocumented. Quite sad really :(

Yes, that was the source of my confusion ... the double-bitmap thingy is
non-obvious. And anybody who wants to implement that interface in a
hypervisor has to be aware that the race I explained has to be avoided
using e.g., two bitmaps and the sync.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Balloon pressuring page cache
  2020-02-05 10:27                       ` Michael S. Tsirkin
@ 2020-02-05 10:43                         ` David Hildenbrand
  0 siblings, 0 replies; 59+ messages in thread
From: David Hildenbrand @ 2020-02-05 10:43 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Nadav Amit, Alexander Duyck, Tyler Sanderson, Wang, Wei W,
	virtualization, David Rientjes, linux-mm, Michal Hocko

On 05.02.20 11:27, Michael S. Tsirkin wrote:
> On Wed, Feb 05, 2020 at 09:19:58AM +0100, David Hildenbrand wrote:
>> On 05.02.20 08:35, Nadav Amit wrote:
>>>> On Feb 3, 2020, at 2:50 PM, Nadav Amit <namit@vmware.com> wrote:
>>>>
>>>>> On Feb 3, 2020, at 8:34 AM, David Hildenbrand <david@redhat.com> wrote:
>>>>>
>>>>> On 03.02.20 17:18, Alexander Duyck wrote:
>>>>>> On Mon, 2020-02-03 at 08:11 -0500, Michael S. Tsirkin wrote:
>>>>>>> On Thu, Jan 30, 2020 at 11:59:46AM -0800, Tyler Sanderson wrote:
>>>>>>>> On Thu, Jan 30, 2020 at 7:31 AM Wang, Wei W <wei.w.wang@intel.com> wrote:
>>>>>>>>
>>>>>>>>   On Thursday, January 30, 2020 11:03 PM, David Hildenbrand wrote:
>>>>>>>>> On 29.01.20 20:11, Tyler Sanderson wrote:
>>>>>>>>>> On Wed, Jan 29, 2020 at 2:31 AM David Hildenbrand <david@redhat.com
>>>>>>>>>> <mailto:david@redhat.com>> wrote:
>>>>>>>>>>
>>>>>>>>>>   On 29.01.20 01:22, Tyler Sanderson via Virtualization wrote:
>>>>>>>>>>> A primary advantage of virtio balloon over other memory reclaim
>>>>>>>>>>> mechanisms is that it can pressure the guest's page cache into
>>>>>>>>>>   shrinking.
>>>>>>>>>>> However, since the balloon driver changed to using the shrinker
>>>>>>>>   API
>>>>>>>>> <https://github.com/torvalds/linux/commit/71994620bb25a8b109388fefa9
>>>>>>>>> e99a28e355255a#diff-fd202acf694d9eba19c8c64da3e480c9> this
>>>>>>>>>>> use case has become a bit more tricky. I'm wondering what the
>>>>>>>>> intended
>>>>>>>>>>> device implementation is.
>>>>>>>>>>>
>>>>>>>>>>> When inflating the balloon against page cache (i.e. no free
>>>>>>>>   memory
>>>>>>>>>>> remains) vmscan.c will both shrink page cache, but also invoke
>>>>>>>>   the
>>>>>>>>>>> shrinkers -- including the balloon's shrinker. So the balloon
>>>>>>>>   driver
>>>>>>>>>>> allocates memory which requires reclaim, vmscan gets this memory
>>>>>>>>> by
>>>>>>>>>>> shrinking the balloon, and then the driver adds the memory back
>>>>>>>>   to
>>>>>>>>> the
>>>>>>>>>>> balloon. Basically a busy no-op.
>>>>>>>>
>>>>>>>>   Per my understanding, the balloon allocation won’t invoke shrinker as
>>>>>>>>   __GFP_DIRECT_RECLAIM isn't set, no?
>>>>>>>>
>>>>>>>> I could be wrong about the mechanism, but the device sees lots of activity on
>>>>>>>> the deflate queue. The balloon is being shrunk. And this only starts once all
>>>>>>>> free memory is depleted and we're inflating into page cache.
>>>>>>>
>>>>>>> So given this looks like a regression, maybe we should revert the
>>>>>>> patch in question 71994620bb25 ("virtio_balloon: replace oom notifier with shrinker")
>>>>>>> Besides, with VIRTIO_BALLOON_F_FREE_PAGE_HINT
>>>>>>> shrinker also ignores VIRTIO_BALLOON_F_MUST_TELL_HOST which isn't nice
>>>>>>> at all.
>>>>>>>
>>>>>>> So it looks like all this rework introduced more issues than it
>>>>>>> addressed ...
>>>>>>>
>>>>>>> I also CC Alex Duyck for an opinion on this.
>>>>>>> Alex, what do you use to put pressure on page cache?
>>>>>>
>>>>>> I would say reverting probably makes sense. I'm not sure there is much
>>>>>> value to having a shrinker running deflation when you are actively trying
>>>>>> to increase the balloon. It would make more sense to wait until you are
>>>>>> actually about to start hitting oom.
>>>>>
>>>>> I think the shrinker makes sense for free page hinting feature
>>>>> (everything on free_page_list).
>>>>>
>>>>> So instead of only reverting, I think we should split it up and always
>>>>> register the shrinker for VIRTIO_BALLOON_F_FREE_PAGE_HINT and the OOM
>>>>> notifier (as before) for VIRTIO_BALLOON_F_MUST_TELL_HOST.
>>>>>
>>>>> (Of course, adapting what is being done in the shrinker and in the OOM
>>>>> notifier)
>>>>
>>>> David,
>>>>
>>>> Please keep me posted. I decided to adapt the same solution as the virtio
>>>> balloon for the VMware balloon. If the verdict is that this is damaging and
>>>> the OOM notifier should be used instead, I will submit patches to move to
>>>> OOM notifier as well.
>>>
>>> Adding some information for the record (if someone googles this thread):
>>>
>>> In the VMware balloon driver, the shrinker is disabled by default since we
>>> encountered a performance degradation in testing. I tried to avoid rapid
>>> inflation/shrinker-deflation cycles by adding a timeout, but apparently it
>>> did not help in avoiding the performance regression.
>>
>> Thanks for that info. To me that sounds like the shrinker is the wrong
>> approach to "auto-deflation". It's not just "some slab cache".
> 
> So as you pointed out yourself deflate on oom is really under-specified.
> 
> I would be very happy if you could take a stub at documenting what's
> expected from guest and how it could be used.
> Please copy the virtio TC when you do this as this is spec stuff.

So, I'll first get the code back into the desired state, so at least we
have an agreement of how it should be, and then follow up with a spec
update.

Might take some time, though (plenty of stuff to do).

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Balloon pressuring page cache
  2020-02-05  9:00                                 ` Wang, Wei W
  2020-02-05  9:05                                   ` David Hildenbrand
@ 2020-02-05 18:43                                   ` Tyler Sanderson
  2020-02-06  9:30                                     ` Wang, Wei W
  1 sibling, 1 reply; 59+ messages in thread
From: Tyler Sanderson @ 2020-02-05 18:43 UTC (permalink / raw)
  To: Wang, Wei W
  Cc: David Hildenbrand, Michael S. Tsirkin, Nadav Amit,
	Alexander Duyck, virtualization, David Rientjes, linux-mm,
	Michal Hocko

[-- Attachment #1: Type: text/plain, Size: 1509 bytes --]

On Wed, Feb 5, 2020 at 1:00 AM Wang, Wei W <wei.w.wang@intel.com> wrote:

> On Wednesday, February 5, 2020 4:57 PM, David Hildenbrand wrote:
> > >> Yes, I agree with you. Yet, I am thinking about one
> > >> (unlikely?impossible?) scenario. Can you refresh my brain why that
> > >> cannot happen (IOW, why we don't have to wait for the host to process
> > >> the request)?
> > >>
> > >> 1. Guest allocates a page and sends it to the host.
> > >> 2. Shrinker gets active and releases that page again.
> > >> 3. Some user in the guest allocates and modifies that page. After
> > >> that, it is done using that page for the next hour.
> > >> 4. The host processes the request and clears the bit in the dirty
> bitmap.
> > >> 5. The guest is being migrated by the host. The modified page is not
> > >> being migrated.
> > >
> > > Whenever the guest modifies a page during migration, it will be
> > > captured by the dirty logging and the hypervisor will send the dirtied
> the
> > page in the following round.
> >
> > Please explain why the steps I outlined don't apply esp. in the last
> round.
> > Your general statement does not explain why this race can't happen.
> >
>
> The guest is stopped in the last round, thus no page will be modified at
> that time.
>

Isn't the hint only useful during the *first* round?
After the first round if a page becomes free then we need to update the
copy at the migration destination, so freeing a page that previously had
contents should mark it dirty.


>
> Best,
> Wei
>

[-- Attachment #2: Type: text/html, Size: 2227 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Balloon pressuring page cache
  2020-02-05  6:57                             ` Michael S. Tsirkin
@ 2020-02-05 19:01                               ` Tyler Sanderson
  2020-02-05 19:22                                 ` Alexander Duyck
  0 siblings, 1 reply; 59+ messages in thread
From: Tyler Sanderson @ 2020-02-05 19:01 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: David Hildenbrand, Alexander Duyck, Wang, Wei W, virtualization,
	David Rientjes, linux-mm, Michal Hocko, namit

[-- Attachment #1: Type: text/plain, Size: 5224 bytes --]

On Tue, Feb 4, 2020 at 10:57 PM Michael S. Tsirkin <mst@redhat.com> wrote:

> On Tue, Feb 04, 2020 at 03:58:51PM -0800, Tyler Sanderson wrote:
> >     >     >
> >     >     >  1. It is last-resort, which means the system has already
> gone     through
> >     >     >     heroics to prevent OOM. Those heroic reclaim efforts
> are     expensive
> >     >     >     and impact application performance.
> >     >
> >     >     That's *exactly* what "deflate on OOM" suggests.
> >     >
> >     >
> >     > It seems there are some use cases where "deflate on OOM" is
> desired and
> >     > others where "deflate on pressure" is desired.
> >     > This suggests adding a new feature bit "DEFLATE_ON_PRESSURE" that
> >     > registers the shrinker, and reverting DEFLATE_ON_OOM to use the OOM
> >     > notifier callback.
> >     >
> >     > This lets users configure the balloon for their use case.
> >
> >     You want the old behavior back, so why should we introduce a new
> one? Or
> >     am I missing something? (you did want us to revert to old handling,
> no?)
> >
> > Reverting actually doesn't help me because this has been the behavior
> since
> > Linux 4.19 which is already widely in use. So my device implementation
> needs to
> > handle the shrinker behavior anyways. I started this conversation to ask
> what
> > the intended device implementation was.
> >
> > I think there are reasonable device implementations that would prefer the
> > shrinker behavior (it turns out that mine doesn't).
> > For example, an implementation that slowly inflates the balloon for the
> purpose
> > of memory overcommit. It might leave the balloon inflated and expect any
> memory
> > pressure (including page cache usage) to deflate the balloon as a way to
> > dynamically right-size the balloon.
>
> So just to make sure we understand, what exactly does your
> implementation do?
>
My implementation is for the purposes of opportunistic memory overcommit.
We always want to give balloon memory back to the guest rather than causing
an OOM, so we use DEFLATE_ON_OOM.
We leave the balloon at size 0 while monitoring memory statistics reported
on the stats queue. When we see there is an opportunity for significant
savings then we inflate the balloon to a desired size (possibly including
pressuring the page cache), and then immediately deflate back to size 0.
The host pages backing the guest pages are unbacked during the inflation
process, so the memory footprint of the guest is smaller after this
inflate/deflate cycle.


>
> > Two reasons I didn't go with the above implementation:
> > 1. I need to support guests before Linux 4.19 which don't have the
> shrinker
> > behavior.
> > 2. Memory in the balloon does not appear as "available" in /proc/meminfo
> even
> > though it is freeable. This is confusing to users, but isn't a deal
> breaker.
> >
> > If we added a DEFLATE_ON_PRESSURE feature bit that indicated shrinker API
> > support then that would resolve reason #1 (ideally we would backport the
> bit to
> > 4.19).
>
> We could declare lack of pagecache pressure with DEFLATE_ON_OOM a
> regression and backport the revert but not I think the new
> DEFLATE_ON_PRESSURE.
>
To be clear, the page cache can still be pressured. When the balloon driver
allocates memory and causes reclaim, some of that memory comes from the
balloon (bad) but some of that comes from the page cache (good).


>
>
> > In any case, the shrinker behavior when pressuring page cache is more of
> an
> > inefficiency than a bug. It's not clear to me that it necessitates
> reverting.
> > If there were/are reasons to be on the shrinker interface then I think
> those
> > carry similar weight as the problem itself.
> >
> >
> >
> >     I consider virtio-balloon to this very day a big hack. And I don't
> see
> >     it getting better with new config knobs. Having that said, the
> >     technologies that are candidates to replace it (free page reporting,
> >     taming the guest page cache, etc.) are still not ready - so we'll
> have
> >     to stick with it for now :( .
> >
> >     >
> >     > I'm actually not sure how you would safely do memory overcommit
> without
> >     > DEFLATE_ON_OOM. So I think it unlocks a huge use case.
> >
> >     Using better suited technologies that are not ready yet (well, some
> form
> >     of free page reporting is available under IBM z already but in a
> >     proprietary form) ;) Anyhow, I remember that DEFLATE_ON_OOM only
> makes
> >     it less likely to crash your guest, but not that you are safe to
> squeeze
> >     the last bit out of your guest VM.
> >
> > Can you elaborate on the danger of DEFLATE_ON_OOM? I haven't seen any
> problems
> > in testing but I'd really like to know about the dangers.
> > Is there a difference in safety between the OOM notifier callback and the
> > shrinker API?
>
> It's not about dangers as such. It's just that when linux hits OOM
> all kind of error paths are being hit, latent bugs start triggering,
> latency goes up drastically.

Doesn't this suggest that the shrinker is preferable to the OOM notifier in
the case that we're actually OOMing (with DEFLATE_ON_OOM)?


>
> >
> >
> >     --
> >     Thanks,
> >
> >     David / dhildenb
> >
> >
>
>

[-- Attachment #2: Type: text/html, Size: 6798 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Balloon pressuring page cache
  2020-02-05 19:01                               ` Tyler Sanderson
@ 2020-02-05 19:22                                 ` Alexander Duyck
  2020-02-05 21:44                                   ` Tyler Sanderson
  0 siblings, 1 reply; 59+ messages in thread
From: Alexander Duyck @ 2020-02-05 19:22 UTC (permalink / raw)
  To: Tyler Sanderson, Michael S. Tsirkin
  Cc: David Hildenbrand, Wang, Wei W, virtualization, David Rientjes,
	linux-mm, Michal Hocko, namit

[-- Attachment #1: Type: text/plain, Size: 7606 bytes --]

On Wed, 2020-02-05 at 11:01 -0800, Tyler Sanderson wrote:
> On Tue, Feb 4, 2020 at 10:57 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Tue, Feb 04, 2020 at 03:58:51PM -0800, Tyler Sanderson wrote:
> > 
> > >     >     >
> > 
> > >     >     >  1. It is last-resort, which means the system has already gone     through
> > 
> > >     >     >     heroics to prevent OOM. Those heroic reclaim efforts are     expensive
> > 
> > >     >     >     and impact application performance.
> > 
> > >     >
> > 
> > >     >     That's *exactly* what "deflate on OOM" suggests.
> > 
> > >     >
> > 
> > >     >
> > 
> > >     > It seems there are some use cases where "deflate on OOM" is desired and
> > 
> > >     > others where "deflate on pressure" is desired.
> > 
> > >     > This suggests adding a new feature bit "DEFLATE_ON_PRESSURE" that
> > 
> > >     > registers the shrinker, and reverting DEFLATE_ON_OOM to use the OOM
> > 
> > >     > notifier callback.
> > 
> > >     >
> > 
> > >     > This lets users configure the balloon for their use case.
> > 
> > > 
> > 
> > >     You want the old behavior back, so why should we introduce a new one? Or
> > 
> > >     am I missing something? (you did want us to revert to old handling, no?)
> > 
> > > 
> > 
> > > Reverting actually doesn't help me because this has been the behavior since
> > 
> > > Linux 4.19 which is already widely in use. So my device implementation needs to
> > 
> > > handle the shrinker behavior anyways. I started this conversation to ask what
> > 
> > > the intended device implementation was.
> > 
> > > 
> > 
> > > I think there are reasonable device implementations that would prefer the
> > 
> > > shrinker behavior (it turns out that mine doesn't).
> > 
> > > For example, an implementation that slowly inflates the balloon for the purpose
> > 
> > > of memory overcommit. It might leave the balloon inflated and expect any memory
> > 
> > > pressure (including page cache usage) to deflate the balloon as a way to
> > 
> > > dynamically right-size the balloon.
> > 
> > 
> > 
> > So just to make sure we understand, what exactly does your
> > 
> > implementation do?
> My implementation is for the purposes of opportunistic memory overcommit. We always want to give balloon memory back to the guest rather than causing an OOM, so we use DEFLATE_ON_OOM.
> We leave the balloon at size 0 while monitoring memory statistics reported on the stats queue. When we see there is an opportunity for significant savings then we inflate the balloon to a desired size (possibly including pressuring the page cache), and then immediately deflate back to size 0.
> The host pages backing the guest pages are unbacked during the inflation process, so the memory footprint of the guest is smaller after this inflate/deflate cycle.

This sounds a lot like free page reporting, except I haven't decided on the best way to exert the pressure yet.

You might want to take a look at my patch set here:
https://lore.kernel.org/lkml/20200122173040.6142.39116.stgit@localhost.localdomain/

Instead of inflating a balloon all it is doing is identifying what pages are currently free and have not already been reported to the host and reports those via the balloon driver. The advantage is that we can do the reporting without causing any sort of OOM errors in most cases since we are just pulling and reporting a small set of patches at a time.

> > 
> > 
> > 
> > > Two reasons I didn't go with the above implementation:
> > 
> > > 1. I need to support guests before Linux 4.19 which don't have the shrinker
> > 
> > > behavior.
> > 
> > > 2. Memory in the balloon does not appear as "available" in /proc/meminfo even
> > 
> > > though it is freeable. This is confusing to users, but isn't a deal breaker.
> > 
> > > 
> > 
> > > If we added a DEFLATE_ON_PRESSURE feature bit that indicated shrinker API
> > 
> > > support then that would resolve reason #1 (ideally we would backport the bit to
> > 
> > > 4.19).
> > 
> > 
> > 
> > We could declare lack of pagecache pressure with DEFLATE_ON_OOM a
> > 
> > regression and backport the revert but not I think the new
> > 
> > DEFLATE_ON_PRESSURE.
> To be clear, the page cache can still be pressured. When the balloon driver allocates memory and causes reclaim, some of that memory comes from the balloon (bad) but some of that comes from the page cache (good).

I think the issue is that you aren't able to maintain the page cache pressure because your balloon is deflating as well which in turn is relieving the pressure. Ideally we would want to have some way of putting the pressure on the page cache without having to put enough stress on the memory though to get to the point of encountering OOM which is one of the reasons why I suspect the balloon driver does the allocation with things in place so that it will stop when it cannot fulfill the allocation and is willing to wait on other threads to trigger the reclaim.

> > 
> > 
> > 
> > > In any case, the shrinker behavior when pressuring page cache is more of an
> > 
> > > inefficiency than a bug. It's not clear to me that it necessitates reverting.
> > 
> > > If there were/are reasons to be on the shrinker interface then I think those
> > 
> > > carry similar weight as the problem itself.
> > 
> > >  
> > 
> > > 
> > 
> > > 
> > 
> > >     I consider virtio-balloon to this very day a big hack. And I don't see
> > 
> > >     it getting better with new config knobs. Having that said, the
> > 
> > >     technologies that are candidates to replace it (free page reporting,
> > 
> > >     taming the guest page cache, etc.) are still not ready - so we'll have
> > 
> > >     to stick with it for now :( .
> > 
> > > 
> > 
> > >     >
> > 
> > >     > I'm actually not sure how you would safely do memory overcommit without
> > 
> > >     > DEFLATE_ON_OOM. So I think it unlocks a huge use case.
> > 
> > > 
> > 
> > >     Using better suited technologies that are not ready yet (well, some form
> > 
> > >     of free page reporting is available under IBM z already but in a
> > 
> > >     proprietary form) ;) Anyhow, I remember that DEFLATE_ON_OOM only makes
> > 
> > >     it less likely to crash your guest, but not that you are safe to squeeze
> > 
> > >     the last bit out of your guest VM.
> > 
> > > 
> > 
> > > Can you elaborate on the danger of DEFLATE_ON_OOM? I haven't seen any problems
> > 
> > > in testing but I'd really like to know about the dangers.
> > 
> > > Is there a difference in safety between the OOM notifier callback and the
> > 
> > > shrinker API?
> > 
> > 
> > 
> > It's not about dangers as such. It's just that when linux hits OOM
> > 
> > all kind of error paths are being hit, latent bugs start triggering,
> > 
> > latency goes up drastically.
> Doesn't this suggest that the shrinker is preferable to the OOM notifier in the case that we're actually OOMing (with DEFLATE_ON_OOM)?

I think it all depends on the use case. For the use case you describe going to the shrinker might be preferable as you are wanting to exert just a light bit of pressure to start some page cache reclaim. However if you are wanting to make the deflation a last resort sort of thing then I would think the OOM would make more sense.

At a minimum I would think that the code needs to be reworked so that you either have the balloon inflating or deflating, not both at the same time. I think that is really what is at the heart of the issue for the current shrinker based approach since you can end up with the balloon driver essentially cycling pages as it is allocating them and freeing them at the same time.

[-- Attachment #2: Type: text/html, Size: 9535 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Balloon pressuring page cache
  2020-02-05 19:22                                 ` Alexander Duyck
@ 2020-02-05 21:44                                   ` Tyler Sanderson
  2020-02-06 11:00                                     ` David Hildenbrand
  0 siblings, 1 reply; 59+ messages in thread
From: Tyler Sanderson @ 2020-02-05 21:44 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Michael S. Tsirkin, David Hildenbrand, Wang, Wei W,
	virtualization, David Rientjes, linux-mm, Michal Hocko, namit

[-- Attachment #1: Type: text/plain, Size: 8569 bytes --]

On Wed, Feb 5, 2020 at 11:22 AM Alexander Duyck <
alexander.h.duyck@linux.intel.com> wrote:

> On Wed, 2020-02-05 at 11:01 -0800, Tyler Sanderson wrote:
>
>
>
> On Tue, Feb 4, 2020 at 10:57 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Tue, Feb 04, 2020 at 03:58:51PM -0800, Tyler Sanderson wrote:
> >     >     >
> >     >     >  1. It is last-resort, which means the system has already
> gone     through
> >     >     >     heroics to prevent OOM. Those heroic reclaim efforts
> are     expensive
> >     >     >     and impact application performance.
> >     >
> >     >     That's *exactly* what "deflate on OOM" suggests.
> >     >
> >     >
> >     > It seems there are some use cases where "deflate on OOM" is
> desired and
> >     > others where "deflate on pressure" is desired.
> >     > This suggests adding a new feature bit "DEFLATE_ON_PRESSURE" that
> >     > registers the shrinker, and reverting DEFLATE_ON_OOM to use the OOM
> >     > notifier callback.
> >     >
> >     > This lets users configure the balloon for their use case.
> >
> >     You want the old behavior back, so why should we introduce a new
> one? Or
> >     am I missing something? (you did want us to revert to old handling,
> no?)
> >
> > Reverting actually doesn't help me because this has been the behavior
> since
> > Linux 4.19 which is already widely in use. So my device implementation
> needs to
> > handle the shrinker behavior anyways. I started this conversation to ask
> what
> > the intended device implementation was.
> >
> > I think there are reasonable device implementations that would prefer the
> > shrinker behavior (it turns out that mine doesn't).
> > For example, an implementation that slowly inflates the balloon for the
> purpose
> > of memory overcommit. It might leave the balloon inflated and expect any
> memory
> > pressure (including page cache usage) to deflate the balloon as a way to
> > dynamically right-size the balloon.
>
> So just to make sure we understand, what exactly does your
> implementation do?
>
> My implementation is for the purposes of opportunistic memory overcommit.
> We always want to give balloon memory back to the guest rather than causing
> an OOM, so we use DEFLATE_ON_OOM.
> We leave the balloon at size 0 while monitoring memory statistics reported
> on the stats queue. When we see there is an opportunity for significant
> savings then we inflate the balloon to a desired size (possibly including
> pressuring the page cache), and then immediately deflate back to size 0.
> The host pages backing the guest pages are unbacked during the inflation
> process, so the memory footprint of the guest is smaller after this
> inflate/deflate cycle.
>
>
> This sounds a lot like free page reporting, except I haven't decided on
> the best way to exert the pressure yet.
>
As you mention below, the advantage of free page reporting is that it
doesn't trigger the OOM path. So I'd strongly advocate that the
corresponding mechanism to shrink page cache should also not trigger the
OOM path. That suggests something like the the drop_caches API we talked
about earlier in the thread.

>
> You might want to take a look at my patch set here:
>
> https://lore.kernel.org/lkml/20200122173040.6142.39116.stgit@localhost.localdomain/
>
Yes, I'm strongly in favor of your patch set's goals.


>
> Instead of inflating a balloon all it is doing is identifying what pages
> are currently free and have not already been reported to the host and
> reports those via the balloon driver. The advantage is that we can do the
> reporting without causing any sort of OOM errors in most cases since we are
> just pulling and reporting a small set of patches at a time.
>
>
>
> > Two reasons I didn't go with the above implementation:
> > 1. I need to support guests before Linux 4.19 which don't have the
> shrinker
> > behavior.
> > 2. Memory in the balloon does not appear as "available" in /proc/meminfo
> even
> > though it is freeable. This is confusing to users, but isn't a deal
> breaker.
> >
> > If we added a DEFLATE_ON_PRESSURE feature bit that indicated shrinker API
> > support then that would resolve reason #1 (ideally we would backport the
> bit to
> > 4.19).
>
> We could declare lack of pagecache pressure with DEFLATE_ON_OOM a
> regression and backport the revert but not I think the new
> DEFLATE_ON_PRESSURE.
>
> To be clear, the page cache can still be pressured. When the balloon
> driver allocates memory and causes reclaim, some of that memory comes from
> the balloon (bad) but some of that comes from the page cache (good).
>
>
> I think the issue is that you aren't able to maintain the page cache
> pressure
>
Right. My implementation can shrink the page cache to whatever size is
desired. It just takes a lot more (10x) time and CPU on guests using the
shrinker API because of this back and forth.


> because your balloon is deflating as well which in turn is relieving the
> pressure. Ideally we would want to have some way of putting the pressure on
> the page cache without having to put enough stress on the memory though to
> get to the point of encountering OOM which is one of the reasons why I
> suspect the balloon driver does the allocation with things in place so that
> it will stop when it cannot fulfill the allocation and is willing to wait
> on other threads to trigger the reclaim.
>
>
>
> > In any case, the shrinker behavior when pressuring page cache is more of
> an
> > inefficiency than a bug. It's not clear to me that it necessitates
> reverting.
> > If there were/are reasons to be on the shrinker interface then I think
> those
> > carry similar weight as the problem itself.
> >
> >
> >
> >     I consider virtio-balloon to this very day a big hack. And I don't
> see
> >     it getting better with new config knobs. Having that said, the
> >     technologies that are candidates to replace it (free page reporting,
> >     taming the guest page cache, etc.) are still not ready - so we'll
> have
> >     to stick with it for now :( .
> >
> >     >
> >     > I'm actually not sure how you would safely do memory overcommit
> without
> >     > DEFLATE_ON_OOM. So I think it unlocks a huge use case.
> >
> >     Using better suited technologies that are not ready yet (well, some
> form
> >     of free page reporting is available under IBM z already but in a
> >     proprietary form) ;) Anyhow, I remember that DEFLATE_ON_OOM only
> makes
> >     it less likely to crash your guest, but not that you are safe to
> squeeze
> >     the last bit out of your guest VM.
> >
> > Can you elaborate on the danger of DEFLATE_ON_OOM? I haven't seen any
> problems
> > in testing but I'd really like to know about the dangers.
> > Is there a difference in safety between the OOM notifier callback and the
> > shrinker API?
>
> It's not about dangers as such. It's just that when linux hits OOM
> all kind of error paths are being hit, latent bugs start triggering,
> latency goes up drastically.
>
> Doesn't this suggest that the shrinker is preferable to the OOM notifier
> in the case that we're actually OOMing (with DEFLATE_ON_OOM)?
>
>
> I think it all depends on the use case. For the use case you describe
> going to the shrinker might be preferable as you are wanting to exert just
> a light bit of pressure to start some page cache reclaim. However if you
> are wanting to make the deflation a last resort sort of thing then I would
> think the OOM would make more sense.
>
I agree that the desired behavior depends on the use case. But even for the
case that deflation is a last resort, it seems like we'd like to use the
shrinker API rather than the OOM notifier since the OOM notifier is more
likely to have bugs/errors. The shrinker API doesn't support this
functionality yet, but you could imagine configuring the API so that the
balloon is reclaimed from less frequently or only when shrinking other
sources is becoming difficult. That way we're not actually in the error
prone OOM path.


> At a minimum I would think that the code needs to be reworked so that you
> either have the balloon inflating or deflating, not both at the same time.
>
DEFLATE_ON_OOM necessarily causes deflate activity regardless of whether
the device wants to continue inflating the balloon. Blocking the deflate
activity would cause an OOM in the guest.


> I think that is really what is at the heart of the issue for the current
> shrinker based approach since you can end up with the balloon driver
> essentially cycling pages as it is allocating them and freeing them at the
> same time.
>

[-- Attachment #2: Type: text/html, Size: 11989 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: Balloon pressuring page cache
  2020-02-05 18:43                                   ` Tyler Sanderson
@ 2020-02-06  9:30                                     ` Wang, Wei W
  0 siblings, 0 replies; 59+ messages in thread
From: Wang, Wei W @ 2020-02-06  9:30 UTC (permalink / raw)
  To: Tyler Sanderson
  Cc: David Hildenbrand, Michael S. Tsirkin, Nadav Amit,
	Alexander Duyck, virtualization, David Rientjes, linux-mm,
	Michal Hocko

[-- Attachment #1: Type: text/plain, Size: 349 bytes --]

Isn't the hint only useful during the first round?
After the first round if a page becomes free then we need to update the copy at the migration destination, so freeing a page that previously had contents should mark it dirty.

Nope. I think as long as it is a free page (no matter 1st or 2nd round), we don’t need to send it.

Best,
Wei

[-- Attachment #2: Type: text/html, Size: 3299 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Balloon pressuring page cache
  2020-02-05 21:44                                   ` Tyler Sanderson
@ 2020-02-06 11:00                                     ` David Hildenbrand
  0 siblings, 0 replies; 59+ messages in thread
From: David Hildenbrand @ 2020-02-06 11:00 UTC (permalink / raw)
  To: Tyler Sanderson, Alexander Duyck
  Cc: Michael S. Tsirkin, Wang, Wei W, virtualization, David Rientjes,
	linux-mm, Michal Hocko, namit

>>>     It's not about dangers as such. It's just that when linux hits OOM
>>>     all kind of error paths are being hit, latent bugs start triggering,
>>>     latency goes up drastically.
>>     Doesn't this suggest that the shrinker is preferable to the OOM
>>     notifier in the case that we're actually OOMing (with DEFLATE_ON_OOM)?
> 
>     I think it all depends on the use case. For the use case you
>     describe going to the shrinker might be preferable as you are
>     wanting to exert just a light bit of pressure to start some page
>     cache reclaim. However if you are wanting to make the deflation a
>     last resort sort of thing then I would think the OOM would make more
>     sense.

Long story short: What you actually want is free page reporting combined
with

a) avoiding the guest page cache (emulated nvdimms, virtio-pmem). Not
always possible and has some issues in non-trusted environments (IOW,
cloud).

b) a way to tame the page cache (e.g., drop it completely similar to
drop_caches, or a way to drop a specific fraction, things not touch for
the last $SECONDS) etc.

There are some nice discussions in response to Alexander's v16.1 posting.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2020-02-06 11:01 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAJuQAmpDUyve2S+oxp9tLUhuRcnddXnNztC5PmYOOCpY6c68xg@mail.gmail.com>
     [not found] ` <91270a68-ff48-88b0-219c-69801f0c252f@redhat.com>
     [not found]   ` <CAJuQAmoaK0Swytu2Os_SQRfG5_LqiCPaDa9yatatm9MtfncNTQ@mail.gmail.com>
2020-01-30 15:02     ` Balloon pressuring page cache David Hildenbrand
2020-01-30 15:20       ` Michael S. Tsirkin
2020-01-30 15:23         ` David Hildenbrand
2020-01-30 15:31       ` Wang, Wei W
2020-01-30 19:59         ` Tyler Sanderson
2020-02-03 13:11           ` Michael S. Tsirkin
2020-02-03 16:18             ` Alexander Duyck
2020-02-03 16:34               ` David Hildenbrand
2020-02-03 17:03                 ` Michael S. Tsirkin
2020-02-03 20:32                   ` Tyler Sanderson
2020-02-03 21:22                     ` Alexander Duyck
2020-02-03 23:16                       ` Tyler Sanderson
2020-02-04  0:10                         ` Alexander Duyck
2020-02-04  5:45                     ` Michael S. Tsirkin
2020-02-04  8:29                     ` David Hildenbrand
2020-02-04 18:52                       ` Tyler Sanderson
2020-02-04 18:56                         ` Michael S. Tsirkin
2020-02-04 19:17                         ` David Hildenbrand
2020-02-04 23:58                           ` Tyler Sanderson
2020-02-05  0:15                             ` Tyler Sanderson
2020-02-05  6:57                             ` Michael S. Tsirkin
2020-02-05 19:01                               ` Tyler Sanderson
2020-02-05 19:22                                 ` Alexander Duyck
2020-02-05 21:44                                   ` Tyler Sanderson
2020-02-06 11:00                                     ` David Hildenbrand
2020-02-03 22:50                 ` Nadav Amit
2020-02-04  8:35                   ` David Hildenbrand
2020-02-04  8:40                     ` Michael S. Tsirkin
2020-02-04  8:48                       ` David Hildenbrand
2020-02-04 14:30                       ` David Hildenbrand
2020-02-04 16:50                         ` Michael S. Tsirkin
2020-02-04 16:56                           ` David Hildenbrand
2020-02-04 20:33                             ` Michael S. Tsirkin
2020-02-05  8:31                               ` David Hildenbrand
2020-02-05  6:52                           ` Wang, Wei W
2020-02-05  7:05                             ` Michael S. Tsirkin
2020-02-05  8:50                               ` Wang, Wei W
2020-02-05  6:49                         ` Wang, Wei W
2020-02-05  8:19                           ` David Hildenbrand
2020-02-05  8:54                             ` Wang, Wei W
2020-02-05  8:56                               ` David Hildenbrand
2020-02-05  9:00                                 ` Wang, Wei W
2020-02-05  9:05                                   ` David Hildenbrand
2020-02-05  9:19                                     ` Wang, Wei W
2020-02-05  9:22                                       ` David Hildenbrand
2020-02-05  9:35                                         ` Wang, Wei W
2020-02-05  9:37                                           ` David Hildenbrand
2020-02-05  9:49                                             ` Wang, Wei W
2020-02-05  9:58                                               ` David Hildenbrand
2020-02-05 10:25                                                 ` Michael S. Tsirkin
2020-02-05 10:42                                                   ` David Hildenbrand
2020-02-05  9:35                                         ` Michael S. Tsirkin
2020-02-05 18:43                                   ` Tyler Sanderson
2020-02-06  9:30                                     ` Wang, Wei W
2020-02-05  7:35                   ` Nadav Amit
2020-02-05  8:19                     ` David Hildenbrand
2020-02-05 10:27                       ` Michael S. Tsirkin
2020-02-05 10:43                         ` David Hildenbrand
2020-01-30 22:46       ` Tyler Sanderson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).