linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* On guest free page hinting and OOM
@ 2019-03-29 13:26 Michael S. Tsirkin
  2019-03-29 14:24 ` David Hildenbrand
  0 siblings, 1 reply; 34+ messages in thread
From: Michael S. Tsirkin @ 2019-03-29 13:26 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: kvm, linux-kernel, linux-mm, pbonzini, lcapitulino, pagupta,
	wei.w.wang, yang.zhang.wz, riel, david, dodgen, konrad.wilk,
	dhildenb, aarcange, alexander.duyck

On Wed, Mar 06, 2019 at 10:50:42AM -0500, Nitesh Narayan Lal wrote:
> The following patch-set proposes an efficient mechanism for handing freed memory between the guest and the host. It enables the guests with no page cache to rapidly free and reclaims memory to and from the host respectively.

Sorry about breaking the thread: the original subject was
	KVM: Guest Free Page Hinting
but the following isn't in a response to a specific patch
so I thought it's reasonable to start a new one.

What bothers both me (and others) with both Nitesh's asynchronous approach
to hinting and the hinting that is already supported in the balloon
driver right now is that it seems to have the potential to create a fake OOM situation:
the page that is in the process of being hinted can not be used.  How
likely that is would depend on the workload so is hard to predict.

Alex's patches do not have this problem as they block the
VCPUs from attempting to get new pages during hinting. Solves the fake OOM
issue but adds blocking which most of the time is not necessary.

With both approaches there's a tradeoff: hinting is more efficient if it
hints about large sized chunks of memory at a time, but as that size
increases, chances of being able to hold on to that much memory at a
time decrease. One can claim that this is a regular performance/memory
tradeoff however there is a difference here: normally
guest performance is traded off for host memory (which host
knows how much there is of), this trades guest performance
for guest memory, but the benefit is on the host, not on
the guest. Thus this is harder to manage.

I have an idea: how about allocating extra guest memory on the host?  An
extra hinting buffer would be appended to guest memory, with the
understanding that it is destined specifically to improve page hinting.
Balloon device would get an extra parameter specifying the
hinting buffer size - e.g. in the config space of the driver.
At driver startup, it would get hold of the amount of
memory specified by host as the hinting buffer size, and keep it around in a
buffer list - if no action is taken - forever.  Whenever balloon would
want to get hold of a page of memory and send it to host for hinting, it
would release a page of the same size from the buffer into the free
list: a new page swaps places with a page in the buffer.

In this way the amount of useful free memory stays constant.

Once hinting is done page can be swapped back - or just stay
in the hinting buffer until the next hint.

Clearly this is a memory/performance tradeoff: the more memory host can
allocate for the hinting buffer, the more batching we'll get so hints
become cheaper. One notes that:
- if guest memory isn't pinned, this memory is virtual and can
  be reclaimed by host. In partucular guest can hint about the
  memory within the hinting buffer at startup.
- guest performance/host memory tradeoffs are reasonably well understood, and
  so it's easier to manage: host knows how much memory it can
  sacrifice to gain the benefit of hinting.

Thoughts?

-- 
MST

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: On guest free page hinting and OOM
  2019-03-29 13:26 On guest free page hinting and OOM Michael S. Tsirkin
@ 2019-03-29 14:24 ` David Hildenbrand
  2019-03-29 15:08   ` Michael S. Tsirkin
  0 siblings, 1 reply; 34+ messages in thread
From: David Hildenbrand @ 2019-03-29 14:24 UTC (permalink / raw)
  To: Michael S. Tsirkin, Nitesh Narayan Lal
  Cc: kvm, linux-kernel, linux-mm, pbonzini, lcapitulino, pagupta,
	wei.w.wang, yang.zhang.wz, riel, dodgen, konrad.wilk, dhildenb,
	aarcange, alexander.duyck

On 29.03.19 14:26, Michael S. Tsirkin wrote:
> On Wed, Mar 06, 2019 at 10:50:42AM -0500, Nitesh Narayan Lal wrote:
>> The following patch-set proposes an efficient mechanism for handing freed memory between the guest and the host. It enables the guests with no page cache to rapidly free and reclaims memory to and from the host respectively.
> 
> Sorry about breaking the thread: the original subject was
> 	KVM: Guest Free Page Hinting
> but the following isn't in a response to a specific patch
> so I thought it's reasonable to start a new one.
> 
> What bothers both me (and others) with both Nitesh's asynchronous approach
> to hinting and the hinting that is already supported in the balloon
> driver right now is that it seems to have the potential to create a fake OOM situation:
> the page that is in the process of being hinted can not be used.  How
> likely that is would depend on the workload so is hard to predict.

We had a very simple idea in mind: As long as a hinting request is
pending, don't actually trigger any OOM activity, but wait for it to be
processed. Can be done using simple atomic variable.

This is a scenario that will only pop up when already pretty low on
memory. And the main difference to ballooning is that we *know* we will
get more memory soon.


> 
> Alex's patches do not have this problem as they block the
> VCPUs from attempting to get new pages during hinting. Solves the fake OOM
> issue but adds blocking which most of the time is not necessary.

+ not going via QEMU which I consider problematic in the future when it
comes to various things
1) VFIO notifications if we ever want to support it
2) Verifying that the memory may actually be hinted. Remember where
people started to madvise(DONTNEED) the BIOS and we had to fix that in QEMU.

> 
> With both approaches there's a tradeoff: hinting is more efficient if it
> hints about large sized chunks of memory at a time, but as that size
> increases, chances of being able to hold on to that much memory at a
> time decrease. One can claim that this is a regular performance/memory
> tradeoff however there is a difference here: normally
> guest performance is traded off for host memory (which host
> knows how much there is of), this trades guest performance
> for guest memory, but the benefit is on the host, not on
> the guest. Thus this is harder to manage.

One nice thing is that, when only hinting larger chunks, the probability
of smaller chunks being available is more likely. It would be more of an
issue when hinting any granularity.

> 
> I have an idea: how about allocating extra guest memory on the host?  An
> extra hinting buffer would be appended to guest memory, with the
> understanding that it is destined specifically to improve page hinting.
> Balloon device would get an extra parameter specifying the
> hinting buffer size - e.g. in the config space of the driver.
> At driver startup, it would get hold of the amount of
> memory specified by host as the hinting buffer size, and keep it around in a
> buffer list - if no action is taken - forever.  Whenever balloon would
> want to get hold of a page of memory and send it to host for hinting, it
> would release a page of the same size from the buffer into the free
> list: a new page swaps places with a page in the buffer.
> 
> In this way the amount of useful free memory stays constant.
> 
> Once hinting is done page can be swapped back - or just stay
> in the hinting buffer until the next hint.
> 
> Clearly this is a memory/performance tradeoff: the more memory host can
> allocate for the hinting buffer, the more batching we'll get so hints
> become cheaper. One notes that:
> - if guest memory isn't pinned, this memory is virtual and can
>   be reclaimed by host. In partucular guest can hint about the
>   memory within the hinting buffer at startup.
> - guest performance/host memory tradeoffs are reasonably well understood, and
>   so it's easier to manage: host knows how much memory it can
>   sacrifice to gain the benefit of hinting.
> 
> Thoughts?
> 

I first want to

a) See it being a real issue. Reproduce it.
b) See that we can't fix it using a simple approach (loop when requests
not processed yet, always keep X pages ...).
c) See that an easy fix is not sufficient and actually an issue.
d) See if we can document it and people that care about can life without
hinting, like they would live without ballooning.

What you describe sounds interesting, but really involved. And really
problematic. I consider many things about your approach not realistic.

"appended to guest memory", "global list of memory", malicious guests
always using that memory like what about NUMA? What about different page
granularity? What about malicious guests? What about more hitning
requests than the buffer is capable to handle? and much much much more.

Honestly, requiring page hinting to make use of actual ballooning or
additional memory makes me shiver. I hope I don't get nightmares ;) In
the long term we might want to get rid of the inflation/deflation side
of virtio-balloon, not require it.

Please don't over-engineer an issue we haven't even see yet. Especially
not using a mechanism that sounds more involved than actual hinting.


As always, I might be very wrong, but this sounds way too complicated to
me, both on the guest and the hypervisor side.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: On guest free page hinting and OOM
  2019-03-29 14:24 ` David Hildenbrand
@ 2019-03-29 15:08   ` Michael S. Tsirkin
  2019-03-29 15:37     ` David Hildenbrand
  0 siblings, 1 reply; 34+ messages in thread
From: Michael S. Tsirkin @ 2019-03-29 15:08 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Nitesh Narayan Lal, kvm, linux-kernel, linux-mm, pbonzini,
	lcapitulino, pagupta, wei.w.wang, yang.zhang.wz, riel, dodgen,
	konrad.wilk, dhildenb, aarcange, alexander.duyck

On Fri, Mar 29, 2019 at 03:24:24PM +0100, David Hildenbrand wrote:
> 
> We had a very simple idea in mind: As long as a hinting request is
> pending, don't actually trigger any OOM activity, but wait for it to be
> processed. Can be done using simple atomic variable.
> 
> This is a scenario that will only pop up when already pretty low on
> memory. And the main difference to ballooning is that we *know* we will
> get more memory soon.

No we don't.  If we keep polling we are quite possibly keeping the CPU
busy so delaying the hint request processing.  Again the issue it's a
tradeoff. One performance for the other. Very hard to know which path do
you hit in advance, and in the real world no one has the time to profile
and tune things. By comparison trading memory for performance is well
understood.


> "appended to guest memory", "global list of memory", malicious guests
> always using that memory like what about NUMA?

This can be up to the guest. A good approach would be to take
a chunk out of each node and add to the hints buffer.

> What about different page
> granularity?

Seems like an orthogonal issue to me.

> What about malicious guests?

That's an interesting question.  Host can actually enforce that # of
hinted free pages at least matches the hint buffer size.


> What about more hitning
> requests than the buffer is capable to handle?

The idea is that we don't send more hints than in the buffer.
In this way host can actually control the overhead which
is probably a good thing - host knows how much benefit
can be derived from hinting. Guest doesn't.

> Honestly, requiring page hinting to make use of actual ballooning or
> additional memory makes me shiver. I hope I don't get nightmares ;) In
> the long term we might want to get rid of the inflation/deflation side
> of virtio-balloon, not require it.
> 
> Please don't over-engineer an issue we haven't even see yet.

All hinting patches are very lightly tested as it is. OOM especially is
very hard to test properly.  So *I* will sleep better at night if we
don't have corner cases.  Balloon is already involved in MM for
isolation and somehow we live with that.  So wait until you see actual
code before worrying about nightmares.

> Especially
> not using a mechanism that sounds more involved than actual hinting.

That would depend on the implementation.
It's just moving a page between two lists.


> 
> As always, I might be very wrong, but this sounds way too complicated to
> me, both on the guest and the hypervisor side.

On the hypervisor side it can be literally nothing if we don't
want to enforce buffer size.

> -- 
> 
> Thanks,
> 
> David / dhildenb

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: On guest free page hinting and OOM
  2019-03-29 15:08   ` Michael S. Tsirkin
@ 2019-03-29 15:37     ` David Hildenbrand
  2019-03-29 15:45       ` David Hildenbrand
  2019-03-29 16:09       ` Michael S. Tsirkin
  0 siblings, 2 replies; 34+ messages in thread
From: David Hildenbrand @ 2019-03-29 15:37 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Nitesh Narayan Lal, kvm, linux-kernel, linux-mm, pbonzini,
	lcapitulino, pagupta, wei.w.wang, yang.zhang.wz, riel, dodgen,
	konrad.wilk, dhildenb, aarcange, alexander.duyck

On 29.03.19 16:08, Michael S. Tsirkin wrote:
> On Fri, Mar 29, 2019 at 03:24:24PM +0100, David Hildenbrand wrote:
>>
>> We had a very simple idea in mind: As long as a hinting request is
>> pending, don't actually trigger any OOM activity, but wait for it to be
>> processed. Can be done using simple atomic variable.
>>
>> This is a scenario that will only pop up when already pretty low on
>> memory. And the main difference to ballooning is that we *know* we will
>> get more memory soon.
> 
> No we don't.  If we keep polling we are quite possibly keeping the CPU
> busy so delaying the hint request processing.  Again the issue it's a

You can always yield. But that's a different topic.

> tradeoff. One performance for the other. Very hard to know which path do
> you hit in advance, and in the real world no one has the time to profile
> and tune things. By comparison trading memory for performance is well
> understood.
> 
> 
>> "appended to guest memory", "global list of memory", malicious guests
>> always using that memory like what about NUMA?
> 
> This can be up to the guest. A good approach would be to take
> a chunk out of each node and add to the hints buffer.

This might lead to you not using the buffer efficiently. But also,
different topic.

> 
>> What about different page
>> granularity?
> 
> Seems like an orthogonal issue to me.

It is similar, yes. But if you support multiple granularities (e.g.
MAX_ORDER - 1, MAX_ORDER - 2 ...) you might have to implement some sort
of buddy for the buffer. This is different than just a list for each node.

> 
>> What about malicious guests?
> 
> That's an interesting question.  Host can actually enforce that # of
> hinted free pages at least matches the hint buffer size.

Well, you will have to assume that the hinting buffer might be
completely used by any guest. Just like ballooning, there are no
guarantees. If the buffer memory is part of initial boot memory, any
guest without proper modifications will simply make use of it. See more
on that below. At first, my impression was that you would actually
somehow want to expose new memory ("buffer") to the guest, which is why
I shivered.

> 
> 
>> What about more hitning
>> requests than the buffer is capable to handle?
> 
> The idea is that we don't send more hints than in the buffer.
> In this way host can actually control the overhead which
> is probably a good thing - host knows how much benefit
> can be derived from hinting. Guest doesn't.

So the buffer will require a global lock, just so we are on the same
page. "this way host can actually control the overhead" - you can
implement that easier by specifying a desired limit via virtio and
tracking it in the guest. Which is what your buffer would implicitly do.

> 
>> Honestly, requiring page hinting to make use of actual ballooning or
>> additional memory makes me shiver. I hope I don't get nightmares ;) In
>> the long term we might want to get rid of the inflation/deflation side
>> of virtio-balloon, not require it.
>>
>> Please don't over-engineer an issue we haven't even see yet.
> 
> All hinting patches are very lightly tested as it is. OOM especially is
> very hard to test properly.  So *I* will sleep better at night if we
> don't have corner cases.  Balloon is already involved in MM for
> isolation and somehow we live with that.  So wait until you see actual
> code before worrying about nightmares.

Yes, but I want to see attempts with simple solutions first.

> 
>> Especially
>> not using a mechanism that sounds more involved than actual hinting.
> 
> That would depend on the implementation.
> It's just moving a page between two lists.

Except NUMA and different granularities.

> 
> 
>>
>> As always, I might be very wrong, but this sounds way too complicated to
>> me, both on the guest and the hypervisor side.
> 
> On the hypervisor side it can be literally nothing if we don't
> want to enforce buffer size.


Just so we understand each other. What you mean with "appended to guest
memory" is "append to the guest memory size", not actually "append
memory via virtio-balloon", like adding memory regions and stuff.

Instead of "-m 4G" you would do "-m 5G -device virtio-balloon,hint_size=1G".


So to summarize my opinion, this looks like a *possible* improvement for
the future *if* we realize that a simple approach does not work. If we
simply add more memory for the hinting buffer to the VM memory size, we
will have to assume it will be used by guests, even if they are not
malicious (e.g. guest without virtio-balloon driver or without the
enhancement). NUMA and different page granularities in the guest are
tricky parts.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: On guest free page hinting and OOM
  2019-03-29 15:37     ` David Hildenbrand
@ 2019-03-29 15:45       ` David Hildenbrand
  2019-03-29 16:51         ` Michael S. Tsirkin
  2019-03-29 16:09       ` Michael S. Tsirkin
  1 sibling, 1 reply; 34+ messages in thread
From: David Hildenbrand @ 2019-03-29 15:45 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Nitesh Narayan Lal, kvm, linux-kernel, linux-mm, pbonzini,
	lcapitulino, pagupta, wei.w.wang, yang.zhang.wz, riel, dodgen,
	konrad.wilk, dhildenb, aarcange, alexander.duyck

On 29.03.19 16:37, David Hildenbrand wrote:
> On 29.03.19 16:08, Michael S. Tsirkin wrote:
>> On Fri, Mar 29, 2019 at 03:24:24PM +0100, David Hildenbrand wrote:
>>>
>>> We had a very simple idea in mind: As long as a hinting request is
>>> pending, don't actually trigger any OOM activity, but wait for it to be
>>> processed. Can be done using simple atomic variable.
>>>
>>> This is a scenario that will only pop up when already pretty low on
>>> memory. And the main difference to ballooning is that we *know* we will
>>> get more memory soon.
>>
>> No we don't.  If we keep polling we are quite possibly keeping the CPU
>> busy so delaying the hint request processing.  Again the issue it's a
> 
> You can always yield. But that's a different topic.
> 
>> tradeoff. One performance for the other. Very hard to know which path do
>> you hit in advance, and in the real world no one has the time to profile
>> and tune things. By comparison trading memory for performance is well
>> understood.
>>
>>
>>> "appended to guest memory", "global list of memory", malicious guests
>>> always using that memory like what about NUMA?
>>
>> This can be up to the guest. A good approach would be to take
>> a chunk out of each node and add to the hints buffer.
> 
> This might lead to you not using the buffer efficiently. But also,
> different topic.
> 
>>
>>> What about different page
>>> granularity?
>>
>> Seems like an orthogonal issue to me.
> 
> It is similar, yes. But if you support multiple granularities (e.g.
> MAX_ORDER - 1, MAX_ORDER - 2 ...) you might have to implement some sort
> of buddy for the buffer. This is different than just a list for each node.

Oh, and before I forget, different zones might of course also be a problem.


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: On guest free page hinting and OOM
  2019-03-29 15:37     ` David Hildenbrand
  2019-03-29 15:45       ` David Hildenbrand
@ 2019-03-29 16:09       ` Michael S. Tsirkin
  1 sibling, 0 replies; 34+ messages in thread
From: Michael S. Tsirkin @ 2019-03-29 16:09 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Nitesh Narayan Lal, kvm, linux-kernel, linux-mm, pbonzini,
	lcapitulino, pagupta, wei.w.wang, yang.zhang.wz, riel, dodgen,
	konrad.wilk, dhildenb, aarcange, alexander.duyck

On Fri, Mar 29, 2019 at 04:37:46PM +0100, David Hildenbrand wrote:
> Just so we understand each other. What you mean with "appended to guest
> memory" is "append to the guest memory size", not actually "append
> memory via virtio-balloon", like adding memory regions and stuff.
> 
> Instead of "-m 4G" you would do "-m 5G -device virtio-balloon,hint_size=1G".

Right.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: On guest free page hinting and OOM
  2019-03-29 15:45       ` David Hildenbrand
@ 2019-03-29 16:51         ` Michael S. Tsirkin
  2019-04-01  8:17           ` David Hildenbrand
  0 siblings, 1 reply; 34+ messages in thread
From: Michael S. Tsirkin @ 2019-03-29 16:51 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Nitesh Narayan Lal, kvm, linux-kernel, linux-mm, pbonzini,
	lcapitulino, pagupta, wei.w.wang, yang.zhang.wz, riel, dodgen,
	konrad.wilk, dhildenb, aarcange, alexander.duyck

On Fri, Mar 29, 2019 at 04:45:58PM +0100, David Hildenbrand wrote:
> On 29.03.19 16:37, David Hildenbrand wrote:
> > On 29.03.19 16:08, Michael S. Tsirkin wrote:
> >> On Fri, Mar 29, 2019 at 03:24:24PM +0100, David Hildenbrand wrote:
> >>>
> >>> We had a very simple idea in mind: As long as a hinting request is
> >>> pending, don't actually trigger any OOM activity, but wait for it to be
> >>> processed. Can be done using simple atomic variable.
> >>>
> >>> This is a scenario that will only pop up when already pretty low on
> >>> memory. And the main difference to ballooning is that we *know* we will
> >>> get more memory soon.
> >>
> >> No we don't.  If we keep polling we are quite possibly keeping the CPU
> >> busy so delaying the hint request processing.  Again the issue it's a
> > 
> > You can always yield. But that's a different topic.
> > 
> >> tradeoff. One performance for the other. Very hard to know which path do
> >> you hit in advance, and in the real world no one has the time to profile
> >> and tune things. By comparison trading memory for performance is well
> >> understood.
> >>
> >>
> >>> "appended to guest memory", "global list of memory", malicious guests
> >>> always using that memory like what about NUMA?
> >>
> >> This can be up to the guest. A good approach would be to take
> >> a chunk out of each node and add to the hints buffer.
> > 
> > This might lead to you not using the buffer efficiently. But also,
> > different topic.
> > 
> >>
> >>> What about different page
> >>> granularity?
> >>
> >> Seems like an orthogonal issue to me.
> > 
> > It is similar, yes. But if you support multiple granularities (e.g.
> > MAX_ORDER - 1, MAX_ORDER - 2 ...) you might have to implement some sort
> > of buddy for the buffer. This is different than just a list for each node.

Right but we don't plan to do it yet.

> Oh, and before I forget, different zones might of course also be a problem.

I would just split the hint buffer evenly between zones.

> 
> -- 
> 
> Thanks,
> 
> David / dhildenb

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: On guest free page hinting and OOM
  2019-03-29 16:51         ` Michael S. Tsirkin
@ 2019-04-01  8:17           ` David Hildenbrand
  2019-04-01 13:24             ` Michael S. Tsirkin
  0 siblings, 1 reply; 34+ messages in thread
From: David Hildenbrand @ 2019-04-01  8:17 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Nitesh Narayan Lal, kvm, linux-kernel, linux-mm, pbonzini,
	lcapitulino, pagupta, wei.w.wang, yang.zhang.wz, riel, dodgen,
	konrad.wilk, dhildenb, aarcange, alexander.duyck

On 29.03.19 17:51, Michael S. Tsirkin wrote:
> On Fri, Mar 29, 2019 at 04:45:58PM +0100, David Hildenbrand wrote:
>> On 29.03.19 16:37, David Hildenbrand wrote:
>>> On 29.03.19 16:08, Michael S. Tsirkin wrote:
>>>> On Fri, Mar 29, 2019 at 03:24:24PM +0100, David Hildenbrand wrote:
>>>>>
>>>>> We had a very simple idea in mind: As long as a hinting request is
>>>>> pending, don't actually trigger any OOM activity, but wait for it to be
>>>>> processed. Can be done using simple atomic variable.
>>>>>
>>>>> This is a scenario that will only pop up when already pretty low on
>>>>> memory. And the main difference to ballooning is that we *know* we will
>>>>> get more memory soon.
>>>>
>>>> No we don't.  If we keep polling we are quite possibly keeping the CPU
>>>> busy so delaying the hint request processing.  Again the issue it's a
>>>
>>> You can always yield. But that's a different topic.
>>>
>>>> tradeoff. One performance for the other. Very hard to know which path do
>>>> you hit in advance, and in the real world no one has the time to profile
>>>> and tune things. By comparison trading memory for performance is well
>>>> understood.
>>>>
>>>>
>>>>> "appended to guest memory", "global list of memory", malicious guests
>>>>> always using that memory like what about NUMA?
>>>>
>>>> This can be up to the guest. A good approach would be to take
>>>> a chunk out of each node and add to the hints buffer.
>>>
>>> This might lead to you not using the buffer efficiently. But also,
>>> different topic.
>>>
>>>>
>>>>> What about different page
>>>>> granularity?
>>>>
>>>> Seems like an orthogonal issue to me.
>>>
>>> It is similar, yes. But if you support multiple granularities (e.g.
>>> MAX_ORDER - 1, MAX_ORDER - 2 ...) you might have to implement some sort
>>> of buddy for the buffer. This is different than just a list for each node.
> 
> Right but we don't plan to do it yet.

MAX_ORDER - 2 on x86-64 seems to work just fine (no THP splits) and
early performance numbers indicate it might be the right thing to do. So
it could be very desirable once we do more performance tests.

> 
>> Oh, and before I forget, different zones might of course also be a problem.
> 
> I would just split the hint buffer evenly between zones.
> 

Thinking about your approach, there is one elementary thing to notice:

Giving the guest pages from the buffer while hinting requests are being
processed means that the guest can and will temporarily make use of more
memory than desired. Essentially up to the point where MADV_FREE is
finally called for the hinted pages. Even then the guest will logicall
make use of more memory than desired until core MM takes pages away.

So:
1) Unmodified guests will make use of more memory than desired.
2) Malicious guests will make use of more memory than desired.
3) Sane, modified guests will make use of more memory than desired.

Instead, we could make our life much easier by doing the following:

1) Introduce a parameter to cap the amount of memory concurrently hinted
similar like you suggested, just don't consider it a buffer value.
"-device virtio-balloon,hinting_size=1G". This gives us control over the
hinting proceess.

hinting_size=0 (default) disables hinting

The admin can tweak the number along with memory requirements of the
guest. We can make suggestions (e.g. calculate depending on #cores,#size
of memory, or simply "1GB")

2) In the guest, track the size of hints in progress, cap at the
hinting_size.

3) Document hinting behavior

"When hinting is enabled, memory up to hinting_size might temporarily be
removed from your guest in order to be hinted to the hypervisor. This is
only for a very short time, but might affect applications. Consider the
hinting_size when sizing your guest. If your application was tested with
XGB and a hinting size of 1G is used, please configure X+1GB for the
guest. Otherwise, performance degradation might be possible."

4) Do the loop/yield on OOM as discussed to improve performance when OOM
and avoid false OOM triggers just to be sure.


BTW, one alternatives I initially had in mind was to add pages from the
buffer from the OOM handler only and putting these pages back into the
buffer once freed. I thought this might help for certain memory offline
scenarios where pages stuck in the buffer might hinder offlining of
memory. And of course, improve performance as the buffer is only touched
when really needed. But it would only help for memory (e.g. DIMM) added
after boot, so it is also not 100% safe. Also, same issues as with your
given approach.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: On guest free page hinting and OOM
  2019-04-01  8:17           ` David Hildenbrand
@ 2019-04-01 13:24             ` Michael S. Tsirkin
  2019-04-01 14:09               ` David Hildenbrand
  0 siblings, 1 reply; 34+ messages in thread
From: Michael S. Tsirkin @ 2019-04-01 13:24 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Nitesh Narayan Lal, kvm, linux-kernel, linux-mm, pbonzini,
	lcapitulino, pagupta, wei.w.wang, yang.zhang.wz, riel, dodgen,
	konrad.wilk, dhildenb, aarcange, alexander.duyck

On Mon, Apr 01, 2019 at 10:17:51AM +0200, David Hildenbrand wrote:
> On 29.03.19 17:51, Michael S. Tsirkin wrote:
> > On Fri, Mar 29, 2019 at 04:45:58PM +0100, David Hildenbrand wrote:
> >> On 29.03.19 16:37, David Hildenbrand wrote:
> >>> On 29.03.19 16:08, Michael S. Tsirkin wrote:
> >>>> On Fri, Mar 29, 2019 at 03:24:24PM +0100, David Hildenbrand wrote:
> >>>>>
> >>>>> We had a very simple idea in mind: As long as a hinting request is
> >>>>> pending, don't actually trigger any OOM activity, but wait for it to be
> >>>>> processed. Can be done using simple atomic variable.
> >>>>>
> >>>>> This is a scenario that will only pop up when already pretty low on
> >>>>> memory. And the main difference to ballooning is that we *know* we will
> >>>>> get more memory soon.
> >>>>
> >>>> No we don't.  If we keep polling we are quite possibly keeping the CPU
> >>>> busy so delaying the hint request processing.  Again the issue it's a
> >>>
> >>> You can always yield. But that's a different topic.
> >>>
> >>>> tradeoff. One performance for the other. Very hard to know which path do
> >>>> you hit in advance, and in the real world no one has the time to profile
> >>>> and tune things. By comparison trading memory for performance is well
> >>>> understood.
> >>>>
> >>>>
> >>>>> "appended to guest memory", "global list of memory", malicious guests
> >>>>> always using that memory like what about NUMA?
> >>>>
> >>>> This can be up to the guest. A good approach would be to take
> >>>> a chunk out of each node and add to the hints buffer.
> >>>
> >>> This might lead to you not using the buffer efficiently. But also,
> >>> different topic.
> >>>
> >>>>
> >>>>> What about different page
> >>>>> granularity?
> >>>>
> >>>> Seems like an orthogonal issue to me.
> >>>
> >>> It is similar, yes. But if you support multiple granularities (e.g.
> >>> MAX_ORDER - 1, MAX_ORDER - 2 ...) you might have to implement some sort
> >>> of buddy for the buffer. This is different than just a list for each node.
> > 
> > Right but we don't plan to do it yet.
> 
> MAX_ORDER - 2 on x86-64 seems to work just fine (no THP splits) and
> early performance numbers indicate it might be the right thing to do. So
> it could be very desirable once we do more performance tests.
> 
> > 
> >> Oh, and before I forget, different zones might of course also be a problem.
> > 
> > I would just split the hint buffer evenly between zones.
> > 
> 
> Thinking about your approach, there is one elementary thing to notice:
> 
> Giving the guest pages from the buffer while hinting requests are being
> processed means that the guest can and will temporarily make use of more
> memory than desired. Essentially up to the point where MADV_FREE is
> finally called for the hinted pages.

Right - but that seems like exactly the reverse of the issue with the current
approach which is guest can temporarily use less memory than desired.

> Even then the guest will logicall
> make use of more memory than desired until core MM takes pages away.

That sounds more like a host issue though. If it wants to
it can use e.g. MAD_DONTNEED.

> So:
> 1) Unmodified guests will make use of more memory than desired.

One interesting possibility for this is to add the buffer memory
by hotplug after the feature has been negotiated.
I agree this sounds complex.

But I have an idea: how about we include the hint size in the
num_pages counter? Then unmodified guests put
it in the balloon and don't use it. Modified ones
will know to use it just for hinting.


> 2) Malicious guests will make use of more memory than desired.

Well this limitation is fundamental to balloon right?
If host wants to add tracking of balloon memory, it
can enforce the limits. So far no one bothered,
but maybe with this feature we should start to do that.

> 3) Sane, modified guests will make use of more memory than desired.
>
> Instead, we could make our life much easier by doing the following:
> 
> 1) Introduce a parameter to cap the amount of memory concurrently hinted
> similar like you suggested, just don't consider it a buffer value.
> "-device virtio-balloon,hinting_size=1G". This gives us control over the
> hinting proceess.
> 
> hinting_size=0 (default) disables hinting
> 
> The admin can tweak the number along with memory requirements of the
> guest. We can make suggestions (e.g. calculate depending on #cores,#size
> of memory, or simply "1GB")

So if it's all up to the guest and for the benefit of the guest, and
with no cost/benefit to the host, then why are we supplying this value
from the host?

> 2) In the guest, track the size of hints in progress, cap at the
> hinting_size.
> 
> 3) Document hinting behavior
> 
> "When hinting is enabled, memory up to hinting_size might temporarily be
> removed from your guest in order to be hinted to the hypervisor. This is
> only for a very short time, but might affect applications. Consider the
> hinting_size when sizing your guest. If your application was tested with
> XGB and a hinting size of 1G is used, please configure X+1GB for the
> guest. Otherwise, performance degradation might be possible."

OK, so let's start with this. Now let us assume that guest follows
the advice.  We thus know that 1GB is not needed for guest applications.
So why do we want to allow applications to still use this extra memory?

> 4) Do the loop/yield on OOM as discussed to improve performance when OOM
> and avoid false OOM triggers just to be sure.

Yes, I'm not against trying the simpler approach as a first step.  But
then we need this path actually tested so see whether hinting introduced
unreasonable overhead on this path.  And it is tricky to test oom as you
are skating close to system's limits. That's one reason I prefer
avoiding oom handler if possible.

When you say yield, I would guess that would involve config space access
to the balloon to flush out outstanding hints?

> 
> BTW, one alternatives I initially had in mind was to add pages from the
> buffer from the OOM handler only and putting these pages back into the
> buffer once freed.

I don't think that works easily - pages get used so we can't
return them into the buffer. Another problem with only handling oom
is that oom is a guest decision. So host really can't
enforce any limits even if it wants to.

> I thought this might help for certain memory offline
> scenarios where pages stuck in the buffer might hinder offlining of
> memory. And of course, improve performance as the buffer is only touched
> when really needed. But it would only help for memory (e.g. DIMM) added
> after boot, so it is also not 100% safe. Also, same issues as with your
> given approach.

So you can look at this approach as a combination of
- balloon inflate with separate accounting
- deflate on oom
- hinting
?

Put this way, it seems rather uncontroversial, right?




> -- 
> 
> Thanks,
> 
> David / dhildenb

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: On guest free page hinting and OOM
  2019-04-01 13:24             ` Michael S. Tsirkin
@ 2019-04-01 14:09               ` David Hildenbrand
  2019-04-01 14:11                 ` David Hildenbrand
  2019-04-01 14:45                 ` Michael S. Tsirkin
  0 siblings, 2 replies; 34+ messages in thread
From: David Hildenbrand @ 2019-04-01 14:09 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Nitesh Narayan Lal, kvm, linux-kernel, linux-mm, pbonzini,
	lcapitulino, pagupta, wei.w.wang, yang.zhang.wz, riel, dodgen,
	konrad.wilk, dhildenb, aarcange, alexander.duyck

>> Thinking about your approach, there is one elementary thing to notice:
>>
>> Giving the guest pages from the buffer while hinting requests are being
>> processed means that the guest can and will temporarily make use of more
>> memory than desired. Essentially up to the point where MADV_FREE is
>> finally called for the hinted pages.
> 
> Right - but that seems like exactly the reverse of the issue with the current
> approach which is guest can temporarily use less memory than desired.
> 
>> Even then the guest will logicall
>> make use of more memory than desired until core MM takes pages away.
> 
> That sounds more like a host issue though. If it wants to
> it can use e.g. MAD_DONTNEED.

Indeed. But MADV_DONTNEED is somewhat undesired for performance reasons.
You want to do the work when swapping not when hinting.

But what I wanted to say here: Looking at the pure size of your guest
will at least not help you to identify if more memory than desired will
be used.

> 
>> So:
>> 1) Unmodified guests will make use of more memory than desired.
> 
> One interesting possibility for this is to add the buffer memory
> by hotplug after the feature has been negotiated.
> I agree this sounds complex.

Yes it is, and it goes into the direction of virtio-mem that essentially
does that. But bad news: memory hotplug is complicated stuff, both on
the hypervisor and guest side. And things like NUMA make it more involved.

But even then, malicious guest can simply fake feature negotiation and
make use of all hotplugged memory. Won't work, at least not for
malicious guests.

> 
> But I have an idea: how about we include the hint size in the
> num_pages counter? Then unmodified guests put
> it in the balloon and don't use it. Modified ones
> will know to use it just for hinting.

These are the nightmares I was talking about. I would like to decouple
this feature as far as possible from balloon inflation/deflation.
Ballooning is 4k based and might have other undesirable side effect.
Just because somebody wants to use page hinting does not mean he wants
to use ballooning. Effectively, many people will want to avoid
ballooning completely by using page hinting for their use case.

> 
> 
>> 2) Malicious guests will make use of more memory than desired.
> 
> Well this limitation is fundamental to balloon right?

Yep, it is the fundamental issue of ballooning. If memory is available
right from the boot, the system is free to do with it whatever it wants.
(one of the main things virtio-mem will do differently/better)

> If host wants to add tracking of balloon memory, it
> can enforce the limits. So far no one bothered,
> but maybe with this feature we should start to do that.

I think I already had endless rants about why this is not possible.
Ballooning as it is currently implemented by virtio-balloon is broken by
design. Period. You can and never will be able to distinguish unmodified
guests from malicious guests. Please don't design new approaches based
on broken design.

> 
>> 3) Sane, modified guests will make use of more memory than desired.
>>
>> Instead, we could make our life much easier by doing the following:
>>
>> 1) Introduce a parameter to cap the amount of memory concurrently hinted
>> similar like you suggested, just don't consider it a buffer value.
>> "-device virtio-balloon,hinting_size=1G". This gives us control over the
>> hinting proceess.
>>
>> hinting_size=0 (default) disables hinting
>>
>> The admin can tweak the number along with memory requirements of the
>> guest. We can make suggestions (e.g. calculate depending on #cores,#size
>> of memory, or simply "1GB")
> 
> So if it's all up to the guest and for the benefit of the guest, and
> with no cost/benefit to the host, then why are we supplying this value
> from the host?

See 3), the admin has to be aware of hinting behavior.

> 
>> 2) In the guest, track the size of hints in progress, cap at the
>> hinting_size.
>>
>> 3) Document hinting behavior
>>
>> "When hinting is enabled, memory up to hinting_size might temporarily be
>> removed from your guest in order to be hinted to the hypervisor. This is
>> only for a very short time, but might affect applications. Consider the
>> hinting_size when sizing your guest. If your application was tested with
>> XGB and a hinting size of 1G is used, please configure X+1GB for the
>> guest. Otherwise, performance degradation might be possible."
> 
> OK, so let's start with this. Now let us assume that guest follows
> the advice.  We thus know that 1GB is not needed for guest applications.
> So why do we want to allow applications to still use this extra memory?

If the application does not need the 1GB, the 1GB will be hinted to the
hypervisor and are effectively only a buffer for the OOM scenario.
(ignoring page cache discussions for now).

"So why do we want to allow applications to still use this extra memory"
is the EXACT same issue you have with your buffer approach. Any guest
can make use of the buffer and you won't be able to detect it. Very same
problem. Only in your approach, the guest might agree to play nicely by
not making use of the 1G you provided. Just as if the application does
not need/use the additional 1GB.

The interesting thing is most probably: Will the hinting size usually be
reasonable small? At least I guess a guest with 4TB of RAM will not
suddenly get a hinting size of hundreds of GB. Most probably also only
something in the range of 1GB. But this is an interesting question to
look into.

Also, if the admin does not care about performance implications when
already close to hinting, no need to add the additional 1Gb to the ram size.

> 
>> 4) Do the loop/yield on OOM as discussed to improve performance when OOM
>> and avoid false OOM triggers just to be sure.
> 
> Yes, I'm not against trying the simpler approach as a first step.  But
> then we need this path actually tested so see whether hinting introduced
> unreasonable overhead on this path.  And it is tricky to test oom as you
> are skating close to system's limits. That's one reason I prefer
> avoiding oom handler if possible.

The issue with the actual issue we are chasing is that it can only
happen if (as far as I see)

1) Application uses X MAX_ORDER - 1 pages
2) Application frees X MAX_ORDER - 1 pages
3) Application reallocates X MAX_ORDER - 1 pages

AND

a) There are not enough MAX_ORDER - 1 pages remaining while hinting
b) The allocation request cannot be satisfied from other free pages of
smaller order
c) We actually trigger hinting with the X freed pages
d) Time between 2 and 3 is not enough to complete hinting

Only then the OOM handler will get active. If between 2) and 3)
reasonable time has passed, it is not an issue.

And as I said right from the beginning, reproducing this might be
difficult. And especially for this reason, I prefer simpler approaches
if possible. Document it for applications that might be affected, let
the admin enable the feature explicitly, avoid complexity.

> 
> When you say yield, I would guess that would involve config space access
> to the balloon to flush out outstanding hints?

I rather meant yield your CPU to the hypervisor, so it can process
hinting requests faster (like waiting for a spinlock). This is the
simple case. More involved approaches might somehow indicate to the
hypervisor to not process queued requests but simply return them to the
guest so the guest can add the isolated pages to the buddy. If this is
"config space access to the balloon to flush out outstanding hints" then
yes, something like that might be a good idea if it doesn't harm
performance.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: On guest free page hinting and OOM
  2019-04-01 14:09               ` David Hildenbrand
@ 2019-04-01 14:11                 ` David Hildenbrand
  2019-04-01 14:47                   ` Michael S. Tsirkin
  2019-04-01 14:45                 ` Michael S. Tsirkin
  1 sibling, 1 reply; 34+ messages in thread
From: David Hildenbrand @ 2019-04-01 14:11 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Nitesh Narayan Lal, kvm, linux-kernel, linux-mm, pbonzini,
	lcapitulino, pagupta, wei.w.wang, yang.zhang.wz, riel, dodgen,
	konrad.wilk, dhildenb, aarcange, alexander.duyck

On 01.04.19 16:09, David Hildenbrand wrote:
>>> Thinking about your approach, there is one elementary thing to notice:
>>>
>>> Giving the guest pages from the buffer while hinting requests are being
>>> processed means that the guest can and will temporarily make use of more
>>> memory than desired. Essentially up to the point where MADV_FREE is
>>> finally called for the hinted pages.
>>
>> Right - but that seems like exactly the reverse of the issue with the current
>> approach which is guest can temporarily use less memory than desired.
>>
>>> Even then the guest will logicall
>>> make use of more memory than desired until core MM takes pages away.
>>
>> That sounds more like a host issue though. If it wants to
>> it can use e.g. MAD_DONTNEED.
> 
> Indeed. But MADV_DONTNEED is somewhat undesired for performance reasons.
> You want to do the work when swapping not when hinting.
> 
> But what I wanted to say here: Looking at the pure size of your guest
> will at least not help you to identify if more memory than desired will
> be used.
> 
>>
>>> So:
>>> 1) Unmodified guests will make use of more memory than desired.
>>
>> One interesting possibility for this is to add the buffer memory
>> by hotplug after the feature has been negotiated.
>> I agree this sounds complex.
> 
> Yes it is, and it goes into the direction of virtio-mem that essentially
> does that. But bad news: memory hotplug is complicated stuff, both on
> the hypervisor and guest side. And things like NUMA make it more involved.
> 
> But even then, malicious guest can simply fake feature negotiation and
> make use of all hotplugged memory. Won't work, at least not for
> malicious guests.
> 
>>
>> But I have an idea: how about we include the hint size in the
>> num_pages counter? Then unmodified guests put
>> it in the balloon and don't use it. Modified ones
>> will know to use it just for hinting.
> 
> These are the nightmares I was talking about. I would like to decouple
> this feature as far as possible from balloon inflation/deflation.
> Ballooning is 4k based and might have other undesirable side effect.
> Just because somebody wants to use page hinting does not mean he wants
> to use ballooning. Effectively, many people will want to avoid
> ballooning completely by using page hinting for their use case.
> 
>>
>>
>>> 2) Malicious guests will make use of more memory than desired.
>>
>> Well this limitation is fundamental to balloon right?
> 
> Yep, it is the fundamental issue of ballooning. If memory is available
> right from the boot, the system is free to do with it whatever it wants.
> (one of the main things virtio-mem will do differently/better)
> 
>> If host wants to add tracking of balloon memory, it
>> can enforce the limits. So far no one bothered,
>> but maybe with this feature we should start to do that.
> 
> I think I already had endless rants about why this is not possible.
> Ballooning as it is currently implemented by virtio-balloon is broken by
> design. Period. You can and never will be able to distinguish unmodified
> guests from malicious guests. Please don't design new approaches based
> on broken design.
> 
>>
>>> 3) Sane, modified guests will make use of more memory than desired.
>>>
>>> Instead, we could make our life much easier by doing the following:
>>>
>>> 1) Introduce a parameter to cap the amount of memory concurrently hinted
>>> similar like you suggested, just don't consider it a buffer value.
>>> "-device virtio-balloon,hinting_size=1G". This gives us control over the
>>> hinting proceess.
>>>
>>> hinting_size=0 (default) disables hinting
>>>
>>> The admin can tweak the number along with memory requirements of the
>>> guest. We can make suggestions (e.g. calculate depending on #cores,#size
>>> of memory, or simply "1GB")
>>
>> So if it's all up to the guest and for the benefit of the guest, and
>> with no cost/benefit to the host, then why are we supplying this value
>> from the host?
> 
> See 3), the admin has to be aware of hinting behavior.
> 
>>
>>> 2) In the guest, track the size of hints in progress, cap at the
>>> hinting_size.
>>>
>>> 3) Document hinting behavior
>>>
>>> "When hinting is enabled, memory up to hinting_size might temporarily be
>>> removed from your guest in order to be hinted to the hypervisor. This is
>>> only for a very short time, but might affect applications. Consider the
>>> hinting_size when sizing your guest. If your application was tested with
>>> XGB and a hinting size of 1G is used, please configure X+1GB for the
>>> guest. Otherwise, performance degradation might be possible."
>>
>> OK, so let's start with this. Now let us assume that guest follows
>> the advice.  We thus know that 1GB is not needed for guest applications.
>> So why do we want to allow applications to still use this extra memory?
> 
> If the application does not need the 1GB, the 1GB will be hinted to the
> hypervisor and are effectively only a buffer for the OOM scenario.
> (ignoring page cache discussions for now).
> 
> "So why do we want to allow applications to still use this extra memory"
> is the EXACT same issue you have with your buffer approach. Any guest
> can make use of the buffer and you won't be able to detect it. Very same
> problem. Only in your approach, the guest might agree to play nicely by
> not making use of the 1G you provided. Just as if the application does
> not need/use the additional 1GB.
> 
> The interesting thing is most probably: Will the hinting size usually be
> reasonable small? At least I guess a guest with 4TB of RAM will not
> suddenly get a hinting size of hundreds of GB. Most probably also only
> something in the range of 1GB. But this is an interesting question to
> look into.
> 
> Also, if the admin does not care about performance implications when
> already close to hinting, no need to add the additional 1Gb to the ram size.

"close to OOM" is what I meant.


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: On guest free page hinting and OOM
  2019-04-01 14:09               ` David Hildenbrand
  2019-04-01 14:11                 ` David Hildenbrand
@ 2019-04-01 14:45                 ` Michael S. Tsirkin
  1 sibling, 0 replies; 34+ messages in thread
From: Michael S. Tsirkin @ 2019-04-01 14:45 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Nitesh Narayan Lal, kvm, linux-kernel, linux-mm, pbonzini,
	lcapitulino, pagupta, wei.w.wang, yang.zhang.wz, riel, dodgen,
	konrad.wilk, dhildenb, aarcange, alexander.duyck

On Mon, Apr 01, 2019 at 04:09:32PM +0200, David Hildenbrand wrote:
> > 
> > When you say yield, I would guess that would involve config space access
> > to the balloon to flush out outstanding hints?
> 
> I rather meant yield your CPU to the hypervisor, so it can process
> hinting requests faster (like waiting for a spinlock). This is the
> simple case. More involved approaches might somehow indicate to the
> hypervisor to not process queued requests but simply return them to the
> guest so the guest can add the isolated pages to the buddy. If this is
> "config space access to the balloon to flush out outstanding hints" then
> yes, something like that might be a good idea if it doesn't harm
> performance.

The problem would be in testing this unfortunately. Same as any
OOM hack, it is difficult to test well.

> -- 
> 
> Thanks,
> 
> David / dhildenb

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: On guest free page hinting and OOM
  2019-04-01 14:11                 ` David Hildenbrand
@ 2019-04-01 14:47                   ` Michael S. Tsirkin
  2019-04-01 14:54                     ` David Hildenbrand
  2019-04-01 20:56                     ` Alexander Duyck
  0 siblings, 2 replies; 34+ messages in thread
From: Michael S. Tsirkin @ 2019-04-01 14:47 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Nitesh Narayan Lal, kvm, linux-kernel, linux-mm, pbonzini,
	lcapitulino, pagupta, wei.w.wang, yang.zhang.wz, riel, dodgen,
	konrad.wilk, dhildenb, aarcange, alexander.duyck

On Mon, Apr 01, 2019 at 04:11:42PM +0200, David Hildenbrand wrote:
> > The interesting thing is most probably: Will the hinting size usually be
> > reasonable small? At least I guess a guest with 4TB of RAM will not
> > suddenly get a hinting size of hundreds of GB. Most probably also only
> > something in the range of 1GB. But this is an interesting question to
> > look into.
> > 
> > Also, if the admin does not care about performance implications when
> > already close to hinting, no need to add the additional 1Gb to the ram size.
> 
> "close to OOM" is what I meant.

Problem is, host admin is the one adding memory. Guest admin is
the one that knows about performance.

> 
> -- 
> 
> Thanks,
> 
> David / dhildenb

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: On guest free page hinting and OOM
  2019-04-01 14:47                   ` Michael S. Tsirkin
@ 2019-04-01 14:54                     ` David Hildenbrand
  2019-04-01 20:56                     ` Alexander Duyck
  1 sibling, 0 replies; 34+ messages in thread
From: David Hildenbrand @ 2019-04-01 14:54 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Nitesh Narayan Lal, kvm, linux-kernel, linux-mm, pbonzini,
	lcapitulino, pagupta, wei.w.wang, yang.zhang.wz, riel, dodgen,
	konrad.wilk, dhildenb, aarcange, alexander.duyck

On 01.04.19 16:47, Michael S. Tsirkin wrote:
> On Mon, Apr 01, 2019 at 04:11:42PM +0200, David Hildenbrand wrote:
>>> The interesting thing is most probably: Will the hinting size usually be
>>> reasonable small? At least I guess a guest with 4TB of RAM will not
>>> suddenly get a hinting size of hundreds of GB. Most probably also only
>>> something in the range of 1GB. But this is an interesting question to
>>> look into.
>>>
>>> Also, if the admin does not care about performance implications when
>>> already close to hinting, no need to add the additional 1Gb to the ram size.
>>
>> "close to OOM" is what I meant.
> 
> Problem is, host admin is the one adding memory. Guest admin is
> the one that knows about performance.

If we think about guest admins only caring about performance, then a
guest admin owill unloads virtio-balloon module to

a) get all the memory available (inflated memory returned to the guest)
b) not use hinting :)

But I get your idea. One side wants hinting, other side has to agree.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: On guest free page hinting and OOM
  2019-04-01 14:47                   ` Michael S. Tsirkin
  2019-04-01 14:54                     ` David Hildenbrand
@ 2019-04-01 20:56                     ` Alexander Duyck
  2019-04-02  7:42                       ` David Hildenbrand
  1 sibling, 1 reply; 34+ messages in thread
From: Alexander Duyck @ 2019-04-01 20:56 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: David Hildenbrand, Nitesh Narayan Lal, kvm list, LKML, linux-mm,
	Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang,
	Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli

On Mon, Apr 1, 2019 at 7:47 AM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Mon, Apr 01, 2019 at 04:11:42PM +0200, David Hildenbrand wrote:
> > > The interesting thing is most probably: Will the hinting size usually be
> > > reasonable small? At least I guess a guest with 4TB of RAM will not
> > > suddenly get a hinting size of hundreds of GB. Most probably also only
> > > something in the range of 1GB. But this is an interesting question to
> > > look into.
> > >
> > > Also, if the admin does not care about performance implications when
> > > already close to hinting, no need to add the additional 1Gb to the ram size.
> >
> > "close to OOM" is what I meant.
>
> Problem is, host admin is the one adding memory. Guest admin is
> the one that knows about performance.

The thing we have to keep in mind with this is that we are not dealing
with the same behavior as the balloon driver. We don't need to inflate
a massive hint and hand that off. Instead we can focus on performing
the hints on much smaller amounts and do it incrementally over time
with the idea being as the system sits idle it frees up more and more
of the inactive memory on the system.

With that said, I still don't like the idea of us even trying to
target 1GB of RAM for hinting. I think it would be much better if we
stuck to smaller sizes and kept things down to a single digit multiple
of THP or higher order pages. Maybe something like 64MB of total
memory out for hinting.

All we really would need to make it work would be to possibly look at
seeing if we can combine PageType values. Specifically what I would be
looking at is a transition that looks something like Buddy -> Offline
-> (Buddy | Offline). We would have to hold the zone lock at each
transition, but that shouldn't be too big of an issue. If we are okay
with possibly combining the Offline and Buddy types we would have a
way of tracking which pages have been hinted and which have not. Then
we would just have to have a thread running in the background on the
guest that is looking at the higher order pages and pulling 64MB at a
time offline, and when the hinting is done put them back in the "Buddy
| Offline" state.

I view this all as working not too dissimilar to how a standard Rx
ring in a network device works. Only we would want to allocate from
the pool of "Buddy" pages, flag the pages as "Offline", and then when
the hint has been processed we would place them back in the "Buddy"
list with the "Offline" value still set. The only real changes needed
to the buddy allocator would be to add some logic for clearing/merging
the "Offline" setting as necessary, and to provide an allocator that
only works with non-"Offline" pages.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: On guest free page hinting and OOM
  2019-04-01 20:56                     ` Alexander Duyck
@ 2019-04-02  7:42                       ` David Hildenbrand
  2019-04-02 15:04                         ` Alexander Duyck
  0 siblings, 1 reply; 34+ messages in thread
From: David Hildenbrand @ 2019-04-02  7:42 UTC (permalink / raw)
  To: Alexander Duyck, Michael S. Tsirkin
  Cc: Nitesh Narayan Lal, kvm list, LKML, linux-mm, Paolo Bonzini,
	lcapitulino, pagupta, wei.w.wang, Yang Zhang, Rik van Riel,
	dodgen, Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli

On 01.04.19 22:56, Alexander Duyck wrote:
> On Mon, Apr 1, 2019 at 7:47 AM Michael S. Tsirkin <mst@redhat.com> wrote:
>>
>> On Mon, Apr 01, 2019 at 04:11:42PM +0200, David Hildenbrand wrote:
>>>> The interesting thing is most probably: Will the hinting size usually be
>>>> reasonable small? At least I guess a guest with 4TB of RAM will not
>>>> suddenly get a hinting size of hundreds of GB. Most probably also only
>>>> something in the range of 1GB. But this is an interesting question to
>>>> look into.
>>>>
>>>> Also, if the admin does not care about performance implications when
>>>> already close to hinting, no need to add the additional 1Gb to the ram size.
>>>
>>> "close to OOM" is what I meant.
>>
>> Problem is, host admin is the one adding memory. Guest admin is
>> the one that knows about performance.
> 
> The thing we have to keep in mind with this is that we are not dealing
> with the same behavior as the balloon driver. We don't need to inflate
> a massive hint and hand that off. Instead we can focus on performing
> the hints on much smaller amounts and do it incrementally over time
> with the idea being as the system sits idle it frees up more and more
> of the inactive memory on the system.
> 
> With that said, I still don't like the idea of us even trying to
> target 1GB of RAM for hinting. I think it would be much better if we
> stuck to smaller sizes and kept things down to a single digit multiple
> of THP or higher order pages. Maybe something like 64MB of total
> memory out for hinting.

1GB was just a number I came up with. But please note, as VCPUs hint in
parallel, even though each request is only 64MB in size, things can sum up.

> 
> All we really would need to make it work would be to possibly look at
> seeing if we can combine PageType values. Specifically what I would be
> looking at is a transition that looks something like Buddy -> Offline
> -> (Buddy | Offline). We would have to hold the zone lock at each
> transition, but that shouldn't be too big of an issue. If we are okay
> with possibly combining the Offline and Buddy types we would have a
> way of tracking which pages have been hinted and which have not. Then
> we would just have to have a thread running in the background on the
> guest that is looking at the higher order pages and pulling 64MB at a
> time offline, and when the hinting is done put them back in the "Buddy
> | Offline" state.

That approach may have other issues to solve (1 thread vs. many VCPUs,
scanning all buddy pages over and over again) and other implications
that might be undesirable (hints performed even more delayed, additional
thread activity). I wouldn't call it the ultimate solution.

Your approach sounds very interesting to play with, however
at this point I would like to avoid throwing away Nitesh work once again
to follow some other approach that looks promising. If we keep going
like that, we'll spend another ~10 years working on free page hinting
without getting anything upstream. Especially if it involves more
core-MM changes. We've been there, we've done that. As long as the
guest-host interface is generic enough, we can play with such approaches
later in the guest. Important part is that the guest-host interface
allows for that.

> 
> I view this all as working not too dissimilar to how a standard Rx
> ring in a network device works. Only we would want to allocate from
> the pool of "Buddy" pages, flag the pages as "Offline", and then when
> the hint has been processed we would place them back in the "Buddy"
> list with the "Offline" value still set. The only real changes needed
> to the buddy allocator would be to add some logic for clearing/merging
> the "Offline" setting as necessary, and to provide an allocator that
> only works with non-"Offline" pages.

Sorry, I had to smile at the phrase "only" in combination with "provide
an allocator that only works with non-Offline pages" :) . I guess you
realize yourself that these are core-mm changes that might easily be
rejected upstream because "the virt guys try to teach core-MM yet
another special case". I agree that this is nice to play with,
eventually that approach could succeed and be accepted upstream. But I
consider this long term work.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: On guest free page hinting and OOM
  2019-04-02  7:42                       ` David Hildenbrand
@ 2019-04-02 15:04                         ` Alexander Duyck
  2019-04-02 15:25                           ` Michael S. Tsirkin
                                             ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Alexander Duyck @ 2019-04-02 15:04 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Michael S. Tsirkin, Nitesh Narayan Lal, kvm list, LKML, linux-mm,
	Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang,
	Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli

On Tue, Apr 2, 2019 at 12:42 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 01.04.19 22:56, Alexander Duyck wrote:
> > On Mon, Apr 1, 2019 at 7:47 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> >>
> >> On Mon, Apr 01, 2019 at 04:11:42PM +0200, David Hildenbrand wrote:
> >>>> The interesting thing is most probably: Will the hinting size usually be
> >>>> reasonable small? At least I guess a guest with 4TB of RAM will not
> >>>> suddenly get a hinting size of hundreds of GB. Most probably also only
> >>>> something in the range of 1GB. But this is an interesting question to
> >>>> look into.
> >>>>
> >>>> Also, if the admin does not care about performance implications when
> >>>> already close to hinting, no need to add the additional 1Gb to the ram size.
> >>>
> >>> "close to OOM" is what I meant.
> >>
> >> Problem is, host admin is the one adding memory. Guest admin is
> >> the one that knows about performance.
> >
> > The thing we have to keep in mind with this is that we are not dealing
> > with the same behavior as the balloon driver. We don't need to inflate
> > a massive hint and hand that off. Instead we can focus on performing
> > the hints on much smaller amounts and do it incrementally over time
> > with the idea being as the system sits idle it frees up more and more
> > of the inactive memory on the system.
> >
> > With that said, I still don't like the idea of us even trying to
> > target 1GB of RAM for hinting. I think it would be much better if we
> > stuck to smaller sizes and kept things down to a single digit multiple
> > of THP or higher order pages. Maybe something like 64MB of total
> > memory out for hinting.
>
> 1GB was just a number I came up with. But please note, as VCPUs hint in
> parallel, even though each request is only 64MB in size, things can sum up.

Why do we need them running in parallel for a single guest? I don't
think we need the hints so quickly that we would need to have multiple
VCPUs running in parallel to provide hints. In addition as it
currently stands in order to get pages into and out of the buddy
allocator we are going to have to take the zone lock anyway so we
could probably just assume a single thread for pulling the memory,
placing it on the ring, and putting it back into the buddy allocator
after the hint has been completed.

> >
> > All we really would need to make it work would be to possibly look at
> > seeing if we can combine PageType values. Specifically what I would be
> > looking at is a transition that looks something like Buddy -> Offline
> > -> (Buddy | Offline). We would have to hold the zone lock at each
> > transition, but that shouldn't be too big of an issue. If we are okay
> > with possibly combining the Offline and Buddy types we would have a
> > way of tracking which pages have been hinted and which have not. Then
> > we would just have to have a thread running in the background on the
> > guest that is looking at the higher order pages and pulling 64MB at a
> > time offline, and when the hinting is done put them back in the "Buddy
> > | Offline" state.
>
> That approach may have other issues to solve (1 thread vs. many VCPUs,
> scanning all buddy pages over and over again) and other implications
> that might be undesirable (hints performed even more delayed, additional
> thread activity). I wouldn't call it the ultimate solution.

So the problem with trying to provide the hint sooner is that you end
up creating a bottle-neck or you end up missing hints on pages
entirely and then have to fall back to such an approach. By just
letting the thread run in the background reporting the idle memory we
can avoid much of that.

Also there isn't a huge priority to report idle memory in real time.
That would be kind of pointless as it might be pulled back out and
reused as soon as it is added. What we need is to give the memory a
bit of time to "cool" so that we aren't constantly hinting away memory
that is still in use.

> Your approach sounds very interesting to play with, however
> at this point I would like to avoid throwing away Nitesh work once again
> to follow some other approach that looks promising. If we keep going
> like that, we'll spend another ~10 years working on free page hinting
> without getting anything upstream. Especially if it involves more
> core-MM changes. We've been there, we've done that. As long as the
> guest-host interface is generic enough, we can play with such approaches
> later in the guest. Important part is that the guest-host interface
> allows for that.

I'm not throwing anything away. One of the issues in Nitesh's design
is that he is going to either miss memory and have to run an
asynchronous thread to clean it up after the fact, or he is going to
cause massive OOM errors and/or have to start halting VCPUs while
waiting on the processing. All I am suggesting is that we can get away
from having to deal with both by just walking through the free pages
for the higher order and hinting only a few at a time without having
to try to provide the host with the hints on what is idle the second
it is freed.

> >
> > I view this all as working not too dissimilar to how a standard Rx
> > ring in a network device works. Only we would want to allocate from
> > the pool of "Buddy" pages, flag the pages as "Offline", and then when
> > the hint has been processed we would place them back in the "Buddy"
> > list with the "Offline" value still set. The only real changes needed
> > to the buddy allocator would be to add some logic for clearing/merging
> > the "Offline" setting as necessary, and to provide an allocator that
> > only works with non-"Offline" pages.
>
> Sorry, I had to smile at the phrase "only" in combination with "provide
> an allocator that only works with non-Offline pages" :) . I guess you
> realize yourself that these are core-mm changes that might easily be
> rejected upstream because "the virt guys try to teach core-MM yet
> another special case". I agree that this is nice to play with,
> eventually that approach could succeed and be accepted upstream. But I
> consider this long term work.

The actual patch for this would probably be pretty small and compared
to some of the other stuff that has gone in recently isn't too far out
of the realm of possibility. It isn't too different then the code that
has already done in to determine the unused pages for virtio-balloon
free page hinting.

Basically what we would be doing is providing a means for
incrementally transitioning the buddy memory into the idle/offline
state to reduce guest memory overhead. It would require one function
that would walk the free page lists and pluck out pages that don't
have the "Offline" page type set, a one-line change to the logic for
allocating a page as we would need to clear that extra bit of state,
and optionally some bits for how to handle the merge of two "Offline"
pages in the buddy allocator (required for lower order support). It
solves most of the guest side issues with the free page hinting in
that trying to do it via the arch_free_page path is problematic at
best since it was designed for a synchronous setup, not an
asynchronous one.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: On guest free page hinting and OOM
  2019-04-02 15:04                         ` Alexander Duyck
@ 2019-04-02 15:25                           ` Michael S. Tsirkin
  2019-04-02 15:57                             ` David Hildenbrand
  2019-04-02 15:55                           ` David Hildenbrand
  2019-04-02 16:19                           ` David Hildenbrand
  2 siblings, 1 reply; 34+ messages in thread
From: Michael S. Tsirkin @ 2019-04-02 15:25 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: David Hildenbrand, Nitesh Narayan Lal, kvm list, LKML, linux-mm,
	Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang,
	Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli

On Tue, Apr 02, 2019 at 08:04:00AM -0700, Alexander Duyck wrote:
> Basically what we would be doing is providing a means for
> incrementally transitioning the buddy memory into the idle/offline
> state to reduce guest memory overhead. It would require one function
> that would walk the free page lists and pluck out pages that don't
> have the "Offline" page type set,

I think we will need an interface that gets
an offline page and returns the next online free page.

If we restart the list walk each time we can't guarantee progress.

> a one-line change to the logic for
> allocating a page as we would need to clear that extra bit of state,
> and optionally some bits for how to handle the merge of two "Offline"
> pages in the buddy allocator (required for lower order support). It
> solves most of the guest side issues with the free page hinting in
> that trying to do it via the arch_free_page path is problematic at
> best since it was designed for a synchronous setup, not an
> asynchronous one.


-- 
MST

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: On guest free page hinting and OOM
  2019-04-02 15:04                         ` Alexander Duyck
  2019-04-02 15:25                           ` Michael S. Tsirkin
@ 2019-04-02 15:55                           ` David Hildenbrand
  2019-04-02 17:30                             ` Alexander Duyck
  2019-04-02 16:19                           ` David Hildenbrand
  2 siblings, 1 reply; 34+ messages in thread
From: David Hildenbrand @ 2019-04-02 15:55 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Michael S. Tsirkin, Nitesh Narayan Lal, kvm list, LKML, linux-mm,
	Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang,
	Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli

On 02.04.19 17:04, Alexander Duyck wrote:
> On Tue, Apr 2, 2019 at 12:42 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 01.04.19 22:56, Alexander Duyck wrote:
>>> On Mon, Apr 1, 2019 at 7:47 AM Michael S. Tsirkin <mst@redhat.com> wrote:
>>>>
>>>> On Mon, Apr 01, 2019 at 04:11:42PM +0200, David Hildenbrand wrote:
>>>>>> The interesting thing is most probably: Will the hinting size usually be
>>>>>> reasonable small? At least I guess a guest with 4TB of RAM will not
>>>>>> suddenly get a hinting size of hundreds of GB. Most probably also only
>>>>>> something in the range of 1GB. But this is an interesting question to
>>>>>> look into.
>>>>>>
>>>>>> Also, if the admin does not care about performance implications when
>>>>>> already close to hinting, no need to add the additional 1Gb to the ram size.
>>>>>
>>>>> "close to OOM" is what I meant.
>>>>
>>>> Problem is, host admin is the one adding memory. Guest admin is
>>>> the one that knows about performance.
>>>
>>> The thing we have to keep in mind with this is that we are not dealing
>>> with the same behavior as the balloon driver. We don't need to inflate
>>> a massive hint and hand that off. Instead we can focus on performing
>>> the hints on much smaller amounts and do it incrementally over time
>>> with the idea being as the system sits idle it frees up more and more
>>> of the inactive memory on the system.
>>>
>>> With that said, I still don't like the idea of us even trying to
>>> target 1GB of RAM for hinting. I think it would be much better if we
>>> stuck to smaller sizes and kept things down to a single digit multiple
>>> of THP or higher order pages. Maybe something like 64MB of total
>>> memory out for hinting.
>>
>> 1GB was just a number I came up with. But please note, as VCPUs hint in
>> parallel, even though each request is only 64MB in size, things can sum up.
> 
> Why do we need them running in parallel for a single guest? I don't
> think we need the hints so quickly that we would need to have multiple
> VCPUs running in parallel to provide hints. In addition as it
> currently stands in order to get pages into and out of the buddy
> allocator we are going to have to take the zone lock anyway so we
> could probably just assume a single thread for pulling the memory,
> placing it on the ring, and putting it back into the buddy allocator
> after the hint has been completed.

VCPUs hint when they think the time has come. Hinting in parallel comes
naturally.

> 
>>>
>>> All we really would need to make it work would be to possibly look at
>>> seeing if we can combine PageType values. Specifically what I would be
>>> looking at is a transition that looks something like Buddy -> Offline
>>> -> (Buddy | Offline). We would have to hold the zone lock at each
>>> transition, but that shouldn't be too big of an issue. If we are okay
>>> with possibly combining the Offline and Buddy types we would have a
>>> way of tracking which pages have been hinted and which have not. Then
>>> we would just have to have a thread running in the background on the
>>> guest that is looking at the higher order pages and pulling 64MB at a
>>> time offline, and when the hinting is done put them back in the "Buddy
>>> | Offline" state.
>>
>> That approach may have other issues to solve (1 thread vs. many VCPUs,
>> scanning all buddy pages over and over again) and other implications
>> that might be undesirable (hints performed even more delayed, additional
>> thread activity). I wouldn't call it the ultimate solution.
> 
> So the problem with trying to provide the hint sooner is that you end
> up creating a bottle-neck or you end up missing hints on pages
> entirely and then have to fall back to such an approach. By just
> letting the thread run in the background reporting the idle memory we
> can avoid much of that.
> 
> Also there isn't a huge priority to report idle memory in real time.
> That would be kind of pointless as it might be pulled back out and
> reused as soon as it is added. What we need is to give the memory a
> bit of time to "cool" so that we aren't constantly hinting away memory
> that is still in use.

Depending on the setup, you don't want free memory lying around for too
long in your guest.

> 
>> Your approach sounds very interesting to play with, however
>> at this point I would like to avoid throwing away Nitesh work once again
>> to follow some other approach that looks promising. If we keep going
>> like that, we'll spend another ~10 years working on free page hinting
>> without getting anything upstream. Especially if it involves more
>> core-MM changes. We've been there, we've done that. As long as the
>> guest-host interface is generic enough, we can play with such approaches
>> later in the guest. Important part is that the guest-host interface
>> allows for that.
> 
> I'm not throwing anything away. One of the issues in Nitesh's design
> is that he is going to either miss memory and have to run an
> asynchronous thread to clean it up after the fact, or he is going to
> cause massive OOM errors and/or have to start halting VCPUs while

1. how are we going to miss memory. We are going to miss memory because
we hint on very huge chunks, but we all agreed to live with that for now.

2. What are the "massive OOM" errors you are talking about? We have the
one scenario we described Nitesh was not even able to reproduce yet. And
we have ways to mitigate the problem (discussed in this thread).

We have something that seems to work. Let's work from there instead of
scrapping the general design once more, thinking "it is super easy". And
yes, what you propose is pretty much throwing away the current design in
the guest.

> waiting on the processing. All I am suggesting is that we can get away
> from having to deal with both by just walking through the free pages
> for the higher order and hinting only a few at a time without having
> to try to provide the host with the hints on what is idle the second
> it is freed.
> 
>>>
>>> I view this all as working not too dissimilar to how a standard Rx
>>> ring in a network device works. Only we would want to allocate from
>>> the pool of "Buddy" pages, flag the pages as "Offline", and then when
>>> the hint has been processed we would place them back in the "Buddy"
>>> list with the "Offline" value still set. The only real changes needed
>>> to the buddy allocator would be to add some logic for clearing/merging
>>> the "Offline" setting as necessary, and to provide an allocator that
>>> only works with non-"Offline" pages.
>>
>> Sorry, I had to smile at the phrase "only" in combination with "provide
>> an allocator that only works with non-Offline pages" :) . I guess you
>> realize yourself that these are core-mm changes that might easily be
>> rejected upstream because "the virt guys try to teach core-MM yet
>> another special case". I agree that this is nice to play with,
>> eventually that approach could succeed and be accepted upstream. But I
>> consider this long term work.
> 
> The actual patch for this would probably be pretty small and compared
> to some of the other stuff that has gone in recently isn't too far out
> of the realm of possibility. It isn't too different then the code that
> has already done in to determine the unused pages for virtio-balloon
> free page hinting.
> 
> Basically what we would be doing is providing a means for
> incrementally transitioning the buddy memory into the idle/offline
> state to reduce guest memory overhead. It would require one function
> that would walk the free page lists and pluck out pages that don't
> have the "Offline" page type set, a one-line change to the logic for
> allocating a page as we would need to clear that extra bit of state,
> and optionally some bits for how to handle the merge of two "Offline"
> pages in the buddy allocator (required for lower order support). It
> solves most of the guest side issues with the free page hinting in
> that trying to do it via the arch_free_page path is problematic at
> best since it was designed for a synchronous setup, not an
> asynchronous one.

This is throwing away work. No I don't think this is the right path to
follow for now. Feel free to look into it while Nitesh gets something in
shape we know conceptually works and we are starting to know which
issues we are hitting.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: On guest free page hinting and OOM
  2019-04-02 15:25                           ` Michael S. Tsirkin
@ 2019-04-02 15:57                             ` David Hildenbrand
  2019-04-02 16:18                               ` Alexander Duyck
  0 siblings, 1 reply; 34+ messages in thread
From: David Hildenbrand @ 2019-04-02 15:57 UTC (permalink / raw)
  To: Michael S. Tsirkin, Alexander Duyck
  Cc: Nitesh Narayan Lal, kvm list, LKML, linux-mm, Paolo Bonzini,
	lcapitulino, pagupta, wei.w.wang, Yang Zhang, Rik van Riel,
	dodgen, Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli

On 02.04.19 17:25, Michael S. Tsirkin wrote:
> On Tue, Apr 02, 2019 at 08:04:00AM -0700, Alexander Duyck wrote:
>> Basically what we would be doing is providing a means for
>> incrementally transitioning the buddy memory into the idle/offline
>> state to reduce guest memory overhead. It would require one function
>> that would walk the free page lists and pluck out pages that don't
>> have the "Offline" page type set,
> 
> I think we will need an interface that gets
> an offline page and returns the next online free page.
> 
> If we restart the list walk each time we can't guarantee progress.

Yes, and essentially we are scanning all the time for chunks vs. we get
notified which chunks are possible hinting candidates. Totally different
design.


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: On guest free page hinting and OOM
  2019-04-02 15:57                             ` David Hildenbrand
@ 2019-04-02 16:18                               ` Alexander Duyck
  2019-04-02 17:08                                 ` David Hildenbrand
  0 siblings, 1 reply; 34+ messages in thread
From: Alexander Duyck @ 2019-04-02 16:18 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Michael S. Tsirkin, Nitesh Narayan Lal, kvm list, LKML, linux-mm,
	Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang,
	Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli

n Tue, Apr 2, 2019 at 8:57 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 02.04.19 17:25, Michael S. Tsirkin wrote:
> > On Tue, Apr 02, 2019 at 08:04:00AM -0700, Alexander Duyck wrote:
> >> Basically what we would be doing is providing a means for
> >> incrementally transitioning the buddy memory into the idle/offline
> >> state to reduce guest memory overhead. It would require one function
> >> that would walk the free page lists and pluck out pages that don't
> >> have the "Offline" page type set,
> >
> > I think we will need an interface that gets
> > an offline page and returns the next online free page.
> >
> > If we restart the list walk each time we can't guarantee progress.
>
> Yes, and essentially we are scanning all the time for chunks vs. we get
> notified which chunks are possible hinting candidates. Totally different
> design.

The problem as I see it is that we can miss notifications if we become
too backlogged, and that will lead to us having to fall back to
scanning anyway. So instead of trying to implement both why don't we
just focus on the scanning approach. Otherwise the only other option
is to hold up the guest and make it wait until the hint processing has
completed and at that point we are back to what is essentially just a
synchronous solution with batching anyway.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: On guest free page hinting and OOM
  2019-04-02 15:04                         ` Alexander Duyck
  2019-04-02 15:25                           ` Michael S. Tsirkin
  2019-04-02 15:55                           ` David Hildenbrand
@ 2019-04-02 16:19                           ` David Hildenbrand
  2 siblings, 0 replies; 34+ messages in thread
From: David Hildenbrand @ 2019-04-02 16:19 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Michael S. Tsirkin, Nitesh Narayan Lal, kvm list, LKML, linux-mm,
	Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang,
	Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli

On 02.04.19 17:04, Alexander Duyck wrote:
> On Tue, Apr 2, 2019 at 12:42 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 01.04.19 22:56, Alexander Duyck wrote:
>>> On Mon, Apr 1, 2019 at 7:47 AM Michael S. Tsirkin <mst@redhat.com> wrote:
>>>>
>>>> On Mon, Apr 01, 2019 at 04:11:42PM +0200, David Hildenbrand wrote:
>>>>>> The interesting thing is most probably: Will the hinting size usually be
>>>>>> reasonable small? At least I guess a guest with 4TB of RAM will not
>>>>>> suddenly get a hinting size of hundreds of GB. Most probably also only
>>>>>> something in the range of 1GB. But this is an interesting question to
>>>>>> look into.
>>>>>>
>>>>>> Also, if the admin does not care about performance implications when
>>>>>> already close to hinting, no need to add the additional 1Gb to the ram size.
>>>>>
>>>>> "close to OOM" is what I meant.
>>>>
>>>> Problem is, host admin is the one adding memory. Guest admin is
>>>> the one that knows about performance.
>>>
>>> The thing we have to keep in mind with this is that we are not dealing
>>> with the same behavior as the balloon driver. We don't need to inflate
>>> a massive hint and hand that off. Instead we can focus on performing
>>> the hints on much smaller amounts and do it incrementally over time
>>> with the idea being as the system sits idle it frees up more and more
>>> of the inactive memory on the system.
>>>
>>> With that said, I still don't like the idea of us even trying to
>>> target 1GB of RAM for hinting. I think it would be much better if we
>>> stuck to smaller sizes and kept things down to a single digit multiple
>>> of THP or higher order pages. Maybe something like 64MB of total
>>> memory out for hinting.
>>
>> 1GB was just a number I came up with. But please note, as VCPUs hint in
>> parallel, even though each request is only 64MB in size, things can sum up.
> 
> Why do we need them running in parallel for a single guest? I don't
> think we need the hints so quickly that we would need to have multiple
> VCPUs running in parallel to provide hints. In addition as it
> currently stands in order to get pages into and out of the buddy
> allocator we are going to have to take the zone lock anyway so we
> could probably just assume a single thread for pulling the memory,
> placing it on the ring, and putting it back into the buddy allocator
> after the hint has been completed.
> 
>>>
>>> All we really would need to make it work would be to possibly look at
>>> seeing if we can combine PageType values. Specifically what I would be
>>> looking at is a transition that looks something like Buddy -> Offline
>>> -> (Buddy | Offline). We would have to hold the zone lock at each
>>> transition, but that shouldn't be too big of an issue. If we are okay
>>> with possibly combining the Offline and Buddy types we would have a
>>> way of tracking which pages have been hinted and which have not. Then
>>> we would just have to have a thread running in the background on the
>>> guest that is looking at the higher order pages and pulling 64MB at a
>>> time offline, and when the hinting is done put them back in the "Buddy
>>> | Offline" state.
>>
>> That approach may have other issues to solve (1 thread vs. many VCPUs,
>> scanning all buddy pages over and over again) and other implications
>> that might be undesirable (hints performed even more delayed, additional
>> thread activity). I wouldn't call it the ultimate solution.
> 
> So the problem with trying to provide the hint sooner is that you end
> up creating a bottle-neck or you end up missing hints on pages
> entirely and then have to fall back to such an approach. By just
> letting the thread run in the background reporting the idle memory we
> can avoid much of that.

BTW, what you propose was already suggested in a similar form by Wei
some (weeks? months?) ago. Back then I thought about something like an
"escalation" mode. If too much MM activity is going on (e.g. close to
OOM, dropping hints, whatever), temporarily stop ordinary hinting and do
basically what you describe.

But I am not sure if dropping hints is actually still a problem.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: On guest free page hinting and OOM
  2019-04-02 16:18                               ` Alexander Duyck
@ 2019-04-02 17:08                                 ` David Hildenbrand
  2019-04-02 17:45                                   ` Alexander Duyck
  0 siblings, 1 reply; 34+ messages in thread
From: David Hildenbrand @ 2019-04-02 17:08 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Michael S. Tsirkin, Nitesh Narayan Lal, kvm list, LKML, linux-mm,
	Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang,
	Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli

On 02.04.19 18:18, Alexander Duyck wrote:
> n Tue, Apr 2, 2019 at 8:57 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 02.04.19 17:25, Michael S. Tsirkin wrote:
>>> On Tue, Apr 02, 2019 at 08:04:00AM -0700, Alexander Duyck wrote:
>>>> Basically what we would be doing is providing a means for
>>>> incrementally transitioning the buddy memory into the idle/offline
>>>> state to reduce guest memory overhead. It would require one function
>>>> that would walk the free page lists and pluck out pages that don't
>>>> have the "Offline" page type set,
>>>
>>> I think we will need an interface that gets
>>> an offline page and returns the next online free page.
>>>
>>> If we restart the list walk each time we can't guarantee progress.
>>
>> Yes, and essentially we are scanning all the time for chunks vs. we get
>> notified which chunks are possible hinting candidates. Totally different
>> design.
> 
> The problem as I see it is that we can miss notifications if we become
> too backlogged, and that will lead to us having to fall back to
> scanning anyway. So instead of trying to implement both why don't we
> just focus on the scanning approach. Otherwise the only other option
> is to hold up the guest and make it wait until the hint processing has
> completed and at that point we are back to what is essentially just a
> synchronous solution with batching anyway.
> 

In general I am not a fan of "there might be a problem, let's try
something completely different". Expect the unexpected. At this point, I
prefer to think about easy solutions to eventual problems. not
completely new designs. As I said, we've been there already.

Related to "falling behind" with hinting. If this is indeed possible
(and I'd like to know under which conditions), I wonder at which point
we no longer care about missed hints. If our guest as a lot of MM
activity, could be that is good that we are dropping hints, because our
guest is so busy, it will reuse pages soon again.

One important point is - I think - that free page hinting does not have
to fit all possible setups. In certain environments it just makes sense
to disable it. Or live with it not giving you "all the hints". E.g.
databases that eat up all free memory either way. The other extreme
would be a simple webserver that is mostly idle.

We are losing hitning of quite free memory already due to the MAX_ORDER
- X discussion. Dropping a couple of other hints shouldn't really hurt.
The question is, are there scenarios where we can completely screw up.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: On guest free page hinting and OOM
  2019-04-02 15:55                           ` David Hildenbrand
@ 2019-04-02 17:30                             ` Alexander Duyck
  2019-04-02 18:53                               ` David Hildenbrand
  0 siblings, 1 reply; 34+ messages in thread
From: Alexander Duyck @ 2019-04-02 17:30 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Michael S. Tsirkin, Nitesh Narayan Lal, kvm list, LKML, linux-mm,
	Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang,
	Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli

On Tue, Apr 2, 2019 at 8:56 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 02.04.19 17:04, Alexander Duyck wrote:
> > On Tue, Apr 2, 2019 at 12:42 AM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 01.04.19 22:56, Alexander Duyck wrote:
> >>> On Mon, Apr 1, 2019 at 7:47 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> >>>>
> >>>> On Mon, Apr 01, 2019 at 04:11:42PM +0200, David Hildenbrand wrote:
> >>>>>> The interesting thing is most probably: Will the hinting size usually be
> >>>>>> reasonable small? At least I guess a guest with 4TB of RAM will not
> >>>>>> suddenly get a hinting size of hundreds of GB. Most probably also only
> >>>>>> something in the range of 1GB. But this is an interesting question to
> >>>>>> look into.
> >>>>>>
> >>>>>> Also, if the admin does not care about performance implications when
> >>>>>> already close to hinting, no need to add the additional 1Gb to the ram size.
> >>>>>
> >>>>> "close to OOM" is what I meant.
> >>>>
> >>>> Problem is, host admin is the one adding memory. Guest admin is
> >>>> the one that knows about performance.
> >>>
> >>> The thing we have to keep in mind with this is that we are not dealing
> >>> with the same behavior as the balloon driver. We don't need to inflate
> >>> a massive hint and hand that off. Instead we can focus on performing
> >>> the hints on much smaller amounts and do it incrementally over time
> >>> with the idea being as the system sits idle it frees up more and more
> >>> of the inactive memory on the system.
> >>>
> >>> With that said, I still don't like the idea of us even trying to
> >>> target 1GB of RAM for hinting. I think it would be much better if we
> >>> stuck to smaller sizes and kept things down to a single digit multiple
> >>> of THP or higher order pages. Maybe something like 64MB of total
> >>> memory out for hinting.
> >>
> >> 1GB was just a number I came up with. But please note, as VCPUs hint in
> >> parallel, even though each request is only 64MB in size, things can sum up.
> >
> > Why do we need them running in parallel for a single guest? I don't
> > think we need the hints so quickly that we would need to have multiple
> > VCPUs running in parallel to provide hints. In addition as it
> > currently stands in order to get pages into and out of the buddy
> > allocator we are going to have to take the zone lock anyway so we
> > could probably just assume a single thread for pulling the memory,
> > placing it on the ring, and putting it back into the buddy allocator
> > after the hint has been completed.
>
> VCPUs hint when they think the time has come. Hinting in parallel comes
> naturally.

Actually it doesn't because if we are doing it asynchronously we are
having to pull pages out of the zone which requires the zone lock.
That has been one of the reasons why the patches from Nitesh start
dropping in performance when you start enabling more than 1 VCPU. If
we are limited by the zone lock it doesn't make sense for us to try to
do thing in parallel.

> >
> >>>
> >>> All we really would need to make it work would be to possibly look at
> >>> seeing if we can combine PageType values. Specifically what I would be
> >>> looking at is a transition that looks something like Buddy -> Offline
> >>> -> (Buddy | Offline). We would have to hold the zone lock at each
> >>> transition, but that shouldn't be too big of an issue. If we are okay
> >>> with possibly combining the Offline and Buddy types we would have a
> >>> way of tracking which pages have been hinted and which have not. Then
> >>> we would just have to have a thread running in the background on the
> >>> guest that is looking at the higher order pages and pulling 64MB at a
> >>> time offline, and when the hinting is done put them back in the "Buddy
> >>> | Offline" state.
> >>
> >> That approach may have other issues to solve (1 thread vs. many VCPUs,
> >> scanning all buddy pages over and over again) and other implications
> >> that might be undesirable (hints performed even more delayed, additional
> >> thread activity). I wouldn't call it the ultimate solution.
> >
> > So the problem with trying to provide the hint sooner is that you end
> > up creating a bottle-neck or you end up missing hints on pages
> > entirely and then have to fall back to such an approach. By just
> > letting the thread run in the background reporting the idle memory we
> > can avoid much of that.
> >
> > Also there isn't a huge priority to report idle memory in real time.
> > That would be kind of pointless as it might be pulled back out and
> > reused as soon as it is added. What we need is to give the memory a
> > bit of time to "cool" so that we aren't constantly hinting away memory
> > that is still in use.
>
> Depending on the setup, you don't want free memory lying around for too
> long in your guest.

Right, but you don't need it as soon as it is freed either. If it is
idle in the guest for a few seconds that shouldn't be an issue. The
free page hinting will hurt performance if we are doing it too often
simply because we are going to be triggering a much higher rate of
page faults.

> >
> >> Your approach sounds very interesting to play with, however
> >> at this point I would like to avoid throwing away Nitesh work once again
> >> to follow some other approach that looks promising. If we keep going
> >> like that, we'll spend another ~10 years working on free page hinting
> >> without getting anything upstream. Especially if it involves more
> >> core-MM changes. We've been there, we've done that. As long as the
> >> guest-host interface is generic enough, we can play with such approaches
> >> later in the guest. Important part is that the guest-host interface
> >> allows for that.
> >
> > I'm not throwing anything away. One of the issues in Nitesh's design
> > is that he is going to either miss memory and have to run an
> > asynchronous thread to clean it up after the fact, or he is going to
> > cause massive OOM errors and/or have to start halting VCPUs while
>
> 1. how are we going to miss memory. We are going to miss memory because
> we hint on very huge chunks, but we all agreed to live with that for now.

What I am talking about is that some application frees gigabytes of
memory. As I recall the queue length for a single cpu is only like 1G.
Are we going to be sitting on the backlog of most of system memory
while we process it 1G at a time?

> 2. What are the "massive OOM" errors you are talking about? We have the
> one scenario we described Nitesh was not even able to reproduce yet. And
> we have ways to mitigate the problem (discussed in this thread).

So I am referring to the last patch set I have seen. Last I knew all
the code was doing was assembling lists if isolated pages and placing
them on a queue. I have seen no way that this really limits the length
of the virtqueue, and the length of the isolated page lists is the
only thing that has any specific limits to it. So I see it easily
being possible for a good portion of memory being consumed by the
queue when you consider that what you have is essentially the maximum
length of the isolated page list multiplied by the number of entries
in a virtqueue.

> We have something that seems to work. Let's work from there instead of
> scrapping the general design once more, thinking "it is super easy". And
> yes, what you propose is pretty much throwing away the current design in
> the guest.

Define "work"? The last patch set required massive fixes as it was
causing kernel panics if more than 1 VCPU was enabled and list
corruption in general. I'm sure there are a ton more bugs lurking as
we have only begun to be able to stress this code in any meaningful
way.

For example what happens if someone sits on the mm write semaphore for
an extended period of time on the host? That will shut down all of the
hinting until that is released, and at that point once again any
hinting queues will be stuck on the guest until they can be processed
by the host.

> > waiting on the processing. All I am suggesting is that we can get away
> > from having to deal with both by just walking through the free pages
> > for the higher order and hinting only a few at a time without having
> > to try to provide the host with the hints on what is idle the second
> > it is freed.
> >
> >>>
> >>> I view this all as working not too dissimilar to how a standard Rx
> >>> ring in a network device works. Only we would want to allocate from
> >>> the pool of "Buddy" pages, flag the pages as "Offline", and then when
> >>> the hint has been processed we would place them back in the "Buddy"
> >>> list with the "Offline" value still set. The only real changes needed
> >>> to the buddy allocator would be to add some logic for clearing/merging
> >>> the "Offline" setting as necessary, and to provide an allocator that
> >>> only works with non-"Offline" pages.
> >>
> >> Sorry, I had to smile at the phrase "only" in combination with "provide
> >> an allocator that only works with non-Offline pages" :) . I guess you
> >> realize yourself that these are core-mm changes that might easily be
> >> rejected upstream because "the virt guys try to teach core-MM yet
> >> another special case". I agree that this is nice to play with,
> >> eventually that approach could succeed and be accepted upstream. But I
> >> consider this long term work.
> >
> > The actual patch for this would probably be pretty small and compared
> > to some of the other stuff that has gone in recently isn't too far out
> > of the realm of possibility. It isn't too different then the code that
> > has already done in to determine the unused pages for virtio-balloon
> > free page hinting.
> >
> > Basically what we would be doing is providing a means for
> > incrementally transitioning the buddy memory into the idle/offline
> > state to reduce guest memory overhead. It would require one function
> > that would walk the free page lists and pluck out pages that don't
> > have the "Offline" page type set, a one-line change to the logic for
> > allocating a page as we would need to clear that extra bit of state,
> > and optionally some bits for how to handle the merge of two "Offline"
> > pages in the buddy allocator (required for lower order support). It
> > solves most of the guest side issues with the free page hinting in
> > that trying to do it via the arch_free_page path is problematic at
> > best since it was designed for a synchronous setup, not an
> > asynchronous one.
>
> This is throwing away work. No I don't think this is the right path to
> follow for now. Feel free to look into it while Nitesh gets something in
> shape we know conceptually works and we are starting to know which
> issues we are hitting.

Yes, it is throwing away work. But if the work is running toward a
dead end does it add any value?

I've been looking into the stuff Nitesh has been doing. I don't know
about others, but I have been testing it. That is why I provided the
patches I did to get it stable enough for me to test and address the
regressions it was causing. That is the source of some of my concern.
I think we have been making this overly complex with all the per-cpu
bits and trying to place this in the free path itself. We really need
to scale this back and look at having a single thread with a walker of
some sort just hinting on what memory is sitting in the buddy but not
hinted on. It is a solution that would work, even in a multiple VCPU
case, and is achievable in the short term.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: On guest free page hinting and OOM
  2019-04-02 17:08                                 ` David Hildenbrand
@ 2019-04-02 17:45                                   ` Alexander Duyck
  2019-04-02 17:53                                     ` Michael S. Tsirkin
  2019-04-02 18:21                                     ` David Hildenbrand
  0 siblings, 2 replies; 34+ messages in thread
From: Alexander Duyck @ 2019-04-02 17:45 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Michael S. Tsirkin, Nitesh Narayan Lal, kvm list, LKML, linux-mm,
	Paolo Bonzini, lcapitulino, pagupta, Yang Zhang, Rik van Riel,
	dodgen, Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli,
	Dave Hansen

On Tue, Apr 2, 2019 at 10:09 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 02.04.19 18:18, Alexander Duyck wrote:
> > n Tue, Apr 2, 2019 at 8:57 AM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 02.04.19 17:25, Michael S. Tsirkin wrote:
> >>> On Tue, Apr 02, 2019 at 08:04:00AM -0700, Alexander Duyck wrote:
> >>>> Basically what we would be doing is providing a means for
> >>>> incrementally transitioning the buddy memory into the idle/offline
> >>>> state to reduce guest memory overhead. It would require one function
> >>>> that would walk the free page lists and pluck out pages that don't
> >>>> have the "Offline" page type set,
> >>>
> >>> I think we will need an interface that gets
> >>> an offline page and returns the next online free page.
> >>>
> >>> If we restart the list walk each time we can't guarantee progress.
> >>
> >> Yes, and essentially we are scanning all the time for chunks vs. we get
> >> notified which chunks are possible hinting candidates. Totally different
> >> design.
> >
> > The problem as I see it is that we can miss notifications if we become
> > too backlogged, and that will lead to us having to fall back to
> > scanning anyway. So instead of trying to implement both why don't we
> > just focus on the scanning approach. Otherwise the only other option
> > is to hold up the guest and make it wait until the hint processing has
> > completed and at that point we are back to what is essentially just a
> > synchronous solution with batching anyway.
> >
>
> In general I am not a fan of "there might be a problem, let's try
> something completely different". Expect the unexpected. At this point, I
> prefer to think about easy solutions to eventual problems. not
> completely new designs. As I said, we've been there already.

The solution as we have is not "easy". There are a number of race
conditions contained within the code and it doesn't practically scale
when you consider we are introducing multiple threads in both the
isolation and returning of pages to/from the buddy allocator that will
have to function within the zone lock.

> Related to "falling behind" with hinting. If this is indeed possible
> (and I'd like to know under which conditions), I wonder at which point
> we no longer care about missed hints. If our guest as a lot of MM
> activity, could be that is good that we are dropping hints, because our
> guest is so busy, it will reuse pages soon again.

This is making a LOT of assumptions. There are a few scenarios that
can hold up hinting on the host side. One of the limitations of
madvise is that we have to take the mm read semaphore. So if something
is sitting on the write semaphore all of the hints will be blocked
until it is released.

> One important point is - I think - that free page hinting does not have
> to fit all possible setups. In certain environments it just makes sense
> to disable it. Or live with it not giving you "all the hints". E.g.
> databases that eat up all free memory either way. The other extreme
> would be a simple webserver that is mostly idle.

My concern is we are introducing massive buffer bloat in the mm
subsystem and it still has the potential for stalling VCPUs if we
don't have room in the VQs. We went through this back in the day with
networking. Adding more buffers is not the solution. The solution is
to have a way to gracefully recover and keep our hinting latency and
buffer bloat to a minimum.

> We are losing hitning of quite free memory already due to the MAX_ORDER
> - X discussion. Dropping a couple of other hints shouldn't really hurt.
> The question is, are there scenarios where we can completely screw up.

My concern is that it can hurt a ton. In my mind the target for a
feature like this is a guest that has something like an application
that will fire up a few times a day eat up a massive amount of memory,
and then free it all when it is done. Now if that application is
freeing a massive block of memory and for whatever reason the QEMU
thread that is translating our hint requests to madvise calls cannot
keep up then we are going to spend the next several hours with that
memory still assigned to an idle guest.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: On guest free page hinting and OOM
  2019-04-02 17:45                                   ` Alexander Duyck
@ 2019-04-02 17:53                                     ` Michael S. Tsirkin
  2019-04-02 20:32                                       ` Alexander Duyck
  2019-04-02 18:21                                     ` David Hildenbrand
  1 sibling, 1 reply; 34+ messages in thread
From: Michael S. Tsirkin @ 2019-04-02 17:53 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: David Hildenbrand, Nitesh Narayan Lal, kvm list, LKML, linux-mm,
	Paolo Bonzini, lcapitulino, pagupta, Yang Zhang, Rik van Riel,
	dodgen, Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli,
	Dave Hansen

On Tue, Apr 02, 2019 at 10:45:43AM -0700, Alexander Duyck wrote:
> We went through this back in the day with
> networking. Adding more buffers is not the solution. The solution is
> to have a way to gracefully recover and keep our hinting latency and
> buffer bloat to a minimum.

That's an interesting approach, I think that things that end up working
well are NAPI (asychronous notifications), limited batching, XDP (big
aligned buffers) and BQL (accounting). Is that your perspective too?

-- 
MST

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: On guest free page hinting and OOM
  2019-04-02 17:45                                   ` Alexander Duyck
  2019-04-02 17:53                                     ` Michael S. Tsirkin
@ 2019-04-02 18:21                                     ` David Hildenbrand
  2019-04-02 19:49                                       ` Michael S. Tsirkin
  1 sibling, 1 reply; 34+ messages in thread
From: David Hildenbrand @ 2019-04-02 18:21 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Michael S. Tsirkin, Nitesh Narayan Lal, kvm list, LKML, linux-mm,
	Paolo Bonzini, lcapitulino, pagupta, Yang Zhang, Rik van Riel,
	dodgen, Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli,
	Dave Hansen

On 02.04.19 19:45, Alexander Duyck wrote:
> On Tue, Apr 2, 2019 at 10:09 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 02.04.19 18:18, Alexander Duyck wrote:
>>> n Tue, Apr 2, 2019 at 8:57 AM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 02.04.19 17:25, Michael S. Tsirkin wrote:
>>>>> On Tue, Apr 02, 2019 at 08:04:00AM -0700, Alexander Duyck wrote:
>>>>>> Basically what we would be doing is providing a means for
>>>>>> incrementally transitioning the buddy memory into the idle/offline
>>>>>> state to reduce guest memory overhead. It would require one function
>>>>>> that would walk the free page lists and pluck out pages that don't
>>>>>> have the "Offline" page type set,
>>>>>
>>>>> I think we will need an interface that gets
>>>>> an offline page and returns the next online free page.
>>>>>
>>>>> If we restart the list walk each time we can't guarantee progress.
>>>>
>>>> Yes, and essentially we are scanning all the time for chunks vs. we get
>>>> notified which chunks are possible hinting candidates. Totally different
>>>> design.
>>>
>>> The problem as I see it is that we can miss notifications if we become
>>> too backlogged, and that will lead to us having to fall back to
>>> scanning anyway. So instead of trying to implement both why don't we
>>> just focus on the scanning approach. Otherwise the only other option
>>> is to hold up the guest and make it wait until the hint processing has
>>> completed and at that point we are back to what is essentially just a
>>> synchronous solution with batching anyway.
>>>
>>
>> In general I am not a fan of "there might be a problem, let's try
>> something completely different". Expect the unexpected. At this point, I
>> prefer to think about easy solutions to eventual problems. not
>> completely new designs. As I said, we've been there already.
> 
> The solution as we have is not "easy". There are a number of race
> conditions contained within the code and it doesn't practically scale
> when you consider we are introducing multiple threads in both the
> isolation and returning of pages to/from the buddy allocator that will
> have to function within the zone lock.

We are freeing pages already, so we are already using the zone lock. We
essentially only care about two zones (maybe three). There is a lot of
optimization potential.

Regarding the scaling issue, I'd love to have a benchmark showcase this
issue.

> 
>> Related to "falling behind" with hinting. If this is indeed possible
>> (and I'd like to know under which conditions), I wonder at which point
>> we no longer care about missed hints. If our guest as a lot of MM
>> activity, could be that is good that we are dropping hints, because our
>> guest is so busy, it will reuse pages soon again.
> 
> This is making a LOT of assumptions. There are a few scenarios that
> can hold up hinting on the host side. One of the limitations of
> madvise is that we have to take the mm read semaphore. So if something
> is sitting on the write semaphore all of the hints will be blocked
> until it is released.
> 
>> One important point is - I think - that free page hinting does not have
>> to fit all possible setups. In certain environments it just makes sense
>> to disable it. Or live with it not giving you "all the hints". E.g.
>> databases that eat up all free memory either way. The other extreme
>> would be a simple webserver that is mostly idle.
> 
> My concern is we are introducing massive buffer bloat in the mm
> subsystem and it still has the potential for stalling VCPUs if we
> don't have room in the VQs. We went through this back in the day with
> networking. Adding more buffers is not the solution. The solution is
> to have a way to gracefully recover and keep our hinting latency and
> buffer bloat to a minimum.

I think the main point is that in contrast to real request, we can
always skip or delay hinting if it "just isn't the right time". I think
that is special when it comes to "hinting". At least so much to "
stalling VCPUs if we don't have room in the VQs".

I agree, adding buffers is not the solution, that's why I also didn't
like Michaels approach.

> 
>> We are losing hitning of quite free memory already due to the MAX_ORDER
>> - X discussion. Dropping a couple of other hints shouldn't really hurt.
>> The question is, are there scenarios where we can completely screw up.
> 
> My concern is that it can hurt a ton. In my mind the target for a
> feature like this is a guest that has something like an application
> that will fire up a few times a day eat up a massive amount of memory,
> and then free it all when it is done. Now if that application is
> freeing a massive block of memory and for whatever reason the QEMU
> thread that is translating our hint requests to madvise calls cannot
> keep up then we are going to spend the next several hours with that
> memory still assigned to an idle guest.
> 

Very special case I was mentioning. This is *AFAIK* not your usual
workload. It is an interesting use case to have in mind, though, and a
good corner case to where things might go wrong.

The other extreme is a system that barely frees (MAX_ORDER - X) pages,
however your thread will waste cycles scanning for such.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: On guest free page hinting and OOM
  2019-04-02 17:30                             ` Alexander Duyck
@ 2019-04-02 18:53                               ` David Hildenbrand
  2019-04-02 23:43                                 ` Alexander Duyck
  0 siblings, 1 reply; 34+ messages in thread
From: David Hildenbrand @ 2019-04-02 18:53 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Michael S. Tsirkin, Nitesh Narayan Lal, kvm list, LKML, linux-mm,
	Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang,
	Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli

>>> Why do we need them running in parallel for a single guest? I don't
>>> think we need the hints so quickly that we would need to have multiple
>>> VCPUs running in parallel to provide hints. In addition as it
>>> currently stands in order to get pages into and out of the buddy
>>> allocator we are going to have to take the zone lock anyway so we
>>> could probably just assume a single thread for pulling the memory,
>>> placing it on the ring, and putting it back into the buddy allocator
>>> after the hint has been completed.
>>
>> VCPUs hint when they think the time has come. Hinting in parallel comes
>> naturally.
> 
> Actually it doesn't because if we are doing it asynchronously we are
> having to pull pages out of the zone which requires the zone lock.

Yes, and we already work with zones already when freeing. At least one zone.

> That has been one of the reasons why the patches from Nitesh start
> dropping in performance when you start enabling more than 1 VCPU. If
> we are limited by the zone lock it doesn't make sense for us to try to
> do thing in parallel.

That is an interesting point and I'd love to see some performance numbers.

>>> Also there isn't a huge priority to report idle memory in real time.
>>> That would be kind of pointless as it might be pulled back out and
>>> reused as soon as it is added. What we need is to give the memory a
>>> bit of time to "cool" so that we aren't constantly hinting away memory
>>> that is still in use.
>>
>> Depending on the setup, you don't want free memory lying around for too
>> long in your guest.
> 
> Right, but you don't need it as soon as it is freed either. If it is
> idle in the guest for a few seconds that shouldn't be an issue. The
> free page hinting will hurt performance if we are doing it too often
> simply because we are going to be triggering a much higher rate of
> page faults.

Valid point.

> 
>>>
>>>> Your approach sounds very interesting to play with, however
>>>> at this point I would like to avoid throwing away Nitesh work once again
>>>> to follow some other approach that looks promising. If we keep going
>>>> like that, we'll spend another ~10 years working on free page hinting
>>>> without getting anything upstream. Especially if it involves more
>>>> core-MM changes. We've been there, we've done that. As long as the
>>>> guest-host interface is generic enough, we can play with such approaches
>>>> later in the guest. Important part is that the guest-host interface
>>>> allows for that.
>>>
>>> I'm not throwing anything away. One of the issues in Nitesh's design
>>> is that he is going to either miss memory and have to run an
>>> asynchronous thread to clean it up after the fact, or he is going to
>>> cause massive OOM errors and/or have to start halting VCPUs while
>>
>> 1. how are we going to miss memory. We are going to miss memory because
>> we hint on very huge chunks, but we all agreed to live with that for now.
> 
> What I am talking about is that some application frees gigabytes of
> memory. As I recall the queue length for a single cpu is only like 1G.
> Are we going to be sitting on the backlog of most of system memory
> while we process it 1G at a time?

I think it is something around "pages that fit into a request" *
"numbers of entries in a virtqueue".

> 
>> 2. What are the "massive OOM" errors you are talking about? We have the
>> one scenario we described Nitesh was not even able to reproduce yet. And
>> we have ways to mitigate the problem (discussed in this thread).
> 
> So I am referring to the last patch set I have seen. Last I knew all
> the code was doing was assembling lists if isolated pages and placing
> them on a queue. I have seen no way that this really limits the length
> of the virtqueue, and the length of the isolated page lists is the

Ah yes, we are discussing something towards possible capping in this
thread. Once there would be too much being hinted already, skip hinting
for now. Which might have good and bad sides. Different discussion.

> only thing that has any specific limits to it. So I see it easily
> being possible for a good portion of memory being consumed by the
> queue when you consider that what you have is essentially the maximum
> length of the isolated page list multiplied by the number of entries
> in a virtqueue.
> 
>> We have something that seems to work. Let's work from there instead of
>> scrapping the general design once more, thinking "it is super easy". And
>> yes, what you propose is pretty much throwing away the current design in
>> the guest.
> 
> Define "work"? The last patch set required massive fixes as it was
> causing kernel panics if more than 1 VCPU was enabled and list
> corruption in general. I'm sure there are a ton more bugs lurking as
> we have only begun to be able to stress this code in any meaningful
> way.

"work" - we get performance numbers that look promising and sorting out
issues in the design we find. This is RFC. We are discussing design
details. If there are issues in the design, let's discuss. If there are
alternatives, let's discuss. Bashing on the quality of prototypes?
Please don't.

> 
> For example what happens if someone sits on the mm write semaphore for
> an extended period of time on the host? That will shut down all of the
> hinting until that is released, and at that point once again any
> hinting queues will be stuck on the guest until they can be processed
> by the host.

I remember that is why we are using asynchronous requests. Combined with
dropping hints when in such a situation (posted hints not getting
processed), nobody would be stuck. Or am I missing something? Yes, then
the issue of dropped hints arises, and that is a different discussion.

> 
>>> waiting on the processing. All I am suggesting is that we can get away
>>> from having to deal with both by just walking through the free pages
>>> for the higher order and hinting only a few at a time without having
>>> to try to provide the host with the hints on what is idle the second
>>> it is freed.
>>>
>>>>>
>>>>> I view this all as working not too dissimilar to how a standard Rx
>>>>> ring in a network device works. Only we would want to allocate from
>>>>> the pool of "Buddy" pages, flag the pages as "Offline", and then when
>>>>> the hint has been processed we would place them back in the "Buddy"
>>>>> list with the "Offline" value still set. The only real changes needed
>>>>> to the buddy allocator would be to add some logic for clearing/merging
>>>>> the "Offline" setting as necessary, and to provide an allocator that
>>>>> only works with non-"Offline" pages.
>>>>
>>>> Sorry, I had to smile at the phrase "only" in combination with "provide
>>>> an allocator that only works with non-Offline pages" :) . I guess you
>>>> realize yourself that these are core-mm changes that might easily be
>>>> rejected upstream because "the virt guys try to teach core-MM yet
>>>> another special case". I agree that this is nice to play with,
>>>> eventually that approach could succeed and be accepted upstream. But I
>>>> consider this long term work.
>>>
>>> The actual patch for this would probably be pretty small and compared
>>> to some of the other stuff that has gone in recently isn't too far out
>>> of the realm of possibility. It isn't too different then the code that
>>> has already done in to determine the unused pages for virtio-balloon
>>> free page hinting.
>>>
>>> Basically what we would be doing is providing a means for
>>> incrementally transitioning the buddy memory into the idle/offline
>>> state to reduce guest memory overhead. It would require one function
>>> that would walk the free page lists and pluck out pages that don't
>>> have the "Offline" page type set, a one-line change to the logic for
>>> allocating a page as we would need to clear that extra bit of state,
>>> and optionally some bits for how to handle the merge of two "Offline"
>>> pages in the buddy allocator (required for lower order support). It
>>> solves most of the guest side issues with the free page hinting in
>>> that trying to do it via the arch_free_page path is problematic at
>>> best since it was designed for a synchronous setup, not an
>>> asynchronous one.
>>
>> This is throwing away work. No I don't think this is the right path to
>> follow for now. Feel free to look into it while Nitesh gets something in
>> shape we know conceptually works and we are starting to know which
>> issues we are hitting.
> 
> Yes, it is throwing away work. But if the work is running toward a
> dead end does it add any value?

"I'm not throwing anything away. " vs. "Yes, it is throwing away work.",
now we are on the same page.

So your main point here is that you are fairly sure we are are running
towards an dead end, right?

> 
> I've been looking into the stuff Nitesh has been doing. I don't know
> about others, but I have been testing it. That is why I provided the
> patches I did to get it stable enough for me to test and address the
> regressions it was causing. That is the source of some of my concern.

Testing and feedback is very much appreciated. You have concerns, they
are valid. I do like discussing concerns, discussing possible solutions,
or finding out that it cannot be solved the easy way. Then throw it away.

Coming up with a clean design that considers problems that are not
directly visible is something I would like to see. But usually they
don't jump at you before prototyping.

The simplest approach so far was "scan for zero pages in the
hypervisor". No changes in the guest needed except setting pages to zero
when freeing. No additional threads in the guest. No hinting. And still
we decided against it.

> I think we have been making this overly complex with all the per-cpu
> bits and trying to place this in the free path itself. We really need

We already removed complexity, at least that is my impression. There are
bugs in there, yes.

> to scale this back and look at having a single thread with a walker of
> some sort just hinting on what memory is sitting in the buddy but not
> hinted on. It is a solution that would work, even in a multiple VCPU
> case, and is achievable in the short term.
Can you write up your complete proposal and start a new thread. What I
understood so far is

1. Separate hinting thread

2. Use virtio-balloon mechanism similar to Nitesh's work

3. Iterate over !offline pages in the buddy. Take them temporarily out
of the buddy (similar to Niteshs work). Send them to the hypervisor.
Mark them offline, put them back to the buddy.

4. When a page leaves the buddy, drop the offline marker.


Selected issues to be sorted out:
- We have to find a way to mask pages offline. We are effectively
touching pages we don't own (keeping flags set when returning pages to
the buddy). Core MM has to accept this change.
- We might teach other users how to treat buddy pages now. Offline
always has to be cleared.
- How to limit the cycles wasted scanning? Idle guests?
- How to efficiently scan a list that might always change between
hinting requests?
- How to avoid OOM that can still happen in corner cases, after all you
are taking pages out of the buddy temporarily.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: On guest free page hinting and OOM
  2019-04-02 18:21                                     ` David Hildenbrand
@ 2019-04-02 19:49                                       ` Michael S. Tsirkin
  2019-04-02 20:32                                         ` David Hildenbrand
  0 siblings, 1 reply; 34+ messages in thread
From: Michael S. Tsirkin @ 2019-04-02 19:49 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alexander Duyck, Nitesh Narayan Lal, kvm list, LKML, linux-mm,
	Paolo Bonzini, lcapitulino, pagupta, Yang Zhang, Rik van Riel,
	dodgen, Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli,
	Dave Hansen

On Tue, Apr 02, 2019 at 08:21:30PM +0200, David Hildenbrand wrote:
> The other extreme is a system that barely frees (MAX_ORDER - X) pages,
> however your thread will waste cycles scanning for such.

I don't think we need to scan as such. An arch hook
that queues a job to a wq only when there's work to
do will fix the issue, right?

-- 
MST

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: On guest free page hinting and OOM
  2019-04-02 17:53                                     ` Michael S. Tsirkin
@ 2019-04-02 20:32                                       ` Alexander Duyck
  0 siblings, 0 replies; 34+ messages in thread
From: Alexander Duyck @ 2019-04-02 20:32 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: David Hildenbrand, Nitesh Narayan Lal, kvm list, LKML, linux-mm,
	Paolo Bonzini, lcapitulino, pagupta, Yang Zhang, Rik van Riel,
	dodgen, Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli,
	Dave Hansen

On Tue, Apr 2, 2019 at 10:53 AM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Tue, Apr 02, 2019 at 10:45:43AM -0700, Alexander Duyck wrote:
> > We went through this back in the day with
> > networking. Adding more buffers is not the solution. The solution is
> > to have a way to gracefully recover and keep our hinting latency and
> > buffer bloat to a minimum.
>
> That's an interesting approach, I think that things that end up working
> well are NAPI (asychronous notifications), limited batching, XDP (big
> aligned buffers) and BQL (accounting). Is that your perspective too?
>

Yes, that is kind of what I was getting at.

Basically we could have a kthread running somewhere that goes through
and pulls something like 64M of pages out of the MAX_ORDER - 1
freelist, does what is necessary to isolate them, puts them on a queue
somewhere, kicks the virtio ring, and waits for the response to come
back indicating that the hints have been processed. We would just have
to keep it running until the list doesn't have enough non-"Offline"
memory to fulfill the request. Then we just wait until we again reach
a level necessary to justify waking the thread back up and repeat.

In my mind it looks a lot like your standard Rx ring in that we
allocate some fixed number of buffers and wait for hardware to tell us
when the buffers are ready. The only extra complexity is having to add
tracking using the PageType "Offline" bit which should be cheap when
we are already having to manipulate the "Buddy" PageType anyway.

It would let us get away from having to do the per-cpu queues and
complicated coordination logic to translate free pages to their buddy.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: On guest free page hinting and OOM
  2019-04-02 19:49                                       ` Michael S. Tsirkin
@ 2019-04-02 20:32                                         ` David Hildenbrand
  0 siblings, 0 replies; 34+ messages in thread
From: David Hildenbrand @ 2019-04-02 20:32 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alexander Duyck, Nitesh Narayan Lal, kvm list, LKML, linux-mm,
	Paolo Bonzini, lcapitulino, pagupta, Yang Zhang, Rik van Riel,
	dodgen, Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli,
	Dave Hansen

On 02.04.19 21:49, Michael S. Tsirkin wrote:
> On Tue, Apr 02, 2019 at 08:21:30PM +0200, David Hildenbrand wrote:
>> The other extreme is a system that barely frees (MAX_ORDER - X) pages,
>> however your thread will waste cycles scanning for such.
> 
> I don't think we need to scan as such. An arch hook
> that queues a job to a wq only when there's work to
> do will fix the issue, right?
> 

Well, even a system that is basically idle can theoretically free a
handful of pages every now and then you would like to report. There
would have to be a mechanism to detect if pages worth hinting have been
freed and need processing, IOW something like a callback after buddy
merging we are already dealing with.

That can then be a simple flag which the hinting thread considers (e.g.
wake up every X seconds to check if there is work) or kicking the
hinting thread like you suggest.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: On guest free page hinting and OOM
  2019-04-02 18:53                               ` David Hildenbrand
@ 2019-04-02 23:43                                 ` Alexander Duyck
  2019-04-03 19:43                                   ` David Hildenbrand
  2019-04-04 13:28                                   ` Michael S. Tsirkin
  0 siblings, 2 replies; 34+ messages in thread
From: Alexander Duyck @ 2019-04-02 23:43 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Michael S. Tsirkin, Nitesh Narayan Lal, kvm list, LKML, linux-mm,
	Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang,
	Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli

On Tue, Apr 2, 2019 at 11:53 AM David Hildenbrand <david@redhat.com> wrote:
>
> >>> Why do we need them running in parallel for a single guest? I don't
> >>> think we need the hints so quickly that we would need to have multiple
> >>> VCPUs running in parallel to provide hints. In addition as it
> >>> currently stands in order to get pages into and out of the buddy
> >>> allocator we are going to have to take the zone lock anyway so we
> >>> could probably just assume a single thread for pulling the memory,
> >>> placing it on the ring, and putting it back into the buddy allocator
> >>> after the hint has been completed.
> >>
> >> VCPUs hint when they think the time has come. Hinting in parallel comes
> >> naturally.
> >
> > Actually it doesn't because if we are doing it asynchronously we are
> > having to pull pages out of the zone which requires the zone lock.
>
> Yes, and we already work with zones already when freeing. At least one zone.
>
> > That has been one of the reasons why the patches from Nitesh start
> > dropping in performance when you start enabling more than 1 VCPU. If
> > we are limited by the zone lock it doesn't make sense for us to try to
> > do thing in parallel.
>
> That is an interesting point and I'd love to see some performance numbers.

So the last time I ran data it was with the virtio-balloon patch set I
ended up having to make a number of fixes and tweaks. I believe the
patches can be found online as I emailed them to the list and Nitesh,
but I don't have them handy to point to.

Results w/ THP
Baseline
[root@localhost ~]# cd ~/will-it-scale/; ./runtest.py page_fault1
tasks,processes,processes_idle,threads,threads_idle,linear
0,0,100,0,100,0
1,501001,93.72,500312,93.72,501001
2,918688,87.49,837092,87.51,1002002
3,1300535,81.22,1200746,81.39,1503003
4,1718865,75.01,1522041,75.20,2004004
5,2032902,68.77,1826264,69.26,2505005
6,2309724,62.55,1979819,63.89,3006006
7,2609748,56.30,1935436,60.20,3507007
8,2705883,50.07,1913416,57.45,4008008
9,2738392,43.84,2017198,51.24,4509009
10,2913739,37.63,1906649,48.65,5010010
11,2996000,31.41,1973332,41.86,5511011
12,2930790,25.19,1928318,37.33,6012012
13,2876603,18.97,2040026,31.83,6513013
14,2820274,12.77,2060417,27.19,7014014
15,2729018,6.55,2134531,24.33,7515015
16,2682826,0.36,2146440,21.25,8016016

My Patch Set
[root@localhost will-it-scale]# cd ~/will-it-scale/; ./runtest.py page_fault1
tasks,processes,processes_idle,threads,threads_idle,linear
0,0,100,0,100,0
1,459575,93.74,458546,93.73,459575
2,901990,87.47,841478,87.49,919150
3,1307078,81.26,1193380,81.43,1378725
4,1717429,75.03,1529761,75.30,1838300
5,2045371,68.79,1765334,70.01,2297875
6,2272685,62.56,1893388,65.42,2757450
7,2583919,56.34,2078468,59.85,3217025
8,2777732,50.10,2009627,57.08,3676600
9,2932699,43.90,1938315,52.00,4136175
10,2935508,37.70,1982124,46.55,4595750
11,2881811,31.45,2162791,41.36,5055325
12,2947880,25.27,2058337,38.93,5514900
13,2925530,19.11,1937080,32.13,5974475
14,2867833,12.89,2023161,25.80,6434050
15,2856156,6.69,2067414,24.67,6893625
16,2775991,0.53,2062535,17.46,7353200

Modified RH Virtio-Balloon based patch set
[root@localhost ~]# cd ~/will-it-scale/; ./runtest.py page_fault1
tasks,processes,processes_idle,threads,threads_idle,linear
0,0,100,0,100,0
1,522672,93.73,524206,93.73,524206
2,914612,87.47,828489,87.66,1048412
3,1336109,81.25,1156889,82.15,1572618
4,1638776,75.01,1419247,76.75,2096824
5,1982146,68.77,1676836,71.27,2621030
6,2211653,62.57,1865976,65.60,3145236
7,2456776,56.33,2111887,57.98,3669442
8,2594395,50.10,2101993,54.17,4193648
9,2672871,43.90,1864173,53.01,4717854
10,2695456,37.69,2152126,45.82,5242060
11,2702604,31.44,1962406,42.50,5766266
12,2702415,25.22,2078596,35.01,6290472
13,2677250,19.02,2068953,35.42,6814678
14,2612990,12.80,2053951,30.77,7338884
15,2521812,6.67,1876602,26.42,7863090
16,2472506,0.53,1957658,20.16,8387296

Basically when you compare the direct approach from the patch set I
submitted versus the one using the virtio approach the virtio has
better single thread performance, but doesn't scale as well as my
patch set did. That is why I am thinking that if we can avoid trying
to scale out to per-cpu and instead focus on just having one thread
handle the feeding of the hints to the virtio device we could avoid
the scaling penalty and instead get the best of both worlds.

> >>> Also there isn't a huge priority to report idle memory in real time.
> >>> That would be kind of pointless as it might be pulled back out and
> >>> reused as soon as it is added. What we need is to give the memory a
> >>> bit of time to "cool" so that we aren't constantly hinting away memory
> >>> that is still in use.
> >>
> >> Depending on the setup, you don't want free memory lying around for too
> >> long in your guest.
> >
> > Right, but you don't need it as soon as it is freed either. If it is
> > idle in the guest for a few seconds that shouldn't be an issue. The
> > free page hinting will hurt performance if we are doing it too often
> > simply because we are going to be triggering a much higher rate of
> > page faults.
>
> Valid point.
>
> >
> >>>
> >>>> Your approach sounds very interesting to play with, however
> >>>> at this point I would like to avoid throwing away Nitesh work once again
> >>>> to follow some other approach that looks promising. If we keep going
> >>>> like that, we'll spend another ~10 years working on free page hinting
> >>>> without getting anything upstream. Especially if it involves more
> >>>> core-MM changes. We've been there, we've done that. As long as the
> >>>> guest-host interface is generic enough, we can play with such approaches
> >>>> later in the guest. Important part is that the guest-host interface
> >>>> allows for that.
> >>>
> >>> I'm not throwing anything away. One of the issues in Nitesh's design
> >>> is that he is going to either miss memory and have to run an
> >>> asynchronous thread to clean it up after the fact, or he is going to
> >>> cause massive OOM errors and/or have to start halting VCPUs while
> >>
> >> 1. how are we going to miss memory. We are going to miss memory because
> >> we hint on very huge chunks, but we all agreed to live with that for now.
> >
> > What I am talking about is that some application frees gigabytes of
> > memory. As I recall the queue length for a single cpu is only like 1G.
> > Are we going to be sitting on the backlog of most of system memory
> > while we process it 1G at a time?
>
> I think it is something around "pages that fit into a request" *
> "numbers of entries in a virtqueue".

Right, but then we start spinning if there are no free entries in the
virtqueue. That isn't a very asynchronous way to handle things if we
are just going to peg a VCPU busy waiting for the virtqueue to free up
a buffer.

If we are going to go asynchronous we should be completely
asynchronous. Right now we still can end up eating a large amount of
memory with request size * virtqueue size, and then when we hit that
limit there is a while loop we hit that will spin until we have a free
buffer.

> >
> >> 2. What are the "massive OOM" errors you are talking about? We have the
> >> one scenario we described Nitesh was not even able to reproduce yet. And
> >> we have ways to mitigate the problem (discussed in this thread).
> >
> > So I am referring to the last patch set I have seen. Last I knew all
> > the code was doing was assembling lists if isolated pages and placing
> > them on a queue. I have seen no way that this really limits the length
> > of the virtqueue, and the length of the isolated page lists is the
>
> Ah yes, we are discussing something towards possible capping in this
> thread. Once there would be too much being hinted already, skip hinting
> for now. Which might have good and bad sides. Different discussion.

I'd say that it is another argument for trying to get away from the
per-cpu lists though if we need to put limits on it. Trying to
maintain a reasonable limit while coordinating between CPUs would be a
challenge.

> > only thing that has any specific limits to it. So I see it easily
> > being possible for a good portion of memory being consumed by the
> > queue when you consider that what you have is essentially the maximum
> > length of the isolated page list multiplied by the number of entries
> > in a virtqueue.
> >
> >> We have something that seems to work. Let's work from there instead of
> >> scrapping the general design once more, thinking "it is super easy". And
> >> yes, what you propose is pretty much throwing away the current design in
> >> the guest.
> >
> > Define "work"? The last patch set required massive fixes as it was
> > causing kernel panics if more than 1 VCPU was enabled and list
> > corruption in general. I'm sure there are a ton more bugs lurking as
> > we have only begun to be able to stress this code in any meaningful
> > way.
>
> "work" - we get performance numbers that look promising and sorting out
> issues in the design we find. This is RFC. We are discussing design
> details. If there are issues in the design, let's discuss. If there are
> alternatives, let's discuss. Bashing on the quality of prototypes?
> Please don't.

I'm not so much bashing the quality as the lack of data. It is hard to
say something is "working" when you have a hard time getting it to
stay up long enough to collect any reasonable data. My concern is that
the data looks really great when you don't have all the proper
critical sections handled correctly, but when you add the locking that
needs to be there it can make the whole point of an entire patch set
moot.

> >
> > For example what happens if someone sits on the mm write semaphore for
> > an extended period of time on the host? That will shut down all of the
> > hinting until that is released, and at that point once again any
> > hinting queues will be stuck on the guest until they can be processed
> > by the host.
>
> I remember that is why we are using asynchronous requests. Combined with
> dropping hints when in such a situation (posted hints not getting
> processed), nobody would be stuck. Or am I missing something? Yes, then
> the issue of dropped hints arises, and that is a different discussion.

Right, but while that lock is stuck you have all the memory sitting in
the queues. That is the reason why I want to keep the amount of memory
that can be sitting in these queues on the smaller side and closer to
something like 64M. I don't want the guest to have to be stuck either
dropping hints, or starving a guest for memory because it can't get
the hints serviced.

> >
> >>> waiting on the processing. All I am suggesting is that we can get away
> >>> from having to deal with both by just walking through the free pages
> >>> for the higher order and hinting only a few at a time without having
> >>> to try to provide the host with the hints on what is idle the second
> >>> it is freed.
> >>>
> >>>>>
> >>>>> I view this all as working not too dissimilar to how a standard Rx
> >>>>> ring in a network device works. Only we would want to allocate from
> >>>>> the pool of "Buddy" pages, flag the pages as "Offline", and then when
> >>>>> the hint has been processed we would place them back in the "Buddy"
> >>>>> list with the "Offline" value still set. The only real changes needed
> >>>>> to the buddy allocator would be to add some logic for clearing/merging
> >>>>> the "Offline" setting as necessary, and to provide an allocator that
> >>>>> only works with non-"Offline" pages.
> >>>>
> >>>> Sorry, I had to smile at the phrase "only" in combination with "provide
> >>>> an allocator that only works with non-Offline pages" :) . I guess you
> >>>> realize yourself that these are core-mm changes that might easily be
> >>>> rejected upstream because "the virt guys try to teach core-MM yet
> >>>> another special case". I agree that this is nice to play with,
> >>>> eventually that approach could succeed and be accepted upstream. But I
> >>>> consider this long term work.
> >>>
> >>> The actual patch for this would probably be pretty small and compared
> >>> to some of the other stuff that has gone in recently isn't too far out
> >>> of the realm of possibility. It isn't too different then the code that
> >>> has already done in to determine the unused pages for virtio-balloon
> >>> free page hinting.
> >>>
> >>> Basically what we would be doing is providing a means for
> >>> incrementally transitioning the buddy memory into the idle/offline
> >>> state to reduce guest memory overhead. It would require one function
> >>> that would walk the free page lists and pluck out pages that don't
> >>> have the "Offline" page type set, a one-line change to the logic for
> >>> allocating a page as we would need to clear that extra bit of state,
> >>> and optionally some bits for how to handle the merge of two "Offline"
> >>> pages in the buddy allocator (required for lower order support). It
> >>> solves most of the guest side issues with the free page hinting in
> >>> that trying to do it via the arch_free_page path is problematic at
> >>> best since it was designed for a synchronous setup, not an
> >>> asynchronous one.
> >>
> >> This is throwing away work. No I don't think this is the right path to
> >> follow for now. Feel free to look into it while Nitesh gets something in
> >> shape we know conceptually works and we are starting to know which
> >> issues we are hitting.
> >
> > Yes, it is throwing away work. But if the work is running toward a
> > dead end does it add any value?
>
> "I'm not throwing anything away. " vs. "Yes, it is throwing away work.",
> now we are on the same page.
>
> So your main point here is that you are fairly sure we are are running
> towards an dead end, right?

Yes. There is a ton of code here that is adding complexity and bugs
that I would consider waste. If we can move to a single threaded
approach the code could become much simpler as we will only have a
single queue that we have to service and we could get away from 2
levels of lists and the allocation that goes with them. In addition I
think we could get away with much more code reuse as the
get_free_page_and_send function could likely be adapted to provide a
more generic function for adding a page of a specific size to the
queue versus the current assumption that the only page size is
VIRTIO_BALLOON_FREE_PAGE_ORDER.

> >
> > I've been looking into the stuff Nitesh has been doing. I don't know
> > about others, but I have been testing it. That is why I provided the
> > patches I did to get it stable enough for me to test and address the
> > regressions it was causing. That is the source of some of my concern.
>
> Testing and feedback is very much appreciated. You have concerns, they
> are valid. I do like discussing concerns, discussing possible solutions,
> or finding out that it cannot be solved the easy way. Then throw it away.
>
> Coming up with a clean design that considers problems that are not
> directly visible is something I would like to see. But usually they
> don't jump at you before prototyping.
>
> The simplest approach so far was "scan for zero pages in the
> hypervisor". No changes in the guest needed except setting pages to zero
> when freeing. No additional threads in the guest. No hinting. And still
> we decided against it.

Right, I get that. There is still going to be a certain amount of
overhead for zeroing the pages in adding the scanning on the host will
not come cheap. I had considered something similar when I first looked
into this.

> > I think we have been making this overly complex with all the per-cpu
> > bits and trying to place this in the free path itself. We really need
>
> We already removed complexity, at least that is my impression. There are
> bugs in there, yes.

So one of the concerns I have at this point is the sheer number of
allocations and the list shuffling that is having to take place.

The design of the last patch set had us enqueueing addresses on the
per-cpu "free_pages_obj". Then when we hit a certain threshold it will
call guest_free_page_hinting which is allocated via kmalloc and then
populated with the pages that can be isolated. Then that function
calls guest_free_page_report which will yet again kmalloc a hint_req
object that just contains a pointer to our isolated pages list. If we
can keep the page count small we could just do away with all of that
and instead do something more like get_free_page_and_send.

> > to scale this back and look at having a single thread with a walker of
> > some sort just hinting on what memory is sitting in the buddy but not
> > hinted on. It is a solution that would work, even in a multiple VCPU
> > case, and is achievable in the short term.
> Can you write up your complete proposal and start a new thread. What I
> understood so far is
>
> 1. Separate hinting thread
>
> 2. Use virtio-balloon mechanism similar to Nitesh's work
>
> 3. Iterate over !offline pages in the buddy. Take them temporarily out
> of the buddy (similar to Niteshs work). Send them to the hypervisor.
> Mark them offline, put them back to the buddy.
>
> 4. When a page leaves the buddy, drop the offline marker.

Yep, that is pretty much it. I'll see if I can get a write-up of it tomorrow.

>
> Selected issues to be sorted out:
> - We have to find a way to mask pages offline. We are effectively
> touching pages we don't own (keeping flags set when returning pages to
> the buddy). Core MM has to accept this change.
> - We might teach other users how to treat buddy pages now. Offline
> always has to be cleared.
> - How to limit the cycles wasted scanning? Idle guests?

One thought I had would be to look at splitting nr_free_pages in each
free_area and splitting it into two free running counters, one for the
number of pages added, and one for the number of pages removed. Then
it would be pretty straight forward to determine how many are
available as we could maintain a free running counter for each free
area in the balloon driver to determine if we need to scan a given
area.

> - How to efficiently scan a list that might always change between
> hinting requests?
> - How to avoid OOM that can still happen in corner cases, after all you
> are taking pages out of the buddy temporarily.

Yes, but hopefully it should be a small enough amount that nobody will
notice. In many cases devices such as NICs can consume much more than
this regularly for just their Rx buffers and it is not an issue. There
has to be a certain amount of overhead that any given device is
allowed to consume. If we contain the balloon hinting to just 64M that
should be a small enough amount that nobody would notice in practice.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: On guest free page hinting and OOM
  2019-04-02 23:43                                 ` Alexander Duyck
@ 2019-04-03 19:43                                   ` David Hildenbrand
  2019-04-04 13:28                                   ` Michael S. Tsirkin
  1 sibling, 0 replies; 34+ messages in thread
From: David Hildenbrand @ 2019-04-03 19:43 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Michael S. Tsirkin, Nitesh Narayan Lal, kvm list, LKML, linux-mm,
	Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang,
	Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli

On 03.04.19 01:43, Alexander Duyck wrote:
> On Tue, Apr 2, 2019 at 11:53 AM David Hildenbrand <david@redhat.com> wrote:
>>
>>>>> Why do we need them running in parallel for a single guest? I don't
>>>>> think we need the hints so quickly that we would need to have multiple
>>>>> VCPUs running in parallel to provide hints. In addition as it
>>>>> currently stands in order to get pages into and out of the buddy
>>>>> allocator we are going to have to take the zone lock anyway so we
>>>>> could probably just assume a single thread for pulling the memory,
>>>>> placing it on the ring, and putting it back into the buddy allocator
>>>>> after the hint has been completed.
>>>>
>>>> VCPUs hint when they think the time has come. Hinting in parallel comes
>>>> naturally.
>>>
>>> Actually it doesn't because if we are doing it asynchronously we are
>>> having to pull pages out of the zone which requires the zone lock.
>>
>> Yes, and we already work with zones already when freeing. At least one zone.
>>
>>> That has been one of the reasons why the patches from Nitesh start
>>> dropping in performance when you start enabling more than 1 VCPU. If
>>> we are limited by the zone lock it doesn't make sense for us to try to
>>> do thing in parallel.
>>
>> That is an interesting point and I'd love to see some performance numbers.
> 
> So the last time I ran data it was with the virtio-balloon patch set I
> ended up having to make a number of fixes and tweaks. I believe the
> patches can be found online as I emailed them to the list and Nitesh,
> but I don't have them handy to point to.

AFAIK Nitesh integrated them.

> 
> Results w/ THP
> Baseline
> [root@localhost ~]# cd ~/will-it-scale/; ./runtest.py page_fault1
> tasks,processes,processes_idle,threads,threads_idle,linear
> 0,0,100,0,100,0
> 1,501001,93.72,500312,93.72,501001
> 2,918688,87.49,837092,87.51,1002002
> 3,1300535,81.22,1200746,81.39,1503003
> 4,1718865,75.01,1522041,75.20,2004004
> 5,2032902,68.77,1826264,69.26,2505005
> 6,2309724,62.55,1979819,63.89,3006006
> 7,2609748,56.30,1935436,60.20,3507007
> 8,2705883,50.07,1913416,57.45,4008008
> 9,2738392,43.84,2017198,51.24,4509009
> 10,2913739,37.63,1906649,48.65,5010010
> 11,2996000,31.41,1973332,41.86,5511011
> 12,2930790,25.19,1928318,37.33,6012012
> 13,2876603,18.97,2040026,31.83,6513013
> 14,2820274,12.77,2060417,27.19,7014014
> 15,2729018,6.55,2134531,24.33,7515015
> 16,2682826,0.36,2146440,21.25,8016016
> 
> My Patch Set
> [root@localhost will-it-scale]# cd ~/will-it-scale/; ./runtest.py page_fault1
> tasks,processes,processes_idle,threads,threads_idle,linear
> 0,0,100,0,100,0
> 1,459575,93.74,458546,93.73,459575
> 2,901990,87.47,841478,87.49,919150
> 3,1307078,81.26,1193380,81.43,1378725
> 4,1717429,75.03,1529761,75.30,1838300
> 5,2045371,68.79,1765334,70.01,2297875
> 6,2272685,62.56,1893388,65.42,2757450
> 7,2583919,56.34,2078468,59.85,3217025
> 8,2777732,50.10,2009627,57.08,3676600
> 9,2932699,43.90,1938315,52.00,4136175
> 10,2935508,37.70,1982124,46.55,4595750
> 11,2881811,31.45,2162791,41.36,5055325
> 12,2947880,25.27,2058337,38.93,5514900
> 13,2925530,19.11,1937080,32.13,5974475
> 14,2867833,12.89,2023161,25.80,6434050
> 15,2856156,6.69,2067414,24.67,6893625
> 16,2775991,0.53,2062535,17.46,7353200
> 
> Modified RH Virtio-Balloon based patch set
> [root@localhost ~]# cd ~/will-it-scale/; ./runtest.py page_fault1
> tasks,processes,processes_idle,threads,threads_idle,linear
> 0,0,100,0,100,0
> 1,522672,93.73,524206,93.73,524206
> 2,914612,87.47,828489,87.66,1048412
> 3,1336109,81.25,1156889,82.15,1572618
> 4,1638776,75.01,1419247,76.75,2096824
> 5,1982146,68.77,1676836,71.27,2621030
> 6,2211653,62.57,1865976,65.60,3145236
> 7,2456776,56.33,2111887,57.98,3669442
> 8,2594395,50.10,2101993,54.17,4193648
> 9,2672871,43.90,1864173,53.01,4717854
> 10,2695456,37.69,2152126,45.82,5242060
> 11,2702604,31.44,1962406,42.50,5766266
> 12,2702415,25.22,2078596,35.01,6290472
> 13,2677250,19.02,2068953,35.42,6814678
> 14,2612990,12.80,2053951,30.77,7338884
> 15,2521812,6.67,1876602,26.42,7863090
> 16,2472506,0.53,1957658,20.16,8387296
> 
> Basically when you compare the direct approach from the patch set I
> submitted versus the one using the virtio approach the virtio has
> better single thread performance, but doesn't scale as well as my
> patch set did. That is why I am thinking that if we can avoid trying
> to scale out to per-cpu and instead focus on just having one thread
> handle the feeding of the hints to the virtio device we could avoid
> the scaling penalty and instead get the best of both worlds.
> 

Sounds like a solid analysis to me. Thanks.

[...]
>>> only thing that has any specific limits to it. So I see it easily
>>> being possible for a good portion of memory being consumed by the
>>> queue when you consider that what you have is essentially the maximum
>>> length of the isolated page list multiplied by the number of entries
>>> in a virtqueue.
>>>
>>>> We have something that seems to work. Let's work from there instead of
>>>> scrapping the general design once more, thinking "it is super easy". And
>>>> yes, what you propose is pretty much throwing away the current design in
>>>> the guest.
>>>
>>> Define "work"? The last patch set required massive fixes as it was
>>> causing kernel panics if more than 1 VCPU was enabled and list
>>> corruption in general. I'm sure there are a ton more bugs lurking as
>>> we have only begun to be able to stress this code in any meaningful
>>> way.
>>
>> "work" - we get performance numbers that look promising and sorting out
>> issues in the design we find. This is RFC. We are discussing design
>> details. If there are issues in the design, let's discuss. If there are
>> alternatives, let's discuss. Bashing on the quality of prototypes?
>> Please don't.
> 
> I'm not so much bashing the quality as the lack of data. It is hard to
> say something is "working" when you have a hard time getting it to
> stay up long enough to collect any reasonable data. My concern is that
> the data looks really great when you don't have all the proper
> critical sections handled correctly, but when you add the locking that
> needs to be there it can make the whole point of an entire patch set
> moot.

Yes, and that is something to figure out during review. And you did that
excellently by even sending fixes :)

Not arguing against "quality of the prototype should improve with RFCs".

[...]

> 
>>>
>>>>> waiting on the processing. All I am suggesting is that we can get away
>>>>> from having to deal with both by just walking through the free pages
>>>>> for the higher order and hinting only a few at a time without having
>>>>> to try to provide the host with the hints on what is idle the second
>>>>> it is freed.
>>>>>
>>>>>>>
>>>>>>> I view this all as working not too dissimilar to how a standard Rx
>>>>>>> ring in a network device works. Only we would want to allocate from
>>>>>>> the pool of "Buddy" pages, flag the pages as "Offline", and then when
>>>>>>> the hint has been processed we would place them back in the "Buddy"
>>>>>>> list with the "Offline" value still set. The only real changes needed
>>>>>>> to the buddy allocator would be to add some logic for clearing/merging
>>>>>>> the "Offline" setting as necessary, and to provide an allocator that
>>>>>>> only works with non-"Offline" pages.
>>>>>>
>>>>>> Sorry, I had to smile at the phrase "only" in combination with "provide
>>>>>> an allocator that only works with non-Offline pages" :) . I guess you
>>>>>> realize yourself that these are core-mm changes that might easily be
>>>>>> rejected upstream because "the virt guys try to teach core-MM yet
>>>>>> another special case". I agree that this is nice to play with,
>>>>>> eventually that approach could succeed and be accepted upstream. But I
>>>>>> consider this long term work.
>>>>>
>>>>> The actual patch for this would probably be pretty small and compared
>>>>> to some of the other stuff that has gone in recently isn't too far out
>>>>> of the realm of possibility. It isn't too different then the code that
>>>>> has already done in to determine the unused pages for virtio-balloon
>>>>> free page hinting.
>>>>>
>>>>> Basically what we would be doing is providing a means for
>>>>> incrementally transitioning the buddy memory into the idle/offline
>>>>> state to reduce guest memory overhead. It would require one function
>>>>> that would walk the free page lists and pluck out pages that don't
>>>>> have the "Offline" page type set, a one-line change to the logic for
>>>>> allocating a page as we would need to clear that extra bit of state,
>>>>> and optionally some bits for how to handle the merge of two "Offline"
>>>>> pages in the buddy allocator (required for lower order support). It
>>>>> solves most of the guest side issues with the free page hinting in
>>>>> that trying to do it via the arch_free_page path is problematic at
>>>>> best since it was designed for a synchronous setup, not an
>>>>> asynchronous one.
>>>>
>>>> This is throwing away work. No I don't think this is the right path to
>>>> follow for now. Feel free to look into it while Nitesh gets something in
>>>> shape we know conceptually works and we are starting to know which
>>>> issues we are hitting.
>>>
>>> Yes, it is throwing away work. But if the work is running toward a
>>> dead end does it add any value?
>>
>> "I'm not throwing anything away. " vs. "Yes, it is throwing away work.",
>> now we are on the same page.
>>
>> So your main point here is that you are fairly sure we are are running
>> towards an dead end, right?
> 
> Yes. There is a ton of code here that is adding complexity and bugs
> that I would consider waste. If we can move to a single threaded
> approach the code could become much simpler as we will only have a
> single queue that we have to service and we could get away from 2
> levels of lists and the allocation that goes with them. In addition I
> think we could get away with much more code reuse as the
> get_free_page_and_send function could likely be adapted to provide a
> more generic function for adding a page of a specific size to the
> queue versus the current assumption that the only page size is
> VIRTIO_BALLOON_FREE_PAGE_ORDER.

We could also implement a single-threaded approach on top of the current
approach. Instead of scanning for free pages "blindly", scan the frees
recorder by the vcpus via the arch_free_pages callback or similar. One
issue is, that one can drop hints once the thread can't keep up. Locking
is also something to care about. Maybe recording hints globally per zone
could mitigate the issue, as the zone lock already has to be involved
when freeing. (I remember we used to prototype something like that
before, at least it was similar but different :) )

We could the go into "exception" mode once we drop too many hints, and
go eventually back to normal mode. See below.

> 
>>>
>>> I've been looking into the stuff Nitesh has been doing. I don't know
>>> about others, but I have been testing it. That is why I provided the
>>> patches I did to get it stable enough for me to test and address the
>>> regressions it was causing. That is the source of some of my concern.
>>
>> Testing and feedback is very much appreciated. You have concerns, they
>> are valid. I do like discussing concerns, discussing possible solutions,
>> or finding out that it cannot be solved the easy way. Then throw it away.
>>
>> Coming up with a clean design that considers problems that are not
>> directly visible is something I would like to see. But usually they
>> don't jump at you before prototyping.
>>
>> The simplest approach so far was "scan for zero pages in the
>> hypervisor". No changes in the guest needed except setting pages to zero
>> when freeing. No additional threads in the guest. No hinting. And still
>> we decided against it.
> 
> Right, I get that. There is still going to be a certain amount of
> overhead for zeroing the pages in adding the scanning on the host will
> not come cheap. I had considered something similar when I first looked
> into this.

Interestingly, there are some architectures (e.g. s390x) that can do
zeroing directly in the memory controller using some instructions - at
least that is what I heard. There, the overhead in the guest is very
small. Scanning for zero pages using KSM in the host is then the
problematic part.

But the real problem is that the host has to scan, and when it is
OOM/about to swap, it might be too late to scan random guests to detect
free pages.

> 
>>> I think we have been making this overly complex with all the per-cpu
>>> bits and trying to place this in the free path itself. We really need
>>
>> We already removed complexity, at least that is my impression. There are
>> bugs in there, yes.
> 
> So one of the concerns I have at this point is the sheer number of
> allocations and the list shuffling that is having to take place.
> 
> The design of the last patch set had us enqueueing addresses on the
> per-cpu "free_pages_obj". Then when we hit a certain threshold it will
> call guest_free_page_hinting which is allocated via kmalloc and then
> populated with the pages that can be isolated. Then that function
> calls guest_free_page_report which will yet again kmalloc a hint_req
> object that just contains a pointer to our isolated pages list. If we
> can keep the page count small we could just do away with all of that
> and instead do something more like get_free_page_and_send.

Yes, not denying that more simplifications like this might be possible.
This is what we are looking for :)

> 
>>> to scale this back and look at having a single thread with a walker of
>>> some sort just hinting on what memory is sitting in the buddy but not
>>> hinted on. It is a solution that would work, even in a multiple VCPU
>>> case, and is achievable in the short term.
>> Can you write up your complete proposal and start a new thread. What I
>> understood so far is
>>
>> 1. Separate hinting thread
>>
>> 2. Use virtio-balloon mechanism similar to Nitesh's work
>>
>> 3. Iterate over !offline pages in the buddy. Take them temporarily out
>> of the buddy (similar to Niteshs work). Send them to the hypervisor.
>> Mark them offline, put them back to the buddy.
>>
>> 4. When a page leaves the buddy, drop the offline marker.
> 
> Yep, that is pretty much it. I'll see if I can get a write-up of it tomorrow.

That would be good, I think a sane approach on how to limit hinting
overhead due to scanning in certain corner cases needs quite some thought.

> 
>>
>> Selected issues to be sorted out:
>> - We have to find a way to mask pages offline. We are effectively
>> touching pages we don't own (keeping flags set when returning pages to
>> the buddy). Core MM has to accept this change.
>> - We might teach other users how to treat buddy pages now. Offline
>> always has to be cleared.
>> - How to limit the cycles wasted scanning? Idle guests?
> 
> One thought I had would be to look at splitting nr_free_pages in each
> free_area and splitting it into two free running counters, one for the
> number of pages added, and one for the number of pages removed. Then
> it would be pretty straight forward to determine how many are
> available as we could maintain a free running counter for each free
> area in the balloon driver to determine if we need to scan a given
> area.

When thinking about this, we should always keep in mind some corner
cases that could happen. If you have a  process that simply allocates
and frees 64mb continuously, and could do that in a frequency that could
make hinting

1. Search for the pages
2. Isolate the pages
3. Hint the pages

In a loop, basically consuming 100% of a host CPU just for hinting, then
this would be really bad.

I'm afraid scanning without taking proper actions can result in a lot of
hinting overhead CPU-wise if not done in a smart way.


See my comment above about possible combining both approaches. Once we
drop a lot of hints, switch from using what the CPUs record to scanning
the free list.

This would mean, one hinting thread, minimal work during freeing of
pages (e.g. record into an array per zone), no need to scan with little
activity, don't lose hints on a lot of activity.

Best of both worlds?

> 
>> - How to efficiently scan a list that might always change between
>> hinting requests?
>> - How to avoid OOM that can still happen in corner cases, after all you
>> are taking pages out of the buddy temporarily.
> 
> Yes, but hopefully it should be a small enough amount that nobody will
> notice. In many cases devices such as NICs can consume much more than
> this regularly for just their Rx buffers and it is not an issue. There
> has to be a certain amount of overhead that any given device is
> allowed to consume. If we contain the balloon hinting to just 64M that
> should be a small enough amount that nobody would notice in practice.

Depends on the setup. If your guest has 512MB-1024MB or less, it could
already be problematic, as it would be roughly around 10%. Most probably
you won't find plenty of 2MB/4MB pages either way ... At least it has to
be documented, so people enabling hinting are aware of this. (something
suggested along the lines in this thread by me)


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: On guest free page hinting and OOM
  2019-04-02 23:43                                 ` Alexander Duyck
  2019-04-03 19:43                                   ` David Hildenbrand
@ 2019-04-04 13:28                                   ` Michael S. Tsirkin
  1 sibling, 0 replies; 34+ messages in thread
From: Michael S. Tsirkin @ 2019-04-04 13:28 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: David Hildenbrand, Nitesh Narayan Lal, kvm list, LKML, linux-mm,
	Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang,
	Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb,
	Andrea Arcangeli

On Tue, Apr 02, 2019 at 04:43:03PM -0700, Alexander Duyck wrote:
> Yes, but hopefully it should be a small enough amount that nobody will
> notice. In many cases devices such as NICs can consume much more than
> this regularly for just their Rx buffers and it is not an issue. There
> has to be a certain amount of overhead that any given device is
> allowed to consume. If we contain the balloon hinting to just 64M that
> should be a small enough amount that nobody would notice in practice.

I am still inclined to add ability for balloon to exit to host on OOM
if any hints are outstanding.

If nothing else will be helpful for stats.

-- 
MST

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2019-04-04 13:28 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-29 13:26 On guest free page hinting and OOM Michael S. Tsirkin
2019-03-29 14:24 ` David Hildenbrand
2019-03-29 15:08   ` Michael S. Tsirkin
2019-03-29 15:37     ` David Hildenbrand
2019-03-29 15:45       ` David Hildenbrand
2019-03-29 16:51         ` Michael S. Tsirkin
2019-04-01  8:17           ` David Hildenbrand
2019-04-01 13:24             ` Michael S. Tsirkin
2019-04-01 14:09               ` David Hildenbrand
2019-04-01 14:11                 ` David Hildenbrand
2019-04-01 14:47                   ` Michael S. Tsirkin
2019-04-01 14:54                     ` David Hildenbrand
2019-04-01 20:56                     ` Alexander Duyck
2019-04-02  7:42                       ` David Hildenbrand
2019-04-02 15:04                         ` Alexander Duyck
2019-04-02 15:25                           ` Michael S. Tsirkin
2019-04-02 15:57                             ` David Hildenbrand
2019-04-02 16:18                               ` Alexander Duyck
2019-04-02 17:08                                 ` David Hildenbrand
2019-04-02 17:45                                   ` Alexander Duyck
2019-04-02 17:53                                     ` Michael S. Tsirkin
2019-04-02 20:32                                       ` Alexander Duyck
2019-04-02 18:21                                     ` David Hildenbrand
2019-04-02 19:49                                       ` Michael S. Tsirkin
2019-04-02 20:32                                         ` David Hildenbrand
2019-04-02 15:55                           ` David Hildenbrand
2019-04-02 17:30                             ` Alexander Duyck
2019-04-02 18:53                               ` David Hildenbrand
2019-04-02 23:43                                 ` Alexander Duyck
2019-04-03 19:43                                   ` David Hildenbrand
2019-04-04 13:28                                   ` Michael S. Tsirkin
2019-04-02 16:19                           ` David Hildenbrand
2019-04-01 14:45                 ` Michael S. Tsirkin
2019-03-29 16:09       ` Michael S. Tsirkin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).