* On guest free page hinting and OOM
@ 2019-03-29 13:26 Michael S. Tsirkin
2019-03-29 14:24 ` David Hildenbrand
0 siblings, 1 reply; 34+ messages in thread
From: Michael S. Tsirkin @ 2019-03-29 13:26 UTC (permalink / raw)
To: Nitesh Narayan Lal
Cc: kvm, linux-kernel, linux-mm, pbonzini, lcapitulino, pagupta,
wei.w.wang, yang.zhang.wz, riel, david, dodgen, konrad.wilk,
dhildenb, aarcange, alexander.duyck
On Wed, Mar 06, 2019 at 10:50:42AM -0500, Nitesh Narayan Lal wrote:
> The following patch-set proposes an efficient mechanism for handing freed memory between the guest and the host. It enables the guests with no page cache to rapidly free and reclaims memory to and from the host respectively.
Sorry about breaking the thread: the original subject was
KVM: Guest Free Page Hinting
but the following isn't in a response to a specific patch
so I thought it's reasonable to start a new one.
What bothers both me (and others) with both Nitesh's asynchronous approach
to hinting and the hinting that is already supported in the balloon
driver right now is that it seems to have the potential to create a fake OOM situation:
the page that is in the process of being hinted can not be used. How
likely that is would depend on the workload so is hard to predict.
Alex's patches do not have this problem as they block the
VCPUs from attempting to get new pages during hinting. Solves the fake OOM
issue but adds blocking which most of the time is not necessary.
With both approaches there's a tradeoff: hinting is more efficient if it
hints about large sized chunks of memory at a time, but as that size
increases, chances of being able to hold on to that much memory at a
time decrease. One can claim that this is a regular performance/memory
tradeoff however there is a difference here: normally
guest performance is traded off for host memory (which host
knows how much there is of), this trades guest performance
for guest memory, but the benefit is on the host, not on
the guest. Thus this is harder to manage.
I have an idea: how about allocating extra guest memory on the host? An
extra hinting buffer would be appended to guest memory, with the
understanding that it is destined specifically to improve page hinting.
Balloon device would get an extra parameter specifying the
hinting buffer size - e.g. in the config space of the driver.
At driver startup, it would get hold of the amount of
memory specified by host as the hinting buffer size, and keep it around in a
buffer list - if no action is taken - forever. Whenever balloon would
want to get hold of a page of memory and send it to host for hinting, it
would release a page of the same size from the buffer into the free
list: a new page swaps places with a page in the buffer.
In this way the amount of useful free memory stays constant.
Once hinting is done page can be swapped back - or just stay
in the hinting buffer until the next hint.
Clearly this is a memory/performance tradeoff: the more memory host can
allocate for the hinting buffer, the more batching we'll get so hints
become cheaper. One notes that:
- if guest memory isn't pinned, this memory is virtual and can
be reclaimed by host. In partucular guest can hint about the
memory within the hinting buffer at startup.
- guest performance/host memory tradeoffs are reasonably well understood, and
so it's easier to manage: host knows how much memory it can
sacrifice to gain the benefit of hinting.
Thoughts?
--
MST
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: On guest free page hinting and OOM 2019-03-29 13:26 On guest free page hinting and OOM Michael S. Tsirkin @ 2019-03-29 14:24 ` David Hildenbrand 2019-03-29 15:08 ` Michael S. Tsirkin 0 siblings, 1 reply; 34+ messages in thread From: David Hildenbrand @ 2019-03-29 14:24 UTC (permalink / raw) To: Michael S. Tsirkin, Nitesh Narayan Lal Cc: kvm, linux-kernel, linux-mm, pbonzini, lcapitulino, pagupta, wei.w.wang, yang.zhang.wz, riel, dodgen, konrad.wilk, dhildenb, aarcange, alexander.duyck On 29.03.19 14:26, Michael S. Tsirkin wrote: > On Wed, Mar 06, 2019 at 10:50:42AM -0500, Nitesh Narayan Lal wrote: >> The following patch-set proposes an efficient mechanism for handing freed memory between the guest and the host. It enables the guests with no page cache to rapidly free and reclaims memory to and from the host respectively. > > Sorry about breaking the thread: the original subject was > KVM: Guest Free Page Hinting > but the following isn't in a response to a specific patch > so I thought it's reasonable to start a new one. > > What bothers both me (and others) with both Nitesh's asynchronous approach > to hinting and the hinting that is already supported in the balloon > driver right now is that it seems to have the potential to create a fake OOM situation: > the page that is in the process of being hinted can not be used. How > likely that is would depend on the workload so is hard to predict. We had a very simple idea in mind: As long as a hinting request is pending, don't actually trigger any OOM activity, but wait for it to be processed. Can be done using simple atomic variable. This is a scenario that will only pop up when already pretty low on memory. And the main difference to ballooning is that we *know* we will get more memory soon. > > Alex's patches do not have this problem as they block the > VCPUs from attempting to get new pages during hinting. Solves the fake OOM > issue but adds blocking which most of the time is not necessary. + not going via QEMU which I consider problematic in the future when it comes to various things 1) VFIO notifications if we ever want to support it 2) Verifying that the memory may actually be hinted. Remember where people started to madvise(DONTNEED) the BIOS and we had to fix that in QEMU. > > With both approaches there's a tradeoff: hinting is more efficient if it > hints about large sized chunks of memory at a time, but as that size > increases, chances of being able to hold on to that much memory at a > time decrease. One can claim that this is a regular performance/memory > tradeoff however there is a difference here: normally > guest performance is traded off for host memory (which host > knows how much there is of), this trades guest performance > for guest memory, but the benefit is on the host, not on > the guest. Thus this is harder to manage. One nice thing is that, when only hinting larger chunks, the probability of smaller chunks being available is more likely. It would be more of an issue when hinting any granularity. > > I have an idea: how about allocating extra guest memory on the host? An > extra hinting buffer would be appended to guest memory, with the > understanding that it is destined specifically to improve page hinting. > Balloon device would get an extra parameter specifying the > hinting buffer size - e.g. in the config space of the driver. > At driver startup, it would get hold of the amount of > memory specified by host as the hinting buffer size, and keep it around in a > buffer list - if no action is taken - forever. Whenever balloon would > want to get hold of a page of memory and send it to host for hinting, it > would release a page of the same size from the buffer into the free > list: a new page swaps places with a page in the buffer. > > In this way the amount of useful free memory stays constant. > > Once hinting is done page can be swapped back - or just stay > in the hinting buffer until the next hint. > > Clearly this is a memory/performance tradeoff: the more memory host can > allocate for the hinting buffer, the more batching we'll get so hints > become cheaper. One notes that: > - if guest memory isn't pinned, this memory is virtual and can > be reclaimed by host. In partucular guest can hint about the > memory within the hinting buffer at startup. > - guest performance/host memory tradeoffs are reasonably well understood, and > so it's easier to manage: host knows how much memory it can > sacrifice to gain the benefit of hinting. > > Thoughts? > I first want to a) See it being a real issue. Reproduce it. b) See that we can't fix it using a simple approach (loop when requests not processed yet, always keep X pages ...). c) See that an easy fix is not sufficient and actually an issue. d) See if we can document it and people that care about can life without hinting, like they would live without ballooning. What you describe sounds interesting, but really involved. And really problematic. I consider many things about your approach not realistic. "appended to guest memory", "global list of memory", malicious guests always using that memory like what about NUMA? What about different page granularity? What about malicious guests? What about more hitning requests than the buffer is capable to handle? and much much much more. Honestly, requiring page hinting to make use of actual ballooning or additional memory makes me shiver. I hope I don't get nightmares ;) In the long term we might want to get rid of the inflation/deflation side of virtio-balloon, not require it. Please don't over-engineer an issue we haven't even see yet. Especially not using a mechanism that sounds more involved than actual hinting. As always, I might be very wrong, but this sounds way too complicated to me, both on the guest and the hypervisor side. -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: On guest free page hinting and OOM 2019-03-29 14:24 ` David Hildenbrand @ 2019-03-29 15:08 ` Michael S. Tsirkin 2019-03-29 15:37 ` David Hildenbrand 0 siblings, 1 reply; 34+ messages in thread From: Michael S. Tsirkin @ 2019-03-29 15:08 UTC (permalink / raw) To: David Hildenbrand Cc: Nitesh Narayan Lal, kvm, linux-kernel, linux-mm, pbonzini, lcapitulino, pagupta, wei.w.wang, yang.zhang.wz, riel, dodgen, konrad.wilk, dhildenb, aarcange, alexander.duyck On Fri, Mar 29, 2019 at 03:24:24PM +0100, David Hildenbrand wrote: > > We had a very simple idea in mind: As long as a hinting request is > pending, don't actually trigger any OOM activity, but wait for it to be > processed. Can be done using simple atomic variable. > > This is a scenario that will only pop up when already pretty low on > memory. And the main difference to ballooning is that we *know* we will > get more memory soon. No we don't. If we keep polling we are quite possibly keeping the CPU busy so delaying the hint request processing. Again the issue it's a tradeoff. One performance for the other. Very hard to know which path do you hit in advance, and in the real world no one has the time to profile and tune things. By comparison trading memory for performance is well understood. > "appended to guest memory", "global list of memory", malicious guests > always using that memory like what about NUMA? This can be up to the guest. A good approach would be to take a chunk out of each node and add to the hints buffer. > What about different page > granularity? Seems like an orthogonal issue to me. > What about malicious guests? That's an interesting question. Host can actually enforce that # of hinted free pages at least matches the hint buffer size. > What about more hitning > requests than the buffer is capable to handle? The idea is that we don't send more hints than in the buffer. In this way host can actually control the overhead which is probably a good thing - host knows how much benefit can be derived from hinting. Guest doesn't. > Honestly, requiring page hinting to make use of actual ballooning or > additional memory makes me shiver. I hope I don't get nightmares ;) In > the long term we might want to get rid of the inflation/deflation side > of virtio-balloon, not require it. > > Please don't over-engineer an issue we haven't even see yet. All hinting patches are very lightly tested as it is. OOM especially is very hard to test properly. So *I* will sleep better at night if we don't have corner cases. Balloon is already involved in MM for isolation and somehow we live with that. So wait until you see actual code before worrying about nightmares. > Especially > not using a mechanism that sounds more involved than actual hinting. That would depend on the implementation. It's just moving a page between two lists. > > As always, I might be very wrong, but this sounds way too complicated to > me, both on the guest and the hypervisor side. On the hypervisor side it can be literally nothing if we don't want to enforce buffer size. > -- > > Thanks, > > David / dhildenb ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: On guest free page hinting and OOM 2019-03-29 15:08 ` Michael S. Tsirkin @ 2019-03-29 15:37 ` David Hildenbrand 2019-03-29 15:45 ` David Hildenbrand 2019-03-29 16:09 ` Michael S. Tsirkin 0 siblings, 2 replies; 34+ messages in thread From: David Hildenbrand @ 2019-03-29 15:37 UTC (permalink / raw) To: Michael S. Tsirkin Cc: Nitesh Narayan Lal, kvm, linux-kernel, linux-mm, pbonzini, lcapitulino, pagupta, wei.w.wang, yang.zhang.wz, riel, dodgen, konrad.wilk, dhildenb, aarcange, alexander.duyck On 29.03.19 16:08, Michael S. Tsirkin wrote: > On Fri, Mar 29, 2019 at 03:24:24PM +0100, David Hildenbrand wrote: >> >> We had a very simple idea in mind: As long as a hinting request is >> pending, don't actually trigger any OOM activity, but wait for it to be >> processed. Can be done using simple atomic variable. >> >> This is a scenario that will only pop up when already pretty low on >> memory. And the main difference to ballooning is that we *know* we will >> get more memory soon. > > No we don't. If we keep polling we are quite possibly keeping the CPU > busy so delaying the hint request processing. Again the issue it's a You can always yield. But that's a different topic. > tradeoff. One performance for the other. Very hard to know which path do > you hit in advance, and in the real world no one has the time to profile > and tune things. By comparison trading memory for performance is well > understood. > > >> "appended to guest memory", "global list of memory", malicious guests >> always using that memory like what about NUMA? > > This can be up to the guest. A good approach would be to take > a chunk out of each node and add to the hints buffer. This might lead to you not using the buffer efficiently. But also, different topic. > >> What about different page >> granularity? > > Seems like an orthogonal issue to me. It is similar, yes. But if you support multiple granularities (e.g. MAX_ORDER - 1, MAX_ORDER - 2 ...) you might have to implement some sort of buddy for the buffer. This is different than just a list for each node. > >> What about malicious guests? > > That's an interesting question. Host can actually enforce that # of > hinted free pages at least matches the hint buffer size. Well, you will have to assume that the hinting buffer might be completely used by any guest. Just like ballooning, there are no guarantees. If the buffer memory is part of initial boot memory, any guest without proper modifications will simply make use of it. See more on that below. At first, my impression was that you would actually somehow want to expose new memory ("buffer") to the guest, which is why I shivered. > > >> What about more hitning >> requests than the buffer is capable to handle? > > The idea is that we don't send more hints than in the buffer. > In this way host can actually control the overhead which > is probably a good thing - host knows how much benefit > can be derived from hinting. Guest doesn't. So the buffer will require a global lock, just so we are on the same page. "this way host can actually control the overhead" - you can implement that easier by specifying a desired limit via virtio and tracking it in the guest. Which is what your buffer would implicitly do. > >> Honestly, requiring page hinting to make use of actual ballooning or >> additional memory makes me shiver. I hope I don't get nightmares ;) In >> the long term we might want to get rid of the inflation/deflation side >> of virtio-balloon, not require it. >> >> Please don't over-engineer an issue we haven't even see yet. > > All hinting patches are very lightly tested as it is. OOM especially is > very hard to test properly. So *I* will sleep better at night if we > don't have corner cases. Balloon is already involved in MM for > isolation and somehow we live with that. So wait until you see actual > code before worrying about nightmares. Yes, but I want to see attempts with simple solutions first. > >> Especially >> not using a mechanism that sounds more involved than actual hinting. > > That would depend on the implementation. > It's just moving a page between two lists. Except NUMA and different granularities. > > >> >> As always, I might be very wrong, but this sounds way too complicated to >> me, both on the guest and the hypervisor side. > > On the hypervisor side it can be literally nothing if we don't > want to enforce buffer size. Just so we understand each other. What you mean with "appended to guest memory" is "append to the guest memory size", not actually "append memory via virtio-balloon", like adding memory regions and stuff. Instead of "-m 4G" you would do "-m 5G -device virtio-balloon,hint_size=1G". So to summarize my opinion, this looks like a *possible* improvement for the future *if* we realize that a simple approach does not work. If we simply add more memory for the hinting buffer to the VM memory size, we will have to assume it will be used by guests, even if they are not malicious (e.g. guest without virtio-balloon driver or without the enhancement). NUMA and different page granularities in the guest are tricky parts. -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: On guest free page hinting and OOM 2019-03-29 15:37 ` David Hildenbrand @ 2019-03-29 15:45 ` David Hildenbrand 2019-03-29 16:51 ` Michael S. Tsirkin 2019-03-29 16:09 ` Michael S. Tsirkin 1 sibling, 1 reply; 34+ messages in thread From: David Hildenbrand @ 2019-03-29 15:45 UTC (permalink / raw) To: Michael S. Tsirkin Cc: Nitesh Narayan Lal, kvm, linux-kernel, linux-mm, pbonzini, lcapitulino, pagupta, wei.w.wang, yang.zhang.wz, riel, dodgen, konrad.wilk, dhildenb, aarcange, alexander.duyck On 29.03.19 16:37, David Hildenbrand wrote: > On 29.03.19 16:08, Michael S. Tsirkin wrote: >> On Fri, Mar 29, 2019 at 03:24:24PM +0100, David Hildenbrand wrote: >>> >>> We had a very simple idea in mind: As long as a hinting request is >>> pending, don't actually trigger any OOM activity, but wait for it to be >>> processed. Can be done using simple atomic variable. >>> >>> This is a scenario that will only pop up when already pretty low on >>> memory. And the main difference to ballooning is that we *know* we will >>> get more memory soon. >> >> No we don't. If we keep polling we are quite possibly keeping the CPU >> busy so delaying the hint request processing. Again the issue it's a > > You can always yield. But that's a different topic. > >> tradeoff. One performance for the other. Very hard to know which path do >> you hit in advance, and in the real world no one has the time to profile >> and tune things. By comparison trading memory for performance is well >> understood. >> >> >>> "appended to guest memory", "global list of memory", malicious guests >>> always using that memory like what about NUMA? >> >> This can be up to the guest. A good approach would be to take >> a chunk out of each node and add to the hints buffer. > > This might lead to you not using the buffer efficiently. But also, > different topic. > >> >>> What about different page >>> granularity? >> >> Seems like an orthogonal issue to me. > > It is similar, yes. But if you support multiple granularities (e.g. > MAX_ORDER - 1, MAX_ORDER - 2 ...) you might have to implement some sort > of buddy for the buffer. This is different than just a list for each node. Oh, and before I forget, different zones might of course also be a problem. -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: On guest free page hinting and OOM 2019-03-29 15:45 ` David Hildenbrand @ 2019-03-29 16:51 ` Michael S. Tsirkin 2019-04-01 8:17 ` David Hildenbrand 0 siblings, 1 reply; 34+ messages in thread From: Michael S. Tsirkin @ 2019-03-29 16:51 UTC (permalink / raw) To: David Hildenbrand Cc: Nitesh Narayan Lal, kvm, linux-kernel, linux-mm, pbonzini, lcapitulino, pagupta, wei.w.wang, yang.zhang.wz, riel, dodgen, konrad.wilk, dhildenb, aarcange, alexander.duyck On Fri, Mar 29, 2019 at 04:45:58PM +0100, David Hildenbrand wrote: > On 29.03.19 16:37, David Hildenbrand wrote: > > On 29.03.19 16:08, Michael S. Tsirkin wrote: > >> On Fri, Mar 29, 2019 at 03:24:24PM +0100, David Hildenbrand wrote: > >>> > >>> We had a very simple idea in mind: As long as a hinting request is > >>> pending, don't actually trigger any OOM activity, but wait for it to be > >>> processed. Can be done using simple atomic variable. > >>> > >>> This is a scenario that will only pop up when already pretty low on > >>> memory. And the main difference to ballooning is that we *know* we will > >>> get more memory soon. > >> > >> No we don't. If we keep polling we are quite possibly keeping the CPU > >> busy so delaying the hint request processing. Again the issue it's a > > > > You can always yield. But that's a different topic. > > > >> tradeoff. One performance for the other. Very hard to know which path do > >> you hit in advance, and in the real world no one has the time to profile > >> and tune things. By comparison trading memory for performance is well > >> understood. > >> > >> > >>> "appended to guest memory", "global list of memory", malicious guests > >>> always using that memory like what about NUMA? > >> > >> This can be up to the guest. A good approach would be to take > >> a chunk out of each node and add to the hints buffer. > > > > This might lead to you not using the buffer efficiently. But also, > > different topic. > > > >> > >>> What about different page > >>> granularity? > >> > >> Seems like an orthogonal issue to me. > > > > It is similar, yes. But if you support multiple granularities (e.g. > > MAX_ORDER - 1, MAX_ORDER - 2 ...) you might have to implement some sort > > of buddy for the buffer. This is different than just a list for each node. Right but we don't plan to do it yet. > Oh, and before I forget, different zones might of course also be a problem. I would just split the hint buffer evenly between zones. > > -- > > Thanks, > > David / dhildenb ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: On guest free page hinting and OOM 2019-03-29 16:51 ` Michael S. Tsirkin @ 2019-04-01 8:17 ` David Hildenbrand 2019-04-01 13:24 ` Michael S. Tsirkin 0 siblings, 1 reply; 34+ messages in thread From: David Hildenbrand @ 2019-04-01 8:17 UTC (permalink / raw) To: Michael S. Tsirkin Cc: Nitesh Narayan Lal, kvm, linux-kernel, linux-mm, pbonzini, lcapitulino, pagupta, wei.w.wang, yang.zhang.wz, riel, dodgen, konrad.wilk, dhildenb, aarcange, alexander.duyck On 29.03.19 17:51, Michael S. Tsirkin wrote: > On Fri, Mar 29, 2019 at 04:45:58PM +0100, David Hildenbrand wrote: >> On 29.03.19 16:37, David Hildenbrand wrote: >>> On 29.03.19 16:08, Michael S. Tsirkin wrote: >>>> On Fri, Mar 29, 2019 at 03:24:24PM +0100, David Hildenbrand wrote: >>>>> >>>>> We had a very simple idea in mind: As long as a hinting request is >>>>> pending, don't actually trigger any OOM activity, but wait for it to be >>>>> processed. Can be done using simple atomic variable. >>>>> >>>>> This is a scenario that will only pop up when already pretty low on >>>>> memory. And the main difference to ballooning is that we *know* we will >>>>> get more memory soon. >>>> >>>> No we don't. If we keep polling we are quite possibly keeping the CPU >>>> busy so delaying the hint request processing. Again the issue it's a >>> >>> You can always yield. But that's a different topic. >>> >>>> tradeoff. One performance for the other. Very hard to know which path do >>>> you hit in advance, and in the real world no one has the time to profile >>>> and tune things. By comparison trading memory for performance is well >>>> understood. >>>> >>>> >>>>> "appended to guest memory", "global list of memory", malicious guests >>>>> always using that memory like what about NUMA? >>>> >>>> This can be up to the guest. A good approach would be to take >>>> a chunk out of each node and add to the hints buffer. >>> >>> This might lead to you not using the buffer efficiently. But also, >>> different topic. >>> >>>> >>>>> What about different page >>>>> granularity? >>>> >>>> Seems like an orthogonal issue to me. >>> >>> It is similar, yes. But if you support multiple granularities (e.g. >>> MAX_ORDER - 1, MAX_ORDER - 2 ...) you might have to implement some sort >>> of buddy for the buffer. This is different than just a list for each node. > > Right but we don't plan to do it yet. MAX_ORDER - 2 on x86-64 seems to work just fine (no THP splits) and early performance numbers indicate it might be the right thing to do. So it could be very desirable once we do more performance tests. > >> Oh, and before I forget, different zones might of course also be a problem. > > I would just split the hint buffer evenly between zones. > Thinking about your approach, there is one elementary thing to notice: Giving the guest pages from the buffer while hinting requests are being processed means that the guest can and will temporarily make use of more memory than desired. Essentially up to the point where MADV_FREE is finally called for the hinted pages. Even then the guest will logicall make use of more memory than desired until core MM takes pages away. So: 1) Unmodified guests will make use of more memory than desired. 2) Malicious guests will make use of more memory than desired. 3) Sane, modified guests will make use of more memory than desired. Instead, we could make our life much easier by doing the following: 1) Introduce a parameter to cap the amount of memory concurrently hinted similar like you suggested, just don't consider it a buffer value. "-device virtio-balloon,hinting_size=1G". This gives us control over the hinting proceess. hinting_size=0 (default) disables hinting The admin can tweak the number along with memory requirements of the guest. We can make suggestions (e.g. calculate depending on #cores,#size of memory, or simply "1GB") 2) In the guest, track the size of hints in progress, cap at the hinting_size. 3) Document hinting behavior "When hinting is enabled, memory up to hinting_size might temporarily be removed from your guest in order to be hinted to the hypervisor. This is only for a very short time, but might affect applications. Consider the hinting_size when sizing your guest. If your application was tested with XGB and a hinting size of 1G is used, please configure X+1GB for the guest. Otherwise, performance degradation might be possible." 4) Do the loop/yield on OOM as discussed to improve performance when OOM and avoid false OOM triggers just to be sure. BTW, one alternatives I initially had in mind was to add pages from the buffer from the OOM handler only and putting these pages back into the buffer once freed. I thought this might help for certain memory offline scenarios where pages stuck in the buffer might hinder offlining of memory. And of course, improve performance as the buffer is only touched when really needed. But it would only help for memory (e.g. DIMM) added after boot, so it is also not 100% safe. Also, same issues as with your given approach. -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: On guest free page hinting and OOM 2019-04-01 8:17 ` David Hildenbrand @ 2019-04-01 13:24 ` Michael S. Tsirkin 2019-04-01 14:09 ` David Hildenbrand 0 siblings, 1 reply; 34+ messages in thread From: Michael S. Tsirkin @ 2019-04-01 13:24 UTC (permalink / raw) To: David Hildenbrand Cc: Nitesh Narayan Lal, kvm, linux-kernel, linux-mm, pbonzini, lcapitulino, pagupta, wei.w.wang, yang.zhang.wz, riel, dodgen, konrad.wilk, dhildenb, aarcange, alexander.duyck On Mon, Apr 01, 2019 at 10:17:51AM +0200, David Hildenbrand wrote: > On 29.03.19 17:51, Michael S. Tsirkin wrote: > > On Fri, Mar 29, 2019 at 04:45:58PM +0100, David Hildenbrand wrote: > >> On 29.03.19 16:37, David Hildenbrand wrote: > >>> On 29.03.19 16:08, Michael S. Tsirkin wrote: > >>>> On Fri, Mar 29, 2019 at 03:24:24PM +0100, David Hildenbrand wrote: > >>>>> > >>>>> We had a very simple idea in mind: As long as a hinting request is > >>>>> pending, don't actually trigger any OOM activity, but wait for it to be > >>>>> processed. Can be done using simple atomic variable. > >>>>> > >>>>> This is a scenario that will only pop up when already pretty low on > >>>>> memory. And the main difference to ballooning is that we *know* we will > >>>>> get more memory soon. > >>>> > >>>> No we don't. If we keep polling we are quite possibly keeping the CPU > >>>> busy so delaying the hint request processing. Again the issue it's a > >>> > >>> You can always yield. But that's a different topic. > >>> > >>>> tradeoff. One performance for the other. Very hard to know which path do > >>>> you hit in advance, and in the real world no one has the time to profile > >>>> and tune things. By comparison trading memory for performance is well > >>>> understood. > >>>> > >>>> > >>>>> "appended to guest memory", "global list of memory", malicious guests > >>>>> always using that memory like what about NUMA? > >>>> > >>>> This can be up to the guest. A good approach would be to take > >>>> a chunk out of each node and add to the hints buffer. > >>> > >>> This might lead to you not using the buffer efficiently. But also, > >>> different topic. > >>> > >>>> > >>>>> What about different page > >>>>> granularity? > >>>> > >>>> Seems like an orthogonal issue to me. > >>> > >>> It is similar, yes. But if you support multiple granularities (e.g. > >>> MAX_ORDER - 1, MAX_ORDER - 2 ...) you might have to implement some sort > >>> of buddy for the buffer. This is different than just a list for each node. > > > > Right but we don't plan to do it yet. > > MAX_ORDER - 2 on x86-64 seems to work just fine (no THP splits) and > early performance numbers indicate it might be the right thing to do. So > it could be very desirable once we do more performance tests. > > > > >> Oh, and before I forget, different zones might of course also be a problem. > > > > I would just split the hint buffer evenly between zones. > > > > Thinking about your approach, there is one elementary thing to notice: > > Giving the guest pages from the buffer while hinting requests are being > processed means that the guest can and will temporarily make use of more > memory than desired. Essentially up to the point where MADV_FREE is > finally called for the hinted pages. Right - but that seems like exactly the reverse of the issue with the current approach which is guest can temporarily use less memory than desired. > Even then the guest will logicall > make use of more memory than desired until core MM takes pages away. That sounds more like a host issue though. If it wants to it can use e.g. MAD_DONTNEED. > So: > 1) Unmodified guests will make use of more memory than desired. One interesting possibility for this is to add the buffer memory by hotplug after the feature has been negotiated. I agree this sounds complex. But I have an idea: how about we include the hint size in the num_pages counter? Then unmodified guests put it in the balloon and don't use it. Modified ones will know to use it just for hinting. > 2) Malicious guests will make use of more memory than desired. Well this limitation is fundamental to balloon right? If host wants to add tracking of balloon memory, it can enforce the limits. So far no one bothered, but maybe with this feature we should start to do that. > 3) Sane, modified guests will make use of more memory than desired. > > Instead, we could make our life much easier by doing the following: > > 1) Introduce a parameter to cap the amount of memory concurrently hinted > similar like you suggested, just don't consider it a buffer value. > "-device virtio-balloon,hinting_size=1G". This gives us control over the > hinting proceess. > > hinting_size=0 (default) disables hinting > > The admin can tweak the number along with memory requirements of the > guest. We can make suggestions (e.g. calculate depending on #cores,#size > of memory, or simply "1GB") So if it's all up to the guest and for the benefit of the guest, and with no cost/benefit to the host, then why are we supplying this value from the host? > 2) In the guest, track the size of hints in progress, cap at the > hinting_size. > > 3) Document hinting behavior > > "When hinting is enabled, memory up to hinting_size might temporarily be > removed from your guest in order to be hinted to the hypervisor. This is > only for a very short time, but might affect applications. Consider the > hinting_size when sizing your guest. If your application was tested with > XGB and a hinting size of 1G is used, please configure X+1GB for the > guest. Otherwise, performance degradation might be possible." OK, so let's start with this. Now let us assume that guest follows the advice. We thus know that 1GB is not needed for guest applications. So why do we want to allow applications to still use this extra memory? > 4) Do the loop/yield on OOM as discussed to improve performance when OOM > and avoid false OOM triggers just to be sure. Yes, I'm not against trying the simpler approach as a first step. But then we need this path actually tested so see whether hinting introduced unreasonable overhead on this path. And it is tricky to test oom as you are skating close to system's limits. That's one reason I prefer avoiding oom handler if possible. When you say yield, I would guess that would involve config space access to the balloon to flush out outstanding hints? > > BTW, one alternatives I initially had in mind was to add pages from the > buffer from the OOM handler only and putting these pages back into the > buffer once freed. I don't think that works easily - pages get used so we can't return them into the buffer. Another problem with only handling oom is that oom is a guest decision. So host really can't enforce any limits even if it wants to. > I thought this might help for certain memory offline > scenarios where pages stuck in the buffer might hinder offlining of > memory. And of course, improve performance as the buffer is only touched > when really needed. But it would only help for memory (e.g. DIMM) added > after boot, so it is also not 100% safe. Also, same issues as with your > given approach. So you can look at this approach as a combination of - balloon inflate with separate accounting - deflate on oom - hinting ? Put this way, it seems rather uncontroversial, right? > -- > > Thanks, > > David / dhildenb ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: On guest free page hinting and OOM 2019-04-01 13:24 ` Michael S. Tsirkin @ 2019-04-01 14:09 ` David Hildenbrand 2019-04-01 14:11 ` David Hildenbrand 2019-04-01 14:45 ` Michael S. Tsirkin 0 siblings, 2 replies; 34+ messages in thread From: David Hildenbrand @ 2019-04-01 14:09 UTC (permalink / raw) To: Michael S. Tsirkin Cc: Nitesh Narayan Lal, kvm, linux-kernel, linux-mm, pbonzini, lcapitulino, pagupta, wei.w.wang, yang.zhang.wz, riel, dodgen, konrad.wilk, dhildenb, aarcange, alexander.duyck >> Thinking about your approach, there is one elementary thing to notice: >> >> Giving the guest pages from the buffer while hinting requests are being >> processed means that the guest can and will temporarily make use of more >> memory than desired. Essentially up to the point where MADV_FREE is >> finally called for the hinted pages. > > Right - but that seems like exactly the reverse of the issue with the current > approach which is guest can temporarily use less memory than desired. > >> Even then the guest will logicall >> make use of more memory than desired until core MM takes pages away. > > That sounds more like a host issue though. If it wants to > it can use e.g. MAD_DONTNEED. Indeed. But MADV_DONTNEED is somewhat undesired for performance reasons. You want to do the work when swapping not when hinting. But what I wanted to say here: Looking at the pure size of your guest will at least not help you to identify if more memory than desired will be used. > >> So: >> 1) Unmodified guests will make use of more memory than desired. > > One interesting possibility for this is to add the buffer memory > by hotplug after the feature has been negotiated. > I agree this sounds complex. Yes it is, and it goes into the direction of virtio-mem that essentially does that. But bad news: memory hotplug is complicated stuff, both on the hypervisor and guest side. And things like NUMA make it more involved. But even then, malicious guest can simply fake feature negotiation and make use of all hotplugged memory. Won't work, at least not for malicious guests. > > But I have an idea: how about we include the hint size in the > num_pages counter? Then unmodified guests put > it in the balloon and don't use it. Modified ones > will know to use it just for hinting. These are the nightmares I was talking about. I would like to decouple this feature as far as possible from balloon inflation/deflation. Ballooning is 4k based and might have other undesirable side effect. Just because somebody wants to use page hinting does not mean he wants to use ballooning. Effectively, many people will want to avoid ballooning completely by using page hinting for their use case. > > >> 2) Malicious guests will make use of more memory than desired. > > Well this limitation is fundamental to balloon right? Yep, it is the fundamental issue of ballooning. If memory is available right from the boot, the system is free to do with it whatever it wants. (one of the main things virtio-mem will do differently/better) > If host wants to add tracking of balloon memory, it > can enforce the limits. So far no one bothered, > but maybe with this feature we should start to do that. I think I already had endless rants about why this is not possible. Ballooning as it is currently implemented by virtio-balloon is broken by design. Period. You can and never will be able to distinguish unmodified guests from malicious guests. Please don't design new approaches based on broken design. > >> 3) Sane, modified guests will make use of more memory than desired. >> >> Instead, we could make our life much easier by doing the following: >> >> 1) Introduce a parameter to cap the amount of memory concurrently hinted >> similar like you suggested, just don't consider it a buffer value. >> "-device virtio-balloon,hinting_size=1G". This gives us control over the >> hinting proceess. >> >> hinting_size=0 (default) disables hinting >> >> The admin can tweak the number along with memory requirements of the >> guest. We can make suggestions (e.g. calculate depending on #cores,#size >> of memory, or simply "1GB") > > So if it's all up to the guest and for the benefit of the guest, and > with no cost/benefit to the host, then why are we supplying this value > from the host? See 3), the admin has to be aware of hinting behavior. > >> 2) In the guest, track the size of hints in progress, cap at the >> hinting_size. >> >> 3) Document hinting behavior >> >> "When hinting is enabled, memory up to hinting_size might temporarily be >> removed from your guest in order to be hinted to the hypervisor. This is >> only for a very short time, but might affect applications. Consider the >> hinting_size when sizing your guest. If your application was tested with >> XGB and a hinting size of 1G is used, please configure X+1GB for the >> guest. Otherwise, performance degradation might be possible." > > OK, so let's start with this. Now let us assume that guest follows > the advice. We thus know that 1GB is not needed for guest applications. > So why do we want to allow applications to still use this extra memory? If the application does not need the 1GB, the 1GB will be hinted to the hypervisor and are effectively only a buffer for the OOM scenario. (ignoring page cache discussions for now). "So why do we want to allow applications to still use this extra memory" is the EXACT same issue you have with your buffer approach. Any guest can make use of the buffer and you won't be able to detect it. Very same problem. Only in your approach, the guest might agree to play nicely by not making use of the 1G you provided. Just as if the application does not need/use the additional 1GB. The interesting thing is most probably: Will the hinting size usually be reasonable small? At least I guess a guest with 4TB of RAM will not suddenly get a hinting size of hundreds of GB. Most probably also only something in the range of 1GB. But this is an interesting question to look into. Also, if the admin does not care about performance implications when already close to hinting, no need to add the additional 1Gb to the ram size. > >> 4) Do the loop/yield on OOM as discussed to improve performance when OOM >> and avoid false OOM triggers just to be sure. > > Yes, I'm not against trying the simpler approach as a first step. But > then we need this path actually tested so see whether hinting introduced > unreasonable overhead on this path. And it is tricky to test oom as you > are skating close to system's limits. That's one reason I prefer > avoiding oom handler if possible. The issue with the actual issue we are chasing is that it can only happen if (as far as I see) 1) Application uses X MAX_ORDER - 1 pages 2) Application frees X MAX_ORDER - 1 pages 3) Application reallocates X MAX_ORDER - 1 pages AND a) There are not enough MAX_ORDER - 1 pages remaining while hinting b) The allocation request cannot be satisfied from other free pages of smaller order c) We actually trigger hinting with the X freed pages d) Time between 2 and 3 is not enough to complete hinting Only then the OOM handler will get active. If between 2) and 3) reasonable time has passed, it is not an issue. And as I said right from the beginning, reproducing this might be difficult. And especially for this reason, I prefer simpler approaches if possible. Document it for applications that might be affected, let the admin enable the feature explicitly, avoid complexity. > > When you say yield, I would guess that would involve config space access > to the balloon to flush out outstanding hints? I rather meant yield your CPU to the hypervisor, so it can process hinting requests faster (like waiting for a spinlock). This is the simple case. More involved approaches might somehow indicate to the hypervisor to not process queued requests but simply return them to the guest so the guest can add the isolated pages to the buddy. If this is "config space access to the balloon to flush out outstanding hints" then yes, something like that might be a good idea if it doesn't harm performance. -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: On guest free page hinting and OOM 2019-04-01 14:09 ` David Hildenbrand @ 2019-04-01 14:11 ` David Hildenbrand 2019-04-01 14:47 ` Michael S. Tsirkin 2019-04-01 14:45 ` Michael S. Tsirkin 1 sibling, 1 reply; 34+ messages in thread From: David Hildenbrand @ 2019-04-01 14:11 UTC (permalink / raw) To: Michael S. Tsirkin Cc: Nitesh Narayan Lal, kvm, linux-kernel, linux-mm, pbonzini, lcapitulino, pagupta, wei.w.wang, yang.zhang.wz, riel, dodgen, konrad.wilk, dhildenb, aarcange, alexander.duyck On 01.04.19 16:09, David Hildenbrand wrote: >>> Thinking about your approach, there is one elementary thing to notice: >>> >>> Giving the guest pages from the buffer while hinting requests are being >>> processed means that the guest can and will temporarily make use of more >>> memory than desired. Essentially up to the point where MADV_FREE is >>> finally called for the hinted pages. >> >> Right - but that seems like exactly the reverse of the issue with the current >> approach which is guest can temporarily use less memory than desired. >> >>> Even then the guest will logicall >>> make use of more memory than desired until core MM takes pages away. >> >> That sounds more like a host issue though. If it wants to >> it can use e.g. MAD_DONTNEED. > > Indeed. But MADV_DONTNEED is somewhat undesired for performance reasons. > You want to do the work when swapping not when hinting. > > But what I wanted to say here: Looking at the pure size of your guest > will at least not help you to identify if more memory than desired will > be used. > >> >>> So: >>> 1) Unmodified guests will make use of more memory than desired. >> >> One interesting possibility for this is to add the buffer memory >> by hotplug after the feature has been negotiated. >> I agree this sounds complex. > > Yes it is, and it goes into the direction of virtio-mem that essentially > does that. But bad news: memory hotplug is complicated stuff, both on > the hypervisor and guest side. And things like NUMA make it more involved. > > But even then, malicious guest can simply fake feature negotiation and > make use of all hotplugged memory. Won't work, at least not for > malicious guests. > >> >> But I have an idea: how about we include the hint size in the >> num_pages counter? Then unmodified guests put >> it in the balloon and don't use it. Modified ones >> will know to use it just for hinting. > > These are the nightmares I was talking about. I would like to decouple > this feature as far as possible from balloon inflation/deflation. > Ballooning is 4k based and might have other undesirable side effect. > Just because somebody wants to use page hinting does not mean he wants > to use ballooning. Effectively, many people will want to avoid > ballooning completely by using page hinting for their use case. > >> >> >>> 2) Malicious guests will make use of more memory than desired. >> >> Well this limitation is fundamental to balloon right? > > Yep, it is the fundamental issue of ballooning. If memory is available > right from the boot, the system is free to do with it whatever it wants. > (one of the main things virtio-mem will do differently/better) > >> If host wants to add tracking of balloon memory, it >> can enforce the limits. So far no one bothered, >> but maybe with this feature we should start to do that. > > I think I already had endless rants about why this is not possible. > Ballooning as it is currently implemented by virtio-balloon is broken by > design. Period. You can and never will be able to distinguish unmodified > guests from malicious guests. Please don't design new approaches based > on broken design. > >> >>> 3) Sane, modified guests will make use of more memory than desired. >>> >>> Instead, we could make our life much easier by doing the following: >>> >>> 1) Introduce a parameter to cap the amount of memory concurrently hinted >>> similar like you suggested, just don't consider it a buffer value. >>> "-device virtio-balloon,hinting_size=1G". This gives us control over the >>> hinting proceess. >>> >>> hinting_size=0 (default) disables hinting >>> >>> The admin can tweak the number along with memory requirements of the >>> guest. We can make suggestions (e.g. calculate depending on #cores,#size >>> of memory, or simply "1GB") >> >> So if it's all up to the guest and for the benefit of the guest, and >> with no cost/benefit to the host, then why are we supplying this value >> from the host? > > See 3), the admin has to be aware of hinting behavior. > >> >>> 2) In the guest, track the size of hints in progress, cap at the >>> hinting_size. >>> >>> 3) Document hinting behavior >>> >>> "When hinting is enabled, memory up to hinting_size might temporarily be >>> removed from your guest in order to be hinted to the hypervisor. This is >>> only for a very short time, but might affect applications. Consider the >>> hinting_size when sizing your guest. If your application was tested with >>> XGB and a hinting size of 1G is used, please configure X+1GB for the >>> guest. Otherwise, performance degradation might be possible." >> >> OK, so let's start with this. Now let us assume that guest follows >> the advice. We thus know that 1GB is not needed for guest applications. >> So why do we want to allow applications to still use this extra memory? > > If the application does not need the 1GB, the 1GB will be hinted to the > hypervisor and are effectively only a buffer for the OOM scenario. > (ignoring page cache discussions for now). > > "So why do we want to allow applications to still use this extra memory" > is the EXACT same issue you have with your buffer approach. Any guest > can make use of the buffer and you won't be able to detect it. Very same > problem. Only in your approach, the guest might agree to play nicely by > not making use of the 1G you provided. Just as if the application does > not need/use the additional 1GB. > > The interesting thing is most probably: Will the hinting size usually be > reasonable small? At least I guess a guest with 4TB of RAM will not > suddenly get a hinting size of hundreds of GB. Most probably also only > something in the range of 1GB. But this is an interesting question to > look into. > > Also, if the admin does not care about performance implications when > already close to hinting, no need to add the additional 1Gb to the ram size. "close to OOM" is what I meant. -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: On guest free page hinting and OOM 2019-04-01 14:11 ` David Hildenbrand @ 2019-04-01 14:47 ` Michael S. Tsirkin 2019-04-01 14:54 ` David Hildenbrand 2019-04-01 20:56 ` Alexander Duyck 0 siblings, 2 replies; 34+ messages in thread From: Michael S. Tsirkin @ 2019-04-01 14:47 UTC (permalink / raw) To: David Hildenbrand Cc: Nitesh Narayan Lal, kvm, linux-kernel, linux-mm, pbonzini, lcapitulino, pagupta, wei.w.wang, yang.zhang.wz, riel, dodgen, konrad.wilk, dhildenb, aarcange, alexander.duyck On Mon, Apr 01, 2019 at 04:11:42PM +0200, David Hildenbrand wrote: > > The interesting thing is most probably: Will the hinting size usually be > > reasonable small? At least I guess a guest with 4TB of RAM will not > > suddenly get a hinting size of hundreds of GB. Most probably also only > > something in the range of 1GB. But this is an interesting question to > > look into. > > > > Also, if the admin does not care about performance implications when > > already close to hinting, no need to add the additional 1Gb to the ram size. > > "close to OOM" is what I meant. Problem is, host admin is the one adding memory. Guest admin is the one that knows about performance. > > -- > > Thanks, > > David / dhildenb ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: On guest free page hinting and OOM 2019-04-01 14:47 ` Michael S. Tsirkin @ 2019-04-01 14:54 ` David Hildenbrand 2019-04-01 20:56 ` Alexander Duyck 1 sibling, 0 replies; 34+ messages in thread From: David Hildenbrand @ 2019-04-01 14:54 UTC (permalink / raw) To: Michael S. Tsirkin Cc: Nitesh Narayan Lal, kvm, linux-kernel, linux-mm, pbonzini, lcapitulino, pagupta, wei.w.wang, yang.zhang.wz, riel, dodgen, konrad.wilk, dhildenb, aarcange, alexander.duyck On 01.04.19 16:47, Michael S. Tsirkin wrote: > On Mon, Apr 01, 2019 at 04:11:42PM +0200, David Hildenbrand wrote: >>> The interesting thing is most probably: Will the hinting size usually be >>> reasonable small? At least I guess a guest with 4TB of RAM will not >>> suddenly get a hinting size of hundreds of GB. Most probably also only >>> something in the range of 1GB. But this is an interesting question to >>> look into. >>> >>> Also, if the admin does not care about performance implications when >>> already close to hinting, no need to add the additional 1Gb to the ram size. >> >> "close to OOM" is what I meant. > > Problem is, host admin is the one adding memory. Guest admin is > the one that knows about performance. If we think about guest admins only caring about performance, then a guest admin owill unloads virtio-balloon module to a) get all the memory available (inflated memory returned to the guest) b) not use hinting :) But I get your idea. One side wants hinting, other side has to agree. -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: On guest free page hinting and OOM 2019-04-01 14:47 ` Michael S. Tsirkin 2019-04-01 14:54 ` David Hildenbrand @ 2019-04-01 20:56 ` Alexander Duyck 2019-04-02 7:42 ` David Hildenbrand 1 sibling, 1 reply; 34+ messages in thread From: Alexander Duyck @ 2019-04-01 20:56 UTC (permalink / raw) To: Michael S. Tsirkin Cc: David Hildenbrand, Nitesh Narayan Lal, kvm list, LKML, linux-mm, Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang, Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli On Mon, Apr 1, 2019 at 7:47 AM Michael S. Tsirkin <mst@redhat.com> wrote: > > On Mon, Apr 01, 2019 at 04:11:42PM +0200, David Hildenbrand wrote: > > > The interesting thing is most probably: Will the hinting size usually be > > > reasonable small? At least I guess a guest with 4TB of RAM will not > > > suddenly get a hinting size of hundreds of GB. Most probably also only > > > something in the range of 1GB. But this is an interesting question to > > > look into. > > > > > > Also, if the admin does not care about performance implications when > > > already close to hinting, no need to add the additional 1Gb to the ram size. > > > > "close to OOM" is what I meant. > > Problem is, host admin is the one adding memory. Guest admin is > the one that knows about performance. The thing we have to keep in mind with this is that we are not dealing with the same behavior as the balloon driver. We don't need to inflate a massive hint and hand that off. Instead we can focus on performing the hints on much smaller amounts and do it incrementally over time with the idea being as the system sits idle it frees up more and more of the inactive memory on the system. With that said, I still don't like the idea of us even trying to target 1GB of RAM for hinting. I think it would be much better if we stuck to smaller sizes and kept things down to a single digit multiple of THP or higher order pages. Maybe something like 64MB of total memory out for hinting. All we really would need to make it work would be to possibly look at seeing if we can combine PageType values. Specifically what I would be looking at is a transition that looks something like Buddy -> Offline -> (Buddy | Offline). We would have to hold the zone lock at each transition, but that shouldn't be too big of an issue. If we are okay with possibly combining the Offline and Buddy types we would have a way of tracking which pages have been hinted and which have not. Then we would just have to have a thread running in the background on the guest that is looking at the higher order pages and pulling 64MB at a time offline, and when the hinting is done put them back in the "Buddy | Offline" state. I view this all as working not too dissimilar to how a standard Rx ring in a network device works. Only we would want to allocate from the pool of "Buddy" pages, flag the pages as "Offline", and then when the hint has been processed we would place them back in the "Buddy" list with the "Offline" value still set. The only real changes needed to the buddy allocator would be to add some logic for clearing/merging the "Offline" setting as necessary, and to provide an allocator that only works with non-"Offline" pages. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: On guest free page hinting and OOM 2019-04-01 20:56 ` Alexander Duyck @ 2019-04-02 7:42 ` David Hildenbrand 2019-04-02 15:04 ` Alexander Duyck 0 siblings, 1 reply; 34+ messages in thread From: David Hildenbrand @ 2019-04-02 7:42 UTC (permalink / raw) To: Alexander Duyck, Michael S. Tsirkin Cc: Nitesh Narayan Lal, kvm list, LKML, linux-mm, Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang, Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli On 01.04.19 22:56, Alexander Duyck wrote: > On Mon, Apr 1, 2019 at 7:47 AM Michael S. Tsirkin <mst@redhat.com> wrote: >> >> On Mon, Apr 01, 2019 at 04:11:42PM +0200, David Hildenbrand wrote: >>>> The interesting thing is most probably: Will the hinting size usually be >>>> reasonable small? At least I guess a guest with 4TB of RAM will not >>>> suddenly get a hinting size of hundreds of GB. Most probably also only >>>> something in the range of 1GB. But this is an interesting question to >>>> look into. >>>> >>>> Also, if the admin does not care about performance implications when >>>> already close to hinting, no need to add the additional 1Gb to the ram size. >>> >>> "close to OOM" is what I meant. >> >> Problem is, host admin is the one adding memory. Guest admin is >> the one that knows about performance. > > The thing we have to keep in mind with this is that we are not dealing > with the same behavior as the balloon driver. We don't need to inflate > a massive hint and hand that off. Instead we can focus on performing > the hints on much smaller amounts and do it incrementally over time > with the idea being as the system sits idle it frees up more and more > of the inactive memory on the system. > > With that said, I still don't like the idea of us even trying to > target 1GB of RAM for hinting. I think it would be much better if we > stuck to smaller sizes and kept things down to a single digit multiple > of THP or higher order pages. Maybe something like 64MB of total > memory out for hinting. 1GB was just a number I came up with. But please note, as VCPUs hint in parallel, even though each request is only 64MB in size, things can sum up. > > All we really would need to make it work would be to possibly look at > seeing if we can combine PageType values. Specifically what I would be > looking at is a transition that looks something like Buddy -> Offline > -> (Buddy | Offline). We would have to hold the zone lock at each > transition, but that shouldn't be too big of an issue. If we are okay > with possibly combining the Offline and Buddy types we would have a > way of tracking which pages have been hinted and which have not. Then > we would just have to have a thread running in the background on the > guest that is looking at the higher order pages and pulling 64MB at a > time offline, and when the hinting is done put them back in the "Buddy > | Offline" state. That approach may have other issues to solve (1 thread vs. many VCPUs, scanning all buddy pages over and over again) and other implications that might be undesirable (hints performed even more delayed, additional thread activity). I wouldn't call it the ultimate solution. Your approach sounds very interesting to play with, however at this point I would like to avoid throwing away Nitesh work once again to follow some other approach that looks promising. If we keep going like that, we'll spend another ~10 years working on free page hinting without getting anything upstream. Especially if it involves more core-MM changes. We've been there, we've done that. As long as the guest-host interface is generic enough, we can play with such approaches later in the guest. Important part is that the guest-host interface allows for that. > > I view this all as working not too dissimilar to how a standard Rx > ring in a network device works. Only we would want to allocate from > the pool of "Buddy" pages, flag the pages as "Offline", and then when > the hint has been processed we would place them back in the "Buddy" > list with the "Offline" value still set. The only real changes needed > to the buddy allocator would be to add some logic for clearing/merging > the "Offline" setting as necessary, and to provide an allocator that > only works with non-"Offline" pages. Sorry, I had to smile at the phrase "only" in combination with "provide an allocator that only works with non-Offline pages" :) . I guess you realize yourself that these are core-mm changes that might easily be rejected upstream because "the virt guys try to teach core-MM yet another special case". I agree that this is nice to play with, eventually that approach could succeed and be accepted upstream. But I consider this long term work. -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: On guest free page hinting and OOM 2019-04-02 7:42 ` David Hildenbrand @ 2019-04-02 15:04 ` Alexander Duyck 2019-04-02 15:25 ` Michael S. Tsirkin ` (2 more replies) 0 siblings, 3 replies; 34+ messages in thread From: Alexander Duyck @ 2019-04-02 15:04 UTC (permalink / raw) To: David Hildenbrand Cc: Michael S. Tsirkin, Nitesh Narayan Lal, kvm list, LKML, linux-mm, Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang, Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli On Tue, Apr 2, 2019 at 12:42 AM David Hildenbrand <david@redhat.com> wrote: > > On 01.04.19 22:56, Alexander Duyck wrote: > > On Mon, Apr 1, 2019 at 7:47 AM Michael S. Tsirkin <mst@redhat.com> wrote: > >> > >> On Mon, Apr 01, 2019 at 04:11:42PM +0200, David Hildenbrand wrote: > >>>> The interesting thing is most probably: Will the hinting size usually be > >>>> reasonable small? At least I guess a guest with 4TB of RAM will not > >>>> suddenly get a hinting size of hundreds of GB. Most probably also only > >>>> something in the range of 1GB. But this is an interesting question to > >>>> look into. > >>>> > >>>> Also, if the admin does not care about performance implications when > >>>> already close to hinting, no need to add the additional 1Gb to the ram size. > >>> > >>> "close to OOM" is what I meant. > >> > >> Problem is, host admin is the one adding memory. Guest admin is > >> the one that knows about performance. > > > > The thing we have to keep in mind with this is that we are not dealing > > with the same behavior as the balloon driver. We don't need to inflate > > a massive hint and hand that off. Instead we can focus on performing > > the hints on much smaller amounts and do it incrementally over time > > with the idea being as the system sits idle it frees up more and more > > of the inactive memory on the system. > > > > With that said, I still don't like the idea of us even trying to > > target 1GB of RAM for hinting. I think it would be much better if we > > stuck to smaller sizes and kept things down to a single digit multiple > > of THP or higher order pages. Maybe something like 64MB of total > > memory out for hinting. > > 1GB was just a number I came up with. But please note, as VCPUs hint in > parallel, even though each request is only 64MB in size, things can sum up. Why do we need them running in parallel for a single guest? I don't think we need the hints so quickly that we would need to have multiple VCPUs running in parallel to provide hints. In addition as it currently stands in order to get pages into and out of the buddy allocator we are going to have to take the zone lock anyway so we could probably just assume a single thread for pulling the memory, placing it on the ring, and putting it back into the buddy allocator after the hint has been completed. > > > > All we really would need to make it work would be to possibly look at > > seeing if we can combine PageType values. Specifically what I would be > > looking at is a transition that looks something like Buddy -> Offline > > -> (Buddy | Offline). We would have to hold the zone lock at each > > transition, but that shouldn't be too big of an issue. If we are okay > > with possibly combining the Offline and Buddy types we would have a > > way of tracking which pages have been hinted and which have not. Then > > we would just have to have a thread running in the background on the > > guest that is looking at the higher order pages and pulling 64MB at a > > time offline, and when the hinting is done put them back in the "Buddy > > | Offline" state. > > That approach may have other issues to solve (1 thread vs. many VCPUs, > scanning all buddy pages over and over again) and other implications > that might be undesirable (hints performed even more delayed, additional > thread activity). I wouldn't call it the ultimate solution. So the problem with trying to provide the hint sooner is that you end up creating a bottle-neck or you end up missing hints on pages entirely and then have to fall back to such an approach. By just letting the thread run in the background reporting the idle memory we can avoid much of that. Also there isn't a huge priority to report idle memory in real time. That would be kind of pointless as it might be pulled back out and reused as soon as it is added. What we need is to give the memory a bit of time to "cool" so that we aren't constantly hinting away memory that is still in use. > Your approach sounds very interesting to play with, however > at this point I would like to avoid throwing away Nitesh work once again > to follow some other approach that looks promising. If we keep going > like that, we'll spend another ~10 years working on free page hinting > without getting anything upstream. Especially if it involves more > core-MM changes. We've been there, we've done that. As long as the > guest-host interface is generic enough, we can play with such approaches > later in the guest. Important part is that the guest-host interface > allows for that. I'm not throwing anything away. One of the issues in Nitesh's design is that he is going to either miss memory and have to run an asynchronous thread to clean it up after the fact, or he is going to cause massive OOM errors and/or have to start halting VCPUs while waiting on the processing. All I am suggesting is that we can get away from having to deal with both by just walking through the free pages for the higher order and hinting only a few at a time without having to try to provide the host with the hints on what is idle the second it is freed. > > > > I view this all as working not too dissimilar to how a standard Rx > > ring in a network device works. Only we would want to allocate from > > the pool of "Buddy" pages, flag the pages as "Offline", and then when > > the hint has been processed we would place them back in the "Buddy" > > list with the "Offline" value still set. The only real changes needed > > to the buddy allocator would be to add some logic for clearing/merging > > the "Offline" setting as necessary, and to provide an allocator that > > only works with non-"Offline" pages. > > Sorry, I had to smile at the phrase "only" in combination with "provide > an allocator that only works with non-Offline pages" :) . I guess you > realize yourself that these are core-mm changes that might easily be > rejected upstream because "the virt guys try to teach core-MM yet > another special case". I agree that this is nice to play with, > eventually that approach could succeed and be accepted upstream. But I > consider this long term work. The actual patch for this would probably be pretty small and compared to some of the other stuff that has gone in recently isn't too far out of the realm of possibility. It isn't too different then the code that has already done in to determine the unused pages for virtio-balloon free page hinting. Basically what we would be doing is providing a means for incrementally transitioning the buddy memory into the idle/offline state to reduce guest memory overhead. It would require one function that would walk the free page lists and pluck out pages that don't have the "Offline" page type set, a one-line change to the logic for allocating a page as we would need to clear that extra bit of state, and optionally some bits for how to handle the merge of two "Offline" pages in the buddy allocator (required for lower order support). It solves most of the guest side issues with the free page hinting in that trying to do it via the arch_free_page path is problematic at best since it was designed for a synchronous setup, not an asynchronous one. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: On guest free page hinting and OOM 2019-04-02 15:04 ` Alexander Duyck @ 2019-04-02 15:25 ` Michael S. Tsirkin 2019-04-02 15:57 ` David Hildenbrand 2019-04-02 15:55 ` David Hildenbrand 2019-04-02 16:19 ` David Hildenbrand 2 siblings, 1 reply; 34+ messages in thread From: Michael S. Tsirkin @ 2019-04-02 15:25 UTC (permalink / raw) To: Alexander Duyck Cc: David Hildenbrand, Nitesh Narayan Lal, kvm list, LKML, linux-mm, Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang, Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli On Tue, Apr 02, 2019 at 08:04:00AM -0700, Alexander Duyck wrote: > Basically what we would be doing is providing a means for > incrementally transitioning the buddy memory into the idle/offline > state to reduce guest memory overhead. It would require one function > that would walk the free page lists and pluck out pages that don't > have the "Offline" page type set, I think we will need an interface that gets an offline page and returns the next online free page. If we restart the list walk each time we can't guarantee progress. > a one-line change to the logic for > allocating a page as we would need to clear that extra bit of state, > and optionally some bits for how to handle the merge of two "Offline" > pages in the buddy allocator (required for lower order support). It > solves most of the guest side issues with the free page hinting in > that trying to do it via the arch_free_page path is problematic at > best since it was designed for a synchronous setup, not an > asynchronous one. -- MST ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: On guest free page hinting and OOM 2019-04-02 15:25 ` Michael S. Tsirkin @ 2019-04-02 15:57 ` David Hildenbrand 2019-04-02 16:18 ` Alexander Duyck 0 siblings, 1 reply; 34+ messages in thread From: David Hildenbrand @ 2019-04-02 15:57 UTC (permalink / raw) To: Michael S. Tsirkin, Alexander Duyck Cc: Nitesh Narayan Lal, kvm list, LKML, linux-mm, Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang, Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli On 02.04.19 17:25, Michael S. Tsirkin wrote: > On Tue, Apr 02, 2019 at 08:04:00AM -0700, Alexander Duyck wrote: >> Basically what we would be doing is providing a means for >> incrementally transitioning the buddy memory into the idle/offline >> state to reduce guest memory overhead. It would require one function >> that would walk the free page lists and pluck out pages that don't >> have the "Offline" page type set, > > I think we will need an interface that gets > an offline page and returns the next online free page. > > If we restart the list walk each time we can't guarantee progress. Yes, and essentially we are scanning all the time for chunks vs. we get notified which chunks are possible hinting candidates. Totally different design. -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: On guest free page hinting and OOM 2019-04-02 15:57 ` David Hildenbrand @ 2019-04-02 16:18 ` Alexander Duyck 2019-04-02 17:08 ` David Hildenbrand 0 siblings, 1 reply; 34+ messages in thread From: Alexander Duyck @ 2019-04-02 16:18 UTC (permalink / raw) To: David Hildenbrand Cc: Michael S. Tsirkin, Nitesh Narayan Lal, kvm list, LKML, linux-mm, Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang, Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli n Tue, Apr 2, 2019 at 8:57 AM David Hildenbrand <david@redhat.com> wrote: > > On 02.04.19 17:25, Michael S. Tsirkin wrote: > > On Tue, Apr 02, 2019 at 08:04:00AM -0700, Alexander Duyck wrote: > >> Basically what we would be doing is providing a means for > >> incrementally transitioning the buddy memory into the idle/offline > >> state to reduce guest memory overhead. It would require one function > >> that would walk the free page lists and pluck out pages that don't > >> have the "Offline" page type set, > > > > I think we will need an interface that gets > > an offline page and returns the next online free page. > > > > If we restart the list walk each time we can't guarantee progress. > > Yes, and essentially we are scanning all the time for chunks vs. we get > notified which chunks are possible hinting candidates. Totally different > design. The problem as I see it is that we can miss notifications if we become too backlogged, and that will lead to us having to fall back to scanning anyway. So instead of trying to implement both why don't we just focus on the scanning approach. Otherwise the only other option is to hold up the guest and make it wait until the hint processing has completed and at that point we are back to what is essentially just a synchronous solution with batching anyway. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: On guest free page hinting and OOM 2019-04-02 16:18 ` Alexander Duyck @ 2019-04-02 17:08 ` David Hildenbrand 2019-04-02 17:45 ` Alexander Duyck 0 siblings, 1 reply; 34+ messages in thread From: David Hildenbrand @ 2019-04-02 17:08 UTC (permalink / raw) To: Alexander Duyck Cc: Michael S. Tsirkin, Nitesh Narayan Lal, kvm list, LKML, linux-mm, Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang, Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli On 02.04.19 18:18, Alexander Duyck wrote: > n Tue, Apr 2, 2019 at 8:57 AM David Hildenbrand <david@redhat.com> wrote: >> >> On 02.04.19 17:25, Michael S. Tsirkin wrote: >>> On Tue, Apr 02, 2019 at 08:04:00AM -0700, Alexander Duyck wrote: >>>> Basically what we would be doing is providing a means for >>>> incrementally transitioning the buddy memory into the idle/offline >>>> state to reduce guest memory overhead. It would require one function >>>> that would walk the free page lists and pluck out pages that don't >>>> have the "Offline" page type set, >>> >>> I think we will need an interface that gets >>> an offline page and returns the next online free page. >>> >>> If we restart the list walk each time we can't guarantee progress. >> >> Yes, and essentially we are scanning all the time for chunks vs. we get >> notified which chunks are possible hinting candidates. Totally different >> design. > > The problem as I see it is that we can miss notifications if we become > too backlogged, and that will lead to us having to fall back to > scanning anyway. So instead of trying to implement both why don't we > just focus on the scanning approach. Otherwise the only other option > is to hold up the guest and make it wait until the hint processing has > completed and at that point we are back to what is essentially just a > synchronous solution with batching anyway. > In general I am not a fan of "there might be a problem, let's try something completely different". Expect the unexpected. At this point, I prefer to think about easy solutions to eventual problems. not completely new designs. As I said, we've been there already. Related to "falling behind" with hinting. If this is indeed possible (and I'd like to know under which conditions), I wonder at which point we no longer care about missed hints. If our guest as a lot of MM activity, could be that is good that we are dropping hints, because our guest is so busy, it will reuse pages soon again. One important point is - I think - that free page hinting does not have to fit all possible setups. In certain environments it just makes sense to disable it. Or live with it not giving you "all the hints". E.g. databases that eat up all free memory either way. The other extreme would be a simple webserver that is mostly idle. We are losing hitning of quite free memory already due to the MAX_ORDER - X discussion. Dropping a couple of other hints shouldn't really hurt. The question is, are there scenarios where we can completely screw up. -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: On guest free page hinting and OOM 2019-04-02 17:08 ` David Hildenbrand @ 2019-04-02 17:45 ` Alexander Duyck 2019-04-02 17:53 ` Michael S. Tsirkin 2019-04-02 18:21 ` David Hildenbrand 0 siblings, 2 replies; 34+ messages in thread From: Alexander Duyck @ 2019-04-02 17:45 UTC (permalink / raw) To: David Hildenbrand Cc: Michael S. Tsirkin, Nitesh Narayan Lal, kvm list, LKML, linux-mm, Paolo Bonzini, lcapitulino, pagupta, Yang Zhang, Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli, Dave Hansen On Tue, Apr 2, 2019 at 10:09 AM David Hildenbrand <david@redhat.com> wrote: > > On 02.04.19 18:18, Alexander Duyck wrote: > > n Tue, Apr 2, 2019 at 8:57 AM David Hildenbrand <david@redhat.com> wrote: > >> > >> On 02.04.19 17:25, Michael S. Tsirkin wrote: > >>> On Tue, Apr 02, 2019 at 08:04:00AM -0700, Alexander Duyck wrote: > >>>> Basically what we would be doing is providing a means for > >>>> incrementally transitioning the buddy memory into the idle/offline > >>>> state to reduce guest memory overhead. It would require one function > >>>> that would walk the free page lists and pluck out pages that don't > >>>> have the "Offline" page type set, > >>> > >>> I think we will need an interface that gets > >>> an offline page and returns the next online free page. > >>> > >>> If we restart the list walk each time we can't guarantee progress. > >> > >> Yes, and essentially we are scanning all the time for chunks vs. we get > >> notified which chunks are possible hinting candidates. Totally different > >> design. > > > > The problem as I see it is that we can miss notifications if we become > > too backlogged, and that will lead to us having to fall back to > > scanning anyway. So instead of trying to implement both why don't we > > just focus on the scanning approach. Otherwise the only other option > > is to hold up the guest and make it wait until the hint processing has > > completed and at that point we are back to what is essentially just a > > synchronous solution with batching anyway. > > > > In general I am not a fan of "there might be a problem, let's try > something completely different". Expect the unexpected. At this point, I > prefer to think about easy solutions to eventual problems. not > completely new designs. As I said, we've been there already. The solution as we have is not "easy". There are a number of race conditions contained within the code and it doesn't practically scale when you consider we are introducing multiple threads in both the isolation and returning of pages to/from the buddy allocator that will have to function within the zone lock. > Related to "falling behind" with hinting. If this is indeed possible > (and I'd like to know under which conditions), I wonder at which point > we no longer care about missed hints. If our guest as a lot of MM > activity, could be that is good that we are dropping hints, because our > guest is so busy, it will reuse pages soon again. This is making a LOT of assumptions. There are a few scenarios that can hold up hinting on the host side. One of the limitations of madvise is that we have to take the mm read semaphore. So if something is sitting on the write semaphore all of the hints will be blocked until it is released. > One important point is - I think - that free page hinting does not have > to fit all possible setups. In certain environments it just makes sense > to disable it. Or live with it not giving you "all the hints". E.g. > databases that eat up all free memory either way. The other extreme > would be a simple webserver that is mostly idle. My concern is we are introducing massive buffer bloat in the mm subsystem and it still has the potential for stalling VCPUs if we don't have room in the VQs. We went through this back in the day with networking. Adding more buffers is not the solution. The solution is to have a way to gracefully recover and keep our hinting latency and buffer bloat to a minimum. > We are losing hitning of quite free memory already due to the MAX_ORDER > - X discussion. Dropping a couple of other hints shouldn't really hurt. > The question is, are there scenarios where we can completely screw up. My concern is that it can hurt a ton. In my mind the target for a feature like this is a guest that has something like an application that will fire up a few times a day eat up a massive amount of memory, and then free it all when it is done. Now if that application is freeing a massive block of memory and for whatever reason the QEMU thread that is translating our hint requests to madvise calls cannot keep up then we are going to spend the next several hours with that memory still assigned to an idle guest. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: On guest free page hinting and OOM 2019-04-02 17:45 ` Alexander Duyck @ 2019-04-02 17:53 ` Michael S. Tsirkin 2019-04-02 20:32 ` Alexander Duyck 2019-04-02 18:21 ` David Hildenbrand 1 sibling, 1 reply; 34+ messages in thread From: Michael S. Tsirkin @ 2019-04-02 17:53 UTC (permalink / raw) To: Alexander Duyck Cc: David Hildenbrand, Nitesh Narayan Lal, kvm list, LKML, linux-mm, Paolo Bonzini, lcapitulino, pagupta, Yang Zhang, Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli, Dave Hansen On Tue, Apr 02, 2019 at 10:45:43AM -0700, Alexander Duyck wrote: > We went through this back in the day with > networking. Adding more buffers is not the solution. The solution is > to have a way to gracefully recover and keep our hinting latency and > buffer bloat to a minimum. That's an interesting approach, I think that things that end up working well are NAPI (asychronous notifications), limited batching, XDP (big aligned buffers) and BQL (accounting). Is that your perspective too? -- MST ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: On guest free page hinting and OOM 2019-04-02 17:53 ` Michael S. Tsirkin @ 2019-04-02 20:32 ` Alexander Duyck 0 siblings, 0 replies; 34+ messages in thread From: Alexander Duyck @ 2019-04-02 20:32 UTC (permalink / raw) To: Michael S. Tsirkin Cc: David Hildenbrand, Nitesh Narayan Lal, kvm list, LKML, linux-mm, Paolo Bonzini, lcapitulino, pagupta, Yang Zhang, Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli, Dave Hansen On Tue, Apr 2, 2019 at 10:53 AM Michael S. Tsirkin <mst@redhat.com> wrote: > > On Tue, Apr 02, 2019 at 10:45:43AM -0700, Alexander Duyck wrote: > > We went through this back in the day with > > networking. Adding more buffers is not the solution. The solution is > > to have a way to gracefully recover and keep our hinting latency and > > buffer bloat to a minimum. > > That's an interesting approach, I think that things that end up working > well are NAPI (asychronous notifications), limited batching, XDP (big > aligned buffers) and BQL (accounting). Is that your perspective too? > Yes, that is kind of what I was getting at. Basically we could have a kthread running somewhere that goes through and pulls something like 64M of pages out of the MAX_ORDER - 1 freelist, does what is necessary to isolate them, puts them on a queue somewhere, kicks the virtio ring, and waits for the response to come back indicating that the hints have been processed. We would just have to keep it running until the list doesn't have enough non-"Offline" memory to fulfill the request. Then we just wait until we again reach a level necessary to justify waking the thread back up and repeat. In my mind it looks a lot like your standard Rx ring in that we allocate some fixed number of buffers and wait for hardware to tell us when the buffers are ready. The only extra complexity is having to add tracking using the PageType "Offline" bit which should be cheap when we are already having to manipulate the "Buddy" PageType anyway. It would let us get away from having to do the per-cpu queues and complicated coordination logic to translate free pages to their buddy. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: On guest free page hinting and OOM 2019-04-02 17:45 ` Alexander Duyck 2019-04-02 17:53 ` Michael S. Tsirkin @ 2019-04-02 18:21 ` David Hildenbrand 2019-04-02 19:49 ` Michael S. Tsirkin 1 sibling, 1 reply; 34+ messages in thread From: David Hildenbrand @ 2019-04-02 18:21 UTC (permalink / raw) To: Alexander Duyck Cc: Michael S. Tsirkin, Nitesh Narayan Lal, kvm list, LKML, linux-mm, Paolo Bonzini, lcapitulino, pagupta, Yang Zhang, Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli, Dave Hansen On 02.04.19 19:45, Alexander Duyck wrote: > On Tue, Apr 2, 2019 at 10:09 AM David Hildenbrand <david@redhat.com> wrote: >> >> On 02.04.19 18:18, Alexander Duyck wrote: >>> n Tue, Apr 2, 2019 at 8:57 AM David Hildenbrand <david@redhat.com> wrote: >>>> >>>> On 02.04.19 17:25, Michael S. Tsirkin wrote: >>>>> On Tue, Apr 02, 2019 at 08:04:00AM -0700, Alexander Duyck wrote: >>>>>> Basically what we would be doing is providing a means for >>>>>> incrementally transitioning the buddy memory into the idle/offline >>>>>> state to reduce guest memory overhead. It would require one function >>>>>> that would walk the free page lists and pluck out pages that don't >>>>>> have the "Offline" page type set, >>>>> >>>>> I think we will need an interface that gets >>>>> an offline page and returns the next online free page. >>>>> >>>>> If we restart the list walk each time we can't guarantee progress. >>>> >>>> Yes, and essentially we are scanning all the time for chunks vs. we get >>>> notified which chunks are possible hinting candidates. Totally different >>>> design. >>> >>> The problem as I see it is that we can miss notifications if we become >>> too backlogged, and that will lead to us having to fall back to >>> scanning anyway. So instead of trying to implement both why don't we >>> just focus on the scanning approach. Otherwise the only other option >>> is to hold up the guest and make it wait until the hint processing has >>> completed and at that point we are back to what is essentially just a >>> synchronous solution with batching anyway. >>> >> >> In general I am not a fan of "there might be a problem, let's try >> something completely different". Expect the unexpected. At this point, I >> prefer to think about easy solutions to eventual problems. not >> completely new designs. As I said, we've been there already. > > The solution as we have is not "easy". There are a number of race > conditions contained within the code and it doesn't practically scale > when you consider we are introducing multiple threads in both the > isolation and returning of pages to/from the buddy allocator that will > have to function within the zone lock. We are freeing pages already, so we are already using the zone lock. We essentially only care about two zones (maybe three). There is a lot of optimization potential. Regarding the scaling issue, I'd love to have a benchmark showcase this issue. > >> Related to "falling behind" with hinting. If this is indeed possible >> (and I'd like to know under which conditions), I wonder at which point >> we no longer care about missed hints. If our guest as a lot of MM >> activity, could be that is good that we are dropping hints, because our >> guest is so busy, it will reuse pages soon again. > > This is making a LOT of assumptions. There are a few scenarios that > can hold up hinting on the host side. One of the limitations of > madvise is that we have to take the mm read semaphore. So if something > is sitting on the write semaphore all of the hints will be blocked > until it is released. > >> One important point is - I think - that free page hinting does not have >> to fit all possible setups. In certain environments it just makes sense >> to disable it. Or live with it not giving you "all the hints". E.g. >> databases that eat up all free memory either way. The other extreme >> would be a simple webserver that is mostly idle. > > My concern is we are introducing massive buffer bloat in the mm > subsystem and it still has the potential for stalling VCPUs if we > don't have room in the VQs. We went through this back in the day with > networking. Adding more buffers is not the solution. The solution is > to have a way to gracefully recover and keep our hinting latency and > buffer bloat to a minimum. I think the main point is that in contrast to real request, we can always skip or delay hinting if it "just isn't the right time". I think that is special when it comes to "hinting". At least so much to " stalling VCPUs if we don't have room in the VQs". I agree, adding buffers is not the solution, that's why I also didn't like Michaels approach. > >> We are losing hitning of quite free memory already due to the MAX_ORDER >> - X discussion. Dropping a couple of other hints shouldn't really hurt. >> The question is, are there scenarios where we can completely screw up. > > My concern is that it can hurt a ton. In my mind the target for a > feature like this is a guest that has something like an application > that will fire up a few times a day eat up a massive amount of memory, > and then free it all when it is done. Now if that application is > freeing a massive block of memory and for whatever reason the QEMU > thread that is translating our hint requests to madvise calls cannot > keep up then we are going to spend the next several hours with that > memory still assigned to an idle guest. > Very special case I was mentioning. This is *AFAIK* not your usual workload. It is an interesting use case to have in mind, though, and a good corner case to where things might go wrong. The other extreme is a system that barely frees (MAX_ORDER - X) pages, however your thread will waste cycles scanning for such. -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: On guest free page hinting and OOM 2019-04-02 18:21 ` David Hildenbrand @ 2019-04-02 19:49 ` Michael S. Tsirkin 2019-04-02 20:32 ` David Hildenbrand 0 siblings, 1 reply; 34+ messages in thread From: Michael S. Tsirkin @ 2019-04-02 19:49 UTC (permalink / raw) To: David Hildenbrand Cc: Alexander Duyck, Nitesh Narayan Lal, kvm list, LKML, linux-mm, Paolo Bonzini, lcapitulino, pagupta, Yang Zhang, Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli, Dave Hansen On Tue, Apr 02, 2019 at 08:21:30PM +0200, David Hildenbrand wrote: > The other extreme is a system that barely frees (MAX_ORDER - X) pages, > however your thread will waste cycles scanning for such. I don't think we need to scan as such. An arch hook that queues a job to a wq only when there's work to do will fix the issue, right? -- MST ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: On guest free page hinting and OOM 2019-04-02 19:49 ` Michael S. Tsirkin @ 2019-04-02 20:32 ` David Hildenbrand 0 siblings, 0 replies; 34+ messages in thread From: David Hildenbrand @ 2019-04-02 20:32 UTC (permalink / raw) To: Michael S. Tsirkin Cc: Alexander Duyck, Nitesh Narayan Lal, kvm list, LKML, linux-mm, Paolo Bonzini, lcapitulino, pagupta, Yang Zhang, Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli, Dave Hansen On 02.04.19 21:49, Michael S. Tsirkin wrote: > On Tue, Apr 02, 2019 at 08:21:30PM +0200, David Hildenbrand wrote: >> The other extreme is a system that barely frees (MAX_ORDER - X) pages, >> however your thread will waste cycles scanning for such. > > I don't think we need to scan as such. An arch hook > that queues a job to a wq only when there's work to > do will fix the issue, right? > Well, even a system that is basically idle can theoretically free a handful of pages every now and then you would like to report. There would have to be a mechanism to detect if pages worth hinting have been freed and need processing, IOW something like a callback after buddy merging we are already dealing with. That can then be a simple flag which the hinting thread considers (e.g. wake up every X seconds to check if there is work) or kicking the hinting thread like you suggest. -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: On guest free page hinting and OOM 2019-04-02 15:04 ` Alexander Duyck 2019-04-02 15:25 ` Michael S. Tsirkin @ 2019-04-02 15:55 ` David Hildenbrand 2019-04-02 17:30 ` Alexander Duyck 2019-04-02 16:19 ` David Hildenbrand 2 siblings, 1 reply; 34+ messages in thread From: David Hildenbrand @ 2019-04-02 15:55 UTC (permalink / raw) To: Alexander Duyck Cc: Michael S. Tsirkin, Nitesh Narayan Lal, kvm list, LKML, linux-mm, Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang, Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli On 02.04.19 17:04, Alexander Duyck wrote: > On Tue, Apr 2, 2019 at 12:42 AM David Hildenbrand <david@redhat.com> wrote: >> >> On 01.04.19 22:56, Alexander Duyck wrote: >>> On Mon, Apr 1, 2019 at 7:47 AM Michael S. Tsirkin <mst@redhat.com> wrote: >>>> >>>> On Mon, Apr 01, 2019 at 04:11:42PM +0200, David Hildenbrand wrote: >>>>>> The interesting thing is most probably: Will the hinting size usually be >>>>>> reasonable small? At least I guess a guest with 4TB of RAM will not >>>>>> suddenly get a hinting size of hundreds of GB. Most probably also only >>>>>> something in the range of 1GB. But this is an interesting question to >>>>>> look into. >>>>>> >>>>>> Also, if the admin does not care about performance implications when >>>>>> already close to hinting, no need to add the additional 1Gb to the ram size. >>>>> >>>>> "close to OOM" is what I meant. >>>> >>>> Problem is, host admin is the one adding memory. Guest admin is >>>> the one that knows about performance. >>> >>> The thing we have to keep in mind with this is that we are not dealing >>> with the same behavior as the balloon driver. We don't need to inflate >>> a massive hint and hand that off. Instead we can focus on performing >>> the hints on much smaller amounts and do it incrementally over time >>> with the idea being as the system sits idle it frees up more and more >>> of the inactive memory on the system. >>> >>> With that said, I still don't like the idea of us even trying to >>> target 1GB of RAM for hinting. I think it would be much better if we >>> stuck to smaller sizes and kept things down to a single digit multiple >>> of THP or higher order pages. Maybe something like 64MB of total >>> memory out for hinting. >> >> 1GB was just a number I came up with. But please note, as VCPUs hint in >> parallel, even though each request is only 64MB in size, things can sum up. > > Why do we need them running in parallel for a single guest? I don't > think we need the hints so quickly that we would need to have multiple > VCPUs running in parallel to provide hints. In addition as it > currently stands in order to get pages into and out of the buddy > allocator we are going to have to take the zone lock anyway so we > could probably just assume a single thread for pulling the memory, > placing it on the ring, and putting it back into the buddy allocator > after the hint has been completed. VCPUs hint when they think the time has come. Hinting in parallel comes naturally. > >>> >>> All we really would need to make it work would be to possibly look at >>> seeing if we can combine PageType values. Specifically what I would be >>> looking at is a transition that looks something like Buddy -> Offline >>> -> (Buddy | Offline). We would have to hold the zone lock at each >>> transition, but that shouldn't be too big of an issue. If we are okay >>> with possibly combining the Offline and Buddy types we would have a >>> way of tracking which pages have been hinted and which have not. Then >>> we would just have to have a thread running in the background on the >>> guest that is looking at the higher order pages and pulling 64MB at a >>> time offline, and when the hinting is done put them back in the "Buddy >>> | Offline" state. >> >> That approach may have other issues to solve (1 thread vs. many VCPUs, >> scanning all buddy pages over and over again) and other implications >> that might be undesirable (hints performed even more delayed, additional >> thread activity). I wouldn't call it the ultimate solution. > > So the problem with trying to provide the hint sooner is that you end > up creating a bottle-neck or you end up missing hints on pages > entirely and then have to fall back to such an approach. By just > letting the thread run in the background reporting the idle memory we > can avoid much of that. > > Also there isn't a huge priority to report idle memory in real time. > That would be kind of pointless as it might be pulled back out and > reused as soon as it is added. What we need is to give the memory a > bit of time to "cool" so that we aren't constantly hinting away memory > that is still in use. Depending on the setup, you don't want free memory lying around for too long in your guest. > >> Your approach sounds very interesting to play with, however >> at this point I would like to avoid throwing away Nitesh work once again >> to follow some other approach that looks promising. If we keep going >> like that, we'll spend another ~10 years working on free page hinting >> without getting anything upstream. Especially if it involves more >> core-MM changes. We've been there, we've done that. As long as the >> guest-host interface is generic enough, we can play with such approaches >> later in the guest. Important part is that the guest-host interface >> allows for that. > > I'm not throwing anything away. One of the issues in Nitesh's design > is that he is going to either miss memory and have to run an > asynchronous thread to clean it up after the fact, or he is going to > cause massive OOM errors and/or have to start halting VCPUs while 1. how are we going to miss memory. We are going to miss memory because we hint on very huge chunks, but we all agreed to live with that for now. 2. What are the "massive OOM" errors you are talking about? We have the one scenario we described Nitesh was not even able to reproduce yet. And we have ways to mitigate the problem (discussed in this thread). We have something that seems to work. Let's work from there instead of scrapping the general design once more, thinking "it is super easy". And yes, what you propose is pretty much throwing away the current design in the guest. > waiting on the processing. All I am suggesting is that we can get away > from having to deal with both by just walking through the free pages > for the higher order and hinting only a few at a time without having > to try to provide the host with the hints on what is idle the second > it is freed. > >>> >>> I view this all as working not too dissimilar to how a standard Rx >>> ring in a network device works. Only we would want to allocate from >>> the pool of "Buddy" pages, flag the pages as "Offline", and then when >>> the hint has been processed we would place them back in the "Buddy" >>> list with the "Offline" value still set. The only real changes needed >>> to the buddy allocator would be to add some logic for clearing/merging >>> the "Offline" setting as necessary, and to provide an allocator that >>> only works with non-"Offline" pages. >> >> Sorry, I had to smile at the phrase "only" in combination with "provide >> an allocator that only works with non-Offline pages" :) . I guess you >> realize yourself that these are core-mm changes that might easily be >> rejected upstream because "the virt guys try to teach core-MM yet >> another special case". I agree that this is nice to play with, >> eventually that approach could succeed and be accepted upstream. But I >> consider this long term work. > > The actual patch for this would probably be pretty small and compared > to some of the other stuff that has gone in recently isn't too far out > of the realm of possibility. It isn't too different then the code that > has already done in to determine the unused pages for virtio-balloon > free page hinting. > > Basically what we would be doing is providing a means for > incrementally transitioning the buddy memory into the idle/offline > state to reduce guest memory overhead. It would require one function > that would walk the free page lists and pluck out pages that don't > have the "Offline" page type set, a one-line change to the logic for > allocating a page as we would need to clear that extra bit of state, > and optionally some bits for how to handle the merge of two "Offline" > pages in the buddy allocator (required for lower order support). It > solves most of the guest side issues with the free page hinting in > that trying to do it via the arch_free_page path is problematic at > best since it was designed for a synchronous setup, not an > asynchronous one. This is throwing away work. No I don't think this is the right path to follow for now. Feel free to look into it while Nitesh gets something in shape we know conceptually works and we are starting to know which issues we are hitting. -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: On guest free page hinting and OOM 2019-04-02 15:55 ` David Hildenbrand @ 2019-04-02 17:30 ` Alexander Duyck 2019-04-02 18:53 ` David Hildenbrand 0 siblings, 1 reply; 34+ messages in thread From: Alexander Duyck @ 2019-04-02 17:30 UTC (permalink / raw) To: David Hildenbrand Cc: Michael S. Tsirkin, Nitesh Narayan Lal, kvm list, LKML, linux-mm, Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang, Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli On Tue, Apr 2, 2019 at 8:56 AM David Hildenbrand <david@redhat.com> wrote: > > On 02.04.19 17:04, Alexander Duyck wrote: > > On Tue, Apr 2, 2019 at 12:42 AM David Hildenbrand <david@redhat.com> wrote: > >> > >> On 01.04.19 22:56, Alexander Duyck wrote: > >>> On Mon, Apr 1, 2019 at 7:47 AM Michael S. Tsirkin <mst@redhat.com> wrote: > >>>> > >>>> On Mon, Apr 01, 2019 at 04:11:42PM +0200, David Hildenbrand wrote: > >>>>>> The interesting thing is most probably: Will the hinting size usually be > >>>>>> reasonable small? At least I guess a guest with 4TB of RAM will not > >>>>>> suddenly get a hinting size of hundreds of GB. Most probably also only > >>>>>> something in the range of 1GB. But this is an interesting question to > >>>>>> look into. > >>>>>> > >>>>>> Also, if the admin does not care about performance implications when > >>>>>> already close to hinting, no need to add the additional 1Gb to the ram size. > >>>>> > >>>>> "close to OOM" is what I meant. > >>>> > >>>> Problem is, host admin is the one adding memory. Guest admin is > >>>> the one that knows about performance. > >>> > >>> The thing we have to keep in mind with this is that we are not dealing > >>> with the same behavior as the balloon driver. We don't need to inflate > >>> a massive hint and hand that off. Instead we can focus on performing > >>> the hints on much smaller amounts and do it incrementally over time > >>> with the idea being as the system sits idle it frees up more and more > >>> of the inactive memory on the system. > >>> > >>> With that said, I still don't like the idea of us even trying to > >>> target 1GB of RAM for hinting. I think it would be much better if we > >>> stuck to smaller sizes and kept things down to a single digit multiple > >>> of THP or higher order pages. Maybe something like 64MB of total > >>> memory out for hinting. > >> > >> 1GB was just a number I came up with. But please note, as VCPUs hint in > >> parallel, even though each request is only 64MB in size, things can sum up. > > > > Why do we need them running in parallel for a single guest? I don't > > think we need the hints so quickly that we would need to have multiple > > VCPUs running in parallel to provide hints. In addition as it > > currently stands in order to get pages into and out of the buddy > > allocator we are going to have to take the zone lock anyway so we > > could probably just assume a single thread for pulling the memory, > > placing it on the ring, and putting it back into the buddy allocator > > after the hint has been completed. > > VCPUs hint when they think the time has come. Hinting in parallel comes > naturally. Actually it doesn't because if we are doing it asynchronously we are having to pull pages out of the zone which requires the zone lock. That has been one of the reasons why the patches from Nitesh start dropping in performance when you start enabling more than 1 VCPU. If we are limited by the zone lock it doesn't make sense for us to try to do thing in parallel. > > > >>> > >>> All we really would need to make it work would be to possibly look at > >>> seeing if we can combine PageType values. Specifically what I would be > >>> looking at is a transition that looks something like Buddy -> Offline > >>> -> (Buddy | Offline). We would have to hold the zone lock at each > >>> transition, but that shouldn't be too big of an issue. If we are okay > >>> with possibly combining the Offline and Buddy types we would have a > >>> way of tracking which pages have been hinted and which have not. Then > >>> we would just have to have a thread running in the background on the > >>> guest that is looking at the higher order pages and pulling 64MB at a > >>> time offline, and when the hinting is done put them back in the "Buddy > >>> | Offline" state. > >> > >> That approach may have other issues to solve (1 thread vs. many VCPUs, > >> scanning all buddy pages over and over again) and other implications > >> that might be undesirable (hints performed even more delayed, additional > >> thread activity). I wouldn't call it the ultimate solution. > > > > So the problem with trying to provide the hint sooner is that you end > > up creating a bottle-neck or you end up missing hints on pages > > entirely and then have to fall back to such an approach. By just > > letting the thread run in the background reporting the idle memory we > > can avoid much of that. > > > > Also there isn't a huge priority to report idle memory in real time. > > That would be kind of pointless as it might be pulled back out and > > reused as soon as it is added. What we need is to give the memory a > > bit of time to "cool" so that we aren't constantly hinting away memory > > that is still in use. > > Depending on the setup, you don't want free memory lying around for too > long in your guest. Right, but you don't need it as soon as it is freed either. If it is idle in the guest for a few seconds that shouldn't be an issue. The free page hinting will hurt performance if we are doing it too often simply because we are going to be triggering a much higher rate of page faults. > > > >> Your approach sounds very interesting to play with, however > >> at this point I would like to avoid throwing away Nitesh work once again > >> to follow some other approach that looks promising. If we keep going > >> like that, we'll spend another ~10 years working on free page hinting > >> without getting anything upstream. Especially if it involves more > >> core-MM changes. We've been there, we've done that. As long as the > >> guest-host interface is generic enough, we can play with such approaches > >> later in the guest. Important part is that the guest-host interface > >> allows for that. > > > > I'm not throwing anything away. One of the issues in Nitesh's design > > is that he is going to either miss memory and have to run an > > asynchronous thread to clean it up after the fact, or he is going to > > cause massive OOM errors and/or have to start halting VCPUs while > > 1. how are we going to miss memory. We are going to miss memory because > we hint on very huge chunks, but we all agreed to live with that for now. What I am talking about is that some application frees gigabytes of memory. As I recall the queue length for a single cpu is only like 1G. Are we going to be sitting on the backlog of most of system memory while we process it 1G at a time? > 2. What are the "massive OOM" errors you are talking about? We have the > one scenario we described Nitesh was not even able to reproduce yet. And > we have ways to mitigate the problem (discussed in this thread). So I am referring to the last patch set I have seen. Last I knew all the code was doing was assembling lists if isolated pages and placing them on a queue. I have seen no way that this really limits the length of the virtqueue, and the length of the isolated page lists is the only thing that has any specific limits to it. So I see it easily being possible for a good portion of memory being consumed by the queue when you consider that what you have is essentially the maximum length of the isolated page list multiplied by the number of entries in a virtqueue. > We have something that seems to work. Let's work from there instead of > scrapping the general design once more, thinking "it is super easy". And > yes, what you propose is pretty much throwing away the current design in > the guest. Define "work"? The last patch set required massive fixes as it was causing kernel panics if more than 1 VCPU was enabled and list corruption in general. I'm sure there are a ton more bugs lurking as we have only begun to be able to stress this code in any meaningful way. For example what happens if someone sits on the mm write semaphore for an extended period of time on the host? That will shut down all of the hinting until that is released, and at that point once again any hinting queues will be stuck on the guest until they can be processed by the host. > > waiting on the processing. All I am suggesting is that we can get away > > from having to deal with both by just walking through the free pages > > for the higher order and hinting only a few at a time without having > > to try to provide the host with the hints on what is idle the second > > it is freed. > > > >>> > >>> I view this all as working not too dissimilar to how a standard Rx > >>> ring in a network device works. Only we would want to allocate from > >>> the pool of "Buddy" pages, flag the pages as "Offline", and then when > >>> the hint has been processed we would place them back in the "Buddy" > >>> list with the "Offline" value still set. The only real changes needed > >>> to the buddy allocator would be to add some logic for clearing/merging > >>> the "Offline" setting as necessary, and to provide an allocator that > >>> only works with non-"Offline" pages. > >> > >> Sorry, I had to smile at the phrase "only" in combination with "provide > >> an allocator that only works with non-Offline pages" :) . I guess you > >> realize yourself that these are core-mm changes that might easily be > >> rejected upstream because "the virt guys try to teach core-MM yet > >> another special case". I agree that this is nice to play with, > >> eventually that approach could succeed and be accepted upstream. But I > >> consider this long term work. > > > > The actual patch for this would probably be pretty small and compared > > to some of the other stuff that has gone in recently isn't too far out > > of the realm of possibility. It isn't too different then the code that > > has already done in to determine the unused pages for virtio-balloon > > free page hinting. > > > > Basically what we would be doing is providing a means for > > incrementally transitioning the buddy memory into the idle/offline > > state to reduce guest memory overhead. It would require one function > > that would walk the free page lists and pluck out pages that don't > > have the "Offline" page type set, a one-line change to the logic for > > allocating a page as we would need to clear that extra bit of state, > > and optionally some bits for how to handle the merge of two "Offline" > > pages in the buddy allocator (required for lower order support). It > > solves most of the guest side issues with the free page hinting in > > that trying to do it via the arch_free_page path is problematic at > > best since it was designed for a synchronous setup, not an > > asynchronous one. > > This is throwing away work. No I don't think this is the right path to > follow for now. Feel free to look into it while Nitesh gets something in > shape we know conceptually works and we are starting to know which > issues we are hitting. Yes, it is throwing away work. But if the work is running toward a dead end does it add any value? I've been looking into the stuff Nitesh has been doing. I don't know about others, but I have been testing it. That is why I provided the patches I did to get it stable enough for me to test and address the regressions it was causing. That is the source of some of my concern. I think we have been making this overly complex with all the per-cpu bits and trying to place this in the free path itself. We really need to scale this back and look at having a single thread with a walker of some sort just hinting on what memory is sitting in the buddy but not hinted on. It is a solution that would work, even in a multiple VCPU case, and is achievable in the short term. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: On guest free page hinting and OOM 2019-04-02 17:30 ` Alexander Duyck @ 2019-04-02 18:53 ` David Hildenbrand 2019-04-02 23:43 ` Alexander Duyck 0 siblings, 1 reply; 34+ messages in thread From: David Hildenbrand @ 2019-04-02 18:53 UTC (permalink / raw) To: Alexander Duyck Cc: Michael S. Tsirkin, Nitesh Narayan Lal, kvm list, LKML, linux-mm, Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang, Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli >>> Why do we need them running in parallel for a single guest? I don't >>> think we need the hints so quickly that we would need to have multiple >>> VCPUs running in parallel to provide hints. In addition as it >>> currently stands in order to get pages into and out of the buddy >>> allocator we are going to have to take the zone lock anyway so we >>> could probably just assume a single thread for pulling the memory, >>> placing it on the ring, and putting it back into the buddy allocator >>> after the hint has been completed. >> >> VCPUs hint when they think the time has come. Hinting in parallel comes >> naturally. > > Actually it doesn't because if we are doing it asynchronously we are > having to pull pages out of the zone which requires the zone lock. Yes, and we already work with zones already when freeing. At least one zone. > That has been one of the reasons why the patches from Nitesh start > dropping in performance when you start enabling more than 1 VCPU. If > we are limited by the zone lock it doesn't make sense for us to try to > do thing in parallel. That is an interesting point and I'd love to see some performance numbers. >>> Also there isn't a huge priority to report idle memory in real time. >>> That would be kind of pointless as it might be pulled back out and >>> reused as soon as it is added. What we need is to give the memory a >>> bit of time to "cool" so that we aren't constantly hinting away memory >>> that is still in use. >> >> Depending on the setup, you don't want free memory lying around for too >> long in your guest. > > Right, but you don't need it as soon as it is freed either. If it is > idle in the guest for a few seconds that shouldn't be an issue. The > free page hinting will hurt performance if we are doing it too often > simply because we are going to be triggering a much higher rate of > page faults. Valid point. > >>> >>>> Your approach sounds very interesting to play with, however >>>> at this point I would like to avoid throwing away Nitesh work once again >>>> to follow some other approach that looks promising. If we keep going >>>> like that, we'll spend another ~10 years working on free page hinting >>>> without getting anything upstream. Especially if it involves more >>>> core-MM changes. We've been there, we've done that. As long as the >>>> guest-host interface is generic enough, we can play with such approaches >>>> later in the guest. Important part is that the guest-host interface >>>> allows for that. >>> >>> I'm not throwing anything away. One of the issues in Nitesh's design >>> is that he is going to either miss memory and have to run an >>> asynchronous thread to clean it up after the fact, or he is going to >>> cause massive OOM errors and/or have to start halting VCPUs while >> >> 1. how are we going to miss memory. We are going to miss memory because >> we hint on very huge chunks, but we all agreed to live with that for now. > > What I am talking about is that some application frees gigabytes of > memory. As I recall the queue length for a single cpu is only like 1G. > Are we going to be sitting on the backlog of most of system memory > while we process it 1G at a time? I think it is something around "pages that fit into a request" * "numbers of entries in a virtqueue". > >> 2. What are the "massive OOM" errors you are talking about? We have the >> one scenario we described Nitesh was not even able to reproduce yet. And >> we have ways to mitigate the problem (discussed in this thread). > > So I am referring to the last patch set I have seen. Last I knew all > the code was doing was assembling lists if isolated pages and placing > them on a queue. I have seen no way that this really limits the length > of the virtqueue, and the length of the isolated page lists is the Ah yes, we are discussing something towards possible capping in this thread. Once there would be too much being hinted already, skip hinting for now. Which might have good and bad sides. Different discussion. > only thing that has any specific limits to it. So I see it easily > being possible for a good portion of memory being consumed by the > queue when you consider that what you have is essentially the maximum > length of the isolated page list multiplied by the number of entries > in a virtqueue. > >> We have something that seems to work. Let's work from there instead of >> scrapping the general design once more, thinking "it is super easy". And >> yes, what you propose is pretty much throwing away the current design in >> the guest. > > Define "work"? The last patch set required massive fixes as it was > causing kernel panics if more than 1 VCPU was enabled and list > corruption in general. I'm sure there are a ton more bugs lurking as > we have only begun to be able to stress this code in any meaningful > way. "work" - we get performance numbers that look promising and sorting out issues in the design we find. This is RFC. We are discussing design details. If there are issues in the design, let's discuss. If there are alternatives, let's discuss. Bashing on the quality of prototypes? Please don't. > > For example what happens if someone sits on the mm write semaphore for > an extended period of time on the host? That will shut down all of the > hinting until that is released, and at that point once again any > hinting queues will be stuck on the guest until they can be processed > by the host. I remember that is why we are using asynchronous requests. Combined with dropping hints when in such a situation (posted hints not getting processed), nobody would be stuck. Or am I missing something? Yes, then the issue of dropped hints arises, and that is a different discussion. > >>> waiting on the processing. All I am suggesting is that we can get away >>> from having to deal with both by just walking through the free pages >>> for the higher order and hinting only a few at a time without having >>> to try to provide the host with the hints on what is idle the second >>> it is freed. >>> >>>>> >>>>> I view this all as working not too dissimilar to how a standard Rx >>>>> ring in a network device works. Only we would want to allocate from >>>>> the pool of "Buddy" pages, flag the pages as "Offline", and then when >>>>> the hint has been processed we would place them back in the "Buddy" >>>>> list with the "Offline" value still set. The only real changes needed >>>>> to the buddy allocator would be to add some logic for clearing/merging >>>>> the "Offline" setting as necessary, and to provide an allocator that >>>>> only works with non-"Offline" pages. >>>> >>>> Sorry, I had to smile at the phrase "only" in combination with "provide >>>> an allocator that only works with non-Offline pages" :) . I guess you >>>> realize yourself that these are core-mm changes that might easily be >>>> rejected upstream because "the virt guys try to teach core-MM yet >>>> another special case". I agree that this is nice to play with, >>>> eventually that approach could succeed and be accepted upstream. But I >>>> consider this long term work. >>> >>> The actual patch for this would probably be pretty small and compared >>> to some of the other stuff that has gone in recently isn't too far out >>> of the realm of possibility. It isn't too different then the code that >>> has already done in to determine the unused pages for virtio-balloon >>> free page hinting. >>> >>> Basically what we would be doing is providing a means for >>> incrementally transitioning the buddy memory into the idle/offline >>> state to reduce guest memory overhead. It would require one function >>> that would walk the free page lists and pluck out pages that don't >>> have the "Offline" page type set, a one-line change to the logic for >>> allocating a page as we would need to clear that extra bit of state, >>> and optionally some bits for how to handle the merge of two "Offline" >>> pages in the buddy allocator (required for lower order support). It >>> solves most of the guest side issues with the free page hinting in >>> that trying to do it via the arch_free_page path is problematic at >>> best since it was designed for a synchronous setup, not an >>> asynchronous one. >> >> This is throwing away work. No I don't think this is the right path to >> follow for now. Feel free to look into it while Nitesh gets something in >> shape we know conceptually works and we are starting to know which >> issues we are hitting. > > Yes, it is throwing away work. But if the work is running toward a > dead end does it add any value? "I'm not throwing anything away. " vs. "Yes, it is throwing away work.", now we are on the same page. So your main point here is that you are fairly sure we are are running towards an dead end, right? > > I've been looking into the stuff Nitesh has been doing. I don't know > about others, but I have been testing it. That is why I provided the > patches I did to get it stable enough for me to test and address the > regressions it was causing. That is the source of some of my concern. Testing and feedback is very much appreciated. You have concerns, they are valid. I do like discussing concerns, discussing possible solutions, or finding out that it cannot be solved the easy way. Then throw it away. Coming up with a clean design that considers problems that are not directly visible is something I would like to see. But usually they don't jump at you before prototyping. The simplest approach so far was "scan for zero pages in the hypervisor". No changes in the guest needed except setting pages to zero when freeing. No additional threads in the guest. No hinting. And still we decided against it. > I think we have been making this overly complex with all the per-cpu > bits and trying to place this in the free path itself. We really need We already removed complexity, at least that is my impression. There are bugs in there, yes. > to scale this back and look at having a single thread with a walker of > some sort just hinting on what memory is sitting in the buddy but not > hinted on. It is a solution that would work, even in a multiple VCPU > case, and is achievable in the short term. Can you write up your complete proposal and start a new thread. What I understood so far is 1. Separate hinting thread 2. Use virtio-balloon mechanism similar to Nitesh's work 3. Iterate over !offline pages in the buddy. Take them temporarily out of the buddy (similar to Niteshs work). Send them to the hypervisor. Mark them offline, put them back to the buddy. 4. When a page leaves the buddy, drop the offline marker. Selected issues to be sorted out: - We have to find a way to mask pages offline. We are effectively touching pages we don't own (keeping flags set when returning pages to the buddy). Core MM has to accept this change. - We might teach other users how to treat buddy pages now. Offline always has to be cleared. - How to limit the cycles wasted scanning? Idle guests? - How to efficiently scan a list that might always change between hinting requests? - How to avoid OOM that can still happen in corner cases, after all you are taking pages out of the buddy temporarily. -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: On guest free page hinting and OOM 2019-04-02 18:53 ` David Hildenbrand @ 2019-04-02 23:43 ` Alexander Duyck 2019-04-03 19:43 ` David Hildenbrand 2019-04-04 13:28 ` Michael S. Tsirkin 0 siblings, 2 replies; 34+ messages in thread From: Alexander Duyck @ 2019-04-02 23:43 UTC (permalink / raw) To: David Hildenbrand Cc: Michael S. Tsirkin, Nitesh Narayan Lal, kvm list, LKML, linux-mm, Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang, Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli On Tue, Apr 2, 2019 at 11:53 AM David Hildenbrand <david@redhat.com> wrote: > > >>> Why do we need them running in parallel for a single guest? I don't > >>> think we need the hints so quickly that we would need to have multiple > >>> VCPUs running in parallel to provide hints. In addition as it > >>> currently stands in order to get pages into and out of the buddy > >>> allocator we are going to have to take the zone lock anyway so we > >>> could probably just assume a single thread for pulling the memory, > >>> placing it on the ring, and putting it back into the buddy allocator > >>> after the hint has been completed. > >> > >> VCPUs hint when they think the time has come. Hinting in parallel comes > >> naturally. > > > > Actually it doesn't because if we are doing it asynchronously we are > > having to pull pages out of the zone which requires the zone lock. > > Yes, and we already work with zones already when freeing. At least one zone. > > > That has been one of the reasons why the patches from Nitesh start > > dropping in performance when you start enabling more than 1 VCPU. If > > we are limited by the zone lock it doesn't make sense for us to try to > > do thing in parallel. > > That is an interesting point and I'd love to see some performance numbers. So the last time I ran data it was with the virtio-balloon patch set I ended up having to make a number of fixes and tweaks. I believe the patches can be found online as I emailed them to the list and Nitesh, but I don't have them handy to point to. Results w/ THP Baseline [root@localhost ~]# cd ~/will-it-scale/; ./runtest.py page_fault1 tasks,processes,processes_idle,threads,threads_idle,linear 0,0,100,0,100,0 1,501001,93.72,500312,93.72,501001 2,918688,87.49,837092,87.51,1002002 3,1300535,81.22,1200746,81.39,1503003 4,1718865,75.01,1522041,75.20,2004004 5,2032902,68.77,1826264,69.26,2505005 6,2309724,62.55,1979819,63.89,3006006 7,2609748,56.30,1935436,60.20,3507007 8,2705883,50.07,1913416,57.45,4008008 9,2738392,43.84,2017198,51.24,4509009 10,2913739,37.63,1906649,48.65,5010010 11,2996000,31.41,1973332,41.86,5511011 12,2930790,25.19,1928318,37.33,6012012 13,2876603,18.97,2040026,31.83,6513013 14,2820274,12.77,2060417,27.19,7014014 15,2729018,6.55,2134531,24.33,7515015 16,2682826,0.36,2146440,21.25,8016016 My Patch Set [root@localhost will-it-scale]# cd ~/will-it-scale/; ./runtest.py page_fault1 tasks,processes,processes_idle,threads,threads_idle,linear 0,0,100,0,100,0 1,459575,93.74,458546,93.73,459575 2,901990,87.47,841478,87.49,919150 3,1307078,81.26,1193380,81.43,1378725 4,1717429,75.03,1529761,75.30,1838300 5,2045371,68.79,1765334,70.01,2297875 6,2272685,62.56,1893388,65.42,2757450 7,2583919,56.34,2078468,59.85,3217025 8,2777732,50.10,2009627,57.08,3676600 9,2932699,43.90,1938315,52.00,4136175 10,2935508,37.70,1982124,46.55,4595750 11,2881811,31.45,2162791,41.36,5055325 12,2947880,25.27,2058337,38.93,5514900 13,2925530,19.11,1937080,32.13,5974475 14,2867833,12.89,2023161,25.80,6434050 15,2856156,6.69,2067414,24.67,6893625 16,2775991,0.53,2062535,17.46,7353200 Modified RH Virtio-Balloon based patch set [root@localhost ~]# cd ~/will-it-scale/; ./runtest.py page_fault1 tasks,processes,processes_idle,threads,threads_idle,linear 0,0,100,0,100,0 1,522672,93.73,524206,93.73,524206 2,914612,87.47,828489,87.66,1048412 3,1336109,81.25,1156889,82.15,1572618 4,1638776,75.01,1419247,76.75,2096824 5,1982146,68.77,1676836,71.27,2621030 6,2211653,62.57,1865976,65.60,3145236 7,2456776,56.33,2111887,57.98,3669442 8,2594395,50.10,2101993,54.17,4193648 9,2672871,43.90,1864173,53.01,4717854 10,2695456,37.69,2152126,45.82,5242060 11,2702604,31.44,1962406,42.50,5766266 12,2702415,25.22,2078596,35.01,6290472 13,2677250,19.02,2068953,35.42,6814678 14,2612990,12.80,2053951,30.77,7338884 15,2521812,6.67,1876602,26.42,7863090 16,2472506,0.53,1957658,20.16,8387296 Basically when you compare the direct approach from the patch set I submitted versus the one using the virtio approach the virtio has better single thread performance, but doesn't scale as well as my patch set did. That is why I am thinking that if we can avoid trying to scale out to per-cpu and instead focus on just having one thread handle the feeding of the hints to the virtio device we could avoid the scaling penalty and instead get the best of both worlds. > >>> Also there isn't a huge priority to report idle memory in real time. > >>> That would be kind of pointless as it might be pulled back out and > >>> reused as soon as it is added. What we need is to give the memory a > >>> bit of time to "cool" so that we aren't constantly hinting away memory > >>> that is still in use. > >> > >> Depending on the setup, you don't want free memory lying around for too > >> long in your guest. > > > > Right, but you don't need it as soon as it is freed either. If it is > > idle in the guest for a few seconds that shouldn't be an issue. The > > free page hinting will hurt performance if we are doing it too often > > simply because we are going to be triggering a much higher rate of > > page faults. > > Valid point. > > > > >>> > >>>> Your approach sounds very interesting to play with, however > >>>> at this point I would like to avoid throwing away Nitesh work once again > >>>> to follow some other approach that looks promising. If we keep going > >>>> like that, we'll spend another ~10 years working on free page hinting > >>>> without getting anything upstream. Especially if it involves more > >>>> core-MM changes. We've been there, we've done that. As long as the > >>>> guest-host interface is generic enough, we can play with such approaches > >>>> later in the guest. Important part is that the guest-host interface > >>>> allows for that. > >>> > >>> I'm not throwing anything away. One of the issues in Nitesh's design > >>> is that he is going to either miss memory and have to run an > >>> asynchronous thread to clean it up after the fact, or he is going to > >>> cause massive OOM errors and/or have to start halting VCPUs while > >> > >> 1. how are we going to miss memory. We are going to miss memory because > >> we hint on very huge chunks, but we all agreed to live with that for now. > > > > What I am talking about is that some application frees gigabytes of > > memory. As I recall the queue length for a single cpu is only like 1G. > > Are we going to be sitting on the backlog of most of system memory > > while we process it 1G at a time? > > I think it is something around "pages that fit into a request" * > "numbers of entries in a virtqueue". Right, but then we start spinning if there are no free entries in the virtqueue. That isn't a very asynchronous way to handle things if we are just going to peg a VCPU busy waiting for the virtqueue to free up a buffer. If we are going to go asynchronous we should be completely asynchronous. Right now we still can end up eating a large amount of memory with request size * virtqueue size, and then when we hit that limit there is a while loop we hit that will spin until we have a free buffer. > > > >> 2. What are the "massive OOM" errors you are talking about? We have the > >> one scenario we described Nitesh was not even able to reproduce yet. And > >> we have ways to mitigate the problem (discussed in this thread). > > > > So I am referring to the last patch set I have seen. Last I knew all > > the code was doing was assembling lists if isolated pages and placing > > them on a queue. I have seen no way that this really limits the length > > of the virtqueue, and the length of the isolated page lists is the > > Ah yes, we are discussing something towards possible capping in this > thread. Once there would be too much being hinted already, skip hinting > for now. Which might have good and bad sides. Different discussion. I'd say that it is another argument for trying to get away from the per-cpu lists though if we need to put limits on it. Trying to maintain a reasonable limit while coordinating between CPUs would be a challenge. > > only thing that has any specific limits to it. So I see it easily > > being possible for a good portion of memory being consumed by the > > queue when you consider that what you have is essentially the maximum > > length of the isolated page list multiplied by the number of entries > > in a virtqueue. > > > >> We have something that seems to work. Let's work from there instead of > >> scrapping the general design once more, thinking "it is super easy". And > >> yes, what you propose is pretty much throwing away the current design in > >> the guest. > > > > Define "work"? The last patch set required massive fixes as it was > > causing kernel panics if more than 1 VCPU was enabled and list > > corruption in general. I'm sure there are a ton more bugs lurking as > > we have only begun to be able to stress this code in any meaningful > > way. > > "work" - we get performance numbers that look promising and sorting out > issues in the design we find. This is RFC. We are discussing design > details. If there are issues in the design, let's discuss. If there are > alternatives, let's discuss. Bashing on the quality of prototypes? > Please don't. I'm not so much bashing the quality as the lack of data. It is hard to say something is "working" when you have a hard time getting it to stay up long enough to collect any reasonable data. My concern is that the data looks really great when you don't have all the proper critical sections handled correctly, but when you add the locking that needs to be there it can make the whole point of an entire patch set moot. > > > > For example what happens if someone sits on the mm write semaphore for > > an extended period of time on the host? That will shut down all of the > > hinting until that is released, and at that point once again any > > hinting queues will be stuck on the guest until they can be processed > > by the host. > > I remember that is why we are using asynchronous requests. Combined with > dropping hints when in such a situation (posted hints not getting > processed), nobody would be stuck. Or am I missing something? Yes, then > the issue of dropped hints arises, and that is a different discussion. Right, but while that lock is stuck you have all the memory sitting in the queues. That is the reason why I want to keep the amount of memory that can be sitting in these queues on the smaller side and closer to something like 64M. I don't want the guest to have to be stuck either dropping hints, or starving a guest for memory because it can't get the hints serviced. > > > >>> waiting on the processing. All I am suggesting is that we can get away > >>> from having to deal with both by just walking through the free pages > >>> for the higher order and hinting only a few at a time without having > >>> to try to provide the host with the hints on what is idle the second > >>> it is freed. > >>> > >>>>> > >>>>> I view this all as working not too dissimilar to how a standard Rx > >>>>> ring in a network device works. Only we would want to allocate from > >>>>> the pool of "Buddy" pages, flag the pages as "Offline", and then when > >>>>> the hint has been processed we would place them back in the "Buddy" > >>>>> list with the "Offline" value still set. The only real changes needed > >>>>> to the buddy allocator would be to add some logic for clearing/merging > >>>>> the "Offline" setting as necessary, and to provide an allocator that > >>>>> only works with non-"Offline" pages. > >>>> > >>>> Sorry, I had to smile at the phrase "only" in combination with "provide > >>>> an allocator that only works with non-Offline pages" :) . I guess you > >>>> realize yourself that these are core-mm changes that might easily be > >>>> rejected upstream because "the virt guys try to teach core-MM yet > >>>> another special case". I agree that this is nice to play with, > >>>> eventually that approach could succeed and be accepted upstream. But I > >>>> consider this long term work. > >>> > >>> The actual patch for this would probably be pretty small and compared > >>> to some of the other stuff that has gone in recently isn't too far out > >>> of the realm of possibility. It isn't too different then the code that > >>> has already done in to determine the unused pages for virtio-balloon > >>> free page hinting. > >>> > >>> Basically what we would be doing is providing a means for > >>> incrementally transitioning the buddy memory into the idle/offline > >>> state to reduce guest memory overhead. It would require one function > >>> that would walk the free page lists and pluck out pages that don't > >>> have the "Offline" page type set, a one-line change to the logic for > >>> allocating a page as we would need to clear that extra bit of state, > >>> and optionally some bits for how to handle the merge of two "Offline" > >>> pages in the buddy allocator (required for lower order support). It > >>> solves most of the guest side issues with the free page hinting in > >>> that trying to do it via the arch_free_page path is problematic at > >>> best since it was designed for a synchronous setup, not an > >>> asynchronous one. > >> > >> This is throwing away work. No I don't think this is the right path to > >> follow for now. Feel free to look into it while Nitesh gets something in > >> shape we know conceptually works and we are starting to know which > >> issues we are hitting. > > > > Yes, it is throwing away work. But if the work is running toward a > > dead end does it add any value? > > "I'm not throwing anything away. " vs. "Yes, it is throwing away work.", > now we are on the same page. > > So your main point here is that you are fairly sure we are are running > towards an dead end, right? Yes. There is a ton of code here that is adding complexity and bugs that I would consider waste. If we can move to a single threaded approach the code could become much simpler as we will only have a single queue that we have to service and we could get away from 2 levels of lists and the allocation that goes with them. In addition I think we could get away with much more code reuse as the get_free_page_and_send function could likely be adapted to provide a more generic function for adding a page of a specific size to the queue versus the current assumption that the only page size is VIRTIO_BALLOON_FREE_PAGE_ORDER. > > > > I've been looking into the stuff Nitesh has been doing. I don't know > > about others, but I have been testing it. That is why I provided the > > patches I did to get it stable enough for me to test and address the > > regressions it was causing. That is the source of some of my concern. > > Testing and feedback is very much appreciated. You have concerns, they > are valid. I do like discussing concerns, discussing possible solutions, > or finding out that it cannot be solved the easy way. Then throw it away. > > Coming up with a clean design that considers problems that are not > directly visible is something I would like to see. But usually they > don't jump at you before prototyping. > > The simplest approach so far was "scan for zero pages in the > hypervisor". No changes in the guest needed except setting pages to zero > when freeing. No additional threads in the guest. No hinting. And still > we decided against it. Right, I get that. There is still going to be a certain amount of overhead for zeroing the pages in adding the scanning on the host will not come cheap. I had considered something similar when I first looked into this. > > I think we have been making this overly complex with all the per-cpu > > bits and trying to place this in the free path itself. We really need > > We already removed complexity, at least that is my impression. There are > bugs in there, yes. So one of the concerns I have at this point is the sheer number of allocations and the list shuffling that is having to take place. The design of the last patch set had us enqueueing addresses on the per-cpu "free_pages_obj". Then when we hit a certain threshold it will call guest_free_page_hinting which is allocated via kmalloc and then populated with the pages that can be isolated. Then that function calls guest_free_page_report which will yet again kmalloc a hint_req object that just contains a pointer to our isolated pages list. If we can keep the page count small we could just do away with all of that and instead do something more like get_free_page_and_send. > > to scale this back and look at having a single thread with a walker of > > some sort just hinting on what memory is sitting in the buddy but not > > hinted on. It is a solution that would work, even in a multiple VCPU > > case, and is achievable in the short term. > Can you write up your complete proposal and start a new thread. What I > understood so far is > > 1. Separate hinting thread > > 2. Use virtio-balloon mechanism similar to Nitesh's work > > 3. Iterate over !offline pages in the buddy. Take them temporarily out > of the buddy (similar to Niteshs work). Send them to the hypervisor. > Mark them offline, put them back to the buddy. > > 4. When a page leaves the buddy, drop the offline marker. Yep, that is pretty much it. I'll see if I can get a write-up of it tomorrow. > > Selected issues to be sorted out: > - We have to find a way to mask pages offline. We are effectively > touching pages we don't own (keeping flags set when returning pages to > the buddy). Core MM has to accept this change. > - We might teach other users how to treat buddy pages now. Offline > always has to be cleared. > - How to limit the cycles wasted scanning? Idle guests? One thought I had would be to look at splitting nr_free_pages in each free_area and splitting it into two free running counters, one for the number of pages added, and one for the number of pages removed. Then it would be pretty straight forward to determine how many are available as we could maintain a free running counter for each free area in the balloon driver to determine if we need to scan a given area. > - How to efficiently scan a list that might always change between > hinting requests? > - How to avoid OOM that can still happen in corner cases, after all you > are taking pages out of the buddy temporarily. Yes, but hopefully it should be a small enough amount that nobody will notice. In many cases devices such as NICs can consume much more than this regularly for just their Rx buffers and it is not an issue. There has to be a certain amount of overhead that any given device is allowed to consume. If we contain the balloon hinting to just 64M that should be a small enough amount that nobody would notice in practice. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: On guest free page hinting and OOM 2019-04-02 23:43 ` Alexander Duyck @ 2019-04-03 19:43 ` David Hildenbrand 2019-04-04 13:28 ` Michael S. Tsirkin 1 sibling, 0 replies; 34+ messages in thread From: David Hildenbrand @ 2019-04-03 19:43 UTC (permalink / raw) To: Alexander Duyck Cc: Michael S. Tsirkin, Nitesh Narayan Lal, kvm list, LKML, linux-mm, Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang, Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli On 03.04.19 01:43, Alexander Duyck wrote: > On Tue, Apr 2, 2019 at 11:53 AM David Hildenbrand <david@redhat.com> wrote: >> >>>>> Why do we need them running in parallel for a single guest? I don't >>>>> think we need the hints so quickly that we would need to have multiple >>>>> VCPUs running in parallel to provide hints. In addition as it >>>>> currently stands in order to get pages into and out of the buddy >>>>> allocator we are going to have to take the zone lock anyway so we >>>>> could probably just assume a single thread for pulling the memory, >>>>> placing it on the ring, and putting it back into the buddy allocator >>>>> after the hint has been completed. >>>> >>>> VCPUs hint when they think the time has come. Hinting in parallel comes >>>> naturally. >>> >>> Actually it doesn't because if we are doing it asynchronously we are >>> having to pull pages out of the zone which requires the zone lock. >> >> Yes, and we already work with zones already when freeing. At least one zone. >> >>> That has been one of the reasons why the patches from Nitesh start >>> dropping in performance when you start enabling more than 1 VCPU. If >>> we are limited by the zone lock it doesn't make sense for us to try to >>> do thing in parallel. >> >> That is an interesting point and I'd love to see some performance numbers. > > So the last time I ran data it was with the virtio-balloon patch set I > ended up having to make a number of fixes and tweaks. I believe the > patches can be found online as I emailed them to the list and Nitesh, > but I don't have them handy to point to. AFAIK Nitesh integrated them. > > Results w/ THP > Baseline > [root@localhost ~]# cd ~/will-it-scale/; ./runtest.py page_fault1 > tasks,processes,processes_idle,threads,threads_idle,linear > 0,0,100,0,100,0 > 1,501001,93.72,500312,93.72,501001 > 2,918688,87.49,837092,87.51,1002002 > 3,1300535,81.22,1200746,81.39,1503003 > 4,1718865,75.01,1522041,75.20,2004004 > 5,2032902,68.77,1826264,69.26,2505005 > 6,2309724,62.55,1979819,63.89,3006006 > 7,2609748,56.30,1935436,60.20,3507007 > 8,2705883,50.07,1913416,57.45,4008008 > 9,2738392,43.84,2017198,51.24,4509009 > 10,2913739,37.63,1906649,48.65,5010010 > 11,2996000,31.41,1973332,41.86,5511011 > 12,2930790,25.19,1928318,37.33,6012012 > 13,2876603,18.97,2040026,31.83,6513013 > 14,2820274,12.77,2060417,27.19,7014014 > 15,2729018,6.55,2134531,24.33,7515015 > 16,2682826,0.36,2146440,21.25,8016016 > > My Patch Set > [root@localhost will-it-scale]# cd ~/will-it-scale/; ./runtest.py page_fault1 > tasks,processes,processes_idle,threads,threads_idle,linear > 0,0,100,0,100,0 > 1,459575,93.74,458546,93.73,459575 > 2,901990,87.47,841478,87.49,919150 > 3,1307078,81.26,1193380,81.43,1378725 > 4,1717429,75.03,1529761,75.30,1838300 > 5,2045371,68.79,1765334,70.01,2297875 > 6,2272685,62.56,1893388,65.42,2757450 > 7,2583919,56.34,2078468,59.85,3217025 > 8,2777732,50.10,2009627,57.08,3676600 > 9,2932699,43.90,1938315,52.00,4136175 > 10,2935508,37.70,1982124,46.55,4595750 > 11,2881811,31.45,2162791,41.36,5055325 > 12,2947880,25.27,2058337,38.93,5514900 > 13,2925530,19.11,1937080,32.13,5974475 > 14,2867833,12.89,2023161,25.80,6434050 > 15,2856156,6.69,2067414,24.67,6893625 > 16,2775991,0.53,2062535,17.46,7353200 > > Modified RH Virtio-Balloon based patch set > [root@localhost ~]# cd ~/will-it-scale/; ./runtest.py page_fault1 > tasks,processes,processes_idle,threads,threads_idle,linear > 0,0,100,0,100,0 > 1,522672,93.73,524206,93.73,524206 > 2,914612,87.47,828489,87.66,1048412 > 3,1336109,81.25,1156889,82.15,1572618 > 4,1638776,75.01,1419247,76.75,2096824 > 5,1982146,68.77,1676836,71.27,2621030 > 6,2211653,62.57,1865976,65.60,3145236 > 7,2456776,56.33,2111887,57.98,3669442 > 8,2594395,50.10,2101993,54.17,4193648 > 9,2672871,43.90,1864173,53.01,4717854 > 10,2695456,37.69,2152126,45.82,5242060 > 11,2702604,31.44,1962406,42.50,5766266 > 12,2702415,25.22,2078596,35.01,6290472 > 13,2677250,19.02,2068953,35.42,6814678 > 14,2612990,12.80,2053951,30.77,7338884 > 15,2521812,6.67,1876602,26.42,7863090 > 16,2472506,0.53,1957658,20.16,8387296 > > Basically when you compare the direct approach from the patch set I > submitted versus the one using the virtio approach the virtio has > better single thread performance, but doesn't scale as well as my > patch set did. That is why I am thinking that if we can avoid trying > to scale out to per-cpu and instead focus on just having one thread > handle the feeding of the hints to the virtio device we could avoid > the scaling penalty and instead get the best of both worlds. > Sounds like a solid analysis to me. Thanks. [...] >>> only thing that has any specific limits to it. So I see it easily >>> being possible for a good portion of memory being consumed by the >>> queue when you consider that what you have is essentially the maximum >>> length of the isolated page list multiplied by the number of entries >>> in a virtqueue. >>> >>>> We have something that seems to work. Let's work from there instead of >>>> scrapping the general design once more, thinking "it is super easy". And >>>> yes, what you propose is pretty much throwing away the current design in >>>> the guest. >>> >>> Define "work"? The last patch set required massive fixes as it was >>> causing kernel panics if more than 1 VCPU was enabled and list >>> corruption in general. I'm sure there are a ton more bugs lurking as >>> we have only begun to be able to stress this code in any meaningful >>> way. >> >> "work" - we get performance numbers that look promising and sorting out >> issues in the design we find. This is RFC. We are discussing design >> details. If there are issues in the design, let's discuss. If there are >> alternatives, let's discuss. Bashing on the quality of prototypes? >> Please don't. > > I'm not so much bashing the quality as the lack of data. It is hard to > say something is "working" when you have a hard time getting it to > stay up long enough to collect any reasonable data. My concern is that > the data looks really great when you don't have all the proper > critical sections handled correctly, but when you add the locking that > needs to be there it can make the whole point of an entire patch set > moot. Yes, and that is something to figure out during review. And you did that excellently by even sending fixes :) Not arguing against "quality of the prototype should improve with RFCs". [...] > >>> >>>>> waiting on the processing. All I am suggesting is that we can get away >>>>> from having to deal with both by just walking through the free pages >>>>> for the higher order and hinting only a few at a time without having >>>>> to try to provide the host with the hints on what is idle the second >>>>> it is freed. >>>>> >>>>>>> >>>>>>> I view this all as working not too dissimilar to how a standard Rx >>>>>>> ring in a network device works. Only we would want to allocate from >>>>>>> the pool of "Buddy" pages, flag the pages as "Offline", and then when >>>>>>> the hint has been processed we would place them back in the "Buddy" >>>>>>> list with the "Offline" value still set. The only real changes needed >>>>>>> to the buddy allocator would be to add some logic for clearing/merging >>>>>>> the "Offline" setting as necessary, and to provide an allocator that >>>>>>> only works with non-"Offline" pages. >>>>>> >>>>>> Sorry, I had to smile at the phrase "only" in combination with "provide >>>>>> an allocator that only works with non-Offline pages" :) . I guess you >>>>>> realize yourself that these are core-mm changes that might easily be >>>>>> rejected upstream because "the virt guys try to teach core-MM yet >>>>>> another special case". I agree that this is nice to play with, >>>>>> eventually that approach could succeed and be accepted upstream. But I >>>>>> consider this long term work. >>>>> >>>>> The actual patch for this would probably be pretty small and compared >>>>> to some of the other stuff that has gone in recently isn't too far out >>>>> of the realm of possibility. It isn't too different then the code that >>>>> has already done in to determine the unused pages for virtio-balloon >>>>> free page hinting. >>>>> >>>>> Basically what we would be doing is providing a means for >>>>> incrementally transitioning the buddy memory into the idle/offline >>>>> state to reduce guest memory overhead. It would require one function >>>>> that would walk the free page lists and pluck out pages that don't >>>>> have the "Offline" page type set, a one-line change to the logic for >>>>> allocating a page as we would need to clear that extra bit of state, >>>>> and optionally some bits for how to handle the merge of two "Offline" >>>>> pages in the buddy allocator (required for lower order support). It >>>>> solves most of the guest side issues with the free page hinting in >>>>> that trying to do it via the arch_free_page path is problematic at >>>>> best since it was designed for a synchronous setup, not an >>>>> asynchronous one. >>>> >>>> This is throwing away work. No I don't think this is the right path to >>>> follow for now. Feel free to look into it while Nitesh gets something in >>>> shape we know conceptually works and we are starting to know which >>>> issues we are hitting. >>> >>> Yes, it is throwing away work. But if the work is running toward a >>> dead end does it add any value? >> >> "I'm not throwing anything away. " vs. "Yes, it is throwing away work.", >> now we are on the same page. >> >> So your main point here is that you are fairly sure we are are running >> towards an dead end, right? > > Yes. There is a ton of code here that is adding complexity and bugs > that I would consider waste. If we can move to a single threaded > approach the code could become much simpler as we will only have a > single queue that we have to service and we could get away from 2 > levels of lists and the allocation that goes with them. In addition I > think we could get away with much more code reuse as the > get_free_page_and_send function could likely be adapted to provide a > more generic function for adding a page of a specific size to the > queue versus the current assumption that the only page size is > VIRTIO_BALLOON_FREE_PAGE_ORDER. We could also implement a single-threaded approach on top of the current approach. Instead of scanning for free pages "blindly", scan the frees recorder by the vcpus via the arch_free_pages callback or similar. One issue is, that one can drop hints once the thread can't keep up. Locking is also something to care about. Maybe recording hints globally per zone could mitigate the issue, as the zone lock already has to be involved when freeing. (I remember we used to prototype something like that before, at least it was similar but different :) ) We could the go into "exception" mode once we drop too many hints, and go eventually back to normal mode. See below. > >>> >>> I've been looking into the stuff Nitesh has been doing. I don't know >>> about others, but I have been testing it. That is why I provided the >>> patches I did to get it stable enough for me to test and address the >>> regressions it was causing. That is the source of some of my concern. >> >> Testing and feedback is very much appreciated. You have concerns, they >> are valid. I do like discussing concerns, discussing possible solutions, >> or finding out that it cannot be solved the easy way. Then throw it away. >> >> Coming up with a clean design that considers problems that are not >> directly visible is something I would like to see. But usually they >> don't jump at you before prototyping. >> >> The simplest approach so far was "scan for zero pages in the >> hypervisor". No changes in the guest needed except setting pages to zero >> when freeing. No additional threads in the guest. No hinting. And still >> we decided against it. > > Right, I get that. There is still going to be a certain amount of > overhead for zeroing the pages in adding the scanning on the host will > not come cheap. I had considered something similar when I first looked > into this. Interestingly, there are some architectures (e.g. s390x) that can do zeroing directly in the memory controller using some instructions - at least that is what I heard. There, the overhead in the guest is very small. Scanning for zero pages using KSM in the host is then the problematic part. But the real problem is that the host has to scan, and when it is OOM/about to swap, it might be too late to scan random guests to detect free pages. > >>> I think we have been making this overly complex with all the per-cpu >>> bits and trying to place this in the free path itself. We really need >> >> We already removed complexity, at least that is my impression. There are >> bugs in there, yes. > > So one of the concerns I have at this point is the sheer number of > allocations and the list shuffling that is having to take place. > > The design of the last patch set had us enqueueing addresses on the > per-cpu "free_pages_obj". Then when we hit a certain threshold it will > call guest_free_page_hinting which is allocated via kmalloc and then > populated with the pages that can be isolated. Then that function > calls guest_free_page_report which will yet again kmalloc a hint_req > object that just contains a pointer to our isolated pages list. If we > can keep the page count small we could just do away with all of that > and instead do something more like get_free_page_and_send. Yes, not denying that more simplifications like this might be possible. This is what we are looking for :) > >>> to scale this back and look at having a single thread with a walker of >>> some sort just hinting on what memory is sitting in the buddy but not >>> hinted on. It is a solution that would work, even in a multiple VCPU >>> case, and is achievable in the short term. >> Can you write up your complete proposal and start a new thread. What I >> understood so far is >> >> 1. Separate hinting thread >> >> 2. Use virtio-balloon mechanism similar to Nitesh's work >> >> 3. Iterate over !offline pages in the buddy. Take them temporarily out >> of the buddy (similar to Niteshs work). Send them to the hypervisor. >> Mark them offline, put them back to the buddy. >> >> 4. When a page leaves the buddy, drop the offline marker. > > Yep, that is pretty much it. I'll see if I can get a write-up of it tomorrow. That would be good, I think a sane approach on how to limit hinting overhead due to scanning in certain corner cases needs quite some thought. > >> >> Selected issues to be sorted out: >> - We have to find a way to mask pages offline. We are effectively >> touching pages we don't own (keeping flags set when returning pages to >> the buddy). Core MM has to accept this change. >> - We might teach other users how to treat buddy pages now. Offline >> always has to be cleared. >> - How to limit the cycles wasted scanning? Idle guests? > > One thought I had would be to look at splitting nr_free_pages in each > free_area and splitting it into two free running counters, one for the > number of pages added, and one for the number of pages removed. Then > it would be pretty straight forward to determine how many are > available as we could maintain a free running counter for each free > area in the balloon driver to determine if we need to scan a given > area. When thinking about this, we should always keep in mind some corner cases that could happen. If you have a process that simply allocates and frees 64mb continuously, and could do that in a frequency that could make hinting 1. Search for the pages 2. Isolate the pages 3. Hint the pages In a loop, basically consuming 100% of a host CPU just for hinting, then this would be really bad. I'm afraid scanning without taking proper actions can result in a lot of hinting overhead CPU-wise if not done in a smart way. See my comment above about possible combining both approaches. Once we drop a lot of hints, switch from using what the CPUs record to scanning the free list. This would mean, one hinting thread, minimal work during freeing of pages (e.g. record into an array per zone), no need to scan with little activity, don't lose hints on a lot of activity. Best of both worlds? > >> - How to efficiently scan a list that might always change between >> hinting requests? >> - How to avoid OOM that can still happen in corner cases, after all you >> are taking pages out of the buddy temporarily. > > Yes, but hopefully it should be a small enough amount that nobody will > notice. In many cases devices such as NICs can consume much more than > this regularly for just their Rx buffers and it is not an issue. There > has to be a certain amount of overhead that any given device is > allowed to consume. If we contain the balloon hinting to just 64M that > should be a small enough amount that nobody would notice in practice. Depends on the setup. If your guest has 512MB-1024MB or less, it could already be problematic, as it would be roughly around 10%. Most probably you won't find plenty of 2MB/4MB pages either way ... At least it has to be documented, so people enabling hinting are aware of this. (something suggested along the lines in this thread by me) -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: On guest free page hinting and OOM 2019-04-02 23:43 ` Alexander Duyck 2019-04-03 19:43 ` David Hildenbrand @ 2019-04-04 13:28 ` Michael S. Tsirkin 1 sibling, 0 replies; 34+ messages in thread From: Michael S. Tsirkin @ 2019-04-04 13:28 UTC (permalink / raw) To: Alexander Duyck Cc: David Hildenbrand, Nitesh Narayan Lal, kvm list, LKML, linux-mm, Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang, Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli On Tue, Apr 02, 2019 at 04:43:03PM -0700, Alexander Duyck wrote: > Yes, but hopefully it should be a small enough amount that nobody will > notice. In many cases devices such as NICs can consume much more than > this regularly for just their Rx buffers and it is not an issue. There > has to be a certain amount of overhead that any given device is > allowed to consume. If we contain the balloon hinting to just 64M that > should be a small enough amount that nobody would notice in practice. I am still inclined to add ability for balloon to exit to host on OOM if any hints are outstanding. If nothing else will be helpful for stats. -- MST ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: On guest free page hinting and OOM 2019-04-02 15:04 ` Alexander Duyck 2019-04-02 15:25 ` Michael S. Tsirkin 2019-04-02 15:55 ` David Hildenbrand @ 2019-04-02 16:19 ` David Hildenbrand 2 siblings, 0 replies; 34+ messages in thread From: David Hildenbrand @ 2019-04-02 16:19 UTC (permalink / raw) To: Alexander Duyck Cc: Michael S. Tsirkin, Nitesh Narayan Lal, kvm list, LKML, linux-mm, Paolo Bonzini, lcapitulino, pagupta, wei.w.wang, Yang Zhang, Rik van Riel, dodgen, Konrad Rzeszutek Wilk, dhildenb, Andrea Arcangeli On 02.04.19 17:04, Alexander Duyck wrote: > On Tue, Apr 2, 2019 at 12:42 AM David Hildenbrand <david@redhat.com> wrote: >> >> On 01.04.19 22:56, Alexander Duyck wrote: >>> On Mon, Apr 1, 2019 at 7:47 AM Michael S. Tsirkin <mst@redhat.com> wrote: >>>> >>>> On Mon, Apr 01, 2019 at 04:11:42PM +0200, David Hildenbrand wrote: >>>>>> The interesting thing is most probably: Will the hinting size usually be >>>>>> reasonable small? At least I guess a guest with 4TB of RAM will not >>>>>> suddenly get a hinting size of hundreds of GB. Most probably also only >>>>>> something in the range of 1GB. But this is an interesting question to >>>>>> look into. >>>>>> >>>>>> Also, if the admin does not care about performance implications when >>>>>> already close to hinting, no need to add the additional 1Gb to the ram size. >>>>> >>>>> "close to OOM" is what I meant. >>>> >>>> Problem is, host admin is the one adding memory. Guest admin is >>>> the one that knows about performance. >>> >>> The thing we have to keep in mind with this is that we are not dealing >>> with the same behavior as the balloon driver. We don't need to inflate >>> a massive hint and hand that off. Instead we can focus on performing >>> the hints on much smaller amounts and do it incrementally over time >>> with the idea being as the system sits idle it frees up more and more >>> of the inactive memory on the system. >>> >>> With that said, I still don't like the idea of us even trying to >>> target 1GB of RAM for hinting. I think it would be much better if we >>> stuck to smaller sizes and kept things down to a single digit multiple >>> of THP or higher order pages. Maybe something like 64MB of total >>> memory out for hinting. >> >> 1GB was just a number I came up with. But please note, as VCPUs hint in >> parallel, even though each request is only 64MB in size, things can sum up. > > Why do we need them running in parallel for a single guest? I don't > think we need the hints so quickly that we would need to have multiple > VCPUs running in parallel to provide hints. In addition as it > currently stands in order to get pages into and out of the buddy > allocator we are going to have to take the zone lock anyway so we > could probably just assume a single thread for pulling the memory, > placing it on the ring, and putting it back into the buddy allocator > after the hint has been completed. > >>> >>> All we really would need to make it work would be to possibly look at >>> seeing if we can combine PageType values. Specifically what I would be >>> looking at is a transition that looks something like Buddy -> Offline >>> -> (Buddy | Offline). We would have to hold the zone lock at each >>> transition, but that shouldn't be too big of an issue. If we are okay >>> with possibly combining the Offline and Buddy types we would have a >>> way of tracking which pages have been hinted and which have not. Then >>> we would just have to have a thread running in the background on the >>> guest that is looking at the higher order pages and pulling 64MB at a >>> time offline, and when the hinting is done put them back in the "Buddy >>> | Offline" state. >> >> That approach may have other issues to solve (1 thread vs. many VCPUs, >> scanning all buddy pages over and over again) and other implications >> that might be undesirable (hints performed even more delayed, additional >> thread activity). I wouldn't call it the ultimate solution. > > So the problem with trying to provide the hint sooner is that you end > up creating a bottle-neck or you end up missing hints on pages > entirely and then have to fall back to such an approach. By just > letting the thread run in the background reporting the idle memory we > can avoid much of that. BTW, what you propose was already suggested in a similar form by Wei some (weeks? months?) ago. Back then I thought about something like an "escalation" mode. If too much MM activity is going on (e.g. close to OOM, dropping hints, whatever), temporarily stop ordinary hinting and do basically what you describe. But I am not sure if dropping hints is actually still a problem. -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: On guest free page hinting and OOM 2019-04-01 14:09 ` David Hildenbrand 2019-04-01 14:11 ` David Hildenbrand @ 2019-04-01 14:45 ` Michael S. Tsirkin 1 sibling, 0 replies; 34+ messages in thread From: Michael S. Tsirkin @ 2019-04-01 14:45 UTC (permalink / raw) To: David Hildenbrand Cc: Nitesh Narayan Lal, kvm, linux-kernel, linux-mm, pbonzini, lcapitulino, pagupta, wei.w.wang, yang.zhang.wz, riel, dodgen, konrad.wilk, dhildenb, aarcange, alexander.duyck On Mon, Apr 01, 2019 at 04:09:32PM +0200, David Hildenbrand wrote: > > > > When you say yield, I would guess that would involve config space access > > to the balloon to flush out outstanding hints? > > I rather meant yield your CPU to the hypervisor, so it can process > hinting requests faster (like waiting for a spinlock). This is the > simple case. More involved approaches might somehow indicate to the > hypervisor to not process queued requests but simply return them to the > guest so the guest can add the isolated pages to the buddy. If this is > "config space access to the balloon to flush out outstanding hints" then > yes, something like that might be a good idea if it doesn't harm > performance. The problem would be in testing this unfortunately. Same as any OOM hack, it is difficult to test well. > -- > > Thanks, > > David / dhildenb ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: On guest free page hinting and OOM 2019-03-29 15:37 ` David Hildenbrand 2019-03-29 15:45 ` David Hildenbrand @ 2019-03-29 16:09 ` Michael S. Tsirkin 1 sibling, 0 replies; 34+ messages in thread From: Michael S. Tsirkin @ 2019-03-29 16:09 UTC (permalink / raw) To: David Hildenbrand Cc: Nitesh Narayan Lal, kvm, linux-kernel, linux-mm, pbonzini, lcapitulino, pagupta, wei.w.wang, yang.zhang.wz, riel, dodgen, konrad.wilk, dhildenb, aarcange, alexander.duyck On Fri, Mar 29, 2019 at 04:37:46PM +0100, David Hildenbrand wrote: > Just so we understand each other. What you mean with "appended to guest > memory" is "append to the guest memory size", not actually "append > memory via virtio-balloon", like adding memory regions and stuff. > > Instead of "-m 4G" you would do "-m 5G -device virtio-balloon,hint_size=1G". Right. ^ permalink raw reply [flat|nested] 34+ messages in thread
end of thread, other threads:[~2019-04-04 13:28 UTC | newest] Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-03-29 13:26 On guest free page hinting and OOM Michael S. Tsirkin 2019-03-29 14:24 ` David Hildenbrand 2019-03-29 15:08 ` Michael S. Tsirkin 2019-03-29 15:37 ` David Hildenbrand 2019-03-29 15:45 ` David Hildenbrand 2019-03-29 16:51 ` Michael S. Tsirkin 2019-04-01 8:17 ` David Hildenbrand 2019-04-01 13:24 ` Michael S. Tsirkin 2019-04-01 14:09 ` David Hildenbrand 2019-04-01 14:11 ` David Hildenbrand 2019-04-01 14:47 ` Michael S. Tsirkin 2019-04-01 14:54 ` David Hildenbrand 2019-04-01 20:56 ` Alexander Duyck 2019-04-02 7:42 ` David Hildenbrand 2019-04-02 15:04 ` Alexander Duyck 2019-04-02 15:25 ` Michael S. Tsirkin 2019-04-02 15:57 ` David Hildenbrand 2019-04-02 16:18 ` Alexander Duyck 2019-04-02 17:08 ` David Hildenbrand 2019-04-02 17:45 ` Alexander Duyck 2019-04-02 17:53 ` Michael S. Tsirkin 2019-04-02 20:32 ` Alexander Duyck 2019-04-02 18:21 ` David Hildenbrand 2019-04-02 19:49 ` Michael S. Tsirkin 2019-04-02 20:32 ` David Hildenbrand 2019-04-02 15:55 ` David Hildenbrand 2019-04-02 17:30 ` Alexander Duyck 2019-04-02 18:53 ` David Hildenbrand 2019-04-02 23:43 ` Alexander Duyck 2019-04-03 19:43 ` David Hildenbrand 2019-04-04 13:28 ` Michael S. Tsirkin 2019-04-02 16:19 ` David Hildenbrand 2019-04-01 14:45 ` Michael S. Tsirkin 2019-03-29 16:09 ` Michael S. Tsirkin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).