All of lore.kernel.org
 help / color / mirror / Atom feed
* initial ballooning amount on HVM+PoD
@ 2014-01-17 14:33 Jan Beulich
  2014-01-17 15:54 ` Boris Ostrovsky
  2014-01-17 17:13 ` Ian Campbell
  0 siblings, 2 replies; 18+ messages in thread
From: Jan Beulich @ 2014-01-17 14:33 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Boris Ostrovsky, Keir Fraser

While looking into Jürgen's issue with PoD setup causing soft lockups
in Dom0 I realized that what I did in linux-2.6.18-xen.hg's c/s
989:a7781c0a3b9a ("xen/balloon: fix balloon driver accounting for
HVM-with-PoD case") just doesn't work - the BUG_ON() added there
triggers as soon as there's a reasonable amount of excess memory.
And that is despite me knowing that I spent significant amounts of
in testing that change - I must have tested something else than
finally got checked in, or must have screwed up in some other way.
Extremely embarrassing...

In the course of finding a proper solution I soon stumbled across
upstream's c275a57f5e ("xen/balloon: Set balloon's initial state to
number of existing RAM pages"), and hence went ahead and
compared three different calculations for initial bs.current_pages:

(a) upstream's (open coding get_num_physpages(), as I did this on
    an older kernel)
(b) plain old num_physpages (equaling the maximum RAM PFN)
(c) XENMEM_get_pod_target output (with the hypervisor altered
    to not refuse this for a domain doing it on itself)

The fourth (original) method, using totalram_pages, was already
known to result in the driver not ballooning down enough, and
hence setting up the domain for an eventual crash when the PoD
cache runs empty.

Interestingly, (a) too results in the driver not ballooning down
enough - there's a gap of exactly as many pages as are marked
reserved below the 1Mb boundary. Therefore aforementioned
upstream commit is presumably broken.

Short of a reliable (and ideally architecture independent) way of
knowing the necessary adjustment value, the next best solution
(not ballooning down too little, but also not ballooning down much
more than necessary) turns out to be using the minimum of (b)
and (c): When the domain only has memory below 4Gb, (b) is
more precise, whereas in the other cases (c) gets closest.

Question now is: Considering that (a) is broken (and hard to fix)
and (b) is in presumably a large part of practical cases leading to
too much ballooning down, shouldn't we open up
XENMEM_get_pod_target for domains to query on themselves?
Alternatively, can anyone see another way to calculate a
reasonably precise value?

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: initial ballooning amount on HVM+PoD
  2014-01-17 14:33 initial ballooning amount on HVM+PoD Jan Beulich
@ 2014-01-17 15:54 ` Boris Ostrovsky
  2014-01-17 16:03   ` Jan Beulich
  2014-01-17 17:13 ` Ian Campbell
  1 sibling, 1 reply; 18+ messages in thread
From: Boris Ostrovsky @ 2014-01-17 15:54 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, xen-devel, Keir Fraser

On 01/17/2014 09:33 AM, Jan Beulich wrote:
> While looking into Jürgen's issue with PoD setup causing soft lockups
> in Dom0 I realized that what I did in linux-2.6.18-xen.hg's c/s
> 989:a7781c0a3b9a ("xen/balloon: fix balloon driver accounting for
> HVM-with-PoD case") just doesn't work - the BUG_ON() added there
> triggers as soon as there's a reasonable amount of excess memory.
> And that is despite me knowing that I spent significant amounts of
> in testing that change - I must have tested something else than
> finally got checked in, or must have screwed up in some other way.
> Extremely embarrassing...
>
> In the course of finding a proper solution I soon stumbled across
> upstream's c275a57f5e ("xen/balloon: Set balloon's initial state to
> number of existing RAM pages"), and hence went ahead and
> compared three different calculations for initial bs.current_pages:
>
> (a) upstream's (open coding get_num_physpages(), as I did this on
>      an older kernel)
> (b) plain old num_physpages (equaling the maximum RAM PFN)
> (c) XENMEM_get_pod_target output (with the hypervisor altered
>      to not refuse this for a domain doing it on itself)
>
> The fourth (original) method, using totalram_pages, was already
> known to result in the driver not ballooning down enough, and
> hence setting up the domain for an eventual crash when the PoD
> cache runs empty.
>
> Interestingly, (a) too results in the driver not ballooning down
> enough - there's a gap of exactly as many pages as are marked
> reserved below the 1Mb boundary. Therefore aforementioned
> upstream commit is presumably broken.
>
> Short of a reliable (and ideally architecture independent) way of
> knowing the necessary adjustment value, the next best solution
> (not ballooning down too little, but also not ballooning down much
> more than necessary) turns out to be using the minimum of (b)
> and (c): When the domain only has memory below 4Gb, (b) is
> more precise, whereas in the other cases (c) gets closest.

I am not sure I understand why (b) would be the right answer for 
less-than-4G guests. The reason for c275a57f5e patch was that max_pfn 
includes MMIO space (which is not RAM) and thus the driver will 
unnecessarily balloon down that much memory.

> Question now is: Considering that (a) is broken (and hard to fix)
> and (b) is in presumably a large part of practical cases leading to
> too much ballooning down, shouldn't we open up
> XENMEM_get_pod_target for domains to query on themselves?
> Alternatively, can anyone see another way to calculate a
> reasonably precise value?

I think hypervisor query is a good thing although I don't know whether 
exposing PoD-specific data (count and entry_count) to the guest is 
necessary. It's probably OK (or we can set these fields to zero for 
non-privileged domains).

-boris


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: initial ballooning amount on HVM+PoD
  2014-01-17 15:54 ` Boris Ostrovsky
@ 2014-01-17 16:03   ` Jan Beulich
  2014-01-17 16:08     ` Ian Campbell
  2014-01-17 16:13     ` Boris Ostrovsky
  0 siblings, 2 replies; 18+ messages in thread
From: Jan Beulich @ 2014-01-17 16:03 UTC (permalink / raw)
  To: Boris Ostrovsky; +Cc: George Dunlap, xen-devel, Keir Fraser

>>> On 17.01.14 at 16:54, Boris Ostrovsky <boris.ostrovsky@oracle.com> wrote:
> On 01/17/2014 09:33 AM, Jan Beulich wrote:
>> While looking into Jürgen's issue with PoD setup causing soft lockups
>> in Dom0 I realized that what I did in linux-2.6.18-xen.hg's c/s
>> 989:a7781c0a3b9a ("xen/balloon: fix balloon driver accounting for
>> HVM-with-PoD case") just doesn't work - the BUG_ON() added there
>> triggers as soon as there's a reasonable amount of excess memory.
>> And that is despite me knowing that I spent significant amounts of
>> in testing that change - I must have tested something else than
>> finally got checked in, or must have screwed up in some other way.
>> Extremely embarrassing...
>>
>> In the course of finding a proper solution I soon stumbled across
>> upstream's c275a57f5e ("xen/balloon: Set balloon's initial state to
>> number of existing RAM pages"), and hence went ahead and
>> compared three different calculations for initial bs.current_pages:
>>
>> (a) upstream's (open coding get_num_physpages(), as I did this on
>>      an older kernel)
>> (b) plain old num_physpages (equaling the maximum RAM PFN)
>> (c) XENMEM_get_pod_target output (with the hypervisor altered
>>      to not refuse this for a domain doing it on itself)
>>
>> The fourth (original) method, using totalram_pages, was already
>> known to result in the driver not ballooning down enough, and
>> hence setting up the domain for an eventual crash when the PoD
>> cache runs empty.
>>
>> Interestingly, (a) too results in the driver not ballooning down
>> enough - there's a gap of exactly as many pages as are marked
>> reserved below the 1Mb boundary. Therefore aforementioned
>> upstream commit is presumably broken.
>>
>> Short of a reliable (and ideally architecture independent) way of
>> knowing the necessary adjustment value, the next best solution
>> (not ballooning down too little, but also not ballooning down much
>> more than necessary) turns out to be using the minimum of (b)
>> and (c): When the domain only has memory below 4Gb, (b) is
>> more precise, whereas in the other cases (c) gets closest.
> 
> I am not sure I understand why (b) would be the right answer for 
> less-than-4G guests. The reason for c275a57f5e patch was that max_pfn 
> includes MMIO space (which is not RAM) and thus the driver will 
> unnecessarily balloon down that much memory.

max_pfn/num_physpages isn't that far off for guest with less than
4Gb, the number calculated from the PoD data is a little worse.

>> Question now is: Considering that (a) is broken (and hard to fix)
>> and (b) is in presumably a large part of practical cases leading to
>> too much ballooning down, shouldn't we open up
>> XENMEM_get_pod_target for domains to query on themselves?
>> Alternatively, can anyone see another way to calculate a
>> reasonably precise value?
> 
> I think hypervisor query is a good thing although I don't know whether 
> exposing PoD-specific data (count and entry_count) to the guest is 
> necessary. It's probably OK (or we can set these fields to zero for 
> non-privileged domains).

That's pointless then - if no useful data is provided through the
call to non-privileged domains, we can as well keep it erroring for
them.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: initial ballooning amount on HVM+PoD
  2014-01-17 16:03   ` Jan Beulich
@ 2014-01-17 16:08     ` Ian Campbell
  2014-01-17 16:26       ` Jan Beulich
  2014-01-17 16:13     ` Boris Ostrovsky
  1 sibling, 1 reply; 18+ messages in thread
From: Ian Campbell @ 2014-01-17 16:08 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, xen-devel, Boris Ostrovsky, Keir Fraser

On Fri, 2014-01-17 at 16:03 +0000, Jan Beulich wrote:
> >>> On 17.01.14 at 16:54, Boris Ostrovsky <boris.ostrovsky@oracle.com> wrote:
> > On 01/17/2014 09:33 AM, Jan Beulich wrote:
> >> While looking into Jürgen's issue with PoD setup causing soft lockups
> >> in Dom0 I realized that what I did in linux-2.6.18-xen.hg's c/s
> >> 989:a7781c0a3b9a ("xen/balloon: fix balloon driver accounting for
> >> HVM-with-PoD case") just doesn't work - the BUG_ON() added there
> >> triggers as soon as there's a reasonable amount of excess memory.
> >> And that is despite me knowing that I spent significant amounts of
> >> in testing that change - I must have tested something else than
> >> finally got checked in, or must have screwed up in some other way.
> >> Extremely embarrassing...
> >>
> >> In the course of finding a proper solution I soon stumbled across
> >> upstream's c275a57f5e ("xen/balloon: Set balloon's initial state to
> >> number of existing RAM pages"), and hence went ahead and
> >> compared three different calculations for initial bs.current_pages:
> >>
> >> (a) upstream's (open coding get_num_physpages(), as I did this on
> >>      an older kernel)
> >> (b) plain old num_physpages (equaling the maximum RAM PFN)
> >> (c) XENMEM_get_pod_target output (with the hypervisor altered
> >>      to not refuse this for a domain doing it on itself)
> >>
> >> The fourth (original) method, using totalram_pages, was already
> >> known to result in the driver not ballooning down enough, and
> >> hence setting up the domain for an eventual crash when the PoD
> >> cache runs empty.
> >>
> >> Interestingly, (a) too results in the driver not ballooning down
> >> enough - there's a gap of exactly as many pages as are marked
> >> reserved below the 1Mb boundary. Therefore aforementioned
> >> upstream commit is presumably broken.
> >>
> >> Short of a reliable (and ideally architecture independent) way of
> >> knowing the necessary adjustment value, the next best solution
> >> (not ballooning down too little, but also not ballooning down much
> >> more than necessary) turns out to be using the minimum of (b)
> >> and (c): When the domain only has memory below 4Gb, (b) is
> >> more precise, whereas in the other cases (c) gets closest.
> > 
> > I am not sure I understand why (b) would be the right answer for 
> > less-than-4G guests. The reason for c275a57f5e patch was that max_pfn 
> > includes MMIO space (which is not RAM) and thus the driver will 
> > unnecessarily balloon down that much memory.
> 
> max_pfn/num_physpages isn't that far off for guest with less than
> 4Gb, the number calculated from the PoD data is a little worse.

On ARM RAM may not start at 0 and so using max_pfn can be very
misleading and in practice causes arm to balloon down to 0 as fast as it
can.



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: initial ballooning amount on HVM+PoD
  2014-01-17 16:03   ` Jan Beulich
  2014-01-17 16:08     ` Ian Campbell
@ 2014-01-17 16:13     ` Boris Ostrovsky
  2014-01-17 16:23       ` Jan Beulich
  1 sibling, 1 reply; 18+ messages in thread
From: Boris Ostrovsky @ 2014-01-17 16:13 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, xen-devel, Keir Fraser

On 01/17/2014 11:03 AM, Jan Beulich wrote:
>>>> On 17.01.14 at 16:54, Boris Ostrovsky <boris.ostrovsky@oracle.com> wrote:
>> On 01/17/2014 09:33 AM, Jan Beulich wrote:
>>> While looking into Jürgen's issue with PoD setup causing soft lockups
>>> in Dom0 I realized that what I did in linux-2.6.18-xen.hg's c/s
>>> 989:a7781c0a3b9a ("xen/balloon: fix balloon driver accounting for
>>> HVM-with-PoD case") just doesn't work - the BUG_ON() added there
>>> triggers as soon as there's a reasonable amount of excess memory.
>>> And that is despite me knowing that I spent significant amounts of
>>> in testing that change - I must have tested something else than
>>> finally got checked in, or must have screwed up in some other way.
>>> Extremely embarrassing...
>>>
>>> In the course of finding a proper solution I soon stumbled across
>>> upstream's c275a57f5e ("xen/balloon: Set balloon's initial state to
>>> number of existing RAM pages"), and hence went ahead and
>>> compared three different calculations for initial bs.current_pages:
>>>
>>> (a) upstream's (open coding get_num_physpages(), as I did this on
>>>       an older kernel)
>>> (b) plain old num_physpages (equaling the maximum RAM PFN)
>>> (c) XENMEM_get_pod_target output (with the hypervisor altered
>>>       to not refuse this for a domain doing it on itself)
>>>
>>> The fourth (original) method, using totalram_pages, was already
>>> known to result in the driver not ballooning down enough, and
>>> hence setting up the domain for an eventual crash when the PoD
>>> cache runs empty.
>>>
>>> Interestingly, (a) too results in the driver not ballooning down
>>> enough - there's a gap of exactly as many pages as are marked
>>> reserved below the 1Mb boundary. Therefore aforementioned
>>> upstream commit is presumably broken.
>>>
>>> Short of a reliable (and ideally architecture independent) way of
>>> knowing the necessary adjustment value, the next best solution
>>> (not ballooning down too little, but also not ballooning down much
>>> more than necessary) turns out to be using the minimum of (b)
>>> and (c): When the domain only has memory below 4Gb, (b) is
>>> more precise, whereas in the other cases (c) gets closest.
>> I am not sure I understand why (b) would be the right answer for
>> less-than-4G guests. The reason for c275a57f5e patch was that max_pfn
>> includes MMIO space (which is not RAM) and thus the driver will
>> unnecessarily balloon down that much memory.
> max_pfn/num_physpages isn't that far off for guest with less than
> 4Gb, the number calculated from the PoD data is a little worse.

For a 4G guest it's 65K pages that are ballooned down so it's not 
insignificant.

And it you are increasing MMIO size (something that we had to do here) 
it gets progressively worse.

>
>>> Question now is: Considering that (a) is broken (and hard to fix)
>>> and (b) is in presumably a large part of practical cases leading to
>>> too much ballooning down, shouldn't we open up
>>> XENMEM_get_pod_target for domains to query on themselves?
>>> Alternatively, can anyone see another way to calculate a
>>> reasonably precise value?
>> I think hypervisor query is a good thing although I don't know whether
>> exposing PoD-specific data (count and entry_count) to the guest is
>> necessary. It's probably OK (or we can set these fields to zero for
>> non-privileged domains).
> That's pointless then - if no useful data is provided through the
> call to non-privileged domains, we can as well keep it erroring for
> them.
>

I thought that are after d->tot_pages, no?

-boris

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: initial ballooning amount on HVM+PoD
  2014-01-17 16:13     ` Boris Ostrovsky
@ 2014-01-17 16:23       ` Jan Beulich
  2014-01-20 15:31         ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 18+ messages in thread
From: Jan Beulich @ 2014-01-17 16:23 UTC (permalink / raw)
  To: Boris Ostrovsky; +Cc: George Dunlap, xen-devel, Keir Fraser

>>> On 17.01.14 at 17:13, Boris Ostrovsky <boris.ostrovsky@oracle.com> wrote:
> On 01/17/2014 11:03 AM, Jan Beulich wrote:
>>>>> On 17.01.14 at 16:54, Boris Ostrovsky <boris.ostrovsky@oracle.com> wrote:
>>> On 01/17/2014 09:33 AM, Jan Beulich wrote:
>>>> While looking into Jürgen's issue with PoD setup causing soft lockups
>>>> in Dom0 I realized that what I did in linux-2.6.18-xen.hg's c/s
>>>> 989:a7781c0a3b9a ("xen/balloon: fix balloon driver accounting for
>>>> HVM-with-PoD case") just doesn't work - the BUG_ON() added there
>>>> triggers as soon as there's a reasonable amount of excess memory.
>>>> And that is despite me knowing that I spent significant amounts of
>>>> in testing that change - I must have tested something else than
>>>> finally got checked in, or must have screwed up in some other way.
>>>> Extremely embarrassing...
>>>>
>>>> In the course of finding a proper solution I soon stumbled across
>>>> upstream's c275a57f5e ("xen/balloon: Set balloon's initial state to
>>>> number of existing RAM pages"), and hence went ahead and
>>>> compared three different calculations for initial bs.current_pages:
>>>>
>>>> (a) upstream's (open coding get_num_physpages(), as I did this on
>>>>       an older kernel)
>>>> (b) plain old num_physpages (equaling the maximum RAM PFN)
>>>> (c) XENMEM_get_pod_target output (with the hypervisor altered
>>>>       to not refuse this for a domain doing it on itself)
>>>>
>>>> The fourth (original) method, using totalram_pages, was already
>>>> known to result in the driver not ballooning down enough, and
>>>> hence setting up the domain for an eventual crash when the PoD
>>>> cache runs empty.
>>>>
>>>> Interestingly, (a) too results in the driver not ballooning down
>>>> enough - there's a gap of exactly as many pages as are marked
>>>> reserved below the 1Mb boundary. Therefore aforementioned
>>>> upstream commit is presumably broken.
>>>>
>>>> Short of a reliable (and ideally architecture independent) way of
>>>> knowing the necessary adjustment value, the next best solution
>>>> (not ballooning down too little, but also not ballooning down much
>>>> more than necessary) turns out to be using the minimum of (b)
>>>> and (c): When the domain only has memory below 4Gb, (b) is
>>>> more precise, whereas in the other cases (c) gets closest.
>>> I am not sure I understand why (b) would be the right answer for
>>> less-than-4G guests. The reason for c275a57f5e patch was that max_pfn
>>> includes MMIO space (which is not RAM) and thus the driver will
>>> unnecessarily balloon down that much memory.
>> max_pfn/num_physpages isn't that far off for guest with less than
>> 4Gb, the number calculated from the PoD data is a little worse.
> 
> For a 4G guest it's 65K pages that are ballooned down so it's not 
> insignificant.

I didn't say (in the original mail) 4Gb guest - I said guest with
memory only below 4Gb. So yes, for 4Gb guest this is unacceptably
high, ...

> And it you are increasing MMIO size (something that we had to do here) 
> it gets progressively worse.

... and growing with MMIO size, hence the PoD data yields better
results in that case.

>>>> Question now is: Considering that (a) is broken (and hard to fix)
>>>> and (b) is in presumably a large part of practical cases leading to
>>>> too much ballooning down, shouldn't we open up
>>>> XENMEM_get_pod_target for domains to query on themselves?
>>>> Alternatively, can anyone see another way to calculate a
>>>> reasonably precise value?
>>> I think hypervisor query is a good thing although I don't know whether
>>> exposing PoD-specific data (count and entry_count) to the guest is
>>> necessary. It's probably OK (or we can set these fields to zero for
>>> non-privileged domains).
>> That's pointless then - if no useful data is provided through the
>> call to non-privileged domains, we can as well keep it erroring for
>> them.
>>
> 
> I thought that are after d->tot_pages, no?

That can be obtained through another XENMEM_ operation. No,
what is needed is the difference between PoD entries and PoD
cache (which then needs to be added to tot_pages).

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: initial ballooning amount on HVM+PoD
  2014-01-17 16:08     ` Ian Campbell
@ 2014-01-17 16:26       ` Jan Beulich
  2014-01-17 16:54         ` Ian Campbell
  0 siblings, 1 reply; 18+ messages in thread
From: Jan Beulich @ 2014-01-17 16:26 UTC (permalink / raw)
  To: Ian Campbell; +Cc: George Dunlap, xen-devel, Boris Ostrovsky, Keir Fraser

>>> On 17.01.14 at 17:08, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> On Fri, 2014-01-17 at 16:03 +0000, Jan Beulich wrote:
>> max_pfn/num_physpages isn't that far off for guest with less than
>> 4Gb, the number calculated from the PoD data is a little worse.
> 
> On ARM RAM may not start at 0 and so using max_pfn can be very
> misleading and in practice causes arm to balloon down to 0 as fast as it
> can.

Ugly. Is that only due to the temporary workaround for there not
being an IOMMU?

And short of the initial value needing to be architecture specific -
can you see a calculation that would yield a decent result on ARM
that would also be suitable on x86?

Jan

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: initial ballooning amount on HVM+PoD
  2014-01-17 16:26       ` Jan Beulich
@ 2014-01-17 16:54         ` Ian Campbell
  2014-01-20  8:01           ` Jan Beulich
  0 siblings, 1 reply; 18+ messages in thread
From: Ian Campbell @ 2014-01-17 16:54 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, xen-devel, Boris Ostrovsky, Keir Fraser

On Fri, 2014-01-17 at 16:26 +0000, Jan Beulich wrote:
> >>> On 17.01.14 at 17:08, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> > On Fri, 2014-01-17 at 16:03 +0000, Jan Beulich wrote:
> >> max_pfn/num_physpages isn't that far off for guest with less than
> >> 4Gb, the number calculated from the PoD data is a little worse.
> > 
> > On ARM RAM may not start at 0 and so using max_pfn can be very
> > misleading and in practice causes arm to balloon down to 0 as fast as it
> > can.
> 
> Ugly. Is that only due to the temporary workaround for there not
> being an IOMMU?

It's not to do with IOMMUs, no, and it isn't temporary.

Architecturally on ARM its not required for RAM to be at address 0 and
it is not uncommon for it to start at 1, 2 or 3GB (as a property of the
SoC design).

If you have 128M of RAM at 0x80000000-0x88000000 then max_pfn is 0x88000
but target pages is just 0x8000, if current_pages is initialised to
max_pfn then the kernel immediately thinks it has to get rid of 0x800000
pages.

> And short of the initial value needing to be architecture specific -
> can you see a calculation that would yield a decent result on ARM
> that would also be suitable on x86?

I previously had a patch to use memblock_phys_mem_size(), but when I saw
Boris switch to get_num_physpages() I thought that would be OK, but I
didn't look into it very hard. Without checking I suspect they return
pretty much the same think and so memblock_phys_mem_size will have the
same issue you observed (which I confess I haven't yet gone back and
understood).

Ian.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: initial ballooning amount on HVM+PoD
  2014-01-17 14:33 initial ballooning amount on HVM+PoD Jan Beulich
  2014-01-17 15:54 ` Boris Ostrovsky
@ 2014-01-17 17:13 ` Ian Campbell
  2014-01-20  8:08   ` Jan Beulich
  1 sibling, 1 reply; 18+ messages in thread
From: Ian Campbell @ 2014-01-17 17:13 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, xen-devel, Boris Ostrovsky, Keir Fraser

On Fri, 2014-01-17 at 14:33 +0000, Jan Beulich wrote:
> Interestingly, (a) too results in the driver not ballooning down
> enough - there's a gap of exactly as many pages as are marked
> reserved below the 1Mb boundary. Therefore aforementioned
> upstream commit is presumably broken.

Can we count those reserved pages? (I guess you mean reserved in the
e820?)

> Short of a reliable (and ideally architecture independent) way of
> knowing the necessary adjustment value, the next best solution
> (not ballooning down too little, but also not ballooning down much
> more than necessary) turns out to be using the minimum of (b)
> and (c): When the domain only has memory below 4Gb, (b) is
> more precise, whereas in the other cases (c) gets closest.

I think I'd prefer an arch specific calculation (or an arch specific
adjustment to a generic calculation) to either of the above.

Ian.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: initial ballooning amount on HVM+PoD
  2014-01-17 16:54         ` Ian Campbell
@ 2014-01-20  8:01           ` Jan Beulich
  2014-01-20 10:42             ` Ian Campbell
  0 siblings, 1 reply; 18+ messages in thread
From: Jan Beulich @ 2014-01-20  8:01 UTC (permalink / raw)
  To: Ian Campbell; +Cc: George Dunlap, xen-devel, Boris Ostrovsky, Keir Fraser

>>> On 17.01.14 at 17:54, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> On Fri, 2014-01-17 at 16:26 +0000, Jan Beulich wrote:
>> >>> On 17.01.14 at 17:08, Ian Campbell <Ian.Campbell@citrix.com> wrote:
>> > On Fri, 2014-01-17 at 16:03 +0000, Jan Beulich wrote:
>> >> max_pfn/num_physpages isn't that far off for guest with less than
>> >> 4Gb, the number calculated from the PoD data is a little worse.
>> > 
>> > On ARM RAM may not start at 0 and so using max_pfn can be very
>> > misleading and in practice causes arm to balloon down to 0 as fast as it
>> > can.
>> 
>> Ugly. Is that only due to the temporary workaround for there not
>> being an IOMMU?
> 
> It's not to do with IOMMUs, no, and it isn't temporary.
> 
> Architecturally on ARM its not required for RAM to be at address 0 and
> it is not uncommon for it to start at 1, 2 or 3GB (as a property of the
> SoC design).
> 
> If you have 128M of RAM at 0x80000000-0x88000000 then max_pfn is 0x88000
> but target pages is just 0x8000, if current_pages is initialised to
> max_pfn then the kernel immediately thinks it has to get rid of 0x800000
> pages.

And there is some sort of benefit from also doing this for virtual
machines?

>> And short of the initial value needing to be architecture specific -
>> can you see a calculation that would yield a decent result on ARM
>> that would also be suitable on x86?
> 
> I previously had a patch to use memblock_phys_mem_size(), but when I saw
> Boris switch to get_num_physpages() I thought that would be OK, but I
> didn't look into it very hard. Without checking I suspect they return
> pretty much the same think and so memblock_phys_mem_size will have the
> same issue you observed (which I confess I haven't yet gone back and
> understood).

If there's no reserved memory in that range, I guess ARM might
be fine as is.

Jan

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: initial ballooning amount on HVM+PoD
  2014-01-17 17:13 ` Ian Campbell
@ 2014-01-20  8:08   ` Jan Beulich
  2014-01-20 10:50     ` Ian Campbell
  0 siblings, 1 reply; 18+ messages in thread
From: Jan Beulich @ 2014-01-20  8:08 UTC (permalink / raw)
  To: Ian Campbell; +Cc: George Dunlap, xen-devel, Boris Ostrovsky, Keir Fraser

>>> On 17.01.14 at 18:13, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> On Fri, 2014-01-17 at 14:33 +0000, Jan Beulich wrote:
>> Interestingly, (a) too results in the driver not ballooning down
>> enough - there's a gap of exactly as many pages as are marked
>> reserved below the 1Mb boundary. Therefore aforementioned
>> upstream commit is presumably broken.
> 
> Can we count those reserved pages? (I guess you mean reserved in the
> e820?)

Yes, we could. But it's not logical to count the ones below 1Mb, but
not the ones above. Yet we can't (without knowledge of the tools/
firmware implementation) tell regions backed by RAM assigned to the
guest (e.g. the reserved pages below 1Mb, covering BIOS stuff)
from regions reserved for other reasons. A specific firmware could,
for example, have a larger BIOS region right below 4Gb (like many
non-virtual BIOSes do), which would then also be RAM covered and
hence also need accounting.

>> Short of a reliable (and ideally architecture independent) way of
>> knowing the necessary adjustment value, the next best solution
>> (not ballooning down too little, but also not ballooning down much
>> more than necessary) turns out to be using the minimum of (b)
>> and (c): When the domain only has memory below 4Gb, (b) is
>> more precise, whereas in the other cases (c) gets closest.
> 
> I think I'd prefer an arch specific calculation (or an arch specific
> adjustment to a generic calculation) to either of the above.

Hmm, interesting. I would have expected a generic calculation to
be deemed preferable.

Jan

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: initial ballooning amount on HVM+PoD
  2014-01-20  8:01           ` Jan Beulich
@ 2014-01-20 10:42             ` Ian Campbell
  0 siblings, 0 replies; 18+ messages in thread
From: Ian Campbell @ 2014-01-20 10:42 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, xen-devel, Boris Ostrovsky, Keir Fraser

On Mon, 2014-01-20 at 08:01 +0000, Jan Beulich wrote:
> >>> On 17.01.14 at 17:54, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> > On Fri, 2014-01-17 at 16:26 +0000, Jan Beulich wrote:
> >> >>> On 17.01.14 at 17:08, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> >> > On Fri, 2014-01-17 at 16:03 +0000, Jan Beulich wrote:
> >> >> max_pfn/num_physpages isn't that far off for guest with less than
> >> >> 4Gb, the number calculated from the PoD data is a little worse.
> >> > 
> >> > On ARM RAM may not start at 0 and so using max_pfn can be very
> >> > misleading and in practice causes arm to balloon down to 0 as fast as it
> >> > can.
> >> 
> >> Ugly. Is that only due to the temporary workaround for there not
> >> being an IOMMU?
> > 
> > It's not to do with IOMMUs, no, and it isn't temporary.
> > 
> > Architecturally on ARM its not required for RAM to be at address 0 and
> > it is not uncommon for it to start at 1, 2 or 3GB (as a property of the
> > SoC design).
> > 
> > If you have 128M of RAM at 0x80000000-0x88000000 then max_pfn is 0x88000
> > but target pages is just 0x8000, if current_pages is initialised to
> > max_pfn then the kernel immediately thinks it has to get rid of 0x800000
> > pages.
> 
> And there is some sort of benefit from also doing this for virtual
> machines?

For dom0 the address space layout mirrors that of the underlying
platform.

For domU we mostly ended up using the layout of the platform we happened
to develop on for the virtual guest layout without much thought. This
actually has some shortcomings, in that it limits the amount of RAM a
guest can have under 4GB quite significantly, so in 4.5 we will probably
change this.

But for domU when we do device assignment we may also want to do
something equivalent to x86 "e820_host" option which also mirrors the
underlying address map for guests.

In any case I don't want to be encoding restrictions like "RAM starts at
zero" into ARM's guest ABI.

> 
> >> And short of the initial value needing to be architecture specific -
> >> can you see a calculation that would yield a decent result on ARM
> >> that would also be suitable on x86?
> > 
> > I previously had a patch to use memblock_phys_mem_size(), but when I saw
> > Boris switch to get_num_physpages() I thought that would be OK, but I
> > didn't look into it very hard. Without checking I suspect they return
> > pretty much the same think and so memblock_phys_mem_size will have the
> > same issue you observed (which I confess I haven't yet gone back and
> > understood).
> 
> If there's no reserved memory in that range, I guess ARM might
> be fine as is.

Hrm I'm not sure what a DTB /memreserve/ statement turns into wrt the
memblock stuff and whether it is included in phys_mem_size -- I think
that stuff is still WIP upstream (i.e. the kernel currently ignores such
things, IIRC there was a fix but it was broken and reverted to try again
and then I've lost track).

Ian.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: initial ballooning amount on HVM+PoD
  2014-01-20  8:08   ` Jan Beulich
@ 2014-01-20 10:50     ` Ian Campbell
  2014-01-20 11:35       ` Jan Beulich
  0 siblings, 1 reply; 18+ messages in thread
From: Ian Campbell @ 2014-01-20 10:50 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, xen-devel, Boris Ostrovsky, Keir Fraser

On Mon, 2014-01-20 at 08:08 +0000, Jan Beulich wrote:
> >>> On 17.01.14 at 18:13, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> > On Fri, 2014-01-17 at 14:33 +0000, Jan Beulich wrote:
> >> Interestingly, (a) too results in the driver not ballooning down
> >> enough - there's a gap of exactly as many pages as are marked
> >> reserved below the 1Mb boundary. Therefore aforementioned
> >> upstream commit is presumably broken.
> > 
> > Can we count those reserved pages? (I guess you mean reserved in the
> > e820?)
> 
> Yes, we could. But it's not logical to count the ones below 1Mb, but
> not the ones above.

I can understand that PoV but it's not like the PC architecture isn't
full of weird quirks and assumptions which are specific to the low
1Mb...

>  Yet we can't (without knowledge of the tools/
> firmware implementation) tell regions backed by RAM assigned to the
> guest (e.g. the reserved pages below 1Mb, covering BIOS stuff)
> from regions reserved for other reasons. A specific firmware could,
> for example, have a larger BIOS region right below 4Gb (like many
> non-virtual BIOSes do), which would then also be RAM covered and
> hence also need accounting.

Couldn't this be accounted for in the toolstack when considering the
target and max_pages? But I suppose it is too late for that now if you
want to DTRT on existing systems.

> >> Short of a reliable (and ideally architecture independent) way of
> >> knowing the necessary adjustment value, the next best solution
> >> (not ballooning down too little, but also not ballooning down much
> >> more than necessary) turns out to be using the minimum of (b)
> >> and (c): When the domain only has memory below 4Gb, (b) is
> >> more precise, whereas in the other cases (c) gets closest.
> > 
> > I think I'd prefer an arch specific calculation (or an arch specific
> > adjustment to a generic calculation) to either of the above.
> 
> Hmm, interesting. I would have expected a generic calculation to
> be deemed preferable.

Yes, I'd much prefer an accurate per-arch calculation to a generic fudge
which only gets close for everyone.

Ian.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: initial ballooning amount on HVM+PoD
  2014-01-20 10:50     ` Ian Campbell
@ 2014-01-20 11:35       ` Jan Beulich
  0 siblings, 0 replies; 18+ messages in thread
From: Jan Beulich @ 2014-01-20 11:35 UTC (permalink / raw)
  To: Ian Campbell; +Cc: George Dunlap, xen-devel, Boris Ostrovsky, Keir Fraser

>>> On 20.01.14 at 11:50, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> On Mon, 2014-01-20 at 08:08 +0000, Jan Beulich wrote:
>>  Yet we can't (without knowledge of the tools/
>> firmware implementation) tell regions backed by RAM assigned to the
>> guest (e.g. the reserved pages below 1Mb, covering BIOS stuff)
>> from regions reserved for other reasons. A specific firmware could,
>> for example, have a larger BIOS region right below 4Gb (like many
>> non-virtual BIOSes do), which would then also be RAM covered and
>> hence also need accounting.
> 
> Couldn't this be accounted for in the toolstack when considering the
> target and max_pages?

Perhaps it could, but ...

> But I suppose it is too late for that now if you
> want to DTRT on existing systems.

... yes, any tools side behavioral change would make the job even
harder for the balloon driver, and wouldn't necessarily help with
existing incarnations (and in fact wouldn't be unlikely to break some).

Jan

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: initial ballooning amount on HVM+PoD
  2014-01-17 16:23       ` Jan Beulich
@ 2014-01-20 15:31         ` Konrad Rzeszutek Wilk
  2014-01-20 15:54           ` Jan Beulich
  0 siblings, 1 reply; 18+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-01-20 15:31 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, xen-devel, Boris Ostrovsky, Keir Fraser

> >>>> Question now is: Considering that (a) is broken (and hard to fix)
> >>>> and (b) is in presumably a large part of practical cases leading to
> >>>> too much ballooning down, shouldn't we open up
> >>>> XENMEM_get_pod_target for domains to query on themselves?
> >>>> Alternatively, can anyone see another way to calculate a
> >>>> reasonably precise value?
> >>> I think hypervisor query is a good thing although I don't know whether
> >>> exposing PoD-specific data (count and entry_count) to the guest is
> >>> necessary. It's probably OK (or we can set these fields to zero for
> >>> non-privileged domains).
> >> That's pointless then - if no useful data is provided through the
> >> call to non-privileged domains, we can as well keep it erroring for
> >> them.
> >>
> > 
> > I thought that are after d->tot_pages, no?
> 
> That can be obtained through another XENMEM_ operation. No,
> what is needed is the difference between PoD entries and PoD
> cache (which then needs to be added to tot_pages).

Won't that be racy? Meaning the moment you get that information and
kick of the balloon worker, said value might be different already?

> 
> Jan

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: initial ballooning amount on HVM+PoD
  2014-01-20 15:31         ` Konrad Rzeszutek Wilk
@ 2014-01-20 15:54           ` Jan Beulich
  0 siblings, 0 replies; 18+ messages in thread
From: Jan Beulich @ 2014-01-20 15:54 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: George Dunlap, xen-devel, Boris Ostrovsky, Keir Fraser

>>> On 20.01.14 at 16:31, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:
>> >>>> Question now is: Considering that (a) is broken (and hard to fix)
>> >>>> and (b) is in presumably a large part of practical cases leading to
>> >>>> too much ballooning down, shouldn't we open up
>> >>>> XENMEM_get_pod_target for domains to query on themselves?
>> >>>> Alternatively, can anyone see another way to calculate a
>> >>>> reasonably precise value?
>> >>> I think hypervisor query is a good thing although I don't know whether
>> >>> exposing PoD-specific data (count and entry_count) to the guest is
>> >>> necessary. It's probably OK (or we can set these fields to zero for
>> >>> non-privileged domains).
>> >> That's pointless then - if no useful data is provided through the
>> >> call to non-privileged domains, we can as well keep it erroring for
>> >> them.
>> >>
>> > 
>> > I thought that are after d->tot_pages, no?
>> 
>> That can be obtained through another XENMEM_ operation. No,
>> what is needed is the difference between PoD entries and PoD
>> cache (which then needs to be added to tot_pages).
> 
> Won't that be racy? Meaning the moment you get that information and
> kick of the balloon worker, said value might be different already?

There's a small risk for that, yes (albeit said difference ought to be
stable, and I don't immediately see how tot_pages would change
underneath the balloon driver initializing), but I am still awaiting
alternative suggestions...

Jan

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: initial ballooning amount on HVM+PoD
  2014-01-20 15:19 Boris Ostrovsky
@ 2014-01-20 15:23 ` Ian Campbell
  0 siblings, 0 replies; 18+ messages in thread
From: Ian Campbell @ 2014-01-20 15:23 UTC (permalink / raw)
  To: Boris Ostrovsky; +Cc: George.Dunlap, xen-devel, keir, JBeulich

On Mon, 2014-01-20 at 07:19 -0800, Boris Ostrovsky wrote:
> ----- Ian.Campbell@citrix.com wrote:
> 
> > > > 
> > > > I previously had a patch to use memblock_phys_mem_size(), but when
> > I saw
> > > > Boris switch to get_num_physpages() I thought that would be OK,
> > but I
> > > > didn't look into it very hard. Without checking I suspect they
> > return
> > > > pretty much the same think and so memblock_phys_mem_size will have
> > the
> > > > same issue you observed (which I confess I haven't yet gone back
> > and
> > > > understood).
> > > 
> > > If there's no reserved memory in that range, I guess ARM might
> > > be fine as is.
> > 
> > Hrm I'm not sure what a DTB /memreserve/ statement turns into wrt the
> > memblock stuff and whether it is included in phys_mem_size -- I think
> > that stuff is still WIP upstream (i.e. the kernel currently ignores
> > such
> > things, IIRC there was a fix but it was broken and reverted to try
> > again
> > and then I've lost track).
> 
> 
> FWIW, when I was looking at this code I also first thought of using
> memblock but IIRC I wasn't convinced that we always have
> CONFIG_HAVE_MEMBLOCK set so I ended up with physmem_size().

FWIW it's select'd by CONFIG_X86 and CONFIG_ARM and overall by about
half of all architectures. So I suppose it depends where your threshold
for caring about future architecture ports lies...

Ian.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: initial ballooning amount on HVM+PoD
@ 2014-01-20 15:19 Boris Ostrovsky
  2014-01-20 15:23 ` Ian Campbell
  0 siblings, 1 reply; 18+ messages in thread
From: Boris Ostrovsky @ 2014-01-20 15:19 UTC (permalink / raw)
  To: Ian.Campbell; +Cc: George.Dunlap, xen-devel, keir, JBeulich


----- Ian.Campbell@citrix.com wrote:

> > > 
> > > I previously had a patch to use memblock_phys_mem_size(), but when
> I saw
> > > Boris switch to get_num_physpages() I thought that would be OK,
> but I
> > > didn't look into it very hard. Without checking I suspect they
> return
> > > pretty much the same think and so memblock_phys_mem_size will have
> the
> > > same issue you observed (which I confess I haven't yet gone back
> and
> > > understood).
> > 
> > If there's no reserved memory in that range, I guess ARM might
> > be fine as is.
> 
> Hrm I'm not sure what a DTB /memreserve/ statement turns into wrt the
> memblock stuff and whether it is included in phys_mem_size -- I think
> that stuff is still WIP upstream (i.e. the kernel currently ignores
> such
> things, IIRC there was a fix but it was broken and reverted to try
> again
> and then I've lost track).


FWIW, when I was looking at this code I also first thought of using memblock but IIRC I wasn't convinced that we always have CONFIG_HAVE_MEMBLOCK set so I ended up with physmem_size().

-boris

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2014-01-20 15:54 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-01-17 14:33 initial ballooning amount on HVM+PoD Jan Beulich
2014-01-17 15:54 ` Boris Ostrovsky
2014-01-17 16:03   ` Jan Beulich
2014-01-17 16:08     ` Ian Campbell
2014-01-17 16:26       ` Jan Beulich
2014-01-17 16:54         ` Ian Campbell
2014-01-20  8:01           ` Jan Beulich
2014-01-20 10:42             ` Ian Campbell
2014-01-17 16:13     ` Boris Ostrovsky
2014-01-17 16:23       ` Jan Beulich
2014-01-20 15:31         ` Konrad Rzeszutek Wilk
2014-01-20 15:54           ` Jan Beulich
2014-01-17 17:13 ` Ian Campbell
2014-01-20  8:08   ` Jan Beulich
2014-01-20 10:50     ` Ian Campbell
2014-01-20 11:35       ` Jan Beulich
2014-01-20 15:19 Boris Ostrovsky
2014-01-20 15:23 ` Ian Campbell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.