From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S966107AbdCXUbF (ORCPT <rfc822;w@1wt.eu>);
        Fri, 24 Mar 2017 16:31:05 -0400
Received: from mail-pg0-f50.google.com ([74.125.83.50]:34813 "EHLO
        mail-pg0-f50.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S964973AbdCXUav (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Fri, 24 Mar 2017 16:30:51 -0400
MIME-Version: 1.0
In-Reply-To: <1bf56d75-4ffb-ba41-4c96-76c120c7800c@suse.com>
References: <CAOZ2QJMkHqeTGGVKFjGw-Gn_zMLmM6Ls27Z=k3N-MYyzz7ihfQ@mail.gmail.com>
 <0628e2af-f7e7-056a-82ec-68860f9c4f29@oracle.com> <1bf56d75-4ffb-ba41-4c96-76c120c7800c@suse.com>
From: Dan Streetman <dan.streetman@canonical.com>
Date: Fri, 24 Mar 2017 16:30:10 -0400
Message-ID: <CAOZ2QJOJrJWxrNfkRKVz-EfwrJEMJ_uZvP1cXtr=i8+ie4Z7+Q@mail.gmail.com>
Subject: Re: maybe revert commit c275a57f5ec3 "xen/balloon: Set balloon's
 initial state to number of existing RAM pages"
To: Juergen Gross <jgross@suse.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>,
        Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>,
        xen-devel@lists.xenproject.org, linux-kernel@vger.kernel.org
Content-Type: text/plain; charset=UTF-8
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Thu, Mar 23, 2017 at 3:56 AM, Juergen Gross <jgross@suse.com> wrote:
> On 23/03/17 03:13, Boris Ostrovsky wrote:
>>
>>
>> On 03/22/2017 05:16 PM, Dan Streetman wrote:
>>> I have a question about a problem introduced by this commit:
>>> c275a57f5ec3056f732843b11659d892235faff7
>>> "xen/balloon: Set balloon's initial state to number of existing RAM
>>> pages"
>>>
>>> It changed the xen balloon current_pages calculation to start with the
>>> number of physical pages in the system, instead of max_pfn.  Since
>>> get_num_physpages() does not include holes, it's always less than the
>>> e820 map's max_pfn.
>>>
>>> However, the problem that commit introduced is, if the hypervisor sets
>>> the balloon target to equal to the e820 map's max_pfn, then the
>>> balloon target will *always* be higher than the initial current pages.
>>> Even if the hypervisor sets the target to (e820 max_pfn - holes), if
>>> the OS adds any holes, the balloon target will be higher than the
>>> current pages.  This is the situation, for example, for Amazon AWS
>>> instances.  The result is, the xen balloon will always immediately
>>> hotplug some memory at boot, but then make only (max_pfn -
>>> get_num_physpages()) available to the system.
>>>
>>> This balloon-hotplugged memory can cause problems, if the hypervisor
>>> wasn't expecting it; specifically, the system's physical page
>>> addresses now will exceed the e820 map's max_pfn, due to the
>>> balloon-hotplugged pages; if the hypervisor isn't expecting pt-device
>>> DMA to/from those physical pages above the e820 max_pfn, it causes
>>> problems.  For example:
>>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1668129
>>>
>>> The additional small amount of balloon memory can cause other problems
>>> as well, for example:
>>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1518457
>>>
>>> Anyway, I'd like to ask, was the original commit added because
>>> hypervisors are supposed to set their balloon target to the guest
>>> system's number of phys pages (max_pfn - holes)?  The mailing list
>>> discussion and commit description seem to indicate that.
>>
>>
>> IIRC the problem that this was trying to fix was that since max_pfn
>> includes holes, upon booting we'd immediately balloon down by the
>> (typically, MMIO) hole size.
>>
>> If you boot a guest with ~4+GB memory you should see this.
>>
>>
>>> However I'm
>>> not sure how that is possible, because the kernel reserves its own
>>> holes, regardless of any predefined holes in the e820 map; for
>>> example, the kernel reserves 64k (by default) at phys addr 0 (the
>>> amount of reservation is configurable via CONFIG_X86_RESERVE_LOW).  So
>>> the hypervisor really has no way to know what the "right" target to
>>> specify is; unless it knows the exact guest OS and kernel version, and
>>> kernel config values, it will never be able to correctly specify its
>>> target to be exactly (e820 max_pfn - all holes).
>>>
>>> Should this commit be reverted?  Should the xen balloon target be
>>> adjusted based on kernel-added e820 holes?
>>
>> I think the second one but shouldn't current_pages be updated, and not
>> the target? The latter is set by Xen (toolstack, via xenstore usually).
>
> Right.
>
> Looking into a HVM domU I can't see any problem related to
> CONFIG_X86_RESERVE_LOW: it is set to 64 on my system. The domU is

sorry I brought that up; I was only giving an example.  It's not
directly relevant to this and may have distracted from the actual
problem; in fact on closer inspection, the X86_RESERVE_LOW is using
memblock_reserve(), which removes it from managed memory but not the
e820 map (and thus doesn't remove it from get_num_physpages()).  Only
phys page 0 is actually reserved in the e820 map.

> configured with 2048 MB of RAM, 8MB being video RAM. Looking into
> /sys/devices/system/xen_memory/xen_memory0 I can see the current
> size and target size do match: both are 2088960 kB (2 GB - 8 MB).
>
> Ballooning down and up to 2048 MB again doesn't change the picture.
>
> So which additional holes are added by the kernel on AWS via which
> functions?

I'll use two AWS types as examples, t2.micro (1G mem) and t2.large (8G mem).

In the micro, the results of ballooning are obvious, because the
hotplugged memory always goes into the Normal zone; but since the base
memory is only 1g, it's contained entirely in the DMA32/DMA zones.  So
we get:

$ grep -E '(start_pfn|present|spanned|managed)' /proc/zoneinfo
        spanned  4095
        present  3997
        managed  3976
  start_pfn:         1
        spanned  258048
        present  258048
        managed  249606
  start_pfn:         4096
        spanned  32768
        present  32768
        managed  11
  start_pfn:         262144

As you can see, none of the e820 memory went into the Normal zone; the
balloon driver hotpluged 128m (32k pages), but only made 11 pages
available.  Having a memory zone with only 11 pages really screwed
with kswapd, since the zone's memory watermarks were all 0.  That was
the second bug I referenced in my initial email.


Anyway, if we look at the large instance, you don't really notice the
additional balloon memory:

$ grep -E '(start_pfn|present|spanned|managed)' /proc/zoneinfo
        spanned  4095
        present  3997
        managed  3976
  start_pfn:         1
        spanned  1044480
        present  978944
        managed  958778
  start_pfn:         4096
        spanned  1146880
        present  1146880
        managed  1080666
  start_pfn:         1048576

but, doing the actual math shows the problem:

$ printf "%x\n" $[ 1048576 + 1146880 ]
218000
$ printf "%x\n" $[ 1048576 + 1080666 ]
207d5a

$ dmesg|grep e820
[    0.000000] e820: BIOS-provided physical RAM map:
[    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009dfff] usable
[    0.000000] BIOS-e820: [mem 0x000000000009e000-0x000000000009ffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000efffffff] usable
[    0.000000] BIOS-e820: [mem 0x00000000fc000000-0x00000000ffffffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000020fffffff] usable
[    0.000000] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
[    0.000000] e820: remove [mem 0x000a0000-0x000fffff] usable
[    0.000000] e820: last_pfn = 0x210000 max_arch_pfn = 0x400000000
[    0.000000] e820: last_pfn = 0xf0000 max_arch_pfn = 0x400000000
[    0.000000] e820: [mem 0xf0000000-0xfbffffff] available for PCI devices
[    0.595083] e820: reserve RAM buffer [mem 0x0009e000-0x0009ffff]

so, we can see the balloon driver hotplugged those extra 0x8000 pages,
and made some of them available.

The target has been set to:
$ printf "%x\n" $( cat /sys/devices/system/xen_memory/xen_memory0/target )
200000000

while the e820 map provides:
$ printf "%x\n" $[ 0x210000000 - 0x100000000 + 0xf0000000 - 0x100000 +
0x9e000 - 0x1000 ]
1fff9d000

and current memory is:
/sys/devices/system/xen_memory/xen_memory0$ printf "%x\n" $[ $( cat
info/current_kb ) * 1024 ]
1fffa8000

so the balloon driver has added...
$ echo $[ ( 0x1fffa8000 - 0x1fff9d000 ) / 4096 ]
11

exactly 11 pages, just like the micro instance type.  I'm not sure
where the balloon driver gets that 11 page calculation, nor am I sure
why the current_kb is actually less than the balloon target.