Re: [PATCH v1 1/3] mm: fix uninitialized memmaps on a partially populated last section

From: David Hildenbrand <david@redhat.com>
To: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	stable@vger.kernel.org,
	Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>,
	Pavel Tatashin <pasha.tatashin@oracle.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Steven Sistare <steven.sistare@oracle.com>,
	Michal Hocko <mhocko@suse.com>, Bob Picco <bob.picco@oracle.com>,
	Oscar Salvador <osalvador@suse.de>
Subject: Re: [PATCH v1 1/3] mm: fix uninitialized memmaps on a partially populated last section
Date: Tue, 10 Dec 2019 11:11:03 +0100	[thread overview]
Message-ID: <c0733e11-bf06-8813-11de-019cdbddef34@redhat.com> (raw)
In-Reply-To: <20191209211502.zhbvzv2qwbvcperm@ca-dmjordan1.us.oracle.com>

On 09.12.19 22:15, Daniel Jordan wrote:
> Hi David,
> 
> On Mon, Dec 09, 2019 at 06:48:34PM +0100, David Hildenbrand wrote:
>> If max_pfn is not aligned to a section boundary, we can easily run into
>> BUGs. This can e.g., be triggered on x86-64 under QEMU by specifying a
>> memory size that is not a multiple of 128MB (e.g., 4097MB, but also
>> 4160MB). I was told that on real HW, we can easily have this scenario
>> (esp., one of the main reasons sub-section hotadd of devmem was added).
>>
>> The issue is, that we have a valid memmap (pfn_valid()) for the
>> whole section, and the whole section will be marked "online".
>> pfn_to_online_page() will succeed, but the memmap contains garbage.
>>
>> E.g., doing a "cat /proc/kpageflags > /dev/null" results in
>>
>> [  303.218313] BUG: unable to handle page fault for address: fffffffffffffffe
>> [  303.218899] #PF: supervisor read access in kernel mode
>> [  303.219344] #PF: error_code(0x0000) - not-present page
>> [  303.219787] PGD 12614067 P4D 12614067 PUD 12616067 PMD 0
>> [  303.220266] Oops: 0000 [#1] SMP NOPTI
>> [  303.220587] CPU: 0 PID: 424 Comm: cat Not tainted 5.4.0-next-20191128+ #17
> 

Hi Daniel,

> I can't reproduce this on x86-64 qemu, next-20191128 or mainline, with either
> memory size.  What config are you using?  How often are you hitting it?

Thanks for verifying! Hah, there is one piece missing to reproduce via
"cat /proc/kpageflags > /dev/null" that I ignored on my QEMU cmdline (see below)

I can reproduce it reliably (QEMU with "-m 4160M") via

[root@localhost ~]# uname -a
Linux localhost 5.5.0-rc1-next-20191209 #93 SMP Tue Dec 10 10:46:19 CET 2019 x86_64 x86_64 x86_64 GNU/Linux
[root@localhost ~]# ./page-types -r -a 0x144001
[  200.476376] BUG: unable to handle page fault for address: fffffffffffffffe
[  200.477500] #PF: supervisor read access in kernel mode
[  200.478334] #PF: error_code(0x0000) - not-present page
[  200.479076] PGD 59614067 P4D 59614067 PUD 59616067 PMD 0 
[  200.479557] Oops: 0000 [#4] SMP NOPTI
[  200.479875] CPU: 0 PID: 603 Comm: page-types Tainted: G      D W         5.5.0-rc1-next-20191209 #93
[  200.480646] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu4
[  200.481648] RIP: 0010:stable_page_flags+0x4d/0x410
[  200.482061] Code: f3 ff 41 89 c0 48 b8 00 00 00 00 01 00 00 00 45 84 c0 0f 85 cd 02 00 00 48 8b 53 08 48 8b 2b 48f
[  200.483644] RSP: 0018:ffffb139401cbe60 EFLAGS: 00010202
[  200.484091] RAX: fffffffffffffffe RBX: fffffbeec5100040 RCX: 0000000000000000
[  200.484697] RDX: 0000000000000001 RSI: ffffffff9535c7cd RDI: 0000000000000246
[  200.485313] RBP: ffffffffffffffff R08: 0000000000000000 R09: 0000000000000000
[  200.485917] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000144001
[  200.486523] R13: 00007ffd6ba55f48 R14: 00007ffd6ba55f40 R15: ffffb139401cbf08
[  200.487130] FS:  00007f68df717580(0000) GS:ffff9ec77fa00000(0000) knlGS:0000000000000000
[  200.487804] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  200.488295] CR2: fffffffffffffffe CR3: 0000000135d48000 CR4: 00000000000006f0
[  200.488897] Call Trace:
[  200.489115]  kpageflags_read+0xe9/0x140
[  200.489447]  proc_reg_read+0x3c/0x60
[  200.489755]  vfs_read+0xc2/0x170
[  200.490037]  ksys_pread64+0x65/0xa0
[  200.490352]  do_syscall_64+0x5c/0xa0
[  200.490665]  entry_SYSCALL_64_after_hwframe+0x49/0xbe

(tool located in tools/vm/page-types.c, see also patch #2)

To reproduce via "cat /proc/kpageflags > /dev/null", you have to
hot/coldplug one DIMM, to move max_pfn beyond the garbage memmap
(see also patch #2). My QEMU cmdline with Fedora 31:

qemu-system-x86_64 \
    --enable-kvm \
    -m 4160M,slots=4,maxmem=8G \
    -hda Fedora-Cloud-Base-31-1.9.x86_64.qcow2 \
    -machine pc \
    -nographic \
    -nodefaults \
    -chardev stdio,id=serial,signal=off \
    -device isa-serial,chardev=serial \
    -object memory-backend-ram,id=mem0,size=1024M \
    -device pc-dimm,id=dimm0,memdev=mem0

[root@localhost ~]# uname -a
Linux localhost 5.3.7-301.fc31.x86_64 #1 SMP Mon Oct 21 19:18:58 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
[root@localhost ~]# cat /proc/kpageflags > /dev/null
[  111.517275] BUG: unable to handle page fault for address: fffffffffffffffe
[  111.517907] #PF: supervisor read access in kernel mode
[  111.518333] #PF: error_code(0x0000) - not-present page
[  111.518771] PGD a240e067 P4D a240e067 PUD a2410067 PMD 0 

> 
> It may not have anything to do with the config, and I may be getting lucky with
> the garbage in my memory.
> 

Some things that might be relevant from my config.

# CONFIG_PAGE_POISONING is not set
CONFIG_DEFERRED_STRUCT_PAGE_INIT=y
CONFIG_SPARSEMEM_EXTREME=y
CONFIG_SPARSEMEM_VMEMMAP_ENABLE=y
CONFIG_SPARSEMEM_VMEMMAP=y
CONFIG_HAVE_MEMBLOCK_NODE_MAP=y
CONFIG_MEMORY_HOTPLUG=y
CONFIG_MEMORY_HOTPLUG_SPARSE=y
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y

The F31 default config should make it trigger.

Will update this patch description - thanks!

...

-- 
Thanks,

David / dhildenb