Re: [PATCH] x86/mm: Fix incorrect for loop count calculation in sync_global_pgds

From: Dan Williams <dan.j.williams@intel.com>
To: Baoquan He <bhe@redhat.com>
Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, "H. Peter Anvin" <hpa@zytor.com>,
	X86 ML <x86@kernel.org>, Kees Cook <keescook@chromium.org>,
	Thomas Garnier <thgarnie@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Yasuaki Ishimatsu <yasu.isimatu@gmail.com>,
	Jinbum Park <jinb.park7@gmail.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Yinghai Lu <yinghai@kernel.org>, Dave Young <dyoung@redhat.com>
Subject: Re: [PATCH] x86/mm: Fix incorrect for loop count calculation in sync_global_pgds
Date: Mon, 1 May 2017 07:40:11 -0700	[thread overview]
Message-ID: <CAPcyv4jfFK3ZMGtqFRjgtB1__73Ja-94XS3=FB1Wj9SNxn+-JQ@mail.gmail.com> (raw)
In-Reply-To: <1493638874-4014-1-git-send-email-bhe@redhat.com>

On Mon, May 1, 2017 at 4:41 AM, Baoquan He <bhe@redhat.com> wrote:
> Jeff Moyer reported that on his system with two memory regions 0~64G and
> 1T~1T+192G, and kernel option "memmap=192G!1024G" added, enabling kaslr
> will make system hang intermittently during boot. While adding 'nokaslr'
> won't.
>
> This is because the for loop count calculation in sync_global_pgds is
> not correct. When a mapping area crosses pgd entries, we should
> calculate the starting address of region which next pgd covers and assign
> it to next for loop count, but not add PGDIR_SIZE directly. The old
> code works right only if the mapping area is times of PGDIR_SIZE,
> otherwize the end region could be skipped so that it can't be synchronized
> to all other processes from kernel pgd init_mm.pgd.
>
> In Jeff's system, emulated pmem area [1024G, 1216G) is smaller than
> PGDIR_SIZE. While 'nokaslr' works because PAGE_OFFSET is 1T aligned, it
> makes this area be mapped inside one pgd entry. With kaslr enabled,
> this area could cross two pgd entries, then the next pgd entry won't
> be synced to all other processes. That is why we saw empty PGD.
>
> Fix it in this patch.
>
> The back trace is pasted as below:
>
> [    9.988867] IP: memcpy_erms+0x6/0x10
> [    9.988868] PGD 0
> [    9.988868]
> [    9.988870] Oops: 0000 [#1] SMP
> [    9.988871] Modules linked in: isci(E) mgag200(E+) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) fb_sys_fops(E) igb(E) ahci(E) ttm(E) libsas(E) libahci(E) scsi_transport_sas(E) ptp(E) pps_core(E) nd_pmem(E) dca(E) drm(E) i2c_algo_bit(E) libata(E) crc32c_intel(E) nd_btt(E) i2c_core(E) dm_mirror(E) dm_region_hash(E) dm_log(E) dm_mod(E)
> [    9.988886] CPU: 0 PID: 442 Comm: systemd-udevd Tainted: G            E   4.11.0-rc5+ #43
> [    9.988887] Hardware name: Intel Corporation LH Pass/SVRBD-ROW_P, BIOS SE5C600.86B.02.01.SP06.050920141054 05/09/2014
> [    9.988888] task: ffff9267dc2f8000 task.stack: ffffba92c783c000
> [    9.988890] RIP: 0010:memcpy_erms+0x6/0x10
> [    9.988891] RSP: 0018:ffffba92c783f9b8 EFLAGS: 00010286
> [    9.988892] RAX: ffff925f19e27000 RBX: 0000000000000000 RCX: 0000000000001000
> [    9.988893] RDX: 0000000000001000 RSI: ffff9387bfff0000 RDI: ffff925f19e27000
> [    9.988893] RBP: ffffba92c783fa38 R08: 0000000000000000 R09: 0000000017ffff80
> [    9.988894] R10: 0000000000000000 R11: ffff9387bfff0000 R12: ffff925fde811ed8
> [    9.988895] R13: 0000002fffff0000 R14: 0000000000001000 R15: ffff925f19e27000
> [    9.988896] FS:  00007f1ee18e68c0(0000) GS:ffff925fdec00000(0000) knlGS:0000000000000000
> [    9.988896] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    9.988897] CR2: ffff9387bfff0000 CR3: 000000081ba28000 CR4: 00000000001406f0
> [    9.988897] Call Trace:
> [    9.988902]  ? pmem_do_bvec+0x93/0x290 [nd_pmem]
> [    9.988904]  ? radix_tree_node_alloc.constprop.20+0x85/0xc0
> [    9.988905]  ? radix_tree_node_alloc.constprop.20+0x85/0xc0
> [    9.988907]  pmem_rw_page+0x3a/0x60 [nd_pmem]
> [    9.988909]  bdev_read_page+0x81/0xb0
> [    9.988911]  do_mpage_readpage+0x56f/0x770
> [    9.988912]  ? I_BDEV+0x20/0x20
> [    9.988915]  ? lru_cache_add+0xe/0x10
> [    9.988917]  mpage_readpages+0x148/0x1e0
> [    9.988917]  ? I_BDEV+0x20/0x20
> [    9.988918]  ? I_BDEV+0x20/0x20
> [    9.988921]  ? alloc_pages_current+0x88/0x120
> [    9.988923]  blkdev_readpages+0x1d/0x20
> [    9.988924]  __do_page_cache_readahead+0x1ce/0x2c0
> [    9.988926]  force_page_cache_readahead+0xa2/0x100
> [    9.988927]  page_cache_sync_readahead+0x3f/0x50
> [    9.988930]  generic_file_read_iter+0x60d/0x8c0
> [    9.988931]  blkdev_read_iter+0x37/0x40
> [    9.988933]  __vfs_read+0xe0/0x150
> [    9.988934]  vfs_read+0x8c/0x130
> [    9.988936]  SyS_read+0x55/0xc0
> [    9.988939]  entry_SYSCALL_64_fastpath+0x1a/0xa9
> [    9.988940] RIP: 0033:0x7f1ee0822480
> [    9.988941] RSP: 002b:00007ffcf9e741f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
> [    9.988942] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f1ee0822480
> [    9.988943] RDX: 0000000000000040 RSI: 0000561b7e1aabc8 RDI: 0000000000000008
> [    9.988943] RBP: 0000561b7e1a86a0 R08: 0000000000000005 R09: 0000000000000068
> [    9.988944] R10: 00007ffcf9e73f80 R11: 0000000000000246 R12: 0000000000000000
> [    9.988945] R13: 0000000000000001 R14: 0000561b7e1a61b0 R15: 0000561b7e1a55e0
> [    9.988946] Code: ff 90 90 90 90 eb 1e 0f 1f 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 <f3> a4 c3 0f 1f 80 00 00 00 00 48 89 f8 48 83 fa 20 72 7e 40 38
> [    9.988962] RIP: memcpy_erms+0x6/0x10 RSP: ffffba92c783f9b8
> [    9.988962] CR2: ffff9387bfff0000
> [    9.989022] ---[ end trace fe34c0fc0fe685ab ]---
> [    9.998690] Kernel panic - not syncing: Fatal exception
> [   10.004708] Kernel Offset: 0x11000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>
> Reported-by: Jeff Moyer <jmoyer@redhat.com>
> Signed-off-by: Baoquan He <bhe@redhat.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: x86@kernel.org
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Thomas Garnier <thgarnie@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
> Cc: Jinbum Park <jinb.park7@gmail.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> Cc: Yinghai Lu <yinghai@kernel.org>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dave Young <dyoung@redhat.com>
> ---

Good catch!

>  arch/x86/mm/init_64.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 15173d3..dbf4f00 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -94,12 +94,14 @@ __setup("noexec32=", nonx32_setup);
>   */
>  void sync_global_pgds(unsigned long start, unsigned long end)
>  {
> -       unsigned long address;
> +       unsigned long address, address_next;
>
> -       for (address = start; address <= end; address += PGDIR_SIZE) {
> +       for (address = start; address <= end; address = address_next) {
>                 const pgd_t *pgd_ref = pgd_offset_k(address);
>                 struct page *page;
>
> +               address_next = (address & PGDIR_MASK) + PGDIR_SIZE;
> +

Let's change this to put the next address calculation in the for loop
directly and use the ALIGN macro. Something like:

 for (address = start; address <= end; address = ALIGN(address + 1, PGDIR_SIZE))