Re: [PATCH v3] x86/mm: Fix incorrect for loop count calculation in sync_global_pgds

From: Thomas Garnier <thgarnie@google.com>
To: Dan Williams <dan.j.williams@intel.com>
Cc: Baoquan He <bhe@redhat.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Ingo Molnar <mingo@kernel.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, "H. Peter Anvin" <hpa@zytor.com>,
	X86 ML <x86@kernel.org>, Kees Cook <keescook@chromium.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Yasuaki Ishimatsu <yasu.isimatu@gmail.com>,
	Jinbum Park <jinb.park7@gmail.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Yinghai Lu <yinghai@kernel.org>, Dave Young <dyoung@redhat.com>
Subject: Re: [PATCH v3] x86/mm: Fix incorrect for loop count calculation in sync_global_pgds
Date: Thu, 4 May 2017 09:25:37 -0700	[thread overview]
Message-ID: <CAJcbSZG4j_U6tML=TFzHCrOpUi49gEfaseTBSiLpzrNjx+J6zg@mail.gmail.com> (raw)
In-Reply-To: <CAPcyv4gsYqbrAV+z3_J+qn+-qk9uPVDAbkTqLdYPmmzs-ReVKw@mail.gmail.com>

On Wed, May 3, 2017 at 7:35 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Wed, May 3, 2017 at 7:25 PM, Baoquan He <bhe@redhat.com> wrote:
>> Jeff Moyer reported that on his system with two memory regions 0~64G and
>> 1T~1T+192G, and kernel option "memmap=192G!1024G" added, enabling kaslr
>> will make system hang intermittently during boot. While adding 'nokaslr'
>> won't.
>>
>> This is because the for loop count calculation in sync_global_pgds is
>> not correct. When a mapping area crosses pgd entries, we should
>> calculate the starting address of region which next pgd covers and assign
>> it to next for loop count, but not add PGDIR_SIZE directly. The old
>> code works right only if the mapping area is times of PGDIR_SIZE,
>> otherwize the end region could be skipped so that it can't be synchronized
>> to all other processes from kernel pgd init_mm.pgd.
>>
>> In Jeff's system, emulated pmem area [1024G, 1216G) is smaller than
>> PGDIR_SIZE. While 'nokaslr' works because PAGE_OFFSET is 1T aligned, it
>> makes this area be mapped inside one pgd entry. With kaslr enabled,
>> this area could cross two pgd entries, then the next pgd entry won't
>> be synced to all other processes. That is why we saw empty PGD.
>>
>> Fix it in this patch.
>>
>> The back trace is pasted as below:
>>
>> [    9.988867] IP: memcpy_erms+0x6/0x10
>> [    9.988868] PGD 0
>> [    9.988868]
>> [    9.988870] Oops: 0000 [#1] SMP
>> [    9.988871] Modules linked in: isci(E) mgag200(E+) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) fb_sys_fops(E) igb(E) ahci(E) ttm(E) libsas(E) libahci(E) scsi_transport_sas(E) ptp(E) pps_core(E) nd_pmem(E) dca(E) drm(E) i2c_algo_bit(E) libata(E) crc32c_intel(E) nd_btt(E)
>> i2c_core(E) dm_mirror(E) dm_region_hash(E) dm_log(E) dm_mod(E)
>> [    9.988886] CPU: 0 PID: 442 Comm: systemd-udevd Tainted: G            E   4.11.0-rc5+ #43
>> [    9.988887] Hardware name: Intel Corporation LH Pass/SVRBD-ROW_P, BIOS SE5C600.86B.02.01.SP06.050920141054 05/09/2014
>> [    9.988888] task: ffff9267dc2f8000 task.stack: ffffba92c783c000
>> [    9.988890] RIP: 0010:memcpy_erms+0x6/0x10
>> [    9.988891] RSP: 0018:ffffba92c783f9b8 EFLAGS: 00010286
>> [    9.988892] RAX: ffff925f19e27000 RBX: 0000000000000000 RCX: 0000000000001000
>> [    9.988893] RDX: 0000000000001000 RSI: ffff9387bfff0000 RDI: ffff925f19e27000
>> [    9.988893] RBP: ffffba92c783fa38 R08: 0000000000000000 R09: 0000000017ffff80
>> [    9.988894] R10: 0000000000000000 R11: ffff9387bfff0000 R12: ffff925fde811ed8
>> [    9.988895] R13: 0000002fffff0000 R14: 0000000000001000 R15: ffff925f19e27000
>> [    9.988896] FS:  00007f1ee18e68c0(0000) GS:ffff925fdec00000(0000) knlGS:0000000000000000
>> [    9.988896] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [    9.988897] CR2: ffff9387bfff0000 CR3: 000000081ba28000 CR4: 00000000001406f0
>> [    9.988897] Call Trace:
>> [    9.988902]  ? pmem_do_bvec+0x93/0x290 [nd_pmem]
>> [    9.988904]  ? radix_tree_node_alloc.constprop.20+0x85/0xc0
>> [    9.988905]  ? radix_tree_node_alloc.constprop.20+0x85/0xc0
>> [    9.988907]  pmem_rw_page+0x3a/0x60 [nd_pmem]
>> [    9.988909]  bdev_read_page+0x81/0xb0
>> [    9.988911]  do_mpage_readpage+0x56f/0x770
>> [    9.988912]  ? I_BDEV+0x20/0x20
>> [    9.988915]  ? lru_cache_add+0xe/0x10
>> [    9.988917]  mpage_readpages+0x148/0x1e0
>> [    9.988917]  ? I_BDEV+0x20/0x20
>> [    9.988918]  ? I_BDEV+0x20/0x20
>> [    9.988921]  ? alloc_pages_current+0x88/0x120
>> [    9.988923]  blkdev_readpages+0x1d/0x20
>> [    9.988924]  __do_page_cache_readahead+0x1ce/0x2c0
>> [    9.988926]  force_page_cache_readahead+0xa2/0x100
>> [    9.988927]  page_cache_sync_readahead+0x3f/0x50
>> [    9.988930]  generic_file_read_iter+0x60d/0x8c0
>> [    9.988931]  blkdev_read_iter+0x37/0x40
>> [    9.988933]  __vfs_read+0xe0/0x150
>> [    9.988934]  vfs_read+0x8c/0x130
>> [    9.988936]  SyS_read+0x55/0xc0
>> [    9.988939]  entry_SYSCALL_64_fastpath+0x1a/0xa9
>> [    9.988940] RIP: 0033:0x7f1ee0822480
>> [    9.988941] RSP: 002b:00007ffcf9e741f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
>> [    9.988942] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f1ee0822480
>> [    9.988943] RDX: 0000000000000040 RSI: 0000561b7e1aabc8 RDI: 0000000000000008
>> [    9.988943] RBP: 0000561b7e1a86a0 R08: 0000000000000005 R09: 0000000000000068
>> [    9.988944] R10: 00007ffcf9e73f80 R11: 0000000000000246 R12: 0000000000000000
>> [    9.988945] R13: 0000000000000001 R14: 0000561b7e1a61b0 R15: 0000561b7e1a55e0
>> [    9.988946] Code: ff 90 90 90 90 eb 1e 0f 1f 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 <f3> a4 c3 0f 1f 80 00 00 00 00 48 89 f8 48 83 fa 20 72 7e 40 38
>> [    9.988962] RIP: memcpy_erms+0x6/0x10 RSP: ffffba92c783f9b8
>> [    9.988962] CR2: ffff9387bfff0000
>> [    9.989022] ---[ end trace fe34c0fc0fe685ab ]---
>> [    9.998690] Kernel panic - not syncing: Fatal exception
>> [   10.004708] Kernel Offset: 0x11000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>>
>> Reported-by: Jeff Moyer <jmoyer@redhat.com>
>> Signed-off-by: Baoquan He <bhe@redhat.com>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: "H. Peter Anvin" <hpa@zytor.com>
>> Cc: x86@kernel.org
>> Cc: Kees Cook <keescook@chromium.org>
>> Cc: Thomas Garnier <thgarnie@google.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
>> Cc: Jinbum Park <jinb.park7@gmail.com>
>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>> Cc: Yinghai Lu <yinghai@kernel.org>
>> Cc: Dan Williams <dan.j.williams@intel.com>
>> Cc: Dave Young <dyoung@redhat.com>
>
> I think this needs a "Fixes:" tag and Cc: <stable@vger.kernel.org>.

Agreed.

>
> Other than that:
>
> Reviewed-by: Dan Williams <dan.j.williams@intel.com>

Thanks again!

Reviewed-by: Thomas Garnier <thgarnie@google.com>
-- 
Thomas