[PATCH v3] x86/mm: Fix incorrect for loop count calculation in sync_global_pgds

* [PATCH v3] x86/mm: Fix incorrect for loop count calculation in sync_global_pgds
@ 2017-05-04  2:25 Baoquan He
  2017-05-04  2:35 ` Dan Williams
  2017-05-05  8:11 ` [tip:x86/urgent] x86/mm: Fix boot crash caused by incorrect loop count calculation in sync_global_pgds() tip-bot for Baoquan He
  0 siblings, 2 replies; 8+ messages in thread
From: Baoquan He @ 2017-05-04  2:25 UTC (permalink / raw)
  To: linux-kernel, mingo
  Cc: Baoquan He, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Kees Cook, Thomas Garnier, Andrew Morton, Yasuaki Ishimatsu,
	Jinbum Park, Dave Hansen, Kirill A. Shutemov, Yinghai Lu,
	Dan Williams, Dave Young

Jeff Moyer reported that on his system with two memory regions 0~64G and
1T~1T+192G, and kernel option "memmap=192G!1024G" added, enabling kaslr
will make system hang intermittently during boot. While adding 'nokaslr'
won't.

This is because the for loop count calculation in sync_global_pgds is
not correct. When a mapping area crosses pgd entries, we should
calculate the starting address of region which next pgd covers and assign
it to next for loop count, but not add PGDIR_SIZE directly. The old
code works right only if the mapping area is times of PGDIR_SIZE,
otherwize the end region could be skipped so that it can't be synchronized
to all other processes from kernel pgd init_mm.pgd.

In Jeff's system, emulated pmem area [1024G, 1216G) is smaller than
PGDIR_SIZE. While 'nokaslr' works because PAGE_OFFSET is 1T aligned, it
makes this area be mapped inside one pgd entry. With kaslr enabled,
this area could cross two pgd entries, then the next pgd entry won't
be synced to all other processes. That is why we saw empty PGD.

Fix it in this patch.

The back trace is pasted as below:

[    9.988867] IP: memcpy_erms+0x6/0x10
[    9.988868] PGD 0
[    9.988868]
[    9.988870] Oops: 0000 [#1] SMP
[    9.988871] Modules linked in: isci(E) mgag200(E+) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) fb_sys_fops(E) igb(E) ahci(E) ttm(E) libsas(E) libahci(E) scsi_transport_sas(E) ptp(E) pps_core(E) nd_pmem(E) dca(E) drm(E) i2c_algo_bit(E) libata(E) crc32c_intel(E) nd_btt(E)
i2c_core(E) dm_mirror(E) dm_region_hash(E) dm_log(E) dm_mod(E)
[    9.988886] CPU: 0 PID: 442 Comm: systemd-udevd Tainted: G            E   4.11.0-rc5+ #43
[    9.988887] Hardware name: Intel Corporation LH Pass/SVRBD-ROW_P, BIOS SE5C600.86B.02.01.SP06.050920141054 05/09/2014
[    9.988888] task: ffff9267dc2f8000 task.stack: ffffba92c783c000
[    9.988890] RIP: 0010:memcpy_erms+0x6/0x10
[    9.988891] RSP: 0018:ffffba92c783f9b8 EFLAGS: 00010286
[    9.988892] RAX: ffff925f19e27000 RBX: 0000000000000000 RCX: 0000000000001000
[    9.988893] RDX: 0000000000001000 RSI: ffff9387bfff0000 RDI: ffff925f19e27000
[    9.988893] RBP: ffffba92c783fa38 R08: 0000000000000000 R09: 0000000017ffff80
[    9.988894] R10: 0000000000000000 R11: ffff9387bfff0000 R12: ffff925fde811ed8
[    9.988895] R13: 0000002fffff0000 R14: 0000000000001000 R15: ffff925f19e27000
[    9.988896] FS:  00007f1ee18e68c0(0000) GS:ffff925fdec00000(0000) knlGS:0000000000000000
[    9.988896] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    9.988897] CR2: ffff9387bfff0000 CR3: 000000081ba28000 CR4: 00000000001406f0
[    9.988897] Call Trace:
[    9.988902]  ? pmem_do_bvec+0x93/0x290 [nd_pmem]
[    9.988904]  ? radix_tree_node_alloc.constprop.20+0x85/0xc0
[    9.988905]  ? radix_tree_node_alloc.constprop.20+0x85/0xc0
[    9.988907]  pmem_rw_page+0x3a/0x60 [nd_pmem]
[    9.988909]  bdev_read_page+0x81/0xb0
[    9.988911]  do_mpage_readpage+0x56f/0x770
[    9.988912]  ? I_BDEV+0x20/0x20
[    9.988915]  ? lru_cache_add+0xe/0x10
[    9.988917]  mpage_readpages+0x148/0x1e0
[    9.988917]  ? I_BDEV+0x20/0x20
[    9.988918]  ? I_BDEV+0x20/0x20
[    9.988921]  ? alloc_pages_current+0x88/0x120
[    9.988923]  blkdev_readpages+0x1d/0x20
[    9.988924]  __do_page_cache_readahead+0x1ce/0x2c0
[    9.988926]  force_page_cache_readahead+0xa2/0x100
[    9.988927]  page_cache_sync_readahead+0x3f/0x50
[    9.988930]  generic_file_read_iter+0x60d/0x8c0
[    9.988931]  blkdev_read_iter+0x37/0x40
[    9.988933]  __vfs_read+0xe0/0x150
[    9.988934]  vfs_read+0x8c/0x130
[    9.988936]  SyS_read+0x55/0xc0
[    9.988939]  entry_SYSCALL_64_fastpath+0x1a/0xa9
[    9.988940] RIP: 0033:0x7f1ee0822480
[    9.988941] RSP: 002b:00007ffcf9e741f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[    9.988942] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f1ee0822480
[    9.988943] RDX: 0000000000000040 RSI: 0000561b7e1aabc8 RDI: 0000000000000008
[    9.988943] RBP: 0000561b7e1a86a0 R08: 0000000000000005 R09: 0000000000000068
[    9.988944] R10: 00007ffcf9e73f80 R11: 0000000000000246 R12: 0000000000000000
[    9.988945] R13: 0000000000000001 R14: 0000561b7e1a61b0 R15: 0000561b7e1a55e0
[    9.988946] Code: ff 90 90 90 90 eb 1e 0f 1f 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 <f3> a4 c3 0f 1f 80 00 00 00 00 48 89 f8 48 83 fa 20 72 7e 40 38
[    9.988962] RIP: memcpy_erms+0x6/0x10 RSP: ffffba92c783f9b8
[    9.988962] CR2: ffff9387bfff0000
[    9.989022] ---[ end trace fe34c0fc0fe685ab ]---
[    9.998690] Kernel panic - not syncing: Fatal exception
[   10.004708] Kernel Offset: 0x11000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)

Reported-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: Kees Cook <keescook@chromium.org>
Cc: Thomas Garnier <thgarnie@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
Cc: Jinbum Park <jinb.park7@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Young <dyoung@redhat.com>
---
v1->v2:
    Use ALIGN(address + 1, PGDIR_SIZE) to calculate the next for loop count.
    It's suggested by Dan.

v2->v3:
   Ingo suggested renaming the local variable 'address ' as 'addr' to make
   for loop line fit into 80 cols.

 arch/x86/mm/init_64.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 745e5e1..97fe887 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -94,10 +94,10 @@ __setup("noexec32=", nonx32_setup);
  */
 void sync_global_pgds(unsigned long start, unsigned long end)
 {
-	unsigned long address;
+	unsigned long addr;
 
-	for (address = start; address <= end; address += PGDIR_SIZE) {
-		pgd_t *pgd_ref = pgd_offset_k(address);
+	for (addr = start; addr <= end; addr = ALIGN(addr + 1, PGDIR_SIZE)) {
+		pgd_t *pgd_ref = pgd_offset_k(addr);
 		const p4d_t *p4d_ref;
 		struct page *page;
 
@@ -106,7 +106,7 @@ void sync_global_pgds(unsigned long start, unsigned long end)
 		 * handle synchonization on p4d level.
 		 */
 		BUILD_BUG_ON(pgd_none(*pgd_ref));
-		p4d_ref = p4d_offset(pgd_ref, address);
+		p4d_ref = p4d_offset(pgd_ref, addr);
 
 		if (p4d_none(*p4d_ref))
 			continue;
@@ -117,8 +117,8 @@ void sync_global_pgds(unsigned long start, unsigned long end)
 			p4d_t *p4d;
 			spinlock_t *pgt_lock;
 
-			pgd = (pgd_t *)page_address(page) + pgd_index(address);
-			p4d = p4d_offset(pgd, address);
+			pgd = (pgd_t *)page_address(page) + pgd_index(addr);
+			p4d = p4d_offset(pgd, addr);
 			/* the pgt_lock only for Xen */
 			pgt_lock = &pgd_page_get_mm(page)->page_table_lock;
 			spin_lock(pgt_lock);
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 8+ messages in thread