linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3] x86/mm: Fix incorrect for loop count calculation in sync_global_pgds
@ 2017-05-04  2:25 Baoquan He
  2017-05-04  2:35 ` Dan Williams
  2017-05-05  8:11 ` [tip:x86/urgent] x86/mm: Fix boot crash caused by incorrect loop count calculation in sync_global_pgds() tip-bot for Baoquan He
  0 siblings, 2 replies; 8+ messages in thread
From: Baoquan He @ 2017-05-04  2:25 UTC (permalink / raw)
  To: linux-kernel, mingo
  Cc: Baoquan He, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Kees Cook, Thomas Garnier, Andrew Morton, Yasuaki Ishimatsu,
	Jinbum Park, Dave Hansen, Kirill A. Shutemov, Yinghai Lu,
	Dan Williams, Dave Young

Jeff Moyer reported that on his system with two memory regions 0~64G and
1T~1T+192G, and kernel option "memmap=192G!1024G" added, enabling kaslr
will make system hang intermittently during boot. While adding 'nokaslr'
won't.

This is because the for loop count calculation in sync_global_pgds is
not correct. When a mapping area crosses pgd entries, we should
calculate the starting address of region which next pgd covers and assign
it to next for loop count, but not add PGDIR_SIZE directly. The old
code works right only if the mapping area is times of PGDIR_SIZE,
otherwize the end region could be skipped so that it can't be synchronized
to all other processes from kernel pgd init_mm.pgd.

In Jeff's system, emulated pmem area [1024G, 1216G) is smaller than
PGDIR_SIZE. While 'nokaslr' works because PAGE_OFFSET is 1T aligned, it
makes this area be mapped inside one pgd entry. With kaslr enabled,
this area could cross two pgd entries, then the next pgd entry won't
be synced to all other processes. That is why we saw empty PGD.

Fix it in this patch.

The back trace is pasted as below:

[    9.988867] IP: memcpy_erms+0x6/0x10
[    9.988868] PGD 0
[    9.988868]
[    9.988870] Oops: 0000 [#1] SMP
[    9.988871] Modules linked in: isci(E) mgag200(E+) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) fb_sys_fops(E) igb(E) ahci(E) ttm(E) libsas(E) libahci(E) scsi_transport_sas(E) ptp(E) pps_core(E) nd_pmem(E) dca(E) drm(E) i2c_algo_bit(E) libata(E) crc32c_intel(E) nd_btt(E)
i2c_core(E) dm_mirror(E) dm_region_hash(E) dm_log(E) dm_mod(E)
[    9.988886] CPU: 0 PID: 442 Comm: systemd-udevd Tainted: G            E   4.11.0-rc5+ #43
[    9.988887] Hardware name: Intel Corporation LH Pass/SVRBD-ROW_P, BIOS SE5C600.86B.02.01.SP06.050920141054 05/09/2014
[    9.988888] task: ffff9267dc2f8000 task.stack: ffffba92c783c000
[    9.988890] RIP: 0010:memcpy_erms+0x6/0x10
[    9.988891] RSP: 0018:ffffba92c783f9b8 EFLAGS: 00010286
[    9.988892] RAX: ffff925f19e27000 RBX: 0000000000000000 RCX: 0000000000001000
[    9.988893] RDX: 0000000000001000 RSI: ffff9387bfff0000 RDI: ffff925f19e27000
[    9.988893] RBP: ffffba92c783fa38 R08: 0000000000000000 R09: 0000000017ffff80
[    9.988894] R10: 0000000000000000 R11: ffff9387bfff0000 R12: ffff925fde811ed8
[    9.988895] R13: 0000002fffff0000 R14: 0000000000001000 R15: ffff925f19e27000
[    9.988896] FS:  00007f1ee18e68c0(0000) GS:ffff925fdec00000(0000) knlGS:0000000000000000
[    9.988896] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    9.988897] CR2: ffff9387bfff0000 CR3: 000000081ba28000 CR4: 00000000001406f0
[    9.988897] Call Trace:
[    9.988902]  ? pmem_do_bvec+0x93/0x290 [nd_pmem]
[    9.988904]  ? radix_tree_node_alloc.constprop.20+0x85/0xc0
[    9.988905]  ? radix_tree_node_alloc.constprop.20+0x85/0xc0
[    9.988907]  pmem_rw_page+0x3a/0x60 [nd_pmem]
[    9.988909]  bdev_read_page+0x81/0xb0
[    9.988911]  do_mpage_readpage+0x56f/0x770
[    9.988912]  ? I_BDEV+0x20/0x20
[    9.988915]  ? lru_cache_add+0xe/0x10
[    9.988917]  mpage_readpages+0x148/0x1e0
[    9.988917]  ? I_BDEV+0x20/0x20
[    9.988918]  ? I_BDEV+0x20/0x20
[    9.988921]  ? alloc_pages_current+0x88/0x120
[    9.988923]  blkdev_readpages+0x1d/0x20
[    9.988924]  __do_page_cache_readahead+0x1ce/0x2c0
[    9.988926]  force_page_cache_readahead+0xa2/0x100
[    9.988927]  page_cache_sync_readahead+0x3f/0x50
[    9.988930]  generic_file_read_iter+0x60d/0x8c0
[    9.988931]  blkdev_read_iter+0x37/0x40
[    9.988933]  __vfs_read+0xe0/0x150
[    9.988934]  vfs_read+0x8c/0x130
[    9.988936]  SyS_read+0x55/0xc0
[    9.988939]  entry_SYSCALL_64_fastpath+0x1a/0xa9
[    9.988940] RIP: 0033:0x7f1ee0822480
[    9.988941] RSP: 002b:00007ffcf9e741f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[    9.988942] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f1ee0822480
[    9.988943] RDX: 0000000000000040 RSI: 0000561b7e1aabc8 RDI: 0000000000000008
[    9.988943] RBP: 0000561b7e1a86a0 R08: 0000000000000005 R09: 0000000000000068
[    9.988944] R10: 00007ffcf9e73f80 R11: 0000000000000246 R12: 0000000000000000
[    9.988945] R13: 0000000000000001 R14: 0000561b7e1a61b0 R15: 0000561b7e1a55e0
[    9.988946] Code: ff 90 90 90 90 eb 1e 0f 1f 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 <f3> a4 c3 0f 1f 80 00 00 00 00 48 89 f8 48 83 fa 20 72 7e 40 38
[    9.988962] RIP: memcpy_erms+0x6/0x10 RSP: ffffba92c783f9b8
[    9.988962] CR2: ffff9387bfff0000
[    9.989022] ---[ end trace fe34c0fc0fe685ab ]---
[    9.998690] Kernel panic - not syncing: Fatal exception
[   10.004708] Kernel Offset: 0x11000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)

Reported-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: Kees Cook <keescook@chromium.org>
Cc: Thomas Garnier <thgarnie@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
Cc: Jinbum Park <jinb.park7@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Young <dyoung@redhat.com>
---
v1->v2:
    Use ALIGN(address + 1, PGDIR_SIZE) to calculate the next for loop count.
    It's suggested by Dan.

v2->v3:
   Ingo suggested renaming the local variable 'address ' as 'addr' to make
   for loop line fit into 80 cols.

 arch/x86/mm/init_64.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 745e5e1..97fe887 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -94,10 +94,10 @@ __setup("noexec32=", nonx32_setup);
  */
 void sync_global_pgds(unsigned long start, unsigned long end)
 {
-	unsigned long address;
+	unsigned long addr;
 
-	for (address = start; address <= end; address += PGDIR_SIZE) {
-		pgd_t *pgd_ref = pgd_offset_k(address);
+	for (addr = start; addr <= end; addr = ALIGN(addr + 1, PGDIR_SIZE)) {
+		pgd_t *pgd_ref = pgd_offset_k(addr);
 		const p4d_t *p4d_ref;
 		struct page *page;
 
@@ -106,7 +106,7 @@ void sync_global_pgds(unsigned long start, unsigned long end)
 		 * handle synchonization on p4d level.
 		 */
 		BUILD_BUG_ON(pgd_none(*pgd_ref));
-		p4d_ref = p4d_offset(pgd_ref, address);
+		p4d_ref = p4d_offset(pgd_ref, addr);
 
 		if (p4d_none(*p4d_ref))
 			continue;
@@ -117,8 +117,8 @@ void sync_global_pgds(unsigned long start, unsigned long end)
 			p4d_t *p4d;
 			spinlock_t *pgt_lock;
 
-			pgd = (pgd_t *)page_address(page) + pgd_index(address);
-			p4d = p4d_offset(pgd, address);
+			pgd = (pgd_t *)page_address(page) + pgd_index(addr);
+			p4d = p4d_offset(pgd, addr);
 			/* the pgt_lock only for Xen */
 			pgt_lock = &pgd_page_get_mm(page)->page_table_lock;
 			spin_lock(pgt_lock);
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v3] x86/mm: Fix incorrect for loop count calculation in sync_global_pgds
  2017-05-04  2:25 [PATCH v3] x86/mm: Fix incorrect for loop count calculation in sync_global_pgds Baoquan He
@ 2017-05-04  2:35 ` Dan Williams
  2017-05-04 16:25   ` Thomas Garnier
  2017-05-05  8:11 ` [tip:x86/urgent] x86/mm: Fix boot crash caused by incorrect loop count calculation in sync_global_pgds() tip-bot for Baoquan He
  1 sibling, 1 reply; 8+ messages in thread
From: Dan Williams @ 2017-05-04  2:35 UTC (permalink / raw)
  To: Baoquan He
  Cc: linux-kernel, Ingo Molnar, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, X86 ML, Kees Cook, Thomas Garnier, Andrew Morton,
	Yasuaki Ishimatsu, Jinbum Park, Dave Hansen, Kirill A. Shutemov,
	Yinghai Lu, Dave Young

On Wed, May 3, 2017 at 7:25 PM, Baoquan He <bhe@redhat.com> wrote:
> Jeff Moyer reported that on his system with two memory regions 0~64G and
> 1T~1T+192G, and kernel option "memmap=192G!1024G" added, enabling kaslr
> will make system hang intermittently during boot. While adding 'nokaslr'
> won't.
>
> This is because the for loop count calculation in sync_global_pgds is
> not correct. When a mapping area crosses pgd entries, we should
> calculate the starting address of region which next pgd covers and assign
> it to next for loop count, but not add PGDIR_SIZE directly. The old
> code works right only if the mapping area is times of PGDIR_SIZE,
> otherwize the end region could be skipped so that it can't be synchronized
> to all other processes from kernel pgd init_mm.pgd.
>
> In Jeff's system, emulated pmem area [1024G, 1216G) is smaller than
> PGDIR_SIZE. While 'nokaslr' works because PAGE_OFFSET is 1T aligned, it
> makes this area be mapped inside one pgd entry. With kaslr enabled,
> this area could cross two pgd entries, then the next pgd entry won't
> be synced to all other processes. That is why we saw empty PGD.
>
> Fix it in this patch.
>
> The back trace is pasted as below:
>
> [    9.988867] IP: memcpy_erms+0x6/0x10
> [    9.988868] PGD 0
> [    9.988868]
> [    9.988870] Oops: 0000 [#1] SMP
> [    9.988871] Modules linked in: isci(E) mgag200(E+) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) fb_sys_fops(E) igb(E) ahci(E) ttm(E) libsas(E) libahci(E) scsi_transport_sas(E) ptp(E) pps_core(E) nd_pmem(E) dca(E) drm(E) i2c_algo_bit(E) libata(E) crc32c_intel(E) nd_btt(E)
> i2c_core(E) dm_mirror(E) dm_region_hash(E) dm_log(E) dm_mod(E)
> [    9.988886] CPU: 0 PID: 442 Comm: systemd-udevd Tainted: G            E   4.11.0-rc5+ #43
> [    9.988887] Hardware name: Intel Corporation LH Pass/SVRBD-ROW_P, BIOS SE5C600.86B.02.01.SP06.050920141054 05/09/2014
> [    9.988888] task: ffff9267dc2f8000 task.stack: ffffba92c783c000
> [    9.988890] RIP: 0010:memcpy_erms+0x6/0x10
> [    9.988891] RSP: 0018:ffffba92c783f9b8 EFLAGS: 00010286
> [    9.988892] RAX: ffff925f19e27000 RBX: 0000000000000000 RCX: 0000000000001000
> [    9.988893] RDX: 0000000000001000 RSI: ffff9387bfff0000 RDI: ffff925f19e27000
> [    9.988893] RBP: ffffba92c783fa38 R08: 0000000000000000 R09: 0000000017ffff80
> [    9.988894] R10: 0000000000000000 R11: ffff9387bfff0000 R12: ffff925fde811ed8
> [    9.988895] R13: 0000002fffff0000 R14: 0000000000001000 R15: ffff925f19e27000
> [    9.988896] FS:  00007f1ee18e68c0(0000) GS:ffff925fdec00000(0000) knlGS:0000000000000000
> [    9.988896] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    9.988897] CR2: ffff9387bfff0000 CR3: 000000081ba28000 CR4: 00000000001406f0
> [    9.988897] Call Trace:
> [    9.988902]  ? pmem_do_bvec+0x93/0x290 [nd_pmem]
> [    9.988904]  ? radix_tree_node_alloc.constprop.20+0x85/0xc0
> [    9.988905]  ? radix_tree_node_alloc.constprop.20+0x85/0xc0
> [    9.988907]  pmem_rw_page+0x3a/0x60 [nd_pmem]
> [    9.988909]  bdev_read_page+0x81/0xb0
> [    9.988911]  do_mpage_readpage+0x56f/0x770
> [    9.988912]  ? I_BDEV+0x20/0x20
> [    9.988915]  ? lru_cache_add+0xe/0x10
> [    9.988917]  mpage_readpages+0x148/0x1e0
> [    9.988917]  ? I_BDEV+0x20/0x20
> [    9.988918]  ? I_BDEV+0x20/0x20
> [    9.988921]  ? alloc_pages_current+0x88/0x120
> [    9.988923]  blkdev_readpages+0x1d/0x20
> [    9.988924]  __do_page_cache_readahead+0x1ce/0x2c0
> [    9.988926]  force_page_cache_readahead+0xa2/0x100
> [    9.988927]  page_cache_sync_readahead+0x3f/0x50
> [    9.988930]  generic_file_read_iter+0x60d/0x8c0
> [    9.988931]  blkdev_read_iter+0x37/0x40
> [    9.988933]  __vfs_read+0xe0/0x150
> [    9.988934]  vfs_read+0x8c/0x130
> [    9.988936]  SyS_read+0x55/0xc0
> [    9.988939]  entry_SYSCALL_64_fastpath+0x1a/0xa9
> [    9.988940] RIP: 0033:0x7f1ee0822480
> [    9.988941] RSP: 002b:00007ffcf9e741f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
> [    9.988942] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f1ee0822480
> [    9.988943] RDX: 0000000000000040 RSI: 0000561b7e1aabc8 RDI: 0000000000000008
> [    9.988943] RBP: 0000561b7e1a86a0 R08: 0000000000000005 R09: 0000000000000068
> [    9.988944] R10: 00007ffcf9e73f80 R11: 0000000000000246 R12: 0000000000000000
> [    9.988945] R13: 0000000000000001 R14: 0000561b7e1a61b0 R15: 0000561b7e1a55e0
> [    9.988946] Code: ff 90 90 90 90 eb 1e 0f 1f 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 <f3> a4 c3 0f 1f 80 00 00 00 00 48 89 f8 48 83 fa 20 72 7e 40 38
> [    9.988962] RIP: memcpy_erms+0x6/0x10 RSP: ffffba92c783f9b8
> [    9.988962] CR2: ffff9387bfff0000
> [    9.989022] ---[ end trace fe34c0fc0fe685ab ]---
> [    9.998690] Kernel panic - not syncing: Fatal exception
> [   10.004708] Kernel Offset: 0x11000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>
> Reported-by: Jeff Moyer <jmoyer@redhat.com>
> Signed-off-by: Baoquan He <bhe@redhat.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: x86@kernel.org
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Thomas Garnier <thgarnie@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
> Cc: Jinbum Park <jinb.park7@gmail.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> Cc: Yinghai Lu <yinghai@kernel.org>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dave Young <dyoung@redhat.com>

I think this needs a "Fixes:" tag and Cc: <stable@vger.kernel.org>.

Other than that:

Reviewed-by: Dan Williams <dan.j.williams@intel.com>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3] x86/mm: Fix incorrect for loop count calculation in sync_global_pgds
  2017-05-04  2:35 ` Dan Williams
@ 2017-05-04 16:25   ` Thomas Garnier
  2017-05-08  2:27     ` Baoquan He
  0 siblings, 1 reply; 8+ messages in thread
From: Thomas Garnier @ 2017-05-04 16:25 UTC (permalink / raw)
  To: Dan Williams
  Cc: Baoquan He, linux-kernel, Ingo Molnar, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, X86 ML, Kees Cook, Andrew Morton,
	Yasuaki Ishimatsu, Jinbum Park, Dave Hansen, Kirill A. Shutemov,
	Yinghai Lu, Dave Young

On Wed, May 3, 2017 at 7:35 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Wed, May 3, 2017 at 7:25 PM, Baoquan He <bhe@redhat.com> wrote:
>> Jeff Moyer reported that on his system with two memory regions 0~64G and
>> 1T~1T+192G, and kernel option "memmap=192G!1024G" added, enabling kaslr
>> will make system hang intermittently during boot. While adding 'nokaslr'
>> won't.
>>
>> This is because the for loop count calculation in sync_global_pgds is
>> not correct. When a mapping area crosses pgd entries, we should
>> calculate the starting address of region which next pgd covers and assign
>> it to next for loop count, but not add PGDIR_SIZE directly. The old
>> code works right only if the mapping area is times of PGDIR_SIZE,
>> otherwize the end region could be skipped so that it can't be synchronized
>> to all other processes from kernel pgd init_mm.pgd.
>>
>> In Jeff's system, emulated pmem area [1024G, 1216G) is smaller than
>> PGDIR_SIZE. While 'nokaslr' works because PAGE_OFFSET is 1T aligned, it
>> makes this area be mapped inside one pgd entry. With kaslr enabled,
>> this area could cross two pgd entries, then the next pgd entry won't
>> be synced to all other processes. That is why we saw empty PGD.
>>
>> Fix it in this patch.
>>
>> The back trace is pasted as below:
>>
>> [    9.988867] IP: memcpy_erms+0x6/0x10
>> [    9.988868] PGD 0
>> [    9.988868]
>> [    9.988870] Oops: 0000 [#1] SMP
>> [    9.988871] Modules linked in: isci(E) mgag200(E+) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) fb_sys_fops(E) igb(E) ahci(E) ttm(E) libsas(E) libahci(E) scsi_transport_sas(E) ptp(E) pps_core(E) nd_pmem(E) dca(E) drm(E) i2c_algo_bit(E) libata(E) crc32c_intel(E) nd_btt(E)
>> i2c_core(E) dm_mirror(E) dm_region_hash(E) dm_log(E) dm_mod(E)
>> [    9.988886] CPU: 0 PID: 442 Comm: systemd-udevd Tainted: G            E   4.11.0-rc5+ #43
>> [    9.988887] Hardware name: Intel Corporation LH Pass/SVRBD-ROW_P, BIOS SE5C600.86B.02.01.SP06.050920141054 05/09/2014
>> [    9.988888] task: ffff9267dc2f8000 task.stack: ffffba92c783c000
>> [    9.988890] RIP: 0010:memcpy_erms+0x6/0x10
>> [    9.988891] RSP: 0018:ffffba92c783f9b8 EFLAGS: 00010286
>> [    9.988892] RAX: ffff925f19e27000 RBX: 0000000000000000 RCX: 0000000000001000
>> [    9.988893] RDX: 0000000000001000 RSI: ffff9387bfff0000 RDI: ffff925f19e27000
>> [    9.988893] RBP: ffffba92c783fa38 R08: 0000000000000000 R09: 0000000017ffff80
>> [    9.988894] R10: 0000000000000000 R11: ffff9387bfff0000 R12: ffff925fde811ed8
>> [    9.988895] R13: 0000002fffff0000 R14: 0000000000001000 R15: ffff925f19e27000
>> [    9.988896] FS:  00007f1ee18e68c0(0000) GS:ffff925fdec00000(0000) knlGS:0000000000000000
>> [    9.988896] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [    9.988897] CR2: ffff9387bfff0000 CR3: 000000081ba28000 CR4: 00000000001406f0
>> [    9.988897] Call Trace:
>> [    9.988902]  ? pmem_do_bvec+0x93/0x290 [nd_pmem]
>> [    9.988904]  ? radix_tree_node_alloc.constprop.20+0x85/0xc0
>> [    9.988905]  ? radix_tree_node_alloc.constprop.20+0x85/0xc0
>> [    9.988907]  pmem_rw_page+0x3a/0x60 [nd_pmem]
>> [    9.988909]  bdev_read_page+0x81/0xb0
>> [    9.988911]  do_mpage_readpage+0x56f/0x770
>> [    9.988912]  ? I_BDEV+0x20/0x20
>> [    9.988915]  ? lru_cache_add+0xe/0x10
>> [    9.988917]  mpage_readpages+0x148/0x1e0
>> [    9.988917]  ? I_BDEV+0x20/0x20
>> [    9.988918]  ? I_BDEV+0x20/0x20
>> [    9.988921]  ? alloc_pages_current+0x88/0x120
>> [    9.988923]  blkdev_readpages+0x1d/0x20
>> [    9.988924]  __do_page_cache_readahead+0x1ce/0x2c0
>> [    9.988926]  force_page_cache_readahead+0xa2/0x100
>> [    9.988927]  page_cache_sync_readahead+0x3f/0x50
>> [    9.988930]  generic_file_read_iter+0x60d/0x8c0
>> [    9.988931]  blkdev_read_iter+0x37/0x40
>> [    9.988933]  __vfs_read+0xe0/0x150
>> [    9.988934]  vfs_read+0x8c/0x130
>> [    9.988936]  SyS_read+0x55/0xc0
>> [    9.988939]  entry_SYSCALL_64_fastpath+0x1a/0xa9
>> [    9.988940] RIP: 0033:0x7f1ee0822480
>> [    9.988941] RSP: 002b:00007ffcf9e741f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
>> [    9.988942] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f1ee0822480
>> [    9.988943] RDX: 0000000000000040 RSI: 0000561b7e1aabc8 RDI: 0000000000000008
>> [    9.988943] RBP: 0000561b7e1a86a0 R08: 0000000000000005 R09: 0000000000000068
>> [    9.988944] R10: 00007ffcf9e73f80 R11: 0000000000000246 R12: 0000000000000000
>> [    9.988945] R13: 0000000000000001 R14: 0000561b7e1a61b0 R15: 0000561b7e1a55e0
>> [    9.988946] Code: ff 90 90 90 90 eb 1e 0f 1f 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 <f3> a4 c3 0f 1f 80 00 00 00 00 48 89 f8 48 83 fa 20 72 7e 40 38
>> [    9.988962] RIP: memcpy_erms+0x6/0x10 RSP: ffffba92c783f9b8
>> [    9.988962] CR2: ffff9387bfff0000
>> [    9.989022] ---[ end trace fe34c0fc0fe685ab ]---
>> [    9.998690] Kernel panic - not syncing: Fatal exception
>> [   10.004708] Kernel Offset: 0x11000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>>
>> Reported-by: Jeff Moyer <jmoyer@redhat.com>
>> Signed-off-by: Baoquan He <bhe@redhat.com>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: "H. Peter Anvin" <hpa@zytor.com>
>> Cc: x86@kernel.org
>> Cc: Kees Cook <keescook@chromium.org>
>> Cc: Thomas Garnier <thgarnie@google.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
>> Cc: Jinbum Park <jinb.park7@gmail.com>
>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>> Cc: Yinghai Lu <yinghai@kernel.org>
>> Cc: Dan Williams <dan.j.williams@intel.com>
>> Cc: Dave Young <dyoung@redhat.com>
>
> I think this needs a "Fixes:" tag and Cc: <stable@vger.kernel.org>.

Agreed.

>
> Other than that:
>
> Reviewed-by: Dan Williams <dan.j.williams@intel.com>

Thanks again!

Reviewed-by: Thomas Garnier <thgarnie@google.com>
-- 
Thomas

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [tip:x86/urgent] x86/mm: Fix boot crash caused by incorrect loop count calculation in sync_global_pgds()
  2017-05-04  2:25 [PATCH v3] x86/mm: Fix incorrect for loop count calculation in sync_global_pgds Baoquan He
  2017-05-04  2:35 ` Dan Williams
@ 2017-05-05  8:11 ` tip-bot for Baoquan He
  2017-06-22  1:26   ` Dan Williams
  1 sibling, 1 reply; 8+ messages in thread
From: tip-bot for Baoquan He @ 2017-05-05  8:11 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: peterz, tglx, linux-kernel, dan.j.williams, dyoung, yasu.isimatu,
	mingo, hpa, jinb.park7, jpoimboe, thgarnie, dvlasenk, bhe,
	kirill.shutemov, dave.hansen, akpm, luto, yinghai, jmoyer,
	brgerst, keescook, torvalds, bp

Commit-ID:  fc5f9d5f151c9fff21d3d1d2907b888a5aec3ff7
Gitweb:     http://git.kernel.org/tip/fc5f9d5f151c9fff21d3d1d2907b888a5aec3ff7
Author:     Baoquan He <bhe@redhat.com>
AuthorDate: Thu, 4 May 2017 10:25:47 +0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 5 May 2017 08:21:24 +0200

x86/mm: Fix boot crash caused by incorrect loop count calculation in sync_global_pgds()

Jeff Moyer reported that on his system with two memory regions 0~64G and
1T~1T+192G, and kernel option "memmap=192G!1024G" added, enabling KASLR
will make the system hang intermittently during boot. While adding 'nokaslr'
won't.

The back trace is:

 Oops: 0000 [#1] SMP

 RIP: memcpy_erms()
 [ .... ]
 Call Trace:
  pmem_rw_page()
  bdev_read_page()
  do_mpage_readpage()
  mpage_readpages()
  blkdev_readpages()
  __do_page_cache_readahead()
  force_page_cache_readahead()
  page_cache_sync_readahead()
  generic_file_read_iter()
  blkdev_read_iter()
  __vfs_read()
  vfs_read()
  SyS_read()
  entry_SYSCALL_64_fastpath()

This crash happens because the for loop count calculation in sync_global_pgds()
is not correct. When a mapping area crosses PGD entries, we should
calculate the starting address of region which next PGD covers and assign
it to next for loop count, but not add PGDIR_SIZE directly. The old
code works right only if the mapping area is an exact multiple of PGDIR_SIZE,
otherwize the end region could be skipped so that it can't be synchronized
to all other processes from kernel PGD init_mm.pgd.

In Jeff's system, emulated pmem area [1024G, 1216G) is smaller than
PGDIR_SIZE. While 'nokaslr' works because PAGE_OFFSET is 1T aligned, it
makes this area be mapped inside one PGD entry. With KASLR enabled,
this area could cross two PGD entries, then the next PGD entry won't
be synced to all other processes. That is why we saw empty PGD.

Fix it.

Reported-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jinbum Park <jinb.park7@gmail.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Garnier <thgarnie@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1493864747-8506-1-git-send-email-bhe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/mm/init_64.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 745e5e1..97fe887 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -94,10 +94,10 @@ __setup("noexec32=", nonx32_setup);
  */
 void sync_global_pgds(unsigned long start, unsigned long end)
 {
-	unsigned long address;
+	unsigned long addr;
 
-	for (address = start; address <= end; address += PGDIR_SIZE) {
-		pgd_t *pgd_ref = pgd_offset_k(address);
+	for (addr = start; addr <= end; addr = ALIGN(addr + 1, PGDIR_SIZE)) {
+		pgd_t *pgd_ref = pgd_offset_k(addr);
 		const p4d_t *p4d_ref;
 		struct page *page;
 
@@ -106,7 +106,7 @@ void sync_global_pgds(unsigned long start, unsigned long end)
 		 * handle synchonization on p4d level.
 		 */
 		BUILD_BUG_ON(pgd_none(*pgd_ref));
-		p4d_ref = p4d_offset(pgd_ref, address);
+		p4d_ref = p4d_offset(pgd_ref, addr);
 
 		if (p4d_none(*p4d_ref))
 			continue;
@@ -117,8 +117,8 @@ void sync_global_pgds(unsigned long start, unsigned long end)
 			p4d_t *p4d;
 			spinlock_t *pgt_lock;
 
-			pgd = (pgd_t *)page_address(page) + pgd_index(address);
-			p4d = p4d_offset(pgd, address);
+			pgd = (pgd_t *)page_address(page) + pgd_index(addr);
+			p4d = p4d_offset(pgd, addr);
 			/* the pgt_lock only for Xen */
 			pgt_lock = &pgd_page_get_mm(page)->page_table_lock;
 			spin_lock(pgt_lock);

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v3] x86/mm: Fix incorrect for loop count calculation in sync_global_pgds
  2017-05-04 16:25   ` Thomas Garnier
@ 2017-05-08  2:27     ` Baoquan He
  0 siblings, 0 replies; 8+ messages in thread
From: Baoquan He @ 2017-05-08  2:27 UTC (permalink / raw)
  To: Thomas Garnier
  Cc: Dan Williams, linux-kernel, Ingo Molnar, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, X86 ML, Kees Cook, Andrew Morton,
	Yasuaki Ishimatsu, Jinbum Park, Dave Hansen, Kirill A. Shutemov,
	Yinghai Lu, Dave Young

On 05/04/17 at 09:25am, Thomas Garnier wrote:

> > I think this needs a "Fixes:" tag and Cc: <stable@vger.kernel.org>.

Sorry for late response, should I resend with them?

> 
> Agreed.
> 
> >
> > Other than that:
> >
> > Reviewed-by: Dan Williams <dan.j.williams@intel.com>
> 
> Thanks again!
> 
> Reviewed-by: Thomas Garnier <thgarnie@google.com>
> -- 
> Thomas

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [tip:x86/urgent] x86/mm: Fix boot crash caused by incorrect loop count calculation in sync_global_pgds()
  2017-05-05  8:11 ` [tip:x86/urgent] x86/mm: Fix boot crash caused by incorrect loop count calculation in sync_global_pgds() tip-bot for Baoquan He
@ 2017-06-22  1:26   ` Dan Williams
  2017-06-22  7:22     ` Ingo Molnar
  2017-06-27 11:12     ` Greg KH
  0 siblings, 2 replies; 8+ messages in thread
From: Dan Williams @ 2017-06-22  1:26 UTC (permalink / raw)
  To: Dan Williams, Dave Young, Peter Zijlstra, Thomas Gleixner,
	Linux Kernel Mailing List, H. Peter Anvin, jinb.park7, jpoimboe,
	Ingo Molnar, Yasuaki Ishimatsu, Andrew Morton, Jeff Moyer,
	yinghai, luto, Kirill A. Shutemov, Dave Hansen, Thomas Garnier,
	Baoquan He, dvlasenk, Borislav Petkov, brgerst, Linus Torvalds,
	Kees Cook, stable
  Cc: linux-tip-commits

[ adding -stable ]

The patch below is upstream as commit fc5f9d5f151c "x86/mm: Fix boot
crash caused by incorrect loop count calculation in
sync_global_pgds()". The referenced bug potentially affects all kaslr
enabled kernels with > 512GB of memory. Please apply this patch to all
current -stable kernels.


On Fri, May 5, 2017 at 1:11 AM, tip-bot for Baoquan He <tipbot@zytor.com> wrote:
> Commit-ID:  fc5f9d5f151c9fff21d3d1d2907b888a5aec3ff7
> Gitweb:     http://git.kernel.org/tip/fc5f9d5f151c9fff21d3d1d2907b888a5aec3ff7
> Author:     Baoquan He <bhe@redhat.com>
> AuthorDate: Thu, 4 May 2017 10:25:47 +0800
> Committer:  Ingo Molnar <mingo@kernel.org>
> CommitDate: Fri, 5 May 2017 08:21:24 +0200
>
> x86/mm: Fix boot crash caused by incorrect loop count calculation in sync_global_pgds()
>
> Jeff Moyer reported that on his system with two memory regions 0~64G and
> 1T~1T+192G, and kernel option "memmap=192G!1024G" added, enabling KASLR
> will make the system hang intermittently during boot. While adding 'nokaslr'
> won't.
>
> The back trace is:
>
>  Oops: 0000 [#1] SMP
>
>  RIP: memcpy_erms()
>  [ .... ]
>  Call Trace:
>   pmem_rw_page()
>   bdev_read_page()
>   do_mpage_readpage()
>   mpage_readpages()
>   blkdev_readpages()
>   __do_page_cache_readahead()
>   force_page_cache_readahead()
>   page_cache_sync_readahead()
>   generic_file_read_iter()
>   blkdev_read_iter()
>   __vfs_read()
>   vfs_read()
>   SyS_read()
>   entry_SYSCALL_64_fastpath()
>
> This crash happens because the for loop count calculation in sync_global_pgds()
> is not correct. When a mapping area crosses PGD entries, we should
> calculate the starting address of region which next PGD covers and assign
> it to next for loop count, but not add PGDIR_SIZE directly. The old
> code works right only if the mapping area is an exact multiple of PGDIR_SIZE,
> otherwize the end region could be skipped so that it can't be synchronized
> to all other processes from kernel PGD init_mm.pgd.
>
> In Jeff's system, emulated pmem area [1024G, 1216G) is smaller than
> PGDIR_SIZE. While 'nokaslr' works because PAGE_OFFSET is 1T aligned, it
> makes this area be mapped inside one PGD entry. With KASLR enabled,
> this area could cross two PGD entries, then the next PGD entry won't
> be synced to all other processes. That is why we saw empty PGD.
>
> Fix it.
>
> Reported-by: Jeff Moyer <jmoyer@redhat.com>
> Signed-off-by: Baoquan He <bhe@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Brian Gerst <brgerst@gmail.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Dave Young <dyoung@redhat.com>
> Cc: Denys Vlasenko <dvlasenk@redhat.com>
> Cc: H. Peter Anvin <hpa@zytor.com>
> Cc: Jinbum Park <jinb.park7@gmail.com>
> Cc: Josh Poimboeuf <jpoimboe@redhat.com>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Garnier <thgarnie@google.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
> Cc: Yinghai Lu <yinghai@kernel.org>
> Link: http://lkml.kernel.org/r/1493864747-8506-1-git-send-email-bhe@redhat.com
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> ---
>  arch/x86/mm/init_64.c | 12 ++++++------
>  1 file changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 745e5e1..97fe887 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -94,10 +94,10 @@ __setup("noexec32=", nonx32_setup);
>   */
>  void sync_global_pgds(unsigned long start, unsigned long end)
>  {
> -       unsigned long address;
> +       unsigned long addr;
>
> -       for (address = start; address <= end; address += PGDIR_SIZE) {
> -               pgd_t *pgd_ref = pgd_offset_k(address);
> +       for (addr = start; addr <= end; addr = ALIGN(addr + 1, PGDIR_SIZE)) {
> +               pgd_t *pgd_ref = pgd_offset_k(addr);
>                 const p4d_t *p4d_ref;
>                 struct page *page;
>
> @@ -106,7 +106,7 @@ void sync_global_pgds(unsigned long start, unsigned long end)
>                  * handle synchonization on p4d level.
>                  */
>                 BUILD_BUG_ON(pgd_none(*pgd_ref));
> -               p4d_ref = p4d_offset(pgd_ref, address);
> +               p4d_ref = p4d_offset(pgd_ref, addr);
>
>                 if (p4d_none(*p4d_ref))
>                         continue;
> @@ -117,8 +117,8 @@ void sync_global_pgds(unsigned long start, unsigned long end)
>                         p4d_t *p4d;
>                         spinlock_t *pgt_lock;
>
> -                       pgd = (pgd_t *)page_address(page) + pgd_index(address);
> -                       p4d = p4d_offset(pgd, address);
> +                       pgd = (pgd_t *)page_address(page) + pgd_index(addr);
> +                       p4d = p4d_offset(pgd, addr);
>                         /* the pgt_lock only for Xen */
>                         pgt_lock = &pgd_page_get_mm(page)->page_table_lock;
>                         spin_lock(pgt_lock);

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [tip:x86/urgent] x86/mm: Fix boot crash caused by incorrect loop count calculation in sync_global_pgds()
  2017-06-22  1:26   ` Dan Williams
@ 2017-06-22  7:22     ` Ingo Molnar
  2017-06-27 11:12     ` Greg KH
  1 sibling, 0 replies; 8+ messages in thread
From: Ingo Molnar @ 2017-06-22  7:22 UTC (permalink / raw)
  To: Dan Williams, stable kernel team
  Cc: Dave Young, Peter Zijlstra, Thomas Gleixner,
	Linux Kernel Mailing List, H. Peter Anvin, jinb.park7, jpoimboe,
	Yasuaki Ishimatsu, Andrew Morton, Jeff Moyer, yinghai, luto,
	Kirill A. Shutemov, Dave Hansen, Thomas Garnier, Baoquan He,
	dvlasenk, Borislav Petkov, brgerst, Linus Torvalds, Kees Cook,
	stable, linux-tip-commits


* Dan Williams <dan.j.williams@intel.com> wrote:

> [ adding -stable ]
> 
> The patch below is upstream as commit fc5f9d5f151c "x86/mm: Fix boot
> crash caused by incorrect loop count calculation in
> sync_global_pgds()". The referenced bug potentially affects all kaslr
> enabled kernels with > 512GB of memory. Please apply this patch to all
> current -stable kernels.

Yeah, that looks like a fix worth having in -stable.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [tip:x86/urgent] x86/mm: Fix boot crash caused by incorrect loop count calculation in sync_global_pgds()
  2017-06-22  1:26   ` Dan Williams
  2017-06-22  7:22     ` Ingo Molnar
@ 2017-06-27 11:12     ` Greg KH
  1 sibling, 0 replies; 8+ messages in thread
From: Greg KH @ 2017-06-27 11:12 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Young, Peter Zijlstra, Thomas Gleixner,
	Linux Kernel Mailing List, H. Peter Anvin, jinb.park7, jpoimboe,
	Ingo Molnar, Yasuaki Ishimatsu, Andrew Morton, Jeff Moyer,
	yinghai, luto, Kirill A. Shutemov, Dave Hansen, Thomas Garnier,
	Baoquan He, dvlasenk, Borislav Petkov, brgerst, Linus Torvalds,
	Kees Cook, stable, linux-tip-commits

On Wed, Jun 21, 2017 at 06:26:59PM -0700, Dan Williams wrote:
> [ adding -stable ]
> 
> The patch below is upstream as commit fc5f9d5f151c "x86/mm: Fix boot
> crash caused by incorrect loop count calculation in
> sync_global_pgds()". The referenced bug potentially affects all kaslr
> enabled kernels with > 512GB of memory. Please apply this patch to all
> current -stable kernels.

Doesn't apply to any stable kernels that I manage, can someone please
provide a working backport if they want to see it applied?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2017-06-27 11:13 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-04  2:25 [PATCH v3] x86/mm: Fix incorrect for loop count calculation in sync_global_pgds Baoquan He
2017-05-04  2:35 ` Dan Williams
2017-05-04 16:25   ` Thomas Garnier
2017-05-08  2:27     ` Baoquan He
2017-05-05  8:11 ` [tip:x86/urgent] x86/mm: Fix boot crash caused by incorrect loop count calculation in sync_global_pgds() tip-bot for Baoquan He
2017-06-22  1:26   ` Dan Williams
2017-06-22  7:22     ` Ingo Molnar
2017-06-27 11:12     ` Greg KH

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).