All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] x86/mm: fix kern_addr_valid to cope with existing but not present entries
@ 2021-08-19 13:27 Mike Rapoport
  2021-08-19 13:35 ` David Hildenbrand
                   ` (5 more replies)
  0 siblings, 6 replies; 11+ messages in thread
From: Mike Rapoport @ 2021-08-19 13:27 UTC (permalink / raw)
  To: x86
  Cc: Andrew Morton, Andy Lutomirski, Dave Hansen, David Hildenbrand,
	Ingo Molnar, Jiri Olsa, Mike Rapoport, Mike Rapoport,
	Oscar Salvador, Peter Zijlstra, Thomas Gleixner, Borislav Petkov,
	linux-kernel, linux-fsdevel, stable

From: Mike Rapoport <rppt@linux.ibm.com>

Jiri Olsa reported a fault when running:

	# cat /proc/kallsyms | grep ksys_read
	ffffffff8136d580 T ksys_read
	# objdump -d --start-address=0xffffffff8136d580 --stop-address=0xffffffff8136d590 /proc/kcore

	/proc/kcore:     file format elf64-x86-64

	Segmentation fault

krava33 login: [   68.330612] general protection fault, probably for non-canonical address 0xf887ffcbff000: 0000 [#1] SMP PTI
[   68.333118] CPU: 12 PID: 1079 Comm: objdump Not tainted 5.14.0-rc5qemu+ #508
[   68.334922] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-4.fc34 04/01/2014
[   68.336945] RIP: 0010:kern_addr_valid+0x150/0x300
[   68.338082] Code: 1f 40 00 48 8b 0d e8 12 61 01 48 85 f6 0f 85 ca 00 00 00 48 81 e1 00 f0 ff ff 48 21 c1 48 b8 00 00 00 00 80 88 ff ff 48 01 ca <48> 8b 3c 02 48 f7 c7 9f ff ff ff 0f 84 d8 fe ff ff 48 89 f8 0f 1f
[   68.342220] RSP: 0018:ffffc90000bcbc38 EFLAGS: 00010206
[   68.343428] RAX: ffff888000000000 RBX: 0000000000001000 RCX: 000ffffffcbff000
[   68.345029] RDX: 000ffffffcbff000 RSI: 0000000000000000 RDI: 800ffffffcbff062
[   68.346599] RBP: ffffc90000bcbea8 R08: 0000000000001000 R09: 0000000000000000
[   68.349000] R10: 0000000000000000 R11: 0000000000001000 R12: 00007fcc0fd80010
[   68.350804] R13: ffffffff83400000 R14: 0000000000400000 R15: ffffffff843d23e0
[   68.352609] FS:  00007fcc111fcc80(0000) GS:ffff888275e00000(0000) knlGS:0000000000000000
[   68.354638] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   68.356104] CR2: 00007fcc0fd80000 CR3: 000000011226e004 CR4: 0000000000770ee0
[   68.357896] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   68.359694] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   68.361597] PKRU: 55555554
[   68.362460] Call Trace:
[   68.363252]  read_kcore+0x57f/0x920
[   68.364289]  ? rcu_read_lock_sched_held+0x12/0x80
[   68.365630]  ? rcu_read_lock_sched_held+0x12/0x80
[   68.366955]  ? rcu_read_lock_sched_held+0x12/0x80
[   68.368277]  ? trace_hardirqs_on+0x1b/0xd0
[   68.369462]  ? rcu_read_lock_sched_held+0x12/0x80
[   68.370793]  ? lock_acquire+0x195/0x2f0
[   68.371920]  ? lock_acquire+0x195/0x2f0
[   68.373035]  ? rcu_read_lock_sched_held+0x12/0x80
[   68.374364]  ? lock_acquire+0x195/0x2f0
[   68.375498]  ? rcu_read_lock_sched_held+0x12/0x80
[   68.376831]  ? rcu_read_lock_sched_held+0x12/0x80
[   68.379883]  ? rcu_read_lock_sched_held+0x12/0x80
[   68.381268]  ? lock_release+0x22b/0x3e0
[   68.382458]  ? _raw_spin_unlock+0x1f/0x30
[   68.383685]  ? __handle_mm_fault+0xcfc/0x15f0
[   68.384994]  ? rcu_read_lock_sched_held+0x12/0x80
[   68.386389]  ? lock_acquire+0x195/0x2f0
[   68.387573]  ? rcu_read_lock_sched_held+0x12/0x80
[   68.388969]  ? lock_release+0x22b/0x3e0
[   68.390145]  proc_reg_read+0x55/0xa0
[   68.391257]  ? vfs_read+0x78/0x1b0
[   68.392336]  vfs_read+0xa7/0x1b0
[   68.393328]  ksys_read+0x68/0xe0
[   68.394308]  do_syscall_64+0x3b/0x90
[   68.395391]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[   68.396804] RIP: 0033:0x7fcc11cf92e2
[   68.397824] Code: c0 e9 b2 fe ff ff 50 48 8d 3d ea 2e 0a 00 e8 95 e9 01 00 0f 1f 44 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24
[   68.402420] RSP: 002b:00007ffd6e0f8da8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[   68.404357] RAX: ffffffffffffffda RBX: 0000565439305b20 RCX: 00007fcc11cf92e2
[   68.406061] RDX: 0000000000800000 RSI: 00007fcc0f980010 RDI: 0000000000000003
[   68.407747] RBP: 00007fcc11dcd300 R08: 0000000000000003 R09: 00007fcc0d980010
[   68.410937] R10: 0000000003826000 R11: 0000000000000246 R12: 00007fcc0f980010
[   68.412624] R13: 0000000000000d68 R14: 00007fcc11dcc700 R15: 0000000000800000
[   68.414322] Modules linked in: intel_rapl_msr intel_rapl_common nfit kvm_intel kvm irqbypass rapl iTCO_wdt iTCO_vendor_support i2c_i801 i2c_smbus lpc_ich drm drm_panel_orientation_quirks zram xfs crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel
[   68.419591] ---[ end trace e2c30f827226966b ]---
[   68.420969] RIP: 0010:kern_addr_valid+0x150/0x300
[   68.422308] Code: 1f 40 00 48 8b 0d e8 12 61 01 48 85 f6 0f 85 ca 00 00 00 48 81 e1 00 f0 ff ff 48 21 c1 48 b8 00 00 00 00 80 88 ff ff 48 01 ca <48> 8b 3c 02 48 f7 c7 9f ff ff ff 0f 84 d8 fe ff ff 48 89 f8 0f 1f
[   68.426826] RSP: 0018:ffffc90000bcbc38 EFLAGS: 00010206
[   68.428150] RAX: ffff888000000000 RBX: 0000000000001000 RCX: 000ffffffcbff000
[   68.429813] RDX: 000ffffffcbff000 RSI: 0000000000000000 RDI: 800ffffffcbff062
[   68.431465] RBP: ffffc90000bcbea8 R08: 0000000000001000 R09: 0000000000000000
[   68.433115] R10: 0000000000000000 R11: 0000000000001000 R12: 00007fcc0fd80010
[   68.434768] R13: ffffffff83400000 R14: 0000000000400000 R15: ffffffff843d23e0
[   68.436423] FS:  00007fcc111fcc80(0000) GS:ffff888275e00000(0000) knlGS:0000000000000000
[   68.438354] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   68.442077] CR2: 00007fcc0fd80000 CR3: 000000011226e004 CR4: 0000000000770ee0
[   68.443727] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   68.445370] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   68.447010] PKRU: 55555554

The fault happens because kern_addr_valid() dereferences existent but not
present PMD in the high kernel mappings.

Such PMDs are created when free_kernel_image_pages() frees regions larger
than 2Mb. In this case a part of the freed memory is mapped with PMDs and
the set_memory_np_noalias() -> ... -> __change_page_attr() sequence will
mark the PMD as not present rather than wipe it completely.

Make kern_addr_valid() to check whether higher level page table entries are
present before trying to dereference them to fix this issue and to avoid
similar issues in the future.

Reported-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: <stable@vger.kernel.org>	# 4.4+
---

v2:
* drop pXd_none() checks and leave only pXd_present(), per David

v1: https://lore.kernel.org/lkml/20210817135854.25407-1-rppt@kernel.org

 arch/x86/mm/init_64.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index ddeaba947eb3..879886c6cc53 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1433,18 +1433,18 @@ int kern_addr_valid(unsigned long addr)
 		return 0;
 
 	p4d = p4d_offset(pgd, addr);
-	if (p4d_none(*p4d))
+	if (!p4d_present(*p4d))
 		return 0;
 
 	pud = pud_offset(p4d, addr);
-	if (pud_none(*pud))
+	if (!pud_present(*pud))
 		return 0;
 
 	if (pud_large(*pud))
 		return pfn_valid(pud_pfn(*pud));
 
 	pmd = pmd_offset(pud, addr);
-	if (pmd_none(*pmd))
+	if (!pmd_present(*pmd))
 		return 0;
 
 	if (pmd_large(*pmd))
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] x86/mm: fix kern_addr_valid to cope with existing but not present entries
  2021-08-19 13:27 [PATCH v2] x86/mm: fix kern_addr_valid to cope with existing but not present entries Mike Rapoport
@ 2021-08-19 13:35 ` David Hildenbrand
  2021-08-19 15:33 ` Jiri Olsa
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: David Hildenbrand @ 2021-08-19 13:35 UTC (permalink / raw)
  To: Mike Rapoport, x86
  Cc: Andrew Morton, Andy Lutomirski, Dave Hansen, Ingo Molnar,
	Jiri Olsa, Mike Rapoport, Oscar Salvador, Peter Zijlstra,
	Thomas Gleixner, Borislav Petkov, linux-kernel, linux-fsdevel,
	stable

On 19.08.21 15:27, Mike Rapoport wrote:
> From: Mike Rapoport <rppt@linux.ibm.com>
> 
> Jiri Olsa reported a fault when running:
> 
> 	# cat /proc/kallsyms | grep ksys_read
> 	ffffffff8136d580 T ksys_read
> 	# objdump -d --start-address=0xffffffff8136d580 --stop-address=0xffffffff8136d590 /proc/kcore
> 
> 	/proc/kcore:     file format elf64-x86-64
> 
> 	Segmentation fault
> 
> krava33 login: [   68.330612] general protection fault, probably for non-canonical address 0xf887ffcbff000: 0000 [#1] SMP PTI
> [   68.333118] CPU: 12 PID: 1079 Comm: objdump Not tainted 5.14.0-rc5qemu+ #508
> [   68.334922] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-4.fc34 04/01/2014
> [   68.336945] RIP: 0010:kern_addr_valid+0x150/0x300
> [   68.338082] Code: 1f 40 00 48 8b 0d e8 12 61 01 48 85 f6 0f 85 ca 00 00 00 48 81 e1 00 f0 ff ff 48 21 c1 48 b8 00 00 00 00 80 88 ff ff 48 01 ca <48> 8b 3c 02 48 f7 c7 9f ff ff ff 0f 84 d8 fe ff ff 48 89 f8 0f 1f
> [   68.342220] RSP: 0018:ffffc90000bcbc38 EFLAGS: 00010206
> [   68.343428] RAX: ffff888000000000 RBX: 0000000000001000 RCX: 000ffffffcbff000
> [   68.345029] RDX: 000ffffffcbff000 RSI: 0000000000000000 RDI: 800ffffffcbff062
> [   68.346599] RBP: ffffc90000bcbea8 R08: 0000000000001000 R09: 0000000000000000
> [   68.349000] R10: 0000000000000000 R11: 0000000000001000 R12: 00007fcc0fd80010
> [   68.350804] R13: ffffffff83400000 R14: 0000000000400000 R15: ffffffff843d23e0
> [   68.352609] FS:  00007fcc111fcc80(0000) GS:ffff888275e00000(0000) knlGS:0000000000000000
> [   68.354638] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   68.356104] CR2: 00007fcc0fd80000 CR3: 000000011226e004 CR4: 0000000000770ee0
> [   68.357896] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [   68.359694] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [   68.361597] PKRU: 55555554
> [   68.362460] Call Trace:
> [   68.363252]  read_kcore+0x57f/0x920
> [   68.364289]  ? rcu_read_lock_sched_held+0x12/0x80
> [   68.365630]  ? rcu_read_lock_sched_held+0x12/0x80
> [   68.366955]  ? rcu_read_lock_sched_held+0x12/0x80
> [   68.368277]  ? trace_hardirqs_on+0x1b/0xd0
> [   68.369462]  ? rcu_read_lock_sched_held+0x12/0x80
> [   68.370793]  ? lock_acquire+0x195/0x2f0
> [   68.371920]  ? lock_acquire+0x195/0x2f0
> [   68.373035]  ? rcu_read_lock_sched_held+0x12/0x80
> [   68.374364]  ? lock_acquire+0x195/0x2f0
> [   68.375498]  ? rcu_read_lock_sched_held+0x12/0x80
> [   68.376831]  ? rcu_read_lock_sched_held+0x12/0x80
> [   68.379883]  ? rcu_read_lock_sched_held+0x12/0x80
> [   68.381268]  ? lock_release+0x22b/0x3e0
> [   68.382458]  ? _raw_spin_unlock+0x1f/0x30
> [   68.383685]  ? __handle_mm_fault+0xcfc/0x15f0
> [   68.384994]  ? rcu_read_lock_sched_held+0x12/0x80
> [   68.386389]  ? lock_acquire+0x195/0x2f0
> [   68.387573]  ? rcu_read_lock_sched_held+0x12/0x80
> [   68.388969]  ? lock_release+0x22b/0x3e0
> [   68.390145]  proc_reg_read+0x55/0xa0
> [   68.391257]  ? vfs_read+0x78/0x1b0
> [   68.392336]  vfs_read+0xa7/0x1b0
> [   68.393328]  ksys_read+0x68/0xe0
> [   68.394308]  do_syscall_64+0x3b/0x90
> [   68.395391]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [   68.396804] RIP: 0033:0x7fcc11cf92e2
> [   68.397824] Code: c0 e9 b2 fe ff ff 50 48 8d 3d ea 2e 0a 00 e8 95 e9 01 00 0f 1f 44 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24
> [   68.402420] RSP: 002b:00007ffd6e0f8da8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
> [   68.404357] RAX: ffffffffffffffda RBX: 0000565439305b20 RCX: 00007fcc11cf92e2
> [   68.406061] RDX: 0000000000800000 RSI: 00007fcc0f980010 RDI: 0000000000000003
> [   68.407747] RBP: 00007fcc11dcd300 R08: 0000000000000003 R09: 00007fcc0d980010
> [   68.410937] R10: 0000000003826000 R11: 0000000000000246 R12: 00007fcc0f980010
> [   68.412624] R13: 0000000000000d68 R14: 00007fcc11dcc700 R15: 0000000000800000
> [   68.414322] Modules linked in: intel_rapl_msr intel_rapl_common nfit kvm_intel kvm irqbypass rapl iTCO_wdt iTCO_vendor_support i2c_i801 i2c_smbus lpc_ich drm drm_panel_orientation_quirks zram xfs crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel
> [   68.419591] ---[ end trace e2c30f827226966b ]---
> [   68.420969] RIP: 0010:kern_addr_valid+0x150/0x300
> [   68.422308] Code: 1f 40 00 48 8b 0d e8 12 61 01 48 85 f6 0f 85 ca 00 00 00 48 81 e1 00 f0 ff ff 48 21 c1 48 b8 00 00 00 00 80 88 ff ff 48 01 ca <48> 8b 3c 02 48 f7 c7 9f ff ff ff 0f 84 d8 fe ff ff 48 89 f8 0f 1f
> [   68.426826] RSP: 0018:ffffc90000bcbc38 EFLAGS: 00010206
> [   68.428150] RAX: ffff888000000000 RBX: 0000000000001000 RCX: 000ffffffcbff000
> [   68.429813] RDX: 000ffffffcbff000 RSI: 0000000000000000 RDI: 800ffffffcbff062
> [   68.431465] RBP: ffffc90000bcbea8 R08: 0000000000001000 R09: 0000000000000000
> [   68.433115] R10: 0000000000000000 R11: 0000000000001000 R12: 00007fcc0fd80010
> [   68.434768] R13: ffffffff83400000 R14: 0000000000400000 R15: ffffffff843d23e0
> [   68.436423] FS:  00007fcc111fcc80(0000) GS:ffff888275e00000(0000) knlGS:0000000000000000
> [   68.438354] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   68.442077] CR2: 00007fcc0fd80000 CR3: 000000011226e004 CR4: 0000000000770ee0
> [   68.443727] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [   68.445370] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [   68.447010] PKRU: 55555554
> 
> The fault happens because kern_addr_valid() dereferences existent but not
> present PMD in the high kernel mappings.
> 
> Such PMDs are created when free_kernel_image_pages() frees regions larger
> than 2Mb. In this case a part of the freed memory is mapped with PMDs and
> the set_memory_np_noalias() -> ... -> __change_page_attr() sequence will
> mark the PMD as not present rather than wipe it completely.
> 
> Make kern_addr_valid() to check whether higher level page table entries are
> present before trying to dereference them to fix this issue and to avoid
> similar issues in the future.
> 
> Reported-by: Jiri Olsa <jolsa@redhat.com>
> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> Cc: <stable@vger.kernel.org>	# 4.4+
> ---
> 
> v2:
> * drop pXd_none() checks and leave only pXd_present(), per David
> 
> v1: https://lore.kernel.org/lkml/20210817135854.25407-1-rppt@kernel.org
> 
>   arch/x86/mm/init_64.c | 6 +++---
>   1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index ddeaba947eb3..879886c6cc53 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -1433,18 +1433,18 @@ int kern_addr_valid(unsigned long addr)
>   		return 0;
>   
>   	p4d = p4d_offset(pgd, addr);
> -	if (p4d_none(*p4d))
> +	if (!p4d_present(*p4d))
>   		return 0;
>   
>   	pud = pud_offset(p4d, addr);
> -	if (pud_none(*pud))
> +	if (!pud_present(*pud))
>   		return 0;
>   
>   	if (pud_large(*pud))
>   		return pfn_valid(pud_pfn(*pud));
>   
>   	pmd = pmd_offset(pud, addr);
> -	if (pmd_none(*pmd))
> +	if (!pmd_present(*pmd))
>   		return 0;
>   
>   	if (pmd_large(*pmd))
> 

Hopefully we won't have other similar BUGs in the code because we leave 
fake swap entries lying around in the direct map.

Thanks!

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] x86/mm: fix kern_addr_valid to cope with existing but not present entries
  2021-08-19 13:27 [PATCH v2] x86/mm: fix kern_addr_valid to cope with existing but not present entries Mike Rapoport
  2021-08-19 13:35 ` David Hildenbrand
@ 2021-08-19 15:33 ` Jiri Olsa
  2021-08-25 18:47 ` Dave Hansen
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Jiri Olsa @ 2021-08-19 15:33 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: x86, Andrew Morton, Andy Lutomirski, Dave Hansen,
	David Hildenbrand, Ingo Molnar, Mike Rapoport, Oscar Salvador,
	Peter Zijlstra, Thomas Gleixner, Borislav Petkov, linux-kernel,
	linux-fsdevel, stable

On Thu, Aug 19, 2021 at 04:27:17PM +0300, Mike Rapoport wrote:
> From: Mike Rapoport <rppt@linux.ibm.com>
> 
> Jiri Olsa reported a fault when running:
> 
> 	# cat /proc/kallsyms | grep ksys_read
> 	ffffffff8136d580 T ksys_read
> 	# objdump -d --start-address=0xffffffff8136d580 --stop-address=0xffffffff8136d590 /proc/kcore
> 
> 	/proc/kcore:     file format elf64-x86-64
> 
> 	Segmentation fault
> 
> krava33 login: [   68.330612] general protection fault, probably for non-canonical address 0xf887ffcbff000: 0000 [#1] SMP PTI
> [   68.333118] CPU: 12 PID: 1079 Comm: objdump Not tainted 5.14.0-rc5qemu+ #508
> [   68.334922] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-4.fc34 04/01/2014
> [   68.336945] RIP: 0010:kern_addr_valid+0x150/0x300
> [   68.338082] Code: 1f 40 00 48 8b 0d e8 12 61 01 48 85 f6 0f 85 ca 00 00 00 48 81 e1 00 f0 ff ff 48 21 c1 48 b8 00 00 00 00 80 88 ff ff 48 01 ca <48> 8b 3c 02 48 f7 c7 9f ff ff ff 0f 84 d8 fe ff ff 48 89 f8 0f 1f
> [   68.342220] RSP: 0018:ffffc90000bcbc38 EFLAGS: 00010206
> [   68.343428] RAX: ffff888000000000 RBX: 0000000000001000 RCX: 000ffffffcbff000
> [   68.345029] RDX: 000ffffffcbff000 RSI: 0000000000000000 RDI: 800ffffffcbff062
> [   68.346599] RBP: ffffc90000bcbea8 R08: 0000000000001000 R09: 0000000000000000
> [   68.349000] R10: 0000000000000000 R11: 0000000000001000 R12: 00007fcc0fd80010
> [   68.350804] R13: ffffffff83400000 R14: 0000000000400000 R15: ffffffff843d23e0
> [   68.352609] FS:  00007fcc111fcc80(0000) GS:ffff888275e00000(0000) knlGS:0000000000000000
> [   68.354638] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   68.356104] CR2: 00007fcc0fd80000 CR3: 000000011226e004 CR4: 0000000000770ee0
> [   68.357896] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [   68.359694] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [   68.361597] PKRU: 55555554
> [   68.362460] Call Trace:
> [   68.363252]  read_kcore+0x57f/0x920
> [   68.364289]  ? rcu_read_lock_sched_held+0x12/0x80
> [   68.365630]  ? rcu_read_lock_sched_held+0x12/0x80
> [   68.366955]  ? rcu_read_lock_sched_held+0x12/0x80
> [   68.368277]  ? trace_hardirqs_on+0x1b/0xd0
> [   68.369462]  ? rcu_read_lock_sched_held+0x12/0x80
> [   68.370793]  ? lock_acquire+0x195/0x2f0
> [   68.371920]  ? lock_acquire+0x195/0x2f0
> [   68.373035]  ? rcu_read_lock_sched_held+0x12/0x80
> [   68.374364]  ? lock_acquire+0x195/0x2f0
> [   68.375498]  ? rcu_read_lock_sched_held+0x12/0x80
> [   68.376831]  ? rcu_read_lock_sched_held+0x12/0x80
> [   68.379883]  ? rcu_read_lock_sched_held+0x12/0x80
> [   68.381268]  ? lock_release+0x22b/0x3e0
> [   68.382458]  ? _raw_spin_unlock+0x1f/0x30
> [   68.383685]  ? __handle_mm_fault+0xcfc/0x15f0
> [   68.384994]  ? rcu_read_lock_sched_held+0x12/0x80
> [   68.386389]  ? lock_acquire+0x195/0x2f0
> [   68.387573]  ? rcu_read_lock_sched_held+0x12/0x80
> [   68.388969]  ? lock_release+0x22b/0x3e0
> [   68.390145]  proc_reg_read+0x55/0xa0
> [   68.391257]  ? vfs_read+0x78/0x1b0
> [   68.392336]  vfs_read+0xa7/0x1b0
> [   68.393328]  ksys_read+0x68/0xe0
> [   68.394308]  do_syscall_64+0x3b/0x90
> [   68.395391]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [   68.396804] RIP: 0033:0x7fcc11cf92e2
> [   68.397824] Code: c0 e9 b2 fe ff ff 50 48 8d 3d ea 2e 0a 00 e8 95 e9 01 00 0f 1f 44 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24
> [   68.402420] RSP: 002b:00007ffd6e0f8da8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
> [   68.404357] RAX: ffffffffffffffda RBX: 0000565439305b20 RCX: 00007fcc11cf92e2
> [   68.406061] RDX: 0000000000800000 RSI: 00007fcc0f980010 RDI: 0000000000000003
> [   68.407747] RBP: 00007fcc11dcd300 R08: 0000000000000003 R09: 00007fcc0d980010
> [   68.410937] R10: 0000000003826000 R11: 0000000000000246 R12: 00007fcc0f980010
> [   68.412624] R13: 0000000000000d68 R14: 00007fcc11dcc700 R15: 0000000000800000
> [   68.414322] Modules linked in: intel_rapl_msr intel_rapl_common nfit kvm_intel kvm irqbypass rapl iTCO_wdt iTCO_vendor_support i2c_i801 i2c_smbus lpc_ich drm drm_panel_orientation_quirks zram xfs crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel
> [   68.419591] ---[ end trace e2c30f827226966b ]---
> [   68.420969] RIP: 0010:kern_addr_valid+0x150/0x300
> [   68.422308] Code: 1f 40 00 48 8b 0d e8 12 61 01 48 85 f6 0f 85 ca 00 00 00 48 81 e1 00 f0 ff ff 48 21 c1 48 b8 00 00 00 00 80 88 ff ff 48 01 ca <48> 8b 3c 02 48 f7 c7 9f ff ff ff 0f 84 d8 fe ff ff 48 89 f8 0f 1f
> [   68.426826] RSP: 0018:ffffc90000bcbc38 EFLAGS: 00010206
> [   68.428150] RAX: ffff888000000000 RBX: 0000000000001000 RCX: 000ffffffcbff000
> [   68.429813] RDX: 000ffffffcbff000 RSI: 0000000000000000 RDI: 800ffffffcbff062
> [   68.431465] RBP: ffffc90000bcbea8 R08: 0000000000001000 R09: 0000000000000000
> [   68.433115] R10: 0000000000000000 R11: 0000000000001000 R12: 00007fcc0fd80010
> [   68.434768] R13: ffffffff83400000 R14: 0000000000400000 R15: ffffffff843d23e0
> [   68.436423] FS:  00007fcc111fcc80(0000) GS:ffff888275e00000(0000) knlGS:0000000000000000
> [   68.438354] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   68.442077] CR2: 00007fcc0fd80000 CR3: 000000011226e004 CR4: 0000000000770ee0
> [   68.443727] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [   68.445370] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [   68.447010] PKRU: 55555554
> 
> The fault happens because kern_addr_valid() dereferences existent but not
> present PMD in the high kernel mappings.
> 
> Such PMDs are created when free_kernel_image_pages() frees regions larger
> than 2Mb. In this case a part of the freed memory is mapped with PMDs and
> the set_memory_np_noalias() -> ... -> __change_page_attr() sequence will
> mark the PMD as not present rather than wipe it completely.
> 
> Make kern_addr_valid() to check whether higher level page table entries are
> present before trying to dereference them to fix this issue and to avoid
> similar issues in the future.
> 
> Reported-by: Jiri Olsa <jolsa@redhat.com>

Tested-by: Jiri Olsa <jolsa@redhat.com>

thanks,
jirka

> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> Cc: <stable@vger.kernel.org>	# 4.4+
> ---
> 
> v2:
> * drop pXd_none() checks and leave only pXd_present(), per David
> 
> v1: https://lore.kernel.org/lkml/20210817135854.25407-1-rppt@kernel.org
> 
>  arch/x86/mm/init_64.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index ddeaba947eb3..879886c6cc53 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -1433,18 +1433,18 @@ int kern_addr_valid(unsigned long addr)
>  		return 0;
>  
>  	p4d = p4d_offset(pgd, addr);
> -	if (p4d_none(*p4d))
> +	if (!p4d_present(*p4d))
>  		return 0;
>  
>  	pud = pud_offset(p4d, addr);
> -	if (pud_none(*pud))
> +	if (!pud_present(*pud))
>  		return 0;
>  
>  	if (pud_large(*pud))
>  		return pfn_valid(pud_pfn(*pud));
>  
>  	pmd = pmd_offset(pud, addr);
> -	if (pmd_none(*pmd))
> +	if (!pmd_present(*pmd))
>  		return 0;
>  
>  	if (pmd_large(*pmd))
> -- 
> 2.28.0
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] x86/mm: fix kern_addr_valid to cope with existing but not present entries
  2021-08-19 13:27 [PATCH v2] x86/mm: fix kern_addr_valid to cope with existing but not present entries Mike Rapoport
  2021-08-19 13:35 ` David Hildenbrand
  2021-08-19 15:33 ` Jiri Olsa
@ 2021-08-25 18:47 ` Dave Hansen
  2021-09-08 10:35   ` Borislav Petkov
  2021-09-02  8:51 ` Mike Rapoport
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 11+ messages in thread
From: Dave Hansen @ 2021-08-25 18:47 UTC (permalink / raw)
  To: Mike Rapoport, x86
  Cc: Andrew Morton, Andy Lutomirski, Dave Hansen, David Hildenbrand,
	Ingo Molnar, Jiri Olsa, Mike Rapoport, Oscar Salvador,
	Peter Zijlstra, Thomas Gleixner, Borislav Petkov, linux-kernel,
	linux-fsdevel, stable

On 8/19/21 6:27 AM, Mike Rapoport wrote:
> Such PMDs are created when free_kernel_image_pages() frees regions larger
> than 2Mb. In this case a part of the freed memory is mapped with PMDs and
> the set_memory_np_noalias() -> ... -> __change_page_attr() sequence will
> mark the PMD as not present rather than wipe it completely.
> 
> Make kern_addr_valid() to check whether higher level page table entries are
> present before trying to dereference them to fix this issue and to avoid
> similar issues in the future.
> 
> Reported-by: Jiri Olsa <jolsa@redhat.com>
> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> Cc: <stable@vger.kernel.org>	# 4.4...
>  	pmd = pmd_offset(pud, addr);
> -	if (pmd_none(*pmd))
> +	if (!pmd_present(*pmd))
>  		return 0;

Yeah, that seems like the right fix.  The one kern_addr_valid() user is
going to touch the memory so it *better* be present.  p*d_none() was
definitely the wrong check.

Acked-by: Dave Hansen <dave.hansen@intel.com>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] x86/mm: fix kern_addr_valid to cope with existing but not present entries
  2021-08-19 13:27 [PATCH v2] x86/mm: fix kern_addr_valid to cope with existing but not present entries Mike Rapoport
                   ` (2 preceding siblings ...)
  2021-08-25 18:47 ` Dave Hansen
@ 2021-09-02  8:51 ` Mike Rapoport
  2021-09-08  9:13 ` Mike Rapoport
  2021-09-08 19:03 ` [tip: x86/urgent] x86/mm: Fix kern_addr_valid() " tip-bot2 for Mike Rapoport
  5 siblings, 0 replies; 11+ messages in thread
From: Mike Rapoport @ 2021-09-02  8:51 UTC (permalink / raw)
  To: x86
  Cc: Andrew Morton, Andy Lutomirski, Dave Hansen, David Hildenbrand,
	Ingo Molnar, Jiri Olsa, Mike Rapoport, Oscar Salvador,
	Peter Zijlstra, Thomas Gleixner, Borislav Petkov, linux-kernel,
	linux-fsdevel, stable

Any updates on this?

On Thu, Aug 19, 2021 at 04:27:17PM +0300, Mike Rapoport wrote:
> From: Mike Rapoport <rppt@linux.ibm.com>
> 
> Jiri Olsa reported a fault when running:
> 
> 	# cat /proc/kallsyms | grep ksys_read
> 	ffffffff8136d580 T ksys_read
> 	# objdump -d --start-address=0xffffffff8136d580 --stop-address=0xffffffff8136d590 /proc/kcore
> 
> 	/proc/kcore:     file format elf64-x86-64
> 
> 	Segmentation fault
> 
> krava33 login: [   68.330612] general protection fault, probably for non-canonical address 0xf887ffcbff000: 0000 [#1] SMP PTI
> [   68.333118] CPU: 12 PID: 1079 Comm: objdump Not tainted 5.14.0-rc5qemu+ #508
> [   68.334922] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-4.fc34 04/01/2014
> [   68.336945] RIP: 0010:kern_addr_valid+0x150/0x300
> [   68.338082] Code: 1f 40 00 48 8b 0d e8 12 61 01 48 85 f6 0f 85 ca 00 00 00 48 81 e1 00 f0 ff ff 48 21 c1 48 b8 00 00 00 00 80 88 ff ff 48 01 ca <48> 8b 3c 02 48 f7 c7 9f ff ff ff 0f 84 d8 fe ff ff 48 89 f8 0f 1f
> [   68.342220] RSP: 0018:ffffc90000bcbc38 EFLAGS: 00010206
> [   68.343428] RAX: ffff888000000000 RBX: 0000000000001000 RCX: 000ffffffcbff000
> [   68.345029] RDX: 000ffffffcbff000 RSI: 0000000000000000 RDI: 800ffffffcbff062
> [   68.346599] RBP: ffffc90000bcbea8 R08: 0000000000001000 R09: 0000000000000000
> [   68.349000] R10: 0000000000000000 R11: 0000000000001000 R12: 00007fcc0fd80010
> [   68.350804] R13: ffffffff83400000 R14: 0000000000400000 R15: ffffffff843d23e0
> [   68.352609] FS:  00007fcc111fcc80(0000) GS:ffff888275e00000(0000) knlGS:0000000000000000
> [   68.354638] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   68.356104] CR2: 00007fcc0fd80000 CR3: 000000011226e004 CR4: 0000000000770ee0
> [   68.357896] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [   68.359694] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [   68.361597] PKRU: 55555554
> [   68.362460] Call Trace:
> [   68.363252]  read_kcore+0x57f/0x920
> [   68.364289]  ? rcu_read_lock_sched_held+0x12/0x80
> [   68.365630]  ? rcu_read_lock_sched_held+0x12/0x80
> [   68.366955]  ? rcu_read_lock_sched_held+0x12/0x80
> [   68.368277]  ? trace_hardirqs_on+0x1b/0xd0
> [   68.369462]  ? rcu_read_lock_sched_held+0x12/0x80
> [   68.370793]  ? lock_acquire+0x195/0x2f0
> [   68.371920]  ? lock_acquire+0x195/0x2f0
> [   68.373035]  ? rcu_read_lock_sched_held+0x12/0x80
> [   68.374364]  ? lock_acquire+0x195/0x2f0
> [   68.375498]  ? rcu_read_lock_sched_held+0x12/0x80
> [   68.376831]  ? rcu_read_lock_sched_held+0x12/0x80
> [   68.379883]  ? rcu_read_lock_sched_held+0x12/0x80
> [   68.381268]  ? lock_release+0x22b/0x3e0
> [   68.382458]  ? _raw_spin_unlock+0x1f/0x30
> [   68.383685]  ? __handle_mm_fault+0xcfc/0x15f0
> [   68.384994]  ? rcu_read_lock_sched_held+0x12/0x80
> [   68.386389]  ? lock_acquire+0x195/0x2f0
> [   68.387573]  ? rcu_read_lock_sched_held+0x12/0x80
> [   68.388969]  ? lock_release+0x22b/0x3e0
> [   68.390145]  proc_reg_read+0x55/0xa0
> [   68.391257]  ? vfs_read+0x78/0x1b0
> [   68.392336]  vfs_read+0xa7/0x1b0
> [   68.393328]  ksys_read+0x68/0xe0
> [   68.394308]  do_syscall_64+0x3b/0x90
> [   68.395391]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [   68.396804] RIP: 0033:0x7fcc11cf92e2
> [   68.397824] Code: c0 e9 b2 fe ff ff 50 48 8d 3d ea 2e 0a 00 e8 95 e9 01 00 0f 1f 44 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24
> [   68.402420] RSP: 002b:00007ffd6e0f8da8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
> [   68.404357] RAX: ffffffffffffffda RBX: 0000565439305b20 RCX: 00007fcc11cf92e2
> [   68.406061] RDX: 0000000000800000 RSI: 00007fcc0f980010 RDI: 0000000000000003
> [   68.407747] RBP: 00007fcc11dcd300 R08: 0000000000000003 R09: 00007fcc0d980010
> [   68.410937] R10: 0000000003826000 R11: 0000000000000246 R12: 00007fcc0f980010
> [   68.412624] R13: 0000000000000d68 R14: 00007fcc11dcc700 R15: 0000000000800000
> [   68.414322] Modules linked in: intel_rapl_msr intel_rapl_common nfit kvm_intel kvm irqbypass rapl iTCO_wdt iTCO_vendor_support i2c_i801 i2c_smbus lpc_ich drm drm_panel_orientation_quirks zram xfs crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel
> [   68.419591] ---[ end trace e2c30f827226966b ]---
> [   68.420969] RIP: 0010:kern_addr_valid+0x150/0x300
> [   68.422308] Code: 1f 40 00 48 8b 0d e8 12 61 01 48 85 f6 0f 85 ca 00 00 00 48 81 e1 00 f0 ff ff 48 21 c1 48 b8 00 00 00 00 80 88 ff ff 48 01 ca <48> 8b 3c 02 48 f7 c7 9f ff ff ff 0f 84 d8 fe ff ff 48 89 f8 0f 1f
> [   68.426826] RSP: 0018:ffffc90000bcbc38 EFLAGS: 00010206
> [   68.428150] RAX: ffff888000000000 RBX: 0000000000001000 RCX: 000ffffffcbff000
> [   68.429813] RDX: 000ffffffcbff000 RSI: 0000000000000000 RDI: 800ffffffcbff062
> [   68.431465] RBP: ffffc90000bcbea8 R08: 0000000000001000 R09: 0000000000000000
> [   68.433115] R10: 0000000000000000 R11: 0000000000001000 R12: 00007fcc0fd80010
> [   68.434768] R13: ffffffff83400000 R14: 0000000000400000 R15: ffffffff843d23e0
> [   68.436423] FS:  00007fcc111fcc80(0000) GS:ffff888275e00000(0000) knlGS:0000000000000000
> [   68.438354] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   68.442077] CR2: 00007fcc0fd80000 CR3: 000000011226e004 CR4: 0000000000770ee0
> [   68.443727] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [   68.445370] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [   68.447010] PKRU: 55555554
> 
> The fault happens because kern_addr_valid() dereferences existent but not
> present PMD in the high kernel mappings.
> 
> Such PMDs are created when free_kernel_image_pages() frees regions larger
> than 2Mb. In this case a part of the freed memory is mapped with PMDs and
> the set_memory_np_noalias() -> ... -> __change_page_attr() sequence will
> mark the PMD as not present rather than wipe it completely.
> 
> Make kern_addr_valid() to check whether higher level page table entries are
> present before trying to dereference them to fix this issue and to avoid
> similar issues in the future.
> 
> Reported-by: Jiri Olsa <jolsa@redhat.com>
> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> Cc: <stable@vger.kernel.org>	# 4.4+
> ---
> 
> v2:
> * drop pXd_none() checks and leave only pXd_present(), per David
> 
> v1: https://lore.kernel.org/lkml/20210817135854.25407-1-rppt@kernel.org
> 
>  arch/x86/mm/init_64.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index ddeaba947eb3..879886c6cc53 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -1433,18 +1433,18 @@ int kern_addr_valid(unsigned long addr)
>  		return 0;
>  
>  	p4d = p4d_offset(pgd, addr);
> -	if (p4d_none(*p4d))
> +	if (!p4d_present(*p4d))
>  		return 0;
>  
>  	pud = pud_offset(p4d, addr);
> -	if (pud_none(*pud))
> +	if (!pud_present(*pud))
>  		return 0;
>  
>  	if (pud_large(*pud))
>  		return pfn_valid(pud_pfn(*pud));
>  
>  	pmd = pmd_offset(pud, addr);
> -	if (pmd_none(*pmd))
> +	if (!pmd_present(*pmd))
>  		return 0;
>  
>  	if (pmd_large(*pmd))
> -- 
> 2.28.0
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] x86/mm: fix kern_addr_valid to cope with existing but not present entries
  2021-08-19 13:27 [PATCH v2] x86/mm: fix kern_addr_valid to cope with existing but not present entries Mike Rapoport
                   ` (3 preceding siblings ...)
  2021-09-02  8:51 ` Mike Rapoport
@ 2021-09-08  9:13 ` Mike Rapoport
  2021-09-08 19:03 ` [tip: x86/urgent] x86/mm: Fix kern_addr_valid() " tip-bot2 for Mike Rapoport
  5 siblings, 0 replies; 11+ messages in thread
From: Mike Rapoport @ 2021-09-08  9:13 UTC (permalink / raw)
  To: x86
  Cc: Andrew Morton, Andy Lutomirski, Dave Hansen, David Hildenbrand,
	Ingo Molnar, Jiri Olsa, Mike Rapoport, Oscar Salvador,
	Peter Zijlstra, Thomas Gleixner, Borislav Petkov, linux-kernel,
	linux-fsdevel, stable

Ping?

On Thu, Aug 19, 2021 at 04:27:17PM +0300, Mike Rapoport wrote:
> From: Mike Rapoport <rppt@linux.ibm.com>
> 
> Jiri Olsa reported a fault when running:
> 
> 	# cat /proc/kallsyms | grep ksys_read
> 	ffffffff8136d580 T ksys_read
> 	# objdump -d --start-address=0xffffffff8136d580 --stop-address=0xffffffff8136d590 /proc/kcore
> 
> 	/proc/kcore:     file format elf64-x86-64
> 
> 	Segmentation fault
> 
> krava33 login: [   68.330612] general protection fault, probably for non-canonical address 0xf887ffcbff000: 0000 [#1] SMP PTI
> [   68.333118] CPU: 12 PID: 1079 Comm: objdump Not tainted 5.14.0-rc5qemu+ #508
> [   68.334922] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-4.fc34 04/01/2014
> [   68.336945] RIP: 0010:kern_addr_valid+0x150/0x300
> [   68.338082] Code: 1f 40 00 48 8b 0d e8 12 61 01 48 85 f6 0f 85 ca 00 00 00 48 81 e1 00 f0 ff ff 48 21 c1 48 b8 00 00 00 00 80 88 ff ff 48 01 ca <48> 8b 3c 02 48 f7 c7 9f ff ff ff 0f 84 d8 fe ff ff 48 89 f8 0f 1f
> [   68.342220] RSP: 0018:ffffc90000bcbc38 EFLAGS: 00010206
> [   68.343428] RAX: ffff888000000000 RBX: 0000000000001000 RCX: 000ffffffcbff000
> [   68.345029] RDX: 000ffffffcbff000 RSI: 0000000000000000 RDI: 800ffffffcbff062
> [   68.346599] RBP: ffffc90000bcbea8 R08: 0000000000001000 R09: 0000000000000000
> [   68.349000] R10: 0000000000000000 R11: 0000000000001000 R12: 00007fcc0fd80010
> [   68.350804] R13: ffffffff83400000 R14: 0000000000400000 R15: ffffffff843d23e0
> [   68.352609] FS:  00007fcc111fcc80(0000) GS:ffff888275e00000(0000) knlGS:0000000000000000
> [   68.354638] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   68.356104] CR2: 00007fcc0fd80000 CR3: 000000011226e004 CR4: 0000000000770ee0
> [   68.357896] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [   68.359694] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [   68.361597] PKRU: 55555554
> [   68.362460] Call Trace:
> [   68.363252]  read_kcore+0x57f/0x920
> [   68.364289]  ? rcu_read_lock_sched_held+0x12/0x80
> [   68.365630]  ? rcu_read_lock_sched_held+0x12/0x80
> [   68.366955]  ? rcu_read_lock_sched_held+0x12/0x80
> [   68.368277]  ? trace_hardirqs_on+0x1b/0xd0
> [   68.369462]  ? rcu_read_lock_sched_held+0x12/0x80
> [   68.370793]  ? lock_acquire+0x195/0x2f0
> [   68.371920]  ? lock_acquire+0x195/0x2f0
> [   68.373035]  ? rcu_read_lock_sched_held+0x12/0x80
> [   68.374364]  ? lock_acquire+0x195/0x2f0
> [   68.375498]  ? rcu_read_lock_sched_held+0x12/0x80
> [   68.376831]  ? rcu_read_lock_sched_held+0x12/0x80
> [   68.379883]  ? rcu_read_lock_sched_held+0x12/0x80
> [   68.381268]  ? lock_release+0x22b/0x3e0
> [   68.382458]  ? _raw_spin_unlock+0x1f/0x30
> [   68.383685]  ? __handle_mm_fault+0xcfc/0x15f0
> [   68.384994]  ? rcu_read_lock_sched_held+0x12/0x80
> [   68.386389]  ? lock_acquire+0x195/0x2f0
> [   68.387573]  ? rcu_read_lock_sched_held+0x12/0x80
> [   68.388969]  ? lock_release+0x22b/0x3e0
> [   68.390145]  proc_reg_read+0x55/0xa0
> [   68.391257]  ? vfs_read+0x78/0x1b0
> [   68.392336]  vfs_read+0xa7/0x1b0
> [   68.393328]  ksys_read+0x68/0xe0
> [   68.394308]  do_syscall_64+0x3b/0x90
> [   68.395391]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [   68.396804] RIP: 0033:0x7fcc11cf92e2
> [   68.397824] Code: c0 e9 b2 fe ff ff 50 48 8d 3d ea 2e 0a 00 e8 95 e9 01 00 0f 1f 44 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24
> [   68.402420] RSP: 002b:00007ffd6e0f8da8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
> [   68.404357] RAX: ffffffffffffffda RBX: 0000565439305b20 RCX: 00007fcc11cf92e2
> [   68.406061] RDX: 0000000000800000 RSI: 00007fcc0f980010 RDI: 0000000000000003
> [   68.407747] RBP: 00007fcc11dcd300 R08: 0000000000000003 R09: 00007fcc0d980010
> [   68.410937] R10: 0000000003826000 R11: 0000000000000246 R12: 00007fcc0f980010
> [   68.412624] R13: 0000000000000d68 R14: 00007fcc11dcc700 R15: 0000000000800000
> [   68.414322] Modules linked in: intel_rapl_msr intel_rapl_common nfit kvm_intel kvm irqbypass rapl iTCO_wdt iTCO_vendor_support i2c_i801 i2c_smbus lpc_ich drm drm_panel_orientation_quirks zram xfs crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel
> [   68.419591] ---[ end trace e2c30f827226966b ]---
> [   68.420969] RIP: 0010:kern_addr_valid+0x150/0x300
> [   68.422308] Code: 1f 40 00 48 8b 0d e8 12 61 01 48 85 f6 0f 85 ca 00 00 00 48 81 e1 00 f0 ff ff 48 21 c1 48 b8 00 00 00 00 80 88 ff ff 48 01 ca <48> 8b 3c 02 48 f7 c7 9f ff ff ff 0f 84 d8 fe ff ff 48 89 f8 0f 1f
> [   68.426826] RSP: 0018:ffffc90000bcbc38 EFLAGS: 00010206
> [   68.428150] RAX: ffff888000000000 RBX: 0000000000001000 RCX: 000ffffffcbff000
> [   68.429813] RDX: 000ffffffcbff000 RSI: 0000000000000000 RDI: 800ffffffcbff062
> [   68.431465] RBP: ffffc90000bcbea8 R08: 0000000000001000 R09: 0000000000000000
> [   68.433115] R10: 0000000000000000 R11: 0000000000001000 R12: 00007fcc0fd80010
> [   68.434768] R13: ffffffff83400000 R14: 0000000000400000 R15: ffffffff843d23e0
> [   68.436423] FS:  00007fcc111fcc80(0000) GS:ffff888275e00000(0000) knlGS:0000000000000000
> [   68.438354] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   68.442077] CR2: 00007fcc0fd80000 CR3: 000000011226e004 CR4: 0000000000770ee0
> [   68.443727] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [   68.445370] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [   68.447010] PKRU: 55555554
> 
> The fault happens because kern_addr_valid() dereferences existent but not
> present PMD in the high kernel mappings.
> 
> Such PMDs are created when free_kernel_image_pages() frees regions larger
> than 2Mb. In this case a part of the freed memory is mapped with PMDs and
> the set_memory_np_noalias() -> ... -> __change_page_attr() sequence will
> mark the PMD as not present rather than wipe it completely.
> 
> Make kern_addr_valid() to check whether higher level page table entries are
> present before trying to dereference them to fix this issue and to avoid
> similar issues in the future.
> 
> Reported-by: Jiri Olsa <jolsa@redhat.com>
> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> Cc: <stable@vger.kernel.org>	# 4.4+
> ---
> 
> v2:
> * drop pXd_none() checks and leave only pXd_present(), per David
> 
> v1: https://lore.kernel.org/lkml/20210817135854.25407-1-rppt@kernel.org
> 
>  arch/x86/mm/init_64.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index ddeaba947eb3..879886c6cc53 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -1433,18 +1433,18 @@ int kern_addr_valid(unsigned long addr)
>  		return 0;
>  
>  	p4d = p4d_offset(pgd, addr);
> -	if (p4d_none(*p4d))
> +	if (!p4d_present(*p4d))
>  		return 0;
>  
>  	pud = pud_offset(p4d, addr);
> -	if (pud_none(*pud))
> +	if (!pud_present(*pud))
>  		return 0;
>  
>  	if (pud_large(*pud))
>  		return pfn_valid(pud_pfn(*pud));
>  
>  	pmd = pmd_offset(pud, addr);
> -	if (pmd_none(*pmd))
> +	if (!pmd_present(*pmd))
>  		return 0;
>  
>  	if (pmd_large(*pmd))
> -- 
> 2.28.0
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] x86/mm: fix kern_addr_valid to cope with existing but not present entries
  2021-08-25 18:47 ` Dave Hansen
@ 2021-09-08 10:35   ` Borislav Petkov
  2021-09-08 10:52     ` Borislav Petkov
  0 siblings, 1 reply; 11+ messages in thread
From: Borislav Petkov @ 2021-09-08 10:35 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Mike Rapoport, x86, Andrew Morton, Andy Lutomirski, Dave Hansen,
	David Hildenbrand, Ingo Molnar, Jiri Olsa, Mike Rapoport,
	Oscar Salvador, Peter Zijlstra, Thomas Gleixner, linux-kernel,
	linux-fsdevel, stable

On Wed, Aug 25, 2021 at 11:47:10AM -0700, Dave Hansen wrote:
> On 8/19/21 6:27 AM, Mike Rapoport wrote:
> > Such PMDs are created when free_kernel_image_pages() frees regions larger
> > than 2Mb. In this case a part of the freed memory is mapped with PMDs and
> > the set_memory_np_noalias() -> ... -> __change_page_attr() sequence will
> > mark the PMD as not present rather than wipe it completely.
> > 
> > Make kern_addr_valid() to check whether higher level page table entries are
> > present before trying to dereference them to fix this issue and to avoid
> > similar issues in the future.
> > 
> > Reported-by: Jiri Olsa <jolsa@redhat.com>
> > Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> > Cc: <stable@vger.kernel.org>	# 4.4...
> >  	pmd = pmd_offset(pud, addr);
> > -	if (pmd_none(*pmd))
> > +	if (!pmd_present(*pmd))
> >  		return 0;
> 
> Yeah, that seems like the right fix.  The one kern_addr_valid() user is
> going to touch the memory so it *better* be present.  p*d_none() was
> definitely the wrong check.
> 
> Acked-by: Dave Hansen <dave.hansen@intel.com>

So I did stare at this for a while, trying to make sense of it and David
Hildenbrand asked for a Fixes: tag in v1 review and from doing a bit of
git archeology I think it should be:

c40a56a7818c ("x86/mm/init: Remove freed kernel image areas from alias mapping")

because that thing added the clearing of the Present bit for the high
kernel image mapping of those areas.

Right?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] x86/mm: fix kern_addr_valid to cope with existing but not present entries
  2021-09-08 10:35   ` Borislav Petkov
@ 2021-09-08 10:52     ` Borislav Petkov
  2021-09-08 11:22       ` Mike Rapoport
  0 siblings, 1 reply; 11+ messages in thread
From: Borislav Petkov @ 2021-09-08 10:52 UTC (permalink / raw)
  To: Dave Hansen, Mike Rapoport
  Cc: x86, Andrew Morton, Andy Lutomirski, Dave Hansen,
	David Hildenbrand, Ingo Molnar, Jiri Olsa, Mike Rapoport,
	Oscar Salvador, Peter Zijlstra, Thomas Gleixner, linux-kernel,
	linux-fsdevel, stable

On Wed, Sep 08, 2021 at 12:35:21PM +0200, Borislav Petkov wrote:
> So I did stare at this for a while, trying to make sense of it and David
> Hildenbrand asked for a Fixes: tag in v1 review and from doing a bit of
> git archeology I think it should be:
> 
> c40a56a7818c ("x86/mm/init: Remove freed kernel image areas from alias mapping")
> 
> because that thing added the clearing of the Present bit for the high
> kernel image mapping of those areas.
> 
> Right?

Hmm, but that commit is in v4.19. Mike has added

Cc: <stable@vger.kernel.org>    # 4.4+

Mike, why 4.4 and newer?

Hmmm.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] x86/mm: fix kern_addr_valid to cope with existing but not present entries
  2021-09-08 10:52     ` Borislav Petkov
@ 2021-09-08 11:22       ` Mike Rapoport
  2021-09-08 11:34         ` Borislav Petkov
  0 siblings, 1 reply; 11+ messages in thread
From: Mike Rapoport @ 2021-09-08 11:22 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Dave Hansen, x86, Andrew Morton, Andy Lutomirski, Dave Hansen,
	David Hildenbrand, Ingo Molnar, Jiri Olsa, Mike Rapoport,
	Oscar Salvador, Peter Zijlstra, Thomas Gleixner, linux-kernel,
	linux-fsdevel, stable

On Wed, Sep 08, 2021 at 12:52:45PM +0200, Borislav Petkov wrote:
> On Wed, Sep 08, 2021 at 12:35:21PM +0200, Borislav Petkov wrote:
> > So I did stare at this for a while, trying to make sense of it and David
> > Hildenbrand asked for a Fixes: tag in v1 review and from doing a bit of
> > git archeology I think it should be:
> > 
> > c40a56a7818c ("x86/mm/init: Remove freed kernel image areas from alias mapping")
> > 
> > because that thing added the clearing of the Present bit for the high
> > kernel image mapping of those areas.
> > 
> > Right?

Yes, in a sense. 
As the only user of kern_addr_valid() is kcore and it only uses this check
for high kernel mappings, there should be no problem before 4.19.

But...


> Hmm, but that commit is in v4.19. Mike has added
> 
> Cc: <stable@vger.kernel.org>    # 4.4+
> 
> Mike, why 4.4 and newer?

kern_addr_valid() wrongly uses pxy_none() rather than pxy_present() because
according to 9a14aefc1d28 ("x86: cpa, fix lookup_address") there could be
cases when page table entries exist but they are not valid.
So a call to kern_addr_valid() for an address in the direct map would oops.

I've stopped digging at 9a14aefc1d28 (which is in v2.6.26) and added the
oldest stable we still support (4.4).

I agree that before 4.19 it's more of a theoretical bug, but you know,
things happen...
 
> Hmmm.

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] x86/mm: fix kern_addr_valid to cope with existing but not present entries
  2021-09-08 11:22       ` Mike Rapoport
@ 2021-09-08 11:34         ` Borislav Petkov
  0 siblings, 0 replies; 11+ messages in thread
From: Borislav Petkov @ 2021-09-08 11:34 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Dave Hansen, x86, Andrew Morton, Andy Lutomirski, Dave Hansen,
	David Hildenbrand, Ingo Molnar, Jiri Olsa, Mike Rapoport,
	Oscar Salvador, Peter Zijlstra, Thomas Gleixner, linux-kernel,
	linux-fsdevel, stable

On Wed, Sep 08, 2021 at 02:22:31PM +0300, Mike Rapoport wrote:
> kern_addr_valid() wrongly uses pxy_none() rather than pxy_present() because
> according to 9a14aefc1d28 ("x86: cpa, fix lookup_address") there could be
> cases when page table entries exist but they are not valid.
> So a call to kern_addr_valid() for an address in the direct map would oops.
> 
> I've stopped digging at 9a14aefc1d28 (which is in v2.6.26) and added the
> oldest stable we still support (4.4).
> 
> I agree that before 4.19 it's more of a theoretical bug, but you know,
> things happen...

Hmmkay, I guess I should add the gist of that to the commit message so
that it is explained why 4.4.

I'm assuming the pxy_present() check is more strict than pxy_none() so
that backporting to all stable kernels should not introduce any risks...

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [tip: x86/urgent] x86/mm: Fix kern_addr_valid() to cope with existing but not present entries
  2021-08-19 13:27 [PATCH v2] x86/mm: fix kern_addr_valid to cope with existing but not present entries Mike Rapoport
                   ` (4 preceding siblings ...)
  2021-09-08  9:13 ` Mike Rapoport
@ 2021-09-08 19:03 ` tip-bot2 for Mike Rapoport
  5 siblings, 0 replies; 11+ messages in thread
From: tip-bot2 for Mike Rapoport @ 2021-09-08 19:03 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Jiri Olsa, Mike Rapoport, Borislav Petkov, David Hildenbrand,
	Dave Hansen, stable, x86, linux-kernel

The following commit has been merged into the x86/urgent branch of tip:

Commit-ID:     34b1999da935a33be6239226bfa6cd4f704c5c88
Gitweb:        https://git.kernel.org/tip/34b1999da935a33be6239226bfa6cd4f704c5c88
Author:        Mike Rapoport <rppt@linux.ibm.com>
AuthorDate:    Thu, 19 Aug 2021 16:27:17 +03:00
Committer:     Borislav Petkov <bp@suse.de>
CommitterDate: Wed, 08 Sep 2021 20:50:32 +02:00

x86/mm: Fix kern_addr_valid() to cope with existing but not present entries

Jiri Olsa reported a fault when running:

  # cat /proc/kallsyms | grep ksys_read
  ffffffff8136d580 T ksys_read
  # objdump -d --start-address=0xffffffff8136d580 --stop-address=0xffffffff8136d590 /proc/kcore

  /proc/kcore:     file format elf64-x86-64

  Segmentation fault

  general protection fault, probably for non-canonical address 0xf887ffcbff000: 0000 [#1] SMP PTI
  CPU: 12 PID: 1079 Comm: objdump Not tainted 5.14.0-rc5qemu+ #508
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-4.fc34 04/01/2014
  RIP: 0010:kern_addr_valid
  Call Trace:
   read_kcore
   ? rcu_read_lock_sched_held
   ? rcu_read_lock_sched_held
   ? rcu_read_lock_sched_held
   ? trace_hardirqs_on
   ? rcu_read_lock_sched_held
   ? lock_acquire
   ? lock_acquire
   ? rcu_read_lock_sched_held
   ? lock_acquire
   ? rcu_read_lock_sched_held
   ? rcu_read_lock_sched_held
   ? rcu_read_lock_sched_held
   ? lock_release
   ? _raw_spin_unlock
   ? __handle_mm_fault
   ? rcu_read_lock_sched_held
   ? lock_acquire
   ? rcu_read_lock_sched_held
   ? lock_release
   proc_reg_read
   ? vfs_read
   vfs_read
   ksys_read
   do_syscall_64
   entry_SYSCALL_64_after_hwframe

The fault happens because kern_addr_valid() dereferences existent but not
present PMD in the high kernel mappings.

Such PMDs are created when free_kernel_image_pages() frees regions larger
than 2Mb. In this case, a part of the freed memory is mapped with PMDs and
the set_memory_np_noalias() -> ... -> __change_page_attr() sequence will
mark the PMD as not present rather than wipe it completely.

Have kern_addr_valid() check whether higher level page table entries are
present before trying to dereference them to fix this issue and to avoid
similar issues in the future.

Stable backporting note:
------------------------

Note that the stable marking is for all active stable branches because
there could be cases where pagetable entries exist but are not valid -
see 9a14aefc1d28 ("x86: cpa, fix lookup_address"), for example. So make
sure to be on the safe side here and use pXY_present() accessors rather
than pXY_none() which could #GP when accessing pages in the direct map.

Also see:

  c40a56a7818c ("x86/mm/init: Remove freed kernel image areas from alias mapping")

for more info.

Reported-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Tested-by: Jiri Olsa <jolsa@redhat.com>
Cc: <stable@vger.kernel.org>	# 4.4+
Link: https://lkml.kernel.org/r/20210819132717.19358-1-rppt@kernel.org
---
 arch/x86/mm/init_64.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index ddeaba9..879886c 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1433,18 +1433,18 @@ int kern_addr_valid(unsigned long addr)
 		return 0;
 
 	p4d = p4d_offset(pgd, addr);
-	if (p4d_none(*p4d))
+	if (!p4d_present(*p4d))
 		return 0;
 
 	pud = pud_offset(p4d, addr);
-	if (pud_none(*pud))
+	if (!pud_present(*pud))
 		return 0;
 
 	if (pud_large(*pud))
 		return pfn_valid(pud_pfn(*pud));
 
 	pmd = pmd_offset(pud, addr);
-	if (pmd_none(*pmd))
+	if (!pmd_present(*pmd))
 		return 0;
 
 	if (pmd_large(*pmd))

^ permalink raw reply related	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2021-09-08 19:03 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-19 13:27 [PATCH v2] x86/mm: fix kern_addr_valid to cope with existing but not present entries Mike Rapoport
2021-08-19 13:35 ` David Hildenbrand
2021-08-19 15:33 ` Jiri Olsa
2021-08-25 18:47 ` Dave Hansen
2021-09-08 10:35   ` Borislav Petkov
2021-09-08 10:52     ` Borislav Petkov
2021-09-08 11:22       ` Mike Rapoport
2021-09-08 11:34         ` Borislav Petkov
2021-09-02  8:51 ` Mike Rapoport
2021-09-08  9:13 ` Mike Rapoport
2021-09-08 19:03 ` [tip: x86/urgent] x86/mm: Fix kern_addr_valid() " tip-bot2 for Mike Rapoport

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.