[RFC] ioremap: don't set up huge I/O mappings when p4d/pud/pmd is zero
diff mbox series

Message ID 1514460261-65222-1-git-send-email-guohanjun@huawei.com
State New, archived
Headers show
Series
  • [RFC] ioremap: don't set up huge I/O mappings when p4d/pud/pmd is zero
Related show

Commit Message

Hanjun Guo Dec. 28, 2017, 11:24 a.m. UTC
From: Hanjun Guo <hanjun.guo@linaro.org>

When we using iounmap() to free the 4K mapping, it just clear the PTEs
but leave P4D/PUD/PMD unchanged, also will not free the memory of page
tables.

This will cause issues on ARM64 platform (not sure if other archs have
the same issue) for this case:

1. ioremap a 4K size, valid page table will build,
2. iounmap it, pte0 will set to 0;
3. ioremap the same address with 2M size, pgd/pmd is unchanged,
   then set the a new value for pmd;
4. pte0 is leaked;
5. CPU may meet exception because the old pmd is still in TLB,
   which will lead to kernel panic.

Fix it by skip setting up the huge I/O mappings when p4d/pud/pmd is
zero.

Reported-by: Lei Li <lious.lilei@hisilicon.com>
Signed-off-by: Hanjun Guo <hanjun.guo@linaro.org>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Xuefeng Wang <wxf.wang@hisilicon.com>
---

Not sure if this is the right direction, this patch has a obvious
side effect that a mapped address with 4K will not back to 2M.  I may
miss something and just wrong, so this is just a RFC version, comments
are welcomed.

 lib/ioremap.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Comments

Hanjun Guo Dec. 29, 2017, 8 a.m. UTC | #1
oops, the title of this patch is wrong, should be:
ioremap: skip setting up huge I/O mappings when p4d/pud/pmd is zero

On 2017/12/28 19:24, Hanjun Guo wrote:
> From: Hanjun Guo <hanjun.guo@linaro.org>
> 
> When we using iounmap() to free the 4K mapping, it just clear the PTEs
> but leave P4D/PUD/PMD unchanged, also will not free the memory of page
> tables.
> 
> This will cause issues on ARM64 platform (not sure if other archs have
> the same issue) for this case:
> 
> 1. ioremap a 4K size, valid page table will build,
> 2. iounmap it, pte0 will set to 0;
> 3. ioremap the same address with 2M size, pgd/pmd is unchanged,
>    then set the a new value for pmd;
> 4. pte0 is leaked;
> 5. CPU may meet exception because the old pmd is still in TLB,
>    which will lead to kernel panic.
> 
> Fix it by skip setting up the huge I/O mappings when p4d/pud/pmd is
> zero.
> 
> Reported-by: Lei Li <lious.lilei@hisilicon.com>
> Signed-off-by: Hanjun Guo <hanjun.guo@linaro.org>
> Cc: Toshi Kani <toshi.kani@hpe.com>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Will Deacon <will.deacon@arm.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Xuefeng Wang <wxf.wang@hisilicon.com>
> ---
> 
> Not sure if this is the right direction, this patch has a obvious
> side effect that a mapped address with 4K will not back to 2M.  I may
> miss something and just wrong, so this is just a RFC version, comments
> are welcomed.
> 
>  lib/ioremap.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/lib/ioremap.c b/lib/ioremap.c
> index b808a39..4e6f19a 100644
> --- a/lib/ioremap.c
> +++ b/lib/ioremap.c
> @@ -89,7 +89,7 @@ static inline int ioremap_pmd_range(pud_t *pud, unsigned long addr,
>  	do {
>  		next = pmd_addr_end(addr, end);
>  
> -		if (ioremap_pmd_enabled() &&
> +		if (ioremap_pmd_enabled() && pmd_none(*pmd) &&
>  		    ((next - addr) == PMD_SIZE) &&
>  		    IS_ALIGNED(phys_addr + addr, PMD_SIZE)) {
>  			if (pmd_set_huge(pmd, phys_addr + addr, prot))
> @@ -115,7 +115,7 @@ static inline int ioremap_pud_range(p4d_t *p4d, unsigned long addr,
>  	do {
>  		next = pud_addr_end(addr, end);
>  
> -		if (ioremap_pud_enabled() &&
> +		if (ioremap_pud_enabled() && pud_none(*pud) &&
>  		    ((next - addr) == PUD_SIZE) &&
>  		    IS_ALIGNED(phys_addr + addr, PUD_SIZE)) {
>  			if (pud_set_huge(pud, phys_addr + addr, prot))
> @@ -141,7 +141,7 @@ static inline int ioremap_p4d_range(pgd_t *pgd, unsigned long addr,
>  	do {
>  		next = p4d_addr_end(addr, end);
>  
> -		if (ioremap_p4d_enabled() &&
> +		if (ioremap_p4d_enabled() && p4d_none(*p4d) &&
>  		    ((next - addr) == P4D_SIZE) &&
>  		    IS_ALIGNED(phys_addr + addr, P4D_SIZE)) {
>  			if (p4d_set_huge(p4d, phys_addr + addr, prot))
>
Kani, Toshi Jan. 5, 2018, 10:15 p.m. UTC | #2
On Thu, 2017-12-28 at 19:24 +0800, Hanjun Guo wrote:
> From: Hanjun Guo <hanjun.guo@linaro.org>
> 
> When we using iounmap() to free the 4K mapping, it just clear the PTEs
> but leave P4D/PUD/PMD unchanged, also will not free the memory of page
> tables.
> 
> This will cause issues on ARM64 platform (not sure if other archs have
> the same issue) for this case:
> 
> 1. ioremap a 4K size, valid page table will build,
> 2. iounmap it, pte0 will set to 0;
> 3. ioremap the same address with 2M size, pgd/pmd is unchanged,
>    then set the a new value for pmd;
> 4. pte0 is leaked;
> 5. CPU may meet exception because the old pmd is still in TLB,
>    which will lead to kernel panic.
> 
> Fix it by skip setting up the huge I/O mappings when p4d/pud/pmd is
> zero.

Hi Hanjun,

I tested the above steps on my x86 box, but was not able to reproduce
your kernel panic.  On x86, a 4K vaddr gets allocated from a small
fragmented free range, whereas a 2MB vaddr is from a larger free range. 
Their addrs have different alignments (4KB & 2MB) as well.  So, the
steps did not lead to use a same pmd entry.

However, I agree that zero'd pte entries will be leaked when a pmd map
is set if they are present under the pmd.

I also tested your patch on my x86 box.  Unfortunately, it effectively
disabled 2MB mappings.  While a 2MB vaddr gets allocated from a larger
free range, it sill comes from a free range covered by zero'd pte
entries.  So, it ends up with 4KB mappings with your changes.

I think we need to come up with other approach.
Thanks,
-Toshi
Hanjun Guo Jan. 6, 2018, 9:46 a.m. UTC | #3
On 2018/1/6 6:15, Kani, Toshi wrote:
> On Thu, 2017-12-28 at 19:24 +0800, Hanjun Guo wrote:
>> From: Hanjun Guo <hanjun.guo@linaro.org>
>>
>> When we using iounmap() to free the 4K mapping, it just clear the PTEs
>> but leave P4D/PUD/PMD unchanged, also will not free the memory of page
>> tables.
>>
>> This will cause issues on ARM64 platform (not sure if other archs have
>> the same issue) for this case:
>>
>> 1. ioremap a 4K size, valid page table will build,
>> 2. iounmap it, pte0 will set to 0;
>> 3. ioremap the same address with 2M size, pgd/pmd is unchanged,
>>    then set the a new value for pmd;
>> 4. pte0 is leaked;
>> 5. CPU may meet exception because the old pmd is still in TLB,
>>    which will lead to kernel panic.
>>
>> Fix it by skip setting up the huge I/O mappings when p4d/pud/pmd is
>> zero.
> 
> Hi Hanjun,
> 
> I tested the above steps on my x86 box, but was not able to reproduce
> your kernel panic.  On x86, a 4K vaddr gets allocated from a small
> fragmented free range, whereas a 2MB vaddr is from a larger free range. 
> Their addrs have different alignments (4KB & 2MB) as well.  So, the
> steps did not lead to use a same pmd entry.

Thanks for the testing, I can only reproduce this on my ARM64 platform
which the CPU will cache the PMD in TLB, from my knowledge, only Cortex-A75
will do this, so ARM64 platforms which are not A75 based can't be reproduced
either.

Catalin, Will, I can reproduce this issue in about 3 minutes with following
simplified test case [1], and can trigger panic as [2], could you take a look
as well?

> 
> However, I agree that zero'd pte entries will be leaked when a pmd map
> is set if they are present under the pmd.

Thanks for the confirm.

> 
> I also tested your patch on my x86 box.  Unfortunately, it effectively
> disabled 2MB mappings.  While a 2MB vaddr gets allocated from a larger
> free range, it sill comes from a free range covered by zero'd pte
> entries.  So, it ends up with 4KB mappings with your changes.
> 
> I think we need to come up with other approach.

Yes, As I said in my patch, this is just RFC, comments are welcomed :)

[1]:
#include <linux/init.h>
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/random.h>
#include <asm/io.h>


void pcie_io_remap_test(u32 times_ms)
{
	u64 phy_addr = 0xd0000000;
	unsigned long timeout = jiffies + msecs_to_jiffies(times_ms);
	int rand_type;
	u32 mem_size;
	void *vir_addr;

	do {
		get_random_bytes(&rand_type, sizeof(u32));

		rand_type %= 6;
		switch (rand_type) {
		case 0:
			mem_size = 0x1000;
			break;
		case 1:
			mem_size = 0x4000;
			break;
		case 2:
			mem_size = 0x200000;
			break;
		case 3:
			mem_size = 0x300000;
			break;
		case 4:
			mem_size = 0x400;
			break;
		case 5:
			mem_size = 0x400000;
			break;
		default:
			mem_size = 0x400;
			break;
		}

		vir_addr = ioremap(phy_addr, mem_size);
		readl(vir_addr);

		iounmap(vir_addr);
		schedule();

	} while (time_before(jiffies, timeout));
}

[2]:

[ 2062.300139] Unable to handle kernel paging request at virtual address ffff000018600000
[ 2062.300139] Unable to handle kernel paging request at virtual address ffff000018600000
[ 2062.327051] Mem abort info:
[ 2062.327051] Mem abort info:
[ 2062.337134]   Exception class = DABT (current EL), IL = 32 bits
[ 2062.337134]   Exception class = DABT (current EL), IL = 32 bits
[ 2062.354614]   SET = 0, FnV = 0
[ 2062.354614]   SET = 0, FnV = 0
[ 2062.363818]   EA = 0, S1PTW = 0
[ 2062.363818]   EA = 0, S1PTW = 0
[ 2062.373131] Data abort info:
[ 2062.373131] Data abort info:
[ 2062.381671]   ISV = 0, ISS = 0x00000007
[ 2062.381671]   ISV = 0, ISS = 0x00000007
[ 2062.393099]   CM = 0, WnR = 0
[ 2062.393099]   CM = 0, WnR = 0
[ 2062.402155] swapper pgtable: 4k pages, 48-bit VAs, pgd = ffff000009cf6000
[ 2062.402155] swapper pgtable: 4k pages, 48-bit VAs, pgd = ffff000009cf6000
[ 2062.421477] [ffff000018600000] *pgd=00000021ffffe003, *pud=00000021ffffd003, *pmd=00e80000d0000705
[ 2062.421477] [ffff000018600000] *pgd=00000021ffffe003, *pud=00000021ffffd003, *pmd=00e80000d0000705
[ 2062.447913] Internal error: Oops: 96000007 [#1] SMP
[ 2062.447913] Internal error: Oops: 96000007 [#1] SMP
[ 2062.540141] CPU: 1 PID: 2149 Comm: unibsp.out Tainted: P           OEL  4.14.0 #1
[ 2062.540141] CPU: 1 PID: 2149 Comm: unibsp.out Tainted: P           OEL  4.14.0 #1
[ 2062.560853] task: ffff8021e6f8a100 task.stack: ffff000010aa0000
[ 2062.560853] task: ffff8021e6f8a100 task.stack: ffff000010aa0000
[ 2062.580070] PC is at pcie_io_remap_test+0x9c/0x228
[ 2062.580070] PC is at pcie_io_remap_test+0x9c/0x228
[ 2062.597307] LR is at pcie_io_remap_test+0x9c/0x228
[ 2062.597307] LR is at pcie_io_remap_test+0x9c/0x228
[ 2062.613720] pc : [<ffff0000010a5e74>] lr : [<ffff0000010a5e74>] pstate: 60400149
[ 2062.613720] pc : [<ffff0000010a5e74>] lr : [<ffff0000010a5e74>] pstate: 60400149
[ 2062.634012] sp : ffff000010aa3bb0
[ 2062.634012] sp : ffff000010aa3bb0
[ 2062.643202] x29: ffff000010aa3bb0 x28: ffff8021e6f8a100
[ 2062.643202] x29: ffff000010aa3bb0 x28: ffff8021e6f8a100
[ 2062.658122] x27: ffff000008c91000 x26: 000000000000001d
[ 2062.658122] x27: ffff000008c91000 x26: 000000000000001d
[ 2062.672948] x25: 0000000000000124 x24: 0000000000000003
[ 2062.672948] x25: 0000000000000124 x24: 0000000000000003
[ 2062.687841] x23: ffff0000093a9c88 x22: ffff0000010a88f0
[ 2062.687841] x23: ffff0000093a9c88 x22: ffff0000010a88f0
[ 2062.702658] x21: 000000002aaaaaab x20: ffff0000093a6a80
[ 2062.702658] x21: 000000002aaaaaab x20: ffff0000093a6a80
[ 2062.717535] x19: 0000000100085de7 x18: 0000000000000040
[ 2062.717535] x19: 0000000100085de7 x18: 0000000000000040
[ 2062.732292] x17: 0000000098b401b6 x16: 00000000af415618
[ 2062.732292] x17: 0000000098b401b6 x16: 00000000af415618
[ 2062.747048] x15: 000000000eed33f7 x14: 0140000000000000
[ 2062.747048] x15: 000000000eed33f7 x14: 0140000000000000
[ 2062.761863] x13: ffff0000093bd000 x12: 0000000000000000
[ 2062.761863] x13: ffff0000093bd000 x12: 0000000000000000
[ 2062.776589] x11: 0400000000000001 x10: 0000000000000001
[ 2062.776589] x11: 0400000000000001 x10: 0000000000000001
[ 2062.791482] x9 : 0040000000000001 x8 : ffff008018600000
[ 2062.791482] x9 : 0040000000000001 x8 : ffff008018600000
[ 2062.806166] x7 : ffff8021ffffd620 x6 : ffff000058600000
[ 2062.806166] x7 : ffff8021ffffd620 x6 : ffff000058600000
[ 2062.820842] x5 : ffff000018800000 x4 : 0000000000000000
[ 2062.820842] x5 : ffff000018800000 x4 : 0000000000000000
[ 2062.835538] x3 : 00e8000000000707 x2 : 00e8000000000707
[ 2062.835538] x3 : 00e8000000000707 x2 : 00e8000000000707
[ 2062.850273] x1 : 00000000d0000000 x0 : ffff000018600000
[ 2062.850273] x1 : 00000000d0000000 x0 : ffff000018600000
[ 2062.865258] Process unibsp.out (pid: 2149, stack limit = 0xffff000010aa0000)
[ 2062.865258] Process unibsp.out (pid: 2149, stack limit = 0xffff000010aa0000)
[ 2062.884578] Call trace:
[ 2062.884578] Call trace:
[ 2062.891581] Exception stack(0xffff000010aa3a70 to 0xffff000010aa3bb0)
[ 2062.891581] Exception stack(0xffff000010aa3a70 to 0xffff000010aa3bb0)
[ 2062.909389] 3a60:                                   ffff000018600000 00000000d0000000
[ 2062.909389] 3a60:                                   ffff000018600000 00000000d0000000
[ 2062.930931] 3a80: 00e8000000000707 00e8000000000707 0000000000000000 ffff000018800000
[ 2062.930931] 3a80: 00e8000000000707 00e8000000000707 0000000000000000 ffff000018800000
[ 2062.952554] 3aa0: ffff000058600000 ffff8021ffffd620 ffff008018600000 0040000000000001
[ 2062.952554] 3aa0: ffff000058600000 ffff8021ffffd620 ffff008018600000 0040000000000001
[ 2062.974189] 3ac0: 0000000000000001 0400000000000001 0000000000000000 ffff0000093bd000
[ 2062.974189] 3ac0: 0000000000000001 0400000000000001 0000000000000000 ffff0000093bd000
[ 2062.995645] 3ae0: 0140000000000000 000000000eed33f7 00000000af415618 0000000098b401b6
[ 2062.995645] 3ae0: 0140000000000000 000000000eed33f7 00000000af415618 0000000098b401b6
[ 2063.017204] 3b00: 0000000000000040 0000000100085de7 ffff0000093a6a80 000000002aaaaaab
[ 2063.017204] 3b00: 0000000000000040 0000000100085de7 ffff0000093a6a80 000000002aaaaaab
[ 2063.038889] 3b20: ffff0000010a88f0 ffff0000093a9c88 0000000000000003 0000000000000124
[ 2063.038889] 3b20: ffff0000010a88f0 ffff0000093a9c88 0000000000000003 0000000000000124
[ 2063.060433] 3b40: 000000000000001d ffff000008c91000 ffff8021e6f8a100 ffff000010aa3bb0
[ 2063.060433] 3b40: 000000000000001d ffff000008c91000 ffff8021e6f8a100 ffff000010aa3bb0
[ 2063.081744] 3b60: ffff0000010a5e74 ffff000010aa3bb0 ffff0000010a5e74 0000000060400149
[ 2063.081744] 3b60: ffff0000010a5e74 ffff000010aa3bb0 ffff0000010a5e74 0000000060400149
[ 2063.102974] 3b80: ffff000010aa3bb0 ffff0000010a5e74 0001000000000000 ffff0000093a6a80
[ 2063.102974] 3b80: ffff000010aa3bb0 ffff0000010a5e74 0001000000000000 ffff0000093a6a80
[ 2063.124379] 3ba0: ffff000010aa3bb0 ffff0000010a5e74
[ 2063.124379] 3ba0: ffff000010aa3bb0 ffff0000010a5e74
[ 2063.138959] [<ffff0000010a5e74>] pcie_io_remap_test+0x9c/0x228
[ 2063.138959] [<ffff0000010a5e74>] pcie_io_remap_test+0x9c/0x228
[ 2063.159168] [<ffff000000d38700>] MCSS_ioctl+0x118/0x218 [unibsp]
[ 2063.159168] [<ffff000000d38700>] MCSS_ioctl+0x118/0x218 [unibsp]
[ 2063.176220] [<ffff0000082c16bc>] do_vfs_ioctl+0xc4/0x7a4
[ 2063.176220] [<ffff0000082c16bc>] do_vfs_ioctl+0xc4/0x7a4
[ 2063.190966] [<ffff0000082c1e2c>] SyS_ioctl+0x90/0xa4
[ 2063.190966] [<ffff0000082c1e2c>] SyS_ioctl+0x90/0xa4
[ 2063.204706] Exception stack(0xffff000010aa3ec0 to 0xffff000010aa4000)
[ 2063.204706] Exception stack(0xffff000010aa3ec0 to 0xffff000010aa4000)
[ 2063.222471] 3ec0: 0000000000000003 00000000000bc006 0000ffffd8c3d698 0000000000000010
[ 2063.222471] 3ec0: 0000000000000003 00000000000bc006 0000ffffd8c3d698 0000000000000010
[ 2063.243921] 3ee0: 6572ffffffffffff 0000000000000000 0000ffffd8c3d6aa 65725f6f695f6569
[ 2063.243921] 3ee0: 6572ffffffffffff 0000000000000000 0000ffffd8c3d6aa 65725f6f695f6569
[ 2063.265403] 3f00: 000000000000001d 2f2f2f2f2f34feff 0000000000000000 0000ffff8310dcb0
[ 2063.265403] 3f00: 000000000000001d 2f2f2f2f2f34feff 0000000000000000 0000ffff8310dcb0
[ 2063.286856] 3f20: 0000000000000000 0000000000000509 0000ffff82ea2e64 0000000000000000
[ 2063.286856] 3f20: 0000000000000000 0000000000000509 0000ffff82ea2e64 0000000000000000
[ 2063.308332] 3f40: 0000ffff82f60680 0000000000415540 0000000000000000 0000ffffd8c3d960
[ 2063.308332] 3f40: 0000ffff82f60680 0000000000415540 0000000000000000 0000ffffd8c3d960
[ 2063.329809] 3f60: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 2063.329809] 3f60: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 2063.351309] 3f80: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 2063.351309] 3f80: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 2063.372763] 3fa0: 0000000000000000 0000ffffd8c3d640 0000000000401b98 0000ffffd8c3d640
[ 2063.372763] 3fa0: 0000000000000000 0000ffffd8c3d640 0000000000401b98 0000ffffd8c3d640
[ 2063.394311] 3fc0: 0000ffff82f6068c 0000000080000000 0000000000000003 000000000000001d
[ 2063.394311] 3fc0: 0000ffff82f6068c 0000000080000000 0000000000000003 000000000000001d
[ 2063.415922] 3fe0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 2063.415922] 3fe0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 2063.437650] [<ffff0000080837b0>] el0_svc_naked+0x24/0x28
[ 2063.437650] [<ffff0000080837b0>] el0_svc_naked+0x24/0x28
[ 2063.452308] Code: d280e0e2 d2ba0000 f2e01d02 95bfdf0d (b9400001)
[ 2063.452308] Code: d280e0e2 d2ba0000 f2e01d02 95bfdf0d (b9400001)
[ 2063.469741] ---[ end trace 36d530c5bf5fea7d ]---
[ 2063.469741] ---[ end trace 36d530c5bf5fea7d ]---
[ 2063.482519] Kernel panic - not syncing: Fatal exception
[ 2063.482519] Kernel panic - not syncing: Fatal exception
[ 2063.497131] SMP: stopping secondary CPUs
[ 2063.497131] SMP: stopping secondary CPUs
[ 2063.508593] Kernel Offset: disabled
[ 2063.508593] Kernel Offset: disabled
[ 2063.518572] CPU features: 0x000a18
[ 2063.518572] CPU features: 0x000a18
[ 2063.528209] Memory Limit: none
[ 2063.528209] Memory Limit: none
[ 2063.537067] ---[ end Kernel panic - not syncing: Fatal exception
[ 2063.537067] ---[ end Kernel panic - not syncing: Fatal exception


Thanks
Hanjun
Kani, Toshi Jan. 8, 2018, 11:36 p.m. UTC | #4
On Sat, 2018-01-06 at 17:46 +0800, Hanjun Guo wrote:
> On 2018/1/6 6:15, Kani, Toshi wrote:
> > On Thu, 2017-12-28 at 19:24 +0800, Hanjun Guo wrote:
> > > From: Hanjun Guo <hanjun.guo@linaro.org>
> > > 
> > > When we using iounmap() to free the 4K mapping, it just clear the PTEs
> > > but leave P4D/PUD/PMD unchanged, also will not free the memory of page
> > > tables.
> > > 
> > > This will cause issues on ARM64 platform (not sure if other archs have
> > > the same issue) for this case:
> > > 
> > > 1. ioremap a 4K size, valid page table will build,
> > > 2. iounmap it, pte0 will set to 0;
> > > 3. ioremap the same address with 2M size, pgd/pmd is unchanged,
> > >    then set the a new value for pmd;
> > > 4. pte0 is leaked;
> > > 5. CPU may meet exception because the old pmd is still in TLB,
> > >    which will lead to kernel panic.
> > > 
> > > Fix it by skip setting up the huge I/O mappings when p4d/pud/pmd is
> > > zero.
> > 
> > Hi Hanjun,
> > 
> > I tested the above steps on my x86 box, but was not able to reproduce
> > your kernel panic.  On x86, a 4K vaddr gets allocated from a small
> > fragmented free range, whereas a 2MB vaddr is from a larger free range. 
> > Their addrs have different alignments (4KB & 2MB) as well.  So, the
> > steps did not lead to use a same pmd entry.
> 
> Thanks for the testing, I can only reproduce this on my ARM64 platform
> which the CPU will cache the PMD in TLB, from my knowledge, only Cortex-A75
> will do this, so ARM64 platforms which are not A75 based can't be reproduced
> either.
> 
> Catalin, Will, I can reproduce this issue in about 3 minutes with following
> simplified test case [1], and can trigger panic as [2], could you take a look
> as well?

Yes, the test case looks good to me. (nit - it should check if vir_addr
is not NULL.)

> > However, I agree that zero'd pte entries will be leaked when a pmd map
> > is set if they are present under the pmd.
> 
> Thanks for the confirm.
> 
> > 
> > I also tested your patch on my x86 box.  Unfortunately, it effectively
> > disabled 2MB mappings.  While a 2MB vaddr gets allocated from a larger
> > free range, it sill comes from a free range covered by zero'd pte
> > entries.  So, it ends up with 4KB mappings with your changes.
> > 
> > I think we need to come up with other approach.
> 
> Yes, As I said in my patch, this is just RFC, comments are welcomed :)

I am wondering if we can follow the same approach in
arch/x86/mm/pageattr.c.  Like the ioremap case, populate_pmd() does not
check if there is a pte table under the pmd.  But its free function,
unmap_pte_range() calls try_to_free_pte_page() so that a pte table is
freed when all pte entries are zero'd.  It then calls pmd_clear().
iounmap()'s free function, vunmap_pte_range() does not free up a pte
table even if all pte entries are zero'd.

Thanks,
-Toshi
Chintan Pandya Feb. 20, 2018, 9:24 a.m. UTC | #5
On 12/28/2017 4:54 PM, Hanjun Guo wrote:
> From: Hanjun Guo <hanjun.guo@linaro.org>
> 
> When we using iounmap() to free the 4K mapping, it just clear the PTEs
> but leave P4D/PUD/PMD unchanged, also will not free the memory of page
> tables.
> 
> This will cause issues on ARM64 platform (not sure if other archs have
> the same issue) for this case:
> 
> 1. ioremap a 4K size, valid page table will build,
> 2. iounmap it, pte0 will set to 0;
> 3. ioremap the same address with 2M size, pgd/pmd is unchanged,
>     then set the a new value for pmd;
> 4. pte0 is leaked;
> 5. CPU may meet exception because the old pmd is still in TLB,
>     which will lead to kernel panic.
> 
> Fix it by skip setting up the huge I/O mappings when p4d/pud/pmd is
> zero.
> 
One obvious problem I see here is, once any 2nd level entry has 3rd 
level mapping, this entry can't map 2M section ever in future. This way, 
we will fragment entire virtual space over time.

The code you are changing is common between 32-bit systems as well (I 
think). And running out of section mapping would be a reality in 
practical terms.

So, if we can do the following as a fix up, we would be saved.
1) Invalidate 2nd level entry from TLB, and
2) Free the page which holds last level page table

BTW, is there any further discussion going on this topic which I am 
missing ?

> Reported-by: Lei Li <lious.lilei@hisilicon.com>
> Signed-off-by: Hanjun Guo <hanjun.guo@linaro.org>
> Cc: Toshi Kani <toshi.kani@hpe.com>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Will Deacon <will.deacon@arm.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Xuefeng Wang <wxf.wang@hisilicon.com>
> ---
> 
> Not sure if this is the right direction, this patch has a obvious
> side effect that a mapped address with 4K will not back to 2M.  I may
> miss something and just wrong, so this is just a RFC version, comments
> are welcomed.
> 
>   lib/ioremap.c | 6 +++---
>   1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/lib/ioremap.c b/lib/ioremap.c
> index b808a39..4e6f19a 100644
> --- a/lib/ioremap.c
> +++ b/lib/ioremap.c
> @@ -89,7 +89,7 @@ static inline int ioremap_pmd_range(pud_t *pud, unsigned long addr,
>   	do {
>   		next = pmd_addr_end(addr, end);
>   
> -		if (ioremap_pmd_enabled() &&
> +		if (ioremap_pmd_enabled() && pmd_none(*pmd) &&
>   		    ((next - addr) == PMD_SIZE) &&
>   		    IS_ALIGNED(phys_addr + addr, PMD_SIZE)) {
>   			if (pmd_set_huge(pmd, phys_addr + addr, prot))
> @@ -115,7 +115,7 @@ static inline int ioremap_pud_range(p4d_t *p4d, unsigned long addr,
>   	do {
>   		next = pud_addr_end(addr, end);
>   
> -		if (ioremap_pud_enabled() &&
> +		if (ioremap_pud_enabled() && pud_none(*pud) &&
>   		    ((next - addr) == PUD_SIZE) &&
>   		    IS_ALIGNED(phys_addr + addr, PUD_SIZE)) {
>   			if (pud_set_huge(pud, phys_addr + addr, prot))
> @@ -141,7 +141,7 @@ static inline int ioremap_p4d_range(pgd_t *pgd, unsigned long addr,
>   	do {
>   		next = p4d_addr_end(addr, end);
>   
> -		if (ioremap_p4d_enabled() &&
> +		if (ioremap_p4d_enabled() && p4d_none(*p4d) &&
>   		    ((next - addr) == P4D_SIZE) &&
>   		    IS_ALIGNED(phys_addr + addr, P4D_SIZE)) {
>   			if (p4d_set_huge(p4d, phys_addr + addr, prot))
> 

Chintan
Kani, Toshi Feb. 21, 2018, 12:34 a.m. UTC | #6
On Tue, 2018-02-20 at 14:54 +0530, Chintan Pandya wrote:
> 
> On 12/28/2017 4:54 PM, Hanjun Guo wrote:
> > From: Hanjun Guo <hanjun.guo@linaro.org>
> > 
> > When we using iounmap() to free the 4K mapping, it just clear the PTEs
> > but leave P4D/PUD/PMD unchanged, also will not free the memory of page
> > tables.
> > 
> > This will cause issues on ARM64 platform (not sure if other archs have
> > the same issue) for this case:
> > 
> > 1. ioremap a 4K size, valid page table will build,
> > 2. iounmap it, pte0 will set to 0;
> > 3. ioremap the same address with 2M size, pgd/pmd is unchanged,
> >     then set the a new value for pmd;
> > 4. pte0 is leaked;
> > 5. CPU may meet exception because the old pmd is still in TLB,
> >     which will lead to kernel panic.
> > 
> > Fix it by skip setting up the huge I/O mappings when p4d/pud/pmd is
> > zero.
> > 
> 
> One obvious problem I see here is, once any 2nd level entry has 3rd 
> level mapping, this entry can't map 2M section ever in future. This way, 
> we will fragment entire virtual space over time.
> 
> The code you are changing is common between 32-bit systems as well (I 
> think). And running out of section mapping would be a reality in 
> practical terms.
> 
> So, if we can do the following as a fix up, we would be saved.
> 1) Invalidate 2nd level entry from TLB, and
> 2) Free the page which holds last level page table
> 
> BTW, is there any further discussion going on this topic which I am 
> missing ?

Yes, I suggested to free up a pte table in my last reply.
https://patchwork.kernel.org/patch/10134581/

Thanks,
-Toshi
Will Deacon Feb. 21, 2018, 11:57 a.m. UTC | #7
[sorry, trying to deal with top-posting here]

On Wed, Feb 21, 2018 at 07:36:34AM +0000, Wangxuefeng (E) wrote:
>      The old flow of reuse the 4k page as 2M page does not follow the BBM flow
> for page table reconstruction,not only the memory leak problems.  If BBM flow
> is not followed,the speculative prefetch of tlb will made false tlb entries
> cached in MMU, the false address will be got, panic will happen.

If I understand Toshi's suggestion correctly, he's saying that the PMD can
be cleared when unmapping the last PTE (like try_to_free_pte_page). In this
case, there's no issue with the TLB because this is exactly BBM -- the PMD
is cleared and TLB invalidation is issued before the PTE table is freed. A
subsequent 2M map request will see an empty PMD and put down a block
mapping.

The downside is that freeing becomes more expensive as the last level table
becomes more sparsely populated and you need to ensure you don't have any
concurrent maps going on for the same table when you're unmapping. I also
can't see a neat way to fit this into the current vunmap code. Perhaps we
need an iounmap_page_range.

In the meantime, the code in lib/ioremap.c looks totally broken so I think
we should deselect CONFIG_HAVE_ARCH_HUGE_VMAP on arm64 until it's fixed.

Will
Will Deacon Feb. 26, 2018, 11:04 a.m. UTC | #8
On Mon, Feb 26, 2018 at 06:57:20PM +0800, Hanjun Guo wrote:
> On 2018/2/21 19:57, Will Deacon wrote:
> > [sorry, trying to deal with top-posting here]
> > 
> > On Wed, Feb 21, 2018 at 07:36:34AM +0000, Wangxuefeng (E) wrote:
> >>      The old flow of reuse the 4k page as 2M page does not follow the BBM flow
> >> for page table reconstruction,not only the memory leak problems.  If BBM flow
> >> is not followed,the speculative prefetch of tlb will made false tlb entries
> >> cached in MMU, the false address will be got, panic will happen.
> > 
> > If I understand Toshi's suggestion correctly, he's saying that the PMD can
> > be cleared when unmapping the last PTE (like try_to_free_pte_page). In this
> > case, there's no issue with the TLB because this is exactly BBM -- the PMD
> > is cleared and TLB invalidation is issued before the PTE table is freed. A
> > subsequent 2M map request will see an empty PMD and put down a block
> > mapping.
> > 
> > The downside is that freeing becomes more expensive as the last level table
> > becomes more sparsely populated and you need to ensure you don't have any
> > concurrent maps going on for the same table when you're unmapping. I also
> > can't see a neat way to fit this into the current vunmap code. Perhaps we
> > need an iounmap_page_range.
> > 
> > In the meantime, the code in lib/ioremap.c looks totally broken so I think
> > we should deselect CONFIG_HAVE_ARCH_HUGE_VMAP on arm64 until it's fixed.
> 
> Simply do something below at now (before the broken code is fixed)?
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index b2b95f7..a86148c 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -84,7 +84,6 @@ config ARM64
>         select HAVE_ALIGNED_STRUCT_PAGE if SLUB
>         select HAVE_ARCH_AUDITSYSCALL
>         select HAVE_ARCH_BITREVERSE
> -   select HAVE_ARCH_HUGE_VMAP
>         select HAVE_ARCH_JUMP_LABEL
>         select HAVE_ARCH_KASAN if !(ARM64_16K_PAGES && ARM64_VA_BITS_48)
>         select HAVE_ARCH_KGDB

No, that actually breaks with the use of block mappings for the kernel
text. Anyway, see:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=15122ee2c515a253b0c66a3e618bc7ebe35105eb

Will
Kani, Toshi Feb. 27, 2018, 7:49 p.m. UTC | #9
On Mon, 2018-02-26 at 20:53 +0800, Hanjun Guo wrote:
> On 2018/2/26 19:04, Will Deacon wrote:
> > On Mon, Feb 26, 2018 at 06:57:20PM +0800, Hanjun Guo wrote:
> > > On 2018/2/21 19:57, Will Deacon wrote:
> > > > [sorry, trying to deal with top-posting here]
> > > > 
> > > > On Wed, Feb 21, 2018 at 07:36:34AM +0000, Wangxuefeng (E) wrote:
> > > > >      The old flow of reuse the 4k page as 2M page does not follow the BBM flow
> > > > > for page table reconstruction,not only the memory leak problems.  If BBM flow
> > > > > is not followed,the speculative prefetch of tlb will made false tlb entries
> > > > > cached in MMU, the false address will be got, panic will happen.
> > > > 
> > > > If I understand Toshi's suggestion correctly, he's saying that the PMD can
> > > > be cleared when unmapping the last PTE (like try_to_free_pte_page). In this
> > > > case, there's no issue with the TLB because this is exactly BBM -- the PMD
> > > > is cleared and TLB invalidation is issued before the PTE table is freed. A
> > > > subsequent 2M map request will see an empty PMD and put down a block
> > > > mapping.
> > > > 
> > > > The downside is that freeing becomes more expensive as the last level table
> > > > becomes more sparsely populated and you need to ensure you don't have any
> > > > concurrent maps going on for the same table when you're unmapping. I also
> > > > can't see a neat way to fit this into the current vunmap code. Perhaps we
> > > > need an iounmap_page_range.
> > > > 
> > > > In the meantime, the code in lib/ioremap.c looks totally broken so I think
> > > > we should deselect CONFIG_HAVE_ARCH_HUGE_VMAP on arm64 until it's fixed.
> > > 
> > > Simply do something below at now (before the broken code is fixed)?
> > > 
> > > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > > index b2b95f7..a86148c 100644
> > > --- a/arch/arm64/Kconfig
> > > +++ b/arch/arm64/Kconfig
> > > @@ -84,7 +84,6 @@ config ARM64
> > >         select HAVE_ALIGNED_STRUCT_PAGE if SLUB
> > >         select HAVE_ARCH_AUDITSYSCALL
> > >         select HAVE_ARCH_BITREVERSE
> > > -   select HAVE_ARCH_HUGE_VMAP
> > >         select HAVE_ARCH_JUMP_LABEL
> > >         select HAVE_ARCH_KASAN if !(ARM64_16K_PAGES && ARM64_VA_BITS_48)
> > >         select HAVE_ARCH_KGDB
> > 
> > No, that actually breaks with the use of block mappings for the kernel
> > text. Anyway, see:
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=15122ee2c515a253b0c66a3e618bc7ebe35105eb
> 
> Sorry, just back from holidays and didn't catch up with all the emails,
> thanks for taking care of this.

I will work on a fix for the common/x86 code.

Thanks,
-Toshi
Will Deacon Feb. 27, 2018, 7:59 p.m. UTC | #10
On Tue, Feb 27, 2018 at 07:49:42PM +0000, Kani, Toshi wrote:
> On Mon, 2018-02-26 at 20:53 +0800, Hanjun Guo wrote:
> > On 2018/2/26 19:04, Will Deacon wrote:
> > > On Mon, Feb 26, 2018 at 06:57:20PM +0800, Hanjun Guo wrote:
> > > > Simply do something below at now (before the broken code is fixed)?
> > > > 
> > > > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > > > index b2b95f7..a86148c 100644
> > > > --- a/arch/arm64/Kconfig
> > > > +++ b/arch/arm64/Kconfig
> > > > @@ -84,7 +84,6 @@ config ARM64
> > > >         select HAVE_ALIGNED_STRUCT_PAGE if SLUB
> > > >         select HAVE_ARCH_AUDITSYSCALL
> > > >         select HAVE_ARCH_BITREVERSE
> > > > -   select HAVE_ARCH_HUGE_VMAP
> > > >         select HAVE_ARCH_JUMP_LABEL
> > > >         select HAVE_ARCH_KASAN if !(ARM64_16K_PAGES && ARM64_VA_BITS_48)
> > > >         select HAVE_ARCH_KGDB
> > > 
> > > No, that actually breaks with the use of block mappings for the kernel
> > > text. Anyway, see:
> > > 
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=15122ee2c515a253b0c66a3e618bc7ebe35105eb
> > 
> > Sorry, just back from holidays and didn't catch up with all the emails,
> > thanks for taking care of this.
> 
> I will work on a fix for the common/x86 code.

Ace, thanks. I'm more than happy to review any changes you make to the core
code from a break-before-make perspective. Just stick me on cc.

Cheers,

Will
Kani, Toshi Feb. 27, 2018, 8:02 p.m. UTC | #11
On Tue, 2018-02-27 at 19:59 +0000, Will Deacon wrote:
> On Tue, Feb 27, 2018 at 07:49:42PM +0000, Kani, Toshi wrote:
> > On Mon, 2018-02-26 at 20:53 +0800, Hanjun Guo wrote:
> > > On 2018/2/26 19:04, Will Deacon wrote:
> > > > On Mon, Feb 26, 2018 at 06:57:20PM +0800, Hanjun Guo wrote:
> > > > > Simply do something below at now (before the broken code is fixed)?
> > > > > 
> > > > > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > > > > index b2b95f7..a86148c 100644
> > > > > --- a/arch/arm64/Kconfig
> > > > > +++ b/arch/arm64/Kconfig
> > > > > @@ -84,7 +84,6 @@ config ARM64
> > > > >         select HAVE_ALIGNED_STRUCT_PAGE if SLUB
> > > > >         select HAVE_ARCH_AUDITSYSCALL
> > > > >         select HAVE_ARCH_BITREVERSE
> > > > > -   select HAVE_ARCH_HUGE_VMAP
> > > > >         select HAVE_ARCH_JUMP_LABEL
> > > > >         select HAVE_ARCH_KASAN if !(ARM64_16K_PAGES && ARM64_VA_BITS_48)
> > > > >         select HAVE_ARCH_KGDB
> > > > 
> > > > No, that actually breaks with the use of block mappings for the kernel
> > > > text. Anyway, see:
> > > > 
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=15122ee2c515a253b0c66a3e618bc7ebe35105eb
> > > 
> > > Sorry, just back from holidays and didn't catch up with all the emails,
> > > thanks for taking care of this.
> > 
> > I will work on a fix for the common/x86 code.
> 
> Ace, thanks. I'm more than happy to review any changes you make to the core
> code from a break-before-make perspective. Just stick me on cc.

Thanks Will!  I will definitely keep you cc'd.
-Toshi

Patch
diff mbox series

diff --git a/lib/ioremap.c b/lib/ioremap.c
index b808a39..4e6f19a 100644
--- a/lib/ioremap.c
+++ b/lib/ioremap.c
@@ -89,7 +89,7 @@  static inline int ioremap_pmd_range(pud_t *pud, unsigned long addr,
 	do {
 		next = pmd_addr_end(addr, end);
 
-		if (ioremap_pmd_enabled() &&
+		if (ioremap_pmd_enabled() && pmd_none(*pmd) &&
 		    ((next - addr) == PMD_SIZE) &&
 		    IS_ALIGNED(phys_addr + addr, PMD_SIZE)) {
 			if (pmd_set_huge(pmd, phys_addr + addr, prot))
@@ -115,7 +115,7 @@  static inline int ioremap_pud_range(p4d_t *p4d, unsigned long addr,
 	do {
 		next = pud_addr_end(addr, end);
 
-		if (ioremap_pud_enabled() &&
+		if (ioremap_pud_enabled() && pud_none(*pud) &&
 		    ((next - addr) == PUD_SIZE) &&
 		    IS_ALIGNED(phys_addr + addr, PUD_SIZE)) {
 			if (pud_set_huge(pud, phys_addr + addr, prot))
@@ -141,7 +141,7 @@  static inline int ioremap_p4d_range(pgd_t *pgd, unsigned long addr,
 	do {
 		next = p4d_addr_end(addr, end);
 
-		if (ioremap_p4d_enabled() &&
+		if (ioremap_p4d_enabled() && p4d_none(*p4d) &&
 		    ((next - addr) == P4D_SIZE) &&
 		    IS_ALIGNED(phys_addr + addr, P4D_SIZE)) {
 			if (p4d_set_huge(p4d, phys_addr + addr, prot))