[-- Attachment #1.1: Type: text/plain, Size: 2355 bytes --] I was bisecting a problem on 64bit where any attempt to cause a crash kernel to boot would hang. The bisect ended up on commit 722bc6b (x86/mm: Fix the size calculation of mapping tables) and somehow, looking at the calling function and the ranges printed on boot, I think the calculations should only be done in the 32bit case. On 64bit: [ 0.000000] init_memory_mapping: [mem 0x00000000-0x77e87fff] [ 0.000000] [mem 0x00000000-0x77dfffff] page 2M [ 0.000000] [mem 0x77e00000-0x77e87fff] page 4k Attached patch would fix this if you agree with it. Thanks. -Stefan From 6b679d1af20656929c0e829f29eed60b0a86a74f Mon Sep 17 00:00:00 2001 From: Stefan Bader <stefan.bader@canonical.com> Date: Fri, 13 Jul 2012 15:16:33 +0200 Subject: [PATCH] x86/mm: Limit 2/4M size calculation to x86_32 commit 722bc6b (x86/mm: Fix the size calculation of mapping tables) did modify the extra space calculation for mapping tables in order to make up for the first 2/4M memory range using 4K pages. However this setup is only used when compiling for 32bit. On 64bit there is only the trailing area of 4K pages (which is already added). The code was already adapted once for things went wrong on a 8TB machine (bd2753b x86/mm: Only add extra pages count for the first memory range during pre-allocation early page table space), but it looks a bit like it currently would overdo things for 64bit. I only noticed while bisecting for the reason I could not make a crash kernel boot (which ended up on this patch). Signed-off-by: Stefan Bader <stefan.bader@canonical.com> Cc: WANG Cong <xiyou.wangcong@gmail.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Tejun Heo <tj@kernel.org> --- diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c index bc4e9d8..636bbfd 100644 --- a/arch/x86/mm/init.c +++ b/arch/x86/mm/init.c @@ -60,10 +60,11 @@ static void __init find_early_table_space(struct map_range *mr, unsigned long en extra = end - ((end>>PMD_SHIFT) << PMD_SHIFT); #ifdef CONFIG_X86_32 extra += PMD_SIZE; -#endif + /* The first 2/4M doesn't use large pages. */ if (mr->start < PMD_SIZE) extra += mr->end - mr->start; +#endif ptes = (extra + PAGE_SIZE - 1) >> PAGE_SHIFT; } else -- 1.7.9.5 [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1.2: 0001-x86-mm-Limit-2-4M-size-calculation-to-x86_32.patch --] [-- Type: text/x-diff; name="0001-x86-mm-Limit-2-4M-size-calculation-to-x86_32.patch", Size: 1753 bytes --] From 6b679d1af20656929c0e829f29eed60b0a86a74f Mon Sep 17 00:00:00 2001 From: Stefan Bader <stefan.bader@canonical.com> Date: Fri, 13 Jul 2012 15:16:33 +0200 Subject: [PATCH] x86/mm: Limit 2/4M size calculation to x86_32 commit 722bc6b (x86/mm: Fix the size calculation of mapping tables) did modify the extra space calculation for mapping tables in order to make up for the first 2/4M memory range using 4K pages. However this setup is only used when compiling for 32bit. On 64bit there is only the trailing area of 4K pages (which is already added). The code was already adapted once for things went wrong on a 8TB machine (bd2753b x86/mm: Only add extra pages count for the first memory range during pre-allocation early page table space), but it looks a bit like it currently would overdo things for 64bit. I only noticed while bisecting for the reason I could not make a crash kernel boot (which ended up on this patch). Signed-off-by: Stefan Bader <stefan.bader@canonical.com> Cc: WANG Cong <xiyou.wangcong@gmail.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Tejun Heo <tj@kernel.org> --- arch/x86/mm/init.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c index bc4e9d8..636bbfd 100644 --- a/arch/x86/mm/init.c +++ b/arch/x86/mm/init.c @@ -60,10 +60,11 @@ static void __init find_early_table_space(struct map_range *mr, unsigned long en extra = end - ((end>>PMD_SHIFT) << PMD_SHIFT); #ifdef CONFIG_X86_32 extra += PMD_SIZE; -#endif + /* The first 2/4M doesn't use large pages. */ if (mr->start < PMD_SIZE) extra += mr->end - mr->start; +#endif ptes = (extra + PAGE_SIZE - 1) >> PAGE_SHIFT; } else -- 1.7.9.5 [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 900 bytes --]
On Fri, Jul 13, 2012 at 6:41 AM, Stefan Bader
<stefan.bader@canonical.com> wrote:
> I was bisecting a problem on 64bit where any attempt to cause a crash kernel to
> boot would hang. The bisect ended up on commit 722bc6b (x86/mm: Fix the size
> calculation of mapping tables) and somehow, looking at the calling function and
> the ranges printed on boot, I think the calculations should only be done in the
> 32bit case.
>
> On 64bit:
> [ 0.000000] init_memory_mapping: [mem 0x00000000-0x77e87fff]
> [ 0.000000] [mem 0x00000000-0x77dfffff] page 2M
> [ 0.000000] [mem 0x77e00000-0x77e87fff] page 4k
>
> Attached patch would fix this if you agree with it. Thanks.
it does not look like for the hang for your system. maybe just because
it change a bit memblock allocation layout.
can you please post whole boot log that is working and not?
Thanks
Yinghai
On 07/13/2012 11:12 AM, Yinghai Lu wrote: > On Fri, Jul 13, 2012 at 6:41 AM, Stefan Bader > <stefan.bader@canonical.com> wrote: >> I was bisecting a problem on 64bit where any attempt to cause a crash kernel to >> boot would hang. The bisect ended up on commit 722bc6b (x86/mm: Fix the size >> calculation of mapping tables) and somehow, looking at the calling function and >> the ranges printed on boot, I think the calculations should only be done in the >> 32bit case. >> >> On 64bit: >> [ 0.000000] init_memory_mapping: [mem 0x00000000-0x77e87fff] >> [ 0.000000] [mem 0x00000000-0x77dfffff] page 2M >> [ 0.000000] [mem 0x77e00000-0x77e87fff] page 4k >> >> Attached patch would fix this if you agree with it. Thanks. > > it does not look like for the hang for your system. maybe just because > it change a bit memblock allocation layout. > The hang is merely the effect of limited memory getting even more limited and running out of it while trying to uncompress your initramfs and/or kernel is not helping. > can you please post whole boot log that is working and not? > I am traveling this week and have no access to the machine. But basically you can see the issue relatively simple. As 64bit does not have the first 2/4M area as 4k pages. So with the current state of the patch this would allocate extra space of about 3MB for the first range (about 1.9GB). Again the problem is not something bad going on beyond the fact that it wastes memory and it just happens to be more than it used to be, so the memory set aside for getting it to boot suddenly failed to be enough. -Stefan > Thanks > > Yinghai >
On 07/13/2012 06:41 AM, Stefan Bader wrote: > I was bisecting a problem on 64bit where any attempt to cause a crash kernel to > boot would hang. The bisect ended up on commit 722bc6b (x86/mm: Fix the size > calculation of mapping tables) and somehow, looking at the calling function and > the ranges printed on boot, I think the calculations should only be done in the > 32bit case. > > On 64bit: > [ 0.000000] init_memory_mapping: [mem 0x00000000-0x77e87fff] > [ 0.000000] [mem 0x00000000-0x77dfffff] page 2M > [ 0.000000] [mem 0x77e00000-0x77e87fff] page 4k > > Attached patch would fix this if you agree with it. Thanks. > Any news on this one? I thought it would be quite simple to check for sanity and not wasting memory sounds like a good thing to do. Even though there is plenty of it around most of the time. ;) -Stefan > -Stefan > > > From 6b679d1af20656929c0e829f29eed60b0a86a74f Mon Sep 17 00:00:00 2001 > From: Stefan Bader <stefan.bader@canonical.com> > Date: Fri, 13 Jul 2012 15:16:33 +0200 > Subject: [PATCH] x86/mm: Limit 2/4M size calculation to x86_32 > > commit 722bc6b (x86/mm: Fix the size calculation of mapping tables) > did modify the extra space calculation for mapping tables in order > to make up for the first 2/4M memory range using 4K pages. > However this setup is only used when compiling for 32bit. On 64bit > there is only the trailing area of 4K pages (which is already added). > > The code was already adapted once for things went wrong on a 8TB > machine (bd2753b x86/mm: Only add extra pages count for the first memory > range during pre-allocation early page table space), but it looks a bit > like it currently would overdo things for 64bit. > I only noticed while bisecting for the reason I could not make a crash > kernel boot (which ended up on this patch). > > Signed-off-by: Stefan Bader <stefan.bader@canonical.com> > Cc: WANG Cong <xiyou.wangcong@gmail.com> > Cc: Yinghai Lu <yinghai@kernel.org> > Cc: Tejun Heo <tj@kernel.org> > --- > diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c > index bc4e9d8..636bbfd 100644 > --- a/arch/x86/mm/init.c > +++ b/arch/x86/mm/init.c > @@ -60,10 +60,11 @@ static void __init find_early_table_space(struct map_range > *mr, unsigned long en > extra = end - ((end>>PMD_SHIFT) << PMD_SHIFT); > #ifdef CONFIG_X86_32 > extra += PMD_SIZE; > -#endif > + > /* The first 2/4M doesn't use large pages. */ > if (mr->start < PMD_SIZE) > extra += mr->end - mr->start; > +#endif > > ptes = (extra + PAGE_SIZE - 1) >> PAGE_SHIFT; > } else >
On Fri, Jul 13, 2012 at 9:41 PM, Stefan Bader
<stefan.bader@canonical.com> wrote:
> I was bisecting a problem on 64bit where any attempt to cause a crash kernel to
> boot would hang. The bisect ended up on commit 722bc6b (x86/mm: Fix the size
> calculation of mapping tables) and somehow, looking at the calling function and
> the ranges printed on boot, I think the calculations should only be done in the
> 32bit case.
>
> On 64bit:
> [ 0.000000] init_memory_mapping: [mem 0x00000000-0x77e87fff]
> [ 0.000000] [mem 0x00000000-0x77dfffff] page 2M
> [ 0.000000] [mem 0x77e00000-0x77e87fff] page 4k
>
> Attached patch would fix this if you agree with it. Thanks.
>
> -Stefan
>
>
> From 6b679d1af20656929c0e829f29eed60b0a86a74f Mon Sep 17 00:00:00 2001
> From: Stefan Bader <stefan.bader@canonical.com>
> Date: Fri, 13 Jul 2012 15:16:33 +0200
> Subject: [PATCH] x86/mm: Limit 2/4M size calculation to x86_32
>
> commit 722bc6b (x86/mm: Fix the size calculation of mapping tables)
> did modify the extra space calculation for mapping tables in order
> to make up for the first 2/4M memory range using 4K pages.
> However this setup is only used when compiling for 32bit. On 64bit
> there is only the trailing area of 4K pages (which is already added).
>
> The code was already adapted once for things went wrong on a 8TB
> machine (bd2753b x86/mm: Only add extra pages count for the first memory
> range during pre-allocation early page table space), but it looks a bit
> like it currently would overdo things for 64bit.
> I only noticed while bisecting for the reason I could not make a crash
> kernel boot (which ended up on this patch).
>
> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
> Cc: WANG Cong <xiyou.wangcong@gmail.com>
> Cc: Yinghai Lu <yinghai@kernel.org>
> Cc: Tejun Heo <tj@kernel.org>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Sorry for that I was not aware of x86_64 is different with x86 in the
first 2/4M.
On 07/24/2012 06:52 PM, Cong Wang wrote:
>> From 6b679d1af20656929c0e829f29eed60b0a86a74f Mon Sep 17 00:00:00 2001
>> From: Stefan Bader <stefan.bader@canonical.com>
>> Date: Fri, 13 Jul 2012 15:16:33 +0200
>> Subject: [PATCH] x86/mm: Limit 2/4M size calculation to x86_32
>>
>> commit 722bc6b (x86/mm: Fix the size calculation of mapping tables)
>> did modify the extra space calculation for mapping tables in order
>> to make up for the first 2/4M memory range using 4K pages.
>> However this setup is only used when compiling for 32bit. On 64bit
>> there is only the trailing area of 4K pages (which is already added).
>>
>> The code was already adapted once for things went wrong on a 8TB
>> machine (bd2753b x86/mm: Only add extra pages count for the first memory
>> range during pre-allocation early page table space), but it looks a bit
>> like it currently would overdo things for 64bit.
>> I only noticed while bisecting for the reason I could not make a crash
>> kernel boot (which ended up on this patch).
>>
>> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
>> Cc: WANG Cong <xiyou.wangcong@gmail.com>
>> Cc: Yinghai Lu <yinghai@kernel.org>
>> Cc: Tejun Heo <tj@kernel.org>
>
> Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
>
> Sorry for that I was not aware of x86_64 is different with x86 in the
> first 2/4M.
Why would there be a difference?
Shouldn't the IO space at 0xa0000-0x100000 be mapped with uncacheable
attributes (or WC for VGA)? If it's done later, it can be done later
for both.
--
error compiling committee.c: too many arguments to function
[-- Attachment #1: Type: text/plain, Size: 1985 bytes --] On 25.07.2012 12:44, Avi Kivity wrote: > On 07/24/2012 06:52 PM, Cong Wang wrote: > >>> From 6b679d1af20656929c0e829f29eed60b0a86a74f Mon Sep 17 00:00:00 2001 >>> From: Stefan Bader <stefan.bader@canonical.com> >>> Date: Fri, 13 Jul 2012 15:16:33 +0200 >>> Subject: [PATCH] x86/mm: Limit 2/4M size calculation to x86_32 >>> >>> commit 722bc6b (x86/mm: Fix the size calculation of mapping tables) >>> did modify the extra space calculation for mapping tables in order >>> to make up for the first 2/4M memory range using 4K pages. >>> However this setup is only used when compiling for 32bit. On 64bit >>> there is only the trailing area of 4K pages (which is already added). >>> >>> The code was already adapted once for things went wrong on a 8TB >>> machine (bd2753b x86/mm: Only add extra pages count for the first memory >>> range during pre-allocation early page table space), but it looks a bit >>> like it currently would overdo things for 64bit. >>> I only noticed while bisecting for the reason I could not make a crash >>> kernel boot (which ended up on this patch). >>> >>> Signed-off-by: Stefan Bader <stefan.bader@canonical.com> >>> Cc: WANG Cong <xiyou.wangcong@gmail.com> >>> Cc: Yinghai Lu <yinghai@kernel.org> >>> Cc: Tejun Heo <tj@kernel.org> >> >> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> >> >> Sorry for that I was not aware of x86_64 is different with x86 in the >> first 2/4M. > > Why would there be a difference? > > Shouldn't the IO space at 0xa0000-0x100000 be mapped with uncacheable > attributes (or WC for VGA)? If it's done later, it can be done later > for both. > arch/x86/mm/init.c unsigned long __init_refok init_memory_mapping(... ... ifdef CONFIG_X86_32 /* * Don't use a large page for the first 2/4MB of memory * because there are often fixed size MTRRs in there * and overlapping MTRRs into large pages can cause * slowdowns. */ [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 900 bytes --]
On 07/25/2012 02:14 PM, Stefan Bader wrote:
> On 25.07.2012 12:44, Avi Kivity wrote:
>> On 07/24/2012 06:52 PM, Cong Wang wrote:
>>
>>>> From 6b679d1af20656929c0e829f29eed60b0a86a74f Mon Sep 17 00:00:00 2001
>>>> From: Stefan Bader <stefan.bader@canonical.com>
>>>> Date: Fri, 13 Jul 2012 15:16:33 +0200
>>>> Subject: [PATCH] x86/mm: Limit 2/4M size calculation to x86_32
>>>>
>>>> commit 722bc6b (x86/mm: Fix the size calculation of mapping tables)
>>>> did modify the extra space calculation for mapping tables in order
>>>> to make up for the first 2/4M memory range using 4K pages.
>>>> However this setup is only used when compiling for 32bit. On 64bit
>>>> there is only the trailing area of 4K pages (which is already added).
>>>>
>>>> The code was already adapted once for things went wrong on a 8TB
>>>> machine (bd2753b x86/mm: Only add extra pages count for the first memory
>>>> range during pre-allocation early page table space), but it looks a bit
>>>> like it currently would overdo things for 64bit.
>>>> I only noticed while bisecting for the reason I could not make a crash
>>>> kernel boot (which ended up on this patch).
>>>>
>>>> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
>>>> Cc: WANG Cong <xiyou.wangcong@gmail.com>
>>>> Cc: Yinghai Lu <yinghai@kernel.org>
>>>> Cc: Tejun Heo <tj@kernel.org>
>>>
>>> Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
>>>
>>> Sorry for that I was not aware of x86_64 is different with x86 in the
>>> first 2/4M.
>>
>> Why would there be a difference?
>>
>> Shouldn't the IO space at 0xa0000-0x100000 be mapped with uncacheable
>> attributes (or WC for VGA)? If it's done later, it can be done later
>> for both.
>>
> arch/x86/mm/init.c
>
> unsigned long __init_refok init_memory_mapping(...
> ...
> ifdef CONFIG_X86_32
> /*
> * Don't use a large page for the first 2/4MB of memory
> * because there are often fixed size MTRRs in there
> * and overlapping MTRRs into large pages can cause
> * slowdowns.
> */
>
That's equally true for X86_64.
Best would be to merge the MTRRs into PAT, but that might not work for SMM.
--
error compiling committee.c: too many arguments to function
[-- Attachment #1: Type: text/plain, Size: 2690 bytes --] On 25.07.2012 14:32, Avi Kivity wrote: > On 07/25/2012 02:14 PM, Stefan Bader wrote: >> On 25.07.2012 12:44, Avi Kivity wrote: >>> On 07/24/2012 06:52 PM, Cong Wang wrote: >>> >>>>> From 6b679d1af20656929c0e829f29eed60b0a86a74f Mon Sep 17 00:00:00 2001 >>>>> From: Stefan Bader <stefan.bader@canonical.com> >>>>> Date: Fri, 13 Jul 2012 15:16:33 +0200 >>>>> Subject: [PATCH] x86/mm: Limit 2/4M size calculation to x86_32 >>>>> >>>>> commit 722bc6b (x86/mm: Fix the size calculation of mapping tables) >>>>> did modify the extra space calculation for mapping tables in order >>>>> to make up for the first 2/4M memory range using 4K pages. >>>>> However this setup is only used when compiling for 32bit. On 64bit >>>>> there is only the trailing area of 4K pages (which is already added). >>>>> >>>>> The code was already adapted once for things went wrong on a 8TB >>>>> machine (bd2753b x86/mm: Only add extra pages count for the first memory >>>>> range during pre-allocation early page table space), but it looks a bit >>>>> like it currently would overdo things for 64bit. >>>>> I only noticed while bisecting for the reason I could not make a crash >>>>> kernel boot (which ended up on this patch). >>>>> >>>>> Signed-off-by: Stefan Bader <stefan.bader@canonical.com> >>>>> Cc: WANG Cong <xiyou.wangcong@gmail.com> >>>>> Cc: Yinghai Lu <yinghai@kernel.org> >>>>> Cc: Tejun Heo <tj@kernel.org> >>>> >>>> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> >>>> >>>> Sorry for that I was not aware of x86_64 is different with x86 in the >>>> first 2/4M. >>> >>> Why would there be a difference? >>> >>> Shouldn't the IO space at 0xa0000-0x100000 be mapped with uncacheable >>> attributes (or WC for VGA)? If it's done later, it can be done later >>> for both. >>> >> arch/x86/mm/init.c >> >> unsigned long __init_refok init_memory_mapping(... >> ... >> ifdef CONFIG_X86_32 >> /* >> * Don't use a large page for the first 2/4MB of memory >> * because there are often fixed size MTRRs in there >> * and overlapping MTRRs into large pages can cause >> * slowdowns. >> */ >> > > That's equally true for X86_64. > > Best would be to merge the MTRRs into PAT, but that might not work for SMM. > > Ok, true. Not sure why this was restricted to 32bit when reconsidering. Except if in 64bit it was assumed (or asserted) that the regions are aligned to 2M... But maybe this can be answered by someone knowing the details. I would not mind either way (have the first range with 4K pages in all cases or fixing the additional PTE allocation). Just as it is now it is inconsistent. [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 900 bytes --]
On 07/25/2012 04:24 PM, Stefan Bader wrote:
>>> ...
>>> ifdef CONFIG_X86_32
>>> /*
>>> * Don't use a large page for the first 2/4MB of memory
>>> * because there are often fixed size MTRRs in there
>>> * and overlapping MTRRs into large pages can cause
>>> * slowdowns.
>>> */
>>>
>>
>> That's equally true for X86_64.
>>
>> Best would be to merge the MTRRs into PAT, but that might not work for SMM.
>>
>>
> Ok, true. Not sure why this was restricted to 32bit when reconsidering. Except
> if in 64bit it was assumed (or asserted) that the regions are aligned to 2M...
> But maybe this can be answered by someone knowing the details. I would not mind
> either way (have the first range with 4K pages in all cases or fixing the
> additional PTE allocation). Just as it is now it is inconsistent.
Sometimes CONFIG_X86_32 is used as an alias for "machines so old they
don't support x86_64". As a 32-bit kernel can be run on a machine that
does support x86_64, it should be replaced by a runtime test for
X86_FEATURE_LM, until a more accurate test can be found.
--
error compiling committee.c: too many arguments to function
[-- Attachment #1: Type: text/plain, Size: 1903 bytes --] On 25.07.2012 15:40, Avi Kivity wrote: > On 07/25/2012 04:24 PM, Stefan Bader wrote: >>>> ... >>>> ifdef CONFIG_X86_32 >>>> /* >>>> * Don't use a large page for the first 2/4MB of memory >>>> * because there are often fixed size MTRRs in there >>>> * and overlapping MTRRs into large pages can cause >>>> * slowdowns. >>>> */ >>>> >>> >>> That's equally true for X86_64. >>> >>> Best would be to merge the MTRRs into PAT, but that might not work for SMM. >>> >>> >> Ok, true. Not sure why this was restricted to 32bit when reconsidering. Except >> if in 64bit it was assumed (or asserted) that the regions are aligned to 2M... >> But maybe this can be answered by someone knowing the details. I would not mind >> either way (have the first range with 4K pages in all cases or fixing the >> additional PTE allocation). Just as it is now it is inconsistent. > > Sometimes CONFIG_X86_32 is used as an alias for "machines so old they > don't support x86_64". As a 32-bit kernel can be run on a machine that > does support x86_64, it should be replaced by a runtime test for > X86_FEATURE_LM, until a more accurate test can be found. > So basically the first range being 4k exist because MTRRs might define ranges there and those are always aligned to 4k but not necessarily to the bigger pages used. Reading through the Intel and AMD docs indicates various levels of badness when this is not the case. Though afaict MTRRs are not tied to long mode capable CPUs. For example Atom is 32bit only (the earlier ones at least) and uses MTRRs. So testing for LM would miss those. Would it not be better to unconditionally have the first 2/4M as 4k pages? At least as long as there is no check for the alignment of the MTRR ranges. Or thinking of it, the runtime test should look for X86_FEATURE_MTRR, shouldn't it? [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 900 bytes --]
On 07/31/2012 12:48 PM, Stefan Bader wrote:
> On 25.07.2012 15:40, Avi Kivity wrote:
>> On 07/25/2012 04:24 PM, Stefan Bader wrote:
>>>>> ...
>>>>> ifdef CONFIG_X86_32
>>>>> /*
>>>>> * Don't use a large page for the first 2/4MB of memory
>>>>> * because there are often fixed size MTRRs in there
>>>>> * and overlapping MTRRs into large pages can cause
>>>>> * slowdowns.
>>>>> */
>>>>>
>>>>
>>>> That's equally true for X86_64.
>>>>
>>>> Best would be to merge the MTRRs into PAT, but that might not work for SMM.
>>>>
>>>>
>>> Ok, true. Not sure why this was restricted to 32bit when reconsidering. Except
>>> if in 64bit it was assumed (or asserted) that the regions are aligned to 2M...
>>> But maybe this can be answered by someone knowing the details. I would not mind
>>> either way (have the first range with 4K pages in all cases or fixing the
>>> additional PTE allocation). Just as it is now it is inconsistent.
>>
>> Sometimes CONFIG_X86_32 is used as an alias for "machines so old they
>> don't support x86_64". As a 32-bit kernel can be run on a machine that
>> does support x86_64, it should be replaced by a runtime test for
>> X86_FEATURE_LM, until a more accurate test can be found.
>>
>
> So basically the first range being 4k exist because MTRRs might define ranges
> there and those are always aligned to 4k but not necessarily to the bigger pages
> used. Reading through the Intel and AMD docs indicates various levels of badness
> when this is not the case. Though afaict MTRRs are not tied to long mode capable
> CPUs. For example Atom is 32bit only (the earlier ones at least) and uses MTRRs.
> So testing for LM would miss those.
> Would it not be better to unconditionally have the first 2/4M as 4k pages? At
> least as long as there is no check for the alignment of the MTRR ranges. Or
> thinking of it, the runtime test should look for X86_FEATURE_MTRR, shouldn't it?
MTRRs are indeed far older than x86_64; it's almost pointless to test
for them, since practically all processors have them.
The fact that the check is only done on i386 and not on x86_64 may come
from one of
- an oversight
- by the time x86_64 processors came along, the problem with
conflicting sizes was resolved
- the whole thing is bogus
Copying hpa who may be in a position to find out which.
--
error compiling committee.c: too many arguments to function
Avi wrote: >The fact that the check is only done on i386 and not on x86_64 > may come from one of > > - an oversight > - by the time x86_64 processors came along, the problem with > conflicting sizes was resolved > - the whole thing is bogus > > Copying hpa who may be in a position to find out which. Talking to hpa it is more of the last. For more than just this reason. Since the whole area of initial page tables seems to be rather sensitive and easy to break there have been discussions and plans to come up with a rewrite to improve on all those shortcomings. The detail I am not agreeing with hpa is the fixup for the immediate breakage at head. IMO right now the code just has regressed and that should be fixed as soon as possible. Plus doing a specific and small fix allows that to be applicable to stable (which again still depends on things being upstream). Hence the re-send in the hope that on the larger scale the may be agreement on the immediate fix. I am not doubting the usefulness or need of a better solution, but I think that having a remedy of the current situation just until then has enough benefit to be considered. -Stefan >From 1d5cc3971716a039c91abc18cb6f9bcbe5dde490 Mon Sep 17 00:00:00 2001 From: Stefan Bader <stefan.bader@canonical.com> Date: Fri, 13 Jul 2012 15:16:33 +0200 Subject: [PATCH] x86/mm: Limit 2/4M size calculation to x86_32 commit 722bc6b (x86/mm: Fix the size calculation of mapping tables) did modify the extra space calculation for mapping tables in order to make up for the first 2/4M memory range using 4K pages. However this setup is only used when compiling for 32bit. On 64bit there is only the trailing area of 4K pages (which is already added). The code was already adapted once for things went wrong on a 8TB machine (bd2753b x86/mm: Only add extra pages count for the first memory range during pre-allocation early page table space), but it looks a bit like it currently would overdo things for 64bit. I only noticed while bisecting for the reason I could not make a crash kernel boot (which ended up on this patch). Signed-off-by: Stefan Bader <stefan.bader@canonical.com> Cc: stable@vger.kernel.org # v3.5 Cc: WANG Cong <xiyou.wangcong@gmail.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Tejun Heo <tj@kernel.org> --- arch/x86/mm/init.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c index e0e6990..28a1c99 100644 --- a/arch/x86/mm/init.c +++ b/arch/x86/mm/init.c @@ -60,10 +60,11 @@ static void __init find_early_table_space(struct map_range *mr, unsigned long en extra = end - ((end>>PMD_SHIFT) << PMD_SHIFT); #ifdef CONFIG_X86_32 extra += PMD_SIZE; -#endif + /* The first 2/4M doesn't use large pages. */ if (mr->start < PMD_SIZE) extra += mr->end - mr->start; +#endif ptes = (extra + PAGE_SIZE - 1) >> PAGE_SHIFT; } else -- 1.7.10.4
I'm not saying we shouldn't patch the regression, but this house of cards *needs* to be replaced with something robust and correct by construction.
Stefan Bader <stefan.bader@canonical.com> wrote:
>Avi wrote:
>>The fact that the check is only done on i386 and not on x86_64
>> may come from one of
>>
>> - an oversight
>> - by the time x86_64 processors came along, the problem with
>> conflicting sizes was resolved
>> - the whole thing is bogus
>>
>> Copying hpa who may be in a position to find out which.
>
>Talking to hpa it is more of the last. For more than just this
>reason. Since the whole area of initial page tables seems to be
>rather sensitive and easy to break there have been discussions
>and plans to come up with a rewrite to improve on all those
>shortcomings.
>
>The detail I am not agreeing with hpa is the fixup for the
>immediate breakage at head. IMO right now the code just has
>regressed and that should be fixed as soon as possible.
>Plus doing a specific and small fix allows that to be applicable
>to stable (which again still depends on things being upstream).
>
>Hence the re-send in the hope that on the larger scale the may
>be agreement on the immediate fix. I am not doubting the usefulness
>or need of a better solution, but I think that having a remedy of
>the current situation just until then has enough benefit to be
>considered.
>
>-Stefan
>
>
>
>From 1d5cc3971716a039c91abc18cb6f9bcbe5dde490 Mon Sep 17 00:00:00 2001
>From: Stefan Bader <stefan.bader@canonical.com>
>Date: Fri, 13 Jul 2012 15:16:33 +0200
>Subject: [PATCH] x86/mm: Limit 2/4M size calculation to x86_32
>
>commit 722bc6b (x86/mm: Fix the size calculation of mapping tables)
>did modify the extra space calculation for mapping tables in order
>to make up for the first 2/4M memory range using 4K pages.
>However this setup is only used when compiling for 32bit. On 64bit
>there is only the trailing area of 4K pages (which is already added).
>
>The code was already adapted once for things went wrong on a 8TB
>machine (bd2753b x86/mm: Only add extra pages count for the first
>memory
>range during pre-allocation early page table space), but it looks a bit
>like it currently would overdo things for 64bit.
>I only noticed while bisecting for the reason I could not make a crash
>kernel boot (which ended up on this patch).
>
>Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
>Cc: stable@vger.kernel.org # v3.5
>Cc: WANG Cong <xiyou.wangcong@gmail.com>
>Cc: Yinghai Lu <yinghai@kernel.org>
>Cc: Tejun Heo <tj@kernel.org>
>---
> arch/x86/mm/init.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
>diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
>index e0e6990..28a1c99 100644
>--- a/arch/x86/mm/init.c
>+++ b/arch/x86/mm/init.c
>@@ -60,10 +60,11 @@ static void __init find_early_table_space(struct
>map_range *mr, unsigned long en
> extra = end - ((end>>PMD_SHIFT) << PMD_SHIFT);
> #ifdef CONFIG_X86_32
> extra += PMD_SIZE;
>-#endif
>+
> /* The first 2/4M doesn't use large pages. */
> if (mr->start < PMD_SIZE)
> extra += mr->end - mr->start;
>+#endif
>
> ptes = (extra + PAGE_SIZE - 1) >> PAGE_SHIFT;
> } else
>--
>1.7.10.4
--
Sent from my mobile phone. Please excuse brevity and lack of formatting.
On 08/31/2012 09:41 AM, H. Peter Anvin wrote: > I'm not saying we shouldn't patch the regression, but this house of cards > *needs* to be replaced with something robust and correct by construction. Then I did misunderstand/-interpret you about the former part and we actually are agreeing on the whole topic. Sorry about that. So the re-post just should serve as a reminder as the last comment here was quite a while ago. > > Stefan Bader <stefan.bader@canonical.com> wrote: > >> Avi wrote: >>> The fact that the check is only done on i386 and not on x86_64 may come >>> from one of >>> >>> - an oversight - by the time x86_64 processors came along, the problem >>> with conflicting sizes was resolved - the whole thing is bogus >>> >>> Copying hpa who may be in a position to find out which. >> >> Talking to hpa it is more of the last. For more than just this reason. >> Since the whole area of initial page tables seems to be rather sensitive >> and easy to break there have been discussions and plans to come up with a >> rewrite to improve on all those shortcomings. >> >> The detail I am not agreeing with hpa is the fixup for the immediate >> breakage at head. IMO right now the code just has regressed and that should >> be fixed as soon as possible. Plus doing a specific and small fix allows >> that to be applicable to stable (which again still depends on things being >> upstream). >> >> Hence the re-send in the hope that on the larger scale the may be agreement >> on the immediate fix. I am not doubting the usefulness or need of a better >> solution, but I think that having a remedy of the current situation just >> until then has enough benefit to be considered. >> >> -Stefan
[-- Attachment #1.1: Type: text/plain, Size: 2737 bytes --] On 31.08.2012 18:41, H. Peter Anvin wrote: > I'm not saying we shouldn't patch the regression, but this house of cards > *needs* to be replaced with something robust and correct by construction. Could that patch then get applied? Though I got some feedback, that the description might be not really well written. So I am attaching a version that tries to do better. The code change itself is the same. -Stefan --- From 737a5ebdd7ac1f4106cb0b0c53cc8f73b6ff1aca Mon Sep 17 00:00:00 2001 From: Stefan Bader <stefan.bader@canonical.com> Date: Fri, 13 Jul 2012 15:16:33 +0200 Subject: [PATCH] x86/mm: Limit extra padding calculation to x86_32 Starting with kernel v3.5 kexec based crash dumping was observed to fail (without any apparent message) on x86_64 machines. This was traced to a lack of memory triggered by a substantial increase (several megabyes) in the size of the initial page tables. After regression (on a VM with 2GB of memory): kernel direct mapping tables up to 0x7fffcfff @ [mem 0x1fbfd000-0x1fffffff] size = 4206591 bytes With this patch applied: kernel direct mapping tables up to 0x7fffcfff @ [mem 0x1fffc000-0x1fffffff] size = 16383 bytes A bisection lead to the commit below: commit 722bc6b (x86/mm: Fix the size calculation of mapping tables) This change modified the extra space calculation to take into account that the first 2/4M range of memory would be mapped as 4K pages as suggested in chapter 11.11.9 of the Intel software developer's manual. However this is currently only true for x86_32 (the reasons behind that are unclear but apparently the whole page table setup needs to be re- visited as it turns out to be very easy to break and has flaws in its current form). Until the logic can be revisited and combined, pair up the extra space calculation with the logic which creates the extra mappings. Signed-off-by: Stefan Bader <stefan.bader@canonical.com> Cc: stable@vger.kernel.org # v3.5+ Cc: WANG Cong <xiyou.wangcong@gmail.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Tejun Heo <tj@kernel.org> --- arch/x86/mm/init.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c index bc4e9d8..636bbfd 100644 --- a/arch/x86/mm/init.c +++ b/arch/x86/mm/init.c @@ -60,10 +60,11 @@ static void __init find_early_table_space(struct map_range *mr, unsigned long en extra = end - ((end>>PMD_SHIFT) << PMD_SHIFT); #ifdef CONFIG_X86_32 extra += PMD_SIZE; -#endif + /* The first 2/4M doesn't use large pages. */ if (mr->start < PMD_SIZE) extra += mr->end - mr->start; +#endif ptes = (extra + PAGE_SIZE - 1) >> PAGE_SHIFT; } else -- 1.7.9.5 [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1.2: 0001-x86-mm-Limit-2-4M-size-calculation-to-x86_32.patch --] [-- Type: text/x-diff; name="0001-x86-mm-Limit-2-4M-size-calculation-to-x86_32.patch", Size: 2309 bytes --] From 737a5ebdd7ac1f4106cb0b0c53cc8f73b6ff1aca Mon Sep 17 00:00:00 2001 From: Stefan Bader <stefan.bader@canonical.com> Date: Fri, 13 Jul 2012 15:16:33 +0200 Subject: [PATCH] x86/mm: Limit extra padding calculation to x86_32 Starting with kernel v3.5 kexec based crash dumping was observed to fail (without any apparent message) on x86_64 machines. This was traced to a lack of memory triggered by a substantial increase (several megabyes) in the size of the initial page tables. After regression (on a VM with 2GB of memory): kernel direct mapping tables up to 0x7fffcfff @ [mem 0x1fbfd000-0x1fffffff] size = 4206591 bytes With this patch applied: kernel direct mapping tables up to 0x7fffcfff @ [mem 0x1fffc000-0x1fffffff] size = 16383 bytes A bisection lead to the commit below: commit 722bc6b (x86/mm: Fix the size calculation of mapping tables) This change modified the extra space calculation to take into account that the first 2/4M range of memory would be mapped as 4K pages as suggested in chapter 11.11.9 of the Intel software developer's manual. However this is currently only true for x86_32 (the reasons behind that are unclear but apparently the whole page table setup needs to be re- visited as it turns out to be very easy to break and has flaws in its current form). Until the logic can be revisited and combined, pair up the extra space calculation with the logic which creates the extra mappings. Signed-off-by: Stefan Bader <stefan.bader@canonical.com> Cc: stable@vger.kernel.org # v3.5+ Cc: WANG Cong <xiyou.wangcong@gmail.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Tejun Heo <tj@kernel.org> --- arch/x86/mm/init.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c index bc4e9d8..636bbfd 100644 --- a/arch/x86/mm/init.c +++ b/arch/x86/mm/init.c @@ -60,10 +60,11 @@ static void __init find_early_table_space(struct map_range *mr, unsigned long en extra = end - ((end>>PMD_SHIFT) << PMD_SHIFT); #ifdef CONFIG_X86_32 extra += PMD_SIZE; -#endif + /* The first 2/4M doesn't use large pages. */ if (mr->start < PMD_SIZE) extra += mr->end - mr->start; +#endif ptes = (extra + PAGE_SIZE - 1) >> PAGE_SHIFT; } else -- 1.7.9.5 [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 897 bytes --]