linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2 v6] add reserved e820 ranges to the kdump kernel e820 table
@ 2018-11-14  7:29 Lianbo Jiang
  2018-11-14  7:29 ` [PATCH 1/2 v6] x86/kexec_file: add e820 entry in case e820 type string matches to io resource name Lianbo Jiang
  2018-11-14  7:29 ` [PATCH 2/2 v6] x86/kexec_file: add reserved e820 ranges to kdump kernel e820 table Lianbo Jiang
  0 siblings, 2 replies; 19+ messages in thread
From: Lianbo Jiang @ 2018-11-14  7:29 UTC (permalink / raw)
  To: linux-kernel; +Cc: kexec, x86, tglx, mingo, bp, akpm, dyoung, bhe

At present, when use the kexec_file_load syscall to load the kernel image
and initramfs(for example: kexec -s -p xxx), the upstream kernel does not
pass the e820 reserved ranges to the second kernel, which might produce
two problems:

The first one is the MMCONFIG issue, although which does not make the
system crash or hang, this issue is still a potential risk, and also
might lead to the hot-plug device could not be recognized in kdump kernel.
Because the PCI MMCONFIG(extended mode) requires the reserved region
otherwise it falls back to legacy mode. For example, the kdump kernel
outputs the following log.

Example:
......
[   19.798354] PCI: MMCONFIG for domain 0000 [bus 00-ff] at [mem 0x80000000-0x8fffffff] (base 0x80000000)
[   19.800653] [Firmware Info]: PCI: MMCONFIG at [mem 0x80000000-0x8fffffff] not reserved in ACPI motherboard resources
[   19.800995] PCI: not using MMCONFIG
......

The correct kernel log is like this:
......
[    0.082649] PCI: MMCONFIG for domain 0000 [bus 00-ff] at [mem 0x80000000-0x8fffffff] (base 0x80000000)
[    0.083610] PCI: MMCONFIG at [mem 0x80000000-0x8fffffff] reserved in E820
......

The second issue is that the e820 reserved ranges do not setup in kdump
kernel, which will cause some functions that related to the e820 reserved
ranges to become invalid. For example:

early_memremap()->
early_memremap_pgprot_adjust()->
memremap_should_map_decrypted()->
e820__get_entry_type()

Please focus on these functions, early_memremap_pgprot_adjust() and
memremap_should_map_decrypted().

In the first kernel, these ranges sit in e820 reserved ranges, so the
memremap_should_map_decrypted() will return true, that is to say, the
reserved memory is decrypted, then the early_memremap_pgprot_adjust()
will call the pgprot_decrypted() to clear the memory encryption mask.

In the second kernel, because the e820 reserved ranges are not passed
to the second kernel, these ranges don't sit in the e820 reserved ranges,
so the memremap_should_map_decrypted() will return false, that is to say,
the reserved memory is encrypted, and then the early_memremap_pgprot_
adjust() will also call the pgprot_encrypted() to set the memory encryption
mask.

In fact, in the second kernel, the e820 reserved memory is still decrypted.
Obviously, it has gone wrong. So, this issue must be fixed, otherwise kdump
won't work in this case.

The e820 reserved range is useful in kdump kernel, so it is necessary to
pass the e820 reserved ranges to kdump kernel.

Changes since v1:
1. Modified the value of flags to "0", when walking through the whole
tree for e820 reserved ranges.

Changes since v2:
1. Modified the value of flags to "0", when walking through the whole
tree for e820 reserved ranges.
2. Modified the invalid SOB chain issue.

Changes since v3:
1. Dropped [PATCH 1/3 v3] resource: fix an error which walks through iomem
   resources. Please refer to this commit <010a93bf97c7> "resource: Fix
   find_next_iomem_res() iteration issue"

Changes since v4:
1. Improve the patch log, and add kernel log.

Changes since v5:
1. Rewrite these patches log.

Lianbo Jiang (2):
  x86/kexec_file: add e820 entry in case e820 type string matches to io
    resource name
  x86/kexec_file: add reserved e820 ranges to kdump kernel e820 table

 arch/x86/include/asm/e820/api.h |  2 ++
 arch/x86/kernel/crash.c         | 10 +++++++++-
 arch/x86/kernel/e820.c          |  2 +-
 kernel/resource.c               |  1 +
 4 files changed, 13 insertions(+), 2 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/2 v6] x86/kexec_file: add e820 entry in case e820 type string matches to io resource name
  2018-11-14  7:29 [PATCH 0/2 v6] add reserved e820 ranges to the kdump kernel e820 table Lianbo Jiang
@ 2018-11-14  7:29 ` Lianbo Jiang
  2018-11-14 11:26   ` Borislav Petkov
  2018-11-14  7:29 ` [PATCH 2/2 v6] x86/kexec_file: add reserved e820 ranges to kdump kernel e820 table Lianbo Jiang
  1 sibling, 1 reply; 19+ messages in thread
From: Lianbo Jiang @ 2018-11-14  7:29 UTC (permalink / raw)
  To: linux-kernel; +Cc: kexec, x86, tglx, mingo, bp, akpm, dyoung, bhe

When load the kernel image and initramfs by kexec_file_load syscall, it can
not add exact e820 reserved type to kdump kernel e820 table.

Kdump uses walk_iomem_res_desc() to iterate io resources, then adds matched
desc to e820 table for kdump kernel. But, when convert the e820 type into
the iores descriptors, several e820 types are converted to 'IORES_DES_NONE'
in this function e820_type_to_iores_desc(). So the walk_iomem_res_desc()
will get these unnecessary types(E820_TYPE_RAM/E820_TYPE_UNUSABLE/E820_TYPE
_KERN) when iterate io resources by the 'IORES_DES_NONE'.

It needs filter out these redundant type(such as E820_TYPE_RAM/E820_TYPE_
UNUSABLE/E820_TYPE_KERN) in order to add exact e820 reserved type to kdump
kernel e820 table. Thus it also needs an extra checking in memmap_entry_
callback() to match the e820 type and resource name.

Suggested-by: Dave Young <dyoung@redhat.com>
Signed-off-by: Lianbo Jiang <lijiang@redhat.com>
---
Changes since v5:
1. Improve the patch log

 arch/x86/include/asm/e820/api.h | 2 ++
 arch/x86/kernel/crash.c         | 6 +++++-
 arch/x86/kernel/e820.c          | 2 +-
 kernel/resource.c               | 1 +
 4 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/e820/api.h b/arch/x86/include/asm/e820/api.h
index 62be73b23d5c..6d5451b36e80 100644
--- a/arch/x86/include/asm/e820/api.h
+++ b/arch/x86/include/asm/e820/api.h
@@ -42,6 +42,8 @@ extern void e820__register_nosave_regions(unsigned long limit_pfn);
 
 extern int  e820__get_entry_type(u64 start, u64 end);
 
+extern const char *e820_type_to_string(struct e820_entry *entry);
+
 /*
  * Returns true iff the specified range [start,end) is completely contained inside
  * the ISA region.
diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index f631a3f15587..ae724a6e0a5f 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -37,6 +37,7 @@
 #include <asm/reboot.h>
 #include <asm/virtext.h>
 #include <asm/intel_pt.h>
+#include <asm/e820/api.h>
 
 /* Used while preparing memory map entries for second kernel */
 struct crash_memmap_data {
@@ -314,11 +315,14 @@ static int memmap_entry_callback(struct resource *res, void *arg)
 	struct crash_memmap_data *cmd = arg;
 	struct boot_params *params = cmd->params;
 	struct e820_entry ei;
+	const char *name;
 
 	ei.addr = res->start;
 	ei.size = resource_size(res);
 	ei.type = cmd->type;
-	add_e820_entry(params, &ei);
+	name = e820_type_to_string(&ei);
+	if (res->name && !strcmp(name, res->name))
+		add_e820_entry(params, &ei);
 
 	return 0;
 }
diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index 50895c2f937d..4c1fe4f8db1e 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -1011,7 +1011,7 @@ void __init e820__finish_early_params(void)
 	}
 }
 
-static const char *__init e820_type_to_string(struct e820_entry *entry)
+const char *e820_type_to_string(struct e820_entry *entry)
 {
 	switch (entry->type) {
 	case E820_TYPE_RESERVED_KERN:	/* Fall-through: */
diff --git a/kernel/resource.c b/kernel/resource.c
index b0fbf685c77a..4ac07717e2b1 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -373,6 +373,7 @@ static int find_next_iomem_res(resource_size_t start, resource_size_t end,
 	res->end = min(end, p->end);
 	res->flags = p->flags;
 	res->desc = p->desc;
+	res->name = p->name;
 	return 0;
 }
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/2 v6] x86/kexec_file: add reserved e820 ranges to kdump kernel e820 table
  2018-11-14  7:29 [PATCH 0/2 v6] add reserved e820 ranges to the kdump kernel e820 table Lianbo Jiang
  2018-11-14  7:29 ` [PATCH 1/2 v6] x86/kexec_file: add e820 entry in case e820 type string matches to io resource name Lianbo Jiang
@ 2018-11-14  7:29 ` Lianbo Jiang
  1 sibling, 0 replies; 19+ messages in thread
From: Lianbo Jiang @ 2018-11-14  7:29 UTC (permalink / raw)
  To: linux-kernel; +Cc: kexec, x86, tglx, mingo, bp, akpm, dyoung, bhe

At present, when use the kexec_file_load syscall to load the kernel image
and initramfs(for example: kexec -s -p xxx), the upstream kernel does not
pass the e820 reserved ranges to the second kernel, which might produce
two problems:

The first one is the MMCONFIG issue, although which does not make the
system crash or hang, this issue is still a potential risk, and also
might lead to the hot-plug device could not be recognized in kdump kernel.
Because the PCI MMCONFIG(extended mode) requires the reserved region
otherwise it falls back to legacy mode. For example, the kdump kernel
outputs the following log.

Example:
......
[   19.798354] PCI: MMCONFIG for domain 0000 [bus 00-ff] at [mem 0x80000000-0x8fffffff] (base 0x80000000)
[   19.800653] [Firmware Info]: PCI: MMCONFIG at [mem 0x80000000-0x8fffffff] not reserved in ACPI motherboard resources
[   19.800995] PCI: not using MMCONFIG
......

The correct kernel log is like this:
......
[    0.082649] PCI: MMCONFIG for domain 0000 [bus 00-ff] at [mem 0x80000000-0x8fffffff] (base 0x80000000)
[    0.083610] PCI: MMCONFIG at [mem 0x80000000-0x8fffffff] reserved in E820
......

The second issue is that the e820 reserved ranges do not setup in kdump
kernel, which will cause some functions that related to the e820 reserved
ranges to become invalid. For example:

early_memremap()->
early_memremap_pgprot_adjust()->
memremap_should_map_decrypted()->
e820__get_entry_type()

Please focus on these functions, early_memremap_pgprot_adjust() and
memremap_should_map_decrypted().

In the first kernel, these ranges sit in e820 reserved ranges, so the
memremap_should_map_decrypted() will return true, that is to say, the
reserved memory is decrypted, then the early_memremap_pgprot_adjust()
will call the pgprot_decrypted() to clear the memory encryption mask.

In the second kernel, because the e820 reserved ranges are not passed
to the second kernel, these ranges don't sit in the e820 reserved ranges,
so the memremap_should_map_decrypted() will return false, that is to say,
the reserved memory is encrypted, and then the early_memremap_pgprot_
adjust() will also call the pgprot_encrypted() to set the memory encryption
mask.

In fact, in the second kernel, the e820 reserved memory is still decrypted.
Obviously, it has gone wrong. So, this issue must be fixed, otherwise kdump
won't work in this case.

The e820 reserved range is useful in kdump kernel, so it is necessary to
pass the e820 reserved ranges to kdump kernel.

Suggested-by: Dave Young <dyoung@redhat.com>
Signed-off-by: Lianbo Jiang <lijiang@redhat.com>
---
Changes since v5:
1. Improve the patch log

 arch/x86/kernel/crash.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index ae724a6e0a5f..d3167125800e 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -384,6 +384,10 @@ int crash_setup_memmap_entries(struct kimage *image, struct boot_params *params)
 	walk_iomem_res_desc(IORES_DESC_ACPI_NV_STORAGE, flags, 0, -1, &cmd,
 			memmap_entry_callback);
 
+	cmd.type = E820_TYPE_RESERVED;
+	walk_iomem_res_desc(IORES_DESC_NONE, 0, 0, -1, &cmd,
+			   memmap_entry_callback);
+
 	/* Add crashk_low_res region */
 	if (crashk_low_res.end) {
 		ei.addr = crashk_low_res.start;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2 v6] x86/kexec_file: add e820 entry in case e820 type string matches to io resource name
  2018-11-14  7:29 ` [PATCH 1/2 v6] x86/kexec_file: add e820 entry in case e820 type string matches to io resource name Lianbo Jiang
@ 2018-11-14 11:26   ` Borislav Petkov
  2018-11-15  5:44     ` lijiang
  0 siblings, 1 reply; 19+ messages in thread
From: Borislav Petkov @ 2018-11-14 11:26 UTC (permalink / raw)
  To: Lianbo Jiang; +Cc: linux-kernel, kexec, x86, tglx, mingo, akpm, dyoung, bhe

On Wed, Nov 14, 2018 at 03:29:25PM +0800, Lianbo Jiang wrote:
> When load the kernel image and initramfs by kexec_file_load syscall, it can
> not add exact e820 reserved type to kdump kernel e820 table.
> 
> Kdump uses walk_iomem_res_desc() to iterate io resources, then adds matched
> desc to e820 table for kdump kernel. But, when convert the e820 type into
> the iores descriptors, several e820 types are converted to 'IORES_DES_NONE'
> in this function e820_type_to_iores_desc(). So the walk_iomem_res_desc()
> will get these unnecessary types(E820_TYPE_RAM/E820_TYPE_UNUSABLE/E820_TYPE
> _KERN) when iterate io resources by the 'IORES_DES_NONE'.
> 
> It needs filter out these redundant type(such as E820_TYPE_RAM/E820_TYPE_
> UNUSABLE/E820_TYPE_KERN) in order to add exact e820 reserved type to kdump
> kernel e820 table. Thus it also needs an extra checking in memmap_entry_
> callback() to match the e820 type and resource name.

Ok, it took me a while to parse what this is trying to say so let's
start from the top:

* What resource type do you do need in the second kernel?

* The most important question: why?

* If it is the reserved resource, why aren't you adding
IORES_DESC_RESERVED or so which to look for instead of this hacky string
comparison?

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2 v6] x86/kexec_file: add e820 entry in case e820 type string matches to io resource name
  2018-11-14 11:26   ` Borislav Petkov
@ 2018-11-15  5:44     ` lijiang
  2018-11-15  5:58       ` Dave Young
  2018-11-15 10:39       ` Borislav Petkov
  0 siblings, 2 replies; 19+ messages in thread
From: lijiang @ 2018-11-15  5:44 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: linux-kernel, kexec, x86, tglx, mingo, akpm, dyoung, bhe

在 2018年11月14日 19:26, Borislav Petkov 写道:
> On Wed, Nov 14, 2018 at 03:29:25PM +0800, Lianbo Jiang wrote:
>> When load the kernel image and initramfs by kexec_file_load syscall, it can
>> not add exact e820 reserved type to kdump kernel e820 table.
>>
>> Kdump uses walk_iomem_res_desc() to iterate io resources, then adds matched
>> desc to e820 table for kdump kernel. But, when convert the e820 type into
>> the iores descriptors, several e820 types are converted to 'IORES_DES_NONE'
>> in this function e820_type_to_iores_desc(). So the walk_iomem_res_desc()
>> will get these unnecessary types(E820_TYPE_RAM/E820_TYPE_UNUSABLE/E820_TYPE
>> _KERN) when iterate io resources by the 'IORES_DES_NONE'.
>>
>> It needs filter out these redundant type(such as E820_TYPE_RAM/E820_TYPE_
>> UNUSABLE/E820_TYPE_KERN) in order to add exact e820 reserved type to kdump
>> kernel e820 table. Thus it also needs an extra checking in memmap_entry_
>> callback() to match the e820 type and resource name.
> 
> Ok, it took me a while to parse what this is trying to say so let's
> start from the top:
> 
> * What resource type do you do need in the second kernel?
> 

Thanks for your comment.

The e820 reserved ranges need to be passed to the second kernel.

> * The most important question: why?
> 

At present, the upstream kernel does not pass the e820 reserved ranges to the
second kernel, which might cause two problems:

The first one is the MMCONFIG issue, the PCI MMCONFIG(extended mode) requires
the reserved region otherwise it falls back to legacy mode, which might lead to
the hot-plug device could not be recognized in kdump kernel.

Another one is that the e820 reserved ranges do not setup in kdump kernel, which
could cause kdump can't work in some machines. To know more information, please
refer to the [PATCH 2/2 v6] patch log.


> * If it is the reserved resource, why aren't you adding
> IORES_DESC_RESERVED or so which to look for instead of this hacky string
> comparison?
> 

Adding the new descriptor 'IORES_DESC_RESERVED' is also a good solution. I will
modify these patches based on your suggestion and post again.

Thanks.
Lianbo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2 v6] x86/kexec_file: add e820 entry in case e820 type string matches to io resource name
  2018-11-15  5:44     ` lijiang
@ 2018-11-15  5:58       ` Dave Young
  2018-11-16  1:34         ` lijiang
  2018-11-15 10:39       ` Borislav Petkov
  1 sibling, 1 reply; 19+ messages in thread
From: Dave Young @ 2018-11-15  5:58 UTC (permalink / raw)
  To: lijiang; +Cc: Borislav Petkov, linux-kernel, kexec, x86, tglx, mingo, akpm, bhe

On 11/15/18 at 01:44pm, lijiang wrote:
> 在 2018年11月14日 19:26, Borislav Petkov 写道:
> > On Wed, Nov 14, 2018 at 03:29:25PM +0800, Lianbo Jiang wrote:
> >> When load the kernel image and initramfs by kexec_file_load syscall, it can
> >> not add exact e820 reserved type to kdump kernel e820 table.
> >>
> >> Kdump uses walk_iomem_res_desc() to iterate io resources, then adds matched
> >> desc to e820 table for kdump kernel. But, when convert the e820 type into
> >> the iores descriptors, several e820 types are converted to 'IORES_DES_NONE'
> >> in this function e820_type_to_iores_desc(). So the walk_iomem_res_desc()
> >> will get these unnecessary types(E820_TYPE_RAM/E820_TYPE_UNUSABLE/E820_TYPE
> >> _KERN) when iterate io resources by the 'IORES_DES_NONE'.
> >>
> >> It needs filter out these redundant type(such as E820_TYPE_RAM/E820_TYPE_
> >> UNUSABLE/E820_TYPE_KERN) in order to add exact e820 reserved type to kdump
> >> kernel e820 table. Thus it also needs an extra checking in memmap_entry_
> >> callback() to match the e820 type and resource name.
> > 
> > Ok, it took me a while to parse what this is trying to say so let's
> > start from the top:
> > 
> > * What resource type do you do need in the second kernel?
> > 
> 
> Thanks for your comment.
> 
> The e820 reserved ranges need to be passed to the second kernel.
> 
> > * The most important question: why?
> > 
> 
> At present, the upstream kernel does not pass the e820 reserved ranges to the
> second kernel, which might cause two problems:
> 
> The first one is the MMCONFIG issue, the PCI MMCONFIG(extended mode) requires
> the reserved region otherwise it falls back to legacy mode, which might lead to
> the hot-plug device could not be recognized in kdump kernel.
> 
> Another one is that the e820 reserved ranges do not setup in kdump kernel, which
> could cause kdump can't work in some machines. To know more information, please
> refer to the [PATCH 2/2 v6] patch log.
> 
> 
> > * If it is the reserved resource, why aren't you adding
> > IORES_DESC_RESERVED or so which to look for instead of this hacky string
> > comparison?
> > 
> 
> Adding the new descriptor 'IORES_DESC_RESERVED' is also a good solution. I will

I was not sure if something else depends on IORES_DESC_NONE and if it is
easy to split it and add IORES_DESC_RESERVED

But if you can prove it is safe then it would be a better way.

Thanks
Dave

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2 v6] x86/kexec_file: add e820 entry in case e820 type string matches to io resource name
  2018-11-15  5:44     ` lijiang
  2018-11-15  5:58       ` Dave Young
@ 2018-11-15 10:39       ` Borislav Petkov
  2018-11-16  3:25         ` lijiang
                           ` (2 more replies)
  1 sibling, 3 replies; 19+ messages in thread
From: Borislav Petkov @ 2018-11-15 10:39 UTC (permalink / raw)
  To: lijiang, Bjorn Helgaas
  Cc: linux-kernel, kexec, x86, tglx, mingo, akpm, dyoung, bhe

+ Bjorn.

On Thu, Nov 15, 2018 at 01:44:07PM +0800, lijiang wrote:
> At present, the upstream kernel does not pass the e820 reserved ranges to the
> second kernel, which might cause two problems:
> 
> The first one is the MMCONFIG issue, the PCI MMCONFIG(extended mode) requires
> the reserved region otherwise it falls back to legacy mode, which might lead to
> the hot-plug device could not be recognized in kdump kernel.

Well, this still doesn't explain it fully. Let's look at a box:

[    0.000000] e820: BIOS-provided physical RAM map:
[    0.000000] BIOS-e820: [mem 0x0000000000000000-0x00000000000997ff] usable
[    0.000000] BIOS-e820: [mem 0x0000000000099800-0x000000000009ffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000000100000-0x0000000065642fff] usable
[    0.000000] BIOS-e820: [mem 0x0000000065643000-0x0000000067fb8fff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000067fb9000-0x00000000689e8fff] ACPI NVS
[    0.000000] BIOS-e820: [mem 0x00000000689e9000-0x0000000068bf5fff] ACPI data
[    0.000000] BIOS-e820: [mem 0x0000000068bf6000-0x000000006f7fffff] usable
[    0.000000] BIOS-e820: [mem 0x000000006f800000-0x000000008fffffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000fd000000-0x00000000fe7fffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000fec00000-0x00000000fec00fff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000fec80000-0x00000000fed00fff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000ff800000-0x00000001007fffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000100800000-0x000000603fffffff] usable

this one has 8 reserved regions. Does that mean that we need to pass
them *all* 8 to the second kernel so that MMCONFIG works?

Or is it only one reserved region which is needed for MMCONFIG?

Bjorn, do you know what the detection logic should be to map the correct
reserved region (or regions) for MMCONFIG?

Now, even if we don't map that reserved region and MMCONFIG falls back
to legacy mode, why is that a problem for the kdump kernel? Why does
the kdump kernel need the hotplug device? What would be the use case?
Hotplug a SATA drive to store the memory dump to it ... or?

> Another one is that the e820 reserved ranges do not setup in kdump kernel, which
> could cause kdump can't work in some machines. To know more information, please
> refer to the [PATCH 2/2 v6] patch log.

Yah, I still don't understand *why* we need the reserved ranges in the
second kernel. Once we've figured out the *why* we can look at the *how*.

Thx.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2 v6] x86/kexec_file: add e820 entry in case e820 type string matches to io resource name
  2018-11-15  5:58       ` Dave Young
@ 2018-11-16  1:34         ` lijiang
  0 siblings, 0 replies; 19+ messages in thread
From: lijiang @ 2018-11-16  1:34 UTC (permalink / raw)
  To: Dave Young
  Cc: Borislav Petkov, linux-kernel, kexec, x86, tglx, mingo, akpm, bhe

在 2018年11月15日 13:58, Dave Young 写道:
> On 11/15/18 at 01:44pm, lijiang wrote:
>> 在 2018年11月14日 19:26, Borislav Petkov 写道:
>>> On Wed, Nov 14, 2018 at 03:29:25PM +0800, Lianbo Jiang wrote:
>>>> When load the kernel image and initramfs by kexec_file_load syscall, it can
>>>> not add exact e820 reserved type to kdump kernel e820 table.
>>>>
>>>> Kdump uses walk_iomem_res_desc() to iterate io resources, then adds matched
>>>> desc to e820 table for kdump kernel. But, when convert the e820 type into
>>>> the iores descriptors, several e820 types are converted to 'IORES_DES_NONE'
>>>> in this function e820_type_to_iores_desc(). So the walk_iomem_res_desc()
>>>> will get these unnecessary types(E820_TYPE_RAM/E820_TYPE_UNUSABLE/E820_TYPE
>>>> _KERN) when iterate io resources by the 'IORES_DES_NONE'.
>>>>
>>>> It needs filter out these redundant type(such as E820_TYPE_RAM/E820_TYPE_
>>>> UNUSABLE/E820_TYPE_KERN) in order to add exact e820 reserved type to kdump
>>>> kernel e820 table. Thus it also needs an extra checking in memmap_entry_
>>>> callback() to match the e820 type and resource name.
>>>
>>> Ok, it took me a while to parse what this is trying to say so let's
>>> start from the top:
>>>
>>> * What resource type do you do need in the second kernel?
>>>
>>
>> Thanks for your comment.
>>
>> The e820 reserved ranges need to be passed to the second kernel.
>>
>>> * The most important question: why?
>>>
>>
>> At present, the upstream kernel does not pass the e820 reserved ranges to the
>> second kernel, which might cause two problems:
>>
>> The first one is the MMCONFIG issue, the PCI MMCONFIG(extended mode) requires
>> the reserved region otherwise it falls back to legacy mode, which might lead to
>> the hot-plug device could not be recognized in kdump kernel.
>>
>> Another one is that the e820 reserved ranges do not setup in kdump kernel, which
>> could cause kdump can't work in some machines. To know more information, please
>> refer to the [PATCH 2/2 v6] patch log.
>>
>>
>>> * If it is the reserved resource, why aren't you adding
>>> IORES_DESC_RESERVED or so which to look for instead of this hacky string
>>> comparison?
>>>
>>
>> Adding the new descriptor 'IORES_DESC_RESERVED' is also a good solution. I will
> 
> I was not sure if something else depends on IORES_DESC_NONE and if it is
> easy to split it and add IORES_DESC_RESERVED
> 
> But if you can prove it is safe then it would be a better way.
> 
Thank you, Dave.

These two solutions should be feasible, they can work very well on my machine.

Regards,
Lianbo

> Thanks
> Dave
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2 v6] x86/kexec_file: add e820 entry in case e820 type string matches to io resource name
  2018-11-15 10:39       ` Borislav Petkov
@ 2018-11-16  3:25         ` lijiang
  2018-11-18 11:52           ` Borislav Petkov
  2018-11-19  9:55         ` Dave Young
  2018-11-20 20:43         ` Bjorn Helgaas
  2 siblings, 1 reply; 19+ messages in thread
From: lijiang @ 2018-11-16  3:25 UTC (permalink / raw)
  To: Borislav Petkov, Bjorn Helgaas
  Cc: linux-kernel, kexec, x86, tglx, mingo, akpm, dyoung, bhe

在 2018年11月15日 18:39, Borislav Petkov 写道:
> + Bjorn.
> 
> On Thu, Nov 15, 2018 at 01:44:07PM +0800, lijiang wrote:
>> At present, the upstream kernel does not pass the e820 reserved ranges to the
>> second kernel, which might cause two problems:
>>
>> The first one is the MMCONFIG issue, the PCI MMCONFIG(extended mode) requires
>> the reserved region otherwise it falls back to legacy mode, which might lead to
>> the hot-plug device could not be recognized in kdump kernel.
> 
> Well, this still doesn't explain it fully. Let's look at a box:
> 
> [    0.000000] e820: BIOS-provided physical RAM map:
> [    0.000000] BIOS-e820: [mem 0x0000000000000000-0x00000000000997ff] usable
> [    0.000000] BIOS-e820: [mem 0x0000000000099800-0x000000000009ffff] reserved
> [    0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
> [    0.000000] BIOS-e820: [mem 0x0000000000100000-0x0000000065642fff] usable
> [    0.000000] BIOS-e820: [mem 0x0000000065643000-0x0000000067fb8fff] reserved
> [    0.000000] BIOS-e820: [mem 0x0000000067fb9000-0x00000000689e8fff] ACPI NVS
> [    0.000000] BIOS-e820: [mem 0x00000000689e9000-0x0000000068bf5fff] ACPI data
> [    0.000000] BIOS-e820: [mem 0x0000000068bf6000-0x000000006f7fffff] usable
> [    0.000000] BIOS-e820: [mem 0x000000006f800000-0x000000008fffffff] reserved
> [    0.000000] BIOS-e820: [mem 0x00000000fd000000-0x00000000fe7fffff] reserved
> [    0.000000] BIOS-e820: [mem 0x00000000fec00000-0x00000000fec00fff] reserved
> [    0.000000] BIOS-e820: [mem 0x00000000fec80000-0x00000000fed00fff] reserved
> [    0.000000] BIOS-e820: [mem 0x00000000ff800000-0x00000001007fffff] reserved
> [    0.000000] BIOS-e820: [mem 0x0000000100800000-0x000000603fffffff] usable
> 
> this one has 8 reserved regions. Does that mean that we need to pass
> them *all* 8 to the second kernel so that MMCONFIG works?
> 
> Or is it only one reserved region which is needed for MMCONFIG?
> 

On my machine, the pci mmconfig region[mem 0x80000000-0x8fffffff] reserved in e820.
This address range belongs to the e820 reserved region[mem 0x0000000078000000-
0x000000008fffffff].

Kernel outputs the following log:

[    0.000000] BIOS-provided physical RAM map:
[    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000008bfff] usable
[    0.000000] BIOS-e820: [mem 0x000000000008c000-0x000000000009ffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000000100000-0x0000000029920fff] usable
[    0.000000] BIOS-e820: [mem 0x0000000029921000-0x0000000029921fff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000029922000-0x0000000062278fff] usable
[    0.000000] BIOS-e820: [mem 0x0000000062279000-0x0000000062378fff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000062379000-0x000000006238bfff] ACPI data
[    0.000000] BIOS-e820: [mem 0x000000006238c000-0x000000006238cfff] ACPI NVS
[    0.000000] BIOS-e820: [mem 0x000000006238d000-0x000000006240bfff] usable
[    0.000000] BIOS-e820: [mem 0x000000006240c000-0x000000006264bfff] reserved
[    0.000000] BIOS-e820: [mem 0x000000006264c000-0x000000006266dfff] usable
[    0.000000] BIOS-e820: [mem 0x000000006266e000-0x00000000626cdfff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000626ce000-0x000000006278dfff] usable
[    0.000000] BIOS-e820: [mem 0x000000006278e000-0x000000006278efff] ACPI data
[    0.000000] BIOS-e820: [mem 0x000000006278f000-0x0000000062807fff] usable
[    0.000000] BIOS-e820: [mem 0x0000000062808000-0x000000006280afff] ACPI data
[    0.000000] BIOS-e820: [mem 0x000000006280b000-0x000000006280cfff] usable
[    0.000000] BIOS-e820: [mem 0x000000006280d000-0x000000006280dfff] ACPI data
[    0.000000] BIOS-e820: [mem 0x000000006280e000-0x000000006286afff] usable
[    0.000000] BIOS-e820: [mem 0x000000006286b000-0x000000006286efff] reserved
[    0.000000] BIOS-e820: [mem 0x000000006286f000-0x00000000682f8fff] usable
[    0.000000] BIOS-e820: [mem 0x00000000682f9000-0x0000000068b05fff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000068b06000-0x0000000068b09fff] ACPI NVS
[    0.000000] BIOS-e820: [mem 0x0000000068b0a000-0x0000000068b1afff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000068b1b000-0x0000000068b1dfff] ACPI NVS
[    0.000000] BIOS-e820: [mem 0x0000000068b1e000-0x0000000071d1dfff] usable
[    0.000000] BIOS-e820: [mem 0x0000000071d1e000-0x0000000071d2dfff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000071d2e000-0x0000000071d3dfff] ACPI NVS
[    0.000000] BIOS-e820: [mem 0x0000000071d3e000-0x0000000071d4dfff] ACPI data
[    0.000000] BIOS-e820: [mem 0x0000000071d4e000-0x0000000077ffffff] usable
[    0.000000] BIOS-e820: [mem 0x0000000078000000-0x000000008fffffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000fed80000-0x00000000fed80fff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000087effffff] usable
[    0.000000] BIOS-e820: [mem 0x000000087f000000-0x000000087fffffff] reserved

......
[    0.082649] PCI: MMCONFIG for domain 0000 [bus 00-ff] at [mem 0x80000000-0x8fffffff] (base 0x80000000)
[    0.083610] PCI: MMCONFIG at [mem 0x80000000-0x8fffffff] reserved in E820
......


For the pci mmconfig issue, it should be good enough that the e820 reserved region
[mem 0x0000000078000000-0x000000008fffffff] is only passed to the second kernel, but
the pci mmconfig region is not the same in another machine.

In addition, it has more serious problems that kdump could not work in some machine. 

> Bjorn, do you know what the detection logic should be to map the correct
> reserved region (or regions) for MMCONFIG?
> 
> Now, even if we don't map that reserved region and MMCONFIG falls back
> to legacy mode, why is that a problem for the kdump kernel? Why does
> the kdump kernel need the hotplug device? What would be the use case?
> Hotplug a SATA drive to store the memory dump to it ... or?
> 

A simple case, hotplug a pci network card and use the ssh/nfs to dump the vmcore.
If the pci mmconfig region is not reserved in kdump kernel, the pci hotplug device
could not be recognized. So the pci network card won't work.

>> Another one is that the e820 reserved ranges do not setup in kdump kernel, which
>> could cause kdump can't work in some machines. To know more information, please
>> refer to the [PATCH 2/2 v6] patch log.
> 
> Yah, I still don't understand *why* we need the reserved ranges in the
> second kernel. Once we've figured out the *why* we can look at the *how*.
> 

Here, there is an example about SME kdump. Maybe it can help to better understand.

The e820 reserved ranges do not setup in kdump kernel, which will cause some
functions that related to the e820 reserved ranges to become invalid.
    
early_memremap()->
early_memremap_pgprot_adjust()->
memremap_should_map_decrypted()->
e820__get_entry_type()

Please focus on these functions, early_memremap_pgprot_adjust() and
memremap_should_map_decrypted().

In the first kernel, these ranges sit in e820 reserved ranges, so the
memremap_should_map_decrypted() will return true, that is to say, the
reserved memory is decrypted, then the early_memremap_pgprot_adjust()
will call the pgprot_decrypted() to clear the memory encryption mask.

In the second kernel, because the e820 reserved ranges are not passed
to the second kernel, these ranges don't sit in the e820 reserved ranges,
so the memremap_should_map_decrypted() will return false, that is to say,
the reserved memory is encrypted, and then the early_memremap_pgprot_
adjust() will also call the pgprot_encrypted() to set the memory encryption
mask.

In fact, in the second kernel, the e820 reserved memory is still decrypted.
Obviously, it has gone wrong. So, this issue must be fixed, otherwise kdump
won't work in this case.

The e820 reserved range is useful in kdump kernel, so it is necessary to
pass the e820 reserved ranges to kdump kernel.

Hope this is helpful.

Thanks,
Lianbo

> Thx.
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2 v6] x86/kexec_file: add e820 entry in case e820 type string matches to io resource name
  2018-11-16  3:25         ` lijiang
@ 2018-11-18 11:52           ` Borislav Petkov
  2018-11-20  4:07             ` lijiang
  2018-11-21 10:54             ` lijiang
  0 siblings, 2 replies; 19+ messages in thread
From: Borislav Petkov @ 2018-11-18 11:52 UTC (permalink / raw)
  To: lijiang
  Cc: Bjorn Helgaas, linux-kernel, kexec, x86, tglx, mingo, akpm, dyoung, bhe

On Fri, Nov 16, 2018 at 11:25:55AM +0800, lijiang wrote:
> For the pci mmconfig issue, it should be good enough that the e820 reserved region
> [mem 0x0000000078000000-0x000000008fffffff] is only passed to the second kernel, but
> the pci mmconfig region is not the same in another machine.

Yes. And now the question is, *which* reserved regions need to be mapped
for the second kernel to function properly? How do we figure that out?

> A simple case, hotplug a pci network card and use the ssh/nfs to dump the vmcore.
> If the pci mmconfig region is not reserved in kdump kernel, the pci hotplug device
> could not be recognized. So the pci network card won't work.

Yes that's a good example; put *that* example in your commit message.

> Here, there is an example about SME kdump. Maybe it can help to better understand.

You keep pasting that and I've read it already. And you keep repeating
that the reserved regions need to be mapped in the second kernel and I'm
asking, how do we determine *which* regions should we pass to the second
kernel?

If we should pass *all* reserved regions, why?

IOW, I'm looking for the *why* first.

Thx.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2 v6] x86/kexec_file: add e820 entry in case e820 type string matches to io resource name
  2018-11-15 10:39       ` Borislav Petkov
  2018-11-16  3:25         ` lijiang
@ 2018-11-19  9:55         ` Dave Young
  2018-11-19 10:28           ` Borislav Petkov
  2018-11-20 20:43         ` Bjorn Helgaas
  2 siblings, 1 reply; 19+ messages in thread
From: Dave Young @ 2018-11-19  9:55 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: lijiang, Bjorn Helgaas, linux-kernel, kexec, x86, tglx, mingo, akpm, bhe

On 11/15/18 at 11:39am, Borislav Petkov wrote:
> + Bjorn.
> 
> On Thu, Nov 15, 2018 at 01:44:07PM +0800, lijiang wrote:
> > At present, the upstream kernel does not pass the e820 reserved ranges to the
> > second kernel, which might cause two problems:
> > 
> > The first one is the MMCONFIG issue, the PCI MMCONFIG(extended mode) requires
> > the reserved region otherwise it falls back to legacy mode, which might lead to
> > the hot-plug device could not be recognized in kdump kernel.
> 
> Well, this still doesn't explain it fully. Let's look at a box:
> 
> [    0.000000] e820: BIOS-provided physical RAM map:
> [    0.000000] BIOS-e820: [mem 0x0000000000000000-0x00000000000997ff] usable
> [    0.000000] BIOS-e820: [mem 0x0000000000099800-0x000000000009ffff] reserved
> [    0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
> [    0.000000] BIOS-e820: [mem 0x0000000000100000-0x0000000065642fff] usable
> [    0.000000] BIOS-e820: [mem 0x0000000065643000-0x0000000067fb8fff] reserved
> [    0.000000] BIOS-e820: [mem 0x0000000067fb9000-0x00000000689e8fff] ACPI NVS
> [    0.000000] BIOS-e820: [mem 0x00000000689e9000-0x0000000068bf5fff] ACPI data
> [    0.000000] BIOS-e820: [mem 0x0000000068bf6000-0x000000006f7fffff] usable
> [    0.000000] BIOS-e820: [mem 0x000000006f800000-0x000000008fffffff] reserved
> [    0.000000] BIOS-e820: [mem 0x00000000fd000000-0x00000000fe7fffff] reserved
> [    0.000000] BIOS-e820: [mem 0x00000000fec00000-0x00000000fec00fff] reserved
> [    0.000000] BIOS-e820: [mem 0x00000000fec80000-0x00000000fed00fff] reserved
> [    0.000000] BIOS-e820: [mem 0x00000000ff800000-0x00000001007fffff] reserved
> [    0.000000] BIOS-e820: [mem 0x0000000100800000-0x000000603fffffff] usable
> 
> this one has 8 reserved regions. Does that mean that we need to pass
> them *all* 8 to the second kernel so that MMCONFIG works?

We just copy 1st kernel memmap (/proc/iomem) to be used in 2nd kernel
e820,  I'm not sure we can get the exact memory range and pass it.
Because different io devices may have different ranges, it is hard to
get all the use cases. And there seems no easy way to get them.

Another thing is it is not worth to get the exact info, the 1st kernel
reserved part is just fine to be reserved as well in 2nd kernel, no 
side effects.  Actually there might be some obscure use cases we
do not find which rely those reserved memory ranges so it is better to
have.

> 
> Or is it only one reserved region which is needed for MMCONFIG?
> 
> Bjorn, do you know what the detection logic should be to map the correct
> reserved region (or regions) for MMCONFIG?
> 
> Now, even if we don't map that reserved region and MMCONFIG falls back
> to legacy mode, why is that a problem for the kdump kernel? Why does
> the kdump kernel need the hotplug device? What would be the use case?
> Hotplug a SATA drive to store the memory dump to it ... or?

According to an old bug report only devices on PCI segment 0 fall back
to legacy mode, those devices on segment 1 do not fall back, they just
do not work, and this seems not related to hotplug.

I found the old bug report, copy something here:
'''
When doing a kdump, the kdump kernel failed to boot due to not finding /dev/root.  The root drive is on a LSI Megaraid disk.

...
[    6.869903] input: American Megatrends Inc. Virtual Keyboard and Mouse as /devices/pci0000:00/0000:00:1a.0/usb1/1-1/1-1.3/1-1.3.1/1-1.3.1:1.1/input/input1
[    6.885358] generic-usb 0003:046B:FF10.0002: input,hidraw1: USB HID v1.10 Mouse [American Megatrends Inc. Virtual Keyboard and Mouse] on usb-0000:00:1a.0-1.3.1/input1
[    6.901927] usbcore: registered new interface driver usbhid
[    6.908145] usbhid: USB HID core driver
......................Could not find /dev/root.
Want me to fall back to /dev/disk/by-id/scsi-3600605b0049fac9018513918775bfc13-part4? (Y/n) 
y
Waiting for device /dev/disk/by-id/scsi-3600605b0049fac9018513918775bfc13-part4 to appear: ..............................not found -- exiting to /bin/sh
$

The basic problem is that this device is in PCI segment 1 and the kernel PCI probing cannot find it without all the e820 i/o reservations being present in the e820 table.  And the crash kernel does not have those reservations because the kexec command does not pass i/o reservation via the memmap= command line option. (This problem does not show up for other vendors, as SGI is apparently the only one using extended PCI. The lookup of devices in PCI segment 0 actually fails for everyone, but devices in segment 0 are then found by some legacy lookup method.)  The workaround for this is to fix kexec to pass i/o reserved areas to the crash kernel.  The patch will be attached.
'''

And here is some old patches in kexec-tools for fixing this:
http://lists.infradead.org/pipermail/kexec/2013-February/007924.html
(patch from SGI, later reverted)

http://lists.infradead.org/pipermail/kexec/2014-April/011710.html
(patch from Chaowang)

But apparently we missed this issue in kexec_file code..

> 
> > Another one is that the e820 reserved ranges do not setup in kdump kernel, which
> > could cause kdump can't work in some machines. To know more information, please
> > refer to the [PATCH 2/2 v6] patch log.
> 
> Yah, I still don't understand *why* we need the reserved ranges in the
> second kernel. Once we've figured out the *why* we can look at the *how*.
> 
> Thx.
> 
> -- 
> Regards/Gruss,
>     Boris.
> 
> Good mailing practices for 400: avoid top-posting and trim the reply.

Thanks
Dave

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2 v6] x86/kexec_file: add e820 entry in case e820 type string matches to io resource name
  2018-11-19  9:55         ` Dave Young
@ 2018-11-19 10:28           ` Borislav Petkov
  2018-11-20  3:37             ` lijiang
  0 siblings, 1 reply; 19+ messages in thread
From: Borislav Petkov @ 2018-11-19 10:28 UTC (permalink / raw)
  To: Dave Young, lijiang
  Cc: Bjorn Helgaas, linux-kernel, kexec, x86, tglx, mingo, akpm, bhe

On Mon, Nov 19, 2018 at 05:55:15PM +0800, Dave Young wrote:
> Another thing is it is not worth to get the exact info, the 1st kernel
> reserved part is just fine to be reserved as well in 2nd kernel, no 
> side effects.  Actually there might be some obscure use cases we
> do not find which rely those reserved memory ranges so it is better to
> have.

That makes sense as an argument. The cleaner thing would be to figure
out only *which* ranges we're going to need but that is probably harder
than simply exporting what the first kernel sees. But why we're doing
it, needs to be in the commit message so that it is clear when bug
hunting later.

...

> The basic problem is that this device is in PCI segment 1 and
> the kernel PCI probing cannot find it without all the e820 i/o
> reservations being present in the e820 table. And the crash kernel
> does not have those reservations because the kexec command does not
> pass i/o reservation via the memmap= command line option. (This
> problem does not show up for other vendors, as SGI is apparently the
> only one using extended PCI. The lookup of devices in PCI segment 0
> actually fails for everyone, but devices in segment 0 are then found
> by some legacy lookup method.) The workaround for this is to fix kexec
> to pass i/o reserved areas to the crash kernel.

Yap, this is the *why* I'm looking for. Lianbo, in your next submission,
please add Dave's explanations to your commit messages.

If it says "we need to do X" in the commit message, without a reason
given *why* we need to, then there's no way for us to know *why* we did
it, when looking at this months from now. And we absolutely need the
*why* when staring at the code and fixing the next bug/issue.

Thx.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2 v6] x86/kexec_file: add e820 entry in case e820 type string matches to io resource name
  2018-11-19 10:28           ` Borislav Petkov
@ 2018-11-20  3:37             ` lijiang
  2018-11-20 19:29               ` Borislav Petkov
  0 siblings, 1 reply; 19+ messages in thread
From: lijiang @ 2018-11-20  3:37 UTC (permalink / raw)
  To: Borislav Petkov, Dave Young
  Cc: Bjorn Helgaas, linux-kernel, kexec, x86, tglx, mingo, akpm, bhe

在 2018年11月19日 18:28, Borislav Petkov 写道:
> On Mon, Nov 19, 2018 at 05:55:15PM +0800, Dave Young wrote:
>> Another thing is it is not worth to get the exact info, the 1st kernel
>> reserved part is just fine to be reserved as well in 2nd kernel, no 
>> side effects.  Actually there might be some obscure use cases we
>> do not find which rely those reserved memory ranges so it is better to
>> have.
> 
> That makes sense as an argument. The cleaner thing would be to figure
> out only *which* ranges we're going to need but that is probably harder
> than simply exporting what the first kernel sees. But why we're doing
> it, needs to be in the commit message so that it is clear when bug
> hunting later.
> 
> ...
> 
>> The basic problem is that this device is in PCI segment 1 and
>> the kernel PCI probing cannot find it without all the e820 i/o
>> reservations being present in the e820 table. And the crash kernel
>> does not have those reservations because the kexec command does not
>> pass i/o reservation via the memmap= command line option. (This
>> problem does not show up for other vendors, as SGI is apparently the
>> only one using extended PCI. The lookup of devices in PCI segment 0
>> actually fails for everyone, but devices in segment 0 are then found
>> by some legacy lookup method.) The workaround for this is to fix kexec
>> to pass i/o reserved areas to the crash kernel.
> 
> Yap, this is the *why* I'm looking for. Lianbo, in your next submission,
> please add Dave's explanations to your commit messages.
> Ok. Thank you, Dave and Boris. I will add Dave's explanations to patch log.

BTW: Boris has mentioned the solution which adds the new descriptor 'IORES_DESC_RESERVED'.

Which solution do you prefer? Add the new I/O resource descriptor 'IORES_DESC_RESERVED'(patch v7)
or exactly comparing a string(patch v6)?

These two solutions are good to me. 

Thanks.
Lianbo
> If it says "we need to do X" in the commit message, without a reason
> given *why* we need to, then there's no way for us to know *why* we did
> it, when looking at this months from now. And we absolutely need the
> *why* when staring at the code and fixing the next bug/issue.
> 
> Thx.
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2 v6] x86/kexec_file: add e820 entry in case e820 type string matches to io resource name
  2018-11-18 11:52           ` Borislav Petkov
@ 2018-11-20  4:07             ` lijiang
  2018-11-21 10:54             ` lijiang
  1 sibling, 0 replies; 19+ messages in thread
From: lijiang @ 2018-11-20  4:07 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Bjorn Helgaas, linux-kernel, kexec, x86, tglx, mingo, akpm, dyoung, bhe

在 2018年11月18日 19:52, Borislav Petkov 写道:
> On Fri, Nov 16, 2018 at 11:25:55AM +0800, lijiang wrote:
>> For the pci mmconfig issue, it should be good enough that the e820 reserved region
>> [mem 0x0000000078000000-0x000000008fffffff] is only passed to the second kernel, but
>> the pci mmconfig region is not the same in another machine.
> 
> Yes. And now the question is, *which* reserved regions need to be mapped
> for the second kernel to function properly? How do we figure that out?
> 
>> A simple case, hotplug a pci network card and use the ssh/nfs to dump the vmcore.
>> If the pci mmconfig region is not reserved in kdump kernel, the pci hotplug device
>> could not be recognized. So the pci network card won't work.
> 
> Yes that's a good example; put *that* example in your commit message.
> 
>> Here, there is an example about SME kdump. Maybe it can help to better understand.
> 
> You keep pasting that and I've read it already. And you keep repeating
> that the reserved regions need to be mapped in the second kernel and I'm
> asking, how do we determine *which* regions should we pass to the second
> kernel?
> 
> If we should pass *all* reserved regions, why?
> 
> IOW, I'm looking for the *why* first.
> 
> Thx.
> 

I guess you have gotten the answer what you want from Dave's reply.

Thank you, Boris. Also thanks for Dave's explanation in detail.

Regards,
Lianbo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2 v6] x86/kexec_file: add e820 entry in case e820 type string matches to io resource name
  2018-11-20  3:37             ` lijiang
@ 2018-11-20 19:29               ` Borislav Petkov
  0 siblings, 0 replies; 19+ messages in thread
From: Borislav Petkov @ 2018-11-20 19:29 UTC (permalink / raw)
  To: lijiang
  Cc: Dave Young, Bjorn Helgaas, linux-kernel, kexec, x86, tglx, mingo,
	akpm, bhe

On Tue, Nov 20, 2018 at 11:37:26AM +0800, lijiang wrote:
> BTW: Boris has mentioned the solution which adds the new descriptor 'IORES_DESC_RESERVED'.
> 
> Which solution do you prefer? Add the new I/O resource descriptor 'IORES_DESC_RESERVED'(patch v7)
> or exactly comparing a string(patch v6)?

The new I/O descriptor of course - the string comparison thing is
fragile and error-prone.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2 v6] x86/kexec_file: add e820 entry in case e820 type string matches to io resource name
  2018-11-15 10:39       ` Borislav Petkov
  2018-11-16  3:25         ` lijiang
  2018-11-19  9:55         ` Dave Young
@ 2018-11-20 20:43         ` Bjorn Helgaas
  2018-11-21 18:42           ` Borislav Petkov
  2 siblings, 1 reply; 19+ messages in thread
From: Bjorn Helgaas @ 2018-11-20 20:43 UTC (permalink / raw)
  To: bp
  Cc: lijiang, Bjorn Helgaas, linux-kernel, kexec, x86, tglx,
	Ingo Molnar, Andrew Morton, dyoung, bhe

On Thu, Nov 15, 2018 at 4:40 AM Borislav Petkov <bp@alien8.de> wrote:
>
> + Bjorn.
>
> On Thu, Nov 15, 2018 at 01:44:07PM +0800, lijiang wrote:
> > At present, the upstream kernel does not pass the e820 reserved ranges to the
> > second kernel, which might cause two problems:
> >
> > The first one is the MMCONFIG issue, the PCI MMCONFIG(extended mode) requires
> > the reserved region otherwise it falls back to legacy mode, which might lead to
> > the hot-plug device could not be recognized in kdump kernel.
>
> Well, this still doesn't explain it fully. Let's look at a box:
>
> [    0.000000] e820: BIOS-provided physical RAM map:
> [    0.000000] BIOS-e820: [mem 0x0000000000000000-0x00000000000997ff] usable
> [    0.000000] BIOS-e820: [mem 0x0000000000099800-0x000000000009ffff] reserved
> [    0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
> [    0.000000] BIOS-e820: [mem 0x0000000000100000-0x0000000065642fff] usable
> [    0.000000] BIOS-e820: [mem 0x0000000065643000-0x0000000067fb8fff] reserved
> [    0.000000] BIOS-e820: [mem 0x0000000067fb9000-0x00000000689e8fff] ACPI NVS
> [    0.000000] BIOS-e820: [mem 0x00000000689e9000-0x0000000068bf5fff] ACPI data
> [    0.000000] BIOS-e820: [mem 0x0000000068bf6000-0x000000006f7fffff] usable
> [    0.000000] BIOS-e820: [mem 0x000000006f800000-0x000000008fffffff] reserved
> [    0.000000] BIOS-e820: [mem 0x00000000fd000000-0x00000000fe7fffff] reserved
> [    0.000000] BIOS-e820: [mem 0x00000000fec00000-0x00000000fec00fff] reserved
> [    0.000000] BIOS-e820: [mem 0x00000000fec80000-0x00000000fed00fff] reserved
> [    0.000000] BIOS-e820: [mem 0x00000000ff800000-0x00000001007fffff] reserved
> [    0.000000] BIOS-e820: [mem 0x0000000100800000-0x000000603fffffff] usable
>
> this one has 8 reserved regions. Does that mean that we need to pass
> them *all* 8 to the second kernel so that MMCONFIG works?
>
> Or is it only one reserved region which is needed for MMCONFIG?
>
> Bjorn, do you know what the detection logic should be to map the correct
> reserved region (or regions) for MMCONFIG?

MMCONFIG (aka ECAM) space is described in the ACPI MCFG table.  The
generic code to read that is in drivers/acpi/pci_mcfg.c (ignore all
the quirks at the top) and the generic code to use it is
drivers/pci/ecam.c.

Unfortunately x86 doesn't use any of that generic path.  It uses the
same MCFG table, but it's parsed in arch/x86/pci/mmconfig-shared.c,
and the code there checks to ensure the ECAM regions are reserved
somehow by firmware, e.g., via the e820 table.  There's a bunch of
grungy device-dependent code there, too, possibly to work around
firmware defects, or (just as likely) to compensate for Linux defects
that were *attributed* to firmware.

I think you should regard correct MCFG/ECAM usage in the kdump kernel
as a requirement.  If you don't have ECAM (a) PCI devices won't work
at all on non-x86 systems that use only ECAM for config access, (b)
you won't be able to access devices on non-0 segments (granted, there
aren't very many of these yet, but there will be more in the future),
and (c) you won't be able to access extended config space (addresses
0x100-0xfff), which means none of the Extended Capabilities will be
available (AER, ACS, ATS, etc).

Bjorn

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2 v6] x86/kexec_file: add e820 entry in case e820 type string matches to io resource name
  2018-11-18 11:52           ` Borislav Petkov
  2018-11-20  4:07             ` lijiang
@ 2018-11-21 10:54             ` lijiang
  2018-11-21 13:06               ` Boris Petkov
  1 sibling, 1 reply; 19+ messages in thread
From: lijiang @ 2018-11-21 10:54 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Bjorn Helgaas, linux-kernel, kexec, x86, tglx, mingo, akpm, dyoung, bhe

在 2018年11月18日 19:52, Borislav Petkov 写道:
> On Fri, Nov 16, 2018 at 11:25:55AM +0800, lijiang wrote:
>> For the pci mmconfig issue, it should be good enough that the e820 reserved region
>> [mem 0x0000000078000000-0x000000008fffffff] is only passed to the second kernel, but
>> the pci mmconfig region is not the same in another machine.
> 
> Yes. And now the question is, *which* reserved regions need to be mapped
> for the second kernel to function properly? How do we figure that out?
> 
>> A simple case, hotplug a pci network card and use the ssh/nfs to dump the vmcore.
>> If the pci mmconfig region is not reserved in kdump kernel, the pci hotplug device
>> could not be recognized. So the pci network card won't work.
> 
> Yes that's a good example; put *that* example in your commit message.
> 
>> Here, there is an example about SME kdump. Maybe it can help to better understand.
> 
> You keep pasting that and I've read it already. And you keep repeating
> that the reserved regions need to be mapped in the second kernel and I'm
> asking, how do we determine *which* regions should we pass to the second
> kernel?
> 

For the pci mmconfig issue, it is clear according to Dave and Bjorn's comment. So i'd like
to explain two things:

1. why the SME kdump kernel does not work without the reserved ranges.

The first kernel can get e820 table's information from BIOS or bootloader, but for kdump
kernel, it can not use this method to get e820 table. Maybe you have known who should pass
the e820 ranges to the second kernel, it is just the kexec-tools.

Unfortunately, kernel does not pass the e820 reserved ranges to the kdump kernel, when use
the kexec_file_load syscall to load the kernel image and initramfs.

At the early boot stage, the caller will use the early_memremap() to remap the address space,
such as DMI, ACPI and Firmware, etc.

Here, i only use an example to illustrate the issue what the DMI encountered in kdump kernel.

At the early boot stage, the DMI will use the early_memremap() to remap the address ranges
[0xF0000-0xFFFFF], and then check for the SMBIOS entry point signature.

And when the caller checks the SMBIOS entry point signature, the caller will use the early_memremap()
to remap another reserved ranges [0x0x6286B000-0x6286EFFF] again, which is reported by firmware(or
SMBIOS).

Please refer to the code: drivers/firmware/dmi_scan.c

dmi_scan_machine()-> / dmi_early_remap()-> early_memremap()
                     \ dmi_smbios3_present()->dmi_walk_early()->dmi_early_remap()->early_memremap()

Obviously, the DMI remapped ranges [0xF0000-0xFFFFF] and [0x6286b000-0x6286efff] are not in the e820
table for the kdump kernel. Therefore, when SME is enabled, these regions will be consider to be
encrypted according to the following code: arch/x86/mm/ioremap.c. Actually, these regions are still
decrypted in kdump kernel. It has gone wrong.

Note: 
When SME is enabled, if the memory regions are decrypted, but the caller marks the memory pages as
encrypted, the content of pages is unpredictable when read from memory.

early_memremap()->
early_memremap_pgprot_adjust()->
memremap_should_map_decrypted()->
e820__get_entry_type()

static bool memremap_should_map_decrypted(resource_size_t phys_addr,
                                          unsigned long size)
{
        int is_pmem;

        ......

        /* Check if the address is outside kernel usable area */
        switch (e820__get_entry_type(phys_addr, phys_addr + size - 1)) {
        case E820_TYPE_RESERVED:
        case E820_TYPE_ACPI:
        case E820_TYPE_NVS:
        case E820_TYPE_UNUSABLE:
                /* For SEV, these areas are encrypted */
                if (sev_active())
                        break;
                /* Fallthrough */

        case E820_TYPE_PRAM:
                return true;
        default:
                break;
        }

        return false;
}

Similarly, for ACPI, firmware, etc. They will also encounter the same problems, eventually, which
causes the system fails to start. That is the reason why the SME kdump kernel does not work without
the reserved ranges.

2. why the all reserved ranges are passed to the second kernel(or which regions should we pass to
   the second kernel?)

As previously mentioned, for the DMI, the reserved regions[0xF0000-0xFFFFF] and [0x0x6286B000-0x6286EFFF]
are accurately passed to the second kernel, the DMI can work well on my machine. But for the ACPI, firmware,
etc. How to deal with this?

I think that we have to find all places where to call the early_memremap(), and determine whether these
reserved regions need be passed to the second kernel, and then add the code for kdump kernel in these code
paths. It is really too expensive to do this.

Furthermore, i only did a same thing in kernel, just like the kexec-tools pass the all reserved ranges to the
second kernel.

If you understood this issue, you might ignore it, please.

Thanks.
Lianbo

> If we should pass *all* reserved regions, why?
> 
> IOW, I'm looking for the *why* first.
> 
> Thx.
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2 v6] x86/kexec_file: add e820 entry in case e820 type string matches to io resource name
  2018-11-21 10:54             ` lijiang
@ 2018-11-21 13:06               ` Boris Petkov
  0 siblings, 0 replies; 19+ messages in thread
From: Boris Petkov @ 2018-11-21 13:06 UTC (permalink / raw)
  To: lijiang
  Cc: Bjorn Helgaas, linux-kernel, kexec, x86, tglx, mingo, akpm, dyoung, bhe

On November 21, 2018 11:54:09 AM GMT+01:00, lijiang <lijiang@redhat.com> wrote:
>If you understood this issue, you might ignore it, please.

Yes,I got it. Simply figuring out which reserved regions to export and which not, is not worth the effort - we can do the simple thing of directly exporting them all.

Thx.

-- 
Sent from a small device: formatting sux and brevity is inevitable.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2 v6] x86/kexec_file: add e820 entry in case e820 type string matches to io resource name
  2018-11-20 20:43         ` Bjorn Helgaas
@ 2018-11-21 18:42           ` Borislav Petkov
  0 siblings, 0 replies; 19+ messages in thread
From: Borislav Petkov @ 2018-11-21 18:42 UTC (permalink / raw)
  To: bjorn
  Cc: lijiang, Bjorn Helgaas, linux-kernel, kexec, x86, tglx,
	Ingo Molnar, Andrew Morton, dyoung, bhe

On Tue, Nov 20, 2018 at 02:43:57PM -0600, Bjorn Helgaas wrote:
> MMCONFIG (aka ECAM) space is described in the ACPI MCFG table.  The
> generic code to read that is in drivers/acpi/pci_mcfg.c (ignore all
> the quirks at the top) and the generic code to use it is
> drivers/pci/ecam.c.

/me saves that for future reference.

> Unfortunately x86 doesn't use any of that generic path.  It uses the
> same MCFG table, but it's parsed in arch/x86/pci/mmconfig-shared.c,
> and the code there checks to ensure the ECAM regions are reserved
> somehow by firmware, e.g., via the e820 table.

As they are on my workstation here, for example:

[    0.552607] PCI: MMCONFIG at [mem 0xf8000000-0xfbffffff] reserved in E820

vs the "ACPI motherboard resources" reservation thing where it traverses
some PNP devices.

> There's a bunch of grungy device-dependent code there, too, possibly
> to work around firmware defects, or (just as likely) to compensate for
> Linux defects that were *attributed* to firmware.

Nah, latter we simply fix. :-)

> I think you should regard correct MCFG/ECAM usage in the kdump kernel
> as a requirement.  If you don't have ECAM (a) PCI devices won't work
> at all on non-x86 systems that use only ECAM for config access, (b)
> you won't be able to access devices on non-0 segments (granted, there
> aren't very many of these yet, but there will be more in the future),

Ah. don't worry, someone will pop up from the woodwork with the need for
this ;-\

> and (c) you won't be able to access extended config space (addresses
> 0x100-0xfff), which means none of the Extended Capabilities will be
> available (AER, ACS, ATS, etc).

Yap, and this answers my question:

e820__mapped_all() is being called with E820_TYPE_RESERVED down the path
in pci_mmcfg_check_reserved(). Which basically says that we need to
provide all reserved regions in the second kernel so that MMCONFIG works
there.

Thanks a lot Bjorn!

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2018-11-21 18:42 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-14  7:29 [PATCH 0/2 v6] add reserved e820 ranges to the kdump kernel e820 table Lianbo Jiang
2018-11-14  7:29 ` [PATCH 1/2 v6] x86/kexec_file: add e820 entry in case e820 type string matches to io resource name Lianbo Jiang
2018-11-14 11:26   ` Borislav Petkov
2018-11-15  5:44     ` lijiang
2018-11-15  5:58       ` Dave Young
2018-11-16  1:34         ` lijiang
2018-11-15 10:39       ` Borislav Petkov
2018-11-16  3:25         ` lijiang
2018-11-18 11:52           ` Borislav Petkov
2018-11-20  4:07             ` lijiang
2018-11-21 10:54             ` lijiang
2018-11-21 13:06               ` Boris Petkov
2018-11-19  9:55         ` Dave Young
2018-11-19 10:28           ` Borislav Petkov
2018-11-20  3:37             ` lijiang
2018-11-20 19:29               ` Borislav Petkov
2018-11-20 20:43         ` Bjorn Helgaas
2018-11-21 18:42           ` Borislav Petkov
2018-11-14  7:29 ` [PATCH 2/2 v6] x86/kexec_file: add reserved e820 ranges to kdump kernel e820 table Lianbo Jiang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).