[PATCH v3] kexec: fix out of the ELF headers buffer issue in syscall kexec_file

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3] kexec: fix out of the ELF headers buffer issue in syscall kexec_file_load()
@ 2015-09-29 12:58 ` Lee, Chun-Yi
  0 siblings, 0 replies; 12+ messages in thread
From: Lee, Chun-Yi @ 2015-09-29 12:58 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: H. Peter Anvin, Thomas Gleixner, Ingo Molnar, x86,
	Stephen Rothwell, Viresh Kumar, Takashi Iwai, Jiang Liu,
	Andy Lutomirski, Baoquan He, linux-kernel, akpm, kexec, Lee,
	Chun-Yi

This patch modified the code in fill_up_crash_elf_data by using
walk_system_ram_res instead of walk_system_ram_range to count the max
number of crash memory ranges. That's because the walk_system_ram_range
filters out small memory regions that are resided in the same page, but
walk_system_ram_res does not.

The oringial issue is page fault error that sometimes happened on big machines
when preparing ELF headers:

[  305.291522] BUG: unable to handle kernel paging request at ffffc90613fc9000
[  305.299621] IP: [<ffffffff8103d645>] prepare_elf64_ram_headers_callback+0x165/0x260
[  305.308300] PGD e000032067 PUD 6dcbec54067 PMD 9dc9bdeb067 PTE 0
[  305.315393] Oops: 0002 [#1] SMP
[...snip]
[  305.420953] task: ffff8e1c01ced600 ti: ffff8e1c03ec2000 task.ti: ffff8e1c03ec2000
[  305.429292] RIP: 0010:[<ffffffff8103d645>]  [<ffffffff8103d645>] prepare_elf64_ra
m_headers_callback+0x165/0x260
[...snip]

After tracing prepare_elf64_headers and prepare_elf64_ram_headers_callback,
the code uses walk_system_ram_res to fill-in crash memory regions information
to program header, so it counts those small memory regions that are resided in
a page area. But, when kernel was using walk_system_ram_range in
fill_up_crash_elf_data to count the number of crash memory regions, it filters
out small regions. I printed those small memory regions, for example:

kexec: Get nr_ram ranges. vaddr=0xffff880077592258 paddr=0x77592258, sz=0xdc0

Base on the code in walk_system_ram_range, this memory region will be filtered
out:

pfn = (0x77592258 + 0x1000 - 1) >> 12 = 0x77593
end_pfn = (0x77592258 + 0xfc0 -1 + 1) >> 12 = 0x77593
end_pfn - pfn = 0x77593 - 0x77593 = 0  <=== if (end_pfn > pfn) is FALSE

So, the max_nr_ranges that's counted by kernel doesn't include small memory
regions. That causes the page fault issue happened in later code path for
preparing EFL headers.

This issus is not easy to reproduce on small machines that don't have too
many CPUs because the allocated page aligned ELF buffer has more free space
to cover those small memory regions' PT_LOAD headers.

v3:
Changed the declaration of nr_ranges to be unsigned int*

v2:
To simplify the patch description, removed some things about CPU number to
avoid confusing patch reviewer.

Signed-off-by: Lee, Chun-Yi <jlee@suse.com>
---
 arch/x86/kernel/crash.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index e068d66..74ca2fe 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -185,10 +185,9 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
 }

 #ifdef CONFIG_KEXEC_FILE
-static int get_nr_ram_ranges_callback(unsigned long start_pfn,
-				unsigned long nr_pfn, void *arg)
+static int get_nr_ram_ranges_callback(u64 start, u64 end, void *arg)
 {
-	int *nr_ranges = arg;
+	unsigned int *nr_ranges = arg;

 	(*nr_ranges)++;
 	return 0;
@@ -214,7 +213,7 @@ static void fill_up_crash_elf_data(struct crash_elf_data *ced,

 	ced->image = image;

-	walk_system_ram_range(0, -1, &nr_ranges,
+	walk_system_ram_res(0, -1, &nr_ranges,
 				get_nr_ram_ranges_callback);

 	ced->max_nr_ranges = nr_ranges;
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v3] kexec: fix out of the ELF headers buffer issue in syscall kexec_file_load()
@ 2015-09-29 12:58 ` Lee, Chun-Yi
  0 siblings, 0 replies; 12+ messages in thread
From: Lee, Chun-Yi @ 2015-09-29 12:58 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Stephen Rothwell, akpm, Baoquan He, Takashi Iwai, Viresh Kumar,
	x86, kexec, linux-kernel, Lee, Chun-Yi, Ingo Molnar,
	Andy Lutomirski, H. Peter Anvin, Thomas Gleixner, Jiang Liu

This patch modified the code in fill_up_crash_elf_data by using
walk_system_ram_res instead of walk_system_ram_range to count the max
number of crash memory ranges. That's because the walk_system_ram_range
filters out small memory regions that are resided in the same page, but
walk_system_ram_res does not.

The oringial issue is page fault error that sometimes happened on big machines
when preparing ELF headers:

[  305.291522] BUG: unable to handle kernel paging request at ffffc90613fc9000
[  305.299621] IP: [<ffffffff8103d645>] prepare_elf64_ram_headers_callback+0x165/0x260
[  305.308300] PGD e000032067 PUD 6dcbec54067 PMD 9dc9bdeb067 PTE 0
[  305.315393] Oops: 0002 [#1] SMP
[...snip]
[  305.420953] task: ffff8e1c01ced600 ti: ffff8e1c03ec2000 task.ti: ffff8e1c03ec2000
[  305.429292] RIP: 0010:[<ffffffff8103d645>]  [<ffffffff8103d645>] prepare_elf64_ra
m_headers_callback+0x165/0x260
[...snip]

After tracing prepare_elf64_headers and prepare_elf64_ram_headers_callback,
the code uses walk_system_ram_res to fill-in crash memory regions information
to program header, so it counts those small memory regions that are resided in
a page area. But, when kernel was using walk_system_ram_range in
fill_up_crash_elf_data to count the number of crash memory regions, it filters
out small regions. I printed those small memory regions, for example:

kexec: Get nr_ram ranges. vaddr=0xffff880077592258 paddr=0x77592258, sz=0xdc0

Base on the code in walk_system_ram_range, this memory region will be filtered
out:

pfn = (0x77592258 + 0x1000 - 1) >> 12 = 0x77593
end_pfn = (0x77592258 + 0xfc0 -1 + 1) >> 12 = 0x77593
end_pfn - pfn = 0x77593 - 0x77593 = 0  <=== if (end_pfn > pfn) is FALSE

So, the max_nr_ranges that's counted by kernel doesn't include small memory
regions. That causes the page fault issue happened in later code path for
preparing EFL headers.

This issus is not easy to reproduce on small machines that don't have too
many CPUs because the allocated page aligned ELF buffer has more free space
to cover those small memory regions' PT_LOAD headers.

v3:
Changed the declaration of nr_ranges to be unsigned int*

v2:
To simplify the patch description, removed some things about CPU number to
avoid confusing patch reviewer.

Signed-off-by: Lee, Chun-Yi <jlee@suse.com>
---
 arch/x86/kernel/crash.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index e068d66..74ca2fe 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -185,10 +185,9 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
 }

 #ifdef CONFIG_KEXEC_FILE
-static int get_nr_ram_ranges_callback(unsigned long start_pfn,
-				unsigned long nr_pfn, void *arg)
+static int get_nr_ram_ranges_callback(u64 start, u64 end, void *arg)
 {
-	int *nr_ranges = arg;
+	unsigned int *nr_ranges = arg;

 	(*nr_ranges)++;
 	return 0;
@@ -214,7 +213,7 @@ static void fill_up_crash_elf_data(struct crash_elf_data *ced,

 	ced->image = image;

-	walk_system_ram_range(0, -1, &nr_ranges,
+	walk_system_ram_res(0, -1, &nr_ranges,
 				get_nr_ram_ranges_callback);

 	ced->max_nr_ranges = nr_ranges;
-- 
2.1.4

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] kexec: fix out of the ELF headers buffer issue in syscall kexec_file_load()
  2015-09-29 12:58 ` Lee, Chun-Yi
@ 2015-09-30  3:04   ` Dave Young
  -1 siblings, 0 replies; 12+ messages in thread
From: Dave Young @ 2015-09-30  3:04 UTC (permalink / raw)
  To: Lee, Chun-Yi
  Cc: Vivek Goyal, Stephen Rothwell, akpm, Baoquan He, Takashi Iwai,
	Viresh Kumar, x86, kexec, linux-kernel, Lee, Chun-Yi,
	Ingo Molnar, Andy Lutomirski, H. Peter Anvin, Thomas Gleixner,
	Jiang Liu

On 09/29/15 at 08:58pm, Lee, Chun-Yi wrote:
> This patch modified the code in fill_up_crash_elf_data by using
> walk_system_ram_res instead of walk_system_ram_range to count the max
> number of crash memory ranges. That's because the walk_system_ram_range
> filters out small memory regions that are resided in the same page, but
> walk_system_ram_res does not.
> 
> The oringial issue is page fault error that sometimes happened on big machines
> when preparing ELF headers:
> 
> [  305.291522] BUG: unable to handle kernel paging request at ffffc90613fc9000
> [  305.299621] IP: [<ffffffff8103d645>] prepare_elf64_ram_headers_callback+0x165/0x260
> [  305.308300] PGD e000032067 PUD 6dcbec54067 PMD 9dc9bdeb067 PTE 0
> [  305.315393] Oops: 0002 [#1] SMP
> [...snip]
> [  305.420953] task: ffff8e1c01ced600 ti: ffff8e1c03ec2000 task.ti: ffff8e1c03ec2000
> [  305.429292] RIP: 0010:[<ffffffff8103d645>]  [<ffffffff8103d645>] prepare_elf64_ra
> m_headers_callback+0x165/0x260
> [...snip]
> 
> After tracing prepare_elf64_headers and prepare_elf64_ram_headers_callback,
> the code uses walk_system_ram_res to fill-in crash memory regions information
> to program header, so it counts those small memory regions that are resided in
> a page area. But, when kernel was using walk_system_ram_range in
> fill_up_crash_elf_data to count the number of crash memory regions, it filters
> out small regions. I printed those small memory regions, for example:
> 
> kexec: Get nr_ram ranges. vaddr=0xffff880077592258 paddr=0x77592258, sz=0xdc0
> 
> Base on the code in walk_system_ram_range, this memory region will be filtered
> out:
> 
> pfn = (0x77592258 + 0x1000 - 1) >> 12 = 0x77593
> end_pfn = (0x77592258 + 0xfc0 -1 + 1) >> 12 = 0x77593
> end_pfn - pfn = 0x77593 - 0x77593 = 0  <=== if (end_pfn > pfn) is FALSE
> 
> So, the max_nr_ranges that's counted by kernel doesn't include small memory
> regions. That causes the page fault issue happened in later code path for
> preparing EFL headers.
> 
> This issus is not easy to reproduce on small machines that don't have too
> many CPUs because the allocated page aligned ELF buffer has more free space
> to cover those small memory regions' PT_LOAD headers.
> 
> v3:
> Changed the declaration of nr_ranges to be unsigned int*
> 
> v2:
> To simplify the patch description, removed some things about CPU number to
> avoid confusing patch reviewer.
> 
> Signed-off-by: Lee, Chun-Yi <jlee@suse.com>
> ---
>  arch/x86/kernel/crash.c | 7 +++----
>  1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
> index e068d66..74ca2fe 100644
> --- a/arch/x86/kernel/crash.c
> +++ b/arch/x86/kernel/crash.c
> @@ -185,10 +185,9 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
>  }
>  
>  #ifdef CONFIG_KEXEC_FILE
> -static int get_nr_ram_ranges_callback(unsigned long start_pfn,
> -				unsigned long nr_pfn, void *arg)
> +static int get_nr_ram_ranges_callback(u64 start, u64 end, void *arg)
>  {
> -	int *nr_ranges = arg;
> +	unsigned int *nr_ranges = arg;
>  
>  	(*nr_ranges)++;
>  	return 0;
> @@ -214,7 +213,7 @@ static void fill_up_crash_elf_data(struct crash_elf_data *ced,
>  
>  	ced->image = image;
>  
> -	walk_system_ram_range(0, -1, &nr_ranges,
> +	walk_system_ram_res(0, -1, &nr_ranges,
>  				get_nr_ram_ranges_callback);
>  
>  	ced->max_nr_ranges = nr_ranges;

Acked-by: Dave Young <dyoung@redhat.com>

Thanks
Dave

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] kexec: fix out of the ELF headers buffer issue in syscall kexec_file_load()
@ 2015-09-30  3:04   ` Dave Young
  0 siblings, 0 replies; 12+ messages in thread
From: Dave Young @ 2015-09-30  3:04 UTC (permalink / raw)
  To: Lee, Chun-Yi
  Cc: Stephen Rothwell, Baoquan He, Takashi Iwai, Viresh Kumar, x86,
	kexec, linux-kernel, Jiang Liu, Lee, Chun-Yi, Ingo Molnar,
	Andy Lutomirski, H. Peter Anvin, akpm, Thomas Gleixner,
	Vivek Goyal

On 09/29/15 at 08:58pm, Lee, Chun-Yi wrote:
> This patch modified the code in fill_up_crash_elf_data by using
> walk_system_ram_res instead of walk_system_ram_range to count the max
> number of crash memory ranges. That's because the walk_system_ram_range
> filters out small memory regions that are resided in the same page, but
> walk_system_ram_res does not.
> 
> The oringial issue is page fault error that sometimes happened on big machines
> when preparing ELF headers:
> 
> [  305.291522] BUG: unable to handle kernel paging request at ffffc90613fc9000
> [  305.299621] IP: [<ffffffff8103d645>] prepare_elf64_ram_headers_callback+0x165/0x260
> [  305.308300] PGD e000032067 PUD 6dcbec54067 PMD 9dc9bdeb067 PTE 0
> [  305.315393] Oops: 0002 [#1] SMP
> [...snip]
> [  305.420953] task: ffff8e1c01ced600 ti: ffff8e1c03ec2000 task.ti: ffff8e1c03ec2000
> [  305.429292] RIP: 0010:[<ffffffff8103d645>]  [<ffffffff8103d645>] prepare_elf64_ra
> m_headers_callback+0x165/0x260
> [...snip]
> 
> After tracing prepare_elf64_headers and prepare_elf64_ram_headers_callback,
> the code uses walk_system_ram_res to fill-in crash memory regions information
> to program header, so it counts those small memory regions that are resided in
> a page area. But, when kernel was using walk_system_ram_range in
> fill_up_crash_elf_data to count the number of crash memory regions, it filters
> out small regions. I printed those small memory regions, for example:
> 
> kexec: Get nr_ram ranges. vaddr=0xffff880077592258 paddr=0x77592258, sz=0xdc0
> 
> Base on the code in walk_system_ram_range, this memory region will be filtered
> out:
> 
> pfn = (0x77592258 + 0x1000 - 1) >> 12 = 0x77593
> end_pfn = (0x77592258 + 0xfc0 -1 + 1) >> 12 = 0x77593
> end_pfn - pfn = 0x77593 - 0x77593 = 0  <=== if (end_pfn > pfn) is FALSE
> 
> So, the max_nr_ranges that's counted by kernel doesn't include small memory
> regions. That causes the page fault issue happened in later code path for
> preparing EFL headers.
> 
> This issus is not easy to reproduce on small machines that don't have too
> many CPUs because the allocated page aligned ELF buffer has more free space
> to cover those small memory regions' PT_LOAD headers.
> 
> v3:
> Changed the declaration of nr_ranges to be unsigned int*
> 
> v2:
> To simplify the patch description, removed some things about CPU number to
> avoid confusing patch reviewer.
> 
> Signed-off-by: Lee, Chun-Yi <jlee@suse.com>
> ---
>  arch/x86/kernel/crash.c | 7 +++----
>  1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
> index e068d66..74ca2fe 100644
> --- a/arch/x86/kernel/crash.c
> +++ b/arch/x86/kernel/crash.c
> @@ -185,10 +185,9 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
>  }
>  
>  #ifdef CONFIG_KEXEC_FILE
> -static int get_nr_ram_ranges_callback(unsigned long start_pfn,
> -				unsigned long nr_pfn, void *arg)
> +static int get_nr_ram_ranges_callback(u64 start, u64 end, void *arg)
>  {
> -	int *nr_ranges = arg;
> +	unsigned int *nr_ranges = arg;
>  
>  	(*nr_ranges)++;
>  	return 0;
> @@ -214,7 +213,7 @@ static void fill_up_crash_elf_data(struct crash_elf_data *ced,
>  
>  	ced->image = image;
>  
> -	walk_system_ram_range(0, -1, &nr_ranges,
> +	walk_system_ram_res(0, -1, &nr_ranges,
>  				get_nr_ram_ranges_callback);
>  
>  	ced->max_nr_ranges = nr_ranges;

Acked-by: Dave Young <dyoung@redhat.com>

Thanks
Dave

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] kexec: fix out of the ELF headers buffer issue in syscall kexec_file_load()
  2015-09-29 12:58 ` Lee, Chun-Yi
@ 2015-09-30 11:27   ` Minfei Huang
  -1 siblings, 0 replies; 12+ messages in thread
From: Minfei Huang @ 2015-09-30 11:27 UTC (permalink / raw)
  To: Lee, Chun-Yi
  Cc: Vivek Goyal, Stephen Rothwell, akpm, Baoquan He, Takashi Iwai,
	Viresh Kumar, x86, kexec, linux-kernel, Lee, Chun-Yi,
	Ingo Molnar, Andy Lutomirski, H. Peter Anvin, Thomas Gleixner,
	Jiang Liu

On 09/29/15 at 08:58pm, Lee, Chun-Yi wrote:
> This patch modified the code in fill_up_crash_elf_data by using
> walk_system_ram_res instead of walk_system_ram_range to count the max
> number of crash memory ranges. That's because the walk_system_ram_range
> filters out small memory regions that are resided in the same page, but
> walk_system_ram_res does not.
> 
> The oringial issue is page fault error that sometimes happened on big machines
> when preparing ELF headers:
> 
> [  305.291522] BUG: unable to handle kernel paging request at ffffc90613fc9000
> [  305.299621] IP: [<ffffffff8103d645>] prepare_elf64_ram_headers_callback+0x165/0x260
> [  305.308300] PGD e000032067 PUD 6dcbec54067 PMD 9dc9bdeb067 PTE 0
> [  305.315393] Oops: 0002 [#1] SMP
> [...snip]
> [  305.420953] task: ffff8e1c01ced600 ti: ffff8e1c03ec2000 task.ti: ffff8e1c03ec2000
> [  305.429292] RIP: 0010:[<ffffffff8103d645>]  [<ffffffff8103d645>] prepare_elf64_ra
> m_headers_callback+0x165/0x260
> [...snip]
> 
> After tracing prepare_elf64_headers and prepare_elf64_ram_headers_callback,
> the code uses walk_system_ram_res to fill-in crash memory regions information
> to program header, so it counts those small memory regions that are resided in
> a page area. But, when kernel was using walk_system_ram_range in
> fill_up_crash_elf_data to count the number of crash memory regions, it filters
> out small regions. I printed those small memory regions, for example:
> 
> kexec: Get nr_ram ranges. vaddr=0xffff880077592258 paddr=0x77592258, sz=0xdc0
> 
> Base on the code in walk_system_ram_range, this memory region will be filtered
> out:
> 
> pfn = (0x77592258 + 0x1000 - 1) >> 12 = 0x77593
> end_pfn = (0x77592258 + 0xfc0 -1 + 1) >> 12 = 0x77593
> end_pfn - pfn = 0x77593 - 0x77593 = 0  <=== if (end_pfn > pfn) is FALSE
> 
> So, the max_nr_ranges that's counted by kernel doesn't include small memory
> regions. That causes the page fault issue happened in later code path for
> preparing EFL headers.
> 
> This issus is not easy to reproduce on small machines that don't have too
> many CPUs because the allocated page aligned ELF buffer has more free space
> to cover those small memory regions' PT_LOAD headers.
> 
> v3:
> Changed the declaration of nr_ranges to be unsigned int*
> 
> v2:
> To simplify the patch description, removed some things about CPU number to
> avoid confusing patch reviewer.
> 
> Signed-off-by: Lee, Chun-Yi <jlee@suse.com>
> ---
>  arch/x86/kernel/crash.c | 7 +++----
>  1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
> index e068d66..74ca2fe 100644
> --- a/arch/x86/kernel/crash.c
> +++ b/arch/x86/kernel/crash.c
> @@ -185,10 +185,9 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
>  }
>  
>  #ifdef CONFIG_KEXEC_FILE
> -static int get_nr_ram_ranges_callback(unsigned long start_pfn,
> -				unsigned long nr_pfn, void *arg)
> +static int get_nr_ram_ranges_callback(u64 start, u64 end, void *arg)
>  {
> -	int *nr_ranges = arg;
> +	unsigned int *nr_ranges = arg;
>  
>  	(*nr_ranges)++;
>  	return 0;
> @@ -214,7 +213,7 @@ static void fill_up_crash_elf_data(struct crash_elf_data *ced,
>  
>  	ced->image = image;
>  
> -	walk_system_ram_range(0, -1, &nr_ranges,
> +	walk_system_ram_res(0, -1, &nr_ranges,
>  				get_nr_ram_ranges_callback);
>  
>  	ced->max_nr_ranges = nr_ranges;
> -- 
> 2.1.4
> 

Reviewed-by: Minfei Huang <mhuang@redhat.com>

> 
> _______________________________________________
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] kexec: fix out of the ELF headers buffer issue in syscall kexec_file_load()
@ 2015-09-30 11:27   ` Minfei Huang
  0 siblings, 0 replies; 12+ messages in thread
From: Minfei Huang @ 2015-09-30 11:27 UTC (permalink / raw)
  To: Lee, Chun-Yi
  Cc: Stephen Rothwell, Baoquan He, Takashi Iwai, Viresh Kumar, x86,
	kexec, linux-kernel, Jiang Liu, Lee, Chun-Yi, Ingo Molnar,
	Andy Lutomirski, H. Peter Anvin, akpm, Thomas Gleixner,
	Vivek Goyal

On 09/29/15 at 08:58pm, Lee, Chun-Yi wrote:
> This patch modified the code in fill_up_crash_elf_data by using
> walk_system_ram_res instead of walk_system_ram_range to count the max
> number of crash memory ranges. That's because the walk_system_ram_range
> filters out small memory regions that are resided in the same page, but
> walk_system_ram_res does not.
> 
> The oringial issue is page fault error that sometimes happened on big machines
> when preparing ELF headers:
> 
> [  305.291522] BUG: unable to handle kernel paging request at ffffc90613fc9000
> [  305.299621] IP: [<ffffffff8103d645>] prepare_elf64_ram_headers_callback+0x165/0x260
> [  305.308300] PGD e000032067 PUD 6dcbec54067 PMD 9dc9bdeb067 PTE 0
> [  305.315393] Oops: 0002 [#1] SMP
> [...snip]
> [  305.420953] task: ffff8e1c01ced600 ti: ffff8e1c03ec2000 task.ti: ffff8e1c03ec2000
> [  305.429292] RIP: 0010:[<ffffffff8103d645>]  [<ffffffff8103d645>] prepare_elf64_ra
> m_headers_callback+0x165/0x260
> [...snip]
> 
> After tracing prepare_elf64_headers and prepare_elf64_ram_headers_callback,
> the code uses walk_system_ram_res to fill-in crash memory regions information
> to program header, so it counts those small memory regions that are resided in
> a page area. But, when kernel was using walk_system_ram_range in
> fill_up_crash_elf_data to count the number of crash memory regions, it filters
> out small regions. I printed those small memory regions, for example:
> 
> kexec: Get nr_ram ranges. vaddr=0xffff880077592258 paddr=0x77592258, sz=0xdc0
> 
> Base on the code in walk_system_ram_range, this memory region will be filtered
> out:
> 
> pfn = (0x77592258 + 0x1000 - 1) >> 12 = 0x77593
> end_pfn = (0x77592258 + 0xfc0 -1 + 1) >> 12 = 0x77593
> end_pfn - pfn = 0x77593 - 0x77593 = 0  <=== if (end_pfn > pfn) is FALSE
> 
> So, the max_nr_ranges that's counted by kernel doesn't include small memory
> regions. That causes the page fault issue happened in later code path for
> preparing EFL headers.
> 
> This issus is not easy to reproduce on small machines that don't have too
> many CPUs because the allocated page aligned ELF buffer has more free space
> to cover those small memory regions' PT_LOAD headers.
> 
> v3:
> Changed the declaration of nr_ranges to be unsigned int*
> 
> v2:
> To simplify the patch description, removed some things about CPU number to
> avoid confusing patch reviewer.
> 
> Signed-off-by: Lee, Chun-Yi <jlee@suse.com>
> ---
>  arch/x86/kernel/crash.c | 7 +++----
>  1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
> index e068d66..74ca2fe 100644
> --- a/arch/x86/kernel/crash.c
> +++ b/arch/x86/kernel/crash.c
> @@ -185,10 +185,9 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
>  }
>  
>  #ifdef CONFIG_KEXEC_FILE
> -static int get_nr_ram_ranges_callback(unsigned long start_pfn,
> -				unsigned long nr_pfn, void *arg)
> +static int get_nr_ram_ranges_callback(u64 start, u64 end, void *arg)
>  {
> -	int *nr_ranges = arg;
> +	unsigned int *nr_ranges = arg;
>  
>  	(*nr_ranges)++;
>  	return 0;
> @@ -214,7 +213,7 @@ static void fill_up_crash_elf_data(struct crash_elf_data *ced,
>  
>  	ced->image = image;
>  
> -	walk_system_ram_range(0, -1, &nr_ranges,
> +	walk_system_ram_res(0, -1, &nr_ranges,
>  				get_nr_ram_ranges_callback);
>  
>  	ced->max_nr_ranges = nr_ranges;
> -- 
> 2.1.4
> 

Reviewed-by: Minfei Huang <mhuang@redhat.com>

> 
> _______________________________________________
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [tip:x86/urgent] x86/kexec: Fix kexec crash in syscall kexec_file_load()
  2015-09-29 12:58 ` Lee, Chun-Yi
                   ` (2 preceding siblings ...)
  (?)
@ 2015-10-01  8:54 ` tip-bot for Lee, Chun-Yi
  -1 siblings, 0 replies; 12+ messages in thread
From: tip-bot for Lee, Chun-Yi @ 2015-10-01  8:54 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: hpa, peterz, tglx, torvalds, efault, jlee, linux-kernel, tiwai,
	joeyli.kernel, sfr, viresh.kumar, luto, jiang.liu, mingo, bhe,
	vgoyal

Commit-ID:  27c7b5b29a7aa2fde52ae8525f04857c306452a0
Gitweb:     http://git.kernel.org/tip/27c7b5b29a7aa2fde52ae8525f04857c306452a0
Author:     Lee, Chun-Yi <joeyli.kernel@gmail.com>
AuthorDate: Tue, 29 Sep 2015 20:58:57 +0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 1 Oct 2015 10:18:04 +0200

x86/kexec: Fix kexec crash in syscall kexec_file_load()

The original bug is a page fault crash that sometimes happens
on big machines when preparing ELF headers:

    BUG: unable to handle kernel paging request at ffffc90613fc9000
    IP: [<ffffffff8103d645>] prepare_elf64_ram_headers_callback+0x165/0x260

The bug is caused by us under-counting the number of memory ranges
and subsequently not allocating enough ELF header space for them.
The bug is typically masked on smaller systems, because the ELF header
allocation is rounded up to the next page.

This patch modifies the code in fill_up_crash_elf_data() by using
walk_system_ram_res() instead of walk_system_ram_range() to correctly
count the max number of crash memory ranges. That's because the
walk_system_ram_range() filters out small memory regions that
reside in the same page, but walk_system_ram_res() does not.

Here's how I found the bug:

After tracing prepare_elf64_headers() and prepare_elf64_ram_headers_callback(),
the code uses walk_system_ram_res() to fill-in crash memory regions information
to the program header, so it counts those small memory regions that
reside in a page area.

But, when the kernel was using walk_system_ram_range() in
fill_up_crash_elf_data() to count the number of crash memory regions,
it filters out small regions.

I printed those small memory regions, for example:

  kexec: Get nr_ram ranges. vaddr=0xffff880077592258 paddr=0x77592258, sz=0xdc0

Based on the code in walk_system_ram_range(), this memory region
will be filtered out:

  pfn = (0x77592258 + 0x1000 - 1) >> 12 = 0x77593
  end_pfn = (0x77592258 + 0xfc0 -1 + 1) >> 12 = 0x77593
  end_pfn - pfn = 0x77593 - 0x77593 = 0  <=== if (end_pfn > pfn) is FALSE

So, the max_nr_ranges that's counted by the kernel doesn't include
small memory regions - causing us to under-allocate the required space.
That causes the page fault crash that happens in a later code path
when preparing ELF headers.

This bug is not easy to reproduce on small machines that have few
CPUs, because the allocated page aligned ELF buffer has more free
space to cover those small memory regions' PT_LOAD headers.

Signed-off-by: Lee, Chun-Yi <jlee@suse.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: kexec@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/1443531537-29436-1-git-send-email-jlee@suse.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/kernel/crash.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index e068d66..74ca2fe 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -185,10 +185,9 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
 }

 #ifdef CONFIG_KEXEC_FILE
-static int get_nr_ram_ranges_callback(unsigned long start_pfn,
-				unsigned long nr_pfn, void *arg)
+static int get_nr_ram_ranges_callback(u64 start, u64 end, void *arg)
 {
-	int *nr_ranges = arg;
+	unsigned int *nr_ranges = arg;

 	(*nr_ranges)++;
 	return 0;
@@ -214,7 +213,7 @@ static void fill_up_crash_elf_data(struct crash_elf_data *ced,

 	ced->image = image;

-	walk_system_ram_range(0, -1, &nr_ranges,
+	walk_system_ram_res(0, -1, &nr_ranges,
 				get_nr_ram_ranges_callback);

 	ced->max_nr_ranges = nr_ranges;

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] kexec: fix out of the ELF headers buffer issue in syscall kexec_file_load()
  2015-09-29 12:58 ` Lee, Chun-Yi
@ 2015-10-01 23:07   ` Andrew Morton
  -1 siblings, 0 replies; 12+ messages in thread
From: Andrew Morton @ 2015-10-01 23:07 UTC (permalink / raw)
  To: Lee, Chun-Yi
  Cc: Vivek Goyal, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, x86,
	Stephen Rothwell, Viresh Kumar, Takashi Iwai, Jiang Liu,
	Andy Lutomirski, Baoquan He, linux-kernel, kexec, Lee, Chun-Yi

On Tue, 29 Sep 2015 20:58:57 +0800 "Lee, Chun-Yi" <joeyli.kernel@gmail.com> wrote:

> This patch modified the code in fill_up_crash_elf_data by using
> walk_system_ram_res instead of walk_system_ram_range to count the max
> number of crash memory ranges. That's because the walk_system_ram_range
> filters out small memory regions that are resided in the same page, but
> walk_system_ram_res does not.
> 
> The oringial issue is page fault error that sometimes happened on big machines
> when preparing ELF headers:
> 
> [  305.291522] BUG: unable to handle kernel paging request at ffffc90613fc9000
> [  305.299621] IP: [<ffffffff8103d645>] prepare_elf64_ram_headers_callback+0x165/0x260
> [  305.308300] PGD e000032067 PUD 6dcbec54067 PMD 9dc9bdeb067 PTE 0
> [  305.315393] Oops: 0002 [#1] SMP
> [...snip]
> [  305.420953] task: ffff8e1c01ced600 ti: ffff8e1c03ec2000 task.ti: ffff8e1c03ec2000
> [  305.429292] RIP: 0010:[<ffffffff8103d645>]  [<ffffffff8103d645>] prepare_elf64_ra
> m_headers_callback+0x165/0x260
> [...snip]
> 
> After tracing prepare_elf64_headers and prepare_elf64_ram_headers_callback,
> the code uses walk_system_ram_res to fill-in crash memory regions information
> to program header, so it counts those small memory regions that are resided in
> a page area. But, when kernel was using walk_system_ram_range in
> fill_up_crash_elf_data to count the number of crash memory regions, it filters
> out small regions. I printed those small memory regions, for example:
> 
> kexec: Get nr_ram ranges. vaddr=0xffff880077592258 paddr=0x77592258, sz=0xdc0
> 
> Base on the code in walk_system_ram_range, this memory region will be filtered
> out:
> 
> pfn = (0x77592258 + 0x1000 - 1) >> 12 = 0x77593
> end_pfn = (0x77592258 + 0xfc0 -1 + 1) >> 12 = 0x77593
> end_pfn - pfn = 0x77593 - 0x77593 = 0  <=== if (end_pfn > pfn) is FALSE
> 
> So, the max_nr_ranges that's counted by kernel doesn't include small memory
> regions. That causes the page fault issue happened in later code path for
> preparing EFL headers.
> 
> This issus is not easy to reproduce on small machines that don't have too
> many CPUs because the allocated page aligned ELF buffer has more free space
> to cover those small memory regions' PT_LOAD headers.
> 

fyi, I added a cc:stable to my copy of this patch.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] kexec: fix out of the ELF headers buffer issue in syscall kexec_file_load()
@ 2015-10-01 23:07   ` Andrew Morton
  0 siblings, 0 replies; 12+ messages in thread
From: Andrew Morton @ 2015-10-01 23:07 UTC (permalink / raw)
  To: Lee, Chun-Yi
  Cc: Stephen Rothwell, Baoquan He, Takashi Iwai, Viresh Kumar, x86,
	kexec, linux-kernel, Lee, Chun-Yi, Ingo Molnar, Andy Lutomirski,
	H. Peter Anvin, Thomas Gleixner, Jiang Liu, Vivek Goyal

On Tue, 29 Sep 2015 20:58:57 +0800 "Lee, Chun-Yi" <joeyli.kernel@gmail.com> wrote:

> This patch modified the code in fill_up_crash_elf_data by using
> walk_system_ram_res instead of walk_system_ram_range to count the max
> number of crash memory ranges. That's because the walk_system_ram_range
> filters out small memory regions that are resided in the same page, but
> walk_system_ram_res does not.
> 
> The oringial issue is page fault error that sometimes happened on big machines
> when preparing ELF headers:
> 
> [  305.291522] BUG: unable to handle kernel paging request at ffffc90613fc9000
> [  305.299621] IP: [<ffffffff8103d645>] prepare_elf64_ram_headers_callback+0x165/0x260
> [  305.308300] PGD e000032067 PUD 6dcbec54067 PMD 9dc9bdeb067 PTE 0
> [  305.315393] Oops: 0002 [#1] SMP
> [...snip]
> [  305.420953] task: ffff8e1c01ced600 ti: ffff8e1c03ec2000 task.ti: ffff8e1c03ec2000
> [  305.429292] RIP: 0010:[<ffffffff8103d645>]  [<ffffffff8103d645>] prepare_elf64_ra
> m_headers_callback+0x165/0x260
> [...snip]
> 
> After tracing prepare_elf64_headers and prepare_elf64_ram_headers_callback,
> the code uses walk_system_ram_res to fill-in crash memory regions information
> to program header, so it counts those small memory regions that are resided in
> a page area. But, when kernel was using walk_system_ram_range in
> fill_up_crash_elf_data to count the number of crash memory regions, it filters
> out small regions. I printed those small memory regions, for example:
> 
> kexec: Get nr_ram ranges. vaddr=0xffff880077592258 paddr=0x77592258, sz=0xdc0
> 
> Base on the code in walk_system_ram_range, this memory region will be filtered
> out:
> 
> pfn = (0x77592258 + 0x1000 - 1) >> 12 = 0x77593
> end_pfn = (0x77592258 + 0xfc0 -1 + 1) >> 12 = 0x77593
> end_pfn - pfn = 0x77593 - 0x77593 = 0  <=== if (end_pfn > pfn) is FALSE
> 
> So, the max_nr_ranges that's counted by kernel doesn't include small memory
> regions. That causes the page fault issue happened in later code path for
> preparing EFL headers.
> 
> This issus is not easy to reproduce on small machines that don't have too
> many CPUs because the allocated page aligned ELF buffer has more free space
> to cover those small memory regions' PT_LOAD headers.
> 

fyi, I added a cc:stable to my copy of this patch.

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] kexec: fix out of the ELF headers buffer issue in syscall kexec_file_load()
  2015-10-01 23:07   ` Andrew Morton
@ 2015-10-02  7:13     ` Ingo Molnar
  -1 siblings, 0 replies; 12+ messages in thread
From: Ingo Molnar @ 2015-10-02  7:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Lee, Chun-Yi, Vivek Goyal, H. Peter Anvin, Thomas Gleixner,
	Ingo Molnar, x86, Stephen Rothwell, Viresh Kumar, Takashi Iwai,
	Jiang Liu, Andy Lutomirski, Baoquan He, linux-kernel, kexec, Lee,
	Chun-Yi


* Andrew Morton <akpm@linux-foundation.org> wrote:

> On Tue, 29 Sep 2015 20:58:57 +0800 "Lee, Chun-Yi" <joeyli.kernel@gmail.com> wrote:
> 
> > This patch modified the code in fill_up_crash_elf_data by using
> > walk_system_ram_res instead of walk_system_ram_range to count the max
> > number of crash memory ranges. That's because the walk_system_ram_range
> > filters out small memory regions that are resided in the same page, but
> > walk_system_ram_res does not.
> > 
> > The oringial issue is page fault error that sometimes happened on big machines
> > when preparing ELF headers:
> > 
> > [  305.291522] BUG: unable to handle kernel paging request at ffffc90613fc9000
> > [  305.299621] IP: [<ffffffff8103d645>] prepare_elf64_ram_headers_callback+0x165/0x260
> > [  305.308300] PGD e000032067 PUD 6dcbec54067 PMD 9dc9bdeb067 PTE 0
> > [  305.315393] Oops: 0002 [#1] SMP
> > [...snip]
> > [  305.420953] task: ffff8e1c01ced600 ti: ffff8e1c03ec2000 task.ti: ffff8e1c03ec2000
> > [  305.429292] RIP: 0010:[<ffffffff8103d645>]  [<ffffffff8103d645>] prepare_elf64_ra
> > m_headers_callback+0x165/0x260
> > [...snip]
> > 
> > After tracing prepare_elf64_headers and prepare_elf64_ram_headers_callback,
> > the code uses walk_system_ram_res to fill-in crash memory regions information
> > to program header, so it counts those small memory regions that are resided in
> > a page area. But, when kernel was using walk_system_ram_range in
> > fill_up_crash_elf_data to count the number of crash memory regions, it filters
> > out small regions. I printed those small memory regions, for example:
> > 
> > kexec: Get nr_ram ranges. vaddr=0xffff880077592258 paddr=0x77592258, sz=0xdc0
> > 
> > Base on the code in walk_system_ram_range, this memory region will be filtered
> > out:
> > 
> > pfn = (0x77592258 + 0x1000 - 1) >> 12 = 0x77593
> > end_pfn = (0x77592258 + 0xfc0 -1 + 1) >> 12 = 0x77593
> > end_pfn - pfn = 0x77593 - 0x77593 = 0  <=== if (end_pfn > pfn) is FALSE
> > 
> > So, the max_nr_ranges that's counted by kernel doesn't include small memory
> > regions. That causes the page fault issue happened in later code path for
> > preparing EFL headers.
> > 
> > This issus is not easy to reproduce on small machines that don't have too
> > many CPUs because the allocated page aligned ELF buffer has more free space
> > to cover those small memory regions' PT_LOAD headers.
> > 
> 
> fyi, I added a cc:stable to my copy of this patch.

Note that I already have it applied, with a much improved changelog:

  e3c41e37b0f4 ("x86/kexec: Fix kexec crash in syscall kexec_file_load()")

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] kexec: fix out of the ELF headers buffer issue in syscall kexec_file_load()
@ 2015-10-02  7:13     ` Ingo Molnar
  0 siblings, 0 replies; 12+ messages in thread
From: Ingo Molnar @ 2015-10-02  7:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Stephen Rothwell, Baoquan He, Takashi Iwai, Viresh Kumar, x86,
	kexec, linux-kernel, Jiang Liu, Lee, Chun-Yi, Lee, Chun-Yi,
	Andy Lutomirski, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	Vivek Goyal


* Andrew Morton <akpm@linux-foundation.org> wrote:

> On Tue, 29 Sep 2015 20:58:57 +0800 "Lee, Chun-Yi" <joeyli.kernel@gmail.com> wrote:
> 
> > This patch modified the code in fill_up_crash_elf_data by using
> > walk_system_ram_res instead of walk_system_ram_range to count the max
> > number of crash memory ranges. That's because the walk_system_ram_range
> > filters out small memory regions that are resided in the same page, but
> > walk_system_ram_res does not.
> > 
> > The oringial issue is page fault error that sometimes happened on big machines
> > when preparing ELF headers:
> > 
> > [  305.291522] BUG: unable to handle kernel paging request at ffffc90613fc9000
> > [  305.299621] IP: [<ffffffff8103d645>] prepare_elf64_ram_headers_callback+0x165/0x260
> > [  305.308300] PGD e000032067 PUD 6dcbec54067 PMD 9dc9bdeb067 PTE 0
> > [  305.315393] Oops: 0002 [#1] SMP
> > [...snip]
> > [  305.420953] task: ffff8e1c01ced600 ti: ffff8e1c03ec2000 task.ti: ffff8e1c03ec2000
> > [  305.429292] RIP: 0010:[<ffffffff8103d645>]  [<ffffffff8103d645>] prepare_elf64_ra
> > m_headers_callback+0x165/0x260
> > [...snip]
> > 
> > After tracing prepare_elf64_headers and prepare_elf64_ram_headers_callback,
> > the code uses walk_system_ram_res to fill-in crash memory regions information
> > to program header, so it counts those small memory regions that are resided in
> > a page area. But, when kernel was using walk_system_ram_range in
> > fill_up_crash_elf_data to count the number of crash memory regions, it filters
> > out small regions. I printed those small memory regions, for example:
> > 
> > kexec: Get nr_ram ranges. vaddr=0xffff880077592258 paddr=0x77592258, sz=0xdc0
> > 
> > Base on the code in walk_system_ram_range, this memory region will be filtered
> > out:
> > 
> > pfn = (0x77592258 + 0x1000 - 1) >> 12 = 0x77593
> > end_pfn = (0x77592258 + 0xfc0 -1 + 1) >> 12 = 0x77593
> > end_pfn - pfn = 0x77593 - 0x77593 = 0  <=== if (end_pfn > pfn) is FALSE
> > 
> > So, the max_nr_ranges that's counted by kernel doesn't include small memory
> > regions. That causes the page fault issue happened in later code path for
> > preparing EFL headers.
> > 
> > This issus is not easy to reproduce on small machines that don't have too
> > many CPUs because the allocated page aligned ELF buffer has more free space
> > to cover those small memory regions' PT_LOAD headers.
> > 
> 
> fyi, I added a cc:stable to my copy of this patch.

Note that I already have it applied, with a much improved changelog:

  e3c41e37b0f4 ("x86/kexec: Fix kexec crash in syscall kexec_file_load()")

Thanks,

	Ingo

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [tip:x86/urgent] x86/kexec: Fix kexec crash in syscall kexec_file_load()
  2015-09-29 12:58 ` Lee, Chun-Yi
                   ` (4 preceding siblings ...)
  (?)
@ 2015-10-02  7:15 ` tip-bot for Lee, Chun-Yi
  -1 siblings, 0 replies; 12+ messages in thread
From: tip-bot for Lee, Chun-Yi @ 2015-10-02  7:15 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: jiang.liu, vgoyal, linux-kernel, hpa, mingo, stable, bhe, tiwai,
	peterz, viresh.kumar, sfr, joeyli.kernel, jlee, efault, tglx,
	torvalds, luto

Commit-ID:  e3c41e37b0f4b18cbd4dac76cbeece5a7558b909
Gitweb:     http://git.kernel.org/tip/e3c41e37b0f4b18cbd4dac76cbeece5a7558b909
Author:     Lee, Chun-Yi <joeyli.kernel@gmail.com>
AuthorDate: Tue, 29 Sep 2015 20:58:57 +0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 2 Oct 2015 09:13:06 +0200

x86/kexec: Fix kexec crash in syscall kexec_file_load()

The original bug is a page fault crash that sometimes happens
on big machines when preparing ELF headers:

    BUG: unable to handle kernel paging request at ffffc90613fc9000
    IP: [<ffffffff8103d645>] prepare_elf64_ram_headers_callback+0x165/0x260

The bug is caused by us under-counting the number of memory ranges
and subsequently not allocating enough ELF header space for them.
The bug is typically masked on smaller systems, because the ELF header
allocation is rounded up to the next page.

This patch modifies the code in fill_up_crash_elf_data() by using
walk_system_ram_res() instead of walk_system_ram_range() to correctly
count the max number of crash memory ranges. That's because the
walk_system_ram_range() filters out small memory regions that
reside in the same page, but walk_system_ram_res() does not.

Here's how I found the bug:

After tracing prepare_elf64_headers() and prepare_elf64_ram_headers_callback(),
the code uses walk_system_ram_res() to fill-in crash memory regions information
to the program header, so it counts those small memory regions that
reside in a page area.

But, when the kernel was using walk_system_ram_range() in
fill_up_crash_elf_data() to count the number of crash memory regions,
it filters out small regions.

I printed those small memory regions, for example:

  kexec: Get nr_ram ranges. vaddr=0xffff880077592258 paddr=0x77592258, sz=0xdc0

Based on the code in walk_system_ram_range(), this memory region
will be filtered out:

  pfn = (0x77592258 + 0x1000 - 1) >> 12 = 0x77593
  end_pfn = (0x77592258 + 0xfc0 -1 + 1) >> 12 = 0x77593
  end_pfn - pfn = 0x77593 - 0x77593 = 0  <=== if (end_pfn > pfn) is FALSE

So, the max_nr_ranges that's counted by the kernel doesn't include
small memory regions - causing us to under-allocate the required space.
That causes the page fault crash that happens in a later code path
when preparing ELF headers.

This bug is not easy to reproduce on small machines that have few
CPUs, because the allocated page aligned ELF buffer has more free
space to cover those small memory regions' PT_LOAD headers.

Signed-off-by: Lee, Chun-Yi <jlee@suse.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: kexec@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/1443531537-29436-1-git-send-email-jlee@suse.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/kernel/crash.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index e068d66..74ca2fe 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -185,10 +185,9 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
 }

 #ifdef CONFIG_KEXEC_FILE
-static int get_nr_ram_ranges_callback(unsigned long start_pfn,
-				unsigned long nr_pfn, void *arg)
+static int get_nr_ram_ranges_callback(u64 start, u64 end, void *arg)
 {
-	int *nr_ranges = arg;
+	unsigned int *nr_ranges = arg;

 	(*nr_ranges)++;
 	return 0;
@@ -214,7 +213,7 @@ static void fill_up_crash_elf_data(struct crash_elf_data *ced,

 	ced->image = image;

-	walk_system_ram_range(0, -1, &nr_ranges,
+	walk_system_ram_res(0, -1, &nr_ranges,
 				get_nr_ram_ranges_callback);

 	ced->max_nr_ranges = nr_ranges;

^ permalink raw reply related	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2015-10-02  7:17 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-29 12:58 [PATCH v3] kexec: fix out of the ELF headers buffer issue in syscall kexec_file_load() Lee, Chun-Yi
2015-09-29 12:58 ` Lee, Chun-Yi
2015-09-30  3:04 ` Dave Young
2015-09-30  3:04   ` Dave Young
2015-09-30 11:27 ` Minfei Huang
2015-09-30 11:27   ` Minfei Huang
2015-10-01  8:54 ` [tip:x86/urgent] x86/kexec: Fix kexec crash " tip-bot for Lee, Chun-Yi
2015-10-01 23:07 ` [PATCH v3] kexec: fix out of the ELF headers buffer issue " Andrew Morton
2015-10-01 23:07   ` Andrew Morton
2015-10-02  7:13   ` Ingo Molnar
2015-10-02  7:13     ` Ingo Molnar
2015-10-02  7:15 ` [tip:x86/urgent] x86/kexec: Fix kexec crash " tip-bot for Lee, Chun-Yi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.