linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] arm64/kdump: Fix OOPS and OOM issues in kdump kernel
@ 2020-07-01 22:14 Bhupesh Sharma
  2020-07-01 22:14 ` [PATCH 1/2] mm/memcontrol: Fix OOPS inside mem_cgroup_get_nr_swap_pages() Bhupesh Sharma
  2020-07-01 22:14 ` [PATCH 2/2] arm64: Allocate crashkernel always in ZONE_DMA Bhupesh Sharma
  0 siblings, 2 replies; 10+ messages in thread
From: Bhupesh Sharma @ 2020-07-01 22:14 UTC (permalink / raw)
  To: cgroups, linux-mm, linux-arm-kernel
  Cc: Mark Rutland, Catalin Marinas, bhsharma, kexec, linux-kernel,
	Michal Hocko, James Morse, Vladimir Davydov, Johannes Weiner,
	bhupesh.linux, Will Deacon

Prabhakar recently reported a kdump kernel boot failure on ThunderX2
arm64 plaforms (which I was able to reproduce on ampere arm64 machines
as well), (see [1]), which is seen when a corner case is hit on some
arm64 boards when kdump kernel runs with "cgroup_disable=memory" passed
to the kdump kernel (via bootargs) and the crashkernel was originally
allocated from either a ZONE_DMA32 memory or mixture of memory chunks
belonging to both ZONE_DMA and ZONE_DMA32 regions.

While [PATCH 1/2] fixes the OOPS inside mem_cgroup_get_nr_swap_pages()
function, [PATCH 2/2] fixes the OOM seen inside the kdump kernel by
allocating the crashkernel inside ZONE_DMA region only.

[1]. https://marc.info/?l=kexec&m=158954035710703&w=4

Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: James Morse <james.morse@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Cc: kexec@lists.infradead.org
Reported-by: Prabhakar Kushwaha <pkushwaha@marvell.com>
Signed-off-by: Bhupesh Sharma <bhsharma@redhat.com>

Bhupesh Sharma (2):
  mm/memcontrol: Fix OOPS inside mem_cgroup_get_nr_swap_pages()
  arm64: Allocate crashkernel always in ZONE_DMA

 arch/arm64/mm/init.c | 16 ++++++++++++++--
 mm/memcontrol.c      |  9 ++++++++-
 2 files changed, 22 insertions(+), 3 deletions(-)

-- 
2.7.4


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/2] mm/memcontrol: Fix OOPS inside mem_cgroup_get_nr_swap_pages()
  2020-07-01 22:14 [PATCH 0/2] arm64/kdump: Fix OOPS and OOM issues in kdump kernel Bhupesh Sharma
@ 2020-07-01 22:14 ` Bhupesh Sharma
  2020-07-02  6:00   ` Michal Hocko
  2020-07-01 22:14 ` [PATCH 2/2] arm64: Allocate crashkernel always in ZONE_DMA Bhupesh Sharma
  1 sibling, 1 reply; 10+ messages in thread
From: Bhupesh Sharma @ 2020-07-01 22:14 UTC (permalink / raw)
  To: cgroups, linux-mm, linux-arm-kernel
  Cc: Mark Rutland, Catalin Marinas, bhsharma, kexec, linux-kernel,
	Michal Hocko, James Morse, Vladimir Davydov, Johannes Weiner,
	bhupesh.linux, Will Deacon

Prabhakar reported an OOPS inside mem_cgroup_get_nr_swap_pages()
function in a corner case seen on some arm64 boards when kdump kernel
runs with "cgroup_disable=memory" passed to the kdump kernel via
bootargs.

The root-cause behind the same is that currently mem_cgroup_swap_init()
function is implemented as a subsys_initcall() call instead of a
core_initcall(), this means 'cgroup_memory_noswap' still
remains set to the default value (false) even when memcg is disabled via
"cgroup_disable=memory" boot parameter.

This may result in premature OOPS inside mem_cgroup_get_nr_swap_pages()
function in corner cases:

  [    0.265617] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000188
  [    0.274495] Mem abort info:
  [    0.277311]   ESR = 0x96000006
  [    0.280389]   EC = 0x25: DABT (current EL), IL = 32 bits
  [    0.285751]   SET = 0, FnV = 0
  [    0.288830]   EA = 0, S1PTW = 0
  [    0.291995] Data abort info:
  [    0.294897]   ISV = 0, ISS = 0x00000006
  [    0.298765]   CM = 0, WnR = 0
  [    0.301757] [0000000000000188] user address but active_mm is swapper
  [    0.308174] Internal error: Oops: 96000006 [#1] SMP
  [    0.313097] Modules linked in:
  <..snip..>
  [    0.331384] pstate: 00400009 (nzcv daif +PAN -UAO BTYPE=--)
  [    0.337014] pc : mem_cgroup_get_nr_swap_pages+0x9c/0xf4
  [    0.342289] lr : mem_cgroup_get_nr_swap_pages+0x68/0xf4
  [    0.347564] sp : fffffe0012b6f800
  [    0.350905] x29: fffffe0012b6f800 x28: fffffe00116b3000
  [    0.356268] x27: fffffe0012b6fb00 x26: 0000000000000020
  [    0.361631] x25: 0000000000000000 x24: fffffc00723ffe28
  [    0.366994] x23: fffffe0010d5b468 x22: fffffe00116bfa00
  [    0.372357] x21: fffffe0010aabda8 x20: 0000000000000000
  [    0.377720] x19: 0000000000000000 x18: 0000000000000010
  [    0.383082] x17: 0000000043e612f2 x16: 00000000a9863ed7
  [    0.388445] x15: ffffffffffffffff x14: 202c303d70617773
  [    0.393808] x13: 6f6e5f79726f6d65 x12: 6d5f70756f726763
  [    0.399170] x11: 2073656761705f70 x10: 6177735f726e5f74
  [    0.404533] x9 : fffffe00100e9580 x8 : fffffe0010628160
  [    0.409895] x7 : 00000000000000a8 x6 : fffffe00118f5e5e
  [    0.415258] x5 : 0000000000000001 x4 : 0000000000000000
  [    0.420621] x3 : 0000000000000000 x2 : 0000000000000000
  [    0.425983] x1 : 0000000000000000 x0 : fffffc0060079000
  [    0.431346] Call trace:
  [    0.433809]  mem_cgroup_get_nr_swap_pages+0x9c/0xf4
  [    0.438735]  shrink_lruvec+0x404/0x4f8
  [    0.442516]  shrink_node+0x1a8/0x688
  [    0.446121]  do_try_to_free_pages+0xe8/0x448
  [    0.450429]  try_to_free_pages+0x110/0x230
  [    0.454563]  __alloc_pages_slowpath.constprop.106+0x2b8/0xb48
  [    0.460366]  __alloc_pages_nodemask+0x2ac/0x2f8
  [    0.464938]  alloc_page_interleave+0x20/0x90
  [    0.469246]  alloc_pages_current+0xdc/0xf8
  [    0.473379]  atomic_pool_expand+0x60/0x210
  [    0.477514]  __dma_atomic_pool_init+0x50/0xa4
  [    0.481910]  dma_atomic_pool_init+0xac/0x158
  [    0.486220]  do_one_initcall+0x50/0x218
  [    0.490091]  kernel_init_freeable+0x22c/0x2d0
  [    0.494489]  kernel_init+0x18/0x110
  [    0.498007]  ret_from_fork+0x10/0x18
  [    0.501614] Code: aa1403e3 91106000 97f82a27 14000011 (f940c663)
  [    0.507770] ---[ end trace 9795948475817de4 ]---
  [    0.512429] Kernel panic - not syncing: Fatal exception
  [    0.517705] Rebooting in 10 seconds..

Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: James Morse <james.morse@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Cc: kexec@lists.infradead.org
Reported-by: Prabhakar Kushwaha <pkushwaha@marvell.com>
Signed-off-by: Bhupesh Sharma <bhsharma@redhat.com>
---
 mm/memcontrol.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 19622328e4b5..8323e4b7b390 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -7186,6 +7186,13 @@ static struct cftype memsw_files[] = {
 	{ },	/* terminate */
 };
 
+/*
+ * If mem_cgroup_swap_init() is implemented as a subsys_initcall()
+ * instead of a core_initcall(), this could mean cgroup_memory_noswap still
+ * remains set to false even when memcg is disabled via "cgroup_disable=memory"
+ * boot parameter. This may result in premature OOPS inside 
+ * mem_cgroup_get_nr_swap_pages() function in corner cases.
+ */
 static int __init mem_cgroup_swap_init(void)
 {
 	/* No memory control -> no swap control */
@@ -7200,6 +7207,6 @@ static int __init mem_cgroup_swap_init(void)
 
 	return 0;
 }
-subsys_initcall(mem_cgroup_swap_init);
+core_initcall(mem_cgroup_swap_init);
 
 #endif /* CONFIG_MEMCG_SWAP */
-- 
2.7.4


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/2] arm64: Allocate crashkernel always in ZONE_DMA
  2020-07-01 22:14 [PATCH 0/2] arm64/kdump: Fix OOPS and OOM issues in kdump kernel Bhupesh Sharma
  2020-07-01 22:14 ` [PATCH 1/2] mm/memcontrol: Fix OOPS inside mem_cgroup_get_nr_swap_pages() Bhupesh Sharma
@ 2020-07-01 22:14 ` Bhupesh Sharma
  2020-07-02  7:50   ` Will Deacon
  1 sibling, 1 reply; 10+ messages in thread
From: Bhupesh Sharma @ 2020-07-01 22:14 UTC (permalink / raw)
  To: cgroups, linux-mm, linux-arm-kernel
  Cc: Mark Rutland, Catalin Marinas, bhsharma, kexec, linux-kernel,
	Michal Hocko, James Morse, Vladimir Davydov, Johannes Weiner,
	bhupesh.linux, Will Deacon

commit bff3b04460a8 ("arm64: mm: reserve CMA and crashkernel in
ZONE_DMA32") allocates crashkernel for arm64 in the ZONE_DMA32.

However as reported by Prabhakar, this breaks kdump kernel booting in
ThunderX2 like arm64 systems. I have noticed this on another ampere
arm64 machine. The OOM log in the kdump kernel looks like this:

  [    0.240552] DMA: preallocated 128 KiB GFP_KERNEL pool for atomic allocations
  [    0.247713] swapper/0: page allocation failure: order:1, mode:0xcc1(GFP_KERNEL|GFP_DMA), nodemask=(null),cpuset=/,mems_allowed=0
  <..snip..>
  [    0.274706] Call trace:
  [    0.277170]  dump_backtrace+0x0/0x208
  [    0.280863]  show_stack+0x1c/0x28
  [    0.284207]  dump_stack+0xc4/0x10c
  [    0.287638]  warn_alloc+0x104/0x170
  [    0.291156]  __alloc_pages_slowpath.constprop.106+0xb08/0xb48
  [    0.296958]  __alloc_pages_nodemask+0x2ac/0x2f8
  [    0.301530]  alloc_page_interleave+0x20/0x90
  [    0.305839]  alloc_pages_current+0xdc/0xf8
  [    0.309972]  atomic_pool_expand+0x60/0x210
  [    0.314108]  __dma_atomic_pool_init+0x50/0xa4
  [    0.318504]  dma_atomic_pool_init+0xac/0x158
  [    0.322813]  do_one_initcall+0x50/0x218
  [    0.326684]  kernel_init_freeable+0x22c/0x2d0
  [    0.331083]  kernel_init+0x18/0x110
  [    0.334600]  ret_from_fork+0x10/0x18

This patch limits the crashkernel allocation to the first 1GB of
the RAM accessible (ZONE_DMA), as otherwise we might run into OOM
issues when crashkernel is executed, as it might have been originally
allocated from either a ZONE_DMA32 memory or mixture of memory chunks
belonging to both ZONE_DMA and ZONE_DMA32.

Fixes: bff3b04460a8 ("arm64: mm: reserve CMA and crashkernel in ZONE_DMA32")
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: James Morse <james.morse@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Cc: kexec@lists.infradead.org
Reported-by: Prabhakar Kushwaha <pkushwaha@marvell.com>
Signed-off-by: Bhupesh Sharma <bhsharma@redhat.com>
---
 arch/arm64/mm/init.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 1e93cfc7c47a..02ae4d623802 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -91,8 +91,15 @@ static void __init reserve_crashkernel(void)
 	crash_size = PAGE_ALIGN(crash_size);
 
 	if (crash_base == 0) {
-		/* Current arm64 boot protocol requires 2MB alignment */
-		crash_base = memblock_find_in_range(0, arm64_dma32_phys_limit,
+		/* Current arm64 boot protocol requires 2MB alignment.
+		 * Also limit the crashkernel allocation to the first
+		 * 1GB of the RAM accessible (ZONE_DMA), as otherwise we
+		 * might run into OOM issues when crashkernel is executed,
+		 * as it might have been originally allocated from
+		 * either a ZONE_DMA32 memory or mixture of memory
+		 * chunks belonging to both ZONE_DMA and ZONE_DMA32.
+		 */
+		crash_base = memblock_find_in_range(0, arm64_dma_phys_limit,
 				crash_size, SZ_2M);
 		if (crash_base == 0) {
 			pr_warn("cannot allocate crashkernel (size:0x%llx)\n",
@@ -101,6 +108,11 @@ static void __init reserve_crashkernel(void)
 		}
 	} else {
 		/* User specifies base address explicitly. */
+		if (crash_base + crash_size > arm64_dma_phys_limit) {
+			pr_warn("cannot reserve crashkernel: region is allocatable only in ZONE_DMA range\n");
+			return;
+		}
+
 		if (!memblock_is_region_memory(crash_base, crash_size)) {
 			pr_warn("cannot reserve crashkernel: region is not memory\n");
 			return;
-- 
2.7.4


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] mm/memcontrol: Fix OOPS inside mem_cgroup_get_nr_swap_pages()
  2020-07-01 22:14 ` [PATCH 1/2] mm/memcontrol: Fix OOPS inside mem_cgroup_get_nr_swap_pages() Bhupesh Sharma
@ 2020-07-02  6:00   ` Michal Hocko
  2020-07-02 18:55     ` Bhupesh Sharma
  2020-07-03  6:43     ` Michal Hocko
  0 siblings, 2 replies; 10+ messages in thread
From: Michal Hocko @ 2020-07-02  6:00 UTC (permalink / raw)
  To: Bhupesh Sharma
  Cc: Mark Rutland, Catalin Marinas, kexec, linux-kernel, linux-mm,
	James Morse, Vladimir Davydov, Johannes Weiner, cgroups,
	bhupesh.linux, Will Deacon, linux-arm-kernel

On Thu 02-07-20 03:44:19, Bhupesh Sharma wrote:
> Prabhakar reported an OOPS inside mem_cgroup_get_nr_swap_pages()
> function in a corner case seen on some arm64 boards when kdump kernel
> runs with "cgroup_disable=memory" passed to the kdump kernel via
> bootargs.
> 
> The root-cause behind the same is that currently mem_cgroup_swap_init()
> function is implemented as a subsys_initcall() call instead of a
> core_initcall(), this means 'cgroup_memory_noswap' still
> remains set to the default value (false) even when memcg is disabled via
> "cgroup_disable=memory" boot parameter.
> 
> This may result in premature OOPS inside mem_cgroup_get_nr_swap_pages()
> function in corner cases:
> 
>   [    0.265617] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000188
>   [    0.274495] Mem abort info:
>   [    0.277311]   ESR = 0x96000006
>   [    0.280389]   EC = 0x25: DABT (current EL), IL = 32 bits
>   [    0.285751]   SET = 0, FnV = 0
>   [    0.288830]   EA = 0, S1PTW = 0
>   [    0.291995] Data abort info:
>   [    0.294897]   ISV = 0, ISS = 0x00000006
>   [    0.298765]   CM = 0, WnR = 0
>   [    0.301757] [0000000000000188] user address but active_mm is swapper
>   [    0.308174] Internal error: Oops: 96000006 [#1] SMP
>   [    0.313097] Modules linked in:
>   <..snip..>
>   [    0.331384] pstate: 00400009 (nzcv daif +PAN -UAO BTYPE=--)
>   [    0.337014] pc : mem_cgroup_get_nr_swap_pages+0x9c/0xf4
>   [    0.342289] lr : mem_cgroup_get_nr_swap_pages+0x68/0xf4
>   [    0.347564] sp : fffffe0012b6f800
>   [    0.350905] x29: fffffe0012b6f800 x28: fffffe00116b3000
>   [    0.356268] x27: fffffe0012b6fb00 x26: 0000000000000020
>   [    0.361631] x25: 0000000000000000 x24: fffffc00723ffe28
>   [    0.366994] x23: fffffe0010d5b468 x22: fffffe00116bfa00
>   [    0.372357] x21: fffffe0010aabda8 x20: 0000000000000000
>   [    0.377720] x19: 0000000000000000 x18: 0000000000000010
>   [    0.383082] x17: 0000000043e612f2 x16: 00000000a9863ed7
>   [    0.388445] x15: ffffffffffffffff x14: 202c303d70617773
>   [    0.393808] x13: 6f6e5f79726f6d65 x12: 6d5f70756f726763
>   [    0.399170] x11: 2073656761705f70 x10: 6177735f726e5f74
>   [    0.404533] x9 : fffffe00100e9580 x8 : fffffe0010628160
>   [    0.409895] x7 : 00000000000000a8 x6 : fffffe00118f5e5e
>   [    0.415258] x5 : 0000000000000001 x4 : 0000000000000000
>   [    0.420621] x3 : 0000000000000000 x2 : 0000000000000000
>   [    0.425983] x1 : 0000000000000000 x0 : fffffc0060079000
>   [    0.431346] Call trace:
>   [    0.433809]  mem_cgroup_get_nr_swap_pages+0x9c/0xf4
>   [    0.438735]  shrink_lruvec+0x404/0x4f8
>   [    0.442516]  shrink_node+0x1a8/0x688
>   [    0.446121]  do_try_to_free_pages+0xe8/0x448
>   [    0.450429]  try_to_free_pages+0x110/0x230
>   [    0.454563]  __alloc_pages_slowpath.constprop.106+0x2b8/0xb48
>   [    0.460366]  __alloc_pages_nodemask+0x2ac/0x2f8
>   [    0.464938]  alloc_page_interleave+0x20/0x90
>   [    0.469246]  alloc_pages_current+0xdc/0xf8
>   [    0.473379]  atomic_pool_expand+0x60/0x210
>   [    0.477514]  __dma_atomic_pool_init+0x50/0xa4
>   [    0.481910]  dma_atomic_pool_init+0xac/0x158
>   [    0.486220]  do_one_initcall+0x50/0x218
>   [    0.490091]  kernel_init_freeable+0x22c/0x2d0
>   [    0.494489]  kernel_init+0x18/0x110
>   [    0.498007]  ret_from_fork+0x10/0x18
>   [    0.501614] Code: aa1403e3 91106000 97f82a27 14000011 (f940c663)
>   [    0.507770] ---[ end trace 9795948475817de4 ]---
>   [    0.512429] Kernel panic - not syncing: Fatal exception
>   [    0.517705] Rebooting in 10 seconds..
> 
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: James Morse <james.morse@arm.com>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: cgroups@vger.kernel.org
> Cc: linux-mm@kvack.org
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: linux-kernel@vger.kernel.org
> Cc: kexec@lists.infradead.org

Fixes: eccb52e78809 ("mm: memcontrol: prepare swap controller setup for integration")

> Reported-by: Prabhakar Kushwaha <pkushwaha@marvell.com>
> Signed-off-by: Bhupesh Sharma <bhsharma@redhat.com>

This is subtle as hell, I have to say. I find the ordering in the init
calls very unintuitive and extremely hard to follow. The above commit
has introduced the problem but the code previously has worked mostly by
a luck because our default was flipped.

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/memcontrol.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 19622328e4b5..8323e4b7b390 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -7186,6 +7186,13 @@ static struct cftype memsw_files[] = {
>  	{ },	/* terminate */
>  };
>  
> +/*
> + * If mem_cgroup_swap_init() is implemented as a subsys_initcall()
> + * instead of a core_initcall(), this could mean cgroup_memory_noswap still
> + * remains set to false even when memcg is disabled via "cgroup_disable=memory"
> + * boot parameter. This may result in premature OOPS inside 
> + * mem_cgroup_get_nr_swap_pages() function in corner cases.
> + */
>  static int __init mem_cgroup_swap_init(void)
>  {
>  	/* No memory control -> no swap control */
> @@ -7200,6 +7207,6 @@ static int __init mem_cgroup_swap_init(void)
>  
>  	return 0;
>  }
> -subsys_initcall(mem_cgroup_swap_init);
> +core_initcall(mem_cgroup_swap_init);
>  
>  #endif /* CONFIG_MEMCG_SWAP */
> -- 
> 2.7.4

-- 
Michal Hocko
SUSE Labs

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] arm64: Allocate crashkernel always in ZONE_DMA
  2020-07-01 22:14 ` [PATCH 2/2] arm64: Allocate crashkernel always in ZONE_DMA Bhupesh Sharma
@ 2020-07-02  7:50   ` Will Deacon
  2020-07-02 19:22     ` Bhupesh Sharma
  0 siblings, 1 reply; 10+ messages in thread
From: Will Deacon @ 2020-07-02  7:50 UTC (permalink / raw)
  To: Bhupesh Sharma
  Cc: Mark Rutland, Catalin Marinas, kexec, linux-kernel, Michal Hocko,
	linux-mm, James Morse, Vladimir Davydov, Johannes Weiner,
	cgroups, bhupesh.linux, linux-arm-kernel

On Thu, Jul 02, 2020 at 03:44:20AM +0530, Bhupesh Sharma wrote:
> commit bff3b04460a8 ("arm64: mm: reserve CMA and crashkernel in
> ZONE_DMA32") allocates crashkernel for arm64 in the ZONE_DMA32.
> 
> However as reported by Prabhakar, this breaks kdump kernel booting in
> ThunderX2 like arm64 systems. I have noticed this on another ampere
> arm64 machine. The OOM log in the kdump kernel looks like this:
> 
>   [    0.240552] DMA: preallocated 128 KiB GFP_KERNEL pool for atomic allocations
>   [    0.247713] swapper/0: page allocation failure: order:1, mode:0xcc1(GFP_KERNEL|GFP_DMA), nodemask=(null),cpuset=/,mems_allowed=0
>   <..snip..>
>   [    0.274706] Call trace:
>   [    0.277170]  dump_backtrace+0x0/0x208
>   [    0.280863]  show_stack+0x1c/0x28
>   [    0.284207]  dump_stack+0xc4/0x10c
>   [    0.287638]  warn_alloc+0x104/0x170
>   [    0.291156]  __alloc_pages_slowpath.constprop.106+0xb08/0xb48
>   [    0.296958]  __alloc_pages_nodemask+0x2ac/0x2f8
>   [    0.301530]  alloc_page_interleave+0x20/0x90
>   [    0.305839]  alloc_pages_current+0xdc/0xf8
>   [    0.309972]  atomic_pool_expand+0x60/0x210
>   [    0.314108]  __dma_atomic_pool_init+0x50/0xa4
>   [    0.318504]  dma_atomic_pool_init+0xac/0x158
>   [    0.322813]  do_one_initcall+0x50/0x218
>   [    0.326684]  kernel_init_freeable+0x22c/0x2d0
>   [    0.331083]  kernel_init+0x18/0x110
>   [    0.334600]  ret_from_fork+0x10/0x18
> 
> This patch limits the crashkernel allocation to the first 1GB of
> the RAM accessible (ZONE_DMA), as otherwise we might run into OOM
> issues when crashkernel is executed, as it might have been originally
> allocated from either a ZONE_DMA32 memory or mixture of memory chunks
> belonging to both ZONE_DMA and ZONE_DMA32.

How does this interact with this ongoing series:

https://lore.kernel.org/r/20200628083458.40066-1-chenzhou10@huawei.com

(patch 4, in particular)

> Fixes: bff3b04460a8 ("arm64: mm: reserve CMA and crashkernel in ZONE_DMA32")
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: James Morse <james.morse@arm.com>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: cgroups@vger.kernel.org
> Cc: linux-mm@kvack.org
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: linux-kernel@vger.kernel.org
> Cc: kexec@lists.infradead.org
> Reported-by: Prabhakar Kushwaha <pkushwaha@marvell.com>
> Signed-off-by: Bhupesh Sharma <bhsharma@redhat.com>
> ---
>  arch/arm64/mm/init.c | 16 ++++++++++++++--
>  1 file changed, 14 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index 1e93cfc7c47a..02ae4d623802 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -91,8 +91,15 @@ static void __init reserve_crashkernel(void)
>  	crash_size = PAGE_ALIGN(crash_size);
>  
>  	if (crash_base == 0) {
> -		/* Current arm64 boot protocol requires 2MB alignment */
> -		crash_base = memblock_find_in_range(0, arm64_dma32_phys_limit,
> +		/* Current arm64 boot protocol requires 2MB alignment.
> +		 * Also limit the crashkernel allocation to the first
> +		 * 1GB of the RAM accessible (ZONE_DMA), as otherwise we
> +		 * might run into OOM issues when crashkernel is executed,
> +		 * as it might have been originally allocated from
> +		 * either a ZONE_DMA32 memory or mixture of memory
> +		 * chunks belonging to both ZONE_DMA and ZONE_DMA32.
> +		 */

This comment needs help. Why does putting the crashkernel in ZONE_DMA
prevent "OOM issues"?

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] mm/memcontrol: Fix OOPS inside mem_cgroup_get_nr_swap_pages()
  2020-07-02  6:00   ` Michal Hocko
@ 2020-07-02 18:55     ` Bhupesh Sharma
  2020-07-03  6:43     ` Michal Hocko
  1 sibling, 0 replies; 10+ messages in thread
From: Bhupesh Sharma @ 2020-07-02 18:55 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mark Rutland, Catalin Marinas, kexec mailing list,
	Linux Kernel Mailing List, linux-mm, James Morse,
	Vladimir Davydov, Johannes Weiner, cgroups, Bhupesh SHARMA,
	Will Deacon, linux-arm-kernel

Hi Michal,

On Thu, Jul 2, 2020 at 11:30 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Thu 02-07-20 03:44:19, Bhupesh Sharma wrote:
> > Prabhakar reported an OOPS inside mem_cgroup_get_nr_swap_pages()
> > function in a corner case seen on some arm64 boards when kdump kernel
> > runs with "cgroup_disable=memory" passed to the kdump kernel via
> > bootargs.
> >
> > The root-cause behind the same is that currently mem_cgroup_swap_init()
> > function is implemented as a subsys_initcall() call instead of a
> > core_initcall(), this means 'cgroup_memory_noswap' still
> > remains set to the default value (false) even when memcg is disabled via
> > "cgroup_disable=memory" boot parameter.
> >
> > This may result in premature OOPS inside mem_cgroup_get_nr_swap_pages()
> > function in corner cases:
> >
> >   [    0.265617] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000188
> >   [    0.274495] Mem abort info:
> >   [    0.277311]   ESR = 0x96000006
> >   [    0.280389]   EC = 0x25: DABT (current EL), IL = 32 bits
> >   [    0.285751]   SET = 0, FnV = 0
> >   [    0.288830]   EA = 0, S1PTW = 0
> >   [    0.291995] Data abort info:
> >   [    0.294897]   ISV = 0, ISS = 0x00000006
> >   [    0.298765]   CM = 0, WnR = 0
> >   [    0.301757] [0000000000000188] user address but active_mm is swapper
> >   [    0.308174] Internal error: Oops: 96000006 [#1] SMP
> >   [    0.313097] Modules linked in:
> >   <..snip..>
> >   [    0.331384] pstate: 00400009 (nzcv daif +PAN -UAO BTYPE=--)
> >   [    0.337014] pc : mem_cgroup_get_nr_swap_pages+0x9c/0xf4
> >   [    0.342289] lr : mem_cgroup_get_nr_swap_pages+0x68/0xf4
> >   [    0.347564] sp : fffffe0012b6f800
> >   [    0.350905] x29: fffffe0012b6f800 x28: fffffe00116b3000
> >   [    0.356268] x27: fffffe0012b6fb00 x26: 0000000000000020
> >   [    0.361631] x25: 0000000000000000 x24: fffffc00723ffe28
> >   [    0.366994] x23: fffffe0010d5b468 x22: fffffe00116bfa00
> >   [    0.372357] x21: fffffe0010aabda8 x20: 0000000000000000
> >   [    0.377720] x19: 0000000000000000 x18: 0000000000000010
> >   [    0.383082] x17: 0000000043e612f2 x16: 00000000a9863ed7
> >   [    0.388445] x15: ffffffffffffffff x14: 202c303d70617773
> >   [    0.393808] x13: 6f6e5f79726f6d65 x12: 6d5f70756f726763
> >   [    0.399170] x11: 2073656761705f70 x10: 6177735f726e5f74
> >   [    0.404533] x9 : fffffe00100e9580 x8 : fffffe0010628160
> >   [    0.409895] x7 : 00000000000000a8 x6 : fffffe00118f5e5e
> >   [    0.415258] x5 : 0000000000000001 x4 : 0000000000000000
> >   [    0.420621] x3 : 0000000000000000 x2 : 0000000000000000
> >   [    0.425983] x1 : 0000000000000000 x0 : fffffc0060079000
> >   [    0.431346] Call trace:
> >   [    0.433809]  mem_cgroup_get_nr_swap_pages+0x9c/0xf4
> >   [    0.438735]  shrink_lruvec+0x404/0x4f8
> >   [    0.442516]  shrink_node+0x1a8/0x688
> >   [    0.446121]  do_try_to_free_pages+0xe8/0x448
> >   [    0.450429]  try_to_free_pages+0x110/0x230
> >   [    0.454563]  __alloc_pages_slowpath.constprop.106+0x2b8/0xb48
> >   [    0.460366]  __alloc_pages_nodemask+0x2ac/0x2f8
> >   [    0.464938]  alloc_page_interleave+0x20/0x90
> >   [    0.469246]  alloc_pages_current+0xdc/0xf8
> >   [    0.473379]  atomic_pool_expand+0x60/0x210
> >   [    0.477514]  __dma_atomic_pool_init+0x50/0xa4
> >   [    0.481910]  dma_atomic_pool_init+0xac/0x158
> >   [    0.486220]  do_one_initcall+0x50/0x218
> >   [    0.490091]  kernel_init_freeable+0x22c/0x2d0
> >   [    0.494489]  kernel_init+0x18/0x110
> >   [    0.498007]  ret_from_fork+0x10/0x18
> >   [    0.501614] Code: aa1403e3 91106000 97f82a27 14000011 (f940c663)
> >   [    0.507770] ---[ end trace 9795948475817de4 ]---
> >   [    0.512429] Kernel panic - not syncing: Fatal exception
> >   [    0.517705] Rebooting in 10 seconds..
> >
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: Michal Hocko <mhocko@kernel.org>
> > Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> > Cc: James Morse <james.morse@arm.com>
> > Cc: Mark Rutland <mark.rutland@arm.com>
> > Cc: Will Deacon <will@kernel.org>
> > Cc: Catalin Marinas <catalin.marinas@arm.com>
> > Cc: cgroups@vger.kernel.org
> > Cc: linux-mm@kvack.org
> > Cc: linux-arm-kernel@lists.infradead.org
> > Cc: linux-kernel@vger.kernel.org
> > Cc: kexec@lists.infradead.org
>
> Fixes: eccb52e78809 ("mm: memcontrol: prepare swap controller setup for integration")
>
> > Reported-by: Prabhakar Kushwaha <pkushwaha@marvell.com>
> > Signed-off-by: Bhupesh Sharma <bhsharma@redhat.com>
>
> This is subtle as hell, I have to say. I find the ordering in the init
> calls very unintuitive and extremely hard to follow. The above commit
> has introduced the problem but the code previously has worked mostly by
> a luck because our default was flipped.
>
> Acked-by: Michal Hocko <mhocko@suse.com>

Thanks for reviewing the patch. Indeed its quite a corner case seen
only selected arm64 machines.

Regards,
Bhupesh

> > ---
> >  mm/memcontrol.c | 9 ++++++++-
> >  1 file changed, 8 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 19622328e4b5..8323e4b7b390 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -7186,6 +7186,13 @@ static struct cftype memsw_files[] = {
> >       { },    /* terminate */
> >  };
> >
> > +/*
> > + * If mem_cgroup_swap_init() is implemented as a subsys_initcall()
> > + * instead of a core_initcall(), this could mean cgroup_memory_noswap still
> > + * remains set to false even when memcg is disabled via "cgroup_disable=memory"
> > + * boot parameter. This may result in premature OOPS inside
> > + * mem_cgroup_get_nr_swap_pages() function in corner cases.
> > + */
> >  static int __init mem_cgroup_swap_init(void)
> >  {
> >       /* No memory control -> no swap control */
> > @@ -7200,6 +7207,6 @@ static int __init mem_cgroup_swap_init(void)
> >
> >       return 0;
> >  }
> > -subsys_initcall(mem_cgroup_swap_init);
> > +core_initcall(mem_cgroup_swap_init);
> >
> >  #endif /* CONFIG_MEMCG_SWAP */
> > --
> > 2.7.4
>
> --
> Michal Hocko
> SUSE Labs
>


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] arm64: Allocate crashkernel always in ZONE_DMA
  2020-07-02  7:50   ` Will Deacon
@ 2020-07-02 19:22     ` Bhupesh Sharma
  2020-07-03  5:24       ` chenzhou
  0 siblings, 1 reply; 10+ messages in thread
From: Bhupesh Sharma @ 2020-07-02 19:22 UTC (permalink / raw)
  To: Will Deacon
  Cc: Mark Rutland, Catalin Marinas, kexec mailing list,
	Linux Kernel Mailing List, Michal Hocko, linux-mm, James Morse,
	Vladimir Davydov, Johannes Weiner, cgroups, Bhupesh SHARMA,
	linux-arm-kernel

Hi Will,

On Thu, Jul 2, 2020 at 1:20 PM Will Deacon <will@kernel.org> wrote:
>
> On Thu, Jul 02, 2020 at 03:44:20AM +0530, Bhupesh Sharma wrote:
> > commit bff3b04460a8 ("arm64: mm: reserve CMA and crashkernel in
> > ZONE_DMA32") allocates crashkernel for arm64 in the ZONE_DMA32.
> >
> > However as reported by Prabhakar, this breaks kdump kernel booting in
> > ThunderX2 like arm64 systems. I have noticed this on another ampere
> > arm64 machine. The OOM log in the kdump kernel looks like this:
> >
> >   [    0.240552] DMA: preallocated 128 KiB GFP_KERNEL pool for atomic allocations
> >   [    0.247713] swapper/0: page allocation failure: order:1, mode:0xcc1(GFP_KERNEL|GFP_DMA), nodemask=(null),cpuset=/,mems_allowed=0
> >   <..snip..>
> >   [    0.274706] Call trace:
> >   [    0.277170]  dump_backtrace+0x0/0x208
> >   [    0.280863]  show_stack+0x1c/0x28
> >   [    0.284207]  dump_stack+0xc4/0x10c
> >   [    0.287638]  warn_alloc+0x104/0x170
> >   [    0.291156]  __alloc_pages_slowpath.constprop.106+0xb08/0xb48
> >   [    0.296958]  __alloc_pages_nodemask+0x2ac/0x2f8
> >   [    0.301530]  alloc_page_interleave+0x20/0x90
> >   [    0.305839]  alloc_pages_current+0xdc/0xf8
> >   [    0.309972]  atomic_pool_expand+0x60/0x210
> >   [    0.314108]  __dma_atomic_pool_init+0x50/0xa4
> >   [    0.318504]  dma_atomic_pool_init+0xac/0x158
> >   [    0.322813]  do_one_initcall+0x50/0x218
> >   [    0.326684]  kernel_init_freeable+0x22c/0x2d0
> >   [    0.331083]  kernel_init+0x18/0x110
> >   [    0.334600]  ret_from_fork+0x10/0x18
> >
> > This patch limits the crashkernel allocation to the first 1GB of
> > the RAM accessible (ZONE_DMA), as otherwise we might run into OOM
> > issues when crashkernel is executed, as it might have been originally
> > allocated from either a ZONE_DMA32 memory or mixture of memory chunks
> > belonging to both ZONE_DMA and ZONE_DMA32.
>
> How does this interact with this ongoing series:
>
> https://lore.kernel.org/r/20200628083458.40066-1-chenzhou10@huawei.com
>
> (patch 4, in particular)

Many thanks for having a look at this patchset. I was not aware that
Chen had sent out a new version.
I had noted in the v9 review of the high/low range allocation
<https://lists.gt.net/linux/kernel/3726052#3726052> that I was working
on a generic solution (irrespective of the crashkernel, low and high
range allocation) which resulted in this patchset.

The issue is two-fold: OOPs in memcfg layer (PATCH 1/2, which has been
Acked-by memcfg maintainer) and OOM in the kdump kernel due to
crashkernel allocation in ZONE_DMA32 regions(s) which is addressed by
this PATCH.

I will have a closer look at the v10 patchset Chen shared, but seems
it needs some rework as per Dave's review comments which he shared
today.
IMO, in the meanwhile this patchset  can be used to fix the existing
kdump issue with upstream kernel.

> > Fixes: bff3b04460a8 ("arm64: mm: reserve CMA and crashkernel in ZONE_DMA32")
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: Michal Hocko <mhocko@kernel.org>
> > Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> > Cc: James Morse <james.morse@arm.com>
> > Cc: Mark Rutland <mark.rutland@arm.com>
> > Cc: Will Deacon <will@kernel.org>
> > Cc: Catalin Marinas <catalin.marinas@arm.com>
> > Cc: cgroups@vger.kernel.org
> > Cc: linux-mm@kvack.org
> > Cc: linux-arm-kernel@lists.infradead.org
> > Cc: linux-kernel@vger.kernel.org
> > Cc: kexec@lists.infradead.org
> > Reported-by: Prabhakar Kushwaha <pkushwaha@marvell.com>
> > Signed-off-by: Bhupesh Sharma <bhsharma@redhat.com>
> > ---
> >  arch/arm64/mm/init.c | 16 ++++++++++++++--
> >  1 file changed, 14 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> > index 1e93cfc7c47a..02ae4d623802 100644
> > --- a/arch/arm64/mm/init.c
> > +++ b/arch/arm64/mm/init.c
> > @@ -91,8 +91,15 @@ static void __init reserve_crashkernel(void)
> >       crash_size = PAGE_ALIGN(crash_size);
> >
> >       if (crash_base == 0) {
> > -             /* Current arm64 boot protocol requires 2MB alignment */
> > -             crash_base = memblock_find_in_range(0, arm64_dma32_phys_limit,
> > +             /* Current arm64 boot protocol requires 2MB alignment.
> > +              * Also limit the crashkernel allocation to the first
> > +              * 1GB of the RAM accessible (ZONE_DMA), as otherwise we
> > +              * might run into OOM issues when crashkernel is executed,
> > +              * as it might have been originally allocated from
> > +              * either a ZONE_DMA32 memory or mixture of memory
> > +              * chunks belonging to both ZONE_DMA and ZONE_DMA32.
> > +              */
>
> This comment needs help. Why does putting the crashkernel in ZONE_DMA
> prevent "OOM issues"?

Sure, I can work on adding more details in the comment so that it
explains the potential OOM issue(s) better.

Thanks,
Bhupesh


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] arm64: Allocate crashkernel always in ZONE_DMA
  2020-07-02 19:22     ` Bhupesh Sharma
@ 2020-07-03  5:24       ` chenzhou
  2020-07-03  7:39         ` Bhupesh Sharma
  0 siblings, 1 reply; 10+ messages in thread
From: chenzhou @ 2020-07-03  5:24 UTC (permalink / raw)
  To: Bhupesh Sharma, Will Deacon
  Cc: Mark Rutland, Catalin Marinas, kexec mailing list,
	Linux Kernel Mailing List, Michal Hocko, linux-mm, James Morse,
	Vladimir Davydov, Johannes Weiner, cgroups, Bhupesh SHARMA,
	linux-arm-kernel

Hi Bhupesh,


On 2020/7/3 3:22, Bhupesh Sharma wrote:
> Hi Will,
>
> On Thu, Jul 2, 2020 at 1:20 PM Will Deacon <will@kernel.org> wrote:
>> On Thu, Jul 02, 2020 at 03:44:20AM +0530, Bhupesh Sharma wrote:
>>> commit bff3b04460a8 ("arm64: mm: reserve CMA and crashkernel in
>>> ZONE_DMA32") allocates crashkernel for arm64 in the ZONE_DMA32.
>>>
>>> However as reported by Prabhakar, this breaks kdump kernel booting in
>>> ThunderX2 like arm64 systems. I have noticed this on another ampere
>>> arm64 machine. The OOM log in the kdump kernel looks like this:
>>>
>>>   [    0.240552] DMA: preallocated 128 KiB GFP_KERNEL pool for atomic allocations
>>>   [    0.247713] swapper/0: page allocation failure: order:1, mode:0xcc1(GFP_KERNEL|GFP_DMA), nodemask=(null),cpuset=/,mems_allowed=0
>>>   <..snip..>
>>>   [    0.274706] Call trace:
>>>   [    0.277170]  dump_backtrace+0x0/0x208
>>>   [    0.280863]  show_stack+0x1c/0x28
>>>   [    0.284207]  dump_stack+0xc4/0x10c
>>>   [    0.287638]  warn_alloc+0x104/0x170
>>>   [    0.291156]  __alloc_pages_slowpath.constprop.106+0xb08/0xb48
>>>   [    0.296958]  __alloc_pages_nodemask+0x2ac/0x2f8
>>>   [    0.301530]  alloc_page_interleave+0x20/0x90
>>>   [    0.305839]  alloc_pages_current+0xdc/0xf8
>>>   [    0.309972]  atomic_pool_expand+0x60/0x210
>>>   [    0.314108]  __dma_atomic_pool_init+0x50/0xa4
>>>   [    0.318504]  dma_atomic_pool_init+0xac/0x158
>>>   [    0.322813]  do_one_initcall+0x50/0x218
>>>   [    0.326684]  kernel_init_freeable+0x22c/0x2d0
>>>   [    0.331083]  kernel_init+0x18/0x110
>>>   [    0.334600]  ret_from_fork+0x10/0x18
>>>
>>> This patch limits the crashkernel allocation to the first 1GB of
>>> the RAM accessible (ZONE_DMA), as otherwise we might run into OOM
>>> issues when crashkernel is executed, as it might have been originally
>>> allocated from either a ZONE_DMA32 memory or mixture of memory chunks
>>> belonging to both ZONE_DMA and ZONE_DMA32.
>> How does this interact with this ongoing series:
>>
>> https://lore.kernel.org/r/20200628083458.40066-1-chenzhou10@huawei.com
>>
>> (patch 4, in particular)
> Many thanks for having a look at this patchset. I was not aware that
> Chen had sent out a new version.
> I had noted in the v9 review of the high/low range allocation
> <https://lists.gt.net/linux/kernel/3726052#3726052> that I was working
> on a generic solution (irrespective of the crashkernel, low and high
> range allocation) which resulted in this patchset.
>
> The issue is two-fold: OOPs in memcfg layer (PATCH 1/2, which has been
> Acked-by memcfg maintainer) and OOM in the kdump kernel due to
> crashkernel allocation in ZONE_DMA32 regions(s) which is addressed by
> this PATCH.
>
> I will have a closer look at the v10 patchset Chen shared, but seems
> it needs some rework as per Dave's review comments which he shared
> today.
> IMO, in the meanwhile this patchset  can be used to fix the existing
> kdump issue with upstream kernel.
Thanks for your work.
There is no progress on the issue for long time, so i sent my solution in v8 comments
and sent v9 recently.

I think direct limiting the crashkernel in ZONE_DMA isn't a good idea:
1. For parameter "crashkernel=Y", reserving crashkernel in first 1G memory will increase
the probability of memory allocation failure.
Previous discuss from https://lkml.org/lkml/2019/10/21/725:
    "With ZONE_DMA=y, this config will fail to reserve 512M CMA on a server"

2. For parameter "crashkernel=Y@X", limiting the crashkernel in ZONE_DMA is unreasonable
for someone really want to reserve crashkernel from specified start address.

I have sent v10: https://www.spinics.net/lists/arm-kernel/msg819408.html, any commets are welcome.

Thanks,
Chen Zhou
>
>>> Fixes: bff3b04460a8 ("arm64: mm: reserve CMA and crashkernel in ZONE_DMA32")
>>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>>> Cc: Michal Hocko <mhocko@kernel.org>
>>> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
>>> Cc: James Morse <james.morse@arm.com>
>>> Cc: Mark Rutland <mark.rutland@arm.com>
>>> Cc: Will Deacon <will@kernel.org>
>>> Cc: Catalin Marinas <catalin.marinas@arm.com>
>>> Cc: cgroups@vger.kernel.org
>>> Cc: linux-mm@kvack.org
>>> Cc: linux-arm-kernel@lists.infradead.org
>>> Cc: linux-kernel@vger.kernel.org
>>> Cc: kexec@lists.infradead.org
>>> Reported-by: Prabhakar Kushwaha <pkushwaha@marvell.com>
>>> Signed-off-by: Bhupesh Sharma <bhsharma@redhat.com>
>>> ---
>>>  arch/arm64/mm/init.c | 16 ++++++++++++++--
>>>  1 file changed, 14 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
>>> index 1e93cfc7c47a..02ae4d623802 100644
>>> --- a/arch/arm64/mm/init.c
>>> +++ b/arch/arm64/mm/init.c
>>> @@ -91,8 +91,15 @@ static void __init reserve_crashkernel(void)
>>>       crash_size = PAGE_ALIGN(crash_size);
>>>
>>>       if (crash_base == 0) {
>>> -             /* Current arm64 boot protocol requires 2MB alignment */
>>> -             crash_base = memblock_find_in_range(0, arm64_dma32_phys_limit,
>>> +             /* Current arm64 boot protocol requires 2MB alignment.
>>> +              * Also limit the crashkernel allocation to the first
>>> +              * 1GB of the RAM accessible (ZONE_DMA), as otherwise we
>>> +              * might run into OOM issues when crashkernel is executed,
>>> +              * as it might have been originally allocated from
>>> +              * either a ZONE_DMA32 memory or mixture of memory
>>> +              * chunks belonging to both ZONE_DMA and ZONE_DMA32.
>>> +              */
>> This comment needs help. Why does putting the crashkernel in ZONE_DMA
>> prevent "OOM issues"?
> Sure, I can work on adding more details in the comment so that it
> explains the potential OOM issue(s) better.
>
> Thanks,
> Bhupesh
>
>
> .
>



_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] mm/memcontrol: Fix OOPS inside mem_cgroup_get_nr_swap_pages()
  2020-07-02  6:00   ` Michal Hocko
  2020-07-02 18:55     ` Bhupesh Sharma
@ 2020-07-03  6:43     ` Michal Hocko
  1 sibling, 0 replies; 10+ messages in thread
From: Michal Hocko @ 2020-07-03  6:43 UTC (permalink / raw)
  To: Bhupesh Sharma
  Cc: Mark Rutland, Catalin Marinas, kexec, linux-kernel, linux-mm,
	James Morse, Vladimir Davydov, Johannes Weiner, cgroups,
	Andrew Morton, bhupesh.linux, Will Deacon, linux-arm-kernel

[Cc Andrew - the patch is http://lkml.kernel.org/r/1593641660-13254-2-git-send-email-bhsharma@redhat.com]

On Thu 02-07-20 08:00:27, Michal Hocko wrote:
> On Thu 02-07-20 03:44:19, Bhupesh Sharma wrote:
> > Prabhakar reported an OOPS inside mem_cgroup_get_nr_swap_pages()
> > function in a corner case seen on some arm64 boards when kdump kernel
> > runs with "cgroup_disable=memory" passed to the kdump kernel via
> > bootargs.
> > 
> > The root-cause behind the same is that currently mem_cgroup_swap_init()
> > function is implemented as a subsys_initcall() call instead of a
> > core_initcall(), this means 'cgroup_memory_noswap' still
> > remains set to the default value (false) even when memcg is disabled via
> > "cgroup_disable=memory" boot parameter.
> > 
> > This may result in premature OOPS inside mem_cgroup_get_nr_swap_pages()
> > function in corner cases:
> > 
> >   [    0.265617] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000188
> >   [    0.274495] Mem abort info:
> >   [    0.277311]   ESR = 0x96000006
> >   [    0.280389]   EC = 0x25: DABT (current EL), IL = 32 bits
> >   [    0.285751]   SET = 0, FnV = 0
> >   [    0.288830]   EA = 0, S1PTW = 0
> >   [    0.291995] Data abort info:
> >   [    0.294897]   ISV = 0, ISS = 0x00000006
> >   [    0.298765]   CM = 0, WnR = 0
> >   [    0.301757] [0000000000000188] user address but active_mm is swapper
> >   [    0.308174] Internal error: Oops: 96000006 [#1] SMP
> >   [    0.313097] Modules linked in:
> >   <..snip..>
> >   [    0.331384] pstate: 00400009 (nzcv daif +PAN -UAO BTYPE=--)
> >   [    0.337014] pc : mem_cgroup_get_nr_swap_pages+0x9c/0xf4
> >   [    0.342289] lr : mem_cgroup_get_nr_swap_pages+0x68/0xf4
> >   [    0.347564] sp : fffffe0012b6f800
> >   [    0.350905] x29: fffffe0012b6f800 x28: fffffe00116b3000
> >   [    0.356268] x27: fffffe0012b6fb00 x26: 0000000000000020
> >   [    0.361631] x25: 0000000000000000 x24: fffffc00723ffe28
> >   [    0.366994] x23: fffffe0010d5b468 x22: fffffe00116bfa00
> >   [    0.372357] x21: fffffe0010aabda8 x20: 0000000000000000
> >   [    0.377720] x19: 0000000000000000 x18: 0000000000000010
> >   [    0.383082] x17: 0000000043e612f2 x16: 00000000a9863ed7
> >   [    0.388445] x15: ffffffffffffffff x14: 202c303d70617773
> >   [    0.393808] x13: 6f6e5f79726f6d65 x12: 6d5f70756f726763
> >   [    0.399170] x11: 2073656761705f70 x10: 6177735f726e5f74
> >   [    0.404533] x9 : fffffe00100e9580 x8 : fffffe0010628160
> >   [    0.409895] x7 : 00000000000000a8 x6 : fffffe00118f5e5e
> >   [    0.415258] x5 : 0000000000000001 x4 : 0000000000000000
> >   [    0.420621] x3 : 0000000000000000 x2 : 0000000000000000
> >   [    0.425983] x1 : 0000000000000000 x0 : fffffc0060079000
> >   [    0.431346] Call trace:
> >   [    0.433809]  mem_cgroup_get_nr_swap_pages+0x9c/0xf4
> >   [    0.438735]  shrink_lruvec+0x404/0x4f8
> >   [    0.442516]  shrink_node+0x1a8/0x688
> >   [    0.446121]  do_try_to_free_pages+0xe8/0x448
> >   [    0.450429]  try_to_free_pages+0x110/0x230
> >   [    0.454563]  __alloc_pages_slowpath.constprop.106+0x2b8/0xb48
> >   [    0.460366]  __alloc_pages_nodemask+0x2ac/0x2f8
> >   [    0.464938]  alloc_page_interleave+0x20/0x90
> >   [    0.469246]  alloc_pages_current+0xdc/0xf8
> >   [    0.473379]  atomic_pool_expand+0x60/0x210
> >   [    0.477514]  __dma_atomic_pool_init+0x50/0xa4
> >   [    0.481910]  dma_atomic_pool_init+0xac/0x158
> >   [    0.486220]  do_one_initcall+0x50/0x218
> >   [    0.490091]  kernel_init_freeable+0x22c/0x2d0
> >   [    0.494489]  kernel_init+0x18/0x110
> >   [    0.498007]  ret_from_fork+0x10/0x18
> >   [    0.501614] Code: aa1403e3 91106000 97f82a27 14000011 (f940c663)
> >   [    0.507770] ---[ end trace 9795948475817de4 ]---
> >   [    0.512429] Kernel panic - not syncing: Fatal exception
> >   [    0.517705] Rebooting in 10 seconds..
> > 
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: Michal Hocko <mhocko@kernel.org>
> > Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> > Cc: James Morse <james.morse@arm.com>
> > Cc: Mark Rutland <mark.rutland@arm.com>
> > Cc: Will Deacon <will@kernel.org>
> > Cc: Catalin Marinas <catalin.marinas@arm.com>
> > Cc: cgroups@vger.kernel.org
> > Cc: linux-mm@kvack.org
> > Cc: linux-arm-kernel@lists.infradead.org
> > Cc: linux-kernel@vger.kernel.org
> > Cc: kexec@lists.infradead.org
> 
> Fixes: eccb52e78809 ("mm: memcontrol: prepare swap controller setup for integration")
> 
> > Reported-by: Prabhakar Kushwaha <pkushwaha@marvell.com>
> > Signed-off-by: Bhupesh Sharma <bhsharma@redhat.com>
> 
> This is subtle as hell, I have to say. I find the ordering in the init
> calls very unintuitive and extremely hard to follow. The above commit
> has introduced the problem but the code previously has worked mostly by
> a luck because our default was flipped.
> 
> Acked-by: Michal Hocko <mhocko@suse.com>
> 
> > ---
> >  mm/memcontrol.c | 9 ++++++++-
> >  1 file changed, 8 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 19622328e4b5..8323e4b7b390 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -7186,6 +7186,13 @@ static struct cftype memsw_files[] = {
> >  	{ },	/* terminate */
> >  };
> >  
> > +/*
> > + * If mem_cgroup_swap_init() is implemented as a subsys_initcall()
> > + * instead of a core_initcall(), this could mean cgroup_memory_noswap still
> > + * remains set to false even when memcg is disabled via "cgroup_disable=memory"
> > + * boot parameter. This may result in premature OOPS inside 
> > + * mem_cgroup_get_nr_swap_pages() function in corner cases.
> > + */
> >  static int __init mem_cgroup_swap_init(void)
> >  {
> >  	/* No memory control -> no swap control */
> > @@ -7200,6 +7207,6 @@ static int __init mem_cgroup_swap_init(void)
> >  
> >  	return 0;
> >  }
> > -subsys_initcall(mem_cgroup_swap_init);
> > +core_initcall(mem_cgroup_swap_init);
> >  
> >  #endif /* CONFIG_MEMCG_SWAP */
> > -- 
> > 2.7.4
> 
> -- 
> Michal Hocko
> SUSE Labs

-- 
Michal Hocko
SUSE Labs

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] arm64: Allocate crashkernel always in ZONE_DMA
  2020-07-03  5:24       ` chenzhou
@ 2020-07-03  7:39         ` Bhupesh Sharma
  0 siblings, 0 replies; 10+ messages in thread
From: Bhupesh Sharma @ 2020-07-03  7:39 UTC (permalink / raw)
  To: chenzhou
  Cc: Mark Rutland, Catalin Marinas, kexec mailing list,
	Linux Kernel Mailing List, Michal Hocko, linux-mm, James Morse,
	Vladimir Davydov, Johannes Weiner, cgroups, Bhupesh SHARMA,
	Will Deacon, linux-arm-kernel

Hi Chen,

On Fri, Jul 3, 2020 at 10:54 AM chenzhou <chenzhou10@huawei.com> wrote:
>
> Hi Bhupesh,
>
>
> On 2020/7/3 3:22, Bhupesh Sharma wrote:
> > Hi Will,
> >
> > On Thu, Jul 2, 2020 at 1:20 PM Will Deacon <will@kernel.org> wrote:
> >> On Thu, Jul 02, 2020 at 03:44:20AM +0530, Bhupesh Sharma wrote:
> >>> commit bff3b04460a8 ("arm64: mm: reserve CMA and crashkernel in
> >>> ZONE_DMA32") allocates crashkernel for arm64 in the ZONE_DMA32.
> >>>
> >>> However as reported by Prabhakar, this breaks kdump kernel booting in
> >>> ThunderX2 like arm64 systems. I have noticed this on another ampere
> >>> arm64 machine. The OOM log in the kdump kernel looks like this:
> >>>
> >>>   [    0.240552] DMA: preallocated 128 KiB GFP_KERNEL pool for atomic allocations
> >>>   [    0.247713] swapper/0: page allocation failure: order:1, mode:0xcc1(GFP_KERNEL|GFP_DMA), nodemask=(null),cpuset=/,mems_allowed=0
> >>>   <..snip..>
> >>>   [    0.274706] Call trace:
> >>>   [    0.277170]  dump_backtrace+0x0/0x208
> >>>   [    0.280863]  show_stack+0x1c/0x28
> >>>   [    0.284207]  dump_stack+0xc4/0x10c
> >>>   [    0.287638]  warn_alloc+0x104/0x170
> >>>   [    0.291156]  __alloc_pages_slowpath.constprop.106+0xb08/0xb48
> >>>   [    0.296958]  __alloc_pages_nodemask+0x2ac/0x2f8
> >>>   [    0.301530]  alloc_page_interleave+0x20/0x90
> >>>   [    0.305839]  alloc_pages_current+0xdc/0xf8
> >>>   [    0.309972]  atomic_pool_expand+0x60/0x210
> >>>   [    0.314108]  __dma_atomic_pool_init+0x50/0xa4
> >>>   [    0.318504]  dma_atomic_pool_init+0xac/0x158
> >>>   [    0.322813]  do_one_initcall+0x50/0x218
> >>>   [    0.326684]  kernel_init_freeable+0x22c/0x2d0
> >>>   [    0.331083]  kernel_init+0x18/0x110
> >>>   [    0.334600]  ret_from_fork+0x10/0x18
> >>>
> >>> This patch limits the crashkernel allocation to the first 1GB of
> >>> the RAM accessible (ZONE_DMA), as otherwise we might run into OOM
> >>> issues when crashkernel is executed, as it might have been originally
> >>> allocated from either a ZONE_DMA32 memory or mixture of memory chunks
> >>> belonging to both ZONE_DMA and ZONE_DMA32.
> >> How does this interact with this ongoing series:
> >>
> >> https://lore.kernel.org/r/20200628083458.40066-1-chenzhou10@huawei.com
> >>
> >> (patch 4, in particular)
> > Many thanks for having a look at this patchset. I was not aware that
> > Chen had sent out a new version.
> > I had noted in the v9 review of the high/low range allocation
> > <https://lists.gt.net/linux/kernel/3726052#3726052> that I was working
> > on a generic solution (irrespective of the crashkernel, low and high
> > range allocation) which resulted in this patchset.
> >
> > The issue is two-fold: OOPs in memcfg layer (PATCH 1/2, which has been
> > Acked-by memcfg maintainer) and OOM in the kdump kernel due to
> > crashkernel allocation in ZONE_DMA32 regions(s) which is addressed by
> > this PATCH.
> >
> > I will have a closer look at the v10 patchset Chen shared, but seems
> > it needs some rework as per Dave's review comments which he shared
> > today.
> > IMO, in the meanwhile this patchset  can be used to fix the existing
> > kdump issue with upstream kernel.
> Thanks for your work.
> There is no progress on the issue for long time, so i sent my solution in v8 comments
> and sent v9 recently.

Thanks a lot for your inputs. Well, I was working on the OOPs seen
with cgroups layer even when the memory cgroup is disabled via kdump
command line. As the cgroup maintainer also noted during the review of
PATCH 1/2 of this series, it's quite a corner case and hence hard to
debug. Hence the delay in sending out this series.

> I think direct limiting the crashkernel in ZONE_DMA isn't a good idea:
> 1. For parameter "crashkernel=Y", reserving crashkernel in first 1G memory will increase
> the probability of memory allocation failure.
> Previous discuss from https://lkml.org/lkml/2019/10/21/725:
>     "With ZONE_DMA=y, this config will fail to reserve 512M CMA on a server"

That is correct. However, we have limited options anyways at the
moment, hence the need for the crashkernel hi/low support series which
you are already working on. Unfortunately as I noted in the review of
the v10 series today, it still needs rework to fix
the OOM issue seen on ThunderX2 and ampere boards with crashkernel=X
kind of format.

See <http://lists.infradead.org/pipermail/kexec/2020-July/020825.html>
 for details.

So, to workaround the issue (while the crashkernel hi/lo support
series is reworked), the idea is to have similar kdump behaviour as we
were having on these boards before ZONE_DMA32 changes were introduced.

I am also working on fixing the '__dma_atomic_pool_init' behaviour
itself (inside 'kernel/dma/pool.c') to adapt to ZONE_DMA and
ZONE_DMA32 range availability in the kdump kernel, but this is a
complex implementation and requires thorough checks (especially with
drivers which can only work within ZONE_DMA memory regions in the
kdump kernel). Hence it might take some iterations to share a RFC
patch on the same.

I will send a v2 addressing Will's inputs shortly.

Thanks,
Bhupesh

> 2. For parameter "crashkernel=Y@X", limiting the crashkernel in ZONE_DMA is unreasonable
> for someone really want to reserve crashkernel from specified start address.
>
> I have sent v10: https://www.spinics.net/lists/arm-kernel/msg819408.html, any commets are welcome.
>
> Thanks,
> Chen Zhou
> >
> >>> Fixes: bff3b04460a8 ("arm64: mm: reserve CMA and crashkernel in ZONE_DMA32")
> >>> Cc: Johannes Weiner <hannes@cmpxchg.org>
> >>> Cc: Michal Hocko <mhocko@kernel.org>
> >>> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> >>> Cc: James Morse <james.morse@arm.com>
> >>> Cc: Mark Rutland <mark.rutland@arm.com>
> >>> Cc: Will Deacon <will@kernel.org>
> >>> Cc: Catalin Marinas <catalin.marinas@arm.com>
> >>> Cc: cgroups@vger.kernel.org
> >>> Cc: linux-mm@kvack.org
> >>> Cc: linux-arm-kernel@lists.infradead.org
> >>> Cc: linux-kernel@vger.kernel.org
> >>> Cc: kexec@lists.infradead.org
> >>> Reported-by: Prabhakar Kushwaha <pkushwaha@marvell.com>
> >>> Signed-off-by: Bhupesh Sharma <bhsharma@redhat.com>
> >>> ---
> >>>  arch/arm64/mm/init.c | 16 ++++++++++++++--
> >>>  1 file changed, 14 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> >>> index 1e93cfc7c47a..02ae4d623802 100644
> >>> --- a/arch/arm64/mm/init.c
> >>> +++ b/arch/arm64/mm/init.c
> >>> @@ -91,8 +91,15 @@ static void __init reserve_crashkernel(void)
> >>>       crash_size = PAGE_ALIGN(crash_size);
> >>>
> >>>       if (crash_base == 0) {
> >>> -             /* Current arm64 boot protocol requires 2MB alignment */
> >>> -             crash_base = memblock_find_in_range(0, arm64_dma32_phys_limit,
> >>> +             /* Current arm64 boot protocol requires 2MB alignment.
> >>> +              * Also limit the crashkernel allocation to the first
> >>> +              * 1GB of the RAM accessible (ZONE_DMA), as otherwise we
> >>> +              * might run into OOM issues when crashkernel is executed,
> >>> +              * as it might have been originally allocated from
> >>> +              * either a ZONE_DMA32 memory or mixture of memory
> >>> +              * chunks belonging to both ZONE_DMA and ZONE_DMA32.
> >>> +              */
> >> This comment needs help. Why does putting the crashkernel in ZONE_DMA
> >> prevent "OOM issues"?
> > Sure, I can work on adding more details in the comment so that it
> > explains the potential OOM issue(s) better.
> >
> > Thanks,
> > Bhupesh
> >
> >
> > .
> >
>
>


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2020-07-03  7:41 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-01 22:14 [PATCH 0/2] arm64/kdump: Fix OOPS and OOM issues in kdump kernel Bhupesh Sharma
2020-07-01 22:14 ` [PATCH 1/2] mm/memcontrol: Fix OOPS inside mem_cgroup_get_nr_swap_pages() Bhupesh Sharma
2020-07-02  6:00   ` Michal Hocko
2020-07-02 18:55     ` Bhupesh Sharma
2020-07-03  6:43     ` Michal Hocko
2020-07-01 22:14 ` [PATCH 2/2] arm64: Allocate crashkernel always in ZONE_DMA Bhupesh Sharma
2020-07-02  7:50   ` Will Deacon
2020-07-02 19:22     ` Bhupesh Sharma
2020-07-03  5:24       ` chenzhou
2020-07-03  7:39         ` Bhupesh Sharma

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).