All of lore.kernel.org
 help / color / mirror / Atom feed
From: Johannes Weiner <hannes@cmpxchg.org>
To: Michael Ellerman <mpe@ellerman.id.au>
Cc: Yury Norov <ynorov@caviumnetworks.com>,
	Heiko Carstens <heiko.carstens@de.ibm.com>,
	Josef Bacik <josef@toxicpanda.com>,
	Michal Hocko <mhocko@suse.com>,
	Vladimir Davydov <vdavydov.dev@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Rik van Riel <riel@redhat.com>,
	linux-mm@kvack.org, cgroups@vger.kernel.org,
	linux-kernel@vger.kernel.org, kernel-team@fb.com,
	linux-s390@vger.kernel.org
Subject: Re: [PATCH 2/6] mm: vmstat: move slab statistics from zone to node counters
Date: Mon, 5 Jun 2017 14:35:11 -0400	[thread overview]
Message-ID: <20170605183511.GA8915@cmpxchg.org> (raw)
In-Reply-To: <87mv9s2f8f.fsf@concordia.ellerman.id.au>

On Thu, Jun 01, 2017 at 08:07:28PM +1000, Michael Ellerman wrote:
> Yury Norov <ynorov@caviumnetworks.com> writes:
> 
> > On Wed, May 31, 2017 at 01:39:00PM +0200, Heiko Carstens wrote:
> >> On Wed, May 31, 2017 at 11:12:56AM +0200, Heiko Carstens wrote:
> >> > On Tue, May 30, 2017 at 02:17:20PM -0400, Johannes Weiner wrote:
> >> > > To re-implement slab cache vs. page cache balancing, we'll need the
> >> > > slab counters at the lruvec level, which, ever since lru reclaim was
> >> > > moved from the zone to the node, is the intersection of the node, not
> >> > > the zone, and the memcg.
> >> > > 
> >> > > We could retain the per-zone counters for when the page allocator
> >> > > dumps its memory information on failures, and have counters on both
> >> > > levels - which on all but NUMA node 0 is usually redundant. But let's
> >> > > keep it simple for now and just move them. If anybody complains we can
> >> > > restore the per-zone counters.
> >> > > 
> >> > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> >> > 
> >> > This patch causes an early boot crash on s390 (linux-next as of today).
> >> > CONFIG_NUMA on/off doesn't make any difference. I haven't looked any
> >> > further into this yet, maybe you have an idea?
> >
> > The same on arm64.
> 
> And powerpc.

It looks like we need the following on top. I can't reproduce the
crash, but it's verifiable with WARN_ONs in the vmstat functions that
the nodestat array isn't properly initialized when slab bootstraps:

---

>From 89ed86b5b538d8debd3c29567d7e1d31257fa577 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Mon, 5 Jun 2017 14:12:15 -0400
Subject: [PATCH] mm: vmstat: move slab statistics from zone to node counters
 fix

Unable to handle kernel paging request at virtual address 2e116007
pgd = c0004000
[2e116007] *pgd=00000000
Internal error: Oops: 5 [#1] SMP ARM
Modules linked in:
CPU: 0 PID: 0 Comm: swapper Not tainted 4.12.0-rc3-00153-gb6bc6724488a #200
Hardware name: Generic DRA74X (Flattened Device Tree)
task: c0d0adc0 task.stack: c0d00000
PC is at __mod_node_page_state+0x2c/0xc8
LR is at __per_cpu_offset+0x0/0x8
pc : [<c0271de8>]    lr : [<c0d07da4>]    psr: 600000d3
sp : c0d01eec  ip : 00000000  fp : c15782f4
r10: 00000000  r9 : c1591280  r8 : 00004000
r7 : 00000001  r6 : 00000006  r5 : 2e116000  r4 : 00000007
r3 : 00000007  r2 : 00000001  r1 : 00000006  r0 : c0dc27c0
Flags: nZCv  IRQs off  FIQs off  Mode SVC_32  ISA ARM  Segment none
Control: 10c5387d  Table: 8000406a  DAC: 00000051
Process swapper (pid: 0, stack limit = 0xc0d00218)
Stack: (0xc0d01eec to 0xc0d02000)
1ee0:                            600000d3 c0dc27c0 c0271efc 00000001 c0d58864
1f00: ef470000 00008000 00004000 c029fbb0 01000000 c1572b5c 00002000 00000000
1f20: 00000001 00000001 00008000 c029f584 00000000 c0d58864 00008000 00008000
1f40: 01008000 c0c23790 c15782f4 a00000d3 c0d58864 c02a0364 00000000 c0819388
1f60: c0d58864 000000c0 01000000 c1572a58 c0aa57a4 00000080 00002000 c0dca000
1f80: efffe980 c0c53a48 00000000 c0c23790 c1572a58 c0c59e48 c0c59de8 c1572b5c
1fa0: c0dca000 c0c257a4 00000000 ffffffff c0dca000 c0d07940 c0dca000 c0c00a9c
1fc0: ffffffff ffffffff 00000000 c0c00680 00000000 c0c53a48 c0dca214 c0d07958
1fe0: c0c53a44 c0d0caa4 8000406a 412fc0f2 00000000 8000807c 00000000 00000000
[<c0271de8>] (__mod_node_page_state) from [<c0271efc>] (mod_node_page_state+0x2c/0x4c)
[<c0271efc>] (mod_node_page_state) from [<c029fbb0>] (cache_alloc_refill+0x5b8/0x828)
[<c029fbb0>] (cache_alloc_refill) from [<c02a0364>] (kmem_cache_alloc+0x24c/0x2d0)
[<c02a0364>] (kmem_cache_alloc) from [<c0c23790>] (create_kmalloc_cache+0x20/0x8c)
[<c0c23790>] (create_kmalloc_cache) from [<c0c257a4>] (kmem_cache_init+0xac/0x11c)
[<c0c257a4>] (kmem_cache_init) from [<c0c00a9c>] (start_kernel+0x1b8/0x3c0)
[<c0c00a9c>] (start_kernel) from [<8000807c>] (0x8000807c)
Code: e79e5103 e28c3001 e0833001 e1a04003 (e19440d5)
---[ end trace 0000000000000000 ]---

The zone counters work earlier than the node counters because the
zones have special boot pagesets, whereas the nodes do not.

Add boot nodestats against which we account until the dynamic per-cpu
allocator is available.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/page_alloc.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5f89cfaddc4b..7f341f84b587 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5107,6 +5107,7 @@ static void build_zonelists(pg_data_t *pgdat)
  */
 static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch);
 static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset);
+static DEFINE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);
 static void setup_zone_pageset(struct zone *zone);
 
 /*
@@ -6010,6 +6011,8 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
 	spin_lock_init(&pgdat->lru_lock);
 	lruvec_init(node_lruvec(pgdat));
 
+	pgdat->per_cpu_nodestats = &boot_nodestats;
+
 	for (j = 0; j < MAX_NR_ZONES; j++) {
 		struct zone *zone = pgdat->node_zones + j;
 		unsigned long size, realsize, freesize, memmap_pages;
-- 
2.13.0

WARNING: multiple messages have this Message-ID (diff)
From: Johannes Weiner <hannes@cmpxchg.org>
To: Michael Ellerman <mpe@ellerman.id.au>
Cc: Yury Norov <ynorov@caviumnetworks.com>,
	Heiko Carstens <heiko.carstens@de.ibm.com>,
	Josef Bacik <josef@toxicpanda.com>,
	Michal Hocko <mhocko@suse.com>,
	Vladimir Davydov <vdavydov.dev@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Rik van Riel <riel@redhat.com>,
	linux-mm@kvack.org, cgroups@vger.kernel.org,
	linux-kernel@vger.kernel.org, kernel-team@fb.com,
	linux-s390@vger.kernel.org
Subject: Re: [PATCH 2/6] mm: vmstat: move slab statistics from zone to node counters
Date: Mon, 5 Jun 2017 14:35:11 -0400	[thread overview]
Message-ID: <20170605183511.GA8915@cmpxchg.org> (raw)
In-Reply-To: <87mv9s2f8f.fsf@concordia.ellerman.id.au>

On Thu, Jun 01, 2017 at 08:07:28PM +1000, Michael Ellerman wrote:
> Yury Norov <ynorov@caviumnetworks.com> writes:
> 
> > On Wed, May 31, 2017 at 01:39:00PM +0200, Heiko Carstens wrote:
> >> On Wed, May 31, 2017 at 11:12:56AM +0200, Heiko Carstens wrote:
> >> > On Tue, May 30, 2017 at 02:17:20PM -0400, Johannes Weiner wrote:
> >> > > To re-implement slab cache vs. page cache balancing, we'll need the
> >> > > slab counters at the lruvec level, which, ever since lru reclaim was
> >> > > moved from the zone to the node, is the intersection of the node, not
> >> > > the zone, and the memcg.
> >> > > 
> >> > > We could retain the per-zone counters for when the page allocator
> >> > > dumps its memory information on failures, and have counters on both
> >> > > levels - which on all but NUMA node 0 is usually redundant. But let's
> >> > > keep it simple for now and just move them. If anybody complains we can
> >> > > restore the per-zone counters.
> >> > > 
> >> > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> >> > 
> >> > This patch causes an early boot crash on s390 (linux-next as of today).
> >> > CONFIG_NUMA on/off doesn't make any difference. I haven't looked any
> >> > further into this yet, maybe you have an idea?
> >
> > The same on arm64.
> 
> And powerpc.

It looks like we need the following on top. I can't reproduce the
crash, but it's verifiable with WARN_ONs in the vmstat functions that
the nodestat array isn't properly initialized when slab bootstraps:

---

WARNING: multiple messages have this Message-ID (diff)
From: Johannes Weiner <hannes@cmpxchg.org>
To: Michael Ellerman <mpe@ellerman.id.au>
Cc: Yury Norov <ynorov@caviumnetworks.com>,
	Heiko Carstens <heiko.carstens@de.ibm.com>,
	Josef Bacik <josef@toxicpanda.com>,
	Michal Hocko <mhocko@suse.com>,
	Vladimir Davydov <vdavydov.dev@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Rik van Riel <riel@redhat.com>,
	linux-mm@kvack.org, cgroups@vger.kernel.org,
	linux-kernel@vger.kernel.org, kernel-team@fb.com,
	linux-s390@vger.kernel.org
Subject: Re: [PATCH 2/6] mm: vmstat: move slab statistics from zone to node counters
Date: Mon, 5 Jun 2017 14:35:11 -0400	[thread overview]
Message-ID: <20170605183511.GA8915@cmpxchg.org> (raw)
In-Reply-To: <87mv9s2f8f.fsf@concordia.ellerman.id.au>

On Thu, Jun 01, 2017 at 08:07:28PM +1000, Michael Ellerman wrote:
> Yury Norov <ynorov@caviumnetworks.com> writes:
> 
> > On Wed, May 31, 2017 at 01:39:00PM +0200, Heiko Carstens wrote:
> >> On Wed, May 31, 2017 at 11:12:56AM +0200, Heiko Carstens wrote:
> >> > On Tue, May 30, 2017 at 02:17:20PM -0400, Johannes Weiner wrote:
> >> > > To re-implement slab cache vs. page cache balancing, we'll need the
> >> > > slab counters at the lruvec level, which, ever since lru reclaim was
> >> > > moved from the zone to the node, is the intersection of the node, not
> >> > > the zone, and the memcg.
> >> > > 
> >> > > We could retain the per-zone counters for when the page allocator
> >> > > dumps its memory information on failures, and have counters on both
> >> > > levels - which on all but NUMA node 0 is usually redundant. But let's
> >> > > keep it simple for now and just move them. If anybody complains we can
> >> > > restore the per-zone counters.
> >> > > 
> >> > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> >> > 
> >> > This patch causes an early boot crash on s390 (linux-next as of today).
> >> > CONFIG_NUMA on/off doesn't make any difference. I haven't looked any
> >> > further into this yet, maybe you have an idea?
> >
> > The same on arm64.
> 
> And powerpc.

It looks like we need the following on top. I can't reproduce the
crash, but it's verifiable with WARN_ONs in the vmstat functions that
the nodestat array isn't properly initialized when slab bootstraps:

---

From 89ed86b5b538d8debd3c29567d7e1d31257fa577 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Mon, 5 Jun 2017 14:12:15 -0400
Subject: [PATCH] mm: vmstat: move slab statistics from zone to node counters
 fix

Unable to handle kernel paging request at virtual address 2e116007
pgd = c0004000
[2e116007] *pgd=00000000
Internal error: Oops: 5 [#1] SMP ARM
Modules linked in:
CPU: 0 PID: 0 Comm: swapper Not tainted 4.12.0-rc3-00153-gb6bc6724488a #200
Hardware name: Generic DRA74X (Flattened Device Tree)
task: c0d0adc0 task.stack: c0d00000
PC is at __mod_node_page_state+0x2c/0xc8
LR is at __per_cpu_offset+0x0/0x8
pc : [<c0271de8>]    lr : [<c0d07da4>]    psr: 600000d3
sp : c0d01eec  ip : 00000000  fp : c15782f4
r10: 00000000  r9 : c1591280  r8 : 00004000
r7 : 00000001  r6 : 00000006  r5 : 2e116000  r4 : 00000007
r3 : 00000007  r2 : 00000001  r1 : 00000006  r0 : c0dc27c0
Flags: nZCv  IRQs off  FIQs off  Mode SVC_32  ISA ARM  Segment none
Control: 10c5387d  Table: 8000406a  DAC: 00000051
Process swapper (pid: 0, stack limit = 0xc0d00218)
Stack: (0xc0d01eec to 0xc0d02000)
1ee0:                            600000d3 c0dc27c0 c0271efc 00000001 c0d58864
1f00: ef470000 00008000 00004000 c029fbb0 01000000 c1572b5c 00002000 00000000
1f20: 00000001 00000001 00008000 c029f584 00000000 c0d58864 00008000 00008000
1f40: 01008000 c0c23790 c15782f4 a00000d3 c0d58864 c02a0364 00000000 c0819388
1f60: c0d58864 000000c0 01000000 c1572a58 c0aa57a4 00000080 00002000 c0dca000
1f80: efffe980 c0c53a48 00000000 c0c23790 c1572a58 c0c59e48 c0c59de8 c1572b5c
1fa0: c0dca000 c0c257a4 00000000 ffffffff c0dca000 c0d07940 c0dca000 c0c00a9c
1fc0: ffffffff ffffffff 00000000 c0c00680 00000000 c0c53a48 c0dca214 c0d07958
1fe0: c0c53a44 c0d0caa4 8000406a 412fc0f2 00000000 8000807c 00000000 00000000
[<c0271de8>] (__mod_node_page_state) from [<c0271efc>] (mod_node_page_state+0x2c/0x4c)
[<c0271efc>] (mod_node_page_state) from [<c029fbb0>] (cache_alloc_refill+0x5b8/0x828)
[<c029fbb0>] (cache_alloc_refill) from [<c02a0364>] (kmem_cache_alloc+0x24c/0x2d0)
[<c02a0364>] (kmem_cache_alloc) from [<c0c23790>] (create_kmalloc_cache+0x20/0x8c)
[<c0c23790>] (create_kmalloc_cache) from [<c0c257a4>] (kmem_cache_init+0xac/0x11c)
[<c0c257a4>] (kmem_cache_init) from [<c0c00a9c>] (start_kernel+0x1b8/0x3c0)
[<c0c00a9c>] (start_kernel) from [<8000807c>] (0x8000807c)
Code: e79e5103 e28c3001 e0833001 e1a04003 (e19440d5)
---[ end trace 0000000000000000 ]---

The zone counters work earlier than the node counters because the
zones have special boot pagesets, whereas the nodes do not.

Add boot nodestats against which we account until the dynamic per-cpu
allocator is available.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/page_alloc.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5f89cfaddc4b..7f341f84b587 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5107,6 +5107,7 @@ static void build_zonelists(pg_data_t *pgdat)
  */
 static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch);
 static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset);
+static DEFINE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);
 static void setup_zone_pageset(struct zone *zone);
 
 /*
@@ -6010,6 +6011,8 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
 	spin_lock_init(&pgdat->lru_lock);
 	lruvec_init(node_lruvec(pgdat));
 
+	pgdat->per_cpu_nodestats = &boot_nodestats;
+
 	for (j = 0; j < MAX_NR_ZONES; j++) {
 		struct zone *zone = pgdat->node_zones + j;
 		unsigned long size, realsize, freesize, memmap_pages;
-- 
2.13.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2017-06-05 18:35 UTC|newest]

Thread overview: 77+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-05-30 18:17 [PATCH 0/6] mm: per-lruvec slab stats Johannes Weiner
2017-05-30 18:17 ` Johannes Weiner
2017-05-30 18:17 ` [PATCH 1/6] mm: vmscan: delete unused pgdat_reclaimable_pages() Johannes Weiner
2017-05-30 18:17   ` Johannes Weiner
2017-05-30 21:50   ` Andrew Morton
2017-05-30 21:50     ` Andrew Morton
2017-05-30 22:02     ` Johannes Weiner
2017-05-30 22:02       ` Johannes Weiner
2017-05-30 18:17 ` [PATCH 2/6] mm: vmstat: move slab statistics from zone to node counters Johannes Weiner
2017-05-30 18:17   ` Johannes Weiner
2017-05-31  9:12   ` Heiko Carstens
2017-05-31  9:12     ` Heiko Carstens
2017-05-31  9:12     ` Heiko Carstens
2017-05-31 11:39     ` Heiko Carstens
2017-05-31 11:39       ` Heiko Carstens
2017-05-31 17:11       ` Yury Norov
2017-05-31 17:11         ` Yury Norov
2017-06-01 10:07         ` Michael Ellerman
2017-06-01 10:07           ` Michael Ellerman
2017-06-01 10:07           ` Michael Ellerman
2017-06-05 18:35           ` Johannes Weiner [this message]
2017-06-05 18:35             ` Johannes Weiner
2017-06-05 18:35             ` Johannes Weiner
2017-06-05 21:38             ` Andrew Morton
2017-06-05 21:38               ` Andrew Morton
2017-06-05 21:38               ` Andrew Morton
2017-06-07 16:20               ` Johannes Weiner
2017-06-07 16:20                 ` Johannes Weiner
2017-06-06  4:31             ` Michael Ellerman
2017-06-06  4:31               ` Michael Ellerman
2017-06-06  4:31               ` Michael Ellerman
2017-06-06 11:15               ` Michael Ellerman
2017-06-06 11:15                 ` Michael Ellerman
2017-06-06 11:15                 ` Michael Ellerman
2017-06-06 14:33                 ` Johannes Weiner
2017-06-06 14:33                   ` Johannes Weiner
2017-06-06 14:33                   ` Johannes Weiner
2017-05-30 18:17 ` [PATCH 3/6] mm: memcontrol: use the node-native slab memory counters Johannes Weiner
2017-05-30 18:17   ` Johannes Weiner
2017-06-03 17:39   ` Vladimir Davydov
2017-06-03 17:39     ` Vladimir Davydov
2017-06-03 17:39     ` Vladimir Davydov
2017-05-30 18:17 ` [PATCH 4/6] mm: memcontrol: use generic mod_memcg_page_state for kmem pages Johannes Weiner
2017-05-30 18:17   ` Johannes Weiner
2017-06-03 17:40   ` Vladimir Davydov
2017-06-03 17:40     ` Vladimir Davydov
2017-05-30 18:17 ` [PATCH 5/6] mm: memcontrol: per-lruvec stats infrastructure Johannes Weiner
2017-05-30 18:17   ` Johannes Weiner
2017-05-31 17:14   ` Johannes Weiner
2017-05-31 17:14     ` Johannes Weiner
2017-05-31 17:14     ` Johannes Weiner
2017-05-31 18:18     ` Andrew Morton
2017-05-31 18:18       ` Andrew Morton
2017-05-31 18:18       ` Andrew Morton
2017-05-31 19:02       ` Tony Lindgren
2017-05-31 19:02         ` Tony Lindgren
2017-05-31 19:02         ` Tony Lindgren
2017-05-31 22:03         ` Stephen Rothwell
2017-05-31 22:03           ` Stephen Rothwell
2017-06-01  1:44       ` Johannes Weiner
2017-06-01  1:44         ` Johannes Weiner
2017-06-01  1:44         ` Johannes Weiner
2017-06-03 17:50   ` Vladimir Davydov
2017-06-03 17:50     ` Vladimir Davydov
2017-06-03 17:50     ` Vladimir Davydov
2017-06-05 17:53     ` Johannes Weiner
2017-06-05 17:53       ` Johannes Weiner
2017-06-05 17:53       ` Johannes Weiner
2017-05-30 18:17 ` [PATCH 6/6] mm: memcontrol: account slab stats per lruvec Johannes Weiner
2017-05-30 18:17   ` Johannes Weiner
2017-06-03 17:54   ` Vladimir Davydov
2017-06-03 17:54     ` Vladimir Davydov
2017-06-05 16:52   ` [6/6] " Guenter Roeck
2017-06-05 16:52     ` Guenter Roeck
2017-06-05 17:52     ` Johannes Weiner
2017-06-05 17:52       ` Johannes Weiner
2017-06-05 17:52       ` Johannes Weiner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170605183511.GA8915@cmpxchg.org \
    --to=hannes@cmpxchg.org \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=heiko.carstens@de.ibm.com \
    --cc=josef@toxicpanda.com \
    --cc=kernel-team@fb.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-s390@vger.kernel.org \
    --cc=mhocko@suse.com \
    --cc=mpe@ellerman.id.au \
    --cc=riel@redhat.com \
    --cc=vdavydov.dev@gmail.com \
    --cc=ynorov@caviumnetworks.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.