linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] memcg: reduce memory consumption by memcg stats
@ 2024-04-23  5:18 Shakeel Butt
  2024-04-23  5:18 ` [PATCH 1/4] mm: rearrange node_stat_item to put memcg stats at start Shakeel Butt
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Shakeel Butt @ 2024-04-23  5:18 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Muchun Song
  Cc: linux-mm, linux-kernel

Most of the memory overhead of a memcg object is due to memcg stats
maintained by the kernel. Since stats updates happen in performance
critical codepaths, the stats are maintained per-cpu and numa specific
stats are maintained per-node * per-cpu. This drastically increase the
overhead on large machines i.e. large of CPUs and multiple numa nodes.
This patch series tries to reduce the overhead by at least not
allocating the memory for stats which are not memcg specific.

Shakeel Butt (4):
  mm: rearrange node_stat_item to put memcg stats at start
  memcg: reduce memory for the lruvec and memcg stats
  memcg: use proper type for mod_memcg_state
  memcg: restrict __mod_memcg_lruvec_state to memcg stats

 include/linux/memcontrol.h | 25 +++++++++++++------------
 include/linux/mmzone.h     | 29 +++++++++++++++++------------
 mm/memcontrol.c            | 12 +++++++-----
 mm/vmstat.c                | 24 ++++++++++++------------
 4 files changed, 49 insertions(+), 41 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/4] mm: rearrange node_stat_item to put memcg stats at start
  2024-04-23  5:18 [PATCH 0/4] memcg: reduce memory consumption by memcg stats Shakeel Butt
@ 2024-04-23  5:18 ` Shakeel Butt
  2024-04-23 13:58   ` Johannes Weiner
  2024-04-25 18:01   ` Chris Li
  2024-04-23  5:18 ` [PATCH 2/4] memcg: reduce memory for the lruvec and memcg stats Shakeel Butt
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 12+ messages in thread
From: Shakeel Butt @ 2024-04-23  5:18 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Muchun Song
  Cc: linux-mm, linux-kernel

At the moment the memcg stats are sized based on the size of enum
node_stat_item but not all fields in node_stat_item corresponds to memcg
stats. So, rearrage the contents of node_stat_item such that all the
memcg specific stats are at the top and then the later patches will make
sure that the memcg code will not waste space for non-memcg stats.

Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
---
 include/linux/mmzone.h | 25 +++++++++++++------------
 mm/vmstat.c            | 24 ++++++++++++------------
 2 files changed, 25 insertions(+), 24 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 8f9c9590a42c..989ca97402c6 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -166,9 +166,6 @@ enum node_stat_item {
 	NR_UNEVICTABLE,		/*  "     "     "   "       "         */
 	NR_SLAB_RECLAIMABLE_B,
 	NR_SLAB_UNRECLAIMABLE_B,
-	NR_ISOLATED_ANON,	/* Temporary isolated pages from anon lru */
-	NR_ISOLATED_FILE,	/* Temporary isolated pages from file lru */
-	WORKINGSET_NODES,
 	WORKINGSET_REFAULT_BASE,
 	WORKINGSET_REFAULT_ANON = WORKINGSET_REFAULT_BASE,
 	WORKINGSET_REFAULT_FILE,
@@ -179,39 +176,43 @@ enum node_stat_item {
 	WORKINGSET_RESTORE_ANON = WORKINGSET_RESTORE_BASE,
 	WORKINGSET_RESTORE_FILE,
 	WORKINGSET_NODERECLAIM,
+	NR_PAGETABLE,		/* used for pagetables */
+	NR_SECONDARY_PAGETABLE, /* secondary pagetables, KVM & IOMMU */
+	NR_KERNEL_STACK_KB,	/* measured in KiB */
 	NR_ANON_MAPPED,	/* Mapped anonymous pages */
 	NR_FILE_MAPPED,	/* pagecache pages mapped into pagetables.
 			   only modified from process context */
 	NR_FILE_PAGES,
+#ifdef CONFIG_SWAP
+	NR_SWAPCACHE,
+#endif
 	NR_FILE_DIRTY,
 	NR_WRITEBACK,
-	NR_WRITEBACK_TEMP,	/* Writeback using temporary buffers */
 	NR_SHMEM,		/* shmem pages (included tmpfs/GEM pages) */
 	NR_SHMEM_THPS,
-	NR_SHMEM_PMDMAPPED,
 	NR_FILE_THPS,
-	NR_FILE_PMDMAPPED,
 	NR_ANON_THPS,
+	/* No memcg stats for the following fields. */
+	NR_SHMEM_PMDMAPPED,
+	NR_FILE_PMDMAPPED,
+	NR_WRITEBACK_TEMP,	/* Writeback using temporary buffers */
 	NR_VMSCAN_WRITE,
 	NR_VMSCAN_IMMEDIATE,	/* Prioritise for reclaim when writeback ends */
+	NR_ISOLATED_ANON,	/* Temporary isolated pages from anon lru */
+	NR_ISOLATED_FILE,	/* Temporary isolated pages from file lru */
+	WORKINGSET_NODES,
 	NR_DIRTIED,		/* page dirtyings since bootup */
 	NR_WRITTEN,		/* page writings since bootup */
 	NR_THROTTLED_WRITTEN,	/* NR_WRITTEN while reclaim throttled */
 	NR_KERNEL_MISC_RECLAIMABLE,	/* reclaimable non-slab kernel pages */
 	NR_FOLL_PIN_ACQUIRED,	/* via: pin_user_page(), gup flag: FOLL_PIN */
 	NR_FOLL_PIN_RELEASED,	/* pages returned via unpin_user_page() */
-	NR_KERNEL_STACK_KB,	/* measured in KiB */
 #if IS_ENABLED(CONFIG_SHADOW_CALL_STACK)
 	NR_KERNEL_SCS_KB,	/* measured in KiB */
 #endif
-	NR_PAGETABLE,		/* used for pagetables */
-	NR_SECONDARY_PAGETABLE, /* secondary pagetables, KVM & IOMMU */
 #ifdef CONFIG_IOMMU_SUPPORT
 	NR_IOMMU_PAGES,		/* # of pages allocated by IOMMU */
 #endif
-#ifdef CONFIG_SWAP
-	NR_SWAPCACHE,
-#endif
 #ifdef CONFIG_NUMA_BALANCING
 	PGPROMOTE_SUCCESS,	/* promote successfully */
 	PGPROMOTE_CANDIDATE,	/* candidate pages to promote */
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 8507c497218b..4eac2f6322a3 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1206,9 +1206,6 @@ const char * const vmstat_text[] = {
 	"nr_unevictable",
 	"nr_slab_reclaimable",
 	"nr_slab_unreclaimable",
-	"nr_isolated_anon",
-	"nr_isolated_file",
-	"workingset_nodes",
 	"workingset_refault_anon",
 	"workingset_refault_file",
 	"workingset_activate_anon",
@@ -1216,38 +1213,41 @@ const char * const vmstat_text[] = {
 	"workingset_restore_anon",
 	"workingset_restore_file",
 	"workingset_nodereclaim",
+	"nr_page_table_pages",
+	"nr_sec_page_table_pages",
+	"nr_kernel_stack",
 	"nr_anon_pages",
 	"nr_mapped",
 	"nr_file_pages",
+#ifdef CONFIG_SWAP
+	"nr_swapcached",
+#endif
 	"nr_dirty",
 	"nr_writeback",
-	"nr_writeback_temp",
 	"nr_shmem",
 	"nr_shmem_hugepages",
-	"nr_shmem_pmdmapped",
 	"nr_file_hugepages",
-	"nr_file_pmdmapped",
 	"nr_anon_transparent_hugepages",
+	"nr_shmem_pmdmapped",
+	"nr_file_pmdmapped",
+	"nr_writeback_temp",
 	"nr_vmscan_write",
 	"nr_vmscan_immediate_reclaim",
+	"nr_isolated_anon",
+	"nr_isolated_file",
+	"workingset_nodes",
 	"nr_dirtied",
 	"nr_written",
 	"nr_throttled_written",
 	"nr_kernel_misc_reclaimable",
 	"nr_foll_pin_acquired",
 	"nr_foll_pin_released",
-	"nr_kernel_stack",
 #if IS_ENABLED(CONFIG_SHADOW_CALL_STACK)
 	"nr_shadow_call_stack",
 #endif
-	"nr_page_table_pages",
-	"nr_sec_page_table_pages",
 #ifdef CONFIG_IOMMU_SUPPORT
 	"nr_iommu_pages",
 #endif
-#ifdef CONFIG_SWAP
-	"nr_swapcached",
-#endif
 #ifdef CONFIG_NUMA_BALANCING
 	"pgpromote_success",
 	"pgpromote_candidate",
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/4] memcg: reduce memory for the lruvec and memcg stats
  2024-04-23  5:18 [PATCH 0/4] memcg: reduce memory consumption by memcg stats Shakeel Butt
  2024-04-23  5:18 ` [PATCH 1/4] mm: rearrange node_stat_item to put memcg stats at start Shakeel Butt
@ 2024-04-23  5:18 ` Shakeel Butt
  2024-04-23 14:40   ` kernel test robot
  2024-04-23 20:58   ` kernel test robot
  2024-04-23  5:22 ` [PATCH 3/4] memcg: use proper type for mod_memcg_state Shakeel Butt
  2024-04-23  5:23 ` [PATCH 4/4] memcg: restrict __mod_memcg_lruvec_state to memcg stats Shakeel Butt
  3 siblings, 2 replies; 12+ messages in thread
From: Shakeel Butt @ 2024-04-23  5:18 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Muchun Song
  Cc: linux-mm, linux-kernel

At the moment, the amount of memory allocated for stats related structs
in the mem_cgroup corresponds to the size of enum node_stat_item.
However not all fields in enum node_stat_item has corresponding memcg
stats. The fields of enum node_stat_item is sorted in such a way that
all the fields with corresponding memcg stats are at the start of the
enum node_stat_item. So, let's just make an explicit  boundary within
enum node_stat_item and use that boundary to allocate memory for stats
related structs of memcgs.

For a given x86_64 config, the size of stats with and without patch is:

structs size in bytes         w/o     with

struct lruvec_stats           1128     648
struct lruvec_stats_percpu     752     432
struct memcg_vmstats          1832    1352
struct memcg_vmstats_percpu   1280     960

The memory savings is further compounded by the fact that these structs
are allocated for each cpu and for node. To be precise, for each memcg,
the memory saved would be:

Memory saved = ((21 * 3 * NR_NODES) + (21 * 2 * NR_NODS * NR_CPUS) +
               (21 * 3) + (21 * 2 * NR_CPUS)) * sizeof(long)

Where 21 is the number of fields eliminated.

Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
---
 include/linux/memcontrol.h | 12 ++++++------
 include/linux/mmzone.h     |  8 ++++++--
 mm/memcontrol.c            |  5 ++++-
 3 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 9aba0d0462ca..d68db7a0e829 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -32,7 +32,7 @@ struct kmem_cache;
 
 /* Cgroup-specific page state, on top of universal node page state */
 enum memcg_stat_item {
-	MEMCG_SWAP = NR_VM_NODE_STAT_ITEMS,
+	MEMCG_SWAP = NR_VM_NODE_MEMCG_STAT_ITEMS,
 	MEMCG_SOCK,
 	MEMCG_PERCPU_B,
 	MEMCG_VMALLOC,
@@ -92,21 +92,21 @@ struct mem_cgroup_reclaim_iter {
 
 struct lruvec_stats_percpu {
 	/* Local (CPU and cgroup) state */
-	long state[NR_VM_NODE_STAT_ITEMS];
+	long state[NR_VM_NODE_MEMCG_STAT_ITEMS];
 
 	/* Delta calculation for lockless upward propagation */
-	long state_prev[NR_VM_NODE_STAT_ITEMS];
+	long state_prev[NR_VM_NODE_MEMCG_STAT_ITEMS];
 };
 
 struct lruvec_stats {
 	/* Aggregated (CPU and subtree) state */
-	long state[NR_VM_NODE_STAT_ITEMS];
+	long state[NR_VM_NODE_MEMCG_STAT_ITEMS];
 
 	/* Non-hierarchical (CPU aggregated) state */
-	long state_local[NR_VM_NODE_STAT_ITEMS];
+	long state_local[NR_VM_NODE_MEMCG_STAT_ITEMS];
 
 	/* Pending child counts during tree propagation */
-	long state_pending[NR_VM_NODE_STAT_ITEMS];
+	long state_pending[NR_VM_NODE_MEMCG_STAT_ITEMS];
 };
 
 /*
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 989ca97402c6..59592f3c7d9b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -192,8 +192,12 @@ enum node_stat_item {
 	NR_SHMEM_THPS,
 	NR_FILE_THPS,
 	NR_ANON_THPS,
-	/* No memcg stats for the following fields. */
-	NR_SHMEM_PMDMAPPED,
+	/*
+	 * No memcg stats for the following fields. Please add stats which have
+	 * memcg counterpart above NR_VM_NODE_MEMCG_STAT_ITEMS.
+	 */
+	NR_VM_NODE_MEMCG_STAT_ITEMS,
+	NR_SHMEM_PMDMAPPED = NR_VM_NODE_MEMCG_STAT_ITEMS,
 	NR_FILE_PMDMAPPED,
 	NR_WRITEBACK_TEMP,	/* Writeback using temporary buffers */
 	NR_VMSCAN_WRITE,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 833d09c1d523..bb1bbf417a46 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1648,6 +1648,9 @@ static void memcg_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
 {
 	int i;
 
+	/* Reduce by 1 for MEMCG_SWAP as that is not exposed in v2. */
+	BUILD_BUG_ON(ARRAY_SIZE(memory_stats) != MEMCG_NR_STAT - 1);
+
 	/*
 	 * Provide statistics on the state of the memory subsystem as
 	 * well as cumulative event counters that show past behavior.
@@ -5869,7 +5872,7 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu)
 
 		lstatc = per_cpu_ptr(pn->lruvec_stats_percpu, cpu);
 
-		for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
+		for (i = 0; i < NR_VM_NODE_MEMCG_STAT_ITEMS; i++) {
 			delta = pn->lruvec_stats.state_pending[i];
 			if (delta)
 				pn->lruvec_stats.state_pending[i] = 0;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 3/4] memcg: use proper type for mod_memcg_state
  2024-04-23  5:18 [PATCH 0/4] memcg: reduce memory consumption by memcg stats Shakeel Butt
  2024-04-23  5:18 ` [PATCH 1/4] mm: rearrange node_stat_item to put memcg stats at start Shakeel Butt
  2024-04-23  5:18 ` [PATCH 2/4] memcg: reduce memory for the lruvec and memcg stats Shakeel Butt
@ 2024-04-23  5:22 ` Shakeel Butt
  2024-04-23  5:23 ` [PATCH 4/4] memcg: restrict __mod_memcg_lruvec_state to memcg stats Shakeel Butt
  3 siblings, 0 replies; 12+ messages in thread
From: Shakeel Butt @ 2024-04-23  5:22 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Muchun Song
  Cc: linux-mm, linux-kernel

The memcg stats update functions can take arbitrary integer but the
only input which make sense is enum memcg_stat_item and we don't
want these functions to be called with arbitrary integer, so replace
the parameter type with enum memcg_stat_item and compiler will be able
to warn if memcg stat update functions are called with incorrect index
value.

Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
---
 include/linux/memcontrol.h | 13 +++++++------
 mm/memcontrol.c            |  3 ++-
 2 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d68db7a0e829..1b4a6201c78c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -991,7 +991,8 @@ void mem_cgroup_print_oom_group(struct mem_cgroup *memcg);
 void folio_memcg_lock(struct folio *folio);
 void folio_memcg_unlock(struct folio *folio);
 
-void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val);
+void __mod_memcg_state(struct mem_cgroup *memcg, enum memcg_stat_item  idx,
+		       int val);
 
 /* try to stablize folio_memcg() for all the pages in a memcg */
 static inline bool mem_cgroup_trylock_pages(struct mem_cgroup *memcg)
@@ -1012,7 +1013,7 @@ static inline void mem_cgroup_unlock_pages(void)
 
 /* idx can be of type enum memcg_stat_item or node_stat_item */
 static inline void mod_memcg_state(struct mem_cgroup *memcg,
-				   int idx, int val)
+				   enum memcg_stat_item idx, int val)
 {
 	unsigned long flags;
 
@@ -1022,7 +1023,7 @@ static inline void mod_memcg_state(struct mem_cgroup *memcg,
 }
 
 static inline void mod_memcg_page_state(struct page *page,
-					int idx, int val)
+					enum memcg_stat_item idx, int val)
 {
 	struct mem_cgroup *memcg;
 
@@ -1541,19 +1542,19 @@ static inline void mem_cgroup_print_oom_group(struct mem_cgroup *memcg)
 }
 
 static inline void __mod_memcg_state(struct mem_cgroup *memcg,
-				     int idx,
+				     enum memcg_stat_item idx,
 				     int nr)
 {
 }
 
 static inline void mod_memcg_state(struct mem_cgroup *memcg,
-				   int idx,
+				   enum memcg_stat_item idx,
 				   int nr)
 {
 }
 
 static inline void mod_memcg_page_state(struct page *page,
-					int idx, int val)
+					enum memcg_stat_item idx, int val)
 {
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index bb1bbf417a46..4e991e913393 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -816,7 +816,8 @@ static int memcg_state_val_in_pages(int idx, int val)
  * @idx: the stat item - can be enum memcg_stat_item or enum node_stat_item
  * @val: delta to add to the counter, can be negative
  */
-void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val)
+void __mod_memcg_state(struct mem_cgroup *memcg, enum memcg_stat_item idx,
+		       int val)
 {
 	if (mem_cgroup_disabled())
 		return;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 4/4] memcg: restrict __mod_memcg_lruvec_state to memcg stats
  2024-04-23  5:18 [PATCH 0/4] memcg: reduce memory consumption by memcg stats Shakeel Butt
                   ` (2 preceding siblings ...)
  2024-04-23  5:22 ` [PATCH 3/4] memcg: use proper type for mod_memcg_state Shakeel Butt
@ 2024-04-23  5:23 ` Shakeel Butt
  3 siblings, 0 replies; 12+ messages in thread
From: Shakeel Butt @ 2024-04-23  5:23 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Muchun Song
  Cc: linux-mm, linux-kernel

Currently __mod_memcg_lruvec_state takes enum node_stat_item as a
parameter and enum node_stat_item contains both memcg and non-memcg
stats but __mod_memcg_lruvec_state can only handle the memcg stats, so
simply only call __mod_memcg_lruvec_state for memcg stats.

Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
---
 mm/memcontrol.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4e991e913393..531b6ff711c7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -860,8 +860,6 @@ static void __mod_memcg_lruvec_state(struct lruvec *lruvec,
 		case NR_ANON_MAPPED:
 		case NR_FILE_MAPPED:
 		case NR_ANON_THPS:
-		case NR_SHMEM_PMDMAPPED:
-		case NR_FILE_PMDMAPPED:
 			if (WARN_ON_ONCE(!in_task()))
 				pr_warn("stat item index: %d\n", idx);
 			break;
@@ -899,7 +897,7 @@ void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 	__mod_node_page_state(lruvec_pgdat(lruvec), idx, val);
 
 	/* Update memcg and lruvec */
-	if (!mem_cgroup_disabled())
+	if (!mem_cgroup_disabled() && idx < NR_VM_NODE_MEMCG_STAT_ITEMS)
 		__mod_memcg_lruvec_state(lruvec, idx, val);
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/4] mm: rearrange node_stat_item to put memcg stats at start
  2024-04-23  5:18 ` [PATCH 1/4] mm: rearrange node_stat_item to put memcg stats at start Shakeel Butt
@ 2024-04-23 13:58   ` Johannes Weiner
  2024-04-23 17:44     ` Shakeel Butt
  2024-04-25 18:01   ` Chris Li
  1 sibling, 1 reply; 12+ messages in thread
From: Johannes Weiner @ 2024-04-23 13:58 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Michal Hocko, Roman Gushchin, Muchun Song,
	linux-mm, linux-kernel

On Mon, Apr 22, 2024 at 10:18:23PM -0700, Shakeel Butt wrote:
> At the moment the memcg stats are sized based on the size of enum
> node_stat_item but not all fields in node_stat_item corresponds to memcg
> stats. So, rearrage the contents of node_stat_item such that all the
> memcg specific stats are at the top and then the later patches will make
> sure that the memcg code will not waste space for non-memcg stats.
> 
> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>

This series is a great idea and the savings speak for themselves.

But rearranging and splitting vmstats along the memcg-nomemcg line
seems like an undue burden on the non-memcg codebase and interface.

- It messes with user-visible /proc/vmstat ordering, and sets things
  up to do so on an ongoing basis as stats are added to memcg.

- It also separates related stats (like the workingset ones) in
  /proc/vmstat when memcg only accounts a subset.

Would it make more sense to have a translation table inside memcg?
Like we have with memcg1_events.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/4] memcg: reduce memory for the lruvec and memcg stats
  2024-04-23  5:18 ` [PATCH 2/4] memcg: reduce memory for the lruvec and memcg stats Shakeel Butt
@ 2024-04-23 14:40   ` kernel test robot
  2024-04-23 20:58   ` kernel test robot
  1 sibling, 0 replies; 12+ messages in thread
From: kernel test robot @ 2024-04-23 14:40 UTC (permalink / raw)
  To: Shakeel Butt, Andrew Morton, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Muchun Song
  Cc: llvm, oe-kbuild-all, Linux Memory Management List, linux-kernel

Hi Shakeel,

kernel test robot noticed the following build errors:

[auto build test ERROR on next-20240422]
[cannot apply to akpm-mm/mm-everything linus/master v6.9-rc5 v6.9-rc4 v6.9-rc3 v6.9-rc5]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Shakeel-Butt/mm-rearrange-node_stat_item-to-put-memcg-stats-at-start/20240423-132451
base:   next-20240422
patch link:    https://lore.kernel.org/r/20240423051826.791934-3-shakeel.butt%40linux.dev
patch subject: [PATCH 2/4] memcg: reduce memory for the lruvec and memcg stats
config: x86_64-buildonly-randconfig-002-20240423 (https://download.01.org/0day-ci/archive/20240423/202404232230.94gQwAI2-lkp@intel.com/config)
compiler: clang version 17.0.6 (https://github.com/llvm/llvm-project 6009708b4367171ccdbf4b5905cb6a803753fe18)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240423/202404232230.94gQwAI2-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202404232230.94gQwAI2-lkp@intel.com/

All errors (new ones prefixed by >>):

>> mm/memcontrol.c:1651:2: error: call to '__compiletime_assert_963' declared with 'error' attribute: BUILD_BUG_ON failed: ARRAY_SIZE(memory_stats) != MEMCG_NR_STAT - 1
    1651 |         BUILD_BUG_ON(ARRAY_SIZE(memory_stats) != MEMCG_NR_STAT - 1);
         |         ^
   include/linux/build_bug.h:50:2: note: expanded from macro 'BUILD_BUG_ON'
      50 |         BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition)
         |         ^
   include/linux/build_bug.h:39:37: note: expanded from macro 'BUILD_BUG_ON_MSG'
      39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
         |                                     ^
   include/linux/compiler_types.h:460:2: note: expanded from macro 'compiletime_assert'
     460 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |         ^
   include/linux/compiler_types.h:448:2: note: expanded from macro '_compiletime_assert'
     448 |         __compiletime_assert(condition, msg, prefix, suffix)
         |         ^
   include/linux/compiler_types.h:441:4: note: expanded from macro '__compiletime_assert'
     441 |                         prefix ## suffix();                             \
         |                         ^
   <scratch space>:40:1: note: expanded from here
      40 | __compiletime_assert_963
         | ^
   1 error generated.


vim +1651 mm/memcontrol.c

  1645	
  1646	static void memcg_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
  1647	{
  1648		int i;
  1649	
  1650		/* Reduce by 1 for MEMCG_SWAP as that is not exposed in v2. */
> 1651		BUILD_BUG_ON(ARRAY_SIZE(memory_stats) != MEMCG_NR_STAT - 1);
  1652	
  1653		/*
  1654		 * Provide statistics on the state of the memory subsystem as
  1655		 * well as cumulative event counters that show past behavior.
  1656		 *
  1657		 * This list is ordered following a combination of these gradients:
  1658		 * 1) generic big picture -> specifics and details
  1659		 * 2) reflecting userspace activity -> reflecting kernel heuristics
  1660		 *
  1661		 * Current memory state:
  1662		 */
  1663		mem_cgroup_flush_stats(memcg);
  1664	
  1665		for (i = 0; i < ARRAY_SIZE(memory_stats); i++) {
  1666			u64 size;
  1667	
  1668			size = memcg_page_state_output(memcg, memory_stats[i].idx);
  1669			seq_buf_printf(s, "%s %llu\n", memory_stats[i].name, size);
  1670	
  1671			if (unlikely(memory_stats[i].idx == NR_SLAB_UNRECLAIMABLE_B)) {
  1672				size += memcg_page_state_output(memcg,
  1673								NR_SLAB_RECLAIMABLE_B);
  1674				seq_buf_printf(s, "slab %llu\n", size);
  1675			}
  1676		}
  1677	
  1678		/* Accumulated memory events */
  1679		seq_buf_printf(s, "pgscan %lu\n",
  1680			       memcg_events(memcg, PGSCAN_KSWAPD) +
  1681			       memcg_events(memcg, PGSCAN_DIRECT) +
  1682			       memcg_events(memcg, PGSCAN_KHUGEPAGED));
  1683		seq_buf_printf(s, "pgsteal %lu\n",
  1684			       memcg_events(memcg, PGSTEAL_KSWAPD) +
  1685			       memcg_events(memcg, PGSTEAL_DIRECT) +
  1686			       memcg_events(memcg, PGSTEAL_KHUGEPAGED));
  1687	
  1688		for (i = 0; i < ARRAY_SIZE(memcg_vm_event_stat); i++) {
  1689			if (memcg_vm_event_stat[i] == PGPGIN ||
  1690			    memcg_vm_event_stat[i] == PGPGOUT)
  1691				continue;
  1692	
  1693			seq_buf_printf(s, "%s %lu\n",
  1694				       vm_event_name(memcg_vm_event_stat[i]),
  1695				       memcg_events(memcg, memcg_vm_event_stat[i]));
  1696		}
  1697	
  1698		/* The above should easily fit into one page */
  1699		WARN_ON_ONCE(seq_buf_has_overflowed(s));
  1700	}
  1701	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/4] mm: rearrange node_stat_item to put memcg stats at start
  2024-04-23 13:58   ` Johannes Weiner
@ 2024-04-23 17:44     ` Shakeel Butt
  2024-04-23 18:30       ` Johannes Weiner
  0 siblings, 1 reply; 12+ messages in thread
From: Shakeel Butt @ 2024-04-23 17:44 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, Roman Gushchin, Muchun Song,
	linux-mm, linux-kernel

On Tue, Apr 23, 2024 at 09:58:44AM -0400, Johannes Weiner wrote:
> On Mon, Apr 22, 2024 at 10:18:23PM -0700, Shakeel Butt wrote:
> > At the moment the memcg stats are sized based on the size of enum
> > node_stat_item but not all fields in node_stat_item corresponds to memcg
> > stats. So, rearrage the contents of node_stat_item such that all the
> > memcg specific stats are at the top and then the later patches will make
> > sure that the memcg code will not waste space for non-memcg stats.
> > 
> > Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> 
> This series is a great idea and the savings speak for themselves.
> 
> But rearranging and splitting vmstats along the memcg-nomemcg line
> seems like an undue burden on the non-memcg codebase and interface.
> 
> - It messes with user-visible /proc/vmstat ordering, and sets things
>   up to do so on an ongoing basis as stats are added to memcg.
> 
> - It also separates related stats (like the workingset ones) in
>   /proc/vmstat when memcg only accounts a subset.
> 
> Would it make more sense to have a translation table inside memcg?
> Like we have with memcg1_events.

Thanks for taking a look. I will look into the translation table
approach. The reason I went with this approach was that I am in parallel
looking into rearranging fields of important MM structs and also enums
to improve cache locality. For example, the field NR_SWAPCACHE is always
accessed together with NR_FILE_PAGES, so it makes sense to have them on
same cacheline. So, is the rearrangement of vmstats a big NO or a little
bit here and there is fine unlike what I did with this patch?

thanks,
Shakeel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/4] mm: rearrange node_stat_item to put memcg stats at start
  2024-04-23 17:44     ` Shakeel Butt
@ 2024-04-23 18:30       ` Johannes Weiner
  2024-04-25 22:54         ` Chris Li
  0 siblings, 1 reply; 12+ messages in thread
From: Johannes Weiner @ 2024-04-23 18:30 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Michal Hocko, Roman Gushchin, Muchun Song,
	linux-mm, linux-kernel

On Tue, Apr 23, 2024 at 10:44:07AM -0700, Shakeel Butt wrote:
> On Tue, Apr 23, 2024 at 09:58:44AM -0400, Johannes Weiner wrote:
> > On Mon, Apr 22, 2024 at 10:18:23PM -0700, Shakeel Butt wrote:
> > > At the moment the memcg stats are sized based on the size of enum
> > > node_stat_item but not all fields in node_stat_item corresponds to memcg
> > > stats. So, rearrage the contents of node_stat_item such that all the
> > > memcg specific stats are at the top and then the later patches will make
> > > sure that the memcg code will not waste space for non-memcg stats.
> > > 
> > > Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> > 
> > This series is a great idea and the savings speak for themselves.
> > 
> > But rearranging and splitting vmstats along the memcg-nomemcg line
> > seems like an undue burden on the non-memcg codebase and interface.
> > 
> > - It messes with user-visible /proc/vmstat ordering, and sets things
> >   up to do so on an ongoing basis as stats are added to memcg.
> > 
> > - It also separates related stats (like the workingset ones) in
> >   /proc/vmstat when memcg only accounts a subset.
> > 
> > Would it make more sense to have a translation table inside memcg?
> > Like we have with memcg1_events.
> 
> Thanks for taking a look. I will look into the translation table
> approach. The reason I went with this approach was that I am in parallel
> looking into rearranging fields of important MM structs and also enums
> to improve cache locality. For example, the field NR_SWAPCACHE is always
> accessed together with NR_FILE_PAGES, so it makes sense to have them on
> same cacheline. So, is the rearrangement of vmstats a big NO or a little
> bit here and there is fine unlike what I did with this patch?

I'm curious what other folks think.

The cache optimization is a stronger argument, IMO, because it
directly benefits the users of /proc/vmstat. And it would be fairly
self contained inside the node_stat_item enum - "ordered for cache".

I was more hesitant about imposing a memcg requirement on the generic
vmstat ordering.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/4] memcg: reduce memory for the lruvec and memcg stats
  2024-04-23  5:18 ` [PATCH 2/4] memcg: reduce memory for the lruvec and memcg stats Shakeel Butt
  2024-04-23 14:40   ` kernel test robot
@ 2024-04-23 20:58   ` kernel test robot
  1 sibling, 0 replies; 12+ messages in thread
From: kernel test robot @ 2024-04-23 20:58 UTC (permalink / raw)
  To: Shakeel Butt, Andrew Morton, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Muchun Song
  Cc: oe-kbuild-all, Linux Memory Management List, linux-kernel

Hi Shakeel,

kernel test robot noticed the following build errors:

[auto build test ERROR on next-20240422]
[cannot apply to akpm-mm/mm-everything linus/master v6.9-rc5 v6.9-rc4 v6.9-rc3 v6.9-rc5]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Shakeel-Butt/mm-rearrange-node_stat_item-to-put-memcg-stats-at-start/20240423-132451
base:   next-20240422
patch link:    https://lore.kernel.org/r/20240423051826.791934-3-shakeel.butt%40linux.dev
patch subject: [PATCH 2/4] memcg: reduce memory for the lruvec and memcg stats
config: sh-allmodconfig (https://download.01.org/0day-ci/archive/20240424/202404240415.fucxk6Ix-lkp@intel.com/config)
compiler: sh4-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240424/202404240415.fucxk6Ix-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202404240415.fucxk6Ix-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from <command-line>:
   In function 'memcg_stat_format',
       inlined from 'memory_stat_format.constprop' at mm/memcontrol.c:1707:3:
>> include/linux/compiler_types.h:460:45: error: call to '__compiletime_assert_633' declared with attribute error: BUILD_BUG_ON failed: ARRAY_SIZE(memory_stats) != MEMCG_NR_STAT - 1
     460 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |                                             ^
   include/linux/compiler_types.h:441:25: note: in definition of macro '__compiletime_assert'
     441 |                         prefix ## suffix();                             \
         |                         ^~~~~~
   include/linux/compiler_types.h:460:9: note: in expansion of macro '_compiletime_assert'
     460 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |         ^~~~~~~~~~~~~~~~~~~
   include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
      39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
         |                                     ^~~~~~~~~~~~~~~~~~
   include/linux/build_bug.h:50:9: note: in expansion of macro 'BUILD_BUG_ON_MSG'
      50 |         BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition)
         |         ^~~~~~~~~~~~~~~~
   mm/memcontrol.c:1651:9: note: in expansion of macro 'BUILD_BUG_ON'
    1651 |         BUILD_BUG_ON(ARRAY_SIZE(memory_stats) != MEMCG_NR_STAT - 1);
         |         ^~~~~~~~~~~~
   In function 'memcg_stat_format',
       inlined from 'memory_stat_format' at mm/memcontrol.c:1707:3,
       inlined from 'memory_stat_show' at mm/memcontrol.c:6946:2:
>> include/linux/compiler_types.h:460:45: error: call to '__compiletime_assert_633' declared with attribute error: BUILD_BUG_ON failed: ARRAY_SIZE(memory_stats) != MEMCG_NR_STAT - 1
     460 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |                                             ^
   include/linux/compiler_types.h:441:25: note: in definition of macro '__compiletime_assert'
     441 |                         prefix ## suffix();                             \
         |                         ^~~~~~
   include/linux/compiler_types.h:460:9: note: in expansion of macro '_compiletime_assert'
     460 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |         ^~~~~~~~~~~~~~~~~~~
   include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
      39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
         |                                     ^~~~~~~~~~~~~~~~~~
   include/linux/build_bug.h:50:9: note: in expansion of macro 'BUILD_BUG_ON_MSG'
      50 |         BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition)
         |         ^~~~~~~~~~~~~~~~
   mm/memcontrol.c:1651:9: note: in expansion of macro 'BUILD_BUG_ON'
    1651 |         BUILD_BUG_ON(ARRAY_SIZE(memory_stats) != MEMCG_NR_STAT - 1);
         |         ^~~~~~~~~~~~


vim +/__compiletime_assert_633 +460 include/linux/compiler_types.h

eb5c2d4b45e3d2 Will Deacon 2020-07-21  446  
eb5c2d4b45e3d2 Will Deacon 2020-07-21  447  #define _compiletime_assert(condition, msg, prefix, suffix) \
eb5c2d4b45e3d2 Will Deacon 2020-07-21  448  	__compiletime_assert(condition, msg, prefix, suffix)
eb5c2d4b45e3d2 Will Deacon 2020-07-21  449  
eb5c2d4b45e3d2 Will Deacon 2020-07-21  450  /**
eb5c2d4b45e3d2 Will Deacon 2020-07-21  451   * compiletime_assert - break build and emit msg if condition is false
eb5c2d4b45e3d2 Will Deacon 2020-07-21  452   * @condition: a compile-time constant condition to check
eb5c2d4b45e3d2 Will Deacon 2020-07-21  453   * @msg:       a message to emit if condition is false
eb5c2d4b45e3d2 Will Deacon 2020-07-21  454   *
eb5c2d4b45e3d2 Will Deacon 2020-07-21  455   * In tradition of POSIX assert, this macro will break the build if the
eb5c2d4b45e3d2 Will Deacon 2020-07-21  456   * supplied condition is *false*, emitting the supplied error message if the
eb5c2d4b45e3d2 Will Deacon 2020-07-21  457   * compiler has support to do so.
eb5c2d4b45e3d2 Will Deacon 2020-07-21  458   */
eb5c2d4b45e3d2 Will Deacon 2020-07-21  459  #define compiletime_assert(condition, msg) \
eb5c2d4b45e3d2 Will Deacon 2020-07-21 @460  	_compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
eb5c2d4b45e3d2 Will Deacon 2020-07-21  461  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/4] mm: rearrange node_stat_item to put memcg stats at start
  2024-04-23  5:18 ` [PATCH 1/4] mm: rearrange node_stat_item to put memcg stats at start Shakeel Butt
  2024-04-23 13:58   ` Johannes Weiner
@ 2024-04-25 18:01   ` Chris Li
  1 sibling, 0 replies; 12+ messages in thread
From: Chris Li @ 2024-04-25 18:01 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Muchun Song, linux-mm, linux-kernel

On Mon, Apr 22, 2024 at 10:19 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> At the moment the memcg stats are sized based on the size of enum
> node_stat_item but not all fields in node_stat_item corresponds to memcg
> stats. So, rearrage the contents of node_stat_item such that all the
> memcg specific stats are at the top and then the later patches will make
> sure that the memcg code will not waste space for non-memcg stats.

Is this patch meant to have no functional change other than the output order?
It would be better to clarify it in the commit message if that is the case.

Chris

>
> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> ---
>  include/linux/mmzone.h | 25 +++++++++++++------------
>  mm/vmstat.c            | 24 ++++++++++++------------
>  2 files changed, 25 insertions(+), 24 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 8f9c9590a42c..989ca97402c6 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -166,9 +166,6 @@ enum node_stat_item {
>         NR_UNEVICTABLE,         /*  "     "     "   "       "         */
>         NR_SLAB_RECLAIMABLE_B,
>         NR_SLAB_UNRECLAIMABLE_B,
> -       NR_ISOLATED_ANON,       /* Temporary isolated pages from anon lru */
> -       NR_ISOLATED_FILE,       /* Temporary isolated pages from file lru */
> -       WORKINGSET_NODES,
>         WORKINGSET_REFAULT_BASE,
>         WORKINGSET_REFAULT_ANON = WORKINGSET_REFAULT_BASE,
>         WORKINGSET_REFAULT_FILE,
> @@ -179,39 +176,43 @@ enum node_stat_item {
>         WORKINGSET_RESTORE_ANON = WORKINGSET_RESTORE_BASE,
>         WORKINGSET_RESTORE_FILE,
>         WORKINGSET_NODERECLAIM,
> +       NR_PAGETABLE,           /* used for pagetables */
> +       NR_SECONDARY_PAGETABLE, /* secondary pagetables, KVM & IOMMU */
> +       NR_KERNEL_STACK_KB,     /* measured in KiB */
>         NR_ANON_MAPPED, /* Mapped anonymous pages */
>         NR_FILE_MAPPED, /* pagecache pages mapped into pagetables.
>                            only modified from process context */
>         NR_FILE_PAGES,
> +#ifdef CONFIG_SWAP
> +       NR_SWAPCACHE,
> +#endif
>         NR_FILE_DIRTY,
>         NR_WRITEBACK,
> -       NR_WRITEBACK_TEMP,      /* Writeback using temporary buffers */
>         NR_SHMEM,               /* shmem pages (included tmpfs/GEM pages) */
>         NR_SHMEM_THPS,
> -       NR_SHMEM_PMDMAPPED,
>         NR_FILE_THPS,
> -       NR_FILE_PMDMAPPED,
>         NR_ANON_THPS,
> +       /* No memcg stats for the following fields. */
> +       NR_SHMEM_PMDMAPPED,
> +       NR_FILE_PMDMAPPED,
> +       NR_WRITEBACK_TEMP,      /* Writeback using temporary buffers */
>         NR_VMSCAN_WRITE,
>         NR_VMSCAN_IMMEDIATE,    /* Prioritise for reclaim when writeback ends */
> +       NR_ISOLATED_ANON,       /* Temporary isolated pages from anon lru */
> +       NR_ISOLATED_FILE,       /* Temporary isolated pages from file lru */
> +       WORKINGSET_NODES,
>         NR_DIRTIED,             /* page dirtyings since bootup */
>         NR_WRITTEN,             /* page writings since bootup */
>         NR_THROTTLED_WRITTEN,   /* NR_WRITTEN while reclaim throttled */
>         NR_KERNEL_MISC_RECLAIMABLE,     /* reclaimable non-slab kernel pages */
>         NR_FOLL_PIN_ACQUIRED,   /* via: pin_user_page(), gup flag: FOLL_PIN */
>         NR_FOLL_PIN_RELEASED,   /* pages returned via unpin_user_page() */
> -       NR_KERNEL_STACK_KB,     /* measured in KiB */
>  #if IS_ENABLED(CONFIG_SHADOW_CALL_STACK)
>         NR_KERNEL_SCS_KB,       /* measured in KiB */
>  #endif
> -       NR_PAGETABLE,           /* used for pagetables */
> -       NR_SECONDARY_PAGETABLE, /* secondary pagetables, KVM & IOMMU */
>  #ifdef CONFIG_IOMMU_SUPPORT
>         NR_IOMMU_PAGES,         /* # of pages allocated by IOMMU */
>  #endif
> -#ifdef CONFIG_SWAP
> -       NR_SWAPCACHE,
> -#endif
>  #ifdef CONFIG_NUMA_BALANCING
>         PGPROMOTE_SUCCESS,      /* promote successfully */
>         PGPROMOTE_CANDIDATE,    /* candidate pages to promote */
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 8507c497218b..4eac2f6322a3 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1206,9 +1206,6 @@ const char * const vmstat_text[] = {
>         "nr_unevictable",
>         "nr_slab_reclaimable",
>         "nr_slab_unreclaimable",
> -       "nr_isolated_anon",
> -       "nr_isolated_file",
> -       "workingset_nodes",
>         "workingset_refault_anon",
>         "workingset_refault_file",
>         "workingset_activate_anon",
> @@ -1216,38 +1213,41 @@ const char * const vmstat_text[] = {
>         "workingset_restore_anon",
>         "workingset_restore_file",
>         "workingset_nodereclaim",
> +       "nr_page_table_pages",
> +       "nr_sec_page_table_pages",
> +       "nr_kernel_stack",
>         "nr_anon_pages",
>         "nr_mapped",
>         "nr_file_pages",
> +#ifdef CONFIG_SWAP
> +       "nr_swapcached",
> +#endif
>         "nr_dirty",
>         "nr_writeback",
> -       "nr_writeback_temp",
>         "nr_shmem",
>         "nr_shmem_hugepages",
> -       "nr_shmem_pmdmapped",
>         "nr_file_hugepages",
> -       "nr_file_pmdmapped",
>         "nr_anon_transparent_hugepages",
> +       "nr_shmem_pmdmapped",
> +       "nr_file_pmdmapped",
> +       "nr_writeback_temp",
>         "nr_vmscan_write",
>         "nr_vmscan_immediate_reclaim",
> +       "nr_isolated_anon",
> +       "nr_isolated_file",
> +       "workingset_nodes",
>         "nr_dirtied",
>         "nr_written",
>         "nr_throttled_written",
>         "nr_kernel_misc_reclaimable",
>         "nr_foll_pin_acquired",
>         "nr_foll_pin_released",
> -       "nr_kernel_stack",
>  #if IS_ENABLED(CONFIG_SHADOW_CALL_STACK)
>         "nr_shadow_call_stack",
>  #endif
> -       "nr_page_table_pages",
> -       "nr_sec_page_table_pages",
>  #ifdef CONFIG_IOMMU_SUPPORT
>         "nr_iommu_pages",
>  #endif
> -#ifdef CONFIG_SWAP
> -       "nr_swapcached",
> -#endif
>  #ifdef CONFIG_NUMA_BALANCING
>         "pgpromote_success",
>         "pgpromote_candidate",
> --
> 2.43.0
>
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/4] mm: rearrange node_stat_item to put memcg stats at start
  2024-04-23 18:30       ` Johannes Weiner
@ 2024-04-25 22:54         ` Chris Li
  0 siblings, 0 replies; 12+ messages in thread
From: Chris Li @ 2024-04-25 22:54 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Shakeel Butt, Andrew Morton, Michal Hocko, Roman Gushchin,
	Muchun Song, linux-mm, linux-kernel

On Tue, Apr 23, 2024 at 11:30 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Tue, Apr 23, 2024 at 10:44:07AM -0700, Shakeel Butt wrote:
> > On Tue, Apr 23, 2024 at 09:58:44AM -0400, Johannes Weiner wrote:
> > > On Mon, Apr 22, 2024 at 10:18:23PM -0700, Shakeel Butt wrote:
> > > > At the moment the memcg stats are sized based on the size of enum
> > > > node_stat_item but not all fields in node_stat_item corresponds to memcg
> > > > stats. So, rearrage the contents of node_stat_item such that all the
> > > > memcg specific stats are at the top and then the later patches will make
> > > > sure that the memcg code will not waste space for non-memcg stats.
> > > >
> > > > Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> > >
> > > This series is a great idea and the savings speak for themselves.
> > >
> > > But rearranging and splitting vmstats along the memcg-nomemcg line
> > > seems like an undue burden on the non-memcg codebase and interface.
> > >
> > > - It messes with user-visible /proc/vmstat ordering, and sets things
> > >   up to do so on an ongoing basis as stats are added to memcg.
> > >
> > > - It also separates related stats (like the workingset ones) in
> > >   /proc/vmstat when memcg only accounts a subset.
> > >
> > > Would it make more sense to have a translation table inside memcg?
> > > Like we have with memcg1_events.
> >
> > Thanks for taking a look. I will look into the translation table
> > approach. The reason I went with this approach was that I am in parallel
> > looking into rearranging fields of important MM structs and also enums
> > to improve cache locality. For example, the field NR_SWAPCACHE is always
> > accessed together with NR_FILE_PAGES, so it makes sense to have them on
> > same cacheline. So, is the rearrangement of vmstats a big NO or a little
> > bit here and there is fine unlike what I did with this patch?
>
> I'm curious what other folks think.
>
I slightly prefer not to change user visible ordering for no good reason.
It is not said the order is carved to stone. It depends on the ROI.

> The cache optimization is a stronger argument, IMO, because it
> directly benefits the users of /proc/vmstat. And it would be fairly
> self contained inside the node_stat_item enum - "ordered for cache".

Not sure how much of the cache optimization is measurable here.
I suspect it is going to be hard to measure a meaningful difference
just from the cache line order alone.

> I was more hesitant about imposing a memcg requirement on the generic
> vmstat ordering.

That is a valid reason.

Chris

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2024-04-25 22:54 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-23  5:18 [PATCH 0/4] memcg: reduce memory consumption by memcg stats Shakeel Butt
2024-04-23  5:18 ` [PATCH 1/4] mm: rearrange node_stat_item to put memcg stats at start Shakeel Butt
2024-04-23 13:58   ` Johannes Weiner
2024-04-23 17:44     ` Shakeel Butt
2024-04-23 18:30       ` Johannes Weiner
2024-04-25 22:54         ` Chris Li
2024-04-25 18:01   ` Chris Li
2024-04-23  5:18 ` [PATCH 2/4] memcg: reduce memory for the lruvec and memcg stats Shakeel Butt
2024-04-23 14:40   ` kernel test robot
2024-04-23 20:58   ` kernel test robot
2024-04-23  5:22 ` [PATCH 3/4] memcg: use proper type for mod_memcg_state Shakeel Butt
2024-04-23  5:23 ` [PATCH 4/4] memcg: restrict __mod_memcg_lruvec_state to memcg stats Shakeel Butt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).