All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 0/7] Convert all THP vmstat counters to pages
@ 2020-12-17  3:43 Muchun Song
  2020-12-17  3:43 ` [PATCH v5 1/7] mm: memcontrol: fix NR_ANON_THPS accounting in charge moving Muchun Song
                   ` (6 more replies)
  0 siblings, 7 replies; 23+ messages in thread
From: Muchun Song @ 2020-12-17  3:43 UTC (permalink / raw)
  To: gregkh, rafael, adobriyan, akpm, hannes, mhocko, vdavydov.dev,
	hughd, shakeelb, guro, samitolvanen, feng.tang, neilb,
	iamjoonsoo.kim, rdunlap
  Cc: linux-kernel, linux-fsdevel, linux-mm, cgroups, Muchun Song

This patch series is aimed to convert all THP vmstat counters to pages.

The unit of some vmstat counters are pages, some are bytes, some are
HPAGE_PMD_NR, and some are KiB. When we want to expose these vmstat
counters to the userspace, we have to know the unit of the vmstat counters
is which one. When the unit is bytes or kB, both clearly distinguishable
by the B/KB suffix. But for the THP vmstat counters, we may make mistakes.

For example, the below is some bug fix for the THP vmstat counters:

  - 7de2e9f195b9 ("mm: memcontrol: correct the NR_ANON_THPS counter of hierarchical memcg")
  - The first commit in this series ("fix NR_ANON_THPS accounting in charge moving")

This patch series can make the code clear. And make all the unit of the THP
vmstat counters in pages. Finally, the unit of the vmstat counters are
pages, kB and bytes. The B/KB suffix can tell us that the unit is bytes
or kB. The rest which is without suffix are pages.

In this series, I changed the following vmstat counters unit from HPAGE_PMD_NR
to pages. However, there is no change to the print format of output to user
space.

  - NR_ANON_THPS
  - NR_FILE_THPS
  - NR_SHMEM_THPS
  - NR_SHMEM_PMDMAPPED
  - NR_FILE_PMDMAPPED

Doing this also can make the statistics more accuracy for the THP vmstat
counters. This series is consistent with 8f182270dfec ("mm/swap.c: flush lru
pvecs on compound page arrival").

Because we use struct per_cpu_nodestat to cache the vmstat counters, which
leads to inaccurate statistics expecially THP vmstat counters. In the systems
with hundreads of processors it can be GBs of memory. For example, for a 96
CPUs system, the threshold is the maximum number of 125. And the per cpu
counters can cache 23.4375 GB in total.

The THP page is already a form of batched addition (it will add 512 worth of
memory in one go) so skipping the batching seems like sensible. Although every
THP stats update overflows the per-cpu counter, resorting to atomic global
updates. But it can make the statistics more accuracy for the THP vmstat
counters. From this point of view, I think that do this converting is
reasonable.

Thanks Hugh for mentioning this. This was inspired by Johannes and Roman.
Thanks to them.

Changes in v4 -> v5:
  - Add motivation to each patch. Thanks to Michal.
  - Replace some HPAGE_PMD_NR to thp_nr_pages(). Thanks to Matthew.

Changes in v3 -> v4:
  - Rename the first commit subject to "mm: memcontrol: fix NR_ANON_THPS
    accounting in charge moving".
  - Fix /proc/vmstat printing. Thanks to Johannes points out that.

Changes in v2 -> v3:
  - Change the series subject from "Convert all vmstat counters to pages or bytes"
    to "Convert all THP vmstat counters to pages".
  - Remove convert of KB to B.

Changes in v1 -> v2:
  - Change the series subject from "Convert all THP vmstat counters to pages"
    to "Convert all vmstat counters to pages or bytes".
  - Convert NR_KERNEL_SCS_KB account to bytes.
  - Convert vmstat slab counters to bytes.
  - Remove {global_}node_page_state_pages.

Muchun Song (7):
  mm: memcontrol: fix NR_ANON_THPS accounting in charge moving
  mm: memcontrol: convert NR_ANON_THPS account to pages
  mm: memcontrol: convert NR_FILE_THPS account to pages
  mm: memcontrol: convert NR_SHMEM_THPS account to pages
  mm: memcontrol: convert NR_SHMEM_PMDMAPPED account to pages
  mm: memcontrol: convert NR_FILE_PMDMAPPED account to pages
  mm: memcontrol: make the slab calculation consistent

 drivers/base/node.c    |  27 +++++-----
 fs/proc/meminfo.c      |  10 ++--
 include/linux/mmzone.h |  14 ++++++
 mm/filemap.c           |   4 +-
 mm/huge_memory.c       |  11 +++--
 mm/khugepaged.c        |   6 ++-
 mm/memcontrol.c        | 132 +++++++++++++++++++++++--------------------------
 mm/page_alloc.c        |   7 ++-
 mm/rmap.c              |  26 ++++++----
 mm/shmem.c             |   2 +-
 mm/vmstat.c            |  11 ++++-
 11 files changed, 139 insertions(+), 111 deletions(-)

-- 
2.11.0


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v5 1/7] mm: memcontrol: fix NR_ANON_THPS accounting in charge moving
  2020-12-17  3:43 [PATCH v5 0/7] Convert all THP vmstat counters to pages Muchun Song
@ 2020-12-17  3:43 ` Muchun Song
  2020-12-23 21:01     ` Shakeel Butt
  2020-12-17  3:43 ` [PATCH v5 2/7] mm: memcontrol: convert NR_ANON_THPS account to pages Muchun Song
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 23+ messages in thread
From: Muchun Song @ 2020-12-17  3:43 UTC (permalink / raw)
  To: gregkh, rafael, adobriyan, akpm, hannes, mhocko, vdavydov.dev,
	hughd, shakeelb, guro, samitolvanen, feng.tang, neilb,
	iamjoonsoo.kim, rdunlap
  Cc: linux-kernel, linux-fsdevel, linux-mm, cgroups, Muchun Song,
	Michal Hocko, Pankaj Gupta

The unit of NR_ANON_THPS is HPAGE_PMD_NR already. So it should inc/dec
by one rather than nr_pages.

Fixes: 468c398233da ("mm: memcontrol: switch to native NR_ANON_THPS counter")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Pankaj Gupta <pankaj.gupta@cloud.ionos.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
---
 mm/memcontrol.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b80328f52fb4..8818bf64d6fe 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5653,10 +5653,8 @@ static int mem_cgroup_move_account(struct page *page,
 			__mod_lruvec_state(from_vec, NR_ANON_MAPPED, -nr_pages);
 			__mod_lruvec_state(to_vec, NR_ANON_MAPPED, nr_pages);
 			if (PageTransHuge(page)) {
-				__mod_lruvec_state(from_vec, NR_ANON_THPS,
-						   -nr_pages);
-				__mod_lruvec_state(to_vec, NR_ANON_THPS,
-						   nr_pages);
+				__dec_lruvec_state(from_vec, NR_ANON_THPS);
+				__inc_lruvec_state(to_vec, NR_ANON_THPS);
 			}
 
 		}
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v5 2/7] mm: memcontrol: convert NR_ANON_THPS account to pages
  2020-12-17  3:43 [PATCH v5 0/7] Convert all THP vmstat counters to pages Muchun Song
  2020-12-17  3:43 ` [PATCH v5 1/7] mm: memcontrol: fix NR_ANON_THPS accounting in charge moving Muchun Song
@ 2020-12-17  3:43 ` Muchun Song
  2020-12-23 22:08     ` Shakeel Butt
  2020-12-17  3:43 ` [PATCH v5 3/7] mm: memcontrol: convert NR_FILE_THPS " Muchun Song
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 23+ messages in thread
From: Muchun Song @ 2020-12-17  3:43 UTC (permalink / raw)
  To: gregkh, rafael, adobriyan, akpm, hannes, mhocko, vdavydov.dev,
	hughd, shakeelb, guro, samitolvanen, feng.tang, neilb,
	iamjoonsoo.kim, rdunlap
  Cc: linux-kernel, linux-fsdevel, linux-mm, cgroups, Muchun Song

Currently we use struct per_cpu_nodestat to cache the vmstat
counters, which leads to inaccurate statistics expecially THP
vmstat counters. In the systems with hundreads of processors
it can be GBs of memory. For example, for a 96 CPUs system,
the threshold is the maximum number of 125. And the per cpu
counters can cache 23.4375 GB in total.

The THP page is already a form of batched addition (it will
add 512 worth of memory in one go) so skipping the batching
seems like sensible. Although every THP stats update overflows
the per-cpu counter, resorting to atomic global updates. But
it can make the statistics more accuracy for the THP vmstat
counters.

So we convert the NR_ANON_THPS account to pages. This patch
is consistent with 8f182270dfec ("mm/swap.c: flush lru pvecs
on compound page arrival"). Doing this also can make the unit
of vmstat counters more unified. Finally, the unit of the vmstat
counters are pages, kB and bytes. The B/KB suffix can tell us
that the unit is bytes or kB. The rest which is without suffix
are pages.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 drivers/base/node.c    | 15 +++++++++------
 fs/proc/meminfo.c      |  2 +-
 include/linux/mmzone.h | 10 ++++++++++
 mm/huge_memory.c       |  3 ++-
 mm/memcontrol.c        | 20 ++++++--------------
 mm/page_alloc.c        |  2 +-
 mm/rmap.c              |  6 +++---
 mm/vmstat.c            | 11 +++++++++--
 8 files changed, 41 insertions(+), 28 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 04f71c7bc3f8..6da0c3508bc9 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -461,8 +461,7 @@ static ssize_t node_read_meminfo(struct device *dev,
 			     nid, K(sunreclaimable)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 			     ,
-			     nid, K(node_page_state(pgdat, NR_ANON_THPS) *
-				    HPAGE_PMD_NR),
+			     nid, K(node_page_state(pgdat, NR_ANON_THPS)),
 			     nid, K(node_page_state(pgdat, NR_SHMEM_THPS) *
 				    HPAGE_PMD_NR),
 			     nid, K(node_page_state(pgdat, NR_SHMEM_PMDMAPPED) *
@@ -519,10 +518,14 @@ static ssize_t node_read_vmstat(struct device *dev,
 				     sum_zone_numa_state(nid, i));
 
 #endif
-	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
-		len += sysfs_emit_at(buf, len, "%s %lu\n",
-				     node_stat_name(i),
-				     node_page_state_pages(pgdat, i));
+	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
+		unsigned long pages = node_page_state_pages(pgdat, i);
+
+		if (vmstat_item_print_in_thp(i))
+			pages /= HPAGE_PMD_NR;
+		len += sysfs_emit_at(buf, len, "%s %lu\n", node_stat_name(i),
+				     pages);
+	}
 
 	return len;
 }
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index d6fc74619625..a635c8a84ddf 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -129,7 +129,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	show_val_kb(m, "AnonHugePages:  ",
-		    global_node_page_state(NR_ANON_THPS) * HPAGE_PMD_NR);
+		    global_node_page_state(NR_ANON_THPS));
 	show_val_kb(m, "ShmemHugePages: ",
 		    global_node_page_state(NR_SHMEM_THPS) * HPAGE_PMD_NR);
 	show_val_kb(m, "ShmemPmdMapped: ",
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b593316bff3d..71029c09782b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -210,6 +210,16 @@ enum node_stat_item {
 };
 
 /*
+ * Returns true if the item should be printed in THPs (/proc/vmstat
+ * currently prints number of anon, file and shmem THPs. But the item
+ * is charged in pages).
+ */
+static __always_inline bool vmstat_item_print_in_thp(enum node_stat_item item)
+{
+	return item == NR_ANON_THPS;
+}
+
+/*
  * Returns true if the value is measured in bytes (most vmstat values are
  * measured in pages). This defines the API part, the internal representation
  * might be different.
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 10dd3cae5f53..66ec454120de 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2178,7 +2178,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		lock_page_memcg(page);
 		if (atomic_add_negative(-1, compound_mapcount_ptr(page))) {
 			/* Last compound_mapcount is gone. */
-			__dec_lruvec_page_state(page, NR_ANON_THPS);
+			__mod_lruvec_page_state(page, NR_ANON_THPS,
+						-HPAGE_PMD_NR);
 			if (TestClearPageDoubleMap(page)) {
 				/* No need in mapcount reference anymore */
 				for (i = 0; i < HPAGE_PMD_NR; i++)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8818bf64d6fe..b18e25a5cdf3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1532,7 +1532,7 @@ static struct memory_stat memory_stats[] = {
 	 * on some architectures, the macro of HPAGE_PMD_SIZE is not
 	 * constant(e.g. powerpc).
 	 */
-	{ "anon_thp", 0, NR_ANON_THPS },
+	{ "anon_thp", PAGE_SIZE, NR_ANON_THPS },
 	{ "file_thp", 0, NR_FILE_THPS },
 	{ "shmem_thp", 0, NR_SHMEM_THPS },
 #endif
@@ -1565,8 +1565,7 @@ static int __init memory_stats_init(void)
 
 	for (i = 0; i < ARRAY_SIZE(memory_stats); i++) {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-		if (memory_stats[i].idx == NR_ANON_THPS ||
-		    memory_stats[i].idx == NR_FILE_THPS ||
+		if (memory_stats[i].idx == NR_FILE_THPS ||
 		    memory_stats[i].idx == NR_SHMEM_THPS)
 			memory_stats[i].ratio = HPAGE_PMD_SIZE;
 #endif
@@ -4088,10 +4087,6 @@ static int memcg_stat_show(struct seq_file *m, void *v)
 		if (memcg1_stats[i] == MEMCG_SWAP && !do_memsw_account())
 			continue;
 		nr = memcg_page_state_local(memcg, memcg1_stats[i]);
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-		if (memcg1_stats[i] == NR_ANON_THPS)
-			nr *= HPAGE_PMD_NR;
-#endif
 		seq_printf(m, "%s %lu\n", memcg1_stat_names[i], nr * PAGE_SIZE);
 	}
 
@@ -4122,10 +4117,6 @@ static int memcg_stat_show(struct seq_file *m, void *v)
 		if (memcg1_stats[i] == MEMCG_SWAP && !do_memsw_account())
 			continue;
 		nr = memcg_page_state(memcg, memcg1_stats[i]);
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-		if (memcg1_stats[i] == NR_ANON_THPS)
-			nr *= HPAGE_PMD_NR;
-#endif
 		seq_printf(m, "total_%s %llu\n", memcg1_stat_names[i],
 						(u64)nr * PAGE_SIZE);
 	}
@@ -5653,10 +5644,11 @@ static int mem_cgroup_move_account(struct page *page,
 			__mod_lruvec_state(from_vec, NR_ANON_MAPPED, -nr_pages);
 			__mod_lruvec_state(to_vec, NR_ANON_MAPPED, nr_pages);
 			if (PageTransHuge(page)) {
-				__dec_lruvec_state(from_vec, NR_ANON_THPS);
-				__inc_lruvec_state(to_vec, NR_ANON_THPS);
+				__mod_lruvec_state(from_vec, NR_ANON_THPS,
+						   -nr_pages);
+				__mod_lruvec_state(to_vec, NR_ANON_THPS,
+						   nr_pages);
 			}
-
 		}
 	} else {
 		__mod_lruvec_state(from_vec, NR_FILE_PAGES, -nr_pages);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 469e28f95ce7..1700f52b7869 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5580,7 +5580,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 			K(node_page_state(pgdat, NR_SHMEM_THPS) * HPAGE_PMD_NR),
 			K(node_page_state(pgdat, NR_SHMEM_PMDMAPPED)
 					* HPAGE_PMD_NR),
-			K(node_page_state(pgdat, NR_ANON_THPS) * HPAGE_PMD_NR),
+			K(node_page_state(pgdat, NR_ANON_THPS)),
 #endif
 			K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
 			node_page_state(pgdat, NR_KERNEL_STACK_KB),
diff --git a/mm/rmap.c b/mm/rmap.c
index 08c56aaf72eb..c4d5c63cfd29 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1144,7 +1144,7 @@ void do_page_add_anon_rmap(struct page *page,
 		 * disabled.
 		 */
 		if (compound)
-			__inc_lruvec_page_state(page, NR_ANON_THPS);
+			__mod_lruvec_page_state(page, NR_ANON_THPS, nr);
 		__mod_lruvec_page_state(page, NR_ANON_MAPPED, nr);
 	}
 
@@ -1186,7 +1186,7 @@ void page_add_new_anon_rmap(struct page *page,
 		if (hpage_pincount_available(page))
 			atomic_set(compound_pincount_ptr(page), 0);
 
-		__inc_lruvec_page_state(page, NR_ANON_THPS);
+		__mod_lruvec_page_state(page, NR_ANON_THPS, nr);
 	} else {
 		/* Anon THP always mapped first with PMD */
 		VM_BUG_ON_PAGE(PageTransCompound(page), page);
@@ -1292,7 +1292,7 @@ static void page_remove_anon_compound_rmap(struct page *page)
 	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
 		return;
 
-	__dec_lruvec_page_state(page, NR_ANON_THPS);
+	__mod_lruvec_page_state(page, NR_ANON_THPS, -thp_nr_pages(page));
 
 	if (TestClearPageDoubleMap(page)) {
 		/*
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 663f49ed1fd6..37c9e7b21e1e 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1624,8 +1624,12 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 	if (is_zone_first_populated(pgdat, zone)) {
 		seq_printf(m, "\n  per-node stats");
 		for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
+			unsigned long pages = node_page_state_pages(pgdat, i);
+
+			if (vmstat_item_print_in_thp(i))
+				pages /= HPAGE_PMD_NR;
 			seq_printf(m, "\n      %-12s %lu", node_stat_name(i),
-				   node_page_state_pages(pgdat, i));
+				   pages);
 		}
 	}
 	seq_printf(m,
@@ -1745,8 +1749,11 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
 	v += NR_VM_NUMA_STAT_ITEMS;
 #endif
 
-	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
+	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
 		v[i] = global_node_page_state_pages(i);
+		if (vmstat_item_print_in_thp(i))
+			v[i] /= HPAGE_PMD_NR;
+	}
 	v += NR_VM_NODE_STAT_ITEMS;
 
 	global_dirty_limits(v + NR_DIRTY_BG_THRESHOLD,
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v5 3/7] mm: memcontrol: convert NR_FILE_THPS account to pages
  2020-12-17  3:43 [PATCH v5 0/7] Convert all THP vmstat counters to pages Muchun Song
  2020-12-17  3:43 ` [PATCH v5 1/7] mm: memcontrol: fix NR_ANON_THPS accounting in charge moving Muchun Song
  2020-12-17  3:43 ` [PATCH v5 2/7] mm: memcontrol: convert NR_ANON_THPS account to pages Muchun Song
@ 2020-12-17  3:43 ` Muchun Song
  2020-12-17  3:43   ` Muchun Song
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 23+ messages in thread
From: Muchun Song @ 2020-12-17  3:43 UTC (permalink / raw)
  To: gregkh, rafael, adobriyan, akpm, hannes, mhocko, vdavydov.dev,
	hughd, shakeelb, guro, samitolvanen, feng.tang, neilb,
	iamjoonsoo.kim, rdunlap
  Cc: linux-kernel, linux-fsdevel, linux-mm, cgroups, Muchun Song

Currently we use struct per_cpu_nodestat to cache the vmstat
counters, which leads to inaccurate statistics expecially THP
vmstat counters. In the systems with hundreads of processors
it can be GBs of memory. For example, for a 96 CPUs system,
the threshold is the maximum number of 125. And the per cpu
counters can cache 23.4375 GB in total.

The THP page is already a form of batched addition (it will
add 512 worth of memory in one go) so skipping the batching
seems like sensible. Although every THP stats update overflows
the per-cpu counter, resorting to atomic global updates. But
it can make the statistics more accuracy for the THP vmstat
counters.

So we convert the NR_FILE_THPS account to pages. This patch
is consistent with 8f182270dfec ("mm/swap.c: flush lru pvecs
on compound page arrival"). Doing this also can make the unit
of vmstat counters more unified. Finally, the unit of the vmstat
counters are pages, kB and bytes. The B/KB suffix can tell us
that the unit is bytes or kB. The rest which is without suffix
are pages.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 drivers/base/node.c    | 3 +--
 fs/proc/meminfo.c      | 2 +-
 include/linux/mmzone.h | 3 ++-
 mm/filemap.c           | 2 +-
 mm/huge_memory.c       | 5 ++++-
 mm/khugepaged.c        | 4 +++-
 mm/memcontrol.c        | 5 ++---
 7 files changed, 14 insertions(+), 10 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 6da0c3508bc9..d5952f754911 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -466,8 +466,7 @@ static ssize_t node_read_meminfo(struct device *dev,
 				    HPAGE_PMD_NR),
 			     nid, K(node_page_state(pgdat, NR_SHMEM_PMDMAPPED) *
 				    HPAGE_PMD_NR),
-			     nid, K(node_page_state(pgdat, NR_FILE_THPS) *
-				    HPAGE_PMD_NR),
+			     nid, K(node_page_state(pgdat, NR_FILE_THPS)),
 			     nid, K(node_page_state(pgdat, NR_FILE_PMDMAPPED) *
 				    HPAGE_PMD_NR)
 #endif
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index a635c8a84ddf..7ea4679880c8 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -135,7 +135,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 	show_val_kb(m, "ShmemPmdMapped: ",
 		    global_node_page_state(NR_SHMEM_PMDMAPPED) * HPAGE_PMD_NR);
 	show_val_kb(m, "FileHugePages:  ",
-		    global_node_page_state(NR_FILE_THPS) * HPAGE_PMD_NR);
+		    global_node_page_state(NR_FILE_THPS));
 	show_val_kb(m, "FilePmdMapped:  ",
 		    global_node_page_state(NR_FILE_PMDMAPPED) * HPAGE_PMD_NR);
 #endif
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 71029c09782b..19e77f656410 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -216,7 +216,8 @@ enum node_stat_item {
  */
 static __always_inline bool vmstat_item_print_in_thp(enum node_stat_item item)
 {
-	return item == NR_ANON_THPS;
+	return item == NR_ANON_THPS ||
+	       item == NR_FILE_THPS;
 }
 
 /*
diff --git a/mm/filemap.c b/mm/filemap.c
index 78090ee08ac2..c5e6f5202476 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -207,7 +207,7 @@ static void unaccount_page_cache_page(struct address_space *mapping,
 		if (PageTransHuge(page))
 			__dec_lruvec_page_state(page, NR_SHMEM_THPS);
 	} else if (PageTransHuge(page)) {
-		__dec_lruvec_page_state(page, NR_FILE_THPS);
+		__mod_lruvec_page_state(page, NR_FILE_THPS, -nr);
 		filemap_nr_thps_dec(mapping);
 	}
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 66ec454120de..cdf61596ef76 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2745,10 +2745,13 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 		}
 		spin_unlock(&ds_queue->split_queue_lock);
 		if (mapping) {
+			int nr = thp_nr_pages(head);
+
 			if (PageSwapBacked(head))
 				__dec_lruvec_page_state(head, NR_SHMEM_THPS);
 			else
-				__dec_lruvec_page_state(head, NR_FILE_THPS);
+				__mod_lruvec_page_state(head, NR_FILE_THPS,
+							-nr);
 		}
 
 		__split_huge_page(page, list, end);
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 494d3cb0b58a..23f93a3e2e69 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1654,6 +1654,7 @@ static void collapse_file(struct mm_struct *mm,
 	XA_STATE_ORDER(xas, &mapping->i_pages, start, HPAGE_PMD_ORDER);
 	int nr_none = 0, result = SCAN_SUCCEED;
 	bool is_shmem = shmem_file(file);
+	int nr;
 
 	VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
 	VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
@@ -1865,11 +1866,12 @@ static void collapse_file(struct mm_struct *mm,
 		put_page(page);
 		goto xa_unlocked;
 	}
+	nr = thp_nr_pages(new_page);
 
 	if (is_shmem)
 		__inc_lruvec_page_state(new_page, NR_SHMEM_THPS);
 	else {
-		__inc_lruvec_page_state(new_page, NR_FILE_THPS);
+		__mod_lruvec_page_state(new_page, NR_FILE_THPS, nr);
 		filemap_nr_thps_inc(mapping);
 	}
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b18e25a5cdf3..04985c8c6a0a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1533,7 +1533,7 @@ static struct memory_stat memory_stats[] = {
 	 * constant(e.g. powerpc).
 	 */
 	{ "anon_thp", PAGE_SIZE, NR_ANON_THPS },
-	{ "file_thp", 0, NR_FILE_THPS },
+	{ "file_thp", PAGE_SIZE, NR_FILE_THPS },
 	{ "shmem_thp", 0, NR_SHMEM_THPS },
 #endif
 	{ "inactive_anon", PAGE_SIZE, NR_INACTIVE_ANON },
@@ -1565,8 +1565,7 @@ static int __init memory_stats_init(void)
 
 	for (i = 0; i < ARRAY_SIZE(memory_stats); i++) {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-		if (memory_stats[i].idx == NR_FILE_THPS ||
-		    memory_stats[i].idx == NR_SHMEM_THPS)
+		if (memory_stats[i].idx == NR_SHMEM_THPS)
 			memory_stats[i].ratio = HPAGE_PMD_SIZE;
 #endif
 		VM_BUG_ON(!memory_stats[i].ratio);
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v5 4/7] mm: memcontrol: convert NR_SHMEM_THPS account to pages
@ 2020-12-17  3:43   ` Muchun Song
  0 siblings, 0 replies; 23+ messages in thread
From: Muchun Song @ 2020-12-17  3:43 UTC (permalink / raw)
  To: gregkh, rafael, adobriyan, akpm, hannes, mhocko, vdavydov.dev,
	hughd, shakeelb, guro, samitolvanen, feng.tang, neilb,
	iamjoonsoo.kim, rdunlap
  Cc: linux-kernel, linux-fsdevel, linux-mm, cgroups, Muchun Song

Currently we use struct per_cpu_nodestat to cache the vmstat
counters, which leads to inaccurate statistics expecially THP
vmstat counters. In the systems with hundreads of processors
it can be GBs of memory. For example, for a 96 CPUs system,
the threshold is the maximum number of 125. And the per cpu
counters can cache 23.4375 GB in total.

The THP page is already a form of batched addition (it will
add 512 worth of memory in one go) so skipping the batching
seems like sensible. Although every THP stats update overflows
the per-cpu counter, resorting to atomic global updates. But
it can make the statistics more accuracy for the THP vmstat
counters.

So we convert the NR_SHMEM_THPS account to pages. This patch
is consistent with 8f182270dfec ("mm/swap.c: flush lru pvecs
on compound page arrival"). Doing this also can make the unit
of vmstat counters more unified. Finally, the unit of the vmstat
counters are pages, kB and bytes. The B/KB suffix can tell us
that the unit is bytes or kB. The rest which is without suffix
are pages.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 drivers/base/node.c    |  3 +--
 fs/proc/meminfo.c      |  2 +-
 include/linux/mmzone.h |  3 ++-
 mm/filemap.c           |  2 +-
 mm/huge_memory.c       |  3 ++-
 mm/khugepaged.c        |  2 +-
 mm/memcontrol.c        | 26 ++------------------------
 mm/page_alloc.c        |  2 +-
 mm/shmem.c             |  2 +-
 9 files changed, 12 insertions(+), 33 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index d5952f754911..6d5ac6ffb6e1 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -462,8 +462,7 @@ static ssize_t node_read_meminfo(struct device *dev,
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 			     ,
 			     nid, K(node_page_state(pgdat, NR_ANON_THPS)),
-			     nid, K(node_page_state(pgdat, NR_SHMEM_THPS) *
-				    HPAGE_PMD_NR),
+			     nid, K(node_page_state(pgdat, NR_SHMEM_THPS)),
 			     nid, K(node_page_state(pgdat, NR_SHMEM_PMDMAPPED) *
 				    HPAGE_PMD_NR),
 			     nid, K(node_page_state(pgdat, NR_FILE_THPS)),
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 7ea4679880c8..cfb107eaa3e6 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -131,7 +131,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 	show_val_kb(m, "AnonHugePages:  ",
 		    global_node_page_state(NR_ANON_THPS));
 	show_val_kb(m, "ShmemHugePages: ",
-		    global_node_page_state(NR_SHMEM_THPS) * HPAGE_PMD_NR);
+		    global_node_page_state(NR_SHMEM_THPS));
 	show_val_kb(m, "ShmemPmdMapped: ",
 		    global_node_page_state(NR_SHMEM_PMDMAPPED) * HPAGE_PMD_NR);
 	show_val_kb(m, "FileHugePages:  ",
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 19e77f656410..3b45f39011ab 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -217,7 +217,8 @@ enum node_stat_item {
 static __always_inline bool vmstat_item_print_in_thp(enum node_stat_item item)
 {
 	return item == NR_ANON_THPS ||
-	       item == NR_FILE_THPS;
+	       item == NR_FILE_THPS ||
+	       item == NR_SHMEM_THPS;
 }
 
 /*
diff --git a/mm/filemap.c b/mm/filemap.c
index c5e6f5202476..1952e923cc2e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -205,7 +205,7 @@ static void unaccount_page_cache_page(struct address_space *mapping,
 	if (PageSwapBacked(page)) {
 		__mod_lruvec_page_state(page, NR_SHMEM, -nr);
 		if (PageTransHuge(page))
-			__dec_lruvec_page_state(page, NR_SHMEM_THPS);
+			__mod_lruvec_page_state(page, NR_SHMEM_THPS, -nr);
 	} else if (PageTransHuge(page)) {
 		__mod_lruvec_page_state(page, NR_FILE_THPS, -nr);
 		filemap_nr_thps_dec(mapping);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index cdf61596ef76..5aa045e3b5dc 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2748,7 +2748,8 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 			int nr = thp_nr_pages(head);
 
 			if (PageSwapBacked(head))
-				__dec_lruvec_page_state(head, NR_SHMEM_THPS);
+				__mod_lruvec_page_state(head, NR_SHMEM_THPS,
+							-nr);
 			else
 				__mod_lruvec_page_state(head, NR_FILE_THPS,
 							-nr);
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 23f93a3e2e69..8369d9620f6d 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1869,7 +1869,7 @@ static void collapse_file(struct mm_struct *mm,
 	nr = thp_nr_pages(new_page);
 
 	if (is_shmem)
-		__inc_lruvec_page_state(new_page, NR_SHMEM_THPS);
+		__mod_lruvec_page_state(new_page, NR_SHMEM_THPS, nr);
 	else {
 		__mod_lruvec_page_state(new_page, NR_FILE_THPS, nr);
 		filemap_nr_thps_inc(mapping);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 04985c8c6a0a..a40797a27f87 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1515,7 +1515,7 @@ struct memory_stat {
 	unsigned int idx;
 };
 
-static struct memory_stat memory_stats[] = {
+static const struct memory_stat memory_stats[] = {
 	{ "anon", PAGE_SIZE, NR_ANON_MAPPED },
 	{ "file", PAGE_SIZE, NR_FILE_PAGES },
 	{ "kernel_stack", 1024, NR_KERNEL_STACK_KB },
@@ -1527,14 +1527,9 @@ static struct memory_stat memory_stats[] = {
 	{ "file_dirty", PAGE_SIZE, NR_FILE_DIRTY },
 	{ "file_writeback", PAGE_SIZE, NR_WRITEBACK },
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	/*
-	 * The ratio will be initialized in memory_stats_init(). Because
-	 * on some architectures, the macro of HPAGE_PMD_SIZE is not
-	 * constant(e.g. powerpc).
-	 */
 	{ "anon_thp", PAGE_SIZE, NR_ANON_THPS },
 	{ "file_thp", PAGE_SIZE, NR_FILE_THPS },
-	{ "shmem_thp", 0, NR_SHMEM_THPS },
+	{ "shmem_thp", PAGE_SIZE, NR_SHMEM_THPS },
 #endif
 	{ "inactive_anon", PAGE_SIZE, NR_INACTIVE_ANON },
 	{ "active_anon", PAGE_SIZE, NR_ACTIVE_ANON },
@@ -1559,23 +1554,6 @@ static struct memory_stat memory_stats[] = {
 	{ "workingset_nodereclaim", 1, WORKINGSET_NODERECLAIM },
 };
 
-static int __init memory_stats_init(void)
-{
-	int i;
-
-	for (i = 0; i < ARRAY_SIZE(memory_stats); i++) {
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-		if (memory_stats[i].idx == NR_SHMEM_THPS)
-			memory_stats[i].ratio = HPAGE_PMD_SIZE;
-#endif
-		VM_BUG_ON(!memory_stats[i].ratio);
-		VM_BUG_ON(memory_stats[i].idx >= MEMCG_NR_STAT);
-	}
-
-	return 0;
-}
-pure_initcall(memory_stats_init);
-
 static char *memory_stat_format(struct mem_cgroup *memcg)
 {
 	struct seq_buf s;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1700f52b7869..720fb5a220b6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5577,7 +5577,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 			K(node_page_state(pgdat, NR_WRITEBACK)),
 			K(node_page_state(pgdat, NR_SHMEM)),
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-			K(node_page_state(pgdat, NR_SHMEM_THPS) * HPAGE_PMD_NR),
+			K(node_page_state(pgdat, NR_SHMEM_THPS)),
 			K(node_page_state(pgdat, NR_SHMEM_PMDMAPPED)
 					* HPAGE_PMD_NR),
 			K(node_page_state(pgdat, NR_ANON_THPS)),
diff --git a/mm/shmem.c b/mm/shmem.c
index 53d84d2c9fe5..de261cfbf987 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -713,7 +713,7 @@ static int shmem_add_to_page_cache(struct page *page,
 		}
 		if (PageTransHuge(page)) {
 			count_vm_event(THP_FILE_ALLOC);
-			__inc_lruvec_page_state(page, NR_SHMEM_THPS);
+			__mod_lruvec_page_state(page, NR_SHMEM_THPS, nr);
 		}
 		mapping->nrpages += nr;
 		__mod_lruvec_page_state(page, NR_FILE_PAGES, nr);
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v5 4/7] mm: memcontrol: convert NR_SHMEM_THPS account to pages
@ 2020-12-17  3:43   ` Muchun Song
  0 siblings, 0 replies; 23+ messages in thread
From: Muchun Song @ 2020-12-17  3:43 UTC (permalink / raw)
  To: gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	rafael-DgEjT+Ai2ygdnm+yROfE0A, adobriyan-Re5JQEeQqe8AvxtiuMwx3w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	hannes-druUgvl0LCNAfugRpC6u6w, mhocko-DgEjT+Ai2ygdnm+yROfE0A,
	vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w,
	hughd-hpIqsD4AKlfQT0dZR+AlfA, shakeelb-hpIqsD4AKlfQT0dZR+AlfA,
	guro-b10kYP2dOMg, samitolvanen-hpIqsD4AKlfQT0dZR+AlfA,
	feng.tang-ral2JQCrhuEAvxtiuMwx3w, neilb-l3A5Bk7waGM,
	iamjoonsoo.kim-Hm3cg6mZ9cc, rdunlap-wEGCiKHe2LqWVfeAwA7xHQ
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Muchun Song

Currently we use struct per_cpu_nodestat to cache the vmstat
counters, which leads to inaccurate statistics expecially THP
vmstat counters. In the systems with hundreads of processors
it can be GBs of memory. For example, for a 96 CPUs system,
the threshold is the maximum number of 125. And the per cpu
counters can cache 23.4375 GB in total.

The THP page is already a form of batched addition (it will
add 512 worth of memory in one go) so skipping the batching
seems like sensible. Although every THP stats update overflows
the per-cpu counter, resorting to atomic global updates. But
it can make the statistics more accuracy for the THP vmstat
counters.

So we convert the NR_SHMEM_THPS account to pages. This patch
is consistent with 8f182270dfec ("mm/swap.c: flush lru pvecs
on compound page arrival"). Doing this also can make the unit
of vmstat counters more unified. Finally, the unit of the vmstat
counters are pages, kB and bytes. The B/KB suffix can tell us
that the unit is bytes or kB. The rest which is without suffix
are pages.

Signed-off-by: Muchun Song <songmuchun-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org>
---
 drivers/base/node.c    |  3 +--
 fs/proc/meminfo.c      |  2 +-
 include/linux/mmzone.h |  3 ++-
 mm/filemap.c           |  2 +-
 mm/huge_memory.c       |  3 ++-
 mm/khugepaged.c        |  2 +-
 mm/memcontrol.c        | 26 ++------------------------
 mm/page_alloc.c        |  2 +-
 mm/shmem.c             |  2 +-
 9 files changed, 12 insertions(+), 33 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index d5952f754911..6d5ac6ffb6e1 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -462,8 +462,7 @@ static ssize_t node_read_meminfo(struct device *dev,
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 			     ,
 			     nid, K(node_page_state(pgdat, NR_ANON_THPS)),
-			     nid, K(node_page_state(pgdat, NR_SHMEM_THPS) *
-				    HPAGE_PMD_NR),
+			     nid, K(node_page_state(pgdat, NR_SHMEM_THPS)),
 			     nid, K(node_page_state(pgdat, NR_SHMEM_PMDMAPPED) *
 				    HPAGE_PMD_NR),
 			     nid, K(node_page_state(pgdat, NR_FILE_THPS)),
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 7ea4679880c8..cfb107eaa3e6 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -131,7 +131,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 	show_val_kb(m, "AnonHugePages:  ",
 		    global_node_page_state(NR_ANON_THPS));
 	show_val_kb(m, "ShmemHugePages: ",
-		    global_node_page_state(NR_SHMEM_THPS) * HPAGE_PMD_NR);
+		    global_node_page_state(NR_SHMEM_THPS));
 	show_val_kb(m, "ShmemPmdMapped: ",
 		    global_node_page_state(NR_SHMEM_PMDMAPPED) * HPAGE_PMD_NR);
 	show_val_kb(m, "FileHugePages:  ",
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 19e77f656410..3b45f39011ab 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -217,7 +217,8 @@ enum node_stat_item {
 static __always_inline bool vmstat_item_print_in_thp(enum node_stat_item item)
 {
 	return item == NR_ANON_THPS ||
-	       item == NR_FILE_THPS;
+	       item == NR_FILE_THPS ||
+	       item == NR_SHMEM_THPS;
 }
 
 /*
diff --git a/mm/filemap.c b/mm/filemap.c
index c5e6f5202476..1952e923cc2e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -205,7 +205,7 @@ static void unaccount_page_cache_page(struct address_space *mapping,
 	if (PageSwapBacked(page)) {
 		__mod_lruvec_page_state(page, NR_SHMEM, -nr);
 		if (PageTransHuge(page))
-			__dec_lruvec_page_state(page, NR_SHMEM_THPS);
+			__mod_lruvec_page_state(page, NR_SHMEM_THPS, -nr);
 	} else if (PageTransHuge(page)) {
 		__mod_lruvec_page_state(page, NR_FILE_THPS, -nr);
 		filemap_nr_thps_dec(mapping);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index cdf61596ef76..5aa045e3b5dc 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2748,7 +2748,8 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 			int nr = thp_nr_pages(head);
 
 			if (PageSwapBacked(head))
-				__dec_lruvec_page_state(head, NR_SHMEM_THPS);
+				__mod_lruvec_page_state(head, NR_SHMEM_THPS,
+							-nr);
 			else
 				__mod_lruvec_page_state(head, NR_FILE_THPS,
 							-nr);
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 23f93a3e2e69..8369d9620f6d 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1869,7 +1869,7 @@ static void collapse_file(struct mm_struct *mm,
 	nr = thp_nr_pages(new_page);
 
 	if (is_shmem)
-		__inc_lruvec_page_state(new_page, NR_SHMEM_THPS);
+		__mod_lruvec_page_state(new_page, NR_SHMEM_THPS, nr);
 	else {
 		__mod_lruvec_page_state(new_page, NR_FILE_THPS, nr);
 		filemap_nr_thps_inc(mapping);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 04985c8c6a0a..a40797a27f87 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1515,7 +1515,7 @@ struct memory_stat {
 	unsigned int idx;
 };
 
-static struct memory_stat memory_stats[] = {
+static const struct memory_stat memory_stats[] = {
 	{ "anon", PAGE_SIZE, NR_ANON_MAPPED },
 	{ "file", PAGE_SIZE, NR_FILE_PAGES },
 	{ "kernel_stack", 1024, NR_KERNEL_STACK_KB },
@@ -1527,14 +1527,9 @@ static struct memory_stat memory_stats[] = {
 	{ "file_dirty", PAGE_SIZE, NR_FILE_DIRTY },
 	{ "file_writeback", PAGE_SIZE, NR_WRITEBACK },
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	/*
-	 * The ratio will be initialized in memory_stats_init(). Because
-	 * on some architectures, the macro of HPAGE_PMD_SIZE is not
-	 * constant(e.g. powerpc).
-	 */
 	{ "anon_thp", PAGE_SIZE, NR_ANON_THPS },
 	{ "file_thp", PAGE_SIZE, NR_FILE_THPS },
-	{ "shmem_thp", 0, NR_SHMEM_THPS },
+	{ "shmem_thp", PAGE_SIZE, NR_SHMEM_THPS },
 #endif
 	{ "inactive_anon", PAGE_SIZE, NR_INACTIVE_ANON },
 	{ "active_anon", PAGE_SIZE, NR_ACTIVE_ANON },
@@ -1559,23 +1554,6 @@ static struct memory_stat memory_stats[] = {
 	{ "workingset_nodereclaim", 1, WORKINGSET_NODERECLAIM },
 };
 
-static int __init memory_stats_init(void)
-{
-	int i;
-
-	for (i = 0; i < ARRAY_SIZE(memory_stats); i++) {
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-		if (memory_stats[i].idx == NR_SHMEM_THPS)
-			memory_stats[i].ratio = HPAGE_PMD_SIZE;
-#endif
-		VM_BUG_ON(!memory_stats[i].ratio);
-		VM_BUG_ON(memory_stats[i].idx >= MEMCG_NR_STAT);
-	}
-
-	return 0;
-}
-pure_initcall(memory_stats_init);
-
 static char *memory_stat_format(struct mem_cgroup *memcg)
 {
 	struct seq_buf s;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1700f52b7869..720fb5a220b6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5577,7 +5577,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 			K(node_page_state(pgdat, NR_WRITEBACK)),
 			K(node_page_state(pgdat, NR_SHMEM)),
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-			K(node_page_state(pgdat, NR_SHMEM_THPS) * HPAGE_PMD_NR),
+			K(node_page_state(pgdat, NR_SHMEM_THPS)),
 			K(node_page_state(pgdat, NR_SHMEM_PMDMAPPED)
 					* HPAGE_PMD_NR),
 			K(node_page_state(pgdat, NR_ANON_THPS)),
diff --git a/mm/shmem.c b/mm/shmem.c
index 53d84d2c9fe5..de261cfbf987 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -713,7 +713,7 @@ static int shmem_add_to_page_cache(struct page *page,
 		}
 		if (PageTransHuge(page)) {
 			count_vm_event(THP_FILE_ALLOC);
-			__inc_lruvec_page_state(page, NR_SHMEM_THPS);
+			__mod_lruvec_page_state(page, NR_SHMEM_THPS, nr);
 		}
 		mapping->nrpages += nr;
 		__mod_lruvec_page_state(page, NR_FILE_PAGES, nr);
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v5 5/7] mm: memcontrol: convert NR_SHMEM_PMDMAPPED account to pages
  2020-12-17  3:43 [PATCH v5 0/7] Convert all THP vmstat counters to pages Muchun Song
                   ` (3 preceding siblings ...)
  2020-12-17  3:43   ` Muchun Song
@ 2020-12-17  3:43 ` Muchun Song
  2020-12-17  3:43   ` Muchun Song
  2020-12-17  3:43 ` [PATCH v5 7/7] mm: memcontrol: make the slab calculation consistent Muchun Song
  6 siblings, 0 replies; 23+ messages in thread
From: Muchun Song @ 2020-12-17  3:43 UTC (permalink / raw)
  To: gregkh, rafael, adobriyan, akpm, hannes, mhocko, vdavydov.dev,
	hughd, shakeelb, guro, samitolvanen, feng.tang, neilb,
	iamjoonsoo.kim, rdunlap
  Cc: linux-kernel, linux-fsdevel, linux-mm, cgroups, Muchun Song

Currently we use struct per_cpu_nodestat to cache the vmstat
counters, which leads to inaccurate statistics expecially THP
vmstat counters. In the systems with hundreads of processors
it can be GBs of memory. For example, for a 96 CPUs system,
the threshold is the maximum number of 125. And the per cpu
counters can cache 23.4375 GB in total.

The THP page is already a form of batched addition (it will
add 512 worth of memory in one go) so skipping the batching
seems like sensible. Although every THP stats update overflows
the per-cpu counter, resorting to atomic global updates. But
it can make the statistics more accuracy for the THP vmstat
counters.

So we convert the NR_SHMEM_PMDMAPPED account to pages. This
patch is consistent with 8f182270dfec ("mm/swap.c: flush lru
pvecs on compound page arrival"). Doing this also can make the
unit of vmstat counters more unified. Finally, the unit of the
vmstat counters are pages, kB and bytes. The B/KB suffix can
tell us that the unit is bytes or kB. The rest which is without
suffix are pages.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 drivers/base/node.c    |  3 +--
 fs/proc/meminfo.c      |  2 +-
 include/linux/mmzone.h |  3 ++-
 mm/page_alloc.c        |  3 +--
 mm/rmap.c              | 14 ++++++++++----
 5 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 6d5ac6ffb6e1..7a66aefe4e46 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -463,8 +463,7 @@ static ssize_t node_read_meminfo(struct device *dev,
 			     ,
 			     nid, K(node_page_state(pgdat, NR_ANON_THPS)),
 			     nid, K(node_page_state(pgdat, NR_SHMEM_THPS)),
-			     nid, K(node_page_state(pgdat, NR_SHMEM_PMDMAPPED) *
-				    HPAGE_PMD_NR),
+			     nid, K(node_page_state(pgdat, NR_SHMEM_PMDMAPPED)),
 			     nid, K(node_page_state(pgdat, NR_FILE_THPS)),
 			     nid, K(node_page_state(pgdat, NR_FILE_PMDMAPPED) *
 				    HPAGE_PMD_NR)
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index cfb107eaa3e6..c61f440570f9 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -133,7 +133,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 	show_val_kb(m, "ShmemHugePages: ",
 		    global_node_page_state(NR_SHMEM_THPS));
 	show_val_kb(m, "ShmemPmdMapped: ",
-		    global_node_page_state(NR_SHMEM_PMDMAPPED) * HPAGE_PMD_NR);
+		    global_node_page_state(NR_SHMEM_PMDMAPPED));
 	show_val_kb(m, "FileHugePages:  ",
 		    global_node_page_state(NR_FILE_THPS));
 	show_val_kb(m, "FilePmdMapped:  ",
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 3b45f39011ab..1ca6a34f6a4a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -218,7 +218,8 @@ static __always_inline bool vmstat_item_print_in_thp(enum node_stat_item item)
 {
 	return item == NR_ANON_THPS ||
 	       item == NR_FILE_THPS ||
-	       item == NR_SHMEM_THPS;
+	       item == NR_SHMEM_THPS ||
+	       item == NR_SHMEM_PMDMAPPED;
 }
 
 /*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 720fb5a220b6..575fbfeea4b5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5578,8 +5578,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 			K(node_page_state(pgdat, NR_SHMEM)),
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 			K(node_page_state(pgdat, NR_SHMEM_THPS)),
-			K(node_page_state(pgdat, NR_SHMEM_PMDMAPPED)
-					* HPAGE_PMD_NR),
+			K(node_page_state(pgdat, NR_SHMEM_PMDMAPPED)),
 			K(node_page_state(pgdat, NR_ANON_THPS)),
 #endif
 			K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
diff --git a/mm/rmap.c b/mm/rmap.c
index c4d5c63cfd29..1c1b576c0627 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1211,14 +1211,17 @@ void page_add_file_rmap(struct page *page, bool compound)
 	VM_BUG_ON_PAGE(compound && !PageTransHuge(page), page);
 	lock_page_memcg(page);
 	if (compound && PageTransHuge(page)) {
-		for (i = 0, nr = 0; i < thp_nr_pages(page); i++) {
+		int nr_pages = thp_nr_pages(page);
+
+		for (i = 0, nr = 0; i < nr_pages; i++) {
 			if (atomic_inc_and_test(&page[i]._mapcount))
 				nr++;
 		}
 		if (!atomic_inc_and_test(compound_mapcount_ptr(page)))
 			goto out;
 		if (PageSwapBacked(page))
-			__inc_node_page_state(page, NR_SHMEM_PMDMAPPED);
+			__mod_lruvec_page_state(page, NR_SHMEM_PMDMAPPED,
+						nr_pages);
 		else
 			__inc_node_page_state(page, NR_FILE_PMDMAPPED);
 	} else {
@@ -1252,14 +1255,17 @@ static void page_remove_file_rmap(struct page *page, bool compound)
 
 	/* page still mapped by someone else? */
 	if (compound && PageTransHuge(page)) {
-		for (i = 0, nr = 0; i < thp_nr_pages(page); i++) {
+		int nr_pages = thp_nr_pages(page);
+
+		for (i = 0, nr = 0; i < nr_pages; i++) {
 			if (atomic_add_negative(-1, &page[i]._mapcount))
 				nr++;
 		}
 		if (!atomic_add_negative(-1, compound_mapcount_ptr(page)))
 			return;
 		if (PageSwapBacked(page))
-			__dec_node_page_state(page, NR_SHMEM_PMDMAPPED);
+			__mod_lruvec_page_state(page, NR_SHMEM_PMDMAPPED,
+						-nr_pages);
 		else
 			__dec_node_page_state(page, NR_FILE_PMDMAPPED);
 	} else {
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v5 6/7] mm: memcontrol: convert NR_FILE_PMDMAPPED account to pages
@ 2020-12-17  3:43   ` Muchun Song
  0 siblings, 0 replies; 23+ messages in thread
From: Muchun Song @ 2020-12-17  3:43 UTC (permalink / raw)
  To: gregkh, rafael, adobriyan, akpm, hannes, mhocko, vdavydov.dev,
	hughd, shakeelb, guro, samitolvanen, feng.tang, neilb,
	iamjoonsoo.kim, rdunlap
  Cc: linux-kernel, linux-fsdevel, linux-mm, cgroups, Muchun Song

Currently we use struct per_cpu_nodestat to cache the vmstat
counters, which leads to inaccurate statistics expecially THP
vmstat counters. In the systems with hundreads of processors
it can be GBs of memory. For example, for a 96 CPUs system,
the threshold is the maximum number of 125. And the per cpu
counters can cache 23.4375 GB in total.

The THP page is already a form of batched addition (it will
add 512 worth of memory in one go) so skipping the batching
seems like sensible. Although every THP stats update overflows
the per-cpu counter, resorting to atomic global updates. But
it can make the statistics more accuracy for the THP vmstat
counters.

So we convert the NR_FILE_PMDMAPPED account to pages. This
patch is consistent with 8f182270dfec ("mm/swap.c: flush lru
pvecs on compound page arrival"). Doing this also can make the
unit of vmstat counters more unified. Finally, the unit of the
vmstat counters are pages, kB and bytes. The B/KB suffix can
tell us that the unit is bytes or kB. The rest which is without
suffix are pages.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 drivers/base/node.c    | 3 +--
 fs/proc/meminfo.c      | 2 +-
 include/linux/mmzone.h | 3 ++-
 mm/rmap.c              | 6 ++++--
 4 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 7a66aefe4e46..d02d86aec19f 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -465,8 +465,7 @@ static ssize_t node_read_meminfo(struct device *dev,
 			     nid, K(node_page_state(pgdat, NR_SHMEM_THPS)),
 			     nid, K(node_page_state(pgdat, NR_SHMEM_PMDMAPPED)),
 			     nid, K(node_page_state(pgdat, NR_FILE_THPS)),
-			     nid, K(node_page_state(pgdat, NR_FILE_PMDMAPPED) *
-				    HPAGE_PMD_NR)
+			     nid, K(node_page_state(pgdat, NR_FILE_PMDMAPPED))
 #endif
 			    );
 	len += hugetlb_report_node_meminfo(buf, len, nid);
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index c61f440570f9..6fa761c9cc78 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -137,7 +137,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 	show_val_kb(m, "FileHugePages:  ",
 		    global_node_page_state(NR_FILE_THPS));
 	show_val_kb(m, "FilePmdMapped:  ",
-		    global_node_page_state(NR_FILE_PMDMAPPED) * HPAGE_PMD_NR);
+		    global_node_page_state(NR_FILE_PMDMAPPED));
 #endif
 
 #ifdef CONFIG_CMA
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 1ca6a34f6a4a..6139479ba417 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -219,7 +219,8 @@ static __always_inline bool vmstat_item_print_in_thp(enum node_stat_item item)
 	return item == NR_ANON_THPS ||
 	       item == NR_FILE_THPS ||
 	       item == NR_SHMEM_THPS ||
-	       item == NR_SHMEM_PMDMAPPED;
+	       item == NR_SHMEM_PMDMAPPED ||
+	       item == NR_FILE_PMDMAPPED;
 }
 
 /*
diff --git a/mm/rmap.c b/mm/rmap.c
index 1c1b576c0627..5ebf16fae4b9 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1223,7 +1223,8 @@ void page_add_file_rmap(struct page *page, bool compound)
 			__mod_lruvec_page_state(page, NR_SHMEM_PMDMAPPED,
 						nr_pages);
 		else
-			__inc_node_page_state(page, NR_FILE_PMDMAPPED);
+			__mod_lruvec_page_state(page, NR_FILE_PMDMAPPED,
+						nr_pages);
 	} else {
 		if (PageTransCompound(page) && page_mapping(page)) {
 			VM_WARN_ON_ONCE(!PageLocked(page));
@@ -1267,7 +1268,8 @@ static void page_remove_file_rmap(struct page *page, bool compound)
 			__mod_lruvec_page_state(page, NR_SHMEM_PMDMAPPED,
 						-nr_pages);
 		else
-			__dec_node_page_state(page, NR_FILE_PMDMAPPED);
+			__mod_lruvec_page_state(page, NR_FILE_PMDMAPPED,
+						-nr_pages);
 	} else {
 		if (!atomic_add_negative(-1, &page->_mapcount))
 			return;
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v5 6/7] mm: memcontrol: convert NR_FILE_PMDMAPPED account to pages
@ 2020-12-17  3:43   ` Muchun Song
  0 siblings, 0 replies; 23+ messages in thread
From: Muchun Song @ 2020-12-17  3:43 UTC (permalink / raw)
  To: gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	rafael-DgEjT+Ai2ygdnm+yROfE0A, adobriyan-Re5JQEeQqe8AvxtiuMwx3w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	hannes-druUgvl0LCNAfugRpC6u6w, mhocko-DgEjT+Ai2ygdnm+yROfE0A,
	vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w,
	hughd-hpIqsD4AKlfQT0dZR+AlfA, shakeelb-hpIqsD4AKlfQT0dZR+AlfA,
	guro-b10kYP2dOMg, samitolvanen-hpIqsD4AKlfQT0dZR+AlfA,
	feng.tang-ral2JQCrhuEAvxtiuMwx3w, neilb-l3A5Bk7waGM,
	iamjoonsoo.kim-Hm3cg6mZ9cc, rdunlap-wEGCiKHe2LqWVfeAwA7xHQ
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Muchun Song

Currently we use struct per_cpu_nodestat to cache the vmstat
counters, which leads to inaccurate statistics expecially THP
vmstat counters. In the systems with hundreads of processors
it can be GBs of memory. For example, for a 96 CPUs system,
the threshold is the maximum number of 125. And the per cpu
counters can cache 23.4375 GB in total.

The THP page is already a form of batched addition (it will
add 512 worth of memory in one go) so skipping the batching
seems like sensible. Although every THP stats update overflows
the per-cpu counter, resorting to atomic global updates. But
it can make the statistics more accuracy for the THP vmstat
counters.

So we convert the NR_FILE_PMDMAPPED account to pages. This
patch is consistent with 8f182270dfec ("mm/swap.c: flush lru
pvecs on compound page arrival"). Doing this also can make the
unit of vmstat counters more unified. Finally, the unit of the
vmstat counters are pages, kB and bytes. The B/KB suffix can
tell us that the unit is bytes or kB. The rest which is without
suffix are pages.

Signed-off-by: Muchun Song <songmuchun-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org>
---
 drivers/base/node.c    | 3 +--
 fs/proc/meminfo.c      | 2 +-
 include/linux/mmzone.h | 3 ++-
 mm/rmap.c              | 6 ++++--
 4 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 7a66aefe4e46..d02d86aec19f 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -465,8 +465,7 @@ static ssize_t node_read_meminfo(struct device *dev,
 			     nid, K(node_page_state(pgdat, NR_SHMEM_THPS)),
 			     nid, K(node_page_state(pgdat, NR_SHMEM_PMDMAPPED)),
 			     nid, K(node_page_state(pgdat, NR_FILE_THPS)),
-			     nid, K(node_page_state(pgdat, NR_FILE_PMDMAPPED) *
-				    HPAGE_PMD_NR)
+			     nid, K(node_page_state(pgdat, NR_FILE_PMDMAPPED))
 #endif
 			    );
 	len += hugetlb_report_node_meminfo(buf, len, nid);
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index c61f440570f9..6fa761c9cc78 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -137,7 +137,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 	show_val_kb(m, "FileHugePages:  ",
 		    global_node_page_state(NR_FILE_THPS));
 	show_val_kb(m, "FilePmdMapped:  ",
-		    global_node_page_state(NR_FILE_PMDMAPPED) * HPAGE_PMD_NR);
+		    global_node_page_state(NR_FILE_PMDMAPPED));
 #endif
 
 #ifdef CONFIG_CMA
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 1ca6a34f6a4a..6139479ba417 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -219,7 +219,8 @@ static __always_inline bool vmstat_item_print_in_thp(enum node_stat_item item)
 	return item == NR_ANON_THPS ||
 	       item == NR_FILE_THPS ||
 	       item == NR_SHMEM_THPS ||
-	       item == NR_SHMEM_PMDMAPPED;
+	       item == NR_SHMEM_PMDMAPPED ||
+	       item == NR_FILE_PMDMAPPED;
 }
 
 /*
diff --git a/mm/rmap.c b/mm/rmap.c
index 1c1b576c0627..5ebf16fae4b9 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1223,7 +1223,8 @@ void page_add_file_rmap(struct page *page, bool compound)
 			__mod_lruvec_page_state(page, NR_SHMEM_PMDMAPPED,
 						nr_pages);
 		else
-			__inc_node_page_state(page, NR_FILE_PMDMAPPED);
+			__mod_lruvec_page_state(page, NR_FILE_PMDMAPPED,
+						nr_pages);
 	} else {
 		if (PageTransCompound(page) && page_mapping(page)) {
 			VM_WARN_ON_ONCE(!PageLocked(page));
@@ -1267,7 +1268,8 @@ static void page_remove_file_rmap(struct page *page, bool compound)
 			__mod_lruvec_page_state(page, NR_SHMEM_PMDMAPPED,
 						-nr_pages);
 		else
-			__dec_node_page_state(page, NR_FILE_PMDMAPPED);
+			__mod_lruvec_page_state(page, NR_FILE_PMDMAPPED,
+						-nr_pages);
 	} else {
 		if (!atomic_add_negative(-1, &page->_mapcount))
 			return;
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v5 7/7] mm: memcontrol: make the slab calculation consistent
  2020-12-17  3:43 [PATCH v5 0/7] Convert all THP vmstat counters to pages Muchun Song
                   ` (5 preceding siblings ...)
  2020-12-17  3:43   ` Muchun Song
@ 2020-12-17  3:43 ` Muchun Song
  2020-12-23 21:21     ` Shakeel Butt
  6 siblings, 1 reply; 23+ messages in thread
From: Muchun Song @ 2020-12-17  3:43 UTC (permalink / raw)
  To: gregkh, rafael, adobriyan, akpm, hannes, mhocko, vdavydov.dev,
	hughd, shakeelb, guro, samitolvanen, feng.tang, neilb,
	iamjoonsoo.kim, rdunlap
  Cc: linux-kernel, linux-fsdevel, linux-mm, cgroups, Muchun Song

Although the ratio of the slab is one, we also should read the ratio
from the related memory_stats instead of hard-coding. And the local
variable of size is already the value of slab_unreclaimable. So we
do not need to read again.

To do this we need some code like below:

if (unlikely(memory_stats[i].idx == NR_SLAB_UNRECLAIMABLE_B)) {
-	size = memcg_page_state(memcg, NR_SLAB_RECLAIMABLE_B) +
-	       memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE_B);
+	size += memcg_page_state(memcg, memory_stats[i - 1].idx) *
+		memory_stats[i - 1].ratio;

It requires a series of BUG_ONs or comments to ensure these two
items are actually adjacent and in the right order. So it would
probably be easier to implement this using a wrapper that has a
big switch() for unit conversion.

This would fix the ratio inconsistency and get rid of the order
guarantee.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/memcontrol.c | 105 +++++++++++++++++++++++++++++++++++---------------------
 1 file changed, 66 insertions(+), 39 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a40797a27f87..eec44918d373 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1511,49 +1511,71 @@ static bool mem_cgroup_wait_acct_move(struct mem_cgroup *memcg)
 
 struct memory_stat {
 	const char *name;
-	unsigned int ratio;
 	unsigned int idx;
 };
 
 static const struct memory_stat memory_stats[] = {
-	{ "anon", PAGE_SIZE, NR_ANON_MAPPED },
-	{ "file", PAGE_SIZE, NR_FILE_PAGES },
-	{ "kernel_stack", 1024, NR_KERNEL_STACK_KB },
-	{ "pagetables", PAGE_SIZE, NR_PAGETABLE },
-	{ "percpu", 1, MEMCG_PERCPU_B },
-	{ "sock", PAGE_SIZE, MEMCG_SOCK },
-	{ "shmem", PAGE_SIZE, NR_SHMEM },
-	{ "file_mapped", PAGE_SIZE, NR_FILE_MAPPED },
-	{ "file_dirty", PAGE_SIZE, NR_FILE_DIRTY },
-	{ "file_writeback", PAGE_SIZE, NR_WRITEBACK },
+	{ "anon",			NR_ANON_MAPPED			},
+	{ "file",			NR_FILE_PAGES			},
+	{ "kernel_stack",		NR_KERNEL_STACK_KB		},
+	{ "pagetables",			NR_PAGETABLE			},
+	{ "percpu",			MEMCG_PERCPU_B			},
+	{ "sock",			MEMCG_SOCK			},
+	{ "shmem",			NR_SHMEM			},
+	{ "file_mapped",		NR_FILE_MAPPED			},
+	{ "file_dirty",			NR_FILE_DIRTY			},
+	{ "file_writeback",		NR_WRITEBACK			},
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	{ "anon_thp", PAGE_SIZE, NR_ANON_THPS },
-	{ "file_thp", PAGE_SIZE, NR_FILE_THPS },
-	{ "shmem_thp", PAGE_SIZE, NR_SHMEM_THPS },
+	{ "anon_thp",			NR_ANON_THPS			},
+	{ "file_thp",			NR_FILE_THPS			},
+	{ "shmem_thp",			NR_SHMEM_THPS			},
 #endif
-	{ "inactive_anon", PAGE_SIZE, NR_INACTIVE_ANON },
-	{ "active_anon", PAGE_SIZE, NR_ACTIVE_ANON },
-	{ "inactive_file", PAGE_SIZE, NR_INACTIVE_FILE },
-	{ "active_file", PAGE_SIZE, NR_ACTIVE_FILE },
-	{ "unevictable", PAGE_SIZE, NR_UNEVICTABLE },
-
-	/*
-	 * Note: The slab_reclaimable and slab_unreclaimable must be
-	 * together and slab_reclaimable must be in front.
-	 */
-	{ "slab_reclaimable", 1, NR_SLAB_RECLAIMABLE_B },
-	{ "slab_unreclaimable", 1, NR_SLAB_UNRECLAIMABLE_B },
+	{ "inactive_anon",		NR_INACTIVE_ANON		},
+	{ "active_anon",		NR_ACTIVE_ANON			},
+	{ "inactive_file",		NR_INACTIVE_FILE		},
+	{ "active_file",		NR_ACTIVE_FILE			},
+	{ "unevictable",		NR_UNEVICTABLE			},
+	{ "slab_reclaimable",		NR_SLAB_RECLAIMABLE_B		},
+	{ "slab_unreclaimable",		NR_SLAB_UNRECLAIMABLE_B		},
 
 	/* The memory events */
-	{ "workingset_refault_anon", 1, WORKINGSET_REFAULT_ANON },
-	{ "workingset_refault_file", 1, WORKINGSET_REFAULT_FILE },
-	{ "workingset_activate_anon", 1, WORKINGSET_ACTIVATE_ANON },
-	{ "workingset_activate_file", 1, WORKINGSET_ACTIVATE_FILE },
-	{ "workingset_restore_anon", 1, WORKINGSET_RESTORE_ANON },
-	{ "workingset_restore_file", 1, WORKINGSET_RESTORE_FILE },
-	{ "workingset_nodereclaim", 1, WORKINGSET_NODERECLAIM },
+	{ "workingset_refault_anon",	WORKINGSET_REFAULT_ANON		},
+	{ "workingset_refault_file",	WORKINGSET_REFAULT_FILE		},
+	{ "workingset_activate_anon",	WORKINGSET_ACTIVATE_ANON	},
+	{ "workingset_activate_file",	WORKINGSET_ACTIVATE_FILE	},
+	{ "workingset_restore_anon",	WORKINGSET_RESTORE_ANON		},
+	{ "workingset_restore_file",	WORKINGSET_RESTORE_FILE		},
+	{ "workingset_nodereclaim",	WORKINGSET_NODERECLAIM		},
 };
 
+/* Translate stat items to the correct unit for memory.stat output */
+static int memcg_page_state_unit(int item)
+{
+	switch (item) {
+	case MEMCG_PERCPU_B:
+	case NR_SLAB_RECLAIMABLE_B:
+	case NR_SLAB_UNRECLAIMABLE_B:
+	case WORKINGSET_REFAULT_ANON:
+	case WORKINGSET_REFAULT_FILE:
+	case WORKINGSET_ACTIVATE_ANON:
+	case WORKINGSET_ACTIVATE_FILE:
+	case WORKINGSET_RESTORE_ANON:
+	case WORKINGSET_RESTORE_FILE:
+	case WORKINGSET_NODERECLAIM:
+		return 1;
+	case NR_KERNEL_STACK_KB:
+		return SZ_1K;
+	default:
+		return PAGE_SIZE;
+	}
+}
+
+static inline unsigned long memcg_page_state_output(struct mem_cgroup *memcg,
+						    int item)
+{
+	return memcg_page_state(memcg, item) * memcg_page_state_unit(item);
+}
+
 static char *memory_stat_format(struct mem_cgroup *memcg)
 {
 	struct seq_buf s;
@@ -1577,13 +1599,12 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
 	for (i = 0; i < ARRAY_SIZE(memory_stats); i++) {
 		u64 size;
 
-		size = memcg_page_state(memcg, memory_stats[i].idx);
-		size *= memory_stats[i].ratio;
+		size = memcg_page_state_output(memcg, memory_stats[i].idx);
 		seq_buf_printf(&s, "%s %llu\n", memory_stats[i].name, size);
 
 		if (unlikely(memory_stats[i].idx == NR_SLAB_UNRECLAIMABLE_B)) {
-			size = memcg_page_state(memcg, NR_SLAB_RECLAIMABLE_B) +
-			       memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE_B);
+			size += memcg_page_state_output(memcg,
+							NR_SLAB_RECLAIMABLE_B);
 			seq_buf_printf(&s, "slab %llu\n", size);
 		}
 	}
@@ -6377,6 +6398,12 @@ static int memory_stat_show(struct seq_file *m, void *v)
 }
 
 #ifdef CONFIG_NUMA
+static inline unsigned long lruvec_page_state_output(struct lruvec *lruvec,
+						     int item)
+{
+	return lruvec_page_state(lruvec, item) * memcg_page_state_unit(item);
+}
+
 static int memory_numa_stat_show(struct seq_file *m, void *v)
 {
 	int i;
@@ -6394,8 +6421,8 @@ static int memory_numa_stat_show(struct seq_file *m, void *v)
 			struct lruvec *lruvec;
 
 			lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid));
-			size = lruvec_page_state(lruvec, memory_stats[i].idx);
-			size *= memory_stats[i].ratio;
+			size = lruvec_page_state_output(lruvec,
+							memory_stats[i].idx);
 			seq_printf(m, " N%d=%llu", nid, size);
 		}
 		seq_putc(m, '\n');
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 1/7] mm: memcontrol: fix NR_ANON_THPS accounting in charge moving
  2020-12-17  3:43 ` [PATCH v5 1/7] mm: memcontrol: fix NR_ANON_THPS accounting in charge moving Muchun Song
@ 2020-12-23 21:01     ` Shakeel Butt
  0 siblings, 0 replies; 23+ messages in thread
From: Shakeel Butt @ 2020-12-23 21:01 UTC (permalink / raw)
  To: Muchun Song
  Cc: Greg Kroah-Hartman, rafael, Alexey Dobriyan, Andrew Morton,
	Johannes Weiner, Michal Hocko, Vladimir Davydov, Hugh Dickins,
	Roman Gushchin, Sami Tolvanen, Feng Tang, Neil Brown,
	Joonsoo Kim, Randy Dunlap, LKML, linux-fsdevel, Linux MM,
	Cgroups, Michal Hocko, Pankaj Gupta

On Wed, Dec 16, 2020 at 7:45 PM Muchun Song <songmuchun@bytedance.com> wrote:
>
> The unit of NR_ANON_THPS is HPAGE_PMD_NR already. So it should inc/dec
> by one rather than nr_pages.
>
> Fixes: 468c398233da ("mm: memcontrol: switch to native NR_ANON_THPS counter")
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Acked-by: Michal Hocko <mhocko@suse.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Acked-by: Pankaj Gupta <pankaj.gupta@cloud.ionos.com>
> Reviewed-by: Roman Gushchin <guro@fb.com>

Reviewed-by: Shakeel Butt <shakeelb@google.com>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 1/7] mm: memcontrol: fix NR_ANON_THPS accounting in charge moving
@ 2020-12-23 21:01     ` Shakeel Butt
  0 siblings, 0 replies; 23+ messages in thread
From: Shakeel Butt @ 2020-12-23 21:01 UTC (permalink / raw)
  To: Muchun Song
  Cc: Greg Kroah-Hartman, rafael, Alexey Dobriyan, Andrew Morton,
	Johannes Weiner, Michal Hocko, Vladimir Davydov, Hugh Dickins,
	Roman Gushchin, Sami Tolvanen, Feng Tang, Neil Brown,
	Joonsoo Kim, Randy Dunlap, LKML, linux-fsdevel, Linux MM,
	Cgroups, Michal Hocko, Pankaj Gupta

On Wed, Dec 16, 2020 at 7:45 PM Muchun Song <songmuchun@bytedance.com> wrote:
>
> The unit of NR_ANON_THPS is HPAGE_PMD_NR already. So it should inc/dec
> by one rather than nr_pages.
>
> Fixes: 468c398233da ("mm: memcontrol: switch to native NR_ANON_THPS counter")
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Acked-by: Michal Hocko <mhocko@suse.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Acked-by: Pankaj Gupta <pankaj.gupta@cloud.ionos.com>
> Reviewed-by: Roman Gushchin <guro@fb.com>

Reviewed-by: Shakeel Butt <shakeelb@google.com>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 7/7] mm: memcontrol: make the slab calculation consistent
  2020-12-17  3:43 ` [PATCH v5 7/7] mm: memcontrol: make the slab calculation consistent Muchun Song
  2020-12-23 21:21     ` Shakeel Butt
@ 2020-12-23 21:21     ` Shakeel Butt
  0 siblings, 0 replies; 23+ messages in thread
From: Shakeel Butt @ 2020-12-23 21:21 UTC (permalink / raw)
  To: Muchun Song
  Cc: Greg Kroah-Hartman, rafael, Alexey Dobriyan, Andrew Morton,
	Johannes Weiner, Michal Hocko, Vladimir Davydov, Hugh Dickins,
	Roman Gushchin, Sami Tolvanen, Feng Tang, Neil Brown,
	Joonsoo Kim, Randy Dunlap, LKML, linux-fsdevel, Linux MM,
	Cgroups

On Wed, Dec 16, 2020 at 7:46 PM Muchun Song <songmuchun@bytedance.com> wrote:
>
> Although the ratio of the slab is one, we also should read the ratio
> from the related memory_stats instead of hard-coding. And the local
> variable of size is already the value of slab_unreclaimable. So we
> do not need to read again.
>
> To do this we need some code like below:
>
> if (unlikely(memory_stats[i].idx == NR_SLAB_UNRECLAIMABLE_B)) {
> -       size = memcg_page_state(memcg, NR_SLAB_RECLAIMABLE_B) +
> -              memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE_B);
> +       size += memcg_page_state(memcg, memory_stats[i - 1].idx) *
> +               memory_stats[i - 1].ratio;
>
> It requires a series of BUG_ONs or comments to ensure these two
> items are actually adjacent and in the right order. So it would
> probably be easier to implement this using a wrapper that has a
> big switch() for unit conversion.
>
> This would fix the ratio inconsistency and get rid of the order
> guarantee.
>

The commit message is really confusing. It is explaining a situation
which it did not do. I don't see any benefit of mentioning BUG_ONs or
[i-1]s in the message. The patch makes sure that we use the right
ratio for slab. Can you rewrite the commit message and motivate in
just that regard?

> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> ---
>  mm/memcontrol.c | 105 +++++++++++++++++++++++++++++++++++---------------------
>  1 file changed, 66 insertions(+), 39 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index a40797a27f87..eec44918d373 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1511,49 +1511,71 @@ static bool mem_cgroup_wait_acct_move(struct mem_cgroup *memcg)
>
>  struct memory_stat {
>         const char *name;
> -       unsigned int ratio;
>         unsigned int idx;
>  };
>
>  static const struct memory_stat memory_stats[] = {
> -       { "anon", PAGE_SIZE, NR_ANON_MAPPED },
> -       { "file", PAGE_SIZE, NR_FILE_PAGES },
> -       { "kernel_stack", 1024, NR_KERNEL_STACK_KB },
> -       { "pagetables", PAGE_SIZE, NR_PAGETABLE },
> -       { "percpu", 1, MEMCG_PERCPU_B },
> -       { "sock", PAGE_SIZE, MEMCG_SOCK },
> -       { "shmem", PAGE_SIZE, NR_SHMEM },
> -       { "file_mapped", PAGE_SIZE, NR_FILE_MAPPED },
> -       { "file_dirty", PAGE_SIZE, NR_FILE_DIRTY },
> -       { "file_writeback", PAGE_SIZE, NR_WRITEBACK },
> +       { "anon",                       NR_ANON_MAPPED                  },
> +       { "file",                       NR_FILE_PAGES                   },
> +       { "kernel_stack",               NR_KERNEL_STACK_KB              },
> +       { "pagetables",                 NR_PAGETABLE                    },
> +       { "percpu",                     MEMCG_PERCPU_B                  },
> +       { "sock",                       MEMCG_SOCK                      },
> +       { "shmem",                      NR_SHMEM                        },
> +       { "file_mapped",                NR_FILE_MAPPED                  },
> +       { "file_dirty",                 NR_FILE_DIRTY                   },
> +       { "file_writeback",             NR_WRITEBACK                    },
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -       { "anon_thp", PAGE_SIZE, NR_ANON_THPS },
> -       { "file_thp", PAGE_SIZE, NR_FILE_THPS },
> -       { "shmem_thp", PAGE_SIZE, NR_SHMEM_THPS },
> +       { "anon_thp",                   NR_ANON_THPS                    },
> +       { "file_thp",                   NR_FILE_THPS                    },
> +       { "shmem_thp",                  NR_SHMEM_THPS                   },
>  #endif
> -       { "inactive_anon", PAGE_SIZE, NR_INACTIVE_ANON },
> -       { "active_anon", PAGE_SIZE, NR_ACTIVE_ANON },
> -       { "inactive_file", PAGE_SIZE, NR_INACTIVE_FILE },
> -       { "active_file", PAGE_SIZE, NR_ACTIVE_FILE },
> -       { "unevictable", PAGE_SIZE, NR_UNEVICTABLE },
> -
> -       /*
> -        * Note: The slab_reclaimable and slab_unreclaimable must be
> -        * together and slab_reclaimable must be in front.
> -        */
> -       { "slab_reclaimable", 1, NR_SLAB_RECLAIMABLE_B },
> -       { "slab_unreclaimable", 1, NR_SLAB_UNRECLAIMABLE_B },
> +       { "inactive_anon",              NR_INACTIVE_ANON                },
> +       { "active_anon",                NR_ACTIVE_ANON                  },
> +       { "inactive_file",              NR_INACTIVE_FILE                },
> +       { "active_file",                NR_ACTIVE_FILE                  },
> +       { "unevictable",                NR_UNEVICTABLE                  },
> +       { "slab_reclaimable",           NR_SLAB_RECLAIMABLE_B           },
> +       { "slab_unreclaimable",         NR_SLAB_UNRECLAIMABLE_B         },
>
>         /* The memory events */
> -       { "workingset_refault_anon", 1, WORKINGSET_REFAULT_ANON },
> -       { "workingset_refault_file", 1, WORKINGSET_REFAULT_FILE },
> -       { "workingset_activate_anon", 1, WORKINGSET_ACTIVATE_ANON },
> -       { "workingset_activate_file", 1, WORKINGSET_ACTIVATE_FILE },
> -       { "workingset_restore_anon", 1, WORKINGSET_RESTORE_ANON },
> -       { "workingset_restore_file", 1, WORKINGSET_RESTORE_FILE },
> -       { "workingset_nodereclaim", 1, WORKINGSET_NODERECLAIM },
> +       { "workingset_refault_anon",    WORKINGSET_REFAULT_ANON         },
> +       { "workingset_refault_file",    WORKINGSET_REFAULT_FILE         },
> +       { "workingset_activate_anon",   WORKINGSET_ACTIVATE_ANON        },
> +       { "workingset_activate_file",   WORKINGSET_ACTIVATE_FILE        },
> +       { "workingset_restore_anon",    WORKINGSET_RESTORE_ANON         },
> +       { "workingset_restore_file",    WORKINGSET_RESTORE_FILE         },
> +       { "workingset_nodereclaim",     WORKINGSET_NODERECLAIM          },
>  };
>
> +/* Translate stat items to the correct unit for memory.stat output */
> +static int memcg_page_state_unit(int item)
> +{
> +       switch (item) {
> +       case MEMCG_PERCPU_B:
> +       case NR_SLAB_RECLAIMABLE_B:
> +       case NR_SLAB_UNRECLAIMABLE_B:
> +       case WORKINGSET_REFAULT_ANON:
> +       case WORKINGSET_REFAULT_FILE:
> +       case WORKINGSET_ACTIVATE_ANON:
> +       case WORKINGSET_ACTIVATE_FILE:
> +       case WORKINGSET_RESTORE_ANON:
> +       case WORKINGSET_RESTORE_FILE:
> +       case WORKINGSET_NODERECLAIM:
> +               return 1;
> +       case NR_KERNEL_STACK_KB:
> +               return SZ_1K;
> +       default:
> +               return PAGE_SIZE;
> +       }
> +}
> +
> +static inline unsigned long memcg_page_state_output(struct mem_cgroup *memcg,
> +                                                   int item)
> +{
> +       return memcg_page_state(memcg, item) * memcg_page_state_unit(item);
> +}
> +
>  static char *memory_stat_format(struct mem_cgroup *memcg)
>  {
>         struct seq_buf s;
> @@ -1577,13 +1599,12 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
>         for (i = 0; i < ARRAY_SIZE(memory_stats); i++) {
>                 u64 size;
>
> -               size = memcg_page_state(memcg, memory_stats[i].idx);
> -               size *= memory_stats[i].ratio;
> +               size = memcg_page_state_output(memcg, memory_stats[i].idx);
>                 seq_buf_printf(&s, "%s %llu\n", memory_stats[i].name, size);
>
>                 if (unlikely(memory_stats[i].idx == NR_SLAB_UNRECLAIMABLE_B)) {
> -                       size = memcg_page_state(memcg, NR_SLAB_RECLAIMABLE_B) +
> -                              memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE_B);
> +                       size += memcg_page_state_output(memcg,
> +                                                       NR_SLAB_RECLAIMABLE_B);
>                         seq_buf_printf(&s, "slab %llu\n", size);
>                 }
>         }
> @@ -6377,6 +6398,12 @@ static int memory_stat_show(struct seq_file *m, void *v)
>  }
>
>  #ifdef CONFIG_NUMA
> +static inline unsigned long lruvec_page_state_output(struct lruvec *lruvec,
> +                                                    int item)
> +{
> +       return lruvec_page_state(lruvec, item) * memcg_page_state_unit(item);
> +}
> +

No need to have lruvec_page_state_output() separately as there is just
one user. Just inline it.

>  static int memory_numa_stat_show(struct seq_file *m, void *v)
>  {
>         int i;
> @@ -6394,8 +6421,8 @@ static int memory_numa_stat_show(struct seq_file *m, void *v)
>                         struct lruvec *lruvec;
>
>                         lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid));
> -                       size = lruvec_page_state(lruvec, memory_stats[i].idx);
> -                       size *= memory_stats[i].ratio;
> +                       size = lruvec_page_state_output(lruvec,
> +                                                       memory_stats[i].idx);
>                         seq_printf(m, " N%d=%llu", nid, size);
>                 }
>                 seq_putc(m, '\n');
> --
> 2.11.0
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 7/7] mm: memcontrol: make the slab calculation consistent
@ 2020-12-23 21:21     ` Shakeel Butt
  0 siblings, 0 replies; 23+ messages in thread
From: Shakeel Butt @ 2020-12-23 21:21 UTC (permalink / raw)
  To: Muchun Song
  Cc: Greg Kroah-Hartman, rafael, Alexey Dobriyan, Andrew Morton,
	Johannes Weiner, Michal Hocko, Vladimir Davydov, Hugh Dickins,
	Roman Gushchin, Sami Tolvanen, Feng Tang, Neil Brown,
	Joonsoo Kim, Randy Dunlap, LKML, linux-fsdevel, Linux MM,
	Cgroups

On Wed, Dec 16, 2020 at 7:46 PM Muchun Song <songmuchun@bytedance.com> wrote:
>
> Although the ratio of the slab is one, we also should read the ratio
> from the related memory_stats instead of hard-coding. And the local
> variable of size is already the value of slab_unreclaimable. So we
> do not need to read again.
>
> To do this we need some code like below:
>
> if (unlikely(memory_stats[i].idx == NR_SLAB_UNRECLAIMABLE_B)) {
> -       size = memcg_page_state(memcg, NR_SLAB_RECLAIMABLE_B) +
> -              memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE_B);
> +       size += memcg_page_state(memcg, memory_stats[i - 1].idx) *
> +               memory_stats[i - 1].ratio;
>
> It requires a series of BUG_ONs or comments to ensure these two
> items are actually adjacent and in the right order. So it would
> probably be easier to implement this using a wrapper that has a
> big switch() for unit conversion.
>
> This would fix the ratio inconsistency and get rid of the order
> guarantee.
>

The commit message is really confusing. It is explaining a situation
which it did not do. I don't see any benefit of mentioning BUG_ONs or
[i-1]s in the message. The patch makes sure that we use the right
ratio for slab. Can you rewrite the commit message and motivate in
just that regard?

> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> ---
>  mm/memcontrol.c | 105 +++++++++++++++++++++++++++++++++++---------------------
>  1 file changed, 66 insertions(+), 39 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index a40797a27f87..eec44918d373 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1511,49 +1511,71 @@ static bool mem_cgroup_wait_acct_move(struct mem_cgroup *memcg)
>
>  struct memory_stat {
>         const char *name;
> -       unsigned int ratio;
>         unsigned int idx;
>  };
>
>  static const struct memory_stat memory_stats[] = {
> -       { "anon", PAGE_SIZE, NR_ANON_MAPPED },
> -       { "file", PAGE_SIZE, NR_FILE_PAGES },
> -       { "kernel_stack", 1024, NR_KERNEL_STACK_KB },
> -       { "pagetables", PAGE_SIZE, NR_PAGETABLE },
> -       { "percpu", 1, MEMCG_PERCPU_B },
> -       { "sock", PAGE_SIZE, MEMCG_SOCK },
> -       { "shmem", PAGE_SIZE, NR_SHMEM },
> -       { "file_mapped", PAGE_SIZE, NR_FILE_MAPPED },
> -       { "file_dirty", PAGE_SIZE, NR_FILE_DIRTY },
> -       { "file_writeback", PAGE_SIZE, NR_WRITEBACK },
> +       { "anon",                       NR_ANON_MAPPED                  },
> +       { "file",                       NR_FILE_PAGES                   },
> +       { "kernel_stack",               NR_KERNEL_STACK_KB              },
> +       { "pagetables",                 NR_PAGETABLE                    },
> +       { "percpu",                     MEMCG_PERCPU_B                  },
> +       { "sock",                       MEMCG_SOCK                      },
> +       { "shmem",                      NR_SHMEM                        },
> +       { "file_mapped",                NR_FILE_MAPPED                  },
> +       { "file_dirty",                 NR_FILE_DIRTY                   },
> +       { "file_writeback",             NR_WRITEBACK                    },
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -       { "anon_thp", PAGE_SIZE, NR_ANON_THPS },
> -       { "file_thp", PAGE_SIZE, NR_FILE_THPS },
> -       { "shmem_thp", PAGE_SIZE, NR_SHMEM_THPS },
> +       { "anon_thp",                   NR_ANON_THPS                    },
> +       { "file_thp",                   NR_FILE_THPS                    },
> +       { "shmem_thp",                  NR_SHMEM_THPS                   },
>  #endif
> -       { "inactive_anon", PAGE_SIZE, NR_INACTIVE_ANON },
> -       { "active_anon", PAGE_SIZE, NR_ACTIVE_ANON },
> -       { "inactive_file", PAGE_SIZE, NR_INACTIVE_FILE },
> -       { "active_file", PAGE_SIZE, NR_ACTIVE_FILE },
> -       { "unevictable", PAGE_SIZE, NR_UNEVICTABLE },
> -
> -       /*
> -        * Note: The slab_reclaimable and slab_unreclaimable must be
> -        * together and slab_reclaimable must be in front.
> -        */
> -       { "slab_reclaimable", 1, NR_SLAB_RECLAIMABLE_B },
> -       { "slab_unreclaimable", 1, NR_SLAB_UNRECLAIMABLE_B },
> +       { "inactive_anon",              NR_INACTIVE_ANON                },
> +       { "active_anon",                NR_ACTIVE_ANON                  },
> +       { "inactive_file",              NR_INACTIVE_FILE                },
> +       { "active_file",                NR_ACTIVE_FILE                  },
> +       { "unevictable",                NR_UNEVICTABLE                  },
> +       { "slab_reclaimable",           NR_SLAB_RECLAIMABLE_B           },
> +       { "slab_unreclaimable",         NR_SLAB_UNRECLAIMABLE_B         },
>
>         /* The memory events */
> -       { "workingset_refault_anon", 1, WORKINGSET_REFAULT_ANON },
> -       { "workingset_refault_file", 1, WORKINGSET_REFAULT_FILE },
> -       { "workingset_activate_anon", 1, WORKINGSET_ACTIVATE_ANON },
> -       { "workingset_activate_file", 1, WORKINGSET_ACTIVATE_FILE },
> -       { "workingset_restore_anon", 1, WORKINGSET_RESTORE_ANON },
> -       { "workingset_restore_file", 1, WORKINGSET_RESTORE_FILE },
> -       { "workingset_nodereclaim", 1, WORKINGSET_NODERECLAIM },
> +       { "workingset_refault_anon",    WORKINGSET_REFAULT_ANON         },
> +       { "workingset_refault_file",    WORKINGSET_REFAULT_FILE         },
> +       { "workingset_activate_anon",   WORKINGSET_ACTIVATE_ANON        },
> +       { "workingset_activate_file",   WORKINGSET_ACTIVATE_FILE        },
> +       { "workingset_restore_anon",    WORKINGSET_RESTORE_ANON         },
> +       { "workingset_restore_file",    WORKINGSET_RESTORE_FILE         },
> +       { "workingset_nodereclaim",     WORKINGSET_NODERECLAIM          },
>  };
>
> +/* Translate stat items to the correct unit for memory.stat output */
> +static int memcg_page_state_unit(int item)
> +{
> +       switch (item) {
> +       case MEMCG_PERCPU_B:
> +       case NR_SLAB_RECLAIMABLE_B:
> +       case NR_SLAB_UNRECLAIMABLE_B:
> +       case WORKINGSET_REFAULT_ANON:
> +       case WORKINGSET_REFAULT_FILE:
> +       case WORKINGSET_ACTIVATE_ANON:
> +       case WORKINGSET_ACTIVATE_FILE:
> +       case WORKINGSET_RESTORE_ANON:
> +       case WORKINGSET_RESTORE_FILE:
> +       case WORKINGSET_NODERECLAIM:
> +               return 1;
> +       case NR_KERNEL_STACK_KB:
> +               return SZ_1K;
> +       default:
> +               return PAGE_SIZE;
> +       }
> +}
> +
> +static inline unsigned long memcg_page_state_output(struct mem_cgroup *memcg,
> +                                                   int item)
> +{
> +       return memcg_page_state(memcg, item) * memcg_page_state_unit(item);
> +}
> +
>  static char *memory_stat_format(struct mem_cgroup *memcg)
>  {
>         struct seq_buf s;
> @@ -1577,13 +1599,12 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
>         for (i = 0; i < ARRAY_SIZE(memory_stats); i++) {
>                 u64 size;
>
> -               size = memcg_page_state(memcg, memory_stats[i].idx);
> -               size *= memory_stats[i].ratio;
> +               size = memcg_page_state_output(memcg, memory_stats[i].idx);
>                 seq_buf_printf(&s, "%s %llu\n", memory_stats[i].name, size);
>
>                 if (unlikely(memory_stats[i].idx == NR_SLAB_UNRECLAIMABLE_B)) {
> -                       size = memcg_page_state(memcg, NR_SLAB_RECLAIMABLE_B) +
> -                              memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE_B);
> +                       size += memcg_page_state_output(memcg,
> +                                                       NR_SLAB_RECLAIMABLE_B);
>                         seq_buf_printf(&s, "slab %llu\n", size);
>                 }
>         }
> @@ -6377,6 +6398,12 @@ static int memory_stat_show(struct seq_file *m, void *v)
>  }
>
>  #ifdef CONFIG_NUMA
> +static inline unsigned long lruvec_page_state_output(struct lruvec *lruvec,
> +                                                    int item)
> +{
> +       return lruvec_page_state(lruvec, item) * memcg_page_state_unit(item);
> +}
> +

No need to have lruvec_page_state_output() separately as there is just
one user. Just inline it.

>  static int memory_numa_stat_show(struct seq_file *m, void *v)
>  {
>         int i;
> @@ -6394,8 +6421,8 @@ static int memory_numa_stat_show(struct seq_file *m, void *v)
>                         struct lruvec *lruvec;
>
>                         lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid));
> -                       size = lruvec_page_state(lruvec, memory_stats[i].idx);
> -                       size *= memory_stats[i].ratio;
> +                       size = lruvec_page_state_output(lruvec,
> +                                                       memory_stats[i].idx);
>                         seq_printf(m, " N%d=%llu", nid, size);
>                 }
>                 seq_putc(m, '\n');
> --
> 2.11.0
>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 7/7] mm: memcontrol: make the slab calculation consistent
@ 2020-12-23 21:21     ` Shakeel Butt
  0 siblings, 0 replies; 23+ messages in thread
From: Shakeel Butt @ 2020-12-23 21:21 UTC (permalink / raw)
  To: Muchun Song
  Cc: Greg Kroah-Hartman, rafael-DgEjT+Ai2ygdnm+yROfE0A,
	Alexey Dobriyan, Andrew Morton, Johannes Weiner, Michal Hocko,
	Vladimir Davydov, Hugh Dickins, Roman Gushchin, Sami Tolvanen,
	Feng Tang, Neil Brown, Joonsoo Kim, Randy Dunlap, LKML,
	linux-fsdevel, Linux MM, Cgroups

On Wed, Dec 16, 2020 at 7:46 PM Muchun Song <songmuchun-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org> wrote:
>
> Although the ratio of the slab is one, we also should read the ratio
> from the related memory_stats instead of hard-coding. And the local
> variable of size is already the value of slab_unreclaimable. So we
> do not need to read again.
>
> To do this we need some code like below:
>
> if (unlikely(memory_stats[i].idx == NR_SLAB_UNRECLAIMABLE_B)) {
> -       size = memcg_page_state(memcg, NR_SLAB_RECLAIMABLE_B) +
> -              memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE_B);
> +       size += memcg_page_state(memcg, memory_stats[i - 1].idx) *
> +               memory_stats[i - 1].ratio;
>
> It requires a series of BUG_ONs or comments to ensure these two
> items are actually adjacent and in the right order. So it would
> probably be easier to implement this using a wrapper that has a
> big switch() for unit conversion.
>
> This would fix the ratio inconsistency and get rid of the order
> guarantee.
>

The commit message is really confusing. It is explaining a situation
which it did not do. I don't see any benefit of mentioning BUG_ONs or
[i-1]s in the message. The patch makes sure that we use the right
ratio for slab. Can you rewrite the commit message and motivate in
just that regard?

> Signed-off-by: Muchun Song <songmuchun-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org>
> ---
>  mm/memcontrol.c | 105 +++++++++++++++++++++++++++++++++++---------------------
>  1 file changed, 66 insertions(+), 39 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index a40797a27f87..eec44918d373 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1511,49 +1511,71 @@ static bool mem_cgroup_wait_acct_move(struct mem_cgroup *memcg)
>
>  struct memory_stat {
>         const char *name;
> -       unsigned int ratio;
>         unsigned int idx;
>  };
>
>  static const struct memory_stat memory_stats[] = {
> -       { "anon", PAGE_SIZE, NR_ANON_MAPPED },
> -       { "file", PAGE_SIZE, NR_FILE_PAGES },
> -       { "kernel_stack", 1024, NR_KERNEL_STACK_KB },
> -       { "pagetables", PAGE_SIZE, NR_PAGETABLE },
> -       { "percpu", 1, MEMCG_PERCPU_B },
> -       { "sock", PAGE_SIZE, MEMCG_SOCK },
> -       { "shmem", PAGE_SIZE, NR_SHMEM },
> -       { "file_mapped", PAGE_SIZE, NR_FILE_MAPPED },
> -       { "file_dirty", PAGE_SIZE, NR_FILE_DIRTY },
> -       { "file_writeback", PAGE_SIZE, NR_WRITEBACK },
> +       { "anon",                       NR_ANON_MAPPED                  },
> +       { "file",                       NR_FILE_PAGES                   },
> +       { "kernel_stack",               NR_KERNEL_STACK_KB              },
> +       { "pagetables",                 NR_PAGETABLE                    },
> +       { "percpu",                     MEMCG_PERCPU_B                  },
> +       { "sock",                       MEMCG_SOCK                      },
> +       { "shmem",                      NR_SHMEM                        },
> +       { "file_mapped",                NR_FILE_MAPPED                  },
> +       { "file_dirty",                 NR_FILE_DIRTY                   },
> +       { "file_writeback",             NR_WRITEBACK                    },
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -       { "anon_thp", PAGE_SIZE, NR_ANON_THPS },
> -       { "file_thp", PAGE_SIZE, NR_FILE_THPS },
> -       { "shmem_thp", PAGE_SIZE, NR_SHMEM_THPS },
> +       { "anon_thp",                   NR_ANON_THPS                    },
> +       { "file_thp",                   NR_FILE_THPS                    },
> +       { "shmem_thp",                  NR_SHMEM_THPS                   },
>  #endif
> -       { "inactive_anon", PAGE_SIZE, NR_INACTIVE_ANON },
> -       { "active_anon", PAGE_SIZE, NR_ACTIVE_ANON },
> -       { "inactive_file", PAGE_SIZE, NR_INACTIVE_FILE },
> -       { "active_file", PAGE_SIZE, NR_ACTIVE_FILE },
> -       { "unevictable", PAGE_SIZE, NR_UNEVICTABLE },
> -
> -       /*
> -        * Note: The slab_reclaimable and slab_unreclaimable must be
> -        * together and slab_reclaimable must be in front.
> -        */
> -       { "slab_reclaimable", 1, NR_SLAB_RECLAIMABLE_B },
> -       { "slab_unreclaimable", 1, NR_SLAB_UNRECLAIMABLE_B },
> +       { "inactive_anon",              NR_INACTIVE_ANON                },
> +       { "active_anon",                NR_ACTIVE_ANON                  },
> +       { "inactive_file",              NR_INACTIVE_FILE                },
> +       { "active_file",                NR_ACTIVE_FILE                  },
> +       { "unevictable",                NR_UNEVICTABLE                  },
> +       { "slab_reclaimable",           NR_SLAB_RECLAIMABLE_B           },
> +       { "slab_unreclaimable",         NR_SLAB_UNRECLAIMABLE_B         },
>
>         /* The memory events */
> -       { "workingset_refault_anon", 1, WORKINGSET_REFAULT_ANON },
> -       { "workingset_refault_file", 1, WORKINGSET_REFAULT_FILE },
> -       { "workingset_activate_anon", 1, WORKINGSET_ACTIVATE_ANON },
> -       { "workingset_activate_file", 1, WORKINGSET_ACTIVATE_FILE },
> -       { "workingset_restore_anon", 1, WORKINGSET_RESTORE_ANON },
> -       { "workingset_restore_file", 1, WORKINGSET_RESTORE_FILE },
> -       { "workingset_nodereclaim", 1, WORKINGSET_NODERECLAIM },
> +       { "workingset_refault_anon",    WORKINGSET_REFAULT_ANON         },
> +       { "workingset_refault_file",    WORKINGSET_REFAULT_FILE         },
> +       { "workingset_activate_anon",   WORKINGSET_ACTIVATE_ANON        },
> +       { "workingset_activate_file",   WORKINGSET_ACTIVATE_FILE        },
> +       { "workingset_restore_anon",    WORKINGSET_RESTORE_ANON         },
> +       { "workingset_restore_file",    WORKINGSET_RESTORE_FILE         },
> +       { "workingset_nodereclaim",     WORKINGSET_NODERECLAIM          },
>  };
>
> +/* Translate stat items to the correct unit for memory.stat output */
> +static int memcg_page_state_unit(int item)
> +{
> +       switch (item) {
> +       case MEMCG_PERCPU_B:
> +       case NR_SLAB_RECLAIMABLE_B:
> +       case NR_SLAB_UNRECLAIMABLE_B:
> +       case WORKINGSET_REFAULT_ANON:
> +       case WORKINGSET_REFAULT_FILE:
> +       case WORKINGSET_ACTIVATE_ANON:
> +       case WORKINGSET_ACTIVATE_FILE:
> +       case WORKINGSET_RESTORE_ANON:
> +       case WORKINGSET_RESTORE_FILE:
> +       case WORKINGSET_NODERECLAIM:
> +               return 1;
> +       case NR_KERNEL_STACK_KB:
> +               return SZ_1K;
> +       default:
> +               return PAGE_SIZE;
> +       }
> +}
> +
> +static inline unsigned long memcg_page_state_output(struct mem_cgroup *memcg,
> +                                                   int item)
> +{
> +       return memcg_page_state(memcg, item) * memcg_page_state_unit(item);
> +}
> +
>  static char *memory_stat_format(struct mem_cgroup *memcg)
>  {
>         struct seq_buf s;
> @@ -1577,13 +1599,12 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
>         for (i = 0; i < ARRAY_SIZE(memory_stats); i++) {
>                 u64 size;
>
> -               size = memcg_page_state(memcg, memory_stats[i].idx);
> -               size *= memory_stats[i].ratio;
> +               size = memcg_page_state_output(memcg, memory_stats[i].idx);
>                 seq_buf_printf(&s, "%s %llu\n", memory_stats[i].name, size);
>
>                 if (unlikely(memory_stats[i].idx == NR_SLAB_UNRECLAIMABLE_B)) {
> -                       size = memcg_page_state(memcg, NR_SLAB_RECLAIMABLE_B) +
> -                              memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE_B);
> +                       size += memcg_page_state_output(memcg,
> +                                                       NR_SLAB_RECLAIMABLE_B);
>                         seq_buf_printf(&s, "slab %llu\n", size);
>                 }
>         }
> @@ -6377,6 +6398,12 @@ static int memory_stat_show(struct seq_file *m, void *v)
>  }
>
>  #ifdef CONFIG_NUMA
> +static inline unsigned long lruvec_page_state_output(struct lruvec *lruvec,
> +                                                    int item)
> +{
> +       return lruvec_page_state(lruvec, item) * memcg_page_state_unit(item);
> +}
> +

No need to have lruvec_page_state_output() separately as there is just
one user. Just inline it.

>  static int memory_numa_stat_show(struct seq_file *m, void *v)
>  {
>         int i;
> @@ -6394,8 +6421,8 @@ static int memory_numa_stat_show(struct seq_file *m, void *v)
>                         struct lruvec *lruvec;
>
>                         lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid));
> -                       size = lruvec_page_state(lruvec, memory_stats[i].idx);
> -                       size *= memory_stats[i].ratio;
> +                       size = lruvec_page_state_output(lruvec,
> +                                                       memory_stats[i].idx);
>                         seq_printf(m, " N%d=%llu", nid, size);
>                 }
>                 seq_putc(m, '\n');
> --
> 2.11.0
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 2/7] mm: memcontrol: convert NR_ANON_THPS account to pages
  2020-12-17  3:43 ` [PATCH v5 2/7] mm: memcontrol: convert NR_ANON_THPS account to pages Muchun Song
@ 2020-12-23 22:08     ` Shakeel Butt
  0 siblings, 0 replies; 23+ messages in thread
From: Shakeel Butt @ 2020-12-23 22:08 UTC (permalink / raw)
  To: Muchun Song
  Cc: Greg Kroah-Hartman, rafael, Alexey Dobriyan, Andrew Morton,
	Johannes Weiner, Michal Hocko, Vladimir Davydov, Hugh Dickins,
	Roman Gushchin, Sami Tolvanen, Feng Tang, Neil Brown,
	Joonsoo Kim, Randy Dunlap, LKML, linux-fsdevel, Linux MM,
	Cgroups

On Wed, Dec 16, 2020 at 7:45 PM Muchun Song <songmuchun@bytedance.com> wrote:
>
> Currently we use struct per_cpu_nodestat to cache the vmstat
> counters, which leads to inaccurate statistics expecially THP

*especially

> vmstat counters. In the systems with hundreads of processors

*hundreds

> it can be GBs of memory. For example, for a 96 CPUs system,
> the threshold is the maximum number of 125. And the per cpu
> counters can cache 23.4375 GB in total.
>
> The THP page is already a form of batched addition (it will
> add 512 worth of memory in one go) so skipping the batching
> seems like sensible. Although every THP stats update overflows
> the per-cpu counter, resorting to atomic global updates. But
> it can make the statistics more accuracy for the THP vmstat
> counters.
>
> So we convert the NR_ANON_THPS account to pages. This patch
> is consistent with 8f182270dfec ("mm/swap.c: flush lru pvecs
> on compound page arrival"). Doing this also can make the unit
> of vmstat counters more unified. Finally, the unit of the vmstat
> counters are pages, kB and bytes. The B/KB suffix can tell us
> that the unit is bytes or kB. The rest which is without suffix
> are pages.
>
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>

I agree with the motivation behind this patch but I would like to see
some performance numbers in the commit message. We might agree to pay
the price but at least we will know what exactly that cost is.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 2/7] mm: memcontrol: convert NR_ANON_THPS account to pages
@ 2020-12-23 22:08     ` Shakeel Butt
  0 siblings, 0 replies; 23+ messages in thread
From: Shakeel Butt @ 2020-12-23 22:08 UTC (permalink / raw)
  To: Muchun Song
  Cc: Greg Kroah-Hartman, rafael, Alexey Dobriyan, Andrew Morton,
	Johannes Weiner, Michal Hocko, Vladimir Davydov, Hugh Dickins,
	Roman Gushchin, Sami Tolvanen, Feng Tang, Neil Brown,
	Joonsoo Kim, Randy Dunlap, LKML, linux-fsdevel, Linux MM,
	Cgroups

On Wed, Dec 16, 2020 at 7:45 PM Muchun Song <songmuchun@bytedance.com> wrote:
>
> Currently we use struct per_cpu_nodestat to cache the vmstat
> counters, which leads to inaccurate statistics expecially THP

*especially

> vmstat counters. In the systems with hundreads of processors

*hundreds

> it can be GBs of memory. For example, for a 96 CPUs system,
> the threshold is the maximum number of 125. And the per cpu
> counters can cache 23.4375 GB in total.
>
> The THP page is already a form of batched addition (it will
> add 512 worth of memory in one go) so skipping the batching
> seems like sensible. Although every THP stats update overflows
> the per-cpu counter, resorting to atomic global updates. But
> it can make the statistics more accuracy for the THP vmstat
> counters.
>
> So we convert the NR_ANON_THPS account to pages. This patch
> is consistent with 8f182270dfec ("mm/swap.c: flush lru pvecs
> on compound page arrival"). Doing this also can make the unit
> of vmstat counters more unified. Finally, the unit of the vmstat
> counters are pages, kB and bytes. The B/KB suffix can tell us
> that the unit is bytes or kB. The rest which is without suffix
> are pages.
>
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>

I agree with the motivation behind this patch but I would like to see
some performance numbers in the commit message. We might agree to pay
the price but at least we will know what exactly that cost is.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [External] Re: [PATCH v5 7/7] mm: memcontrol: make the slab calculation consistent
  2020-12-23 21:21     ` Shakeel Butt
  (?)
@ 2020-12-24  2:43       ` Muchun Song
  -1 siblings, 0 replies; 23+ messages in thread
From: Muchun Song @ 2020-12-24  2:43 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Greg Kroah-Hartman, Rafael J. Wysocki, Alexey Dobriyan,
	Andrew Morton, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Hugh Dickins, Roman Gushchin, Sami Tolvanen, Feng Tang,
	Neil Brown, Joonsoo Kim, Randy Dunlap, LKML, linux-fsdevel,
	Linux MM, Cgroups

On Thu, Dec 24, 2020 at 5:21 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Wed, Dec 16, 2020 at 7:46 PM Muchun Song <songmuchun@bytedance.com> wrote:
> >
> > Although the ratio of the slab is one, we also should read the ratio
> > from the related memory_stats instead of hard-coding. And the local
> > variable of size is already the value of slab_unreclaimable. So we
> > do not need to read again.
> >
> > To do this we need some code like below:
> >
> > if (unlikely(memory_stats[i].idx == NR_SLAB_UNRECLAIMABLE_B)) {
> > -       size = memcg_page_state(memcg, NR_SLAB_RECLAIMABLE_B) +
> > -              memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE_B);
> > +       size += memcg_page_state(memcg, memory_stats[i - 1].idx) *
> > +               memory_stats[i - 1].ratio;

Hi Shakeel,

Here is the [i - 1].

> >
> > It requires a series of BUG_ONs or comments to ensure these two
> > items are actually adjacent and in the right order. So it would
> > probably be easier to implement this using a wrapper that has a
> > big switch() for unit conversion.
> >
> > This would fix the ratio inconsistency and get rid of the order
> > guarantee.
> >
>
> The commit message is really confusing. It is explaining a situation
> which it did not do. I don't see any benefit of mentioning BUG_ONs or
> [i-1]s in the message. The patch makes sure that we use the right
> ratio for slab. Can you rewrite the commit message and motivate in
> just that regard?

Yeah, I need rewrite the commit message to make it more clear.
However, here is a discussion about this. See

    https://lore.kernel.org/patchwork/patch/1348611/

Thanks.

>
> > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> > ---
> >  mm/memcontrol.c | 105 +++++++++++++++++++++++++++++++++++---------------------
> >  1 file changed, 66 insertions(+), 39 deletions(-)
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index a40797a27f87..eec44918d373 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1511,49 +1511,71 @@ static bool mem_cgroup_wait_acct_move(struct mem_cgroup *memcg)
> >
> >  struct memory_stat {
> >         const char *name;
> > -       unsigned int ratio;
> >         unsigned int idx;
> >  };
> >
> >  static const struct memory_stat memory_stats[] = {
> > -       { "anon", PAGE_SIZE, NR_ANON_MAPPED },
> > -       { "file", PAGE_SIZE, NR_FILE_PAGES },
> > -       { "kernel_stack", 1024, NR_KERNEL_STACK_KB },
> > -       { "pagetables", PAGE_SIZE, NR_PAGETABLE },
> > -       { "percpu", 1, MEMCG_PERCPU_B },
> > -       { "sock", PAGE_SIZE, MEMCG_SOCK },
> > -       { "shmem", PAGE_SIZE, NR_SHMEM },
> > -       { "file_mapped", PAGE_SIZE, NR_FILE_MAPPED },
> > -       { "file_dirty", PAGE_SIZE, NR_FILE_DIRTY },
> > -       { "file_writeback", PAGE_SIZE, NR_WRITEBACK },
> > +       { "anon",                       NR_ANON_MAPPED                  },
> > +       { "file",                       NR_FILE_PAGES                   },
> > +       { "kernel_stack",               NR_KERNEL_STACK_KB              },
> > +       { "pagetables",                 NR_PAGETABLE                    },
> > +       { "percpu",                     MEMCG_PERCPU_B                  },
> > +       { "sock",                       MEMCG_SOCK                      },
> > +       { "shmem",                      NR_SHMEM                        },
> > +       { "file_mapped",                NR_FILE_MAPPED                  },
> > +       { "file_dirty",                 NR_FILE_DIRTY                   },
> > +       { "file_writeback",             NR_WRITEBACK                    },
> >  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > -       { "anon_thp", PAGE_SIZE, NR_ANON_THPS },
> > -       { "file_thp", PAGE_SIZE, NR_FILE_THPS },
> > -       { "shmem_thp", PAGE_SIZE, NR_SHMEM_THPS },
> > +       { "anon_thp",                   NR_ANON_THPS                    },
> > +       { "file_thp",                   NR_FILE_THPS                    },
> > +       { "shmem_thp",                  NR_SHMEM_THPS                   },
> >  #endif
> > -       { "inactive_anon", PAGE_SIZE, NR_INACTIVE_ANON },
> > -       { "active_anon", PAGE_SIZE, NR_ACTIVE_ANON },
> > -       { "inactive_file", PAGE_SIZE, NR_INACTIVE_FILE },
> > -       { "active_file", PAGE_SIZE, NR_ACTIVE_FILE },
> > -       { "unevictable", PAGE_SIZE, NR_UNEVICTABLE },
> > -
> > -       /*
> > -        * Note: The slab_reclaimable and slab_unreclaimable must be
> > -        * together and slab_reclaimable must be in front.
> > -        */
> > -       { "slab_reclaimable", 1, NR_SLAB_RECLAIMABLE_B },
> > -       { "slab_unreclaimable", 1, NR_SLAB_UNRECLAIMABLE_B },
> > +       { "inactive_anon",              NR_INACTIVE_ANON                },
> > +       { "active_anon",                NR_ACTIVE_ANON                  },
> > +       { "inactive_file",              NR_INACTIVE_FILE                },
> > +       { "active_file",                NR_ACTIVE_FILE                  },
> > +       { "unevictable",                NR_UNEVICTABLE                  },
> > +       { "slab_reclaimable",           NR_SLAB_RECLAIMABLE_B           },
> > +       { "slab_unreclaimable",         NR_SLAB_UNRECLAIMABLE_B         },
> >
> >         /* The memory events */
> > -       { "workingset_refault_anon", 1, WORKINGSET_REFAULT_ANON },
> > -       { "workingset_refault_file", 1, WORKINGSET_REFAULT_FILE },
> > -       { "workingset_activate_anon", 1, WORKINGSET_ACTIVATE_ANON },
> > -       { "workingset_activate_file", 1, WORKINGSET_ACTIVATE_FILE },
> > -       { "workingset_restore_anon", 1, WORKINGSET_RESTORE_ANON },
> > -       { "workingset_restore_file", 1, WORKINGSET_RESTORE_FILE },
> > -       { "workingset_nodereclaim", 1, WORKINGSET_NODERECLAIM },
> > +       { "workingset_refault_anon",    WORKINGSET_REFAULT_ANON         },
> > +       { "workingset_refault_file",    WORKINGSET_REFAULT_FILE         },
> > +       { "workingset_activate_anon",   WORKINGSET_ACTIVATE_ANON        },
> > +       { "workingset_activate_file",   WORKINGSET_ACTIVATE_FILE        },
> > +       { "workingset_restore_anon",    WORKINGSET_RESTORE_ANON         },
> > +       { "workingset_restore_file",    WORKINGSET_RESTORE_FILE         },
> > +       { "workingset_nodereclaim",     WORKINGSET_NODERECLAIM          },
> >  };
> >
> > +/* Translate stat items to the correct unit for memory.stat output */
> > +static int memcg_page_state_unit(int item)
> > +{
> > +       switch (item) {
> > +       case MEMCG_PERCPU_B:
> > +       case NR_SLAB_RECLAIMABLE_B:
> > +       case NR_SLAB_UNRECLAIMABLE_B:
> > +       case WORKINGSET_REFAULT_ANON:
> > +       case WORKINGSET_REFAULT_FILE:
> > +       case WORKINGSET_ACTIVATE_ANON:
> > +       case WORKINGSET_ACTIVATE_FILE:
> > +       case WORKINGSET_RESTORE_ANON:
> > +       case WORKINGSET_RESTORE_FILE:
> > +       case WORKINGSET_NODERECLAIM:
> > +               return 1;
> > +       case NR_KERNEL_STACK_KB:
> > +               return SZ_1K;
> > +       default:
> > +               return PAGE_SIZE;
> > +       }
> > +}
> > +
> > +static inline unsigned long memcg_page_state_output(struct mem_cgroup *memcg,
> > +                                                   int item)
> > +{
> > +       return memcg_page_state(memcg, item) * memcg_page_state_unit(item);
> > +}
> > +
> >  static char *memory_stat_format(struct mem_cgroup *memcg)
> >  {
> >         struct seq_buf s;
> > @@ -1577,13 +1599,12 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
> >         for (i = 0; i < ARRAY_SIZE(memory_stats); i++) {
> >                 u64 size;
> >
> > -               size = memcg_page_state(memcg, memory_stats[i].idx);
> > -               size *= memory_stats[i].ratio;
> > +               size = memcg_page_state_output(memcg, memory_stats[i].idx);
> >                 seq_buf_printf(&s, "%s %llu\n", memory_stats[i].name, size);
> >
> >                 if (unlikely(memory_stats[i].idx == NR_SLAB_UNRECLAIMABLE_B)) {
> > -                       size = memcg_page_state(memcg, NR_SLAB_RECLAIMABLE_B) +
> > -                              memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE_B);
> > +                       size += memcg_page_state_output(memcg,
> > +                                                       NR_SLAB_RECLAIMABLE_B);
> >                         seq_buf_printf(&s, "slab %llu\n", size);
> >                 }
> >         }
> > @@ -6377,6 +6398,12 @@ static int memory_stat_show(struct seq_file *m, void *v)
> >  }
> >
> >  #ifdef CONFIG_NUMA
> > +static inline unsigned long lruvec_page_state_output(struct lruvec *lruvec,
> > +                                                    int item)
> > +{
> > +       return lruvec_page_state(lruvec, item) * memcg_page_state_unit(item);
> > +}
> > +
>
> No need to have lruvec_page_state_output() separately as there is just
> one user. Just inline it.
>
> >  static int memory_numa_stat_show(struct seq_file *m, void *v)
> >  {
> >         int i;
> > @@ -6394,8 +6421,8 @@ static int memory_numa_stat_show(struct seq_file *m, void *v)
> >                         struct lruvec *lruvec;
> >
> >                         lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid));
> > -                       size = lruvec_page_state(lruvec, memory_stats[i].idx);
> > -                       size *= memory_stats[i].ratio;
> > +                       size = lruvec_page_state_output(lruvec,
> > +                                                       memory_stats[i].idx);
> >                         seq_printf(m, " N%d=%llu", nid, size);
> >                 }
> >                 seq_putc(m, '\n');
> > --
> > 2.11.0
> >



-- 
Yours,
Muchun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [External] Re: [PATCH v5 7/7] mm: memcontrol: make the slab calculation consistent
@ 2020-12-24  2:43       ` Muchun Song
  0 siblings, 0 replies; 23+ messages in thread
From: Muchun Song @ 2020-12-24  2:43 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Greg Kroah-Hartman, Rafael J. Wysocki, Alexey Dobriyan,
	Andrew Morton, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Hugh Dickins, Roman Gushchin, Sami Tolvanen, Feng Tang,
	Neil Brown, Joonsoo Kim, Randy Dunlap, LKML, linux-fsdevel,
	Linux MM, Cgroups

On Thu, Dec 24, 2020 at 5:21 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Wed, Dec 16, 2020 at 7:46 PM Muchun Song <songmuchun@bytedance.com> wrote:
> >
> > Although the ratio of the slab is one, we also should read the ratio
> > from the related memory_stats instead of hard-coding. And the local
> > variable of size is already the value of slab_unreclaimable. So we
> > do not need to read again.
> >
> > To do this we need some code like below:
> >
> > if (unlikely(memory_stats[i].idx == NR_SLAB_UNRECLAIMABLE_B)) {
> > -       size = memcg_page_state(memcg, NR_SLAB_RECLAIMABLE_B) +
> > -              memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE_B);
> > +       size += memcg_page_state(memcg, memory_stats[i - 1].idx) *
> > +               memory_stats[i - 1].ratio;

Hi Shakeel,

Here is the [i - 1].

> >
> > It requires a series of BUG_ONs or comments to ensure these two
> > items are actually adjacent and in the right order. So it would
> > probably be easier to implement this using a wrapper that has a
> > big switch() for unit conversion.
> >
> > This would fix the ratio inconsistency and get rid of the order
> > guarantee.
> >
>
> The commit message is really confusing. It is explaining a situation
> which it did not do. I don't see any benefit of mentioning BUG_ONs or
> [i-1]s in the message. The patch makes sure that we use the right
> ratio for slab. Can you rewrite the commit message and motivate in
> just that regard?

Yeah, I need rewrite the commit message to make it more clear.
However, here is a discussion about this. See

    https://lore.kernel.org/patchwork/patch/1348611/

Thanks.

>
> > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> > ---
> >  mm/memcontrol.c | 105 +++++++++++++++++++++++++++++++++++---------------------
> >  1 file changed, 66 insertions(+), 39 deletions(-)
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index a40797a27f87..eec44918d373 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1511,49 +1511,71 @@ static bool mem_cgroup_wait_acct_move(struct mem_cgroup *memcg)
> >
> >  struct memory_stat {
> >         const char *name;
> > -       unsigned int ratio;
> >         unsigned int idx;
> >  };
> >
> >  static const struct memory_stat memory_stats[] = {
> > -       { "anon", PAGE_SIZE, NR_ANON_MAPPED },
> > -       { "file", PAGE_SIZE, NR_FILE_PAGES },
> > -       { "kernel_stack", 1024, NR_KERNEL_STACK_KB },
> > -       { "pagetables", PAGE_SIZE, NR_PAGETABLE },
> > -       { "percpu", 1, MEMCG_PERCPU_B },
> > -       { "sock", PAGE_SIZE, MEMCG_SOCK },
> > -       { "shmem", PAGE_SIZE, NR_SHMEM },
> > -       { "file_mapped", PAGE_SIZE, NR_FILE_MAPPED },
> > -       { "file_dirty", PAGE_SIZE, NR_FILE_DIRTY },
> > -       { "file_writeback", PAGE_SIZE, NR_WRITEBACK },
> > +       { "anon",                       NR_ANON_MAPPED                  },
> > +       { "file",                       NR_FILE_PAGES                   },
> > +       { "kernel_stack",               NR_KERNEL_STACK_KB              },
> > +       { "pagetables",                 NR_PAGETABLE                    },
> > +       { "percpu",                     MEMCG_PERCPU_B                  },
> > +       { "sock",                       MEMCG_SOCK                      },
> > +       { "shmem",                      NR_SHMEM                        },
> > +       { "file_mapped",                NR_FILE_MAPPED                  },
> > +       { "file_dirty",                 NR_FILE_DIRTY                   },
> > +       { "file_writeback",             NR_WRITEBACK                    },
> >  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > -       { "anon_thp", PAGE_SIZE, NR_ANON_THPS },
> > -       { "file_thp", PAGE_SIZE, NR_FILE_THPS },
> > -       { "shmem_thp", PAGE_SIZE, NR_SHMEM_THPS },
> > +       { "anon_thp",                   NR_ANON_THPS                    },
> > +       { "file_thp",                   NR_FILE_THPS                    },
> > +       { "shmem_thp",                  NR_SHMEM_THPS                   },
> >  #endif
> > -       { "inactive_anon", PAGE_SIZE, NR_INACTIVE_ANON },
> > -       { "active_anon", PAGE_SIZE, NR_ACTIVE_ANON },
> > -       { "inactive_file", PAGE_SIZE, NR_INACTIVE_FILE },
> > -       { "active_file", PAGE_SIZE, NR_ACTIVE_FILE },
> > -       { "unevictable", PAGE_SIZE, NR_UNEVICTABLE },
> > -
> > -       /*
> > -        * Note: The slab_reclaimable and slab_unreclaimable must be
> > -        * together and slab_reclaimable must be in front.
> > -        */
> > -       { "slab_reclaimable", 1, NR_SLAB_RECLAIMABLE_B },
> > -       { "slab_unreclaimable", 1, NR_SLAB_UNRECLAIMABLE_B },
> > +       { "inactive_anon",              NR_INACTIVE_ANON                },
> > +       { "active_anon",                NR_ACTIVE_ANON                  },
> > +       { "inactive_file",              NR_INACTIVE_FILE                },
> > +       { "active_file",                NR_ACTIVE_FILE                  },
> > +       { "unevictable",                NR_UNEVICTABLE                  },
> > +       { "slab_reclaimable",           NR_SLAB_RECLAIMABLE_B           },
> > +       { "slab_unreclaimable",         NR_SLAB_UNRECLAIMABLE_B         },
> >
> >         /* The memory events */
> > -       { "workingset_refault_anon", 1, WORKINGSET_REFAULT_ANON },
> > -       { "workingset_refault_file", 1, WORKINGSET_REFAULT_FILE },
> > -       { "workingset_activate_anon", 1, WORKINGSET_ACTIVATE_ANON },
> > -       { "workingset_activate_file", 1, WORKINGSET_ACTIVATE_FILE },
> > -       { "workingset_restore_anon", 1, WORKINGSET_RESTORE_ANON },
> > -       { "workingset_restore_file", 1, WORKINGSET_RESTORE_FILE },
> > -       { "workingset_nodereclaim", 1, WORKINGSET_NODERECLAIM },
> > +       { "workingset_refault_anon",    WORKINGSET_REFAULT_ANON         },
> > +       { "workingset_refault_file",    WORKINGSET_REFAULT_FILE         },
> > +       { "workingset_activate_anon",   WORKINGSET_ACTIVATE_ANON        },
> > +       { "workingset_activate_file",   WORKINGSET_ACTIVATE_FILE        },
> > +       { "workingset_restore_anon",    WORKINGSET_RESTORE_ANON         },
> > +       { "workingset_restore_file",    WORKINGSET_RESTORE_FILE         },
> > +       { "workingset_nodereclaim",     WORKINGSET_NODERECLAIM          },
> >  };
> >
> > +/* Translate stat items to the correct unit for memory.stat output */
> > +static int memcg_page_state_unit(int item)
> > +{
> > +       switch (item) {
> > +       case MEMCG_PERCPU_B:
> > +       case NR_SLAB_RECLAIMABLE_B:
> > +       case NR_SLAB_UNRECLAIMABLE_B:
> > +       case WORKINGSET_REFAULT_ANON:
> > +       case WORKINGSET_REFAULT_FILE:
> > +       case WORKINGSET_ACTIVATE_ANON:
> > +       case WORKINGSET_ACTIVATE_FILE:
> > +       case WORKINGSET_RESTORE_ANON:
> > +       case WORKINGSET_RESTORE_FILE:
> > +       case WORKINGSET_NODERECLAIM:
> > +               return 1;
> > +       case NR_KERNEL_STACK_KB:
> > +               return SZ_1K;
> > +       default:
> > +               return PAGE_SIZE;
> > +       }
> > +}
> > +
> > +static inline unsigned long memcg_page_state_output(struct mem_cgroup *memcg,
> > +                                                   int item)
> > +{
> > +       return memcg_page_state(memcg, item) * memcg_page_state_unit(item);
> > +}
> > +
> >  static char *memory_stat_format(struct mem_cgroup *memcg)
> >  {
> >         struct seq_buf s;
> > @@ -1577,13 +1599,12 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
> >         for (i = 0; i < ARRAY_SIZE(memory_stats); i++) {
> >                 u64 size;
> >
> > -               size = memcg_page_state(memcg, memory_stats[i].idx);
> > -               size *= memory_stats[i].ratio;
> > +               size = memcg_page_state_output(memcg, memory_stats[i].idx);
> >                 seq_buf_printf(&s, "%s %llu\n", memory_stats[i].name, size);
> >
> >                 if (unlikely(memory_stats[i].idx == NR_SLAB_UNRECLAIMABLE_B)) {
> > -                       size = memcg_page_state(memcg, NR_SLAB_RECLAIMABLE_B) +
> > -                              memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE_B);
> > +                       size += memcg_page_state_output(memcg,
> > +                                                       NR_SLAB_RECLAIMABLE_B);
> >                         seq_buf_printf(&s, "slab %llu\n", size);
> >                 }
> >         }
> > @@ -6377,6 +6398,12 @@ static int memory_stat_show(struct seq_file *m, void *v)
> >  }
> >
> >  #ifdef CONFIG_NUMA
> > +static inline unsigned long lruvec_page_state_output(struct lruvec *lruvec,
> > +                                                    int item)
> > +{
> > +       return lruvec_page_state(lruvec, item) * memcg_page_state_unit(item);
> > +}
> > +
>
> No need to have lruvec_page_state_output() separately as there is just
> one user. Just inline it.
>
> >  static int memory_numa_stat_show(struct seq_file *m, void *v)
> >  {
> >         int i;
> > @@ -6394,8 +6421,8 @@ static int memory_numa_stat_show(struct seq_file *m, void *v)
> >                         struct lruvec *lruvec;
> >
> >                         lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid));
> > -                       size = lruvec_page_state(lruvec, memory_stats[i].idx);
> > -                       size *= memory_stats[i].ratio;
> > +                       size = lruvec_page_state_output(lruvec,
> > +                                                       memory_stats[i].idx);
> >                         seq_printf(m, " N%d=%llu", nid, size);
> >                 }
> >                 seq_putc(m, '\n');
> > --
> > 2.11.0
> >



-- 
Yours,
Muchun


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [External] Re: [PATCH v5 7/7] mm: memcontrol: make the slab calculation consistent
@ 2020-12-24  2:43       ` Muchun Song
  0 siblings, 0 replies; 23+ messages in thread
From: Muchun Song @ 2020-12-24  2:43 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Greg Kroah-Hartman, Rafael J. Wysocki, Alexey Dobriyan,
	Andrew Morton, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Hugh Dickins, Roman Gushchin, Sami Tolvanen, Feng Tang,
	Neil Brown, Joonsoo Kim, Randy Dunlap, LKML, linux-fsdevel,
	Linux MM, Cgroups

On Thu, Dec 24, 2020 at 5:21 AM Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>
> On Wed, Dec 16, 2020 at 7:46 PM Muchun Song <songmuchun-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org> wrote:
> >
> > Although the ratio of the slab is one, we also should read the ratio
> > from the related memory_stats instead of hard-coding. And the local
> > variable of size is already the value of slab_unreclaimable. So we
> > do not need to read again.
> >
> > To do this we need some code like below:
> >
> > if (unlikely(memory_stats[i].idx == NR_SLAB_UNRECLAIMABLE_B)) {
> > -       size = memcg_page_state(memcg, NR_SLAB_RECLAIMABLE_B) +
> > -              memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE_B);
> > +       size += memcg_page_state(memcg, memory_stats[i - 1].idx) *
> > +               memory_stats[i - 1].ratio;

Hi Shakeel,

Here is the [i - 1].

> >
> > It requires a series of BUG_ONs or comments to ensure these two
> > items are actually adjacent and in the right order. So it would
> > probably be easier to implement this using a wrapper that has a
> > big switch() for unit conversion.
> >
> > This would fix the ratio inconsistency and get rid of the order
> > guarantee.
> >
>
> The commit message is really confusing. It is explaining a situation
> which it did not do. I don't see any benefit of mentioning BUG_ONs or
> [i-1]s in the message. The patch makes sure that we use the right
> ratio for slab. Can you rewrite the commit message and motivate in
> just that regard?

Yeah, I need rewrite the commit message to make it more clear.
However, here is a discussion about this. See

    https://lore.kernel.org/patchwork/patch/1348611/

Thanks.

>
> > Signed-off-by: Muchun Song <songmuchun-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org>
> > ---
> >  mm/memcontrol.c | 105 +++++++++++++++++++++++++++++++++++---------------------
> >  1 file changed, 66 insertions(+), 39 deletions(-)
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index a40797a27f87..eec44918d373 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1511,49 +1511,71 @@ static bool mem_cgroup_wait_acct_move(struct mem_cgroup *memcg)
> >
> >  struct memory_stat {
> >         const char *name;
> > -       unsigned int ratio;
> >         unsigned int idx;
> >  };
> >
> >  static const struct memory_stat memory_stats[] = {
> > -       { "anon", PAGE_SIZE, NR_ANON_MAPPED },
> > -       { "file", PAGE_SIZE, NR_FILE_PAGES },
> > -       { "kernel_stack", 1024, NR_KERNEL_STACK_KB },
> > -       { "pagetables", PAGE_SIZE, NR_PAGETABLE },
> > -       { "percpu", 1, MEMCG_PERCPU_B },
> > -       { "sock", PAGE_SIZE, MEMCG_SOCK },
> > -       { "shmem", PAGE_SIZE, NR_SHMEM },
> > -       { "file_mapped", PAGE_SIZE, NR_FILE_MAPPED },
> > -       { "file_dirty", PAGE_SIZE, NR_FILE_DIRTY },
> > -       { "file_writeback", PAGE_SIZE, NR_WRITEBACK },
> > +       { "anon",                       NR_ANON_MAPPED                  },
> > +       { "file",                       NR_FILE_PAGES                   },
> > +       { "kernel_stack",               NR_KERNEL_STACK_KB              },
> > +       { "pagetables",                 NR_PAGETABLE                    },
> > +       { "percpu",                     MEMCG_PERCPU_B                  },
> > +       { "sock",                       MEMCG_SOCK                      },
> > +       { "shmem",                      NR_SHMEM                        },
> > +       { "file_mapped",                NR_FILE_MAPPED                  },
> > +       { "file_dirty",                 NR_FILE_DIRTY                   },
> > +       { "file_writeback",             NR_WRITEBACK                    },
> >  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > -       { "anon_thp", PAGE_SIZE, NR_ANON_THPS },
> > -       { "file_thp", PAGE_SIZE, NR_FILE_THPS },
> > -       { "shmem_thp", PAGE_SIZE, NR_SHMEM_THPS },
> > +       { "anon_thp",                   NR_ANON_THPS                    },
> > +       { "file_thp",                   NR_FILE_THPS                    },
> > +       { "shmem_thp",                  NR_SHMEM_THPS                   },
> >  #endif
> > -       { "inactive_anon", PAGE_SIZE, NR_INACTIVE_ANON },
> > -       { "active_anon", PAGE_SIZE, NR_ACTIVE_ANON },
> > -       { "inactive_file", PAGE_SIZE, NR_INACTIVE_FILE },
> > -       { "active_file", PAGE_SIZE, NR_ACTIVE_FILE },
> > -       { "unevictable", PAGE_SIZE, NR_UNEVICTABLE },
> > -
> > -       /*
> > -        * Note: The slab_reclaimable and slab_unreclaimable must be
> > -        * together and slab_reclaimable must be in front.
> > -        */
> > -       { "slab_reclaimable", 1, NR_SLAB_RECLAIMABLE_B },
> > -       { "slab_unreclaimable", 1, NR_SLAB_UNRECLAIMABLE_B },
> > +       { "inactive_anon",              NR_INACTIVE_ANON                },
> > +       { "active_anon",                NR_ACTIVE_ANON                  },
> > +       { "inactive_file",              NR_INACTIVE_FILE                },
> > +       { "active_file",                NR_ACTIVE_FILE                  },
> > +       { "unevictable",                NR_UNEVICTABLE                  },
> > +       { "slab_reclaimable",           NR_SLAB_RECLAIMABLE_B           },
> > +       { "slab_unreclaimable",         NR_SLAB_UNRECLAIMABLE_B         },
> >
> >         /* The memory events */
> > -       { "workingset_refault_anon", 1, WORKINGSET_REFAULT_ANON },
> > -       { "workingset_refault_file", 1, WORKINGSET_REFAULT_FILE },
> > -       { "workingset_activate_anon", 1, WORKINGSET_ACTIVATE_ANON },
> > -       { "workingset_activate_file", 1, WORKINGSET_ACTIVATE_FILE },
> > -       { "workingset_restore_anon", 1, WORKINGSET_RESTORE_ANON },
> > -       { "workingset_restore_file", 1, WORKINGSET_RESTORE_FILE },
> > -       { "workingset_nodereclaim", 1, WORKINGSET_NODERECLAIM },
> > +       { "workingset_refault_anon",    WORKINGSET_REFAULT_ANON         },
> > +       { "workingset_refault_file",    WORKINGSET_REFAULT_FILE         },
> > +       { "workingset_activate_anon",   WORKINGSET_ACTIVATE_ANON        },
> > +       { "workingset_activate_file",   WORKINGSET_ACTIVATE_FILE        },
> > +       { "workingset_restore_anon",    WORKINGSET_RESTORE_ANON         },
> > +       { "workingset_restore_file",    WORKINGSET_RESTORE_FILE         },
> > +       { "workingset_nodereclaim",     WORKINGSET_NODERECLAIM          },
> >  };
> >
> > +/* Translate stat items to the correct unit for memory.stat output */
> > +static int memcg_page_state_unit(int item)
> > +{
> > +       switch (item) {
> > +       case MEMCG_PERCPU_B:
> > +       case NR_SLAB_RECLAIMABLE_B:
> > +       case NR_SLAB_UNRECLAIMABLE_B:
> > +       case WORKINGSET_REFAULT_ANON:
> > +       case WORKINGSET_REFAULT_FILE:
> > +       case WORKINGSET_ACTIVATE_ANON:
> > +       case WORKINGSET_ACTIVATE_FILE:
> > +       case WORKINGSET_RESTORE_ANON:
> > +       case WORKINGSET_RESTORE_FILE:
> > +       case WORKINGSET_NODERECLAIM:
> > +               return 1;
> > +       case NR_KERNEL_STACK_KB:
> > +               return SZ_1K;
> > +       default:
> > +               return PAGE_SIZE;
> > +       }
> > +}
> > +
> > +static inline unsigned long memcg_page_state_output(struct mem_cgroup *memcg,
> > +                                                   int item)
> > +{
> > +       return memcg_page_state(memcg, item) * memcg_page_state_unit(item);
> > +}
> > +
> >  static char *memory_stat_format(struct mem_cgroup *memcg)
> >  {
> >         struct seq_buf s;
> > @@ -1577,13 +1599,12 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
> >         for (i = 0; i < ARRAY_SIZE(memory_stats); i++) {
> >                 u64 size;
> >
> > -               size = memcg_page_state(memcg, memory_stats[i].idx);
> > -               size *= memory_stats[i].ratio;
> > +               size = memcg_page_state_output(memcg, memory_stats[i].idx);
> >                 seq_buf_printf(&s, "%s %llu\n", memory_stats[i].name, size);
> >
> >                 if (unlikely(memory_stats[i].idx == NR_SLAB_UNRECLAIMABLE_B)) {
> > -                       size = memcg_page_state(memcg, NR_SLAB_RECLAIMABLE_B) +
> > -                              memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE_B);
> > +                       size += memcg_page_state_output(memcg,
> > +                                                       NR_SLAB_RECLAIMABLE_B);
> >                         seq_buf_printf(&s, "slab %llu\n", size);
> >                 }
> >         }
> > @@ -6377,6 +6398,12 @@ static int memory_stat_show(struct seq_file *m, void *v)
> >  }
> >
> >  #ifdef CONFIG_NUMA
> > +static inline unsigned long lruvec_page_state_output(struct lruvec *lruvec,
> > +                                                    int item)
> > +{
> > +       return lruvec_page_state(lruvec, item) * memcg_page_state_unit(item);
> > +}
> > +
>
> No need to have lruvec_page_state_output() separately as there is just
> one user. Just inline it.
>
> >  static int memory_numa_stat_show(struct seq_file *m, void *v)
> >  {
> >         int i;
> > @@ -6394,8 +6421,8 @@ static int memory_numa_stat_show(struct seq_file *m, void *v)
> >                         struct lruvec *lruvec;
> >
> >                         lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid));
> > -                       size = lruvec_page_state(lruvec, memory_stats[i].idx);
> > -                       size *= memory_stats[i].ratio;
> > +                       size = lruvec_page_state_output(lruvec,
> > +                                                       memory_stats[i].idx);
> >                         seq_printf(m, " N%d=%llu", nid, size);
> >                 }
> >                 seq_putc(m, '\n');
> > --
> > 2.11.0
> >



-- 
Yours,
Muchun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [External] Re: [PATCH v5 2/7] mm: memcontrol: convert NR_ANON_THPS account to pages
  2020-12-23 22:08     ` Shakeel Butt
  (?)
@ 2020-12-24  2:45       ` Muchun Song
  -1 siblings, 0 replies; 23+ messages in thread
From: Muchun Song @ 2020-12-24  2:45 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Greg Kroah-Hartman, Rafael J. Wysocki, Alexey Dobriyan,
	Andrew Morton, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Hugh Dickins, Roman Gushchin, Sami Tolvanen, Feng Tang,
	Neil Brown, Joonsoo Kim, Randy Dunlap, LKML, linux-fsdevel,
	Linux MM, Cgroups

On Thu, Dec 24, 2020 at 6:08 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Wed, Dec 16, 2020 at 7:45 PM Muchun Song <songmuchun@bytedance.com> wrote:
> >
> > Currently we use struct per_cpu_nodestat to cache the vmstat
> > counters, which leads to inaccurate statistics expecially THP
>
> *especially

Thanks.

>
> > vmstat counters. In the systems with hundreads of processors
>
> *hundreds

Thanks.

>
> > it can be GBs of memory. For example, for a 96 CPUs system,
> > the threshold is the maximum number of 125. And the per cpu
> > counters can cache 23.4375 GB in total.
> >
> > The THP page is already a form of batched addition (it will
> > add 512 worth of memory in one go) so skipping the batching
> > seems like sensible. Although every THP stats update overflows
> > the per-cpu counter, resorting to atomic global updates. But
> > it can make the statistics more accuracy for the THP vmstat
> > counters.
> >
> > So we convert the NR_ANON_THPS account to pages. This patch
> > is consistent with 8f182270dfec ("mm/swap.c: flush lru pvecs
> > on compound page arrival"). Doing this also can make the unit
> > of vmstat counters more unified. Finally, the unit of the vmstat
> > counters are pages, kB and bytes. The B/KB suffix can tell us
> > that the unit is bytes or kB. The rest which is without suffix
> > are pages.
> >
> > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
>
> I agree with the motivation behind this patch but I would like to see
> some performance numbers in the commit message. We might agree to pay
> the price but at least we will know what exactly that cost is.

Do you have any recommendations about benchmarks?
I can do a test. Thanks very much.

-- 
Yours,
Muchun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [External] Re: [PATCH v5 2/7] mm: memcontrol: convert NR_ANON_THPS account to pages
@ 2020-12-24  2:45       ` Muchun Song
  0 siblings, 0 replies; 23+ messages in thread
From: Muchun Song @ 2020-12-24  2:45 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Greg Kroah-Hartman, Rafael J. Wysocki, Alexey Dobriyan,
	Andrew Morton, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Hugh Dickins, Roman Gushchin, Sami Tolvanen, Feng Tang,
	Neil Brown, Joonsoo Kim, Randy Dunlap, LKML, linux-fsdevel,
	Linux MM, Cgroups

On Thu, Dec 24, 2020 at 6:08 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Wed, Dec 16, 2020 at 7:45 PM Muchun Song <songmuchun@bytedance.com> wrote:
> >
> > Currently we use struct per_cpu_nodestat to cache the vmstat
> > counters, which leads to inaccurate statistics expecially THP
>
> *especially

Thanks.

>
> > vmstat counters. In the systems with hundreads of processors
>
> *hundreds

Thanks.

>
> > it can be GBs of memory. For example, for a 96 CPUs system,
> > the threshold is the maximum number of 125. And the per cpu
> > counters can cache 23.4375 GB in total.
> >
> > The THP page is already a form of batched addition (it will
> > add 512 worth of memory in one go) so skipping the batching
> > seems like sensible. Although every THP stats update overflows
> > the per-cpu counter, resorting to atomic global updates. But
> > it can make the statistics more accuracy for the THP vmstat
> > counters.
> >
> > So we convert the NR_ANON_THPS account to pages. This patch
> > is consistent with 8f182270dfec ("mm/swap.c: flush lru pvecs
> > on compound page arrival"). Doing this also can make the unit
> > of vmstat counters more unified. Finally, the unit of the vmstat
> > counters are pages, kB and bytes. The B/KB suffix can tell us
> > that the unit is bytes or kB. The rest which is without suffix
> > are pages.
> >
> > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
>
> I agree with the motivation behind this patch but I would like to see
> some performance numbers in the commit message. We might agree to pay
> the price but at least we will know what exactly that cost is.

Do you have any recommendations about benchmarks?
I can do a test. Thanks very much.

-- 
Yours,
Muchun


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [External] Re: [PATCH v5 2/7] mm: memcontrol: convert NR_ANON_THPS account to pages
@ 2020-12-24  2:45       ` Muchun Song
  0 siblings, 0 replies; 23+ messages in thread
From: Muchun Song @ 2020-12-24  2:45 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Greg Kroah-Hartman, Rafael J. Wysocki, Alexey Dobriyan,
	Andrew Morton, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Hugh Dickins, Roman Gushchin, Sami Tolvanen, Feng Tang,
	Neil Brown, Joonsoo Kim, Randy Dunlap, LKML, linux-fsdevel,
	Linux MM, Cgroups

On Thu, Dec 24, 2020 at 6:08 AM Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>
> On Wed, Dec 16, 2020 at 7:45 PM Muchun Song <songmuchun-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org> wrote:
> >
> > Currently we use struct per_cpu_nodestat to cache the vmstat
> > counters, which leads to inaccurate statistics expecially THP
>
> *especially

Thanks.

>
> > vmstat counters. In the systems with hundreads of processors
>
> *hundreds

Thanks.

>
> > it can be GBs of memory. For example, for a 96 CPUs system,
> > the threshold is the maximum number of 125. And the per cpu
> > counters can cache 23.4375 GB in total.
> >
> > The THP page is already a form of batched addition (it will
> > add 512 worth of memory in one go) so skipping the batching
> > seems like sensible. Although every THP stats update overflows
> > the per-cpu counter, resorting to atomic global updates. But
> > it can make the statistics more accuracy for the THP vmstat
> > counters.
> >
> > So we convert the NR_ANON_THPS account to pages. This patch
> > is consistent with 8f182270dfec ("mm/swap.c: flush lru pvecs
> > on compound page arrival"). Doing this also can make the unit
> > of vmstat counters more unified. Finally, the unit of the vmstat
> > counters are pages, kB and bytes. The B/KB suffix can tell us
> > that the unit is bytes or kB. The rest which is without suffix
> > are pages.
> >
> > Signed-off-by: Muchun Song <songmuchun-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org>
>
> I agree with the motivation behind this patch but I would like to see
> some performance numbers in the commit message. We might agree to pay
> the price but at least we will know what exactly that cost is.

Do you have any recommendations about benchmarks?
I can do a test. Thanks very much.

-- 
Yours,
Muchun

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2020-12-24  2:47 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-17  3:43 [PATCH v5 0/7] Convert all THP vmstat counters to pages Muchun Song
2020-12-17  3:43 ` [PATCH v5 1/7] mm: memcontrol: fix NR_ANON_THPS accounting in charge moving Muchun Song
2020-12-23 21:01   ` Shakeel Butt
2020-12-23 21:01     ` Shakeel Butt
2020-12-17  3:43 ` [PATCH v5 2/7] mm: memcontrol: convert NR_ANON_THPS account to pages Muchun Song
2020-12-23 22:08   ` Shakeel Butt
2020-12-23 22:08     ` Shakeel Butt
2020-12-24  2:45     ` [External] " Muchun Song
2020-12-24  2:45       ` Muchun Song
2020-12-24  2:45       ` Muchun Song
2020-12-17  3:43 ` [PATCH v5 3/7] mm: memcontrol: convert NR_FILE_THPS " Muchun Song
2020-12-17  3:43 ` [PATCH v5 4/7] mm: memcontrol: convert NR_SHMEM_THPS " Muchun Song
2020-12-17  3:43   ` Muchun Song
2020-12-17  3:43 ` [PATCH v5 5/7] mm: memcontrol: convert NR_SHMEM_PMDMAPPED " Muchun Song
2020-12-17  3:43 ` [PATCH v5 6/7] mm: memcontrol: convert NR_FILE_PMDMAPPED " Muchun Song
2020-12-17  3:43   ` Muchun Song
2020-12-17  3:43 ` [PATCH v5 7/7] mm: memcontrol: make the slab calculation consistent Muchun Song
2020-12-23 21:21   ` Shakeel Butt
2020-12-23 21:21     ` Shakeel Butt
2020-12-23 21:21     ` Shakeel Butt
2020-12-24  2:43     ` [External] " Muchun Song
2020-12-24  2:43       ` Muchun Song
2020-12-24  2:43       ` Muchun Song

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.