[PATCH rfc 0/5] mm: allow mapping accounted kernel pages to userspace

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH rfc 0/5] mm: allow mapping accounted kernel pages to userspace
@ 2020-09-10 20:26 Roman Gushchin
  2020-09-10 20:26 ` [PATCH rfc 1/5] mm: memcg/slab: fix racy access to page->mem_cgroup in mem_cgroup_from_obj() Roman Gushchin
                   ` (5 more replies)
  0 siblings, 6 replies; 16+ messages in thread
From: Roman Gushchin @ 2020-09-10 20:26 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: =Shakeel Butt, Johannes Weiner, Michal Hocko, kernel-team,
	linux-kernel, Roman Gushchin

Currently a non-slab kernel page which has been charged to a memory
cgroup can't be mapped to userspace. The underlying reason is simple:
PageKmemcg flag is defined as a page type (like buddy, offline, etc),
so it takes a bit from a page->mapped counter. Pages with a type set
can't be mapped to userspace.

But in general the kmemcg flag has nothing to do with mapping to
userspace. It only means that the page has been accounted by the page
allocator, so it has to be properly uncharged on release.

Some bpf maps are mapping the vmalloc-based memory to userspace, and
their memory can't be accounted because of this implementation detail.

This patchset removes this limitation by moving the PageKmemcg flag
into one of the free bits of the page->mem_cgroup pointer. Also it
formalizes all accesses to the page->mem_cgroup and page->obj_cgroups
using new helpers, adds several checks and removes a couple of obsolete
functions. As the result the code became more robust with fewer
open-coded bits tricks.

The first patch in the series is a bugfix, which I already sent separately.
Including it in rfc to make the whole series compile.

Roman Gushchin (5):
  mm: memcg/slab: fix racy access to page->mem_cgroup in
    mem_cgroup_from_obj()
  mm: memcontrol: use helpers to access page's memcg data
  mm: memcontrol/slab: use helpers to access slab page's memcg_data
  mm: introduce page memcg flags
  mm: convert page kmemcg type to a page memcg flag

 include/linux/memcontrol.h       | 161 +++++++++++++++++++++++++++++--
 include/linux/mm.h               |  22 -----
 include/linux/mm_types.h         |   5 +-
 include/linux/page-flags.h       |  11 +--
 include/trace/events/writeback.h |   2 +-
 mm/debug.c                       |   4 +-
 mm/huge_memory.c                 |   4 +-
 mm/memcontrol.c                  | 116 ++++++++++------------
 mm/migrate.c                     |   2 +-
 mm/page_alloc.c                  |   6 +-
 mm/page_io.c                     |   4 +-
 mm/slab.h                        |  28 +-----
 mm/workingset.c                  |   4 +-
 13 files changed, 221 insertions(+), 148 deletions(-)

-- 
2.26.2

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH rfc 1/5] mm: memcg/slab: fix racy access to page->mem_cgroup in mem_cgroup_from_obj()
  2020-09-10 20:26 [PATCH rfc 0/5] mm: allow mapping accounted kernel pages to userspace Roman Gushchin
@ 2020-09-10 20:26 ` Roman Gushchin
  2020-09-10 20:26 ` [PATCH rfc 2/5] mm: memcontrol: use helpers to access page's memcg data Roman Gushchin
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 16+ messages in thread
From: Roman Gushchin @ 2020-09-10 20:26 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: =Shakeel Butt, Johannes Weiner, Michal Hocko, kernel-team,
	linux-kernel, Roman Gushchin, Vlastimil Babka

mem_cgroup_from_obj() checks the lowest bit of the page->mem_cgroup
pointer to determine if the page has an attached obj_cgroup vector
instead of a regular memcg pointer. If it's not set, it simple returns
the page->mem_cgroup value as a struct mem_cgroup pointer.

The commit 10befea91b61 ("mm: memcg/slab: use a single set of
kmem_caches for all allocations") changed the moment when this bit
is set: if previously it was set on the allocation of the slab page,
now it can be set well after, when the first accounted object is
allocated on this page.

It opened a race: if page->mem_cgroup is set concurrently after the
first page_has_obj_cgroups(page) check, a pointer to the obj_cgroups
array can be returned as a memory cgroup pointer.

A simple check for page->mem_cgroup pointer for NULL before the
page_has_obj_cgroups() check fixes the race. Indeed, if the pointer
is not NULL, it's either a simple mem_cgroup pointer or a pointer
to obj_cgroup vector. The pointer can be asynchronously changed
from NULL to (obj_cgroup_vec | 0x1UL), but can't be changed
from a valid memcg pointer to objcg vector or back.

If the object passed to mem_cgroup_from_obj() is a slab object
and page->mem_cgroup is NULL, it means that the object is not
accounted, so the function must return NULL.

I've discovered the race looking at the code, so far I haven't seen it
in the wild.

Fixes: 10befea91b61 ("mm: memcg/slab: use a single set of kmem_caches for all allocations")
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Shakeel Butt <shakeelb@google.com>
---
 mm/memcontrol.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 75cd1a1e66c8..093526fec4bf 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2923,6 +2923,17 @@ struct mem_cgroup *mem_cgroup_from_obj(void *p)

 	page = virt_to_head_page(p);

+	/*
+	 * If page->mem_cgroup is set, it's either a simple mem_cgroup pointer
+	 * or a pointer to obj_cgroup vector. In the latter case the lowest
+	 * bit of the pointer is set.
+	 * The page->mem_cgroup pointer can be asynchronously changed
+	 * from NULL to (obj_cgroup_vec | 0x1UL), but can't be changed
+	 * from a valid memcg pointer to objcg vector or back.
+	 */
+	if (!page->mem_cgroup)
+		return NULL;
+
 	/*
 	 * Slab objects are accounted individually, not per-page.
 	 * Memcg membership data for each individual object is saved in
-- 
2.26.2

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH rfc 2/5] mm: memcontrol: use helpers to access page's memcg data
  2020-09-10 20:26 [PATCH rfc 0/5] mm: allow mapping accounted kernel pages to userspace Roman Gushchin
  2020-09-10 20:26 ` [PATCH rfc 1/5] mm: memcg/slab: fix racy access to page->mem_cgroup in mem_cgroup_from_obj() Roman Gushchin
@ 2020-09-10 20:26 ` Roman Gushchin
  2020-09-17  0:58     ` Shakeel Butt
  2020-09-10 20:26 ` [PATCH rfc 3/5] mm: memcontrol/slab: use helpers to access slab page's memcg_data Roman Gushchin
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 16+ messages in thread
From: Roman Gushchin @ 2020-09-10 20:26 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: =Shakeel Butt, Johannes Weiner, Michal Hocko, kernel-team,
	linux-kernel, Roman Gushchin

Currently there are many open-coded reads and writes of the page->mem_cgroup
pointer, as well as a couple of read helpers, which are barely used.

It creates an obstacle on a way to reuse some bits of the pointer
for storing additional bits of information. In fact, we already do
this for slab pages, where the last bit indicates that a pointer has
an attached vector of objcg pointers instead of a regular memcg
pointer.

This commits introduces 4 new helper functions and converts all
raw accesses to page->mem_cgroup to calls of these helpers:
  struct mem_cgroup *page_mem_cgroup(struct page *page);
  struct mem_cgroup *page_mem_cgroup_check(struct page *page);
  void set_page_mem_cgroup(struct page *page, struct mem_cgroup *memcg);
  void clear_page_mem_cgroup(struct page *page);

page_mem_cgroup_check() is intended to be used in cases when the page
can be a slab page and have a memcg pointer pointing at objcg vector.
It does check the lowest bit, and if set, returns NULL.
page_mem_cgroup() contains a VM_BUG_ON_PAGE() check for the page not
being a slab page. So do set_page_mem_cgroup() and clear_page_mem_cgroup().

To make sure nobody uses a direct access, struct page's
mem_cgroup/obj_cgroups is converted to unsigned long memcg_data.
Only new helpers and a couple of slab-accounting related functions
access this field directly.

page_memcg() and page_memcg_rcu() helpers defined in mm.h are removed.
New page_mem_cgroup() is a direct analog of page_memcg(), while
page_memcg_rcu() has a single call site in a small rcu-read-lock
section, so it's just not worth it to have a separate helper. So
it's replaced with page_mem_cgroup() too.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 include/linux/memcontrol.h       |  72 +++++++++++++++++---
 include/linux/mm.h               |  22 ------
 include/linux/mm_types.h         |   5 +-
 include/trace/events/writeback.h |   2 +-
 mm/debug.c                       |   4 +-
 mm/huge_memory.c                 |   4 +-
 mm/memcontrol.c                  | 111 +++++++++++++------------------
 mm/migrate.c                     |   2 +-
 mm/page_alloc.c                  |   4 +-
 mm/page_io.c                     |   4 +-
 mm/slab.h                        |   7 +-
 mm/workingset.c                  |   4 +-
 12 files changed, 124 insertions(+), 117 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 924177502479..0997220c84ce 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -340,6 +340,41 @@ struct mem_cgroup {
 
 extern struct mem_cgroup *root_mem_cgroup;
 
+static inline struct mem_cgroup *page_mem_cgroup(struct page *page)
+{
+	VM_BUG_ON_PAGE(PageSlab(page), page);
+	return (struct mem_cgroup *)page->memcg_data;
+}
+
+static inline struct mem_cgroup *page_mem_cgroup_check(struct page *page)
+{
+	unsigned long memcg_data = page->memcg_data;
+
+	/*
+	 * The lowest bit set means that memcg isn't a valid
+	 * memcg pointer, but a obj_cgroups pointer.
+	 * In this case the page is shared and doesn't belong
+	 * to any specific memory cgroup.
+	 */
+	if (memcg_data & 0x1UL)
+		return NULL;
+
+	return (struct mem_cgroup *)memcg_data;
+}
+
+static inline void set_page_mem_cgroup(struct page *page,
+				       struct mem_cgroup *memcg)
+{
+	VM_BUG_ON_PAGE(PageSlab(page), page);
+	page->memcg_data = (unsigned long)memcg;
+}
+
+static inline void clear_page_mem_cgroup(struct page *page)
+{
+	VM_BUG_ON_PAGE(PageSlab(page), page);
+	page->memcg_data = 0;
+}
+
 static __always_inline bool memcg_stat_item_in_bytes(int idx)
 {
 	if (idx == MEMCG_PERCPU_B)
@@ -740,15 +775,15 @@ static inline void mod_memcg_state(struct mem_cgroup *memcg,
 static inline void __mod_memcg_page_state(struct page *page,
 					  int idx, int val)
 {
-	if (page->mem_cgroup)
-		__mod_memcg_state(page->mem_cgroup, idx, val);
+	if (page_mem_cgroup(page))
+		__mod_memcg_state(page_mem_cgroup(page), idx, val);
 }
 
 static inline void mod_memcg_page_state(struct page *page,
 					int idx, int val)
 {
-	if (page->mem_cgroup)
-		mod_memcg_state(page->mem_cgroup, idx, val);
+	if (page_mem_cgroup(page))
+		mod_memcg_state(page_mem_cgroup(page), idx, val);
 }
 
 static inline unsigned long lruvec_page_state(struct lruvec *lruvec,
@@ -835,12 +870,12 @@ static inline void __mod_lruvec_page_state(struct page *page,
 	struct lruvec *lruvec;
 
 	/* Untracked pages have no memcg, no lruvec. Update only the node */
-	if (!head->mem_cgroup) {
+	if (!page_mem_cgroup(head)) {
 		__mod_node_page_state(pgdat, idx, val);
 		return;
 	}
 
-	lruvec = mem_cgroup_lruvec(head->mem_cgroup, pgdat);
+	lruvec = mem_cgroup_lruvec(page_mem_cgroup(head), pgdat);
 	__mod_lruvec_state(lruvec, idx, val);
 }
 
@@ -875,8 +910,8 @@ static inline void count_memcg_events(struct mem_cgroup *memcg,
 static inline void count_memcg_page_event(struct page *page,
 					  enum vm_event_item idx)
 {
-	if (page->mem_cgroup)
-		count_memcg_events(page->mem_cgroup, idx, 1);
+	if (page_mem_cgroup(page))
+		count_memcg_events(page_mem_cgroup(page), idx, 1);
 }
 
 static inline void count_memcg_event_mm(struct mm_struct *mm,
@@ -938,6 +973,25 @@ void mem_cgroup_split_huge_fixup(struct page *head);
 
 struct mem_cgroup;
 
+static inline struct mem_cgroup *page_mem_cgroup(struct page *page)
+{
+	return NULL;
+}
+
+static inline struct mem_cgroup *page_mem_cgroup_check(struct page *page)
+{
+	return NULL;
+}
+
+static inline void set_page_mem_cgroup(struct page *page,
+				       struct mem_cgroup *memcg)
+{
+}
+
+static inline void clear_page_mem_cgroup(struct page *page)
+{
+}
+
 static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
 {
 	return true;
@@ -1427,7 +1481,7 @@ static inline void mem_cgroup_track_foreign_dirty(struct page *page,
 	if (mem_cgroup_disabled())
 		return;
 
-	if (unlikely(&page->mem_cgroup->css != wb->memcg_css))
+	if (unlikely(&page_mem_cgroup(page)->css != wb->memcg_css))
 		mem_cgroup_track_foreign_dirty_slowpath(page, wb);
 }
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 517751310dd2..dc4eb3b150fe 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1474,28 +1474,6 @@ static inline void set_page_links(struct page *page, enum zone_type zone,
 #endif
 }
 
-#ifdef CONFIG_MEMCG
-static inline struct mem_cgroup *page_memcg(struct page *page)
-{
-	return page->mem_cgroup;
-}
-static inline struct mem_cgroup *page_memcg_rcu(struct page *page)
-{
-	WARN_ON_ONCE(!rcu_read_lock_held());
-	return READ_ONCE(page->mem_cgroup);
-}
-#else
-static inline struct mem_cgroup *page_memcg(struct page *page)
-{
-	return NULL;
-}
-static inline struct mem_cgroup *page_memcg_rcu(struct page *page)
-{
-	WARN_ON_ONCE(!rcu_read_lock_held());
-	return NULL;
-}
-#endif
-
 /*
  * Some inline functions in vmstat.h depend on page_zone()
  */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 496c3ff97cce..4856d23b1161 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -199,10 +199,7 @@ struct page {
 	atomic_t _refcount;
 
 #ifdef CONFIG_MEMCG
-	union {
-		struct mem_cgroup *mem_cgroup;
-		struct obj_cgroup **obj_cgroups;
-	};
+	unsigned long memcg_data;
 #endif
 
 	/*
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index e7cbccc7c14c..b1fa3ac64fa5 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -257,7 +257,7 @@ TRACE_EVENT(track_foreign_dirty,
 		__entry->ino		= inode ? inode->i_ino : 0;
 		__entry->memcg_id	= wb->memcg_css->id;
 		__entry->cgroup_ino	= __trace_wb_assign_cgroup(wb);
-		__entry->page_cgroup_ino = cgroup_ino(page->mem_cgroup->css.cgroup);
+		__entry->page_cgroup_ino = cgroup_ino(page_mem_cgroup(page)->css.cgroup);
 	),
 
 	TP_printk("bdi %s[%llu]: ino=%lu memcg_id=%u cgroup_ino=%lu page_cgroup_ino=%lu",
diff --git a/mm/debug.c b/mm/debug.c
index ccca576b2899..55d1c42c7da8 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -182,8 +182,8 @@ void __dump_page(struct page *page, const char *reason)
 		pr_warn("page dumped because: %s\n", reason);
 
 #ifdef CONFIG_MEMCG
-	if (!page_poisoned && page->mem_cgroup)
-		pr_warn("page->mem_cgroup:%px\n", page->mem_cgroup);
+	if (!page_poisoned && page_mem_cgroup(page))
+		pr_warn("page->mem_cgroup:%px\n", page_mem_cgroup(page));
 #endif
 }
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2a468a4acb0a..7ca3baea9810 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -470,7 +470,7 @@ pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
 #ifdef CONFIG_MEMCG
 static inline struct deferred_split *get_deferred_split_queue(struct page *page)
 {
-	struct mem_cgroup *memcg = compound_head(page)->mem_cgroup;
+	struct mem_cgroup *memcg = page_mem_cgroup(compound_head(page));
 	struct pglist_data *pgdat = NODE_DATA(page_to_nid(page));
 
 	if (memcg)
@@ -2728,7 +2728,7 @@ void deferred_split_huge_page(struct page *page)
 {
 	struct deferred_split *ds_queue = get_deferred_split_queue(page);
 #ifdef CONFIG_MEMCG
-	struct mem_cgroup *memcg = compound_head(page)->mem_cgroup;
+	struct mem_cgroup *memcg = page_mem_cgroup(compound_head(page));
 #endif
 	unsigned long flags;
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 093526fec4bf..19180674e38a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -541,7 +541,7 @@ struct cgroup_subsys_state *mem_cgroup_css_from_page(struct page *page)
 {
 	struct mem_cgroup *memcg;
 
-	memcg = page->mem_cgroup;
+	memcg = page_mem_cgroup(page);
 
 	if (!memcg || !cgroup_subsys_on_dfl(memory_cgrp_subsys))
 		memcg = root_mem_cgroup;
@@ -568,16 +568,7 @@ ino_t page_cgroup_ino(struct page *page)
 	unsigned long ino = 0;
 
 	rcu_read_lock();
-	memcg = page->mem_cgroup;
-
-	/*
-	 * The lowest bit set means that memcg isn't a valid
-	 * memcg pointer, but a obj_cgroups pointer.
-	 * In this case the page is shared and doesn't belong
-	 * to any specific memory cgroup.
-	 */
-	if ((unsigned long) memcg & 0x1UL)
-		memcg = NULL;
+	memcg = page_mem_cgroup_check(page);
 
 	while (memcg && !(memcg->css.flags & CSS_ONLINE))
 		memcg = parent_mem_cgroup(memcg);
@@ -1058,7 +1049,7 @@ EXPORT_SYMBOL(get_mem_cgroup_from_mm);
  */
 struct mem_cgroup *get_mem_cgroup_from_page(struct page *page)
 {
-	struct mem_cgroup *memcg = page->mem_cgroup;
+	struct mem_cgroup *memcg = page_mem_cgroup(page);
 
 	if (mem_cgroup_disabled())
 		return NULL;
@@ -1343,7 +1334,7 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
  * @page: the page
  * @pgdat: pgdat of the page
  *
- * This function relies on page->mem_cgroup being stable - see the
+ * This function relies on page and memcg binding being stable - see the
  * access rules in commit_charge().
  */
 struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgdat)
@@ -1357,7 +1348,7 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
 		goto out;
 	}
 
-	memcg = page->mem_cgroup;
+	memcg = page_mem_cgroup(page);
 	/*
 	 * Swapcache readahead pages are added to the LRU - and
 	 * possibly migrated - before they are charged.
@@ -2100,7 +2091,7 @@ void mem_cgroup_print_oom_group(struct mem_cgroup *memcg)
 }
 
 /**
- * lock_page_memcg - lock a page->mem_cgroup binding
+ * lock_page_memcg - lock a page and memcg binding
  * @page: the page
  *
  * This function protects unlocked LRU pages from being moved to
@@ -2132,7 +2123,7 @@ struct mem_cgroup *lock_page_memcg(struct page *page)
 	if (mem_cgroup_disabled())
 		return NULL;
 again:
-	memcg = head->mem_cgroup;
+	memcg = page_mem_cgroup(head);
 	if (unlikely(!memcg))
 		return NULL;
 
@@ -2140,7 +2131,7 @@ struct mem_cgroup *lock_page_memcg(struct page *page)
 		return memcg;
 
 	spin_lock_irqsave(&memcg->move_lock, flags);
-	if (memcg != head->mem_cgroup) {
+	if (memcg != page_mem_cgroup(head)) {
 		spin_unlock_irqrestore(&memcg->move_lock, flags);
 		goto again;
 	}
@@ -2178,14 +2169,14 @@ void __unlock_page_memcg(struct mem_cgroup *memcg)
 }
 
 /**
- * unlock_page_memcg - unlock a page->mem_cgroup binding
+ * unlock_page_memcg - unlock a page and memcg binding
  * @page: the page
  */
 void unlock_page_memcg(struct page *page)
 {
 	struct page *head = compound_head(page);
 
-	__unlock_page_memcg(head->mem_cgroup);
+	__unlock_page_memcg(page_mem_cgroup(head));
 }
 EXPORT_SYMBOL(unlock_page_memcg);
 
@@ -2875,16 +2866,16 @@ static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
 
 static void commit_charge(struct page *page, struct mem_cgroup *memcg)
 {
-	VM_BUG_ON_PAGE(page->mem_cgroup, page);
+	VM_BUG_ON_PAGE(page_mem_cgroup(page), page);
 	/*
-	 * Any of the following ensures page->mem_cgroup stability:
+	 * Any of the following ensures page and memcg binding stability:
 	 *
 	 * - the page lock
 	 * - LRU isolation
 	 * - lock_page_memcg()
 	 * - exclusive reference
 	 */
-	page->mem_cgroup = memcg;
+	set_page_mem_cgroup(page, memcg);
 }
 
 #ifdef CONFIG_MEMCG_KMEM
@@ -2899,8 +2890,7 @@ int memcg_alloc_page_obj_cgroups(struct page *page, struct kmem_cache *s,
 	if (!vec)
 		return -ENOMEM;
 
-	if (cmpxchg(&page->obj_cgroups, NULL,
-		    (struct obj_cgroup **) ((unsigned long)vec | 0x1UL)))
+	if (cmpxchg(&page->memcg_data, 0, (unsigned long)vec | 0x1UL))
 		kfree(vec);
 	else
 		kmemleak_not_leak(vec);
@@ -2923,17 +2913,6 @@ struct mem_cgroup *mem_cgroup_from_obj(void *p)
 
 	page = virt_to_head_page(p);
 
-	/*
-	 * If page->mem_cgroup is set, it's either a simple mem_cgroup pointer
-	 * or a pointer to obj_cgroup vector. In the latter case the lowest
-	 * bit of the pointer is set.
-	 * The page->mem_cgroup pointer can be asynchronously changed
-	 * from NULL to (obj_cgroup_vec | 0x1UL), but can't be changed
-	 * from a valid memcg pointer to objcg vector or back.
-	 */
-	if (!page->mem_cgroup)
-		return NULL;
-
 	/*
 	 * Slab objects are accounted individually, not per-page.
 	 * Memcg membership data for each individual object is saved in
@@ -2952,7 +2931,7 @@ struct mem_cgroup *mem_cgroup_from_obj(void *p)
 	}
 
 	/* All other pages use page->mem_cgroup */
-	return page->mem_cgroup;
+	return page_mem_cgroup_check(page);
 }
 
 __always_inline struct obj_cgroup *get_obj_cgroup_from_current(void)
@@ -3090,7 +3069,7 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order)
 	if (memcg && !mem_cgroup_is_root(memcg)) {
 		ret = __memcg_kmem_charge(memcg, gfp, 1 << order);
 		if (!ret) {
-			page->mem_cgroup = memcg;
+			set_page_mem_cgroup(page, memcg);
 			__SetPageKmemcg(page);
 			return 0;
 		}
@@ -3106,7 +3085,7 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order)
  */
 void __memcg_kmem_uncharge_page(struct page *page, int order)
 {
-	struct mem_cgroup *memcg = page->mem_cgroup;
+	struct mem_cgroup *memcg = page_mem_cgroup(page);
 	unsigned int nr_pages = 1 << order;
 
 	if (!memcg)
@@ -3114,7 +3093,7 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
 
 	VM_BUG_ON_PAGE(mem_cgroup_is_root(memcg), page);
 	__memcg_kmem_uncharge(memcg, nr_pages);
-	page->mem_cgroup = NULL;
+	clear_page_mem_cgroup(page);
 	css_put(&memcg->css);
 
 	/* slab pages do not have PageKmemcg flag set */
@@ -3265,7 +3244,7 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
  */
 void mem_cgroup_split_huge_fixup(struct page *head)
 {
-	struct mem_cgroup *memcg = head->mem_cgroup;
+	struct mem_cgroup *memcg = page_mem_cgroup(head);
 	int i;
 
 	if (mem_cgroup_disabled())
@@ -3273,7 +3252,7 @@ void mem_cgroup_split_huge_fixup(struct page *head)
 
 	for (i = 1; i < HPAGE_PMD_NR; i++) {
 		css_get(&memcg->css);
-		head[i].mem_cgroup = memcg;
+		set_page_mem_cgroup(&head[i], memcg);
 	}
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
@@ -4649,7 +4628,7 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages,
 void mem_cgroup_track_foreign_dirty_slowpath(struct page *page,
 					     struct bdi_writeback *wb)
 {
-	struct mem_cgroup *memcg = page->mem_cgroup;
+	struct mem_cgroup *memcg = page_mem_cgroup(page);
 	struct memcg_cgwb_frn *frn;
 	u64 now = get_jiffies_64();
 	u64 oldest_at = now;
@@ -5641,14 +5620,14 @@ static int mem_cgroup_move_account(struct page *page,
 
 	/*
 	 * Prevent mem_cgroup_migrate() from looking at
-	 * page->mem_cgroup of its source page while we change it.
+	 * page_mem_cgroup(page) of its source page while we change it.
 	 */
 	ret = -EBUSY;
 	if (!trylock_page(page))
 		goto out;
 
 	ret = -EINVAL;
-	if (page->mem_cgroup != from)
+	if (page_mem_cgroup(page) != from)
 		goto out_unlock;
 
 	pgdat = page_pgdat(page);
@@ -5703,13 +5682,13 @@ static int mem_cgroup_move_account(struct page *page,
 	/*
 	 * All state has been migrated, let's switch to the new memcg.
 	 *
-	 * It is safe to change page->mem_cgroup here because the page
+	 * It is safe to change page's memcg here because the page
 	 * is referenced, charged, isolated, and locked: we can't race
 	 * with (un)charging, migration, LRU putback, or anything else
-	 * that would rely on a stable page->mem_cgroup.
+	 * that would rely on a stable page_mem_cgroup(page).
 	 *
 	 * Note that lock_page_memcg is a memcg lock, not a page lock,
-	 * to save space. As soon as we switch page->mem_cgroup to a
+	 * to save space. As soon as we switch page_mem_cgroup(page) to a
 	 * new memcg that isn't locked, the above state can change
 	 * concurrently again. Make sure we're truly done with it.
 	 */
@@ -5718,7 +5697,7 @@ static int mem_cgroup_move_account(struct page *page,
 	css_get(&to->css);
 	css_put(&from->css);
 
-	page->mem_cgroup = to;
+	set_page_mem_cgroup(page, to);
 
 	__unlock_page_memcg(from);
 
@@ -5784,7 +5763,7 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
 		 * mem_cgroup_move_account() checks the page is valid or
 		 * not under LRU exclusion.
 		 */
-		if (page->mem_cgroup == mc.from) {
+		if (page_mem_cgroup(page) == mc.from) {
 			ret = MC_TARGET_PAGE;
 			if (is_device_private_page(page))
 				ret = MC_TARGET_DEVICE;
@@ -5828,7 +5807,7 @@ static enum mc_target_type get_mctgt_type_thp(struct vm_area_struct *vma,
 	VM_BUG_ON_PAGE(!page || !PageHead(page), page);
 	if (!(mc.flags & MOVE_ANON))
 		return ret;
-	if (page->mem_cgroup == mc.from) {
+	if (page_mem_cgroup(page) == mc.from) {
 		ret = MC_TARGET_PAGE;
 		if (target) {
 			get_page(page);
@@ -6739,12 +6718,12 @@ int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
 		/*
 		 * Every swap fault against a single page tries to charge the
 		 * page, bail as early as possible.  shmem_unuse() encounters
-		 * already charged pages, too.  page->mem_cgroup is protected
-		 * by the page lock, which serializes swap cache removal, which
-		 * in turn serializes uncharging.
+		 * already charged pages, too.  page and memcg binding is
+		 * protected by the page lock, which serializes swap cache
+		 * removal, which in turn serializes uncharging.
 		 */
 		VM_BUG_ON_PAGE(!PageLocked(page), page);
-		if (compound_head(page)->mem_cgroup)
+		if (page_mem_cgroup(compound_head(page)))
 			goto out;
 
 		id = lookup_swap_cgroup_id(ent);
@@ -6828,21 +6807,21 @@ static void uncharge_page(struct page *page, struct uncharge_gather *ug)
 
 	VM_BUG_ON_PAGE(PageLRU(page), page);
 
-	if (!page->mem_cgroup)
+	if (!page_mem_cgroup(page))
 		return;
 
 	/*
 	 * Nobody should be changing or seriously looking at
-	 * page->mem_cgroup at this point, we have fully
+	 * page_mem_cgroup(page) at this point, we have fully
 	 * exclusive access to the page.
 	 */
 
-	if (ug->memcg != page->mem_cgroup) {
+	if (ug->memcg != page_mem_cgroup(page)) {
 		if (ug->memcg) {
 			uncharge_batch(ug);
 			uncharge_gather_clear(ug);
 		}
-		ug->memcg = page->mem_cgroup;
+		ug->memcg = page_mem_cgroup(page);
 
 		/* pairs with css_put in uncharge_batch */
 		css_get(&ug->memcg->css);
@@ -6859,7 +6838,7 @@ static void uncharge_page(struct page *page, struct uncharge_gather *ug)
 	}
 
 	ug->dummy_page = page;
-	page->mem_cgroup = NULL;
+	clear_page_mem_cgroup(page);
 	css_put(&ug->memcg->css);
 }
 
@@ -6902,7 +6881,7 @@ void mem_cgroup_uncharge(struct page *page)
 		return;
 
 	/* Don't touch page->lru of any random page, pre-check: */
-	if (!page->mem_cgroup)
+	if (!page_mem_cgroup(page))
 		return;
 
 	uncharge_gather_clear(&ug);
@@ -6952,11 +6931,11 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage)
 		return;
 
 	/* Page cache replacement: new page already charged? */
-	if (newpage->mem_cgroup)
+	if (page_mem_cgroup(newpage))
 		return;
 
 	/* Swapcache readahead pages can get replaced before being charged */
-	memcg = oldpage->mem_cgroup;
+	memcg = page_mem_cgroup(oldpage);
 	if (!memcg)
 		return;
 
@@ -7151,7 +7130,7 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
 	if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
 		return;
 
-	memcg = page->mem_cgroup;
+	memcg = page_mem_cgroup(page);
 
 	/* Readahead page, never charged */
 	if (!memcg)
@@ -7172,7 +7151,7 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
 	VM_BUG_ON_PAGE(oldid, page);
 	mod_memcg_state(swap_memcg, MEMCG_SWAP, nr_entries);
 
-	page->mem_cgroup = NULL;
+	clear_page_mem_cgroup(page);
 
 	if (!mem_cgroup_is_root(memcg))
 		page_counter_uncharge(&memcg->memory, nr_entries);
@@ -7215,7 +7194,7 @@ int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry)
 	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
 		return 0;
 
-	memcg = page->mem_cgroup;
+	memcg = page_mem_cgroup(page);
 
 	/* Readahead page, never charged */
 	if (!memcg)
@@ -7296,7 +7275,7 @@ bool mem_cgroup_swap_full(struct page *page)
 	if (cgroup_memory_noswap || !cgroup_subsys_on_dfl(memory_cgrp_subsys))
 		return false;
 
-	memcg = page->mem_cgroup;
+	memcg = page_mem_cgroup(page);
 	if (!memcg)
 		return false;
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 659d3d8a3e1f..6c3b542395e7 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -493,7 +493,7 @@ int migrate_page_move_mapping(struct address_space *mapping,
 		struct lruvec *old_lruvec, *new_lruvec;
 		struct mem_cgroup *memcg;
 
-		memcg = page_memcg(page);
+		memcg = page_mem_cgroup(page);
 		old_lruvec = mem_cgroup_lruvec(memcg, oldzone->zone_pgdat);
 		new_lruvec = mem_cgroup_lruvec(memcg, newzone->zone_pgdat);
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0d9f9bd0e06c..a707671f3b6c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1056,7 +1056,7 @@ static inline bool page_expected_state(struct page *page,
 	if (unlikely((unsigned long)page->mapping |
 			page_ref_count(page) |
 #ifdef CONFIG_MEMCG
-			(unsigned long)page->mem_cgroup |
+			(unsigned long)page_mem_cgroup(page) |
 #endif
 			(page->flags & check_flags)))
 		return false;
@@ -1081,7 +1081,7 @@ static const char *page_bad_reason(struct page *page, unsigned long flags)
 			bad_reason = "PAGE_FLAGS_CHECK_AT_FREE flag(s) set";
 	}
 #ifdef CONFIG_MEMCG
-	if (unlikely(page->mem_cgroup))
+	if (unlikely(page_mem_cgroup(page)))
 		bad_reason = "page still charged to cgroup";
 #endif
 	return bad_reason;
diff --git a/mm/page_io.c b/mm/page_io.c
index dc6de6962612..ffa3a7d20c58 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -282,11 +282,11 @@ static void bio_associate_blkg_from_page(struct bio *bio, struct page *page)
 {
 	struct cgroup_subsys_state *css;
 
-	if (!page->mem_cgroup)
+	if (!page_mem_cgroup(page))
 		return;
 
 	rcu_read_lock();
-	css = cgroup_e_css(page->mem_cgroup->css.cgroup, &io_cgrp_subsys);
+	css = cgroup_e_css(page_mem_cgroup(page)->css.cgroup, &io_cgrp_subsys);
 	bio_associate_blkg_from_css(bio, css);
 	rcu_read_unlock();
 }
diff --git a/mm/slab.h b/mm/slab.h
index 4a24e1702923..b9787c4d6e78 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -247,13 +247,12 @@ static inline struct obj_cgroup **page_obj_cgroups(struct page *page)
 	 * that the page is a slab page (e.g. page_cgroup_ino()), let's
 	 * always set the lowest bit of obj_cgroups.
 	 */
-	return (struct obj_cgroup **)
-		((unsigned long)page->obj_cgroups & ~0x1UL);
+	return (struct obj_cgroup **)(page->memcg_data & ~0x1UL);
 }
 
 static inline bool page_has_obj_cgroups(struct page *page)
 {
-	return ((unsigned long)page->obj_cgroups & 0x1UL);
+	return page->memcg_data & 0x1UL;
 }
 
 int memcg_alloc_page_obj_cgroups(struct page *page, struct kmem_cache *s,
@@ -262,7 +261,7 @@ int memcg_alloc_page_obj_cgroups(struct page *page, struct kmem_cache *s,
 static inline void memcg_free_page_obj_cgroups(struct page *page)
 {
 	kfree(page_obj_cgroups(page));
-	page->obj_cgroups = NULL;
+	page->memcg_data = 0;
 }
 
 static inline size_t obj_full_size(struct kmem_cache *s)
diff --git a/mm/workingset.c b/mm/workingset.c
index 92e66113a577..c0b7a4faa0d8 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -345,7 +345,7 @@ void workingset_refault(struct page *page, void *shadow)
 	 * However, the cgroup that will own the page is the one that
 	 * is actually experiencing the refault event.
 	 */
-	memcg = page_memcg(page);
+	memcg = page_mem_cgroup(page);
 	lruvec = mem_cgroup_lruvec(memcg, pgdat);
 
 	inc_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file);
@@ -407,7 +407,7 @@ void workingset_activation(struct page *page)
 	 * XXX: See workingset_refault() - this should return
 	 * root_mem_cgroup even for !CONFIG_MEMCG.
 	 */
-	memcg = page_memcg_rcu(page);
+	memcg = page_mem_cgroup(page);
 	if (!mem_cgroup_disabled() && !memcg)
 		goto out;
 	lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH rfc 3/5] mm: memcontrol/slab: use helpers to access slab page's memcg_data
  2020-09-10 20:26 [PATCH rfc 0/5] mm: allow mapping accounted kernel pages to userspace Roman Gushchin
  2020-09-10 20:26 ` [PATCH rfc 1/5] mm: memcg/slab: fix racy access to page->mem_cgroup in mem_cgroup_from_obj() Roman Gushchin
  2020-09-10 20:26 ` [PATCH rfc 2/5] mm: memcontrol: use helpers to access page's memcg data Roman Gushchin
@ 2020-09-10 20:26 ` Roman Gushchin
  2020-09-17  1:12     ` Shakeel Butt
  2020-09-10 20:26 ` [PATCH rfc 4/5] mm: introduce page memcg flags Roman Gushchin
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 16+ messages in thread
From: Roman Gushchin @ 2020-09-10 20:26 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: =Shakeel Butt, Johannes Weiner, Michal Hocko, kernel-team,
	linux-kernel, Roman Gushchin

To gather all direct accesses to struct page's memcg_data field
in one place, let's introduce 4 new helper functions to use in
the slab accounting code:
  struct obj_cgroup **page_obj_cgroups(struct page *page);
  struct obj_cgroup **page_obj_cgroups_check(struct page *page);
  bool set_page_obj_cgroups(struct page *page, struct obj_cgroup **objcgs);
  void clear_page_obj_cgroups(struct page *page);

They are similar to the corresponding API for generic pages, except
that the setter can return false, indicating that the value has been
already set from a different thread.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 include/linux/memcontrol.h | 48 ++++++++++++++++++++++++++++++++++++++
 mm/memcontrol.c            |  4 ++--
 mm/slab.h                  | 27 +++------------------
 3 files changed, 53 insertions(+), 26 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0997220c84ce..48d4c2c1ce81 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -375,6 +375,54 @@ static inline void clear_page_mem_cgroup(struct page *page)
 	page->memcg_data = 0;
 }
 
+#ifdef CONFIG_MEMCG_KMEM
+static inline struct obj_cgroup **page_obj_cgroups(struct page *page)
+{
+	return (struct obj_cgroup **)(page->memcg_data & ~0x1UL);
+}
+
+static inline struct obj_cgroup **page_obj_cgroups_check(struct page *page)
+{
+	unsigned long memcg_data = page->memcg_data;
+
+	if (memcg_data && (memcg_data & 0x1UL))
+		return (struct obj_cgroup **)memcg_data;
+
+	return NULL;
+}
+
+static inline bool set_page_obj_cgroups(struct page *page,
+					struct obj_cgroup **objcgs)
+{
+	return !cmpxchg(&page->memcg_data, 0, (unsigned long)objcgs | 0x1UL);
+}
+
+static inline void clear_page_obj_cgroups(struct page *page)
+{
+	page->memcg_data = 0;
+}
+#else
+static inline struct obj_cgroup **page_obj_cgroups(struct page *page)
+{
+	return NULL;
+}
+
+static inline struct obj_cgroup **page_obj_cgroups_check(struct page *page)
+{
+	return NULL;
+}
+
+static inline bool set_page_obj_cgroups(struct page *page,
+					struct obj_cgroup **objcgs)
+{
+	return true;
+}
+
+static inline void clear_page_obj_cgroups(struct page *page)
+{
+}
+#endif
+
 static __always_inline bool memcg_stat_item_in_bytes(int idx)
 {
 	if (idx == MEMCG_PERCPU_B)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 19180674e38a..ba9b053b1b88 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2890,7 +2890,7 @@ int memcg_alloc_page_obj_cgroups(struct page *page, struct kmem_cache *s,
 	if (!vec)
 		return -ENOMEM;
 
-	if (cmpxchg(&page->memcg_data, 0, (unsigned long)vec | 0x1UL))
+	if (!set_page_obj_cgroups(page, vec))
 		kfree(vec);
 	else
 		kmemleak_not_leak(vec);
@@ -2918,7 +2918,7 @@ struct mem_cgroup *mem_cgroup_from_obj(void *p)
 	 * Memcg membership data for each individual object is saved in
 	 * the page->obj_cgroups.
 	 */
-	if (page_has_obj_cgroups(page)) {
+	if (page_obj_cgroups_check(page)) {
 		struct obj_cgroup *objcg;
 		unsigned int off;
 
diff --git a/mm/slab.h b/mm/slab.h
index b9787c4d6e78..9a46ab76cb61 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -239,29 +239,13 @@ static inline bool kmem_cache_debug_flags(struct kmem_cache *s, slab_flags_t fla
 }
 
 #ifdef CONFIG_MEMCG_KMEM
-static inline struct obj_cgroup **page_obj_cgroups(struct page *page)
-{
-	/*
-	 * page->mem_cgroup and page->obj_cgroups are sharing the same
-	 * space. To distinguish between them in case we don't know for sure
-	 * that the page is a slab page (e.g. page_cgroup_ino()), let's
-	 * always set the lowest bit of obj_cgroups.
-	 */
-	return (struct obj_cgroup **)(page->memcg_data & ~0x1UL);
-}
-
-static inline bool page_has_obj_cgroups(struct page *page)
-{
-	return page->memcg_data & 0x1UL;
-}
-
 int memcg_alloc_page_obj_cgroups(struct page *page, struct kmem_cache *s,
 				 gfp_t gfp);
 
 static inline void memcg_free_page_obj_cgroups(struct page *page)
 {
 	kfree(page_obj_cgroups(page));
-	page->memcg_data = 0;
+	clear_page_obj_cgroups(page);
 }
 
 static inline size_t obj_full_size(struct kmem_cache *s)
@@ -322,7 +306,7 @@ static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
 		if (likely(p[i])) {
 			page = virt_to_head_page(p[i]);
 
-			if (!page_has_obj_cgroups(page) &&
+			if (!page_obj_cgroups(page) &&
 			    memcg_alloc_page_obj_cgroups(page, s, flags)) {
 				obj_cgroup_uncharge(objcg, obj_full_size(s));
 				continue;
@@ -349,7 +333,7 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s, struct page *page,
 	if (!memcg_kmem_enabled())
 		return;
 
-	if (!page_has_obj_cgroups(page))
+	if (!page_obj_cgroups(page))
 		return;
 
 	off = obj_to_index(s, page, p);
@@ -367,11 +351,6 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s, struct page *page,
 }
 
 #else /* CONFIG_MEMCG_KMEM */
-static inline bool page_has_obj_cgroups(struct page *page)
-{
-	return false;
-}
-
 static inline struct mem_cgroup *memcg_from_slab_obj(void *ptr)
 {
 	return NULL;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH rfc 4/5] mm: introduce page memcg flags
  2020-09-10 20:26 [PATCH rfc 0/5] mm: allow mapping accounted kernel pages to userspace Roman Gushchin
                   ` (2 preceding siblings ...)
  2020-09-10 20:26 ` [PATCH rfc 3/5] mm: memcontrol/slab: use helpers to access slab page's memcg_data Roman Gushchin
@ 2020-09-10 20:26 ` Roman Gushchin
  2020-09-10 20:26 ` [PATCH rfc 5/5] mm: convert page kmemcg type to a page memcg flag Roman Gushchin
  2020-09-11 17:34   ` Shakeel Butt
  5 siblings, 0 replies; 16+ messages in thread
From: Roman Gushchin @ 2020-09-10 20:26 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: =Shakeel Butt, Johannes Weiner, Michal Hocko, kernel-team,
	linux-kernel, Roman Gushchin

The lowest bit in page->memcg_data is used to distinguish between
struct memory_cgroup pointer and a pointer to a objcgs array.
All checks and modifications of this bit are open-coded.

Let's formalize it using page memcg flags, defined in page_memcg_flags
enum and replace all open-coded accesses with test_bit()/__set_bit().

Few additional flags might be added later. Flags are intended to be
mutually exclusive.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 include/linux/memcontrol.h | 34 +++++++++++++++++++++++-----------
 1 file changed, 23 insertions(+), 11 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 48d4c2c1ce81..7ab5f92bb686 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -340,23 +340,25 @@ struct mem_cgroup {
 
 extern struct mem_cgroup *root_mem_cgroup;
 
+enum page_memcg_flags {
+	/* page->memcg_data is a pointer to a objcgs vector */
+	PG_MEMCG_OBJ_CGROUPS,
+};
+
 static inline struct mem_cgroup *page_mem_cgroup(struct page *page)
 {
+	unsigned long memcg_data = page->memcg_data;
+
 	VM_BUG_ON_PAGE(PageSlab(page), page);
-	return (struct mem_cgroup *)page->memcg_data;
+
+	return (struct mem_cgroup *)memcg_data;
 }
 
 static inline struct mem_cgroup *page_mem_cgroup_check(struct page *page)
 {
 	unsigned long memcg_data = page->memcg_data;
 
-	/*
-	 * The lowest bit set means that memcg isn't a valid
-	 * memcg pointer, but a obj_cgroups pointer.
-	 * In this case the page is shared and doesn't belong
-	 * to any specific memory cgroup.
-	 */
-	if (memcg_data & 0x1UL)
+	if (test_bit(PG_MEMCG_OBJ_CGROUPS, &memcg_data))
 		return NULL;
 
 	return (struct mem_cgroup *)memcg_data;
@@ -378,14 +380,20 @@ static inline void clear_page_mem_cgroup(struct page *page)
 #ifdef CONFIG_MEMCG_KMEM
 static inline struct obj_cgroup **page_obj_cgroups(struct page *page)
 {
-	return (struct obj_cgroup **)(page->memcg_data & ~0x1UL);
+	unsigned long memcg_data = page->memcg_data;
+
+	VM_BUG_ON_PAGE(memcg_data && !test_bit(PG_MEMCG_OBJ_CGROUPS,
+					       &memcg_data), page);
+	__clear_bit(PG_MEMCG_OBJ_CGROUPS, &memcg_data);
+
+	return (struct obj_cgroup **)memcg_data;
 }
 
 static inline struct obj_cgroup **page_obj_cgroups_check(struct page *page)
 {
 	unsigned long memcg_data = page->memcg_data;
 
-	if (memcg_data && (memcg_data & 0x1UL))
+	if (memcg_data && test_bit(PG_MEMCG_OBJ_CGROUPS, &memcg_data))
 		return (struct obj_cgroup **)memcg_data;
 
 	return NULL;
@@ -394,7 +402,11 @@ static inline struct obj_cgroup **page_obj_cgroups_check(struct page *page)
 static inline bool set_page_obj_cgroups(struct page *page,
 					struct obj_cgroup **objcgs)
 {
-	return !cmpxchg(&page->memcg_data, 0, (unsigned long)objcgs | 0x1UL);
+	unsigned long memcg_data = (unsigned long)objcgs;
+
+	__set_bit(PG_MEMCG_OBJ_CGROUPS, &memcg_data);
+
+	return !cmpxchg(&page->memcg_data, 0, memcg_data);
 }
 
 static inline void clear_page_obj_cgroups(struct page *page)
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH rfc 5/5] mm: convert page kmemcg type to a page memcg flag
  2020-09-10 20:26 [PATCH rfc 0/5] mm: allow mapping accounted kernel pages to userspace Roman Gushchin
                   ` (3 preceding siblings ...)
  2020-09-10 20:26 ` [PATCH rfc 4/5] mm: introduce page memcg flags Roman Gushchin
@ 2020-09-10 20:26 ` Roman Gushchin
  2020-09-11 17:34   ` Shakeel Butt
  5 siblings, 0 replies; 16+ messages in thread
From: Roman Gushchin @ 2020-09-10 20:26 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: =Shakeel Butt, Johannes Weiner, Michal Hocko, kernel-team,
	linux-kernel, Roman Gushchin

PageKmemcg flag is currently defined as a page type (like buddy,
offline, table and guard). Semantically it means that the page
was accounted as a kernel memory by the page allocator and has
to be uncharged on the release.

As a side effect of defining the flag as a page type, the accounted
page can't be mapped to userspace (look at page_has_type() and
comments above). In particular, this blocks the accounting of
vmalloc-backed memory used by some bpf maps, because these maps
do map the memory to userspace.

One option is to fix it by complicating the access to page->mapcount,
which provides some free bits for page->page_type.

But it's way better to move this flag into page->memcg_data flags.
Indeed, the flag makes no sense without enabled memory cgroups
and memory cgroup pointer set in particular.

This commit replaces PageKmemcg() and __SetPageKmemcg() with
PageMemcgKmem() and SetPageMemcgKmem(). __ClearPageKmemcg()
can be simple deleted because clear_page_mem_cgroup() already
does the job.

As a bonus, on !CONFIG_MEMCG build the PageMemcgKmem() check will
be compiled out.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 include/linux/memcontrol.h | 35 ++++++++++++++++++++++++++++++++---
 include/linux/page-flags.h | 11 ++---------
 mm/memcontrol.c            | 14 ++++----------
 mm/page_alloc.c            |  2 +-
 4 files changed, 39 insertions(+), 23 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 7ab5f92bb686..430d1ca925c9 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -343,15 +343,22 @@ extern struct mem_cgroup *root_mem_cgroup;
 enum page_memcg_flags {
 	/* page->memcg_data is a pointer to a objcgs vector */
 	PG_MEMCG_OBJ_CGROUPS,
+	/* page has been accounted as a non-slab kernel page */
+	PG_MEMCG_KMEM,
+	/* the next bit after the last actual flag */
+	PG_MEMCG_LAST_FLAG,
 };
 
+#define MEMCG_FLAGS_MASK ((1UL << PG_MEMCG_LAST_FLAG) - 1)
+
 static inline struct mem_cgroup *page_mem_cgroup(struct page *page)
 {
 	unsigned long memcg_data = page->memcg_data;
 
 	VM_BUG_ON_PAGE(PageSlab(page), page);
+	VM_BUG_ON_PAGE(test_bit(PG_MEMCG_OBJ_CGROUPS, &memcg_data), page);
 
-	return (struct mem_cgroup *)memcg_data;
+	return (struct mem_cgroup *)(memcg_data & ~MEMCG_FLAGS_MASK);
 }
 
 static inline struct mem_cgroup *page_mem_cgroup_check(struct page *page)
@@ -361,7 +368,7 @@ static inline struct mem_cgroup *page_mem_cgroup_check(struct page *page)
 	if (test_bit(PG_MEMCG_OBJ_CGROUPS, &memcg_data))
 		return NULL;
 
-	return (struct mem_cgroup *)memcg_data;
+	return (struct mem_cgroup *)(memcg_data & ~MEMCG_FLAGS_MASK);
 }
 
 static inline void set_page_mem_cgroup(struct page *page,
@@ -377,6 +384,16 @@ static inline void clear_page_mem_cgroup(struct page *page)
 	page->memcg_data = 0;
 }
 
+static inline bool PageMemcgKmem(struct page *page)
+{
+	return test_bit(PG_MEMCG_KMEM, &page->memcg_data);
+}
+
+static inline void SetPageMemcgKmem(struct page *page)
+{
+	__set_bit(PG_MEMCG_KMEM, &page->memcg_data);
+}
+
 #ifdef CONFIG_MEMCG_KMEM
 static inline struct obj_cgroup **page_obj_cgroups(struct page *page)
 {
@@ -385,6 +402,7 @@ static inline struct obj_cgroup **page_obj_cgroups(struct page *page)
 	VM_BUG_ON_PAGE(memcg_data && !test_bit(PG_MEMCG_OBJ_CGROUPS,
 					       &memcg_data), page);
 	__clear_bit(PG_MEMCG_OBJ_CGROUPS, &memcg_data);
+	VM_BUG_ON_PAGE(test_bit(PG_MEMCG_KMEM, &memcg_data), page);
 
 	return (struct obj_cgroup **)memcg_data;
 }
@@ -393,8 +411,10 @@ static inline struct obj_cgroup **page_obj_cgroups_check(struct page *page)
 {
 	unsigned long memcg_data = page->memcg_data;
 
-	if (memcg_data && test_bit(PG_MEMCG_OBJ_CGROUPS, &memcg_data))
+	if (memcg_data && test_bit(PG_MEMCG_OBJ_CGROUPS, &memcg_data)) {
+		VM_BUG_ON_PAGE(test_bit(PG_MEMCG_KMEM, &memcg_data), page);
 		return (struct obj_cgroup **)memcg_data;
+	}
 
 	return NULL;
 }
@@ -1052,6 +1072,15 @@ static inline void clear_page_mem_cgroup(struct page *page)
 {
 }
 
+static inline bool PageMemcgKmem(struct page *page)
+{
+	return false;
+}
+
+static inline void SetPageMemcgKmem(struct page *page)
+{
+}
+
 static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
 {
 	return true;
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index fbbb841a9346..a7ca01ae78d9 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -712,9 +712,8 @@ PAGEFLAG_FALSE(DoubleMap)
 #define PAGE_MAPCOUNT_RESERVE	-128
 #define PG_buddy	0x00000080
 #define PG_offline	0x00000100
-#define PG_kmemcg	0x00000200
-#define PG_table	0x00000400
-#define PG_guard	0x00000800
+#define PG_table	0x00000200
+#define PG_guard	0x00000400
 
 #define PageType(page, flag)						\
 	((page->page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE)
@@ -765,12 +764,6 @@ PAGE_TYPE_OPS(Buddy, buddy)
  */
 PAGE_TYPE_OPS(Offline, offline)
 
-/*
- * If kmemcg is enabled, the buddy allocator will set PageKmemcg() on
- * pages allocated with __GFP_ACCOUNT. It gets cleared on page free.
- */
-PAGE_TYPE_OPS(Kmemcg, kmemcg)
-
 /*
  * Marks pages in use as page tables.
  */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ba9b053b1b88..d4c21870dab9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3070,7 +3070,7 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order)
 		ret = __memcg_kmem_charge(memcg, gfp, 1 << order);
 		if (!ret) {
 			set_page_mem_cgroup(page, memcg);
-			__SetPageKmemcg(page);
+			SetPageMemcgKmem(page);
 			return 0;
 		}
 		css_put(&memcg->css);
@@ -3095,10 +3095,6 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
 	__memcg_kmem_uncharge(memcg, nr_pages);
 	clear_page_mem_cgroup(page);
 	css_put(&memcg->css);
-
-	/* slab pages do not have PageKmemcg flag set */
-	if (PageKmemcg(page))
-		__ClearPageKmemcg(page);
 }
 
 static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
@@ -6830,12 +6826,10 @@ static void uncharge_page(struct page *page, struct uncharge_gather *ug)
 	nr_pages = compound_nr(page);
 	ug->nr_pages += nr_pages;
 
-	if (!PageKmemcg(page)) {
-		ug->pgpgout++;
-	} else {
+	if (PageMemcgKmem(page))
 		ug->nr_kmem += nr_pages;
-		__ClearPageKmemcg(page);
-	}
+	else
+		ug->pgpgout++;
 
 	ug->dummy_page = page;
 	clear_page_mem_cgroup(page);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a707671f3b6c..3a61868113ec 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1197,7 +1197,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
 	}
 	if (PageMappingFlags(page))
 		page->mapping = NULL;
-	if (memcg_kmem_enabled() && PageKmemcg(page))
+	if (memcg_kmem_enabled() && PageMemcgKmem(page))
 		__memcg_kmem_uncharge_page(page, order);
 	if (check_free)
 		bad += check_free_page(page);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH rfc 0/5] mm: allow mapping accounted kernel pages to userspace
  2020-09-10 20:26 [PATCH rfc 0/5] mm: allow mapping accounted kernel pages to userspace Roman Gushchin
@ 2020-09-11 17:34   ` Shakeel Butt
  2020-09-10 20:26 ` [PATCH rfc 2/5] mm: memcontrol: use helpers to access page's memcg data Roman Gushchin
                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 16+ messages in thread
From: Shakeel Butt @ 2020-09-11 17:34 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Linux MM, Johannes Weiner, Michal Hocko,
	Kernel Team, LKML

On Thu, Sep 10, 2020 at 1:27 PM Roman Gushchin <guro@fb.com> wrote:
>
> Currently a non-slab kernel page which has been charged to a memory
> cgroup can't be mapped to userspace. The underlying reason is simple:
> PageKmemcg flag is defined as a page type (like buddy, offline, etc),
> so it takes a bit from a page->mapped counter. Pages with a type set
> can't be mapped to userspace.
>
> But in general the kmemcg flag has nothing to do with mapping to
> userspace. It only means that the page has been accounted by the page
> allocator, so it has to be properly uncharged on release.
>
> Some bpf maps are mapping the vmalloc-based memory to userspace, and
> their memory can't be accounted because of this implementation detail.
>
> This patchset removes this limitation by moving the PageKmemcg flag
> into one of the free bits of the page->mem_cgroup pointer. Also it
> formalizes all accesses to the page->mem_cgroup and page->obj_cgroups
> using new helpers, adds several checks and removes a couple of obsolete
> functions. As the result the code became more robust with fewer
> open-coded bits tricks.
>
> The first patch in the series is a bugfix, which I already sent separately.
> Including it in rfc to make the whole series compile.
>
>

This would be a really beneficial feature. I tried to fix the similar
issue for kvm_vcpu_mmap [1] but using the actual page flag bit but
your solution would be non controversial.

I think this might also help the accounting of TCP zerocopy receive
mmapped memory. The memory is charged in skbs but once it is mmapped,
the skbs get uncharged and we can have a very large amount of
uncharged memory.

I will take a look at the series.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH rfc 0/5] mm: allow mapping accounted kernel pages to userspace
@ 2020-09-11 17:34   ` Shakeel Butt
  0 siblings, 0 replies; 16+ messages in thread
From: Shakeel Butt @ 2020-09-11 17:34 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Linux MM, Johannes Weiner, Michal Hocko,
	Kernel Team, LKML

On Thu, Sep 10, 2020 at 1:27 PM Roman Gushchin <guro@fb.com> wrote:
>
> Currently a non-slab kernel page which has been charged to a memory
> cgroup can't be mapped to userspace. The underlying reason is simple:
> PageKmemcg flag is defined as a page type (like buddy, offline, etc),
> so it takes a bit from a page->mapped counter. Pages with a type set
> can't be mapped to userspace.
>
> But in general the kmemcg flag has nothing to do with mapping to
> userspace. It only means that the page has been accounted by the page
> allocator, so it has to be properly uncharged on release.
>
> Some bpf maps are mapping the vmalloc-based memory to userspace, and
> their memory can't be accounted because of this implementation detail.
>
> This patchset removes this limitation by moving the PageKmemcg flag
> into one of the free bits of the page->mem_cgroup pointer. Also it
> formalizes all accesses to the page->mem_cgroup and page->obj_cgroups
> using new helpers, adds several checks and removes a couple of obsolete
> functions. As the result the code became more robust with fewer
> open-coded bits tricks.
>
> The first patch in the series is a bugfix, which I already sent separately.
> Including it in rfc to make the whole series compile.
>
>

This would be a really beneficial feature. I tried to fix the similar
issue for kvm_vcpu_mmap [1] but using the actual page flag bit but
your solution would be non controversial.

I think this might also help the accounting of TCP zerocopy receive
mmapped memory. The memory is charged in skbs but once it is mmapped,
the skbs get uncharged and we can have a very large amount of
uncharged memory.

I will take a look at the series.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH rfc 0/5] mm: allow mapping accounted kernel pages to userspace
  2020-09-11 17:34   ` Shakeel Butt
@ 2020-09-11 17:34     ` Shakeel Butt
  -1 siblings, 0 replies; 16+ messages in thread
From: Shakeel Butt @ 2020-09-11 17:34 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Linux MM, Johannes Weiner, Michal Hocko,
	Kernel Team, LKML

On Fri, Sep 11, 2020 at 10:34 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Thu, Sep 10, 2020 at 1:27 PM Roman Gushchin <guro@fb.com> wrote:
> >
> > Currently a non-slab kernel page which has been charged to a memory
> > cgroup can't be mapped to userspace. The underlying reason is simple:
> > PageKmemcg flag is defined as a page type (like buddy, offline, etc),
> > so it takes a bit from a page->mapped counter. Pages with a type set
> > can't be mapped to userspace.
> >
> > But in general the kmemcg flag has nothing to do with mapping to
> > userspace. It only means that the page has been accounted by the page
> > allocator, so it has to be properly uncharged on release.
> >
> > Some bpf maps are mapping the vmalloc-based memory to userspace, and
> > their memory can't be accounted because of this implementation detail.
> >
> > This patchset removes this limitation by moving the PageKmemcg flag
> > into one of the free bits of the page->mem_cgroup pointer. Also it
> > formalizes all accesses to the page->mem_cgroup and page->obj_cgroups
> > using new helpers, adds several checks and removes a couple of obsolete
> > functions. As the result the code became more robust with fewer
> > open-coded bits tricks.
> >
> > The first patch in the series is a bugfix, which I already sent separately.
> > Including it in rfc to make the whole series compile.
> >
> >
>
> This would be a really beneficial feature. I tried to fix the similar
> issue for kvm_vcpu_mmap [1] but using the actual page flag bit but
> your solution would be non controversial.
>
> I think this might also help the accounting of TCP zerocopy receive
> mmapped memory. The memory is charged in skbs but once it is mmapped,
> the skbs get uncharged and we can have a very large amount of
> uncharged memory.
>
> I will take a look at the series.

[1] https://lore.kernel.org/kvm/20190329012836.47013-1-shakeelb@google.com/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH rfc 0/5] mm: allow mapping accounted kernel pages to userspace
@ 2020-09-11 17:34     ` Shakeel Butt
  0 siblings, 0 replies; 16+ messages in thread
From: Shakeel Butt @ 2020-09-11 17:34 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Linux MM, Johannes Weiner, Michal Hocko,
	Kernel Team, LKML

On Fri, Sep 11, 2020 at 10:34 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Thu, Sep 10, 2020 at 1:27 PM Roman Gushchin <guro@fb.com> wrote:
> >
> > Currently a non-slab kernel page which has been charged to a memory
> > cgroup can't be mapped to userspace. The underlying reason is simple:
> > PageKmemcg flag is defined as a page type (like buddy, offline, etc),
> > so it takes a bit from a page->mapped counter. Pages with a type set
> > can't be mapped to userspace.
> >
> > But in general the kmemcg flag has nothing to do with mapping to
> > userspace. It only means that the page has been accounted by the page
> > allocator, so it has to be properly uncharged on release.
> >
> > Some bpf maps are mapping the vmalloc-based memory to userspace, and
> > their memory can't be accounted because of this implementation detail.
> >
> > This patchset removes this limitation by moving the PageKmemcg flag
> > into one of the free bits of the page->mem_cgroup pointer. Also it
> > formalizes all accesses to the page->mem_cgroup and page->obj_cgroups
> > using new helpers, adds several checks and removes a couple of obsolete
> > functions. As the result the code became more robust with fewer
> > open-coded bits tricks.
> >
> > The first patch in the series is a bugfix, which I already sent separately.
> > Including it in rfc to make the whole series compile.
> >
> >
>
> This would be a really beneficial feature. I tried to fix the similar
> issue for kvm_vcpu_mmap [1] but using the actual page flag bit but
> your solution would be non controversial.
>
> I think this might also help the accounting of TCP zerocopy receive
> mmapped memory. The memory is charged in skbs but once it is mmapped,
> the skbs get uncharged and we can have a very large amount of
> uncharged memory.
>
> I will take a look at the series.

[1] https://lore.kernel.org/kvm/20190329012836.47013-1-shakeelb@google.com/


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH rfc 0/5] mm: allow mapping accounted kernel pages to userspace
  2020-09-11 17:34     ` Shakeel Butt
  (?)
@ 2020-09-11 21:36     ` Roman Gushchin
  -1 siblings, 0 replies; 16+ messages in thread
From: Roman Gushchin @ 2020-09-11 21:36 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Linux MM, Johannes Weiner, Michal Hocko,
	Kernel Team, LKML

On Fri, Sep 11, 2020 at 10:34:57AM -0700, Shakeel Butt wrote:
> On Fri, Sep 11, 2020 at 10:34 AM Shakeel Butt <shakeelb@google.com> wrote:
> >
> > On Thu, Sep 10, 2020 at 1:27 PM Roman Gushchin <guro@fb.com> wrote:
> > >
> > > Currently a non-slab kernel page which has been charged to a memory
> > > cgroup can't be mapped to userspace. The underlying reason is simple:
> > > PageKmemcg flag is defined as a page type (like buddy, offline, etc),
> > > so it takes a bit from a page->mapped counter. Pages with a type set
> > > can't be mapped to userspace.
> > >
> > > But in general the kmemcg flag has nothing to do with mapping to
> > > userspace. It only means that the page has been accounted by the page
> > > allocator, so it has to be properly uncharged on release.
> > >
> > > Some bpf maps are mapping the vmalloc-based memory to userspace, and
> > > their memory can't be accounted because of this implementation detail.
> > >
> > > This patchset removes this limitation by moving the PageKmemcg flag
> > > into one of the free bits of the page->mem_cgroup pointer. Also it
> > > formalizes all accesses to the page->mem_cgroup and page->obj_cgroups
> > > using new helpers, adds several checks and removes a couple of obsolete
> > > functions. As the result the code became more robust with fewer
> > > open-coded bits tricks.
> > >
> > > The first patch in the series is a bugfix, which I already sent separately.
> > > Including it in rfc to make the whole series compile.
> > >
> > >
> >
> > This would be a really beneficial feature. I tried to fix the similar
> > issue for kvm_vcpu_mmap [1] but using the actual page flag bit but
> > your solution would be non controversial.
> >
> > I think this might also help the accounting of TCP zerocopy receive
> > mmapped memory. The memory is charged in skbs but once it is mmapped,
> > the skbs get uncharged and we can have a very large amount of
> > uncharged memory.
> >
> > I will take a look at the series.
> 
> [1] https://lore.kernel.org/kvm/20190329012836.47013-1-shakeelb@google.com/

Cool, thank you for the link!

It's very nice that this feature is useful behind the bpf case.

Thanks!

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH rfc 2/5] mm: memcontrol: use helpers to access page's memcg data
  2020-09-10 20:26 ` [PATCH rfc 2/5] mm: memcontrol: use helpers to access page's memcg data Roman Gushchin
@ 2020-09-17  0:58     ` Shakeel Butt
  0 siblings, 0 replies; 16+ messages in thread
From: Shakeel Butt @ 2020-09-17  0:58 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Linux MM, Johannes Weiner, Michal Hocko,
	Kernel Team, LKML

On Thu, Sep 10, 2020 at 1:27 PM Roman Gushchin <guro@fb.com> wrote:
>
> Currently there are many open-coded reads and writes of the page->mem_cgroup
> pointer, as well as a couple of read helpers, which are barely used.
>
> It creates an obstacle on a way to reuse some bits of the pointer
> for storing additional bits of information. In fact, we already do
> this for slab pages, where the last bit indicates that a pointer has
> an attached vector of objcg pointers instead of a regular memcg
> pointer.
>
> This commits introduces 4 new helper functions and converts all
> raw accesses to page->mem_cgroup to calls of these helpers:
>   struct mem_cgroup *page_mem_cgroup(struct page *page);
>   struct mem_cgroup *page_mem_cgroup_check(struct page *page);
>   void set_page_mem_cgroup(struct page *page, struct mem_cgroup *memcg);
>   void clear_page_mem_cgroup(struct page *page);
>
> page_mem_cgroup_check() is intended to be used in cases when the page
> can be a slab page and have a memcg pointer pointing at objcg vector.
> It does check the lowest bit, and if set, returns NULL.
> page_mem_cgroup() contains a VM_BUG_ON_PAGE() check for the page not
> being a slab page. So do set_page_mem_cgroup() and clear_page_mem_cgroup().
>
> To make sure nobody uses a direct access, struct page's
> mem_cgroup/obj_cgroups is converted to unsigned long memcg_data.
> Only new helpers and a couple of slab-accounting related functions
> access this field directly.
>
> page_memcg() and page_memcg_rcu() helpers defined in mm.h are removed.
> New page_mem_cgroup() is a direct analog of page_memcg(), while
> page_memcg_rcu() has a single call site in a small rcu-read-lock
> section, so it's just not worth it to have a separate helper. So
> it's replaced with page_mem_cgroup() too.
>
> Signed-off-by: Roman Gushchin <guro@fb.com>

You need to update a couple of comments in kernel/fork.c, mm/slab.h,
mm/workingset.c, fs/buffer.c, fs/iomap/buffered-io.c.

Reviewed-by: Shakeel Butt <shakeelb@google.com>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH rfc 2/5] mm: memcontrol: use helpers to access page's memcg data
@ 2020-09-17  0:58     ` Shakeel Butt
  0 siblings, 0 replies; 16+ messages in thread
From: Shakeel Butt @ 2020-09-17  0:58 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Linux MM, Johannes Weiner, Michal Hocko,
	Kernel Team, LKML

On Thu, Sep 10, 2020 at 1:27 PM Roman Gushchin <guro@fb.com> wrote:
>
> Currently there are many open-coded reads and writes of the page->mem_cgroup
> pointer, as well as a couple of read helpers, which are barely used.
>
> It creates an obstacle on a way to reuse some bits of the pointer
> for storing additional bits of information. In fact, we already do
> this for slab pages, where the last bit indicates that a pointer has
> an attached vector of objcg pointers instead of a regular memcg
> pointer.
>
> This commits introduces 4 new helper functions and converts all
> raw accesses to page->mem_cgroup to calls of these helpers:
>   struct mem_cgroup *page_mem_cgroup(struct page *page);
>   struct mem_cgroup *page_mem_cgroup_check(struct page *page);
>   void set_page_mem_cgroup(struct page *page, struct mem_cgroup *memcg);
>   void clear_page_mem_cgroup(struct page *page);
>
> page_mem_cgroup_check() is intended to be used in cases when the page
> can be a slab page and have a memcg pointer pointing at objcg vector.
> It does check the lowest bit, and if set, returns NULL.
> page_mem_cgroup() contains a VM_BUG_ON_PAGE() check for the page not
> being a slab page. So do set_page_mem_cgroup() and clear_page_mem_cgroup().
>
> To make sure nobody uses a direct access, struct page's
> mem_cgroup/obj_cgroups is converted to unsigned long memcg_data.
> Only new helpers and a couple of slab-accounting related functions
> access this field directly.
>
> page_memcg() and page_memcg_rcu() helpers defined in mm.h are removed.
> New page_mem_cgroup() is a direct analog of page_memcg(), while
> page_memcg_rcu() has a single call site in a small rcu-read-lock
> section, so it's just not worth it to have a separate helper. So
> it's replaced with page_mem_cgroup() too.
>
> Signed-off-by: Roman Gushchin <guro@fb.com>

You need to update a couple of comments in kernel/fork.c, mm/slab.h,
mm/workingset.c, fs/buffer.c, fs/iomap/buffered-io.c.

Reviewed-by: Shakeel Butt <shakeelb@google.com>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH rfc 3/5] mm: memcontrol/slab: use helpers to access slab page's memcg_data
  2020-09-10 20:26 ` [PATCH rfc 3/5] mm: memcontrol/slab: use helpers to access slab page's memcg_data Roman Gushchin
@ 2020-09-17  1:12     ` Shakeel Butt
  0 siblings, 0 replies; 16+ messages in thread
From: Shakeel Butt @ 2020-09-17  1:12 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Linux MM, Johannes Weiner, Michal Hocko,
	Kernel Team, LKML

On Thu, Sep 10, 2020 at 1:27 PM Roman Gushchin <guro@fb.com> wrote:
>
> To gather all direct accesses to struct page's memcg_data field
> in one place, let's introduce 4 new helper functions to use in
> the slab accounting code:
>   struct obj_cgroup **page_obj_cgroups(struct page *page);
>   struct obj_cgroup **page_obj_cgroups_check(struct page *page);
>   bool set_page_obj_cgroups(struct page *page, struct obj_cgroup **objcgs);
>   void clear_page_obj_cgroups(struct page *page);
>
> They are similar to the corresponding API for generic pages, except
> that the setter can return false, indicating that the value has been
> already set from a different thread.
>
> Signed-off-by: Roman Gushchin <guro@fb.com>

Nit below.

Reviewed-by: Shakeel Butt <shakeelb@google.com>

> ---
>  include/linux/memcontrol.h | 48 ++++++++++++++++++++++++++++++++++++++
>  mm/memcontrol.c            |  4 ++--
>  mm/slab.h                  | 27 +++------------------
>  3 files changed, 53 insertions(+), 26 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 0997220c84ce..48d4c2c1ce81 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -375,6 +375,54 @@ static inline void clear_page_mem_cgroup(struct page *page)
>         page->memcg_data = 0;
>  }
>

I think adding comments on these functions would be helpful.

> +#ifdef CONFIG_MEMCG_KMEM
> +static inline struct obj_cgroup **page_obj_cgroups(struct page *page)
> +{
> +       return (struct obj_cgroup **)(page->memcg_data & ~0x1UL);
> +}
> +
> +static inline struct obj_cgroup **page_obj_cgroups_check(struct page *page)
> +{
> +       unsigned long memcg_data = page->memcg_data;
> +
> +       if (memcg_data && (memcg_data & 0x1UL))
> +               return (struct obj_cgroup **)memcg_data;
> +
> +       return NULL;
> +}
> +
> +static inline bool set_page_obj_cgroups(struct page *page,
> +                                       struct obj_cgroup **objcgs)
> +{
> +       return !cmpxchg(&page->memcg_data, 0, (unsigned long)objcgs | 0x1UL);
> +}
> +
> +static inline void clear_page_obj_cgroups(struct page *page)
> +{
> +       page->memcg_data = 0;
> +}

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH rfc 3/5] mm: memcontrol/slab: use helpers to access slab page's memcg_data
@ 2020-09-17  1:12     ` Shakeel Butt
  0 siblings, 0 replies; 16+ messages in thread
From: Shakeel Butt @ 2020-09-17  1:12 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Linux MM, Johannes Weiner, Michal Hocko,
	Kernel Team, LKML

On Thu, Sep 10, 2020 at 1:27 PM Roman Gushchin <guro@fb.com> wrote:
>
> To gather all direct accesses to struct page's memcg_data field
> in one place, let's introduce 4 new helper functions to use in
> the slab accounting code:
>   struct obj_cgroup **page_obj_cgroups(struct page *page);
>   struct obj_cgroup **page_obj_cgroups_check(struct page *page);
>   bool set_page_obj_cgroups(struct page *page, struct obj_cgroup **objcgs);
>   void clear_page_obj_cgroups(struct page *page);
>
> They are similar to the corresponding API for generic pages, except
> that the setter can return false, indicating that the value has been
> already set from a different thread.
>
> Signed-off-by: Roman Gushchin <guro@fb.com>

Nit below.

Reviewed-by: Shakeel Butt <shakeelb@google.com>

> ---
>  include/linux/memcontrol.h | 48 ++++++++++++++++++++++++++++++++++++++
>  mm/memcontrol.c            |  4 ++--
>  mm/slab.h                  | 27 +++------------------
>  3 files changed, 53 insertions(+), 26 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 0997220c84ce..48d4c2c1ce81 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -375,6 +375,54 @@ static inline void clear_page_mem_cgroup(struct page *page)
>         page->memcg_data = 0;
>  }
>

I think adding comments on these functions would be helpful.

> +#ifdef CONFIG_MEMCG_KMEM
> +static inline struct obj_cgroup **page_obj_cgroups(struct page *page)
> +{
> +       return (struct obj_cgroup **)(page->memcg_data & ~0x1UL);
> +}
> +
> +static inline struct obj_cgroup **page_obj_cgroups_check(struct page *page)
> +{
> +       unsigned long memcg_data = page->memcg_data;
> +
> +       if (memcg_data && (memcg_data & 0x1UL))
> +               return (struct obj_cgroup **)memcg_data;
> +
> +       return NULL;
> +}
> +
> +static inline bool set_page_obj_cgroups(struct page *page,
> +                                       struct obj_cgroup **objcgs)
> +{
> +       return !cmpxchg(&page->memcg_data, 0, (unsigned long)objcgs | 0x1UL);
> +}
> +
> +static inline void clear_page_obj_cgroups(struct page *page)
> +{
> +       page->memcg_data = 0;
> +}


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH rfc 3/5] mm: memcontrol/slab: use helpers to access slab page's memcg_data
  2020-09-17  1:12     ` Shakeel Butt
  (?)
@ 2020-09-17 16:49     ` Roman Gushchin
  -1 siblings, 0 replies; 16+ messages in thread
From: Roman Gushchin @ 2020-09-17 16:49 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Linux MM, Johannes Weiner, Michal Hocko,
	Kernel Team, LKML

On Wed, Sep 16, 2020 at 06:12:37PM -0700, Shakeel Butt wrote:
> On Thu, Sep 10, 2020 at 1:27 PM Roman Gushchin <guro@fb.com> wrote:
> >
> > To gather all direct accesses to struct page's memcg_data field
> > in one place, let's introduce 4 new helper functions to use in
> > the slab accounting code:
> >   struct obj_cgroup **page_obj_cgroups(struct page *page);
> >   struct obj_cgroup **page_obj_cgroups_check(struct page *page);
> >   bool set_page_obj_cgroups(struct page *page, struct obj_cgroup **objcgs);
> >   void clear_page_obj_cgroups(struct page *page);
> >
> > They are similar to the corresponding API for generic pages, except
> > that the setter can return false, indicating that the value has been
> > already set from a different thread.
> >
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> 
> Nit below.
> 
> Reviewed-by: Shakeel Butt <shakeelb@google.com>
> 
> > ---
> >  include/linux/memcontrol.h | 48 ++++++++++++++++++++++++++++++++++++++
> >  mm/memcontrol.c            |  4 ++--
> >  mm/slab.h                  | 27 +++------------------
> >  3 files changed, 53 insertions(+), 26 deletions(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 0997220c84ce..48d4c2c1ce81 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -375,6 +375,54 @@ static inline void clear_page_mem_cgroup(struct page *page)
> >         page->memcg_data = 0;
> >  }
> >
> 
> I think adding comments on these functions would be helpful.

I agree, will add.

Thank you for taking a look!

Roman

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2020-09-17 18:35 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-10 20:26 [PATCH rfc 0/5] mm: allow mapping accounted kernel pages to userspace Roman Gushchin
2020-09-10 20:26 ` [PATCH rfc 1/5] mm: memcg/slab: fix racy access to page->mem_cgroup in mem_cgroup_from_obj() Roman Gushchin
2020-09-10 20:26 ` [PATCH rfc 2/5] mm: memcontrol: use helpers to access page's memcg data Roman Gushchin
2020-09-17  0:58   ` Shakeel Butt
2020-09-17  0:58     ` Shakeel Butt
2020-09-10 20:26 ` [PATCH rfc 3/5] mm: memcontrol/slab: use helpers to access slab page's memcg_data Roman Gushchin
2020-09-17  1:12   ` Shakeel Butt
2020-09-17  1:12     ` Shakeel Butt
2020-09-17 16:49     ` Roman Gushchin
2020-09-10 20:26 ` [PATCH rfc 4/5] mm: introduce page memcg flags Roman Gushchin
2020-09-10 20:26 ` [PATCH rfc 5/5] mm: convert page kmemcg type to a page memcg flag Roman Gushchin
2020-09-11 17:34 ` [PATCH rfc 0/5] mm: allow mapping accounted kernel pages to userspace Shakeel Butt
2020-09-11 17:34   ` Shakeel Butt
2020-09-11 17:34   ` Shakeel Butt
2020-09-11 17:34     ` Shakeel Butt
2020-09-11 21:36     ` Roman Gushchin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.