linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/3] mm: push down lock_page_memcg()
@ 2022-12-06 17:13 Johannes Weiner
  2022-12-06 17:13 ` [PATCH 1/3] mm: memcontrol: skip moving non-present pages that are mapped elsewhere Johannes Weiner
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Johannes Weiner @ 2022-12-06 17:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Hugh Dickins, Shakeel Butt, Michal Hocko,
	linux-mm, cgroups, linux-kernel

New series based on the discussion in the previous thread around
getting lock_page_memcg() out of rmap.

I beat on this with concurrent high-frequency moving of tasks that
partially share a swapped out shmem file. I didn't spot anything
problematic. That said, it is quite subtle, and Hugh, I'd feel better
if you could also subject it to your torture suite ;)

Thanks!

Against yesterday's mm-unstable.

 Documentation/admin-guide/cgroup-v1/memory.rst | 11 ++++-
 mm/memcontrol.c                                | 56 ++++++++++++++++++------
 mm/rmap.c                                      | 26 ++++-------
 3 files changed, 60 insertions(+), 33 deletions(-)



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/3] mm: memcontrol: skip moving non-present pages that are mapped elsewhere
  2022-12-06 17:13 [PATCH v2 0/3] mm: push down lock_page_memcg() Johannes Weiner
@ 2022-12-06 17:13 ` Johannes Weiner
  2022-12-07  1:51   ` Hugh Dickins
  2022-12-08  0:36   ` Shakeel Butt
  2022-12-06 17:13 ` [PATCH 2/3] mm: rmap: remove lock_page_memcg() Johannes Weiner
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 14+ messages in thread
From: Johannes Weiner @ 2022-12-06 17:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Hugh Dickins, Shakeel Butt, Michal Hocko,
	linux-mm, cgroups, linux-kernel

During charge moving, the pte lock and the page lock cover nearly all
cases of stabilizing page_mapped(). The only exception is when we're
looking at a non-present pte and find a page in the page cache or in
the swapcache: if the page is mapped elsewhere, it can become unmapped
outside of our control. For this reason, rmap needs lock_page_memcg().

We don't like cgroup-specific locks in generic MM code - especially in
performance-critical MM code - and for a legacy feature that's
unlikely to have many users left - if any.

So remove the exception. Arguably that's better semantics anyway: the
page is shared, and another process seems to be the more active user.

Once we stop moving such pages, rmap doesn't need lock_page_memcg()
anymore. The next patch will remove it.

Suggested-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c | 52 ++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 38 insertions(+), 14 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 48c44229cf47..b696354c1b21 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5681,7 +5681,7 @@ static struct page *mc_handle_file_pte(struct vm_area_struct *vma,
  * @from: mem_cgroup which the page is moved from.
  * @to:	mem_cgroup which the page is moved to. @from != @to.
  *
- * The caller must make sure the page is not on LRU (isolate_page() is useful.)
+ * The page must be locked and not on the LRU.
  *
  * This function doesn't do "charge" to new cgroup and doesn't do "uncharge"
  * from old cgroup.
@@ -5698,20 +5698,13 @@ static int mem_cgroup_move_account(struct page *page,
 	int nid, ret;
 
 	VM_BUG_ON(from == to);
+	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
 	VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
 	VM_BUG_ON(compound && !folio_test_large(folio));
 
-	/*
-	 * Prevent mem_cgroup_migrate() from looking at
-	 * page's memory cgroup of its source page while we change it.
-	 */
-	ret = -EBUSY;
-	if (!folio_trylock(folio))
-		goto out;
-
 	ret = -EINVAL;
 	if (folio_memcg(folio) != from)
-		goto out_unlock;
+		goto out;
 
 	pgdat = folio_pgdat(folio);
 	from_vec = mem_cgroup_lruvec(from, pgdat);
@@ -5798,8 +5791,6 @@ static int mem_cgroup_move_account(struct page *page,
 	mem_cgroup_charge_statistics(from, -nr_pages);
 	memcg_check_events(from, nid);
 	local_irq_enable();
-out_unlock:
-	folio_unlock(folio);
 out:
 	return ret;
 }
@@ -5848,6 +5839,29 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
 	else if (is_swap_pte(ptent))
 		page = mc_handle_swap_pte(vma, ptent, &ent);
 
+	if (target && page) {
+		if (!trylock_page(page)) {
+			put_page(page);
+			return ret;
+		}
+		/*
+		 * page_mapped() must be stable during the move. This
+		 * pte is locked, so if it's present, the page cannot
+		 * become unmapped. If it isn't, we have only partial
+		 * control over the mapped state: the page lock will
+		 * prevent new faults against pagecache and swapcache,
+		 * so an unmapped page cannot become mapped. However,
+		 * if the page is already mapped elsewhere, it can
+		 * unmap, and there is nothing we can do about it.
+		 * Alas, skip moving the page in this case.
+		 */
+		if (!pte_present(ptent) && page_mapped(page)) {
+			unlock_page(page);
+			put_page(page);
+			return ret;
+		}
+	}
+
 	if (!page && !ent.val)
 		return ret;
 	if (page) {
@@ -5864,8 +5878,11 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
 			if (target)
 				target->page = page;
 		}
-		if (!ret || !target)
+		if (!ret || !target) {
+			if (target)
+				unlock_page(page);
 			put_page(page);
+		}
 	}
 	/*
 	 * There is a swap entry and a page doesn't exist or isn't charged.
@@ -5905,6 +5922,10 @@ static enum mc_target_type get_mctgt_type_thp(struct vm_area_struct *vma,
 		ret = MC_TARGET_PAGE;
 		if (target) {
 			get_page(page);
+			if (!trylock_page(page)) {
+				put_page(page);
+				return MC_TARGET_NONE;
+			}
 			target->page = page;
 		}
 	}
@@ -6143,6 +6164,7 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 				}
 				putback_lru_page(page);
 			}
+			unlock_page(page);
 			put_page(page);
 		} else if (target_type == MC_TARGET_DEVICE) {
 			page = target.page;
@@ -6151,6 +6173,7 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 				mc.precharge -= HPAGE_PMD_NR;
 				mc.moved_charge += HPAGE_PMD_NR;
 			}
+			unlock_page(page);
 			put_page(page);
 		}
 		spin_unlock(ptl);
@@ -6193,7 +6216,8 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 			}
 			if (!device)
 				putback_lru_page(page);
-put:			/* get_mctgt_type() gets the page */
+put:			/* get_mctgt_type() gets & locks the page */
+			unlock_page(page);
 			put_page(page);
 			break;
 		case MC_TARGET_SWAP:
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/3] mm: rmap: remove lock_page_memcg()
  2022-12-06 17:13 [PATCH v2 0/3] mm: push down lock_page_memcg() Johannes Weiner
  2022-12-06 17:13 ` [PATCH 1/3] mm: memcontrol: skip moving non-present pages that are mapped elsewhere Johannes Weiner
@ 2022-12-06 17:13 ` Johannes Weiner
  2022-12-07  1:52   ` Hugh Dickins
  2022-12-08  0:36   ` Shakeel Butt
  2022-12-06 17:13 ` [PATCH 3/3] mm: memcontrol: deprecate charge moving Johannes Weiner
  2022-12-07 14:07 ` [PATCH v2 0/3] mm: push down lock_page_memcg() Michal Hocko
  3 siblings, 2 replies; 14+ messages in thread
From: Johannes Weiner @ 2022-12-06 17:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Hugh Dickins, Shakeel Butt, Michal Hocko,
	linux-mm, cgroups, linux-kernel

The previous patch made sure charge moving only touches pages for
which page_mapped() is stable. lock_page_memcg() is no longer needed.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/rmap.c | 26 ++++++++------------------
 1 file changed, 8 insertions(+), 18 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index b616870a09be..32e48b1c5847 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1222,9 +1222,6 @@ void page_add_anon_rmap(struct page *page,
 	bool compound = flags & RMAP_COMPOUND;
 	bool first = true;
 
-	if (unlikely(PageKsm(page)))
-		lock_page_memcg(page);
-
 	/* Is page being mapped by PTE? Is this its first map to be added? */
 	if (likely(!compound)) {
 		first = atomic_inc_and_test(&page->_mapcount);
@@ -1262,15 +1259,14 @@ void page_add_anon_rmap(struct page *page,
 	if (nr)
 		__mod_lruvec_page_state(page, NR_ANON_MAPPED, nr);
 
-	if (unlikely(PageKsm(page)))
-		unlock_page_memcg(page);
-
-	/* address might be in next vma when migration races vma_adjust */
-	else if (first)
-		__page_set_anon_rmap(page, vma, address,
-				     !!(flags & RMAP_EXCLUSIVE));
-	else
-		__page_check_anon_rmap(page, vma, address);
+	if (likely(!PageKsm(page))) {
+		/* address might be in next vma when migration races vma_adjust */
+		if (first)
+			__page_set_anon_rmap(page, vma, address,
+					     !!(flags & RMAP_EXCLUSIVE));
+		else
+			__page_check_anon_rmap(page, vma, address);
+	}
 
 	mlock_vma_page(page, vma, compound);
 }
@@ -1329,7 +1325,6 @@ void page_add_file_rmap(struct page *page,
 	bool first;
 
 	VM_BUG_ON_PAGE(compound && !PageTransHuge(page), page);
-	lock_page_memcg(page);
 
 	/* Is page being mapped by PTE? Is this its first map to be added? */
 	if (likely(!compound)) {
@@ -1365,7 +1360,6 @@ void page_add_file_rmap(struct page *page,
 			NR_SHMEM_PMDMAPPED : NR_FILE_PMDMAPPED, nr_pmdmapped);
 	if (nr)
 		__mod_lruvec_page_state(page, NR_FILE_MAPPED, nr);
-	unlock_page_memcg(page);
 
 	mlock_vma_page(page, vma, compound);
 }
@@ -1394,8 +1388,6 @@ void page_remove_rmap(struct page *page,
 		return;
 	}
 
-	lock_page_memcg(page);
-
 	/* Is page being unmapped by PTE? Is this its last map to be removed? */
 	if (likely(!compound)) {
 		last = atomic_add_negative(-1, &page->_mapcount);
@@ -1451,8 +1443,6 @@ void page_remove_rmap(struct page *page,
 	 * and remember that it's only reliable while mapped.
 	 */
 
-	unlock_page_memcg(page);
-
 	munlock_vma_page(page, vma, compound);
 }
 
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/3] mm: memcontrol: deprecate charge moving
  2022-12-06 17:13 [PATCH v2 0/3] mm: push down lock_page_memcg() Johannes Weiner
  2022-12-06 17:13 ` [PATCH 1/3] mm: memcontrol: skip moving non-present pages that are mapped elsewhere Johannes Weiner
  2022-12-06 17:13 ` [PATCH 2/3] mm: rmap: remove lock_page_memcg() Johannes Weiner
@ 2022-12-06 17:13 ` Johannes Weiner
  2022-12-07  0:03   ` Shakeel Butt
  2022-12-07  1:58   ` Hugh Dickins
  2022-12-07 14:07 ` [PATCH v2 0/3] mm: push down lock_page_memcg() Michal Hocko
  3 siblings, 2 replies; 14+ messages in thread
From: Johannes Weiner @ 2022-12-06 17:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Hugh Dickins, Shakeel Butt, Michal Hocko,
	linux-mm, cgroups, linux-kernel

Charge moving mode in cgroup1 allows memory to follow tasks as they
migrate between cgroups. This is, and always has been, a questionable
thing to do - for several reasons.

First, it's expensive. Pages need to be identified, locked and
isolated from various MM operations, and reassigned, one by one.

Second, it's unreliable. Once pages are charged to a cgroup, there
isn't always a clear owner task anymore. Cache isn't moved at all, for
example. Mapped memory is moved - but if trylocking or isolating a
page fails, it's arbitrarily left behind. Frequent moving between
domains may leave a task's memory scattered all over the place.

Third, it isn't really needed. Launcher tasks can kick off workload
tasks directly in their target cgroup. Using dedicated per-workload
groups allows fine-grained policy adjustments - no need to move tasks
and their physical pages between control domains. The feature was
never forward-ported to cgroup2, and it hasn't been missed.

Despite it being a niche usecase, the maintenance overhead of
supporting it is enormous. Because pages are moved while they are live
and subject to various MM operations, the synchronization rules are
complicated. There are lock_page_memcg() in MM and FS code, which
non-cgroup people don't understand. In some cases we've been able to
shift code and cgroup API calls around such that we can rely on native
locking as much as possible. But that's fragile, and sometimes we need
to hold MM locks for longer than we otherwise would (pte lock e.g.).

Mark the feature deprecated. Hopefully we can remove it soon.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 Documentation/admin-guide/cgroup-v1/memory.rst | 11 ++++++++++-
 mm/memcontrol.c                                |  4 ++++
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst
index 60370f2c67b9..87d7877b98ec 100644
--- a/Documentation/admin-guide/cgroup-v1/memory.rst
+++ b/Documentation/admin-guide/cgroup-v1/memory.rst
@@ -86,6 +86,8 @@ Brief summary of control files.
  memory.swappiness		     set/show swappiness parameter of vmscan
 				     (See sysctl's vm.swappiness)
  memory.move_charge_at_immigrate     set/show controls of moving charges
+                                     This knob is deprecated and shouldn't be
+                                     used.
  memory.oom_control		     set/show oom controls.
  memory.numa_stat		     show the number of memory usage per numa
 				     node
@@ -717,9 +719,16 @@ Soft limits can be setup by using the following commands (in this example we
        It is recommended to set the soft limit always below the hard limit,
        otherwise the hard limit will take precedence.
 
-8. Move charges at task migration
+8. Move charges at task migration (DEPRECATED!)
 =================================
 
+THIS IS DEPRECATED!
+
+It's expensive and unreliable! It's better practice to launch workload
+tasks directly from inside their target cgroup. Use dedicated workload
+cgroups to allow fine-grained policy adjustments without having to
+move physical pages between control domains.
+
 Users can move charges associated with a task along with task migration, that
 is, uncharge task's pages from the old cgroup and charge them to the new cgroup.
 This feature is not supported in !CONFIG_MMU environments because of lack of
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b696354c1b21..e650a38d9a90 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3919,6 +3919,10 @@ static int mem_cgroup_move_charge_write(struct cgroup_subsys_state *css,
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
 
+	pr_warn_once("Cgroup memory moving is deprecated. "
+		     "Please report your usecase to linux-mm@kvack.org if you "
+		     "depend on this functionality.\n");
+
 	if (val & ~MOVE_MASK)
 		return -EINVAL;
 
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] mm: memcontrol: deprecate charge moving
  2022-12-06 17:13 ` [PATCH 3/3] mm: memcontrol: deprecate charge moving Johannes Weiner
@ 2022-12-07  0:03   ` Shakeel Butt
  2022-12-07 21:51     ` Andrew Morton
  2022-12-07  1:58   ` Hugh Dickins
  1 sibling, 1 reply; 14+ messages in thread
From: Shakeel Butt @ 2022-12-07  0:03 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Linus Torvalds, Hugh Dickins, Michal Hocko,
	linux-mm, cgroups, linux-kernel

On Tue, Dec 6, 2022 at 9:14 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> Charge moving mode in cgroup1 allows memory to follow tasks as they
> migrate between cgroups. This is, and always has been, a questionable
> thing to do - for several reasons.
>
> First, it's expensive. Pages need to be identified, locked and
> isolated from various MM operations, and reassigned, one by one.
>
> Second, it's unreliable. Once pages are charged to a cgroup, there
> isn't always a clear owner task anymore. Cache isn't moved at all, for
> example. Mapped memory is moved - but if trylocking or isolating a
> page fails, it's arbitrarily left behind. Frequent moving between
> domains may leave a task's memory scattered all over the place.
>
> Third, it isn't really needed. Launcher tasks can kick off workload
> tasks directly in their target cgroup. Using dedicated per-workload
> groups allows fine-grained policy adjustments - no need to move tasks
> and their physical pages between control domains. The feature was
> never forward-ported to cgroup2, and it hasn't been missed.
>
> Despite it being a niche usecase, the maintenance overhead of
> supporting it is enormous. Because pages are moved while they are live
> and subject to various MM operations, the synchronization rules are
> complicated. There are lock_page_memcg() in MM and FS code, which
> non-cgroup people don't understand. In some cases we've been able to
> shift code and cgroup API calls around such that we can rely on native
> locking as much as possible. But that's fragile, and sometimes we need
> to hold MM locks for longer than we otherwise would (pte lock e.g.).
>
> Mark the feature deprecated. Hopefully we can remove it soon.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Shakeel Butt <shakeelb@google.com>

I would request this patch to be backported to stable kernels as well
for early warnings to users which update to newer kernels very late.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] mm: memcontrol: skip moving non-present pages that are mapped elsewhere
  2022-12-06 17:13 ` [PATCH 1/3] mm: memcontrol: skip moving non-present pages that are mapped elsewhere Johannes Weiner
@ 2022-12-07  1:51   ` Hugh Dickins
  2022-12-08  0:36   ` Shakeel Butt
  1 sibling, 0 replies; 14+ messages in thread
From: Hugh Dickins @ 2022-12-07  1:51 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Linus Torvalds, Hugh Dickins, Shakeel Butt,
	Michal Hocko, linux-mm, cgroups, linux-kernel

On Tue, 6 Dec 2022, Johannes Weiner wrote:

> During charge moving, the pte lock and the page lock cover nearly all
> cases of stabilizing page_mapped(). The only exception is when we're
> looking at a non-present pte and find a page in the page cache or in
> the swapcache: if the page is mapped elsewhere, it can become unmapped
> outside of our control. For this reason, rmap needs lock_page_memcg().
> 
> We don't like cgroup-specific locks in generic MM code - especially in
> performance-critical MM code - and for a legacy feature that's
> unlikely to have many users left - if any.
> 
> So remove the exception. Arguably that's better semantics anyway: the
> page is shared, and another process seems to be the more active user.
> 
> Once we stop moving such pages, rmap doesn't need lock_page_memcg()
> anymore. The next patch will remove it.
> 
> Suggested-by: Hugh Dickins <hughd@google.com>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Hugh Dickins <hughd@google.com>

It ended up simpler than I'd expected: nice, thank you.

I was going to say that you'd missed the most important detail from
the commit message (that page lock prevents remapping unmapped pages):
but you've gone into good detail on that in the source comment,
so that's fine.

I almost thought you could remove the folio_memcg() check from
mem_cgroup_move_account() itself: but then it looks as if
get_mctgt_type_thp() does things in a slightly different order,
leaving a window open in which folio memcg could have been changed.
Okay, there's no need to go back and rearrange that.

(I notice that get_mctgt_type_thp() has never been updated
for shmem and file THPs, so will move them iff MOVE_ANON:
but that's irrelevant to your changes, and probably something
we're not at all interested in fixing, now it's deprecated code.)

My tmpfs swapping load has been running for five hours on this
(and the others) so far: going fine.  I hacked in some stats to
verify that it really is moving anon and shmem and file, mapped
and unmapped: yes it is, and the unmapped numbers are big enough
that I'm glad that we chose to include them.

> ---
>  mm/memcontrol.c | 52 ++++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 38 insertions(+), 14 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 48c44229cf47..b696354c1b21 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5681,7 +5681,7 @@ static struct page *mc_handle_file_pte(struct vm_area_struct *vma,
>   * @from: mem_cgroup which the page is moved from.
>   * @to:	mem_cgroup which the page is moved to. @from != @to.
>   *
> - * The caller must make sure the page is not on LRU (isolate_page() is useful.)
> + * The page must be locked and not on the LRU.
>   *
>   * This function doesn't do "charge" to new cgroup and doesn't do "uncharge"
>   * from old cgroup.
> @@ -5698,20 +5698,13 @@ static int mem_cgroup_move_account(struct page *page,
>  	int nid, ret;
>  
>  	VM_BUG_ON(from == to);
> +	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
>  	VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
>  	VM_BUG_ON(compound && !folio_test_large(folio));
>  
> -	/*
> -	 * Prevent mem_cgroup_migrate() from looking at
> -	 * page's memory cgroup of its source page while we change it.
> -	 */
> -	ret = -EBUSY;
> -	if (!folio_trylock(folio))
> -		goto out;
> -
>  	ret = -EINVAL;
>  	if (folio_memcg(folio) != from)
> -		goto out_unlock;
> +		goto out;
>  
>  	pgdat = folio_pgdat(folio);
>  	from_vec = mem_cgroup_lruvec(from, pgdat);
> @@ -5798,8 +5791,6 @@ static int mem_cgroup_move_account(struct page *page,
>  	mem_cgroup_charge_statistics(from, -nr_pages);
>  	memcg_check_events(from, nid);
>  	local_irq_enable();
> -out_unlock:
> -	folio_unlock(folio);
>  out:
>  	return ret;
>  }
> @@ -5848,6 +5839,29 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
>  	else if (is_swap_pte(ptent))
>  		page = mc_handle_swap_pte(vma, ptent, &ent);
>  
> +	if (target && page) {
> +		if (!trylock_page(page)) {
> +			put_page(page);
> +			return ret;
> +		}
> +		/*
> +		 * page_mapped() must be stable during the move. This
> +		 * pte is locked, so if it's present, the page cannot
> +		 * become unmapped. If it isn't, we have only partial
> +		 * control over the mapped state: the page lock will
> +		 * prevent new faults against pagecache and swapcache,
> +		 * so an unmapped page cannot become mapped. However,
> +		 * if the page is already mapped elsewhere, it can
> +		 * unmap, and there is nothing we can do about it.
> +		 * Alas, skip moving the page in this case.
> +		 */
> +		if (!pte_present(ptent) && page_mapped(page)) {
> +			unlock_page(page);
> +			put_page(page);
> +			return ret;
> +		}
> +	}
> +
>  	if (!page && !ent.val)
>  		return ret;
>  	if (page) {
> @@ -5864,8 +5878,11 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
>  			if (target)
>  				target->page = page;
>  		}
> -		if (!ret || !target)
> +		if (!ret || !target) {
> +			if (target)
> +				unlock_page(page);
>  			put_page(page);
> +		}
>  	}
>  	/*
>  	 * There is a swap entry and a page doesn't exist or isn't charged.
> @@ -5905,6 +5922,10 @@ static enum mc_target_type get_mctgt_type_thp(struct vm_area_struct *vma,
>  		ret = MC_TARGET_PAGE;
>  		if (target) {
>  			get_page(page);
> +			if (!trylock_page(page)) {
> +				put_page(page);
> +				return MC_TARGET_NONE;
> +			}
>  			target->page = page;
>  		}
>  	}
> @@ -6143,6 +6164,7 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
>  				}
>  				putback_lru_page(page);
>  			}
> +			unlock_page(page);
>  			put_page(page);
>  		} else if (target_type == MC_TARGET_DEVICE) {
>  			page = target.page;
> @@ -6151,6 +6173,7 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
>  				mc.precharge -= HPAGE_PMD_NR;
>  				mc.moved_charge += HPAGE_PMD_NR;
>  			}
> +			unlock_page(page);
>  			put_page(page);
>  		}
>  		spin_unlock(ptl);
> @@ -6193,7 +6216,8 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
>  			}
>  			if (!device)
>  				putback_lru_page(page);
> -put:			/* get_mctgt_type() gets the page */
> +put:			/* get_mctgt_type() gets & locks the page */
> +			unlock_page(page);
>  			put_page(page);
>  			break;
>  		case MC_TARGET_SWAP:
> -- 
> 2.38.1
> 
> 
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] mm: rmap: remove lock_page_memcg()
  2022-12-06 17:13 ` [PATCH 2/3] mm: rmap: remove lock_page_memcg() Johannes Weiner
@ 2022-12-07  1:52   ` Hugh Dickins
  2022-12-08  0:36   ` Shakeel Butt
  1 sibling, 0 replies; 14+ messages in thread
From: Hugh Dickins @ 2022-12-07  1:52 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Linus Torvalds, Hugh Dickins, Shakeel Butt,
	Michal Hocko, linux-mm, cgroups, linux-kernel

On Tue, 6 Dec 2022, Johannes Weiner wrote:

> The previous patch made sure charge moving only touches pages for
> which page_mapped() is stable. lock_page_memcg() is no longer needed.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Hugh Dickins <hughd@google.com>

> ---
>  mm/rmap.c | 26 ++++++++------------------
>  1 file changed, 8 insertions(+), 18 deletions(-)
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index b616870a09be..32e48b1c5847 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1222,9 +1222,6 @@ void page_add_anon_rmap(struct page *page,
>  	bool compound = flags & RMAP_COMPOUND;
>  	bool first = true;
>  
> -	if (unlikely(PageKsm(page)))
> -		lock_page_memcg(page);
> -
>  	/* Is page being mapped by PTE? Is this its first map to be added? */
>  	if (likely(!compound)) {
>  		first = atomic_inc_and_test(&page->_mapcount);
> @@ -1262,15 +1259,14 @@ void page_add_anon_rmap(struct page *page,
>  	if (nr)
>  		__mod_lruvec_page_state(page, NR_ANON_MAPPED, nr);
>  
> -	if (unlikely(PageKsm(page)))
> -		unlock_page_memcg(page);
> -
> -	/* address might be in next vma when migration races vma_adjust */
> -	else if (first)
> -		__page_set_anon_rmap(page, vma, address,
> -				     !!(flags & RMAP_EXCLUSIVE));
> -	else
> -		__page_check_anon_rmap(page, vma, address);
> +	if (likely(!PageKsm(page))) {
> +		/* address might be in next vma when migration races vma_adjust */
> +		if (first)
> +			__page_set_anon_rmap(page, vma, address,
> +					     !!(flags & RMAP_EXCLUSIVE));
> +		else
> +			__page_check_anon_rmap(page, vma, address);
> +	}
>  
>  	mlock_vma_page(page, vma, compound);
>  }
> @@ -1329,7 +1325,6 @@ void page_add_file_rmap(struct page *page,
>  	bool first;
>  
>  	VM_BUG_ON_PAGE(compound && !PageTransHuge(page), page);
> -	lock_page_memcg(page);
>  
>  	/* Is page being mapped by PTE? Is this its first map to be added? */
>  	if (likely(!compound)) {
> @@ -1365,7 +1360,6 @@ void page_add_file_rmap(struct page *page,
>  			NR_SHMEM_PMDMAPPED : NR_FILE_PMDMAPPED, nr_pmdmapped);
>  	if (nr)
>  		__mod_lruvec_page_state(page, NR_FILE_MAPPED, nr);
> -	unlock_page_memcg(page);
>  
>  	mlock_vma_page(page, vma, compound);
>  }
> @@ -1394,8 +1388,6 @@ void page_remove_rmap(struct page *page,
>  		return;
>  	}
>  
> -	lock_page_memcg(page);
> -
>  	/* Is page being unmapped by PTE? Is this its last map to be removed? */
>  	if (likely(!compound)) {
>  		last = atomic_add_negative(-1, &page->_mapcount);
> @@ -1451,8 +1443,6 @@ void page_remove_rmap(struct page *page,
>  	 * and remember that it's only reliable while mapped.
>  	 */
>  
> -	unlock_page_memcg(page);
> -
>  	munlock_vma_page(page, vma, compound);
>  }
>  
> -- 
> 2.38.1
> 
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] mm: memcontrol: deprecate charge moving
  2022-12-06 17:13 ` [PATCH 3/3] mm: memcontrol: deprecate charge moving Johannes Weiner
  2022-12-07  0:03   ` Shakeel Butt
@ 2022-12-07  1:58   ` Hugh Dickins
  2022-12-07 13:00     ` Johannes Weiner
  1 sibling, 1 reply; 14+ messages in thread
From: Hugh Dickins @ 2022-12-07  1:58 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Linus Torvalds, Hugh Dickins, Shakeel Butt,
	Michal Hocko, linux-mm, cgroups, linux-kernel

On Tue, 6 Dec 2022, Johannes Weiner wrote:

> Charge moving mode in cgroup1 allows memory to follow tasks as they
> migrate between cgroups. This is, and always has been, a questionable
> thing to do - for several reasons.
> 
> First, it's expensive. Pages need to be identified, locked and
> isolated from various MM operations, and reassigned, one by one.
> 
> Second, it's unreliable. Once pages are charged to a cgroup, there
> isn't always a clear owner task anymore. Cache isn't moved at all, for
> example. Mapped memory is moved - but if trylocking or isolating a
> page fails, it's arbitrarily left behind. Frequent moving between
> domains may leave a task's memory scattered all over the place.
> 
> Third, it isn't really needed. Launcher tasks can kick off workload
> tasks directly in their target cgroup. Using dedicated per-workload
> groups allows fine-grained policy adjustments - no need to move tasks
> and their physical pages between control domains. The feature was
> never forward-ported to cgroup2, and it hasn't been missed.
> 
> Despite it being a niche usecase, the maintenance overhead of
> supporting it is enormous. Because pages are moved while they are live
> and subject to various MM operations, the synchronization rules are
> complicated. There are lock_page_memcg() in MM and FS code, which
> non-cgroup people don't understand. In some cases we've been able to
> shift code and cgroup API calls around such that we can rely on native
> locking as much as possible. But that's fragile, and sometimes we need
> to hold MM locks for longer than we otherwise would (pte lock e.g.).
> 
> Mark the feature deprecated. Hopefully we can remove it soon.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Hugh Dickins <hughd@google.com>

but I wonder if it would be helpful to mention move_charge_at_immigrate
in the deprecation message: maybe the first line should be
"Cgroup memory moving (move_charge_at_immigrate) is deprecated.\n"

> ---
>  Documentation/admin-guide/cgroup-v1/memory.rst | 11 ++++++++++-
>  mm/memcontrol.c                                |  4 ++++
>  2 files changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst
> index 60370f2c67b9..87d7877b98ec 100644
> --- a/Documentation/admin-guide/cgroup-v1/memory.rst
> +++ b/Documentation/admin-guide/cgroup-v1/memory.rst
> @@ -86,6 +86,8 @@ Brief summary of control files.
>   memory.swappiness		     set/show swappiness parameter of vmscan
>  				     (See sysctl's vm.swappiness)
>   memory.move_charge_at_immigrate     set/show controls of moving charges
> +                                     This knob is deprecated and shouldn't be
> +                                     used.
>   memory.oom_control		     set/show oom controls.
>   memory.numa_stat		     show the number of memory usage per numa
>  				     node
> @@ -717,9 +719,16 @@ Soft limits can be setup by using the following commands (in this example we
>         It is recommended to set the soft limit always below the hard limit,
>         otherwise the hard limit will take precedence.
>  
> -8. Move charges at task migration
> +8. Move charges at task migration (DEPRECATED!)
>  =================================
>  
> +THIS IS DEPRECATED!
> +
> +It's expensive and unreliable! It's better practice to launch workload
> +tasks directly from inside their target cgroup. Use dedicated workload
> +cgroups to allow fine-grained policy adjustments without having to
> +move physical pages between control domains.
> +
>  Users can move charges associated with a task along with task migration, that
>  is, uncharge task's pages from the old cgroup and charge them to the new cgroup.
>  This feature is not supported in !CONFIG_MMU environments because of lack of
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b696354c1b21..e650a38d9a90 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3919,6 +3919,10 @@ static int mem_cgroup_move_charge_write(struct cgroup_subsys_state *css,
>  {
>  	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
>  
> +	pr_warn_once("Cgroup memory moving is deprecated. "
> +		     "Please report your usecase to linux-mm@kvack.org if you "
> +		     "depend on this functionality.\n");
> +
>  	if (val & ~MOVE_MASK)
>  		return -EINVAL;
>  
> -- 
> 2.38.1
> 
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] mm: memcontrol: deprecate charge moving
  2022-12-07  1:58   ` Hugh Dickins
@ 2022-12-07 13:00     ` Johannes Weiner
  0 siblings, 0 replies; 14+ messages in thread
From: Johannes Weiner @ 2022-12-07 13:00 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Linus Torvalds, Shakeel Butt, Michal Hocko,
	linux-mm, cgroups, linux-kernel

On Tue, Dec 06, 2022 at 05:58:14PM -0800, Hugh Dickins wrote:
> On Tue, 6 Dec 2022, Johannes Weiner wrote:
> 
> > Charge moving mode in cgroup1 allows memory to follow tasks as they
> > migrate between cgroups. This is, and always has been, a questionable
> > thing to do - for several reasons.
> > 
> > First, it's expensive. Pages need to be identified, locked and
> > isolated from various MM operations, and reassigned, one by one.
> > 
> > Second, it's unreliable. Once pages are charged to a cgroup, there
> > isn't always a clear owner task anymore. Cache isn't moved at all, for
> > example. Mapped memory is moved - but if trylocking or isolating a
> > page fails, it's arbitrarily left behind. Frequent moving between
> > domains may leave a task's memory scattered all over the place.
> > 
> > Third, it isn't really needed. Launcher tasks can kick off workload
> > tasks directly in their target cgroup. Using dedicated per-workload
> > groups allows fine-grained policy adjustments - no need to move tasks
> > and their physical pages between control domains. The feature was
> > never forward-ported to cgroup2, and it hasn't been missed.
> > 
> > Despite it being a niche usecase, the maintenance overhead of
> > supporting it is enormous. Because pages are moved while they are live
> > and subject to various MM operations, the synchronization rules are
> > complicated. There are lock_page_memcg() in MM and FS code, which
> > non-cgroup people don't understand. In some cases we've been able to
> > shift code and cgroup API calls around such that we can rely on native
> > locking as much as possible. But that's fragile, and sometimes we need
> > to hold MM locks for longer than we otherwise would (pte lock e.g.).
> > 
> > Mark the feature deprecated. Hopefully we can remove it soon.
> > 
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> Acked-by: Hugh Dickins <hughd@google.com>

Thanks

> but I wonder if it would be helpful to mention move_charge_at_immigrate
> in the deprecation message: maybe the first line should be
> "Cgroup memory moving (move_charge_at_immigrate) is deprecated.\n"

Fair enough! Here is the updated patch.

---

From 0e791e6ab8ba2f75dd4205684c06bcc7308d9867 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Mon, 5 Dec 2022 19:57:06 +0100
Subject: [PATCH] mm: memcontrol: deprecate charge moving

Charge moving mode in cgroup1 allows memory to follow tasks as they
migrate between cgroups. This is, and always has been, a questionable
thing to do - for several reasons.

First, it's expensive. Pages need to be identified, locked and
isolated from various MM operations, and reassigned, one by one.

Second, it's unreliable. Once pages are charged to a cgroup, there
isn't always a clear owner task anymore. Cache isn't moved at all, for
example. Mapped memory is moved - but if trylocking or isolating a
page fails, it's arbitrarily left behind. Frequent moving between
domains may leave a task's memory scattered all over the place.

Third, it isn't really needed. Launcher tasks can kick off workload
tasks directly in their target cgroup. Using dedicated per-workload
groups allows fine-grained policy adjustments - no need to move tasks
and their physical pages between control domains. The feature was
never forward-ported to cgroup2, and it hasn't been missed.

Despite it being a niche usecase, the maintenance overhead of
supporting it is enormous. Because pages are moved while they are live
and subject to various MM operations, the synchronization rules are
complicated. There are lock_page_memcg() in MM and FS code, which
non-cgroup people don't understand. In some cases we've been able to
shift code and cgroup API calls around such that we can rely on native
locking as much as possible. But that's fragile, and sometimes we need
to hold MM locks for longer than we otherwise would (pte lock e.g.).

Mark the feature deprecated. Hopefully we can remove it soon.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: stable@vger.kernel.org
---
 Documentation/admin-guide/cgroup-v1/memory.rst | 11 ++++++++++-
 mm/memcontrol.c                                |  4 ++++
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst
index 60370f2c67b9..87d7877b98ec 100644
--- a/Documentation/admin-guide/cgroup-v1/memory.rst
+++ b/Documentation/admin-guide/cgroup-v1/memory.rst
@@ -86,6 +86,8 @@ Brief summary of control files.
  memory.swappiness		     set/show swappiness parameter of vmscan
 				     (See sysctl's vm.swappiness)
  memory.move_charge_at_immigrate     set/show controls of moving charges
+                                     This knob is deprecated and shouldn't be
+                                     used.
  memory.oom_control		     set/show oom controls.
  memory.numa_stat		     show the number of memory usage per numa
 				     node
@@ -717,9 +719,16 @@ Soft limits can be setup by using the following commands (in this example we
        It is recommended to set the soft limit always below the hard limit,
        otherwise the hard limit will take precedence.
 
-8. Move charges at task migration
+8. Move charges at task migration (DEPRECATED!)
 =================================
 
+THIS IS DEPRECATED!
+
+It's expensive and unreliable! It's better practice to launch workload
+tasks directly from inside their target cgroup. Use dedicated workload
+cgroups to allow fine-grained policy adjustments without having to
+move physical pages between control domains.
+
 Users can move charges associated with a task along with task migration, that
 is, uncharge task's pages from the old cgroup and charge them to the new cgroup.
 This feature is not supported in !CONFIG_MMU environments because of lack of
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b696354c1b21..9c9a42153b76 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3919,6 +3919,10 @@ static int mem_cgroup_move_charge_write(struct cgroup_subsys_state *css,
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
 
+	pr_warn_once("Cgroup memory moving (move_charge_at_immigrate) is deprecated. "
+		     "Please report your usecase to linux-mm@kvack.org if you "
+		     "depend on this functionality.\n");
+
 	if (val & ~MOVE_MASK)
 		return -EINVAL;
 
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 0/3] mm: push down lock_page_memcg()
  2022-12-06 17:13 [PATCH v2 0/3] mm: push down lock_page_memcg() Johannes Weiner
                   ` (2 preceding siblings ...)
  2022-12-06 17:13 ` [PATCH 3/3] mm: memcontrol: deprecate charge moving Johannes Weiner
@ 2022-12-07 14:07 ` Michal Hocko
  3 siblings, 0 replies; 14+ messages in thread
From: Michal Hocko @ 2022-12-07 14:07 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Linus Torvalds, Hugh Dickins, Shakeel Butt,
	linux-mm, cgroups, linux-kernel

On Tue 06-12-22 18:13:38, Johannes Weiner wrote:
> New series based on the discussion in the previous thread around
> getting lock_page_memcg() out of rmap.
> 
> I beat on this with concurrent high-frequency moving of tasks that
> partially share a swapped out shmem file. I didn't spot anything
> problematic. That said, it is quite subtle, and Hugh, I'd feel better
> if you could also subject it to your torture suite ;)

For the whole series
Acked-by: Michal Hocko <mhocko@suse.com>
Thanks!
> 
> Thanks!
> 
> Against yesterday's mm-unstable.
> 
>  Documentation/admin-guide/cgroup-v1/memory.rst | 11 ++++-
>  mm/memcontrol.c                                | 56 ++++++++++++++++++------
>  mm/rmap.c                                      | 26 ++++-------
>  3 files changed, 60 insertions(+), 33 deletions(-)
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] mm: memcontrol: deprecate charge moving
  2022-12-07  0:03   ` Shakeel Butt
@ 2022-12-07 21:51     ` Andrew Morton
  2022-12-07 22:15       ` Shakeel Butt
  0 siblings, 1 reply; 14+ messages in thread
From: Andrew Morton @ 2022-12-07 21:51 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Linus Torvalds, Hugh Dickins, Michal Hocko,
	linux-mm, cgroups, linux-kernel

On Tue, 6 Dec 2022 16:03:54 -0800 Shakeel Butt <shakeelb@google.com> wrote:

> On Tue, Dec 6, 2022 at 9:14 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > Charge moving mode in cgroup1 allows memory to follow tasks as they
> > migrate between cgroups. This is, and always has been, a questionable
> > thing to do - for several reasons.
> >
> > First, it's expensive. Pages need to be identified, locked and
> > isolated from various MM operations, and reassigned, one by one.
> >
> > Second, it's unreliable. Once pages are charged to a cgroup, there
> > isn't always a clear owner task anymore. Cache isn't moved at all, for
> > example. Mapped memory is moved - but if trylocking or isolating a
> > page fails, it's arbitrarily left behind. Frequent moving between
> > domains may leave a task's memory scattered all over the place.
> >
> > Third, it isn't really needed. Launcher tasks can kick off workload
> > tasks directly in their target cgroup. Using dedicated per-workload
> > groups allows fine-grained policy adjustments - no need to move tasks
> > and their physical pages between control domains. The feature was
> > never forward-ported to cgroup2, and it hasn't been missed.
> >
> > Despite it being a niche usecase, the maintenance overhead of
> > supporting it is enormous. Because pages are moved while they are live
> > and subject to various MM operations, the synchronization rules are
> > complicated. There are lock_page_memcg() in MM and FS code, which
> > non-cgroup people don't understand. In some cases we've been able to
> > shift code and cgroup API calls around such that we can rely on native
> > locking as much as possible. But that's fragile, and sometimes we need
> > to hold MM locks for longer than we otherwise would (pte lock e.g.).
> >
> > Mark the feature deprecated. Hopefully we can remove it soon.
> >
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> Acked-by: Shakeel Butt <shakeelb@google.com>
> 
> I would request this patch to be backported to stable kernels as well
> for early warnings to users which update to newer kernels very late.

Sounds reasonable, but the changelog should have a few words in it
explaining why we're requesting the backport.  I guess I can type those
in.

We're at -rc8 and I'm not planning on merging these up until after
6.2-rc1 is out.  Please feel free to argue with me on that score.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] mm: memcontrol: deprecate charge moving
  2022-12-07 21:51     ` Andrew Morton
@ 2022-12-07 22:15       ` Shakeel Butt
  0 siblings, 0 replies; 14+ messages in thread
From: Shakeel Butt @ 2022-12-07 22:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Linus Torvalds, Hugh Dickins, Michal Hocko,
	linux-mm, cgroups, linux-kernel

On Wed, Dec 7, 2022 at 1:51 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Tue, 6 Dec 2022 16:03:54 -0800 Shakeel Butt <shakeelb@google.com> wrote:
>
> > On Tue, Dec 6, 2022 at 9:14 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > >
> > > Charge moving mode in cgroup1 allows memory to follow tasks as they
> > > migrate between cgroups. This is, and always has been, a questionable
> > > thing to do - for several reasons.
> > >
> > > First, it's expensive. Pages need to be identified, locked and
> > > isolated from various MM operations, and reassigned, one by one.
> > >
> > > Second, it's unreliable. Once pages are charged to a cgroup, there
> > > isn't always a clear owner task anymore. Cache isn't moved at all, for
> > > example. Mapped memory is moved - but if trylocking or isolating a
> > > page fails, it's arbitrarily left behind. Frequent moving between
> > > domains may leave a task's memory scattered all over the place.
> > >
> > > Third, it isn't really needed. Launcher tasks can kick off workload
> > > tasks directly in their target cgroup. Using dedicated per-workload
> > > groups allows fine-grained policy adjustments - no need to move tasks
> > > and their physical pages between control domains. The feature was
> > > never forward-ported to cgroup2, and it hasn't been missed.
> > >
> > > Despite it being a niche usecase, the maintenance overhead of
> > > supporting it is enormous. Because pages are moved while they are live
> > > and subject to various MM operations, the synchronization rules are
> > > complicated. There are lock_page_memcg() in MM and FS code, which
> > > non-cgroup people don't understand. In some cases we've been able to
> > > shift code and cgroup API calls around such that we can rely on native
> > > locking as much as possible. But that's fragile, and sometimes we need
> > > to hold MM locks for longer than we otherwise would (pte lock e.g.).
> > >
> > > Mark the feature deprecated. Hopefully we can remove it soon.
> > >
> > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> >
> > Acked-by: Shakeel Butt <shakeelb@google.com>
> >
> > I would request this patch to be backported to stable kernels as well
> > for early warnings to users which update to newer kernels very late.
>
> Sounds reasonable, but the changelog should have a few words in it
> explaining why we're requesting the backport.  I guess I can type those
> in.

Thanks a lot.

>
> We're at -rc8 and I'm not planning on merging these up until after
> 6.2-rc1 is out.  Please feel free to argue with me on that score.

No, I totally agree with you. There is no such urgency in merging
these and a couple of weeks delay is totally fine.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] mm: memcontrol: skip moving non-present pages that are mapped elsewhere
  2022-12-06 17:13 ` [PATCH 1/3] mm: memcontrol: skip moving non-present pages that are mapped elsewhere Johannes Weiner
  2022-12-07  1:51   ` Hugh Dickins
@ 2022-12-08  0:36   ` Shakeel Butt
  1 sibling, 0 replies; 14+ messages in thread
From: Shakeel Butt @ 2022-12-08  0:36 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Linus Torvalds, Hugh Dickins, Michal Hocko,
	linux-mm, cgroups, linux-kernel

On Tue, Dec 6, 2022 at 9:14 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> During charge moving, the pte lock and the page lock cover nearly all
> cases of stabilizing page_mapped(). The only exception is when we're
> looking at a non-present pte and find a page in the page cache or in
> the swapcache: if the page is mapped elsewhere, it can become unmapped
> outside of our control. For this reason, rmap needs lock_page_memcg().
>
> We don't like cgroup-specific locks in generic MM code - especially in
> performance-critical MM code - and for a legacy feature that's
> unlikely to have many users left - if any.
>
> So remove the exception. Arguably that's better semantics anyway: the
> page is shared, and another process seems to be the more active user.
>
> Once we stop moving such pages, rmap doesn't need lock_page_memcg()
> anymore. The next patch will remove it.
>
> Suggested-by: Hugh Dickins <hughd@google.com>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Shakeel Butt <shakeelb@google.com>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] mm: rmap: remove lock_page_memcg()
  2022-12-06 17:13 ` [PATCH 2/3] mm: rmap: remove lock_page_memcg() Johannes Weiner
  2022-12-07  1:52   ` Hugh Dickins
@ 2022-12-08  0:36   ` Shakeel Butt
  1 sibling, 0 replies; 14+ messages in thread
From: Shakeel Butt @ 2022-12-08  0:36 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Linus Torvalds, Hugh Dickins, Michal Hocko,
	linux-mm, cgroups, linux-kernel

On Tue, Dec 6, 2022 at 9:14 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> The previous patch made sure charge moving only touches pages for
> which page_mapped() is stable. lock_page_memcg() is no longer needed.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Shakeel Butt <shakeelb@google.com>

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2022-12-08  0:36 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-12-06 17:13 [PATCH v2 0/3] mm: push down lock_page_memcg() Johannes Weiner
2022-12-06 17:13 ` [PATCH 1/3] mm: memcontrol: skip moving non-present pages that are mapped elsewhere Johannes Weiner
2022-12-07  1:51   ` Hugh Dickins
2022-12-08  0:36   ` Shakeel Butt
2022-12-06 17:13 ` [PATCH 2/3] mm: rmap: remove lock_page_memcg() Johannes Weiner
2022-12-07  1:52   ` Hugh Dickins
2022-12-08  0:36   ` Shakeel Butt
2022-12-06 17:13 ` [PATCH 3/3] mm: memcontrol: deprecate charge moving Johannes Weiner
2022-12-07  0:03   ` Shakeel Butt
2022-12-07 21:51     ` Andrew Morton
2022-12-07 22:15       ` Shakeel Butt
2022-12-07  1:58   ` Hugh Dickins
2022-12-07 13:00     ` Johannes Weiner
2022-12-07 14:07 ` [PATCH v2 0/3] mm: push down lock_page_memcg() Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).