linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] mm: use irq locking suffix instead local_irq_disable()
@ 2018-06-22 15:12 Sebastian Andrzej Siewior
  2018-06-22 15:12 ` [PATCH 1/3] mm: workingset: remove local_irq_disable() from count_shadow_nodes() Sebastian Andrzej Siewior
                   ` (3 more replies)
  0 siblings, 4 replies; 43+ messages in thread
From: Sebastian Andrzej Siewior @ 2018-06-22 15:12 UTC (permalink / raw)
  To: linux-mm; +Cc: tglx, Andrew Morton

small series which avoids using local_irq_disable()/local_irq_enable()
but instead does spin_lock_irq()/spin_unlock_irq() so it is within the
context of the lock which it belongs to.
Patch #1 is a cleanup where local_irq_.*() remained after the lock was
removed.

Sebastian

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 1/3] mm: workingset: remove local_irq_disable() from count_shadow_nodes()
  2018-06-22 15:12 [PATCH 0/3] mm: use irq locking suffix instead local_irq_disable() Sebastian Andrzej Siewior
@ 2018-06-22 15:12 ` Sebastian Andrzej Siewior
  2018-06-24 19:51   ` Vladimir Davydov
  2018-06-25 10:36   ` Kirill Tkhai
  2018-06-22 15:12 ` [PATCH 2/3] mm: workingset: make shadow_lru_isolate() use locking suffix Sebastian Andrzej Siewior
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 43+ messages in thread
From: Sebastian Andrzej Siewior @ 2018-06-22 15:12 UTC (permalink / raw)
  To: linux-mm; +Cc: tglx, Andrew Morton, Sebastian Andrzej Siewior, Kirill Tkhai

In commit 0c7c1bed7e13 ("mm: make counting of list_lru_one::nr_items
lockless") the
	spin_lock(&nlru->lock);

statement was replaced with
	rcu_read_lock();

in __list_lru_count_one(). The comment in count_shadow_nodes() says that
the local_irq_disable() is required because the lock must be acquired
with disabled interrupts and (spin_lock()) does not do so.
Since the lock is replaced with rcu_read_lock() the local_irq_disable()
is no longer needed. The code path is
  list_lru_shrink_count()
    -> list_lru_count_one()
      -> __list_lru_count_one()
        -> rcu_read_lock()
        -> list_lru_from_memcg_idx()
        -> rcu_read_unlock()

Remove the local_irq_disable() statement.

Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 mm/workingset.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/mm/workingset.c b/mm/workingset.c
index 40ee02c83978..ed8151180899 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -366,10 +366,7 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
 	unsigned long nodes;
 	unsigned long cache;
 
-	/* list_lru lock nests inside the IRQ-safe i_pages lock */
-	local_irq_disable();
 	nodes = list_lru_shrink_count(&shadow_nodes, sc);
-	local_irq_enable();
 
 	/*
 	 * Approximate a reasonable limit for the radix tree nodes
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 2/3] mm: workingset: make shadow_lru_isolate() use locking suffix
  2018-06-22 15:12 [PATCH 0/3] mm: use irq locking suffix instead local_irq_disable() Sebastian Andrzej Siewior
  2018-06-22 15:12 ` [PATCH 1/3] mm: workingset: remove local_irq_disable() from count_shadow_nodes() Sebastian Andrzej Siewior
@ 2018-06-22 15:12 ` Sebastian Andrzej Siewior
  2018-06-24 19:57   ` Vladimir Davydov
  2018-06-22 15:12 ` [PATCH 3/3] mm: list_lru: Add lock_irq member to __list_lru_init() Sebastian Andrzej Siewior
  2018-06-22 21:39 ` [PATCH 0/3] mm: use irq locking suffix instead local_irq_disable() Andrew Morton
  3 siblings, 1 reply; 43+ messages in thread
From: Sebastian Andrzej Siewior @ 2018-06-22 15:12 UTC (permalink / raw)
  To: linux-mm; +Cc: tglx, Andrew Morton, Sebastian Andrzej Siewior

shadow_lru_isolate() disables interrupts and acquires a lock. It could
use spin_lock_irq() instead. It also uses local_irq_enable() while it
could use spin_unlock_irq()/xa_unlock_irq().

Use proper suffix for lock/unlock in order to enable/disable interrupts
during release/acquire of a lock.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 mm/workingset.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/mm/workingset.c b/mm/workingset.c
index ed8151180899..529480c21f93 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -431,7 +431,7 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
 
 	/* Coming from the list, invert the lock order */
 	if (!xa_trylock(&mapping->i_pages)) {
-		spin_unlock(lru_lock);
+		spin_unlock_irq(lru_lock);
 		ret = LRU_RETRY;
 		goto out;
 	}
@@ -469,13 +469,11 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
 				 workingset_lookup_update(mapping));
 
 out_invalid:
-	xa_unlock(&mapping->i_pages);
+	xa_unlock_irq(&mapping->i_pages);
 	ret = LRU_REMOVED_RETRY;
 out:
-	local_irq_enable();
 	cond_resched();
-	local_irq_disable();
-	spin_lock(lru_lock);
+	spin_lock_irq(lru_lock);
 	return ret;
 }
 
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 3/3] mm: list_lru: Add lock_irq member to __list_lru_init()
  2018-06-22 15:12 [PATCH 0/3] mm: use irq locking suffix instead local_irq_disable() Sebastian Andrzej Siewior
  2018-06-22 15:12 ` [PATCH 1/3] mm: workingset: remove local_irq_disable() from count_shadow_nodes() Sebastian Andrzej Siewior
  2018-06-22 15:12 ` [PATCH 2/3] mm: workingset: make shadow_lru_isolate() use locking suffix Sebastian Andrzej Siewior
@ 2018-06-22 15:12 ` Sebastian Andrzej Siewior
  2018-06-24 20:09   ` Vladimir Davydov
  2018-06-22 21:39 ` [PATCH 0/3] mm: use irq locking suffix instead local_irq_disable() Andrew Morton
  3 siblings, 1 reply; 43+ messages in thread
From: Sebastian Andrzej Siewior @ 2018-06-22 15:12 UTC (permalink / raw)
  To: linux-mm; +Cc: tglx, Andrew Morton, Sebastian Andrzej Siewior

scan_shadow_nodes() is the only user of __list_lru_walk_one() which
disables interrupts before invoking it. The reason is that nlru->lock is
nesting inside IRQ-safe i_pages lock. Some functions unconditionally
acquire the lock with the _irq() suffix.

__list_lru_walk_one() can't acquire the lock unconditionally with _irq()
suffix because it might invoke a callback which unlocks the nlru->lock
and invokes a sleeping function without enabling interrupts.

Add an argument to __list_lru_init() which identifies wheather the
nlru->lock needs to be acquired with disabling interrupts or without.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/list_lru.h | 12 ++++++++----
 mm/list_lru.c            | 14 ++++++++++----
 mm/workingset.c          | 12 ++++--------
 3 files changed, 22 insertions(+), 16 deletions(-)

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index 96def9d15b1b..c2161c3a1809 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -51,18 +51,22 @@ struct list_lru_node {
 
 struct list_lru {
 	struct list_lru_node	*node;
+	bool			lock_irq;
 #if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
 	struct list_head	list;
 #endif
 };
 
 void list_lru_destroy(struct list_lru *lru);
-int __list_lru_init(struct list_lru *lru, bool memcg_aware,
+int __list_lru_init(struct list_lru *lru, bool memcg_aware, bool lock_irq,
 		    struct lock_class_key *key);
 
-#define list_lru_init(lru)		__list_lru_init((lru), false, NULL)
-#define list_lru_init_key(lru, key)	__list_lru_init((lru), false, (key))
-#define list_lru_init_memcg(lru)	__list_lru_init((lru), true, NULL)
+#define list_lru_init(lru)		__list_lru_init((lru), false, false, \
+							NULL)
+#define list_lru_init_key(lru, key)	__list_lru_init((lru), false, false, \
+							(key))
+#define list_lru_init_memcg(lru)	__list_lru_init((lru), true, false, \
+							NULL)
 
 int memcg_update_all_list_lrus(int num_memcgs);
 void memcg_drain_all_list_lrus(int src_idx, int dst_idx);
diff --git a/mm/list_lru.c b/mm/list_lru.c
index fcfb6c89ed47..1c49d48078e4 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -204,7 +204,10 @@ __list_lru_walk_one(struct list_lru *lru, int nid, int memcg_idx,
 	struct list_head *item, *n;
 	unsigned long isolated = 0;
 
-	spin_lock(&nlru->lock);
+	if (lru->lock_irq)
+		spin_lock_irq(&nlru->lock);
+	else
+		spin_lock(&nlru->lock);
 	l = list_lru_from_memcg_idx(nlru, memcg_idx);
 restart:
 	list_for_each_safe(item, n, &l->list) {
@@ -251,7 +254,10 @@ __list_lru_walk_one(struct list_lru *lru, int nid, int memcg_idx,
 		}
 	}
 
-	spin_unlock(&nlru->lock);
+	if (lru->lock_irq)
+		spin_unlock_irq(&nlru->lock);
+	else
+		spin_unlock(&nlru->lock);
 	return isolated;
 }
 
@@ -553,7 +559,7 @@ static void memcg_destroy_list_lru(struct list_lru *lru)
 }
 #endif /* CONFIG_MEMCG && !CONFIG_SLOB */
 
-int __list_lru_init(struct list_lru *lru, bool memcg_aware,
+int __list_lru_init(struct list_lru *lru, bool memcg_aware, bool lock_irq,
 		    struct lock_class_key *key)
 {
 	int i;
@@ -580,7 +586,7 @@ int __list_lru_init(struct list_lru *lru, bool memcg_aware,
 		lru->node = NULL;
 		goto out;
 	}
-
+	lru->lock_irq = lock_irq;
 	list_lru_register(lru);
 out:
 	memcg_put_cache_ids();
diff --git a/mm/workingset.c b/mm/workingset.c
index 529480c21f93..23ce00f48212 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -480,13 +480,8 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
 static unsigned long scan_shadow_nodes(struct shrinker *shrinker,
 				       struct shrink_control *sc)
 {
-	unsigned long ret;
-
-	/* list_lru lock nests inside the IRQ-safe i_pages lock */
-	local_irq_disable();
-	ret = list_lru_shrink_walk(&shadow_nodes, sc, shadow_lru_isolate, NULL);
-	local_irq_enable();
-	return ret;
+	return list_lru_shrink_walk(&shadow_nodes, sc, shadow_lru_isolate,
+				    NULL);
 }
 
 static struct shrinker workingset_shadow_shrinker = {
@@ -523,7 +518,8 @@ static int __init workingset_init(void)
 	pr_info("workingset: timestamp_bits=%d max_order=%d bucket_order=%u\n",
 	       timestamp_bits, max_order, bucket_order);
 
-	ret = __list_lru_init(&shadow_nodes, true, &shadow_nodes_key);
+	/* list_lru lock nests inside the IRQ-safe i_pages lock */
+	ret = __list_lru_init(&shadow_nodes, true, true, &shadow_nodes_key);
 	if (ret)
 		goto err;
 	ret = register_shrinker(&workingset_shadow_shrinker);
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/3] mm: use irq locking suffix instead local_irq_disable()
  2018-06-22 15:12 [PATCH 0/3] mm: use irq locking suffix instead local_irq_disable() Sebastian Andrzej Siewior
                   ` (2 preceding siblings ...)
  2018-06-22 15:12 ` [PATCH 3/3] mm: list_lru: Add lock_irq member to __list_lru_init() Sebastian Andrzej Siewior
@ 2018-06-22 21:39 ` Andrew Morton
  2018-06-24 20:10   ` Vladimir Davydov
  3 siblings, 1 reply; 43+ messages in thread
From: Andrew Morton @ 2018-06-22 21:39 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior; +Cc: linux-mm, tglx, Kirill Tkhai, Vladimir Davydov

On Fri, 22 Jun 2018 17:12:18 +0200 Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:

> small series which avoids using local_irq_disable()/local_irq_enable()
> but instead does spin_lock_irq()/spin_unlock_irq() so it is within the
> context of the lock which it belongs to.
> Patch #1 is a cleanup where local_irq_.*() remained after the lock was
> removed.

Looks OK.

And we may as well do this...

From: Andrew Morton <akpm@linux-foundation.org>
Subject: mm/list_lru.c: fold __list_lru_count_one() into its caller

__list_lru_count_one() has a single callsite.

Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/list_lru.c |   12 +++---------
 1 file changed, 3 insertions(+), 9 deletions(-)

diff -puN mm/list_lru.c~mm-list_lruc-fold-__list_lru_count_one-into-its-caller mm/list_lru.c
--- a/mm/list_lru.c~mm-list_lruc-fold-__list_lru_count_one-into-its-caller
+++ a/mm/list_lru.c
@@ -162,26 +162,20 @@ void list_lru_isolate_move(struct list_l
 }
 EXPORT_SYMBOL_GPL(list_lru_isolate_move);
 
-static unsigned long __list_lru_count_one(struct list_lru *lru,
-					  int nid, int memcg_idx)
+unsigned long list_lru_count_one(struct list_lru *lru,
+				 int nid, struct mem_cgroup *memcg)
 {
 	struct list_lru_node *nlru = &lru->node[nid];
 	struct list_lru_one *l;
 	unsigned long count;
 
 	rcu_read_lock();
-	l = list_lru_from_memcg_idx(nlru, memcg_idx);
+	l = list_lru_from_memcg_idx(nlru, memcg_cache_id(memcg));
 	count = l->nr_items;
 	rcu_read_unlock();
 
 	return count;
 }
-
-unsigned long list_lru_count_one(struct list_lru *lru,
-				 int nid, struct mem_cgroup *memcg)
-{
-	return __list_lru_count_one(lru, nid, memcg_cache_id(memcg));
-}
 EXPORT_SYMBOL_GPL(list_lru_count_one);
 
 unsigned long list_lru_count_node(struct list_lru *lru, int nid)
_

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/3] mm: workingset: remove local_irq_disable() from count_shadow_nodes()
  2018-06-22 15:12 ` [PATCH 1/3] mm: workingset: remove local_irq_disable() from count_shadow_nodes() Sebastian Andrzej Siewior
@ 2018-06-24 19:51   ` Vladimir Davydov
  2018-06-25 10:36   ` Kirill Tkhai
  1 sibling, 0 replies; 43+ messages in thread
From: Vladimir Davydov @ 2018-06-24 19:51 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior; +Cc: linux-mm, tglx, Andrew Morton, Kirill Tkhai

On Fri, Jun 22, 2018 at 05:12:19PM +0200, Sebastian Andrzej Siewior wrote:
> In commit 0c7c1bed7e13 ("mm: make counting of list_lru_one::nr_items
> lockless") the
> 	spin_lock(&nlru->lock);
> 
> statement was replaced with
> 	rcu_read_lock();
> 
> in __list_lru_count_one(). The comment in count_shadow_nodes() says that
> the local_irq_disable() is required because the lock must be acquired
> with disabled interrupts and (spin_lock()) does not do so.
> Since the lock is replaced with rcu_read_lock() the local_irq_disable()
> is no longer needed. The code path is
>   list_lru_shrink_count()
>     -> list_lru_count_one()
>       -> __list_lru_count_one()
>         -> rcu_read_lock()
>         -> list_lru_from_memcg_idx()
>         -> rcu_read_unlock()
> 
> Remove the local_irq_disable() statement.
> 
> Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/3] mm: workingset: make shadow_lru_isolate() use locking suffix
  2018-06-22 15:12 ` [PATCH 2/3] mm: workingset: make shadow_lru_isolate() use locking suffix Sebastian Andrzej Siewior
@ 2018-06-24 19:57   ` Vladimir Davydov
  2018-06-26 21:25     ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 43+ messages in thread
From: Vladimir Davydov @ 2018-06-24 19:57 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior; +Cc: linux-mm, tglx, Andrew Morton

On Fri, Jun 22, 2018 at 05:12:20PM +0200, Sebastian Andrzej Siewior wrote:
> shadow_lru_isolate() disables interrupts and acquires a lock. It could
> use spin_lock_irq() instead. It also uses local_irq_enable() while it
> could use spin_unlock_irq()/xa_unlock_irq().
> 
> Use proper suffix for lock/unlock in order to enable/disable interrupts
> during release/acquire of a lock.
> 
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

I don't like when a spin lock is locked with local_irq_disabled +
spin_lock and unlocked with spin_unlock_irq - it looks asymmetric.
IMHO the code is pretty easy to follow as it is - local_irq_disable in
scan_shadow_nodes matches local_irq_enable in shadow_lru_isolate.

> ---
>  mm/workingset.c | 8 +++-----
>  1 file changed, 3 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/workingset.c b/mm/workingset.c
> index ed8151180899..529480c21f93 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -431,7 +431,7 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
>  
>  	/* Coming from the list, invert the lock order */
>  	if (!xa_trylock(&mapping->i_pages)) {
> -		spin_unlock(lru_lock);
> +		spin_unlock_irq(lru_lock);
>  		ret = LRU_RETRY;
>  		goto out;
>  	}
> @@ -469,13 +469,11 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
>  				 workingset_lookup_update(mapping));
>  
>  out_invalid:
> -	xa_unlock(&mapping->i_pages);
> +	xa_unlock_irq(&mapping->i_pages);
>  	ret = LRU_REMOVED_RETRY;
>  out:
> -	local_irq_enable();
>  	cond_resched();
> -	local_irq_disable();
> -	spin_lock(lru_lock);
> +	spin_lock_irq(lru_lock);
>  	return ret;
>  }

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] mm: list_lru: Add lock_irq member to __list_lru_init()
  2018-06-22 15:12 ` [PATCH 3/3] mm: list_lru: Add lock_irq member to __list_lru_init() Sebastian Andrzej Siewior
@ 2018-06-24 20:09   ` Vladimir Davydov
  2018-07-03 14:52     ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 43+ messages in thread
From: Vladimir Davydov @ 2018-06-24 20:09 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior; +Cc: linux-mm, tglx, Andrew Morton

On Fri, Jun 22, 2018 at 05:12:21PM +0200, Sebastian Andrzej Siewior wrote:
> scan_shadow_nodes() is the only user of __list_lru_walk_one() which
> disables interrupts before invoking it. The reason is that nlru->lock is
> nesting inside IRQ-safe i_pages lock. Some functions unconditionally
> acquire the lock with the _irq() suffix.
> 
> __list_lru_walk_one() can't acquire the lock unconditionally with _irq()
> suffix because it might invoke a callback which unlocks the nlru->lock
> and invokes a sleeping function without enabling interrupts.
> 
> Add an argument to __list_lru_init() which identifies wheather the
> nlru->lock needs to be acquired with disabling interrupts or without.
> 
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
>  include/linux/list_lru.h | 12 ++++++++----
>  mm/list_lru.c            | 14 ++++++++++----
>  mm/workingset.c          | 12 ++++--------
>  3 files changed, 22 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
> index 96def9d15b1b..c2161c3a1809 100644
> --- a/include/linux/list_lru.h
> +++ b/include/linux/list_lru.h
> @@ -51,18 +51,22 @@ struct list_lru_node {
>  
>  struct list_lru {
>  	struct list_lru_node	*node;
> +	bool			lock_irq;
>  #if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
>  	struct list_head	list;
>  #endif
>  };

TBO I don't like this patch, because the new member of struct list_lru,
lock_irq, has rather obscure meaning IMHO: it makes list_lru_walk
disable irq before taking lru_lock, but at the same time list_lru_add
and list_lru_del never do that, no matter whether lock_irq is true or
false. That is, if a user of struct list_lru sets this flag, he's
supposed to disable irq for list_lru_add/del by himself (mm/workingset
does that). IMHO the code of mm/workingset is clear as it is. Since it
is the only place where this flag is used, I'd rather leave it as is.

>  
>  void list_lru_destroy(struct list_lru *lru);
> -int __list_lru_init(struct list_lru *lru, bool memcg_aware,
> +int __list_lru_init(struct list_lru *lru, bool memcg_aware, bool lock_irq,
>  		    struct lock_class_key *key);
>  
> -#define list_lru_init(lru)		__list_lru_init((lru), false, NULL)
> -#define list_lru_init_key(lru, key)	__list_lru_init((lru), false, (key))
> -#define list_lru_init_memcg(lru)	__list_lru_init((lru), true, NULL)
> +#define list_lru_init(lru)		__list_lru_init((lru), false, false, \
> +							NULL)
> +#define list_lru_init_key(lru, key)	__list_lru_init((lru), false, false, \
> +							(key))
> +#define list_lru_init_memcg(lru)	__list_lru_init((lru), true, false, \
> +							NULL)
>  
>  int memcg_update_all_list_lrus(int num_memcgs);
>  void memcg_drain_all_list_lrus(int src_idx, int dst_idx);
> diff --git a/mm/list_lru.c b/mm/list_lru.c
> index fcfb6c89ed47..1c49d48078e4 100644
> --- a/mm/list_lru.c
> +++ b/mm/list_lru.c
> @@ -204,7 +204,10 @@ __list_lru_walk_one(struct list_lru *lru, int nid, int memcg_idx,
>  	struct list_head *item, *n;
>  	unsigned long isolated = 0;
>  
> -	spin_lock(&nlru->lock);
> +	if (lru->lock_irq)
> +		spin_lock_irq(&nlru->lock);
> +	else
> +		spin_lock(&nlru->lock);
>  	l = list_lru_from_memcg_idx(nlru, memcg_idx);
>  restart:
>  	list_for_each_safe(item, n, &l->list) {
> @@ -251,7 +254,10 @@ __list_lru_walk_one(struct list_lru *lru, int nid, int memcg_idx,
>  		}
>  	}
>  
> -	spin_unlock(&nlru->lock);
> +	if (lru->lock_irq)
> +		spin_unlock_irq(&nlru->lock);
> +	else
> +		spin_unlock(&nlru->lock);
>  	return isolated;
>  }
>  
> @@ -553,7 +559,7 @@ static void memcg_destroy_list_lru(struct list_lru *lru)
>  }
>  #endif /* CONFIG_MEMCG && !CONFIG_SLOB */
>  
> -int __list_lru_init(struct list_lru *lru, bool memcg_aware,
> +int __list_lru_init(struct list_lru *lru, bool memcg_aware, bool lock_irq,
>  		    struct lock_class_key *key)
>  {
>  	int i;
> @@ -580,7 +586,7 @@ int __list_lru_init(struct list_lru *lru, bool memcg_aware,
>  		lru->node = NULL;
>  		goto out;
>  	}
> -
> +	lru->lock_irq = lock_irq;
>  	list_lru_register(lru);
>  out:
>  	memcg_put_cache_ids();
> diff --git a/mm/workingset.c b/mm/workingset.c
> index 529480c21f93..23ce00f48212 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -480,13 +480,8 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
>  static unsigned long scan_shadow_nodes(struct shrinker *shrinker,
>  				       struct shrink_control *sc)
>  {
> -	unsigned long ret;
> -
> -	/* list_lru lock nests inside the IRQ-safe i_pages lock */
> -	local_irq_disable();
> -	ret = list_lru_shrink_walk(&shadow_nodes, sc, shadow_lru_isolate, NULL);
> -	local_irq_enable();
> -	return ret;
> +	return list_lru_shrink_walk(&shadow_nodes, sc, shadow_lru_isolate,
> +				    NULL);
>  }
>  
>  static struct shrinker workingset_shadow_shrinker = {
> @@ -523,7 +518,8 @@ static int __init workingset_init(void)
>  	pr_info("workingset: timestamp_bits=%d max_order=%d bucket_order=%u\n",
>  	       timestamp_bits, max_order, bucket_order);
>  
> -	ret = __list_lru_init(&shadow_nodes, true, &shadow_nodes_key);
> +	/* list_lru lock nests inside the IRQ-safe i_pages lock */
> +	ret = __list_lru_init(&shadow_nodes, true, true, &shadow_nodes_key);
>  	if (ret)
>  		goto err;
>  	ret = register_shrinker(&workingset_shadow_shrinker);

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/3] mm: use irq locking suffix instead local_irq_disable()
  2018-06-22 21:39 ` [PATCH 0/3] mm: use irq locking suffix instead local_irq_disable() Andrew Morton
@ 2018-06-24 20:10   ` Vladimir Davydov
  0 siblings, 0 replies; 43+ messages in thread
From: Vladimir Davydov @ 2018-06-24 20:10 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Sebastian Andrzej Siewior, linux-mm, tglx, Kirill Tkhai

On Fri, Jun 22, 2018 at 02:39:00PM -0700, Andrew Morton wrote:
> From: Andrew Morton <akpm@linux-foundation.org>
> Subject: mm/list_lru.c: fold __list_lru_count_one() into its caller
> 
> __list_lru_count_one() has a single callsite.
> 
> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/3] mm: workingset: remove local_irq_disable() from count_shadow_nodes()
  2018-06-22 15:12 ` [PATCH 1/3] mm: workingset: remove local_irq_disable() from count_shadow_nodes() Sebastian Andrzej Siewior
  2018-06-24 19:51   ` Vladimir Davydov
@ 2018-06-25 10:36   ` Kirill Tkhai
  1 sibling, 0 replies; 43+ messages in thread
From: Kirill Tkhai @ 2018-06-25 10:36 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, linux-mm; +Cc: tglx, Andrew Morton

On 22.06.2018 18:12, Sebastian Andrzej Siewior wrote:
> In commit 0c7c1bed7e13 ("mm: make counting of list_lru_one::nr_items
> lockless") the
> 	spin_lock(&nlru->lock);
> 
> statement was replaced with
> 	rcu_read_lock();
> 
> in __list_lru_count_one(). The comment in count_shadow_nodes() says that
> the local_irq_disable() is required because the lock must be acquired
> with disabled interrupts and (spin_lock()) does not do so.
> Since the lock is replaced with rcu_read_lock() the local_irq_disable()
> is no longer needed. The code path is
>   list_lru_shrink_count()
>     -> list_lru_count_one()
>       -> __list_lru_count_one()
>         -> rcu_read_lock()
>         -> list_lru_from_memcg_idx()
>         -> rcu_read_unlock()
> 
> Remove the local_irq_disable() statement.
> 
> Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

Looks good for me.

Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>

> ---
>  mm/workingset.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/mm/workingset.c b/mm/workingset.c
> index 40ee02c83978..ed8151180899 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -366,10 +366,7 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
>  	unsigned long nodes;
>  	unsigned long cache;
>  
> -	/* list_lru lock nests inside the IRQ-safe i_pages lock */
> -	local_irq_disable();
>  	nodes = list_lru_shrink_count(&shadow_nodes, sc);
> -	local_irq_enable();
>  
>  	/*
>  	 * Approximate a reasonable limit for the radix tree nodes
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/3] mm: workingset: make shadow_lru_isolate() use locking suffix
  2018-06-24 19:57   ` Vladimir Davydov
@ 2018-06-26 21:25     ` Sebastian Andrzej Siewior
  2018-06-27  8:50       ` Vladimir Davydov
  0 siblings, 1 reply; 43+ messages in thread
From: Sebastian Andrzej Siewior @ 2018-06-26 21:25 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: linux-mm, tglx, Andrew Morton

On 2018-06-24 22:57:53 [+0300], Vladimir Davydov wrote:
> On Fri, Jun 22, 2018 at 05:12:20PM +0200, Sebastian Andrzej Siewior wrote:
> > shadow_lru_isolate() disables interrupts and acquires a lock. It could
> > use spin_lock_irq() instead. It also uses local_irq_enable() while it
> > could use spin_unlock_irq()/xa_unlock_irq().
> > 
> > Use proper suffix for lock/unlock in order to enable/disable interrupts
> > during release/acquire of a lock.
> > 
> > Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> 
> I don't like when a spin lock is locked with local_irq_disabled +
> spin_lock and unlocked with spin_unlock_irq - it looks asymmetric.
> IMHO the code is pretty easy to follow as it is - local_irq_disable in
> scan_shadow_nodes matches local_irq_enable in shadow_lru_isolate.

it is not asymmetric because a later patch makes it use spin_lock_irq(),
too. If you use local_irq_disable() and a spin_lock() (like you suggest
in 3/3 as well) then you separate the locking instruction. It works as
expected on vanilla but break other locking implementations like those
on RT. Also if the locking changes then the local_irq_disable() part
will be forgotten like you saw in 1/3 of this series.

Sebastian

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/3] mm: workingset: make shadow_lru_isolate() use locking suffix
  2018-06-26 21:25     ` Sebastian Andrzej Siewior
@ 2018-06-27  8:50       ` Vladimir Davydov
  2018-06-27  9:20         ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 43+ messages in thread
From: Vladimir Davydov @ 2018-06-27  8:50 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior; +Cc: linux-mm, tglx, Andrew Morton

On Tue, Jun 26, 2018 at 11:25:34PM +0200, Sebastian Andrzej Siewior wrote:
> On 2018-06-24 22:57:53 [+0300], Vladimir Davydov wrote:
> > On Fri, Jun 22, 2018 at 05:12:20PM +0200, Sebastian Andrzej Siewior wrote:
> > > shadow_lru_isolate() disables interrupts and acquires a lock. It could
> > > use spin_lock_irq() instead. It also uses local_irq_enable() while it
> > > could use spin_unlock_irq()/xa_unlock_irq().
> > > 
> > > Use proper suffix for lock/unlock in order to enable/disable interrupts
> > > during release/acquire of a lock.
> > > 
> > > Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> > 
> > I don't like when a spin lock is locked with local_irq_disabled +
> > spin_lock and unlocked with spin_unlock_irq - it looks asymmetric.
> > IMHO the code is pretty easy to follow as it is - local_irq_disable in
> > scan_shadow_nodes matches local_irq_enable in shadow_lru_isolate.
> 
> it is not asymmetric because a later patch makes it use
> spin_lock_irq(), too. If you use local_irq_disable() and a spin_lock()
> (like you suggest in 3/3 as well) then you separate the locking
> instruction. It works as expected on vanilla but break other locking
> implementations like those on RT.

As I said earlier, I don't like patch 3 either, because I find the
notion of list_lru::lock_irq flag abstruse since it doesn't make all
code paths taking the lock disable irq: list_lru_add/del use spin_lock
no matter whether the flag is set or not. That is, when you initialize a
list_lru and pass lock_irq=true, you'll have to keep in mind that it
only protects list_lru_walk, while list_lru_add/del must be called with
irq disabled by the caller. Disabling irq before list_lru_walk
explicitly looks much more straightforward IMO.

As for RT, it wouldn't need mm/workingset altogether AFAIU. Anyway, it's
rather unusual to care about out-of-the-tree patches when changing the
vanilla kernel code IMO. Using local_irq_disable + spin_lock instead of
spin_lock_irq is a typical pattern, and I don't see how changing this
particular place would help RT.

> Also if the locking changes then the local_irq_disable() part will be
> forgotten like you saw in 1/3 of this series.

If the locking changes, we'll have to revise all list_lru users anyway.
Yeah, we missed it last time, but it didn't break anything, and it was
finally found and fixed (by you, thanks BTW).

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/3] mm: workingset: make shadow_lru_isolate() use locking suffix
  2018-06-27  8:50       ` Vladimir Davydov
@ 2018-06-27  9:20         ` Sebastian Andrzej Siewior
  2018-06-28  9:30           ` Vladimir Davydov
  0 siblings, 1 reply; 43+ messages in thread
From: Sebastian Andrzej Siewior @ 2018-06-27  9:20 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: linux-mm, tglx, Andrew Morton

On 2018-06-27 11:50:03 [+0300], Vladimir Davydov wrote:
> > it is not asymmetric because a later patch makes it use
> > spin_lock_irq(), too. If you use local_irq_disable() and a spin_lock()
> > (like you suggest in 3/3 as well) then you separate the locking
> > instruction. It works as expected on vanilla but break other locking
> > implementations like those on RT.
> 
> As I said earlier, I don't like patch 3 either, because I find the
> notion of list_lru::lock_irq flag abstruse since it doesn't make all
> code paths taking the lock disable irq: list_lru_add/del use spin_lock
> no matter whether the flag is set or not. That is, when you initialize a
> list_lru and pass lock_irq=true, you'll have to keep in mind that it
> only protects list_lru_walk, while list_lru_add/del must be called with
> irq disabled by the caller. Disabling irq before list_lru_walk
> explicitly looks much more straightforward IMO.

It helps to keep the locking annotation in one place. If it helps I
could add the _irqsave() suffix to list_lru_add/del like it is already
done in other places (in this file).

> As for RT, it wouldn't need mm/workingset altogether AFAIU. 
Why wouldn't it need it?

> Anyway, it's
> rather unusual to care about out-of-the-tree patches when changing the
> vanilla kernel code IMO. 
The plan is not stay out-of-tree forever. And I don't intend to make
impossible or hard to argue changes just for RT's sake. This is only to
keep the correct locking context/primitives in one place and not
scattered around.
The only reason for the separation is that most users don't disable
interrupts (one user does) and there a few places which already use
irqsave() because they can be called from both places. This
list_lru_walk() is the only which can't do so due to the callback it
invokes. I could also add a different function (say
list_lru_walk_one_irq()) which behaves like list_lru_walk_one() but does
spin_lock_irq() instead.

> Using local_irq_disable + spin_lock instead of
> spin_lock_irq is a typical pattern, and I don't see how changing this
> particular place would help RT.
It is not that typical. It is how the locking primitives work, yes, but
they are not so many places that do so and those that did got cleaned
up.

> > Also if the locking changes then the local_irq_disable() part will be
> > forgotten like you saw in 1/3 of this series.
> 
> If the locking changes, we'll have to revise all list_lru users anyway.
> Yeah, we missed it last time, but it didn't break anything, and it was
> finally found and fixed (by you, thanks BTW).
You are very welcome. But having the locking primitives in one place you
would have less things to worry about.

Sebastian

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/3] mm: workingset: make shadow_lru_isolate() use locking suffix
  2018-06-27  9:20         ` Sebastian Andrzej Siewior
@ 2018-06-28  9:30           ` Vladimir Davydov
  2018-07-02 22:38             ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 43+ messages in thread
From: Vladimir Davydov @ 2018-06-28  9:30 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior; +Cc: linux-mm, tglx, Andrew Morton

On Wed, Jun 27, 2018 at 11:20:59AM +0200, Sebastian Andrzej Siewior wrote:
> On 2018-06-27 11:50:03 [+0300], Vladimir Davydov wrote:
> > > it is not asymmetric because a later patch makes it use
> > > spin_lock_irq(), too. If you use local_irq_disable() and a spin_lock()
> > > (like you suggest in 3/3 as well) then you separate the locking
> > > instruction. It works as expected on vanilla but break other locking
> > > implementations like those on RT.
> > 
> > As I said earlier, I don't like patch 3 either, because I find the
> > notion of list_lru::lock_irq flag abstruse since it doesn't make all
> > code paths taking the lock disable irq: list_lru_add/del use spin_lock
> > no matter whether the flag is set or not. That is, when you initialize a
> > list_lru and pass lock_irq=true, you'll have to keep in mind that it
> > only protects list_lru_walk, while list_lru_add/del must be called with
> > irq disabled by the caller. Disabling irq before list_lru_walk
> > explicitly looks much more straightforward IMO.
> 
> It helps to keep the locking annotation in one place. If it helps I
> could add the _irqsave() suffix to list_lru_add/del like it is already
> done in other places (in this file).

AFAIK local_irqsave/restore don't come for free so using them just to
keep the code clean doesn't seem to be reasonable.

> 
> > As for RT, it wouldn't need mm/workingset altogether AFAIU. 
> Why wouldn't it need it?

I may be wrong, but AFAIU RT kernel doesn't do swapping.

> 
> > Anyway, it's
> > rather unusual to care about out-of-the-tree patches when changing the
> > vanilla kernel code IMO. 
> The plan is not stay out-of-tree forever. And I don't intend to make
> impossible or hard to argue changes just for RT's sake. This is only to
> keep the correct locking context/primitives in one place and not
> scattered around.
> The only reason for the separation is that most users don't disable
> interrupts (one user does) and there a few places which already use
> irqsave() because they can be called from both places. This
> list_lru_walk() is the only which can't do so due to the callback it
> invokes. I could also add a different function (say
> list_lru_walk_one_irq()) which behaves like list_lru_walk_one() but does
> spin_lock_irq() instead.

That would look better IMHO. I mean, passing the flag as an argument to
__list_lru_walk_one and introducing list_lru_shrink_walk_irq.

> 
> > Using local_irq_disable + spin_lock instead of
> > spin_lock_irq is a typical pattern, and I don't see how changing this
> > particular place would help RT.
> It is not that typical. It is how the locking primitives work, yes, but
> they are not so many places that do so and those that did got cleaned
> up.

Missed that. There used to be a lot of places like that in the past.
I guess things have changed.

> 
> > > Also if the locking changes then the local_irq_disable() part will be
> > > forgotten like you saw in 1/3 of this series.
> > 
> > If the locking changes, we'll have to revise all list_lru users anyway.
> > Yeah, we missed it last time, but it didn't break anything, and it was
> > finally found and fixed (by you, thanks BTW).
> You are very welcome. But having the locking primitives in one place you
> would have less things to worry about.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/3] mm: workingset: make shadow_lru_isolate() use locking suffix
  2018-06-28  9:30           ` Vladimir Davydov
@ 2018-07-02 22:38             ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 43+ messages in thread
From: Sebastian Andrzej Siewior @ 2018-07-02 22:38 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: linux-mm, tglx, Andrew Morton

On 2018-06-28 12:30:57 [+0300], Vladimir Davydov wrote:
> > It helps to keep the locking annotation in one place. If it helps I
> > could add the _irqsave() suffix to list_lru_add/del like it is already
> > done in other places (in this file).
> 
> AFAIK local_irqsave/restore don't come for free so using them just to
> keep the code clean doesn't seem to be reasonable.

exactly. So I kept those two as is since there is no need for it.

> > > As for RT, it wouldn't need mm/workingset altogether AFAIU. 
> > Why wouldn't it need it?
> 
> I may be wrong, but AFAIU RT kernel doesn't do swapping.

swapping the RT task out would be bad indeed. This does not stop you
from using it. You can mlock() your RT application (well should because
you don't want do remove RO-data or code from memory because it is
unchanged on disk) and everything else that is not essential (say
SCHED_OTHER) could be swapped out then if memory goes low.

> > invokes. I could also add a different function (say
> > list_lru_walk_one_irq()) which behaves like list_lru_walk_one() but does
> > spin_lock_irq() instead.
> 
> That would look better IMHO. I mean, passing the flag as an argument to
> __list_lru_walk_one and introducing list_lru_shrink_walk_irq.

You think so? So I had this earlier and decided to go with what I
posted. But hey. I will post it later as suggested here and we will see
how it goes.
I just wrote this here to let akpm know that I will do as asked here
(since he Cc: me in other thread on this topic, thank you will act).

Sebastian

^ permalink raw reply	[flat|nested] 43+ messages in thread

* (no subject)
  2018-06-24 20:09   ` Vladimir Davydov
@ 2018-07-03 14:52     ` Sebastian Andrzej Siewior
  2018-07-03 14:52       ` [PATCH 1/4] mm/list_lru: use list_lru_walk_one() in list_lru_walk_node() Sebastian Andrzej Siewior
                         ` (4 more replies)
  0 siblings, 5 replies; 43+ messages in thread
From: Sebastian Andrzej Siewior @ 2018-07-03 14:52 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: linux-mm, tglx, Andrew Morton

My intepretation of situtation is that Vladimir Davydon is fine patch #1
and #2 of the series [0] but dislikes the irq argument and struct
member. It has been suggested to use list_lru_shrink_walk_irq() instead
the approach I went on in "mm: list_lru: Add lock_irq member to
__list_lru_init()".

This series is based on the former two patches and introduces
list_lru_shrink_walk_irq() (and makes the third patch of series
obsolete).
In patch 1-3 I tried a tiny cleanup so the different locking
(spin_lock() vs spin_lock_irq()) is simply lifted to the caller of the
function.

[0] The patch
      mm: workingset: remove local_irq_disable() from count_shadow_nodes() 
   and
      mm: workingset: make shadow_lru_isolate() use locking suffix

Sebastian

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 1/4] mm/list_lru: use list_lru_walk_one() in list_lru_walk_node()
  2018-07-03 14:52     ` Sebastian Andrzej Siewior
@ 2018-07-03 14:52       ` Sebastian Andrzej Siewior
  2018-07-03 14:52       ` [PATCH 2/4] mm/list_lru: Move locking from __list_lru_walk_one() to its caller Sebastian Andrzej Siewior
                         ` (3 subsequent siblings)
  4 siblings, 0 replies; 43+ messages in thread
From: Sebastian Andrzej Siewior @ 2018-07-03 14:52 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: linux-mm, tglx, Andrew Morton, Sebastian Andrzej Siewior

list_lru_walk_node() invokes __list_lru_walk_one() with -1 as the
memcg_idx parameter. The same can be achieved by list_lru_walk_one() and
passing NULL as memcg argument which then gets converted into -1. This
is a preparation step when the spin_lock() function is lifted to the
caller of __list_lru_walk_one().
Invoke list_lru_walk_one() instead __list_lru_walk_one() when possible.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 mm/list_lru.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/list_lru.c b/mm/list_lru.c
index fcfb6c89ed47..ddbffbdd3d72 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -272,8 +272,8 @@ unsigned long list_lru_walk_node(struct list_lru *lru, int nid,
 	long isolated = 0;
 	int memcg_idx;
 
-	isolated += __list_lru_walk_one(lru, nid, -1, isolate, cb_arg,
-					nr_to_walk);
+	isolated += list_lru_walk_one(lru, nid, NULL, isolate, cb_arg,
+				      nr_to_walk);
 	if (*nr_to_walk > 0 && list_lru_memcg_aware(lru)) {
 		for_each_memcg_cache_index(memcg_idx) {
 			isolated += __list_lru_walk_one(lru, nid, memcg_idx,
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 2/4] mm/list_lru: Move locking from __list_lru_walk_one() to its caller
  2018-07-03 14:52     ` Sebastian Andrzej Siewior
  2018-07-03 14:52       ` [PATCH 1/4] mm/list_lru: use list_lru_walk_one() in list_lru_walk_node() Sebastian Andrzej Siewior
@ 2018-07-03 14:52       ` Sebastian Andrzej Siewior
  2018-07-03 14:52       ` [PATCH 3/4] mm/list_lru: Pass struct list_lru_node as an argument __list_lru_walk_one() Sebastian Andrzej Siewior
                         ` (2 subsequent siblings)
  4 siblings, 0 replies; 43+ messages in thread
From: Sebastian Andrzej Siewior @ 2018-07-03 14:52 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: linux-mm, tglx, Andrew Morton, Sebastian Andrzej Siewior

Move the locking inside __list_lru_walk_one() to its caller. This is a
preparation step in order to introduce list_lru_walk_one_irq() which
does spin_lock_irq() instead of spin_lock() for the locking.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 mm/list_lru.c | 18 +++++++++++++-----
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/mm/list_lru.c b/mm/list_lru.c
index ddbffbdd3d72..819e0595303e 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -204,7 +204,6 @@ __list_lru_walk_one(struct list_lru *lru, int nid, int memcg_idx,
 	struct list_head *item, *n;
 	unsigned long isolated = 0;
 
-	spin_lock(&nlru->lock);
 	l = list_lru_from_memcg_idx(nlru, memcg_idx);
 restart:
 	list_for_each_safe(item, n, &l->list) {
@@ -250,8 +249,6 @@ __list_lru_walk_one(struct list_lru *lru, int nid, int memcg_idx,
 			BUG();
 		}
 	}
-
-	spin_unlock(&nlru->lock);
 	return isolated;
 }
 
@@ -260,8 +257,14 @@ list_lru_walk_one(struct list_lru *lru, int nid, struct mem_cgroup *memcg,
 		  list_lru_walk_cb isolate, void *cb_arg,
 		  unsigned long *nr_to_walk)
 {
-	return __list_lru_walk_one(lru, nid, memcg_cache_id(memcg),
-				   isolate, cb_arg, nr_to_walk);
+	struct list_lru_node *nlru = &lru->node[nid];
+	unsigned long ret;
+
+	spin_lock(&nlru->lock);
+	ret = __list_lru_walk_one(lru, nid, memcg_cache_id(memcg),
+				  isolate, cb_arg, nr_to_walk);
+	spin_unlock(&nlru->lock);
+	return ret;
 }
 EXPORT_SYMBOL_GPL(list_lru_walk_one);
 
@@ -276,8 +279,13 @@ unsigned long list_lru_walk_node(struct list_lru *lru, int nid,
 				      nr_to_walk);
 	if (*nr_to_walk > 0 && list_lru_memcg_aware(lru)) {
 		for_each_memcg_cache_index(memcg_idx) {
+			struct list_lru_node *nlru = &lru->node[nid];
+
+			spin_lock(&nlru->lock);
 			isolated += __list_lru_walk_one(lru, nid, memcg_idx,
 						isolate, cb_arg, nr_to_walk);
+			spin_unlock(&nlru->lock);
+
 			if (*nr_to_walk <= 0)
 				break;
 		}
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 3/4] mm/list_lru: Pass struct list_lru_node as an argument __list_lru_walk_one()
  2018-07-03 14:52     ` Sebastian Andrzej Siewior
  2018-07-03 14:52       ` [PATCH 1/4] mm/list_lru: use list_lru_walk_one() in list_lru_walk_node() Sebastian Andrzej Siewior
  2018-07-03 14:52       ` [PATCH 2/4] mm/list_lru: Move locking from __list_lru_walk_one() to its caller Sebastian Andrzej Siewior
@ 2018-07-03 14:52       ` Sebastian Andrzej Siewior
  2018-07-03 14:52       ` [PATCH 4/4] mm/list_lru: Introduce list_lru_shrink_walk_irq() Sebastian Andrzej Siewior
  2018-07-03 21:14       ` Andrew Morton
  4 siblings, 0 replies; 43+ messages in thread
From: Sebastian Andrzej Siewior @ 2018-07-03 14:52 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: linux-mm, tglx, Andrew Morton, Sebastian Andrzej Siewior

__list_lru_walk_one() is invoked with struct list_lru *lru, int nid as
the first two argument. Those two are only used to retrieve struct
list_lru_node. Since this is already done by the caller of the function
for the locking, we can pass struct list_lru_node directly and avoid the
dance around it.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 mm/list_lru.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/list_lru.c b/mm/list_lru.c
index 819e0595303e..4d7f981e6144 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -194,12 +194,11 @@ unsigned long list_lru_count_node(struct list_lru *lru, int nid)
 EXPORT_SYMBOL_GPL(list_lru_count_node);
 
 static unsigned long
-__list_lru_walk_one(struct list_lru *lru, int nid, int memcg_idx,
+__list_lru_walk_one(struct list_lru_node *nlru, int memcg_idx,
 		    list_lru_walk_cb isolate, void *cb_arg,
 		    unsigned long *nr_to_walk)
 {
 
-	struct list_lru_node *nlru = &lru->node[nid];
 	struct list_lru_one *l;
 	struct list_head *item, *n;
 	unsigned long isolated = 0;
@@ -261,8 +260,8 @@ list_lru_walk_one(struct list_lru *lru, int nid, struct mem_cgroup *memcg,
 	unsigned long ret;
 
 	spin_lock(&nlru->lock);
-	ret = __list_lru_walk_one(lru, nid, memcg_cache_id(memcg),
-				  isolate, cb_arg, nr_to_walk);
+	ret = __list_lru_walk_one(nlru, memcg_cache_id(memcg), isolate, cb_arg,
+				  nr_to_walk);
 	spin_unlock(&nlru->lock);
 	return ret;
 }
@@ -282,8 +281,9 @@ unsigned long list_lru_walk_node(struct list_lru *lru, int nid,
 			struct list_lru_node *nlru = &lru->node[nid];
 
 			spin_lock(&nlru->lock);
-			isolated += __list_lru_walk_one(lru, nid, memcg_idx,
-						isolate, cb_arg, nr_to_walk);
+			isolated += __list_lru_walk_one(nlru, memcg_idx,
+							isolate, cb_arg,
+							nr_to_walk);
 			spin_unlock(&nlru->lock);
 
 			if (*nr_to_walk <= 0)
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 4/4] mm/list_lru: Introduce list_lru_shrink_walk_irq()
  2018-07-03 14:52     ` Sebastian Andrzej Siewior
                         ` (2 preceding siblings ...)
  2018-07-03 14:52       ` [PATCH 3/4] mm/list_lru: Pass struct list_lru_node as an argument __list_lru_walk_one() Sebastian Andrzej Siewior
@ 2018-07-03 14:52       ` Sebastian Andrzej Siewior
  2018-07-03 21:14       ` Andrew Morton
  4 siblings, 0 replies; 43+ messages in thread
From: Sebastian Andrzej Siewior @ 2018-07-03 14:52 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: linux-mm, tglx, Andrew Morton, Sebastian Andrzej Siewior

Provide list_lru_shrink_walk_irq() and let it behave like
list_lru_walk_one() except that it locks the spinlock with
spin_lock_irq(). This is used by scan_shadow_nodes() because its lock
nests within the i_pages lock which is acquired with IRQ.
This change allows to use proper locking promitives instead hand crafted
lock_irq_disable() plus spin_lock().
There is no EXPORT_SYMBOL provided because the current user is in-KERNEL
only.

Add list_lru_shrink_walk_irq() which acquires the spinlock with the
proper locking primitives.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/list_lru.h | 25 +++++++++++++++++++++++++
 mm/list_lru.c            | 15 +++++++++++++++
 mm/workingset.c          |  8 ++------
 3 files changed, 42 insertions(+), 6 deletions(-)

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index 96def9d15b1b..798c41743657 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -162,6 +162,23 @@ unsigned long list_lru_walk_one(struct list_lru *lru,
 				int nid, struct mem_cgroup *memcg,
 				list_lru_walk_cb isolate, void *cb_arg,
 				unsigned long *nr_to_walk);
+/**
+ * list_lru_walk_one_irq: walk a list_lru, isolating and disposing freeable items.
+ * @lru: the lru pointer.
+ * @nid: the node id to scan from.
+ * @memcg: the cgroup to scan from.
+ * @isolate: callback function that is resposible for deciding what to do with
+ *  the item currently being scanned
+ * @cb_arg: opaque type that will be passed to @isolate
+ * @nr_to_walk: how many items to scan.
+ *
+ * Same as @list_lru_walk_one except that the spinlock is acquired with
+ * spin_lock_irq().
+ */
+unsigned long list_lru_walk_one_irq(struct list_lru *lru,
+				    int nid, struct mem_cgroup *memcg,
+				    list_lru_walk_cb isolate, void *cb_arg,
+				    unsigned long *nr_to_walk);
 unsigned long list_lru_walk_node(struct list_lru *lru, int nid,
 				 list_lru_walk_cb isolate, void *cb_arg,
 				 unsigned long *nr_to_walk);
@@ -174,6 +191,14 @@ list_lru_shrink_walk(struct list_lru *lru, struct shrink_control *sc,
 				 &sc->nr_to_scan);
 }
 
+static inline unsigned long
+list_lru_shrink_walk_irq(struct list_lru *lru, struct shrink_control *sc,
+			 list_lru_walk_cb isolate, void *cb_arg)
+{
+	return list_lru_walk_one_irq(lru, sc->nid, sc->memcg, isolate, cb_arg,
+				     &sc->nr_to_scan);
+}
+
 static inline unsigned long
 list_lru_walk(struct list_lru *lru, list_lru_walk_cb isolate,
 	      void *cb_arg, unsigned long nr_to_walk)
diff --git a/mm/list_lru.c b/mm/list_lru.c
index 4d7f981e6144..1bf53a08cda8 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -267,6 +267,21 @@ list_lru_walk_one(struct list_lru *lru, int nid, struct mem_cgroup *memcg,
 }
 EXPORT_SYMBOL_GPL(list_lru_walk_one);
 
+unsigned long
+list_lru_walk_one_irq(struct list_lru *lru, int nid, struct mem_cgroup *memcg,
+		      list_lru_walk_cb isolate, void *cb_arg,
+		      unsigned long *nr_to_walk)
+{
+	struct list_lru_node *nlru = &lru->node[nid];
+	unsigned long ret;
+
+	spin_lock_irq(&nlru->lock);
+	ret = __list_lru_walk_one(nlru, memcg_cache_id(memcg), isolate, cb_arg,
+				  nr_to_walk);
+	spin_unlock_irq(&nlru->lock);
+	return ret;
+}
+
 unsigned long list_lru_walk_node(struct list_lru *lru, int nid,
 				 list_lru_walk_cb isolate, void *cb_arg,
 				 unsigned long *nr_to_walk)
diff --git a/mm/workingset.c b/mm/workingset.c
index 529480c21f93..aa75c0027079 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -480,13 +480,9 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
 static unsigned long scan_shadow_nodes(struct shrinker *shrinker,
 				       struct shrink_control *sc)
 {
-	unsigned long ret;
-
 	/* list_lru lock nests inside the IRQ-safe i_pages lock */
-	local_irq_disable();
-	ret = list_lru_shrink_walk(&shadow_nodes, sc, shadow_lru_isolate, NULL);
-	local_irq_enable();
-	return ret;
+	return list_lru_shrink_walk_irq(&shadow_nodes, sc, shadow_lru_isolate,
+					NULL);
 }
 
 static struct shrinker workingset_shadow_shrinker = {
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re:
  2018-07-03 14:52     ` Sebastian Andrzej Siewior
                         ` (3 preceding siblings ...)
  2018-07-03 14:52       ` [PATCH 4/4] mm/list_lru: Introduce list_lru_shrink_walk_irq() Sebastian Andrzej Siewior
@ 2018-07-03 21:14       ` Andrew Morton
  2018-07-03 21:44         ` Re: Sebastian Andrzej Siewior
  4 siblings, 1 reply; 43+ messages in thread
From: Andrew Morton @ 2018-07-03 21:14 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, Vladimir Davydov, linux-mm, tglx,
	Kirill Tkhai


> Reply-To: "[PATCH 0/4] mm/list_lru": add.list_lru_shrink_walk_irq@mail.linuxfoundation.org.and.use.it ()

Well that's messed up.

On Tue,  3 Jul 2018 16:52:31 +0200 Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:

> My intepretation of situtation is that Vladimir Davydon is fine patch #1
> and #2 of the series [0] but dislikes the irq argument and struct
> member. It has been suggested to use list_lru_shrink_walk_irq() instead
> the approach I went on in "mm: list_lru: Add lock_irq member to
> __list_lru_init()".
> 
> This series is based on the former two patches and introduces
> list_lru_shrink_walk_irq() (and makes the third patch of series
> obsolete).
> In patch 1-3 I tried a tiny cleanup so the different locking
> (spin_lock() vs spin_lock_irq()) is simply lifted to the caller of the
> function.
> 
> [0] The patch
>       mm: workingset: remove local_irq_disable() from count_shadow_nodes() 
>    and
>       mm: workingset: make shadow_lru_isolate() use locking suffix
> 

This isn't a very informative [0/n] changelog.  Some overall summary of
the patchset's objective, behaviour, use cases, testing results, etc.

I'm seeing significant conflicts with Kirill's "Improve shrink_slab()
scalability (old complexity was O(n^2), new is O(n))" series, which I
merged eight milliseconds ago.  Kirill's patchset is large but fairly
straightforward so I expect it's good for 4.18.  So I suggest we leave
things a week or more then please take a look at redoing this patchset
on top of that work?  

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re:
  2018-07-03 21:14       ` Andrew Morton
@ 2018-07-03 21:44         ` Sebastian Andrzej Siewior
  2018-07-04 14:44           ` Re: Vladimir Davydov
  0 siblings, 1 reply; 43+ messages in thread
From: Sebastian Andrzej Siewior @ 2018-07-03 21:44 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Vladimir Davydov, linux-mm, tglx, Kirill Tkhai

On 2018-07-03 14:14:29 [-0700], Andrew Morton wrote:
> 
> > Reply-To: "[PATCH 0/4] mm/list_lru": add.list_lru_shrink_walk_irq@mail.linuxfoundation.org.and.use.it ()
> 
> Well that's messed up.

indeed it is. This should get into Subject:

> On Tue,  3 Jul 2018 16:52:31 +0200 Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
> 
> > My intepretation of situtation is that Vladimir Davydon is fine patch #1
> > and #2 of the series [0] but dislikes the irq argument and struct
> > member. It has been suggested to use list_lru_shrink_walk_irq() instead
> > the approach I went on in "mm: list_lru: Add lock_irq member to
> > __list_lru_init()".
> > 
> > This series is based on the former two patches and introduces
> > list_lru_shrink_walk_irq() (and makes the third patch of series
> > obsolete).
> > In patch 1-3 I tried a tiny cleanup so the different locking
> > (spin_lock() vs spin_lock_irq()) is simply lifted to the caller of the
> > function.
> > 
> > [0] The patch
> >       mm: workingset: remove local_irq_disable() from count_shadow_nodes() 
> >    and
> >       mm: workingset: make shadow_lru_isolate() use locking suffix
> > 
> 
> This isn't a very informative [0/n] changelog.  Some overall summary of
> the patchset's objective, behaviour, use cases, testing results, etc.

The patches should be threaded as a reply to 3/3 of the series so I
assumed it was enough. And while Vladimir complained about 2/3 and 3/3
the discussion went on in 2/3 where he suggested to go on with the _irq
function. And testing, well with and without RT the function was invoked
as part of swapping (allocating memory until OOM) without complains.

> I'm seeing significant conflicts with Kirill's "Improve shrink_slab()
> scalability (old complexity was O(n^2), new is O(n))" series, which I
> merged eight milliseconds ago.  Kirill's patchset is large but fairly
> straightforward so I expect it's good for 4.18.  So I suggest we leave
> things a week or more then please take a look at redoing this patchset
> on top of that work?  

If Vladimir is okay with to redo and nobody else complains then I could
rebase these four patches on top of your tree next week.

Sebastian

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re:
  2018-07-03 21:44         ` Re: Sebastian Andrzej Siewior
@ 2018-07-04 14:44           ` Vladimir Davydov
  0 siblings, 0 replies; 43+ messages in thread
From: Vladimir Davydov @ 2018-07-04 14:44 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior; +Cc: Andrew Morton, linux-mm, tglx, Kirill Tkhai

On Tue, Jul 03, 2018 at 11:44:29PM +0200, Sebastian Andrzej Siewior wrote:
> On 2018-07-03 14:14:29 [-0700], Andrew Morton wrote:
> > 
> > > Reply-To: "[PATCH 0/4] mm/list_lru": add.list_lru_shrink_walk_irq@mail.linuxfoundation.org.and.use.it ()
> > 
> > Well that's messed up.
> 
> indeed it is. This should get into Subject:
> 
> > On Tue,  3 Jul 2018 16:52:31 +0200 Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
> > 
> > > My intepretation of situtation is that Vladimir Davydon is fine patch #1
> > > and #2 of the series [0] but dislikes the irq argument and struct
> > > member. It has been suggested to use list_lru_shrink_walk_irq() instead
> > > the approach I went on in "mm: list_lru: Add lock_irq member to
> > > __list_lru_init()".
> > > 
> > > This series is based on the former two patches and introduces
> > > list_lru_shrink_walk_irq() (and makes the third patch of series
> > > obsolete).
> > > In patch 1-3 I tried a tiny cleanup so the different locking
> > > (spin_lock() vs spin_lock_irq()) is simply lifted to the caller of the
> > > function.
> > > 
> > > [0] The patch
> > >       mm: workingset: remove local_irq_disable() from count_shadow_nodes() 
> > >    and
> > >       mm: workingset: make shadow_lru_isolate() use locking suffix
> > > 
> > 
> > This isn't a very informative [0/n] changelog.  Some overall summary of
> > the patchset's objective, behaviour, use cases, testing results, etc.
> 
> The patches should be threaded as a reply to 3/3 of the series so I
> assumed it was enough. And while Vladimir complained about 2/3 and 3/3
> the discussion went on in 2/3 where he suggested to go on with the _irq
> function. And testing, well with and without RT the function was invoked
> as part of swapping (allocating memory until OOM) without complains.
> 
> > I'm seeing significant conflicts with Kirill's "Improve shrink_slab()
> > scalability (old complexity was O(n^2), new is O(n))" series, which I
> > merged eight milliseconds ago.  Kirill's patchset is large but fairly
> > straightforward so I expect it's good for 4.18.  So I suggest we leave
> > things a week or more then please take a look at redoing this patchset
> > on top of that work?  
> 
> If Vladimir is okay with to redo and nobody else complains then I could
> rebase these four patches on top of your tree next week.

IMHO this approach is more straightforward than the one with the per
list_lru flag. For all patches,

Reviewed-by: Vladimir Davydov <vdavydov.dev@gmail.com>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re:
  2024-03-07 14:05                     ` Re: Matthew Wilcox
  2024-03-07 15:24                       ` Re: Ryan Roberts
@ 2024-03-08  1:06                       ` Yin, Fengwei
  1 sibling, 0 replies; 43+ messages in thread
From: Yin, Fengwei @ 2024-03-08  1:06 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Ryan Roberts, Zi Yan, Andrew Morton, linux-mm, Yang Shi, Huang Ying



On 3/7/2024 10:05 PM, Matthew Wilcox wrote:
> On Thu, Mar 07, 2024 at 09:50:09PM +0800, Yin, Fengwei wrote:
>>
>>
>> On 3/7/2024 4:56 PM,  wrote:
>>> I just want to make sure I've understood correctly: CPU1's folio_put()
>>> is not the last reference, and it keeps iterating through the local
>>> list. Then CPU2 does the final folio_put() which causes list_del_init()
>>> to modify the local list concurrently with CPU1's iteration, so CPU1
>>> probably goes into the weeds?
>>
>> My understanding is this can not corrupt the folio->deferred_list as
>> this folio was iterated already.
> 
> I am not convinced about that at all.  It's possible this isn't the only
> problem, but deleting something from a list without holding (the correct)
> lock is something you have to think incredibly hard about to get right.
> I didn't bother going any deeper into the analysis once I spotted the
> locking problem, but the proof is very much on you that this is not a bug!
Removing folio from deferred_list in folio_put() also needs require
split_queue_lock. So my understanding is no deleting without hold
correct lock. local list iteration is impacted. But that's not the issue
Ryan hit here.

> 
>> But I did see other strange thing:
>> [   76.269942] page: refcount:0 mapcount:1 mapping:0000000000000000
>> index:0xffffbd0a0 pfn:0x2554a0
>> [   76.270483] note: kcompactd0[62] exited with preempt_count 1
>> [   76.271344] head: order:0 entire_mapcount:1 nr_pages_mapped:0 pincount:0
>>
>> This large folio has order 0? Maybe folio->_flags_1 was screwed?
>>
>> In free_unref_folios(), there is code like following:
>>                  if (order > 0 && folio_test_large_rmappable(folio))
>>                          folio_undo_large_rmappable(folio);
>>
>> But with destroy_large_folio():
>>          if (folio_test_large_rmappable(folio))
>>
>> 			folio_undo_large_rmappable(folio);
>>
>> Can it connect to the folio has zero refcount still in deferred list
>> with Matthew's patch?
>>
>>
>> Looks like folio order was cleared unexpected somewhere.
> 
> No, we intentionally clear it:
> 
> free_unref_folios -> free_unref_page_prepare -> free_pages_prepare ->
> page[1].flags &= ~PAGE_FLAGS_SECOND;
> 
> PAGE_FLAGS_SECOND includes the order, which is why we have to save it
> away in folio->private so that we know what it is in the second loop.
> So it's always been cleared by the time we call free_page_is_bad().
Oh. That's the key. Thanks a lot for detail explanation.

I thought there was a bug in other place, covered by
destroy_large_folio() but exposed by free_unref_folios()...


Regards
Yin, Fengwei


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re:
  2024-03-07 16:24                         ` Re: Ryan Roberts
@ 2024-03-07 23:02                           ` Matthew Wilcox
  0 siblings, 0 replies; 43+ messages in thread
From: Matthew Wilcox @ 2024-03-07 23:02 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Yin, Fengwei, Zi Yan, Andrew Morton, linux-mm, Yang Shi, Huang Ying

On Thu, Mar 07, 2024 at 04:24:43PM +0000, Ryan Roberts wrote:
> > But if I run only with the deferred split fix and DO NOT revert the other
> > change, everything grinds to a halt when swapping 2M pages. Sometimes with RCU
> > stalls where I can't even interact on the serial port. Sometimes (more usually)
> > everything just gets stuck trying to reclaim and allocate memory. And when I
> > kill the jobs, I still have barely any memory in the system - about 10% what I
> > would expect.

(for the benefit of anyone trying to follow along, this is now
understood; it was my missing folio_put() in the 'folio_trylock failed'
path)

> I notice that before the commit, large folios are uncharged with
> __mem_cgroup_uncharge() and now they use __mem_cgroup_uncharge_folios().
> 
> The former has an upfront check:
> 
> 	if (!folio_memcg(folio))
> 		return;
> 
> I'm not exactly sure what that's checking but could the fact this is missing
> after the change cause things to go wonky?

Honestly, I think that's stale.  uncharge_folio() checks the same
thing very early on, so all it's actually saving is a test of the LRU
flag.

Looks like the need for it went away in 2017 with commit a9d5adeeb4b2c73c
which stopped using page->lru to gather the single page onto a
degenerate list.  I'll try to remember to submit a patch to delete
that check.

By the way, something we could try to see if the problem goes away is to
re-narrow the window that i widened.  ie something like this:

+++ b/mm/swap.c
@@ -1012,6 +1012,8 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
                        free_huge_folio(folio);
                        continue;
                }
+               if (folio_test_large(folio) && folio_test_large_rmappable(folio))
+                       folio_undo_large_rmappable(folio);

                __page_cache_release(folio, &lruvec, &flags);




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re:
  2024-03-07 15:24                       ` Re: Ryan Roberts
@ 2024-03-07 16:24                         ` Ryan Roberts
  2024-03-07 23:02                           ` Re: Matthew Wilcox
  0 siblings, 1 reply; 43+ messages in thread
From: Ryan Roberts @ 2024-03-07 16:24 UTC (permalink / raw)
  To: Matthew Wilcox, Yin, Fengwei
  Cc: Zi Yan, Andrew Morton, linux-mm, Yang Shi, Huang Ying

On 07/03/2024 15:24, Ryan Roberts wrote:
> On 07/03/2024 14:05, Matthew Wilcox wrote:
>> On Thu, Mar 07, 2024 at 09:50:09PM +0800, Yin, Fengwei wrote:
>>>
>>>
>>> On 3/7/2024 4:56 PM,  wrote:
>>>> I just want to make sure I've understood correctly: CPU1's folio_put()
>>>> is not the last reference, and it keeps iterating through the local
>>>> list. Then CPU2 does the final folio_put() which causes list_del_init()
>>>> to modify the local list concurrently with CPU1's iteration, so CPU1
>>>> probably goes into the weeds?
>>>
>>> My understanding is this can not corrupt the folio->deferred_list as
>>> this folio was iterated already.
>>
>> I am not convinced about that at all.  It's possible this isn't the only
>> problem, but deleting something from a list without holding (the correct)
>> lock is something you have to think incredibly hard about to get right.
>> I didn't bother going any deeper into the analysis once I spotted the
>> locking problem, but the proof is very much on you that this is not a bug!
>>
>>> But I did see other strange thing:
>>> [   76.269942] page: refcount:0 mapcount:1 mapping:0000000000000000
>>> index:0xffffbd0a0 pfn:0x2554a0
>>> [   76.270483] note: kcompactd0[62] exited with preempt_count 1
>>> [   76.271344] head: order:0 entire_mapcount:1 nr_pages_mapped:0 pincount:0
>>>
>>> This large folio has order 0? Maybe folio->_flags_1 was screwed?
>>>
>>> In free_unref_folios(), there is code like following:
>>>                 if (order > 0 && folio_test_large_rmappable(folio))
>>>                         folio_undo_large_rmappable(folio);
>>>
>>> But with destroy_large_folio():
>>>         if (folio_test_large_rmappable(folio))
>>>
>>> 			folio_undo_large_rmappable(folio);
>>>
>>> Can it connect to the folio has zero refcount still in deferred list
>>> with Matthew's patch?
>>>
>>>
>>> Looks like folio order was cleared unexpected somewhere.
> 
> I think there could be something to this...
> 
> I have a set up where, when running with Matthew's deferred split fix AND have
> commit 31b2ff82aefb "mm: handle large folios in free_unref_folios()" REVERTED,
> everything works as expected. And at the end, I have the expected amount of
> memory free (seen in meminfo and buddyinfo).
> 
> But if I run only with the deferred split fix and DO NOT revert the other
> change, everything grinds to a halt when swapping 2M pages. Sometimes with RCU
> stalls where I can't even interact on the serial port. Sometimes (more usually)
> everything just gets stuck trying to reclaim and allocate memory. And when I
> kill the jobs, I still have barely any memory in the system - about 10% what I
> would expect.
> 
> So is it possible that after commit 31b2ff82aefb "mm: handle large folios in
> free_unref_folios()", when freeing 2M folio back to the buddy, we are actually
> only telling it about the first 4K page? So we end up leaking the rest?

I notice that before the commit, large folios are uncharged with
__mem_cgroup_uncharge() and now they use __mem_cgroup_uncharge_folios().

The former has an upfront check:

	if (!folio_memcg(folio))
		return;

I'm not exactly sure what that's checking but could the fact this is missing
after the change cause things to go wonky?


> 
>>
>> No, we intentionally clear it:
>>
>> free_unref_folios -> free_unref_page_prepare -> free_pages_prepare ->
>> page[1].flags &= ~PAGE_FLAGS_SECOND;
>>
>> PAGE_FLAGS_SECOND includes the order, which is why we have to save it
>> away in folio->private so that we know what it is in the second loop.
>> So it's always been cleared by the time we call free_page_is_bad().
> 



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re:
  2024-03-07 14:05                     ` Re: Matthew Wilcox
@ 2024-03-07 15:24                       ` Ryan Roberts
  2024-03-07 16:24                         ` Re: Ryan Roberts
  2024-03-08  1:06                       ` Re: Yin, Fengwei
  1 sibling, 1 reply; 43+ messages in thread
From: Ryan Roberts @ 2024-03-07 15:24 UTC (permalink / raw)
  To: Matthew Wilcox, Yin, Fengwei
  Cc: Zi Yan, Andrew Morton, linux-mm, Yang Shi, Huang Ying

On 07/03/2024 14:05, Matthew Wilcox wrote:
> On Thu, Mar 07, 2024 at 09:50:09PM +0800, Yin, Fengwei wrote:
>>
>>
>> On 3/7/2024 4:56 PM,  wrote:
>>> I just want to make sure I've understood correctly: CPU1's folio_put()
>>> is not the last reference, and it keeps iterating through the local
>>> list. Then CPU2 does the final folio_put() which causes list_del_init()
>>> to modify the local list concurrently with CPU1's iteration, so CPU1
>>> probably goes into the weeds?
>>
>> My understanding is this can not corrupt the folio->deferred_list as
>> this folio was iterated already.
> 
> I am not convinced about that at all.  It's possible this isn't the only
> problem, but deleting something from a list without holding (the correct)
> lock is something you have to think incredibly hard about to get right.
> I didn't bother going any deeper into the analysis once I spotted the
> locking problem, but the proof is very much on you that this is not a bug!
> 
>> But I did see other strange thing:
>> [   76.269942] page: refcount:0 mapcount:1 mapping:0000000000000000
>> index:0xffffbd0a0 pfn:0x2554a0
>> [   76.270483] note: kcompactd0[62] exited with preempt_count 1
>> [   76.271344] head: order:0 entire_mapcount:1 nr_pages_mapped:0 pincount:0
>>
>> This large folio has order 0? Maybe folio->_flags_1 was screwed?
>>
>> In free_unref_folios(), there is code like following:
>>                 if (order > 0 && folio_test_large_rmappable(folio))
>>                         folio_undo_large_rmappable(folio);
>>
>> But with destroy_large_folio():
>>         if (folio_test_large_rmappable(folio))
>>
>> 			folio_undo_large_rmappable(folio);
>>
>> Can it connect to the folio has zero refcount still in deferred list
>> with Matthew's patch?
>>
>>
>> Looks like folio order was cleared unexpected somewhere.

I think there could be something to this...

I have a set up where, when running with Matthew's deferred split fix AND have
commit 31b2ff82aefb "mm: handle large folios in free_unref_folios()" REVERTED,
everything works as expected. And at the end, I have the expected amount of
memory free (seen in meminfo and buddyinfo).

But if I run only with the deferred split fix and DO NOT revert the other
change, everything grinds to a halt when swapping 2M pages. Sometimes with RCU
stalls where I can't even interact on the serial port. Sometimes (more usually)
everything just gets stuck trying to reclaim and allocate memory. And when I
kill the jobs, I still have barely any memory in the system - about 10% what I
would expect.

So is it possible that after commit 31b2ff82aefb "mm: handle large folios in
free_unref_folios()", when freeing 2M folio back to the buddy, we are actually
only telling it about the first 4K page? So we end up leaking the rest?

> 
> No, we intentionally clear it:
> 
> free_unref_folios -> free_unref_page_prepare -> free_pages_prepare ->
> page[1].flags &= ~PAGE_FLAGS_SECOND;
> 
> PAGE_FLAGS_SECOND includes the order, which is why we have to save it
> away in folio->private so that we know what it is in the second loop.
> So it's always been cleared by the time we call free_page_is_bad().



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re:
  2024-03-07 13:50                   ` Yin, Fengwei
@ 2024-03-07 14:05                     ` Matthew Wilcox
  2024-03-07 15:24                       ` Re: Ryan Roberts
  2024-03-08  1:06                       ` Re: Yin, Fengwei
  0 siblings, 2 replies; 43+ messages in thread
From: Matthew Wilcox @ 2024-03-07 14:05 UTC (permalink / raw)
  To: Yin, Fengwei
  Cc: Ryan Roberts, Zi Yan, Andrew Morton, linux-mm, Yang Shi, Huang Ying

On Thu, Mar 07, 2024 at 09:50:09PM +0800, Yin, Fengwei wrote:
> 
> 
> On 3/7/2024 4:56 PM,  wrote:
> > I just want to make sure I've understood correctly: CPU1's folio_put()
> > is not the last reference, and it keeps iterating through the local
> > list. Then CPU2 does the final folio_put() which causes list_del_init()
> > to modify the local list concurrently with CPU1's iteration, so CPU1
> > probably goes into the weeds?
> 
> My understanding is this can not corrupt the folio->deferred_list as
> this folio was iterated already.

I am not convinced about that at all.  It's possible this isn't the only
problem, but deleting something from a list without holding (the correct)
lock is something you have to think incredibly hard about to get right.
I didn't bother going any deeper into the analysis once I spotted the
locking problem, but the proof is very much on you that this is not a bug!

> But I did see other strange thing:
> [   76.269942] page: refcount:0 mapcount:1 mapping:0000000000000000
> index:0xffffbd0a0 pfn:0x2554a0
> [   76.270483] note: kcompactd0[62] exited with preempt_count 1
> [   76.271344] head: order:0 entire_mapcount:1 nr_pages_mapped:0 pincount:0
> 
> This large folio has order 0? Maybe folio->_flags_1 was screwed?
> 
> In free_unref_folios(), there is code like following:
>                 if (order > 0 && folio_test_large_rmappable(folio))
>                         folio_undo_large_rmappable(folio);
> 
> But with destroy_large_folio():
>         if (folio_test_large_rmappable(folio))
> 
> 			folio_undo_large_rmappable(folio);
> 
> Can it connect to the folio has zero refcount still in deferred list
> with Matthew's patch?
> 
> 
> Looks like folio order was cleared unexpected somewhere.

No, we intentionally clear it:

free_unref_folios -> free_unref_page_prepare -> free_pages_prepare ->
page[1].flags &= ~PAGE_FLAGS_SECOND;

PAGE_FLAGS_SECOND includes the order, which is why we have to save it
away in folio->private so that we know what it is in the second loop.
So it's always been cleared by the time we call free_page_is_bad().


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re:
  2024-03-07  8:56                 ` Ryan Roberts
@ 2024-03-07 13:50                   ` Yin, Fengwei
  2024-03-07 14:05                     ` Re: Matthew Wilcox
  0 siblings, 1 reply; 43+ messages in thread
From: Yin, Fengwei @ 2024-03-07 13:50 UTC (permalink / raw)
  To: Ryan Roberts, Matthew Wilcox, Zi Yan
  Cc: Andrew Morton, linux-mm, Yang Shi, Huang Ying



On 3/7/2024 4:56 PM,  wrote:
> I just want to make sure I've understood correctly: CPU1's folio_put()
> is not the last reference, and it keeps iterating through the local
> list. Then CPU2 does the final folio_put() which causes list_del_init()
> to modify the local list concurrently with CPU1's iteration, so CPU1
> probably goes into the weeds?

My understanding is this can not corrupt the folio->deferred_list as
this folio was iterated already.


But I did see other strange thing:
[   76.269942] page: refcount:0 mapcount:1 mapping:0000000000000000 
index:0xffffbd0a0 pfn:0x2554a0
[   76.270483] note: kcompactd0[62] exited with preempt_count 1
[   76.271344] head: order:0 entire_mapcount:1 nr_pages_mapped:0 pincount:0

This large folio has order 0? Maybe folio->_flags_1 was screwed?

In free_unref_folios(), there is code like following:
                 if (order > 0 && folio_test_large_rmappable(folio))
                         folio_undo_large_rmappable(folio);

But with destroy_large_folio():
         if (folio_test_large_rmappable(folio)) 

			folio_undo_large_rmappable(folio);

Can it connect to the folio has zero refcount still in deferred list
with Matthew's patch?


Looks like folio order was cleared unexpected somewhere.

Regards
Yin, Fengwei



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re:
  2024-01-22 10:13 ` Andi Kleen
@ 2024-01-22 11:53   ` Dave Chinner
  0 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2024-01-22 11:53 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-xfs, linux-mm

On Mon, Jan 22, 2024 at 02:13:23AM -0800, Andi Kleen wrote:
> Dave Chinner <david@fromorbit.com> writes:
> 
> > Thoughts, comments, etc?
> 
> The interesting part is if it will cause additional tail latencies
> allocating under fragmentation with direct reclaim, compaction
> etc. being triggered before it falls back to the base page path.

It's not like I don't know these problems exist with memory
allocation. Go have a look at xlog_kvmalloc() which is an open coded
kvmalloc() that allows the high order kmalloc allocations to
fail-fast without triggering all the expensive and unnecessary
direct reclaim overhead (e.g. compaction!) because we can fall back
to vmalloc without huge concerns. When high order allocations start
to fail, then we fall back to vmalloc and then we hit the long
standing vmalloc scalability problems before anything else in XFS or
the IO path becomes a bottleneck.

IOWs, we already know that fail-fast high-order allocation is a more
efficient and effective fast path than using vmalloc/vmap_ram() all
the time. As this is an RFC, I haven't implemented stuff like this
yet - I haven't seen anything in the profiles indicating that high
order folio allocation is failing and causing lots of reclaim
overhead, so I simply haven't added fail-fast behaviour yet...

> In fact it is highly likely it will, the question is just how bad it is.
> 
> Unfortunately benchmarking for that isn't that easy, it needs artificial
> memory fragmentation and then some high stress workload, and then
> instrumenting the transactions for individual latencies. 

I stress test and measure XFS metadata performance under sustained
memory pressure all the time. This change has not caused any
obvious regressions in the short time I've been testing it.

I still need to do perf testing on large directory block sizes. That
is where high-order allocations will get stressed - that's where
xlog_kvmalloc() starts dominating the profiles as it trips over
vmalloc scalability issues...

> I would in any case add a tunable for it in case people run into this.

No tunables. It either works or it doesn't. If we can't make
it work reliably by default, we throw it in the dumpster, light it
on fire and walk away.

> Tail latencies are a common concern on many IO workloads.

Yes, for user data operations it's a common concern. For metadata,
not so much - there's so many far worse long tail latencies in
metadata operations (like waiting for journal space) that memory
allocation latencies in the metadata IO path are largely noise....

-Dave.
-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re:
  2023-05-11 12:58 Ryan Roberts
@ 2023-05-11 13:13 ` Ryan Roberts
  0 siblings, 0 replies; 43+ messages in thread
From: Ryan Roberts @ 2023-05-11 13:13 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, SeongJae Park
  Cc: linux-kernel, linux-mm, damon

My appologies for the noise: A blank line between Cc and Subject has broken the
subject and grouping in lore.

Please Ignore this, I will resend.


On 11/05/2023 13:58, Ryan Roberts wrote:
> Date: Thu, 11 May 2023 11:38:28 +0100
> Subject: [PATCH v1 0/5] Encapsulate PTE contents from non-arch code
> 
> Hi All,
> 
> This series improves the encapsulation of pte entries by disallowing non-arch
> code from directly dereferencing pte_t pointers. Instead code must use a new
> helper, `pte_t ptep_deref(pte_t *ptep)`. By default, this helper does a direct
> dereference of the pointer, so generated code should be exactly the same. But
> it's presence sets us up for arch code being able to override the default to
> "virtualize" the ptes without needing to maintain a shadow table.
> 
> I intend to take advantage of this for arm64 to enable use of its "contiguous
> bit" to coalesce multiple ptes into a single tlb entry, reducing pressure and
> improving performance. I have an RFC for the first part of this work at [1]. The
> cover letter there also explains the second part, which this series is enabling.
> 
> I intend to post an RFC for the contpte changes in due course, but it would be
> good to get the ball rolling on this enabler.
> 
> There are 2 reasons that I need the encapsulation:
> 
>   - Prevent leaking the arch-private PTE_CONT bit to the core code. If the core
>     code reads a pte that contains this bit, it could end up calling
>     set_pte_at() with the bit set which would confuse the implementation. So we
>     can always clear PTE_CONT in ptep_deref() (and ptep_get()) to avoid a leaky
>     abstraction.
>   - Contiguous ptes have a single access and dirty bit for the contiguous range.
>     So we need to "mix-in" those bits when the core is dereferencing a pte that
>     lies in the contig range. There is code that dereferences the pte then takes
>     different actions based on access/dirty (see e.g. write_protect_page()).
> 
> While ptep_get() and ptep_get_lockless() already exist, both of them are
> implemented using READ_ONCE() by default. While we could use ptep_get() instead
> of the new ptep_deref(), I didn't want to risk performance regression.
> Alternatively, all call sites that currently use ptep_get() that need the
> lockless behaviour could be upgraded to ptep_get_lockless() and ptep_get() could
> be downgraded to a simple dereference. That would be cleanest, but is a much
> bigger (and likely error prone) change because all the arch code would need to
> be updated for the new definitions of ptep_get().
> 
> The series is split up as follows:
> 
> patchs 1-2: Fix bugs where code was _setting_ ptes directly, rather than using
>             set_pte_at() and friends.
> patch 3:    Fix highmem unmapping issue I spotted while doing the work.
> patch 4:    Introduce the new ptep_deref() helper with default implementation.
> patch 5:    Convert all direct dereferences to use ptep_deref().
> 
> [1] https://lore.kernel.org/linux-mm/20230414130303.2345383-1-ryan.roberts@arm.com/
> 
> Thanks,
> Ryan
> 
> 
> Ryan Roberts (5):
>   mm: vmalloc must set pte via arch code
>   mm: damon must atomically clear young on ptes and pmds
>   mm: Fix failure to unmap pte on highmem systems
>   mm: Add new ptep_deref() helper to fully encapsulate pte_t
>   mm: ptep_deref() conversion
> 
>  .../drm/i915/gem/selftests/i915_gem_mman.c    |   8 +-
>  drivers/misc/sgi-gru/grufault.c               |   2 +-
>  drivers/vfio/vfio_iommu_type1.c               |   7 +-
>  drivers/xen/privcmd.c                         |   2 +-
>  fs/proc/task_mmu.c                            |  33 +++---
>  fs/userfaultfd.c                              |   6 +-
>  include/linux/hugetlb.h                       |   2 +-
>  include/linux/mm_inline.h                     |   2 +-
>  include/linux/pgtable.h                       |  13 ++-
>  kernel/events/uprobes.c                       |   2 +-
>  mm/damon/ops-common.c                         |  18 ++-
>  mm/damon/ops-common.h                         |   4 +-
>  mm/damon/paddr.c                              |   6 +-
>  mm/damon/vaddr.c                              |  14 ++-
>  mm/filemap.c                                  |   2 +-
>  mm/gup.c                                      |  21 ++--
>  mm/highmem.c                                  |  12 +-
>  mm/hmm.c                                      |   2 +-
>  mm/huge_memory.c                              |   4 +-
>  mm/hugetlb.c                                  |   2 +-
>  mm/hugetlb_vmemmap.c                          |   6 +-
>  mm/kasan/init.c                               |   9 +-
>  mm/kasan/shadow.c                             |  10 +-
>  mm/khugepaged.c                               |  24 ++--
>  mm/ksm.c                                      |  22 ++--
>  mm/madvise.c                                  |   6 +-
>  mm/mapping_dirty_helpers.c                    |   4 +-
>  mm/memcontrol.c                               |   4 +-
>  mm/memory-failure.c                           |   6 +-
>  mm/memory.c                                   | 103 +++++++++---------
>  mm/mempolicy.c                                |   6 +-
>  mm/migrate.c                                  |  14 ++-
>  mm/migrate_device.c                           |  14 ++-
>  mm/mincore.c                                  |   2 +-
>  mm/mlock.c                                    |   6 +-
>  mm/mprotect.c                                 |   8 +-
>  mm/mremap.c                                   |   2 +-
>  mm/page_table_check.c                         |   4 +-
>  mm/page_vma_mapped.c                          |  26 +++--
>  mm/pgtable-generic.c                          |   2 +-
>  mm/rmap.c                                     |  32 +++---
>  mm/sparse-vmemmap.c                           |   8 +-
>  mm/swap_state.c                               |   4 +-
>  mm/swapfile.c                                 |  16 +--
>  mm/userfaultfd.c                              |   4 +-
>  mm/vmalloc.c                                  |  11 +-
>  mm/vmscan.c                                   |  14 ++-
>  virt/kvm/kvm_main.c                           |   9 +-
>  48 files changed, 302 insertions(+), 236 deletions(-)
> 
> --
> 2.25.1
> 



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re:
  2022-08-31 21:47 ` Yang Shi
@ 2022-09-01  0:24   ` Zach O'Keefe
  0 siblings, 0 replies; 43+ messages in thread
From: Zach O'Keefe @ 2022-09-01  0:24 UTC (permalink / raw)
  To: Yang Shi
  Cc: linux-mm, Andrew Morton, linux-api, Axel Rasmussen,
	James Houghton, Hugh Dickins, Miaohe Lin, David Hildenbrand,
	David Rientjes, Matthew Wilcox, Pasha Tatashin, Peter Xu,
	Rongwei Wang, SeongJae Park, Song Liu, Vlastimil Babka,
	Chris Kennelly, Kirill A. Shutemov, Minchan Kim, Patrick Xia

On Wed, Aug 31, 2022 at 2:47 PM Yang Shi <shy828301@gmail.com> wrote:
>
> Hi Zach,
>
> I did a quick look at the series, basically no show stopper to me. But
> I didn't find time to review them thoroughly yet, quite busy on
> something else. Just a heads up, I didn't mean to ignore you. I will
> review them when I find some time.
>
> Thanks,
> Yang

Hey Yang,

Thanks for taking the time to look through, and thanks for giving me a
heads up, and no rush!

In the last day or so, while porting this series around, I encountered
some subtle edge cases I wanted to clean up / address - so it's good
you didn't do a thorough review yet. I was *hoping* to have a v3 out
last night (which evidently did not happen) and it does not seem like
it will happen today, so I'll leave this message as a request for
reviewers to hold off on a thorough review until v3.

Thanks for your time as always,
Zach


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re:
  2022-08-26 22:03 Zach O'Keefe
@ 2022-08-31 21:47 ` Yang Shi
  2022-09-01  0:24   ` Re: Zach O'Keefe
  0 siblings, 1 reply; 43+ messages in thread
From: Yang Shi @ 2022-08-31 21:47 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: linux-mm, Andrew Morton, linux-api, Axel Rasmussen,
	James Houghton, Hugh Dickins, Miaohe Lin, David Hildenbrand,
	David Rientjes, Matthew Wilcox, Pasha Tatashin, Peter Xu,
	Rongwei Wang, SeongJae Park, Song Liu, Vlastimil Babka,
	Chris Kennelly, Kirill A. Shutemov, Minchan Kim, Patrick Xia

Hi Zach,

I did a quick look at the series, basically no show stopper to me. But
I didn't find time to review them thoroughly yet, quite busy on
something else. Just a heads up, I didn't mean to ignore you. I will
review them when I find some time.

Thanks,
Yang

On Fri, Aug 26, 2022 at 3:03 PM Zach O'Keefe <zokeefe@google.com> wrote:
>
> Subject: [PATCH mm-unstable v2 0/9] mm: add file/shmem support to MADV_COLLAPSE
>
> v2 Forward
>
> Mostly a RESEND: rebase on latest mm-unstable + minor bug fixes from
> kernel test robot.
> --------------------------------
>
> This series builds on top of the previous "mm: userspace hugepage collapse"
> series which introduced the MADV_COLLAPSE madvise mode and added support
> for private, anonymous mappings[1], by adding support for file and shmem
> backed memory to CONFIG_READ_ONLY_THP_FOR_FS=y kernels.
>
> File and shmem support have been added with effort to align with existing
> MADV_COLLAPSE semantics and policy decisions[2].  Collapse of shmem-backed
> memory ignores kernel-guiding directives and heuristics including all
> sysfs settings (transparent_hugepage/shmem_enabled), and tmpfs huge= mount
> options (shmem always supports large folios).  Like anonymous mappings, on
> successful return of MADV_COLLAPSE on file/shmem memory, the contents of
> memory mapped by the addresses provided will be synchronously pmd-mapped
> THPs.
>
> This functionality unlocks two important uses:
>
> (1)     Immediately back executable text by THPs.  Current support provided
>         by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large
>         system which might impair services from serving at their full rated
>         load after (re)starting.  Tricks like mremap(2)'ing text onto
>         anonymous memory to immediately realize iTLB performance prevents
>         page sharing and demand paging, both of which increase steady state
>         memory footprint.  Now, we can have the best of both worlds: Peak
>         upfront performance and lower RAM footprints.
>
> (2)     userfaultfd-based live migration of virtual machines satisfy UFFD
>         faults by fetching native-sized pages over the network (to avoid
>         latency of transferring an entire hugepage).  However, after guest
>         memory has been fully copied to the new host, MADV_COLLAPSE can
>         be used to immediately increase guest performance.
>
> khugepaged has received a small improvement by association and can now
> detect and collapse pte-mapped THPs.  However, there is still work to be
> done along the file collapse path.  Compound pages of arbitrary order still
> needs to be supported and THP collapse needs to be converted to using
> folios in general.  Eventually, we'd like to move away from the read-only
> and executable-mapped constraints currently imposed on eligible files and
> support any inode claiming huge folio support.  That said, I think the
> series as-is covers enough to claim that MADV_COLLAPSE supports file/shmem
> memory.
>
> Patches 1-3     Implement the guts of the series.
> Patch 4         Is a tracepoint for debugging.
> Patches 5-8     Refactor existing khugepaged selftests to work with new
>                 memory types.
> Patch 9         Adds a userfaultfd selftest mode to mimic a functional test
>                 of UFFDIO_REGISTER_MODE_MINOR+MADV_COLLAPSE live migration.
>
> Applies against mm-unstable.
>
> [1] https://lore.kernel.org/linux-mm/20220706235936.2197195-1-zokeefe@google.com/
> [2] https://lore.kernel.org/linux-mm/YtBmhaiPHUTkJml8@google.com/
>
> v1 -> v2:
> - Add missing definition for khugepaged_add_pte_mapped_thp() in
>   !CONFIG_SHEM builds, in "mm/khugepaged: attempt to map
>   file/shmem-backed pte-mapped THPs by pmds"
> - Minor bugfixes in "mm/madvise: add file and shmem support to
>   MADV_COLLAPSE" for !CONFIG_SHMEM, !CONFIG_TRANSPARENT_HUGEPAGE and some
>   compiler settings.
> - Rebased on latest mm-unstable
>
> Zach O'Keefe (9):
>   mm/shmem: add flag to enforce shmem THP in hugepage_vma_check()
>   mm/khugepaged: attempt to map file/shmem-backed pte-mapped THPs by
>     pmds
>   mm/madvise: add file and shmem support to MADV_COLLAPSE
>   mm/khugepaged: add tracepoint to hpage_collapse_scan_file()
>   selftests/vm: dedup THP helpers
>   selftests/vm: modularize thp collapse memory operations
>   selftests/vm: add thp collapse file and tmpfs testing
>   selftests/vm: add thp collapse shmem testing
>   selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory
>
>  include/linux/khugepaged.h                    |  13 +-
>  include/linux/shmem_fs.h                      |  10 +-
>  include/trace/events/huge_memory.h            |  36 +
>  kernel/events/uprobes.c                       |   2 +-
>  mm/huge_memory.c                              |   2 +-
>  mm/khugepaged.c                               | 289 ++++--
>  mm/shmem.c                                    |  18 +-
>  tools/testing/selftests/vm/Makefile           |   2 +
>  tools/testing/selftests/vm/khugepaged.c       | 828 ++++++++++++------
>  tools/testing/selftests/vm/soft-dirty.c       |   2 +-
>  .../selftests/vm/split_huge_page_test.c       |  12 +-
>  tools/testing/selftests/vm/userfaultfd.c      | 171 +++-
>  tools/testing/selftests/vm/vm_util.c          |  36 +-
>  tools/testing/selftests/vm/vm_util.h          |   5 +-
>  14 files changed, 1040 insertions(+), 386 deletions(-)
>
> --
> 2.37.2.672.g94769d06f0-goog
>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re:
  2022-04-21 23:17   ` Re: Yury Norov
@ 2022-04-21 23:21     ` John Hubbard
  0 siblings, 0 replies; 43+ messages in thread
From: John Hubbard @ 2022-04-21 23:21 UTC (permalink / raw)
  To: Yury Norov; +Cc: Andrew Morton, Minchan Kim, linux-mm, linux-kernel

On 4/21/22 16:17, Yury Norov wrote:
>> Let's add a Cc: line for Michan as well:
>>
>> Cc: Minchan Kim <minchan@kernel.org>
>   
> He's in CC already, I think...
>   

Here, I am talking about attribution in the commit log, as opposed
to the email Cc. In other words, I'm suggesting that you literally
add this line to the commit description:

Cc: Minchan Kim <minchan@kernel.org>


thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re:
  2022-04-21 23:04 ` Re: John Hubbard
  2022-04-21 23:09   ` Re: John Hubbard
@ 2022-04-21 23:17   ` Yury Norov
  2022-04-21 23:21     ` Re: John Hubbard
  1 sibling, 1 reply; 43+ messages in thread
From: Yury Norov @ 2022-04-21 23:17 UTC (permalink / raw)
  To: John Hubbard; +Cc: Andrew Morton, Minchan Kim, linux-mm, linux-kernel

On Thu, Apr 21, 2022 at 04:04:44PM -0700, John Hubbard wrote:
> On 4/21/22 09:41, Yury Norov wrote:
> > Subject: [PATCH] mm/gup: fix comments to pin_user_pages_*()
> > 
> 
> Hi Yuri,
> 
> Thanks for picking this up. I have been distracted and didn't trust
> myself to focus on this properly, so it's good to have help!
> 
> IT/admin point: somehow the first line of the commit description didn't
> make it into an actual email subject. The subject line was blank when it
> arrived in my inbox, and the subject is in the body here instead. Not
> sure how that happened.
> 
> Maybe check your git-sendemail setup?
 
git-sendmail is OK. I just accidentally added empty line above Subject,
which broke format. My bad, sorry for this.
 
> > pin_user_pages API forces FOLL_PIN in gup_flags, which means that the
> > API requires struct page **pages to be provided (not NULL). However,
> > the comment to pin_user_pages() says:
> > 
> >      * @pages:      array that receives pointers to the pages pinned.
> >      *              Should be at least nr_pages long. Or NULL, if caller
> >      *              only intends to ensure the pages are faulted in.
> > 
> > This patch fixes comments along the pin_user_pages code, and also adds
> > WARN_ON(!pages), so that API users will have better understanding
> > on how to use it.
> 
> No need to quote the code in the commit log. Instead, just summarize.
> For example:
> 
> pin_user_pages API forces FOLL_PIN in gup_flags, which means that the
> API requires struct page **pages to be provided (not NULL). However, the
> comment to pin_user_pages() clearly allows for passing in a NULL @pages
> argument.
> 
> Remove the incorrect comments, and add WARN_ON_ONCE(!pages) calls to
> enforce the API.
> 
> > 
> > It has been independently spotted by Minchan Kim and confirmed with
> > John Hubbard:
> > 
> > https://lore.kernel.org/all/YgWA0ghrrzHONehH@google.com/
> 
> Let's add a Cc: line for Michan as well:
> 
> Cc: Minchan Kim <minchan@kernel.org>
 
He's in CC already, I think...
 
> > Signed-off-by: Yury Norov (NVIDIA) <yury.norov@gmail.com>
> > ---
> >   mm/gup.c | 26 ++++++++++++++++++++++----
> >   1 file changed, 22 insertions(+), 4 deletions(-)
> > 
> > diff --git a/mm/gup.c b/mm/gup.c
> > index f598a037eb04..559626457585 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -2871,6 +2871,10 @@ int pin_user_pages_fast(unsigned long start, int nr_pages,
> >   	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
> >   		return -EINVAL;
> > +	/* FOLL_PIN requires pages != NULL */
> 
> Please delete each and every one of these one-line comments, because
> they merely echo what the code says.

Sure.
 
> > +	if (WARN_ON_ONCE(!pages))
> > +		return -EINVAL;
> > +
> >   	gup_flags |= FOLL_PIN;
> >   	return internal_get_user_pages_fast(start, nr_pages, gup_flags, pages);
> >   }
> > @@ -2893,6 +2897,10 @@ int pin_user_pages_fast_only(unsigned long start, int nr_pages,
> >   	 */
> >   	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
> >   		return 0;
> > +
> > +	/* FOLL_PIN requires pages != NULL */
> > +	if (WARN_ON_ONCE(!pages))
> > +		return 0;
> >   	/*
> >   	 * FOLL_FAST_ONLY is required in order to match the API description of
> >   	 * this routine: no fall back to regular ("slow") GUP.
> > @@ -2920,8 +2928,7 @@ EXPORT_SYMBOL_GPL(pin_user_pages_fast_only);
> >    * @nr_pages:	number of pages from start to pin
> >    * @gup_flags:	flags modifying lookup behaviour
> >    * @pages:	array that receives pointers to the pages pinned.
> > - *		Should be at least nr_pages long. Or NULL, if caller
> > - *		only intends to ensure the pages are faulted in.
> > + *		Should be at least nr_pages long.
> >    * @vmas:	array of pointers to vmas corresponding to each page.
> >    *		Or NULL if the caller does not require them.
> >    * @locked:	pointer to lock flag indicating whether lock is held and
> > @@ -2944,6 +2951,10 @@ long pin_user_pages_remote(struct mm_struct *mm,
> >   	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
> >   		return -EINVAL;
> > +	/* FOLL_PIN requires pages != NULL */
> > +	if (WARN_ON_ONCE(!pages))
> > +		return -EINVAL;
> > +
> >   	gup_flags |= FOLL_PIN;
> >   	return __get_user_pages_remote(mm, start, nr_pages, gup_flags,
> >   				       pages, vmas, locked);
> > @@ -2957,8 +2968,7 @@ EXPORT_SYMBOL(pin_user_pages_remote);
> >    * @nr_pages:	number of pages from start to pin
> >    * @gup_flags:	flags modifying lookup behaviour
> >    * @pages:	array that receives pointers to the pages pinned.
> > - *		Should be at least nr_pages long. Or NULL, if caller
> > - *		only intends to ensure the pages are faulted in.
> > + *		Should be at least nr_pages long.
> >    * @vmas:	array of pointers to vmas corresponding to each page.
> >    *		Or NULL if the caller does not require them.
> >    *
> > @@ -2976,6 +2986,10 @@ long pin_user_pages(unsigned long start, unsigned long nr_pages,
> >   	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
> >   		return -EINVAL;
> > +	/* FOLL_PIN requires pages != NULL */
> > +	if (WARN_ON_ONCE(!pages))
> > +		return -EINVAL;
> > +
> >   	gup_flags |= FOLL_PIN;
> >   	return __gup_longterm_locked(current->mm, start, nr_pages,
> >   				     pages, vmas, gup_flags);
> > @@ -2994,6 +3008,10 @@ long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
> >   	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
> >   		return -EINVAL;
> > +	/* FOLL_PIN requires pages != NULL */
> > +	if (WARN_ON_ONCE(!pages))
> > +		return -EINVAL;
> > +
> >   	gup_flags |= FOLL_PIN;
> >   	return get_user_pages_unlocked(start, nr_pages, pages, gup_flags);
> >   }
> 
> I hope we don't break any callers with the newly enforced !pages, but it's
> the right thing to do, in order to avoid misunderstandings.
> 
> thanks,
> -- 
> John Hubbard
> NVIDIA

Let me test v2 and resend shortly.

Thanks,
Yury


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re:
  2022-04-21 23:04 ` Re: John Hubbard
@ 2022-04-21 23:09   ` John Hubbard
  2022-04-21 23:17   ` Re: Yury Norov
  1 sibling, 0 replies; 43+ messages in thread
From: John Hubbard @ 2022-04-21 23:09 UTC (permalink / raw)
  To: Yury Norov, Andrew Morton, Minchan Kim, linux-mm, linux-kernel

On 4/21/22 16:04, John Hubbard wrote:
> On 4/21/22 09:41, Yury Norov wrote:
>> Subject: [PATCH] mm/gup: fix comments to pin_user_pages_*()
>>
> 
> Hi Yuri,

...and I see that I have typo'd both Yury's and Minchan's name (further
down), in the same email!

Really apologize for screwing that up. It's Yury-with-a-"y", I know. :)


thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re:
       [not found] <20220421164138.1250943-1-yury.norov@gmail.com>
@ 2022-04-21 23:04 ` John Hubbard
  2022-04-21 23:09   ` Re: John Hubbard
  2022-04-21 23:17   ` Re: Yury Norov
  0 siblings, 2 replies; 43+ messages in thread
From: John Hubbard @ 2022-04-21 23:04 UTC (permalink / raw)
  To: Yury Norov, Andrew Morton, Minchan Kim, linux-mm, linux-kernel

On 4/21/22 09:41, Yury Norov wrote:
> Subject: [PATCH] mm/gup: fix comments to pin_user_pages_*()
> 

Hi Yuri,

Thanks for picking this up. I have been distracted and didn't trust
myself to focus on this properly, so it's good to have help!

IT/admin point: somehow the first line of the commit description didn't
make it into an actual email subject. The subject line was blank when it
arrived in my inbox, and the subject is in the body here instead. Not
sure how that happened.

Maybe check your git-sendemail setup?


> pin_user_pages API forces FOLL_PIN in gup_flags, which means that the
> API requires struct page **pages to be provided (not NULL). However,
> the comment to pin_user_pages() says:
> 
>      * @pages:      array that receives pointers to the pages pinned.
>      *              Should be at least nr_pages long. Or NULL, if caller
>      *              only intends to ensure the pages are faulted in.
> 
> This patch fixes comments along the pin_user_pages code, and also adds
> WARN_ON(!pages), so that API users will have better understanding
> on how to use it.

No need to quote the code in the commit log. Instead, just summarize.
For example:

pin_user_pages API forces FOLL_PIN in gup_flags, which means that the
API requires struct page **pages to be provided (not NULL). However, the
comment to pin_user_pages() clearly allows for passing in a NULL @pages
argument.

Remove the incorrect comments, and add WARN_ON_ONCE(!pages) calls to
enforce the API.

> 
> It has been independently spotted by Minchan Kim and confirmed with
> John Hubbard:
> 
> https://lore.kernel.org/all/YgWA0ghrrzHONehH@google.com/

Let's add a Cc: line for Michan as well:

Cc: Minchan Kim <minchan@kernel.org>

> 
> Signed-off-by: Yury Norov (NVIDIA) <yury.norov@gmail.com>
> ---
>   mm/gup.c | 26 ++++++++++++++++++++++----
>   1 file changed, 22 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index f598a037eb04..559626457585 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2871,6 +2871,10 @@ int pin_user_pages_fast(unsigned long start, int nr_pages,
>   	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
>   		return -EINVAL;
>   
> +	/* FOLL_PIN requires pages != NULL */

Please delete each and every one of these one-line comments, because
they merely echo what the code says.

> +	if (WARN_ON_ONCE(!pages))
> +		return -EINVAL;
> +
>   	gup_flags |= FOLL_PIN;
>   	return internal_get_user_pages_fast(start, nr_pages, gup_flags, pages);
>   }
> @@ -2893,6 +2897,10 @@ int pin_user_pages_fast_only(unsigned long start, int nr_pages,
>   	 */
>   	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
>   		return 0;
> +
> +	/* FOLL_PIN requires pages != NULL */
> +	if (WARN_ON_ONCE(!pages))
> +		return 0;
>   	/*
>   	 * FOLL_FAST_ONLY is required in order to match the API description of
>   	 * this routine: no fall back to regular ("slow") GUP.
> @@ -2920,8 +2928,7 @@ EXPORT_SYMBOL_GPL(pin_user_pages_fast_only);
>    * @nr_pages:	number of pages from start to pin
>    * @gup_flags:	flags modifying lookup behaviour
>    * @pages:	array that receives pointers to the pages pinned.
> - *		Should be at least nr_pages long. Or NULL, if caller
> - *		only intends to ensure the pages are faulted in.
> + *		Should be at least nr_pages long.
>    * @vmas:	array of pointers to vmas corresponding to each page.
>    *		Or NULL if the caller does not require them.
>    * @locked:	pointer to lock flag indicating whether lock is held and
> @@ -2944,6 +2951,10 @@ long pin_user_pages_remote(struct mm_struct *mm,
>   	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
>   		return -EINVAL;
>   
> +	/* FOLL_PIN requires pages != NULL */
> +	if (WARN_ON_ONCE(!pages))
> +		return -EINVAL;
> +
>   	gup_flags |= FOLL_PIN;
>   	return __get_user_pages_remote(mm, start, nr_pages, gup_flags,
>   				       pages, vmas, locked);
> @@ -2957,8 +2968,7 @@ EXPORT_SYMBOL(pin_user_pages_remote);
>    * @nr_pages:	number of pages from start to pin
>    * @gup_flags:	flags modifying lookup behaviour
>    * @pages:	array that receives pointers to the pages pinned.
> - *		Should be at least nr_pages long. Or NULL, if caller
> - *		only intends to ensure the pages are faulted in.
> + *		Should be at least nr_pages long.
>    * @vmas:	array of pointers to vmas corresponding to each page.
>    *		Or NULL if the caller does not require them.
>    *
> @@ -2976,6 +2986,10 @@ long pin_user_pages(unsigned long start, unsigned long nr_pages,
>   	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
>   		return -EINVAL;
>   
> +	/* FOLL_PIN requires pages != NULL */
> +	if (WARN_ON_ONCE(!pages))
> +		return -EINVAL;
> +
>   	gup_flags |= FOLL_PIN;
>   	return __gup_longterm_locked(current->mm, start, nr_pages,
>   				     pages, vmas, gup_flags);
> @@ -2994,6 +3008,10 @@ long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
>   	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
>   		return -EINVAL;
>   
> +	/* FOLL_PIN requires pages != NULL */
> +	if (WARN_ON_ONCE(!pages))
> +		return -EINVAL;
> +
>   	gup_flags |= FOLL_PIN;
>   	return get_user_pages_unlocked(start, nr_pages, pages, gup_flags);
>   }

I hope we don't break any callers with the newly enforced !pages, but it's
the right thing to do, in order to avoid misunderstandings.

thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re:
  2021-08-12 20:19   ` Re: Andrew Morton
@ 2021-08-13  8:14     ` SeongJae Park
  0 siblings, 0 replies; 43+ messages in thread
From: SeongJae Park @ 2021-08-13  8:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: SeongJae Park,  Valdis Klētnieks ,
	SeongJae Park, linux-mm, linux-kernel

From: SeongJae Park <sjpark@amazon.de>

On Thu, 12 Aug 2021 13:19:21 -0700 Andrew Morton <akpm@linux-foundation.org> wrote:

> On Thu, 12 Aug 2021 09:42:40 +0000 SeongJae Park <sj38.park@gmail.com> wrote:
> 
> > > +config PAGE_IDLE_FLAG
> > > +       bool "Add PG_idle and PG_young flags"
> > > +       help
> > > +         This feature adds PG_idle and PG_young flags in 'struct page'.  PTE
> > > +         Accessed bit writers can set the state of the bit in the flags to let
> > > +         other PTE Accessed bit readers don't disturbed.
> > > 
> > > This needs to be converted to proper, or at least comprehensible, English....
> > 
> > Thank you for the comment.
> > 
> > How about below?
> > 
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -743,9 +743,9 @@ config PAGE_IDLE_FLAG
> >         bool "Add PG_idle and PG_young flags"
> >         select PAGE_EXTENSION if !64BIT
> >         help
> > -         This feature adds PG_idle and PG_young flags in 'struct page'.  PTE
> > -         Accessed bit writers can set the state of the bit in the flags to let
> > -         other PTE Accessed bit readers don't disturbed.
> > +         This feature adds 'PG_idle' and 'PG_young' flags in 'struct page'.
> > +         PTE Accessed bit writers can save the state of the bit in the flags
> > +         to let other PTE Accessed bit readers don't get disturbed.
> 
> How about this?
> 
> --- a/mm/Kconfig~mm-idle_page_tracking-make-pg_idle-reusable-fix-fix
> +++ a/mm/Kconfig
> @@ -743,9 +743,9 @@ config PAGE_IDLE_FLAG
>  	bool "Add PG_idle and PG_young flags"
>  	select PAGE_EXTENSION if !64BIT
>  	help
> -	  This feature adds PG_idle and PG_young flags in 'struct page'.  PTE
> -	  Accessed bit writers can set the state of the bit in the flags to let
> -	  other PTE Accessed bit readers don't disturbed.
> +	  This adds PG_idle and PG_young flags to 'struct page'.  PTE Accessed
> +	  bit writers can set the state of the bit in the flags so that PTE
> +	  Accessed bit readers may avoid disturbance.
>  
>  config IDLE_PAGE_TRACKING
>  	bool "Enable idle page tracking"

So good, thank you!

> 
> Also, is there any way in which we can avoid presenting this option to
> the user?  Because most users will have real trouble understanding what
> this thing is for.  Can we simply select it when needed, as dictated by
> other, higher-level config options?

I believe this is the right way to go!  I sent a patch for removing the prompt
of this option:
https://lore.kernel.org/linux-mm/20210813081238.34705-1-sj38.park@gmail.com/


Thanks,
SeongJae Park


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re:
  2021-08-12  9:42 ` SeongJae Park
@ 2021-08-12 20:19   ` Andrew Morton
  2021-08-13  8:14     ` Re: SeongJae Park
  0 siblings, 1 reply; 43+ messages in thread
From: Andrew Morton @ 2021-08-12 20:19 UTC (permalink / raw)
  To: SeongJae Park
  Cc:  Valdis Klētnieks , SeongJae Park, linux-mm, linux-kernel

On Thu, 12 Aug 2021 09:42:40 +0000 SeongJae Park <sj38.park@gmail.com> wrote:

> > +config PAGE_IDLE_FLAG
> > +       bool "Add PG_idle and PG_young flags"
> > +       help
> > +         This feature adds PG_idle and PG_young flags in 'struct page'.  PTE
> > +         Accessed bit writers can set the state of the bit in the flags to let
> > +         other PTE Accessed bit readers don't disturbed.
> > 
> > This needs to be converted to proper, or at least comprehensible, English....
> 
> Thank you for the comment.
> 
> How about below?
> 
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -743,9 +743,9 @@ config PAGE_IDLE_FLAG
>         bool "Add PG_idle and PG_young flags"
>         select PAGE_EXTENSION if !64BIT
>         help
> -         This feature adds PG_idle and PG_young flags in 'struct page'.  PTE
> -         Accessed bit writers can set the state of the bit in the flags to let
> -         other PTE Accessed bit readers don't disturbed.
> +         This feature adds 'PG_idle' and 'PG_young' flags in 'struct page'.
> +         PTE Accessed bit writers can save the state of the bit in the flags
> +         to let other PTE Accessed bit readers don't get disturbed.

How about this?

--- a/mm/Kconfig~mm-idle_page_tracking-make-pg_idle-reusable-fix-fix
+++ a/mm/Kconfig
@@ -743,9 +743,9 @@ config PAGE_IDLE_FLAG
 	bool "Add PG_idle and PG_young flags"
 	select PAGE_EXTENSION if !64BIT
 	help
-	  This feature adds PG_idle and PG_young flags in 'struct page'.  PTE
-	  Accessed bit writers can set the state of the bit in the flags to let
-	  other PTE Accessed bit readers don't disturbed.
+	  This adds PG_idle and PG_young flags to 'struct page'.  PTE Accessed
+	  bit writers can set the state of the bit in the flags so that PTE
+	  Accessed bit readers may avoid disturbance.
 
 config IDLE_PAGE_TRACKING
 	bool "Enable idle page tracking"

Also, is there any way in which we can avoid presenting this option to
the user?  Because most users will have real trouble understanding what
this thing is for.  Can we simply select it when needed, as dictated by
other, higher-level config options?



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re:
  2021-08-12  9:21 Valdis Klētnieks
@ 2021-08-12  9:42 ` SeongJae Park
  2021-08-12 20:19   ` Re: Andrew Morton
  0 siblings, 1 reply; 43+ messages in thread
From: SeongJae Park @ 2021-08-12  9:42 UTC (permalink / raw)
  To: Valdis Klētnieks
  Cc: SeongJae Park, Andrew Morton, linux-mm, linux-kernel

From: SeongJae Park <sjpark@amazon.de>

Hello Valdis,

On Thu, 12 Aug 2021 05:21:57 -0400 "Valdis =?utf-8?Q?Kl=c4=93tnieks?=" <valdis.kletnieks@vt.edu> wrote:

> In this commit:
> 
> commit fedc37448fb1be5d03e420ca7791d4286893d5ec
> Author: SeongJae Park <sjpark@amazon.de>
> Date:   Tue Aug 10 16:55:51 2021 +1000
> 
>     mm/idle_page_tracking: make PG_idle reusable
> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 504336de9a1e..d0b85dc12429 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -739,10 +739,18 @@ config DEFERRED_STRUCT_PAGE_INIT
>           lifetime of the system until these kthreads finish the
>           initialisation.
> 
> +config PAGE_IDLE_FLAG
> +       bool "Add PG_idle and PG_young flags"
> +       help
> +         This feature adds PG_idle and PG_young flags in 'struct page'.  PTE
> +         Accessed bit writers can set the state of the bit in the flags to let
> +         other PTE Accessed bit readers don't disturbed.
> 
> This needs to be converted to proper, or at least comprehensible, English....

Thank you for the comment.

How about below?

--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -743,9 +743,9 @@ config PAGE_IDLE_FLAG
        bool "Add PG_idle and PG_young flags"
        select PAGE_EXTENSION if !64BIT
        help
-         This feature adds PG_idle and PG_young flags in 'struct page'.  PTE
-         Accessed bit writers can set the state of the bit in the flags to let
-         other PTE Accessed bit readers don't disturbed.
+         This feature adds 'PG_idle' and 'PG_young' flags in 'struct page'.
+         PTE Accessed bit writers can save the state of the bit in the flags
+         to let other PTE Accessed bit readers don't get disturbed.


Thanks,
SeongJae Park


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re:
  2018-06-28  3:48 ` Andrew Morton
@ 2018-06-29 18:44   ` Andrey Ryabinin
  0 siblings, 0 replies; 43+ messages in thread
From: Andrey Ryabinin @ 2018-06-29 18:44 UTC (permalink / raw)
  To: Andrew Morton, icytxw
  Cc: bugzilla-daemon, linux-mm, Alexander Potapenko, Dmitry Vyukov



On 06/28/2018 06:48 AM, Andrew Morton wrote:

>> Hi,
>> This bug was found in Linux Kernel v4.18-rc2
>>
>> $ cat report0 
>> ================================================================================
>> UBSAN: Undefined behaviour in mm/fadvise.c:76:10
>> signed integer overflow:
>> 4 + 9223372036854775805 cannot be represented in type 'long long int'
>> CPU: 0 PID: 13477 Comm: syz-executor1 Not tainted 4.18.0-rc1 #2
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
>> rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
>> Call Trace:
>>  __dump_stack lib/dump_stack.c:77 [inline]
>>  dump_stack+0x122/0x1c8 lib/dump_stack.c:113
>>  ubsan_epilogue+0x12/0x86 lib/ubsan.c:159
>>  handle_overflow+0x1c2/0x21f lib/ubsan.c:190
>>  __ubsan_handle_add_overflow+0x2a/0x31 lib/ubsan.c:198
>>  ksys_fadvise64_64+0xbf0/0xd10 mm/fadvise.c:76
>>  __do_sys_fadvise64 mm/fadvise.c:198 [inline]
>>  __se_sys_fadvise64 mm/fadvise.c:196 [inline]
>>  __x64_sys_fadvise64+0xa9/0x120 mm/fadvise.c:196
>>  do_syscall_64+0xb8/0x3a0 arch/x86/entry/common.c:290
> 
> That overflow is deliberate:
> 
> 	endbyte = offset + len;
> 	if (!len || endbyte < len)
> 		endbyte = -1;
> 	else
> 		endbyte--;		/* inclusive */
> 
> Or is there a hole in this logic?
> 
> If not, I guess ee can do this another way to keep the checker happy.
 
It should be enough to make overflow unsigned. Unsigned overflow is defined by the C standard.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE:
@ 2007-04-03 18:41 Royal VIP Casino
  0 siblings, 0 replies; 43+ messages in thread
From: Royal VIP Casino @ 2007-04-03 18:41 UTC (permalink / raw)
  To: linux-mm


How about a hulking cock? Interested? Check Penis Enlarge Patch out.

http://www.zcukchud.com/

With Penis Enlarge Patch your dick will be growing<BR>so fast you wont be able to record the exact size.














------------------------
disrespect  for the girl. See how it looks in print -- I translate this froma conversation in one of the best of the German Sunday-school books:    Gretchen. Wilhelm, where is the turnip?     Wilhelm. She has gone to the kitchen.     Gretchen. Where is the accomplished and beautiful English maiden?
describe? And  observe  the strongest of  the several German equ                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re:
       [not found] <F265RQAOCop3wyv9kI3000143b1@hotmail.com>
@ 2001-10-08 11:48 ` Joseph A Knapka
  0 siblings, 0 replies; 43+ messages in thread
From: Joseph A Knapka @ 2001-10-08 11:48 UTC (permalink / raw)
  To: Gavin Dolling; +Cc: linux-mm

Hi Gavin,

[Forwarded to linux-mm, since those guys will be able to
 answer your questions much more completely. Maybe someone
 has already solved your problem.]

Gavin Dolling wrote:
> 
> Your VM page has helped me immensely. I'm after so advice though about the
> following. No problem if you are too busy, etc. your site has already helped
> me a great deal so just hit that delete key now ...
> 
> I have an embedded linux system running out of 8M of RAM. It has no backing
> store and uses a RAM disk as its FS. It boots from a flash chip - at boot
> time things are uncompressed into RAM. Running an MTD type system with a
> flash FS is not an option.
> 
> Memory is very tight and it is unfortunate that the binaries effectively
> appear twice in memory. They are in the RAM FS in full and also get paged
> into memory. There is a lot of paging going on which I believe is drowning
> the system.
> 
> We have no swap file (that would obviously be stupid) but a large number of
> buffers (i.e. a lot of dirty pages). The application is networking stuff so
> it is supposed to perform at line rate - the paging appears to be preventing
> this.
> 
> What I wish to do is to page the user space binaries into the page cache,
> mark them so they are never evicted. Delete them from the RAMFS and recover
> the memory. This should be the most optimum way of running the system - in
> terms of memory usage anyway.
> 
> I am planning to hack filemap.c. Going to use page_cache_read on each binary
> and then remove from RAM FS. If the page is not in use I will have to make
> sure that deleting the file does not result in the page being evicted.
> Basically some more hacking required. I am also concerned about the inode
> associated with the page, this is going to cause me pain I think?
> 
> I am going to try this on my PC first. Going to try and force 'cat' to be
> fully paged in and then rename it. I should still be able to use cat at the
> command line.

I don't think this will work as a test case. The address_space mappings
are based on inode identity, and since you won't actually have
a "cat" program on your filesystem, the inode won't be found, so
the kernel will not have a way of knowing that the cached pages
are the right ones. You'd have to leave at least enough of the
filesystem intact for the kernel to be able to map the program
name to the correct inode. You might solve this by pinning the
inode buffers in main memory before reclaiming the RAMFS pages,
but that's pure speculation on my part.

> So basically:
> 
> a) Is this feasible?

See below.

> b) When I delete the binary can I prevent it from being evicted from the
> page cache?
> (I note with interest that if I mv my /usr/bin/emacs whilst emacs is running
>       e.g.   $ emacs &; mv /usr/bin/emacs /usr/bin/emacs2
> it allows me to do it and what's more nothing bad happens. This tells me I
> do not understand enough of what is going on - I would have expected this to
> fail in some manner).

The disk inode for a moved or deleted file (and the file's disk
blocks) don't get freed until all references to the inode are
gone. If the kernel has the file open (eg due to mmap()),
the file can still be used for paging until it's unmapped
by all the processes that are using it. (This is another
reason your test case above might be misleading.)

> c) I must have to leave something in the RAMFS such that the instance of the
> binary still exists even if not its whole content.
> 
> d) Am I insane to try this? (Why would be more useful than just a yes ;-)  )

I don't know. This is a deeper hack than any I've contemplated.
However, I'm tempted to say that it would be easier to figure
out a way to directly add the RAMFS pages to the page cache,
and thus use a single page simultaneously as a cache page and
an FS page. I don't know how hard that's going to be, but I
think it might be easier than trying to yank the FS out from
under an in-use mapping.

Cheers,

-- Joe
# "You know how many remote castles there are along the
#  gorges? You can't MOVE for remote castles!" - Lu Tze re. Uberwald
# Linux MM docs:
http://home.earthlink.net/~jknapka/linux-mm/vmoutline.html
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2024-03-08  1:06 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-22 15:12 [PATCH 0/3] mm: use irq locking suffix instead local_irq_disable() Sebastian Andrzej Siewior
2018-06-22 15:12 ` [PATCH 1/3] mm: workingset: remove local_irq_disable() from count_shadow_nodes() Sebastian Andrzej Siewior
2018-06-24 19:51   ` Vladimir Davydov
2018-06-25 10:36   ` Kirill Tkhai
2018-06-22 15:12 ` [PATCH 2/3] mm: workingset: make shadow_lru_isolate() use locking suffix Sebastian Andrzej Siewior
2018-06-24 19:57   ` Vladimir Davydov
2018-06-26 21:25     ` Sebastian Andrzej Siewior
2018-06-27  8:50       ` Vladimir Davydov
2018-06-27  9:20         ` Sebastian Andrzej Siewior
2018-06-28  9:30           ` Vladimir Davydov
2018-07-02 22:38             ` Sebastian Andrzej Siewior
2018-06-22 15:12 ` [PATCH 3/3] mm: list_lru: Add lock_irq member to __list_lru_init() Sebastian Andrzej Siewior
2018-06-24 20:09   ` Vladimir Davydov
2018-07-03 14:52     ` Sebastian Andrzej Siewior
2018-07-03 14:52       ` [PATCH 1/4] mm/list_lru: use list_lru_walk_one() in list_lru_walk_node() Sebastian Andrzej Siewior
2018-07-03 14:52       ` [PATCH 2/4] mm/list_lru: Move locking from __list_lru_walk_one() to its caller Sebastian Andrzej Siewior
2018-07-03 14:52       ` [PATCH 3/4] mm/list_lru: Pass struct list_lru_node as an argument __list_lru_walk_one() Sebastian Andrzej Siewior
2018-07-03 14:52       ` [PATCH 4/4] mm/list_lru: Introduce list_lru_shrink_walk_irq() Sebastian Andrzej Siewior
2018-07-03 21:14       ` Andrew Morton
2018-07-03 21:44         ` Re: Sebastian Andrzej Siewior
2018-07-04 14:44           ` Re: Vladimir Davydov
2018-06-22 21:39 ` [PATCH 0/3] mm: use irq locking suffix instead local_irq_disable() Andrew Morton
2018-06-24 20:10   ` Vladimir Davydov
  -- strict thread matches above, loose matches on Subject: below --
2024-02-27 17:42 [PATCH v3 00/18] Rearrange batched folio freeing Matthew Wilcox (Oracle)
2024-02-27 17:42 ` [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed Matthew Wilcox (Oracle)
2024-03-06 13:42   ` Ryan Roberts
2024-03-06 16:09     ` Matthew Wilcox
2024-03-06 16:19       ` Ryan Roberts
2024-03-06 17:41         ` Ryan Roberts
2024-03-06 18:41           ` Zi Yan
2024-03-06 19:55             ` Matthew Wilcox
2024-03-06 21:55               ` Matthew Wilcox
2024-03-07  8:56                 ` Ryan Roberts
2024-03-07 13:50                   ` Yin, Fengwei
2024-03-07 14:05                     ` Re: Matthew Wilcox
2024-03-07 15:24                       ` Re: Ryan Roberts
2024-03-07 16:24                         ` Re: Ryan Roberts
2024-03-07 23:02                           ` Re: Matthew Wilcox
2024-03-08  1:06                       ` Re: Yin, Fengwei
2024-01-18 22:19 [RFC] [PATCH 0/3] xfs: use large folios for buffers Dave Chinner
2024-01-22 10:13 ` Andi Kleen
2024-01-22 11:53   ` Dave Chinner
2023-05-11 12:58 Ryan Roberts
2023-05-11 13:13 ` Ryan Roberts
2022-08-26 22:03 Zach O'Keefe
2022-08-31 21:47 ` Yang Shi
2022-09-01  0:24   ` Re: Zach O'Keefe
     [not found] <20220421164138.1250943-1-yury.norov@gmail.com>
2022-04-21 23:04 ` Re: John Hubbard
2022-04-21 23:09   ` Re: John Hubbard
2022-04-21 23:17   ` Re: Yury Norov
2022-04-21 23:21     ` Re: John Hubbard
2021-08-12  9:21 Valdis Klētnieks
2021-08-12  9:42 ` SeongJae Park
2021-08-12 20:19   ` Re: Andrew Morton
2021-08-13  8:14     ` Re: SeongJae Park
     [not found] <bug-200209-27@https.bugzilla.kernel.org/>
2018-06-28  3:48 ` Andrew Morton
2018-06-29 18:44   ` Andrey Ryabinin
2007-04-03 18:41 Royal VIP Casino
     [not found] <F265RQAOCop3wyv9kI3000143b1@hotmail.com>
2001-10-08 11:48 ` Joseph A Knapka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).