linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] mm/zsmalloc: simplify synchronization between zs_page_migrate() and free_zspage()
@ 2024-02-27  3:02 Chengming Zhou
  2024-02-27  3:02 ` [PATCH 1/2] mm/zsmalloc: don't hold locks of all pages when free_zspage() Chengming Zhou
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Chengming Zhou @ 2024-02-27  3:02 UTC (permalink / raw)
  To: yosryahmed, Sergey Senozhatsky, hannes, nphamcs, Andrew Morton,
	Minchan Kim
  Cc: linux-kernel, linux-mm, Chengming Zhou

Hello,

free_zspage() has to hold locks of all pages, since zs_page_migrate()
path rely on this page lock to protect the race between zs_free() and
it, so it can safely get zspage from page->private.

But this way is not good and simple enough:

1. Since zs_free() couldn't be sleepable, it can only trylock pages,
   or has to kick_deferred_free() to defer that to a work.

2. Even in the worker context, async_free_zspage() can't simply
   lock all pages in lock_zspage(), it's still trylock because of
   the race between zs_free() and zs_page_migrate(). Please see
   the commit 2505a981114d ("zsmalloc: fix races between asynchronous
   zspage free and page migration") for details.

Actually, all free_zspage() needs is to get zspage from page safely,
we can use RCU to achieve it easily. Then free_zspage() don't need to
hold locks of all pages, so don't need the deferred free mechanism
at all. This patchset implements it and remove all of deferred free
related code.

Thanks for review and comments!

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
Chengming Zhou (2):
      mm/zsmalloc: don't hold locks of all pages when free_zspage()
      mm/zsmalloc: remove the deferred free mechanism

 mm/zsmalloc.c | 206 ++++++++++++++++------------------------------------------
 1 file changed, 56 insertions(+), 150 deletions(-)
---
base-commit: ccbd06e764bac9bbf6b4e91c700fe6dd28f08fb3
change-id: 20240226-zsmalloc-zspage-rcu-b2c12f054fb4

Best regards,
-- 
Chengming Zhou <zhouchengming@bytedance.com>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/2] mm/zsmalloc: don't hold locks of all pages when free_zspage()
  2024-02-27  3:02 [PATCH 0/2] mm/zsmalloc: simplify synchronization between zs_page_migrate() and free_zspage() Chengming Zhou
@ 2024-02-27  3:02 ` Chengming Zhou
  2024-02-28  4:33   ` Sergey Senozhatsky
                     ` (2 more replies)
  2024-02-27  3:02 ` [PATCH 2/2] mm/zsmalloc: remove the deferred free mechanism Chengming Zhou
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 15+ messages in thread
From: Chengming Zhou @ 2024-02-27  3:02 UTC (permalink / raw)
  To: yosryahmed, Sergey Senozhatsky, hannes, nphamcs, Andrew Morton,
	Minchan Kim
  Cc: linux-kernel, linux-mm, Chengming Zhou

free_zspage() has to hold locks of all pages, since zs_page_migrate()
path rely on this page lock to protect the race between zs_free() and
it, so it can safely get zspage from page->private.

But this way is not good and simple enough:

1. Since zs_free() couldn't be sleepable, it can only trylock pages,
   or has to kick_deferred_free() to defer that to a work.

2. Even in the worker context, async_free_zspage() can't simply
   lock all pages in lock_zspage(), it's still trylock because of
   the race between zs_free() and zs_page_migrate(). Please see
   the commit 2505a981114d ("zsmalloc: fix races between asynchronous
   zspage free and page migration") for details.

Actually, all free_zspage() needs is to get zspage from page safely,
we can use RCU to achieve it easily. Then free_zspage() don't need to
hold locks of all pages, so don't need the deferred free mechanism
at all.

The updated zs_page_migrate() now has two more cases to consider:

1. get_zspage_lockless() return NULL: it means free_zspage() has used
   reset_page() on this page and its reference of page.

2. get_zspage_lockless() return zspage but it's not on pool list:
   it means zspage has been removed from list and in process of free.

I'm not sure what value should be returned in these cases? -EINVAL or
-EAGAIN or other value? If the migration caller can find that page
has no extra referenced and can just free it, I think we should return
-EAGAIN to let the migration caller retry this page later to free it.
Now I choose to use -EINVAL to skip migration of this page, it seems
not a big deal to fail migration of some pages?

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 mm/zsmalloc.c | 97 ++++++++++++++++++++++++++++++++++-------------------------
 1 file changed, 56 insertions(+), 41 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 63ec385cd670..b153f2e5fc0f 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -253,6 +253,7 @@ struct zspage {
 	struct list_head list; /* fullness list */
 	struct zs_pool *pool;
 	rwlock_t lock;
+	struct rcu_head rcu_head;
 };
 
 struct mapping_area {
@@ -310,6 +311,8 @@ static int create_cache(struct zs_pool *pool)
 static void destroy_cache(struct zs_pool *pool)
 {
 	kmem_cache_destroy(pool->handle_cachep);
+	/* Synchronize RCU to free zspage. */
+	synchronize_rcu();
 	kmem_cache_destroy(pool->zspage_cachep);
 }
 
@@ -335,6 +338,14 @@ static void cache_free_zspage(struct zs_pool *pool, struct zspage *zspage)
 	kmem_cache_free(pool->zspage_cachep, zspage);
 }
 
+static void rcu_free_zspage(struct rcu_head *h)
+{
+	struct zspage *zspage = container_of(h, struct zspage, rcu_head);
+	struct zs_pool *pool = zspage->pool;
+
+	kmem_cache_free(pool->zspage_cachep, zspage);
+}
+
 /* pool->lock(which owns the handle) synchronizes races */
 static void record_obj(unsigned long handle, unsigned long obj)
 {
@@ -710,14 +721,31 @@ static int fix_fullness_group(struct size_class *class, struct zspage *zspage)
 	return newfg;
 }
 
+static void set_zspage(struct page *page, struct zspage *zspage)
+{
+	struct zspage __rcu **private = (struct zspage __rcu **)&page->private;
+
+	rcu_assign_pointer(*private, zspage);
+}
+
 static struct zspage *get_zspage(struct page *page)
 {
-	struct zspage *zspage = (struct zspage *)page_private(page);
+	struct zspage __rcu **private = (struct zspage __rcu **)&page->private;
+	struct zspage *zspage;
 
+	zspage = rcu_dereference_protected(*private, true);
 	BUG_ON(zspage->magic != ZSPAGE_MAGIC);
 	return zspage;
 }
 
+/* Only used in zs_page_migrate() to get zspage locklessly. */
+static struct zspage *get_zspage_lockless(struct page *page)
+{
+	struct zspage __rcu **private = (struct zspage __rcu **)&page->private;
+
+	return rcu_dereference(*private);
+}
+
 static struct page *get_next_page(struct page *page)
 {
 	struct zspage *zspage = get_zspage(page);
@@ -793,32 +821,11 @@ static void reset_page(struct page *page)
 {
 	__ClearPageMovable(page);
 	ClearPagePrivate(page);
-	set_page_private(page, 0);
+	set_zspage(page, NULL);
 	page_mapcount_reset(page);
 	page->index = 0;
 }
 
-static int trylock_zspage(struct zspage *zspage)
-{
-	struct page *cursor, *fail;
-
-	for (cursor = get_first_page(zspage); cursor != NULL; cursor =
-					get_next_page(cursor)) {
-		if (!trylock_page(cursor)) {
-			fail = cursor;
-			goto unlock;
-		}
-	}
-
-	return 1;
-unlock:
-	for (cursor = get_first_page(zspage); cursor != fail; cursor =
-					get_next_page(cursor))
-		unlock_page(cursor);
-
-	return 0;
-}
-
 static void __free_zspage(struct zs_pool *pool, struct size_class *class,
 				struct zspage *zspage)
 {
@@ -834,13 +841,12 @@ static void __free_zspage(struct zs_pool *pool, struct size_class *class,
 		VM_BUG_ON_PAGE(!PageLocked(page), page);
 		next = get_next_page(page);
 		reset_page(page);
-		unlock_page(page);
 		dec_zone_page_state(page, NR_ZSPAGES);
 		put_page(page);
 		page = next;
 	} while (page != NULL);
 
-	cache_free_zspage(pool, zspage);
+	call_rcu(&zspage->rcu_head, rcu_free_zspage);
 
 	class_stat_dec(class, ZS_OBJS_ALLOCATED, class->objs_per_zspage);
 	atomic_long_sub(class->pages_per_zspage, &pool->pages_allocated);
@@ -852,16 +858,6 @@ static void free_zspage(struct zs_pool *pool, struct size_class *class,
 	VM_BUG_ON(get_zspage_inuse(zspage));
 	VM_BUG_ON(list_empty(&zspage->list));
 
-	/*
-	 * Since zs_free couldn't be sleepable, this function cannot call
-	 * lock_page. The page locks trylock_zspage got will be released
-	 * by __free_zspage.
-	 */
-	if (!trylock_zspage(zspage)) {
-		kick_deferred_free(pool);
-		return;
-	}
-
 	remove_zspage(class, zspage);
 	__free_zspage(pool, class, zspage);
 }
@@ -929,7 +925,7 @@ static void create_page_chain(struct size_class *class, struct zspage *zspage,
 	 */
 	for (i = 0; i < nr_pages; i++) {
 		page = pages[i];
-		set_page_private(page, (unsigned long)zspage);
+		set_zspage(page, zspage);
 		page->index = 0;
 		if (i == 0) {
 			zspage->first_page = page;
@@ -978,10 +974,11 @@ static struct zspage *alloc_zspage(struct zs_pool *pool,
 		pages[i] = page;
 	}
 
-	create_page_chain(class, zspage, pages);
 	init_zspage(class, zspage);
 	zspage->pool = pool;
 	zspage->class = class->index;
+	/* RCU set_zspage() after zspage initialized. */
+	create_page_chain(class, zspage, pages);
 
 	return zspage;
 }
@@ -1765,17 +1762,35 @@ static int zs_page_migrate(struct page *newpage, struct page *page,
 
 	VM_BUG_ON_PAGE(!PageIsolated(page), page);
 
-	/* The page is locked, so this pointer must remain valid */
-	zspage = get_zspage(page);
-	pool = zspage->pool;
+	rcu_read_lock();
+	zspage = get_zspage_lockless(page);
+	if (!zspage) {
+		rcu_read_unlock();
+		return -EINVAL;
+	}
 
 	/*
 	 * The pool's lock protects the race between zpage migration
-	 * and zs_free.
+	 * and zs_free. We check if the zspage is still in pool with
+	 * pool->lock protection. If the zspage isn't in pool anymore,
+	 * it should be freed by RCU soon.
 	 */
+	pool = zspage->pool;
 	spin_lock(&pool->lock);
 	class = zspage_class(pool, zspage);
 
+	if (list_empty(&zspage->list)) {
+		spin_unlock(&pool->lock);
+		rcu_read_unlock();
+		return -EINVAL;
+	}
+
+	/*
+	 * Now the zspage is still on pool, and we held pool->lock,
+	 * it can't be freed in the meantime.
+	 */
+	rcu_read_unlock();
+
 	/* the migrate_write_lock protects zpage access via zs_map_object */
 	migrate_write_lock(zspage);
 

-- 
b4 0.10.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/2] mm/zsmalloc: remove the deferred free mechanism
  2024-02-27  3:02 [PATCH 0/2] mm/zsmalloc: simplify synchronization between zs_page_migrate() and free_zspage() Chengming Zhou
  2024-02-27  3:02 ` [PATCH 1/2] mm/zsmalloc: don't hold locks of all pages when free_zspage() Chengming Zhou
@ 2024-02-27  3:02 ` Chengming Zhou
  2024-02-28  1:57 ` [PATCH 0/2] mm/zsmalloc: simplify synchronization between zs_page_migrate() and free_zspage() Sergey Senozhatsky
  2024-02-28  3:54 ` Sergey Senozhatsky
  3 siblings, 0 replies; 15+ messages in thread
From: Chengming Zhou @ 2024-02-27  3:02 UTC (permalink / raw)
  To: yosryahmed, Sergey Senozhatsky, hannes, nphamcs, Andrew Morton,
	Minchan Kim
  Cc: linux-kernel, linux-mm, Chengming Zhou

Since the only user of kick_deferred_free() has gone, remove all the
deferred mechanism related code.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 mm/zsmalloc.c | 109 ----------------------------------------------------------
 1 file changed, 109 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index b153f2e5fc0f..1a044690b389 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -232,9 +232,6 @@ struct zs_pool {
 
 #ifdef CONFIG_ZSMALLOC_STAT
 	struct dentry *stat_dentry;
-#endif
-#ifdef CONFIG_COMPACTION
-	struct work_struct free_work;
 #endif
 	spinlock_t lock;
 	atomic_t compaction_in_progress;
@@ -281,12 +278,8 @@ static void migrate_write_lock(struct zspage *zspage);
 static void migrate_write_unlock(struct zspage *zspage);
 
 #ifdef CONFIG_COMPACTION
-static void kick_deferred_free(struct zs_pool *pool);
-static void init_deferred_free(struct zs_pool *pool);
 static void SetZsPageMovable(struct zs_pool *pool, struct zspage *zspage);
 #else
-static void kick_deferred_free(struct zs_pool *pool) {}
-static void init_deferred_free(struct zs_pool *pool) {}
 static void SetZsPageMovable(struct zs_pool *pool, struct zspage *zspage) {}
 #endif
 
@@ -1632,50 +1625,6 @@ static int putback_zspage(struct size_class *class, struct zspage *zspage)
 	return fullness;
 }
 
-#ifdef CONFIG_COMPACTION
-/*
- * To prevent zspage destroy during migration, zspage freeing should
- * hold locks of all pages in the zspage.
- */
-static void lock_zspage(struct zspage *zspage)
-{
-	struct page *curr_page, *page;
-
-	/*
-	 * Pages we haven't locked yet can be migrated off the list while we're
-	 * trying to lock them, so we need to be careful and only attempt to
-	 * lock each page under migrate_read_lock(). Otherwise, the page we lock
-	 * may no longer belong to the zspage. This means that we may wait for
-	 * the wrong page to unlock, so we must take a reference to the page
-	 * prior to waiting for it to unlock outside migrate_read_lock().
-	 */
-	while (1) {
-		migrate_read_lock(zspage);
-		page = get_first_page(zspage);
-		if (trylock_page(page))
-			break;
-		get_page(page);
-		migrate_read_unlock(zspage);
-		wait_on_page_locked(page);
-		put_page(page);
-	}
-
-	curr_page = page;
-	while ((page = get_next_page(curr_page))) {
-		if (trylock_page(page)) {
-			curr_page = page;
-		} else {
-			get_page(page);
-			migrate_read_unlock(zspage);
-			wait_on_page_locked(page);
-			put_page(page);
-			migrate_read_lock(zspage);
-		}
-	}
-	migrate_read_unlock(zspage);
-}
-#endif /* CONFIG_COMPACTION */
-
 static void migrate_lock_init(struct zspage *zspage)
 {
 	rwlock_init(&zspage->lock);
@@ -1730,10 +1679,6 @@ static void replace_sub_page(struct size_class *class, struct zspage *zspage,
 
 static bool zs_page_isolate(struct page *page, isolate_mode_t mode)
 {
-	/*
-	 * Page is locked so zspage couldn't be destroyed. For detail, look at
-	 * lock_zspage in free_zspage.
-	 */
 	VM_BUG_ON_PAGE(PageIsolated(page), page);
 
 	return true;
@@ -1848,56 +1793,6 @@ static const struct movable_operations zsmalloc_mops = {
 	.putback_page = zs_page_putback,
 };
 
-/*
- * Caller should hold page_lock of all pages in the zspage
- * In here, we cannot use zspage meta data.
- */
-static void async_free_zspage(struct work_struct *work)
-{
-	int i;
-	struct size_class *class;
-	struct zspage *zspage, *tmp;
-	LIST_HEAD(free_pages);
-	struct zs_pool *pool = container_of(work, struct zs_pool,
-					free_work);
-
-	for (i = 0; i < ZS_SIZE_CLASSES; i++) {
-		class = pool->size_class[i];
-		if (class->index != i)
-			continue;
-
-		spin_lock(&pool->lock);
-		list_splice_init(&class->fullness_list[ZS_INUSE_RATIO_0],
-				 &free_pages);
-		spin_unlock(&pool->lock);
-	}
-
-	list_for_each_entry_safe(zspage, tmp, &free_pages, list) {
-		list_del(&zspage->list);
-		lock_zspage(zspage);
-
-		spin_lock(&pool->lock);
-		class = zspage_class(pool, zspage);
-		__free_zspage(pool, class, zspage);
-		spin_unlock(&pool->lock);
-	}
-};
-
-static void kick_deferred_free(struct zs_pool *pool)
-{
-	schedule_work(&pool->free_work);
-}
-
-static void zs_flush_migration(struct zs_pool *pool)
-{
-	flush_work(&pool->free_work);
-}
-
-static void init_deferred_free(struct zs_pool *pool)
-{
-	INIT_WORK(&pool->free_work, async_free_zspage);
-}
-
 static void SetZsPageMovable(struct zs_pool *pool, struct zspage *zspage)
 {
 	struct page *page = get_first_page(zspage);
@@ -1908,8 +1803,6 @@ static void SetZsPageMovable(struct zs_pool *pool, struct zspage *zspage)
 		unlock_page(page);
 	} while ((page = get_next_page(page)) != NULL);
 }
-#else
-static inline void zs_flush_migration(struct zs_pool *pool) { }
 #endif
 
 /*
@@ -2121,7 +2014,6 @@ struct zs_pool *zs_create_pool(const char *name)
 	if (!pool)
 		return NULL;
 
-	init_deferred_free(pool);
 	spin_lock_init(&pool->lock);
 	atomic_set(&pool->compaction_in_progress, 0);
 
@@ -2229,7 +2121,6 @@ void zs_destroy_pool(struct zs_pool *pool)
 	int i;
 
 	zs_unregister_shrinker(pool);
-	zs_flush_migration(pool);
 	zs_pool_stat_destroy(pool);
 
 	for (i = 0; i < ZS_SIZE_CLASSES; i++) {

-- 
b4 0.10.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/2] mm/zsmalloc: simplify synchronization between zs_page_migrate() and free_zspage()
  2024-02-27  3:02 [PATCH 0/2] mm/zsmalloc: simplify synchronization between zs_page_migrate() and free_zspage() Chengming Zhou
  2024-02-27  3:02 ` [PATCH 1/2] mm/zsmalloc: don't hold locks of all pages when free_zspage() Chengming Zhou
  2024-02-27  3:02 ` [PATCH 2/2] mm/zsmalloc: remove the deferred free mechanism Chengming Zhou
@ 2024-02-28  1:57 ` Sergey Senozhatsky
  2024-02-28  2:22   ` [External] " Chengming Zhou
  2024-02-28  3:54 ` Sergey Senozhatsky
  3 siblings, 1 reply; 15+ messages in thread
From: Sergey Senozhatsky @ 2024-02-28  1:57 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: yosryahmed, Sergey Senozhatsky, hannes, nphamcs, Andrew Morton,
	Minchan Kim, linux-kernel, linux-mm

On (24/02/27 03:02), Chengming Zhou wrote:
> Hello,
> 
> free_zspage() has to hold locks of all pages, since zs_page_migrate()
> path rely on this page lock to protect the race between zs_free() and
> it, so it can safely get zspage from page->private.
> 
> But this way is not good and simple enough:
> 
> 1. Since zs_free() couldn't be sleepable, it can only trylock pages,
>    or has to kick_deferred_free() to defer that to a work.
> 
> 2. Even in the worker context, async_free_zspage() can't simply
>    lock all pages in lock_zspage(), it's still trylock because of
>    the race between zs_free() and zs_page_migrate(). Please see
>    the commit 2505a981114d ("zsmalloc: fix races between asynchronous
>    zspage free and page migration") for details.
> 
> Actually, all free_zspage() needs is to get zspage from page safely,
> we can use RCU to achieve it easily. Then free_zspage() don't need to
> hold locks of all pages, so don't need the deferred free mechanism
> at all. This patchset implements it and remove all of deferred free
> related code.
> 
> Thanks for review and comments!
> 
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>

JFI, recovered from the SPAM folder
"The sender hasn't authenticated this message"


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [External] Re: [PATCH 0/2] mm/zsmalloc: simplify synchronization between zs_page_migrate() and free_zspage()
  2024-02-28  1:57 ` [PATCH 0/2] mm/zsmalloc: simplify synchronization between zs_page_migrate() and free_zspage() Sergey Senozhatsky
@ 2024-02-28  2:22   ` Chengming Zhou
  0 siblings, 0 replies; 15+ messages in thread
From: Chengming Zhou @ 2024-02-28  2:22 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: yosryahmed, hannes, nphamcs, Andrew Morton, Minchan Kim,
	linux-kernel, linux-mm

On 2024/2/28 09:57, Sergey Senozhatsky wrote:
> On (24/02/27 03:02), Chengming Zhou wrote:
>> Hello,
>>
>> free_zspage() has to hold locks of all pages, since zs_page_migrate()
>> path rely on this page lock to protect the race between zs_free() and
>> it, so it can safely get zspage from page->private.
>>
>> But this way is not good and simple enough:
>>
>> 1. Since zs_free() couldn't be sleepable, it can only trylock pages,
>>    or has to kick_deferred_free() to defer that to a work.
>>
>> 2. Even in the worker context, async_free_zspage() can't simply
>>    lock all pages in lock_zspage(), it's still trylock because of
>>    the race between zs_free() and zs_page_migrate(). Please see
>>    the commit 2505a981114d ("zsmalloc: fix races between asynchronous
>>    zspage free and page migration") for details.
>>
>> Actually, all free_zspage() needs is to get zspage from page safely,
>> we can use RCU to achieve it easily. Then free_zspage() don't need to
>> hold locks of all pages, so don't need the deferred free mechanism
>> at all. This patchset implements it and remove all of deferred free
>> related code.
>>
>> Thanks for review and comments!
>>
>> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> 
> JFI, recovered from the SPAM folder
> "The sender hasn't authenticated this message"

Sorry for this, I thought the problem was fixed after testing with my own
Gmail last time. But it turns out my corporation email still sometimes has
this problem.

I will always use linux.dev email in the future to avoid these problems.

Thanks for your time!


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/2] mm/zsmalloc: simplify synchronization between zs_page_migrate() and free_zspage()
  2024-02-27  3:02 [PATCH 0/2] mm/zsmalloc: simplify synchronization between zs_page_migrate() and free_zspage() Chengming Zhou
                   ` (2 preceding siblings ...)
  2024-02-28  1:57 ` [PATCH 0/2] mm/zsmalloc: simplify synchronization between zs_page_migrate() and free_zspage() Sergey Senozhatsky
@ 2024-02-28  3:54 ` Sergey Senozhatsky
  3 siblings, 0 replies; 15+ messages in thread
From: Sergey Senozhatsky @ 2024-02-28  3:54 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: yosryahmed, Sergey Senozhatsky, hannes, nphamcs, Andrew Morton,
	Minchan Kim, linux-kernel, linux-mm

On (24/02/27 03:02), Chengming Zhou wrote:
> free_zspage() has to hold locks of all pages, since zs_page_migrate()
> path rely on this page lock to protect the race between zs_free() and
> it, so it can safely get zspage from page->private.
> 
> But this way is not good and simple enough:
> 
> 1. Since zs_free() couldn't be sleepable, it can only trylock pages,
>    or has to kick_deferred_free() to defer that to a work.
> 
> 2. Even in the worker context, async_free_zspage() can't simply
>    lock all pages in lock_zspage(), it's still trylock because of
>    the race between zs_free() and zs_page_migrate(). Please see
>    the commit 2505a981114d ("zsmalloc: fix races between asynchronous
>    zspage free and page migration") for details.
> 
> Actually, all free_zspage() needs is to get zspage from page safely,
> we can use RCU to achieve it easily. Then free_zspage() don't need to
> hold locks of all pages, so don't need the deferred free mechanism
> at all. This patchset implements it and remove all of deferred free
> related code.
> 
> Thanks for review and comments!
> 
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> ---
> Chengming Zhou (2):
>       mm/zsmalloc: don't hold locks of all pages when free_zspage()

That seems to be crashing on me:

[   28.123867] ==================================================================
[   28.125303] BUG: KASAN: null-ptr-deref in obj_malloc+0xa9/0x1f0
[   28.126289] Read of size 8 at addr 0000000000000028 by task mkfs.ext2/432
[   28.127414] 
[   28.127684] CPU: 8 PID: 432 Comm: mkfs.ext2 Tainted: G                 N 6.8.0-rc5+ #309
[   28.129015] Call Trace:
[   28.129442]  <TASK>
[   28.129805]  dump_stack_lvl+0x6f/0xab
[   28.130437]  print_report+0xe0/0x5e0
[   28.131050]  ? _printk+0x59/0x7b
[   28.131602]  ? kasan_report+0x96/0x120
[   28.132233]  ? obj_malloc+0xa9/0x1f0
[   28.132837]  kasan_report+0xe7/0x120
[   28.133441]  ? obj_malloc+0xa9/0x1f0
[   28.134046]  obj_malloc+0xa9/0x1f0
[   28.134633]  zs_malloc+0x22c/0x3e0
[   28.135211]  zram_submit_bio+0x44e/0xee0
[   28.135871]  ? lock_release+0x50c/0x700
[   28.136520]  submit_bio_noacct_nocheck+0x22a/0x650
[   28.137327]  __block_write_full_folio+0x48b/0x710
[   28.138119]  ? __cfi_blkdev_get_block+0x10/0x10
[   28.138885]  ? __cfi_block_write_full_folio+0x10/0x10
[   28.139737]  write_cache_pages+0x83/0xf0
[   28.140397]  ? __cfi_blkdev_get_block+0x10/0x10
[   28.141152]  blkdev_writepages+0x46/0x80
[   28.141810]  do_writepages+0x1be/0x400
[   28.142443]  file_write_and_wait_range+0x104/0x170
[   28.143254]  blkdev_fsync+0x4a/0x70
[   28.143846]  __x64_sys_fsync+0xe9/0x120
[   28.144491]  do_syscall_64+0x8d/0x130
[   28.145106]  entry_SYSCALL_64_after_hwframe+0x46/0x4e


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] mm/zsmalloc: don't hold locks of all pages when free_zspage()
  2024-02-27  3:02 ` [PATCH 1/2] mm/zsmalloc: don't hold locks of all pages when free_zspage() Chengming Zhou
@ 2024-02-28  4:33   ` Sergey Senozhatsky
  2024-02-28  5:14     ` Chengming Zhou
  2024-02-28  6:01   ` Sergey Senozhatsky
  2024-02-28  6:14   ` Sergey Senozhatsky
  2 siblings, 1 reply; 15+ messages in thread
From: Sergey Senozhatsky @ 2024-02-28  4:33 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: yosryahmed, Sergey Senozhatsky, hannes, nphamcs, Andrew Morton,
	Minchan Kim, linux-kernel, linux-mm

On (24/02/27 03:02), Chengming Zhou wrote:
[..]
> @@ -978,10 +974,11 @@ static struct zspage *alloc_zspage(struct zs_pool *pool,
>  		pages[i] = page;
>  	}
>  
> -	create_page_chain(class, zspage, pages);
>  	init_zspage(class, zspage);
>  	zspage->pool = pool;
>  	zspage->class = class->index;
> +	/* RCU set_zspage() after zspage initialized. */
> +	create_page_chain(class, zspage, pages);

So this hasn't been tested, has it?

init_zspage() does not like to be invoked before create_page_chain(),
because we haven't setup required pointers yet.

So when init_zspage() calls get_first_page() it gets NULL zspage->first_page
which we then use in is_first_page(first_page)->PagePrivate(page). As far as
I can tell.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] mm/zsmalloc: don't hold locks of all pages when free_zspage()
  2024-02-28  4:33   ` Sergey Senozhatsky
@ 2024-02-28  5:14     ` Chengming Zhou
  2024-02-28  5:29       ` Sergey Senozhatsky
  0 siblings, 1 reply; 15+ messages in thread
From: Chengming Zhou @ 2024-02-28  5:14 UTC (permalink / raw)
  To: Sergey Senozhatsky, Chengming Zhou
  Cc: yosryahmed, hannes, nphamcs, Andrew Morton, Minchan Kim,
	linux-kernel, linux-mm

On 2024/2/28 12:33, Sergey Senozhatsky wrote:
> On (24/02/27 03:02), Chengming Zhou wrote:
> [..]
>> @@ -978,10 +974,11 @@ static struct zspage *alloc_zspage(struct zs_pool *pool,
>>  		pages[i] = page;
>>  	}
>>  
>> -	create_page_chain(class, zspage, pages);
>>  	init_zspage(class, zspage);
>>  	zspage->pool = pool;
>>  	zspage->class = class->index;
>> +	/* RCU set_zspage() after zspage initialized. */
>> +	create_page_chain(class, zspage, pages);
> 
> So this hasn't been tested, has it?
I have tested it in my test vm, but it hasn't KASAN enabled. I tested the
kernel build in tmpfs with zswap enabled using zsmalloc pool, not sure
why the kernel didn't crash then...

> 
> init_zspage() does not like to be invoked before create_page_chain(),
> because we haven't setup required pointers yet.

You're right, I can reproduce the problem with KASAN enabled this time,
create_page_chain() should be put before init_zspage(), which will iterate
over the pages to create free objects list.

> 
> So when init_zspage() calls get_first_page() it gets NULL zspage->first_page
> which we then use in is_first_page(first_page)->PagePrivate(page). As far as
> I can tell.

Thanks! I will fix it and test throughly before send an update.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] mm/zsmalloc: don't hold locks of all pages when free_zspage()
  2024-02-28  5:14     ` Chengming Zhou
@ 2024-02-28  5:29       ` Sergey Senozhatsky
  2024-02-28  5:42         ` Chengming Zhou
  0 siblings, 1 reply; 15+ messages in thread
From: Sergey Senozhatsky @ 2024-02-28  5:29 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Sergey Senozhatsky, Chengming Zhou, yosryahmed, hannes, nphamcs,
	Andrew Morton, Minchan Kim, linux-kernel, linux-mm

On (24/02/28 13:14), Chengming Zhou wrote:
> On 2024/2/28 12:33, Sergey Senozhatsky wrote:
> > On (24/02/27 03:02), Chengming Zhou wrote:
> > [..]
> >> @@ -978,10 +974,11 @@ static struct zspage *alloc_zspage(struct zs_pool *pool,
> >>  		pages[i] = page;
> >>  	}
> >>  
> >> -	create_page_chain(class, zspage, pages);
> >>  	init_zspage(class, zspage);
> >>  	zspage->pool = pool;
> >>  	zspage->class = class->index;
> >> +	/* RCU set_zspage() after zspage initialized. */
> >> +	create_page_chain(class, zspage, pages);
> > 
> > So this hasn't been tested, has it?
> I have tested it in my test vm, but it hasn't KASAN enabled. I tested the
> kernel build in tmpfs with zswap enabled using zsmalloc pool, not sure
> why the kernel didn't crash then...

I hit the problem on non-kasan-enabled kernel.  KASAN was enabled
later on.

[..]

> > So when init_zspage() calls get_first_page() it gets NULL zspage->first_page
> > which we then use in is_first_page(first_page)->PagePrivate(page). As far as
> > I can tell.
>
> Thanks! I will fix it and test throughly before send an update.

I'm curious if we want to add RCU to the picture, given that zsmalloc
is quite often run under memory pressure.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] mm/zsmalloc: don't hold locks of all pages when free_zspage()
  2024-02-28  5:29       ` Sergey Senozhatsky
@ 2024-02-28  5:42         ` Chengming Zhou
  2024-02-28  6:07           ` Sergey Senozhatsky
  0 siblings, 1 reply; 15+ messages in thread
From: Chengming Zhou @ 2024-02-28  5:42 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Chengming Zhou, yosryahmed, hannes, nphamcs, Andrew Morton,
	Minchan Kim, linux-kernel, linux-mm

On 2024/2/28 13:29, Sergey Senozhatsky wrote:
> On (24/02/28 13:14), Chengming Zhou wrote:
>> On 2024/2/28 12:33, Sergey Senozhatsky wrote:
>>> On (24/02/27 03:02), Chengming Zhou wrote:
>>> [..]
>>>> @@ -978,10 +974,11 @@ static struct zspage *alloc_zspage(struct zs_pool *pool,
>>>>  		pages[i] = page;
>>>>  	}
>>>>  
>>>> -	create_page_chain(class, zspage, pages);
>>>>  	init_zspage(class, zspage);
>>>>  	zspage->pool = pool;
>>>>  	zspage->class = class->index;
>>>> +	/* RCU set_zspage() after zspage initialized. */
>>>> +	create_page_chain(class, zspage, pages);
>>>
>>> So this hasn't been tested, has it?
>> I have tested it in my test vm, but it hasn't KASAN enabled. I tested the
>> kernel build in tmpfs with zswap enabled using zsmalloc pool, not sure
>> why the kernel didn't crash then...
> 
> I hit the problem on non-kasan-enabled kernel.  KASAN was enabled
> later on.

Ok, It should be a problem with my process, sorry.

> 
> [..]
> 
>>> So when init_zspage() calls get_first_page() it gets NULL zspage->first_page
>>> which we then use in is_first_page(first_page)->PagePrivate(page). As far as
>>> I can tell.
>>
>> Thanks! I will fix it and test throughly before send an update.
> 
> I'm curious if we want to add RCU to the picture, given that zsmalloc
> is quite often run under memory pressure.

Yes, it's a reasonable point. But given struct zspage size has only 56 bytes,
it maybe not a problem to delay its free to RCU?

Thanks.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] mm/zsmalloc: don't hold locks of all pages when free_zspage()
  2024-02-27  3:02 ` [PATCH 1/2] mm/zsmalloc: don't hold locks of all pages when free_zspage() Chengming Zhou
  2024-02-28  4:33   ` Sergey Senozhatsky
@ 2024-02-28  6:01   ` Sergey Senozhatsky
  2024-02-28  6:44     ` Chengming Zhou
  2024-02-28  6:14   ` Sergey Senozhatsky
  2 siblings, 1 reply; 15+ messages in thread
From: Sergey Senozhatsky @ 2024-02-28  6:01 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: yosryahmed, Sergey Senozhatsky, hannes, nphamcs, Andrew Morton,
	Minchan Kim, linux-kernel, linux-mm

On (24/02/27 03:02), Chengming Zhou wrote:
>  static void __free_zspage(struct zs_pool *pool, struct size_class *class,
>  				struct zspage *zspage)
>  {
> @@ -834,13 +841,12 @@ static void __free_zspage(struct zs_pool *pool, struct size_class *class,
>  		VM_BUG_ON_PAGE(!PageLocked(page), page);

Who owns page lock here if free_zspage() doesn't trylock_zspage() no longer?

>  		next = get_next_page(page);
>  		reset_page(page);
> -		unlock_page(page);
>  		dec_zone_page_state(page, NR_ZSPAGES);
>  		put_page(page);
>  		page = next;
>  	} while (page != NULL);
>  
> -	cache_free_zspage(pool, zspage);
> +	call_rcu(&zspage->rcu_head, rcu_free_zspage);
>  
>  	class_stat_dec(class, ZS_OBJS_ALLOCATED, class->objs_per_zspage);
>  	atomic_long_sub(class->pages_per_zspage, &pool->pages_allocated);
> @@ -852,16 +858,6 @@ static void free_zspage(struct zs_pool *pool, struct size_class *class,
>  	VM_BUG_ON(get_zspage_inuse(zspage));
>  	VM_BUG_ON(list_empty(&zspage->list));
>  
> -	/*
> -	 * Since zs_free couldn't be sleepable, this function cannot call
> -	 * lock_page. The page locks trylock_zspage got will be released
> -	 * by __free_zspage.
> -	 */
> -	if (!trylock_zspage(zspage)) {
> -		kick_deferred_free(pool);
> -		return;
> -	}
> -
>  	remove_zspage(class, zspage);
>  	__free_zspage(pool, class, zspage);
>  }


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] mm/zsmalloc: don't hold locks of all pages when free_zspage()
  2024-02-28  5:42         ` Chengming Zhou
@ 2024-02-28  6:07           ` Sergey Senozhatsky
  0 siblings, 0 replies; 15+ messages in thread
From: Sergey Senozhatsky @ 2024-02-28  6:07 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Sergey Senozhatsky, Chengming Zhou, yosryahmed, hannes, nphamcs,
	Andrew Morton, Minchan Kim, linux-kernel, linux-mm

On (24/02/28 13:42), Chengming Zhou wrote:
> > I'm curious if we want to add RCU to the picture, given that zsmalloc
> > is quite often run under memory pressure.
> 
> Yes, it's a reasonable point. But given struct zspage size has only 56 bytes,
> it maybe not a problem to delay its free to RCU?

Hmm, yeah, probably.  I think it'll make sense to wait for more
"go for it" from Cc-ed folks before we land this series.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] mm/zsmalloc: don't hold locks of all pages when free_zspage()
  2024-02-27  3:02 ` [PATCH 1/2] mm/zsmalloc: don't hold locks of all pages when free_zspage() Chengming Zhou
  2024-02-28  4:33   ` Sergey Senozhatsky
  2024-02-28  6:01   ` Sergey Senozhatsky
@ 2024-02-28  6:14   ` Sergey Senozhatsky
  2024-02-28  6:49     ` Chengming Zhou
  2 siblings, 1 reply; 15+ messages in thread
From: Sergey Senozhatsky @ 2024-02-28  6:14 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: yosryahmed, Sergey Senozhatsky, hannes, nphamcs, Andrew Morton,
	Minchan Kim, linux-kernel, linux-mm

On (24/02/27 03:02), Chengming Zhou wrote:
> @@ -834,13 +841,12 @@ static void __free_zspage(struct zs_pool *pool, struct size_class *class,
>  		VM_BUG_ON_PAGE(!PageLocked(page), page);
>  		next = get_next_page(page);
>  		reset_page(page);

reset_page()->__ClearPageMovable()->PageMovable() expects page to be
locked.

> -		unlock_page(page);
>  		dec_zone_page_state(page, NR_ZSPAGES);
>  		put_page(page);
>  		page = next;
>  	} while (page != NULL);


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] mm/zsmalloc: don't hold locks of all pages when free_zspage()
  2024-02-28  6:01   ` Sergey Senozhatsky
@ 2024-02-28  6:44     ` Chengming Zhou
  0 siblings, 0 replies; 15+ messages in thread
From: Chengming Zhou @ 2024-02-28  6:44 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: yosryahmed, hannes, nphamcs, Andrew Morton, Minchan Kim,
	linux-kernel, linux-mm

On 2024/2/28 14:01, Sergey Senozhatsky wrote:
> On (24/02/27 03:02), Chengming Zhou wrote:
>>  static void __free_zspage(struct zs_pool *pool, struct size_class *class,
>>  				struct zspage *zspage)
>>  {
>> @@ -834,13 +841,12 @@ static void __free_zspage(struct zs_pool *pool, struct size_class *class,
>>  		VM_BUG_ON_PAGE(!PageLocked(page), page);
> 
> Who owns page lock here if free_zspage() doesn't trylock_zspage() no longer?

Right, it should be removed.

> 
>>  		next = get_next_page(page);
>>  		reset_page(page);
>> -		unlock_page(page);
>>  		dec_zone_page_state(page, NR_ZSPAGES);
>>  		put_page(page);
>>  		page = next;
>>  	} while (page != NULL);
>>  
>> -	cache_free_zspage(pool, zspage);
>> +	call_rcu(&zspage->rcu_head, rcu_free_zspage);
>>  
>>  	class_stat_dec(class, ZS_OBJS_ALLOCATED, class->objs_per_zspage);
>>  	atomic_long_sub(class->pages_per_zspage, &pool->pages_allocated);
>> @@ -852,16 +858,6 @@ static void free_zspage(struct zs_pool *pool, struct size_class *class,
>>  	VM_BUG_ON(get_zspage_inuse(zspage));
>>  	VM_BUG_ON(list_empty(&zspage->list));
>>  
>> -	/*
>> -	 * Since zs_free couldn't be sleepable, this function cannot call
>> -	 * lock_page. The page locks trylock_zspage got will be released
>> -	 * by __free_zspage.
>> -	 */
>> -	if (!trylock_zspage(zspage)) {
>> -		kick_deferred_free(pool);
>> -		return;
>> -	}
>> -
>>  	remove_zspage(class, zspage);
>>  	__free_zspage(pool, class, zspage);
>>  }


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] mm/zsmalloc: don't hold locks of all pages when free_zspage()
  2024-02-28  6:14   ` Sergey Senozhatsky
@ 2024-02-28  6:49     ` Chengming Zhou
  0 siblings, 0 replies; 15+ messages in thread
From: Chengming Zhou @ 2024-02-28  6:49 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: yosryahmed, hannes, nphamcs, Andrew Morton, Minchan Kim,
	linux-kernel, linux-mm

On 2024/2/28 14:14, Sergey Senozhatsky wrote:
> On (24/02/27 03:02), Chengming Zhou wrote:
>> @@ -834,13 +841,12 @@ static void __free_zspage(struct zs_pool *pool, struct size_class *class,
>>  		VM_BUG_ON_PAGE(!PageLocked(page), page);
>>  		next = get_next_page(page);
>>  		reset_page(page);
> 
> reset_page()->__ClearPageMovable()->PageMovable() expects page to be
> locked.

This seems to make the patch doesn't work anymore... will think about it.

Thanks!

> 
>> -		unlock_page(page);
>>  		dec_zone_page_state(page, NR_ZSPAGES);
>>  		put_page(page);
>>  		page = next;
>>  	} while (page != NULL);


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2024-02-28  6:49 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-27  3:02 [PATCH 0/2] mm/zsmalloc: simplify synchronization between zs_page_migrate() and free_zspage() Chengming Zhou
2024-02-27  3:02 ` [PATCH 1/2] mm/zsmalloc: don't hold locks of all pages when free_zspage() Chengming Zhou
2024-02-28  4:33   ` Sergey Senozhatsky
2024-02-28  5:14     ` Chengming Zhou
2024-02-28  5:29       ` Sergey Senozhatsky
2024-02-28  5:42         ` Chengming Zhou
2024-02-28  6:07           ` Sergey Senozhatsky
2024-02-28  6:01   ` Sergey Senozhatsky
2024-02-28  6:44     ` Chengming Zhou
2024-02-28  6:14   ` Sergey Senozhatsky
2024-02-28  6:49     ` Chengming Zhou
2024-02-27  3:02 ` [PATCH 2/2] mm/zsmalloc: remove the deferred free mechanism Chengming Zhou
2024-02-28  1:57 ` [PATCH 0/2] mm/zsmalloc: simplify synchronization between zs_page_migrate() and free_zspage() Sergey Senozhatsky
2024-02-28  2:22   ` [External] " Chengming Zhou
2024-02-28  3:54 ` Sergey Senozhatsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).