linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm: zswap: fix data loss on SWP_SYNCHRONOUS_IO devices
@ 2024-03-24 21:04 Johannes Weiner
  2024-03-24 21:22 ` Yosry Ahmed
                   ` (4 more replies)
  0 siblings, 5 replies; 16+ messages in thread
From: Johannes Weiner @ 2024-03-24 21:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Zhongkun He, Chengming Zhou, Yosry Ahmed, Barry Song, Chris Li,
	Nhat Pham, linux-mm, linux-kernel

Zhongkun He reports data corruption when combining zswap with zram.

The issue is the exclusive loads we're doing in zswap. They assume
that all reads are going into the swapcache, which can assume
authoritative ownership of the data and so the zswap copy can go.

However, zram files are marked SWP_SYNCHRONOUS_IO, and faults will try
to bypass the swapcache. This results in an optimistic read of the
swap data into a page that will be dismissed if the fault fails due to
races. In this case, zswap mustn't drop its authoritative copy.

Link: https://lore.kernel.org/all/CACSyD1N+dUvsu8=zV9P691B9bVq33erwOXNTmEaUbi9DrDeJzw@mail.gmail.com/
Reported-by: Zhongkun He <hezhongkun.hzk@bytedance.com>
Fixes: b9c91c43412f ("mm: zswap: support exclusive loads")
Cc: stable@vger.kernel.org	[6.5+]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Zhongkun He <hezhongkun.hzk@bytedance.com>
---
 mm/zswap.c | 23 +++++++++++++++++++----
 1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 535c907345e0..41a1170f7cfe 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1622,6 +1622,7 @@ bool zswap_load(struct folio *folio)
 	swp_entry_t swp = folio->swap;
 	pgoff_t offset = swp_offset(swp);
 	struct page *page = &folio->page;
+	bool swapcache = folio_test_swapcache(folio);
 	struct zswap_tree *tree = swap_zswap_tree(swp);
 	struct zswap_entry *entry;
 	u8 *dst;
@@ -1634,7 +1635,20 @@ bool zswap_load(struct folio *folio)
 		spin_unlock(&tree->lock);
 		return false;
 	}
-	zswap_rb_erase(&tree->rbroot, entry);
+	/*
+	 * When reading into the swapcache, invalidate our entry. The
+	 * swapcache can be the authoritative owner of the page and
+	 * its mappings, and the pressure that results from having two
+	 * in-memory copies outweighs any benefits of caching the
+	 * compression work.
+	 *
+	 * (Most swapins go through the swapcache. The notable
+	 * exception is the singleton fault on SWP_SYNCHRONOUS_IO
+	 * files, which reads into a private page and may free it if
+	 * the fault fails. We remain the primary owner of the entry.)
+	 */
+	if (swapcache)
+		zswap_rb_erase(&tree->rbroot, entry);
 	spin_unlock(&tree->lock);
 
 	if (entry->length)
@@ -1649,9 +1663,10 @@ bool zswap_load(struct folio *folio)
 	if (entry->objcg)
 		count_objcg_event(entry->objcg, ZSWPIN);
 
-	zswap_entry_free(entry);
-
-	folio_mark_dirty(folio);
+	if (swapcache) {
+		zswap_entry_free(entry);
+		folio_mark_dirty(folio);
+	}
 
 	return true;
 }
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: zswap: fix data loss on SWP_SYNCHRONOUS_IO devices
  2024-03-24 21:04 [PATCH] mm: zswap: fix data loss on SWP_SYNCHRONOUS_IO devices Johannes Weiner
@ 2024-03-24 21:22 ` Yosry Ahmed
  2024-03-25  4:54   ` Barry Song
  2024-03-25 16:30   ` Johannes Weiner
  2024-03-25  0:01 ` Chengming Zhou
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 16+ messages in thread
From: Yosry Ahmed @ 2024-03-24 21:22 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Zhongkun He, Chengming Zhou, Barry Song, Chris Li,
	Nhat Pham, linux-mm, linux-kernel

On Sun, Mar 24, 2024 at 2:04 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> Zhongkun He reports data corruption when combining zswap with zram.
>
> The issue is the exclusive loads we're doing in zswap. They assume
> that all reads are going into the swapcache, which can assume
> authoritative ownership of the data and so the zswap copy can go.
>
> However, zram files are marked SWP_SYNCHRONOUS_IO, and faults will try
> to bypass the swapcache. This results in an optimistic read of the
> swap data into a page that will be dismissed if the fault fails due to
> races. In this case, zswap mustn't drop its authoritative copy.
>
> Link: https://lore.kernel.org/all/CACSyD1N+dUvsu8=zV9P691B9bVq33erwOXNTmEaUbi9DrDeJzw@mail.gmail.com/
> Reported-by: Zhongkun He <hezhongkun.hzk@bytedance.com>
> Fixes: b9c91c43412f ("mm: zswap: support exclusive loads")
> Cc: stable@vger.kernel.org      [6.5+]
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Tested-by: Zhongkun He <hezhongkun.hzk@bytedance.com>

Do we also want to mention somewhere (commit log or comment) that
keeping the entry in the tree is fine because we are still protected
from concurrent loads/invalidations/writeback by swapcache_prepare()
setting SWAP_HAS_CACHE or so?

Anyway, this LGTM.

Acked-by: Yosry Ahmed <yosryahmed@google.com>

> ---
>  mm/zswap.c | 23 +++++++++++++++++++----
>  1 file changed, 19 insertions(+), 4 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 535c907345e0..41a1170f7cfe 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1622,6 +1622,7 @@ bool zswap_load(struct folio *folio)
>         swp_entry_t swp = folio->swap;
>         pgoff_t offset = swp_offset(swp);
>         struct page *page = &folio->page;
> +       bool swapcache = folio_test_swapcache(folio);
>         struct zswap_tree *tree = swap_zswap_tree(swp);
>         struct zswap_entry *entry;
>         u8 *dst;
> @@ -1634,7 +1635,20 @@ bool zswap_load(struct folio *folio)
>                 spin_unlock(&tree->lock);
>                 return false;
>         }
> -       zswap_rb_erase(&tree->rbroot, entry);
> +       /*
> +        * When reading into the swapcache, invalidate our entry. The
> +        * swapcache can be the authoritative owner of the page and
> +        * its mappings, and the pressure that results from having two
> +        * in-memory copies outweighs any benefits of caching the
> +        * compression work.
> +        *
> +        * (Most swapins go through the swapcache. The notable
> +        * exception is the singleton fault on SWP_SYNCHRONOUS_IO
> +        * files, which reads into a private page and may free it if
> +        * the fault fails. We remain the primary owner of the entry.)
> +        */
> +       if (swapcache)
> +               zswap_rb_erase(&tree->rbroot, entry);
>         spin_unlock(&tree->lock);
>
>         if (entry->length)
> @@ -1649,9 +1663,10 @@ bool zswap_load(struct folio *folio)
>         if (entry->objcg)
>                 count_objcg_event(entry->objcg, ZSWPIN);
>
> -       zswap_entry_free(entry);
> -
> -       folio_mark_dirty(folio);
> +       if (swapcache) {
> +               zswap_entry_free(entry);
> +               folio_mark_dirty(folio);
> +       }
>
>         return true;
>  }
> --
> 2.44.0
>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: zswap: fix data loss on SWP_SYNCHRONOUS_IO devices
  2024-03-24 21:04 [PATCH] mm: zswap: fix data loss on SWP_SYNCHRONOUS_IO devices Johannes Weiner
  2024-03-24 21:22 ` Yosry Ahmed
@ 2024-03-25  0:01 ` Chengming Zhou
  2024-03-25  3:01 ` [External] " Zhongkun He
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 16+ messages in thread
From: Chengming Zhou @ 2024-03-25  0:01 UTC (permalink / raw)
  To: Johannes Weiner, Andrew Morton
  Cc: Zhongkun He, Chengming Zhou, Yosry Ahmed, Barry Song, Chris Li,
	Nhat Pham, linux-mm, linux-kernel

On 2024/3/25 05:04, Johannes Weiner wrote:
> Zhongkun He reports data corruption when combining zswap with zram.
> 
> The issue is the exclusive loads we're doing in zswap. They assume
> that all reads are going into the swapcache, which can assume
> authoritative ownership of the data and so the zswap copy can go.
> 
> However, zram files are marked SWP_SYNCHRONOUS_IO, and faults will try
> to bypass the swapcache. This results in an optimistic read of the
> swap data into a page that will be dismissed if the fault fails due to
> races. In this case, zswap mustn't drop its authoritative copy.
> 
> Link: https://lore.kernel.org/all/CACSyD1N+dUvsu8=zV9P691B9bVq33erwOXNTmEaUbi9DrDeJzw@mail.gmail.com/
> Reported-by: Zhongkun He <hezhongkun.hzk@bytedance.com>
> Fixes: b9c91c43412f ("mm: zswap: support exclusive loads")
> Cc: stable@vger.kernel.org	[6.5+]
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Tested-by: Zhongkun He <hezhongkun.hzk@bytedance.com>

Very nice solution!

Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>

Thanks.

> ---
>  mm/zswap.c | 23 +++++++++++++++++++----
>  1 file changed, 19 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 535c907345e0..41a1170f7cfe 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1622,6 +1622,7 @@ bool zswap_load(struct folio *folio)
>  	swp_entry_t swp = folio->swap;
>  	pgoff_t offset = swp_offset(swp);
>  	struct page *page = &folio->page;
> +	bool swapcache = folio_test_swapcache(folio);
>  	struct zswap_tree *tree = swap_zswap_tree(swp);
>  	struct zswap_entry *entry;
>  	u8 *dst;
> @@ -1634,7 +1635,20 @@ bool zswap_load(struct folio *folio)
>  		spin_unlock(&tree->lock);
>  		return false;
>  	}
> -	zswap_rb_erase(&tree->rbroot, entry);
> +	/*
> +	 * When reading into the swapcache, invalidate our entry. The
> +	 * swapcache can be the authoritative owner of the page and
> +	 * its mappings, and the pressure that results from having two
> +	 * in-memory copies outweighs any benefits of caching the
> +	 * compression work.
> +	 *
> +	 * (Most swapins go through the swapcache. The notable
> +	 * exception is the singleton fault on SWP_SYNCHRONOUS_IO
> +	 * files, which reads into a private page and may free it if
> +	 * the fault fails. We remain the primary owner of the entry.)
> +	 */
> +	if (swapcache)
> +		zswap_rb_erase(&tree->rbroot, entry);
>  	spin_unlock(&tree->lock);
>  
>  	if (entry->length)
> @@ -1649,9 +1663,10 @@ bool zswap_load(struct folio *folio)
>  	if (entry->objcg)
>  		count_objcg_event(entry->objcg, ZSWPIN);
>  
> -	zswap_entry_free(entry);
> -
> -	folio_mark_dirty(folio);
> +	if (swapcache) {
> +		zswap_entry_free(entry);
> +		folio_mark_dirty(folio);
> +	}
>  
>  	return true;
>  }



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [External] [PATCH] mm: zswap: fix data loss on SWP_SYNCHRONOUS_IO devices
  2024-03-24 21:04 [PATCH] mm: zswap: fix data loss on SWP_SYNCHRONOUS_IO devices Johannes Weiner
  2024-03-24 21:22 ` Yosry Ahmed
  2024-03-25  0:01 ` Chengming Zhou
@ 2024-03-25  3:01 ` Zhongkun He
  2024-03-25 17:09 ` Nhat Pham
  2024-03-25 21:27 ` Chris Li
  4 siblings, 0 replies; 16+ messages in thread
From: Zhongkun He @ 2024-03-25  3:01 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Chengming Zhou, Yosry Ahmed, Barry Song, Chris Li,
	Nhat Pham, linux-mm, linux-kernel

On Mon, Mar 25, 2024 at 5:05 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> Zhongkun He reports data corruption when combining zswap with zram.
>
> The issue is the exclusive loads we're doing in zswap. They assume
> that all reads are going into the swapcache, which can assume
> authoritative ownership of the data and so the zswap copy can go.
>
> However, zram files are marked SWP_SYNCHRONOUS_IO, and faults will try
> to bypass the swapcache. This results in an optimistic read of the
> swap data into a page that will be dismissed if the fault fails due to
> races. In this case, zswap mustn't drop its authoritative copy.
>
> Link: https://lore.kernel.org/all/CACSyD1N+dUvsu8=zV9P691B9bVq33erwOXNTmEaUbi9DrDeJzw@mail.gmail.com/
> Reported-by: Zhongkun He <hezhongkun.hzk@bytedance.com>
> Fixes: b9c91c43412f ("mm: zswap: support exclusive loads")
> Cc: stable@vger.kernel.org      [6.5+]
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Tested-by: Zhongkun He <hezhongkun.hzk@bytedance.com>
> ---
>  mm/zswap.c | 23 +++++++++++++++++++----
>  1 file changed, 19 insertions(+), 4 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 535c907345e0..41a1170f7cfe 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1622,6 +1622,7 @@ bool zswap_load(struct folio *folio)
>         swp_entry_t swp = folio->swap;
>         pgoff_t offset = swp_offset(swp);
>         struct page *page = &folio->page;
> +       bool swapcache = folio_test_swapcache(folio);
>         struct zswap_tree *tree = swap_zswap_tree(swp);
>         struct zswap_entry *entry;
>         u8 *dst;
> @@ -1634,7 +1635,20 @@ bool zswap_load(struct folio *folio)
>                 spin_unlock(&tree->lock);
>                 return false;
>         }
> -       zswap_rb_erase(&tree->rbroot, entry);
> +       /*
> +        * When reading into the swapcache, invalidate our entry. The
> +        * swapcache can be the authoritative owner of the page and
> +        * its mappings, and the pressure that results from having two
> +        * in-memory copies outweighs any benefits of caching the
> +        * compression work.
> +        *
> +        * (Most swapins go through the swapcache. The notable
> +        * exception is the singleton fault on SWP_SYNCHRONOUS_IO
> +        * files, which reads into a private page and may free it if
> +        * the fault fails. We remain the primary owner of the entry.)
> +        */
> +       if (swapcache)
> +               zswap_rb_erase(&tree->rbroot, entry);
>         spin_unlock(&tree->lock);
>
>         if (entry->length)
> @@ -1649,9 +1663,10 @@ bool zswap_load(struct folio *folio)
>         if (entry->objcg)
>                 count_objcg_event(entry->objcg, ZSWPIN);
>
> -       zswap_entry_free(entry);
> -
> -       folio_mark_dirty(folio);
> +       if (swapcache) {
> +               zswap_entry_free(entry);
> +               folio_mark_dirty(folio);
> +       }
>
>         return true;
>  }
> --
> 2.44.0
>

Good solution and makes great sense to me.

Thanks.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: zswap: fix data loss on SWP_SYNCHRONOUS_IO devices
  2024-03-24 21:22 ` Yosry Ahmed
@ 2024-03-25  4:54   ` Barry Song
  2024-03-25  7:06     ` Yosry Ahmed
  2024-03-25 16:30   ` Johannes Weiner
  1 sibling, 1 reply; 16+ messages in thread
From: Barry Song @ 2024-03-25  4:54 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Johannes Weiner, Andrew Morton, Zhongkun He, Chengming Zhou,
	Chris Li, Nhat Pham, linux-mm, linux-kernel

On Mon, Mar 25, 2024 at 10:23 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Sun, Mar 24, 2024 at 2:04 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > Zhongkun He reports data corruption when combining zswap with zram.
> >
> > The issue is the exclusive loads we're doing in zswap. They assume
> > that all reads are going into the swapcache, which can assume
> > authoritative ownership of the data and so the zswap copy can go.
> >
> > However, zram files are marked SWP_SYNCHRONOUS_IO, and faults will try
> > to bypass the swapcache. This results in an optimistic read of the
> > swap data into a page that will be dismissed if the fault fails due to
> > races. In this case, zswap mustn't drop its authoritative copy.
> >
> > Link: https://lore.kernel.org/all/CACSyD1N+dUvsu8=zV9P691B9bVq33erwOXNTmEaUbi9DrDeJzw@mail.gmail.com/
> > Reported-by: Zhongkun He <hezhongkun.hzk@bytedance.com>
> > Fixes: b9c91c43412f ("mm: zswap: support exclusive loads")
> > Cc: stable@vger.kernel.org      [6.5+]
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > Tested-by: Zhongkun He <hezhongkun.hzk@bytedance.com>

Acked-by: Barry Song <baohua@kernel.org>

>
> Do we also want to mention somewhere (commit log or comment) that
> keeping the entry in the tree is fine because we are still protected
> from concurrent loads/invalidations/writeback by swapcache_prepare()
> setting SWAP_HAS_CACHE or so?

It seems that Kairui's patch comprehensively addresses the issue at hand.
Johannes's solution, on the other hand, appears to align zswap behavior
more closely with that of a traditional swap device, only releasing an entry
when the corresponding swap slot is freed, particularly in the sync-io case.

Johannes' patch has inspired me to consider whether zRAM could achieve
a comparable outcome by immediately releasing objects in swap cache
scenarios.  When I have the opportunity, I plan to experiment with zRAM.

>
> Anyway, this LGTM.
>
> Acked-by: Yosry Ahmed <yosryahmed@google.com>
>
> > ---
> >  mm/zswap.c | 23 +++++++++++++++++++----
> >  1 file changed, 19 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 535c907345e0..41a1170f7cfe 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -1622,6 +1622,7 @@ bool zswap_load(struct folio *folio)
> >         swp_entry_t swp = folio->swap;
> >         pgoff_t offset = swp_offset(swp);
> >         struct page *page = &folio->page;
> > +       bool swapcache = folio_test_swapcache(folio);
> >         struct zswap_tree *tree = swap_zswap_tree(swp);
> >         struct zswap_entry *entry;
> >         u8 *dst;
> > @@ -1634,7 +1635,20 @@ bool zswap_load(struct folio *folio)
> >                 spin_unlock(&tree->lock);
> >                 return false;
> >         }
> > -       zswap_rb_erase(&tree->rbroot, entry);
> > +       /*
> > +        * When reading into the swapcache, invalidate our entry. The
> > +        * swapcache can be the authoritative owner of the page and
> > +        * its mappings, and the pressure that results from having two
> > +        * in-memory copies outweighs any benefits of caching the
> > +        * compression work.
> > +        *
> > +        * (Most swapins go through the swapcache. The notable
> > +        * exception is the singleton fault on SWP_SYNCHRONOUS_IO
> > +        * files, which reads into a private page and may free it if
> > +        * the fault fails. We remain the primary owner of the entry.)
> > +        */
> > +       if (swapcache)
> > +               zswap_rb_erase(&tree->rbroot, entry);
> >         spin_unlock(&tree->lock);
> >
> >         if (entry->length)
> > @@ -1649,9 +1663,10 @@ bool zswap_load(struct folio *folio)
> >         if (entry->objcg)
> >                 count_objcg_event(entry->objcg, ZSWPIN);
> >
> > -       zswap_entry_free(entry);
> > -
> > -       folio_mark_dirty(folio);
> > +       if (swapcache) {
> > +               zswap_entry_free(entry);
> > +               folio_mark_dirty(folio);
> > +       }
> >
> >         return true;
> >  }
> > --
> > 2.44.0

Thanks
Barry


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: zswap: fix data loss on SWP_SYNCHRONOUS_IO devices
  2024-03-25  4:54   ` Barry Song
@ 2024-03-25  7:06     ` Yosry Ahmed
  2024-03-25  7:33       ` Chengming Zhou
  0 siblings, 1 reply; 16+ messages in thread
From: Yosry Ahmed @ 2024-03-25  7:06 UTC (permalink / raw)
  To: Barry Song
  Cc: Johannes Weiner, Andrew Morton, Zhongkun He, Chengming Zhou,
	Chris Li, Nhat Pham, linux-mm, linux-kernel, Kairui Song

On Sun, Mar 24, 2024 at 9:54 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Mon, Mar 25, 2024 at 10:23 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Sun, Mar 24, 2024 at 2:04 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > >
> > > Zhongkun He reports data corruption when combining zswap with zram.
> > >
> > > The issue is the exclusive loads we're doing in zswap. They assume
> > > that all reads are going into the swapcache, which can assume
> > > authoritative ownership of the data and so the zswap copy can go.
> > >
> > > However, zram files are marked SWP_SYNCHRONOUS_IO, and faults will try
> > > to bypass the swapcache. This results in an optimistic read of the
> > > swap data into a page that will be dismissed if the fault fails due to
> > > races. In this case, zswap mustn't drop its authoritative copy.
> > >
> > > Link: https://lore.kernel.org/all/CACSyD1N+dUvsu8=zV9P691B9bVq33erwOXNTmEaUbi9DrDeJzw@mail.gmail.com/
> > > Reported-by: Zhongkun He <hezhongkun.hzk@bytedance.com>
> > > Fixes: b9c91c43412f ("mm: zswap: support exclusive loads")
> > > Cc: stable@vger.kernel.org      [6.5+]
> > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > > Tested-by: Zhongkun He <hezhongkun.hzk@bytedance.com>
>
> Acked-by: Barry Song <baohua@kernel.org>
>
> >
> > Do we also want to mention somewhere (commit log or comment) that
> > keeping the entry in the tree is fine because we are still protected
> > from concurrent loads/invalidations/writeback by swapcache_prepare()
> > setting SWAP_HAS_CACHE or so?
>
> It seems that Kairui's patch comprehensively addresses the issue at hand.
> Johannes's solution, on the other hand, appears to align zswap behavior
> more closely with that of a traditional swap device, only releasing an entry
> when the corresponding swap slot is freed, particularly in the sync-io case.

It actually worked out quite well that Kairui's fix landed shortly
before this bug was reported, as this fix wouldn't have been possible
without it as far as I can tell.

>
> Johannes' patch has inspired me to consider whether zRAM could achieve
> a comparable outcome by immediately releasing objects in swap cache
> scenarios.  When I have the opportunity, I plan to experiment with zRAM.

That would be interesting. I am curious if it would be as
straightforward in zram to just mark the folio as dirty in this case
like zswap does, given its implementation as a block device.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: zswap: fix data loss on SWP_SYNCHRONOUS_IO devices
  2024-03-25  7:06     ` Yosry Ahmed
@ 2024-03-25  7:33       ` Chengming Zhou
  2024-03-25  8:38         ` Yosry Ahmed
  0 siblings, 1 reply; 16+ messages in thread
From: Chengming Zhou @ 2024-03-25  7:33 UTC (permalink / raw)
  To: Yosry Ahmed, Barry Song
  Cc: Johannes Weiner, Andrew Morton, Zhongkun He, Chengming Zhou,
	Chris Li, Nhat Pham, linux-mm, linux-kernel, Kairui Song

On 2024/3/25 15:06, Yosry Ahmed wrote:
> On Sun, Mar 24, 2024 at 9:54 PM Barry Song <21cnbao@gmail.com> wrote:
>>
>> On Mon, Mar 25, 2024 at 10:23 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>>>
>>> On Sun, Mar 24, 2024 at 2:04 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>>>>
>>>> Zhongkun He reports data corruption when combining zswap with zram.
>>>>
>>>> The issue is the exclusive loads we're doing in zswap. They assume
>>>> that all reads are going into the swapcache, which can assume
>>>> authoritative ownership of the data and so the zswap copy can go.
>>>>
>>>> However, zram files are marked SWP_SYNCHRONOUS_IO, and faults will try
>>>> to bypass the swapcache. This results in an optimistic read of the
>>>> swap data into a page that will be dismissed if the fault fails due to
>>>> races. In this case, zswap mustn't drop its authoritative copy.
>>>>
>>>> Link: https://lore.kernel.org/all/CACSyD1N+dUvsu8=zV9P691B9bVq33erwOXNTmEaUbi9DrDeJzw@mail.gmail.com/
>>>> Reported-by: Zhongkun He <hezhongkun.hzk@bytedance.com>
>>>> Fixes: b9c91c43412f ("mm: zswap: support exclusive loads")
>>>> Cc: stable@vger.kernel.org      [6.5+]
>>>> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
>>>> Tested-by: Zhongkun He <hezhongkun.hzk@bytedance.com>
>>
>> Acked-by: Barry Song <baohua@kernel.org>
>>
>>>
>>> Do we also want to mention somewhere (commit log or comment) that
>>> keeping the entry in the tree is fine because we are still protected
>>> from concurrent loads/invalidations/writeback by swapcache_prepare()
>>> setting SWAP_HAS_CACHE or so?
>>
>> It seems that Kairui's patch comprehensively addresses the issue at hand.
>> Johannes's solution, on the other hand, appears to align zswap behavior
>> more closely with that of a traditional swap device, only releasing an entry
>> when the corresponding swap slot is freed, particularly in the sync-io case.
> 
> It actually worked out quite well that Kairui's fix landed shortly
> before this bug was reported, as this fix wouldn't have been possible
> without it as far as I can tell.
> 
>>
>> Johannes' patch has inspired me to consider whether zRAM could achieve
>> a comparable outcome by immediately releasing objects in swap cache
>> scenarios.  When I have the opportunity, I plan to experiment with zRAM.
> 
> That would be interesting. I am curious if it would be as
> straightforward in zram to just mark the folio as dirty in this case
> like zswap does, given its implementation as a block device.
> 

This makes me wonder who is responsible for marking folio dirty in this swapcache
bypass case? Should we call folio_mark_dirty() after the swap_read_folio()?



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: zswap: fix data loss on SWP_SYNCHRONOUS_IO devices
  2024-03-25  7:33       ` Chengming Zhou
@ 2024-03-25  8:38         ` Yosry Ahmed
  2024-03-25  9:22           ` Chengming Zhou
  0 siblings, 1 reply; 16+ messages in thread
From: Yosry Ahmed @ 2024-03-25  8:38 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Barry Song, Johannes Weiner, Andrew Morton, Zhongkun He,
	Chengming Zhou, Chris Li, Nhat Pham, linux-mm, linux-kernel,
	Kairui Song

On Mon, Mar 25, 2024 at 12:33 AM Chengming Zhou
<chengming.zhou@linux.dev> wrote:
>
> On 2024/3/25 15:06, Yosry Ahmed wrote:
> > On Sun, Mar 24, 2024 at 9:54 PM Barry Song <21cnbao@gmail.com> wrote:
> >>
> >> On Mon, Mar 25, 2024 at 10:23 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> >>>
> >>> On Sun, Mar 24, 2024 at 2:04 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >>>>
> >>>> Zhongkun He reports data corruption when combining zswap with zram.
> >>>>
> >>>> The issue is the exclusive loads we're doing in zswap. They assume
> >>>> that all reads are going into the swapcache, which can assume
> >>>> authoritative ownership of the data and so the zswap copy can go.
> >>>>
> >>>> However, zram files are marked SWP_SYNCHRONOUS_IO, and faults will try
> >>>> to bypass the swapcache. This results in an optimistic read of the
> >>>> swap data into a page that will be dismissed if the fault fails due to
> >>>> races. In this case, zswap mustn't drop its authoritative copy.
> >>>>
> >>>> Link: https://lore.kernel.org/all/CACSyD1N+dUvsu8=zV9P691B9bVq33erwOXNTmEaUbi9DrDeJzw@mail.gmail.com/
> >>>> Reported-by: Zhongkun He <hezhongkun.hzk@bytedance.com>
> >>>> Fixes: b9c91c43412f ("mm: zswap: support exclusive loads")
> >>>> Cc: stable@vger.kernel.org      [6.5+]
> >>>> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> >>>> Tested-by: Zhongkun He <hezhongkun.hzk@bytedance.com>
> >>
> >> Acked-by: Barry Song <baohua@kernel.org>
> >>
> >>>
> >>> Do we also want to mention somewhere (commit log or comment) that
> >>> keeping the entry in the tree is fine because we are still protected
> >>> from concurrent loads/invalidations/writeback by swapcache_prepare()
> >>> setting SWAP_HAS_CACHE or so?
> >>
> >> It seems that Kairui's patch comprehensively addresses the issue at hand.
> >> Johannes's solution, on the other hand, appears to align zswap behavior
> >> more closely with that of a traditional swap device, only releasing an entry
> >> when the corresponding swap slot is freed, particularly in the sync-io case.
> >
> > It actually worked out quite well that Kairui's fix landed shortly
> > before this bug was reported, as this fix wouldn't have been possible
> > without it as far as I can tell.
> >
> >>
> >> Johannes' patch has inspired me to consider whether zRAM could achieve
> >> a comparable outcome by immediately releasing objects in swap cache
> >> scenarios.  When I have the opportunity, I plan to experiment with zRAM.
> >
> > That would be interesting. I am curious if it would be as
> > straightforward in zram to just mark the folio as dirty in this case
> > like zswap does, given its implementation as a block device.
> >
>
> This makes me wonder who is responsible for marking folio dirty in this swapcache
> bypass case? Should we call folio_mark_dirty() after the swap_read_folio()?

In shrink_folio_list(), we try to add anonymous folios to the
swapcache if they are not there before checking if they are dirty.
add_to_swap() calls folio_mark_dirty(), so this should take care of
it. There is an interesting comment there though. It says that PTE
should be dirty, so unmapping the folio should have already marked it
as dirty by the time we are adding it to the swapcache, except for the
MADV_FREE case.

However, I think we actually unmap the folio after we add it to the
swapcache in shrink_folio_list(). Also, I don't immediately see why
the PTE would be dirty. In do_swap_page(), making the PTE dirty seems
to be conditional on the fault being a write fault, but I didn't look
thoroughly, maybe I missed it. It is also possible that the comment is
just outdated.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: zswap: fix data loss on SWP_SYNCHRONOUS_IO devices
  2024-03-25  8:38         ` Yosry Ahmed
@ 2024-03-25  9:22           ` Chengming Zhou
  2024-03-25  9:40             ` Yosry Ahmed
  0 siblings, 1 reply; 16+ messages in thread
From: Chengming Zhou @ 2024-03-25  9:22 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Barry Song, Johannes Weiner, Andrew Morton, Zhongkun He,
	Chengming Zhou, Chris Li, Nhat Pham, linux-mm, linux-kernel,
	Kairui Song

On 2024/3/25 16:38, Yosry Ahmed wrote:
> On Mon, Mar 25, 2024 at 12:33 AM Chengming Zhou
> <chengming.zhou@linux.dev> wrote:
>>
>> On 2024/3/25 15:06, Yosry Ahmed wrote:
>>> On Sun, Mar 24, 2024 at 9:54 PM Barry Song <21cnbao@gmail.com> wrote:
>>>>
>>>> On Mon, Mar 25, 2024 at 10:23 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>>>>>
>>>>> On Sun, Mar 24, 2024 at 2:04 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>>>>>>
>>>>>> Zhongkun He reports data corruption when combining zswap with zram.
>>>>>>
>>>>>> The issue is the exclusive loads we're doing in zswap. They assume
>>>>>> that all reads are going into the swapcache, which can assume
>>>>>> authoritative ownership of the data and so the zswap copy can go.
>>>>>>
>>>>>> However, zram files are marked SWP_SYNCHRONOUS_IO, and faults will try
>>>>>> to bypass the swapcache. This results in an optimistic read of the
>>>>>> swap data into a page that will be dismissed if the fault fails due to
>>>>>> races. In this case, zswap mustn't drop its authoritative copy.
>>>>>>
>>>>>> Link: https://lore.kernel.org/all/CACSyD1N+dUvsu8=zV9P691B9bVq33erwOXNTmEaUbi9DrDeJzw@mail.gmail.com/
>>>>>> Reported-by: Zhongkun He <hezhongkun.hzk@bytedance.com>
>>>>>> Fixes: b9c91c43412f ("mm: zswap: support exclusive loads")
>>>>>> Cc: stable@vger.kernel.org      [6.5+]
>>>>>> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
>>>>>> Tested-by: Zhongkun He <hezhongkun.hzk@bytedance.com>
>>>>
>>>> Acked-by: Barry Song <baohua@kernel.org>
>>>>
>>>>>
>>>>> Do we also want to mention somewhere (commit log or comment) that
>>>>> keeping the entry in the tree is fine because we are still protected
>>>>> from concurrent loads/invalidations/writeback by swapcache_prepare()
>>>>> setting SWAP_HAS_CACHE or so?
>>>>
>>>> It seems that Kairui's patch comprehensively addresses the issue at hand.
>>>> Johannes's solution, on the other hand, appears to align zswap behavior
>>>> more closely with that of a traditional swap device, only releasing an entry
>>>> when the corresponding swap slot is freed, particularly in the sync-io case.
>>>
>>> It actually worked out quite well that Kairui's fix landed shortly
>>> before this bug was reported, as this fix wouldn't have been possible
>>> without it as far as I can tell.
>>>
>>>>
>>>> Johannes' patch has inspired me to consider whether zRAM could achieve
>>>> a comparable outcome by immediately releasing objects in swap cache
>>>> scenarios.  When I have the opportunity, I plan to experiment with zRAM.
>>>
>>> That would be interesting. I am curious if it would be as
>>> straightforward in zram to just mark the folio as dirty in this case
>>> like zswap does, given its implementation as a block device.
>>>
>>
>> This makes me wonder who is responsible for marking folio dirty in this swapcache
>> bypass case? Should we call folio_mark_dirty() after the swap_read_folio()?
> 
> In shrink_folio_list(), we try to add anonymous folios to the
> swapcache if they are not there before checking if they are dirty.
> add_to_swap() calls folio_mark_dirty(), so this should take care of

Right, thanks for your clarification, so should be no problem here.
Although it was a fix just for MADV_FREE case.

> it. There is an interesting comment there though. It says that PTE
> should be dirty, so unmapping the folio should have already marked it
> as dirty by the time we are adding it to the swapcache, except for the
> MADV_FREE case.

It seems to say the folio will be dirtied when unmap later, supposing the
PTE is dirty.

> 
> However, I think we actually unmap the folio after we add it to the
> swapcache in shrink_folio_list(). Also, I don't immediately see why
> the PTE would be dirty. In do_swap_page(), making the PTE dirty seems

If all anon pages on LRU list are faulted by write, it should be true.
We could just use the zero page if faulted by read, right?

> to be conditional on the fault being a write fault, but I didn't look
> thoroughly, maybe I missed it. It is also possible that the comment is
> just outdated.

Yeah, dirty is only marked on write fault.

Thanks.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: zswap: fix data loss on SWP_SYNCHRONOUS_IO devices
  2024-03-25  9:22           ` Chengming Zhou
@ 2024-03-25  9:40             ` Yosry Ahmed
  2024-03-25  9:46               ` Chengming Zhou
  0 siblings, 1 reply; 16+ messages in thread
From: Yosry Ahmed @ 2024-03-25  9:40 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Barry Song, Johannes Weiner, Andrew Morton, Zhongkun He,
	Chengming Zhou, Chris Li, Nhat Pham, linux-mm, linux-kernel,
	Kairui Song

On Mon, Mar 25, 2024 at 2:22 AM Chengming Zhou <chengming.zhou@linux.dev> wrote:
>
> On 2024/3/25 16:38, Yosry Ahmed wrote:
> > On Mon, Mar 25, 2024 at 12:33 AM Chengming Zhou
> > <chengming.zhou@linux.dev> wrote:
> >>
> >> On 2024/3/25 15:06, Yosry Ahmed wrote:
> >>> On Sun, Mar 24, 2024 at 9:54 PM Barry Song <21cnbao@gmail.com> wrote:
> >>>>
> >>>> On Mon, Mar 25, 2024 at 10:23 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> >>>>>
> >>>>> On Sun, Mar 24, 2024 at 2:04 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >>>>>>
> >>>>>> Zhongkun He reports data corruption when combining zswap with zram.
> >>>>>>
> >>>>>> The issue is the exclusive loads we're doing in zswap. They assume
> >>>>>> that all reads are going into the swapcache, which can assume
> >>>>>> authoritative ownership of the data and so the zswap copy can go.
> >>>>>>
> >>>>>> However, zram files are marked SWP_SYNCHRONOUS_IO, and faults will try
> >>>>>> to bypass the swapcache. This results in an optimistic read of the
> >>>>>> swap data into a page that will be dismissed if the fault fails due to
> >>>>>> races. In this case, zswap mustn't drop its authoritative copy.
> >>>>>>
> >>>>>> Link: https://lore.kernel.org/all/CACSyD1N+dUvsu8=zV9P691B9bVq33erwOXNTmEaUbi9DrDeJzw@mail.gmail.com/
> >>>>>> Reported-by: Zhongkun He <hezhongkun.hzk@bytedance.com>
> >>>>>> Fixes: b9c91c43412f ("mm: zswap: support exclusive loads")
> >>>>>> Cc: stable@vger.kernel.org      [6.5+]
> >>>>>> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> >>>>>> Tested-by: Zhongkun He <hezhongkun.hzk@bytedance.com>
> >>>>
> >>>> Acked-by: Barry Song <baohua@kernel.org>
> >>>>
> >>>>>
> >>>>> Do we also want to mention somewhere (commit log or comment) that
> >>>>> keeping the entry in the tree is fine because we are still protected
> >>>>> from concurrent loads/invalidations/writeback by swapcache_prepare()
> >>>>> setting SWAP_HAS_CACHE or so?
> >>>>
> >>>> It seems that Kairui's patch comprehensively addresses the issue at hand.
> >>>> Johannes's solution, on the other hand, appears to align zswap behavior
> >>>> more closely with that of a traditional swap device, only releasing an entry
> >>>> when the corresponding swap slot is freed, particularly in the sync-io case.
> >>>
> >>> It actually worked out quite well that Kairui's fix landed shortly
> >>> before this bug was reported, as this fix wouldn't have been possible
> >>> without it as far as I can tell.
> >>>
> >>>>
> >>>> Johannes' patch has inspired me to consider whether zRAM could achieve
> >>>> a comparable outcome by immediately releasing objects in swap cache
> >>>> scenarios.  When I have the opportunity, I plan to experiment with zRAM.
> >>>
> >>> That would be interesting. I am curious if it would be as
> >>> straightforward in zram to just mark the folio as dirty in this case
> >>> like zswap does, given its implementation as a block device.
> >>>
> >>
> >> This makes me wonder who is responsible for marking folio dirty in this swapcache
> >> bypass case? Should we call folio_mark_dirty() after the swap_read_folio()?
> >
> > In shrink_folio_list(), we try to add anonymous folios to the
> > swapcache if they are not there before checking if they are dirty.
> > add_to_swap() calls folio_mark_dirty(), so this should take care of
>
> Right, thanks for your clarification, so should be no problem here.
> Although it was a fix just for MADV_FREE case.
>
> > it. There is an interesting comment there though. It says that PTE
> > should be dirty, so unmapping the folio should have already marked it
> > as dirty by the time we are adding it to the swapcache, except for the
> > MADV_FREE case.
>
> It seems to say the folio will be dirtied when unmap later, supposing the
> PTE is dirty.

Oh yeah it could mean that the folio will be dirted later.

>
> >
> > However, I think we actually unmap the folio after we add it to the
> > swapcache in shrink_folio_list(). Also, I don't immediately see why
> > the PTE would be dirty. In do_swap_page(), making the PTE dirty seems
>
> If all anon pages on LRU list are faulted by write, it should be true.
> We could just use the zero page if faulted by read, right?

This applies for the initial fault that creates the folio, but this is
a swap fault. It could be a read fault and in that case we still need
to make the folio dirty because it's not in the swapcache and we need
to write it out if it's reclaimed, right?

>
> > to be conditional on the fault being a write fault, but I didn't look
> > thoroughly, maybe I missed it. It is also possible that the comment is
> > just outdated.
>
> Yeah, dirty is only marked on write fault.
>
> Thanks.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: zswap: fix data loss on SWP_SYNCHRONOUS_IO devices
  2024-03-25  9:40             ` Yosry Ahmed
@ 2024-03-25  9:46               ` Chengming Zhou
  2024-03-25 18:35                 ` Yosry Ahmed
  0 siblings, 1 reply; 16+ messages in thread
From: Chengming Zhou @ 2024-03-25  9:46 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Barry Song, Johannes Weiner, Andrew Morton, Zhongkun He,
	Chengming Zhou, Chris Li, Nhat Pham, linux-mm, linux-kernel,
	Kairui Song

On 2024/3/25 17:40, Yosry Ahmed wrote:
> On Mon, Mar 25, 2024 at 2:22 AM Chengming Zhou <chengming.zhou@linux.dev> wrote:
>>
>> On 2024/3/25 16:38, Yosry Ahmed wrote:
>>> On Mon, Mar 25, 2024 at 12:33 AM Chengming Zhou
>>> <chengming.zhou@linux.dev> wrote:
>>>>
>>>> On 2024/3/25 15:06, Yosry Ahmed wrote:
>>>>> On Sun, Mar 24, 2024 at 9:54 PM Barry Song <21cnbao@gmail.com> wrote:
>>>>>>
>>>>>> On Mon, Mar 25, 2024 at 10:23 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>>>>>>>
>>>>>>> On Sun, Mar 24, 2024 at 2:04 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>>>>>>>>
>>>>>>>> Zhongkun He reports data corruption when combining zswap with zram.
>>>>>>>>
>>>>>>>> The issue is the exclusive loads we're doing in zswap. They assume
>>>>>>>> that all reads are going into the swapcache, which can assume
>>>>>>>> authoritative ownership of the data and so the zswap copy can go.
>>>>>>>>
>>>>>>>> However, zram files are marked SWP_SYNCHRONOUS_IO, and faults will try
>>>>>>>> to bypass the swapcache. This results in an optimistic read of the
>>>>>>>> swap data into a page that will be dismissed if the fault fails due to
>>>>>>>> races. In this case, zswap mustn't drop its authoritative copy.
>>>>>>>>
>>>>>>>> Link: https://lore.kernel.org/all/CACSyD1N+dUvsu8=zV9P691B9bVq33erwOXNTmEaUbi9DrDeJzw@mail.gmail.com/
>>>>>>>> Reported-by: Zhongkun He <hezhongkun.hzk@bytedance.com>
>>>>>>>> Fixes: b9c91c43412f ("mm: zswap: support exclusive loads")
>>>>>>>> Cc: stable@vger.kernel.org      [6.5+]
>>>>>>>> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
>>>>>>>> Tested-by: Zhongkun He <hezhongkun.hzk@bytedance.com>
>>>>>>
>>>>>> Acked-by: Barry Song <baohua@kernel.org>
>>>>>>
>>>>>>>
>>>>>>> Do we also want to mention somewhere (commit log or comment) that
>>>>>>> keeping the entry in the tree is fine because we are still protected
>>>>>>> from concurrent loads/invalidations/writeback by swapcache_prepare()
>>>>>>> setting SWAP_HAS_CACHE or so?
>>>>>>
>>>>>> It seems that Kairui's patch comprehensively addresses the issue at hand.
>>>>>> Johannes's solution, on the other hand, appears to align zswap behavior
>>>>>> more closely with that of a traditional swap device, only releasing an entry
>>>>>> when the corresponding swap slot is freed, particularly in the sync-io case.
>>>>>
>>>>> It actually worked out quite well that Kairui's fix landed shortly
>>>>> before this bug was reported, as this fix wouldn't have been possible
>>>>> without it as far as I can tell.
>>>>>
>>>>>>
>>>>>> Johannes' patch has inspired me to consider whether zRAM could achieve
>>>>>> a comparable outcome by immediately releasing objects in swap cache
>>>>>> scenarios.  When I have the opportunity, I plan to experiment with zRAM.
>>>>>
>>>>> That would be interesting. I am curious if it would be as
>>>>> straightforward in zram to just mark the folio as dirty in this case
>>>>> like zswap does, given its implementation as a block device.
>>>>>
>>>>
>>>> This makes me wonder who is responsible for marking folio dirty in this swapcache
>>>> bypass case? Should we call folio_mark_dirty() after the swap_read_folio()?
>>>
>>> In shrink_folio_list(), we try to add anonymous folios to the
>>> swapcache if they are not there before checking if they are dirty.
>>> add_to_swap() calls folio_mark_dirty(), so this should take care of
>>
>> Right, thanks for your clarification, so should be no problem here.
>> Although it was a fix just for MADV_FREE case.
>>
>>> it. There is an interesting comment there though. It says that PTE
>>> should be dirty, so unmapping the folio should have already marked it
>>> as dirty by the time we are adding it to the swapcache, except for the
>>> MADV_FREE case.
>>
>> It seems to say the folio will be dirtied when unmap later, supposing the
>> PTE is dirty.
> 
> Oh yeah it could mean that the folio will be dirted later.
> 
>>
>>>
>>> However, I think we actually unmap the folio after we add it to the
>>> swapcache in shrink_folio_list(). Also, I don't immediately see why
>>> the PTE would be dirty. In do_swap_page(), making the PTE dirty seems
>>
>> If all anon pages on LRU list are faulted by write, it should be true.
>> We could just use the zero page if faulted by read, right?
> 
> This applies for the initial fault that creates the folio, but this is
> a swap fault. It could be a read fault and in that case we still need
> to make the folio dirty because it's not in the swapcache and we need
> to write it out if it's reclaimed, right?

Yes, IMHO I think it should be marked as dirty here.

But it should be no problem with that unconditional folio_mark_dirty()
in add_to_swap(). Not sure if there are other issues.

> 
>>
>>> to be conditional on the fault being a write fault, but I didn't look
>>> thoroughly, maybe I missed it. It is also possible that the comment is
>>> just outdated.
>>
>> Yeah, dirty is only marked on write fault.
>>
>> Thanks.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: zswap: fix data loss on SWP_SYNCHRONOUS_IO devices
  2024-03-24 21:22 ` Yosry Ahmed
  2024-03-25  4:54   ` Barry Song
@ 2024-03-25 16:30   ` Johannes Weiner
  2024-03-25 18:41     ` Yosry Ahmed
  1 sibling, 1 reply; 16+ messages in thread
From: Johannes Weiner @ 2024-03-25 16:30 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Andrew Morton, Zhongkun He, Chengming Zhou, Barry Song, Chris Li,
	Nhat Pham, linux-mm, linux-kernel

On Sun, Mar 24, 2024 at 02:22:46PM -0700, Yosry Ahmed wrote:
> On Sun, Mar 24, 2024 at 2:04 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > Zhongkun He reports data corruption when combining zswap with zram.
> >
> > The issue is the exclusive loads we're doing in zswap. They assume
> > that all reads are going into the swapcache, which can assume
> > authoritative ownership of the data and so the zswap copy can go.
> >
> > However, zram files are marked SWP_SYNCHRONOUS_IO, and faults will try
> > to bypass the swapcache. This results in an optimistic read of the
> > swap data into a page that will be dismissed if the fault fails due to
> > races. In this case, zswap mustn't drop its authoritative copy.
> >
> > Link: https://lore.kernel.org/all/CACSyD1N+dUvsu8=zV9P691B9bVq33erwOXNTmEaUbi9DrDeJzw@mail.gmail.com/
> > Reported-by: Zhongkun He <hezhongkun.hzk@bytedance.com>
> > Fixes: b9c91c43412f ("mm: zswap: support exclusive loads")
> > Cc: stable@vger.kernel.org      [6.5+]
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > Tested-by: Zhongkun He <hezhongkun.hzk@bytedance.com>
> 
> Do we also want to mention somewhere (commit log or comment) that
> keeping the entry in the tree is fine because we are still protected
> from concurrent loads/invalidations/writeback by swapcache_prepare()
> setting SWAP_HAS_CACHE or so?

I don't think it's necessary, as zswap isn't doing anything special
here. It's up to the caller to follow the generic swap exclusion
protocol that zswap also adheres to. So IMO the relevant comment
should be, and is, above that swapcache_prepare() in do_swap_page().

> Anyway, this LGTM.
> 
> Acked-by: Yosry Ahmed <yosryahmed@google.com>

Thanks!


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: zswap: fix data loss on SWP_SYNCHRONOUS_IO devices
  2024-03-24 21:04 [PATCH] mm: zswap: fix data loss on SWP_SYNCHRONOUS_IO devices Johannes Weiner
                   ` (2 preceding siblings ...)
  2024-03-25  3:01 ` [External] " Zhongkun He
@ 2024-03-25 17:09 ` Nhat Pham
  2024-03-25 21:27 ` Chris Li
  4 siblings, 0 replies; 16+ messages in thread
From: Nhat Pham @ 2024-03-25 17:09 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Zhongkun He, Chengming Zhou, Yosry Ahmed,
	Barry Song, Chris Li, linux-mm, linux-kernel

On Sun, Mar 24, 2024 at 2:04 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> Zhongkun He reports data corruption when combining zswap with zram.
>
> The issue is the exclusive loads we're doing in zswap. They assume
> that all reads are going into the swapcache, which can assume
> authoritative ownership of the data and so the zswap copy can go.
>
> However, zram files are marked SWP_SYNCHRONOUS_IO, and faults will try
> to bypass the swapcache. This results in an optimistic read of the
> swap data into a page that will be dismissed if the fault fails due to
> races. In this case, zswap mustn't drop its authoritative copy.
>
> Link: https://lore.kernel.org/all/CACSyD1N+dUvsu8=zV9P691B9bVq33erwOXNTmEaUbi9DrDeJzw@mail.gmail.com/
> Reported-by: Zhongkun He <hezhongkun.hzk@bytedance.com>
> Fixes: b9c91c43412f ("mm: zswap: support exclusive loads")
> Cc: stable@vger.kernel.org      [6.5+]
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Tested-by: Zhongkun He <hezhongkun.hzk@bytedance.com>
> ---
>  mm/zswap.c | 23 +++++++++++++++++++----
>  1 file changed, 19 insertions(+), 4 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 535c907345e0..41a1170f7cfe 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1622,6 +1622,7 @@ bool zswap_load(struct folio *folio)
>         swp_entry_t swp = folio->swap;
>         pgoff_t offset = swp_offset(swp);
>         struct page *page = &folio->page;
> +       bool swapcache = folio_test_swapcache(folio);
>         struct zswap_tree *tree = swap_zswap_tree(swp);
>         struct zswap_entry *entry;
>         u8 *dst;
> @@ -1634,7 +1635,20 @@ bool zswap_load(struct folio *folio)
>                 spin_unlock(&tree->lock);
>                 return false;
>         }
> -       zswap_rb_erase(&tree->rbroot, entry);
> +       /*
> +        * When reading into the swapcache, invalidate our entry. The
> +        * swapcache can be the authoritative owner of the page and
> +        * its mappings, and the pressure that results from having two
> +        * in-memory copies outweighs any benefits of caching the
> +        * compression work.
> +        *
> +        * (Most swapins go through the swapcache. The notable
> +        * exception is the singleton fault on SWP_SYNCHRONOUS_IO
> +        * files, which reads into a private page and may free it if
> +        * the fault fails. We remain the primary owner of the entry.)
> +        */
> +       if (swapcache)
> +               zswap_rb_erase(&tree->rbroot, entry);
>         spin_unlock(&tree->lock);
>
>         if (entry->length)
> @@ -1649,9 +1663,10 @@ bool zswap_load(struct folio *folio)
>         if (entry->objcg)
>                 count_objcg_event(entry->objcg, ZSWPIN);
>
> -       zswap_entry_free(entry);
> -
> -       folio_mark_dirty(folio);
> +       if (swapcache) {
> +               zswap_entry_free(entry);
> +               folio_mark_dirty(folio);
> +       }

This LGTM!

Reviewed-by: Nhat Pham <nphamcs@gmail.com>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: zswap: fix data loss on SWP_SYNCHRONOUS_IO devices
  2024-03-25  9:46               ` Chengming Zhou
@ 2024-03-25 18:35                 ` Yosry Ahmed
  0 siblings, 0 replies; 16+ messages in thread
From: Yosry Ahmed @ 2024-03-25 18:35 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Barry Song, Johannes Weiner, Andrew Morton, Zhongkun He,
	Chengming Zhou, Chris Li, Nhat Pham, linux-mm, linux-kernel,
	Kairui Song, Minchan Kim

On Mon, Mar 25, 2024 at 2:46 AM Chengming Zhou <chengming.zhou@linux.dev> wrote:
>
> On 2024/3/25 17:40, Yosry Ahmed wrote:
> > On Mon, Mar 25, 2024 at 2:22 AM Chengming Zhou <chengming.zhou@linux.dev> wrote:
> >>
> >> On 2024/3/25 16:38, Yosry Ahmed wrote:
> >>> On Mon, Mar 25, 2024 at 12:33 AM Chengming Zhou
> >>> <chengming.zhou@linux.dev> wrote:
> >>>>
> >>>> On 2024/3/25 15:06, Yosry Ahmed wrote:
> >>>>> On Sun, Mar 24, 2024 at 9:54 PM Barry Song <21cnbao@gmail.com> wrote:
> >>>>>>
> >>>>>> On Mon, Mar 25, 2024 at 10:23 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> >>>>>>>
> >>>>>>> On Sun, Mar 24, 2024 at 2:04 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >>>>>>>>
> >>>>>>>> Zhongkun He reports data corruption when combining zswap with zram.
> >>>>>>>>
> >>>>>>>> The issue is the exclusive loads we're doing in zswap. They assume
> >>>>>>>> that all reads are going into the swapcache, which can assume
> >>>>>>>> authoritative ownership of the data and so the zswap copy can go.
> >>>>>>>>
> >>>>>>>> However, zram files are marked SWP_SYNCHRONOUS_IO, and faults will try
> >>>>>>>> to bypass the swapcache. This results in an optimistic read of the
> >>>>>>>> swap data into a page that will be dismissed if the fault fails due to
> >>>>>>>> races. In this case, zswap mustn't drop its authoritative copy.
> >>>>>>>>
> >>>>>>>> Link: https://lore.kernel.org/all/CACSyD1N+dUvsu8=zV9P691B9bVq33erwOXNTmEaUbi9DrDeJzw@mail.gmail.com/
> >>>>>>>> Reported-by: Zhongkun He <hezhongkun.hzk@bytedance.com>
> >>>>>>>> Fixes: b9c91c43412f ("mm: zswap: support exclusive loads")
> >>>>>>>> Cc: stable@vger.kernel.org      [6.5+]
> >>>>>>>> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> >>>>>>>> Tested-by: Zhongkun He <hezhongkun.hzk@bytedance.com>
> >>>>>>
> >>>>>> Acked-by: Barry Song <baohua@kernel.org>
> >>>>>>
> >>>>>>>
> >>>>>>> Do we also want to mention somewhere (commit log or comment) that
> >>>>>>> keeping the entry in the tree is fine because we are still protected
> >>>>>>> from concurrent loads/invalidations/writeback by swapcache_prepare()
> >>>>>>> setting SWAP_HAS_CACHE or so?
> >>>>>>
> >>>>>> It seems that Kairui's patch comprehensively addresses the issue at hand.
> >>>>>> Johannes's solution, on the other hand, appears to align zswap behavior
> >>>>>> more closely with that of a traditional swap device, only releasing an entry
> >>>>>> when the corresponding swap slot is freed, particularly in the sync-io case.
> >>>>>
> >>>>> It actually worked out quite well that Kairui's fix landed shortly
> >>>>> before this bug was reported, as this fix wouldn't have been possible
> >>>>> without it as far as I can tell.
> >>>>>
> >>>>>>
> >>>>>> Johannes' patch has inspired me to consider whether zRAM could achieve
> >>>>>> a comparable outcome by immediately releasing objects in swap cache
> >>>>>> scenarios.  When I have the opportunity, I plan to experiment with zRAM.
> >>>>>
> >>>>> That would be interesting. I am curious if it would be as
> >>>>> straightforward in zram to just mark the folio as dirty in this case
> >>>>> like zswap does, given its implementation as a block device.
> >>>>>
> >>>>
> >>>> This makes me wonder who is responsible for marking folio dirty in this swapcache
> >>>> bypass case? Should we call folio_mark_dirty() after the swap_read_folio()?
> >>>
> >>> In shrink_folio_list(), we try to add anonymous folios to the
> >>> swapcache if they are not there before checking if they are dirty.
> >>> add_to_swap() calls folio_mark_dirty(), so this should take care of
> >>
> >> Right, thanks for your clarification, so should be no problem here.
> >> Although it was a fix just for MADV_FREE case.
> >>
> >>> it. There is an interesting comment there though. It says that PTE
> >>> should be dirty, so unmapping the folio should have already marked it
> >>> as dirty by the time we are adding it to the swapcache, except for the
> >>> MADV_FREE case.
> >>
> >> It seems to say the folio will be dirtied when unmap later, supposing the
> >> PTE is dirty.
> >
> > Oh yeah it could mean that the folio will be dirted later.
> >
> >>
> >>>
> >>> However, I think we actually unmap the folio after we add it to the
> >>> swapcache in shrink_folio_list(). Also, I don't immediately see why
> >>> the PTE would be dirty. In do_swap_page(), making the PTE dirty seems
> >>
> >> If all anon pages on LRU list are faulted by write, it should be true.
> >> We could just use the zero page if faulted by read, right?
> >
> > This applies for the initial fault that creates the folio, but this is
> > a swap fault. It could be a read fault and in that case we still need
> > to make the folio dirty because it's not in the swapcache and we need
> > to write it out if it's reclaimed, right?
>
> Yes, IMHO I think it should be marked as dirty here.
>
> But it should be no problem with that unconditional folio_mark_dirty()
> in add_to_swap(). Not sure if there are other issues.

I don't believe there are any issues now. Dirtying the folio in
add_to_swap() was introduced before SWP_SYNCHRONOUS_IO, so I guess
things have always worked.

I think we should update the comment there though to mention that
dirtying the folio is also needed for this case (not just MADV_FREE),
or dirty the PTE during the fault. Otherwise, if someone is making
MADV_FREE changes they could end up breaking SWP_SYNCHRONOUS_IO
faults.

Adding Minchan here in case he can confirm that we in fact rely on
add_to_swap()->folio_mark_dirty() for SWP_SYNCHRONOUS_IO to work as
intended.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: zswap: fix data loss on SWP_SYNCHRONOUS_IO devices
  2024-03-25 16:30   ` Johannes Weiner
@ 2024-03-25 18:41     ` Yosry Ahmed
  0 siblings, 0 replies; 16+ messages in thread
From: Yosry Ahmed @ 2024-03-25 18:41 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Zhongkun He, Chengming Zhou, Barry Song, Chris Li,
	Nhat Pham, linux-mm, linux-kernel

On Mon, Mar 25, 2024 at 9:30 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Sun, Mar 24, 2024 at 02:22:46PM -0700, Yosry Ahmed wrote:
> > On Sun, Mar 24, 2024 at 2:04 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > >
> > > Zhongkun He reports data corruption when combining zswap with zram.
> > >
> > > The issue is the exclusive loads we're doing in zswap. They assume
> > > that all reads are going into the swapcache, which can assume
> > > authoritative ownership of the data and so the zswap copy can go.
> > >
> > > However, zram files are marked SWP_SYNCHRONOUS_IO, and faults will try
> > > to bypass the swapcache. This results in an optimistic read of the
> > > swap data into a page that will be dismissed if the fault fails due to
> > > races. In this case, zswap mustn't drop its authoritative copy.
> > >
> > > Link: https://lore.kernel.org/all/CACSyD1N+dUvsu8=zV9P691B9bVq33erwOXNTmEaUbi9DrDeJzw@mail.gmail.com/
> > > Reported-by: Zhongkun He <hezhongkun.hzk@bytedance.com>
> > > Fixes: b9c91c43412f ("mm: zswap: support exclusive loads")
> > > Cc: stable@vger.kernel.org      [6.5+]
> > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > > Tested-by: Zhongkun He <hezhongkun.hzk@bytedance.com>
> >
> > Do we also want to mention somewhere (commit log or comment) that
> > keeping the entry in the tree is fine because we are still protected
> > from concurrent loads/invalidations/writeback by swapcache_prepare()
> > setting SWAP_HAS_CACHE or so?
>
> I don't think it's necessary, as zswap isn't doing anything special
> here. It's up to the caller to follow the generic swap exclusion
> protocol that zswap also adheres to. So IMO the relevant comment
> should be, and is, above that swapcache_prepare() in do_swap_page().

From the perspective of someone looking at the zswap code, it isn't
immediately clear what protects the zswap entry in the non-exclusive
load case from being freed from under us. At some point we had a
refcount, then we used to remove it from the tree under lock so others
wouldn't have access to it. Now it's less clear because we rely on
protection outside of zswap code.

We also document other places where we rely on the swapcache for
synchronization, so I think it may be worth briefly mentioning this
here as well, especially that in this code we explicitly check for the
folio not being in the swapcache. That said, I don't feel strongly
about it. Tracking down the SWP_SYNCHRONOUS_IO code should eventually
make it clear. Also, the commit log will end up having a link to this
thread anyway so the details are not completely unfindable :)


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: zswap: fix data loss on SWP_SYNCHRONOUS_IO devices
  2024-03-24 21:04 [PATCH] mm: zswap: fix data loss on SWP_SYNCHRONOUS_IO devices Johannes Weiner
                   ` (3 preceding siblings ...)
  2024-03-25 17:09 ` Nhat Pham
@ 2024-03-25 21:27 ` Chris Li
  4 siblings, 0 replies; 16+ messages in thread
From: Chris Li @ 2024-03-25 21:27 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Zhongkun He, Chengming Zhou, Yosry Ahmed,
	Barry Song, Nhat Pham, linux-mm, linux-kernel

On Sun, Mar 24, 2024 at 2:04 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> Zhongkun He reports data corruption when combining zswap with zram.
>
> The issue is the exclusive loads we're doing in zswap. They assume
> that all reads are going into the swapcache, which can assume
> authoritative ownership of the data and so the zswap copy can go.
>
> However, zram files are marked SWP_SYNCHRONOUS_IO, and faults will try
> to bypass the swapcache. This results in an optimistic read of the
> swap data into a page that will be dismissed if the fault fails due to
> races. In this case, zswap mustn't drop its authoritative copy.
>
> Link: https://lore.kernel.org/all/CACSyD1N+dUvsu8=zV9P691B9bVq33erwOXNTmEaUbi9DrDeJzw@mail.gmail.com/
> Reported-by: Zhongkun He <hezhongkun.hzk@bytedance.com>
> Fixes: b9c91c43412f ("mm: zswap: support exclusive loads")
> Cc: stable@vger.kernel.org      [6.5+]
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Tested-by: Zhongkun He <hezhongkun.hzk@bytedance.com>

The change looks good to me. Thanks for the cleaner solution.
It has conflict with the zswap rb-tree to xarray patch though.
I will resolve the conflict and resubmit the zswap xarray patch.

Acked-by: Chris Li <chrisl@kernel.org>

Chris
> ---
>  mm/zswap.c | 23 +++++++++++++++++++----
>  1 file changed, 19 insertions(+), 4 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 535c907345e0..41a1170f7cfe 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1622,6 +1622,7 @@ bool zswap_load(struct folio *folio)
>         swp_entry_t swp = folio->swap;
>         pgoff_t offset = swp_offset(swp);
>         struct page *page = &folio->page;
> +       bool swapcache = folio_test_swapcache(folio);
>         struct zswap_tree *tree = swap_zswap_tree(swp);
>         struct zswap_entry *entry;
>         u8 *dst;
> @@ -1634,7 +1635,20 @@ bool zswap_load(struct folio *folio)
>                 spin_unlock(&tree->lock);
>                 return false;
>         }
> -       zswap_rb_erase(&tree->rbroot, entry);
> +       /*
> +        * When reading into the swapcache, invalidate our entry. The
> +        * swapcache can be the authoritative owner of the page and
> +        * its mappings, and the pressure that results from having two
> +        * in-memory copies outweighs any benefits of caching the
> +        * compression work.
> +        *
> +        * (Most swapins go through the swapcache. The notable
> +        * exception is the singleton fault on SWP_SYNCHRONOUS_IO
> +        * files, which reads into a private page and may free it if
> +        * the fault fails. We remain the primary owner of the entry.)
> +        */
> +       if (swapcache)
> +               zswap_rb_erase(&tree->rbroot, entry);
>         spin_unlock(&tree->lock);
>
>         if (entry->length)
> @@ -1649,9 +1663,10 @@ bool zswap_load(struct folio *folio)
>         if (entry->objcg)
>                 count_objcg_event(entry->objcg, ZSWPIN);
>
> -       zswap_entry_free(entry);
> -
> -       folio_mark_dirty(folio);
> +       if (swapcache) {
> +               zswap_entry_free(entry);
> +               folio_mark_dirty(folio);
> +       }
>
>         return true;
>  }
> --
> 2.44.0
>


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2024-03-25 21:27 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-24 21:04 [PATCH] mm: zswap: fix data loss on SWP_SYNCHRONOUS_IO devices Johannes Weiner
2024-03-24 21:22 ` Yosry Ahmed
2024-03-25  4:54   ` Barry Song
2024-03-25  7:06     ` Yosry Ahmed
2024-03-25  7:33       ` Chengming Zhou
2024-03-25  8:38         ` Yosry Ahmed
2024-03-25  9:22           ` Chengming Zhou
2024-03-25  9:40             ` Yosry Ahmed
2024-03-25  9:46               ` Chengming Zhou
2024-03-25 18:35                 ` Yosry Ahmed
2024-03-25 16:30   ` Johannes Weiner
2024-03-25 18:41     ` Yosry Ahmed
2024-03-25  0:01 ` Chengming Zhou
2024-03-25  3:01 ` [External] " Zhongkun He
2024-03-25 17:09 ` Nhat Pham
2024-03-25 21:27 ` Chris Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).