linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Chengming Zhou <chengming.zhou@linux.dev>
To: Yosry Ahmed <yosryahmed@google.com>
Cc: hannes@cmpxchg.org, nphamcs@gmail.com, akpm@linux-foundation.org,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Chengming Zhou <zhouchengming@bytedance.com>
Subject: Re: [PATCH] mm/zswap: invalidate old entry when store fail or !zswap_enabled
Date: Tue, 6 Feb 2024 10:23:33 +0800	[thread overview]
Message-ID: <e5315e2d-a03a-4b2f-9e12-1685fa0515e0@linux.dev> (raw)
In-Reply-To: <ZcFne336KJdbrvvS@google.com>

On 2024/2/6 06:55, Yosry Ahmed wrote:
> On Sun, Feb 04, 2024 at 08:34:11AM +0000, chengming.zhou@linux.dev wrote:
>> From: Chengming Zhou <zhouchengming@bytedance.com>
>>
>> We may encounter duplicate entry in the zswap_store():
>>
>> 1. swap slot that freed to per-cpu swap cache, doesn't invalidate
>>    the zswap entry, then got reused. This has been fixed.
>>
>> 2. !exclusive load mode, swapin folio will leave its zswap entry
>>    on the tree, then swapout again. This has been removed.
>>
>> 3. one folio can be dirtied again after zswap_store(), so need to
>>    zswap_store() again. This should be handled correctly.
>>
>> So we must invalidate the old duplicate entry before insert the
>> new one, which actually doesn't have to be done at the beginning
>> of zswap_store(). And this is a normal situation, we shouldn't
>> WARN_ON(1) in this case, so delete it. (The WARN_ON(1) seems want
>> to detect swap entry UAF problem? But not very necessary here.)
>>
>> The good point is that we don't need to lock tree twice in the
>> store success path.
>>
>> Note we still need to invalidate the old duplicate entry in the
>> store failure path, otherwise the new data in swapfile could be
>> overwrite by the old data in zswap pool when lru writeback.
> 
> I think this may have been introduced by 42c06a0e8ebe ("mm: kill
> frontswap"). Frontswap used to check if the page was present in
> frontswap and invalidate it before calling into zswap, so it would
> invalidate a previously stored page when it is dirtied and swapped out
> again, even if zswap is disabled.
> 
> Johannes, does this sound correct to you? If yes, I think we need a
> proper Fixes tag and a stable backport as this may cause data
> corruption.

I haven't looked into that commit. If this is true, will add:

Fixes: 42c06a0e8ebe ("mm: kill frontswap")

> 
>>
>> We have to do this even when !zswap_enabled since zswap can be
>> disabled anytime. If the folio store success before, then got
>> dirtied again but zswap disabled, we won't invalidate the old
>> duplicate entry in the zswap_store(). So later lru writeback
>> may overwrite the new data in swapfile.
>>
>> This fix is not good, since we have to grab lock to check everytime
>> even when zswap is disabled, but it's simple.
> 
> Frontswap had a bitmap that we can query locklessly to find out if there
> is an outdated stored page. I think we can overcome this with the
> xarray, we can do a lockless lookup first, and only take the lock if
> there is an outdated entry to remove.

Yes, agree! We can lockless lookup once xarray lands in.

> 
> Meanwhile I am not sure if acquiring the lock on every swapout even with
> zswap disabled is acceptable, but I think it's the simplest fix for now,
> unless we revive the bitmap.

Yeah, it's simple. I think bitmap is not needed if we will use xarray.

> 
>>
>> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
>> ---
>>  mm/zswap.c | 33 +++++++++++++++------------------
>>  1 file changed, 15 insertions(+), 18 deletions(-)
>>
>> diff --git a/mm/zswap.c b/mm/zswap.c
>> index cd67f7f6b302..0b7599f4116d 100644
>> --- a/mm/zswap.c
>> +++ b/mm/zswap.c
>> @@ -1518,18 +1518,8 @@ bool zswap_store(struct folio *folio)
>>  		return false;
>>  
>>  	if (!zswap_enabled)
>> -		return false;
>> +		goto check_old;
>>  
>> -	/*
>> -	 * If this is a duplicate, it must be removed before attempting to store
>> -	 * it, otherwise, if the store fails the old page won't be removed from
>> -	 * the tree, and it might be written back overriding the new data.
>> -	 */
>> -	spin_lock(&tree->lock);
>> -	entry = zswap_rb_search(&tree->rbroot, offset);
>> -	if (entry)
>> -		zswap_invalidate_entry(tree, entry);
>> -	spin_unlock(&tree->lock);
>>  	objcg = get_obj_cgroup_from_folio(folio);
>>  	if (objcg && !obj_cgroup_may_zswap(objcg)) {
>>  		memcg = get_mem_cgroup_from_objcg(objcg);
>> @@ -1608,15 +1598,11 @@ bool zswap_store(struct folio *folio)
>>  	/* map */
>>  	spin_lock(&tree->lock);
>>  	/*
>> -	 * A duplicate entry should have been removed at the beginning of this
>> -	 * function. Since the swap entry should be pinned, if a duplicate is
>> -	 * found again here it means that something went wrong in the swap
>> -	 * cache.
>> +	 * The folio could be dirtied again, invalidate the possible old entry
>> +	 * before insert this new entry.
>>  	 */
>> -	while (zswap_rb_insert(&tree->rbroot, entry, &dupentry) == -EEXIST) {
>> -		WARN_ON(1);
>> +	while (zswap_rb_insert(&tree->rbroot, entry, &dupentry) == -EEXIST)
>>  		zswap_invalidate_entry(tree, dupentry);
>> -	}
> 
> I always thought the loop here was confusing. We are holding the lock,
> so it should be guaranteed that if we get -EEXIST once and invalidate
> it, we won't find it the next time around.

Ah, right, this is obvious.

> 
> This should really be a cmpxchg operation, which is simple with the
> xarray. We can probably do the same with the rbtree, but perhaps it's
> not worth it if the xarray change is coming soon.
> 
> For now, I think an if condition is clearer:
> 
> if (zswap_rb_insert(&tree->rbroot, entry, &dupentry) == -EEXIST) {
> 	zswap_invalidate_entry(tree, dupentry);
> 	/* Must succeed, we just removed the dup under the lock */
> 	WARN_ON(zswap_rb_insert(&tree->rbroot, entry, &dupentry));
> }

This is clearer, will change to this version.

Thanks!

> 
>>  	if (entry->length) {
>>  		INIT_LIST_HEAD(&entry->lru);
>>  		zswap_lru_add(&entry->pool->list_lru, entry);
>> @@ -1638,6 +1624,17 @@ bool zswap_store(struct folio *folio)
>>  reject:
>>  	if (objcg)
>>  		obj_cgroup_put(objcg);
>> +check_old:
>> +	/*
>> +	 * If zswap store fail or zswap disabled, we must invalidate possible
>> +	 * old entry which previously stored by this folio. Otherwise, later
>> +	 * writeback could overwrite the new data in swapfile.
>> +	 */
>> +	spin_lock(&tree->lock);
>> +	entry = zswap_rb_search(&tree->rbroot, offset);
>> +	if (entry)
>> +		zswap_invalidate_entry(tree, entry);
>> +	spin_unlock(&tree->lock);
>>  	return false;
>>  
>>  shrink:
>> -- 
>> 2.40.1
>>


  reply	other threads:[~2024-02-06  2:23 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-04  8:34 [PATCH] mm/zswap: invalidate old entry when store fail or !zswap_enabled chengming.zhou
2024-02-05 22:55 ` Yosry Ahmed
2024-02-06  2:23   ` Chengming Zhou [this message]
2024-02-06 15:00     ` Johannes Weiner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e5315e2d-a03a-4b2f-9e12-1685fa0515e0@linux.dev \
    --to=chengming.zhou@linux.dev \
    --cc=akpm@linux-foundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=nphamcs@gmail.com \
    --cc=yosryahmed@google.com \
    --cc=zhouchengming@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).