All of lore.kernel.org
 help / color / mirror / Atom feed
From: Charan Teja Kalla <quic_charante@quicinc.com>
To: Hugh Dickins <hughd@google.com>
Cc: <akpm@linux-foundation.org>, <willy@infradead.org>,
	<markhemm@googlemail.com>, <rientjes@google.com>,
	<surenb@google.com>, <shakeelb@google.com>, <fvdl@google.com>,
	<quic_pkondeti@quicinc.com>, <linux-mm@kvack.org>,
	<linux-kernel@vger.kernel.org>
Subject: Re: [PATCH V7 2/2] mm: shmem: implement POSIX_FADV_[WILL|DONT]NEED for shmem
Date: Mon, 24 Apr 2023 20:34:38 +0530	[thread overview]
Message-ID: <eeeba374-9247-96fd-c9f5-8cba8761f1b9@quicinc.com> (raw)
In-Reply-To: <2d56e1dd-68b5-c99e-522f-f8dadf6ad69e@google.com>

Thanks Hugh for the valuable comments!!

On 4/21/2023 5:37 AM, Hugh Dickins wrote:
>> Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
> I'm sorry, but no, this is not yet ready for primetime. I came here
> expecting to be able just to add a patch on top with small fixes,
> but see today that it needs more than that, and my time has run out.
> 
> Though if Andrew is keen to go ahead with it in 6.4, and add fixes
> on top while it's in rc, that will be okay: except for one small bad
@Andrew: I should resend the patch soon with all these comments addressed.
> bug, which must be fixed immediately - "luckily" nobody appears to
> be using or testing this since v5, but it cannot go further as is>
> Willneed is probably fine, but dontneed is not.
> 
>> ---
>>  mm/shmem.c | 116 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 116 insertions(+)
>>
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index 448f393..1af8525 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -40,6 +40,9 @@
>>  #include <linux/fs_parser.h>
>>  #include <linux/swapfile.h>
>>  #include <linux/iversion.h>
>> +#include <linux/mm_inline.h>
>> +#include <linux/fadvise.h>
>> +#include <linux/page_idle.h>
>>  #include "swap.h"
>>  
>>  static struct vfsmount *shm_mnt;
>> @@ -2344,6 +2347,118 @@ static void shmem_set_inode_flags(struct inode *inode, unsigned int fsflags)
>>  #define shmem_initxattrs NULL
>>  #endif
>>  
>> +static void shmem_isolate_pages_range(struct address_space *mapping, loff_t start,
>> +				loff_t end, struct list_head *list)
> loff_t? They are pgoff_t.
> 
>> +{
>> +	XA_STATE(xas, &mapping->i_pages, start);
>> +	struct folio *folio;
>> +
>> +	rcu_read_lock();
>> +	xas_for_each(&xas, folio, end) {
>> +		if (xas_retry(&xas, folio))
>> +			continue;
>> +		if (xa_is_value(folio))
>> +			continue;
>> +
>> +		if (!folio_try_get(folio))
>> +			continue;
>> +		if (folio_test_unevictable(folio) || folio_mapped(folio) ||
>> +				folio_isolate_lru(folio)) {
> There is the one small bad bug.  That should say !folio_isolate_lru(folio).
> In v5, it was isolate_lru_page(page), because isolate_lru_page() returned
> 0 for success or -EBUSY for unavailable; whereas folio_isolate_lru(folio)
> is a boolean, returning true if it successfully removed folio from LRU.
> 

Looks bad thing from my side:(.  Thanks a lot for catching it. This time
I will update the patch with unit tests too.

> The effect of that bug is that in v6 and v7, it has skipped all the folios
> it was expected to be reclaiming; except when one of them happened to be
> off LRU for other reasons (being reclaimed elsewhere, being migrated,
> whatever) - and precisely those folios which were not safe to touch,
> which have often been transferred to a private worklist, are the ones
> which the code below goes on to play with - corrupting either or both
> lists.  (I haven't tried to reproduce that in practice, just saw it
> in the code, and verified with a count that no pages were reclaimed.)

True.
> 
>> +			folio_put(folio);
>> +			continue;
>> +		}
>> +		folio_put(folio);
>> +
>> +		/*
>> +		 * Prepare the folios to be passed to reclaim_pages().
>> +		 * VM can't reclaim a folio unless young bit is
>> +		 * cleared in its flags.
>> +		 */
>> +		folio_clear_referenced(folio);
>> +		folio_test_clear_young(folio);
>> +		list_add(&folio->lru, list);
>> +		if (need_resched()) {
>> +			xas_pause(&xas);
>> +			cond_resched_rcu();
>> +		}
>> +	}
>> +	rcu_read_unlock();
>> +}
>> +
>> +static int shmem_fadvise_dontneed(struct address_space *mapping, loff_t start,
>> +				loff_t end)
> loff_t? They are pgoff_t. And why return an int which is always 0?
> 
>> +{
>> +	LIST_HEAD(folio_list);
>> +
>> +	if (!total_swap_pages || mapping_unevictable(mapping))
>> +		return 0;
>> +
>> +	lru_add_drain();
>> +	shmem_isolate_pages_range(mapping, start, end, &folio_list);
>> +	reclaim_pages(&folio_list);
>> +
>> +	return 0;
>> +}
>> +
>> +static int shmem_fadvise_willneed(struct address_space *mapping,
>> +				 pgoff_t start, pgoff_t long end)
> pgoff_t long? That's a new type to me! Again, why return an int always 0?
> 
Will remove this in the next patch.
>> +{
>> +	struct folio *folio;
>> +	pgoff_t index;
>> +
>> +	xa_for_each_range(&mapping->i_pages, index, folio, start, end) {
>> +		if (!xa_is_value(folio))
>> +			continue;
>> +		folio = shmem_read_folio(mapping, index);
>> +		if (!IS_ERR(folio))
>> +			folio_put(folio);
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +static int shmem_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
>> +{
>> +	loff_t endbyte;
>> +	pgoff_t start_index;
>> +	pgoff_t end_index;
>> +	struct address_space *mapping;
>> +	struct inode *inode = file_inode(file);
>> +	int ret = 0;
>> +
>> +	if (S_ISFIFO(inode->i_mode))
>> +		return -ESPIPE;
>> +
>> +	mapping = file->f_mapping;
>> +	if (!mapping || len < 0 || !shmem_mapping(mapping))
>> +		return -EINVAL;
>> +
>> +	endbyte = fadvise_calc_endbyte(offset, len);
>> +
>> +	start_index = offset >> PAGE_SHIFT;
>> +	end_index   = endbyte >> PAGE_SHIFT;
>> +	switch (advice) {
>> +	case POSIX_FADV_DONTNEED:
> This is where I ran out of time.  I'm afraid all the focus on
> fadvise_calc_endbyte() has distracted you from looking at the DONTNEED
> in mm/fadvise.c: where there are detailed comments on why and how it
> then narrows the DONTNEED range.  And aside from needing to duplicate
> that here for shmem (or put it into another or combined helper), it
> implies to me that shmem_isolate_pages_range() needs to do a similar
> narrowing, when it finds that the range overlaps part of a large folio.
> 
Sure, will include those range calculations for shmem pages too.

> Something that has crossed my mind as a worry, but I've not had time
> to look further into (maybe it's no concern at all) is the question
> of this syscall temporarily isolating a very large number of folios,
> whether they need to be (or perhaps already are) counted in
> NR_ISOLATED_ANON, whether too many isolated needs to be limited.

They are _not_ counted as ISOLATED_ANON now as this operation is for a
small duration. I do see there exists too_many_isolated() checks in
direct reclaim/compaction logic where it is necessary to stop the
multiple processes in the direct reclaim from isolating too many pages.

I am not able to envisage such problem here, where usually single
process doing the fadvise operation on a file. Even If the file is
opened by multiple processes and do fadvise, the operation is limited
only to the pages of this file and doesn't impact the system.

Please let me know if I'm missing something where I should be counting
these as NR_ISOLATED.

> 
>> +		ret = shmem_fadvise_dontneed(mapping, start_index, end_index);
>> +		break;
>> +	case POSIX_FADV_WILLNEED:
>> +		ret = shmem_fadvise_willneed(mapping, start_index, end_index);
>> +		break;
>> +	case POSIX_FADV_NORMAL:
>> +	case POSIX_FADV_RANDOM:
>> +	case POSIX_FADV_SEQUENTIAL:
>> +	case POSIX_FADV_NOREUSE:
>> +		/*
>> +		 * No bad return value, but ignore advice.
>> +		 */
>> +		break;
>> +	default:
>> +		return -EINVAL;
>> +	}
>> +
>> +	return ret;
>> +}
>> +
>>  static struct inode *shmem_get_inode(struct mnt_idmap *idmap, struct super_block *sb,
>>  				     struct inode *dir, umode_t mode, dev_t dev,
>>  				     unsigned long flags)
>> @@ -3942,6 +4057,7 @@ static const struct file_operations shmem_file_operations = {
>>  	.splice_write	= iter_file_splice_write,
>>  	.fallocate	= shmem_fallocate,
>>  #endif
>> +	.fadvise	= shmem_fadvise,
> I'd say posix_fadvise() is an operation on an fd, and shmem_fadvise() and
> all its helpers should be under CONFIG_TMPFS (but oftentimes I do think
Sure.
> CONFIG_TMPFS and CONFIG_SHMEM are more trouble than they are worth).
> 
> Hugh
> 

  reply	other threads:[~2023-04-24 15:05 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-14 12:51 [PATCH V7 0/2] mm: shmem: support POSIX_FADV_[WILL|DONT]NEED for shmem files Charan Teja Kalla
2023-02-14 12:51 ` [PATCH V7 1/2] mm: fadvise: move 'endbyte' calculations to helper function Charan Teja Kalla
2023-02-14 12:51 ` [PATCH V7 2/2] mm: shmem: implement POSIX_FADV_[WILL|DONT]NEED for shmem Charan Teja Kalla
2023-04-06 23:44   ` Minchan Kim
2023-04-10 13:52     ` Charan Teja Kalla
2023-04-11  3:42       ` Andrew Morton
2023-04-21  0:07   ` Hugh Dickins
2023-04-24 15:04     ` Charan Teja Kalla [this message]
2023-05-17 11:32       ` Hugh Dickins
2023-05-18 12:46         ` Charan Teja Kalla
2024-02-14  9:13           ` Charan Teja Kalla
2024-02-20  5:10             ` Hugh Dickins
2023-03-28 22:50 ` [PATCH V7 0/2] mm: shmem: support POSIX_FADV_[WILL|DONT]NEED for shmem files Andrew Morton
2023-03-29 21:36   ` Suren Baghdasaryan
2023-04-13 19:45 ` Frank van der Linden
2023-04-14 17:44   ` Frank van der Linden
2023-04-14 19:10     ` Charan Teja Kalla
2023-04-14 22:02       ` Frank van der Linden
2023-04-17  6:11         ` Charan Teja Kalla
2023-04-18 17:29           ` Frank van der Linden
2023-04-19  4:19             ` Pavan Kondeti
2023-04-19 17:27               ` Frank van der Linden
2023-04-19 14:39             ` Charan Teja Kalla
2023-04-19 17:29               ` Frank van der Linden

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=eeeba374-9247-96fd-c9f5-8cba8761f1b9@quicinc.com \
    --to=quic_charante@quicinc.com \
    --cc=akpm@linux-foundation.org \
    --cc=fvdl@google.com \
    --cc=hughd@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=markhemm@googlemail.com \
    --cc=quic_pkondeti@quicinc.com \
    --cc=rientjes@google.com \
    --cc=shakeelb@google.com \
    --cc=surenb@google.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.