linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH v2 02/18] mm/filemap: Remove dynamically allocated array from filemap_read
       [not found]   ` <20201104213005.GB3365678@moria.home.lan>
@ 2020-11-05  4:52     ` Matthew Wilcox
       [not found]     ` <20201105001302.GF17076@casper.infradead.org>
  1 sibling, 0 replies; 19+ messages in thread
From: Matthew Wilcox @ 2020-11-05  4:52 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-fsdevel, linux-mm, hch

On Wed, Nov 04, 2020 at 04:30:05PM -0500, Kent Overstreet wrote:
> On Wed, Nov 04, 2020 at 08:42:03PM +0000, Matthew Wilcox (Oracle) wrote:
> > Increasing the batch size runs into diminishing returns.  It's probably
> > better to make, eg, three calls to filemap_get_pages() than it is to
> > call into kmalloc().
> 
> I have to disagree. Working with PAGEVEC_SIZE pages is eventually going to be
> like working with 4k pages today, and have you actually read the slub code for
> the kmalloc fast path? It's _really_ fast, there's no atomic operations and it
> doesn't even have to disable preemption - which is why you never see it showing
> up in profiles ever since we switched to slub.

I've been puzzling over this, and trying to run some benchmarks to figure
it out.  My test VM is too noisy though; the error bars are too large to
get solid data.

There are three reasons why I think we hit diminishing returns:

1. Cost of going into the slab allocator (one alloc, one free).
Maybe that's not as high as I think it is.

2. Let's say the per-page overhead of walking i_pages is 10% of the
CPU time for a 128kB I/O with a batch size of 1.  Increasing the batch
size to 15 means we walk the array 3 times instead of 32 times, or 0.7%
of the CPU time -- total reduction in CPU time of 9.3%.  Increasing the
batch size to 32 means we only walk the array once, which cuts it down
from 10% to 0.3% -- reduction in CPU time of 9.7%.

If we are doing 2MB I/Os (and most applications I've looked at recently
only do 128kB), and the 10% remains constant, then the batch-size-15
case walks the tree 17 times instead of 512 times -- 0.6%, whereas the
batch-size-512 case walks the tree once -- 0.02%.  But that only loks
like an overall savings of 9.98% versus 9.4%.  And is an extra 0.6%
saving worth it?

3. By the time we're doing such large I/Os, we're surely dominated by
memcpy() and not walking the tree.  Even if the file you're working on
is a terabyte in size, the radix tree is only 5 layers deep.  So that's
five pointer dereferences to find the struct page, and they should stay
in cache (maybe they'd fall out to L2, but surely not as far as L3).
And generally radix tree cachelines stay clean so there shouldn't be any
contention on them from other CPUs unless they're dirtying the pages or
writing them back.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 01/18] mm/filemap: Rename generic_file_buffered_read subfunctions
       [not found] ` <20201104204219.23810-2-willy@infradead.org>
@ 2020-11-06  8:07   ` Christoph Hellwig
  0 siblings, 0 replies; 19+ messages in thread
From: Christoph Hellwig @ 2020-11-06  8:07 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle); +Cc: linux-fsdevel, linux-mm, hch, kent.overstreet

Not 100% my preferred naming, but much better than what we had before,
so:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 02/18] mm/filemap: Remove dynamically allocated array from filemap_read
       [not found] ` <20201104204219.23810-3-willy@infradead.org>
       [not found]   ` <20201104213005.GB3365678@moria.home.lan>
@ 2020-11-06  8:07   ` Christoph Hellwig
  1 sibling, 0 replies; 19+ messages in thread
From: Christoph Hellwig @ 2020-11-06  8:07 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle); +Cc: linux-fsdevel, linux-mm, hch, kent.overstreet

On Wed, Nov 04, 2020 at 08:42:03PM +0000, Matthew Wilcox (Oracle) wrote:
> Increasing the batch size runs into diminishing returns.  It's probably
> better to make, eg, three calls to filemap_get_pages() than it is to
> call into kmalloc().

Yes, this always looked weird.

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 02/18] mm/filemap: Remove dynamically allocated array from filemap_read
       [not found]     ` <20201105001302.GF17076@casper.infradead.org>
@ 2020-11-06  8:08       ` Christoph Hellwig
  2020-11-06 12:30         ` [PATCH 1/4] pagevec: Allow pagevecs to be different sizes Matthew Wilcox (Oracle)
  0 siblings, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2020-11-06  8:08 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Kent Overstreet, linux-fsdevel, linux-mm, hch

On Thu, Nov 05, 2020 at 12:13:02AM +0000, Matthew Wilcox wrote:
> I have the beginnings of a patch for that, but I got busy with other stuff
> and didn't finish it.

If we have numbers that we want larger batch sizes than the current
pagevec PAGEVEC_SIZE this is the way to go insted of special casing
it in one path.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 03/18] mm/filemap: Convert filemap_get_pages to take a pagevec
       [not found] ` <20201104204219.23810-4-willy@infradead.org>
@ 2020-11-06  8:08   ` Christoph Hellwig
  0 siblings, 0 replies; 19+ messages in thread
From: Christoph Hellwig @ 2020-11-06  8:08 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle); +Cc: linux-fsdevel, linux-mm, hch, kent.overstreet

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 09/18] mm/filemap: Change filemap_read_page calling conventions
       [not found] ` <20201104204219.23810-10-willy@infradead.org>
@ 2020-11-06  8:11   ` Christoph Hellwig
  0 siblings, 0 replies; 19+ messages in thread
From: Christoph Hellwig @ 2020-11-06  8:11 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle); +Cc: linux-fsdevel, linux-mm, hch, kent.overstreet

On Wed, Nov 04, 2020 at 08:42:10PM +0000, Matthew Wilcox (Oracle) wrote:
> @@ -2288,7 +2270,18 @@ static struct page *filemap_update_page(struct kiocb *iocb, struct file *filp,
>  	return page;
>  
>  readpage:
> +	if (iocb->ki_flags & (IOCB_NOIO | IOCB_NOWAIT | IOCB_WAITQ)) {
> +		unlock_page(page);
> +		put_page(page);
> +		return ERR_PTR(-EAGAIN);
> +	}
> +	error = filemap_read_page(iocb->ki_filp, mapping, page);
> +	if (!error)

I still find the eraly return here weird..

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 11/18] mm/filemap: Convert filemap_update_page to return an errno
       [not found] ` <20201104204219.23810-12-willy@infradead.org>
@ 2020-11-06  8:14   ` Christoph Hellwig
  2020-11-06  8:37     ` Christoph Hellwig
  0 siblings, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2020-11-06  8:14 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle); +Cc: linux-fsdevel, linux-mm, hch, kent.overstreet

On Wed, Nov 04, 2020 at 08:42:12PM +0000, Matthew Wilcox (Oracle) wrote:
> Use AOP_TRUNCATED_PAGE to indicate that no error occurred, but the
> page we looked up is no longer valid.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> Reviewed-by: Kent Overstreet <kent.overstreet@gmail.com>
> ---
>  mm/filemap.c | 42 +++++++++++++++++++-----------------------
>  1 file changed, 19 insertions(+), 23 deletions(-)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 25945fefdd39..93c054f51677 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2231,24 +2231,21 @@ static int filemap_read_page(struct file *file, struct address_space *mapping,
>  	return error;
>  }
>  
> +static int filemap_update_page(struct kiocb *iocb,
> +		struct address_space *mapping, struct iov_iter *iter,
> +		struct page *page, loff_t pos, loff_t count)
>  {
>  	struct inode *inode = mapping->host;
>  	int error;
>  
>  	if (iocb->ki_flags & IOCB_WAITQ) {
>  		error = lock_page_async(page, iocb->ki_waitq);
> +		if (error)
> +			goto error;
>  	} else {
>  		if (!trylock_page(page)) {
>  			put_and_wait_on_page_locked(page, TASK_KILLABLE);
> +			return AOP_TRUNCATED_PAGE;
>  		}
>  	}
>  
> @@ -2267,25 +2264,24 @@ static struct page *filemap_update_page(struct kiocb *iocb, struct file *filp,
>  		goto readpage;
>  uptodate:
>  	unlock_page(page);
> +	return 0;
>  
>  readpage:
>  	if (iocb->ki_flags & (IOCB_NOIO | IOCB_NOWAIT | IOCB_WAITQ)) {
>  		unlock_page(page);
> +		error = -EAGAIN;
> +	} else {
> +		error = filemap_read_page(iocb->ki_filp, mapping, page);
> +		if (!error)
> +			return 0;
>  	}
> +error:
>  	put_page(page);
> +	return error;
>  truncated:
>  	unlock_page(page);
>  	put_page(page);
> +	return AOP_TRUNCATED_PAGE;

We could still consolidate the page unlocking by having another label.
Or even better move the put_page into the caller like I did in my
series, which would conceputally fit in pretty nicely here:

> +			err = filemap_update_page(iocb, mapping, iter, page,
>  					pg_pos, pg_count);
> +			if (err)
>  				pvec->nr--;

But otherwise this looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 13/18] mm/filemap: Add filemap_range_uptodate
       [not found] ` <20201104204219.23810-14-willy@infradead.org>
@ 2020-11-06  8:16   ` Christoph Hellwig
  0 siblings, 0 replies; 19+ messages in thread
From: Christoph Hellwig @ 2020-11-06  8:16 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle); +Cc: linux-fsdevel, linux-mm, hch, kent.overstreet

> +	if (mapping->host->i_blkbits >= (PAGE_SHIFT + thp_order(page)))
> +		return false;
> +
> +	pos = (loff_t) page->index << PAGE_SHIFT;

Should this use page_offset() ?

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 14/18] mm/filemap: Split filemap_readahead out of filemap_get_pages
       [not found] ` <20201104204219.23810-15-willy@infradead.org>
@ 2020-11-06  8:19   ` Christoph Hellwig
  0 siblings, 0 replies; 19+ messages in thread
From: Christoph Hellwig @ 2020-11-06  8:19 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle); +Cc: linux-fsdevel, linux-mm, hch, kent.overstreet

>  		if (PageReadahead(page)) {
> +			err = filemap_readahead(iocb, filp, mapping, page,
> +					last_index);
> +			if (err) {
>  				pvec->nr--;
>  				goto err;
>  			}

Not really new in this patch, but don't we need to drop the page refcount
here?


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 15/18] mm/filemap: Restructure filemap_get_pages
       [not found] ` <20201104204219.23810-16-willy@infradead.org>
@ 2020-11-06  8:21   ` Christoph Hellwig
  0 siblings, 0 replies; 19+ messages in thread
From: Christoph Hellwig @ 2020-11-06  8:21 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle); +Cc: linux-fsdevel, linux-mm, hch, kent.overstreet

On Wed, Nov 04, 2020 at 08:42:16PM +0000, Matthew Wilcox (Oracle) wrote:
> Remove the got_pages label, remove indentation, rename find_page to retry,
> simplify error handling.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> Reviewed-by: Kent Overstreet <kent.overstreet@gmail.com>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 16/18] mm/filemap: Don't relock the page after calling readpage
       [not found] ` <20201104204219.23810-17-willy@infradead.org>
@ 2020-11-06  8:21   ` Christoph Hellwig
  0 siblings, 0 replies; 19+ messages in thread
From: Christoph Hellwig @ 2020-11-06  8:21 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle); +Cc: linux-fsdevel, linux-mm, hch, kent.overstreet

On Wed, Nov 04, 2020 at 08:42:17PM +0000, Matthew Wilcox (Oracle) wrote:
> We don't need to get the page lock again; we just need to wait for
> the I/O to finish, so use wait_on_page_locked_killable() like the
> other callers of ->readpage.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> Reviewed-by: Kent Overstreet <kent.overstreet@gmail.com>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

although I still think a little comment would not hurt.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 11/18] mm/filemap: Convert filemap_update_page to return an errno
  2020-11-06  8:14   ` [PATCH v2 11/18] mm/filemap: Convert filemap_update_page to return an errno Christoph Hellwig
@ 2020-11-06  8:37     ` Christoph Hellwig
  2020-11-09 13:29       ` Matthew Wilcox
  0 siblings, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2020-11-06  8:37 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle); +Cc: linux-fsdevel, linux-mm, hch, kent.overstreet

On Fri, Nov 06, 2020 at 09:14:20AM +0100, Christoph Hellwig wrote:
> We could still consolidate the page unlocking by having another label.
> Or even better move the put_page into the caller like I did in my
> series, which would conceputally fit in pretty nicely here:

FYI, this is what we should do for putting the page (on top of your
whole series), which also catches the leak for the readahead NOIO case:

diff --git a/mm/filemap.c b/mm/filemap.c
index 500e9fd4232cf9..dacee60b92d3d9 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2268,40 +2268,35 @@ static int filemap_update_page(struct kiocb *iocb,
 		struct address_space *mapping, struct iov_iter *iter,
 		struct page *page)
 {
-	int error = -EAGAIN;
+	int error = 0;
 
 	if (!trylock_page(page)) {
 		if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_NOIO))
-			goto error;
+			return -EAGAIN;
 		if (!(iocb->ki_flags & IOCB_WAITQ)) {
 			put_and_wait_on_page_locked(page, TASK_KILLABLE);
 			return AOP_TRUNCATED_PAGE;
 		}
 		error = __lock_page_async(page, iocb->ki_waitq);
 		if (error)
-			goto error;
+			return error;
 	}
 
-	error = AOP_TRUNCATED_PAGE;
-	if (!page->mapping)
-		goto unlock;
-	if (filemap_range_uptodate(iocb, mapping, iter, page)) {
+	if (!page->mapping) {
 		unlock_page(page);
-		return 0;
+		put_page(page);
+		return AOP_TRUNCATED_PAGE;
 	}
 
-	error = -EAGAIN;
-	if (iocb->ki_flags & (IOCB_NOIO | IOCB_NOWAIT | IOCB_WAITQ))
-		goto unlock;
+	if (!filemap_range_uptodate(iocb, mapping, iter, page)) {
+		error = -EAGAIN;
+		if (iocb->ki_flags & (IOCB_NOIO | IOCB_NOWAIT | IOCB_WAITQ))
+			goto unlock;
+		return filemap_read_page(iocb->ki_filp, mapping, page);
+	}
 
-	error = filemap_read_page(iocb->ki_filp, mapping, page);
-	if (error)
-		goto error;
-	return 0;
 unlock:
 	unlock_page(page);
-error:
-	put_page(page);
 	return error;
 }
 
@@ -2396,8 +2391,9 @@ static int filemap_get_pages(struct kiocb *iocb, struct iov_iter *iter,
 
 	return 0;
 err:
-	pvec->nr--;
-	if (likely(pvec->nr))
+	if (err < 0)
+		put_page(page);
+	if (likely(--pvec->nr))
 		return 0;
 	if (err == AOP_TRUNCATED_PAGE)
 		goto retry;


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 1/4] pagevec: Allow pagevecs to be different sizes
  2020-11-06  8:08       ` Christoph Hellwig
@ 2020-11-06 12:30         ` Matthew Wilcox (Oracle)
  2020-11-06 12:30           ` [PATCH 2/4] pagevec: Increase the size of LRU pagevecs Matthew Wilcox (Oracle)
                             ` (3 more replies)
  0 siblings, 4 replies; 19+ messages in thread
From: Matthew Wilcox (Oracle) @ 2020-11-06 12:30 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm, hch, kent.overstreet; +Cc: Matthew Wilcox (Oracle)

Declaring a pagevec continues to create a pagevec which is the same size,
but functions which manipulate pagevecs no longer rely on this.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/pagevec.h | 20 ++++++++++++++++----
 mm/swap.c               |  8 ++++++++
 2 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/include/linux/pagevec.h b/include/linux/pagevec.h
index 875a3f0d9dd2..ee5d3c4da8da 100644
--- a/include/linux/pagevec.h
+++ b/include/linux/pagevec.h
@@ -18,9 +18,15 @@ struct page;
 struct address_space;
 
 struct pagevec {
-	unsigned char nr;
-	bool percpu_pvec_drained;
-	struct page *pages[PAGEVEC_SIZE];
+	union {
+		struct {
+			unsigned char sz;
+			unsigned char nr;
+			bool percpu_pvec_drained;
+			struct page *pages[PAGEVEC_SIZE];
+		};
+		void *__p[PAGEVEC_SIZE + 1];
+	};
 };
 
 void __pagevec_release(struct pagevec *pvec);
@@ -41,6 +47,7 @@ static inline unsigned pagevec_lookup_tag(struct pagevec *pvec,
 
 static inline void pagevec_init(struct pagevec *pvec)
 {
+	pvec->sz = PAGEVEC_SIZE;
 	pvec->nr = 0;
 	pvec->percpu_pvec_drained = false;
 }
@@ -50,6 +57,11 @@ static inline void pagevec_reinit(struct pagevec *pvec)
 	pvec->nr = 0;
 }
 
+static inline unsigned pagevec_size(struct pagevec *pvec)
+{
+	return pvec->sz;
+}
+
 static inline unsigned pagevec_count(struct pagevec *pvec)
 {
 	return pvec->nr;
@@ -57,7 +69,7 @@ static inline unsigned pagevec_count(struct pagevec *pvec)
 
 static inline unsigned pagevec_space(struct pagevec *pvec)
 {
-	return PAGEVEC_SIZE - pvec->nr;
+	return pvec->sz - pvec->nr;
 }
 
 /*
diff --git a/mm/swap.c b/mm/swap.c
index 2ee3522a7170..d093fb30f038 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -52,6 +52,7 @@ struct lru_rotate {
 };
 static DEFINE_PER_CPU(struct lru_rotate, lru_rotate) = {
 	.lock = INIT_LOCAL_LOCK(lock),
+	.pvec.sz = PAGEVEC_SIZE,
 };
 
 /*
@@ -70,6 +71,13 @@ struct lru_pvecs {
 };
 static DEFINE_PER_CPU(struct lru_pvecs, lru_pvecs) = {
 	.lock = INIT_LOCAL_LOCK(lock),
+	.lru_add.sz = PAGEVEC_SIZE,
+	.lru_deactivate_file.sz = PAGEVEC_SIZE,
+	.lru_deactivate.sz = PAGEVEC_SIZE,
+	.lru_lazyfree.sz = PAGEVEC_SIZE,
+#ifdef CONFIG_SMP
+	.activate_page.sz = PAGEVEC_SIZE,
+#endif
 };
 
 /*
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/4] pagevec: Increase the size of LRU pagevecs
  2020-11-06 12:30         ` [PATCH 1/4] pagevec: Allow pagevecs to be different sizes Matthew Wilcox (Oracle)
@ 2020-11-06 12:30           ` Matthew Wilcox (Oracle)
  2020-11-06 12:30           ` [PATCH 3/4] pagevec: Add dynamically allocated pagevecs Matthew Wilcox (Oracle)
                             ` (2 subsequent siblings)
  3 siblings, 0 replies; 19+ messages in thread
From: Matthew Wilcox (Oracle) @ 2020-11-06 12:30 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm, hch, kent.overstreet; +Cc: Matthew Wilcox (Oracle)

Tim Chen reports that increasing the size of LRU pagevecs is advantageous
with a workload he has:
https://lore.kernel.org/linux-mm/d1cc9f12a8ad6c2a52cb600d93b06b064f2bbc57.1593205965.git.tim.c.chen@linux.intel.com/

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/pagevec.h |  5 +++++
 mm/swap.c               | 27 +++++++++++++++------------
 2 files changed, 20 insertions(+), 12 deletions(-)

diff --git a/include/linux/pagevec.h b/include/linux/pagevec.h
index ee5d3c4da8da..4dc45392d776 100644
--- a/include/linux/pagevec.h
+++ b/include/linux/pagevec.h
@@ -29,6 +29,11 @@ struct pagevec {
 	};
 };
 
+#define declare_pagevec(name, size) union {				\
+	struct pagevec name;						\
+	void *__p ##name [size + 1];					\
+}
+
 void __pagevec_release(struct pagevec *pvec);
 void __pagevec_lru_add(struct pagevec *pvec);
 void pagevec_remove_exceptionals(struct pagevec *pvec);
diff --git a/mm/swap.c b/mm/swap.c
index d093fb30f038..1e6f50b312ea 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -45,14 +45,17 @@
 /* How many pages do we try to swap or page in/out together? */
 int page_cluster;
 
+#define LRU_PAGEVEC_SIZE 63
+#define lru_pagevec(name)	declare_pagevec(name, LRU_PAGEVEC_SIZE)
+
 /* Protecting only lru_rotate.pvec which requires disabling interrupts */
 struct lru_rotate {
 	local_lock_t lock;
-	struct pagevec pvec;
+	lru_pagevec(pvec);
 };
 static DEFINE_PER_CPU(struct lru_rotate, lru_rotate) = {
 	.lock = INIT_LOCAL_LOCK(lock),
-	.pvec.sz = PAGEVEC_SIZE,
+	.pvec.sz = LRU_PAGEVEC_SIZE,
 };
 
 /*
@@ -61,22 +64,22 @@ static DEFINE_PER_CPU(struct lru_rotate, lru_rotate) = {
  */
 struct lru_pvecs {
 	local_lock_t lock;
-	struct pagevec lru_add;
-	struct pagevec lru_deactivate_file;
-	struct pagevec lru_deactivate;
-	struct pagevec lru_lazyfree;
+	lru_pagevec(lru_add);
+	lru_pagevec(lru_deactivate_file);
+	lru_pagevec(lru_deactivate);
+	lru_pagevec(lru_lazyfree);
 #ifdef CONFIG_SMP
-	struct pagevec activate_page;
+	lru_pagevec(activate_page);
 #endif
 };
 static DEFINE_PER_CPU(struct lru_pvecs, lru_pvecs) = {
 	.lock = INIT_LOCAL_LOCK(lock),
-	.lru_add.sz = PAGEVEC_SIZE,
-	.lru_deactivate_file.sz = PAGEVEC_SIZE,
-	.lru_deactivate.sz = PAGEVEC_SIZE,
-	.lru_lazyfree.sz = PAGEVEC_SIZE,
+	.lru_add.sz = LRU_PAGEVEC_SIZE,
+	.lru_deactivate_file.sz = LRU_PAGEVEC_SIZE,
+	.lru_deactivate.sz = LRU_PAGEVEC_SIZE,
+	.lru_lazyfree.sz = LRU_PAGEVEC_SIZE,
 #ifdef CONFIG_SMP
-	.activate_page.sz = PAGEVEC_SIZE,
+	.activate_page.sz = LRU_PAGEVEC_SIZE,
 #endif
 };
 
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/4] pagevec: Add dynamically allocated pagevecs
  2020-11-06 12:30         ` [PATCH 1/4] pagevec: Allow pagevecs to be different sizes Matthew Wilcox (Oracle)
  2020-11-06 12:30           ` [PATCH 2/4] pagevec: Increase the size of LRU pagevecs Matthew Wilcox (Oracle)
@ 2020-11-06 12:30           ` Matthew Wilcox (Oracle)
  2020-11-06 12:30           ` [PATCH 4/4] mm/filemap: Use a dynamically allocated pagevec in filemap_read Matthew Wilcox (Oracle)
  2020-11-07 17:08           ` [PATCH 1/4] pagevec: Allow pagevecs to be different sizes Kent Overstreet
  3 siblings, 0 replies; 19+ messages in thread
From: Matthew Wilcox (Oracle) @ 2020-11-06 12:30 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm, hch, kent.overstreet; +Cc: Matthew Wilcox (Oracle)

Add pagevec_alloc() and pagevec_free() to allow for pagevecs up to 255
entries in size.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/pagevec.h |  7 +++++++
 mm/swap.c               | 28 ++++++++++++++++++++++++++++
 2 files changed, 35 insertions(+)

diff --git a/include/linux/pagevec.h b/include/linux/pagevec.h
index 4dc45392d776..4d5a48d7a372 100644
--- a/include/linux/pagevec.h
+++ b/include/linux/pagevec.h
@@ -36,6 +36,13 @@ struct pagevec {
 
 void __pagevec_release(struct pagevec *pvec);
 void __pagevec_lru_add(struct pagevec *pvec);
+struct pagevec *pagevec_alloc(unsigned long sz, gfp_t gfp);
+
+static inline void pagevec_free(struct pagevec *pvec)
+{
+	kfree(pvec);
+}
+
 void pagevec_remove_exceptionals(struct pagevec *pvec);
 
 unsigned pagevec_lookup_range_tag(struct pagevec *pvec,
diff --git a/mm/swap.c b/mm/swap.c
index 1e6f50b312ea..3f856a272cb2 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -988,6 +988,34 @@ void __pagevec_release(struct pagevec *pvec)
 }
 EXPORT_SYMBOL(__pagevec_release);
 
+/**
+ * pagevec_alloc - Allocate a pagevec.
+ * @sz: Number of pages wanted.
+ * @gfp: Memory allocation flags.
+ *
+ * Allocates a new pagevec.  The @sz parameter is advisory; this function
+ * may allocate a pagevec that can contain fewer pages than requested.  If
+ * the caller cares how many were allocated, it can check pagevec_size(),
+ * but most callers will simply use as many as were allocated.
+ *
+ * Return: A new pagevec, or NULL if memory allocation failed.
+ */
+struct pagevec *pagevec_alloc(unsigned long sz, gfp_t gfp)
+{
+	struct pagevec *pvec;
+
+	if (sz > 255)
+		sz = 255;
+	pvec = kmalloc_array(sz + 1, sizeof(void *), gfp);
+	if (!pvec)
+		return NULL;
+	pvec->nr = 0;
+	pvec->sz = sz;
+
+	return pvec;
+}
+EXPORT_SYMBOL(pagevec_alloc);
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 /* used by __split_huge_page_refcount() */
 void lru_add_page_tail(struct page *page, struct page *page_tail,
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 4/4] mm/filemap: Use a dynamically allocated pagevec in filemap_read
  2020-11-06 12:30         ` [PATCH 1/4] pagevec: Allow pagevecs to be different sizes Matthew Wilcox (Oracle)
  2020-11-06 12:30           ` [PATCH 2/4] pagevec: Increase the size of LRU pagevecs Matthew Wilcox (Oracle)
  2020-11-06 12:30           ` [PATCH 3/4] pagevec: Add dynamically allocated pagevecs Matthew Wilcox (Oracle)
@ 2020-11-06 12:30           ` Matthew Wilcox (Oracle)
  2020-11-07 17:08           ` [PATCH 1/4] pagevec: Allow pagevecs to be different sizes Kent Overstreet
  3 siblings, 0 replies; 19+ messages in thread
From: Matthew Wilcox (Oracle) @ 2020-11-06 12:30 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm, hch, kent.overstreet; +Cc: Matthew Wilcox (Oracle)

Restore Kent's optimisation for I/O sizes between 64kB and 1MB.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/filemap.c | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index f41de0759e86..2b4d8ed241bd 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2427,7 +2427,7 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter,
 	struct file_ra_state *ra = &filp->f_ra;
 	struct address_space *mapping = filp->f_mapping;
 	struct inode *inode = mapping->host;
-	struct pagevec pvec;
+	struct pagevec pvec_stack, *pvec = NULL;
 	int i, error = 0;
 	bool writably_mapped;
 	loff_t isize, end_offset;
@@ -2436,6 +2436,10 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter,
 		return 0;
 	iov_iter_truncate(iter, inode->i_sb->s_maxbytes);
 
+	if (iter->count / PAGE_SIZE > PAGEVEC_SIZE)
+		pvec = pagevec_alloc(iter->count / PAGE_SIZE + 1, GFP_KERNEL);
+	if (!pvec)
+		pvec = &pvec_stack;
 	do {
 		cond_resched();
 
@@ -2447,7 +2451,7 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter,
 		if ((iocb->ki_flags & IOCB_WAITQ) && already_read)
 			iocb->ki_flags |= IOCB_NOWAIT;
 
-		error = filemap_get_pages(iocb, iter, &pvec);
+		error = filemap_get_pages(iocb, iter, pvec);
 		if (error < 0)
 			break;
 
@@ -2476,10 +2480,10 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter,
 		 */
 		if (iocb->ki_pos >> PAGE_SHIFT !=
 		    ra->prev_pos >> PAGE_SHIFT)
-			mark_page_accessed(pvec.pages[0]);
+			mark_page_accessed(pvec->pages[0]);
 
-		for (i = 0; i < pagevec_count(&pvec); i++) {
-			struct page *page = pvec.pages[i];
+		for (i = 0; i < pagevec_count(pvec); i++) {
+			struct page *page = pvec->pages[i];
 			size_t page_size = thp_size(page);
 			size_t offset = iocb->ki_pos & (page_size - 1);
 			size_t bytes = min_t(loff_t, end_offset - iocb->ki_pos,
@@ -2514,7 +2518,7 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter,
 			}
 		}
 put_pages:
-		pagevec_release(&pvec);
+		pagevec_release(pvec);
 	} while (iov_iter_count(iter) && iocb->ki_pos < isize && !error);
 
 	file_accessed(filp);
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/4] pagevec: Allow pagevecs to be different sizes
  2020-11-06 12:30         ` [PATCH 1/4] pagevec: Allow pagevecs to be different sizes Matthew Wilcox (Oracle)
                             ` (2 preceding siblings ...)
  2020-11-06 12:30           ` [PATCH 4/4] mm/filemap: Use a dynamically allocated pagevec in filemap_read Matthew Wilcox (Oracle)
@ 2020-11-07 17:08           ` Kent Overstreet
  2020-11-07 17:20             ` Matthew Wilcox
  3 siblings, 1 reply; 19+ messages in thread
From: Kent Overstreet @ 2020-11-07 17:08 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle); +Cc: linux-fsdevel, linux-mm, hch

On Fri, Nov 06, 2020 at 12:30:37PM +0000, Matthew Wilcox (Oracle) wrote:
> Declaring a pagevec continues to create a pagevec which is the same size,
> but functions which manipulate pagevecs no longer rely on this.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  include/linux/pagevec.h | 20 ++++++++++++++++----
>  mm/swap.c               |  8 ++++++++
>  2 files changed, 24 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/pagevec.h b/include/linux/pagevec.h
> index 875a3f0d9dd2..ee5d3c4da8da 100644
> --- a/include/linux/pagevec.h
> +++ b/include/linux/pagevec.h
> @@ -18,9 +18,15 @@ struct page;
>  struct address_space;
>  
>  struct pagevec {
> -	unsigned char nr;
> -	bool percpu_pvec_drained;
> -	struct page *pages[PAGEVEC_SIZE];
> +	union {
> +		struct {
> +			unsigned char sz;
> +			unsigned char nr;
> +			bool percpu_pvec_drained;
This should probably be removed, it's only used by the swap code and I don't
think it belongs in the generic data structure. That would mean nr and size (and
let's please use more standard naming...) can be u32, not u8s.

> +			struct page *pages[PAGEVEC_SIZE];
> +		};
> +		void *__p[PAGEVEC_SIZE + 1];

What's up with this union?

> +	};
>  };
>  
>  void __pagevec_release(struct pagevec *pvec);
> @@ -41,6 +47,7 @@ static inline unsigned pagevec_lookup_tag(struct pagevec *pvec,
>  
>  static inline void pagevec_init(struct pagevec *pvec)
>  {
> +	pvec->sz = PAGEVEC_SIZE;
>  	pvec->nr = 0;
>  	pvec->percpu_pvec_drained = false;
>  }
> @@ -50,6 +57,11 @@ static inline void pagevec_reinit(struct pagevec *pvec)
>  	pvec->nr = 0;
>  }
>  
> +static inline unsigned pagevec_size(struct pagevec *pvec)
> +{
> +	return pvec->sz;
> +}
> +
>  static inline unsigned pagevec_count(struct pagevec *pvec)
>  {
>  	return pvec->nr;
> @@ -57,7 +69,7 @@ static inline unsigned pagevec_count(struct pagevec *pvec)
>  
>  static inline unsigned pagevec_space(struct pagevec *pvec)
>  {
> -	return PAGEVEC_SIZE - pvec->nr;
> +	return pvec->sz - pvec->nr;
>  }
>  
>  /*
> diff --git a/mm/swap.c b/mm/swap.c
> index 2ee3522a7170..d093fb30f038 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -52,6 +52,7 @@ struct lru_rotate {
>  };
>  static DEFINE_PER_CPU(struct lru_rotate, lru_rotate) = {
>  	.lock = INIT_LOCAL_LOCK(lock),
> +	.pvec.sz = PAGEVEC_SIZE,
>  };
>  
>  /*
> @@ -70,6 +71,13 @@ struct lru_pvecs {
>  };
>  static DEFINE_PER_CPU(struct lru_pvecs, lru_pvecs) = {
>  	.lock = INIT_LOCAL_LOCK(lock),
> +	.lru_add.sz = PAGEVEC_SIZE,
> +	.lru_deactivate_file.sz = PAGEVEC_SIZE,
> +	.lru_deactivate.sz = PAGEVEC_SIZE,
> +	.lru_lazyfree.sz = PAGEVEC_SIZE,
> +#ifdef CONFIG_SMP
> +	.activate_page.sz = PAGEVEC_SIZE,
> +#endif
>  };
>  
>  /*
> -- 
> 2.28.0
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/4] pagevec: Allow pagevecs to be different sizes
  2020-11-07 17:08           ` [PATCH 1/4] pagevec: Allow pagevecs to be different sizes Kent Overstreet
@ 2020-11-07 17:20             ` Matthew Wilcox
  0 siblings, 0 replies; 19+ messages in thread
From: Matthew Wilcox @ 2020-11-07 17:20 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-fsdevel, linux-mm, hch

On Sat, Nov 07, 2020 at 12:08:13PM -0500, Kent Overstreet wrote:
> On Fri, Nov 06, 2020 at 12:30:37PM +0000, Matthew Wilcox (Oracle) wrote:
> >  struct pagevec {
> > -	unsigned char nr;
> > -	bool percpu_pvec_drained;
> > -	struct page *pages[PAGEVEC_SIZE];
> > +	union {
> > +		struct {
> > +			unsigned char sz;
> > +			unsigned char nr;
> > +			bool percpu_pvec_drained;
> This should probably be removed, it's only used by the swap code and I don't
> think it belongs in the generic data structure. That would mean nr and size (and
> let's please use more standard naming...) can be u32, not u8s.

Nevertheless, that's a very important user of pagevecs!  You and I have no
need for it in the fs code, but it doesn't hurt to leave it here.

I really don't think that increasing the size above 255 is worth anything.
See my earlir analysis of the effect of increasing the batch size.

> > +			struct page *pages[PAGEVEC_SIZE];
> > +		};
> > +		void *__p[PAGEVEC_SIZE + 1];
> 
> What's up with this union?

*blink*.  That was supposed to be 'struct page *pages[];'

And the reason to do it that way is to avoid overly-clever instrumentation
like kmsan from saying "Hey, nr is bigger than 16, this is a buffer
overrun".  Clearly I didn't build a kernel with the various sanitisers
enabled, but I'll do that now.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 11/18] mm/filemap: Convert filemap_update_page to return an errno
  2020-11-06  8:37     ` Christoph Hellwig
@ 2020-11-09 13:29       ` Matthew Wilcox
  0 siblings, 0 replies; 19+ messages in thread
From: Matthew Wilcox @ 2020-11-09 13:29 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, linux-mm, kent.overstreet

On Fri, Nov 06, 2020 at 09:37:46AM +0100, Christoph Hellwig wrote:
>  
> -	error = AOP_TRUNCATED_PAGE;
> -	if (!page->mapping)
> -		goto unlock;
> -	if (filemap_range_uptodate(iocb, mapping, iter, page)) {
> +	if (!page->mapping) {
>  		unlock_page(page);
> -		return 0;
> +		put_page(page);
> +		return AOP_TRUNCATED_PAGE;
>  	}
>  
> -	error = -EAGAIN;
> -	if (iocb->ki_flags & (IOCB_NOIO | IOCB_NOWAIT | IOCB_WAITQ))
> -		goto unlock;
> +	if (!filemap_range_uptodate(iocb, mapping, iter, page)) {
> +		error = -EAGAIN;
> +		if (iocb->ki_flags & (IOCB_NOIO | IOCB_NOWAIT | IOCB_WAITQ))
> +			goto unlock;
> +		return filemap_read_page(iocb->ki_filp, mapping, page);
> +	}
>  
> -	error = filemap_read_page(iocb->ki_filp, mapping, page);
> -	if (error)
> -		goto error;
> -	return 0;
>  unlock:
>  	unlock_page(page);
> -error:
> -	put_page(page);
>  	return error;
>  }

It works out to be a little more complicated than that because
filemap_read_page() can also return AOP_TRUNCATED_PAGE.  Here's what I
currently have:

        if (!page->mapping)
                goto truncated;

        error = 0;
        if (filemap_range_uptodate(iocb, mapping, iter, page))
                goto unlock;

        error = -EAGAIN;
        if (iocb->ki_flags & (IOCB_NOIO | IOCB_NOWAIT | IOCB_WAITQ))
                goto unlock;

        error = filemap_read_page(iocb->ki_filp, mapping, page);
        if (error == AOP_TRUNCATED_PAGE)
                put_page(page);
        return error;
truncated:
        error = AOP_TRUNCATED_PAGE;
        put_page(page);
unlock:
        unlock_page(page);
        return error;
}




^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2020-11-09 13:29 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20201104204219.23810-1-willy@infradead.org>
     [not found] ` <20201104204219.23810-2-willy@infradead.org>
2020-11-06  8:07   ` [PATCH v2 01/18] mm/filemap: Rename generic_file_buffered_read subfunctions Christoph Hellwig
     [not found] ` <20201104204219.23810-3-willy@infradead.org>
     [not found]   ` <20201104213005.GB3365678@moria.home.lan>
2020-11-05  4:52     ` [PATCH v2 02/18] mm/filemap: Remove dynamically allocated array from filemap_read Matthew Wilcox
     [not found]     ` <20201105001302.GF17076@casper.infradead.org>
2020-11-06  8:08       ` Christoph Hellwig
2020-11-06 12:30         ` [PATCH 1/4] pagevec: Allow pagevecs to be different sizes Matthew Wilcox (Oracle)
2020-11-06 12:30           ` [PATCH 2/4] pagevec: Increase the size of LRU pagevecs Matthew Wilcox (Oracle)
2020-11-06 12:30           ` [PATCH 3/4] pagevec: Add dynamically allocated pagevecs Matthew Wilcox (Oracle)
2020-11-06 12:30           ` [PATCH 4/4] mm/filemap: Use a dynamically allocated pagevec in filemap_read Matthew Wilcox (Oracle)
2020-11-07 17:08           ` [PATCH 1/4] pagevec: Allow pagevecs to be different sizes Kent Overstreet
2020-11-07 17:20             ` Matthew Wilcox
2020-11-06  8:07   ` [PATCH v2 02/18] mm/filemap: Remove dynamically allocated array from filemap_read Christoph Hellwig
     [not found] ` <20201104204219.23810-4-willy@infradead.org>
2020-11-06  8:08   ` [PATCH v2 03/18] mm/filemap: Convert filemap_get_pages to take a pagevec Christoph Hellwig
     [not found] ` <20201104204219.23810-10-willy@infradead.org>
2020-11-06  8:11   ` [PATCH v2 09/18] mm/filemap: Change filemap_read_page calling conventions Christoph Hellwig
     [not found] ` <20201104204219.23810-12-willy@infradead.org>
2020-11-06  8:14   ` [PATCH v2 11/18] mm/filemap: Convert filemap_update_page to return an errno Christoph Hellwig
2020-11-06  8:37     ` Christoph Hellwig
2020-11-09 13:29       ` Matthew Wilcox
     [not found] ` <20201104204219.23810-14-willy@infradead.org>
2020-11-06  8:16   ` [PATCH v2 13/18] mm/filemap: Add filemap_range_uptodate Christoph Hellwig
     [not found] ` <20201104204219.23810-15-willy@infradead.org>
2020-11-06  8:19   ` [PATCH v2 14/18] mm/filemap: Split filemap_readahead out of filemap_get_pages Christoph Hellwig
     [not found] ` <20201104204219.23810-16-willy@infradead.org>
2020-11-06  8:21   ` [PATCH v2 15/18] mm/filemap: Restructure filemap_get_pages Christoph Hellwig
     [not found] ` <20201104204219.23810-17-willy@infradead.org>
2020-11-06  8:21   ` [PATCH v2 16/18] mm/filemap: Don't relock the page after calling readpage Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).