Linux-XFS Archive on lore.kernel.org
 help / color / Atom feed
* [RFC 0/8] Replacing the readpages a_op
@ 2020-01-13 15:37 Matthew Wilcox
  2020-01-13 15:37 ` [PATCH 1/8] pagevec: Add an iterator Matthew Wilcox
                   ` (8 more replies)
  0 siblings, 9 replies; 25+ messages in thread
From: Matthew Wilcox @ 2020-01-13 15:37 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel, linux-mm; +Cc: Matthew Wilcox (Oracle), jlayton, hch

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

I think everybody hates the readpages API.  The fundamental problem with
it is that it passes the pages to be read on a doubly linked list, using
the ->lru list in the struct page.  That means the filesystems have to
do the work of calling add_to_page_cache{,_lru,_locked}, and handling
failures (because another task is also accessing that chunk of the file,
and so it fails).

This is an attempt to add a ->readahead op to replace ->readpages.  I've
converted two users, iomap/xfs and cifs.  The cifs conversion is lacking
fscache support, and that's just because I didn't want to do that work;
I don't believe there's anything fundamental to it.  But I wanted to do
iomap because it is The Infrastructure Of The Future and cifs because it
is the sole remaining user of add_to_page_cache_locked(), which enables
the last two patches in the series.  By the way, that gives CIFS access
to the workingset shadow infrastructure, which it had to ignore before
because it couldn't put pages onto the lru list at the right time.

The fundamental question is, how do we indicate to the implementation of
->readahead what pages to operate on?  I've gone with passing a pagevec.
This has the obvious advantage that it's a data structure that already
exists and is used within filemap for batches of pages.  I had to add a
bit of new infrastructure to support iterating over the pages in the
pagevec, but with that done, it's quite nice.

I think the biggest problem is that the size of the pagevec is limited
to 15 pages (60kB).  So that'll mean that if the readahead window bumps
all the way up to 256kB, we may end up making 5 BIOs (and merging them)
instead of one.  I'd kind of like to be able to allocate variable length
pagevecs while allowing regular pagevecs to be allocated on the stack,
but I can't figure out a way to do that.  eg this doesn't work:

-       struct page *pages[PAGEVEC_SIZE];
+       union {
+               struct page *pages[PAGEVEC_SIZE];
+               struct page *_pages[];
+       }

and if we just allocate them, useful and wonderful tools are going to
point out when pages[16] is accessed that we've overstepped the end of
the array.

I have considered alternatives to the pagevec like just having the
->readahead implementation look up the pages in the i_pages XArray
directly.  That didn't work out too well.

Anyway, I want to draw your attention to the diffstat below.  Net 91 lines
deleted, and that's with adding all the infrastructure for ->readahead
and getting rid of none of the infrastructure for ->readpages.  There's
probably a good couple of hundred lines of code to be deleted there.

Matthew Wilcox (Oracle) (8):
  pagevec: Add an iterator
  mm: Fix the return type of __do_page_cache_readahead
  mm: Use a pagevec for readahead
  mm/fs: Add a_ops->readahead
  iomap,xfs: Convert from readpages to readahead
  cifs: Convert from readpages to readahead
  mm: Remove add_to_page_cache_locked
  mm: Unify all add_to_page_cache variants

 Documentation/filesystems/locking.rst |   8 +-
 Documentation/filesystems/vfs.rst     |   9 ++
 fs/cifs/file.c                        | 125 ++++-------------------
 fs/iomap/buffered-io.c                |  60 +++--------
 fs/iomap/trace.h                      |  18 ++--
 fs/xfs/xfs_aops.c                     |  12 +--
 include/linux/fs.h                    |   3 +
 include/linux/iomap.h                 |   4 +-
 include/linux/pagemap.h               |  23 +----
 include/linux/pagevec.h               |  20 ++++
 mm/filemap.c                          |  72 ++++---------
 mm/internal.h                         |   2 +-
 mm/readahead.c                        | 141 +++++++++++++++++---------
 13 files changed, 203 insertions(+), 294 deletions(-)

-- 
2.24.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/8] pagevec: Add an iterator
  2020-01-13 15:37 [RFC 0/8] Replacing the readpages a_op Matthew Wilcox
@ 2020-01-13 15:37 ` Matthew Wilcox
  2020-01-13 15:37 ` [PATCH 2/8] mm: Fix the return type of __do_page_cache_readahead Matthew Wilcox
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 25+ messages in thread
From: Matthew Wilcox @ 2020-01-13 15:37 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel, linux-mm; +Cc: Matthew Wilcox (Oracle), jlayton, hch

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

There's plenty of space in the pagevec for a loop counter, and
that will come in handy later.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/pagevec.h | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/include/linux/pagevec.h b/include/linux/pagevec.h
index 081d934eda64..9b8c43661ab3 100644
--- a/include/linux/pagevec.h
+++ b/include/linux/pagevec.h
@@ -19,6 +19,7 @@ struct address_space;
 
 struct pagevec {
 	unsigned char nr;
+	unsigned char first;
 	bool percpu_pvec_drained;
 	struct page *pages[PAGEVEC_SIZE];
 };
@@ -55,12 +56,14 @@ static inline unsigned pagevec_lookup_tag(struct pagevec *pvec,
 static inline void pagevec_init(struct pagevec *pvec)
 {
 	pvec->nr = 0;
+	pvec->first = 0;
 	pvec->percpu_pvec_drained = false;
 }
 
 static inline void pagevec_reinit(struct pagevec *pvec)
 {
 	pvec->nr = 0;
+	pvec->first = 0;
 }
 
 static inline unsigned pagevec_count(struct pagevec *pvec)
@@ -88,4 +91,21 @@ static inline void pagevec_release(struct pagevec *pvec)
 		__pagevec_release(pvec);
 }
 
+static inline struct page *pagevec_last(struct pagevec *pvec)
+{
+	if (pvec->nr == 0)
+		return NULL;
+	return pvec->pages[pvec->nr - 1];
+}
+
+static inline struct page *pagevec_next(struct pagevec *pvec)
+{
+	if (pvec->first >= pvec->nr)
+		return NULL;
+	return pvec->pages[pvec->first++];
+}
+
+#define pagevec_for_each(pvec, page)	\
+	while ((page = pagevec_next(pvec)))
+
 #endif /* _LINUX_PAGEVEC_H */
-- 
2.24.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 2/8] mm: Fix the return type of __do_page_cache_readahead
  2020-01-13 15:37 [RFC 0/8] Replacing the readpages a_op Matthew Wilcox
  2020-01-13 15:37 ` [PATCH 1/8] pagevec: Add an iterator Matthew Wilcox
@ 2020-01-13 15:37 ` Matthew Wilcox
  2020-01-13 15:37 ` [PATCH 3/8] mm: Use a pagevec for readahead Matthew Wilcox
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 25+ messages in thread
From: Matthew Wilcox @ 2020-01-13 15:37 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel, linux-mm; +Cc: Matthew Wilcox (Oracle), jlayton, hch

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

ra_submit() which is a wrapper around __do_page_cache_readahead() already
returns an unsigned long, and the 'nr_to_read' parameter is an unsigned
long, so fix __do_page_cache_readahead() to return an unsigned long,
even though I'm pretty sure we're not going to readahead more than 2^32
pages ever.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/internal.h  | 2 +-
 mm/readahead.c | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 3cf20ab3ca01..41b93c4b3ab7 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -49,7 +49,7 @@ void unmap_page_range(struct mmu_gather *tlb,
 			     unsigned long addr, unsigned long end,
 			     struct zap_details *details);
 
-extern unsigned int __do_page_cache_readahead(struct address_space *mapping,
+extern unsigned long __do_page_cache_readahead(struct address_space *mapping,
 		struct file *filp, pgoff_t offset, unsigned long nr_to_read,
 		unsigned long lookahead_size);
 
diff --git a/mm/readahead.c b/mm/readahead.c
index 2fe72cd29b47..6bf73ef33b7e 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -152,7 +152,7 @@ static int read_pages(struct address_space *mapping, struct file *filp,
  *
  * Returns the number of pages requested, or the maximum amount of I/O allowed.
  */
-unsigned int __do_page_cache_readahead(struct address_space *mapping,
+unsigned long __do_page_cache_readahead(struct address_space *mapping,
 		struct file *filp, pgoff_t offset, unsigned long nr_to_read,
 		unsigned long lookahead_size)
 {
@@ -161,7 +161,7 @@ unsigned int __do_page_cache_readahead(struct address_space *mapping,
 	unsigned long end_index;	/* The last page we want to read */
 	LIST_HEAD(page_pool);
 	int page_idx;
-	unsigned int nr_pages = 0;
+	unsigned long nr_pages = 0;
 	loff_t isize = i_size_read(inode);
 	gfp_t gfp_mask = readahead_gfp_mask(mapping);
 
-- 
2.24.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 3/8] mm: Use a pagevec for readahead
  2020-01-13 15:37 [RFC 0/8] Replacing the readpages a_op Matthew Wilcox
  2020-01-13 15:37 ` [PATCH 1/8] pagevec: Add an iterator Matthew Wilcox
  2020-01-13 15:37 ` [PATCH 2/8] mm: Fix the return type of __do_page_cache_readahead Matthew Wilcox
@ 2020-01-13 15:37 ` Matthew Wilcox
  2020-01-13 15:37 ` [PATCH 4/8] mm/fs: Add a_ops->readahead Matthew Wilcox
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 25+ messages in thread
From: Matthew Wilcox @ 2020-01-13 15:37 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel, linux-mm; +Cc: Matthew Wilcox (Oracle), jlayton, hch

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Instead of using a linked list, use a small array.  This does mean we
will allocate and then submit for I/O no more than 15 pages at a time
(60kB), but we have the block queue plugged so the bios can be combined
afterwards.  We generally don't readahead more than 256kB anyway,
so this is not a huge reduction in efficiency, and we'll make up
for it with later patches.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/readahead.c | 97 +++++++++++++++++++++++++++-----------------------
 1 file changed, 52 insertions(+), 45 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index 6bf73ef33b7e..76a70a4406b5 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -113,35 +113,37 @@ int read_cache_pages(struct address_space *mapping, struct list_head *pages,
 
 EXPORT_SYMBOL(read_cache_pages);
 
-static int read_pages(struct address_space *mapping, struct file *filp,
-		struct list_head *pages, unsigned int nr_pages, gfp_t gfp)
+/*
+ * We ignore I/O errors - they will be handled by the actual consumer of
+ * the data that we attempted to prefetch.
+ */
+static unsigned read_pages(struct address_space *mapping, struct file *filp,
+		struct pagevec *pvec, pgoff_t offset, gfp_t gfp)
 {
-	struct blk_plug plug;
-	unsigned page_idx;
-	int ret;
-
-	blk_start_plug(&plug);
+	struct page *page;
+	unsigned int nr_pages = pagevec_count(pvec);
 
 	if (mapping->a_ops->readpages) {
-		ret = mapping->a_ops->readpages(filp, mapping, pages, nr_pages);
-		/* Clean up the remaining pages */
-		put_pages_list(pages);
-		goto out;
-	}
+		LIST_HEAD(pages);
 
-	for (page_idx = 0; page_idx < nr_pages; page_idx++) {
-		struct page *page = lru_to_page(pages);
-		list_del(&page->lru);
-		if (!add_to_page_cache_lru(page, mapping, page->index, gfp))
-			mapping->a_ops->readpage(filp, page);
-		put_page(page);
+		pagevec_for_each(pvec, page) {
+			page->index = offset++;
+			list_add(&page->lru, &pages);
+		}
+		mapping->a_ops->readpages(filp, mapping, &pages, nr_pages);
+		/* Clean up the remaining pages */
+		put_pages_list(&pages);
+	} else {
+		pagevec_for_each(pvec, page) {
+			if (!add_to_page_cache_lru(page, mapping, offset++,
+						gfp))
+				mapping->a_ops->readpage(filp, page);
+			put_page(page);
+		}
 	}
-	ret = 0;
 
-out:
-	blk_finish_plug(&plug);
-
-	return ret;
+	pagevec_reinit(pvec);
+	return nr_pages;
 }
 
 /*
@@ -159,59 +161,64 @@ unsigned long __do_page_cache_readahead(struct address_space *mapping,
 	struct inode *inode = mapping->host;
 	struct page *page;
 	unsigned long end_index;	/* The last page we want to read */
-	LIST_HEAD(page_pool);
+	struct pagevec pages;
 	int page_idx;
+	pgoff_t page_offset = offset;
 	unsigned long nr_pages = 0;
 	loff_t isize = i_size_read(inode);
 	gfp_t gfp_mask = readahead_gfp_mask(mapping);
+	struct blk_plug plug;
+
+	blk_start_plug(&plug);
 
 	if (isize == 0)
 		goto out;
 
 	end_index = ((isize - 1) >> PAGE_SHIFT);
+	pagevec_init(&pages);
 
 	/*
 	 * Preallocate as many pages as we will need.
 	 */
 	for (page_idx = 0; page_idx < nr_to_read; page_idx++) {
-		pgoff_t page_offset = offset + page_idx;
+		page_offset++;
 
 		if (page_offset > end_index)
 			break;
 
 		page = xa_load(&mapping->i_pages, page_offset);
+
+		/*
+		 * Page already present?  Kick off the current batch of
+		 * contiguous pages before continuing with the next batch.
+		 */
 		if (page && !xa_is_value(page)) {
-			/*
-			 * Page already present?  Kick off the current batch of
-			 * contiguous pages before continuing with the next
-			 * batch.
-			 */
-			if (nr_pages)
-				read_pages(mapping, filp, &page_pool, nr_pages,
-						gfp_mask);
-			nr_pages = 0;
+			unsigned int count = pagevec_count(&pages);
+
+			if (count)
+				nr_pages += read_pages(mapping, filp, &pages,
+						offset, gfp_mask);
+			offset = page_offset + 1;
 			continue;
 		}
 
 		page = __page_cache_alloc(gfp_mask);
 		if (!page)
 			break;
-		page->index = page_offset;
-		list_add(&page->lru, &page_pool);
+		if (pagevec_add(&pages, page) == 0) {
+			nr_pages += read_pages(mapping, filp, &pages,
+					offset, gfp_mask);
+			offset = page_offset + 1;
+		}
 		if (page_idx == nr_to_read - lookahead_size)
 			SetPageReadahead(page);
-		nr_pages++;
 	}
 
-	/*
-	 * Now start the IO.  We ignore I/O errors - if the page is not
-	 * uptodate then the caller will launch readpage again, and
-	 * will then handle the error.
-	 */
-	if (nr_pages)
-		read_pages(mapping, filp, &page_pool, nr_pages, gfp_mask);
-	BUG_ON(!list_empty(&page_pool));
+	if (pagevec_count(&pages))
+		nr_pages += read_pages(mapping, filp, &pages, offset, gfp_mask);
 out:
+	blk_finish_plug(&plug);
+
 	return nr_pages;
 }
 
-- 
2.24.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 4/8] mm/fs: Add a_ops->readahead
  2020-01-13 15:37 [RFC 0/8] Replacing the readpages a_op Matthew Wilcox
                   ` (2 preceding siblings ...)
  2020-01-13 15:37 ` [PATCH 3/8] mm: Use a pagevec for readahead Matthew Wilcox
@ 2020-01-13 15:37 ` Matthew Wilcox
  2020-01-13 18:22   ` Daniel Wagner
  2020-01-13 15:37 ` [PATCH 5/8] iomap,xfs: Convert from readpages to readahead Matthew Wilcox
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 25+ messages in thread
From: Matthew Wilcox @ 2020-01-13 15:37 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel, linux-mm; +Cc: Matthew Wilcox (Oracle), jlayton, hch

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

This will replace ->readpages with a saner interface:
 - No return type (errors are ignored for read ahead anyway)
 - Pages are already in the page cache when ->readpages is called
 - Pages are passed in a pagevec instead of a linked list

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 Documentation/filesystems/locking.rst |  8 +++++-
 Documentation/filesystems/vfs.rst     |  9 ++++++
 include/linux/fs.h                    |  3 ++
 mm/readahead.c                        | 40 ++++++++++++++++++++++++++-
 4 files changed, 58 insertions(+), 2 deletions(-)

diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst
index 5057e4d9dcd1..1e2f1186fd1a 100644
--- a/Documentation/filesystems/locking.rst
+++ b/Documentation/filesystems/locking.rst
@@ -239,6 +239,8 @@ prototypes::
 	int (*readpage)(struct file *, struct page *);
 	int (*writepages)(struct address_space *, struct writeback_control *);
 	int (*set_page_dirty)(struct page *page);
+	int (*readahead)(struct file *, struct address_space *,
+			struct pagevec *, pgoff_t index);
 	int (*readpages)(struct file *filp, struct address_space *mapping,
 			struct list_head *pages, unsigned nr_pages);
 	int (*write_begin)(struct file *, struct address_space *mapping,
@@ -271,7 +273,8 @@ writepage:		yes, unlocks (see below)
 readpage:		yes, unlocks
 writepages:
 set_page_dirty		no
-readpages:
+readpages:              no
+readahead:              yes, unlocks
 write_begin:		locks the page		 exclusive
 write_end:		yes, unlocks		 exclusive
 bmap:
@@ -298,6 +301,9 @@ completion.
 ->readpages() populates the pagecache with the passed pages and starts
 I/O against them.  They come unlocked upon I/O completion.
 
+->readahead() starts I/O against the pages.  They come unlocked upon
+I/O completion.
+
 ->writepage() is used for two purposes: for "memory cleansing" and for
 "sync".  These are quite different operations and the behaviour may differ
 depending upon the mode.
diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index 7d4d09dd5e6d..63d0f0dbbf9c 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -706,6 +706,8 @@ cache in your filesystem.  The following members are defined:
 		int (*readpage)(struct file *, struct page *);
 		int (*writepages)(struct address_space *, struct writeback_control *);
 		int (*set_page_dirty)(struct page *page);
+		int (*readahead)(struct file *, struct address_space *,
+				 struct pagevec *, pgoff_t index);
 		int (*readpages)(struct file *filp, struct address_space *mapping,
 				 struct list_head *pages, unsigned nr_pages);
 		int (*write_begin)(struct file *, struct address_space *mapping,
@@ -781,6 +783,13 @@ cache in your filesystem.  The following members are defined:
 	If defined, it should set the PageDirty flag, and the
 	PAGECACHE_TAG_DIRTY tag in the radix tree.
 
+``readahead``
+	called by the VM to read pages associated with the address_space
+	object.  This is essentially a vector version of readpage.
+	Instead of just one page, several pages are requested.
+	Since this is readahead, attempt to start I/O on each page and
+        let the I/O completion path set errors on the page.
+
 ``readpages``
 	called by the VM to read pages associated with the address_space
 	object.  This is essentially just a vector version of readpage.
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 98e0349adb52..2769f89666fb 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -52,6 +52,7 @@ struct hd_geometry;
 struct iovec;
 struct kiocb;
 struct kobject;
+struct pagevec;
 struct pipe_inode_info;
 struct poll_table_struct;
 struct kstatfs;
@@ -375,6 +376,8 @@ struct address_space_operations {
 	 */
 	int (*readpages)(struct file *filp, struct address_space *mapping,
 			struct list_head *pages, unsigned nr_pages);
+	void (*readahead)(struct file *, struct address_space *,
+			struct pagevec *, pgoff_t offset);
 
 	int (*write_begin)(struct file *, struct address_space *mapping,
 				loff_t pos, unsigned len, unsigned flags,
diff --git a/mm/readahead.c b/mm/readahead.c
index 76a70a4406b5..2fe0974173ea 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -123,7 +123,45 @@ static unsigned read_pages(struct address_space *mapping, struct file *filp,
 	struct page *page;
 	unsigned int nr_pages = pagevec_count(pvec);
 
-	if (mapping->a_ops->readpages) {
+	if (mapping->a_ops->readahead) {
+		/*
+		 * When we remove support for ->readpages, we'll call
+		 * add_to_page_cache_lru() in the parent and all this
+		 * grot goes away.
+		 */
+		unsigned char first = pvec->first;
+		unsigned char saved_nr = pvec->nr;
+		pgoff_t base = offset;
+		pagevec_for_each(pvec, page) {
+			if (!add_to_page_cache_lru(page, mapping, offset++,
+						gfp)) {
+				unsigned char saved_first = pvec->first;
+
+				pvec->nr = pvec->first - 1;
+				pvec->first = first;
+				mapping->a_ops->readahead(filp, mapping, pvec,
+						base + first);
+				first = pvec->nr + 1;
+				pvec->nr = saved_nr;
+				pvec->first = saved_first;
+
+				put_page(page);
+			}
+		}
+		pvec->first = first;
+		offset = base + first;
+		mapping->a_ops->readahead(filp, mapping, pvec, offset);
+		/*
+		 * Ideally the implementation would at least attempt to
+		 * start I/O against all the pages, but there are times
+		 * when it makes more sense to just give up.  Take care
+		 * of any un-attempted pages here.
+		 */
+		pagevec_for_each(pvec, page) {
+			unlock_page(page);
+			put_page(page);
+		}
+	} else if (mapping->a_ops->readpages) {
 		LIST_HEAD(pages);
 
 		pagevec_for_each(pvec, page) {
-- 
2.24.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 5/8] iomap,xfs: Convert from readpages to readahead
  2020-01-13 15:37 [RFC 0/8] Replacing the readpages a_op Matthew Wilcox
                   ` (3 preceding siblings ...)
  2020-01-13 15:37 ` [PATCH 4/8] mm/fs: Add a_ops->readahead Matthew Wilcox
@ 2020-01-13 15:37 ` Matthew Wilcox
  2020-01-13 15:37 ` [PATCH 6/8] cifs: " Matthew Wilcox
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 25+ messages in thread
From: Matthew Wilcox @ 2020-01-13 15:37 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel, linux-mm; +Cc: Matthew Wilcox (Oracle), jlayton, hch

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Use the new readahead operation in XFS and iomap.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/iomap/buffered-io.c | 60 +++++++++---------------------------------
 fs/iomap/trace.h       | 18 ++++++-------
 fs/xfs/xfs_aops.c      | 12 ++++-----
 include/linux/iomap.h  |  4 +--
 4 files changed, 29 insertions(+), 65 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 828444e14d09..818fa5bbd643 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -8,6 +8,7 @@
 #include <linux/fs.h>
 #include <linux/iomap.h>
 #include <linux/pagemap.h>
+#include <linux/pagevec.h>
 #include <linux/uio.h>
 #include <linux/buffer_head.h>
 #include <linux/dax.h>
@@ -216,7 +217,7 @@ struct iomap_readpage_ctx {
 	bool			cur_page_in_bio;
 	bool			is_readahead;
 	struct bio		*bio;
-	struct list_head	*pages;
+	struct pagevec		*pages;
 };
 
 static void
@@ -337,7 +338,7 @@ iomap_readpage(struct page *page, const struct iomap_ops *ops)
 	unsigned poff;
 	loff_t ret;
 
-	trace_iomap_readpage(page->mapping->host, 1);
+	trace_iomap_readpage(page->mapping->host, (loff_t)PAGE_SIZE);
 
 	for (poff = 0; poff < PAGE_SIZE; poff += ret) {
 		ret = iomap_apply(inode, page_offset(page) + poff,
@@ -367,36 +368,8 @@ iomap_readpage(struct page *page, const struct iomap_ops *ops)
 }
 EXPORT_SYMBOL_GPL(iomap_readpage);
 
-static struct page *
-iomap_next_page(struct inode *inode, struct list_head *pages, loff_t pos,
-		loff_t length, loff_t *done)
-{
-	while (!list_empty(pages)) {
-		struct page *page = lru_to_page(pages);
-
-		if (page_offset(page) >= (u64)pos + length)
-			break;
-
-		list_del(&page->lru);
-		if (!add_to_page_cache_lru(page, inode->i_mapping, page->index,
-				GFP_NOFS))
-			return page;
-
-		/*
-		 * If we already have a page in the page cache at index we are
-		 * done.  Upper layers don't care if it is uptodate after the
-		 * readpages call itself as every page gets checked again once
-		 * actually needed.
-		 */
-		*done += PAGE_SIZE;
-		put_page(page);
-	}
-
-	return NULL;
-}
-
 static loff_t
-iomap_readpages_actor(struct inode *inode, loff_t pos, loff_t length,
+iomap_readahead_actor(struct inode *inode, loff_t pos, loff_t length,
 		void *data, struct iomap *iomap, struct iomap *srcmap)
 {
 	struct iomap_readpage_ctx *ctx = data;
@@ -410,8 +383,7 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, loff_t length,
 			ctx->cur_page = NULL;
 		}
 		if (!ctx->cur_page) {
-			ctx->cur_page = iomap_next_page(inode, ctx->pages,
-					pos, length, &done);
+			ctx->cur_page = pagevec_next(ctx->pages);
 			if (!ctx->cur_page)
 				break;
 			ctx->cur_page_in_bio = false;
@@ -423,23 +395,22 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, loff_t length,
 	return done;
 }
 
-int
-iomap_readpages(struct address_space *mapping, struct list_head *pages,
-		unsigned nr_pages, const struct iomap_ops *ops)
+void iomap_readahead(struct address_space *mapping, struct pagevec *pages,
+		pgoff_t index, const struct iomap_ops *ops)
 {
 	struct iomap_readpage_ctx ctx = {
 		.pages		= pages,
 		.is_readahead	= true,
 	};
-	loff_t pos = page_offset(list_entry(pages->prev, struct page, lru));
-	loff_t last = page_offset(list_entry(pages->next, struct page, lru));
+	loff_t pos = (loff_t)index << PAGE_SHIFT;
+	loff_t last = page_offset(pagevec_last(pages));
 	loff_t length = last - pos + PAGE_SIZE, ret = 0;
 
-	trace_iomap_readpages(mapping->host, nr_pages);
+	trace_iomap_readahead(mapping->host, length);
 
 	while (length > 0) {
 		ret = iomap_apply(mapping->host, pos, length, 0, ops,
-				&ctx, iomap_readpages_actor);
+				&ctx, iomap_readahead_actor);
 		if (ret <= 0) {
 			WARN_ON_ONCE(ret == 0);
 			goto done;
@@ -456,15 +427,8 @@ iomap_readpages(struct address_space *mapping, struct list_head *pages,
 			unlock_page(ctx.cur_page);
 		put_page(ctx.cur_page);
 	}
-
-	/*
-	 * Check that we didn't lose a page due to the arcance calling
-	 * conventions..
-	 */
-	WARN_ON_ONCE(!ret && !list_empty(ctx.pages));
-	return ret;
 }
-EXPORT_SYMBOL_GPL(iomap_readpages);
+EXPORT_SYMBOL_GPL(iomap_readahead);
 
 /*
  * iomap_is_partially_uptodate checks whether blocks within a page are
diff --git a/fs/iomap/trace.h b/fs/iomap/trace.h
index 6dc227b8c47e..adbfd9fd4275 100644
--- a/fs/iomap/trace.h
+++ b/fs/iomap/trace.h
@@ -16,30 +16,30 @@
 struct inode;
 
 DECLARE_EVENT_CLASS(iomap_readpage_class,
-	TP_PROTO(struct inode *inode, int nr_pages),
-	TP_ARGS(inode, nr_pages),
+	TP_PROTO(struct inode *inode, loff_t length),
+	TP_ARGS(inode, length),
 	TP_STRUCT__entry(
 		__field(dev_t, dev)
 		__field(u64, ino)
-		__field(int, nr_pages)
+		__field(loff_t, length)
 	),
 	TP_fast_assign(
 		__entry->dev = inode->i_sb->s_dev;
 		__entry->ino = inode->i_ino;
-		__entry->nr_pages = nr_pages;
+		__entry->length = length;
 	),
-	TP_printk("dev %d:%d ino 0x%llx nr_pages %d",
+	TP_printk("dev %d:%d ino 0x%llx length %lld",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  __entry->ino,
-		  __entry->nr_pages)
+		  __entry->length)
 )
 
 #define DEFINE_READPAGE_EVENT(name)		\
 DEFINE_EVENT(iomap_readpage_class, name,	\
-	TP_PROTO(struct inode *inode, int nr_pages), \
-	TP_ARGS(inode, nr_pages))
+	TP_PROTO(struct inode *inode, loff_t length), \
+	TP_ARGS(inode, length))
 DEFINE_READPAGE_EVENT(iomap_readpage);
-DEFINE_READPAGE_EVENT(iomap_readpages);
+DEFINE_READPAGE_EVENT(iomap_readahead);
 
 DECLARE_EVENT_CLASS(iomap_page_class,
 	TP_PROTO(struct inode *inode, struct page *page, unsigned long off,
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 3a688eb5c5ae..e3db35bcfa34 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -621,14 +621,14 @@ xfs_vm_readpage(
 	return iomap_readpage(page, &xfs_read_iomap_ops);
 }
 
-STATIC int
-xfs_vm_readpages(
+STATIC void
+xfs_vm_readahead(
 	struct file		*unused,
 	struct address_space	*mapping,
-	struct list_head	*pages,
-	unsigned		nr_pages)
+	struct pagevec		*pages,
+	pgoff_t			index)
 {
-	return iomap_readpages(mapping, pages, nr_pages, &xfs_read_iomap_ops);
+	iomap_readahead(mapping, pages, index, &xfs_read_iomap_ops);
 }
 
 static int
@@ -644,7 +644,7 @@ xfs_iomap_swapfile_activate(
 
 const struct address_space_operations xfs_address_space_operations = {
 	.readpage		= xfs_vm_readpage,
-	.readpages		= xfs_vm_readpages,
+	.readahead		= xfs_vm_readahead,
 	.writepage		= xfs_vm_writepage,
 	.writepages		= xfs_vm_writepages,
 	.set_page_dirty		= iomap_set_page_dirty,
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 8b09463dae0d..1af1ec0920d8 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -155,8 +155,8 @@ loff_t iomap_apply(struct inode *inode, loff_t pos, loff_t length,
 ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from,
 		const struct iomap_ops *ops);
 int iomap_readpage(struct page *page, const struct iomap_ops *ops);
-int iomap_readpages(struct address_space *mapping, struct list_head *pages,
-		unsigned nr_pages, const struct iomap_ops *ops);
+void iomap_readahead(struct address_space *mapping, struct pagevec *pages,
+		pgoff_t index, const struct iomap_ops *ops);
 int iomap_set_page_dirty(struct page *page);
 int iomap_is_partially_uptodate(struct page *page, unsigned long from,
 		unsigned long count);
-- 
2.24.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 6/8] cifs: Convert from readpages to readahead
  2020-01-13 15:37 [RFC 0/8] Replacing the readpages a_op Matthew Wilcox
                   ` (4 preceding siblings ...)
  2020-01-13 15:37 ` [PATCH 5/8] iomap,xfs: Convert from readpages to readahead Matthew Wilcox
@ 2020-01-13 15:37 ` " Matthew Wilcox
  2020-01-13 15:37 ` [PATCH 7/8] mm: Remove add_to_page_cache_locked Matthew Wilcox
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 25+ messages in thread
From: Matthew Wilcox @ 2020-01-13 15:37 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel, linux-mm; +Cc: Matthew Wilcox (Oracle), jlayton, hch

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Use the new readahead operation in CIFS

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/cifs/file.c | 125 ++++++++-----------------------------------------
 1 file changed, 19 insertions(+), 106 deletions(-)

diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 043288b5c728..816670b501d8 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -4280,70 +4280,10 @@ cifs_readpages_copy_into_pages(struct TCP_Server_Info *server,
 	return readpages_fill_pages(server, rdata, iter, iter->count);
 }
 
-static int
-readpages_get_pages(struct address_space *mapping, struct list_head *page_list,
-		    unsigned int rsize, struct list_head *tmplist,
-		    unsigned int *nr_pages, loff_t *offset, unsigned int *bytes)
-{
-	struct page *page, *tpage;
-	unsigned int expected_index;
-	int rc;
-	gfp_t gfp = readahead_gfp_mask(mapping);
-
-	INIT_LIST_HEAD(tmplist);
-
-	page = lru_to_page(page_list);
-
-	/*
-	 * Lock the page and put it in the cache. Since no one else
-	 * should have access to this page, we're safe to simply set
-	 * PG_locked without checking it first.
-	 */
-	__SetPageLocked(page);
-	rc = add_to_page_cache_locked(page, mapping,
-				      page->index, gfp);
-
-	/* give up if we can't stick it in the cache */
-	if (rc) {
-		__ClearPageLocked(page);
-		return rc;
-	}
-
-	/* move first page to the tmplist */
-	*offset = (loff_t)page->index << PAGE_SHIFT;
-	*bytes = PAGE_SIZE;
-	*nr_pages = 1;
-	list_move_tail(&page->lru, tmplist);
-
-	/* now try and add more pages onto the request */
-	expected_index = page->index + 1;
-	list_for_each_entry_safe_reverse(page, tpage, page_list, lru) {
-		/* discontinuity ? */
-		if (page->index != expected_index)
-			break;
-
-		/* would this page push the read over the rsize? */
-		if (*bytes + PAGE_SIZE > rsize)
-			break;
-
-		__SetPageLocked(page);
-		if (add_to_page_cache_locked(page, mapping, page->index, gfp)) {
-			__ClearPageLocked(page);
-			break;
-		}
-		list_move_tail(&page->lru, tmplist);
-		(*bytes) += PAGE_SIZE;
-		expected_index++;
-		(*nr_pages)++;
-	}
-	return rc;
-}
-
-static int cifs_readpages(struct file *file, struct address_space *mapping,
-	struct list_head *page_list, unsigned num_pages)
+static void cifs_readahead(struct file *file, struct address_space *mapping,
+		struct pagevec *pages, pgoff_t index)
 {
 	int rc;
-	struct list_head tmplist;
 	struct cifsFileInfo *open_file = file->private_data;
 	struct cifs_sb_info *cifs_sb = CIFS_FILE_SB(file);
 	struct TCP_Server_Info *server;
@@ -4358,11 +4298,10 @@ static int cifs_readpages(struct file *file, struct address_space *mapping,
 	 * After this point, every page in the list might have PG_fscache set,
 	 * so we will need to clean that up off of every page we don't use.
 	 */
-	rc = cifs_readpages_from_fscache(mapping->host, mapping, page_list,
-					 &num_pages);
+	rc = -ENOBUFS;
 	if (rc == 0) {
 		free_xid(xid);
-		return rc;
+		return;
 	}
 
 	if (cifs_sb->mnt_cifs_flags & CIFS_MOUNT_RWPIDFORWARD)
@@ -4373,24 +4312,12 @@ static int cifs_readpages(struct file *file, struct address_space *mapping,
 	rc = 0;
 	server = tlink_tcon(open_file->tlink)->ses->server;
 
-	cifs_dbg(FYI, "%s: file=%p mapping=%p num_pages=%u\n",
-		 __func__, file, mapping, num_pages);
+	cifs_dbg(FYI, "%s: file=%p mapping=%p index=%lu\n",
+		 __func__, file, mapping, index);
 
-	/*
-	 * Start with the page at end of list and move it to private
-	 * list. Do the same with any following pages until we hit
-	 * the rsize limit, hit an index discontinuity, or run out of
-	 * pages. Issue the async read and then start the loop again
-	 * until the list is empty.
-	 *
-	 * Note that list order is important. The page_list is in
-	 * the order of declining indexes. When we put the pages in
-	 * the rdata->pages, then we want them in increasing order.
-	 */
-	while (!list_empty(page_list)) {
-		unsigned int i, nr_pages, bytes, rsize;
-		loff_t offset;
-		struct page *page, *tpage;
+	while (pages->first < pages->nr) {
+		unsigned int i, nr_pages, rsize;
+		struct page *page;
 		struct cifs_readdata *rdata;
 		struct cifs_credits credits_on_stack;
 		struct cifs_credits *credits = &credits_on_stack;
@@ -4408,6 +4335,7 @@ static int cifs_readpages(struct file *file, struct address_space *mapping,
 		if (rc)
 			break;
 
+		nr_pages = rsize / PAGE_SIZE;
 		/*
 		 * Give up immediately if rsize is too small to read an entire
 		 * page. The VFS will fall back to readpage. We should never
@@ -4415,36 +4343,23 @@ static int cifs_readpages(struct file *file, struct address_space *mapping,
 		 * rsize is smaller than a cache page.
 		 */
 		if (unlikely(rsize < PAGE_SIZE)) {
-			add_credits_and_wake_if(server, credits, 0);
-			free_xid(xid);
-			return 0;
-		}
-
-		rc = readpages_get_pages(mapping, page_list, rsize, &tmplist,
-					 &nr_pages, &offset, &bytes);
-		if (rc) {
 			add_credits_and_wake_if(server, credits, 0);
 			break;
 		}
 
+		if (nr_pages > pagevec_count(pages))
+			nr_pages = pagevec_count(pages);
+
 		rdata = cifs_readdata_alloc(nr_pages, cifs_readv_complete);
 		if (!rdata) {
 			/* best to give up if we're out of mem */
-			list_for_each_entry_safe(page, tpage, &tmplist, lru) {
-				list_del(&page->lru);
-				lru_cache_add_file(page);
-				unlock_page(page);
-				put_page(page);
-			}
-			rc = -ENOMEM;
 			add_credits_and_wake_if(server, credits, 0);
 			break;
 		}
 
 		rdata->cfile = cifsFileInfo_get(open_file);
 		rdata->mapping = mapping;
-		rdata->offset = offset;
-		rdata->bytes = bytes;
+		rdata->offset = index;
 		rdata->pid = pid;
 		rdata->pagesz = PAGE_SIZE;
 		rdata->tailsz = PAGE_SIZE;
@@ -4452,9 +4367,10 @@ static int cifs_readpages(struct file *file, struct address_space *mapping,
 		rdata->copy_into_pages = cifs_readpages_copy_into_pages;
 		rdata->credits = credits_on_stack;
 
-		list_for_each_entry_safe(page, tpage, &tmplist, lru) {
-			list_del(&page->lru);
-			rdata->pages[rdata->nr_pages++] = page;
+		for (i = 0; i < rdata->nr_pages; i++) {
+			rdata->pages[rdata->nr_pages++] = pagevec_next(pages);
+			index++;
+			rdata->bytes += PAGE_SIZE;
 		}
 
 		rc = adjust_credits(server, &rdata->credits, rdata->bytes);
@@ -4470,7 +4386,6 @@ static int cifs_readpages(struct file *file, struct address_space *mapping,
 			add_credits_and_wake_if(server, &rdata->credits, 0);
 			for (i = 0; i < rdata->nr_pages; i++) {
 				page = rdata->pages[i];
-				lru_cache_add_file(page);
 				unlock_page(page);
 				put_page(page);
 			}
@@ -4486,9 +4401,7 @@ static int cifs_readpages(struct file *file, struct address_space *mapping,
 	 * the pagecache must be uncached before they get returned to the
 	 * allocator.
 	 */
-	cifs_fscache_readpages_cancel(mapping->host, page_list);
 	free_xid(xid);
-	return rc;
 }
 
 /*
@@ -4806,7 +4719,7 @@ cifs_direct_io(struct kiocb *iocb, struct iov_iter *iter)
 
 const struct address_space_operations cifs_addr_ops = {
 	.readpage = cifs_readpage,
-	.readpages = cifs_readpages,
+	.readahead = cifs_readahead,
 	.writepage = cifs_writepage,
 	.writepages = cifs_writepages,
 	.write_begin = cifs_write_begin,
-- 
2.24.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 7/8] mm: Remove add_to_page_cache_locked
  2020-01-13 15:37 [RFC 0/8] Replacing the readpages a_op Matthew Wilcox
                   ` (5 preceding siblings ...)
  2020-01-13 15:37 ` [PATCH 6/8] cifs: " Matthew Wilcox
@ 2020-01-13 15:37 ` Matthew Wilcox
  2020-01-13 15:37 ` [PATCH 8/8] mm: Unify all add_to_page_cache variants Matthew Wilcox
  2020-01-13 16:42 ` [RFC 0/8] Replacing the readpages a_op Chris Mason
  8 siblings, 0 replies; 25+ messages in thread
From: Matthew Wilcox @ 2020-01-13 15:37 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel, linux-mm; +Cc: Matthew Wilcox (Oracle), jlayton, hch

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

The only remaining caller is add_to_page_cache(), and the only
caller of that is hugetlbfs, so move add_to_page_cache() into
filemap.c and call __add_to_page_cache_locked() directly.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/pagemap.h | 20 ++------------------
 mm/filemap.c            | 23 ++++++++---------------
 2 files changed, 10 insertions(+), 33 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 37a4d9e32cd3..3ce051fb9c73 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -604,8 +604,8 @@ static inline int fault_in_pages_readable(const char __user *uaddr, int size)
 	return 0;
 }
 
-int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
-				pgoff_t index, gfp_t gfp_mask);
+int add_to_page_cache(struct page *page, struct address_space *mapping,
+				pgoff_t index, gfp_t gfp);
 int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
 				pgoff_t index, gfp_t gfp_mask);
 extern void delete_from_page_cache(struct page *page);
@@ -614,22 +614,6 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask);
 void delete_from_page_cache_batch(struct address_space *mapping,
 				  struct pagevec *pvec);
 
-/*
- * Like add_to_page_cache_locked, but used to add newly allocated pages:
- * the page is new, so we can just run __SetPageLocked() against it.
- */
-static inline int add_to_page_cache(struct page *page,
-		struct address_space *mapping, pgoff_t offset, gfp_t gfp_mask)
-{
-	int error;
-
-	__SetPageLocked(page);
-	error = add_to_page_cache_locked(page, mapping, offset, gfp_mask);
-	if (unlikely(error))
-		__ClearPageLocked(page);
-	return error;
-}
-
 static inline unsigned long dir_pages(struct inode *inode)
 {
 	return (unsigned long)(inode->i_size + PAGE_SIZE - 1) >>
diff --git a/mm/filemap.c b/mm/filemap.c
index bf6aa30be58d..fb87f5fa75e6 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -913,25 +913,18 @@ static int __add_to_page_cache_locked(struct page *page,
 }
 ALLOW_ERROR_INJECTION(__add_to_page_cache_locked, ERRNO);
 
-/**
- * add_to_page_cache_locked - add a locked page to the pagecache
- * @page:	page to add
- * @mapping:	the page's address_space
- * @offset:	page index
- * @gfp_mask:	page allocation mode
- *
- * This function is used to add a page to the pagecache. It must be locked.
- * This function does not add the page to the LRU.  The caller must do that.
- *
- * Return: %0 on success, negative error code otherwise.
- */
-int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
+int add_to_page_cache(struct page *page, struct address_space *mapping,
 		pgoff_t offset, gfp_t gfp_mask)
 {
-	return __add_to_page_cache_locked(page, mapping, offset,
+	int err;
+
+	__SetPageLocked(page);
+	err = __add_to_page_cache_locked(page, mapping, offset,
 					  gfp_mask, NULL);
+	if (unlikely(err))
+		__ClearPageLocked(page);
+	return err;
 }
-EXPORT_SYMBOL(add_to_page_cache_locked);
 
 int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
 				pgoff_t offset, gfp_t gfp_mask)
-- 
2.24.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 8/8] mm: Unify all add_to_page_cache variants
  2020-01-13 15:37 [RFC 0/8] Replacing the readpages a_op Matthew Wilcox
                   ` (6 preceding siblings ...)
  2020-01-13 15:37 ` [PATCH 7/8] mm: Remove add_to_page_cache_locked Matthew Wilcox
@ 2020-01-13 15:37 ` Matthew Wilcox
  2020-01-13 16:42 ` [RFC 0/8] Replacing the readpages a_op Chris Mason
  8 siblings, 0 replies; 25+ messages in thread
From: Matthew Wilcox @ 2020-01-13 15:37 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel, linux-mm; +Cc: Matthew Wilcox (Oracle), jlayton, hch

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

We already have various bits of add_to_page_cache() executed conditionally
on !PageHuge(page); add the add_to_page_cache_lru() pieces as some
more code which isn't executed for huge pages.  This lets us remove
the old add_to_page_cache() and rename __add_to_page_cache_locked() to
add_to_page_cache().  Include a compatibility define so we don't have
to change all 20+ callers of add_to_page_cache_lru().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/pagemap.h |  5 ++--
 mm/filemap.c            | 65 ++++++++++++-----------------------------
 2 files changed, 21 insertions(+), 49 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 3ce051fb9c73..753e8df6a5b1 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -606,14 +606,15 @@ static inline int fault_in_pages_readable(const char __user *uaddr, int size)
 
 int add_to_page_cache(struct page *page, struct address_space *mapping,
 				pgoff_t index, gfp_t gfp);
-int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
-				pgoff_t index, gfp_t gfp_mask);
 extern void delete_from_page_cache(struct page *page);
 extern void __delete_from_page_cache(struct page *page, void *shadow);
 int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask);
 void delete_from_page_cache_batch(struct address_space *mapping,
 				  struct pagevec *pvec);
 
+#define add_to_page_cache_lru(page, mapping, index, gfp) \
+	add_to_page_cache(page, mapping, index, gfp)
+
 static inline unsigned long dir_pages(struct inode *inode)
 {
 	return (unsigned long)(inode->i_size + PAGE_SIZE - 1) >>
diff --git a/mm/filemap.c b/mm/filemap.c
index fb87f5fa75e6..83f45f31a00a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -847,19 +847,18 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
 }
 EXPORT_SYMBOL_GPL(replace_page_cache_page);
 
-static int __add_to_page_cache_locked(struct page *page,
-				      struct address_space *mapping,
-				      pgoff_t offset, gfp_t gfp_mask,
-				      void **shadowp)
+int add_to_page_cache(struct page *page, struct address_space *mapping,
+		pgoff_t offset, gfp_t gfp_mask)
 {
 	XA_STATE(xas, &mapping->i_pages, offset);
 	int huge = PageHuge(page);
 	struct mem_cgroup *memcg;
 	int error;
-	void *old;
+	void *old, *shadow = NULL;
 
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(PageSwapBacked(page), page);
+	__SetPageLocked(page);
 	mapping_set_update(&xas, mapping);
 
 	if (!huge) {
@@ -884,8 +883,7 @@ static int __add_to_page_cache_locked(struct page *page,
 
 		if (xa_is_value(old)) {
 			mapping->nrexceptional--;
-			if (shadowp)
-				*shadowp = old;
+			shadow = old;
 		}
 		mapping->nrpages++;
 
@@ -899,45 +897,8 @@ static int __add_to_page_cache_locked(struct page *page,
 	if (xas_error(&xas))
 		goto error;
 
-	if (!huge)
+	if (!huge) {
 		mem_cgroup_commit_charge(page, memcg, false, false);
-	trace_mm_filemap_add_to_page_cache(page);
-	return 0;
-error:
-	page->mapping = NULL;
-	/* Leave page->index set: truncation relies upon it */
-	if (!huge)
-		mem_cgroup_cancel_charge(page, memcg, false);
-	put_page(page);
-	return xas_error(&xas);
-}
-ALLOW_ERROR_INJECTION(__add_to_page_cache_locked, ERRNO);
-
-int add_to_page_cache(struct page *page, struct address_space *mapping,
-		pgoff_t offset, gfp_t gfp_mask)
-{
-	int err;
-
-	__SetPageLocked(page);
-	err = __add_to_page_cache_locked(page, mapping, offset,
-					  gfp_mask, NULL);
-	if (unlikely(err))
-		__ClearPageLocked(page);
-	return err;
-}
-
-int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
-				pgoff_t offset, gfp_t gfp_mask)
-{
-	void *shadow = NULL;
-	int ret;
-
-	__SetPageLocked(page);
-	ret = __add_to_page_cache_locked(page, mapping, offset,
-					 gfp_mask, &shadow);
-	if (unlikely(ret))
-		__ClearPageLocked(page);
-	else {
 		/*
 		 * The page might have been evicted from cache only
 		 * recently, in which case it should be activated like
@@ -951,9 +912,19 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
 			workingset_refault(page, shadow);
 		lru_cache_add(page);
 	}
-	return ret;
+	trace_mm_filemap_add_to_page_cache(page);
+	return 0;
+error:
+	page->mapping = NULL;
+	/* Leave page->index set: truncation relies upon it */
+	if (!huge)
+		mem_cgroup_cancel_charge(page, memcg, false);
+	put_page(page);
+	__ClearPageLocked(page);
+	return xas_error(&xas);
 }
-EXPORT_SYMBOL_GPL(add_to_page_cache_lru);
+ALLOW_ERROR_INJECTION(add_to_page_cache, ERRNO);
+EXPORT_SYMBOL_GPL(add_to_page_cache);
 
 #ifdef CONFIG_NUMA
 struct page *__page_cache_alloc(gfp_t gfp)
-- 
2.24.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC 0/8] Replacing the readpages a_op
  2020-01-13 15:37 [RFC 0/8] Replacing the readpages a_op Matthew Wilcox
                   ` (7 preceding siblings ...)
  2020-01-13 15:37 ` [PATCH 8/8] mm: Unify all add_to_page_cache variants Matthew Wilcox
@ 2020-01-13 16:42 ` Chris Mason
  2020-01-13 17:40   ` Matthew Wilcox
  2020-01-13 17:54   ` Matthew Wilcox
  8 siblings, 2 replies; 25+ messages in thread
From: Chris Mason @ 2020-01-13 16:42 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-xfs, linux-fsdevel, linux-mm, jlayton, hch



On 13 Jan 2020, at 10:37, Matthew Wilcox wrote:

> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
>
> I think everybody hates the readpages API.  The fundamental problem 
> with
> it is that it passes the pages to be read on a doubly linked list, 
> using
> the ->lru list in the struct page.  That means the filesystems have to
> do the work of calling add_to_page_cache{,_lru,_locked}, and handling
> failures (because another task is also accessing that chunk of the 
> file,
> and so it fails).
>
> This is an attempt to add a ->readahead op to replace ->readpages.  
> I've
> converted two users, iomap/xfs and cifs.  The cifs conversion is 
> lacking
> fscache support, and that's just because I didn't want to do that 
> work;
> I don't believe there's anything fundamental to it.  But I wanted to 
> do
> iomap because it is The Infrastructure Of The Future and cifs because 
> it
> is the sole remaining user of add_to_page_cache_locked(), which 
> enables
> the last two patches in the series.  By the way, that gives CIFS 
> access
> to the workingset shadow infrastructure, which it had to ignore before
> because it couldn't put pages onto the lru list at the right time.

I've always kind of liked the compromise of sending the lists.  It's 
really good at the common case and doesn't have massive problems when 
things break down.   Just glancing through the patches, the old 
readpages is called in bigger chunks, so for massive reads we can do 
more effective readahead on metadata.  I don't think any of us actually 
do, but we could.

With this new operation, our window is constant, and much smaller.

>
> The fundamental question is, how do we indicate to the implementation 
> of
> ->readahead what pages to operate on?  I've gone with passing a 
> pagevec.
> This has the obvious advantage that it's a data structure that already
> exists and is used within filemap for batches of pages.  I had to add 
> a
> bit of new infrastructure to support iterating over the pages in the
> pagevec, but with that done, it's quite nice.
>
> I think the biggest problem is that the size of the pagevec is limited
> to 15 pages (60kB).  So that'll mean that if the readahead window 
> bumps
> all the way up to 256kB, we may end up making 5 BIOs (and merging 
> them)
> instead of one.  I'd kind of like to be able to allocate variable 
> length
> pagevecs while allowing regular pagevecs to be allocated on the stack,
> but I can't figure out a way to do that.  eg this doesn't work:
>
> -       struct page *pages[PAGEVEC_SIZE];
> +       union {
> +               struct page *pages[PAGEVEC_SIZE];
> +               struct page *_pages[];
> +       }
>
> and if we just allocate them, useful and wonderful tools are going to
> point out when pages[16] is accessed that we've overstepped the end of
> the array.
>
> I have considered alternatives to the pagevec like just having the
> ->readahead implementation look up the pages in the i_pages XArray
> directly.  That didn't work out too well.
>

Btrfs basically does this now, honestly iomap isn't that far away.  
Given how sensible iomap is for this, I'd rather see us pile into that 
abstraction than try to pass pagevecs for large ranges.  Otherwise, if 
the lists are awkward we can make some helpers to make it less error 
prone?

-chris


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC 0/8] Replacing the readpages a_op
  2020-01-13 16:42 ` [RFC 0/8] Replacing the readpages a_op Chris Mason
@ 2020-01-13 17:40   ` Matthew Wilcox
  2020-01-13 18:00     ` Chris Mason
  2020-01-13 17:54   ` Matthew Wilcox
  1 sibling, 1 reply; 25+ messages in thread
From: Matthew Wilcox @ 2020-01-13 17:40 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-xfs, linux-fsdevel, linux-mm, jlayton, hch

On Mon, Jan 13, 2020 at 04:42:10PM +0000, Chris Mason wrote:
> On 13 Jan 2020, at 10:37, Matthew Wilcox wrote:
> > From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> > I think everybody hates the readpages API.  The fundamental problem 
> > with
> > it is that it passes the pages to be read on a doubly linked list, 
> > using
> > the ->lru list in the struct page.  That means the filesystems have to
> > do the work of calling add_to_page_cache{,_lru,_locked}, and handling
> > failures (because another task is also accessing that chunk of the 
> > file,
> > and so it fails).
> 
> I've always kind of liked the compromise of sending the lists.  It's 
> really good at the common case and doesn't have massive problems when 
> things break down.

I think we'll have to disagree on that point.  Linked lists are awful
for the CPU in the common case, and the error handling code for "things
break down" is painful.  I'm pretty sure I spotted three bugs in the
CIFS implementation.

> Just glancing through the patches, the old 
> readpages is called in bigger chunks, so for massive reads we can do 
> more effective readahead on metadata.  I don't think any of us actually 
> do, but we could.
> 
> With this new operation, our window is constant, and much smaller.
> 
> > The fundamental question is, how do we indicate to the implementation 
> > of
> > ->readahead what pages to operate on?  I've gone with passing a 
> > pagevec.
> > This has the obvious advantage that it's a data structure that already
> > exists and is used within filemap for batches of pages.  I had to add 
> > a
> > bit of new infrastructure to support iterating over the pages in the
> > pagevec, but with that done, it's quite nice.
> >
> > I think the biggest problem is that the size of the pagevec is limited
> > to 15 pages (60kB).  So that'll mean that if the readahead window 
> > bumps
> > all the way up to 256kB, we may end up making 5 BIOs (and merging 
> > them)
> > instead of one.  I'd kind of like to be able to allocate variable 
> > length
> > pagevecs while allowing regular pagevecs to be allocated on the stack,
> > but I can't figure out a way to do that.  eg this doesn't work:
> >
> > -       struct page *pages[PAGEVEC_SIZE];
> > +       union {
> > +               struct page *pages[PAGEVEC_SIZE];
> > +               struct page *_pages[];
> > +       }
> >
> > and if we just allocate them, useful and wonderful tools are going to
> > point out when pages[16] is accessed that we've overstepped the end of
> > the array.
> >
> > I have considered alternatives to the pagevec like just having the
> > ->readahead implementation look up the pages in the i_pages XArray
> > directly.  That didn't work out too well.
> >
> 
> Btrfs basically does this now, honestly iomap isn't that far away.  
> Given how sensible iomap is for this, I'd rather see us pile into that 
> abstraction than try to pass pagevecs for large ranges.  Otherwise, if 
> the lists are awkward we can make some helpers to make it less error 
> prone?

I did do a couple of helpers for lists for iomap before deciding the
whole thing was too painful.  I didn't look at btrfs until just now, but, um ...

int extent_readpages(struct address_space *mapping, struct list_head *pages,
                     unsigned nr_pages)
..
        struct page *pagepool[16];
..
        while (!list_empty(pages)) {
..
                        list_del(&page->lru);
                        if (add_to_page_cache_lru(page, mapping, page->index,
..
                        pagepool[nr++] = page;

you're basically doing exactly what i'm proposing to be the new interface!
OK, you get one extra page per batch ;-P

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC 0/8] Replacing the readpages a_op
  2020-01-13 16:42 ` [RFC 0/8] Replacing the readpages a_op Chris Mason
  2020-01-13 17:40   ` Matthew Wilcox
@ 2020-01-13 17:54   ` Matthew Wilcox
  2020-01-13 22:19     ` Jens Axboe
  1 sibling, 1 reply; 25+ messages in thread
From: Matthew Wilcox @ 2020-01-13 17:54 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-xfs, linux-fsdevel, linux-mm, jlayton, hch

On Mon, Jan 13, 2020 at 04:42:10PM +0000, Chris Mason wrote:
> Btrfs basically does this now, honestly iomap isn't that far away.  
> Given how sensible iomap is for this, I'd rather see us pile into that 
> abstraction than try to pass pagevecs for large ranges.  Otherwise, if 

I completely misread this at first and thought you were proposing we
pass a bio_vec to ->readahead.  Initially, this is a layering violation,
completely unnecessary to have all these extra offset/size fields being
allocated and passed around.  But ... the bio_vec and the skb_frag_t are
now the same data structure, so both block and network use it.  It may
make sense to have this as the common data structure for 'unit of IO'.
The bio supports having the bi_vec allocated externally to the data
structure while the skbuff would need to copy the array.

Maybe we need a more neutral name than bio_vec so as to not upset people.
page_frag, perhaps [1].

[1] Yes, I know about the one in include/linux/mm_types_task.h

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC 0/8] Replacing the readpages a_op
  2020-01-13 17:40   ` Matthew Wilcox
@ 2020-01-13 18:00     ` Chris Mason
  2020-01-13 21:58       ` Matthew Wilcox
  0 siblings, 1 reply; 25+ messages in thread
From: Chris Mason @ 2020-01-13 18:00 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-xfs, linux-fsdevel, linux-mm, jlayton, hch

On 13 Jan 2020, at 12:40, Matthew Wilcox wrote:

> On Mon, Jan 13, 2020 at 04:42:10PM +0000, Chris Mason wrote:
>
> I did do a couple of helpers for lists for iomap before deciding the
> whole thing was too painful.  I didn't look at btrfs until just now, 
> but, um ...
>
> int extent_readpages(struct address_space *mapping, struct list_head 
> *pages,
>                      unsigned nr_pages)
> ..
>         struct page *pagepool[16];
> ..
>         while (!list_empty(pages)) {
> ..
>                         list_del(&page->lru);
>                         if (add_to_page_cache_lru(page, mapping, 
> page->index,
> ..
>                         pagepool[nr++] = page;
>
> you're basically doing exactly what i'm proposing to be the new 
> interface!
> OK, you get one extra page per batch ;-P

This is true, I didn't explain that part well ;)  Depending on 
compression etc we might end up poking the xarray inside the actual IO 
functions, but the main difference is that btrfs is building a single 
bio.  You're moving the plug so you'll merge into single bio, but I'd 
rather build 2MB bios than merge them.

I guess it doesn't feel like enough of a win to justify the churn.  If 
we find a way to do much larger pagevecs, I think this makes more sense.

-chris


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/8] mm/fs: Add a_ops->readahead
  2020-01-13 15:37 ` [PATCH 4/8] mm/fs: Add a_ops->readahead Matthew Wilcox
@ 2020-01-13 18:22   ` Daniel Wagner
  2020-01-13 19:17     ` Matthew Wilcox
  0 siblings, 1 reply; 25+ messages in thread
From: Daniel Wagner @ 2020-01-13 18:22 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-xfs, linux-fsdevel, linux-mm, jlayton, hch

Hi,

On Mon, Jan 13, 2020 at 07:37:42AM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> This will replace ->readpages with a saner interface:
>  - No return type (errors are ignored for read ahead anyway)
>  - Pages are already in the page cache when ->readpages is called
>  - Pages are passed in a pagevec instead of a linked list
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  Documentation/filesystems/locking.rst |  8 +++++-
>  Documentation/filesystems/vfs.rst     |  9 ++++++
>  include/linux/fs.h                    |  3 ++
>  mm/readahead.c                        | 40 ++++++++++++++++++++++++++-
>  4 files changed, 58 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst
> index 5057e4d9dcd1..1e2f1186fd1a 100644
> --- a/Documentation/filesystems/locking.rst
> +++ b/Documentation/filesystems/locking.rst
> @@ -239,6 +239,8 @@ prototypes::
>  	int (*readpage)(struct file *, struct page *);
>  	int (*writepages)(struct address_space *, struct writeback_control *);
>  	int (*set_page_dirty)(struct page *page);
> +	int (*readahead)(struct file *, struct address_space *,
> +			struct pagevec *, pgoff_t index);

Shouldn't this be no return type? Just trying to map your commit
message to the code.


>  	int (*readpages)(struct file *filp, struct address_space *mapping,
>  			struct list_head *pages, unsigned nr_pages);
>  	int (*write_begin)(struct file *, struct address_space *mapping,
> @@ -271,7 +273,8 @@ writepage:		yes, unlocks (see below)
>  readpage:		yes, unlocks
>  writepages:
>  set_page_dirty		no
> -readpages:
> +readpages:              no
> +readahead:              yes, unlocks
>  write_begin:		locks the page		 exclusive
>  write_end:		yes, unlocks		 exclusive
>  bmap:
> @@ -298,6 +301,9 @@ completion.
>  ->readpages() populates the pagecache with the passed pages and starts
>  I/O against them.  They come unlocked upon I/O completion.
>  
> +->readahead() starts I/O against the pages.  They come unlocked upon
> +I/O completion.
> +
>  ->writepage() is used for two purposes: for "memory cleansing" and for
>  "sync".  These are quite different operations and the behaviour may differ
>  depending upon the mode.
> diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
> index 7d4d09dd5e6d..63d0f0dbbf9c 100644
> --- a/Documentation/filesystems/vfs.rst
> +++ b/Documentation/filesystems/vfs.rst
> @@ -706,6 +706,8 @@ cache in your filesystem.  The following members are defined:
>  		int (*readpage)(struct file *, struct page *);
>  		int (*writepages)(struct address_space *, struct writeback_control *);
>  		int (*set_page_dirty)(struct page *page);
> +		int (*readahead)(struct file *, struct address_space *,
> +				 struct pagevec *, pgoff_t index);

Same here?

Maybe I just miss the point, in the case, sorry for the noise.

Thanks,
Daniel


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/8] mm/fs: Add a_ops->readahead
  2020-01-13 18:22   ` Daniel Wagner
@ 2020-01-13 19:17     ` Matthew Wilcox
  0 siblings, 0 replies; 25+ messages in thread
From: Matthew Wilcox @ 2020-01-13 19:17 UTC (permalink / raw)
  To: Daniel Wagner; +Cc: linux-xfs, linux-fsdevel, linux-mm, jlayton, hch

On Mon, Jan 13, 2020 at 07:22:42PM +0100, Daniel Wagner wrote:
> > +	int (*readahead)(struct file *, struct address_space *,
> > +			struct pagevec *, pgoff_t index);
> 
> Shouldn't this be no return type? Just trying to map your commit
> message to the code.

Yes, I'll fix it.  I'm not really impressed with the filesystem
documentation at this point; I feel like more could be automated and
there's too much duplication.

> Maybe I just miss the point, in the case, sorry for the noise.

Well, mostly I'm looking for architectural and design feedback.  I have
compiled this code, but not run it.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC 0/8] Replacing the readpages a_op
  2020-01-13 18:00     ` Chris Mason
@ 2020-01-13 21:58       ` Matthew Wilcox
  2020-01-13 22:00         ` Jens Axboe
  0 siblings, 1 reply; 25+ messages in thread
From: Matthew Wilcox @ 2020-01-13 21:58 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-xfs, linux-fsdevel, linux-mm, jlayton, hch, Jens Axboe

On Mon, Jan 13, 2020 at 06:00:52PM +0000, Chris Mason wrote:
> This is true, I didn't explain that part well ;)  Depending on 
> compression etc we might end up poking the xarray inside the actual IO 
> functions, but the main difference is that btrfs is building a single 
> bio.  You're moving the plug so you'll merge into single bio, but I'd 
> rather build 2MB bios than merge them.

Why don't we store a bio pointer inside the plug?  You're opencoding that,
iomap is opencoding that, and I bet there's a dozen other places where
we pass a bio around.  Then blk_finish_plug can submit the bio.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC 0/8] Replacing the readpages a_op
  2020-01-13 21:58       ` Matthew Wilcox
@ 2020-01-13 22:00         ` Jens Axboe
  2020-01-13 22:10           ` Matthew Wilcox
  0 siblings, 1 reply; 25+ messages in thread
From: Jens Axboe @ 2020-01-13 22:00 UTC (permalink / raw)
  To: Matthew Wilcox, Chris Mason
  Cc: linux-xfs, linux-fsdevel, linux-mm, jlayton, hch

On 1/13/20 2:58 PM, Matthew Wilcox wrote:
> On Mon, Jan 13, 2020 at 06:00:52PM +0000, Chris Mason wrote:
>> This is true, I didn't explain that part well ;)  Depending on 
>> compression etc we might end up poking the xarray inside the actual IO 
>> functions, but the main difference is that btrfs is building a single 
>> bio.  You're moving the plug so you'll merge into single bio, but I'd 
>> rather build 2MB bios than merge them.
> 
> Why don't we store a bio pointer inside the plug?  You're opencoding that,
> iomap is opencoding that, and I bet there's a dozen other places where
> we pass a bio around.  Then blk_finish_plug can submit the bio.

Plugs aren't necessarily a bio, they can be callbacks too.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC 0/8] Replacing the readpages a_op
  2020-01-13 22:00         ` Jens Axboe
@ 2020-01-13 22:10           ` Matthew Wilcox
  2020-01-13 22:14             ` Jens Axboe
  0 siblings, 1 reply; 25+ messages in thread
From: Matthew Wilcox @ 2020-01-13 22:10 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Chris Mason, linux-xfs, linux-fsdevel, linux-mm, jlayton, hch

On Mon, Jan 13, 2020 at 03:00:40PM -0700, Jens Axboe wrote:
> On 1/13/20 2:58 PM, Matthew Wilcox wrote:
> > On Mon, Jan 13, 2020 at 06:00:52PM +0000, Chris Mason wrote:
> >> This is true, I didn't explain that part well ;)  Depending on 
> >> compression etc we might end up poking the xarray inside the actual IO 
> >> functions, but the main difference is that btrfs is building a single 
> >> bio.  You're moving the plug so you'll merge into single bio, but I'd 
> >> rather build 2MB bios than merge them.
> > 
> > Why don't we store a bio pointer inside the plug?  You're opencoding that,
> > iomap is opencoding that, and I bet there's a dozen other places where
> > we pass a bio around.  Then blk_finish_plug can submit the bio.
> 
> Plugs aren't necessarily a bio, they can be callbacks too.

I'm thinking something as simple as this:

@@ -1711,6 +1711,7 @@ void blk_start_plug(struct blk_plug *plug)
 
        INIT_LIST_HEAD(&plug->mq_list);
        INIT_LIST_HEAD(&plug->cb_list);
+       plug->bio = NULL;
        plug->rq_count = 0;
        plug->multiple_queues = false;
 
@@ -1786,6 +1787,8 @@ void blk_finish_plug(struct blk_plug *plug)
 {
        if (plug != current->plug)
                return;
+       if (plug->bio)
+               submit_bio(plug->bio);
        blk_flush_plug_list(plug, false);
 
        current->plug = NULL;
@@ -1160,6 +1160,7 @@ extern void blk_set_queue_dying(struct request_queue *);
 struct blk_plug {
        struct list_head mq_list; /* blk-mq requests */
        struct list_head cb_list; /* md requires an unplug callback */
+       struct bio *bio;
        unsigned short rq_count;
        bool multiple_queues;
 };

with accessors to 'get_current_bio()' and 'set_current_bio()'.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC 0/8] Replacing the readpages a_op
  2020-01-13 22:10           ` Matthew Wilcox
@ 2020-01-13 22:14             ` Jens Axboe
  2020-01-13 22:27               ` Matthew Wilcox
  0 siblings, 1 reply; 25+ messages in thread
From: Jens Axboe @ 2020-01-13 22:14 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Chris Mason, linux-xfs, linux-fsdevel, linux-mm, jlayton, hch

On 1/13/20 3:10 PM, Matthew Wilcox wrote:
> On Mon, Jan 13, 2020 at 03:00:40PM -0700, Jens Axboe wrote:
>> On 1/13/20 2:58 PM, Matthew Wilcox wrote:
>>> On Mon, Jan 13, 2020 at 06:00:52PM +0000, Chris Mason wrote:
>>>> This is true, I didn't explain that part well ;)  Depending on 
>>>> compression etc we might end up poking the xarray inside the actual IO 
>>>> functions, but the main difference is that btrfs is building a single 
>>>> bio.  You're moving the plug so you'll merge into single bio, but I'd 
>>>> rather build 2MB bios than merge them.
>>>
>>> Why don't we store a bio pointer inside the plug?  You're opencoding that,
>>> iomap is opencoding that, and I bet there's a dozen other places where
>>> we pass a bio around.  Then blk_finish_plug can submit the bio.
>>
>> Plugs aren't necessarily a bio, they can be callbacks too.
> 
> I'm thinking something as simple as this:
> 
> @@ -1711,6 +1711,7 @@ void blk_start_plug(struct blk_plug *plug)
>  
>         INIT_LIST_HEAD(&plug->mq_list);
>         INIT_LIST_HEAD(&plug->cb_list);
> +       plug->bio = NULL;
>         plug->rq_count = 0;
>         plug->multiple_queues = false;
>  
> @@ -1786,6 +1787,8 @@ void blk_finish_plug(struct blk_plug *plug)
>  {
>         if (plug != current->plug)
>                 return;
> +       if (plug->bio)
> +               submit_bio(plug->bio);
>         blk_flush_plug_list(plug, false);
>  
>         current->plug = NULL;
> @@ -1160,6 +1160,7 @@ extern void blk_set_queue_dying(struct request_queue *);
>  struct blk_plug {
>         struct list_head mq_list; /* blk-mq requests */
>         struct list_head cb_list; /* md requires an unplug callback */
> +       struct bio *bio;
>         unsigned short rq_count;
>         bool multiple_queues;
>  };
> 
> with accessors to 'get_current_bio()' and 'set_current_bio()'.

It's a little odd imho, the plugging generally collect requests. Sounds
what you're looking for is some plug owner private data, which just
happens to be a bio in this case?

Is this over repeated calls to some IO generating helper? Would it be
more efficient if that helper could generate the full bio in one go,
instead of piecemeal?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC 0/8] Replacing the readpages a_op
  2020-01-13 17:54   ` Matthew Wilcox
@ 2020-01-13 22:19     ` Jens Axboe
  0 siblings, 0 replies; 25+ messages in thread
From: Jens Axboe @ 2020-01-13 22:19 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Chris Mason, linux-xfs, linux-fsdevel, linux-mm, jlayton, hch

On Mon, Jan 13, 2020 at 10:54 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Jan 13, 2020 at 04:42:10PM +0000, Chris Mason wrote:
> > Btrfs basically does this now, honestly iomap isn't that far away.
> > Given how sensible iomap is for this, I'd rather see us pile into
> > that abstraction than try to pass pagevecs for large ranges.
> > Otherwise, if
>
> I completely misread this at first and thought you were proposing we
> pass a bio_vec to ->readahead.  Initially, this is a layering
> violation, completely unnecessary to have all these extra offset/size
> fields being allocated and passed around.  But ... the bio_vec and the
> skb_frag_t are now the same data structure, so both block and network
> use it.  It may make sense to have this as the common data structure
> for 'unit of IO'.  The bio supports having the bi_vec allocated
> externally to the data structure while the skbuff would need to copy
> the array.
>
> Maybe we need a more neutral name than bio_vec so as to not upset
> people.  page_frag, perhaps [1].
>
> [1] Yes, I know about the one in include/linux/mm_types_task.h

Note that bio_vecs support page merging, so page fragment isn't very
descriptive. Not sure what a good name would be, but fragment isn't it.
But any shared type would imply that others should support that as well
if you're passing it around (or by pointer).

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC 0/8] Replacing the readpages a_op
  2020-01-13 22:14             ` Jens Axboe
@ 2020-01-13 22:27               ` Matthew Wilcox
  2020-01-13 22:30                 ` Jens Axboe
  2020-01-13 22:34                 ` Chris Mason
  0 siblings, 2 replies; 25+ messages in thread
From: Matthew Wilcox @ 2020-01-13 22:27 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Chris Mason, linux-xfs, linux-fsdevel, linux-mm, jlayton, hch

On Mon, Jan 13, 2020 at 03:14:26PM -0700, Jens Axboe wrote:
> On 1/13/20 3:10 PM, Matthew Wilcox wrote:
> > On Mon, Jan 13, 2020 at 03:00:40PM -0700, Jens Axboe wrote:
> >> On 1/13/20 2:58 PM, Matthew Wilcox wrote:
> >>> On Mon, Jan 13, 2020 at 06:00:52PM +0000, Chris Mason wrote:
> >>>> This is true, I didn't explain that part well ;)  Depending on 
> >>>> compression etc we might end up poking the xarray inside the actual IO 
> >>>> functions, but the main difference is that btrfs is building a single 
> >>>> bio.  You're moving the plug so you'll merge into single bio, but I'd 
> >>>> rather build 2MB bios than merge them.
> >>>
> >>> Why don't we store a bio pointer inside the plug?  You're opencoding that,
> >>> iomap is opencoding that, and I bet there's a dozen other places where
> >>> we pass a bio around.  Then blk_finish_plug can submit the bio.
> >>
> >> Plugs aren't necessarily a bio, they can be callbacks too.
> > 
> > I'm thinking something as simple as this:
>
> It's a little odd imho, the plugging generally collect requests. Sounds
> what you're looking for is some plug owner private data, which just
> happens to be a bio in this case?
> 
> Is this over repeated calls to some IO generating helper? Would it be
> more efficient if that helper could generate the full bio in one go,
> instead of piecemeal?

The issue is around ->readpages.  Take a look at how iomap_readpages
works, for example.  We're under a plug (taken in mm/readahead.c),
but we still go through the rigamarole of keeping a pointer to the bio
in ctx.bio and passing ctx around so that we don't end up with many
fragments which have to be recombined into a single bio at the end.

I think what I want is a bio I can reach from current, somehow.  And the
plug feels like a natural place to keep it because it's basically saying
"I want to do lots of little IOs and have them combined".  The fact that
the iomap code has a bio that it precombines fragments into suggests to
me that the existing antifragmentation code in the plugging mechanism
isn't good enough.  So let's make it better by storing a bio in the plug
and then we can get rid of the bio in the iomap code.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC 0/8] Replacing the readpages a_op
  2020-01-13 22:27               ` Matthew Wilcox
@ 2020-01-13 22:30                 ` Jens Axboe
  2020-01-13 22:34                 ` Chris Mason
  1 sibling, 0 replies; 25+ messages in thread
From: Jens Axboe @ 2020-01-13 22:30 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Chris Mason, linux-xfs, linux-fsdevel, linux-mm, jlayton, hch

On 1/13/20 3:27 PM, Matthew Wilcox wrote:
> On Mon, Jan 13, 2020 at 03:14:26PM -0700, Jens Axboe wrote:
>> On 1/13/20 3:10 PM, Matthew Wilcox wrote:
>>> On Mon, Jan 13, 2020 at 03:00:40PM -0700, Jens Axboe wrote:
>>>> On 1/13/20 2:58 PM, Matthew Wilcox wrote:
>>>>> On Mon, Jan 13, 2020 at 06:00:52PM +0000, Chris Mason wrote:
>>>>>> This is true, I didn't explain that part well ;)  Depending on 
>>>>>> compression etc we might end up poking the xarray inside the actual IO 
>>>>>> functions, but the main difference is that btrfs is building a single 
>>>>>> bio.  You're moving the plug so you'll merge into single bio, but I'd 
>>>>>> rather build 2MB bios than merge them.
>>>>>
>>>>> Why don't we store a bio pointer inside the plug?  You're opencoding that,
>>>>> iomap is opencoding that, and I bet there's a dozen other places where
>>>>> we pass a bio around.  Then blk_finish_plug can submit the bio.
>>>>
>>>> Plugs aren't necessarily a bio, they can be callbacks too.
>>>
>>> I'm thinking something as simple as this:
>>
>> It's a little odd imho, the plugging generally collect requests. Sounds
>> what you're looking for is some plug owner private data, which just
>> happens to be a bio in this case?
>>
>> Is this over repeated calls to some IO generating helper? Would it be
>> more efficient if that helper could generate the full bio in one go,
>> instead of piecemeal?
> 
> The issue is around ->readpages.  Take a look at how iomap_readpages
> works, for example.  We're under a plug (taken in mm/readahead.c),
> but we still go through the rigamarole of keeping a pointer to the bio
> in ctx.bio and passing ctx around so that we don't end up with many
> fragments which have to be recombined into a single bio at the end.
> 
> I think what I want is a bio I can reach from current, somehow.  And the
> plug feels like a natural place to keep it because it's basically saying
> "I want to do lots of little IOs and have them combined".  The fact that
> the iomap code has a bio that it precombines fragments into suggests to
> me that the existing antifragmentation code in the plugging mechanism
> isn't good enough.  So let's make it better by storing a bio in the plug
> and then we can get rid of the bio in the iomap code.

But it doesn't fit the plug model very well. If you get preempted, it
could go away. Or if you have nested plugs, it also won't work at all. I
think it's much saner to keep the "current" bio _outside_ of the plug,
and work on it in peace until you want to submit it.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC 0/8] Replacing the readpages a_op
  2020-01-13 22:27               ` Matthew Wilcox
  2020-01-13 22:30                 ` Jens Axboe
@ 2020-01-13 22:34                 ` Chris Mason
  2020-01-14  1:01                   ` Matthew Wilcox
  1 sibling, 1 reply; 25+ messages in thread
From: Chris Mason @ 2020-01-13 22:34 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jens Axboe, linux-xfs, linux-fsdevel, linux-mm, jlayton, hch

On 13 Jan 2020, at 17:27, Matthew Wilcox wrote:

> On Mon, Jan 13, 2020 at 03:14:26PM -0700, Jens Axboe wrote:
>> On 1/13/20 3:10 PM, Matthew Wilcox wrote:
>>> On Mon, Jan 13, 2020 at 03:00:40PM -0700, Jens Axboe wrote:
>>>> On 1/13/20 2:58 PM, Matthew Wilcox wrote:
>>>>> On Mon, Jan 13, 2020 at 06:00:52PM +0000, Chris Mason wrote:
>>>>>> This is true, I didn't explain that part well ;)  Depending on
>>>>>> compression etc we might end up poking the xarray inside the 
>>>>>> actual IO
>>>>>> functions, but the main difference is that btrfs is building a 
>>>>>> single
>>>>>> bio.  You're moving the plug so you'll merge into single bio, but 
>>>>>> I'd
>>>>>> rather build 2MB bios than merge them.
>>>>>
>>>>> Why don't we store a bio pointer inside the plug?  You're 
>>>>> opencoding that,
>>>>> iomap is opencoding that, and I bet there's a dozen other places 
>>>>> where
>>>>> we pass a bio around.  Then blk_finish_plug can submit the bio.
>>>>
>>>> Plugs aren't necessarily a bio, they can be callbacks too.
>>>
>>> I'm thinking something as simple as this:
>>
>> It's a little odd imho, the plugging generally collect requests. 
>> Sounds
>> what you're looking for is some plug owner private data, which just
>> happens to be a bio in this case?
>>
>> Is this over repeated calls to some IO generating helper? Would it be
>> more efficient if that helper could generate the full bio in one go,
>> instead of piecemeal?
>
> The issue is around ->readpages.  Take a look at how iomap_readpages
> works, for example.  We're under a plug (taken in mm/readahead.c),
> but we still go through the rigamarole of keeping a pointer to the bio
> in ctx.bio and passing ctx around so that we don't end up with many
> fragments which have to be recombined into a single bio at the end.
>
> I think what I want is a bio I can reach from current, somehow.  And 
> the
> plug feels like a natural place to keep it because it's basically 
> saying
> "I want to do lots of little IOs and have them combined".  The fact 
> that
> the iomap code has a bio that it precombines fragments into suggests 
> to
> me that the existing antifragmentation code in the plugging mechanism
> isn't good enough.  So let's make it better by storing a bio in the 
> plug
> and then we can get rid of the bio in the iomap code.

Both btrfs and xfs do this, we have a bio that we pass around and build 
and submit.  We both also do some gymnastics in writepages to avoid 
waiting for the bios we've been building to finish while we're building 
them.

I love the idea of the plug api having a way to hold that for us, but 
sometimes we really are building the bios, and we don't want the plug to 
let it go if we happen to schedule.

-chris

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC 0/8] Replacing the readpages a_op
  2020-01-13 22:34                 ` Chris Mason
@ 2020-01-14  1:01                   ` Matthew Wilcox
  2020-01-14  1:07                     ` Chris Mason
  0 siblings, 1 reply; 25+ messages in thread
From: Matthew Wilcox @ 2020-01-14  1:01 UTC (permalink / raw)
  To: Chris Mason; +Cc: Jens Axboe, linux-xfs, linux-fsdevel, linux-mm, jlayton, hch

On Mon, Jan 13, 2020 at 10:34:16PM +0000, Chris Mason wrote:
> On 13 Jan 2020, at 17:27, Matthew Wilcox wrote:
> > On Mon, Jan 13, 2020 at 03:14:26PM -0700, Jens Axboe wrote:
> >> On 1/13/20 3:10 PM, Matthew Wilcox wrote:
> >>> On Mon, Jan 13, 2020 at 03:00:40PM -0700, Jens Axboe wrote:
> >>>> On 1/13/20 2:58 PM, Matthew Wilcox wrote:
> >>>>> Why don't we store a bio pointer inside the plug?  You're 
> >>>>> opencoding that,
> >>>>> iomap is opencoding that, and I bet there's a dozen other places 
> >>>>> where
> >>>>> we pass a bio around.  Then blk_finish_plug can submit the bio.
> >>>>
> >>>> Plugs aren't necessarily a bio, they can be callbacks too.
> >>
> >> It's a little odd imho, the plugging generally collect requests. 
> >> Sounds
> >> what you're looking for is some plug owner private data, which just
> >> happens to be a bio in this case?
> >>
> >> Is this over repeated calls to some IO generating helper? Would it be
> >> more efficient if that helper could generate the full bio in one go,
> >> instead of piecemeal?
> >
> > The issue is around ->readpages.  Take a look at how iomap_readpages
> > works, for example.  We're under a plug (taken in mm/readahead.c),
> > but we still go through the rigamarole of keeping a pointer to the bio
> > in ctx.bio and passing ctx around so that we don't end up with many
> > fragments which have to be recombined into a single bio at the end.
> >
> > I think what I want is a bio I can reach from current, somehow.  And 
> > the
> > plug feels like a natural place to keep it because it's basically 
> > saying
> > "I want to do lots of little IOs and have them combined".  The fact 
> > that
> > the iomap code has a bio that it precombines fragments into suggests 
> > to
> > me that the existing antifragmentation code in the plugging mechanism
> > isn't good enough.  So let's make it better by storing a bio in the 
> > plug
> > and then we can get rid of the bio in the iomap code.
> 
> Both btrfs and xfs do this, we have a bio that we pass around and build 
> and submit.  We both also do some gymnastics in writepages to avoid 
> waiting for the bios we've been building to finish while we're building 
> them.
> 
> I love the idea of the plug api having a way to hold that for us, but 
> sometimes we really are building the bios, and we don't want the plug to 
> let it go if we happen to schedule.

The plug wouldn't have to let the bio go.  I appreciate the plug does let
requests go on context switch, but it wouldn't have to let the bio go.
This bio is being stored on the stack, just as now, so it's still there.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC 0/8] Replacing the readpages a_op
  2020-01-14  1:01                   ` Matthew Wilcox
@ 2020-01-14  1:07                     ` Chris Mason
  0 siblings, 0 replies; 25+ messages in thread
From: Chris Mason @ 2020-01-14  1:07 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jens Axboe, linux-xfs, linux-fsdevel, linux-mm, jlayton, hch

On 13 Jan 2020, at 20:01, Matthew Wilcox wrote:

> On Mon, Jan 13, 2020 at 10:34:16PM +0000, Chris Mason wrote:
>>> I think what I want is a bio I can reach from current, somehow.  And
>>> the
>>> plug feels like a natural place to keep it because it's basically
>>> saying
>>> "I want to do lots of little IOs and have them combined".  The fact
>>> that
>>> the iomap code has a bio that it precombines fragments into suggests
>>> to
>>> me that the existing antifragmentation code in the plugging 
>>> mechanism
>>> isn't good enough.  So let's make it better by storing a bio in the
>>> plug
>>> and then we can get rid of the bio in the iomap code.
>>
>> Both btrfs and xfs do this, we have a bio that we pass around and 
>> build
>> and submit.  We both also do some gymnastics in writepages to avoid
>> waiting for the bios we've been building to finish while we're 
>> building
>> them.
>>
>> I love the idea of the plug api having a way to hold that for us, but
>> sometimes we really are building the bios, and we don't want the plug 
>> to
>> let it go if we happen to schedule.
>
> The plug wouldn't have to let the bio go.  I appreciate the plug does 
> let
> requests go on context switch, but it wouldn't have to let the bio go.
> This bio is being stored on the stack, just as now, so it's still 
> there.

Plugging is great because it's building up state for the block layer 
across a wide variety of MM and FS code.

btrfs and xfs are building state for themselves across a tiny cross 
section of their own code.  I'd rather not complicate magic state in 
current in places we can easily pass the bio down.

-chris

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, back to index

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-13 15:37 [RFC 0/8] Replacing the readpages a_op Matthew Wilcox
2020-01-13 15:37 ` [PATCH 1/8] pagevec: Add an iterator Matthew Wilcox
2020-01-13 15:37 ` [PATCH 2/8] mm: Fix the return type of __do_page_cache_readahead Matthew Wilcox
2020-01-13 15:37 ` [PATCH 3/8] mm: Use a pagevec for readahead Matthew Wilcox
2020-01-13 15:37 ` [PATCH 4/8] mm/fs: Add a_ops->readahead Matthew Wilcox
2020-01-13 18:22   ` Daniel Wagner
2020-01-13 19:17     ` Matthew Wilcox
2020-01-13 15:37 ` [PATCH 5/8] iomap,xfs: Convert from readpages to readahead Matthew Wilcox
2020-01-13 15:37 ` [PATCH 6/8] cifs: " Matthew Wilcox
2020-01-13 15:37 ` [PATCH 7/8] mm: Remove add_to_page_cache_locked Matthew Wilcox
2020-01-13 15:37 ` [PATCH 8/8] mm: Unify all add_to_page_cache variants Matthew Wilcox
2020-01-13 16:42 ` [RFC 0/8] Replacing the readpages a_op Chris Mason
2020-01-13 17:40   ` Matthew Wilcox
2020-01-13 18:00     ` Chris Mason
2020-01-13 21:58       ` Matthew Wilcox
2020-01-13 22:00         ` Jens Axboe
2020-01-13 22:10           ` Matthew Wilcox
2020-01-13 22:14             ` Jens Axboe
2020-01-13 22:27               ` Matthew Wilcox
2020-01-13 22:30                 ` Jens Axboe
2020-01-13 22:34                 ` Chris Mason
2020-01-14  1:01                   ` Matthew Wilcox
2020-01-14  1:07                     ` Chris Mason
2020-01-13 17:54   ` Matthew Wilcox
2020-01-13 22:19     ` Jens Axboe

Linux-XFS Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-xfs/0 linux-xfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-xfs linux-xfs/ https://lore.kernel.org/linux-xfs \
		linux-xfs@vger.kernel.org
	public-inbox-index linux-xfs

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-xfs


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git