[PATCH 00/11] Remove remaining parts of congestion tracking code.

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00/11]  Remove remaining parts of congestion tracking code.
@ 2022-02-10  5:37 NeilBrown
  2022-02-10  5:37 ` [PATCH 07/11] Remove inode_congested() NeilBrown
                   ` (10 more replies)
  0 siblings, 11 replies; 21+ messages in thread
From: NeilBrown @ 2022-02-10  5:37 UTC (permalink / raw)
  To: Andrew Morton, Jan Kara, Wu Fengguang, Jaegeuk Kim, Chao Yu,
	Jeff Layton, Ilya Dryomov, Miklos Szeredi, Trond Myklebust,
	Anna Schumaker, Ryusuke Konishi, Darrick J. Wong,
	Philipp Reisner, Lars Ellenberg, Paolo Valente, Jens Axboe
  Cc: linux-doc, linux-mm, linux-nilfs, linux-nfs, linux-fsdevel,
	linux-f2fs-devel, linux-ext4, ceph-devel, drbd-dev, linux-kernel,
	linux-block

Here is a refresh of my "remove congestion tracking" series.

It makes some small changes to readahead so that the effect of the
inode_read_congested() test in readahead can be duplicated in the
filesystem.  fuse uses this.
It add some documentation for readahead, and for that to work, it
cleans up related doco a bit.

Andrew: please drop the version of this that is currently in your tree
even if you don't take this.  The changes to fuse/nfs/ceph are not
appropriate.

Thanks,
NeilBrown

---

NeilBrown (11):
      DOC: convert 'subsection' to 'section' in gfp.h
      MM: document and polish read-ahead code.
      MM: improve cleanup when ->readpages doesn't process all pages.
      fuse: remove reliance on bdi congestion
      nfs: remove reliance on bdi congestion
      ceph: remove reliance on bdi congestion
      Remove inode_congested()
      Remove bdi_congested() and wb_congested() and related functions
      f2fs: replace congestion_wait() calls with io_schedule_timeout()
      block/bfq-iosched.c: use "false" rather than "BLK_RW_ASYNC"
      Remove congestion tracking framework.


 Documentation/core-api/mm-api.rst |  19 ++++-
 Documentation/filesystems/vfs.rst |  16 ++--
 block/bfq-iosched.c               |   2 +-
 drivers/block/drbd/drbd_int.h     |   3 -
 drivers/block/drbd/drbd_req.c     |   3 +-
 fs/ceph/addr.c                    |  22 +++---
 fs/ceph/super.c                   |   1 +
 fs/ceph/super.h                   |   1 +
 fs/ext2/ialloc.c                  |   5 --
 fs/f2fs/compress.c                |   4 +-
 fs/f2fs/data.c                    |   3 +-
 fs/f2fs/f2fs.h                    |   6 ++
 fs/f2fs/segment.c                 |   8 +-
 fs/f2fs/super.c                   |   6 +-
 fs/fs-writeback.c                 |  37 ---------
 fs/fuse/control.c                 |  17 ----
 fs/fuse/dev.c                     |   8 --
 fs/fuse/file.c                    |  17 ++++
 fs/nfs/write.c                    |  14 +++-
 fs/nilfs2/segbuf.c                |  15 ----
 fs/xfs/xfs_buf.c                  |   3 -
 include/linux/backing-dev-defs.h  |   8 --
 include/linux/backing-dev.h       |  50 ------------
 include/linux/fs.h                |   9 ++-
 include/linux/nfs_fs_sb.h         |   1 +
 include/trace/events/writeback.h  |  28 -------
 mm/backing-dev.c                  |  57 --------------
 mm/fadvise.c                      |   5 +-
 mm/readahead.c                    | 126 +++++++++++++++++++++++++++---
 mm/vmscan.c                       |  21 +----
 30 files changed, 214 insertions(+), 301 deletions(-)

--
Signature


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 01/11] DOC: convert 'subsection' to 'section' in gfp.h
  2022-02-10  5:37 [PATCH 00/11] Remove remaining parts of congestion tracking code NeilBrown
                   ` (8 preceding siblings ...)
  2022-02-10  5:37 ` [PATCH 09/11] f2fs: replace congestion_wait() calls with io_schedule_timeout() NeilBrown
@ 2022-02-10  5:37 ` NeilBrown
  2022-02-10  5:37 ` [PATCH 11/11] Remove congestion tracking framework NeilBrown
  10 siblings, 0 replies; 21+ messages in thread
From: NeilBrown @ 2022-02-10  5:37 UTC (permalink / raw)
  To: Andrew Morton, Jan Kara, Wu Fengguang, Jaegeuk Kim, Chao Yu,
	Jeff Layton, Ilya Dryomov, Miklos Szeredi, Trond Myklebust,
	Anna Schumaker, Ryusuke Konishi, Darrick J. Wong,
	Philipp Reisner, Lars Ellenberg, Paolo Valente, Jens Axboe
  Cc: linux-doc, linux-mm, linux-nilfs, linux-nfs, linux-fsdevel,
	linux-f2fs-devel, linux-ext4, ceph-devel, drbd-dev, linux-kernel,
	linux-block

Various DOC: sections in gfp.h have subsection headers (~~~) but the
place where they are included in mm-api.rst does not have section, only
chapters.
So convert to section headers (---) to avoid confusion.  Specifically if
section are added later in mm-api.rst, an error results.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 include/linux/gfp.h |   10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 80f63c862be5..20f6fbe12993 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -79,7 +79,7 @@ struct vm_area_struct;
  * DOC: Page mobility and placement hints
  *
  * Page mobility and placement hints
- * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ * ---------------------------------
  *
  * These flags provide hints about how mobile the page is. Pages with similar
  * mobility are placed within the same pageblocks to minimise problems due
@@ -112,7 +112,7 @@ struct vm_area_struct;
  * DOC: Watermark modifiers
  *
  * Watermark modifiers -- controls access to emergency reserves
- * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ * ------------------------------------------------------------
  *
  * %__GFP_HIGH indicates that the caller is high-priority and that granting
  * the request is necessary before the system can make forward progress.
@@ -144,7 +144,7 @@ struct vm_area_struct;
  * DOC: Reclaim modifiers
  *
  * Reclaim modifiers
- * ~~~~~~~~~~~~~~~~~
+ * -----------------
  * Please note that all the following flags are only applicable to sleepable
  * allocations (e.g. %GFP_NOWAIT and %GFP_ATOMIC will ignore them).
  *
@@ -224,7 +224,7 @@ struct vm_area_struct;
  * DOC: Action modifiers
  *
  * Action modifiers
- * ~~~~~~~~~~~~~~~~
+ * ----------------
  *
  * %__GFP_NOWARN suppresses allocation failure reports.
  *
@@ -256,7 +256,7 @@ struct vm_area_struct;
  * DOC: Useful GFP flag combinations
  *
  * Useful GFP flag combinations
- * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ * ----------------------------
  *
  * Useful GFP flag combinations that are commonly used. It is recommended
  * that subsystems start with one of these combinations and then set/clear



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 02/11] MM: document and polish read-ahead code.
  2022-02-10  5:37 [PATCH 00/11] Remove remaining parts of congestion tracking code NeilBrown
  2022-02-10  5:37 ` [PATCH 07/11] Remove inode_congested() NeilBrown
  2022-02-10  5:37 ` [PATCH 10/11] block/bfq-iosched.c: use "false" rather than "BLK_RW_ASYNC" NeilBrown
@ 2022-02-10  5:37 ` NeilBrown
  2022-02-10 12:24   ` Jan Kara
  2022-02-10  5:37 ` [PATCH 05/11] nfs: remove reliance on bdi congestion NeilBrown
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 21+ messages in thread
From: NeilBrown @ 2022-02-10  5:37 UTC (permalink / raw)
  To: Andrew Morton, Jan Kara, Wu Fengguang, Jaegeuk Kim, Chao Yu,
	Jeff Layton, Ilya Dryomov, Miklos Szeredi, Trond Myklebust,
	Anna Schumaker, Ryusuke Konishi, Darrick J. Wong,
	Philipp Reisner, Lars Ellenberg, Paolo Valente, Jens Axboe
  Cc: linux-doc, linux-mm, linux-nilfs, linux-nfs, linux-fsdevel,
	linux-f2fs-devel, linux-ext4, ceph-devel, drbd-dev, linux-kernel,
	linux-block

Add some "big-picture" documentation for read-ahead and polish the code
to make it fit this documentation.

The meaning of ->async_size is clarified to match its name.
i.e. Any request to ->readahead() has a sync part and an async part.
The caller will wait for the sync pages to complete, but will not wait
for the async pages.  The first async page is still marked PG_readahead

- in try_context_readahead(), the async_sync is set correctly rather
  than being set to 1.  Prior to Commit 2cad40180197 ("readahead: make
  context readahead more conservative") it was set to ra->size which
  is not correct (that implies no sync component).  As this was too
  high and caused problems it was reduced to 1, again incorrect but less
  problematic.  The setting provided with this patch does not restore
  those problems, and is now not arbitrary.
- The calculation of ->async_size in the initial_readahead section of
  ondemand_readahead() now makes sense - it is zero if the chosen
  size does not exceed the requested size.  This means that we will not
  set the PG_readahead flag in this case, but as the requested size
  has not been satisfied we can expect a subsequent read ahead request
  any way.

Note that the current function names page_cache_sync_ra() and
page_cache_async_ra() are misleading.  All ra request are partly sync
and partly async, so either part can be empty.
A page_cache_sync_ra() request will usually set ->async_size non-zero,
implying it is not all synchronous.
When a non-zero req_count is passed to page_cache_async_ra(), the
implication is that some prefix of the request is synchronous, though
the calculation made there is incorrect - I haven't tried to fix it.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 Documentation/core-api/mm-api.rst |   19 ++++++-
 Documentation/filesystems/vfs.rst |   16 ++++--
 include/linux/fs.h                |    9 +++
 mm/readahead.c                    |  103 ++++++++++++++++++++++++++++++++++++-
 4 files changed, 135 insertions(+), 12 deletions(-)

diff --git a/Documentation/core-api/mm-api.rst b/Documentation/core-api/mm-api.rst
index 395835f9289f..f5b2f92822c8 100644
--- a/Documentation/core-api/mm-api.rst
+++ b/Documentation/core-api/mm-api.rst
@@ -58,15 +58,30 @@ Virtually Contiguous Mappings
 File Mapping and Page Cache
 ===========================
 
-.. kernel-doc:: mm/readahead.c
-   :export:
+Filemap
+-------
 
 .. kernel-doc:: mm/filemap.c
    :export:
 
+Readahead
+---------
+
+.. kernel-doc:: mm/readahead.c
+   :doc: Readahead Overview
+
+.. kernel-doc:: mm/readahead.c
+   :export:
+
+Writeback
+---------
+
 .. kernel-doc:: mm/page-writeback.c
    :export:
 
+Truncate
+--------
+
 .. kernel-doc:: mm/truncate.c
    :export:
 
diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index bf5c48066fac..b4a0baa46dcc 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -806,12 +806,16 @@ cache in your filesystem.  The following members are defined:
 	object.  The pages are consecutive in the page cache and are
 	locked.  The implementation should decrement the page refcount
 	after starting I/O on each page.  Usually the page will be
-	unlocked by the I/O completion handler.  If the filesystem decides
-	to stop attempting I/O before reaching the end of the readahead
-	window, it can simply return.  The caller will decrement the page
-	refcount and unlock the remaining pages for you.  Set PageUptodate
-	if the I/O completes successfully.  Setting PageError on any page
-	will be ignored; simply unlock the page if an I/O error occurs.
+	unlocked by the I/O completion handler.  The set of pages are
+	divided into some sync pages followed by some async pages,
+	rac->ra->async_size gives the number of async pages.  The
+	filesystem should attempt to read all sync pages but may decide
+	to stop once it reaches the async pages.  If it does decide to
+	stop attempting I/O, it can simply return.  The caller will
+	remove the remaining pages from the address space, unlock them
+	and decrement the page refcount.  Set PageUptodate if the I/O
+	completes successfully.  Setting PageError on any page will be
+	ignored; simply unlock the page if an I/O error occurs.
 
 ``readpages``
 	called by the VM to read pages associated with the address_space
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e2d892b201b0..8b5c486bd4a2 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -930,10 +930,15 @@ struct fown_struct {
  * struct file_ra_state - Track a file's readahead state.
  * @start: Where the most recent readahead started.
  * @size: Number of pages read in the most recent readahead.
- * @async_size: Start next readahead when this many pages are left.
- * @ra_pages: Maximum size of a readahead request.
+ * @async_size: Numer of pages that were/are not needed immediately
+ *      and so were/are genuinely "ahead".  Start next readahead when
+ *      the first of these pages is accessed.
+ * @ra_pages: Maximum size of a readahead request, copied from the bdi.
  * @mmap_miss: How many mmap accesses missed in the page cache.
  * @prev_pos: The last byte in the most recent read request.
+ *
+ * When this structure is passed to ->readahead(), the "most recent"
+ * readahead means the current readahead.
  */
 struct file_ra_state {
 	pgoff_t start;
diff --git a/mm/readahead.c b/mm/readahead.c
index cf0dcf89eb69..c44b2957f59f 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -8,6 +8,105 @@
  *		Initial version.
  */
 
+/**
+ * DOC: Readahead Overview
+ *
+ * Readahead is used to read content into the page cache before it is
+ * explicitly requested by the application.  Readahead only ever
+ * attempts to read pages that are not yet in the page cache.  If a
+ * page is present but not up-to-date, readahead will not try to read
+ * it. In that case a simple ->readpage() will be requested.
+ *
+ * Readahead is triggered when an application read request (whether a
+ * systemcall or a page fault) finds that the requested page is not in
+ * the page cache, or that it is in the page cache and has the
+ * %PG_readahead flag set.  This flag indicates that the page was loaded
+ * as part of a previous read-ahead request and now that it has been
+ * accessed, it is time for the next read-ahead.
+ *
+ * Each readahead request is partly synchronous read, and partly async
+ * read-ahead.  This is reflected in the struct file_ra_state which
+ * contains ->size being to total number of pages, and ->async_size
+ * which is the number of pages in the async section.  The first page in
+ * this async section will have %PG_readahead set as a trigger for a
+ * subsequent read ahead.  Once a series of sequential reads has been
+ * established, there should be no need for a synchronous component and
+ * all read ahead request will be fully asynchronous.
+ *
+ * When either of the triggers causes a readahead, three numbers need to
+ * be determined: the start of the region, the size of the region, and
+ * the size of the async tail.
+ *
+ * The start of the region is simply the first page address at or after
+ * the accessed address, which is not currently populated in the page
+ * cache.  This is found with a simple search in the page cache.
+ *
+ * The size of the async tail is determined by subtracting the size that
+ * was explicitly requested from the determined request size, unless
+ * this would be less than zero - then zero is used.  NOTE THIS
+ * CALCULATION IS WRONG WHEN THE START OF THE REGION IS NOT THE ACCESSED
+ * PAGE.
+ *
+ * The size of the region is normally determined from the size of the
+ * previous readahead which loaded the preceding pages.  This may be
+ * discovered from the struct file_ra_state for simple sequential reads,
+ * or from examining the state of the page cache when multiple
+ * sequential reads are interleaved.  Specifically: where the readahead
+ * was triggered by the %PG_readahead flag, the size of the previous
+ * readahead is assumed to be the number of pages from the triggering
+ * page to the start of the new readahead.  In these cases, the size of
+ * the previous readahead is scaled, often doubled, for the new
+ * readahead, though see get_next_ra_size() for details.
+ *
+ * If the size of the previous read cannot be determined, the number of
+ * preceding pages in the page cache is used to estimate the size of
+ * a previous read.  This estimate could easily be misled by random
+ * reads being coincidentally adjacent, so it is ignored unless it is
+ * larger than the current request, and it is not scaled up, unless it
+ * is at the start of file.
+ *
+ * In general read ahead is accelerated at the start of the file, as
+ * reads from there are often sequential.  There are other minor
+ * adjustments to the read ahead size in various special cases and these
+ * are best discovered by reading the code.
+ *
+ * The above calculation determines the readahead, to which any requested
+ * read size may be added.
+ *
+ * Readahead requests are sent to the filesystem using the ->readahead()
+ * address space operation, for which mpage_readahead() is a canonical
+ * implementation.  ->readahead() should normally initiate reads on all
+ * pages, but may fail to read any or all pages without causing an IO
+ * error.  The page cache reading code will issue a ->readpage() request
+ * for any page which ->readahead() does not provided, and only an error
+ * from this will be final.
+ *
+ * ->readahead() will generally call readahead_page() repeatedly to get
+ * each page from those prepared for read ahead.  It may fail to read a
+ * page by:
+ *
+ * * not calling readahead_page() sufficiently many times, effectively
+ *   ignoring some pages, as might be appropriate if the path to
+ *   storage is congested.
+ *
+ * * failing to actually submit a read request for a given page,
+ *   possibly due to insufficient resources, or
+ *
+ * * getting an error during subsequent processing of a request.
+ *
+ * In the last two cases, the page should be unlocked to indicate that
+ * the read attempt has failed.  In the first case the page will be
+ * unlocked by the caller.
+ *
+ * Those pages not in the final ``async_size`` of the request should be
+ * considered to be important and ->readahead() should not fail them due
+ * to congestion or temporary resource unavailability, but should wait
+ * for necessary resources (e.g.  memory or indexing information) to
+ * become available.  Pages in the final ``async_size`` may be
+ * considered less urgent and failure to read them is more acceptable.
+ * They will eventually be read individually using ->readpage().
+ */
+
 #include <linux/kernel.h>
 #include <linux/dax.h>
 #include <linux/gfp.h>
@@ -426,7 +525,7 @@ static int try_context_readahead(struct address_space *mapping,
 
 	ra->start = index;
 	ra->size = min(size + req_size, max);
-	ra->async_size = 1;
+	ra->async_size = ra->size - req_size;
 
 	return 1;
 }
@@ -527,7 +626,7 @@ static void ondemand_readahead(struct readahead_control *ractl,
 initial_readahead:
 	ra->start = index;
 	ra->size = get_init_ra_size(req_size, max_pages);
-	ra->async_size = ra->size > req_size ? ra->size - req_size : ra->size;
+	ra->async_size = ra->size > req_size ? ra->size - req_size : 0;
 
 readit:
 	/*



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 03/11] MM: improve cleanup when ->readpages doesn't process all pages.
  2022-02-10  5:37 [PATCH 00/11] Remove remaining parts of congestion tracking code NeilBrown
                   ` (6 preceding siblings ...)
  2022-02-10  5:37 ` [PATCH 04/11] fuse: " NeilBrown
@ 2022-02-10  5:37 ` NeilBrown
  2022-02-10  5:37 ` [PATCH 09/11] f2fs: replace congestion_wait() calls with io_schedule_timeout() NeilBrown
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 21+ messages in thread
From: NeilBrown @ 2022-02-10  5:37 UTC (permalink / raw)
  To: Andrew Morton, Jan Kara, Wu Fengguang, Jaegeuk Kim, Chao Yu,
	Jeff Layton, Ilya Dryomov, Miklos Szeredi, Trond Myklebust,
	Anna Schumaker, Ryusuke Konishi, Darrick J. Wong,
	Philipp Reisner, Lars Ellenberg, Paolo Valente, Jens Axboe
  Cc: linux-doc, linux-mm, linux-nilfs, linux-nfs, linux-fsdevel,
	linux-f2fs-devel, linux-ext4, ceph-devel, drbd-dev, linux-kernel,
	linux-block

If ->readpages doesn't process all the pages, then it is best to act as
though they weren't requested so that a subsequent readahead can try
again.
So:
  - remove any 'ahead' pages from the page cache so they can be loaded
    with ->readahead() rather then multiple ->read()s
  - update the file_ra_state to reflect the reads that were actually
    submitted.

This allows ->readpages() to abort early due e.g.  to congestion, which
will then allow us to remove the inode_read_congested() test from
page_Cache_async_ra().

Signed-off-by: NeilBrown <neilb@suse.de>
---
 mm/readahead.c |   19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index c44b2957f59f..35a7ebfcb504 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -104,7 +104,13 @@
  * for necessary resources (e.g.  memory or indexing information) to
  * become available.  Pages in the final ``async_size`` may be
  * considered less urgent and failure to read them is more acceptable.
- * They will eventually be read individually using ->readpage().
+ * In this case it is best to use delete_from_page_cache() to remove the
+ * pages from the page cache as is automatically done for pages that
+ * were not fetched with readahead_page().  This will allow a
+ * subsequent synchronous read ahead request to try them again.  If they
+ * are left in the page cache, then they will be read individually using
+ * ->readpage().
+ *
  */
 
 #include <linux/kernel.h>
@@ -226,8 +232,17 @@ static void read_pages(struct readahead_control *rac, struct list_head *pages,
 
 	if (aops->readahead) {
 		aops->readahead(rac);
-		/* Clean up the remaining pages */
+		/*
+		 * Clean up the remaining pages.  The sizes in ->ra
+		 * maybe be used to size next read-ahead, so make sure
+		 * they accurately reflect what happened.
+		 */
 		while ((page = readahead_page(rac))) {
+			rac->ra->size -= 1;
+			if (rac->ra->async_size > 0) {
+				rac->ra->async_size -= 1;
+				delete_from_page_cache(page);
+			}
 			unlock_page(page);
 			put_page(page);
 		}



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 05/11] nfs: remove reliance on bdi congestion
  2022-02-10  5:37 [PATCH 00/11] Remove remaining parts of congestion tracking code NeilBrown
                   ` (2 preceding siblings ...)
  2022-02-10  5:37 ` [PATCH 02/11] MM: document and polish read-ahead code NeilBrown
@ 2022-02-10  5:37 ` NeilBrown
  2022-02-10  5:37 ` [PATCH 08/11] Remove bdi_congested() and wb_congested() and related functions NeilBrown
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 21+ messages in thread
From: NeilBrown @ 2022-02-10  5:37 UTC (permalink / raw)
  To: Andrew Morton, Jan Kara, Wu Fengguang, Jaegeuk Kim, Chao Yu,
	Jeff Layton, Ilya Dryomov, Miklos Szeredi, Trond Myklebust,
	Anna Schumaker, Ryusuke Konishi, Darrick J. Wong,
	Philipp Reisner, Lars Ellenberg, Paolo Valente, Jens Axboe
  Cc: linux-doc, linux-mm, linux-nilfs, linux-nfs, linux-fsdevel,
	linux-f2fs-devel, linux-ext4, ceph-devel, drbd-dev, linux-kernel,
	linux-block

The bdi congestion tracking in not widely used and will be removed.

NFS is one of a small number of filesystems that uses it, setting just
the async (write) congestion flag at what it determines are appropriate
times.

The only remaining effect of the async flag is to cause (some)
WB_SYNC_NONE writes to be skipped.

So instead of setting the flag, set an internal flag and change:
 - .writepages to do nothing if WB_SYNC_NONE and the flag is set
 - .writepage to return AOP_WRITEPAGE_ACTIVATE if WB_SYNC_NONE
    and the flag is set.

The writepages change causes a behavioural change in that pageout() can
now return PAGE_ACTIVATE instead of PAGE_KEEP, so SetPageActive() will
be called on the page which (I think) wil further delay the next attempt
at writeout.  This might be a good thing.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/nfs/write.c            |   14 +++++++++++---
 include/linux/nfs_fs_sb.h |    1 +
 2 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 987a187bd39a..7c986164018e 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -417,7 +417,7 @@ static void nfs_set_page_writeback(struct page *page)
 
 	if (atomic_long_inc_return(&nfss->writeback) >
 			NFS_CONGESTION_ON_THRESH)
-		set_bdi_congested(inode_to_bdi(inode), BLK_RW_ASYNC);
+		nfss->write_congested = 1;
 }
 
 static void nfs_end_page_writeback(struct nfs_page *req)
@@ -433,7 +433,7 @@ static void nfs_end_page_writeback(struct nfs_page *req)
 
 	end_page_writeback(req->wb_page);
 	if (atomic_long_dec_return(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH)
-		clear_bdi_congested(inode_to_bdi(inode), BLK_RW_ASYNC);
+		nfss->write_congested = 0;
 }
 
 /*
@@ -672,6 +672,10 @@ static int nfs_writepage_locked(struct page *page,
 	struct inode *inode = page_file_mapping(page)->host;
 	int err;
 
+	if (wbc->sync_mode == WB_SYNC_NONE &&
+	    NFS_SERVER(inode)->write_congested)
+		return AOP_WRITEPAGE_ACTIVATE;
+
 	nfs_inc_stats(inode, NFSIOS_VFSWRITEPAGE);
 	nfs_pageio_init_write(&pgio, inode, 0,
 				false, &nfs_async_write_completion_ops);
@@ -719,6 +723,10 @@ int nfs_writepages(struct address_space *mapping, struct writeback_control *wbc)
 	int priority = 0;
 	int err;
 
+	if (wbc->sync_mode == WB_SYNC_NONE &&
+	    NFS_SERVER(inode)->write_congested)
+		return 0;
+
 	nfs_inc_stats(inode, NFSIOS_VFSWRITEPAGES);
 
 	if (!(mntflags & NFS_MOUNT_WRITE_EAGER) || wbc->for_kupdate ||
@@ -1893,7 +1901,7 @@ static void nfs_commit_release_pages(struct nfs_commit_data *data)
 	}
 	nfss = NFS_SERVER(data->inode);
 	if (atomic_long_read(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH)
-		clear_bdi_congested(inode_to_bdi(data->inode), BLK_RW_ASYNC);
+		nfss->write_congested = 0;
 
 	nfs_init_cinfo(&cinfo, data->inode, data->dreq);
 	nfs_commit_end(cinfo.mds);
diff --git a/include/linux/nfs_fs_sb.h b/include/linux/nfs_fs_sb.h
index ca0959e51e81..6aa2a200676a 100644
--- a/include/linux/nfs_fs_sb.h
+++ b/include/linux/nfs_fs_sb.h
@@ -138,6 +138,7 @@ struct nfs_server {
 	struct nlm_host		*nlm_host;	/* NLM client handle */
 	struct nfs_iostats __percpu *io_stats;	/* I/O statistics */
 	atomic_long_t		writeback;	/* number of writeback pages */
+	unsigned int		write_congested;/* flag set when writeback gets too high */
 	unsigned int		flags;		/* various flags */
 
 /* The following are for internal use only. Also see uapi/linux/nfs_mount.h */



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 06/11] ceph: remove reliance on bdi congestion
  2022-02-10  5:37 [PATCH 00/11] Remove remaining parts of congestion tracking code NeilBrown
                   ` (4 preceding siblings ...)
  2022-02-10  5:37 ` [PATCH 08/11] Remove bdi_congested() and wb_congested() and related functions NeilBrown
@ 2022-02-10  5:37 ` NeilBrown
  2022-02-10  5:37 ` [PATCH 04/11] fuse: " NeilBrown
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 21+ messages in thread
From: NeilBrown @ 2022-02-10  5:37 UTC (permalink / raw)
  To: Andrew Morton, Jan Kara, Wu Fengguang, Jaegeuk Kim, Chao Yu,
	Jeff Layton, Ilya Dryomov, Miklos Szeredi, Trond Myklebust,
	Anna Schumaker, Ryusuke Konishi, Darrick J. Wong,
	Philipp Reisner, Lars Ellenberg, Paolo Valente, Jens Axboe
  Cc: linux-doc, linux-mm, linux-nilfs, linux-nfs, linux-fsdevel,
	linux-f2fs-devel, linux-ext4, ceph-devel, drbd-dev, linux-kernel,
	linux-block

The bdi congestion tracking in not widely used and will be removed.

CEPHfs is one of a small number of filesystems that uses it, setting
just the async (write) congestion flags at what it determines are
appropriate times.

The only remaining effect of the async flag is to cause (some)
WB_SYNC_NONE writes to be skipped.

So instead of setting the flag, set an internal flag and change:
 - .writepages to do nothing if WB_SYNC_NONE and the flag is set
 - .writepage to return AOP_WRITEPAGE_ACTIVATE if WB_SYNC_NONE
    and the flag is set.

The writepages change causes a behavioural change in that pageout() can
now return PAGE_ACTIVATE instead of PAGE_KEEP, so SetPageActive() will
be called on the page which (I think) wil further delay the next attempt
at writeout.  This might be a good thing.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/ceph/addr.c  |   22 +++++++++++++---------
 fs/ceph/super.c |    1 +
 fs/ceph/super.h |    1 +
 3 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index c98e5238a1b6..dc7af34640dd 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -563,7 +563,7 @@ static int writepage_nounlock(struct page *page, struct writeback_control *wbc)
 
 	if (atomic_long_inc_return(&fsc->writeback_count) >
 	    CONGESTION_ON_THRESH(fsc->mount_options->congestion_kb))
-		set_bdi_congested(inode_to_bdi(inode), BLK_RW_ASYNC);
+		fsc->write_congested = true;
 
 	req = ceph_osdc_new_request(osdc, &ci->i_layout, ceph_vino(inode), page_off, &len, 0, 1,
 				    CEPH_OSD_OP_WRITE, CEPH_OSD_FLAG_WRITE, snapc,
@@ -623,7 +623,7 @@ static int writepage_nounlock(struct page *page, struct writeback_control *wbc)
 
 	if (atomic_long_dec_return(&fsc->writeback_count) <
 	    CONGESTION_OFF_THRESH(fsc->mount_options->congestion_kb))
-		clear_bdi_congested(inode_to_bdi(inode), BLK_RW_ASYNC);
+		fsc->write_congested = false;
 
 	return err;
 }
@@ -635,6 +635,10 @@ static int ceph_writepage(struct page *page, struct writeback_control *wbc)
 	BUG_ON(!inode);
 	ihold(inode);
 
+	if (wbc->sync_mode == WB_SYNC_NONE &&
+	    ceph_inode_to_client(inode)->write_congested)
+		return AOP_WRITEPAGE_ACTIVATE;
+
 	wait_on_page_fscache(page);
 
 	err = writepage_nounlock(page, wbc);
@@ -707,8 +711,7 @@ static void writepages_finish(struct ceph_osd_request *req)
 			if (atomic_long_dec_return(&fsc->writeback_count) <
 			     CONGESTION_OFF_THRESH(
 					fsc->mount_options->congestion_kb))
-				clear_bdi_congested(inode_to_bdi(inode),
-						    BLK_RW_ASYNC);
+				fsc->write_congested = false;
 
 			ceph_put_snap_context(detach_page_private(page));
 			end_page_writeback(page);
@@ -760,6 +763,10 @@ static int ceph_writepages_start(struct address_space *mapping,
 	bool done = false;
 	bool caching = ceph_is_cache_enabled(inode);
 
+	if (wbc->sync_mode == WB_SYNC_NONE &&
+	    fsc->write_congested)
+		return 0;
+
 	dout("writepages_start %p (mode=%s)\n", inode,
 	     wbc->sync_mode == WB_SYNC_NONE ? "NONE" :
 	     (wbc->sync_mode == WB_SYNC_ALL ? "ALL" : "HOLD"));
@@ -954,11 +961,8 @@ static int ceph_writepages_start(struct address_space *mapping,
 
 			if (atomic_long_inc_return(&fsc->writeback_count) >
 			    CONGESTION_ON_THRESH(
-				    fsc->mount_options->congestion_kb)) {
-				set_bdi_congested(inode_to_bdi(inode),
-						  BLK_RW_ASYNC);
-			}
-
+				    fsc->mount_options->congestion_kb))
+				fsc->write_congested = true;
 
 			pages[locked_pages++] = page;
 			pvec.pages[i] = NULL;
diff --git a/fs/ceph/super.c b/fs/ceph/super.c
index bf79f369aec6..4a3b77d049c7 100644
--- a/fs/ceph/super.c
+++ b/fs/ceph/super.c
@@ -802,6 +802,7 @@ static struct ceph_fs_client *create_fs_client(struct ceph_mount_options *fsopt,
 	fsc->have_copy_from2 = true;
 
 	atomic_long_set(&fsc->writeback_count, 0);
+	fsc->write_congested = false;
 
 	err = -ENOMEM;
 	/*
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index 67f145e1ae7a..0bd97aea2319 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -121,6 +121,7 @@ struct ceph_fs_client {
 	struct ceph_mds_client *mdsc;
 
 	atomic_long_t writeback_count;
+	bool write_congested;
 
 	struct workqueue_struct *inode_wq;
 	struct workqueue_struct *cap_wq;



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 04/11] fuse: remove reliance on bdi congestion
  2022-02-10  5:37 [PATCH 00/11] Remove remaining parts of congestion tracking code NeilBrown
                   ` (5 preceding siblings ...)
  2022-02-10  5:37 ` [PATCH 06/11] ceph: remove reliance on bdi congestion NeilBrown
@ 2022-02-10  5:37 ` NeilBrown
  2022-02-10  5:37 ` [PATCH 03/11] MM: improve cleanup when ->readpages doesn't process all pages NeilBrown
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 21+ messages in thread
From: NeilBrown @ 2022-02-10  5:37 UTC (permalink / raw)
  To: Andrew Morton, Jan Kara, Wu Fengguang, Jaegeuk Kim, Chao Yu,
	Jeff Layton, Ilya Dryomov, Miklos Szeredi, Trond Myklebust,
	Anna Schumaker, Ryusuke Konishi, Darrick J. Wong,
	Philipp Reisner, Lars Ellenberg, Paolo Valente, Jens Axboe
  Cc: linux-doc, linux-mm, linux-nilfs, linux-nfs, linux-fsdevel,
	linux-f2fs-devel, linux-ext4, ceph-devel, drbd-dev, linux-kernel,
	linux-block

The bdi congestion tracking in not widely used and will be removed.

Fuse is one of a small number of filesystems that uses it, setting both
the sync (read) and async (write) congestion flags at what it determines
are appropriate times.

The only remaining effect of the sync flag is to cause read-ahead to be
skipped.
The only remaining effect of the async flag is to cause (some)
WB_SYNC_NONE writes to be skipped.

So instead of setting the flags, change:
 - .readahead to stop when it has submitted all non-async pages
    for read.
 - .writepages to do nothing if WB_SYNC_NONE and the flag would be set
 - .writepage to return AOP_WRITEPAGE_ACTIVATE if WB_SYNC_NONE
    and the flag would be set.

The writepages change causes a behavioural change in that pageout() can
now return PAGE_ACTIVATE instead of PAGE_KEEP, so SetPageActive() will
be called on the page which (I think) will further delay the next attempt
at writeout.  This might be a good thing.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/fuse/control.c |   17 -----------------
 fs/fuse/dev.c     |    8 --------
 fs/fuse/file.c    |   17 +++++++++++++++++
 3 files changed, 17 insertions(+), 25 deletions(-)

diff --git a/fs/fuse/control.c b/fs/fuse/control.c
index 000d2e5627e9..7cede9a3bc96 100644
--- a/fs/fuse/control.c
+++ b/fs/fuse/control.c
@@ -164,7 +164,6 @@ static ssize_t fuse_conn_congestion_threshold_write(struct file *file,
 {
 	unsigned val;
 	struct fuse_conn *fc;
-	struct fuse_mount *fm;
 	ssize_t ret;
 
 	ret = fuse_conn_limit_write(file, buf, count, ppos, &val,
@@ -178,22 +177,6 @@ static ssize_t fuse_conn_congestion_threshold_write(struct file *file,
 	down_read(&fc->killsb);
 	spin_lock(&fc->bg_lock);
 	fc->congestion_threshold = val;
-
-	/*
-	 * Get any fuse_mount belonging to this fuse_conn; s_bdi is
-	 * shared between all of them
-	 */
-
-	if (!list_empty(&fc->mounts)) {
-		fm = list_first_entry(&fc->mounts, struct fuse_mount, fc_entry);
-		if (fc->num_background < fc->congestion_threshold) {
-			clear_bdi_congested(fm->sb->s_bdi, BLK_RW_SYNC);
-			clear_bdi_congested(fm->sb->s_bdi, BLK_RW_ASYNC);
-		} else {
-			set_bdi_congested(fm->sb->s_bdi, BLK_RW_SYNC);
-			set_bdi_congested(fm->sb->s_bdi, BLK_RW_ASYNC);
-		}
-	}
 	spin_unlock(&fc->bg_lock);
 	up_read(&fc->killsb);
 	fuse_conn_put(fc);
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index cd54a529460d..e1b4a846c90d 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -315,10 +315,6 @@ void fuse_request_end(struct fuse_req *req)
 				wake_up(&fc->blocked_waitq);
 		}
 
-		if (fc->num_background == fc->congestion_threshold && fm->sb) {
-			clear_bdi_congested(fm->sb->s_bdi, BLK_RW_SYNC);
-			clear_bdi_congested(fm->sb->s_bdi, BLK_RW_ASYNC);
-		}
 		fc->num_background--;
 		fc->active_background--;
 		flush_bg_queue(fc);
@@ -540,10 +536,6 @@ static bool fuse_request_queue_background(struct fuse_req *req)
 		fc->num_background++;
 		if (fc->num_background == fc->max_background)
 			fc->blocked = 1;
-		if (fc->num_background == fc->congestion_threshold && fm->sb) {
-			set_bdi_congested(fm->sb->s_bdi, BLK_RW_SYNC);
-			set_bdi_congested(fm->sb->s_bdi, BLK_RW_ASYNC);
-		}
 		list_add_tail(&req->list, &fc->bg_queue);
 		flush_bg_queue(fc);
 		queued = true;
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 829094451774..94747bac3489 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -966,6 +966,14 @@ static void fuse_readahead(struct readahead_control *rac)
 		struct fuse_io_args *ia;
 		struct fuse_args_pages *ap;
 
+		if (fc->num_background >= fc->congestion_threshold &&
+		    rac->ra->async_size >= readahead_count(rac))
+			/*
+			 * Congested and only async pages left, so skip the
+			 * rest.
+			 */
+			break;
+
 		nr_pages = readahead_count(rac) - nr_pages;
 		if (nr_pages > max_pages)
 			nr_pages = max_pages;
@@ -1958,6 +1966,7 @@ static int fuse_writepage_locked(struct page *page)
 
 static int fuse_writepage(struct page *page, struct writeback_control *wbc)
 {
+	struct fuse_conn *fc = get_fuse_conn(page->mapping->host);
 	int err;
 
 	if (fuse_page_is_writeback(page->mapping->host, page->index)) {
@@ -1973,6 +1982,10 @@ static int fuse_writepage(struct page *page, struct writeback_control *wbc)
 		return 0;
 	}
 
+	if (wbc->sync_mode == WB_SYNC_NONE &&
+	    fc->num_background >= fc->congestion_threshold)
+		return AOP_WRITEPAGE_ACTIVATE;
+
 	err = fuse_writepage_locked(page);
 	unlock_page(page);
 
@@ -2226,6 +2239,10 @@ static int fuse_writepages(struct address_space *mapping,
 	if (fuse_is_bad(inode))
 		goto out;
 
+	if (wbc->sync_mode == WB_SYNC_NONE &&
+	    fc->num_background >= fc->congestion_threshold)
+		return 0;
+
 	data.inode = inode;
 	data.wpa = NULL;
 	data.ff = NULL;



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 07/11] Remove inode_congested()
  2022-02-10  5:37 [PATCH 00/11] Remove remaining parts of congestion tracking code NeilBrown
@ 2022-02-10  5:37 ` NeilBrown
  2022-02-10  5:37 ` [PATCH 10/11] block/bfq-iosched.c: use "false" rather than "BLK_RW_ASYNC" NeilBrown
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 21+ messages in thread
From: NeilBrown @ 2022-02-10  5:37 UTC (permalink / raw)
  To: Andrew Morton, Jan Kara, Wu Fengguang, Jaegeuk Kim, Chao Yu,
	Jeff Layton, Ilya Dryomov, Miklos Szeredi, Trond Myklebust,
	Anna Schumaker, Ryusuke Konishi, Darrick J. Wong,
	Philipp Reisner, Lars Ellenberg, Paolo Valente, Jens Axboe
  Cc: linux-doc, linux-mm, linux-nilfs, linux-nfs, linux-fsdevel,
	linux-f2fs-devel, linux-ext4, ceph-devel, drbd-dev, linux-kernel,
	linux-block

inode_congested() reports if the backing-device for the inode is
congested.  No bdi reports congestion any more, so this always
returns 'false'.

So remove inode_congested() and related functions, and remove the call
sites, assuming that inode_congested() always returns 'false'.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/fs-writeback.c           |   37 -------------------------------------
 include/linux/backing-dev.h |   22 ----------------------
 mm/fadvise.c                |    5 ++---
 mm/readahead.c              |    6 ------
 mm/vmscan.c                 |   17 +----------------
 5 files changed, 3 insertions(+), 84 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index f8d7fe6db989..42a3dfad40b8 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -893,43 +893,6 @@ void wbc_account_cgroup_owner(struct writeback_control *wbc, struct page *page,
 }
 EXPORT_SYMBOL_GPL(wbc_account_cgroup_owner);
 
-/**
- * inode_congested - test whether an inode is congested
- * @inode: inode to test for congestion (may be NULL)
- * @cong_bits: mask of WB_[a]sync_congested bits to test
- *
- * Tests whether @inode is congested.  @cong_bits is the mask of congestion
- * bits to test and the return value is the mask of set bits.
- *
- * If cgroup writeback is enabled for @inode, the congestion state is
- * determined by whether the cgwb (cgroup bdi_writeback) for the blkcg
- * associated with @inode is congested; otherwise, the root wb's congestion
- * state is used.
- *
- * @inode is allowed to be NULL as this function is often called on
- * mapping->host which is NULL for the swapper space.
- */
-int inode_congested(struct inode *inode, int cong_bits)
-{
-	/*
-	 * Once set, ->i_wb never becomes NULL while the inode is alive.
-	 * Start transaction iff ->i_wb is visible.
-	 */
-	if (inode && inode_to_wb_is_valid(inode)) {
-		struct bdi_writeback *wb;
-		struct wb_lock_cookie lock_cookie = {};
-		bool congested;
-
-		wb = unlocked_inode_to_wb_begin(inode, &lock_cookie);
-		congested = wb_congested(wb, cong_bits);
-		unlocked_inode_to_wb_end(inode, &lock_cookie);
-		return congested;
-	}
-
-	return wb_congested(&inode_to_bdi(inode)->wb, cong_bits);
-}
-EXPORT_SYMBOL_GPL(inode_congested);
-
 /**
  * wb_split_bdi_pages - split nr_pages to write according to bandwidth
  * @wb: target bdi_writeback to split @nr_pages to
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 483979c1b9f4..860b675c2929 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -162,7 +162,6 @@ struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi,
 				    gfp_t gfp);
 void wb_memcg_offline(struct mem_cgroup *memcg);
 void wb_blkcg_offline(struct blkcg *blkcg);
-int inode_congested(struct inode *inode, int cong_bits);
 
 /**
  * inode_cgwb_enabled - test whether cgroup writeback is enabled on an inode
@@ -390,29 +389,8 @@ static inline void wb_blkcg_offline(struct blkcg *blkcg)
 {
 }
 
-static inline int inode_congested(struct inode *inode, int cong_bits)
-{
-	return wb_congested(&inode_to_bdi(inode)->wb, cong_bits);
-}
-
 #endif	/* CONFIG_CGROUP_WRITEBACK */
 
-static inline int inode_read_congested(struct inode *inode)
-{
-	return inode_congested(inode, 1 << WB_sync_congested);
-}
-
-static inline int inode_write_congested(struct inode *inode)
-{
-	return inode_congested(inode, 1 << WB_async_congested);
-}
-
-static inline int inode_rw_congested(struct inode *inode)
-{
-	return inode_congested(inode, (1 << WB_sync_congested) |
-				      (1 << WB_async_congested));
-}
-
 static inline int bdi_congested(struct backing_dev_info *bdi, int cong_bits)
 {
 	return wb_congested(&bdi->wb, cong_bits);
diff --git a/mm/fadvise.c b/mm/fadvise.c
index d6baa4f451c5..338f16022012 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -109,9 +109,8 @@ int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
 	case POSIX_FADV_NOREUSE:
 		break;
 	case POSIX_FADV_DONTNEED:
-		if (!inode_write_congested(mapping->host))
-			__filemap_fdatawrite_range(mapping, offset, endbyte,
-						   WB_SYNC_NONE);
+		__filemap_fdatawrite_range(mapping, offset, endbyte,
+					   WB_SYNC_NONE);
 
 		/*
 		 * First and last FULL page! Partial pages are deliberately
diff --git a/mm/readahead.c b/mm/readahead.c
index 35a7ebfcb504..31127d5f909f 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -709,12 +709,6 @@ void page_cache_async_ra(struct readahead_control *ractl,
 
 	folio_clear_readahead(folio);
 
-	/*
-	 * Defer asynchronous read-ahead on IO congestion.
-	 */
-	if (inode_read_congested(ractl->mapping->host))
-		return;
-
 	if (blk_cgroup_congested())
 		return;
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 090bfb605ecf..ce8492939bd3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -989,17 +989,6 @@ static inline int is_page_cache_freeable(struct page *page)
 	return page_count(page) - page_has_private(page) == 1 + page_cache_pins;
 }
 
-static int may_write_to_inode(struct inode *inode)
-{
-	if (current->flags & PF_SWAPWRITE)
-		return 1;
-	if (!inode_write_congested(inode))
-		return 1;
-	if (inode_to_bdi(inode) == current->backing_dev_info)
-		return 1;
-	return 0;
-}
-
 /*
  * We detected a synchronous write error writing a page out.  Probably
  * -ENOSPC.  We need to propagate that into the address_space for a subsequent
@@ -1199,8 +1188,6 @@ static pageout_t pageout(struct page *page, struct address_space *mapping)
 	}
 	if (mapping->a_ops->writepage == NULL)
 		return PAGE_ACTIVATE;
-	if (!may_write_to_inode(mapping->host))
-		return PAGE_KEEP;
 
 	if (clear_page_dirty_for_io(page)) {
 		int res;
@@ -1576,9 +1563,7 @@ static unsigned int shrink_page_list(struct list_head *page_list,
 		 * end of the LRU a second time.
 		 */
 		mapping = page_mapping(page);
-		if (((dirty || writeback) && mapping &&
-		     inode_write_congested(mapping->host)) ||
-		    (writeback && PageReclaim(page)))
+		if (writeback && PageReclaim(page))
 			stat->nr_congested++;
 
 		/*



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 11/11] Remove congestion tracking framework.
  2022-02-10  5:37 [PATCH 00/11] Remove remaining parts of congestion tracking code NeilBrown
                   ` (9 preceding siblings ...)
  2022-02-10  5:37 ` [PATCH 01/11] DOC: convert 'subsection' to 'section' in gfp.h NeilBrown
@ 2022-02-10  5:37 ` NeilBrown
  10 siblings, 0 replies; 21+ messages in thread
From: NeilBrown @ 2022-02-10  5:37 UTC (permalink / raw)
  To: Andrew Morton, Jan Kara, Wu Fengguang, Jaegeuk Kim, Chao Yu,
	Jeff Layton, Ilya Dryomov, Miklos Szeredi, Trond Myklebust,
	Anna Schumaker, Ryusuke Konishi, Darrick J. Wong,
	Philipp Reisner, Lars Ellenberg, Paolo Valente, Jens Axboe
  Cc: linux-doc, linux-mm, linux-nilfs, linux-nfs, linux-fsdevel,
	linux-f2fs-devel, linux-ext4, ceph-devel, drbd-dev, linux-kernel,
	linux-block

This framework is no longer used - so discard it.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 include/linux/backing-dev-defs.h |    8 -----
 include/linux/backing-dev.h      |    2 -
 include/trace/events/writeback.h |   28 -------------------
 mm/backing-dev.c                 |   57 --------------------------------------
 4 files changed, 95 deletions(-)

diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 993c5628a726..e863c88df95f 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -207,14 +207,6 @@ struct backing_dev_info {
 #endif
 };
 
-enum {
-	BLK_RW_ASYNC	= 0,
-	BLK_RW_SYNC	= 1,
-};
-
-void clear_bdi_congested(struct backing_dev_info *bdi, int sync);
-void set_bdi_congested(struct backing_dev_info *bdi, int sync);
-
 struct wb_lock_cookie {
 	bool locked;
 	unsigned long flags;
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 2d764566280c..87ce24d238f3 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -135,8 +135,6 @@ static inline bool writeback_in_progress(struct bdi_writeback *wb)
 
 struct backing_dev_info *inode_to_bdi(struct inode *inode);
 
-long congestion_wait(int sync, long timeout);
-
 static inline bool mapping_can_writeback(struct address_space *mapping)
 {
 	return inode_to_bdi(mapping->host)->capabilities & BDI_CAP_WRITEBACK;
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index a345b1e12daf..86b2a82da546 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -735,34 +735,6 @@ TRACE_EVENT(writeback_sb_inodes_requeue,
 	)
 );
 
-DECLARE_EVENT_CLASS(writeback_congest_waited_template,
-
-	TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),
-
-	TP_ARGS(usec_timeout, usec_delayed),
-
-	TP_STRUCT__entry(
-		__field(	unsigned int,	usec_timeout	)
-		__field(	unsigned int,	usec_delayed	)
-	),
-
-	TP_fast_assign(
-		__entry->usec_timeout	= usec_timeout;
-		__entry->usec_delayed	= usec_delayed;
-	),
-
-	TP_printk("usec_timeout=%u usec_delayed=%u",
-			__entry->usec_timeout,
-			__entry->usec_delayed)
-);
-
-DEFINE_EVENT(writeback_congest_waited_template, writeback_congestion_wait,
-
-	TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),
-
-	TP_ARGS(usec_timeout, usec_delayed)
-);
-
 DECLARE_EVENT_CLASS(writeback_single_inode_template,
 
 	TP_PROTO(struct inode *inode,
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index eae96dfe0261..7176af65b103 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -1005,60 +1005,3 @@ const char *bdi_dev_name(struct backing_dev_info *bdi)
 	return bdi->dev_name;
 }
 EXPORT_SYMBOL_GPL(bdi_dev_name);
-
-static wait_queue_head_t congestion_wqh[2] = {
-		__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]),
-		__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
-	};
-static atomic_t nr_wb_congested[2];
-
-void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
-{
-	wait_queue_head_t *wqh = &congestion_wqh[sync];
-	enum wb_congested_state bit;
-
-	bit = sync ? WB_sync_congested : WB_async_congested;
-	if (test_and_clear_bit(bit, &bdi->wb.congested))
-		atomic_dec(&nr_wb_congested[sync]);
-	smp_mb__after_atomic();
-	if (waitqueue_active(wqh))
-		wake_up(wqh);
-}
-EXPORT_SYMBOL(clear_bdi_congested);
-
-void set_bdi_congested(struct backing_dev_info *bdi, int sync)
-{
-	enum wb_congested_state bit;
-
-	bit = sync ? WB_sync_congested : WB_async_congested;
-	if (!test_and_set_bit(bit, &bdi->wb.congested))
-		atomic_inc(&nr_wb_congested[sync]);
-}
-EXPORT_SYMBOL(set_bdi_congested);
-
-/**
- * congestion_wait - wait for a backing_dev to become uncongested
- * @sync: SYNC or ASYNC IO
- * @timeout: timeout in jiffies
- *
- * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
- * write congestion.  If no backing_devs are congested then just wait for the
- * next write to be completed.
- */
-long congestion_wait(int sync, long timeout)
-{
-	long ret;
-	unsigned long start = jiffies;
-	DEFINE_WAIT(wait);
-	wait_queue_head_t *wqh = &congestion_wqh[sync];
-
-	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
-	ret = io_schedule_timeout(timeout);
-	finish_wait(wqh, &wait);
-
-	trace_writeback_congestion_wait(jiffies_to_usecs(timeout),
-					jiffies_to_usecs(jiffies - start));
-
-	return ret;
-}
-EXPORT_SYMBOL(congestion_wait);



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 10/11] block/bfq-iosched.c: use "false" rather than "BLK_RW_ASYNC"
  2022-02-10  5:37 [PATCH 00/11] Remove remaining parts of congestion tracking code NeilBrown
  2022-02-10  5:37 ` [PATCH 07/11] Remove inode_congested() NeilBrown
@ 2022-02-10  5:37 ` NeilBrown
  2022-02-10 10:02   ` Jan Kara
  2022-02-10  5:37 ` [PATCH 02/11] MM: document and polish read-ahead code NeilBrown
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 21+ messages in thread
From: NeilBrown @ 2022-02-10  5:37 UTC (permalink / raw)
  To: Andrew Morton, Jan Kara, Wu Fengguang, Jaegeuk Kim, Chao Yu,
	Jeff Layton, Ilya Dryomov, Miklos Szeredi, Trond Myklebust,
	Anna Schumaker, Ryusuke Konishi, Darrick J. Wong,
	Philipp Reisner, Lars Ellenberg, Paolo Valente, Jens Axboe
  Cc: linux-doc, linux-mm, linux-nilfs, linux-nfs, linux-fsdevel,
	linux-f2fs-devel, linux-ext4, ceph-devel, drbd-dev, linux-kernel,
	linux-block

bfq_get_queue() expects a "bool" for the third arg, so pass "false"
rather than "BLK_RW_ASYNC" which will soon be removed.

Acked-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: NeilBrown <neilb@suse.de>
---
 block/bfq-iosched.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 0c612a911696..4e645ae1e066 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -5448,7 +5448,7 @@ static void bfq_check_ioprio_change(struct bfq_io_cq *bic, struct bio *bio)
 	bfqq = bic_to_bfqq(bic, false);
 	if (bfqq) {
 		bfq_release_process_ref(bfqd, bfqq);
-		bfqq = bfq_get_queue(bfqd, bio, BLK_RW_ASYNC, bic, true);
+		bfqq = bfq_get_queue(bfqd, bio, false, bic, true);
 		bic_set_bfqq(bic, bfqq, false);
 	}
 



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 09/11] f2fs: replace congestion_wait() calls with io_schedule_timeout()
  2022-02-10  5:37 [PATCH 00/11] Remove remaining parts of congestion tracking code NeilBrown
                   ` (7 preceding siblings ...)
  2022-02-10  5:37 ` [PATCH 03/11] MM: improve cleanup when ->readpages doesn't process all pages NeilBrown
@ 2022-02-10  5:37 ` NeilBrown
  2022-02-10  5:37 ` [PATCH 01/11] DOC: convert 'subsection' to 'section' in gfp.h NeilBrown
  2022-02-10  5:37 ` [PATCH 11/11] Remove congestion tracking framework NeilBrown
  10 siblings, 0 replies; 21+ messages in thread
From: NeilBrown @ 2022-02-10  5:37 UTC (permalink / raw)
  To: Andrew Morton, Jan Kara, Wu Fengguang, Jaegeuk Kim, Chao Yu,
	Jeff Layton, Ilya Dryomov, Miklos Szeredi, Trond Myklebust,
	Anna Schumaker, Ryusuke Konishi, Darrick J. Wong,
	Philipp Reisner, Lars Ellenberg, Paolo Valente, Jens Axboe
  Cc: linux-doc, linux-mm, linux-nilfs, linux-nfs, linux-fsdevel,
	linux-f2fs-devel, linux-ext4, ceph-devel, drbd-dev, linux-kernel,
	linux-block

As congestion is no longer tracked, congestion_wait() is effectively
equivalent to io_schedule_timeout().
So introduce f2fs_io_schedule_timeout() which sets TASK_UNINTERRUPTIBLE
and call that instead.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/f2fs/compress.c |    4 +---
 fs/f2fs/data.c     |    3 +--
 fs/f2fs/f2fs.h     |    6 ++++++
 fs/f2fs/segment.c  |    8 +++-----
 fs/f2fs/super.c    |    6 ++----
 5 files changed, 13 insertions(+), 14 deletions(-)

diff --git a/fs/f2fs/compress.c b/fs/f2fs/compress.c
index d0c3aeba5945..2f95559025ad 100644
--- a/fs/f2fs/compress.c
+++ b/fs/f2fs/compress.c
@@ -1505,9 +1505,7 @@ static int f2fs_write_raw_pages(struct compress_ctx *cc,
 				if (IS_NOQUOTA(cc->inode))
 					return 0;
 				ret = 0;
-				cond_resched();
-				congestion_wait(BLK_RW_ASYNC,
-						DEFAULT_IO_TIMEOUT);
+				f2fs_io_schedule_timeout(DEFAULT_IO_TIMEOUT);
 				goto retry_write;
 			}
 			return ret;
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index 8c417864c66a..d428ddfd42ee 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -3047,8 +3047,7 @@ static int f2fs_write_cache_pages(struct address_space *mapping,
 				} else if (ret == -EAGAIN) {
 					ret = 0;
 					if (wbc->sync_mode == WB_SYNC_ALL) {
-						cond_resched();
-						congestion_wait(BLK_RW_ASYNC,
+						f2fs_io_schedule_timeout(
 							DEFAULT_IO_TIMEOUT);
 						goto retry_write;
 					}
diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index 68b44015514f..467f5dbdc7d1 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -4426,6 +4426,12 @@ static inline bool f2fs_block_unit_discard(struct f2fs_sb_info *sbi)
 	return F2FS_OPTION(sbi).discard_unit == DISCARD_UNIT_BLOCK;
 }
 
+static inline void f2fs_io_schedule_timeout(long timeout)
+{
+	set_current_state(TASK_UNINTERRUPTIBLE);
+	io_schedule_timeout(timeout);
+}
+
 #define EFSBADCRC	EBADMSG		/* Bad CRC detected */
 #define EFSCORRUPTED	EUCLEAN		/* Filesystem is corrupted */
 
diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
index 1dabc8244083..6ff20da44ad7 100644
--- a/fs/f2fs/segment.c
+++ b/fs/f2fs/segment.c
@@ -313,8 +313,7 @@ void f2fs_drop_inmem_pages_all(struct f2fs_sb_info *sbi, bool gc_failure)
 skip:
 		iput(inode);
 	}
-	congestion_wait(BLK_RW_ASYNC, DEFAULT_IO_TIMEOUT);
-	cond_resched();
+	f2fs_io_schedule_timeout(DEFAULT_IO_TIMEOUT);
 	if (gc_failure) {
 		if (++looped >= count)
 			return;
@@ -803,8 +802,7 @@ int f2fs_flush_device_cache(struct f2fs_sb_info *sbi)
 		do {
 			ret = __submit_flush_wait(sbi, FDEV(i).bdev);
 			if (ret)
-				congestion_wait(BLK_RW_ASYNC,
-						DEFAULT_IO_TIMEOUT);
+				f2fs_io_schedule_timeout(DEFAULT_IO_TIMEOUT);
 		} while (ret && --count);
 
 		if (ret) {
@@ -3133,7 +3131,7 @@ static unsigned int __issue_discard_cmd_range(struct f2fs_sb_info *sbi,
 			blk_finish_plug(&plug);
 			mutex_unlock(&dcc->cmd_lock);
 			trimmed += __wait_all_discard_cmd(sbi, NULL);
-			congestion_wait(BLK_RW_ASYNC, DEFAULT_IO_TIMEOUT);
+			f2fs_io_schedule_timeout(DEFAULT_IO_TIMEOUT);
 			goto next;
 		}
 skip:
diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
index baefd398ec1a..ebd32daf052c 100644
--- a/fs/f2fs/super.c
+++ b/fs/f2fs/super.c
@@ -2135,8 +2135,7 @@ static void f2fs_enable_checkpoint(struct f2fs_sb_info *sbi)
 	/* we should flush all the data to keep data consistency */
 	do {
 		sync_inodes_sb(sbi->sb);
-		cond_resched();
-		congestion_wait(BLK_RW_ASYNC, DEFAULT_IO_TIMEOUT);
+		f2fs_io_schedule_timeout(DEFAULT_IO_TIMEOUT);
 	} while (get_pages(sbi, F2FS_DIRTY_DATA) && retry--);
 
 	if (unlikely(retry < 0))
@@ -2504,8 +2503,7 @@ static ssize_t f2fs_quota_write(struct super_block *sb, int type,
 							&page, &fsdata);
 		if (unlikely(err)) {
 			if (err == -ENOMEM) {
-				congestion_wait(BLK_RW_ASYNC,
-						DEFAULT_IO_TIMEOUT);
+				f2fs_io_schedule_timeout(DEFAULT_IO_TIMEOUT);
 				goto retry;
 			}
 			set_sbi_flag(F2FS_SB(sb), SBI_QUOTA_NEED_REPAIR);



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 08/11] Remove bdi_congested() and wb_congested() and related functions
  2022-02-10  5:37 [PATCH 00/11] Remove remaining parts of congestion tracking code NeilBrown
                   ` (3 preceding siblings ...)
  2022-02-10  5:37 ` [PATCH 05/11] nfs: remove reliance on bdi congestion NeilBrown
@ 2022-02-10  5:37 ` NeilBrown
  2022-02-24 18:10   ` Ryusuke Konishi
  2022-02-10  5:37 ` [PATCH 06/11] ceph: remove reliance on bdi congestion NeilBrown
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 21+ messages in thread
From: NeilBrown @ 2022-02-10  5:37 UTC (permalink / raw)
  To: Andrew Morton, Jan Kara, Wu Fengguang, Jaegeuk Kim, Chao Yu,
	Jeff Layton, Ilya Dryomov, Miklos Szeredi, Trond Myklebust,
	Anna Schumaker, Ryusuke Konishi, Darrick J. Wong,
	Philipp Reisner, Lars Ellenberg, Paolo Valente, Jens Axboe
  Cc: linux-doc, linux-mm, linux-nilfs, linux-nfs, linux-fsdevel,
	linux-f2fs-devel, linux-ext4, ceph-devel, drbd-dev, linux-kernel,
	linux-block

These functions are no longer useful as no BDIs report congestions any
more.

Removing the test on bdi_write_contested() in current_may_throttle()
could cause a small change in behaviour, but only when PF_LOCAL_THROTTLE
is set.

So replace the calls by 'false' and simplify the code - and remove the
functions.

Acked-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> (for nilfs bits)
Signed-off-by: NeilBrown <neilb@suse.de>
---
 drivers/block/drbd/drbd_int.h |    3 ---
 drivers/block/drbd/drbd_req.c |    3 +--
 fs/ext2/ialloc.c              |    5 -----
 fs/nilfs2/segbuf.c            |   15 ---------------
 fs/xfs/xfs_buf.c              |    3 ---
 include/linux/backing-dev.h   |   26 --------------------------
 mm/vmscan.c                   |    4 +---
 7 files changed, 2 insertions(+), 57 deletions(-)

diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index f27d5b0f9a0b..f804b1bfb3e6 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -638,9 +638,6 @@ enum {
 	STATE_SENT,		/* Do not change state/UUIDs while this is set */
 	CALLBACK_PENDING,	/* Whether we have a call_usermodehelper(, UMH_WAIT_PROC)
 				 * pending, from drbd worker context.
-				 * If set, bdi_write_congested() returns true,
-				 * so shrink_page_list() would not recurse into,
-				 * and potentially deadlock on, this drbd worker.
 				 */
 	DISCONNECT_SENT,
 
diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
index 3235532ae077..2e5fb7e442e3 100644
--- a/drivers/block/drbd/drbd_req.c
+++ b/drivers/block/drbd/drbd_req.c
@@ -909,8 +909,7 @@ static bool remote_due_to_read_balancing(struct drbd_device *device, sector_t se
 
 	switch (rbm) {
 	case RB_CONGESTED_REMOTE:
-		return bdi_read_congested(
-			device->ldev->backing_bdev->bd_disk->bdi);
+		return 0;
 	case RB_LEAST_PENDING:
 		return atomic_read(&device->local_cnt) >
 			atomic_read(&device->ap_pending_cnt) + atomic_read(&device->rs_pending_cnt);
diff --git a/fs/ext2/ialloc.c b/fs/ext2/ialloc.c
index df14e750e9fe..998dd2ac8008 100644
--- a/fs/ext2/ialloc.c
+++ b/fs/ext2/ialloc.c
@@ -170,11 +170,6 @@ static void ext2_preread_inode(struct inode *inode)
 	unsigned long offset;
 	unsigned long block;
 	struct ext2_group_desc * gdp;
-	struct backing_dev_info *bdi;
-
-	bdi = inode_to_bdi(inode);
-	if (bdi_rw_congested(bdi))
-		return;
 
 	block_group = (inode->i_ino - 1) / EXT2_INODES_PER_GROUP(inode->i_sb);
 	gdp = ext2_get_group_desc(inode->i_sb, block_group, NULL);
diff --git a/fs/nilfs2/segbuf.c b/fs/nilfs2/segbuf.c
index 43287b0d3e9b..c4510f79037f 100644
--- a/fs/nilfs2/segbuf.c
+++ b/fs/nilfs2/segbuf.c
@@ -343,17 +343,6 @@ static int nilfs_segbuf_submit_bio(struct nilfs_segment_buffer *segbuf,
 	struct bio *bio = wi->bio;
 	int err;
 
-	if (segbuf->sb_nbio > 0 &&
-	    bdi_write_congested(segbuf->sb_super->s_bdi)) {
-		wait_for_completion(&segbuf->sb_bio_event);
-		segbuf->sb_nbio--;
-		if (unlikely(atomic_read(&segbuf->sb_err))) {
-			bio_put(bio);
-			err = -EIO;
-			goto failed;
-		}
-	}
-
 	bio->bi_end_io = nilfs_end_bio_write;
 	bio->bi_private = segbuf;
 	bio_set_op_attrs(bio, mode, mode_flags);
@@ -365,10 +354,6 @@ static int nilfs_segbuf_submit_bio(struct nilfs_segment_buffer *segbuf,
 	wi->nr_vecs = min(wi->max_pages, wi->rest_blocks);
 	wi->start = wi->end;
 	return 0;
-
- failed:
-	wi->bio = NULL;
-	return err;
 }
 
 /**
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index b45e0d50a405..b7ebcfe6b8d3 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -843,9 +843,6 @@ xfs_buf_readahead_map(
 {
 	struct xfs_buf		*bp;
 
-	if (bdi_read_congested(target->bt_bdev->bd_disk->bdi))
-		return;
-
 	xfs_buf_read_map(target, map, nmaps,
 		     XBF_TRYLOCK | XBF_ASYNC | XBF_READ_AHEAD, &bp, ops,
 		     __this_address);
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 860b675c2929..2d764566280c 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -135,11 +135,6 @@ static inline bool writeback_in_progress(struct bdi_writeback *wb)
 
 struct backing_dev_info *inode_to_bdi(struct inode *inode);
 
-static inline int wb_congested(struct bdi_writeback *wb, int cong_bits)
-{
-	return wb->congested & cong_bits;
-}
-
 long congestion_wait(int sync, long timeout);
 
 static inline bool mapping_can_writeback(struct address_space *mapping)
@@ -391,27 +386,6 @@ static inline void wb_blkcg_offline(struct blkcg *blkcg)
 
 #endif	/* CONFIG_CGROUP_WRITEBACK */
 
-static inline int bdi_congested(struct backing_dev_info *bdi, int cong_bits)
-{
-	return wb_congested(&bdi->wb, cong_bits);
-}
-
-static inline int bdi_read_congested(struct backing_dev_info *bdi)
-{
-	return bdi_congested(bdi, 1 << WB_sync_congested);
-}
-
-static inline int bdi_write_congested(struct backing_dev_info *bdi)
-{
-	return bdi_congested(bdi, 1 << WB_async_congested);
-}
-
-static inline int bdi_rw_congested(struct backing_dev_info *bdi)
-{
-	return bdi_congested(bdi, (1 << WB_sync_congested) |
-				  (1 << WB_async_congested));
-}
-
 const char *bdi_dev_name(struct backing_dev_info *bdi);
 
 #endif	/* _LINUX_BACKING_DEV_H */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ce8492939bd3..0b930556c4f2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2362,9 +2362,7 @@ static unsigned int move_pages_to_lru(struct lruvec *lruvec,
  */
 static int current_may_throttle(void)
 {
-	return !(current->flags & PF_LOCAL_THROTTLE) ||
-		current->backing_dev_info == NULL ||
-		bdi_write_congested(current->backing_dev_info);
+	return !(current->flags & PF_LOCAL_THROTTLE);
 }
 
 /*



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 10/11] block/bfq-iosched.c: use "false" rather than "BLK_RW_ASYNC"
  2022-02-10  5:37 ` [PATCH 10/11] block/bfq-iosched.c: use "false" rather than "BLK_RW_ASYNC" NeilBrown
@ 2022-02-10 10:02   ` Jan Kara
  0 siblings, 0 replies; 21+ messages in thread
From: Jan Kara @ 2022-02-10 10:02 UTC (permalink / raw)
  To: NeilBrown
  Cc: Andrew Morton, Jan Kara, Wu Fengguang, Jaegeuk Kim, Chao Yu,
	Jeff Layton, Ilya Dryomov, Miklos Szeredi, Trond Myklebust,
	Anna Schumaker, Ryusuke Konishi, Darrick J. Wong,
	Philipp Reisner, Lars Ellenberg, Paolo Valente, Jens Axboe,
	linux-doc, linux-mm, linux-nilfs, linux-nfs, linux-fsdevel,
	linux-f2fs-devel, linux-ext4, ceph-devel, drbd-dev, linux-kernel,
	linux-block

On Thu 10-02-22 16:37:52, NeilBrown wrote:
> bfq_get_queue() expects a "bool" for the third arg, so pass "false"
> rather than "BLK_RW_ASYNC" which will soon be removed.
> 
> Acked-by: Jens Axboe <axboe@kernel.dk>
> Signed-off-by: NeilBrown <neilb@suse.de>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  block/bfq-iosched.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
> index 0c612a911696..4e645ae1e066 100644
> --- a/block/bfq-iosched.c
> +++ b/block/bfq-iosched.c
> @@ -5448,7 +5448,7 @@ static void bfq_check_ioprio_change(struct bfq_io_cq *bic, struct bio *bio)
>  	bfqq = bic_to_bfqq(bic, false);
>  	if (bfqq) {
>  		bfq_release_process_ref(bfqd, bfqq);
> -		bfqq = bfq_get_queue(bfqd, bio, BLK_RW_ASYNC, bic, true);
> +		bfqq = bfq_get_queue(bfqd, bio, false, bic, true);
>  		bic_set_bfqq(bic, bfqq, false);
>  	}
>  
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 02/11] MM: document and polish read-ahead code.
  2022-02-10  5:37 ` [PATCH 02/11] MM: document and polish read-ahead code NeilBrown
@ 2022-02-10 12:24   ` Jan Kara
  2022-02-10 23:35     ` NeilBrown
  0 siblings, 1 reply; 21+ messages in thread
From: Jan Kara @ 2022-02-10 12:24 UTC (permalink / raw)
  To: NeilBrown
  Cc: Andrew Morton, Jan Kara, Wu Fengguang, Jaegeuk Kim, Chao Yu,
	Jeff Layton, Ilya Dryomov, Miklos Szeredi, Trond Myklebust,
	Anna Schumaker, Ryusuke Konishi, Darrick J. Wong,
	Philipp Reisner, Lars Ellenberg, Paolo Valente, Jens Axboe,
	linux-doc, linux-mm, linux-nilfs, linux-nfs, linux-fsdevel,
	linux-f2fs-devel, linux-ext4, ceph-devel, drbd-dev, linux-kernel,
	linux-block

Hi Neil!

On Thu 10-02-22 16:37:52, NeilBrown wrote:
> Add some "big-picture" documentation for read-ahead and polish the code
> to make it fit this documentation.
> 
> The meaning of ->async_size is clarified to match its name.
> i.e. Any request to ->readahead() has a sync part and an async part.
> The caller will wait for the sync pages to complete, but will not wait
> for the async pages.  The first async page is still marked PG_readahead

So I don't think this is how the code was meant. My understanding of
readahead comes from a comment:

/*
 * On-demand readahead design.
 *
....

in mm/readahead.c. The ra->size is how many pages should be read.
ra->async_size is the "lookahead size" meaning that we should place a
marker (PageReadahead) at "ra->size - ra->async_size" to trigger next
readahead.

> 
> - in try_context_readahead(), the async_sync is set correctly rather
>   than being set to 1.  Prior to Commit 2cad40180197 ("readahead: make
>   context readahead more conservative") it was set to ra->size which
>   is not correct (that implies no sync component).  As this was too
>   high and caused problems it was reduced to 1, again incorrect but less
>   problematic.  The setting provided with this patch does not restore
>   those problems, and is now not arbitrary.

I agree the 1 there looks strange as it effectively discards all the logic
handling the lookahead size. I agree with the tweak there but I would do
this behavioral change as a separate commit since it can have performance
implications.

> - The calculation of ->async_size in the initial_readahead section of
>   ondemand_readahead() now makes sense - it is zero if the chosen
>   size does not exceed the requested size.  This means that we will not
>   set the PG_readahead flag in this case, but as the requested size
>   has not been satisfied we can expect a subsequent read ahead request
>   any way.

So I agree that setting of ->async_size to ->size in initial_readahead
section does not make great sence but if you look a bit below into readit
section, you will notice the ->async_size is overwritten there to something
meaninful. So I think the code actually does something sensible, maybe it
could be written in a more readable way.
 
> Note that the current function names page_cache_sync_ra() and
> page_cache_async_ra() are misleading.  All ra request are partly sync
> and partly async, so either part can be empty.

The meaning of these names IMO is:
page_cache_sync_ra() - tell readahead that we currently need a page
ractl->_index and would prefer req_count pages fetched ahead.

page_cache_async_ra() - called when we hit the lookahead marker to give
opportunity to readahead code to prefetch more pages.

> A page_cache_sync_ra() request will usually set ->async_size non-zero,
> implying it is not all synchronous.
> When a non-zero req_count is passed to page_cache_async_ra(), the
> implication is that some prefix of the request is synchronous, though
> the calculation made there is incorrect - I haven't tried to fix it.
> 
> Signed-off-by: NeilBrown <neilb@suse.de>

								Honza


> ---
>  Documentation/core-api/mm-api.rst |   19 ++++++-
>  Documentation/filesystems/vfs.rst |   16 ++++--
>  include/linux/fs.h                |    9 +++
>  mm/readahead.c                    |  103 ++++++++++++++++++++++++++++++++++++-
>  4 files changed, 135 insertions(+), 12 deletions(-)
> 
> diff --git a/Documentation/core-api/mm-api.rst b/Documentation/core-api/mm-api.rst
> index 395835f9289f..f5b2f92822c8 100644
> --- a/Documentation/core-api/mm-api.rst
> +++ b/Documentation/core-api/mm-api.rst
> @@ -58,15 +58,30 @@ Virtually Contiguous Mappings
>  File Mapping and Page Cache
>  ===========================
>  
> -.. kernel-doc:: mm/readahead.c
> -   :export:
> +Filemap
> +-------
>  
>  .. kernel-doc:: mm/filemap.c
>     :export:
>  
> +Readahead
> +---------
> +
> +.. kernel-doc:: mm/readahead.c
> +   :doc: Readahead Overview
> +
> +.. kernel-doc:: mm/readahead.c
> +   :export:
> +
> +Writeback
> +---------
> +
>  .. kernel-doc:: mm/page-writeback.c
>     :export:
>  
> +Truncate
> +--------
> +
>  .. kernel-doc:: mm/truncate.c
>     :export:
>  
> diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
> index bf5c48066fac..b4a0baa46dcc 100644
> --- a/Documentation/filesystems/vfs.rst
> +++ b/Documentation/filesystems/vfs.rst
> @@ -806,12 +806,16 @@ cache in your filesystem.  The following members are defined:
>  	object.  The pages are consecutive in the page cache and are
>  	locked.  The implementation should decrement the page refcount
>  	after starting I/O on each page.  Usually the page will be
> -	unlocked by the I/O completion handler.  If the filesystem decides
> -	to stop attempting I/O before reaching the end of the readahead
> -	window, it can simply return.  The caller will decrement the page
> -	refcount and unlock the remaining pages for you.  Set PageUptodate
> -	if the I/O completes successfully.  Setting PageError on any page
> -	will be ignored; simply unlock the page if an I/O error occurs.
> +	unlocked by the I/O completion handler.  The set of pages are
> +	divided into some sync pages followed by some async pages,
> +	rac->ra->async_size gives the number of async pages.  The
> +	filesystem should attempt to read all sync pages but may decide
> +	to stop once it reaches the async pages.  If it does decide to
> +	stop attempting I/O, it can simply return.  The caller will
> +	remove the remaining pages from the address space, unlock them
> +	and decrement the page refcount.  Set PageUptodate if the I/O
> +	completes successfully.  Setting PageError on any page will be
> +	ignored; simply unlock the page if an I/O error occurs.
>  
>  ``readpages``
>  	called by the VM to read pages associated with the address_space
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index e2d892b201b0..8b5c486bd4a2 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -930,10 +930,15 @@ struct fown_struct {
>   * struct file_ra_state - Track a file's readahead state.
>   * @start: Where the most recent readahead started.
>   * @size: Number of pages read in the most recent readahead.
> - * @async_size: Start next readahead when this many pages are left.
> - * @ra_pages: Maximum size of a readahead request.
> + * @async_size: Numer of pages that were/are not needed immediately
> + *      and so were/are genuinely "ahead".  Start next readahead when
> + *      the first of these pages is accessed.
> + * @ra_pages: Maximum size of a readahead request, copied from the bdi.
>   * @mmap_miss: How many mmap accesses missed in the page cache.
>   * @prev_pos: The last byte in the most recent read request.
> + *
> + * When this structure is passed to ->readahead(), the "most recent"
> + * readahead means the current readahead.
>   */
>  struct file_ra_state {
>  	pgoff_t start;
> diff --git a/mm/readahead.c b/mm/readahead.c
> index cf0dcf89eb69..c44b2957f59f 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -8,6 +8,105 @@
>   *		Initial version.
>   */
>  
> +/**
> + * DOC: Readahead Overview
> + *
> + * Readahead is used to read content into the page cache before it is
> + * explicitly requested by the application.  Readahead only ever
> + * attempts to read pages that are not yet in the page cache.  If a
> + * page is present but not up-to-date, readahead will not try to read
> + * it. In that case a simple ->readpage() will be requested.
> + *
> + * Readahead is triggered when an application read request (whether a
> + * systemcall or a page fault) finds that the requested page is not in
> + * the page cache, or that it is in the page cache and has the
> + * %PG_readahead flag set.  This flag indicates that the page was loaded
> + * as part of a previous read-ahead request and now that it has been
> + * accessed, it is time for the next read-ahead.
> + *
> + * Each readahead request is partly synchronous read, and partly async
> + * read-ahead.  This is reflected in the struct file_ra_state which
> + * contains ->size being to total number of pages, and ->async_size
> + * which is the number of pages in the async section.  The first page in
> + * this async section will have %PG_readahead set as a trigger for a
> + * subsequent read ahead.  Once a series of sequential reads has been
> + * established, there should be no need for a synchronous component and
> + * all read ahead request will be fully asynchronous.
> + *
> + * When either of the triggers causes a readahead, three numbers need to
> + * be determined: the start of the region, the size of the region, and
> + * the size of the async tail.
> + *
> + * The start of the region is simply the first page address at or after
> + * the accessed address, which is not currently populated in the page
> + * cache.  This is found with a simple search in the page cache.
> + *
> + * The size of the async tail is determined by subtracting the size that
> + * was explicitly requested from the determined request size, unless
> + * this would be less than zero - then zero is used.  NOTE THIS
> + * CALCULATION IS WRONG WHEN THE START OF THE REGION IS NOT THE ACCESSED
> + * PAGE.
> + *
> + * The size of the region is normally determined from the size of the
> + * previous readahead which loaded the preceding pages.  This may be
> + * discovered from the struct file_ra_state for simple sequential reads,
> + * or from examining the state of the page cache when multiple
> + * sequential reads are interleaved.  Specifically: where the readahead
> + * was triggered by the %PG_readahead flag, the size of the previous
> + * readahead is assumed to be the number of pages from the triggering
> + * page to the start of the new readahead.  In these cases, the size of
> + * the previous readahead is scaled, often doubled, for the new
> + * readahead, though see get_next_ra_size() for details.
> + *
> + * If the size of the previous read cannot be determined, the number of
> + * preceding pages in the page cache is used to estimate the size of
> + * a previous read.  This estimate could easily be misled by random
> + * reads being coincidentally adjacent, so it is ignored unless it is
> + * larger than the current request, and it is not scaled up, unless it
> + * is at the start of file.
> + *
> + * In general read ahead is accelerated at the start of the file, as
> + * reads from there are often sequential.  There are other minor
> + * adjustments to the read ahead size in various special cases and these
> + * are best discovered by reading the code.
> + *
> + * The above calculation determines the readahead, to which any requested
> + * read size may be added.
> + *
> + * Readahead requests are sent to the filesystem using the ->readahead()
> + * address space operation, for which mpage_readahead() is a canonical
> + * implementation.  ->readahead() should normally initiate reads on all
> + * pages, but may fail to read any or all pages without causing an IO
> + * error.  The page cache reading code will issue a ->readpage() request
> + * for any page which ->readahead() does not provided, and only an error
> + * from this will be final.
> + *
> + * ->readahead() will generally call readahead_page() repeatedly to get
> + * each page from those prepared for read ahead.  It may fail to read a
> + * page by:
> + *
> + * * not calling readahead_page() sufficiently many times, effectively
> + *   ignoring some pages, as might be appropriate if the path to
> + *   storage is congested.
> + *
> + * * failing to actually submit a read request for a given page,
> + *   possibly due to insufficient resources, or
> + *
> + * * getting an error during subsequent processing of a request.
> + *
> + * In the last two cases, the page should be unlocked to indicate that
> + * the read attempt has failed.  In the first case the page will be
> + * unlocked by the caller.
> + *
> + * Those pages not in the final ``async_size`` of the request should be
> + * considered to be important and ->readahead() should not fail them due
> + * to congestion or temporary resource unavailability, but should wait
> + * for necessary resources (e.g.  memory or indexing information) to
> + * become available.  Pages in the final ``async_size`` may be
> + * considered less urgent and failure to read them is more acceptable.
> + * They will eventually be read individually using ->readpage().
> + */
> +
>  #include <linux/kernel.h>
>  #include <linux/dax.h>
>  #include <linux/gfp.h>
> @@ -426,7 +525,7 @@ static int try_context_readahead(struct address_space *mapping,
>  
>  	ra->start = index;
>  	ra->size = min(size + req_size, max);
> -	ra->async_size = 1;
> +	ra->async_size = ra->size - req_size;
>  
>  	return 1;
>  }
> @@ -527,7 +626,7 @@ static void ondemand_readahead(struct readahead_control *ractl,
>  initial_readahead:
>  	ra->start = index;
>  	ra->size = get_init_ra_size(req_size, max_pages);
> -	ra->async_size = ra->size > req_size ? ra->size - req_size : ra->size;
> +	ra->async_size = ra->size > req_size ? ra->size - req_size : 0;
>  
>  readit:
>  	/*
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 02/11] MM: document and polish read-ahead code.
  2022-02-10 12:24   ` Jan Kara
@ 2022-02-10 23:35     ` NeilBrown
  2022-02-24 18:26       ` Jan Kara
  0 siblings, 1 reply; 21+ messages in thread
From: NeilBrown @ 2022-02-10 23:35 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Jan Kara, Wu Fengguang, Jaegeuk Kim, Chao Yu,
	Jeff Layton, Ilya Dryomov, Miklos Szeredi, Trond Myklebust,
	Anna Schumaker, Ryusuke Konishi, Darrick J. Wong,
	Philipp Reisner, Lars Ellenberg, Paolo Valente, Jens Axboe,
	linux-doc, linux-mm, linux-nilfs, linux-nfs, linux-fsdevel,
	linux-f2fs-devel, linux-ext4, ceph-devel, drbd-dev, linux-kernel,
	linux-block

On Thu, 10 Feb 2022, Jan Kara wrote:
> Hi Neil!
> 
> On Thu 10-02-22 16:37:52, NeilBrown wrote:
> > Add some "big-picture" documentation for read-ahead and polish the code
> > to make it fit this documentation.
> > 
> > The meaning of ->async_size is clarified to match its name.
> > i.e. Any request to ->readahead() has a sync part and an async part.
> > The caller will wait for the sync pages to complete, but will not wait
> > for the async pages.  The first async page is still marked PG_readahead

Thanks for the review!

> 
> So I don't think this is how the code was meant. My understanding of
> readahead comes from a comment:

I can't be sure what was "meant" but what I described is very close to
what the code actually does.

> 
> /*
>  * On-demand readahead design.
>  *
> ....
> 
> in mm/readahead.c. The ra->size is how many pages should be read.
> ra->async_size is the "lookahead size" meaning that we should place a
> marker (PageReadahead) at "ra->size - ra->async_size" to trigger next
> readahead.

This description of PageReadahead and ->async_size focuses on *what*
happens, not *why*.  Importantly it doesn't help answer the question "What
should I set ->async_size to?"

The implication in the code is that when we sequentially access a page
that was read-ahead (read before it was explicitly requested), we trigger
more read ahead.  So ->async_size should refer to that part of the
readahead request which was not explicitly requested.  With that
understanding, it becomes possible to audit all the places that
->async_size are set and to see if they make sense.

> 
> > 
> > - in try_context_readahead(), the async_sync is set correctly rather
> >   than being set to 1.  Prior to Commit 2cad40180197 ("readahead: make
> >   context readahead more conservative") it was set to ra->size which
> >   is not correct (that implies no sync component).  As this was too
> >   high and caused problems it was reduced to 1, again incorrect but less
> >   problematic.  The setting provided with this patch does not restore
> >   those problems, and is now not arbitrary.
> 
> I agree the 1 there looks strange as it effectively discards all the logic
> handling the lookahead size. I agree with the tweak there but I would do
> this behavioral change as a separate commit since it can have performance
> implications.
> 
> > - The calculation of ->async_size in the initial_readahead section of
> >   ondemand_readahead() now makes sense - it is zero if the chosen
> >   size does not exceed the requested size.  This means that we will not
> >   set the PG_readahead flag in this case, but as the requested size
> >   has not been satisfied we can expect a subsequent read ahead request
> >   any way.
> 
> So I agree that setting of ->async_size to ->size in initial_readahead
> section does not make great sence but if you look a bit below into readit
> section, you will notice the ->async_size is overwritten there to something
> meaninful. So I think the code actually does something sensible, maybe it
> could be written in a more readable way.

I'm certainly focusing on making the code look sensible and be
consistent with the documentation, rather than fixing actual faults in
behaviour.  Code that makes sense is easier to maintain.

I came very close to removing that code after readit: but I agree it
needs a separate patch and needs more thought.  It looks like a bandaid
that addressed some specific problem which was probably caused by one of
the size fields being set "wrongly" earlier.

>  
> > Note that the current function names page_cache_sync_ra() and
> > page_cache_async_ra() are misleading.  All ra request are partly sync
> > and partly async, so either part can be empty.
> 
> The meaning of these names IMO is:
> page_cache_sync_ra() - tell readahead that we currently need a page
> ractl->_index and would prefer req_count pages fetched ahead.

I don't think that is what req_count means.  req_count is the number of
pages that are needed *now* to satisfy the current read request.
page_cache_sync_ra() has the job of determining how many more pages (if
any) to read-ahead to satisfy future requests.  Sometimes it reads
another req_count - sometimes not.

> 
> page_cache_async_ra() - called when we hit the lookahead marker to give
> opportunity to readahead code to prefetch more pages.

Yes, but page_cache_async_ra() is given a req_count which, as above, is
the number of pages needed to satisfy *this* request.  That wouldn't
make sense if it was a pure future-readahead request.

In practice, the word "sync" is used to mean "page was missing" and
"async" here means "PG_readahead was found".  But that isn't what those
words usually mean.

They both call ondemand_readahead() passing False or True respectively
to hit_readahead_marker - which makes that meaning clear in the code...
but it still isn't clear in the name.

> 
> > A page_cache_sync_ra() request will usually set ->async_size non-zero,
> > implying it is not all synchronous.
> > When a non-zero req_count is passed to page_cache_async_ra(), the
> > implication is that some prefix of the request is synchronous, though
> > the calculation made there is incorrect - I haven't tried to fix it.
> > 
> > Signed-off-by: NeilBrown <neilb@suse.de>
> 
> 								Honza


Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 08/11] Remove bdi_congested() and wb_congested() and related functions
  2022-02-10  5:37 ` [PATCH 08/11] Remove bdi_congested() and wb_congested() and related functions NeilBrown
@ 2022-02-24 18:10   ` Ryusuke Konishi
  0 siblings, 0 replies; 21+ messages in thread
From: Ryusuke Konishi @ 2022-02-24 18:10 UTC (permalink / raw)
  To: NeilBrown
  Cc: Andrew Morton, Jan Kara, Wu Fengguang, Jaegeuk Kim, Chao Yu,
	Jeff Layton, Ilya Dryomov, Miklos Szeredi, Trond Myklebust,
	Anna Schumaker, Darrick J. Wong, Philipp Reisner, Lars Ellenberg,
	Paolo Valente, Jens Axboe, linux-doc, Linux MM, linux-nilfs,
	linux-nfs, linux-fsdevel, linux-f2fs-devel, linux-ext4,
	ceph-devel, drbd-dev, LKML, linux-block

On Thu, Feb 10, 2022 at 2:41 PM NeilBrown <neilb@suse.de> wrote:
>
> These functions are no longer useful as no BDIs report congestions any
> more.
>
> Removing the test on bdi_write_contested() in current_may_throttle()
> could cause a small change in behaviour, but only when PF_LOCAL_THROTTLE
> is set.
>
> So replace the calls by 'false' and simplify the code - and remove the
> functions.
>
> Acked-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> (for nilfs bits)
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
>  drivers/block/drbd/drbd_int.h |    3 ---
>  drivers/block/drbd/drbd_req.c |    3 +--
>  fs/ext2/ialloc.c              |    5 -----
>  fs/nilfs2/segbuf.c            |   15 ---------------
>  fs/xfs/xfs_buf.c              |    3 ---
>  include/linux/backing-dev.h   |   26 --------------------------
>  mm/vmscan.c                   |    4 +---
>  7 files changed, 2 insertions(+), 57 deletions(-)
>
> diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
> index f27d5b0f9a0b..f804b1bfb3e6 100644
> --- a/drivers/block/drbd/drbd_int.h
> +++ b/drivers/block/drbd/drbd_int.h
> @@ -638,9 +638,6 @@ enum {
>         STATE_SENT,             /* Do not change state/UUIDs while this is set */
>         CALLBACK_PENDING,       /* Whether we have a call_usermodehelper(, UMH_WAIT_PROC)
>                                  * pending, from drbd worker context.
> -                                * If set, bdi_write_congested() returns true,
> -                                * so shrink_page_list() would not recurse into,
> -                                * and potentially deadlock on, this drbd worker.
>                                  */
>         DISCONNECT_SENT,
>
> diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
> index 3235532ae077..2e5fb7e442e3 100644
> --- a/drivers/block/drbd/drbd_req.c
> +++ b/drivers/block/drbd/drbd_req.c
> @@ -909,8 +909,7 @@ static bool remote_due_to_read_balancing(struct drbd_device *device, sector_t se
>
>         switch (rbm) {
>         case RB_CONGESTED_REMOTE:
> -               return bdi_read_congested(
> -                       device->ldev->backing_bdev->bd_disk->bdi);
> +               return 0;
>         case RB_LEAST_PENDING:
>                 return atomic_read(&device->local_cnt) >
>                         atomic_read(&device->ap_pending_cnt) + atomic_read(&device->rs_pending_cnt);
> diff --git a/fs/ext2/ialloc.c b/fs/ext2/ialloc.c
> index df14e750e9fe..998dd2ac8008 100644
> --- a/fs/ext2/ialloc.c
> +++ b/fs/ext2/ialloc.c
> @@ -170,11 +170,6 @@ static void ext2_preread_inode(struct inode *inode)
>         unsigned long offset;
>         unsigned long block;
>         struct ext2_group_desc * gdp;
> -       struct backing_dev_info *bdi;
> -
> -       bdi = inode_to_bdi(inode);
> -       if (bdi_rw_congested(bdi))
> -               return;
>
>         block_group = (inode->i_ino - 1) / EXT2_INODES_PER_GROUP(inode->i_sb);
>         gdp = ext2_get_group_desc(inode->i_sb, block_group, NULL);
> diff --git a/fs/nilfs2/segbuf.c b/fs/nilfs2/segbuf.c
> index 43287b0d3e9b..c4510f79037f 100644
> --- a/fs/nilfs2/segbuf.c
> +++ b/fs/nilfs2/segbuf.c
> @@ -343,17 +343,6 @@ static int nilfs_segbuf_submit_bio(struct nilfs_segment_buffer *segbuf,
>         struct bio *bio = wi->bio;
>         int err;
>
> -       if (segbuf->sb_nbio > 0 &&
> -           bdi_write_congested(segbuf->sb_super->s_bdi)) {
> -               wait_for_completion(&segbuf->sb_bio_event);
> -               segbuf->sb_nbio--;
> -               if (unlikely(atomic_read(&segbuf->sb_err))) {
> -                       bio_put(bio);
> -                       err = -EIO;
> -                       goto failed;
> -               }
> -       }
> -
>         bio->bi_end_io = nilfs_end_bio_write;
>         bio->bi_private = segbuf;
>         bio_set_op_attrs(bio, mode, mode_flags);
> @@ -365,10 +354,6 @@ static int nilfs_segbuf_submit_bio(struct nilfs_segment_buffer *segbuf,
>         wi->nr_vecs = min(wi->max_pages, wi->rest_blocks);
>         wi->start = wi->end;
>         return 0;
> -
> - failed:
> -       wi->bio = NULL;
> -       return err;
>  }

In this revised version, "int err" is no longer used, so could you
delete it as well ?

Regards,
Ryusuke Konishi

>
>  /**
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index b45e0d50a405..b7ebcfe6b8d3 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -843,9 +843,6 @@ xfs_buf_readahead_map(
>  {
>         struct xfs_buf          *bp;
>
> -       if (bdi_read_congested(target->bt_bdev->bd_disk->bdi))
> -               return;
> -
>         xfs_buf_read_map(target, map, nmaps,
>                      XBF_TRYLOCK | XBF_ASYNC | XBF_READ_AHEAD, &bp, ops,
>                      __this_address);
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index 860b675c2929..2d764566280c 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -135,11 +135,6 @@ static inline bool writeback_in_progress(struct bdi_writeback *wb)
>
>  struct backing_dev_info *inode_to_bdi(struct inode *inode);
>
> -static inline int wb_congested(struct bdi_writeback *wb, int cong_bits)
> -{
> -       return wb->congested & cong_bits;
> -}
> -
>  long congestion_wait(int sync, long timeout);
>
>  static inline bool mapping_can_writeback(struct address_space *mapping)
> @@ -391,27 +386,6 @@ static inline void wb_blkcg_offline(struct blkcg *blkcg)
>
>  #endif /* CONFIG_CGROUP_WRITEBACK */
>
> -static inline int bdi_congested(struct backing_dev_info *bdi, int cong_bits)
> -{
> -       return wb_congested(&bdi->wb, cong_bits);
> -}
> -
> -static inline int bdi_read_congested(struct backing_dev_info *bdi)
> -{
> -       return bdi_congested(bdi, 1 << WB_sync_congested);
> -}
> -
> -static inline int bdi_write_congested(struct backing_dev_info *bdi)
> -{
> -       return bdi_congested(bdi, 1 << WB_async_congested);
> -}
> -
> -static inline int bdi_rw_congested(struct backing_dev_info *bdi)
> -{
> -       return bdi_congested(bdi, (1 << WB_sync_congested) |
> -                                 (1 << WB_async_congested));
> -}
> -
>  const char *bdi_dev_name(struct backing_dev_info *bdi);
>
>  #endif /* _LINUX_BACKING_DEV_H */
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index ce8492939bd3..0b930556c4f2 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2362,9 +2362,7 @@ static unsigned int move_pages_to_lru(struct lruvec *lruvec,
>   */
>  static int current_may_throttle(void)
>  {
> -       return !(current->flags & PF_LOCAL_THROTTLE) ||
> -               current->backing_dev_info == NULL ||
> -               bdi_write_congested(current->backing_dev_info);
> +       return !(current->flags & PF_LOCAL_THROTTLE);
>  }
>
>  /*
>
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 02/11] MM: document and polish read-ahead code.
  2022-02-10 23:35     ` NeilBrown
@ 2022-02-24 18:26       ` Jan Kara
  2022-02-28  4:28         ` NeilBrown
  0 siblings, 1 reply; 21+ messages in thread
From: Jan Kara @ 2022-02-24 18:26 UTC (permalink / raw)
  To: NeilBrown
  Cc: Jan Kara, Andrew Morton, Wu Fengguang, Jaegeuk Kim, Chao Yu,
	Jeff Layton, Ilya Dryomov, Miklos Szeredi, Trond Myklebust,
	Anna Schumaker, Ryusuke Konishi, Darrick J. Wong,
	Philipp Reisner, Lars Ellenberg, Paolo Valente, Jens Axboe,
	linux-doc, linux-mm, linux-nilfs, linux-nfs, linux-fsdevel,
	linux-f2fs-devel, linux-ext4, ceph-devel, drbd-dev, linux-kernel,
	linux-block

On Fri 11-02-22 10:35:17, NeilBrown wrote:
> On Thu, 10 Feb 2022, Jan Kara wrote:
> > Hi Neil!
> > 
> > On Thu 10-02-22 16:37:52, NeilBrown wrote:
> > > Add some "big-picture" documentation for read-ahead and polish the code
> > > to make it fit this documentation.
> > > 
> > > The meaning of ->async_size is clarified to match its name.
> > > i.e. Any request to ->readahead() has a sync part and an async part.
> > > The caller will wait for the sync pages to complete, but will not wait
> > > for the async pages.  The first async page is still marked PG_readahead
> 
> Thanks for the review!
> 
> > 
> > So I don't think this is how the code was meant. My understanding of
> > readahead comes from a comment:
> 
> I can't be sure what was "meant" but what I described is very close to
> what the code actually does.
> 
> > 
> > /*
> >  * On-demand readahead design.
> >  *
> > ....
> > 
> > in mm/readahead.c. The ra->size is how many pages should be read.
> > ra->async_size is the "lookahead size" meaning that we should place a
> > marker (PageReadahead) at "ra->size - ra->async_size" to trigger next
> > readahead.
> 
> This description of PageReadahead and ->async_size focuses on *what*
> happens, not *why*.  Importantly it doesn't help answer the question "What
> should I set ->async_size to?"

Sorry for delayed reply. I was on vacation and then catching up with stuff.
I know you have submitted another version of the patches but not much has
changed in this regard so I figured it might be better to continue
discussion here.

So my answer to "What should I set ->async_size to?" is: Ideally so that it
takes application to process data between ->async_size and ->size as long
as it takes the disk to load the next chunk of data into the page cache.
This is explained in the comment:

 * To overlap application thinking time and disk I/O time, we do
 * `readahead pipelining': Do not wait until the application consumed all
 * readahead pages and stalled on the missing page at readahead_index;
 * Instead, submit an asynchronous readahead I/O as soon as there are
 * only async_size pages left in the readahead window. Normally async_size
 * will be equal to size, for maximum pipelining.

Now because things such as think time or time to read pages is difficult to
estimate, we just end up triggering next readahead as soon as we are at least
a bit confident application is going to use the pages. But I don't think
there was ever any intent to have "sync" and "async" parts of the request
or that ->size - ->async_size is what must be read. Any function in the
readahead code is free to return without doing anything regardless of
passed parameters and the caller needs to deal with that, ->size is just a
hint to the filesystem how much we expect to be useful to read...

> The implication in the code is that when we sequentially access a page
> that was read-ahead (read before it was explicitly requested), we trigger
> more read ahead.  So ->async_size should refer to that part of the
> readahead request which was not explicitly requested.  With that
> understanding, it becomes possible to audit all the places that
> ->async_size are set and to see if they make sense.

I don't think this "implication" was ever intended :) But it may have
happened that some guesses how big ->async_size should be have ended like
that because of the impracticality of the original definition of how large
->async_size should be.

In fact I kind of like what you suggest ->async_size should be - it is
possible to actually implement that unlike the original definition - but it
is more of "let's redesign how readahead size is chosen" than "let's
document how readahead size is chosen" :).

> > > Note that the current function names page_cache_sync_ra() and
> > > page_cache_async_ra() are misleading.  All ra request are partly sync
> > > and partly async, so either part can be empty.
> > 
> > The meaning of these names IMO is:
> > page_cache_sync_ra() - tell readahead that we currently need a page
> > ractl->_index and would prefer req_count pages fetched ahead.
> 
> I don't think that is what req_count means.  req_count is the number of
> pages that are needed *now* to satisfy the current read request.
> page_cache_sync_ra() has the job of determining how many more pages (if
> any) to read-ahead to satisfy future requests.  Sometimes it reads
> another req_count - sometimes not.

So this is certainly true for page_cache_sync_readahead() call in
filemap_get_pages() but the call of page_cache_sync_ra() from
do_sync_mmap_readahead() does not quite follow what you say - we need only
one page there but request more.

> > page_cache_async_ra() - called when we hit the lookahead marker to give
> > opportunity to readahead code to prefetch more pages.
> 
> Yes, but page_cache_async_ra() is given a req_count which, as above, is
> the number of pages needed to satisfy *this* request.  That wouldn't
> make sense if it was a pure future-readahead request.

Again, usage of req_count in page_cache_async_ra() is not always the number
of pages immediately needed.

> In practice, the word "sync" is used to mean "page was missing" and
> "async" here means "PG_readahead was found".  But that isn't what those
> words usually mean.
>
> They both call ondemand_readahead() passing False or True respectively
> to hit_readahead_marker - which makes that meaning clear in the code...
> but it still isn't clear in the name.

I agree the naming is somewhat confusing :)

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 02/11] MM: document and polish read-ahead code.
  2022-02-24 18:26       ` Jan Kara
@ 2022-02-28  4:28         ` NeilBrown
  2022-02-28  4:47           ` Andrew Morton
  0 siblings, 1 reply; 21+ messages in thread
From: NeilBrown @ 2022-02-28  4:28 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jan Kara, Andrew Morton, Wu Fengguang, Jaegeuk Kim, Chao Yu,
	Jeff Layton, Ilya Dryomov, Miklos Szeredi, Trond Myklebust,
	Anna Schumaker, Ryusuke Konishi, Darrick J. Wong,
	Philipp Reisner, Lars Ellenberg, Paolo Valente, Jens Axboe,
	linux-doc, linux-mm, linux-nilfs, linux-nfs, linux-fsdevel,
	linux-f2fs-devel, linux-ext4, ceph-devel, drbd-dev, linux-kernel,
	linux-block

On Fri, 25 Feb 2022, Jan Kara wrote:
> On Fri 11-02-22 10:35:17, NeilBrown wrote:
> > On Thu, 10 Feb 2022, Jan Kara wrote:
> > > Hi Neil!
> > > 
> > > On Thu 10-02-22 16:37:52, NeilBrown wrote:
> > > > Add some "big-picture" documentation for read-ahead and polish the code
> > > > to make it fit this documentation.
> > > > 
> > > > The meaning of ->async_size is clarified to match its name.
> > > > i.e. Any request to ->readahead() has a sync part and an async part.
> > > > The caller will wait for the sync pages to complete, but will not wait
> > > > for the async pages.  The first async page is still marked PG_readahead
> > 
> > Thanks for the review!
> > 
> > > 
> > > So I don't think this is how the code was meant. My understanding of
> > > readahead comes from a comment:
> > 
> > I can't be sure what was "meant" but what I described is very close to
> > what the code actually does.
> > 
> > > 
> > > /*
> > >  * On-demand readahead design.
> > >  *
> > > ....
> > > 
> > > in mm/readahead.c. The ra->size is how many pages should be read.
> > > ra->async_size is the "lookahead size" meaning that we should place a
> > > marker (PageReadahead) at "ra->size - ra->async_size" to trigger next
> > > readahead.
> > 
> > This description of PageReadahead and ->async_size focuses on *what*
> > happens, not *why*.  Importantly it doesn't help answer the question "What
> > should I set ->async_size to?"
> 
> Sorry for delayed reply. I was on vacation and then catching up with stuff.
> I know you have submitted another version of the patches but not much has
> changed in this regard so I figured it might be better to continue
> discussion here.
> 
> So my answer to "What should I set ->async_size to?" is: Ideally so that it
> takes application to process data between ->async_size and ->size as long
> as it takes the disk to load the next chunk of data into the page cache.
> This is explained in the comment:
> 
>  * To overlap application thinking time and disk I/O time, we do
>  * `readahead pipelining': Do not wait until the application consumed all
>  * readahead pages and stalled on the missing page at readahead_index;
>  * Instead, submit an asynchronous readahead I/O as soon as there are
>  * only async_size pages left in the readahead window. Normally async_size
>  * will be equal to size, for maximum pipelining.
> 
> Now because things such as think time or time to read pages is difficult to
> estimate, we just end up triggering next readahead as soon as we are at least
> a bit confident application is going to use the pages. But I don't think
> there was ever any intent to have "sync" and "async" parts of the request
> or that ->size - ->async_size is what must be read. Any function in the
> readahead code is free to return without doing anything regardless of
> passed parameters and the caller needs to deal with that, ->size is just a
> hint to the filesystem how much we expect to be useful to read...

I don't think it is only "difficult" to estimate.  It is impossible and
there is no evidence in the code of any attempt to estimate.  So I think
you are reading more into that comment than is there.

Certainly we want to overlap "think" and "read".  The issue I think is
not timing but size - how much to read.
At one extreme you can maximum overlap by requesting the whole file when
the first byte is read.  There are obvious problems with that approach.
As you say, we need to be "at least a bit confident that the application
is going to use the pages".  The way I read the comment and the code,
that is the primary issue: How can we be confident that the data will be
used?

Obviously we cannot, but we can manage the risk.  The read ahead
calculations are all about risk management.  We limit the potential
waste by only pre-reading a fixed multiple of the number of pages that
have been explicitly requested - that multiple is 1.  So if 32 pages
have been read, it is now safe to read-ahead an extra 32 (or maybe it's
1.5... anyway it is a simple function of how much has been requested).

The "when to read it" question is based on finding a balance between
achieving maximum confidence that the pattern continues to be a
sequential read (so don't read too early) and loading the pages so
they'll be ready when needed (so don't read too late).  There is no
"obviously right" answer - but we can clearly see the solution that the
code chooses in *almost* all circumstances.  It starts a new
opportunistic read when the first page from the last opportunistic read
is first accessed.  I think that pattern is clear enough that it is
worth documenting, and I can see no justification for diverging from
that pattern.

You express a doubt that 'there was ever any intent to have "sync" and
"async" parts of the request' and you may well be correct.  But there is
nonetheless an observed reality that some of the pages passed to
->readhead() (or ->readpages()) are wanted immediately, and some are
not.   And those that are not are the number placed in ->async_size. 

When writing documentation the intent of the author is of some interest,
but the behaviour of the code is paramount.

> 
> > The implication in the code is that when we sequentially access a page
> > that was read-ahead (read before it was explicitly requested), we trigger
> > more read ahead.  So ->async_size should refer to that part of the
> > readahead request which was not explicitly requested.  With that
> > understanding, it becomes possible to audit all the places that
> > ->async_size are set and to see if they make sense.
> 
> I don't think this "implication" was ever intended :) But it may have
> happened that some guesses how big ->async_size should be have ended like
> that because of the impracticality of the original definition of how large
> ->async_size should be.
> 
> In fact I kind of like what you suggest ->async_size should be - it is
> possible to actually implement that unlike the original definition - but it
> is more of "let's redesign how readahead size is chosen" than "let's
> document how readahead size is chosen" :).

I don't think I'm changing the design - I'm describing the design in a
possibly novel way.

I accept that "number of pages that aren't needed immediately" and
"where to set the PG_readahead" flag are not intrinsically related and
that maybe I have been unduly swayed by the field name "async_size".

I certainly want the filesystem to know which pages are explisitly
request, and which are being read opportunistically, so it can handle
them differently.  The filesystem already sees "async_size" so it is
easy to use that.  Maybe I should have added a new field which is the
actual async size, can kept it separate from the currently misnamed
"async_size".  But it didn't seem like a useful exercise.

> 
> > > > Note that the current function names page_cache_sync_ra() and
> > > > page_cache_async_ra() are misleading.  All ra request are partly sync
> > > > and partly async, so either part can be empty.
> > > 
> > > The meaning of these names IMO is:
> > > page_cache_sync_ra() - tell readahead that we currently need a page
> > > ractl->_index and would prefer req_count pages fetched ahead.
> > 
> > I don't think that is what req_count means.  req_count is the number of
> > pages that are needed *now* to satisfy the current read request.
> > page_cache_sync_ra() has the job of determining how many more pages (if
> > any) to read-ahead to satisfy future requests.  Sometimes it reads
> > another req_count - sometimes not.
> 
> So this is certainly true for page_cache_sync_readahead() call in
> filemap_get_pages() but the call of page_cache_sync_ra() from
> do_sync_mmap_readahead() does not quite follow what you say - we need only
> one page there but request more.

Yes...  and no.  That code in do_sync_mmap_readahead() runs if
VM_SEQ_READ is set.  That means MADV_SEQUENTIAL has been passed with
madvise().  So the application has explicitly said "I will access this
mapping sequentially" which effectively mean "if I access any bytes, you
can assume I'll want to access a great many subsequent bytes".  So the
"request" here can be reasonably seen as asking for "as much as
possible".

So while this is read-ahead, it is quite different from the heuristic
read-ahead that we've been discussing so far.  This isn't "might be
needed, so don't bother if it seems too hard", this is "will be needed,
so treat it like any other read request".

> 
> > > page_cache_async_ra() - called when we hit the lookahead marker to give
> > > opportunity to readahead code to prefetch more pages.
> > 
> > Yes, but page_cache_async_ra() is given a req_count which, as above, is
> > the number of pages needed to satisfy *this* request.  That wouldn't
> > make sense if it was a pure future-readahead request.
> 
> Again, usage of req_count in page_cache_async_ra() is not always the number
> of pages immediately needed.

I think it is ... or should be.
The use in f2fs/dir.c doesn't seem obvious, but I don't know the
context.
In every other case, the number comes either from an explicit
request, or from a confirmed desire (promise?) to read the file
sequentially.

> 
> > In practice, the word "sync" is used to mean "page was missing" and
> > "async" here means "PG_readahead was found".  But that isn't what those
> > words usually mean.
> >
> > They both call ondemand_readahead() passing False or True respectively
> > to hit_readahead_marker - which makes that meaning clear in the code...
> > but it still isn't clear in the name.
> 
> I agree the naming is somewhat confusing :)

Thanks :-)

NeilBrown

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 02/11] MM: document and polish read-ahead code.
  2022-02-28  4:28         ` NeilBrown
@ 2022-02-28  4:47           ` Andrew Morton
  2022-02-28  5:19             ` NeilBrown
  0 siblings, 1 reply; 21+ messages in thread
From: Andrew Morton @ 2022-02-28  4:47 UTC (permalink / raw)
  To: NeilBrown
  Cc: Jan Kara, Wu Fengguang, Jaegeuk Kim, Chao Yu, Jeff Layton,
	Ilya Dryomov, Miklos Szeredi, Trond Myklebust, Anna Schumaker,
	Ryusuke Konishi, Darrick J. Wong, Philipp Reisner,
	Lars Ellenberg, Paolo Valente, Jens Axboe, linux-doc, linux-mm,
	linux-nilfs, linux-nfs, linux-fsdevel, linux-f2fs-devel,
	linux-ext4, ceph-devel, drbd-dev, linux-kernel, linux-block

On Mon, 28 Feb 2022 15:28:39 +1100 "NeilBrown" <neilb@suse.de> wrote:

> When writing documentation the intent of the author is of some interest,
> but the behaviour of the code is paramount.

uh, er, ah, no.  The code describes the behaviour of the code.  The
comments are there to describe things other than the code's behaviour.
Things such as the author's intent.

Any deviation between the author's intent and the code's behaviour is
called a "bug", so it's pretty important to understand authorial
intent, no?


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 02/11] MM: document and polish read-ahead code.
  2022-02-28  4:47           ` Andrew Morton
@ 2022-02-28  5:19             ` NeilBrown
  0 siblings, 0 replies; 21+ messages in thread
From: NeilBrown @ 2022-02-28  5:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Jaegeuk Kim, Chao Yu, Jeff Layton,
	Ilya Dryomov, Miklos Szeredi, Trond Myklebust, Anna Schumaker,
	Ryusuke Konishi, Darrick J. Wong, Philipp Reisner,
	Lars Ellenberg, Paolo Valente, Jens Axboe, linux-doc, linux-mm,
	linux-nilfs, linux-nfs, linux-fsdevel, linux-f2fs-devel,
	linux-ext4, ceph-devel, drbd-dev, linux-kernel, linux-block

On Mon, 28 Feb 2022, Andrew Morton wrote:
> On Mon, 28 Feb 2022 15:28:39 +1100 "NeilBrown" <neilb@suse.de> wrote:
> 
> > When writing documentation the intent of the author is of some interest,
> > but the behaviour of the code is paramount.
> 
> uh, er, ah, no.  The code describes the behaviour of the code.  The
> comments are there to describe things other than the code's behaviour.
> Things such as the author's intent.
> 
> Any deviation between the author's intent and the code's behaviour is
> called a "bug", so it's pretty important to understand authorial
> intent, no?

When the author is writing the documentation - then yes - definitely. 
When the "author" is several different people over a period of years,
then it is not even certain that there is a single unified "intent".

The author's intent is less interesting not so much because it is less
relevant, but because it is less available.

So when writing third-party post-hoc documentation, the focus has to be
on the code, though with reference to the intent to whatever extent it
is available.  Bugs then show up where the actual behaviour turns out to
be impossible to document coherently.

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 10/11] block/bfq-iosched.c: use "false" rather than "BLK_RW_ASYNC"
  2022-02-22  3:17 [PATCH 00/11] Remove remaining parts of congestion tracking code NeilBrown
@ 2022-02-22  3:17 ` NeilBrown
  0 siblings, 0 replies; 21+ messages in thread
From: NeilBrown @ 2022-02-22  3:17 UTC (permalink / raw)
  To: Andrew Morton, Jan Kara, Wu Fengguang, Jaegeuk Kim, Chao Yu,
	Jeff Layton, Ilya Dryomov, Miklos Szeredi, Trond Myklebust,
	Anna Schumaker, Ryusuke Konishi, Darrick J. Wong,
	Philipp Reisner, Lars Ellenberg, Paolo Valente, Jens Axboe
  Cc: linux-doc, linux-mm, linux-nilfs, linux-nfs, linux-fsdevel,
	linux-f2fs-devel, linux-ext4, ceph-devel, drbd-dev, linux-kernel

bfq_get_queue() expects a "bool" for the third arg, so pass "false"
rather than "BLK_RW_ASYNC" which will soon be removed.

Acked-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: NeilBrown <neilb@suse.de>
---
 block/bfq-iosched.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 36a66e97e3c2..ed9bb1054bf2 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -5448,7 +5448,7 @@ static void bfq_check_ioprio_change(struct bfq_io_cq *bic, struct bio *bio)
 	bfqq = bic_to_bfqq(bic, false);
 	if (bfqq) {
 		bfq_release_process_ref(bfqd, bfqq);
-		bfqq = bfq_get_queue(bfqd, bio, BLK_RW_ASYNC, bic, true);
+		bfqq = bfq_get_queue(bfqd, bio, false, bic, true);
 		bic_set_bfqq(bic, bfqq, false);
 	}
 



^ permalink raw reply related	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2022-02-28  5:19 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-10  5:37 [PATCH 00/11] Remove remaining parts of congestion tracking code NeilBrown
2022-02-10  5:37 ` [PATCH 07/11] Remove inode_congested() NeilBrown
2022-02-10  5:37 ` [PATCH 10/11] block/bfq-iosched.c: use "false" rather than "BLK_RW_ASYNC" NeilBrown
2022-02-10 10:02   ` Jan Kara
2022-02-10  5:37 ` [PATCH 02/11] MM: document and polish read-ahead code NeilBrown
2022-02-10 12:24   ` Jan Kara
2022-02-10 23:35     ` NeilBrown
2022-02-24 18:26       ` Jan Kara
2022-02-28  4:28         ` NeilBrown
2022-02-28  4:47           ` Andrew Morton
2022-02-28  5:19             ` NeilBrown
2022-02-10  5:37 ` [PATCH 05/11] nfs: remove reliance on bdi congestion NeilBrown
2022-02-10  5:37 ` [PATCH 08/11] Remove bdi_congested() and wb_congested() and related functions NeilBrown
2022-02-24 18:10   ` Ryusuke Konishi
2022-02-10  5:37 ` [PATCH 06/11] ceph: remove reliance on bdi congestion NeilBrown
2022-02-10  5:37 ` [PATCH 04/11] fuse: " NeilBrown
2022-02-10  5:37 ` [PATCH 03/11] MM: improve cleanup when ->readpages doesn't process all pages NeilBrown
2022-02-10  5:37 ` [PATCH 09/11] f2fs: replace congestion_wait() calls with io_schedule_timeout() NeilBrown
2022-02-10  5:37 ` [PATCH 01/11] DOC: convert 'subsection' to 'section' in gfp.h NeilBrown
2022-02-10  5:37 ` [PATCH 11/11] Remove congestion tracking framework NeilBrown
2022-02-22  3:17 [PATCH 00/11] Remove remaining parts of congestion tracking code NeilBrown
2022-02-22  3:17 ` [PATCH 10/11] block/bfq-iosched.c: use "false" rather than "BLK_RW_ASYNC" NeilBrown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).