All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/12 v3] Writeback improvements
@ 2017-09-27 20:13 Jens Axboe
  2017-09-27 20:13 ` [PATCH 01/12] buffer: have alloc_page_buffers() use __GFP_NOFAIL Jens Axboe
                   ` (12 more replies)
  0 siblings, 13 replies; 35+ messages in thread
From: Jens Axboe @ 2017-09-27 20:13 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel; +Cc: hannes, jack, torvalds

We've had some issues with writeback in presence of memory reclaim
at Facebook, and this patch set attempts to fix it up. The real
functional change for that issue is patch 10. The rest are cleanups,
as well as the removal of doing non-range cyclic writeback. The users
of that was sync_inodes_sb() and wakeup_flusher_threads(), both of
which writeback all of the dirty pages.

The basic idea is that we have callers that call
wakeup_flusher_threads() with nr_pages == 0. This means 'writeback
everything'. For memory reclaim situations, we can end up queuing
a TON of these kinds of writeback units. This can cause softlockups
and further memory issues, since we allocate huge amounts of
struct wb_writeback_work to handle this writeback. Handle this
situation more gracefully.


 drivers/md/bitmap.c                      |    2 
 drivers/staging/lustre/lustre/llite/rw.c |   25 ++-----
 fs/afs/write.c                           |   25 +------
 fs/btrfs/extent_io.c                     |   31 ++-------
 fs/buffer.c                              |   60 +++---------------
 fs/ceph/addr.c                           |   26 ++-----
 fs/cifs/file.c                           |   20 +-----
 fs/ext4/inode.c                          |   26 +++----
 fs/f2fs/data.c                           |   26 ++-----
 fs/fs-writeback.c                        |  103 +++++++++++++++++++------------
 fs/gfs2/aops.c                           |   27 ++------
 fs/ntfs/aops.c                           |    2 
 fs/ntfs/mft.c                            |    2 
 fs/sync.c                                |    2 
 include/linux/backing-dev-defs.h         |    1 
 include/linux/backing-dev.h              |    2 
 include/linux/buffer_head.h              |    2 
 include/linux/writeback.h                |    5 -
 include/trace/events/btrfs.h             |    2 
 include/trace/events/ext4.h              |    2 
 include/trace/events/f2fs.h              |    2 
 include/trace/events/writeback.h         |    4 -
 mm/page-writeback.c                      |   44 ++-----------
 mm/vmscan.c                              |    2 
 24 files changed, 159 insertions(+), 284 deletions(-)


Changes since v2:

- Removal of non-range_cyclic writeback.
- Cleanup of the buffer.c failure handling code, utilize
  __GFP_NOFAIL instead of rolling our own.
- Reinstate cyclic writeback for laptop mode, it's now the only
  option available.
- Rebased on top of master, and series shuffled around.

Changes since v1:

- Rename WB_zero_pages to WB_start_all (Amir).
- Remove a test_bit() for a condition where we always expect the bit
  to be set.
- Remove 'nr_pages' from the wakeup flusher threads helpers, since
  everybody now passes in zero. Enables further cleanups in later
  patches too (Jan).
- Fix a case where I forgot to clear WB_start_all if 'work' allocation
  failed.
- Get rid of cond_resched() in the wb_do_writeback() loop.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 01/12] buffer: have alloc_page_buffers() use __GFP_NOFAIL
  2017-09-27 20:13 [PATCH 0/12 v3] Writeback improvements Jens Axboe
@ 2017-09-27 20:13 ` Jens Axboe
  2017-09-28 14:08   ` Nikolay Borisov
  2017-10-02 15:02   ` Jan Kara
  2017-09-27 20:13 ` [PATCH 02/12] buffer: grow_dev_page() should use __GFP_NOFAIL for all cases Jens Axboe
                   ` (11 subsequent siblings)
  12 siblings, 2 replies; 35+ messages in thread
From: Jens Axboe @ 2017-09-27 20:13 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel; +Cc: hannes, jack, torvalds, Jens Axboe

Instead of adding weird retry logic in that function, utilize
__GFP_NOFAIL to ensure that the vm takes care of handling any
potential retries appropriately. This means we don't have to
call free_more_memory() from here.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 drivers/md/bitmap.c         |  2 +-
 fs/buffer.c                 | 33 ++++++++++-----------------------
 fs/ntfs/aops.c              |  2 +-
 fs/ntfs/mft.c               |  2 +-
 include/linux/buffer_head.h |  2 +-
 5 files changed, 14 insertions(+), 27 deletions(-)

diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index d2121637b4ab..4d8ed74efadf 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -368,7 +368,7 @@ static int read_page(struct file *file, unsigned long index,
 	pr_debug("read bitmap file (%dB @ %llu)\n", (int)PAGE_SIZE,
 		 (unsigned long long)index << PAGE_SHIFT);
 
-	bh = alloc_page_buffers(page, 1<<inode->i_blkbits, 0);
+	bh = alloc_page_buffers(page, 1<<inode->i_blkbits, false);
 	if (!bh) {
 		ret = -ENOMEM;
 		goto out;
diff --git a/fs/buffer.c b/fs/buffer.c
index 170df856bdb9..1234ae343aef 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -861,16 +861,19 @@ int remove_inode_buffers(struct inode *inode)
  * which may not fail from ordinary buffer allocations.
  */
 struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size,
-		int retry)
+		bool retry)
 {
 	struct buffer_head *bh, *head;
+	gfp_t gfp = GFP_NOFS;
 	long offset;
 
-try_again:
+	if (retry)
+		gfp |= __GFP_NOFAIL;
+
 	head = NULL;
 	offset = PAGE_SIZE;
 	while ((offset -= size) >= 0) {
-		bh = alloc_buffer_head(GFP_NOFS);
+		bh = alloc_buffer_head(gfp);
 		if (!bh)
 			goto no_grow;
 
@@ -896,23 +899,7 @@ struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size,
 		} while (head);
 	}
 
-	/*
-	 * Return failure for non-async IO requests.  Async IO requests
-	 * are not allowed to fail, so we have to wait until buffer heads
-	 * become available.  But we don't want tasks sleeping with 
-	 * partially complete buffers, so all were released above.
-	 */
-	if (!retry)
-		return NULL;
-
-	/* We're _really_ low on memory. Now we just
-	 * wait for old buffer heads to become free due to
-	 * finishing IO.  Since this is an async request and
-	 * the reserve list is empty, we're sure there are 
-	 * async buffer heads in use.
-	 */
-	free_more_memory();
-	goto try_again;
+	return NULL;
 }
 EXPORT_SYMBOL_GPL(alloc_page_buffers);
 
@@ -1021,7 +1008,7 @@ grow_dev_page(struct block_device *bdev, sector_t block,
 	/*
 	 * Allocate some buffers for this page
 	 */
-	bh = alloc_page_buffers(page, size, 0);
+	bh = alloc_page_buffers(page, size, false);
 	if (!bh)
 		goto failed;
 
@@ -1575,7 +1562,7 @@ void create_empty_buffers(struct page *page,
 {
 	struct buffer_head *bh, *head, *tail;
 
-	head = alloc_page_buffers(page, blocksize, 1);
+	head = alloc_page_buffers(page, blocksize, true);
 	bh = head;
 	do {
 		bh->b_state |= b_state;
@@ -2638,7 +2625,7 @@ int nobh_write_begin(struct address_space *mapping,
 	 * Be careful: the buffer linked list is a NULL terminated one, rather
 	 * than the circular one we're used to.
 	 */
-	head = alloc_page_buffers(page, blocksize, 0);
+	head = alloc_page_buffers(page, blocksize, false);
 	if (!head) {
 		ret = -ENOMEM;
 		goto out_release;
diff --git a/fs/ntfs/aops.c b/fs/ntfs/aops.c
index cc91856b5e2d..3a2e509c77c5 100644
--- a/fs/ntfs/aops.c
+++ b/fs/ntfs/aops.c
@@ -1739,7 +1739,7 @@ void mark_ntfs_record_dirty(struct page *page, const unsigned int ofs) {
 	spin_lock(&mapping->private_lock);
 	if (unlikely(!page_has_buffers(page))) {
 		spin_unlock(&mapping->private_lock);
-		bh = head = alloc_page_buffers(page, bh_size, 1);
+		bh = head = alloc_page_buffers(page, bh_size, true);
 		spin_lock(&mapping->private_lock);
 		if (likely(!page_has_buffers(page))) {
 			struct buffer_head *tail;
diff --git a/fs/ntfs/mft.c b/fs/ntfs/mft.c
index b6f402194f02..ee8392aee9f6 100644
--- a/fs/ntfs/mft.c
+++ b/fs/ntfs/mft.c
@@ -507,7 +507,7 @@ int ntfs_sync_mft_mirror(ntfs_volume *vol, const unsigned long mft_no,
 	if (unlikely(!page_has_buffers(page))) {
 		struct buffer_head *tail;
 
-		bh = head = alloc_page_buffers(page, blocksize, 1);
+		bh = head = alloc_page_buffers(page, blocksize, true);
 		do {
 			set_buffer_uptodate(bh);
 			tail = bh;
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index c8dae555eccf..ae2d25f01b98 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -156,7 +156,7 @@ void set_bh_page(struct buffer_head *bh,
 		struct page *page, unsigned long offset);
 int try_to_free_buffers(struct page *);
 struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size,
-		int retry);
+		bool retry);
 void create_empty_buffers(struct page *, unsigned long,
 			unsigned long b_state);
 void end_buffer_read_sync(struct buffer_head *bh, int uptodate);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 02/12] buffer: grow_dev_page() should use __GFP_NOFAIL for all cases
  2017-09-27 20:13 [PATCH 0/12 v3] Writeback improvements Jens Axboe
  2017-09-27 20:13 ` [PATCH 01/12] buffer: have alloc_page_buffers() use __GFP_NOFAIL Jens Axboe
@ 2017-09-27 20:13 ` Jens Axboe
  2017-09-28 14:11   ` Nikolay Borisov
  2017-10-03 12:10   ` Jan Kara
  2017-09-27 20:13 ` [PATCH 03/12] buffer: eliminate the need to call free_more_memory() in __getblk_slow() Jens Axboe
                   ` (10 subsequent siblings)
  12 siblings, 2 replies; 35+ messages in thread
From: Jens Axboe @ 2017-09-27 20:13 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel; +Cc: hannes, jack, torvalds, Jens Axboe

We currently it it for find_or_create_page(), which means that it
cannot fail. Ensure we also pass in 'retry == true' to
alloc_page_buffers(), which also ensure that it cannot fail.

After this, there are no failure cases in grow_dev_page() that
occur because of a failed memory allocation.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/buffer.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 1234ae343aef..3b60cd8456db 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -988,8 +988,6 @@ grow_dev_page(struct block_device *bdev, sector_t block,
 	gfp_mask |= __GFP_NOFAIL;
 
 	page = find_or_create_page(inode->i_mapping, index, gfp_mask);
-	if (!page)
-		return ret;
 
 	BUG_ON(!PageLocked(page));
 
@@ -1008,9 +1006,7 @@ grow_dev_page(struct block_device *bdev, sector_t block,
 	/*
 	 * Allocate some buffers for this page
 	 */
-	bh = alloc_page_buffers(page, size, false);
-	if (!bh)
-		goto failed;
+	bh = alloc_page_buffers(page, size, true);
 
 	/*
 	 * Link the page to the buffers and initialise them.  Take the
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 03/12] buffer: eliminate the need to call free_more_memory() in __getblk_slow()
  2017-09-27 20:13 [PATCH 0/12 v3] Writeback improvements Jens Axboe
  2017-09-27 20:13 ` [PATCH 01/12] buffer: have alloc_page_buffers() use __GFP_NOFAIL Jens Axboe
  2017-09-27 20:13 ` [PATCH 02/12] buffer: grow_dev_page() should use __GFP_NOFAIL for all cases Jens Axboe
@ 2017-09-27 20:13 ` Jens Axboe
  2017-09-28 14:12   ` Nikolay Borisov
  2017-10-03 12:22   ` Jan Kara
  2017-09-27 20:13 ` [PATCH 04/12] fs: kill 'nr_pages' argument from wakeup_flusher_threads() Jens Axboe
                   ` (9 subsequent siblings)
  12 siblings, 2 replies; 35+ messages in thread
From: Jens Axboe @ 2017-09-27 20:13 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel; +Cc: hannes, jack, torvalds, Jens Axboe

Since the previous commit removed any case where grow_buffers()
would return failure due to memory allocations, we can safely
remove the case where we have to call free_more_memory() in
this function.

Since this is also the last user of free_more_memory(), kill
it off completely.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/buffer.c | 23 -----------------------
 1 file changed, 23 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 3b60cd8456db..bff571dc7bc3 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -253,27 +253,6 @@ __find_get_block_slow(struct block_device *bdev, sector_t block)
 }
 
 /*
- * Kick the writeback threads then try to free up some ZONE_NORMAL memory.
- */
-static void free_more_memory(void)
-{
-	struct zoneref *z;
-	int nid;
-
-	wakeup_flusher_threads(1024, WB_REASON_FREE_MORE_MEM);
-	yield();
-
-	for_each_online_node(nid) {
-
-		z = first_zones_zonelist(node_zonelist(nid, GFP_NOFS),
-						gfp_zone(GFP_NOFS), NULL);
-		if (z->zone)
-			try_to_free_pages(node_zonelist(nid, GFP_NOFS), 0,
-						GFP_NOFS, NULL);
-	}
-}
-
-/*
  * I/O completion handler for block_read_full_page() - pages
  * which come unlocked at the end of I/O.
  */
@@ -1086,8 +1065,6 @@ __getblk_slow(struct block_device *bdev, sector_t block,
 		ret = grow_buffers(bdev, block, size, gfp);
 		if (ret < 0)
 			return NULL;
-		if (ret == 0)
-			free_more_memory();
 	}
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 04/12] fs: kill 'nr_pages' argument from wakeup_flusher_threads()
  2017-09-27 20:13 [PATCH 0/12 v3] Writeback improvements Jens Axboe
                   ` (2 preceding siblings ...)
  2017-09-27 20:13 ` [PATCH 03/12] buffer: eliminate the need to call free_more_memory() in __getblk_slow() Jens Axboe
@ 2017-09-27 20:13 ` Jens Axboe
  2017-09-27 20:13 ` [PATCH 05/12] writeback: switch wakeup_flusher_threads() to cyclic writeback Jens Axboe
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 35+ messages in thread
From: Jens Axboe @ 2017-09-27 20:13 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel; +Cc: hannes, jack, torvalds, Jens Axboe

Everybody is passing in 0 now, let's get rid of the argument.

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/fs-writeback.c         | 9 ++++-----
 fs/sync.c                 | 2 +-
 include/linux/writeback.h | 2 +-
 mm/vmscan.c               | 2 +-
 4 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 245c430a2e41..bb6148dc6d24 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1947,12 +1947,12 @@ void wb_workfn(struct work_struct *work)
 }
 
 /*
- * Start writeback of `nr_pages' pages.  If `nr_pages' is zero, write back
- * the whole world.
+ * Wakeup the flusher threads to start writeback of all currently dirty pages
  */
-void wakeup_flusher_threads(long nr_pages, enum wb_reason reason)
+void wakeup_flusher_threads(enum wb_reason reason)
 {
 	struct backing_dev_info *bdi;
+	long nr_pages;
 
 	/*
 	 * If we are expecting writeback progress we must submit plugged IO.
@@ -1960,8 +1960,7 @@ void wakeup_flusher_threads(long nr_pages, enum wb_reason reason)
 	if (blk_needs_flush_plug(current))
 		blk_schedule_flush_plug(current);
 
-	if (!nr_pages)
-		nr_pages = get_nr_dirty_pages();
+	nr_pages = get_nr_dirty_pages();
 
 	rcu_read_lock();
 	list_for_each_entry_rcu(bdi, &bdi_list, bdi_list) {
diff --git a/fs/sync.c b/fs/sync.c
index a576aa2e6b09..09f96a18dd93 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -108,7 +108,7 @@ SYSCALL_DEFINE0(sync)
 {
 	int nowait = 0, wait = 1;
 
-	wakeup_flusher_threads(0, WB_REASON_SYNC);
+	wakeup_flusher_threads(WB_REASON_SYNC);
 	iterate_supers(sync_inodes_one_sb, NULL);
 	iterate_supers(sync_fs_one_sb, &nowait);
 	iterate_supers(sync_fs_one_sb, &wait);
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index d5815794416c..1f9c6db5e29a 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -189,7 +189,7 @@ bool try_to_writeback_inodes_sb(struct super_block *, enum wb_reason reason);
 bool try_to_writeback_inodes_sb_nr(struct super_block *, unsigned long nr,
 				   enum wb_reason reason);
 void sync_inodes_sb(struct super_block *);
-void wakeup_flusher_threads(long nr_pages, enum wb_reason reason);
+void wakeup_flusher_threads(enum wb_reason reason);
 void inode_wait_for_writeback(struct inode *inode);
 
 /* writeback.h requires fs.h; it, too, is not included from here. */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 13d711dd8776..42a7fdd52d87 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1867,7 +1867,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 		 * also allow kswapd to start writing pages during reclaim.
 		 */
 		if (stat.nr_unqueued_dirty == nr_taken) {
-			wakeup_flusher_threads(0, WB_REASON_VMSCAN);
+			wakeup_flusher_threads(WB_REASON_VMSCAN);
 			set_bit(PGDAT_DIRTY, &pgdat->flags);
 		}
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 05/12] writeback: switch wakeup_flusher_threads() to cyclic writeback
  2017-09-27 20:13 [PATCH 0/12 v3] Writeback improvements Jens Axboe
                   ` (3 preceding siblings ...)
  2017-09-27 20:13 ` [PATCH 04/12] fs: kill 'nr_pages' argument from wakeup_flusher_threads() Jens Axboe
@ 2017-09-27 20:13 ` Jens Axboe
  2017-10-03 12:43   ` Jan Kara
  2017-09-27 20:13 ` [PATCH 06/12] writeback: provide a wakeup_flusher_threads_bdi() Jens Axboe
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 35+ messages in thread
From: Jens Axboe @ 2017-09-27 20:13 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel; +Cc: hannes, jack, torvalds, Jens Axboe

We're writing back the full range of dirty pages on the devices,
there's no point in making this special and not do normal range
cyclic writeback.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/fs-writeback.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index bb6148dc6d24..65e6992d8719 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1971,7 +1971,7 @@ void wakeup_flusher_threads(enum wb_reason reason)
 
 		list_for_each_entry_rcu(wb, &bdi->wb_list, bdi_node)
 			wb_start_writeback(wb, wb_split_bdi_pages(wb, nr_pages),
-					   false, reason);
+						true, reason);
 	}
 	rcu_read_unlock();
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 06/12] writeback: provide a wakeup_flusher_threads_bdi()
  2017-09-27 20:13 [PATCH 0/12 v3] Writeback improvements Jens Axboe
                   ` (4 preceding siblings ...)
  2017-09-27 20:13 ` [PATCH 05/12] writeback: switch wakeup_flusher_threads() to cyclic writeback Jens Axboe
@ 2017-09-27 20:13 ` Jens Axboe
  2017-09-27 20:13 ` [PATCH 07/12] writeback: pass in '0' for nr_pages writeback in laptop mode Jens Axboe
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 35+ messages in thread
From: Jens Axboe @ 2017-09-27 20:13 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel; +Cc: hannes, jack, torvalds, Jens Axboe

Similar to wakeup_flusher_threads(), except that we only wake
up the flusher threads on the specified backing device.

No functional changes in this patch.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Chris Mason <clm@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/fs-writeback.c         | 39 +++++++++++++++++++++++++++++----------
 include/linux/writeback.h |  2 ++
 2 files changed, 31 insertions(+), 10 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 65e6992d8719..1a9b45250437 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1947,6 +1947,33 @@ void wb_workfn(struct work_struct *work)
 }
 
 /*
+ * Start writeback of `nr_pages' pages on this bdi. If `nr_pages' is zero,
+ * write back the whole world.
+ */
+static void __wakeup_flusher_threads_bdi(struct backing_dev_info *bdi,
+					 long nr_pages, enum wb_reason reason)
+{
+	struct bdi_writeback *wb;
+
+	if (!bdi_has_dirty_io(bdi))
+		return;
+
+	list_for_each_entry_rcu(wb, &bdi->wb_list, bdi_node)
+		wb_start_writeback(wb, wb_split_bdi_pages(wb, nr_pages),
+					true, reason);
+}
+
+void wakeup_flusher_threads_bdi(struct backing_dev_info *bdi,
+				enum wb_reason reason)
+{
+	long nr_pages = get_nr_dirty_pages();
+
+	rcu_read_lock();
+	__wakeup_flusher_threads_bdi(bdi, nr_pages, reason);
+	rcu_read_unlock();
+}
+
+/*
  * Wakeup the flusher threads to start writeback of all currently dirty pages
  */
 void wakeup_flusher_threads(enum wb_reason reason)
@@ -1963,16 +1990,8 @@ void wakeup_flusher_threads(enum wb_reason reason)
 	nr_pages = get_nr_dirty_pages();
 
 	rcu_read_lock();
-	list_for_each_entry_rcu(bdi, &bdi_list, bdi_list) {
-		struct bdi_writeback *wb;
-
-		if (!bdi_has_dirty_io(bdi))
-			continue;
-
-		list_for_each_entry_rcu(wb, &bdi->wb_list, bdi_node)
-			wb_start_writeback(wb, wb_split_bdi_pages(wb, nr_pages),
-						true, reason);
-	}
+	list_for_each_entry_rcu(bdi, &bdi_list, bdi_list)
+		__wakeup_flusher_threads_bdi(bdi, nr_pages, reason);
 	rcu_read_unlock();
 }
 
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 1f9c6db5e29a..9c0091678af4 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -190,6 +190,8 @@ bool try_to_writeback_inodes_sb_nr(struct super_block *, unsigned long nr,
 				   enum wb_reason reason);
 void sync_inodes_sb(struct super_block *);
 void wakeup_flusher_threads(enum wb_reason reason);
+void wakeup_flusher_threads_bdi(struct backing_dev_info *bdi,
+				enum wb_reason reason);
 void inode_wait_for_writeback(struct inode *inode);
 
 /* writeback.h requires fs.h; it, too, is not included from here. */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 07/12] writeback: pass in '0' for nr_pages writeback in laptop mode
  2017-09-27 20:13 [PATCH 0/12 v3] Writeback improvements Jens Axboe
                   ` (5 preceding siblings ...)
  2017-09-27 20:13 ` [PATCH 06/12] writeback: provide a wakeup_flusher_threads_bdi() Jens Axboe
@ 2017-09-27 20:13 ` Jens Axboe
  2017-09-27 20:13 ` [PATCH 08/12] writeback: make wb_start_writeback() static Jens Axboe
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 35+ messages in thread
From: Jens Axboe @ 2017-09-27 20:13 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel; +Cc: hannes, jack, torvalds, Jens Axboe

Laptop mode really wants to writeback the number of dirty
pages and inodes. Instead of calculating this in the caller,
just pass in 0 and let wakeup_flusher_threads() handle it.

Use the new wakeup_flusher_threads_bdi() instead of rolling
our own.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Chris Mason <clm@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 mm/page-writeback.c | 17 +----------------
 1 file changed, 1 insertion(+), 16 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 0b9c5cbe8eba..8d1fc593bce8 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1980,23 +1980,8 @@ int dirty_writeback_centisecs_handler(struct ctl_table *table, int write,
 void laptop_mode_timer_fn(unsigned long data)
 {
 	struct request_queue *q = (struct request_queue *)data;
-	int nr_pages = global_node_page_state(NR_FILE_DIRTY) +
-		global_node_page_state(NR_UNSTABLE_NFS);
-	struct bdi_writeback *wb;
 
-	/*
-	 * We want to write everything out, not just down to the dirty
-	 * threshold
-	 */
-	if (!bdi_has_dirty_io(q->backing_dev_info))
-		return;
-
-	rcu_read_lock();
-	list_for_each_entry_rcu(wb, &q->backing_dev_info->wb_list, bdi_node)
-		if (wb_has_dirty_io(wb))
-			wb_start_writeback(wb, nr_pages, true,
-					   WB_REASON_LAPTOP_TIMER);
-	rcu_read_unlock();
+	wakeup_flusher_threads_bdi(q->backing_dev_info, WB_REASON_LAPTOP_TIMER);
 }
 
 /*
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 08/12] writeback: make wb_start_writeback() static
  2017-09-27 20:13 [PATCH 0/12 v3] Writeback improvements Jens Axboe
                   ` (6 preceding siblings ...)
  2017-09-27 20:13 ` [PATCH 07/12] writeback: pass in '0' for nr_pages writeback in laptop mode Jens Axboe
@ 2017-09-27 20:13 ` Jens Axboe
  2017-09-27 20:13 ` [PATCH 09/12] writeback: move nr_pages == 0 logic to one location Jens Axboe
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 35+ messages in thread
From: Jens Axboe @ 2017-09-27 20:13 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel; +Cc: hannes, jack, torvalds, Jens Axboe

We don't have any callers outside of fs-writeback.c anymore,
make it private.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Chris Mason <clm@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/fs-writeback.c           | 4 ++--
 include/linux/backing-dev.h | 2 --
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 1a9b45250437..571a5702b568 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -933,8 +933,8 @@ static void bdi_split_work_to_wbs(struct backing_dev_info *bdi,
 
 #endif	/* CONFIG_CGROUP_WRITEBACK */
 
-void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
-			bool range_cyclic, enum wb_reason reason)
+static void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
+			       bool range_cyclic, enum wb_reason reason)
 {
 	struct wb_writeback_work *work;
 
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 854e1bdd0b2a..157e950a70dc 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -38,8 +38,6 @@ static inline struct backing_dev_info *bdi_alloc(gfp_t gfp_mask)
 	return bdi_alloc_node(gfp_mask, NUMA_NO_NODE);
 }
 
-void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
-			bool range_cyclic, enum wb_reason reason);
 void wb_start_background_writeback(struct bdi_writeback *wb);
 void wb_workfn(struct work_struct *work);
 void wb_wakeup_delayed(struct bdi_writeback *wb);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 09/12] writeback: move nr_pages == 0 logic to one location
  2017-09-27 20:13 [PATCH 0/12 v3] Writeback improvements Jens Axboe
                   ` (7 preceding siblings ...)
  2017-09-27 20:13 ` [PATCH 08/12] writeback: make wb_start_writeback() static Jens Axboe
@ 2017-09-27 20:13 ` Jens Axboe
  2017-09-27 20:13 ` [PATCH 10/12] writeback: only allow one inflight and pending full flush Jens Axboe
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 35+ messages in thread
From: Jens Axboe @ 2017-09-27 20:13 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel; +Cc: hannes, jack, torvalds, Jens Axboe

Now that we have no external callers of wb_start_writeback(), we
can shuffle the passing in of 'nr_pages'. Everybody passes in 0
at this point, so just kill the argument and move the dirty
count retrieval to that function.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Chris Mason <clm@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/fs-writeback.c | 42 ++++++++++++++++++------------------------
 1 file changed, 18 insertions(+), 24 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 571a5702b568..d9be3dc34302 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -933,8 +933,19 @@ static void bdi_split_work_to_wbs(struct backing_dev_info *bdi,
 
 #endif	/* CONFIG_CGROUP_WRITEBACK */
 
-static void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
-			       bool range_cyclic, enum wb_reason reason)
+/*
+ * Add in the number of potentially dirty inodes, because each inode
+ * write can dirty pagecache in the underlying blockdev.
+ */
+static unsigned long get_nr_dirty_pages(void)
+{
+	return global_node_page_state(NR_FILE_DIRTY) +
+		global_node_page_state(NR_UNSTABLE_NFS) +
+		get_nr_dirty_inodes();
+}
+
+static void wb_start_writeback(struct bdi_writeback *wb, bool range_cyclic,
+			       enum wb_reason reason)
 {
 	struct wb_writeback_work *work;
 
@@ -954,7 +965,7 @@ static void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
 	}
 
 	work->sync_mode	= WB_SYNC_NONE;
-	work->nr_pages	= nr_pages;
+	work->nr_pages	= wb_split_bdi_pages(wb, get_nr_dirty_pages());
 	work->range_cyclic = range_cyclic;
 	work->reason	= reason;
 	work->auto_free	= 1;
@@ -1814,17 +1825,6 @@ static struct wb_writeback_work *get_next_work_item(struct bdi_writeback *wb)
 	return work;
 }
 
-/*
- * Add in the number of potentially dirty inodes, because each inode
- * write can dirty pagecache in the underlying blockdev.
- */
-static unsigned long get_nr_dirty_pages(void)
-{
-	return global_node_page_state(NR_FILE_DIRTY) +
-		global_node_page_state(NR_UNSTABLE_NFS) +
-		get_nr_dirty_inodes();
-}
-
 static long wb_check_background_flush(struct bdi_writeback *wb)
 {
 	if (wb_over_bg_thresh(wb)) {
@@ -1951,7 +1951,7 @@ void wb_workfn(struct work_struct *work)
  * write back the whole world.
  */
 static void __wakeup_flusher_threads_bdi(struct backing_dev_info *bdi,
-					 long nr_pages, enum wb_reason reason)
+					 enum wb_reason reason)
 {
 	struct bdi_writeback *wb;
 
@@ -1959,17 +1959,14 @@ static void __wakeup_flusher_threads_bdi(struct backing_dev_info *bdi,
 		return;
 
 	list_for_each_entry_rcu(wb, &bdi->wb_list, bdi_node)
-		wb_start_writeback(wb, wb_split_bdi_pages(wb, nr_pages),
-					true, reason);
+		wb_start_writeback(wb, true, reason);
 }
 
 void wakeup_flusher_threads_bdi(struct backing_dev_info *bdi,
 				enum wb_reason reason)
 {
-	long nr_pages = get_nr_dirty_pages();
-
 	rcu_read_lock();
-	__wakeup_flusher_threads_bdi(bdi, nr_pages, reason);
+	__wakeup_flusher_threads_bdi(bdi, reason);
 	rcu_read_unlock();
 }
 
@@ -1979,7 +1976,6 @@ void wakeup_flusher_threads_bdi(struct backing_dev_info *bdi,
 void wakeup_flusher_threads(enum wb_reason reason)
 {
 	struct backing_dev_info *bdi;
-	long nr_pages;
 
 	/*
 	 * If we are expecting writeback progress we must submit plugged IO.
@@ -1987,11 +1983,9 @@ void wakeup_flusher_threads(enum wb_reason reason)
 	if (blk_needs_flush_plug(current))
 		blk_schedule_flush_plug(current);
 
-	nr_pages = get_nr_dirty_pages();
-
 	rcu_read_lock();
 	list_for_each_entry_rcu(bdi, &bdi_list, bdi_list)
-		__wakeup_flusher_threads_bdi(bdi, nr_pages, reason);
+		__wakeup_flusher_threads_bdi(bdi, reason);
 	rcu_read_unlock();
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 10/12] writeback: only allow one inflight and pending full flush
  2017-09-27 20:13 [PATCH 0/12 v3] Writeback improvements Jens Axboe
                   ` (8 preceding siblings ...)
  2017-09-27 20:13 ` [PATCH 09/12] writeback: move nr_pages == 0 logic to one location Jens Axboe
@ 2017-09-27 20:13 ` Jens Axboe
  2017-09-28 21:41   ` Andrew Morton
  2017-09-27 20:13 ` [PATCH 11/12] writeback: make sync_inodes_sb() use range cyclic writeback Jens Axboe
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 35+ messages in thread
From: Jens Axboe @ 2017-09-27 20:13 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel; +Cc: hannes, jack, torvalds, Jens Axboe

When someone calls wakeup_flusher_threads() or
wakeup_flusher_threads_bdi(), they schedule writeback of all dirty
pages in the system (or on that bdi). If we are tight on memory, we
can get tons of these queued from kswapd/vmscan. This causes (at
least) two problems:

1) We consume a ton of memory just allocating writeback work items.
   We've seen as much as 600 million of these writeback work items
   pending. That's a lot of memory to pointlessly hold hostage,
   while the box is under memory pressure.

2) We spend so much time processing these work items, that we
   introduce a softlockup in writeback processing. This is because
   each of the writeback work items don't end up doing any work (it's
   hard when you have millions of identical ones coming in to the
   flush machinery), so we just sit in a tight loop pulling work
   items and deleting/freeing them.

Fix this by adding a 'start_all' bit to the writeback structure, and
set that when someone attempts to flush all dirty pages. The bit is
cleared when we start writeback on that work item. If the bit is
already set when we attempt to queue !nr_pages writeback, then we
simply ignore it.

This provides us one full flush in flight, with one pending as well,
and makes for more efficient handling of this type of writeback.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Chris Mason <clm@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/fs-writeback.c                | 25 +++++++++++++++++++++++++
 include/linux/backing-dev-defs.h |  1 +
 2 files changed, 26 insertions(+)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index d9be3dc34302..2e46a1537e10 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -53,6 +53,7 @@ struct wb_writeback_work {
 	unsigned int for_background:1;
 	unsigned int for_sync:1;	/* sync(2) WB_SYNC_ALL writeback */
 	unsigned int auto_free:1;	/* free on completion */
+	unsigned int start_all:1;	/* nr_pages == 0 (all) writeback */
 	enum wb_reason reason;		/* why was writeback initiated? */
 
 	struct list_head list;		/* pending work list */
@@ -953,12 +954,27 @@ static void wb_start_writeback(struct bdi_writeback *wb, bool range_cyclic,
 		return;
 
 	/*
+	 * All callers of this function want to start writeback of all
+	 * dirty pages. Places like vmscan can call this at a very
+	 * high frequency, causing pointless allocations of tons of
+	 * work items and keeping the flusher threads busy retrieving
+	 * that work. Ensure that we only allow one of them pending and
+	 * inflight at the time. It doesn't matter if we race a little
+	 * bit on this, so use the faster separate test/set bit variants.
+	 */
+	if (test_bit(WB_start_all, &wb->state))
+		return;
+
+	set_bit(WB_start_all, &wb->state);
+
+	/*
 	 * This is WB_SYNC_NONE writeback, so if allocation fails just
 	 * wakeup the thread for old dirty data writeback
 	 */
 	work = kzalloc(sizeof(*work),
 		       GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN);
 	if (!work) {
+		clear_bit(WB_start_all, &wb->state);
 		trace_writeback_nowork(wb);
 		wb_wakeup(wb);
 		return;
@@ -969,6 +985,7 @@ static void wb_start_writeback(struct bdi_writeback *wb, bool range_cyclic,
 	work->range_cyclic = range_cyclic;
 	work->reason	= reason;
 	work->auto_free	= 1;
+	work->start_all = 1;
 
 	wb_queue_work(wb, work);
 }
@@ -1822,6 +1839,14 @@ static struct wb_writeback_work *get_next_work_item(struct bdi_writeback *wb)
 		list_del_init(&work->list);
 	}
 	spin_unlock_bh(&wb->work_lock);
+
+	/*
+	 * Once we start processing a work item that had !nr_pages,
+	 * clear the wb state bit for that so we can allow more.
+	 */
+	if (work && work->start_all)
+		clear_bit(WB_start_all, &wb->state);
+
 	return work;
 }
 
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 866c433e7d32..420de5c7c7f9 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -24,6 +24,7 @@ enum wb_state {
 	WB_shutting_down,	/* wb_shutdown() in progress */
 	WB_writeback_running,	/* Writeback is in progress */
 	WB_has_dirty_io,	/* Dirty inodes on ->b_{dirty|io|more_io} */
+	WB_start_all,		/* nr_pages == 0 (all) work pending */
 };
 
 enum wb_congested_state {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 11/12] writeback: make sync_inodes_sb() use range cyclic writeback
  2017-09-27 20:13 [PATCH 0/12 v3] Writeback improvements Jens Axboe
                   ` (9 preceding siblings ...)
  2017-09-27 20:13 ` [PATCH 10/12] writeback: only allow one inflight and pending full flush Jens Axboe
@ 2017-09-27 20:13 ` Jens Axboe
  2017-09-27 20:13 ` [PATCH 12/12] writeback: kill off ->range_cycle option Jens Axboe
  2017-09-28 13:19 ` [PATCH 0/12 v3] Writeback improvements John Stoffel
  12 siblings, 0 replies; 35+ messages in thread
From: Jens Axboe @ 2017-09-27 20:13 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel; +Cc: hannes, jack, torvalds, Jens Axboe

It's flushing all the dirty pages for this superblock, there's
no point in not doing range cyclic writeback.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/fs-writeback.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 2e46a1537e10..686d510e618c 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -2428,7 +2428,7 @@ void sync_inodes_sb(struct super_block *sb)
 		.sb		= sb,
 		.sync_mode	= WB_SYNC_ALL,
 		.nr_pages	= LONG_MAX,
-		.range_cyclic	= 0,
+		.range_cyclic	= 1,
 		.done		= &done,
 		.reason		= WB_REASON_SYNC,
 		.for_sync	= 1,
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 12/12] writeback: kill off ->range_cycle option
  2017-09-27 20:13 [PATCH 0/12 v3] Writeback improvements Jens Axboe
                   ` (10 preceding siblings ...)
  2017-09-27 20:13 ` [PATCH 11/12] writeback: make sync_inodes_sb() use range cyclic writeback Jens Axboe
@ 2017-09-27 20:13 ` Jens Axboe
  2017-09-27 20:22   ` Jens Axboe
  2017-09-28 13:19 ` [PATCH 0/12 v3] Writeback improvements John Stoffel
  12 siblings, 1 reply; 35+ messages in thread
From: Jens Axboe @ 2017-09-27 20:13 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel; +Cc: hannes, jack, torvalds, Jens Axboe

All types of writeback are now range cyclic. Kill off the member
from struct wb_writeback_work and struct writeback_control.

Remove the various checks for whether or not this is range cyclic
writeback or not in the writeback code and file systems.

For tracing, we leave the member there, and just always set it to 1.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 drivers/staging/lustre/lustre/llite/rw.c | 25 +++++++------------------
 fs/afs/write.c                           | 25 ++++++-------------------
 fs/btrfs/extent_io.c                     | 31 +++++++------------------------
 fs/ceph/addr.c                           | 26 ++++++++------------------
 fs/cifs/file.c                           | 20 +++++---------------
 fs/ext4/inode.c                          | 26 ++++++++++----------------
 fs/f2fs/data.c                           | 26 ++++++++------------------
 fs/fs-writeback.c                        | 12 ++----------
 fs/gfs2/aops.c                           | 27 ++++++++-------------------
 include/linux/writeback.h                |  1 -
 include/trace/events/btrfs.h             |  2 +-
 include/trace/events/ext4.h              |  2 +-
 include/trace/events/f2fs.h              |  2 +-
 include/trace/events/writeback.h         |  4 ++--
 mm/page-writeback.c                      | 27 ++++++++-------------------
 15 files changed, 74 insertions(+), 182 deletions(-)

diff --git a/drivers/staging/lustre/lustre/llite/rw.c b/drivers/staging/lustre/lustre/llite/rw.c
index e72090572bcc..d5960dd20405 100644
--- a/drivers/staging/lustre/lustre/llite/rw.c
+++ b/drivers/staging/lustre/lustre/llite/rw.c
@@ -1002,18 +1002,8 @@ int ll_writepages(struct address_space *mapping, struct writeback_control *wbc)
 	int result;
 	int ignore_layout = 0;
 
-	if (wbc->range_cyclic) {
-		start = mapping->writeback_index << PAGE_SHIFT;
-		end = OBD_OBJECT_EOF;
-	} else {
-		start = wbc->range_start;
-		end = wbc->range_end;
-		if (end == LLONG_MAX) {
-			end = OBD_OBJECT_EOF;
-			range_whole = start == 0;
-		}
-	}
-
+	start = mapping->writeback_index << PAGE_SHIFT;
+	end = OBD_OBJECT_EOF;
 	mode = CL_FSYNC_NONE;
 	if (wbc->sync_mode == WB_SYNC_ALL)
 		mode = CL_FSYNC_LOCAL;
@@ -1034,12 +1024,11 @@ int ll_writepages(struct address_space *mapping, struct writeback_control *wbc)
 		result = 0;
 	}
 
-	if (wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0)) {
-		if (end == OBD_OBJECT_EOF)
-			mapping->writeback_index = 0;
-		else
-			mapping->writeback_index = (end >> PAGE_SHIFT) + 1;
-	}
+	if (end == OBD_OBJECT_EOF)
+		mapping->writeback_index = 0;
+	else
+		mapping->writeback_index = (end >> PAGE_SHIFT) + 1;
+
 	return result;
 }
 
diff --git a/fs/afs/write.c b/fs/afs/write.c
index 106e43db1115..2e21c197173f 100644
--- a/fs/afs/write.c
+++ b/fs/afs/write.c
@@ -570,24 +570,12 @@ int afs_writepages(struct address_space *mapping,
 
 	_enter("");
 
-	if (wbc->range_cyclic) {
-		start = mapping->writeback_index;
-		end = -1;
-		ret = afs_writepages_region(mapping, wbc, start, end, &next);
-		if (start > 0 && wbc->nr_to_write > 0 && ret == 0)
-			ret = afs_writepages_region(mapping, wbc, 0, start,
-						    &next);
-		mapping->writeback_index = next;
-	} else if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX) {
-		end = (pgoff_t)(LLONG_MAX >> PAGE_SHIFT);
-		ret = afs_writepages_region(mapping, wbc, 0, end, &next);
-		if (wbc->nr_to_write > 0)
-			mapping->writeback_index = next;
-	} else {
-		start = wbc->range_start >> PAGE_SHIFT;
-		end = wbc->range_end >> PAGE_SHIFT;
-		ret = afs_writepages_region(mapping, wbc, start, end, &next);
-	}
+	start = mapping->writeback_index;
+	end = -1;
+	ret = afs_writepages_region(mapping, wbc, start, end, &next);
+	if (start > 0 && wbc->nr_to_write > 0 && ret == 0)
+		ret = afs_writepages_region(mapping, wbc, 0, start, &next);
+	mapping->writeback_index = next;
 
 	_leave(" = %d", ret);
 	return ret;
@@ -685,7 +673,6 @@ int afs_writeback_all(struct afs_vnode *vnode)
 	struct writeback_control wbc = {
 		.sync_mode	= WB_SYNC_ALL,
 		.nr_to_write	= LONG_MAX,
-		.range_cyclic	= 1,
 	};
 	int ret;
 
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 3e5bb0cdd3cd..5c3bd7d50435 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3803,14 +3803,8 @@ int btree_write_cache_pages(struct address_space *mapping,
 	int tag;
 
 	pagevec_init(&pvec, 0);
-	if (wbc->range_cyclic) {
-		index = mapping->writeback_index; /* Start from prev offset */
-		end = -1;
-	} else {
-		index = wbc->range_start >> PAGE_SHIFT;
-		end = wbc->range_end >> PAGE_SHIFT;
-		scanned = 1;
-	}
+	index = mapping->writeback_index; /* Start from prev offset */
+	end = -1;
 	if (wbc->sync_mode == WB_SYNC_ALL)
 		tag = PAGECACHE_TAG_TOWRITE;
 	else
@@ -3830,7 +3824,7 @@ int btree_write_cache_pages(struct address_space *mapping,
 			if (!PagePrivate(page))
 				continue;
 
-			if (!wbc->range_cyclic && page->index > end) {
+			if (page->index > end) {
 				done = 1;
 				break;
 			}
@@ -3930,7 +3924,6 @@ static int extent_write_cache_pages(struct address_space *mapping,
 	pgoff_t index;
 	pgoff_t end;		/* Inclusive */
 	pgoff_t done_index;
-	int range_whole = 0;
 	int scanned = 0;
 	int tag;
 
@@ -3947,16 +3940,8 @@ static int extent_write_cache_pages(struct address_space *mapping,
 		return 0;
 
 	pagevec_init(&pvec, 0);
-	if (wbc->range_cyclic) {
-		index = mapping->writeback_index; /* Start from prev offset */
-		end = -1;
-	} else {
-		index = wbc->range_start >> PAGE_SHIFT;
-		end = wbc->range_end >> PAGE_SHIFT;
-		if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
-			range_whole = 1;
-		scanned = 1;
-	}
+	index = mapping->writeback_index; /* Start from prev offset */
+	end = -1;
 	if (wbc->sync_mode == WB_SYNC_ALL)
 		tag = PAGECACHE_TAG_TOWRITE;
 	else
@@ -3992,7 +3977,7 @@ static int extent_write_cache_pages(struct address_space *mapping,
 				continue;
 			}
 
-			if (!wbc->range_cyclic && page->index > end) {
+			if (page->index > end) {
 				done = 1;
 				unlock_page(page);
 				continue;
@@ -4051,9 +4036,7 @@ static int extent_write_cache_pages(struct address_space *mapping,
 		goto retry;
 	}
 
-	if (wbc->range_cyclic || (wbc->nr_to_write > 0 && range_whole))
-		mapping->writeback_index = done_index;
-
+	mapping->writeback_index = done_index;
 	btrfs_add_delayed_iput(inode);
 	return ret;
 }
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index b3e3edc09d80..5aa5a901c448 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -791,7 +791,7 @@ static int ceph_writepages_start(struct address_space *mapping,
 	unsigned int wsize = i_blocksize(inode);
 	struct ceph_osd_request *req = NULL;
 	struct ceph_writeback_ctl ceph_wbc;
-	bool should_loop, range_whole = false;
+	bool should_loop;
 	bool stop, done = false;
 
 	dout("writepages_start %p (mode=%s)\n", inode,
@@ -812,7 +812,7 @@ static int ceph_writepages_start(struct address_space *mapping,
 
 	pagevec_init(&pvec, 0);
 
-	start_index = wbc->range_cyclic ? mapping->writeback_index : 0;
+	start_index = mapping->writeback_index;
 	index = start_index;
 
 retry:
@@ -830,19 +830,11 @@ static int ceph_writepages_start(struct address_space *mapping,
 	should_loop = false;
 	if (ceph_wbc.head_snapc && snapc != last_snapc) {
 		/* where to start/end? */
-		if (wbc->range_cyclic) {
-			index = start_index;
-			end = -1;
-			if (index > 0)
-				should_loop = true;
-			dout(" cyclic, start at %lu\n", index);
-		} else {
-			index = wbc->range_start >> PAGE_SHIFT;
-			end = wbc->range_end >> PAGE_SHIFT;
-			if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
-				range_whole = true;
-			dout(" not cyclic, %lu to %lu\n", index, end);
-		}
+		index = start_index;
+		end = -1;
+		if (index > 0)
+			should_loop = true;
+		dout(" cyclic, start at %lu\n", index);
 	} else if (!ceph_wbc.head_snapc) {
 		/* Do not respect wbc->range_{start,end}. Dirty pages
 		 * in that range can be associated with newer snapc.
@@ -1194,9 +1186,7 @@ static int ceph_writepages_start(struct address_space *mapping,
 		goto retry;
 	}
 
-	if (wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0))
-		mapping->writeback_index = index;
-
+	mapping->writeback_index = index;
 out:
 	ceph_osdc_put_request(req);
 	ceph_put_snap_context(last_snapc);
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 92fdf9c35de2..98498afc3f87 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -2021,7 +2021,7 @@ wdata_prepare_pages(struct cifs_writedata *wdata, unsigned int found_pages,
 			break;
 		}
 
-		if (!wbc->range_cyclic && page->index > end) {
+		if (page->index > end) {
 			*done = true;
 			unlock_page(page);
 			break;
@@ -2112,7 +2112,7 @@ static int cifs_writepages(struct address_space *mapping,
 {
 	struct cifs_sb_info *cifs_sb = CIFS_SB(mapping->host->i_sb);
 	struct TCP_Server_Info *server;
-	bool done = false, scanned = false, range_whole = false;
+	bool done = false, scanned = false;
 	pgoff_t end, index;
 	struct cifs_writedata *wdata;
 	int rc = 0;
@@ -2124,16 +2124,8 @@ static int cifs_writepages(struct address_space *mapping,
 	if (cifs_sb->wsize < PAGE_SIZE)
 		return generic_writepages(mapping, wbc);
 
-	if (wbc->range_cyclic) {
-		index = mapping->writeback_index; /* Start from prev offset */
-		end = -1;
-	} else {
-		index = wbc->range_start >> PAGE_SHIFT;
-		end = wbc->range_end >> PAGE_SHIFT;
-		if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
-			range_whole = true;
-		scanned = true;
-	}
+	index = mapping->writeback_index; /* Start from prev offset */
+	end = -1;
 	server = cifs_sb_master_tcon(cifs_sb)->ses->server;
 retry:
 	while (!done && index <= end) {
@@ -2214,9 +2206,7 @@ static int cifs_writepages(struct address_space *mapping,
 		goto retry;
 	}
 
-	if (wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0))
-		mapping->writeback_index = index;
-
+	mapping->writeback_index = index;
 	return rc;
 }
 
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 31db875bc7a1..f33f700b110c 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2792,16 +2792,11 @@ static int ext4_writepages(struct address_space *mapping,
 	if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
 		range_whole = 1;
 
-	if (wbc->range_cyclic) {
-		writeback_index = mapping->writeback_index;
-		if (writeback_index)
-			cycled = 0;
-		mpd.first_page = writeback_index;
-		mpd.last_page = -1;
-	} else {
-		mpd.first_page = wbc->range_start >> PAGE_SHIFT;
-		mpd.last_page = wbc->range_end >> PAGE_SHIFT;
-	}
+	writeback_index = mapping->writeback_index;
+	if (writeback_index)
+		cycled = 0;
+	mpd.first_page = writeback_index;
+	mpd.last_page = -1;
 
 	mpd.inode = inode;
 	mpd.wbc = wbc;
@@ -2940,12 +2935,11 @@ static int ext4_writepages(struct address_space *mapping,
 	}
 
 	/* Update index */
-	if (wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0))
-		/*
-		 * Set the writeback_index so that range_cyclic
-		 * mode will write it back later
-		 */
-		mapping->writeback_index = mpd.first_page;
+	/*
+	 * Set the writeback_index so that range_cyclic
+	 * mode will write it back later
+	 */
+	mapping->writeback_index = mpd.first_page;
 
 out_writepages:
 	trace_ext4_writepages_result(inode, wbc, ret,
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index 36b535207c88..05edac5fdf2e 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -1632,7 +1632,6 @@ static int f2fs_write_cache_pages(struct address_space *mapping,
 	pgoff_t done_index;
 	pgoff_t last_idx = ULONG_MAX;
 	int cycled;
-	int range_whole = 0;
 	int tag;
 
 	pagevec_init(&pvec, 0);
@@ -1643,21 +1642,13 @@ static int f2fs_write_cache_pages(struct address_space *mapping,
 	else
 		clear_inode_flag(mapping->host, FI_HOT_DATA);
 
-	if (wbc->range_cyclic) {
-		writeback_index = mapping->writeback_index; /* prev offset */
-		index = writeback_index;
-		if (index == 0)
-			cycled = 1;
-		else
-			cycled = 0;
-		end = -1;
-	} else {
-		index = wbc->range_start >> PAGE_SHIFT;
-		end = wbc->range_end >> PAGE_SHIFT;
-		if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
-			range_whole = 1;
-		cycled = 1; /* ignore range_cyclic tests */
-	}
+	writeback_index = mapping->writeback_index; /* prev offset */
+	index = writeback_index;
+	if (index == 0)
+		cycled = 1;
+	else
+		cycled = 0;
+	end = -1;
 	if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
 		tag = PAGECACHE_TAG_TOWRITE;
 	else
@@ -1755,8 +1746,7 @@ static int f2fs_write_cache_pages(struct address_space *mapping,
 		end = writeback_index - 1;
 		goto retry;
 	}
-	if (wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0))
-		mapping->writeback_index = done_index;
+	mapping->writeback_index = done_index;
 
 	if (last_idx != ULONG_MAX)
 		f2fs_submit_merged_write_cond(F2FS_M_SB(mapping), mapping->host,
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 686d510e618c..65bbc627addf 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -49,7 +49,6 @@ struct wb_writeback_work {
 	enum writeback_sync_modes sync_mode;
 	unsigned int tagged_writepages:1;
 	unsigned int for_kupdate:1;
-	unsigned int range_cyclic:1;
 	unsigned int for_background:1;
 	unsigned int for_sync:1;	/* sync(2) WB_SYNC_ALL writeback */
 	unsigned int auto_free:1;	/* free on completion */
@@ -945,8 +944,7 @@ static unsigned long get_nr_dirty_pages(void)
 		get_nr_dirty_inodes();
 }
 
-static void wb_start_writeback(struct bdi_writeback *wb, bool range_cyclic,
-			       enum wb_reason reason)
+static void wb_start_writeback(struct bdi_writeback *wb, enum wb_reason reason)
 {
 	struct wb_writeback_work *work;
 
@@ -982,7 +980,6 @@ static void wb_start_writeback(struct bdi_writeback *wb, bool range_cyclic,
 
 	work->sync_mode	= WB_SYNC_NONE;
 	work->nr_pages	= wb_split_bdi_pages(wb, get_nr_dirty_pages());
-	work->range_cyclic = range_cyclic;
 	work->reason	= reason;
 	work->auto_free	= 1;
 	work->start_all = 1;
@@ -1526,7 +1523,6 @@ static long writeback_sb_inodes(struct super_block *sb,
 		.for_kupdate		= work->for_kupdate,
 		.for_background		= work->for_background,
 		.for_sync		= work->for_sync,
-		.range_cyclic		= work->range_cyclic,
 		.range_start		= 0,
 		.range_end		= LLONG_MAX,
 	};
@@ -1698,7 +1694,6 @@ static long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages,
 	struct wb_writeback_work work = {
 		.nr_pages	= nr_pages,
 		.sync_mode	= WB_SYNC_NONE,
-		.range_cyclic	= 1,
 		.reason		= reason,
 	};
 	struct blk_plug plug;
@@ -1858,7 +1853,6 @@ static long wb_check_background_flush(struct bdi_writeback *wb)
 			.nr_pages	= LONG_MAX,
 			.sync_mode	= WB_SYNC_NONE,
 			.for_background	= 1,
-			.range_cyclic	= 1,
 			.reason		= WB_REASON_BACKGROUND,
 		};
 
@@ -1892,7 +1886,6 @@ static long wb_check_old_data_flush(struct bdi_writeback *wb)
 			.nr_pages	= nr_pages,
 			.sync_mode	= WB_SYNC_NONE,
 			.for_kupdate	= 1,
-			.range_cyclic	= 1,
 			.reason		= WB_REASON_PERIODIC,
 		};
 
@@ -1984,7 +1977,7 @@ static void __wakeup_flusher_threads_bdi(struct backing_dev_info *bdi,
 		return;
 
 	list_for_each_entry_rcu(wb, &bdi->wb_list, bdi_node)
-		wb_start_writeback(wb, true, reason);
+		wb_start_writeback(wb, reason);
 }
 
 void wakeup_flusher_threads_bdi(struct backing_dev_info *bdi,
@@ -2428,7 +2421,6 @@ void sync_inodes_sb(struct super_block *sb)
 		.sb		= sb,
 		.sync_mode	= WB_SYNC_ALL,
 		.nr_pages	= LONG_MAX,
-		.range_cyclic	= 1,
 		.done		= &done,
 		.reason		= WB_REASON_SYNC,
 		.for_sync	= 1,
diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index 68ed06962537..4e9d86dd7413 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -384,25 +384,16 @@ static int gfs2_write_cache_jdata(struct address_space *mapping,
 	pgoff_t end;
 	pgoff_t done_index;
 	int cycled;
-	int range_whole = 0;
 	int tag;
 
 	pagevec_init(&pvec, 0);
-	if (wbc->range_cyclic) {
-		writeback_index = mapping->writeback_index; /* prev offset */
-		index = writeback_index;
-		if (index == 0)
-			cycled = 1;
-		else
-			cycled = 0;
-		end = -1;
-	} else {
-		index = wbc->range_start >> PAGE_SHIFT;
-		end = wbc->range_end >> PAGE_SHIFT;
-		if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
-			range_whole = 1;
-		cycled = 1; /* ignore range_cyclic tests */
-	}
+	writeback_index = mapping->writeback_index; /* prev offset */
+	index = writeback_index;
+	if (index == 0)
+		cycled = 1;
+	else
+		cycled = 0;
+	end = -1;
 	if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
 		tag = PAGECACHE_TAG_TOWRITE;
 	else
@@ -439,9 +430,7 @@ static int gfs2_write_cache_jdata(struct address_space *mapping,
 		goto retry;
 	}
 
-	if (wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0))
-		mapping->writeback_index = done_index;
-
+	mapping->writeback_index = done_index;
 	return ret;
 }
 
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 9c0091678af4..52319fbf300f 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -87,7 +87,6 @@ struct writeback_control {
 	unsigned for_background:1;	/* A background writeback */
 	unsigned tagged_writepages:1;	/* tag-and-write to avoid livelock */
 	unsigned for_reclaim:1;		/* Invoked from the page allocator */
-	unsigned range_cyclic:1;	/* range_start is cyclic */
 	unsigned for_sync:1;		/* sync(2) WB_SYNC_ALL writeback */
 #ifdef CONFIG_CGROUP_WRITEBACK
 	struct bdi_writeback *wb;	/* wb this writeback is issued under */
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index dc1d0df91e0b..734349288fec 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -502,7 +502,7 @@ DECLARE_EVENT_CLASS(btrfs__writepage,
 		__entry->range_end	= wbc->range_end;
 		__entry->for_kupdate	= wbc->for_kupdate;
 		__entry->for_reclaim	= wbc->for_reclaim;
-		__entry->range_cyclic	= wbc->range_cyclic;
+		__entry->range_cyclic	= 1;
 		__entry->writeback_index = inode->i_mapping->writeback_index;
 		__entry->root_objectid	=
 				 BTRFS_I(inode)->root->root_key.objectid;
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 9c3bc3883d2f..e1de610895f1 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -394,7 +394,7 @@ TRACE_EVENT(ext4_writepages,
 		__entry->writeback_index = inode->i_mapping->writeback_index;
 		__entry->sync_mode	= wbc->sync_mode;
 		__entry->for_kupdate	= wbc->for_kupdate;
-		__entry->range_cyclic	= wbc->range_cyclic;
+		__entry->range_cyclic	= 1;
 	),
 
 	TP_printk("dev %d,%d ino %lu nr_to_write %ld pages_skipped %ld "
diff --git a/include/trace/events/f2fs.h b/include/trace/events/f2fs.h
index 5d216f7fb05a..e291e4717bf7 100644
--- a/include/trace/events/f2fs.h
+++ b/include/trace/events/f2fs.h
@@ -1174,7 +1174,7 @@ TRACE_EVENT(f2fs_writepages,
 		__entry->for_background	= wbc->for_background;
 		__entry->tagged_writepages	= wbc->tagged_writepages;
 		__entry->for_reclaim	= wbc->for_reclaim;
-		__entry->range_cyclic	= wbc->range_cyclic;
+		__entry->range_cyclic	= 1;
 		__entry->for_sync	= wbc->for_sync;
 	),
 
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index 9b57f014d79d..22e40e54943f 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -225,7 +225,7 @@ DECLARE_EVENT_CLASS(writeback_work_class,
 		__entry->sb_dev = work->sb ? work->sb->s_dev : 0;
 		__entry->sync_mode = work->sync_mode;
 		__entry->for_kupdate = work->for_kupdate;
-		__entry->range_cyclic = work->range_cyclic;
+		__entry->range_cyclic = 1;
 		__entry->for_background	= work->for_background;
 		__entry->reason = work->reason;
 		__entry->cgroup_ino = __trace_wb_assign_cgroup(wb);
@@ -328,7 +328,7 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__entry->for_kupdate	= wbc->for_kupdate;
 		__entry->for_background	= wbc->for_background;
 		__entry->for_reclaim	= wbc->for_reclaim;
-		__entry->range_cyclic	= wbc->range_cyclic;
+		__entry->range_cyclic	= 1;
 		__entry->range_start	= (long)wbc->range_start;
 		__entry->range_end	= (long)wbc->range_end;
 		__entry->cgroup_ino	= __trace_wbc_assign_cgroup(wbc);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 8d1fc593bce8..507c18eb33fa 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2149,25 +2149,16 @@ int write_cache_pages(struct address_space *mapping,
 	pgoff_t end;		/* Inclusive */
 	pgoff_t done_index;
 	int cycled;
-	int range_whole = 0;
 	int tag;
 
 	pagevec_init(&pvec, 0);
-	if (wbc->range_cyclic) {
-		writeback_index = mapping->writeback_index; /* prev offset */
-		index = writeback_index;
-		if (index == 0)
-			cycled = 1;
-		else
-			cycled = 0;
-		end = -1;
-	} else {
-		index = wbc->range_start >> PAGE_SHIFT;
-		end = wbc->range_end >> PAGE_SHIFT;
-		if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
-			range_whole = 1;
-		cycled = 1; /* ignore range_cyclic tests */
-	}
+	writeback_index = mapping->writeback_index; /* prev offset */
+	index = writeback_index;
+	if (index == 0)
+		cycled = 1;
+	else
+		cycled = 0;
+	end = -1;
 	if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
 		tag = PAGECACHE_TAG_TOWRITE;
 	else
@@ -2285,9 +2276,7 @@ int write_cache_pages(struct address_space *mapping,
 		end = writeback_index - 1;
 		goto retry;
 	}
-	if (wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0))
-		mapping->writeback_index = done_index;
-
+	mapping->writeback_index = done_index;
 	return ret;
 }
 EXPORT_SYMBOL(write_cache_pages);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 12/12] writeback: kill off ->range_cycle option
  2017-09-27 20:13 ` [PATCH 12/12] writeback: kill off ->range_cycle option Jens Axboe
@ 2017-09-27 20:22   ` Jens Axboe
  0 siblings, 0 replies; 35+ messages in thread
From: Jens Axboe @ 2017-09-27 20:22 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel; +Cc: hannes, jack, torvalds

On 09/27/2017 10:13 PM, Jens Axboe wrote:
> All types of writeback are now range cyclic. Kill off the member
> from struct wb_writeback_work and struct writeback_control.
> 
> Remove the various checks for whether or not this is range cyclic
> writeback or not in the writeback code and file systems.
> 
> For tracing, we leave the member there, and just always set it to 1.

Disregard this patch, there are obviously places where we don't set
range_cyclic at all and write out a subset of the dirty pages (or
just one). Needs more looking into, for now just ignore this last
patch in the series.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/12 v3] Writeback improvements
  2017-09-27 20:13 [PATCH 0/12 v3] Writeback improvements Jens Axboe
                   ` (11 preceding siblings ...)
  2017-09-27 20:13 ` [PATCH 12/12] writeback: kill off ->range_cycle option Jens Axboe
@ 2017-09-28 13:19 ` John Stoffel
  2017-09-28 13:39   ` Jens Axboe
  12 siblings, 1 reply; 35+ messages in thread
From: John Stoffel @ 2017-09-28 13:19 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, linux-fsdevel, hannes, jack, torvalds

On Wed, Sep 27, 2017 at 02:13:47PM -0600, Jens Axboe wrote:
> We've had some issues with writeback in presence of memory reclaim
> at Facebook, and this patch set attempts to fix it up. The real
> functional change for that issue is patch 10. The rest are cleanups,
> as well as the removal of doing non-range cyclic writeback. The users
> of that was sync_inodes_sb() and wakeup_flusher_threads(), both of
> which writeback all of the dirty pages.

So does this patch set make things faster?  Less bursty?  Does it make
writeout take longer, but with less spikes?  What is the performance
impact of this change?   I hate to be a pain, but this just smacks of
arm waving and I'm sure FB doesn't make changes without data... :-)

> The basic idea is that we have callers that call
> wakeup_flusher_threads() with nr_pages == 0. This means 'writeback
> everything'. For memory reclaim situations, we can end up queuing
> a TON of these kinds of writeback units. This can cause softlockups
> and further memory issues, since we allocate huge amounts of
> struct wb_writeback_work to handle this writeback. Handle this
> situation more gracefully.

Do you push back on the callers or slow them down?  Why do we even
allow callers to flush everything?

John

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/12 v3] Writeback improvements
  2017-09-28 13:19 ` [PATCH 0/12 v3] Writeback improvements John Stoffel
@ 2017-09-28 13:39   ` Jens Axboe
  0 siblings, 0 replies; 35+ messages in thread
From: Jens Axboe @ 2017-09-28 13:39 UTC (permalink / raw)
  To: John Stoffel; +Cc: linux-kernel, linux-fsdevel, hannes, jack, torvalds

On 09/28/2017 03:19 PM, John Stoffel wrote:
> On Wed, Sep 27, 2017 at 02:13:47PM -0600, Jens Axboe wrote:
>> We've had some issues with writeback in presence of memory reclaim
>> at Facebook, and this patch set attempts to fix it up. The real
>> functional change for that issue is patch 10. The rest are cleanups,
>> as well as the removal of doing non-range cyclic writeback. The users
>> of that was sync_inodes_sb() and wakeup_flusher_threads(), both of
>> which writeback all of the dirty pages.
> 
> So does this patch set make things faster?  Less bursty?  Does it make
> writeout take longer, but with less spikes?  What is the performance
> impact of this change?   I hate to be a pain, but this just smacks of
> arm waving and I'm sure FB doesn't make changes without data... :-)

See patch 10, this isn't arm waving at all. The whole point is that
you can have millions of writeback work items, which don't do anything.
See not only are we wasting a full core of doing nothing, it's bad
enough that we can trigger softlockups since it's just sitting there
in a loop doing that.

It's all explained in that patch...

>> The basic idea is that we have callers that call
>> wakeup_flusher_threads() with nr_pages == 0. This means 'writeback
>> everything'. For memory reclaim situations, we can end up queuing
>> a TON of these kinds of writeback units. This can cause softlockups
>> and further memory issues, since we allocate huge amounts of
>> struct wb_writeback_work to handle this writeback. Handle this
>> situation more gracefully.
> 
> Do you push back on the callers or slow them down?  Why do we even
> allow callers to flush everything?

Ehm, because we have to? There are cases where flushing everything
makes sense. Laptop mode is one of them, the problematic case
here is memory reclaim. To clean dirty pages, you have to kick
the flusher threads.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 01/12] buffer: have alloc_page_buffers() use __GFP_NOFAIL
  2017-09-27 20:13 ` [PATCH 01/12] buffer: have alloc_page_buffers() use __GFP_NOFAIL Jens Axboe
@ 2017-09-28 14:08   ` Nikolay Borisov
  2017-10-02 15:02   ` Jan Kara
  1 sibling, 0 replies; 35+ messages in thread
From: Nikolay Borisov @ 2017-09-28 14:08 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel, linux-fsdevel; +Cc: hannes, jack, torvalds



On 27.09.2017 23:13, Jens Axboe wrote:
> Instead of adding weird retry logic in that function, utilize
> __GFP_NOFAIL to ensure that the vm takes care of handling any
> potential retries appropriately. This means we don't have to
> call free_more_memory() from here.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
> ---
>  drivers/md/bitmap.c         |  2 +-
>  fs/buffer.c                 | 33 ++++++++++-----------------------
>  fs/ntfs/aops.c              |  2 +-
>  fs/ntfs/mft.c               |  2 +-
>  include/linux/buffer_head.h |  2 +-
>  5 files changed, 14 insertions(+), 27 deletions(-)
> 
> diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
> index d2121637b4ab..4d8ed74efadf 100644
> --- a/drivers/md/bitmap.c
> +++ b/drivers/md/bitmap.c
> @@ -368,7 +368,7 @@ static int read_page(struct file *file, unsigned long index,
>  	pr_debug("read bitmap file (%dB @ %llu)\n", (int)PAGE_SIZE,
>  		 (unsigned long long)index << PAGE_SHIFT);
>  
> -	bh = alloc_page_buffers(page, 1<<inode->i_blkbits, 0);
> +	bh = alloc_page_buffers(page, 1<<inode->i_blkbits, false);
>  	if (!bh) {
>  		ret = -ENOMEM;
>  		goto out;
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 170df856bdb9..1234ae343aef 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -861,16 +861,19 @@ int remove_inode_buffers(struct inode *inode)
>   * which may not fail from ordinary buffer allocations.
>   */
>  struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size,
> -		int retry)
> +		bool retry)
>  {
>  	struct buffer_head *bh, *head;
> +	gfp_t gfp = GFP_NOFS;
>  	long offset;
>  
> -try_again:
> +	if (retry)
> +		gfp |= __GFP_NOFAIL;
> +
>  	head = NULL;
>  	offset = PAGE_SIZE;
>  	while ((offset -= size) >= 0) {
> -		bh = alloc_buffer_head(GFP_NOFS);
> +		bh = alloc_buffer_head(gfp);
>  		if (!bh)
>  			goto no_grow;
>  
> @@ -896,23 +899,7 @@ struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size,
>  		} while (head);
>  	}
>  
> -	/*
> -	 * Return failure for non-async IO requests.  Async IO requests
> -	 * are not allowed to fail, so we have to wait until buffer heads
> -	 * become available.  But we don't want tasks sleeping with 
> -	 * partially complete buffers, so all were released above.
> -	 */
> -	if (!retry)
> -		return NULL;
> -
> -	/* We're _really_ low on memory. Now we just
> -	 * wait for old buffer heads to become free due to
> -	 * finishing IO.  Since this is an async request and
> -	 * the reserve list is empty, we're sure there are 
> -	 * async buffer heads in use.
> -	 */
> -	free_more_memory();
> -	goto try_again;
> +	return NULL;
>  }
>  EXPORT_SYMBOL_GPL(alloc_page_buffers);
>  
> @@ -1021,7 +1008,7 @@ grow_dev_page(struct block_device *bdev, sector_t block,
>  	/*
>  	 * Allocate some buffers for this page
>  	 */
> -	bh = alloc_page_buffers(page, size, 0);
> +	bh = alloc_page_buffers(page, size, false);
>  	if (!bh)
>  		goto failed;
>  
> @@ -1575,7 +1562,7 @@ void create_empty_buffers(struct page *page,
>  {
>  	struct buffer_head *bh, *head, *tail;
>  
> -	head = alloc_page_buffers(page, blocksize, 1);
> +	head = alloc_page_buffers(page, blocksize, true);
>  	bh = head;
>  	do {
>  		bh->b_state |= b_state;
> @@ -2638,7 +2625,7 @@ int nobh_write_begin(struct address_space *mapping,
>  	 * Be careful: the buffer linked list is a NULL terminated one, rather
>  	 * than the circular one we're used to.
>  	 */
> -	head = alloc_page_buffers(page, blocksize, 0);
> +	head = alloc_page_buffers(page, blocksize, false);
>  	if (!head) {
>  		ret = -ENOMEM;
>  		goto out_release;
> diff --git a/fs/ntfs/aops.c b/fs/ntfs/aops.c
> index cc91856b5e2d..3a2e509c77c5 100644
> --- a/fs/ntfs/aops.c
> +++ b/fs/ntfs/aops.c
> @@ -1739,7 +1739,7 @@ void mark_ntfs_record_dirty(struct page *page, const unsigned int ofs) {
>  	spin_lock(&mapping->private_lock);
>  	if (unlikely(!page_has_buffers(page))) {
>  		spin_unlock(&mapping->private_lock);
> -		bh = head = alloc_page_buffers(page, bh_size, 1);
> +		bh = head = alloc_page_buffers(page, bh_size, true);
>  		spin_lock(&mapping->private_lock);
>  		if (likely(!page_has_buffers(page))) {
>  			struct buffer_head *tail;
> diff --git a/fs/ntfs/mft.c b/fs/ntfs/mft.c
> index b6f402194f02..ee8392aee9f6 100644
> --- a/fs/ntfs/mft.c
> +++ b/fs/ntfs/mft.c
> @@ -507,7 +507,7 @@ int ntfs_sync_mft_mirror(ntfs_volume *vol, const unsigned long mft_no,
>  	if (unlikely(!page_has_buffers(page))) {
>  		struct buffer_head *tail;
>  
> -		bh = head = alloc_page_buffers(page, blocksize, 1);
> +		bh = head = alloc_page_buffers(page, blocksize, true);
>  		do {
>  			set_buffer_uptodate(bh);
>  			tail = bh;
> diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
> index c8dae555eccf..ae2d25f01b98 100644
> --- a/include/linux/buffer_head.h
> +++ b/include/linux/buffer_head.h
> @@ -156,7 +156,7 @@ void set_bh_page(struct buffer_head *bh,
>  		struct page *page, unsigned long offset);
>  int try_to_free_buffers(struct page *);
>  struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size,
> -		int retry);
> +		bool retry);
>  void create_empty_buffers(struct page *, unsigned long,
>  			unsigned long b_state);
>  void end_buffer_read_sync(struct buffer_head *bh, int uptodate);
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 02/12] buffer: grow_dev_page() should use __GFP_NOFAIL for all cases
  2017-09-27 20:13 ` [PATCH 02/12] buffer: grow_dev_page() should use __GFP_NOFAIL for all cases Jens Axboe
@ 2017-09-28 14:11   ` Nikolay Borisov
  2017-09-28 18:12     ` Jens Axboe
  2017-10-03 12:10   ` Jan Kara
  1 sibling, 1 reply; 35+ messages in thread
From: Nikolay Borisov @ 2017-09-28 14:11 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel, linux-fsdevel; +Cc: hannes, jack, torvalds



On 27.09.2017 23:13, Jens Axboe wrote:
> We currently it it for find_or_create_page(), which means that it

nit: Perhaps you wanted to write "We currently use it for find_..."

otherwise:

Reviewed-by: Nikolay Borisov <nborisov@suse.com>

> cannot fail. Ensure we also pass in 'retry == true' to
> alloc_page_buffers(), which also ensure that it cannot fail.
> 
> After this, there are no failure cases in grow_dev_page() that
> occur because of a failed memory allocation.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  fs/buffer.c | 6 +-----
>  1 file changed, 1 insertion(+), 5 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 1234ae343aef..3b60cd8456db 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -988,8 +988,6 @@ grow_dev_page(struct block_device *bdev, sector_t block,
>  	gfp_mask |= __GFP_NOFAIL;
>  
>  	page = find_or_create_page(inode->i_mapping, index, gfp_mask);
> -	if (!page)
> -		return ret;
>  
>  	BUG_ON(!PageLocked(page));
>  
> @@ -1008,9 +1006,7 @@ grow_dev_page(struct block_device *bdev, sector_t block,
>  	/*
>  	 * Allocate some buffers for this page
>  	 */
> -	bh = alloc_page_buffers(page, size, false);
> -	if (!bh)
> -		goto failed;
> +	bh = alloc_page_buffers(page, size, true);
>  
>  	/*
>  	 * Link the page to the buffers and initialise them.  Take the
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 03/12] buffer: eliminate the need to call free_more_memory() in __getblk_slow()
  2017-09-27 20:13 ` [PATCH 03/12] buffer: eliminate the need to call free_more_memory() in __getblk_slow() Jens Axboe
@ 2017-09-28 14:12   ` Nikolay Borisov
  2017-10-03 12:22   ` Jan Kara
  1 sibling, 0 replies; 35+ messages in thread
From: Nikolay Borisov @ 2017-09-28 14:12 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel, linux-fsdevel; +Cc: hannes, jack, torvalds



On 27.09.2017 23:13, Jens Axboe wrote:
> Since the previous commit removed any case where grow_buffers()
> would return failure due to memory allocations, we can safely
> remove the case where we have to call free_more_memory() in
> this function.
> 
> Since this is also the last user of free_more_memory(), kill
> it off completely.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>

Reviewed-by: Nikolay Borisov <nborisov@suse.com>

> ---
>  fs/buffer.c | 23 -----------------------
>  1 file changed, 23 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 3b60cd8456db..bff571dc7bc3 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -253,27 +253,6 @@ __find_get_block_slow(struct block_device *bdev, sector_t block)
>  }
>  
>  /*
> - * Kick the writeback threads then try to free up some ZONE_NORMAL memory.
> - */
> -static void free_more_memory(void)
> -{
> -	struct zoneref *z;
> -	int nid;
> -
> -	wakeup_flusher_threads(1024, WB_REASON_FREE_MORE_MEM);
> -	yield();
> -
> -	for_each_online_node(nid) {
> -
> -		z = first_zones_zonelist(node_zonelist(nid, GFP_NOFS),
> -						gfp_zone(GFP_NOFS), NULL);
> -		if (z->zone)
> -			try_to_free_pages(node_zonelist(nid, GFP_NOFS), 0,
> -						GFP_NOFS, NULL);
> -	}
> -}
> -
> -/*
>   * I/O completion handler for block_read_full_page() - pages
>   * which come unlocked at the end of I/O.
>   */
> @@ -1086,8 +1065,6 @@ __getblk_slow(struct block_device *bdev, sector_t block,
>  		ret = grow_buffers(bdev, block, size, gfp);
>  		if (ret < 0)
>  			return NULL;
> -		if (ret == 0)
> -			free_more_memory();
>  	}
>  }
>  
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 02/12] buffer: grow_dev_page() should use __GFP_NOFAIL for all cases
  2017-09-28 14:11   ` Nikolay Borisov
@ 2017-09-28 18:12     ` Jens Axboe
  0 siblings, 0 replies; 35+ messages in thread
From: Jens Axboe @ 2017-09-28 18:12 UTC (permalink / raw)
  To: Nikolay Borisov, linux-kernel, linux-fsdevel; +Cc: hannes, jack, torvalds

On 09/28/2017 04:11 PM, Nikolay Borisov wrote:
> 
> 
> On 27.09.2017 23:13, Jens Axboe wrote:
>> We currently it it for find_or_create_page(), which means that it
> 
> nit: Perhaps you wanted to write "We currently use it for find_..."
> 
> otherwise:
> 
> Reviewed-by: Nikolay Borisov <nborisov@suse.com>

Yeah, fixed up, thanks.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 10/12] writeback: only allow one inflight and pending full flush
  2017-09-27 20:13 ` [PATCH 10/12] writeback: only allow one inflight and pending full flush Jens Axboe
@ 2017-09-28 21:41   ` Andrew Morton
  2017-09-28 21:44     ` Linus Torvalds
  2017-09-29  0:15     ` Jens Axboe
  0 siblings, 2 replies; 35+ messages in thread
From: Andrew Morton @ 2017-09-28 21:41 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, linux-fsdevel, hannes, jack, torvalds

On Wed, 27 Sep 2017 14:13:57 -0600 Jens Axboe <axboe@kernel.dk> wrote:

> When someone calls wakeup_flusher_threads() or
> wakeup_flusher_threads_bdi(), they schedule writeback of all dirty
> pages in the system (or on that bdi). If we are tight on memory, we
> can get tons of these queued from kswapd/vmscan. This causes (at
> least) two problems:
> 
> 1) We consume a ton of memory just allocating writeback work items.
>    We've seen as much as 600 million of these writeback work items
>    pending. That's a lot of memory to pointlessly hold hostage,
>    while the box is under memory pressure.
> 
> 2) We spend so much time processing these work items, that we
>    introduce a softlockup in writeback processing. This is because
>    each of the writeback work items don't end up doing any work (it's
>    hard when you have millions of identical ones coming in to the
>    flush machinery), so we just sit in a tight loop pulling work
>    items and deleting/freeing them.
> 
> Fix this by adding a 'start_all' bit to the writeback structure, and
> set that when someone attempts to flush all dirty pages. The bit is
> cleared when we start writeback on that work item. If the bit is
> already set when we attempt to queue !nr_pages writeback, then we
> simply ignore it.
> 
> This provides us one full flush in flight, with one pending as well,
> and makes for more efficient handling of this type of writeback.
> 
> ...
>
> @@ -953,12 +954,27 @@ static void wb_start_writeback(struct bdi_writeback *wb, bool range_cyclic,
>  		return;
>  
>  	/*
> +	 * All callers of this function want to start writeback of all
> +	 * dirty pages. Places like vmscan can call this at a very
> +	 * high frequency, causing pointless allocations of tons of
> +	 * work items and keeping the flusher threads busy retrieving
> +	 * that work. Ensure that we only allow one of them pending and
> +	 * inflight at the time. It doesn't matter if we race a little
> +	 * bit on this, so use the faster separate test/set bit variants.
> +	 */
> +	if (test_bit(WB_start_all, &wb->state))
> +		return;
> +
> +	set_bit(WB_start_all, &wb->state);

test_and_set_bit()?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 10/12] writeback: only allow one inflight and pending full flush
  2017-09-28 21:41   ` Andrew Morton
@ 2017-09-28 21:44     ` Linus Torvalds
  2017-09-29  0:17       ` Jens Axboe
  2017-09-29  0:15     ` Jens Axboe
  1 sibling, 1 reply; 35+ messages in thread
From: Linus Torvalds @ 2017-09-28 21:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Linux Kernel Mailing List, linux-fsdevel,
	Johannes Weiner, Jan Kara

On Thu, Sep 28, 2017 at 2:41 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
>
> test_and_set_bit()?

If there aren't any atomicity concerns (either because of higher-level
locking, or because racing and having two people set the bit is fine),
it can be better to do them separately if the test_bit() is the common
case and you can avoid dirtying a cacheline that way.

But yeah, if that is the case, it might be worth documenting, because
test_and_set_bit() is the more obviously appropriate "there can be
only one" model.

               Linus

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 10/12] writeback: only allow one inflight and pending full flush
  2017-09-28 21:41   ` Andrew Morton
  2017-09-28 21:44     ` Linus Torvalds
@ 2017-09-29  0:15     ` Jens Axboe
  1 sibling, 0 replies; 35+ messages in thread
From: Jens Axboe @ 2017-09-29  0:15 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-fsdevel, hannes, jack, torvalds

On 09/28/2017 11:41 PM, Andrew Morton wrote:
> On Wed, 27 Sep 2017 14:13:57 -0600 Jens Axboe <axboe@kernel.dk> wrote:
> 
>> When someone calls wakeup_flusher_threads() or
>> wakeup_flusher_threads_bdi(), they schedule writeback of all dirty
>> pages in the system (or on that bdi). If we are tight on memory, we
>> can get tons of these queued from kswapd/vmscan. This causes (at
>> least) two problems:
>>
>> 1) We consume a ton of memory just allocating writeback work items.
>>    We've seen as much as 600 million of these writeback work items
>>    pending. That's a lot of memory to pointlessly hold hostage,
>>    while the box is under memory pressure.
>>
>> 2) We spend so much time processing these work items, that we
>>    introduce a softlockup in writeback processing. This is because
>>    each of the writeback work items don't end up doing any work (it's
>>    hard when you have millions of identical ones coming in to the
>>    flush machinery), so we just sit in a tight loop pulling work
>>    items and deleting/freeing them.
>>
>> Fix this by adding a 'start_all' bit to the writeback structure, and
>> set that when someone attempts to flush all dirty pages. The bit is
>> cleared when we start writeback on that work item. If the bit is
>> already set when we attempt to queue !nr_pages writeback, then we
>> simply ignore it.
>>
>> This provides us one full flush in flight, with one pending as well,
>> and makes for more efficient handling of this type of writeback.
>>
>> ...
>>
>> @@ -953,12 +954,27 @@ static void wb_start_writeback(struct bdi_writeback *wb, bool range_cyclic,
>>  		return;
>>  
>>  	/*
>> +	 * All callers of this function want to start writeback of all
>> +	 * dirty pages. Places like vmscan can call this at a very
>> +	 * high frequency, causing pointless allocations of tons of
>> +	 * work items and keeping the flusher threads busy retrieving
>> +	 * that work. Ensure that we only allow one of them pending and
>> +	 * inflight at the time. It doesn't matter if we race a little
>> +	 * bit on this, so use the faster separate test/set bit variants.
>> +	 */
>> +	if (test_bit(WB_start_all, &wb->state))
>> +		return;
>> +
>> +	set_bit(WB_start_all, &wb->state);
> 
> test_and_set_bit()?

Like Linus says, this is done purposely. I've even included a bit about
it in the comment above, though maybe it's not clear enough. I've used
this trick in blk-mq quite a bit as well, and for high frequency calls,
it can make a substantial difference not to redirty that cache line if
you can avoid it.

If you do care about atomicity, this works really well too:

if (test_bit(bit, addr) || test_and_set_bit(bit, addr))
	...

just to avoid the locked operation. Also see this commit:
commit 7fcbbaf18392f0b17c95e2f033c8ccf87eecde1d
Author: Jens Axboe <axboe@fb.com>
Date:   Thu May 22 11:54:16 2014 -0700

    mm/filemap.c: avoid always dirtying mapping->flags on O_DIRECT

where there are some actual numbers on a specific case.

For the case at hand, we don't even need to do the test_and_set
case, since we don't care about a small race there.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 10/12] writeback: only allow one inflight and pending full flush
  2017-09-28 21:44     ` Linus Torvalds
@ 2017-09-29  0:17       ` Jens Axboe
  2017-09-29  5:21         ` Amir Goldstein
  2017-10-03 16:06         ` Matthew Wilcox
  0 siblings, 2 replies; 35+ messages in thread
From: Jens Axboe @ 2017-09-29  0:17 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton
  Cc: Linux Kernel Mailing List, linux-fsdevel, Johannes Weiner, Jan Kara

On 09/28/2017 11:44 PM, Linus Torvalds wrote:
> On Thu, Sep 28, 2017 at 2:41 PM, Andrew Morton
> <akpm@linux-foundation.org> wrote:
>>
>> test_and_set_bit()?
> 
> If there aren't any atomicity concerns (either because of higher-level
> locking, or because racing and having two people set the bit is fine),
> it can be better to do them separately if the test_bit() is the common
> case and you can avoid dirtying a cacheline that way.
> 
> But yeah, if that is the case, it might be worth documenting, because
> test_and_set_bit() is the more obviously appropriate "there can be
> only one" model.

It is documented though, but maybe not well enough...

I've actually had to document/explain it enough times now, that it
might be worth making a general construct. Though it has to be
used carefully, so perhaps it's better contained as separate use
cases.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 10/12] writeback: only allow one inflight and pending full flush
  2017-09-29  0:17       ` Jens Axboe
@ 2017-09-29  5:21         ` Amir Goldstein
  2017-10-03 16:06         ` Matthew Wilcox
  1 sibling, 0 replies; 35+ messages in thread
From: Amir Goldstein @ 2017-09-29  5:21 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linus Torvalds, Andrew Morton, Linux Kernel Mailing List,
	linux-fsdevel, Johannes Weiner, Jan Kara

On Fri, Sep 29, 2017 at 3:17 AM, Jens Axboe <axboe@kernel.dk> wrote:
> On 09/28/2017 11:44 PM, Linus Torvalds wrote:
>> On Thu, Sep 28, 2017 at 2:41 PM, Andrew Morton
>> <akpm@linux-foundation.org> wrote:
>>>
>>> test_and_set_bit()?
>>
>> If there aren't any atomicity concerns (either because of higher-level
>> locking, or because racing and having two people set the bit is fine),
>> it can be better to do them separately if the test_bit() is the common
>> case and you can avoid dirtying a cacheline that way.
>>
>> But yeah, if that is the case, it might be worth documenting, because
>> test_and_set_bit() is the more obviously appropriate "there can be
>> only one" model.
>
> It is documented though, but maybe not well enough...
>
> I've actually had to document/explain it enough times now, that it
> might be worth making a general construct. Though it has to be
> used carefully, so perhaps it's better contained as separate use
> cases.
>

Maybe change "Ensure that we only allow one of them pending"
in the comment above. Only the "allow one inflight" part is correct.

Or apply your follow up patch and be done with in...

Amir.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 01/12] buffer: have alloc_page_buffers() use __GFP_NOFAIL
  2017-09-27 20:13 ` [PATCH 01/12] buffer: have alloc_page_buffers() use __GFP_NOFAIL Jens Axboe
  2017-09-28 14:08   ` Nikolay Borisov
@ 2017-10-02 15:02   ` Jan Kara
  1 sibling, 0 replies; 35+ messages in thread
From: Jan Kara @ 2017-10-02 15:02 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, linux-fsdevel, hannes, jack, torvalds

On Wed 27-09-17 14:13:48, Jens Axboe wrote:
> Instead of adding weird retry logic in that function, utilize
> __GFP_NOFAIL to ensure that the vm takes care of handling any
> potential retries appropriately. This means we don't have to
> call free_more_memory() from here.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>

Looks good. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza


> ---
>  drivers/md/bitmap.c         |  2 +-
>  fs/buffer.c                 | 33 ++++++++++-----------------------
>  fs/ntfs/aops.c              |  2 +-
>  fs/ntfs/mft.c               |  2 +-
>  include/linux/buffer_head.h |  2 +-
>  5 files changed, 14 insertions(+), 27 deletions(-)
> 
> diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
> index d2121637b4ab..4d8ed74efadf 100644
> --- a/drivers/md/bitmap.c
> +++ b/drivers/md/bitmap.c
> @@ -368,7 +368,7 @@ static int read_page(struct file *file, unsigned long index,
>  	pr_debug("read bitmap file (%dB @ %llu)\n", (int)PAGE_SIZE,
>  		 (unsigned long long)index << PAGE_SHIFT);
>  
> -	bh = alloc_page_buffers(page, 1<<inode->i_blkbits, 0);
> +	bh = alloc_page_buffers(page, 1<<inode->i_blkbits, false);
>  	if (!bh) {
>  		ret = -ENOMEM;
>  		goto out;
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 170df856bdb9..1234ae343aef 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -861,16 +861,19 @@ int remove_inode_buffers(struct inode *inode)
>   * which may not fail from ordinary buffer allocations.
>   */
>  struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size,
> -		int retry)
> +		bool retry)
>  {
>  	struct buffer_head *bh, *head;
> +	gfp_t gfp = GFP_NOFS;
>  	long offset;
>  
> -try_again:
> +	if (retry)
> +		gfp |= __GFP_NOFAIL;
> +
>  	head = NULL;
>  	offset = PAGE_SIZE;
>  	while ((offset -= size) >= 0) {
> -		bh = alloc_buffer_head(GFP_NOFS);
> +		bh = alloc_buffer_head(gfp);
>  		if (!bh)
>  			goto no_grow;
>  
> @@ -896,23 +899,7 @@ struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size,
>  		} while (head);
>  	}
>  
> -	/*
> -	 * Return failure for non-async IO requests.  Async IO requests
> -	 * are not allowed to fail, so we have to wait until buffer heads
> -	 * become available.  But we don't want tasks sleeping with 
> -	 * partially complete buffers, so all were released above.
> -	 */
> -	if (!retry)
> -		return NULL;
> -
> -	/* We're _really_ low on memory. Now we just
> -	 * wait for old buffer heads to become free due to
> -	 * finishing IO.  Since this is an async request and
> -	 * the reserve list is empty, we're sure there are 
> -	 * async buffer heads in use.
> -	 */
> -	free_more_memory();
> -	goto try_again;
> +	return NULL;
>  }
>  EXPORT_SYMBOL_GPL(alloc_page_buffers);
>  
> @@ -1021,7 +1008,7 @@ grow_dev_page(struct block_device *bdev, sector_t block,
>  	/*
>  	 * Allocate some buffers for this page
>  	 */
> -	bh = alloc_page_buffers(page, size, 0);
> +	bh = alloc_page_buffers(page, size, false);
>  	if (!bh)
>  		goto failed;
>  
> @@ -1575,7 +1562,7 @@ void create_empty_buffers(struct page *page,
>  {
>  	struct buffer_head *bh, *head, *tail;
>  
> -	head = alloc_page_buffers(page, blocksize, 1);
> +	head = alloc_page_buffers(page, blocksize, true);
>  	bh = head;
>  	do {
>  		bh->b_state |= b_state;
> @@ -2638,7 +2625,7 @@ int nobh_write_begin(struct address_space *mapping,
>  	 * Be careful: the buffer linked list is a NULL terminated one, rather
>  	 * than the circular one we're used to.
>  	 */
> -	head = alloc_page_buffers(page, blocksize, 0);
> +	head = alloc_page_buffers(page, blocksize, false);
>  	if (!head) {
>  		ret = -ENOMEM;
>  		goto out_release;
> diff --git a/fs/ntfs/aops.c b/fs/ntfs/aops.c
> index cc91856b5e2d..3a2e509c77c5 100644
> --- a/fs/ntfs/aops.c
> +++ b/fs/ntfs/aops.c
> @@ -1739,7 +1739,7 @@ void mark_ntfs_record_dirty(struct page *page, const unsigned int ofs) {
>  	spin_lock(&mapping->private_lock);
>  	if (unlikely(!page_has_buffers(page))) {
>  		spin_unlock(&mapping->private_lock);
> -		bh = head = alloc_page_buffers(page, bh_size, 1);
> +		bh = head = alloc_page_buffers(page, bh_size, true);
>  		spin_lock(&mapping->private_lock);
>  		if (likely(!page_has_buffers(page))) {
>  			struct buffer_head *tail;
> diff --git a/fs/ntfs/mft.c b/fs/ntfs/mft.c
> index b6f402194f02..ee8392aee9f6 100644
> --- a/fs/ntfs/mft.c
> +++ b/fs/ntfs/mft.c
> @@ -507,7 +507,7 @@ int ntfs_sync_mft_mirror(ntfs_volume *vol, const unsigned long mft_no,
>  	if (unlikely(!page_has_buffers(page))) {
>  		struct buffer_head *tail;
>  
> -		bh = head = alloc_page_buffers(page, blocksize, 1);
> +		bh = head = alloc_page_buffers(page, blocksize, true);
>  		do {
>  			set_buffer_uptodate(bh);
>  			tail = bh;
> diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
> index c8dae555eccf..ae2d25f01b98 100644
> --- a/include/linux/buffer_head.h
> +++ b/include/linux/buffer_head.h
> @@ -156,7 +156,7 @@ void set_bh_page(struct buffer_head *bh,
>  		struct page *page, unsigned long offset);
>  int try_to_free_buffers(struct page *);
>  struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size,
> -		int retry);
> +		bool retry);
>  void create_empty_buffers(struct page *, unsigned long,
>  			unsigned long b_state);
>  void end_buffer_read_sync(struct buffer_head *bh, int uptodate);
> -- 
> 2.7.4
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 02/12] buffer: grow_dev_page() should use __GFP_NOFAIL for all cases
  2017-09-27 20:13 ` [PATCH 02/12] buffer: grow_dev_page() should use __GFP_NOFAIL for all cases Jens Axboe
  2017-09-28 14:11   ` Nikolay Borisov
@ 2017-10-03 12:10   ` Jan Kara
  2017-10-03 12:25     ` Jan Kara
  1 sibling, 1 reply; 35+ messages in thread
From: Jan Kara @ 2017-10-03 12:10 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, linux-fsdevel, hannes, jack, torvalds

On Wed 27-09-17 14:13:49, Jens Axboe wrote:
> We currently it it for find_or_create_page(), which means that it
> cannot fail. Ensure we also pass in 'retry == true' to
> alloc_page_buffers(), which also ensure that it cannot fail.
> 
> After this, there are no failure cases in grow_dev_page() that
> occur because of a failed memory allocation.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>

Makes sense. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/buffer.c | 6 +-----
>  1 file changed, 1 insertion(+), 5 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 1234ae343aef..3b60cd8456db 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -988,8 +988,6 @@ grow_dev_page(struct block_device *bdev, sector_t block,
>  	gfp_mask |= __GFP_NOFAIL;
>  
>  	page = find_or_create_page(inode->i_mapping, index, gfp_mask);
> -	if (!page)
> -		return ret;
>  
>  	BUG_ON(!PageLocked(page));
>  
> @@ -1008,9 +1006,7 @@ grow_dev_page(struct block_device *bdev, sector_t block,
>  	/*
>  	 * Allocate some buffers for this page
>  	 */
> -	bh = alloc_page_buffers(page, size, false);
> -	if (!bh)
> -		goto failed;
> +	bh = alloc_page_buffers(page, size, true);
>  
>  	/*
>  	 * Link the page to the buffers and initialise them.  Take the
> -- 
> 2.7.4
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 03/12] buffer: eliminate the need to call free_more_memory() in __getblk_slow()
  2017-09-27 20:13 ` [PATCH 03/12] buffer: eliminate the need to call free_more_memory() in __getblk_slow() Jens Axboe
  2017-09-28 14:12   ` Nikolay Borisov
@ 2017-10-03 12:22   ` Jan Kara
  1 sibling, 0 replies; 35+ messages in thread
From: Jan Kara @ 2017-10-03 12:22 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, linux-fsdevel, hannes, jack, torvalds

On Wed 27-09-17 14:13:50, Jens Axboe wrote:
> Since the previous commit removed any case where grow_buffers()
> would return failure due to memory allocations, we can safely
> remove the case where we have to call free_more_memory() in
> this function.
> 
> Since this is also the last user of free_more_memory(), kill
> it off completely.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>

Looks good. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/buffer.c | 23 -----------------------
>  1 file changed, 23 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 3b60cd8456db..bff571dc7bc3 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -253,27 +253,6 @@ __find_get_block_slow(struct block_device *bdev, sector_t block)
>  }
>  
>  /*
> - * Kick the writeback threads then try to free up some ZONE_NORMAL memory.
> - */
> -static void free_more_memory(void)
> -{
> -	struct zoneref *z;
> -	int nid;
> -
> -	wakeup_flusher_threads(1024, WB_REASON_FREE_MORE_MEM);
> -	yield();
> -
> -	for_each_online_node(nid) {
> -
> -		z = first_zones_zonelist(node_zonelist(nid, GFP_NOFS),
> -						gfp_zone(GFP_NOFS), NULL);
> -		if (z->zone)
> -			try_to_free_pages(node_zonelist(nid, GFP_NOFS), 0,
> -						GFP_NOFS, NULL);
> -	}
> -}
> -
> -/*
>   * I/O completion handler for block_read_full_page() - pages
>   * which come unlocked at the end of I/O.
>   */
> @@ -1086,8 +1065,6 @@ __getblk_slow(struct block_device *bdev, sector_t block,
>  		ret = grow_buffers(bdev, block, size, gfp);
>  		if (ret < 0)
>  			return NULL;
> -		if (ret == 0)
> -			free_more_memory();
>  	}
>  }
>  
> -- 
> 2.7.4
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 02/12] buffer: grow_dev_page() should use __GFP_NOFAIL for all cases
  2017-10-03 12:10   ` Jan Kara
@ 2017-10-03 12:25     ` Jan Kara
  2017-10-03 14:36       ` Jens Axboe
  0 siblings, 1 reply; 35+ messages in thread
From: Jan Kara @ 2017-10-03 12:25 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, linux-fsdevel, hannes, jack, torvalds

On Tue 03-10-17 14:10:49, Jan Kara wrote:
> On Wed 27-09-17 14:13:49, Jens Axboe wrote:
> > We currently it it for find_or_create_page(), which means that it
> > cannot fail. Ensure we also pass in 'retry == true' to
> > alloc_page_buffers(), which also ensure that it cannot fail.
> > 
> > After this, there are no failure cases in grow_dev_page() that
> > occur because of a failed memory allocation.
> > 
> > Signed-off-by: Jens Axboe <axboe@kernel.dk>
> 
> Makes sense. You can add:
> 
> Reviewed-by: Jan Kara <jack@suse.cz>

I forgot one question though:

> >  	page = find_or_create_page(inode->i_mapping, index, gfp_mask);
> > -	if (!page)
> > -		return ret;

Are we sure find_or_create_page() cannot fail for other reasons than
ENOMEM? Currently it seems to be the case AFAICT but it isn't obvious to me
that is guaranteed in future as well...

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 05/12] writeback: switch wakeup_flusher_threads() to cyclic writeback
  2017-09-27 20:13 ` [PATCH 05/12] writeback: switch wakeup_flusher_threads() to cyclic writeback Jens Axboe
@ 2017-10-03 12:43   ` Jan Kara
  0 siblings, 0 replies; 35+ messages in thread
From: Jan Kara @ 2017-10-03 12:43 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, linux-fsdevel, hannes, jack, torvalds

On Wed 27-09-17 14:13:52, Jens Axboe wrote:
> We're writing back the full range of dirty pages on the devices,
> there's no point in making this special and not do normal range
> cyclic writeback.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>

OK, since this is just ordinary "memory cleaning" writeback, I agree that
range_cyclic probably makes more sense. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/fs-writeback.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index bb6148dc6d24..65e6992d8719 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -1971,7 +1971,7 @@ void wakeup_flusher_threads(enum wb_reason reason)
>  
>  		list_for_each_entry_rcu(wb, &bdi->wb_list, bdi_node)
>  			wb_start_writeback(wb, wb_split_bdi_pages(wb, nr_pages),
> -					   false, reason);
> +						true, reason);
>  	}
>  	rcu_read_unlock();
>  }
> -- 
> 2.7.4
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 02/12] buffer: grow_dev_page() should use __GFP_NOFAIL for all cases
  2017-10-03 12:25     ` Jan Kara
@ 2017-10-03 14:36       ` Jens Axboe
  2017-10-03 15:52         ` Jan Kara
  0 siblings, 1 reply; 35+ messages in thread
From: Jens Axboe @ 2017-10-03 14:36 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-kernel, linux-fsdevel, hannes, torvalds

On 10/03/2017 06:25 AM, Jan Kara wrote:
> On Tue 03-10-17 14:10:49, Jan Kara wrote:
>> On Wed 27-09-17 14:13:49, Jens Axboe wrote:
>>> We currently it it for find_or_create_page(), which means that it
>>> cannot fail. Ensure we also pass in 'retry == true' to
>>> alloc_page_buffers(), which also ensure that it cannot fail.
>>>
>>> After this, there are no failure cases in grow_dev_page() that
>>> occur because of a failed memory allocation.
>>>
>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>
>> Makes sense. You can add:
>>
>> Reviewed-by: Jan Kara <jack@suse.cz>
> 
> I forgot one question though:
> 
>>>  	page = find_or_create_page(inode->i_mapping, index, gfp_mask);
>>> -	if (!page)
>>> -		return ret;
> 
> Are we sure find_or_create_page() cannot fail for other reasons than
> ENOMEM? Currently it seems to be the case AFAICT but it isn't obvious to me
> that is guaranteed in future as well...

It looks up a page, or allocates and adds one. If we tell the allocator
that we cannot tolerate failure, then I don't see what else would be
able to make it fail. That function is such an integral part of
lookup/create for the page cache.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 02/12] buffer: grow_dev_page() should use __GFP_NOFAIL for all cases
  2017-10-03 14:36       ` Jens Axboe
@ 2017-10-03 15:52         ` Jan Kara
  0 siblings, 0 replies; 35+ messages in thread
From: Jan Kara @ 2017-10-03 15:52 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Jan Kara, linux-kernel, linux-fsdevel, hannes, torvalds

On Tue 03-10-17 08:36:16, Jens Axboe wrote:
> On 10/03/2017 06:25 AM, Jan Kara wrote:
> > On Tue 03-10-17 14:10:49, Jan Kara wrote:
> >> On Wed 27-09-17 14:13:49, Jens Axboe wrote:
> >>> We currently it it for find_or_create_page(), which means that it
> >>> cannot fail. Ensure we also pass in 'retry == true' to
> >>> alloc_page_buffers(), which also ensure that it cannot fail.
> >>>
> >>> After this, there are no failure cases in grow_dev_page() that
> >>> occur because of a failed memory allocation.
> >>>
> >>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> >>
> >> Makes sense. You can add:
> >>
> >> Reviewed-by: Jan Kara <jack@suse.cz>
> > 
> > I forgot one question though:
> > 
> >>>  	page = find_or_create_page(inode->i_mapping, index, gfp_mask);
> >>> -	if (!page)
> >>> -		return ret;
> > 
> > Are we sure find_or_create_page() cannot fail for other reasons than
> > ENOMEM? Currently it seems to be the case AFAICT but it isn't obvious to me
> > that is guaranteed in future as well...
> 
> It looks up a page, or allocates and adds one. If we tell the allocator
> that we cannot tolerate failure, then I don't see what else would be
> able to make it fail. That function is such an integral part of
> lookup/create for the page cache.

Well, there memcg stuff in there, radix tree handling etc. So it is not
only an allocation that could fail. But all take care of not failing if
GFP_NOFAIL is set so I guess there are good chances for this to hold even
in the future.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 10/12] writeback: only allow one inflight and pending full flush
  2017-09-29  0:17       ` Jens Axboe
  2017-09-29  5:21         ` Amir Goldstein
@ 2017-10-03 16:06         ` Matthew Wilcox
  2017-10-03 16:11           ` Jens Axboe
  1 sibling, 1 reply; 35+ messages in thread
From: Matthew Wilcox @ 2017-10-03 16:06 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linus Torvalds, Andrew Morton, Linux Kernel Mailing List,
	linux-fsdevel, Johannes Weiner, Jan Kara

On Fri, Sep 29, 2017 at 02:17:32AM +0200, Jens Axboe wrote:
> On 09/28/2017 11:44 PM, Linus Torvalds wrote:
> > On Thu, Sep 28, 2017 at 2:41 PM, Andrew Morton
> > <akpm@linux-foundation.org> wrote:
> >>
> >> test_and_set_bit()?
> > 
> > If there aren't any atomicity concerns (either because of higher-level
> > locking, or because racing and having two people set the bit is fine),
> > it can be better to do them separately if the test_bit() is the common
> > case and you can avoid dirtying a cacheline that way.
> > 
> > But yeah, if that is the case, it might be worth documenting, because
> > test_and_set_bit() is the more obviously appropriate "there can be
> > only one" model.
> 
> It is documented though, but maybe not well enough...
> 
> I've actually had to document/explain it enough times now, that it
> might be worth making a general construct. Though it has to be
> used carefully, so perhaps it's better contained as separate use
> cases.

test_and_test_and_set_bit()?  It's an unusual name, so when either
reading it or writing it, people are going to say "something unusual
here", rather than "That Jens Axboe is such a n00b, he doesn't know how
to use test_and_set_bit()".  There are a few references out on the web
to test-and-test-and-set already, so it's not entirely unique to Linux.

Plus, some architectures might be able to optimise that, particularly
those which are ll/sc based.  It might be exactly the same as their
test_and_set().

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 10/12] writeback: only allow one inflight and pending full flush
  2017-10-03 16:06         ` Matthew Wilcox
@ 2017-10-03 16:11           ` Jens Axboe
  2017-10-03 17:03             ` Matthew Wilcox
  0 siblings, 1 reply; 35+ messages in thread
From: Jens Axboe @ 2017-10-03 16:11 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Linus Torvalds, Andrew Morton, Linux Kernel Mailing List,
	linux-fsdevel, Johannes Weiner, Jan Kara

On 10/03/2017 10:06 AM, Matthew Wilcox wrote:
> On Fri, Sep 29, 2017 at 02:17:32AM +0200, Jens Axboe wrote:
>> On 09/28/2017 11:44 PM, Linus Torvalds wrote:
>>> On Thu, Sep 28, 2017 at 2:41 PM, Andrew Morton
>>> <akpm@linux-foundation.org> wrote:
>>>>
>>>> test_and_set_bit()?
>>>
>>> If there aren't any atomicity concerns (either because of higher-level
>>> locking, or because racing and having two people set the bit is fine),
>>> it can be better to do them separately if the test_bit() is the common
>>> case and you can avoid dirtying a cacheline that way.
>>>
>>> But yeah, if that is the case, it might be worth documenting, because
>>> test_and_set_bit() is the more obviously appropriate "there can be
>>> only one" model.
>>
>> It is documented though, but maybe not well enough...
>>
>> I've actually had to document/explain it enough times now, that it
>> might be worth making a general construct. Though it has to be
>> used carefully, so perhaps it's better contained as separate use
>> cases.
> 
> test_and_test_and_set_bit()?  It's an unusual name, so when either
> reading it or writing it, people are going to say "something unusual
> here", rather than "That Jens Axboe is such a n00b, he doesn't know how
> to use test_and_set_bit()".  There are a few references out on the web
> to test-and-test-and-set already, so it's not entirely unique to Linux.
> 
> Plus, some architectures might be able to optimise that, particularly
> those which are ll/sc based.  It might be exactly the same as their
> test_and_set().

I like that suggestion, but would suggest we make it
test_then_test_and_set_bit() since the 'then' naming would work for
having similar test_then_clear_bit() and not clash with
test_and_set_bit().

And yes, some archs would be able to optimize this nicely.

All worth it if I never have to explain it or add special comments
about it again :-)

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 10/12] writeback: only allow one inflight and pending full flush
  2017-10-03 16:11           ` Jens Axboe
@ 2017-10-03 17:03             ` Matthew Wilcox
  0 siblings, 0 replies; 35+ messages in thread
From: Matthew Wilcox @ 2017-10-03 17:03 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linus Torvalds, Andrew Morton, Linux Kernel Mailing List,
	linux-fsdevel, Johannes Weiner, Jan Kara

On Tue, Oct 03, 2017 at 10:11:20AM -0600, Jens Axboe wrote:
> On 10/03/2017 10:06 AM, Matthew Wilcox wrote:
> > test_and_test_and_set_bit()?  It's an unusual name, so when either
> > reading it or writing it, people are going to say "something unusual
> > here", rather than "That Jens Axboe is such a n00b, he doesn't know how
> > to use test_and_set_bit()".  There are a few references out on the web
> > to test-and-test-and-set already, so it's not entirely unique to Linux.
> 
> I like that suggestion, but would suggest we make it
> test_then_test_and_set_bit() since the 'then' naming would work for
> having similar test_then_clear_bit() and not clash with
> test_and_set_bit().

'test-then-test-and-set' has the disadvantage of not being readily
searchable ... if you search for 'test-and-test-and-set', you find
discussions about why you might want to use this technique.  Also, I
don't like having set use a different name from clear; either we want
'test_and_test_and_(set|clear)_bit()' or 'test_then_(set|clear)_bit()'.

Usually I'd be in favour of the shorter name, but this should be a rare
thing for people to use, and if you search for test-then-clear you get
a lot of results about pregnancy tests ...

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2017-10-03 17:03 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-27 20:13 [PATCH 0/12 v3] Writeback improvements Jens Axboe
2017-09-27 20:13 ` [PATCH 01/12] buffer: have alloc_page_buffers() use __GFP_NOFAIL Jens Axboe
2017-09-28 14:08   ` Nikolay Borisov
2017-10-02 15:02   ` Jan Kara
2017-09-27 20:13 ` [PATCH 02/12] buffer: grow_dev_page() should use __GFP_NOFAIL for all cases Jens Axboe
2017-09-28 14:11   ` Nikolay Borisov
2017-09-28 18:12     ` Jens Axboe
2017-10-03 12:10   ` Jan Kara
2017-10-03 12:25     ` Jan Kara
2017-10-03 14:36       ` Jens Axboe
2017-10-03 15:52         ` Jan Kara
2017-09-27 20:13 ` [PATCH 03/12] buffer: eliminate the need to call free_more_memory() in __getblk_slow() Jens Axboe
2017-09-28 14:12   ` Nikolay Borisov
2017-10-03 12:22   ` Jan Kara
2017-09-27 20:13 ` [PATCH 04/12] fs: kill 'nr_pages' argument from wakeup_flusher_threads() Jens Axboe
2017-09-27 20:13 ` [PATCH 05/12] writeback: switch wakeup_flusher_threads() to cyclic writeback Jens Axboe
2017-10-03 12:43   ` Jan Kara
2017-09-27 20:13 ` [PATCH 06/12] writeback: provide a wakeup_flusher_threads_bdi() Jens Axboe
2017-09-27 20:13 ` [PATCH 07/12] writeback: pass in '0' for nr_pages writeback in laptop mode Jens Axboe
2017-09-27 20:13 ` [PATCH 08/12] writeback: make wb_start_writeback() static Jens Axboe
2017-09-27 20:13 ` [PATCH 09/12] writeback: move nr_pages == 0 logic to one location Jens Axboe
2017-09-27 20:13 ` [PATCH 10/12] writeback: only allow one inflight and pending full flush Jens Axboe
2017-09-28 21:41   ` Andrew Morton
2017-09-28 21:44     ` Linus Torvalds
2017-09-29  0:17       ` Jens Axboe
2017-09-29  5:21         ` Amir Goldstein
2017-10-03 16:06         ` Matthew Wilcox
2017-10-03 16:11           ` Jens Axboe
2017-10-03 17:03             ` Matthew Wilcox
2017-09-29  0:15     ` Jens Axboe
2017-09-27 20:13 ` [PATCH 11/12] writeback: make sync_inodes_sb() use range cyclic writeback Jens Axboe
2017-09-27 20:13 ` [PATCH 12/12] writeback: kill off ->range_cycle option Jens Axboe
2017-09-27 20:22   ` Jens Axboe
2017-09-28 13:19 ` [PATCH 0/12 v3] Writeback improvements John Stoffel
2017-09-28 13:39   ` Jens Axboe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.