* [PATCH/RFC] MM: fix writeback for NFS @ 2020-03-26 3:25 NeilBrown 2020-04-01 23:52 ` Writeback fixes " NeilBrown 0 siblings, 1 reply; 43+ messages in thread From: NeilBrown @ 2020-03-26 3:25 UTC (permalink / raw) To: Trond Myklebust, Anna.Schumaker, Andrew Morton, Tejun Heo Cc: linux-mm, linux-nfs, LKML [-- Attachment #1: Type: text/plain, Size: 5175 bytes --] Page account happens a little differently for NFS, and writeback is aware of this, but handles it wrongly. With NFS, pages go throught three states which are accounted separately: NR_FILE_DIRTY - page is dirty and unwritten NR_WRITEBACK - page is being written back NR_UNSTABLE_NFS - page has been written, but might need to be written again if server restarts. Once an NFS COMMIT completes, page becomes freeable. writeback combines both the NF_FILE_DIRTY counters and NF_UNSTABLE_NFS together, which doesn't make a lot of sense. NR_UNSTABLE_NFS is more like NR_WRITEBACK in that there is nothing that can be done for these pages except wait. Current behaviour is that when there are a lot of NR_UNSTABLE_NFS pages, balance_dirty_pages() repeatedly schedules the writeback thread, even when the number of dirty pages is below the threshold. This result is each ->write_pages() call into NFS only finding relatively few DIRTY pages, so it creates requests that are well below the maximum size. Depending on performance of characteristics of the server, this can substantially reduce throughput. There are two places where balance_dirty_pages() schedules the writeback thread, and each of them schedule writeback too often. 1/ When balance_dirty_pages() determines that it must wait for the number of dirty pages to shrink, it unconditionally schedules writeback. This is probably not ideal for any filesystem but I measured it to be harmful for NFS. This writeback should only be scheduled if the number of dirty pages is above the threshold. While it is below, waiting for the pending writes to complete (and be committed) is a more appropriate behaviour. 2/ When balance_dirty_pages() finishes, it again shedules writeback if nr_reclaimable is above the background threshold. This is conceptually correct, but wrong because nr_reclaimable is wrong. The impliciation of this usage is that nr_reclaimable is the number of pages that can be reclaimed if writeback is triggered. In fact it is NR_FILE_DIRTY plus NR_UNSTABLE_NFS, which is not a meaningful number. So this patch removes NR_UNSTABLE_NFS from nr_reclaimable, and only schedules writeback when the new nr_reclaimable is greater than ->thresh or ->bg_thresh, depending on context. The effect of this can be measured by looking at the count of nfs_writepages calls while performing a large streaming write (larger than dirty_threshold) e.g. mount -t nfs server:/path /mnt dd if=/dev/zero of=/mnt/zeros bs=1M count=1000 awk '/events:/ {print $13}' /proc/self/mountstats Each writepages call should normally submit writes for 4M (1024 pages) or more (MIN_WRITEBACK_PAGES) (unless memory size is tiny). With the patch I measure typically 200-300 writepages calls with the above, which suggests an average of about 850 pages being made available to each writepages call. This is less than MIN_WRITEBACK_PAGES, but enough to assemble large 1MB WRITE requests fairly often. Without the patch, numbers range from about 2000 to over 10000, meaning an average of maybe 100 pages per call, which means we rarely get large 1M requests, and often get small requests Note that there are other places where NR_FILE_DIRTY and NR_UNSTABLE_NFS are combined (wb_over_bg_thresh() and get_nr_dirty_pages()). I suspect they are wrong too. I also suspect that unstable-NFS pages should not be included in the WB_RECLAIMABLE wb_stat, but I haven't explored that in detail yet. Note that the current behaviour of NFS w.r.t sending COMMIT requests is that as soon as a batch of writes all complete, a COMMIT is scheduled. Prior to commit 919e3bd9a875 ("NFS: Ensure we commit after writeback is complete"), other mechanisms triggered the COMMIT. It is possible that the current writeback code made more sense before that commit. Signed-off-by: NeilBrown <neilb@suse.de> --- mm/page-writeback.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 2caf780a42e7..99419cba1e7a 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -1593,10 +1593,11 @@ static void balance_dirty_pages(struct bdi_writeback *wb, * written to the server's write cache, but has not yet * been flushed to permanent storage. */ - nr_reclaimable = global_node_page_state(NR_FILE_DIRTY) + - global_node_page_state(NR_UNSTABLE_NFS); + nr_reclaimable = global_node_page_state(NR_FILE_DIRTY); gdtc->avail = global_dirtyable_memory(); - gdtc->dirty = nr_reclaimable + global_node_page_state(NR_WRITEBACK); + gdtc->dirty = nr_reclaimable + + global_node_page_state(NR_WRITEBACK) + + global_node_page_state(NR_UNSTABLE_NFS); domain_dirty_limits(gdtc); @@ -1664,7 +1665,8 @@ static void balance_dirty_pages(struct bdi_writeback *wb, break; } - if (unlikely(!writeback_in_progress(wb))) + if (nr_reclaimable > gdtc->thresh && + unlikely(!writeback_in_progress(wb))) wb_start_background_writeback(wb); mem_cgroup_flush_foreign(wb); -- 2.25.2 [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply related [flat|nested] 43+ messages in thread
* Writeback fixes for NFS 2020-03-26 3:25 [PATCH/RFC] MM: fix writeback for NFS NeilBrown @ 2020-04-01 23:52 ` NeilBrown 2020-04-01 23:53 ` [PATCH 1/2] MM: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE NeilBrown ` (2 more replies) 0 siblings, 3 replies; 43+ messages in thread From: NeilBrown @ 2020-04-01 23:52 UTC (permalink / raw) To: Trond Myklebust, Anna.Schumaker, Andrew Morton, Jan Kara Cc: linux-mm, linux-nfs, LKML [-- Attachment #1: Type: text/plain, Size: 2418 bytes --] Please ignore my previous patch (which this is in reply to), it was flawed in various ways. I now understand the code a bit better and have a somewhat simpler patch which appears to address the same problem. The problem is that writeback to NFS often produces lots of small writes (10s of K) rather than fewer large writes (1M). This pattern can often hurt throughput, but in certain circumstances it can hurt NFS throughput more than expected. Each nfs_writepages() call results in an NFS commit being sent to the server. If writeback triggers lots of smaller nfs_writepages calls, this means lots of COMMITs. If the server is slow to handle the COMMIT (I've seen the Ganesha NFS server take over 200ms per commit), these COMMITs can overlap, queue up, and choke the NFS server and cause order-of-magnitude drop in throughput. So we really want to only call nfs_writepages when there are a largish number of pages to be written - i.e. that are 'dirty'. For historical reasons that I didn't thoroughly research but I'm confident are no longer relevant, pages that have been written to the NFS server but have not yet been the subject of a COMMIT - so-called "unstable" pages - are effectively accounted that same as "dirty" pages (sometimes called "reclaimable"). This can result in writeback thinking there are lots of "dirty" pages to reclaim, while nfs_writepages can only find a few that it can write out. The second patch following changes the accounting for these "unstable" pages. They are now always accounted exactly the same was writeback pages. Conceptually they can be thought of as still in writeback, but the writeback is now happening on the server. A COMMIT will always automatically follow the writes generated by nfs_writepages, so from the perspective of the VM, there really is no difference: It has scheduled the write and there is nothing else it can do except wait. Testing this patch showed that loop-back NFS is prone to deadlocks again. I cannot see exactly how the change to 'unstable' accounting affected this, but I can see that the old +25% heuristic can no longer be justified given the complexity of writeback calculations. So the first patch following changes how writeback is handled for NFS servers handling loop-back requests (and other similar services) so that it is more obviously safe against excessive dirty pages scheduled for other devices. Thanks, NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* [PATCH 1/2] MM: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE 2020-04-01 23:52 ` Writeback fixes " NeilBrown @ 2020-04-01 23:53 ` NeilBrown 2020-04-01 23:54 ` [PATCH 2/2] Deprecate NR_UNSTABLE_NFS, use NR_WRITEBACK NeilBrown ` (2 more replies) 2020-04-02 4:26 ` Hillf Danton 2020-04-06 23:42 ` Writeback fixes for NFS - V2 NeilBrown 2 siblings, 3 replies; 43+ messages in thread From: NeilBrown @ 2020-04-01 23:53 UTC (permalink / raw) To: Trond Myklebust, Anna.Schumaker, Andrew Morton, Jan Kara Cc: linux-mm, linux-nfs, LKML [-- Attachment #1: Type: text/plain, Size: 7556 bytes --] PF_LESS_THROTTLE exists for loop-back nfsd, and a similar need in the loop block driver, where a daemon needs to write to one bdi in order to free up writes queued to another bdi. The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty pages, so that it can still dirty pages after other processses have been throttled. This approach was designed when all threads were blocked equally, independently on which device they were writing to, or how fast it was. Since that time the writeback algorithm has changed substantially with different threads getting different allowances based on non-trivial heuristics. This means the simple "add 25%" heuristic is no longer reliable. This patch changes the heuristic to ignore the global limits and consider only the limit relevant to the bdi being written to. This approach is already available for BDI_CAP_STRICTLIMIT users (fuse) and should not introduce surprises. This has the desired result of protecting the task from the consequences of large amounts of dirty data queued for other devices. This approach of "only consider the target bdi" is consistent with the other use of PF_LESS_THROTTLE in current_may_throttle(), were it causes attention to be focussed only on the target bdi. So this patch - renames PF_LESS_THROTTLE to PF_LOCAL_THROTTLE, - remove the 25% bonus that that flag gives, and - imposes 'strictlimit' handling for any process with PF_LOCAL_THROTTLE set. Note that previously realtime threads were treated the same as PF_LESS_THROTTLE threads. This patch does *not* change the behvaiour for real-time threads, so it is now different from the behaviour of nfsd and loop tasks. I don't know what is wanted for realtime. Signed-off-by: NeilBrown <neilb@suse.de> --- drivers/block/loop.c | 2 +- fs/nfsd/vfs.c | 9 +++++---- include/linux/sched.h | 2 +- kernel/sys.c | 2 +- mm/page-writeback.c | 10 ++++++---- mm/vmscan.c | 4 ++-- 6 files changed, 16 insertions(+), 13 deletions(-) diff --git a/drivers/block/loop.c b/drivers/block/loop.c index 739b372a5112..2c59371ce936 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -897,7 +897,7 @@ static void loop_unprepare_queue(struct loop_device *lo) static int loop_kthread_worker_fn(void *worker_ptr) { - current->flags |= PF_LESS_THROTTLE | PF_MEMALLOC_NOIO; + current->flags |= PF_LOCAL_THROTTLE | PF_MEMALLOC_NOIO; return kthread_worker_fn(worker_ptr); } diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c index 0aa02eb18bd3..c3fbab1753ec 100644 --- a/fs/nfsd/vfs.c +++ b/fs/nfsd/vfs.c @@ -979,12 +979,13 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf, if (test_bit(RQ_LOCAL, &rqstp->rq_flags)) /* - * We want less throttling in balance_dirty_pages() - * and shrink_inactive_list() so that nfs to + * We want throttling in balance_dirty_pages() + * and shrink_inactive_list() to only consider + * the backingdev we are writing to, so that nfs to * localhost doesn't cause nfsd to lock up due to all * the client's dirty pages or its congested queue. */ - current->flags |= PF_LESS_THROTTLE; + current->flags |= PF_LOCAL_THROTTLE; exp = fhp->fh_export; use_wgather = (rqstp->rq_vers == 2) && EX_WGATHER(exp); @@ -1037,7 +1038,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf, nfserr = nfserrno(host_err); } if (test_bit(RQ_LOCAL, &rqstp->rq_flags)) - current_restore_flags(pflags, PF_LESS_THROTTLE); + current_restore_flags(pflags, PF_LOCAL_THROTTLE); return nfserr; } diff --git a/include/linux/sched.h b/include/linux/sched.h index 04278493bf15..5dcd27abc8cd 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1473,7 +1473,7 @@ extern struct pid *cad_pid; #define PF_KSWAPD 0x00020000 /* I am kswapd */ #define PF_MEMALLOC_NOFS 0x00040000 /* All allocation requests will inherit GFP_NOFS */ #define PF_MEMALLOC_NOIO 0x00080000 /* All allocation requests will inherit GFP_NOIO */ -#define PF_LESS_THROTTLE 0x00100000 /* Throttle me less: I clean memory */ +#define PF_LOCAL_THROTTLE 0x00100000 /* Throttle me less: I clean memory */ #define PF_KTHREAD 0x00200000 /* I am a kernel thread */ #define PF_RANDOMIZE 0x00400000 /* Randomize virtual address space */ #define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */ diff --git a/kernel/sys.c b/kernel/sys.c index d325f3ab624a..180a2fa33f7f 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2262,7 +2262,7 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct *t, unsigned long which, return -EINVAL; } -#define PR_IO_FLUSHER (PF_MEMALLOC_NOIO | PF_LESS_THROTTLE) +#define PR_IO_FLUSHER (PF_MEMALLOC_NOIO | PF_LOCAL_THROTTLE) SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, unsigned long, arg4, unsigned long, arg5) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 2caf780a42e7..2afb09fa2fe0 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -387,8 +387,7 @@ static unsigned long global_dirtyable_memory(void) * Calculate @dtc->thresh and ->bg_thresh considering * vm_dirty_{bytes|ratio} and dirty_background_{bytes|ratio}. The caller * must ensure that @dtc->avail is set before calling this function. The - * dirty limits will be lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd) and - * real-time tasks. + * dirty limits will be lifted by 1/4 for real-time tasks. */ static void domain_dirty_limits(struct dirty_throttle_control *dtc) { @@ -436,7 +435,7 @@ static void domain_dirty_limits(struct dirty_throttle_control *dtc) if (bg_thresh >= thresh) bg_thresh = thresh / 2; tsk = current; - if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) { + if (rt_task(tsk)) { bg_thresh += bg_thresh / 4 + global_wb_domain.dirty_limit / 32; thresh += thresh / 4 + global_wb_domain.dirty_limit / 32; } @@ -486,7 +485,7 @@ static unsigned long node_dirty_limit(struct pglist_data *pgdat) else dirty = vm_dirty_ratio * node_memory / 100; - if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) + if (rt_task(tsk)) dirty += dirty / 4; return dirty; @@ -1580,6 +1579,9 @@ static void balance_dirty_pages(struct bdi_writeback *wb, bool strictlimit = bdi->capabilities & BDI_CAP_STRICTLIMIT; unsigned long start_time = jiffies; + if (current->flags & PF_LOCAL_THROTTLE) + /* This task must only be throttled by its own writeback */ + strictlimit = true; for (;;) { unsigned long now = jiffies; unsigned long dirty, thresh, bg_thresh; diff --git a/mm/vmscan.c b/mm/vmscan.c index 876370565455..c5cf25938c56 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1880,13 +1880,13 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec, /* * If a kernel thread (such as nfsd for loop-back mounts) services - * a backing device by writing to the page cache it sets PF_LESS_THROTTLE. + * a backing device by writing to the page cache it sets PF_LOCAL_THROTTLE. * In that case we should only throttle if the backing device it is * writing to is congested. In other cases it is safe to throttle. */ static int current_may_throttle(void) { - return !(current->flags & PF_LESS_THROTTLE) || + return !(current->flags & PF_LOCAL_THROTTLE) || current->backing_dev_info == NULL || bdi_write_congested(current->backing_dev_info); } -- 2.26.0 [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH 2/2] Deprecate NR_UNSTABLE_NFS, use NR_WRITEBACK 2020-04-01 23:53 ` [PATCH 1/2] MM: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE NeilBrown @ 2020-04-01 23:54 ` NeilBrown 2020-04-02 15:10 ` Christoph Hellwig 2020-04-02 19:55 ` [PATCH 2/2] Deprecate NR_UNSTABLE_NFS, use NR_WRITEBACK Jan Kara 2020-04-02 16:35 ` [PATCH 1/2] MM: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE Jan Kara 2020-04-03 15:15 ` Michal Hocko 2 siblings, 2 replies; 43+ messages in thread From: NeilBrown @ 2020-04-01 23:54 UTC (permalink / raw) To: Trond Myklebust, Anna.Schumaker, Andrew Morton, Jan Kara Cc: linux-mm, linux-nfs, LKML [-- Attachment #1: Type: text/plain, Size: 5759 bytes --] After an NFS page has been written it is considered "unstable" until a COMMIT request succeeds. If the COMMIT fails, the page will be re-written. These "unstable" pages are currently accounted as "reclaimable", either in WB_RECLAIMABLE, or in NR_UNSTABLE_NFS which is included in a 'reclaimable' count. This might have made sense when sending the COMMIT required a separate action by the VFS/MM (e.g. releasepage() used to send a COMMIT). However now that all writes generated by ->writepages() will automatically be followed by a COMMIT, it makes more sense to treat them as writeback pages. So this page deprecates NR_UNSTABLE_NFS and accounts unstable pages in NR_WRITEBACK and WB_WRITEBACK. A particular effect of this change is that when wb_check_background_flush() calls wb_over_bg_threshold(), the latter will report 'true' a lot less often as the 'unstable' pages are no longer considered 'dirty' (and there is nothing that writeback can do about them anyway). Currently wb_check_background_flush() will trigger writeback to NFS even when there are relatively few dirty pages (if there are lots of unstable pages), this can result in small writes going to the server (10s of Kilobytes rather than a Megabyte) which hurts throughput. With this that, there are fewer writes which are each larger on average. Signed-off-by: NeilBrown <neilb@suse.de> --- fs/fs-writeback.c | 1 - fs/nfs/internal.h | 7 +++++-- fs/nfs/write.c | 4 ++-- include/linux/mmzone.h | 2 +- mm/memcontrol.c | 1 - mm/page-writeback.c | 7 ++----- 6 files changed, 10 insertions(+), 12 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 76ac9c7d32ec..c5bdf46e3b4b 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -1070,7 +1070,6 @@ static void bdi_split_work_to_wbs(struct backing_dev_info *bdi, static unsigned long get_nr_dirty_pages(void) { return global_node_page_state(NR_FILE_DIRTY) + - global_node_page_state(NR_UNSTABLE_NFS) + get_nr_dirty_inodes(); } diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h index f80c47d5ff27..ba1ff5adeccd 100644 --- a/fs/nfs/internal.h +++ b/fs/nfs/internal.h @@ -660,8 +660,11 @@ void nfs_mark_page_unstable(struct page *page, struct nfs_commit_info *cinfo) if (!cinfo->dreq) { struct inode *inode = page_file_mapping(page)->host; - inc_node_page_state(page, NR_UNSTABLE_NFS); - inc_wb_stat(&inode_to_bdi(inode)->wb, WB_RECLAIMABLE); + /* This page is really still in write-back - just that the + * writeback is happening on the server now. + */ + inc_node_page_state(page, NR_WRITEBACK); + inc_wb_stat(&inode_to_bdi(inode)->wb, WB_WRITEBACK); __mark_inode_dirty(inode, I_DIRTY_DATASYNC); } } diff --git a/fs/nfs/write.c b/fs/nfs/write.c index c478b772cc49..2e15a56620b3 100644 --- a/fs/nfs/write.c +++ b/fs/nfs/write.c @@ -958,9 +958,9 @@ nfs_mark_request_commit(struct nfs_page *req, struct pnfs_layout_segment *lseg, static void nfs_clear_page_commit(struct page *page) { - dec_node_page_state(page, NR_UNSTABLE_NFS); + dec_node_page_state(page, NR_WRITEBACK); dec_wb_stat(&inode_to_bdi(page_file_mapping(page)->host)->wb, - WB_RECLAIMABLE); + WB_WRITEBACK); } /* Called holding the request lock on @req */ diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 462f6873905a..227fcb8cd0e6 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -237,7 +237,7 @@ enum node_stat_item { NR_FILE_THPS, NR_FILE_PMDMAPPED, NR_ANON_THPS, - NR_UNSTABLE_NFS, /* NFS unstable pages */ + NR_UNSTABLE_NFS, /* NFS unstable pages - DEPRECATED DO NOT USE */ NR_VMSCAN_WRITE, NR_VMSCAN_IMMEDIATE, /* Prioritise for reclaim when writeback ends */ NR_DIRTIED, /* page dirtyings since bootup */ diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 7ddf91c4295f..fad8e8a23235 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4317,7 +4317,6 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages, *pdirty = memcg_exact_page_state(memcg, NR_FILE_DIRTY); - /* this should eventually include NR_UNSTABLE_NFS */ *pwriteback = memcg_exact_page_state(memcg, NR_WRITEBACK); *pfilepages = memcg_exact_page_state(memcg, NR_INACTIVE_FILE) + memcg_exact_page_state(memcg, NR_ACTIVE_FILE); diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 2afb09fa2fe0..d1f03c799d11 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -504,7 +504,6 @@ bool node_dirty_ok(struct pglist_data *pgdat) unsigned long nr_pages = 0; nr_pages += node_page_state(pgdat, NR_FILE_DIRTY); - nr_pages += node_page_state(pgdat, NR_UNSTABLE_NFS); nr_pages += node_page_state(pgdat, NR_WRITEBACK); return nr_pages <= limit; @@ -1595,8 +1594,7 @@ static void balance_dirty_pages(struct bdi_writeback *wb, * written to the server's write cache, but has not yet * been flushed to permanent storage. */ - nr_reclaimable = global_node_page_state(NR_FILE_DIRTY) + - global_node_page_state(NR_UNSTABLE_NFS); + nr_reclaimable = global_node_page_state(NR_FILE_DIRTY); gdtc->avail = global_dirtyable_memory(); gdtc->dirty = nr_reclaimable + global_node_page_state(NR_WRITEBACK); @@ -1940,8 +1938,7 @@ bool wb_over_bg_thresh(struct bdi_writeback *wb) * as we're trying to decide whether to put more under writeback. */ gdtc->avail = global_dirtyable_memory(); - gdtc->dirty = global_node_page_state(NR_FILE_DIRTY) + - global_node_page_state(NR_UNSTABLE_NFS); + gdtc->dirty = global_node_page_state(NR_FILE_DIRTY); domain_dirty_limits(gdtc); if (gdtc->dirty > gdtc->bg_thresh) -- 2.26.0 [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: [PATCH 2/2] Deprecate NR_UNSTABLE_NFS, use NR_WRITEBACK 2020-04-01 23:54 ` [PATCH 2/2] Deprecate NR_UNSTABLE_NFS, use NR_WRITEBACK NeilBrown @ 2020-04-02 15:10 ` Christoph Hellwig 2020-04-02 22:35 ` [PATCH 2/2 - v2] MM: Discard NR_UNSTABLE_NFS, use NR_WRITEBACK instead NeilBrown 2020-04-02 19:55 ` [PATCH 2/2] Deprecate NR_UNSTABLE_NFS, use NR_WRITEBACK Jan Kara 1 sibling, 1 reply; 43+ messages in thread From: Christoph Hellwig @ 2020-04-02 15:10 UTC (permalink / raw) To: NeilBrown Cc: Trond Myklebust, Anna.Schumaker@Netapp.com, Andrew Morton, Jan Kara, linux-mm, linux-nfs, LKML On Thu, Apr 02, 2020 at 10:54:07AM +1100, NeilBrown wrote: > > After an NFS page has been written it is considered "unstable" until a > COMMIT request succeeds. If the COMMIT fails, the page will be > re-written. > > These "unstable" pages are currently accounted as "reclaimable", > either in WB_RECLAIMABLE, or in NR_UNSTABLE_NFS which is included in a > 'reclaimable' count. This might have made sense when sending the COMMIT > required a separate action by the VFS/MM (e.g. releasepage() used to > send a COMMIT). However now that all writes generated by ->writepages() > will automatically be followed by a COMMIT, it makes more sense to > treat them as writeback pages. > > So this page deprecates NR_UNSTABLE_NFS and accounts unstable pages in > NR_WRITEBACK and WB_WRITEBACK. Please remove it entirely if it isn't used any more. ^ permalink raw reply [flat|nested] 43+ messages in thread
* [PATCH 2/2 - v2] MM: Discard NR_UNSTABLE_NFS, use NR_WRITEBACK instead. 2020-04-02 15:10 ` Christoph Hellwig @ 2020-04-02 22:35 ` NeilBrown 2020-04-03 9:42 ` Jan Kara 0 siblings, 1 reply; 43+ messages in thread From: NeilBrown @ 2020-04-02 22:35 UTC (permalink / raw) To: Christoph Hellwig Cc: Trond Myklebust, Anna.Schumaker@Netapp.com, Andrew Morton, Jan Kara, linux-mm, linux-nfs, LKML [-- Attachment #1: Type: text/plain, Size: 13628 bytes --] After an NFS page has been written it is considered "unstable" until a COMMIT request succeeds. If the COMMIT fails, the page will be re-written. These "unstable" pages are currently accounted as "reclaimable", either in WB_RECLAIMABLE, or in NR_UNSTABLE_NFS which is included in a 'reclaimable' count. This might have made sense when sending the COMMIT required a separate action by the VFS/MM (e.g. releasepage() used to send a COMMIT). However now that all writes generated by ->writepages() will automatically be followed by a COMMIT (since commit 919e3bd9a875 ("NFS: Ensure we commit after writeback is complete")) it makes more sense to treat them as writeback pages. So this patch removes NR_UNSTABLE_NFS and accounts unstable pages in NR_WRITEBACK and WB_WRITEBACK. A particular effect of this change is that when wb_check_background_flush() calls wb_over_bg_threshold(), the latter will report 'true' a lot less often as the 'unstable' pages are no longer considered 'dirty' (and there is nothing that writeback can do about them anyway). Currently wb_check_background_flush() will trigger writeback to NFS even when there are relatively few dirty pages (if there are lots of unstable pages), this can result in small writes going to the server (10s of Kilobytes rather than a Megabyte) which hurts throughput. With this patch, there are fewer writes which are each larger on average. Signed-off-by: NeilBrown <neilb@suse.de> --- NR_UNSTABLE_NFS completely removed as recommended by Christoph, removal of an unnecessary comment, and improvements to commit message. Thanks. Documentation/filesystems/proc.txt | 3 --- drivers/base/node.c | 2 -- fs/fs-writeback.c | 1 - fs/nfs/internal.h | 10 +++++++--- fs/nfs/write.c | 4 ++-- fs/proc/meminfo.c | 2 -- include/linux/mmzone.h | 1 - include/trace/events/writeback.h | 5 +---- mm/memcontrol.c | 1 - mm/page-writeback.c | 17 ++++------------- mm/page_alloc.c | 5 +---- mm/vmstat.c | 1 - 12 files changed, 15 insertions(+), 37 deletions(-) diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index 99ca040e3f90..690c712f5f79 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -904,7 +904,6 @@ Slab: 284364 kB SReclaimable: 159856 kB SUnreclaim: 124508 kB PageTables: 24448 kB -NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 7669796 kB @@ -975,8 +974,6 @@ SReclaimable: Part of Slab, that might be reclaimed, such as caches SUnreclaim: Part of Slab, that cannot be reclaimed on memory pressure PageTables: amount of memory dedicated to the lowest level of page tables. -NFS_Unstable: NFS pages sent to the server, but not yet committed to stable - storage Bounce: Memory used for block device "bounce buffers" WritebackTmp: Memory used by FUSE for temporary writeback buffers CommitLimit: Based on the overcommit ratio ('vm.overcommit_ratio'), diff --git a/drivers/base/node.c b/drivers/base/node.c index 98a31bafc8a2..7059021ce2af 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -416,7 +416,6 @@ static ssize_t node_read_meminfo(struct device *dev, "Node %d Shmem: %8lu kB\n" "Node %d KernelStack: %8lu kB\n" "Node %d PageTables: %8lu kB\n" - "Node %d NFS_Unstable: %8lu kB\n" "Node %d Bounce: %8lu kB\n" "Node %d WritebackTmp: %8lu kB\n" "Node %d KReclaimable: %8lu kB\n" @@ -439,7 +438,6 @@ static ssize_t node_read_meminfo(struct device *dev, nid, K(i.sharedram), nid, sum_zone_node_page_state(nid, NR_KERNEL_STACK_KB), nid, K(sum_zone_node_page_state(nid, NR_PAGETABLE)), - nid, K(node_page_state(pgdat, NR_UNSTABLE_NFS)), nid, K(sum_zone_node_page_state(nid, NR_BOUNCE)), nid, K(node_page_state(pgdat, NR_WRITEBACK_TEMP)), nid, K(sreclaimable + diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 76ac9c7d32ec..c5bdf46e3b4b 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -1070,7 +1070,6 @@ static void bdi_split_work_to_wbs(struct backing_dev_info *bdi, static unsigned long get_nr_dirty_pages(void) { return global_node_page_state(NR_FILE_DIRTY) + - global_node_page_state(NR_UNSTABLE_NFS) + get_nr_dirty_inodes(); } diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h index f80c47d5ff27..749da02b547a 100644 --- a/fs/nfs/internal.h +++ b/fs/nfs/internal.h @@ -652,7 +652,8 @@ void nfs_super_set_maxbytes(struct super_block *sb, __u64 maxfilesize) } /* - * Record the page as unstable and mark its inode as dirty. + * Record the page as unstable (an extra writeback period) and mark its + * inode as dirty. */ static inline void nfs_mark_page_unstable(struct page *page, struct nfs_commit_info *cinfo) @@ -660,8 +661,11 @@ void nfs_mark_page_unstable(struct page *page, struct nfs_commit_info *cinfo) if (!cinfo->dreq) { struct inode *inode = page_file_mapping(page)->host; - inc_node_page_state(page, NR_UNSTABLE_NFS); - inc_wb_stat(&inode_to_bdi(inode)->wb, WB_RECLAIMABLE); + /* This page is really still in write-back - just that the + * writeback is happening on the server now. + */ + inc_node_page_state(page, NR_WRITEBACK); + inc_wb_stat(&inode_to_bdi(inode)->wb, WB_WRITEBACK); __mark_inode_dirty(inode, I_DIRTY_DATASYNC); } } diff --git a/fs/nfs/write.c b/fs/nfs/write.c index c478b772cc49..2e15a56620b3 100644 --- a/fs/nfs/write.c +++ b/fs/nfs/write.c @@ -958,9 +958,9 @@ nfs_mark_request_commit(struct nfs_page *req, struct pnfs_layout_segment *lseg, static void nfs_clear_page_commit(struct page *page) { - dec_node_page_state(page, NR_UNSTABLE_NFS); + dec_node_page_state(page, NR_WRITEBACK); dec_wb_stat(&inode_to_bdi(page_file_mapping(page)->host)->wb, - WB_RECLAIMABLE); + WB_WRITEBACK); } /* Called holding the request lock on @req */ diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c index 8c1f1bb1a5ce..1378a132ff7e 100644 --- a/fs/proc/meminfo.c +++ b/fs/proc/meminfo.c @@ -106,8 +106,6 @@ static int meminfo_proc_show(struct seq_file *m, void *v) show_val_kb(m, "PageTables: ", global_zone_page_state(NR_PAGETABLE)); - show_val_kb(m, "NFS_Unstable: ", - global_node_page_state(NR_UNSTABLE_NFS)); show_val_kb(m, "Bounce: ", global_zone_page_state(NR_BOUNCE)); show_val_kb(m, "WritebackTmp: ", diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 462f6873905a..a18611197bea 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -237,7 +237,6 @@ enum node_stat_item { NR_FILE_THPS, NR_FILE_PMDMAPPED, NR_ANON_THPS, - NR_UNSTABLE_NFS, /* NFS unstable pages */ NR_VMSCAN_WRITE, NR_VMSCAN_IMMEDIATE, /* Prioritise for reclaim when writeback ends */ NR_DIRTIED, /* page dirtyings since bootup */ diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h index d94def25e4dc..45b5fbdb1f62 100644 --- a/include/trace/events/writeback.h +++ b/include/trace/events/writeback.h @@ -542,7 +542,6 @@ TRACE_EVENT(global_dirty_state, TP_STRUCT__entry( __field(unsigned long, nr_dirty) __field(unsigned long, nr_writeback) - __field(unsigned long, nr_unstable) __field(unsigned long, background_thresh) __field(unsigned long, dirty_thresh) __field(unsigned long, dirty_limit) @@ -553,7 +552,6 @@ TRACE_EVENT(global_dirty_state, TP_fast_assign( __entry->nr_dirty = global_node_page_state(NR_FILE_DIRTY); __entry->nr_writeback = global_node_page_state(NR_WRITEBACK); - __entry->nr_unstable = global_node_page_state(NR_UNSTABLE_NFS); __entry->nr_dirtied = global_node_page_state(NR_DIRTIED); __entry->nr_written = global_node_page_state(NR_WRITTEN); __entry->background_thresh = background_thresh; @@ -561,12 +559,11 @@ TRACE_EVENT(global_dirty_state, __entry->dirty_limit = global_wb_domain.dirty_limit; ), - TP_printk("dirty=%lu writeback=%lu unstable=%lu " + TP_printk("dirty=%lu writeback=%lu " "bg_thresh=%lu thresh=%lu limit=%lu " "dirtied=%lu written=%lu", __entry->nr_dirty, __entry->nr_writeback, - __entry->nr_unstable, __entry->background_thresh, __entry->dirty_thresh, __entry->dirty_limit, diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 7ddf91c4295f..fad8e8a23235 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4317,7 +4317,6 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages, *pdirty = memcg_exact_page_state(memcg, NR_FILE_DIRTY); - /* this should eventually include NR_UNSTABLE_NFS */ *pwriteback = memcg_exact_page_state(memcg, NR_WRITEBACK); *pfilepages = memcg_exact_page_state(memcg, NR_INACTIVE_FILE) + memcg_exact_page_state(memcg, NR_ACTIVE_FILE); diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 2afb09fa2fe0..dbc73522609e 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -504,7 +504,6 @@ bool node_dirty_ok(struct pglist_data *pgdat) unsigned long nr_pages = 0; nr_pages += node_page_state(pgdat, NR_FILE_DIRTY); - nr_pages += node_page_state(pgdat, NR_UNSTABLE_NFS); nr_pages += node_page_state(pgdat, NR_WRITEBACK); return nr_pages <= limit; @@ -758,7 +757,7 @@ static void mdtc_calc_avail(struct dirty_throttle_control *mdtc, * bounded by the bdi->min_ratio and/or bdi->max_ratio parameters, if set. * * Return: @wb's dirty limit in pages. The term "dirty" in the context of - * dirty balancing includes all PG_dirty, PG_writeback and NFS unstable pages. + * dirty balancing includes all PG_dirty and PG_writeback pages. */ static unsigned long __wb_calc_thresh(struct dirty_throttle_control *dtc) { @@ -1566,7 +1565,7 @@ static void balance_dirty_pages(struct bdi_writeback *wb, struct dirty_throttle_control * const mdtc = mdtc_valid(&mdtc_stor) ? &mdtc_stor : NULL; struct dirty_throttle_control *sdtc; - unsigned long nr_reclaimable; /* = file_dirty + unstable_nfs */ + unsigned long nr_reclaimable; /* = file_dirty */ long period; long pause; long max_pause; @@ -1589,14 +1588,7 @@ static void balance_dirty_pages(struct bdi_writeback *wb, unsigned long m_thresh = 0; unsigned long m_bg_thresh = 0; - /* - * Unstable writes are a feature of certain networked - * filesystems (i.e. NFS) in which data may have been - * written to the server's write cache, but has not yet - * been flushed to permanent storage. - */ - nr_reclaimable = global_node_page_state(NR_FILE_DIRTY) + - global_node_page_state(NR_UNSTABLE_NFS); + nr_reclaimable = global_node_page_state(NR_FILE_DIRTY); gdtc->avail = global_dirtyable_memory(); gdtc->dirty = nr_reclaimable + global_node_page_state(NR_WRITEBACK); @@ -1940,8 +1932,7 @@ bool wb_over_bg_thresh(struct bdi_writeback *wb) * as we're trying to decide whether to put more under writeback. */ gdtc->avail = global_dirtyable_memory(); - gdtc->dirty = global_node_page_state(NR_FILE_DIRTY) + - global_node_page_state(NR_UNSTABLE_NFS); + gdtc->dirty = global_node_page_state(NR_FILE_DIRTY); domain_dirty_limits(gdtc); if (gdtc->dirty > gdtc->bg_thresh) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 3c4eb750a199..6bd1112d590d 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5239,7 +5239,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask) printk("active_anon:%lu inactive_anon:%lu isolated_anon:%lu\n" " active_file:%lu inactive_file:%lu isolated_file:%lu\n" - " unevictable:%lu dirty:%lu writeback:%lu unstable:%lu\n" + " unevictable:%lu dirty:%lu writeback:%lu\n" " slab_reclaimable:%lu slab_unreclaimable:%lu\n" " mapped:%lu shmem:%lu pagetables:%lu bounce:%lu\n" " free:%lu free_pcp:%lu free_cma:%lu\n", @@ -5252,7 +5252,6 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask) global_node_page_state(NR_UNEVICTABLE), global_node_page_state(NR_FILE_DIRTY), global_node_page_state(NR_WRITEBACK), - global_node_page_state(NR_UNSTABLE_NFS), global_node_page_state(NR_SLAB_RECLAIMABLE), global_node_page_state(NR_SLAB_UNRECLAIMABLE), global_node_page_state(NR_FILE_MAPPED), @@ -5285,7 +5284,6 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask) " anon_thp: %lukB" #endif " writeback_tmp:%lukB" - " unstable:%lukB" " all_unreclaimable? %s" "\n", pgdat->node_id, @@ -5307,7 +5305,6 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask) K(node_page_state(pgdat, NR_ANON_THPS) * HPAGE_PMD_NR), #endif K(node_page_state(pgdat, NR_WRITEBACK_TEMP)), - K(node_page_state(pgdat, NR_UNSTABLE_NFS)), pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ? "yes" : "no"); } diff --git a/mm/vmstat.c b/mm/vmstat.c index 78d53378db99..d1291537bbb9 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1162,7 +1162,6 @@ const char * const vmstat_text[] = { "nr_file_hugepages", "nr_file_pmdmapped", "nr_anon_transparent_hugepages", - "nr_unstable", "nr_vmscan_write", "nr_vmscan_immediate_reclaim", "nr_dirtied", -- 2.26.0 [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: [PATCH 2/2 - v2] MM: Discard NR_UNSTABLE_NFS, use NR_WRITEBACK instead. 2020-04-02 22:35 ` [PATCH 2/2 - v2] MM: Discard NR_UNSTABLE_NFS, use NR_WRITEBACK instead NeilBrown @ 2020-04-03 9:42 ` Jan Kara 2020-04-03 11:03 ` Michal Hocko 2020-04-06 23:28 ` NeilBrown 0 siblings, 2 replies; 43+ messages in thread From: Jan Kara @ 2020-04-03 9:42 UTC (permalink / raw) To: NeilBrown Cc: Christoph Hellwig, Trond Myklebust, Anna.Schumaker@Netapp.com, Andrew Morton, Jan Kara, linux-mm, linux-nfs, LKML On Fri 03-04-20 09:35:21, NeilBrown wrote: > > After an NFS page has been written it is considered "unstable" until a > COMMIT request succeeds. If the COMMIT fails, the page will be > re-written. > > These "unstable" pages are currently accounted as "reclaimable", either > in WB_RECLAIMABLE, or in NR_UNSTABLE_NFS which is included in a > 'reclaimable' count. This might have made sense when sending the COMMIT > required a separate action by the VFS/MM (e.g. releasepage() used to > send a COMMIT). However now that all writes generated by ->writepages() > will automatically be followed by a COMMIT (since commit 919e3bd9a875 > ("NFS: Ensure we commit after writeback is complete")) it makes more > sense to treat them as writeback pages. > > So this patch removes NR_UNSTABLE_NFS and accounts unstable pages in > NR_WRITEBACK and WB_WRITEBACK. > > A particular effect of this change is that when > wb_check_background_flush() calls wb_over_bg_threshold(), the latter > will report 'true' a lot less often as the 'unstable' pages are no > longer considered 'dirty' (and there is nothing that writeback can do > about them anyway). > > Currently wb_check_background_flush() will trigger writeback to NFS even > when there are relatively few dirty pages (if there are lots of unstable > pages), this can result in small writes going to the server (10s of > Kilobytes rather than a Megabyte) which hurts throughput. > With this patch, there are fewer writes which are each larger on average. > > Signed-off-by: NeilBrown <neilb@suse.de> > --- > > NR_UNSTABLE_NFS completely removed as recommended by Christoph, removal > of an unnecessary comment, and improvements to commit message. > Thanks. > > Documentation/filesystems/proc.txt | 3 --- > drivers/base/node.c | 2 -- > fs/fs-writeback.c | 1 - > fs/nfs/internal.h | 10 +++++++--- > fs/nfs/write.c | 4 ++-- > fs/proc/meminfo.c | 2 -- > include/linux/mmzone.h | 1 - > include/trace/events/writeback.h | 5 +---- > mm/memcontrol.c | 1 - > mm/page-writeback.c | 17 ++++------------- > mm/page_alloc.c | 5 +---- > mm/vmstat.c | 1 - > 12 files changed, 15 insertions(+), 37 deletions(-) > > diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt > index 99ca040e3f90..690c712f5f79 100644 > --- a/Documentation/filesystems/proc.txt > +++ b/Documentation/filesystems/proc.txt > @@ -904,7 +904,6 @@ Slab: 284364 kB > SReclaimable: 159856 kB > SUnreclaim: 124508 kB > PageTables: 24448 kB > -NFS_Unstable: 0 kB > Bounce: 0 kB > WritebackTmp: 0 kB > CommitLimit: 7669796 kB > @@ -975,8 +974,6 @@ SReclaimable: Part of Slab, that might be reclaimed, such as caches > SUnreclaim: Part of Slab, that cannot be reclaimed on memory pressure > PageTables: amount of memory dedicated to the lowest level of page > tables. > -NFS_Unstable: NFS pages sent to the server, but not yet committed to stable > - storage > Bounce: Memory used for block device "bounce buffers" > WritebackTmp: Memory used by FUSE for temporary writeback buffers > CommitLimit: Based on the overcommit ratio ('vm.overcommit_ratio'), > diff --git a/drivers/base/node.c b/drivers/base/node.c > index 98a31bafc8a2..7059021ce2af 100644 > --- a/drivers/base/node.c > +++ b/drivers/base/node.c > @@ -416,7 +416,6 @@ static ssize_t node_read_meminfo(struct device *dev, > "Node %d Shmem: %8lu kB\n" > "Node %d KernelStack: %8lu kB\n" > "Node %d PageTables: %8lu kB\n" > - "Node %d NFS_Unstable: %8lu kB\n" > "Node %d Bounce: %8lu kB\n" > "Node %d WritebackTmp: %8lu kB\n" > "Node %d KReclaimable: %8lu kB\n" > @@ -439,7 +438,6 @@ static ssize_t node_read_meminfo(struct device *dev, > nid, K(i.sharedram), > nid, sum_zone_node_page_state(nid, NR_KERNEL_STACK_KB), > nid, K(sum_zone_node_page_state(nid, NR_PAGETABLE)), > - nid, K(node_page_state(pgdat, NR_UNSTABLE_NFS)), > nid, K(sum_zone_node_page_state(nid, NR_BOUNCE)), > nid, K(node_page_state(pgdat, NR_WRITEBACK_TEMP)), > nid, K(sreclaimable + So I don't think we can just remove lines from procfs files like this. That has a high potential of breaking some userspace app that is not careful enough when parsing the file. So I think that we need to leave there the format string and just replace K(node_page_state(pgdat, NR_UNSTABLE_NFS)) with 0. > diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c > index 8c1f1bb1a5ce..1378a132ff7e 100644 > --- a/fs/proc/meminfo.c > +++ b/fs/proc/meminfo.c > @@ -106,8 +106,6 @@ static int meminfo_proc_show(struct seq_file *m, void *v) > show_val_kb(m, "PageTables: ", > global_zone_page_state(NR_PAGETABLE)); > > - show_val_kb(m, "NFS_Unstable: ", > - global_node_page_state(NR_UNSTABLE_NFS)); > show_val_kb(m, "Bounce: ", > global_zone_page_state(NR_BOUNCE)); > show_val_kb(m, "WritebackTmp: ", Similarly here. > diff --git a/mm/vmstat.c b/mm/vmstat.c > index 78d53378db99..d1291537bbb9 100644 > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -1162,7 +1162,6 @@ const char * const vmstat_text[] = { > "nr_file_hugepages", > "nr_file_pmdmapped", > "nr_anon_transparent_hugepages", > - "nr_unstable", > "nr_vmscan_write", > "nr_vmscan_immediate_reclaim", > "nr_dirtied", This is probably the most tricky to deal with given how /proc/vmstat is formatted. OTOH for this file there's good chance we'd get away with just deleting nr_unstable line because there are entries added to it in the middle (e.g. in 60fbf0ab5da1 last September) and nobody complained yet. What do mm people think? How were changes to vmstat counters handled in the past? Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 2/2 - v2] MM: Discard NR_UNSTABLE_NFS, use NR_WRITEBACK instead. 2020-04-03 9:42 ` Jan Kara @ 2020-04-03 11:03 ` Michal Hocko 2020-04-06 0:14 ` NeilBrown 2020-04-06 23:28 ` NeilBrown 1 sibling, 1 reply; 43+ messages in thread From: Michal Hocko @ 2020-04-03 11:03 UTC (permalink / raw) To: Jan Kara Cc: NeilBrown, Christoph Hellwig, Trond Myklebust, Anna.Schumaker@Netapp.com, Andrew Morton, linux-mm, linux-nfs, LKML On Fri 03-04-20 11:42:20, Jan Kara wrote: [...] > > diff --git a/mm/vmstat.c b/mm/vmstat.c > > index 78d53378db99..d1291537bbb9 100644 > > --- a/mm/vmstat.c > > +++ b/mm/vmstat.c > > @@ -1162,7 +1162,6 @@ const char * const vmstat_text[] = { > > "nr_file_hugepages", > > "nr_file_pmdmapped", > > "nr_anon_transparent_hugepages", > > - "nr_unstable", > > "nr_vmscan_write", > > "nr_vmscan_immediate_reclaim", > > "nr_dirtied", > > This is probably the most tricky to deal with given how /proc/vmstat is > formatted. OTOH for this file there's good chance we'd get away with just > deleting nr_unstable line because there are entries added to it in the > middle (e.g. in 60fbf0ab5da1 last September) and nobody complained yet. > > What do mm people think? How were changes to vmstat counters handled in the > past? Adding new counters in the middle seems to be generally OK. I would be more worried about removing counters though. So if we can simply print a phone value at the very end then this should be a reasonable workaround. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 2/2 - v2] MM: Discard NR_UNSTABLE_NFS, use NR_WRITEBACK instead. 2020-04-03 11:03 ` Michal Hocko @ 2020-04-06 0:14 ` NeilBrown 2020-04-06 7:41 ` Michal Hocko 0 siblings, 1 reply; 43+ messages in thread From: NeilBrown @ 2020-04-06 0:14 UTC (permalink / raw) To: Michal Hocko, Jan Kara Cc: Christoph Hellwig, Trond Myklebust, Anna.Schumaker@Netapp.com, Andrew Morton, linux-mm, linux-nfs, LKML [-- Attachment #1: Type: text/plain, Size: 1356 bytes --] On Fri, Apr 03 2020, Michal Hocko wrote: > On Fri 03-04-20 11:42:20, Jan Kara wrote: > [...] >> > diff --git a/mm/vmstat.c b/mm/vmstat.c >> > index 78d53378db99..d1291537bbb9 100644 >> > --- a/mm/vmstat.c >> > +++ b/mm/vmstat.c >> > @@ -1162,7 +1162,6 @@ const char * const vmstat_text[] = { >> > "nr_file_hugepages", >> > "nr_file_pmdmapped", >> > "nr_anon_transparent_hugepages", >> > - "nr_unstable", >> > "nr_vmscan_write", >> > "nr_vmscan_immediate_reclaim", >> > "nr_dirtied", >> >> This is probably the most tricky to deal with given how /proc/vmstat is >> formatted. OTOH for this file there's good chance we'd get away with just >> deleting nr_unstable line because there are entries added to it in the >> middle (e.g. in 60fbf0ab5da1 last September) and nobody complained yet. >> >> What do mm people think? How were changes to vmstat counters handled in the >> past? > > Adding new counters in the middle seems to be generally OK. I would be > more worried about removing counters though. So if we can simply print a > phone value at the very end then this should be a reasonable workaround. At the very end? Do you mean not have "nr_unstable 0" appear at all, but having "dummy 0" appear at the end just so that the number of lines doesn't decrease? Am I misunderstanding? Thanks, NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 2/2 - v2] MM: Discard NR_UNSTABLE_NFS, use NR_WRITEBACK instead. 2020-04-06 0:14 ` NeilBrown @ 2020-04-06 7:41 ` Michal Hocko 0 siblings, 0 replies; 43+ messages in thread From: Michal Hocko @ 2020-04-06 7:41 UTC (permalink / raw) To: NeilBrown Cc: Jan Kara, Christoph Hellwig, Trond Myklebust, Anna.Schumaker@Netapp.com, Andrew Morton, linux-mm, linux-nfs, LKML On Mon 06-04-20 10:14:16, Neil Brown wrote: > On Fri, Apr 03 2020, Michal Hocko wrote: > > > On Fri 03-04-20 11:42:20, Jan Kara wrote: > > [...] > >> > diff --git a/mm/vmstat.c b/mm/vmstat.c > >> > index 78d53378db99..d1291537bbb9 100644 > >> > --- a/mm/vmstat.c > >> > +++ b/mm/vmstat.c > >> > @@ -1162,7 +1162,6 @@ const char * const vmstat_text[] = { > >> > "nr_file_hugepages", > >> > "nr_file_pmdmapped", > >> > "nr_anon_transparent_hugepages", > >> > - "nr_unstable", > >> > "nr_vmscan_write", > >> > "nr_vmscan_immediate_reclaim", > >> > "nr_dirtied", > >> > >> This is probably the most tricky to deal with given how /proc/vmstat is > >> formatted. OTOH for this file there's good chance we'd get away with just > >> deleting nr_unstable line because there are entries added to it in the > >> middle (e.g. in 60fbf0ab5da1 last September) and nobody complained yet. > >> > >> What do mm people think? How were changes to vmstat counters handled in the > >> past? > > > > Adding new counters in the middle seems to be generally OK. I would be > > more worried about removing counters though. So if we can simply print a > > phone value at the very end then this should be a reasonable workaround. > > At the very end? > Do you mean not have "nr_unstable 0" appear at all, but having "dummy 0" > appear at the end just so that the number of lines doesn't decrease? > Am I misunderstanding? Sorry for not being clear. I meant semething like diff --git a/mm/vmstat.c b/mm/vmstat.c index 78d53378db99..836e3f7a7aff 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1705,8 +1705,16 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos) static void *vmstat_next(struct seq_file *m, void *arg, loff_t *pos) { (*pos)++; - if (*pos >= NR_VMSTAT_ITEMS) + if (*pos >= NR_VMSTAT_ITEMS) { + /* + * deprecated counters which are no longer represented + * in vmstat arrays. We just lie about them to be always + * 0 to not break userspace which might expect them in + * int the output. + */ + seq_puts(m, "nr_unstable 0") return NULL; + } return (unsigned long *)m->private + *pos; } -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: [PATCH 2/2 - v2] MM: Discard NR_UNSTABLE_NFS, use NR_WRITEBACK instead. 2020-04-03 9:42 ` Jan Kara 2020-04-03 11:03 ` Michal Hocko @ 2020-04-06 23:28 ` NeilBrown 2020-04-07 7:33 ` Michal Hocko 1 sibling, 1 reply; 43+ messages in thread From: NeilBrown @ 2020-04-06 23:28 UTC (permalink / raw) To: Jan Kara Cc: Christoph Hellwig, Trond Myklebust, Anna.Schumaker@Netapp.com, Andrew Morton, Jan Kara, linux-mm, linux-nfs, LKML [-- Attachment #1: Type: text/plain, Size: 443 bytes --] On Fri, Apr 03 2020, Jan Kara wrote: > > So I don't think we can just remove lines from procfs files like this. That > has a high potential of breaking some userspace app that is not careful > enough when parsing the file. So I think that we need to leave there the > format string and just replace K(node_page_state(pgdat, NR_UNSTABLE_NFS)) > with 0. OK. I assume changing the static trace points isn't a problem though? Thanks, NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 2/2 - v2] MM: Discard NR_UNSTABLE_NFS, use NR_WRITEBACK instead. 2020-04-06 23:28 ` NeilBrown @ 2020-04-07 7:33 ` Michal Hocko 0 siblings, 0 replies; 43+ messages in thread From: Michal Hocko @ 2020-04-07 7:33 UTC (permalink / raw) To: NeilBrown Cc: Jan Kara, Christoph Hellwig, Trond Myklebust, Anna.Schumaker@Netapp.com, Andrew Morton, linux-mm, linux-nfs, LKML On Tue 07-04-20 09:28:19, Neil Brown wrote: > On Fri, Apr 03 2020, Jan Kara wrote: > > > > So I don't think we can just remove lines from procfs files like this. That > > has a high potential of breaking some userspace app that is not careful > > enough when parsing the file. So I think that we need to leave there the > > format string and just replace K(node_page_state(pgdat, NR_UNSTABLE_NFS)) > > with 0. > > OK. I assume changing the static trace points isn't a problem though? It shouldn't be until we learn that somebody depends on it... -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 2/2] Deprecate NR_UNSTABLE_NFS, use NR_WRITEBACK 2020-04-01 23:54 ` [PATCH 2/2] Deprecate NR_UNSTABLE_NFS, use NR_WRITEBACK NeilBrown 2020-04-02 15:10 ` Christoph Hellwig @ 2020-04-02 19:55 ` Jan Kara 1 sibling, 0 replies; 43+ messages in thread From: Jan Kara @ 2020-04-02 19:55 UTC (permalink / raw) To: NeilBrown Cc: Trond Myklebust, Anna.Schumaker@Netapp.com, Andrew Morton, Jan Kara, linux-mm, linux-nfs, LKML On Thu 02-04-20 10:54:07, NeilBrown wrote: > > After an NFS page has been written it is considered "unstable" until a > COMMIT request succeeds. If the COMMIT fails, the page will be > re-written. > > These "unstable" pages are currently accounted as "reclaimable", > either in WB_RECLAIMABLE, or in NR_UNSTABLE_NFS which is included in a > 'reclaimable' count. This might have made sense when sending the COMMIT > required a separate action by the VFS/MM (e.g. releasepage() used to > send a COMMIT). However now that all writes generated by ->writepages() > will automatically be followed by a COMMIT, it makes more sense to > treat them as writeback pages. > > So this page deprecates NR_UNSTABLE_NFS and accounts unstable pages in > NR_WRITEBACK and WB_WRITEBACK. > > A particular effect of this change is that when > wb_check_background_flush() calls wb_over_bg_threshold(), the latter > will report 'true' a lot less often as the 'unstable' pages are no > longer considered 'dirty' (and there is nothing that writeback can do > about them anyway). > > Currently wb_check_background_flush() will trigger writeback to NFS even > when there are relatively few dirty pages (if there are lots of unstable > pages), this can result in small writes going to the server (10s of > Kilobytes rather than a Megabyte) which hurts throughput. > With this that, there are fewer writes which are each larger on average. > > Signed-off-by: NeilBrown <neilb@suse.de> The patch looks good to me. I agree with Christoph that it would be best to remove NR_UNSTABLE_NFS completely. We just have to be careful, not change format of any entries in /proc/ where it currently gets reported... Honza > --- > fs/fs-writeback.c | 1 - > fs/nfs/internal.h | 7 +++++-- > fs/nfs/write.c | 4 ++-- > include/linux/mmzone.h | 2 +- > mm/memcontrol.c | 1 - > mm/page-writeback.c | 7 ++----- > 6 files changed, 10 insertions(+), 12 deletions(-) > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c > index 76ac9c7d32ec..c5bdf46e3b4b 100644 > --- a/fs/fs-writeback.c > +++ b/fs/fs-writeback.c > @@ -1070,7 +1070,6 @@ static void bdi_split_work_to_wbs(struct backing_dev_info *bdi, > static unsigned long get_nr_dirty_pages(void) > { > return global_node_page_state(NR_FILE_DIRTY) + > - global_node_page_state(NR_UNSTABLE_NFS) + > get_nr_dirty_inodes(); > } > > diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h > index f80c47d5ff27..ba1ff5adeccd 100644 > --- a/fs/nfs/internal.h > +++ b/fs/nfs/internal.h > @@ -660,8 +660,11 @@ void nfs_mark_page_unstable(struct page *page, struct nfs_commit_info *cinfo) > if (!cinfo->dreq) { > struct inode *inode = page_file_mapping(page)->host; > > - inc_node_page_state(page, NR_UNSTABLE_NFS); > - inc_wb_stat(&inode_to_bdi(inode)->wb, WB_RECLAIMABLE); > + /* This page is really still in write-back - just that the > + * writeback is happening on the server now. > + */ > + inc_node_page_state(page, NR_WRITEBACK); > + inc_wb_stat(&inode_to_bdi(inode)->wb, WB_WRITEBACK); > __mark_inode_dirty(inode, I_DIRTY_DATASYNC); > } > } > diff --git a/fs/nfs/write.c b/fs/nfs/write.c > index c478b772cc49..2e15a56620b3 100644 > --- a/fs/nfs/write.c > +++ b/fs/nfs/write.c > @@ -958,9 +958,9 @@ nfs_mark_request_commit(struct nfs_page *req, struct pnfs_layout_segment *lseg, > static void > nfs_clear_page_commit(struct page *page) > { > - dec_node_page_state(page, NR_UNSTABLE_NFS); > + dec_node_page_state(page, NR_WRITEBACK); > dec_wb_stat(&inode_to_bdi(page_file_mapping(page)->host)->wb, > - WB_RECLAIMABLE); > + WB_WRITEBACK); > } > > /* Called holding the request lock on @req */ > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index 462f6873905a..227fcb8cd0e6 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -237,7 +237,7 @@ enum node_stat_item { > NR_FILE_THPS, > NR_FILE_PMDMAPPED, > NR_ANON_THPS, > - NR_UNSTABLE_NFS, /* NFS unstable pages */ > + NR_UNSTABLE_NFS, /* NFS unstable pages - DEPRECATED DO NOT USE */ > NR_VMSCAN_WRITE, > NR_VMSCAN_IMMEDIATE, /* Prioritise for reclaim when writeback ends */ > NR_DIRTIED, /* page dirtyings since bootup */ > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 7ddf91c4295f..fad8e8a23235 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -4317,7 +4317,6 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages, > > *pdirty = memcg_exact_page_state(memcg, NR_FILE_DIRTY); > > - /* this should eventually include NR_UNSTABLE_NFS */ > *pwriteback = memcg_exact_page_state(memcg, NR_WRITEBACK); > *pfilepages = memcg_exact_page_state(memcg, NR_INACTIVE_FILE) + > memcg_exact_page_state(memcg, NR_ACTIVE_FILE); > diff --git a/mm/page-writeback.c b/mm/page-writeback.c > index 2afb09fa2fe0..d1f03c799d11 100644 > --- a/mm/page-writeback.c > +++ b/mm/page-writeback.c > @@ -504,7 +504,6 @@ bool node_dirty_ok(struct pglist_data *pgdat) > unsigned long nr_pages = 0; > > nr_pages += node_page_state(pgdat, NR_FILE_DIRTY); > - nr_pages += node_page_state(pgdat, NR_UNSTABLE_NFS); > nr_pages += node_page_state(pgdat, NR_WRITEBACK); > > return nr_pages <= limit; > @@ -1595,8 +1594,7 @@ static void balance_dirty_pages(struct bdi_writeback *wb, > * written to the server's write cache, but has not yet > * been flushed to permanent storage. > */ > - nr_reclaimable = global_node_page_state(NR_FILE_DIRTY) + > - global_node_page_state(NR_UNSTABLE_NFS); > + nr_reclaimable = global_node_page_state(NR_FILE_DIRTY); > gdtc->avail = global_dirtyable_memory(); > gdtc->dirty = nr_reclaimable + global_node_page_state(NR_WRITEBACK); > > @@ -1940,8 +1938,7 @@ bool wb_over_bg_thresh(struct bdi_writeback *wb) > * as we're trying to decide whether to put more under writeback. > */ > gdtc->avail = global_dirtyable_memory(); > - gdtc->dirty = global_node_page_state(NR_FILE_DIRTY) + > - global_node_page_state(NR_UNSTABLE_NFS); > + gdtc->dirty = global_node_page_state(NR_FILE_DIRTY); > domain_dirty_limits(gdtc); > > if (gdtc->dirty > gdtc->bg_thresh) > -- > 2.26.0 > -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 1/2] MM: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE 2020-04-01 23:53 ` [PATCH 1/2] MM: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE NeilBrown 2020-04-01 23:54 ` [PATCH 2/2] Deprecate NR_UNSTABLE_NFS, use NR_WRITEBACK NeilBrown @ 2020-04-02 16:35 ` Jan Kara 2020-04-03 15:15 ` Michal Hocko 2 siblings, 0 replies; 43+ messages in thread From: Jan Kara @ 2020-04-02 16:35 UTC (permalink / raw) To: NeilBrown Cc: Trond Myklebust, Anna.Schumaker@Netapp.com, Andrew Morton, Jan Kara, linux-mm, linux-nfs, LKML On Thu 02-04-20 10:53:20, NeilBrown wrote: > > PF_LESS_THROTTLE exists for loop-back nfsd, and a similar need in the > loop block driver, where a daemon needs to write to one bdi in > order to free up writes queued to another bdi. > > The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty > pages, so that it can still dirty pages after other processses have been > throttled. > > This approach was designed when all threads were blocked equally, > independently on which device they were writing to, or how fast it was. > Since that time the writeback algorithm has changed substantially with > different threads getting different allowances based on non-trivial > heuristics. This means the simple "add 25%" heuristic is no longer > reliable. > > This patch changes the heuristic to ignore the global limits and > consider only the limit relevant to the bdi being written to. This > approach is already available for BDI_CAP_STRICTLIMIT users (fuse) and > should not introduce surprises. This has the desired result of > protecting the task from the consequences of large amounts of dirty data > queued for other devices. > > This approach of "only consider the target bdi" is consistent with the > other use of PF_LESS_THROTTLE in current_may_throttle(), were it causes > attention to be focussed only on the target bdi. > > So this patch > - renames PF_LESS_THROTTLE to PF_LOCAL_THROTTLE, > - remove the 25% bonus that that flag gives, and > - imposes 'strictlimit' handling for any process with PF_LOCAL_THROTTLE > set. > > Note that previously realtime threads were treated the same as > PF_LESS_THROTTLE threads. This patch does *not* change the behvaiour for > real-time threads, so it is now different from the behaviour of nfsd and > loop tasks. I don't know what is wanted for realtime. > > Signed-off-by: NeilBrown <neilb@suse.de> This makes sense to me and the patch looks good. You can add: Reviewed-by: Jan Kara <jack@suse.cz> Thanks. Honza > --- > drivers/block/loop.c | 2 +- > fs/nfsd/vfs.c | 9 +++++---- > include/linux/sched.h | 2 +- > kernel/sys.c | 2 +- > mm/page-writeback.c | 10 ++++++---- > mm/vmscan.c | 4 ++-- > 6 files changed, 16 insertions(+), 13 deletions(-) > > diff --git a/drivers/block/loop.c b/drivers/block/loop.c > index 739b372a5112..2c59371ce936 100644 > --- a/drivers/block/loop.c > +++ b/drivers/block/loop.c > @@ -897,7 +897,7 @@ static void loop_unprepare_queue(struct loop_device *lo) > > static int loop_kthread_worker_fn(void *worker_ptr) > { > - current->flags |= PF_LESS_THROTTLE | PF_MEMALLOC_NOIO; > + current->flags |= PF_LOCAL_THROTTLE | PF_MEMALLOC_NOIO; > return kthread_worker_fn(worker_ptr); > } > > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c > index 0aa02eb18bd3..c3fbab1753ec 100644 > --- a/fs/nfsd/vfs.c > +++ b/fs/nfsd/vfs.c > @@ -979,12 +979,13 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf, > > if (test_bit(RQ_LOCAL, &rqstp->rq_flags)) > /* > - * We want less throttling in balance_dirty_pages() > - * and shrink_inactive_list() so that nfs to > + * We want throttling in balance_dirty_pages() > + * and shrink_inactive_list() to only consider > + * the backingdev we are writing to, so that nfs to > * localhost doesn't cause nfsd to lock up due to all > * the client's dirty pages or its congested queue. > */ > - current->flags |= PF_LESS_THROTTLE; > + current->flags |= PF_LOCAL_THROTTLE; > > exp = fhp->fh_export; > use_wgather = (rqstp->rq_vers == 2) && EX_WGATHER(exp); > @@ -1037,7 +1038,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf, > nfserr = nfserrno(host_err); > } > if (test_bit(RQ_LOCAL, &rqstp->rq_flags)) > - current_restore_flags(pflags, PF_LESS_THROTTLE); > + current_restore_flags(pflags, PF_LOCAL_THROTTLE); > return nfserr; > } > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 04278493bf15..5dcd27abc8cd 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -1473,7 +1473,7 @@ extern struct pid *cad_pid; > #define PF_KSWAPD 0x00020000 /* I am kswapd */ > #define PF_MEMALLOC_NOFS 0x00040000 /* All allocation requests will inherit GFP_NOFS */ > #define PF_MEMALLOC_NOIO 0x00080000 /* All allocation requests will inherit GFP_NOIO */ > -#define PF_LESS_THROTTLE 0x00100000 /* Throttle me less: I clean memory */ > +#define PF_LOCAL_THROTTLE 0x00100000 /* Throttle me less: I clean memory */ > #define PF_KTHREAD 0x00200000 /* I am a kernel thread */ > #define PF_RANDOMIZE 0x00400000 /* Randomize virtual address space */ > #define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */ > diff --git a/kernel/sys.c b/kernel/sys.c > index d325f3ab624a..180a2fa33f7f 100644 > --- a/kernel/sys.c > +++ b/kernel/sys.c > @@ -2262,7 +2262,7 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct *t, unsigned long which, > return -EINVAL; > } > > -#define PR_IO_FLUSHER (PF_MEMALLOC_NOIO | PF_LESS_THROTTLE) > +#define PR_IO_FLUSHER (PF_MEMALLOC_NOIO | PF_LOCAL_THROTTLE) > > SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, > unsigned long, arg4, unsigned long, arg5) > diff --git a/mm/page-writeback.c b/mm/page-writeback.c > index 2caf780a42e7..2afb09fa2fe0 100644 > --- a/mm/page-writeback.c > +++ b/mm/page-writeback.c > @@ -387,8 +387,7 @@ static unsigned long global_dirtyable_memory(void) > * Calculate @dtc->thresh and ->bg_thresh considering > * vm_dirty_{bytes|ratio} and dirty_background_{bytes|ratio}. The caller > * must ensure that @dtc->avail is set before calling this function. The > - * dirty limits will be lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd) and > - * real-time tasks. > + * dirty limits will be lifted by 1/4 for real-time tasks. > */ > static void domain_dirty_limits(struct dirty_throttle_control *dtc) > { > @@ -436,7 +435,7 @@ static void domain_dirty_limits(struct dirty_throttle_control *dtc) > if (bg_thresh >= thresh) > bg_thresh = thresh / 2; > tsk = current; > - if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) { > + if (rt_task(tsk)) { > bg_thresh += bg_thresh / 4 + global_wb_domain.dirty_limit / 32; > thresh += thresh / 4 + global_wb_domain.dirty_limit / 32; > } > @@ -486,7 +485,7 @@ static unsigned long node_dirty_limit(struct pglist_data *pgdat) > else > dirty = vm_dirty_ratio * node_memory / 100; > > - if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) > + if (rt_task(tsk)) > dirty += dirty / 4; > > return dirty; > @@ -1580,6 +1579,9 @@ static void balance_dirty_pages(struct bdi_writeback *wb, > bool strictlimit = bdi->capabilities & BDI_CAP_STRICTLIMIT; > unsigned long start_time = jiffies; > > + if (current->flags & PF_LOCAL_THROTTLE) > + /* This task must only be throttled by its own writeback */ > + strictlimit = true; > for (;;) { > unsigned long now = jiffies; > unsigned long dirty, thresh, bg_thresh; > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 876370565455..c5cf25938c56 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1880,13 +1880,13 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec, > > /* > * If a kernel thread (such as nfsd for loop-back mounts) services > - * a backing device by writing to the page cache it sets PF_LESS_THROTTLE. > + * a backing device by writing to the page cache it sets PF_LOCAL_THROTTLE. > * In that case we should only throttle if the backing device it is > * writing to is congested. In other cases it is safe to throttle. > */ > static int current_may_throttle(void) > { > - return !(current->flags & PF_LESS_THROTTLE) || > + return !(current->flags & PF_LOCAL_THROTTLE) || > current->backing_dev_info == NULL || > bdi_write_congested(current->backing_dev_info); > } > -- > 2.26.0 > -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 1/2] MM: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE 2020-04-01 23:53 ` [PATCH 1/2] MM: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE NeilBrown 2020-04-01 23:54 ` [PATCH 2/2] Deprecate NR_UNSTABLE_NFS, use NR_WRITEBACK NeilBrown 2020-04-02 16:35 ` [PATCH 1/2] MM: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE Jan Kara @ 2020-04-03 15:15 ` Michal Hocko 2020-04-03 21:40 ` NeilBrown 2 siblings, 1 reply; 43+ messages in thread From: Michal Hocko @ 2020-04-03 15:15 UTC (permalink / raw) To: NeilBrown Cc: Trond Myklebust, Anna.Schumaker, Andrew Morton, Jan Kara, linux-mm, linux-nfs, LKML On Thu 02-04-20 10:53:20, Neil Brown wrote: > > PF_LESS_THROTTLE exists for loop-back nfsd, and a similar need in the > loop block driver, where a daemon needs to write to one bdi in > order to free up writes queued to another bdi. > > The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty > pages, so that it can still dirty pages after other processses have been > throttled. > > This approach was designed when all threads were blocked equally, > independently on which device they were writing to, or how fast it was. > Since that time the writeback algorithm has changed substantially with > different threads getting different allowances based on non-trivial > heuristics. This means the simple "add 25%" heuristic is no longer > reliable. > > This patch changes the heuristic to ignore the global limits and > consider only the limit relevant to the bdi being written to. This > approach is already available for BDI_CAP_STRICTLIMIT users (fuse) and > should not introduce surprises. This has the desired result of > protecting the task from the consequences of large amounts of dirty data > queued for other devices. While I understand that you want to have per bdi throttling for those "special" files I am still missing how this is going to provide the additional room that the additnal 25% gave them previously. I might misremember or things have changed (what you mention as non-trivial heuristics) but PF_LESS_THROTTLE really needed that room to guarantee a forward progress. Care to expan some more on how this is handled now? Maybe we do not need it anymore but calling that out explicitly would be really helpful. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 1/2] MM: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE 2020-04-03 15:15 ` Michal Hocko @ 2020-04-03 21:40 ` NeilBrown 2020-04-06 7:44 ` Michal Hocko 0 siblings, 1 reply; 43+ messages in thread From: NeilBrown @ 2020-04-03 21:40 UTC (permalink / raw) To: Michal Hocko Cc: Trond Myklebust, Anna.Schumaker, Andrew Morton, Jan Kara, linux-mm, linux-nfs, LKML [-- Attachment #1: Type: text/plain, Size: 2507 bytes --] On Fri, Apr 03 2020, Michal Hocko wrote: > On Thu 02-04-20 10:53:20, Neil Brown wrote: >> >> PF_LESS_THROTTLE exists for loop-back nfsd, and a similar need in the >> loop block driver, where a daemon needs to write to one bdi in >> order to free up writes queued to another bdi. >> >> The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty >> pages, so that it can still dirty pages after other processses have been >> throttled. >> >> This approach was designed when all threads were blocked equally, >> independently on which device they were writing to, or how fast it was. >> Since that time the writeback algorithm has changed substantially with >> different threads getting different allowances based on non-trivial >> heuristics. This means the simple "add 25%" heuristic is no longer >> reliable. >> >> This patch changes the heuristic to ignore the global limits and >> consider only the limit relevant to the bdi being written to. This >> approach is already available for BDI_CAP_STRICTLIMIT users (fuse) and >> should not introduce surprises. This has the desired result of >> protecting the task from the consequences of large amounts of dirty data >> queued for other devices. > > While I understand that you want to have per bdi throttling for those > "special" files I am still missing how this is going to provide the > additional room that the additnal 25% gave them previously. I might > misremember or things have changed (what you mention as non-trivial > heuristics) but PF_LESS_THROTTLE really needed that room to guarantee a > forward progress. Care to expan some more on how this is handled now? > Maybe we do not need it anymore but calling that out explicitly would be > really helpful. The 25% was a means to an end, not an end in itself. The problem is that the NFS server needs to be able to write to the backing filesystem when the dirty memory limits have been reached by being totally consumed by dirty pages on the NFS filesystem. The 25% was just a way of giving an allowance of dirty pages to nfsd that could not be consumed by processes writing to an NFS filesystem. i.e. it doesn't need 25% MORE, it needs 25% PRIVATELY. Actually it only really needs 1 page privately, but a few pages give better throughput and 25% seemed like a good idea at the time. per-bdi throttling focuses on the "PRIVATELY" (the important bit) and de-emphasises the 25% (the irrelevant detail). Thanks, NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 1/2] MM: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE 2020-04-03 21:40 ` NeilBrown @ 2020-04-06 7:44 ` Michal Hocko 2020-04-06 9:36 ` Jan Kara 0 siblings, 1 reply; 43+ messages in thread From: Michal Hocko @ 2020-04-06 7:44 UTC (permalink / raw) To: NeilBrown Cc: Trond Myklebust, Anna.Schumaker, Andrew Morton, Jan Kara, linux-mm, linux-nfs, LKML On Sat 04-04-20 08:40:17, Neil Brown wrote: > On Fri, Apr 03 2020, Michal Hocko wrote: > > > On Thu 02-04-20 10:53:20, Neil Brown wrote: > >> > >> PF_LESS_THROTTLE exists for loop-back nfsd, and a similar need in the > >> loop block driver, where a daemon needs to write to one bdi in > >> order to free up writes queued to another bdi. > >> > >> The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty > >> pages, so that it can still dirty pages after other processses have been > >> throttled. > >> > >> This approach was designed when all threads were blocked equally, > >> independently on which device they were writing to, or how fast it was. > >> Since that time the writeback algorithm has changed substantially with > >> different threads getting different allowances based on non-trivial > >> heuristics. This means the simple "add 25%" heuristic is no longer > >> reliable. > >> > >> This patch changes the heuristic to ignore the global limits and > >> consider only the limit relevant to the bdi being written to. This > >> approach is already available for BDI_CAP_STRICTLIMIT users (fuse) and > >> should not introduce surprises. This has the desired result of > >> protecting the task from the consequences of large amounts of dirty data > >> queued for other devices. > > > > While I understand that you want to have per bdi throttling for those > > "special" files I am still missing how this is going to provide the > > additional room that the additnal 25% gave them previously. I might > > misremember or things have changed (what you mention as non-trivial > > heuristics) but PF_LESS_THROTTLE really needed that room to guarantee a > > forward progress. Care to expan some more on how this is handled now? > > Maybe we do not need it anymore but calling that out explicitly would be > > really helpful. > > The 25% was a means to an end, not an end in itself. > > The problem is that the NFS server needs to be able to write to the > backing filesystem when the dirty memory limits have been reached by > being totally consumed by dirty pages on the NFS filesystem. > > The 25% was just a way of giving an allowance of dirty pages to nfsd > that could not be consumed by processes writing to an NFS filesystem. > i.e. it doesn't need 25% MORE, it needs 25% PRIVATELY. Actually it only > really needs 1 page privately, but a few pages give better throughput > and 25% seemed like a good idea at the time. Yes this part is clear to me. > per-bdi throttling focuses on the "PRIVATELY" (the important bit) and > de-emphasises the 25% (the irrelevant detail). It is still not clear to me how this patch is going to behave when the global dirty throttling is essentially equal to the per-bdi - e.g. there is only a single bdi and now the PF_LOCAL_THROTTLE process doesn't have anything private. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 1/2] MM: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE 2020-04-06 7:44 ` Michal Hocko @ 2020-04-06 9:36 ` Jan Kara 2020-04-06 10:57 ` Michal Hocko 2020-04-06 11:58 ` NeilBrown 0 siblings, 2 replies; 43+ messages in thread From: Jan Kara @ 2020-04-06 9:36 UTC (permalink / raw) To: Michal Hocko Cc: NeilBrown, Trond Myklebust, Anna.Schumaker@Netapp.com, Andrew Morton, Jan Kara, linux-mm, linux-nfs, LKML On Mon 06-04-20 09:44:53, Michal Hocko wrote: > On Sat 04-04-20 08:40:17, Neil Brown wrote: > > On Fri, Apr 03 2020, Michal Hocko wrote: > > > > > On Thu 02-04-20 10:53:20, Neil Brown wrote: > > >> > > >> PF_LESS_THROTTLE exists for loop-back nfsd, and a similar need in the > > >> loop block driver, where a daemon needs to write to one bdi in > > >> order to free up writes queued to another bdi. > > >> > > >> The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty > > >> pages, so that it can still dirty pages after other processses have been > > >> throttled. > > >> > > >> This approach was designed when all threads were blocked equally, > > >> independently on which device they were writing to, or how fast it was. > > >> Since that time the writeback algorithm has changed substantially with > > >> different threads getting different allowances based on non-trivial > > >> heuristics. This means the simple "add 25%" heuristic is no longer > > >> reliable. > > >> > > >> This patch changes the heuristic to ignore the global limits and > > >> consider only the limit relevant to the bdi being written to. This > > >> approach is already available for BDI_CAP_STRICTLIMIT users (fuse) and > > >> should not introduce surprises. This has the desired result of > > >> protecting the task from the consequences of large amounts of dirty data > > >> queued for other devices. > > > > > > While I understand that you want to have per bdi throttling for those > > > "special" files I am still missing how this is going to provide the > > > additional room that the additnal 25% gave them previously. I might > > > misremember or things have changed (what you mention as non-trivial > > > heuristics) but PF_LESS_THROTTLE really needed that room to guarantee a > > > forward progress. Care to expan some more on how this is handled now? > > > Maybe we do not need it anymore but calling that out explicitly would be > > > really helpful. > > > > The 25% was a means to an end, not an end in itself. > > > > The problem is that the NFS server needs to be able to write to the > > backing filesystem when the dirty memory limits have been reached by > > being totally consumed by dirty pages on the NFS filesystem. > > > > The 25% was just a way of giving an allowance of dirty pages to nfsd > > that could not be consumed by processes writing to an NFS filesystem. > > i.e. it doesn't need 25% MORE, it needs 25% PRIVATELY. Actually it only > > really needs 1 page privately, but a few pages give better throughput > > and 25% seemed like a good idea at the time. > > Yes this part is clear to me. > > > per-bdi throttling focuses on the "PRIVATELY" (the important bit) and > > de-emphasises the 25% (the irrelevant detail). > > It is still not clear to me how this patch is going to behave when the > global dirty throttling is essentially equal to the per-bdi - e.g. there > is only a single bdi and now the PF_LOCAL_THROTTLE process doesn't have > anything private. Let me think out loud so see whether I understand this properly. There are two BDIs involved in NFS loop mount - the NFS virtual BDI (let's call it simply NFS-bdi) and the bdi of the real filesystem that is backing NFS (let's call this real-bdi). The case we are concerned about is when NFS-bdi is full of dirty pages so that global dirty limit of the machine is exceeded. Then flusher thread will take dirty pages from NFS-bdi and send them over localhost to nfsd. Nfsd, which has PF_LOCAL_THROTTLE set, will take these pages and write them to real-bdi. Now because PF_LOCAL_THROTTLE is set for nfsd, the fact that we are over global limit does not take effect and nfsd is still able to write to real-bdi until dirty limit on real-bdi is reached. So things should work as Neil writes AFAIU. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 1/2] MM: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE 2020-04-06 9:36 ` Jan Kara @ 2020-04-06 10:57 ` Michal Hocko 2020-04-06 11:58 ` NeilBrown 1 sibling, 0 replies; 43+ messages in thread From: Michal Hocko @ 2020-04-06 10:57 UTC (permalink / raw) To: Jan Kara Cc: NeilBrown, Trond Myklebust, Anna.Schumaker@Netapp.com, Andrew Morton, linux-mm, linux-nfs, LKML On Mon 06-04-20 11:36:01, Jan Kara wrote: > On Mon 06-04-20 09:44:53, Michal Hocko wrote: > > On Sat 04-04-20 08:40:17, Neil Brown wrote: > > > On Fri, Apr 03 2020, Michal Hocko wrote: > > > > > > > On Thu 02-04-20 10:53:20, Neil Brown wrote: > > > >> > > > >> PF_LESS_THROTTLE exists for loop-back nfsd, and a similar need in the > > > >> loop block driver, where a daemon needs to write to one bdi in > > > >> order to free up writes queued to another bdi. > > > >> > > > >> The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty > > > >> pages, so that it can still dirty pages after other processses have been > > > >> throttled. > > > >> > > > >> This approach was designed when all threads were blocked equally, > > > >> independently on which device they were writing to, or how fast it was. > > > >> Since that time the writeback algorithm has changed substantially with > > > >> different threads getting different allowances based on non-trivial > > > >> heuristics. This means the simple "add 25%" heuristic is no longer > > > >> reliable. > > > >> > > > >> This patch changes the heuristic to ignore the global limits and > > > >> consider only the limit relevant to the bdi being written to. This > > > >> approach is already available for BDI_CAP_STRICTLIMIT users (fuse) and > > > >> should not introduce surprises. This has the desired result of > > > >> protecting the task from the consequences of large amounts of dirty data > > > >> queued for other devices. > > > > > > > > While I understand that you want to have per bdi throttling for those > > > > "special" files I am still missing how this is going to provide the > > > > additional room that the additnal 25% gave them previously. I might > > > > misremember or things have changed (what you mention as non-trivial > > > > heuristics) but PF_LESS_THROTTLE really needed that room to guarantee a > > > > forward progress. Care to expan some more on how this is handled now? > > > > Maybe we do not need it anymore but calling that out explicitly would be > > > > really helpful. > > > > > > The 25% was a means to an end, not an end in itself. > > > > > > The problem is that the NFS server needs to be able to write to the > > > backing filesystem when the dirty memory limits have been reached by > > > being totally consumed by dirty pages on the NFS filesystem. > > > > > > The 25% was just a way of giving an allowance of dirty pages to nfsd > > > that could not be consumed by processes writing to an NFS filesystem. > > > i.e. it doesn't need 25% MORE, it needs 25% PRIVATELY. Actually it only > > > really needs 1 page privately, but a few pages give better throughput > > > and 25% seemed like a good idea at the time. > > > > Yes this part is clear to me. > > > > > per-bdi throttling focuses on the "PRIVATELY" (the important bit) and > > > de-emphasises the 25% (the irrelevant detail). > > > > It is still not clear to me how this patch is going to behave when the > > global dirty throttling is essentially equal to the per-bdi - e.g. there > > is only a single bdi and now the PF_LOCAL_THROTTLE process doesn't have > > anything private. > > Let me think out loud so see whether I understand this properly. There are > two BDIs involved in NFS loop mount - the NFS virtual BDI (let's call it > simply NFS-bdi) and the bdi of the real filesystem that is backing NFS > (let's call this real-bdi). The case we are concerned about is when NFS-bdi > is full of dirty pages so that global dirty limit of the machine is > exceeded. Then flusher thread will take dirty pages from NFS-bdi and send > them over localhost to nfsd. Nfsd, which has PF_LOCAL_THROTTLE set, will take > these pages and write them to real-bdi. Now because PF_LOCAL_THROTTLE is > set for nfsd, the fact that we are over global limit does not take effect > and nfsd is still able to write to real-bdi until dirty limit on real-bdi > is reached. So things should work as Neil writes AFAIU. Thanks for the clarification. I was not aware of the 2 bdi situation. This makes more sense now. Maybe this is a trivial fact for everybody who is more familiar with nfs internals but it would be so much esier to follow if it was explicit in the changelog. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 1/2] MM: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE 2020-04-06 9:36 ` Jan Kara 2020-04-06 10:57 ` Michal Hocko @ 2020-04-06 11:58 ` NeilBrown 1 sibling, 0 replies; 43+ messages in thread From: NeilBrown @ 2020-04-06 11:58 UTC (permalink / raw) To: Jan Kara, Michal Hocko Cc: Trond Myklebust, Anna.Schumaker@Netapp.com, Andrew Morton, Jan Kara, linux-mm, linux-nfs, LKML [-- Attachment #1: Type: text/plain, Size: 4617 bytes --] On Mon, Apr 06 2020, Jan Kara wrote: > On Mon 06-04-20 09:44:53, Michal Hocko wrote: >> On Sat 04-04-20 08:40:17, Neil Brown wrote: >> > On Fri, Apr 03 2020, Michal Hocko wrote: >> > >> > > On Thu 02-04-20 10:53:20, Neil Brown wrote: >> > >> >> > >> PF_LESS_THROTTLE exists for loop-back nfsd, and a similar need in the >> > >> loop block driver, where a daemon needs to write to one bdi in >> > >> order to free up writes queued to another bdi. >> > >> >> > >> The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty >> > >> pages, so that it can still dirty pages after other processses have been >> > >> throttled. >> > >> >> > >> This approach was designed when all threads were blocked equally, >> > >> independently on which device they were writing to, or how fast it was. >> > >> Since that time the writeback algorithm has changed substantially with >> > >> different threads getting different allowances based on non-trivial >> > >> heuristics. This means the simple "add 25%" heuristic is no longer >> > >> reliable. >> > >> >> > >> This patch changes the heuristic to ignore the global limits and >> > >> consider only the limit relevant to the bdi being written to. This >> > >> approach is already available for BDI_CAP_STRICTLIMIT users (fuse) and >> > >> should not introduce surprises. This has the desired result of >> > >> protecting the task from the consequences of large amounts of dirty data >> > >> queued for other devices. >> > > >> > > While I understand that you want to have per bdi throttling for those >> > > "special" files I am still missing how this is going to provide the >> > > additional room that the additnal 25% gave them previously. I might >> > > misremember or things have changed (what you mention as non-trivial >> > > heuristics) but PF_LESS_THROTTLE really needed that room to guarantee a >> > > forward progress. Care to expan some more on how this is handled now? >> > > Maybe we do not need it anymore but calling that out explicitly would be >> > > really helpful. >> > >> > The 25% was a means to an end, not an end in itself. >> > >> > The problem is that the NFS server needs to be able to write to the >> > backing filesystem when the dirty memory limits have been reached by >> > being totally consumed by dirty pages on the NFS filesystem. >> > >> > The 25% was just a way of giving an allowance of dirty pages to nfsd >> > that could not be consumed by processes writing to an NFS filesystem. >> > i.e. it doesn't need 25% MORE, it needs 25% PRIVATELY. Actually it only >> > really needs 1 page privately, but a few pages give better throughput >> > and 25% seemed like a good idea at the time. >> >> Yes this part is clear to me. >> >> > per-bdi throttling focuses on the "PRIVATELY" (the important bit) and >> > de-emphasises the 25% (the irrelevant detail). >> >> It is still not clear to me how this patch is going to behave when the >> global dirty throttling is essentially equal to the per-bdi - e.g. there >> is only a single bdi and now the PF_LOCAL_THROTTLE process doesn't have >> anything private. > > Let me think out loud so see whether I understand this properly. There are > two BDIs involved in NFS loop mount - the NFS virtual BDI (let's call it > simply NFS-bdi) and the bdi of the real filesystem that is backing NFS > (let's call this real-bdi). The case we are concerned about is when NFS-bdi > is full of dirty pages so that global dirty limit of the machine is > exceeded. Then flusher thread will take dirty pages from NFS-bdi and send > them over localhost to nfsd. Nfsd, which has PF_LOCAL_THROTTLE set, will take > these pages and write them to real-bdi. Now because PF_LOCAL_THROTTLE is > set for nfsd, the fact that we are over global limit does not take effect > and nfsd is still able to write to real-bdi until dirty limit on real-bdi > is reached. So things should work as Neil writes AFAIU. Exactly. The 'loop' block device follows a similar pattern - there is the 'loop' bdi that might consume all the allowed dirty pages, and the backing bdi that we need to write to so those dirty pages can be cleaned. The intention for PR_SET_IO_FLUSHER as described in 'man 2 prctl' is much the same. The thread that sets this is expected to be working on behalf of a "block layer or filesystem" such as "FUSE daemons, SCSI device emulation daemons" - each of these would be serving a bdi "above" by writing to a bdi "below". I'll add some more text to the changelog to make this clearer. Thanks, NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 1/2] MM: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE 2020-04-01 23:52 ` Writeback fixes " NeilBrown 2020-04-01 23:53 ` [PATCH 1/2] MM: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE NeilBrown @ 2020-04-02 4:26 ` Hillf Danton 2020-04-02 4:57 ` NeilBrown 2020-04-06 3:58 ` Hillf Danton 2020-04-06 23:42 ` Writeback fixes for NFS - V2 NeilBrown 2 siblings, 2 replies; 43+ messages in thread From: Hillf Danton @ 2020-04-02 4:26 UTC (permalink / raw) To: NeilBrown Cc: Trond Myklebust, Anna Schumaker, Andrew Morton, linux-mm, linux-nfs, LKML On Thu, 02 Apr 2020 10:53:20 +1100 NeilBrown wrote: > > PF_LESS_THROTTLE exists for loop-back nfsd, and a similar need in the > loop block driver, where a daemon needs to write to one bdi in > order to free up writes queued to another bdi. > > The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty > pages, so that it can still dirty pages after other processses have been > throttled. > > This approach was designed when all threads were blocked equally, > independently on which device they were writing to, or how fast it was. > Since that time the writeback algorithm has changed substantially with > different threads getting different allowances based on non-trivial > heuristics. This means the simple "add 25%" heuristic is no longer > reliable. > > This patch changes the heuristic to ignore the global limits and > consider only the limit relevant to the bdi being written to. This > approach is already available for BDI_CAP_STRICTLIMIT users (fuse) and > should not introduce surprises. This has the desired result of > protecting the task from the consequences of large amounts of dirty data > queued for other devices. > > This approach of "only consider the target bdi" is consistent with the > other use of PF_LESS_THROTTLE in current_may_throttle(), were it causes > attention to be focussed only on the target bdi. > > So this patch > - renames PF_LESS_THROTTLE to PF_LOCAL_THROTTLE, > - remove the 25% bonus that that flag gives, and > - imposes 'strictlimit' handling for any process with PF_LOCAL_THROTTLE > set. /* * The strictlimit feature is a tool preventing mistrusted filesystems * from growing a large number of dirty pages before throttling. For Based on the comment snippet, I suspect it is applicable to IO flushers unless they are likely generating tons of dirty pages. If they are, however, cutting their bonuses seem questionable. > > Note that previously realtime threads were treated the same as > PF_LESS_THROTTLE threads. This patch does *not* change the behvaiour for > real-time threads, so it is now different from the behaviour of nfsd and > loop tasks. I don't know what is wanted for realtime. > > Signed-off-by: NeilBrown <neilb@suse.de> > =2D-- Hrm corrupted delivery? > drivers/block/loop.c | 2 +- > fs/nfsd/vfs.c | 9 +++++---- > include/linux/sched.h | 2 +- > kernel/sys.c | 2 +- > mm/page-writeback.c | 10 ++++++---- > mm/vmscan.c | 4 ++-- > 6 files changed, 16 insertions(+), 13 deletions(-) ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 1/2] MM: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE 2020-04-02 4:26 ` Hillf Danton @ 2020-04-02 4:57 ` NeilBrown 2020-04-06 3:58 ` Hillf Danton 1 sibling, 0 replies; 43+ messages in thread From: NeilBrown @ 2020-04-02 4:57 UTC (permalink / raw) To: Hillf Danton Cc: Trond Myklebust, Anna Schumaker, Andrew Morton, linux-mm, linux-nfs, LKML [-- Attachment #1: Type: text/plain, Size: 4454 bytes --] On Thu, Apr 02 2020, Hillf Danton wrote: > On Thu, 02 Apr 2020 10:53:20 +1100 NeilBrown wrote: >> >> PF_LESS_THROTTLE exists for loop-back nfsd, and a similar need in the >> loop block driver, where a daemon needs to write to one bdi in >> order to free up writes queued to another bdi. >> >> The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty >> pages, so that it can still dirty pages after other processses have been >> throttled. >> >> This approach was designed when all threads were blocked equally, >> independently on which device they were writing to, or how fast it was. >> Since that time the writeback algorithm has changed substantially with >> different threads getting different allowances based on non-trivial >> heuristics. This means the simple "add 25%" heuristic is no longer >> reliable. >> >> This patch changes the heuristic to ignore the global limits and >> consider only the limit relevant to the bdi being written to. This >> approach is already available for BDI_CAP_STRICTLIMIT users (fuse) and >> should not introduce surprises. This has the desired result of >> protecting the task from the consequences of large amounts of dirty data >> queued for other devices. >> >> This approach of "only consider the target bdi" is consistent with the >> other use of PF_LESS_THROTTLE in current_may_throttle(), were it causes >> attention to be focussed only on the target bdi. >> >> So this patch >> - renames PF_LESS_THROTTLE to PF_LOCAL_THROTTLE, >> - remove the 25% bonus that that flag gives, and >> - imposes 'strictlimit' handling for any process with PF_LOCAL_THROTTLE >> set. > > /* > * The strictlimit feature is a tool preventing mistrusted filesystems > * from growing a large number of dirty pages before throttling. For > > Based on the comment snippet, I suspect it is applicable to IO flushers > unless they are likely generating tons of dirty pages. If they are, > however, cutting their bonuses seem questionable. The purpose of the strictlimit feature was to isolate one filesystem (bdi) from all others, so that the one cannot create dirty pages which unfairly disadvantage the others - this is what that comment says. But the implementation appears to focus on the isolation, not the specific purpose, and isolation works both ways. It protects the others from the one, and the one from the others. fuse needs to be isolated so it doesn't harm others. nfsd and loop need to be isolate so they aren't harmed by others. I'm less familiar with IO flushers but I suspect that have exactly the same need as nfsd and loop - they need to be isolated from dirty pages other than on the device they are writing to. The 25% bonus was never about giving them a bonus because they need it. It was about protecting them from excess usage elsewhere. I strongly suspect that my change will provide a conceptually better service for IO flushers. (whether it is better in a practical measurable sense I cannot say, but I'd be surprised if it was worse). One possible problem with strictlimit isolation is suggested by the comment * * In strictlimit case make decision based on the wb counters * and limits. Small writeouts when the wb limits are ramping * up are the price we consciously pay for strictlimit-ing. * This suggests that starting transients may be worse in some cases. I haven't noticed any problems in my (limited) testing so while I suspect there is a genuine difference here, I don't expect it to be problematic. > >> >> Note that previously realtime threads were treated the same as >> PF_LESS_THROTTLE threads. This patch does *not* change the behvaiour for >> real-time threads, so it is now different from the behaviour of nfsd and >> loop tasks. I don't know what is wanted for realtime. >> >> Signed-off-by: NeilBrown <neilb@suse.de> >> =2D-- > > Hrm corrupted delivery? No, that's just normal email quotation for a multi-part message (needed for crypto-signing which I do by default) 'git am', for example, is quite capable of coping with it. Thanks, NeilBrown > >> drivers/block/loop.c | 2 +- >> fs/nfsd/vfs.c | 9 +++++---- >> include/linux/sched.h | 2 +- >> kernel/sys.c | 2 +- >> mm/page-writeback.c | 10 ++++++---- >> mm/vmscan.c | 4 ++-- >> 6 files changed, 16 insertions(+), 13 deletions(-) [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 1/2] MM: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE 2020-04-02 4:26 ` Hillf Danton 2020-04-02 4:57 ` NeilBrown @ 2020-04-06 3:58 ` Hillf Danton 1 sibling, 0 replies; 43+ messages in thread From: Hillf Danton @ 2020-04-06 3:58 UTC (permalink / raw) To: NeilBrown Cc: Trond Myklebust, Anna Schumaker, Andrew Morton, linux-mm, linux-nfs, LKML On Thu, 02 Apr 2020 15:57:56 +1100 NeilBrown wrote: > > On Thu, Apr 02 2020, Hillf Danton wrote: > > > On Thu, 02 Apr 2020 10:53:20 +1100 NeilBrown wrote: > >>=20 > >> PF_LESS_THROTTLE exists for loop-back nfsd, and a similar need in the > >> loop block driver, where a daemon needs to write to one bdi in > >> order to free up writes queued to another bdi. > >>=20 > >> The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty > >> pages, so that it can still dirty pages after other processses have been > >> throttled. > >>=20 > >> This approach was designed when all threads were blocked equally, > >> independently on which device they were writing to, or how fast it was. > >> Since that time the writeback algorithm has changed substantially with > >> different threads getting different allowances based on non-trivial > >> heuristics. This means the simple "add 25%" heuristic is no longer > >> reliable. > >>=20 > >> This patch changes the heuristic to ignore the global limits and > >> consider only the limit relevant to the bdi being written to. This > >> approach is already available for BDI_CAP_STRICTLIMIT users (fuse) and > >> should not introduce surprises. This has the desired result of > >> protecting the task from the consequences of large amounts of dirty data > >> queued for other devices. > >>=20 > >> This approach of "only consider the target bdi" is consistent with the > >> other use of PF_LESS_THROTTLE in current_may_throttle(), were it causes > >> attention to be focussed only on the target bdi. > >>=20 > >> So this patch > >> - renames PF_LESS_THROTTLE to PF_LOCAL_THROTTLE, > >> - remove the 25% bonus that that flag gives, and > >> - imposes 'strictlimit' handling for any process with PF_LOCAL_THROTTLE > >> set. > > > > /* > > * The strictlimit feature is a tool preventing mistrusted filesystems > > * from growing a large number of dirty pages before throttling. For > > > > Based on the comment snippet, I suspect it is applicable to IO flushers > > unless they are likely generating tons of dirty pages. If they are, > > however, cutting their bonuses seem questionable. > > The purpose of the strictlimit feature was to isolate one filesystem > (bdi) from all others, so that the one cannot create dirty pages which > unfairly disadvantage the others - this is what that comment says. > But the implementation appears to focus on the isolation, not the > specific purpose, and isolation works both ways. It protects the others > from the one, and the one from the others. > > fuse needs to be isolated so it doesn't harm others. > nfsd and loop need to be isolate so they aren't harmed by others. For those working in emergency services, extra N95 face masks and Covid-19 testing kits, say 25%, would be preserved, too, if isolation doesn't help them. > I'm less familiar with IO flushers but I suspect that have exactly the > same need as nfsd and loop - they need to be isolated from dirty pages > other than on the device they are writing to. > The 25% bonus was never about giving them a bonus because they need it. > It was about protecting them from excess usage elsewhere. For example, > I strongly > suspect that my change will provide a conceptually better service for IO > flushers. (whether it is better in a practical measurable sense I cannot > say, but I'd be surprised if it was worse). ^ permalink raw reply [flat|nested] 43+ messages in thread
* Writeback fixes for NFS - V2 2020-04-01 23:52 ` Writeback fixes " NeilBrown 2020-04-01 23:53 ` [PATCH 1/2] MM: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE NeilBrown 2020-04-02 4:26 ` Hillf Danton @ 2020-04-06 23:42 ` NeilBrown 2020-04-06 23:43 ` [PATCH 1/2] MM: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE NeilBrown 2020-04-16 0:29 ` Writeback fixes for NFS - V3 NeilBrown 2 siblings, 2 replies; 43+ messages in thread From: NeilBrown @ 2020-04-06 23:42 UTC (permalink / raw) To: Trond Myklebust, Anna.Schumaker, Andrew Morton, Jan Kara, Michal Hocko Cc: linux-mm, linux-nfs, LKML [-- Attachment #1: Type: text/plain, Size: 718 bytes --] This is a second version of my patches to improve writeback over NFS. This was prompted by a customer report of very slow throughput to a server running Ganesha NFS server which was taking about 200ms to handle COMMIT requests. While slow COMMIT requests do exacerbate the problem, there are still problems when COMMIT is comparatively fast. Latest test results from customer are that these changes provide substantial improvements. Changes over first version include improvements to the explanation in the commit message, and preservation of "NFS Unstable" counts in various statistics files, despite that fact that the internal accounting is completely gone (merged with writeback accounting). Thanks, NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* [PATCH 1/2] MM: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE 2020-04-06 23:42 ` Writeback fixes for NFS - V2 NeilBrown @ 2020-04-06 23:43 ` NeilBrown 2020-04-07 16:10 ` Chuck Lever 2020-04-16 0:29 ` Writeback fixes for NFS - V3 NeilBrown 1 sibling, 1 reply; 43+ messages in thread From: NeilBrown @ 2020-04-06 23:43 UTC (permalink / raw) To: Trond Myklebust, Anna.Schumaker, Andrew Morton, Jan Kara, Michal Hocko Cc: linux-mm, linux-nfs, LKML [-- Attachment #1: Type: text/plain, Size: 8451 bytes --] PF_LESS_THROTTLE exists for loop-back nfsd, and a similar need in the loop block driver, where a daemon needs to write to one bdi (the final bdi) in order to free up writes queued to another bdi (the client bdi). The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty pages, so that it can still dirty pages after other processses have been throttled. This approach was designed when all threads were blocked equally, independently on which device they were writing to, or how fast it was. Since that time the writeback algorithm has changed substantially with different threads getting different allowances based on non-trivial heuristics. This means the simple "add 25%" heuristic is no longer reliable. The important issue is not that the daemon needs a *larger* dirty page allowance, but that it needs a *private* dirty page allowance, so that dirty pages for the "client" bdi that it is helping to clear (the bdi for an NFS filesystem or loop block device etc) do not affect the throttling of the deamon writing to the "final" bdi. This patch changes the heuristic to ignore the global limits and consider only the limit relevant to the bdi being written to. This approach is already available for BDI_CAP_STRICTLIMIT users (fuse) and should not introduce surprises. This has the desired result of protecting the task from the consequences of large amounts of dirty data queued for other devices. This approach of "only consider the target bdi" is consistent with the other use of PF_LESS_THROTTLE in current_may_throttle(), were it causes attention to be focussed only on the target bdi. So this patch - renames PF_LESS_THROTTLE to PF_LOCAL_THROTTLE, - removes the 25% bonus that that flag gives, and - imposes 'strictlimit' handling for any process with PF_LOCAL_THROTTLE set. Note that previously realtime threads were treated the same as PF_LESS_THROTTLE threads. This patch does *not* change the behvaiour for real-time threads, so it is now different from the behaviour of nfsd and loop tasks. I don't know what is wanted for realtime. Note that the worst-case situation with this patch is that the threshold might be calculated as zero. In that case the daemon may block when there are any dirty pages for the final bdi. These will eventually clear and the daemon will be able to proceed. The writing of those dirty pages will increase the apparent throughput of the final bdi and thus increase its threshold for future calculations. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: NeilBrown <neilb@suse.de> --- drivers/block/loop.c | 2 +- fs/nfsd/vfs.c | 9 +++++---- include/linux/sched.h | 3 ++- kernel/sys.c | 2 +- mm/page-writeback.c | 10 ++++++---- mm/vmscan.c | 4 ++-- 6 files changed, 17 insertions(+), 13 deletions(-) diff --git a/drivers/block/loop.c b/drivers/block/loop.c index a42c49e04954..0e13b9fc8dfa 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -899,7 +899,7 @@ static void loop_unprepare_queue(struct loop_device *lo) static int loop_kthread_worker_fn(void *worker_ptr) { - current->flags |= PF_LESS_THROTTLE | PF_MEMALLOC_NOIO; + current->flags |= PF_LOCAL_THROTTLE | PF_MEMALLOC_NOIO; return kthread_worker_fn(worker_ptr); } diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c index 0aa02eb18bd3..c3fbab1753ec 100644 --- a/fs/nfsd/vfs.c +++ b/fs/nfsd/vfs.c @@ -979,12 +979,13 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf, if (test_bit(RQ_LOCAL, &rqstp->rq_flags)) /* - * We want less throttling in balance_dirty_pages() - * and shrink_inactive_list() so that nfs to + * We want throttling in balance_dirty_pages() + * and shrink_inactive_list() to only consider + * the backingdev we are writing to, so that nfs to * localhost doesn't cause nfsd to lock up due to all * the client's dirty pages or its congested queue. */ - current->flags |= PF_LESS_THROTTLE; + current->flags |= PF_LOCAL_THROTTLE; exp = fhp->fh_export; use_wgather = (rqstp->rq_vers == 2) && EX_WGATHER(exp); @@ -1037,7 +1038,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf, nfserr = nfserrno(host_err); } if (test_bit(RQ_LOCAL, &rqstp->rq_flags)) - current_restore_flags(pflags, PF_LESS_THROTTLE); + current_restore_flags(pflags, PF_LOCAL_THROTTLE); return nfserr; } diff --git a/include/linux/sched.h b/include/linux/sched.h index 4418f5cb8324..5955a089df32 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1481,7 +1481,8 @@ extern struct pid *cad_pid; #define PF_KSWAPD 0x00020000 /* I am kswapd */ #define PF_MEMALLOC_NOFS 0x00040000 /* All allocation requests will inherit GFP_NOFS */ #define PF_MEMALLOC_NOIO 0x00080000 /* All allocation requests will inherit GFP_NOIO */ -#define PF_LESS_THROTTLE 0x00100000 /* Throttle me less: I clean memory */ +#define PF_LOCAL_THROTTLE 0x00100000 /* Throttle writes only agasint the bdi I write to, + * I am cleaning dirty pages from some other bdi. */ #define PF_KTHREAD 0x00200000 /* I am a kernel thread */ #define PF_RANDOMIZE 0x00400000 /* Randomize virtual address space */ #define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */ diff --git a/kernel/sys.c b/kernel/sys.c index d325f3ab624a..180a2fa33f7f 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2262,7 +2262,7 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct *t, unsigned long which, return -EINVAL; } -#define PR_IO_FLUSHER (PF_MEMALLOC_NOIO | PF_LESS_THROTTLE) +#define PR_IO_FLUSHER (PF_MEMALLOC_NOIO | PF_LOCAL_THROTTLE) SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, unsigned long, arg4, unsigned long, arg5) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 7326b54ab728..4c9875971de5 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -387,8 +387,7 @@ static unsigned long global_dirtyable_memory(void) * Calculate @dtc->thresh and ->bg_thresh considering * vm_dirty_{bytes|ratio} and dirty_background_{bytes|ratio}. The caller * must ensure that @dtc->avail is set before calling this function. The - * dirty limits will be lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd) and - * real-time tasks. + * dirty limits will be lifted by 1/4 for real-time tasks. */ static void domain_dirty_limits(struct dirty_throttle_control *dtc) { @@ -436,7 +435,7 @@ static void domain_dirty_limits(struct dirty_throttle_control *dtc) if (bg_thresh >= thresh) bg_thresh = thresh / 2; tsk = current; - if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) { + if (rt_task(tsk)) { bg_thresh += bg_thresh / 4 + global_wb_domain.dirty_limit / 32; thresh += thresh / 4 + global_wb_domain.dirty_limit / 32; } @@ -486,7 +485,7 @@ static unsigned long node_dirty_limit(struct pglist_data *pgdat) else dirty = vm_dirty_ratio * node_memory / 100; - if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) + if (rt_task(tsk)) dirty += dirty / 4; return dirty; @@ -1580,6 +1579,9 @@ static void balance_dirty_pages(struct bdi_writeback *wb, bool strictlimit = bdi->capabilities & BDI_CAP_STRICTLIMIT; unsigned long start_time = jiffies; + if (current->flags & PF_LOCAL_THROTTLE) + /* This task must only be throttled by its own writeback */ + strictlimit = true; for (;;) { unsigned long now = jiffies; unsigned long dirty, thresh, bg_thresh; diff --git a/mm/vmscan.c b/mm/vmscan.c index 2e8e690d2813..b776da4bb8c8 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1879,13 +1879,13 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec, /* * If a kernel thread (such as nfsd for loop-back mounts) services - * a backing device by writing to the page cache it sets PF_LESS_THROTTLE. + * a backing device by writing to the page cache it sets PF_LOCAL_THROTTLE. * In that case we should only throttle if the backing device it is * writing to is congested. In other cases it is safe to throttle. */ static int current_may_throttle(void) { - return !(current->flags & PF_LESS_THROTTLE) || + return !(current->flags & PF_LOCAL_THROTTLE) || current->backing_dev_info == NULL || bdi_write_congested(current->backing_dev_info); } -- 2.26.0 [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: [PATCH 1/2] MM: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE 2020-04-06 23:43 ` [PATCH 1/2] MM: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE NeilBrown @ 2020-04-07 16:10 ` Chuck Lever 0 siblings, 0 replies; 43+ messages in thread From: Chuck Lever @ 2020-04-07 16:10 UTC (permalink / raw) To: Neil Brown Cc: Trond Myklebust, Anna Schumaker, Andrew Morton, Jan Kara, Michal Hocko, linux-mm, Linux NFS Mailing List, LKML > On Apr 6, 2020, at 7:43 PM, NeilBrown <neilb@suse.de> wrote: > > > PF_LESS_THROTTLE exists for loop-back nfsd, and a similar need in the > loop block driver, where a daemon needs to write to one bdi (the final > bdi) in order to free up writes queued to another bdi (the client bdi). > > The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty > pages, so that it can still dirty pages after other processses have been > throttled. > > This approach was designed when all threads were blocked equally, > independently on which device they were writing to, or how fast it was. > Since that time the writeback algorithm has changed substantially with > different threads getting different allowances based on non-trivial > heuristics. This means the simple "add 25%" heuristic is no longer > reliable. > > The important issue is not that the daemon needs a *larger* dirty page > allowance, but that it needs a *private* dirty page allowance, so that > dirty pages for the "client" bdi that it is helping to clear (the bdi for > an NFS filesystem or loop block device etc) do not affect the throttling > of the deamon writing to the "final" bdi. > > This patch changes the heuristic to ignore the global limits and > consider only the limit relevant to the bdi being written to. This > approach is already available for BDI_CAP_STRICTLIMIT users (fuse) and > should not introduce surprises. This has the desired result of > protecting the task from the consequences of large amounts of dirty data > queued for other devices. > > This approach of "only consider the target bdi" is consistent with the > other use of PF_LESS_THROTTLE in current_may_throttle(), were it causes > attention to be focussed only on the target bdi. > > So this patch > - renames PF_LESS_THROTTLE to PF_LOCAL_THROTTLE, > - removes the 25% bonus that that flag gives, and > - imposes 'strictlimit' handling for any process with PF_LOCAL_THROTTLE > set. > > Note that previously realtime threads were treated the same as > PF_LESS_THROTTLE threads. This patch does *not* change the behvaiour for > real-time threads, so it is now different from the behaviour of nfsd and > loop tasks. I don't know what is wanted for realtime. > > Note that the worst-case situation with this patch is that the threshold > might be calculated as zero. In that case the daemon may block when > there are any dirty pages for the final bdi. These will eventually > clear and the daemon will be able to proceed. The writing of those > dirty pages will increase the apparent throughput of the final bdi and > thus increase its threshold for future calculations. > > Reviewed-by: Jan Kara <jack@suse.cz> > Signed-off-by: NeilBrown <neilb@suse.de> > --- > drivers/block/loop.c | 2 +- > fs/nfsd/vfs.c | 9 +++++---- > include/linux/sched.h | 3 ++- > kernel/sys.c | 2 +- > mm/page-writeback.c | 10 ++++++---- > mm/vmscan.c | 4 ++-- > 6 files changed, 17 insertions(+), 13 deletions(-) > > diff --git a/drivers/block/loop.c b/drivers/block/loop.c > index a42c49e04954..0e13b9fc8dfa 100644 > --- a/drivers/block/loop.c > +++ b/drivers/block/loop.c > @@ -899,7 +899,7 @@ static void loop_unprepare_queue(struct loop_device *lo) > > static int loop_kthread_worker_fn(void *worker_ptr) > { > - current->flags |= PF_LESS_THROTTLE | PF_MEMALLOC_NOIO; > + current->flags |= PF_LOCAL_THROTTLE | PF_MEMALLOC_NOIO; > return kthread_worker_fn(worker_ptr); > } > > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c > index 0aa02eb18bd3..c3fbab1753ec 100644 > --- a/fs/nfsd/vfs.c > +++ b/fs/nfsd/vfs.c > @@ -979,12 +979,13 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf, Assuming these patches are not going through the NFSD tree, so this hunk is Acked-by: Chuck Lever <chuck.lever@oracle.com> If this isn't necessary or appropriate, then ignore me :-) > if (test_bit(RQ_LOCAL, &rqstp->rq_flags)) > /* > - * We want less throttling in balance_dirty_pages() > - * and shrink_inactive_list() so that nfs to > + * We want throttling in balance_dirty_pages() > + * and shrink_inactive_list() to only consider > + * the backingdev we are writing to, so that nfs to > * localhost doesn't cause nfsd to lock up due to all > * the client's dirty pages or its congested queue. > */ > - current->flags |= PF_LESS_THROTTLE; > + current->flags |= PF_LOCAL_THROTTLE; > > exp = fhp->fh_export; > use_wgather = (rqstp->rq_vers == 2) && EX_WGATHER(exp); > @@ -1037,7 +1038,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf, > nfserr = nfserrno(host_err); > } > if (test_bit(RQ_LOCAL, &rqstp->rq_flags)) > - current_restore_flags(pflags, PF_LESS_THROTTLE); > + current_restore_flags(pflags, PF_LOCAL_THROTTLE); > return nfserr; > } > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 4418f5cb8324..5955a089df32 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -1481,7 +1481,8 @@ extern struct pid *cad_pid; > #define PF_KSWAPD 0x00020000 /* I am kswapd */ > #define PF_MEMALLOC_NOFS 0x00040000 /* All allocation requests will inherit GFP_NOFS */ > #define PF_MEMALLOC_NOIO 0x00080000 /* All allocation requests will inherit GFP_NOIO */ > -#define PF_LESS_THROTTLE 0x00100000 /* Throttle me less: I clean memory */ > +#define PF_LOCAL_THROTTLE 0x00100000 /* Throttle writes only agasint the bdi I write to, > + * I am cleaning dirty pages from some other bdi. */ > #define PF_KTHREAD 0x00200000 /* I am a kernel thread */ > #define PF_RANDOMIZE 0x00400000 /* Randomize virtual address space */ > #define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */ > diff --git a/kernel/sys.c b/kernel/sys.c > index d325f3ab624a..180a2fa33f7f 100644 > --- a/kernel/sys.c > +++ b/kernel/sys.c > @@ -2262,7 +2262,7 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct *t, unsigned long which, > return -EINVAL; > } > > -#define PR_IO_FLUSHER (PF_MEMALLOC_NOIO | PF_LESS_THROTTLE) > +#define PR_IO_FLUSHER (PF_MEMALLOC_NOIO | PF_LOCAL_THROTTLE) > > SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, > unsigned long, arg4, unsigned long, arg5) > diff --git a/mm/page-writeback.c b/mm/page-writeback.c > index 7326b54ab728..4c9875971de5 100644 > --- a/mm/page-writeback.c > +++ b/mm/page-writeback.c > @@ -387,8 +387,7 @@ static unsigned long global_dirtyable_memory(void) > * Calculate @dtc->thresh and ->bg_thresh considering > * vm_dirty_{bytes|ratio} and dirty_background_{bytes|ratio}. The caller > * must ensure that @dtc->avail is set before calling this function. The > - * dirty limits will be lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd) and > - * real-time tasks. > + * dirty limits will be lifted by 1/4 for real-time tasks. > */ > static void domain_dirty_limits(struct dirty_throttle_control *dtc) > { > @@ -436,7 +435,7 @@ static void domain_dirty_limits(struct dirty_throttle_control *dtc) > if (bg_thresh >= thresh) > bg_thresh = thresh / 2; > tsk = current; > - if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) { > + if (rt_task(tsk)) { > bg_thresh += bg_thresh / 4 + global_wb_domain.dirty_limit / 32; > thresh += thresh / 4 + global_wb_domain.dirty_limit / 32; > } > @@ -486,7 +485,7 @@ static unsigned long node_dirty_limit(struct pglist_data *pgdat) > else > dirty = vm_dirty_ratio * node_memory / 100; > > - if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) > + if (rt_task(tsk)) > dirty += dirty / 4; > > return dirty; > @@ -1580,6 +1579,9 @@ static void balance_dirty_pages(struct bdi_writeback *wb, > bool strictlimit = bdi->capabilities & BDI_CAP_STRICTLIMIT; > unsigned long start_time = jiffies; > > + if (current->flags & PF_LOCAL_THROTTLE) > + /* This task must only be throttled by its own writeback */ > + strictlimit = true; > for (;;) { > unsigned long now = jiffies; > unsigned long dirty, thresh, bg_thresh; > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 2e8e690d2813..b776da4bb8c8 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1879,13 +1879,13 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec, > > /* > * If a kernel thread (such as nfsd for loop-back mounts) services > - * a backing device by writing to the page cache it sets PF_LESS_THROTTLE. > + * a backing device by writing to the page cache it sets PF_LOCAL_THROTTLE. > * In that case we should only throttle if the backing device it is > * writing to is congested. In other cases it is safe to throttle. > */ > static int current_may_throttle(void) > { > - return !(current->flags & PF_LESS_THROTTLE) || > + return !(current->flags & PF_LOCAL_THROTTLE) || > current->backing_dev_info == NULL || > bdi_write_congested(current->backing_dev_info); > } > -- > 2.26.0 > -- Chuck Lever chucklever@gmail.com ^ permalink raw reply [flat|nested] 43+ messages in thread
* Writeback fixes for NFS - V3 2020-04-06 23:42 ` Writeback fixes for NFS - V2 NeilBrown 2020-04-06 23:43 ` [PATCH 1/2] MM: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE NeilBrown @ 2020-04-16 0:29 ` NeilBrown 2020-04-16 0:30 ` [PATCH 1/2 V3] MM: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE NeilBrown 2020-04-16 0:31 ` [PATCH 2/2 V3] " NeilBrown 1 sibling, 2 replies; 43+ messages in thread From: NeilBrown @ 2020-04-16 0:29 UTC (permalink / raw) To: Trond Myklebust, Anna.Schumaker, Andrew Morton, Jan Kara, Michal Hocko Cc: linux-mm, linux-nfs, LKML [-- Attachment #1: Type: text/plain, Size: 1019 bytes --] This is version 3 (I think) of my patches to improve NFS writeaback. Changes: - the code for adding legacy values to /proc/vmstat was broken. I haven't taken the approach that Michal and Jan discussed but a simpler (I hope) approach that just seq_puts() the lines at an appropriate place. - I've modified the handling of PF_LOCAL_THROTTLE - sufficiently that I dropped Jan's reviewed-by. Rather than invoking the same behaviour as BDI_CAP_STRICTLIMIT, I now just use the part I needed. So if the global threshold is not exceeded, PF_LOCAL_THROTTLE tasks are not throttled. This is the case for normal processes, but not when BDI_CAP_STRICTLIMIT is in effect. If the global threshold *is* exceeded, only then to we check the local per-bdi threshold. If that is not exceeded then PF_LOCAL_THROTTLE again avoid any throttling. Only if both the thresholds are exceeded are these tasks throttled. I think this is more refletive if what we actually need. Thanks, NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* [PATCH 1/2 V3] MM: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE 2020-04-16 0:29 ` Writeback fixes for NFS - V3 NeilBrown @ 2020-04-16 0:30 ` NeilBrown 2020-04-16 6:54 ` Christoph Hellwig 2020-04-16 15:19 ` Jan Kara 2020-04-16 0:31 ` [PATCH 2/2 V3] " NeilBrown 1 sibling, 2 replies; 43+ messages in thread From: NeilBrown @ 2020-04-16 0:30 UTC (permalink / raw) To: Trond Myklebust, Anna.Schumaker, Andrew Morton, Jan Kara, Michal Hocko Cc: linux-mm, linux-nfs, LKML [-- Attachment #1: Type: text/plain, Size: 8420 bytes --] PF_LESS_THROTTLE exists for loop-back nfsd (and a similar need in the loop block driver and callers of prctl(PR_SET_IO_FLUSHER)), where a daemon needs to write to one bdi (the final bdi) in order to free up writes queued to another bdi (the client bdi). The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty pages, so that it can still dirty pages after other processses have been throttled. This approach was designed when all threads were blocked equally, independently on which device they were writing to, or how fast it was. Since that time the writeback algorithm has changed substantially with different threads getting different allowances based on non-trivial heuristics. This means the simple "add 25%" heuristic is no longer reliable. The important issue is not that the daemon needs a *larger* dirty page allowance, but that it needs a *private* dirty page allowance, so that dirty pages for the "client" bdi that it is helping to clear (the bdi for an NFS filesystem or loop block device etc) do not affect the throttling of the deamon writing to the "final" bdi. This patch changes the heuristic so that the task is only throttled if *both* the global threshhold *and* the per-wb threshold are exceeded. This is similar to the effect of BDI_CAP_STRICTLIMIT which causes the global limits to be ignored, but it isn't as strict. A PF_LOCAL_THROTTLE task will be allowed to proceed unthrottled if the global threshold is not exceeded or if the local threshold is not exceeded. They need to both be exceeded before PF_LOCAL_THROTTLE tasks are throttled. This approach of "only throttle when target bdi is busy" is consistent with the other use of PF_LESS_THROTTLE in current_may_throttle(), were it causes attention to be focussed only on the target bdi. So this patch - renames PF_LESS_THROTTLE to PF_LOCAL_THROTTLE, - removes the 25% bonus that that flag gives, and - If PF_LOCAL_THROTTLE is set, don't delay at all unless both thresholds are exceeded. Note that previously realtime threads were treated the same as PF_LESS_THROTTLE threads. This patch does *not* change the behvaiour for real-time threads, so it is now different from the behaviour of nfsd and loop tasks. I don't know what is wanted for realtime. Acked-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: NeilBrown <neilb@suse.de> --- drivers/block/loop.c | 2 +- fs/nfsd/vfs.c | 9 +++++---- include/linux/sched.h | 3 ++- kernel/sys.c | 2 +- mm/page-writeback.c | 18 ++++++++++++++---- mm/vmscan.c | 4 ++-- 6 files changed, 25 insertions(+), 13 deletions(-) diff --git a/drivers/block/loop.c b/drivers/block/loop.c index da693e6a834e..d89c25ba3b89 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -919,7 +919,7 @@ static void loop_unprepare_queue(struct loop_device *lo) static int loop_kthread_worker_fn(void *worker_ptr) { - current->flags |= PF_LESS_THROTTLE | PF_MEMALLOC_NOIO; + current->flags |= PF_LOCAL_THROTTLE | PF_MEMALLOC_NOIO; return kthread_worker_fn(worker_ptr); } diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c index 0aa02eb18bd3..c3fbab1753ec 100644 --- a/fs/nfsd/vfs.c +++ b/fs/nfsd/vfs.c @@ -979,12 +979,13 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf, if (test_bit(RQ_LOCAL, &rqstp->rq_flags)) /* - * We want less throttling in balance_dirty_pages() - * and shrink_inactive_list() so that nfs to + * We want throttling in balance_dirty_pages() + * and shrink_inactive_list() to only consider + * the backingdev we are writing to, so that nfs to * localhost doesn't cause nfsd to lock up due to all * the client's dirty pages or its congested queue. */ - current->flags |= PF_LESS_THROTTLE; + current->flags |= PF_LOCAL_THROTTLE; exp = fhp->fh_export; use_wgather = (rqstp->rq_vers == 2) && EX_WGATHER(exp); @@ -1037,7 +1038,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf, nfserr = nfserrno(host_err); } if (test_bit(RQ_LOCAL, &rqstp->rq_flags)) - current_restore_flags(pflags, PF_LESS_THROTTLE); + current_restore_flags(pflags, PF_LOCAL_THROTTLE); return nfserr; } diff --git a/include/linux/sched.h b/include/linux/sched.h index 4418f5cb8324..5955a089df32 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1481,7 +1481,8 @@ extern struct pid *cad_pid; #define PF_KSWAPD 0x00020000 /* I am kswapd */ #define PF_MEMALLOC_NOFS 0x00040000 /* All allocation requests will inherit GFP_NOFS */ #define PF_MEMALLOC_NOIO 0x00080000 /* All allocation requests will inherit GFP_NOIO */ -#define PF_LESS_THROTTLE 0x00100000 /* Throttle me less: I clean memory */ +#define PF_LOCAL_THROTTLE 0x00100000 /* Throttle writes only agasint the bdi I write to, + * I am cleaning dirty pages from some other bdi. */ #define PF_KTHREAD 0x00200000 /* I am a kernel thread */ #define PF_RANDOMIZE 0x00400000 /* Randomize virtual address space */ #define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */ diff --git a/kernel/sys.c b/kernel/sys.c index d325f3ab624a..180a2fa33f7f 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2262,7 +2262,7 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct *t, unsigned long which, return -EINVAL; } -#define PR_IO_FLUSHER (PF_MEMALLOC_NOIO | PF_LESS_THROTTLE) +#define PR_IO_FLUSHER (PF_MEMALLOC_NOIO | PF_LOCAL_THROTTLE) SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, unsigned long, arg4, unsigned long, arg5) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 7326b54ab728..9692c553526b 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -387,8 +387,7 @@ static unsigned long global_dirtyable_memory(void) * Calculate @dtc->thresh and ->bg_thresh considering * vm_dirty_{bytes|ratio} and dirty_background_{bytes|ratio}. The caller * must ensure that @dtc->avail is set before calling this function. The - * dirty limits will be lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd) and - * real-time tasks. + * dirty limits will be lifted by 1/4 for real-time tasks. */ static void domain_dirty_limits(struct dirty_throttle_control *dtc) { @@ -436,7 +435,7 @@ static void domain_dirty_limits(struct dirty_throttle_control *dtc) if (bg_thresh >= thresh) bg_thresh = thresh / 2; tsk = current; - if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) { + if (rt_task(tsk)) { bg_thresh += bg_thresh / 4 + global_wb_domain.dirty_limit / 32; thresh += thresh / 4 + global_wb_domain.dirty_limit / 32; } @@ -486,7 +485,7 @@ static unsigned long node_dirty_limit(struct pglist_data *pgdat) else dirty = vm_dirty_ratio * node_memory / 100; - if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) + if (rt_task(tsk)) dirty += dirty / 4; return dirty; @@ -1700,6 +1699,17 @@ static void balance_dirty_pages(struct bdi_writeback *wb, sdtc = mdtc; } + if (current->flags & PF_LOCAL_THROTTLE) + /* This task must only be throttled based on the bdi + * it is writing to - dirty pages for other bdis might + * be pages this task is trying to write out. So it + * gets a free pass unless both global and local + * thresholds are exceeded. i.e unless + * "dirty_exceeded". + */ + if (!dirty_exceeded) + break; + if (dirty_exceeded && !wb->dirty_exceeded) wb->dirty_exceeded = 1; diff --git a/mm/vmscan.c b/mm/vmscan.c index b06868fc4926..80ab523926aa 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1879,13 +1879,13 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec, /* * If a kernel thread (such as nfsd for loop-back mounts) services - * a backing device by writing to the page cache it sets PF_LESS_THROTTLE. + * a backing device by writing to the page cache it sets PF_LOCAL_THROTTLE. * In that case we should only throttle if the backing device it is * writing to is congested. In other cases it is safe to throttle. */ static int current_may_throttle(void) { - return !(current->flags & PF_LESS_THROTTLE) || + return !(current->flags & PF_LOCAL_THROTTLE) || current->backing_dev_info == NULL || bdi_write_congested(current->backing_dev_info); } -- 2.26.0 [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: [PATCH 1/2 V3] MM: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE 2020-04-16 0:30 ` [PATCH 1/2 V3] MM: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE NeilBrown @ 2020-04-16 6:54 ` Christoph Hellwig 2020-04-16 15:19 ` Jan Kara 1 sibling, 0 replies; 43+ messages in thread From: Christoph Hellwig @ 2020-04-16 6:54 UTC (permalink / raw) To: NeilBrown Cc: Trond Myklebust, Anna.Schumaker@Netapp.com, Andrew Morton, Jan Kara, Michal Hocko, linux-mm, linux-nfs, LKML > + if (current->flags & PF_LOCAL_THROTTLE) > + /* This task must only be throttled based on the bdi > + * it is writing to - dirty pages for other bdis might > + * be pages this task is trying to write out. So it > + * gets a free pass unless both global and local > + * thresholds are exceeded. i.e unless > + * "dirty_exceeded". > + */ This is not our normal multi-line comment style. The first line should be just a /* Otherwise this looks good. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 1/2 V3] MM: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE 2020-04-16 0:30 ` [PATCH 1/2 V3] MM: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE NeilBrown 2020-04-16 6:54 ` Christoph Hellwig @ 2020-04-16 15:19 ` Jan Kara 2020-04-21 2:22 ` NeilBrown 1 sibling, 1 reply; 43+ messages in thread From: Jan Kara @ 2020-04-16 15:19 UTC (permalink / raw) To: NeilBrown Cc: Trond Myklebust, Anna.Schumaker@Netapp.com, Andrew Morton, Jan Kara, Michal Hocko, linux-mm, linux-nfs, LKML On Thu 16-04-20 10:30:42, NeilBrown wrote: > > PF_LESS_THROTTLE exists for loop-back nfsd (and a similar need in the > loop block driver and callers of prctl(PR_SET_IO_FLUSHER)), where a > daemon needs to write to one bdi (the final bdi) in order to free up > writes queued to another bdi (the client bdi). > > The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty > pages, so that it can still dirty pages after other processses have been > throttled. > > This approach was designed when all threads were blocked equally, > independently on which device they were writing to, or how fast it was. > Since that time the writeback algorithm has changed substantially with > different threads getting different allowances based on non-trivial > heuristics. This means the simple "add 25%" heuristic is no longer > reliable. > > The important issue is not that the daemon needs a *larger* dirty page > allowance, but that it needs a *private* dirty page allowance, so that > dirty pages for the "client" bdi that it is helping to clear (the bdi for > an NFS filesystem or loop block device etc) do not affect the throttling > of the deamon writing to the "final" bdi. > > This patch changes the heuristic so that the task is only throttled if > *both* the global threshhold *and* the per-wb threshold are exceeded. > This is similar to the effect of BDI_CAP_STRICTLIMIT which causes the > global limits to be ignored, but it isn't as strict. A PF_LOCAL_THROTTLE > task will be allowed to proceed unthrottled if the global threshold is > not exceeded or if the local threshold is not exceeded. They need to > both be exceeded before PF_LOCAL_THROTTLE tasks are throttled. > > This approach of "only throttle when target bdi is busy" is consistent > with the other use of PF_LESS_THROTTLE in current_may_throttle(), were > it causes attention to be focussed only on the target bdi. > > So this patch > - renames PF_LESS_THROTTLE to PF_LOCAL_THROTTLE, > - removes the 25% bonus that that flag gives, and > - If PF_LOCAL_THROTTLE is set, don't delay at all unless both > thresholds are exceeded. > > Note that previously realtime threads were treated the same as > PF_LESS_THROTTLE threads. This patch does *not* change the behvaiour for > real-time threads, so it is now different from the behaviour of nfsd and > loop tasks. I don't know what is wanted for realtime. > > Acked-by: Chuck Lever <chuck.lever@oracle.com> > Signed-off-by: NeilBrown <neilb@suse.de> ... > @@ -1700,6 +1699,17 @@ static void balance_dirty_pages(struct bdi_writeback *wb, > sdtc = mdtc; > } > > + if (current->flags & PF_LOCAL_THROTTLE) > + /* This task must only be throttled based on the bdi > + * it is writing to - dirty pages for other bdis might > + * be pages this task is trying to write out. So it > + * gets a free pass unless both global and local > + * thresholds are exceeded. i.e unless > + * "dirty_exceeded". > + */ > + if (!dirty_exceeded) > + break; > + > if (dirty_exceeded && !wb->dirty_exceeded) > wb->dirty_exceeded = 1; Ok, but note that this will have one sideeffect you maybe didn't realize: Currently we try to throttle tasks softly - the heuristic rougly works like this: If dirty < (thresh + bg_thresh)/2, leave the task alone. (thresh+bg_thresh)/2 is called "freerun ceiling". If dirty is greater than this, we delay the task somewhat (the aim is to delay the task as long as it would take to write back the pages task has dirtied) in balance_dirty_pages() so ideally 'thresh' is never hit. Only if the heuristic consistently underestimates the time to writeback pages, we hit 'thresh' and then block the task as long as it takes flush worker to clean enough pages to get below 'thresh'. This all leads to task being usually gradually slowed down in balance_dirty_pages() which generally leads to smoother overall system behavior. What you did makes PF_LOCAL_THROTTLE tasks ignore any limits and then when local bdi limit is exceeded, they'll suddently hit the wall and be blocked for a long time in balance_dirty_pages(). So I like what you suggest in principle, just I think the implementation has undesirable sideeffects. I think it would be better to modify wb_position_ratio() to take PF_LOCAL_THROTTLE into account. It will be probably similar to how BDI_CAP_STRICTLIMIT is handled but different in some ways because BDI_CAP_STRICTLIMIT takes minimum from wb_pos_ratio and global pos_ratio, you rather want to take wb_pos_ratio only. Also there are some early bail out conditions when we are over global dirty limit which you need to handle differently for PF_LOCAL_THROTTLE. And then, when you have appropriate pos_ratio computed based on your policy, you can let the task wait for appropriate amount of time and things should just work (TM) ;). Thinking about it, you probably also want to add 'freerun' condition for PF_LOCAL_THROTTLE tasks like: if ((current->flags & PF_LOCAL_THROTTLE) && wb_dirty <= dirty_freerun_ceiling(wb_thresh, wb_bg_thresh)) go the freerun path... Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 1/2 V3] MM: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE 2020-04-16 15:19 ` Jan Kara @ 2020-04-21 2:22 ` NeilBrown 2020-04-22 12:46 ` Jan Kara 0 siblings, 1 reply; 43+ messages in thread From: NeilBrown @ 2020-04-21 2:22 UTC (permalink / raw) To: Jan Kara; +Cc: Andrew Morton, Michal Hocko, linux-mm, linux-nfs, LKML [-- Attachment #1: Type: text/plain, Size: 7964 bytes --] On Thu, Apr 16 2020, Jan Kara wrote: > On Thu 16-04-20 10:30:42, NeilBrown wrote: >> >> PF_LESS_THROTTLE exists for loop-back nfsd (and a similar need in the >> loop block driver and callers of prctl(PR_SET_IO_FLUSHER)), where a >> daemon needs to write to one bdi (the final bdi) in order to free up >> writes queued to another bdi (the client bdi). >> >> The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty >> pages, so that it can still dirty pages after other processses have been >> throttled. >> >> This approach was designed when all threads were blocked equally, >> independently on which device they were writing to, or how fast it was. >> Since that time the writeback algorithm has changed substantially with >> different threads getting different allowances based on non-trivial >> heuristics. This means the simple "add 25%" heuristic is no longer >> reliable. >> >> The important issue is not that the daemon needs a *larger* dirty page >> allowance, but that it needs a *private* dirty page allowance, so that >> dirty pages for the "client" bdi that it is helping to clear (the bdi for >> an NFS filesystem or loop block device etc) do not affect the throttling >> of the deamon writing to the "final" bdi. >> >> This patch changes the heuristic so that the task is only throttled if >> *both* the global threshhold *and* the per-wb threshold are exceeded. >> This is similar to the effect of BDI_CAP_STRICTLIMIT which causes the >> global limits to be ignored, but it isn't as strict. A PF_LOCAL_THROTTLE >> task will be allowed to proceed unthrottled if the global threshold is >> not exceeded or if the local threshold is not exceeded. They need to >> both be exceeded before PF_LOCAL_THROTTLE tasks are throttled. >> >> This approach of "only throttle when target bdi is busy" is consistent >> with the other use of PF_LESS_THROTTLE in current_may_throttle(), were >> it causes attention to be focussed only on the target bdi. >> >> So this patch >> - renames PF_LESS_THROTTLE to PF_LOCAL_THROTTLE, >> - removes the 25% bonus that that flag gives, and >> - If PF_LOCAL_THROTTLE is set, don't delay at all unless both >> thresholds are exceeded. >> >> Note that previously realtime threads were treated the same as >> PF_LESS_THROTTLE threads. This patch does *not* change the behvaiour for >> real-time threads, so it is now different from the behaviour of nfsd and >> loop tasks. I don't know what is wanted for realtime. >> >> Acked-by: Chuck Lever <chuck.lever@oracle.com> >> Signed-off-by: NeilBrown <neilb@suse.de> > > ... > >> @@ -1700,6 +1699,17 @@ static void balance_dirty_pages(struct bdi_writeback *wb, >> sdtc = mdtc; >> } >> >> + if (current->flags & PF_LOCAL_THROTTLE) >> + /* This task must only be throttled based on the bdi >> + * it is writing to - dirty pages for other bdis might >> + * be pages this task is trying to write out. So it >> + * gets a free pass unless both global and local >> + * thresholds are exceeded. i.e unless >> + * "dirty_exceeded". >> + */ >> + if (!dirty_exceeded) >> + break; >> + >> if (dirty_exceeded && !wb->dirty_exceeded) >> wb->dirty_exceeded = 1; > > Ok, but note that this will have one sideeffect you maybe didn't realize: > Currently we try to throttle tasks softly - the heuristic rougly works like > this: If dirty < (thresh + bg_thresh)/2, leave the task alone. > (thresh+bg_thresh)/2 is called "freerun ceiling". If dirty is greater than > this, we delay the task somewhat (the aim is to delay the task as long as > it would take to write back the pages task has dirtied) in > balance_dirty_pages() so ideally 'thresh' is never hit. Only if the > heuristic consistently underestimates the time to writeback pages, we hit > 'thresh' and then block the task as long as it takes flush worker to clean > enough pages to get below 'thresh'. This all leads to task being usually > gradually slowed down in balance_dirty_pages() which generally leads to > smoother overall system behavior. > > What you did makes PF_LOCAL_THROTTLE tasks ignore any limits and then when > local bdi limit is exceeded, they'll suddently hit the wall and be blocked > for a long time in balance_dirty_pages(). > > So I like what you suggest in principle, just I think the implementation > has undesirable sideeffects. I think it would be better to modify > wb_position_ratio() to take PF_LOCAL_THROTTLE into account. It will be > probably similar to how BDI_CAP_STRICTLIMIT is handled but different in > some ways because BDI_CAP_STRICTLIMIT takes minimum from wb_pos_ratio and > global pos_ratio, you rather want to take wb_pos_ratio only. Also there are > some early bail out conditions when we are over global dirty limit which > you need to handle differently for PF_LOCAL_THROTTLE. And then, when you > have appropriate pos_ratio computed based on your policy, you can let the > task wait for appropriate amount of time and things should just work (TM) ;). > Thinking about it, you probably also want to add 'freerun' condition for > PF_LOCAL_THROTTLE tasks like: > > if ((current->flags & PF_LOCAL_THROTTLE) && > wb_dirty <= dirty_freerun_ceiling(wb_thresh, wb_bg_thresh)) > go the freerun path... > Thanks..... I have 2 thoughts on this. One is that I'm not sure how much it really matters. The PF_LOCAL_THROTTLE task it always doing writeout on behalf of some other process. Some process writes to NFS or to a loop block device or somewhere, then the PF_LOCAL_THROTTLE task writes those dirty pages out to a different BDI. So the top level task will be throttled, an the PF_LOCAL_THROTTLE task won't get more than it can handle. There will be starting transients of course, but I doubt it would generally be a problem. However it would still be nice to find the "right" solution. My second thought is that I really don't understand the writeback code. I think I understand the general principle, and there are lots of big comments that try to explain things, but it just doesn't seem to help. I look at the code and see more questions than answers. What are the units for "dirty_ratelimit"?? I think it is pages per second, because it is initialized to INIT_BW which is documented as 100 MB/s. What is the difference between dirty_ratelimit and balanced_dirty_ratelimit? The later is "balanced" I guess. What does that mean? Apparently (from backing-dev-defs.h) dirty_ratelimit moves in smaller steps and is more smooth than balanced_dirty_ratelimit. How is being less smooth, more balanced?? What is pos_ratio? And what is RATELIMIT_CALC_SHIFT ??? Maybe pos_ratio is the ratio of the actual number of dirty pages to the desired number? And pos_ratio is calculated with fixed-point arithmetic and RATELIMIT_CALC_SHIFT tells where the point is? I think I understand freerun - half way between the dirty limit and the dirty_bg limit. Se below dirty_bg, no writeback happens. Between there and freerun, writeback happens, but nothing in throttled. From free up to the limit, tasks are progressively throttled. "setpoint" is the midpoint of this range. Is the goal that pos_ratio is computed for. (except that in the BDI_CAP_STRICTLIMIT part of wb_position_ratio) wb_setpoint is set to the bottom of this range, same as the freerun ceiling.) Then we have the control lines, which are cubic(?) for global counts and linear for per-wb - but truncated at 1/4. The comment says "so that wb_dirty can be smoothly throttled". It'll take me a while to work out what a hard edge results in smooth throttling. I suspect it makes sense but it doesn't jump out at me. So, you see, I don't feel at all confident changing any of this code because I just don't get it. So I'm inclined to stick with the patch that I have. :-( Thanks, NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 1/2 V3] MM: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE 2020-04-21 2:22 ` NeilBrown @ 2020-04-22 12:46 ` Jan Kara 2020-05-13 7:16 ` NeilBrown 0 siblings, 1 reply; 43+ messages in thread From: Jan Kara @ 2020-04-22 12:46 UTC (permalink / raw) To: NeilBrown Cc: Jan Kara, Andrew Morton, Michal Hocko, linux-mm, linux-nfs, LKML On Tue 21-04-20 12:22:59, NeilBrown wrote: > On Thu, Apr 16 2020, Jan Kara wrote: > > > On Thu 16-04-20 10:30:42, NeilBrown wrote: > >> > >> PF_LESS_THROTTLE exists for loop-back nfsd (and a similar need in the > >> loop block driver and callers of prctl(PR_SET_IO_FLUSHER)), where a > >> daemon needs to write to one bdi (the final bdi) in order to free up > >> writes queued to another bdi (the client bdi). > >> > >> The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty > >> pages, so that it can still dirty pages after other processses have been > >> throttled. > >> > >> This approach was designed when all threads were blocked equally, > >> independently on which device they were writing to, or how fast it was. > >> Since that time the writeback algorithm has changed substantially with > >> different threads getting different allowances based on non-trivial > >> heuristics. This means the simple "add 25%" heuristic is no longer > >> reliable. > >> > >> The important issue is not that the daemon needs a *larger* dirty page > >> allowance, but that it needs a *private* dirty page allowance, so that > >> dirty pages for the "client" bdi that it is helping to clear (the bdi for > >> an NFS filesystem or loop block device etc) do not affect the throttling > >> of the deamon writing to the "final" bdi. > >> > >> This patch changes the heuristic so that the task is only throttled if > >> *both* the global threshhold *and* the per-wb threshold are exceeded. > >> This is similar to the effect of BDI_CAP_STRICTLIMIT which causes the > >> global limits to be ignored, but it isn't as strict. A PF_LOCAL_THROTTLE > >> task will be allowed to proceed unthrottled if the global threshold is > >> not exceeded or if the local threshold is not exceeded. They need to > >> both be exceeded before PF_LOCAL_THROTTLE tasks are throttled. > >> > >> This approach of "only throttle when target bdi is busy" is consistent > >> with the other use of PF_LESS_THROTTLE in current_may_throttle(), were > >> it causes attention to be focussed only on the target bdi. > >> > >> So this patch > >> - renames PF_LESS_THROTTLE to PF_LOCAL_THROTTLE, > >> - removes the 25% bonus that that flag gives, and > >> - If PF_LOCAL_THROTTLE is set, don't delay at all unless both > >> thresholds are exceeded. > >> > >> Note that previously realtime threads were treated the same as > >> PF_LESS_THROTTLE threads. This patch does *not* change the behvaiour for > >> real-time threads, so it is now different from the behaviour of nfsd and > >> loop tasks. I don't know what is wanted for realtime. > >> > >> Acked-by: Chuck Lever <chuck.lever@oracle.com> > >> Signed-off-by: NeilBrown <neilb@suse.de> > > > > ... > > > >> @@ -1700,6 +1699,17 @@ static void balance_dirty_pages(struct bdi_writeback *wb, > >> sdtc = mdtc; > >> } > >> > >> + if (current->flags & PF_LOCAL_THROTTLE) > >> + /* This task must only be throttled based on the bdi > >> + * it is writing to - dirty pages for other bdis might > >> + * be pages this task is trying to write out. So it > >> + * gets a free pass unless both global and local > >> + * thresholds are exceeded. i.e unless > >> + * "dirty_exceeded". > >> + */ > >> + if (!dirty_exceeded) > >> + break; > >> + > >> if (dirty_exceeded && !wb->dirty_exceeded) > >> wb->dirty_exceeded = 1; > > > > Ok, but note that this will have one sideeffect you maybe didn't realize: > > Currently we try to throttle tasks softly - the heuristic rougly works like > > this: If dirty < (thresh + bg_thresh)/2, leave the task alone. > > (thresh+bg_thresh)/2 is called "freerun ceiling". If dirty is greater than > > this, we delay the task somewhat (the aim is to delay the task as long as > > it would take to write back the pages task has dirtied) in > > balance_dirty_pages() so ideally 'thresh' is never hit. Only if the > > heuristic consistently underestimates the time to writeback pages, we hit > > 'thresh' and then block the task as long as it takes flush worker to clean > > enough pages to get below 'thresh'. This all leads to task being usually > > gradually slowed down in balance_dirty_pages() which generally leads to > > smoother overall system behavior. > > > > What you did makes PF_LOCAL_THROTTLE tasks ignore any limits and then when > > local bdi limit is exceeded, they'll suddently hit the wall and be blocked > > for a long time in balance_dirty_pages(). > > > > So I like what you suggest in principle, just I think the implementation > > has undesirable sideeffects. I think it would be better to modify > > wb_position_ratio() to take PF_LOCAL_THROTTLE into account. It will be > > probably similar to how BDI_CAP_STRICTLIMIT is handled but different in > > some ways because BDI_CAP_STRICTLIMIT takes minimum from wb_pos_ratio and > > global pos_ratio, you rather want to take wb_pos_ratio only. Also there are > > some early bail out conditions when we are over global dirty limit which > > you need to handle differently for PF_LOCAL_THROTTLE. And then, when you > > have appropriate pos_ratio computed based on your policy, you can let the > > task wait for appropriate amount of time and things should just work (TM) ;). > > Thinking about it, you probably also want to add 'freerun' condition for > > PF_LOCAL_THROTTLE tasks like: > > > > if ((current->flags & PF_LOCAL_THROTTLE) && > > wb_dirty <= dirty_freerun_ceiling(wb_thresh, wb_bg_thresh)) > > go the freerun path... > > > > Thanks..... > I have 2 thoughts on this. > One is that I'm not sure how much it really matters. > The PF_LOCAL_THROTTLE task it always doing writeout on behalf of some > other process. Some process writes to NFS or to a loop block device or > somewhere, then the PF_LOCAL_THROTTLE task writes those dirty pages out > to a different BDI. So the top level task will be throttled, an the > PF_LOCAL_THROTTLE task won't get more than it can handle. > There will be starting transients of course, but I doubt it would > generally be a problem. However it would still be nice to find the > "right" solution. I'm not sure PF_LOCAL_THROTTLE "won't get more than it can handle". Once dirty pages on NFS BDI accumulate, flush worker will start to push them out as fast as it can. So the only thing that's limitting this is the dirty throttling on the receiving (NFS server - thus underlying BDI) side. When underlying BDI throttling triggers depends on that BDI dirty limits and those are proportional part of global dirty limits scaled by writeback throughput on underlying BDI compared to other BDIs. So depending on which BDIs are in the system and how active they are in dirtying pages 'underlying BDI' will get different dirty limits set. It's quite imaginable that in some configurations it will be easy to push NFS server to hit its dirty limit even with PF_LOCAL_THROTTLE. And then having NFS server undersponsive for couple seconds because it is blocked in balance_dirty_pages() just is not nice... > My second thought is that I really don't understand the writeback code. > I think I understand the general principle, and there are lots of big > comments that try to explain things, but it just doesn't seem to help. > I look at the code and see more questions than answers. I fully understand what you mean :). The logic is complex and while Fengguang wrote a lot of comments it is still rather hard to follow. > What are the units for "dirty_ratelimit"?? I think it is pages per > second, because it is initialized to INIT_BW which is documented as 100 > MB/s. Yes, that's what I think as well. > What is the difference between dirty_ratelimit and > balanced_dirty_ratelimit? > The later is "balanced" I guess. What does that mean? > Apparently (from backing-dev-defs.h) dirty_ratelimit moves in smaller > steps and is more smooth than balanced_dirty_ratelimit. How is being > less smooth, more balanced?? Yeah. So I cannot really explain the naming to you (not sure why Fengguang chose these names). But 'balanced_dirty_ratelimit' is pages/second value we want to limit task to based on the events in the most current time slice. 'dirty_ratelimit' is smoothed version of 'balanced_dirty_ratelimit' taking more of history into account. > What is pos_ratio? And what is RATELIMIT_CALC_SHIFT ??? > Maybe pos_ratio is the ratio of the actual number of dirty pages to the > desired number? And pos_ratio is calculated with fixed-point arithmetic > and RATELIMIT_CALC_SHIFT tells where the point is? So RATELIMIT_CALC_SHIFT is indeed the shift of fixed point arithmetic used in the computations. Pos_ratio is the multiplicative "correction" factor we apply to computed dirty_ratelimit (i.e., task_ratelimit = dirty_ratelimit * pos_ratio) - so if we see we are able to writeout say 100 MB/s but we are still relatively far from dirty limits, we let tasks dirty 200 MB/s (pos_ratio is 2). As we are nearing dirty limits, pos_ratio is dropping (it's appropriately scaled and shifted third order polynomial) so very close to dirty limits, we let tasks dirty only say 10 MB/s even though we are still able to write out 100 MB/s. > I think I understand freerun - half way between the dirty limit and the > dirty_bg limit. Se below dirty_bg, no writeback happens. Between there > and freerun, writeback happens, but nothing in throttled. From free up > to the limit, tasks are progressively throttled. Correct. > "setpoint" is the midpoint of this range. Is the goal that pos_ratio is > computed for. > (except that in the BDI_CAP_STRICTLIMIT part of wb_position_ratio) > wb_setpoint is set to the bottom of this range, same as the freerun ceiling.) Correct. > Then we have the control lines, which are cubic(?) for global counts and > linear for per-wb - but truncated at 1/4. The comment says "so that > wb_dirty can be smoothly throttled". It'll take me a while to work out > what a hard edge results in smooth throttling. I suspect it makes sense > but it doesn't jump out at me. > > So, you see, I don't feel at all confident changing any of this code > because I just don't get it. > > So I'm inclined to stick with the patch that I have. :-( OK, I'll try to write something and we'll see if it will work :) Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 1/2 V3] MM: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE 2020-04-22 12:46 ` Jan Kara @ 2020-05-13 7:16 ` NeilBrown 2020-05-13 7:17 ` [PATCH 1/2 V4] " NeilBrown 2020-05-13 7:18 ` [PATCH 2/2 V4] " NeilBrown 0 siblings, 2 replies; 43+ messages in thread From: NeilBrown @ 2020-05-13 7:16 UTC (permalink / raw) To: Jan Kara Cc: Andrew Morton, Michal Hocko, linux-mm, linux-nfs, LKML, Christoph Hellwig [-- Attachment #1: Type: text/plain, Size: 1275 bytes --] I thought about this some more and come up with another "simple" approach that didn't require me understanding too much code, but does - I think - address your concerns. I've changed the heuristic to avoid any throttling on PF_LOCAL_THROTTLE task if: - the global dirty count is below the global free-run threshold. The code did this already. - (or) the per-wb dirty count is below the per-wb free-run threshold. This is the change. This means that: - in a steady stated, all bdis will be throttled based on their (steady state) throughput, which is equally appropriate for PF_LOCAL_THROTTLE tasks. - a PF_LOCAL_THROTTLE task will never be *completely* blocked by dirty pages queued for other devices. This means no deadlock, and that is the primary purpose of PF_LOCAL_THROTTLE. - when writes through the PF_LOCAL_THROTTLE task start up from idle - when there is no current throughput estimate - the PF_LOCAL_THROTTLE can be expected to get a fair share of the available memory, just as much as any other writer. This was the possible problem with treating PF_LOCAL_THROTTLE just like BDI_CAP_STRICTLIMIT. So I think this is a good solution. Thoughts? Patches follow - I've address the comment formatting issue. Thanks, NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* [PATCH 1/2 V4] MM: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE 2020-05-13 7:16 ` NeilBrown @ 2020-05-13 7:17 ` NeilBrown 2020-05-15 11:10 ` Jan Kara 2020-05-13 7:18 ` [PATCH 2/2 V4] " NeilBrown 1 sibling, 1 reply; 43+ messages in thread From: NeilBrown @ 2020-05-13 7:17 UTC (permalink / raw) To: Jan Kara Cc: Andrew Morton, Michal Hocko, linux-mm, linux-nfs, LKML, Christoph Hellwig [-- Attachment #1: Type: text/plain, Size: 9991 bytes --] PF_LESS_THROTTLE exists for loop-back nfsd (and a similar need in the loop block driver and callers of prctl(PR_SET_IO_FLUSHER)), where a daemon needs to write to one bdi (the final bdi) in order to free up writes queued to another bdi (the client bdi). The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty pages, so that it can still dirty pages after other processses have been throttled. The purpose of this is to avoid deadlock that happen when the PF_LESS_THROTTLE process must write for any dirty pages to be freed, but it is being thottled and cannot write. This approach was designed when all threads were blocked equally, independently on which device they were writing to, or how fast it was. Since that time the writeback algorithm has changed substantially with different threads getting different allowances based on non-trivial heuristics. This means the simple "add 25%" heuristic is no longer reliable. The important issue is not that the daemon needs a *larger* dirty page allowance, but that it needs a *private* dirty page allowance, so that dirty pages for the "client" bdi that it is helping to clear (the bdi for an NFS filesystem or loop block device etc) do not affect the throttling of the deamon writing to the "final" bdi. This patch changes the heuristic so that the task is not throttled when the bdi it is writing to has a dirty page count below below (or equal to) the free-run threshold for that bdi. This ensures it will always be able to have some pages in flight, and so will not deadlock. In a steady-state, it is expected that PF_LOCAL_THROTTLE tasks might still be throttled by global threshold, but that is acceptable as it is only the deadlock state that is interesting for this flag. This approach of "only throttle when target bdi is busy" is consistent with the other use of PF_LESS_THROTTLE in current_may_throttle(), were it causes attention to be focussed only on the target bdi. So this patch - renames PF_LESS_THROTTLE to PF_LOCAL_THROTTLE, - removes the 25% bonus that that flag gives, and - If PF_LOCAL_THROTTLE is set, don't delay at all unless the global and the local free-run thresholds are exceeded. Note that previously realtime threads were treated the same as PF_LESS_THROTTLE threads. This patch does *not* change the behvaiour for real-time threads, so it is now different from the behaviour of nfsd and loop tasks. I don't know what is wanted for realtime. Acked-by: Chuck Lever <chuck.lever@oracle.com> (for nfsd change) Signed-off-by: NeilBrown <neilb@suse.de> --- drivers/block/loop.c | 2 +- fs/nfsd/vfs.c | 9 +++++---- include/linux/sched.h | 3 ++- kernel/sys.c | 2 +- mm/page-writeback.c | 41 +++++++++++++++++++++++++++++++++-------- mm/vmscan.c | 4 ++-- 6 files changed, 44 insertions(+), 17 deletions(-) diff --git a/drivers/block/loop.c b/drivers/block/loop.c index da693e6a834e..d89c25ba3b89 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -919,7 +919,7 @@ static void loop_unprepare_queue(struct loop_device *lo) static int loop_kthread_worker_fn(void *worker_ptr) { - current->flags |= PF_LESS_THROTTLE | PF_MEMALLOC_NOIO; + current->flags |= PF_LOCAL_THROTTLE | PF_MEMALLOC_NOIO; return kthread_worker_fn(worker_ptr); } diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c index 0aa02eb18bd3..c3fbab1753ec 100644 --- a/fs/nfsd/vfs.c +++ b/fs/nfsd/vfs.c @@ -979,12 +979,13 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf, if (test_bit(RQ_LOCAL, &rqstp->rq_flags)) /* - * We want less throttling in balance_dirty_pages() - * and shrink_inactive_list() so that nfs to + * We want throttling in balance_dirty_pages() + * and shrink_inactive_list() to only consider + * the backingdev we are writing to, so that nfs to * localhost doesn't cause nfsd to lock up due to all * the client's dirty pages or its congested queue. */ - current->flags |= PF_LESS_THROTTLE; + current->flags |= PF_LOCAL_THROTTLE; exp = fhp->fh_export; use_wgather = (rqstp->rq_vers == 2) && EX_WGATHER(exp); @@ -1037,7 +1038,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf, nfserr = nfserrno(host_err); } if (test_bit(RQ_LOCAL, &rqstp->rq_flags)) - current_restore_flags(pflags, PF_LESS_THROTTLE); + current_restore_flags(pflags, PF_LOCAL_THROTTLE); return nfserr; } diff --git a/include/linux/sched.h b/include/linux/sched.h index 4418f5cb8324..5955a089df32 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1481,7 +1481,8 @@ extern struct pid *cad_pid; #define PF_KSWAPD 0x00020000 /* I am kswapd */ #define PF_MEMALLOC_NOFS 0x00040000 /* All allocation requests will inherit GFP_NOFS */ #define PF_MEMALLOC_NOIO 0x00080000 /* All allocation requests will inherit GFP_NOIO */ -#define PF_LESS_THROTTLE 0x00100000 /* Throttle me less: I clean memory */ +#define PF_LOCAL_THROTTLE 0x00100000 /* Throttle writes only agasint the bdi I write to, + * I am cleaning dirty pages from some other bdi. */ #define PF_KTHREAD 0x00200000 /* I am a kernel thread */ #define PF_RANDOMIZE 0x00400000 /* Randomize virtual address space */ #define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */ diff --git a/kernel/sys.c b/kernel/sys.c index d325f3ab624a..180a2fa33f7f 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2262,7 +2262,7 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct *t, unsigned long which, return -EINVAL; } -#define PR_IO_FLUSHER (PF_MEMALLOC_NOIO | PF_LESS_THROTTLE) +#define PR_IO_FLUSHER (PF_MEMALLOC_NOIO | PF_LOCAL_THROTTLE) SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, unsigned long, arg4, unsigned long, arg5) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 7326b54ab728..f02a71797781 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -387,8 +387,7 @@ static unsigned long global_dirtyable_memory(void) * Calculate @dtc->thresh and ->bg_thresh considering * vm_dirty_{bytes|ratio} and dirty_background_{bytes|ratio}. The caller * must ensure that @dtc->avail is set before calling this function. The - * dirty limits will be lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd) and - * real-time tasks. + * dirty limits will be lifted by 1/4 for real-time tasks. */ static void domain_dirty_limits(struct dirty_throttle_control *dtc) { @@ -436,7 +435,7 @@ static void domain_dirty_limits(struct dirty_throttle_control *dtc) if (bg_thresh >= thresh) bg_thresh = thresh / 2; tsk = current; - if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) { + if (rt_task(tsk)) { bg_thresh += bg_thresh / 4 + global_wb_domain.dirty_limit / 32; thresh += thresh / 4 + global_wb_domain.dirty_limit / 32; } @@ -486,7 +485,7 @@ static unsigned long node_dirty_limit(struct pglist_data *pgdat) else dirty = vm_dirty_ratio * node_memory / 100; - if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) + if (rt_task(tsk)) dirty += dirty / 4; return dirty; @@ -1653,8 +1652,12 @@ static void balance_dirty_pages(struct bdi_writeback *wb, if (dirty <= dirty_freerun_ceiling(thresh, bg_thresh) && (!mdtc || m_dirty <= dirty_freerun_ceiling(m_thresh, m_bg_thresh))) { - unsigned long intv = dirty_poll_interval(dirty, thresh); - unsigned long m_intv = ULONG_MAX; + unsigned long intv; + unsigned long m_intv; + + free_running: + intv = dirty_poll_interval(dirty, thresh); + m_intv = ULONG_MAX; current->dirty_paused_when = now; current->nr_dirtied = 0; @@ -1673,9 +1676,20 @@ static void balance_dirty_pages(struct bdi_writeback *wb, * Calculate global domain's pos_ratio and select the * global dtc by default. */ - if (!strictlimit) + if (!strictlimit) { wb_dirty_limits(gdtc); + if ((current->flags & PF_LOCAL_THROTTLE) && + gdtc->wb_dirty < + dirty_freerun_ceiling(gdtc->wb_thresh, + gdtc->wb_bg_thresh)) + /* + * LOCAL_THROTTLE tasks must not be throttled + * when below the per-wb freerun ceiling. + */ + goto free_running; + } + dirty_exceeded = (gdtc->wb_dirty > gdtc->wb_thresh) && ((gdtc->dirty > gdtc->thresh) || strictlimit); @@ -1689,9 +1703,20 @@ static void balance_dirty_pages(struct bdi_writeback *wb, * both global and memcg domains. Choose the one * w/ lower pos_ratio. */ - if (!strictlimit) + if (!strictlimit) { wb_dirty_limits(mdtc); + if ((current->flags & PF_LOCAL_THROTTLE) && + mdtc->wb_dirty < + dirty_freerun_ceiling(mdtc->wb_thresh, + mdtc->wb_bg_thresh)) + /* + * LOCAL_THROTTLE tasks must not be + * throttled when below the per-wb + * freerun ceiling. + */ + goto free_running; + } dirty_exceeded |= (mdtc->wb_dirty > mdtc->wb_thresh) && ((mdtc->dirty > mdtc->thresh) || strictlimit); diff --git a/mm/vmscan.c b/mm/vmscan.c index a37c87b5aee2..b2f5deb3603c 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1878,13 +1878,13 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec, /* * If a kernel thread (such as nfsd for loop-back mounts) services - * a backing device by writing to the page cache it sets PF_LESS_THROTTLE. + * a backing device by writing to the page cache it sets PF_LOCAL_THROTTLE. * In that case we should only throttle if the backing device it is * writing to is congested. In other cases it is safe to throttle. */ static int current_may_throttle(void) { - return !(current->flags & PF_LESS_THROTTLE) || + return !(current->flags & PF_LOCAL_THROTTLE) || current->backing_dev_info == NULL || bdi_write_congested(current->backing_dev_info); } -- 2.26.2 [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: [PATCH 1/2 V4] MM: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE 2020-05-13 7:17 ` [PATCH 1/2 V4] " NeilBrown @ 2020-05-15 11:10 ` Jan Kara 2020-06-01 0:46 ` Writeback fixes for NFS NeilBrown 0 siblings, 1 reply; 43+ messages in thread From: Jan Kara @ 2020-05-15 11:10 UTC (permalink / raw) To: NeilBrown Cc: Jan Kara, Andrew Morton, Michal Hocko, linux-mm, linux-nfs, LKML, Christoph Hellwig On Wed 13-05-20 17:17:20, NeilBrown wrote: > > PF_LESS_THROTTLE exists for loop-back nfsd (and a similar need in the > loop block driver and callers of prctl(PR_SET_IO_FLUSHER)), where a > daemon needs to write to one bdi (the final bdi) in order to free up > writes queued to another bdi (the client bdi). > > The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty > pages, so that it can still dirty pages after other processses have been > throttled. The purpose of this is to avoid deadlock that happen when > the PF_LESS_THROTTLE process must write for any dirty pages to be freed, > but it is being thottled and cannot write. > > This approach was designed when all threads were blocked equally, > independently on which device they were writing to, or how fast it was. > Since that time the writeback algorithm has changed substantially with > different threads getting different allowances based on non-trivial > heuristics. This means the simple "add 25%" heuristic is no longer > reliable. > > The important issue is not that the daemon needs a *larger* dirty page > allowance, but that it needs a *private* dirty page allowance, so that > dirty pages for the "client" bdi that it is helping to clear (the bdi for > an NFS filesystem or loop block device etc) do not affect the throttling > of the deamon writing to the "final" bdi. > > This patch changes the heuristic so that the task is not throttled when > the bdi it is writing to has a dirty page count below below (or equal > to) the free-run threshold for that bdi. This ensures it will always be > able to have some pages in flight, and so will not deadlock. > > In a steady-state, it is expected that PF_LOCAL_THROTTLE tasks might > still be throttled by global threshold, but that is acceptable as it is > only the deadlock state that is interesting for this flag. > > This approach of "only throttle when target bdi is busy" is consistent > with the other use of PF_LESS_THROTTLE in current_may_throttle(), were > it causes attention to be focussed only on the target bdi. > > So this patch > - renames PF_LESS_THROTTLE to PF_LOCAL_THROTTLE, > - removes the 25% bonus that that flag gives, and > - If PF_LOCAL_THROTTLE is set, don't delay at all unless the > global and the local free-run thresholds are exceeded. > > Note that previously realtime threads were treated the same as > PF_LESS_THROTTLE threads. This patch does *not* change the behvaiour for > real-time threads, so it is now different from the behaviour of nfsd and > loop tasks. I don't know what is wanted for realtime. > > Acked-by: Chuck Lever <chuck.lever@oracle.com> (for nfsd change) > Signed-off-by: NeilBrown <neilb@suse.de> The idea looks good to me. It will still have the "threads may hit hard wall" behavior when BDI freerun threshold is crossed at the moment we are very close to the global limit but that should be rare. So I think for now this good enough. The patch looks good to me so feel free to add: Reviewed-by: Jan Kara <jack@suse.cz> Throttling patches usually go through mm tree so ask Andrew to pick them up. Honza > @@ -1653,8 +1652,12 @@ static void balance_dirty_pages(struct bdi_writeback *wb, > if (dirty <= dirty_freerun_ceiling(thresh, bg_thresh) && > (!mdtc || > m_dirty <= dirty_freerun_ceiling(m_thresh, m_bg_thresh))) { > - unsigned long intv = dirty_poll_interval(dirty, thresh); > - unsigned long m_intv = ULONG_MAX; > + unsigned long intv; > + unsigned long m_intv; > + > + free_running: > + intv = dirty_poll_interval(dirty, thresh); > + m_intv = ULONG_MAX; > > current->dirty_paused_when = now; > current->nr_dirtied = 0; > @@ -1673,9 +1676,20 @@ static void balance_dirty_pages(struct bdi_writeback *wb, > * Calculate global domain's pos_ratio and select the > * global dtc by default. > */ > - if (!strictlimit) > + if (!strictlimit) { > wb_dirty_limits(gdtc); > > + if ((current->flags & PF_LOCAL_THROTTLE) && > + gdtc->wb_dirty < > + dirty_freerun_ceiling(gdtc->wb_thresh, > + gdtc->wb_bg_thresh)) > + /* > + * LOCAL_THROTTLE tasks must not be throttled > + * when below the per-wb freerun ceiling. > + */ > + goto free_running; > + } > + > dirty_exceeded = (gdtc->wb_dirty > gdtc->wb_thresh) && > ((gdtc->dirty > gdtc->thresh) || strictlimit); > > @@ -1689,9 +1703,20 @@ static void balance_dirty_pages(struct bdi_writeback *wb, > * both global and memcg domains. Choose the one > * w/ lower pos_ratio. > */ > - if (!strictlimit) > + if (!strictlimit) { > wb_dirty_limits(mdtc); > > + if ((current->flags & PF_LOCAL_THROTTLE) && > + mdtc->wb_dirty < > + dirty_freerun_ceiling(mdtc->wb_thresh, > + mdtc->wb_bg_thresh)) > + /* > + * LOCAL_THROTTLE tasks must not be > + * throttled when below the per-wb > + * freerun ceiling. > + */ > + goto free_running; > + } > dirty_exceeded |= (mdtc->wb_dirty > mdtc->wb_thresh) && > ((mdtc->dirty > mdtc->thresh) || strictlimit); > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index a37c87b5aee2..b2f5deb3603c 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1878,13 +1878,13 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec, > > /* > * If a kernel thread (such as nfsd for loop-back mounts) services > - * a backing device by writing to the page cache it sets PF_LESS_THROTTLE. > + * a backing device by writing to the page cache it sets PF_LOCAL_THROTTLE. > * In that case we should only throttle if the backing device it is > * writing to is congested. In other cases it is safe to throttle. > */ > static int current_may_throttle(void) > { > - return !(current->flags & PF_LESS_THROTTLE) || > + return !(current->flags & PF_LOCAL_THROTTLE) || > current->backing_dev_info == NULL || > bdi_write_congested(current->backing_dev_info); > } > -- > 2.26.2 > -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 43+ messages in thread
* Writeback fixes for NFS 2020-05-15 11:10 ` Jan Kara @ 2020-06-01 0:46 ` NeilBrown 2020-06-01 0:48 ` [PATCH 1/2] MM: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE NeilBrown 2020-06-01 0:49 ` [PATCH 2/2] MM: Discard NR_UNSTABLE_NFS, use NR_WRITEBACK instead NeilBrown 0 siblings, 2 replies; 43+ messages in thread From: NeilBrown @ 2020-06-01 0:46 UTC (permalink / raw) To: Andrew Morton Cc: Jan Kara, Michal Hocko, linux-mm, linux-nfs, LKML, Christoph Hellwig [-- Attachment #1: Type: text/plain, Size: 152 bytes --] Hi Andrew, could you please queue these two patches (following). I think they have sufficient review and no remaining complaints. Thanks, NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* [PATCH 1/2] MM: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE 2020-06-01 0:46 ` Writeback fixes for NFS NeilBrown @ 2020-06-01 0:48 ` NeilBrown 2020-06-01 0:49 ` [PATCH 2/2] MM: Discard NR_UNSTABLE_NFS, use NR_WRITEBACK instead NeilBrown 1 sibling, 0 replies; 43+ messages in thread From: NeilBrown @ 2020-06-01 0:48 UTC (permalink / raw) To: Andrew Morton Cc: Jan Kara, Michal Hocko, linux-mm, linux-nfs, LKML, Christoph Hellwig [-- Attachment #1: Type: text/plain, Size: 10029 bytes --] PF_LESS_THROTTLE exists for loop-back nfsd (and a similar need in the loop block driver and callers of prctl(PR_SET_IO_FLUSHER)), where a daemon needs to write to one bdi (the final bdi) in order to free up writes queued to another bdi (the client bdi). The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty pages, so that it can still dirty pages after other processses have been throttled. The purpose of this is to avoid deadlock that happen when the PF_LESS_THROTTLE process must write for any dirty pages to be freed, but it is being thottled and cannot write. This approach was designed when all threads were blocked equally, independently on which device they were writing to, or how fast it was. Since that time the writeback algorithm has changed substantially with different threads getting different allowances based on non-trivial heuristics. This means the simple "add 25%" heuristic is no longer reliable. The important issue is not that the daemon needs a *larger* dirty page allowance, but that it needs a *private* dirty page allowance, so that dirty pages for the "client" bdi that it is helping to clear (the bdi for an NFS filesystem or loop block device etc) do not affect the throttling of the deamon writing to the "final" bdi. This patch changes the heuristic so that the task is not throttled when the bdi it is writing to has a dirty page count below below (or equal to) the free-run threshold for that bdi. This ensures it will always be able to have some pages in flight, and so will not deadlock. In a steady-state, it is expected that PF_LOCAL_THROTTLE tasks might still be throttled by global threshold, but that is acceptable as it is only the deadlock state that is interesting for this flag. This approach of "only throttle when target bdi is busy" is consistent with the other use of PF_LESS_THROTTLE in current_may_throttle(), were it causes attention to be focussed only on the target bdi. So this patch - renames PF_LESS_THROTTLE to PF_LOCAL_THROTTLE, - removes the 25% bonus that that flag gives, and - If PF_LOCAL_THROTTLE is set, don't delay at all unless the global and the local free-run thresholds are exceeded. Note that previously realtime threads were treated the same as PF_LESS_THROTTLE threads. This patch does *not* change the behvaiour for real-time threads, so it is now different from the behaviour of nfsd and loop tasks. I don't know what is wanted for realtime. Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Chuck Lever <chuck.lever@oracle.com> (for nfsd change) Signed-off-by: NeilBrown <neilb@suse.de> --- drivers/block/loop.c | 2 +- fs/nfsd/vfs.c | 9 +++++---- include/linux/sched.h | 3 ++- kernel/sys.c | 2 +- mm/page-writeback.c | 41 +++++++++++++++++++++++++++++++++-------- mm/vmscan.c | 4 ++-- 6 files changed, 44 insertions(+), 17 deletions(-) diff --git a/drivers/block/loop.c b/drivers/block/loop.c index da693e6a834e..d89c25ba3b89 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -919,7 +919,7 @@ static void loop_unprepare_queue(struct loop_device *lo) static int loop_kthread_worker_fn(void *worker_ptr) { - current->flags |= PF_LESS_THROTTLE | PF_MEMALLOC_NOIO; + current->flags |= PF_LOCAL_THROTTLE | PF_MEMALLOC_NOIO; return kthread_worker_fn(worker_ptr); } diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c index 0aa02eb18bd3..c3fbab1753ec 100644 --- a/fs/nfsd/vfs.c +++ b/fs/nfsd/vfs.c @@ -979,12 +979,13 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf, if (test_bit(RQ_LOCAL, &rqstp->rq_flags)) /* - * We want less throttling in balance_dirty_pages() - * and shrink_inactive_list() so that nfs to + * We want throttling in balance_dirty_pages() + * and shrink_inactive_list() to only consider + * the backingdev we are writing to, so that nfs to * localhost doesn't cause nfsd to lock up due to all * the client's dirty pages or its congested queue. */ - current->flags |= PF_LESS_THROTTLE; + current->flags |= PF_LOCAL_THROTTLE; exp = fhp->fh_export; use_wgather = (rqstp->rq_vers == 2) && EX_WGATHER(exp); @@ -1037,7 +1038,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf, nfserr = nfserrno(host_err); } if (test_bit(RQ_LOCAL, &rqstp->rq_flags)) - current_restore_flags(pflags, PF_LESS_THROTTLE); + current_restore_flags(pflags, PF_LOCAL_THROTTLE); return nfserr; } diff --git a/include/linux/sched.h b/include/linux/sched.h index 4418f5cb8324..5955a089df32 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1481,7 +1481,8 @@ extern struct pid *cad_pid; #define PF_KSWAPD 0x00020000 /* I am kswapd */ #define PF_MEMALLOC_NOFS 0x00040000 /* All allocation requests will inherit GFP_NOFS */ #define PF_MEMALLOC_NOIO 0x00080000 /* All allocation requests will inherit GFP_NOIO */ -#define PF_LESS_THROTTLE 0x00100000 /* Throttle me less: I clean memory */ +#define PF_LOCAL_THROTTLE 0x00100000 /* Throttle writes only agasint the bdi I write to, + * I am cleaning dirty pages from some other bdi. */ #define PF_KTHREAD 0x00200000 /* I am a kernel thread */ #define PF_RANDOMIZE 0x00400000 /* Randomize virtual address space */ #define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */ diff --git a/kernel/sys.c b/kernel/sys.c index d325f3ab624a..180a2fa33f7f 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2262,7 +2262,7 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct *t, unsigned long which, return -EINVAL; } -#define PR_IO_FLUSHER (PF_MEMALLOC_NOIO | PF_LESS_THROTTLE) +#define PR_IO_FLUSHER (PF_MEMALLOC_NOIO | PF_LOCAL_THROTTLE) SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, unsigned long, arg4, unsigned long, arg5) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 7326b54ab728..f02a71797781 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -387,8 +387,7 @@ static unsigned long global_dirtyable_memory(void) * Calculate @dtc->thresh and ->bg_thresh considering * vm_dirty_{bytes|ratio} and dirty_background_{bytes|ratio}. The caller * must ensure that @dtc->avail is set before calling this function. The - * dirty limits will be lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd) and - * real-time tasks. + * dirty limits will be lifted by 1/4 for real-time tasks. */ static void domain_dirty_limits(struct dirty_throttle_control *dtc) { @@ -436,7 +435,7 @@ static void domain_dirty_limits(struct dirty_throttle_control *dtc) if (bg_thresh >= thresh) bg_thresh = thresh / 2; tsk = current; - if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) { + if (rt_task(tsk)) { bg_thresh += bg_thresh / 4 + global_wb_domain.dirty_limit / 32; thresh += thresh / 4 + global_wb_domain.dirty_limit / 32; } @@ -486,7 +485,7 @@ static unsigned long node_dirty_limit(struct pglist_data *pgdat) else dirty = vm_dirty_ratio * node_memory / 100; - if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) + if (rt_task(tsk)) dirty += dirty / 4; return dirty; @@ -1653,8 +1652,12 @@ static void balance_dirty_pages(struct bdi_writeback *wb, if (dirty <= dirty_freerun_ceiling(thresh, bg_thresh) && (!mdtc || m_dirty <= dirty_freerun_ceiling(m_thresh, m_bg_thresh))) { - unsigned long intv = dirty_poll_interval(dirty, thresh); - unsigned long m_intv = ULONG_MAX; + unsigned long intv; + unsigned long m_intv; + + free_running: + intv = dirty_poll_interval(dirty, thresh); + m_intv = ULONG_MAX; current->dirty_paused_when = now; current->nr_dirtied = 0; @@ -1673,9 +1676,20 @@ static void balance_dirty_pages(struct bdi_writeback *wb, * Calculate global domain's pos_ratio and select the * global dtc by default. */ - if (!strictlimit) + if (!strictlimit) { wb_dirty_limits(gdtc); + if ((current->flags & PF_LOCAL_THROTTLE) && + gdtc->wb_dirty < + dirty_freerun_ceiling(gdtc->wb_thresh, + gdtc->wb_bg_thresh)) + /* + * LOCAL_THROTTLE tasks must not be throttled + * when below the per-wb freerun ceiling. + */ + goto free_running; + } + dirty_exceeded = (gdtc->wb_dirty > gdtc->wb_thresh) && ((gdtc->dirty > gdtc->thresh) || strictlimit); @@ -1689,9 +1703,20 @@ static void balance_dirty_pages(struct bdi_writeback *wb, * both global and memcg domains. Choose the one * w/ lower pos_ratio. */ - if (!strictlimit) + if (!strictlimit) { wb_dirty_limits(mdtc); + if ((current->flags & PF_LOCAL_THROTTLE) && + mdtc->wb_dirty < + dirty_freerun_ceiling(mdtc->wb_thresh, + mdtc->wb_bg_thresh)) + /* + * LOCAL_THROTTLE tasks must not be + * throttled when below the per-wb + * freerun ceiling. + */ + goto free_running; + } dirty_exceeded |= (mdtc->wb_dirty > mdtc->wb_thresh) && ((mdtc->dirty > mdtc->thresh) || strictlimit); diff --git a/mm/vmscan.c b/mm/vmscan.c index a37c87b5aee2..b2f5deb3603c 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1878,13 +1878,13 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec, /* * If a kernel thread (such as nfsd for loop-back mounts) services - * a backing device by writing to the page cache it sets PF_LESS_THROTTLE. + * a backing device by writing to the page cache it sets PF_LOCAL_THROTTLE. * In that case we should only throttle if the backing device it is * writing to is congested. In other cases it is safe to throttle. */ static int current_may_throttle(void) { - return !(current->flags & PF_LESS_THROTTLE) || + return !(current->flags & PF_LOCAL_THROTTLE) || current->backing_dev_info == NULL || bdi_write_congested(current->backing_dev_info); } -- 2.26.2 [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH 2/2] MM: Discard NR_UNSTABLE_NFS, use NR_WRITEBACK instead. 2020-06-01 0:46 ` Writeback fixes for NFS NeilBrown 2020-06-01 0:48 ` [PATCH 1/2] MM: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE NeilBrown @ 2020-06-01 0:49 ` NeilBrown 1 sibling, 0 replies; 43+ messages in thread From: NeilBrown @ 2020-06-01 0:49 UTC (permalink / raw) To: Andrew Morton Cc: Jan Kara, Michal Hocko, linux-mm, linux-nfs, LKML, Christoph Hellwig [-- Attachment #1: Type: text/plain, Size: 14001 bytes --] After an NFS page has been written it is considered "unstable" until a COMMIT request succeeds. If the COMMIT fails, the page will be re-written. These "unstable" pages are currently accounted as "reclaimable", either in WB_RECLAIMABLE, or in NR_UNSTABLE_NFS which is included in a 'reclaimable' count. This might have made sense when sending the COMMIT required a separate action by the VFS/MM (e.g. releasepage() used to send a COMMIT). However now that all writes generated by ->writepages() will automatically be followed by a COMMIT (since commit 919e3bd9a875 ("NFS: Ensure we commit after writeback is complete")) it makes more sense to treat them as writeback pages. So this patch removes NR_UNSTABLE_NFS and accounts unstable pages in NR_WRITEBACK and WB_WRITEBACK. A particular effect of this change is that when wb_check_background_flush() calls wb_over_bg_threshold(), the latter will report 'true' a lot less often as the 'unstable' pages are no longer considered 'dirty' (as there is nothing that writeback can do about them anyway). Currently wb_check_background_flush() will trigger writeback to NFS even when there are relatively few dirty pages (if there are lots of unstable pages), this can result in small writes going to the server (10s of Kilobytes rather than a Megabyte) which hurts throughput. With this patch, there are fewer writes which are each larger on average. Where the NR_UNSTABLE_NFS count was included in statistics virtual-files, the entry is retained, but the value is hard-coded as zero. static trace points and warning printks which mentioned this counter no longer report it. Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Trond Myklebust <trond.myklebust@hammerspace.com> Acked-by: Michal Hocko <mhocko@suse.com> # for MM parts Signed-off-by: NeilBrown <neilb@suse.de> --- Documentation/filesystems/proc.rst | 4 ++-- drivers/base/node.c | 2 +- fs/fs-writeback.c | 1 - fs/nfs/internal.h | 10 +++++++--- fs/nfs/write.c | 4 ++-- fs/proc/meminfo.c | 3 +-- include/linux/mmzone.h | 1 - include/trace/events/writeback.h | 5 +---- mm/memcontrol.c | 1 - mm/page-writeback.c | 17 ++++------------- mm/page_alloc.c | 5 +---- mm/vmstat.c | 11 +++++++++-- 12 files changed, 28 insertions(+), 36 deletions(-) diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst index 38b606991065..092b7b44d158 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -1042,8 +1042,8 @@ PageTables amount of memory dedicated to the lowest level of page tables. NFS_Unstable - NFS pages sent to the server, but not yet committed to stable - storage + Always zero. Previous counted pages which had been written to + the server, but has not been committed to stable storage. Bounce Memory used for block device "bounce buffers" WritebackTmp diff --git a/drivers/base/node.c b/drivers/base/node.c index 10d7e818e118..15f5ed6a8830 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -439,7 +439,7 @@ static ssize_t node_read_meminfo(struct device *dev, nid, K(i.sharedram), nid, sum_zone_node_page_state(nid, NR_KERNEL_STACK_KB), nid, K(sum_zone_node_page_state(nid, NR_PAGETABLE)), - nid, K(node_page_state(pgdat, NR_UNSTABLE_NFS)), + nid, 0, nid, K(sum_zone_node_page_state(nid, NR_BOUNCE)), nid, K(node_page_state(pgdat, NR_WRITEBACK_TEMP)), nid, K(sreclaimable + diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 76ac9c7d32ec..c5bdf46e3b4b 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -1070,7 +1070,6 @@ static void bdi_split_work_to_wbs(struct backing_dev_info *bdi, static unsigned long get_nr_dirty_pages(void) { return global_node_page_state(NR_FILE_DIRTY) + - global_node_page_state(NR_UNSTABLE_NFS) + get_nr_dirty_inodes(); } diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h index 1f32a9fbfdaf..6673a77884d9 100644 --- a/fs/nfs/internal.h +++ b/fs/nfs/internal.h @@ -668,7 +668,8 @@ void nfs_super_set_maxbytes(struct super_block *sb, __u64 maxfilesize) } /* - * Record the page as unstable and mark its inode as dirty. + * Record the page as unstable (an extra writeback period) and mark its + * inode as dirty. */ static inline void nfs_mark_page_unstable(struct page *page, struct nfs_commit_info *cinfo) @@ -676,8 +677,11 @@ void nfs_mark_page_unstable(struct page *page, struct nfs_commit_info *cinfo) if (!cinfo->dreq) { struct inode *inode = page_file_mapping(page)->host; - inc_node_page_state(page, NR_UNSTABLE_NFS); - inc_wb_stat(&inode_to_bdi(inode)->wb, WB_RECLAIMABLE); + /* This page is really still in write-back - just that the + * writeback is happening on the server now. + */ + inc_node_page_state(page, NR_WRITEBACK); + inc_wb_stat(&inode_to_bdi(inode)->wb, WB_WRITEBACK); __mark_inode_dirty(inode, I_DIRTY_DATASYNC); } } diff --git a/fs/nfs/write.c b/fs/nfs/write.c index 1e767f779c49..639c34fec04a 100644 --- a/fs/nfs/write.c +++ b/fs/nfs/write.c @@ -946,9 +946,9 @@ nfs_mark_request_commit(struct nfs_page *req, struct pnfs_layout_segment *lseg, static void nfs_clear_page_commit(struct page *page) { - dec_node_page_state(page, NR_UNSTABLE_NFS); + dec_node_page_state(page, NR_WRITEBACK); dec_wb_stat(&inode_to_bdi(page_file_mapping(page)->host)->wb, - WB_RECLAIMABLE); + WB_WRITEBACK); } /* Called holding the request lock on @req */ diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c index 8c1f1bb1a5ce..9bd94b5a9658 100644 --- a/fs/proc/meminfo.c +++ b/fs/proc/meminfo.c @@ -106,8 +106,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v) show_val_kb(m, "PageTables: ", global_zone_page_state(NR_PAGETABLE)); - show_val_kb(m, "NFS_Unstable: ", - global_node_page_state(NR_UNSTABLE_NFS)); + show_val_kb(m, "NFS_Unstable: ", 0); show_val_kb(m, "Bounce: ", global_zone_page_state(NR_BOUNCE)); show_val_kb(m, "WritebackTmp: ", diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 1b9de7d220fb..a89f47515eb1 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -193,7 +193,6 @@ enum node_stat_item { NR_FILE_THPS, NR_FILE_PMDMAPPED, NR_ANON_THPS, - NR_UNSTABLE_NFS, /* NFS unstable pages */ NR_VMSCAN_WRITE, NR_VMSCAN_IMMEDIATE, /* Prioritise for reclaim when writeback ends */ NR_DIRTIED, /* page dirtyings since bootup */ diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h index 85a33bea76f1..10f5d1fa7347 100644 --- a/include/trace/events/writeback.h +++ b/include/trace/events/writeback.h @@ -541,7 +541,6 @@ TRACE_EVENT(global_dirty_state, TP_STRUCT__entry( __field(unsigned long, nr_dirty) __field(unsigned long, nr_writeback) - __field(unsigned long, nr_unstable) __field(unsigned long, background_thresh) __field(unsigned long, dirty_thresh) __field(unsigned long, dirty_limit) @@ -552,7 +551,6 @@ TRACE_EVENT(global_dirty_state, TP_fast_assign( __entry->nr_dirty = global_node_page_state(NR_FILE_DIRTY); __entry->nr_writeback = global_node_page_state(NR_WRITEBACK); - __entry->nr_unstable = global_node_page_state(NR_UNSTABLE_NFS); __entry->nr_dirtied = global_node_page_state(NR_DIRTIED); __entry->nr_written = global_node_page_state(NR_WRITTEN); __entry->background_thresh = background_thresh; @@ -560,12 +558,11 @@ TRACE_EVENT(global_dirty_state, __entry->dirty_limit = global_wb_domain.dirty_limit; ), - TP_printk("dirty=%lu writeback=%lu unstable=%lu " + TP_printk("dirty=%lu writeback=%lu " "bg_thresh=%lu thresh=%lu limit=%lu " "dirtied=%lu written=%lu", __entry->nr_dirty, __entry->nr_writeback, - __entry->nr_unstable, __entry->background_thresh, __entry->dirty_thresh, __entry->dirty_limit, diff --git a/mm/memcontrol.c b/mm/memcontrol.c index a3b97f103966..1db4b285c407 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4330,7 +4330,6 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages, *pdirty = memcg_exact_page_state(memcg, NR_FILE_DIRTY); - /* this should eventually include NR_UNSTABLE_NFS */ *pwriteback = memcg_exact_page_state(memcg, NR_WRITEBACK); *pfilepages = memcg_exact_page_state(memcg, NR_INACTIVE_FILE) + memcg_exact_page_state(memcg, NR_ACTIVE_FILE); diff --git a/mm/page-writeback.c b/mm/page-writeback.c index f02a71797781..d11b097c8002 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -504,7 +504,6 @@ bool node_dirty_ok(struct pglist_data *pgdat) unsigned long nr_pages = 0; nr_pages += node_page_state(pgdat, NR_FILE_DIRTY); - nr_pages += node_page_state(pgdat, NR_UNSTABLE_NFS); nr_pages += node_page_state(pgdat, NR_WRITEBACK); return nr_pages <= limit; @@ -758,7 +757,7 @@ static void mdtc_calc_avail(struct dirty_throttle_control *mdtc, * bounded by the bdi->min_ratio and/or bdi->max_ratio parameters, if set. * * Return: @wb's dirty limit in pages. The term "dirty" in the context of - * dirty balancing includes all PG_dirty, PG_writeback and NFS unstable pages. + * dirty balancing includes all PG_dirty and PG_writeback pages. */ static unsigned long __wb_calc_thresh(struct dirty_throttle_control *dtc) { @@ -1566,7 +1565,7 @@ static void balance_dirty_pages(struct bdi_writeback *wb, struct dirty_throttle_control * const mdtc = mdtc_valid(&mdtc_stor) ? &mdtc_stor : NULL; struct dirty_throttle_control *sdtc; - unsigned long nr_reclaimable; /* = file_dirty + unstable_nfs */ + unsigned long nr_reclaimable; /* = file_dirty */ long period; long pause; long max_pause; @@ -1586,14 +1585,7 @@ static void balance_dirty_pages(struct bdi_writeback *wb, unsigned long m_thresh = 0; unsigned long m_bg_thresh = 0; - /* - * Unstable writes are a feature of certain networked - * filesystems (i.e. NFS) in which data may have been - * written to the server's write cache, but has not yet - * been flushed to permanent storage. - */ - nr_reclaimable = global_node_page_state(NR_FILE_DIRTY) + - global_node_page_state(NR_UNSTABLE_NFS); + nr_reclaimable = global_node_page_state(NR_FILE_DIRTY); gdtc->avail = global_dirtyable_memory(); gdtc->dirty = nr_reclaimable + global_node_page_state(NR_WRITEBACK); @@ -1963,8 +1955,7 @@ bool wb_over_bg_thresh(struct bdi_writeback *wb) * as we're trying to decide whether to put more under writeback. */ gdtc->avail = global_dirtyable_memory(); - gdtc->dirty = global_node_page_state(NR_FILE_DIRTY) + - global_node_page_state(NR_UNSTABLE_NFS); + gdtc->dirty = global_node_page_state(NR_FILE_DIRTY); domain_dirty_limits(gdtc); if (gdtc->dirty > gdtc->bg_thresh) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 13cc653122b7..cc406ee17ad9 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5319,7 +5319,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask) printk("active_anon:%lu inactive_anon:%lu isolated_anon:%lu\n" " active_file:%lu inactive_file:%lu isolated_file:%lu\n" - " unevictable:%lu dirty:%lu writeback:%lu unstable:%lu\n" + " unevictable:%lu dirty:%lu writeback:%lu\n" " slab_reclaimable:%lu slab_unreclaimable:%lu\n" " mapped:%lu shmem:%lu pagetables:%lu bounce:%lu\n" " free:%lu free_pcp:%lu free_cma:%lu\n", @@ -5332,7 +5332,6 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask) global_node_page_state(NR_UNEVICTABLE), global_node_page_state(NR_FILE_DIRTY), global_node_page_state(NR_WRITEBACK), - global_node_page_state(NR_UNSTABLE_NFS), global_node_page_state(NR_SLAB_RECLAIMABLE), global_node_page_state(NR_SLAB_UNRECLAIMABLE), global_node_page_state(NR_FILE_MAPPED), @@ -5365,7 +5364,6 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask) " anon_thp: %lukB" #endif " writeback_tmp:%lukB" - " unstable:%lukB" " all_unreclaimable? %s" "\n", pgdat->node_id, @@ -5387,7 +5385,6 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask) K(node_page_state(pgdat, NR_ANON_THPS) * HPAGE_PMD_NR), #endif K(node_page_state(pgdat, NR_WRITEBACK_TEMP)), - K(node_page_state(pgdat, NR_UNSTABLE_NFS)), pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ? "yes" : "no"); } diff --git a/mm/vmstat.c b/mm/vmstat.c index 96d21a792b57..6c719f184843 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1108,7 +1108,7 @@ int fragmentation_index(struct zone *zone, unsigned int order) TEXT_FOR_HIGHMEM(xx) xx "_movable", const char * const vmstat_text[] = { - /* enum zone_stat_item countes */ + /* enum zone_stat_item counters */ "nr_free_pages", "nr_zone_inactive_anon", "nr_zone_active_anon", @@ -1162,7 +1162,6 @@ const char * const vmstat_text[] = { "nr_file_hugepages", "nr_file_pmdmapped", "nr_anon_transparent_hugepages", - "nr_unstable", "nr_vmscan_write", "nr_vmscan_immediate_reclaim", "nr_dirtied", @@ -1723,6 +1722,14 @@ static int vmstat_show(struct seq_file *m, void *arg) seq_puts(m, vmstat_text[off]); seq_put_decimal_ull(m, " ", *l); seq_putc(m, '\n'); + + if (off == NR_VMSTAT_ITEMS - 1) { + /* We've come to the end - add any deprecated counters + * to avoid breaking userspace which might depend on + * them being present. + */ + seq_puts(m, "nr_unstable 0\n"); + } return 0; } -- 2.26.2 [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH 2/2 V4] MM: Discard NR_UNSTABLE_NFS, use NR_WRITEBACK instead. 2020-05-13 7:16 ` NeilBrown 2020-05-13 7:17 ` [PATCH 1/2 V4] " NeilBrown @ 2020-05-13 7:18 ` NeilBrown 2020-05-15 9:59 ` Jan Kara 1 sibling, 1 reply; 43+ messages in thread From: NeilBrown @ 2020-05-13 7:18 UTC (permalink / raw) To: Jan Kara Cc: Andrew Morton, Michal Hocko, linux-mm, linux-nfs, LKML, Christoph Hellwig [-- Attachment #1: Type: text/plain, Size: 13963 bytes --] After an NFS page has been written it is considered "unstable" until a COMMIT request succeeds. If the COMMIT fails, the page will be re-written. These "unstable" pages are currently accounted as "reclaimable", either in WB_RECLAIMABLE, or in NR_UNSTABLE_NFS which is included in a 'reclaimable' count. This might have made sense when sending the COMMIT required a separate action by the VFS/MM (e.g. releasepage() used to send a COMMIT). However now that all writes generated by ->writepages() will automatically be followed by a COMMIT (since commit 919e3bd9a875 ("NFS: Ensure we commit after writeback is complete")) it makes more sense to treat them as writeback pages. So this patch removes NR_UNSTABLE_NFS and accounts unstable pages in NR_WRITEBACK and WB_WRITEBACK. A particular effect of this change is that when wb_check_background_flush() calls wb_over_bg_threshold(), the latter will report 'true' a lot less often as the 'unstable' pages are no longer considered 'dirty' (as there is nothing that writeback can do about them anyway). Currently wb_check_background_flush() will trigger writeback to NFS even when there are relatively few dirty pages (if there are lots of unstable pages), this can result in small writes going to the server (10s of Kilobytes rather than a Megabyte) which hurts throughput. With this patch, there are fewer writes which are each larger on average. Where the NR_UNSTABLE_NFS count was included in statistics virtual-files, the entry is retained, but the value is hard-coded as zero. static trace points and warning printks which mentioned this counter no longer report it. Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Trond Myklebust <trond.myklebust@hammerspace.com> Acked-by: Michal Hocko <mhocko@suse.com> # for MM parts Signed-off-by: NeilBrown <neilb@suse.de> --- Documentation/filesystems/proc.rst | 4 ++-- drivers/base/node.c | 2 +- fs/fs-writeback.c | 1 - fs/nfs/internal.h | 10 +++++++--- fs/nfs/write.c | 4 ++-- fs/proc/meminfo.c | 3 +-- include/linux/mmzone.h | 1 - include/trace/events/writeback.h | 5 +---- mm/memcontrol.c | 1 - mm/page-writeback.c | 17 ++++------------- mm/page_alloc.c | 5 +---- mm/vmstat.c | 11 +++++++++-- 12 files changed, 28 insertions(+), 36 deletions(-) diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst index 38b606991065..092b7b44d158 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -1042,8 +1042,8 @@ PageTables amount of memory dedicated to the lowest level of page tables. NFS_Unstable - NFS pages sent to the server, but not yet committed to stable - storage + Always zero. Previous counted pages which had been written to + the server, but has not been committed to stable storage. Bounce Memory used for block device "bounce buffers" WritebackTmp diff --git a/drivers/base/node.c b/drivers/base/node.c index 10d7e818e118..15f5ed6a8830 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -439,7 +439,7 @@ static ssize_t node_read_meminfo(struct device *dev, nid, K(i.sharedram), nid, sum_zone_node_page_state(nid, NR_KERNEL_STACK_KB), nid, K(sum_zone_node_page_state(nid, NR_PAGETABLE)), - nid, K(node_page_state(pgdat, NR_UNSTABLE_NFS)), + nid, 0, nid, K(sum_zone_node_page_state(nid, NR_BOUNCE)), nid, K(node_page_state(pgdat, NR_WRITEBACK_TEMP)), nid, K(sreclaimable + diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 76ac9c7d32ec..c5bdf46e3b4b 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -1070,7 +1070,6 @@ static void bdi_split_work_to_wbs(struct backing_dev_info *bdi, static unsigned long get_nr_dirty_pages(void) { return global_node_page_state(NR_FILE_DIRTY) + - global_node_page_state(NR_UNSTABLE_NFS) + get_nr_dirty_inodes(); } diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h index 1f32a9fbfdaf..6673a77884d9 100644 --- a/fs/nfs/internal.h +++ b/fs/nfs/internal.h @@ -668,7 +668,8 @@ void nfs_super_set_maxbytes(struct super_block *sb, __u64 maxfilesize) } /* - * Record the page as unstable and mark its inode as dirty. + * Record the page as unstable (an extra writeback period) and mark its + * inode as dirty. */ static inline void nfs_mark_page_unstable(struct page *page, struct nfs_commit_info *cinfo) @@ -676,8 +677,11 @@ void nfs_mark_page_unstable(struct page *page, struct nfs_commit_info *cinfo) if (!cinfo->dreq) { struct inode *inode = page_file_mapping(page)->host; - inc_node_page_state(page, NR_UNSTABLE_NFS); - inc_wb_stat(&inode_to_bdi(inode)->wb, WB_RECLAIMABLE); + /* This page is really still in write-back - just that the + * writeback is happening on the server now. + */ + inc_node_page_state(page, NR_WRITEBACK); + inc_wb_stat(&inode_to_bdi(inode)->wb, WB_WRITEBACK); __mark_inode_dirty(inode, I_DIRTY_DATASYNC); } } diff --git a/fs/nfs/write.c b/fs/nfs/write.c index df4b87c30ac9..d9ea824accb7 100644 --- a/fs/nfs/write.c +++ b/fs/nfs/write.c @@ -946,9 +946,9 @@ nfs_mark_request_commit(struct nfs_page *req, struct pnfs_layout_segment *lseg, static void nfs_clear_page_commit(struct page *page) { - dec_node_page_state(page, NR_UNSTABLE_NFS); + dec_node_page_state(page, NR_WRITEBACK); dec_wb_stat(&inode_to_bdi(page_file_mapping(page)->host)->wb, - WB_RECLAIMABLE); + WB_WRITEBACK); } /* Called holding the request lock on @req */ diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c index 8c1f1bb1a5ce..9bd94b5a9658 100644 --- a/fs/proc/meminfo.c +++ b/fs/proc/meminfo.c @@ -106,8 +106,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v) show_val_kb(m, "PageTables: ", global_zone_page_state(NR_PAGETABLE)); - show_val_kb(m, "NFS_Unstable: ", - global_node_page_state(NR_UNSTABLE_NFS)); + show_val_kb(m, "NFS_Unstable: ", 0); show_val_kb(m, "Bounce: ", global_zone_page_state(NR_BOUNCE)); show_val_kb(m, "WritebackTmp: ", diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 1b9de7d220fb..a89f47515eb1 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -193,7 +193,6 @@ enum node_stat_item { NR_FILE_THPS, NR_FILE_PMDMAPPED, NR_ANON_THPS, - NR_UNSTABLE_NFS, /* NFS unstable pages */ NR_VMSCAN_WRITE, NR_VMSCAN_IMMEDIATE, /* Prioritise for reclaim when writeback ends */ NR_DIRTIED, /* page dirtyings since bootup */ diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h index 85a33bea76f1..10f5d1fa7347 100644 --- a/include/trace/events/writeback.h +++ b/include/trace/events/writeback.h @@ -541,7 +541,6 @@ TRACE_EVENT(global_dirty_state, TP_STRUCT__entry( __field(unsigned long, nr_dirty) __field(unsigned long, nr_writeback) - __field(unsigned long, nr_unstable) __field(unsigned long, background_thresh) __field(unsigned long, dirty_thresh) __field(unsigned long, dirty_limit) @@ -552,7 +551,6 @@ TRACE_EVENT(global_dirty_state, TP_fast_assign( __entry->nr_dirty = global_node_page_state(NR_FILE_DIRTY); __entry->nr_writeback = global_node_page_state(NR_WRITEBACK); - __entry->nr_unstable = global_node_page_state(NR_UNSTABLE_NFS); __entry->nr_dirtied = global_node_page_state(NR_DIRTIED); __entry->nr_written = global_node_page_state(NR_WRITTEN); __entry->background_thresh = background_thresh; @@ -560,12 +558,11 @@ TRACE_EVENT(global_dirty_state, __entry->dirty_limit = global_wb_domain.dirty_limit; ), - TP_printk("dirty=%lu writeback=%lu unstable=%lu " + TP_printk("dirty=%lu writeback=%lu " "bg_thresh=%lu thresh=%lu limit=%lu " "dirtied=%lu written=%lu", __entry->nr_dirty, __entry->nr_writeback, - __entry->nr_unstable, __entry->background_thresh, __entry->dirty_thresh, __entry->dirty_limit, diff --git a/mm/memcontrol.c b/mm/memcontrol.c index a3b97f103966..1db4b285c407 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4330,7 +4330,6 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages, *pdirty = memcg_exact_page_state(memcg, NR_FILE_DIRTY); - /* this should eventually include NR_UNSTABLE_NFS */ *pwriteback = memcg_exact_page_state(memcg, NR_WRITEBACK); *pfilepages = memcg_exact_page_state(memcg, NR_INACTIVE_FILE) + memcg_exact_page_state(memcg, NR_ACTIVE_FILE); diff --git a/mm/page-writeback.c b/mm/page-writeback.c index f02a71797781..d11b097c8002 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -504,7 +504,6 @@ bool node_dirty_ok(struct pglist_data *pgdat) unsigned long nr_pages = 0; nr_pages += node_page_state(pgdat, NR_FILE_DIRTY); - nr_pages += node_page_state(pgdat, NR_UNSTABLE_NFS); nr_pages += node_page_state(pgdat, NR_WRITEBACK); return nr_pages <= limit; @@ -758,7 +757,7 @@ static void mdtc_calc_avail(struct dirty_throttle_control *mdtc, * bounded by the bdi->min_ratio and/or bdi->max_ratio parameters, if set. * * Return: @wb's dirty limit in pages. The term "dirty" in the context of - * dirty balancing includes all PG_dirty, PG_writeback and NFS unstable pages. + * dirty balancing includes all PG_dirty and PG_writeback pages. */ static unsigned long __wb_calc_thresh(struct dirty_throttle_control *dtc) { @@ -1566,7 +1565,7 @@ static void balance_dirty_pages(struct bdi_writeback *wb, struct dirty_throttle_control * const mdtc = mdtc_valid(&mdtc_stor) ? &mdtc_stor : NULL; struct dirty_throttle_control *sdtc; - unsigned long nr_reclaimable; /* = file_dirty + unstable_nfs */ + unsigned long nr_reclaimable; /* = file_dirty */ long period; long pause; long max_pause; @@ -1586,14 +1585,7 @@ static void balance_dirty_pages(struct bdi_writeback *wb, unsigned long m_thresh = 0; unsigned long m_bg_thresh = 0; - /* - * Unstable writes are a feature of certain networked - * filesystems (i.e. NFS) in which data may have been - * written to the server's write cache, but has not yet - * been flushed to permanent storage. - */ - nr_reclaimable = global_node_page_state(NR_FILE_DIRTY) + - global_node_page_state(NR_UNSTABLE_NFS); + nr_reclaimable = global_node_page_state(NR_FILE_DIRTY); gdtc->avail = global_dirtyable_memory(); gdtc->dirty = nr_reclaimable + global_node_page_state(NR_WRITEBACK); @@ -1963,8 +1955,7 @@ bool wb_over_bg_thresh(struct bdi_writeback *wb) * as we're trying to decide whether to put more under writeback. */ gdtc->avail = global_dirtyable_memory(); - gdtc->dirty = global_node_page_state(NR_FILE_DIRTY) + - global_node_page_state(NR_UNSTABLE_NFS); + gdtc->dirty = global_node_page_state(NR_FILE_DIRTY); domain_dirty_limits(gdtc); if (gdtc->dirty > gdtc->bg_thresh) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 13cc653122b7..cc406ee17ad9 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5319,7 +5319,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask) printk("active_anon:%lu inactive_anon:%lu isolated_anon:%lu\n" " active_file:%lu inactive_file:%lu isolated_file:%lu\n" - " unevictable:%lu dirty:%lu writeback:%lu unstable:%lu\n" + " unevictable:%lu dirty:%lu writeback:%lu\n" " slab_reclaimable:%lu slab_unreclaimable:%lu\n" " mapped:%lu shmem:%lu pagetables:%lu bounce:%lu\n" " free:%lu free_pcp:%lu free_cma:%lu\n", @@ -5332,7 +5332,6 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask) global_node_page_state(NR_UNEVICTABLE), global_node_page_state(NR_FILE_DIRTY), global_node_page_state(NR_WRITEBACK), - global_node_page_state(NR_UNSTABLE_NFS), global_node_page_state(NR_SLAB_RECLAIMABLE), global_node_page_state(NR_SLAB_UNRECLAIMABLE), global_node_page_state(NR_FILE_MAPPED), @@ -5365,7 +5364,6 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask) " anon_thp: %lukB" #endif " writeback_tmp:%lukB" - " unstable:%lukB" " all_unreclaimable? %s" "\n", pgdat->node_id, @@ -5387,7 +5385,6 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask) K(node_page_state(pgdat, NR_ANON_THPS) * HPAGE_PMD_NR), #endif K(node_page_state(pgdat, NR_WRITEBACK_TEMP)), - K(node_page_state(pgdat, NR_UNSTABLE_NFS)), pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ? "yes" : "no"); } diff --git a/mm/vmstat.c b/mm/vmstat.c index 96d21a792b57..6c719f184843 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1108,7 +1108,7 @@ int fragmentation_index(struct zone *zone, unsigned int order) TEXT_FOR_HIGHMEM(xx) xx "_movable", const char * const vmstat_text[] = { - /* enum zone_stat_item countes */ + /* enum zone_stat_item counters */ "nr_free_pages", "nr_zone_inactive_anon", "nr_zone_active_anon", @@ -1162,7 +1162,6 @@ const char * const vmstat_text[] = { "nr_file_hugepages", "nr_file_pmdmapped", "nr_anon_transparent_hugepages", - "nr_unstable", "nr_vmscan_write", "nr_vmscan_immediate_reclaim", "nr_dirtied", @@ -1723,6 +1722,14 @@ static int vmstat_show(struct seq_file *m, void *arg) seq_puts(m, vmstat_text[off]); seq_put_decimal_ull(m, " ", *l); seq_putc(m, '\n'); + + if (off == NR_VMSTAT_ITEMS - 1) { + /* We've come to the end - add any deprecated counters + * to avoid breaking userspace which might depend on + * them being present. + */ + seq_puts(m, "nr_unstable 0\n"); + } return 0; } -- 2.26.2 [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: [PATCH 2/2 V4] MM: Discard NR_UNSTABLE_NFS, use NR_WRITEBACK instead. 2020-05-13 7:18 ` [PATCH 2/2 V4] " NeilBrown @ 2020-05-15 9:59 ` Jan Kara 0 siblings, 0 replies; 43+ messages in thread From: Jan Kara @ 2020-05-15 9:59 UTC (permalink / raw) To: NeilBrown Cc: Jan Kara, Andrew Morton, Michal Hocko, linux-mm, linux-nfs, LKML, Christoph Hellwig On Wed 13-05-20 17:18:29, NeilBrown wrote: > > After an NFS page has been written it is considered "unstable" until a > COMMIT request succeeds. If the COMMIT fails, the page will be > re-written. > > These "unstable" pages are currently accounted as "reclaimable", either > in WB_RECLAIMABLE, or in NR_UNSTABLE_NFS which is included in a > 'reclaimable' count. This might have made sense when sending the COMMIT > required a separate action by the VFS/MM (e.g. releasepage() used to > send a COMMIT). However now that all writes generated by ->writepages() > will automatically be followed by a COMMIT (since commit 919e3bd9a875 > ("NFS: Ensure we commit after writeback is complete")) it makes more > sense to treat them as writeback pages. > > So this patch removes NR_UNSTABLE_NFS and accounts unstable pages in > NR_WRITEBACK and WB_WRITEBACK. > > A particular effect of this change is that when > wb_check_background_flush() calls wb_over_bg_threshold(), the latter > will report 'true' a lot less often as the 'unstable' pages are no > longer considered 'dirty' (as there is nothing that writeback can do > about them anyway). > > Currently wb_check_background_flush() will trigger writeback to NFS even > when there are relatively few dirty pages (if there are lots of unstable > pages), this can result in small writes going to the server (10s of > Kilobytes rather than a Megabyte) which hurts throughput. > With this patch, there are fewer writes which are each larger on average. > > Where the NR_UNSTABLE_NFS count was included in statistics > virtual-files, the entry is retained, but the value is hard-coded as > zero. static trace points and warning printks which mentioned this > counter no longer report it. > > Reviewed-by: Christoph Hellwig <hch@lst.de> > Acked-by: Trond Myklebust <trond.myklebust@hammerspace.com> > Acked-by: Michal Hocko <mhocko@suse.com> # for MM parts > Signed-off-by: NeilBrown <neilb@suse.de> The patch looks good to me. You can add: Reviewed-by: Jan Kara <jack@suse.cz> Honza > --- > Documentation/filesystems/proc.rst | 4 ++-- > drivers/base/node.c | 2 +- > fs/fs-writeback.c | 1 - > fs/nfs/internal.h | 10 +++++++--- > fs/nfs/write.c | 4 ++-- > fs/proc/meminfo.c | 3 +-- > include/linux/mmzone.h | 1 - > include/trace/events/writeback.h | 5 +---- > mm/memcontrol.c | 1 - > mm/page-writeback.c | 17 ++++------------- > mm/page_alloc.c | 5 +---- > mm/vmstat.c | 11 +++++++++-- > 12 files changed, 28 insertions(+), 36 deletions(-) > > diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst > index 38b606991065..092b7b44d158 100644 > --- a/Documentation/filesystems/proc.rst > +++ b/Documentation/filesystems/proc.rst > @@ -1042,8 +1042,8 @@ PageTables > amount of memory dedicated to the lowest level of page > tables. > NFS_Unstable > - NFS pages sent to the server, but not yet committed to stable > - storage > + Always zero. Previous counted pages which had been written to > + the server, but has not been committed to stable storage. > Bounce > Memory used for block device "bounce buffers" > WritebackTmp > diff --git a/drivers/base/node.c b/drivers/base/node.c > index 10d7e818e118..15f5ed6a8830 100644 > --- a/drivers/base/node.c > +++ b/drivers/base/node.c > @@ -439,7 +439,7 @@ static ssize_t node_read_meminfo(struct device *dev, > nid, K(i.sharedram), > nid, sum_zone_node_page_state(nid, NR_KERNEL_STACK_KB), > nid, K(sum_zone_node_page_state(nid, NR_PAGETABLE)), > - nid, K(node_page_state(pgdat, NR_UNSTABLE_NFS)), > + nid, 0, > nid, K(sum_zone_node_page_state(nid, NR_BOUNCE)), > nid, K(node_page_state(pgdat, NR_WRITEBACK_TEMP)), > nid, K(sreclaimable + > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c > index 76ac9c7d32ec..c5bdf46e3b4b 100644 > --- a/fs/fs-writeback.c > +++ b/fs/fs-writeback.c > @@ -1070,7 +1070,6 @@ static void bdi_split_work_to_wbs(struct backing_dev_info *bdi, > static unsigned long get_nr_dirty_pages(void) > { > return global_node_page_state(NR_FILE_DIRTY) + > - global_node_page_state(NR_UNSTABLE_NFS) + > get_nr_dirty_inodes(); > } > > diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h > index 1f32a9fbfdaf..6673a77884d9 100644 > --- a/fs/nfs/internal.h > +++ b/fs/nfs/internal.h > @@ -668,7 +668,8 @@ void nfs_super_set_maxbytes(struct super_block *sb, __u64 maxfilesize) > } > > /* > - * Record the page as unstable and mark its inode as dirty. > + * Record the page as unstable (an extra writeback period) and mark its > + * inode as dirty. > */ > static inline > void nfs_mark_page_unstable(struct page *page, struct nfs_commit_info *cinfo) > @@ -676,8 +677,11 @@ void nfs_mark_page_unstable(struct page *page, struct nfs_commit_info *cinfo) > if (!cinfo->dreq) { > struct inode *inode = page_file_mapping(page)->host; > > - inc_node_page_state(page, NR_UNSTABLE_NFS); > - inc_wb_stat(&inode_to_bdi(inode)->wb, WB_RECLAIMABLE); > + /* This page is really still in write-back - just that the > + * writeback is happening on the server now. > + */ > + inc_node_page_state(page, NR_WRITEBACK); > + inc_wb_stat(&inode_to_bdi(inode)->wb, WB_WRITEBACK); > __mark_inode_dirty(inode, I_DIRTY_DATASYNC); > } > } > diff --git a/fs/nfs/write.c b/fs/nfs/write.c > index df4b87c30ac9..d9ea824accb7 100644 > --- a/fs/nfs/write.c > +++ b/fs/nfs/write.c > @@ -946,9 +946,9 @@ nfs_mark_request_commit(struct nfs_page *req, struct pnfs_layout_segment *lseg, > static void > nfs_clear_page_commit(struct page *page) > { > - dec_node_page_state(page, NR_UNSTABLE_NFS); > + dec_node_page_state(page, NR_WRITEBACK); > dec_wb_stat(&inode_to_bdi(page_file_mapping(page)->host)->wb, > - WB_RECLAIMABLE); > + WB_WRITEBACK); > } > > /* Called holding the request lock on @req */ > diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c > index 8c1f1bb1a5ce..9bd94b5a9658 100644 > --- a/fs/proc/meminfo.c > +++ b/fs/proc/meminfo.c > @@ -106,8 +106,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v) > show_val_kb(m, "PageTables: ", > global_zone_page_state(NR_PAGETABLE)); > > - show_val_kb(m, "NFS_Unstable: ", > - global_node_page_state(NR_UNSTABLE_NFS)); > + show_val_kb(m, "NFS_Unstable: ", 0); > show_val_kb(m, "Bounce: ", > global_zone_page_state(NR_BOUNCE)); > show_val_kb(m, "WritebackTmp: ", > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index 1b9de7d220fb..a89f47515eb1 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -193,7 +193,6 @@ enum node_stat_item { > NR_FILE_THPS, > NR_FILE_PMDMAPPED, > NR_ANON_THPS, > - NR_UNSTABLE_NFS, /* NFS unstable pages */ > NR_VMSCAN_WRITE, > NR_VMSCAN_IMMEDIATE, /* Prioritise for reclaim when writeback ends */ > NR_DIRTIED, /* page dirtyings since bootup */ > diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h > index 85a33bea76f1..10f5d1fa7347 100644 > --- a/include/trace/events/writeback.h > +++ b/include/trace/events/writeback.h > @@ -541,7 +541,6 @@ TRACE_EVENT(global_dirty_state, > TP_STRUCT__entry( > __field(unsigned long, nr_dirty) > __field(unsigned long, nr_writeback) > - __field(unsigned long, nr_unstable) > __field(unsigned long, background_thresh) > __field(unsigned long, dirty_thresh) > __field(unsigned long, dirty_limit) > @@ -552,7 +551,6 @@ TRACE_EVENT(global_dirty_state, > TP_fast_assign( > __entry->nr_dirty = global_node_page_state(NR_FILE_DIRTY); > __entry->nr_writeback = global_node_page_state(NR_WRITEBACK); > - __entry->nr_unstable = global_node_page_state(NR_UNSTABLE_NFS); > __entry->nr_dirtied = global_node_page_state(NR_DIRTIED); > __entry->nr_written = global_node_page_state(NR_WRITTEN); > __entry->background_thresh = background_thresh; > @@ -560,12 +558,11 @@ TRACE_EVENT(global_dirty_state, > __entry->dirty_limit = global_wb_domain.dirty_limit; > ), > > - TP_printk("dirty=%lu writeback=%lu unstable=%lu " > + TP_printk("dirty=%lu writeback=%lu " > "bg_thresh=%lu thresh=%lu limit=%lu " > "dirtied=%lu written=%lu", > __entry->nr_dirty, > __entry->nr_writeback, > - __entry->nr_unstable, > __entry->background_thresh, > __entry->dirty_thresh, > __entry->dirty_limit, > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index a3b97f103966..1db4b285c407 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -4330,7 +4330,6 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages, > > *pdirty = memcg_exact_page_state(memcg, NR_FILE_DIRTY); > > - /* this should eventually include NR_UNSTABLE_NFS */ > *pwriteback = memcg_exact_page_state(memcg, NR_WRITEBACK); > *pfilepages = memcg_exact_page_state(memcg, NR_INACTIVE_FILE) + > memcg_exact_page_state(memcg, NR_ACTIVE_FILE); > diff --git a/mm/page-writeback.c b/mm/page-writeback.c > index f02a71797781..d11b097c8002 100644 > --- a/mm/page-writeback.c > +++ b/mm/page-writeback.c > @@ -504,7 +504,6 @@ bool node_dirty_ok(struct pglist_data *pgdat) > unsigned long nr_pages = 0; > > nr_pages += node_page_state(pgdat, NR_FILE_DIRTY); > - nr_pages += node_page_state(pgdat, NR_UNSTABLE_NFS); > nr_pages += node_page_state(pgdat, NR_WRITEBACK); > > return nr_pages <= limit; > @@ -758,7 +757,7 @@ static void mdtc_calc_avail(struct dirty_throttle_control *mdtc, > * bounded by the bdi->min_ratio and/or bdi->max_ratio parameters, if set. > * > * Return: @wb's dirty limit in pages. The term "dirty" in the context of > - * dirty balancing includes all PG_dirty, PG_writeback and NFS unstable pages. > + * dirty balancing includes all PG_dirty and PG_writeback pages. > */ > static unsigned long __wb_calc_thresh(struct dirty_throttle_control *dtc) > { > @@ -1566,7 +1565,7 @@ static void balance_dirty_pages(struct bdi_writeback *wb, > struct dirty_throttle_control * const mdtc = mdtc_valid(&mdtc_stor) ? > &mdtc_stor : NULL; > struct dirty_throttle_control *sdtc; > - unsigned long nr_reclaimable; /* = file_dirty + unstable_nfs */ > + unsigned long nr_reclaimable; /* = file_dirty */ > long period; > long pause; > long max_pause; > @@ -1586,14 +1585,7 @@ static void balance_dirty_pages(struct bdi_writeback *wb, > unsigned long m_thresh = 0; > unsigned long m_bg_thresh = 0; > > - /* > - * Unstable writes are a feature of certain networked > - * filesystems (i.e. NFS) in which data may have been > - * written to the server's write cache, but has not yet > - * been flushed to permanent storage. > - */ > - nr_reclaimable = global_node_page_state(NR_FILE_DIRTY) + > - global_node_page_state(NR_UNSTABLE_NFS); > + nr_reclaimable = global_node_page_state(NR_FILE_DIRTY); > gdtc->avail = global_dirtyable_memory(); > gdtc->dirty = nr_reclaimable + global_node_page_state(NR_WRITEBACK); > > @@ -1963,8 +1955,7 @@ bool wb_over_bg_thresh(struct bdi_writeback *wb) > * as we're trying to decide whether to put more under writeback. > */ > gdtc->avail = global_dirtyable_memory(); > - gdtc->dirty = global_node_page_state(NR_FILE_DIRTY) + > - global_node_page_state(NR_UNSTABLE_NFS); > + gdtc->dirty = global_node_page_state(NR_FILE_DIRTY); > domain_dirty_limits(gdtc); > > if (gdtc->dirty > gdtc->bg_thresh) > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 13cc653122b7..cc406ee17ad9 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -5319,7 +5319,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask) > > printk("active_anon:%lu inactive_anon:%lu isolated_anon:%lu\n" > " active_file:%lu inactive_file:%lu isolated_file:%lu\n" > - " unevictable:%lu dirty:%lu writeback:%lu unstable:%lu\n" > + " unevictable:%lu dirty:%lu writeback:%lu\n" > " slab_reclaimable:%lu slab_unreclaimable:%lu\n" > " mapped:%lu shmem:%lu pagetables:%lu bounce:%lu\n" > " free:%lu free_pcp:%lu free_cma:%lu\n", > @@ -5332,7 +5332,6 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask) > global_node_page_state(NR_UNEVICTABLE), > global_node_page_state(NR_FILE_DIRTY), > global_node_page_state(NR_WRITEBACK), > - global_node_page_state(NR_UNSTABLE_NFS), > global_node_page_state(NR_SLAB_RECLAIMABLE), > global_node_page_state(NR_SLAB_UNRECLAIMABLE), > global_node_page_state(NR_FILE_MAPPED), > @@ -5365,7 +5364,6 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask) > " anon_thp: %lukB" > #endif > " writeback_tmp:%lukB" > - " unstable:%lukB" > " all_unreclaimable? %s" > "\n", > pgdat->node_id, > @@ -5387,7 +5385,6 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask) > K(node_page_state(pgdat, NR_ANON_THPS) * HPAGE_PMD_NR), > #endif > K(node_page_state(pgdat, NR_WRITEBACK_TEMP)), > - K(node_page_state(pgdat, NR_UNSTABLE_NFS)), > pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ? > "yes" : "no"); > } > diff --git a/mm/vmstat.c b/mm/vmstat.c > index 96d21a792b57..6c719f184843 100644 > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -1108,7 +1108,7 @@ int fragmentation_index(struct zone *zone, unsigned int order) > TEXT_FOR_HIGHMEM(xx) xx "_movable", > > const char * const vmstat_text[] = { > - /* enum zone_stat_item countes */ > + /* enum zone_stat_item counters */ > "nr_free_pages", > "nr_zone_inactive_anon", > "nr_zone_active_anon", > @@ -1162,7 +1162,6 @@ const char * const vmstat_text[] = { > "nr_file_hugepages", > "nr_file_pmdmapped", > "nr_anon_transparent_hugepages", > - "nr_unstable", > "nr_vmscan_write", > "nr_vmscan_immediate_reclaim", > "nr_dirtied", > @@ -1723,6 +1722,14 @@ static int vmstat_show(struct seq_file *m, void *arg) > seq_puts(m, vmstat_text[off]); > seq_put_decimal_ull(m, " ", *l); > seq_putc(m, '\n'); > + > + if (off == NR_VMSTAT_ITEMS - 1) { > + /* We've come to the end - add any deprecated counters > + * to avoid breaking userspace which might depend on > + * them being present. > + */ > + seq_puts(m, "nr_unstable 0\n"); > + } > return 0; > } > > -- > 2.26.2 > -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 43+ messages in thread
* [PATCH 2/2 V3] MM: Discard NR_UNSTABLE_NFS, use NR_WRITEBACK instead. 2020-04-16 0:29 ` Writeback fixes for NFS - V3 NeilBrown 2020-04-16 0:30 ` [PATCH 1/2 V3] MM: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE NeilBrown @ 2020-04-16 0:31 ` NeilBrown 2020-04-16 6:56 ` Christoph Hellwig 2020-04-16 15:24 ` Jan Kara 1 sibling, 2 replies; 43+ messages in thread From: NeilBrown @ 2020-04-16 0:31 UTC (permalink / raw) To: Trond Myklebust, Anna.Schumaker, Andrew Morton, Jan Kara, Michal Hocko Cc: linux-mm, linux-nfs, LKML [-- Attachment #1: Type: text/plain, Size: 13918 bytes --] After an NFS page has been written it is considered "unstable" until a COMMIT request succeeds. If the COMMIT fails, the page will be re-written. These "unstable" pages are currently accounted as "reclaimable", either in WB_RECLAIMABLE, or in NR_UNSTABLE_NFS which is included in a 'reclaimable' count. This might have made sense when sending the COMMIT required a separate action by the VFS/MM (e.g. releasepage() used to send a COMMIT). However now that all writes generated by ->writepages() will automatically be followed by a COMMIT (since commit 919e3bd9a875 ("NFS: Ensure we commit after writeback is complete")) it makes more sense to treat them as writeback pages. So this patch removes NR_UNSTABLE_NFS and accounts unstable pages in NR_WRITEBACK and WB_WRITEBACK. A particular effect of this change is that when wb_check_background_flush() calls wb_over_bg_threshold(), the latter will report 'true' a lot less often as the 'unstable' pages are no longer considered 'dirty' (as there is nothing that writeback can do about them anyway). Currently wb_check_background_flush() will trigger writeback to NFS even when there are relatively few dirty pages (if there are lots of unstable pages), this can result in small writes going to the server (10s of Kilobytes rather than a Megabyte) which hurts throughput. With this patch, there are fewer writes which are each larger on average. Where the NR_UNSTABLE_NFS count was included in statistics virtual-files, the entry is retained, but the value is hard-coded as zero. static trace points and warning printks which mentioned this counter no longer report it. Acked-by: Trond Myklebust <trond.myklebust@hammerspace.com> Acked-by: Michal Hocko <mhocko@suse.com> # for MM parts Signed-off-by: NeilBrown <neilb@suse.de> --- Documentation/filesystems/proc.rst | 4 ++-- drivers/base/node.c | 2 +- fs/fs-writeback.c | 1 - fs/nfs/internal.h | 10 +++++++--- fs/nfs/write.c | 4 ++-- fs/proc/meminfo.c | 3 +-- include/linux/mmzone.h | 1 - include/trace/events/writeback.h | 5 +---- mm/memcontrol.c | 1 - mm/page-writeback.c | 17 ++++------------- mm/page_alloc.c | 5 +---- mm/vmstat.c | 11 +++++++++-- 12 files changed, 28 insertions(+), 36 deletions(-) diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst index 38b606991065..092b7b44d158 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -1042,8 +1042,8 @@ PageTables amount of memory dedicated to the lowest level of page tables. NFS_Unstable - NFS pages sent to the server, but not yet committed to stable - storage + Always zero. Previous counted pages which had been written to + the server, but has not been committed to stable storage. Bounce Memory used for block device "bounce buffers" WritebackTmp diff --git a/drivers/base/node.c b/drivers/base/node.c index 10d7e818e118..15f5ed6a8830 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -439,7 +439,7 @@ static ssize_t node_read_meminfo(struct device *dev, nid, K(i.sharedram), nid, sum_zone_node_page_state(nid, NR_KERNEL_STACK_KB), nid, K(sum_zone_node_page_state(nid, NR_PAGETABLE)), - nid, K(node_page_state(pgdat, NR_UNSTABLE_NFS)), + nid, 0, nid, K(sum_zone_node_page_state(nid, NR_BOUNCE)), nid, K(node_page_state(pgdat, NR_WRITEBACK_TEMP)), nid, K(sreclaimable + diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 76ac9c7d32ec..c5bdf46e3b4b 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -1070,7 +1070,6 @@ static void bdi_split_work_to_wbs(struct backing_dev_info *bdi, static unsigned long get_nr_dirty_pages(void) { return global_node_page_state(NR_FILE_DIRTY) + - global_node_page_state(NR_UNSTABLE_NFS) + get_nr_dirty_inodes(); } diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h index 1f32a9fbfdaf..6673a77884d9 100644 --- a/fs/nfs/internal.h +++ b/fs/nfs/internal.h @@ -668,7 +668,8 @@ void nfs_super_set_maxbytes(struct super_block *sb, __u64 maxfilesize) } /* - * Record the page as unstable and mark its inode as dirty. + * Record the page as unstable (an extra writeback period) and mark its + * inode as dirty. */ static inline void nfs_mark_page_unstable(struct page *page, struct nfs_commit_info *cinfo) @@ -676,8 +677,11 @@ void nfs_mark_page_unstable(struct page *page, struct nfs_commit_info *cinfo) if (!cinfo->dreq) { struct inode *inode = page_file_mapping(page)->host; - inc_node_page_state(page, NR_UNSTABLE_NFS); - inc_wb_stat(&inode_to_bdi(inode)->wb, WB_RECLAIMABLE); + /* This page is really still in write-back - just that the + * writeback is happening on the server now. + */ + inc_node_page_state(page, NR_WRITEBACK); + inc_wb_stat(&inode_to_bdi(inode)->wb, WB_WRITEBACK); __mark_inode_dirty(inode, I_DIRTY_DATASYNC); } } diff --git a/fs/nfs/write.c b/fs/nfs/write.c index df4b87c30ac9..d9ea824accb7 100644 --- a/fs/nfs/write.c +++ b/fs/nfs/write.c @@ -946,9 +946,9 @@ nfs_mark_request_commit(struct nfs_page *req, struct pnfs_layout_segment *lseg, static void nfs_clear_page_commit(struct page *page) { - dec_node_page_state(page, NR_UNSTABLE_NFS); + dec_node_page_state(page, NR_WRITEBACK); dec_wb_stat(&inode_to_bdi(page_file_mapping(page)->host)->wb, - WB_RECLAIMABLE); + WB_WRITEBACK); } /* Called holding the request lock on @req */ diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c index 8c1f1bb1a5ce..9bd94b5a9658 100644 --- a/fs/proc/meminfo.c +++ b/fs/proc/meminfo.c @@ -106,8 +106,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v) show_val_kb(m, "PageTables: ", global_zone_page_state(NR_PAGETABLE)); - show_val_kb(m, "NFS_Unstable: ", - global_node_page_state(NR_UNSTABLE_NFS)); + show_val_kb(m, "NFS_Unstable: ", 0); show_val_kb(m, "Bounce: ", global_zone_page_state(NR_BOUNCE)); show_val_kb(m, "WritebackTmp: ", diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 1b9de7d220fb..a89f47515eb1 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -193,7 +193,6 @@ enum node_stat_item { NR_FILE_THPS, NR_FILE_PMDMAPPED, NR_ANON_THPS, - NR_UNSTABLE_NFS, /* NFS unstable pages */ NR_VMSCAN_WRITE, NR_VMSCAN_IMMEDIATE, /* Prioritise for reclaim when writeback ends */ NR_DIRTIED, /* page dirtyings since bootup */ diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h index d94def25e4dc..45b5fbdb1f62 100644 --- a/include/trace/events/writeback.h +++ b/include/trace/events/writeback.h @@ -542,7 +542,6 @@ TRACE_EVENT(global_dirty_state, TP_STRUCT__entry( __field(unsigned long, nr_dirty) __field(unsigned long, nr_writeback) - __field(unsigned long, nr_unstable) __field(unsigned long, background_thresh) __field(unsigned long, dirty_thresh) __field(unsigned long, dirty_limit) @@ -553,7 +552,6 @@ TRACE_EVENT(global_dirty_state, TP_fast_assign( __entry->nr_dirty = global_node_page_state(NR_FILE_DIRTY); __entry->nr_writeback = global_node_page_state(NR_WRITEBACK); - __entry->nr_unstable = global_node_page_state(NR_UNSTABLE_NFS); __entry->nr_dirtied = global_node_page_state(NR_DIRTIED); __entry->nr_written = global_node_page_state(NR_WRITTEN); __entry->background_thresh = background_thresh; @@ -561,12 +559,11 @@ TRACE_EVENT(global_dirty_state, __entry->dirty_limit = global_wb_domain.dirty_limit; ), - TP_printk("dirty=%lu writeback=%lu unstable=%lu " + TP_printk("dirty=%lu writeback=%lu " "bg_thresh=%lu thresh=%lu limit=%lu " "dirtied=%lu written=%lu", __entry->nr_dirty, __entry->nr_writeback, - __entry->nr_unstable, __entry->background_thresh, __entry->dirty_thresh, __entry->dirty_limit, diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 5beea03dd58a..2db5bbcfc17a 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4330,7 +4330,6 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages, *pdirty = memcg_exact_page_state(memcg, NR_FILE_DIRTY); - /* this should eventually include NR_UNSTABLE_NFS */ *pwriteback = memcg_exact_page_state(memcg, NR_WRITEBACK); *pfilepages = memcg_exact_page_state(memcg, NR_INACTIVE_FILE) + memcg_exact_page_state(memcg, NR_ACTIVE_FILE); diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 9692c553526b..b3b08de01d12 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -504,7 +504,6 @@ bool node_dirty_ok(struct pglist_data *pgdat) unsigned long nr_pages = 0; nr_pages += node_page_state(pgdat, NR_FILE_DIRTY); - nr_pages += node_page_state(pgdat, NR_UNSTABLE_NFS); nr_pages += node_page_state(pgdat, NR_WRITEBACK); return nr_pages <= limit; @@ -758,7 +757,7 @@ static void mdtc_calc_avail(struct dirty_throttle_control *mdtc, * bounded by the bdi->min_ratio and/or bdi->max_ratio parameters, if set. * * Return: @wb's dirty limit in pages. The term "dirty" in the context of - * dirty balancing includes all PG_dirty, PG_writeback and NFS unstable pages. + * dirty balancing includes all PG_dirty and PG_writeback pages. */ static unsigned long __wb_calc_thresh(struct dirty_throttle_control *dtc) { @@ -1566,7 +1565,7 @@ static void balance_dirty_pages(struct bdi_writeback *wb, struct dirty_throttle_control * const mdtc = mdtc_valid(&mdtc_stor) ? &mdtc_stor : NULL; struct dirty_throttle_control *sdtc; - unsigned long nr_reclaimable; /* = file_dirty + unstable_nfs */ + unsigned long nr_reclaimable; /* = file_dirty */ long period; long pause; long max_pause; @@ -1586,14 +1585,7 @@ static void balance_dirty_pages(struct bdi_writeback *wb, unsigned long m_thresh = 0; unsigned long m_bg_thresh = 0; - /* - * Unstable writes are a feature of certain networked - * filesystems (i.e. NFS) in which data may have been - * written to the server's write cache, but has not yet - * been flushed to permanent storage. - */ - nr_reclaimable = global_node_page_state(NR_FILE_DIRTY) + - global_node_page_state(NR_UNSTABLE_NFS); + nr_reclaimable = global_node_page_state(NR_FILE_DIRTY); gdtc->avail = global_dirtyable_memory(); gdtc->dirty = nr_reclaimable + global_node_page_state(NR_WRITEBACK); @@ -1948,8 +1940,7 @@ bool wb_over_bg_thresh(struct bdi_writeback *wb) * as we're trying to decide whether to put more under writeback. */ gdtc->avail = global_dirtyable_memory(); - gdtc->dirty = global_node_page_state(NR_FILE_DIRTY) + - global_node_page_state(NR_UNSTABLE_NFS); + gdtc->dirty = global_node_page_state(NR_FILE_DIRTY); domain_dirty_limits(gdtc); if (gdtc->dirty > gdtc->bg_thresh) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 69827d4fa052..238b5518f3c5 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5310,7 +5310,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask) printk("active_anon:%lu inactive_anon:%lu isolated_anon:%lu\n" " active_file:%lu inactive_file:%lu isolated_file:%lu\n" - " unevictable:%lu dirty:%lu writeback:%lu unstable:%lu\n" + " unevictable:%lu dirty:%lu writeback:%lu\n" " slab_reclaimable:%lu slab_unreclaimable:%lu\n" " mapped:%lu shmem:%lu pagetables:%lu bounce:%lu\n" " free:%lu free_pcp:%lu free_cma:%lu\n", @@ -5323,7 +5323,6 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask) global_node_page_state(NR_UNEVICTABLE), global_node_page_state(NR_FILE_DIRTY), global_node_page_state(NR_WRITEBACK), - global_node_page_state(NR_UNSTABLE_NFS), global_node_page_state(NR_SLAB_RECLAIMABLE), global_node_page_state(NR_SLAB_UNRECLAIMABLE), global_node_page_state(NR_FILE_MAPPED), @@ -5356,7 +5355,6 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask) " anon_thp: %lukB" #endif " writeback_tmp:%lukB" - " unstable:%lukB" " all_unreclaimable? %s" "\n", pgdat->node_id, @@ -5378,7 +5376,6 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask) K(node_page_state(pgdat, NR_ANON_THPS) * HPAGE_PMD_NR), #endif K(node_page_state(pgdat, NR_WRITEBACK_TEMP)), - K(node_page_state(pgdat, NR_UNSTABLE_NFS)), pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ? "yes" : "no"); } diff --git a/mm/vmstat.c b/mm/vmstat.c index 96d21a792b57..6c719f184843 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1108,7 +1108,7 @@ int fragmentation_index(struct zone *zone, unsigned int order) TEXT_FOR_HIGHMEM(xx) xx "_movable", const char * const vmstat_text[] = { - /* enum zone_stat_item countes */ + /* enum zone_stat_item counters */ "nr_free_pages", "nr_zone_inactive_anon", "nr_zone_active_anon", @@ -1162,7 +1162,6 @@ const char * const vmstat_text[] = { "nr_file_hugepages", "nr_file_pmdmapped", "nr_anon_transparent_hugepages", - "nr_unstable", "nr_vmscan_write", "nr_vmscan_immediate_reclaim", "nr_dirtied", @@ -1723,6 +1722,14 @@ static int vmstat_show(struct seq_file *m, void *arg) seq_puts(m, vmstat_text[off]); seq_put_decimal_ull(m, " ", *l); seq_putc(m, '\n'); + + if (off == NR_VMSTAT_ITEMS - 1) { + /* We've come to the end - add any deprecated counters + * to avoid breaking userspace which might depend on + * them being present. + */ + seq_puts(m, "nr_unstable 0\n"); + } return 0; } -- 2.26.0 [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: [PATCH 2/2 V3] MM: Discard NR_UNSTABLE_NFS, use NR_WRITEBACK instead. 2020-04-16 0:31 ` [PATCH 2/2 V3] " NeilBrown @ 2020-04-16 6:56 ` Christoph Hellwig 2020-04-16 15:24 ` Jan Kara 1 sibling, 0 replies; 43+ messages in thread From: Christoph Hellwig @ 2020-04-16 6:56 UTC (permalink / raw) To: NeilBrown Cc: Trond Myklebust, Anna.Schumaker@Netapp.com, Andrew Morton, Jan Kara, Michal Hocko, linux-mm, linux-nfs, LKML On Thu, Apr 16, 2020 at 10:31:27AM +1000, NeilBrown wrote: > return global_node_page_state(NR_FILE_DIRTY) + > - global_node_page_state(NR_UNSTABLE_NFS) + > get_nr_dirty_inodes(); Nit: this could a single line now: return global_node_page_state(NR_FILE_DIRTY) + get_nr_dirty_inodes(); > + /* This page is really still in write-back - just that the > + * writeback is happening on the server now. > + */ This needs to switch to the canonical kernel comment style. > + if (off == NR_VMSTAT_ITEMS - 1) { > + /* We've come to the end - add any deprecated counters > + * to avoid breaking userspace which might depend on > + * them being present. > + */ Same here. Otherwise looks good: Reviewed-by: Christoph Hellwig <hch@lst.de> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 2/2 V3] MM: Discard NR_UNSTABLE_NFS, use NR_WRITEBACK instead. 2020-04-16 0:31 ` [PATCH 2/2 V3] " NeilBrown 2020-04-16 6:56 ` Christoph Hellwig @ 2020-04-16 15:24 ` Jan Kara 1 sibling, 0 replies; 43+ messages in thread From: Jan Kara @ 2020-04-16 15:24 UTC (permalink / raw) To: NeilBrown Cc: Trond Myklebust, Anna.Schumaker@Netapp.com, Andrew Morton, Jan Kara, Michal Hocko, linux-mm, linux-nfs, LKML On Thu 16-04-20 10:31:27, NeilBrown wrote: > > After an NFS page has been written it is considered "unstable" until a > COMMIT request succeeds. If the COMMIT fails, the page will be > re-written. > > These "unstable" pages are currently accounted as "reclaimable", either > in WB_RECLAIMABLE, or in NR_UNSTABLE_NFS which is included in a > 'reclaimable' count. This might have made sense when sending the COMMIT > required a separate action by the VFS/MM (e.g. releasepage() used to > send a COMMIT). However now that all writes generated by ->writepages() > will automatically be followed by a COMMIT (since commit 919e3bd9a875 > ("NFS: Ensure we commit after writeback is complete")) it makes more > sense to treat them as writeback pages. > > So this patch removes NR_UNSTABLE_NFS and accounts unstable pages in > NR_WRITEBACK and WB_WRITEBACK. > > A particular effect of this change is that when > wb_check_background_flush() calls wb_over_bg_threshold(), the latter > will report 'true' a lot less often as the 'unstable' pages are no > longer considered 'dirty' (as there is nothing that writeback can do > about them anyway). > > Currently wb_check_background_flush() will trigger writeback to NFS even > when there are relatively few dirty pages (if there are lots of unstable > pages), this can result in small writes going to the server (10s of > Kilobytes rather than a Megabyte) which hurts throughput. > With this patch, there are fewer writes which are each larger on average. > > Where the NR_UNSTABLE_NFS count was included in statistics > virtual-files, the entry is retained, but the value is hard-coded as > zero. static trace points and warning printks which mentioned this > counter no longer report it. > > Acked-by: Trond Myklebust <trond.myklebust@hammerspace.com> > Acked-by: Michal Hocko <mhocko@suse.com> # for MM parts > Signed-off-by: NeilBrown <neilb@suse.de> This looks good to me. You can add: Reviewed-by: Jan Kara <jack@suse.cz> BTW I prefer the comment style Christoph wants but I don't insist... Honza > --- > Documentation/filesystems/proc.rst | 4 ++-- > drivers/base/node.c | 2 +- > fs/fs-writeback.c | 1 - > fs/nfs/internal.h | 10 +++++++--- > fs/nfs/write.c | 4 ++-- > fs/proc/meminfo.c | 3 +-- > include/linux/mmzone.h | 1 - > include/trace/events/writeback.h | 5 +---- > mm/memcontrol.c | 1 - > mm/page-writeback.c | 17 ++++------------- > mm/page_alloc.c | 5 +---- > mm/vmstat.c | 11 +++++++++-- > 12 files changed, 28 insertions(+), 36 deletions(-) > > diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst > index 38b606991065..092b7b44d158 100644 > --- a/Documentation/filesystems/proc.rst > +++ b/Documentation/filesystems/proc.rst > @@ -1042,8 +1042,8 @@ PageTables > amount of memory dedicated to the lowest level of page > tables. > NFS_Unstable > - NFS pages sent to the server, but not yet committed to stable > - storage > + Always zero. Previous counted pages which had been written to > + the server, but has not been committed to stable storage. > Bounce > Memory used for block device "bounce buffers" > WritebackTmp > diff --git a/drivers/base/node.c b/drivers/base/node.c > index 10d7e818e118..15f5ed6a8830 100644 > --- a/drivers/base/node.c > +++ b/drivers/base/node.c > @@ -439,7 +439,7 @@ static ssize_t node_read_meminfo(struct device *dev, > nid, K(i.sharedram), > nid, sum_zone_node_page_state(nid, NR_KERNEL_STACK_KB), > nid, K(sum_zone_node_page_state(nid, NR_PAGETABLE)), > - nid, K(node_page_state(pgdat, NR_UNSTABLE_NFS)), > + nid, 0, > nid, K(sum_zone_node_page_state(nid, NR_BOUNCE)), > nid, K(node_page_state(pgdat, NR_WRITEBACK_TEMP)), > nid, K(sreclaimable + > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c > index 76ac9c7d32ec..c5bdf46e3b4b 100644 > --- a/fs/fs-writeback.c > +++ b/fs/fs-writeback.c > @@ -1070,7 +1070,6 @@ static void bdi_split_work_to_wbs(struct backing_dev_info *bdi, > static unsigned long get_nr_dirty_pages(void) > { > return global_node_page_state(NR_FILE_DIRTY) + > - global_node_page_state(NR_UNSTABLE_NFS) + > get_nr_dirty_inodes(); > } > > diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h > index 1f32a9fbfdaf..6673a77884d9 100644 > --- a/fs/nfs/internal.h > +++ b/fs/nfs/internal.h > @@ -668,7 +668,8 @@ void nfs_super_set_maxbytes(struct super_block *sb, __u64 maxfilesize) > } > > /* > - * Record the page as unstable and mark its inode as dirty. > + * Record the page as unstable (an extra writeback period) and mark its > + * inode as dirty. > */ > static inline > void nfs_mark_page_unstable(struct page *page, struct nfs_commit_info *cinfo) > @@ -676,8 +677,11 @@ void nfs_mark_page_unstable(struct page *page, struct nfs_commit_info *cinfo) > if (!cinfo->dreq) { > struct inode *inode = page_file_mapping(page)->host; > > - inc_node_page_state(page, NR_UNSTABLE_NFS); > - inc_wb_stat(&inode_to_bdi(inode)->wb, WB_RECLAIMABLE); > + /* This page is really still in write-back - just that the > + * writeback is happening on the server now. > + */ > + inc_node_page_state(page, NR_WRITEBACK); > + inc_wb_stat(&inode_to_bdi(inode)->wb, WB_WRITEBACK); > __mark_inode_dirty(inode, I_DIRTY_DATASYNC); > } > } > diff --git a/fs/nfs/write.c b/fs/nfs/write.c > index df4b87c30ac9..d9ea824accb7 100644 > --- a/fs/nfs/write.c > +++ b/fs/nfs/write.c > @@ -946,9 +946,9 @@ nfs_mark_request_commit(struct nfs_page *req, struct pnfs_layout_segment *lseg, > static void > nfs_clear_page_commit(struct page *page) > { > - dec_node_page_state(page, NR_UNSTABLE_NFS); > + dec_node_page_state(page, NR_WRITEBACK); > dec_wb_stat(&inode_to_bdi(page_file_mapping(page)->host)->wb, > - WB_RECLAIMABLE); > + WB_WRITEBACK); > } > > /* Called holding the request lock on @req */ > diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c > index 8c1f1bb1a5ce..9bd94b5a9658 100644 > --- a/fs/proc/meminfo.c > +++ b/fs/proc/meminfo.c > @@ -106,8 +106,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v) > show_val_kb(m, "PageTables: ", > global_zone_page_state(NR_PAGETABLE)); > > - show_val_kb(m, "NFS_Unstable: ", > - global_node_page_state(NR_UNSTABLE_NFS)); > + show_val_kb(m, "NFS_Unstable: ", 0); > show_val_kb(m, "Bounce: ", > global_zone_page_state(NR_BOUNCE)); > show_val_kb(m, "WritebackTmp: ", > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index 1b9de7d220fb..a89f47515eb1 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -193,7 +193,6 @@ enum node_stat_item { > NR_FILE_THPS, > NR_FILE_PMDMAPPED, > NR_ANON_THPS, > - NR_UNSTABLE_NFS, /* NFS unstable pages */ > NR_VMSCAN_WRITE, > NR_VMSCAN_IMMEDIATE, /* Prioritise for reclaim when writeback ends */ > NR_DIRTIED, /* page dirtyings since bootup */ > diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h > index d94def25e4dc..45b5fbdb1f62 100644 > --- a/include/trace/events/writeback.h > +++ b/include/trace/events/writeback.h > @@ -542,7 +542,6 @@ TRACE_EVENT(global_dirty_state, > TP_STRUCT__entry( > __field(unsigned long, nr_dirty) > __field(unsigned long, nr_writeback) > - __field(unsigned long, nr_unstable) > __field(unsigned long, background_thresh) > __field(unsigned long, dirty_thresh) > __field(unsigned long, dirty_limit) > @@ -553,7 +552,6 @@ TRACE_EVENT(global_dirty_state, > TP_fast_assign( > __entry->nr_dirty = global_node_page_state(NR_FILE_DIRTY); > __entry->nr_writeback = global_node_page_state(NR_WRITEBACK); > - __entry->nr_unstable = global_node_page_state(NR_UNSTABLE_NFS); > __entry->nr_dirtied = global_node_page_state(NR_DIRTIED); > __entry->nr_written = global_node_page_state(NR_WRITTEN); > __entry->background_thresh = background_thresh; > @@ -561,12 +559,11 @@ TRACE_EVENT(global_dirty_state, > __entry->dirty_limit = global_wb_domain.dirty_limit; > ), > > - TP_printk("dirty=%lu writeback=%lu unstable=%lu " > + TP_printk("dirty=%lu writeback=%lu " > "bg_thresh=%lu thresh=%lu limit=%lu " > "dirtied=%lu written=%lu", > __entry->nr_dirty, > __entry->nr_writeback, > - __entry->nr_unstable, > __entry->background_thresh, > __entry->dirty_thresh, > __entry->dirty_limit, > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 5beea03dd58a..2db5bbcfc17a 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -4330,7 +4330,6 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages, > > *pdirty = memcg_exact_page_state(memcg, NR_FILE_DIRTY); > > - /* this should eventually include NR_UNSTABLE_NFS */ > *pwriteback = memcg_exact_page_state(memcg, NR_WRITEBACK); > *pfilepages = memcg_exact_page_state(memcg, NR_INACTIVE_FILE) + > memcg_exact_page_state(memcg, NR_ACTIVE_FILE); > diff --git a/mm/page-writeback.c b/mm/page-writeback.c > index 9692c553526b..b3b08de01d12 100644 > --- a/mm/page-writeback.c > +++ b/mm/page-writeback.c > @@ -504,7 +504,6 @@ bool node_dirty_ok(struct pglist_data *pgdat) > unsigned long nr_pages = 0; > > nr_pages += node_page_state(pgdat, NR_FILE_DIRTY); > - nr_pages += node_page_state(pgdat, NR_UNSTABLE_NFS); > nr_pages += node_page_state(pgdat, NR_WRITEBACK); > > return nr_pages <= limit; > @@ -758,7 +757,7 @@ static void mdtc_calc_avail(struct dirty_throttle_control *mdtc, > * bounded by the bdi->min_ratio and/or bdi->max_ratio parameters, if set. > * > * Return: @wb's dirty limit in pages. The term "dirty" in the context of > - * dirty balancing includes all PG_dirty, PG_writeback and NFS unstable pages. > + * dirty balancing includes all PG_dirty and PG_writeback pages. > */ > static unsigned long __wb_calc_thresh(struct dirty_throttle_control *dtc) > { > @@ -1566,7 +1565,7 @@ static void balance_dirty_pages(struct bdi_writeback *wb, > struct dirty_throttle_control * const mdtc = mdtc_valid(&mdtc_stor) ? > &mdtc_stor : NULL; > struct dirty_throttle_control *sdtc; > - unsigned long nr_reclaimable; /* = file_dirty + unstable_nfs */ > + unsigned long nr_reclaimable; /* = file_dirty */ > long period; > long pause; > long max_pause; > @@ -1586,14 +1585,7 @@ static void balance_dirty_pages(struct bdi_writeback *wb, > unsigned long m_thresh = 0; > unsigned long m_bg_thresh = 0; > > - /* > - * Unstable writes are a feature of certain networked > - * filesystems (i.e. NFS) in which data may have been > - * written to the server's write cache, but has not yet > - * been flushed to permanent storage. > - */ > - nr_reclaimable = global_node_page_state(NR_FILE_DIRTY) + > - global_node_page_state(NR_UNSTABLE_NFS); > + nr_reclaimable = global_node_page_state(NR_FILE_DIRTY); > gdtc->avail = global_dirtyable_memory(); > gdtc->dirty = nr_reclaimable + global_node_page_state(NR_WRITEBACK); > > @@ -1948,8 +1940,7 @@ bool wb_over_bg_thresh(struct bdi_writeback *wb) > * as we're trying to decide whether to put more under writeback. > */ > gdtc->avail = global_dirtyable_memory(); > - gdtc->dirty = global_node_page_state(NR_FILE_DIRTY) + > - global_node_page_state(NR_UNSTABLE_NFS); > + gdtc->dirty = global_node_page_state(NR_FILE_DIRTY); > domain_dirty_limits(gdtc); > > if (gdtc->dirty > gdtc->bg_thresh) > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 69827d4fa052..238b5518f3c5 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -5310,7 +5310,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask) > > printk("active_anon:%lu inactive_anon:%lu isolated_anon:%lu\n" > " active_file:%lu inactive_file:%lu isolated_file:%lu\n" > - " unevictable:%lu dirty:%lu writeback:%lu unstable:%lu\n" > + " unevictable:%lu dirty:%lu writeback:%lu\n" > " slab_reclaimable:%lu slab_unreclaimable:%lu\n" > " mapped:%lu shmem:%lu pagetables:%lu bounce:%lu\n" > " free:%lu free_pcp:%lu free_cma:%lu\n", > @@ -5323,7 +5323,6 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask) > global_node_page_state(NR_UNEVICTABLE), > global_node_page_state(NR_FILE_DIRTY), > global_node_page_state(NR_WRITEBACK), > - global_node_page_state(NR_UNSTABLE_NFS), > global_node_page_state(NR_SLAB_RECLAIMABLE), > global_node_page_state(NR_SLAB_UNRECLAIMABLE), > global_node_page_state(NR_FILE_MAPPED), > @@ -5356,7 +5355,6 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask) > " anon_thp: %lukB" > #endif > " writeback_tmp:%lukB" > - " unstable:%lukB" > " all_unreclaimable? %s" > "\n", > pgdat->node_id, > @@ -5378,7 +5376,6 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask) > K(node_page_state(pgdat, NR_ANON_THPS) * HPAGE_PMD_NR), > #endif > K(node_page_state(pgdat, NR_WRITEBACK_TEMP)), > - K(node_page_state(pgdat, NR_UNSTABLE_NFS)), > pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ? > "yes" : "no"); > } > diff --git a/mm/vmstat.c b/mm/vmstat.c > index 96d21a792b57..6c719f184843 100644 > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -1108,7 +1108,7 @@ int fragmentation_index(struct zone *zone, unsigned int order) > TEXT_FOR_HIGHMEM(xx) xx "_movable", > > const char * const vmstat_text[] = { > - /* enum zone_stat_item countes */ > + /* enum zone_stat_item counters */ > "nr_free_pages", > "nr_zone_inactive_anon", > "nr_zone_active_anon", > @@ -1162,7 +1162,6 @@ const char * const vmstat_text[] = { > "nr_file_hugepages", > "nr_file_pmdmapped", > "nr_anon_transparent_hugepages", > - "nr_unstable", > "nr_vmscan_write", > "nr_vmscan_immediate_reclaim", > "nr_dirtied", > @@ -1723,6 +1722,14 @@ static int vmstat_show(struct seq_file *m, void *arg) > seq_puts(m, vmstat_text[off]); > seq_put_decimal_ull(m, " ", *l); > seq_putc(m, '\n'); > + > + if (off == NR_VMSTAT_ITEMS - 1) { > + /* We've come to the end - add any deprecated counters > + * to avoid breaking userspace which might depend on > + * them being present. > + */ > + seq_puts(m, "nr_unstable 0\n"); > + } > return 0; > } > > -- > 2.26.0 > -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 43+ messages in thread
end of thread, other threads:[~2020-06-01 0:49 UTC | newest] Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-03-26 3:25 [PATCH/RFC] MM: fix writeback for NFS NeilBrown 2020-04-01 23:52 ` Writeback fixes " NeilBrown 2020-04-01 23:53 ` [PATCH 1/2] MM: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE NeilBrown 2020-04-01 23:54 ` [PATCH 2/2] Deprecate NR_UNSTABLE_NFS, use NR_WRITEBACK NeilBrown 2020-04-02 15:10 ` Christoph Hellwig 2020-04-02 22:35 ` [PATCH 2/2 - v2] MM: Discard NR_UNSTABLE_NFS, use NR_WRITEBACK instead NeilBrown 2020-04-03 9:42 ` Jan Kara 2020-04-03 11:03 ` Michal Hocko 2020-04-06 0:14 ` NeilBrown 2020-04-06 7:41 ` Michal Hocko 2020-04-06 23:28 ` NeilBrown 2020-04-07 7:33 ` Michal Hocko 2020-04-02 19:55 ` [PATCH 2/2] Deprecate NR_UNSTABLE_NFS, use NR_WRITEBACK Jan Kara 2020-04-02 16:35 ` [PATCH 1/2] MM: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE Jan Kara 2020-04-03 15:15 ` Michal Hocko 2020-04-03 21:40 ` NeilBrown 2020-04-06 7:44 ` Michal Hocko 2020-04-06 9:36 ` Jan Kara 2020-04-06 10:57 ` Michal Hocko 2020-04-06 11:58 ` NeilBrown 2020-04-02 4:26 ` Hillf Danton 2020-04-02 4:57 ` NeilBrown 2020-04-06 3:58 ` Hillf Danton 2020-04-06 23:42 ` Writeback fixes for NFS - V2 NeilBrown 2020-04-06 23:43 ` [PATCH 1/2] MM: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE NeilBrown 2020-04-07 16:10 ` Chuck Lever 2020-04-16 0:29 ` Writeback fixes for NFS - V3 NeilBrown 2020-04-16 0:30 ` [PATCH 1/2 V3] MM: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE NeilBrown 2020-04-16 6:54 ` Christoph Hellwig 2020-04-16 15:19 ` Jan Kara 2020-04-21 2:22 ` NeilBrown 2020-04-22 12:46 ` Jan Kara 2020-05-13 7:16 ` NeilBrown 2020-05-13 7:17 ` [PATCH 1/2 V4] " NeilBrown 2020-05-15 11:10 ` Jan Kara 2020-06-01 0:46 ` Writeback fixes for NFS NeilBrown 2020-06-01 0:48 ` [PATCH 1/2] MM: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE NeilBrown 2020-06-01 0:49 ` [PATCH 2/2] MM: Discard NR_UNSTABLE_NFS, use NR_WRITEBACK instead NeilBrown 2020-05-13 7:18 ` [PATCH 2/2 V4] " NeilBrown 2020-05-15 9:59 ` Jan Kara 2020-04-16 0:31 ` [PATCH 2/2 V3] " NeilBrown 2020-04-16 6:56 ` Christoph Hellwig 2020-04-16 15:24 ` Jan Kara
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).