linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [RFC v2] writeback: add elastic bdi in cgwb bdp
       [not found] <20191026104656.15176-1-hdanton@sina.com>
@ 2019-11-08 21:00 ` Andrew Morton
  2019-11-14 12:17 ` Jan Kara
       [not found] ` <20191115033240.11236-1-hdanton@sina.com>
  2 siblings, 0 replies; 3+ messages in thread
From: Andrew Morton @ 2019-11-08 21:00 UTC (permalink / raw)
  To: Hillf Danton
  Cc: linux-mm, fsdev, linux-kernel, Fengguang Wu, Tejun Heo, Jan Kara,
	Johannes Weiner, Shakeel Butt, Minchan Kim, Mel Gorman

On Sat, 26 Oct 2019 18:46:56 +0800 Hillf Danton <hdanton@sina.com> wrote:

> 
> The elastic bdi is the mirror bdi of spinning disks, SSD, USB and
> other storage devices/instruments on market. The performance of
> ebdi goes up and down as the pattern of IO dispatched changes, as
> approximately estimated as below.
> 
> 	P = j(..., IO pattern);
> 
> In ebdi's view, the bandwidth currently measured in balancing dirty
> pages has close relation to its performance because the former is a
> part of the latter.
> 
> 	B = y(P);
> 
> The functions above suggest there may be a layer violation if it
> could be better measured somewhere below fs.
> 
> It is measured however to the extent that makes every judge happy,
> and is playing a role in dispatching IO with the IO pattern entirely
> ignored that is volatile in nature.
> 
> And it helps to throttle the dirty speed, with the figure ignored
> that DRAM in general is x10 faster than ebdi. If B is half of P for
> instance, then it is near 5% of dirty speed, just 2 points from the
> figure in the snippet below.
> 
> /*
>  * If ratelimit_pages is too high then we can get into dirty-data overload
>  * if a large number of processes all perform writes at the same time.
>  * If it is too low then SMP machines will call the (expensive)
>  * get_writeback_state too often.
>  *
>  * Here we set ratelimit_pages to a level which ensures that when all CPUs are
>  * dirtying in parallel, we cannot go more than 3% (1/32) over the dirty memory
>  * thresholds.
>  */
> 
> To prevent dirty speed from running away from laundry speed, ebdi
> suggests the walk-dog method to put in bdp as a leash seems to
> churn less in IO pattern.

I'm finding both the changelog and the patch rather hard to understand.
The absence of code comments doesn't help.  But the test robot
performance results look nice.

Presumably you did your own performance testing.  Please share the
results of that in the changelog.

> 
> --- a/include/linux/backing-dev-defs.h
> +++ b/include/linux/backing-dev-defs.h
> @@ -170,6 +170,8 @@ struct bdi_writeback {
>  
>  	struct list_head bdi_node;	/* anchored at bdi->wb_list */
>  
> +	struct wait_queue_head bdp_waitq;

Please add comments which help the reader find out what "bdp" stands
for.


>  #ifdef CONFIG_CGROUP_WRITEBACK
>  	struct percpu_ref refcnt;	/* used only for !root wb's */
>  	struct fprop_local_percpu memcg_completions;
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -324,6 +324,8 @@ static int wb_init(struct bdi_writeback
>  			goto out_destroy_stat;
>  	}
>  
> +	init_waitqueue_head(&wb->bdp_waitq);
> +
>  	return 0;
>  
>  out_destroy_stat:
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1551,6 +1551,39 @@ static inline void wb_dirty_limits(struc
>  	}
>  }
>
> +static bool cgwb_bdp_should_throttle(struct bdi_writeback *wb)

Please document this function.  Describe the "why" not the "what".

Comment should help readers understand what "cgwb" means.

> +{
> +	struct dirty_throttle_control gdtc = { GDTC_INIT_NO_WB };
> +
> +	if (fatal_signal_pending(current))
> +		return false;
> +
> +	gdtc.avail = global_dirtyable_memory();
> +
> +	domain_dirty_limits(&gdtc);
> +
> +	gdtc.dirty = global_node_page_state(NR_FILE_DIRTY) +
> +			global_node_page_state(NR_UNSTABLE_NFS) +
> +			global_node_page_state(NR_WRITEBACK);
> +
> +	if (gdtc.dirty < gdtc.bg_thresh)
> +		return false;
> +
> +	if (!writeback_in_progress(wb))
> +		wb_start_background_writeback(wb);

Add a comment explaining what's going on here and why we're doing this.

> +	return gdtc.dirty > gdtc.thresh &&
> +		wb_stat(wb, WB_DIRTIED) >
> +		wb_stat(wb, WB_WRITTEN) +
> +		wb_stat_error();

Why?

> +}
> +
> +static inline void cgwb_bdp(struct bdi_writeback *wb)
> +{
> +	wait_event_interruptible_timeout(wb->bdp_waitq,
> +			!cgwb_bdp_should_throttle(wb), HZ);
> +}
> +
>  /*
>   * balance_dirty_pages() must be called by processes which are generating dirty
>   * data.  It looks at the number of dirty pages in the machine and will force
> @@ -1910,7 +1943,7 @@ void balance_dirty_pages_ratelimited(str
>  	preempt_enable();
>  
>  	if (unlikely(current->nr_dirtied >= ratelimit))
> -		balance_dirty_pages(wb, current->nr_dirtied);
> +		cgwb_bdp(wb);
>  
>  	wb_put(wb);
>  }
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -811,6 +811,8 @@ static long wb_split_bdi_pages(struct bd
>  	if (nr_pages == LONG_MAX)
>  		return LONG_MAX;
>  
> +	return nr_pages;
> +
>  	/*
>  	 * This may be called on clean wb's and proportional distribution
>  	 * may not make sense, just use the original @nr_pages in those
> @@ -1604,6 +1606,7 @@ static long writeback_chunk_size(struct
>  		pages = min(pages, work->nr_pages);
>  		pages = round_down(pages + MIN_WRITEBACK_PAGES,
>  				   MIN_WRITEBACK_PAGES);
> +		pages = work->nr_pages;
>  	}
>  
>  	return pages;
> @@ -2092,6 +2095,9 @@ void wb_workfn(struct work_struct *work)
>  		wb_wakeup_delayed(wb);
>  
>  	current->flags &= ~PF_SWAPWRITE;
> +
> +	if (waitqueue_active(&wb->bdp_waitq))
> +		wake_up_all(&wb->bdp_waitq);
>  }

Does this patch affect both cgroup writeback and global writeback? 
Were both tested?  Performance results of both?


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [RFC v2] writeback: add elastic bdi in cgwb bdp
       [not found] <20191026104656.15176-1-hdanton@sina.com>
  2019-11-08 21:00 ` [RFC v2] writeback: add elastic bdi in cgwb bdp Andrew Morton
@ 2019-11-14 12:17 ` Jan Kara
       [not found] ` <20191115033240.11236-1-hdanton@sina.com>
  2 siblings, 0 replies; 3+ messages in thread
From: Jan Kara @ 2019-11-14 12:17 UTC (permalink / raw)
  To: Hillf Danton
  Cc: linux-mm, fsdev, Andrew Morton, linux-kernel, Fengguang Wu,
	Tejun Heo, Jan Kara, Johannes Weiner, Shakeel Butt, Minchan Kim,
	Mel Gorman

On Sat 26-10-19 18:46:56, Hillf Danton wrote:
> 
> The elastic bdi is the mirror bdi of spinning disks, SSD, USB and
> other storage devices/instruments on market. The performance of
> ebdi goes up and down as the pattern of IO dispatched changes, as
> approximately estimated as below.
> 
> 	P = j(..., IO pattern);
> 
> In ebdi's view, the bandwidth currently measured in balancing dirty
> pages has close relation to its performance because the former is a
> part of the latter.
> 
> 	B = y(P);
> 
> The functions above suggest there may be a layer violation if it
> could be better measured somewhere below fs.
> 
> It is measured however to the extent that makes every judge happy,
> and is playing a role in dispatching IO with the IO pattern entirely
> ignored that is volatile in nature.
> 
> And it helps to throttle the dirty speed, with the figure ignored
> that DRAM in general is x10 faster than ebdi. If B is half of P for
> instance, then it is near 5% of dirty speed, just 2 points from the
> figure in the snippet below.
> 
> /*
>  * If ratelimit_pages is too high then we can get into dirty-data overload
>  * if a large number of processes all perform writes at the same time.
>  * If it is too low then SMP machines will call the (expensive)
>  * get_writeback_state too often.
>  *
>  * Here we set ratelimit_pages to a level which ensures that when all CPUs are
>  * dirtying in parallel, we cannot go more than 3% (1/32) over the dirty memory
>  * thresholds.
>  */
> 
> To prevent dirty speed from running away from laundry speed, ebdi
> suggests the walk-dog method to put in bdp as a leash seems to
> churn less in IO pattern.
> 
> V2 is based on next-20191025.

Honestly, the changelog is still pretty incomprehensible as Andrew already
mentioned. Also I completely miss there, what are the benefits of this work
compared to what we currently have.

> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1551,6 +1551,39 @@ static inline void wb_dirty_limits(struc
>  	}
>  }
>  
> +static bool cgwb_bdp_should_throttle(struct bdi_writeback *wb)
> +{
> +	struct dirty_throttle_control gdtc = { GDTC_INIT_NO_WB };
> +
> +	if (fatal_signal_pending(current))
> +		return false;
> +
> +	gdtc.avail = global_dirtyable_memory();
> +
> +	domain_dirty_limits(&gdtc);
> +
> +	gdtc.dirty = global_node_page_state(NR_FILE_DIRTY) +
> +			global_node_page_state(NR_UNSTABLE_NFS) +
> +			global_node_page_state(NR_WRITEBACK);
> +
> +	if (gdtc.dirty < gdtc.bg_thresh)
> +		return false;
> +
> +	if (!writeback_in_progress(wb))
> +		wb_start_background_writeback(wb);
> +
> +	return gdtc.dirty > gdtc.thresh &&
> +		wb_stat(wb, WB_DIRTIED) >
> +		wb_stat(wb, WB_WRITTEN) +
> +		wb_stat_error();
> +}

This looks like a very primitive version of what we already have in
balance_dirty_pages(). Just without support for cgroup-aware writeback, or
any guarantees on amount of written out data.

> +
> +static inline void cgwb_bdp(struct bdi_writeback *wb)
> +{
> +	wait_event_interruptible_timeout(wb->bdp_waitq,
> +			!cgwb_bdp_should_throttle(wb), HZ);
> +}

This breaks dirty throttling as no dirtier is ever delayed for more than 1
second. Under heavier IO load, it can clearly take longer to clean enough
pages before dirtier can continue...

> +
>  /*
>   * balance_dirty_pages() must be called by processes which are generating dirty
>   * data.  It looks at the number of dirty pages in the machine and will force
> @@ -1910,7 +1943,7 @@ void balance_dirty_pages_ratelimited(str
>  	preempt_enable();
>  
>  	if (unlikely(current->nr_dirtied >= ratelimit))
> -		balance_dirty_pages(wb, current->nr_dirtied);
> +		cgwb_bdp(wb);
>  
>  	wb_put(wb);
>  }
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -811,6 +811,8 @@ static long wb_split_bdi_pages(struct bd
>  	if (nr_pages == LONG_MAX)
>  		return LONG_MAX;
>  
> +	return nr_pages;
> +

So you've just broken cgroup aware writeback with this. I won't even speak
of the fact that this is a return in the middle of the function. But let's
take this as an experimental patch to show out something...

>  	/*
>  	 * This may be called on clean wb's and proportional distribution
>  	 * may not make sense, just use the original @nr_pages in those
> @@ -1604,6 +1606,7 @@ static long writeback_chunk_size(struct
>  		pages = min(pages, work->nr_pages);
>  		pages = round_down(pages + MIN_WRITEBACK_PAGES,
>  				   MIN_WRITEBACK_PAGES);
> +		pages = work->nr_pages;
>  	}

This breaks livelock prevention in the writeback code. Now we can write out
single inode basically forever.

> @@ -2092,6 +2095,9 @@ void wb_workfn(struct work_struct *work)
>  		wb_wakeup_delayed(wb);
>  
>  	current->flags &= ~PF_SWAPWRITE;
> +
> +	if (waitqueue_active(&wb->bdp_waitq))
> +		wake_up_all(&wb->bdp_waitq);
>  }

If anyone submits writeback work with a few pages, this will result in
releasing dirtying processes prematurely (before we are guaranteed we have
got below dirty limits).

So to summarize, I'm sorry but this patch looks very broken to me and I
don't see how your proposed throttling method is any better than what we
already have.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [RFC v2] writeback: add elastic bdi in cgwb bdp
       [not found] ` <20191115033240.11236-1-hdanton@sina.com>
@ 2019-11-15  9:53   ` Jan Kara
  0 siblings, 0 replies; 3+ messages in thread
From: Jan Kara @ 2019-11-15  9:53 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Jan Kara, linux-mm, fsdev, Andrew Morton, linux-kernel,
	Fengguang Wu, Tejun Heo, Jan Kara, Johannes Weiner, Shakeel Butt,
	Minchan Kim, Mel Gorman

On Fri 15-11-19 11:32:40, Hillf Danton wrote:
> 
> On Thu, 14 Nov 2019 13:17:46 +0100 Jan Kara wrote:
> > 
> > On Sat 26-10-19 18:46:56, Hillf Danton wrote:
> > > 
> > > The elastic bdi is the mirror bdi of spinning disks, SSD, USB and
> > > other storage devices/instruments on market. The performance of
> > > ebdi goes up and down as the pattern of IO dispatched changes, as
> > > approximately estimated as below.
> > > 
> > > 	P = j(..., IO pattern);
> > > 
> > > In ebdi's view, the bandwidth currently measured in balancing dirty
> > > pages has close relation to its performance because the former is a
> > > part of the latter.
> > > 
> > > 	B = y(P);
> > > 
> > > The functions above suggest there may be a layer violation if it
> > > could be better measured somewhere below fs.
> > > 
> > > It is measured however to the extent that makes every judge happy,
> > > and is playing a role in dispatching IO with the IO pattern entirely
> > > ignored that is volatile in nature.
> > > 
> > > And it helps to throttle the dirty speed, with the figure ignored
> > > that DRAM in general is x10 faster than ebdi. If B is half of P for
> > > instance, then it is near 5% of dirty speed, just 2 points from the
> > > figure in the snippet below.
> > > 
> > > /*
> > >  * If ratelimit_pages is too high then we can get into dirty-data overload
> > >  * if a large number of processes all perform writes at the same time.
> > >  * If it is too low then SMP machines will call the (expensive)
> > >  * get_writeback_state too often.
> > >  *
> > >  * Here we set ratelimit_pages to a level which ensures that when all CPUs are
> > >  * dirtying in parallel, we cannot go more than 3% (1/32) over the dirty memory
> > >  * thresholds.
> > >  */
> > > 
> > > To prevent dirty speed from running away from laundry speed, ebdi
> > > suggests the walk-dog method to put in bdp as a leash seems to
> > > churn less in IO pattern.
> > > 
> > > V2 is based on next-20191025.
> > 
> > Honestly, the changelog is still pretty incomprehensible as Andrew already
> > mentioned. Also I completely miss there, what are the benefits of this work
> > compared to what we currently have.
> > 
> Hey Jan
> 
> In the room which has been somewhere between 3% and 5% for bdp since
> 143dfe8611a6 ("writeback: IO-less balance_dirty_pages()") a bdp is
> proposed with target of surviving tests like LTP without regressions
> introduced, so overall the concerned benefit is that bdp is becoming
> more diverse if the diversity under linux/fs is good for the 99%.

What do you mean by "balance_dirty_pages() is becoming more diverse"?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2019-11-15  9:53 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20191026104656.15176-1-hdanton@sina.com>
2019-11-08 21:00 ` [RFC v2] writeback: add elastic bdi in cgwb bdp Andrew Morton
2019-11-14 12:17 ` Jan Kara
     [not found] ` <20191115033240.11236-1-hdanton@sina.com>
2019-11-15  9:53   ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).