All of lore.kernel.org
 help / color / mirror / Atom feed
From: Stefan Roesch <shr@fb.com>
To: Jan Kara <jack@suse.cz>
Cc: io-uring@vger.kernel.org, kernel-team@fb.com, linux-mm@kvack.org,
	linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	david@fromorbit.com
Subject: Re: [RFC PATCH v1 15/18] mm: support write throttling for async buffered writes
Date: Tue, 10 May 2022 13:16:30 -0700	[thread overview]
Message-ID: <84f8da94-1227-a351-56ba-eabdba91027b@fb.com> (raw)
In-Reply-To: <20220510095036.6tbbwwf5hxcevzkh@quack3.lan>



On 5/10/22 2:50 AM, Jan Kara wrote:
> Sorry for delayed reply. This has fallen through the cracks...
> 
> On Thu 28-04-22 13:16:19, Stefan Roesch wrote:
>> On 4/28/22 10:47 AM, Jan Kara wrote:
>>> On Tue 26-04-22 10:43:32, Stefan Roesch wrote:
>>>> This change adds support for async write throttling in the function
>>>> balance_dirty_pages(). So far if throttling was required, the code was
>>>> waiting synchronously as long as the writes were throttled. This change
>>>> introduces asynchronous throttling. Instead of waiting in the function
>>>> balance_dirty_pages(), the timeout is set in the task_struct field
>>>> bdp_pause. Once the timeout has expired, the writes are no longer
>>>> throttled.
>>>>
>>>> - Add a new parameter to the balance_dirty_pages() function
>>>>   - This allows the caller to pass in the nowait flag
>>>>   - When the nowait flag is specified, the code does not wait in
>>>>     balance_dirty_pages(), but instead stores the wait expiration in the
>>>>     new task_struct field bdp_pause.
>>>>
>>>> - The function balance_dirty_pages_ratelimited() resets the new values
>>>>   in the task_struct, once the timeout has expired
>>>>
>>>> This change is required to support write throttling for the async
>>>> buffered writes. While the writes are throttled, io_uring still can make
>>>> progress with processing other requests.
>>>>
>>>> Signed-off-by: Stefan Roesch <shr@fb.com>
>>>
>>> Maybe I miss something but I don't think this will throttle writers enough.
>>> For three reasons:
>>>
>>> 1) The calculated throttling pauses should accumulate for the task so that
>>> if we compute that say it takes 0.1s to write 100 pages and the task writes
>>> 300 pages, the delay adds up to 0.3s properly. Otherwise the task would not
>>> be throttled as long as we expect the writeback to take.
>>>
>>> 2) We must not allow the amount of dirty pages to exceed the dirty limit.
>>> That can easily lead to page reclaim getting into trouble reclaiming pages
>>> and thus machine stalls, oom kills etc. So if we are coming close to dirty
>>> limit and we cannot sleep, we must just fail the nowait write.
>>>
>>> 3) Even with above two problems fixed I suspect results will be suboptimal
>>> because balance_dirty_pages() heuristics assume they get called reasonably
>>> often and throttle writes so if amount of dirty pages is coming close to
>>> dirty limit, they think we are overestimating writeback speed and update
>>> throttling parameters accordingly. So if io_uring code does not throttle
>>> writers often enough, I think dirty throttling parameters will be jumping
>>> wildly resulting in poor behavior.
>>>
>>> So what I'd probably suggest is that if balance_dirty_pages() is called in
>>> "async" mode, we'd give tasks a pass until dirty_freerun_ceiling(). If
>>> balance_dirty_pages() decides the task needs to wait, we store the pause
>>> and bail all the way up into the place where we can sleep (io_uring code I
>>> assume), sleep there, and then continue doing write.
>>>
>>
>> Jan, thanks for the feedback. Are you suggesting to change the following
>> check in the function balance_dirty_pages():
>>
>>                 /*
>>                  * Throttle it only when the background writeback cannot
>>                  * catch-up. This avoids (excessively) small writeouts
>>                  * when the wb limits are ramping up in case of !strictlimit.
>>                  *
>>                  * In strictlimit case make decision based on the wb counters
>>                  * and limits. Small writeouts when the wb limits are ramping
>>                  * up are the price we consciously pay for strictlimit-ing.
>>                  *
>>                  * If memcg domain is in effect, @dirty should be under
>>                  * both global and memcg freerun ceilings.
>>                  */
>>                 if (dirty <= dirty_freerun_ceiling(thresh, bg_thresh) &&
>>                     (!mdtc ||
>>                      m_dirty <= dirty_freerun_ceiling(m_thresh, m_bg_thresh))) {
>>                         unsigned long intv;
>>                         unsigned long m_intv;
>>
>> to include if we are in async mode?
> 
> Actually no. This condition is the one that gives any task a free pass
> until dirty_freerun_ceiling(). So there's no need to do any modification
> for that. Sorry, I've probably formulated my suggestion in a bit confusing
> way.
> 
>> There is no direct way to return that the process should sleep. Instead
>> two new fields are introduced in the proc structure. These two fields are
>> then used in io_uring to determine if the writes for a task need to be
>> throttled.
>>
>> In case the writes need to be throttled, the writes are not issued, but
>> instead inserted on a wait queue. We cannot sleep in the general io_uring
>> code path as we still want to process other requests which are affected
>> by the throttling.
> 
> Probably you wanted to say "are not affected by the throttling" in the
> above.
> 

Yes, that's correct.

> I know that you're using fields in task_struct to propagate the delay info.
> But IMHO that is unnecessary (although I don't care too much). Instead we
> could factor out a variant of balance_dirty_pages() that returns 'pause' to
> sleep, 0 if no sleeping needed. Normal balance_dirty_pages() would use this
> for pause calculation, places wanting async throttling would only get the
> pause to sleep. So e.g. iomap_write_iter() would then check and if returned
> pause is > 0, it would abort the loop similary as we'd abort it for any
> other reason when NOWAIT write is aborted because we need to sleep. Iouring
> code then detects short write / EAGAIN and offloads the write to the
> workqueue where normal balance_dirty_pages() can sleep as needed.
> 
> This will make sure dirty limits are properly observed and we don't need
> that much special handling for it.
>

I like the idea of factoring out a function out balance_dirty_pages(), however

I see two challenges:
- the write operation has already completed at this point,
- so we can't really sleep on its completion in the io-worker in io-uring
- we don't know how long to sleep in io-uring

Currently balance_dirty_pages_ratelimited() is called at the end of the function
iomap_write_iter(). If the function balance_dirty_pages_ratelimited() would instead
be called at the beginning of the function iomap_write_iter() we could return -EAGAIN
and then complete it in the io-worker.

I'm not sure what the implications are of moving the function call to the beginning of
the function iomap_write_iter().
 
> 								Honza

  reply	other threads:[~2022-05-10 20:16 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-26 17:43 [RFC PATCH v1 00/18] io-uring/xfs: support async buffered writes Stefan Roesch
2022-04-26 17:43 ` [RFC PATCH v1 01/18] block: add check for async buffered writes to generic_write_checks Stefan Roesch
2022-04-26 17:43 ` [RFC PATCH v1 02/18] mm: add FGP_ATOMIC flag to __filemap_get_folio() Stefan Roesch
2022-04-26 19:06   ` Matthew Wilcox
2022-04-28 19:54     ` Stefan Roesch
2022-04-26 17:43 ` [RFC PATCH v1 03/18] iomap: add iomap_page_create_gfp to allocate iomap_pages Stefan Roesch
2022-04-26 17:43 ` [RFC PATCH v1 04/18] iomap: use iomap_page_create_gfp() in __iomap_write_begin Stefan Roesch
2022-04-26 17:43 ` [RFC PATCH v1 05/18] iomap: add async buffered write support Stefan Roesch
2022-04-26 17:43 ` [RFC PATCH v1 06/18] xfs: add iomap " Stefan Roesch
2022-04-26 22:54   ` Dave Chinner
2022-04-28 20:03     ` Stefan Roesch
2022-04-28 21:44       ` Dave Chinner
2022-04-26 17:43 ` [RFC PATCH v1 07/18] fs: split off need_remove_file_privs() do_remove_file_privs() Stefan Roesch
2022-04-26 17:43 ` [RFC PATCH v1 08/18] fs: split off need_file_update_time and do_file_update_time Stefan Roesch
2022-04-26 17:43 ` [RFC PATCH v1 09/18] fs: add pending file update time flag Stefan Roesch
2022-04-26 17:43 ` [RFC PATCH v1 10/18] xfs: Enable async write file modification handling Stefan Roesch
2022-04-26 22:55   ` Dave Chinner
2022-04-27 12:07   ` Christian Brauner
2022-04-27 15:05   ` kernel test robot
2022-04-26 17:43 ` [RFC PATCH v1 11/18] xfs: add async buffered write support Stefan Roesch
2022-04-26 22:56   ` Dave Chinner
2022-04-28 19:58     ` Stefan Roesch
2022-04-28 21:54       ` Dave Chinner
2022-05-02 21:21         ` Stefan Roesch
2022-05-06  9:29           ` Dave Chinner
2022-05-09 19:32             ` Stefan Roesch
2022-05-09 23:24               ` Dave Chinner
2022-05-09 23:44                 ` Darrick J. Wong
2022-05-10  1:12                   ` Dave Chinner
2022-05-10  6:47                     ` Christoph Hellwig
2022-05-16  2:24                       ` Dave Chinner
2022-05-16 13:39                         ` Christoph Hellwig
2022-04-26 17:43 ` [RFC PATCH v1 12/18] io_uring: add support for async buffered writes Stefan Roesch
2022-04-26 17:43 ` [RFC PATCH v1 13/18] io_uring: add tracepoint for short writes Stefan Roesch
2022-04-26 17:43 ` [RFC PATCH v1 14/18] sched: add new fields to task_struct Stefan Roesch
2022-04-26 17:43 ` [RFC PATCH v1 15/18] mm: support write throttling for async buffered writes Stefan Roesch
2022-04-28 17:47   ` Jan Kara
2022-04-28 20:16     ` Stefan Roesch
2022-05-10  9:50       ` Jan Kara
2022-05-10 20:16         ` Stefan Roesch [this message]
2022-05-11 10:38           ` Jan Kara
2022-05-13 18:57             ` Stefan Roesch
2022-04-26 17:43 ` [RFC PATCH v1 16/18] iomap: User " Stefan Roesch
2022-04-26 17:43 ` [RFC PATCH v1 17/18] io_uring: support write " Stefan Roesch
2022-04-26 17:43 ` [RFC PATCH v1 18/18] xfs: enable async buffered write support Stefan Roesch
2022-04-26 22:37 ` [RFC PATCH v1 00/18] io-uring/xfs: support async buffered writes Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=84f8da94-1227-a351-56ba-eabdba91027b@fb.com \
    --to=shr@fb.com \
    --cc=david@fromorbit.com \
    --cc=io-uring@vger.kernel.org \
    --cc=jack@suse.cz \
    --cc=kernel-team@fb.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.