linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
To: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Cc: Jens Axboe <axboe@kernel.dk>, Michal Hocko <mhocko@suse.com>,
	Dave Chinner <david@fromorbit.com>, Mel Gorman <mgorman@suse.de>,
	Johannes Weiner <hannes@cmpxchg.org>, Tejun Heo <tj@kernel.org>,
	Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: [PATCH v2] mm: implement write-behind policy for sequential file writes
Date: Fri, 20 Sep 2019 10:39:33 +0300	[thread overview]
Message-ID: <875f3b55-4fe1-e2c3-5bee-ca79e4668e72@yandex-team.ru> (raw)
In-Reply-To: <156896493723.4334.13340481207144634918.stgit@buzz>

[-- Attachment #1: Type: text/plain, Size: 4244 bytes --]

Script for trivial demo in attachment

$ bash test_writebehind.sh
SIZE
3,2G	dummy
vm.dirty_write_behind = 0
COPY

real	0m3.629s
user	0m0.016s
sys	0m3.613s
Dirty:           3254552 kB
SYNC

real	0m31.953s
user	0m0.002s
sys	0m0.000s
vm.dirty_write_behind = 1
COPY

real	0m32.738s
user	0m0.008s
sys	0m4.047s
Dirty:              2900 kB
SYNC

real	0m0.427s
user	0m0.000s
sys	0m0.004s
vm.dirty_write_behind = 2
COPY

real	0m32.168s
user	0m0.000s
sys	0m4.066s
Dirty:              3088 kB
SYNC

real	0m0.421s
user	0m0.004s
sys	0m0.001s


With vm.dirty_write_behind 1 or 2 files are written even faster and
during copying amount of dirty memory always stays around at 16MiB.


On 20/09/2019 10.35, Konstantin Khlebnikov wrote:
> Traditional writeback tries to accumulate as much dirty data as possible.
> This is worth strategy for extremely short-living files and for batching
> writes for saving battery power. But for workloads where disk latency is
> important this policy generates periodic disk load spikes which increases
> latency for concurrent operations.
> 
> Also dirty pages in file cache cannot be reclaimed and reused immediately.
> This way massive I/O like file copying affects memory allocation latency.
> 
> Present writeback engine allows to tune only dirty data size or expiration
> time. Such tuning cannot eliminate spikes - this just lowers and multiplies
> them. Other option is switching into sync mode which flushes written data
> right after each write, obviously this have significant performance impact.
> Such tuning is system-wide and affects memory-mapped and randomly written
> files, flusher threads handle them much better.
> 
> This patch implements write-behind policy which tracks sequential writes
> and starts background writeback when file have enough dirty pages.
> 
> Global switch in sysctl vm.dirty_write_behind:
> =0: disabled, default
> =1: enabled for strictly sequential writes (append, copying)
> =2: enabled for all sequential writes
> 
> The only parameter is window size: maximum amount of dirty pages behind
> current position and maximum amount of pages in background writeback.
> 
> Setup is per-disk in sysfs in file /sys/block/$DISK/bdi/write_behind_kb.
> Default: 16MiB, '0' disables write-behind for this disk.
> 
> When amount of unwritten pages exceeds window size write-behind starts
> background writeback for max(excess, max_sectors_kb) and then waits for
> the same amount of background writeback initiated at previously.
> 
>   |<-wait-this->|           |<-send-this->|<---pending-write-behind--->|
>   |<--async-write-behind--->|<--------previous-data------>|<-new-data->|
>                current head-^    new head-^              file position-^
> 
> Remaining tail pages are flushed at closing file if async write-behind was
> started or this is new file and it is at least max_sectors_kb long.
> 
> Overall behavior depending on total data size:
> < max_sectors_kb - no writes
>> max_sectors_kb - write new files in background after close
>> write_behind_kb - streaming write, write tail at close
> 
> Special cases:
> 
> * files with POSIX_FADV_RANDOM, O_DIRECT, O_[D]SYNC are ignored
> 
> * writing cursor for O_APPEND is aligned to covers previous small appends
>    Append might happen via multiple files or via new file each time.
> 
> * mode vm.dirty_write_behind=1 ignores non-append writes
>    This reacts only to completely sequential writes like copying files,
>    writing logs with O_APPEND or rewriting files after O_TRUNC.
> 
> Note: ext4 feature "auto_da_alloc" also writes cache at closing file
> after truncating it to 0 and after renaming one file over other.
> 
> Changes since v1 (2017-10-02):
> * rework window management:
> * change default window 1MiB -> 16MiB
> * change default request 256KiB -> max_sectors_kb
> * drop always-async behavior for O_NONBLOCK
> * drop handling POSIX_FADV_NOREUSE (should be in separate patch)
> * ignore writes with O_DIRECT, O_SYNC, O_DSYNC
> * align head position for O_APPEND
> * add strictly sequential mode
> * write tail pages for new files
> * make void, keep errors at mapping
> 
> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> Link: https://lore.kernel.org/patchwork/patch/836149/ (v1)
> ---

[-- Attachment #2: test_writebehind.sh --]
[-- Type: application/x-shellscript, Size: 428 bytes --]

  reply	other threads:[~2019-09-20  7:39 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-20  7:35 [PATCH v2] mm: implement write-behind policy for sequential file writes Konstantin Khlebnikov
2019-09-20  7:39 ` Konstantin Khlebnikov [this message]
2019-09-23 14:52   ` Tejun Heo
2019-09-23 15:06     ` Konstantin Khlebnikov
2019-09-23 15:19       ` Tejun Heo
2019-09-24  7:39       ` Dave Chinner
2019-09-24  9:00         ` Konstantin Khlebnikov
2019-09-25  7:18           ` Dave Chinner
2019-09-25  8:15             ` Konstantin Khlebnikov
2019-09-25 23:25               ` Dave Chinner
2019-09-25 12:54             ` Theodore Y. Ts'o
2019-09-24 19:08         ` Linus Torvalds
2019-09-25  8:00           ` Dave Chinner
2019-09-20 23:05 ` Linus Torvalds
2019-09-20 23:10   ` Linus Torvalds
2019-09-23 15:36     ` Jens Axboe
2019-09-23 16:05       ` Konstantin Khlebnikov
2019-09-24  9:29   ` Konstantin Khlebnikov
2019-09-22  7:47 ` kbuild test robot
2019-09-23  0:36 ` [mm] e0e7df8d5b: will-it-scale.per_process_ops -7.3% regression kernel test robot
2019-09-23 19:11   ` Konstantin Khlebnikov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=875f3b55-4fe1-e2c3-5bee-ca79e4668e72@yandex-team.ru \
    --to=khlebnikov@yandex-team.ru \
    --cc=axboe@kernel.dk \
    --cc=david@fromorbit.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mhocko@suse.com \
    --cc=tj@kernel.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).