From: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
To: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
linux-kernel@vger.kernel.org
Cc: Jens Axboe <axboe@kernel.dk>, Michal Hocko <mhocko@suse.com>,
Dave Chinner <david@fromorbit.com>, Mel Gorman <mgorman@suse.de>,
Johannes Weiner <hannes@cmpxchg.org>, Tejun Heo <tj@kernel.org>,
Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: [PATCH v2] mm: implement write-behind policy for sequential file writes
Date: Fri, 20 Sep 2019 10:39:33 +0300 [thread overview]
Message-ID: <875f3b55-4fe1-e2c3-5bee-ca79e4668e72@yandex-team.ru> (raw)
In-Reply-To: <156896493723.4334.13340481207144634918.stgit@buzz>
[-- Attachment #1: Type: text/plain, Size: 4244 bytes --]
Script for trivial demo in attachment
$ bash test_writebehind.sh
SIZE
3,2G dummy
vm.dirty_write_behind = 0
COPY
real 0m3.629s
user 0m0.016s
sys 0m3.613s
Dirty: 3254552 kB
SYNC
real 0m31.953s
user 0m0.002s
sys 0m0.000s
vm.dirty_write_behind = 1
COPY
real 0m32.738s
user 0m0.008s
sys 0m4.047s
Dirty: 2900 kB
SYNC
real 0m0.427s
user 0m0.000s
sys 0m0.004s
vm.dirty_write_behind = 2
COPY
real 0m32.168s
user 0m0.000s
sys 0m4.066s
Dirty: 3088 kB
SYNC
real 0m0.421s
user 0m0.004s
sys 0m0.001s
With vm.dirty_write_behind 1 or 2 files are written even faster and
during copying amount of dirty memory always stays around at 16MiB.
On 20/09/2019 10.35, Konstantin Khlebnikov wrote:
> Traditional writeback tries to accumulate as much dirty data as possible.
> This is worth strategy for extremely short-living files and for batching
> writes for saving battery power. But for workloads where disk latency is
> important this policy generates periodic disk load spikes which increases
> latency for concurrent operations.
>
> Also dirty pages in file cache cannot be reclaimed and reused immediately.
> This way massive I/O like file copying affects memory allocation latency.
>
> Present writeback engine allows to tune only dirty data size or expiration
> time. Such tuning cannot eliminate spikes - this just lowers and multiplies
> them. Other option is switching into sync mode which flushes written data
> right after each write, obviously this have significant performance impact.
> Such tuning is system-wide and affects memory-mapped and randomly written
> files, flusher threads handle them much better.
>
> This patch implements write-behind policy which tracks sequential writes
> and starts background writeback when file have enough dirty pages.
>
> Global switch in sysctl vm.dirty_write_behind:
> =0: disabled, default
> =1: enabled for strictly sequential writes (append, copying)
> =2: enabled for all sequential writes
>
> The only parameter is window size: maximum amount of dirty pages behind
> current position and maximum amount of pages in background writeback.
>
> Setup is per-disk in sysfs in file /sys/block/$DISK/bdi/write_behind_kb.
> Default: 16MiB, '0' disables write-behind for this disk.
>
> When amount of unwritten pages exceeds window size write-behind starts
> background writeback for max(excess, max_sectors_kb) and then waits for
> the same amount of background writeback initiated at previously.
>
> |<-wait-this->| |<-send-this->|<---pending-write-behind--->|
> |<--async-write-behind--->|<--------previous-data------>|<-new-data->|
> current head-^ new head-^ file position-^
>
> Remaining tail pages are flushed at closing file if async write-behind was
> started or this is new file and it is at least max_sectors_kb long.
>
> Overall behavior depending on total data size:
> < max_sectors_kb - no writes
>> max_sectors_kb - write new files in background after close
>> write_behind_kb - streaming write, write tail at close
>
> Special cases:
>
> * files with POSIX_FADV_RANDOM, O_DIRECT, O_[D]SYNC are ignored
>
> * writing cursor for O_APPEND is aligned to covers previous small appends
> Append might happen via multiple files or via new file each time.
>
> * mode vm.dirty_write_behind=1 ignores non-append writes
> This reacts only to completely sequential writes like copying files,
> writing logs with O_APPEND or rewriting files after O_TRUNC.
>
> Note: ext4 feature "auto_da_alloc" also writes cache at closing file
> after truncating it to 0 and after renaming one file over other.
>
> Changes since v1 (2017-10-02):
> * rework window management:
> * change default window 1MiB -> 16MiB
> * change default request 256KiB -> max_sectors_kb
> * drop always-async behavior for O_NONBLOCK
> * drop handling POSIX_FADV_NOREUSE (should be in separate patch)
> * ignore writes with O_DIRECT, O_SYNC, O_DSYNC
> * align head position for O_APPEND
> * add strictly sequential mode
> * write tail pages for new files
> * make void, keep errors at mapping
>
> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> Link: https://lore.kernel.org/patchwork/patch/836149/ (v1)
> ---
[-- Attachment #2: test_writebehind.sh --]
[-- Type: application/x-shellscript, Size: 428 bytes --]
next prev parent reply other threads:[~2019-09-20 7:39 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-09-20 7:35 [PATCH v2] mm: implement write-behind policy for sequential file writes Konstantin Khlebnikov
2019-09-20 7:39 ` Konstantin Khlebnikov [this message]
2019-09-23 14:52 ` Tejun Heo
2019-09-23 15:06 ` Konstantin Khlebnikov
2019-09-23 15:19 ` Tejun Heo
2019-09-24 7:39 ` Dave Chinner
2019-09-24 9:00 ` Konstantin Khlebnikov
2019-09-25 7:18 ` Dave Chinner
2019-09-25 8:15 ` Konstantin Khlebnikov
2019-09-25 23:25 ` Dave Chinner
2019-09-25 12:54 ` Theodore Y. Ts'o
2019-09-24 19:08 ` Linus Torvalds
2019-09-25 8:00 ` Dave Chinner
2019-09-20 23:05 ` Linus Torvalds
2019-09-20 23:10 ` Linus Torvalds
2019-09-23 15:36 ` Jens Axboe
2019-09-23 16:05 ` Konstantin Khlebnikov
2019-09-24 9:29 ` Konstantin Khlebnikov
2019-09-22 7:47 ` kbuild test robot
2019-09-23 0:36 ` [mm] e0e7df8d5b: will-it-scale.per_process_ops -7.3% regression kernel test robot
2019-09-23 19:11 ` Konstantin Khlebnikov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=875f3b55-4fe1-e2c3-5bee-ca79e4668e72@yandex-team.ru \
--to=khlebnikov@yandex-team.ru \
--cc=axboe@kernel.dk \
--cc=david@fromorbit.com \
--cc=hannes@cmpxchg.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@suse.de \
--cc=mhocko@suse.com \
--cc=tj@kernel.org \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).