From: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
To: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
linux-kernel@vger.kernel.org
Cc: Jens Axboe <axboe@kernel.dk>, Michal Hocko <mhocko@suse.com>,
Dave Chinner <david@fromorbit.com>, Mel Gorman <mgorman@suse.de>,
Johannes Weiner <hannes@cmpxchg.org>, Tejun Heo <tj@kernel.org>,
Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: [PATCH v2] mm: implement write-behind policy for sequential file writes
Date: Fri, 20 Sep 2019 10:39:33 +0300 [thread overview]
Message-ID: <875f3b55-4fe1-e2c3-5bee-ca79e4668e72@yandex-team.ru> (raw)
In-Reply-To: <156896493723.4334.13340481207144634918.stgit@buzz>
[-- Attachment #1: Type: text/plain, Size: 4244 bytes --]
Script for trivial demo in attachment
$ bash test_writebehind.sh
SIZE
3,2G dummy
vm.dirty_write_behind = 0
COPY
real 0m3.629s
user 0m0.016s
sys 0m3.613s
Dirty: 3254552 kB
SYNC
real 0m31.953s
user 0m0.002s
sys 0m0.000s
vm.dirty_write_behind = 1
COPY
real 0m32.738s
user 0m0.008s
sys 0m4.047s
Dirty: 2900 kB
SYNC
real 0m0.427s
user 0m0.000s
sys 0m0.004s
vm.dirty_write_behind = 2
COPY
real 0m32.168s
user 0m0.000s
sys 0m4.066s
Dirty: 3088 kB
SYNC
real 0m0.421s
user 0m0.004s
sys 0m0.001s
With vm.dirty_write_behind 1 or 2 files are written even faster and
during copying amount of dirty memory always stays around at 16MiB.
On 20/09/2019 10.35, Konstantin Khlebnikov wrote:
> Traditional writeback tries to accumulate as much dirty data as possible.
> This is worth strategy for extremely short-living files and for batching
> writes for saving battery power. But for workloads where disk latency is
> important this policy generates periodic disk load spikes which increases
> latency for concurrent operations.
>
> Also dirty pages in file cache cannot be reclaimed and reused immediately.
> This way massive I/O like file copying affects memory allocation latency.
>
> Present writeback engine allows to tune only dirty data size or expiration
> time. Such tuning cannot eliminate spikes - this just lowers and multiplies
> them. Other option is switching into sync mode which flushes written data
> right after each write, obviously this have significant performance impact.
> Such tuning is system-wide and affects memory-mapped and randomly written
> files, flusher threads handle them much better.
>
> This patch implements write-behind policy which tracks sequential writes
> and starts background writeback when file have enough dirty pages.
>
> Global switch in sysctl vm.dirty_write_behind:
> =0: disabled, default
> =1: enabled for strictly sequential writes (append, copying)
> =2: enabled for all sequential writes
>
> The only parameter is window size: maximum amount of dirty pages behind
> current position and maximum amount of pages in background writeback.
>
> Setup is per-disk in sysfs in file /sys/block/$DISK/bdi/write_behind_kb.
> Default: 16MiB, '0' disables write-behind for this disk.
>
> When amount of unwritten pages exceeds window size write-behind starts
> background writeback for max(excess, max_sectors_kb) and then waits for
> the same amount of background writeback initiated at previously.
>
> |<-wait-this->| |<-send-this->|<---pending-write-behind--->|
> |<--async-write-behind--->|<--------previous-data------>|<-new-data->|
> current head-^ new head-^ file position-^
>
> Remaining tail pages are flushed at closing file if async write-behind was
> started or this is new file and it is at least max_sectors_kb long.
>
> Overall behavior depending on total data size:
> < max_sectors_kb - no writes
>> max_sectors_kb - write new files in background after close
>> write_behind_kb - streaming write, write tail at close
>
> Special cases:
>
> * files with POSIX_FADV_RANDOM, O_DIRECT, O_[D]SYNC are ignored
>
> * writing cursor for O_APPEND is aligned to covers previous small appends
> Append might happen via multiple files or via new file each time.
>
> * mode vm.dirty_write_behind=1 ignores non-append writes
> This reacts only to completely sequential writes like copying files,
> writing logs with O_APPEND or rewriting files after O_TRUNC.
>
> Note: ext4 feature "auto_da_alloc" also writes cache at closing file
> after truncating it to 0 and after renaming one file over other.
>
> Changes since v1 (2017-10-02):
> * rework window management:
> * change default window 1MiB -> 16MiB
> * change default request 256KiB -> max_sectors_kb
> * drop always-async behavior for O_NONBLOCK
> * drop handling POSIX_FADV_NOREUSE (should be in separate patch)
> * ignore writes with O_DIRECT, O_SYNC, O_DSYNC
> * align head position for O_APPEND
> * add strictly sequential mode
> * write tail pages for new files
> * make void, keep errors at mapping
>
> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> Link: https://lore.kernel.org/patchwork/patch/836149/ (v1)
> ---
[-- Attachment #2: test_writebehind.sh --]
[-- Type: application/x-shellscript, Size: 428 bytes --]
next prev parent reply other threads:[~2019-09-20 7:39 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-09-20 7:35 [PATCH v2] mm: implement write-behind policy for sequential file writes Konstantin Khlebnikov
2019-09-20 7:39 ` Konstantin Khlebnikov [this message]
2019-09-23 14:52 ` Tejun Heo
2019-09-23 15:06 ` Konstantin Khlebnikov
2019-09-23 15:19 ` Tejun Heo
2019-09-24 7:39 ` Dave Chinner
2019-09-24 9:00 ` Konstantin Khlebnikov
2019-09-25 7:18 ` Dave Chinner
2019-09-25 8:15 ` Konstantin Khlebnikov
2019-09-25 23:25 ` Dave Chinner
2019-09-25 12:54 ` Theodore Y. Ts'o
2019-09-24 19:08 ` Linus Torvalds
2019-09-24 19:08 ` Linus Torvalds
2019-09-25 8:00 ` Dave Chinner
2019-09-20 23:05 ` Linus Torvalds
2019-09-20 23:05 ` Linus Torvalds
2019-09-20 23:10 ` Linus Torvalds
2019-09-20 23:10 ` Linus Torvalds
2019-09-23 15:36 ` Jens Axboe
2019-09-23 16:05 ` Konstantin Khlebnikov
2019-09-24 9:29 ` Konstantin Khlebnikov
2019-09-22 7:47 ` kbuild test robot
2019-09-23 0:36 ` [mm] e0e7df8d5b: will-it-scale.per_process_ops -7.3% regression kernel test robot
2019-09-23 0:36 ` kernel test robot
2019-09-23 19:11 ` Konstantin Khlebnikov
2019-09-23 19:11 ` Konstantin Khlebnikov
2019-09-23 19:11 ` Konstantin Khlebnikov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=875f3b55-4fe1-e2c3-5bee-ca79e4668e72@yandex-team.ru \
--to=khlebnikov@yandex-team.ru \
--cc=axboe@kernel.dk \
--cc=david@fromorbit.com \
--cc=hannes@cmpxchg.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@suse.de \
--cc=mhocko@suse.com \
--cc=tj@kernel.org \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.