All of lore.kernel.org
 help / color / mirror / Atom feed
From: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
To: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Cc: Jens Axboe <axboe@kernel.dk>, Michal Hocko <mhocko@suse.com>,
	Dave Chinner <david@fromorbit.com>, Mel Gorman <mgorman@suse.de>,
	Johannes Weiner <hannes@cmpxchg.org>, Tejun Heo <tj@kernel.org>,
	Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: [PATCH v2] mm: implement write-behind policy for sequential file writes
Date: Fri, 20 Sep 2019 10:39:33 +0300	[thread overview]
Message-ID: <875f3b55-4fe1-e2c3-5bee-ca79e4668e72@yandex-team.ru> (raw)
In-Reply-To: <156896493723.4334.13340481207144634918.stgit@buzz>

[-- Attachment #1: Type: text/plain, Size: 4244 bytes --]

Script for trivial demo in attachment

$ bash test_writebehind.sh
SIZE
3,2G	dummy
vm.dirty_write_behind = 0
COPY

real	0m3.629s
user	0m0.016s
sys	0m3.613s
Dirty:           3254552 kB
SYNC

real	0m31.953s
user	0m0.002s
sys	0m0.000s
vm.dirty_write_behind = 1
COPY

real	0m32.738s
user	0m0.008s
sys	0m4.047s
Dirty:              2900 kB
SYNC

real	0m0.427s
user	0m0.000s
sys	0m0.004s
vm.dirty_write_behind = 2
COPY

real	0m32.168s
user	0m0.000s
sys	0m4.066s
Dirty:              3088 kB
SYNC

real	0m0.421s
user	0m0.004s
sys	0m0.001s


With vm.dirty_write_behind 1 or 2 files are written even faster and
during copying amount of dirty memory always stays around at 16MiB.


On 20/09/2019 10.35, Konstantin Khlebnikov wrote:
> Traditional writeback tries to accumulate as much dirty data as possible.
> This is worth strategy for extremely short-living files and for batching
> writes for saving battery power. But for workloads where disk latency is
> important this policy generates periodic disk load spikes which increases
> latency for concurrent operations.
> 
> Also dirty pages in file cache cannot be reclaimed and reused immediately.
> This way massive I/O like file copying affects memory allocation latency.
> 
> Present writeback engine allows to tune only dirty data size or expiration
> time. Such tuning cannot eliminate spikes - this just lowers and multiplies
> them. Other option is switching into sync mode which flushes written data
> right after each write, obviously this have significant performance impact.
> Such tuning is system-wide and affects memory-mapped and randomly written
> files, flusher threads handle them much better.
> 
> This patch implements write-behind policy which tracks sequential writes
> and starts background writeback when file have enough dirty pages.
> 
> Global switch in sysctl vm.dirty_write_behind:
> =0: disabled, default
> =1: enabled for strictly sequential writes (append, copying)
> =2: enabled for all sequential writes
> 
> The only parameter is window size: maximum amount of dirty pages behind
> current position and maximum amount of pages in background writeback.
> 
> Setup is per-disk in sysfs in file /sys/block/$DISK/bdi/write_behind_kb.
> Default: 16MiB, '0' disables write-behind for this disk.
> 
> When amount of unwritten pages exceeds window size write-behind starts
> background writeback for max(excess, max_sectors_kb) and then waits for
> the same amount of background writeback initiated at previously.
> 
>   |<-wait-this->|           |<-send-this->|<---pending-write-behind--->|
>   |<--async-write-behind--->|<--------previous-data------>|<-new-data->|
>                current head-^    new head-^              file position-^
> 
> Remaining tail pages are flushed at closing file if async write-behind was
> started or this is new file and it is at least max_sectors_kb long.
> 
> Overall behavior depending on total data size:
> < max_sectors_kb - no writes
>> max_sectors_kb - write new files in background after close
>> write_behind_kb - streaming write, write tail at close
> 
> Special cases:
> 
> * files with POSIX_FADV_RANDOM, O_DIRECT, O_[D]SYNC are ignored
> 
> * writing cursor for O_APPEND is aligned to covers previous small appends
>    Append might happen via multiple files or via new file each time.
> 
> * mode vm.dirty_write_behind=1 ignores non-append writes
>    This reacts only to completely sequential writes like copying files,
>    writing logs with O_APPEND or rewriting files after O_TRUNC.
> 
> Note: ext4 feature "auto_da_alloc" also writes cache at closing file
> after truncating it to 0 and after renaming one file over other.
> 
> Changes since v1 (2017-10-02):
> * rework window management:
> * change default window 1MiB -> 16MiB
> * change default request 256KiB -> max_sectors_kb
> * drop always-async behavior for O_NONBLOCK
> * drop handling POSIX_FADV_NOREUSE (should be in separate patch)
> * ignore writes with O_DIRECT, O_SYNC, O_DSYNC
> * align head position for O_APPEND
> * add strictly sequential mode
> * write tail pages for new files
> * make void, keep errors at mapping
> 
> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> Link: https://lore.kernel.org/patchwork/patch/836149/ (v1)
> ---

[-- Attachment #2: test_writebehind.sh --]
[-- Type: application/x-shellscript, Size: 428 bytes --]

  reply	other threads:[~2019-09-20  7:39 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-20  7:35 [PATCH v2] mm: implement write-behind policy for sequential file writes Konstantin Khlebnikov
2019-09-20  7:39 ` Konstantin Khlebnikov [this message]
2019-09-23 14:52   ` Tejun Heo
2019-09-23 15:06     ` Konstantin Khlebnikov
2019-09-23 15:19       ` Tejun Heo
2019-09-24  7:39       ` Dave Chinner
2019-09-24  9:00         ` Konstantin Khlebnikov
2019-09-25  7:18           ` Dave Chinner
2019-09-25  8:15             ` Konstantin Khlebnikov
2019-09-25 23:25               ` Dave Chinner
2019-09-25 12:54             ` Theodore Y. Ts'o
2019-09-24 19:08         ` Linus Torvalds
2019-09-24 19:08           ` Linus Torvalds
2019-09-25  8:00           ` Dave Chinner
2019-09-20 23:05 ` Linus Torvalds
2019-09-20 23:05   ` Linus Torvalds
2019-09-20 23:10   ` Linus Torvalds
2019-09-20 23:10     ` Linus Torvalds
2019-09-23 15:36     ` Jens Axboe
2019-09-23 16:05       ` Konstantin Khlebnikov
2019-09-24  9:29   ` Konstantin Khlebnikov
2019-09-22  7:47 ` kbuild test robot
2019-09-23  0:36 ` [mm] e0e7df8d5b: will-it-scale.per_process_ops -7.3% regression kernel test robot
2019-09-23  0:36   ` kernel test robot
2019-09-23 19:11   ` Konstantin Khlebnikov
2019-09-23 19:11     ` Konstantin Khlebnikov
2019-09-23 19:11     ` Konstantin Khlebnikov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=875f3b55-4fe1-e2c3-5bee-ca79e4668e72@yandex-team.ru \
    --to=khlebnikov@yandex-team.ru \
    --cc=axboe@kernel.dk \
    --cc=david@fromorbit.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mhocko@suse.com \
    --cc=tj@kernel.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.