All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andres Freund <andres@anarazel.de>
To: linux-fsdevel@vger.kernel.org
Cc: linux-xfs@vger.kernel.org, linux-ext4@vger.kernel.org,
	linux-block@vger.kernel.org
Subject: fallocate(FALLOC_FL_ZERO_RANGE_BUT_REALLY) to avoid unwritten extents?
Date: Tue, 29 Dec 2020 22:28:19 -0800	[thread overview]
Message-ID: <20201230062819.yinrrp6uwfegsqo3@alap3.anarazel.de> (raw)

Hi,

For things like database journals using fallocate(0) is not sufficient,
as writing into the the pre-allocated data with O_DIRECT | O_DSYNC
writes requires the unwritten extents to be converted, which in turn
requires journal operations.

The performance difference in a journalling workload (lots of
sequential, low-iodepth, often small, writes) is quite remarkable. Even
on quite fast devices:

    andres@awork3:/mnt/t3$ grep /mnt/t3 /proc/mounts
    /dev/nvme1n1 /mnt/t3 xfs rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0

    andres@awork3:/mnt/t3$ fallocate -l $((1024*1024*1024)) test_file

    andres@awork3:/mnt/t3$ dd if=/dev/zero of=test_file bs=4096 conv=notrunc iflag=count_bytes count=$((1024*1024*1024)) oflag=direct,dsync
    262144+0 records in
    262144+0 records out
    1073741824 bytes (1.1 GB, 1.0 GiB) copied, 117.587 s, 9.1 MB/s

    andres@awork3:/mnt/t3$ dd if=/dev/zero of=test_file bs=4096 conv=notrunc iflag=count_bytes count=$((1024*1024*1024)) oflag=direct,dsync
    262144+0 records in
    262144+0 records out
    1073741824 bytes (1.1 GB, 1.0 GiB) copied, 3.69125 s, 291 MB/s

    andres@awork3:/mnt/t3$ fallocate -z -l $((1024*1024*1024)) test_file

    andres@awork3:/mnt/t3$ dd if=/dev/zero of=test_file bs=4096 conv=notrunc iflag=count_bytes count=$((1024*1024*1024)) oflag=direct,dsync
    z262144+0 records in
    262144+0 records out
    1073741824 bytes (1.1 GB, 1.0 GiB) copied, 109.398 s, 9.8 MB/s

    andres@awork3:/mnt/t3$ dd if=/dev/zero of=test_file bs=4096 conv=notrunc iflag=count_bytes count=$((1024*1024*1024)) oflag=direct,dsync
    262144+0 records in
    262144+0 records out
    1073741824 bytes (1.1 GB, 1.0 GiB) copied, 3.76166 s, 285 MB/s


The way around that, from a database's perspective, is obviously to just
overwrite the file "manually" after fallocate()ing it, utilizing larger
writes, and then to recycle the file.


But that's a fair bit of unnecessary IO from userspace, and it's IO that
the kernel can do more efficiently on a number of types of block
devices, e.g. by utilizing write-zeroes.


Which brings me to $subject:

Would it make sense to add a variant of FALLOC_FL_ZERO_RANGE that
doesn't convert extents into unwritten extents, but instead uses
blkdev_issue_zeroout() if supported?  Mostly interested in xfs/ext4
myself, but ...

Doing so as a variant of FALLOC_FL_ZERO_RANGE seems to make the most
sense, as that'd work reasonably efficiently to initialize newly
allocated space as well as for zeroing out previously used file space.


As blkdev_issue_zeroout() already has a fallback path it seems this
should be doable without too much concern for which devices have write
zeroes, and which do not?

Greetings,

Andres Freund

             reply	other threads:[~2020-12-30  6:29 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-30  6:28 Andres Freund [this message]
2021-01-04 18:19 ` fallocate(FALLOC_FL_ZERO_RANGE_BUT_REALLY) to avoid unwritten extents? Darrick J. Wong
2021-01-04 19:10   ` Andres Freund
2021-01-04 19:57     ` Avi Kivity
2021-01-12 18:16       ` Christoph Hellwig
2021-01-12 18:39         ` Andreas Dilger
2021-01-12 18:43           ` Christoph Hellwig
2021-01-12 18:51             ` Andreas Dilger
2021-01-12 21:14               ` Darrick J. Wong
2021-01-12 21:36                 ` Andres Freund
2021-01-13  7:44                   ` Avi Kivity
2021-01-19  3:44                     ` Andreas Dilger
2021-01-04 19:17 ` Theodore Ts'o
2021-01-04 19:24   ` Matthew Wilcox
2021-01-04 20:29   ` Andres Freund
2021-01-04 22:40   ` Eric Sandeen
2021-01-06 22:52 ` Dave Chinner
2021-01-06 23:40   ` Andres Freund
2021-01-08 20:32     ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201230062819.yinrrp6uwfegsqo3@alap3.anarazel.de \
    --to=andres@anarazel.de \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.