From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Andres Freund <andres@anarazel.de>
Cc: linux-fsdevel@vger.kernel.org, linux-xfs@vger.kernel.org,
linux-ext4@vger.kernel.org, linux-block@vger.kernel.org
Subject: Re: fallocate(FALLOC_FL_ZERO_RANGE_BUT_REALLY) to avoid unwritten extents?
Date: Mon, 4 Jan 2021 10:19:58 -0800 [thread overview]
Message-ID: <20210104181958.GE6908@magnolia> (raw)
In-Reply-To: <20201230062819.yinrrp6uwfegsqo3@alap3.anarazel.de>
On Tue, Dec 29, 2020 at 10:28:19PM -0800, Andres Freund wrote:
> Hi,
>
> For things like database journals using fallocate(0) is not sufficient,
> as writing into the the pre-allocated data with O_DIRECT | O_DSYNC
> writes requires the unwritten extents to be converted, which in turn
> requires journal operations.
>
> The performance difference in a journalling workload (lots of
> sequential, low-iodepth, often small, writes) is quite remarkable. Even
> on quite fast devices:
>
> andres@awork3:/mnt/t3$ grep /mnt/t3 /proc/mounts
> /dev/nvme1n1 /mnt/t3 xfs rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0
>
> andres@awork3:/mnt/t3$ fallocate -l $((1024*1024*1024)) test_file
>
> andres@awork3:/mnt/t3$ dd if=/dev/zero of=test_file bs=4096 conv=notrunc iflag=count_bytes count=$((1024*1024*1024)) oflag=direct,dsync
> 262144+0 records in
> 262144+0 records out
> 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 117.587 s, 9.1 MB/s
>
> andres@awork3:/mnt/t3$ dd if=/dev/zero of=test_file bs=4096 conv=notrunc iflag=count_bytes count=$((1024*1024*1024)) oflag=direct,dsync
> 262144+0 records in
> 262144+0 records out
> 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 3.69125 s, 291 MB/s
>
> andres@awork3:/mnt/t3$ fallocate -z -l $((1024*1024*1024)) test_file
>
> andres@awork3:/mnt/t3$ dd if=/dev/zero of=test_file bs=4096 conv=notrunc iflag=count_bytes count=$((1024*1024*1024)) oflag=direct,dsync
> z262144+0 records in
> 262144+0 records out
> 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 109.398 s, 9.8 MB/s
>
> andres@awork3:/mnt/t3$ dd if=/dev/zero of=test_file bs=4096 conv=notrunc iflag=count_bytes count=$((1024*1024*1024)) oflag=direct,dsync
> 262144+0 records in
> 262144+0 records out
> 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 3.76166 s, 285 MB/s
>
>
> The way around that, from a database's perspective, is obviously to just
> overwrite the file "manually" after fallocate()ing it, utilizing larger
> writes, and then to recycle the file.
>
>
> But that's a fair bit of unnecessary IO from userspace, and it's IO that
> the kernel can do more efficiently on a number of types of block
> devices, e.g. by utilizing write-zeroes.
>
>
> Which brings me to $subject:
>
> Would it make sense to add a variant of FALLOC_FL_ZERO_RANGE that
> doesn't convert extents into unwritten extents, but instead uses
> blkdev_issue_zeroout() if supported? Mostly interested in xfs/ext4
> myself, but ...
>
> Doing so as a variant of FALLOC_FL_ZERO_RANGE seems to make the most
> sense, as that'd work reasonably efficiently to initialize newly
> allocated space as well as for zeroing out previously used file space.
>
>
> As blkdev_issue_zeroout() already has a fallback path it seems this
> should be doable without too much concern for which devices have write
> zeroes, and which do not?
Question: do you want the kernel to write zeroes even for devices that
don't support accelerated zeroing? Since I assume that if the fallocate
fails you'll fall back to writing zeroes from userspace anyway...
Second question: Would it help to have a FALLOC_FL_DRY_RUN flag that
could be used to probe if a file supports fallocate without actually
changing anything? I'm (separately) pursuing a fix for the loop device
not being able to figure out if a file actually supports a particular
fallocate mode.
--D
> Greetings,
>
> Andres Freund
next prev parent reply other threads:[~2021-01-04 18:20 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-12-30 6:28 fallocate(FALLOC_FL_ZERO_RANGE_BUT_REALLY) to avoid unwritten extents? Andres Freund
2021-01-04 18:19 ` Darrick J. Wong [this message]
2021-01-04 19:10 ` Andres Freund
2021-01-04 19:57 ` Avi Kivity
2021-01-12 18:16 ` Christoph Hellwig
2021-01-12 18:39 ` Andreas Dilger
2021-01-12 18:43 ` Christoph Hellwig
2021-01-12 18:51 ` Andreas Dilger
2021-01-12 21:14 ` Darrick J. Wong
2021-01-12 21:36 ` Andres Freund
2021-01-13 7:44 ` Avi Kivity
2021-01-19 3:44 ` Andreas Dilger
2021-01-04 19:17 ` Theodore Ts'o
2021-01-04 19:24 ` Matthew Wilcox
2021-01-04 20:29 ` Andres Freund
2021-01-04 22:40 ` Eric Sandeen
2021-01-06 22:52 ` Dave Chinner
2021-01-06 23:40 ` Andres Freund
2021-01-08 20:32 ` Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20210104181958.GE6908@magnolia \
--to=darrick.wong@oracle.com \
--cc=andres@anarazel.de \
--cc=linux-block@vger.kernel.org \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).