linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Andres Freund <andres@anarazel.de>
Cc: linux-fsdevel@vger.kernel.org, linux-xfs@vger.kernel.org,
	linux-ext4@vger.kernel.org, linux-block@vger.kernel.org
Subject: Re: fallocate(FALLOC_FL_ZERO_RANGE_BUT_REALLY) to avoid unwritten extents?
Date: Mon, 4 Jan 2021 10:19:58 -0800	[thread overview]
Message-ID: <20210104181958.GE6908@magnolia> (raw)
In-Reply-To: <20201230062819.yinrrp6uwfegsqo3@alap3.anarazel.de>

On Tue, Dec 29, 2020 at 10:28:19PM -0800, Andres Freund wrote:
> Hi,
> 
> For things like database journals using fallocate(0) is not sufficient,
> as writing into the the pre-allocated data with O_DIRECT | O_DSYNC
> writes requires the unwritten extents to be converted, which in turn
> requires journal operations.
> 
> The performance difference in a journalling workload (lots of
> sequential, low-iodepth, often small, writes) is quite remarkable. Even
> on quite fast devices:
> 
>     andres@awork3:/mnt/t3$ grep /mnt/t3 /proc/mounts
>     /dev/nvme1n1 /mnt/t3 xfs rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0
> 
>     andres@awork3:/mnt/t3$ fallocate -l $((1024*1024*1024)) test_file
> 
>     andres@awork3:/mnt/t3$ dd if=/dev/zero of=test_file bs=4096 conv=notrunc iflag=count_bytes count=$((1024*1024*1024)) oflag=direct,dsync
>     262144+0 records in
>     262144+0 records out
>     1073741824 bytes (1.1 GB, 1.0 GiB) copied, 117.587 s, 9.1 MB/s
> 
>     andres@awork3:/mnt/t3$ dd if=/dev/zero of=test_file bs=4096 conv=notrunc iflag=count_bytes count=$((1024*1024*1024)) oflag=direct,dsync
>     262144+0 records in
>     262144+0 records out
>     1073741824 bytes (1.1 GB, 1.0 GiB) copied, 3.69125 s, 291 MB/s
> 
>     andres@awork3:/mnt/t3$ fallocate -z -l $((1024*1024*1024)) test_file
> 
>     andres@awork3:/mnt/t3$ dd if=/dev/zero of=test_file bs=4096 conv=notrunc iflag=count_bytes count=$((1024*1024*1024)) oflag=direct,dsync
>     z262144+0 records in
>     262144+0 records out
>     1073741824 bytes (1.1 GB, 1.0 GiB) copied, 109.398 s, 9.8 MB/s
> 
>     andres@awork3:/mnt/t3$ dd if=/dev/zero of=test_file bs=4096 conv=notrunc iflag=count_bytes count=$((1024*1024*1024)) oflag=direct,dsync
>     262144+0 records in
>     262144+0 records out
>     1073741824 bytes (1.1 GB, 1.0 GiB) copied, 3.76166 s, 285 MB/s
> 
> 
> The way around that, from a database's perspective, is obviously to just
> overwrite the file "manually" after fallocate()ing it, utilizing larger
> writes, and then to recycle the file.
> 
> 
> But that's a fair bit of unnecessary IO from userspace, and it's IO that
> the kernel can do more efficiently on a number of types of block
> devices, e.g. by utilizing write-zeroes.
> 
> 
> Which brings me to $subject:
> 
> Would it make sense to add a variant of FALLOC_FL_ZERO_RANGE that
> doesn't convert extents into unwritten extents, but instead uses
> blkdev_issue_zeroout() if supported?  Mostly interested in xfs/ext4
> myself, but ...
> 
> Doing so as a variant of FALLOC_FL_ZERO_RANGE seems to make the most
> sense, as that'd work reasonably efficiently to initialize newly
> allocated space as well as for zeroing out previously used file space.
> 
> 
> As blkdev_issue_zeroout() already has a fallback path it seems this
> should be doable without too much concern for which devices have write
> zeroes, and which do not?

Question: do you want the kernel to write zeroes even for devices that
don't support accelerated zeroing?  Since I assume that if the fallocate
fails you'll fall back to writing zeroes from userspace anyway...

Second question: Would it help to have a FALLOC_FL_DRY_RUN flag that
could be used to probe if a file supports fallocate without actually
changing anything?  I'm (separately) pursuing a fix for the loop device
not being able to figure out if a file actually supports a particular
fallocate mode.

--D

> Greetings,
> 
> Andres Freund

  reply	other threads:[~2021-01-04 18:20 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-30  6:28 fallocate(FALLOC_FL_ZERO_RANGE_BUT_REALLY) to avoid unwritten extents? Andres Freund
2021-01-04 18:19 ` Darrick J. Wong [this message]
2021-01-04 19:10   ` Andres Freund
2021-01-04 19:57     ` Avi Kivity
2021-01-12 18:16       ` Christoph Hellwig
2021-01-12 18:39         ` Andreas Dilger
2021-01-12 18:43           ` Christoph Hellwig
2021-01-12 18:51             ` Andreas Dilger
2021-01-12 21:14               ` Darrick J. Wong
2021-01-12 21:36                 ` Andres Freund
2021-01-13  7:44                   ` Avi Kivity
2021-01-19  3:44                     ` Andreas Dilger
2021-01-04 19:17 ` Theodore Ts'o
2021-01-04 19:24   ` Matthew Wilcox
2021-01-04 20:29   ` Andres Freund
2021-01-04 22:40   ` Eric Sandeen
2021-01-06 22:52 ` Dave Chinner
2021-01-06 23:40   ` Andres Freund
2021-01-08 20:32     ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210104181958.GE6908@magnolia \
    --to=darrick.wong@oracle.com \
    --cc=andres@anarazel.de \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).