All of lore.kernel.org
 help / color / mirror / Atom feed
* fallocate(FALLOC_FL_ZERO_RANGE_BUT_REALLY) to avoid unwritten extents?
@ 2020-12-30  6:28 Andres Freund
  2021-01-04 18:19 ` Darrick J. Wong
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Andres Freund @ 2020-12-30  6:28 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-xfs, linux-ext4, linux-block

Hi,

For things like database journals using fallocate(0) is not sufficient,
as writing into the the pre-allocated data with O_DIRECT | O_DSYNC
writes requires the unwritten extents to be converted, which in turn
requires journal operations.

The performance difference in a journalling workload (lots of
sequential, low-iodepth, often small, writes) is quite remarkable. Even
on quite fast devices:

    andres@awork3:/mnt/t3$ grep /mnt/t3 /proc/mounts
    /dev/nvme1n1 /mnt/t3 xfs rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0

    andres@awork3:/mnt/t3$ fallocate -l $((1024*1024*1024)) test_file

    andres@awork3:/mnt/t3$ dd if=/dev/zero of=test_file bs=4096 conv=notrunc iflag=count_bytes count=$((1024*1024*1024)) oflag=direct,dsync
    262144+0 records in
    262144+0 records out
    1073741824 bytes (1.1 GB, 1.0 GiB) copied, 117.587 s, 9.1 MB/s

    andres@awork3:/mnt/t3$ dd if=/dev/zero of=test_file bs=4096 conv=notrunc iflag=count_bytes count=$((1024*1024*1024)) oflag=direct,dsync
    262144+0 records in
    262144+0 records out
    1073741824 bytes (1.1 GB, 1.0 GiB) copied, 3.69125 s, 291 MB/s

    andres@awork3:/mnt/t3$ fallocate -z -l $((1024*1024*1024)) test_file

    andres@awork3:/mnt/t3$ dd if=/dev/zero of=test_file bs=4096 conv=notrunc iflag=count_bytes count=$((1024*1024*1024)) oflag=direct,dsync
    z262144+0 records in
    262144+0 records out
    1073741824 bytes (1.1 GB, 1.0 GiB) copied, 109.398 s, 9.8 MB/s

    andres@awork3:/mnt/t3$ dd if=/dev/zero of=test_file bs=4096 conv=notrunc iflag=count_bytes count=$((1024*1024*1024)) oflag=direct,dsync
    262144+0 records in
    262144+0 records out
    1073741824 bytes (1.1 GB, 1.0 GiB) copied, 3.76166 s, 285 MB/s


The way around that, from a database's perspective, is obviously to just
overwrite the file "manually" after fallocate()ing it, utilizing larger
writes, and then to recycle the file.


But that's a fair bit of unnecessary IO from userspace, and it's IO that
the kernel can do more efficiently on a number of types of block
devices, e.g. by utilizing write-zeroes.


Which brings me to $subject:

Would it make sense to add a variant of FALLOC_FL_ZERO_RANGE that
doesn't convert extents into unwritten extents, but instead uses
blkdev_issue_zeroout() if supported?  Mostly interested in xfs/ext4
myself, but ...

Doing so as a variant of FALLOC_FL_ZERO_RANGE seems to make the most
sense, as that'd work reasonably efficiently to initialize newly
allocated space as well as for zeroing out previously used file space.


As blkdev_issue_zeroout() already has a fallback path it seems this
should be doable without too much concern for which devices have write
zeroes, and which do not?

Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2021-01-19  3:45 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-30  6:28 fallocate(FALLOC_FL_ZERO_RANGE_BUT_REALLY) to avoid unwritten extents? Andres Freund
2021-01-04 18:19 ` Darrick J. Wong
2021-01-04 19:10   ` Andres Freund
2021-01-04 19:57     ` Avi Kivity
2021-01-12 18:16       ` Christoph Hellwig
2021-01-12 18:39         ` Andreas Dilger
2021-01-12 18:43           ` Christoph Hellwig
2021-01-12 18:51             ` Andreas Dilger
2021-01-12 21:14               ` Darrick J. Wong
2021-01-12 21:36                 ` Andres Freund
2021-01-13  7:44                   ` Avi Kivity
2021-01-19  3:44                     ` Andreas Dilger
2021-01-04 19:17 ` Theodore Ts'o
2021-01-04 19:24   ` Matthew Wilcox
2021-01-04 20:29   ` Andres Freund
2021-01-04 22:40   ` Eric Sandeen
2021-01-06 22:52 ` Dave Chinner
2021-01-06 23:40   ` Andres Freund
2021-01-08 20:32     ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.