linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: "Theodore Ts'o" <tytso@mit.edu>,
	Chris Mason <chris.mason@fusionio.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Ric Wheeler <rwheeler@redhat.com>, Ingo Molnar <mingo@kernel.org>,
	Christoph Hellwig <hch@infradead.org>,
	Martin Steigerwald <Martin@lichtvoll.de>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>
Subject: Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
Date: Tue, 11 Dec 2012 11:52:17 +1100	[thread overview]
Message-ID: <20121211005217.GR15784@dastard> (raw)
In-Reply-To: <20121210173739.GA1359@thunk.org>

On Mon, Dec 10, 2012 at 12:37:39PM -0500, Theodore Ts'o wrote:
> On Sat, Dec 08, 2012 at 11:17:05AM +1100, Dave Chinner wrote:
> > I wouldn't recommend XFS_IOC_ALLOCSP as a user-friendly interface.
> > The concept, however, implemented by a new fallocate()
> > flag (say FALLOC_FL_WRITE_ZEROS) so that the filesystem knows that
> > the application considers unwritten extents undesirable is exactly
> > the sort of thing that we should be considering implementing.
> 
> What's the point of using a new flag like this (or XFS's
> XFS_IOC_ALLOCSP) for writing zeros during preallocation as oppoised to
> simply doing a fallocate() followed by zeroing the data via a O_DIRECT
> write system call?

There is no window where stale data is exposed to userspace, which
is what you have to do if you are doing zeroing from userspace. If
the system crashes after allocation but before zeroing, how do you
recover that? The filesystem can ensure the allocation transactions
are not committed until the allocated extents are zeroed....

> > Indeed, if the filesystem is on something with WRITE_SAME or
> > discards to zero, no data would need to be written, you wouldn't
> > have any unwritten extent overhead, and no stale data exposure.
> 
> And if you have a storage device which supports WRITE_SAME or
> persistent discards, you can do this automatically at preallocation
> time without needing a new fallocate(2) flag.

Because there are cases where unwritten extents are preferrable or
WRITE_SAME functionality is unavailable.

Indeed, if we take the case of file-per-frame, uncompressed
real-time video ingest, I'm going to be wanting to use unwritten
extents to preallocate files in a known pattern with as little
latency as possible. If we are talking about 4k uncompressed video,
that's a data rate of 1.2GB/s for 24fps, and studios are now
shooting in 48fps in 3D, which gives a real-time data rate of
roughly 5GB/s.  This has an acceptible IO latency *per-frame* of
roughly 10-20ms, which we can do easily with unwritten extents. We
can preallocate thousands of files, look at their layout via fiemap
and select the order in which we write to them based on where they
were allocated.

There is no way in hell that WRITE_SAME can be used for these sorts
of workloads, because preallocation then places a load on the storage
device that affects the latency of the real-time data stream.
Unwritten extent conversion is an after-the-fact overhead in these
cases that doesn't impact on the data stream throughput or latency.

IOWs, there are clear cases where discard optimisations will be
actively harmful to the workload using preallocation for performance
reasons. Hence, I'm not about to make the existing fallocate code in
XFS stop using unwritten extents by default even if the underlying
device supports WRITE_SAME.

> I certainly don't
> oppose adding such optimizations to ext4 or any other file system (I'm
> not entirely convinced that it's worth it to do this optimization at
> the VFS level), but it doesn't help for storage devices that don't
> support this feature.

Sure, this optimisation is a per-filesystem decision. ext4 is
unlikely to be used in the sorts of high end enviroments we see XFS
being used in, so it might make sense for you to make it use
WRITE_SAME by default if it is supported.

This, however, does not change the fact that there are existing
applications using fallocate that absolutely do not want
preallocation to use WRITE_SAME semantics. It's not an optimisation
if it breaks a significant portion of your userbase's applications.
Hence adding a flag to allow applications to specify they want
WRITE_SAME preallocation behaviour rather than unwritten extents
makes sense.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  parent reply	other threads:[~2012-12-11  0:52 UTC|newest]

Thread overview: 69+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-11-19 23:04 [PATCH] fs: revert commit bbdd6808 to fallocate UAPI Dave Chinner
2012-11-20 16:36 ` Christoph Hellwig
2012-11-26  0:28 ` [PATCH, 3.7-rc7, RESEND] " Dave Chinner
2012-11-26  2:55   ` Theodore Ts'o
2012-11-26  6:14     ` Tao Ma
2012-11-26  9:12     ` Dave Chinner
2012-12-05 10:48       ` Martin Steigerwald
2012-12-05 15:45         ` Linus Torvalds
2012-12-05 16:18           ` Martin Steigerwald
2012-12-05 16:33             ` Theodore Ts'o
2012-12-05 17:24               ` Martin Steigerwald
2012-12-05 17:34                 ` Theodore Ts'o
2012-12-05 17:55                   ` Martin Steigerwald
2012-12-06  0:42                   ` Dave Chinner
2012-12-06  9:24                     ` Martin Steigerwald
2012-12-05 18:25             ` Linus Torvalds
2012-12-06  1:14               ` Dave Chinner
2012-12-06  3:03                 ` Linus Torvalds
2012-12-06  9:37                   ` Martin Steigerwald
2012-12-07  1:08                     ` Ingo Molnar
2012-12-07  2:40                       ` Dave Chinner
2012-12-07 10:24                       ` Martin Steigerwald
2012-12-06 12:06                 ` Christoph Hellwig
2012-12-06 16:50                   ` Theodore Ts'o
2012-12-07  1:57                     ` Dave Chinner
2012-12-06 12:05           ` Christoph Hellwig
2012-12-07  1:16             ` Ingo Molnar
2012-12-07  3:19               ` Dave Chinner
2012-12-07 17:36               ` Ric Wheeler
2012-12-07 18:18                 ` Linus Torvalds
2012-12-07 19:03                   ` Chris Mason
2012-12-07 20:43                     ` Theodore Ts'o
2012-12-07 21:09                       ` Chris Mason
2012-12-07 21:27                         ` Theodore Ts'o
2012-12-07 21:43                           ` Chris Mason
2012-12-07 21:49                             ` Ric Wheeler
2012-12-07 21:57                               ` Chris Mason
2012-12-07 22:51                                 ` Eric Sandeen
2012-12-07 22:52                                 ` Eric Sandeen
2012-12-07 21:42                         ` Ric Wheeler
2012-12-07 21:57                           ` Theodore Ts'o
2012-12-07 22:02                             ` Ric Wheeler
2012-12-08  0:39                               ` Dave Chinner
2012-12-08  2:52                                 ` Joel Becker
2012-12-08  4:04                                   ` Dave Chinner
2012-12-08  0:17                     ` Dave Chinner
2012-12-08  1:39                       ` Chris Mason
2012-12-10 16:02                         ` Chris Mason
2012-12-10 17:37                       ` Theodore Ts'o
2012-12-10 18:05                         ` Steven Whitehouse
2012-12-10 18:13                           ` Theodore Ts'o
2012-12-10 18:20                             ` Theodore Ts'o
2012-12-11 12:16                               ` Steven Whitehouse
2012-12-11 22:09                                 ` Dave Chinner
2012-12-10 18:52                         ` Ric Wheeler
2012-12-11  0:52                         ` Dave Chinner [this message]
2012-12-07 19:30                   ` Steven Rostedt
2012-12-07 21:14                     ` Theodore Ts'o
2012-12-07 21:47                       ` Ric Wheeler
2012-12-07 23:25                         ` Howard Chu
2012-12-08  0:50                           ` Dave Chinner
2012-12-08 13:52                             ` Howard Chu
2012-12-08 14:02                               ` Ric Wheeler
2012-12-07 22:01                       ` Eric Sandeen
2012-12-09 21:37                       ` Ric Wheeler
2012-11-26 11:53     ` Alan Cox
2012-11-26 14:43       ` Theodore Ts'o
2012-11-26 21:12       ` Dave Chinner
2012-11-27 13:44         ` Martin Steigerwald

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20121211005217.GR15784@dastard \
    --to=david@fromorbit.com \
    --cc=Martin@lichtvoll.de \
    --cc=chris.mason@fusionio.com \
    --cc=hch@infradead.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=rwheeler@redhat.com \
    --cc=torvalds@linux-foundation.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).