linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Joel Becker <jlbec@evilplan.org>
To: Dave Chinner <david@fromorbit.com>
Cc: Ric Wheeler <rwheeler@redhat.com>,
	"Theodore Ts'o" <tytso@mit.edu>,
	Chris Mason <chris.mason@fusionio.com>,
	Chris Mason <clmason@fusionio.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Ingo Molnar <mingo@kernel.org>,
	Christoph Hellwig <hch@infradead.org>,
	Martin Steigerwald <Martin@lichtvoll.de>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>
Subject: Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
Date: Fri, 7 Dec 2012 18:52:51 -0800	[thread overview]
Message-ID: <20121208025250.GP22789@localhost> (raw)
In-Reply-To: <20121208003936.GP27172@dastard>

On Sat, Dec 08, 2012 at 11:39:36AM +1100, Dave Chinner wrote:
> On Fri, Dec 07, 2012 at 05:02:32PM -0500, Ric Wheeler wrote:
> > On 12/07/2012 04:57 PM, Theodore Ts'o wrote:
> > >On Fri, Dec 07, 2012 at 04:42:06PM -0500, Ric Wheeler wrote:
> > >>The other things that I think we should try would be to convert over
> > >>larger chunks as we discussed on the list back in the summer (just
> > >>because the user writes 4KB does not mean that we cannot flip over
> > >>1MB and zero that).
> > >Writing a megabyte is not free.  If you assume that your HDD has a
> > >sustained write throughput of 100-125 MB/s, writing a megabyte will
> > >take 8-10ms.  It might be a win if you amortize it over a large number
> > >of writes, but it doesn't help your 99.9 percentile latency numbers.
> > >(99.9 percentile latency numbers matters because eventually you'll
> > >have a user request which hits multiple serial long latency
> > >operations, and then the delay looks **really** user visible.)
> > >
> > >	    	     	       	     		- Ted
> > 
> > Writing 4KB at a time to a disk cost XX units of time.
> > 
> > Writing to the same sector (especially for a HDD), cost XX units + a small amount.
> > 
> > I suggest that we try it out.
> > 
> > For SSD's, much better to use specific HW offload commands if
> > possible like WRITE_SAME (zeroed) or UNMAP/TRIM to get that
> > performance boost since no actual data is moved...
> 
> Yup, that could be done quite trivially in XFS. Just mark the
> preallocated extents as "busy" rather than unwritten, mark the
> transaction as synchronous and the transaction commit will issue a
> discard on the preallocated ranges before returning to userspace.
> The extra overhead to the preallocation command is unlikely to be
> noticed, and unwritten extent conversion overhead just goes away...
> 
> No fallocate() API changes necessary, though I think it would be
> better if the user application gave a hint that it preferred "writing
> zeros" (i.e. FALLOC_FL_WRITE_ZEROS) to allocating unwritten extents
> as there are workloads where one will always be clearly better than
> the other...

	Wait, I missed something.  We're letting fallocate be dumb?
Let's not do that, then.
	Over in ocfs2-land, we CoW in 1MB hunks.  That's the entire
extent if it is 1MB or less, or some MB multiple if it is large enough
to slice it.  This is for very similar reasons to unwritten clearing,
with the added benefit of less fragmentation from CoW.
	On spinning media, any read/write of up to 1MB is roughly about
the same penalty as reading/writing a sector.  You're already paying the
seek.  On SSD, WRITE_SAME is *way* better than leaking data.
	At the end of the day, you have to pay for zeroing.  You can do
it up front, or you can do it at write time.  A certain large commercial
database takes advantage of fallocate+unwritten by getting large swaths
of contiguous storage; it then writes to the whole space before using
it.  This allows the allocation benefits of fallocate, doesn't pay for
unneeded zeros, and yet peforms correctly at runtime.
	We should not be leaking data so that we can be lazy.

Joel


> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 

Life's Little Instruction Book #182

	"Be romantic."

			http://www.jlbec.org/
			jlbec@evilplan.org

  reply	other threads:[~2012-12-08  2:53 UTC|newest]

Thread overview: 69+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-11-19 23:04 [PATCH] fs: revert commit bbdd6808 to fallocate UAPI Dave Chinner
2012-11-20 16:36 ` Christoph Hellwig
2012-11-26  0:28 ` [PATCH, 3.7-rc7, RESEND] " Dave Chinner
2012-11-26  2:55   ` Theodore Ts'o
2012-11-26  6:14     ` Tao Ma
2012-11-26  9:12     ` Dave Chinner
2012-12-05 10:48       ` Martin Steigerwald
2012-12-05 15:45         ` Linus Torvalds
2012-12-05 16:18           ` Martin Steigerwald
2012-12-05 16:33             ` Theodore Ts'o
2012-12-05 17:24               ` Martin Steigerwald
2012-12-05 17:34                 ` Theodore Ts'o
2012-12-05 17:55                   ` Martin Steigerwald
2012-12-06  0:42                   ` Dave Chinner
2012-12-06  9:24                     ` Martin Steigerwald
2012-12-05 18:25             ` Linus Torvalds
2012-12-06  1:14               ` Dave Chinner
2012-12-06  3:03                 ` Linus Torvalds
2012-12-06  9:37                   ` Martin Steigerwald
2012-12-07  1:08                     ` Ingo Molnar
2012-12-07  2:40                       ` Dave Chinner
2012-12-07 10:24                       ` Martin Steigerwald
2012-12-06 12:06                 ` Christoph Hellwig
2012-12-06 16:50                   ` Theodore Ts'o
2012-12-07  1:57                     ` Dave Chinner
2012-12-06 12:05           ` Christoph Hellwig
2012-12-07  1:16             ` Ingo Molnar
2012-12-07  3:19               ` Dave Chinner
2012-12-07 17:36               ` Ric Wheeler
2012-12-07 18:18                 ` Linus Torvalds
2012-12-07 19:03                   ` Chris Mason
2012-12-07 20:43                     ` Theodore Ts'o
2012-12-07 21:09                       ` Chris Mason
2012-12-07 21:27                         ` Theodore Ts'o
2012-12-07 21:43                           ` Chris Mason
2012-12-07 21:49                             ` Ric Wheeler
2012-12-07 21:57                               ` Chris Mason
2012-12-07 22:51                                 ` Eric Sandeen
2012-12-07 22:52                                 ` Eric Sandeen
2012-12-07 21:42                         ` Ric Wheeler
2012-12-07 21:57                           ` Theodore Ts'o
2012-12-07 22:02                             ` Ric Wheeler
2012-12-08  0:39                               ` Dave Chinner
2012-12-08  2:52                                 ` Joel Becker [this message]
2012-12-08  4:04                                   ` Dave Chinner
2012-12-08  0:17                     ` Dave Chinner
2012-12-08  1:39                       ` Chris Mason
2012-12-10 16:02                         ` Chris Mason
2012-12-10 17:37                       ` Theodore Ts'o
2012-12-10 18:05                         ` Steven Whitehouse
2012-12-10 18:13                           ` Theodore Ts'o
2012-12-10 18:20                             ` Theodore Ts'o
2012-12-11 12:16                               ` Steven Whitehouse
2012-12-11 22:09                                 ` Dave Chinner
2012-12-10 18:52                         ` Ric Wheeler
2012-12-11  0:52                         ` Dave Chinner
2012-12-07 19:30                   ` Steven Rostedt
2012-12-07 21:14                     ` Theodore Ts'o
2012-12-07 21:47                       ` Ric Wheeler
2012-12-07 23:25                         ` Howard Chu
2012-12-08  0:50                           ` Dave Chinner
2012-12-08 13:52                             ` Howard Chu
2012-12-08 14:02                               ` Ric Wheeler
2012-12-07 22:01                       ` Eric Sandeen
2012-12-09 21:37                       ` Ric Wheeler
2012-11-26 11:53     ` Alan Cox
2012-11-26 14:43       ` Theodore Ts'o
2012-11-26 21:12       ` Dave Chinner
2012-11-27 13:44         ` Martin Steigerwald

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20121208025250.GP22789@localhost \
    --to=jlbec@evilplan.org \
    --cc=Martin@lichtvoll.de \
    --cc=chris.mason@fusionio.com \
    --cc=clmason@fusionio.com \
    --cc=david@fromorbit.com \
    --cc=hch@infradead.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=rwheeler@redhat.com \
    --cc=torvalds@linux-foundation.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).