linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] fs: revert commit bbdd6808 to fallocate UAPI
@ 2012-11-19 23:04 Dave Chinner
  2012-11-20 16:36 ` Christoph Hellwig
  2012-11-26  0:28 ` [PATCH, 3.7-rc7, RESEND] " Dave Chinner
  0 siblings, 2 replies; 69+ messages in thread
From: Dave Chinner @ 2012-11-19 23:04 UTC (permalink / raw)
  To: torvalds; +Cc: tytso, linux-kernel, linux-fsdevel

From: Dave Chinner <dchinner@redhat.com>

Commit bbdd6808 ("fs: reserve fallocate flag codepoint") changes the
fallocate(2) syscall interface. The flag that is reserved by this
commit is for functionality that has previously been NAKed on the
-fsdevel mailing list, and so exists out-of-tree.

The reserved syscall flag is completely undocumented, the commit
message doesn't tell us why the patches that use it exist out of
tree, or even why the flag needs to be in the kernel code and not
part of the out-of-tree patches. Further, the flag is not
implemented in any in-tree filesystems and probably never will be
due to the truck-sized security hole it opens up. Finally, we don't
change syscalls purely to support out-of-tree patches or kernel
modules.

The change to the syscall API was written and committed directly to
the ext4 tree by the ext4 maintainer, and merged through that tree
via the ext4 merge without review.  According to the commit message,
this was discussed at the Plumber's conference but no documentary
evidence of that discussion exists.  However, whether or not this
discussion took place is irrelevant as the proper venue for
discussion of this change is linux-fsdevel; discussions at a
conference are no substitute for a full airing of the change on the
appropriate mailing list.

The method of pushing of such a commit (i.e. written, committed and
pushed by a tree maintainer as part of a larger subsystem merge)
could be seen as designed to avoid review and discussion of a
controversial change that is likely to be NAKed. A long-term
subsystem maintainer should know better than to push changes in this
manner.

The lack of formal review and discussion for a syscall API change is
grounds for reverting patch, especially given the controversial
nature of the feature and the previous discussions and NAKs. The way
the change was pushed into mainline borders on an abuse of the trust
we place in maintainers and hence as a matter of principle this
change should be reverted.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 include/uapi/linux/falloc.h |    1 -
 1 file changed, 1 deletion(-)

diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
index 990c4cc..8a7935f 100644
--- a/include/uapi/linux/falloc.h
+++ b/include/uapi/linux/falloc.h
@@ -3,7 +3,6 @@
 
 #define FALLOC_FL_KEEP_SIZE	0x01 /* default is extend size */
 #define FALLOC_FL_PUNCH_HOLE	0x02 /* de-allocates range */
-#define FALLOC_FL_NO_HIDE_STALE	0x04 /* reserved codepoint */
 
 
 #endif /* _UAPI_FALLOC_H_ */
-- 
1.7.10


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH] fs: revert commit bbdd6808 to fallocate UAPI
  2012-11-19 23:04 [PATCH] fs: revert commit bbdd6808 to fallocate UAPI Dave Chinner
@ 2012-11-20 16:36 ` Christoph Hellwig
  2012-11-26  0:28 ` [PATCH, 3.7-rc7, RESEND] " Dave Chinner
  1 sibling, 0 replies; 69+ messages in thread
From: Christoph Hellwig @ 2012-11-20 16:36 UTC (permalink / raw)
  To: Dave Chinner; +Cc: torvalds, tytso, linux-kernel, linux-fsdevel

On Tue, Nov 20, 2012 at 10:04:27AM +1100, Dave Chinner wrote:
> The method of pushing of such a commit (i.e. written, committed and
> pushed by a tree maintainer as part of a larger subsystem merge)
> could be seen as designed to avoid review and discussion of a
> controversial change that is likely to be NAKed. A long-term
> subsystem maintainer should know better than to push changes in this
> manner.


Agreed, this needs to be reverted.

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-11-19 23:04 [PATCH] fs: revert commit bbdd6808 to fallocate UAPI Dave Chinner
  2012-11-20 16:36 ` Christoph Hellwig
@ 2012-11-26  0:28 ` Dave Chinner
  2012-11-26  2:55   ` Theodore Ts'o
  1 sibling, 1 reply; 69+ messages in thread
From: Dave Chinner @ 2012-11-26  0:28 UTC (permalink / raw)
  To: torvalds; +Cc: tytso, linux-kernel, linux-fsdevel

fs: revert commit bbdd6808 to fallocate UAPI

From: Dave Chinner <dchinner@redhat.com>

Commit bbdd6808 ("fs: reserve fallocate flag codepoint") changes the
fallocate(2) syscall interface. The flag that is reserved by this
commit is for functionality that has previously been NAKed on the
-fsdevel mailing list, and so exists out-of-tree.

The reserved syscall flag is completely undocumented, the commit
message doesn't tell us why the patches that use it exist out of
tree, or even why the flag needs to be in the kernel code and not
part of the out-of-tree patches. Further, the flag is not
implemented in any in-tree filesystems and probably never will be
due to the truck-sized security hole it opens up. Finally, we don't
change syscalls purely to support out-of-tree patches or kernel
modules.

The change to the syscall API was written and committed directly to
the ext4 tree by the ext4 maintainer, and merged through that tree
via the ext4 merge without review.  According to the commit message,
this was discussed at the Plumber's conference but no documentary
evidence of that discussion exists.  However, whether or not this
discussion took place is irrelevant as the proper venue for
discussion of this change is linux-fsdevel; discussions at a
conference are no substitute for a full airing of the change on the
appropriate mailing list.

The method of pushing of such a commit (i.e. written, committed and
pushed by a tree maintainer as part of a larger subsystem merge)
could be seen as designed to avoid review and discussion of a
controversial change that is likely to be NAKed. A long-term
subsystem maintainer should know better than to push changes in this
manner.

The lack of formal review and discussion for a syscall API change is
grounds for reverting patch, especially given the controversial
nature of the feature and the previous discussions and NAKs. The way
the change was pushed into mainline borders on an abuse of the trust
we place in maintainers and hence as a matter of principle this
change should be reverted.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---

Reason for resend:
- No response to orignal posting here:

	http://www.spinics.net/lists/linux-fsdevel/msg59937.html

 include/uapi/linux/falloc.h |    1 -
 1 file changed, 1 deletion(-)

diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
index 990c4cc..8a7935f 100644
--- a/include/uapi/linux/falloc.h
+++ b/include/uapi/linux/falloc.h
@@ -3,7 +3,6 @@
 
 #define FALLOC_FL_KEEP_SIZE	0x01 /* default is extend size */
 #define FALLOC_FL_PUNCH_HOLE	0x02 /* de-allocates range */
-#define FALLOC_FL_NO_HIDE_STALE	0x04 /* reserved codepoint */
 
 
 #endif /* _UAPI_FALLOC_H_ */

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-11-26  0:28 ` [PATCH, 3.7-rc7, RESEND] " Dave Chinner
@ 2012-11-26  2:55   ` Theodore Ts'o
  2012-11-26  6:14     ` Tao Ma
                       ` (2 more replies)
  0 siblings, 3 replies; 69+ messages in thread
From: Theodore Ts'o @ 2012-11-26  2:55 UTC (permalink / raw)
  To: torvalds; +Cc: Dave Chinner, linux-kernel, linux-fsdevel

On Mon, Nov 26, 2012 at 11:28:14AM +1100, Dave Chinner wrote:
> fs: revert commit bbdd6808 to fallocate UAPI
> 
> From: Dave Chinner <dchinner@redhat.com>
> 
> Commit bbdd6808 ("fs: reserve fallocate flag codepoint") changes the
> fallocate(2) syscall interface. The flag that is reserved by this
> commit is for functionality that has previously been NAKed on the
> -fsdevel mailing list, and so exists out-of-tree.

Hi Linus,

It doesn't change the interface or break anything; it just reserves a
bit so that out-of-tree patches don't collide with future allocations.
There are significant usages of this bit within Google and Tao Bao.
It is true that there has been significant pushback about adding this
functionality on linux-fsdevel; I find it personally frustrating that
in effect, if enough people scream, they can veto an optional feature
that might only be implemented by a single file system.

It's not like there is any shortage of flag bits, so what's the harm
of reserving the bit?

						- Ted

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-11-26  2:55   ` Theodore Ts'o
@ 2012-11-26  6:14     ` Tao Ma
  2012-11-26  9:12     ` Dave Chinner
  2012-11-26 11:53     ` Alan Cox
  2 siblings, 0 replies; 69+ messages in thread
From: Tao Ma @ 2012-11-26  6:14 UTC (permalink / raw)
  To: Theodore Ts'o, torvalds, Dave Chinner, linux-kernel, linux-fsdevel

Hi Dave,
On 11/26/2012 10:55 AM, Theodore Ts'o wrote:
> On Mon, Nov 26, 2012 at 11:28:14AM +1100, Dave Chinner wrote:
>> fs: revert commit bbdd6808 to fallocate UAPI
>>
>> From: Dave Chinner <dchinner@redhat.com>
>>
>> Commit bbdd6808 ("fs: reserve fallocate flag codepoint") changes the
>> fallocate(2) syscall interface. The flag that is reserved by this
>> commit is for functionality that has previously been NAKed on the
>> -fsdevel mailing list, and so exists out-of-tree.
> 
> Hi Linus,
> 
> It doesn't change the interface or break anything; it just reserves a
> bit so that out-of-tree patches don't collide with future allocations.
> There are significant usages of this bit within Google and Tao Bao.
> It is true that there has been significant pushback about adding this
> functionality on linux-fsdevel; I find it personally frustrating that
> in effect, if enough people scream, they can veto an optional feature
> that might only be implemented by a single file system.
Sorry, but we(Tao Bao) should say it explicitly that it is currently
used in our product system. So we are with Ted that there should be no
side effect for reserving just a bit to avoid future conflict?

Thanks
Tao

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-11-26  2:55   ` Theodore Ts'o
  2012-11-26  6:14     ` Tao Ma
@ 2012-11-26  9:12     ` Dave Chinner
  2012-12-05 10:48       ` Martin Steigerwald
  2012-11-26 11:53     ` Alan Cox
  2 siblings, 1 reply; 69+ messages in thread
From: Dave Chinner @ 2012-11-26  9:12 UTC (permalink / raw)
  To: Theodore Ts'o, torvalds, linux-kernel, linux-fsdevel

On Sun, Nov 25, 2012 at 09:55:20PM -0500, Theodore Ts'o wrote:
> On Mon, Nov 26, 2012 at 11:28:14AM +1100, Dave Chinner wrote:
> > fs: revert commit bbdd6808 to fallocate UAPI
> > 
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Commit bbdd6808 ("fs: reserve fallocate flag codepoint") changes the
> > fallocate(2) syscall interface. The flag that is reserved by this
> > commit is for functionality that has previously been NAKed on the
> > -fsdevel mailing list, and so exists out-of-tree.
> 
> Hi Linus,
> 
> It doesn't change the interface or break anything; it just reserves a
> bit so that out-of-tree patches don't collide with future allocations.
> There are significant usages of this bit within Google and Tao Bao.
> It is true that there has been significant pushback about adding this
> functionality on linux-fsdevel;

It's not the fact that you want to reserve a bit that is at issue
here - it's the way it's been pushed into the tree that is the
front-and-center issue.

> I find it personally frustrating that
> in effect, if enough people scream, they can veto an optional feature
> that might only be implemented by a single file system.

Having a significant portion of the wider fs development community
disagree with your patches is no reason for subverting the review
process.  Besides, that's irrelevant to the issue being discussed,
unless you are describing your motives in an effort to justify your
actions.

In fact, it's even more disturbing if this was your real motive.
That is, is sounds somewhat like you've just admitted that you
pushed this change silently through the ext4 tree to avoid review
and discussion and that you are blaming the rest of the FS community
for forcing you to take such actions.

> It's not like there is any shortage of flag bits, so what's the harm
> of reserving the bit?

The harm has already been done - to the trust we've placed in you as
a maintainer.  To argue that the code does no harm is to completely
miss the crux of the issue at hand: principles, process and trust
are far more important in our community than a single line of
code.

Ted, it comes down to trust. If we can't trust you not to push your
own changes to syscall APIs into the mainline tree via backdoor
channels, then how can we trust you not to push the entire
out-of-tree patch into the kernel the same way?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-11-26  2:55   ` Theodore Ts'o
  2012-11-26  6:14     ` Tao Ma
  2012-11-26  9:12     ` Dave Chinner
@ 2012-11-26 11:53     ` Alan Cox
  2012-11-26 14:43       ` Theodore Ts'o
  2012-11-26 21:12       ` Dave Chinner
  2 siblings, 2 replies; 69+ messages in thread
From: Alan Cox @ 2012-11-26 11:53 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: torvalds, Dave Chinner, linux-kernel, linux-fsdevel

> It's not like there is any shortage of flag bits, so what's the harm
> of reserving the bit?

Why not just reserve a small group of bits for fs private use in that
case - for any fs.

Alan

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-11-26 11:53     ` Alan Cox
@ 2012-11-26 14:43       ` Theodore Ts'o
  2012-11-26 21:12       ` Dave Chinner
  1 sibling, 0 replies; 69+ messages in thread
From: Theodore Ts'o @ 2012-11-26 14:43 UTC (permalink / raw)
  To: Alan Cox; +Cc: torvalds, Dave Chinner, linux-kernel, linux-fsdevel

On Mon, Nov 26, 2012 at 11:53:45AM +0000, Alan Cox wrote:
> > It's not like there is any shortage of flag bits, so what's the harm
> > of reserving the bit?
> 
> Why not just reserve a small group of bits for fs private use in that
> case - for any fs.

I think that would be a *fine* idea.

						- Ted

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-11-26 11:53     ` Alan Cox
  2012-11-26 14:43       ` Theodore Ts'o
@ 2012-11-26 21:12       ` Dave Chinner
  2012-11-27 13:44         ` Martin Steigerwald
  1 sibling, 1 reply; 69+ messages in thread
From: Dave Chinner @ 2012-11-26 21:12 UTC (permalink / raw)
  To: Alan Cox; +Cc: Theodore Ts'o, torvalds, linux-kernel, linux-fsdevel

On Mon, Nov 26, 2012 at 11:53:45AM +0000, Alan Cox wrote:
> > It's not like there is any shortage of flag bits, so what's the harm
> > of reserving the bit?
> 
> Why not just reserve a small group of bits for fs private use in that
> case - for any fs.

Flawed - one bit, one function for all filesystems, otherwise the
same binary could behave very differently on different filesystems.

Besides, we already have a mechanism for adding filesystem specific
interfaces. It's called an ioctl.  That's what it's there for - a
free-form extensible interface that can be wholly defined and
contained in the out-of-tree patch.

Most filesystems implement ioctls for their own specific
functionality, including for one-off preallocation semantics (e.g.
XFS_IOC_ZERO_RANGE). There is no reason why ext4 can't do the same
thing and we can drop the whole issue of having to modify a syscall
API with magic, undocumented flag bits with unpredictable
behaviour....

ext4 is not a special snowflake that allows developers to bend rules
whenever they want. If the ext4 developers want to support out of
tree functionality for their filesystem, then they can do it within
their filesystem via ioctls like everyone else does.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-11-26 21:12       ` Dave Chinner
@ 2012-11-27 13:44         ` Martin Steigerwald
  0 siblings, 0 replies; 69+ messages in thread
From: Martin Steigerwald @ 2012-11-27 13:44 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Chinner, Alan Cox, Theodore Ts'o, torvalds, linux-fsdevel

Am Montag, 26. November 2012 schrieb Dave Chinner:
> On Mon, Nov 26, 2012 at 11:53:45AM +0000, Alan Cox wrote:
> > > It's not like there is any shortage of flag bits, so what's the
> > > harm of reserving the bit?
> > 
> > Why not just reserve a small group of bits for fs private use in that
> > case - for any fs.
> 
> Flawed - one bit, one function for all filesystems, otherwise the
> same binary could behave very differently on different filesystems.
> 
> Besides, we already have a mechanism for adding filesystem specific
> interfaces. It's called an ioctl.  That's what it's there for - a
> free-form extensible interface that can be wholly defined and
> contained in the out-of-tree patch.
> 
> Most filesystems implement ioctls for their own specific
> functionality, including for one-off preallocation semantics (e.g.
> XFS_IOC_ZERO_RANGE). There is no reason why ext4 can't do the same
> thing and we can drop the whole issue of having to modify a syscall
> API with magic, undocumented flag bits with unpredictable
> behaviour....
> 
> ext4 is not a special snowflake that allows developers to bend rules
> whenever they want. If the ext4 developers want to support out of
> tree functionality for their filesystem, then they can do it within
> their filesystem via ioctls like everyone else does.

I do not develop for the kernel, just test here a bit, there a bit… and I 
didn´t read the previous discussions about this patch…

… but I think I can follow this argument of yours, Dave.

And Ted, this does not appear to be screaming to me.

So why no ioctl?

And what functionality is this about anyway? Sometimes I read in some 
kernel source and I am happy that sometimes can understood some of it. 
Reading "this is used for some magic Google or Tao Bao do with their 
filesystem" would decrease my chance to understand whats going on there. 
Not quite approbiate for an open source kernel, I think.

Thanks,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-11-26  9:12     ` Dave Chinner
@ 2012-12-05 10:48       ` Martin Steigerwald
  2012-12-05 15:45         ` Linus Torvalds
  0 siblings, 1 reply; 69+ messages in thread
From: Martin Steigerwald @ 2012-12-05 10:48 UTC (permalink / raw)
  To: linux-kernel; +Cc: Dave Chinner, Theodore Ts'o, torvalds, linux-fsdevel

Am Montag, 26. November 2012 schrieb Dave Chinner:
> On Sun, Nov 25, 2012 at 09:55:20PM -0500, Theodore Ts'o wrote:
> > On Mon, Nov 26, 2012 at 11:28:14AM +1100, Dave Chinner wrote:
> > > fs: revert commit bbdd6808 to fallocate UAPI
> > > 
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Commit bbdd6808 ("fs: reserve fallocate flag codepoint") changes
> > > the fallocate(2) syscall interface. The flag that is reserved by
> > > this commit is for functionality that has previously been NAKed on
> > > the -fsdevel mailing list, and so exists out-of-tree.
> > 
> > Hi Linus,
> > 
> > It doesn't change the interface or break anything; it just reserves a
> > bit so that out-of-tree patches don't collide with future
> > allocations. There are significant usages of this bit within Google
> > and Tao Bao. It is true that there has been significant pushback
> > about adding this functionality on linux-fsdevel;
> 
> It's not the fact that you want to reserve a bit that is at issue
> here - it's the way it's been pushed into the tree that is the
> front-and-center issue.
> 
> > I find it personally frustrating that
> > in effect, if enough people scream, they can veto an optional feature
> > that might only be implemented by a single file system.
> 
> Having a significant portion of the wider fs development community
> disagree with your patches is no reason for subverting the review
> process.  Besides, that's irrelevant to the issue being discussed,
> unless you are describing your motives in an effort to justify your
> actions.
> 
> In fact, it's even more disturbing if this was your real motive.
> That is, is sounds somewhat like you've just admitted that you
> pushed this change silently through the ext4 tree to avoid review
> and discussion and that you are blaming the rest of the FS community
> for forcing you to take such actions.
> 
> > It's not like there is any shortage of flag bits, so what's the harm
> > of reserving the bit?
> 
> The harm has already been done - to the trust we've placed in you as
> a maintainer.  To argue that the code does no harm is to completely
> miss the crux of the issue at hand: principles, process and trust
> are far more important in our community than a single line of
> code.
> 
> Ted, it comes down to trust. If we can't trust you not to push your
> own changes to syscall APIs into the mainline tree via backdoor
> channels, then how can we trust you not to push the entire
> out-of-tree patch into the kernel the same way?

Ping.

Linus, while I am interested in an answer I think that Dave and Christoph 
as Linux filesystem developers actually deserve one (instead of silently 
being ignored which is also a decision in this matter).

I did not see an answer in linux-2.6 commit log as of today so far.

Thanks,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-12-05 10:48       ` Martin Steigerwald
@ 2012-12-05 15:45         ` Linus Torvalds
  2012-12-05 16:18           ` Martin Steigerwald
  2012-12-06 12:05           ` Christoph Hellwig
  0 siblings, 2 replies; 69+ messages in thread
From: Linus Torvalds @ 2012-12-05 15:45 UTC (permalink / raw)
  To: Martin Steigerwald
  Cc: Linux Kernel Mailing List, Dave Chinner, Theodore Ts'o,
	linux-fsdevel

On Wed, Dec 5, 2012 at 2:48 AM, Martin Steigerwald <Martin@lichtvoll.de> wrote:
>
> Linus, while I am interested in an answer I think that Dave and Christoph
> as Linux filesystem developers actually deserve one (instead of silently
> being ignored which is also a decision in this matter).
>
> I did not see an answer in linux-2.6 commit log as of today so far.

Christ guys. This whole thread is retarded.

The *ONLY* reason people seem to have for reverting that is a "ooh, my
feelings are hurt by how this was done, and now I'm a pissy bitch and
I want to get this reverted".

Stop the f*cking around already.

If you want something reverted, you show me the *technical* reason for
it. Not the "ooh, I'm so annoyed by how this was done" reason for it.

And if your little feelings got hurt, get your mommy to tuck you in,
don't email me about it. Because I'm not exactly known for my deep
emotional understanding and supportive personality, am I?

               Linus

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-12-05 15:45         ` Linus Torvalds
@ 2012-12-05 16:18           ` Martin Steigerwald
  2012-12-05 16:33             ` Theodore Ts'o
  2012-12-05 18:25             ` Linus Torvalds
  2012-12-06 12:05           ` Christoph Hellwig
  1 sibling, 2 replies; 69+ messages in thread
From: Martin Steigerwald @ 2012-12-05 16:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, Dave Chinner, Theodore Ts'o,
	linux-fsdevel

Am Mittwoch, 5. Dezember 2012 schrieb Linus Torvalds:
> On Wed, Dec 5, 2012 at 2:48 AM, Martin Steigerwald <Martin@lichtvoll.de> 
wrote:
> > Linus, while I am interested in an answer I think that Dave and
> > Christoph as Linux filesystem developers actually deserve one
> > (instead of silently being ignored which is also a decision in this
> > matter).
> > 
> > I did not see an answer in linux-2.6 commit log as of today so far.
> 
> Christ guys. This whole thread is retarded.
> 
> The *ONLY* reason people seem to have for reverting that is a "ooh, my
> feelings are hurt by how this was done, and now I'm a pissy bitch and
> I want to get this reverted".
> 
> Stop the f*cking around already.
> 
> If you want something reverted, you show me the *technical* reason for
> it. Not the "ooh, I'm so annoyed by how this was done" reason for it.
> 
> And if your little feelings got hurt, get your mommy to tuck you in,
> don't email me about it. Because I'm not exactly known for my deep
> emotional understanding and supportive personality, am I?

Did you actually *read* the thread, Linus?

Dave provided technical reasons.

First in the patch description and then in:

https://lkml.org/lkml/2012/11/26/700

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-12-05 16:18           ` Martin Steigerwald
@ 2012-12-05 16:33             ` Theodore Ts'o
  2012-12-05 17:24               ` Martin Steigerwald
  2012-12-05 18:25             ` Linus Torvalds
  1 sibling, 1 reply; 69+ messages in thread
From: Theodore Ts'o @ 2012-12-05 16:33 UTC (permalink / raw)
  To: Martin Steigerwald
  Cc: Linus Torvalds, Linux Kernel Mailing List, Dave Chinner, linux-fsdevel

On Wed, Dec 05, 2012 at 05:18:43PM +0100, Martin Steigerwald wrote:
> Dave provided technical reasons.
> 
> First in the patch description and then in:
> 
> https://lkml.org/lkml/2012/11/26/700

There were no technical reasons.  We are only reserving a bit.  And
different file systems don't support all of the various different
fallocate flags already --- for example, not all file systems support
the punch system call.

Yes, I could create an entrely new ioctl() that looks just like
fallocate, but supports the extra bit, just so that Dave and others
don't have to be offended about the existence of that extra bit ---
but Linus (and others) have considered ioctl()'s evil, since there is
no type checking, and it's just silly to create a separate interface
just because somebody doesn't think some other file system shouldn't
implement a particular feature --- especially since it's not like
we're have any kind of shortage of bits in the fallocate field.

Heck, I probably have more to complain about with the inode flags
field, which were originally created specifically for ext2/3/4, and
which has since been grabbed for use by other file systems, including
btrfs.  You haven't heard me kvetching because btrfs has grabbed
btrfs-specific inode flags for nocow and notail... no one even bother
to try to get it past the fs-devel shed painting crew before *those*
bits were allocated --- and I am absolutely fine with that.

     	  	    	      	 	    - Ted

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-12-05 16:33             ` Theodore Ts'o
@ 2012-12-05 17:24               ` Martin Steigerwald
  2012-12-05 17:34                 ` Theodore Ts'o
  0 siblings, 1 reply; 69+ messages in thread
From: Martin Steigerwald @ 2012-12-05 17:24 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Linus Torvalds, Linux Kernel Mailing List, Dave Chinner, linux-fsdevel

Am Mittwoch, 5. Dezember 2012 schrieb Theodore Ts'o:
> On Wed, Dec 05, 2012 at 05:18:43PM +0100, Martin Steigerwald wrote:
> > Dave provided technical reasons.
> > 
> > First in the patch description and then in:
> > 
> > https://lkml.org/lkml/2012/11/26/700
> 
> There were no technical reasons.  We are only reserving a bit.  And
> different file systems don't support all of the various different
> fallocate flags already --- for example, not all file systems support
> the punch system call.
> 
> Yes, I could create an entrely new ioctl() that looks just like
> fallocate, but supports the extra bit, just so that Dave and others
> don't have to be offended about the existence of that extra bit ---
> but Linus (and others) have considered ioctl()'s evil, since there is
> no type checking, and it's just silly to create a separate interface
> just because somebody doesn't think some other file system shouldn't
> implement a particular feature --- especially since it's not like
> we're have any kind of shortage of bits in the fallocate field.

Thanks for explaining your technical view about this. I appreciate it.

> Heck, I probably have more to complain about with the inode flags
> field, which were originally created specifically for ext2/3/4, and
> which has since been grabbed for use by other file systems, including
> btrfs.  You haven't heard me kvetching because btrfs has grabbed
> btrfs-specific inode flags for nocow and notail... no one even bother
> to try to get it past the fs-devel shed painting crew before *those*
> bits were allocated --- and I am absolutely fine with that.

Thats no technical reason – thats exactly the process / patch review stuff 
Linus does not seem to give a shit about at least with this topic.

That aside I wondered about that inode flags in earlier days already. At 
some time chattr +i worked with XFS and then it only worked in Ext3. 
Before that I thought that chattr stuff would work with all filesystems.

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-12-05 17:24               ` Martin Steigerwald
@ 2012-12-05 17:34                 ` Theodore Ts'o
  2012-12-05 17:55                   ` Martin Steigerwald
  2012-12-06  0:42                   ` Dave Chinner
  0 siblings, 2 replies; 69+ messages in thread
From: Theodore Ts'o @ 2012-12-05 17:34 UTC (permalink / raw)
  To: Martin Steigerwald
  Cc: Linus Torvalds, Linux Kernel Mailing List, Dave Chinner, linux-fsdevel

On Wed, Dec 05, 2012 at 06:24:05PM +0100, Martin Steigerwald wrote:
> > Heck, I probably have more to complain about with the inode flags
> > field, which were originally created specifically for ext2/3/4, and
> > which has since been grabbed for use by other file systems, including
> > btrfs.  You haven't heard me kvetching because btrfs has grabbed
> > btrfs-specific inode flags for nocow and notail... no one even bother
> > to try to get it past the fs-devel shed painting crew before *those*
> > bits were allocated --- and I am absolutely fine with that.
> 
> Thats no technical reason – thats exactly the process / patch review stuff 
> Linus does not seem to give a shit about at least with this topic.

Exactly; just as Dave's complaint is not technical.

> That aside I wondered about that inode flags in earlier days already. At 
> some time chattr +i worked with XFS and then it only worked in Ext3. 
> Before that I thought that chattr stuff would work with all filesystems.

Historically, they were created only for ext2/3/4 file systems.  Over
time, other file systems have used it as a common interface.  It's for
historical reasons that it's still an ioctl, as opposed to a system
call.  For ext2/3/4 the inode flags is actually the on-disk encoding,
not just the userspace ABI.  For other file systems, it might be just
the ABI, or it might also be something that they use for their on-disk
encoding.

Because it's the on-disk encoding, when btrfs uses extra bits for its
btrfs-specific inode flags, it means that I need to avoid using those
bits in ext4, if it's a flag that needs to also be exposed via
chattr/lsattr.  That being said, you'll note that unlike Dave, I have
**not** thrown a hissy fit when btrfs grabbed bits from the inode
field, even though quite a bit more bits allocated for the inode flags
than the fallocate flags.

						- Ted

P.S.  The main reason why it would have been better for btrfs
developers to have consulted me is that they also depend on
lsattr/chattr, and those programs are part of e2fsprogs.  Since no one
told me about the nocow flag when it first went into the kernel, I
didn't add it to e2fsprogs until relatively recently --- with the
result that Ubuntu Lucid doesnt have a version of chattr which
supports the nocow flag.  (Not a big deal, I just have to convince
Canoncal to upgrade to a newer version of e2fsprogs.)  So I only get
annoyed when some btrfs users complain about the lack of support in
chattr, when the main reason why there was no support at least at
first was no one bothered to ask me to add support; it was not because
I had anything against btrfs.  As soon as I found out about the nocow
flag, I added support to chattr and lsattr and pushed out a new
release of e2fsprogs.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-12-05 17:34                 ` Theodore Ts'o
@ 2012-12-05 17:55                   ` Martin Steigerwald
  2012-12-06  0:42                   ` Dave Chinner
  1 sibling, 0 replies; 69+ messages in thread
From: Martin Steigerwald @ 2012-12-05 17:55 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Linus Torvalds, Linux Kernel Mailing List, Dave Chinner, linux-fsdevel

Am Mittwoch, 5. Dezember 2012 schrieb Theodore Ts'o:
> On Wed, Dec 05, 2012 at 06:24:05PM +0100, Martin Steigerwald wrote:
> > > Heck, I probably have more to complain about with the inode flags
> > > field, which were originally created specifically for ext2/3/4, and
> > > which has since been grabbed for use by other file systems,
> > > including btrfs.  You haven't heard me kvetching because btrfs has
> > > grabbed btrfs-specific inode flags for nocow and notail... no one
> > > even bother to try to get it past the fs-devel shed painting crew
> > > before *those* bits were allocated --- and I am absolutely fine
> > > with that.
> > 
> > Thats no technical reason – thats exactly the process / patch review
> > stuff Linus does not seem to give a shit about at least with this
> > topic.
> 
> Exactly; just as Dave's complaint is not technical.

For me the ioctl argument looked like a technical argument which you 
answered to now.

And I noted that about the patch review aspects there seem to be quite 
different opinions. I tend to agree with Dave there.

> > That aside I wondered about that inode flags in earlier days already.
> > At some time chattr +i worked with XFS and then it only worked in
> > Ext3. Before that I thought that chattr stuff would work with all
> > filesystems.
> 
> Historically, they were created only for ext2/3/4 file systems.  Over
> time, other file systems have used it as a common interface.  It's for
> historical reasons that it's still an ioctl, as opposed to a system
> call.  For ext2/3/4 the inode flags is actually the on-disk encoding,
> not just the userspace ABI.  For other file systems, it might be just
> the ABI, or it might also be something that they use for their on-disk
> encoding.

Thanks for your detailed explaination.

> Because it's the on-disk encoding, when btrfs uses extra bits for its
> btrfs-specific inode flags, it means that I need to avoid using those
> bits in ext4, if it's a flag that needs to also be exposed via
> chattr/lsattr.  That being said, you'll note that unlike Dave, I have
> **not** thrown a hissy fit when btrfs grabbed bits from the inode
> field, even though quite a bit more bits allocated for the inode flags
> than the fallocate flags.

I do not think this that these kind of comparisons lead to anything 
useful. Cause other have done it is no reason for me doing it.

> P.S.  The main reason why it would have been better for btrfs
> developers to have consulted me is that they also depend on
> lsattr/chattr, and those programs are part of e2fsprogs.  Since no one
> told me about the nocow flag when it first went into the kernel, I
> didn't add it to e2fsprogs until relatively recently --- with the
> result that Ubuntu Lucid doesnt have a version of chattr which
> supports the nocow flag.  (Not a big deal, I just have to convince
> Canoncal to upgrade to a newer version of e2fsprogs.)  So I only get
> annoyed when some btrfs users complain about the lack of support in
> chattr, when the main reason why there was no support at least at
> first was no one bothered to ask me to add support; it was not because
> I had anything against btrfs.  As soon as I found out about the nocow
> flag, I added support to chattr and lsattr and pushed out a new
> release of e2fsprogs.

Which would be an argument for patch review via mailing list and CC´ing 
everyone involved. Cause then you would have known and the patch review 
process would likely have lead to avoid the technical issue the end user 
has had in the first place.

IMO exactly just the non-technical argument, Dave has made, beside his IMO 
technical arguments.

So regardless of the outcome of the review process, it may be that the bit 
goes in or not, the process has some importance here. Even if in the end 
you as the Ext4 maintainer decide that the bit goes in *after* the review 
process, after having read the acks / nacks and their explainations.

Thanks,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-12-05 16:18           ` Martin Steigerwald
  2012-12-05 16:33             ` Theodore Ts'o
@ 2012-12-05 18:25             ` Linus Torvalds
  2012-12-06  1:14               ` Dave Chinner
  1 sibling, 1 reply; 69+ messages in thread
From: Linus Torvalds @ 2012-12-05 18:25 UTC (permalink / raw)
  To: Martin Steigerwald
  Cc: Linux Kernel Mailing List, Dave Chinner, Theodore Ts'o,
	linux-fsdevel

On Wed, Dec 5, 2012 at 8:18 AM, Martin Steigerwald <Martin@lichtvoll.de> wrote:
>
> Did you actually *read* the thread, Linus?

I did. And I actually understood it. Unlike some people.

> Dave provided technical reasons.
>
> First in the patch description and then in:
>
> https://lkml.org/lkml/2012/11/26/700

No. That technical argument is an argument *against* changing the
current "specific bit meaning" reservation into a "generic fs private
use" bit.

The current one actually has a specific meaning (documented in the
name, if very little else), and is *not* some kind of "generic fs
private use" bit.  So the email you quote was actually an argument
against changing the current status quo.

And there is an actual technical reason for the current situation,
described in the original commit. Now, people may not *like* the fact
that the bit is commonly used out-of-tree (and used that way, rather
than with an ioctl), but it's a fact.  And quite frankly, ioctl's
aren't any better. They are just another different way of messing
things up.

Everything else in that thread has basically been whining about "I
don't like the current reality".

Yes, people can argue that "process" is about technical issues too,
but let's be honest: our process is fluid. Not everything gets
reviewed on the mailing list, and people *do* talk about things
face-to-face at conferences. And none of that changes the current
actual situation.

                 Linus

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-12-05 17:34                 ` Theodore Ts'o
  2012-12-05 17:55                   ` Martin Steigerwald
@ 2012-12-06  0:42                   ` Dave Chinner
  2012-12-06  9:24                     ` Martin Steigerwald
  1 sibling, 1 reply; 69+ messages in thread
From: Dave Chinner @ 2012-12-06  0:42 UTC (permalink / raw)
  To: Theodore Ts'o, Martin Steigerwald, Linus Torvalds,
	Linux Kernel Mailing List, linux-fsdevel

On Wed, Dec 05, 2012 at 12:34:15PM -0500, Theodore Ts'o wrote:
> Because it's the on-disk encoding, when btrfs uses extra bits for its
> btrfs-specific inode flags, it means that I need to avoid using those
> bits in ext4, if it's a flag that needs to also be exposed via
> chattr/lsattr.

What, you mean the ones they posted to -fsdevel for peer review?

http://marc.info/?t=129914133000001&r=1&w=2

> P.S.  The main reason why it would have been better for btrfs
> developers to have consulted me is that they also depend on
> lsattr/chattr, and those programs are part of e2fsprogs.

They did consult you directly, Ted. Here's the header from V2 of the
patch indicating that you are directly cc'd:

  Message-ID: <4D871243.2060001@cn.fujitsu.com>
  Date: Mon, 21 Mar 2011 16:54:27 +0800
  From: liubo <liubo2009@cn.fujitsu.com>
  User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US;
          rv:1.9.1b3pre) Gecko/20090513 Fedora/3.0-2.3.beta2.fc11
          Thunderbird/3.0b2
  MIME-Version: 1.0
  To: Linux Btrfs <linux-btrfs@vger.kernel.org>
  CC: linux-fsdevel <linux-fsdevel@vger.kernel.org>,
          Andreas Dilger <adilger@dilger.ca>, Christoph Hellwig <hch@lst.de>,
          tytso <tytso@mit.edu>, Chris Mason <chris.mason@oracle.com>
  Subject: [PATCH 1/2 v2] Btrfs: add datacow flag in inode flag

Apples and oranges, Ted. You didn't post your change anywhere,
unlike the btrfs guys who did the right thing. Please stop trying to
blame someone else for your actions.

> That being said, you'll note that unlike Dave, I have
> **not** thrown a hissy fit when btrfs grabbed bits from the inode
> field, even though quite a bit more bits allocated for the inode flags
> than the fallocate flags.

IOWs, pointing out that the standard peer review process has been
bypassed is "throwing a hissy fit"? And Linus calls it "being a
pissy bitch". To me this is acknoweldgement of the fact that
something wrong was done, but the actors can't bring themselves to
directly admit it and so are resorting to name calling and
blame-shifting to discredit the messenger and therefore the message.

Or perhaps Linus has given us the go-ahead to push random unreviewed
syscall changes through subsystem trees after holding private
discussions between a handful of developers like has happened here.

Cheers.

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-12-05 18:25             ` Linus Torvalds
@ 2012-12-06  1:14               ` Dave Chinner
  2012-12-06  3:03                 ` Linus Torvalds
  2012-12-06 12:06                 ` Christoph Hellwig
  0 siblings, 2 replies; 69+ messages in thread
From: Dave Chinner @ 2012-12-06  1:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Martin Steigerwald, Linux Kernel Mailing List, Theodore Ts'o,
	linux-fsdevel

On Wed, Dec 05, 2012 at 10:25:17AM -0800, Linus Torvalds wrote:
> Yes, people can argue that "process" is about technical issues too,
> but let's be honest: our process is fluid. Not everything gets
> reviewed on the mailing list, and people *do* talk about things
> face-to-face at conferences.

Yes, that is true.

But we don't review code at conferences. We have mailing lists for
that.  I've lost count of the number of times I've heard "post your
code" or "let's take it to the list" or "better to discuss this on
the list where everyone can comment" at conferences such as
Plumbers. It's a standard practise - talk at conferences, review
on mailing lists.

And for changes to syscalls? That's something that must be peer
reviewed because we are going to be stuck with those changes forever
as we can't undo them at a later date. It doesn't matter who made the
change in question, I would have done exactly the same thing....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-12-06  1:14               ` Dave Chinner
@ 2012-12-06  3:03                 ` Linus Torvalds
  2012-12-06  9:37                   ` Martin Steigerwald
  2012-12-06 12:06                 ` Christoph Hellwig
  1 sibling, 1 reply; 69+ messages in thread
From: Linus Torvalds @ 2012-12-06  3:03 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Martin Steigerwald, Linux Kernel Mailing List, Theodore Ts'o,
	linux-fsdevel

On Wed, Dec 5, 2012 at 5:14 PM, Dave Chinner <david@fromorbit.com> wrote:
>
> And for changes to syscalls? That's something that must be peer
> reviewed because we are going to be stuck with those changes forever
> as we can't undo them at a later date. It doesn't matter who made the
> change in question, I would have done exactly the same thing....

The thing that people are complaining about is exactly the reverse of
this. It's *protecting* us from making mistakes, and doesn't actually
add any new interfaces in itself.

This is why I'm so annoyed with this stupid thread. It's been going on
forever, and reverting that change WOULD BE OBJECTIVELY A BAD IDEA.

The change is a one-liner, doesn't add any code at all, seeks to
protect us from potential future mistakes, and had a good technical
rationale for it. Seriously. Why are we even discussing this two weeks
later?

Don't even bother to answer that question. This thread isn't worth it.

              Linus

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-12-06  0:42                   ` Dave Chinner
@ 2012-12-06  9:24                     ` Martin Steigerwald
  0 siblings, 0 replies; 69+ messages in thread
From: Martin Steigerwald @ 2012-12-06  9:24 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Theodore Ts'o, Linus Torvalds, Linux Kernel Mailing List,
	linux-fsdevel

Am Donnerstag, 6. Dezember 2012 schrieb Dave Chinner:
> > That being said, you'll note that unlike Dave, I have
> > **not** thrown a hissy fit when btrfs grabbed bits from the inode
> > field, even though quite a bit more bits allocated for the inode
> > flags than the fallocate flags.
> 
> IOWs, pointing out that the standard peer review process has been
> bypassed is "throwing a hissy fit"? And Linus calls it "being a
> pissy bitch". To me this is acknoweldgement of the fact that
> something wrong was done, but the actors can't bring themselves to
> directly admit it and so are resorting to name calling and
> blame-shifting to discredit the messenger and therefore the message.

+1

Anyway, the decision seems to be set, so indeed waste of energy to discuss 
it further.

I agree the process is usually fluid. Cause I did not recognize a thread 
like this every day here on this list. And what happened, happened.

But for the future I´d like that patches are posted.

Ciao,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-12-06  3:03                 ` Linus Torvalds
@ 2012-12-06  9:37                   ` Martin Steigerwald
  2012-12-07  1:08                     ` Ingo Molnar
  0 siblings, 1 reply; 69+ messages in thread
From: Martin Steigerwald @ 2012-12-06  9:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linus Torvalds, Dave Chinner, Theodore Ts'o, linux-fsdevel

Am Donnerstag, 6. Dezember 2012 schrieb Linus Torvalds:
> On Wed, Dec 5, 2012 at 5:14 PM, Dave Chinner <david@fromorbit.com> 
wrote:
> > And for changes to syscalls? That's something that must be peer
> > reviewed because we are going to be stuck with those changes forever
> > as we can't undo them at a later date. It doesn't matter who made the
> > change in question, I would have done exactly the same thing....
> 
> The thing that people are complaining about is exactly the reverse of
> this. It's *protecting* us from making mistakes, and doesn't actually
> add any new interfaces in itself.
>
> This is why I'm so annoyed with this stupid thread. It's been going on
> forever, and reverting that change WOULD BE OBJECTIVELY A BAD IDEA.

See, thats where you have a problem with "reality".

It seems you cannot accept the fact that some developers disliked the 
process in which this change was pushed. It seems to be that you want this 
thread to vanish in thin air immediately. But unfortunately it didn´t.

So while the process generally is fluid, I agree to that, it wasn´t fluid 
here. Or would you call this fluid here?

I accept the fact that a decision has been made. And I am not deeply 
enough into the technical matters to have to continue discussing it. I 
also think all arguments have been said.

The process has been the way it is. Noted. For the future there is 
potential to do it differently with less churn involved. Will you, Ted and 
others involved take the chance? Up to you entirely. And depending on what 
effect you want to create.

BTW: I do not buy into your tough guy number. Scare someone else with 
that, if it makes you feel better about yourself. So thanks for the 
factual stuff that you delivered after your first post to the thread.

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-12-05 15:45         ` Linus Torvalds
  2012-12-05 16:18           ` Martin Steigerwald
@ 2012-12-06 12:05           ` Christoph Hellwig
  2012-12-07  1:16             ` Ingo Molnar
  1 sibling, 1 reply; 69+ messages in thread
From: Christoph Hellwig @ 2012-12-06 12:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Martin Steigerwald, Linux Kernel Mailing List, Dave Chinner,
	Theodore Ts'o, linux-fsdevel

On Wed, Dec 05, 2012 at 07:45:39AM -0800, Linus Torvalds wrote:
> On Wed, Dec 5, 2012 at 2:48 AM, Martin Steigerwald <Martin@lichtvoll.de> wrote:
> >
> > Linus, while I am interested in an answer I think that Dave and Christoph
> > as Linux filesystem developers actually deserve one (instead of silently
> > being ignored which is also a decision in this matter).
> >
> > I did not see an answer in linux-2.6 commit log as of today so far.
> 
> Christ guys. This whole thread is retarded.
> 
> The *ONLY* reason people seem to have for reverting that is a "ooh, my
> feelings are hurt by how this was done, and now I'm a pissy bitch and
> I want to get this reverted".
> 
> Stop the f*cking around already.
> 
> If you want something reverted, you show me the *technical* reason for
> it. Not the "ooh, I'm so annoyed by how this was done" reason for it.
> 
> And if your little feelings got hurt, get your mommy to tuck you in,
> don't email me about it. Because I'm not exactly known for my deep
> emotional understanding and supportive personality, am I?

No, the problem is that the thing is not just a) wrong, but b) only made
it in through sneaky ways.


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-12-06  1:14               ` Dave Chinner
  2012-12-06  3:03                 ` Linus Torvalds
@ 2012-12-06 12:06                 ` Christoph Hellwig
  2012-12-06 16:50                   ` Theodore Ts'o
  1 sibling, 1 reply; 69+ messages in thread
From: Christoph Hellwig @ 2012-12-06 12:06 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Linus Torvalds, Martin Steigerwald, Linux Kernel Mailing List,
	Theodore Ts'o, linux-fsdevel

On Thu, Dec 06, 2012 at 12:14:02PM +1100, Dave Chinner wrote:
> On Wed, Dec 05, 2012 at 10:25:17AM -0800, Linus Torvalds wrote:
> > Yes, people can argue that "process" is about technical issues too,
> > but let's be honest: our process is fluid. Not everything gets
> > reviewed on the mailing list, and people *do* talk about things
> > face-to-face at conferences.
> 
> Yes, that is true.
> 
> But we don't review code at conferences. We have mailing lists for
> that.  I've lost count of the number of times I've heard "post your
> code" or "let's take it to the list" or "better to discuss this on
> the list where everyone can comment" at conferences such as
> Plumbers. It's a standard practise - talk at conferences, review
> on mailing lists.

Also the only conference outcome I remember is that everyone at LSF
except for Ted basically said "no fucking way".


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-12-06 12:06                 ` Christoph Hellwig
@ 2012-12-06 16:50                   ` Theodore Ts'o
  2012-12-07  1:57                     ` Dave Chinner
  0 siblings, 1 reply; 69+ messages in thread
From: Theodore Ts'o @ 2012-12-06 16:50 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dave Chinner, Linus Torvalds, Martin Steigerwald,
	Linux Kernel Mailing List, linux-fsdevel

On Thu, Dec 06, 2012 at 07:06:45AM -0500, Christoph Hellwig wrote:
> 
> Also the only conference outcome I remember is that everyone at LSF
> except for Ted basically said "no fucking way".
> 

At LSF, that's correct.  And as a result, the people who need this --
Google and Tao Bao -- have decided to keep the patch as an out-of-tree
patch, much like the Android wakelock patch was out of tree, and for
similar reasons --- because the community has rejected the
functionality.

At this point, I've only asked that the bit be reserved, so we don't
have to worry about codepoint collisions.  (We'd have the same issue
with an ioctl, BTW --- we would need to reserve an ioctl number to
avoid collisions, although granted there are ways to cleverly choose
an ioctl number that would reduce the chance of collisions even if it
isn't formally reserved.)

Note that this is not a kernel fork the same way the android wakelock
is not a kernel fork.  It's an out of tree patch which has been
rejected by upstream.

						- Ted

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-12-06  9:37                   ` Martin Steigerwald
@ 2012-12-07  1:08                     ` Ingo Molnar
  2012-12-07  2:40                       ` Dave Chinner
  2012-12-07 10:24                       ` Martin Steigerwald
  0 siblings, 2 replies; 69+ messages in thread
From: Ingo Molnar @ 2012-12-07  1:08 UTC (permalink / raw)
  To: Martin Steigerwald
  Cc: linux-kernel, Linus Torvalds, Dave Chinner, Theodore Ts'o,
	linux-fsdevel


* Martin Steigerwald <Martin@lichtvoll.de> wrote:

> > The thing that people are complaining about is exactly the 
> > reverse of this. It's *protecting* us from making mistakes, 
> > and doesn't actually add any new interfaces in itself.
> >
> > This is why I'm so annoyed with this stupid thread. It's 
> > been going on forever, and reverting that change WOULD BE 
> > OBJECTIVELY A BAD IDEA.
> 
> See, thats where you have a problem with "reality".
> 
> It seems you cannot accept the fact that some developers 
> disliked the process in which this change was pushed. [...]

I don't think you have understood Linus's argument above.

The "process" does not change the object technical merits of a 
patch. Ever. This patch is _good_, and objectively good. No 
amount of 'bad process' can make this patch bad.

Now, hypothetically, if this was an objectively bad patch, then 
any "bad process" used to push it would add insult to injury and 
it could be reason enough to flame Tytso twice as hard.

But it turns out the patch was right and good, so kudos to Tytso 
for cutting through the bike shed painting and politicks of 
fsdevel - which "process" would have deprived us of a good 
patch...

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-12-06 12:05           ` Christoph Hellwig
@ 2012-12-07  1:16             ` Ingo Molnar
  2012-12-07  3:19               ` Dave Chinner
  2012-12-07 17:36               ` Ric Wheeler
  0 siblings, 2 replies; 69+ messages in thread
From: Ingo Molnar @ 2012-12-07  1:16 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Linus Torvalds, Martin Steigerwald, Linux Kernel Mailing List,
	Dave Chinner, Theodore Ts'o, linux-fsdevel


* Christoph Hellwig <hch@infradead.org> wrote:

> No, the problem is that the thing is not just a) wrong, but b) 
> only made it in through sneaky ways.

People disagree with a), and b) only really matters if a) is 
true.

You never gave a technical reason for why protecting against 
future ABI clashes is 'wrong'. It looks like a marginally 
useful, practical patch to me.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-12-06 16:50                   ` Theodore Ts'o
@ 2012-12-07  1:57                     ` Dave Chinner
  0 siblings, 0 replies; 69+ messages in thread
From: Dave Chinner @ 2012-12-07  1:57 UTC (permalink / raw)
  To: Theodore Ts'o, Christoph Hellwig, Linus Torvalds,
	Martin Steigerwald, Linux Kernel Mailing List, linux-fsdevel

On Thu, Dec 06, 2012 at 11:50:24AM -0500, Theodore Ts'o wrote:
> On Thu, Dec 06, 2012 at 07:06:45AM -0500, Christoph Hellwig wrote:
> > 
> > Also the only conference outcome I remember is that everyone at LSF
> > except for Ted basically said "no fucking way".
> > 
> 
> At LSF, that's correct.  And as a result, the people who need this --
> Google and Tao Bao -- have decided to keep the patch as an out-of-tree
> patch, much like the Android wakelock patch was out of tree, and for
> similar reasons --- because the community has rejected the
> functionality.

Sure. But your association argument can be shown to be a fallacy
very easily.  There was agreement that wakelock-like functionality
was needed, the problems were with the proposed implementation and
that everyone would work together on a solution. Hence the mainline
kernel now has integrated wakelock support.

Compare that to stale-no-hide: the -concept- was given a fairly
unanimous "not a chance in hell" send-off. There was no "lets rework
it into something acceptable" compromise - the concept was
rejected and so you simply cannot compare it to wakelocks.

> At this point, I've only asked that the bit be reserved, so we don't

Really? We wouldn't be having this discussion if you'd just asked...

> have to worry about codepoint collisions.  (We'd have the same issue
> with an ioctl, BTW --- we would need to reserve an ioctl number to
> avoid collisions, although granted there are ways to cleverly choose
> an ioctl number that would reduce the chance of collisions even if it
> isn't formally reserved.)

struct ext4_ioc_falloc {
...
};

/* security hole reserved for out-of-tree patches. */
#define EXT4_IOC_FALLOC_NOHIDE		_IOW('f', 10000, struct ext4_ioc_falloc)

Done. Not so hard, is it?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-12-07  1:08                     ` Ingo Molnar
@ 2012-12-07  2:40                       ` Dave Chinner
  2012-12-07 10:24                       ` Martin Steigerwald
  1 sibling, 0 replies; 69+ messages in thread
From: Dave Chinner @ 2012-12-07  2:40 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Martin Steigerwald, linux-kernel, Linus Torvalds,
	Theodore Ts'o, linux-fsdevel

On Fri, Dec 07, 2012 at 02:08:37AM +0100, Ingo Molnar wrote:
> 
> * Martin Steigerwald <Martin@lichtvoll.de> wrote:
> 
> > > The thing that people are complaining about is exactly the 
> > > reverse of this. It's *protecting* us from making mistakes, 
> > > and doesn't actually add any new interfaces in itself.
> > >
> > > This is why I'm so annoyed with this stupid thread. It's 
> > > been going on forever, and reverting that change WOULD BE 
> > > OBJECTIVELY A BAD IDEA.
> > 
> > See, thats where you have a problem with "reality".
> > 
> > It seems you cannot accept the fact that some developers 
> > disliked the process in which this change was pushed. [...]
> 
> I don't think you have understood Linus's argument above.
> 
> The "process" does not change the object technical merits of a 
> patch. Ever. This patch is _good_, and objectively good. No 
> amount of 'bad process' can make this patch bad.

Yeah, Linus asserted the patch is good, and you're parroting that.
There isn't even any agreement on that level, let alone the other
problems like the fact it introduced undocumented changes to a
syscall API. people get raked over the coals frequently for doing
this sort of stuff, but in this case it's "good because Linus said
so".

Review gives the opportunity to make patches *better*. It might be
a good concept with a bad implementation (which is how I'd
categorise this patch), and review gives us the opportunity to sort
out those problems before we end up in a situation like this.

Especially for changes to syscall APIs....

> Now, hypothetically, if this was an objectively bad patch, then 
> any "bad process" used to push it would add insult to injury and 
> it could be reason enough to flame Tytso twice as hard.

Assumption: the patch is good

Reality: the concept is worthy, but the implmentation and and the
taken were terrible. Patch could be a lot better, but improvement
needs time and for syscall changes already in mainline we don't have
that time. either we revert it or we are stuck with it.

Right now, we're stuck with it....

> But it turns out the patch was right and good, so kudos to Tytso 
> for cutting through the bike shed painting and politicks of 
> fsdevel - which "process" would have deprived us of a good 
> patch...

Assumption: a) it's a good patch and b) review would have prevented
the patch from proceeding.

Reality: a) see above. b) Who can say - it never occurred?
Personally, I'm not opposed to reserving a bit but it needs
documentation and clarification over future scope and kernel
implementation, which review would have caught and the discussion
would have been in that context....

If you can't stand the heat, get out of the kitchen. IOWs, if you
can't defend your change on it's merits, then it shouldn't be made.
Sending your patch through a back door becuse you don't think you
can't defend it as you pass through the front door is simply *not
acceptable*.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-12-07  1:16             ` Ingo Molnar
@ 2012-12-07  3:19               ` Dave Chinner
  2012-12-07 17:36               ` Ric Wheeler
  1 sibling, 0 replies; 69+ messages in thread
From: Dave Chinner @ 2012-12-07  3:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Christoph Hellwig, Linus Torvalds, Martin Steigerwald,
	Linux Kernel Mailing List, Theodore Ts'o, linux-fsdevel

On Fri, Dec 07, 2012 at 02:16:28AM +0100, Ingo Molnar wrote:
> 
> * Christoph Hellwig <hch@infradead.org> wrote:
> 
> > No, the problem is that the thing is not just a) wrong, but b) 
> > only made it in through sneaky ways.
> 
> People disagree with a), and b) only really matters if a) is 
> true.

Wow.

Let me paraphrase that logic:

If (maintainer thinks their patch is right) {
	patch doesn't need review
} else {
	/* maintainer thinks the patch is wrong. */
	/* XXX: why would you think your own patch is wrong? */
	patch needs review
}

Ok, if you were to disagree with the maintainer of a *different
subsystem* about whether the patch is right or not, how would you
know that the maintainer has made the change in the first place? It
hasn't been posted anywhere public like everyone who is not a
maintainer does....

Indeed, it is Ted's application of your logic that is the *exact
problem* that started this thread. Restating it as justification
for what has happened is a circular argument.

> You never gave a technical reason for why protecting against 
> future ABI clashes is 'wrong'. It looks like a marginally 
> useful, practical patch to me.

The concept isn't wrong - but the implementation is definitely
sub-optimal. We document syscall API changes, write tests for them,
and explain when and where filesystems need to implement
functionality, etc, and none of this has been done. I'd call that a
good technical argument for reverting the change, especially as
review would have picked this up and it would have been corrected
before proceeding with the change.

Simply put:

(concept == good) != (implementation == good)

And that's why we have review - to make sure the implementation is
as good as it can possibly be.

When you bypass review with the justification of "the concept is
good", then you are basically saying code review is irrelevant to the
implementation quality. Nothing could be further from the truth, and
that's why b) matters more than a). And in this case, we haven't
even got past the question of whether the concept is good or not
because it hasn't even been asked....

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-12-07  1:08                     ` Ingo Molnar
  2012-12-07  2:40                       ` Dave Chinner
@ 2012-12-07 10:24                       ` Martin Steigerwald
  1 sibling, 0 replies; 69+ messages in thread
From: Martin Steigerwald @ 2012-12-07 10:24 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Dave Chinner, Theodore Ts'o,
	linux-fsdevel

Am Freitag, 7. Dezember 2012 schrieb Ingo Molnar:

> * Martin Steigerwald <Martin@lichtvoll.de> wrote:
> 
> > > The thing that people are complaining about is exactly the 
> > > reverse of this. It's *protecting* us from making mistakes, 
> > > and doesn't actually add any new interfaces in itself.
> > >
> > > This is why I'm so annoyed with this stupid thread. It's 
> > > been going on forever, and reverting that change WOULD BE 
> > > OBJECTIVELY A BAD IDEA.
> > 
> > See, thats where you have a problem with "reality".
> > 
> > It seems you cannot accept the fact that some developers 
> > disliked the process in which this change was pushed. [...]
> 
> I don't think you have understood Linus's argument above.
> 
> The "process" does not change the object technical merits of a 
> patch. Ever. This patch is _good_, and objectively good. No 
> amount of 'bad process' can make this patch bad.

A patch can´t be objectively good or bad. Unlike one of you developers see
yourself as a god who actually *really* knows it. Cause individual
developers write patches, review patches, have oppinions about patches.
Anything a subject created can´t be objective at all. Saying that one
patch is *objectively* good IMO carries a message like "I know it better
than you, go away" with it.

There have been different oppinions about the patch quality. And that
is what the review process was made for. At least so I thought.

> Now, hypothetically, if this was an objectively bad patch, then 
> any "bad process" used to push it would add insult to injury and 
> it could be reason enough to flame Tytso twice as hard.

I agree to Dave´s view here. If its good, why fear and bypass the review
process upsetting other developers? Actually my argument is that using
the review process the process can be more fluent.

This way comments of other kernel developers can contribute to
make this patch better than it is currently.

And if in the end the subsystem maintainer wants to take a patch despite
NACKs, he / she can still do it. At least thats how I understood it. But
then he / shes does it openly instead of sneaking a patch in, possibly hoping
other developers do not read git logs that closely.


Granted the patch can still be improved and actually I do think the patch flag
allocation deverses a better in source code comment about it.

So I suggest:

- On a next occasion Ted (or any other developer) goes through the
review process again. Especially put controversial patches through
the review process!

- For this time discuss constructively how to make the bit reservation
patch acceptable to Dave and Christoph, i.e. by adding some documentation.
As I read out of it Dave and Christoph can basically agree with the bit
reservation. Thus there is room for improving the patch.

> But it turns out the patch was right and good, so kudos to Tytso 
> for cutting through the bike shed painting and politicks of 
> fsdevel - which "process" would have deprived us of a good 
> patch...

I am astonished by the lack of confidence you seem to put into the review
process, Ingo.

Thanks,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-12-07  1:16             ` Ingo Molnar
  2012-12-07  3:19               ` Dave Chinner
@ 2012-12-07 17:36               ` Ric Wheeler
  2012-12-07 18:18                 ` Linus Torvalds
  1 sibling, 1 reply; 69+ messages in thread
From: Ric Wheeler @ 2012-12-07 17:36 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Christoph Hellwig, Linus Torvalds, Martin Steigerwald,
	Linux Kernel Mailing List, Dave Chinner, Theodore Ts'o,
	linux-fsdevel

On 12/06/2012 08:16 PM, Ingo Molnar wrote:
> * Christoph Hellwig <hch@infradead.org> wrote:
>
>> No, the problem is that the thing is not just a) wrong, but b)
>> only made it in through sneaky ways.
> People disagree with a), and b) only really matters if a) is
> true.
>
> You never gave a technical reason for why protecting against
> future ABI clashes is 'wrong'. It looks like a marginally
> useful, practical patch to me.
>
> Thanks,
>
> 	Ingo
>

Hi Ingo,

The historical roots of the argument are not quite as clear here as you posit 
above. The need for the interface/ABI itself was the subject of the review.

The interface proposed - expose any deleted data without zeroing it - was 
requested not to enable a tool or fix a specific need. It was proposed in order 
to avoid tripping over an ext4 performance problem that occurs when we change 
allocated-but-unwritten extents into allocated and written.

This is a huge break with very long standing file system semantics - normally, 
we always promise to return to the application only data that you wrote or 
return zeroed blocks of data if you allocated it and did not write it.

This allows you to fallocate all unused space on disk, seek around and poke for 
other peoples' deleted data.  Aside from the obvious violation of expected 
privacy of deleted data (for non-root users at least), it could also break 
things that have the original expectations in place.

After LSF, we did try to reproduce the use case (not with a lot of success) and 
had several proposed ways to fix the ext4 performance challenge instead of using 
this hack to avoid it.

I would prefer to fix the performance issue in ext4 rather than add an interface 
that has no actual users of the actual feature - it exists for applications that 
want to avoid an unfortunate performance hit from something that we could work 
around.

If there are legitimate needs to expose the data to non-root users, it would be 
good to have that debate in the open and clarify the correct interface.

The process issue exposed is not one where "bike shedding" occurred - the 
proposed feature was discussed in person at LSF and on the mailing lists and 
debated and rejected.

Review is part of the way we work as a community and we should figure out how to 
fix our review process so that we can have meaningful results from the review or 
we lose confidence in the process and it makes it much harder to get reviewers 
to spend time reviewing when their reviews are ultimately ignored.

Regards,

Ric





^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-12-07 17:36               ` Ric Wheeler
@ 2012-12-07 18:18                 ` Linus Torvalds
  2012-12-07 19:03                   ` Chris Mason
  2012-12-07 19:30                   ` Steven Rostedt
  0 siblings, 2 replies; 69+ messages in thread
From: Linus Torvalds @ 2012-12-07 18:18 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Ingo Molnar, Christoph Hellwig, Martin Steigerwald,
	Linux Kernel Mailing List, Dave Chinner, Theodore Ts'o,
	linux-fsdevel



On Fri, 7 Dec 2012, Ric Wheeler wrote:
> 
> Review is part of the way we work as a community and we should figure out how
> to fix our review process so that we can have meaningful results from the
> review or we lose confidence in the process and it makes it much harder to get
> reviewers to spend time reviewing when their reviews are ultimately ignored.

Christ, I promised myself to not respond any more to this thread, but the 
insanity just continues, from people who damn well should know better.

The code wasn't merged. The review worked.

What you (and Dave, and Christoph) are trying to do is shut down a feature 
that somebody else decided they needed. That's not what code review is all 
about, and dammit, don't try to even claim it is.

So stop these dishonest and disingenious arguments. They are full of crap.

No amount of "review" has any meaning what-so-ever on whether somebody 
else decides they need a feature or not. You can review all you want, but 
it's irrelevant - if some company decides they are going to ship or use a 
feature, it's out of your hands.

What got merged was a ONE-LINER to make sure that possible future 
development didn't unnecessarily make things any more confusing, with the 
knowledge that there was a user of the code you didn't like. 

Every single argument I've heard of from the "please revert" camp has been 
inane. And they've been *transparently* inane, to the point where I don't 
understand how you can make them with a straght face and not be ashamed.

The arguments have been:

 - the code failed review, and shouldn't have been merged.

   Fine, and the code *wasn't* merged. What was merged was just a note in 
   the source code so that other people will know not to stomp on things 
   pointlessly so that things would get confusing for no reason.

 - "We have a process, and things should be done on the mailing list".

   Bullshit. What ended up in the tree was a one-liner patch that couldn't 
   possibly break anything, and fixed a small and trivial issue that 
   nobody should ever have even cared about (much less result in these 
   inane long threads)

   Anybody who claims that our "process" requires that things like that go 
   on the mailing list and pass long reviews and discussions IS JUST 
   LYING.

   Because it's not true. We discuss big features, and we want code 
   review, yes, but the fact is, most small obvious patches do *not* get 
   reviewed, they just get fixed. You all know that, why the hell are you 
   suddenly claiming that this is so magically different?

 - The "it's now open season and anything can be merged without review" 
   sky-is-falling argument (and yes, seriously, I've seen that insane 
   statement in this thread too, by a person who I thought was saner than 
   that)

   Read the above arguments, and realise how shrill and outright STUPID 
   that kind of "we can now do anything without review" argument is.

 - the totally unsupported claim that the patch that people are 
   complaining about is "bad". Christoph has now said so multiple times, 
   without ever actually backing it up in any way.

   Christoph: I can well imagine that you don't like the code that google 
   apparently uses. BUT THAT ISN'T WHAT YOU ARE ARGUING FOR REVERTING!

   You seem to seriously argue that it's a *bad* thing to put a note that 
   one bit is already in use. That's f*cking moronic. You are arguing that 
   it is bad idea to let people know about possible clashes. Really?

Guys, it's not like anybody else even wants to *use* the bit that was 
marked! If we were running out of bits, then the argument "That bit could 
be used for better, and we don't care about some random out-of-tree use" 
would be perfectly valid. And maybe we will revisit this in a year (or 
five) due to issues like that.

And that's _fine_. Once you have actual technical arguments ("I'd like to 
re-appropriate that bit, because xyzzy") you have real and valid 
arguments, and it would be easy to then do the sane "let's just use the 
bit for something more worthy" thing. But even then it's nice to have the 
knowledge about what the implications of such use would be, for chrissake!

But that's not what the insane and pointless arguments in this thread have 
been. The whole thread has been just choch-full of pure STUPID.

Please stop the inane and idiotic arguments already. The "we must review 
every one-liner, and this destroys and makes a mockery of our review 
process" argument in particular has been dishonest and pure crap. What has 
made this simple and obvious patch so special in your minds?

                   Linus

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-12-07 18:18                 ` Linus Torvalds
@ 2012-12-07 19:03                   ` Chris Mason
  2012-12-07 20:43                     ` Theodore Ts'o
  2012-12-08  0:17                     ` Dave Chinner
  2012-12-07 19:30                   ` Steven Rostedt
  1 sibling, 2 replies; 69+ messages in thread
From: Chris Mason @ 2012-12-07 19:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ric Wheeler, Ingo Molnar, Christoph Hellwig, Martin Steigerwald,
	Linux Kernel Mailing List, Dave Chinner, Theodore Ts'o,
	linux-fsdevel

On Fri, Dec 07, 2012 at 11:18:00AM -0700, Linus Torvalds wrote:
> 
> 
> On Fri, 7 Dec 2012, Ric Wheeler wrote:
> > 
> > Review is part of the way we work as a community and we should figure out how
> > to fix our review process so that we can have meaningful results from the
> > review or we lose confidence in the process and it makes it much harder to get
> > reviewers to spend time reviewing when their reviews are ultimately ignored.
> 
> Christ, I promised myself to not respond any more to this thread, but the 
> insanity just continues, from people who damn well should know better.
> 
> The code wasn't merged. The review worked.
> 
> What you (and Dave, and Christoph) are trying to do is shut down a feature 
> that somebody else decided they needed. That's not what code review is all 
> about, and dammit, don't try to even claim it is.
> 
> So stop these dishonest and disingenious arguments. They are full of crap.
> 
> No amount of "review" has any meaning what-so-ever on whether somebody 
> else decides they need a feature or not. You can review all you want, but 
> it's irrelevant - if some company decides they are going to ship or use a 
> feature, it's out of your hands.
> 
> What got merged was a ONE-LINER to make sure that possible future 
> development didn't unnecessarily make things any more confusing, with the 
> knowledge that there was a user of the code you didn't like. 
> 
> Every single argument I've heard of from the "please revert" camp has been 
> inane. And they've been *transparently* inane, to the point where I don't 
> understand how you can make them with a straght face and not be ashamed.

I really agree with Dave's statement that we should ioctl for private
features and system call for features other filesystems are likely to
implement.  So we really shouldn't have private bits in fallocate in use
in production systems.

That's not what happened though, and the right way forward from here is
to give the bit to the feature, maybe with a generic name like
FALLOCATE_WITHOUT_BEING_HORRIBLY_SLOW.  It should have been done
differently, but it wasn't.  And it's a problem we all have, so it makes
sense that we'll all want to address it somehow.

On a single flash drive doing random 4K writes, xfs does 950MB/s into
regular extents but only 400MB/s into preallocated extents.

http://masoncoding.com/presentation/perf-linuxcon12/fallocate.png

ext4 has a bigger hit, but there's a little room for improvement all
around.

Maybe we should use this thread as the starting point for the proper
12-18 months of bike shedding for a real fix?

-chris


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-12-07 18:18                 ` Linus Torvalds
  2012-12-07 19:03                   ` Chris Mason
@ 2012-12-07 19:30                   ` Steven Rostedt
  2012-12-07 21:14                     ` Theodore Ts'o
  1 sibling, 1 reply; 69+ messages in thread
From: Steven Rostedt @ 2012-12-07 19:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ric Wheeler, Ingo Molnar, Christoph Hellwig, Martin Steigerwald,
	Linux Kernel Mailing List, Dave Chinner, Theodore Ts'o,
	linux-fsdevel

On Fri, Dec 07, 2012 at 10:18:00AM -0800, Linus Torvalds wrote:
> 
> And that's _fine_. Once you have actual technical arguments ("I'd like to 
> re-appropriate that bit, because xyzzy") you have real and valid 
> arguments, and it would be easy to then do the sane "let's just use the 
> bit for something more worthy" thing. But even then it's nice to have the 
> knowledge about what the implications of such use would be, for chrissake!

First I want to say that I don't really give a crap about this bit, and
whether or not it's been merged or not. But I think Ric did have one
valid point:

>> I would prefer to fix the performance issue in ext4 rather than add an
>> interface that has no actual users of the actual feature - it exists
>> for applications that want to avoid an unfortunate performance hit
>> from something that we could work around.

This is similar to the preemptible big kernel lock. The BKL has been an
issue for almost everyone, and a really big one for the real-time
folks. But because it was converted into a preemtible lock, it was
dropped low on the priority list of things to get fixed. But it wasn't
until you reverted this code to kick our ass into gear to go about and fix
the issue.

How is this similar? By adding this bit, we removed incentive from a
group of developers that have the means to fix the real issue at hand
(the performance problem with ext4). Thus, it means that they have a work
around that's good enough for them, but the rest of us suffer.

Sure, they could just modify their kernel to keep this bit, but as it
wouldn't be reserved, they would have to worry about colisions in the
future. Thus the incentive to fix the problem at hand. Just like we had
the preempt BKL in the real-time patch, but it was just better to get it
fixed in mainline. Thanks to Frederic that started the work, and then Arnd
that finished it, we no longer have that nasty lock around to deal with.

You were very wise to revert the preemptible BKL, as I believe we
probably would still have BKL in the kernel if you didn't do that. Maybe
reverting this bit might do the same thing? Maybe not. It's your call.

-- Steve


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-12-07 19:03                   ` Chris Mason
@ 2012-12-07 20:43                     ` Theodore Ts'o
  2012-12-07 21:09                       ` Chris Mason
  2012-12-08  0:17                     ` Dave Chinner
  1 sibling, 1 reply; 69+ messages in thread
From: Theodore Ts'o @ 2012-12-07 20:43 UTC (permalink / raw)
  To: Chris Mason, Linus Torvalds, Ric Wheeler, Ingo Molnar,
	Christoph Hellwig, Martin Steigerwald, Linux Kernel Mailing List,
	Dave Chinner, linux-fsdevel

On Fri, Dec 07, 2012 at 02:03:06PM -0500, Chris Mason wrote:
> 
> That's not what happened though, and the right way forward from here is
> to give the bit to the feature, maybe with a generic name like
> FALLOCATE_WITHOUT_BEING_HORRIBLY_SLOW.  

I don't think that's a good idea, because the current name explicitly
calls out the fact that we are making a tradeoff between what
***might*** be a security exposure in some cases (but which might be
perfectly fine in others) for performance.  Using the generic name
would hide the fact that this tradeoff is being made, and the
arguments (which I've never seen backed up with a specific design) is
that it's possible to speed up random writes into preallocated space
on a flash device without making any kind of tradeoff that might imply
a security tradeoff.

If indeed it is possible to speed up this particular workload without
making any kind of no-hide-stale tradeoff, then we won't need the bit
--- writes into fallocated space will just get faster, with or without
the bit

I am sure it will be possible to do this in some cases (for example if
you have a device that supports persistent trim which can quickly
zeroize the blocks in question), but I would be very surprised if it's
possible to completely eliminate the performance degradation for all
devices and workloads.  (Not all storage devices support persistent
trim, just for starters.)

In answer's to Linus's question, the reason why people are
hyperventilating so badly about this is that in some circles,
revealing stale data is so horrible that anyone who even tries to
suggest this should be excommunicated.  The mere existence of the
code, or that people are using it, horribly offends those people.

(Witness how people reacted when Linus changed the default of ext3 to
data=writeback many years ago in order to solve a performance problem
that was impacting him and other desktop users very badly.  At the
time, I understood why he made that change, but I know a lot of people
were horrified.  These days we have a better solution which is used by
ext4, but for desktop users at that time, it was a completely fair
engineering choice to make.)

						- Ted

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-12-07 20:43                     ` Theodore Ts'o
@ 2012-12-07 21:09                       ` Chris Mason
  2012-12-07 21:27                         ` Theodore Ts'o
  2012-12-07 21:42                         ` Ric Wheeler
  0 siblings, 2 replies; 69+ messages in thread
From: Chris Mason @ 2012-12-07 21:09 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Chris Mason, Linus Torvalds, Ric Wheeler, Ingo Molnar,
	Christoph Hellwig, Martin Steigerwald, Linux Kernel Mailing List,
	Dave Chinner, linux-fsdevel

On Fri, Dec 07, 2012 at 01:43:25PM -0700, Theodore Ts'o wrote:
> On Fri, Dec 07, 2012 at 02:03:06PM -0500, Chris Mason wrote:
> > 
> > That's not what happened though, and the right way forward from here is
> > to give the bit to the feature, maybe with a generic name like
> > FALLOCATE_WITHOUT_BEING_HORRIBLY_SLOW.  
> 
> I don't think that's a good idea, because the current name explicitly
> calls out the fact that we are making a tradeoff between what
> ***might*** be a security exposure in some cases (but which might be
> perfectly fine in others) for performance.  Using the generic name
> would hide the fact that this tradeoff is being made, and the
> arguments (which I've never seen backed up with a specific design) is
> that it's possible to speed up random writes into preallocated space
> on a flash device without making any kind of tradeoff that might imply
> a security tradeoff.

Grin, we're really good at debating names.  But I do see what you mean.
I'd hope that whatever generic facility we put in doesn't have the
security implications.

> 
> If indeed it is possible to speed up this particular workload without
> making any kind of no-hide-stale tradeoff, then we won't need the bit
> --- writes into fallocated space will just get faster, with or without
> the bit
> 
> I am sure it will be possible to do this in some cases (for example if
> you have a device that supports persistent trim which can quickly
> zeroize the blocks in question), but I would be very surprised if it's
> possible to completely eliminate the performance degradation for all
> devices and workloads.  (Not all storage devices support persistent
> trim, just for starters.)

Persistent trim is what I had in mind, but there are other ideas that do
imply a change in behavior as well.  Can we safely assume this feature
won't matter on spinning media?  New features like persistent
trim do make it much easier to solve securely, and using a bit for it
means we can toss back an error to the app if the underlying storage
isn't safe.

If google wants to have a block device patch that pretends to persistent
trim on devices that can't, great.

> 
> In answer's to Linus's question, the reason why people are
> hyperventilating so badly about this is that in some circles,
> revealing stale data is so horrible that anyone who even tries to
> suggest this should be excommunicated.  The mere existence of the
> code, or that people are using it, horribly offends those people.

So I've always said this was a real performance problem and that it
isn't just limited to ext4.  But can we please move past this part?  I
don't think it is completely accurate.

-chris

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-12-07 19:30                   ` Steven Rostedt
@ 2012-12-07 21:14                     ` Theodore Ts'o
  2012-12-07 21:47                       ` Ric Wheeler
                                         ` (2 more replies)
  0 siblings, 3 replies; 69+ messages in thread
From: Theodore Ts'o @ 2012-12-07 21:14 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linus Torvalds, Ric Wheeler, Ingo Molnar, Christoph Hellwig,
	Martin Steigerwald, Linux Kernel Mailing List, Dave Chinner,
	linux-fsdevel

On Fri, Dec 07, 2012 at 02:30:19PM -0500, Steven Rostedt wrote:
> How is this similar? By adding this bit, we removed incentive from a
> group of developers that have the means to fix the real issue at hand
> (the performance problem with ext4). Thus, it means that they have a work
> around that's good enough for them, but the rest of us suffer.

That assumes that there **is** a way to claw back the performance
loss, and Chris Mason has demonstrated the performance hit exists with
xfs as well (950 MB/s vs. 400 MB/s; that's more than a factor of two).
Sometimes, you have to make the engineering tradeoffs.  That's why
we're engineers, for goodness sakes.  Sometimes, it's just not
possible to square the circle.

I don't believe that the technique of forcing people who need that
performance to suffer in order to induce them to try to engineer a
solution which may or may not exist is really the best or fairest way
to go about things.

					- Ted

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-12-07 21:09                       ` Chris Mason
@ 2012-12-07 21:27                         ` Theodore Ts'o
  2012-12-07 21:43                           ` Chris Mason
  2012-12-07 21:42                         ` Ric Wheeler
  1 sibling, 1 reply; 69+ messages in thread
From: Theodore Ts'o @ 2012-12-07 21:27 UTC (permalink / raw)
  To: Chris Mason, Chris Mason, Linus Torvalds, Ric Wheeler,
	Ingo Molnar, Christoph Hellwig, Martin Steigerwald,
	Linux Kernel Mailing List, Dave Chinner, linux-fsdevel

On Fri, Dec 07, 2012 at 04:09:32PM -0500, Chris Mason wrote:
> Persistent trim is what I had in mind, but there are other ideas that do
> imply a change in behavior as well.  Can we safely assume this feature
> won't matter on spinning media?  New features like persistent
> trim do make it much easier to solve securely, and using a bit for it
> means we can toss back an error to the app if the underlying storage
> isn't safe.

We originally implemented no hide stale for spinning media.  Some
folks have claimed that for XFS their superior technology means that
no hide stale doesn't buy them anything for HDD's.  I'm not entirely
sure I buy this, since if you need to update metadata, it means at
least one extra seek for each random write into 4k preallocated space,
and 7200 RPM disks only have about 200 seeks per second.

One of the problems that I've seen is that as disks get bigger, the
number of seeks per second have remained constant, and so an
application which required N TB spread out over a large number of
disks might now only require a fraction of the number of disks --- so
it's very easy for a cluster file system to become seek constrained by
the number of spindles that you have, and not capacity constrained.

This to me seems to be a fundamental problem, and I don't think it's
possible to wave one's hands to get rid of it.  All you can say is
that the people who care about this are crazy (that's OK, I don't mind
when Christoph or Dave call me crazy :-), and that their workload
doesn't matter.  But if you are trying to optimize out every last
seek, because you desperately care about latency and seeks are a
precious and scarce resource[1], then I don't see around away the
technique of not requiring an update to the metadata at the time that
you write the data block, and that kinda implies no-hide-stale.

Regards,

					- Ted

[1] Even if you don't care about the latency of the write operation,
the fact that the write operation has to do two seeks and not one can
very well slow down a subsequent high priority read request, where you
*do* care about latency.  The problem is that you only have about 200
seeks per spindle.


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-12-07 21:09                       ` Chris Mason
  2012-12-07 21:27                         ` Theodore Ts'o
@ 2012-12-07 21:42                         ` Ric Wheeler
  2012-12-07 21:57                           ` Theodore Ts'o
  1 sibling, 1 reply; 69+ messages in thread
From: Ric Wheeler @ 2012-12-07 21:42 UTC (permalink / raw)
  To: Chris Mason, Theodore Ts'o, Chris Mason, Linus Torvalds,
	Ingo Molnar, Christoph Hellwig, Martin Steigerwald,
	Linux Kernel Mailing List, Dave Chinner, linux-fsdevel

On 12/07/2012 04:09 PM, Chris Mason wrote:
> On Fri, Dec 07, 2012 at 01:43:25PM -0700, Theodore Ts'o wrote:
>> On Fri, Dec 07, 2012 at 02:03:06PM -0500, Chris Mason wrote:
>>> That's not what happened though, and the right way forward from here is
>>> to give the bit to the feature, maybe with a generic name like
>>> FALLOCATE_WITHOUT_BEING_HORRIBLY_SLOW.
>> I don't think that's a good idea, because the current name explicitly
>> calls out the fact that we are making a tradeoff between what
>> ***might*** be a security exposure in some cases (but which might be
>> perfectly fine in others) for performance.  Using the generic name
>> would hide the fact that this tradeoff is being made, and the
>> arguments (which I've never seen backed up with a specific design) is
>> that it's possible to speed up random writes into preallocated space
>> on a flash device without making any kind of tradeoff that might imply
>> a security tradeoff.
> Grin, we're really good at debating names.  But I do see what you mean.
> I'd hope that whatever generic facility we put in doesn't have the
> security implications.

I would suggest a name like "let me see other peoples data, pronto"

>> If indeed it is possible to speed up this particular workload without
>> making any kind of no-hide-stale tradeoff, then we won't need the bit
>> --- writes into fallocated space will just get faster, with or without
>> the bit
>>
>> I am sure it will be possible to do this in some cases (for example if
>> you have a device that supports persistent trim which can quickly
>> zeroize the blocks in question), but I would be very surprised if it's
>> possible to completely eliminate the performance degradation for all
>> devices and workloads.  (Not all storage devices support persistent
>> trim, just for starters.)
> Persistent trim is what I had in mind, but there are other ideas that do
> imply a change in behavior as well.  Can we safely assume this feature
> won't matter on spinning media?  New features like persistent
> trim do make it much easier to solve securely, and using a bit for it
> means we can toss back an error to the app if the underlying storage
> isn't safe.
>
> If google wants to have a block device patch that pretends to persistent
> trim on devices that can't, great.

The other things that I think we should try would be to convert over larger 
chunks as we discussed on the list back in the summer (just because the user 
writes 4KB does not mean that we cannot flip over 1MB and zero that).

>
>> In answer's to Linus's question, the reason why people are
>> hyperventilating so badly about this is that in some circles,
>> revealing stale data is so horrible that anyone who even tries to
>> suggest this should be excommunicated.  The mere existence of the
>> code, or that people are using it, horribly offends those people.
> So I've always said this was a real performance problem and that it
> isn't just limited to ext4.  But can we please move past this part?  I
> don't think it is completely accurate.

The thing that bothers me is that no one wants to use this "feature" to see the 
stale data, just to benefit from a coincidental performance bump

Most features need to have a defined use case as opposed to a side effect as 
their motivation.

Let's focus on fixing the performance in a way that would be useful to a broader 
swath of users. To be clear, I certainly would never ship this in a distro I was 
involved in.

With or without the bit, we need to fix this properly if it is a meaningful 
workload.

ric




^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-12-07 21:27                         ` Theodore Ts'o
@ 2012-12-07 21:43                           ` Chris Mason
  2012-12-07 21:49                             ` Ric Wheeler
  0 siblings, 1 reply; 69+ messages in thread
From: Chris Mason @ 2012-12-07 21:43 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Chris Mason, Linus Torvalds, Ric Wheeler, Ingo Molnar,
	Christoph Hellwig, Martin Steigerwald, Linux Kernel Mailing List,
	Dave Chinner, linux-fsdevel

On Fri, Dec 07, 2012 at 02:27:43PM -0700, Theodore Ts'o wrote:
> On Fri, Dec 07, 2012 at 04:09:32PM -0500, Chris Mason wrote:
> > Persistent trim is what I had in mind, but there are other ideas that do
> > imply a change in behavior as well.  Can we safely assume this feature
> > won't matter on spinning media?  New features like persistent
> > trim do make it much easier to solve securely, and using a bit for it
> > means we can toss back an error to the app if the underlying storage
> > isn't safe.
> 
> We originally implemented no hide stale for spinning media.  Some
> folks have claimed that for XFS their superior technology means that
> no hide stale doesn't buy them anything for HDD's.  I'm not entirely
> sure I buy this, since if you need to update metadata, it means at
> least one extra seek for each random write into 4k preallocated space,
> and 7200 RPM disks only have about 200 seeks per second.

True, 7200 RPM disks are slow, but even allowing them to expose stale
data just makes them a little less slow.

I know it's against the rules to pretend that disks don't matter.  But
really, once you're doing random IO into a spindle you've given up on
performance anyway.

-chris

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-12-07 21:14                     ` Theodore Ts'o
@ 2012-12-07 21:47                       ` Ric Wheeler
  2012-12-07 23:25                         ` Howard Chu
  2012-12-07 22:01                       ` Eric Sandeen
  2012-12-09 21:37                       ` Ric Wheeler
  2 siblings, 1 reply; 69+ messages in thread
From: Ric Wheeler @ 2012-12-07 21:47 UTC (permalink / raw)
  To: Theodore Ts'o, Steven Rostedt, Linus Torvalds, Ingo Molnar,
	Christoph Hellwig, Martin Steigerwald, Linux Kernel Mailing List,
	Dave Chinner, linux-fsdevel

On 12/07/2012 04:14 PM, Theodore Ts'o wrote:
> On Fri, Dec 07, 2012 at 02:30:19PM -0500, Steven Rostedt wrote:
>> How is this similar? By adding this bit, we removed incentive from a
>> group of developers that have the means to fix the real issue at hand
>> (the performance problem with ext4). Thus, it means that they have a work
>> around that's good enough for them, but the rest of us suffer.
> That assumes that there **is** a way to claw back the performance
> loss, and Chris Mason has demonstrated the performance hit exists with
> xfs as well (950 MB/s vs. 400 MB/s; that's more than a factor of two).
> Sometimes, you have to make the engineering tradeoffs.  That's why
> we're engineers, for goodness sakes.  Sometimes, it's just not
> possible to square the circle.
>
> I don't believe that the technique of forcing people who need that
> performance to suffer in order to induce them to try to engineer a
> solution which may or may not exist is really the best or fairest way
> to go about things.
>
> 					- Ted

This is not a generally useful feature and won't ship in a way that helps most 
users with this issue.

Let's fix the problem properly.

In the meantime, there are several obvious ways to avoid this performance hit 
without changing the kernel (fully allocate and write the data, certainly 
reasonable for even reasonable sized files).

Ric


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-12-07 21:43                           ` Chris Mason
@ 2012-12-07 21:49                             ` Ric Wheeler
  2012-12-07 21:57                               ` Chris Mason
  0 siblings, 1 reply; 69+ messages in thread
From: Ric Wheeler @ 2012-12-07 21:49 UTC (permalink / raw)
  To: Chris Mason, Theodore Ts'o, Chris Mason, Linus Torvalds,
	Ingo Molnar, Christoph Hellwig, Martin Steigerwald,
	Linux Kernel Mailing List, Dave Chinner, linux-fsdevel

On 12/07/2012 04:43 PM, Chris Mason wrote:
> On Fri, Dec 07, 2012 at 02:27:43PM -0700, Theodore Ts'o wrote:
>> On Fri, Dec 07, 2012 at 04:09:32PM -0500, Chris Mason wrote:
>>> Persistent trim is what I had in mind, but there are other ideas that do
>>> imply a change in behavior as well.  Can we safely assume this feature
>>> won't matter on spinning media?  New features like persistent
>>> trim do make it much easier to solve securely, and using a bit for it
>>> means we can toss back an error to the app if the underlying storage
>>> isn't safe.
>> We originally implemented no hide stale for spinning media.  Some
>> folks have claimed that for XFS their superior technology means that
>> no hide stale doesn't buy them anything for HDD's.  I'm not entirely
>> sure I buy this, since if you need to update metadata, it means at
>> least one extra seek for each random write into 4k preallocated space,
>> and 7200 RPM disks only have about 200 seeks per second.
> True, 7200 RPM disks are slow, but even allowing them to expose stale
> data just makes them a little less slow.
>
> I know it's against the rules to pretend that disks don't matter.  But
> really, once you're doing random IO into a spindle you've given up on
> performance anyway.
>
> -chris

That's right.

And equally true, once you have moved the disk heads to that track, you can 
write a lot as cheaply as a little (i.e., do 1MB instead of 4KB). That will also 
avoid fragmentation of the extents.

I think it would be good to see how much that gets back for us,

Ric


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-12-07 21:42                         ` Ric Wheeler
@ 2012-12-07 21:57                           ` Theodore Ts'o
  2012-12-07 22:02                             ` Ric Wheeler
  0 siblings, 1 reply; 69+ messages in thread
From: Theodore Ts'o @ 2012-12-07 21:57 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Chris Mason, Chris Mason, Linus Torvalds, Ingo Molnar,
	Christoph Hellwig, Martin Steigerwald, Linux Kernel Mailing List,
	Dave Chinner, linux-fsdevel

On Fri, Dec 07, 2012 at 04:42:06PM -0500, Ric Wheeler wrote:
> The other things that I think we should try would be to convert over
> larger chunks as we discussed on the list back in the summer (just
> because the user writes 4KB does not mean that we cannot flip over
> 1MB and zero that).

Writing a megabyte is not free.  If you assume that your HDD has a
sustained write throughput of 100-125 MB/s, writing a megabyte will
take 8-10ms.  It might be a win if you amortize it over a large number
of writes, but it doesn't help your 99.9 percentile latency numbers.
(99.9 percentile latency numbers matters because eventually you'll
have a user request which hits multiple serial long latency
operations, and then the delay looks **really** user visible.)

	    	     	       	     		- Ted

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-12-07 21:49                             ` Ric Wheeler
@ 2012-12-07 21:57                               ` Chris Mason
  2012-12-07 22:51                                 ` Eric Sandeen
  2012-12-07 22:52                                 ` Eric Sandeen
  0 siblings, 2 replies; 69+ messages in thread
From: Chris Mason @ 2012-12-07 21:57 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Chris Mason, Theodore Ts'o, Linus Torvalds, Ingo Molnar,
	Christoph Hellwig, Martin Steigerwald, Linux Kernel Mailing List,
	Dave Chinner, linux-fsdevel

On Fri, Dec 07, 2012 at 02:49:04PM -0700, Ric Wheeler wrote:
> On 12/07/2012 04:43 PM, Chris Mason wrote:
> > On Fri, Dec 07, 2012 at 02:27:43PM -0700, Theodore Ts'o wrote:
> >> On Fri, Dec 07, 2012 at 04:09:32PM -0500, Chris Mason wrote:
> >>> Persistent trim is what I had in mind, but there are other ideas that do
> >>> imply a change in behavior as well.  Can we safely assume this feature
> >>> won't matter on spinning media?  New features like persistent
> >>> trim do make it much easier to solve securely, and using a bit for it
> >>> means we can toss back an error to the app if the underlying storage
> >>> isn't safe.
> >> We originally implemented no hide stale for spinning media.  Some
> >> folks have claimed that for XFS their superior technology means that
> >> no hide stale doesn't buy them anything for HDD's.  I'm not entirely
> >> sure I buy this, since if you need to update metadata, it means at
> >> least one extra seek for each random write into 4k preallocated space,
> >> and 7200 RPM disks only have about 200 seeks per second.
> > True, 7200 RPM disks are slow, but even allowing them to expose stale
> > data just makes them a little less slow.
> >
> > I know it's against the rules to pretend that disks don't matter.  But
> > really, once you're doing random IO into a spindle you've given up on
> > performance anyway.
> >
> > -chris
> 
> That's right.
> 
> And equally true, once you have moved the disk heads to that track, you can 
> write a lot as cheaply as a little (i.e., do 1MB instead of 4KB). That will also 
> avoid fragmentation of the extents.

When you do a 4K write, you have to remember that you've written just
those 4K.  When you do a 1MB write, you have to remember that you've
written just that 1MB.  It's the same operation, except with the 1MB
you've also had to setup all the bios and send down the zeros, and do
the proper locking to make sure you're not sending zeros down over
some concurrent IO.

The 1MB setup is actually more work, but it does greatly reduce the
amount of time the workload needs to run before it goes into a steady
state.  For smaller files it may work well, but for larger ones I don't
think it will be enough.

-chris

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-12-07 21:14                     ` Theodore Ts'o
  2012-12-07 21:47                       ` Ric Wheeler
@ 2012-12-07 22:01                       ` Eric Sandeen
  2012-12-09 21:37                       ` Ric Wheeler
  2 siblings, 0 replies; 69+ messages in thread
From: Eric Sandeen @ 2012-12-07 22:01 UTC (permalink / raw)
  To: Theodore Ts'o, Steven Rostedt, Linus Torvalds, Ric Wheeler,
	Ingo Molnar, Christoph Hellwig, Martin Steigerwald,
	Linux Kernel Mailing List, Dave Chinner, linux-fsdevel

On 12/7/12 3:14 PM, Theodore Ts'o wrote:
> On Fri, Dec 07, 2012 at 02:30:19PM -0500, Steven Rostedt wrote:
>> How is this similar? By adding this bit, we removed incentive from a
>> group of developers that have the means to fix the real issue at hand
>> (the performance problem with ext4). Thus, it means that they have a work
>> around that's good enough for them, but the rest of us suffer.
> 
> That assumes that there **is** a way to claw back the performance
> loss, and Chris Mason has demonstrated the performance hit exists with
> xfs as well (950 MB/s vs. 400 MB/s; that's more than a factor of two).

But he has not demonstrated that it can't be improved in XFS; I don't
think that anyone in the XFS community has even begun to look at
whether it can be improved ...

> Sometimes, you have to make the engineering tradeoffs.  That's why
> we're engineers, for goodness sakes.  Sometimes, it's just not
> possible to square the circle.

... so this strikes me as a bit premature.

> I don't believe that the technique of forcing people who need that
> performance to suffer in order to induce them to try to engineer a
> solution which may or may not exist is really the best or fairest way
> to go about things.

Have we exhausted efforts to improve ext4 as well?  Have we even identified
the performance bottlenecks yet via profiling?

What this seems to be is behavior nobody has asked for (expose
other users' stale data) in the name of solving a performance problem
(fine-grained conversion of unwritten extents comes at a non-negligible cost).

-Eric

> 					- Ted


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-12-07 21:57                           ` Theodore Ts'o
@ 2012-12-07 22:02                             ` Ric Wheeler
  2012-12-08  0:39                               ` Dave Chinner
  0 siblings, 1 reply; 69+ messages in thread
From: Ric Wheeler @ 2012-12-07 22:02 UTC (permalink / raw)
  To: Theodore Ts'o, Chris Mason, Chris Mason, Linus Torvalds,
	Ingo Molnar, Christoph Hellwig, Martin Steigerwald,
	Linux Kernel Mailing List, Dave Chinner, linux-fsdevel

On 12/07/2012 04:57 PM, Theodore Ts'o wrote:
> On Fri, Dec 07, 2012 at 04:42:06PM -0500, Ric Wheeler wrote:
>> The other things that I think we should try would be to convert over
>> larger chunks as we discussed on the list back in the summer (just
>> because the user writes 4KB does not mean that we cannot flip over
>> 1MB and zero that).
> Writing a megabyte is not free.  If you assume that your HDD has a
> sustained write throughput of 100-125 MB/s, writing a megabyte will
> take 8-10ms.  It might be a win if you amortize it over a large number
> of writes, but it doesn't help your 99.9 percentile latency numbers.
> (99.9 percentile latency numbers matters because eventually you'll
> have a user request which hits multiple serial long latency
> operations, and then the delay looks **really** user visible.)
>
> 	    	     	       	     		- Ted

Writing 4KB at a time to a disk cost XX units of time.

Writing to the same sector (especially for a HDD), cost XX units + a small amount.

I suggest that we try it out.

For SSD's, much better to use specific HW offload commands if possible like 
WRITE_SAME (zeroed) or UNMAP/TRIM to get that performance boost since no actual 
data is moved...

ric


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-12-07 21:57                               ` Chris Mason
@ 2012-12-07 22:51                                 ` Eric Sandeen
  2012-12-07 22:52                                 ` Eric Sandeen
  1 sibling, 0 replies; 69+ messages in thread
From: Eric Sandeen @ 2012-12-07 22:51 UTC (permalink / raw)
  To: Chris Mason, Ric Wheeler, Chris Mason, Theodore Ts'o,
	Linus Torvalds, Ingo Molnar, Christoph Hellwig,
	Martin Steigerwald, Linux Kernel Mailing List, Dave Chinner,
	linux-fsdevel

On 12/7/12 3:57 PM, Chris Mason wrote:
> On Fri, Dec 07, 2012 at 02:49:04PM -0700, Ric Wheeler wrote:
>> On 12/07/2012 04:43 PM, Chris Mason wrote:
>>> On Fri, Dec 07, 2012 at 02:27:43PM -0700, Theodore Ts'o wrote:
>>>> On Fri, Dec 07, 2012 at 04:09:32PM -0500, Chris Mason wrote:
>>>>> Persistent trim is what I had in mind, but there are other ideas that do
>>>>> imply a change in behavior as well.  Can we safely assume this feature
>>>>> won't matter on spinning media?  New features like persistent
>>>>> trim do make it much easier to solve securely, and using a bit for it
>>>>> means we can toss back an error to the app if the underlying storage
>>>>> isn't safe.
>>>> We originally implemented no hide stale for spinning media.  Some
>>>> folks have claimed that for XFS their superior technology means that
>>>> no hide stale doesn't buy them anything for HDD's.  I'm not entirely
>>>> sure I buy this, since if you need to update metadata, it means at
>>>> least one extra seek for each random write into 4k preallocated space,
>>>> and 7200 RPM disks only have about 200 seeks per second.
>>> True, 7200 RPM disks are slow, but even allowing them to expose stale
>>> data just makes them a little less slow.
>>>
>>> I know it's against the rules to pretend that disks don't matter.  But
>>> really, once you're doing random IO into a spindle you've given up on
>>> performance anyway.
>>>
>>> -chris
>>
>> That's right.
>>
>> And equally true, once you have moved the disk heads to that track, you can 
>> write a lot as cheaply as a little (i.e., do 1MB instead of 4KB). That will also 
>> avoid fragmentation of the extents.
> 
> When you do a 4K write, you have to remember that you've written just
> those 4K.  When you do a 1MB write, you have to remember that you've
> written just that 1MB.  It's the same operation, except with the 1MB
> you've also had to setup all the bios and send down the zeros, and do
> the proper locking to make sure you're not sending zeros down over
> some concurrent IO.
> 
> The 1MB setup is actually more work, but it does greatly reduce the
> amount of time the workload needs to run before it goes into a steady
> state.  For smaller files it may work well, but for larger ones I don't
> think it will be enough.

Ext4 already does this, actually, I think - see s_extent_max_zeroout_kb
and how it's used.

        /* If extent is less than s_max_zeroout_kb, zeroout directly */

It's not a tunable (*gasp* ;)) but it's currently set to "32" as in
32 kb.  Would be fun to bump that up and see how your test goes.

-Eric

> -chris
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-12-07 21:57                               ` Chris Mason
  2012-12-07 22:51                                 ` Eric Sandeen
@ 2012-12-07 22:52                                 ` Eric Sandeen
  1 sibling, 0 replies; 69+ messages in thread
From: Eric Sandeen @ 2012-12-07 22:52 UTC (permalink / raw)
  To: Chris Mason, Ric Wheeler, Chris Mason, Theodore Ts'o,
	Linus Torvalds, Ingo Molnar, Christoph Hellwig,
	Martin Steigerwald, Linux Kernel Mailing List, Dave Chinner,
	linux-fsdevel

On 12/7/12 3:57 PM, Chris Mason wrote:
> On Fri, Dec 07, 2012 at 02:49:04PM -0700, Ric Wheeler wrote:
>> On 12/07/2012 04:43 PM, Chris Mason wrote:
>>> On Fri, Dec 07, 2012 at 02:27:43PM -0700, Theodore Ts'o wrote:
>>>> On Fri, Dec 07, 2012 at 04:09:32PM -0500, Chris Mason wrote:
>>>>> Persistent trim is what I had in mind, but there are other ideas that do
>>>>> imply a change in behavior as well.  Can we safely assume this feature
>>>>> won't matter on spinning media?  New features like persistent
>>>>> trim do make it much easier to solve securely, and using a bit for it
>>>>> means we can toss back an error to the app if the underlying storage
>>>>> isn't safe.
>>>> We originally implemented no hide stale for spinning media.  Some
>>>> folks have claimed that for XFS their superior technology means that
>>>> no hide stale doesn't buy them anything for HDD's.  I'm not entirely
>>>> sure I buy this, since if you need to update metadata, it means at
>>>> least one extra seek for each random write into 4k preallocated space,
>>>> and 7200 RPM disks only have about 200 seeks per second.
>>> True, 7200 RPM disks are slow, but even allowing them to expose stale
>>> data just makes them a little less slow.
>>>
>>> I know it's against the rules to pretend that disks don't matter.  But
>>> really, once you're doing random IO into a spindle you've given up on
>>> performance anyway.
>>>
>>> -chris
>>
>> That's right.
>>
>> And equally true, once you have moved the disk heads to that track, you can 
>> write a lot as cheaply as a little (i.e., do 1MB instead of 4KB). That will also 
>> avoid fragmentation of the extents.
> 
> When you do a 4K write, you have to remember that you've written just
> those 4K.  When you do a 1MB write, you have to remember that you've
> written just that 1MB.  It's the same operation, except with the 1MB
> you've also had to setup all the bios and send down the zeros, and do
> the proper locking to make sure you're not sending zeros down over
> some concurrent IO.
> 
> The 1MB setup is actually more work, but it does greatly reduce the
> amount of time the workload needs to run before it goes into a steady
> state.  For smaller files it may work well, but for larger ones I don't
> think it will be enough.

Ext4 already does this, actually, I think - see s_extent_max_zeroout_kb
and how it's used.

        /* If extent is less than s_max_zeroout_kb, zeroout directly */

It's not a tunable (*gasp* ;)) but it's currently set to "32" as in
32 kb.  Would be fun to bump that up and see how your test goes.

-Eric

> -chris
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-12-07 21:47                       ` Ric Wheeler
@ 2012-12-07 23:25                         ` Howard Chu
  2012-12-08  0:50                           ` Dave Chinner
  0 siblings, 1 reply; 69+ messages in thread
From: Howard Chu @ 2012-12-07 23:25 UTC (permalink / raw)
  To: Ric Wheeler, Theodore Ts'o, Steven Rostedt, Linus Torvalds,
	Ingo Molnar, Christoph Hellwig, Martin Steigerwald,
	Linux Kernel Mailing List, Dave Chinner, linux-fsdevel

Ric Wheeler wrote:
> On 12/07/2012 04:14 PM, Theodore Ts'o wrote:
>> On Fri, Dec 07, 2012 at 02:30:19PM -0500, Steven Rostedt wrote:
>>> How is this similar? By adding this bit, we removed incentive from a
>>> group of developers that have the means to fix the real issue at hand
>>> (the performance problem with ext4). Thus, it means that they have a work
>>> around that's good enough for them, but the rest of us suffer.
>> That assumes that there **is** a way to claw back the performance
>> loss, and Chris Mason has demonstrated the performance hit exists with
>> xfs as well (950 MB/s vs. 400 MB/s; that's more than a factor of two).
>> Sometimes, you have to make the engineering tradeoffs.  That's why
>> we're engineers, for goodness sakes.  Sometimes, it's just not
>> possible to square the circle.
>>
>> I don't believe that the technique of forcing people who need that
>> performance to suffer in order to induce them to try to engineer a
>> solution which may or may not exist is really the best or fairest way
>> to go about things.
>>
>> 					- Ted
>
> This is not a generally useful feature and won't ship in a way that helps most
> users with this issue.

> Let's fix the problem properly.
>
> In the meantime, there are several obvious ways to avoid this performance hit
> without changing the kernel (fully allocate and write the data, certainly
> reasonable for even reasonable sized files).

I have to agree that, if this is going to be an ext4-specific feature, then it 
can just be implemented via an ext4-specific ioctl and be done with it. But 
I'm not convinced this should be an ext4-specific feature.

As for "fix the problem properly" - you're fixing the wrong problem. This type 
of feature is important to me, not just because of the performance issue. As 
has already been pointed out, the performance difference may even be negligible.

But on SSDs, the issue is write endurance. The whole point of preallocating a 
file is to avoid doing incremental metadata updates. Particularly when each of 
those 1-bit status updates costs entire blocks, and gratuitously shortens the 
life of the media. The fact that avoiding the unnecessary wear and tear may 
also yield a performance boost is just icing on the cake. (And if the perf 
boost is over a factor of 2:1 that's some pretty damn good icing.)

There are certainly ways in which a feature like this could be deployed 
safely, or at least, without violating anyone's expectations of security. For 
example, you have braindead filesystems like FAT that don't actually support 
per-file owner/group info. If you have a filesystem where all of the files are 
known to belong to the same user, then the whole argument about "seeing 
someone else's data" is moot. If you provide the uid=/gid= mount options 
generically across all (or most) filesystem types, then you can let a sysadmin 
decide if they want to play this way or not.

-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-12-07 19:03                   ` Chris Mason
  2012-12-07 20:43                     ` Theodore Ts'o
@ 2012-12-08  0:17                     ` Dave Chinner
  2012-12-08  1:39                       ` Chris Mason
  2012-12-10 17:37                       ` Theodore Ts'o
  1 sibling, 2 replies; 69+ messages in thread
From: Dave Chinner @ 2012-12-08  0:17 UTC (permalink / raw)
  To: Chris Mason, Linus Torvalds, Ric Wheeler, Ingo Molnar,
	Christoph Hellwig, Martin Steigerwald, Linux Kernel Mailing List,
	Theodore Ts'o, linux-fsdevel

On Fri, Dec 07, 2012 at 02:03:06PM -0500, Chris Mason wrote:
> On Fri, Dec 07, 2012 at 11:18:00AM -0700, Linus Torvalds wrote:
> > 
> > 
> > On Fri, 7 Dec 2012, Ric Wheeler wrote:
> > > 
> > > Review is part of the way we work as a community and we should figure out how
> > > to fix our review process so that we can have meaningful results from the
> > > review or we lose confidence in the process and it makes it much harder to get
> > > reviewers to spend time reviewing when their reviews are ultimately ignored.
> > 
> > Christ, I promised myself to not respond any more to this thread, but the 
> > insanity just continues, from people who damn well should know better.
> > 
> > The code wasn't merged. The review worked.
> > 
> > What you (and Dave, and Christoph) are trying to do is shut down a feature 
> > that somebody else decided they needed. That's not what code review is all 
> > about, and dammit, don't try to even claim it is.
> > 
> > So stop these dishonest and disingenious arguments. They are full of crap.
> > 
> > No amount of "review" has any meaning what-so-ever on whether somebody 
> > else decides they need a feature or not. You can review all you want, but 
> > it's irrelevant - if some company decides they are going to ship or use a 
> > feature, it's out of your hands.
> > 
> > What got merged was a ONE-LINER to make sure that possible future 
> > development didn't unnecessarily make things any more confusing, with the 
> > knowledge that there was a user of the code you didn't like. 
> > 
> > Every single argument I've heard of from the "please revert" camp has been 
> > inane. And they've been *transparently* inane, to the point where I don't 
> > understand how you can make them with a straght face and not be ashamed.
> 
> I really agree with Dave's statement that we should ioctl for private
> features and system call for features other filesystems are likely to
> implement.  So we really shouldn't have private bits in fallocate in use
> in production systems.
> 
> That's not what happened though, and the right way forward from here is
> to give the bit to the feature, maybe with a generic name like
> FALLOCATE_WITHOUT_BEING_HORRIBLY_SLOW.  It should have been done
> differently, but it wasn't.  And it's a problem we all have, so it makes
> sense that we'll all want to address it somehow.

Well, we could have a discussion about that if Linus were
to revert the original change. Not so much the name (that's just
bikeshedding), but if there is a better way to expand the fallocate
interface to allow people to sanely work around the supposedly
unfixable ext4 unwritten extent performance problems.

But he's not going to, so it is pointless to even suggest such
things.

> On a single flash drive doing random 4K writes, xfs does 950MB/s into
> regular extents but only 400MB/s into preallocated extents.
> 
> http://masoncoding.com/presentation/perf-linuxcon12/fallocate.png

This is bordering on irrelevancy, but can you provide the workload
you were running to generate this graph?  Random 4k writes could be
anything, really.

In my experience, applications that actually do processing between
random write IOs don't see anywhere near the same degradation as
such micro-benchmarks tend to indicate can occur with unwritten
extents. Are you seeing this level of degradation in real-world applications?
If you give me a reason to fix it (and the hardware to test it on),
I'm pretty sure I can bring the overhead down to just a few percent
on fully featured SSDs like FusionIO devices...

[ slightly more on topic ]

FWIW, if this was your production workload and you are using XFS
then you could always use XFS_IOC_ALLOCSP to write zeros during
preallocation rather than using unwritten extents. i.e. trade off
setup-time overhead for higher run-time performance.

[ Have I mentioned before that XFS has several of custom ioctls for
issuing different forms of preallocation? :) ]

I wouldn't recommend XFS_IOC_ALLOCSP as a user-friendly interface.
The concept, however, implemented by a new fallocate()
flag (say FALLOC_FL_WRITE_ZEROS) so that the filesystem knows that
the application considers unwritten extents undesirable is exactly
the sort of thing that we should be considering implementing.
Indeed, if the filesystem is on something with WRITE_SAME or
discards to zero, no data would need to be written, you wouldn't
have any unwritten extent overhead, and no stale data exposure.
And it's not a filesystem specific interface or optimisation...

[ back to original topic ]

This is exactly why Ted should have posted the patch for review. He
may not have got the flag through, but the discussion might just end
up in a place that is *better for everyone*. By subverting the
review process, he's deprived the community of that opportunity. now
we're stuck with a shitty change that we can't improve on and will
have to explain repeatedly over the next 15 years why it's not
implemented in any kernel filesystem.

And further - what happens if we add changes like I've mentioned
above and Google moves to using them instead? We'll have a bit in
the interface that nobody uses, nobody will ever implement, and we
can't remove. There's many, many good reasons why a revert is the
only sane thing to do at this point....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-12-07 22:02                             ` Ric Wheeler
@ 2012-12-08  0:39                               ` Dave Chinner
  2012-12-08  2:52                                 ` Joel Becker
  0 siblings, 1 reply; 69+ messages in thread
From: Dave Chinner @ 2012-12-08  0:39 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Theodore Ts'o, Chris Mason, Chris Mason, Linus Torvalds,
	Ingo Molnar, Christoph Hellwig, Martin Steigerwald,
	Linux Kernel Mailing List, linux-fsdevel

On Fri, Dec 07, 2012 at 05:02:32PM -0500, Ric Wheeler wrote:
> On 12/07/2012 04:57 PM, Theodore Ts'o wrote:
> >On Fri, Dec 07, 2012 at 04:42:06PM -0500, Ric Wheeler wrote:
> >>The other things that I think we should try would be to convert over
> >>larger chunks as we discussed on the list back in the summer (just
> >>because the user writes 4KB does not mean that we cannot flip over
> >>1MB and zero that).
> >Writing a megabyte is not free.  If you assume that your HDD has a
> >sustained write throughput of 100-125 MB/s, writing a megabyte will
> >take 8-10ms.  It might be a win if you amortize it over a large number
> >of writes, but it doesn't help your 99.9 percentile latency numbers.
> >(99.9 percentile latency numbers matters because eventually you'll
> >have a user request which hits multiple serial long latency
> >operations, and then the delay looks **really** user visible.)
> >
> >	    	     	       	     		- Ted
> 
> Writing 4KB at a time to a disk cost XX units of time.
> 
> Writing to the same sector (especially for a HDD), cost XX units + a small amount.
> 
> I suggest that we try it out.
> 
> For SSD's, much better to use specific HW offload commands if
> possible like WRITE_SAME (zeroed) or UNMAP/TRIM to get that
> performance boost since no actual data is moved...

Yup, that could be done quite trivially in XFS. Just mark the
preallocated extents as "busy" rather than unwritten, mark the
transaction as synchronous and the transaction commit will issue a
discard on the preallocated ranges before returning to userspace.
The extra overhead to the preallocation command is unlikely to be
noticed, and unwritten extent conversion overhead just goes away...

No fallocate() API changes necessary, though I think it would be
better if the user application gave a hint that it preferred "writing
zeros" (i.e. FALLOC_FL_WRITE_ZEROS) to allocating unwritten extents
as there are workloads where one will always be clearly better than
the other...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-12-07 23:25                         ` Howard Chu
@ 2012-12-08  0:50                           ` Dave Chinner
  2012-12-08 13:52                             ` Howard Chu
  0 siblings, 1 reply; 69+ messages in thread
From: Dave Chinner @ 2012-12-08  0:50 UTC (permalink / raw)
  To: Howard Chu
  Cc: Ric Wheeler, Theodore Ts'o, Steven Rostedt, Linus Torvalds,
	Ingo Molnar, Christoph Hellwig, Martin Steigerwald,
	Linux Kernel Mailing List, linux-fsdevel

On Fri, Dec 07, 2012 at 03:25:53PM -0800, Howard Chu wrote:
> Ric Wheeler wrote:
> >On 12/07/2012 04:14 PM, Theodore Ts'o wrote:
> >>On Fri, Dec 07, 2012 at 02:30:19PM -0500, Steven Rostedt wrote:
> >>>How is this similar? By adding this bit, we removed incentive from a
> >>>group of developers that have the means to fix the real issue at hand
> >>>(the performance problem with ext4). Thus, it means that they have a work
> >>>around that's good enough for them, but the rest of us suffer.
> >>That assumes that there **is** a way to claw back the performance
> >>loss, and Chris Mason has demonstrated the performance hit exists with
> >>xfs as well (950 MB/s vs. 400 MB/s; that's more than a factor of two).
> >>Sometimes, you have to make the engineering tradeoffs.  That's why
> >>we're engineers, for goodness sakes.  Sometimes, it's just not
> >>possible to square the circle.
> >>
> >>I don't believe that the technique of forcing people who need that
> >>performance to suffer in order to induce them to try to engineer a
> >>solution which may or may not exist is really the best or fairest way
> >>to go about things.
> >>
> >>					- Ted
> >
> >This is not a generally useful feature and won't ship in a way that helps most
> >users with this issue.
> 
> >Let's fix the problem properly.
> >
> >In the meantime, there are several obvious ways to avoid this performance hit
> >without changing the kernel (fully allocate and write the data, certainly
> >reasonable for even reasonable sized files).
> 
> I have to agree that, if this is going to be an ext4-specific
> feature, then it can just be implemented via an ext4-specific ioctl
> and be done with it. But I'm not convinced this should be an
> ext4-specific feature.
> 
> As for "fix the problem properly" - you're fixing the wrong problem.
> This type of feature is important to me, not just because of the
> performance issue. As has already been pointed out, the performance
> difference may even be negligible.
> 
> But on SSDs, the issue is write endurance. The whole point of
> preallocating a file is to avoid doing incremental metadata updates.
> Particularly when each of those 1-bit status updates costs entire
> blocks, and gratuitously shortens the life of the media. The fact
> that avoiding the unnecessary wear and tear may also yield a
> performance boost is just icing on the cake. (And if the perf boost
> is over a factor of 2:1 that's some pretty damn good icing.)

That's a filesystem implementation specific problem, not a generic
fallocate() or unwritten extent conversion problem.

Besides, ext4 doesn't write back every metadata modification that is
made - they are aggregated in memory and only written when the
journal is full or the metadata ages out. Hence unwritten extent
conversion has very little impact on the amount of writes that are
done to the flash because it is vastly dominated by the data writes.

Similarly, in XFS you might see a few thousand or tens of thousands
of metadata blocks get written once every 30s under such a random
write workload, but each metadata block might have gone through a
million changes in memory since the last time it was written.
Indeed, in that 30s, there would have been a few million random data
writes so the metadata writes are well and truly lost in the
noise...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-12-08  0:17                     ` Dave Chinner
@ 2012-12-08  1:39                       ` Chris Mason
  2012-12-10 16:02                         ` Chris Mason
  2012-12-10 17:37                       ` Theodore Ts'o
  1 sibling, 1 reply; 69+ messages in thread
From: Chris Mason @ 2012-12-08  1:39 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Chris Mason, Linus Torvalds, Ric Wheeler, Ingo Molnar,
	Christoph Hellwig, Martin Steigerwald, Linux Kernel Mailing List,
	Theodore Ts'o, linux-fsdevel

On Fri, Dec 07, 2012 at 05:17:05PM -0700, Dave Chinner wrote:
> On Fri, Dec 07, 2012 at 02:03:06PM -0500, Chris Mason wrote:

[ dead and beaten fallocate ponies ]

> 
> > On a single flash drive doing random 4K writes, xfs does 950MB/s into
> > regular extents but only 400MB/s into preallocated extents.
> > 
> > http://masoncoding.com/presentation/perf-linuxcon12/fallocate.png
> 
> This is bordering on irrelevancy, but can you provide the workload
> you were running to generate this graph?  Random 4k writes could be
> anything, really.

This one was fio aio/dio, I'll dig out the job file and rerun it on
3.7-rc on Monday.  Any real random write is going to show this with
enough load.

> 
> In my experience, applications that actually do processing between
> random write IOs don't see anywhere near the same degradation as
> such micro-benchmarks tend to indicate can occur with unwritten
> extents. Are you seeing this level of degradation in real-world applications?
> If you give me a reason to fix it (and the hardware to test it on),
> I'm pretty sure I can bring the overhead down to just a few percent
> on fully featured SSDs like FusionIO devices...

We should have a card I can send, drop me the address.

For the workload...that's harder.  We can talk all day about what a
normal random write workload is, but if you have a fio job that you
think represents real world, I can run that.

[ much nodding ;) ]

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-12-08  0:39                               ` Dave Chinner
@ 2012-12-08  2:52                                 ` Joel Becker
  2012-12-08  4:04                                   ` Dave Chinner
  0 siblings, 1 reply; 69+ messages in thread
From: Joel Becker @ 2012-12-08  2:52 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ric Wheeler, Theodore Ts'o, Chris Mason, Chris Mason,
	Linus Torvalds, Ingo Molnar, Christoph Hellwig,
	Martin Steigerwald, Linux Kernel Mailing List, linux-fsdevel

On Sat, Dec 08, 2012 at 11:39:36AM +1100, Dave Chinner wrote:
> On Fri, Dec 07, 2012 at 05:02:32PM -0500, Ric Wheeler wrote:
> > On 12/07/2012 04:57 PM, Theodore Ts'o wrote:
> > >On Fri, Dec 07, 2012 at 04:42:06PM -0500, Ric Wheeler wrote:
> > >>The other things that I think we should try would be to convert over
> > >>larger chunks as we discussed on the list back in the summer (just
> > >>because the user writes 4KB does not mean that we cannot flip over
> > >>1MB and zero that).
> > >Writing a megabyte is not free.  If you assume that your HDD has a
> > >sustained write throughput of 100-125 MB/s, writing a megabyte will
> > >take 8-10ms.  It might be a win if you amortize it over a large number
> > >of writes, but it doesn't help your 99.9 percentile latency numbers.
> > >(99.9 percentile latency numbers matters because eventually you'll
> > >have a user request which hits multiple serial long latency
> > >operations, and then the delay looks **really** user visible.)
> > >
> > >	    	     	       	     		- Ted
> > 
> > Writing 4KB at a time to a disk cost XX units of time.
> > 
> > Writing to the same sector (especially for a HDD), cost XX units + a small amount.
> > 
> > I suggest that we try it out.
> > 
> > For SSD's, much better to use specific HW offload commands if
> > possible like WRITE_SAME (zeroed) or UNMAP/TRIM to get that
> > performance boost since no actual data is moved...
> 
> Yup, that could be done quite trivially in XFS. Just mark the
> preallocated extents as "busy" rather than unwritten, mark the
> transaction as synchronous and the transaction commit will issue a
> discard on the preallocated ranges before returning to userspace.
> The extra overhead to the preallocation command is unlikely to be
> noticed, and unwritten extent conversion overhead just goes away...
> 
> No fallocate() API changes necessary, though I think it would be
> better if the user application gave a hint that it preferred "writing
> zeros" (i.e. FALLOC_FL_WRITE_ZEROS) to allocating unwritten extents
> as there are workloads where one will always be clearly better than
> the other...

	Wait, I missed something.  We're letting fallocate be dumb?
Let's not do that, then.
	Over in ocfs2-land, we CoW in 1MB hunks.  That's the entire
extent if it is 1MB or less, or some MB multiple if it is large enough
to slice it.  This is for very similar reasons to unwritten clearing,
with the added benefit of less fragmentation from CoW.
	On spinning media, any read/write of up to 1MB is roughly about
the same penalty as reading/writing a sector.  You're already paying the
seek.  On SSD, WRITE_SAME is *way* better than leaking data.
	At the end of the day, you have to pay for zeroing.  You can do
it up front, or you can do it at write time.  A certain large commercial
database takes advantage of fallocate+unwritten by getting large swaths
of contiguous storage; it then writes to the whole space before using
it.  This allows the allocation benefits of fallocate, doesn't pay for
unneeded zeros, and yet peforms correctly at runtime.
	We should not be leaking data so that we can be lazy.

Joel


> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 

Life's Little Instruction Book #182

	"Be romantic."

			http://www.jlbec.org/
			jlbec@evilplan.org

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-12-08  2:52                                 ` Joel Becker
@ 2012-12-08  4:04                                   ` Dave Chinner
  0 siblings, 0 replies; 69+ messages in thread
From: Dave Chinner @ 2012-12-08  4:04 UTC (permalink / raw)
  To: Ric Wheeler, Theodore Ts'o, Chris Mason, Chris Mason,
	Linus Torvalds, Ingo Molnar, Christoph Hellwig,
	Martin Steigerwald, Linux Kernel Mailing List, linux-fsdevel

On Fri, Dec 07, 2012 at 06:52:51PM -0800, Joel Becker wrote:
> On Sat, Dec 08, 2012 at 11:39:36AM +1100, Dave Chinner wrote:
> > On Fri, Dec 07, 2012 at 05:02:32PM -0500, Ric Wheeler wrote:
> > > On 12/07/2012 04:57 PM, Theodore Ts'o wrote:
> > > >On Fri, Dec 07, 2012 at 04:42:06PM -0500, Ric Wheeler wrote:
> > > >>The other things that I think we should try would be to convert over
> > > >>larger chunks as we discussed on the list back in the summer (just
> > > >>because the user writes 4KB does not mean that we cannot flip over
> > > >>1MB and zero that).
> > > >Writing a megabyte is not free.  If you assume that your HDD has a
> > > >sustained write throughput of 100-125 MB/s, writing a megabyte will
> > > >take 8-10ms.  It might be a win if you amortize it over a large number
> > > >of writes, but it doesn't help your 99.9 percentile latency numbers.
> > > >(99.9 percentile latency numbers matters because eventually you'll
> > > >have a user request which hits multiple serial long latency
> > > >operations, and then the delay looks **really** user visible.)
> > > >
> > > >	    	     	       	     		- Ted
> > > 
> > > Writing 4KB at a time to a disk cost XX units of time.
> > > 
> > > Writing to the same sector (especially for a HDD), cost XX units + a small amount.
> > > 
> > > I suggest that we try it out.
> > > 
> > > For SSD's, much better to use specific HW offload commands if
> > > possible like WRITE_SAME (zeroed) or UNMAP/TRIM to get that
> > > performance boost since no actual data is moved...
> > 
> > Yup, that could be done quite trivially in XFS. Just mark the
> > preallocated extents as "busy" rather than unwritten, mark the
> > transaction as synchronous and the transaction commit will issue a
> > discard on the preallocated ranges before returning to userspace.
> > The extra overhead to the preallocation command is unlikely to be
> > noticed, and unwritten extent conversion overhead just goes away...
> > 
> > No fallocate() API changes necessary, though I think it would be
> > better if the user application gave a hint that it preferred "writing
> > zeros" (i.e. FALLOC_FL_WRITE_ZEROS) to allocating unwritten extents
> > as there are workloads where one will always be clearly better than
> > the other...
> 
> 	Wait, I missed something.  We're letting fallocate be dumb?
> Let's not do that, then.

No, not at all. Read again. There are workloads where explicitly
using unwritten extents are the best thing to do. For others,
zeroing rather than using unwritten extents may be better. All I
suggested was an additional flag that allows applications to tell
the filesystem preallocate zero space, but to do it via writing
zeros rather than unwritten extents.

> 	Over in ocfs2-land, we CoW in 1MB hunks.  That's the entire
> extent if it is 1MB or less, or some MB multiple if it is large enough
> to slice it.  This is for very similar reasons to unwritten clearing,
> with the added benefit of less fragmentation from CoW.
> 	On spinning media, any read/write of up to 1MB is roughly about
> the same penalty as reading/writing a sector.  You're already paying the
> seek.

Sure, but that's filesystem implementation details, and something
that I don't care about. Every filesystem implements preallocation
via fallocate in a similar manner (i.e. all have copied XFS's
unwritten extents technique) but that doesn't mean it's optimal for
every workload that needs preallocation.

The deficiencies of ext4's unwritten extent implementation is what
Ted has been trying to address by exposing stale data, rather than
looking at the problem as "is there a better way to preallocate for
this workload?"

That's where:

> On SSD, WRITE_SAME is *way* better than leaking data.

TRIM/WRITE_SAME can be used. It's way faster than actually writing
zeros, but from the user perspective, that's exactly how it appears
to them. IOWs, filesystems can implement the FALLOC_FL_WRITE_ZEROS method 
using this hardware offload, just be dumb and write zeros through
the page cache or ignore it altogether and just use unwritten
extents anyway...

> 	At the end of the day, you have to pay for zeroing.  You can do
> it up front, or you can do it at write time.

And the application should be able to tell us which it prefers....

....

> 	We should not be leaking data so that we can be lazy.

You're in violent agreement with me about that.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-12-08  0:50                           ` Dave Chinner
@ 2012-12-08 13:52                             ` Howard Chu
  2012-12-08 14:02                               ` Ric Wheeler
  0 siblings, 1 reply; 69+ messages in thread
From: Howard Chu @ 2012-12-08 13:52 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ric Wheeler, Theodore Ts'o, Steven Rostedt, Linus Torvalds,
	Ingo Molnar, Christoph Hellwig, Martin Steigerwald,
	Linux Kernel Mailing List, linux-fsdevel

Dave Chinner wrote:
> On Fri, Dec 07, 2012 at 03:25:53PM -0800, Howard Chu wrote:
>> I have to agree that, if this is going to be an ext4-specific
>> feature, then it can just be implemented via an ext4-specific ioctl
>> and be done with it. But I'm not convinced this should be an
>> ext4-specific feature.
>>
>> As for "fix the problem properly" - you're fixing the wrong problem.
>> This type of feature is important to me, not just because of the
>> performance issue. As has already been pointed out, the performance
>> difference may even be negligible.
>>
>> But on SSDs, the issue is write endurance. The whole point of
>> preallocating a file is to avoid doing incremental metadata updates.
>> Particularly when each of those 1-bit status updates costs entire
>> blocks, and gratuitously shortens the life of the media. The fact
>> that avoiding the unnecessary wear and tear may also yield a
>> performance boost is just icing on the cake. (And if the perf boost
>> is over a factor of 2:1 that's some pretty damn good icing.)
>
> That's a filesystem implementation specific problem, not a generic
> fallocate() or unwritten extent conversion problem.

> Besides, ext4 doesn't write back every metadata modification that is
> made - they are aggregated in memory and only written when the
> journal is full or the metadata ages out. Hence unwritten extent
> conversion has very little impact on the amount of writes that are
> done to the flash because it is vastly dominated by the data writes.
>
> Similarly, in XFS you might see a few thousand or tens of thousands
> of metadata blocks get written once every 30s under such a random
> write workload, but each metadata block might have gone through a
> million changes in memory since the last time it was written.
> Indeed, in that 30s, there would have been a few million random data
> writes so the metadata writes are well and truly lost in the
> noise...

That's only true if write caching is allowed. If you have a transactional 
database running, it's syncing every transaction to media.

-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-12-08 13:52                             ` Howard Chu
@ 2012-12-08 14:02                               ` Ric Wheeler
  0 siblings, 0 replies; 69+ messages in thread
From: Ric Wheeler @ 2012-12-08 14:02 UTC (permalink / raw)
  To: Howard Chu
  Cc: Dave Chinner, Theodore Ts'o, Steven Rostedt, Linus Torvalds,
	Ingo Molnar, Christoph Hellwig, Martin Steigerwald,
	Linux Kernel Mailing List, linux-fsdevel

On 12/08/2012 08:52 AM, Howard Chu wrote:
> Dave Chinner wrote:
>> On Fri, Dec 07, 2012 at 03:25:53PM -0800, Howard Chu wrote:
>>> I have to agree that, if this is going to be an ext4-specific
>>> feature, then it can just be implemented via an ext4-specific ioctl
>>> and be done with it. But I'm not convinced this should be an
>>> ext4-specific feature.
>>>
>>> As for "fix the problem properly" - you're fixing the wrong problem.
>>> This type of feature is important to me, not just because of the
>>> performance issue. As has already been pointed out, the performance
>>> difference may even be negligible.
>>>
>>> But on SSDs, the issue is write endurance. The whole point of
>>> preallocating a file is to avoid doing incremental metadata updates.
>>> Particularly when each of those 1-bit status updates costs entire
>>> blocks, and gratuitously shortens the life of the media. The fact
>>> that avoiding the unnecessary wear and tear may also yield a
>>> performance boost is just icing on the cake. (And if the perf boost
>>> is over a factor of 2:1 that's some pretty damn good icing.)
>>
>> That's a filesystem implementation specific problem, not a generic
>> fallocate() or unwritten extent conversion problem.
>
>> Besides, ext4 doesn't write back every metadata modification that is
>> made - they are aggregated in memory and only written when the
>> journal is full or the metadata ages out. Hence unwritten extent
>> conversion has very little impact on the amount of writes that are
>> done to the flash because it is vastly dominated by the data writes.
>>
>> Similarly, in XFS you might see a few thousand or tens of thousands
>> of metadata blocks get written once every 30s under such a random
>> write workload, but each metadata block might have gone through a
>> million changes in memory since the last time it was written.
>> Indeed, in that 30s, there would have been a few million random data
>> writes so the metadata writes are well and truly lost in the
>> noise...
>
> That's only true if write caching is allowed. If you have a transactional 
> database running, it's syncing every transaction to media.
>

The math just does not add up - no device sustains millions of random IO's per 
second.

Each class of device has so many IOs it can do per second. S-ATA disks do say 
40-50 IOPS, SAS maybe twice that, enterprise arrays 10k IOPS and PCI-E cards 
100k IOPS.

Transactional databases accumulate multiple updates in memory and commit to disk 
*in transactions* to pack as much as possible into the IOPS that the device has. 
Batching is a core principle of database performance (just like we do in the 
guts of XFS or ext4 in our file system transactions).

If you use a transactional DB, you should pre-allocate the table space (and 
probably the log file), but it would be wise to pre-allocate it and zero the 
blocks since DB's are long lived items. Pay once for the initialization and off 
you go. More expensive for really large tables on slow devices of course.

Again, we are back to the core need here - we want to improve the performance of 
a random, small IO workload. This workload was historically *so* painful 
(remember that 40-50 IOP's for a s-ata disk :)), that pretty much every sane 
application avoided random IO like the plague.

With some of the newer devices like the new SSD's, random IO gets to be more 
reasonable and we need to fix the performance to accommodate workloads that were 
not normal.

Ric



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-12-07 21:14                     ` Theodore Ts'o
  2012-12-07 21:47                       ` Ric Wheeler
  2012-12-07 22:01                       ` Eric Sandeen
@ 2012-12-09 21:37                       ` Ric Wheeler
  2 siblings, 0 replies; 69+ messages in thread
From: Ric Wheeler @ 2012-12-09 21:37 UTC (permalink / raw)
  To: Theodore Ts'o, Steven Rostedt, Linus Torvalds, Ric Wheeler,
	Ingo Molnar, Christoph Hellwig, Martin Steigerwald,
	Linux Kernel Mailing List, Dave Chinner, linux-fsdevel

On 12/07/2012 04:14 PM, Theodore Ts'o wrote:
> On Fri, Dec 07, 2012 at 02:30:19PM -0500, Steven Rostedt wrote:
>> How is this similar? By adding this bit, we removed incentive from a
>> group of developers that have the means to fix the real issue at hand
>> (the performance problem with ext4). Thus, it means that they have a work
>> around that's good enough for them, but the rest of us suffer.
> That assumes that there **is** a way to claw back the performance
> loss, and Chris Mason has demonstrated the performance hit exists with
> xfs as well (950 MB/s vs. 400 MB/s; that's more than a factor of two).
> Sometimes, you have to make the engineering tradeoffs.  That's why
> we're engineers, for goodness sakes.  Sometimes, it's just not
> possible to square the circle.

Keep in mind that no one has tried to retune or adjust XFS (or other file 
systems) for this workload.

>
> I don't believe that the technique of forcing people who need that
> performance to suffer in order to induce them to try to engineer a
> solution which may or may not exist is really the best or fairest way
> to go about things.
>

The proposed solution will not ship in any enterprise distro. Google, I assume, 
would require some specific non-normal user access permission.

That means that this solution will not be usable by the vast majority of users, 
so it is clear that we do need to work on fixing the performance at the root 
instead of plastering over the behaviour.

ric


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-12-08  1:39                       ` Chris Mason
@ 2012-12-10 16:02                         ` Chris Mason
  0 siblings, 0 replies; 69+ messages in thread
From: Chris Mason @ 2012-12-10 16:02 UTC (permalink / raw)
  To: Chris Mason
  Cc: Dave Chinner, Linus Torvalds, Ric Wheeler, Ingo Molnar,
	Christoph Hellwig, Martin Steigerwald, Linux Kernel Mailing List,
	Theodore Ts'o, linux-fsdevel

On Fri, Dec 07, 2012 at 06:39:49PM -0700, Chris Mason wrote:
> On Fri, Dec 07, 2012 at 05:17:05PM -0700, Dave Chinner wrote:
> > On Fri, Dec 07, 2012 at 02:03:06PM -0500, Chris Mason wrote:
> > > On a single flash drive doing random 4K writes, xfs does 950MB/s into
> > > regular extents but only 400MB/s into preallocated extents.
> > > 
> > > http://masoncoding.com/presentation/perf-linuxcon12/fallocate.png
> > 
> > This is bordering on irrelevancy, but can you provide the workload
> > you were running to generate this graph?  Random 4k writes could be
> > anything, really.
> 
> This one was fio aio/dio, I'll dig out the job file and rerun it on
> 3.7-rc on Monday.  Any real random write is going to show this with
> enough load.

Ok, I ran this against 3.6.  Since my box has two iodrives in it now, I
tossed them into lvm and ran striped over both.  A single drive is iop
bound at 1GB/s, and we're able to push 2GB/s over both.  LVM slows it
down slightly, and if you let the runs go long enough, you can see the
little log structured squirrels jumping in from time to time.

Long story short, on the lvm block device we average about 1.7GB/s over
the two drives.  This is iop bound, the two cards can push about 2.6GB/s
doing streaming writes.

XFS without preallocation comes very close to the iops bound number.
This is really impressive, but it also means every additional IO to track
the preallocation is going to hurt the bottom line.

With preallocation on, the speed is the same with one drive as with two.
Eric had asked me to do a run with holes, and they come out a little
worse than preallocated.

Graphs:

http://masoncoding.com/mason/benchmark/xfs-fallocate/xfs-random-write-compare.png

The fio job is in that xfs-fallocate directory, and included below.

-chris

[global]
bs=4k
direct=1
ioengine=aio
size=12g
rw=randwrite
norandommap
runtime=30
iodepth=1024

# set overwrite=1 to force us to fully overwrite
# the preallocated files before the random IO starts
#
#overwrite=1

# set fallocate=none to ues sparse files
#fallocate=none

# run 4 jobs where each job is operating on
# only one file.  This way there's no lock contention
# on the file itself.
#
[f1]
filename=/mnt/f1
[f1]
filename=/mnt/f2
[f1]
filename=/mnt/f3
[f1]
filename=/mnt/f4


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-12-08  0:17                     ` Dave Chinner
  2012-12-08  1:39                       ` Chris Mason
@ 2012-12-10 17:37                       ` Theodore Ts'o
  2012-12-10 18:05                         ` Steven Whitehouse
                                           ` (2 more replies)
  1 sibling, 3 replies; 69+ messages in thread
From: Theodore Ts'o @ 2012-12-10 17:37 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Chris Mason, Linus Torvalds, Ric Wheeler, Ingo Molnar,
	Christoph Hellwig, Martin Steigerwald, Linux Kernel Mailing List,
	linux-fsdevel

On Sat, Dec 08, 2012 at 11:17:05AM +1100, Dave Chinner wrote:
> I wouldn't recommend XFS_IOC_ALLOCSP as a user-friendly interface.
> The concept, however, implemented by a new fallocate()
> flag (say FALLOC_FL_WRITE_ZEROS) so that the filesystem knows that
> the application considers unwritten extents undesirable is exactly
> the sort of thing that we should be considering implementing.

What's the point of using a new flag like this (or XFS's
XFS_IOC_ALLOCSP) for writing zeros during preallocation as oppoised to
simply doing a fallocate() followed by zeroing the data via a O_DIRECT
write system call?

> Indeed, if the filesystem is on something with WRITE_SAME or
> discards to zero, no data would need to be written, you wouldn't
> have any unwritten extent overhead, and no stale data exposure.

And if you have a storage device which supports WRITE_SAME or
persistent discards, you can do this automatically at preallocation
time without needing a new fallocate(2) flag.  I certainly don't
oppose adding such optimizations to ext4 or any other file system (I'm
not entirely convinced that it's worth it to do this optimization at
the VFS level), but it doesn't help for storage devices that don't
support this feature.

> This is exactly why Ted should have posted the patch for review. He
> may not have got the flag through, but the discussion might just end
> up in a place that is *better for everyone*. 

Both of these suggestions have been made multiple times before when we
submitted the original patch to support NO_HIDE_STALE, and they aren't
sufficient for our purposes, so it's not like submitting a patch to
reserve the bit would have given us any new information.  We've had
this discussion **ALREADY**, multiple times before, with no one
beliving that the alternate solutions were sufficient for our needs.

This is why this discussion reminds me so much of the wakelocks
discussion, and why I've made the same decision the Android folks
made, except they wasted far more time and got far more frustrated ---
I'll just keep the dammned thing as a out-of-tree patch, until there
are enough other people willing to say that they need and are using
this patch because their workloads and use cases need it.  It will
save me a whole lot of time.

(BTW, on a similar subject, we have folks at Tao Bao, Oracle, and
Google saying that disabling stable page writes does improve their
workloads, despite everyone else claiming it doesn't matter, and both
Tao Bao and Google have out-of-tree patches that spike out stable page
writes.  Maybe this will be enough so we don't have to waste more time
convincing people that it's not an insane workload such benefits from
a way to disable stable page writes.  But if not, I'm not going to
waste time trying to convincing everyone else on fs-devel.  Keeping
the out of tree patch is just way less effort, and requires much less
resources.)

> And further - what happens if we add changes like I've mentioned
> above and Google moves to using them instead? We'll have a bit in
> the interface that nobody uses, nobody will ever implement, and we
> can't remove. There's many, many good reasons why a revert is the
> only sane thing to do at this point....

It's one bit, where we have plenty and plenty of bits.  The only other
possible uses for the fallocate flags (such as hot vs cold storage,
etc., have all been bike-sheeded to death on fs-devel already), so I'm
quite confident that we will never get to the point where even close
to running out of fallocate flag bits.

The only other bit I'm aware of that might happen soon is the volatile
ranges patch, and that's just one more bit.  So at this point we'll
still have 28 bits (out of 32 bits).  So when you talk about an
interface that we'll never remove, I think you're engaging in
hyperbole.

						- Ted

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-12-10 17:37                       ` Theodore Ts'o
@ 2012-12-10 18:05                         ` Steven Whitehouse
  2012-12-10 18:13                           ` Theodore Ts'o
  2012-12-10 18:52                         ` Ric Wheeler
  2012-12-11  0:52                         ` Dave Chinner
  2 siblings, 1 reply; 69+ messages in thread
From: Steven Whitehouse @ 2012-12-10 18:05 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Dave Chinner, Chris Mason, Linus Torvalds, Ric Wheeler,
	Ingo Molnar, Christoph Hellwig, Martin Steigerwald,
	Linux Kernel Mailing List, linux-fsdevel

Hi,

On Mon, 2012-12-10 at 12:37 -0500, Theodore Ts'o wrote:
> On Sat, Dec 08, 2012 at 11:17:05AM +1100, Dave Chinner wrote:
> > I wouldn't recommend XFS_IOC_ALLOCSP as a user-friendly interface.
> > The concept, however, implemented by a new fallocate()
> > flag (say FALLOC_FL_WRITE_ZEROS) so that the filesystem knows that
> > the application considers unwritten extents undesirable is exactly
> > the sort of thing that we should be considering implementing.
> 
> What's the point of using a new flag like this (or XFS's
> XFS_IOC_ALLOCSP) for writing zeros during preallocation as oppoised to
> simply doing a fallocate() followed by zeroing the data via a O_DIRECT
> write system call?
> 

If the block device can zero out blocks more efficiently than just
writing zeros (i.e. via sb_issue_zeroout()) then this could be faster.
In fact GFS2 already zeros out fallocate()d extents in this way because
it has no way to mark unwritten extents in its metadata, and we did not
want to leave them uninitialised,

Steve.



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-12-10 18:05                         ` Steven Whitehouse
@ 2012-12-10 18:13                           ` Theodore Ts'o
  2012-12-10 18:20                             ` Theodore Ts'o
  0 siblings, 1 reply; 69+ messages in thread
From: Theodore Ts'o @ 2012-12-10 18:13 UTC (permalink / raw)
  To: Steven Whitehouse
  Cc: Dave Chinner, Chris Mason, Linus Torvalds, Ric Wheeler,
	Ingo Molnar, Christoph Hellwig, Martin Steigerwald,
	Linux Kernel Mailing List, linux-fsdevel

On Mon, Dec 10, 2012 at 06:05:59PM +0000, Steven Whitehouse wrote:
> 
> If the block device can zero out blocks more efficiently than just
> writing zeros (i.e. via sb_issue_zeroout()) then this could be faster.
> In fact GFS2 already zeros out fallocate()d extents in this way because
> it has no way to mark unwritten extents in its metadata, and we did not
> want to leave them uninitialised,

Sure, but if the block device supports WRITE_SAME or persistent
discard, then rpesumably fallocate() should do this automatically all
the time, and not require a flag to request this behavior.  That is, a
seek plus writing 1MB does take more time than the amount of disk time
fraction that it consumes if you compare it to a seek plus writing 4k
or 32k.  It may be that 32k is too low, and there may be some
disagreement about whether people want to the system to automatically
tune for average throughput vs 99.9 percentile latency.

Regardless, this is actually something which I think the file system
should try to do automatically if at all possible, via some kind of
auto-tuning hueristic, instead of using an explicit fallocate(2) flag.
(See, I don't propose using a new fallocate flag for everything.  :-)

      	      	      	  	    	     - Ted

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-12-10 18:13                           ` Theodore Ts'o
@ 2012-12-10 18:20                             ` Theodore Ts'o
  2012-12-11 12:16                               ` Steven Whitehouse
  0 siblings, 1 reply; 69+ messages in thread
From: Theodore Ts'o @ 2012-12-10 18:20 UTC (permalink / raw)
  To: Steven Whitehouse, Dave Chinner, Chris Mason, Linus Torvalds,
	Ric Wheeler, Ingo Molnar, Christoph Hellwig, Martin Steigerwald,
	Linux Kernel Mailing List, linux-fsdevel

A sentence or two got chopped out during an editing pass.  Let me try
that again so it's a bit clearer what I was trying to say....

Sure, but if the block device supports WRITE_SAME or persistent
discard, then presumably fallocate() should do this automatically all
the time, and not require a flag to request this behavior.  The only
reason why you might not is if the WRITE_SAME is more costly.  That is
when a seek plus writing 1MB does take more time than the amount of
disk time fraction that it consumes if you compare it to a seek plus
writing 4k or 32k.

Ext4 currently uses a threshold of 32k for this break point (below
that, we will use sb_issue_zeroout; above that, we will break apart an
uninitialized extent when writing into a preallocated region).  It may
be that 32k is too low, especailly for certain types of devices (i.e.,
SSD's versus RAID 5, where it should be aligned on a RAID strip,
etc.).  More of an issue might be that there will be some disagreement
about whether people want to the system to automatically tune for
average throughput vs 99.9 percentile latency.

Regardless, this is actually something which I think the file system
should try to do automatically if at all possible, via some kind of
auto-tuning hueristic, instead of using an explicit fallocate(2) flag.
(See, I don't propose using a new fallocate flag for everything.  :-)

      	      	      	      - Ted


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-12-10 17:37                       ` Theodore Ts'o
  2012-12-10 18:05                         ` Steven Whitehouse
@ 2012-12-10 18:52                         ` Ric Wheeler
  2012-12-11  0:52                         ` Dave Chinner
  2 siblings, 0 replies; 69+ messages in thread
From: Ric Wheeler @ 2012-12-10 18:52 UTC (permalink / raw)
  To: Theodore Ts'o, Dave Chinner, Chris Mason, Linus Torvalds,
	Ric Wheeler, Ingo Molnar, Christoph Hellwig, Martin Steigerwald,
	Linux Kernel Mailing List, linux-fsdevel

On 12/10/2012 12:37 PM, Theodore Ts'o wrote:
> On Sat, Dec 08, 2012 at 11:17:05AM +1100, Dave Chinner wrote:
>> I wouldn't recommend XFS_IOC_ALLOCSP as a user-friendly interface.
>> The concept, however, implemented by a new fallocate()
>> flag (say FALLOC_FL_WRITE_ZEROS) so that the filesystem knows that
>> the application considers unwritten extents undesirable is exactly
>> the sort of thing that we should be considering implementing.
> What's the point of using a new flag like this (or XFS's
> XFS_IOC_ALLOCSP) for writing zeros during preallocation as oppoised to
> simply doing a fallocate() followed by zeroing the data via a O_DIRECT
> write system call?

I think that this is actually quite useful if you take it to mean "keep normal 
fallocate semantics and make sure that no meta-data twiddling needs be done"

The existing behaviour (the default one) is preallocate and it is OK to metadata 
twiddling (extent conversion, etc) later on.

Of course, we could well offload either to WRITE_SAME or TRIM, etc if the 
storage target supports it,

ric

>
>> Indeed, if the filesystem is on something with WRITE_SAME or
>> discards to zero, no data would need to be written, you wouldn't
>> have any unwritten extent overhead, and no stale data exposure.
> And if you have a storage device which supports WRITE_SAME or
> persistent discards, you can do this automatically at preallocation
> time without needing a new fallocate(2) flag.  I certainly don't
> oppose adding such optimizations to ext4 or any other file system (I'm
> not entirely convinced that it's worth it to do this optimization at
> the VFS level), but it doesn't help for storage devices that don't
> support this feature.
>
>> This is exactly why Ted should have posted the patch for review. He
>> may not have got the flag through, but the discussion might just end
>> up in a place that is *better for everyone*.
> Both of these suggestions have been made multiple times before when we
> submitted the original patch to support NO_HIDE_STALE, and they aren't
> sufficient for our purposes, so it's not like submitting a patch to
> reserve the bit would have given us any new information.  We've had
> this discussion **ALREADY**, multiple times before, with no one
> beliving that the alternate solutions were sufficient for our needs.
>
> This is why this discussion reminds me so much of the wakelocks
> discussion, and why I've made the same decision the Android folks
> made, except they wasted far more time and got far more frustrated ---
> I'll just keep the dammned thing as a out-of-tree patch, until there
> are enough other people willing to say that they need and are using
> this patch because their workloads and use cases need it.  It will
> save me a whole lot of time.
>
> (BTW, on a similar subject, we have folks at Tao Bao, Oracle, and
> Google saying that disabling stable page writes does improve their
> workloads, despite everyone else claiming it doesn't matter, and both
> Tao Bao and Google have out-of-tree patches that spike out stable page
> writes.  Maybe this will be enough so we don't have to waste more time
> convincing people that it's not an insane workload such benefits from
> a way to disable stable page writes.  But if not, I'm not going to
> waste time trying to convincing everyone else on fs-devel.  Keeping
> the out of tree patch is just way less effort, and requires much less
> resources.)
>
>> And further - what happens if we add changes like I've mentioned
>> above and Google moves to using them instead? We'll have a bit in
>> the interface that nobody uses, nobody will ever implement, and we
>> can't remove. There's many, many good reasons why a revert is the
>> only sane thing to do at this point....
> It's one bit, where we have plenty and plenty of bits.  The only other
> possible uses for the fallocate flags (such as hot vs cold storage,
> etc., have all been bike-sheeded to death on fs-devel already), so I'm
> quite confident that we will never get to the point where even close
> to running out of fallocate flag bits.
>
> The only other bit I'm aware of that might happen soon is the volatile
> ranges patch, and that's just one more bit.  So at this point we'll
> still have 28 bits (out of 32 bits).  So when you talk about an
> interface that we'll never remove, I think you're engaging in
> hyperbole.
>
> 						- Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-12-10 17:37                       ` Theodore Ts'o
  2012-12-10 18:05                         ` Steven Whitehouse
  2012-12-10 18:52                         ` Ric Wheeler
@ 2012-12-11  0:52                         ` Dave Chinner
  2 siblings, 0 replies; 69+ messages in thread
From: Dave Chinner @ 2012-12-11  0:52 UTC (permalink / raw)
  To: Theodore Ts'o, Chris Mason, Linus Torvalds, Ric Wheeler,
	Ingo Molnar, Christoph Hellwig, Martin Steigerwald,
	Linux Kernel Mailing List, linux-fsdevel

On Mon, Dec 10, 2012 at 12:37:39PM -0500, Theodore Ts'o wrote:
> On Sat, Dec 08, 2012 at 11:17:05AM +1100, Dave Chinner wrote:
> > I wouldn't recommend XFS_IOC_ALLOCSP as a user-friendly interface.
> > The concept, however, implemented by a new fallocate()
> > flag (say FALLOC_FL_WRITE_ZEROS) so that the filesystem knows that
> > the application considers unwritten extents undesirable is exactly
> > the sort of thing that we should be considering implementing.
> 
> What's the point of using a new flag like this (or XFS's
> XFS_IOC_ALLOCSP) for writing zeros during preallocation as oppoised to
> simply doing a fallocate() followed by zeroing the data via a O_DIRECT
> write system call?

There is no window where stale data is exposed to userspace, which
is what you have to do if you are doing zeroing from userspace. If
the system crashes after allocation but before zeroing, how do you
recover that? The filesystem can ensure the allocation transactions
are not committed until the allocated extents are zeroed....

> > Indeed, if the filesystem is on something with WRITE_SAME or
> > discards to zero, no data would need to be written, you wouldn't
> > have any unwritten extent overhead, and no stale data exposure.
> 
> And if you have a storage device which supports WRITE_SAME or
> persistent discards, you can do this automatically at preallocation
> time without needing a new fallocate(2) flag.

Because there are cases where unwritten extents are preferrable or
WRITE_SAME functionality is unavailable.

Indeed, if we take the case of file-per-frame, uncompressed
real-time video ingest, I'm going to be wanting to use unwritten
extents to preallocate files in a known pattern with as little
latency as possible. If we are talking about 4k uncompressed video,
that's a data rate of 1.2GB/s for 24fps, and studios are now
shooting in 48fps in 3D, which gives a real-time data rate of
roughly 5GB/s.  This has an acceptible IO latency *per-frame* of
roughly 10-20ms, which we can do easily with unwritten extents. We
can preallocate thousands of files, look at their layout via fiemap
and select the order in which we write to them based on where they
were allocated.

There is no way in hell that WRITE_SAME can be used for these sorts
of workloads, because preallocation then places a load on the storage
device that affects the latency of the real-time data stream.
Unwritten extent conversion is an after-the-fact overhead in these
cases that doesn't impact on the data stream throughput or latency.

IOWs, there are clear cases where discard optimisations will be
actively harmful to the workload using preallocation for performance
reasons. Hence, I'm not about to make the existing fallocate code in
XFS stop using unwritten extents by default even if the underlying
device supports WRITE_SAME.

> I certainly don't
> oppose adding such optimizations to ext4 or any other file system (I'm
> not entirely convinced that it's worth it to do this optimization at
> the VFS level), but it doesn't help for storage devices that don't
> support this feature.

Sure, this optimisation is a per-filesystem decision. ext4 is
unlikely to be used in the sorts of high end enviroments we see XFS
being used in, so it might make sense for you to make it use
WRITE_SAME by default if it is supported.

This, however, does not change the fact that there are existing
applications using fallocate that absolutely do not want
preallocation to use WRITE_SAME semantics. It's not an optimisation
if it breaks a significant portion of your userbase's applications.
Hence adding a flag to allow applications to specify they want
WRITE_SAME preallocation behaviour rather than unwritten extents
makes sense.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-12-10 18:20                             ` Theodore Ts'o
@ 2012-12-11 12:16                               ` Steven Whitehouse
  2012-12-11 22:09                                 ` Dave Chinner
  0 siblings, 1 reply; 69+ messages in thread
From: Steven Whitehouse @ 2012-12-11 12:16 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Dave Chinner, Chris Mason, Linus Torvalds, Ric Wheeler,
	Ingo Molnar, Christoph Hellwig, Martin Steigerwald,
	Linux Kernel Mailing List, linux-fsdevel

Hi,

On Mon, 2012-12-10 at 13:20 -0500, Theodore Ts'o wrote:
> A sentence or two got chopped out during an editing pass.  Let me try
> that again so it's a bit clearer what I was trying to say....
> 
> Sure, but if the block device supports WRITE_SAME or persistent
> discard, then presumably fallocate() should do this automatically all
> the time, and not require a flag to request this behavior.  The only
> reason why you might not is if the WRITE_SAME is more costly.  That is
> when a seek plus writing 1MB does take more time than the amount of
> disk time fraction that it consumes if you compare it to a seek plus
> writing 4k or 32k.
> 
Well there are two cases here I think....

One is the GFS2 type case where the metadata doesn't support "these
blocks are allocated but zero" so that we must, for all fallocate
requests, zero out the blocks at fallocate time to avoid exposing stale
data to userspace.

The advantage over dd from userspace in this case is firstly that no
copy from userspace means that it should be faster. Also the use of
sb_issue_zeroout means that block devices which don't need an explicit
block of zeros to write should be able to do this faster - however that
is implemented at the block layer. The fs shouldn't need to care about
how is it implemented. In the case of GFS2, we implemented fallocate
because it was useful to have the feature of being able to allocate
beyond the end of file without changing the file size. This helped us
fix a bug in our fs grow code, so performance was a secondary (but
welcome!) consideration. 

The other case is ext4/XFS type case where the metadata does support
"these blocks are allocated but zero" which means that the metadata
needs to be changed twice. Once to "these blocks are allocated but zero"
at fallocate time and again to "these blocks have valid content" at
write time. As I understand the issue, the problem is that this second
metadata change is what is causing the performance issue.

> Ext4 currently uses a threshold of 32k for this break point (below
> that, we will use sb_issue_zeroout; above that, we will break apart an
> uninitialized extent when writing into a preallocated region).  It may
> be that 32k is too low, especailly for certain types of devices (i.e.,
> SSD's versus RAID 5, where it should be aligned on a RAID strip,
> etc.).  More of an issue might be that there will be some disagreement
> about whether people want to the system to automatically tune for
> average throughput vs 99.9 percentile latency.
> 
> Regardless, this is actually something which I think the file system
> should try to do automatically if at all possible, via some kind of
> auto-tuning hueristic, instead of using an explicit fallocate(2) flag.
> (See, I don't propose using a new fallocate flag for everything.  :-)
> 
>       	      	      	      - Ted
> 

It sounds like it might well be worth experimenting with the thresholds
as you suggest, 32k is really pretty small. I guess that the real
question here is what is the cost of the metadata change (to say what is
written and what remains unwritten) vs. simply zeroing out the unwritten
blocks in the extent when the write occurs.

There are likely to be a number of factors affecting that, and the
answer doesn't appear straightforward,

Steve.



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
  2012-12-11 12:16                               ` Steven Whitehouse
@ 2012-12-11 22:09                                 ` Dave Chinner
  0 siblings, 0 replies; 69+ messages in thread
From: Dave Chinner @ 2012-12-11 22:09 UTC (permalink / raw)
  To: Steven Whitehouse
  Cc: Theodore Ts'o, Chris Mason, Linus Torvalds, Ric Wheeler,
	Ingo Molnar, Christoph Hellwig, Martin Steigerwald,
	Linux Kernel Mailing List, linux-fsdevel

On Tue, Dec 11, 2012 at 12:16:03PM +0000, Steven Whitehouse wrote:
> On Mon, 2012-12-10 at 13:20 -0500, Theodore Ts'o wrote:
> > A sentence or two got chopped out during an editing pass.  Let me try
> > that again so it's a bit clearer what I was trying to say....
> > 
> > Sure, but if the block device supports WRITE_SAME or persistent
> > discard, then presumably fallocate() should do this automatically all
> > the time, and not require a flag to request this behavior.  The only
> > reason why you might not is if the WRITE_SAME is more costly.  That is
> > when a seek plus writing 1MB does take more time than the amount of
> > disk time fraction that it consumes if you compare it to a seek plus
> > writing 4k or 32k.
> > 
> Well there are two cases here I think....

/me nods

> The other case is ext4/XFS type case where the metadata does support
> "these blocks are allocated but zero" which means that the metadata
> needs to be changed twice. Once to "these blocks are allocated but zero"
> at fallocate time and again to "these blocks have valid content" at
> write time. As I understand the issue, the problem is that this second
> metadata change is what is causing the performance issue.

Apparently so ;)

> > Ext4 currently uses a threshold of 32k for this break point (below
> > that, we will use sb_issue_zeroout; above that, we will break apart an
> > uninitialized extent when writing into a preallocated region).  It may
> > be that 32k is too low, especailly for certain types of devices (i.e.,
> > SSD's versus RAID 5, where it should be aligned on a RAID strip,
> > etc.).  More of an issue might be that there will be some disagreement
> > about whether people want to the system to automatically tune for
> > average throughput vs 99.9 percentile latency.
> > 
> > Regardless, this is actually something which I think the file system
> > should try to do automatically if at all possible, via some kind of
> > auto-tuning hueristic, instead of using an explicit fallocate(2) flag.
> > (See, I don't propose using a new fallocate flag for everything.  :-)

Preallocation is primarily there for the application to optimise
file layout when the filesystem heuristics fail to do the right
thing.  IOWs, we should not be adding heuristics to "tune" fallocate
behaviour automatically because a) we are guaranteed to get
heuristics wrong for some applications, and b) fallocate is for
direct application control and hence avoid filesystem hueristic
based problems.

> It sounds like it might well be worth experimenting with the thresholds
> as you suggest, 32k is really pretty small. I guess that the real
> question here is what is the cost of the metadata change (to say what is
> written and what remains unwritten) vs. simply zeroing out the unwritten
> blocks in the extent when the write occurs.

Exactly my concern. 32k is a range that is arbitrary, and if it
cannot be changed to suit your workload then if it sucks....  Given
that fallocate is supposed to allow applications to avoid such
allocation heuristics if they are sub-optimal, this fixed
"zero-around" range seems like a bad
model to follow.

If I implement a zero-around for direct IO on unwritten extents for
XFS, I'll use the per-inode extent granularity hint that is held in
the inode to determine the range to write zeros to.  That will
enable us to have a per-file configurable allocation/unwritten
extent conversion granularity and hence bound the maximum size of
the extent tree for a given workload+data set.

The real problem is that to do zero-around efficiently, the zeroing
needs to be done as part of the direct IO submission process. That
code only supports block size zeroing rather than arbitrary ranges,
and it's already very messy because of the rat's nest of
micro-optimisations.

However, doing the zero-around as part of IO submission would
certainly remove a significant amount of the ongoing extent
conversion overhead from the workloads Chris has pointed out as
troublesome...

> There are likely to be a number of factors affecting that, and the
> answer doesn't appear straightforward,

Right, which is where making it per-file configurable makes sense.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 69+ messages in thread

end of thread, other threads:[~2012-12-11 22:09 UTC | newest]

Thread overview: 69+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-11-19 23:04 [PATCH] fs: revert commit bbdd6808 to fallocate UAPI Dave Chinner
2012-11-20 16:36 ` Christoph Hellwig
2012-11-26  0:28 ` [PATCH, 3.7-rc7, RESEND] " Dave Chinner
2012-11-26  2:55   ` Theodore Ts'o
2012-11-26  6:14     ` Tao Ma
2012-11-26  9:12     ` Dave Chinner
2012-12-05 10:48       ` Martin Steigerwald
2012-12-05 15:45         ` Linus Torvalds
2012-12-05 16:18           ` Martin Steigerwald
2012-12-05 16:33             ` Theodore Ts'o
2012-12-05 17:24               ` Martin Steigerwald
2012-12-05 17:34                 ` Theodore Ts'o
2012-12-05 17:55                   ` Martin Steigerwald
2012-12-06  0:42                   ` Dave Chinner
2012-12-06  9:24                     ` Martin Steigerwald
2012-12-05 18:25             ` Linus Torvalds
2012-12-06  1:14               ` Dave Chinner
2012-12-06  3:03                 ` Linus Torvalds
2012-12-06  9:37                   ` Martin Steigerwald
2012-12-07  1:08                     ` Ingo Molnar
2012-12-07  2:40                       ` Dave Chinner
2012-12-07 10:24                       ` Martin Steigerwald
2012-12-06 12:06                 ` Christoph Hellwig
2012-12-06 16:50                   ` Theodore Ts'o
2012-12-07  1:57                     ` Dave Chinner
2012-12-06 12:05           ` Christoph Hellwig
2012-12-07  1:16             ` Ingo Molnar
2012-12-07  3:19               ` Dave Chinner
2012-12-07 17:36               ` Ric Wheeler
2012-12-07 18:18                 ` Linus Torvalds
2012-12-07 19:03                   ` Chris Mason
2012-12-07 20:43                     ` Theodore Ts'o
2012-12-07 21:09                       ` Chris Mason
2012-12-07 21:27                         ` Theodore Ts'o
2012-12-07 21:43                           ` Chris Mason
2012-12-07 21:49                             ` Ric Wheeler
2012-12-07 21:57                               ` Chris Mason
2012-12-07 22:51                                 ` Eric Sandeen
2012-12-07 22:52                                 ` Eric Sandeen
2012-12-07 21:42                         ` Ric Wheeler
2012-12-07 21:57                           ` Theodore Ts'o
2012-12-07 22:02                             ` Ric Wheeler
2012-12-08  0:39                               ` Dave Chinner
2012-12-08  2:52                                 ` Joel Becker
2012-12-08  4:04                                   ` Dave Chinner
2012-12-08  0:17                     ` Dave Chinner
2012-12-08  1:39                       ` Chris Mason
2012-12-10 16:02                         ` Chris Mason
2012-12-10 17:37                       ` Theodore Ts'o
2012-12-10 18:05                         ` Steven Whitehouse
2012-12-10 18:13                           ` Theodore Ts'o
2012-12-10 18:20                             ` Theodore Ts'o
2012-12-11 12:16                               ` Steven Whitehouse
2012-12-11 22:09                                 ` Dave Chinner
2012-12-10 18:52                         ` Ric Wheeler
2012-12-11  0:52                         ` Dave Chinner
2012-12-07 19:30                   ` Steven Rostedt
2012-12-07 21:14                     ` Theodore Ts'o
2012-12-07 21:47                       ` Ric Wheeler
2012-12-07 23:25                         ` Howard Chu
2012-12-08  0:50                           ` Dave Chinner
2012-12-08 13:52                             ` Howard Chu
2012-12-08 14:02                               ` Ric Wheeler
2012-12-07 22:01                       ` Eric Sandeen
2012-12-09 21:37                       ` Ric Wheeler
2012-11-26 11:53     ` Alan Cox
2012-11-26 14:43       ` Theodore Ts'o
2012-11-26 21:12       ` Dave Chinner
2012-11-27 13:44         ` Martin Steigerwald

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).