From: Dave Chinner <david@fromorbit.com> To: Theodore Tso <tytso@mit.edu>, Nick Piggin <nickpiggin@yahoo.com.au>, Daniel Phillips <phillips@phunq.net>, linux-fsdevel@vger.kernel.org, tux3@tux3.org, Andrew Morton <akpm@linux-foundation.org>, linux-kernel@vger.kernel.org Subject: Re: [Tux3] Tux3 report: Tux3 Git tree available Date: Mon, 16 Mar 2009 16:12:11 +1100 [thread overview] Message-ID: <20090316051211.GB26138@disturbed> (raw) In-Reply-To: <20090315214426.GA6357@mit.edu> On Sun, Mar 15, 2009 at 05:44:26PM -0400, Theodore Tso wrote: > On Sun, Mar 15, 2009 at 02:45:04PM +1100, Nick Piggin wrote: > > > As it happens, Tux3 also physically allocates each _physical_ metadata > > > block (i.e., what is currently called buffer cache) at the time it is > > > dirtied. I don't know if this is the best thing to do, but it is > > > interesting that you do the same thing. I also don't know if I want to > > > trust a library to get this right, before having completely proved out > > > the idea in a non-trival filesystem. But good luck with that! It > > > > I'm not sure why it would be a big problem. fsblock isn't allocating > > the block itself of course, it just asks the filesystem to. It's > > trivial to do for fsblock. > > So the really unfortunate thing about allocating the block as soon as > the page is dirty is that it spikes out delayed allocation. By > delaying the physical allocation of the logical->physical mapping as > long as possible, the filesystem can select the best possible physical > location. This is no different to the way delayed allocation with bufferheads works. Both XFS and ext4 set the buffer_delay flag instead of allocating up front so that later on in ->writepages we can do optimal delayed allocation. AFAICT fsblock works the same way.... > XFS, for example, keeps a btree of free regions indexed by > size so that it can select the perfect location for a newly written > file which is 24k or 56k long. Ah, no. It's far more complex than that. To begin with, XFS has *two* freespace trees per allocation group - one indexed by extent size, the other by extent starting block. XFS looks for an exact or nearby extent start block match that is big enough in the by-block tree. If it can't find a nearby match, then it looks up a size match in the by-size tree. i.e. the fundamental allocation assumption is that locality of data placement matters far more than filling holes in the freespace trees..... > In addition, XFS uses delayed allocation to avoid the problem of > uninitalized data becoming visible in the event of a crash. No it doesn't. Delayed allocation minimises the problem but doesn't prevent it. It has been known for years (since before I joined SGI in 2002) that there is a theoretical timing gap in XFS where the allocation transaction can commit and a crash occur before data hits the disk hence exposing stale data. The reality is that no-one has ever reported exposing stale data in this scenario, and there has been plenty of effort expended trying to trigger it. Hence it has remained in the realm of a theoretical problem.... > If > fsblock immediately allocates the physical block, then either the > unitialized data might become available on a system crash (which > is a security problem), or XFS is going to have to force all newly > written data blocks to disk before a commit. If that sounds > familiar it's what ext3's data=ordered mode does, and it's what is > responsible for the Firefox 3.0 fsync performance problem. If this was to occur, the obvious solution to this problem is to allocate unwritten extents and do conversion after data I/O completion. That would result in correct metadata/data ordering in all cases with only a small performance impact and without introducing ext3-sync-the-world-like issues... Ted, I appreciate you telling the world over and over again how bad XFS is and what you think needs to be done to fix it. Truth is, this would have been a much better email had you written about it from an ext4 perspective. That way it wouldn't have been full of errors or sound like a kid caught with his hand in the cookie jar: "It's not my fault! I was only copying XFS! He did it first!" Cheers, Dave. -- Dave Chinner david@fromorbit.com
WARNING: multiple messages have this Message-ID (diff)
From: Dave Chinner <david@fromorbit.com> To: Theodore Tso <tytso@mit.edu>, Nick Piggin <nickpiggin@yahoo.com.au>, Daniel Phillips <phillips@phunq.net>, linux-fsdevel@vger.kernel.org, tux3@tux3.org, Andrew Morton <akpm@linux-fou Subject: Re: [Tux3] Tux3 report: Tux3 Git tree available Date: Mon, 16 Mar 2009 16:12:11 +1100 [thread overview] Message-ID: <20090316051211.GB26138@disturbed> (raw) In-Reply-To: <20090315214426.GA6357@mit.edu> On Sun, Mar 15, 2009 at 05:44:26PM -0400, Theodore Tso wrote: > On Sun, Mar 15, 2009 at 02:45:04PM +1100, Nick Piggin wrote: > > > As it happens, Tux3 also physically allocates each _physical_ metadata > > > block (i.e., what is currently called buffer cache) at the time it is > > > dirtied. I don't know if this is the best thing to do, but it is > > > interesting that you do the same thing. I also don't know if I want to > > > trust a library to get this right, before having completely proved out > > > the idea in a non-trival filesystem. But good luck with that! It > > > > I'm not sure why it would be a big problem. fsblock isn't allocating > > the block itself of course, it just asks the filesystem to. It's > > trivial to do for fsblock. > > So the really unfortunate thing about allocating the block as soon as > the page is dirty is that it spikes out delayed allocation. By > delaying the physical allocation of the logical->physical mapping as > long as possible, the filesystem can select the best possible physical > location. This is no different to the way delayed allocation with bufferheads works. Both XFS and ext4 set the buffer_delay flag instead of allocating up front so that later on in ->writepages we can do optimal delayed allocation. AFAICT fsblock works the same way.... > XFS, for example, keeps a btree of free regions indexed by > size so that it can select the perfect location for a newly written > file which is 24k or 56k long. Ah, no. It's far more complex than that. To begin with, XFS has *two* freespace trees per allocation group - one indexed by extent size, the other by extent starting block. XFS looks for an exact or nearby extent start block match that is big enough in the by-block tree. If it can't find a nearby match, then it looks up a size match in the by-size tree. i.e. the fundamental allocation assumption is that locality of data placement matters far more than filling holes in the freespace trees..... > In addition, XFS uses delayed allocation to avoid the problem of > uninitalized data becoming visible in the event of a crash. No it doesn't. Delayed allocation minimises the problem but doesn't prevent it. It has been known for years (since before I joined SGI in 2002) that there is a theoretical timing gap in XFS where the allocation transaction can commit and a crash occur before data hits the disk hence exposing stale data. The reality is that no-one has ever reported exposing stale data in this scenario, and there has been plenty of effort expended trying to trigger it. Hence it has remained in the realm of a theoretical problem.... > If > fsblock immediately allocates the physical block, then either the > unitialized data might become available on a system crash (which > is a security problem), or XFS is going to have to force all newly > written data blocks to disk before a commit. If that sounds > familiar it's what ext3's data=ordered mode does, and it's what is > responsible for the Firefox 3.0 fsync performance problem. If this was to occur, the obvious solution to this problem is to allocate unwritten extents and do conversion after data I/O completion. That would result in correct metadata/data ordering in all cases with only a small performance impact and without introducing ext3-sync-the-world-like issues... Ted, I appreciate you telling the world over and over again how bad XFS is and what you think needs to be done to fix it. Truth is, this would have been a much better email had you written about it from an ext4 perspective. That way it wouldn't have been full of errors or sound like a kid caught with his hand in the cookie jar: "It's not my fault! I was only copying XFS! He did it first!" Cheers, Dave. -- Dave Chinner david@fromorbit.com
next prev parent reply other threads:[~2009-03-16 5:12 UTC|newest] Thread overview: 57+ messages / expand[flat|nested] mbox.gz Atom feed top 2009-03-11 16:25 Tux3 report: Tux3 Git tree available Daniel Phillips 2009-03-11 18:42 ` Andrew Morton 2009-03-12 5:38 ` [Tux3] " Daniel Phillips 2009-03-12 6:07 ` Andrew Morton 2009-03-12 8:33 ` Daniel Phillips 2009-03-12 8:47 ` Nick Piggin 2009-03-12 9:00 ` Daniel Phillips 2009-03-12 9:10 ` Nick Piggin 2009-03-12 10:15 ` Daniel Phillips 2009-03-12 11:03 ` Nick Piggin 2009-03-12 12:24 ` Daniel Phillips 2009-03-12 12:32 ` Matthew Wilcox 2009-03-12 12:45 ` Nick Piggin 2009-03-12 12:45 ` Nick Piggin 2009-03-12 13:12 ` [Tux3] " Daniel Phillips 2009-03-12 13:06 ` Daniel Phillips 2009-03-12 13:04 ` Nick Piggin 2009-03-12 13:04 ` Nick Piggin 2009-03-12 13:59 ` [Tux3] " Matthew Wilcox 2009-03-12 14:19 ` Nick Piggin 2009-03-15 3:24 ` Daniel Phillips 2009-03-15 3:24 ` Daniel Phillips 2009-03-15 3:50 ` [Tux3] " Nick Piggin 2009-03-15 4:08 ` Daniel Phillips 2009-03-15 4:08 ` Daniel Phillips 2009-03-15 4:14 ` [Tux3] " Nick Piggin 2009-03-15 2:41 ` Daniel Phillips 2009-03-15 3:45 ` Nick Piggin 2009-03-15 21:44 ` Theodore Tso 2009-03-15 22:41 ` Daniel Phillips 2009-03-16 10:32 ` Nick Piggin 2009-03-16 5:12 ` Dave Chinner [this message] 2009-03-16 5:12 ` Dave Chinner 2009-03-16 6:38 ` Theodore Tso 2009-03-16 6:38 ` Theodore Tso 2009-03-16 10:14 ` Nick Piggin 2009-03-16 10:14 ` Nick Piggin 2009-03-12 17:06 ` [Tux3] " Theodore Tso 2009-03-13 9:32 ` Nick Piggin 2009-03-12 17:00 ` OGAWA Hirofumi 2009-03-15 3:54 ` Daniel Phillips 2009-03-12 9:47 ` Sam Ravnborg 2009-03-12 10:25 ` Daniel Phillips 2009-03-12 15:30 ` Diego Calleja 2009-03-12 16:54 ` OGAWA Hirofumi 2009-03-15 3:36 ` Daniel Phillips 2009-03-15 4:26 ` OGAWA Hirofumi 2009-03-12 13:24 ` Andi Kleen 2009-03-12 21:24 ` [Tux3] " Daniel Phillips 2009-03-12 23:38 ` Andi Kleen 2009-03-15 3:03 ` Daniel Phillips 2009-03-12 21:02 ` Roland Dreier 2009-03-15 4:02 ` Daniel Phillips 2009-03-12 16:18 ` OGAWA Hirofumi 2009-03-12 20:02 ` Andrew Morton 2009-03-12 20:46 ` OGAWA Hirofumi 2009-03-15 3:58 ` Daniel Phillips
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20090316051211.GB26138@disturbed \ --to=david@fromorbit.com \ --cc=akpm@linux-foundation.org \ --cc=linux-fsdevel@vger.kernel.org \ --cc=linux-kernel@vger.kernel.org \ --cc=nickpiggin@yahoo.com.au \ --cc=phillips@phunq.net \ --cc=tux3@tux3.org \ --cc=tytso@mit.edu \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.