Re: Request for information on bloated writes using Swift

From: Dilip Simha <nmdilipsimha@gmail.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Eric Sandeen <sandeen@sandeen.net>, xfs@oss.sgi.com
Subject: Re: Request for information on bloated writes using Swift
Date: Wed, 3 Feb 2016 14:43:27 -0800	[thread overview]
Message-ID: <CAFHL4X1gxP8B_JLHoOKxhX553nkOH+D_2DjgXXq7hPjxTi2t0g@mail.gmail.com> (raw)
In-Reply-To: <20160203215144.GE459@dastard>

[-- Attachment #1.1: Type: text/plain, Size: 6877 bytes --]

On Wed, Feb 3, 2016 at 1:51 PM, Dave Chinner <david@fromorbit.com> wrote:

> On Wed, Feb 03, 2016 at 09:02:40AM -0600, Eric Sandeen wrote:
> >
> >
> > On 2/3/16 2:30 AM, Dave Chinner wrote:
> > > On Tue, Feb 02, 2016 at 11:09:15PM -0800, Dilip Simha wrote:
> > >> Hi Dave,
> > >>
> > >> On Tue, Feb 2, 2016 at 10:37 PM, Dave Chinner <david@fromorbit.com>
> wrote:
> > >>
> > >>> On Tue, Feb 02, 2016 at 07:40:34PM -0800, Dilip Simha wrote:
> > >>>> Hi Eric,
> > >>>>
> > >>>> Thank you for your quick reply.
> > >>>>
> > >>>> Using xfs_io as per your suggestion, I am able to reproduce the
> issue.
> > >>>> However, I need to falloc for 256K and write for 257K to see this
> issue.
> > >>>>
> > >>>> # xfs_io -f -c "falloc 0 256k" -c "pwrite 0 257k"
> /srv/node/r1/t1.txt
> > >>>> # stat /srv/node/r1/t4.txt | grep Blocks
> > >>>>   Size: 263168     Blocks: 1536       IO Block: 4096   regular file
> > >>>
> > >>> Fallocate sets the XFS_DIFLAG_PREALLOC on the inode.
> > >>>
> > >>> When you writing *past the preallocated area* and do delayed
> > >>> allocation, the speculative preallocation beyond EOF is double the
> > >>> size of the extent at EOF. i.e. 512k, leading to 768k being
> > >>> allocated to the file (1536 blocks, exactly).
> > >>>
> > >>
> > >> Thank you for the details.
> > >> This is exactly where I am a bit perplexed. Since the reclamation
> logic
> > >> skips inodes that have the XFS_DIFLAG_PREALLOC flag set, why did the
> > >> allocation logic allot more blocks on such an inode?
> > >
> > > To store the data you wrote outside the preallocated region, of
> > > course.
> >
> > I think what Dilip meant was, why does it do preallocation, not
> > why does it allocate blocks for the data.  That part is obvious
> > of course.  ;)
> >
> > IOWS, if XFS_DIFLAG_PREALLOC prevents speculative preallocation
> > from being reclaimed, why is speculative preallocation added to files
> > with that flag set?
> >
> > Seems like a fair question, even if Swift's use of preallocation is
> > ill-advised.
> >
> > I don't have all the speculative preallocation heuristics in my
> > head like you do Dave, but if I have it right, and it's i.e.:
> >
> > 1) preallocate 256k
> > 2) inode gets XFS_DIFLAG_PREALLOC
> > 3) write 257k
> > 4) inode gets speculative preallocation added due to write past EOF
> > 5) inode never gets preallocation trimmed due to XFS_DIFLAG_PREALLOC
> >
> > that seems suboptimal.
>
> So do things the other way around:
>
> 1) write 257k
> 2) preallocate 256k beyond EOF and speculative prealloc region
> 3) inode gets XFS_DIFLAG_PREALLOC
> 4) inode never gets preallocation trimmed due to XFS_DIFLAG_PREALLOC
>
> This is correct behaviour.
>

I am sorry, but I don't agree to this. How can an user application know
about step2. XFS may preallocate 256k or any other size depending on the
free space available on the system. Some other file-system may not even do
speculative preallocation. So it makes little sense for an user-application
to own up responsibility for disk space that it doesn't know about.

>
> How do you tell them apart, and in what context can we actually
> determine that we need to remove the inode flag?
>
> Consider the fact that the 'write 257k' doesn't actually do any
> modification to the extent list. i.e. we still have 256k of
> persistent preallocation as unwritten extents. These do not
> converted to written extents until writeback *completes*, so if we
> crash before writeback, the inode remains with only 256k of
> preallocated, unwritten extents. speculative prealloc in memory occurs in
> the
> write() context, physical allocation occurs in the writeback
> context, and inode size updates occur at IO completion.
>
> i.e. none of these contexts have enough information to be able to
> determine whether the XFS_DIFLAG_PREALLOC needs to be removed,
> because it cannot be removed until all the persistent prealloc has
> been written over *and* the new EOF is stable on disk.
>
> Further, what about persistent preallocation in the middle of the
> file? Do we remove the XFS_DIFLAG_PREALLOC while that still exists
> as unwritten extents? This gets especially interesting once we
> consider the behaviour reflink, COW and dedupe should have on such
> extents....
>
> As I said: This is anything but simple, and it's not going to get
> any simpler any time soon.
>

I agree, having to remove the XFS_DIFLAG_PREALLOC flag is not a simpler
option but needs careful thought.
However, as Dave suggested, its easier to NOT do speculative preallocation
on inodes that have this flag already set. This is simply because of the
fact that XFS assumes the user-application issued fallocate with the best
knowledge of its workload. By the way, this need not be just the Swift. Any
user application can experience this issue. Also, I am not associated with
Swift!

>
> > Never doing speculative preallocation on files with XFS_DIFLAG_PREALLOC
> > set, regardless of file offset, would seem sane to me.  App asked
> > to take control via prealloc; let it have it, and leave it at that.
>
> We don't do speculative prealloc on inodes that already have blocks
> beyond EOF. We already detect that case and don't do speculative
> prealloc. But when there aren't blocks beyond EOF, extending
> writes should use speculative preallocation.
>
> But if we decide that we don't do speculative prealloc when
> XFS_DIFLAG_PREALLOC is set, then workloads that mis-use fallocate
> (like swift), or use fallocate to fill sparse holes in files are
> going fragment the hell out of their files when they extending
> them.
>

I don't understand why would this be the case. If XFS doesn't do
speculative preallocation then for the 256 byte write after the end of EOF
will simply result in pushing the EOF ahead. So I see no harm if XFS
doesn't do speculative preallocation when XFS_DIFLAG_PREALLOC is set.

>
> In reality, if swift is really just writing 1k past the prealloc'd
> range it creates, then that is clearly an application bug. Further,
> if swift is only ever preallocating the first 256k of each file it
> writes, regardless of size, then that is also an application bug.
>

Its not a bug. Assume a use-case like appending to a file. Would you say
append is a buggy operation?
An append operation can come at any time after the initial fallocate and
write has happened.

Simple steps to recreate this bloated-write issue is:
xfs_io -f -c "falloc 0 256k" -c "pwrite 0 256k" -c "pwrite 256k 256" t1.txt

Thanks & Regards,
Dilip

> If such users don't like the fact their application is badly written
> and interacts badly with a filesystem feature that is, in general,
> the best behaviour to have, then they can either (1) get the
> application fixed, or (2) set mount options to turn off the feature
> that the application bugs interact badly with.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
>

[-- Attachment #1.2: Type: text/html, Size: 9310 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs