All of lore.kernel.org
 help / color / mirror / Atom feed
* Request for information on bloated writes using Swift
@ 2016-02-02 22:32 Dilip Simha
  2016-02-03  2:47 ` Eric Sandeen
  0 siblings, 1 reply; 14+ messages in thread
From: Dilip Simha @ 2016-02-02 22:32 UTC (permalink / raw)
  To: xfs


[-- Attachment #1.1: Type: text/plain, Size: 733 bytes --]

Hi,

I have a question regarding speculated preallocation in XFS, w.r.t kernel
version: 3.16.0-46-generic.
I am using Swift version: 1.0 and mkfs.xfs version 3.2.1

When I write a 256KiB file to Swift, I see that the underlying XFS uses 3x
the amount of space/blocks to write that data.
Upon performing detailed experiments, I see that when Swift uses fallocate
(default approach), XFS doesn't reclaim the preallocated blocks that XFS
allocated. Swift fallocate doesn't exceed the body size(256 KiB).

Interestingly, when either allocsize=4k or when swift doesn't use
fallocate, XFS doesn't consume additional space.

Can you please let me know if this is a known bug and if its fixed in the
later versions?

Thanks & Regards,
Dilip

[-- Attachment #1.2: Type: text/html, Size: 957 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Request for information on bloated writes using Swift
  2016-02-02 22:32 Request for information on bloated writes using Swift Dilip Simha
@ 2016-02-03  2:47 ` Eric Sandeen
  2016-02-03  3:40   ` Dilip Simha
  0 siblings, 1 reply; 14+ messages in thread
From: Eric Sandeen @ 2016-02-03  2:47 UTC (permalink / raw)
  To: xfs



On 2/2/16 4:32 PM, Dilip Simha wrote:
> Hi,
> 
> I have a question regarding speculated preallocation in XFS, w.r.t
> kernel version: 3.16.0-46-generic. I am using Swift version: 1.0 and
> mkfs.xfs version 3.2.1
> 
> When I write a 256KiB file to Swift, I see that the underlying XFS
> uses 3x the amount of space/blocks to write that data. Upon
> performing detailed experiments, I see that when Swift uses fallocate
> (default approach), XFS doesn't reclaim the preallocated blocks that
> XFS allocated. Swift fallocate doesn't exceed the body size(256
> KiB).
> 
> Interestingly, when either allocsize=4k or when swift doesn't use
> fallocate, XFS doesn't consume additional space.
> 
> Can you please let me know if this is a known bug and if its fixed in
> the later versions?

Can you clarify the exact sequence of events?

i.e. -

xfs_io -f -c "fallocate 0 256k" -c "pwrite 0 256k" somefile

leads to unreaclaimed preallocation, while

xfs_io -f -c "pwrite 0 256k" somefile

does not?  Or is it some other sequence?  I don't have a
3.16 handy to test, but if you can describe it in more detail
that'd help.  Some of this is influenced by fs geometry, too
so xfs_info output would be good, along with any mount options
you might be using.

Are you preallocating with or without KEEP_SIZE?

-Eric

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Request for information on bloated writes using Swift
  2016-02-03  2:47 ` Eric Sandeen
@ 2016-02-03  3:40   ` Dilip Simha
  2016-02-03  3:42     ` Dilip Simha
  2016-02-03  6:37     ` Dave Chinner
  0 siblings, 2 replies; 14+ messages in thread
From: Dilip Simha @ 2016-02-03  3:40 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: xfs


[-- Attachment #1.1: Type: text/plain, Size: 3230 bytes --]

Hi Eric,

Thank you for your quick reply.

Using xfs_io as per your suggestion, I am able to reproduce the issue.
However, I need to falloc for 256K and write for 257K to see this issue.

# xfs_io -f -c "falloc 0 256k" -c "pwrite 0 257k" /srv/node/r1/t1.txt
# stat /srv/node/r1/t4.txt | grep Blocks
  Size: 263168     Blocks: 1536       IO Block: 4096   regular file

# xfs_io -f -c "pwrite 0 257k" /srv/node/r1/t2.txt
# stat  /srv/node/r1/t2.txt | grep Blocks
Size: 263168    *Blocks*: 520        IO Block: 4096   regular file

# xfs_info /srv/node/r1
meta-data=/dev/mapper/35000cca05831283c-part2 isize=256    agcount=4,
agsize=183141504 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=0        finobt=0
data     =                       bsize=4096   blocks=732566016, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=0
log      =internal               bsize=4096   blocks=357698, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

# cat /proc/mounts | grep r1

/dev/mapper/35000cca05831283c-part2 /srv/node/*r1* xfs
rw,nosuid,nodev,noexec,noatime,nodiratime,attr2,inode64,logbufs=8,noquota 0
0
I waited for around 15 mins before collecting the stat output to give the
background reclamation logic a fair chance to do its job. I also tried
changing the value of speculative_prealloc_lifetime from 300 to 10. But it
was of no use.

cat /proc/sys/fs/xfs/speculative_prealloc_lifetime
10

Regards,
Dilip

On Tue, Feb 2, 2016 at 6:47 PM, Eric Sandeen <sandeen@sandeen.net> wrote:

>
>
> On 2/2/16 4:32 PM, Dilip Simha wrote:
> > Hi,
> >
> > I have a question regarding speculated preallocation in XFS, w.r.t
> > kernel version: 3.16.0-46-generic. I am using Swift version: 1.0 and
> > mkfs.xfs version 3.2.1
> >
> > When I write a 256KiB file to Swift, I see that the underlying XFS
> > uses 3x the amount of space/blocks to write that data. Upon
> > performing detailed experiments, I see that when Swift uses fallocate
> > (default approach), XFS doesn't reclaim the preallocated blocks that
> > XFS allocated. Swift fallocate doesn't exceed the body size(256
> > KiB).
> >
> > Interestingly, when either allocsize=4k or when swift doesn't use
> > fallocate, XFS doesn't consume additional space.
> >
> > Can you please let me know if this is a known bug and if its fixed in
> > the later versions?
>
> Can you clarify the exact sequence of events?
>
> i.e. -
>
> xfs_io -f -c "fallocate 0 256k" -c "pwrite 0 256k" somefile
>
> leads to unreaclaimed preallocation, while
>
> xfs_io -f -c "pwrite 0 256k" somefile
>
> does not?  Or is it some other sequence?  I don't have a
> 3.16 handy to test, but if you can describe it in more detail
> that'd help.  Some of this is influenced by fs geometry, too
> so xfs_info output would be good, along with any mount options
> you might be using.
>
> Are you preallocating with or without KEEP_SIZE?
>
> -Eric
>
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
>

[-- Attachment #1.2: Type: text/html, Size: 4785 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Request for information on bloated writes using Swift
  2016-02-03  3:40   ` Dilip Simha
@ 2016-02-03  3:42     ` Dilip Simha
  2016-02-03  6:37     ` Dave Chinner
  1 sibling, 0 replies; 14+ messages in thread
From: Dilip Simha @ 2016-02-03  3:42 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: xfs


[-- Attachment #1.1: Type: text/plain, Size: 3534 bytes --]

Apologies:
Small correction:

The stat was taken on t1.txt but mistakenly printed it as t4.txt.

On Tue, Feb 2, 2016 at 7:40 PM, Dilip Simha <nmdilipsimha@gmail.com> wrote:

> Hi Eric,
>
> Thank you for your quick reply.
>
> Using xfs_io as per your suggestion, I am able to reproduce the issue.
> However, I need to falloc for 256K and write for 257K to see this issue.
>
> # xfs_io -f -c "falloc 0 256k" -c "pwrite 0 257k" /srv/node/r1/t1.txt
> # stat /srv/node/r1/t4.txt | grep Blocks
>   Size: 263168     Blocks: 1536       IO Block: 4096   regular file
>
> # xfs_io -f -c "pwrite 0 257k" /srv/node/r1/t2.txt
> # stat  /srv/node/r1/t2.txt | grep Blocks
> Size: 263168    *Blocks*: 520        IO Block: 4096   regular file
>
> # xfs_info /srv/node/r1
> meta-data=/dev/mapper/35000cca05831283c-part2 isize=256    agcount=4,
> agsize=183141504 blks
>          =                       sectsz=512   attr=2, projid32bit=1
>          =                       crc=0        finobt=0
> data     =                       bsize=4096   blocks=732566016, imaxpct=5
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0 ftype=0
> log      =internal               bsize=4096   blocks=357698, version=2
>          =                       sectsz=512   sunit=0 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
>
> # cat /proc/mounts | grep r1
>
> /dev/mapper/35000cca05831283c-part2 /srv/node/*r1* xfs
> rw,nosuid,nodev,noexec,noatime,nodiratime,attr2,inode64,logbufs=8,noquota 0
> 0
> I waited for around 15 mins before collecting the stat output to give the
> background reclamation logic a fair chance to do its job. I also tried
> changing the value of speculative_prealloc_lifetime from 300 to 10. But it
> was of no use.
>
> cat /proc/sys/fs/xfs/speculative_prealloc_lifetime
> 10
>
> Regards,
> Dilip
>
> On Tue, Feb 2, 2016 at 6:47 PM, Eric Sandeen <sandeen@sandeen.net> wrote:
>
>>
>>
>> On 2/2/16 4:32 PM, Dilip Simha wrote:
>> > Hi,
>> >
>> > I have a question regarding speculated preallocation in XFS, w.r.t
>> > kernel version: 3.16.0-46-generic. I am using Swift version: 1.0 and
>> > mkfs.xfs version 3.2.1
>> >
>> > When I write a 256KiB file to Swift, I see that the underlying XFS
>> > uses 3x the amount of space/blocks to write that data. Upon
>> > performing detailed experiments, I see that when Swift uses fallocate
>> > (default approach), XFS doesn't reclaim the preallocated blocks that
>> > XFS allocated. Swift fallocate doesn't exceed the body size(256
>> > KiB).
>> >
>> > Interestingly, when either allocsize=4k or when swift doesn't use
>> > fallocate, XFS doesn't consume additional space.
>> >
>> > Can you please let me know if this is a known bug and if its fixed in
>> > the later versions?
>>
>> Can you clarify the exact sequence of events?
>>
>> i.e. -
>>
>> xfs_io -f -c "fallocate 0 256k" -c "pwrite 0 256k" somefile
>>
>> leads to unreaclaimed preallocation, while
>>
>> xfs_io -f -c "pwrite 0 256k" somefile
>>
>> does not?  Or is it some other sequence?  I don't have a
>> 3.16 handy to test, but if you can describe it in more detail
>> that'd help.  Some of this is influenced by fs geometry, too
>> so xfs_info output would be good, along with any mount options
>> you might be using.
>>
>> Are you preallocating with or without KEEP_SIZE?
>>
>> -Eric
>>
>> _______________________________________________
>> xfs mailing list
>> xfs@oss.sgi.com
>> http://oss.sgi.com/mailman/listinfo/xfs
>>
>
>

[-- Attachment #1.2: Type: text/html, Size: 5271 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Request for information on bloated writes using Swift
  2016-02-03  3:40   ` Dilip Simha
  2016-02-03  3:42     ` Dilip Simha
@ 2016-02-03  6:37     ` Dave Chinner
  2016-02-03  7:09       ` Dilip Simha
  1 sibling, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2016-02-03  6:37 UTC (permalink / raw)
  To: Dilip Simha; +Cc: Eric Sandeen, xfs

On Tue, Feb 02, 2016 at 07:40:34PM -0800, Dilip Simha wrote:
> Hi Eric,
> 
> Thank you for your quick reply.
> 
> Using xfs_io as per your suggestion, I am able to reproduce the issue.
> However, I need to falloc for 256K and write for 257K to see this issue.
> 
> # xfs_io -f -c "falloc 0 256k" -c "pwrite 0 257k" /srv/node/r1/t1.txt
> # stat /srv/node/r1/t4.txt | grep Blocks
>   Size: 263168     Blocks: 1536       IO Block: 4096   regular file

Fallocate sets the XFS_DIFLAG_PREALLOC on the inode.

When you writing *past the preallocated area* and do delayed
allocation, the speculative preallocation beyond EOF is double the
size of the extent at EOF. i.e. 512k, leading to 768k being
allocated to the file (1536 blocks, exactly).

This is expected behaviour.

> # xfs_io -f -c "pwrite 0 257k" /srv/node/r1/t2.txt
> # stat  /srv/node/r1/t2.txt | grep Blocks
> Size: 263168    *Blocks*: 520        IO Block: 4096   regular file

So pure delayed allocation, specualtive preallocation starts at 64k
file size, so it would have been (((64k + 64K) + 128K) + 256k) =
768k.


> I waited for around 15 mins before collecting the stat output to give the
> background reclamation logic a fair chance to do its job. I also tried
> changing the value of speculative_prealloc_lifetime from 300 to 10. But it
> was of no use.

The prealloc cleaner skips inodes with XFS_DIFLAG_PREALLOC set on
them.

Because the XFS_DIFLAG_PREALLOC flag is not set on the delayed
allocation inode, the EOF blocks cleaner runs truncates it to EOF,
and 260k (520 blocks) remains allocated to the file.

i.e. you are seeing behaviour exactly as designed and intended.

The way swift is using fallocate is actively harmful. You do not
want preallocation for write once files - this is exactly the
workload that delayed allocation was designed to be optimal for as
delayed allocation sequentialises the IO from multiple files.

Using preallocation means writeback of the data cannot be optimised
across files as the preallocation location will not be sequential to
the IO that was just issued, hence writeback will seek the disks
back and forth instead of seeing a nice sequential IO stream.

<sigh>

Yet another way that the swift storage back end tries to be smart
but ends up just making things go slow....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Request for information on bloated writes using Swift
  2016-02-03  6:37     ` Dave Chinner
@ 2016-02-03  7:09       ` Dilip Simha
  2016-02-03  8:30         ` Dave Chinner
  0 siblings, 1 reply; 14+ messages in thread
From: Dilip Simha @ 2016-02-03  7:09 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Eric Sandeen, xfs


[-- Attachment #1.1: Type: text/plain, Size: 3855 bytes --]

Hi Dave,

On Tue, Feb 2, 2016 at 10:37 PM, Dave Chinner <david@fromorbit.com> wrote:

> On Tue, Feb 02, 2016 at 07:40:34PM -0800, Dilip Simha wrote:
> > Hi Eric,
> >
> > Thank you for your quick reply.
> >
> > Using xfs_io as per your suggestion, I am able to reproduce the issue.
> > However, I need to falloc for 256K and write for 257K to see this issue.
> >
> > # xfs_io -f -c "falloc 0 256k" -c "pwrite 0 257k" /srv/node/r1/t1.txt
> > # stat /srv/node/r1/t4.txt | grep Blocks
> >   Size: 263168     Blocks: 1536       IO Block: 4096   regular file
>
> Fallocate sets the XFS_DIFLAG_PREALLOC on the inode.
>
> When you writing *past the preallocated area* and do delayed
> allocation, the speculative preallocation beyond EOF is double the
> size of the extent at EOF. i.e. 512k, leading to 768k being
> allocated to the file (1536 blocks, exactly).
>

Thank you for the details.
This is exactly where I am a bit perplexed. Since the reclamation logic
skips inodes that have the XFS_DIFLAG_PREALLOC flag set, why did the
allocation logic allot more blocks on such an inode?
My understanding is that the fallocate caller only requested for 256K worth
of blocks to be available sequentially if possible. On any subsequent write
beyond the EOF, the caller is completely unaware of the underlying
file-system storing that data adjacent to the first 256K data. Since XFS is
speculatively allocating additional space (512K) adjacent to the first 256K
data, I would expect XFS to either treat these two allocations distinctly
and NOT mark XFS_DIFLAG_PREALLOC on the additional 512K data(minus the
actually used additional data=1K), OR remove XFS_DIFLAG_PREALLOC flag on
the entire inode.

Also, is there any way I can check for this flag?
The FLAGS, as observed from xfs_bmap doesn't show any flags set to it. Am I
not looking at the right flags?

# xfs_bmap -lpv
/srv/node/r16/objects/10/ff3/55517cd029bee36151a5098ce7cdeff3/1453771923.11401.data
/srv/node/r16/objects/10/ff3/55517cd029bee36151a5098ce7cdeff3/1453771923.11401.data:
EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS
0: [0..1535]: 1465876416..1465877951 1 (744384..745919) 1536 00000

Thanks & Regards,
Dilip


> This is expected behaviour.
>
> > # xfs_io -f -c "pwrite 0 257k" /srv/node/r1/t2.txt
> > # stat  /srv/node/r1/t2.txt | grep Blocks
> > Size: 263168    *Blocks*: 520        IO Block: 4096   regular file
>
> So pure delayed allocation, specualtive preallocation starts at 64k
> file size, so it would have been (((64k + 64K) + 128K) + 256k) =
> 768k.
>
>
> > I waited for around 15 mins before collecting the stat output to give the
> > background reclamation logic a fair chance to do its job. I also tried
> > changing the value of speculative_prealloc_lifetime from 300 to 10. But
> it
> > was of no use.
>
> The prealloc cleaner skips inodes with XFS_DIFLAG_PREALLOC set on
> them.
>
> Because the XFS_DIFLAG_PREALLOC flag is not set on the delayed
> allocation inode, the EOF blocks cleaner runs truncates it to EOF,
> and 260k (520 blocks) remains allocated to the file.
>
> i.e. you are seeing behaviour exactly as designed and intended.
>
> The way swift is using fallocate is actively harmful. You do not
> want preallocation for write once files - this is exactly the
> workload that delayed allocation was designed to be optimal for as
> delayed allocation sequentialises the IO from multiple files.
>
> Using preallocation means writeback of the data cannot be optimised
> across files as the preallocation location will not be sequential to
> the IO that was just issued, hence writeback will seek the disks
> back and forth instead of seeing a nice sequential IO stream.
>
> <sigh>
>
> Yet another way that the swift storage back end tries to be smart
> but ends up just making things go slow....
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
>

[-- Attachment #1.2: Type: text/html, Size: 5062 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Request for information on bloated writes using Swift
  2016-02-03  7:09       ` Dilip Simha
@ 2016-02-03  8:30         ` Dave Chinner
  2016-02-03 15:02           ` Eric Sandeen
  2016-02-03 16:10           ` Dilip Simha
  0 siblings, 2 replies; 14+ messages in thread
From: Dave Chinner @ 2016-02-03  8:30 UTC (permalink / raw)
  To: Dilip Simha; +Cc: Eric Sandeen, xfs

On Tue, Feb 02, 2016 at 11:09:15PM -0800, Dilip Simha wrote:
> Hi Dave,
> 
> On Tue, Feb 2, 2016 at 10:37 PM, Dave Chinner <david@fromorbit.com> wrote:
> 
> > On Tue, Feb 02, 2016 at 07:40:34PM -0800, Dilip Simha wrote:
> > > Hi Eric,
> > >
> > > Thank you for your quick reply.
> > >
> > > Using xfs_io as per your suggestion, I am able to reproduce the issue.
> > > However, I need to falloc for 256K and write for 257K to see this issue.
> > >
> > > # xfs_io -f -c "falloc 0 256k" -c "pwrite 0 257k" /srv/node/r1/t1.txt
> > > # stat /srv/node/r1/t4.txt | grep Blocks
> > >   Size: 263168     Blocks: 1536       IO Block: 4096   regular file
> >
> > Fallocate sets the XFS_DIFLAG_PREALLOC on the inode.
> >
> > When you writing *past the preallocated area* and do delayed
> > allocation, the speculative preallocation beyond EOF is double the
> > size of the extent at EOF. i.e. 512k, leading to 768k being
> > allocated to the file (1536 blocks, exactly).
> >
> 
> Thank you for the details.
> This is exactly where I am a bit perplexed. Since the reclamation logic
> skips inodes that have the XFS_DIFLAG_PREALLOC flag set, why did the
> allocation logic allot more blocks on such an inode?

To store the data you wrote outside the preallocated region, of
course.

> My understanding is that the fallocate caller only requested for 256K worth
> of blocks to be available sequentially if possible.

fallocate only guarantees the blocks are allocated - it does not
guarantee anything about the layout of the blocks.

> On any subsequent write beyond the EOF, the caller is completely
> unaware of the underlying file-system storing that data adjacent
> to the first 256K data.  Since XFS is speculatively allocating
> additional space (512K) adjacent to the first 256K data, I would
> expect XFS to either treat these two allocations distinctly and
> NOT mark XFS_DIFLAG_PREALLOC on the additional 512K data(minus the
> actually used additional data=1K), OR remove XFS_DIFLAG_PREALLOC
> flag on the entire inode.

Oh, if only it were that simple. It's way more complex than I have
time to explain here.

Fundamentally, XFS_DIFLAG_PREALLOC is used to indicate that
persistent preallocation has been done on the file, and so if that
has happened we need to turn off optimistic removal of blocks
anywhere in the file because we can't tell what blocks had
persistent preallocation done on them after the fact.  That's the
way it's been since unwritten extents were added to XFS back in
1998, and I don't really see the need for it to change right now.

If an application wants to mix fallocate and delayed allocatin
writes to the same file in the same IO, then that's an application
bug. It's going to cause bad IO patterns and file fragmentation and
have other side effects (as you've noticed), and there's nothing the
filesystem can do about it. fallocate() requires expertise to use in
a beneficial manner - most developers do not have the required
expertise (and don't have enough expertise to realise this) and so
usually make things worse rather than better by using fallocate.

> Also, is there any way I can check for this flag?
> The FLAGS, as observed from xfs_bmap doesn't show any flags set to it. Am I
> not looking at the right flags?

xfs_io -c stat <file>

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Request for information on bloated writes using Swift
  2016-02-03  8:30         ` Dave Chinner
@ 2016-02-03 15:02           ` Eric Sandeen
  2016-02-03 21:51             ` Dave Chinner
  2016-02-03 16:10           ` Dilip Simha
  1 sibling, 1 reply; 14+ messages in thread
From: Eric Sandeen @ 2016-02-03 15:02 UTC (permalink / raw)
  To: Dave Chinner, Dilip Simha; +Cc: xfs



On 2/3/16 2:30 AM, Dave Chinner wrote:
> On Tue, Feb 02, 2016 at 11:09:15PM -0800, Dilip Simha wrote:
>> Hi Dave,
>>
>> On Tue, Feb 2, 2016 at 10:37 PM, Dave Chinner <david@fromorbit.com> wrote:
>>
>>> On Tue, Feb 02, 2016 at 07:40:34PM -0800, Dilip Simha wrote:
>>>> Hi Eric,
>>>>
>>>> Thank you for your quick reply.
>>>>
>>>> Using xfs_io as per your suggestion, I am able to reproduce the issue.
>>>> However, I need to falloc for 256K and write for 257K to see this issue.
>>>>
>>>> # xfs_io -f -c "falloc 0 256k" -c "pwrite 0 257k" /srv/node/r1/t1.txt
>>>> # stat /srv/node/r1/t4.txt | grep Blocks
>>>>   Size: 263168     Blocks: 1536       IO Block: 4096   regular file
>>>
>>> Fallocate sets the XFS_DIFLAG_PREALLOC on the inode.
>>>
>>> When you writing *past the preallocated area* and do delayed
>>> allocation, the speculative preallocation beyond EOF is double the
>>> size of the extent at EOF. i.e. 512k, leading to 768k being
>>> allocated to the file (1536 blocks, exactly).
>>>
>>
>> Thank you for the details.
>> This is exactly where I am a bit perplexed. Since the reclamation logic
>> skips inodes that have the XFS_DIFLAG_PREALLOC flag set, why did the
>> allocation logic allot more blocks on such an inode?
> 
> To store the data you wrote outside the preallocated region, of
> course.

I think what Dilip meant was, why does it do preallocation, not
why does it allocate blocks for the data.  That part is obvious
of course.  ;)

IOWS, if XFS_DIFLAG_PREALLOC prevents speculative preallocation
from being reclaimed, why is speculative preallocation added to files
with that flag set?

Seems like a fair question, even if Swift's use of preallocation is
ill-advised.

I don't have all the speculative preallocation heuristics in my
head like you do Dave, but if I have it right, and it's i.e.:

1) preallocate 256k
2) inode gets XFS_DIFLAG_PREALLOC
3) write 257k
4) inode gets speculative preallocation added due to write past EOF
5) inode never gets preallocation trimmed due to XFS_DIFLAG_PREALLOC

that seems suboptimal.

Never doing speculative preallocation on files with XFS_DIFLAG_PREALLOC
set, regardless of file offset, would seem sane to me.  App asked
to take control via prealloc; let it have it, and leave it at that.

(Of course now I'll go read the code to see if I understand it
properly...)

-Eric

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Request for information on bloated writes using Swift
  2016-02-03  8:30         ` Dave Chinner
  2016-02-03 15:02           ` Eric Sandeen
@ 2016-02-03 16:10           ` Dilip Simha
  2016-02-03 16:15             ` Dilip Simha
  1 sibling, 1 reply; 14+ messages in thread
From: Dilip Simha @ 2016-02-03 16:10 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Eric Sandeen, xfs


[-- Attachment #1.1: Type: text/plain, Size: 4821 bytes --]

On Wed, Feb 3, 2016 at 12:30 AM, Dave Chinner <david@fromorbit.com> wrote:

> On Tue, Feb 02, 2016 at 11:09:15PM -0800, Dilip Simha wrote:
> > Hi Dave,
> >
> > On Tue, Feb 2, 2016 at 10:37 PM, Dave Chinner <david@fromorbit.com>
> wrote:
> >
> > > On Tue, Feb 02, 2016 at 07:40:34PM -0800, Dilip Simha wrote:
> > > > Hi Eric,
> > > >
> > > > Thank you for your quick reply.
> > > >
> > > > Using xfs_io as per your suggestion, I am able to reproduce the
> issue.
> > > > However, I need to falloc for 256K and write for 257K to see this
> issue.
> > > >
> > > > # xfs_io -f -c "falloc 0 256k" -c "pwrite 0 257k" /srv/node/r1/t1.txt
> > > > # stat /srv/node/r1/t4.txt | grep Blocks
> > > >   Size: 263168     Blocks: 1536       IO Block: 4096   regular file
> > >
> > > Fallocate sets the XFS_DIFLAG_PREALLOC on the inode.
> > >
> > > When you writing *past the preallocated area* and do delayed
> > > allocation, the speculative preallocation beyond EOF is double the
> > > size of the extent at EOF. i.e. 512k, leading to 768k being
> > > allocated to the file (1536 blocks, exactly).
> > >
> >
> > Thank you for the details.
> > This is exactly where I am a bit perplexed. Since the reclamation logic
> > skips inodes that have the XFS_DIFLAG_PREALLOC flag set, why did the
> > allocation logic allot more blocks on such an inode?
>
> To store the data you wrote outside the preallocated region, of
> course.
>
> > My understanding is that the fallocate caller only requested for 256K
> worth
> > of blocks to be available sequentially if possible.
>
> fallocate only guarantees the blocks are allocated - it does not
> guarantee anything about the layout of the blocks.
>
> > On any subsequent write beyond the EOF, the caller is completely
> > unaware of the underlying file-system storing that data adjacent
> > to the first 256K data.  Since XFS is speculatively allocating
> > additional space (512K) adjacent to the first 256K data, I would
> > expect XFS to either treat these two allocations distinctly and
> > NOT mark XFS_DIFLAG_PREALLOC on the additional 512K data(minus the
> > actually used additional data=1K), OR remove XFS_DIFLAG_PREALLOC
> > flag on the entire inode.
>
> Oh, if only it were that simple. It's way more complex than I have
> time to explain here.
>
> Fundamentally, XFS_DIFLAG_PREALLOC is used to indicate that
> persistent preallocation has been done on the file, and so if that
> has happened we need to turn off optimistic removal of blocks
> anywhere in the file because we can't tell what blocks had
> persistent preallocation done on them after the fact.  That's the
> way it's been since unwritten extents were added to XFS back in
> 1998, and I don't really see the need for it to change right now.
>

I completely understand the reasoning behind this reclamation logic and I
also agree to it.
But my question is with the allocation logic. I don't understand why XFS
allocates more than necessary blocks when this flag is set and when it
knows that its not going to clean up the additional space.

A simple example would be:
1: Open File in Write mode.
2: Fallocate 256K
3: Write 256K
4: Close File

Stat shows that XFS allocated 512 blocks as expected.

5: Open file in append mode.
6: Write 256 bytes.
7: Close file.

Expectation is that the number of blocks allocated is either 512+1 or 512+8
depending on the block size.
However, XFS uses speculative preallocation to allocate 512K (as per your
explanation) to write 256 bytes and hence the overall disk usage goes up to
1536 blocks.
Now, who is responsible for clearing up the additional allocated blocks?
Clearly the application has no idea about the over-allocation.

I agree that if an application uses fallocate and delayed allocation on the
same file in the same IO, then its a badly structured application. But in
this case we have two different IOs on the same file. The first IO did not
expect an append and hence issued an fallocate. So that looks good to me.

Your thoughts on this?

Regards,
Dilip


> If an application wants to mix fallocate and delayed allocatin
> writes to the same file in the same IO, then that's an application
> bug. It's going to cause bad IO patterns and file fragmentation and
> have other side effects (as you've noticed), and there's nothing the
> filesystem can do about it. fallocate() requires expertise to use in
> a beneficial manner - most developers do not have the required
> expertise (and don't have enough expertise to realise this) and so
> usually make things worse rather than better by using fallocate.
>
> > Also, is there any way I can check for this flag?
> > The FLAGS, as observed from xfs_bmap doesn't show any flags set to it.
> Am I
> > not looking at the right flags?
>
> xfs_io -c stat <file>
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
>

[-- Attachment #1.2: Type: text/html, Size: 6293 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Request for information on bloated writes using Swift
  2016-02-03 16:10           ` Dilip Simha
@ 2016-02-03 16:15             ` Dilip Simha
  0 siblings, 0 replies; 14+ messages in thread
From: Dilip Simha @ 2016-02-03 16:15 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Eric Sandeen, xfs


[-- Attachment #1.1: Type: text/plain, Size: 5181 bytes --]

Thank you Eric,
I am sorry, I missed reading your message before replying.
You got my question right.

Regards,
Dilip

On Wed, Feb 3, 2016 at 8:10 AM, Dilip Simha <nmdilipsimha@gmail.com> wrote:

> On Wed, Feb 3, 2016 at 12:30 AM, Dave Chinner <david@fromorbit.com> wrote:
>
>> On Tue, Feb 02, 2016 at 11:09:15PM -0800, Dilip Simha wrote:
>> > Hi Dave,
>> >
>> > On Tue, Feb 2, 2016 at 10:37 PM, Dave Chinner <david@fromorbit.com>
>> wrote:
>> >
>> > > On Tue, Feb 02, 2016 at 07:40:34PM -0800, Dilip Simha wrote:
>> > > > Hi Eric,
>> > > >
>> > > > Thank you for your quick reply.
>> > > >
>> > > > Using xfs_io as per your suggestion, I am able to reproduce the
>> issue.
>> > > > However, I need to falloc for 256K and write for 257K to see this
>> issue.
>> > > >
>> > > > # xfs_io -f -c "falloc 0 256k" -c "pwrite 0 257k"
>> /srv/node/r1/t1.txt
>> > > > # stat /srv/node/r1/t4.txt | grep Blocks
>> > > >   Size: 263168     Blocks: 1536       IO Block: 4096   regular file
>> > >
>> > > Fallocate sets the XFS_DIFLAG_PREALLOC on the inode.
>> > >
>> > > When you writing *past the preallocated area* and do delayed
>> > > allocation, the speculative preallocation beyond EOF is double the
>> > > size of the extent at EOF. i.e. 512k, leading to 768k being
>> > > allocated to the file (1536 blocks, exactly).
>> > >
>> >
>> > Thank you for the details.
>> > This is exactly where I am a bit perplexed. Since the reclamation logic
>> > skips inodes that have the XFS_DIFLAG_PREALLOC flag set, why did the
>> > allocation logic allot more blocks on such an inode?
>>
>> To store the data you wrote outside the preallocated region, of
>> course.
>>
>> > My understanding is that the fallocate caller only requested for 256K
>> worth
>> > of blocks to be available sequentially if possible.
>>
>> fallocate only guarantees the blocks are allocated - it does not
>> guarantee anything about the layout of the blocks.
>>
>> > On any subsequent write beyond the EOF, the caller is completely
>> > unaware of the underlying file-system storing that data adjacent
>> > to the first 256K data.  Since XFS is speculatively allocating
>> > additional space (512K) adjacent to the first 256K data, I would
>> > expect XFS to either treat these two allocations distinctly and
>> > NOT mark XFS_DIFLAG_PREALLOC on the additional 512K data(minus the
>> > actually used additional data=1K), OR remove XFS_DIFLAG_PREALLOC
>> > flag on the entire inode.
>>
>> Oh, if only it were that simple. It's way more complex than I have
>> time to explain here.
>>
>> Fundamentally, XFS_DIFLAG_PREALLOC is used to indicate that
>> persistent preallocation has been done on the file, and so if that
>> has happened we need to turn off optimistic removal of blocks
>> anywhere in the file because we can't tell what blocks had
>> persistent preallocation done on them after the fact.  That's the
>> way it's been since unwritten extents were added to XFS back in
>> 1998, and I don't really see the need for it to change right now.
>>
>
> I completely understand the reasoning behind this reclamation logic and I
> also agree to it.
> But my question is with the allocation logic. I don't understand why XFS
> allocates more than necessary blocks when this flag is set and when it
> knows that its not going to clean up the additional space.
>
> A simple example would be:
> 1: Open File in Write mode.
> 2: Fallocate 256K
> 3: Write 256K
> 4: Close File
>
> Stat shows that XFS allocated 512 blocks as expected.
>
> 5: Open file in append mode.
> 6: Write 256 bytes.
> 7: Close file.
>
> Expectation is that the number of blocks allocated is either 512+1 or
> 512+8 depending on the block size.
> However, XFS uses speculative preallocation to allocate 512K (as per your
> explanation) to write 256 bytes and hence the overall disk usage goes up to
> 1536 blocks.
> Now, who is responsible for clearing up the additional allocated blocks?
> Clearly the application has no idea about the over-allocation.
>
> I agree that if an application uses fallocate and delayed allocation on
> the same file in the same IO, then its a badly structured application. But
> in this case we have two different IOs on the same file. The first IO did
> not expect an append and hence issued an fallocate. So that looks good to
> me.
>
> Your thoughts on this?
>
> Regards,
> Dilip
>
>
>> If an application wants to mix fallocate and delayed allocatin
>> writes to the same file in the same IO, then that's an application
>> bug. It's going to cause bad IO patterns and file fragmentation and
>> have other side effects (as you've noticed), and there's nothing the
>> filesystem can do about it. fallocate() requires expertise to use in
>> a beneficial manner - most developers do not have the required
>> expertise (and don't have enough expertise to realise this) and so
>> usually make things worse rather than better by using fallocate.
>>
>> > Also, is there any way I can check for this flag?
>> > The FLAGS, as observed from xfs_bmap doesn't show any flags set to it.
>> Am I
>> > not looking at the right flags?
>>
>> xfs_io -c stat <file>
>>
>> Cheers,
>>
>> Dave.
>> --
>> Dave Chinner
>> david@fromorbit.com
>>
>
>

[-- Attachment #1.2: Type: text/html, Size: 6865 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Request for information on bloated writes using Swift
  2016-02-03 15:02           ` Eric Sandeen
@ 2016-02-03 21:51             ` Dave Chinner
  2016-02-03 22:43               ` Dilip Simha
  0 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2016-02-03 21:51 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Dilip Simha, xfs

On Wed, Feb 03, 2016 at 09:02:40AM -0600, Eric Sandeen wrote:
> 
> 
> On 2/3/16 2:30 AM, Dave Chinner wrote:
> > On Tue, Feb 02, 2016 at 11:09:15PM -0800, Dilip Simha wrote:
> >> Hi Dave,
> >>
> >> On Tue, Feb 2, 2016 at 10:37 PM, Dave Chinner <david@fromorbit.com> wrote:
> >>
> >>> On Tue, Feb 02, 2016 at 07:40:34PM -0800, Dilip Simha wrote:
> >>>> Hi Eric,
> >>>>
> >>>> Thank you for your quick reply.
> >>>>
> >>>> Using xfs_io as per your suggestion, I am able to reproduce the issue.
> >>>> However, I need to falloc for 256K and write for 257K to see this issue.
> >>>>
> >>>> # xfs_io -f -c "falloc 0 256k" -c "pwrite 0 257k" /srv/node/r1/t1.txt
> >>>> # stat /srv/node/r1/t4.txt | grep Blocks
> >>>>   Size: 263168     Blocks: 1536       IO Block: 4096   regular file
> >>>
> >>> Fallocate sets the XFS_DIFLAG_PREALLOC on the inode.
> >>>
> >>> When you writing *past the preallocated area* and do delayed
> >>> allocation, the speculative preallocation beyond EOF is double the
> >>> size of the extent at EOF. i.e. 512k, leading to 768k being
> >>> allocated to the file (1536 blocks, exactly).
> >>>
> >>
> >> Thank you for the details.
> >> This is exactly where I am a bit perplexed. Since the reclamation logic
> >> skips inodes that have the XFS_DIFLAG_PREALLOC flag set, why did the
> >> allocation logic allot more blocks on such an inode?
> > 
> > To store the data you wrote outside the preallocated region, of
> > course.
> 
> I think what Dilip meant was, why does it do preallocation, not
> why does it allocate blocks for the data.  That part is obvious
> of course.  ;)
> 
> IOWS, if XFS_DIFLAG_PREALLOC prevents speculative preallocation
> from being reclaimed, why is speculative preallocation added to files
> with that flag set?
> 
> Seems like a fair question, even if Swift's use of preallocation is
> ill-advised.
> 
> I don't have all the speculative preallocation heuristics in my
> head like you do Dave, but if I have it right, and it's i.e.:
> 
> 1) preallocate 256k
> 2) inode gets XFS_DIFLAG_PREALLOC
> 3) write 257k
> 4) inode gets speculative preallocation added due to write past EOF
> 5) inode never gets preallocation trimmed due to XFS_DIFLAG_PREALLOC
>
> that seems suboptimal.

So do things the other way around:

1) write 257k
2) preallocate 256k beyond EOF and speculative prealloc region
3) inode gets XFS_DIFLAG_PREALLOC
4) inode never gets preallocation trimmed due to XFS_DIFLAG_PREALLOC

This is correct behaviour.

How do you tell them apart, and in what context can we actually
determine that we need to remove the inode flag?

Consider the fact that the 'write 257k' doesn't actually do any
modification to the extent list. i.e. we still have 256k of
persistent preallocation as unwritten extents. These do not
converted to written extents until writeback *completes*, so if we
crash before writeback, the inode remains with only 256k of
preallocated, unwritten extents. speculative prealloc in memory occurs in the
write() context, physical allocation occurs in the writeback
context, and inode size updates occur at IO completion.

i.e. none of these contexts have enough information to be able to
determine whether the XFS_DIFLAG_PREALLOC needs to be removed,
because it cannot be removed until all the persistent prealloc has
been written over *and* the new EOF is stable on disk.

Further, what about persistent preallocation in the middle of the
file? Do we remove the XFS_DIFLAG_PREALLOC while that still exists
as unwritten extents? This gets especially interesting once we
consider the behaviour reflink, COW and dedupe should have on such
extents....

As I said: This is anything but simple, and it's not going to get
any simpler any time soon.

> Never doing speculative preallocation on files with XFS_DIFLAG_PREALLOC
> set, regardless of file offset, would seem sane to me.  App asked
> to take control via prealloc; let it have it, and leave it at that.

We don't do speculative prealloc on inodes that already have blocks
beyond EOF. We already detect that case and don't do speculative
prealloc. But when there aren't blocks beyond EOF, extending
writes should use speculative preallocation.

But if we decide that we don't do speculative prealloc when
XFS_DIFLAG_PREALLOC is set, then workloads that mis-use fallocate
(like swift), or use fallocate to fill sparse holes in files are
going fragment the hell out of their files when they extending
them.

In reality, if swift is really just writing 1k past the prealloc'd
range it creates, then that is clearly an application bug. Further,
if swift is only ever preallocating the first 256k of each file it
writes, regardless of size, then that is also an application bug.

If such users don't like the fact their application is badly written
and interacts badly with a filesystem feature that is, in general,
the best behaviour to have, then they can either (1) get the
application fixed, or (2) set mount options to turn off the feature
that the application bugs interact badly with.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Request for information on bloated writes using Swift
  2016-02-03 21:51             ` Dave Chinner
@ 2016-02-03 22:43               ` Dilip Simha
  2016-02-03 23:28                 ` Dave Chinner
  0 siblings, 1 reply; 14+ messages in thread
From: Dilip Simha @ 2016-02-03 22:43 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Eric Sandeen, xfs


[-- Attachment #1.1: Type: text/plain, Size: 6877 bytes --]

On Wed, Feb 3, 2016 at 1:51 PM, Dave Chinner <david@fromorbit.com> wrote:

> On Wed, Feb 03, 2016 at 09:02:40AM -0600, Eric Sandeen wrote:
> >
> >
> > On 2/3/16 2:30 AM, Dave Chinner wrote:
> > > On Tue, Feb 02, 2016 at 11:09:15PM -0800, Dilip Simha wrote:
> > >> Hi Dave,
> > >>
> > >> On Tue, Feb 2, 2016 at 10:37 PM, Dave Chinner <david@fromorbit.com>
> wrote:
> > >>
> > >>> On Tue, Feb 02, 2016 at 07:40:34PM -0800, Dilip Simha wrote:
> > >>>> Hi Eric,
> > >>>>
> > >>>> Thank you for your quick reply.
> > >>>>
> > >>>> Using xfs_io as per your suggestion, I am able to reproduce the
> issue.
> > >>>> However, I need to falloc for 256K and write for 257K to see this
> issue.
> > >>>>
> > >>>> # xfs_io -f -c "falloc 0 256k" -c "pwrite 0 257k"
> /srv/node/r1/t1.txt
> > >>>> # stat /srv/node/r1/t4.txt | grep Blocks
> > >>>>   Size: 263168     Blocks: 1536       IO Block: 4096   regular file
> > >>>
> > >>> Fallocate sets the XFS_DIFLAG_PREALLOC on the inode.
> > >>>
> > >>> When you writing *past the preallocated area* and do delayed
> > >>> allocation, the speculative preallocation beyond EOF is double the
> > >>> size of the extent at EOF. i.e. 512k, leading to 768k being
> > >>> allocated to the file (1536 blocks, exactly).
> > >>>
> > >>
> > >> Thank you for the details.
> > >> This is exactly where I am a bit perplexed. Since the reclamation
> logic
> > >> skips inodes that have the XFS_DIFLAG_PREALLOC flag set, why did the
> > >> allocation logic allot more blocks on such an inode?
> > >
> > > To store the data you wrote outside the preallocated region, of
> > > course.
> >
> > I think what Dilip meant was, why does it do preallocation, not
> > why does it allocate blocks for the data.  That part is obvious
> > of course.  ;)
> >
> > IOWS, if XFS_DIFLAG_PREALLOC prevents speculative preallocation
> > from being reclaimed, why is speculative preallocation added to files
> > with that flag set?
> >
> > Seems like a fair question, even if Swift's use of preallocation is
> > ill-advised.
> >
> > I don't have all the speculative preallocation heuristics in my
> > head like you do Dave, but if I have it right, and it's i.e.:
> >
> > 1) preallocate 256k
> > 2) inode gets XFS_DIFLAG_PREALLOC
> > 3) write 257k
> > 4) inode gets speculative preallocation added due to write past EOF
> > 5) inode never gets preallocation trimmed due to XFS_DIFLAG_PREALLOC
> >
> > that seems suboptimal.
>
> So do things the other way around:
>
> 1) write 257k
> 2) preallocate 256k beyond EOF and speculative prealloc region
> 3) inode gets XFS_DIFLAG_PREALLOC
> 4) inode never gets preallocation trimmed due to XFS_DIFLAG_PREALLOC
>
> This is correct behaviour.
>

I am sorry, but I don't agree to this. How can an user application know
about step2. XFS may preallocate 256k or any other size depending on the
free space available on the system. Some other file-system may not even do
speculative preallocation. So it makes little sense for an user-application
to own up responsibility for disk space that it doesn't know about.


>
> How do you tell them apart, and in what context can we actually
> determine that we need to remove the inode flag?
>
> Consider the fact that the 'write 257k' doesn't actually do any
> modification to the extent list. i.e. we still have 256k of
> persistent preallocation as unwritten extents. These do not
> converted to written extents until writeback *completes*, so if we
> crash before writeback, the inode remains with only 256k of
> preallocated, unwritten extents. speculative prealloc in memory occurs in
> the
> write() context, physical allocation occurs in the writeback
> context, and inode size updates occur at IO completion.
>
> i.e. none of these contexts have enough information to be able to
> determine whether the XFS_DIFLAG_PREALLOC needs to be removed,
> because it cannot be removed until all the persistent prealloc has
> been written over *and* the new EOF is stable on disk.
>
> Further, what about persistent preallocation in the middle of the
> file? Do we remove the XFS_DIFLAG_PREALLOC while that still exists
> as unwritten extents? This gets especially interesting once we
> consider the behaviour reflink, COW and dedupe should have on such
> extents....
>
> As I said: This is anything but simple, and it's not going to get
> any simpler any time soon.
>

I agree, having to remove the XFS_DIFLAG_PREALLOC flag is not a simpler
option but needs careful thought.
However, as Dave suggested, its easier to NOT do speculative preallocation
on inodes that have this flag already set. This is simply because of the
fact that XFS assumes the user-application issued fallocate with the best
knowledge of its workload. By the way, this need not be just the Swift. Any
user application can experience this issue. Also, I am not associated with
Swift!

>
> > Never doing speculative preallocation on files with XFS_DIFLAG_PREALLOC
> > set, regardless of file offset, would seem sane to me.  App asked
> > to take control via prealloc; let it have it, and leave it at that.
>
> We don't do speculative prealloc on inodes that already have blocks
> beyond EOF. We already detect that case and don't do speculative
> prealloc. But when there aren't blocks beyond EOF, extending
> writes should use speculative preallocation.
>
> But if we decide that we don't do speculative prealloc when
> XFS_DIFLAG_PREALLOC is set, then workloads that mis-use fallocate
> (like swift), or use fallocate to fill sparse holes in files are
> going fragment the hell out of their files when they extending
> them.
>

I don't understand why would this be the case. If XFS doesn't do
speculative preallocation then for the 256 byte write after the end of EOF
will simply result in pushing the EOF ahead. So I see no harm if XFS
doesn't do speculative preallocation when XFS_DIFLAG_PREALLOC is set.


>
> In reality, if swift is really just writing 1k past the prealloc'd
> range it creates, then that is clearly an application bug. Further,
> if swift is only ever preallocating the first 256k of each file it
> writes, regardless of size, then that is also an application bug.
>

Its not a bug. Assume a use-case like appending to a file. Would you say
append is a buggy operation?
An append operation can come at any time after the initial fallocate and
write has happened.

Simple steps to recreate this bloated-write issue is:
xfs_io -f -c "falloc 0 256k" -c "pwrite 0 256k" -c "pwrite 256k 256" t1.txt

Thanks & Regards,
Dilip


> If such users don't like the fact their application is badly written
> and interacts badly with a filesystem feature that is, in general,
> the best behaviour to have, then they can either (1) get the
> application fixed, or (2) set mount options to turn off the feature
> that the application bugs interact badly with.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
>

[-- Attachment #1.2: Type: text/html, Size: 9310 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Request for information on bloated writes using Swift
  2016-02-03 22:43               ` Dilip Simha
@ 2016-02-03 23:28                 ` Dave Chinner
  2016-02-04  6:16                   ` Dilip Simha
  0 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2016-02-03 23:28 UTC (permalink / raw)
  To: Dilip Simha; +Cc: Eric Sandeen, xfs

On Wed, Feb 03, 2016 at 02:43:27PM -0800, Dilip Simha wrote:
> On Wed, Feb 3, 2016 at 1:51 PM, Dave Chinner <david@fromorbit.com> wrote:
> 
> > On Wed, Feb 03, 2016 at 09:02:40AM -0600, Eric Sandeen wrote:
> > >
> > >
> > > On 2/3/16 2:30 AM, Dave Chinner wrote:
> > > > On Tue, Feb 02, 2016 at 11:09:15PM -0800, Dilip Simha wrote:
> > > >> Hi Dave,
> > > >>
> > > >> On Tue, Feb 2, 2016 at 10:37 PM, Dave Chinner <david@fromorbit.com>
> > wrote:
> > > >>
> > > >>> On Tue, Feb 02, 2016 at 07:40:34PM -0800, Dilip Simha wrote:
> > > >>>> Hi Eric,
> > > >>>>
> > > >>>> Thank you for your quick reply.
> > > >>>>
> > > >>>> Using xfs_io as per your suggestion, I am able to reproduce the
> > issue.
> > > >>>> However, I need to falloc for 256K and write for 257K to see this
> > issue.
> > > >>>>
> > > >>>> # xfs_io -f -c "falloc 0 256k" -c "pwrite 0 257k"
> > /srv/node/r1/t1.txt
> > > >>>> # stat /srv/node/r1/t4.txt | grep Blocks
> > > >>>>   Size: 263168     Blocks: 1536       IO Block: 4096   regular file
> > > >>>
> > > >>> Fallocate sets the XFS_DIFLAG_PREALLOC on the inode.
> > > >>>
> > > >>> When you writing *past the preallocated area* and do delayed
> > > >>> allocation, the speculative preallocation beyond EOF is double the
> > > >>> size of the extent at EOF. i.e. 512k, leading to 768k being
> > > >>> allocated to the file (1536 blocks, exactly).
> > > >>>
> > > >>
> > > >> Thank you for the details.
> > > >> This is exactly where I am a bit perplexed. Since the reclamation
> > logic
> > > >> skips inodes that have the XFS_DIFLAG_PREALLOC flag set, why did the
> > > >> allocation logic allot more blocks on such an inode?
> > > >
> > > > To store the data you wrote outside the preallocated region, of
> > > > course.
> > >
> > > I think what Dilip meant was, why does it do preallocation, not
> > > why does it allocate blocks for the data.  That part is obvious
> > > of course.  ;)
> > >
> > > IOWS, if XFS_DIFLAG_PREALLOC prevents speculative preallocation
> > > from being reclaimed, why is speculative preallocation added to files
> > > with that flag set?
> > >
> > > Seems like a fair question, even if Swift's use of preallocation is
> > > ill-advised.
> > >
> > > I don't have all the speculative preallocation heuristics in my
> > > head like you do Dave, but if I have it right, and it's i.e.:
> > >
> > > 1) preallocate 256k
> > > 2) inode gets XFS_DIFLAG_PREALLOC
> > > 3) write 257k
> > > 4) inode gets speculative preallocation added due to write past EOF
> > > 5) inode never gets preallocation trimmed due to XFS_DIFLAG_PREALLOC
> > >
> > > that seems suboptimal.
> >
> > So do things the other way around:
> >
> > 1) write 257k
> > 2) preallocate 256k beyond EOF and speculative prealloc region
> > 3) inode gets XFS_DIFLAG_PREALLOC
> > 4) inode never gets preallocation trimmed due to XFS_DIFLAG_PREALLOC
> >
> > This is correct behaviour.
> >
> 
> I am sorry, but I don't agree to this. How can an user application know
> about step2.

Step 2 is fallocate(keep size) to a range well beyond EOF. e.g. in
preparation for a bunch of sparse writes that are about to take
place. So userspace will most definitely know about it. It's now the
kernel that now doesn't have a clue what to do about the speculative
preallocation it already has because the application is mixing it's
IO models.

Fundamentally, if you mix writes across persistent preallocation and
adjacent holes, you are going to get a mess no matter what
filesystem you do this to. If you don't like the way XFS handles it,
either fix the application to not do this, or use the mount option
to turn off speculative preallocation.

Just like we say "don't mix direct IO and buffered IO on the same
file", it's a really good idea not to mix preallocated and
non-preallocated writes to the same file.

> > But if we decide that we don't do speculative prealloc when
> > XFS_DIFLAG_PREALLOC is set, then workloads that mis-use fallocate
> > (like swift), or use fallocate to fill sparse holes in files are
> > going fragment the hell out of their files when they extending
> > them.
> >
> 
> I don't understand why would this be the case. If XFS doesn't do
> speculative preallocation then for the 256 byte write after the end of EOF
> will simply result in pushing the EOF ahead. So I see no harm if XFS
> doesn't do speculative preallocation when XFS_DIFLAG_PREALLOC is set.

I see *potential harm* in changing a long standing default
behaviour.

> > In reality, if swift is really just writing 1k past the prealloc'd
> > range it creates, then that is clearly an application bug. Further,
> > if swift is only ever preallocating the first 256k of each file it
> > writes, regardless of size, then that is also an application bug.
> 
> Its not a bug. Assume a use-case like appending to a file. Would you say
> append is a buggy operation?

If the app is using preallocation to reduce append workload file
fragmenation, and then doesn't use preallocation once it is used up,
the the app is definitely buggy because it's not being consistent in
it's IO behaviour.  The app should always use fallocate() to control
file layout, or it should never use fallocate and leave the
filesystem to optimise the layout at it sees best.

In my experience, the filesystem will almost always do a better job
of optimising allocation for best throughput and minimum seeks than
applications using fallocate().

IOWs, the default behaviour of XFS has been around for more than 15
years and is sane for the majority of applications out there. Hence
the solution here is to either fix the application that is doing
stupid things with fallocate(), or use the allocasize mount option
to minimise the impact of the stupid thing the buggy application is
doing.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Request for information on bloated writes using Swift
  2016-02-03 23:28                 ` Dave Chinner
@ 2016-02-04  6:16                   ` Dilip Simha
  0 siblings, 0 replies; 14+ messages in thread
From: Dilip Simha @ 2016-02-04  6:16 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Eric Sandeen, xfs


[-- Attachment #1.1: Type: text/plain, Size: 6326 bytes --]

Hi Dave,

Thanks much for the suggestions. Your suggestion of not mixing preallocated
and non-preallocated writes on the same file makes sense to me.

Regards,
Dilip

On Wed, Feb 3, 2016 at 3:28 PM, Dave Chinner <david@fromorbit.com> wrote:

> On Wed, Feb 03, 2016 at 02:43:27PM -0800, Dilip Simha wrote:
> > On Wed, Feb 3, 2016 at 1:51 PM, Dave Chinner <david@fromorbit.com>
> wrote:
> >
> > > On Wed, Feb 03, 2016 at 09:02:40AM -0600, Eric Sandeen wrote:
> > > >
> > > >
> > > > On 2/3/16 2:30 AM, Dave Chinner wrote:
> > > > > On Tue, Feb 02, 2016 at 11:09:15PM -0800, Dilip Simha wrote:
> > > > >> Hi Dave,
> > > > >>
> > > > >> On Tue, Feb 2, 2016 at 10:37 PM, Dave Chinner <
> david@fromorbit.com>
> > > wrote:
> > > > >>
> > > > >>> On Tue, Feb 02, 2016 at 07:40:34PM -0800, Dilip Simha wrote:
> > > > >>>> Hi Eric,
> > > > >>>>
> > > > >>>> Thank you for your quick reply.
> > > > >>>>
> > > > >>>> Using xfs_io as per your suggestion, I am able to reproduce the
> > > issue.
> > > > >>>> However, I need to falloc for 256K and write for 257K to see
> this
> > > issue.
> > > > >>>>
> > > > >>>> # xfs_io -f -c "falloc 0 256k" -c "pwrite 0 257k"
> > > /srv/node/r1/t1.txt
> > > > >>>> # stat /srv/node/r1/t4.txt | grep Blocks
> > > > >>>>   Size: 263168     Blocks: 1536       IO Block: 4096   regular
> file
> > > > >>>
> > > > >>> Fallocate sets the XFS_DIFLAG_PREALLOC on the inode.
> > > > >>>
> > > > >>> When you writing *past the preallocated area* and do delayed
> > > > >>> allocation, the speculative preallocation beyond EOF is double
> the
> > > > >>> size of the extent at EOF. i.e. 512k, leading to 768k being
> > > > >>> allocated to the file (1536 blocks, exactly).
> > > > >>>
> > > > >>
> > > > >> Thank you for the details.
> > > > >> This is exactly where I am a bit perplexed. Since the reclamation
> > > logic
> > > > >> skips inodes that have the XFS_DIFLAG_PREALLOC flag set, why did
> the
> > > > >> allocation logic allot more blocks on such an inode?
> > > > >
> > > > > To store the data you wrote outside the preallocated region, of
> > > > > course.
> > > >
> > > > I think what Dilip meant was, why does it do preallocation, not
> > > > why does it allocate blocks for the data.  That part is obvious
> > > > of course.  ;)
> > > >
> > > > IOWS, if XFS_DIFLAG_PREALLOC prevents speculative preallocation
> > > > from being reclaimed, why is speculative preallocation added to files
> > > > with that flag set?
> > > >
> > > > Seems like a fair question, even if Swift's use of preallocation is
> > > > ill-advised.
> > > >
> > > > I don't have all the speculative preallocation heuristics in my
> > > > head like you do Dave, but if I have it right, and it's i.e.:
> > > >
> > > > 1) preallocate 256k
> > > > 2) inode gets XFS_DIFLAG_PREALLOC
> > > > 3) write 257k
> > > > 4) inode gets speculative preallocation added due to write past EOF
> > > > 5) inode never gets preallocation trimmed due to XFS_DIFLAG_PREALLOC
> > > >
> > > > that seems suboptimal.
> > >
> > > So do things the other way around:
> > >
> > > 1) write 257k
> > > 2) preallocate 256k beyond EOF and speculative prealloc region
> > > 3) inode gets XFS_DIFLAG_PREALLOC
> > > 4) inode never gets preallocation trimmed due to XFS_DIFLAG_PREALLOC
> > >
> > > This is correct behaviour.
> > >
> >
> > I am sorry, but I don't agree to this. How can an user application know
> > about step2.
>
> Step 2 is fallocate(keep size) to a range well beyond EOF. e.g. in
> preparation for a bunch of sparse writes that are about to take
> place. So userspace will most definitely know about it. It's now the
> kernel that now doesn't have a clue what to do about the speculative
> preallocation it already has because the application is mixing it's
> IO models.
>
> Fundamentally, if you mix writes across persistent preallocation and
> adjacent holes, you are going to get a mess no matter what
> filesystem you do this to. If you don't like the way XFS handles it,
> either fix the application to not do this, or use the mount option
> to turn off speculative preallocation.
>
> Just like we say "don't mix direct IO and buffered IO on the same
> file", it's a really good idea not to mix preallocated and
> non-preallocated writes to the same file.
>
> > > But if we decide that we don't do speculative prealloc when
> > > XFS_DIFLAG_PREALLOC is set, then workloads that mis-use fallocate
> > > (like swift), or use fallocate to fill sparse holes in files are
> > > going fragment the hell out of their files when they extending
> > > them.
> > >
> >
> > I don't understand why would this be the case. If XFS doesn't do
> > speculative preallocation then for the 256 byte write after the end of
> EOF
> > will simply result in pushing the EOF ahead. So I see no harm if XFS
> > doesn't do speculative preallocation when XFS_DIFLAG_PREALLOC is set.
>
> I see *potential harm* in changing a long standing default
> behaviour.
>
> > > In reality, if swift is really just writing 1k past the prealloc'd
> > > range it creates, then that is clearly an application bug. Further,
> > > if swift is only ever preallocating the first 256k of each file it
> > > writes, regardless of size, then that is also an application bug.
> >
> > Its not a bug. Assume a use-case like appending to a file. Would you say
> > append is a buggy operation?
>
> If the app is using preallocation to reduce append workload file
> fragmenation, and then doesn't use preallocation once it is used up,
> the the app is definitely buggy because it's not being consistent in
> it's IO behaviour.  The app should always use fallocate() to control
> file layout, or it should never use fallocate and leave the
> filesystem to optimise the layout at it sees best.
>
> In my experience, the filesystem will almost always do a better job
> of optimising allocation for best throughput and minimum seeks than
> applications using fallocate().
>
> IOWs, the default behaviour of XFS has been around for more than 15
> years and is sane for the majority of applications out there. Hence
> the solution here is to either fix the application that is doing
> stupid things with fallocate(), or use the allocasize mount option
> to minimise the impact of the stupid thing the buggy application is
> doing.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
>

[-- Attachment #1.2: Type: text/html, Size: 8470 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2016-02-04  6:17 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-02 22:32 Request for information on bloated writes using Swift Dilip Simha
2016-02-03  2:47 ` Eric Sandeen
2016-02-03  3:40   ` Dilip Simha
2016-02-03  3:42     ` Dilip Simha
2016-02-03  6:37     ` Dave Chinner
2016-02-03  7:09       ` Dilip Simha
2016-02-03  8:30         ` Dave Chinner
2016-02-03 15:02           ` Eric Sandeen
2016-02-03 21:51             ` Dave Chinner
2016-02-03 22:43               ` Dilip Simha
2016-02-03 23:28                 ` Dave Chinner
2016-02-04  6:16                   ` Dilip Simha
2016-02-03 16:10           ` Dilip Simha
2016-02-03 16:15             ` Dilip Simha

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.