From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay1.corp.sgi.com [137.38.102.111]) by oss.sgi.com (Postfix) with ESMTP id F2D5C7CA1 for ; Wed, 3 Feb 2016 10:15:56 -0600 (CST) Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by relay1.corp.sgi.com (Postfix) with ESMTP id D24368F8033 for ; Wed, 3 Feb 2016 08:15:56 -0800 (PST) Received: from mail-ob0-f175.google.com (mail-ob0-f175.google.com [209.85.214.175]) by cuda.sgi.com with ESMTP id F8qAxBEshzknyMzR (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NO) for ; Wed, 03 Feb 2016 08:15:54 -0800 (PST) Received: by mail-ob0-f175.google.com with SMTP id ba1so36243448obb.3 for ; Wed, 03 Feb 2016 08:15:54 -0800 (PST) MIME-Version: 1.0 In-Reply-To: References: <56B16A3C.1030207@sandeen.net> <20160203063705.GB459@dastard> <20160203083016.GD459@dastard> From: Dilip Simha Date: Wed, 3 Feb 2016 08:15:34 -0800 Message-ID: Subject: Re: Request for information on bloated writes using Swift List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: multipart/mixed; boundary="===============5401408377959829846==" Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Dave Chinner Cc: Eric Sandeen , xfs@oss.sgi.com --===============5401408377959829846== Content-Type: multipart/alternative; boundary=089e01229de0afab6f052adfeec4 --089e01229de0afab6f052adfeec4 Content-Type: text/plain; charset=UTF-8 Thank you Eric, I am sorry, I missed reading your message before replying. You got my question right. Regards, Dilip On Wed, Feb 3, 2016 at 8:10 AM, Dilip Simha wrote: > On Wed, Feb 3, 2016 at 12:30 AM, Dave Chinner wrote: > >> On Tue, Feb 02, 2016 at 11:09:15PM -0800, Dilip Simha wrote: >> > Hi Dave, >> > >> > On Tue, Feb 2, 2016 at 10:37 PM, Dave Chinner >> wrote: >> > >> > > On Tue, Feb 02, 2016 at 07:40:34PM -0800, Dilip Simha wrote: >> > > > Hi Eric, >> > > > >> > > > Thank you for your quick reply. >> > > > >> > > > Using xfs_io as per your suggestion, I am able to reproduce the >> issue. >> > > > However, I need to falloc for 256K and write for 257K to see this >> issue. >> > > > >> > > > # xfs_io -f -c "falloc 0 256k" -c "pwrite 0 257k" >> /srv/node/r1/t1.txt >> > > > # stat /srv/node/r1/t4.txt | grep Blocks >> > > > Size: 263168 Blocks: 1536 IO Block: 4096 regular file >> > > >> > > Fallocate sets the XFS_DIFLAG_PREALLOC on the inode. >> > > >> > > When you writing *past the preallocated area* and do delayed >> > > allocation, the speculative preallocation beyond EOF is double the >> > > size of the extent at EOF. i.e. 512k, leading to 768k being >> > > allocated to the file (1536 blocks, exactly). >> > > >> > >> > Thank you for the details. >> > This is exactly where I am a bit perplexed. Since the reclamation logic >> > skips inodes that have the XFS_DIFLAG_PREALLOC flag set, why did the >> > allocation logic allot more blocks on such an inode? >> >> To store the data you wrote outside the preallocated region, of >> course. >> >> > My understanding is that the fallocate caller only requested for 256K >> worth >> > of blocks to be available sequentially if possible. >> >> fallocate only guarantees the blocks are allocated - it does not >> guarantee anything about the layout of the blocks. >> >> > On any subsequent write beyond the EOF, the caller is completely >> > unaware of the underlying file-system storing that data adjacent >> > to the first 256K data. Since XFS is speculatively allocating >> > additional space (512K) adjacent to the first 256K data, I would >> > expect XFS to either treat these two allocations distinctly and >> > NOT mark XFS_DIFLAG_PREALLOC on the additional 512K data(minus the >> > actually used additional data=1K), OR remove XFS_DIFLAG_PREALLOC >> > flag on the entire inode. >> >> Oh, if only it were that simple. It's way more complex than I have >> time to explain here. >> >> Fundamentally, XFS_DIFLAG_PREALLOC is used to indicate that >> persistent preallocation has been done on the file, and so if that >> has happened we need to turn off optimistic removal of blocks >> anywhere in the file because we can't tell what blocks had >> persistent preallocation done on them after the fact. That's the >> way it's been since unwritten extents were added to XFS back in >> 1998, and I don't really see the need for it to change right now. >> > > I completely understand the reasoning behind this reclamation logic and I > also agree to it. > But my question is with the allocation logic. I don't understand why XFS > allocates more than necessary blocks when this flag is set and when it > knows that its not going to clean up the additional space. > > A simple example would be: > 1: Open File in Write mode. > 2: Fallocate 256K > 3: Write 256K > 4: Close File > > Stat shows that XFS allocated 512 blocks as expected. > > 5: Open file in append mode. > 6: Write 256 bytes. > 7: Close file. > > Expectation is that the number of blocks allocated is either 512+1 or > 512+8 depending on the block size. > However, XFS uses speculative preallocation to allocate 512K (as per your > explanation) to write 256 bytes and hence the overall disk usage goes up to > 1536 blocks. > Now, who is responsible for clearing up the additional allocated blocks? > Clearly the application has no idea about the over-allocation. > > I agree that if an application uses fallocate and delayed allocation on > the same file in the same IO, then its a badly structured application. But > in this case we have two different IOs on the same file. The first IO did > not expect an append and hence issued an fallocate. So that looks good to > me. > > Your thoughts on this? > > Regards, > Dilip > > >> If an application wants to mix fallocate and delayed allocatin >> writes to the same file in the same IO, then that's an application >> bug. It's going to cause bad IO patterns and file fragmentation and >> have other side effects (as you've noticed), and there's nothing the >> filesystem can do about it. fallocate() requires expertise to use in >> a beneficial manner - most developers do not have the required >> expertise (and don't have enough expertise to realise this) and so >> usually make things worse rather than better by using fallocate. >> >> > Also, is there any way I can check for this flag? >> > The FLAGS, as observed from xfs_bmap doesn't show any flags set to it. >> Am I >> > not looking at the right flags? >> >> xfs_io -c stat >> >> Cheers, >> >> Dave. >> -- >> Dave Chinner >> david@fromorbit.com >> > > --089e01229de0afab6f052adfeec4 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Thank you Eric,
I am sorry, I missed reading your mess= age before replying.
You got my question right.

Regards,
Dilip

=
On Wed, Feb 3, 2016 at 8:10 AM, Dilip Simha <nmdilipsimha@gmail.com> wrote:
On Wed, Feb 3, 2016 at 12:30 AM, Dave Chinner <david@fromorbit.com> wrote:
david@fromorbit.com> wrote:
>
> > On Tue, Feb 02, 2016 at 07:40:34PM -0800, Dilip Simha wrote:
> > > Hi Eric,
> > >
> > > Thank you for your quick reply.
> > >
> > > Using xfs_io as per your suggestion, I am able to reproduce = the issue.
> > > However, I need to falloc for 256K and write for 257K to see= this issue.
> > >
> > > # xfs_io -f -c "falloc 0 256k" -c "pwrite 0 2= 57k" /srv/node/r1/t1.txt
> > > # stat /srv/node/r1/t4.txt | grep Blocks
> > >=C2=A0 =C2=A0Size: 263168=C2=A0 =C2=A0 =C2=A0Blocks: 1536=C2= =A0 =C2=A0 =C2=A0 =C2=A0IO Block: 4096=C2=A0 =C2=A0regular file
> >
> > Fallocate sets the XFS_DIFLAG_PREALLOC on the inode.
> >
> > When you writing *past the preallocated area* and do delayed
> > allocation, the speculative preallocation beyond EOF is double th= e
> > size of the extent at EOF. i.e. 512k, leading to 768k being
> > allocated to the file (1536 blocks, exactly).
> >
>
> Thank you for the details.
> This is exactly where I am a bit perplexed. Since the reclamation logi= c
> skips inodes that have the XFS_DIFLAG_PREALLOC flag set, why did the > allocation logic allot more blocks on such an inode?

To store the data you wrote outside the preallocated region, of
course.

> My understanding is that the fallocate caller only requested for 256K = worth
> of blocks to be available sequentially if possible.

fallocate only guarantees the blocks are allocated - it does not
guarantee anything about the layout of the blocks.

> On any subsequent write beyond the EOF, the caller is completely
> unaware of the underlying file-system storing that data adjacent
> to the first 256K data.=C2=A0 Since XFS is speculatively allocating > additional space (512K) adjacent to the first 256K data, I would
> expect XFS to either treat these two allocations distinctly and
> NOT mark XFS_DIFLAG_PREALLOC on the additional 512K data(minus the
> actually used additional data=3D1K), OR remove XFS_DIFLAG_PREALLOC
> flag on the entire inode.

Oh, if only it were that simple. It's way more complex than I ha= ve
time to explain here.

Fundamentally, XFS_DIFLAG_PREALLOC is used to indicate that
persistent preallocation has been done on the file, and so if that
has happened we need to turn off optimistic removal of blocks
anywhere in the file because we can't tell what blocks had
persistent preallocation done on them after the fact.=C2=A0 That's the<= br> way it's been since unwritten extents were added to XFS back in
1998, and I don't really see the need for it to change right now.

I completely understand the reas= oning behind this reclamation logic and I also agree to it.
But m= y question is with the allocation logic. I don't understand why XFS all= ocates more than necessary blocks when this flag is set and when it knows t= hat its not going to clean up the additional space.

A simple example would be:
1: Open File in Write mode.
2: Fallocate 256K
3: Write 256K
4: Close File

Stat shows that XFS allocated 512 blocks as expected.

5: Open file in append mode.
6: Write 256 b= ytes.
7: Close file.

Expectation is that= the number of blocks allocated is either 512+1 or 512+8 depending on the b= lock size.
However, XFS uses speculative preallocation to allocat= e 512K (as per your explanation) to write 256 bytes and hence the overall d= isk usage goes up to 1536 blocks.
Now, who is responsible for cle= aring up the additional allocated blocks? Clearly the application has no id= ea about the over-allocation.

I agree that if an a= pplication uses fallocate and delayed allocation on the same file in the sa= me IO, then its a badly structured application. But in this case we have tw= o different IOs on the same file. The first IO did not expect an append and= hence issued an fallocate. So that looks good to me.

<= div>Your thoughts on this?

Regards,
Dili= p


If an application wants to mix fallocate and delayed allocatin
writes to the same file in the same IO, then that's an application
bug. It's going to cause bad IO patterns and file fragmentation and
have other side effects (as you've noticed), and there's nothing th= e
filesystem can do about it. fallocate() requires expertise to use in
a beneficial manner - most developers do not have the required
expertise (and don't have enough expertise to realise this) and so
usually make things worse rather than better by using fallocate.

> Also, is there any way I can check for this flag?
> The FLAGS, as observed from xfs_bmap doesn't show any flags set to= it. Am I
> not looking at the right flags?

xfs_io -c stat <file>

Cheers,

Dave.
--
Dave Chinner
david@fromorbit.co= m


--089e01229de0afab6f052adfeec4-- --===============5401408377959829846== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs --===============5401408377959829846==--