All of lore.kernel.org
 help / color / mirror / Atom feed
* file allocation problem
@ 2009-07-16 11:31 Stephan Kulow
  2009-07-16 15:58 ` Theodore Tso
  0 siblings, 1 reply; 13+ messages in thread
From: Stephan Kulow @ 2009-07-16 11:31 UTC (permalink / raw)
  To: linux-ext4

Hi,

I played around with ext4 online defrag on 2.6.31-rc3 and noticed a problem. 
The core is this:

# filefrag -v /usr/bin/gimp-2.6 
File size of /usr/bin/gimp-2.6 is 4677400 (1142 blocks, blocksize 4096)
 ext logical physical expected length flags
   0       0  2884963              29 
   1      29  2890819  2884991     29 
   2      58  2906960  2890847     62 
   3     120  2893864  2907021     29 
   4     149  2898531  2893892     29 
   5     178  2887012  2898559     28 
   6     206  2887261  2887039     27 
   7     233  2888229  2887287     27 
   8     260  2907727  2888255     49 
   9     309  2907811  2907775     90 
  10     399  2889078  2907900     26 
  11     425  2890641  2889103     26 
  12     451  2908065  2890666     31 
  13     482  2908136  2908095     33 
  14     515  2908170  2908168     54 
  15     569  2908257  2908223     31 
  16     600  2908378  2908287     38 
  17     638  2886399  2908415     25 
  18     663  2908646  2886423     26 
  19     689  2909129  2908671     56 
  20     745  2909186  2909184     62 
  21     807  2909281  2909247     31 
  22     838  2902503  2909311     25 
  23     863   103690  2902527    161 
  24    1024   109621   103850    118 eof
/usr/bin/gimp-2.6: 25 extents found

ext4 defragmentation for /usr/bin/gimp-2.6
[1/1]/usr/bin/gimp-2.6: 100%  extents: 25 -> 25 [ OK ]
 Success:                       [1/1]

(filefrag will output very much the same now)

But now the really interesting part starts: when I copy away
that file (as far as I understand the code, e4defrag allocates
space in /usr/bin too), I get:

cp -a /usr/bin/gimp-2.6{,.defrag} (I have 50% free, so I expect it to find 
room):

filefrag -v /usr/bin/gimp-2.6.defrag
File size of /usr/bin/gimp-2.6.defrag is 4677400 (1142 blocks, blocksize 4096)
 ext logical physical expected length flags
   0       0   452952              40 
   1      40   439168   452991     32 
   2      72   442912   439199     32 
   3     104   448544   442943     32 
   4     136   449472   448575     32 
   5     168   453920   449503     32 
   6     200   429625   453951     31 
   7     231   430714   429655     31 
   8     262   435296   430744     31 
   9     293   454842   435326     31 
  10     324   436410   454872     29 
  11     353   426832   436438     28 
  12     381   453651   426859     27 
  13     408   447705   453677     25 
  14     433   436510   447729     23 
  15     456   442421   436532     23 
  16     479   451098   442443     23 
  17     502   447082   451120     22 
  18     524   451647   447103     22 
  19     546   437950   451668     21 
  20     567   439293   437970     21 
  21     588   454464   439313     21 
  22     609   455776   454484     21 
  23     630   454624   455796     20 
  24     650   450592   454643     18 
  25     668   451136   450609     18 
  26     686   452305   451153     18 
  27     704   427088   452322     16 
  28     720   427568   427103     16 
  29     736   427952   427583     16 
  30     752   427984   427967     16 
  31     768   650240   427999    256 
  32    1024   634851   650495     69 
  33    1093   633344   634919     49 eof
/usr/bin/gimp-2.6.defrag: 34 extents found

Now that I call fragmented! Calling e4defrag again gives me
34->28 and now it moved _parts_

..
  24     781   478136   480191     56 
  25     837   475850   478191     54 
  26     891  1836751   475903    133 
  27    1024  1875978  1836883    118 eof
/usr/bin/gimp-2.6.defrag: 28 extents found

This looks really strange to me, is this a problem with my very file system or 
a bug?

Greetings, Stephan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: file allocation problem
  2009-07-16 11:31 file allocation problem Stephan Kulow
@ 2009-07-16 15:58 ` Theodore Tso
  2009-07-16 17:43   ` Stephan Kulow
  0 siblings, 1 reply; 13+ messages in thread
From: Theodore Tso @ 2009-07-16 15:58 UTC (permalink / raw)
  To: Stephan Kulow; +Cc: linux-ext4

On Thu, Jul 16, 2009 at 01:31:17PM +0200, Stephan Kulow wrote:
> Hi,
> 
> I played around with ext4 online defrag on 2.6.31-rc3 and noticed a problem. 
> The core is this:

Was your filesystem originally an ext3 filesystme which was converted
over to ext4?  What features are currently enabled (sending a copy of
the output of "dumpe2fs -h /dev/XXX" would be helpful.)

If it is the case that this was originally an ext3 filesystem,
e4defrag does have some definite limitations that will prevent it from
doing a great job in such a case.  I'm guessing that's what's going on
here.

> Now that I call fragmented! Calling e4defrag again gives me
> 34->28 and now it moved _parts_

I'm not sure what you mean by moving _parts_?

						- Ted

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: file allocation problem
  2009-07-16 15:58 ` Theodore Tso
@ 2009-07-16 17:43   ` Stephan Kulow
  2009-07-17  1:12     ` Theodore Tso
  0 siblings, 1 reply; 13+ messages in thread
From: Stephan Kulow @ 2009-07-16 17:43 UTC (permalink / raw)
  To: Theodore Tso; +Cc: linux-ext4

On Thursday 16 July 2009 17:58:32 Theodore Tso wrote:
> On Thu, Jul 16, 2009 at 01:31:17PM +0200, Stephan Kulow wrote:
> > Hi,
> >
> > I played around with ext4 online defrag on 2.6.31-rc3 and noticed a
> > problem. The core is this:
>
> Was your filesystem originally an ext3 filesystme which was converted
> over to ext4?  What features are currently enabled (sending a copy of
Yes, it was converted quite some time ago.

> the output of "dumpe2fs -h /dev/XXX" would be helpful.)

Filesystem volume name:   <none>
Last mounted on:          /root
Filesystem UUID:          ec4454af-a8db-42ad-9627-19c9c17a0220
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype 
needs_recovery extent sparse_super large_file
Filesystem flags:         signed_directory_hash 
Default mount options:    (none)
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              853440
Block count:              3409788
Reserved block count:     170489
Free blocks:              1156411
Free inodes:              615319
First block:              0
Block size:               4096
Fragment size:            4096
Reserved GDT blocks:      832
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         8128
Inode blocks per group:   508
Filesystem created:       Fri Dec 12 17:01:57 2008
Last mount time:          Thu Jul 16 19:30:26 2009
Last write time:          Thu Jul 16 19:30:26 2009
Mount count:              718
Maximum mount count:      -1
Last checked:             Thu Jan 29 15:01:57 2009
Check interval:           0 (<none>)
Lifetime writes:          5211 MB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               256
Required extra isize:     28
Desired extra isize:      28
Journal inode:            8
First orphan inode:       650850
Default directory hash:   half_md4
Directory Hash Seed:      a262693d-9659-4212-8e5b-5901140edff8
Journal backup:           inode blocks
Journal size:             128M

>
> If it is the case that this was originally an ext3 filesystem,
> e4defrag does have some definite limitations that will prevent it from
> doing a great job in such a case.  I'm guessing that's what's going on
> here.
My problem is not so much with what e4defrag does, but the fact that
a new file I create with cp(1) contains 34 extents.

>
> > Now that I call fragmented! Calling e4defrag again gives me
> > 34->28 and now it moved _parts_
>
> I'm not sure what you mean by moving _parts_?
It moved a couple of blocks from 6XXX to 10XXX and most extents stayed in the 
area where they were (I guess close to the rest of /usr/bin?)

Greetings, Stephan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: file allocation problem
  2009-07-16 17:43   ` Stephan Kulow
@ 2009-07-17  1:12     ` Theodore Tso
  2009-07-17  4:32       ` Andreas Dilger
  2009-07-17  5:17       ` Stephan Kulow
  0 siblings, 2 replies; 13+ messages in thread
From: Theodore Tso @ 2009-07-17  1:12 UTC (permalink / raw)
  To: Stephan Kulow; +Cc: linux-ext4

On Thu, Jul 16, 2009 at 07:43:21PM +0200, Stephan Kulow wrote:
> > If it is the case that this was originally an ext3 filesystem,
> > e4defrag does have some definite limitations that will prevent it from
> > doing a great job in such a case.  I'm guessing that's what's going on
> > here.
> My problem is not so much with what e4defrag does, but the fact that
> a new file I create with cp(1) contains 34 extents.

Well, because your filesystem is still fragmented; you asked e4defrag
to defragment a single file.  In fact, it wasn't able to do much --
the file previously had 25 extents, and the new file had 25 extents.
E4defrag is quite new, and still needs a lot of polishing; I'm not
sure it should have tried to swap files when the newly allocated file
has the same number of extents.  This might be a case of changing a
">=" to ">" in code.

The reason why "cp" still created a file with 34 extents is because
the free space was still fragmented.  As I said, e4defrag is quite
primitive; it doesn't know how to defrag free space; it simply tries
to reduce the number of extents for each file, on a file-by-file
basis.

The other problem is that an ext3 filesystem that has been converted
to ext4 does not have the flex_bg feature.  This is a feature that,
when set at when the file system is formatted, creates a higher order
flex_bg which combines several block groups into a bigger allocation
group, a flex_bg.  This helps avoid fragmentation, especially for
directories like /usr/bin which typically have more than 128 megs (a
single block group) worth of files in it.

Using an ext3 filesystem format, the filesystem driver will first try
to find space in the home block group of the directory, and if there
is no space there, it will look in other block groups.  With a freshly
formatted ext4 filesystem, the allocation group is the flex_bg, which
is much larger, and which gives us a better opportunity for allocating
contiguous blocks.

I suspect we could do better with our allocator in this case; maybe
should use a flex_bg to give the block group allocator a bigger set of
block groups to search.  The inode tables will still not be optimally
laid out for flex_bg, but we might still be better off.  Or, if the
block group is terribly fragmented, maybe we should have the allocator
find some other bg, even if it isn't the ideal block group close to
the directory.  According to the dumpe2fs output, the filesystem is
only 66% or so full, so there's probably some possibly completely
unused block groups we should be using instead.  One of the things
that we have _not_ had time to do is optimize the block allocator for
heavily fragimented filesystems, especially for fragmented filesystems
that had been converted from ext3 filesystems.

In any case, I don't anything went _wrong_ per se, just that both
e4defrag and our block allocator are insufficiently smart to help
improve things for you given your current filesystem.  A backup,
reformat, and restore will result in a filesystem that works far
better.

Out of curiosity, what sort of workload had the file system received?
It looks like the filesystem hadn't been created that long ago, so
it's bit surprising it was so fragmented.  Were you perhaps updating
your system (by doing a yum update or apt-get update) very frequently,
perhaps?

						- Ted

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: file allocation problem
  2009-07-17  1:12     ` Theodore Tso
@ 2009-07-17  4:32       ` Andreas Dilger
  2009-07-17  5:31         ` Stephan Kulow
  2009-07-17  5:17       ` Stephan Kulow
  1 sibling, 1 reply; 13+ messages in thread
From: Andreas Dilger @ 2009-07-17  4:32 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Stephan Kulow, linux-ext4

On Jul 16, 2009  21:12 -0400, Theodore Ts'o wrote:
> On Thu, Jul 16, 2009 at 07:43:21PM +0200, Stephan Kulow wrote:
> > My problem is not so much with what e4defrag does, but the fact that
> > a new file I create with cp(1) contains 34 extents.
> 
> The other problem is that an ext3 filesystem that has been converted
> to ext4 does not have the flex_bg feature.  This is a feature that,
> when set at when the file system is formatted, creates a higher order
> flex_bg which combines several block groups into a bigger allocation
> group, a flex_bg.  This helps avoid fragmentation, especially for
> directories like /usr/bin which typically have more than 128 megs (a
> single block group) worth of files in it.

It seems quite odd to me that mballoc didn't find enough contiguous
free space for this relatively small file.  It might be worthwhile
to look at (though not necessarily post) the output from the file
/sys/fs/ext4/{dev}/mb_groups (or "dumpe2fs" has equivalent data)
and see if there are groups with a lot of contiguous free space.
In the mb_groups file this would be numbers in the 2^{high} column. 

I don't agree that flex_bg is necessary to have good block allocation,
since we do get about 125MB per group.  Maybe mballoc is being
constrained to look at too few block groups in this case?  Looking at
/sys/fs/ext4/{dev}/mb_history under the "groups" column will tell how
many groups were scanned to find that allocation, and the "original"
and "result" will show group/grpblock/count@logblock for recent writes.

$ dd if=/dev/zero of=/myth/tmp/foo bs=1M count=1

pid   inode    original             goal                  result
4423  110359   3448/14336/256@0     1646/18944/256@0      1646/19456/256@0

You might also try to create a new temp directory elsewhere on the
filesystem, copy the file over to the temp directory, and then see
if it is less fragmented in the new directory.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: file allocation problem
  2009-07-17  1:12     ` Theodore Tso
  2009-07-17  4:32       ` Andreas Dilger
@ 2009-07-17  5:17       ` Stephan Kulow
  2009-07-17 14:26         ` Theodore Tso
  1 sibling, 1 reply; 13+ messages in thread
From: Stephan Kulow @ 2009-07-17  5:17 UTC (permalink / raw)
  To: Theodore Tso; +Cc: linux-ext4

On Friday 17 July 2009 03:12:19 you wrote:
> On Thu, Jul 16, 2009 at 07:43:21PM +0200, Stephan Kulow wrote:
> > > If it is the case that this was originally an ext3 filesystem,
> > > e4defrag does have some definite limitations that will prevent it from
> > > doing a great job in such a case.  I'm guessing that's what's going on
> > > here.
> >
> > My problem is not so much with what e4defrag does, but the fact that
> > a new file I create with cp(1) contains 34 extents.
>
Hi,
>
> The reason why "cp" still created a file with 34 extents is because
> the free space was still fragmented.  As I said, e4defrag is quite
> primitive; it doesn't know how to defrag free space; it simply tries
> to reduce the number of extents for each file, on a file-by-file
> basis.
Well, is there a tool to check the overall state of the file system? I can't 
really believe it's 1010101010, but it's hard to say without a picture :)
>
> The other problem is that an ext3 filesystem that has been converted
> to ext4 does not have the flex_bg feature.  This is a feature that,
> when set at when the file system is formatted, creates a higher order
> flex_bg which combines several block groups into a bigger allocation
> group, a flex_bg.  This helps avoid fragmentation, especially for
> directories like /usr/bin which typically have more than 128 megs (a
> single block group) worth of files in it.

Oh, I enabled flex_bg after you asked, rebooted to get a e2fsck - and I still 
get 34 extents for my gimp-2.6.defrag. From what I understand, this doesn't 
help in the after fact, but then again how am I supposed to fix my file system
if even new files are created fragmented.

> In any case, I don't anything went _wrong_ per se, just that both
> e4defrag and our block allocator are insufficiently smart to help
> improve things for you given your current filesystem.  A backup,
> reformat, and restore will result in a filesystem that works far
> better.
I believe that, but my hope for online defrag was not having to rely on this 
80ties defrag method :)
>
> Out of curiosity, what sort of workload had the file system received?
> It looks like the filesystem hadn't been created that long ago, so
> it's bit surprising it was so fragmented.  Were you perhaps updating
> your system (by doing a yum update or apt-get update) very frequently,
> perhaps?
Yes, that's what I'm doing. I'm updating about every file in this file system 
every second day by means of rpm packages (openSUSE calls it factory, you will
now it as rawhide).

Greetings, Stephan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: file allocation problem
  2009-07-17  4:32       ` Andreas Dilger
@ 2009-07-17  5:31         ` Stephan Kulow
  0 siblings, 0 replies; 13+ messages in thread
From: Stephan Kulow @ 2009-07-17  5:31 UTC (permalink / raw)
  To: Andreas Dilger, linux-ext4

On Friday 17 July 2009 06:32:42 Andreas Dilger wrote:

Hi,

> It seems quite odd to me that mballoc didn't find enough contiguous
> free space for this relatively small file.  It might be worthwhile
> to look at (though not necessarily post) the output from the file
> /sys/fs/ext4/{dev}/mb_groups (or "dumpe2fs" has equivalent data)
> and see if there are groups with a lot of contiguous free space.
> In the mb_groups file this would be numbers in the 2^{high} column.

I'm not sure what you expect with "a lot", so I pasted the full file (that 
happens to be in /proc/fs here): http://ktown.kde.org/~coolo/sda6
>
> I don't agree that flex_bg is necessary to have good block allocation,
> since we do get about 125MB per group.  Maybe mballoc is being
> constrained to look at too few block groups in this case?  Looking at
> /sys/fs/ext4/{dev}/mb_history under the "groups" column will tell how
> many groups were scanned to find that allocation, and the "original"
> and "result" will show group/grpblock/count@logblock for recent writes.
>
> $ dd if=/dev/zero of=/myth/tmp/foo bs=1M count=1
>
> pid   inode    original             goal                  result
> 4423  110359   3448/14336/256@0     1646/18944/256@0      1646/19456/256@0
>
> You might also try to create a new temp directory elsewhere on the
> filesystem, copy the file over to the temp directory, and then see
> if it is less fragmented in the new directory.
>
cp /usr/bin/gimp-2.6{.defrag}:

31548 106916   13/0/1142@0             13/0/1024@0             13/24152/59@0           
201   1     2  1056        0     0     
31548 106916   13/24211/1083@59        13/24211/965@59         13/26192/41@59          
201   1     2  1568        0     0     
31548 106916   13/26233/1042@100       13/26233/924@100        13/21777/34@100         
201   1     2  1568        0     0     
31548 106916   13/21811/1008@134       13/21811/890@134        13/6688/32@134          
201   1     2  1568        0     0     
31548 106916   13/6720/976@166         13/6720/858@166         13/10944/32@166         
201   1     2  1568        0     0     
31548 106916   13/6720/1@0             13/6720/1@0             13/513/1@0              
1     1     1  1024        0     0     
31548 106916   13/10976/944@198        13/10976/826@198        13/16896/32@198         
201   1     2  1568        0     0     
31548 106916   13/16928/912@230        13/16928/794@230        13/12564/31@230         
201   1     2  1568        0     0     
31548 106916   13/12595/881@261        13/12595/763@261        13/12724/31@261         
201   1     2  1568        0     0     
31548 106916   13/12755/850@292        13/12755/732@292        13/31700/31@292         
201   1     2  1568        0     0     
31548 106916   13/31731/819@323        13/31731/701@323        13/18103/30@323         
201   1     2  1568        0     0     
31548 106916   13/18133/789@353        13/18133/671@353        13/21691/30@353         
201   1     2  1568        0     0     
31548 106916   13/21721/759@383        13/21721/641@383        13/25881/30@383         
201   1     2  1568        0     0     
31548 106916   13/25911/729@413        13/25911/611@413        13/22196/29@413         
201   1     2  1568        0     0     
31548 106916   13/22225/700@442        13/22225/582@442        13/31380/29@442         
201   1     2  1568        0     0     
31548 106916   13/31409/671@471        13/31409/553@471        13/12954/27@471         
201   2     2  1568        0     0     
31548 106916   13/12981/644@498        13/12981/526@498        13/18176/27@498         
201   2     2  1568        0     0     
31548 106916   13/18203/617@525        13/18203/499@525        13/15161/26@525         
201   2     2  1568        0     0     
31548 106916   13/15187/591@551        13/15187/473@551        13/17625/26@551         
201   2     2  1568        0     0     
31548 106916   13/17651/565@577        13/17651/447@577        13/19936/26@577         
201   2     2  1568        0     0     
31548 106916   13/19962/539@603        13/19962/421@603        13/20247/26@603         
201   2     2  1568        0     0     
31548 106916   13/20273/513@629        13/20273/395@629        13/23515/26@629         
201   2     2  1568        0     0     
31548 106916   13/23541/487@655        13/23541/369@655        13/9949/25@655          
201   2     2  1568        0     0     
31548 106916   13/9974/462@680         13/9974/344@680         13/19832/25@680         
201   2     2  1568        0     0     
31548 106916   13/19857/437@705        13/19857/319@705        13/29244/25@705         
201   2     2  1568        0     0     
31548 106916   13/29269/412@730        13/29269/294@730        13/1344/24@730          
201   2     2  1568        0     0     
31548 106916   13/1368/388@754         13/1368/270@754         13/11776/23@754         
201   2     2  1568        0     0     
31548 106916   13/11799/365@777        13/11799/247@777        14/3104/26@777          
201   2     2  1568        0     0     
31548 106916   14/3130/339@803         14/3130/221@803         14/9984/50@803          
201   1     2  1568        0     0     
31548 106916   14/10034/289@853        14/10034/171@853        14/11264/46@853         
201   1     2  1568        0     0     
31548 106916   14/11310/243@899        14/11310/125@899        58/1024/125@899         
11    1     1  1568        125   128   
31548 106916   58/1149/118@1024        58/1149/1024@1024       
58/17408/445@1024       201   2     2  1568        0     0 

filefrag: 59 extents.

cp /usr/bin/gimp-2.6 /tmp/nd/

25449 650578   80/0/1@0                80/0/1@0                80/589/1@0              
4     1     1  0           0     0

Filesystem type is: ef53
File size of /tmp/nd/gimp-2.6 is 4677400 (1142 blocks, blocksize 4096)
 ext logical physical expected length flags
   0       0  2638592             588 
   1     588  2628896  2639179    436 
   2    1024  2637846  2629331    118 eof
/tmp/nd/gimp-2.6: 3 extents found

Greetings, Stephan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: file allocation problem
  2009-07-17  5:17       ` Stephan Kulow
@ 2009-07-17 14:26         ` Theodore Tso
  2009-07-17 18:02           ` Stephan Kulow
  0 siblings, 1 reply; 13+ messages in thread
From: Theodore Tso @ 2009-07-17 14:26 UTC (permalink / raw)
  To: Stephan Kulow; +Cc: linux-ext4

On Fri, Jul 17, 2009 at 07:17:12AM +0200, Stephan Kulow wrote:
> Well, is there a tool to check the overall state of the file system? I can't 
> really believe it's 1010101010, but it's hard to say without a picture :)

Well, you can check the fragmentation of the free space by using
dumpe2fs and looking at the free blocks in each block group.

> > The other problem is that an ext3 filesystem that has been converted
> > to ext4 does not have the flex_bg feature.  This is a feature that,
> > when set at when the file system is formatted, creates a higher order
> > flex_bg which combines several block groups into a bigger allocation
> > group, a flex_bg.  This helps avoid fragmentation, especially for
> > directories like /usr/bin which typically have more than 128 megs (a
> > single block group) worth of files in it.
> 
> Oh, I enabled flex_bg after you asked, rebooted to get a e2fsck -
> and I still get 34 extents for my gimp-2.6.defrag. From what I
> understand, this doesn't help in the after fact, but then again how
> am I supposed to fix my file system if even new files are created
> fragmented.

Well, it's actually not enough to enable flex_bg filesystem feature;
you need to also set the flex_bg size, like this:

debugfs -w /dev/XXX
debugfs: ssv log_groups_per_flex 4
debugfs: quit

(And no, this isn't something which we've done a lot of testing on.)

And this isn't necessarily going to help; if 16 block groups around
(2**4) for the flex_bg for the /usr/bin directory are all badly
fragmented, then when you create new files in /usr/bin, it will still
be fragmented.

> > In any case, I don't anything went _wrong_ per se, just that both
> > e4defrag and our block allocator are insufficiently smart to help
> > improve things for you given your current filesystem.  A backup,
> > reformat, and restore will result in a filesystem that works far
> > better.
>
> I believe that, but my hope for online defrag was not having to rely on this 
> 80ties defrag method :)

Yeah, sorry, online defrag is a very new feature.  It will hopefully
get better, but it's matter of resources.  Ultimately, though, the
problem is that the ext3 allocation algorithms are very different (and
far more primitive) than the ext4 allocation algorithms.  So undoing
the ext3 allocation algorithm decisions is going to be non-trivial,
and even if we can eventually get e4defrag to the point where it can
do this on the whole filesystem, I suspect backup/reformat/restore
will almost always be faster.

> > Out of curiosity, what sort of workload had the file system received?
> > It looks like the filesystem hadn't been created that long ago, so
> > it's bit surprising it was so fragmented.  Were you perhaps updating
> > your system (by doing a yum update or apt-get update) very frequently,
> > perhaps?
>
> Yes, that's what I'm doing. I'm updating about every file in this
> file system every second day by means of rpm packages (openSUSE
> calls it factory, you will now it as rawhide).

Unfortunately, constantly updating every single file on a daily basis
is a very effective way of seriously aging a filesystem.  The ext4
allocator tries to keep files aligned on power of two boundaries,
which tends to help this a lot (although this means that dumpe2fs -h
will show a bunch of holes that makes the free space look more
fragmented than it really is), but the ext3 allocator doesn't have any
such smarts on it.

						- Ted

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: file allocation problem
  2009-07-17 14:26         ` Theodore Tso
@ 2009-07-17 18:02           ` Stephan Kulow
  2009-07-17 21:14             ` Andreas Dilger
  0 siblings, 1 reply; 13+ messages in thread
From: Stephan Kulow @ 2009-07-17 18:02 UTC (permalink / raw)
  To: Theodore Tso; +Cc: linux-ext4

On Friday 17 July 2009 16:26:28 Theodore Tso wrote:
> And this isn't necessarily going to help; if 16 block groups around
> (2**4) for the flex_bg for the /usr/bin directory are all badly
> fragmented, then when you create new files in /usr/bin, it will still
> be fragmented.
Yeah, but even the file in /tmp/nd got 3 extents. my file is 1142 blocks
and my mb_groups says 2**9 is the highest possible value. So I guess I will
indeed try to create the file system from scratch to test the allocator for 
real.
>
> > > In any case, I don't anything went _wrong_ per se, just that both
> > > e4defrag and our block allocator are insufficiently smart to help
> > > improve things for you given your current filesystem.  A backup,
> > > reformat, and restore will result in a filesystem that works far
> > > better.
> >
> > I believe that, but my hope for online defrag was not having to rely on
> > this 80ties defrag method :)
>
> Yeah, sorry, online defrag is a very new feature.  It will hopefully
> get better, but it's matter of resources.  Ultimately, though, the
> problem is that the ext3 allocation algorithms are very different (and
> far more primitive) than the ext4 allocation algorithms.  So undoing
> the ext3 allocation algorithm decisions is going to be non-trivial,
> and even if we can eventually get e4defrag to the point where it can
> do this on the whole filesystem, I suspect backup/reformat/restore
> will almost always be faster.
I don't have any kind of experience in that field, but would it possible
to allocate a big file that would get all all the free blocks and then move
the extents of one group into it, basically freeing all blocks of one group
so it can be used purely by ext4 allocation? Or even go as far and pack the 
blocks of every group. As far as I see there is no way with the current ioctl 
interface to achieve that once your file system is fragmented enough because 
the allocator will always create new files as fragemented and the ioctl can 
only move extents from one fragemented to another fragemented.

And yes, backup/restore might be faster, but it's also the far more 
interruptive action than leaving defrag running over night.
 
>
> > > Out of curiosity, what sort of workload had the file system received?
> > > It looks like the filesystem hadn't been created that long ago, so
> > > it's bit surprising it was so fragmented.  Were you perhaps updating
> > > your system (by doing a yum update or apt-get update) very frequently,
> > > perhaps?
> >
> > Yes, that's what I'm doing. I'm updating about every file in this
> > file system every second day by means of rpm packages (openSUSE
> > calls it factory, you will now it as rawhide).
>
> Unfortunately, constantly updating every single file on a daily basis
> is a very effective way of seriously aging a filesystem.  The ext4
Of course it is, guess why I'm so interested in having it :)

> allocator tries to keep files aligned on power of two boundaries,
> which tends to help this a lot (although this means that dumpe2fs -h
> will show a bunch of holes that makes the free space look more
> fragmented than it really is), but the ext3 allocator doesn't have any
> such smarts on it.
But there is nothing packing the blocks if the groups get full, so these
holes will always cause fragmentation once the file system gets full, right?

So I guess online defragmentation first needs to pretend doing an online 
resize so it can use the gained free size. Now I have something to test.. :)

Greetings, Stephan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: file allocation problem
  2009-07-17 18:02           ` Stephan Kulow
@ 2009-07-17 21:14             ` Andreas Dilger
  2009-07-18 21:16               ` Stephan Kulow
  2009-07-19 22:45               ` Ron Johnson
  0 siblings, 2 replies; 13+ messages in thread
From: Andreas Dilger @ 2009-07-17 21:14 UTC (permalink / raw)
  To: Stephan Kulow; +Cc: Theodore Tso, linux-ext4

On Jul 17, 2009  20:02 +0200, Stephan Kulow wrote:
> On Friday 17 July 2009 16:26:28 Theodore Tso wrote:
> > And this isn't necessarily going to help; if 16 block groups around
> > (2**4) for the flex_bg for the /usr/bin directory are all badly
> > fragmented, then when you create new files in /usr/bin, it will still
> > be fragmented.
>
> Yeah, but even the file in /tmp/nd got 3 extents. my file is 1142 blocks
> and my mb_groups says 2**9 is the highest possible value. So I guess I will
> indeed try to create the file system from scratch to test the allocator for 
> real.

The defrag code needs to become smarter, so that it finds small files 
in the middle of freespace and migrates those to fit into a small gap.
That will allow larger files to be defragged once there is large chunks
of free space.

> > allocator tries to keep files aligned on power of two boundaries,
> > which tends to help this a lot (although this means that dumpe2fs -h
> > will show a bunch of holes that makes the free space look more
> > fragmented than it really is), but the ext3 allocator doesn't have any
> > such smarts on it.
> But there is nothing packing the blocks if the groups get full, so these
> holes will always cause fragmentation once the file system gets full, right?
 

Well, this isn't quite correct.  The mballoc code only tries to allocate
"large" files on power-of-two boundaries, where large is 64kB by default,
but is tunable in /proc.  For smaller files it tries to pack them together
into the same block, or into gaps that are exactly the size of the file.

> So I guess online defragmentation first needs to pretend doing an online 
> resize so it can use the gained free size. Now I have something to test.. :)

Yes, that would give you some good free space at the end of the filesystem.
Then find the largest files in the filesystem, migrate them there, then
defrag the smaller files.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: file allocation problem
  2009-07-17 21:14             ` Andreas Dilger
@ 2009-07-18 21:16               ` Stephan Kulow
  2009-07-19 22:45               ` Ron Johnson
  1 sibling, 0 replies; 13+ messages in thread
From: Stephan Kulow @ 2009-07-18 21:16 UTC (permalink / raw)
  To: Andreas Dilger, linux-ext4

On Friday 17 July 2009 23:14:44 Andreas Dilger wrote:
> > Yeah, but even the file in /tmp/nd got 3 extents. my file is 1142 blocks
> > and my mb_groups says 2**9 is the highest possible value. So I guess I
> > will indeed try to create the file system from scratch to test the
> > allocator for real.
>
> The defrag code needs to become smarter, so that it finds small files
> in the middle of freespace and migrates those to fit into a small gap.
> That will allow larger files to be defragged once there is large chunks
> of free space.

Is there a way that user space can hint the allocator to fill these gaps? I 
don't see any obvious way. Relying on the allocator not to make matters worse
might be enough, but it doesn't sound ideal. Unless something urgent comes up
I might actually continue experiment next week :)

My resize2fs defrag worked pretty well actually, but then again I did it on an 
offline copy and it won't work for online that way.

Greetings, Stephan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: file allocation problem
  2009-07-17 21:14             ` Andreas Dilger
  2009-07-18 21:16               ` Stephan Kulow
@ 2009-07-19 22:45               ` Ron Johnson
  2009-07-20 21:18                 ` Andreas Dilger
  1 sibling, 1 reply; 13+ messages in thread
From: Ron Johnson @ 2009-07-19 22:45 UTC (permalink / raw)
  To: linux-ext4

On 2009-07-17 16:14, Andreas Dilger wrote:
[snip]
> 
> Well, this isn't quite correct.  The mballoc code only tries to allocate
> "large" files on power-of-two boundaries, where large is 64kB by default,
> but is tunable in /proc.  For smaller files it tries to pack them together
> into the same block, or into gaps that are exactly the size of the file.

How does ext4 act on growing files?  I.e., creating a tarball that, 
obviously, starts at 0 bytes and then grows to multi-GB?

-- 
Scooty Puff, Sr
The Doom-Bringer

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: file allocation problem
  2009-07-19 22:45               ` Ron Johnson
@ 2009-07-20 21:18                 ` Andreas Dilger
  0 siblings, 0 replies; 13+ messages in thread
From: Andreas Dilger @ 2009-07-20 21:18 UTC (permalink / raw)
  To: Ron Johnson; +Cc: linux-ext4

On Jul 19, 2009  17:45 -0500, Ron Johnson wrote:
> On 2009-07-17 16:14, Andreas Dilger wrote:
>> Well, this isn't quite correct.  The mballoc code only tries to allocate
>> "large" files on power-of-two boundaries, where large is 64kB by default,
>> but is tunable in /proc.  For smaller files it tries to pack them together
>> into the same block, or into gaps that are exactly the size of the file.
>
> How does ext4 act on growing files?  I.e., creating a tarball that,  
> obviously, starts at 0 bytes and then grows to multi-GB?

ext4 has "delayed allocation" (delalloc) so that no blocks are allocated
during initial file writes, but rather only when RAM is running short or
when the data has been sitting around for a while.

Normally, if you are writing to a file with _most_ applications the IO
rate is high enough that within the 5-30s memory flush interval the
size of the file has grown large enough to give the allocator an idea
whether the file will be small or large.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2009-07-20 21:19 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-07-16 11:31 file allocation problem Stephan Kulow
2009-07-16 15:58 ` Theodore Tso
2009-07-16 17:43   ` Stephan Kulow
2009-07-17  1:12     ` Theodore Tso
2009-07-17  4:32       ` Andreas Dilger
2009-07-17  5:31         ` Stephan Kulow
2009-07-17  5:17       ` Stephan Kulow
2009-07-17 14:26         ` Theodore Tso
2009-07-17 18:02           ` Stephan Kulow
2009-07-17 21:14             ` Andreas Dilger
2009-07-18 21:16               ` Stephan Kulow
2009-07-19 22:45               ` Ron Johnson
2009-07-20 21:18                 ` Andreas Dilger

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.