linux-lvm.redhat.com archive mirror
 help / color / mirror / Atom feed
* Re: [linux-lvm] Re: ext2resize
@ 1999-07-06  7:56 Lennert Buytenhek
  1999-07-08 14:34 ` tytso
  1999-07-08 22:56 ` Theodore Y. Ts'o
  0 siblings, 2 replies; 17+ messages in thread
From: Lennert Buytenhek @ 1999-07-06  7:56 UTC (permalink / raw)
  To: LVM on Linux, Linux FS development list

Andreas Dilger writes:
>>>Have the "reserved" blocks blocks be the first few data blocks in each
>>>group, rather than all in a single group.  This way the blocks can be
>>>used by root when the FS is very full, or they can be used for expanding
>>>the filesystem.  The one problem is that you may not have your extra
>>>blocks when you need them (ie when the FS is full) if it is root that
>>>is filling the FS.  The plus is that you don't keep more spare blocks
>>>that normally nobody uses in the FS.
>
>>How do you tell the kernel that it should use those 'reserved' blocks
>>only when free space is low? compat/rocompat/incompat feature?
>
>Actually, ext2 (like most Unix FS) already reserves space for root (or
>other specified users).  The default is 5% of the FS, but can be set at
>FS creation time or via tune2fs (-m parameter).  If we have mke2fs
>allocate the 5% reserved blocks spread across all groups using the blocks
>right after the current GDT, then we could use these data blocks later
>on instead of moving user blocks out of the way.  However, these blocks
>are not guaranteed to be anywhere in particular (currently they all get
>allocated in one group), so doing "tune2fs -m 0; tune2fs -m 5" will
>probably move all of these blocks around.

ext2 doesn't specify _which_ blocks are reserved. So you can't just
say: "Hey, I want those blocks to be reserved." ext2 keeps track of
the # of free blocks and it detects when it is eating into reserved
space.

>>Why don't we say: "every ext2 fs must have 32 gdt blocks per block group"
>>then? Same idea.
>
>Being able to tune it makes people happy.  If the max is actually 512
>blocks we definitely don't want that.  Also, people may want to have a
>few GB expansion room in each FS, but not 2TB worth...  Since at 1kB
>blocks we get 256MB growth/block, and at 4kB blocks it is 16GB per block,
>we probably don't need to reserve more than 4 GDT blocks over what people
>already allocate in order to give them the normal expansion needs.  If
>they need more expansion at a later date, they can unmount the FS and
>do block shuffling offline.

I just got an email from John Finlay saying he's got a 52GB fs with
6000+ block groups. So the 1024 block group limit is just bogus.
The header in question which #defined the max # of block groups
to be 1024 is wrong, then.

Lennert Buytenhek

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [linux-lvm] Re: ext2resize
  1999-07-06  7:56 [linux-lvm] Re: ext2resize Lennert Buytenhek
@ 1999-07-08 14:34 ` tytso
  1999-07-08 22:56 ` Theodore Y. Ts'o
  1 sibling, 0 replies; 17+ messages in thread
From: tytso @ 1999-07-08 14:34 UTC (permalink / raw)
  To: buytenh; +Cc: linux-lvm, linux-fsdevel

   From: "Lennert Buytenhek" <buytenh@dsv.nl>
   Date:   Tue, 6 Jul 1999 09:56:07 +0200

   I just got an email from John Finlay saying he's got a 52GB fs with
   6000+ block groups. So the 1024 block group limit is just bogus.
   The header in question which #defined the max # of block groups
   to be 1024 is wrong, then.

I'm curious *where* people thought a header defined the max # of block
groups to be 1024?  Yes, that limit is completely bogus.

   From: "Lennert Buytenhek" <buytenh@dsv.nl>
   Date:   Tue, 6 Jul 1999 10:56:06 +0200

   Moving the inode/block bitmap blocks is not that difficult when
   unmounted, but moving the inode table is tricky to get right (w.r.t
   atomicity). One ext2resize user got a segfault while the inode table
   was being moved (yes, that bug is fixed now) a while ago and
   we're still cleaning up the mess.

The trick here is to try *very* hard not to have to do an overlapping
move of the inode table (which you may be forced to do if the filesystem
is too full, but in most cases it can be avoided).  Then do a copy of
that portion of the inode table, and only update the GDT *after* the
inode table has been safely moved.  

You may have to move some extra data blocks belonging to files to do
this, though, so it's much easier to do this safely during an offline
resize. 

						- Ted

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [linux-lvm] Re: ext2resize
  1999-07-06  7:56 [linux-lvm] Re: ext2resize Lennert Buytenhek
  1999-07-08 14:34 ` tytso
@ 1999-07-08 22:56 ` Theodore Y. Ts'o
  1 sibling, 0 replies; 17+ messages in thread
From: Theodore Y. Ts'o @ 1999-07-08 22:56 UTC (permalink / raw)
  To: buytenh; +Cc: linux-lvm, linux-fsdevel

   From: "Lennert Buytenhek" <buytenh@dsv.nl>
   Date:   Tue, 6 Jul 1999 09:56:07 +0200

   I just got an email from John Finlay saying he's got a 52GB fs with
   6000+ block groups. So the 1024 block group limit is just bogus.
   The header in question which #defined the max # of block groups
   to be 1024 is wrong, then.

I'm curious *where* people thought a header defined the max # of block
groups to be 1024?  Yes, that limit is completely bogus.

   From: "Lennert Buytenhek" <buytenh@dsv.nl>
   Date:   Tue, 6 Jul 1999 10:56:06 +0200

   Moving the inode/block bitmap blocks is not that difficult when
   unmounted, but moving the inode table is tricky to get right (w.r.t
   atomicity). One ext2resize user got a segfault while the inode table
   was being moved (yes, that bug is fixed now) a while ago and
   we're still cleaning up the mess.

The trick here is to try *very* hard not to have to do an overlapping
move of the inode table (which you may be forced to do if the filesystem
is too full, but in most cases it can be avoided).  Then do a copy of
that portion of the inode table, and only update the GDT *after* the
inode table has been safely moved.  

You may have to move some extra data blocks belonging to files to do
this, though, so it's much easier to do this safely during an offline
resize. 

						- Ted

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [linux-lvm] Re: ext2resize
  1999-07-05 19:47 Andreas Dilger
@ 1999-07-07  5:27 ` John Finlay
  0 siblings, 0 replies; 17+ messages in thread
From: John Finlay @ 1999-07-07  5:27 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: LVM on Linux

Andreas Dilger wrote:

> John Finlay <finlay@moeraki.com> writes:
> > It seems that ext2 is not really suited for large filesystems: seems like
> > there is too much redundancy in the block groups that causes slow downs in
> > operations like mount, etc.; e2fsck takes hours on a 52GB filesystem.
>
> Actually, the new "sparse superblock" version of ext2 available for Linux 2.2
> kernels removes much of the redundancy issues for superblocks/group blocks.
> Copies are only stored in group 0, and groups which are a power of 3, 5, and
> 7.  The real issue with large filesystems isn't the redundancy, which is
> mostly wasted space and slowdown when unmounting, but rather that the fsck
> has to verify the entire FS structure at mount time.  The preferred method

It's been my observation that there is an awful lot of disk activity at mount
time on a clean large file system (no fsck). It appears that every superblock is
being written in turn - maybe to set the dirty flag? It's particularly noticeable
if the filesystem uses 1k blocksize. The mounts are much faster for  when the
filesystem uses 4k blocksize.

>
> is to have a transaction log/journal which keeps track of outstanding metadata
> changes in progress.  When you get a failure, then you only need to replay
> the log to see what parts of the FS were being modified at the time, and
> then only those areas need to be verified at fsck time.
>

Logging metadata does speed up lots of operations including filesystem recovery
after an unclean shutdown by avoiding fsck. This does seem like the only way to
meet the needs of large filesystem users.

>
> > Are there any projects underway to develop a new filesystem that is more
> > suitable for large filesystems?
>
> There are several log FS/JFS projects underway right now for Linux.

Are any of them nearing first release? Where's the best place to find pointers?

>
> Even SGI will release the source (or so I've read) to their IRIX
> filesystem, which is journalled, so this may be added to the mix soon.
>

I'm not hopeful that XFS source will be released soon - seems more like a PR
gesture.

John

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [linux-lvm] Re: ext2resize
@ 1999-07-06  8:56 Lennert Buytenhek
  0 siblings, 0 replies; 17+ messages in thread
From: Lennert Buytenhek @ 1999-07-06  8:56 UTC (permalink / raw)
  To: linux-lvm, linux-fsdevel

Stephen Tweedie said:
>Hi,
>
>On Fri, 2 Jul 1999 10:58:27 -0600 (MDT), Andreas Dilger
><adilger@enel.ucalgary.ca> said:
>
>> I agree that reserving GDT blocks would be a small hack for v0 of ext2,
but
>> we could add a "COMPATIBLE" extension to v1 of ext2 that gave the number
of
>> GDT blocks reserved.  Also, since the v1 sparse superblock filesystems
>> leave a gap between the structures, there should not be a real problem.
>> We just have to make sure that we agree (for example) that all the blocks
>> from the start of the group to the block bitmap are reserved for the GDT.
>> This is not a big strech, since this is already how v0 ext2 filesystems
are.
>
>If you're going to be moving user data around anyway (and you have to be
>able to do this for the shrink case anyway), then relocating the
>inode/bitmap blocks is really not that much harder.

There's quite a difference between moving a non-metadata block and
a metadata block. Non-metadata blocks are easily moved; they have
only (max.) 1 reference somewhere. Moving metadata blocks is a bit
harder, especially when the fs is mounted.

Moving the inode/block bitmap blocks is not that difficult when
unmounted, but moving the inode table is tricky to get right (w.r.t
atomicity). One ext2resize user got a segfault while the inode table
was being moved (yes, that bug is fixed now) a while ago and
we're still cleaning up the mess.

Lennert

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [linux-lvm] Re: ext2resize
@ 1999-07-06  8:47 Lennert Buytenhek
  0 siblings, 0 replies; 17+ messages in thread
From: Lennert Buytenhek @ 1999-07-06  8:47 UTC (permalink / raw)
  To: linux-lvm

>John Finlay wrote:
>> Lennert Buytenhek wrote:
>> Yes, for a 1kb block size. One of my ext2 linux kernel headers #defines
>> the max number of groups to be 1024. Times 32 bytes per group
>> descriptor is 32kb, which is 32 blocks on 1kb. Unless the header is
>> wrong, of course.
>
>I don't think this is correct. I have an ext2 filesystem that is 52GB. It
>appears to have 200+ blocks in the GDT and as I recall 6000+ block groups.
I
>made a 86GB filesystem the other day with 1k blocks and it had 10000+ block
>groups. With 4k blocks there were 650+ block groups.

ext2resize will probably not handle this (it assumes the max. is
1024). I'll fix this in a few days.

Lennert

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [linux-lvm] Re: ext2resize
@ 1999-07-05 20:49 Andreas Dilger
  0 siblings, 0 replies; 17+ messages in thread
From: Andreas Dilger @ 1999-07-05 20:49 UTC (permalink / raw)
  To: LVM on Linux, Linux FS development list

Lennert Buytenhek writes:
>I tried to make a 8194-block filesystem with mke2fs once, and it told
>me it was going to make the fs 8193 blocks. Can you give me a recipe
>for making an fs which has a last group w/o sb/gdt?

I think this was a mistake on my part.  In alloc_tables.c it looked
like it was possible to have the last group without sb/gdt, but it
turns out that this is only setting the range where blocks are allocated.
In initialize.c it shows where "overhead" is calculated that if the
last group has less than 50 free data blocks it will be discarded.

>>Have the "reserved" blocks blocks be the first few data blocks in each
>>group, rather than all in a single group.  This way the blocks can be
>>used by root when the FS is very full, or they can be used for expanding
>>the filesystem.  The one problem is that you may not have your extra
>>blocks when you need them (ie when the FS is full) if it is root that
>>is filling the FS.  The plus is that you don't keep more spare blocks
>>that normally nobody uses in the FS.

>How do you tell the kernel that it should use those 'reserved' blocks
>only when free space is low? compat/rocompat/incompat feature?

Actually, ext2 (like most Unix FS) already reserves space for root (or
other specified users).  The default is 5% of the FS, but can be set at
FS creation time or via tune2fs (-m parameter).  If we have mke2fs
allocate the 5% reserved blocks spread across all groups using the blocks
right after the current GDT, then we could use these data blocks later
on instead of moving user blocks out of the way.  However, these blocks
are not guaranteed to be anywhere in particular (currently they all get
allocated in one group), so doing "tune2fs -m 0; tune2fs -m 5" will
probably move all of these blocks around.

For this reason it is probably safer to use a different inode to make
sure the spare GDT blocks are where we want them to be and aren't in use.
However, if we have a good offline block moving utility, this may be
enough to be able to online resize "most" filesystems, especially those that
the admin hasn't been messing with.  If they are desparate, they can
unmount and shuffle blocks, or online expand to fill the rest of the last
GDT block (average growth limit 128MB for 1kB blocks, 8GB for 4kB blocks)
even without adding ANY blocks to GDT.

>Why don't we say: "every ext2 fs must have 32 gdt blocks per block group"
>then? Same idea.

Being able to tune it makes people happy.  If the max is actually 512
blocks we definitely don't want that.  Also, people may want to have a
few GB expansion room in each FS, but not 2TB worth...  Since at 1kB
blocks we get 256MB growth/block, and at 4kB blocks it is 16GB per block,
we probably don't need to reserve more than 4 GDT blocks over what people
already allocate in order to give them the normal expansion needs.  If
they need more expansion at a later date, they can unmount the FS and
do block shuffling offline.

Cheers, Andreas
-- 
Andreas Dilger   University of Calgary  \"If a man ate a pound of pasta and
                 Micronet Research Group \ a pound of antipasto, would they
Dept of Electrical & Computer Engineering \   cancel out, leaving him still
http://www-mddsp.enel.ucalgary.ca/People/adilger/       hungry?" -- Dogbert

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [linux-lvm] Re: ext2resize
@ 1999-07-05 19:47 Andreas Dilger
  1999-07-07  5:27 ` John Finlay
  0 siblings, 1 reply; 17+ messages in thread
From: Andreas Dilger @ 1999-07-05 19:47 UTC (permalink / raw)
  To: LVM on Linux

John Finlay <finlay@moeraki.com> writes:
> It seems that ext2 is not really suited for large filesystems: seems like
> there is too much redundancy in the block groups that causes slow downs in
> operations like mount, etc.; e2fsck takes hours on a 52GB filesystem.

Actually, the new "sparse superblock" version of ext2 available for Linux 2.2
kernels removes much of the redundancy issues for superblocks/group blocks.
Copies are only stored in group 0, and groups which are a power of 3, 5, and
7.  The real issue with large filesystems isn't the redundancy, which is
mostly wasted space and slowdown when unmounting, but rather that the fsck
has to verify the entire FS structure at mount time.  The preferred method
is to have a transaction log/journal which keeps track of outstanding metadata
changes in progress.  When you get a failure, then you only need to replay
the log to see what parts of the FS were being modified at the time, and
then only those areas need to be verified at fsck time.

> Are there any projects underway to develop a new filesystem that is more
> suitable for large filesystems?

There are several log FS/JFS projects underway right now for Linux.
Even SGI will release the source (or so I've read) to their IRIX
filesystem, which is journalled, so this may be added to the mix soon.

Cheers, Andreas
-- 
Andreas Dilger   University of Calgary  \"If a man ate a pound of pasta and
                 Micronet Research Group \ a pound of antipasto, would they
Dept of Electrical & Computer Engineering \   cancel out, leaving him still
http://www-mddsp.enel.ucalgary.ca/People/adilger/       hungry?" -- Dogbert

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [linux-lvm] Re: ext2resize
  1999-07-05  8:47 Lennert Buytenhek
  1999-07-05 18:38 ` John Finlay
@ 1999-07-05 18:55 ` John Finlay
  1 sibling, 0 replies; 17+ messages in thread
From: John Finlay @ 1999-07-05 18:55 UTC (permalink / raw)
  To: Lennert Buytenhek; +Cc: linux-lvm

Lennert Buytenhek wrote:

> >Lennert Buytenhek writes:
> >> I replied: "Depending on who you talk to of course. :-) The max. number
> >> of groups for ext2 is 1024 I believe. One extra gd block per block gives
> >> you room for 32*8=256mb expansion (assuming 1kb blocks). This
> >> will cost you at most 1 meg of reserved gd blocks. Seems like a fair
> >> price. The max. number of gd blocks is 32. So doing this when making
> >> an fs will cost you at most 32*1024 blocks, which is 32mb with 1k
> >> blocks. On modern drives, you'll probably not even notice a 32mb
> >> loss. Unless you have a lot of partitions, of course...."
> >Are you sure that the max number of GDT blocks is 32?  For a 1kB block
>
> Yes, for a 1kb block size. One of my ext2 linux kernel headers #defines
> the max number of groups to be 1024. Times 32 bytes per group
> descriptor is 32kb, which is 32 blocks on 1kb. Unless the header is
> wrong, of course.
>

I don't think this is correct. I have an ext2 filesystem that is 52GB. It
appears to have 200+ blocks in the GDT and as I recall 6000+ block groups. I
made a 86GB filesystem the other day with 1k blocks and it had 10000+ block
groups. With 4k blocks there were 650+ block groups.

>
> >size, this would give a limit of 32 GDT blocks * 32 GD/GDT block * 8k
> >blocks/GD = 8GB max FS size.  With 4kB blocks we grow 4x for larger data
> >blocks, 4x for more GD/GDT block, and 4x for more blocks/GD, so 512 GB
> >max, not the expected 4TB limit.  If we wanted to reach 4TB with 1kB
> >blocks (possible since block numbers are 32-bit unsigned), then we would
> >need 512*32 GDT blocks, or 200% !!!  of all FS space, while with 4kB
> >blocks we need 256 GDT blocks, or 1/32 of FS space.
>
> Yes, well, I didn't invent this. Most of the people will use larger block
> sizes then, anyway.
>

It does seem peculiar that the largest block size is limited to 4k. 8k would
seem to be a reasonable size to me.

It seems that ext2 is not really suited for large filesystems: seems like
there is too much redundancy in the block groups that causes slow downs in
operations like mount, etc.; e2fsck takes hours on a 52GB filesystem.

Are there any projects underway to develop a new filesystem that is more
suitable for large filesystems?

John

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [linux-lvm] Re: ext2resize
  1999-07-05  8:47 Lennert Buytenhek
@ 1999-07-05 18:38 ` John Finlay
  1999-07-05 18:55 ` John Finlay
  1 sibling, 0 replies; 17+ messages in thread
From: John Finlay @ 1999-07-05 18:38 UTC (permalink / raw)
  To: Lennert Buytenhek; +Cc: linux-lvm



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [linux-lvm] Re: ext2resize
  1999-07-02 16:58 ` Andreas Dilger
@ 1999-07-05 17:03   ` Stephen C. Tweedie
  0 siblings, 0 replies; 17+ messages in thread
From: Stephen C. Tweedie @ 1999-07-05 17:03 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Linux LVM mailing list, Linux FS development list, Stephen Tweedie

Hi,

On Fri, 2 Jul 1999 10:58:27 -0600 (MDT), Andreas Dilger
<adilger@enel.ucalgary.ca> said:

> I agree that reserving GDT blocks would be a small hack for v0 of ext2, but
> we could add a "COMPATIBLE" extension to v1 of ext2 that gave the number of
> GDT blocks reserved.  Also, since the v1 sparse superblock filesystems
> leave a gap between the structures, there should not be a real problem.
> We just have to make sure that we agree (for example) that all the blocks
> from the start of the group to the block bitmap are reserved for the GDT.
> This is not a big strech, since this is already how v0 ext2 filesystems are.

If you're going to be moving user data around anyway (and you have to be
able to do this for the shrink case anyway), then relocating the
inode/bitmap blocks is really not that much harder.

--Stephen

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [linux-lvm] Re: ext2resize
@ 1999-07-05  8:47 Lennert Buytenhek
  1999-07-05 18:38 ` John Finlay
  1999-07-05 18:55 ` John Finlay
  0 siblings, 2 replies; 17+ messages in thread
From: Lennert Buytenhek @ 1999-07-05  8:47 UTC (permalink / raw)
  To: linux-lvm

>Lennert Buytenhek writes:
>> I replied: "Depending on who you talk to of course. :-) The max. number
>> of groups for ext2 is 1024 I believe. One extra gd block per block gives
>> you room for 32*8=256mb expansion (assuming 1kb blocks). This
>> will cost you at most 1 meg of reserved gd blocks. Seems like a fair
>> price. The max. number of gd blocks is 32. So doing this when making
>> an fs will cost you at most 32*1024 blocks, which is 32mb with 1k
>> blocks. On modern drives, you'll probably not even notice a 32mb
>> loss. Unless you have a lot of partitions, of course...."
>Are you sure that the max number of GDT blocks is 32?  For a 1kB block

Yes, for a 1kb block size. One of my ext2 linux kernel headers #defines
the max number of groups to be 1024. Times 32 bytes per group
descriptor is 32kb, which is 32 blocks on 1kb. Unless the header is
wrong, of course.

>size, this would give a limit of 32 GDT blocks * 32 GD/GDT block * 8k
>blocks/GD = 8GB max FS size.  With 4kB blocks we grow 4x for larger data
>blocks, 4x for more GD/GDT block, and 4x for more blocks/GD, so 512 GB
>max, not the expected 4TB limit.  If we wanted to reach 4TB with 1kB
>blocks (possible since block numbers are 32-bit unsigned), then we would
>need 512*32 GDT blocks, or 200% !!!  of all FS space, while with 4kB
>blocks we need 256 GDT blocks, or 1/32 of FS space.

Yes, well, I didn't invent this. Most of the people will use larger block
sizes then, anyway.

>> >Mike had also suggested that when we are doing a major FS (offline)
>> >reorg, we could start removing blocks from the inode table instead of
>> >data blocks as there are usually free inodes in each group, but not
>> >always data blocks...
>> I think it's not worth the complexity. First of all all your inodes will
be
>> renumbered. You'll need a full directory scan-and-replace for inodes
>> which is very crash-sensitive. On the other hand, relocating a block
>> is atomic.
>
>There are two other possibilities:
>1) "#define EXT2_EXPAND_INO 7" and put it in include/linux/ext2_fs.h and
allocate
>   the first few data blocks in each group to this inode to reserve them.

Would be nice. But this way the blocks are really reserved and
can't be used for anything else. Why don't we say: "every ext2
fs must have 32 gdt blocks per block group" then? Same idea.

>2) Have the "reserved" blocks blocks be the first few data blocks in each
group,
>   rather than all in a single group.  This way the blocks can be used by
root when
>   the FS is very full, or they can be used for expanding the filesystem.
The one
>   problem is that you may not have your extra blocks when you need them
(ie when the
>   FS is full) if it is root that is filling the FS.  The plus is that you
don't keep
>   more spare blocks that normally nobody uses in the FS.

How do you tell the kernel that it should use those 'reserved' blocks only
when
free space is low? compat/rocompat/incompat feature?

Lennert Buytenhek
<buytenh@dsv.nl>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [linux-lvm] Re: ext2resize
@ 1999-07-05  8:40 Lennert Buytenhek
  0 siblings, 0 replies; 17+ messages in thread
From: Lennert Buytenhek @ 1999-07-05  8:40 UTC (permalink / raw)
  To: linux-lvm

>Lennert Buytenhek writes:
>> I mailed with Mike about this too. He said:  "First guess is that is that
it
>> would require one more field in the superblock - how many blocks are
>> reserved for group descriptors....". I replied with: "In the group
>> descriptor table there are pointers to the start of the block/inode
bitmap
>> and inode table for each group. So you don't have to put any info in the
>> superblock. You can just leave free blocks between the gd table and the
>> block bitmap. I guess. But ext2 works in mysterious ways.... :-)"
>>
>> He said: "Still, reserving extra blocks would only be a hack..."
>I agree that reserving GDT blocks would be a small hack for v0 of ext2, but
>we could add a "COMPATIBLE" extension to v1 of ext2 that gave the number of
>GDT blocks reserved.  Also, since the v1 sparse superblock filesystems
>leave a gap between the structures, there should not be a real problem.
>We just have to make sure that we agree (for example) that all the blocks
>from the start of the group to the block bitmap are reserved for the GDT.
>This is not a big strech, since this is already how v0 ext2 filesystems
are.
>
>If we really don't want to do this (I haven't seen any reason not to,
however),
>then we could always allocate the initial datablocks in each group to a
>reserved inode (7 is unused in my ext2_fs.h) so they aren't available to
any
>files.  Then we don't need to worry about moving user data around when we
>want to resize.

When I said "reserving" I meant: leave empty data blocks between the
end of the gd table and the block bitmap (just like the sparse sb flag)
leaves holes. It's easy to move a data block out of the way. But not
when mounted of course. For me it was move a data block (which
is easily made atomic) vs. move the entire bitmap/table structure
a few blocks ahead.

When you say reserving you mean: keep this block free. Okay, that's
more logical than my 'reserving' :-)

>> One extra gd block per block gives
>> you room for 32*8=256mb expansion (assuming 1kb blocks). This
>> will cost you at most 1 meg of reserved gd blocks. Seems like a fair
>> price. The max. number of gd blocks is 32. So doing this when making
>> an fs will cost you at most 32*1024 blocks, which is 32mb with 1k
>> blocks. On modern drives, you'll probably not even notice a 32mb
>> loss. Unless you have a lot of partitions, of course...."
>
>The good news is that the "reserved" GDT blocks are a constant percentage
>of the filesystem size, as you are not allocating GDT blocks in the groups
>that don't exist yet.
Ok, ok.

>> I suggested Mike putting the 'reserving extra blocks' feature in
>> mke2fs.
>I've worked on a change to mke2fs to give it a new parameter (-e) which
>will set the "expansion limit" in a new FS.  It needs a change to libext2fs
>in initialize.c to do a "set_field(desc_blocks,...)" to take the set value
or
>fill it in with the default value.  One issue I had was whether the -e
>parameter should be in blocks/GDT (easier to code) or total FS blocks
(easier
>for a user to understand)?

I think the latter (easier for the user). It's not that much math, or
is it?

>> Huh? I thought a group must _always_ have an sb, gd table,
>> block bitmap, inode bitmap and inode table. Or am I wrong here?
>
>If you look at alloc_tables.c in libext2fs, there are several parts like
>
>    start_blk = group_blk + 3 + fs->desc_blocks;
>    if (start_blk > last_blk)
>        start_blk = group_blk;
>
>I take this to mean that if there isn't enough room for the superblock,
GDT,
>inode bitmap, and block bitmap, then don't put one in the last group.  The
>last group will start directly with the inode table.  However, this is
easily
>"fixed" so that the last group will require a complete set of these tables,
>or it won't be allocated and the FS size will be reduced.
I tried to make a 8194-block filesystem with mke2fs once, and it told
me it was going to make the fs 8193 blocks. Can you give me a recipe
for making an fs which has a last group w/o sb/gdt?

>> So: add a flag that will cancel the resize if the gd table growth needs
>> to move blocks/metadata?
>
>I don't think it should be a flag.  If the FS is mounted and we need to
>move blocks, then cancel the operation automatically.  It shouldn't be
>a problem to move blocks if the filesystem is unmounted.
ext2resize doesn't even check whether the fs is mounted right
now. Lots of people try resizing their mounted fs'es. I'll implement
a -f (force) option.

Lennert Buytenhek
<buytenh@dsv.nl>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [linux-lvm] Re: ext2resize
@ 1999-07-04 18:56 Andreas Dilger
  0 siblings, 0 replies; 17+ messages in thread
From: Andreas Dilger @ 1999-07-04 18:56 UTC (permalink / raw)
  To: Linux LVM mailing list, Linux FS development list

Lennert Buytenhek writes:
> I replied: "Depending on who you talk to of course. :-) The max. number
> of groups for ext2 is 1024 I believe. One extra gd block per block gives
> you room for 32*8=256mb expansion (assuming 1kb blocks). This
> will cost you at most 1 meg of reserved gd blocks. Seems like a fair
> price. The max. number of gd blocks is 32. So doing this when making
> an fs will cost you at most 32*1024 blocks, which is 32mb with 1k
> blocks. On modern drives, you'll probably not even notice a 32mb
> loss. Unless you have a lot of partitions, of course...."

Are you sure that the max number of GDT blocks is 32?  For a 1kB block
size, this would give a limit of 32 GDT blocks * 32 GD/GDT block * 8k
blocks/GD = 8GB max FS size.  With 4kB blocks we grow 4x for larger data
blocks, 4x for more GD/GDT block, and 4x for more blocks/GD, so 512 GB
max, not the expected 4TB limit.  If we wanted to reach 4TB with 1kB
blocks (possible since block numbers are 32-bit unsigned), then we would
need 512*32 GDT blocks, or 200% !!!  of all FS space, while with 4kB
blocks we need 256 GDT blocks, or 1/32 of FS space.

> I suggested Mike putting the 'reserving extra blocks' feature in
> mke2fs.

I started on this, and initially I thought I would have the parameter (-e)
take the maximum FS size in blocks, however this makes mke2fs much more
complicated as it replicates code in libext2fs which sets default values
for block size, blocks per group, etc.  It is much easier just supplying
a GDT block count and passing this to libext2fs.

> >Mike had also suggested that when we are doing a major FS (offline)
> >reorg, we could start removing blocks from the inode table instead of
> >data blocks as there are usually free inodes in each group, but not
> >always data blocks...
> I think it's not worth the complexity. First of all all your inodes will be
> renumbered. You'll need a full directory scan-and-replace for inodes
> which is very crash-sensitive. On the other hand, relocating a block
> is atomic.

There are two other possibilities:
1) "#define EXT2_EXPAND_INO 7" and put it in include/linux/ext2_fs.h and allocate
   the first few data blocks in each group to this inode to reserve them.
2) Have the "reserved" blocks blocks be the first few data blocks in each group,
   rather than all in a single group.  This way the blocks can be used by root when
   the FS is very full, or they can be used for expanding the filesystem.  The one
   problem is that you may not have your extra blocks when you need them (ie when the
   FS is full) if it is root that is filling the FS.  The plus is that you don't keep
   more spare blocks that normally nobody uses in the FS.

Cheers, Andreas
-- 
Andreas Dilger  University of Calgary \ "If a man ate a pound of pasta and
                Micronet Research Group \ a pound of antipasto, would they
Dept of Electrical & Computer Engineering \  cancel out, leaving him still
http://www-mddsp.enel.ucalgary.ca/People/adilger/      hungry?" -- Dogbert

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [linux-lvm] Re: ext2resize
  1999-06-30  9:46 Admin
@ 1999-07-02 16:58 ` Andreas Dilger
  1999-07-05 17:03   ` Stephen C. Tweedie
  0 siblings, 1 reply; 17+ messages in thread
From: Andreas Dilger @ 1999-07-02 16:58 UTC (permalink / raw)
  To: Linux LVM mailing list; +Cc: Linux FS development list

Lennert Buytenhek writes:
> I mailed with Mike about this too. He said: "First guess is that is that it
> would require one more field in the superblock - how many blocks are
> reserved for group descriptors....". I replied with: "In the group
> descriptor table there are pointers to the start of the block/inode bitmap
> and inode table for each group. So you don't have to put any info in the
> superblock. You can just leave free blocks between the gd table and the
> block bitmap. I guess. But ext2 works in mysterious ways.... :-)"
> 
> He said: "Still, reserving extra blocks would only be a hack..."

I agree that reserving GDT blocks would be a small hack for v0 of ext2, but
we could add a "COMPATIBLE" extension to v1 of ext2 that gave the number of
GDT blocks reserved.  Also, since the v1 sparse superblock filesystems
leave a gap between the structures, there should not be a real problem.
We just have to make sure that we agree (for example) that all the blocks
from the start of the group to the block bitmap are reserved for the GDT.
This is not a big strech, since this is already how v0 ext2 filesystems are.

If we really don't want to do this (I haven't seen any reason not to, however),
then we could always allocate the initial datablocks in each group to a
reserved inode (7 is unused in my ext2_fs.h) so they aren't available to any
files.  Then we don't need to worry about moving user data around when we
want to resize.

> One extra gd block per block gives
> you room for 32*8=256mb expansion (assuming 1kb blocks). This
> will cost you at most 1 meg of reserved gd blocks. Seems like a fair
> price. The max. number of gd blocks is 32. So doing this when making
> an fs will cost you at most 32*1024 blocks, which is 32mb with 1k
> blocks. On modern drives, you'll probably not even notice a 32mb
> loss. Unless you have a lot of partitions, of course...."

The good news is that the "reserved" GDT blocks are a constant percentage
of the filesystem size, as you are not allocating GDT blocks in the groups
that don't exist yet.

> I suggested Mike putting the 'reserving extra blocks' feature in
> mke2fs.

I've worked on a change to mke2fs to give it a new parameter (-e) which
will set the "expansion limit" in a new FS.  It needs a change to libext2fs
in initialize.c to do a "set_field(desc_blocks,...)" to take the set value or
fill it in with the default value.  One issue I had was whether the -e
parameter should be in blocks/GDT (easier to code) or total FS blocks (easier
for a user to understand)?

> Huh? I thought a group must _always_ have an sb, gd table,
> block bitmap, inode bitmap and inode table. Or am I wrong here?

If you look at alloc_tables.c in libext2fs, there are several parts like

        start_blk = group_blk + 3 + fs->desc_blocks;
        if (start_blk > last_blk)
                start_blk = group_blk;

I take this to mean that if there isn't enough room for the superblock, GDT,
inode bitmap, and block bitmap, then don't put one in the last group.  The
last group will start directly with the inode table.  However, this is easily
"fixed" so that the last group will require a complete set of these tables,
or it won't be allocated and the FS size will be reduced.

The savings in doing this the old way was very small anyways - for a 256MB or
smaller FS, it was the difference between maybe 3 datablocks at the most,
increasing by 1 datablock for each 256MB.

start_blk = group_blk + 3 + 1 > last_blk

We need one of the remaining 4 blocks for the inode_table, and only 3 blocks
left.  This is hardly worth the savings compared to the difficulty in moving
all of the blocks around when we want to expand a filesystem.

> So: add a flag that will cancel the resize if the gd table growth needs
> to move blocks/metadata?

I don't think it should be a flag.  If the FS is mounted and we need to
move blocks, then cancel the operation automatically.  It shouldn't be
a problem to move blocks if the filesystem is unmounted.

Cheers, Andreas
-- 
Andreas Dilger  University of Calgary \ "If a man ate a pound of pasta and
                Micronet Research Group \ a pound of antipasto, would they
Dept of Electrical & Computer Engineering \  cancel out, leaving him still
http://www-mddsp.enel.ucalgary.ca/People/adilger/      hungry?" -- Dogbert

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [linux-lvm] Re: ext2resize
@ 1999-06-30  9:46 Admin
  1999-07-02 16:58 ` Andreas Dilger
  0 siblings, 1 reply; 17+ messages in thread
From: Admin @ 1999-06-30  9:46 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: linux-lvm, linux-fsdevel

>Lennert Buytenhek writes:
>> Correct. ext2 is divided into block groups, which are 8mb
>> big when using 1k blocks. A block group looks like this:
>>
>> 1 block        superblock
>> ? blocks      group descriptor table
>> 1 block        block bitmap
>> 1 block        inode bitmap
>> ? blocks      inode table
>> ? blocks      data blocks (the bulk of the group)
>
>I was emailing with Mike Field about this, and according to the
>definition of ext2_super_block in ext2_fs.h, it should be possible to
>set the location of the block bitmap, inode bitmap, and inode table
>anywhere in the group, and have the datablocks follow.  If you set the
>pointers to these structures to start, say, 33 blocks into the group,
>this would allow you to grow the GDT to handle an 8GB filesystem before
>a reorg (block moving) is necessary.

I mailed with Mike about this too. He said: "First guess is that is that it
would require one more field in the superblock - how many blocks are
reserved for group descriptors....". I replied with: "In the group
descriptor
table there are pointers to the start of the block/inode bitmap and inode
table for each group. So you don't have to put any info in the superblock.
You can just leave free blocks between the gd table and the block bitmap.
I guess. But ext2 works in mysterious ways.... :-)"

He said: "Still, reserving extra blocks would only be a hack..."

I replied: "Depending on who you talk to of course. :-) The max. number
of groups for ext2 is 1024 I believe. One extra gd block per block gives
you room for 32*8=256mb expansion (assuming 1kb blocks). This
will cost you at most 1 meg of reserved gd blocks. Seems like a fair
price. The max. number of gd blocks is 32. So doing this when making
an fs will cost you at most 32*1024 blocks, which is 32mb with 1k
blocks. On modern drives, you'll probably not even notice a 32mb
loss. Unless you have a lot of partitions, of course...."

Then I said: "You could always throw the fs down, free the soon-to-
be-needed gs blocks, do some moves, and mount it again, all very
quickly. Then you could do the add-another-group thingy. You don't
need to do the whole operation while unmounted. (Hey, resizing
mounted fs'es is tricky anyway, so why not be extra tricky.... :-)"

>[time passes]
>I looked into the code in e2fsprogs/lib/ext2fs (openfs.c, initialize.c)
>and the kernel (fs/ext2/balloc.c).  It looks like, while an ext2 reader
>will only (currently) calculate desc_blocks based on the number of group
>descriptors and the block size, it will gladly use the values supplied
>in the superblock for the location of the block bitmap, inode bitmap,
>inode table, and the number of data blocks - leaving a "gap" after the
>GDT for future growth (NB - need to check e2fsck for what it does).  If

e2fsck prolly does the same. Make a small (~64mb) filesystem and mkfs
it with the sparse superblocks flag on. Then run dumpe2fs on it. This
will bring enlightenment w.r.t. metadata pointers.

>you "fix" initialize.c to have a larger number of desc_blocks than the
>minimum needed, existing kernels and e2fsck should work OK with this,
>which is a big plus.  Your ext2_resize could also do this without
>actually "growing" the filesystem - just get it ready to do so if
>needed.

I suggested Mike putting the 'reserving extra blocks' feature in
mke2fs.

>   mke2fs which says "start writing X groups into the FS".  The
>   only real issue is the last group, which appears to be able to
>   NOT have a superblock or GDT, which is a BIG problem...
Huh? I thought a group must _always_ have an sb, gd table,
block bitmap, inode bitmap and inode table. Or am I wrong here?

>> This is what ext2resize basically does (when enlarging).
>> But you'll need a way to get this through to the kernel (it
>> has it's own superblock copy). I haven't really looked at
>> the volume patch very well.
>
>As I suggested to Mike, it may be desirable to have two different
>implementations - an online resize which will not do much (if any) block
>moving, and can only resize up to the next 256MB boundary (or
>pre-allocated GDT size), and an offline resize which will do things like
>renumber inode and data blocks, remove inodes, add GDT blocks, etc.

So: add a flag that will cancel the resize if the gd table growth needs
to move blocks/metadata?

>Mike had also suggested that when we are doing a major FS (offline)
>reorg, we could start removing blocks from the inode table instead of
>data blocks as there are usually free inodes in each group, but not
>always data blocks...
I think it's not worth the complexity. First of all all your inodes will be
renumbered. You'll need a full directory scan-and-replace for inodes
which is very crash-sensitive. On the other hand, relocating a block
is atomic.

>> You can remount an fs RO, ext2resize it, and remount it RW methinks.
>
>This would likely break many programs, as they would fail for the time
>it is in RO mode.  A more pleasant solution is to only allow growth to a
>pre-determined limit online (with a kernel lock), and then force the
>user to unmount the FS to do block shuffling.

Yes, well, the remounting it RO and then remounting it RW will probably
not work, since (as Rolf has already mentioned) the kernel will not
reread metadata upon a remount.

>> About shrinking an existing fs: this would be even
>> messier. (Involves moving inodes around, and those
>> inodes might be in core. Et cetera. Hell on earth :-)
>> But growing an fs might be messy too, because of
>> the growing group descriptor table.
>
>I don't think shrinking a FS online is as big a need as growing it, and
>this can be left for a utility that works when the FS is unmounted.

Yep.

>Cheers, Andreas


Lennert Buytenhek
<buytenh@dsv.nl>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [linux-lvm] Re: ext2resize
       [not found] <009c01bebee1$aa81b9a0$0102010a@adminstation.sgymsdam.nl>
@ 1999-06-30  7:34 ` Andreas Dilger
  0 siblings, 0 replies; 17+ messages in thread
From: Andreas Dilger @ 1999-06-30  7:34 UTC (permalink / raw)
  To: buytenh; +Cc: Linux LVM mailing list, Linux FS development list

Lennert Buytenhek writes:
> Correct. ext2 is divided into block groups, which are 8mb
> big when using 1k blocks. A block group looks like this:
> 
> 1 block        superblock
> ? blocks      group descriptor table
> 1 block        block bitmap
> 1 block        inode bitmap
> ? blocks      inode table
> ? blocks      data blocks (the bulk of the group)

I was emailing with Mike Field about this, and according to the
definition of ext2_super_block in ext2_fs.h, it should be possible to
set the location of the block bitmap, inode bitmap, and inode table
anywhere in the group, and have the datablocks follow.  If you set the
pointers to these structures to start, say, 33 blocks into the group,
this would allow you to grow the GDT to handle an 8GB filesystem before
a reorg (block moving) is necessary.

[time passes]
I looked into the code in e2fsprogs/lib/ext2fs (openfs.c, initialize.c)
and the kernel (fs/ext2/balloc.c).  It looks like, while an ext2 reader
will only (currently) calculate desc_blocks based on the number of group
descriptors and the block size, it will gladly use the values supplied
in the superblock for the location of the block bitmap, inode bitmap,
inode table, and the number of data blocks - leaving a "gap" after the
GDT for future growth (NB - need to check e2fsck for what it does).  If
you "fix" initialize.c to have a larger number of desc_blocks than the
minimum needed, existing kernels and e2fsck should work OK with this,
which is a big plus.  Your ext2_resize could also do this without
actually "growing" the filesystem - just get it ready to do so if
needed.

When it comes time to grow the filesystem, all you need to
do is:

0) expand LV/partition/md/loopback file/etc to be larger.
1) userland - write into new groups the new FS data (superblock,
   GDT, inode bitmaps, inode blocks, etc).  This is what
   mke2fs + ext2extend from ext2-volume does to a new disk. It
   should be relatively straight forward, maybe a new flag to
   mke2fs which says "start writing X groups into the FS".  The
   only real issue is the last group, which appears to be able to
   NOT have a superblock or GDT, which is a BIG problem...
2) userland - write into the "spare" GDT for each existing group
   any needed values.  Since this is likely constant, it could
   even be done long in advance (eg FS creation, or
   ext2_offline_resize).  There should be no worry about this
   space being overwritten by the kernel, since it will never
   read or write these blocks.
3) userland - write into all "extra" superblocks the new FS
   configuration, updating blocks_count, free_blocks, r_blocks_count,
   inodes_count, free_inodes_count, groups_count.  Again, hopefully
   no worries about overwriting this on a running system because the
   kernel shouldn't touch these on an open filesystem.
4) lock FS in kernel
5) kernel - update kernel superblock data with new FS config as in (3).
   May need to "realloc" the GDT tables in memory, as the kernel
   will only have allocated enough based on old GDT size (or so it looks
   in my 2.0.36 balloc.c).
6) kernel - write primary superblock to disk.  This is the "real"
   copy, and the other superblocks are only estimates that will be
   overwritten when the FS is unmounted, I believe.  If system
   crashes without FS unmount, then primary superblock should be
   used on remount anyways, and e2fsck will fix others?
7) unlock FS in kernel
8) userland - proceed to use new space in FS ;-)

> This is what ext2resize basically does (when enlarging).
> But you'll need a way to get this through to the kernel (it
> has it's own superblock copy). I haven't really looked at
> the volume patch very well.

As I suggested to Mike, it may be desirable to have two different
implementations - an online resize which will not do much (if any) block
moving, and can only resize up to the next 256MB boundary (or
pre-allocated GDT size), and an offline resize which will do things like
renumber inode and data blocks, remove inodes, add GDT blocks, etc.

Mike had also suggested that when we are doing a major FS (offline)
reorg, we could start removing blocks from the inode table instead of
data blocks as there are usually free inodes in each group, but not
always data blocks...

> You can remount an fs RO, ext2resize it, and remount it RW methinks.

This would likely break many programs, as they would fail for the time
it is in RO mode.  A more pleasant solution is to only allow growth to a
pre-determined limit online (with a kernel lock), and then force the
user to unmount the FS to do block shuffling.

> About shrinking an existing fs: this would be even
> messier. (Involves moving inodes around, and those
> inodes might be in core. Et cetera. Hell on earth :-)
> But growing an fs might be messy too, because of
> the growing group descriptor table.

I don't think shrinking a FS online is as big a need as growing it, and
this can be left for a utility that works when the FS is unmounted.

Cheers, Andreas
-- 
Andreas Dilger  University of Calgary \ "If a man ate a pound of pasta and
                Micronet Research Group \ a pound of antipasto, would they
Dept of Electrical & Computer Engineering \  cancel out, leaving him still
http://www-mddsp.enel.ucalgary.ca/People/adilger/      hungry?" -- Dogbert

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~1999-07-08 22:56 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
1999-07-06  7:56 [linux-lvm] Re: ext2resize Lennert Buytenhek
1999-07-08 14:34 ` tytso
1999-07-08 22:56 ` Theodore Y. Ts'o
  -- strict thread matches above, loose matches on Subject: below --
1999-07-06  8:56 Lennert Buytenhek
1999-07-06  8:47 Lennert Buytenhek
1999-07-05 20:49 Andreas Dilger
1999-07-05 19:47 Andreas Dilger
1999-07-07  5:27 ` John Finlay
1999-07-05  8:47 Lennert Buytenhek
1999-07-05 18:38 ` John Finlay
1999-07-05 18:55 ` John Finlay
1999-07-05  8:40 Lennert Buytenhek
1999-07-04 18:56 Andreas Dilger
1999-06-30  9:46 Admin
1999-07-02 16:58 ` Andreas Dilger
1999-07-05 17:03   ` Stephen C. Tweedie
     [not found] <009c01bebee1$aa81b9a0$0102010a@adminstation.sgymsdam.nl>
1999-06-30  7:34 ` Andreas Dilger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).