All of lore.kernel.org
 help / color / mirror / Atom feed
* ENOSPC design issues
@ 2012-09-20 19:03 Josef Bacik
  2012-09-24 16:59 ` Mitch Harder
  2012-09-25 16:43 ` David Sterba
  0 siblings, 2 replies; 9+ messages in thread
From: Josef Bacik @ 2012-09-20 19:03 UTC (permalink / raw)
  To: linux-btrfs

Hello,

I'm going to look at fixing some of the performance issues that crop up because
of our reservation system.  Before I go and do a whole lot of work I want some
feedback.  I've done a brain dump here

https://btrfs.wiki.kernel.org/index.php/ENOSPC

This has an explanation of how our reservation system currently works and the
problems I think there are with it, and the solutions I'm going to try.  If
somebody has any better ideas please put them on that page along with your IRC
nick so I can find you and we can chat about your idea.  Once we have a few
things we all agree on I'll start hammering away at them.  Some of these things
I'm going to go ahead and do because they are just correct, but the bigger
things such as the flusher thread I'd like some more input on.  Any crazy ideas
welcome, but if your idea involves completely scrapping the current system it
will have to be an awesome idea since I don't really want to spend the next year
fixing early enospc problems again.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: ENOSPC design issues
  2012-09-20 19:03 ENOSPC design issues Josef Bacik
@ 2012-09-24 16:59 ` Mitch Harder
  2012-09-25 16:43 ` David Sterba
  1 sibling, 0 replies; 9+ messages in thread
From: Mitch Harder @ 2012-09-24 16:59 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs

On Thu, Sep 20, 2012 at 2:03 PM, Josef Bacik <jbacik@fusionio.com> wrote:
> Hello,
>
> I'm going to look at fixing some of the performance issues that crop up because
> of our reservation system.  Before I go and do a whole lot of work I want some
> feedback.

When I was trying to figure out the problem with gzip ENOSPC issues, I
spent some time debugging and following the flow through the
reserve_metadata_bytes() function in extent-tree.c.

My observation was that the accounting around
space_info->bytes_may_use did not appear to be tightly closed.  The
space_info->bytes_may_use value would grow large (often 3 or 4 times
greater than space_info->total), and the flow through
reserve_metadata_bytes() would stay in overcommit.

I was unsuccessfull in figuring out how to rework or close the loop on
the accounting for space_info->bytes_may_use.

I noticed that btrfs seemed to work OK even though the value in
space_info->bytes_may_use appeared inexplicably large, and btrfs was
always in overcommit.

So, since you're asking for possibly 'crazy ideas', I suggest
considering finding a way to ignore space_info->bytes_may_use in
reserve_metadata_bytes().  Either make the overcommit the default
(which I found to approximate my real-life case anyhow), or have a
simple mechanism for quick fail-over to overcommit.

I doubt this will be any kind of comprehensive fix for ENOSPC issues,
but simplifying reserve_metadata_bytes() may make it easier to find
the other issues.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: ENOSPC design issues
  2012-09-20 19:03 ENOSPC design issues Josef Bacik
  2012-09-24 16:59 ` Mitch Harder
@ 2012-09-25 16:43 ` David Sterba
  2012-09-25 17:02   ` Josef Bacik
  1 sibling, 1 reply; 9+ messages in thread
From: David Sterba @ 2012-09-25 16:43 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs

On Thu, Sep 20, 2012 at 03:03:06PM -0400, Josef Bacik wrote:
> I'm going to look at fixing some of the performance issues that crop up because
> of our reservation system.  Before I go and do a whole lot of work I want some
> feedback.  I've done a brain dump here
> https://btrfs.wiki.kernel.org/index.php/ENOSPC

Thanks for writing it down, much appreciated.

My first and probably naive approach is described in the page, quoting
here:

 "Attempt to address how to flush less stated below. The
 over-reservation of a 4k block can go up to 96k as the worst case
 calculation (see above). This accounts for splitting the full tree path
 from 8th level root down to the leaf plus the node splits. My question:
 how often do we need to go up to the level N+1 from current level N?
 for levels 0 and 1 it may happen within one transaction, maybe not so
 often for level 2 and with exponentially decreasing frequency for the
 higher levels. Therefore, is it possible to check the tree level first
 and adapt the calculation according to that? Let's say we can reduce
 the 4k reservation size from 96k to 32k on average (for a many-gigabyte
 filesystem), thus increasing the space available for reservations by
 some factor. The expected gain is less pressure to the flusher because
 more reservations will succeed immediately.
 The idea behind is to make the initial reservation more accurate to
 current state than blindly overcommitting by some random factor (1/2).
 Another hint to the tree root level may be the usage of the root node:
 eg. if the root is less than half full, splitting will not happen
 unless there are K concurrent reservations running where K is
 proportional to overwriting the whole subtree (same exponential
 decrease with increasing level) and this will not be possible within
 one transaction or there will not be enough space to satisfy all
 reservations. (This attempts to fine-tune the currently hardcoded level
 8 up to the best value). The safe value for the level in the
 calculations would be like N+1, ie. as if all the possible splits
 happen with respect to current tree height."

implemented as follows on top of next/master, in short:
* disable overcommit completely
* do the optimistically best guess for the metadata and reserve only up
  to the current tree height

--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2794,7 +2794,11 @@ static inline gfp_t btrfs_alloc_write_mask(struct address_space *mapping)
 static inline u64 btrfs_calc_trans_metadata_size(struct btrfs_root *root,
                                                 unsigned num_items)
 {
-       return (root->leafsize + root->nodesize * (BTRFS_MAX_LEVEL - 1)) *
+       int level = btrfs_header_level(root->node);
+
+       level = min(level, BTRFS_MAX_LEVEL);
+
+       return (root->leafsize + root->nodesize * (level - 1)) *
                3 * num_items;
 }

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index efb044e..c9fa7ed 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3649,6 +3649,8 @@ static int can_overcommit(struct btrfs_root *root,
        u64 avail;
        u64 used;

+       return 0;
+
        used = space_info->bytes_used + space_info->bytes_reserved +
                space_info->bytes_pinned + space_info->bytes_readonly +
                space_info->bytes_may_use;
---

I must be doing something wrong, because I don't see any unexpected
ENOSPC while performing some tests on a 2G, 4G and 10G partitions with
following loads:

fs_mark -F -k -S 0 -D 20 -N 100
- dumb, no file contents

fs_mark -F -k -S 0 -D 20000 -N 400000 -s 2048 -t 8
- metadata intense, files and inline contents

fs_mark -F -k -S 0 -D 20 -N 100 -s 3900 -t 24
- ~dtto, lots writers

fs_mark -F -k -S 0 -D 20 -N 100 -s 8192 -t 24
- lots writers, no inlines

tar xf linux-3.2.tar.bz2	(1G in total)
- simple untar


The fs_mark loads do not do any kind of sync, as this should stress the
number of data in flight. After each load finishes with ENOSPC, the rest
of the filesystem is filled with a file full of zeros. Then 'fi df'
reports all the space is used, no unexpectedly large files can be found
(ie a few hundred KBs if it was data-intense, or using whole remaining
data space if it was meta-intense, eg. 8MB).

mkfs was default, so are the mount options, did not push it through
xfstests. but at least verified md5sums of the kernel tree.

Sample tree height for extent tree was 2 and 3 for fs tree, so I think
it exercised the case where the tree height increased during the load
(but maybe it was just lucky to get the +1 block from somewhere else,
dunno).

I'm running the tests on a 100G filesystem now.


david

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: ENOSPC design issues
  2012-09-25 16:43 ` David Sterba
@ 2012-09-25 17:02   ` Josef Bacik
  2012-09-26  7:55     ` Ahmet Inan
  2012-10-01  0:00     ` David Sterba
  0 siblings, 2 replies; 9+ messages in thread
From: Josef Bacik @ 2012-09-25 17:02 UTC (permalink / raw)
  To: David Sterba; +Cc: Josef Bacik, linux-btrfs

On Tue, Sep 25, 2012 at 10:43:36AM -0600, David Sterba wrote:
> On Thu, Sep 20, 2012 at 03:03:06PM -0400, Josef Bacik wrote:
> > I'm going to look at fixing some of the performance issues that crop up because
> > of our reservation system.  Before I go and do a whole lot of work I want some
> > feedback.  I've done a brain dump here
> > https://btrfs.wiki.kernel.org/index.php/ENOSPC
> 
> Thanks for writing it down, much appreciated.
> 
> My first and probably naive approach is described in the page, quoting
> here:
> 
>  "Attempt to address how to flush less stated below. The
>  over-reservation of a 4k block can go up to 96k as the worst case
>  calculation (see above). This accounts for splitting the full tree path
>  from 8th level root down to the leaf plus the node splits. My question:
>  how often do we need to go up to the level N+1 from current level N?
>  for levels 0 and 1 it may happen within one transaction, maybe not so
>  often for level 2 and with exponentially decreasing frequency for the
>  higher levels. Therefore, is it possible to check the tree level first
>  and adapt the calculation according to that? Let's say we can reduce
>  the 4k reservation size from 96k to 32k on average (for a many-gigabyte
>  filesystem), thus increasing the space available for reservations by
>  some factor. The expected gain is less pressure to the flusher because
>  more reservations will succeed immediately.
>  The idea behind is to make the initial reservation more accurate to
>  current state than blindly overcommitting by some random factor (1/2).
>  Another hint to the tree root level may be the usage of the root node:
>  eg. if the root is less than half full, splitting will not happen
>  unless there are K concurrent reservations running where K is
>  proportional to overwriting the whole subtree (same exponential
>  decrease with increasing level) and this will not be possible within
>  one transaction or there will not be enough space to satisfy all
>  reservations. (This attempts to fine-tune the currently hardcoded level
>  8 up to the best value). The safe value for the level in the
>  calculations would be like N+1, ie. as if all the possible splits
>  happen with respect to current tree height."
> 
> implemented as follows on top of next/master, in short:
> * disable overcommit completely
> * do the optimistically best guess for the metadata and reserve only up
>   to the current tree height
> 

So I had tried to do this before, the problem is when height changes our reserve
changes.  So for things like delalloc we say we have X number of extents and we
reserve that much space, but then when we run delalloc we re-calculate the
metadata size for X number extents we've removed and that number could come out
differently since the height of the tree would have changed.  One thing we could
do is to store the actual reservation with the extent in the io_tree, but I
think we already use the private for something else so we'd have to add it
somewhere else.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: ENOSPC design issues
  2012-09-25 17:02   ` Josef Bacik
@ 2012-09-26  7:55     ` Ahmet Inan
  2012-09-26 13:00       ` Josef Bacik
  2012-09-26 13:11       ` Josef Bacik
  2012-10-01  0:00     ` David Sterba
  1 sibling, 2 replies; 9+ messages in thread
From: Ahmet Inan @ 2012-09-26  7:55 UTC (permalink / raw)
  To: Josef Bacik; +Cc: David Sterba, linux-btrfs

when testing, please also do something like this:

# create big squashfs image somewhere:
# mksquashfs / /big.img -noappend -no-sparse -e big.img

# then unpack into fresh filesystem with (and no) compression:
# unsquashfs -f -d /subvol /big.img

this is how i was always able to trigger ENOSPC while trying to
make a full system installation from squashfs image.

you should also try different compression algos (i only use lzo)

btw: i was able to trigger ENOSPC with for-linus on 3.5.4 on a
i686 Pentium M Notebook with only 1GB of Memory and
fresh FSthis way, otherwise havent seen ENOSPC for long time.

Ahmet

On Tue, Sep 25, 2012 at 7:02 PM, Josef Bacik <jbacik@fusionio.com> wrote:
> On Tue, Sep 25, 2012 at 10:43:36AM -0600, David Sterba wrote:
>> On Thu, Sep 20, 2012 at 03:03:06PM -0400, Josef Bacik wrote:
>> > I'm going to look at fixing some of the performance issues that crop up because
>> > of our reservation system.  Before I go and do a whole lot of work I want some
>> > feedback.  I've done a brain dump here
>> > https://btrfs.wiki.kernel.org/index.php/ENOSPC
>>
>> Thanks for writing it down, much appreciated.
>>
>> My first and probably naive approach is described in the page, quoting
>> here:
>>
>>  "Attempt to address how to flush less stated below. The
>>  over-reservation of a 4k block can go up to 96k as the worst case
>>  calculation (see above). This accounts for splitting the full tree path
>>  from 8th level root down to the leaf plus the node splits. My question:
>>  how often do we need to go up to the level N+1 from current level N?
>>  for levels 0 and 1 it may happen within one transaction, maybe not so
>>  often for level 2 and with exponentially decreasing frequency for the
>>  higher levels. Therefore, is it possible to check the tree level first
>>  and adapt the calculation according to that? Let's say we can reduce
>>  the 4k reservation size from 96k to 32k on average (for a many-gigabyte
>>  filesystem), thus increasing the space available for reservations by
>>  some factor. The expected gain is less pressure to the flusher because
>>  more reservations will succeed immediately.
>>  The idea behind is to make the initial reservation more accurate to
>>  current state than blindly overcommitting by some random factor (1/2).
>>  Another hint to the tree root level may be the usage of the root node:
>>  eg. if the root is less than half full, splitting will not happen
>>  unless there are K concurrent reservations running where K is
>>  proportional to overwriting the whole subtree (same exponential
>>  decrease with increasing level) and this will not be possible within
>>  one transaction or there will not be enough space to satisfy all
>>  reservations. (This attempts to fine-tune the currently hardcoded level
>>  8 up to the best value). The safe value for the level in the
>>  calculations would be like N+1, ie. as if all the possible splits
>>  happen with respect to current tree height."
>>
>> implemented as follows on top of next/master, in short:
>> * disable overcommit completely
>> * do the optimistically best guess for the metadata and reserve only up
>>   to the current tree height
>>
>
> So I had tried to do this before, the problem is when height changes our reserve
> changes.  So for things like delalloc we say we have X number of extents and we
> reserve that much space, but then when we run delalloc we re-calculate the
> metadata size for X number extents we've removed and that number could come out
> differently since the height of the tree would have changed.  One thing we could
> do is to store the actual reservation with the extent in the io_tree, but I
> think we already use the private for something else so we'd have to add it
> somewhere else.  Thanks,
>
> Josef
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 

Ahmet Inan

Systemadministrator

Mathematisches Institut
Albert-Ludwigs-Universität Freiburg
Eckerstr. 1
79104 Freiburg im Breisgau
Tel.: +49(0)761 / 203-5552
Raum: 332
mailto:sysadm@email.mathematik.uni-freiburg.de

Abteilung für Angewandte Mathematik
Albert-Ludwigs-Universität Freiburg
Hermann-Herder-Str. 10
79104 Freiburg im Breisgau
Tel.: +49(0)761 / 203-5626
Raum: 221
mailto:admin@mathematik.uni-freiburg.de

Abteilung für Mathematische Stochastik
Albert-Ludwigs-Universität Freiburg
Eckerstr. 1
79104 Freiburg im Breisgau
Tel.: +49(0)761 / 203-5678
Raum: 249
mailto:helpdesk@stochastik.uni-freiburg.de

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: ENOSPC design issues
  2012-09-26  7:55     ` Ahmet Inan
@ 2012-09-26 13:00       ` Josef Bacik
  2012-09-26 13:11       ` Josef Bacik
  1 sibling, 0 replies; 9+ messages in thread
From: Josef Bacik @ 2012-09-26 13:00 UTC (permalink / raw)
  To: Ahmet Inan; +Cc: Josef Bacik, David Sterba, linux-btrfs

On Wed, Sep 26, 2012 at 01:55:47AM -0600, Ahmet Inan wrote:
> when testing, please also do something like this:
> 
> # create big squashfs image somewhere:
> # mksquashfs / /big.img -noappend -no-sparse -e big.img
> 
> # then unpack into fresh filesystem with (and no) compression:
> # unsquashfs -f -d /subvol /big.img
> 
> this is how i was always able to trigger ENOSPC while trying to
> make a full system installation from squashfs image.
> 
> you should also try different compression algos (i only use lzo)
> 
> btw: i was able to trigger ENOSPC with for-linus on 3.5.4 on a
> i686 Pentium M Notebook with only 1GB of Memory and
> fresh FSthis way, otherwise havent seen ENOSPC for long time.
> 

Thanks I will try that now,

Josef

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: ENOSPC design issues
  2012-09-26  7:55     ` Ahmet Inan
  2012-09-26 13:00       ` Josef Bacik
@ 2012-09-26 13:11       ` Josef Bacik
  2012-09-27 15:39         ` Ahmet Inan
  1 sibling, 1 reply; 9+ messages in thread
From: Josef Bacik @ 2012-09-26 13:11 UTC (permalink / raw)
  To: Ahmet Inan; +Cc: Josef Bacik, David Sterba, linux-btrfs

On Wed, Sep 26, 2012 at 01:55:47AM -0600, Ahmet Inan wrote:
> when testing, please also do something like this:
> 
> # create big squashfs image somewhere:
> # mksquashfs / /big.img -noappend -no-sparse -e big.img
> 
> # then unpack into fresh filesystem with (and no) compression:
> # unsquashfs -f -d /subvol /big.img
> 
> this is how i was always able to trigger ENOSPC while trying to
> make a full system installation from squashfs image.
> 
> you should also try different compression algos (i only use lzo)
> 
> btw: i was able to trigger ENOSPC with for-linus on 3.5.4 on a
> i686 Pentium M Notebook with only 1GB of Memory and
> fresh FSthis way, otherwise havent seen ENOSPC for long time.
>

Have you actually seen this problem on 3.5 with no compression?  I fixed a
problem in the 3.5 timeframe that should have fixed this for no-compression, and
then I've since fixed the compression part of it in btrfs-next.  Can you retest
with btrfs-next and see if you still see the problem?  In the meantime I'm
running mksquashfs, but it's going to take a few years to complete ;).  Thanks,

Josef 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: ENOSPC design issues
  2012-09-26 13:11       ` Josef Bacik
@ 2012-09-27 15:39         ` Ahmet Inan
  0 siblings, 0 replies; 9+ messages in thread
From: Ahmet Inan @ 2012-09-27 15:39 UTC (permalink / raw)
  To: Josef Bacik; +Cc: David Sterba, linux-btrfs

Josef,

> Have you actually seen this problem on 3.5 with no compression?
youre right, no problems with no compression.

> problem in the 3.5 timeframe that should have fixed this for no-compression, and
> then I've since fixed the compression part of it in btrfs-next.  Can you retest
i cherry picked on top of 3.5.4 + for-linus all enospc patches
and dependencies from your btrfs-next in that order:

ddabdd663aa819f206aaaf689b12475b1b11d71e
Btrfs: do not allocate chunks as agressively

dd314c3ef057dde30d1c8ae67c98019a71713ea4
Btrfs: turbo charge fsync

4d9f18a35d456c27e81c8e0de29c858f85d35c91
Btrfs: do not needlessly restart the transaction for enospc

82b8aa2bef5c5267d74ad8e5a82572d4fa85bf89
Btrfs: wait on async pages when shrinking delalloc

4af89eb8ce93a1ec64fc698f82f67a0e3bd8cbef
Btrfs: delay block group item insertion

i hope those are the patches you wanted me to test.

> with btrfs-next and see if you still see the problem?
did and got some nice results: no problems and its faster!

these results are from an old i686 Pentium M Notebook
with 30GB HDD and 1.2GB RAM:

3.5.4 + for-linus + lzo: 6x enospc problems at 30% unsquashfs
*snip*
[================|                                         ] 269645/910412  29%
Write on output file failed because No space left on device
writer: failed to write data block 0
Failed to write /mnt/point/something, skipping
*snip*

# df -h /mnt/point/
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda2        28G   12G   16G  43% /mnt/point

one run:
real    28m44.882s
user    7m4.542s
sys     4m18.353s

3.5.4 + for-linus + no compression: no problems
# df -h /mnt/point/
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda2        28G   18G  9.0G  67% /mnt/point

one run:
real    20m57.157s
user    5m37.915s
sys     3m23.913s

3.5.4 + for-linus + enospc patches + lzo: no problems
# df -h /mnt/point/
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda2        28G   12G   17G  42% /mnt/point

first run:
real    19m35.852s
user    4m33.359s
sys     3m2.808s

second run:
real    23m12.371s
user    5m37.398s
sys     3m47.525s

3.5.4 + for-linus + enospc patches + no compression: no problems
# df -h /mnt/point/
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda2        28G   18G  9.5G  66% /mnt/point

one run:
real    22m56.940s
user    4m53.571s
sys     3m4.188s

great work.
im gonna keep those patches and deploy it on my systems here soon
after some more testing.

Ahmet

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: ENOSPC design issues
  2012-09-25 17:02   ` Josef Bacik
  2012-09-26  7:55     ` Ahmet Inan
@ 2012-10-01  0:00     ` David Sterba
  1 sibling, 0 replies; 9+ messages in thread
From: David Sterba @ 2012-10-01  0:00 UTC (permalink / raw)
  To: Josef Bacik; +Cc: David Sterba, linux-btrfs

On Tue, Sep 25, 2012 at 01:02:39PM -0400, Josef Bacik wrote:
> > implemented as follows on top of next/master, in short:
> > * disable overcommit completely
> > * do the optimistically best guess for the metadata and reserve only up
> >   to the current tree height
> 
> So I had tried to do this before, the problem is when height changes our reserve
> changes.  So for things like delalloc we say we have X number of extents and we
> reserve that much space, but then when we run delalloc we re-calculate the
> metadata size for X number extents we've removed and that number could come out
> differently since the height of the tree would have changed.

Understood, I did not came accross this during testing because I did not
reproduce the required conditions that do not seem frequent but
definetelly are possible.

Let me proceed with V2:

* disable overcommit completely
* reserve only up to the current tree height + 1

You can object that the tree height may grow by 2, namely at the early
stage from 0 -> 2 if the number of operations is quite high and all of
this finishes within one transaction.

So improvement that also covers that:

* reserve up to MIN( BTRFS_MAX_LEVEL, max(tree->height + 1, 3) )

with the assumption that growing three from 3->4 will take more
operations that could fit into one transaction period even on a fast
device. (More accurate analysis required, but this should give the
idea).

The estimated savings in metadata block reservations start at 3/8 (and
decreases with the larger filesystem use). This estimate does not have
the property of being 'almost exact' as the V1 proposal, so some sort of
overcommit math should also take place. I leave this out for now.

thanks,
david

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2012-10-01  0:00 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-09-20 19:03 ENOSPC design issues Josef Bacik
2012-09-24 16:59 ` Mitch Harder
2012-09-25 16:43 ` David Sterba
2012-09-25 17:02   ` Josef Bacik
2012-09-26  7:55     ` Ahmet Inan
2012-09-26 13:00       ` Josef Bacik
2012-09-26 13:11       ` Josef Bacik
2012-09-27 15:39         ` Ahmet Inan
2012-10-01  0:00     ` David Sterba

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.