[PATCH 0/3] Add comments describing how space reservation works

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/3] Add comments describing how space reservation works
@ 2020-02-03 20:44 Josef Bacik
  2020-02-03 20:44 ` [PATCH 1/3] btrfs: add a comment describing block-rsvs Josef Bacik
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Josef Bacik @ 2020-02-03 20:44 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

In talking with Nikolay about the data reservation patches it became clear that
there's way more about the space reservation system that exists soley in my head
than I thought.  I've written three big comments on the different sections of
the space reservation system to hopefully help people understand how the system
works generally.  Again this is from my point of view, and since I'm the one who
wrote it I'm not sure where the gaps in peoples understanding is, so if there's
something that is not clear here now is the time to bring it up.  Hopefully this
will make reviewing patches I submit to this system less scary for reviewers.
Thanks,

Josef

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/3] btrfs: add a comment describing block-rsvs
  2020-02-03 20:44 [PATCH 0/3] Add comments describing how space reservation works Josef Bacik
@ 2020-02-03 20:44 ` Josef Bacik
  2020-02-04  9:30   ` Qu Wenruo
  2020-02-03 20:44 ` [PATCH 2/3] btrfs: add a comment describing delalloc space reservation Josef Bacik
  2020-02-03 20:44 ` [PATCH 3/3] btrfs: describe the space reservation system in general Josef Bacik
  2 siblings, 1 reply; 11+ messages in thread
From: Josef Bacik @ 2020-02-03 20:44 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

This is a giant comment at the top of block-rsv.c describing generally
how block rsvs work.  It is purely about the block rsv's themselves, and
nothing to do with how the actual reservation system works.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/block-rsv.c | 81 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 81 insertions(+)

diff --git a/fs/btrfs/block-rsv.c b/fs/btrfs/block-rsv.c
index d07bd41a7c1e..54380f477f80 100644
--- a/fs/btrfs/block-rsv.c
+++ b/fs/btrfs/block-rsv.c
@@ -6,6 +6,87 @@
 #include "space-info.h"
 #include "transaction.h"
 
+/*
+ * HOW DO BLOCK RSVS WORK
+ *
+ *   Think of block_rsv's as bucktes for logically grouped reservations.  Each
+ *   block_rsv has a ->size and a ->reserved.  ->size is how large we want our
+ *   block rsv to be, ->reserved is how much space is currently reserved for
+ *   this block reserve.
+ *
+ *   ->failfast exists for the truncate case, and is described below.
+ *
+ * NORMAL OPERATION
+ *   We determine we need N items of reservation, we use the appropriate
+ *   btrfs_calc*() helper to determine the number of bytes.  We call into
+ *   reserve_metadata_bytes() and get our bytes, we then add this space to our
+ *   ->size and our ->reserved.
+ *
+ *   We go to modify the tree for our operation, we allocate a tree block, which
+ *   calls btrfs_use_block_rsv(), and subtracts nodesize from
+ *   block_rsv->reserved.
+ *
+ *   We finish our operation, we subtract our original reservation from ->size,
+ *   and then we subtract ->size from ->reserved if there is an excess and free
+ *   the excess back to the space info, by reducing space_info->bytes_may_use by
+ *   the excess amount.
+ *
+ *   In some cases we may return this excess to the global block reserve or
+ *   delayed refs reserve if either of their ->size is greater than their
+ *   ->reserved.
+ *
+ * BLOCK_RSV_TRANS, BLOCK_RSV_DELOPS, BLOCK_RSV_CHUNK
+ *   These behave normally, as described above, just within the confines of the
+ *   lifetime of ther particular operation (transaction for the whole trans
+ *   handle lifetime, for example).
+ *
+ * BLOCK_RSV_GLOBAL
+ *   This has existed forever, with diminishing degrees of importance.
+ *   Currently it exists to save us from ourselves.  We definitely over-reserve
+ *   space most of the time, but the nature of COW is that we do not know how
+ *   much space we may need to use for any given operation.  This is
+ *   particularly true about the extent tree.  Modifying one extent could
+ *   balloon into 1000 modifications of the extent tree, which we have no way of
+ *   properly predicting.  To cover this case we have the global reserve act as
+ *   the "root" space to allow us to not abort the transaciton when things are
+ *   very tight.  As such we tend to treat this space as sacred, and only use it
+ *   if we are desparate.  Generally we should no longer be depending on its
+ *   space, and if new use cases arise we need to address them elsewhere.
+ *
+ * BLOCK_RSV_DELALLOC
+ *   The individual item sizes are determined by the per-inode size
+ *   calculations, which are described with the delalloc code.  This is pretty
+ *   straightforward, it's just the calculation of ->size encodes a lot of
+ *   different items, and thus it gets used when updating inodes, inserting file
+ *   extents, and inserting checksums.
+ *
+ * BLOCK_RSV_DELREFS
+ *   We keep a running talley of how many delayed refs we have on the system.
+ *   We assume each one of these delayed refs are going to use a full
+ *   reservation.  We use the transaction items and pre-reserve space for every
+ *   operation, and use this reservation to refill any gap between ->size and
+ *   ->reserved that may exist.
+ *
+ *   From there it's straightforward, removing a delayed ref means we remove its
+ *   count from ->size and free up reservations as necessary.  Since this is the
+ *   most dynamic block rsv in the system, we will try to refill this block rsv
+ *   first with any excess returned by any other block reserve.
+ *
+ * BLOCK_RSV_EMPTY
+ *   This is the fallback block rsv to make us try to reserve space if we don't
+ *   have a specific bucket for this allocation.  It is mostly used for updating
+ *   the device tree and such, since that is a separate pool we're content to
+ *   just reserve space from the space_info on demand.
+ *
+ * BLOCK_RSV_TEMP
+ *   This is used by things like truncate and iput.  We will temporarily
+ *   allocate a block rsv, set it to some size, and then truncate bytes until we
+ *   have no space left.  With ->failfast set we'll simply return ENOSPC from
+ *   btrfs_use_block_rsv() to signal that we need to unwind and try to make a
+ *   new reservation.  This is because these operations are unbounded, so we
+ *   want to do as much work as we can, and then back off and re-reserve.
+ */
+
 static u64 block_rsv_release_bytes(struct btrfs_fs_info *fs_info,
 				    struct btrfs_block_rsv *block_rsv,
 				    struct btrfs_block_rsv *dest, u64 num_bytes,
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/3] btrfs: add a comment describing delalloc space reservation
  2020-02-03 20:44 [PATCH 0/3] Add comments describing how space reservation works Josef Bacik
  2020-02-03 20:44 ` [PATCH 1/3] btrfs: add a comment describing block-rsvs Josef Bacik
@ 2020-02-03 20:44 ` Josef Bacik
  2020-02-04  9:48   ` Qu Wenruo
  2020-02-03 20:44 ` [PATCH 3/3] btrfs: describe the space reservation system in general Josef Bacik
  2 siblings, 1 reply; 11+ messages in thread
From: Josef Bacik @ 2020-02-03 20:44 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

delalloc space reservation is tricky because it encompasses both data
and metadata.  Make it clear what each side does, the general flow of
how space is moved throughout the lifetime of a write, and what goes
into the calculations.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/delalloc-space.c | 90 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 90 insertions(+)

diff --git a/fs/btrfs/delalloc-space.c b/fs/btrfs/delalloc-space.c
index c13d8609cc99..09a9c01fc1b5 100644
--- a/fs/btrfs/delalloc-space.c
+++ b/fs/btrfs/delalloc-space.c
@@ -9,6 +9,96 @@
 #include "qgroup.h"
 #include "block-group.h"
 
+/*
+ * HOW DOES THIS WORK
+ *
+ * There are two stages to data reservations, one for data and one for metadata
+ * to handle the new extents and checksums generated by writing data.
+ *
+ *
+ * DATA RESERVATION
+ *   The data reservation stuff is relatively straightforward.  We want X bytes,
+ *   and thus need to make sure we have X bytes free in data space in order to
+ *   write that data.  If there is not X bytes free, allocate data chunks until
+ *   we can satisfy that reservation.  If we can no longer allocate data chunks,
+ *   attempt to flush space to see if we can now make the reservaiton.  See the
+ *   comment for data_flush_states to see how that flushing is accomplished.
+ *
+ *   Once this space is reserved, it is added to space_info->bytes_may_use.  The
+ *   caller must keep track of this reservation and free it up if it is never
+ *   used.  With the buffered IO case this is handled via the EXTENT_DELALLOC
+ *   bit's on the inode's io_tree.  For direct IO it's more straightforward, we
+ *   take the reservation at the start of the operation, and if we write less
+ *   than we reserved we free the excess.
+ *
+ *   For the buffered case our reservation will take one of two paths
+ *
+ *   1) It is allocated.  In find_free_extent() we will call
+ *   btrfs_add_reserved_bytes() with the size of the extent we made, along with
+ *   the size that we are covering with this allocation.  For non-compressed
+ *   these will be the same thing, but for compressed they could be different.
+ *   In any case, we increase space_info->bytes_reserved by the extent size, and
+ *   reduce the space_info->bytes_may_use by the ram_bytes size.  From now on
+ *   the handling of this reserved space is the responsibility of the ordered
+ *   extent or the cow path.
+ *
+ *   2) There is an error, and we free it.  This is handled with the
+ *   EXTENT_CLEAR_DATA_RESV bit when clearing EXTENT_DELALLOC on the inode's
+ *   io_tree.
+ *
+ * METADATA RESERVATION
+ *   The general metadata reservation lifetimes are discussed elsewhere, this
+ *   will just focus on how it is used for delalloc space.
+ *
+ *   There are 3 things we are keeping reservations for.
+ *
+ *   1) Updating the inode item.  We hold a reservation for this inode as long
+ *   as there are dirty bytes outstanding for this inode.  This is because we
+ *   may update the inode multiple times throughout an operation, and there is
+ *   no telling when we may have to do a full cow back to that inode item.  Thus
+ *   we must always hold a reservation.
+ *
+ *   2) Adding an extent item.  This is trickier, so a few sub points
+ *
+ *     a) We keep track of how many extents an inode may need to create in
+ *     inode->outstanding_extents.  This is how many items we will have reserved
+ *     for the extents for this inode.
+ *
+ *     b) count_max_extents() is used to figure out how many extent items we
+ *     will need based on the contiguous area we have dirtied.  Thus if we are
+ *     writing 4k extents but they coalesce into a very large extent, we will
+ *     break this into smaller extents which means we'll need a reservation for
+ *     each of those extents.
+ *
+ *     c) When we set EXTENT_DELALLOC on the inode io_tree we will figure out
+ *     the nummber of extents needed for the contiguous area we just created,
+ *     and add that to inode->outstanding_extents.
+ *
+ *     d) We have no idea at reservation time how this new extent fits into
+ *     existing extents.  We unconditionally use count_max_extents() on the
+ *     reservation we are currently doing.  The reservation _must_ use
+ *     btrfs_delalloc_release_extents() once it has done it's work to clear up
+ *     this outstanding extents.  This means that we will transiently have too
+ *     many extent reservations for this inode than we need.  For example say we
+ *     have a clean inode, and we do a buffered write of 4k.  The reservation
+ *     code will mod outstanding_extents to 1, and then set_delalloc will
+ *     increase it to 2.  Then once we are finished,
+ *     btrfs_delalloc_release_extents() will drop it back down to 1 again.
+ *
+ *     e) Ordered extents take on the responsibility of their extent.  We know
+ *     that the ordered extent represents a single inode item, so it will modify
+ *     ->outstanding_extents by 1, and will clear delalloc which will adjust the
+ *     ->outstanding_extents by whatever value it needs to be adjusted to.  Once
+ *     the ordered io is finished we drop the ->outstanding_extents by 1 and if
+ *     we are 0 we drop our inode item reservation as well.
+ *
+ *   3) Adding csums for the range.  This is more straightforward than the
+ *   extent items, as we just want to hold the number of bytes we'll need for
+ *   checksums until the ordered extent is removed.  If there is an error it is
+ *   cleared via the EXTENT_CLEAR_META_RESV bit when clearning EXTENT_DELALLOC
+ *   on the inode io_tree.
+ */
+
 int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes)
 {
 	struct btrfs_root *root = inode->root;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 3/3] btrfs: describe the space reservation system in general
  2020-02-03 20:44 [PATCH 0/3] Add comments describing how space reservation works Josef Bacik
  2020-02-03 20:44 ` [PATCH 1/3] btrfs: add a comment describing block-rsvs Josef Bacik
  2020-02-03 20:44 ` [PATCH 2/3] btrfs: add a comment describing delalloc space reservation Josef Bacik
@ 2020-02-03 20:44 ` Josef Bacik
  2020-02-04 10:14   ` Qu Wenruo
  2 siblings, 1 reply; 11+ messages in thread
From: Josef Bacik @ 2020-02-03 20:44 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

Add another comment to cover how the space reservation system works
generally.  This covers the actual reservation flow, as well as how
flushing is handled.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/space-info.c | 128 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 128 insertions(+)

diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index d3befc536a7f..6de1fbe2835a 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -10,6 +10,134 @@
 #include "transaction.h"
 #include "block-group.h"
 
+/*
+ * HOW DOES SPACE RESERVATION WORK
+ *
+ * If you want to know about delalloc specifically, there is a separate comment
+ * for that with the delalloc code.  This comment is about how the whole system
+ * works generally.
+ *
+ * BASIC CONCEPTS
+ *
+ *   1) space_info.  This is the ultimate arbiter of how much space we can use.
+ *   There's a description of the bytes_ fields with the struct declaration,
+ *   refer to that for specifics on each field.  Suffice it to say that for
+ *   reservations we care about total_bytes - SUM(space_info->bytes_) when
+ *   determining if there is space to make an allocation.
+ *
+ *   2) block_rsv's.  These are basically buckets for every different type of
+ *   metadata reservation we have.  You can see the comment in the block_rsv
+ *   code on the rules for each type, but generally block_rsv->reserved is how
+ *   much space is accounted for in space_info->bytes_may_use.
+ *
+ *   3) btrfs_calc*_size.  These are the worst case calculations we used based
+ *   on the number of items we will want to modify.  We have one for changing
+ *   items, and one for inserting new items.  Generally we use these helpers to
+ *   determine the size of the block reserves, and then use the actual bytes
+ *   values to adjust the space_info counters.
+ *
+ * MAKING RESERVATIONS, THE NORMAL CASE
+ *
+ *   Things wanting to make reservations will calculate the size that they want
+ *   and make a reservation request.  If there is sufficient space, and there
+ *   are no current reservations pending, we will adjust
+ *   space_info->bytes_may_use by this amount.
+ *
+ *   Once we allocate an extent, we will add that size to ->bytes_reserved and
+ *   subtract the size from ->bytes_may_use.  Once that extent is written out we
+ *   subtract that value from ->bytes_reserved and add it to ->bytes_used.
+ *
+ *   If there is an error at any point the reserver is responsible for dropping
+ *   its reservation from ->bytes_may_use.
+ *
+ * MAKING RESERVATIONS, FLUSHING
+ *
+ *   If we are unable to satisfy our reservation, or if there are pending
+ *   reservations already, we will create a reserve ticket and add ourselves to
+ *   the appropriate list.  This is controlled by btrfs_reserve_flush_enum.  For
+ *   simplicity sake this boils down to two cases, priority and normal.
+ *
+ *   1) Priority.  These reservations are important and have limited ability to
+ *   flush space.  For example, the relocation code currently tries to make a
+ *   reservation under a transaction commit, thus it cannot wait on anything
+ *   that may want to commit the transaction.  These tasks will add themselves
+ *   to the priority list and thus get any new space first, and then they can
+ *   flush space directly in their own context that is safe for them to do
+ *   without causing a deadlock.
+ *
+ *   2) Normal.  These reservations can wait forever on anything, because the do
+ *   not hold resources that they would deadlock on.  These tickets simply go to
+ *   sleep and start an async thread that will flush space on their behalf.
+ *   Every time one of the ->bytes_* counters is adjusted for the space info, we
+ *   will check to see if there is enough space to satisfy the requests (in
+ *   order) on either of our lists.  If there is enough space we will set the
+ *   ticket->bytes = 0, and wake the task up.  If we flush a few times and fail
+ *   to make any progress we will wake up all of the tickets and fail them all.
+ *
+ * THE FLUSHING STATES
+ *
+ *   Generally speaking we will have two cases for each state, a "nice" state
+ *   and a "ALL THE THINGS" state.  In btrfs we delay a lot of work in order to
+ *   reduce the locking over head on the various trees, and even to keep from
+ *   doing any work at all in the case of delayed refs.  Each of these delayed
+ *   things however hold reservations, and so letting them run allows us to
+ *   reclaim space so we can make new reservations.
+ *
+ *   FLUSH_DELAYED_ITEMS
+ *     Every inode has a delayed item to update the inode.  Take a simple write
+ *     for example, we would update the inode item at write time to update the
+ *     mtime, and then again at finish_ordered_io() time in order to update the
+ *     isize or bytes.  We keep these delayed items to coalesce these operations
+ *     into a single operation done on demand.  These are an easy way to reclaim
+ *     metadata space.
+ *
+ *   FLUSH_DELALLOC
+ *     Look at the delalloc comment to get an idea of how much space is reserved
+ *     for delayed allocation.  We can reclaim some of this space simply by
+ *     running delalloc, but usually we need to wait for ordered extents to
+ *     reclaim the bulk of this space.
+ *
+ *   FLUSH_DELAYED_REFS
+ *     We have a block reserve for the outstanding delayed refs space, and every
+ *     delayed ref operation holds a reservation.  Running these is a quick way
+ *     to reclaim space, but we want to hold this until the end because COW can
+ *     churn a lot and we can avoid making some extent tree modifications if we
+ *     are able to delay for as long as possible.
+ *
+ *   ALLOC_CHUNK
+ *     We will skip this the first time through space reservation, because of
+ *     overcommit and we don't want to have a lot of useless metadata space when
+ *     our worst case reservations will likely never come true.
+ *
+ *   RUN_DELAYED_IPUTS
+ *     If we're freeing inodes we're likely freeing checksums, file extent
+ *     items, and extent tree items.  Loads of space could be freed up by these
+ *     operations, however they won't be usable until the transaction commits.
+ *
+ *   COMMIT_TRANS
+ *     may_commit_transaction() is the ultimate arbiter on wether we commit the
+ *     transaction or not.  In order to avoid constantly churning we do all the
+ *     above flushing first and then commit the transaction as the last resort.
+ *     However we need to take into account things like pinned space that would
+ *     be freed, plus any delayed work we may not have gotten rid of in the case
+ *     of metadata.
+ *
+ * OVERCOMMIT
+ *   Because we hold so many reservations for metadata we will allow you to
+ *   reserve more space than is currently free in the currently allocate
+ *   metadata space.  This only happens with metadata, data does not allow
+ *   overcommitting.
+ *
+ *   You can see the current logic for when we allow overcommit in
+ *   btrfs_can_overcommit(), but it only applies to unallocated space.  If there
+ *   is no unallocated space to be had, all reservations are kept within the
+ *   free space in the allocated metadata chunks.
+ *
+ *   Because of overcommitting, you generally want to use the
+ *   btrfs_can_overcommit() logic for metadata allocations, as it does the right
+ *   thing with or without extra unallocated space.
+ */
+
 u64 __pure btrfs_space_info_used(struct btrfs_space_info *s_info,
 			  bool may_use_included)
 {
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/3] btrfs: add a comment describing block-rsvs
  2020-02-03 20:44 ` [PATCH 1/3] btrfs: add a comment describing block-rsvs Josef Bacik
@ 2020-02-04  9:30   ` Qu Wenruo
  2020-02-04 10:32     ` Nikolay Borisov
  0 siblings, 1 reply; 11+ messages in thread
From: Qu Wenruo @ 2020-02-04  9:30 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs, kernel-team


[-- Attachment #1.1: Type: text/plain, Size: 7026 bytes --]



On 2020/2/4 上午4:44, Josef Bacik wrote:
> This is a giant comment at the top of block-rsv.c describing generally
> how block rsvs work.  It is purely about the block rsv's themselves, and
> nothing to do with how the actual reservation system works.

Such comment really helps!

Although it looks like there are too many words but too little ascii
arts or graphs.
Not sure if it's really easy to read.

And some questions inlined below.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/block-rsv.c | 81 ++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 81 insertions(+)
> 
> diff --git a/fs/btrfs/block-rsv.c b/fs/btrfs/block-rsv.c
> index d07bd41a7c1e..54380f477f80 100644
> --- a/fs/btrfs/block-rsv.c
> +++ b/fs/btrfs/block-rsv.c
> @@ -6,6 +6,87 @@
>  #include "space-info.h"
>  #include "transaction.h"
>  
> +/*
> + * HOW DO BLOCK RSVS WORK
> + *
> + *   Think of block_rsv's as bucktes for logically grouped reservations.  Each
> + *   block_rsv has a ->size and a ->reserved.  ->size is how large we want our
> + *   block rsv to be, ->reserved is how much space is currently reserved for
> + *   this block reserve.

This block_rsv system is for metadata only, right?
Since data size can be easily determined, no need to such complex system.

> + *
> + *   ->failfast exists for the truncate case, and is described below.
> + *
> + * NORMAL OPERATION
> + *   We determine we need N items of reservation, we use the appropriate
> + *   btrfs_calc*() helper to determine the number of bytes.  We call into
> + *   reserve_metadata_bytes() and get our bytes, we then add this space to our
> + *   ->size and our ->reserved.

Since you mentioned bytes_may_use in the finish part, what about also
mentioning that here?

> + *
> + *   We go to modify the tree for our operation, we allocate a tree block, which
> + *   calls btrfs_use_block_rsv(), and subtracts nodesize from
> + *   block_rsv->reserved.
> + *
> + *   We finish our operation, we subtract our original reservation from ->size,
> + *   and then we subtract ->size from ->reserved if there is an excess and free
> + *   the excess back to the space info, by reducing space_info->bytes_may_use by
> + *   the excess amount.

So I find the workflow can be expressed like this using timeline (?) graph:

+--- Reserve:
|    Entrance: btrfs_block_rsv_add(), btrfs_block_rsv_refill()
|
|    Calculate needed bytes by btrfs_calc*(), then add the needed space
|    to our ->size and our ->reserved.
|    This also contributes to space_info->bytes_may_use.
|
+--- Use:
|    Entrance: btrfs_use_block_rsv()
|
|    We're allocating a tree block, will subtracts nodesize from
|    block_rsv->reserved.
|
+--- Finish:
     Entrance: btrfs_block_rsv_release()

     we subtract our original reservation from ->size,
     and then we subtract ->size from ->reserved if there is an excess
     and free the excess back to the space info, by reducing
     space_info->bytes_may_use by the excess amount.

> + *
> + *   In some cases we may return this excess to the global block reserve or
> + *   delayed refs reserve if either of their ->size is greater than their
> + *   ->reserved.
> + *

Types of block_rsv:

> + * BLOCK_RSV_TRANS, BLOCK_RSV_DELOPS, BLOCK_RSV_CHUNK
> + *   These behave normally, as described above, just within the confines of the
> + *   lifetime of ther particular operation (transaction for the whole trans
> + *   handle lifetime, for example).
> + *
> + * BLOCK_RSV_GLOBAL
> + *   This has existed forever, with diminishing degrees of importance.
> + *   Currently it exists to save us from ourselves.  We definitely over-reserve
> + *   space most of the time, but the nature of COW is that we do not know how
> + *   much space we may need to use for any given operation.  This is
> + *   particularly true about the extent tree.  Modifying one extent could
> + *   balloon into 1000 modifications of the extent tree, which we have no way of
> + *   properly predicting.  To cover this case we have the global reserve act as
> + *   the "root" space to allow us to not abort the transaciton when things are
> + *   very tight.  As such we tend to treat this space as sacred, and only use it
> + *   if we are desparate.  Generally we should no longer be depending on its
> + *   space, and if new use cases arise we need to address them elsewhere.

Although we all know global rsv is really important for essential tree
updates, can we make it a little simpler?
It looks too long to read though.

I guess we don't need to put all related info here.
Maybe just mentioning the usage of each type is enough?
(Since the reader will still go greping for more details)

This also applies to the remaining types.

Thanks,
Qu

> + *
> + * BLOCK_RSV_DELALLOC
> + *   The individual item sizes are determined by the per-inode size
> + *   calculations, which are described with the delalloc code.  This is pretty
> + *   straightforward, it's just the calculation of ->size encodes a lot of
> + *   different items, and thus it gets used when updating inodes, inserting file
> + *   extents, and inserting checksums.
> + *
> + * BLOCK_RSV_DELREFS
> + *   We keep a running talley of how many delayed refs we have on the system.
> + *   We assume each one of these delayed refs are going to use a full
> + *   reservation.  We use the transaction items and pre-reserve space for every
> + *   operation, and use this reservation to refill any gap between ->size and
> + *   ->reserved that may exist.
> + *
> + *   From there it's straightforward, removing a delayed ref means we remove its
> + *   count from ->size and free up reservations as necessary.  Since this is the
> + *   most dynamic block rsv in the system, we will try to refill this block rsv
> + *   first with any excess returned by any other block reserve.
> + *
> + * BLOCK_RSV_EMPTY
> + *   This is the fallback block rsv to make us try to reserve space if we don't
> + *   have a specific bucket for this allocation.  It is mostly used for updating
> + *   the device tree and such, since that is a separate pool we're content to
> + *   just reserve space from the space_info on demand.
> + *
> + * BLOCK_RSV_TEMP
> + *   This is used by things like truncate and iput.  We will temporarily
> + *   allocate a block rsv, set it to some size, and then truncate bytes until we
> + *   have no space left.  With ->failfast set we'll simply return ENOSPC from
> + *   btrfs_use_block_rsv() to signal that we need to unwind and try to make a
> + *   new reservation.  This is because these operations are unbounded, so we
> + *   want to do as much work as we can, and then back off and re-reserve.
> + */
> +
>  static u64 block_rsv_release_bytes(struct btrfs_fs_info *fs_info,
>  				    struct btrfs_block_rsv *block_rsv,
>  				    struct btrfs_block_rsv *dest, u64 num_bytes,
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/3] btrfs: add a comment describing delalloc space reservation
  2020-02-03 20:44 ` [PATCH 2/3] btrfs: add a comment describing delalloc space reservation Josef Bacik
@ 2020-02-04  9:48   ` Qu Wenruo
  2020-02-04 12:27     ` Nikolay Borisov
  0 siblings, 1 reply; 11+ messages in thread
From: Qu Wenruo @ 2020-02-04  9:48 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs, kernel-team


[-- Attachment #1.1: Type: text/plain, Size: 7623 bytes --]



On 2020/2/4 上午4:44, Josef Bacik wrote:
> delalloc space reservation is tricky because it encompasses both data
> and metadata.  Make it clear what each side does, the general flow of
> how space is moved throughout the lifetime of a write, and what goes
> into the calculations.

In fact, the lifespan of a write would be super helpful for newbies.

I still remember the pain when trying to understand the whole mechanism
years ago.

> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/delalloc-space.c | 90 +++++++++++++++++++++++++++++++++++++++
>  1 file changed, 90 insertions(+)
> 
> diff --git a/fs/btrfs/delalloc-space.c b/fs/btrfs/delalloc-space.c
> index c13d8609cc99..09a9c01fc1b5 100644
> --- a/fs/btrfs/delalloc-space.c
> +++ b/fs/btrfs/delalloc-space.c
> @@ -9,6 +9,96 @@
>  #include "qgroup.h"
>  #include "block-group.h"
>  
> +/*
> + * HOW DOES THIS WORK
> + *
> + * There are two stages to data reservations, one for data and one for metadata
> + * to handle the new extents and checksums generated by writing data.
> + *
> + *
> + * DATA RESERVATION
> + *   The data reservation stuff is relatively straightforward.  We want X bytes,
> + *   and thus need to make sure we have X bytes free in data space in order to
> + *   write that data.  If there is not X bytes free, allocate data chunks until
> + *   we can satisfy that reservation.  If we can no longer allocate data chunks,
> + *   attempt to flush space to see if we can now make the reservaiton.  See the
> + *   comment for data_flush_states to see how that flushing is accomplished.

What about such less words version?
We want X bytes of data space.

If there is not enough, try the following methods in order:
- Allocate new data chunk
- Flush space
  See comment for data_flush_states

> + *
> + *   Once this space is reserved, it is added to space_info->bytes_may_use.  The
> + *   caller must keep track of this reservation and free it up if it is never
> + *   used.  With the buffered IO case this is handled via the EXTENT_DELALLOC
> + *   bit's on the inode's io_tree.  For direct IO it's more straightforward, we
> + *   take the reservation at the start of the operation, and if we write less
> + *   than we reserved we free the excess.

This part involves the lifespan and state machine of data.
I guess more explanation on the state machine would help a lot.

Like:
Page clean
|
+- btrfs_buffered_write()
|  Reserve data space for data, metadata space for csum/file
|  extents/inodes.
|
Page dirty
|
+- run_delalloc_range()
|  Allocate data extents, submit ordered extents to do csum calculation
|  and bio submission
Page write back
|
+- finish_oredred_io()
|  Insert csum and file extents
|
Page clean

Although I'm not sure if such lifespan should belong to delalloc-space.c.

> + *
> + *   For the buffered case our reservation will take one of two paths
> + *
> + *   1) It is allocated.  In find_free_extent() we will call
> + *   btrfs_add_reserved_bytes() with the size of the extent we made, along with
> + *   the size that we are covering with this allocation.  For non-compressed
> + *   these will be the same thing, but for compressed they could be different.
> + *   In any case, we increase space_info->bytes_reserved by the extent size, and
> + *   reduce the space_info->bytes_may_use by the ram_bytes size.  From now on
> + *   the handling of this reserved space is the responsibility of the ordered
> + *   extent or the cow path.
> + *
> + *   2) There is an error, and we free it.  This is handled with the
> + *   EXTENT_CLEAR_DATA_RESV bit when clearing EXTENT_DELALLOC on the inode's
> + *   io_tree.
> + *
> + * METADATA RESERVATION
> + *   The general metadata reservation lifetimes are discussed elsewhere, this
> + *   will just focus on how it is used for delalloc space.
> + *
> + *   There are 3 things we are keeping reservations for.

It looks the 3 things are too detailed I guess?
It's definitely educational, but not sure if it fits the introduction
nature of such comment.

I guess we only need to mention:
- Objective
  How this meta rsv is used for (inode item, file extents, csum)

- Location of interest
  Important details. (outstanding extents and DELALLOC bits for metadata
  rsv calculation)

- Timing of such rsv
  When we reserve/update, use and release. (function entrance)

Then it should be enough for people to dig for their own interest.

Thanks,
Qu

> + *
> + *   1) Updating the inode item.  We hold a reservation for this inode as long
> + *   as there are dirty bytes outstanding for this inode.  This is because we
> + *   may update the inode multiple times throughout an operation, and there is
> + *   no telling when we may have to do a full cow back to that inode item.  Thus
> + *   we must always hold a reservation.
> + *
> + *   2) Adding an extent item.  This is trickier, so a few sub points
> + *
> + *     a) We keep track of how many extents an inode may need to create in
> + *     inode->outstanding_extents.  This is how many items we will have reserved
> + *     for the extents for this inode.
> + *
> + *     b) count_max_extents() is used to figure out how many extent items we
> + *     will need based on the contiguous area we have dirtied.  Thus if we are
> + *     writing 4k extents but they coalesce into a very large extent, we will
> + *     break this into smaller extents which means we'll need a reservation for
> + *     each of those extents.
> + *
> + *     c) When we set EXTENT_DELALLOC on the inode io_tree we will figure out
> + *     the nummber of extents needed for the contiguous area we just created,
> + *     and add that to inode->outstanding_extents.
> + *
> + *     d) We have no idea at reservation time how this new extent fits into
> + *     existing extents.  We unconditionally use count_max_extents() on the
> + *     reservation we are currently doing.  The reservation _must_ use
> + *     btrfs_delalloc_release_extents() once it has done it's work to clear up
> + *     this outstanding extents.  This means that we will transiently have too
> + *     many extent reservations for this inode than we need.  For example say we
> + *     have a clean inode, and we do a buffered write of 4k.  The reservation
> + *     code will mod outstanding_extents to 1, and then set_delalloc will
> + *     increase it to 2.  Then once we are finished,
> + *     btrfs_delalloc_release_extents() will drop it back down to 1 again.
> + *
> + *     e) Ordered extents take on the responsibility of their extent.  We know
> + *     that the ordered extent represents a single inode item, so it will modify
> + *     ->outstanding_extents by 1, and will clear delalloc which will adjust the
> + *     ->outstanding_extents by whatever value it needs to be adjusted to.  Once
> + *     the ordered io is finished we drop the ->outstanding_extents by 1 and if
> + *     we are 0 we drop our inode item reservation as well.
> + *
> + *   3) Adding csums for the range.  This is more straightforward than the
> + *   extent items, as we just want to hold the number of bytes we'll need for
> + *   checksums until the ordered extent is removed.  If there is an error it is
> + *   cleared via the EXTENT_CLEAR_META_RESV bit when clearning EXTENT_DELALLOC
> + *   on the inode io_tree.
> + */
> +
>  int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes)
>  {
>  	struct btrfs_root *root = inode->root;
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] btrfs: describe the space reservation system in general
  2020-02-03 20:44 ` [PATCH 3/3] btrfs: describe the space reservation system in general Josef Bacik
@ 2020-02-04 10:14   ` Qu Wenruo
  0 siblings, 0 replies; 11+ messages in thread
From: Qu Wenruo @ 2020-02-04 10:14 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs, kernel-team


[-- Attachment #1.1: Type: text/plain, Size: 8891 bytes --]



On 2020/2/4 上午4:44, Josef Bacik wrote:
> Add another comment to cover how the space reservation system works
> generally.  This covers the actual reservation flow, as well as how
> flushing is handled.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/space-info.c | 128 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 128 insertions(+)
> 
> diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
> index d3befc536a7f..6de1fbe2835a 100644
> --- a/fs/btrfs/space-info.c
> +++ b/fs/btrfs/space-info.c
> @@ -10,6 +10,134 @@
>  #include "transaction.h"
>  #include "block-group.h"
>  
> +/*
> + * HOW DOES SPACE RESERVATION WORK
> + *
> + * If you want to know about delalloc specifically, there is a separate comment
> + * for that with the delalloc code.  This comment is about how the whole system
> + * works generally.
> + *
> + * BASIC CONCEPTS
> + *
> + *   1) space_info.  This is the ultimate arbiter of how much space we can use.
> + *   There's a description of the bytes_ fields with the struct declaration,
> + *   refer to that for specifics on each field.  Suffice it to say that for
> + *   reservations we care about total_bytes - SUM(space_info->bytes_) when
> + *   determining if there is space to make an allocation.

How about mentioning 3 types of space info: DATA, META, SYS?

And it may be a good timing to also update the comment of
btrfs_space_info::bytes_*?.

> + *
> + *   2) block_rsv's.  These are basically buckets for every different type of
> + *   metadata reservation we have.  You can see the comment in the block_rsv
> + *   code on the rules for each type, but generally block_rsv->reserved is how
> + *   much space is accounted for in space_info->bytes_may_use.
> + *
> + *   3) btrfs_calc*_size.  These are the worst case calculations we used based
> + *   on the number of items we will want to modify.  We have one for changing
> + *   items, and one for inserting new items.  Generally we use these helpers to
> + *   determine the size of the block reserves, and then use the actual bytes
> + *   values to adjust the space_info counters.
> + *
> + * MAKING RESERVATIONS, THE NORMAL CASE
> + *
> + *   Things wanting to make reservations will calculate the size that they want
> + *   and make a reservation request.  If there is sufficient space, and there
> + *   are no current reservations pending, we will adjust
> + *   space_info->bytes_may_use by this amount.
> + *
> + *   Once we allocate an extent, we will add that size to ->bytes_reserved and
> + *   subtract the size from ->bytes_may_use.  Once that extent is written out we
> + *   subtract that value from ->bytes_reserved and add it to ->bytes_used.

Lifespan! And definitely a graph would be easier to read, with less words.

> + *
> + *   If there is an error at any point the reserver is responsible for dropping
> + *   its reservation from ->bytes_may_use.
> + *
> + * MAKING RESERVATIONS, FLUSHING
> + *
> + *   If we are unable to satisfy our reservation, or if there are pending
> + *   reservations already, we will create a reserve ticket and add ourselves to
> + *   the appropriate list.  This is controlled by btrfs_reserve_flush_enum.  For
> + *   simplicity sake this boils down to two cases, priority and normal.

This is the core of the whole ticketing space rsv system.
Definitely needs something like objective (to get some free space),
entrance functions.

> + *
> + *   1) Priority.  These reservations are important and have limited ability to
> + *   flush space.  For example, the relocation code currently tries to make a
> + *   reservation under a transaction commit, thus it cannot wait on anything
> + *   that may want to commit the transaction.  These tasks will add themselves
> + *   to the priority list and thus get any new space first, and then they can
> + *   flush space directly in their own context that is safe for them to do
> + *   without causing a deadlock.
> + *
> + *   2) Normal.  These reservations can wait forever on anything, because the do
> + *   not hold resources that they would deadlock on.  These tickets simply go to
> + *   sleep and start an async thread that will flush space on their behalf.
> + *   Every time one of the ->bytes_* counters is adjusted for the space info, we
> + *   will check to see if there is enough space to satisfy the requests (in
> + *   order) on either of our lists.  If there is enough space we will set the
> + *   ticket->bytes = 0, and wake the task up.  If we flush a few times and fail
> + *   to make any progress we will wake up all of the tickets and fail them all.
> + *
> + * THE FLUSHING STATES
> + *
> + *   Generally speaking we will have two cases for each state, a "nice" state
> + *   and a "ALL THE THINGS" state.  In btrfs we delay a lot of work in order to
> + *   reduce the locking over head on the various trees, and even to keep from
> + *   doing any work at all in the case of delayed refs.  Each of these delayed
> + *   things however hold reservations, and so letting them run allows us to
> + *   reclaim space so we can make new reservations.

So, it's just some delayed works which can free space if run.
The last sentence looks sufficient to me.

> + *
> + *   FLUSH_DELAYED_ITEMS
> + *     Every inode has a delayed item to update the inode (item).

The best explanation. This one sentence is enough to solve my question
on what delayed items are.

>  Take a simple write
> + *     for example, we would update the inode item at write time to update the
> + *     mtime, and then again at finish_ordered_io() time in order to update the
> + *     isize or bytes.  We keep these delayed items to coalesce these operations
> + *     into a single operation done on demand.  These are an easy way to reclaim
> + *     metadata space.
> + *
> + *   FLUSH_DELALLOC
> + *     Look at the delalloc comment to get an idea of how much space is reserved
> + *     for delayed allocation.  We can reclaim some of this space simply by
> + *     running delalloc, but usually we need to wait for ordered extents to
> + *     reclaim the bulk of this space.
> + *
> + *   FLUSH_DELAYED_REFS
> + *     We have a block reserve for the outstanding delayed refs space, and every
> + *     delayed ref operation holds a reservation.  Running these is a quick way
> + *     to reclaim space, but we want to hold this until the end because COW can
> + *     churn a lot and we can avoid making some extent tree modifications if we
> + *     are able to delay for as long as possible.
> + *
> + *   ALLOC_CHUNK
> + *     We will skip this the first time through space reservation, because of
> + *     overcommit and we don't want to have a lot of useless metadata space when
> + *     our worst case reservations will likely never come true.
> + *
> + *   RUN_DELAYED_IPUTS

Although I guess we all know what delayed iput is doing, one line
explanation would definitely help.

> + *     If we're freeing inodes we're likely freeing checksums, file extent
> + *     items, and extent tree items.  Loads of space could be freed up by these
> + *     operations, however they won't be usable until the transaction commits.
> + *
> + *   COMMIT_TRANS
> + *     may_commit_transaction() is the ultimate arbiter on wether we commit the
> + *     transaction or not.  In order to avoid constantly churning we do all the
> + *     above flushing first and then commit the transaction as the last resort.
> + *     However we need to take into account things like pinned space that would
> + *     be freed, plus any delayed work we may not have gotten rid of in the case
> + *     of metadata.

All these comments make sense, it would be fantastic if we can reduce
the number of lines though.

Thanks for all these comments, they really help,
Qu

> + *
> + * OVERCOMMIT
> + *   Because we hold so many reservations for metadata we will allow you to
> + *   reserve more space than is currently free in the currently allocate
> + *   metadata space.  This only happens with metadata, data does not allow
> + *   overcommitting.
> + *
> + *   You can see the current logic for when we allow overcommit in
> + *   btrfs_can_overcommit(), but it only applies to unallocated space.  If there
> + *   is no unallocated space to be had, all reservations are kept within the
> + *   free space in the allocated metadata chunks.
> + *
> + *   Because of overcommitting, you generally want to use the
> + *   btrfs_can_overcommit() logic for metadata allocations, as it does the right
> + *   thing with or without extra unallocated space.
> + */
> +
>  u64 __pure btrfs_space_info_used(struct btrfs_space_info *s_info,
>  			  bool may_use_included)
>  {
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/3] btrfs: add a comment describing block-rsvs
  2020-02-04  9:30   ` Qu Wenruo
@ 2020-02-04 10:32     ` Nikolay Borisov
  0 siblings, 0 replies; 11+ messages in thread
From: Nikolay Borisov @ 2020-02-04 10:32 UTC (permalink / raw)
  To: Qu Wenruo, Josef Bacik, linux-btrfs, kernel-team



On 4.02.20 г. 11:30 ч., Qu Wenruo wrote:
> 
> 
> On 2020/2/4 上午4:44, Josef Bacik wrote:
>> This is a giant comment at the top of block-rsv.c describing generally
>> how block rsvs work.  It is purely about the block rsv's themselves, and
>> nothing to do with how the actual reservation system works.
> 
> Such comment really helps!
> 
> Although it looks like there are too many words but too little ascii
> arts or graphs.
> Not sure if it's really easy to read.
> 
> And some questions inlined below.
>>
>> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
>> ---
>>  fs/btrfs/block-rsv.c | 81 ++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 81 insertions(+)
>>
>> diff --git a/fs/btrfs/block-rsv.c b/fs/btrfs/block-rsv.c
>> index d07bd41a7c1e..54380f477f80 100644
>> --- a/fs/btrfs/block-rsv.c
>> +++ b/fs/btrfs/block-rsv.c
>> @@ -6,6 +6,87 @@
>>  #include "space-info.h"
>>  #include "transaction.h"
>>


<snip>

>> + *
>> + *   We go to modify the tree for our operation, we allocate a tree block, which
>> + *   calls btrfs_use_block_rsv(), and subtracts nodesize from
>> + *   block_rsv->reserved.
>> + *
>> + *   We finish our operation, we subtract our original reservation from ->size,
>> + *   and then we subtract ->size from ->reserved if there is an excess and free
>> + *   the excess back to the space info, by reducing space_info->bytes_may_use by
>> + *   the excess amount.
> 
> So I find the workflow can be expressed like this using timeline (?) graph:
> 
> +--- Reserve:
> |    Entrance: btrfs_block_rsv_add(), btrfs_block_rsv_refill()
> |
> |    Calculate needed bytes by btrfs_calc*(), then add the needed space
> |    to our ->size and our ->reserved.
> |    This also contributes to space_info->bytes_may_use.
> |
> +--- Use:
> |    Entrance: btrfs_use_block_rsv()
> |
> |    We're allocating a tree block, will subtracts nodesize from
> |    block_rsv->reserved.
> |
> +--- Finish:
>      Entrance: btrfs_block_rsv_release()
> 
>      we subtract our original reservation from ->size,
>      and then we subtract ->size from ->reserved if there is an excess
>      and free the excess back to the space info, by reducing
>      space_info->bytes_may_use by the excess amount.

I find this graphic helpful. Also IMO it's important to explicitly state
that ->size is based on an overestimation, whereas the space subtracted
from ->reserved is always based on real usage, hence we can have a case
where we end up with  excess space that can be returned.

Over reservation is mentioned in the BLOCK_RSV_GLOBAL paragraph but I
think it should be here and can be removed from there.
> 
>> + *
>> + *   In some cases we may return this excess to the global block reserve or
>> + *   delayed refs reserve if either of their ->size is greater than their
>> + *   ->reserved.
>> + *
> 
> Types of block_rsv:
> 
>> + * BLOCK_RSV_TRANS, BLOCK_RSV_DELOPS, BLOCK_RSV_CHUNK
>> + *   These behave normally, as described above, just within the confines of the
>> + *   lifetime of ther particular operation (transaction for the whole trans
>> + *   handle lifetime, for example).
>> + *
>> + * BLOCK_RSV_GLOBAL
>> + *   This has existed forever, with diminishing degrees of importance.
>> + *   Currently it exists to save us from ourselves.  We definitely over-reserve
>> + *   space most of the time, but the nature of COW is that we do not know how
>> + *   much space we may need to use for any given operation.  This is
>> + *   particularly true about the extent tree.  Modifying one extent could
>> + *   balloon into 1000 modifications of the extent tree, which we have no way of
>> + *   properly predicting.  To cover this case we have the global reserve act as
>> + *   the "root" space to allow us to not abort the transaciton when things are
nit: s/transaciton/transaction
>> + *   very tight.  As such we tend to treat this space as sacred, and only use it
>> + *   if we are desparate.  Generally we should no longer be depending on its
nit: s/desparate/desperate

>> + *   space, and if new use cases arise we need to address them elsewhere.
> 
> Although we all know global rsv is really important for essential tree
> updates, can we make it a little simpler?
> It looks too long to read though.

The 2nd sentence of the paragraph can be removed. Also it can be
mentioned that globalrsv is used for other trees apart from extent i.e
chunk/csum ones. Also isn't it used to ensure progress of unlink() ?

> 
> I guess we don't need to put all related info here.
> Maybe just mentioning the usage of each type is enough?
> (Since the reader will still go greping for more details)
> 
> This also applies to the remaining types.


I disagree, those comment provide glimpses of the problem that
necessitated having block rsv in the first place. It's good to read this
before diving into the code.

<snip>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/3] btrfs: add a comment describing delalloc space reservation
  2020-02-04  9:48   ` Qu Wenruo
@ 2020-02-04 12:27     ` Nikolay Borisov
  2020-02-04 12:39       ` Qu Wenruo
  0 siblings, 1 reply; 11+ messages in thread
From: Nikolay Borisov @ 2020-02-04 12:27 UTC (permalink / raw)
  To: Qu Wenruo, Josef Bacik, linux-btrfs, kernel-team



On 4.02.20 г. 11:48 ч., Qu Wenruo wrote:
>


<snip>

> 
>> + *
>> + *   Once this space is reserved, it is added to space_info->bytes_may_use.  The
>> + *   caller must keep track of this reservation and free it up if it is never
>> + *   used.  With the buffered IO case this is handled via the EXTENT_DELALLOC
>> + *   bit's on the inode's io_tree.  For direct IO it's more straightforward, we
>> + *   take the reservation at the start of the operation, and if we write less
>> + *   than we reserved we free the excess.
> 
> This part involves the lifespan and state machine of data.
> I guess more explanation on the state machine would help a lot.
> 
> Like:
> Page clean
> |
> +- btrfs_buffered_write()
> |  Reserve data space for data, metadata space for csum/file
> |  extents/inodes.
> |
> Page dirty
> |
> +- run_delalloc_range()
> |  Allocate data extents, submit ordered extents to do csum calculation
> |  and bio submission
> Page write back
> |
> +- finish_oredred_io()
> |  Insert csum and file extents
> |
> Page clean
> 
> Although I'm not sure if such lifespan should belong to delalloc-space.c.

This omits a lot of critical details. FOr example it should be noted
that in btrfs_buffered_write we reserve as much space as is requested by
the user. Then in run_delalloc_range it must be mentioned that in case
of compressed extents it can be called to allocate an extent which is
less than the space reserved in btrfs_buffered_write => that's where the
possible space savings in case of compressed come from.

As a matter of fact running ordered io doesn't really clean any space
apart from a bit of metadata space (unless we do overwrites, as per our
discussion with josef in the slack channel).

<snip>

>> + *
>> + *   1) Updating the inode item.  We hold a reservation for this inode as long
>> + *   as there are dirty bytes outstanding for this inode.  This is because we
>> + *   may update the inode multiple times throughout an operation, and there is
>> + *   no telling when we may have to do a full cow back to that inode item.  Thus
>> + *   we must always hold a reservation.
>> + *
>> + *   2) Adding an extent item.  This is trickier, so a few sub points
>> + *
>> + *     a) We keep track of how many extents an inode may need to create in
>> + *     inode->outstanding_extents.  This is how many items we will have reserved
>> + *     for the extents for this inode.
>> + *
>> + *     b) count_max_extents() is used to figure out how many extent items we
>> + *     will need based on the contiguous area we have dirtied.  Thus if we are
>> + *     writing 4k extents but they coalesce into a very large extent, we will

I THe way you have worded this is a bit confusing because first you
mention that count_max_extents calcs how many extent items we'll need
for a contiguous area. Then you mention that if we make a bunch of 4k
writes that coalesce to a single extent i.e create 1 large contiguous
(that's what coalescing implies in this context) we'll have to split it
them. This is counter-intuitive.

I guess what you meant here is physically contiguous as opposed to
logically contiguous?

>> + *     break this into smaller extents which means we'll need a reservation for
>> + *     each of those extents.
>> + *
>> + *     c) When we set EXTENT_DELALLOC on the inode io_tree we will figure out
>> + *     the nummber of extents needed for the contiguous area we just created,

nit: s/nummber/number

>> + *     and add that to inode->outstanding_extents.

<snip>

>> + *
>> + *   3) Adding csums for the range.  This is more straightforward than the
>> + *   extent items, as we just want to hold the number of bytes we'll need for
>> + *   checksums until the ordered extent is removed.  If there is an error it is
>> + *   cleared via the EXTENT_CLEAR_META_RESV bit when clearning EXTENT_DELALLOC

nit: s/clearning/clearing

>> + *   on the inode io_tree.
>> + */
>> +
>>  int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes)
>>  {
>>  	struct btrfs_root *root = inode->root;
>>
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/3] btrfs: add a comment describing delalloc space reservation
  2020-02-04 12:27     ` Nikolay Borisov
@ 2020-02-04 12:39       ` Qu Wenruo
  2020-02-05 13:44         ` David Sterba
  0 siblings, 1 reply; 11+ messages in thread
From: Qu Wenruo @ 2020-02-04 12:39 UTC (permalink / raw)
  To: Nikolay Borisov, Josef Bacik, linux-btrfs, kernel-team



On 2020/2/4 下午8:27, Nikolay Borisov wrote:
>
>
> On 4.02.20 г. 11:48 ч., Qu Wenruo wrote:
>>
>
>
> <snip>
>
>>
>>> + *
>>> + *   Once this space is reserved, it is added to space_info->bytes_may_use.  The
>>> + *   caller must keep track of this reservation and free it up if it is never
>>> + *   used.  With the buffered IO case this is handled via the EXTENT_DELALLOC
>>> + *   bit's on the inode's io_tree.  For direct IO it's more straightforward, we
>>> + *   take the reservation at the start of the operation, and if we write less
>>> + *   than we reserved we free the excess.
>>
>> This part involves the lifespan and state machine of data.
>> I guess more explanation on the state machine would help a lot.
>>
>> Like:
>> Page clean
>> |
>> +- btrfs_buffered_write()
>> |  Reserve data space for data, metadata space for csum/file
>> |  extents/inodes.
>> |
>> Page dirty
>> |
>> +- run_delalloc_range()
>> |  Allocate data extents, submit ordered extents to do csum calculation
>> |  and bio submission
>> Page write back
>> |
>> +- finish_oredred_io()
>> |  Insert csum and file extents
>> |
>> Page clean
>>
>> Although I'm not sure if such lifespan should belong to delalloc-space.c.
>
> This omits a lot of critical details. FOr example it should be noted
> that in btrfs_buffered_write we reserve as much space as is requested by
> the user. Then in run_delalloc_range it must be mentioned that in case
> of compressed extents it can be called to allocate an extent which is
> less than the space reserved in btrfs_buffered_write => that's where the
> possible space savings in case of compressed come from.

If you spoiler everything in the introduction, I guess it's no longer
introduction.
An introduction should only tell the overall picture, not every details.
For details, we go read mentioned functions.

And too many details would make the introduction pretty hard to
maintain. What if one day we don't the current always reserve behavior
for buffered write?

So I tend to have just a overview, and entrance function. With minimal
details unless it's a really complex design.

Thanks,
Qu

>
> As a matter of fact running ordered io doesn't really clean any space
> apart from a bit of metadata space (unless we do overwrites, as per our
> discussion with josef in the slack channel).
>
> <snip>
>
>>> + *
>>> + *   1) Updating the inode item.  We hold a reservation for this inode as long
>>> + *   as there are dirty bytes outstanding for this inode.  This is because we
>>> + *   may update the inode multiple times throughout an operation, and there is
>>> + *   no telling when we may have to do a full cow back to that inode item.  Thus
>>> + *   we must always hold a reservation.
>>> + *
>>> + *   2) Adding an extent item.  This is trickier, so a few sub points
>>> + *
>>> + *     a) We keep track of how many extents an inode may need to create in
>>> + *     inode->outstanding_extents.  This is how many items we will have reserved
>>> + *     for the extents for this inode.
>>> + *
>>> + *     b) count_max_extents() is used to figure out how many extent items we
>>> + *     will need based on the contiguous area we have dirtied.  Thus if we are
>>> + *     writing 4k extents but they coalesce into a very large extent, we will
>
> I THe way you have worded this is a bit confusing because first you
> mention that count_max_extents calcs how many extent items we'll need
> for a contiguous area. Then you mention that if we make a bunch of 4k
> writes that coalesce to a single extent i.e create 1 large contiguous
> (that's what coalescing implies in this context) we'll have to split it
> them. This is counter-intuitive.
>
> I guess what you meant here is physically contiguous as opposed to
> logically contiguous?
>
>>> + *     break this into smaller extents which means we'll need a reservation for
>>> + *     each of those extents.
>>> + *
>>> + *     c) When we set EXTENT_DELALLOC on the inode io_tree we will figure out
>>> + *     the nummber of extents needed for the contiguous area we just created,
>
> nit: s/nummber/number
>
>>> + *     and add that to inode->outstanding_extents.
>
> <snip>
>
>>> + *
>>> + *   3) Adding csums for the range.  This is more straightforward than the
>>> + *   extent items, as we just want to hold the number of bytes we'll need for
>>> + *   checksums until the ordered extent is removed.  If there is an error it is
>>> + *   cleared via the EXTENT_CLEAR_META_RESV bit when clearning EXTENT_DELALLOC
>
> nit: s/clearning/clearing
>
>>> + *   on the inode io_tree.
>>> + */
>>> +
>>>  int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes)
>>>  {
>>>  	struct btrfs_root *root = inode->root;
>>>
>>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/3] btrfs: add a comment describing delalloc space reservation
  2020-02-04 12:39       ` Qu Wenruo
@ 2020-02-05 13:44         ` David Sterba
  0 siblings, 0 replies; 11+ messages in thread
From: David Sterba @ 2020-02-05 13:44 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Nikolay Borisov, Josef Bacik, linux-btrfs, kernel-team

On Tue, Feb 04, 2020 at 08:39:19PM +0800, Qu Wenruo wrote:
> >> Although I'm not sure if such lifespan should belong to delalloc-space.c.
> >
> > This omits a lot of critical details. FOr example it should be noted
> > that in btrfs_buffered_write we reserve as much space as is requested by
> > the user. Then in run_delalloc_range it must be mentioned that in case
> > of compressed extents it can be called to allocate an extent which is
> > less than the space reserved in btrfs_buffered_write => that's where the
> > possible space savings in case of compressed come from.
> 
> If you spoiler everything in the introduction, I guess it's no longer
> introduction.
> An introduction should only tell the overall picture, not every details.
> For details, we go read mentioned functions.
> 
> And too many details would make the introduction pretty hard to
> maintain. What if one day we don't the current always reserve behavior
> for buffered write?
> 
> So I tend to have just a overview, and entrance function. With minimal
> details unless it's a really complex design.

I'd rather keep it as it is and enhance more, the ascii graphics might
help but I don't insist. Even if it's introductory, giving just pointers
would be IMHO too little. I'm assuming the overview should be enough to
read and then go to code and already have a clue what's happening.

We can update the docs further but we need a start so I've applied v2 to
misc-next and let's take that as starting point so we can discuss
suggested improvements as new patches. In case there's something that
really does not fit the .c file comment there's always the docs git
repository.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2020-02-05 13:44 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-03 20:44 [PATCH 0/3] Add comments describing how space reservation works Josef Bacik
2020-02-03 20:44 ` [PATCH 1/3] btrfs: add a comment describing block-rsvs Josef Bacik
2020-02-04  9:30   ` Qu Wenruo
2020-02-04 10:32     ` Nikolay Borisov
2020-02-03 20:44 ` [PATCH 2/3] btrfs: add a comment describing delalloc space reservation Josef Bacik
2020-02-04  9:48   ` Qu Wenruo
2020-02-04 12:27     ` Nikolay Borisov
2020-02-04 12:39       ` Qu Wenruo
2020-02-05 13:44         ` David Sterba
2020-02-03 20:44 ` [PATCH 3/3] btrfs: describe the space reservation system in general Josef Bacik
2020-02-04 10:14   ` Qu Wenruo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.