All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: david@fromorbit.com, darrick.wong@oracle.com
Cc: linux-xfs@vger.kernel.org, xfs@oss.sgi.com
Subject: [PATCH 4/7] xfsdocs: move the discussions of short and long format btrees to a separate chapter
Date: Thu, 25 Aug 2016 16:27:23 -0700	[thread overview]
Message-ID: <147216764310.32447.6110430430725995824.stgit@birch.djwong.org> (raw)
In-Reply-To: <147216761636.32447.4229640006064129056.stgit@birch.djwong.org>

Move the discussion of short and long format btrees into a separate
chapter.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 .../allocation_groups.asciidoc                     |   59 ------
 design/XFS_Filesystem_Structure/btrees.asciidoc    |  196 ++++++++++++++++++++
 .../XFS_Filesystem_Structure/data_extents.asciidoc |   72 +------
 .../xfs_filesystem_structure.asciidoc              |    2 
 4 files changed, 204 insertions(+), 125 deletions(-)
 create mode 100644 design/XFS_Filesystem_Structure/btrees.asciidoc


diff --git a/design/XFS_Filesystem_Structure/allocation_groups.asciidoc b/design/XFS_Filesystem_Structure/allocation_groups.asciidoc
index 0633175..55bbc50 100644
--- a/design/XFS_Filesystem_Structure/allocation_groups.asciidoc
+++ b/design/XFS_Filesystem_Structure/allocation_groups.asciidoc
@@ -612,65 +612,6 @@ Checksum of the AGF sector.
 *agf_spare2*::
 Empty space in the unlogged part of the AGF sector.
 
-[[Short_Format_Btrees]]
-=== Short Format B+trees
-
-Each allocation group uses a ``short format'' B+tree to index various
-information about the allocation group.  The structure is called short format
-because all block pointers are AG block numbers.  The trees use the following
-header:
-
-[source, c]
-----
-struct xfs_btree_sblock {
-     __be32                    bb_magic;
-     __be16                    bb_level;
-     __be16                    bb_numrecs;
-     __be32                    bb_leftsib;
-     __be32                    bb_rightsib;
-
-     /* version 5 filesystem fields start here */
-     __be64                    bb_blkno;
-     __be64                    bb_lsn;
-     uuid_t                    bb_uuid;
-     __be32                    bb_owner;
-     __le32                    bb_crc;
-};
-----
-
-*bb_magic*::
-Specifies the magic number for the per-AG B+tree block.
-
-*bb_level*::
-The level of the tree in which this block is found.  If this value is 0, this
-is a leaf block and contains records; otherwise, it is a node block and
-contains keys and pointers.
-
-*bb_numrecs*::
-Number of records in this block.
-
-*bb_leftsib*::
-AG block number of the left sibling of this B+tree node.
-
-*bb_rightsib*::
-AG block number of the right sibling of this B+tree node.
-
-*bb_blkno*::
-FS block number of this B+tree block.
-
-*bb_lsn*::
-Log sequence number of the last write to this block.
-
-*bb_uuid*::
-The UUID of this block, which must match either +sb_uuid+ or +sb_meta_uuid+
-depending on which features are set.
-
-*bb_owner*::
-The AG number that this B+tree block ought to be in.
-
-*bb_crc*::
-Checksum of the B+tree block.
-
 [[AG_Free_Space_Btrees]]
 === AG Free Space B+trees
 
diff --git a/design/XFS_Filesystem_Structure/btrees.asciidoc b/design/XFS_Filesystem_Structure/btrees.asciidoc
new file mode 100644
index 0000000..306e061
--- /dev/null
+++ b/design/XFS_Filesystem_Structure/btrees.asciidoc
@@ -0,0 +1,196 @@
+= B+trees
+
+XFS uses b+trees to index all metadata records.  This well known data structure
+is used to provide efficient random and sequential access to metadata records
+while minimizing seek times.  There are two btree formats: a short format
+for records pertaining to a single allocation group, since all block pointers
+in an AG are 32-bits in size; and a long format for records pertaining to a
+file, since file data can have 64-bit block offsets.  Each b+tree block is
+either a leaf node containing records, or an internal node containing keys and
+pointers to other b+tree blocks.  The tree consists of a root block which may
+point to some number of other blocks; blocks in the bottom level of the b+tree
+contains only records.
+
+Leaf blocks of both types of b+trees have the same general format: a header
+describing the data in the block, and an array of records.  The specific header
+formats are given in the next two sections, and the record format is provided
+by the b+tree client itself.  The generic b+tree code does not have any
+specific knowledge of the record format.
+
+----
++--------+------------+------------+
+| header |   record   | records... |
++--------+------------+------------+
+----
+
+Internal node blocks of both types of b+trees also have the same general
+format: a header describing the data in the block, an array of keys, and an
+array of pointers.  Each pointer may be associated with one or two keys.  The
+first key uniquely identifies the first record accessible via the leftmost path
+down the branch of the tree.
+
+If the records in a b+tree are indexed by an interval, then a range of keys can
+uniquely identify a single record.  For example, if a record covers blocks
+12-16, then any one of the keys 12, 13, 14, 15, or 16 return the same record.
+In this case, the key for the record describing "12-16" is 12.  If none of the
+records overlap, we only need to store one key.
+
+This is the format of a standard b+tree node:
+
+----
++--------+---------+---------+---------+---------+
+| header |   key   | keys... |   ptr   | ptrs... |
++--------+---------+---------+---------+---------+
+----
+
+If the b+tree records do not overlap, performing a b+tree lookup is simple.
+Start with the root.  If it is a leaf block, perform a binary search of the
+records until we find the record with a lower key than our search key.  If the
+block is a node block, perform a binary search of the keys until we find a
+key lower than our search key, then follow the pointer to the next block.
+Repeat until we find a record.
+
+However, if b+tree records contain intervals and are allowed to overlap, the
+internal nodes of the b+tree become larger:
+
+----
++--------+---------+----------+---------+-------------+---------+---------+
+| header | low key | high key | low key | high key... |   ptr   | ptrs... |
++--------+---------+----------+---------+-------------+---------+---------+
+----
+
+The low keys are exactly the same as the keys in the non-overlapping b+tree.
+High keys, however, are a little different.  Recall that a record with a key
+consisting of an interval can be referenced by a number of keys.  Since the low
+key of a record indexes the low end of that key range, the high key indexes the
+high end of the key range.  Returning to the example above, the high key for
+the record describing "12-16" is 16.  The high key recorded in a b+tree node
+is the largest of the high keys of all records accessible under the subtree
+rooted by the pointer.  For a level 1 node, this is the largest high key in
+the pointed-to leaf node; for any other node, this is the largest of the high
+keys in the pointed-to node.
+
+Nodes and leaves use the same magic numbers.
+
+[[Short_Format_Btrees]]
+== Short Format B+trees
+
+Each allocation group uses a ``short format'' B+tree to index various
+information about the allocation group.  The structure is called short format
+because all block pointers are AG block numbers.  The trees use the following
+header:
+
+[source, c]
+----
+struct xfs_btree_sblock {
+     __be32                    bb_magic;
+     __be16                    bb_level;
+     __be16                    bb_numrecs;
+     __be32                    bb_leftsib;
+     __be32                    bb_rightsib;
+
+     /* version 5 filesystem fields start here */
+     __be64                    bb_blkno;
+     __be64                    bb_lsn;
+     uuid_t                    bb_uuid;
+     __be32                    bb_owner;
+     __le32                    bb_crc;
+};
+----
+
+*bb_magic*::
+Specifies the magic number for the per-AG B+tree block.
+
+*bb_level*::
+The level of the tree in which this block is found.  If this value is 0, this
+is a leaf block and contains records; otherwise, it is a node block and
+contains keys and pointers.
+
+*bb_numrecs*::
+Number of records in this block.
+
+*bb_leftsib*::
+AG block number of the left sibling of this B+tree node.
+
+*bb_rightsib*::
+AG block number of the right sibling of this B+tree node.
+
+*bb_blkno*::
+FS block number of this B+tree block.
+
+*bb_lsn*::
+Log sequence number of the last write to this block.
+
+*bb_uuid*::
+The UUID of this block, which must match either +sb_uuid+ or +sb_meta_uuid+
+depending on which features are set.
+
+*bb_owner*::
+The AG number that this B+tree block ought to be in.
+
+*bb_crc*::
+Checksum of the B+tree block.
+
+[[Long_Format_Btrees]]
+== Long Format B+trees
+
+Long format B+trees are similar to short format B+trees, except that their
+block pointers are 64-bit filesystem block numbers instead of 32-bit AG block
+numbers.  Because of this, long format b+trees can be (and usually are) rooted
+in an inode's data or attribute fork.  The nodes and leaves of this B+tree use
+the +xfs_btree_lblock+ declaration:
+
+[source, c]
+----
+struct xfs_btree_lblock {
+     __be32                    bb_magic;
+     __be16                    bb_level;
+     __be16                    bb_numrecs;
+     __be64                    bb_leftsib;
+     __be64                    bb_rightsib;
+
+     /* version 5 filesystem fields start here */
+     __be64                    bb_blkno;
+     __be64                    bb_lsn;
+     uuid_t                    bb_uuid;
+     __be64                    bb_owner;
+     __le32                    bb_crc;
+     __be32                    bb_pad;
+};
+----
+
+*bb_magic*::
+Specifies the magic number for the btree block.
+
+*bb_level*::
+The level of the tree in which this block is found.  If this value is 0, this
+is a leaf block and contains records; otherwise, it is a node block and
+contains keys and pointers.
+
+*bb_numrecs*::
+Number of records in this block.
+
+*bb_leftsib*::
+FS block number of the left sibling of this B+tree node.
+
+*bb_rightsib*::
+FS block number of the right sibling of this B+tree node.
+
+*bb_blkno*::
+FS block number of this B+tree block.
+
+*bb_lsn*::
+Log sequence number of the last write to this block.
+
+*bb_uuid*::
+The UUID of this block, which must match either +sb_uuid+ or +sb_meta_uuid+
+depending on which features are set.
+
+*bb_owner*::
+The AG number that this B+tree block ought to be in.
+
+*bb_crc*::
+Checksum of the B+tree block.
+
+*bb_pad*::
+Pads the structure to 64 bytes.
diff --git a/design/XFS_Filesystem_Structure/data_extents.asciidoc b/design/XFS_Filesystem_Structure/data_extents.asciidoc
index a39045d..4f1109b 100644
--- a/design/XFS_Filesystem_Structure/data_extents.asciidoc
+++ b/design/XFS_Filesystem_Structure/data_extents.asciidoc
@@ -203,9 +203,10 @@ u.bmx[0-1] = [startoff,startblock,blockcount,extentflag]
 [[Btree_Extent_List]]
 == B+tree Extent List
 
-To manage extent maps that cannot fit in the inode fork area, XFS uses long
-format B+trees.  The root node of the B+tree is stored in the inode's data
-fork.  All block pointers for extent B+trees are 64-bit absolute block numbers.
+To manage extent maps that cannot fit in the inode fork area, XFS uses
+xref:Long_Format_Btrees[long format B+trees].  The root node of the B+tree is
+stored in the inode's data fork.  All block pointers for extent B+trees are
+64-bit filesystem block numbers.
 
 For a single level B+tree, the root node points to the B+tree's leaves. Each
 leaf occupies one filesystem block and contains a header and an array of extents
@@ -242,69 +243,8 @@ standard 256 byte inode before a new level of nodes is added between the root
 and the leaves. This will be less if +di_forkoff+ is not zero (i.e. attributes
 are in use on the inode).
 
-[[Long_Format_Btrees]]
-=== Long Format B+trees
-
-The subsequent nodes and leaves of the B+tree use the +xfs_btree_lblock+
-declaration:
-
-[source, c]
-----
-struct xfs_btree_lblock {
-     __be32                    bb_magic;
-     __be16                    bb_level;
-     __be16                    bb_numrecs;
-     __be64                    bb_leftsib;
-     __be64                    bb_rightsib;
-
-     /* version 5 filesystem fields start here */
-     __be64                    bb_blkno;
-     __be64                    bb_lsn;
-     uuid_t                    bb_uuid;
-     __be64                    bb_owner;
-     __le32                    bb_crc;
-     __be32                    bb_pad;
-};
-----
-
-*bb_magic*::
-Specifies the magic number for the BMBT block: ``BMAP'' (0x424d4150).
-On a v5 filesystem, this is ``BMA3'' (0x424d4133).
-
-*bb_level*::
-The level of the tree in which this block is found.  If this value is 0, this
-is a leaf block and contains records; otherwise, it is a node block and
-contains keys and pointers.
-
-*bb_numrecs*::
-Number of records in this block.
-
-*bb_leftsib*::
-FS block number of the left sibling of this B+tree node.
-
-*bb_rightsib*::
-FS block number of the right sibling of this B+tree node.
-
-*bb_blkno*::
-FS block number of this B+tree block.
-
-*bb_lsn*::
-Log sequence number of the last write to this block.
-
-*bb_uuid*::
-The UUID of this block, which must match either +sb_uuid+ or +sb_meta_uuid+
-depending on which features are set.
-
-*bb_owner*::
-The AG number that this B+tree block ought to be in.
-
-*bb_crc*::
-Checksum of the B+tree block.
-
-*bb_pad*::
-Pads the structure to 64 bytes.
-
-// force-split the lists
+* The magic number for a BMBT block is ``BMAP'' (0x424d4150).  On a v5
+filesystem, this is ``BMA3'' (0x424d4133).
 
 * For intermediate nodes, the data following +xfs_btree_lblock+ is the same as
 the root node: array of +xfs_bmbt_key+ value followed by an array of
diff --git a/design/XFS_Filesystem_Structure/xfs_filesystem_structure.asciidoc b/design/XFS_Filesystem_Structure/xfs_filesystem_structure.asciidoc
index f580aab..62502b3 100644
--- a/design/XFS_Filesystem_Structure/xfs_filesystem_structure.asciidoc
+++ b/design/XFS_Filesystem_Structure/xfs_filesystem_structure.asciidoc
@@ -62,6 +62,8 @@ Global Structures
 
 :leveloffset: 1
 
+include::btrees.asciidoc[]
+
 include::allocation_groups.asciidoc[]
 
 include::journaling_log.asciidoc[]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  parent reply	other threads:[~2016-08-25 23:27 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-08-25 23:26 [PATCH v8 0/7] xfs-docs: reorganize chapters, document rmap and reflink Darrick J. Wong
2016-08-25 23:27 ` [PATCH 1/7] journaling_log: fix some typos in the section about EFDs Darrick J. Wong
2016-08-25 23:27 ` [PATCH 2/7] xfsdocs: document known testing procedures Darrick J. Wong
2016-08-25 23:27 ` [PATCH 3/7] xfsdocs: update the on-disk format with changes for Linux 4.5 Darrick J. Wong
2016-08-25 23:27 ` Darrick J. Wong [this message]
2016-08-25 23:27 ` [PATCH 5/7] xfsdocs: reverse-mapping btree documentation Darrick J. Wong
2016-08-25 23:27 ` [PATCH 6/7] xfsdocs: document refcount btree and reflink Darrick J. Wong
2016-08-25 23:27 ` [PATCH 7/7] xfsdocs: document the realtime reverse mapping btree Darrick J. Wong
2016-09-08  1:38   ` Dave Chinner
2016-09-08  2:03     ` Darrick J. Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=147216764310.32447.6110430430725995824.stgit@birch.djwong.org \
    --to=darrick.wong@oracle.com \
    --cc=david@fromorbit.com \
    --cc=linux-xfs@vger.kernel.org \
    --cc=xfs@oss.sgi.com \
    --subject='Re: [PATCH 4/7] xfsdocs: move the discussions of short and long format btrees to a separate chapter' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.