All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v8 0/7] xfs-docs: reorganize chapters, document rmap and reflink
@ 2016-08-25 23:26 Darrick J. Wong
  2016-08-25 23:27 ` [PATCH 1/7] journaling_log: fix some typos in the section about EFDs Darrick J. Wong
                   ` (6 more replies)
  0 siblings, 7 replies; 10+ messages in thread
From: Darrick J. Wong @ 2016-08-25 23:26 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, xfs

Hi all,

This is the eighth revision of a patchset that adds to XFS kernel
support for mapping multiple file logical blocks to the same physical
block (reflink/deduplication), implements the beginnings of online
metadata scrubbing and preening, and implements reverse mapping for
the realtime device.  There shouldn't be any incompatible on-disk
format changes, pending a thorough review of the patches within.

The patches in this series fix various errors, bring the documentation
up to date as of kernel 4.6, and add entirely new chapters about the
reverse mapping and reference count btrees.  The new log redo items
are also documented.

If you're going to start using this mess, you probably ought to just
pull from my github trees for kernel[1], xfsprogs[2], xfstests[3],
xfs-docs[4], and man-pages[5].  The kernel patches in the git trees
should apply to 4.8-rc3; xfsprogs patches to for-next; and xfstest to
master.

The patches have been xfstested with x64, ppc64, and armhf; all tests
in the clone and rmap groups pass.  AFAICT they don't cause any new
failures for the 'auto' group.

This is an extraordinary way to eat your data.  Enjoy! 
Comments and questions are, as always, welcome.

--D

[1] https://github.com/djwong/linux/tree/djwong-devel
[2] https://github.com/djwong/xfsprogs/tree/djwong-devel
[3] https://github.com/djwong/xfstests/tree/djwong-devel
[4] https://github.com/djwong/xfs-documentation/tree/djwong-devel
[5] https://github.com/djwong/man-pages/tree/djwong-devel

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/7] journaling_log: fix some typos in the section about EFDs
  2016-08-25 23:26 [PATCH v8 0/7] xfs-docs: reorganize chapters, document rmap and reflink Darrick J. Wong
@ 2016-08-25 23:27 ` Darrick J. Wong
  2016-08-25 23:27 ` [PATCH 2/7] xfsdocs: document known testing procedures Darrick J. Wong
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 10+ messages in thread
From: Darrick J. Wong @ 2016-08-25 23:27 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, xfs

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 .../journaling_log.asciidoc                        |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)


diff --git a/design/XFS_Filesystem_Structure/journaling_log.asciidoc b/design/XFS_Filesystem_Structure/journaling_log.asciidoc
index a67fcc2..67d209f 100644
--- a/design/XFS_Filesystem_Structure/journaling_log.asciidoc
+++ b/design/XFS_Filesystem_Structure/journaling_log.asciidoc
@@ -368,7 +368,7 @@ typedef struct xfs_efd_log_format {
 ----
 
 *efd_type*::
-The signature of an EFI operation, 0x1236.  This value is in host-endian order,
+The signature of an EFD operation, 0x1237.  This value is in host-endian order,
 not big-endian like the rest of XFS.
 
 *efd_size*::
@@ -382,9 +382,9 @@ A 64-bit number that binds the corresponding EFI log item to this EFD log item.
 
 *efd_extents*::
 Variable-length array of extents to be freed.  The array length is given by
-+efi_nextents+.  The record type will be either +xfs_extent_64_t+ or
++efd_nextents+.  The record type will be either +xfs_extent_64_t+ or
 +xfs_extent_32_t+; this can be determined from the log item size (+oh_len+) and
-the number of extents (+efi_nextents+).
+the number of extents (+efd_nextents+).
 
 [[Inode_Log_Item]]
 === Inode Updates

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 2/7] xfsdocs: document known testing procedures
  2016-08-25 23:26 [PATCH v8 0/7] xfs-docs: reorganize chapters, document rmap and reflink Darrick J. Wong
  2016-08-25 23:27 ` [PATCH 1/7] journaling_log: fix some typos in the section about EFDs Darrick J. Wong
@ 2016-08-25 23:27 ` Darrick J. Wong
  2016-08-25 23:27 ` [PATCH 3/7] xfsdocs: update the on-disk format with changes for Linux 4.5 Darrick J. Wong
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 10+ messages in thread
From: Darrick J. Wong @ 2016-08-25 23:27 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, xfs

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 design/XFS_Filesystem_Structure/docinfo.xml        |   14 ++++++++++++
 design/XFS_Filesystem_Structure/testing.asciidoc   |   23 ++++++++++++++++++++
 .../xfs_filesystem_structure.asciidoc              |    2 ++
 3 files changed, 39 insertions(+)
 create mode 100644 design/XFS_Filesystem_Structure/testing.asciidoc


diff --git a/design/XFS_Filesystem_Structure/docinfo.xml b/design/XFS_Filesystem_Structure/docinfo.xml
index ba97809..cc5596d 100644
--- a/design/XFS_Filesystem_Structure/docinfo.xml
+++ b/design/XFS_Filesystem_Structure/docinfo.xml
@@ -108,4 +108,18 @@
 			</simplelist>
 		</revdescription>
 	</revision>
+	<revision>
+		<revnumber>3.14</revnumber>
+		<date>January 2016</date>
+		<author>
+			<firstname>Darrick</firstname>
+			<surname>Wong</surname>
+			<email></email>
+		</author>
+		<revdescription>
+			<simplelist>
+				<member>Document disk format change testing.</member>
+			</simplelist>
+		</revdescription>
+	</revision>
 </revhistory>
diff --git a/design/XFS_Filesystem_Structure/testing.asciidoc b/design/XFS_Filesystem_Structure/testing.asciidoc
new file mode 100644
index 0000000..f1c90bc
--- /dev/null
+++ b/design/XFS_Filesystem_Structure/testing.asciidoc
@@ -0,0 +1,23 @@
+[[Testing]]
+= Testing Filesystem Changes
+
+People put a lot of trust in filesystems to preserve their data in a reliable
+fashion.  To that end, it is very important that users and developers have
+access to a suite of regression tests that can be used to prove correct
+operation of any given filesystem code, or to analyze failures to fix problems
+found in the code.  The XFS regression test suite, +xfstests+, is hosted at
++git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git+.  Most tests apply to
+filesystems in general, but the suite also contains tests for features specific
+to each filesystem.
+
+When fixing bugs, it is important to provide a testcase exposing the bug so
+that the developers can avoid a future re-occurrence of the regression.
+Furthermore, if you're developing a new user-visible feature for XFS, please
+help the rest of the development community to sustain and maintain the whole
+codebase by providing generous test coverage to check its behavior.
+
+When altering, adding, or removing an on-disk data structure, please remember
+to update both the in-kernel structure size checks in +xfs_ondisk.h+ and to
+ensure that your changes are reflected in xfstest xfs/122.  These regression
+tests enable us to detect compiler bugs, alignment problems, and anything
+else that might result in the creation of incompatible filesystem images.
diff --git a/design/XFS_Filesystem_Structure/xfs_filesystem_structure.asciidoc b/design/XFS_Filesystem_Structure/xfs_filesystem_structure.asciidoc
index 53262bf..f580aab 100644
--- a/design/XFS_Filesystem_Structure/xfs_filesystem_structure.asciidoc
+++ b/design/XFS_Filesystem_Structure/xfs_filesystem_structure.asciidoc
@@ -52,6 +52,8 @@ include::common_types.asciidoc[]
 
 include::magic.asciidoc[]
 
+include::testing.asciidoc[]
+
 // return titles to normal
 :leveloffset: 0
 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 3/7] xfsdocs: update the on-disk format with changes for Linux 4.5
  2016-08-25 23:26 [PATCH v8 0/7] xfs-docs: reorganize chapters, document rmap and reflink Darrick J. Wong
  2016-08-25 23:27 ` [PATCH 1/7] journaling_log: fix some typos in the section about EFDs Darrick J. Wong
  2016-08-25 23:27 ` [PATCH 2/7] xfsdocs: document known testing procedures Darrick J. Wong
@ 2016-08-25 23:27 ` Darrick J. Wong
  2016-08-25 23:27 ` [PATCH 4/7] xfsdocs: move the discussions of short and long format btrees to a separate chapter Darrick J. Wong
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 10+ messages in thread
From: Darrick J. Wong @ 2016-08-25 23:27 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, xfs

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 .../XFS_Filesystem_Structure/ondisk_inode.asciidoc |   19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)


diff --git a/design/XFS_Filesystem_Structure/ondisk_inode.asciidoc b/design/XFS_Filesystem_Structure/ondisk_inode.asciidoc
index 4aabc55..dc1fad2 100644
--- a/design/XFS_Filesystem_Structure/ondisk_inode.asciidoc
+++ b/design/XFS_Filesystem_Structure/ondisk_inode.asciidoc
@@ -66,9 +66,10 @@ of the literal area and +di_forkoff+. The attribute fork is located between
 [[Inode_Core]]
 == Inode Core
 
-The inode's core is 96 bytes in size and contains information about the file
-itself including most stat data information about data and attribute forks after
-the core within the inode. It uses the following structure:
+The inode's core is 96 bytes on a V4 filesystem and 176 bytes on a V5
+filesystem.  It contains information about the file itself including most stat
+data information about data and attribute forks after the core within the
+inode. It uses the following structure:
 
 [source, c]
 ----
@@ -313,8 +314,16 @@ Counts the number of changes made to the attributes in this inode.
 Log sequence number of the last inode write.
 
 *di_flags2*::
-Specifies extended flags associated with a v3 inode.  There are no flags defined
-currently.
+Specifies extended flags associated with a v3 inode.
+
+.Version 3 Inode flags
+[options="header"]
+|=====
+| Flag				| Description
+| +XFS_DIFLAG2_DAX+		|
+For a file, enable DAX to increase performance on persistent-memory storage.
+If set on a directory, files created in the directory will inherit this flag.
+|=====
 
 *di_pad2*::
 Padding for future expansion of the inode.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 4/7] xfsdocs: move the discussions of short and long format btrees to a separate chapter
  2016-08-25 23:26 [PATCH v8 0/7] xfs-docs: reorganize chapters, document rmap and reflink Darrick J. Wong
                   ` (2 preceding siblings ...)
  2016-08-25 23:27 ` [PATCH 3/7] xfsdocs: update the on-disk format with changes for Linux 4.5 Darrick J. Wong
@ 2016-08-25 23:27 ` Darrick J. Wong
  2016-08-25 23:27 ` [PATCH 5/7] xfsdocs: reverse-mapping btree documentation Darrick J. Wong
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 10+ messages in thread
From: Darrick J. Wong @ 2016-08-25 23:27 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, xfs

Move the discussion of short and long format btrees into a separate
chapter.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 .../allocation_groups.asciidoc                     |   59 ------
 design/XFS_Filesystem_Structure/btrees.asciidoc    |  196 ++++++++++++++++++++
 .../XFS_Filesystem_Structure/data_extents.asciidoc |   72 +------
 .../xfs_filesystem_structure.asciidoc              |    2 
 4 files changed, 204 insertions(+), 125 deletions(-)
 create mode 100644 design/XFS_Filesystem_Structure/btrees.asciidoc


diff --git a/design/XFS_Filesystem_Structure/allocation_groups.asciidoc b/design/XFS_Filesystem_Structure/allocation_groups.asciidoc
index 0633175..55bbc50 100644
--- a/design/XFS_Filesystem_Structure/allocation_groups.asciidoc
+++ b/design/XFS_Filesystem_Structure/allocation_groups.asciidoc
@@ -612,65 +612,6 @@ Checksum of the AGF sector.
 *agf_spare2*::
 Empty space in the unlogged part of the AGF sector.
 
-[[Short_Format_Btrees]]
-=== Short Format B+trees
-
-Each allocation group uses a ``short format'' B+tree to index various
-information about the allocation group.  The structure is called short format
-because all block pointers are AG block numbers.  The trees use the following
-header:
-
-[source, c]
-----
-struct xfs_btree_sblock {
-     __be32                    bb_magic;
-     __be16                    bb_level;
-     __be16                    bb_numrecs;
-     __be32                    bb_leftsib;
-     __be32                    bb_rightsib;
-
-     /* version 5 filesystem fields start here */
-     __be64                    bb_blkno;
-     __be64                    bb_lsn;
-     uuid_t                    bb_uuid;
-     __be32                    bb_owner;
-     __le32                    bb_crc;
-};
-----
-
-*bb_magic*::
-Specifies the magic number for the per-AG B+tree block.
-
-*bb_level*::
-The level of the tree in which this block is found.  If this value is 0, this
-is a leaf block and contains records; otherwise, it is a node block and
-contains keys and pointers.
-
-*bb_numrecs*::
-Number of records in this block.
-
-*bb_leftsib*::
-AG block number of the left sibling of this B+tree node.
-
-*bb_rightsib*::
-AG block number of the right sibling of this B+tree node.
-
-*bb_blkno*::
-FS block number of this B+tree block.
-
-*bb_lsn*::
-Log sequence number of the last write to this block.
-
-*bb_uuid*::
-The UUID of this block, which must match either +sb_uuid+ or +sb_meta_uuid+
-depending on which features are set.
-
-*bb_owner*::
-The AG number that this B+tree block ought to be in.
-
-*bb_crc*::
-Checksum of the B+tree block.
-
 [[AG_Free_Space_Btrees]]
 === AG Free Space B+trees
 
diff --git a/design/XFS_Filesystem_Structure/btrees.asciidoc b/design/XFS_Filesystem_Structure/btrees.asciidoc
new file mode 100644
index 0000000..306e061
--- /dev/null
+++ b/design/XFS_Filesystem_Structure/btrees.asciidoc
@@ -0,0 +1,196 @@
+= B+trees
+
+XFS uses b+trees to index all metadata records.  This well known data structure
+is used to provide efficient random and sequential access to metadata records
+while minimizing seek times.  There are two btree formats: a short format
+for records pertaining to a single allocation group, since all block pointers
+in an AG are 32-bits in size; and a long format for records pertaining to a
+file, since file data can have 64-bit block offsets.  Each b+tree block is
+either a leaf node containing records, or an internal node containing keys and
+pointers to other b+tree blocks.  The tree consists of a root block which may
+point to some number of other blocks; blocks in the bottom level of the b+tree
+contains only records.
+
+Leaf blocks of both types of b+trees have the same general format: a header
+describing the data in the block, and an array of records.  The specific header
+formats are given in the next two sections, and the record format is provided
+by the b+tree client itself.  The generic b+tree code does not have any
+specific knowledge of the record format.
+
+----
++--------+------------+------------+
+| header |   record   | records... |
++--------+------------+------------+
+----
+
+Internal node blocks of both types of b+trees also have the same general
+format: a header describing the data in the block, an array of keys, and an
+array of pointers.  Each pointer may be associated with one or two keys.  The
+first key uniquely identifies the first record accessible via the leftmost path
+down the branch of the tree.
+
+If the records in a b+tree are indexed by an interval, then a range of keys can
+uniquely identify a single record.  For example, if a record covers blocks
+12-16, then any one of the keys 12, 13, 14, 15, or 16 return the same record.
+In this case, the key for the record describing "12-16" is 12.  If none of the
+records overlap, we only need to store one key.
+
+This is the format of a standard b+tree node:
+
+----
++--------+---------+---------+---------+---------+
+| header |   key   | keys... |   ptr   | ptrs... |
++--------+---------+---------+---------+---------+
+----
+
+If the b+tree records do not overlap, performing a b+tree lookup is simple.
+Start with the root.  If it is a leaf block, perform a binary search of the
+records until we find the record with a lower key than our search key.  If the
+block is a node block, perform a binary search of the keys until we find a
+key lower than our search key, then follow the pointer to the next block.
+Repeat until we find a record.
+
+However, if b+tree records contain intervals and are allowed to overlap, the
+internal nodes of the b+tree become larger:
+
+----
++--------+---------+----------+---------+-------------+---------+---------+
+| header | low key | high key | low key | high key... |   ptr   | ptrs... |
++--------+---------+----------+---------+-------------+---------+---------+
+----
+
+The low keys are exactly the same as the keys in the non-overlapping b+tree.
+High keys, however, are a little different.  Recall that a record with a key
+consisting of an interval can be referenced by a number of keys.  Since the low
+key of a record indexes the low end of that key range, the high key indexes the
+high end of the key range.  Returning to the example above, the high key for
+the record describing "12-16" is 16.  The high key recorded in a b+tree node
+is the largest of the high keys of all records accessible under the subtree
+rooted by the pointer.  For a level 1 node, this is the largest high key in
+the pointed-to leaf node; for any other node, this is the largest of the high
+keys in the pointed-to node.
+
+Nodes and leaves use the same magic numbers.
+
+[[Short_Format_Btrees]]
+== Short Format B+trees
+
+Each allocation group uses a ``short format'' B+tree to index various
+information about the allocation group.  The structure is called short format
+because all block pointers are AG block numbers.  The trees use the following
+header:
+
+[source, c]
+----
+struct xfs_btree_sblock {
+     __be32                    bb_magic;
+     __be16                    bb_level;
+     __be16                    bb_numrecs;
+     __be32                    bb_leftsib;
+     __be32                    bb_rightsib;
+
+     /* version 5 filesystem fields start here */
+     __be64                    bb_blkno;
+     __be64                    bb_lsn;
+     uuid_t                    bb_uuid;
+     __be32                    bb_owner;
+     __le32                    bb_crc;
+};
+----
+
+*bb_magic*::
+Specifies the magic number for the per-AG B+tree block.
+
+*bb_level*::
+The level of the tree in which this block is found.  If this value is 0, this
+is a leaf block and contains records; otherwise, it is a node block and
+contains keys and pointers.
+
+*bb_numrecs*::
+Number of records in this block.
+
+*bb_leftsib*::
+AG block number of the left sibling of this B+tree node.
+
+*bb_rightsib*::
+AG block number of the right sibling of this B+tree node.
+
+*bb_blkno*::
+FS block number of this B+tree block.
+
+*bb_lsn*::
+Log sequence number of the last write to this block.
+
+*bb_uuid*::
+The UUID of this block, which must match either +sb_uuid+ or +sb_meta_uuid+
+depending on which features are set.
+
+*bb_owner*::
+The AG number that this B+tree block ought to be in.
+
+*bb_crc*::
+Checksum of the B+tree block.
+
+[[Long_Format_Btrees]]
+== Long Format B+trees
+
+Long format B+trees are similar to short format B+trees, except that their
+block pointers are 64-bit filesystem block numbers instead of 32-bit AG block
+numbers.  Because of this, long format b+trees can be (and usually are) rooted
+in an inode's data or attribute fork.  The nodes and leaves of this B+tree use
+the +xfs_btree_lblock+ declaration:
+
+[source, c]
+----
+struct xfs_btree_lblock {
+     __be32                    bb_magic;
+     __be16                    bb_level;
+     __be16                    bb_numrecs;
+     __be64                    bb_leftsib;
+     __be64                    bb_rightsib;
+
+     /* version 5 filesystem fields start here */
+     __be64                    bb_blkno;
+     __be64                    bb_lsn;
+     uuid_t                    bb_uuid;
+     __be64                    bb_owner;
+     __le32                    bb_crc;
+     __be32                    bb_pad;
+};
+----
+
+*bb_magic*::
+Specifies the magic number for the btree block.
+
+*bb_level*::
+The level of the tree in which this block is found.  If this value is 0, this
+is a leaf block and contains records; otherwise, it is a node block and
+contains keys and pointers.
+
+*bb_numrecs*::
+Number of records in this block.
+
+*bb_leftsib*::
+FS block number of the left sibling of this B+tree node.
+
+*bb_rightsib*::
+FS block number of the right sibling of this B+tree node.
+
+*bb_blkno*::
+FS block number of this B+tree block.
+
+*bb_lsn*::
+Log sequence number of the last write to this block.
+
+*bb_uuid*::
+The UUID of this block, which must match either +sb_uuid+ or +sb_meta_uuid+
+depending on which features are set.
+
+*bb_owner*::
+The AG number that this B+tree block ought to be in.
+
+*bb_crc*::
+Checksum of the B+tree block.
+
+*bb_pad*::
+Pads the structure to 64 bytes.
diff --git a/design/XFS_Filesystem_Structure/data_extents.asciidoc b/design/XFS_Filesystem_Structure/data_extents.asciidoc
index a39045d..4f1109b 100644
--- a/design/XFS_Filesystem_Structure/data_extents.asciidoc
+++ b/design/XFS_Filesystem_Structure/data_extents.asciidoc
@@ -203,9 +203,10 @@ u.bmx[0-1] = [startoff,startblock,blockcount,extentflag]
 [[Btree_Extent_List]]
 == B+tree Extent List
 
-To manage extent maps that cannot fit in the inode fork area, XFS uses long
-format B+trees.  The root node of the B+tree is stored in the inode's data
-fork.  All block pointers for extent B+trees are 64-bit absolute block numbers.
+To manage extent maps that cannot fit in the inode fork area, XFS uses
+xref:Long_Format_Btrees[long format B+trees].  The root node of the B+tree is
+stored in the inode's data fork.  All block pointers for extent B+trees are
+64-bit filesystem block numbers.
 
 For a single level B+tree, the root node points to the B+tree's leaves. Each
 leaf occupies one filesystem block and contains a header and an array of extents
@@ -242,69 +243,8 @@ standard 256 byte inode before a new level of nodes is added between the root
 and the leaves. This will be less if +di_forkoff+ is not zero (i.e. attributes
 are in use on the inode).
 
-[[Long_Format_Btrees]]
-=== Long Format B+trees
-
-The subsequent nodes and leaves of the B+tree use the +xfs_btree_lblock+
-declaration:
-
-[source, c]
-----
-struct xfs_btree_lblock {
-     __be32                    bb_magic;
-     __be16                    bb_level;
-     __be16                    bb_numrecs;
-     __be64                    bb_leftsib;
-     __be64                    bb_rightsib;
-
-     /* version 5 filesystem fields start here */
-     __be64                    bb_blkno;
-     __be64                    bb_lsn;
-     uuid_t                    bb_uuid;
-     __be64                    bb_owner;
-     __le32                    bb_crc;
-     __be32                    bb_pad;
-};
-----
-
-*bb_magic*::
-Specifies the magic number for the BMBT block: ``BMAP'' (0x424d4150).
-On a v5 filesystem, this is ``BMA3'' (0x424d4133).
-
-*bb_level*::
-The level of the tree in which this block is found.  If this value is 0, this
-is a leaf block and contains records; otherwise, it is a node block and
-contains keys and pointers.
-
-*bb_numrecs*::
-Number of records in this block.
-
-*bb_leftsib*::
-FS block number of the left sibling of this B+tree node.
-
-*bb_rightsib*::
-FS block number of the right sibling of this B+tree node.
-
-*bb_blkno*::
-FS block number of this B+tree block.
-
-*bb_lsn*::
-Log sequence number of the last write to this block.
-
-*bb_uuid*::
-The UUID of this block, which must match either +sb_uuid+ or +sb_meta_uuid+
-depending on which features are set.
-
-*bb_owner*::
-The AG number that this B+tree block ought to be in.
-
-*bb_crc*::
-Checksum of the B+tree block.
-
-*bb_pad*::
-Pads the structure to 64 bytes.
-
-// force-split the lists
+* The magic number for a BMBT block is ``BMAP'' (0x424d4150).  On a v5
+filesystem, this is ``BMA3'' (0x424d4133).
 
 * For intermediate nodes, the data following +xfs_btree_lblock+ is the same as
 the root node: array of +xfs_bmbt_key+ value followed by an array of
diff --git a/design/XFS_Filesystem_Structure/xfs_filesystem_structure.asciidoc b/design/XFS_Filesystem_Structure/xfs_filesystem_structure.asciidoc
index f580aab..62502b3 100644
--- a/design/XFS_Filesystem_Structure/xfs_filesystem_structure.asciidoc
+++ b/design/XFS_Filesystem_Structure/xfs_filesystem_structure.asciidoc
@@ -62,6 +62,8 @@ Global Structures
 
 :leveloffset: 1
 
+include::btrees.asciidoc[]
+
 include::allocation_groups.asciidoc[]
 
 include::journaling_log.asciidoc[]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 5/7] xfsdocs: reverse-mapping btree documentation
  2016-08-25 23:26 [PATCH v8 0/7] xfs-docs: reorganize chapters, document rmap and reflink Darrick J. Wong
                   ` (3 preceding siblings ...)
  2016-08-25 23:27 ` [PATCH 4/7] xfsdocs: move the discussions of short and long format btrees to a separate chapter Darrick J. Wong
@ 2016-08-25 23:27 ` Darrick J. Wong
  2016-08-25 23:27 ` [PATCH 6/7] xfsdocs: document refcount btree and reflink Darrick J. Wong
  2016-08-25 23:27 ` [PATCH 7/7] xfsdocs: document the realtime reverse mapping btree Darrick J. Wong
  6 siblings, 0 replies; 10+ messages in thread
From: Darrick J. Wong @ 2016-08-25 23:27 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, xfs

Add chapters on the operation of the reverse mapping btree and future
things we could do with rmap data.

v2: Add magic number to the table.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 .../allocation_groups.asciidoc                     |   31 +-
 design/XFS_Filesystem_Structure/docinfo.xml        |   17 +
 .../journaling_log.asciidoc                        |  122 ++++++++
 design/XFS_Filesystem_Structure/magic.asciidoc     |    3 
 .../reconstruction.asciidoc                        |   53 +++
 design/XFS_Filesystem_Structure/rmapbt.asciidoc    |  305 ++++++++++++++++++++
 .../xfs_filesystem_structure.asciidoc              |    4 
 7 files changed, 526 insertions(+), 9 deletions(-)
 create mode 100644 design/XFS_Filesystem_Structure/reconstruction.asciidoc
 create mode 100644 design/XFS_Filesystem_Structure/rmapbt.asciidoc


diff --git a/design/XFS_Filesystem_Structure/allocation_groups.asciidoc b/design/XFS_Filesystem_Structure/allocation_groups.asciidoc
index 55bbc50..9fcf975 100644
--- a/design/XFS_Filesystem_Structure/allocation_groups.asciidoc
+++ b/design/XFS_Filesystem_Structure/allocation_groups.asciidoc
@@ -12,6 +12,7 @@ Each AG has the following characteristics:
          * A super block describing overall filesystem info
          * Free space management
          * Inode allocation and tracking
+         * Reverse block-mapping index (optional)
 
 Having multiple AGs allows XFS to handle most operations in parallel without
 degrading performance as the number of concurrent accesses increases.
@@ -379,6 +380,12 @@ it doesn't understand the flag.
 Free inode B+tree.  Each allocation group contains a B+tree to track inode chunks
 containing free inodes.  This is a performance optimization to reduce the time
 required to allocate inodes.
+
+| +XFS_SB_FEAT_RO_COMPAT_RMAPBT+ |
+Reverse mapping B+tree.  Each allocation group contains a B+tree containing
+records mapping AG blocks to their owners.  See the section about
+xref:Reconstruction[reconstruction] for more details.
+
 |=====
 
 *sb_features_incompat*::
@@ -529,9 +536,7 @@ struct xfs_agf {
      __be32              agf_seqno;
      __be32              agf_length;
      __be32              agf_roots[XFS_BTNUM_AGF];
-     __be32              agf_spare0;
      __be32              agf_levels[XFS_BTNUM_AGF];
-     __be32              agf_spare1;
      __be32              agf_flfirst;
      __be32              agf_fllast;
      __be32              agf_flcount;
@@ -541,7 +546,9 @@ struct xfs_agf {
 
      /* version 5 filesystem fields start here */
      uuid_t              agf_uuid;
-     __be64              agf_spare64[16];
+     __be32              agf_rmap_blocks;
+     __be32              __pad;
+     __be64              agf_spare64[15];
 
      /* unlogged fields, written during buffer writeback. */
      __be64              agf_lsn;
@@ -550,9 +557,10 @@ struct xfs_agf {
 };
 ----
 
-The rest of the bytes in the sector are zeroed. +XFS_BTNUM_AGF+ is set to 2:
-index 0 for the free space B+tree indexed by block number; and index 1 for the
-free space B+tree indexed by extent size.
+The rest of the bytes in the sector are zeroed. +XFS_BTNUM_AGF+ is set to 3:
+index 0 for the free space B+tree indexed by block number; index 1 for the free
+space B+tree indexed by extent size; and index 2 for the reverse-mapping
+B+tree.
 
 *agf_magicnum*::
 Specifies the magic number for the AGF sector: ``XAGF'' (0x58414746).
@@ -570,11 +578,13 @@ this could be less than the +sb_agblocks+ value. It is this value that should
 be used to determine the size of the AG.
 
 *agf_roots*::
-Specifies the block number for the root of the two free space B+trees.
+Specifies the block number for the root of the two free space B+trees and the
+reverse-mapping B+tree, if enabled.
 
 *agf_levels*::
-Specifies the level or depth of the two free space B+trees. For a fresh AG, this
-will be one, and the ``roots'' will point to a single leaf of level 0.
+Specifies the level or depth of the two free space B+trees and the
+reverse-mapping B+tree, if enabled.  For a fresh AG, this value will be one,
+and the ``roots'' will point to a single leaf of level 0.
 
 *agf_flfirst*::
 Specifies the index of the first ``free list'' block. Free lists are covered in
@@ -600,6 +610,9 @@ used if the +XFS_SB_VERSION2_LAZYSBCOUNTBIT+ bit is set in +sb_features2+.
 The UUID of this block, which must match either +sb_uuid+ or +sb_meta_uuid+
 depending on which features are set.
 
+*agf_rmap_blocks*::
+The size of the reverse mapping B+tree in this allocation group, in blocks.
+
 *agf_spare64*::
 Empty space in the logged part of the AGF sector, for use for future features.
 
diff --git a/design/XFS_Filesystem_Structure/docinfo.xml b/design/XFS_Filesystem_Structure/docinfo.xml
index cc5596d..44f944a 100644
--- a/design/XFS_Filesystem_Structure/docinfo.xml
+++ b/design/XFS_Filesystem_Structure/docinfo.xml
@@ -122,4 +122,21 @@
 			</simplelist>
 		</revdescription>
 	</revision>
+	<revision>
+		<revnumber>3.141</revnumber>
+		<date>June 2016</date>
+		<author>
+			<firstname>Darrick</firstname>
+			<surname>Wong</surname>
+			<email></email>
+		</author>
+		<revdescription>
+			<simplelist>
+				<member>Document the reverse-mapping btree.</member>
+				<member>Move the b+tree info to a separate chapter.</member>
+				<member>Discuss overlapping interval b+trees.</member>
+				<member>Discuss new log items for atomic updates.</member>
+			</simplelist>
+		</revdescription>
+	</revision>
 </revhistory>
diff --git a/design/XFS_Filesystem_Structure/journaling_log.asciidoc b/design/XFS_Filesystem_Structure/journaling_log.asciidoc
index 67d209f..78ce436 100644
--- a/design/XFS_Filesystem_Structure/journaling_log.asciidoc
+++ b/design/XFS_Filesystem_Structure/journaling_log.asciidoc
@@ -209,6 +209,8 @@ magic number to distinguish themselves.  Buffer data items only appear after
 | +XFS_LI_DQUOT+		| 0x123d        | xref:Quota_Update_Log_Item[Update Quota]
 | +XFS_LI_QUOTAOFF+		| 0x123e        | xref:Quota_Off_Log_Item[Quota Off]
 | +XFS_LI_ICREATE+		| 0x123f        | xref:Inode_Create_Log_Item[Inode Creation]
+| +XFS_LI_RUI+			| 0x1240        | xref:RUI_Log_Item[Reverse Mapping Update Intent]
+| +XFS_LI_RUD+			| 0x1241        | xref:RUD_Log_Item[Reverse Mapping Update Done]
 |=====
 
 [[Log_Transaction_Headers]]
@@ -386,6 +388,126 @@ Variable-length array of extents to be freed.  The array length is given by
 +xfs_extent_32_t+; this can be determined from the log item size (+oh_len+) and
 the number of extents (+efd_nextents+).
 
+[[RUI_Log_Item]]
+=== Reverse Mapping Updates Intent
+
+The next two operation types work together to handle deferred reverse mapping
+updates.  Naturally, the mappings to be updated can be expressed in terms of
+mapping extents:
+
+[source, c]
+----
+struct xfs_map_extent {
+     __uint64_t                me_owner;
+     __uint64_t                me_startblock;
+     __uint64_t                me_startoff;
+     __uint32_t                me_len;
+     __uint32_t                me_flags;
+};
+----
+
+*me_owner*::
+Owner of this reverse mapping.  See the values in the section about
+xref:Reverse_Mapping_Btree[reverse mapping] for more information.
+
+*me_startblock*::
+Filesystem block of this mapping.
+
+*me_startoff*::
+Logical block offset of this mapping.
+
+*me_len*::
+The length of this mapping.
+
+*me_flags*::
+The lower byte of this field is a type code indicating what sort of
+reverse mapping operation we want.  The upper three bytes are flag bits.
+
+.Reverse mapping update log intent types
+[options="header"]
+|=====
+| Value				| Description
+| +XFS_RMAP_EXTENT_MAP+		| Add a reverse mapping for file data.
+| +XFS_RMAP_EXTENT_MAP_SHARED+	| Add a reverse mapping for file data for a file with shared blocks.
+| +XFS_RMAP_EXTENT_UNMAP+	| Remove a reverse mapping for file data.
+| +XFS_RMAP_EXTENT_UNMAP_SHARED+	| Remove a reverse mapping for file data for a file with shared blocks.
+| +XFS_RMAP_EXTENT_CONVERT+	| Convert a reverse mapping for file data between unwritten and normal.
+| +XFS_RMAP_EXTENT_CONVERT_SHARED+	| Convert a reverse mapping for file data between unwritten and normal for a file with shared blocks.
+| +XFS_RMAP_EXTENT_ALLOC+	| Add a reverse mapping for non-file data.
+| +XFS_RMAP_EXTENT_FREE+	| Remove a reverse mapping for non-file data.
+|=====
+
+.Reverse mapping update log intent flags
+[options="header"]
+|=====
+| Value				| Description
+| +XFS_RMAP_EXTENT_ATTR_FORK+	| Extent is for the attribute fork.
+| +XFS_RMAP_EXTENT_BMBT_BLOCK+	| Extent is for a block mapping btree block.
+| +XFS_RMAP_EXTENT_UNWRITTEN+	| Extent is unwritten.
+|=====
+
+The ``rmap update intent'' operation comes first; it tells the log that XFS
+wants to update some reverse mappings.  This record is crucial for correct log
+recovery because it enables us to spread a complex metadata update across
+multiple transactions while ensuring that a crash midway through the complex
+update will be replayed fully during log recovery.
+
+[source, c]
+----
+struct xfs_rui_log_format {
+     __uint16_t                rui_type;
+     __uint16_t                rui_size;
+     __uint32_t                rui_nextents;
+     __uint64_t                rui_id;	
+     struct xfs_map_extent     rui_extents[1];
+};
+----
+
+*rui_type*::
+The signature of an RUI operation, 0x1240.  This value is in host-endian order,
+not big-endian like the rest of XFS.
+
+*rui_size*::
+Size of this log item.  Should be 1.
+
+*rui_nextents*::
+Number of reverse mappings.
+
+*rui_id*::
+A 64-bit number that binds the corresponding RUD log item to this RUI log item.
+
+*rui_extents*::
+Variable-length array of reverse mappings to update.
+
+[[RUD_Log_Item]]
+=== Completion of Reverse Mapping Updates
+
+The ``reverse mapping update done'' operation complements the ``reverse mapping
+update intent'' operation.  This second operation indicates that the update
+actually happened, so that log recovery needn't replay the update.  The RUD and
+the actual updates are typically found in a new transaction following the
+transaction in which the RUI was logged.
+
+[source, c]
+----
+struct xfs_rud_log_format {
+      __uint16_t               rud_type;
+      __uint16_t               rud_size;
+      __uint32_t               __pad;
+      __uint64_t               rud_rui_id;
+};
+----
+
+*rud_type*::
+The signature of an RUD operation, 0x1241.  This value is in host-endian order,
+not big-endian like the rest of XFS.
+
+*rud_size*::
+Size of this log item.  Should be 1.
+
+*rud_rui_id*::
+A 64-bit number that binds the corresponding RUI log item to this RUD log item.
+
 [[Inode_Log_Item]]
 === Inode Updates
 
diff --git a/design/XFS_Filesystem_Structure/magic.asciidoc b/design/XFS_Filesystem_Structure/magic.asciidoc
index 301cfa0..10fd15f 100644
--- a/design/XFS_Filesystem_Structure/magic.asciidoc
+++ b/design/XFS_Filesystem_Structure/magic.asciidoc
@@ -44,6 +44,7 @@ relevant chapters.  Magic numbers tend to have consistent locations:
 | +XFS_ATTR_LEAF_MAGIC+		| 0xfbee	|     	| xref:Leaf_Attributes[Leaf Attribute]
 | +XFS_ATTR3_LEAF_MAGIC+	| 0x3bee	|     	| xref:Leaf_Attributes[Leaf Attribute], v5 only
 | +XFS_ATTR3_RMT_MAGIC+		| 0x5841524d	| XARM	| xref:Remote_Values[Remote Attribute Value], v5 only
+| +XFS_RMAP_CRC_MAGIC+		| 0x524d4233	| RMB3	| xref:Reverse_Mapping_Btree[Reverse Mapping B+tree], v5 only
 |=====
 
 The magic numbers for log items are at offset zero in each log item, but items
@@ -61,6 +62,8 @@ are not aligned to blocks.
 | +XFS_LI_DQUOT+		| 0x123d        |       | xref:Quota_Update_Log_Item[Update Quota Log Item]
 | +XFS_LI_QUOTAOFF+		| 0x123e        |       | xref:Quota_Off_Log_Item[Quota Off Log Item]
 | +XFS_LI_ICREATE+		| 0x123f        |       | xref:Inode_Create_Log_Item[Inode Creation Log Item]
+| +XFS_LI_RUI+			| 0x1240        |       | xref:RUI_Log_Item[Reverse Mapping Update Intent]
+| +XFS_LI_RUD+			| 0x1241        |       | xref:RUD_Log_Item[Reverse Mapping Update Done]
 |=====
 
 = Theoretical Limits
diff --git a/design/XFS_Filesystem_Structure/reconstruction.asciidoc b/design/XFS_Filesystem_Structure/reconstruction.asciidoc
new file mode 100644
index 0000000..f172e0f
--- /dev/null
+++ b/design/XFS_Filesystem_Structure/reconstruction.asciidoc
@@ -0,0 +1,53 @@
+[[Reconstruction]]
+= Metadata Reconstruction
+
+[NOTE]
+This is a theoretical discussion of how reconstruction could work; none of this
+is implemented as of 2015.
+
+A simple UNIX filesystem can be thought of in terms of a directed acyclic graph.
+To a first approximation, there exists a root directory node, which points to
+other nodes.  Those other nodes can themselves be directories or they can be
+files.  Each file, in turn, points to data blocks.
+
+XFS adds a few more details to this picture:
+
+* The real root(s) of an XFS filesystem are the allocation group headers
+(superblock, AGF, AGI, AGFL).
+* Each allocation group’s headers point to various per-AG B+trees (free space,
+inode, free inodes, free list, etc.)
+* The free space B+trees point to unused extents;
+* The inode B+trees point to blocks containing inode chunks;
+* All superblocks point to the root directory and the log;
+* Hardlinks mean that multiple directories can point to a single file node;
+* File data block pointers are indexed by file offset;
+* Files and directories can have a second collection of pointers to data blocks
+which contain extended attributes;
+* Large directories require multiple data blocks to store all the subpointers;
+* Still larger directories use high-offset data blocks to store a B+tree of
+hashes to directory entries;
+* Large extended attribute forks similarly use high-offset data blocks to store
+a B+tree of hashes to attribute keys; and
+* Symbolic links can point to data blocks.
+
+The beauty of this massive graph structure is that under normal circumstances,
+everything known to the filesystem is discoverable (access controls
+notwithstanding) from the root.  The major weakness of this structure of course
+is that breaking a edge in the graph can render entire subtrees inaccessible.
++xfs_repair+ “recovers” from broken directories by scanning for unlinked inodes
+and connecting them to +/lost+found+, but this isn’t sufficiently general to
+recover from breaks in other parts of the graph structure.  Wouldn’t it be
+useful to have back pointers as a secondary data structure?  The current repair
+strategy is to reconstruct whatever can be rebuilt, but to scrap anything that
+doesn't check out.
+
+The xref:Reverse_Mapping_Btree[reverse-mapping B+tree] fills in part of the
+puzzle.  Since it contains copies of every entry in each inode’s data and
+attribute forks, we can fix a corrupted block map with these records.
+Furthermore, if the inode B+trees become corrupt, it is possible to visit all
+inode chunks using the reverse-mapping data.  Should XFS ever gain the ability
+to store parent directory information in each inode, it also becomes possible
+to resurrect damaged directory trees, which should reduce the complaints about
+inodes ending up in +/lost+found+.  Everything else in the per-AG primary
+metadata can already be reconstructed via +xfs_repair+.  Hopefully,
+reconstruction will not turn out to be a fool's errand.
diff --git a/design/XFS_Filesystem_Structure/rmapbt.asciidoc b/design/XFS_Filesystem_Structure/rmapbt.asciidoc
new file mode 100644
index 0000000..a8a210b
--- /dev/null
+++ b/design/XFS_Filesystem_Structure/rmapbt.asciidoc
@@ -0,0 +1,305 @@
+[[Reverse_Mapping_Btree]]
+== Reverse-Mapping B+tree
+
+[NOTE]
+This data structure is under construction!  Details may change.
+
+If the feature is enabled, each allocation group has its own reverse
+block-mapping B+tree, which grows in the free space like the free space
+B+trees.  As mentioned in the chapter about
+xref:Reconstruction[reconstruction], this data structure is another piece of
+the puzzle necessary to reconstruct the data or attribute fork of a file from
+reverse-mapping records; we can also use it to double-check allocations to
+ensure that we are not accidentally cross-linking blocks, which can cause
+severe damage to the filesystem.
+
+This B+tree is only present if the +XFS_SB_FEAT_RO_COMPAT_RMAPBT+
+feature is enabled.  The feature requires a version 5 filesystem.
+
+Each record in the reverse-mapping B+tree has the following structure:
+
+[source, c]
+----
+struct xfs_rmap_rec {
+     __be32                     rm_startblock;
+     __be32                     rm_blockcount;
+     __be64                     rm_owner;
+     __be64                     rm_fork:1;
+     __be64                     rm_bmbt:1;
+     __be64                     rm_unwritten:1;
+     __be64                     rm_unused:7;
+     __be64                     rm_offset:54;
+};
+----
+
+*rm_startblock*::
+AG block number of this record.
+
+*rm_blockcount*::
+The length of this extent.
+
+*rm_owner*::
+A 64-bit number describing the owner of this extent.  This is typically the
+absolute inode number, but can also correspond to one of the following:
+
+.Special owner values
+[options="header"]
+|=====
+| Value				| Description
+| +XFS_RMAP_OWN_NULL+           | No owner.  This should never appear on disk.
+| +XFS_RMAP_OWN_UNKNOWN+        | Unknown owner; for EFI recovery.  This should never appear on disk.
+| +XFS_RMAP_OWN_FS+             | Allocation group headers
+| +XFS_RMAP_OWN_LOG+            | XFS log blocks
+| +XFS_RMAP_OWN_AG+             | Per-allocation group B+tree blocks.  This means free space B+tree blocks, blocks on the freelist, and reverse-mapping B+tree blocks.
+| +XFS_RMAP_OWN_INOBT+          | Per-allocation group inode B+tree blocks.  This includes free inode B+tree blocks.
+| +XFS_RMAP_OWN_INODES+         | Inode chunks
+|=====
+
+*rm_fork*::
+If +rm_owner+ describes an inode, this can be 1 if this record is for an
+attribute fork.
+
+*rm_bmbt*::
+If +rm_owner+ describes an inode, this can be 1 to signify that this record is
+for a block map B+tree block.  In this case, +rm_offset+ has no meaning.
+
+*rm_unwritten*::
+A flag indicating that the extent is unwritten.  This corresponds to the flag in
+the xref:Data_Extents[extent record] format which means +XFS_EXT_UNWRITTEN+.
+
+*rm_offset*::
+The 54-bit logical file block offset, if +rm_owner+ describes an inode.
+Meaningless otherwise.
+
+[NOTE]
+The single-bit flag values +rm_unwritten+, +rm_fork+, and +rm_bmbt+ are packed
+into the larger fields in the C structure definition.
+
+The key has the following structure:
+
+[source, c]
+----
+struct xfs_rmap_key {
+     __be32                     rm_startblock;
+     __be64                     rm_owner;
+     __be64                     rm_fork:1;
+     __be64                     rm_bmbt:1;
+     __be64                     rm_reserved:1;
+     __be64                     rm_unused:7;
+     __be64                     rm_offset:54;
+};
+----
+
+For the reverse-mapping B+tree on a filesystem that supports sharing of file
+data blocks, the key definition is larger than the usual AG block number.  On a
+classic XFS filesystem, each block has only one owner, which means that
++rm_startblock+ is sufficient to uniquely identify each record.  However,
+shared block support (reflink) on XFS breaks that assumption; now filesystem
+blocks can be linked to any logical block offset of any file inode.  Therefore,
+the key must include the owner and offset information to preserve the 1 to 1
+relation between key and record.
+
+* As the reference counting is AG relative, all the block numbers are only
+32-bits.
+* The +bb_magic+ value is "RMB3" (0x524d4233).
+* The +xfs_btree_sblock_t+ header is used for intermediate B+tree node as well
+as the leaves.
+* Each pointer is associated with two keys.  The first of these is the "low
+key", which is the key of the smallest record accessible through the pointer.
+This low key has the same meaning as the key in all other btrees.  The second
+key is the high key, which is the maximum of the largest key that can be used
+to access a given record underneath the pointer.  Recall that each record
+in the reverse mapping b+tree describes an interval of physical blocks mapped
+to an interval of logical file block offsets; therefore, it makes sense that
+a range of keys can be used to find to a record.
+
+=== xfs_db rmapbt Example
+
+This example shows a reverse-mapping B+tree from a freshly populated root
+filesystem:
+
+----
+xfs_db> agf 0
+xfs_db> addr rmaproot
+xfs_db> p
+magic = 0x524d4233
+level = 1
+numrecs = 43
+leftsib = null
+rightsib = null
+bno = 56
+lsn = 0x3000004c8
+uuid = 1977221d-8345-464e-b1f4-aa2ea36895f4
+owner = 0
+crc = 0x7cf8be6f (correct)
+keys[1-43] = [startblock,owner,offset]
+keys[1-43] = [startblock,owner,offset,attrfork,bmbtblock,startblock_hi,owner_hi,
+	     offset_hi,attrfork_hi,bmbtblock_hi]
+        1:[0,-3,0,0,0,351,4418,66,0,0]
+        2:[417,285,0,0,0,827,4419,2,0,0]
+        3:[829,499,0,0,0,2352,573,55,0,0]
+        4:[1292,710,0,0,0,32168,262923,47,0,0]
+        5:[32215,-5,0,0,0,34655,2365,3411,0,0]
+        6:[34083,1161,0,0,0,34895,265220,1,0,1]
+        7:[34896,256191,0,0,0,36522,-9,0,0,0]
+        ...
+        41:[50998,326734,0,0,0,51430,-5,0,0,0]
+        42:[51431,327010,0,0,0,51600,325722,11,0,0]
+        43:[51611,327112,0,0,0,94063,23522,28375272,0,0]
+ptrs[1-43] = 1:5 2:6 3:8 4:9 5:10 6:11 7:418 ... 41:46377 42:48784 43:49522
+----
+
+We arbitrarily pick pointer 17 to traverse downwards:
+
+----
+xfs_db> addr ptrs[17]
+xfs_db> p
+magic = 0x524d4233
+level = 0
+numrecs = 168
+leftsib = 36284
+rightsib = 37617
+bno = 294760
+lsn = 0x200002761
+uuid = 1977221d-8345-464e-b1f4-aa2ea36895f4
+owner = 0
+crc = 0x2dad3fbe (correct)
+recs[1-168] = [startblock,blockcount,owner,offset,extentflag,attrfork,bmbtblock]
+        1:[40326,1,259615,0,0,0,0] 2:[40327,1,-5,0,0,0,0]
+        3:[40328,2,259618,0,0,0,0] 4:[40330,1,259619,0,0,0,0]
+        ...
+        127:[40540,1,324266,0,0,0,0] 128:[40541,1,324266,8388608,0,0,0]
+        129:[40542,2,324266,1,0,0,0] 130:[40544,32,-7,0,0,0,0]
+----
+
+Several interesting things pop out here.  The first record shows that inode
+259,615 has mapped AG block 40,326 at offset 0.  We confirm this by looking at
+the block map for that inode:
+
+----
+xfs_db> inode 259615
+xfs_db> bmap
+data offset 0 startblock 40326 (0/40326) count 1 flag 0
+----
+
+Next, notice records 127 and 128, which describe neighboring AG blocks that are
+mapped to non-contiguous logical blocks in inode 324,266.  Given the logical
+offset of 8,388,608 we surmise that this is a leaf directory, but let us
+confirm:
+
+----
+xfs_db> inode 324266
+xfs_db> p core.mode
+core.mode = 040755
+xfs_db> bmap
+data offset 0 startblock 40540 (0/40540) count 1 flag 0
+data offset 1 startblock 40542 (0/40542) count 2 flag 0
+data offset 3 startblock 40576 (0/40576) count 1 flag 0
+data offset 8388608 startblock 40541 (0/40541) count 1 flag 0
+xfs_db> p core.mode
+core.mode = 0100644
+xfs_db> dblock 0
+xfs_db> p dhdr.hdr.magic
+dhdr.hdr.magic = 0x58444433
+xfs_db> dblock 8388608
+xfs_db> p lhdr.info.hdr.magic
+lhdr.info.hdr.magic = 0x3df1
+----
+
+Indeed, this inode 324,266 appears to be a leaf directory, as it has regular
+directory data blocks at low offsets, and a single leaf block.
+
+Notice further the two reverse-mapping records with negative owners.  An owner
+of -7 corresponds to +XFS_RMAP_OWN_INODES+, which is an inode chunk, and an
+owner code of -5 corresponds to +XFS_RMAP_OWN_AG+, which covers free space
+B+trees and free space.  Let's see if block 40,544 is part of an inode chunk:
+
+----
+xfs_db> blockget
+xfs_db> fsblock 40544
+xfs_db> blockuse
+block 40544 (0/40544) type inode
+xfs_db> stack
+1:
+        byte offset 166068224, length 4096
+        buffer block 324352 (fsbno 40544), 8 bbs
+        inode 324266, dir inode 324266, type data
+xfs_db> type inode
+xfs_db> p
+core.magic = 0x494e
+----
+
+Our suspicions are confirmed.  Let's also see if 40,327 is part of a free space
+tree:
+
+----
+xfs_db> fsblock 40327
+xfs_db> blockuse
+block 40327 (0/40327) type btrmap
+xfs_db> type rmapbt
+xfs_db> p
+magic = 0x524d4233
+----
+
+As you can see, the reverse block-mapping B+tree is an important secondary
+metadata structure, which can be used to reconstruct damaged primary metadata.
+Now let's look at an extend rmap btree:
+
+----
+xfs_db> agf 0
+xfs_db> addr rmaproot
+xfs_db> p
+magic = 0x34524d42
+level = 1
+numrecs = 5
+leftsib = null
+rightsib = null
+bno = 6368
+lsn = 0x100000d1b
+uuid = 400f0928-6b88-4c37-af1e-cef1f8911f3f
+owner = 0
+crc = 0x8d4ace05 (correct)
+keys[1-5] = [startblock,owner,offset,attrfork,bmbtblock,startblock_hi,owner_hi,offset_hi,attrfork_hi,bmbtblock_hi]
+1:[0,-3,0,0,0,705,132,681,0,0]
+2:[24,5761,0,0,0,548,5761,524,0,0]
+3:[24,5929,0,0,0,380,5929,356,0,0]
+4:[24,6097,0,0,0,212,6097,188,0,0]
+5:[24,6277,0,0,0,807,-7,0,0,0]
+ptrs[1-5] = 1:5 2:771 3:9 4:10 5:11
+----
+
+The second pointer stores both the low key [24,5761,0,0,0] and the high key
+[548,5761,524,0,0], which means that we can expect block 771 to contain records
+starting at physical block 24, inode 5761, offset zero; and that one of the
+records can be used to find a reverse mapping for physical block 548, inode
+5761, and offset 524:
+
+----
+xfs_db> addr ptrs[2]
+xfs_db> p
+magic = 0x34524d42
+level = 0
+numrecs = 168
+leftsib = 5
+rightsib = 9
+bno = 6168
+lsn = 0x100000d1b
+uuid = 400f0928-6b88-4c37-af1e-cef1f8911f3f
+owner = 0
+crc = 0xd58eff0e (correct)
+recs[1-168] = [startblock,blockcount,owner,offset,extentflag,attrfork,bmbtblock]
+1:[24,525,5761,0,0,0,0]
+2:[24,524,5762,0,0,0,0]
+3:[24,523,5763,0,0,0,0]
+...
+166:[24,360,5926,0,0,0,0]
+167:[24,359,5927,0,0,0,0]
+168:[24,358,5928,0,0,0,0]
+----
+
+Observe that the first record in the block starts at physical block 24, inode
+5761, offset zero, just as we expected.  Note that this first record is also
+indexed by the highest key as provided in the node block; physical block 548,
+inode 5761, offset 524 is the very last block mapped by this record.  Furthermore,
+note that record 168, despite being the last record in this block, has a lower
+maximum key (physical block 382, inode 5928, offset 23) than the first record.
diff --git a/design/XFS_Filesystem_Structure/xfs_filesystem_structure.asciidoc b/design/XFS_Filesystem_Structure/xfs_filesystem_structure.asciidoc
index 62502b3..1b8658d 100644
--- a/design/XFS_Filesystem_Structure/xfs_filesystem_structure.asciidoc
+++ b/design/XFS_Filesystem_Structure/xfs_filesystem_structure.asciidoc
@@ -48,6 +48,8 @@ include::overview.asciidoc[]
 
 include::metadata_integrity.asciidoc[]
 
+include::reconstruction.asciidoc[]
+
 include::common_types.asciidoc[]
 
 include::magic.asciidoc[]
@@ -66,6 +68,8 @@ include::btrees.asciidoc[]
 
 include::allocation_groups.asciidoc[]
 
+include::rmapbt.asciidoc[]
+
 include::journaling_log.asciidoc[]
 
 include::internal_inodes.asciidoc[]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 6/7] xfsdocs: document refcount btree and reflink
  2016-08-25 23:26 [PATCH v8 0/7] xfs-docs: reorganize chapters, document rmap and reflink Darrick J. Wong
                   ` (4 preceding siblings ...)
  2016-08-25 23:27 ` [PATCH 5/7] xfsdocs: reverse-mapping btree documentation Darrick J. Wong
@ 2016-08-25 23:27 ` Darrick J. Wong
  2016-08-25 23:27 ` [PATCH 7/7] xfsdocs: document the realtime reverse mapping btree Darrick J. Wong
  6 siblings, 0 replies; 10+ messages in thread
From: Darrick J. Wong @ 2016-08-25 23:27 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, xfs

Document the reference count btree and talk a little bit about how
the reflink feature uses it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 .../allocation_groups.asciidoc                     |   25 ++-
 .../XFS_Filesystem_Structure/directories.asciidoc  |    1 
 design/XFS_Filesystem_Structure/docinfo.xml        |    2 
 .../journaling_log.asciidoc                        |  192 ++++++++++++++++++++
 design/XFS_Filesystem_Structure/magic.asciidoc     |    5 +
 .../XFS_Filesystem_Structure/ondisk_inode.asciidoc |   25 ++-
 .../XFS_Filesystem_Structure/refcountbt.asciidoc   |  145 +++++++++++++++
 design/XFS_Filesystem_Structure/reflink.asciidoc   |   40 ++++
 design/XFS_Filesystem_Structure/rmapbt.asciidoc    |    2 
 .../xfs_filesystem_structure.asciidoc              |    4 
 10 files changed, 435 insertions(+), 6 deletions(-)
 create mode 100644 design/XFS_Filesystem_Structure/refcountbt.asciidoc
 create mode 100644 design/XFS_Filesystem_Structure/reflink.asciidoc


diff --git a/design/XFS_Filesystem_Structure/allocation_groups.asciidoc b/design/XFS_Filesystem_Structure/allocation_groups.asciidoc
index 9fcf975..cafa8b7 100644
--- a/design/XFS_Filesystem_Structure/allocation_groups.asciidoc
+++ b/design/XFS_Filesystem_Structure/allocation_groups.asciidoc
@@ -13,6 +13,7 @@ Each AG has the following characteristics:
          * Free space management
          * Inode allocation and tracking
          * Reverse block-mapping index (optional)
+         * Data block reference count index (optional)
 
 Having multiple AGs allows XFS to handle most operations in parallel without
 degrading performance as the number of concurrent accesses increases.
@@ -386,6 +387,12 @@ Reverse mapping B+tree.  Each allocation group contains a B+tree containing
 records mapping AG blocks to their owners.  See the section about
 xref:Reconstruction[reconstruction] for more details.
 
+| +XFS_SB_FEAT_RO_COMPAT_REFLINK+ |
+Reference count B+tree.  Each allocation group contains a B+tree to track the
+reference counts of AG blocks.  This enables files to share data blocks safely.
+See the section about xref:Reflink_Deduplication[reflink and deduplication] for
+more details.
+
 |=====
 
 *sb_features_incompat*::
@@ -547,8 +554,10 @@ struct xfs_agf {
      /* version 5 filesystem fields start here */
      uuid_t              agf_uuid;
      __be32              agf_rmap_blocks;
-     __be32              __pad;
-     __be64              agf_spare64[15];
+     __be32              agf_refcount_blocks;
+     __be32              agf_refcount_root;
+     __be32              agf_refcount_level;
+     __be64              agf_spare64[14];
 
      /* unlogged fields, written during buffer writeback. */
      __be64              agf_lsn;
@@ -613,6 +622,15 @@ depending on which features are set.
 *agf_rmap_blocks*::
 The size of the reverse mapping B+tree in this allocation group, in blocks.
 
+*agf_refcount_blocks*::
+The size of the reference count B+tree in this allocation group, in blocks.
+
+*agf_refcount_root*::
+Block number for the root of the reference count B+tree, if enabled.
+
+*agf_refcount_root*::
+Depth of the reference count B+tree, if enabled.
+
 *agf_spare64*::
 Empty space in the logged part of the AGF sector, for use for future features.
 
@@ -1243,4 +1261,5 @@ By placing the real time device (and the journal) on separate high-performance
 storage devices, it is possible to reduce most of the unpredictability in I/O
 response times that come from metadata operations.
 
-None of the XFS per-AG B+trees are involved with real time files.
+None of the XFS per-AG B+trees are involved with real time files.  It is not
+possible for real time files to share data blocks.
diff --git a/design/XFS_Filesystem_Structure/directories.asciidoc b/design/XFS_Filesystem_Structure/directories.asciidoc
index bccf912..1758c4e 100644
--- a/design/XFS_Filesystem_Structure/directories.asciidoc
+++ b/design/XFS_Filesystem_Structure/directories.asciidoc
@@ -1419,6 +1419,7 @@ The hash value of a particular record.
 The directory/attribute logical block containing all entries up to the
 corresponding hash value.
 
+//
 * The freeindex's +bests+ array starts from the end of the block and grows to the
 start of the block.
 
diff --git a/design/XFS_Filesystem_Structure/docinfo.xml b/design/XFS_Filesystem_Structure/docinfo.xml
index 44f944a..f5e62bc 100644
--- a/design/XFS_Filesystem_Structure/docinfo.xml
+++ b/design/XFS_Filesystem_Structure/docinfo.xml
@@ -136,6 +136,8 @@
 				<member>Move the b+tree info to a separate chapter.</member>
 				<member>Discuss overlapping interval b+trees.</member>
 				<member>Discuss new log items for atomic updates.</member>
+				<member>Document the reference-count btree.</member>
+				<member>Discuss block sharing, reflink, &amp; deduplication.</member>
 			</simplelist>
 		</revdescription>
 	</revision>
diff --git a/design/XFS_Filesystem_Structure/journaling_log.asciidoc b/design/XFS_Filesystem_Structure/journaling_log.asciidoc
index 78ce436..0aec036 100644
--- a/design/XFS_Filesystem_Structure/journaling_log.asciidoc
+++ b/design/XFS_Filesystem_Structure/journaling_log.asciidoc
@@ -211,6 +211,10 @@ magic number to distinguish themselves.  Buffer data items only appear after
 | +XFS_LI_ICREATE+		| 0x123f        | xref:Inode_Create_Log_Item[Inode Creation]
 | +XFS_LI_RUI+			| 0x1240        | xref:RUI_Log_Item[Reverse Mapping Update Intent]
 | +XFS_LI_RUD+			| 0x1241        | xref:RUD_Log_Item[Reverse Mapping Update Done]
+| +XFS_LI_CUI+			| 0x1242        | xref:CUI_Log_Item[Reference Count Update Intent]
+| +XFS_LI_CUD+			| 0x1243        | xref:CUD_Log_Item[Reference Count Update Done]
+| +XFS_LI_BUI+			| 0x1244        | xref:BUI_Log_Item[File Block Mapping Update Intent]
+| +XFS_LI_BUD+			| 0x1245        | xref:BUD_Log_Item[File Block Mapping Update Done]
 |=====
 
 [[Log_Transaction_Headers]]
@@ -508,6 +512,194 @@ Size of this log item.  Should be 1.
 *rud_rui_id*::
 A 64-bit number that binds the corresponding RUI log item to this RUD log item.
 
+[[CUI_Log_Item]]
+=== Reference Count Updates Intent
+
+The next two operation types work together to handle reference count updates.
+Naturally, the ranges of extents having reference count updates can be
+expressed in terms of physical extents:
+
+[source, c]
+----
+struct xfs_phys_extent {
+     __uint64_t                pe_startblock;
+     __uint32_t                pe_len;
+     __uint32_t                pe_flags;
+};
+----
+
+*pe_startblock*::
+Filesystem block of this extent.
+
+*pe_len*::
+The length of this extent.
+
+*pe_flags*::
+The lower byte of this field is a type code indicating what sort of
+reverse mapping operation we want.  The upper three bytes are flag bits.
+
+.Reference count update log intent types
+[options="header"]
+|=====
+| Value				  | Description
+| +XFS_REFCOUNT_EXTENT_INCREASE+  | Increase the reference count for this extent.
+| +XFS_REFCOUNT_EXTENT_DECREASE+  | Decrease the reference count for this extent.
+| +XFS_REFCOUNT_EXTENT_ALLOC_COW+ | Reserve an extent for staging copy on write.
+| +XFS_REFCOUNT_EXTENT_FREE_COW+  | Unreserve an extent for staging copy on write.
+|=====
+
+The ``reference count update intent'' operation comes first; it tells the log
+that XFS wants to update some reference counts.  This record is crucial for
+correct log recovery because it enables us to spread a complex metadata update
+across multiple transactions while ensuring that a crash midway through the
+complex update will be replayed fully during log recovery.
+
+[source, c]
+----
+struct xfs_cui_log_format {
+     __uint16_t                cui_type;
+     __uint16_t                cui_size;
+     __uint32_t                cui_nextents;
+     __uint64_t                cui_id;
+     struct xfs_map_extent     cui_extents[1];
+};
+----
+
+*cui_type*::
+The signature of an CUI operation, 0x1242.  This value is in host-endian order,
+not big-endian like the rest of XFS.
+
+*cui_size*::
+Size of this log item.  Should be 1.
+
+*cui_nextents*::
+Number of reference count updates.
+
+*cui_id*::
+A 64-bit number that binds the corresponding RUD log item to this RUI log item.
+
+*cui_extents*::
+Variable-length array of reference count update information.
+
+[[CUD_Log_Item]]
+=== Completion of Reference Count Updates
+
+The ``reference count update done'' operation complements the ``reference count
+update intent'' operation.  This second operation indicates that the update
+actually happened, so that log recovery needn't replay the update.  The CUD and
+the actual updates are typically found in a new transaction following the
+transaction in which the CUI was logged.
+
+[source, c]
+----
+struct xfs_cud_log_format {
+      __uint16_t               cud_type;
+      __uint16_t               cud_size;
+      __uint32_t               __pad;
+      __uint64_t               cud_cui_id;
+};
+----
+
+*cud_type*::
+The signature of an RUD operation, 0x1243.  This value is in host-endian order,
+not big-endian like the rest of XFS.
+
+*cud_size*::
+Size of this log item.  Should be 1.
+
+*cud_cui_id*::
+A 64-bit number that binds the corresponding CUI log item to this CUD log item.
+
+[[BUI_Log_Item]]
+=== File Block Mapping Intent
+
+The next two operation types work together to handle deferred file block
+mapping updates.  The extents to be mapped are expressed via the
++xfs_map_extent+ structure discussed in the section about
+xref:RUI_Log_Item[reverse mapping intents].
+
+The lower byte of the +me_flags+ field is a type code indicating what sort of
+file block mapping operation we want.  The upper three bytes are flag bits.
+
+.File block mapping update log intent types
+[options="header"]
+|=====
+| Value				| Description
+| +XFS_BMAP_EXTENT_MAP+		| Add a mapping for file data.
+| +XFS_BMAP_EXTENT_UNMAP+	| Remove a mapping for file data.
+|=====
+
+.File block mapping update log intent flags
+[options="header"]
+|=====
+| Value				| Description
+| +XFS_BMAP_EXTENT_ATTR_FORK+	| Extent is for the attribute fork.
+| +XFS_BMAP_EXTENT_UNWRITTEN+	| Extent is unwritten.
+|=====
+
+The ``file block mapping update intent'' operation comes first; it tells the
+log that XFS wants to map or unmap some extents in a file.  This record is
+crucial for correct log recovery because it enables us to spread a complex
+metadata update across multiple transactions while ensuring that a crash midway
+through the complex update will be replayed fully during log recovery.
+
+[source, c]
+----
+struct xfs_bui_log_format {
+     __uint16_t                bui_type;
+     __uint16_t                bui_size;
+     __uint32_t                bui_nextents;
+     __uint64_t                bui_id;
+     struct xfs_map_extent     bui_extents[1];
+};
+----
+
+*bui_type*::
+The signature of an BUI operation, 0x1244.  This value is in host-endian order,
+not big-endian like the rest of XFS.
+
+*bui_size*::
+Size of this log item.  Should be 1.
+
+*bui_nextents*::
+Number of file mappings.  Should be 1.
+
+*bui_id*::
+A 64-bit number that binds the corresponding BUD log item to this BUI log item.
+
+*bui_extents*::
+Variable-length array of file block mappings to update.  There should only
+be one mapping present.
+
+[[BUD_Log_Item]]
+=== Completion of File Block Mapping Updates
+
+The ``file block mapping update done'' operation complements the ``file block
+mapping update intent'' operation.  This second operation indicates that the
+update actually happened, so that log recovery needn't replay the update.  The
+BUD and the actual updates are typically found in a new transaction following
+the transaction in which the BUI was logged.
+
+[source, c]
+----
+struct xfs_bud_log_format {
+      __uint16_t               bud_type;
+      __uint16_t               bud_size;
+      __uint32_t               __pad;
+      __uint64_t               bud_bui_id;
+};
+----
+
+*bud_type*::
+The signature of an BUD operation, 0x1245.  This value is in host-endian order,
+not big-endian like the rest of XFS.
+
+*bud_size*::
+Size of this log item.  Should be 1.
+
+*bud_bui_id*::
+A 64-bit number that binds the corresponding BUI log item to this BUD log item.
+
 [[Inode_Log_Item]]
 === Inode Updates
 
diff --git a/design/XFS_Filesystem_Structure/magic.asciidoc b/design/XFS_Filesystem_Structure/magic.asciidoc
index 10fd15f..bc172f3 100644
--- a/design/XFS_Filesystem_Structure/magic.asciidoc
+++ b/design/XFS_Filesystem_Structure/magic.asciidoc
@@ -45,6 +45,7 @@ relevant chapters.  Magic numbers tend to have consistent locations:
 | +XFS_ATTR3_LEAF_MAGIC+	| 0x3bee	|     	| xref:Leaf_Attributes[Leaf Attribute], v5 only
 | +XFS_ATTR3_RMT_MAGIC+		| 0x5841524d	| XARM	| xref:Remote_Values[Remote Attribute Value], v5 only
 | +XFS_RMAP_CRC_MAGIC+		| 0x524d4233	| RMB3	| xref:Reverse_Mapping_Btree[Reverse Mapping B+tree], v5 only
+| +XFS_REFC_CRC_MAGIC+		| 0x52334643	| R3FC	| xref:Reference_Count_Btree[Reference Count B+tree], v5 only
 |=====
 
 The magic numbers for log items are at offset zero in each log item, but items
@@ -64,6 +65,10 @@ are not aligned to blocks.
 | +XFS_LI_ICREATE+		| 0x123f        |       | xref:Inode_Create_Log_Item[Inode Creation Log Item]
 | +XFS_LI_RUI+			| 0x1240        |       | xref:RUI_Log_Item[Reverse Mapping Update Intent]
 | +XFS_LI_RUD+			| 0x1241        |       | xref:RUD_Log_Item[Reverse Mapping Update Done]
+| +XFS_LI_CUI+			| 0x1242        |       | xref:CUI_Log_Item[Reference Count Update Intent]
+| +XFS_LI_CUD+			| 0x1243        |       | xref:CUD_Log_Item[Reference Count Update Done]
+| +XFS_LI_BUI+			| 0x1244        |       | xref:BUI_Log_Item[File Block Mapping Update Intent]
+| +XFS_LI_BUD+			| 0x1245        |       | xref:BUD_Log_Item[File Block Mapping Update Done]
 |=====
 
 = Theoretical Limits
diff --git a/design/XFS_Filesystem_Structure/ondisk_inode.asciidoc b/design/XFS_Filesystem_Structure/ondisk_inode.asciidoc
index dc1fad2..4415c38 100644
--- a/design/XFS_Filesystem_Structure/ondisk_inode.asciidoc
+++ b/design/XFS_Filesystem_Structure/ondisk_inode.asciidoc
@@ -109,7 +109,8 @@ struct xfs_dinode_core {
      __be64                    di_changecount;
      __be64                    di_lsn;
      __be64                    di_flags2;
-     __u8                      di_pad2[16];
+     __be32                    di_cowextsize;
+     __u8                      di_pad2[12];
      xfs_timestamp_t           di_crtime;
      __be64                    di_ino;
      uuid_t                    di_uuid;
@@ -215,7 +216,7 @@ including relevant metadata like B+trees. This does not include blocks used for
 extended attributes.
 
 *di_extsize*::
-Specifies the extent size for filesystems with real-time devices and an extent
+Specifies the extent size for filesystems with real-time devices or an extent
 size hint for standard filesystems. For normal filesystems, and with
 directories, the +XFS_DIFLAG_EXTSZINHERIT+ flag must be set in +di_flags+ if
 this field is used. Inodes created in these directories will inherit the
@@ -279,7 +280,7 @@ For directory inodes, new inodes inherit the +di_projid+ value.
 For directory inodes, symlinks cannot be created.
 
 | +XFS_DIFLAG_EXTSIZE+		|
-Specifies the extent size for real-time files or a and extent size hint for regular files.
+Specifies the extent size for real-time files or an extent size hint for regular files.
 
 | +XFS_DIFLAG_EXTSZINHERIT+	|
 For directory inodes, new inodes inherit the +di_extsize+ value.
@@ -323,8 +324,26 @@ Specifies extended flags associated with a v3 inode.
 | +XFS_DIFLAG2_DAX+		|
 For a file, enable DAX to increase performance on persistent-memory storage.
 If set on a directory, files created in the directory will inherit this flag.
+| +XFS_DIFLAG2_REFLINK+		|
+This inode shares (or has shared) data blocks with another inode.
+| +XFS_DIFLAG2_COWEXTSIZE+	|
+For files, this is the extent size hint for copy on write operations; see
++di_cowextsize+ for details.  For directories, the value in +di_cowextsize+
+will be copied to all newly created files and directories.
 |=====
 
+*di_cowextsize*::
+Specifies the extent size hint for copy on write operations.  When allocating
+extents for a copy on write operation, the allocator will be asked to align
+its allocations to either +di_cowextsize+ blocks or +di_extsize+ blocks,
+whichever is greater.  The +XFS_DIFLAG2_COWEXTSIZE+ flag must be set if this
+field is used.  If this field and its flag are set on a directory file, the
+value will be copied into any files or directories created within this
+directory.  During a block sharing operation, this value will be copied from
+the source file to the destination file if the sharing operation completely
+overwrites the destination file's contents and the destination file does not
+already have +di_cowextsize+ set.
+
 *di_pad2*::
 Padding for future expansion of the inode.
 
diff --git a/design/XFS_Filesystem_Structure/refcountbt.asciidoc b/design/XFS_Filesystem_Structure/refcountbt.asciidoc
new file mode 100644
index 0000000..dbbb98e
--- /dev/null
+++ b/design/XFS_Filesystem_Structure/refcountbt.asciidoc
@@ -0,0 +1,145 @@
+[[Reference_Count_Btree]]
+== Reference Count B+tree
+
+[NOTE]
+This data structure is under construction!  Details may change.
+
+To support the sharing of file data blocks (reflink), each allocation group has
+its own reference count B+tree, which grows in the allocated space like the
+inode B+trees.  This data could be gleaned by performing an interval query of
+the reverse-mapping B+tree, but doing so would come at a huge performance
+penalty.  Therefore, this data structure is a cache of computable information.
+
+This B+tree is only present if the +XFS_SB_FEAT_RO_COMPAT_REFLINK+
+feature is enabled.  The feature requires a version 5 filesystem.
+
+Each record in the reference count B+tree has the following structure:
+
+[source, c]
+----
+struct xfs_refcount_rec {
+     __be32                     rc_startblock;
+     __be32                     rc_blockcount;
+     __be32                     rc_refcount;
+};
+----
+
+*rc_startblock*::
+AG block number of this record.
+
+*rc_blockcount*::
+The length of this extent.
+
+*rc_refcount*::
+Number of mappings of this filesystem extent.
+
+Node pointers are an AG relative block pointer:
+
+[source, c]
+----
+struct xfs_refcount_key {
+     __be32                     rc_startblock;
+};
+----
+
+* As the reference counting is AG relative, all the block numbers are only
+32-bits.
+* The +bb_magic+ value is "R3FC" (0x52334643).
+* The +xfs_btree_sblock_t+ header is used for intermediate B+tree node as well
+as the leaves.
+
+=== xfs_db refcntbt Example
+
+For this example, an XFS filesystem was populated with a root filesystem and
+a deduplication program was run to create shared blocks:
+
+----
+xfs_db> agf 0
+xfs_db> addr refcntroot
+xfs_db> p
+magic = 0x52334643
+level = 1
+numrecs = 6
+leftsib = null
+rightsib = null
+bno = 36892
+lsn = 0x200004ec2
+uuid = f1f89746-e00b-49c9-96b3-ecef0f2f14ae
+owner = 0
+crc = 0x75f35128 (correct)
+keys[1-6] = [startblock] 1:[14] 2:[65633] 3:[65780] 4:[94571] 5:[117201] 6:[152442]
+ptrs[1-6] = 1:7 2:25836 3:25835 4:18447 5:18445 6:18449
+xfs_db> addr ptrs[3]
+xfs_db> p
+magic = 0x52334643
+level = 0
+numrecs = 80
+leftsib = 25836
+rightsib = 18447
+bno = 51670
+lsn = 0x200004ec2
+uuid = f1f89746-e00b-49c9-96b3-ecef0f2f14ae
+owner = 0
+crc = 0xc3962813 (correct)
+recs[1-80] = [startblock,blockcount,refcount]
+        1:[65780,1,2] 2:[65781,1,3] 3:[65785,2,2] 4:[66640,1,2]
+        5:[69602,4,2] 6:[72256,16,2] 7:[72871,4,2] 8:[72879,20,2]
+        9:[73395,4,2] 10:[75063,4,2] 11:[79093,4,2] 12:[86344,16,2]
+----
+
+Record 6 in the reference count B+tree for AG 0 indicates that the AG extent
+starting at block 72,256 and running for 16 blocks has a reference count of 2.
+This means that there are two files sharing the block:
+
+----
+xfs_db> blockget -n
+xfs_db> fsblock 72256
+xfs_db> blockuse
+block 72256 (0/72256) type rldata inode 25169197
+----
+
+The blockuse type changes to ``rldata'' to indicate that the block is shared
+data.  Unfortunately, blockuse only tells us about one block owner.  If we
+happen to have enabled the reverse-mapping B+tree, we can use it to find all
+inodes that own this block:
+
+----
+xfs_db> agf 0
+xfs_db> addr rmaproot
+...
+xfs_db> addr ptrs[3]
+...
+xfs_db> addr ptrs[7]
+xfs_db> p
+magic = 0x524d4233
+level = 0
+numrecs = 22
+leftsib = 65057
+rightsib = 65058
+bno = 291478
+lsn = 0x200004ec2
+uuid = f1f89746-e00b-49c9-96b3-ecef0f2f14ae
+owner = 0
+crc = 0xed7da3f7 (correct)
+recs[1-22] = [startblock,blockcount,owner,offset,extentflag,attrfork,bmbtblock]
+        1:[68957,8,3201,0,0,0,0] 2:[68965,4,25260953,0,0,0,0]
+        ...
+        18:[72232,58,3227,0,0,0,0] 19:[72256,16,25169197,24,0,0,0]
+        20:[72290,75,3228,0,0,0,0] 21:[72365,46,3229,0,0,0,0]
+----
+
+Records 18 and 19 intersect the block 72,256; they tell us that inodes 3,227
+and 25,169,197 both claim ownership.  Let us confirm this:
+
+----
+xfs_db> inode 25169197
+xfs_db> bmap
+data offset 0 startblock 12632259 (3/49347) count 24 flag 0
+data offset 24 startblock 72256 (0/72256) count 16 flag 0
+data offset 40 startblock 12632299 (3/49387) count 18 flag 0
+xfs_db> inode 3227
+xfs_db> bmap
+data offset 0 startblock 72232 (0/72232) count 58 flag 0
+----
+
+Inodes 25,169,197 and 3,227 both contain mappings to block 0/72,256.
diff --git a/design/XFS_Filesystem_Structure/reflink.asciidoc b/design/XFS_Filesystem_Structure/reflink.asciidoc
new file mode 100644
index 0000000..8f52b90
--- /dev/null
+++ b/design/XFS_Filesystem_Structure/reflink.asciidoc
@@ -0,0 +1,40 @@
+[[Reflink_Deduplication]]
+= Sharing Data Blocks
+
+On a traditional filesystem, there is a 1:1 mapping between a logical block
+offset in a file and a physical block on disk, which is to say that physical
+blocks are not shared.  However, there exist various use cases for being able
+to share blocks between files -- deduplicating files saves space on archival
+systems; creating space-efficient clones of disk images for virtual machines
+and containers facilitates efficient datacenters; and deferring the payment of
+the allocation cost of a file system tree copy as long as possible makes
+regular work faster.  In all of these cases, a write to one of the shared
+copies *must* not affect the other shared copies, which means that writes to
+shared blocks must employ a copy-on-write strategy.  Sharing blocks in this
+manner is commonly referred to as ``reflinking''.
+
+XFS implements block sharing in a fairly straightforward manner.  All existing
+data fork structures remain unchanged, save for the addition of a
+per-allocation group xref:Reference_Count_Btree[reference count B+tree].  This
+data structure tracks reference counts for all shared physical blocks, with a
+few rules to maintain compatibility with existing code: If a block is free, it
+will be tracked in the free space B+trees.  If a block is owned by a single
+file, it appears in neither the free space nor the reference count B+trees.  If
+a block is shared, it will appear in the reference count B+tree with a
+reference count >= 2.  The first two cases are established precedent in XFS, so
+the third case is the only behavioral change.
+
+When a filesystem block is shared, the block mapping in the destination file is
+updated to point to that filesystem block and the reference count B+tree records
+are updated to reflect the increased refcount.  If a shared block is written, a
+new block will be allocated, the dirty data written to this new block, and the
+file's block mapping updated to point to the new block.  If a shared block is
+unmapped, the reference count records are updated to reflect the decreased
+refcount and the block is also freed if its reference count becomes zero.  This
+enables users to create space efficient clones of disk images and to copy
+filesystem subtrees quickly, using the standard Linux coreutils packages.
+
+Deduplication employs the same mechanism to share blocks and copy them at write
+time.  However, the kernel confirms that the contents of both files are
+identical before updating the destination file's mapping.  This enables XFS to
+be used by userspace deduplication programs such as +duperemove+.
diff --git a/design/XFS_Filesystem_Structure/rmapbt.asciidoc b/design/XFS_Filesystem_Structure/rmapbt.asciidoc
index a8a210b..0ec72c1 100644
--- a/design/XFS_Filesystem_Structure/rmapbt.asciidoc
+++ b/design/XFS_Filesystem_Structure/rmapbt.asciidoc
@@ -53,6 +53,8 @@ absolute inode number, but can also correspond to one of the following:
 | +XFS_RMAP_OWN_AG+             | Per-allocation group B+tree blocks.  This means free space B+tree blocks, blocks on the freelist, and reverse-mapping B+tree blocks.
 | +XFS_RMAP_OWN_INOBT+          | Per-allocation group inode B+tree blocks.  This includes free inode B+tree blocks.
 | +XFS_RMAP_OWN_INODES+         | Inode chunks
+| +XFS_RMAP_OWN_REFC+           | Per-allocation group refcount B+tree blocks.  This will be used for reflink support.
+| +XFS_RMAP_OWN_COW+		| Blocks that have been reserved for a copy-on-write operation that has not completed.
 |=====
 
 *rm_fork*::
diff --git a/design/XFS_Filesystem_Structure/xfs_filesystem_structure.asciidoc b/design/XFS_Filesystem_Structure/xfs_filesystem_structure.asciidoc
index 1b8658d..7916fbe 100644
--- a/design/XFS_Filesystem_Structure/xfs_filesystem_structure.asciidoc
+++ b/design/XFS_Filesystem_Structure/xfs_filesystem_structure.asciidoc
@@ -48,6 +48,8 @@ include::overview.asciidoc[]
 
 include::metadata_integrity.asciidoc[]
 
+include::reflink.asciidoc[]
+
 include::reconstruction.asciidoc[]
 
 include::common_types.asciidoc[]
@@ -70,6 +72,8 @@ include::allocation_groups.asciidoc[]
 
 include::rmapbt.asciidoc[]
 
+include::refcountbt.asciidoc[]
+
 include::journaling_log.asciidoc[]
 
 include::internal_inodes.asciidoc[]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 7/7] xfsdocs: document the realtime reverse mapping btree
  2016-08-25 23:26 [PATCH v8 0/7] xfs-docs: reorganize chapters, document rmap and reflink Darrick J. Wong
                   ` (5 preceding siblings ...)
  2016-08-25 23:27 ` [PATCH 6/7] xfsdocs: document refcount btree and reflink Darrick J. Wong
@ 2016-08-25 23:27 ` Darrick J. Wong
  2016-09-08  1:38   ` Dave Chinner
  6 siblings, 1 reply; 10+ messages in thread
From: Darrick J. Wong @ 2016-08-25 23:27 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, xfs

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 .../allocation_groups.asciidoc                     |    8 +
 design/XFS_Filesystem_Structure/docinfo.xml        |   14 +
 .../internal_inodes.asciidoc                       |    2 
 design/XFS_Filesystem_Structure/magic.asciidoc     |    1 
 .../XFS_Filesystem_Structure/ondisk_inode.asciidoc |    6 -
 design/XFS_Filesystem_Structure/rtrmapbt.asciidoc  |  234 ++++++++++++++++++++
 6 files changed, 263 insertions(+), 2 deletions(-)
 create mode 100644 design/XFS_Filesystem_Structure/rtrmapbt.asciidoc


diff --git a/design/XFS_Filesystem_Structure/allocation_groups.asciidoc b/design/XFS_Filesystem_Structure/allocation_groups.asciidoc
index cafa8b7..7ba636a 100644
--- a/design/XFS_Filesystem_Structure/allocation_groups.asciidoc
+++ b/design/XFS_Filesystem_Structure/allocation_groups.asciidoc
@@ -105,6 +105,7 @@ struct xfs_sb
 	xfs_ino_t		sb_pquotino;
 	xfs_lsn_t		sb_lsn;
 	uuid_t			sb_meta_uuid;
+	xfs_ino_t		sb_rrmapino;
 };
 ----
 *sb_magicnum*::
@@ -449,6 +450,13 @@ If the +XFS_SB_FEAT_INCOMPAT_META_UUID+ feature is set, then the UUID field in
 all metadata blocks must match this UUID.  If not, the block header UUID field
 must match +sb_uuid+.
 
+*sb_rrmapino*::
+If the +XFS_SB_FEAT_COMPAT_RMAPBT+ feature is set and a real-time
+device is present (+sb_rblocks+ > 0), this field points to an inode
+that contains the root to the
+xref:Real_time_Reverse_Mapping_Btree[Real-Time Reverse Mapping B+tree].
+This field is zero otherwise.
+
 === xfs_db Superblock Example
 
 A filesystem is made on a single disk with the following command:
diff --git a/design/XFS_Filesystem_Structure/docinfo.xml b/design/XFS_Filesystem_Structure/docinfo.xml
index f5e62bc..5cdcf6c 100644
--- a/design/XFS_Filesystem_Structure/docinfo.xml
+++ b/design/XFS_Filesystem_Structure/docinfo.xml
@@ -141,4 +141,18 @@
 			</simplelist>
 		</revdescription>
 	</revision>
+	<revision>
+		<revnumber>3.1415</revnumber>
+		<date>July 2016</date>
+		<author>
+			<firstname>Darrick</firstname>
+			<surname>Wong</surname>
+			<email></email>
+		</author>
+		<revdescription>
+			<simplelist>
+				<member>Document the real-time reverse-mapping btree.</member>
+			</simplelist>
+		</revdescription>
+	</revision>
 </revhistory>
diff --git a/design/XFS_Filesystem_Structure/internal_inodes.asciidoc b/design/XFS_Filesystem_Structure/internal_inodes.asciidoc
index 9ace3ea..e6bf75f 100644
--- a/design/XFS_Filesystem_Structure/internal_inodes.asciidoc
+++ b/design/XFS_Filesystem_Structure/internal_inodes.asciidoc
@@ -201,3 +201,5 @@ rtbitmap location, and positive if there are any.
 This data structure is not particularly space efficient, however it is a very
 fast way to provide the same data as the two free space B+trees for regular
 files since the space is preallocated and metadata maintenance is minimal.
+
+include::rtrmapbt.asciidoc[]
diff --git a/design/XFS_Filesystem_Structure/magic.asciidoc b/design/XFS_Filesystem_Structure/magic.asciidoc
index bc172f3..77bed6d 100644
--- a/design/XFS_Filesystem_Structure/magic.asciidoc
+++ b/design/XFS_Filesystem_Structure/magic.asciidoc
@@ -45,6 +45,7 @@ relevant chapters.  Magic numbers tend to have consistent locations:
 | +XFS_ATTR3_LEAF_MAGIC+	| 0x3bee	|     	| xref:Leaf_Attributes[Leaf Attribute], v5 only
 | +XFS_ATTR3_RMT_MAGIC+		| 0x5841524d	| XARM	| xref:Remote_Values[Remote Attribute Value], v5 only
 | +XFS_RMAP_CRC_MAGIC+		| 0x524d4233	| RMB3	| xref:Reverse_Mapping_Btree[Reverse Mapping B+tree], v5 only
+| +XFS_RTRMAP_CRC_MAGIC+	| 0x4d415052	| MAPR	| xref:Real_time_Reverse_Mapping_Btree[Real-Time Reverse Mapping B+tree], v5 only
 | +XFS_REFC_CRC_MAGIC+		| 0x52334643	| R3FC	| xref:Reference_Count_Btree[Reference Count B+tree], v5 only
 |=====
 
diff --git a/design/XFS_Filesystem_Structure/ondisk_inode.asciidoc b/design/XFS_Filesystem_Structure/ondisk_inode.asciidoc
index 4415c38..02d44ac 100644
--- a/design/XFS_Filesystem_Structure/ondisk_inode.asciidoc
+++ b/design/XFS_Filesystem_Structure/ondisk_inode.asciidoc
@@ -141,7 +141,8 @@ the associated metadata or data; or ``btree'' where the inode contains a B+tree
 root node which points to filesystem blocks containing the metadata or data.
 Migration between the formats depends on the amount of metadata associated with
 the inode. ``dev'' is used for character and block devices while ``uuid'' is
-currently not used.
+currently not used.  ``rmap'' indicates that a reverse-mapping B+tree
+is rooted in the fork.
 
 [source, c]
 ----
@@ -150,7 +151,8 @@ typedef enum xfs_dinode_fmt {
      XFS_DINODE_FMT_LOCAL,
      XFS_DINODE_FMT_EXTENTS,
      XFS_DINODE_FMT_BTREE,
-     XFS_DINODE_FMT_UUID
+     XFS_DINODE_FMT_UUID,
+     XFS_DINODE_FMT_RMAP,
 } xfs_dinode_fmt_t;
 ----
 
diff --git a/design/XFS_Filesystem_Structure/rtrmapbt.asciidoc b/design/XFS_Filesystem_Structure/rtrmapbt.asciidoc
new file mode 100644
index 0000000..3a109b2
--- /dev/null
+++ b/design/XFS_Filesystem_Structure/rtrmapbt.asciidoc
@@ -0,0 +1,234 @@
+[[Real_time_Reverse_Mapping_Btree]]
+=== Real-Time Reverse-Mapping B+tree
+
+[NOTE]
+This data structure is under construction!  Details may change.
+
+If the reverse-mapping B+tree and real-time storage device features
+are enabled, the real-time device has its own reverse block-mapping
+B+tree.
+
+As mentioned in the chapter about xref:Reconstruction[reconstruction],
+this data structure is another piece of the puzzle necessary to
+reconstruct the data or attribute fork of a file from reverse-mapping
+records; we can also use it to double-check allocations to ensure that
+we are not accidentally cross-linking blocks, which can cause severe
+damage to the filesystem.
+
+This B+tree is only present if the +XFS_SB_FEAT_RO_COMPAT_RMAPBT+
+feature is enabled and a real time device is present.  The feature
+requires a version 5 filesystem.
+
+The real-time reverse mapping B+tree is rooted in an inode's data
+fork; the inode number is given by the +sb_rrmapino+ field in the
+superblock.  The B+tree blocks themselves are stored in the regular
+filesystem.  The structures used for an inode's B+tree root are:
+
+[source, c]
+----
+struct xfs_rtrmap_root {
+     __be16                     bb_level;
+     __be16                     bb_numrecs;
+};
+----
+
+* On disk, the B+tree node starts with the +xfs_rtrmap_root+ header
+followed by an array of +xfs_rtrmap_key+ values and then an array of
++xfs_rtrmap_ptr_t+ values. The size of both arrays is specified by the
+header's +bb_numrecs+ value.
+
+* The root node in the inode can only contain up to 10 key/pointer
+pairs for a standard 512 byte inode before a new level of nodes is
+added between the root and the leaves.  +di_forkoff+ should always
+be zero, because there are no extended attributes.
+
+Each record in the real-time reverse-mapping B+tree has the following
+structure:
+
+[source, c]
+----
+struct xfs_rtrmap_rec {
+     __be64                     rm_startblock;
+     __be64                     rm_blockcount;
+     __be64                     rm_owner;
+     __be64                     rm_fork:1;
+     __be64                     rm_bmbt:1;
+     __be64                     rm_unwritten:1;
+     __be64                     rm_unused:7;
+     __be64                     rm_offset:54;
+};
+----
+
+*rm_startblock*::
+Real-time device block number of this record.
+
+*rm_blockcount*::
+The length of this extent, in real-time blocks.
+
+*rm_owner*::
+A 64-bit number describing the owner of this extent.  This must be an
+inode number, because the real-time device is for file data only.
+
+*rm_fork*::
+If +rm_owner+ describes an inode, this can be 1 if this record is for
+an attribute fork.  This value will always be zero for real-time
+extents.
+
+*rm_bmbt*::
+If +rm_owner+ describes an inode, this can be 1 to signify that this
+record is for a block map B+tree block.  In this case, +rm_offset+ has
+no meaning.  This value will always be zero for real-time extents.
+
+*rm_unwritten*::
+A flag indicating that the extent is unwritten.  This corresponds to
+the flag in the xref:Data_Extents[extent record] format which means
++XFS_EXT_UNWRITTEN+.
+
+*rm_offset*::
+The 54-bit logical file block offset, if +rm_owner+ describes an
+inode.
+
+[NOTE]
+The single-bit flag values +rm_unwritten+, +rm_fork+, and +rm_bmbt+
+are packed into the larger fields in the C structure definition.
+
+The key has the following structure:
+
+[source, c]
+----
+struct xfs_rtrmap_key {
+     __be64                     rm_startblock;
+     __be64                     rm_owner;
+     __be64                     rm_fork:1;
+     __be64                     rm_bmbt:1;
+     __be64                     rm_reserved:1;
+     __be64                     rm_unused:7;
+     __be64                     rm_offset:54;
+};
+----
+
+* All block numbers are 64-bit real-time device block numbers.
+
+* The +bb_magic+ value is ``MAPR'' (0x4d415052).
+
+* The +xfs_btree_lblock_t+ header is used for intermediate B+tree node as well
+as the leaves.
+
+* Each pointer is associated with two keys.  The first of these is the
+"low key", which is the key of the smallest record accessible through
+the pointer.  This low key has the same meaning as the key in all
+other btrees.  The second key is the high key, which is the maximum of
+the largest key that can be used to access a given record underneath
+the pointer.  Recall that each record in the real-time reverse mapping
+b+tree describes an interval of physical blocks mapped to an interval
+of logical file block offsets; therefore, it makes sense that a range
+of keys can be used to find to a record.
+
+==== xfs_db rtrmapbt Example
+
+This example shows a real-time reverse-mapping B+tree from a freshly
+populated root filesystem:
+
+----
+xfs_db> sb 0
+xfs_db> addr rrmapino
+xfs_db> p
+core.magic = 0x494e
+core.mode = 0100000
+core.version = 3
+core.format = 5 (rtrmapbt)
+...
+u3.rtrmapbt.level = 3
+u3.rtrmapbt.numrecs = 1
+u3.rtrmapbt.keys[1] = [startblock,owner,offset,attrfork,bmbtblock,startblock_hi,
+		       owner_hi,offset_hi,attrfork_hi,bmbtblock_hi]
+	1:[1,132,1,0,0,1705337,133,54431,0,0]
+u3.rtrmapbt.ptrs[1] = 1:671
+xfs_db> addr u3.rtrmapbt.ptrs[1]
+xfs_db> p
+magic = 0x4d415052
+level = 2
+numrecs = 8
+leftsib = null
+rightsib = null
+bno = 5368
+lsn = 0x400000000
+uuid = 98bbde42-67e7-46a5-a73e-d64a76b1b5ce
+owner = 131
+crc = 0x2560d199 (correct)
+keys[1-8] = [startblock,owner,offset,attrfork,bmbtblock,startblock_hi,owner_hi,
+	     offset_hi,attrfork_hi,bmbtblock_hi]
+	1:[1,132,1,0,0,17749,132,17749,0,0]
+	2:[17751,132,17751,0,0,35499,132,35499,0,0]
+	3:[35501,132,35501,0,0,53249,132,53249,0,0]
+	4:[53251,132,53251,0,0,1658473,133,7567,0,0]
+	5:[1658475,133,7569,0,0,1667473,133,16567,0,0]
+	6:[1667475,133,16569,0,0,1685223,133,34317,0,0]
+	7:[1685225,133,34319,0,0,1694223,133,43317,0,0]
+	8:[1694225,133,43319,0,0,1705337,133,54431,0,0]
+ptrs[1-8] = 1:134 2:238 3:345 4:453 5:795 6:563 7:670 8:780
+----
+
+We arbitrarily pick pointer 7 (twice) to traverse downwards:
+
+----
+xfs_db> addr ptrs[7]
+xfs_db> p
+magic = 0x4d415052
+level = 1
+numrecs = 36
+leftsib = 563
+rightsib = 780
+bno = 5360
+lsn = 0
+uuid = 98bbde42-67e7-46a5-a73e-d64a76b1b5ce
+owner = 131
+crc = 0x6807761d (correct)
+keys[1-36] = [startblock,owner,offset,attrfork,bmbtblock,startblock_hi,owner_hi,
+	      offset_hi,attrfork_hi,bmbtblock_hi]
+	1:[1685225,133,34319,0,0,1685473,133,34567,0,0]
+	2:[1685475,133,34569,0,0,1685723,133,34817,0,0]
+	3:[1685725,133,34819,0,0,1685973,133,35067,0,0]
+	...
+	34:[1693475,133,42569,0,0,1693723,133,42817,0,0]
+	35:[1693725,133,42819,0,0,1693973,133,43067,0,0]
+	36:[1693975,133,43069,0,0,1694223,133,43317,0,0]
+ptrs[1-36] = 1:669 2:672 3:674...34:722 35:723 36:725
+xfs_db> addr ptrs[7]
+xfs_db> p
+magic = 0x4d415052
+level = 0
+numrecs = 125
+leftsib = 678
+rightsib = 681
+bno = 5440
+lsn = 0
+uuid = 98bbde42-67e7-46a5-a73e-d64a76b1b5ce
+owner = 131
+crc = 0xefce34d4 (correct)
+recs[1-125] = [startblock,blockcount,owner,offset,extentflag,attrfork,bmbtblock]
+	1:[1686725,1,133,35819,0,0,0]
+	2:[1686727,1,133,35821,0,0,0]
+	3:[1686729,1,133,35823,0,0,0]
+	...
+	123:[1686969,1,133,36063,0,0,0]
+	124:[1686971,1,133,36065,0,0,0]
+	125:[1686973,1,133,36067,0,0,0]
+----
+
+Several interesting things pop out here.  The first record shows that
+inode 133 has mapped real-time block 1,686,725 at offset 35,819.  We
+confirm this by looking at the block map for that inode:
+
+----
+xfs_db> inode 133
+xfs_db> p core.realtime
+core.realtime = 1
+xfs_db> bmap
+data offset 35817 startblock 1686723 (1/638147) count 1 flag 0
+data offset 35819 startblock 1686725 (1/638149) count 1 flag 0
+data offset 35821 startblock 1686727 (1/638151) count 1 flag 0
+----
+
+Notice that inode 133 has the real-time flag set, which means that its
+data blocks are all allocated from the real-time device.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 7/7] xfsdocs: document the realtime reverse mapping btree
  2016-08-25 23:27 ` [PATCH 7/7] xfsdocs: document the realtime reverse mapping btree Darrick J. Wong
@ 2016-09-08  1:38   ` Dave Chinner
  2016-09-08  2:03     ` Darrick J. Wong
  0 siblings, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2016-09-08  1:38 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, xfs

On Thu, Aug 25, 2016 at 04:27:42PM -0700, Darrick J. Wong wrote:
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  .../allocation_groups.asciidoc                     |    8 +
>  design/XFS_Filesystem_Structure/docinfo.xml        |   14 +
>  .../internal_inodes.asciidoc                       |    2 
>  design/XFS_Filesystem_Structure/magic.asciidoc     |    1 
>  .../XFS_Filesystem_Structure/ondisk_inode.asciidoc |    6 -
>  design/XFS_Filesystem_Structure/rtrmapbt.asciidoc  |  234 ++++++++++++++++++++
>  6 files changed, 263 insertions(+), 2 deletions(-)
>  create mode 100644 design/XFS_Filesystem_Structure/rtrmapbt.asciidoc
> 
> 
> diff --git a/design/XFS_Filesystem_Structure/allocation_groups.asciidoc b/design/XFS_Filesystem_Structure/allocation_groups.asciidoc
> index cafa8b7..7ba636a 100644
> --- a/design/XFS_Filesystem_Structure/allocation_groups.asciidoc
> +++ b/design/XFS_Filesystem_Structure/allocation_groups.asciidoc
> @@ -105,6 +105,7 @@ struct xfs_sb
>  	xfs_ino_t		sb_pquotino;
>  	xfs_lsn_t		sb_lsn;
>  	uuid_t			sb_meta_uuid;
> +	xfs_ino_t		sb_rrmapino;
>  };
>  ----
>  *sb_magicnum*::
> @@ -449,6 +450,13 @@ If the +XFS_SB_FEAT_INCOMPAT_META_UUID+ feature is set, then the UUID field in
>  all metadata blocks must match this UUID.  If not, the block header UUID field
>  must match +sb_uuid+.
>  
> +*sb_rrmapino*::
> +If the +XFS_SB_FEAT_COMPAT_RMAPBT+ feature is set and a real-time

XFS_SB_FEAT_RO_COMPAT_RMAPBT?

(yes, I am reading these patches!)

-Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 7/7] xfsdocs: document the realtime reverse mapping btree
  2016-09-08  1:38   ` Dave Chinner
@ 2016-09-08  2:03     ` Darrick J. Wong
  0 siblings, 0 replies; 10+ messages in thread
From: Darrick J. Wong @ 2016-09-08  2:03 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, xfs

On Thu, Sep 08, 2016 at 11:38:38AM +1000, Dave Chinner wrote:
> On Thu, Aug 25, 2016 at 04:27:42PM -0700, Darrick J. Wong wrote:
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  .../allocation_groups.asciidoc                     |    8 +
> >  design/XFS_Filesystem_Structure/docinfo.xml        |   14 +
> >  .../internal_inodes.asciidoc                       |    2 
> >  design/XFS_Filesystem_Structure/magic.asciidoc     |    1 
> >  .../XFS_Filesystem_Structure/ondisk_inode.asciidoc |    6 -
> >  design/XFS_Filesystem_Structure/rtrmapbt.asciidoc  |  234 ++++++++++++++++++++
> >  6 files changed, 263 insertions(+), 2 deletions(-)
> >  create mode 100644 design/XFS_Filesystem_Structure/rtrmapbt.asciidoc
> > 
> > 
> > diff --git a/design/XFS_Filesystem_Structure/allocation_groups.asciidoc b/design/XFS_Filesystem_Structure/allocation_groups.asciidoc
> > index cafa8b7..7ba636a 100644
> > --- a/design/XFS_Filesystem_Structure/allocation_groups.asciidoc
> > +++ b/design/XFS_Filesystem_Structure/allocation_groups.asciidoc
> > @@ -105,6 +105,7 @@ struct xfs_sb
> >  	xfs_ino_t		sb_pquotino;
> >  	xfs_lsn_t		sb_lsn;
> >  	uuid_t			sb_meta_uuid;
> > +	xfs_ino_t		sb_rrmapino;
> >  };
> >  ----
> >  *sb_magicnum*::
> > @@ -449,6 +450,13 @@ If the +XFS_SB_FEAT_INCOMPAT_META_UUID+ feature is set, then the UUID field in
> >  all metadata blocks must match this UUID.  If not, the block header UUID field
> >  must match +sb_uuid+.
> >  
> > +*sb_rrmapino*::
> > +If the +XFS_SB_FEAT_COMPAT_RMAPBT+ feature is set and a real-time
> 
> XFS_SB_FEAT_RO_COMPAT_RMAPBT?
> 
> (yes, I am reading these patches!)

Woohoo!!!

Thank you for catching this! :)

--D

> 
> -Dave.
> 
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2016-09-08  2:04 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-25 23:26 [PATCH v8 0/7] xfs-docs: reorganize chapters, document rmap and reflink Darrick J. Wong
2016-08-25 23:27 ` [PATCH 1/7] journaling_log: fix some typos in the section about EFDs Darrick J. Wong
2016-08-25 23:27 ` [PATCH 2/7] xfsdocs: document known testing procedures Darrick J. Wong
2016-08-25 23:27 ` [PATCH 3/7] xfsdocs: update the on-disk format with changes for Linux 4.5 Darrick J. Wong
2016-08-25 23:27 ` [PATCH 4/7] xfsdocs: move the discussions of short and long format btrees to a separate chapter Darrick J. Wong
2016-08-25 23:27 ` [PATCH 5/7] xfsdocs: reverse-mapping btree documentation Darrick J. Wong
2016-08-25 23:27 ` [PATCH 6/7] xfsdocs: document refcount btree and reflink Darrick J. Wong
2016-08-25 23:27 ` [PATCH 7/7] xfsdocs: document the realtime reverse mapping btree Darrick J. Wong
2016-09-08  1:38   ` Dave Chinner
2016-09-08  2:03     ` Darrick J. Wong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.