All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: david@fromorbit.com, darrick.wong@oracle.com
Cc: linux-xfs@vger.kernel.org, xfs@oss.sgi.com
Subject: [PATCH 6/7] xfsdocs: document refcount btree and reflink
Date: Thu, 25 Aug 2016 16:27:35 -0700	[thread overview]
Message-ID: <147216765581.32447.11893404900649503970.stgit@birch.djwong.org> (raw)
In-Reply-To: <147216761636.32447.4229640006064129056.stgit@birch.djwong.org>

Document the reference count btree and talk a little bit about how
the reflink feature uses it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 .../allocation_groups.asciidoc                     |   25 ++-
 .../XFS_Filesystem_Structure/directories.asciidoc  |    1 
 design/XFS_Filesystem_Structure/docinfo.xml        |    2 
 .../journaling_log.asciidoc                        |  192 ++++++++++++++++++++
 design/XFS_Filesystem_Structure/magic.asciidoc     |    5 +
 .../XFS_Filesystem_Structure/ondisk_inode.asciidoc |   25 ++-
 .../XFS_Filesystem_Structure/refcountbt.asciidoc   |  145 +++++++++++++++
 design/XFS_Filesystem_Structure/reflink.asciidoc   |   40 ++++
 design/XFS_Filesystem_Structure/rmapbt.asciidoc    |    2 
 .../xfs_filesystem_structure.asciidoc              |    4 
 10 files changed, 435 insertions(+), 6 deletions(-)
 create mode 100644 design/XFS_Filesystem_Structure/refcountbt.asciidoc
 create mode 100644 design/XFS_Filesystem_Structure/reflink.asciidoc


diff --git a/design/XFS_Filesystem_Structure/allocation_groups.asciidoc b/design/XFS_Filesystem_Structure/allocation_groups.asciidoc
index 9fcf975..cafa8b7 100644
--- a/design/XFS_Filesystem_Structure/allocation_groups.asciidoc
+++ b/design/XFS_Filesystem_Structure/allocation_groups.asciidoc
@@ -13,6 +13,7 @@ Each AG has the following characteristics:
          * Free space management
          * Inode allocation and tracking
          * Reverse block-mapping index (optional)
+         * Data block reference count index (optional)
 
 Having multiple AGs allows XFS to handle most operations in parallel without
 degrading performance as the number of concurrent accesses increases.
@@ -386,6 +387,12 @@ Reverse mapping B+tree.  Each allocation group contains a B+tree containing
 records mapping AG blocks to their owners.  See the section about
 xref:Reconstruction[reconstruction] for more details.
 
+| +XFS_SB_FEAT_RO_COMPAT_REFLINK+ |
+Reference count B+tree.  Each allocation group contains a B+tree to track the
+reference counts of AG blocks.  This enables files to share data blocks safely.
+See the section about xref:Reflink_Deduplication[reflink and deduplication] for
+more details.
+
 |=====
 
 *sb_features_incompat*::
@@ -547,8 +554,10 @@ struct xfs_agf {
      /* version 5 filesystem fields start here */
      uuid_t              agf_uuid;
      __be32              agf_rmap_blocks;
-     __be32              __pad;
-     __be64              agf_spare64[15];
+     __be32              agf_refcount_blocks;
+     __be32              agf_refcount_root;
+     __be32              agf_refcount_level;
+     __be64              agf_spare64[14];
 
      /* unlogged fields, written during buffer writeback. */
      __be64              agf_lsn;
@@ -613,6 +622,15 @@ depending on which features are set.
 *agf_rmap_blocks*::
 The size of the reverse mapping B+tree in this allocation group, in blocks.
 
+*agf_refcount_blocks*::
+The size of the reference count B+tree in this allocation group, in blocks.
+
+*agf_refcount_root*::
+Block number for the root of the reference count B+tree, if enabled.
+
+*agf_refcount_root*::
+Depth of the reference count B+tree, if enabled.
+
 *agf_spare64*::
 Empty space in the logged part of the AGF sector, for use for future features.
 
@@ -1243,4 +1261,5 @@ By placing the real time device (and the journal) on separate high-performance
 storage devices, it is possible to reduce most of the unpredictability in I/O
 response times that come from metadata operations.
 
-None of the XFS per-AG B+trees are involved with real time files.
+None of the XFS per-AG B+trees are involved with real time files.  It is not
+possible for real time files to share data blocks.
diff --git a/design/XFS_Filesystem_Structure/directories.asciidoc b/design/XFS_Filesystem_Structure/directories.asciidoc
index bccf912..1758c4e 100644
--- a/design/XFS_Filesystem_Structure/directories.asciidoc
+++ b/design/XFS_Filesystem_Structure/directories.asciidoc
@@ -1419,6 +1419,7 @@ The hash value of a particular record.
 The directory/attribute logical block containing all entries up to the
 corresponding hash value.
 
+//
 * The freeindex's +bests+ array starts from the end of the block and grows to the
 start of the block.
 
diff --git a/design/XFS_Filesystem_Structure/docinfo.xml b/design/XFS_Filesystem_Structure/docinfo.xml
index 44f944a..f5e62bc 100644
--- a/design/XFS_Filesystem_Structure/docinfo.xml
+++ b/design/XFS_Filesystem_Structure/docinfo.xml
@@ -136,6 +136,8 @@
 				<member>Move the b+tree info to a separate chapter.</member>
 				<member>Discuss overlapping interval b+trees.</member>
 				<member>Discuss new log items for atomic updates.</member>
+				<member>Document the reference-count btree.</member>
+				<member>Discuss block sharing, reflink, &amp; deduplication.</member>
 			</simplelist>
 		</revdescription>
 	</revision>
diff --git a/design/XFS_Filesystem_Structure/journaling_log.asciidoc b/design/XFS_Filesystem_Structure/journaling_log.asciidoc
index 78ce436..0aec036 100644
--- a/design/XFS_Filesystem_Structure/journaling_log.asciidoc
+++ b/design/XFS_Filesystem_Structure/journaling_log.asciidoc
@@ -211,6 +211,10 @@ magic number to distinguish themselves.  Buffer data items only appear after
 | +XFS_LI_ICREATE+		| 0x123f        | xref:Inode_Create_Log_Item[Inode Creation]
 | +XFS_LI_RUI+			| 0x1240        | xref:RUI_Log_Item[Reverse Mapping Update Intent]
 | +XFS_LI_RUD+			| 0x1241        | xref:RUD_Log_Item[Reverse Mapping Update Done]
+| +XFS_LI_CUI+			| 0x1242        | xref:CUI_Log_Item[Reference Count Update Intent]
+| +XFS_LI_CUD+			| 0x1243        | xref:CUD_Log_Item[Reference Count Update Done]
+| +XFS_LI_BUI+			| 0x1244        | xref:BUI_Log_Item[File Block Mapping Update Intent]
+| +XFS_LI_BUD+			| 0x1245        | xref:BUD_Log_Item[File Block Mapping Update Done]
 |=====
 
 [[Log_Transaction_Headers]]
@@ -508,6 +512,194 @@ Size of this log item.  Should be 1.
 *rud_rui_id*::
 A 64-bit number that binds the corresponding RUI log item to this RUD log item.
 
+[[CUI_Log_Item]]
+=== Reference Count Updates Intent
+
+The next two operation types work together to handle reference count updates.
+Naturally, the ranges of extents having reference count updates can be
+expressed in terms of physical extents:
+
+[source, c]
+----
+struct xfs_phys_extent {
+     __uint64_t                pe_startblock;
+     __uint32_t                pe_len;
+     __uint32_t                pe_flags;
+};
+----
+
+*pe_startblock*::
+Filesystem block of this extent.
+
+*pe_len*::
+The length of this extent.
+
+*pe_flags*::
+The lower byte of this field is a type code indicating what sort of
+reverse mapping operation we want.  The upper three bytes are flag bits.
+
+.Reference count update log intent types
+[options="header"]
+|=====
+| Value				  | Description
+| +XFS_REFCOUNT_EXTENT_INCREASE+  | Increase the reference count for this extent.
+| +XFS_REFCOUNT_EXTENT_DECREASE+  | Decrease the reference count for this extent.
+| +XFS_REFCOUNT_EXTENT_ALLOC_COW+ | Reserve an extent for staging copy on write.
+| +XFS_REFCOUNT_EXTENT_FREE_COW+  | Unreserve an extent for staging copy on write.
+|=====
+
+The ``reference count update intent'' operation comes first; it tells the log
+that XFS wants to update some reference counts.  This record is crucial for
+correct log recovery because it enables us to spread a complex metadata update
+across multiple transactions while ensuring that a crash midway through the
+complex update will be replayed fully during log recovery.
+
+[source, c]
+----
+struct xfs_cui_log_format {
+     __uint16_t                cui_type;
+     __uint16_t                cui_size;
+     __uint32_t                cui_nextents;
+     __uint64_t                cui_id;
+     struct xfs_map_extent     cui_extents[1];
+};
+----
+
+*cui_type*::
+The signature of an CUI operation, 0x1242.  This value is in host-endian order,
+not big-endian like the rest of XFS.
+
+*cui_size*::
+Size of this log item.  Should be 1.
+
+*cui_nextents*::
+Number of reference count updates.
+
+*cui_id*::
+A 64-bit number that binds the corresponding RUD log item to this RUI log item.
+
+*cui_extents*::
+Variable-length array of reference count update information.
+
+[[CUD_Log_Item]]
+=== Completion of Reference Count Updates
+
+The ``reference count update done'' operation complements the ``reference count
+update intent'' operation.  This second operation indicates that the update
+actually happened, so that log recovery needn't replay the update.  The CUD and
+the actual updates are typically found in a new transaction following the
+transaction in which the CUI was logged.
+
+[source, c]
+----
+struct xfs_cud_log_format {
+      __uint16_t               cud_type;
+      __uint16_t               cud_size;
+      __uint32_t               __pad;
+      __uint64_t               cud_cui_id;
+};
+----
+
+*cud_type*::
+The signature of an RUD operation, 0x1243.  This value is in host-endian order,
+not big-endian like the rest of XFS.
+
+*cud_size*::
+Size of this log item.  Should be 1.
+
+*cud_cui_id*::
+A 64-bit number that binds the corresponding CUI log item to this CUD log item.
+
+[[BUI_Log_Item]]
+=== File Block Mapping Intent
+
+The next two operation types work together to handle deferred file block
+mapping updates.  The extents to be mapped are expressed via the
++xfs_map_extent+ structure discussed in the section about
+xref:RUI_Log_Item[reverse mapping intents].
+
+The lower byte of the +me_flags+ field is a type code indicating what sort of
+file block mapping operation we want.  The upper three bytes are flag bits.
+
+.File block mapping update log intent types
+[options="header"]
+|=====
+| Value				| Description
+| +XFS_BMAP_EXTENT_MAP+		| Add a mapping for file data.
+| +XFS_BMAP_EXTENT_UNMAP+	| Remove a mapping for file data.
+|=====
+
+.File block mapping update log intent flags
+[options="header"]
+|=====
+| Value				| Description
+| +XFS_BMAP_EXTENT_ATTR_FORK+	| Extent is for the attribute fork.
+| +XFS_BMAP_EXTENT_UNWRITTEN+	| Extent is unwritten.
+|=====
+
+The ``file block mapping update intent'' operation comes first; it tells the
+log that XFS wants to map or unmap some extents in a file.  This record is
+crucial for correct log recovery because it enables us to spread a complex
+metadata update across multiple transactions while ensuring that a crash midway
+through the complex update will be replayed fully during log recovery.
+
+[source, c]
+----
+struct xfs_bui_log_format {
+     __uint16_t                bui_type;
+     __uint16_t                bui_size;
+     __uint32_t                bui_nextents;
+     __uint64_t                bui_id;
+     struct xfs_map_extent     bui_extents[1];
+};
+----
+
+*bui_type*::
+The signature of an BUI operation, 0x1244.  This value is in host-endian order,
+not big-endian like the rest of XFS.
+
+*bui_size*::
+Size of this log item.  Should be 1.
+
+*bui_nextents*::
+Number of file mappings.  Should be 1.
+
+*bui_id*::
+A 64-bit number that binds the corresponding BUD log item to this BUI log item.
+
+*bui_extents*::
+Variable-length array of file block mappings to update.  There should only
+be one mapping present.
+
+[[BUD_Log_Item]]
+=== Completion of File Block Mapping Updates
+
+The ``file block mapping update done'' operation complements the ``file block
+mapping update intent'' operation.  This second operation indicates that the
+update actually happened, so that log recovery needn't replay the update.  The
+BUD and the actual updates are typically found in a new transaction following
+the transaction in which the BUI was logged.
+
+[source, c]
+----
+struct xfs_bud_log_format {
+      __uint16_t               bud_type;
+      __uint16_t               bud_size;
+      __uint32_t               __pad;
+      __uint64_t               bud_bui_id;
+};
+----
+
+*bud_type*::
+The signature of an BUD operation, 0x1245.  This value is in host-endian order,
+not big-endian like the rest of XFS.
+
+*bud_size*::
+Size of this log item.  Should be 1.
+
+*bud_bui_id*::
+A 64-bit number that binds the corresponding BUI log item to this BUD log item.
+
 [[Inode_Log_Item]]
 === Inode Updates
 
diff --git a/design/XFS_Filesystem_Structure/magic.asciidoc b/design/XFS_Filesystem_Structure/magic.asciidoc
index 10fd15f..bc172f3 100644
--- a/design/XFS_Filesystem_Structure/magic.asciidoc
+++ b/design/XFS_Filesystem_Structure/magic.asciidoc
@@ -45,6 +45,7 @@ relevant chapters.  Magic numbers tend to have consistent locations:
 | +XFS_ATTR3_LEAF_MAGIC+	| 0x3bee	|     	| xref:Leaf_Attributes[Leaf Attribute], v5 only
 | +XFS_ATTR3_RMT_MAGIC+		| 0x5841524d	| XARM	| xref:Remote_Values[Remote Attribute Value], v5 only
 | +XFS_RMAP_CRC_MAGIC+		| 0x524d4233	| RMB3	| xref:Reverse_Mapping_Btree[Reverse Mapping B+tree], v5 only
+| +XFS_REFC_CRC_MAGIC+		| 0x52334643	| R3FC	| xref:Reference_Count_Btree[Reference Count B+tree], v5 only
 |=====
 
 The magic numbers for log items are at offset zero in each log item, but items
@@ -64,6 +65,10 @@ are not aligned to blocks.
 | +XFS_LI_ICREATE+		| 0x123f        |       | xref:Inode_Create_Log_Item[Inode Creation Log Item]
 | +XFS_LI_RUI+			| 0x1240        |       | xref:RUI_Log_Item[Reverse Mapping Update Intent]
 | +XFS_LI_RUD+			| 0x1241        |       | xref:RUD_Log_Item[Reverse Mapping Update Done]
+| +XFS_LI_CUI+			| 0x1242        |       | xref:CUI_Log_Item[Reference Count Update Intent]
+| +XFS_LI_CUD+			| 0x1243        |       | xref:CUD_Log_Item[Reference Count Update Done]
+| +XFS_LI_BUI+			| 0x1244        |       | xref:BUI_Log_Item[File Block Mapping Update Intent]
+| +XFS_LI_BUD+			| 0x1245        |       | xref:BUD_Log_Item[File Block Mapping Update Done]
 |=====
 
 = Theoretical Limits
diff --git a/design/XFS_Filesystem_Structure/ondisk_inode.asciidoc b/design/XFS_Filesystem_Structure/ondisk_inode.asciidoc
index dc1fad2..4415c38 100644
--- a/design/XFS_Filesystem_Structure/ondisk_inode.asciidoc
+++ b/design/XFS_Filesystem_Structure/ondisk_inode.asciidoc
@@ -109,7 +109,8 @@ struct xfs_dinode_core {
      __be64                    di_changecount;
      __be64                    di_lsn;
      __be64                    di_flags2;
-     __u8                      di_pad2[16];
+     __be32                    di_cowextsize;
+     __u8                      di_pad2[12];
      xfs_timestamp_t           di_crtime;
      __be64                    di_ino;
      uuid_t                    di_uuid;
@@ -215,7 +216,7 @@ including relevant metadata like B+trees. This does not include blocks used for
 extended attributes.
 
 *di_extsize*::
-Specifies the extent size for filesystems with real-time devices and an extent
+Specifies the extent size for filesystems with real-time devices or an extent
 size hint for standard filesystems. For normal filesystems, and with
 directories, the +XFS_DIFLAG_EXTSZINHERIT+ flag must be set in +di_flags+ if
 this field is used. Inodes created in these directories will inherit the
@@ -279,7 +280,7 @@ For directory inodes, new inodes inherit the +di_projid+ value.
 For directory inodes, symlinks cannot be created.
 
 | +XFS_DIFLAG_EXTSIZE+		|
-Specifies the extent size for real-time files or a and extent size hint for regular files.
+Specifies the extent size for real-time files or an extent size hint for regular files.
 
 | +XFS_DIFLAG_EXTSZINHERIT+	|
 For directory inodes, new inodes inherit the +di_extsize+ value.
@@ -323,8 +324,26 @@ Specifies extended flags associated with a v3 inode.
 | +XFS_DIFLAG2_DAX+		|
 For a file, enable DAX to increase performance on persistent-memory storage.
 If set on a directory, files created in the directory will inherit this flag.
+| +XFS_DIFLAG2_REFLINK+		|
+This inode shares (or has shared) data blocks with another inode.
+| +XFS_DIFLAG2_COWEXTSIZE+	|
+For files, this is the extent size hint for copy on write operations; see
++di_cowextsize+ for details.  For directories, the value in +di_cowextsize+
+will be copied to all newly created files and directories.
 |=====
 
+*di_cowextsize*::
+Specifies the extent size hint for copy on write operations.  When allocating
+extents for a copy on write operation, the allocator will be asked to align
+its allocations to either +di_cowextsize+ blocks or +di_extsize+ blocks,
+whichever is greater.  The +XFS_DIFLAG2_COWEXTSIZE+ flag must be set if this
+field is used.  If this field and its flag are set on a directory file, the
+value will be copied into any files or directories created within this
+directory.  During a block sharing operation, this value will be copied from
+the source file to the destination file if the sharing operation completely
+overwrites the destination file's contents and the destination file does not
+already have +di_cowextsize+ set.
+
 *di_pad2*::
 Padding for future expansion of the inode.
 
diff --git a/design/XFS_Filesystem_Structure/refcountbt.asciidoc b/design/XFS_Filesystem_Structure/refcountbt.asciidoc
new file mode 100644
index 0000000..dbbb98e
--- /dev/null
+++ b/design/XFS_Filesystem_Structure/refcountbt.asciidoc
@@ -0,0 +1,145 @@
+[[Reference_Count_Btree]]
+== Reference Count B+tree
+
+[NOTE]
+This data structure is under construction!  Details may change.
+
+To support the sharing of file data blocks (reflink), each allocation group has
+its own reference count B+tree, which grows in the allocated space like the
+inode B+trees.  This data could be gleaned by performing an interval query of
+the reverse-mapping B+tree, but doing so would come at a huge performance
+penalty.  Therefore, this data structure is a cache of computable information.
+
+This B+tree is only present if the +XFS_SB_FEAT_RO_COMPAT_REFLINK+
+feature is enabled.  The feature requires a version 5 filesystem.
+
+Each record in the reference count B+tree has the following structure:
+
+[source, c]
+----
+struct xfs_refcount_rec {
+     __be32                     rc_startblock;
+     __be32                     rc_blockcount;
+     __be32                     rc_refcount;
+};
+----
+
+*rc_startblock*::
+AG block number of this record.
+
+*rc_blockcount*::
+The length of this extent.
+
+*rc_refcount*::
+Number of mappings of this filesystem extent.
+
+Node pointers are an AG relative block pointer:
+
+[source, c]
+----
+struct xfs_refcount_key {
+     __be32                     rc_startblock;
+};
+----
+
+* As the reference counting is AG relative, all the block numbers are only
+32-bits.
+* The +bb_magic+ value is "R3FC" (0x52334643).
+* The +xfs_btree_sblock_t+ header is used for intermediate B+tree node as well
+as the leaves.
+
+=== xfs_db refcntbt Example
+
+For this example, an XFS filesystem was populated with a root filesystem and
+a deduplication program was run to create shared blocks:
+
+----
+xfs_db> agf 0
+xfs_db> addr refcntroot
+xfs_db> p
+magic = 0x52334643
+level = 1
+numrecs = 6
+leftsib = null
+rightsib = null
+bno = 36892
+lsn = 0x200004ec2
+uuid = f1f89746-e00b-49c9-96b3-ecef0f2f14ae
+owner = 0
+crc = 0x75f35128 (correct)
+keys[1-6] = [startblock] 1:[14] 2:[65633] 3:[65780] 4:[94571] 5:[117201] 6:[152442]
+ptrs[1-6] = 1:7 2:25836 3:25835 4:18447 5:18445 6:18449
+xfs_db> addr ptrs[3]
+xfs_db> p
+magic = 0x52334643
+level = 0
+numrecs = 80
+leftsib = 25836
+rightsib = 18447
+bno = 51670
+lsn = 0x200004ec2
+uuid = f1f89746-e00b-49c9-96b3-ecef0f2f14ae
+owner = 0
+crc = 0xc3962813 (correct)
+recs[1-80] = [startblock,blockcount,refcount]
+        1:[65780,1,2] 2:[65781,1,3] 3:[65785,2,2] 4:[66640,1,2]
+        5:[69602,4,2] 6:[72256,16,2] 7:[72871,4,2] 8:[72879,20,2]
+        9:[73395,4,2] 10:[75063,4,2] 11:[79093,4,2] 12:[86344,16,2]
+----
+
+Record 6 in the reference count B+tree for AG 0 indicates that the AG extent
+starting at block 72,256 and running for 16 blocks has a reference count of 2.
+This means that there are two files sharing the block:
+
+----
+xfs_db> blockget -n
+xfs_db> fsblock 72256
+xfs_db> blockuse
+block 72256 (0/72256) type rldata inode 25169197
+----
+
+The blockuse type changes to ``rldata'' to indicate that the block is shared
+data.  Unfortunately, blockuse only tells us about one block owner.  If we
+happen to have enabled the reverse-mapping B+tree, we can use it to find all
+inodes that own this block:
+
+----
+xfs_db> agf 0
+xfs_db> addr rmaproot
+...
+xfs_db> addr ptrs[3]
+...
+xfs_db> addr ptrs[7]
+xfs_db> p
+magic = 0x524d4233
+level = 0
+numrecs = 22
+leftsib = 65057
+rightsib = 65058
+bno = 291478
+lsn = 0x200004ec2
+uuid = f1f89746-e00b-49c9-96b3-ecef0f2f14ae
+owner = 0
+crc = 0xed7da3f7 (correct)
+recs[1-22] = [startblock,blockcount,owner,offset,extentflag,attrfork,bmbtblock]
+        1:[68957,8,3201,0,0,0,0] 2:[68965,4,25260953,0,0,0,0]
+        ...
+        18:[72232,58,3227,0,0,0,0] 19:[72256,16,25169197,24,0,0,0]
+        20:[72290,75,3228,0,0,0,0] 21:[72365,46,3229,0,0,0,0]
+----
+
+Records 18 and 19 intersect the block 72,256; they tell us that inodes 3,227
+and 25,169,197 both claim ownership.  Let us confirm this:
+
+----
+xfs_db> inode 25169197
+xfs_db> bmap
+data offset 0 startblock 12632259 (3/49347) count 24 flag 0
+data offset 24 startblock 72256 (0/72256) count 16 flag 0
+data offset 40 startblock 12632299 (3/49387) count 18 flag 0
+xfs_db> inode 3227
+xfs_db> bmap
+data offset 0 startblock 72232 (0/72232) count 58 flag 0
+----
+
+Inodes 25,169,197 and 3,227 both contain mappings to block 0/72,256.
diff --git a/design/XFS_Filesystem_Structure/reflink.asciidoc b/design/XFS_Filesystem_Structure/reflink.asciidoc
new file mode 100644
index 0000000..8f52b90
--- /dev/null
+++ b/design/XFS_Filesystem_Structure/reflink.asciidoc
@@ -0,0 +1,40 @@
+[[Reflink_Deduplication]]
+= Sharing Data Blocks
+
+On a traditional filesystem, there is a 1:1 mapping between a logical block
+offset in a file and a physical block on disk, which is to say that physical
+blocks are not shared.  However, there exist various use cases for being able
+to share blocks between files -- deduplicating files saves space on archival
+systems; creating space-efficient clones of disk images for virtual machines
+and containers facilitates efficient datacenters; and deferring the payment of
+the allocation cost of a file system tree copy as long as possible makes
+regular work faster.  In all of these cases, a write to one of the shared
+copies *must* not affect the other shared copies, which means that writes to
+shared blocks must employ a copy-on-write strategy.  Sharing blocks in this
+manner is commonly referred to as ``reflinking''.
+
+XFS implements block sharing in a fairly straightforward manner.  All existing
+data fork structures remain unchanged, save for the addition of a
+per-allocation group xref:Reference_Count_Btree[reference count B+tree].  This
+data structure tracks reference counts for all shared physical blocks, with a
+few rules to maintain compatibility with existing code: If a block is free, it
+will be tracked in the free space B+trees.  If a block is owned by a single
+file, it appears in neither the free space nor the reference count B+trees.  If
+a block is shared, it will appear in the reference count B+tree with a
+reference count >= 2.  The first two cases are established precedent in XFS, so
+the third case is the only behavioral change.
+
+When a filesystem block is shared, the block mapping in the destination file is
+updated to point to that filesystem block and the reference count B+tree records
+are updated to reflect the increased refcount.  If a shared block is written, a
+new block will be allocated, the dirty data written to this new block, and the
+file's block mapping updated to point to the new block.  If a shared block is
+unmapped, the reference count records are updated to reflect the decreased
+refcount and the block is also freed if its reference count becomes zero.  This
+enables users to create space efficient clones of disk images and to copy
+filesystem subtrees quickly, using the standard Linux coreutils packages.
+
+Deduplication employs the same mechanism to share blocks and copy them at write
+time.  However, the kernel confirms that the contents of both files are
+identical before updating the destination file's mapping.  This enables XFS to
+be used by userspace deduplication programs such as +duperemove+.
diff --git a/design/XFS_Filesystem_Structure/rmapbt.asciidoc b/design/XFS_Filesystem_Structure/rmapbt.asciidoc
index a8a210b..0ec72c1 100644
--- a/design/XFS_Filesystem_Structure/rmapbt.asciidoc
+++ b/design/XFS_Filesystem_Structure/rmapbt.asciidoc
@@ -53,6 +53,8 @@ absolute inode number, but can also correspond to one of the following:
 | +XFS_RMAP_OWN_AG+             | Per-allocation group B+tree blocks.  This means free space B+tree blocks, blocks on the freelist, and reverse-mapping B+tree blocks.
 | +XFS_RMAP_OWN_INOBT+          | Per-allocation group inode B+tree blocks.  This includes free inode B+tree blocks.
 | +XFS_RMAP_OWN_INODES+         | Inode chunks
+| +XFS_RMAP_OWN_REFC+           | Per-allocation group refcount B+tree blocks.  This will be used for reflink support.
+| +XFS_RMAP_OWN_COW+		| Blocks that have been reserved for a copy-on-write operation that has not completed.
 |=====
 
 *rm_fork*::
diff --git a/design/XFS_Filesystem_Structure/xfs_filesystem_structure.asciidoc b/design/XFS_Filesystem_Structure/xfs_filesystem_structure.asciidoc
index 1b8658d..7916fbe 100644
--- a/design/XFS_Filesystem_Structure/xfs_filesystem_structure.asciidoc
+++ b/design/XFS_Filesystem_Structure/xfs_filesystem_structure.asciidoc
@@ -48,6 +48,8 @@ include::overview.asciidoc[]
 
 include::metadata_integrity.asciidoc[]
 
+include::reflink.asciidoc[]
+
 include::reconstruction.asciidoc[]
 
 include::common_types.asciidoc[]
@@ -70,6 +72,8 @@ include::allocation_groups.asciidoc[]
 
 include::rmapbt.asciidoc[]
 
+include::refcountbt.asciidoc[]
+
 include::journaling_log.asciidoc[]
 
 include::internal_inodes.asciidoc[]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  parent reply	other threads:[~2016-08-25 23:27 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-08-25 23:26 [PATCH v8 0/7] xfs-docs: reorganize chapters, document rmap and reflink Darrick J. Wong
2016-08-25 23:27 ` [PATCH 1/7] journaling_log: fix some typos in the section about EFDs Darrick J. Wong
2016-08-25 23:27 ` [PATCH 2/7] xfsdocs: document known testing procedures Darrick J. Wong
2016-08-25 23:27 ` [PATCH 3/7] xfsdocs: update the on-disk format with changes for Linux 4.5 Darrick J. Wong
2016-08-25 23:27 ` [PATCH 4/7] xfsdocs: move the discussions of short and long format btrees to a separate chapter Darrick J. Wong
2016-08-25 23:27 ` [PATCH 5/7] xfsdocs: reverse-mapping btree documentation Darrick J. Wong
2016-08-25 23:27 ` Darrick J. Wong [this message]
2016-08-25 23:27 ` [PATCH 7/7] xfsdocs: document the realtime reverse mapping btree Darrick J. Wong
2016-09-08  1:38   ` Dave Chinner
2016-09-08  2:03     ` Darrick J. Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=147216765581.32447.11893404900649503970.stgit@birch.djwong.org \
    --to=darrick.wong@oracle.com \
    --cc=david@fromorbit.com \
    --cc=linux-xfs@vger.kernel.org \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.