Re: [PATCH 00/10] btrfs: Support for DAX devices

From: Robert White <rwhite@pobox.com>
To: Jeff Mahoney <jeffm@suse.com>,
	Goldwyn Rodrigues <rgoldwyn@suse.de>,
	linux-btrfs@vger.kernel.org
Subject: Re: [PATCH 00/10] btrfs: Support for DAX devices
Date: Thu, 6 Dec 2018 07:40:43 +0000	[thread overview]
Message-ID: <27107fd9-a8c8-9b3a-6d9f-3f020523f4ea@pobox.com> (raw)
In-Reply-To: <6a5b4bb8-896f-14c8-fc71-71b7122dd836@suse.com>

On 12/5/18 9:37 PM, Jeff Mahoney wrote:
> The high level idea that Jan Kara and I came up with in our conversation 
> at Labs conf is pretty expensive.  We'd need to set a flag that pauses 
> new page faults, set the WP bit on affected ranges, do the snapshot, 
> commit, clear the flag, and wake up the waiting threads.  Neither of us 
> had any concrete idea of how well that would perform and it still 
> depends on finding a good way to resolve all open mmap ranges on a 
> subvolume.  Perhaps using the address_space->private_list anchored on 
> each root would work.

This is a potentially wild idea, so "grain of salt" and all that. I may 
misuse the exact wording.

So the essential problem of DAX is basically the opposite of 
data-deduplication. Instead of merging two duplicate data regions, you 
want to mark regions as at-risk while keeping the original content 
intact if there are snapshots in conflict.

So suppose you _require_ data checksums and data mode of "dup" or mirror 
or one of the other fault tolerant layouts.

By definition any block that gets written with content that it didn't 
have before will now have a bad checksum.

If the inode is flagged for direct IO that's an indication that the 
block has been updated.

At this point you really just need to do the opposite of deduplication, 
as in find/recover the original contents and assign/leave assigned those 
to the old/other snapshots, then compute the new checksum on the 
"original block" and assign it to the active subvolume.

So when a region is mapped for direct IO, and it's refcount is greater 
than one, and you get to a sync or close event, you "recover" the old 
contents into a new location and assign those to "all the other users". 
Now that original storage region has only one user, so on sync or close 
you fix its checksums on the cheap.

Instead of the new data being a small rock sitting over a large rug to 
make a lump, the new data is like a rock being slid under the rug to 
make a lump.

So the first write to an extent creates a burdensome copy to retain the 
old contents, but second and subsequent writes to the same extent only 
have the cost of an _eventual_ checksum of the original block list.

Maybe If the data isn't already duplicated then the write mapping or the 
DAX open or the setting of the S_DUP flag could force the file into an 
extent block that _is_ duplicated.

The mental leap required is that the new blocks don't need to belong to 
the new state being created. The new blocks can be associated to the 
snapshots since data copy is idempotent.

The side note is that it only ever matters if the usage count is greater 
than one, so at worst taking a snapshot, which is already a _little_ 
racy anyway, would/could trigger a semi-lightweight copy of any S_DAX files:

If S_DAX :
   If checksum invalid :
     copy data as-is and checksum, store in snapshot
   else : look for duplicate checksum
     if duplicate found :
       assign that extent to the snapshot
     else :
       If file opened for writing and has any mmaps for write :
         copy extent and assign to new snapshot.
       else :
         increment usage count and assign current block to snapshot

Anyway, I only know enough of the internals to be dangerous.

Since the real goal of mmap is speed during actual update, this idea is 
basically about amortizing the copy costs into the task of maintaining 
the snapshots instead of leaving them in the immediate hands of the 
time-critical updater.

The flush, unmmap, or close by the user, or a system-wide sync event, 
are also good points to expense the bookeeping time.