linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Robert White <rwhite@pobox.com>
To: Jeff Mahoney <jeffm@suse.com>,
	Goldwyn Rodrigues <rgoldwyn@suse.de>,
	linux-btrfs@vger.kernel.org
Subject: Re: [PATCH 00/10] btrfs: Support for DAX devices
Date: Thu, 6 Dec 2018 07:40:43 +0000	[thread overview]
Message-ID: <27107fd9-a8c8-9b3a-6d9f-3f020523f4ea@pobox.com> (raw)
In-Reply-To: <6a5b4bb8-896f-14c8-fc71-71b7122dd836@suse.com>

On 12/5/18 9:37 PM, Jeff Mahoney wrote:
> The high level idea that Jan Kara and I came up with in our conversation 
> at Labs conf is pretty expensive.  We'd need to set a flag that pauses 
> new page faults, set the WP bit on affected ranges, do the snapshot, 
> commit, clear the flag, and wake up the waiting threads.  Neither of us 
> had any concrete idea of how well that would perform and it still 
> depends on finding a good way to resolve all open mmap ranges on a 
> subvolume.  Perhaps using the address_space->private_list anchored on 
> each root would work.

This is a potentially wild idea, so "grain of salt" and all that. I may 
misuse the exact wording.

So the essential problem of DAX is basically the opposite of 
data-deduplication. Instead of merging two duplicate data regions, you 
want to mark regions as at-risk while keeping the original content 
intact if there are snapshots in conflict.

So suppose you _require_ data checksums and data mode of "dup" or mirror 
or one of the other fault tolerant layouts.

By definition any block that gets written with content that it didn't 
have before will now have a bad checksum.

If the inode is flagged for direct IO that's an indication that the 
block has been updated.

At this point you really just need to do the opposite of deduplication, 
as in find/recover the original contents and assign/leave assigned those 
to the old/other snapshots, then compute the new checksum on the 
"original block" and assign it to the active subvolume.

So when a region is mapped for direct IO, and it's refcount is greater 
than one, and you get to a sync or close event, you "recover" the old 
contents into a new location and assign those to "all the other users". 
Now that original storage region has only one user, so on sync or close 
you fix its checksums on the cheap.

Instead of the new data being a small rock sitting over a large rug to 
make a lump, the new data is like a rock being slid under the rug to 
make a lump.

So the first write to an extent creates a burdensome copy to retain the 
old contents, but second and subsequent writes to the same extent only 
have the cost of an _eventual_ checksum of the original block list.

Maybe If the data isn't already duplicated then the write mapping or the 
DAX open or the setting of the S_DUP flag could force the file into an 
extent block that _is_ duplicated.

The mental leap required is that the new blocks don't need to belong to 
the new state being created. The new blocks can be associated to the 
snapshots since data copy is idempotent.

The side note is that it only ever matters if the usage count is greater 
than one, so at worst taking a snapshot, which is already a _little_ 
racy anyway, would/could trigger a semi-lightweight copy of any S_DAX files:

If S_DAX :
   If checksum invalid :
     copy data as-is and checksum, store in snapshot
   else : look for duplicate checksum
     if duplicate found :
       assign that extent to the snapshot
     else :
       If file opened for writing and has any mmaps for write :
         copy extent and assign to new snapshot.
       else :
         increment usage count and assign current block to snapshot

Anyway, I only know enough of the internals to be dangerous.

Since the real goal of mmap is speed during actual update, this idea is 
basically about amortizing the copy costs into the task of maintaining 
the snapshots instead of leaving them in the immediate hands of the 
time-critical updater.

The flush, unmmap, or close by the user, or a system-wide sync event, 
are also good points to expense the bookeeping time.

  reply	other threads:[~2018-12-06  7:40 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-12-05 12:28 [PATCH 00/10] btrfs: Support for DAX devices Goldwyn Rodrigues
2018-12-05 12:28 ` [PATCH 01/10] btrfs: create a mount option for dax Goldwyn Rodrigues
2018-12-05 12:42   ` Johannes Thumshirn
2018-12-05 12:43   ` Nikolay Borisov
2018-12-05 14:59     ` Adam Borowski
2018-12-05 12:28 ` [PATCH 02/10] btrfs: basic dax read Goldwyn Rodrigues
2018-12-05 13:11   ` Nikolay Borisov
2018-12-05 13:22   ` Johannes Thumshirn
2018-12-05 12:28 ` [PATCH 03/10] btrfs: dax: read zeros from holes Goldwyn Rodrigues
2018-12-05 13:26   ` Nikolay Borisov
2018-12-05 12:28 ` [PATCH 04/10] Rename __endio_write_update_ordered() to btrfs_update_ordered_extent() Goldwyn Rodrigues
2018-12-05 13:35   ` Nikolay Borisov
2018-12-05 12:28 ` [PATCH 05/10] btrfs: Carve out btrfs_get_extent_map_write() out of btrfs_get_blocks_write() Goldwyn Rodrigues
2018-12-05 12:28 ` [PATCH 06/10] btrfs: dax write support Goldwyn Rodrigues
2018-12-05 13:56   ` Johannes Thumshirn
2018-12-05 12:28 ` [PATCH 07/10] dax: export functions for use with btrfs Goldwyn Rodrigues
2018-12-05 13:59   ` Johannes Thumshirn
2018-12-05 14:52   ` Christoph Hellwig
2018-12-06 11:46     ` Goldwyn Rodrigues
2018-12-12  8:07       ` Christoph Hellwig
2019-03-26 19:36   ` Dan Williams
2019-03-27 11:10     ` Goldwyn Rodrigues
2018-12-05 12:28 ` [PATCH 08/10] btrfs: dax add read mmap path Goldwyn Rodrigues
2018-12-05 12:28 ` [PATCH 09/10] btrfs: dax support for cow_page/mmap_private and shared Goldwyn Rodrigues
2018-12-05 12:28 ` [PATCH 10/10] btrfs: dax mmap write Goldwyn Rodrigues
2018-12-05 13:03 ` [PATCH 00/10] btrfs: Support for DAX devices Qu Wenruo
2018-12-05 21:36   ` Jeff Mahoney
2018-12-05 13:57 ` Adam Borowski
2018-12-05 21:37 ` Jeff Mahoney
2018-12-06  7:40   ` Robert White [this message]
2018-12-06 10:07 ` Johannes Thumshirn
2018-12-06 11:47   ` Goldwyn Rodrigues

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=27107fd9-a8c8-9b3a-6d9f-3f020523f4ea@pobox.com \
    --to=rwhite@pobox.com \
    --cc=jeffm@suse.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=rgoldwyn@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).