linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Adam Borowski <kilobyte@angband.pl>
To: Goldwyn Rodrigues <rgoldwyn@suse.de>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH 00/10] btrfs: Support for DAX devices
Date: Wed, 5 Dec 2018 14:57:15 +0100	[thread overview]
Message-ID: <20181205135715.glozremrekz2kesx@angband.pl> (raw)
In-Reply-To: <20181205122835.19290-1-rgoldwyn@suse.de>

On Wed, Dec 05, 2018 at 06:28:25AM -0600, Goldwyn Rodrigues wrote:
> This is a support for DAX in btrfs.

Yay!

> I understand there have been previous attempts at it.  However, I wanted
> to make sure copy-on-write (COW) works on dax as well.

btrfs' usual use of CoW and DAX are thoroughly in conflict.

The very point of DAX is to have writes not go through the kernel, you
mmap the file then do all writes right to the pmem, flushing when needed
(without hitting the kernel) and having the processor+memory persist what
you wrote.

CoW via page faults are fine -- pmem is closer to memory than disk, and this
means the kernel will ask the filesystem for an extent to place the new page
in, copy the contents and let the process play with it.  But real btrfs CoW
would mean we'd need to page fault on ᴇᴠᴇʀʏ ꜱɪɴɢʟᴇ ᴡʀɪᴛᴇ.

Delaying CoW until the next commit doesn't help -- you'd need to store the
dirty page in DRAM then write it, which goes against the whole concept of
DAX.

Only way I see would be to CoW once then pretend the page is nodatacow until
the next commit, when we checksum it, add to the metadata trees, and mark
for CoWing on the next write.  Lots of complexity, and you still need to
copy the whole thing every commit (so no gain).

Ie, we're in nodatacow land.  CoW for metadata is fine.

> Before I present this to the FS folks I wanted to run this through the
> btrfs. Even though I wish, I cannot get it correct the first time
> around :/.. Here are some questions for which I need suggestions:
> 
> Questions:
> 1. I have been unable to do checksumming for DAX devices. While
> checksumming can be done for reads and writes, it is a problem when mmap
> is involved because btrfs kernel module does not get back control after
> an mmap() writes. Any ideas are appreciated, or we would have to set
> nodatasum when dax is enabled.

Per the above, it sounds like nodatacow (ie, "cow once") would be needed.

> 2. Currently, a user can continue writing on "old" extents of an mmaped file
> after a snapshot has been created. How can we enforce writes to be directed
> to new extents after snapshots have been created? Do we keep a list of
> all mmap()s, and re-mmap them after a snapshot?

Same as for any other memory that's shared: when a new instance of sharing
is added (a snapshot/reflink in our case), you deny writes, causing a page
fault on the next attempt.  "pmem" is named "ᴘersistent ᴍᴇᴍory" for a
reason...

> Tested by creating a pmem device in RAM with "memmap=2G!4G" kernel
> command line parameter.

Might be more useful to use a bigger piece of the "disk" than 2G, it's not
in the danger area though.

Also note that it's utterly pointless to use any RAID modes; multi-dev
single is fine, DUP counts as RAID here.
* RAID0 is already done better in hardware (interleave)
* RAID1 would require hardware support, replication isn't easy
* RAID5/6 

What would make sense, is disabling dax for any files that are not marked as
nodatacow.  This way, unrelated files can still use checksums or
compression, while only files meant as a pmempool or otherwise by a
pmem-aware program would have dax writes (you can still give read-only pages
that CoW to DRAM).  This way we can have write dax for only a subset of
files, and full set of btrfs features for the rest.  Write dax is dangerous
for programs that have no specific support: the vast majority of
database-like programs rely on page-level atomicity while pmem gives you
cacheline/word atomicity only; torn writes mean data loss.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢠⠒⠀⣿⡁ Ivan was a worldly man: born in St. Petersburg, raised in
⢿⡄⠘⠷⠚⠋⠀ Petrograd, lived most of his life in Leningrad, then returned
⠈⠳⣄⠀⠀⠀⠀ to the city of his birth to die.

  parent reply	other threads:[~2018-12-05 13:57 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-12-05 12:28 [PATCH 00/10] btrfs: Support for DAX devices Goldwyn Rodrigues
2018-12-05 12:28 ` [PATCH 01/10] btrfs: create a mount option for dax Goldwyn Rodrigues
2018-12-05 12:42   ` Johannes Thumshirn
2018-12-05 12:43   ` Nikolay Borisov
2018-12-05 14:59     ` Adam Borowski
2018-12-05 12:28 ` [PATCH 02/10] btrfs: basic dax read Goldwyn Rodrigues
2018-12-05 13:11   ` Nikolay Borisov
2018-12-05 13:22   ` Johannes Thumshirn
2018-12-05 12:28 ` [PATCH 03/10] btrfs: dax: read zeros from holes Goldwyn Rodrigues
2018-12-05 13:26   ` Nikolay Borisov
2018-12-05 12:28 ` [PATCH 04/10] Rename __endio_write_update_ordered() to btrfs_update_ordered_extent() Goldwyn Rodrigues
2018-12-05 13:35   ` Nikolay Borisov
2018-12-05 12:28 ` [PATCH 05/10] btrfs: Carve out btrfs_get_extent_map_write() out of btrfs_get_blocks_write() Goldwyn Rodrigues
2018-12-05 12:28 ` [PATCH 06/10] btrfs: dax write support Goldwyn Rodrigues
2018-12-05 13:56   ` Johannes Thumshirn
2018-12-05 12:28 ` [PATCH 07/10] dax: export functions for use with btrfs Goldwyn Rodrigues
2018-12-05 13:59   ` Johannes Thumshirn
2018-12-05 14:52   ` Christoph Hellwig
2018-12-06 11:46     ` Goldwyn Rodrigues
2018-12-12  8:07       ` Christoph Hellwig
2019-03-26 19:36   ` Dan Williams
2019-03-27 11:10     ` Goldwyn Rodrigues
2018-12-05 12:28 ` [PATCH 08/10] btrfs: dax add read mmap path Goldwyn Rodrigues
2018-12-05 12:28 ` [PATCH 09/10] btrfs: dax support for cow_page/mmap_private and shared Goldwyn Rodrigues
2018-12-05 12:28 ` [PATCH 10/10] btrfs: dax mmap write Goldwyn Rodrigues
2018-12-05 13:03 ` [PATCH 00/10] btrfs: Support for DAX devices Qu Wenruo
2018-12-05 21:36   ` Jeff Mahoney
2018-12-05 13:57 ` Adam Borowski [this message]
2018-12-05 21:37 ` Jeff Mahoney
2018-12-06  7:40   ` Robert White
2018-12-06 10:07 ` Johannes Thumshirn
2018-12-06 11:47   ` Goldwyn Rodrigues

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20181205135715.glozremrekz2kesx@angband.pl \
    --to=kilobyte@angband.pl \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=rgoldwyn@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).