All of lore.kernel.org
 help / color / mirror / Atom feed
From: Matthew Wilcox <willy@linux.intel.com>
To: Jan Kara <jack@suse.cz>
Cc: linux-fsdevel@vger.kernel.org, "Wilcox,
	Matthew R  <matthew.r.wilcox@intel.com>,
	NeilBrown <neilb@suse.com>,
	Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	linux-nvdimm@lists.01.org
Subject: Re: [RFC v2] [PATCH 0/10] DAX page fault locking
Date: Wed, 23 Mar 2016 16:50:14 -0400	[thread overview]
Message-ID: <20160323205014.GW23727@linux.intel.com> (raw)
In-Reply-To: <20160323150939.GK4512@quack.suse.cz>

On Wed, Mar 23, 2016 at 04:09:39PM +0100, Jan Kara wrote:
> On Mon 21-03-16 13:41:03, Matthew Wilcox wrote:
> > On Mon, Mar 21, 2016 at 02:22:45PM +0100, Jan Kara wrote:
> > > The basic idea is that we use a bit in an exceptional radix tree entry as
> > > a lock bit and use it similarly to how page lock is used for normal faults.
> > > That way we fix races between hole instantiation and read faults of the
> > > same index. For now I have disabled PMD faults since there the issues with
> > > page fault locking are even worse. Now that Matthew's multi-order radix tree
> > > has landed, I can have a look into using that for proper locking of PMD faults
> > > but first I want normal pages sorted out.
> > 
> > FYI, the multi-order radix tree code that landed is unusably buggy.
> > Ross and I have been working like madmen for the past three weeks to fix
> > all of the bugs we've found and not introduce new ones.  The radix tree
> > test suite has been enormously helpful in this regard, but we're still
> > finding corner cases (thanks, RCU! ;-)
> > 
> > Our current best effort can be found hiding in
> > http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/radix-fixes-2016-03-15
> > but it's for sure not ready for review yet.  I just don't want other
> > people trying to use the facility and wasting their time.
> 
> So when looking through the fixes I was wondering: Are really sibling
> entries worth it? Won't the result be simpler if we just used
> RADIX_TREE_MAP_SHIFT == 9? We would need to put slot pointers out of
> radix_tree_node structure (there'd be full page worth of them) but that's
> easy. More complications probably come from the fact that we don't want
> that unconditionally since radix tree for small files would consume
> considerably more memory and that could be an issue for some systems. For
> DAX as such we don't really care I think, at least for now, but for normal
> page cache we do. So we would have to make RADIX_TREE_MAP_SHIFT
> per-radix-tree property. What do you think? I can try to write some patches
> if you'd consider it's worth it...

I haven't tried it yet.  I think one of the problems is that there may be
architectures which have PMD_SHIFT-PAGE_SHIFT != PUD_SHIFT-PMD_SHIFT.
I have started evolving the radix tree code towards something
that can support variable height nodes (check the latest head of
radix-fixes-2016-03-15), but I didn't consider splitting the slot array
out of the radix_tree_node.

It'd absolutely be possible to mix different order nodes within the same
tree, but the problem becomes deciding when to use which shift at which
level.  If the first insertion is an order-9 entry, then that's easy, but
if you already have a few order-0 entries in a few places in an order-6
based tree then converting that tree to be order-9 based could be tricky.

Do we really want to introduce another pointer follow operation at each
level of the radix tree?  It'd be partially compensated for by having
fewer levels.  Eg: a file with 1TB entries (and 4k pages) would have 28
bits used for index.  With the current 6-bit MAP_SHIFT, that's 5 levels.
With a 9-bit MAP_SHIFT, that's 4 levels, or 8 indirections.  I seem to
have picked the worst possible case out of thin air there ;-)  A 512GB
file would also use 5 levels with a 6-bit MAP_SHIFT and only 3 with a
9-bit MAP_SHIFT (which would be 6 indirections).

Another way we could go here is removing all the metadata from the
tree, so that each level is only a page.  We could have a metadata tree
that shadows its structure and contains the parent, shift, tags, etc.
That way the lookup would be fast and the less common operations would
be slower.

I'm going to keep going with the sibling entries, but feel free to try
other ways of organising the radix tree!  May the best one win (and may
we all contribute to the test suite ...)

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

WARNING: multiple messages have this Message-ID (diff)
From: Matthew Wilcox <willy@linux.intel.com>
To: Jan Kara <jack@suse.cz>
Cc: linux-fsdevel@vger.kernel.org, linux-nvdimm@lists.01.org,
	NeilBrown <neilb@suse.com>,
	"Wilcox, Matthew R" <matthew.r.wilcox@intel.com>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Subject: Re: [RFC v2] [PATCH 0/10] DAX page fault locking
Date: Wed, 23 Mar 2016 16:50:14 -0400	[thread overview]
Message-ID: <20160323205014.GW23727@linux.intel.com> (raw)
In-Reply-To: <20160323150939.GK4512@quack.suse.cz>

On Wed, Mar 23, 2016 at 04:09:39PM +0100, Jan Kara wrote:
> On Mon 21-03-16 13:41:03, Matthew Wilcox wrote:
> > On Mon, Mar 21, 2016 at 02:22:45PM +0100, Jan Kara wrote:
> > > The basic idea is that we use a bit in an exceptional radix tree entry as
> > > a lock bit and use it similarly to how page lock is used for normal faults.
> > > That way we fix races between hole instantiation and read faults of the
> > > same index. For now I have disabled PMD faults since there the issues with
> > > page fault locking are even worse. Now that Matthew's multi-order radix tree
> > > has landed, I can have a look into using that for proper locking of PMD faults
> > > but first I want normal pages sorted out.
> > 
> > FYI, the multi-order radix tree code that landed is unusably buggy.
> > Ross and I have been working like madmen for the past three weeks to fix
> > all of the bugs we've found and not introduce new ones.  The radix tree
> > test suite has been enormously helpful in this regard, but we're still
> > finding corner cases (thanks, RCU! ;-)
> > 
> > Our current best effort can be found hiding in
> > http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/radix-fixes-2016-03-15
> > but it's for sure not ready for review yet.  I just don't want other
> > people trying to use the facility and wasting their time.
> 
> So when looking through the fixes I was wondering: Are really sibling
> entries worth it? Won't the result be simpler if we just used
> RADIX_TREE_MAP_SHIFT == 9? We would need to put slot pointers out of
> radix_tree_node structure (there'd be full page worth of them) but that's
> easy. More complications probably come from the fact that we don't want
> that unconditionally since radix tree for small files would consume
> considerably more memory and that could be an issue for some systems. For
> DAX as such we don't really care I think, at least for now, but for normal
> page cache we do. So we would have to make RADIX_TREE_MAP_SHIFT
> per-radix-tree property. What do you think? I can try to write some patches
> if you'd consider it's worth it...

I haven't tried it yet.  I think one of the problems is that there may be
architectures which have PMD_SHIFT-PAGE_SHIFT != PUD_SHIFT-PMD_SHIFT.
I have started evolving the radix tree code towards something
that can support variable height nodes (check the latest head of
radix-fixes-2016-03-15), but I didn't consider splitting the slot array
out of the radix_tree_node.

It'd absolutely be possible to mix different order nodes within the same
tree, but the problem becomes deciding when to use which shift at which
level.  If the first insertion is an order-9 entry, then that's easy, but
if you already have a few order-0 entries in a few places in an order-6
based tree then converting that tree to be order-9 based could be tricky.

Do we really want to introduce another pointer follow operation at each
level of the radix tree?  It'd be partially compensated for by having
fewer levels.  Eg: a file with 1TB entries (and 4k pages) would have 28
bits used for index.  With the current 6-bit MAP_SHIFT, that's 5 levels.
With a 9-bit MAP_SHIFT, that's 4 levels, or 8 indirections.  I seem to
have picked the worst possible case out of thin air there ;-)  A 512GB
file would also use 5 levels with a 6-bit MAP_SHIFT and only 3 with a
9-bit MAP_SHIFT (which would be 6 indirections).

Another way we could go here is removing all the metadata from the
tree, so that each level is only a page.  We could have a metadata tree
that shadows its structure and contains the parent, shift, tags, etc.
That way the lookup would be fast and the less common operations would
be slower.

I'm going to keep going with the sibling entries, but feel free to try
other ways of organising the radix tree!  May the best one win (and may
we all contribute to the test suite ...)


  reply	other threads:[~2016-03-23 20:49 UTC|newest]

Thread overview: 88+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-03-21 13:22 [RFC v2] [PATCH 0/10] DAX page fault locking Jan Kara
2016-03-21 13:22 ` Jan Kara
2016-03-21 13:22 ` [PATCH 01/10] DAX: move RADIX_DAX_ definitions to dax.c Jan Kara
2016-03-21 13:22   ` Jan Kara
2016-03-21 17:25   ` Matthew Wilcox
2016-03-21 17:25     ` Matthew Wilcox
2016-03-21 13:22 ` [PATCH 02/10] radix-tree: make 'indirect' bit available to exception entries Jan Kara
2016-03-21 13:22   ` Jan Kara
2016-03-21 17:34   ` Matthew Wilcox
2016-03-21 17:34     ` Matthew Wilcox
2016-03-22  9:12     ` Jan Kara
2016-03-22  9:12       ` Jan Kara
2016-03-22  9:27       ` Matthew Wilcox
2016-03-22  9:27         ` Matthew Wilcox
2016-03-22 10:37         ` Jan Kara
2016-03-22 10:37           ` Jan Kara
2016-03-23 16:41           ` Ross Zwisler
2016-03-23 16:41             ` Ross Zwisler
2016-03-24 12:31             ` Jan Kara
2016-03-24 12:31               ` Jan Kara
2016-03-21 13:22 ` [PATCH 03/10] dax: Remove complete_unwritten argument Jan Kara
2016-03-21 13:22   ` Jan Kara
2016-03-23 17:12   ` Ross Zwisler
2016-03-23 17:12     ` Ross Zwisler
2016-03-24 12:32     ` Jan Kara
2016-03-24 12:32       ` Jan Kara
2016-03-21 13:22 ` [PATCH 04/10] dax: Fix data corruption for written and mmapped files Jan Kara
2016-03-21 13:22   ` Jan Kara
2016-03-23 17:39   ` Ross Zwisler
2016-03-23 17:39     ` Ross Zwisler
2016-03-24 12:51     ` Jan Kara
2016-03-24 12:51       ` Jan Kara
2016-03-29 15:17       ` Ross Zwisler
2016-03-29 15:17         ` Ross Zwisler
2016-03-21 13:22 ` [PATCH 05/10] dax: Allow DAX code to replace exceptional entries Jan Kara
2016-03-21 13:22   ` Jan Kara
2016-03-23 17:52   ` Ross Zwisler
2016-03-23 17:52     ` Ross Zwisler
2016-03-24 10:42     ` Jan Kara
2016-03-24 10:42       ` Jan Kara
2016-03-21 13:22 ` [PATCH 06/10] dax: Remove redundant inode size checks Jan Kara
2016-03-21 13:22   ` Jan Kara
2016-03-23 21:08   ` Ross Zwisler
2016-03-23 21:08     ` Ross Zwisler
2016-03-21 13:22 ` [PATCH 07/10] dax: Disable huge page handling Jan Kara
2016-03-21 13:22   ` Jan Kara
2016-03-23 20:50   ` Ross Zwisler
2016-03-23 20:50     ` Ross Zwisler
2016-03-24 12:56     ` Jan Kara
2016-03-24 12:56       ` Jan Kara
2016-03-21 13:22 ` [PATCH 08/10] dax: New fault locking Jan Kara
2016-03-21 13:22   ` Jan Kara
2016-03-29 21:57   ` Ross Zwisler
2016-03-29 21:57     ` Ross Zwisler
2016-03-31 16:27     ` Jan Kara
2016-03-31 16:27       ` Jan Kara
2016-03-21 13:22 ` [PATCH 09/10] dax: Use radix tree entry lock to protect cow faults Jan Kara
2016-03-21 13:22   ` Jan Kara
2016-03-21 19:11   ` Matthew Wilcox
2016-03-21 19:11     ` Matthew Wilcox
2016-03-22  7:03     ` Jan Kara
2016-03-22  7:03       ` Jan Kara
2016-03-29 22:18   ` Ross Zwisler
2016-03-29 22:18     ` Ross Zwisler
2016-03-21 13:22 ` [PATCH 10/10] dax: Remove i_mmap_lock protection Jan Kara
2016-03-21 13:22   ` Jan Kara
2016-03-29 22:17   ` Ross Zwisler
2016-03-29 22:17     ` Ross Zwisler
2016-03-21 17:41 ` [RFC v2] [PATCH 0/10] DAX page fault locking Matthew Wilcox
2016-03-21 17:41   ` Matthew Wilcox
2016-03-23 15:09   ` Jan Kara
2016-03-23 15:09     ` Jan Kara
2016-03-23 20:50     ` Matthew Wilcox [this message]
2016-03-23 20:50       ` Matthew Wilcox
2016-03-24 10:00     ` Matthew Wilcox
2016-03-24 10:00       ` Matthew Wilcox
2016-03-22 19:32 ` Ross Zwisler
2016-03-22 19:32   ` Ross Zwisler
2016-03-22 21:07   ` Toshi Kani
2016-03-22 21:07     ` Toshi Kani
2016-03-22 21:15     ` Dave Chinner
2016-03-22 21:15       ` Dave Chinner
2016-03-23  9:45     ` Jan Kara
2016-03-23  9:45       ` Jan Kara
2016-03-23 15:11       ` Toshi Kani
2016-03-23 15:11         ` Toshi Kani
  -- strict thread matches above, loose matches on Subject: below --
2016-03-21 13:21 Jan Kara
2016-03-21 13:21 ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160323205014.GW23727@linux.intel.com \
    --to=willy@linux.intel.com \
    --cc=jack@suse.cz \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-nvdimm@lists.01.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.