Re: bcachefs status update (it's done cooking; let's get this sucker merged)

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: bcachefs status update (it's done cooking; let's get this sucker merged)
@ 2019-06-29 16:39 Luke Kenneth Casson Leighton
  0 siblings, 0 replies; 15+ messages in thread
From: Luke Kenneth Casson Leighton @ 2019-06-29 16:39 UTC (permalink / raw)
  To: torvalds
  Cc: akpm, axboe, darrick.wong, david, dchinner, josef,
	kent.overstreet, linux-bcache, linux-fsdevel, linux-kernel,
	peterz, tj, viro, zach.brown

hey linus, you made news again, all blown up and pointless again.
you're doing great: you're being honest. remember the offer i made to
put you in touch with my friend.

anecdotal story: andrew tridgell worked on the fujitsu sparc
supercomputer a couple decades ago: it had a really weird DMA ring
bus.

* memory-to-memory copy (in the same core) was 10mbytes/sec
* DMA memory-to-memory copy (in the same core) was 20mbytes/sec
* memory-memory copy (across the ring bus i.e. to another machine) was
100mbytes/sec
* DMA memory-memory copy (across the ring bus) was *200* mbytes/sec.

when andrew tried asking people, "hey everyone, we need a filesystem
that can work really well on this fast parallel system", he had to
continuously fend off "i got a great idea for in-core memory-to-memory
cacheing!!!!" suggestions, because they *just would never work*.

the point being: caches aren't always "fast".

/salutes.

l.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: bcachefs status update (it's done cooking; let's get this sucker merged)
  2019-06-10 19:14 Kent Overstreet
  2019-06-10 20:46 ` Linus Torvalds
@ 2019-07-03  5:59 ` Stefan K
  1 sibling, 0 replies; 15+ messages in thread
From: Stefan K @ 2019-07-03  5:59 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-kernel, linux-fsdevel, linux-bcache, Dave Chinner,
	Darrick J . Wong ,
	Zach Brown, Peter Zijlstra, Jens Axboe, Josef Bacik,
	Alexander Viro, Andrew Morton, Linus Torvalds, Tejun Heo

Hello,

is there a chance to get this in Kernel 5.3?
And thanks for this fs!


On Monday, June 10, 2019 9:14:08 PM CEST Kent Overstreet wrote:
> Last status update: https://lkml.org/lkml/2018/12/2/46
>
> Current status - I'm pretty much running out of things to polish and excuses to
> keep tinkering. The core featureset is _done_ and the list of known outstanding
> bugs is getting to be short and unexciting. The next big things on my todo list
> are finishing erasure coding and reflink, but there's no reason for merging to
> wait on those.
>
> So. Here's my bcachefs-for-review branch - this has the minimal set of patches
> outside of fs/bcachefs/. My master branch has some performance optimizations for
> the core buffered IO paths, but those are fairly tricky and invasive so I want
> to hold off on those for now - this branch is intended to be more or less
> suitable for merging as is.
>
> https://evilpiepirate.org/git/bcachefs.git/log/?h=bcachefs-for-review
>
> The list of non bcachefs patches is:
>
> closures: fix a race on wakeup from closure_sync
> closures: closure_wait_event()
> bcache: move closures to lib/
> bcache: optimize continue_at_nobarrier()
> block: Add some exports for bcachefs
> Propagate gfp_t when allocating pte entries from __vmalloc
> fs: factor out d_mark_tmpfile()
> fs: insert_inode_locked2()
> mm: export find_get_pages()
> mm: pagecache add lock
> locking: SIX locks (shared/intent/exclusive)
> Compiler Attributes: add __flatten
>
> Most of the patches are pretty small, of the ones that aren't:
>
>  - SIX locks have already been discussed, and seem to be pretty uncontroversial.
>
>  - pagecache add lock: it's kind of ugly, but necessary to rigorously prevent
>    page cache inconsistencies with dio and other operations, in particular
>    racing vs. page faults - honestly, it's criminal that we still don't have a
>    mechanism in the kernel to address this, other filesystems are susceptible to
>    these kinds of bugs too.
>
>    My patch is intentionally ugly in the hopes that someone else will come up
>    with a magical elegant solution, but in the meantime it's an "it's ugly but
>    it works" sort of thing, and I suspect in real world scenarios it's going to
>    beat any kind of range locking performance wise, which is the only
>    alternative I've heard discussed.
>
>  - Propaget gfp_t from __vmalloc() - bcachefs needs __vmalloc() to respect
>    GFP_NOFS, that's all that is.
>
>  - and, moving closures out of drivers/md/bcache to lib/.
>
> The rest of the tree is 62k lines of code in fs/bcachefs. So, I obviously won't
> be mailing out all of that as patches, but if any code reviewers have
> suggestions on what would make that go easier go ahead and speak up. The last
> time I was mailing things out for review the main thing that came up was ioctls,
> but the ioctl interface hasn't really changed since then. I'm pretty confident
> in the on disk format stuff, which was the other thing that was mentioned.
>
> ----------
>
> This has been a monumental effort over a lot of years, and I'm _really_ happy
> with how it's turned out. I'm excited to finally unleash this upon the world.
>




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: bcachefs status update (it's done cooking; let's get this sucker merged)
  2019-06-12 23:02         ` Dave Chinner
@ 2019-06-19  8:21           ` Jan Kara
  0 siblings, 0 replies; 15+ messages in thread
From: Jan Kara @ 2019-06-19  8:21 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Kent Overstreet, Linus Torvalds, Dave Chinner, Waiman Long,
	Peter Zijlstra, Linux List Kernel Mailing, linux-fsdevel,
	linux-bcache, Darrick J . Wong, Zach Brown, Jens Axboe,
	Josef Bacik, Alexander Viro, Andrew Morton, Tejun Heo

On Thu 13-06-19 09:02:24, Dave Chinner wrote:
> On Wed, Jun 12, 2019 at 12:21:44PM -0400, Kent Overstreet wrote:
> > This would simplify things a lot and eliminate a really nasty corner case - page
> > faults trigger readahead. Even if the buffer and the direct IO don't overlap,
> > readahead can pull in pages that do overlap with the dio.
> 
> Page cache readahead needs to be moved under the filesystem IO
> locks. There was a recent thread about how readahead can race with
> hole punching and other fallocate() operations because page cache
> readahead bypasses the filesystem IO locks used to serialise page
> cache invalidation.
> 
> e.g. Readahead can be directed by userspace via fadvise, so we now
> have file->f_op->fadvise() so that filesystems can lock the inode
> before calling generic_fadvise() such that page cache instantiation
> and readahead dispatch can be serialised against page cache
> invalidation. I have a patch for XFS sitting around somewhere that
> implements the ->fadvise method.
> 
> I think there are some other patches floating around to address the
> other readahead mechanisms to only be done under filesytem IO locks,
> but I haven't had time to dig into it any further. Readahead from
> page faults most definitely needs to be under the MMAPLOCK at
> least so it serialises against fallocate()...

Yes, I have patch to make madvise(MADV_WILLNEED) go through ->fadvise() as
well. I'll post it soon since the rest of the series isn't really dependent
on it.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: bcachefs status update (it's done cooking; let's get this sucker merged)
  2019-06-12 16:21       ` Kent Overstreet
@ 2019-06-12 23:02         ` Dave Chinner
  2019-06-19  8:21           ` Jan Kara
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2019-06-12 23:02 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Linus Torvalds, Dave Chinner, Waiman Long, Peter Zijlstra,
	Linux List Kernel Mailing, linux-fsdevel, linux-bcache,
	Darrick J . Wong, Zach Brown, Jens Axboe, Josef Bacik,
	Alexander Viro, Andrew Morton, Tejun Heo

On Wed, Jun 12, 2019 at 12:21:44PM -0400, Kent Overstreet wrote:
> On Tue, Jun 11, 2019 at 02:33:36PM +1000, Dave Chinner wrote:
> > I just recently said this with reference to the range lock stuff I'm
> > working on in the background:
> > 
> > 	FWIW, it's to avoid problems with stupid userspace stuff
> > 	that nobody really should be doing that I want range locks
> > 	for the XFS inode locks.  If userspace overlaps the ranges
> > 	and deadlocks in that case, they they get to keep all the
> > 	broken bits because, IMO, they are doing something
> > 	monumentally stupid. I'd probably be making it return
> > 	EDEADLOCK back out to userspace in the case rather than
> > 	deadlocking but, fundamentally, I think it's broken
> > 	behaviour that we should be rejecting with an error rather
> > 	than adding complexity trying to handle it.
> > 
> > So I think this recusive locking across a page fault case should
> > just fail, not add yet more complexity to try to handle a rare
> > corner case that exists more in theory than in reality. i.e put the
> > lock context in the current task, then if the page fault requires a
> > conflicting lock context to be taken, we terminate the page fault,
> > back out of the IO and return EDEADLOCK out to userspace. This works
> > for all types of lock contexts - only the filesystem itself needs to
> > know what the lock context pointer contains....
> 
> Ok, I'm totally on board with returning EDEADLOCK.
> 
> Question: Would we be ok with returning EDEADLOCK for any IO where the buffer is
> in the same address space as the file being read/written to, even if the buffer
> and the IO don't technically overlap?

I'd say that depends on the lock granularity. For a range lock,
we'd be able to do the IO for non-overlapping ranges. For a normal
mutex or rwsem, then we risk deadlock if the page fault triggers on
the same address space host as we already have locked for IO. That's
the case we currently handle with the second IO lock in XFS, ext4,
btrfs, etc (XFS_MMAPLOCK_* in XFS).

One of the reasons I'm looking at range locks for XFS is to get rid
of the need for this second mmap lock, as there is no reason for it
existing if we can lock ranges and EDEADLOCK inside page faults and
return errors.

> This would simplify things a lot and eliminate a really nasty corner case - page
> faults trigger readahead. Even if the buffer and the direct IO don't overlap,
> readahead can pull in pages that do overlap with the dio.

Page cache readahead needs to be moved under the filesystem IO
locks. There was a recent thread about how readahead can race with
hole punching and other fallocate() operations because page cache
readahead bypasses the filesystem IO locks used to serialise page
cache invalidation.

e.g. Readahead can be directed by userspace via fadvise, so we now
have file->f_op->fadvise() so that filesystems can lock the inode
before calling generic_fadvise() such that page cache instantiation
and readahead dispatch can be serialised against page cache
invalidation. I have a patch for XFS sitting around somewhere that
implements the ->fadvise method.

I think there are some other patches floating around to address the
other readahead mechanisms to only be done under filesytem IO locks,
but I haven't had time to dig into it any further. Readahead from
page faults most definitely needs to be under the MMAPLOCK at
least so it serialises against fallocate()...

> And on getting EDEADLOCK we could fall back to buffered IO, so
> userspace would never know....

Yup, that's a choice that individual filesystems can make.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: bcachefs status update (it's done cooking; let's get this sucker merged)
  2019-06-11  4:33     ` Dave Chinner
@ 2019-06-12 16:21       ` Kent Overstreet
  2019-06-12 23:02         ` Dave Chinner
  0 siblings, 1 reply; 15+ messages in thread
From: Kent Overstreet @ 2019-06-12 16:21 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Linus Torvalds, Dave Chinner, Waiman Long, Peter Zijlstra,
	Linux List Kernel Mailing, linux-fsdevel, linux-bcache,
	Darrick J . Wong, Zach Brown, Jens Axboe, Josef Bacik,
	Alexander Viro, Andrew Morton, Tejun Heo

On Tue, Jun 11, 2019 at 02:33:36PM +1000, Dave Chinner wrote:
> I just recently said this with reference to the range lock stuff I'm
> working on in the background:
> 
> 	FWIW, it's to avoid problems with stupid userspace stuff
> 	that nobody really should be doing that I want range locks
> 	for the XFS inode locks.  If userspace overlaps the ranges
> 	and deadlocks in that case, they they get to keep all the
> 	broken bits because, IMO, they are doing something
> 	monumentally stupid. I'd probably be making it return
> 	EDEADLOCK back out to userspace in the case rather than
> 	deadlocking but, fundamentally, I think it's broken
> 	behaviour that we should be rejecting with an error rather
> 	than adding complexity trying to handle it.
> 
> So I think this recusive locking across a page fault case should
> just fail, not add yet more complexity to try to handle a rare
> corner case that exists more in theory than in reality. i.e put the
> lock context in the current task, then if the page fault requires a
> conflicting lock context to be taken, we terminate the page fault,
> back out of the IO and return EDEADLOCK out to userspace. This works
> for all types of lock contexts - only the filesystem itself needs to
> know what the lock context pointer contains....

Ok, I'm totally on board with returning EDEADLOCK.

Question: Would we be ok with returning EDEADLOCK for any IO where the buffer is
in the same address space as the file being read/written to, even if the buffer
and the IO don't technically overlap?

This would simplify things a lot and eliminate a really nasty corner case - page
faults trigger readahead. Even if the buffer and the direct IO don't overlap,
readahead can pull in pages that do overlap with the dio.

And on getting EDEADLOCK we could fall back to buffered IO, so userspace would
never know...

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: bcachefs status update (it's done cooking; let's get this sucker merged)
  2019-06-11  7:10       ` Dave Chinner
@ 2019-06-12  2:07         ` Linus Torvalds
  0 siblings, 0 replies; 15+ messages in thread
From: Linus Torvalds @ 2019-06-12  2:07 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Kent Overstreet, Linux List Kernel Mailing, linux-fsdevel,
	linux-bcache, Dave Chinner, Darrick J . Wong, Zach Brown,
	Peter Zijlstra, Jens Axboe, Josef Bacik, Alexander Viro,
	Andrew Morton, Tejun Heo

On Mon, Jun 10, 2019 at 9:11 PM Dave Chinner <david@fromorbit.com> wrote:
>
> The same rwsem issues were seen on the mmap_sem, the shrinker rwsem,
> in a couple of device drivers, and so on. i.e. This isn't an XFS
> issue I'm raising here - I'm raising a concern about the lack of
> validation of core infrastructure and it's suitability for
> functionality extensions.

I haven't actually seen the reports.

That said, I do think this should be improving. The random
architecture-specific code is largely going away, and we'll have a
unified rwsem.

It might obviously cause some pain initially, but I think long-term we
should be much better off, at least avoiding the "on particular
configurations" issue..

              Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: bcachefs status update (it's done cooking; let's get this sucker merged)
  2019-06-11  4:55     ` Linus Torvalds
@ 2019-06-11 14:26       ` Matthew Wilcox
  0 siblings, 0 replies; 15+ messages in thread
From: Matthew Wilcox @ 2019-06-11 14:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kent Overstreet, Dave Chinner, Waiman Long, Peter Zijlstra,
	Linux List Kernel Mailing, linux-fsdevel, linux-bcache,
	Darrick J . Wong, Zach Brown, Jens Axboe, Josef Bacik,
	Alexander Viro, Andrew Morton, Tejun Heo

On Mon, Jun 10, 2019 at 06:55:15PM -1000, Linus Torvalds wrote:
> On Mon, Jun 10, 2019 at 3:17 PM Kent Overstreet
> <kent.overstreet@gmail.com> wrote:
> > > Why does the regular page lock (at a finer granularity) not suffice?
> >
> > Because the lock needs to prevent pages from being _added_ to the page cache -
> > to do it with a page granularity lock it'd have to be part of the radix tree,
> 
> No, I understand that part, but I still think we should be able to do
> the locking per-page rather than over the whole mapping.
> 
> When doing dio, you need to iterate over old existing pages anyway in
> that range (otherwise the "no _new_ pages" part is kind of pointless
> when there are old pages there), so my gut feel is that you might as
> well at that point also "poison" the range you are doin dio on. With
> the xarray changes, we might be better at handling ranges. That was
> one of the arguments for the xarrays over the old radix tree model,
> after all.

We could do that -- if there are pages (or shadow entries) in the XArray,
replace them with "lock entries".  I think we'd want the behaviour of
faults / buffered IO be to wait on those entries being removed.  I think
the DAX code is just about ready to switch over to lock entries instead
of having a special DAX lock bit.

The question is what to do when there are _no_ pages in the tree for a
range that we're about to do DIO on.  This should be the normal case --
as you say, DIO users typically have their own schemes for caching in
userspace, and rather resent the other users starting to cache their
file in the kernel.

Adding lock entries in the page cache for every DIO starts to look pretty
painful in terms of allocating radix tree nodes.  And it gets worse when
you have sub-page-size DIOs -- do we embed a count in the lock entry?
Or delay DIOs which hit in the same page as an existing DIO?

And then I start to question the whole reasoning behind how we do mixed
DIO and buffered IO; if there's a page in the page cache, why are we
writing it out, then doing a direct IO instead of doing a memcpy to the
page first, then writing the page back?

IOW, move to a model where:

 - If i_dio_count is non-zero, buffered I/O waits for i_dio_count to
   drop to zero before bringing pages in.
 - If i_pages is empty, DIOs increment i_dio_count, do the IO and
   decrement i_dio_count.
 - If i_pages is not empty, DIO is implemented by doing buffered I/O
   and waiting for the pages to finish writeback.

(needs a slight tweak to ensure that new DIOs can't hold off a buffered
I/O indefinitely)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: bcachefs status update (it's done cooking; let's get this sucker merged)
  2019-06-11  4:39     ` Linus Torvalds
@ 2019-06-11  7:10       ` Dave Chinner
  2019-06-12  2:07         ` Linus Torvalds
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2019-06-11  7:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kent Overstreet, Linux List Kernel Mailing, linux-fsdevel,
	linux-bcache, Dave Chinner, Darrick J . Wong, Zach Brown,
	Peter Zijlstra, Jens Axboe, Josef Bacik, Alexander Viro,
	Andrew Morton, Tejun Heo

On Mon, Jun 10, 2019 at 06:39:00PM -1000, Linus Torvalds wrote:
> On Mon, Jun 10, 2019 at 6:11 PM Dave Chinner <david@fromorbit.com> wrote:
> >
> > Please, no, let's not make the rwsems even more fragile than they
> > already are. I'm tired of the ongoing XFS customer escalations that
> > end up being root caused to yet another rwsem memory barrier bug.
> >
> > > Have you talked to Waiman Long about that?
> >
> > Unfortunately, Waiman has been unable to find/debug multiple rwsem
> > exclusion violations we've seen in XFS bug reports over the past 2-3
> > years.
> 
> Inside xfs you can do whatever you want.
>
> But in generic code, no, we're not saying "we don't trust the generic
> locking, so we cook our own random locking".

We use the generic rwsems in XFS, too, and it's the generic
rwsems that have been the cause of the problems I'm talking about.

The same rwsem issues were seen on the mmap_sem, the shrinker rwsem,
in a couple of device drivers, and so on. i.e. This isn't an XFS
issue I'm raising here - I'm raising a concern about the lack of
validation of core infrastructure and it's suitability for
functionality extensions.

> If tghere really are exclusion issues, they should be fairly easy to
> try to find with a generic test-suite.  Have a bunch of readers that
> assert that some shared variable has a particular value, and a bund of
> writers that then modify the value and set it back. Add some random
> timing and "yield" to them all, and show that the serialization is
> wrong.

Writing such a test suite would be the responsibility of the rwsem
maintainers, yes?

> Some kind of "XFS load Y shows problems" is undebuggable, and not
> necessarily due to locking.

Sure, but this wasn't isolated to XFS, and it wasn't one workload.

We had a growing pile of kernel crash dumps all with the same
signatures across multiple subsystems. When this happens, it falls
to the maintainer of that common element to more deeply analyse the
issue. One of the rwsem maintainers was unable to reproduce or find
the root cause of the pile of rwsem state corruptions, and so we've
been left hanging telling people "we think it's rwsems because the
state is valid right up to the rwsem state going bad, but we can't
prove it's a rwsem problem because the debug we've added to the
rwsem code makes the problem go away". Sometime later, a bug has
been found in the upstream rwsem code....

This has played out several times over the past couple of years. No
locking bugs have been found in XFS, with the mmap_sem, the shrinker
rwsem, etc, but 4 or 5 bugs have been found in the rwsem code and
backports of those commits have been proven to solve _all_ the
issues that were reported.

That's the painful reality I'm telling you about here - that poor
upstream core infrastructure quality has had quite severe downstream
knock-on effects that cost a lot of time, resources, money and
stress to diagnose and rectify.  I don't want those same mistakes to
be made again for many reasons, not the least that the stress of
these situations has a direct and adverse impact on my mental
health....

> Because if the locking issues are real (and we did fix one bug
> recently in a9e9bcb45b15: "locking/rwsem: Prevent decrement of reader
> count before increment") it needs to be fixed.

That's just one of the bugs we've tripped over. There's been a
couple of missed wakeups bugs that caused rwsem state hangs (e.g.
readers waiting with no holder), there was a power arch specific
memory barrier bug that caused read/write exclusion bugs, the
optimistic spinning caused some severe performance degradations on
the mmap_sem with some highly threaded workloads, the rwsem bias
changed from read biased to write biased (might be the other way
around, can't remember) some time around 4.10 causing a complete
inversion in mixed read-write IO characteristics, there was a
botched RHEL7 backport that had memory barrier bugs in it that
upstream didn't have that occurred because of the complexity of the
code, etc.

But this is all off-topic for bcachefs review - all we need to do
here is keep the SIX locking in a separate module and everything
rwsem related will be just fine.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: bcachefs status update (it's done cooking; let's get this sucker merged)
  2019-06-11  1:17   ` Kent Overstreet
  2019-06-11  4:33     ` Dave Chinner
@ 2019-06-11  4:55     ` Linus Torvalds
  2019-06-11 14:26       ` Matthew Wilcox
  1 sibling, 1 reply; 15+ messages in thread
From: Linus Torvalds @ 2019-06-11  4:55 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Dave Chinner, Waiman Long, Peter Zijlstra,
	Linux List Kernel Mailing, linux-fsdevel, linux-bcache,
	Darrick J . Wong, Zach Brown, Jens Axboe, Josef Bacik,
	Alexander Viro, Andrew Morton, Tejun Heo

On Mon, Jun 10, 2019 at 3:17 PM Kent Overstreet
<kent.overstreet@gmail.com> wrote:
>
>
> > Why does the regular page lock (at a finer granularity) not suffice?
>
> Because the lock needs to prevent pages from being _added_ to the page cache -
> to do it with a page granularity lock it'd have to be part of the radix tree,

No, I understand that part, but I still think we should be able to do
the locking per-page rather than over the whole mapping.

When doing dio, you need to iterate over old existing pages anyway in
that range (otherwise the "no _new_ pages" part is kind of pointless
when there are old pages there), so my gut feel is that you might as
well at that point also "poison" the range you are doin dio on. With
the xarray changes, we might be better at handling ranges. That was
one of the arguments for the xarrays over the old radix tree model,
after all.

And I think the dio code would ideally want to have a range-based lock
anyway, rather than one global one. No?

Anyway, don't get me wrong. I'm not entirely against a "stop adding
pages" model per-mapping if it's just fundamentally simpler and nobody
wants anything fancier. So I'm certainly open to it, assuming it
doesn't add any real overhead to the normal case.

But I *am* against it when it has ad-hoc locking and random
anti-recursion things.

So I'm with Dave on the "I hope we can avoid the recursive hacks" by
making better rules. Even if I disagree with him on the locking thing
- I'd rather put _more_stress on the standard locking and make sure it
really works, over having multiple similar locking models because they
don't trust each other.

               Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: bcachefs status update (it's done cooking; let's get this sucker merged)
  2019-06-11  4:10   ` Dave Chinner
@ 2019-06-11  4:39     ` Linus Torvalds
  2019-06-11  7:10       ` Dave Chinner
  0 siblings, 1 reply; 15+ messages in thread
From: Linus Torvalds @ 2019-06-11  4:39 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Kent Overstreet, Linux List Kernel Mailing, linux-fsdevel,
	linux-bcache, Dave Chinner, Darrick J . Wong, Zach Brown,
	Peter Zijlstra, Jens Axboe, Josef Bacik, Alexander Viro,
	Andrew Morton, Tejun Heo

On Mon, Jun 10, 2019 at 6:11 PM Dave Chinner <david@fromorbit.com> wrote:
>
> Please, no, let's not make the rwsems even more fragile than they
> already are. I'm tired of the ongoing XFS customer escalations that
> end up being root caused to yet another rwsem memory barrier bug.
>
> > Have you talked to Waiman Long about that?
>
> Unfortunately, Waiman has been unable to find/debug multiple rwsem
> exclusion violations we've seen in XFS bug reports over the past 2-3
> years.

Inside xfs you can do whatever you want.

But in generic code, no, we're not saying "we don't trust the generic
locking, so we cook our own random locking".

If tghere really are exclusion issues, they should be fairly easy to
try to find with a generic test-suite. Have a bunch of readers that
assert that some shared variable has a particular value, and a bund of
writers that then modify the value and set it back. Add some random
timing and "yield" to them all, and show that the serialization is
wrong.

Some kind of "XFS load Y shows problems" is undebuggable, and not
necessarily due to locking.

Because if the locking issues are real (and we did fix one bug
recently in a9e9bcb45b15: "locking/rwsem: Prevent decrement of reader
count before increment") it needs to be fixed. Some kind of "let's do
something else entirely" is simply not acceptable.

                  Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: bcachefs status update (it's done cooking; let's get this sucker merged)
  2019-06-11  1:17   ` Kent Overstreet
@ 2019-06-11  4:33     ` Dave Chinner
  2019-06-12 16:21       ` Kent Overstreet
  2019-06-11  4:55     ` Linus Torvalds
  1 sibling, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2019-06-11  4:33 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Linus Torvalds, Dave Chinner, Waiman Long, Peter Zijlstra,
	Linux List Kernel Mailing, linux-fsdevel, linux-bcache,
	Darrick J . Wong, Zach Brown, Jens Axboe, Josef Bacik,
	Alexander Viro, Andrew Morton, Tejun Heo

On Mon, Jun 10, 2019 at 09:17:37PM -0400, Kent Overstreet wrote:
> On Mon, Jun 10, 2019 at 10:46:35AM -1000, Linus Torvalds wrote:
> > On Mon, Jun 10, 2019 at 9:14 AM Kent Overstreet
> > <kent.overstreet@gmail.com> wrote:
> > That lock is somewhat questionable in the first place, and no, we
> > don't do those hacky recursive things anyway. A recursive lock is
> > almost always a buggy and mis-designed one.
> 
> You're preaching to the choir there, I still feel dirty about that code and I'd
> love nothing more than for someone else to come along and point out how stupid
> I've been with a much better way of doing it. 
> 
> > Why does the regular page lock (at a finer granularity) not suffice?
> 
> Because the lock needs to prevent pages from being _added_ to the page cache -
> to do it with a page granularity lock it'd have to be part of the radix tree, 
> 
> > And no, nobody has ever cared. The dio people just don't care about
> > page cache anyway. They have their own thing going.
> 
> It's not just dio, it's even worse with the various fallocate operations. And
> the xfs people care, but IIRC even they don't have locking for pages being
> faulted in. This is an issue I've talked to other filesystem people quite a bit
> about - especially Dave Chinner, maybe we can get him to weigh in here.
> 
> And this inconsistency does result in _real_ bugs. It goes something like this:
>  - dio write shoots down the range of the page cache for the file it's writing
>    to, using invalidate_inode_pages_range2
>  - After the page cache shoot down, but before the write actually happens,
>    another process pulls those pages back in to the page cache
>  - Now the write happens: if that write was e.g. an allocating write, you're
>    going to have page cache state (buffer heads) that say that page doesn't have
>    anything on disk backing it, but it actually does because of the dio write.
> 
> xfs has additional locking (that the vfs does _not_ do) around both the buffered
> and dio IO paths to prevent this happening because of a buffered read pulling
> the pages back in, but no one has a solution for pages getting _faulted_ back in
> - either because of mmap or gup().
> 
> And there are some filesystem people who do know about this race, because at
> some point the dio code has been changed to shoot down the page cache _again_
> after the write completes. But that doesn't eliminate the race, it just makes it
> harder to trigger.
> 
> And dio writes actually aren't the worst of it, it's even worse with fallocate
> FALLOC_FL_INSERT_RANGE/COLLAPSE_RANGE. Last time I looked at the ext4 fallocate
> code, it looked _completely_ broken to me - the code seemed to think it was
> using the same mechanism truncate uses for shooting down the page cache and
> keeping pages from being readded - but that only works for truncate because it's
> changing i_size and shooting down pages above i_size. Fallocate needs to shoot
> down pages that are still within i_size, so... yeah...

Yes, that ext4 code is broken, and Jan Kara is trying to work out
how to fix it. His recent patchset fell foul of taking the same lock
either side of the mmap_sem in this path:

> The recursiveness is needed because otherwise, if you mmap a file, then do a dio
> write where you pass the address you mmapped to pwrite(), gup() from the dio
> write path will be trying to fault in the exact pages it's blocking from being
> added.
> 
> A better solution would be for gup() to detect that and return an error, so we
> can just fall back to buffered writes. Or just return an error to userspace
> because fuck anyone who would actually do that.

I just recently said this with reference to the range lock stuff I'm
working on in the background:

	FWIW, it's to avoid problems with stupid userspace stuff
	that nobody really should be doing that I want range locks
	for the XFS inode locks.  If userspace overlaps the ranges
	and deadlocks in that case, they they get to keep all the
	broken bits because, IMO, they are doing something
	monumentally stupid. I'd probably be making it return
	EDEADLOCK back out to userspace in the case rather than
	deadlocking but, fundamentally, I think it's broken
	behaviour that we should be rejecting with an error rather
	than adding complexity trying to handle it.

So I think this recusive locking across a page fault case should
just fail, not add yet more complexity to try to handle a rare
corner case that exists more in theory than in reality. i.e put the
lock context in the current task, then if the page fault requires a
conflicting lock context to be taken, we terminate the page fault,
back out of the IO and return EDEADLOCK out to userspace. This works
for all types of lock contexts - only the filesystem itself needs to
know what the lock context pointer contains....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: bcachefs status update (it's done cooking; let's get this sucker merged)
  2019-06-10 20:46 ` Linus Torvalds
  2019-06-11  1:17   ` Kent Overstreet
@ 2019-06-11  4:10   ` Dave Chinner
  2019-06-11  4:39     ` Linus Torvalds
  1 sibling, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2019-06-11  4:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kent Overstreet, Linux List Kernel Mailing, linux-fsdevel,
	linux-bcache, Dave Chinner, Darrick J . Wong, Zach Brown,
	Peter Zijlstra, Jens Axboe, Josef Bacik, Alexander Viro,
	Andrew Morton, Tejun Heo

On Mon, Jun 10, 2019 at 10:46:35AM -1000, Linus Torvalds wrote:
> I also get the feeling that the "intent" part of the six-locks could
> just be done as a slight extension of the rwsem, where an "intent" is
> the same as a write-lock, but without waiting for existing readers,
> and then the write-lock part is just the "wait for readers to be
> done".

Please, no, let's not make the rwsems even more fragile than they
already are. I'm tired of the ongoing XFS customer escalations that
end up being root caused to yet another rwsem memory barrier bug.

> Have you talked to Waiman Long about that?

Unfortunately, Waiman has been unable to find/debug multiple rwsem
exclusion violations we've seen in XFS bug reports over the past 2-3
years. Those memory barrier bugs have all been fixed by other people
long after Waiman has said "I can't reproduce any problems in my
testing" and essentially walked away from the problem. We've been
left multiple times wondering how the hell we even prove it's a
rwsem bug because there's no way to reproduce the inconsistent rwsem
state we see in the kernel crash dumps.

Hence, as a downstream rwsem user, I have relatively little
confidence in upstream's ability to integrate new functionality into
rwsems without introducing yet more subtle regressions that are only
exposed by heavy rwsem users like XFS. As such, I consider rwsems to
be extremely fragile and are now a prime suspect whenever see some
one-off memory corruption in a structure protected by a rwsem.

As such, please keep SIX locks separate to rwsems to minimise the
merge risk of bcachefs.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: bcachefs status update (it's done cooking; let's get this sucker merged)
  2019-06-10 20:46 ` Linus Torvalds
@ 2019-06-11  1:17   ` Kent Overstreet
  2019-06-11  4:33     ` Dave Chinner
  2019-06-11  4:55     ` Linus Torvalds
  2019-06-11  4:10   ` Dave Chinner
  1 sibling, 2 replies; 15+ messages in thread
From: Kent Overstreet @ 2019-06-11  1:17 UTC (permalink / raw)
  To: Linus Torvalds, Dave Chinner, Waiman Long, Peter Zijlstra
  Cc: Linux List Kernel Mailing, linux-fsdevel, linux-bcache,
	Darrick J . Wong, Zach Brown, Jens Axboe, Josef Bacik,
	Alexander Viro, Andrew Morton, Tejun Heo

On Mon, Jun 10, 2019 at 10:46:35AM -1000, Linus Torvalds wrote:
> On Mon, Jun 10, 2019 at 9:14 AM Kent Overstreet
> <kent.overstreet@gmail.com> wrote:
> >
> > So. Here's my bcachefs-for-review branch - this has the minimal set of patches
> > outside of fs/bcachefs/. My master branch has some performance optimizations for
> > the core buffered IO paths, but those are fairly tricky and invasive so I want
> > to hold off on those for now - this branch is intended to be more or less
> > suitable for merging as is.
> 
> Honestly, it really isn't.

Heh, I suppose that's what review is for :)

> There are obvious things wrong with it - like the fact that you've
> rebased it so that the original history is gone, yet you've not
> actually *fixed* the history, so you find things like reverts of
> commits that should simply have been removed, and fixes for things
> that should just have been fixed in the original commit the fix is
> for.

Yeah, I suppose I have dropped the ball on that lately. 

> But note that the cleanup should go further than just fix those kinds
> of technical issues. If you rebase, and you have fixes in your tree
> for things you rebase, just fix things as you rewrite history anyway
> (there are cases where the fix may be informative in itself and it's
> worth leaving around, but that's rare).

Yeah that has historically been my practice, I've just been moving away from
that kind of history editing as bcachefs has been getting more users. Hence the
in-between, worst of both workflows state of the current tree.

But, I can certainly go through and clean things up like that one last time and
make everything bisectable again - I'll go through and write proper commit
messages too. Unless you'd be ok with just squashing most of the history down to
one commit - which would you prefer?

> Anyway, aside from that, I only looked at the non-bcachefs parts. Some
> of those are not acceptable either, like
> 
>     struct pagecache_lock add_lock
>         ____cacheline_aligned_in_smp; /* protects adding new pages */
> 
> in 'struct address_space', which is completely bogus, since that
> forces not only a potentially huge amount of padding, it also requires
> alignment that that struct simply fundamentally does not have, and
> _will_ not have.

Oh, good point.

> You can only use ____cacheline_aligned_in_smp for top-level objects,
> and honestly, it's almost never a win. That lock shouldn't be so hot.
> 
> That lock is somewhat questionable in the first place, and no, we
> don't do those hacky recursive things anyway. A recursive lock is
> almost always a buggy and mis-designed one.

You're preaching to the choir there, I still feel dirty about that code and I'd
love nothing more than for someone else to come along and point out how stupid
I've been with a much better way of doing it. 

> Why does the regular page lock (at a finer granularity) not suffice?

Because the lock needs to prevent pages from being _added_ to the page cache -
to do it with a page granularity lock it'd have to be part of the radix tree, 

> And no, nobody has ever cared. The dio people just don't care about
> page cache anyway. They have their own thing going.

It's not just dio, it's even worse with the various fallocate operations. And
the xfs people care, but IIRC even they don't have locking for pages being
faulted in. This is an issue I've talked to other filesystem people quite a bit
about - especially Dave Chinner, maybe we can get him to weigh in here.

And this inconsistency does result in _real_ bugs. It goes something like this:
 - dio write shoots down the range of the page cache for the file it's writing
   to, using invalidate_inode_pages_range2
 - After the page cache shoot down, but before the write actually happens,
   another process pulls those pages back in to the page cache
 - Now the write happens: if that write was e.g. an allocating write, you're
   going to have page cache state (buffer heads) that say that page doesn't have
   anything on disk backing it, but it actually does because of the dio write.

xfs has additional locking (that the vfs does _not_ do) around both the buffered
and dio IO paths to prevent this happening because of a buffered read pulling
the pages back in, but no one has a solution for pages getting _faulted_ back in
- either because of mmap or gup().

And there are some filesystem people who do know about this race, because at
some point the dio code has been changed to shoot down the page cache _again_
after the write completes. But that doesn't eliminate the race, it just makes it
harder to trigger.

And dio writes actually aren't the worst of it, it's even worse with fallocate
FALLOC_FL_INSERT_RANGE/COLLAPSE_RANGE. Last time I looked at the ext4 fallocate
code, it looked _completely_ broken to me - the code seemed to think it was
using the same mechanism truncate uses for shooting down the page cache and
keeping pages from being readded - but that only works for truncate because it's
changing i_size and shooting down pages above i_size. Fallocate needs to shoot
down pages that are still within i_size, so... yeah...

The recursiveness is needed because otherwise, if you mmap a file, then do a dio
write where you pass the address you mmapped to pwrite(), gup() from the dio
write path will be trying to fault in the exact pages it's blocking from being
added.

A better solution would be for gup() to detect that and return an error, so we
can just fall back to buffered writes. Or just return an error to userspace
because fuck anyone who would actually do that.

But I fear plumbing that through gup() is going to be a hell of a lot uglier
than this patch.

I would really like Dave to weigh in here.

> Similarly, no, we're not starting to do vmalloc in non-process context. Stop it.

I don't want to do vmalloc in non process context - but I do need to call
vmalloc when reading in btree nodes, from the filesystem IO path.

But I just learned today about this new memalloc_nofs_save() thing, so if that
works I'm more than happy to drop that patch.

> And the commit comments are very sparse. And not always signed off.

Yeah, I'll fix that.

> I also get the feeling that the "intent" part of the six-locks could
> just be done as a slight extension of the rwsem, where an "intent" is
> the same as a write-lock, but without waiting for existing readers,
> and then the write-lock part is just the "wait for readers to be
> done".
> 
> Have you talked to Waiman Long about that?

No, I haven't, but I'm adding him to the list.

I really hate the idea of adding these sorts of special case features to the
core locking primitives though - I mean, look what's happened to the mutex code,
and the intent state isn't the only special feature they have. As is, they're
small and clean and they do their job well, I'd really prefer to have them just
remain their own thing instead of trying to cram it all into the the hyper
optimized rw semaphore code.

Also, six locks used to be in fs/bcachefs/, but last time I was mailing stuff
out for review Peter Zijlstra was dead set against exporting the osq lock stuff
- moving six locks to kernel/locking/ was actually his idea. 

I can say more about six locks tomorrow when I'm less sleep deprived, if you're
still not convinced.

Cheers.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: bcachefs status update (it's done cooking; let's get this sucker merged)
  2019-06-10 19:14 Kent Overstreet
@ 2019-06-10 20:46 ` Linus Torvalds
  2019-06-11  1:17   ` Kent Overstreet
  2019-06-11  4:10   ` Dave Chinner
  2019-07-03  5:59 ` Stefan K
  1 sibling, 2 replies; 15+ messages in thread
From: Linus Torvalds @ 2019-06-10 20:46 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Linux List Kernel Mailing, linux-fsdevel, linux-bcache,
	Dave Chinner, Darrick J . Wong, Zach Brown, Peter Zijlstra,
	Jens Axboe, Josef Bacik, Alexander Viro, Andrew Morton,
	Tejun Heo

On Mon, Jun 10, 2019 at 9:14 AM Kent Overstreet
<kent.overstreet@gmail.com> wrote:
>
> So. Here's my bcachefs-for-review branch - this has the minimal set of patches
> outside of fs/bcachefs/. My master branch has some performance optimizations for
> the core buffered IO paths, but those are fairly tricky and invasive so I want
> to hold off on those for now - this branch is intended to be more or less
> suitable for merging as is.

Honestly, it really isn't.

There are obvious things wrong with it - like the fact that you've
rebased it so that the original history is gone, yet you've not
actually *fixed* the history, so you find things like reverts of
commits that should simply have been removed, and fixes for things
that should just have been fixed in the original commit the fix is
for.

And this isn't just "if you rebase, just fix things". You have actual
bogus commit messages as a result of this all.

So to point at the revert, for example. The commit message is

    This reverts commit 36f389604294dfc953e6f5624ceb683818d32f28.

which is wrong to begin with - you should always explain *why* the
revert was done, not just state that it's a revert.

But since you rebased things, that commit 36f3896042 doesn't exist any
more to begin with.  So now you have a revert commit that doesn't
explain why it reverts things, but the only thing it *does* say is
simply wrong and pointless.

Some of the "fixes" commits have the exact same issue - they say things like

    gc lock must be held while invalidating buckets - fixes
    "1f7a95698e bcachefs: Invalidate buckets when writing to alloc btree"

and

    fixes b0f3e786995cb3b12975503f963e469db5a4f09b

both of which are dead and stale git object pointers since the commits
in question got rebased and have some other hash these days.

But note that the cleanup should go further than just fix those kinds
of technical issues. If you rebase, and you have fixes in your tree
for things you rebase, just fix things as you rewrite history anyway
(there are cases where the fix may be informative in itself and it's
worth leaving around, but that's rare).

Anyway, aside from that, I only looked at the non-bcachefs parts. Some
of those are not acceptable either, like

    struct pagecache_lock add_lock
        ____cacheline_aligned_in_smp; /* protects adding new pages */

in 'struct address_space', which is completely bogus, since that
forces not only a potentially huge amount of padding, it also requires
alignment that that struct simply fundamentally does not have, and
_will_ not have.

You can only use ____cacheline_aligned_in_smp for top-level objects,
and honestly, it's almost never a win. That lock shouldn't be so hot.

That lock is somewhat questionable in the first place, and no, we
don't do those hacky recursive things anyway. A recursive lock is
almost always a buggy and mis-designed one.

Why does the regular page lock (at a finer granularity) not suffice?

And no, nobody has ever cared. The dio people just don't care about
page cache anyway. They have their own thing going.

Similarly, no, we're not starting to do vmalloc in non-process context. Stop it.

And the commit comments are very sparse. And not always signed off.

I also get the feeling that the "intent" part of the six-locks could
just be done as a slight extension of the rwsem, where an "intent" is
the same as a write-lock, but without waiting for existing readers,
and then the write-lock part is just the "wait for readers to be
done".

Have you talked to Waiman Long about that?

                    Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* bcachefs status update (it's done cooking; let's get this sucker merged)
@ 2019-06-10 19:14 Kent Overstreet
  2019-06-10 20:46 ` Linus Torvalds
  2019-07-03  5:59 ` Stefan K
  0 siblings, 2 replies; 15+ messages in thread
From: Kent Overstreet @ 2019-06-10 19:14 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcache
  Cc: Kent Overstreet, Dave Chinner, Darrick J . Wong ,
	Zach Brown, Peter Zijlstra, Jens Axboe, Josef Bacik,
	Alexander Viro, Andrew Morton, Linus Torvalds, Tejun Heo

Last status update: https://lkml.org/lkml/2018/12/2/46

Current status - I'm pretty much running out of things to polish and excuses to
keep tinkering. The core featureset is _done_ and the list of known outstanding
bugs is getting to be short and unexciting. The next big things on my todo list
are finishing erasure coding and reflink, but there's no reason for merging to
wait on those.

So. Here's my bcachefs-for-review branch - this has the minimal set of patches
outside of fs/bcachefs/. My master branch has some performance optimizations for
the core buffered IO paths, but those are fairly tricky and invasive so I want
to hold off on those for now - this branch is intended to be more or less
suitable for merging as is.

https://evilpiepirate.org/git/bcachefs.git/log/?h=bcachefs-for-review

The list of non bcachefs patches is:

closures: fix a race on wakeup from closure_sync
closures: closure_wait_event()
bcache: move closures to lib/
bcache: optimize continue_at_nobarrier()
block: Add some exports for bcachefs
Propagate gfp_t when allocating pte entries from __vmalloc
fs: factor out d_mark_tmpfile()
fs: insert_inode_locked2()
mm: export find_get_pages()
mm: pagecache add lock
locking: SIX locks (shared/intent/exclusive)
Compiler Attributes: add __flatten

Most of the patches are pretty small, of the ones that aren't:

 - SIX locks have already been discussed, and seem to be pretty uncontroversial.

 - pagecache add lock: it's kind of ugly, but necessary to rigorously prevent
   page cache inconsistencies with dio and other operations, in particular
   racing vs. page faults - honestly, it's criminal that we still don't have a
   mechanism in the kernel to address this, other filesystems are susceptible to
   these kinds of bugs too.

   My patch is intentionally ugly in the hopes that someone else will come up
   with a magical elegant solution, but in the meantime it's an "it's ugly but
   it works" sort of thing, and I suspect in real world scenarios it's going to
   beat any kind of range locking performance wise, which is the only
   alternative I've heard discussed.

 - Propaget gfp_t from __vmalloc() - bcachefs needs __vmalloc() to respect
   GFP_NOFS, that's all that is.

 - and, moving closures out of drivers/md/bcache to lib/. 

The rest of the tree is 62k lines of code in fs/bcachefs. So, I obviously won't
be mailing out all of that as patches, but if any code reviewers have
suggestions on what would make that go easier go ahead and speak up. The last
time I was mailing things out for review the main thing that came up was ioctls,
but the ioctl interface hasn't really changed since then. I'm pretty confident
in the on disk format stuff, which was the other thing that was mentioned.

----------

This has been a monumental effort over a lot of years, and I'm _really_ happy
with how it's turned out. I'm excited to finally unleash this upon the world.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2019-07-03  6:00 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-29 16:39 bcachefs status update (it's done cooking; let's get this sucker merged) Luke Kenneth Casson Leighton
  -- strict thread matches above, loose matches on Subject: below --
2019-06-10 19:14 Kent Overstreet
2019-06-10 20:46 ` Linus Torvalds
2019-06-11  1:17   ` Kent Overstreet
2019-06-11  4:33     ` Dave Chinner
2019-06-12 16:21       ` Kent Overstreet
2019-06-12 23:02         ` Dave Chinner
2019-06-19  8:21           ` Jan Kara
2019-06-11  4:55     ` Linus Torvalds
2019-06-11 14:26       ` Matthew Wilcox
2019-06-11  4:10   ` Dave Chinner
2019-06-11  4:39     ` Linus Torvalds
2019-06-11  7:10       ` Dave Chinner
2019-06-12  2:07         ` Linus Torvalds
2019-07-03  5:59 ` Stefan K

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).