Re: [PATCH 2/2] bcachefs: Buffered write path now can avoid the inode lock

From: Kent Overstreet <kent.overstreet@linux.dev>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-bcachefs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	 david@fromorbit.com, mcgrof@kernel.org, hch@lst.de,
	willy@infradead.org
Subject: Re: [PATCH 2/2] bcachefs: Buffered write path now can avoid the inode lock
Date: Thu, 29 Feb 2024 02:42:12 -0500	[thread overview]
Message-ID: <z3zxghw5yok5qftgj7pygfrspwwiadcrg73cbvr3okwoti7tho@zwmw2naayz5c> (raw)
In-Reply-To: <CAHk-=whf9HsM6BP3L4EYONCjGawAV=X0aBDoUHXkND4fpqB2Ww@mail.gmail.com>

On Wed, Feb 28, 2024 at 11:20:44PM -0800, Linus Torvalds wrote:
> On Wed, 28 Feb 2024 at 22:30, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> >
> > Non append, non extending buffered writes can now avoid taking the inode
> > lock.
> 
> I think this is buggy.
> 
> I think you still need to take the inode lock *shared* for the writes,
> because otherwise you can have somebody else that truncates the file
> and now you will do a write past the end of the size of the file. That
> will cause a lot of issues.
> 
> So it's not a "inode_lock or not" situation. I think it's a
> "inode_lock vs inode_locks_shared" situation.
> 
> Note that the reading side isn't all that critical - if a read races
> with a truncate, at worst it will read some zeroes because we used the
> old length and the page cache got cleared in the meantime.
> 
> But the writing side ends up having actual consistency issues on disk.
> You don't want to have a truncate that removes the pages past the end
> of the new size and clears the end of the new last page, and race with
> another write that used the old size and *thought* it was writing to
> the middle of the file, but is now actually accessing a folio that is
> past the end of the whole file and writing to it.
> 
> There may be some reason that I'm missing that would make this a
> non-issue, but I really think you want to get the inode lock at least
> shared for the duration of the write.

It's even mentioned in one of the comments - bcachefs's pagecache add
lock guards against that. The rules for that lock are

 - things that add to the pagecache take the add side of that lock
 - things that remove the pagecache take the block side of that lock

I added that so that we wouldn't have pagecache inconsistency issues
with dio and mmap'd IO - without it anything that needs to shoot down
the pagecache while it's doing IO that bypasses the pagecache is buggy
(fpunch, fcollapse...).

> Also note that for similar reasons, you can't just look at "will I
> extend the file" and take the lock non-shared. No, in order to
> actually trust the size, you need to *hold* the lock, so the logic
> needs to be something like
> 
>  - take the lock exclusively if O_APPEND or if it *looks* like you
> might extend the file size.
> 
>  - otherwise, take the shared lock, and THEN RE-CHECK. The file size
> might have changed, so now you need to double-check that you're really
> not going to extend the size of the file, and if you are, you need to
> go back and take the inode lock exclusively after all.

That one - yes.

pagecache add lock was also supposed to handle that because anything
that changes i_size downward needs pagecache block, but I moved where we
take that lock for lock ordering reasons, and I really didn't need
too...

I'm undecided on that one. I dislike using pagecache add lock to guard
i_size because that's really not what it's for, but I also hate hitting
the inode lock if we don't actually need it.

Kinda waiting for Al to drop in and mention the other super obscure
reason the inode lock actually is needed here...