Re: [RFC PATCH 3/3] fs: detect that the i_rwsem has already been taken exclusively

From: Linus Torvalds <torvalds@linux-foundation.org>
To: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Chinner <david@fromorbit.com>,
	LSM List <linux-security-module@vger.kernel.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Mimi Zohar <zohar@linux.vnet.ibm.com>,
	Christoph Hellwig <hch@infradead.org>,
	"Theodore Ts'o" <tytso@mit.edu>, Jan Kara <jack@suse.cz>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	linux-integrity@vger.kernel.org
Subject: Re: [RFC PATCH 3/3] fs: detect that the i_rwsem has already been taken exclusively
Date: Sun, 1 Oct 2017 15:20:04 -0700	[thread overview]
Message-ID: <CA+55aFwD=+5trJbmuc2SL-DYcdmt2p4gq1uPBp8mznj1JYSVTg@mail.gmail.com> (raw)
In-Reply-To: <87shf2jzfr.fsf@xmission.com>

On Sun, Oct 1, 2017 at 3:06 PM, Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> Unless I misread something it was being pointed out there are some vfs
> operations today on which ima writes an ima xattr as a side effect.  And
> those operations hold the i_sem.  So perhaps I am misunderstanding
> things or writing the ima xattr needs to happen at some point.  Which
> implies something like queued work.

So the issue is indeed the inode semaphore, as it is used by IMA. But
all these IMA patches to work around the issue are just horribly ugly.
One adds a VFS-layer filesystem method that most filesystems end up
not really needing (it's the same as the regular read), and other
filesystems end up then having hacks with ("oh, I don't need to take
this lock because it was already taken by the caller").

The second patch attempt avoided the need for a new filesystem method,
but added a flag in an annoying place (for the same basic logic). The
advantage is that now most filesystems don't actually need to care any
more (and the filesystems that used to care now check that flag).

There was discussion about moving the flag to a mode convenient spot,
which would have made it a lot less intrusive.

But the basic issue is that almost always when you see lock
inversions, the problem can just be fixed by doing the locking
differently instead.

And that's what I was/am pushing for.

There really are two totally different issues:

 - the integrity _measurement_.

   This one wants to be serialized, so that you don't have multiple
concurrent measurements, and the serialization fundamentally has to be
around all the IO, so this lock pretty much has to be outside the
i_sem.

 - the integrity invalidation on certain operations.

   This one fundamentally had to be inside the i_sem, since some of
the operations that cause this end up already holding the i_sem at a
VFS layer.

so you had these two different requirements (inside _and_ outside),
and the IMA approach was basically to avoid the problem by making
i_sem *the* lock, and then making the IO routines aware of it already
being held. That does solve the inside/outside issue.

But the simpler way to fix it is to simply use two locks that nest
inside each other, with i_sem nesting in the middle.  That just avoids
the problem entirely, and doesn't require anybody to ever care about
i_sem semantic changes, because i_sem semantics simply didn't change
at all.

So that's the approach I'm pushing. I admittedly haven't actually
looked at the IMA details, but from a high-level standpoint you can
basically describe it (as above) without having to care too much about
exactly what IMA even wants.

The two-lock approach does require that the operations that invalidate
the integrity measurements always only invalidate it, and don't try to
re-compute it. But I suspect that would be entirely insane anyway
(imagine a world where "setxattr" would have to read the whole file
contents in order to revalidate the integrity measurement - even if
there is nobody who even *cares*).

           Linus