RE: [PATCH 03/11] fs: add new read_uptr and write_uptr file operations

From: David Laight <David.Laight@ACULAB.COM>
To: 'Linus Torvalds' <torvalds@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>, Al Viro <viro@zeniv.linux.org.uk>,
	"Luis Chamberlain" <mcgrof@kernel.org>,
	Kees Cook <keescook@chromium.org>,
	"Iurii Zaikin" <yzaikin@google.com>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>
Subject: RE: [PATCH 03/11] fs: add new read_uptr and write_uptr file operations
Date: Mon, 29 Jun 2020 08:21:22 +0000	[thread overview]
Message-ID: <fcd951e164a3407295971e3a4236b418@AcuMS.aculab.com> (raw)
In-Reply-To: <CAHk-=wjxQczqZ96esvDrH5QZsLg6azXCGDgo+Bmm6r8t2ssasg@mail.gmail.com>

From: Linus Torvalds
> Sent: 27 June 2020 17:33
> On Sat, Jun 27, 2020 at 3:49 AM David Laight <David.Laight@aculab.com> wrote:
> >
> > > Just keep the existing "set_fs()". It's not harmful if it's only used
> > > occasionally. We should rename it once it's rare enough, though.
> >
> > Am I right in thinking that it just sets a flag in 'current' ?
> 
> Basically, yes. That's what it has always done.

I could check, but I suspect it sets what TASK_SIZE uses to ~0u
so that access_ok() can't fail.

> Well "always" is not true - it used to set the %fs segment register
> originally (thus the name), but _conceptually_ it sets a flag for
> "should user accesses be kernel accesses instead".
> 
> On x86 - and most other architectures where user space and kernel
> space are in the same address space and accessed with the same
> instructions, that has then been implemented as just a "what is the
> limit for an access".
> 
> On other architectures - architectures that need different access
> methods (or different flags to the load/store instruction) - it's an
> actual flag that changes which access method you use.
> 
> > Although I don't remember access_ok() doing a suitable check
> > (would need to be (address - base) < limit).
> 
> So again, on the architectures with a unified address space,
> access_ok() is exactly that "address + access_size <= limit", although
> often done with some inline asm just to get the overflow case done
> efficiently.

I realised afterwards that the 'kernel address is actually user'
check isn't really done on architectures like x86 until stac/clac.

I had another thought.
While setting up a full-blown scatter-gather 'iter' structure for
functions like [gs]etsockopt, ioctl and fcntl is OTT and probably
measurably expensive a lightweight 'buffer' structure that just
contained address, length and user/kernel flag could be used.

Although the uses would need an extra level of indirection this
would be offset by reducing the number of parameters passed
through all the layers.

...
> I thought there was just one very specific case of "oh, in certain
> cases of setsockopt we don't know what size this address is and optlen
> is ignored", so we have to just pass the pointer down to the protocol,
> which is the point that knows how much of an address it wants..

I can't help feeling that userspace passes a suitable length but
the kernel doesn't verify it.

It is worse than that, one of the SCTP getsockopt() calls has to return
a length that is shorter than the buffer it wrote.

So any buffer descriptor length would have to be advisory.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)