Re: cifs.ko page management during reads

From: Shyam Prasad N <nspmangalore@gmail.com>
To: ronnie sahlberg <ronniesahlberg@gmail.com>
Cc: Pavel Shilovsky <piastryyy@gmail.com>,
	Steve French <smfrench@gmail.com>,
	CIFS <linux-cifs@vger.kernel.org>
Subject: Re: cifs.ko page management during reads
Date: Wed, 14 Jul 2021 12:20:56 +0530	[thread overview]
Message-ID: <CANT5p=oCJ5MRSVcNHV3-+Jg2TcGc3m785GGJBzScGThkS24eMw@mail.gmail.com> (raw)
In-Reply-To: <CAN05THQL3GTN0oG8rREQ3W9bxEoAV+NppbSgAuytdzPyH3cuhA@mail.gmail.com>

On Wed, Jul 14, 2021 at 5:45 AM ronnie sahlberg
<ronniesahlberg@gmail.com> wrote:
>
> On Wed, Jul 14, 2021 at 8:58 AM Pavel Shilovsky <piastryyy@gmail.com> wrote:
> >
> > пн, 12 июл. 2021 г. в 22:41,Steve French <smfrench@gmail.com>:
> > >
> > > And don't forget "esize" mount option to allow offload of decryption...
> > >
> > > On Tue, Jul 13, 2021, 00:37 Shyam Prasad N <nspmangalore@gmail.com> wrote:
> > >>
> > >> Hi all,
> > >>
> > >> I'm trying to understand our read path (big picture is to integrate
> > >> the new netfs helpers into cifs.ko), and I wanted to get some
> > >> understanding about why things are the way they are. @Pavel Shilovsky
> > >> I'm guessing that you've worked on most of this code, so hoping that
> > >> you'll be able to answer some of these:
> >
> > Thanks for taking a look at this.
> >
> > >>
> > >> 1. for cases where both encryption and signing are disabled, it looks
> > >> like we read from the socket page-by-page directly into pages and map
> > >> those to the inode address space
> >
> > Yes, we read directly into pre-allocated pages. For direct IO we read
> > into pages supplied by the user and for non-direct IO (including page
> > IO) we allocate intermediate pages.
> >
> > >>
> > >> 2. for cases where encryption is enabled, we read the encrypted data
> > >> into a set of pages first, then call crypto APIs, which decrypt the
> > >> data in-place to the same pages. We then copy out the decrypted data
> > >> from these pages into the pages in the address space that are already
> > >> mapped.
> >
> > Correct. Regardless of whether offloading is used or not there is an extra copy.
> >
> > >>
> > >> 3. similarly for the signing case, we read the data into a set of
> > >> pages first, then verify signature, then copy the pages to the mapped
> > >> pages in the inode address space.
> >
> > Yes.
> >
> > >>
> > >> for case 1, can't we read from the socket directly into the mapped
> > >> pages?
> >
> > Yes - assuming no signing or encryption, we should be able to read
> > directly into the mapped pages. But I doubt many people use SMB2
> > without signing or encryption although there may be some use cases
> > requiring performance over safety.
> >
> > >> for case 2 and 3, instead of allocating temporary pages to read
> > >> into, can't we read directly into the pages of the inode address
> > >> space? the only risk I see is that if decryption or sign verification
> > >> fails, it would overwrite the data that is already present in the page
> > >> cache. but I see that as acceptable, because we're reading from the
> > >> server, since the data we have in the cache is stale anyways.
> >
> > Theoretically we may try doing this with signing but we need to make
> > sure that no one can modify those pages locally which may lead to
> > signing errors. I don't think today we reconnect an SMB session on
> > those today but we probably should.
> >

How can someone modify those pages? Don't we keep the pages locked
when the read is in progress?

> > For encryption, we do not know where read data starts in an encrypted
> > SMB packet, so there is almost no point to read directly because we
> > would need to shift (copy) the data afterwards anyway.
>

Understood.

> Yes. But even if we knew where the data starts, it will not be aligned
> to a page anyway so we need to memcpy() it anyway.
>
> But I think it is possible, with a lot of effort. And probably not
> worth the effort.
>
> But, I have been toying with the idea of trying to do "zero-copy" SMB2
> READ in libsmb2.
> It, does read all the unencrypted READ payloads straight into the
> application buffer without a copy
> already, but just like cifs.ko it uses an intermediate buffer for
> encrypted packets.And thus an extra memcpy()
>
> The solution I was thinking of would be something like :
> 1, Read the first 4 + 64 bytes into a buffer.  This will be
> enough to hold, either
> 1a, unencrypted case:
>       inspect the SMB2 header to find the opcode, then read the
> response header  (16 bytes for smb2 read/write responses)
>       then read the payload straight into the application buffer.
>       I do this already and it works without too much pain.
>       Now we are finished for the unencrypted case.
> 2, encrypted case:
>       In this case we have the full transform header and also part of
> an encrypted smb2 header.
>       We know how big the transform header is so we can copy the
> partial smb2 header into a 64 byte
>       temp buffer and then read the remaining part of the smb2 header.
> 3,   Start decryption, but only run it for the first 64 bytes, i.e.
> the smb2 header.
> 3a, OPCODE != READ: Inspect the decrypted smb2 header, IF it is NOT a
> read response then just read all remaining data into
>       a temp buffer, continue the decryption and handle it the normal way.
> 4,   If the opcode is a READ, then read an additional 16 bytes into a
> temp buffer and continue the decryption of these 16 bytes.
> 5,   What follows now is the read response payload so read this
> decrypted data straight into the application buffer (or pages)
>       and again, continue the decryption and decrypt the data in-place.
>       Voila, zero-copy encrypted read handling.  Which would really
> help on a PS2 since memcpy() is pretty expensive on its very slow
> memory.
>
>
> I think this would work, and it would be very interesting to implement.
> Not sure if it is worth it to do in kernel code because there will be
> hairy special conditions to handle.
>

Very interesting.
Understood that this will be significant changes. But curious as to
what special conditions you're referring to?

>
> For kernel code I think it would be better/safer to have fast hw
> offload for memcpy()
>

Understood.

>
> regards
> ronnie sahlberg
>
>
> >
> > --
> > Best regards,
> > Pavel Shilovsky

-- 
Regards,
Shyam