From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S932640AbcLMBQH (ORCPT <rfc822;w@1wt.eu>);
        Mon, 12 Dec 2016 20:16:07 -0500
Received: from mail.kernel.org ([198.145.29.136]:53224 "EHLO mail.kernel.org"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1753082AbcLMBQF (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
        Mon, 12 Dec 2016 20:16:05 -0500
Date: Mon, 12 Dec 2016 17:15:55 -0800 (PST)
From: Stefano Stabellini <sstabellini@kernel.org>
X-X-Sender: sstabellini@sstabellini-ThinkPad-X260
To: Al Viro <viro@ZenIV.linux.org.uk>
cc: Stefano Stabellini <sstabellini@kernel.org>,
        v9fs-developer@lists.sourceforge.net, ericvh@gmail.com,
        rminnich@sandia.gov, lucho@ionkov.net, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 4/5] 9p: introduce async read requests
In-Reply-To: <20161210015059.GD1555@ZenIV.linux.org.uk>
Message-ID: <alpine.DEB.2.10.1612121635020.2777@sstabellini-ThinkPad-X260>
References: <alpine.DEB.2.10.1612081249000.22778@sstabellini-ThinkPad-X260> <1481230746-16741-1-git-send-email-sstabellini@kernel.org> <1481230746-16741-4-git-send-email-sstabellini@kernel.org> <20161210015059.GD1555@ZenIV.linux.org.uk>
User-Agent: Alpine 2.10 (DEB 1266 2009-07-14)
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

Hi Al,
thanks for looking at the patch.

On Sat, 10 Dec 2016, Al Viro wrote:
> On Thu, Dec 08, 2016 at 12:59:05PM -0800, Stefano Stabellini wrote:
> 
> 
> > +		} else {
> > +			req = p9_client_get_req(clnt, P9_TREAD, "dqd", fid->fid, offset, rsize);
> > +			if (IS_ERR(req)) {
> > +				*err = PTR_ERR(req);
> > +				break;
> > +			}
> > +			req->rsize = iov_iter_get_pages_alloc(to, &req->pagevec, 
> > +					(size_t)rsize, &req->offset);
> > +			req->kiocb = iocb;
> > +			for (i = 0; i < req->rsize; i += PAGE_SIZE)
> > +				page_cache_get_speculative(req->pagevec[i/PAGE_SIZE]);
> > +			req->callback = p9_client_read_complete;
> > +
> > +			*err = clnt->trans_mod->request(clnt, req);
> > +			if (*err < 0) {
> > +				clnt->status = Disconnected;
> > +				release_pages(req->pagevec,
> > +						(req->rsize + PAGE_SIZE - 1) / PAGE_SIZE,
> > +						true);
> > +				kvfree(req->pagevec);
> > +				p9_free_req(clnt, req);
> > +				break;
> > +			}
> > +
> > +			*err = -EIOCBQUEUED;
> 
> IDGI.  AFAICS, your code will result in shitloads of short reads - every
> time when you give it a multi-iovec array, only the first one will be
> issued and the rest won't be even looked at.  Sure, it is technically
> legal, but I very much doubt that aio users will be happy with that.
> 
> What am I missing here?

There is a problem with short reads, but I don't think it is the one
you described, unless I made a mistake somewhere.

The code above will issue one request as big as possible (size is
limited by clnt->msize, which is the size of the channel). No matter how
many segs in the iov_iter, the request will cover the maximum amount of
bytes allowed by the channel, typically 4K but can be larger. So
multi-iovec arrays should be handled correctly up to msize. Please let
me know if I am wrong, I am not an expert on this. I tried, but couldn't
actually get any multi-iovec arrays in my tests.


That said, reading the code again, there are indeed two possible causes
of short reads. The first one are responses which have a smaller byte
count than requested. Currently this case is not handled, forwarding
the short read up to the user. But I wrote and tested successfully a
patch that issues another follow-up request from the completion
function. Example:
  p9_client_read, issue request, size 4K
  p9_client_read_complete, receive response, count 2K
  p9_client_read_complete, issue request size 4K-2K = 2K
  p9_client_read_complete, receive response size 2K
  p9_client_read_complete, call ki_complete


The second possible cause of short reads are requests which are bigger
than msize. For example a 2MB read over a 4K channel. In this patch we
simply issue one request as big as msize, and eventually return to the
caller a smaller byte count. One option is to simply fall back to sync
IO in these cases. Another solution is to use the same technique
described above: we issue the first request as big as msize, then, from
the callback function, we issue a follow-up request, and again, and
again, until we fully complete the large read. This way, although we are
not issuing any simultaneous requests for a specific large read, at
least we can issue other aio requests in parallel for other reads or
writes. It should still be more performant than sync IO.

I am thinking of implementing this second option in the next version of
the series.