All of lore.kernel.org
 help / color / mirror / Atom feed
From: Chuck Lever <chuck.lever@oracle.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: "mgorman@suse.de" <mgorman@suse.de>,
	"brouer@redhat.com" <brouer@redhat.com>,
	Linux NFS Mailing List <linux-nfs@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>
Subject: Re: alloc_pages_bulk()
Date: Tue, 9 Feb 2021 22:55:35 +0000	[thread overview]
Message-ID: <D683D2BD-FF98-4DD2-A6D2-BA10BB132011@oracle.com> (raw)
In-Reply-To: <20210209220127.GB308988@casper.infradead.org>



> On Feb 9, 2021, at 5:01 PM, Matthew Wilcox <willy@infradead.org> wrote:
> 
> On Mon, Feb 08, 2021 at 05:50:51PM +0000, Chuck Lever wrote:
>>> We've been discussing how NFSD can more efficiently refill its
>>> receive buffers (currently alloc_page() in a loop; see
>>> net/sunrpc/svc_xprt.c::svc_alloc_arg()).
> 
> I'm not familiar with the sunrpc architecture, but this feels like you're
> trying to optimise something that shouldn't exist.  Ideally a write
> would ask the page cache for the pages that correspond to the portion
> of the file which is being written to.  I appreciate that doesn't work
> well for, eg, NFS-over-TCP, but for NFS over any kind of RDMA, that
> should be possible, right?

(Note there is room for improvement for both transport types).

Since you asked ;-) there are four broad categories of NFSD I/O.

1.  Receive an ingress RPC message (typically a Call)

2.  Read from a file

3.  Write to a file

4.  Send an egress RPC message (typically a Reply)

A server RPC transaction is some combination of these, usually
1, 2, and 4; or 1, 3, and 4.

To do 1, the server allocates a set of order-0 pages to form a
receive buffer and a set of order-0 pages to form a send buffer.
We want to handle this with bulk allocation. The Call is then
received into the receive buffer pages.

The receive buffer pages typically stay with the nfsd thread for
its lifetime, but the send buffer pages do not. We want to use a
buffer page size that matches the page cache size (see below) and
also a size small enough that makes allocation very unlikely to
fail. The largest transactions (NFS READ and WRITE) use up to 1MB
worth of pages.

Category two can be done by copying the file's pages into the send
buffer pages, or it can be done via a splice. When a splice is
done, the send buffer pages allocated above are released first
before being replaced in the buffer with the file's pages.

3 is currently done only by copying receive buffer pages to file
pages. Pages are neither allocated or released by this category
of I/O. There are various reasons for this, but it's an area that
could stand some attention.

Sending (category 4) passes the send buffer pages to kernel_sendpage(),
which bumps the page count on them. When sendpage() returns, the
server does a put_page() on all of those pages, then goes back to
category 1 to replace the consumed send buffer pages. When the
network layer is finished with the pages, it releases them.

There are two reasons I can see for this:

1. A network send isn't complete until the server gets an ACK from
the client. This can take a while. I'm not aware of a TCP-provided
mechanism to indicate when the ACK has arrived, so the server can't
re-use them. (RDMA has an affirmative send completion event that
we can use to reduce send buffer churn).

2. If a splice was done, the send buffer pages that are also file
pages can't be re-used for the next RPC send buffer because
overwriting their content would corrupt the file. Thus they must
also be released and replaced.


--
Chuck Lever




      reply	other threads:[~2021-02-09 23:52 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-02-08 15:42 alloc_pages_bulk() Chuck Lever
2021-02-08 17:50 ` Fwd: alloc_pages_bulk() Chuck Lever
2021-02-09 10:31   ` alloc_pages_bulk() Jesper Dangaard Brouer
2021-02-09 13:37     ` alloc_pages_bulk() Chuck Lever
2021-02-09 17:27     ` alloc_pages_bulk() Vlastimil Babka
2021-02-10  9:51       ` alloc_pages_bulk() Christoph Hellwig
2021-02-10  8:41     ` alloc_pages_bulk() Mel Gorman
2021-02-10 11:41       ` alloc_pages_bulk() Jesper Dangaard Brouer
2021-02-10 13:07         ` alloc_pages_bulk() Mel Gorman
2021-02-10 22:58           ` alloc_pages_bulk() Chuck Lever
2021-02-11  9:12             ` alloc_pages_bulk() Mel Gorman
2021-02-11 12:26               ` alloc_pages_bulk() Jesper Dangaard Brouer
2021-02-15 12:00                 ` alloc_pages_bulk() Mel Gorman
2021-02-15 16:10                   ` alloc_pages_bulk() Jesper Dangaard Brouer
2021-02-22  9:42                     ` alloc_pages_bulk() Mel Gorman
2021-02-22 11:42                       ` alloc_pages_bulk() Jesper Dangaard Brouer
2021-02-22 14:08                         ` alloc_pages_bulk() Mel Gorman
2021-02-11 16:20               ` alloc_pages_bulk() Chuck Lever
2021-02-15 12:06                 ` alloc_pages_bulk() Mel Gorman
2021-02-15 16:00                   ` alloc_pages_bulk() Chuck Lever
2021-02-22 20:44                   ` alloc_pages_bulk() Jesper Dangaard Brouer
2021-02-09 22:01   ` Fwd: alloc_pages_bulk() Matthew Wilcox
2021-02-09 22:55     ` Chuck Lever [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=D683D2BD-FF98-4DD2-A6D2-BA10BB132011@oracle.com \
    --to=chuck.lever@oracle.com \
    --cc=brouer@redhat.com \
    --cc=linux-mm@kvack.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.