linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: Read-only Mapping of Program Text using Large THP Pages
       [not found]   ` <07B3B085-C844-4A13-96B1-3DB0F1AF26F5@oracle.com>
@ 2019-02-20 14:43     ` Matthew Wilcox
  2019-02-20 16:39       ` Keith Busch
  0 siblings, 1 reply; 6+ messages in thread
From: Matthew Wilcox @ 2019-02-20 14:43 UTC (permalink / raw)
  To: William Kucharski
  Cc: lsf-pc, Linux-MM, linux-fsdevel, linux-nvme, linux-block


[adding linux-nvme and linux-block for opinions on the critical-page-first
idea in the second and third paragraphs below]

On Wed, Feb 20, 2019 at 07:07:29AM -0700, William Kucharski wrote:
> > On Feb 20, 2019, at 6:44 AM, Matthew Wilcox <willy@infradead.org> wrote:
> > That interface would need to have some hint from the VFS as to what
> > range of file offsets it's looking for, and which page is the critical
> > one.  Maybe that's as simple as passing in pgoff and order, where pgoff is
> > not necessarily aligned to 1<<order.  Or maybe we want to explicitly
> > pass in start, end, critical.
> 
> The order is especially important, as I think it's vital that the FS can
> tell the difference between a caller wanting 2M in PAGESIZE pages
> (something that could be satisfied by taking multiple trips through the
> existing readahead) or needing to transfer ALL the content for a 2M page
> as the fault can't be satisfied until the operation is complete.

There's an open question here (at least in my mind) whether it's worth
transferring the critical page first and creating a temporary PTE mapping
for just that one page, then filling in the other 511 pages around it
and replacing it with a PMD-sized mapping.  We've had similar discussions
around this with zeroing freshly-allocated PMD pages, but I'm not aware
of anyone showing any numbers.  The only reason this might be a win
is that we wouldn't have to flush remote CPUs when replacing the PTE
mapping with a PMD mapping because they would both map to the same page.

It might be a complete loss because IO systems are generally set up for
working well with large contiguous IOs rather than returning a page here,
12 pages there and then 499 pages there.  To a certain extent we fixed
that in NVMe; where SCSI required transferring bytes in order across the
wire, an NVMe device is provided with a list of pages and can transfer
bytes in whatever way makes most sense for it.  What NVMe doesn't have
is a way for the host to tell the controller "Here's a 2MB sized I/O;
bytes 40960 to 45056 are most important to me; please give me a completion
event once those bytes are valid and then another completion event once
the entire I/O is finished".

I have no idea if hardware designers would be interested in adding that
kind of complexity, but this is why we also have I/O people at the same
meeting, so we can get these kinds of whole-stack discussions going.

> It also
> won't be long before reading 1G at a time to map PUD-sized pages becomes
> more important, plus the need to support various sizes in-between for
> architectures like ARM that support them (see the non-standard size THP
> discussion for more on that.)

The critical-page-first notion becomes even more interesting at these
larger sizes.  If a memory system is capable of, say, 40GB/s, it can
only handle 40 1GB page faults per second, and each individual page
fault takes 25ms.  That's rotating rust latencies ;-)

> I'm also hoping the conference would have enough "mixer" time that MM folks
> can have a nice discussion with the FS folks to get their input - or at the
> very least these mail threads will get that ball rolling.

Yes, there are both joint sessions (sometimes plenary with all three
streams, sometimes two streams) and plenty of time allocated to
inter-session discussions.  There's usually substantial on-site meal
and coffee breaks during which many important unscheduled discussions
take place.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Read-only Mapping of Program Text using Large THP Pages
  2019-02-20 14:43     ` Read-only Mapping of Program Text using Large THP Pages Matthew Wilcox
@ 2019-02-20 16:39       ` Keith Busch
  2019-02-20 17:19         ` Matthew Wilcox
  0 siblings, 1 reply; 6+ messages in thread
From: Keith Busch @ 2019-02-20 16:39 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: William Kucharski, lsf-pc, Linux-MM, linux-fsdevel, linux-nvme,
	linux-block

On Wed, Feb 20, 2019 at 06:43:46AM -0800, Matthew Wilcox wrote:
> What NVMe doesn't have is a way for the host to tell the controller
> "Here's a 2MB sized I/O; bytes 40960 to 45056 are most important to
> me; please give me a completion event once those bytes are valid and
> then another completion event once the entire I/O is finished".
> 
> I have no idea if hardware designers would be interested in adding that
> kind of complexity, but this is why we also have I/O people at the same
> meeting, so we can get these kinds of whole-stack discussions going.

We have two unused PRP bits, so I guess there's room to define something
like a "me first" flag. I am skeptical we'd get committee approval for
that or partial completion events, though.

I think the host should just split the more important part of the transfer
into a separate command. The only hardware support we have to prioritize
that command ahead of others is with weighted priority queues, but we're
missing driver support for that at the moment.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Read-only Mapping of Program Text using Large THP Pages
  2019-02-20 16:39       ` Keith Busch
@ 2019-02-20 17:19         ` Matthew Wilcox
  2019-04-08 11:36           ` William Kucharski
  0 siblings, 1 reply; 6+ messages in thread
From: Matthew Wilcox @ 2019-02-20 17:19 UTC (permalink / raw)
  To: Keith Busch
  Cc: William Kucharski, lsf-pc, Linux-MM, linux-fsdevel, linux-nvme,
	linux-block

On Wed, Feb 20, 2019 at 09:39:22AM -0700, Keith Busch wrote:
> On Wed, Feb 20, 2019 at 06:43:46AM -0800, Matthew Wilcox wrote:
> > What NVMe doesn't have is a way for the host to tell the controller
> > "Here's a 2MB sized I/O; bytes 40960 to 45056 are most important to
> > me; please give me a completion event once those bytes are valid and
> > then another completion event once the entire I/O is finished".
> > 
> > I have no idea if hardware designers would be interested in adding that
> > kind of complexity, but this is why we also have I/O people at the same
> > meeting, so we can get these kinds of whole-stack discussions going.
> 
> We have two unused PRP bits, so I guess there's room to define something
> like a "me first" flag. I am skeptical we'd get committee approval for
> that or partial completion events, though.
> 
> I think the host should just split the more important part of the transfer
> into a separate command. The only hardware support we have to prioritize
> that command ahead of others is with weighted priority queues, but we're
> missing driver support for that at the moment.

Yes, on reflection, NVMe is probably an example where we'd want to send
three commands (one for the critical page, one for the part before and one
for the part after); it has low per-command overhead so it should be fine.

Thinking about William's example of a 1GB page, with a x4 link running
at 8Gbps, a 1GB transfer would take approximately a quarter of a second.
If we do end up wanting to support 1GB pages, I think we'll want that
low-priority queue support ... and to qualify drives which actually have
the ability to handle multiple commands in parallel.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Read-only Mapping of Program Text using Large THP Pages
  2019-02-20 17:19         ` Matthew Wilcox
@ 2019-04-08 11:36           ` William Kucharski
  2019-04-28 20:08             ` Song Liu
  0 siblings, 1 reply; 6+ messages in thread
From: William Kucharski @ 2019-04-08 11:36 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Keith Busch, Linux-MM, linux-fsdevel, linux-nvme, linux-block



> On Feb 20, 2019, at 10:19 AM, Matthew Wilcox <willy@infradead.org> wrote:
> 
> Yes, on reflection, NVMe is probably an example where we'd want to send
> three commands (one for the critical page, one for the part before and one
> for the part after); it has low per-command overhead so it should be fine.
> 
> Thinking about William's example of a 1GB page, with a x4 link running
> at 8Gbps, a 1GB transfer would take approximately a quarter of a second.
> If we do end up wanting to support 1GB pages, I think we'll want that
> low-priority queue support ... and to qualify drives which actually have
> the ability to handle multiple commands in parallel.

I just got my denial for LSF/MM, so I was hopeful someone who will
be attending can talk to the filesystem folks in an effort to determine what
the best approach may be going forward for filling a PMD sized page to satisfy
a page fault.

The two obvious solutions are to either read the full content of the PMD
sized page before the fault can be satisfied, or as Matthew suggested
perhaps satisfy the fault temporarily with a single PAGESIZE page and use a
readahead to populate the other 511 pages. The next page fault would then
be satisfied by replacing the PAGESIZE page already mapped with a mapping for
the full PMD page. 

The latter approach seems like it could be a performance win at the sake of some
complexity. However, with the advent of faster storage arrays and more SSD, let
alone NVMe, just reading the full contents of a PMD sized page may ultimately be
the cleanest way to go as slow physical media becomes less of a concern in the
future.

Thanks in advance to anyone who wants to take this issue up.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Read-only Mapping of Program Text using Large THP Pages
  2019-04-08 11:36           ` William Kucharski
@ 2019-04-28 20:08             ` Song Liu
  2019-04-30 12:12               ` William Kucharski
  0 siblings, 1 reply; 6+ messages in thread
From: Song Liu @ 2019-04-28 20:08 UTC (permalink / raw)
  To: William Kucharski
  Cc: Matthew Wilcox, Keith Busch, Linux-MM, Linux-Fsdevel, linux-nvme,
	linux-block

Hi William,

On Mon, Apr 8, 2019 at 4:37 AM William Kucharski
<william.kucharski@oracle.com> wrote:
>
>
>
> > On Feb 20, 2019, at 10:19 AM, Matthew Wilcox <willy@infradead.org> wrote:
> >
> > Yes, on reflection, NVMe is probably an example where we'd want to send
> > three commands (one for the critical page, one for the part before and one
> > for the part after); it has low per-command overhead so it should be fine.
> >
> > Thinking about William's example of a 1GB page, with a x4 link running
> > at 8Gbps, a 1GB transfer would take approximately a quarter of a second.
> > If we do end up wanting to support 1GB pages, I think we'll want that
> > low-priority queue support ... and to qualify drives which actually have
> > the ability to handle multiple commands in parallel.
>
> I just got my denial for LSF/MM, so I was hopeful someone who will
> be attending can talk to the filesystem folks in an effort to determine what
> the best approach may be going forward for filling a PMD sized page to satisfy
> a page fault.
>
> The two obvious solutions are to either read the full content of the PMD
> sized page before the fault can be satisfied, or as Matthew suggested
> perhaps satisfy the fault temporarily with a single PAGESIZE page and use a
> readahead to populate the other 511 pages. The next page fault would then
> be satisfied by replacing the PAGESIZE page already mapped with a mapping for
> the full PMD page.
>
> The latter approach seems like it could be a performance win at the sake of some
> complexity. However, with the advent of faster storage arrays and more SSD, let
> alone NVMe, just reading the full contents of a PMD sized page may ultimately be
> the cleanest way to go as slow physical media becomes less of a concern in the
> future.
>
> Thanks in advance to anyone who wants to take this issue up.

We will bring this proposal up in THP discussions. Would you like to share more
thoughts on pros and cons of the two solutions? Or in other words, do you have
strong reasons to dislike either of them?

Thanks,
Song

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Read-only Mapping of Program Text using Large THP Pages
  2019-04-28 20:08             ` Song Liu
@ 2019-04-30 12:12               ` William Kucharski
  0 siblings, 0 replies; 6+ messages in thread
From: William Kucharski @ 2019-04-30 12:12 UTC (permalink / raw)
  To: Song Liu
  Cc: Matthew Wilcox, Keith Busch, Linux-MM, Linux-Fsdevel, linux-nvme,
	linux-block



> On Apr 28, 2019, at 2:08 PM, Song Liu <liu.song.a23@gmail.com> wrote:
> 
> We will bring this proposal up in THP discussions. Would you like to share more
> thoughts on pros and cons of the two solutions? Or in other words, do you have
> strong reasons to dislike either of them?

I think it's a performance issue that needs to be hashed out.

The obvious thing to do is read the whole large page and then map
it, but depending on the architecture or I/O speed, mapping one
PAGESIZE page to satisfy the single fault while the large page is
being read in could potentially be faster. However, as with all
swags without actual data who can say. You can also bring up the
question of whether with SSDs and NVME storage if it makes sense
to worry anymore about how long it would take to read a 2M or even
1G page in from storage. I like the idea of simply reading the
entire large page purely for neatness reasons - recovering from an
error during redhead of a large page seems like it could become
rather complex.

One other issue is how this will interact with filesystems and how
and how to tell filesystems I want a large page's worth of data.
Matthew mentioned that compound_order() can be used to detect the
page size, so that's one answer, but obviously no such code exists
as of yet and it would need to be propagated across all file systems.

I really hope the discussions at LSFMM are productive.

-- Bill

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2019-04-30 12:12 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <379F21DD-006F-4E33-9BD5-F81F9BA75C10@oracle.com>
     [not found] ` <20190220134454.GF12668@bombadil.infradead.org>
     [not found]   ` <07B3B085-C844-4A13-96B1-3DB0F1AF26F5@oracle.com>
2019-02-20 14:43     ` Read-only Mapping of Program Text using Large THP Pages Matthew Wilcox
2019-02-20 16:39       ` Keith Busch
2019-02-20 17:19         ` Matthew Wilcox
2019-04-08 11:36           ` William Kucharski
2019-04-28 20:08             ` Song Liu
2019-04-30 12:12               ` William Kucharski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).