linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [LSF/MM TOPIC ][LSF/MM ATTEND] Read-only Mapping of Program Text using Large THP Pages
@ 2019-02-20 11:17 William Kucharski
  2019-02-20 12:10 ` Michal Hocko
  2019-02-20 13:44 ` Matthew Wilcox
  0 siblings, 2 replies; 12+ messages in thread
From: William Kucharski @ 2019-02-20 11:17 UTC (permalink / raw)
  To: lsf-pc, Linux-MM, linux-fsdevel

For the past year or so I have been working on further developing my original
prototype support of mapping read-only program text using large THP pages.

I developed a prototype described below which I continue to work on, but the
major issues I have yet to solve involve page cache integration and filesystem
support.

At present, the conventional methodology of reading a single base PAGE and
using readahead to fill in additional pages isn't useful as the entire (in my
prototype) PMD page needs to be read in before the page can be mapped (and at
that point it is unclear whether readahead of additional PMD sized pages would
be of benefit or too costly.

Additionally, there are no good interfaces at present to tell filesystem layers
that content is desired in chunks larger than a hardcoded limit of 64K, or to
to read disk blocks in chunks appropriate for PMD sized pages.

I very briefly discussed some of this work with Kirill in the past, and am
currently somewhat blocked on progress with my prototype due to issues with
multiorder page size support in the radix tree page cache. I don't feel it is
worth the time to debug those issues since the radix tree page cache is dead,
and it's much more useful to help Matthew Wilcox get multiorder page support
for XArray tested and approved upstream.

The following is a backgrounder on the work I have done to date and some
performance numbers.

Since it's just a prototype, I am unsure as to whether it would make a good topic
of a discussion talk per se, but should I be invited to attend it could
certainly engender a good amount of discussion as a BOF/cross-discipline topic
between the MM and FS tracks.

Thanks,
    William Kucharski

========================================

One of the downsides of THP as currently implemented is that it only supports
large page mappings for anonymous pages.

I embarked upon this prototype on the theory that it would be advantageous to 
be able to map large ranges of read-only text pages using THP as well.

The idea is that the kernel will attempt to allocate and map the range using a 
PMD sized THP page upon first fault; if the allocation is successful the page 
will be populated (at present using a call to kernel_read()) and the page will 
be mapped at the PMD level. If memory allocation fails, the page fault routines 
will drop through to the conventional PAGESIZE-oriented routines for mapping 
the faulting page.

Since this approach will map a PMD size block of the memory map at a time, we 
should see a slight uptick in time spent in disk I/O but a substantial drop in 
page faults as well as a reduction in iTLB misses as address ranges will be 
mapped with the larger page. Analysis of a test program that consists of a very 
large text area (483,138,032 bytes in size) that thrashes D$ and I$ shows this 
does occur and there is a slight reduction in program execution time.

The text segment as seen from readelf:

LOAD          0x0000000000000000 0x0000000000400000 0x0000000000400000
              0x000000001ccc19f0 0x000000001ccc19f0 R E    0x200000

As currently implemented for test purposes, the prototype will only use large 
pages to map an executable with a particular filename ("testr"), enabling easy 
comparison of the same executable using 4K and 2M (x64) pages on the same 
kernel. It is understood that this is just a proof of concept implementation 
and much more work regarding enabling the feature and overall system usage of 
it would need to be done before it was submitted as a kernel patch. However, I 
felt it would be worthy to send it out as an RFC so I can find out whether 
there are huge objections from the community to doing this at all, or a better 
understanding of the major concerns that must be assuaged before it would even 
be considered. I currently hardcode CONFIG_TRANSPARENT_HUGEPAGE to the 
equivalent of "always" and bypass some checks for anonymous pages by simply 
#ifdefing the code out; obviously I would need to determine the right thing to 
do in those cases.

Current comparisons of 4K vs 2M pages as generated by "perf stat -d -d -d -r10" 
follow; the 4K pagesize program was named "foo" and the 2M pagesize program 
"testr" (as noted above) - please note that these numbers do vary from run to 
run, but the orders of magnitude of the differences between the two versions 
remain relatively constant:

4K Pages:
=========
Performance counter stats for './foo' (10 runs):

  307054.450421      task-clock:u (msec)       #    1.000 CPUs utilized            ( +-  0.21% )
              0      context-switches:u        #    0.000 K/sec
              0      cpu-migrations:u          #    0.000 K/sec
          7,728      page-faults:u             #    0.025 K/sec                    ( +-  0.00% )
1,401,295,823,265      cycles:u                #    4.564 GHz                      ( +-  0.19% )  (30.77%)
562,704,668,718      instructions:u            #    0.40  insn per cycle           ( +-  0.00% )  (38.46%)
 20,100,243,102      branches:u                #   65.461 M/sec                    ( +-  0.00% )  (38.46%)
      2,628,944      branch-misses:u           #    0.01% of all branches          ( +-  3.32% )  (38.46%)
180,885,880,185      L1-dcache-loads:u         #  589.100 M/sec                    ( +-  0.00% )  (38.46%)
 40,374,420,279      L1-dcache-load-misses:u   #   22.32% of all L1-dcache hits    ( +-  0.01% )  (38.46%)
    232,184,583      LLC-loads:u               #    0.756 M/sec                    ( +-  1.48% )  (30.77%)
     23,990,082      LLC-load-misses:u         #   10.33% of all LL-cache hits     ( +-  1.48% )  (30.77%)
<not supported>      L1-icache-loads:u
 74,897,499,234      L1-icache-load-misses:u                                       ( +-  0.00% )  (30.77%)
180,990,026,447      dTLB-loads:u              #  589.440 M/sec                    ( +-  0.00% )  (30.77%)
        707,373      dTLB-load-misses:u        #    0.00% of all dTLB cache hits   ( +-  4.62% )  (30.77%)
      5,583,675      iTLB-loads:u              #    0.018 M/sec                    ( +-  0.31% )  (30.77%)
  1,219,514,499      iTLB-load-misses:u        # 21840.71% of all iTLB cache hits  ( +-  0.01% )  (30.77%)
<not supported>      L1-dcache-prefetches:u
<not supported>      L1-dcache-prefetch-misses:u

307.093088771 seconds time elapsed                                          ( +-  0.20% )

2M Pages:
=========
Performance counter stats for './testr' (10 runs):

  289504.209769      task-clock:u (msec)       #    1.000 CPUs utilized            ( +-  0.19% )
              0      context-switches:u        #    0.000 K/sec
              0      cpu-migrations:u          #    0.000 K/sec
            598      page-faults:u             #    0.002 K/sec                    ( +-  0.03% )
1,323,835,488,984      cycles:u                #    4.573 GHz                      ( +-  0.19% )  (30.77%)
562,658,682,055      instructions:u            #    0.43  insn per cycle           ( +-  0.00% )  (38.46%)
 20,099,662,528      branches:u                #   69.428 M/sec                    ( +-  0.00% )  (38.46%)
      2,877,086      branch-misses:u           #    0.01% of all branches          ( +-  4.52% )  (38.46%)
180,899,297,017      L1-dcache-loads:u         #  624.859 M/sec                    ( +-  0.00% )  (38.46%)
 40,209,140,089      L1-dcache-load-misses:u   #   22.23% of all L1-dcache hits    ( +-  0.00% )  (38.46%)
    135,968,232      LLC-loads:u               #    0.470 M/sec                    ( +-  1.56% )  (30.77%)
      6,704,890      LLC-load-misses:u         #    4.93% of all LL-cache hits     ( +-  1.92% )  (30.77%)
<not supported>      L1-icache-loads:u
 74,955,673,747      L1-icache-load-misses:u                                       ( +-  0.00% )  (30.77%)
180,987,794,366      dTLB-loads:u              #  625.165 M/sec                    ( +-  0.00% )  (30.77%)
            835      dTLB-load-misses:u        #    0.00% of all dTLB cache hits   ( +- 14.35% )  (30.77%)
      6,386,207      iTLB-loads:u              #    0.022 M/sec                    ( +-  0.42% )  (30.77%)
     51,929,869      iTLB-load-misses:u        #  813.16% of all iTLB cache hits   ( +-  1.61% )  (30.77%)
<not supported>      L1-dcache-prefetches:u
<not supported>      L1-dcache-prefetch-misses:u

289.551551387 seconds time elapsed                                          ( +-  0.20% )

A check of /proc/meminfo with the test program running shows the large mappings:

ShmemPmdMapped:   471040 kB

The obvious problem with this first swipe at things is the large pages are not
placed into the page cache, so for example multiple concurrent executions of the
test program allocate and map the large pages each time.

A greater architectural issue is the best way to support large pages in the page
cache, which is something Matthew Wilcox's multiorder page support in XArray
should solve.

Some questions:

* What is the best approach to deal with large pages when PAGESIZE mappings exist?
At present, the prototype evicts PAGESIZE pages from the page cache, replacing
them with a mapping for the large page, and future mappings of a PAGESIZE range
should map using an offset into the PMD sized physical page used to map the PMD
sized virtual page.

* Do we need to create per-filesystem routines to handle large pages or can
we delay that (ideally we would want to be able to read in the contents
of large pages without having to read_iter however many PAGESIZE pages
we need.)

I am happy to take whatever approach is best to add large pages to the page
cache, but it seems useful and crucuial that a way be provided for the system to
automatically use THP to map large text pages if so desired, read-only to begin
but eventually read/write to accommodate applications that self-modify code such
as databases and Java.

========================================

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [LSF/MM TOPIC ][LSF/MM ATTEND] Read-only Mapping of Program Text using Large THP Pages
  2019-02-20 11:17 [LSF/MM TOPIC ][LSF/MM ATTEND] Read-only Mapping of Program Text using Large THP Pages William Kucharski
@ 2019-02-20 12:10 ` Michal Hocko
  2019-02-20 13:18   ` William Kucharski
  2019-02-20 13:44 ` Matthew Wilcox
  1 sibling, 1 reply; 12+ messages in thread
From: Michal Hocko @ 2019-02-20 12:10 UTC (permalink / raw)
  To: William Kucharski; +Cc: lsf-pc, Linux-MM, linux-fsdevel

On Wed 20-02-19 04:17:13, William Kucharski wrote:
> For the past year or so I have been working on further developing my original
> prototype support of mapping read-only program text using large THP pages.

Song Liu has already proposed THP on FS topic already [1]

[1] http://lkml.kernel.org/r/77A00946-D70D-469D-963D-4C4EA20AE4FA@fb.com
and I assume this is essentially leading to the same discussion, right?
So we can merge this requests.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [LSF/MM TOPIC ][LSF/MM ATTEND] Read-only Mapping of Program Text using Large THP Pages
  2019-02-20 12:10 ` Michal Hocko
@ 2019-02-20 13:18   ` William Kucharski
  2019-02-20 13:27     ` Michal Hocko
  0 siblings, 1 reply; 12+ messages in thread
From: William Kucharski @ 2019-02-20 13:18 UTC (permalink / raw)
  To: Michal Hocko; +Cc: lsf-pc, Linux-MM, linux-fsdevel



> On Feb 20, 2019, at 5:10 AM, Michal Hocko <mhocko@kernel.org> wrote:
> 
> On Wed 20-02-19 04:17:13, William Kucharski wrote:
>> For the past year or so I have been working on further developing my original
>> prototype support of mapping read-only program text using large THP pages.
> 
> Song Liu has already proposed THP on FS topic already [1]
> 
> [1] http://lkml.kernel.org/r/77A00946-D70D-469D-963D-4C4EA20AE4FA@fb.com
> and I assume this is essentially leading to the same discussion, right?
> So we can merge this requests.

Different approaches but the same basic issue, yes.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [LSF/MM TOPIC ][LSF/MM ATTEND] Read-only Mapping of Program Text using Large THP Pages
  2019-02-20 13:18   ` William Kucharski
@ 2019-02-20 13:27     ` Michal Hocko
  0 siblings, 0 replies; 12+ messages in thread
From: Michal Hocko @ 2019-02-20 13:27 UTC (permalink / raw)
  To: William Kucharski; +Cc: lsf-pc, Linux-MM, linux-fsdevel

On Wed 20-02-19 06:18:47, William Kucharski wrote:
> 
> 
> > On Feb 20, 2019, at 5:10 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > 
> > On Wed 20-02-19 04:17:13, William Kucharski wrote:
> >> For the past year or so I have been working on further developing my original
> >> prototype support of mapping read-only program text using large THP pages.
> > 
> > Song Liu has already proposed THP on FS topic already [1]
> > 
> > [1] http://lkml.kernel.org/r/77A00946-D70D-469D-963D-4C4EA20AE4FA@fb.com
> > and I assume this is essentially leading to the same discussion, right?
> > So we can merge this requests.
> 
> Different approaches but the same basic issue, yes.

OK, I will mark it as a separate topic then.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [LSF/MM TOPIC ][LSF/MM ATTEND] Read-only Mapping of Program Text using Large THP Pages
  2019-02-20 11:17 [LSF/MM TOPIC ][LSF/MM ATTEND] Read-only Mapping of Program Text using Large THP Pages William Kucharski
  2019-02-20 12:10 ` Michal Hocko
@ 2019-02-20 13:44 ` Matthew Wilcox
  2019-02-20 14:07   ` William Kucharski
  1 sibling, 1 reply; 12+ messages in thread
From: Matthew Wilcox @ 2019-02-20 13:44 UTC (permalink / raw)
  To: William Kucharski; +Cc: lsf-pc, Linux-MM, linux-fsdevel

On Wed, Feb 20, 2019 at 04:17:13AM -0700, William Kucharski wrote:
> At present, the conventional methodology of reading a single base PAGE and
> using readahead to fill in additional pages isn't useful as the entire (in my
> prototype) PMD page needs to be read in before the page can be mapped (and at
> that point it is unclear whether readahead of additional PMD sized pages would
> be of benefit or too costly.

I remember discussing readahead with Kirill in the past.  Here's my
understanding of how it works today and why it probably doesn't work any
more once we have THPs.  We mark some page partway to the current end of
the readahead window with the ReadAhead page flag.  Once we get to it,
we trigger more readahead and change the location of the page with the
ReadAhead flag.  Our current RA window is on the order of 256kB.

With THPs, we're mapping 2MB at a time.  We don't get a warning every
256kB that we're getting close to the end of our RA window.  We only get
to know every 2MB.  So either we can increase the RA window from 256kB
to 2MB, or we have to manage without RA at all.

Most systems these days have SSDs, so the whole concept of RA probably
needs to be rethought.  We should try to figure out if we care about
the performance of rotating rust, and other high-latency systems like
long-distance networking and USB sticks.  (I was going to say network
filesystems in general, but then I remembered clameter's example of
400Gbps networks being faster than DRAM, so local networks clearly aren't
a problem any more).

Maybe there's scope for a session on readahead in general, but I don't
know who'd bring data and argue for what actions based on it.

> Additionally, there are no good interfaces at present to tell filesystem layers
> that content is desired in chunks larger than a hardcoded limit of 64K, or to
> to read disk blocks in chunks appropriate for PMD sized pages.

Right!  It's actually slightly worse than that.  The VFS allocates
pages on behalf of the filesystem and tells the filesystem to read them.
So there's no way to allow the filesystem to say "Actually, I'd rather
read in 32kB chunks because that's how big my compression blocks are".
See page_cache_read() and __do_page_cache_readahead().

I've mentioned in the past my preferred interface for solving this is to
have a new address space operation called ->populate().  The VFS would
call this from both of the above functions, allowing a filesystem to
allocate, say, an order-3 page and place it in i_pages before starting
IO on it.  I haven't done any work towards this, though.  And my opinion
on it might change after having written some code.

That interface would need to have some hint from the VFS as to what
range of file offsets it's looking for, and which page is the critical
one.  Maybe that's as simple as passing in pgoff and order, where pgoff is
not necessarily aligned to 1<<order.  Or maybe we want to explicitly
pass in start, end, critical.

I'm in favour of William attending LSFMM, for whatever my opinion
is worth.  Also Kirill, of course.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [LSF/MM TOPIC ][LSF/MM ATTEND] Read-only Mapping of Program Text using Large THP Pages
  2019-02-20 13:44 ` Matthew Wilcox
@ 2019-02-20 14:07   ` William Kucharski
  2019-02-20 14:43     ` Matthew Wilcox
  0 siblings, 1 reply; 12+ messages in thread
From: William Kucharski @ 2019-02-20 14:07 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: lsf-pc, Linux-MM, linux-fsdevel



> On Feb 20, 2019, at 6:44 AM, Matthew Wilcox <willy@infradead.org> wrote:
> 
> That interface would need to have some hint from the VFS as to what
> range of file offsets it's looking for, and which page is the critical
> one.  Maybe that's as simple as passing in pgoff and order, where pgoff is
> not necessarily aligned to 1<<order.  Or maybe we want to explicitly
> pass in start, end, critical.

The order is especially important, as I think it's vital that the FS can
tell the difference between a caller wanting 2M in PAGESIZE pages
(something that could be satisfied by taking multiple trips through the
existing readahead) or needing to transfer ALL the content for a 2M page
as the fault can't be satisfied until the operation is complete. It also
won't be long before reading 1G at a time to map PUD-sized pages becomes
more important, plus the need to support various sizes in-between for
architectures like ARM that support them (see the non-standard size THP
discussion for more on that.)

I'm also hoping the conference would have enough "mixer" time that MM folks
can have a nice discussion with the FS folks to get their input - or at the
very least these mail threads will get that ball rolling.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Read-only Mapping of Program Text using Large THP Pages
  2019-02-20 14:07   ` William Kucharski
@ 2019-02-20 14:43     ` Matthew Wilcox
  2019-02-20 16:39       ` Keith Busch
  0 siblings, 1 reply; 12+ messages in thread
From: Matthew Wilcox @ 2019-02-20 14:43 UTC (permalink / raw)
  To: William Kucharski
  Cc: lsf-pc, Linux-MM, linux-fsdevel, linux-nvme, linux-block


[adding linux-nvme and linux-block for opinions on the critical-page-first
idea in the second and third paragraphs below]

On Wed, Feb 20, 2019 at 07:07:29AM -0700, William Kucharski wrote:
> > On Feb 20, 2019, at 6:44 AM, Matthew Wilcox <willy@infradead.org> wrote:
> > That interface would need to have some hint from the VFS as to what
> > range of file offsets it's looking for, and which page is the critical
> > one.  Maybe that's as simple as passing in pgoff and order, where pgoff is
> > not necessarily aligned to 1<<order.  Or maybe we want to explicitly
> > pass in start, end, critical.
> 
> The order is especially important, as I think it's vital that the FS can
> tell the difference between a caller wanting 2M in PAGESIZE pages
> (something that could be satisfied by taking multiple trips through the
> existing readahead) or needing to transfer ALL the content for a 2M page
> as the fault can't be satisfied until the operation is complete.

There's an open question here (at least in my mind) whether it's worth
transferring the critical page first and creating a temporary PTE mapping
for just that one page, then filling in the other 511 pages around it
and replacing it with a PMD-sized mapping.  We've had similar discussions
around this with zeroing freshly-allocated PMD pages, but I'm not aware
of anyone showing any numbers.  The only reason this might be a win
is that we wouldn't have to flush remote CPUs when replacing the PTE
mapping with a PMD mapping because they would both map to the same page.

It might be a complete loss because IO systems are generally set up for
working well with large contiguous IOs rather than returning a page here,
12 pages there and then 499 pages there.  To a certain extent we fixed
that in NVMe; where SCSI required transferring bytes in order across the
wire, an NVMe device is provided with a list of pages and can transfer
bytes in whatever way makes most sense for it.  What NVMe doesn't have
is a way for the host to tell the controller "Here's a 2MB sized I/O;
bytes 40960 to 45056 are most important to me; please give me a completion
event once those bytes are valid and then another completion event once
the entire I/O is finished".

I have no idea if hardware designers would be interested in adding that
kind of complexity, but this is why we also have I/O people at the same
meeting, so we can get these kinds of whole-stack discussions going.

> It also
> won't be long before reading 1G at a time to map PUD-sized pages becomes
> more important, plus the need to support various sizes in-between for
> architectures like ARM that support them (see the non-standard size THP
> discussion for more on that.)

The critical-page-first notion becomes even more interesting at these
larger sizes.  If a memory system is capable of, say, 40GB/s, it can
only handle 40 1GB page faults per second, and each individual page
fault takes 25ms.  That's rotating rust latencies ;-)

> I'm also hoping the conference would have enough "mixer" time that MM folks
> can have a nice discussion with the FS folks to get their input - or at the
> very least these mail threads will get that ball rolling.

Yes, there are both joint sessions (sometimes plenary with all three
streams, sometimes two streams) and plenty of time allocated to
inter-session discussions.  There's usually substantial on-site meal
and coffee breaks during which many important unscheduled discussions
take place.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Read-only Mapping of Program Text using Large THP Pages
  2019-02-20 14:43     ` Matthew Wilcox
@ 2019-02-20 16:39       ` Keith Busch
  2019-02-20 17:19         ` Matthew Wilcox
  0 siblings, 1 reply; 12+ messages in thread
From: Keith Busch @ 2019-02-20 16:39 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: William Kucharski, lsf-pc, Linux-MM, linux-fsdevel, linux-nvme,
	linux-block

On Wed, Feb 20, 2019 at 06:43:46AM -0800, Matthew Wilcox wrote:
> What NVMe doesn't have is a way for the host to tell the controller
> "Here's a 2MB sized I/O; bytes 40960 to 45056 are most important to
> me; please give me a completion event once those bytes are valid and
> then another completion event once the entire I/O is finished".
> 
> I have no idea if hardware designers would be interested in adding that
> kind of complexity, but this is why we also have I/O people at the same
> meeting, so we can get these kinds of whole-stack discussions going.

We have two unused PRP bits, so I guess there's room to define something
like a "me first" flag. I am skeptical we'd get committee approval for
that or partial completion events, though.

I think the host should just split the more important part of the transfer
into a separate command. The only hardware support we have to prioritize
that command ahead of others is with weighted priority queues, but we're
missing driver support for that at the moment.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Read-only Mapping of Program Text using Large THP Pages
  2019-02-20 16:39       ` Keith Busch
@ 2019-02-20 17:19         ` Matthew Wilcox
  2019-04-08 11:36           ` William Kucharski
  0 siblings, 1 reply; 12+ messages in thread
From: Matthew Wilcox @ 2019-02-20 17:19 UTC (permalink / raw)
  To: Keith Busch
  Cc: William Kucharski, lsf-pc, Linux-MM, linux-fsdevel, linux-nvme,
	linux-block

On Wed, Feb 20, 2019 at 09:39:22AM -0700, Keith Busch wrote:
> On Wed, Feb 20, 2019 at 06:43:46AM -0800, Matthew Wilcox wrote:
> > What NVMe doesn't have is a way for the host to tell the controller
> > "Here's a 2MB sized I/O; bytes 40960 to 45056 are most important to
> > me; please give me a completion event once those bytes are valid and
> > then another completion event once the entire I/O is finished".
> > 
> > I have no idea if hardware designers would be interested in adding that
> > kind of complexity, but this is why we also have I/O people at the same
> > meeting, so we can get these kinds of whole-stack discussions going.
> 
> We have two unused PRP bits, so I guess there's room to define something
> like a "me first" flag. I am skeptical we'd get committee approval for
> that or partial completion events, though.
> 
> I think the host should just split the more important part of the transfer
> into a separate command. The only hardware support we have to prioritize
> that command ahead of others is with weighted priority queues, but we're
> missing driver support for that at the moment.

Yes, on reflection, NVMe is probably an example where we'd want to send
three commands (one for the critical page, one for the part before and one
for the part after); it has low per-command overhead so it should be fine.

Thinking about William's example of a 1GB page, with a x4 link running
at 8Gbps, a 1GB transfer would take approximately a quarter of a second.
If we do end up wanting to support 1GB pages, I think we'll want that
low-priority queue support ... and to qualify drives which actually have
the ability to handle multiple commands in parallel.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Read-only Mapping of Program Text using Large THP Pages
  2019-02-20 17:19         ` Matthew Wilcox
@ 2019-04-08 11:36           ` William Kucharski
  2019-04-28 20:08             ` Song Liu
  0 siblings, 1 reply; 12+ messages in thread
From: William Kucharski @ 2019-04-08 11:36 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Keith Busch, Linux-MM, linux-fsdevel, linux-nvme, linux-block



> On Feb 20, 2019, at 10:19 AM, Matthew Wilcox <willy@infradead.org> wrote:
> 
> Yes, on reflection, NVMe is probably an example where we'd want to send
> three commands (one for the critical page, one for the part before and one
> for the part after); it has low per-command overhead so it should be fine.
> 
> Thinking about William's example of a 1GB page, with a x4 link running
> at 8Gbps, a 1GB transfer would take approximately a quarter of a second.
> If we do end up wanting to support 1GB pages, I think we'll want that
> low-priority queue support ... and to qualify drives which actually have
> the ability to handle multiple commands in parallel.

I just got my denial for LSF/MM, so I was hopeful someone who will
be attending can talk to the filesystem folks in an effort to determine what
the best approach may be going forward for filling a PMD sized page to satisfy
a page fault.

The two obvious solutions are to either read the full content of the PMD
sized page before the fault can be satisfied, or as Matthew suggested
perhaps satisfy the fault temporarily with a single PAGESIZE page and use a
readahead to populate the other 511 pages. The next page fault would then
be satisfied by replacing the PAGESIZE page already mapped with a mapping for
the full PMD page. 

The latter approach seems like it could be a performance win at the sake of some
complexity. However, with the advent of faster storage arrays and more SSD, let
alone NVMe, just reading the full contents of a PMD sized page may ultimately be
the cleanest way to go as slow physical media becomes less of a concern in the
future.

Thanks in advance to anyone who wants to take this issue up.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Read-only Mapping of Program Text using Large THP Pages
  2019-04-08 11:36           ` William Kucharski
@ 2019-04-28 20:08             ` Song Liu
  2019-04-30 12:12               ` William Kucharski
  0 siblings, 1 reply; 12+ messages in thread
From: Song Liu @ 2019-04-28 20:08 UTC (permalink / raw)
  To: William Kucharski
  Cc: Matthew Wilcox, Keith Busch, Linux-MM, Linux-Fsdevel, linux-nvme,
	linux-block

Hi William,

On Mon, Apr 8, 2019 at 4:37 AM William Kucharski
<william.kucharski@oracle.com> wrote:
>
>
>
> > On Feb 20, 2019, at 10:19 AM, Matthew Wilcox <willy@infradead.org> wrote:
> >
> > Yes, on reflection, NVMe is probably an example where we'd want to send
> > three commands (one for the critical page, one for the part before and one
> > for the part after); it has low per-command overhead so it should be fine.
> >
> > Thinking about William's example of a 1GB page, with a x4 link running
> > at 8Gbps, a 1GB transfer would take approximately a quarter of a second.
> > If we do end up wanting to support 1GB pages, I think we'll want that
> > low-priority queue support ... and to qualify drives which actually have
> > the ability to handle multiple commands in parallel.
>
> I just got my denial for LSF/MM, so I was hopeful someone who will
> be attending can talk to the filesystem folks in an effort to determine what
> the best approach may be going forward for filling a PMD sized page to satisfy
> a page fault.
>
> The two obvious solutions are to either read the full content of the PMD
> sized page before the fault can be satisfied, or as Matthew suggested
> perhaps satisfy the fault temporarily with a single PAGESIZE page and use a
> readahead to populate the other 511 pages. The next page fault would then
> be satisfied by replacing the PAGESIZE page already mapped with a mapping for
> the full PMD page.
>
> The latter approach seems like it could be a performance win at the sake of some
> complexity. However, with the advent of faster storage arrays and more SSD, let
> alone NVMe, just reading the full contents of a PMD sized page may ultimately be
> the cleanest way to go as slow physical media becomes less of a concern in the
> future.
>
> Thanks in advance to anyone who wants to take this issue up.

We will bring this proposal up in THP discussions. Would you like to share more
thoughts on pros and cons of the two solutions? Or in other words, do you have
strong reasons to dislike either of them?

Thanks,
Song

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Read-only Mapping of Program Text using Large THP Pages
  2019-04-28 20:08             ` Song Liu
@ 2019-04-30 12:12               ` William Kucharski
  0 siblings, 0 replies; 12+ messages in thread
From: William Kucharski @ 2019-04-30 12:12 UTC (permalink / raw)
  To: Song Liu
  Cc: Matthew Wilcox, Keith Busch, Linux-MM, Linux-Fsdevel, linux-nvme,
	linux-block



> On Apr 28, 2019, at 2:08 PM, Song Liu <liu.song.a23@gmail.com> wrote:
> 
> We will bring this proposal up in THP discussions. Would you like to share more
> thoughts on pros and cons of the two solutions? Or in other words, do you have
> strong reasons to dislike either of them?

I think it's a performance issue that needs to be hashed out.

The obvious thing to do is read the whole large page and then map
it, but depending on the architecture or I/O speed, mapping one
PAGESIZE page to satisfy the single fault while the large page is
being read in could potentially be faster. However, as with all
swags without actual data who can say. You can also bring up the
question of whether with SSDs and NVME storage if it makes sense
to worry anymore about how long it would take to read a 2M or even
1G page in from storage. I like the idea of simply reading the
entire large page purely for neatness reasons - recovering from an
error during redhead of a large page seems like it could become
rather complex.

One other issue is how this will interact with filesystems and how
and how to tell filesystems I want a large page's worth of data.
Matthew mentioned that compound_order() can be used to detect the
page size, so that's one answer, but obviously no such code exists
as of yet and it would need to be propagated across all file systems.

I really hope the discussions at LSFMM are productive.

-- Bill

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2019-04-30 12:12 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-20 11:17 [LSF/MM TOPIC ][LSF/MM ATTEND] Read-only Mapping of Program Text using Large THP Pages William Kucharski
2019-02-20 12:10 ` Michal Hocko
2019-02-20 13:18   ` William Kucharski
2019-02-20 13:27     ` Michal Hocko
2019-02-20 13:44 ` Matthew Wilcox
2019-02-20 14:07   ` William Kucharski
2019-02-20 14:43     ` Matthew Wilcox
2019-02-20 16:39       ` Keith Busch
2019-02-20 17:19         ` Matthew Wilcox
2019-04-08 11:36           ` William Kucharski
2019-04-28 20:08             ` Song Liu
2019-04-30 12:12               ` William Kucharski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).