linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* question about page tables in DAX/FS/PMEM case
@ 2019-02-20 23:06 Larry Bassel
  2019-02-21 20:41 ` Jerome Glisse
  0 siblings, 1 reply; 5+ messages in thread
From: Larry Bassel @ 2019-02-20 23:06 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: linux-kernel

I'm working on sharing page tables in the DAX/XFS/PMEM/PMD case.

If multiple processes would use the identical page of PMDs corresponding
to a 1 GiB address range of DAX/XFS/PMEM/PMDs, presumably one can instead
of populating a new PUD, just atomically increment a refcount and point
to the same PUD in the next level above.

i.e.

OLD:
process 1:
VA -> levels of page tables -> PUD1 -> page of PMDs1
process 2:
VA -> levels of page tables -> PUD2 -> page of PMDs2

NEW:
process 1:
VA -> levels of page tables -> PUD1 -> page of PMDs1
process 2:
VA -> levels of page tables -> PUD1 -> page of PMDs1 (refcount 2)

There are several cases to consider:

1. New mapping
OLD:
make a new PUD, populate the associated page of PMDs
(at least partially) with PMD entries.
NEW:
same

2. Mapping by a process same (same VA->PA and size and protections, etc.)
as one that already exists
OLD:
make a new PUD, populate the associated page of PMDs
(at least partially) with PMD entries.
NEW:
use same PUD, increase refcount (potentially even if this mapping is private
in which case there may eventually be a copy-on-write -- see #5 below)

3. Unmapping of a mapping which is the same as that from another process
OLD:
destroy the process's copy of mapping, free PUD, etc.
NEW:
decrease refcount, only if now 0 do we destroy mapping, etc.

4. Unmapping of a mapping which is unique (refcount 1)
OLD:
destroy the process's copy of mapping, free PUD, etc.
NEW:
same

5. Mapping was private (but same as another process), process writes
OLD:
break the PMD into PTEs, destroy PMD mapping, free PUD, etc..
NEW:
decrease refcount, only if now 0 do we destroy mapping, etc.
we still break the PMD into PTEs.

If I have a mmap of a DAX/FS/PMEM file and I take
a page (either pte or PMD sized) fault on access to this file,
the page table(s) are set up in dax_iomap_fault() in fs/dax.c (correct?).

If the process later munmaps this file or exits but there are still
other users of the shared page of PMDs, I would need to
detect that this has happened and act accordingly (#3 above)

Where will these page table entries be torn down?
In the same code where any other page table is torn down?
If this is the case, what would the cleanest way of telling that these
page tables (PMDs, etc.) correspond to a DAX/FS/PMEM mapping
(look at the physical address pointed to?) so that
I could do the right thing here.

I understand that I may have missed something obvious here.

Thanks.

Larry

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: question about page tables in DAX/FS/PMEM case
  2019-02-20 23:06 question about page tables in DAX/FS/PMEM case Larry Bassel
@ 2019-02-21 20:41 ` Jerome Glisse
  2019-02-21 22:58   ` Larry Bassel
  0 siblings, 1 reply; 5+ messages in thread
From: Jerome Glisse @ 2019-02-21 20:41 UTC (permalink / raw)
  To: Larry Bassel; +Cc: linux-nvdimm, linux-kernel

On Wed, Feb 20, 2019 at 03:06:22PM -0800, Larry Bassel wrote:
> I'm working on sharing page tables in the DAX/XFS/PMEM/PMD case.
> 
> If multiple processes would use the identical page of PMDs corresponding
> to a 1 GiB address range of DAX/XFS/PMEM/PMDs, presumably one can instead
> of populating a new PUD, just atomically increment a refcount and point
> to the same PUD in the next level above.

I think page table sharing was discuss several time in the past and
the complexity involve versus the benefit were not clear. For 1GB
of virtual address you need:
    #pte pages = 1G/(512 * 2^12)       = 512 pte pages
    #pmd pages = 1G/(512 * 512 * 2^12) = 1   pmd pages

So if we were to share the pmd directory page we would be saving a
total of 513 pages for every page table or ~2MB. This goes up with
the number of process that map the same range ie if 10 process map
the same range and share the same pmd than you are saving 9 * 2MB
18MB of memory. This seems relatively modest saving.

AFAIK there is no hardware benefit from sharing the page table
directory within different page table. So the only benefit is the
amount of memory we save.

See below for comments on complexity to achieve this.

> 
> i.e.
> 
> OLD:
> process 1:
> VA -> levels of page tables -> PUD1 -> page of PMDs1
> process 2:
> VA -> levels of page tables -> PUD2 -> page of PMDs2
> 
> NEW:
> process 1:
> VA -> levels of page tables -> PUD1 -> page of PMDs1
> process 2:
> VA -> levels of page tables -> PUD1 -> page of PMDs1 (refcount 2)
> 
> There are several cases to consider:
> 
> 1. New mapping
> OLD:
> make a new PUD, populate the associated page of PMDs
> (at least partially) with PMD entries.
> NEW:
> same
> 
> 2. Mapping by a process same (same VA->PA and size and protections, etc.)
> as one that already exists
> OLD:
> make a new PUD, populate the associated page of PMDs
> (at least partially) with PMD entries.
> NEW:
> use same PUD, increase refcount (potentially even if this mapping is private
> in which case there may eventually be a copy-on-write -- see #5 below)
> 
> 3. Unmapping of a mapping which is the same as that from another process
> OLD:
> destroy the process's copy of mapping, free PUD, etc.
> NEW:
> decrease refcount, only if now 0 do we destroy mapping, etc.
> 
> 4. Unmapping of a mapping which is unique (refcount 1)
> OLD:
> destroy the process's copy of mapping, free PUD, etc.
> NEW:
> same
> 
> 5. Mapping was private (but same as another process), process writes
> OLD:
> break the PMD into PTEs, destroy PMD mapping, free PUD, etc..
> NEW:
> decrease refcount, only if now 0 do we destroy mapping, etc.
> we still break the PMD into PTEs.
> 
> If I have a mmap of a DAX/FS/PMEM file and I take
> a page (either pte or PMD sized) fault on access to this file,
> the page table(s) are set up in dax_iomap_fault() in fs/dax.c (correct?).

Not exactly the page table are allocated long before dax_iomap_fault()
get calls. They are allocated by the handle_mm_fault() and its childs
functions.

> 
> If the process later munmaps this file or exits but there are still
> other users of the shared page of PMDs, I would need to
> detect that this has happened and act accordingly (#3 above)
> 
> Where will these page table entries be torn down?
> In the same code where any other page table is torn down?
> If this is the case, what would the cleanest way of telling that these
> page tables (PMDs, etc.) correspond to a DAX/FS/PMEM mapping
> (look at the physical address pointed to?) so that
> I could do the right thing here.
> 
> I understand that I may have missed something obvious here.
> 

They are many issues here are the one i can think of:
    - finding a pmd/pud to share, you need to walk the reverse mapping
      of the range you are mapping and to find if any process or other
      virtual address already as a pud or pmd you can reuse. This can
      take more time than allocating page directory pages.
    - if one process munmap some portion of a share pud you need to
      break the sharing this means that munmap (or mremap) would need
      to handle this page table directory sharing case first
    - many code path in the kernel might need update to understand this
      share page table thing (mprotect, userfaultfd, ...)
    - the locking rules is bound to be painfull
    - this might not work on all architecture as some architecture do
      associate information with page table directory and that can not
      always be share (it would need to be enabled arch by arch)

The nice thing:
    - unmapping for migration, when you unmap a share pud/pmd you can
      decrement mapcount by share pud/pmd count this could speedup
      migration

This is what i could think of on the top of my head but there might be
other thing. I believe the question is really a benefit versus cost and
to me at least the complexity cost outweight the benefit one for now.
Kirill Shutemov proposed rework on how we do page table and this kind of
rework might tip the balance the other way. So my suggestion would be to
look into how the page table management can be change in a beneficial
way that could also achieve the page table sharing.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: question about page tables in DAX/FS/PMEM case
  2019-02-21 20:41 ` Jerome Glisse
@ 2019-02-21 22:58   ` Larry Bassel
  2019-02-21 23:51     ` Dave Hansen
  2019-02-22  0:39     ` Jerome Glisse
  0 siblings, 2 replies; 5+ messages in thread
From: Larry Bassel @ 2019-02-21 22:58 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: Larry Bassel, linux-nvdimm, linux-kernel, linux-mm

[adding linux-mm]

On 21 Feb 19 15:41, Jerome Glisse wrote:
> On Wed, Feb 20, 2019 at 03:06:22PM -0800, Larry Bassel wrote:
> > I'm working on sharing page tables in the DAX/XFS/PMEM/PMD case.
> > 
> > If multiple processes would use the identical page of PMDs corresponding
> > to a 1 GiB address range of DAX/XFS/PMEM/PMDs, presumably one can instead
> > of populating a new PUD, just atomically increment a refcount and point
> > to the same PUD in the next level above.

Thanks for your feedback. Some comments/clarification below.

> 
> I think page table sharing was discuss several time in the past and
> the complexity involve versus the benefit were not clear. For 1GB
> of virtual address you need:
>     #pte pages = 1G/(512 * 2^12)       = 512 pte pages
>     #pmd pages = 1G/(512 * 512 * 2^12) = 1   pmd pages
> 
> So if we were to share the pmd directory page we would be saving a
> total of 513 pages for every page table or ~2MB. This goes up with
> the number of process that map the same range ie if 10 process map
> the same range and share the same pmd than you are saving 9 * 2MB
> 18MB of memory. This seems relatively modest saving.

The file blocksize = page size in what I am working on would
be 2 MiB (sharing puds/pages of pmds), I'm not trying to
support sharing pmds/pages of ptes. And yes, the savings in this
case is actually even less than in your example (but see my example below).

> 
> AFAIK there is no hardware benefit from sharing the page table
> directory within different page table. So the only benefit is the
> amount of memory we save.

Yes, in our use case (high end Oracle database using DAX/XFS/PMEM/PMD)
the main benefit would be memory savings:

A future system might have 6 TiB of PMEM on it and
there might be 10000 processes each mapping all of this 6 TiB.
Here the savings would be approximately
(6 TiB / 2 MiB) * 8 bytes (page table size) * 10000 = 240 GiB
(and these page tables themselves would be in non-PMEM (ordinary RAM)).

> 
> See below for comments on complexity to achieve this.
> 
[trim]
> > 
> > If I have a mmap of a DAX/FS/PMEM file and I take
> > a page (either pte or PMD sized) fault on access to this file,
> > the page table(s) are set up in dax_iomap_fault() in fs/dax.c (correct?).
> 
> Not exactly the page table are allocated long before dax_iomap_fault()
> get calls. They are allocated by the handle_mm_fault() and its childs
> functions.

Yes, I misstated this, the fault is handled there which may well
alter the PUD (in my case), but the original page tables are set up earlier.

> 
> > 
> > If the process later munmaps this file or exits but there are still
> > other users of the shared page of PMDs, I would need to
> > detect that this has happened and act accordingly (#3 above)
> > 
> > Where will these page table entries be torn down?
> > In the same code where any other page table is torn down?
> > If this is the case, what would the cleanest way of telling that these
> > page tables (PMDs, etc.) correspond to a DAX/FS/PMEM mapping
> > (look at the physical address pointed to?) so that
> > I could do the right thing here.
> > 
> > I understand that I may have missed something obvious here.
> > 
> 
> They are many issues here are the one i can think of:
>     - finding a pmd/pud to share, you need to walk the reverse mapping
>       of the range you are mapping and to find if any process or other
>       virtual address already as a pud or pmd you can reuse. This can
>       take more time than allocating page directory pages.
>     - if one process munmap some portion of a share pud you need to
>       break the sharing this means that munmap (or mremap) would need
>       to handle this page table directory sharing case first
>     - many code path in the kernel might need update to understand this
>       share page table thing (mprotect, userfaultfd, ...)
>     - the locking rules is bound to be painfull
>     - this might not work on all architecture as some architecture do
>       associate information with page table directory and that can not
>       always be share (it would need to be enabled arch by arch)

Yes, some architectures don't support DAX at all (note again that
I'm not trying to share non-DAX page table here).

> 
> The nice thing:
>     - unmapping for migration, when you unmap a share pud/pmd you can
>       decrement mapcount by share pud/pmd count this could speedup
>       migration

A followup question: the kernel does sharing of page tables for hugetlbfs
(also 2 MiB pages), why aren't the above issues relevant there as well
(or are they but we support it anyhow)?

> 
> This is what i could think of on the top of my head but there might be
> other thing. I believe the question is really a benefit versus cost and
> to me at least the complexity cost outweight the benefit one for now.
> Kirill Shutemov proposed rework on how we do page table and this kind of
> rework might tip the balance the other way. So my suggestion would be to
> look into how the page table management can be change in a beneficial
> way that could also achieve the page table sharing.
> 
> Cheers,
> Jérôme

Thanks.

Larry

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: question about page tables in DAX/FS/PMEM case
  2019-02-21 22:58   ` Larry Bassel
@ 2019-02-21 23:51     ` Dave Hansen
  2019-02-22  0:39     ` Jerome Glisse
  1 sibling, 0 replies; 5+ messages in thread
From: Dave Hansen @ 2019-02-21 23:51 UTC (permalink / raw)
  To: Larry Bassel, Jerome Glisse; +Cc: linux-nvdimm, linux-kernel, linux-mm

On 2/21/19 2:58 PM, Larry Bassel wrote:
> AFAIK there is no hardware benefit from sharing the page table
> directory within different page table. So the only benefit is the
> amount of memory we save.

The hardware benefit from schemes like this is that the CPU caches are
better utilized.  If two processes share page tables, they don't share
TLB entries, but they *do* share the contents of the CPU's caches.  That
will make TLB misses faster.

It probably doesn't matter *that* much in practice because the page
walker doing TLB fills does a pretty good job of hiding all the latency,
but it might matter in extreme cases.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: question about page tables in DAX/FS/PMEM case
  2019-02-21 22:58   ` Larry Bassel
  2019-02-21 23:51     ` Dave Hansen
@ 2019-02-22  0:39     ` Jerome Glisse
  1 sibling, 0 replies; 5+ messages in thread
From: Jerome Glisse @ 2019-02-22  0:39 UTC (permalink / raw)
  To: Larry Bassel; +Cc: linux-nvdimm, linux-kernel, linux-mm

On Thu, Feb 21, 2019 at 02:58:27PM -0800, Larry Bassel wrote:
> [adding linux-mm]
> 
> On 21 Feb 19 15:41, Jerome Glisse wrote:
> > On Wed, Feb 20, 2019 at 03:06:22PM -0800, Larry Bassel wrote:
> > > I'm working on sharing page tables in the DAX/XFS/PMEM/PMD case.
> > > 
> > > If multiple processes would use the identical page of PMDs corresponding
> > > to a 1 GiB address range of DAX/XFS/PMEM/PMDs, presumably one can instead
> > > of populating a new PUD, just atomically increment a refcount and point
> > > to the same PUD in the next level above.
> 
> Thanks for your feedback. Some comments/clarification below.
> 
> > 
> > I think page table sharing was discuss several time in the past and
> > the complexity involve versus the benefit were not clear. For 1GB
> > of virtual address you need:
> >     #pte pages = 1G/(512 * 2^12)       = 512 pte pages
> >     #pmd pages = 1G/(512 * 512 * 2^12) = 1   pmd pages
> > 
> > So if we were to share the pmd directory page we would be saving a
> > total of 513 pages for every page table or ~2MB. This goes up with
> > the number of process that map the same range ie if 10 process map
> > the same range and share the same pmd than you are saving 9 * 2MB
> > 18MB of memory. This seems relatively modest saving.
> 
> The file blocksize = page size in what I am working on would
> be 2 MiB (sharing puds/pages of pmds), I'm not trying to
> support sharing pmds/pages of ptes. And yes, the savings in this
> case is actually even less than in your example (but see my example below).
> 
> > 
> > AFAIK there is no hardware benefit from sharing the page table
> > directory within different page table. So the only benefit is the
> > amount of memory we save.
> 
> Yes, in our use case (high end Oracle database using DAX/XFS/PMEM/PMD)
> the main benefit would be memory savings:
> 
> A future system might have 6 TiB of PMEM on it and
> there might be 10000 processes each mapping all of this 6 TiB.
> Here the savings would be approximately
> (6 TiB / 2 MiB) * 8 bytes (page table size) * 10000 = 240 GiB
> (and these page tables themselves would be in non-PMEM (ordinary RAM)).

Damm you have a lot of process, must mean many cores, i want one of those :)

[...]

> > > If the process later munmaps this file or exits but there are still
> > > other users of the shared page of PMDs, I would need to
> > > detect that this has happened and act accordingly (#3 above)
> > > 
> > > Where will these page table entries be torn down?
> > > In the same code where any other page table is torn down?
> > > If this is the case, what would the cleanest way of telling that these
> > > page tables (PMDs, etc.) correspond to a DAX/FS/PMEM mapping
> > > (look at the physical address pointed to?) so that
> > > I could do the right thing here.
> > > 
> > > I understand that I may have missed something obvious here.
> > > 
> > 
> > They are many issues here are the one i can think of:
> >     - finding a pmd/pud to share, you need to walk the reverse mapping
> >       of the range you are mapping and to find if any process or other
> >       virtual address already as a pud or pmd you can reuse. This can
> >       take more time than allocating page directory pages.
> >     - if one process munmap some portion of a share pud you need to
> >       break the sharing this means that munmap (or mremap) would need
> >       to handle this page table directory sharing case first
> >     - many code path in the kernel might need update to understand this
> >       share page table thing (mprotect, userfaultfd, ...)
> >     - the locking rules is bound to be painfull
> >     - this might not work on all architecture as some architecture do
> >       associate information with page table directory and that can not
> >       always be share (it would need to be enabled arch by arch)
> 
> Yes, some architectures don't support DAX at all (note again that
> I'm not trying to share non-DAX page table here).

DAX is irrelevant here, DAX is a property of the underlying filesystem
and for the most part the core mm is blissfully unaware of it. So all
of the above apply.

> > 
> > The nice thing:
> >     - unmapping for migration, when you unmap a share pud/pmd you can
> >       decrement mapcount by share pud/pmd count this could speedup
> >       migration
> 
> A followup question: the kernel does sharing of page tables for hugetlbfs
> (also 2 MiB pages), why aren't the above issues relevant there as well
> (or are they but we support it anyhow)?

hugetlbfs is a thing on its own like no other in the kernel and i don't
think we want to repeat it. It has special case all over the mm so all
the case that can go wrong are handled by the hugetlbfs code instead of
core mm function.

I would not follow that as an example i don't think there is much love
for what hugetlbfs turned into.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2019-02-22  0:39 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-20 23:06 question about page tables in DAX/FS/PMEM case Larry Bassel
2019-02-21 20:41 ` Jerome Glisse
2019-02-21 22:58   ` Larry Bassel
2019-02-21 23:51     ` Dave Hansen
2019-02-22  0:39     ` Jerome Glisse

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).