linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* How to use huge pages in drivers?
@ 2019-09-03 18:26 Dominique Martinet
  2019-09-03 18:42 ` Matthew Wilcox
  0 siblings, 1 reply; 8+ messages in thread
From: Dominique Martinet @ 2019-09-03 18:26 UTC (permalink / raw)
  To: linux-mm

Hi ; not quite sure where to ask so will start here...


Some context first. I'm inquiring in the context of mckernel[1], a
lightweight kernel that works next to linux (basically offlines a
few/most cores, reserve some memory and have boot a second OS on that to
run HPC applications).
Being brutally honest here, this is mostly research and anyone here
looking into it will probably scream, but I might as well try not to add
too many more reasons to do so....

One of the mecanisms here is that sometimes we want to access the
mckernel memory from linux (either from the process that spawned the
mckernel side process or from a driver in linux), and to do that we have
mapped the mckernel side virtual memory range to that process so it can
page fault.
The (horrible) function doing that can be found here[2], rus_vm_fault -
sends a message to the other side to identify the physical address
corresponding from what we had reserved earlier and map it quite
manually.

We could know at this point if it had been a huge page (very likely) or
not; I'm observing a huge difference of performance with some
interconnect if I add a huge kludge emulating huge pages here (directly
manipulating the process' page table) so I'd very much like to use huge
pages when we know a huge page has been mapped on the other side.



What I'd like to know is:
 - we know (assuming the other side isn't too bugged, but if it is we're
fucked up anyway) exactly what huge-page-sized physical memory range has
been mapped on the other side, is there a way to manually gather the
pages corresponding and merge them into a huge page?

 - from what I understand that does not seem possible/recommended, the
way to go being to have a userland process get huge pages and pass these
to a device (ioctl or something); but I assume that means said process
needs to keep on running all the time that memory is required?
If the page fault needs to split the page (because the other side handed
a "small" page so we can only map a regular page here), can it be merged
back into a huge page for the next time this physical region is used?


[1] https://github.com/RIKEN-SysSoft/mckernel
[2] https://github.com/RIKEN-SysSoft/mckernel/blob/development/executer/kernel/mcctrl/syscall.c#L538

Any input will be appreciated,
-- 
Dominique Martinet


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: How to use huge pages in drivers?
  2019-09-03 18:26 How to use huge pages in drivers? Dominique Martinet
@ 2019-09-03 18:42 ` Matthew Wilcox
  2019-09-03 21:28   ` Dominique Martinet
  0 siblings, 1 reply; 8+ messages in thread
From: Matthew Wilcox @ 2019-09-03 18:42 UTC (permalink / raw)
  To: Dominique Martinet; +Cc: linux-mm

On Tue, Sep 03, 2019 at 08:26:27PM +0200, Dominique Martinet wrote:
> Some context first. I'm inquiring in the context of mckernel[1], a
> lightweight kernel that works next to linux (basically offlines a
> few/most cores, reserve some memory and have boot a second OS on that to
> run HPC applications).
> Being brutally honest here, this is mostly research and anyone here
> looking into it will probably scream, but I might as well try not to add
> too many more reasons to do so....
> 
> One of the mecanisms here is that sometimes we want to access the
> mckernel memory from linux (either from the process that spawned the
> mckernel side process or from a driver in linux), and to do that we have
> mapped the mckernel side virtual memory range to that process so it can
> page fault.
> The (horrible) function doing that can be found here[2], rus_vm_fault -
> sends a message to the other side to identify the physical address
> corresponding from what we had reserved earlier and map it quite
> manually.
> 
> We could know at this point if it had been a huge page (very likely) or
> not; I'm observing a huge difference of performance with some
> interconnect if I add a huge kludge emulating huge pages here (directly
> manipulating the process' page table) so I'd very much like to use huge
> pages when we know a huge page has been mapped on the other side.
> 
> 
> 
> What I'd like to know is:
>  - we know (assuming the other side isn't too bugged, but if it is we're
> fucked up anyway) exactly what huge-page-sized physical memory range has
> been mapped on the other side, is there a way to manually gather the
> pages corresponding and merge them into a huge page?

You're using the word 'page' here, but I suspect what you really mean is
"pfn" or "pte".  As you've described it, it doesn't matter what data structure
Linux is using for the memory, since Linux doesn't know about the memory.

We have vmf_insert_pfn_pmd() which is designed to be called from your
->huge_fault handler.  See dev_dax_huge_fault() -> __dev_dax_pmd_fault()
for an example.  It's a fairly new mechanism, so I don't think it's
popular with device drivers yet.

All you really need is the physical address of the memory to make this work.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: How to use huge pages in drivers?
  2019-09-03 18:42 ` Matthew Wilcox
@ 2019-09-03 21:28   ` Dominique Martinet
  2019-09-04 17:00     ` Dominique Martinet
  0 siblings, 1 reply; 8+ messages in thread
From: Dominique Martinet @ 2019-09-03 21:28 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-mm

Matthew Wilcox wrote on Tue, Sep 03, 2019:
> > What I'd like to know is:
> >  - we know (assuming the other side isn't too bugged, but if it is we're
> > fucked up anyway) exactly what huge-page-sized physical memory range has
> > been mapped on the other side, is there a way to manually gather the
> > pages corresponding and merge them into a huge page?
> 
> You're using the word 'page' here, but I suspect what you really mean is
> "pfn" or "pte".  As you've described it, it doesn't matter what data structure
> Linux is using for the memory, since Linux doesn't know about the memory.

Correct, we're already using vmf_insert_pfn

> We have vmf_insert_pfn_pmd() which is designed to be called from your
> ->huge_fault handler.  See dev_dax_huge_fault() -> __dev_dax_pmd_fault()
> for an example.  It's a fairly new mechanism, so I don't think it's
> popular with device drivers yet.
> 
> All you really need is the physical address of the memory to make this work.

Great; I'm not sure how I had missed the pmd variant here. It's even
been around for long enough to be available on our "old" el7 kernels so
I'll be able to test this quickly.

Thanks!
-- 
Dominique


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: How to use huge pages in drivers?
  2019-09-03 21:28   ` Dominique Martinet
@ 2019-09-04 17:00     ` Dominique Martinet
  2019-09-04 17:50       ` Matthew Wilcox
  0 siblings, 1 reply; 8+ messages in thread
From: Dominique Martinet @ 2019-09-04 17:00 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-mm

Dominique Martinet wrote on Tue, Sep 03, 2019:
> Matthew Wilcox wrote on Tue, Sep 03, 2019:
> > > What I'd like to know is:
> > >  - we know (assuming the other side isn't too bugged, but if it is we're
> > > fucked up anyway) exactly what huge-page-sized physical memory range has
> > > been mapped on the other side, is there a way to manually gather the
> > > pages corresponding and merge them into a huge page?
> > 
> > You're using the word 'page' here, but I suspect what you really mean is
> > "pfn" or "pte".  As you've described it, it doesn't matter what data structure
> > Linux is using for the memory, since Linux doesn't know about the memory.
> 
> Correct, we're already using vmf_insert_pfn

Actually let me take that back, vmf_insert_pfn is only used if
pfn_valid() is false, probably as a safeguard of sort(?).
The normal case went with pfn_to_page(pfn) + vm_insert_page() so, as
things stands.
I do have a few more questions if you could humor me a bit more...

 - the vma was created with a vm_flags including VM_MIXEDMAP for some
reason, I don't know why.
If I change it to VM_PFNMAP (which sounds better here from the little I
understand of this as we do not need cow and looks a bit simpler?), I
can remove the vm_insert_page() path and use the vmf_insert_pfn one
instead, which appears to work fine for simple programs... But the
kernel thread for my network adapter (bxi... which is not upstream
either I guess.. sigh..) no longer tries to fault via my custom .fault
vm operation... Which means I probably did need MIXEDMAP ?

I'm honestly not sure where to read up on what these two flags imply,
looking at the page fault handler code I do not see why the request from
a kernel thread would care what kind of vma it is...


 - ignoring that for now (it's not like I need to switch to PFNMAP);
adding vmf_insert_pfn_pmd() for when the remote side uses large pages,
it complains that the vmf->pmd is not a pmd_none nor huge nor a devmap
(this check appears specific to rhel7 kernel, I could temporarily test
with an upstream kernel but the network adapter won't work there so I'll
need this to work on this ultimately)

It looks like handle_mm_fault() will always try to allocate a pmd so it
should never be empty in my fault handler, and I don't see anything else
than vmf_insert_pfn_pmd() setting the mkdevmap flag, and it's not huge
either...
(on a dump, the the pmd content is 175cb18067, so these flags according
to crash for x86_64 are (PRESENT|RW|USER|ACCESSED|DIRTY))

I tried adding a huge_fault vm op thinking it might be called with a
more appropriate pmd but it doesn't seem to be called at all in my
case..? I would have assumed from the code that it would try every page

and if I try to somehow force it by using pmd_mkdevmap on the vmf->pmd,
things appear to work until the process exits and zap_page does a null
deref on pgtable_trans_huge_withdraw because the pgtable was never
deposited - this looks gone on newer kernels, but once again I do not
see where these should come from; I'm just assuming I reap what I sew
messing with the flags.



Long story short, I think I have some deeper undestanding problem about
the whole thing. Do I also need to use some specific flags when that
special file is mmap'd to allow huge_fault to be called ?
I think transparent_hugepage_enabled(vma) is fine, but the vmf.pmd found
in __handle_mm_fault is probably already not none at this point...?



Thanks again, feel free to ignore me for a bit longer I'll keep digging
my own grave, writing to a rubber duck that might have an idea of how
far the wrong way I've gone already helps... :D
-- 
Dominique



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: How to use huge pages in drivers?
  2019-09-04 17:00     ` Dominique Martinet
@ 2019-09-04 17:50       ` Matthew Wilcox
  2019-09-05 15:44         ` Dominique Martinet
  0 siblings, 1 reply; 8+ messages in thread
From: Matthew Wilcox @ 2019-09-04 17:50 UTC (permalink / raw)
  To: Dominique Martinet; +Cc: linux-mm

On Wed, Sep 04, 2019 at 07:00:56PM +0200, Dominique Martinet wrote:
> Dominique Martinet wrote on Tue, Sep 03, 2019:
> > Matthew Wilcox wrote on Tue, Sep 03, 2019:
> > > > What I'd like to know is:
> > > >  - we know (assuming the other side isn't too bugged, but if it is we're
> > > > fucked up anyway) exactly what huge-page-sized physical memory range has
> > > > been mapped on the other side, is there a way to manually gather the
> > > > pages corresponding and merge them into a huge page?
> > > 
> > > You're using the word 'page' here, but I suspect what you really mean is
> > > "pfn" or "pte".  As you've described it, it doesn't matter what data structure
> > > Linux is using for the memory, since Linux doesn't know about the memory.
> > 
> > Correct, we're already using vmf_insert_pfn
> 
> Actually let me take that back, vmf_insert_pfn is only used if
> pfn_valid() is false, probably as a safeguard of sort(?).
> The normal case went with pfn_to_page(pfn) + vm_insert_page() so, as
> things stands.
> I do have a few more questions if you could humor me a bit more...
> 
>  - the vma was created with a vm_flags including VM_MIXEDMAP for some
> reason, I don't know why.
> If I change it to VM_PFNMAP (which sounds better here from the little I
> understand of this as we do not need cow and looks a bit simpler?), I
> can remove the vm_insert_page() path and use the vmf_insert_pfn one
> instead, which appears to work fine for simple programs... But the
> kernel thread for my network adapter (bxi... which is not upstream
> either I guess.. sigh..) no longer tries to fault via my custom .fault
> vm operation... Which means I probably did need MIXEDMAP ?

Strange ... PFNMAP absolutely should try to fault via the ->fault
vm operation (although see below)

>  - ignoring that for now (it's not like I need to switch to PFNMAP);
> adding vmf_insert_pfn_pmd() for when the remote side uses large pages,
> it complains that the vmf->pmd is not a pmd_none nor huge nor a devmap
> (this check appears specific to rhel7 kernel, I could temporarily test
> with an upstream kernel but the network adapter won't work there so I'll
> need this to work on this ultimately)
> 
> It looks like handle_mm_fault() will always try to allocate a pmd so it
> should never be empty in my fault handler, and I don't see anything else
> than vmf_insert_pfn_pmd() setting the mkdevmap flag, and it's not huge
> either...
> (on a dump, the the pmd content is 175cb18067, so these flags according
> to crash for x86_64 are (PRESENT|RW|USER|ACCESSED|DIRTY))
> 
> I tried adding a huge_fault vm op thinking it might be called with a
> more appropriate pmd but it doesn't seem to be called at all in my
> case..? I would have assumed from the code that it would try every page

You shouldn't be calling vmf_insert_pfn_pmd() from a regular ->fault
handler, as by then the fault handler has already inserted a PMD.
The ->huge_fault handler is the place to call it from.

You may need to force PMD-alignment for your call to mmap().

> Long story short, I think I have some deeper undestanding problem about
> the whole thing. Do I also need to use some specific flags when that
> special file is mmap'd to allow huge_fault to be called ?
> I think transparent_hugepage_enabled(vma) is fine, but the vmf.pmd found
> in __handle_mm_fault is probably already not none at this point...?
> 
> Thanks again, feel free to ignore me for a bit longer I'll keep digging
> my own grave, writing to a rubber duck that might have an idea of how
> far the wrong way I've gone already helps... :D

Hope these pointers are slightly more useful than a rubber duck ;-)


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: How to use huge pages in drivers?
  2019-09-04 17:50       ` Matthew Wilcox
@ 2019-09-05 15:44         ` Dominique Martinet
  2019-09-05 18:15           ` Matthew Wilcox
  0 siblings, 1 reply; 8+ messages in thread
From: Dominique Martinet @ 2019-09-05 15:44 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-mm

Matthew Wilcox wrote on Wed, Sep 04, 2019:
> >  - the vma was created with a vm_flags including VM_MIXEDMAP for some
> > reason, I don't know why.
> > If I change it to VM_PFNMAP (which sounds better here from the little I
> > understand of this as we do not need cow and looks a bit simpler?), I
> > can remove the vm_insert_page() path and use the vmf_insert_pfn one
> > instead, which appears to work fine for simple programs... But the
> > kernel thread for my network adapter (bxi... which is not upstream
> > either I guess.. sigh..) no longer tries to fault via my custom .fault
> > vm operation... Which means I probably did need MIXEDMAP ?
> 
> Strange ... PFNMAP absolutely should try to fault via the ->fault
> vm operation (although see below)

It does fault in some context, just not in another.. A bit weird but
I'll stick to MIXEDMAP for now - I'm really curious as to what the
difference is, "normal" applications seem to work fine with either mode,
it's only the bxi driver that 

> > I tried adding a huge_fault vm op thinking it might be called with a
> > more appropriate pmd but it doesn't seem to be called at all in my
> > case..? I would have assumed from the code that it would try every page
> 
> You shouldn't be calling vmf_insert_pfn_pmd() from a regular ->fault
> handler, as by then the fault handler has already inserted a PMD.
> The ->huge_fault handler is the place to call it from.
> 
> You may need to force PMD-alignment for your call to mmap().

I was missing setting the VM_HUGE_FAULT vm_flags2 bit in the vma - the
huge_fault handler is now called, and I no longer have the pre-existing
pmd problem; that's a much better solution than manually fiddling with
flags :)

Question though - is it ok to insert small pages if the huge_fault
handler is called with PE_SIZE_PMD ?
(I think the pte insertion will automatically create the pmd, but would
be good to confirm)


Now I've got this I'm back to where I stood with my kludge though,
programs work until they exit, and the zap_huge_pmd() function tries to
withdraw the pagetable from some magic field that was never set in my
case... I realize this is old code no longer upstream, but my new
workaround for this (looking at the zap_huge_pmd function) was to
pretend my file is dax.
Now that I've set it as dax I think it actually makes sense as in
"there's memory here that points to something linux no longer manages
directly, just let it be" and we might benefit from the other exceptions
dax have, I'll need to look at what this implies in more details...


> Hope these pointers are slightly more useful than a rubber duck ;-)

Much appreciated, thank you for taking the time! :)

Off to debug my network driver for the PFNMAP behaviour next, and then
some more testing... I'm sure I broke something seemingly unrelated on
the other side of the project!

-- 
Dominique


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: How to use huge pages in drivers?
  2019-09-05 15:44         ` Dominique Martinet
@ 2019-09-05 18:15           ` Matthew Wilcox
  2019-09-05 18:50             ` Dominique Martinet
  0 siblings, 1 reply; 8+ messages in thread
From: Matthew Wilcox @ 2019-09-05 18:15 UTC (permalink / raw)
  To: Dominique Martinet; +Cc: linux-mm

On Thu, Sep 05, 2019 at 05:44:00PM +0200, Dominique Martinet wrote:
> > You shouldn't be calling vmf_insert_pfn_pmd() from a regular ->fault
> > handler, as by then the fault handler has already inserted a PMD.
> > The ->huge_fault handler is the place to call it from.
> > 
> > You may need to force PMD-alignment for your call to mmap().
> 
> I was missing setting the VM_HUGE_FAULT vm_flags2 bit in the vma - the
> huge_fault handler is now called, and I no longer have the pre-existing
> pmd problem; that's a much better solution than manually fiddling with
> flags :)
> 
> Question though - is it ok to insert small pages if the huge_fault
> handler is called with PE_SIZE_PMD ?
> (I think the pte insertion will automatically create the pmd, but would
> be good to confirm)

No, you need to return VM_FAULT_FALLBACK, at which point the generic code
will create a PMD for you and then call your ->fault handler which can
insert PTEs.

It works the same way from PUDs to PMDs by the way, in case you ever
have a 1GB mapping ;-)

> Now I've got this I'm back to where I stood with my kludge though,
> programs work until they exit, and the zap_huge_pmd() function tries to
> withdraw the pagetable from some magic field that was never set in my
> case... I realize this is old code no longer upstream, but my new
> workaround for this (looking at the zap_huge_pmd function) was to
> pretend my file is dax.
> Now that I've set it as dax I think it actually makes sense as in
> "there's memory here that points to something linux no longer manages
> directly, just let it be" and we might benefit from the other exceptions
> dax have, I'll need to look at what this implies in more details...

I think that should be fine, but I don't really know RHEL 7.3 all that
well ;-)

> > Hope these pointers are slightly more useful than a rubber duck ;-)
> 
> Much appreciated, thank you for taking the time! :)

No problem ... these APIs are relatively new and not necessarily all
that intuitive.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: How to use huge pages in drivers?
  2019-09-05 18:15           ` Matthew Wilcox
@ 2019-09-05 18:50             ` Dominique Martinet
  0 siblings, 0 replies; 8+ messages in thread
From: Dominique Martinet @ 2019-09-05 18:50 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-mm

Matthew Wilcox wrote on Thu, Sep 05, 2019:
> On Thu, Sep 05, 2019 at 05:44:00PM +0200, Dominique Martinet wrote:
> > Question though - is it ok to insert small pages if the huge_fault
> > handler is called with PE_SIZE_PMD ?
> > (I think the pte insertion will automatically create the pmd, but would
> > be good to confirm)
> 
> No, you need to return VM_FAULT_FALLBACK, at which point the generic code
> will create a PMD for you and then call your ->fault handler which can
> insert PTEs.

Hmm, that's a shame actually.
There is a rather costly round-trip between linux and mckernel to
determine what page size is used for this virtual address on the remote
side and to get the corresponding physical address, so basically when we
get the fault we do know know if this will be a PMD or PTE. 

I'd rather avoid having to do one round-trip at the PMD stage, get told
this is a PTE, temporarily give up and wait to be called again with
PE_SIZE_PTE and do a second round-trip in this case.
I didn't see anywhere in the vm_fault struct that I could piggy-back to
remember something from the previous call, and I'm pretty sure it would
be a bad idea to use the vma's vm_private_data here because there could
be multiple faults in parallel on other threads.


Looking at vmf_insert_pfn(), it will allocate a pmd because of
insert_pfn's get_locked_pte, so it does end up working (I never return a
page - we always return VM_FAULT_NOPAGE on success, so I do not see the
harm in doing it early if we can)

Following the code in __handle_vm_fault assuming the pmd fault would
have returned fallback I do not see any harm here - the pmd actually
already has been allocated here (at pmd level fault), it's just set to
none.

Not exactly pretty, though, and very definitely no guarantee it'll keep
working... I'll stick a comment saying what we should do at least :P

> It works the same way from PUDs to PMDs by the way, in case you ever
> have a 1GB mapping ;-)

Yes, already returning fallback in this case - but I'm just assuming
that won't happen so no round-trip here :)


> > Now that I've set it as dax I think it actually makes sense as in
> > "there's memory here that points to something linux no longer manages
> > directly, just let it be" and we might benefit from the other exceptions
> > dax have, I'll need to look at what this implies in more details...
> 
> I think that should be fine, but I don't really know RHEL 7.3 all that
> well ;-)

Good enough for me, tests will tell me what I broke :)


> No problem ... these APIs are relatively new and not necessarily all
> that intuitive.

Looking at a recent vanilla linux on evening and rhel's kernel at work
didn't help on my side (some fun differences like the VM_HUGE_FAULT flag
in the vma, but now I understand it was added for abi compatibility it
does make sense after I found about it - on an older module the function
could just have been left uninitialized and thus non-null yet not valid)

Definitely did help to point at huge_fault() again.


Thanks,
-- 
Dominique


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2019-09-05 18:51 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-03 18:26 How to use huge pages in drivers? Dominique Martinet
2019-09-03 18:42 ` Matthew Wilcox
2019-09-03 21:28   ` Dominique Martinet
2019-09-04 17:00     ` Dominique Martinet
2019-09-04 17:50       ` Matthew Wilcox
2019-09-05 15:44         ` Dominique Martinet
2019-09-05 18:15           ` Matthew Wilcox
2019-09-05 18:50             ` Dominique Martinet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).