* How to use huge pages in drivers? @ 2019-09-03 18:26 Dominique Martinet 2019-09-03 18:42 ` Matthew Wilcox 0 siblings, 1 reply; 8+ messages in thread From: Dominique Martinet @ 2019-09-03 18:26 UTC (permalink / raw) To: linux-mm Hi ; not quite sure where to ask so will start here... Some context first. I'm inquiring in the context of mckernel[1], a lightweight kernel that works next to linux (basically offlines a few/most cores, reserve some memory and have boot a second OS on that to run HPC applications). Being brutally honest here, this is mostly research and anyone here looking into it will probably scream, but I might as well try not to add too many more reasons to do so.... One of the mecanisms here is that sometimes we want to access the mckernel memory from linux (either from the process that spawned the mckernel side process or from a driver in linux), and to do that we have mapped the mckernel side virtual memory range to that process so it can page fault. The (horrible) function doing that can be found here[2], rus_vm_fault - sends a message to the other side to identify the physical address corresponding from what we had reserved earlier and map it quite manually. We could know at this point if it had been a huge page (very likely) or not; I'm observing a huge difference of performance with some interconnect if I add a huge kludge emulating huge pages here (directly manipulating the process' page table) so I'd very much like to use huge pages when we know a huge page has been mapped on the other side. What I'd like to know is: - we know (assuming the other side isn't too bugged, but if it is we're fucked up anyway) exactly what huge-page-sized physical memory range has been mapped on the other side, is there a way to manually gather the pages corresponding and merge them into a huge page? - from what I understand that does not seem possible/recommended, the way to go being to have a userland process get huge pages and pass these to a device (ioctl or something); but I assume that means said process needs to keep on running all the time that memory is required? If the page fault needs to split the page (because the other side handed a "small" page so we can only map a regular page here), can it be merged back into a huge page for the next time this physical region is used? [1] https://github.com/RIKEN-SysSoft/mckernel [2] https://github.com/RIKEN-SysSoft/mckernel/blob/development/executer/kernel/mcctrl/syscall.c#L538 Any input will be appreciated, -- Dominique Martinet ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: How to use huge pages in drivers? 2019-09-03 18:26 How to use huge pages in drivers? Dominique Martinet @ 2019-09-03 18:42 ` Matthew Wilcox 2019-09-03 21:28 ` Dominique Martinet 0 siblings, 1 reply; 8+ messages in thread From: Matthew Wilcox @ 2019-09-03 18:42 UTC (permalink / raw) To: Dominique Martinet; +Cc: linux-mm On Tue, Sep 03, 2019 at 08:26:27PM +0200, Dominique Martinet wrote: > Some context first. I'm inquiring in the context of mckernel[1], a > lightweight kernel that works next to linux (basically offlines a > few/most cores, reserve some memory and have boot a second OS on that to > run HPC applications). > Being brutally honest here, this is mostly research and anyone here > looking into it will probably scream, but I might as well try not to add > too many more reasons to do so.... > > One of the mecanisms here is that sometimes we want to access the > mckernel memory from linux (either from the process that spawned the > mckernel side process or from a driver in linux), and to do that we have > mapped the mckernel side virtual memory range to that process so it can > page fault. > The (horrible) function doing that can be found here[2], rus_vm_fault - > sends a message to the other side to identify the physical address > corresponding from what we had reserved earlier and map it quite > manually. > > We could know at this point if it had been a huge page (very likely) or > not; I'm observing a huge difference of performance with some > interconnect if I add a huge kludge emulating huge pages here (directly > manipulating the process' page table) so I'd very much like to use huge > pages when we know a huge page has been mapped on the other side. > > > > What I'd like to know is: > - we know (assuming the other side isn't too bugged, but if it is we're > fucked up anyway) exactly what huge-page-sized physical memory range has > been mapped on the other side, is there a way to manually gather the > pages corresponding and merge them into a huge page? You're using the word 'page' here, but I suspect what you really mean is "pfn" or "pte". As you've described it, it doesn't matter what data structure Linux is using for the memory, since Linux doesn't know about the memory. We have vmf_insert_pfn_pmd() which is designed to be called from your ->huge_fault handler. See dev_dax_huge_fault() -> __dev_dax_pmd_fault() for an example. It's a fairly new mechanism, so I don't think it's popular with device drivers yet. All you really need is the physical address of the memory to make this work. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: How to use huge pages in drivers? 2019-09-03 18:42 ` Matthew Wilcox @ 2019-09-03 21:28 ` Dominique Martinet 2019-09-04 17:00 ` Dominique Martinet 0 siblings, 1 reply; 8+ messages in thread From: Dominique Martinet @ 2019-09-03 21:28 UTC (permalink / raw) To: Matthew Wilcox; +Cc: linux-mm Matthew Wilcox wrote on Tue, Sep 03, 2019: > > What I'd like to know is: > > - we know (assuming the other side isn't too bugged, but if it is we're > > fucked up anyway) exactly what huge-page-sized physical memory range has > > been mapped on the other side, is there a way to manually gather the > > pages corresponding and merge them into a huge page? > > You're using the word 'page' here, but I suspect what you really mean is > "pfn" or "pte". As you've described it, it doesn't matter what data structure > Linux is using for the memory, since Linux doesn't know about the memory. Correct, we're already using vmf_insert_pfn > We have vmf_insert_pfn_pmd() which is designed to be called from your > ->huge_fault handler. See dev_dax_huge_fault() -> __dev_dax_pmd_fault() > for an example. It's a fairly new mechanism, so I don't think it's > popular with device drivers yet. > > All you really need is the physical address of the memory to make this work. Great; I'm not sure how I had missed the pmd variant here. It's even been around for long enough to be available on our "old" el7 kernels so I'll be able to test this quickly. Thanks! -- Dominique ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: How to use huge pages in drivers? 2019-09-03 21:28 ` Dominique Martinet @ 2019-09-04 17:00 ` Dominique Martinet 2019-09-04 17:50 ` Matthew Wilcox 0 siblings, 1 reply; 8+ messages in thread From: Dominique Martinet @ 2019-09-04 17:00 UTC (permalink / raw) To: Matthew Wilcox; +Cc: linux-mm Dominique Martinet wrote on Tue, Sep 03, 2019: > Matthew Wilcox wrote on Tue, Sep 03, 2019: > > > What I'd like to know is: > > > - we know (assuming the other side isn't too bugged, but if it is we're > > > fucked up anyway) exactly what huge-page-sized physical memory range has > > > been mapped on the other side, is there a way to manually gather the > > > pages corresponding and merge them into a huge page? > > > > You're using the word 'page' here, but I suspect what you really mean is > > "pfn" or "pte". As you've described it, it doesn't matter what data structure > > Linux is using for the memory, since Linux doesn't know about the memory. > > Correct, we're already using vmf_insert_pfn Actually let me take that back, vmf_insert_pfn is only used if pfn_valid() is false, probably as a safeguard of sort(?). The normal case went with pfn_to_page(pfn) + vm_insert_page() so, as things stands. I do have a few more questions if you could humor me a bit more... - the vma was created with a vm_flags including VM_MIXEDMAP for some reason, I don't know why. If I change it to VM_PFNMAP (which sounds better here from the little I understand of this as we do not need cow and looks a bit simpler?), I can remove the vm_insert_page() path and use the vmf_insert_pfn one instead, which appears to work fine for simple programs... But the kernel thread for my network adapter (bxi... which is not upstream either I guess.. sigh..) no longer tries to fault via my custom .fault vm operation... Which means I probably did need MIXEDMAP ? I'm honestly not sure where to read up on what these two flags imply, looking at the page fault handler code I do not see why the request from a kernel thread would care what kind of vma it is... - ignoring that for now (it's not like I need to switch to PFNMAP); adding vmf_insert_pfn_pmd() for when the remote side uses large pages, it complains that the vmf->pmd is not a pmd_none nor huge nor a devmap (this check appears specific to rhel7 kernel, I could temporarily test with an upstream kernel but the network adapter won't work there so I'll need this to work on this ultimately) It looks like handle_mm_fault() will always try to allocate a pmd so it should never be empty in my fault handler, and I don't see anything else than vmf_insert_pfn_pmd() setting the mkdevmap flag, and it's not huge either... (on a dump, the the pmd content is 175cb18067, so these flags according to crash for x86_64 are (PRESENT|RW|USER|ACCESSED|DIRTY)) I tried adding a huge_fault vm op thinking it might be called with a more appropriate pmd but it doesn't seem to be called at all in my case..? I would have assumed from the code that it would try every page and if I try to somehow force it by using pmd_mkdevmap on the vmf->pmd, things appear to work until the process exits and zap_page does a null deref on pgtable_trans_huge_withdraw because the pgtable was never deposited - this looks gone on newer kernels, but once again I do not see where these should come from; I'm just assuming I reap what I sew messing with the flags. Long story short, I think I have some deeper undestanding problem about the whole thing. Do I also need to use some specific flags when that special file is mmap'd to allow huge_fault to be called ? I think transparent_hugepage_enabled(vma) is fine, but the vmf.pmd found in __handle_mm_fault is probably already not none at this point...? Thanks again, feel free to ignore me for a bit longer I'll keep digging my own grave, writing to a rubber duck that might have an idea of how far the wrong way I've gone already helps... :D -- Dominique ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: How to use huge pages in drivers? 2019-09-04 17:00 ` Dominique Martinet @ 2019-09-04 17:50 ` Matthew Wilcox 2019-09-05 15:44 ` Dominique Martinet 0 siblings, 1 reply; 8+ messages in thread From: Matthew Wilcox @ 2019-09-04 17:50 UTC (permalink / raw) To: Dominique Martinet; +Cc: linux-mm On Wed, Sep 04, 2019 at 07:00:56PM +0200, Dominique Martinet wrote: > Dominique Martinet wrote on Tue, Sep 03, 2019: > > Matthew Wilcox wrote on Tue, Sep 03, 2019: > > > > What I'd like to know is: > > > > - we know (assuming the other side isn't too bugged, but if it is we're > > > > fucked up anyway) exactly what huge-page-sized physical memory range has > > > > been mapped on the other side, is there a way to manually gather the > > > > pages corresponding and merge them into a huge page? > > > > > > You're using the word 'page' here, but I suspect what you really mean is > > > "pfn" or "pte". As you've described it, it doesn't matter what data structure > > > Linux is using for the memory, since Linux doesn't know about the memory. > > > > Correct, we're already using vmf_insert_pfn > > Actually let me take that back, vmf_insert_pfn is only used if > pfn_valid() is false, probably as a safeguard of sort(?). > The normal case went with pfn_to_page(pfn) + vm_insert_page() so, as > things stands. > I do have a few more questions if you could humor me a bit more... > > - the vma was created with a vm_flags including VM_MIXEDMAP for some > reason, I don't know why. > If I change it to VM_PFNMAP (which sounds better here from the little I > understand of this as we do not need cow and looks a bit simpler?), I > can remove the vm_insert_page() path and use the vmf_insert_pfn one > instead, which appears to work fine for simple programs... But the > kernel thread for my network adapter (bxi... which is not upstream > either I guess.. sigh..) no longer tries to fault via my custom .fault > vm operation... Which means I probably did need MIXEDMAP ? Strange ... PFNMAP absolutely should try to fault via the ->fault vm operation (although see below) > - ignoring that for now (it's not like I need to switch to PFNMAP); > adding vmf_insert_pfn_pmd() for when the remote side uses large pages, > it complains that the vmf->pmd is not a pmd_none nor huge nor a devmap > (this check appears specific to rhel7 kernel, I could temporarily test > with an upstream kernel but the network adapter won't work there so I'll > need this to work on this ultimately) > > It looks like handle_mm_fault() will always try to allocate a pmd so it > should never be empty in my fault handler, and I don't see anything else > than vmf_insert_pfn_pmd() setting the mkdevmap flag, and it's not huge > either... > (on a dump, the the pmd content is 175cb18067, so these flags according > to crash for x86_64 are (PRESENT|RW|USER|ACCESSED|DIRTY)) > > I tried adding a huge_fault vm op thinking it might be called with a > more appropriate pmd but it doesn't seem to be called at all in my > case..? I would have assumed from the code that it would try every page You shouldn't be calling vmf_insert_pfn_pmd() from a regular ->fault handler, as by then the fault handler has already inserted a PMD. The ->huge_fault handler is the place to call it from. You may need to force PMD-alignment for your call to mmap(). > Long story short, I think I have some deeper undestanding problem about > the whole thing. Do I also need to use some specific flags when that > special file is mmap'd to allow huge_fault to be called ? > I think transparent_hugepage_enabled(vma) is fine, but the vmf.pmd found > in __handle_mm_fault is probably already not none at this point...? > > Thanks again, feel free to ignore me for a bit longer I'll keep digging > my own grave, writing to a rubber duck that might have an idea of how > far the wrong way I've gone already helps... :D Hope these pointers are slightly more useful than a rubber duck ;-) ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: How to use huge pages in drivers? 2019-09-04 17:50 ` Matthew Wilcox @ 2019-09-05 15:44 ` Dominique Martinet 2019-09-05 18:15 ` Matthew Wilcox 0 siblings, 1 reply; 8+ messages in thread From: Dominique Martinet @ 2019-09-05 15:44 UTC (permalink / raw) To: Matthew Wilcox; +Cc: linux-mm Matthew Wilcox wrote on Wed, Sep 04, 2019: > > - the vma was created with a vm_flags including VM_MIXEDMAP for some > > reason, I don't know why. > > If I change it to VM_PFNMAP (which sounds better here from the little I > > understand of this as we do not need cow and looks a bit simpler?), I > > can remove the vm_insert_page() path and use the vmf_insert_pfn one > > instead, which appears to work fine for simple programs... But the > > kernel thread for my network adapter (bxi... which is not upstream > > either I guess.. sigh..) no longer tries to fault via my custom .fault > > vm operation... Which means I probably did need MIXEDMAP ? > > Strange ... PFNMAP absolutely should try to fault via the ->fault > vm operation (although see below) It does fault in some context, just not in another.. A bit weird but I'll stick to MIXEDMAP for now - I'm really curious as to what the difference is, "normal" applications seem to work fine with either mode, it's only the bxi driver that > > I tried adding a huge_fault vm op thinking it might be called with a > > more appropriate pmd but it doesn't seem to be called at all in my > > case..? I would have assumed from the code that it would try every page > > You shouldn't be calling vmf_insert_pfn_pmd() from a regular ->fault > handler, as by then the fault handler has already inserted a PMD. > The ->huge_fault handler is the place to call it from. > > You may need to force PMD-alignment for your call to mmap(). I was missing setting the VM_HUGE_FAULT vm_flags2 bit in the vma - the huge_fault handler is now called, and I no longer have the pre-existing pmd problem; that's a much better solution than manually fiddling with flags :) Question though - is it ok to insert small pages if the huge_fault handler is called with PE_SIZE_PMD ? (I think the pte insertion will automatically create the pmd, but would be good to confirm) Now I've got this I'm back to where I stood with my kludge though, programs work until they exit, and the zap_huge_pmd() function tries to withdraw the pagetable from some magic field that was never set in my case... I realize this is old code no longer upstream, but my new workaround for this (looking at the zap_huge_pmd function) was to pretend my file is dax. Now that I've set it as dax I think it actually makes sense as in "there's memory here that points to something linux no longer manages directly, just let it be" and we might benefit from the other exceptions dax have, I'll need to look at what this implies in more details... > Hope these pointers are slightly more useful than a rubber duck ;-) Much appreciated, thank you for taking the time! :) Off to debug my network driver for the PFNMAP behaviour next, and then some more testing... I'm sure I broke something seemingly unrelated on the other side of the project! -- Dominique ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: How to use huge pages in drivers? 2019-09-05 15:44 ` Dominique Martinet @ 2019-09-05 18:15 ` Matthew Wilcox 2019-09-05 18:50 ` Dominique Martinet 0 siblings, 1 reply; 8+ messages in thread From: Matthew Wilcox @ 2019-09-05 18:15 UTC (permalink / raw) To: Dominique Martinet; +Cc: linux-mm On Thu, Sep 05, 2019 at 05:44:00PM +0200, Dominique Martinet wrote: > > You shouldn't be calling vmf_insert_pfn_pmd() from a regular ->fault > > handler, as by then the fault handler has already inserted a PMD. > > The ->huge_fault handler is the place to call it from. > > > > You may need to force PMD-alignment for your call to mmap(). > > I was missing setting the VM_HUGE_FAULT vm_flags2 bit in the vma - the > huge_fault handler is now called, and I no longer have the pre-existing > pmd problem; that's a much better solution than manually fiddling with > flags :) > > Question though - is it ok to insert small pages if the huge_fault > handler is called with PE_SIZE_PMD ? > (I think the pte insertion will automatically create the pmd, but would > be good to confirm) No, you need to return VM_FAULT_FALLBACK, at which point the generic code will create a PMD for you and then call your ->fault handler which can insert PTEs. It works the same way from PUDs to PMDs by the way, in case you ever have a 1GB mapping ;-) > Now I've got this I'm back to where I stood with my kludge though, > programs work until they exit, and the zap_huge_pmd() function tries to > withdraw the pagetable from some magic field that was never set in my > case... I realize this is old code no longer upstream, but my new > workaround for this (looking at the zap_huge_pmd function) was to > pretend my file is dax. > Now that I've set it as dax I think it actually makes sense as in > "there's memory here that points to something linux no longer manages > directly, just let it be" and we might benefit from the other exceptions > dax have, I'll need to look at what this implies in more details... I think that should be fine, but I don't really know RHEL 7.3 all that well ;-) > > Hope these pointers are slightly more useful than a rubber duck ;-) > > Much appreciated, thank you for taking the time! :) No problem ... these APIs are relatively new and not necessarily all that intuitive. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: How to use huge pages in drivers? 2019-09-05 18:15 ` Matthew Wilcox @ 2019-09-05 18:50 ` Dominique Martinet 0 siblings, 0 replies; 8+ messages in thread From: Dominique Martinet @ 2019-09-05 18:50 UTC (permalink / raw) To: Matthew Wilcox; +Cc: linux-mm Matthew Wilcox wrote on Thu, Sep 05, 2019: > On Thu, Sep 05, 2019 at 05:44:00PM +0200, Dominique Martinet wrote: > > Question though - is it ok to insert small pages if the huge_fault > > handler is called with PE_SIZE_PMD ? > > (I think the pte insertion will automatically create the pmd, but would > > be good to confirm) > > No, you need to return VM_FAULT_FALLBACK, at which point the generic code > will create a PMD for you and then call your ->fault handler which can > insert PTEs. Hmm, that's a shame actually. There is a rather costly round-trip between linux and mckernel to determine what page size is used for this virtual address on the remote side and to get the corresponding physical address, so basically when we get the fault we do know know if this will be a PMD or PTE. I'd rather avoid having to do one round-trip at the PMD stage, get told this is a PTE, temporarily give up and wait to be called again with PE_SIZE_PTE and do a second round-trip in this case. I didn't see anywhere in the vm_fault struct that I could piggy-back to remember something from the previous call, and I'm pretty sure it would be a bad idea to use the vma's vm_private_data here because there could be multiple faults in parallel on other threads. Looking at vmf_insert_pfn(), it will allocate a pmd because of insert_pfn's get_locked_pte, so it does end up working (I never return a page - we always return VM_FAULT_NOPAGE on success, so I do not see the harm in doing it early if we can) Following the code in __handle_vm_fault assuming the pmd fault would have returned fallback I do not see any harm here - the pmd actually already has been allocated here (at pmd level fault), it's just set to none. Not exactly pretty, though, and very definitely no guarantee it'll keep working... I'll stick a comment saying what we should do at least :P > It works the same way from PUDs to PMDs by the way, in case you ever > have a 1GB mapping ;-) Yes, already returning fallback in this case - but I'm just assuming that won't happen so no round-trip here :) > > Now that I've set it as dax I think it actually makes sense as in > > "there's memory here that points to something linux no longer manages > > directly, just let it be" and we might benefit from the other exceptions > > dax have, I'll need to look at what this implies in more details... > > I think that should be fine, but I don't really know RHEL 7.3 all that > well ;-) Good enough for me, tests will tell me what I broke :) > No problem ... these APIs are relatively new and not necessarily all > that intuitive. Looking at a recent vanilla linux on evening and rhel's kernel at work didn't help on my side (some fun differences like the VM_HUGE_FAULT flag in the vma, but now I understand it was added for abi compatibility it does make sense after I found about it - on an older module the function could just have been left uninitialized and thus non-null yet not valid) Definitely did help to point at huge_fault() again. Thanks, -- Dominique ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2019-09-05 18:51 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-09-03 18:26 How to use huge pages in drivers? Dominique Martinet 2019-09-03 18:42 ` Matthew Wilcox 2019-09-03 21:28 ` Dominique Martinet 2019-09-04 17:00 ` Dominique Martinet 2019-09-04 17:50 ` Matthew Wilcox 2019-09-05 15:44 ` Dominique Martinet 2019-09-05 18:15 ` Matthew Wilcox 2019-09-05 18:50 ` Dominique Martinet
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).