Re: How to use huge pages in drivers?

From: Matthew Wilcox <willy@infradead.org>
To: Dominique Martinet <asmadeus@codewreck.org>
Cc: linux-mm@kvack.org
Subject: Re: How to use huge pages in drivers?
Date: Tue, 3 Sep 2019 11:42:30 -0700	[thread overview]
Message-ID: <20190903184230.GJ29434@bombadil.infradead.org> (raw)
In-Reply-To: <20190903182627.GA6079@nautica>

On Tue, Sep 03, 2019 at 08:26:27PM +0200, Dominique Martinet wrote:
> Some context first. I'm inquiring in the context of mckernel[1], a
> lightweight kernel that works next to linux (basically offlines a
> few/most cores, reserve some memory and have boot a second OS on that to
> run HPC applications).
> Being brutally honest here, this is mostly research and anyone here
> looking into it will probably scream, but I might as well try not to add
> too many more reasons to do so....
> 
> One of the mecanisms here is that sometimes we want to access the
> mckernel memory from linux (either from the process that spawned the
> mckernel side process or from a driver in linux), and to do that we have
> mapped the mckernel side virtual memory range to that process so it can
> page fault.
> The (horrible) function doing that can be found here[2], rus_vm_fault -
> sends a message to the other side to identify the physical address
> corresponding from what we had reserved earlier and map it quite
> manually.
> 
> We could know at this point if it had been a huge page (very likely) or
> not; I'm observing a huge difference of performance with some
> interconnect if I add a huge kludge emulating huge pages here (directly
> manipulating the process' page table) so I'd very much like to use huge
> pages when we know a huge page has been mapped on the other side.
> 
> 
> 
> What I'd like to know is:
>  - we know (assuming the other side isn't too bugged, but if it is we're
> fucked up anyway) exactly what huge-page-sized physical memory range has
> been mapped on the other side, is there a way to manually gather the
> pages corresponding and merge them into a huge page?

You're using the word 'page' here, but I suspect what you really mean is
"pfn" or "pte".  As you've described it, it doesn't matter what data structure
Linux is using for the memory, since Linux doesn't know about the memory.

We have vmf_insert_pfn_pmd() which is designed to be called from your
->huge_fault handler.  See dev_dax_huge_fault() -> __dev_dax_pmd_fault()
for an example.  It's a fairly new mechanism, so I don't think it's
popular with device drivers yet.

All you really need is the physical address of the memory to make this work.