* GRU driver feedback @ 2008-07-23 14:12 Nick Piggin 2008-07-24 2:41 ` Nick Piggin ` (2 more replies) 0 siblings, 3 replies; 13+ messages in thread From: Nick Piggin @ 2008-07-23 14:12 UTC (permalink / raw) To: Jack Steiner Cc: Andrew Morton, Linux Memory Management List, Linux Kernel Mailing List Hi Jack, Some review of the GRU driver. Hope it helps. Some trivial. - I would put all the driver into a single patch. It's a logical change, and splitting them out is not really logical. Unless for example you start with a minimally functional driver and build more things on it, I don't think there is any point avoiding the one big patch. You have to look at the whole thing to understand it properly anyway really. Although I guess you could leave the kernel API parts in another patch and maybe even leave them out until you have a good user of them... - GRU driver -- gru_intr finds mm to fault pages from, does an "atomic pte lookup" which looks up the pte atomically using similar lockless pagetable walk from get_user_pages_fast. This only works because it can guarantee page table existence by disabling interrupts on the CPU where mm is currently running. It looks like atomic pte lookup can be run on mms which are not presently running on the local CPU. This would have been noticed if it had been using a specialised function in arch/*/mm/gup.c, because it would not have provided an mm_struct parameter ;) - Which leads me to my next observation, the GRU TLB fault code _really_ would like to work the same way as the CPU's fault handler, and that is to synchronously enter the kernel and run in process context of the faulting task. I guess we can't do this because the GRU isn't going to be accessing pages synchronously as the commands are sent to it, right? I guess you could have a kernel thread to handle GRU faults... but that could get into fairness and parallelism issues, and also wouldn't be able to take advantage of lockless pagetable walking either (although maybe they're not such a big deal? I imagine that asking the user process to call into the kernel to process faults isn't optimal either). Is there a completely different approach that can be taken? Ideally we want the - "options" in non-static scope? Please change. Putting gru in any global symbols should be enough. - gru_prefetch -- no users of this. This isn't intended to fault in a user virtual address, is it? (we already have generic kernel functions to do this). But as it isn't used, it may as well go away. And you have to remember if you need page tables set up then they can be torn down again at any time. Oh, maybe it is for possible vmalloc addresses? If so, then OK but please comment. - start_instruction, wait_instruction_complete etc -- take a void pointer which is actually a pointer to one of a number of structures with a specific first int. Ugly. Don't know what the preferred form is, it's probably something device writers have to deal with. Wouldn't something like this be better for typing? typedef unsigned int gru_insn_register_t; /* register with cmd bit as lsb */ struct context_configuration_handle { union { gru_insn_register_t insn_reg; struct { unsigned int cmd:1; /* etc */ }; } }; void start_instruction(gru_insn_register_t *reg); Even coding it as a macro would be better because you'd then avoid the void * -> cast and get the type checking. - You're using gru_flush_cache for a lot of things. Is the device not coherent with CPU caches? (in which case, should the DMA api be used instead?) Or is it simply higher performing if you invalidate the CPU cache ahead of time so the GRU probe doesn't have to? - In gru_fault, the while() should be an if(), it would be much nicer to use a waitqueue or semaphore to allocate gru contexts (and if you actively need to steal a context to prevent starvation, you can use eg down_timeout()). - You use prefetchw around the place and say it is sometimes required for correctness. Except it would be free to be ignored (say if the CPU didn't have a TLB entry or full prefetch queue, or even the Linux implementatoin could be a noop). - In gru_fault, I don't think you've validated the size of the vma, and it definitely seems like you haven't taken offset into the vma into account either. remap_pfn_range etc should probably validate the former because I'm sure this isn't the only driver that might get it wrong. The latter can't really be detected though. Please fix or make it more obviously correct (eg a comment). - And please make sure in general that this thing properly fixes/rejects nasty code and attempts to crash the kernel. I'm sure it can't be used, for example, write into readonly pages or read from kernel memory etc. (I'm sure you're very concious of this, but humour me! :P) - I have a rough handle on it, but can you sketch out exactly how it is used, and how the virtual/physical/etc memory activity happens? Unfortunately, in its present state, this is going to have to be something that mm/ developers will have to get an understanding of, and I'm a bit lost when it comes to driver code. (Or have I missed some piece of documentation?) Although there are a lot of comments, most are fairly low level. I want a high level overview and description of important interactions. I agree with Hugh in some respects GRU is special, the bar is a bit higher than the average driver. As far as I understand it A process can use GRU to accelerate some memory and simple transformation operations. This can only occur within the address space of a single mm (and by the looks you have some kernel kva<->kva interfaces there too?) OK, and the way for a user to operate the GRU you have provided is to mmap /dev/gru, which gives the process essentially an mmio control page (Oh, it's big, maybe that's more than a control area? Anyway let's call it a control area...). So the user can stuff some commands into it (that can't possibly take down the machine? Or if they could then /dev/gru is root only? ;)) Such as giving it some user virtual addresses and asking it to copy from one to the other or something much cooler. User faults writing to control area, driver steals context (by zapping ptes) from process currently using it, and hands it to us. Then GRU faults on the new virtual address it is being asked to operate on, and raises an interrupt. Interrupt handler finds the process's physical address from virtual and stuffs it into the GRU. The GRU and the process are each now merrily doing their own thing. If the process calls a fork() for example and some virtual addresses need to be write protected, we must also ensure the GRU can't write to these either so we need to invalidate its TLB before we can continue. GRU TLB invalidation is done with mmu notifiers... I'll think about this some more because I'll have more comments on mmu notifiers (and probably GRU TLB invalidation). - Finally: what is it / can it be used for? What sort of performance numbers do you see? This whole TLB and control page scheme presumably is to avoid kernel entry at all costs... the thing is, the GRU is out on the IO bus anyway, and is going to be invalidating the CPU's cache in operation. It's going to need to be working on some pretty big memory areas in order to be a speedup I would have thought. In which case, how much extra overhead is a syscall or two? A syscall register and unregister the memory in question would alleviate the need for the whole TLB scheme, although obviously it would want to be unregistered or care taken with something like fork... So, a nice rationale, please. This is a fair hunk of complexity here :P Meanwhile, I hope that gives a bit to go on. I'm sorry it has come relatively late in the game, but I had a week off a while back then had (have) some important work work I'm starting to get a handle on... Thanks, Nick ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: GRU driver feedback 2008-07-23 14:12 GRU driver feedback Nick Piggin @ 2008-07-24 2:41 ` Nick Piggin 2008-07-24 3:26 ` Jack Steiner 2008-07-28 17:36 ` Jack Steiner 2 siblings, 0 replies; 13+ messages in thread From: Nick Piggin @ 2008-07-24 2:41 UTC (permalink / raw) To: Nick Piggin Cc: Jack Steiner, Andrew Morton, Linux Memory Management List, Linux Kernel Mailing List On Thursday 24 July 2008 00:12, Nick Piggin wrote: > Meanwhile, I hope that gives a bit to go on. I'm sorry it has come > relatively late in the game, but I had a week off a while back then had > (have) some important work work I'm starting to get a handle on... > > Thanks, > Nick Couple of other things I noticed today before I launch into the mmu notifier and TLB invalidate code proper. - gru_invalidate_range_end -- atomic_dec can filter into wake_up_all, past the spin_lock in __wake_up, and past the loading of the list of tasks. So you can lose a wakeup I believe (not on x86, but on ia64 with release ordering spinlocks it would be possible). atomic_dec_and_test should do the trick, and you might also want to consider memory ordering of the atomic_inc (haven't really looked, but it seems quite suspicious to allow it be reordered). - you seem to be using cache flushes and memory barriers in different ways but each to push out things to the GRU device. For example start_instruction does a wmb() then a store, then a CPU cache flush. I'm lost as to how the mmio protocol actually works (not the low level protocol, but exactly what cache attributes are used, and how the CPU pushes things to the device and vice versa). For example, if you are using wmb(), this I think implies you are using UC or WC memory to map the device, in which case I don't see why you need the gru_flush_cache (which would suggest WB memory). Is this documented somewhere? ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: GRU driver feedback 2008-07-23 14:12 GRU driver feedback Nick Piggin 2008-07-24 2:41 ` Nick Piggin @ 2008-07-24 3:26 ` Jack Steiner 2008-07-24 6:52 ` Nick Piggin 2008-07-28 17:36 ` Jack Steiner 2 siblings, 1 reply; 13+ messages in thread From: Jack Steiner @ 2008-07-24 3:26 UTC (permalink / raw) To: Nick Piggin Cc: Andrew Morton, Linux Memory Management List, Linux Kernel Mailing List On Wed, Jul 23, 2008 at 04:12:30PM +0200, Nick Piggin wrote: > > Hi Jack, > > Some review of the GRU driver. Hope it helps. Some trivial. Thanks for the feedback. I'm at OLS this week & barely reading email. I'll go thru the comments as soon as I get home next week & will respond in detail then. > > - I would put all the driver into a single patch. It's a logical change, > and splitting them out is not really logical. Unless for example you > start with a minimally functional driver and build more things on it, > I don't think there is any point avoiding the one big patch. You have to > look at the whole thing to understand it properly anyway really. I would prefer that, too, but was told by one of the more verbose kernel developers (who will remain nameless) that I should split the code into multiple patches to make it easier to review. Oh well..... More responses to follow.... --- jack ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: GRU driver feedback 2008-07-24 3:26 ` Jack Steiner @ 2008-07-24 6:52 ` Nick Piggin 0 siblings, 0 replies; 13+ messages in thread From: Nick Piggin @ 2008-07-24 6:52 UTC (permalink / raw) To: Jack Steiner Cc: Andrew Morton, Linux Memory Management List, Linux Kernel Mailing List On Wed, Jul 23, 2008 at 10:26:28PM -0500, Jack Steiner wrote: > On Wed, Jul 23, 2008 at 04:12:30PM +0200, Nick Piggin wrote: > > > > Hi Jack, > > > > Some review of the GRU driver. Hope it helps. Some trivial. > > Thanks for the feedback. I'm at OLS this week & barely reading email. > I'll go thru the comments as soon as I get home next week & will > respond in detail then. OK, no problem. Andrew said he's happy to hold off the driver merge for a bit and we should be able to drop it in post-rc1 if we can get it sorted out. > > - I would put all the driver into a single patch. It's a logical change, > > and splitting them out is not really logical. Unless for example you > > start with a minimally functional driver and build more things on it, > > I don't think there is any point avoiding the one big patch. You have to > > look at the whole thing to understand it properly anyway really. > > I would prefer that, too, but was told by one of the more verbose > kernel developers (who will remain nameless) that I should split the code > into multiple patches to make it easier to review. Oh well..... I don't want to make a big stink out of it. But I don't understand why half (or 1/5th) of a driver is a particularly good unit of change. 1. basic functionality, 2. more stuff, 3. documentation, 4. kconfig or some such splitup seems more appropriate if it really must be split, but IMO each change should as much as possible result in a coherent complete source tree before and after. Anyway, yeah, no big deal so if Andrew decided to merge them together or leave them split, I go along with that ;) ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: GRU driver feedback 2008-07-23 14:12 GRU driver feedback Nick Piggin 2008-07-24 2:41 ` Nick Piggin 2008-07-24 3:26 ` Jack Steiner @ 2008-07-28 17:36 ` Jack Steiner 2008-07-28 17:44 ` Jack Steiner 2008-07-29 2:00 ` Nick Piggin 2 siblings, 2 replies; 13+ messages in thread From: Jack Steiner @ 2008-07-28 17:36 UTC (permalink / raw) To: Nick Piggin Cc: Andrew Morton, Linux Memory Management List, Linux Kernel Mailing List I appreciate the thorough review. The GRU is a complicated device. I tried to provide comments in the code but I know it is still difficult to understand. You appear to have a pretty good idea of how it works. I've added a few new comments to the code to make it clearer in a few cases. > > - I would put all the driver into a single patch. It's a logical change, > and splitting them out is not really logical. Unless for example you > start with a minimally functional driver and build more things on it, > I don't think there is any point avoiding the one big patch. You have to > look at the whole thing to understand it properly anyway really. > > Although I guess you could leave the kernel API parts in another patch > and maybe even leave them out until you have a good user of them... The GRU driver is divided into multiple files - each supporting a different functional area. I thought it was clearer to send the files as separate patches to 1) make the function of the file/patch more obvious, and 2) reduce the size of individual patches. In the future, if the community would prefer single patches (except for kernel API changes, etc) for new drivers, I can do that. However, I think it is too late for the GRU. If Andrew prefers, however, I can resend as a single patch. > > - GRU driver -- gru_intr finds mm to fault pages from, does an "atomic pte > lookup" which looks up the pte atomically using similar lockless pagetable > walk from get_user_pages_fast. This only works because it can guarantee page > table existence by disabling interrupts on the CPU where mm is currently > running. It looks like atomic pte lookup can be run on mms which are not > presently running on the local CPU. This would have been noticed if it had > been using a specialised function in arch/*/mm/gup.c, because it would not > have provided an mm_struct parameter ;) Existence of the mm is guaranteed thru an indirect path. The mm struct cannot go away until the GRU context that caused the interrupt is unloaded. When the GRU hardware sends an interrupt, it locks the context & prevents it from being unloaded until the interrupt is serviced. If the atomic pte is successful, the subsequent TLB dropin will unlock the context to allow it to be unloaded. The mm can't go away until the context is unloaded. If the atomic pte fails, the TLB miss is converted to a different kind of TLB miss that must be handled in user context. This is done using "tfh_user_polling_mode(tfh)" (see failupm:). This also unlocks the context and allows it to be unloaded - possibly without ever having serviced the fault. Does this cover the case you are concerned about or did I misunderstand your question? > > - Which leads me to my next observation, the GRU TLB fault code _really_ would > like to work the same way as the CPU's fault handler, and that is to > synchronously enter the kernel and run in process context of the faulting > task. I guess we can't do this because the GRU isn't going to be accessing > pages synchronously as the commands are sent to it, right? I guess you could > have a kernel thread to handle GRU faults... but that could get into fairness > and parallelism issues, and also wouldn't be able to take advantage of > lockless pagetable walking either (although maybe they're not such a big deal? > I imagine that asking the user process to call into the kernel to process > faults isn't optimal either). Is there a completely different approach that > can be taken? Ideally we want the (Did some text get dropped from the above). Your observation is correct. However, some GRU instructions can take a considerable amount of time to complete and faults can occur at various points thruout the execution of the instruction. For example, a user could do a bcopy of a TB of data. I'm not sure how long that takes but it is not quick (wish it was :-). The user can also have up to 128 instructions simulataneously active on each GRU context (usually 1 per task but could be more). Using interrupts for TLB miss notification seems like the only viable solution. Long term, we do plan to support a kernel thread for TLB dropins. Another option that has been discussed is to use the buddy HT in the same cpu core as a helper to do the dropins. The buddy HT is generally not useful for many classes of HPC codes. Using the HT to perform TLB dropins is an intriguing idea. > > - "options" in non-static scope? Please change. Putting gru in any global > symbols should be enough. Yuck... Fixed. > > - gru_prefetch -- no users of this. This isn't intended to fault in a user > virtual address, is it? (we already have generic kernel functions to do this). > But as it isn't used, it may as well go away. And you have to remember if > you need page tables set up then they can be torn down again at any time. > Oh, maybe it is for possible vmalloc addresses? If so, then OK but please > comment. Fixed (deleted). > > - start_instruction, wait_instruction_complete etc -- take a void pointer > which is actually a pointer to one of a number of structures with a specific > first int. Ugly. Don't know what the preferred form is, it's probably > something device writers have to deal with. Wouldn't something like this > be better for typing? > > typedef unsigned int gru_insn_register_t; /* register with cmd bit as lsb */ > struct context_configuration_handle { > union { > gru_insn_register_t insn_reg; > struct { > unsigned int cmd:1; > /* etc */ > }; > } > }; > > void start_instruction(gru_insn_register_t *reg); > > Even coding it as a macro would be better because you'd then avoid the > void * -> cast and get the type checking. I'll put this idea on a list for future consideration. > > - You're using gru_flush_cache for a lot of things. Is the device not coherent > with CPU caches? (in which case, should the DMA api be used instead?) Or is it > simply higher performing if you invalidate the CPU cache ahead of time so the > GRU probe doesn't have to? The device IS coherent. All GRU ontexts are mapped as normal WB memory. The gru_flush_cache serves several purposes. For new instructions, the cpu needs to kick the instruction out of the cpu caches in order for it to be seen by the GRU. In other cases, the flushes improve performance by eliminating GRU probes. > > - In gru_fault, the while() should be an if(), it would be much nicer to use a > waitqueue or semaphore to allocate gru contexts (and if you actively need to > steal a context to prevent starvation, you can use eg down_timeout()). Fixed. The logic to steal contexts is a terrible hack. Although it works, it would do a poor job of handling oversubscription of GRU resources of a real workload. This area will be replaced but first we need to decide how reservation of GRU resources will be done. May involve changes to the batch scheduler. I'll keep down_timeout() in mind... > > - You use prefetchw around the place and say it is sometimes required for > correctness. Except it would be free to be ignored (say if the CPU didn't > have a TLB entry or full prefetch queue, or even the Linux implementatoin > could be a noop). AFAICT, all of the calls to prefetchw() are performance optimizations. Nothing breaks if the requests are ignored. The one exception is when running on the hardware simulator. prefetchw() is required on the simulator. Most calls to prefetchw() have the comment: /* Helps on hardware, required for emulator */ I added comments where it was missing. > > - In gru_fault, I don't think you've validated the size of the vma, and it > definitely seems like you haven't taken offset into the vma into account > either. remap_pfn_range etc should probably validate the former because I'm > sure this isn't the only driver that might get it wrong. The latter can't > really be detected though. Please fix or make it more obviously correct (eg > a comment). ZZZ > > - And please make sure in general that this thing properly fixes/rejects > nasty code and attempts to crash the kernel. I'm sure it can't be used, > for example, write into readonly pages or read from kernel memory etc. > (I'm sure you're very concious of this, but humour me! :P) > > - I have a rough handle on it, but can you sketch out exactly how it is used, > and how the virtual/physical/etc memory activity happens? Unfortunately, in > its present state, this is going to have to be something that mm/ developers > will have to get an understanding of, and I'm a bit lost when it comes to > driver code. (Or have I missed some piece of documentation?) Although there > are a lot of comments, most are fairly low level. I want a high level > overview and description of important interactions. I agree with Hugh in > some respects GRU is special, the bar is a bit higher than the average > driver. I added some text and a diagram to "grutables.h". > > As far as I understand it A process can use GRU to accelerate some memory and > simple transformation operations. This can only occur within the address > space of a single mm (and by the looks you have some kernel kva<->kva > interfaces there too?) Correct. The GRU driver also exports a few interfaces to the XPC/XPMEM drivers. These allow for use oif the GRU for cross partition access. The GRU uses physical addressing mode for all kernel level instructions. > > OK, and the way for a user to operate the GRU you have provided is to mmap > /dev/gru, which gives the process essentially an mmio control page (Oh, it's > big, maybe that's more than a control area? Anyway let's call it a control > area...). Correct (although we refer to it as a gru context). And it's not mmio. > > So the user can stuff some commands into it (that can't possibly take down > the machine? Correct. > Or if they could then /dev/gru is root only? ;)) Such as > giving it some user virtual addresses and asking it to copy from one to > the other or something much cooler. The GRU is user-safe. GRU instructions in general can't do anything that a user could not do using normal load/store/AMO instructions. The one exception is that users can directly reference across SSIs using numalink. However, these references still go thru the GRU TLB and have full validation of all accesses. > > User faults writing to control area, driver steals context (by zapping ptes) > from process currently using it, and hands it to us. Correct, although in theory no "steal" is needed because a batch scheduler did the necessary resource reservation. However, in some cases, a "steal" may occur. > > Then GRU faults on the new virtual address it is being asked to operate on, > and raises an interrupt. Interrupt handler finds the process's physical > address from virtual and stuffs it into the GRU. The GRU and the process > are each now merrily doing their own thing. If the process calls a fork() > for example and some virtual addresses need to be write protected, we must > also ensure the GRU can't write to these either so we need to invalidate > its TLB before we can continue. Correct. > > GRU TLB invalidation is done with mmu notifiers... I'll think about this > some more because I'll have more comments on mmu notifiers (and probably > GRU TLB invalidation). Correct. > > - Finally: what is it / can it be used for? What sort of performance numbers > do you see? This whole TLB and control page scheme presumably is to avoid > kernel entry at all costs... the thing is, the GRU is out on the IO bus No. The GRU is not on an IO bus. It is an integral part of the chipset. Specifically, it is on the node-controller that connects directly to a port on the cpu socket. > anyway, and is going to be invalidating the CPU's cache in operation. It's > going to need to be working on some pretty big memory areas in order to > be a speedup I would have thought. In which case, how much extra overhead > is a syscall or two? A syscall register and unregister the memory in > question would alleviate the need for the whole TLB scheme, although > obviously it would want to be unregistered or care taken with something > like fork... So, a nice rationale, please. This is a fair hunk of > complexity here :P > A single GRU (in the node controler) can support up to 16 user contexts, each running asynchronously and all sharing the same TLB. Since each user can potentially access ANY virtual address, it is not practical to explicitly lock addresses. GRU instructions can also be strided or can use scater/gather lists. Again, explicit locking is not practical because the number of entries that would need to be locked is unbounded. On Thu, Jul 24, 2008 at 12:41:50PM +1000, Nick Piggin wrote: > > Couple of other things I noticed today before I launch into the mmu > notifier and TLB invalidate code proper. > > - gru_invalidate_range_end -- atomic_dec can filter into wake_up_all, past > the spin_lock in __wake_up, and past the loading of the list of tasks. So > you can lose a wakeup I believe (not on x86, but on ia64 with release > ordering spinlocks it would be possible). atomic_dec_and_test should do > the trick, and you might also want to consider memory ordering of the > atomic_inc (haven't really looked, but it seems quite suspicious to allow > it be reordered). Fixed. (Nice catch... I wonder if we could ever have debugged this failure) > > - you seem to be using cache flushes and memory barriers in different ways > but each to push out things to the GRU device. For example start_instruction > does a wmb() then a store, then a CPU cache flush. Correct. Setting bit 0 (cmd bit) MUST be done last. Starting a new instruction consists of a number of stores of instruction parameters followed by setting the cmd bit to indicate a new instructions. The wmb() ensures that all parameters are in the instruction BEFORE the cmd bit is set. After setting the cmd bit, the cacheline is flushed to the GRU to start the instruction. > > I'm lost as to how the mmio protocol actually works (not the low level > protocol, but exactly what cache attributes are used, and how the CPU > pushes things to the device and vice versa). Not MMIO. All GRU accesses use fully coherent WB memory. > > For example, if you are using wmb(), this I think implies you are using > UC or WC memory to map the device, in which case I don't see why you need > the gru_flush_cache (which would suggest WB memory). Is this documented > somewhere? Added to the new text in gruhandles.h. Here is a copy of the text I added: ----------------------------------------------------------------- /* * GRU Chiplet: * The GRU is a user addressible memory accelerator. It provides * several forms of load, store, memset, bcopy instructions. In addition, it * contains special instructions for AMOs, sending messages to message * queues, etc. * * The GRU is an integral part of the node controller. It connects * directly to the cpu socket. In its current implementation, there are 2 * GRU chiplets in the node controller on each blade (~node). * * The entire context is fully coherent and cacheable by the cpus. * * Each GRU chiplet has a physical memory map that looks like the following: * * +-----------------+ * |/////////////////| * |/////////////////| * |/////////////////| * |/////////////////| * |/////////////////| * |/////////////////| * |/////////////////| * |/////////////////| * +-----------------+ * | system control | * +-----------------+ _______ +-------------+ * |/////////////////| / | | * |/////////////////| / | | * |/////////////////| / | instructions| * |/////////////////| / | | * |/////////////////| / | | * |/////////////////| / |-------------| * |/////////////////| / | | * +-----------------+ | | * | context 15 | | data | * +-----------------+ | | * | ...... | \ | | * +-----------------+ \____________ +-------------+ * | context 1 | * +-----------------+ * | context 0 | * +-----------------+ * * Each of the "contexts" is a chunk of memory that can be mmaped into user * space. The context consists of 2 parts: * * - an instruction space that can be directly accessed by the user * to issue GRU instructions and to check instruction status. * * - a data area that acts as normal RAM. * * User instructions contain virtual addresses of data to be accessed by the * GRU. The GRU contains a TLB that is used to convert these user virtual * addresses to physical addresses. * * The "system control" area of the GRU chiplet is used by the kernel driver * to manage user contexts and to perform functions such as TLB dropin and * purging. * * One context may be reserved for the kernel and used for cross-partition * communication. The GRU will also be used to asynchronously zero out * large blocks of memory (not currently implemented). --- jack ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: GRU driver feedback 2008-07-28 17:36 ` Jack Steiner @ 2008-07-28 17:44 ` Jack Steiner 2008-07-29 2:00 ` Nick Piggin 1 sibling, 0 replies; 13+ messages in thread From: Jack Steiner @ 2008-07-28 17:44 UTC (permalink / raw) Cc: Nick Piggin, Andrew Morton, Linux Memory Management List, Linux Kernel Mailing List Whoops: > > - In gru_fault, I don't think you've validated the size of the vma, and it > definitely seems like you haven't taken offset into the vma into account > either. remap_pfn_range etc should probably validate the former because I'm > sure this isn't the only driver that might get it wrong. The latter can't > really be detected though. Please fix or make it more obviously correct (eg > a comment). ZZZ s/ZZZ/fixed/ --- jack ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: GRU driver feedback 2008-07-28 17:36 ` Jack Steiner 2008-07-28 17:44 ` Jack Steiner @ 2008-07-29 2:00 ` Nick Piggin 2008-07-29 18:53 ` Robin Holt 1 sibling, 1 reply; 13+ messages in thread From: Nick Piggin @ 2008-07-29 2:00 UTC (permalink / raw) To: Jack Steiner Cc: Nick Piggin, Andrew Morton, Linux Memory Management List, Linux Kernel Mailing List On Tuesday 29 July 2008 03:36, Jack Steiner wrote: > I appreciate the thorough review. The GRU is a complicated device. I > tried to provide comments in the code but I know it is still difficult > to understand. > > You appear to have a pretty good idea of how it works. I've added a > few new comments to the code to make it clearer in a few cases. Hi Jack, Thanks very much for your thorough comments in return. I will take longer to digest them, but quick reply now because you're probably rushing to get things merged... So I think you've resolved all my concerns except one. > > - GRU driver -- gru_intr finds mm to fault pages from, does an "atomic > > pte lookup" which looks up the pte atomically using similar lockless > > pagetable walk from get_user_pages_fast. This only works because it can > > guarantee page table existence by disabling interrupts on the CPU where > > mm is currently running. It looks like atomic pte lookup can be run on > > mms which are not presently running on the local CPU. This would have > > been noticed if it had been using a specialised function in > > arch/*/mm/gup.c, because it would not have provided an mm_struct > > parameter ;) > > Existence of the mm is guaranteed thru an indirect path. The mm > struct cannot go away until the GRU context that caused the interrupt > is unloaded. When the GRU hardware sends an interrupt, it locks the > context & prevents it from being unloaded until the interrupt is > serviced. If the atomic pte is successful, the subsequent TLB dropin > will unlock the context to allow it to be unloaded. The mm can't go > away until the context is unloaded. It is not existence of the mm that I am worried about, but existence of the page tables. get_user_pages_fast works the way it does on x86 because x86's pagetable shootdown and TLB flushing requires that an IPI be sent to all running threads of a process before page tables are freed. So if `current` is one such thread, and wants to do a page table walk of its own mm, then it can guarantee page table existence by turning off interrupts (and so blocking the IPI). This will not work if you are trying to walk down somebody else's page tables because there is nothing to say the processor you are running on will get an IPI. This is why get_user_pages_fast can not work if task != current or mm != current->mm. So I think there is still a problem. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: GRU driver feedback 2008-07-29 2:00 ` Nick Piggin @ 2008-07-29 18:53 ` Robin Holt 2008-07-30 5:50 ` Nick Piggin 0 siblings, 1 reply; 13+ messages in thread From: Robin Holt @ 2008-07-29 18:53 UTC (permalink / raw) To: Nick Piggin Cc: Jack Steiner, Nick Piggin, Andrew Morton, Linux Memory Management List, Linux Kernel Mailing List On Tue, Jul 29, 2008 at 12:00:09PM +1000, Nick Piggin wrote: > On Tuesday 29 July 2008 03:36, Jack Steiner wrote: > > I appreciate the thorough review. The GRU is a complicated device. I > > tried to provide comments in the code but I know it is still difficult > > to understand. > > > > You appear to have a pretty good idea of how it works. I've added a > > few new comments to the code to make it clearer in a few cases. > > Hi Jack, > > Thanks very much for your thorough comments in return. I will take longer > to digest them, but quick reply now because you're probably rushing to > get things merged... So I think you've resolved all my concerns except one. > > > > > - GRU driver -- gru_intr finds mm to fault pages from, does an "atomic > > > pte lookup" which looks up the pte atomically using similar lockless > > > pagetable walk from get_user_pages_fast. This only works because it can > > > guarantee page table existence by disabling interrupts on the CPU where > > > mm is currently running. It looks like atomic pte lookup can be run on > > > mms which are not presently running on the local CPU. This would have > > > been noticed if it had been using a specialised function in > > > arch/*/mm/gup.c, because it would not have provided an mm_struct > > > parameter ;) > > > > Existence of the mm is guaranteed thru an indirect path. The mm > > struct cannot go away until the GRU context that caused the interrupt > > is unloaded. When the GRU hardware sends an interrupt, it locks the > > context & prevents it from being unloaded until the interrupt is > > serviced. If the atomic pte is successful, the subsequent TLB dropin > > will unlock the context to allow it to be unloaded. The mm can't go > > away until the context is unloaded. > > It is not existence of the mm that I am worried about, but existence > of the page tables. get_user_pages_fast works the way it does on x86 > because x86's pagetable shootdown and TLB flushing requires that an > IPI be sent to all running threads of a process before page tables > are freed. So if `current` is one such thread, and wants to do a page > table walk of its own mm, then it can guarantee page table existence > by turning off interrupts (and so blocking the IPI). > > This will not work if you are trying to walk down somebody else's > page tables because there is nothing to say the processor you are > running on will get an IPI. This is why get_user_pages_fast can not > work if task != current or mm != current->mm. > > So I think there is still a problem. I reserve the right to be wrong, but I think this is covered. First, let me be clear about what I gathered your concern is. I assume you are addressing the case of page tables being torn down. In the case where unmap_region is clearing page tables, the caller to unmap_region is expected to be holding the mmap_sem writably. Jacks fault handler will immediately return when it fails on the down_read_trylock(). In the exit_mmap case, the sequence is mmu_notifier_release(), ... some other stuff ..., free_pgtables(). Here is where special hardware comes into play again. The mmu_notifier_release() callout to the GRU will result in all of the GRU contexts for this MM to be torn down. That process will free the actual hardware's context. The hardware-free portion of the hardware will not complete until all NUMA traffic associated with this context are finished and all fault interrupts have been either ack'd or terminated (last part of the interrupt handler). I am sorry for the rushed explanation. I hope my understanding of your concern is correct and my explanation is clear. Thanks, Robin ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: GRU driver feedback 2008-07-29 18:53 ` Robin Holt @ 2008-07-30 5:50 ` Nick Piggin 2008-07-31 7:14 ` Nick Piggin 0 siblings, 1 reply; 13+ messages in thread From: Nick Piggin @ 2008-07-30 5:50 UTC (permalink / raw) To: Robin Holt, Torvalds, Linus Cc: Jack Steiner, Nick Piggin, Andrew Morton, Linux Memory Management List, Linux Kernel Mailing List On Wednesday 30 July 2008 04:53, Robin Holt wrote: > On Tue, Jul 29, 2008 at 12:00:09PM +1000, Nick Piggin wrote: > > On Tuesday 29 July 2008 03:36, Jack Steiner wrote: > > > I appreciate the thorough review. The GRU is a complicated device. I > > > tried to provide comments in the code but I know it is still difficult > > > to understand. > > > > > > You appear to have a pretty good idea of how it works. I've added a > > > few new comments to the code to make it clearer in a few cases. > > > > Hi Jack, > > > > Thanks very much for your thorough comments in return. I will take longer > > to digest them, but quick reply now because you're probably rushing to > > get things merged... So I think you've resolved all my concerns except > > one. > > > > > > - GRU driver -- gru_intr finds mm to fault pages from, does an > > > > "atomic pte lookup" which looks up the pte atomically using similar > > > > lockless pagetable walk from get_user_pages_fast. This only works > > > > because it can guarantee page table existence by disabling interrupts > > > > on the CPU where mm is currently running. It looks like atomic pte > > > > lookup can be run on mms which are not presently running on the local > > > > CPU. This would have been noticed if it had been using a specialised > > > > function in > > > > arch/*/mm/gup.c, because it would not have provided an mm_struct > > > > parameter ;) > > > > > > Existence of the mm is guaranteed thru an indirect path. The mm > > > struct cannot go away until the GRU context that caused the interrupt > > > is unloaded. When the GRU hardware sends an interrupt, it locks the > > > context & prevents it from being unloaded until the interrupt is > > > serviced. If the atomic pte is successful, the subsequent TLB dropin > > > will unlock the context to allow it to be unloaded. The mm can't go > > > away until the context is unloaded. > > > > It is not existence of the mm that I am worried about, but existence > > of the page tables. get_user_pages_fast works the way it does on x86 > > because x86's pagetable shootdown and TLB flushing requires that an > > IPI be sent to all running threads of a process before page tables > > are freed. So if `current` is one such thread, and wants to do a page > > table walk of its own mm, then it can guarantee page table existence > > by turning off interrupts (and so blocking the IPI). > > > > This will not work if you are trying to walk down somebody else's > > page tables because there is nothing to say the processor you are > > running on will get an IPI. This is why get_user_pages_fast can not > > work if task != current or mm != current->mm. > > > > So I think there is still a problem. > > I reserve the right to be wrong, but I think this is covered. You're most welcome to have that right :) I exercise mine all the time! In this case, I don't think you'll need it though. > First, let me be clear about what I gathered your concern is. I assume > you are addressing the case of page tables being torn down. That's what I was worried about in the above message, yes. > In the case where unmap_region is clearing page tables, the caller to > unmap_region is expected to be holding the mmap_sem writably. Jacks fault > handler will immediately return when it fails on the down_read_trylock(). No, you are right of course. I had in my mind the problems faced by lockless get_user_pages, in which case I was worried about the page table existence, but missed the fact that you're holding mmap_sem to provide existence (which it would, as you note, although one day we may want to reclaim page tables or something that doesn't take mmap_sem, so a big comment would be nice here). > In the exit_mmap case, the sequence is mmu_notifier_release(), ... some > other stuff ..., free_pgtables(). Here is where special hardware > comes into play again. The mmu_notifier_release() callout to the GRU > will result in all of the GRU contexts for this MM to be torn down. > That process will free the actual hardware's context. The hardware-free > portion of the hardware will not complete until all NUMA traffic > associated with this context are finished and all fault interrupts have > been either ack'd or terminated (last part of the interrupt handler). > > I am sorry for the rushed explanation. I hope my understanding of your > concern is correct and my explanation is clear. You are right I think. Hmm, isn't there a memory ordering problem in gru_try_dropin, between atomic_read(ms_range_active); and atomic_pte_lookup();? The write side goes like this: pte_clear(pte); atomic_dec_and_test(ms_range_active); So the atomic pte lookup could potentially execute its loads first, which finds the pte not yet cleared; and then the load of ms_range_active executes and finds ms_range_active is 0, and the should have been invalidated pte gets inserted. I'm slightly scared of this flush-tlb-before-clearing-ptes design of tlb flushing, as I've said lots of times now. I *think* it seems OK after this (and the other) memory ordering issues are fixed, but you can just see that it is more fragile and complex. Anyway, I'll try to come up with some patches eventually to change all that, and will try to get them merged. I'll stop whining about it now :) ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: GRU driver feedback 2008-07-30 5:50 ` Nick Piggin @ 2008-07-31 7:14 ` Nick Piggin 2008-07-31 12:40 ` Jack Steiner 0 siblings, 1 reply; 13+ messages in thread From: Nick Piggin @ 2008-07-31 7:14 UTC (permalink / raw) To: Robin Holt Cc: Torvalds, Linus, Jack Steiner, Nick Piggin, Andrew Morton, Linux Memory Management List, Linux Kernel Mailing List On Wednesday 30 July 2008 15:50, Nick Piggin wrote: > On Wednesday 30 July 2008 04:53, Robin Holt wrote: > > In the case where unmap_region is clearing page tables, the caller to > > unmap_region is expected to be holding the mmap_sem writably. Jacks > > fault handler will immediately return when it fails on the > > down_read_trylock(). > > No, you are right of course. I had in my mind the problems faced by > lockless get_user_pages, in which case I was worried about the page table > existence, but missed the fact that you're holding mmap_sem to provide > existence (which it would, as you note, although one day we may want to > reclaim page tables or something that doesn't take mmap_sem, so a big > comment would be nice here). The other thing is... then GRU should get rid of the local_irq_disable in the atomic pte lookup. By definition it is worthless if we can be operating on an mm that is not running on current (and if I understand correctly, sn2 can avoid sending tlb flush IPIs completely sometimes?) ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: GRU driver feedback 2008-07-31 7:14 ` Nick Piggin @ 2008-07-31 12:40 ` Jack Steiner 2008-08-01 12:11 ` Hugh Dickins 0 siblings, 1 reply; 13+ messages in thread From: Jack Steiner @ 2008-07-31 12:40 UTC (permalink / raw) To: Nick Piggin Cc: Robin Holt, Torvalds, Linus, Nick Piggin, Andrew Morton, Linux Memory Management List, Linux Kernel Mailing List On Thu, Jul 31, 2008 at 05:14:04PM +1000, Nick Piggin wrote: > On Wednesday 30 July 2008 15:50, Nick Piggin wrote: > > On Wednesday 30 July 2008 04:53, Robin Holt wrote: > > > > In the case where unmap_region is clearing page tables, the caller to > > > unmap_region is expected to be holding the mmap_sem writably. Jacks > > > fault handler will immediately return when it fails on the > > > down_read_trylock(). > > > > No, you are right of course. I had in my mind the problems faced by > > lockless get_user_pages, in which case I was worried about the page table > > existence, but missed the fact that you're holding mmap_sem to provide > > existence (which it would, as you note, although one day we may want to > > reclaim page tables or something that doesn't take mmap_sem, so a big > > comment would be nice here). > > The other thing is... then GRU should get rid of the local_irq_disable > in the atomic pte lookup. By definition it is worthless if we can be > operating on an mm that is not running on current (and if I understand > correctly, sn2 can avoid sending tlb flush IPIs completely sometimes?) Done. I'm collecting the fixes & additional comments to be added & will send them upstream later. Thanks for the careful review. --- jack ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: GRU driver feedback 2008-07-31 12:40 ` Jack Steiner @ 2008-08-01 12:11 ` Hugh Dickins 2008-08-01 20:09 ` Jack Steiner 0 siblings, 1 reply; 13+ messages in thread From: Hugh Dickins @ 2008-08-01 12:11 UTC (permalink / raw) To: Jack Steiner Cc: Nick Piggin, Robin Holt, Torvalds, Linus, Nick Piggin, Andrew Morton, Linux Memory Management List, Linux Kernel Mailing List On Thu, 31 Jul 2008, Jack Steiner wrote: > > I'm collecting the fixes & additional comments to be added & will send > them upstream later. One small thing to remove if you've not already noticed: EXPORT_SYMBOL_GPL(follow_page) got added to mm/memory.c, despite our realization that GRU cannot and now does not use it. Hugh ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: GRU driver feedback 2008-08-01 12:11 ` Hugh Dickins @ 2008-08-01 20:09 ` Jack Steiner 0 siblings, 0 replies; 13+ messages in thread From: Jack Steiner @ 2008-08-01 20:09 UTC (permalink / raw) To: Hugh Dickins Cc: Nick Piggin, Robin Holt, Torvalds, Linus, Nick Piggin, Andrew Morton, Linux Memory Management List, Linux Kernel Mailing List On Fri, Aug 01, 2008 at 01:11:09PM +0100, Hugh Dickins wrote: > On Thu, 31 Jul 2008, Jack Steiner wrote: > > > > I'm collecting the fixes & additional comments to be added & will send > > them upstream later. > > One small thing to remove if you've not already noticed: > EXPORT_SYMBOL_GPL(follow_page) got added to mm/memory.c, > despite our realization that GRU cannot and now does not use it. > Thanks for catching this. I sent a patch upstream a few minutes ago to remove the export. --- jack ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2008-08-01 20:09 UTC | newest] Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2008-07-23 14:12 GRU driver feedback Nick Piggin 2008-07-24 2:41 ` Nick Piggin 2008-07-24 3:26 ` Jack Steiner 2008-07-24 6:52 ` Nick Piggin 2008-07-28 17:36 ` Jack Steiner 2008-07-28 17:44 ` Jack Steiner 2008-07-29 2:00 ` Nick Piggin 2008-07-29 18:53 ` Robin Holt 2008-07-30 5:50 ` Nick Piggin 2008-07-31 7:14 ` Nick Piggin 2008-07-31 12:40 ` Jack Steiner 2008-08-01 12:11 ` Hugh Dickins 2008-08-01 20:09 ` Jack Steiner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).