* [GIT PULL] please pull ummunotify @ 2009-09-11 4:38 ` Roland Dreier 0 siblings, 0 replies; 82+ messages in thread From: Roland Dreier @ 2009-09-11 4:38 UTC (permalink / raw) To: torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, jsquyres-FYB4Gu1CFyUAvxtiuMwx3w Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5, linux-kernel-u79uwXL29TY76Z2rM5mHXA Linus, please consider pulling from master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git ummunotify This tree is also available from kernel.org mirrors at: git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git ummunotify This will get "ummunotify," a new character device that allows a userspace library to register for MMU notifications; this is particularly useful for MPI implementions (message passing libraries used in HPC) to be able to keep track of what wacky things consumers do to their memory mappings. My colleague Jeff Squyres from the Open MPI project posted a blog entry about why MPI wants this: http://blogs.cisco.com/ciscotalk/performance/comments/better_linux_memory_tracking/ His summary of ummunotify: "It’s elegant, doesn’t require strange linker tricks, and seems to work in all cases. Yay!" This code went through several review iterations on lkml and was in -mm and -next for quite a few weeks. Andrew is OK with merging it (I think -- Andrew please correct me if I misunderstood you). Roland Dreier (1): ummunotify: Userspace support for MMU notifications Documentation/Makefile | 3 +- Documentation/ummunotify/Makefile | 7 + Documentation/ummunotify/ummunotify.txt | 150 ++++++++ Documentation/ummunotify/umn-test.c | 200 +++++++++++ drivers/char/Kconfig | 12 + drivers/char/Makefile | 1 + drivers/char/ummunotify.c | 566 +++++++++++++++++++++++++++++++ include/linux/Kbuild | 1 + include/linux/ummunotify.h | 121 +++++++ 9 files changed, 1060 insertions(+), 1 deletions(-) create mode 100644 Documentation/ummunotify/Makefile create mode 100644 Documentation/ummunotify/ummunotify.txt create mode 100644 Documentation/ummunotify/umn-test.c create mode 100644 drivers/char/ummunotify.c create mode 100644 include/linux/ummunotify.h -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 82+ messages in thread
* [GIT PULL] please pull ummunotify @ 2009-09-11 4:38 ` Roland Dreier 0 siblings, 0 replies; 82+ messages in thread From: Roland Dreier @ 2009-09-11 4:38 UTC (permalink / raw) To: torvalds, akpm, jsquyres; +Cc: linux-rdma, general, linux-kernel Linus, please consider pulling from master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git ummunotify This tree is also available from kernel.org mirrors at: git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git ummunotify This will get "ummunotify," a new character device that allows a userspace library to register for MMU notifications; this is particularly useful for MPI implementions (message passing libraries used in HPC) to be able to keep track of what wacky things consumers do to their memory mappings. My colleague Jeff Squyres from the Open MPI project posted a blog entry about why MPI wants this: http://blogs.cisco.com/ciscotalk/performance/comments/better_linux_memory_tracking/ His summary of ummunotify: "It’s elegant, doesn’t require strange linker tricks, and seems to work in all cases. Yay!" This code went through several review iterations on lkml and was in -mm and -next for quite a few weeks. Andrew is OK with merging it (I think -- Andrew please correct me if I misunderstood you). Roland Dreier (1): ummunotify: Userspace support for MMU notifications Documentation/Makefile | 3 +- Documentation/ummunotify/Makefile | 7 + Documentation/ummunotify/ummunotify.txt | 150 ++++++++ Documentation/ummunotify/umn-test.c | 200 +++++++++++ drivers/char/Kconfig | 12 + drivers/char/Makefile | 1 + drivers/char/ummunotify.c | 566 +++++++++++++++++++++++++++++++ include/linux/Kbuild | 1 + include/linux/ummunotify.h | 121 +++++++ 9 files changed, 1060 insertions(+), 1 deletions(-) create mode 100644 Documentation/ummunotify/Makefile create mode 100644 Documentation/ummunotify/ummunotify.txt create mode 100644 Documentation/ummunotify/umn-test.c create mode 100644 drivers/char/ummunotify.c create mode 100644 include/linux/ummunotify.h ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [GIT PULL] please pull ummunotify 2009-09-11 4:38 ` Roland Dreier (?) @ 2009-09-15 11:34 ` Pavel Machek [not found] ` <20090915113434.GF1328-+ZI9xUNit7I@public.gmane.org> -1 siblings, 1 reply; 82+ messages in thread From: Pavel Machek @ 2009-09-15 11:34 UTC (permalink / raw) To: Roland Dreier; +Cc: torvalds, akpm, jsquyres, linux-rdma, general, linux-kernel Hi! > Linus, please consider pulling from > > master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git ummunotify > > This tree is also available from kernel.org mirrors at: > > git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git ummunotify > > This will get "ummunotify," a new character device that allows a > userspace library to register for MMU notifications; this is > particularly useful for MPI implementions (message passing libraries > used in HPC) to be able to keep track of what wacky things consumers > do to their memory mappings. My colleague Jeff Squyres from the Open > MPI project posted a blog entry about why MPI wants this: > > http://blogs.cisco.com/ciscotalk/performance/comments/better_linux_memory_tracking/ > > His summary of ummunotify: > > "It???s elegant, doesn???t require strange linker tricks, and seems to > work in all cases. Yay!" > > This code went through several review iterations on lkml and was in > -mm and -next for quite a few weeks. Andrew is OK with merging it (I > think -- Andrew please correct me if I misunderstood you). I don't remember seeing discussion of this on lkml. Yes it is in -next... Basically it allows app to 'trace itself'? ...with interesting mmap() interface, exporting int to userspace, hoping it behaves atomically...? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 82+ messages in thread
[parent not found: <20090915113434.GF1328-+ZI9xUNit7I@public.gmane.org>]
* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify 2009-09-15 11:34 ` Pavel Machek @ 2009-09-15 14:57 ` Roland Dreier 0 siblings, 0 replies; 82+ messages in thread From: Roland Dreier @ 2009-09-15 14:57 UTC (permalink / raw) To: Pavel Machek Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b > I don't remember seeing discussion of this on lkml. Yes it is in > -next... eg http://lkml.org/lkml/2009/7/31/197 and followups, or search for v2 and earlier patches. > Basically it allows app to 'trace itself'? ...with interesting mmap() > interface, exporting int to userspace, hoping it behaves atomically...? Yes, it allows app to trace what the kernel does to memory mappings. I don't believe there's any real issue to atomicity of mmap'ed memory, since userspace really just tests whether read value is == to old read value or not. - R. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify @ 2009-09-15 14:57 ` Roland Dreier 0 siblings, 0 replies; 82+ messages in thread From: Roland Dreier @ 2009-09-15 14:57 UTC (permalink / raw) To: Pavel Machek; +Cc: linux-rdma, linux-kernel, general, akpm, torvalds > I don't remember seeing discussion of this on lkml. Yes it is in > -next... eg http://lkml.org/lkml/2009/7/31/197 and followups, or search for v2 and earlier patches. > Basically it allows app to 'trace itself'? ...with interesting mmap() > interface, exporting int to userspace, hoping it behaves atomically...? Yes, it allows app to trace what the kernel does to memory mappings. I don't believe there's any real issue to atomicity of mmap'ed memory, since userspace really just tests whether read value is == to old read value or not. - R. ^ permalink raw reply [flat|nested] 82+ messages in thread
[parent not found: <ada7hw0gsqz.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>]
* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify 2009-09-15 14:57 ` Roland Dreier @ 2009-09-28 20:49 ` Pavel Machek -1 siblings, 0 replies; 82+ messages in thread From: Pavel Machek @ 2009-09-28 20:49 UTC (permalink / raw) To: Roland Dreier Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Tue 2009-09-15 07:57:56, Roland Dreier wrote: > > > I don't remember seeing discussion of this on lkml. Yes it is in > > -next... > > eg http://lkml.org/lkml/2009/7/31/197 and followups, or search for v2 > and earlier patches. Well... it seems little overspecialized. Just modifying libc to provide hooks you want looks like better solution. > > Basically it allows app to 'trace itself'? ...with interesting mmap() > > interface, exporting int to userspace, hoping it behaves atomically...? > > Yes, it allows app to trace what the kernel does to memory mappings. I > don't believe there's any real issue to atomicity of mmap'ed memory, > since userspace really just tests whether read value is == to old read > value or not. That still needs memory barriers etc.. to ensure reliable operation, no? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify @ 2009-09-28 20:49 ` Pavel Machek 0 siblings, 0 replies; 82+ messages in thread From: Pavel Machek @ 2009-09-28 20:49 UTC (permalink / raw) To: Roland Dreier; +Cc: linux-rdma, linux-kernel, general, akpm, torvalds On Tue 2009-09-15 07:57:56, Roland Dreier wrote: > > > I don't remember seeing discussion of this on lkml. Yes it is in > > -next... > > eg http://lkml.org/lkml/2009/7/31/197 and followups, or search for v2 > and earlier patches. Well... it seems little overspecialized. Just modifying libc to provide hooks you want looks like better solution. > > Basically it allows app to 'trace itself'? ...with interesting mmap() > > interface, exporting int to userspace, hoping it behaves atomically...? > > Yes, it allows app to trace what the kernel does to memory mappings. I > don't believe there's any real issue to atomicity of mmap'ed memory, > since userspace really just tests whether read value is == to old read > value or not. That still needs memory barriers etc.. to ensure reliable operation, no? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 82+ messages in thread
[parent not found: <20090928204923.GA1960-I/5MKhXcvmPrBKCeMvbIDA@public.gmane.org>]
* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify 2009-09-28 20:49 ` Pavel Machek @ 2009-09-28 21:40 ` Jason Gunthorpe -1 siblings, 0 replies; 82+ messages in thread From: Jason Gunthorpe @ 2009-09-28 21:40 UTC (permalink / raw) To: Pavel Machek Cc: Roland Dreier, linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Mon, Sep 28, 2009 at 10:49:23PM +0200, Pavel Machek wrote: > > > I don't remember seeing discussion of this on lkml. Yes it is in > > > -next... > > > > eg http://lkml.org/lkml/2009/7/31/197 and followups, or search for v2 > > and earlier patches. > Well... it seems little overspecialized. Just modifying libc to > provide hooks you want looks like better solution. That is what MPI people are doing today and their feedback is that it doesn't work - there are a lot of ways to mess with memory and no good choices to hook the raw syscalls and keep sensible performance. The main focus of this is high performance MPI apps, so lower overhead on critical paths like memory allocation is part of the point. It is ment to go hand-in-hand with the specialized RDMA memory pinning interfaces.. > > > Basically it allows app to 'trace itself'? ...with interesting mmap() > > > interface, exporting int to userspace, hoping it behaves atomically...? > > > > Yes, it allows app to trace what the kernel does to memory mappings. I > > don't believe there's any real issue to atomicity of mmap'ed memory, > > since userspace really just tests whether read value is == to old read > > value or not. > > That still needs memory barriers etc.. to ensure reliable operation, > no? No, I don't think so.. The application is expected to provide sequencing of some sort between the memory call (mmap/munmap/brk/etc) and the int check - usually just by running in the same thread, or through some kind of locking scheme. As long as the mmu notifiers run immediately in the same context as the mmap/etc then it should be fine. For example, the most common problem to solve looks like this: x = mmap(...) do RDMA with x [..] mmunmap(x); [..] y = mmap(..); do RDMA with y if by chance x == y things explode. So this API puts the int test directly before 'do RDMA with'. Due to the above kind of argument the net requirement is either to completely synchronously (and with low overhead) hook every mmap/munmap/brk/etc call into the kernel and do the accounting work, or have a very low over head check every time the memory region is about to be used. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify @ 2009-09-28 21:40 ` Jason Gunthorpe 0 siblings, 0 replies; 82+ messages in thread From: Jason Gunthorpe @ 2009-09-28 21:40 UTC (permalink / raw) To: Pavel Machek Cc: Roland Dreier, linux-rdma, linux-kernel, general, akpm, torvalds On Mon, Sep 28, 2009 at 10:49:23PM +0200, Pavel Machek wrote: > > > I don't remember seeing discussion of this on lkml. Yes it is in > > > -next... > > > > eg http://lkml.org/lkml/2009/7/31/197 and followups, or search for v2 > > and earlier patches. > Well... it seems little overspecialized. Just modifying libc to > provide hooks you want looks like better solution. That is what MPI people are doing today and their feedback is that it doesn't work - there are a lot of ways to mess with memory and no good choices to hook the raw syscalls and keep sensible performance. The main focus of this is high performance MPI apps, so lower overhead on critical paths like memory allocation is part of the point. It is ment to go hand-in-hand with the specialized RDMA memory pinning interfaces.. > > > Basically it allows app to 'trace itself'? ...with interesting mmap() > > > interface, exporting int to userspace, hoping it behaves atomically...? > > > > Yes, it allows app to trace what the kernel does to memory mappings. I > > don't believe there's any real issue to atomicity of mmap'ed memory, > > since userspace really just tests whether read value is == to old read > > value or not. > > That still needs memory barriers etc.. to ensure reliable operation, > no? No, I don't think so.. The application is expected to provide sequencing of some sort between the memory call (mmap/munmap/brk/etc) and the int check - usually just by running in the same thread, or through some kind of locking scheme. As long as the mmu notifiers run immediately in the same context as the mmap/etc then it should be fine. For example, the most common problem to solve looks like this: x = mmap(...) do RDMA with x [..] mmunmap(x); [..] y = mmap(..); do RDMA with y if by chance x == y things explode. So this API puts the int test directly before 'do RDMA with'. Due to the above kind of argument the net requirement is either to completely synchronously (and with low overhead) hook every mmap/munmap/brk/etc call into the kernel and do the accounting work, or have a very low over head check every time the memory region is about to be used. Jason ^ permalink raw reply [flat|nested] 82+ messages in thread
[parent not found: <aday6omhz9d.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>]
* Re: [GIT PULL] please pull ummunotify 2009-09-11 4:38 ` Roland Dreier @ 2009-09-11 5:56 ` KOSAKI Motohiro -1 siblings, 0 replies; 82+ messages in thread From: KOSAKI Motohiro @ 2009-09-11 5:56 UTC (permalink / raw) To: Roland Dreier Cc: kosaki.motohiro-+CUm20s59erQFUHtdCDX3A, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, jsquyres-FYB4Gu1CFyUAvxtiuMwx3w, linux-rdma-u79uwXL29TY76Z2rM5mHXA, general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5, linux-kernel-u79uwXL29TY76Z2rM5mHXA Hi Roland, > Linus, please consider pulling from > > master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git ummunotify > > This tree is also available from kernel.org mirrors at: > > git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git ummunotify > > This will get "ummunotify," a new character device that allows a > userspace library to register for MMU notifications; this is > particularly useful for MPI implementions (message passing libraries > used in HPC) to be able to keep track of what wacky things consumers > do to their memory mappings. My colleague Jeff Squyres from the Open > MPI project posted a blog entry about why MPI wants this: > > http://blogs.cisco.com/ciscotalk/performance/comments/better_linux_memory_tracking/ > > His summary of ummunotify: > > "It’s elegant, doesn’t require strange linker tricks, and seems to > work in all cases. Yay!" > > This code went through several review iterations on lkml and was in > -mm and -next for quite a few weeks. Andrew is OK with merging it (I > think -- Andrew please correct me if I misunderstood you). I'm sorry. I haven't review this code and I didn't track this discussion carefully. but I have one stupid question. May I ask? Can I this version already solved fork() + COW issue? if so, could you please explain what happen at fork. Obviously RDMA point to either parent or child page, not both. but Corrent COW rule is, first touch process get copyed page and other process still own original page. I think it's unpecected behavior form RDMA. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [GIT PULL] please pull ummunotify @ 2009-09-11 5:56 ` KOSAKI Motohiro 0 siblings, 0 replies; 82+ messages in thread From: KOSAKI Motohiro @ 2009-09-11 5:56 UTC (permalink / raw) To: Roland Dreier Cc: kosaki.motohiro, torvalds, akpm, jsquyres, linux-rdma, general, linux-kernel Hi Roland, > Linus, please consider pulling from > > master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git ummunotify > > This tree is also available from kernel.org mirrors at: > > git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git ummunotify > > This will get "ummunotify," a new character device that allows a > userspace library to register for MMU notifications; this is > particularly useful for MPI implementions (message passing libraries > used in HPC) to be able to keep track of what wacky things consumers > do to their memory mappings. My colleague Jeff Squyres from the Open > MPI project posted a blog entry about why MPI wants this: > > http://blogs.cisco.com/ciscotalk/performance/comments/better_linux_memory_tracking/ > > His summary of ummunotify: > > "It’s elegant, doesn’t require strange linker tricks, and seems to > work in all cases. Yay!" > > This code went through several review iterations on lkml and was in > -mm and -next for quite a few weeks. Andrew is OK with merging it (I > think -- Andrew please correct me if I misunderstood you). I'm sorry. I haven't review this code and I didn't track this discussion carefully. but I have one stupid question. May I ask? Can I this version already solved fork() + COW issue? if so, could you please explain what happen at fork. Obviously RDMA point to either parent or child page, not both. but Corrent COW rule is, first touch process get copyed page and other process still own original page. I think it's unpecected behavior form RDMA. ^ permalink raw reply [flat|nested] 82+ messages in thread
* [ofa-general] Re: [GIT PULL] please pull ummunotify 2009-09-11 5:56 ` KOSAKI Motohiro @ 2009-09-11 6:03 ` Roland Dreier -1 siblings, 0 replies; 82+ messages in thread From: Roland Dreier @ 2009-09-11 6:03 UTC (permalink / raw) To: KOSAKI Motohiro; +Cc: linux-rdma, linux-kernel, general, akpm, torvalds > Can I this version already solved fork() + COW issue? if so, could you > please explain what happen at fork. Obviously RDMA point to either parent > or child page, not both. but Corrent COW rule is, first touch process > get copyed page and other process still own original page. I think it's > unpecected behavior form RDMA. No, ummunotify doesn't really help that much with fork() + COW. If a parent forks and then touches pages that are actively in use for RDMA, then of course they get COWed and RDMA goes to the wrong memory (from the point of view of the parent). ummunotify does deal with the case where a process forks and touches memory that was used for RDMA but no longer is -- in that case, the MPI library has a chance to flush its registration cache because it will get a ummunotify event invalidating the old mapping. The real purpose of ummunotify is to allow MPI implementations to cache registrations, even when the MPI library is used with an application that does funny things for allocation (mmap()/munmap() or brk(), etc). - Roland ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [GIT PULL] please pull ummunotify @ 2009-09-11 6:03 ` Roland Dreier 0 siblings, 0 replies; 82+ messages in thread From: Roland Dreier @ 2009-09-11 6:03 UTC (permalink / raw) To: KOSAKI Motohiro Cc: torvalds, akpm, jsquyres, linux-rdma, general, linux-kernel > Can I this version already solved fork() + COW issue? if so, could you > please explain what happen at fork. Obviously RDMA point to either parent > or child page, not both. but Corrent COW rule is, first touch process > get copyed page and other process still own original page. I think it's > unpecected behavior form RDMA. No, ummunotify doesn't really help that much with fork() + COW. If a parent forks and then touches pages that are actively in use for RDMA, then of course they get COWed and RDMA goes to the wrong memory (from the point of view of the parent). ummunotify does deal with the case where a process forks and touches memory that was used for RDMA but no longer is -- in that case, the MPI library has a chance to flush its registration cache because it will get a ummunotify event invalidating the old mapping. The real purpose of ummunotify is to allow MPI implementations to cache registrations, even when the MPI library is used with an application that does funny things for allocation (mmap()/munmap() or brk(), etc). - Roland ^ permalink raw reply [flat|nested] 82+ messages in thread
[parent not found: <adatyzahvbm.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>]
* Re: [GIT PULL] please pull ummunotify 2009-09-11 6:03 ` Roland Dreier @ 2009-09-11 6:11 ` KOSAKI Motohiro -1 siblings, 0 replies; 82+ messages in thread From: KOSAKI Motohiro @ 2009-09-11 6:11 UTC (permalink / raw) To: Roland Dreier Cc: kosaki.motohiro-+CUm20s59erQFUHtdCDX3A, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, jsquyres-FYB4Gu1CFyUAvxtiuMwx3w, linux-rdma-u79uwXL29TY76Z2rM5mHXA, general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5, linux-kernel-u79uwXL29TY76Z2rM5mHXA Hi Thank you explanation. > > > Can I this version already solved fork() + COW issue? if so, could you > > please explain what happen at fork. Obviously RDMA point to either parent > > or child page, not both. but Corrent COW rule is, first touch process > > get copyed page and other process still own original page. I think it's > > unpecected behavior form RDMA. > > No, ummunotify doesn't really help that much with fork() + COW. If a > parent forks and then touches pages that are actively in use for RDMA, > then of course they get COWed and RDMA goes to the wrong memory (from > the point of view of the parent). So, Can we assume OpenMPI user process doesn't such thing? Parhaps, madvise(DONTFORK) or vfork() avoid this issue. but I'm not sure all program in the world do that. > ummunotify does deal with the case where a process forks and touches > memory that was used for RDMA but no longer is -- in that case, the MPI > library has a chance to flush its registration cache because it will get > a ummunotify event invalidating the old mapping. > > The real purpose of ummunotify is to allow MPI implementations to cache > registrations, even when the MPI library is used with an application > that does funny things for allocation (mmap()/munmap() or brk(), etc). Yup, that's very worth. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [GIT PULL] please pull ummunotify @ 2009-09-11 6:11 ` KOSAKI Motohiro 0 siblings, 0 replies; 82+ messages in thread From: KOSAKI Motohiro @ 2009-09-11 6:11 UTC (permalink / raw) To: Roland Dreier Cc: kosaki.motohiro, torvalds, akpm, jsquyres, linux-rdma, general, linux-kernel Hi Thank you explanation. > > > Can I this version already solved fork() + COW issue? if so, could you > > please explain what happen at fork. Obviously RDMA point to either parent > > or child page, not both. but Corrent COW rule is, first touch process > > get copyed page and other process still own original page. I think it's > > unpecected behavior form RDMA. > > No, ummunotify doesn't really help that much with fork() + COW. If a > parent forks and then touches pages that are actively in use for RDMA, > then of course they get COWed and RDMA goes to the wrong memory (from > the point of view of the parent). So, Can we assume OpenMPI user process doesn't such thing? Parhaps, madvise(DONTFORK) or vfork() avoid this issue. but I'm not sure all program in the world do that. > ummunotify does deal with the case where a process forks and touches > memory that was used for RDMA but no longer is -- in that case, the MPI > library has a chance to flush its registration cache because it will get > a ummunotify event invalidating the old mapping. > > The real purpose of ummunotify is to allow MPI implementations to cache > registrations, even when the MPI library is used with an application > that does funny things for allocation (mmap()/munmap() or brk(), etc). Yup, that's very worth. ^ permalink raw reply [flat|nested] 82+ messages in thread
[parent not found: <20090911150552.DB68.A69D9226-+CUm20s59erQFUHtdCDX3A@public.gmane.org>]
* Re: [GIT PULL] please pull ummunotify 2009-09-11 6:11 ` KOSAKI Motohiro @ 2009-09-11 16:42 ` Gleb Natapov -1 siblings, 0 replies; 82+ messages in thread From: Gleb Natapov @ 2009-09-11 16:42 UTC (permalink / raw) To: KOSAKI Motohiro Cc: Roland Dreier, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, jsquyres-FYB4Gu1CFyUAvxtiuMwx3w, linux-rdma-u79uwXL29TY76Z2rM5mHXA, general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Fri, Sep 11, 2009 at 03:11:36PM +0900, KOSAKI Motohiro wrote: > Hi > > Thank you explanation. > > > > > > Can I this version already solved fork() + COW issue? if so, could you > > > please explain what happen at fork. Obviously RDMA point to either parent > > > or child page, not both. but Corrent COW rule is, first touch process > > > get copyed page and other process still own original page. I think it's > > > unpecected behavior form RDMA. > > > > No, ummunotify doesn't really help that much with fork() + COW. If a > > parent forks and then touches pages that are actively in use for RDMA, > > then of course they get COWed and RDMA goes to the wrong memory (from > > the point of view of the parent). > > So, Can we assume OpenMPI user process doesn't such thing? > > Parhaps, madvise(DONTFORK) or vfork() avoid this issue. but I'm not > sure all program in the world do that. > MPI (or is it libibverbs?) marks all registered memory as DONTFORK. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [GIT PULL] please pull ummunotify @ 2009-09-11 16:42 ` Gleb Natapov 0 siblings, 0 replies; 82+ messages in thread From: Gleb Natapov @ 2009-09-11 16:42 UTC (permalink / raw) To: KOSAKI Motohiro Cc: Roland Dreier, torvalds, akpm, jsquyres, linux-rdma, general, linux-kernel On Fri, Sep 11, 2009 at 03:11:36PM +0900, KOSAKI Motohiro wrote: > Hi > > Thank you explanation. > > > > > > Can I this version already solved fork() + COW issue? if so, could you > > > please explain what happen at fork. Obviously RDMA point to either parent > > > or child page, not both. but Corrent COW rule is, first touch process > > > get copyed page and other process still own original page. I think it's > > > unpecected behavior form RDMA. > > > > No, ummunotify doesn't really help that much with fork() + COW. If a > > parent forks and then touches pages that are actively in use for RDMA, > > then of course they get COWed and RDMA goes to the wrong memory (from > > the point of view of the parent). > > So, Can we assume OpenMPI user process doesn't such thing? > > Parhaps, madvise(DONTFORK) or vfork() avoid this issue. but I'm not > sure all program in the world do that. > MPI (or is it libibverbs?) marks all registered memory as DONTFORK. -- Gleb. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [GIT PULL] please pull ummunotify 2009-09-11 6:03 ` Roland Dreier (?) (?) @ 2009-09-11 6:15 ` Brice Goglin [not found] ` <4AA9EAF7.5010401-MZpvjPyXg2s@public.gmane.org> -1 siblings, 1 reply; 82+ messages in thread From: Brice Goglin @ 2009-09-11 6:15 UTC (permalink / raw) To: Roland Dreier Cc: KOSAKI Motohiro, torvalds, akpm, jsquyres, linux-rdma, general, linux-kernel Roland Dreier wrote: > > Can I this version already solved fork() + COW issue? if so, could you > > please explain what happen at fork. Obviously RDMA point to either parent > > or child page, not both. but Corrent COW rule is, first touch process > > get copyed page and other process still own original page. I think it's > > unpecected behavior form RDMA. > > No, ummunotify doesn't really help that much with fork() + COW. If a > parent forks and then touches pages that are actively in use for RDMA, > then of course they get COWed and RDMA goes to the wrong memory (from > the point of view of the parent). > My understanding of the code is that fork will end-up calling copy_page_range() on all VMA, and copy_page_range() calls mmu_notifier_invalidate_range_start() if is_cow_mapping() is true, which should be the case here. So you should get some invalidate events on fork. Brice ^ permalink raw reply [flat|nested] 82+ messages in thread
[parent not found: <4AA9EAF7.5010401-MZpvjPyXg2s@public.gmane.org>]
* Re: [GIT PULL] please pull ummunotify 2009-09-11 6:15 ` Brice Goglin @ 2009-09-11 6:21 ` KOSAKI Motohiro 0 siblings, 0 replies; 82+ messages in thread From: KOSAKI Motohiro @ 2009-09-11 6:21 UTC (permalink / raw) To: Brice Goglin Cc: kosaki.motohiro-+CUm20s59erQFUHtdCDX3A, Roland Dreier, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, jsquyres-FYB4Gu1CFyUAvxtiuMwx3w, linux-rdma-u79uwXL29TY76Z2rM5mHXA, general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5, linux-kernel-u79uwXL29TY76Z2rM5mHXA > Roland Dreier wrote: > > > Can I this version already solved fork() + COW issue? if so, could you > > > please explain what happen at fork. Obviously RDMA point to either parent > > > or child page, not both. but Corrent COW rule is, first touch process > > > get copyed page and other process still own original page. I think it's > > > unpecected behavior form RDMA. > > > > No, ummunotify doesn't really help that much with fork() + COW. If a > > parent forks and then touches pages that are actively in use for RDMA, > > then of course they get COWed and RDMA goes to the wrong memory (from > > the point of view of the parent). > > > > My understanding of the code is that fork will end-up calling > copy_page_range() on all VMA, and copy_page_range() calls > mmu_notifier_invalidate_range_start() if is_cow_mapping() is true, > which should be the case here. So you should get some invalidate events > on fork. Worried... Anybody haven't test fork() case yet??? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [GIT PULL] please pull ummunotify @ 2009-09-11 6:21 ` KOSAKI Motohiro 0 siblings, 0 replies; 82+ messages in thread From: KOSAKI Motohiro @ 2009-09-11 6:21 UTC (permalink / raw) To: Brice Goglin Cc: kosaki.motohiro, Roland Dreier, torvalds, akpm, jsquyres, linux-rdma, general, linux-kernel > Roland Dreier wrote: > > > Can I this version already solved fork() + COW issue? if so, could you > > > please explain what happen at fork. Obviously RDMA point to either parent > > > or child page, not both. but Corrent COW rule is, first touch process > > > get copyed page and other process still own original page. I think it's > > > unpecected behavior form RDMA. > > > > No, ummunotify doesn't really help that much with fork() + COW. If a > > parent forks and then touches pages that are actively in use for RDMA, > > then of course they get COWed and RDMA goes to the wrong memory (from > > the point of view of the parent). > > > > My understanding of the code is that fork will end-up calling > copy_page_range() on all VMA, and copy_page_range() calls > mmu_notifier_invalidate_range_start() if is_cow_mapping() is true, > which should be the case here. So you should get some invalidate events > on fork. Worried... Anybody haven't test fork() case yet??? ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [GIT PULL] please pull ummunotify 2009-09-11 6:15 ` Brice Goglin @ 2009-09-11 6:22 ` Roland Dreier 0 siblings, 0 replies; 82+ messages in thread From: Roland Dreier @ 2009-09-11 6:22 UTC (permalink / raw) To: Brice Goglin Cc: KOSAKI Motohiro, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, jsquyres-FYB4Gu1CFyUAvxtiuMwx3w, linux-rdma-u79uwXL29TY76Z2rM5mHXA, general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5, linux-kernel-u79uwXL29TY76Z2rM5mHXA > My understanding of the code is that fork will end-up calling > copy_page_range() on all VMA, and copy_page_range() calls > mmu_notifier_invalidate_range_start() if is_cow_mapping() is true, > which should be the case here. So you should get some invalidate events > on fork. Yes, I agree (that's what the second half of my email tried to say). However, that doesn't help if the parent process is actively doing RDMA on the range being invalidated -- the MPI library or whatever will get the invalidate event via ummunotify, but what can it do? The event is basically saying "your data is going to the wrong place" and I don't see what useful thing MPI could do with that. As I said, it does mean that MPI can invalidate cached registrations for COWed memory, which might be useful in case a parent forks and then touches memory it used to use for RDMA, but I think that's the easier part of the fork/COW problem. - R. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [GIT PULL] please pull ummunotify @ 2009-09-11 6:22 ` Roland Dreier 0 siblings, 0 replies; 82+ messages in thread From: Roland Dreier @ 2009-09-11 6:22 UTC (permalink / raw) To: Brice Goglin Cc: KOSAKI Motohiro, torvalds, akpm, jsquyres, linux-rdma, general, linux-kernel > My understanding of the code is that fork will end-up calling > copy_page_range() on all VMA, and copy_page_range() calls > mmu_notifier_invalidate_range_start() if is_cow_mapping() is true, > which should be the case here. So you should get some invalidate events > on fork. Yes, I agree (that's what the second half of my email tried to say). However, that doesn't help if the parent process is actively doing RDMA on the range being invalidated -- the MPI library or whatever will get the invalidate event via ummunotify, but what can it do? The event is basically saying "your data is going to the wrong place" and I don't see what useful thing MPI could do with that. As I said, it does mean that MPI can invalidate cached registrations for COWed memory, which might be useful in case a parent forks and then touches memory it used to use for RDMA, but I think that's the easier part of the fork/COW problem. - R. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify 2009-09-11 6:22 ` Roland Dreier @ 2009-09-11 6:40 ` Jason Gunthorpe -1 siblings, 0 replies; 82+ messages in thread From: Jason Gunthorpe @ 2009-09-11 6:40 UTC (permalink / raw) To: Roland Dreier Cc: KOSAKI Motohiro, linux-rdma, linux-kernel, general, Brice Goglin, akpm, torvalds On Thu, Sep 10, 2009 at 11:22:20PM -0700, Roland Dreier wrote: > As I said, it does mean that MPI can invalidate cached registrations for > COWed memory, which might be useful in case a parent forks and then > touches memory it used to use for RDMA, but I think that's the easier > part of the fork/COW problem. What happens to all the other IB resources (PD, CQ, QP, etc) on fork? AFAIK, pretty much by design the IB stack cannot/does not duplicate these objects. The natural consequence is that a PD is always associated with a single process at a time, thus a memory registration which is associated with a PD must also be associated with a single process. So.. What is the problem with fork? The semantics of what should happen seem natural enough to me, the PD doesn't get copied to the child, so the MR stays with the parent. COW events on the pinned region must be resolved so that the physical page stays with the process that has pinned it - the pin is logically released in the child because the MR doesn't exist because the PD doesn't exist. Is this a general problem with the MR mechanism? If I mmap(MAP_SHARED|MAP_READONLY) and someone mmaps(MAP_PRIVATE|MAP_WRITE) on the same file I can generate COW events - will this make RDMAs go randomly too?? Jason ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify @ 2009-09-11 6:40 ` Jason Gunthorpe 0 siblings, 0 replies; 82+ messages in thread From: Jason Gunthorpe @ 2009-09-11 6:40 UTC (permalink / raw) To: Roland Dreier Cc: Brice Goglin, KOSAKI Motohiro, linux-rdma, linux-kernel, general, akpm, torvalds On Thu, Sep 10, 2009 at 11:22:20PM -0700, Roland Dreier wrote: > As I said, it does mean that MPI can invalidate cached registrations for > COWed memory, which might be useful in case a parent forks and then > touches memory it used to use for RDMA, but I think that's the easier > part of the fork/COW problem. What happens to all the other IB resources (PD, CQ, QP, etc) on fork? AFAIK, pretty much by design the IB stack cannot/does not duplicate these objects. The natural consequence is that a PD is always associated with a single process at a time, thus a memory registration which is associated with a PD must also be associated with a single process. So.. What is the problem with fork? The semantics of what should happen seem natural enough to me, the PD doesn't get copied to the child, so the MR stays with the parent. COW events on the pinned region must be resolved so that the physical page stays with the process that has pinned it - the pin is logically released in the child because the MR doesn't exist because the PD doesn't exist. Is this a general problem with the MR mechanism? If I mmap(MAP_SHARED|MAP_READONLY) and someone mmaps(MAP_PRIVATE|MAP_WRITE) on the same file I can generate COW events - will this make RDMAs go randomly too?? Jason ^ permalink raw reply [flat|nested] 82+ messages in thread
[parent not found: <20090911064019.GZ4973-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>]
* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify 2009-09-11 6:40 ` Jason Gunthorpe @ 2009-09-11 16:58 ` Roland Dreier -1 siblings, 0 replies; 82+ messages in thread From: Roland Dreier @ 2009-09-11 16:58 UTC (permalink / raw) To: Jason Gunthorpe Cc: Brice Goglin, KOSAKI Motohiro, linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b > So.. What is the problem with fork? The semantics of what should > happen seem natural enough to me, the PD doesn't get copied to the > child, so the MR stays with the parent. COW events on the pinned > region must be resolved so that the physical page stays with the > process that has pinned it - the pin is logically released in the > child because the MR doesn't exist because the PD doesn't exist. This is getting away from the problem that ummunotify is solving, but handling a COW fault generated by the parent by doing the copy in the child seems like a pretty major, tricky change to make. The child may have forked 100 more times in the meantime, meaning we now have to change 101 memory maps ... the cost of page faults goes through the roof probably... - R. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify @ 2009-09-11 16:58 ` Roland Dreier 0 siblings, 0 replies; 82+ messages in thread From: Roland Dreier @ 2009-09-11 16:58 UTC (permalink / raw) To: Jason Gunthorpe Cc: Brice Goglin, KOSAKI Motohiro, linux-rdma, linux-kernel, general, akpm, torvalds > So.. What is the problem with fork? The semantics of what should > happen seem natural enough to me, the PD doesn't get copied to the > child, so the MR stays with the parent. COW events on the pinned > region must be resolved so that the physical page stays with the > process that has pinned it - the pin is logically released in the > child because the MR doesn't exist because the PD doesn't exist. This is getting away from the problem that ummunotify is solving, but handling a COW fault generated by the parent by doing the copy in the child seems like a pretty major, tricky change to make. The child may have forked 100 more times in the meantime, meaning we now have to change 101 memory maps ... the cost of page faults goes through the roof probably... - R. ^ permalink raw reply [flat|nested] 82+ messages in thread
[parent not found: <adaljklifkt.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>]
* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify 2009-09-11 16:58 ` Roland Dreier @ 2009-09-15 7:03 ` KOSAKI Motohiro -1 siblings, 0 replies; 82+ messages in thread From: KOSAKI Motohiro @ 2009-09-15 7:03 UTC (permalink / raw) To: Roland Dreier Cc: kosaki.motohiro-+CUm20s59erQFUHtdCDX3A, Jason Gunthorpe, Brice Goglin, linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b > > > So.. What is the problem with fork? The semantics of what should > > happen seem natural enough to me, the PD doesn't get copied to the > > child, so the MR stays with the parent. COW events on the pinned > > region must be resolved so that the physical page stays with the > > process that has pinned it - the pin is logically released in the > > child because the MR doesn't exist because the PD doesn't exist. > > This is getting away from the problem that ummunotify is solving, but > handling a COW fault generated by the parent by doing the copy in the > child seems like a pretty major, tricky change to make. The child may > have forked 100 more times in the meantime, meaning we now have to > change 101 memory maps ... the cost of page faults goes through the roof > probably... Ummm... Perhaps my first question was wrong. I'm not intent to NAK your patch. I merely want to know your patch detail... ok, I ask you again as another word. - I guess you have your MPI implementaion w/ ummunotify, right? - I guess you have test sevaral pattern, right? if so, can we see your test result? - I think you can explain your MPI advantage/disadvantage against current OpenMPI (or mpich et al). - I guess your patch dramatically improve MPI implementaion, but it's not free. it request some limitation to MPI application, right? - I imagine multi thread and fork. Is there another linmitaion? - In past discuttion, you said ummunotify user should not use multi threading. you also think user should not fork? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify @ 2009-09-15 7:03 ` KOSAKI Motohiro 0 siblings, 0 replies; 82+ messages in thread From: KOSAKI Motohiro @ 2009-09-15 7:03 UTC (permalink / raw) To: Roland Dreier Cc: kosaki.motohiro, Jason Gunthorpe, Brice Goglin, linux-rdma, linux-kernel, general, akpm, torvalds > > > So.. What is the problem with fork? The semantics of what should > > happen seem natural enough to me, the PD doesn't get copied to the > > child, so the MR stays with the parent. COW events on the pinned > > region must be resolved so that the physical page stays with the > > process that has pinned it - the pin is logically released in the > > child because the MR doesn't exist because the PD doesn't exist. > > This is getting away from the problem that ummunotify is solving, but > handling a COW fault generated by the parent by doing the copy in the > child seems like a pretty major, tricky change to make. The child may > have forked 100 more times in the meantime, meaning we now have to > change 101 memory maps ... the cost of page faults goes through the roof > probably... Ummm... Perhaps my first question was wrong. I'm not intent to NAK your patch. I merely want to know your patch detail... ok, I ask you again as another word. - I guess you have your MPI implementaion w/ ummunotify, right? - I guess you have test sevaral pattern, right? if so, can we see your test result? - I think you can explain your MPI advantage/disadvantage against current OpenMPI (or mpich et al). - I guess your patch dramatically improve MPI implementaion, but it's not free. it request some limitation to MPI application, right? - I imagine multi thread and fork. Is there another linmitaion? - In past discuttion, you said ummunotify user should not use multi threading. you also think user should not fork? ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify 2009-09-15 7:03 ` KOSAKI Motohiro @ 2009-09-15 8:27 ` Roland Dreier -1 siblings, 0 replies; 82+ messages in thread From: Roland Dreier @ 2009-09-15 8:27 UTC (permalink / raw) To: KOSAKI Motohiro Cc: Jason Gunthorpe, Brice Goglin, linux-rdma, linux-kernel, general, akpm, torvalds, jsquyres > - I guess you have your MPI implementaion w/ ummunotify, right? Yes, Jeff Squyres (cc'ed) has an Open MPI prototype (mercurial tree at http://bitbucket.org/jsquyres/ummunot/). > - I guess you have test sevaral pattern, right? > if so, can we see your test result? Open MPI has a pretty extensive automated test fabric -- I don't have a link handy but I believe all the tests that pass with unmodified Open MPI currently still pass with ummunotify. Maybe Jeff has a link. > - I think you can explain your MPI advantage/disadvantage against > current OpenMPI (or mpich et al). The advantage is as Jeff explained in his blog post (http://blogs.cisco.com/ciscotalk/performance/comments/better_linux_memory_tracking/), namely the performance improvement of memory registration caching without the reliability problems caused by previous approaches to caching such as trying to hook malloc etc (which are fragile because the great diversity of MPI-using codes find ways to mess up all previous userspace-only approaches). > - I guess your patch dramatically improve MPI implementaion, but > it's not free. it request some limitation to MPI application, right? Not that I know of, beyond already existing limitations. > - I imagine multi thread and fork. Is there another linmitaion? There are no new limitations on multi-threaded codes or on use of fork that I know of. Of course, buggy code that does something like passing a buffer to MPI in one thread and then freeing that buffer from another thread before MPI is done with it is still buggy; but ummunotify actually increases the ability of the MPI implementation to detect such bugs and give useful diagnostic information. > - In past discuttion, you said ummunotify user should not use > multi threading. you also think user should not fork? I don't recall where I said ummunotify users should not be multithreaded. I don't know of any problem with that. Also code using ummunotify can fork -- ummunotify simply does not fix issues with copy-on-write for buffers that are in use, just as it does not fix multithreaded code that has a race between using a buffer and freeing the same buffer. Hope this clarifies things. - Roland ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify @ 2009-09-15 8:27 ` Roland Dreier 0 siblings, 0 replies; 82+ messages in thread From: Roland Dreier @ 2009-09-15 8:27 UTC (permalink / raw) To: KOSAKI Motohiro Cc: Jason Gunthorpe, Brice Goglin, linux-rdma, linux-kernel, general, akpm, torvalds, jsquyres > - I guess you have your MPI implementaion w/ ummunotify, right? Yes, Jeff Squyres (cc'ed) has an Open MPI prototype (mercurial tree at http://bitbucket.org/jsquyres/ummunot/). > - I guess you have test sevaral pattern, right? > if so, can we see your test result? Open MPI has a pretty extensive automated test fabric -- I don't have a link handy but I believe all the tests that pass with unmodified Open MPI currently still pass with ummunotify. Maybe Jeff has a link. > - I think you can explain your MPI advantage/disadvantage against > current OpenMPI (or mpich et al). The advantage is as Jeff explained in his blog post (http://blogs.cisco.com/ciscotalk/performance/comments/better_linux_memory_tracking/), namely the performance improvement of memory registration caching without the reliability problems caused by previous approaches to caching such as trying to hook malloc etc (which are fragile because the great diversity of MPI-using codes find ways to mess up all previous userspace-only approaches). > - I guess your patch dramatically improve MPI implementaion, but > it's not free. it request some limitation to MPI application, right? Not that I know of, beyond already existing limitations. > - I imagine multi thread and fork. Is there another linmitaion? There are no new limitations on multi-threaded codes or on use of fork that I know of. Of course, buggy code that does something like passing a buffer to MPI in one thread and then freeing that buffer from another thread before MPI is done with it is still buggy; but ummunotify actually increases the ability of the MPI implementation to detect such bugs and give useful diagnostic information. > - In past discuttion, you said ummunotify user should not use > multi threading. you also think user should not fork? I don't recall where I said ummunotify users should not be multithreaded. I don't know of any problem with that. Also code using ummunotify can fork -- ummunotify simply does not fix issues with copy-on-write for buffers that are in use, just as it does not fix multithreaded code that has a race between using a buffer and freeing the same buffer. Hope this clarifies things. - Roland ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify 2009-09-15 7:03 ` KOSAKI Motohiro @ 2009-09-15 12:38 ` Jeff Squyres -1 siblings, 0 replies; 82+ messages in thread From: Jeff Squyres @ 2009-09-15 12:38 UTC (permalink / raw) To: KOSAKI Motohiro Cc: linux-rdma, Roland Dreier (rdreier), linux-kernel, general, Brice Goglin, akpm, torvalds On Sep 15, 2009, at 3:03 AM, KOSAKI Motohiro wrote: > - I guess you have your MPI implementaion w/ ummunotify, right? > - I guess you have test sevaral pattern, right? > if so, can we see your test result? > Roland's answers to the rest of these questions were spot-on, so I thought I'd just throw in a quick reply to the above questions: yes, we have a prototype Open MPI implementation with code that uses ummunotify (http://bitbucket.org/jsquyres/ummunot/). I just finished fixing a high-priority (but unrelated) bug in Open MPI, so merging the prototype ummunotify code into the upstream Open MPI repository is now at the top of my priority list. We have done quite a bit of testing with ummunotify, but since the code is not yet in the Open MPI mainline, most of the testing has been manual (not through our automated testing system). As far as we can tell, everything is working properly with Open MPI + ummunotify. We also anticipate that other MPI implementations will be able to use ummunotify, potentially using Open MPI as a reference ummunotify implementation. FWIW: we went through a bunch of design and implementation iterations with Roland to get code that everyone was happy with: - Roland likes it (and anticipated that the kernel community would be receptive to) - we like it - performs correctly Hope that helps. -- Jeff Squyres jsquyres@cisco.com ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify @ 2009-09-15 12:38 ` Jeff Squyres 0 siblings, 0 replies; 82+ messages in thread From: Jeff Squyres @ 2009-09-15 12:38 UTC (permalink / raw) To: KOSAKI Motohiro Cc: Roland Dreier (rdreier), Jason Gunthorpe, Brice Goglin, linux-rdma, linux-kernel, general, akpm, torvalds On Sep 15, 2009, at 3:03 AM, KOSAKI Motohiro wrote: > - I guess you have your MPI implementaion w/ ummunotify, right? > - I guess you have test sevaral pattern, right? > if so, can we see your test result? > Roland's answers to the rest of these questions were spot-on, so I thought I'd just throw in a quick reply to the above questions: yes, we have a prototype Open MPI implementation with code that uses ummunotify (http://bitbucket.org/jsquyres/ummunot/). I just finished fixing a high-priority (but unrelated) bug in Open MPI, so merging the prototype ummunotify code into the upstream Open MPI repository is now at the top of my priority list. We have done quite a bit of testing with ummunotify, but since the code is not yet in the Open MPI mainline, most of the testing has been manual (not through our automated testing system). As far as we can tell, everything is working properly with Open MPI + ummunotify. We also anticipate that other MPI implementations will be able to use ummunotify, potentially using Open MPI as a reference ummunotify implementation. FWIW: we went through a bunch of design and implementation iterations with Roland to get code that everyone was happy with: - Roland likes it (and anticipated that the kernel community would be receptive to) - we like it - performs correctly Hope that helps. -- Jeff Squyres jsquyres@cisco.com ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [GIT PULL] please pull ummunotify 2009-09-11 4:38 ` Roland Dreier @ 2009-09-16 16:30 ` Roland Dreier -1 siblings, 0 replies; 82+ messages in thread From: Roland Dreier @ 2009-09-16 16:30 UTC (permalink / raw) To: torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, jsquyres-FYB4Gu1CFyUAvxtiuMwx3w, linux-rdma-u79uwXL29TY76Z2rM5mHXA, general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5, linux-kernel-u79uwXL29TY76Z2rM5mHXA Hi Linus, Sorry to hassle you about this, but I would like to know where things stand. I know (from the reflink discussion if nothing else) that you're definitely not bashful about telling people when their code sucks, so this silent treatment has me really flustered. I've been showering and brushing my teeth and everything, honest! Seriously, this code solves a problem that the MPI/HPC people have been complaining about for quite a while, and if possible I'd like to get this upstream. Or if you have a better idea, I'm all ears... Thanks, Roland > Linus, please consider pulling from > > master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git ummunotify > > This tree is also available from kernel.org mirrors at: > > git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git ummunotify > > This will get "ummunotify," a new character device that allows a > userspace library to register for MMU notifications; this is > particularly useful for MPI implementions (message passing libraries > used in HPC) to be able to keep track of what wacky things consumers > do to their memory mappings. My colleague Jeff Squyres from the Open > MPI project posted a blog entry about why MPI wants this: > > http://blogs.cisco.com/ciscotalk/performance/comments/better_linux_memory_tracking/ > > His summary of ummunotify: > > "It’s elegant, doesn’t require strange linker tricks, and seems to > work in all cases. Yay!" > > This code went through several review iterations on lkml and was in > -mm and -next for quite a few weeks. Andrew is OK with merging it (I > think -- Andrew please correct me if I misunderstood you). > > Roland Dreier (1): > ummunotify: Userspace support for MMU notifications > > Documentation/Makefile | 3 +- > Documentation/ummunotify/Makefile | 7 + > Documentation/ummunotify/ummunotify.txt | 150 ++++++++ > Documentation/ummunotify/umn-test.c | 200 +++++++++++ > drivers/char/Kconfig | 12 + > drivers/char/Makefile | 1 + > drivers/char/ummunotify.c | 566 +++++++++++++++++++++++++++++++ > include/linux/Kbuild | 1 + > include/linux/ummunotify.h | 121 +++++++ > 9 files changed, 1060 insertions(+), 1 deletions(-) > create mode 100644 Documentation/ummunotify/Makefile > create mode 100644 Documentation/ummunotify/ummunotify.txt > create mode 100644 Documentation/ummunotify/umn-test.c > create mode 100644 drivers/char/ummunotify.c > create mode 100644 include/linux/ummunotify.h -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [GIT PULL] please pull ummunotify @ 2009-09-16 16:30 ` Roland Dreier 0 siblings, 0 replies; 82+ messages in thread From: Roland Dreier @ 2009-09-16 16:30 UTC (permalink / raw) To: torvalds; +Cc: akpm, jsquyres, linux-rdma, general, linux-kernel Hi Linus, Sorry to hassle you about this, but I would like to know where things stand. I know (from the reflink discussion if nothing else) that you're definitely not bashful about telling people when their code sucks, so this silent treatment has me really flustered. I've been showering and brushing my teeth and everything, honest! Seriously, this code solves a problem that the MPI/HPC people have been complaining about for quite a while, and if possible I'd like to get this upstream. Or if you have a better idea, I'm all ears... Thanks, Roland > Linus, please consider pulling from > > master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git ummunotify > > This tree is also available from kernel.org mirrors at: > > git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git ummunotify > > This will get "ummunotify," a new character device that allows a > userspace library to register for MMU notifications; this is > particularly useful for MPI implementions (message passing libraries > used in HPC) to be able to keep track of what wacky things consumers > do to their memory mappings. My colleague Jeff Squyres from the Open > MPI project posted a blog entry about why MPI wants this: > > http://blogs.cisco.com/ciscotalk/performance/comments/better_linux_memory_tracking/ > > His summary of ummunotify: > > "It’s elegant, doesn’t require strange linker tricks, and seems to > work in all cases. Yay!" > > This code went through several review iterations on lkml and was in > -mm and -next for quite a few weeks. Andrew is OK with merging it (I > think -- Andrew please correct me if I misunderstood you). > > Roland Dreier (1): > ummunotify: Userspace support for MMU notifications > > Documentation/Makefile | 3 +- > Documentation/ummunotify/Makefile | 7 + > Documentation/ummunotify/ummunotify.txt | 150 ++++++++ > Documentation/ummunotify/umn-test.c | 200 +++++++++++ > drivers/char/Kconfig | 12 + > drivers/char/Makefile | 1 + > drivers/char/ummunotify.c | 566 +++++++++++++++++++++++++++++++ > include/linux/Kbuild | 1 + > include/linux/ummunotify.h | 121 +++++++ > 9 files changed, 1060 insertions(+), 1 deletions(-) > create mode 100644 Documentation/ummunotify/Makefile > create mode 100644 Documentation/ummunotify/ummunotify.txt > create mode 100644 Documentation/ummunotify/umn-test.c > create mode 100644 drivers/char/ummunotify.c > create mode 100644 include/linux/ummunotify.h ^ permalink raw reply [flat|nested] 82+ messages in thread
* [ofa-general] Re: [GIT PULL] please pull ummunotify 2009-09-16 16:30 ` Roland Dreier @ 2009-09-16 16:40 ` Linus Torvalds -1 siblings, 0 replies; 82+ messages in thread From: Linus Torvalds @ 2009-09-16 16:40 UTC (permalink / raw) To: Roland Dreier; +Cc: linux-rdma, akpm, linux-kernel, general On Wed, 16 Sep 2009, Roland Dreier wrote: > > Sorry to hassle you about this, but I would like to know where things > stand. I know (from the reflink discussion if nothing else) that you're > definitely not bashful about telling people when their code sucks, so > this silent treatment has me really flustered. I've been showering and > brushing my teeth and everything, honest! I just haven't had time to look into the issue, so I'm merging the code that I know I need to merge, and hopefully I'll have a breather later when I can actually look at code and the thread it all spawned., Linus ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [GIT PULL] please pull ummunotify @ 2009-09-16 16:40 ` Linus Torvalds 0 siblings, 0 replies; 82+ messages in thread From: Linus Torvalds @ 2009-09-16 16:40 UTC (permalink / raw) To: Roland Dreier; +Cc: akpm, jsquyres, linux-rdma, general, linux-kernel On Wed, 16 Sep 2009, Roland Dreier wrote: > > Sorry to hassle you about this, but I would like to know where things > stand. I know (from the reflink discussion if nothing else) that you're > definitely not bashful about telling people when their code sucks, so > this silent treatment has me really flustered. I've been showering and > brushing my teeth and everything, honest! I just haven't had time to look into the issue, so I'm merging the code that I know I need to merge, and hopefully I'll have a breather later when I can actually look at code and the thread it all spawned., Linus ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [GIT PULL] please pull ummunotify 2009-09-11 4:38 ` Roland Dreier @ 2009-09-17 11:30 ` Peter Zijlstra -1 siblings, 0 replies; 82+ messages in thread From: Peter Zijlstra @ 2009-09-17 11:30 UTC (permalink / raw) To: Roland Dreier Cc: torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, jsquyres-FYB4Gu1CFyUAvxtiuMwx3w, linux-rdma-u79uwXL29TY76Z2rM5mHXA, general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Anton Blanchard, Paul Mackerras, Ingo Molnar On Thu, 2009-09-10 at 21:38 -0700, Roland Dreier wrote: > Linus, please consider pulling from > > master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git ummunotify > > This tree is also available from kernel.org mirrors at: > > git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git ummunotify > > This will get "ummunotify," a new character device that allows a > userspace library to register for MMU notifications; this is > particularly useful for MPI implementions (message passing libraries > used in HPC) to be able to keep track of what wacky things consumers > do to their memory mappings. My colleague Jeff Squyres from the Open > MPI project posted a blog entry about why MPI wants this: > > http://blogs.cisco.com/ciscotalk/performance/comments/better_linux_memory_tracking/ > > His summary of ummunotify: > > "It’s elegant, doesn’t require strange linker tricks, and seems to > work in all cases. Yay!" > > This code went through several review iterations on lkml and was in > -mm and -next for quite a few weeks. Andrew is OK with merging it (I > think -- Andrew please correct me if I misunderstood you). Anton Blanchard suggested a while back that this might be integrated with perf-counters, since perf-counters already does mmap() tracking and also provides events through an mmap()'ed buffer. Has anybody looked into this? If someone did and I missed the discussion on why it isn't appropriate, kindly point me in the right direction ;-) -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [GIT PULL] please pull ummunotify @ 2009-09-17 11:30 ` Peter Zijlstra 0 siblings, 0 replies; 82+ messages in thread From: Peter Zijlstra @ 2009-09-17 11:30 UTC (permalink / raw) To: Roland Dreier Cc: torvalds, akpm, jsquyres, linux-rdma, general, linux-kernel, Anton Blanchard, Paul Mackerras, Ingo Molnar On Thu, 2009-09-10 at 21:38 -0700, Roland Dreier wrote: > Linus, please consider pulling from > > master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git ummunotify > > This tree is also available from kernel.org mirrors at: > > git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git ummunotify > > This will get "ummunotify," a new character device that allows a > userspace library to register for MMU notifications; this is > particularly useful for MPI implementions (message passing libraries > used in HPC) to be able to keep track of what wacky things consumers > do to their memory mappings. My colleague Jeff Squyres from the Open > MPI project posted a blog entry about why MPI wants this: > > http://blogs.cisco.com/ciscotalk/performance/comments/better_linux_memory_tracking/ > > His summary of ummunotify: > > "It’s elegant, doesn’t require strange linker tricks, and seems to > work in all cases. Yay!" > > This code went through several review iterations on lkml and was in > -mm and -next for quite a few weeks. Andrew is OK with merging it (I > think -- Andrew please correct me if I misunderstood you). Anton Blanchard suggested a while back that this might be integrated with perf-counters, since perf-counters already does mmap() tracking and also provides events through an mmap()'ed buffer. Has anybody looked into this? If someone did and I missed the discussion on why it isn't appropriate, kindly point me in the right direction ;-) ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify 2009-09-17 11:30 ` Peter Zijlstra (?) @ 2009-09-17 14:24 ` Roland Dreier [not found] ` <adafxalejiq.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> -1 siblings, 1 reply; 82+ messages in thread From: Roland Dreier @ 2009-09-17 14:24 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-rdma, linux-kernel, Paul Mackerras, Anton Blanchard, general, akpm, torvalds, Ingo Molnar > Anton Blanchard suggested a while back that this might be integrated > with perf-counters, since perf-counters already does mmap() tracking and > also provides events through an mmap()'ed buffer. > > Has anybody looked into this? I didn't see the original suggestion. Certainly hooking in to existing infrastructure for user/kernel communication would be good. The fit doesn't seem great to me, although I am rather naive about perf counters. The problem that ummunotify is trying to solve is to let an app say 'for these 1000 address ranges (that possibly only cover a small part of my total address space), please let me know when the mappings are invalidated for any reason'. So getting those events in the kernel is no problem -- we have the MMU notifier hooks that tell us exactly what we need to know. The issue is purely the way userspace registers interest in address ranges, and how to kernel returns the events. For perf counters it seems that one would have to create a new counter for each address range... is that correct? And also I don't know if perf counter has an analog for the fast path optimization that ummunotify provides via a mmap'ed generation counter (a quick way for userspace to see 'nothing happened since last time you checked'). - R. ^ permalink raw reply [flat|nested] 82+ messages in thread
[parent not found: <adafxalejiq.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>]
* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify 2009-09-17 14:24 ` [ofa-general] " Roland Dreier @ 2009-09-17 14:32 ` Roland Dreier 0 siblings, 0 replies; 82+ messages in thread From: Roland Dreier @ 2009-09-17 14:32 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras, Anton Blanchard, general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Ingo Molnar > So getting those events in the kernel is no problem -- we have the MMU > notifier hooks that tell us exactly what we need to know. The issue is > purely the way userspace registers interest in address ranges, and how > to kernel returns the events. > > For perf counters it seems that one would have to create a new counter > for each address range... is that correct? And also I don't know if > perf counter has an analog for the fast path optimization that > ummunotify provides via a mmap'ed generation counter (a quick way for > userspace to see 'nothing happened since last time you checked'). Oh I forgot... ummunotify also preallocates everything etc. so that there is no way for events to be lost. Which saves userspace from having to trash everything cached and start over, which it would have to do if it misses an invalidate event. And AFAIK, pref counters does have the possibility of overflowing a buffer and losing an event, right? - R. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify @ 2009-09-17 14:32 ` Roland Dreier 0 siblings, 0 replies; 82+ messages in thread From: Roland Dreier @ 2009-09-17 14:32 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-rdma, linux-kernel, Paul Mackerras, Anton Blanchard, general, akpm, torvalds, Ingo Molnar > So getting those events in the kernel is no problem -- we have the MMU > notifier hooks that tell us exactly what we need to know. The issue is > purely the way userspace registers interest in address ranges, and how > to kernel returns the events. > > For perf counters it seems that one would have to create a new counter > for each address range... is that correct? And also I don't know if > perf counter has an analog for the fast path optimization that > ummunotify provides via a mmap'ed generation counter (a quick way for > userspace to see 'nothing happened since last time you checked'). Oh I forgot... ummunotify also preallocates everything etc. so that there is no way for events to be lost. Which saves userspace from having to trash everything cached and start over, which it would have to do if it misses an invalidate event. And AFAIK, pref counters does have the possibility of overflowing a buffer and losing an event, right? - R. ^ permalink raw reply [flat|nested] 82+ messages in thread
[parent not found: <adaab0tej5c.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>]
* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify 2009-09-17 14:32 ` Roland Dreier @ 2009-09-17 14:49 ` Peter Zijlstra -1 siblings, 0 replies; 82+ messages in thread From: Peter Zijlstra @ 2009-09-17 14:49 UTC (permalink / raw) To: Roland Dreier Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras, Anton Blanchard, general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Ingo Molnar On Thu, 2009-09-17 at 07:32 -0700, Roland Dreier wrote: > > So getting those events in the kernel is no problem -- we have the MMU > > notifier hooks that tell us exactly what we need to know. The issue is > > purely the way userspace registers interest in address ranges, and how > > to kernel returns the events. > > > > For perf counters it seems that one would have to create a new counter > > for each address range... is that correct? And also I don't know if > > perf counter has an analog for the fast path optimization that > > ummunotify provides via a mmap'ed generation counter (a quick way for > > userspace to see 'nothing happened since last time you checked'). > > Oh I forgot... ummunotify also preallocates everything etc. so that > there is no way for events to be lost. Which saves userspace from > having to trash everything cached and start over, which it would have to > do if it misses an invalidate event. > > And AFAIK, pref counters does have the possibility of overflowing a > buffer and losing an event, right? Well, you cannot pre-allocate everything, either you get back-logged evens in kernel space leading to a kernel DoS, or you loose events. Perf counters have two modes, a RO mmap() and a RW mmap(). The RO mode will automagically overwrite its tail data without regard for userspace having observed it. In the RW mode userspace has to advance the tail, the kernel will drop events when full and insert a PERF_EVENT_LOST event once there is room again. Hmm, or are you saying you can only get 1 event per registered range and allocate the thing on registration? That'd need some registration limit to avoid DoS scenarios. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify @ 2009-09-17 14:49 ` Peter Zijlstra 0 siblings, 0 replies; 82+ messages in thread From: Peter Zijlstra @ 2009-09-17 14:49 UTC (permalink / raw) To: Roland Dreier Cc: linux-rdma, linux-kernel, Paul Mackerras, Anton Blanchard, general, akpm, torvalds, Ingo Molnar On Thu, 2009-09-17 at 07:32 -0700, Roland Dreier wrote: > > So getting those events in the kernel is no problem -- we have the MMU > > notifier hooks that tell us exactly what we need to know. The issue is > > purely the way userspace registers interest in address ranges, and how > > to kernel returns the events. > > > > For perf counters it seems that one would have to create a new counter > > for each address range... is that correct? And also I don't know if > > perf counter has an analog for the fast path optimization that > > ummunotify provides via a mmap'ed generation counter (a quick way for > > userspace to see 'nothing happened since last time you checked'). > > Oh I forgot... ummunotify also preallocates everything etc. so that > there is no way for events to be lost. Which saves userspace from > having to trash everything cached and start over, which it would have to > do if it misses an invalidate event. > > And AFAIK, pref counters does have the possibility of overflowing a > buffer and losing an event, right? Well, you cannot pre-allocate everything, either you get back-logged evens in kernel space leading to a kernel DoS, or you loose events. Perf counters have two modes, a RO mmap() and a RW mmap(). The RO mode will automagically overwrite its tail data without regard for userspace having observed it. In the RW mode userspace has to advance the tail, the kernel will drop events when full and insert a PERF_EVENT_LOST event once there is room again. Hmm, or are you saying you can only get 1 event per registered range and allocate the thing on registration? That'd need some registration limit to avoid DoS scenarios. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify 2009-09-17 14:49 ` Peter Zijlstra @ 2009-09-17 15:03 ` Roland Dreier -1 siblings, 0 replies; 82+ messages in thread From: Roland Dreier @ 2009-09-17 15:03 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras, Anton Blanchard, general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Ingo Molnar > Hmm, or are you saying you can only get 1 event per registered range and > allocate the thing on registration? That'd need some registration limit > to avoid DoS scenarios. Yes, that's what I do. You're right, I should add a limit... although their are lots of ways for userspace to consume arbitrary amounts of kernel resources already. - R. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify @ 2009-09-17 15:03 ` Roland Dreier 0 siblings, 0 replies; 82+ messages in thread From: Roland Dreier @ 2009-09-17 15:03 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-rdma, linux-kernel, Paul Mackerras, Anton Blanchard, general, akpm, torvalds, Ingo Molnar > Hmm, or are you saying you can only get 1 event per registered range and > allocate the thing on registration? That'd need some registration limit > to avoid DoS scenarios. Yes, that's what I do. You're right, I should add a limit... although their are lots of ways for userspace to consume arbitrary amounts of kernel resources already. - R. ^ permalink raw reply [flat|nested] 82+ messages in thread
[parent not found: <adazl8td35u.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>]
* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify 2009-09-17 15:03 ` Roland Dreier @ 2009-09-17 15:22 ` Peter Zijlstra -1 siblings, 0 replies; 82+ messages in thread From: Peter Zijlstra @ 2009-09-17 15:22 UTC (permalink / raw) To: Roland Dreier Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras, Anton Blanchard, general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Ingo Molnar On Thu, 2009-09-17 at 08:03 -0700, Roland Dreier wrote: > > Hmm, or are you saying you can only get 1 event per registered range and > > allocate the thing on registration? That'd need some registration limit > > to avoid DoS scenarios. > > Yes, that's what I do. You're right, I should add a limit... although > their are lots of ways for userspace to consume arbitrary amounts of > kernel resources already. I'd be good to work at reducing that number, not adding to it ;-) But yeah, I currently don't see a very nice match to perf counters. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify @ 2009-09-17 15:22 ` Peter Zijlstra 0 siblings, 0 replies; 82+ messages in thread From: Peter Zijlstra @ 2009-09-17 15:22 UTC (permalink / raw) To: Roland Dreier Cc: linux-rdma, linux-kernel, Paul Mackerras, Anton Blanchard, general, akpm, torvalds, Ingo Molnar On Thu, 2009-09-17 at 08:03 -0700, Roland Dreier wrote: > > Hmm, or are you saying you can only get 1 event per registered range and > > allocate the thing on registration? That'd need some registration limit > > to avoid DoS scenarios. > > Yes, that's what I do. You're right, I should add a limit... although > their are lots of ways for userspace to consume arbitrary amounts of > kernel resources already. I'd be good to work at reducing that number, not adding to it ;-) But yeah, I currently don't see a very nice match to perf counters. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify 2009-09-17 15:03 ` Roland Dreier @ 2009-09-17 15:45 ` Roland Dreier -1 siblings, 0 replies; 82+ messages in thread From: Roland Dreier @ 2009-09-17 15:45 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras, Anton Blanchard, general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Ingo Molnar > > > Hmm, or are you saying you can only get 1 event per registered range and > > > allocate the thing on registration? That'd need some registration limit > > > to avoid DoS scenarios. > > > > Yes, that's what I do. You're right, I should add a limit... although > > their are lots of ways for userspace to consume arbitrary amounts of > > kernel resources already. > > I'd be good to work at reducing that number, not adding to it ;-) Yes, definitely. I'll add a quick ummunotify module parameter that limits the number of registrations per process. > But yeah, I currently don't see a very nice match to perf counters. OK. It would be nice to tie into something more general, but I think I agree -- perf counters are missing the filtering and the "no lost events" that ummunotify does have. And I'm not sure it's worth messing up the perf counters design just to jam one more not totally related thing in. - R. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify @ 2009-09-17 15:45 ` Roland Dreier 0 siblings, 0 replies; 82+ messages in thread From: Roland Dreier @ 2009-09-17 15:45 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-rdma, linux-kernel, Paul Mackerras, Anton Blanchard, general, akpm, torvalds, Ingo Molnar > > > Hmm, or are you saying you can only get 1 event per registered range and > > > allocate the thing on registration? That'd need some registration limit > > > to avoid DoS scenarios. > > > > Yes, that's what I do. You're right, I should add a limit... although > > their are lots of ways for userspace to consume arbitrary amounts of > > kernel resources already. > > I'd be good to work at reducing that number, not adding to it ;-) Yes, definitely. I'll add a quick ummunotify module parameter that limits the number of registrations per process. > But yeah, I currently don't see a very nice match to perf counters. OK. It would be nice to tie into something more general, but I think I agree -- perf counters are missing the filtering and the "no lost events" that ummunotify does have. And I'm not sure it's worth messing up the perf counters design just to jam one more not totally related thing in. - R. ^ permalink raw reply [flat|nested] 82+ messages in thread
[parent not found: <adatyz1d17q.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>]
* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify 2009-09-17 15:45 ` Roland Dreier @ 2009-09-18 11:50 ` Ingo Molnar -1 siblings, 0 replies; 82+ messages in thread From: Ingo Molnar @ 2009-09-18 11:50 UTC (permalink / raw) To: Roland Dreier Cc: Peter Zijlstra, linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras, Anton Blanchard, general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b * Roland Dreier <rdreier-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> wrote: > > But yeah, I currently don't see a very nice match to perf counters. > > OK. It would be nice to tie into something more general, but I think > I agree -- perf counters are missing the filtering and the "no lost > events" that ummunotify does have. And I'm not sure it's worth > messing up the perf counters design just to jam one more not totally > related thing in. The filtering can be done and has been done - see Li Zefan's patchset that uses filter expressions to do per event in-kernel filtering. The OOM DoS is a bug in your patches i think, which perfcounters solves ;-) Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify @ 2009-09-18 11:50 ` Ingo Molnar 0 siblings, 0 replies; 82+ messages in thread From: Ingo Molnar @ 2009-09-18 11:50 UTC (permalink / raw) To: Roland Dreier Cc: Peter Zijlstra, linux-rdma, linux-kernel, Paul Mackerras, Anton Blanchard, general, akpm, torvalds * Roland Dreier <rdreier@cisco.com> wrote: > > But yeah, I currently don't see a very nice match to perf counters. > > OK. It would be nice to tie into something more general, but I think > I agree -- perf counters are missing the filtering and the "no lost > events" that ummunotify does have. And I'm not sure it's worth > messing up the perf counters design just to jam one more not totally > related thing in. The filtering can be done and has been done - see Li Zefan's patchset that uses filter expressions to do per event in-kernel filtering. The OOM DoS is a bug in your patches i think, which perfcounters solves ;-) Ingo ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify 2009-09-17 15:45 ` Roland Dreier @ 2009-09-29 17:13 ` Pavel Machek -1 siblings, 0 replies; 82+ messages in thread From: Pavel Machek @ 2009-09-29 17:13 UTC (permalink / raw) To: Roland Dreier Cc: Peter Zijlstra, linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras, Anton Blanchard, general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Ingo Molnar On Thu 2009-09-17 08:45:29, Roland Dreier wrote: > > > > > Hmm, or are you saying you can only get 1 event per registered range and > > > > allocate the thing on registration? That'd need some registration limit > > > > to avoid DoS scenarios. > > > > > > Yes, that's what I do. You're right, I should add a limit... although > > > their are lots of ways for userspace to consume arbitrary amounts of > > > kernel resources already. > > > > I'd be good to work at reducing that number, not adding to it ;-) > > Yes, definitely. I'll add a quick ummunotify module parameter that > limits the number of registrations per process. > > > But yeah, I currently don't see a very nice match to perf counters. > > OK. It would be nice to tie into something more general, but I think I > agree -- perf counters are missing the filtering and the "no lost > events" that ummunotify does have. And I'm not sure it's worth messing > up the perf counters design just to jam one more not totally related > thing in. I believe that extending perf counters to do what you want is better than adding one more, very strange, user<->kernel interface. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify @ 2009-09-29 17:13 ` Pavel Machek 0 siblings, 0 replies; 82+ messages in thread From: Pavel Machek @ 2009-09-29 17:13 UTC (permalink / raw) To: Roland Dreier Cc: Peter Zijlstra, linux-rdma, linux-kernel, Paul Mackerras, Anton Blanchard, general, akpm, torvalds, Ingo Molnar On Thu 2009-09-17 08:45:29, Roland Dreier wrote: > > > > > Hmm, or are you saying you can only get 1 event per registered range and > > > > allocate the thing on registration? That'd need some registration limit > > > > to avoid DoS scenarios. > > > > > > Yes, that's what I do. You're right, I should add a limit... although > > > their are lots of ways for userspace to consume arbitrary amounts of > > > kernel resources already. > > > > I'd be good to work at reducing that number, not adding to it ;-) > > Yes, definitely. I'll add a quick ummunotify module parameter that > limits the number of registrations per process. > > > But yeah, I currently don't see a very nice match to perf counters. > > OK. It would be nice to tie into something more general, but I think I > agree -- perf counters are missing the filtering and the "no lost > events" that ummunotify does have. And I'm not sure it's worth messing > up the perf counters design just to jam one more not totally related > thing in. I believe that extending perf counters to do what you want is better than adding one more, very strange, user<->kernel interface. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 82+ messages in thread
[parent not found: <20090929171332.GD14405-I/5MKhXcvmPrBKCeMvbIDA@public.gmane.org>]
* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify 2009-09-29 17:13 ` Pavel Machek @ 2009-09-30 9:44 ` Ingo Molnar -1 siblings, 0 replies; 82+ messages in thread From: Ingo Molnar @ 2009-09-30 9:44 UTC (permalink / raw) To: Pavel Machek Cc: Roland Dreier, Peter Zijlstra, linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras, Anton Blanchard, general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b * Pavel Machek <pavel-+ZI9xUNit7I@public.gmane.org> wrote: > On Thu 2009-09-17 08:45:29, Roland Dreier wrote: > > > > [...] > > OK. It would be nice to tie into something more general, but I > > think I agree -- perf counters are missing the filtering and the "no > > lost events" that ummunotify does have. [...] Performance events filtering is being worked on and now with the proper non-DoS limit you've added you can lose events too, dont you? So it's all a question of how much buffering to add - and with perf events too you can buffer arbitrary large amount of events. > > [...] And I'm not sure it's worth messing up the perf counters > > design just to jam one more not totally related thing in. Nobody suggested details for any redesign yet (so far it seems like a perfect match, to me at least) so i'm wondering what messup you are referring to. > I believe that extending perf counters to do what you want is better > than adding one more, very strange, user<->kernel interface. Agreed. Lemme react to the original description of the code: > git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git ummunotify > > This will get "ummunotify," a new character device that allows a > userspace library to register for MMU notifications; this is > particularly useful for MPI implementions (message passing libraries > used in HPC) to be able to keep track of what wacky things consumers > do to their memory mappings. I test-pulled this code and had a look at it. I think this could be done in a simpler, less limited, more generic, more useful form by using some variation of perf events. You should be able to get all that you want by adding two TRACE_EVENT() tracepoints and using the existing perf event syscall to get the events to user-space. Meaning that this: 9 files changed, 1060 insertions(+), 1 deletions(-) Would be replaced with something like: 2 files changed, 100 insertions(+), 0 deletions(-) [ the +100 lines would (roughly) would add tracepoints to invalidate_page and invalidate_range_start. (possibly via mmu_notifier_register() like the ummunotify code does) Most of that linecount would be comments. ] Another upside, beyond the reduction in complexity is that we'd have one less special char driver based ABI. Which is a big plus in my opinion, especially if this goes towards HPC folks and if it's used for real. Why should such a MM capability hidden behind a character device and an ioctl? The perf event approach is beneficial to non-HPC as well: MM instrumentation for example - page range invalidates are interesting to all sorts of modi of analysis. A question: what is the typical size/scope of the rbtree of the watched regions of memory in practical (test) deployments of the ummunofity code? Per tracepoint filtering is possible via the perf event patches Li Zefan has posted to lkml recently, under this subject: [PATCH 0/6] perf trace: Add filter support They are still being worked on but it's very clear that flexible in-kernel filtering support will be a natural part of the perf event design in the very near future, so if that alone is your reason not to use it it would be better if you helped us complete/test the filter support and use that, instead of a parallel framework. Or if that's not desirable or not possible, or if there's any other technical roadblock, i'd like to know the particulars of that. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify @ 2009-09-30 9:44 ` Ingo Molnar 0 siblings, 0 replies; 82+ messages in thread From: Ingo Molnar @ 2009-09-30 9:44 UTC (permalink / raw) To: Pavel Machek Cc: Roland Dreier, Peter Zijlstra, linux-rdma, linux-kernel, Paul Mackerras, Anton Blanchard, general, akpm, torvalds * Pavel Machek <pavel@ucw.cz> wrote: > On Thu 2009-09-17 08:45:29, Roland Dreier wrote: > > > > [...] > > OK. It would be nice to tie into something more general, but I > > think I agree -- perf counters are missing the filtering and the "no > > lost events" that ummunotify does have. [...] Performance events filtering is being worked on and now with the proper non-DoS limit you've added you can lose events too, dont you? So it's all a question of how much buffering to add - and with perf events too you can buffer arbitrary large amount of events. > > [...] And I'm not sure it's worth messing up the perf counters > > design just to jam one more not totally related thing in. Nobody suggested details for any redesign yet (so far it seems like a perfect match, to me at least) so i'm wondering what messup you are referring to. > I believe that extending perf counters to do what you want is better > than adding one more, very strange, user<->kernel interface. Agreed. Lemme react to the original description of the code: > git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git ummunotify > > This will get "ummunotify," a new character device that allows a > userspace library to register for MMU notifications; this is > particularly useful for MPI implementions (message passing libraries > used in HPC) to be able to keep track of what wacky things consumers > do to their memory mappings. I test-pulled this code and had a look at it. I think this could be done in a simpler, less limited, more generic, more useful form by using some variation of perf events. You should be able to get all that you want by adding two TRACE_EVENT() tracepoints and using the existing perf event syscall to get the events to user-space. Meaning that this: 9 files changed, 1060 insertions(+), 1 deletions(-) Would be replaced with something like: 2 files changed, 100 insertions(+), 0 deletions(-) [ the +100 lines would (roughly) would add tracepoints to invalidate_page and invalidate_range_start. (possibly via mmu_notifier_register() like the ummunotify code does) Most of that linecount would be comments. ] Another upside, beyond the reduction in complexity is that we'd have one less special char driver based ABI. Which is a big plus in my opinion, especially if this goes towards HPC folks and if it's used for real. Why should such a MM capability hidden behind a character device and an ioctl? The perf event approach is beneficial to non-HPC as well: MM instrumentation for example - page range invalidates are interesting to all sorts of modi of analysis. A question: what is the typical size/scope of the rbtree of the watched regions of memory in practical (test) deployments of the ummunofity code? Per tracepoint filtering is possible via the perf event patches Li Zefan has posted to lkml recently, under this subject: [PATCH 0/6] perf trace: Add filter support They are still being worked on but it's very clear that flexible in-kernel filtering support will be a natural part of the perf event design in the very near future, so if that alone is your reason not to use it it would be better if you helped us complete/test the filter support and use that, instead of a parallel framework. Or if that's not desirable or not possible, or if there's any other technical roadblock, i'd like to know the particulars of that. Thanks, Ingo ^ permalink raw reply [flat|nested] 82+ messages in thread
[parent not found: <20090930094456.GD24621-X9Un+BFzKDI@public.gmane.org>]
* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify 2009-09-30 9:44 ` Ingo Molnar @ 2009-09-30 16:02 ` Jason Gunthorpe -1 siblings, 0 replies; 82+ messages in thread From: Jason Gunthorpe @ 2009-09-30 16:02 UTC (permalink / raw) To: Ingo Molnar Cc: Pavel Machek, Roland Dreier, Peter Zijlstra, linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras, Anton Blanchard, general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Jeff Squyres On Wed, Sep 30, 2009 at 11:44:56AM +0200, Ingo Molnar wrote: > > > OK. It would be nice to tie into something more general, but I > > > think I agree -- perf counters are missing the filtering and the "no > > > lost events" that ummunotify does have. [...] > > Performance events filtering is being worked on and now with the proper > non-DoS limit you've added you can lose events too, dont you? So it's > all a question of how much buffering to add - and with perf events too > you can buffer arbitrary large amount of events. No, the ummunotify does not loose events, that is the fundamental difference between it and all tracing schemes. Every call to ibv_reg_mr is paired with a call to ummunotify to create a matching watcher. Both calls allocate some kernel memory, if one fails the entire operation fails and userspace can do whatever it does on memory allocation failure. After that point the scheme is perfectly lossless. Performance event filtering would use the same kind of kernel memory, call ibv_reg_mr, then install a filter, both allocate kernel memory, if one fails the op fails. But then when the ring buffer overflows you've lost events. All the tracing schemes are lossy - since they loose events when the ring buffer fills up. So to do that we either need to make a recovery scheme of some sort, or make trace points that are blocking.. So, here is a concrete proposal how ummunotify could be absorbed by perf events tracing, with filters. - The filter expression must be able to trigger on a MMU event, triggering on the intersection of the MMU event address range and filter expression address range. - The traces must be choosen so that there is exactly one filter expression per ibv_reg_mr region - Each filter has a clearable saturating counter that increments every time the filter matches an event - Each filter has a 64 bit user space assigned tag. - An API similar to ummunotify exists: struct perf_filter_tag foo[100] int rc = perf_filters_read_and_clear_non_zero_counters(foo,100); - Optionally - the mmap ring would contain only 64 bit user space filter tags, not trace events. This would then duplicate the functions of ummunotify, including the lossless collection of events. The flow would more or less be the same: struct my_data *ptr = calloc() ptr->reg_handle = ibv_reg_mr(base,len) ptr->filter_handle = perf_filter_register("string matching base->len",ptr) [..] // fast path if (atomically(perf_map->head) != last_perf_map_head) { struct perf_filter_tag foo[100] int rc = perf_filters_read_and_clear_non_zero_counters(foo,100); for (unsigned int i = 0; i != rc; i++) ((struct my_data *)foo[i])->invalid = 1; perf_empty_mmap_ring(perf_map); } If 'optionally' is done then the app can trundle through the mmap and only use the above syscall loop if the mmap overflows. That would be quite ideal. It also must be guarenteed that when a trace point is hit the mmap atomics are updated and visible to another user space thread before the trace point returns - otherwise it is not synchronous enough and will be racey. > A question: what is the typical size/scope of the rbtree of the watched > regions of memory in practical (test) deployments of the ummunofity > code? Jeff can you comment? IIRC it is many tens (hundreds?) of thousands of watches. > Per tracepoint filtering is possible via the perf event patches Li Zefan > has posted to lkml recently, under this subject: Performance of the filter add is probably a bit of a concern.. Regards, Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify @ 2009-09-30 16:02 ` Jason Gunthorpe 0 siblings, 0 replies; 82+ messages in thread From: Jason Gunthorpe @ 2009-09-30 16:02 UTC (permalink / raw) To: Ingo Molnar Cc: Pavel Machek, Roland Dreier, Peter Zijlstra, linux-rdma, linux-kernel, Paul Mackerras, Anton Blanchard, general, akpm, torvalds, Jeff Squyres On Wed, Sep 30, 2009 at 11:44:56AM +0200, Ingo Molnar wrote: > > > OK. It would be nice to tie into something more general, but I > > > think I agree -- perf counters are missing the filtering and the "no > > > lost events" that ummunotify does have. [...] > > Performance events filtering is being worked on and now with the proper > non-DoS limit you've added you can lose events too, dont you? So it's > all a question of how much buffering to add - and with perf events too > you can buffer arbitrary large amount of events. No, the ummunotify does not loose events, that is the fundamental difference between it and all tracing schemes. Every call to ibv_reg_mr is paired with a call to ummunotify to create a matching watcher. Both calls allocate some kernel memory, if one fails the entire operation fails and userspace can do whatever it does on memory allocation failure. After that point the scheme is perfectly lossless. Performance event filtering would use the same kind of kernel memory, call ibv_reg_mr, then install a filter, both allocate kernel memory, if one fails the op fails. But then when the ring buffer overflows you've lost events. All the tracing schemes are lossy - since they loose events when the ring buffer fills up. So to do that we either need to make a recovery scheme of some sort, or make trace points that are blocking.. So, here is a concrete proposal how ummunotify could be absorbed by perf events tracing, with filters. - The filter expression must be able to trigger on a MMU event, triggering on the intersection of the MMU event address range and filter expression address range. - The traces must be choosen so that there is exactly one filter expression per ibv_reg_mr region - Each filter has a clearable saturating counter that increments every time the filter matches an event - Each filter has a 64 bit user space assigned tag. - An API similar to ummunotify exists: struct perf_filter_tag foo[100] int rc = perf_filters_read_and_clear_non_zero_counters(foo,100); - Optionally - the mmap ring would contain only 64 bit user space filter tags, not trace events. This would then duplicate the functions of ummunotify, including the lossless collection of events. The flow would more or less be the same: struct my_data *ptr = calloc() ptr->reg_handle = ibv_reg_mr(base,len) ptr->filter_handle = perf_filter_register("string matching base->len",ptr) [..] // fast path if (atomically(perf_map->head) != last_perf_map_head) { struct perf_filter_tag foo[100] int rc = perf_filters_read_and_clear_non_zero_counters(foo,100); for (unsigned int i = 0; i != rc; i++) ((struct my_data *)foo[i])->invalid = 1; perf_empty_mmap_ring(perf_map); } If 'optionally' is done then the app can trundle through the mmap and only use the above syscall loop if the mmap overflows. That would be quite ideal. It also must be guarenteed that when a trace point is hit the mmap atomics are updated and visible to another user space thread before the trace point returns - otherwise it is not synchronous enough and will be racey. > A question: what is the typical size/scope of the rbtree of the watched > regions of memory in practical (test) deployments of the ummunofity > code? Jeff can you comment? IIRC it is many tens (hundreds?) of thousands of watches. > Per tracepoint filtering is possible via the perf event patches Li Zefan > has posted to lkml recently, under this subject: Performance of the filter add is probably a bit of a concern.. Regards, Jason ^ permalink raw reply [flat|nested] 82+ messages in thread
[parent not found: <20090930160232.GZ22310-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>]
* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify 2009-09-30 16:02 ` Jason Gunthorpe @ 2009-10-12 18:19 ` Ingo Molnar -1 siblings, 0 replies; 82+ messages in thread From: Ingo Molnar @ 2009-10-12 18:19 UTC (permalink / raw) To: Jason Gunthorpe Cc: Pavel Machek, Roland Dreier, Peter Zijlstra, linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras, Anton Blanchard, general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Jeff Squyres * Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote: > On Wed, Sep 30, 2009 at 11:44:56AM +0200, Ingo Molnar wrote: > > > > OK. It would be nice to tie into something more general, but I > > > > think I agree -- perf counters are missing the filtering and the "no > > > > lost events" that ummunotify does have. [...] > > > > Performance events filtering is being worked on and now with the > > proper non-DoS limit you've added you can lose events too, dont you? > > So it's all a question of how much buffering to add - and with perf > > events too you can buffer arbitrary large amount of events. > > No, the ummunotify does not loose events, that is the fundamental > difference between it and all tracing schemes. > > Every call to ibv_reg_mr is paired with a call to ummunotify to create > a matching watcher. Both calls allocate some kernel memory, if one > fails the entire operation fails and userspace can do whatever it does > on memory allocation failure. We already support signal notification for perf events, and we also support two modi of perf ring-buffer overflow notification. Adding a third one that sends a signal when events are lost would be in line with that. This would allow you to have the OOM semantics of requesting a SIGBUS - or user-space could do other things: like print a warning in the app or ignore the event overflow. Which are all interesting things to do. (If you do that you might want to add that to 'perf top' or 'perf record' as well.) > After that point the scheme is perfectly lossless. Well if it can OOM it's not lossless, obviously. You just define "event loss" to be equivalent to "Destruction of the universe." ;-) Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify @ 2009-10-12 18:19 ` Ingo Molnar 0 siblings, 0 replies; 82+ messages in thread From: Ingo Molnar @ 2009-10-12 18:19 UTC (permalink / raw) To: Jason Gunthorpe Cc: Pavel Machek, Roland Dreier, Peter Zijlstra, linux-rdma, linux-kernel, Paul Mackerras, Anton Blanchard, general, akpm, torvalds, Jeff Squyres * Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote: > On Wed, Sep 30, 2009 at 11:44:56AM +0200, Ingo Molnar wrote: > > > > OK. It would be nice to tie into something more general, but I > > > > think I agree -- perf counters are missing the filtering and the "no > > > > lost events" that ummunotify does have. [...] > > > > Performance events filtering is being worked on and now with the > > proper non-DoS limit you've added you can lose events too, dont you? > > So it's all a question of how much buffering to add - and with perf > > events too you can buffer arbitrary large amount of events. > > No, the ummunotify does not loose events, that is the fundamental > difference between it and all tracing schemes. > > Every call to ibv_reg_mr is paired with a call to ummunotify to create > a matching watcher. Both calls allocate some kernel memory, if one > fails the entire operation fails and userspace can do whatever it does > on memory allocation failure. We already support signal notification for perf events, and we also support two modi of perf ring-buffer overflow notification. Adding a third one that sends a signal when events are lost would be in line with that. This would allow you to have the OOM semantics of requesting a SIGBUS - or user-space could do other things: like print a warning in the app or ignore the event overflow. Which are all interesting things to do. (If you do that you might want to add that to 'perf top' or 'perf record' as well.) > After that point the scheme is perfectly lossless. Well if it can OOM it's not lossless, obviously. You just define "event loss" to be equivalent to "Destruction of the universe." ;-) Ingo ^ permalink raw reply [flat|nested] 82+ messages in thread
[parent not found: <20091012181944.GF17138-X9Un+BFzKDI@public.gmane.org>]
* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify 2009-10-12 18:19 ` Ingo Molnar @ 2009-10-12 19:30 ` Jason Gunthorpe -1 siblings, 0 replies; 82+ messages in thread From: Jason Gunthorpe @ 2009-10-12 19:30 UTC (permalink / raw) To: Ingo Molnar Cc: Pavel Machek, Roland Dreier, Peter Zijlstra, linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras, Anton Blanchard, general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Jeff Squyres On Mon, Oct 12, 2009 at 08:19:44PM +0200, Ingo Molnar wrote: > > After that point the scheme is perfectly lossless. > > Well if it can OOM it's not lossless, obviously. You just define "event > loss" to be equivalent to "Destruction of the universe." ;-) It can't OOM once the ummunotify registration is done - when an event occurs it doesn't allocate any memory and it doesn't loose events. It has the same problem as perf - you either bound the number/size of filters, or let user space allocate filters until the box OOMs. perf has the additonal problem that even with filters you can still loose events if the event ring overflows. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify @ 2009-10-12 19:30 ` Jason Gunthorpe 0 siblings, 0 replies; 82+ messages in thread From: Jason Gunthorpe @ 2009-10-12 19:30 UTC (permalink / raw) To: Ingo Molnar Cc: Pavel Machek, Roland Dreier, Peter Zijlstra, linux-rdma, linux-kernel, Paul Mackerras, Anton Blanchard, general, akpm, torvalds, Jeff Squyres On Mon, Oct 12, 2009 at 08:19:44PM +0200, Ingo Molnar wrote: > > After that point the scheme is perfectly lossless. > > Well if it can OOM it's not lossless, obviously. You just define "event > loss" to be equivalent to "Destruction of the universe." ;-) It can't OOM once the ummunotify registration is done - when an event occurs it doesn't allocate any memory and it doesn't loose events. It has the same problem as perf - you either bound the number/size of filters, or let user space allocate filters until the box OOMs. perf has the additonal problem that even with filters you can still loose events if the event ring overflows. Jason ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify 2009-10-12 19:30 ` Jason Gunthorpe (?) @ 2009-10-12 20:20 ` Ingo Molnar [not found] ` <20091012202046.GA7648-X9Un+BFzKDI@public.gmane.org> 2009-10-13 5:43 ` Brice Goglin -1 siblings, 2 replies; 82+ messages in thread From: Ingo Molnar @ 2009-10-12 20:20 UTC (permalink / raw) To: Jason Gunthorpe Cc: Pavel Machek, Roland Dreier, Peter Zijlstra, linux-rdma, linux-kernel, Paul Mackerras, Anton Blanchard, general, akpm, torvalds, Jeff Squyres * Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote: > On Mon, Oct 12, 2009 at 08:19:44PM +0200, Ingo Molnar wrote: > > > > After that point the scheme is perfectly lossless. > > > > Well if it can OOM it's not lossless, obviously. You just define > > "event loss" to be equivalent to "Destruction of the universe." ;-) > > It can't OOM once the ummunotify registration is done - when an event > occurs it doesn't allocate any memory and it doesn't loose events. Well, it has built-in event loss via the UMMUNOTIFY_FLAG_HINT mechanism: any double events on the same range will cause an imprecise event to be recorded and cause the loss of information. Is that loss of information more acceptable than the loss of information via the loss of events? It might be more acceptable because the flag-hint mechanism can at most cause over-flushing - while with perf events we might miss to invalidate a range altogether. Ingo ^ permalink raw reply [flat|nested] 82+ messages in thread
[parent not found: <20091012202046.GA7648-X9Un+BFzKDI@public.gmane.org>]
* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify 2009-10-12 20:20 ` Ingo Molnar @ 2009-10-13 4:05 ` Jason Gunthorpe 2009-10-13 5:43 ` Brice Goglin 1 sibling, 0 replies; 82+ messages in thread From: Jason Gunthorpe @ 2009-10-13 4:05 UTC (permalink / raw) To: Ingo Molnar Cc: Pavel Machek, Roland Dreier, Peter Zijlstra, linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras, Anton Blanchard, general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Jeff Squyres On Mon, Oct 12, 2009 at 10:20:46PM +0200, Ingo Molnar wrote: > It might be more acceptable because the flag-hint mechanism can at most > cause over-flushing - while with perf events we might miss to invalidate > a range altogether. Right. Overflushing is not important, but missing an event entirely is not recoverable (at least within the current kernel APIs). Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify @ 2009-10-13 4:05 ` Jason Gunthorpe 0 siblings, 0 replies; 82+ messages in thread From: Jason Gunthorpe @ 2009-10-13 4:05 UTC (permalink / raw) To: Ingo Molnar Cc: Pavel Machek, Roland Dreier, Peter Zijlstra, linux-rdma, linux-kernel, Paul Mackerras, Anton Blanchard, general, akpm, torvalds, Jeff Squyres On Mon, Oct 12, 2009 at 10:20:46PM +0200, Ingo Molnar wrote: > It might be more acceptable because the flag-hint mechanism can at most > cause over-flushing - while with perf events we might miss to invalidate > a range altogether. Right. Overflushing is not important, but missing an event entirely is not recoverable (at least within the current kernel APIs). Jason ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify 2009-10-13 4:05 ` Jason Gunthorpe (?) @ 2009-10-13 6:40 ` Ingo Molnar [not found] ` <20091013064006.GC9470-X9Un+BFzKDI@public.gmane.org> -1 siblings, 1 reply; 82+ messages in thread From: Ingo Molnar @ 2009-10-13 6:40 UTC (permalink / raw) To: Jason Gunthorpe Cc: Pavel Machek, Roland Dreier, Peter Zijlstra, linux-rdma, linux-kernel, Paul Mackerras, Anton Blanchard, general, akpm, torvalds, Jeff Squyres * Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote: > On Mon, Oct 12, 2009 at 10:20:46PM +0200, Ingo Molnar wrote: > > It might be more acceptable because the flag-hint mechanism can at most > > cause over-flushing - while with perf events we might miss to invalidate > > a range altogether. > > Right. Overflushing is not important, but missing an event entirely is > not recoverable (at least within the current kernel APIs). So if we detect event loss in the perf event case (should not happen with sufficient buffering but it is a possibility the code should be prepared for) then we can just flush the [0,-1ULL] range, right? Ingo ^ permalink raw reply [flat|nested] 82+ messages in thread
[parent not found: <20091013064006.GC9470-X9Un+BFzKDI@public.gmane.org>]
* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify 2009-10-13 6:40 ` Ingo Molnar @ 2009-10-13 16:27 ` Jason Gunthorpe 0 siblings, 0 replies; 82+ messages in thread From: Jason Gunthorpe @ 2009-10-13 16:27 UTC (permalink / raw) To: Ingo Molnar Cc: Pavel Machek, Roland Dreier, Peter Zijlstra, linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras, Anton Blanchard, general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Jeff Squyres On Tue, Oct 13, 2009 at 08:40:06AM +0200, Ingo Molnar wrote: > > * Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote: > > > On Mon, Oct 12, 2009 at 10:20:46PM +0200, Ingo Molnar wrote: > > > It might be more acceptable because the flag-hint mechanism can at most > > > cause over-flushing - while with perf events we might miss to invalidate > > > a range altogether. > > > > Right. Overflushing is not important, but missing an event entirely is > > not recoverable (at least within the current kernel APIs). > > So if we detect event loss in the perf event case (should not happen > with sufficient buffering but it is a possibility the code should be > prepared for) then we can just flush the [0,-1ULL] range, right? No, the reason overflushing within a registration is OK is because of how the MPI APIs are defined and typically used. The map and registration window will typically be 1:1 ie you malloc something and then register it. It is an error to register beyond your malloced space. So, in truth, the hint stuff isn't really essential for MPI. flushing all ranges would result in data loss since ranges may be in use at the time, and 'flush' is actually unregister/reregister - the hardware cannot do in place atomic modify. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify @ 2009-10-13 16:27 ` Jason Gunthorpe 0 siblings, 0 replies; 82+ messages in thread From: Jason Gunthorpe @ 2009-10-13 16:27 UTC (permalink / raw) To: Ingo Molnar Cc: Pavel Machek, Roland Dreier, Peter Zijlstra, linux-rdma, linux-kernel, Paul Mackerras, Anton Blanchard, general, akpm, torvalds, Jeff Squyres On Tue, Oct 13, 2009 at 08:40:06AM +0200, Ingo Molnar wrote: > > * Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote: > > > On Mon, Oct 12, 2009 at 10:20:46PM +0200, Ingo Molnar wrote: > > > It might be more acceptable because the flag-hint mechanism can at most > > > cause over-flushing - while with perf events we might miss to invalidate > > > a range altogether. > > > > Right. Overflushing is not important, but missing an event entirely is > > not recoverable (at least within the current kernel APIs). > > So if we detect event loss in the perf event case (should not happen > with sufficient buffering but it is a possibility the code should be > prepared for) then we can just flush the [0,-1ULL] range, right? No, the reason overflushing within a registration is OK is because of how the MPI APIs are defined and typically used. The map and registration window will typically be 1:1 ie you malloc something and then register it. It is an error to register beyond your malloced space. So, in truth, the hint stuff isn't really essential for MPI. flushing all ranges would result in data loss since ranges may be in use at the time, and 'flush' is actually unregister/reregister - the hardware cannot do in place atomic modify. Jason ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify 2009-10-12 20:20 ` Ingo Molnar [not found] ` <20091012202046.GA7648-X9Un+BFzKDI@public.gmane.org> @ 2009-10-13 5:43 ` Brice Goglin [not found] ` <4AD41373.8010108-MZpvjPyXg2s@public.gmane.org> 1 sibling, 1 reply; 82+ messages in thread From: Brice Goglin @ 2009-10-13 5:43 UTC (permalink / raw) To: Ingo Molnar Cc: Jason Gunthorpe, Pavel Machek, Roland Dreier, Peter Zijlstra, linux-rdma, linux-kernel, Paul Mackerras, Anton Blanchard, general, akpm, torvalds, Jeff Squyres Ingo Molnar wrote: > * Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote: > > >> On Mon, Oct 12, 2009 at 08:19:44PM +0200, Ingo Molnar wrote: >> >> >>>> After that point the scheme is perfectly lossless. >>>> >>> Well if it can OOM it's not lossless, obviously. You just define >>> "event loss" to be equivalent to "Destruction of the universe." ;-) >>> >> It can't OOM once the ummunotify registration is done - when an event >> occurs it doesn't allocate any memory and it doesn't loose events. >> > > Well, it has built-in event loss via the UMMUNOTIFY_FLAG_HINT mechanism: > any double events on the same range will cause an imprecise event to be > recorded and cause the loss of information. > The target (MPI) application doesn't care about how many events are coming here. It just needs to know whether something has been invalidated in the range. If so, it destroy the whole RDMA window anyway. So it's actually _good_ that multiple events are merged into a single one: the application only has to process a single event per partially-invalidated range. Brice ^ permalink raw reply [flat|nested] 82+ messages in thread
[parent not found: <4AD41373.8010108-MZpvjPyXg2s@public.gmane.org>]
* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify 2009-10-13 5:43 ` Brice Goglin @ 2009-10-13 6:38 ` Ingo Molnar 0 siblings, 0 replies; 82+ messages in thread From: Ingo Molnar @ 2009-10-13 6:38 UTC (permalink / raw) To: Brice Goglin Cc: Jason Gunthorpe, Pavel Machek, Roland Dreier, Peter Zijlstra, linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras, Anton Blanchard, general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Jeff Squyres * Brice Goglin <Brice.Goglin-MZpvjPyXg2s@public.gmane.org> wrote: > Ingo Molnar wrote: > > * Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote: > > > > > >> On Mon, Oct 12, 2009 at 08:19:44PM +0200, Ingo Molnar wrote: > >> > >> > >>>> After that point the scheme is perfectly lossless. > >>>> > >>> Well if it can OOM it's not lossless, obviously. You just define > >>> "event loss" to be equivalent to "Destruction of the universe." ;-) > >>> > >> It can't OOM once the ummunotify registration is done - when an event > >> occurs it doesn't allocate any memory and it doesn't loose events. > >> > > > > Well, it has built-in event loss via the UMMUNOTIFY_FLAG_HINT mechanism: > > any double events on the same range will cause an imprecise event to be > > recorded and cause the loss of information. > > > > The target (MPI) application doesn't care about how many events are > coming here. It just needs to know whether something has been > invalidated in the range. If so, it destroy the whole RDMA window > anyway. So it's actually _good_ that multiple events are merged into a > single one: the application only has to process a single event per > partially-invalidated range. it's not unconditionally good as the fuzzy-merge-events rule: events[n].flags = UMMUNOTIFY_EVENT_FLAG_HINT; events[n].hint_start = max(reg->start, reg->hint_start); events[n].hint_end = min(reg->end, reg->hint_end); in essence merges flushes into a single interval - which inevitably might include areas of memory that were not flushed at all. For example these two flushes: [...] [...] Would be merged into: [..................] Btw., isnt the above max/min logic buggy, causing lost events? Shouldnt it be: events[n].hint_start = min(reg->start, reg->hint_start); events[n].hint_end = max(reg->end, reg->hint_end); ? Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify @ 2009-10-13 6:38 ` Ingo Molnar 0 siblings, 0 replies; 82+ messages in thread From: Ingo Molnar @ 2009-10-13 6:38 UTC (permalink / raw) To: Brice Goglin Cc: Jason Gunthorpe, Pavel Machek, Roland Dreier, Peter Zijlstra, linux-rdma, linux-kernel, Paul Mackerras, Anton Blanchard, general, akpm, torvalds, Jeff Squyres * Brice Goglin <Brice.Goglin@inria.fr> wrote: > Ingo Molnar wrote: > > * Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote: > > > > > >> On Mon, Oct 12, 2009 at 08:19:44PM +0200, Ingo Molnar wrote: > >> > >> > >>>> After that point the scheme is perfectly lossless. > >>>> > >>> Well if it can OOM it's not lossless, obviously. You just define > >>> "event loss" to be equivalent to "Destruction of the universe." ;-) > >>> > >> It can't OOM once the ummunotify registration is done - when an event > >> occurs it doesn't allocate any memory and it doesn't loose events. > >> > > > > Well, it has built-in event loss via the UMMUNOTIFY_FLAG_HINT mechanism: > > any double events on the same range will cause an imprecise event to be > > recorded and cause the loss of information. > > > > The target (MPI) application doesn't care about how many events are > coming here. It just needs to know whether something has been > invalidated in the range. If so, it destroy the whole RDMA window > anyway. So it's actually _good_ that multiple events are merged into a > single one: the application only has to process a single event per > partially-invalidated range. it's not unconditionally good as the fuzzy-merge-events rule: events[n].flags = UMMUNOTIFY_EVENT_FLAG_HINT; events[n].hint_start = max(reg->start, reg->hint_start); events[n].hint_end = min(reg->end, reg->hint_end); in essence merges flushes into a single interval - which inevitably might include areas of memory that were not flushed at all. For example these two flushes: [...] [...] Would be merged into: [..................] Btw., isnt the above max/min logic buggy, causing lost events? Shouldnt it be: events[n].hint_start = min(reg->start, reg->hint_start); events[n].hint_end = max(reg->end, reg->hint_end); ? Ingo ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify 2009-09-30 9:44 ` Ingo Molnar @ 2009-09-30 17:06 ` Roland Dreier -1 siblings, 0 replies; 82+ messages in thread From: Roland Dreier @ 2009-09-30 17:06 UTC (permalink / raw) To: Ingo Molnar Cc: Pavel Machek, Peter Zijlstra, linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras, Anton Blanchard, general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b > Performance events filtering is being worked on and now with the proper > non-DoS limit you've added you can lose events too, dont you? So it's > all a question of how much buffering to add - and with perf events too > you can buffer arbitrary large amount of events. No, the idea for non-DoS for ummunotify is that we would limit the number of regions the application can register; so an application might hit the limit up front but no runtime loss of events once a region was registered successfully. > I think this could be done in a simpler, less limited, more generic, > more useful form by using some variation of perf events. > > You should be able to get all that you want by adding two TRACE_EVENT() > tracepoints and using the existing perf event syscall to get the events > to user-space. Yes, I would like to use perf events too. Would it be plausible to create a way for userspace to create a "counter" for each address range being watched? Then events would not be lost, because those counters would become non-zero. > Meaning that this: > 9 files changed, 1060 insertions(+), 1 deletions(-) Note that lots/ of the files touched here are in Documentation or are one-line changes to Makefiles etc. - R. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify @ 2009-09-30 17:06 ` Roland Dreier 0 siblings, 0 replies; 82+ messages in thread From: Roland Dreier @ 2009-09-30 17:06 UTC (permalink / raw) To: Ingo Molnar Cc: Pavel Machek, Peter Zijlstra, linux-rdma, linux-kernel, Paul Mackerras, Anton Blanchard, general, akpm, torvalds > Performance events filtering is being worked on and now with the proper > non-DoS limit you've added you can lose events too, dont you? So it's > all a question of how much buffering to add - and with perf events too > you can buffer arbitrary large amount of events. No, the idea for non-DoS for ummunotify is that we would limit the number of regions the application can register; so an application might hit the limit up front but no runtime loss of events once a region was registered successfully. > I think this could be done in a simpler, less limited, more generic, > more useful form by using some variation of perf events. > > You should be able to get all that you want by adding two TRACE_EVENT() > tracepoints and using the existing perf event syscall to get the events > to user-space. Yes, I would like to use perf events too. Would it be plausible to create a way for userspace to create a "counter" for each address range being watched? Then events would not be lost, because those counters would become non-zero. > Meaning that this: > 9 files changed, 1060 insertions(+), 1 deletions(-) Note that lots/ of the files touched here are in Documentation or are one-line changes to Makefiles etc. - R. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify 2009-09-30 9:44 ` Ingo Molnar @ 2009-10-02 16:32 ` Roland Dreier -1 siblings, 0 replies; 82+ messages in thread From: Roland Dreier @ 2009-10-02 16:32 UTC (permalink / raw) To: Ingo Molnar Cc: Pavel Machek, Peter Zijlstra, linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras, Anton Blanchard, general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b > Per tracepoint filtering is possible via the perf event patches Li Zefan > has posted to lkml recently, under this subject: > > [PATCH 0/6] perf trace: Add filter support > > They are still being worked on but it's very clear that flexible > in-kernel filtering support will be a natural part of the perf event > design in the very near future, so if that alone is your reason not to > use it it would be better if you helped us complete/test the filter > support and use that, instead of a parallel framework. > > Or if that's not desirable or not possible, or if there's any other > technical roadblock, i'd like to know the particulars of that. So I looked a little deeper into this, and I don't think (even with the filtering extensions) that perf events are directly applicable to this problem. The first issue is that, assuming I'm understanding the comment in perf_event.c: /* * Raw tracepoint data is a severe data leak, only allow root to * have these. */ currently tracepoints can only be used by privileged processes. A key feature of ummunotify is that ordinary unprivileged processes can use it. So would it be acceptable to add something like PERF_TYPE_MMU_NOTIFIER as a way of letting unprivileged userspace get access to just MMU events for their own process? Clearly this touches core infrastructure and is not as simple as just adding two tracepoints. Then, assuming we have some way to create an "MMU notifier" perf event, we need a way for userspace to specify which address ranges it would like events for (I don't think the string filter expression used by existing trace filtering works, because if userspace is looking at a few hundred regions, then the size of the filtering expression explodes, and adding or removing a single range becomes a pain). So I guess a new ioctl() to add/remove ranges for MMU_NOTIFIER perf events? I think filtering is needed, because otherwise events for ranges that are not of interest are just a waste of resources to generate and process, and make losing good events because of overflow much more likely. We still have the problem of lost events if the mmap buffer overflows, but userspace should be able to size the buffer so that such events are rare I guess. In the end this seems to just take the ummunotify code I have, and make it be a new type of perf counter instead of a character special device. I'd actually be OK with that, since having an oddball new char dev interface is not particularly nice. But on the other hand just multiplexing a new type of thing under perf events is not all that much better. What do you think? Thanks, Roland -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify @ 2009-10-02 16:32 ` Roland Dreier 0 siblings, 0 replies; 82+ messages in thread From: Roland Dreier @ 2009-10-02 16:32 UTC (permalink / raw) To: Ingo Molnar Cc: Pavel Machek, Peter Zijlstra, linux-rdma, linux-kernel, Paul Mackerras, Anton Blanchard, general, akpm, torvalds > Per tracepoint filtering is possible via the perf event patches Li Zefan > has posted to lkml recently, under this subject: > > [PATCH 0/6] perf trace: Add filter support > > They are still being worked on but it's very clear that flexible > in-kernel filtering support will be a natural part of the perf event > design in the very near future, so if that alone is your reason not to > use it it would be better if you helped us complete/test the filter > support and use that, instead of a parallel framework. > > Or if that's not desirable or not possible, or if there's any other > technical roadblock, i'd like to know the particulars of that. So I looked a little deeper into this, and I don't think (even with the filtering extensions) that perf events are directly applicable to this problem. The first issue is that, assuming I'm understanding the comment in perf_event.c: /* * Raw tracepoint data is a severe data leak, only allow root to * have these. */ currently tracepoints can only be used by privileged processes. A key feature of ummunotify is that ordinary unprivileged processes can use it. So would it be acceptable to add something like PERF_TYPE_MMU_NOTIFIER as a way of letting unprivileged userspace get access to just MMU events for their own process? Clearly this touches core infrastructure and is not as simple as just adding two tracepoints. Then, assuming we have some way to create an "MMU notifier" perf event, we need a way for userspace to specify which address ranges it would like events for (I don't think the string filter expression used by existing trace filtering works, because if userspace is looking at a few hundred regions, then the size of the filtering expression explodes, and adding or removing a single range becomes a pain). So I guess a new ioctl() to add/remove ranges for MMU_NOTIFIER perf events? I think filtering is needed, because otherwise events for ranges that are not of interest are just a waste of resources to generate and process, and make losing good events because of overflow much more likely. We still have the problem of lost events if the mmap buffer overflows, but userspace should be able to size the buffer so that such events are rare I guess. In the end this seems to just take the ummunotify code I have, and make it be a new type of perf counter instead of a character special device. I'd actually be OK with that, since having an oddball new char dev interface is not particularly nice. But on the other hand just multiplexing a new type of thing under perf events is not all that much better. What do you think? Thanks, Roland ^ permalink raw reply [flat|nested] 82+ messages in thread
[parent not found: <ada3a61rc3j.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>]
* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify 2009-10-02 16:32 ` Roland Dreier @ 2009-10-02 20:45 ` Pavel Machek -1 siblings, 0 replies; 82+ messages in thread From: Pavel Machek @ 2009-10-02 20:45 UTC (permalink / raw) To: Roland Dreier Cc: Ingo Molnar, Peter Zijlstra, linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras, Anton Blanchard, general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b Hi! > In the end this seems to just take the ummunotify code I have, and make > it be a new type of perf counter instead of a character special device. > I'd actually be OK with that, since having an oddball new char dev > interface is not particularly nice. But on the other hand just > multiplexing a new type of thing under perf events is not all that much > better. What do you think? I really hate the strange character device. So if you can hide it in tracing infrastructure, I'd certainly like that. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify @ 2009-10-02 20:45 ` Pavel Machek 0 siblings, 0 replies; 82+ messages in thread From: Pavel Machek @ 2009-10-02 20:45 UTC (permalink / raw) To: Roland Dreier Cc: Ingo Molnar, Peter Zijlstra, linux-rdma, linux-kernel, Paul Mackerras, Anton Blanchard, general, akpm, torvalds Hi! > In the end this seems to just take the ummunotify code I have, and make > it be a new type of perf counter instead of a character special device. > I'd actually be OK with that, since having an oddball new char dev > interface is not particularly nice. But on the other hand just > multiplexing a new type of thing under perf events is not all that much > better. What do you think? I really hate the strange character device. So if you can hide it in tracing infrastructure, I'd certainly like that. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify 2009-10-02 16:32 ` Roland Dreier @ 2009-10-07 22:34 ` Roland Dreier -1 siblings, 0 replies; 82+ messages in thread From: Roland Dreier @ 2009-10-07 22:34 UTC (permalink / raw) To: Ingo Molnar Cc: Pavel Machek, Peter Zijlstra, linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras, Anton Blanchard, general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b > So I looked a little deeper into this, and I don't think (even with the > filtering extensions) that perf events are directly applicable to this > problem. The first issue is that, assuming I'm understanding the > comment in perf_event.c: > > /* > * Raw tracepoint data is a severe data leak, only allow root to > * have these. > */ > > currently tracepoints can only be used by privileged processes. A key > feature of ummunotify is that ordinary unprivileged processes can use it. > > So would it be acceptable to add something like PERF_TYPE_MMU_NOTIFIER > as a way of letting unprivileged userspace get access to just MMU events > for their own process? Clearly this touches core infrastructure and is > not as simple as just adding two tracepoints. > > Then, assuming we have some way to create an "MMU notifier" perf event, > we need a way for userspace to specify which address ranges it would > like events for (I don't think the string filter expression used by > existing trace filtering works, because if userspace is looking at a few > hundred regions, then the size of the filtering expression explodes, and > adding or removing a single range becomes a pain). So I guess a new > ioctl() to add/remove ranges for MMU_NOTIFIER perf events? > > I think filtering is needed, because otherwise events for ranges that > are not of interest are just a waste of resources to generate and > process, and make losing good events because of overflow much more > likely. > > We still have the problem of lost events if the mmap buffer overflows, > but userspace should be able to size the buffer so that such events are > rare I guess. > > In the end this seems to just take the ummunotify code I have, and make > it be a new type of perf counter instead of a character special device. > I'd actually be OK with that, since having an oddball new char dev > interface is not particularly nice. But on the other hand just > multiplexing a new type of thing under perf events is not all that much > better. What do you think? Ingo/Peter/<anyone suggesting perf events> -- can you comment on this plan of creating PERF_TYPE_MMU_NOTIFIER for perf events to implement ummunotify? To me it looks like a wash -- the main difference is how userspace gets the magic ummunotify file descriptor, either by open("/dev/ummunotify") or by perf_event_open(...PERF_TYPE_MMU_NOTIFIER...), but pretty much everything else stays pretty much the same in terms of how much kernel code is involved. We do reuse the perf events mmap buffer code but I think that ends up being more complicated than returning events via read(). Anyway, before I spend the time converting over to the new infrastructure and causing the MPI guys to churn their code, I'd like to make sure that this is what you guys have in mind. (By the way, after thinking about this more, I really do think that filtering events by address range is a must-have -- with filtering, userspace can map sufficient buffer space to avoid losing events for a given number of regions; without filtering, events might get lost just because of invalidate events for ranges userspace didn't even care about) Thanks, Roland -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify @ 2009-10-07 22:34 ` Roland Dreier 0 siblings, 0 replies; 82+ messages in thread From: Roland Dreier @ 2009-10-07 22:34 UTC (permalink / raw) To: Ingo Molnar Cc: Pavel Machek, Peter Zijlstra, linux-rdma, linux-kernel, Paul Mackerras, Anton Blanchard, general, akpm, torvalds > So I looked a little deeper into this, and I don't think (even with the > filtering extensions) that perf events are directly applicable to this > problem. The first issue is that, assuming I'm understanding the > comment in perf_event.c: > > /* > * Raw tracepoint data is a severe data leak, only allow root to > * have these. > */ > > currently tracepoints can only be used by privileged processes. A key > feature of ummunotify is that ordinary unprivileged processes can use it. > > So would it be acceptable to add something like PERF_TYPE_MMU_NOTIFIER > as a way of letting unprivileged userspace get access to just MMU events > for their own process? Clearly this touches core infrastructure and is > not as simple as just adding two tracepoints. > > Then, assuming we have some way to create an "MMU notifier" perf event, > we need a way for userspace to specify which address ranges it would > like events for (I don't think the string filter expression used by > existing trace filtering works, because if userspace is looking at a few > hundred regions, then the size of the filtering expression explodes, and > adding or removing a single range becomes a pain). So I guess a new > ioctl() to add/remove ranges for MMU_NOTIFIER perf events? > > I think filtering is needed, because otherwise events for ranges that > are not of interest are just a waste of resources to generate and > process, and make losing good events because of overflow much more > likely. > > We still have the problem of lost events if the mmap buffer overflows, > but userspace should be able to size the buffer so that such events are > rare I guess. > > In the end this seems to just take the ummunotify code I have, and make > it be a new type of perf counter instead of a character special device. > I'd actually be OK with that, since having an oddball new char dev > interface is not particularly nice. But on the other hand just > multiplexing a new type of thing under perf events is not all that much > better. What do you think? Ingo/Peter/<anyone suggesting perf events> -- can you comment on this plan of creating PERF_TYPE_MMU_NOTIFIER for perf events to implement ummunotify? To me it looks like a wash -- the main difference is how userspace gets the magic ummunotify file descriptor, either by open("/dev/ummunotify") or by perf_event_open(...PERF_TYPE_MMU_NOTIFIER...), but pretty much everything else stays pretty much the same in terms of how much kernel code is involved. We do reuse the perf events mmap buffer code but I think that ends up being more complicated than returning events via read(). Anyway, before I spend the time converting over to the new infrastructure and causing the MPI guys to churn their code, I'd like to make sure that this is what you guys have in mind. (By the way, after thinking about this more, I really do think that filtering events by address range is a must-have -- with filtering, userspace can map sufficient buffer space to avoid losing events for a given number of regions; without filtering, events might get lost just because of invalidate events for ranges userspace didn't even care about) Thanks, Roland ^ permalink raw reply [flat|nested] 82+ messages in thread
[parent not found: <ada3a5uq1dk.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>]
* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify 2009-10-07 22:34 ` Roland Dreier @ 2009-10-12 17:33 ` Peter Zijlstra -1 siblings, 0 replies; 82+ messages in thread From: Peter Zijlstra @ 2009-10-12 17:33 UTC (permalink / raw) To: Roland Dreier Cc: Ingo Molnar, Pavel Machek, linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras, Anton Blanchard, general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Wed, 2009-10-07 at 15:34 -0700, Roland Dreier wrote: > > So I looked a little deeper into this, and I don't think (even with the > > filtering extensions) that perf events are directly applicable to this > > problem. The first issue is that, assuming I'm understanding the > > comment in perf_event.c: > > > > /* > > * Raw tracepoint data is a severe data leak, only allow root to > > * have these. > > */ > > > > currently tracepoints can only be used by privileged processes. A key > > feature of ummunotify is that ordinary unprivileged processes can use it. > > > > So would it be acceptable to add something like PERF_TYPE_MMU_NOTIFIER > > as a way of letting unprivileged userspace get access to just MMU events > > for their own process? Clearly this touches core infrastructure and is > > not as simple as just adding two tracepoints. > > > > Then, assuming we have some way to create an "MMU notifier" perf event, > > we need a way for userspace to specify which address ranges it would > > like events for (I don't think the string filter expression used by > > existing trace filtering works, because if userspace is looking at a few > > hundred regions, then the size of the filtering expression explodes, and > > adding or removing a single range becomes a pain). So I guess a new > > ioctl() to add/remove ranges for MMU_NOTIFIER perf events? > > > > I think filtering is needed, because otherwise events for ranges that > > are not of interest are just a waste of resources to generate and > > process, and make losing good events because of overflow much more > > likely. > > > > We still have the problem of lost events if the mmap buffer overflows, > > but userspace should be able to size the buffer so that such events are > > rare I guess. > > > > In the end this seems to just take the ummunotify code I have, and make > > it be a new type of perf counter instead of a character special device. > > I'd actually be OK with that, since having an oddball new char dev > > interface is not particularly nice. But on the other hand just > > multiplexing a new type of thing under perf events is not all that much > > better. What do you think? > > Ingo/Peter/<anyone suggesting perf events> -- can you comment on this > plan of creating PERF_TYPE_MMU_NOTIFIER for perf events to implement > ummunotify? To me it looks like a wash -- the main difference is how > userspace gets the magic ummunotify file descriptor, either by > open("/dev/ummunotify") or by perf_event_open(...PERF_TYPE_MMU_NOTIFIER...), > but pretty much everything else stays pretty much the same in terms of > how much kernel code is involved. We do reuse the perf events mmap > buffer code but I think that ends up being more complicated than > returning events via read(). > > Anyway, before I spend the time converting over to the new > infrastructure and causing the MPI guys to churn their code, I'd like to > make sure that this is what you guys have in mind. > > (By the way, after thinking about this more, I really do think that > filtering events by address range is a must-have -- with filtering, > userspace can map sufficient buffer space to avoid losing events for a > given number of regions; without filtering, events might get lost just > because of invalidate events for ranges userspace didn't even care about) I think something like PERF_TYPE_SOFTWARE, PERF_COUNT_SW_MUNMAP + $filter or PERF_TYPE_TRACEPOINT, //events/vm/munmap/id + $filter As for the read/poll issue, I think we can do something like PERF_FORMAT_BLOCK which would make read() block when ->count hasn't changed, and make poll() work without requiring a mmap(). As to filter, we can do two things, add a simple single range filter to perf_event_attr, which is something ia64 has hardware support for IIRC, or we can possibly use this trace filter muck. Would something like that be sufficient? With such events only generating a wakeup (poll) when the unmap actually happens, you'd not even need an mmap() buffer to keep up with that. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify @ 2009-10-12 17:33 ` Peter Zijlstra 0 siblings, 0 replies; 82+ messages in thread From: Peter Zijlstra @ 2009-10-12 17:33 UTC (permalink / raw) To: Roland Dreier Cc: Ingo Molnar, Pavel Machek, linux-rdma, linux-kernel, Paul Mackerras, Anton Blanchard, general, akpm, torvalds On Wed, 2009-10-07 at 15:34 -0700, Roland Dreier wrote: > > So I looked a little deeper into this, and I don't think (even with the > > filtering extensions) that perf events are directly applicable to this > > problem. The first issue is that, assuming I'm understanding the > > comment in perf_event.c: > > > > /* > > * Raw tracepoint data is a severe data leak, only allow root to > > * have these. > > */ > > > > currently tracepoints can only be used by privileged processes. A key > > feature of ummunotify is that ordinary unprivileged processes can use it. > > > > So would it be acceptable to add something like PERF_TYPE_MMU_NOTIFIER > > as a way of letting unprivileged userspace get access to just MMU events > > for their own process? Clearly this touches core infrastructure and is > > not as simple as just adding two tracepoints. > > > > Then, assuming we have some way to create an "MMU notifier" perf event, > > we need a way for userspace to specify which address ranges it would > > like events for (I don't think the string filter expression used by > > existing trace filtering works, because if userspace is looking at a few > > hundred regions, then the size of the filtering expression explodes, and > > adding or removing a single range becomes a pain). So I guess a new > > ioctl() to add/remove ranges for MMU_NOTIFIER perf events? > > > > I think filtering is needed, because otherwise events for ranges that > > are not of interest are just a waste of resources to generate and > > process, and make losing good events because of overflow much more > > likely. > > > > We still have the problem of lost events if the mmap buffer overflows, > > but userspace should be able to size the buffer so that such events are > > rare I guess. > > > > In the end this seems to just take the ummunotify code I have, and make > > it be a new type of perf counter instead of a character special device. > > I'd actually be OK with that, since having an oddball new char dev > > interface is not particularly nice. But on the other hand just > > multiplexing a new type of thing under perf events is not all that much > > better. What do you think? > > Ingo/Peter/<anyone suggesting perf events> -- can you comment on this > plan of creating PERF_TYPE_MMU_NOTIFIER for perf events to implement > ummunotify? To me it looks like a wash -- the main difference is how > userspace gets the magic ummunotify file descriptor, either by > open("/dev/ummunotify") or by perf_event_open(...PERF_TYPE_MMU_NOTIFIER...), > but pretty much everything else stays pretty much the same in terms of > how much kernel code is involved. We do reuse the perf events mmap > buffer code but I think that ends up being more complicated than > returning events via read(). > > Anyway, before I spend the time converting over to the new > infrastructure and causing the MPI guys to churn their code, I'd like to > make sure that this is what you guys have in mind. > > (By the way, after thinking about this more, I really do think that > filtering events by address range is a must-have -- with filtering, > userspace can map sufficient buffer space to avoid losing events for a > given number of regions; without filtering, events might get lost just > because of invalidate events for ranges userspace didn't even care about) I think something like PERF_TYPE_SOFTWARE, PERF_COUNT_SW_MUNMAP + $filter or PERF_TYPE_TRACEPOINT, //events/vm/munmap/id + $filter As for the read/poll issue, I think we can do something like PERF_FORMAT_BLOCK which would make read() block when ->count hasn't changed, and make poll() work without requiring a mmap(). As to filter, we can do two things, add a simple single range filter to perf_event_attr, which is something ia64 has hardware support for IIRC, or we can possibly use this trace filter muck. Would something like that be sufficient? With such events only generating a wakeup (poll) when the unmap actually happens, you'd not even need an mmap() buffer to keep up with that. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify 2009-09-17 14:24 ` [ofa-general] " Roland Dreier @ 2009-09-17 14:43 ` Peter Zijlstra 0 siblings, 0 replies; 82+ messages in thread From: Peter Zijlstra @ 2009-09-17 14:43 UTC (permalink / raw) To: Roland Dreier Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras, Anton Blanchard, general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Ingo Molnar On Thu, 2009-09-17 at 07:24 -0700, Roland Dreier wrote: > > Anton Blanchard suggested a while back that this might be integrated > > with perf-counters, since perf-counters already does mmap() tracking and > > also provides events through an mmap()'ed buffer. > > > > Has anybody looked into this? > > I didn't see the original suggestion. Certainly hooking in to existing > infrastructure for user/kernel communication would be good. > > The fit doesn't seem great to me, although I am rather naive about perf > counters. The problem that ummunotify is trying to solve is to let an > app say 'for these 1000 address ranges (that possibly only cover a small > part of my total address space), please let me know when the mappings > are invalidated for any reason'. > > So getting those events in the kernel is no problem -- we have the MMU > notifier hooks that tell us exactly what we need to know. The issue is > purely the way userspace registers interest in address ranges, and how > to kernel returns the events. > > For perf counters it seems that one would have to create a new counter > for each address range... is that correct? And also I don't know if > perf counter has an analog for the fast path optimization that > ummunotify provides via a mmap'ed generation counter (a quick way for > userspace to see 'nothing happened since last time you checked'). You're right in that perf-counter currently doesn't provide a way to specify these ranges, we simply track all mmap() traffic. The thing is that mmap() data is basically a side channel. For your usage you'd basically have to open a NOP counter and only observe the mmap data. We could look at ways of adding ranges.. We do have a means of detecting if new data is available, we keep a data head index. If that moves, you've got new stuff. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify @ 2009-09-17 14:43 ` Peter Zijlstra 0 siblings, 0 replies; 82+ messages in thread From: Peter Zijlstra @ 2009-09-17 14:43 UTC (permalink / raw) To: Roland Dreier Cc: linux-rdma, linux-kernel, Paul Mackerras, Anton Blanchard, general, akpm, torvalds, Ingo Molnar On Thu, 2009-09-17 at 07:24 -0700, Roland Dreier wrote: > > Anton Blanchard suggested a while back that this might be integrated > > with perf-counters, since perf-counters already does mmap() tracking and > > also provides events through an mmap()'ed buffer. > > > > Has anybody looked into this? > > I didn't see the original suggestion. Certainly hooking in to existing > infrastructure for user/kernel communication would be good. > > The fit doesn't seem great to me, although I am rather naive about perf > counters. The problem that ummunotify is trying to solve is to let an > app say 'for these 1000 address ranges (that possibly only cover a small > part of my total address space), please let me know when the mappings > are invalidated for any reason'. > > So getting those events in the kernel is no problem -- we have the MMU > notifier hooks that tell us exactly what we need to know. The issue is > purely the way userspace registers interest in address ranges, and how > to kernel returns the events. > > For perf counters it seems that one would have to create a new counter > for each address range... is that correct? And also I don't know if > perf counter has an analog for the fast path optimization that > ummunotify provides via a mmap'ed generation counter (a quick way for > userspace to see 'nothing happened since last time you checked'). You're right in that perf-counter currently doesn't provide a way to specify these ranges, we simply track all mmap() traffic. The thing is that mmap() data is basically a side channel. For your usage you'd basically have to open a NOP counter and only observe the mmap data. We could look at ways of adding ranges.. We do have a means of detecting if new data is available, we keep a data head index. If that moves, you've got new stuff. ^ permalink raw reply [flat|nested] 82+ messages in thread
end of thread, other threads:[~2009-10-13 16:28 UTC | newest] Thread overview: 82+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2009-09-11 4:38 [GIT PULL] please pull ummunotify Roland Dreier 2009-09-11 4:38 ` Roland Dreier 2009-09-15 11:34 ` Pavel Machek [not found] ` <20090915113434.GF1328-+ZI9xUNit7I@public.gmane.org> 2009-09-15 14:57 ` [ofa-general] " Roland Dreier 2009-09-15 14:57 ` Roland Dreier [not found] ` <ada7hw0gsqz.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> 2009-09-28 20:49 ` Pavel Machek 2009-09-28 20:49 ` Pavel Machek [not found] ` <20090928204923.GA1960-I/5MKhXcvmPrBKCeMvbIDA@public.gmane.org> 2009-09-28 21:40 ` Jason Gunthorpe 2009-09-28 21:40 ` Jason Gunthorpe [not found] ` <aday6omhz9d.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> 2009-09-11 5:56 ` KOSAKI Motohiro 2009-09-11 5:56 ` KOSAKI Motohiro 2009-09-11 6:03 ` [ofa-general] " Roland Dreier 2009-09-11 6:03 ` Roland Dreier [not found] ` <adatyzahvbm.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> 2009-09-11 6:11 ` KOSAKI Motohiro 2009-09-11 6:11 ` KOSAKI Motohiro [not found] ` <20090911150552.DB68.A69D9226-+CUm20s59erQFUHtdCDX3A@public.gmane.org> 2009-09-11 16:42 ` Gleb Natapov 2009-09-11 16:42 ` Gleb Natapov 2009-09-11 6:15 ` Brice Goglin [not found] ` <4AA9EAF7.5010401-MZpvjPyXg2s@public.gmane.org> 2009-09-11 6:21 ` KOSAKI Motohiro 2009-09-11 6:21 ` KOSAKI Motohiro 2009-09-11 6:22 ` Roland Dreier 2009-09-11 6:22 ` Roland Dreier 2009-09-11 6:40 ` [ofa-general] " Jason Gunthorpe 2009-09-11 6:40 ` Jason Gunthorpe [not found] ` <20090911064019.GZ4973-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> 2009-09-11 16:58 ` Roland Dreier 2009-09-11 16:58 ` Roland Dreier [not found] ` <adaljklifkt.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> 2009-09-15 7:03 ` KOSAKI Motohiro 2009-09-15 7:03 ` KOSAKI Motohiro 2009-09-15 8:27 ` Roland Dreier 2009-09-15 8:27 ` Roland Dreier 2009-09-15 12:38 ` Jeff Squyres 2009-09-15 12:38 ` Jeff Squyres 2009-09-16 16:30 ` Roland Dreier 2009-09-16 16:30 ` Roland Dreier 2009-09-16 16:40 ` [ofa-general] " Linus Torvalds 2009-09-16 16:40 ` Linus Torvalds 2009-09-17 11:30 ` Peter Zijlstra 2009-09-17 11:30 ` Peter Zijlstra 2009-09-17 14:24 ` [ofa-general] " Roland Dreier [not found] ` <adafxalejiq.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> 2009-09-17 14:32 ` Roland Dreier 2009-09-17 14:32 ` Roland Dreier [not found] ` <adaab0tej5c.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> 2009-09-17 14:49 ` Peter Zijlstra 2009-09-17 14:49 ` Peter Zijlstra 2009-09-17 15:03 ` Roland Dreier 2009-09-17 15:03 ` Roland Dreier [not found] ` <adazl8td35u.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> 2009-09-17 15:22 ` Peter Zijlstra 2009-09-17 15:22 ` Peter Zijlstra 2009-09-17 15:45 ` Roland Dreier 2009-09-17 15:45 ` Roland Dreier [not found] ` <adatyz1d17q.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> 2009-09-18 11:50 ` Ingo Molnar 2009-09-18 11:50 ` Ingo Molnar 2009-09-29 17:13 ` Pavel Machek 2009-09-29 17:13 ` Pavel Machek [not found] ` <20090929171332.GD14405-I/5MKhXcvmPrBKCeMvbIDA@public.gmane.org> 2009-09-30 9:44 ` Ingo Molnar 2009-09-30 9:44 ` Ingo Molnar [not found] ` <20090930094456.GD24621-X9Un+BFzKDI@public.gmane.org> 2009-09-30 16:02 ` Jason Gunthorpe 2009-09-30 16:02 ` Jason Gunthorpe [not found] ` <20090930160232.GZ22310-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> 2009-10-12 18:19 ` Ingo Molnar 2009-10-12 18:19 ` Ingo Molnar [not found] ` <20091012181944.GF17138-X9Un+BFzKDI@public.gmane.org> 2009-10-12 19:30 ` Jason Gunthorpe 2009-10-12 19:30 ` Jason Gunthorpe 2009-10-12 20:20 ` Ingo Molnar [not found] ` <20091012202046.GA7648-X9Un+BFzKDI@public.gmane.org> 2009-10-13 4:05 ` Jason Gunthorpe 2009-10-13 4:05 ` Jason Gunthorpe 2009-10-13 6:40 ` Ingo Molnar [not found] ` <20091013064006.GC9470-X9Un+BFzKDI@public.gmane.org> 2009-10-13 16:27 ` Jason Gunthorpe 2009-10-13 16:27 ` Jason Gunthorpe 2009-10-13 5:43 ` Brice Goglin [not found] ` <4AD41373.8010108-MZpvjPyXg2s@public.gmane.org> 2009-10-13 6:38 ` Ingo Molnar 2009-10-13 6:38 ` Ingo Molnar 2009-09-30 17:06 ` Roland Dreier 2009-09-30 17:06 ` Roland Dreier 2009-10-02 16:32 ` Roland Dreier 2009-10-02 16:32 ` Roland Dreier [not found] ` <ada3a61rc3j.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> 2009-10-02 20:45 ` Pavel Machek 2009-10-02 20:45 ` Pavel Machek 2009-10-07 22:34 ` Roland Dreier 2009-10-07 22:34 ` Roland Dreier [not found] ` <ada3a5uq1dk.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> 2009-10-12 17:33 ` Peter Zijlstra 2009-10-12 17:33 ` Peter Zijlstra 2009-09-17 14:43 ` Peter Zijlstra 2009-09-17 14:43 ` Peter Zijlstra
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.