All of lore.kernel.org
 help / color / mirror / Atom feed
* [GIT PULL] please pull ummunotify
@ 2009-09-11  4:38 ` Roland Dreier
  0 siblings, 0 replies; 82+ messages in thread
From: Roland Dreier @ 2009-09-11  4:38 UTC (permalink / raw)
  To: torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	jsquyres-FYB4Gu1CFyUAvxtiuMwx3w
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Linus, please consider pulling from

    master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git ummunotify

This tree is also available from kernel.org mirrors at:

    git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git ummunotify

This will get "ummunotify," a new character device that allows a
userspace library to register for MMU notifications; this is
particularly useful for MPI implementions (message passing libraries
used in HPC) to be able to keep track of what wacky things consumers
do to their memory mappings.  My colleague Jeff Squyres from the Open
MPI project posted a blog entry about why MPI wants this:

http://blogs.cisco.com/ciscotalk/performance/comments/better_linux_memory_tracking/

His summary of ummunotify:

  "It’s elegant, doesn’t require strange linker tricks, and seems to
   work in all cases.  Yay!"

This code went through several review iterations on lkml and was in
-mm and -next for quite a few weeks.  Andrew is OK with merging it (I
think -- Andrew please correct me if I misunderstood you).

Roland Dreier (1):
      ummunotify: Userspace support for MMU notifications

 Documentation/Makefile                  |    3 +-
 Documentation/ummunotify/Makefile       |    7 +
 Documentation/ummunotify/ummunotify.txt |  150 ++++++++
 Documentation/ummunotify/umn-test.c     |  200 +++++++++++
 drivers/char/Kconfig                    |   12 +
 drivers/char/Makefile                   |    1 +
 drivers/char/ummunotify.c               |  566 +++++++++++++++++++++++++++++++
 include/linux/Kbuild                    |    1 +
 include/linux/ummunotify.h              |  121 +++++++
 9 files changed, 1060 insertions(+), 1 deletions(-)
 create mode 100644 Documentation/ummunotify/Makefile
 create mode 100644 Documentation/ummunotify/ummunotify.txt
 create mode 100644 Documentation/ummunotify/umn-test.c
 create mode 100644 drivers/char/ummunotify.c
 create mode 100644 include/linux/ummunotify.h
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [GIT PULL] please pull ummunotify
@ 2009-09-11  4:38 ` Roland Dreier
  0 siblings, 0 replies; 82+ messages in thread
From: Roland Dreier @ 2009-09-11  4:38 UTC (permalink / raw)
  To: torvalds, akpm, jsquyres; +Cc: linux-rdma, general, linux-kernel

Linus, please consider pulling from

    master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git ummunotify

This tree is also available from kernel.org mirrors at:

    git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git ummunotify

This will get "ummunotify," a new character device that allows a
userspace library to register for MMU notifications; this is
particularly useful for MPI implementions (message passing libraries
used in HPC) to be able to keep track of what wacky things consumers
do to their memory mappings.  My colleague Jeff Squyres from the Open
MPI project posted a blog entry about why MPI wants this:

http://blogs.cisco.com/ciscotalk/performance/comments/better_linux_memory_tracking/

His summary of ummunotify:

  "It’s elegant, doesn’t require strange linker tricks, and seems to
   work in all cases.  Yay!"

This code went through several review iterations on lkml and was in
-mm and -next for quite a few weeks.  Andrew is OK with merging it (I
think -- Andrew please correct me if I misunderstood you).

Roland Dreier (1):
      ummunotify: Userspace support for MMU notifications

 Documentation/Makefile                  |    3 +-
 Documentation/ummunotify/Makefile       |    7 +
 Documentation/ummunotify/ummunotify.txt |  150 ++++++++
 Documentation/ummunotify/umn-test.c     |  200 +++++++++++
 drivers/char/Kconfig                    |   12 +
 drivers/char/Makefile                   |    1 +
 drivers/char/ummunotify.c               |  566 +++++++++++++++++++++++++++++++
 include/linux/Kbuild                    |    1 +
 include/linux/ummunotify.h              |  121 +++++++
 9 files changed, 1060 insertions(+), 1 deletions(-)
 create mode 100644 Documentation/ummunotify/Makefile
 create mode 100644 Documentation/ummunotify/ummunotify.txt
 create mode 100644 Documentation/ummunotify/umn-test.c
 create mode 100644 drivers/char/ummunotify.c
 create mode 100644 include/linux/ummunotify.h

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [GIT PULL] please pull ummunotify
  2009-09-11  4:38 ` Roland Dreier
@ 2009-09-11  5:56     ` KOSAKI Motohiro
  -1 siblings, 0 replies; 82+ messages in thread
From: KOSAKI Motohiro @ 2009-09-11  5:56 UTC (permalink / raw)
  To: Roland Dreier
  Cc: kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	jsquyres-FYB4Gu1CFyUAvxtiuMwx3w,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Hi Roland,

> Linus, please consider pulling from
> 
>     master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git ummunotify
> 
> This tree is also available from kernel.org mirrors at:
> 
>     git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git ummunotify
> 
> This will get "ummunotify," a new character device that allows a
> userspace library to register for MMU notifications; this is
> particularly useful for MPI implementions (message passing libraries
> used in HPC) to be able to keep track of what wacky things consumers
> do to their memory mappings.  My colleague Jeff Squyres from the Open
> MPI project posted a blog entry about why MPI wants this:
> 
> http://blogs.cisco.com/ciscotalk/performance/comments/better_linux_memory_tracking/
> 
> His summary of ummunotify:
> 
>   "It’s elegant, doesn’t require strange linker tricks, and seems to
>    work in all cases.  Yay!"
> 
> This code went through several review iterations on lkml and was in
> -mm and -next for quite a few weeks.  Andrew is OK with merging it (I
> think -- Andrew please correct me if I misunderstood you).

I'm sorry. I haven't review this code and I didn't track this discussion
carefully. but I have one stupid question. May I ask? 
Can I this version already solved fork() + COW issue? if so, could you
please explain what happen at fork. Obviously RDMA point to either parent
or child page, not both. but Corrent COW rule is, first touch process
get copyed page and other process still own original page. I think it's 
unpecected behavior form RDMA.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [GIT PULL] please pull ummunotify
@ 2009-09-11  5:56     ` KOSAKI Motohiro
  0 siblings, 0 replies; 82+ messages in thread
From: KOSAKI Motohiro @ 2009-09-11  5:56 UTC (permalink / raw)
  To: Roland Dreier
  Cc: kosaki.motohiro, torvalds, akpm, jsquyres, linux-rdma, general,
	linux-kernel

Hi Roland,

> Linus, please consider pulling from
> 
>     master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git ummunotify
> 
> This tree is also available from kernel.org mirrors at:
> 
>     git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git ummunotify
> 
> This will get "ummunotify," a new character device that allows a
> userspace library to register for MMU notifications; this is
> particularly useful for MPI implementions (message passing libraries
> used in HPC) to be able to keep track of what wacky things consumers
> do to their memory mappings.  My colleague Jeff Squyres from the Open
> MPI project posted a blog entry about why MPI wants this:
> 
> http://blogs.cisco.com/ciscotalk/performance/comments/better_linux_memory_tracking/
> 
> His summary of ummunotify:
> 
>   "It’s elegant, doesn’t require strange linker tricks, and seems to
>    work in all cases.  Yay!"
> 
> This code went through several review iterations on lkml and was in
> -mm and -next for quite a few weeks.  Andrew is OK with merging it (I
> think -- Andrew please correct me if I misunderstood you).

I'm sorry. I haven't review this code and I didn't track this discussion
carefully. but I have one stupid question. May I ask? 
Can I this version already solved fork() + COW issue? if so, could you
please explain what happen at fork. Obviously RDMA point to either parent
or child page, not both. but Corrent COW rule is, first touch process
get copyed page and other process still own original page. I think it's 
unpecected behavior form RDMA.



^ permalink raw reply	[flat|nested] 82+ messages in thread

* [ofa-general] Re: [GIT PULL] please pull ummunotify
  2009-09-11  5:56     ` KOSAKI Motohiro
@ 2009-09-11  6:03       ` Roland Dreier
  -1 siblings, 0 replies; 82+ messages in thread
From: Roland Dreier @ 2009-09-11  6:03 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: linux-rdma, linux-kernel, general, akpm, torvalds


 > Can I this version already solved fork() + COW issue? if so, could you
 > please explain what happen at fork. Obviously RDMA point to either parent
 > or child page, not both. but Corrent COW rule is, first touch process
 > get copyed page and other process still own original page. I think it's 
 > unpecected behavior form RDMA.

No, ummunotify doesn't really help that much with fork() + COW.  If a
parent forks and then touches pages that are actively in use for RDMA,
then of course they get COWed and RDMA goes to the wrong memory (from
the point of view of the parent).

ummunotify does deal with the case where a process forks and touches
memory that was used for RDMA but no longer is -- in that case, the MPI
library has a chance to flush its registration cache because it will get
a ummunotify event invalidating the old mapping.

The real purpose of ummunotify is to allow MPI implementations to cache
registrations, even when the MPI library is used with an application
that does funny things for allocation (mmap()/munmap() or brk(), etc).

 - Roland

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [GIT PULL] please pull ummunotify
@ 2009-09-11  6:03       ` Roland Dreier
  0 siblings, 0 replies; 82+ messages in thread
From: Roland Dreier @ 2009-09-11  6:03 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: torvalds, akpm, jsquyres, linux-rdma, general, linux-kernel


 > Can I this version already solved fork() + COW issue? if so, could you
 > please explain what happen at fork. Obviously RDMA point to either parent
 > or child page, not both. but Corrent COW rule is, first touch process
 > get copyed page and other process still own original page. I think it's 
 > unpecected behavior form RDMA.

No, ummunotify doesn't really help that much with fork() + COW.  If a
parent forks and then touches pages that are actively in use for RDMA,
then of course they get COWed and RDMA goes to the wrong memory (from
the point of view of the parent).

ummunotify does deal with the case where a process forks and touches
memory that was used for RDMA but no longer is -- in that case, the MPI
library has a chance to flush its registration cache because it will get
a ummunotify event invalidating the old mapping.

The real purpose of ummunotify is to allow MPI implementations to cache
registrations, even when the MPI library is used with an application
that does funny things for allocation (mmap()/munmap() or brk(), etc).

 - Roland

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [GIT PULL] please pull ummunotify
  2009-09-11  6:03       ` Roland Dreier
@ 2009-09-11  6:11           ` KOSAKI Motohiro
  -1 siblings, 0 replies; 82+ messages in thread
From: KOSAKI Motohiro @ 2009-09-11  6:11 UTC (permalink / raw)
  To: Roland Dreier
  Cc: kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	jsquyres-FYB4Gu1CFyUAvxtiuMwx3w,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Hi

Thank you explanation.

> 
>  > Can I this version already solved fork() + COW issue? if so, could you
>  > please explain what happen at fork. Obviously RDMA point to either parent
>  > or child page, not both. but Corrent COW rule is, first touch process
>  > get copyed page and other process still own original page. I think it's 
>  > unpecected behavior form RDMA.
> 
> No, ummunotify doesn't really help that much with fork() + COW.  If a
> parent forks and then touches pages that are actively in use for RDMA,
> then of course they get COWed and RDMA goes to the wrong memory (from
> the point of view of the parent).

So, Can we assume OpenMPI user process doesn't such thing?

Parhaps, madvise(DONTFORK) or vfork() avoid this issue. but I'm not
sure all program in the world do that.

> ummunotify does deal with the case where a process forks and touches
> memory that was used for RDMA but no longer is -- in that case, the MPI
> library has a chance to flush its registration cache because it will get
> a ummunotify event invalidating the old mapping.
> 
> The real purpose of ummunotify is to allow MPI implementations to cache
> registrations, even when the MPI library is used with an application
> that does funny things for allocation (mmap()/munmap() or brk(), etc).

Yup, that's very worth.



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [GIT PULL] please pull ummunotify
@ 2009-09-11  6:11           ` KOSAKI Motohiro
  0 siblings, 0 replies; 82+ messages in thread
From: KOSAKI Motohiro @ 2009-09-11  6:11 UTC (permalink / raw)
  To: Roland Dreier
  Cc: kosaki.motohiro, torvalds, akpm, jsquyres, linux-rdma, general,
	linux-kernel

Hi

Thank you explanation.

> 
>  > Can I this version already solved fork() + COW issue? if so, could you
>  > please explain what happen at fork. Obviously RDMA point to either parent
>  > or child page, not both. but Corrent COW rule is, first touch process
>  > get copyed page and other process still own original page. I think it's 
>  > unpecected behavior form RDMA.
> 
> No, ummunotify doesn't really help that much with fork() + COW.  If a
> parent forks and then touches pages that are actively in use for RDMA,
> then of course they get COWed and RDMA goes to the wrong memory (from
> the point of view of the parent).

So, Can we assume OpenMPI user process doesn't such thing?

Parhaps, madvise(DONTFORK) or vfork() avoid this issue. but I'm not
sure all program in the world do that.

> ummunotify does deal with the case where a process forks and touches
> memory that was used for RDMA but no longer is -- in that case, the MPI
> library has a chance to flush its registration cache because it will get
> a ummunotify event invalidating the old mapping.
> 
> The real purpose of ummunotify is to allow MPI implementations to cache
> registrations, even when the MPI library is used with an application
> that does funny things for allocation (mmap()/munmap() or brk(), etc).

Yup, that's very worth.




^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [GIT PULL] please pull ummunotify
  2009-09-11  6:03       ` Roland Dreier
  (?)
  (?)
@ 2009-09-11  6:15       ` Brice Goglin
       [not found]         ` <4AA9EAF7.5010401-MZpvjPyXg2s@public.gmane.org>
  -1 siblings, 1 reply; 82+ messages in thread
From: Brice Goglin @ 2009-09-11  6:15 UTC (permalink / raw)
  To: Roland Dreier
  Cc: KOSAKI Motohiro, torvalds, akpm, jsquyres, linux-rdma, general,
	linux-kernel

Roland Dreier wrote:
>  > Can I this version already solved fork() + COW issue? if so, could you
>  > please explain what happen at fork. Obviously RDMA point to either parent
>  > or child page, not both. but Corrent COW rule is, first touch process
>  > get copyed page and other process still own original page. I think it's 
>  > unpecected behavior form RDMA.
>
> No, ummunotify doesn't really help that much with fork() + COW.  If a
> parent forks and then touches pages that are actively in use for RDMA,
> then of course they get COWed and RDMA goes to the wrong memory (from
> the point of view of the parent).
>   

My understanding of the code is that fork will end-up calling
copy_page_range() on all VMA, and copy_page_range() calls
mmu_notifier_invalidate_range_start() if  is_cow_mapping() is true,
which should be the case here. So you should get some invalidate events
on fork.

Brice

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [GIT PULL] please pull ummunotify
  2009-09-11  6:15       ` Brice Goglin
@ 2009-09-11  6:21             ` KOSAKI Motohiro
  0 siblings, 0 replies; 82+ messages in thread
From: KOSAKI Motohiro @ 2009-09-11  6:21 UTC (permalink / raw)
  To: Brice Goglin
  Cc: kosaki.motohiro-+CUm20s59erQFUHtdCDX3A, Roland Dreier,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	jsquyres-FYB4Gu1CFyUAvxtiuMwx3w,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

> Roland Dreier wrote:
> >  > Can I this version already solved fork() + COW issue? if so, could you
> >  > please explain what happen at fork. Obviously RDMA point to either parent
> >  > or child page, not both. but Corrent COW rule is, first touch process
> >  > get copyed page and other process still own original page. I think it's 
> >  > unpecected behavior form RDMA.
> >
> > No, ummunotify doesn't really help that much with fork() + COW.  If a
> > parent forks and then touches pages that are actively in use for RDMA,
> > then of course they get COWed and RDMA goes to the wrong memory (from
> > the point of view of the parent).
> >   
> 
> My understanding of the code is that fork will end-up calling
> copy_page_range() on all VMA, and copy_page_range() calls
> mmu_notifier_invalidate_range_start() if  is_cow_mapping() is true,
> which should be the case here. So you should get some invalidate events
> on fork.

Worried...
Anybody haven't test fork() case yet???



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [GIT PULL] please pull ummunotify
@ 2009-09-11  6:21             ` KOSAKI Motohiro
  0 siblings, 0 replies; 82+ messages in thread
From: KOSAKI Motohiro @ 2009-09-11  6:21 UTC (permalink / raw)
  To: Brice Goglin
  Cc: kosaki.motohiro, Roland Dreier, torvalds, akpm, jsquyres,
	linux-rdma, general, linux-kernel

> Roland Dreier wrote:
> >  > Can I this version already solved fork() + COW issue? if so, could you
> >  > please explain what happen at fork. Obviously RDMA point to either parent
> >  > or child page, not both. but Corrent COW rule is, first touch process
> >  > get copyed page and other process still own original page. I think it's 
> >  > unpecected behavior form RDMA.
> >
> > No, ummunotify doesn't really help that much with fork() + COW.  If a
> > parent forks and then touches pages that are actively in use for RDMA,
> > then of course they get COWed and RDMA goes to the wrong memory (from
> > the point of view of the parent).
> >   
> 
> My understanding of the code is that fork will end-up calling
> copy_page_range() on all VMA, and copy_page_range() calls
> mmu_notifier_invalidate_range_start() if  is_cow_mapping() is true,
> which should be the case here. So you should get some invalidate events
> on fork.

Worried...
Anybody haven't test fork() case yet???




^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [GIT PULL] please pull ummunotify
  2009-09-11  6:15       ` Brice Goglin
@ 2009-09-11  6:22             ` Roland Dreier
  0 siblings, 0 replies; 82+ messages in thread
From: Roland Dreier @ 2009-09-11  6:22 UTC (permalink / raw)
  To: Brice Goglin
  Cc: KOSAKI Motohiro, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	jsquyres-FYB4Gu1CFyUAvxtiuMwx3w,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA


 > My understanding of the code is that fork will end-up calling
 > copy_page_range() on all VMA, and copy_page_range() calls
 > mmu_notifier_invalidate_range_start() if  is_cow_mapping() is true,
 > which should be the case here. So you should get some invalidate events
 > on fork.

Yes, I agree (that's what the second half of my email tried to say).

However, that doesn't help if the parent process is actively doing RDMA
on the range being invalidated -- the MPI library or whatever will get
the invalidate event via ummunotify, but what can it do?  The event is
basically saying "your data is going to the wrong place" and I don't see
what useful thing MPI could do with that.

As I said, it does mean that MPI can invalidate cached registrations for
COWed memory, which might be useful in case a parent forks and then
touches memory it used to use for RDMA, but I think that's the easier
part of the fork/COW problem.

 - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [GIT PULL] please pull ummunotify
@ 2009-09-11  6:22             ` Roland Dreier
  0 siblings, 0 replies; 82+ messages in thread
From: Roland Dreier @ 2009-09-11  6:22 UTC (permalink / raw)
  To: Brice Goglin
  Cc: KOSAKI Motohiro, torvalds, akpm, jsquyres, linux-rdma, general,
	linux-kernel


 > My understanding of the code is that fork will end-up calling
 > copy_page_range() on all VMA, and copy_page_range() calls
 > mmu_notifier_invalidate_range_start() if  is_cow_mapping() is true,
 > which should be the case here. So you should get some invalidate events
 > on fork.

Yes, I agree (that's what the second half of my email tried to say).

However, that doesn't help if the parent process is actively doing RDMA
on the range being invalidated -- the MPI library or whatever will get
the invalidate event via ummunotify, but what can it do?  The event is
basically saying "your data is going to the wrong place" and I don't see
what useful thing MPI could do with that.

As I said, it does mean that MPI can invalidate cached registrations for
COWed memory, which might be useful in case a parent forks and then
touches memory it used to use for RDMA, but I think that's the easier
part of the fork/COW problem.

 - R.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify
  2009-09-11  6:22             ` Roland Dreier
@ 2009-09-11  6:40               ` Jason Gunthorpe
  -1 siblings, 0 replies; 82+ messages in thread
From: Jason Gunthorpe @ 2009-09-11  6:40 UTC (permalink / raw)
  To: Roland Dreier
  Cc: KOSAKI Motohiro, linux-rdma, linux-kernel, general, Brice Goglin,
	akpm, torvalds

On Thu, Sep 10, 2009 at 11:22:20PM -0700, Roland Dreier wrote:

> As I said, it does mean that MPI can invalidate cached registrations for
> COWed memory, which might be useful in case a parent forks and then
> touches memory it used to use for RDMA, but I think that's the easier
> part of the fork/COW problem.

What happens to all the other IB resources (PD, CQ, QP, etc) on fork?

AFAIK, pretty much by design the IB stack cannot/does not duplicate
these objects.

The natural consequence is that a PD is always associated with a
single process at a time, thus a memory registration which is
associated with a PD must also be associated with a single process.

So.. What is the problem with fork? The semantics of what should
happen seem natural enough to me, the PD doesn't get copied to the
child, so the MR stays with the parent. COW events on the pinned
region must be resolved so that the physical page stays with the
process that has pinned it - the pin is logically released in the
child because the MR doesn't exist because the PD doesn't exist.

Is this a general problem with the MR mechanism? If I
mmap(MAP_SHARED|MAP_READONLY) and someone mmaps(MAP_PRIVATE|MAP_WRITE)
on the same file I can generate COW events - will this make RDMAs go
randomly too?? 

Jason

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify
@ 2009-09-11  6:40               ` Jason Gunthorpe
  0 siblings, 0 replies; 82+ messages in thread
From: Jason Gunthorpe @ 2009-09-11  6:40 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Brice Goglin, KOSAKI Motohiro, linux-rdma, linux-kernel, general,
	akpm, torvalds

On Thu, Sep 10, 2009 at 11:22:20PM -0700, Roland Dreier wrote:

> As I said, it does mean that MPI can invalidate cached registrations for
> COWed memory, which might be useful in case a parent forks and then
> touches memory it used to use for RDMA, but I think that's the easier
> part of the fork/COW problem.

What happens to all the other IB resources (PD, CQ, QP, etc) on fork?

AFAIK, pretty much by design the IB stack cannot/does not duplicate
these objects.

The natural consequence is that a PD is always associated with a
single process at a time, thus a memory registration which is
associated with a PD must also be associated with a single process.

So.. What is the problem with fork? The semantics of what should
happen seem natural enough to me, the PD doesn't get copied to the
child, so the MR stays with the parent. COW events on the pinned
region must be resolved so that the physical page stays with the
process that has pinned it - the pin is logically released in the
child because the MR doesn't exist because the PD doesn't exist.

Is this a general problem with the MR mechanism? If I
mmap(MAP_SHARED|MAP_READONLY) and someone mmaps(MAP_PRIVATE|MAP_WRITE)
on the same file I can generate COW events - will this make RDMAs go
randomly too?? 

Jason

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [GIT PULL] please pull ummunotify
  2009-09-11  6:11           ` KOSAKI Motohiro
@ 2009-09-11 16:42               ` Gleb Natapov
  -1 siblings, 0 replies; 82+ messages in thread
From: Gleb Natapov @ 2009-09-11 16:42 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Roland Dreier, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	jsquyres-FYB4Gu1CFyUAvxtiuMwx3w,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Fri, Sep 11, 2009 at 03:11:36PM +0900, KOSAKI Motohiro wrote:
> Hi
> 
> Thank you explanation.
> 
> > 
> >  > Can I this version already solved fork() + COW issue? if so, could you
> >  > please explain what happen at fork. Obviously RDMA point to either parent
> >  > or child page, not both. but Corrent COW rule is, first touch process
> >  > get copyed page and other process still own original page. I think it's 
> >  > unpecected behavior form RDMA.
> > 
> > No, ummunotify doesn't really help that much with fork() + COW.  If a
> > parent forks and then touches pages that are actively in use for RDMA,
> > then of course they get COWed and RDMA goes to the wrong memory (from
> > the point of view of the parent).
> 
> So, Can we assume OpenMPI user process doesn't such thing?
> 
> Parhaps, madvise(DONTFORK) or vfork() avoid this issue. but I'm not
> sure all program in the world do that.
> 
MPI (or is it libibverbs?) marks all registered memory as DONTFORK.

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [GIT PULL] please pull ummunotify
@ 2009-09-11 16:42               ` Gleb Natapov
  0 siblings, 0 replies; 82+ messages in thread
From: Gleb Natapov @ 2009-09-11 16:42 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Roland Dreier, torvalds, akpm, jsquyres, linux-rdma, general,
	linux-kernel

On Fri, Sep 11, 2009 at 03:11:36PM +0900, KOSAKI Motohiro wrote:
> Hi
> 
> Thank you explanation.
> 
> > 
> >  > Can I this version already solved fork() + COW issue? if so, could you
> >  > please explain what happen at fork. Obviously RDMA point to either parent
> >  > or child page, not both. but Corrent COW rule is, first touch process
> >  > get copyed page and other process still own original page. I think it's 
> >  > unpecected behavior form RDMA.
> > 
> > No, ummunotify doesn't really help that much with fork() + COW.  If a
> > parent forks and then touches pages that are actively in use for RDMA,
> > then of course they get COWed and RDMA goes to the wrong memory (from
> > the point of view of the parent).
> 
> So, Can we assume OpenMPI user process doesn't such thing?
> 
> Parhaps, madvise(DONTFORK) or vfork() avoid this issue. but I'm not
> sure all program in the world do that.
> 
MPI (or is it libibverbs?) marks all registered memory as DONTFORK.

--
			Gleb.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify
  2009-09-11  6:40               ` Jason Gunthorpe
@ 2009-09-11 16:58                   ` Roland Dreier
  -1 siblings, 0 replies; 82+ messages in thread
From: Roland Dreier @ 2009-09-11 16:58 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Brice Goglin, KOSAKI Motohiro, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b


 > So.. What is the problem with fork? The semantics of what should
 > happen seem natural enough to me, the PD doesn't get copied to the
 > child, so the MR stays with the parent. COW events on the pinned
 > region must be resolved so that the physical page stays with the
 > process that has pinned it - the pin is logically released in the
 > child because the MR doesn't exist because the PD doesn't exist.

This is getting away from the problem that ummunotify is solving, but
handling a COW fault generated by the parent by doing the copy in the
child seems like a pretty major, tricky change to make.  The child may
have forked 100 more times in the meantime, meaning we now have to
change 101 memory maps ... the cost of page faults goes through the roof
probably...

 - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify
@ 2009-09-11 16:58                   ` Roland Dreier
  0 siblings, 0 replies; 82+ messages in thread
From: Roland Dreier @ 2009-09-11 16:58 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Brice Goglin, KOSAKI Motohiro, linux-rdma, linux-kernel, general,
	akpm, torvalds


 > So.. What is the problem with fork? The semantics of what should
 > happen seem natural enough to me, the PD doesn't get copied to the
 > child, so the MR stays with the parent. COW events on the pinned
 > region must be resolved so that the physical page stays with the
 > process that has pinned it - the pin is logically released in the
 > child because the MR doesn't exist because the PD doesn't exist.

This is getting away from the problem that ummunotify is solving, but
handling a COW fault generated by the parent by doing the copy in the
child seems like a pretty major, tricky change to make.  The child may
have forked 100 more times in the meantime, meaning we now have to
change 101 memory maps ... the cost of page faults goes through the roof
probably...

 - R.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify
  2009-09-11 16:58                   ` Roland Dreier
@ 2009-09-15  7:03                       ` KOSAKI Motohiro
  -1 siblings, 0 replies; 82+ messages in thread
From: KOSAKI Motohiro @ 2009-09-15  7:03 UTC (permalink / raw)
  To: Roland Dreier
  Cc: kosaki.motohiro-+CUm20s59erQFUHtdCDX3A, Jason Gunthorpe,
	Brice Goglin, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

> 
>  > So.. What is the problem with fork? The semantics of what should
>  > happen seem natural enough to me, the PD doesn't get copied to the
>  > child, so the MR stays with the parent. COW events on the pinned
>  > region must be resolved so that the physical page stays with the
>  > process that has pinned it - the pin is logically released in the
>  > child because the MR doesn't exist because the PD doesn't exist.
> 
> This is getting away from the problem that ummunotify is solving, but
> handling a COW fault generated by the parent by doing the copy in the
> child seems like a pretty major, tricky change to make.  The child may
> have forked 100 more times in the meantime, meaning we now have to
> change 101 memory maps ... the cost of page faults goes through the roof
> probably...

Ummm...
Perhaps my first question was wrong. I'm not intent to NAK your patch.
I merely want to know your patch detail...

ok, I ask you again as another word.

 - I guess you have your MPI implementaion w/ ummunotify, right?
 - I guess you have test sevaral pattern, right?
   if so, can we see your test result?
 - I think you can explain your MPI advantage/disadvantage against
   current OpenMPI (or mpich et al).
 - I guess your patch dramatically improve MPI implementaion, but
   it's not free. it request some limitation to MPI application, right?
 - I imagine multi thread and fork. Is there another linmitaion?
 - In past discuttion, you said ummunotify user should not use
   multi threading. you also think user should not fork?



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify
@ 2009-09-15  7:03                       ` KOSAKI Motohiro
  0 siblings, 0 replies; 82+ messages in thread
From: KOSAKI Motohiro @ 2009-09-15  7:03 UTC (permalink / raw)
  To: Roland Dreier
  Cc: kosaki.motohiro, Jason Gunthorpe, Brice Goglin, linux-rdma,
	linux-kernel, general, akpm, torvalds

> 
>  > So.. What is the problem with fork? The semantics of what should
>  > happen seem natural enough to me, the PD doesn't get copied to the
>  > child, so the MR stays with the parent. COW events on the pinned
>  > region must be resolved so that the physical page stays with the
>  > process that has pinned it - the pin is logically released in the
>  > child because the MR doesn't exist because the PD doesn't exist.
> 
> This is getting away from the problem that ummunotify is solving, but
> handling a COW fault generated by the parent by doing the copy in the
> child seems like a pretty major, tricky change to make.  The child may
> have forked 100 more times in the meantime, meaning we now have to
> change 101 memory maps ... the cost of page faults goes through the roof
> probably...

Ummm...
Perhaps my first question was wrong. I'm not intent to NAK your patch.
I merely want to know your patch detail...

ok, I ask you again as another word.

 - I guess you have your MPI implementaion w/ ummunotify, right?
 - I guess you have test sevaral pattern, right?
   if so, can we see your test result?
 - I think you can explain your MPI advantage/disadvantage against
   current OpenMPI (or mpich et al).
 - I guess your patch dramatically improve MPI implementaion, but
   it's not free. it request some limitation to MPI application, right?
 - I imagine multi thread and fork. Is there another linmitaion?
 - In past discuttion, you said ummunotify user should not use
   multi threading. you also think user should not fork?




^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify
  2009-09-15  7:03                       ` KOSAKI Motohiro
@ 2009-09-15  8:27                         ` Roland Dreier
  -1 siblings, 0 replies; 82+ messages in thread
From: Roland Dreier @ 2009-09-15  8:27 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Jason Gunthorpe, Brice Goglin, linux-rdma, linux-kernel, general,
	akpm, torvalds, jsquyres


 >  - I guess you have your MPI implementaion w/ ummunotify, right?

Yes, Jeff Squyres (cc'ed) has an Open MPI prototype (mercurial tree at
http://bitbucket.org/jsquyres/ummunot/).

 >  - I guess you have test sevaral pattern, right?
 >    if so, can we see your test result?

Open MPI has a pretty extensive automated test fabric -- I don't have a
link handy but I believe all the tests that pass with unmodified Open
MPI currently still pass with ummunotify.  Maybe Jeff has a link.

 >  - I think you can explain your MPI advantage/disadvantage against
 >    current OpenMPI (or mpich et al).

The advantage is as Jeff explained in his blog post
(http://blogs.cisco.com/ciscotalk/performance/comments/better_linux_memory_tracking/),
namely the performance improvement of memory registration caching
without the reliability problems caused by previous approaches to
caching such as trying to hook malloc etc (which are fragile because the
great diversity of MPI-using codes find ways to mess up all previous
userspace-only approaches).

 >  - I guess your patch dramatically improve MPI implementaion, but
 >    it's not free. it request some limitation to MPI application, right?

Not that I know of, beyond already existing limitations.

 >  - I imagine multi thread and fork. Is there another linmitaion?

There are no new limitations on multi-threaded codes or on use of fork
that I know of.  Of course, buggy code that does something like passing
a buffer to MPI in one thread and then freeing that buffer from another
thread before MPI is done with it is still buggy; but ummunotify
actually increases the ability of the MPI implementation to detect such
bugs and give useful diagnostic information.

 >  - In past discuttion, you said ummunotify user should not use
 >    multi threading. you also think user should not fork?

I don't recall where I said ummunotify users should not be
multithreaded.  I don't know of any problem with that.

Also code using ummunotify can fork -- ummunotify simply does not fix
issues with copy-on-write for buffers that are in use, just as it does
not fix multithreaded code that has a race between using a buffer and
freeing the same buffer.

Hope this clarifies things.

 - Roland

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify
@ 2009-09-15  8:27                         ` Roland Dreier
  0 siblings, 0 replies; 82+ messages in thread
From: Roland Dreier @ 2009-09-15  8:27 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Jason Gunthorpe, Brice Goglin, linux-rdma, linux-kernel, general,
	akpm, torvalds, jsquyres


 >  - I guess you have your MPI implementaion w/ ummunotify, right?

Yes, Jeff Squyres (cc'ed) has an Open MPI prototype (mercurial tree at
http://bitbucket.org/jsquyres/ummunot/).

 >  - I guess you have test sevaral pattern, right?
 >    if so, can we see your test result?

Open MPI has a pretty extensive automated test fabric -- I don't have a
link handy but I believe all the tests that pass with unmodified Open
MPI currently still pass with ummunotify.  Maybe Jeff has a link.

 >  - I think you can explain your MPI advantage/disadvantage against
 >    current OpenMPI (or mpich et al).

The advantage is as Jeff explained in his blog post
(http://blogs.cisco.com/ciscotalk/performance/comments/better_linux_memory_tracking/),
namely the performance improvement of memory registration caching
without the reliability problems caused by previous approaches to
caching such as trying to hook malloc etc (which are fragile because the
great diversity of MPI-using codes find ways to mess up all previous
userspace-only approaches).

 >  - I guess your patch dramatically improve MPI implementaion, but
 >    it's not free. it request some limitation to MPI application, right?

Not that I know of, beyond already existing limitations.

 >  - I imagine multi thread and fork. Is there another linmitaion?

There are no new limitations on multi-threaded codes or on use of fork
that I know of.  Of course, buggy code that does something like passing
a buffer to MPI in one thread and then freeing that buffer from another
thread before MPI is done with it is still buggy; but ummunotify
actually increases the ability of the MPI implementation to detect such
bugs and give useful diagnostic information.

 >  - In past discuttion, you said ummunotify user should not use
 >    multi threading. you also think user should not fork?

I don't recall where I said ummunotify users should not be
multithreaded.  I don't know of any problem with that.

Also code using ummunotify can fork -- ummunotify simply does not fix
issues with copy-on-write for buffers that are in use, just as it does
not fix multithreaded code that has a race between using a buffer and
freeing the same buffer.

Hope this clarifies things.

 - Roland

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [GIT PULL] please pull ummunotify
  2009-09-11  4:38 ` Roland Dreier
  (?)
@ 2009-09-15 11:34 ` Pavel Machek
       [not found]   ` <20090915113434.GF1328-+ZI9xUNit7I@public.gmane.org>
  -1 siblings, 1 reply; 82+ messages in thread
From: Pavel Machek @ 2009-09-15 11:34 UTC (permalink / raw)
  To: Roland Dreier; +Cc: torvalds, akpm, jsquyres, linux-rdma, general, linux-kernel

Hi!

> Linus, please consider pulling from
> 
>     master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git ummunotify
> 
> This tree is also available from kernel.org mirrors at:
> 
>     git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git ummunotify
> 
> This will get "ummunotify," a new character device that allows a
> userspace library to register for MMU notifications; this is
> particularly useful for MPI implementions (message passing libraries
> used in HPC) to be able to keep track of what wacky things consumers
> do to their memory mappings.  My colleague Jeff Squyres from the Open
> MPI project posted a blog entry about why MPI wants this:
> 
> http://blogs.cisco.com/ciscotalk/performance/comments/better_linux_memory_tracking/
> 
> His summary of ummunotify:
> 
>   "It???s elegant, doesn???t require strange linker tricks, and seems to
>    work in all cases.  Yay!"
> 
> This code went through several review iterations on lkml and was in
> -mm and -next for quite a few weeks.  Andrew is OK with merging it (I
> think -- Andrew please correct me if I misunderstood you).

I don't remember seeing discussion of this on lkml. Yes it  is in
-next...

Basically it allows app to 'trace itself'? ...with interesting mmap()
interface, exporting int to userspace, hoping it behaves atomically...? 
	
							Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify
  2009-09-15  7:03                       ` KOSAKI Motohiro
@ 2009-09-15 12:38                         ` Jeff Squyres
  -1 siblings, 0 replies; 82+ messages in thread
From: Jeff Squyres @ 2009-09-15 12:38 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-rdma, Roland Dreier (rdreier),
	linux-kernel, general, Brice Goglin, akpm, torvalds

On Sep 15, 2009, at 3:03 AM, KOSAKI Motohiro wrote:

>  - I guess you have your MPI implementaion w/ ummunotify, right?
>  - I guess you have test sevaral pattern, right?
>    if so, can we see your test result?
>


Roland's answers to the rest of these questions were spot-on, so I  
thought I'd just throw in a quick reply to the above questions: yes,  
we have a prototype Open MPI implementation with code that uses  
ummunotify (http://bitbucket.org/jsquyres/ummunot/).  I just finished  
fixing a high-priority (but unrelated) bug in Open MPI, so merging the  
prototype ummunotify code into the upstream Open MPI repository is now  
at the top of my priority list.

We have done quite a bit of testing with ummunotify, but since the  
code is not yet in the Open MPI mainline, most of the testing has been  
manual (not through our automated testing system).  As far as we can  
tell, everything is working properly with Open MPI + ummunotify.  We  
also anticipate that other MPI implementations will be able to use  
ummunotify, potentially using Open MPI as a reference ummunotify  
implementation.

FWIW: we went through a bunch of design and implementation iterations  
with Roland to get code that everyone was happy with:

- Roland likes it (and anticipated that the kernel community would be  
receptive to)
- we like it
- performs correctly

Hope that helps.

-- 
Jeff Squyres
jsquyres@cisco.com

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify
@ 2009-09-15 12:38                         ` Jeff Squyres
  0 siblings, 0 replies; 82+ messages in thread
From: Jeff Squyres @ 2009-09-15 12:38 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Roland Dreier (rdreier),
	Jason Gunthorpe, Brice Goglin, linux-rdma, linux-kernel, general,
	akpm, torvalds

On Sep 15, 2009, at 3:03 AM, KOSAKI Motohiro wrote:

>  - I guess you have your MPI implementaion w/ ummunotify, right?
>  - I guess you have test sevaral pattern, right?
>    if so, can we see your test result?
>


Roland's answers to the rest of these questions were spot-on, so I  
thought I'd just throw in a quick reply to the above questions: yes,  
we have a prototype Open MPI implementation with code that uses  
ummunotify (http://bitbucket.org/jsquyres/ummunot/).  I just finished  
fixing a high-priority (but unrelated) bug in Open MPI, so merging the  
prototype ummunotify code into the upstream Open MPI repository is now  
at the top of my priority list.

We have done quite a bit of testing with ummunotify, but since the  
code is not yet in the Open MPI mainline, most of the testing has been  
manual (not through our automated testing system).  As far as we can  
tell, everything is working properly with Open MPI + ummunotify.  We  
also anticipate that other MPI implementations will be able to use  
ummunotify, potentially using Open MPI as a reference ummunotify  
implementation.

FWIW: we went through a bunch of design and implementation iterations  
with Roland to get code that everyone was happy with:

- Roland likes it (and anticipated that the kernel community would be  
receptive to)
- we like it
- performs correctly

Hope that helps.

-- 
Jeff Squyres
jsquyres@cisco.com


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify
  2009-09-15 11:34 ` Pavel Machek
@ 2009-09-15 14:57       ` Roland Dreier
  0 siblings, 0 replies; 82+ messages in thread
From: Roland Dreier @ 2009-09-15 14:57 UTC (permalink / raw)
  To: Pavel Machek
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b


 > I don't remember seeing discussion of this on lkml. Yes it  is in
 > -next...

eg http://lkml.org/lkml/2009/7/31/197 and followups, or search for v2
and earlier patches.

 > Basically it allows app to 'trace itself'? ...with interesting mmap()
 > interface, exporting int to userspace, hoping it behaves atomically...? 

Yes, it allows app to trace what the kernel does to memory mappings.  I
don't believe there's any real issue to atomicity of mmap'ed memory,
since userspace really just tests whether read value is == to old read
value or not.

 - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify
@ 2009-09-15 14:57       ` Roland Dreier
  0 siblings, 0 replies; 82+ messages in thread
From: Roland Dreier @ 2009-09-15 14:57 UTC (permalink / raw)
  To: Pavel Machek; +Cc: linux-rdma, linux-kernel, general, akpm, torvalds


 > I don't remember seeing discussion of this on lkml. Yes it  is in
 > -next...

eg http://lkml.org/lkml/2009/7/31/197 and followups, or search for v2
and earlier patches.

 > Basically it allows app to 'trace itself'? ...with interesting mmap()
 > interface, exporting int to userspace, hoping it behaves atomically...? 

Yes, it allows app to trace what the kernel does to memory mappings.  I
don't believe there's any real issue to atomicity of mmap'ed memory,
since userspace really just tests whether read value is == to old read
value or not.

 - R.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [GIT PULL] please pull ummunotify
  2009-09-11  4:38 ` Roland Dreier
@ 2009-09-16 16:30     ` Roland Dreier
  -1 siblings, 0 replies; 82+ messages in thread
From: Roland Dreier @ 2009-09-16 16:30 UTC (permalink / raw)
  To: torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	jsquyres-FYB4Gu1CFyUAvxtiuMwx3w,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Hi Linus,

Sorry to hassle you about this, but I would like to know where things
stand.  I know (from the reflink discussion if nothing else) that you're
definitely not bashful about telling people when their code sucks, so
this silent treatment has me really flustered.  I've been showering and
brushing my teeth and everything, honest!

Seriously, this code solves a problem that the MPI/HPC people have been
complaining about for quite a while, and if possible I'd like to get
this upstream.  Or if you have a better idea, I'm all ears...

Thanks,
  Roland

 > Linus, please consider pulling from
 > 
 >     master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git ummunotify
 > 
 > This tree is also available from kernel.org mirrors at:
 > 
 >     git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git ummunotify
 > 
 > This will get "ummunotify," a new character device that allows a
 > userspace library to register for MMU notifications; this is
 > particularly useful for MPI implementions (message passing libraries
 > used in HPC) to be able to keep track of what wacky things consumers
 > do to their memory mappings.  My colleague Jeff Squyres from the Open
 > MPI project posted a blog entry about why MPI wants this:
 > 
 > http://blogs.cisco.com/ciscotalk/performance/comments/better_linux_memory_tracking/
 > 
 > His summary of ummunotify:
 > 
 >   "It’s elegant, doesn’t require strange linker tricks, and seems to
 >    work in all cases.  Yay!"
 > 
 > This code went through several review iterations on lkml and was in
 > -mm and -next for quite a few weeks.  Andrew is OK with merging it (I
 > think -- Andrew please correct me if I misunderstood you).
 > 
 > Roland Dreier (1):
 >       ummunotify: Userspace support for MMU notifications
 > 
 >  Documentation/Makefile                  |    3 +-
 >  Documentation/ummunotify/Makefile       |    7 +
 >  Documentation/ummunotify/ummunotify.txt |  150 ++++++++
 >  Documentation/ummunotify/umn-test.c     |  200 +++++++++++
 >  drivers/char/Kconfig                    |   12 +
 >  drivers/char/Makefile                   |    1 +
 >  drivers/char/ummunotify.c               |  566 +++++++++++++++++++++++++++++++
 >  include/linux/Kbuild                    |    1 +
 >  include/linux/ummunotify.h              |  121 +++++++
 >  9 files changed, 1060 insertions(+), 1 deletions(-)
 >  create mode 100644 Documentation/ummunotify/Makefile
 >  create mode 100644 Documentation/ummunotify/ummunotify.txt
 >  create mode 100644 Documentation/ummunotify/umn-test.c
 >  create mode 100644 drivers/char/ummunotify.c
 >  create mode 100644 include/linux/ummunotify.h
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [GIT PULL] please pull ummunotify
@ 2009-09-16 16:30     ` Roland Dreier
  0 siblings, 0 replies; 82+ messages in thread
From: Roland Dreier @ 2009-09-16 16:30 UTC (permalink / raw)
  To: torvalds; +Cc: akpm, jsquyres, linux-rdma, general, linux-kernel

Hi Linus,

Sorry to hassle you about this, but I would like to know where things
stand.  I know (from the reflink discussion if nothing else) that you're
definitely not bashful about telling people when their code sucks, so
this silent treatment has me really flustered.  I've been showering and
brushing my teeth and everything, honest!

Seriously, this code solves a problem that the MPI/HPC people have been
complaining about for quite a while, and if possible I'd like to get
this upstream.  Or if you have a better idea, I'm all ears...

Thanks,
  Roland

 > Linus, please consider pulling from
 > 
 >     master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git ummunotify
 > 
 > This tree is also available from kernel.org mirrors at:
 > 
 >     git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git ummunotify
 > 
 > This will get "ummunotify," a new character device that allows a
 > userspace library to register for MMU notifications; this is
 > particularly useful for MPI implementions (message passing libraries
 > used in HPC) to be able to keep track of what wacky things consumers
 > do to their memory mappings.  My colleague Jeff Squyres from the Open
 > MPI project posted a blog entry about why MPI wants this:
 > 
 > http://blogs.cisco.com/ciscotalk/performance/comments/better_linux_memory_tracking/
 > 
 > His summary of ummunotify:
 > 
 >   "It’s elegant, doesn’t require strange linker tricks, and seems to
 >    work in all cases.  Yay!"
 > 
 > This code went through several review iterations on lkml and was in
 > -mm and -next for quite a few weeks.  Andrew is OK with merging it (I
 > think -- Andrew please correct me if I misunderstood you).
 > 
 > Roland Dreier (1):
 >       ummunotify: Userspace support for MMU notifications
 > 
 >  Documentation/Makefile                  |    3 +-
 >  Documentation/ummunotify/Makefile       |    7 +
 >  Documentation/ummunotify/ummunotify.txt |  150 ++++++++
 >  Documentation/ummunotify/umn-test.c     |  200 +++++++++++
 >  drivers/char/Kconfig                    |   12 +
 >  drivers/char/Makefile                   |    1 +
 >  drivers/char/ummunotify.c               |  566 +++++++++++++++++++++++++++++++
 >  include/linux/Kbuild                    |    1 +
 >  include/linux/ummunotify.h              |  121 +++++++
 >  9 files changed, 1060 insertions(+), 1 deletions(-)
 >  create mode 100644 Documentation/ummunotify/Makefile
 >  create mode 100644 Documentation/ummunotify/ummunotify.txt
 >  create mode 100644 Documentation/ummunotify/umn-test.c
 >  create mode 100644 drivers/char/ummunotify.c
 >  create mode 100644 include/linux/ummunotify.h

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [ofa-general] Re: [GIT PULL] please pull ummunotify
  2009-09-16 16:30     ` Roland Dreier
@ 2009-09-16 16:40       ` Linus Torvalds
  -1 siblings, 0 replies; 82+ messages in thread
From: Linus Torvalds @ 2009-09-16 16:40 UTC (permalink / raw)
  To: Roland Dreier; +Cc: linux-rdma, akpm, linux-kernel, general



On Wed, 16 Sep 2009, Roland Dreier wrote:
> 
> Sorry to hassle you about this, but I would like to know where things
> stand.  I know (from the reflink discussion if nothing else) that you're
> definitely not bashful about telling people when their code sucks, so
> this silent treatment has me really flustered.  I've been showering and
> brushing my teeth and everything, honest!

I just haven't had time to look into the issue, so I'm merging the code 
that I know I need to merge, and hopefully I'll have a breather later when 
I can actually look at code and the thread it all spawned.,

		Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [GIT PULL] please pull ummunotify
@ 2009-09-16 16:40       ` Linus Torvalds
  0 siblings, 0 replies; 82+ messages in thread
From: Linus Torvalds @ 2009-09-16 16:40 UTC (permalink / raw)
  To: Roland Dreier; +Cc: akpm, jsquyres, linux-rdma, general, linux-kernel



On Wed, 16 Sep 2009, Roland Dreier wrote:
> 
> Sorry to hassle you about this, but I would like to know where things
> stand.  I know (from the reflink discussion if nothing else) that you're
> definitely not bashful about telling people when their code sucks, so
> this silent treatment has me really flustered.  I've been showering and
> brushing my teeth and everything, honest!

I just haven't had time to look into the issue, so I'm merging the code 
that I know I need to merge, and hopefully I'll have a breather later when 
I can actually look at code and the thread it all spawned.,

		Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [GIT PULL] please pull ummunotify
  2009-09-11  4:38 ` Roland Dreier
@ 2009-09-17 11:30     ` Peter Zijlstra
  -1 siblings, 0 replies; 82+ messages in thread
From: Peter Zijlstra @ 2009-09-17 11:30 UTC (permalink / raw)
  To: Roland Dreier
  Cc: torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	jsquyres-FYB4Gu1CFyUAvxtiuMwx3w,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Anton Blanchard,
	Paul Mackerras, Ingo Molnar

On Thu, 2009-09-10 at 21:38 -0700, Roland Dreier wrote:
> Linus, please consider pulling from
> 
>     master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git ummunotify
> 
> This tree is also available from kernel.org mirrors at:
> 
>     git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git ummunotify
> 
> This will get "ummunotify," a new character device that allows a
> userspace library to register for MMU notifications; this is
> particularly useful for MPI implementions (message passing libraries
> used in HPC) to be able to keep track of what wacky things consumers
> do to their memory mappings.  My colleague Jeff Squyres from the Open
> MPI project posted a blog entry about why MPI wants this:
> 
> http://blogs.cisco.com/ciscotalk/performance/comments/better_linux_memory_tracking/
> 
> His summary of ummunotify:
> 
>   "It’s elegant, doesn’t require strange linker tricks, and seems to
>    work in all cases.  Yay!"
> 
> This code went through several review iterations on lkml and was in
> -mm and -next for quite a few weeks.  Andrew is OK with merging it (I
> think -- Andrew please correct me if I misunderstood you).

Anton Blanchard suggested a while back that this might be integrated
with perf-counters, since perf-counters already does mmap() tracking and
also provides events through an mmap()'ed buffer.

Has anybody looked into this?

If someone did and I missed the discussion on why it isn't appropriate,
kindly point me in the right direction ;-)


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [GIT PULL] please pull ummunotify
@ 2009-09-17 11:30     ` Peter Zijlstra
  0 siblings, 0 replies; 82+ messages in thread
From: Peter Zijlstra @ 2009-09-17 11:30 UTC (permalink / raw)
  To: Roland Dreier
  Cc: torvalds, akpm, jsquyres, linux-rdma, general, linux-kernel,
	Anton Blanchard, Paul Mackerras, Ingo Molnar

On Thu, 2009-09-10 at 21:38 -0700, Roland Dreier wrote:
> Linus, please consider pulling from
> 
>     master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git ummunotify
> 
> This tree is also available from kernel.org mirrors at:
> 
>     git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git ummunotify
> 
> This will get "ummunotify," a new character device that allows a
> userspace library to register for MMU notifications; this is
> particularly useful for MPI implementions (message passing libraries
> used in HPC) to be able to keep track of what wacky things consumers
> do to their memory mappings.  My colleague Jeff Squyres from the Open
> MPI project posted a blog entry about why MPI wants this:
> 
> http://blogs.cisco.com/ciscotalk/performance/comments/better_linux_memory_tracking/
> 
> His summary of ummunotify:
> 
>   "It’s elegant, doesn’t require strange linker tricks, and seems to
>    work in all cases.  Yay!"
> 
> This code went through several review iterations on lkml and was in
> -mm and -next for quite a few weeks.  Andrew is OK with merging it (I
> think -- Andrew please correct me if I misunderstood you).

Anton Blanchard suggested a while back that this might be integrated
with perf-counters, since perf-counters already does mmap() tracking and
also provides events through an mmap()'ed buffer.

Has anybody looked into this?

If someone did and I missed the discussion on why it isn't appropriate,
kindly point me in the right direction ;-)



^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify
  2009-09-17 11:30     ` Peter Zijlstra
  (?)
@ 2009-09-17 14:24     ` Roland Dreier
       [not found]       ` <adafxalejiq.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>
  -1 siblings, 1 reply; 82+ messages in thread
From: Roland Dreier @ 2009-09-17 14:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-rdma, linux-kernel, Paul Mackerras, Anton Blanchard,
	general, akpm, torvalds, Ingo Molnar


 > Anton Blanchard suggested a while back that this might be integrated
 > with perf-counters, since perf-counters already does mmap() tracking and
 > also provides events through an mmap()'ed buffer.
 > 
 > Has anybody looked into this?

I didn't see the original suggestion.  Certainly hooking in to existing
infrastructure for user/kernel communication would be good.

The fit doesn't seem great to me, although I am rather naive about perf
counters.  The problem that ummunotify is trying to solve is to let an
app say 'for these 1000 address ranges (that possibly only cover a small
part of my total address space), please let me know when the mappings
are invalidated for any reason'.

So getting those events in the kernel is no problem -- we have the MMU
notifier hooks that tell us exactly what we need to know.  The issue is
purely the way userspace registers interest in address ranges, and how
to kernel returns the events.

For perf counters it seems that one would have to create a new counter
for each address range... is that correct?  And also I don't know if
perf counter has an analog for the fast path optimization that
ummunotify provides via a mmap'ed generation counter (a quick way for
userspace to see 'nothing happened since last time you checked').

 - R.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify
  2009-09-17 14:24     ` [ofa-general] " Roland Dreier
@ 2009-09-17 14:32           ` Roland Dreier
  0 siblings, 0 replies; 82+ messages in thread
From: Roland Dreier @ 2009-09-17 14:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras,
	Anton Blanchard, general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Ingo Molnar


 > So getting those events in the kernel is no problem -- we have the MMU
 > notifier hooks that tell us exactly what we need to know.  The issue is
 > purely the way userspace registers interest in address ranges, and how
 > to kernel returns the events.
 > 
 > For perf counters it seems that one would have to create a new counter
 > for each address range... is that correct?  And also I don't know if
 > perf counter has an analog for the fast path optimization that
 > ummunotify provides via a mmap'ed generation counter (a quick way for
 > userspace to see 'nothing happened since last time you checked').

Oh I forgot... ummunotify also preallocates everything etc. so that
there is no way for events to be lost.  Which saves userspace from
having to trash everything cached and start over, which it would have to
do if it misses an invalidate event.

And AFAIK, pref counters does have the possibility of overflowing a
buffer and losing an event, right?

 - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify
@ 2009-09-17 14:32           ` Roland Dreier
  0 siblings, 0 replies; 82+ messages in thread
From: Roland Dreier @ 2009-09-17 14:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-rdma, linux-kernel, Paul Mackerras, Anton Blanchard,
	general, akpm, torvalds, Ingo Molnar


 > So getting those events in the kernel is no problem -- we have the MMU
 > notifier hooks that tell us exactly what we need to know.  The issue is
 > purely the way userspace registers interest in address ranges, and how
 > to kernel returns the events.
 > 
 > For perf counters it seems that one would have to create a new counter
 > for each address range... is that correct?  And also I don't know if
 > perf counter has an analog for the fast path optimization that
 > ummunotify provides via a mmap'ed generation counter (a quick way for
 > userspace to see 'nothing happened since last time you checked').

Oh I forgot... ummunotify also preallocates everything etc. so that
there is no way for events to be lost.  Which saves userspace from
having to trash everything cached and start over, which it would have to
do if it misses an invalidate event.

And AFAIK, pref counters does have the possibility of overflowing a
buffer and losing an event, right?

 - R.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify
  2009-09-17 14:24     ` [ofa-general] " Roland Dreier
@ 2009-09-17 14:43           ` Peter Zijlstra
  0 siblings, 0 replies; 82+ messages in thread
From: Peter Zijlstra @ 2009-09-17 14:43 UTC (permalink / raw)
  To: Roland Dreier
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras,
	Anton Blanchard, general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Ingo Molnar

On Thu, 2009-09-17 at 07:24 -0700, Roland Dreier wrote:
> > Anton Blanchard suggested a while back that this might be integrated
>  > with perf-counters, since perf-counters already does mmap() tracking and
>  > also provides events through an mmap()'ed buffer.
>  > 
>  > Has anybody looked into this?
> 
> I didn't see the original suggestion.  Certainly hooking in to existing
> infrastructure for user/kernel communication would be good.
> 
> The fit doesn't seem great to me, although I am rather naive about perf
> counters.  The problem that ummunotify is trying to solve is to let an
> app say 'for these 1000 address ranges (that possibly only cover a small
> part of my total address space), please let me know when the mappings
> are invalidated for any reason'.
> 
> So getting those events in the kernel is no problem -- we have the MMU
> notifier hooks that tell us exactly what we need to know.  The issue is
> purely the way userspace registers interest in address ranges, and how
> to kernel returns the events.
> 
> For perf counters it seems that one would have to create a new counter
> for each address range... is that correct?  And also I don't know if
> perf counter has an analog for the fast path optimization that
> ummunotify provides via a mmap'ed generation counter (a quick way for
> userspace to see 'nothing happened since last time you checked').

You're right in that perf-counter currently doesn't provide a way to
specify these ranges, we simply track all mmap() traffic.

The thing is that mmap() data is basically a side channel. For your
usage you'd basically have to open a NOP counter and only observe the
mmap data.

We could look at ways of adding ranges..

We do have a means of detecting if new data is available, we keep a data
head index. If that moves, you've got new stuff.



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify
@ 2009-09-17 14:43           ` Peter Zijlstra
  0 siblings, 0 replies; 82+ messages in thread
From: Peter Zijlstra @ 2009-09-17 14:43 UTC (permalink / raw)
  To: Roland Dreier
  Cc: linux-rdma, linux-kernel, Paul Mackerras, Anton Blanchard,
	general, akpm, torvalds, Ingo Molnar

On Thu, 2009-09-17 at 07:24 -0700, Roland Dreier wrote:
> > Anton Blanchard suggested a while back that this might be integrated
>  > with perf-counters, since perf-counters already does mmap() tracking and
>  > also provides events through an mmap()'ed buffer.
>  > 
>  > Has anybody looked into this?
> 
> I didn't see the original suggestion.  Certainly hooking in to existing
> infrastructure for user/kernel communication would be good.
> 
> The fit doesn't seem great to me, although I am rather naive about perf
> counters.  The problem that ummunotify is trying to solve is to let an
> app say 'for these 1000 address ranges (that possibly only cover a small
> part of my total address space), please let me know when the mappings
> are invalidated for any reason'.
> 
> So getting those events in the kernel is no problem -- we have the MMU
> notifier hooks that tell us exactly what we need to know.  The issue is
> purely the way userspace registers interest in address ranges, and how
> to kernel returns the events.
> 
> For perf counters it seems that one would have to create a new counter
> for each address range... is that correct?  And also I don't know if
> perf counter has an analog for the fast path optimization that
> ummunotify provides via a mmap'ed generation counter (a quick way for
> userspace to see 'nothing happened since last time you checked').

You're right in that perf-counter currently doesn't provide a way to
specify these ranges, we simply track all mmap() traffic.

The thing is that mmap() data is basically a side channel. For your
usage you'd basically have to open a NOP counter and only observe the
mmap data.

We could look at ways of adding ranges..

We do have a means of detecting if new data is available, we keep a data
head index. If that moves, you've got new stuff.




^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify
  2009-09-17 14:32           ` Roland Dreier
@ 2009-09-17 14:49               ` Peter Zijlstra
  -1 siblings, 0 replies; 82+ messages in thread
From: Peter Zijlstra @ 2009-09-17 14:49 UTC (permalink / raw)
  To: Roland Dreier
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras,
	Anton Blanchard, general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Ingo Molnar

On Thu, 2009-09-17 at 07:32 -0700, Roland Dreier wrote:
> > So getting those events in the kernel is no problem -- we have the MMU
>  > notifier hooks that tell us exactly what we need to know.  The issue is
>  > purely the way userspace registers interest in address ranges, and how
>  > to kernel returns the events.
>  > 
>  > For perf counters it seems that one would have to create a new counter
>  > for each address range... is that correct?  And also I don't know if
>  > perf counter has an analog for the fast path optimization that
>  > ummunotify provides via a mmap'ed generation counter (a quick way for
>  > userspace to see 'nothing happened since last time you checked').
> 
> Oh I forgot... ummunotify also preallocates everything etc. so that
> there is no way for events to be lost.  Which saves userspace from
> having to trash everything cached and start over, which it would have to
> do if it misses an invalidate event.
> 
> And AFAIK, pref counters does have the possibility of overflowing a
> buffer and losing an event, right?

Well, you cannot pre-allocate everything, either you get back-logged
evens in kernel space leading to a kernel DoS, or you loose events.

Perf counters have two modes, a RO mmap() and a RW mmap(). The RO mode
will automagically overwrite its tail data without regard for userspace
having observed it.

In the RW mode userspace has to advance the tail, the kernel will drop
events when full and insert a PERF_EVENT_LOST event once there is room
again.

Hmm, or are you saying you can only get 1 event per registered range and
allocate the thing on registration? That'd need some registration limit
to avoid DoS scenarios.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify
@ 2009-09-17 14:49               ` Peter Zijlstra
  0 siblings, 0 replies; 82+ messages in thread
From: Peter Zijlstra @ 2009-09-17 14:49 UTC (permalink / raw)
  To: Roland Dreier
  Cc: linux-rdma, linux-kernel, Paul Mackerras, Anton Blanchard,
	general, akpm, torvalds, Ingo Molnar

On Thu, 2009-09-17 at 07:32 -0700, Roland Dreier wrote:
> > So getting those events in the kernel is no problem -- we have the MMU
>  > notifier hooks that tell us exactly what we need to know.  The issue is
>  > purely the way userspace registers interest in address ranges, and how
>  > to kernel returns the events.
>  > 
>  > For perf counters it seems that one would have to create a new counter
>  > for each address range... is that correct?  And also I don't know if
>  > perf counter has an analog for the fast path optimization that
>  > ummunotify provides via a mmap'ed generation counter (a quick way for
>  > userspace to see 'nothing happened since last time you checked').
> 
> Oh I forgot... ummunotify also preallocates everything etc. so that
> there is no way for events to be lost.  Which saves userspace from
> having to trash everything cached and start over, which it would have to
> do if it misses an invalidate event.
> 
> And AFAIK, pref counters does have the possibility of overflowing a
> buffer and losing an event, right?

Well, you cannot pre-allocate everything, either you get back-logged
evens in kernel space leading to a kernel DoS, or you loose events.

Perf counters have two modes, a RO mmap() and a RW mmap(). The RO mode
will automagically overwrite its tail data without regard for userspace
having observed it.

In the RW mode userspace has to advance the tail, the kernel will drop
events when full and insert a PERF_EVENT_LOST event once there is room
again.

Hmm, or are you saying you can only get 1 event per registered range and
allocate the thing on registration? That'd need some registration limit
to avoid DoS scenarios.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify
  2009-09-17 14:49               ` Peter Zijlstra
@ 2009-09-17 15:03                 ` Roland Dreier
  -1 siblings, 0 replies; 82+ messages in thread
From: Roland Dreier @ 2009-09-17 15:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras,
	Anton Blanchard, general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Ingo Molnar


 > Hmm, or are you saying you can only get 1 event per registered range and
 > allocate the thing on registration? That'd need some registration limit
 > to avoid DoS scenarios.

Yes, that's what I do.  You're right, I should add a limit... although
their are lots of ways for userspace to consume arbitrary amounts of
kernel resources already.

 - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify
@ 2009-09-17 15:03                 ` Roland Dreier
  0 siblings, 0 replies; 82+ messages in thread
From: Roland Dreier @ 2009-09-17 15:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-rdma, linux-kernel, Paul Mackerras, Anton Blanchard,
	general, akpm, torvalds, Ingo Molnar


 > Hmm, or are you saying you can only get 1 event per registered range and
 > allocate the thing on registration? That'd need some registration limit
 > to avoid DoS scenarios.

Yes, that's what I do.  You're right, I should add a limit... although
their are lots of ways for userspace to consume arbitrary amounts of
kernel resources already.

 - R.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify
  2009-09-17 15:03                 ` Roland Dreier
@ 2009-09-17 15:22                     ` Peter Zijlstra
  -1 siblings, 0 replies; 82+ messages in thread
From: Peter Zijlstra @ 2009-09-17 15:22 UTC (permalink / raw)
  To: Roland Dreier
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras,
	Anton Blanchard, general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Ingo Molnar

On Thu, 2009-09-17 at 08:03 -0700, Roland Dreier wrote:
> > Hmm, or are you saying you can only get 1 event per registered range and
>  > allocate the thing on registration? That'd need some registration limit
>  > to avoid DoS scenarios.
> 
> Yes, that's what I do.  You're right, I should add a limit... although
> their are lots of ways for userspace to consume arbitrary amounts of
> kernel resources already.

I'd be good to work at reducing that number, not adding to it ;-)

But yeah, I currently don't see a very nice match to perf counters.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify
@ 2009-09-17 15:22                     ` Peter Zijlstra
  0 siblings, 0 replies; 82+ messages in thread
From: Peter Zijlstra @ 2009-09-17 15:22 UTC (permalink / raw)
  To: Roland Dreier
  Cc: linux-rdma, linux-kernel, Paul Mackerras, Anton Blanchard,
	general, akpm, torvalds, Ingo Molnar

On Thu, 2009-09-17 at 08:03 -0700, Roland Dreier wrote:
> > Hmm, or are you saying you can only get 1 event per registered range and
>  > allocate the thing on registration? That'd need some registration limit
>  > to avoid DoS scenarios.
> 
> Yes, that's what I do.  You're right, I should add a limit... although
> their are lots of ways for userspace to consume arbitrary amounts of
> kernel resources already.

I'd be good to work at reducing that number, not adding to it ;-)

But yeah, I currently don't see a very nice match to perf counters.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify
  2009-09-17 15:03                 ` Roland Dreier
@ 2009-09-17 15:45                     ` Roland Dreier
  -1 siblings, 0 replies; 82+ messages in thread
From: Roland Dreier @ 2009-09-17 15:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras,
	Anton Blanchard, general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Ingo Molnar


 > > > Hmm, or are you saying you can only get 1 event per registered range and
 > > > allocate the thing on registration? That'd need some registration limit
 > > > to avoid DoS scenarios.
 > > 
 > > Yes, that's what I do.  You're right, I should add a limit... although
 > > their are lots of ways for userspace to consume arbitrary amounts of
 > > kernel resources already.
 > 
 > I'd be good to work at reducing that number, not adding to it ;-)

Yes, definitely.  I'll add a quick ummunotify module parameter that
limits the number of registrations per process.

 > But yeah, I currently don't see a very nice match to perf counters.

OK.  It would be nice to tie into something more general, but I think I
agree -- perf counters are missing the filtering and the "no lost
events" that ummunotify does have.  And I'm not sure it's worth messing
up the perf counters design just to jam one more not totally related
thing in.

 - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify
@ 2009-09-17 15:45                     ` Roland Dreier
  0 siblings, 0 replies; 82+ messages in thread
From: Roland Dreier @ 2009-09-17 15:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-rdma, linux-kernel, Paul Mackerras, Anton Blanchard,
	general, akpm, torvalds, Ingo Molnar


 > > > Hmm, or are you saying you can only get 1 event per registered range and
 > > > allocate the thing on registration? That'd need some registration limit
 > > > to avoid DoS scenarios.
 > > 
 > > Yes, that's what I do.  You're right, I should add a limit... although
 > > their are lots of ways for userspace to consume arbitrary amounts of
 > > kernel resources already.
 > 
 > I'd be good to work at reducing that number, not adding to it ;-)

Yes, definitely.  I'll add a quick ummunotify module parameter that
limits the number of registrations per process.

 > But yeah, I currently don't see a very nice match to perf counters.

OK.  It would be nice to tie into something more general, but I think I
agree -- perf counters are missing the filtering and the "no lost
events" that ummunotify does have.  And I'm not sure it's worth messing
up the perf counters design just to jam one more not totally related
thing in.

 - R.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify
  2009-09-17 15:45                     ` Roland Dreier
@ 2009-09-18 11:50                         ` Ingo Molnar
  -1 siblings, 0 replies; 82+ messages in thread
From: Ingo Molnar @ 2009-09-18 11:50 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Peter Zijlstra, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras,
	Anton Blanchard, general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b


* Roland Dreier <rdreier-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> wrote:

>  > But yeah, I currently don't see a very nice match to perf counters.
> 
> OK.  It would be nice to tie into something more general, but I think 
> I agree -- perf counters are missing the filtering and the "no lost 
> events" that ummunotify does have.  And I'm not sure it's worth 
> messing up the perf counters design just to jam one more not totally 
> related thing in.

The filtering can be done and has been done - see Li Zefan's patchset 
that uses filter expressions to do per event in-kernel filtering.

The OOM DoS is a bug in your patches i think, which perfcounters solves 
;-)

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify
@ 2009-09-18 11:50                         ` Ingo Molnar
  0 siblings, 0 replies; 82+ messages in thread
From: Ingo Molnar @ 2009-09-18 11:50 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Peter Zijlstra, linux-rdma, linux-kernel, Paul Mackerras,
	Anton Blanchard, general, akpm, torvalds


* Roland Dreier <rdreier@cisco.com> wrote:

>  > But yeah, I currently don't see a very nice match to perf counters.
> 
> OK.  It would be nice to tie into something more general, but I think 
> I agree -- perf counters are missing the filtering and the "no lost 
> events" that ummunotify does have.  And I'm not sure it's worth 
> messing up the perf counters design just to jam one more not totally 
> related thing in.

The filtering can be done and has been done - see Li Zefan's patchset 
that uses filter expressions to do per event in-kernel filtering.

The OOM DoS is a bug in your patches i think, which perfcounters solves 
;-)

	Ingo

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify
  2009-09-15 14:57       ` Roland Dreier
@ 2009-09-28 20:49           ` Pavel Machek
  -1 siblings, 0 replies; 82+ messages in thread
From: Pavel Machek @ 2009-09-28 20:49 UTC (permalink / raw)
  To: Roland Dreier
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Tue 2009-09-15 07:57:56, Roland Dreier wrote:
> 
>  > I don't remember seeing discussion of this on lkml. Yes it  is in
>  > -next...
> 
> eg http://lkml.org/lkml/2009/7/31/197 and followups, or search for v2
> and earlier patches.

Well... it seems little overspecialized. Just modifying libc to
provide hooks you want looks like better solution.

>  > Basically it allows app to 'trace itself'? ...with interesting mmap()
>  > interface, exporting int to userspace, hoping it behaves atomically...? 
> 
> Yes, it allows app to trace what the kernel does to memory mappings.  I
> don't believe there's any real issue to atomicity of mmap'ed memory,
> since userspace really just tests whether read value is == to old read
> value or not.

That still needs memory barriers etc.. to ensure reliable operation,
no?
									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify
@ 2009-09-28 20:49           ` Pavel Machek
  0 siblings, 0 replies; 82+ messages in thread
From: Pavel Machek @ 2009-09-28 20:49 UTC (permalink / raw)
  To: Roland Dreier; +Cc: linux-rdma, linux-kernel, general, akpm, torvalds

On Tue 2009-09-15 07:57:56, Roland Dreier wrote:
> 
>  > I don't remember seeing discussion of this on lkml. Yes it  is in
>  > -next...
> 
> eg http://lkml.org/lkml/2009/7/31/197 and followups, or search for v2
> and earlier patches.

Well... it seems little overspecialized. Just modifying libc to
provide hooks you want looks like better solution.

>  > Basically it allows app to 'trace itself'? ...with interesting mmap()
>  > interface, exporting int to userspace, hoping it behaves atomically...? 
> 
> Yes, it allows app to trace what the kernel does to memory mappings.  I
> don't believe there's any real issue to atomicity of mmap'ed memory,
> since userspace really just tests whether read value is == to old read
> value or not.

That still needs memory barriers etc.. to ensure reliable operation,
no?
									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify
  2009-09-28 20:49           ` Pavel Machek
@ 2009-09-28 21:40               ` Jason Gunthorpe
  -1 siblings, 0 replies; 82+ messages in thread
From: Jason Gunthorpe @ 2009-09-28 21:40 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Roland Dreier, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Mon, Sep 28, 2009 at 10:49:23PM +0200, Pavel Machek wrote:

> >  > I don't remember seeing discussion of this on lkml. Yes it  is in
> >  > -next...
> > 
> > eg http://lkml.org/lkml/2009/7/31/197 and followups, or search for v2
> > and earlier patches.
 
> Well... it seems little overspecialized. Just modifying libc to
> provide hooks you want looks like better solution.

That is what MPI people are doing today and their feedback is that it
doesn't work - there are a lot of ways to mess with memory and no good
choices to hook the raw syscalls and keep sensible performance.

The main focus of this is high performance MPI apps, so lower overhead
on critical paths like memory allocation is part of the point. It is
ment to go hand-in-hand with the specialized RDMA memory pinning
interfaces..

> >  > Basically it allows app to 'trace itself'? ...with interesting mmap()
> >  > interface, exporting int to userspace, hoping it behaves atomically...? 
> > 
> > Yes, it allows app to trace what the kernel does to memory mappings.  I
> > don't believe there's any real issue to atomicity of mmap'ed memory,
> > since userspace really just tests whether read value is == to old read
> > value or not.
> 
> That still needs memory barriers etc.. to ensure reliable operation,
> no?

No, I don't think so..

The application is expected to provide sequencing of some sort between
the memory call (mmap/munmap/brk/etc) and the int check - usually just
by running in the same thread, or through some kind of locking scheme.

As long as the mmu notifiers run immediately in the same context as
the mmap/etc then it should be fine.

For example, the most common problem to solve looks like this:

   x = mmap(...)
   do RDMA with x
   [..]
   mmunmap(x);

   [..]
   y = mmap(..);
   do RDMA with y
     if by chance x == y things explode.

So this API puts the int test directly before 'do RDMA with'.

Due to the above kind of argument the net requirement is either to
completely synchronously (and with low overhead) hook every
mmap/munmap/brk/etc call into the kernel and do the accounting work,
or have a very low over head check every time the memory region is
about to be used.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify
@ 2009-09-28 21:40               ` Jason Gunthorpe
  0 siblings, 0 replies; 82+ messages in thread
From: Jason Gunthorpe @ 2009-09-28 21:40 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Roland Dreier, linux-rdma, linux-kernel, general, akpm, torvalds

On Mon, Sep 28, 2009 at 10:49:23PM +0200, Pavel Machek wrote:

> >  > I don't remember seeing discussion of this on lkml. Yes it  is in
> >  > -next...
> > 
> > eg http://lkml.org/lkml/2009/7/31/197 and followups, or search for v2
> > and earlier patches.
 
> Well... it seems little overspecialized. Just modifying libc to
> provide hooks you want looks like better solution.

That is what MPI people are doing today and their feedback is that it
doesn't work - there are a lot of ways to mess with memory and no good
choices to hook the raw syscalls and keep sensible performance.

The main focus of this is high performance MPI apps, so lower overhead
on critical paths like memory allocation is part of the point. It is
ment to go hand-in-hand with the specialized RDMA memory pinning
interfaces..

> >  > Basically it allows app to 'trace itself'? ...with interesting mmap()
> >  > interface, exporting int to userspace, hoping it behaves atomically...? 
> > 
> > Yes, it allows app to trace what the kernel does to memory mappings.  I
> > don't believe there's any real issue to atomicity of mmap'ed memory,
> > since userspace really just tests whether read value is == to old read
> > value or not.
> 
> That still needs memory barriers etc.. to ensure reliable operation,
> no?

No, I don't think so..

The application is expected to provide sequencing of some sort between
the memory call (mmap/munmap/brk/etc) and the int check - usually just
by running in the same thread, or through some kind of locking scheme.

As long as the mmu notifiers run immediately in the same context as
the mmap/etc then it should be fine.

For example, the most common problem to solve looks like this:

   x = mmap(...)
   do RDMA with x
   [..]
   mmunmap(x);

   [..]
   y = mmap(..);
   do RDMA with y
     if by chance x == y things explode.

So this API puts the int test directly before 'do RDMA with'.

Due to the above kind of argument the net requirement is either to
completely synchronously (and with low overhead) hook every
mmap/munmap/brk/etc call into the kernel and do the accounting work,
or have a very low over head check every time the memory region is
about to be used.

Jason

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify
  2009-09-17 15:45                     ` Roland Dreier
@ 2009-09-29 17:13                         ` Pavel Machek
  -1 siblings, 0 replies; 82+ messages in thread
From: Pavel Machek @ 2009-09-29 17:13 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Peter Zijlstra, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras,
	Anton Blanchard, general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Ingo Molnar

On Thu 2009-09-17 08:45:29, Roland Dreier wrote:
> 
>  > > > Hmm, or are you saying you can only get 1 event per registered range and
>  > > > allocate the thing on registration? That'd need some registration limit
>  > > > to avoid DoS scenarios.
>  > > 
>  > > Yes, that's what I do.  You're right, I should add a limit... although
>  > > their are lots of ways for userspace to consume arbitrary amounts of
>  > > kernel resources already.
>  > 
>  > I'd be good to work at reducing that number, not adding to it ;-)
> 
> Yes, definitely.  I'll add a quick ummunotify module parameter that
> limits the number of registrations per process.
> 
>  > But yeah, I currently don't see a very nice match to perf counters.
> 
> OK.  It would be nice to tie into something more general, but I think I
> agree -- perf counters are missing the filtering and the "no lost
> events" that ummunotify does have.  And I'm not sure it's worth messing
> up the perf counters design just to jam one more not totally related
> thing in.

I believe that extending perf counters to do what you want is better
than adding one more, very strange, user<->kernel interface.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify
@ 2009-09-29 17:13                         ` Pavel Machek
  0 siblings, 0 replies; 82+ messages in thread
From: Pavel Machek @ 2009-09-29 17:13 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Peter Zijlstra, linux-rdma, linux-kernel, Paul Mackerras,
	Anton Blanchard, general, akpm, torvalds, Ingo Molnar

On Thu 2009-09-17 08:45:29, Roland Dreier wrote:
> 
>  > > > Hmm, or are you saying you can only get 1 event per registered range and
>  > > > allocate the thing on registration? That'd need some registration limit
>  > > > to avoid DoS scenarios.
>  > > 
>  > > Yes, that's what I do.  You're right, I should add a limit... although
>  > > their are lots of ways for userspace to consume arbitrary amounts of
>  > > kernel resources already.
>  > 
>  > I'd be good to work at reducing that number, not adding to it ;-)
> 
> Yes, definitely.  I'll add a quick ummunotify module parameter that
> limits the number of registrations per process.
> 
>  > But yeah, I currently don't see a very nice match to perf counters.
> 
> OK.  It would be nice to tie into something more general, but I think I
> agree -- perf counters are missing the filtering and the "no lost
> events" that ummunotify does have.  And I'm not sure it's worth messing
> up the perf counters design just to jam one more not totally related
> thing in.

I believe that extending perf counters to do what you want is better
than adding one more, very strange, user<->kernel interface.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify
  2009-09-29 17:13                         ` Pavel Machek
@ 2009-09-30  9:44                             ` Ingo Molnar
  -1 siblings, 0 replies; 82+ messages in thread
From: Ingo Molnar @ 2009-09-30  9:44 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Roland Dreier, Peter Zijlstra, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras,
	Anton Blanchard, general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b


* Pavel Machek <pavel-+ZI9xUNit7I@public.gmane.org> wrote:

> On Thu 2009-09-17 08:45:29, Roland Dreier wrote:
> > 
> >
[...]
> > OK.  It would be nice to tie into something more general, but I 
> > think I agree -- perf counters are missing the filtering and the "no 
> > lost events" that ummunotify does have. [...]

Performance events filtering is being worked on and now with the proper 
non-DoS limit you've added you can lose events too, dont you? So it's 
all a question of how much buffering to add - and with perf events too 
you can buffer arbitrary large amount of events.

> > [...]  And I'm not sure it's worth messing up the perf counters 
> > design just to jam one more not totally related thing in.

Nobody suggested details for any redesign yet (so far it seems like a 
perfect match, to me at least) so i'm wondering what messup you are 
referring to.

> I believe that extending perf counters to do what you want is better 
> than adding one more, very strange, user<->kernel interface.

Agreed.

Lemme react to the original description of the code:

>     git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git ummunotify
>
> This will get "ummunotify," a new character device that allows a 
> userspace library to register for MMU notifications; this is 
> particularly useful for MPI implementions (message passing libraries 
> used in HPC) to be able to keep track of what wacky things consumers 
> do to their memory mappings.

I test-pulled this code and had a look at it.

I think this could be done in a simpler, less limited, more generic, 
more useful form by using some variation of perf events.

You should be able to get all that you want by adding two TRACE_EVENT() 
tracepoints and using the existing perf event syscall to get the events 
to user-space.

Meaning that this:

  9 files changed, 1060 insertions(+), 1 deletions(-)

Would be replaced with something like:

  2 files changed, 100 insertions(+), 0 deletions(-)

[ the +100 lines would (roughly) would add tracepoints to 
  invalidate_page and invalidate_range_start. (possibly via 
  mmu_notifier_register() like the ummunotify code does) Most of that 
  linecount would be comments. ]

Another upside, beyond the reduction in complexity is that we'd have one 
less special char driver based ABI. Which is a big plus in my opinion, 
especially if this goes towards HPC folks and if it's used for real. Why 
should such a MM capability hidden behind a character device and an 
ioctl?

The perf event approach is beneficial to non-HPC as well: MM 
instrumentation for example - page range invalidates are interesting to 
all sorts of modi of analysis.

A question: what is the typical size/scope of the rbtree of the watched 
regions of memory in practical (test) deployments of the ummunofity 
code?

Per tracepoint filtering is possible via the perf event patches Li Zefan 
has posted to lkml recently, under this subject:

   [PATCH 0/6] perf trace: Add filter support

They are still being worked on but it's very clear that flexible 
in-kernel filtering support will be a natural part of the perf event 
design in the very near future, so if that alone is your reason not to 
use it it would be better if you helped us complete/test the filter 
support and use that, instead of a parallel framework.

Or if that's not desirable or not possible, or if there's any other 
technical roadblock, i'd like to know the particulars of that.

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify
@ 2009-09-30  9:44                             ` Ingo Molnar
  0 siblings, 0 replies; 82+ messages in thread
From: Ingo Molnar @ 2009-09-30  9:44 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Roland Dreier, Peter Zijlstra, linux-rdma, linux-kernel,
	Paul Mackerras, Anton Blanchard, general, akpm, torvalds


* Pavel Machek <pavel@ucw.cz> wrote:

> On Thu 2009-09-17 08:45:29, Roland Dreier wrote:
> > 
> >
[...]
> > OK.  It would be nice to tie into something more general, but I 
> > think I agree -- perf counters are missing the filtering and the "no 
> > lost events" that ummunotify does have. [...]

Performance events filtering is being worked on and now with the proper 
non-DoS limit you've added you can lose events too, dont you? So it's 
all a question of how much buffering to add - and with perf events too 
you can buffer arbitrary large amount of events.

> > [...]  And I'm not sure it's worth messing up the perf counters 
> > design just to jam one more not totally related thing in.

Nobody suggested details for any redesign yet (so far it seems like a 
perfect match, to me at least) so i'm wondering what messup you are 
referring to.

> I believe that extending perf counters to do what you want is better 
> than adding one more, very strange, user<->kernel interface.

Agreed.

Lemme react to the original description of the code:

>     git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git ummunotify
>
> This will get "ummunotify," a new character device that allows a 
> userspace library to register for MMU notifications; this is 
> particularly useful for MPI implementions (message passing libraries 
> used in HPC) to be able to keep track of what wacky things consumers 
> do to their memory mappings.

I test-pulled this code and had a look at it.

I think this could be done in a simpler, less limited, more generic, 
more useful form by using some variation of perf events.

You should be able to get all that you want by adding two TRACE_EVENT() 
tracepoints and using the existing perf event syscall to get the events 
to user-space.

Meaning that this:

  9 files changed, 1060 insertions(+), 1 deletions(-)

Would be replaced with something like:

  2 files changed, 100 insertions(+), 0 deletions(-)

[ the +100 lines would (roughly) would add tracepoints to 
  invalidate_page and invalidate_range_start. (possibly via 
  mmu_notifier_register() like the ummunotify code does) Most of that 
  linecount would be comments. ]

Another upside, beyond the reduction in complexity is that we'd have one 
less special char driver based ABI. Which is a big plus in my opinion, 
especially if this goes towards HPC folks and if it's used for real. Why 
should such a MM capability hidden behind a character device and an 
ioctl?

The perf event approach is beneficial to non-HPC as well: MM 
instrumentation for example - page range invalidates are interesting to 
all sorts of modi of analysis.

A question: what is the typical size/scope of the rbtree of the watched 
regions of memory in practical (test) deployments of the ummunofity 
code?

Per tracepoint filtering is possible via the perf event patches Li Zefan 
has posted to lkml recently, under this subject:

   [PATCH 0/6] perf trace: Add filter support

They are still being worked on but it's very clear that flexible 
in-kernel filtering support will be a natural part of the perf event 
design in the very near future, so if that alone is your reason not to 
use it it would be better if you helped us complete/test the filter 
support and use that, instead of a parallel framework.

Or if that's not desirable or not possible, or if there's any other 
technical roadblock, i'd like to know the particulars of that.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify
  2009-09-30  9:44                             ` Ingo Molnar
@ 2009-09-30 16:02                                 ` Jason Gunthorpe
  -1 siblings, 0 replies; 82+ messages in thread
From: Jason Gunthorpe @ 2009-09-30 16:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Pavel Machek, Roland Dreier, Peter Zijlstra,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras,
	Anton Blanchard, general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Jeff Squyres

On Wed, Sep 30, 2009 at 11:44:56AM +0200, Ingo Molnar wrote:
> > > OK.  It would be nice to tie into something more general, but I 
> > > think I agree -- perf counters are missing the filtering and the "no 
> > > lost events" that ummunotify does have. [...]
> 
> Performance events filtering is being worked on and now with the proper 
> non-DoS limit you've added you can lose events too, dont you? So it's 
> all a question of how much buffering to add - and with perf events too 
> you can buffer arbitrary large amount of events.

No, the ummunotify does not loose events, that is the fundamental
difference between it and all tracing schemes.

Every call to ibv_reg_mr is paired with a call to ummunotify to create
a matching watcher. Both calls allocate some kernel memory, if one
fails the entire operation fails and userspace can do whatever it does
on memory allocation failure.

After that point the scheme is perfectly lossless.

Performance event filtering would use the same kind of kernel memory,
call ibv_reg_mr, then install a filter, both allocate kernel memory,
if one fails the op fails. But then when the ring buffer overflows
you've lost events.

All the tracing schemes are lossy - since they loose events when the
ring buffer fills up. So to do that we either need to make a recovery
scheme of some sort, or make trace points that are blocking..

So, here is a concrete proposal how ummunotify could be absorbed by
perf events tracing, with filters.
- The filter expression must be able to trigger on a MMU event,
  triggering on the intersection of the MMU event address range and
  filter expression address range.
- The traces must be choosen so that there is exactly one filter
  expression per ibv_reg_mr region
- Each filter has a clearable saturating counter that increments every
  time the filter matches an event
- Each filter has a 64 bit user space assigned tag.
- An API similar to ummunotify exists:
     struct perf_filter_tag foo[100]
     int rc = perf_filters_read_and_clear_non_zero_counters(foo,100);
- Optionally - the mmap ring would contain only 64 bit user space
  filter tags, not trace events.

This would then duplicate the functions of ummunotify, including the
lossless collection of events. The flow would more or less be the
same:

 struct my_data *ptr = calloc()
 ptr->reg_handle = ibv_reg_mr(base,len)
 ptr->filter_handle = perf_filter_register("string matching base->len",ptr)

 [..]
 // fast path
 if (atomically(perf_map->head) != last_perf_map_head) {
     struct perf_filter_tag foo[100]
     int rc = perf_filters_read_and_clear_non_zero_counters(foo,100);
     for (unsigned int i = 0; i != rc; i++)
         ((struct my_data *)foo[i])->invalid = 1;

     perf_empty_mmap_ring(perf_map);
 }

If 'optionally' is done then the app can trundle through the mmap
and only use the above syscall loop if the mmap overflows. That would
be quite ideal.

It also must be guarenteed that when a trace point is hit the mmap
atomics are updated and visible to another user space thread before
the trace point returns - otherwise it is not synchronous enough and
will be racey.

> A question: what is the typical size/scope of the rbtree of the watched 
> regions of memory in practical (test) deployments of the ummunofity 
> code?

Jeff can you comment?

IIRC it is many tens (hundreds?) of thousands of watches.

> Per tracepoint filtering is possible via the perf event patches Li Zefan 
> has posted to lkml recently, under this subject:

Performance of the filter add is probably a bit of a concern..

Regards,
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify
@ 2009-09-30 16:02                                 ` Jason Gunthorpe
  0 siblings, 0 replies; 82+ messages in thread
From: Jason Gunthorpe @ 2009-09-30 16:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Pavel Machek, Roland Dreier, Peter Zijlstra, linux-rdma,
	linux-kernel, Paul Mackerras, Anton Blanchard, general, akpm,
	torvalds, Jeff Squyres

On Wed, Sep 30, 2009 at 11:44:56AM +0200, Ingo Molnar wrote:
> > > OK.  It would be nice to tie into something more general, but I 
> > > think I agree -- perf counters are missing the filtering and the "no 
> > > lost events" that ummunotify does have. [...]
> 
> Performance events filtering is being worked on and now with the proper 
> non-DoS limit you've added you can lose events too, dont you? So it's 
> all a question of how much buffering to add - and with perf events too 
> you can buffer arbitrary large amount of events.

No, the ummunotify does not loose events, that is the fundamental
difference between it and all tracing schemes.

Every call to ibv_reg_mr is paired with a call to ummunotify to create
a matching watcher. Both calls allocate some kernel memory, if one
fails the entire operation fails and userspace can do whatever it does
on memory allocation failure.

After that point the scheme is perfectly lossless.

Performance event filtering would use the same kind of kernel memory,
call ibv_reg_mr, then install a filter, both allocate kernel memory,
if one fails the op fails. But then when the ring buffer overflows
you've lost events.

All the tracing schemes are lossy - since they loose events when the
ring buffer fills up. So to do that we either need to make a recovery
scheme of some sort, or make trace points that are blocking..

So, here is a concrete proposal how ummunotify could be absorbed by
perf events tracing, with filters.
- The filter expression must be able to trigger on a MMU event,
  triggering on the intersection of the MMU event address range and
  filter expression address range.
- The traces must be choosen so that there is exactly one filter
  expression per ibv_reg_mr region
- Each filter has a clearable saturating counter that increments every
  time the filter matches an event
- Each filter has a 64 bit user space assigned tag.
- An API similar to ummunotify exists:
     struct perf_filter_tag foo[100]
     int rc = perf_filters_read_and_clear_non_zero_counters(foo,100);
- Optionally - the mmap ring would contain only 64 bit user space
  filter tags, not trace events.

This would then duplicate the functions of ummunotify, including the
lossless collection of events. The flow would more or less be the
same:

 struct my_data *ptr = calloc()
 ptr->reg_handle = ibv_reg_mr(base,len)
 ptr->filter_handle = perf_filter_register("string matching base->len",ptr)

 [..]
 // fast path
 if (atomically(perf_map->head) != last_perf_map_head) {
     struct perf_filter_tag foo[100]
     int rc = perf_filters_read_and_clear_non_zero_counters(foo,100);
     for (unsigned int i = 0; i != rc; i++)
         ((struct my_data *)foo[i])->invalid = 1;

     perf_empty_mmap_ring(perf_map);
 }

If 'optionally' is done then the app can trundle through the mmap
and only use the above syscall loop if the mmap overflows. That would
be quite ideal.

It also must be guarenteed that when a trace point is hit the mmap
atomics are updated and visible to another user space thread before
the trace point returns - otherwise it is not synchronous enough and
will be racey.

> A question: what is the typical size/scope of the rbtree of the watched 
> regions of memory in practical (test) deployments of the ummunofity 
> code?

Jeff can you comment?

IIRC it is many tens (hundreds?) of thousands of watches.

> Per tracepoint filtering is possible via the perf event patches Li Zefan 
> has posted to lkml recently, under this subject:

Performance of the filter add is probably a bit of a concern..

Regards,
Jason

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify
  2009-09-30  9:44                             ` Ingo Molnar
@ 2009-09-30 17:06                                 ` Roland Dreier
  -1 siblings, 0 replies; 82+ messages in thread
From: Roland Dreier @ 2009-09-30 17:06 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Pavel Machek, Peter Zijlstra, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras,
	Anton Blanchard, general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b


 > Performance events filtering is being worked on and now with the proper 
 > non-DoS limit you've added you can lose events too, dont you? So it's 
 > all a question of how much buffering to add - and with perf events too 
 > you can buffer arbitrary large amount of events.

No, the idea for non-DoS for ummunotify is that we would limit the
number of regions the application can register; so an application might
hit the limit up front but no runtime loss of events once a region was
registered successfully.

 > I think this could be done in a simpler, less limited, more generic, 
 > more useful form by using some variation of perf events.
 > 
 > You should be able to get all that you want by adding two TRACE_EVENT() 
 > tracepoints and using the existing perf event syscall to get the events 
 > to user-space.

Yes, I would like to use perf events too.  Would it be plausible to
create a way for userspace to create a "counter" for each address range
being watched?  Then events would not be lost, because those counters
would become non-zero.

 > Meaning that this:
 >   9 files changed, 1060 insertions(+), 1 deletions(-)

Note that lots/ of the files touched here are in Documentation or are
one-line changes to Makefiles etc.

 - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify
@ 2009-09-30 17:06                                 ` Roland Dreier
  0 siblings, 0 replies; 82+ messages in thread
From: Roland Dreier @ 2009-09-30 17:06 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Pavel Machek, Peter Zijlstra, linux-rdma, linux-kernel,
	Paul Mackerras, Anton Blanchard, general, akpm, torvalds


 > Performance events filtering is being worked on and now with the proper 
 > non-DoS limit you've added you can lose events too, dont you? So it's 
 > all a question of how much buffering to add - and with perf events too 
 > you can buffer arbitrary large amount of events.

No, the idea for non-DoS for ummunotify is that we would limit the
number of regions the application can register; so an application might
hit the limit up front but no runtime loss of events once a region was
registered successfully.

 > I think this could be done in a simpler, less limited, more generic, 
 > more useful form by using some variation of perf events.
 > 
 > You should be able to get all that you want by adding two TRACE_EVENT() 
 > tracepoints and using the existing perf event syscall to get the events 
 > to user-space.

Yes, I would like to use perf events too.  Would it be plausible to
create a way for userspace to create a "counter" for each address range
being watched?  Then events would not be lost, because those counters
would become non-zero.

 > Meaning that this:
 >   9 files changed, 1060 insertions(+), 1 deletions(-)

Note that lots/ of the files touched here are in Documentation or are
one-line changes to Makefiles etc.

 - R.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify
  2009-09-30  9:44                             ` Ingo Molnar
@ 2009-10-02 16:32                                 ` Roland Dreier
  -1 siblings, 0 replies; 82+ messages in thread
From: Roland Dreier @ 2009-10-02 16:32 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Pavel Machek, Peter Zijlstra, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras,
	Anton Blanchard, general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b


 > Per tracepoint filtering is possible via the perf event patches Li Zefan 
 > has posted to lkml recently, under this subject:
 > 
 >    [PATCH 0/6] perf trace: Add filter support
 > 
 > They are still being worked on but it's very clear that flexible 
 > in-kernel filtering support will be a natural part of the perf event 
 > design in the very near future, so if that alone is your reason not to 
 > use it it would be better if you helped us complete/test the filter 
 > support and use that, instead of a parallel framework.
 > 
 > Or if that's not desirable or not possible, or if there's any other 
 > technical roadblock, i'd like to know the particulars of that.

So I looked a little deeper into this, and I don't think (even with the
filtering extensions) that perf events are directly applicable to this
problem.  The first issue is that, assuming I'm understanding the
comment in perf_event.c:

        /*
         * Raw tracepoint data is a severe data leak, only allow root to
         * have these.
         */

currently tracepoints can only be used by privileged processes.  A key
feature of ummunotify is that ordinary unprivileged processes can use it.

So would it be acceptable to add something like PERF_TYPE_MMU_NOTIFIER
as a way of letting unprivileged userspace get access to just MMU events
for their own process?  Clearly this touches core infrastructure and is
not as simple as just adding two tracepoints.

Then, assuming we have some way to create an "MMU notifier" perf event,
we need a way for userspace to specify which address ranges it would
like events for (I don't think the string filter expression used by
existing trace filtering works, because if userspace is looking at a few
hundred regions, then the size of the filtering expression explodes, and
adding or removing a single range becomes a pain).  So I guess a new
ioctl() to add/remove ranges for MMU_NOTIFIER perf events?

I think filtering is needed, because otherwise events for ranges that
are not of interest are just a waste of resources to generate and
process, and make losing good events because of overflow much more
likely.

We still have the problem of lost events if the mmap buffer overflows,
but userspace should be able to size the buffer so that such events are
rare I guess.

In the end this seems to just take the ummunotify code I have, and make
it be a new type of perf counter instead of a character special device.
I'd actually be OK with that, since having an oddball new char dev
interface is not particularly nice.  But on the other hand just
multiplexing a new type of thing under perf events is not all that much
better.  What do you think?

Thanks,
  Roland
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify
@ 2009-10-02 16:32                                 ` Roland Dreier
  0 siblings, 0 replies; 82+ messages in thread
From: Roland Dreier @ 2009-10-02 16:32 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Pavel Machek, Peter Zijlstra, linux-rdma, linux-kernel,
	Paul Mackerras, Anton Blanchard, general, akpm, torvalds


 > Per tracepoint filtering is possible via the perf event patches Li Zefan 
 > has posted to lkml recently, under this subject:
 > 
 >    [PATCH 0/6] perf trace: Add filter support
 > 
 > They are still being worked on but it's very clear that flexible 
 > in-kernel filtering support will be a natural part of the perf event 
 > design in the very near future, so if that alone is your reason not to 
 > use it it would be better if you helped us complete/test the filter 
 > support and use that, instead of a parallel framework.
 > 
 > Or if that's not desirable or not possible, or if there's any other 
 > technical roadblock, i'd like to know the particulars of that.

So I looked a little deeper into this, and I don't think (even with the
filtering extensions) that perf events are directly applicable to this
problem.  The first issue is that, assuming I'm understanding the
comment in perf_event.c:

        /*
         * Raw tracepoint data is a severe data leak, only allow root to
         * have these.
         */

currently tracepoints can only be used by privileged processes.  A key
feature of ummunotify is that ordinary unprivileged processes can use it.

So would it be acceptable to add something like PERF_TYPE_MMU_NOTIFIER
as a way of letting unprivileged userspace get access to just MMU events
for their own process?  Clearly this touches core infrastructure and is
not as simple as just adding two tracepoints.

Then, assuming we have some way to create an "MMU notifier" perf event,
we need a way for userspace to specify which address ranges it would
like events for (I don't think the string filter expression used by
existing trace filtering works, because if userspace is looking at a few
hundred regions, then the size of the filtering expression explodes, and
adding or removing a single range becomes a pain).  So I guess a new
ioctl() to add/remove ranges for MMU_NOTIFIER perf events?

I think filtering is needed, because otherwise events for ranges that
are not of interest are just a waste of resources to generate and
process, and make losing good events because of overflow much more
likely.

We still have the problem of lost events if the mmap buffer overflows,
but userspace should be able to size the buffer so that such events are
rare I guess.

In the end this seems to just take the ummunotify code I have, and make
it be a new type of perf counter instead of a character special device.
I'd actually be OK with that, since having an oddball new char dev
interface is not particularly nice.  But on the other hand just
multiplexing a new type of thing under perf events is not all that much
better.  What do you think?

Thanks,
  Roland

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify
  2009-10-02 16:32                                 ` Roland Dreier
@ 2009-10-02 20:45                                     ` Pavel Machek
  -1 siblings, 0 replies; 82+ messages in thread
From: Pavel Machek @ 2009-10-02 20:45 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Ingo Molnar, Peter Zijlstra, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras,
	Anton Blanchard, general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Hi!

> In the end this seems to just take the ummunotify code I have, and make
> it be a new type of perf counter instead of a character special device.
> I'd actually be OK with that, since having an oddball new char dev
> interface is not particularly nice.  But on the other hand just
> multiplexing a new type of thing under perf events is not all that much
> better.  What do you think?

I really hate the strange character device. So if you can hide it in
tracing infrastructure, I'd certainly like that.
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify
@ 2009-10-02 20:45                                     ` Pavel Machek
  0 siblings, 0 replies; 82+ messages in thread
From: Pavel Machek @ 2009-10-02 20:45 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Ingo Molnar, Peter Zijlstra, linux-rdma, linux-kernel,
	Paul Mackerras, Anton Blanchard, general, akpm, torvalds

Hi!

> In the end this seems to just take the ummunotify code I have, and make
> it be a new type of perf counter instead of a character special device.
> I'd actually be OK with that, since having an oddball new char dev
> interface is not particularly nice.  But on the other hand just
> multiplexing a new type of thing under perf events is not all that much
> better.  What do you think?

I really hate the strange character device. So if you can hide it in
tracing infrastructure, I'd certainly like that.
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify
  2009-10-02 16:32                                 ` Roland Dreier
@ 2009-10-07 22:34                                     ` Roland Dreier
  -1 siblings, 0 replies; 82+ messages in thread
From: Roland Dreier @ 2009-10-07 22:34 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Pavel Machek, Peter Zijlstra, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras,
	Anton Blanchard, general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b


 > So I looked a little deeper into this, and I don't think (even with the
 > filtering extensions) that perf events are directly applicable to this
 > problem.  The first issue is that, assuming I'm understanding the
 > comment in perf_event.c:
 > 
 >         /*
 >          * Raw tracepoint data is a severe data leak, only allow root to
 >          * have these.
 >          */
 > 
 > currently tracepoints can only be used by privileged processes.  A key
 > feature of ummunotify is that ordinary unprivileged processes can use it.
 > 
 > So would it be acceptable to add something like PERF_TYPE_MMU_NOTIFIER
 > as a way of letting unprivileged userspace get access to just MMU events
 > for their own process?  Clearly this touches core infrastructure and is
 > not as simple as just adding two tracepoints.
 > 
 > Then, assuming we have some way to create an "MMU notifier" perf event,
 > we need a way for userspace to specify which address ranges it would
 > like events for (I don't think the string filter expression used by
 > existing trace filtering works, because if userspace is looking at a few
 > hundred regions, then the size of the filtering expression explodes, and
 > adding or removing a single range becomes a pain).  So I guess a new
 > ioctl() to add/remove ranges for MMU_NOTIFIER perf events?
 > 
 > I think filtering is needed, because otherwise events for ranges that
 > are not of interest are just a waste of resources to generate and
 > process, and make losing good events because of overflow much more
 > likely.
 > 
 > We still have the problem of lost events if the mmap buffer overflows,
 > but userspace should be able to size the buffer so that such events are
 > rare I guess.
 > 
 > In the end this seems to just take the ummunotify code I have, and make
 > it be a new type of perf counter instead of a character special device.
 > I'd actually be OK with that, since having an oddball new char dev
 > interface is not particularly nice.  But on the other hand just
 > multiplexing a new type of thing under perf events is not all that much
 > better.  What do you think?

Ingo/Peter/<anyone suggesting perf events> -- can you comment on this
plan of creating PERF_TYPE_MMU_NOTIFIER for perf events to implement
ummunotify?  To me it looks like a wash -- the main difference is how
userspace gets the magic ummunotify file descriptor, either by
open("/dev/ummunotify") or by perf_event_open(...PERF_TYPE_MMU_NOTIFIER...),
but pretty much everything else stays pretty much the same in terms of
how much kernel code is involved.  We do reuse the perf events mmap
buffer code but I think that ends up being more complicated than
returning events via read().

Anyway, before I spend the time converting over to the new
infrastructure and causing the MPI guys to churn their code, I'd like to
make sure that this is what you guys have in mind.

(By the way, after thinking about this more, I really do think that
filtering events by address range is a must-have -- with filtering,
userspace can map sufficient buffer space to avoid losing events for a
given number of regions; without filtering, events might get lost just
because of invalidate events for ranges userspace didn't even care about)

Thanks,
  Roland
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify
@ 2009-10-07 22:34                                     ` Roland Dreier
  0 siblings, 0 replies; 82+ messages in thread
From: Roland Dreier @ 2009-10-07 22:34 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Pavel Machek, Peter Zijlstra, linux-rdma, linux-kernel,
	Paul Mackerras, Anton Blanchard, general, akpm, torvalds


 > So I looked a little deeper into this, and I don't think (even with the
 > filtering extensions) that perf events are directly applicable to this
 > problem.  The first issue is that, assuming I'm understanding the
 > comment in perf_event.c:
 > 
 >         /*
 >          * Raw tracepoint data is a severe data leak, only allow root to
 >          * have these.
 >          */
 > 
 > currently tracepoints can only be used by privileged processes.  A key
 > feature of ummunotify is that ordinary unprivileged processes can use it.
 > 
 > So would it be acceptable to add something like PERF_TYPE_MMU_NOTIFIER
 > as a way of letting unprivileged userspace get access to just MMU events
 > for their own process?  Clearly this touches core infrastructure and is
 > not as simple as just adding two tracepoints.
 > 
 > Then, assuming we have some way to create an "MMU notifier" perf event,
 > we need a way for userspace to specify which address ranges it would
 > like events for (I don't think the string filter expression used by
 > existing trace filtering works, because if userspace is looking at a few
 > hundred regions, then the size of the filtering expression explodes, and
 > adding or removing a single range becomes a pain).  So I guess a new
 > ioctl() to add/remove ranges for MMU_NOTIFIER perf events?
 > 
 > I think filtering is needed, because otherwise events for ranges that
 > are not of interest are just a waste of resources to generate and
 > process, and make losing good events because of overflow much more
 > likely.
 > 
 > We still have the problem of lost events if the mmap buffer overflows,
 > but userspace should be able to size the buffer so that such events are
 > rare I guess.
 > 
 > In the end this seems to just take the ummunotify code I have, and make
 > it be a new type of perf counter instead of a character special device.
 > I'd actually be OK with that, since having an oddball new char dev
 > interface is not particularly nice.  But on the other hand just
 > multiplexing a new type of thing under perf events is not all that much
 > better.  What do you think?

Ingo/Peter/<anyone suggesting perf events> -- can you comment on this
plan of creating PERF_TYPE_MMU_NOTIFIER for perf events to implement
ummunotify?  To me it looks like a wash -- the main difference is how
userspace gets the magic ummunotify file descriptor, either by
open("/dev/ummunotify") or by perf_event_open(...PERF_TYPE_MMU_NOTIFIER...),
but pretty much everything else stays pretty much the same in terms of
how much kernel code is involved.  We do reuse the perf events mmap
buffer code but I think that ends up being more complicated than
returning events via read().

Anyway, before I spend the time converting over to the new
infrastructure and causing the MPI guys to churn their code, I'd like to
make sure that this is what you guys have in mind.

(By the way, after thinking about this more, I really do think that
filtering events by address range is a must-have -- with filtering,
userspace can map sufficient buffer space to avoid losing events for a
given number of regions; without filtering, events might get lost just
because of invalidate events for ranges userspace didn't even care about)

Thanks,
  Roland

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify
  2009-10-07 22:34                                     ` Roland Dreier
@ 2009-10-12 17:33                                         ` Peter Zijlstra
  -1 siblings, 0 replies; 82+ messages in thread
From: Peter Zijlstra @ 2009-10-12 17:33 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Ingo Molnar, Pavel Machek, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras,
	Anton Blanchard, general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Wed, 2009-10-07 at 15:34 -0700, Roland Dreier wrote:
> > So I looked a little deeper into this, and I don't think (even with the
>  > filtering extensions) that perf events are directly applicable to this
>  > problem.  The first issue is that, assuming I'm understanding the
>  > comment in perf_event.c:
>  > 
>  >         /*
>  >          * Raw tracepoint data is a severe data leak, only allow root to
>  >          * have these.
>  >          */
>  > 
>  > currently tracepoints can only be used by privileged processes.  A key
>  > feature of ummunotify is that ordinary unprivileged processes can use it.
>  > 
>  > So would it be acceptable to add something like PERF_TYPE_MMU_NOTIFIER
>  > as a way of letting unprivileged userspace get access to just MMU events
>  > for their own process?  Clearly this touches core infrastructure and is
>  > not as simple as just adding two tracepoints.
>  > 
>  > Then, assuming we have some way to create an "MMU notifier" perf event,
>  > we need a way for userspace to specify which address ranges it would
>  > like events for (I don't think the string filter expression used by
>  > existing trace filtering works, because if userspace is looking at a few
>  > hundred regions, then the size of the filtering expression explodes, and
>  > adding or removing a single range becomes a pain).  So I guess a new
>  > ioctl() to add/remove ranges for MMU_NOTIFIER perf events?
>  > 
>  > I think filtering is needed, because otherwise events for ranges that
>  > are not of interest are just a waste of resources to generate and
>  > process, and make losing good events because of overflow much more
>  > likely.
>  > 
>  > We still have the problem of lost events if the mmap buffer overflows,
>  > but userspace should be able to size the buffer so that such events are
>  > rare I guess.
>  > 
>  > In the end this seems to just take the ummunotify code I have, and make
>  > it be a new type of perf counter instead of a character special device.
>  > I'd actually be OK with that, since having an oddball new char dev
>  > interface is not particularly nice.  But on the other hand just
>  > multiplexing a new type of thing under perf events is not all that much
>  > better.  What do you think?
> 
> Ingo/Peter/<anyone suggesting perf events> -- can you comment on this
> plan of creating PERF_TYPE_MMU_NOTIFIER for perf events to implement
> ummunotify?  To me it looks like a wash -- the main difference is how
> userspace gets the magic ummunotify file descriptor, either by
> open("/dev/ummunotify") or by perf_event_open(...PERF_TYPE_MMU_NOTIFIER...),
> but pretty much everything else stays pretty much the same in terms of
> how much kernel code is involved.  We do reuse the perf events mmap
> buffer code but I think that ends up being more complicated than
> returning events via read().
> 
> Anyway, before I spend the time converting over to the new
> infrastructure and causing the MPI guys to churn their code, I'd like to
> make sure that this is what you guys have in mind.
> 
> (By the way, after thinking about this more, I really do think that
> filtering events by address range is a must-have -- with filtering,
> userspace can map sufficient buffer space to avoid losing events for a
> given number of regions; without filtering, events might get lost just
> because of invalidate events for ranges userspace didn't even care about)

I think something like

PERF_TYPE_SOFTWARE, PERF_COUNT_SW_MUNMAP + $filter

or

PERF_TYPE_TRACEPOINT, //events/vm/munmap/id + $filter

As for the read/poll issue, I think we can do something like
PERF_FORMAT_BLOCK which would make read() block when ->count hasn't
changed, and make poll() work without requiring a mmap().

As to filter, we can do two things, add a simple single range filter to
perf_event_attr, which is something ia64 has hardware support for IIRC,
or we can possibly use this trace filter muck.

Would something like that be sufficient? With such events only
generating a wakeup (poll) when the unmap actually happens, you'd not
even need an mmap() buffer to keep up with that.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify
@ 2009-10-12 17:33                                         ` Peter Zijlstra
  0 siblings, 0 replies; 82+ messages in thread
From: Peter Zijlstra @ 2009-10-12 17:33 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Ingo Molnar, Pavel Machek, linux-rdma, linux-kernel,
	Paul Mackerras, Anton Blanchard, general, akpm, torvalds

On Wed, 2009-10-07 at 15:34 -0700, Roland Dreier wrote:
> > So I looked a little deeper into this, and I don't think (even with the
>  > filtering extensions) that perf events are directly applicable to this
>  > problem.  The first issue is that, assuming I'm understanding the
>  > comment in perf_event.c:
>  > 
>  >         /*
>  >          * Raw tracepoint data is a severe data leak, only allow root to
>  >          * have these.
>  >          */
>  > 
>  > currently tracepoints can only be used by privileged processes.  A key
>  > feature of ummunotify is that ordinary unprivileged processes can use it.
>  > 
>  > So would it be acceptable to add something like PERF_TYPE_MMU_NOTIFIER
>  > as a way of letting unprivileged userspace get access to just MMU events
>  > for their own process?  Clearly this touches core infrastructure and is
>  > not as simple as just adding two tracepoints.
>  > 
>  > Then, assuming we have some way to create an "MMU notifier" perf event,
>  > we need a way for userspace to specify which address ranges it would
>  > like events for (I don't think the string filter expression used by
>  > existing trace filtering works, because if userspace is looking at a few
>  > hundred regions, then the size of the filtering expression explodes, and
>  > adding or removing a single range becomes a pain).  So I guess a new
>  > ioctl() to add/remove ranges for MMU_NOTIFIER perf events?
>  > 
>  > I think filtering is needed, because otherwise events for ranges that
>  > are not of interest are just a waste of resources to generate and
>  > process, and make losing good events because of overflow much more
>  > likely.
>  > 
>  > We still have the problem of lost events if the mmap buffer overflows,
>  > but userspace should be able to size the buffer so that such events are
>  > rare I guess.
>  > 
>  > In the end this seems to just take the ummunotify code I have, and make
>  > it be a new type of perf counter instead of a character special device.
>  > I'd actually be OK with that, since having an oddball new char dev
>  > interface is not particularly nice.  But on the other hand just
>  > multiplexing a new type of thing under perf events is not all that much
>  > better.  What do you think?
> 
> Ingo/Peter/<anyone suggesting perf events> -- can you comment on this
> plan of creating PERF_TYPE_MMU_NOTIFIER for perf events to implement
> ummunotify?  To me it looks like a wash -- the main difference is how
> userspace gets the magic ummunotify file descriptor, either by
> open("/dev/ummunotify") or by perf_event_open(...PERF_TYPE_MMU_NOTIFIER...),
> but pretty much everything else stays pretty much the same in terms of
> how much kernel code is involved.  We do reuse the perf events mmap
> buffer code but I think that ends up being more complicated than
> returning events via read().
> 
> Anyway, before I spend the time converting over to the new
> infrastructure and causing the MPI guys to churn their code, I'd like to
> make sure that this is what you guys have in mind.
> 
> (By the way, after thinking about this more, I really do think that
> filtering events by address range is a must-have -- with filtering,
> userspace can map sufficient buffer space to avoid losing events for a
> given number of regions; without filtering, events might get lost just
> because of invalidate events for ranges userspace didn't even care about)

I think something like

PERF_TYPE_SOFTWARE, PERF_COUNT_SW_MUNMAP + $filter

or

PERF_TYPE_TRACEPOINT, //events/vm/munmap/id + $filter

As for the read/poll issue, I think we can do something like
PERF_FORMAT_BLOCK which would make read() block when ->count hasn't
changed, and make poll() work without requiring a mmap().

As to filter, we can do two things, add a simple single range filter to
perf_event_attr, which is something ia64 has hardware support for IIRC,
or we can possibly use this trace filter muck.

Would something like that be sufficient? With such events only
generating a wakeup (poll) when the unmap actually happens, you'd not
even need an mmap() buffer to keep up with that.



^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify
  2009-09-30 16:02                                 ` Jason Gunthorpe
@ 2009-10-12 18:19                                     ` Ingo Molnar
  -1 siblings, 0 replies; 82+ messages in thread
From: Ingo Molnar @ 2009-10-12 18:19 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Pavel Machek, Roland Dreier, Peter Zijlstra,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras,
	Anton Blanchard, general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Jeff Squyres


* Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:

> On Wed, Sep 30, 2009 at 11:44:56AM +0200, Ingo Molnar wrote:
> > > > OK.  It would be nice to tie into something more general, but I 
> > > > think I agree -- perf counters are missing the filtering and the "no 
> > > > lost events" that ummunotify does have. [...]
> > 
> > Performance events filtering is being worked on and now with the 
> > proper non-DoS limit you've added you can lose events too, dont you? 
> > So it's all a question of how much buffering to add - and with perf 
> > events too you can buffer arbitrary large amount of events.
> 
> No, the ummunotify does not loose events, that is the fundamental 
> difference between it and all tracing schemes.
> 
> Every call to ibv_reg_mr is paired with a call to ummunotify to create 
> a matching watcher. Both calls allocate some kernel memory, if one 
> fails the entire operation fails and userspace can do whatever it does 
> on memory allocation failure.

We already support signal notification for perf events, and we also 
support two modi of perf ring-buffer overflow notification. Adding a 
third one that sends a signal when events are lost would be in line with 
that.

This would allow you to have the OOM semantics of requesting a SIGBUS - 
or user-space could do other things: like print a warning in the app or 
ignore the event overflow.

Which are all interesting things to do. (If you do that you might want 
to add that to 'perf top' or 'perf record' as well.)

> After that point the scheme is perfectly lossless.

Well if it can OOM it's not lossless, obviously. You just define "event 
loss" to be equivalent to "Destruction of the universe." ;-)

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify
@ 2009-10-12 18:19                                     ` Ingo Molnar
  0 siblings, 0 replies; 82+ messages in thread
From: Ingo Molnar @ 2009-10-12 18:19 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Pavel Machek, Roland Dreier, Peter Zijlstra, linux-rdma,
	linux-kernel, Paul Mackerras, Anton Blanchard, general, akpm,
	torvalds, Jeff Squyres


* Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote:

> On Wed, Sep 30, 2009 at 11:44:56AM +0200, Ingo Molnar wrote:
> > > > OK.  It would be nice to tie into something more general, but I 
> > > > think I agree -- perf counters are missing the filtering and the "no 
> > > > lost events" that ummunotify does have. [...]
> > 
> > Performance events filtering is being worked on and now with the 
> > proper non-DoS limit you've added you can lose events too, dont you? 
> > So it's all a question of how much buffering to add - and with perf 
> > events too you can buffer arbitrary large amount of events.
> 
> No, the ummunotify does not loose events, that is the fundamental 
> difference between it and all tracing schemes.
> 
> Every call to ibv_reg_mr is paired with a call to ummunotify to create 
> a matching watcher. Both calls allocate some kernel memory, if one 
> fails the entire operation fails and userspace can do whatever it does 
> on memory allocation failure.

We already support signal notification for perf events, and we also 
support two modi of perf ring-buffer overflow notification. Adding a 
third one that sends a signal when events are lost would be in line with 
that.

This would allow you to have the OOM semantics of requesting a SIGBUS - 
or user-space could do other things: like print a warning in the app or 
ignore the event overflow.

Which are all interesting things to do. (If you do that you might want 
to add that to 'perf top' or 'perf record' as well.)

> After that point the scheme is perfectly lossless.

Well if it can OOM it's not lossless, obviously. You just define "event 
loss" to be equivalent to "Destruction of the universe." ;-)

	Ingo

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify
  2009-10-12 18:19                                     ` Ingo Molnar
@ 2009-10-12 19:30                                         ` Jason Gunthorpe
  -1 siblings, 0 replies; 82+ messages in thread
From: Jason Gunthorpe @ 2009-10-12 19:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Pavel Machek, Roland Dreier, Peter Zijlstra,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras,
	Anton Blanchard, general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Jeff Squyres

On Mon, Oct 12, 2009 at 08:19:44PM +0200, Ingo Molnar wrote:

> > After that point the scheme is perfectly lossless.
> 
> Well if it can OOM it's not lossless, obviously. You just define "event 
> loss" to be equivalent to "Destruction of the universe." ;-)

It can't OOM once the ummunotify registration is done - when an event
occurs it doesn't allocate any memory and it doesn't loose events.

It has the same problem as perf - you either bound the number/size of
filters, or let user space allocate filters until the box OOMs. perf
has the additonal problem that even with filters you can still loose
events if the event ring overflows.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify
@ 2009-10-12 19:30                                         ` Jason Gunthorpe
  0 siblings, 0 replies; 82+ messages in thread
From: Jason Gunthorpe @ 2009-10-12 19:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Pavel Machek, Roland Dreier, Peter Zijlstra, linux-rdma,
	linux-kernel, Paul Mackerras, Anton Blanchard, general, akpm,
	torvalds, Jeff Squyres

On Mon, Oct 12, 2009 at 08:19:44PM +0200, Ingo Molnar wrote:

> > After that point the scheme is perfectly lossless.
> 
> Well if it can OOM it's not lossless, obviously. You just define "event 
> loss" to be equivalent to "Destruction of the universe." ;-)

It can't OOM once the ummunotify registration is done - when an event
occurs it doesn't allocate any memory and it doesn't loose events.

It has the same problem as perf - you either bound the number/size of
filters, or let user space allocate filters until the box OOMs. perf
has the additonal problem that even with filters you can still loose
events if the event ring overflows.

Jason

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify
  2009-10-12 19:30                                         ` Jason Gunthorpe
  (?)
@ 2009-10-12 20:20                                         ` Ingo Molnar
       [not found]                                           ` <20091012202046.GA7648-X9Un+BFzKDI@public.gmane.org>
  2009-10-13  5:43                                           ` Brice Goglin
  -1 siblings, 2 replies; 82+ messages in thread
From: Ingo Molnar @ 2009-10-12 20:20 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Pavel Machek, Roland Dreier, Peter Zijlstra, linux-rdma,
	linux-kernel, Paul Mackerras, Anton Blanchard, general, akpm,
	torvalds, Jeff Squyres


* Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote:

> On Mon, Oct 12, 2009 at 08:19:44PM +0200, Ingo Molnar wrote:
> 
> > > After that point the scheme is perfectly lossless.
> > 
> > Well if it can OOM it's not lossless, obviously. You just define 
> > "event loss" to be equivalent to "Destruction of the universe." ;-)
> 
> It can't OOM once the ummunotify registration is done - when an event 
> occurs it doesn't allocate any memory and it doesn't loose events.

Well, it has built-in event loss via the UMMUNOTIFY_FLAG_HINT mechanism: 
any double events on the same range will cause an imprecise event to be 
recorded and cause the loss of information.

Is that loss of information more acceptable than the loss of information 
via the loss of events?

It might be more acceptable because the flag-hint mechanism can at most 
cause over-flushing - while with perf events we might miss to invalidate 
a range altogether.

	Ingo

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify
  2009-10-12 20:20                                         ` Ingo Molnar
@ 2009-10-13  4:05                                               ` Jason Gunthorpe
  2009-10-13  5:43                                           ` Brice Goglin
  1 sibling, 0 replies; 82+ messages in thread
From: Jason Gunthorpe @ 2009-10-13  4:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Pavel Machek, Roland Dreier, Peter Zijlstra,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras,
	Anton Blanchard, general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Jeff Squyres

On Mon, Oct 12, 2009 at 10:20:46PM +0200, Ingo Molnar wrote:
> It might be more acceptable because the flag-hint mechanism can at most 
> cause over-flushing - while with perf events we might miss to invalidate 
> a range altogether.

Right. Overflushing is not important, but missing an event entirely is not
recoverable (at least within the current kernel APIs).

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify
@ 2009-10-13  4:05                                               ` Jason Gunthorpe
  0 siblings, 0 replies; 82+ messages in thread
From: Jason Gunthorpe @ 2009-10-13  4:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Pavel Machek, Roland Dreier, Peter Zijlstra, linux-rdma,
	linux-kernel, Paul Mackerras, Anton Blanchard, general, akpm,
	torvalds, Jeff Squyres

On Mon, Oct 12, 2009 at 10:20:46PM +0200, Ingo Molnar wrote:
> It might be more acceptable because the flag-hint mechanism can at most 
> cause over-flushing - while with perf events we might miss to invalidate 
> a range altogether.

Right. Overflushing is not important, but missing an event entirely is not
recoverable (at least within the current kernel APIs).

Jason

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify
  2009-10-12 20:20                                         ` Ingo Molnar
       [not found]                                           ` <20091012202046.GA7648-X9Un+BFzKDI@public.gmane.org>
@ 2009-10-13  5:43                                           ` Brice Goglin
       [not found]                                             ` <4AD41373.8010108-MZpvjPyXg2s@public.gmane.org>
  1 sibling, 1 reply; 82+ messages in thread
From: Brice Goglin @ 2009-10-13  5:43 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jason Gunthorpe, Pavel Machek, Roland Dreier, Peter Zijlstra,
	linux-rdma, linux-kernel, Paul Mackerras, Anton Blanchard,
	general, akpm, torvalds, Jeff Squyres

Ingo Molnar wrote:
> * Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote:
>
>   
>> On Mon, Oct 12, 2009 at 08:19:44PM +0200, Ingo Molnar wrote:
>>
>>     
>>>> After that point the scheme is perfectly lossless.
>>>>         
>>> Well if it can OOM it's not lossless, obviously. You just define 
>>> "event loss" to be equivalent to "Destruction of the universe." ;-)
>>>       
>> It can't OOM once the ummunotify registration is done - when an event 
>> occurs it doesn't allocate any memory and it doesn't loose events.
>>     
>
> Well, it has built-in event loss via the UMMUNOTIFY_FLAG_HINT mechanism: 
> any double events on the same range will cause an imprecise event to be 
> recorded and cause the loss of information.
>   

The target (MPI) application doesn't care about how many events are
coming here. It just needs to know whether something has been
invalidated in the range. If so, it destroy the whole RDMA window
anyway. So it's actually _good_ that multiple events are merged into a
single one: the application only has to process a single event per
partially-invalidated range.

Brice

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify
  2009-10-13  5:43                                           ` Brice Goglin
@ 2009-10-13  6:38                                                 ` Ingo Molnar
  0 siblings, 0 replies; 82+ messages in thread
From: Ingo Molnar @ 2009-10-13  6:38 UTC (permalink / raw)
  To: Brice Goglin
  Cc: Jason Gunthorpe, Pavel Machek, Roland Dreier, Peter Zijlstra,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras,
	Anton Blanchard, general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Jeff Squyres


* Brice Goglin <Brice.Goglin-MZpvjPyXg2s@public.gmane.org> wrote:

> Ingo Molnar wrote:
> > * Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> >
> >   
> >> On Mon, Oct 12, 2009 at 08:19:44PM +0200, Ingo Molnar wrote:
> >>
> >>     
> >>>> After that point the scheme is perfectly lossless.
> >>>>         
> >>> Well if it can OOM it's not lossless, obviously. You just define 
> >>> "event loss" to be equivalent to "Destruction of the universe." ;-)
> >>>       
> >> It can't OOM once the ummunotify registration is done - when an event 
> >> occurs it doesn't allocate any memory and it doesn't loose events.
> >>     
> >
> > Well, it has built-in event loss via the UMMUNOTIFY_FLAG_HINT mechanism: 
> > any double events on the same range will cause an imprecise event to be 
> > recorded and cause the loss of information.
> >   
> 
> The target (MPI) application doesn't care about how many events are 
> coming here. It just needs to know whether something has been 
> invalidated in the range. If so, it destroy the whole RDMA window 
> anyway. So it's actually _good_ that multiple events are merged into a 
> single one: the application only has to process a single event per 
> partially-invalidated range.

it's not unconditionally good as the fuzzy-merge-events rule:

                        events[n].flags      = UMMUNOTIFY_EVENT_FLAG_HINT;
                        events[n].hint_start = max(reg->start, reg->hint_start);
                        events[n].hint_end   = min(reg->end, reg->hint_end);

in essence merges flushes into a single interval - which inevitably 
might include areas of memory that were not flushed at all.

For example these two flushes:

  [...]          [...]

Would be merged into:

  [..................]

Btw., isnt the above max/min logic buggy, causing lost events? Shouldnt 
it be:

                        events[n].hint_start = min(reg->start, reg->hint_start);
                        events[n].hint_end   = max(reg->end, reg->hint_end);

?

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify
@ 2009-10-13  6:38                                                 ` Ingo Molnar
  0 siblings, 0 replies; 82+ messages in thread
From: Ingo Molnar @ 2009-10-13  6:38 UTC (permalink / raw)
  To: Brice Goglin
  Cc: Jason Gunthorpe, Pavel Machek, Roland Dreier, Peter Zijlstra,
	linux-rdma, linux-kernel, Paul Mackerras, Anton Blanchard,
	general, akpm, torvalds, Jeff Squyres


* Brice Goglin <Brice.Goglin@inria.fr> wrote:

> Ingo Molnar wrote:
> > * Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote:
> >
> >   
> >> On Mon, Oct 12, 2009 at 08:19:44PM +0200, Ingo Molnar wrote:
> >>
> >>     
> >>>> After that point the scheme is perfectly lossless.
> >>>>         
> >>> Well if it can OOM it's not lossless, obviously. You just define 
> >>> "event loss" to be equivalent to "Destruction of the universe." ;-)
> >>>       
> >> It can't OOM once the ummunotify registration is done - when an event 
> >> occurs it doesn't allocate any memory and it doesn't loose events.
> >>     
> >
> > Well, it has built-in event loss via the UMMUNOTIFY_FLAG_HINT mechanism: 
> > any double events on the same range will cause an imprecise event to be 
> > recorded and cause the loss of information.
> >   
> 
> The target (MPI) application doesn't care about how many events are 
> coming here. It just needs to know whether something has been 
> invalidated in the range. If so, it destroy the whole RDMA window 
> anyway. So it's actually _good_ that multiple events are merged into a 
> single one: the application only has to process a single event per 
> partially-invalidated range.

it's not unconditionally good as the fuzzy-merge-events rule:

                        events[n].flags      = UMMUNOTIFY_EVENT_FLAG_HINT;
                        events[n].hint_start = max(reg->start, reg->hint_start);
                        events[n].hint_end   = min(reg->end, reg->hint_end);

in essence merges flushes into a single interval - which inevitably 
might include areas of memory that were not flushed at all.

For example these two flushes:

  [...]          [...]

Would be merged into:

  [..................]

Btw., isnt the above max/min logic buggy, causing lost events? Shouldnt 
it be:

                        events[n].hint_start = min(reg->start, reg->hint_start);
                        events[n].hint_end   = max(reg->end, reg->hint_end);

?

	Ingo

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify
  2009-10-13  4:05                                               ` Jason Gunthorpe
  (?)
@ 2009-10-13  6:40                                               ` Ingo Molnar
       [not found]                                                 ` <20091013064006.GC9470-X9Un+BFzKDI@public.gmane.org>
  -1 siblings, 1 reply; 82+ messages in thread
From: Ingo Molnar @ 2009-10-13  6:40 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Pavel Machek, Roland Dreier, Peter Zijlstra, linux-rdma,
	linux-kernel, Paul Mackerras, Anton Blanchard, general, akpm,
	torvalds, Jeff Squyres


* Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote:

> On Mon, Oct 12, 2009 at 10:20:46PM +0200, Ingo Molnar wrote:
> > It might be more acceptable because the flag-hint mechanism can at most 
> > cause over-flushing - while with perf events we might miss to invalidate 
> > a range altogether.
> 
> Right. Overflushing is not important, but missing an event entirely is 
> not recoverable (at least within the current kernel APIs).

So if we detect event loss in the perf event case (should not happen 
with sufficient buffering but it is a possibility the code should be 
prepared for) then we can just flush the [0,-1ULL] range, right?

	Ingo

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify
  2009-10-13  6:40                                               ` Ingo Molnar
@ 2009-10-13 16:27                                                     ` Jason Gunthorpe
  0 siblings, 0 replies; 82+ messages in thread
From: Jason Gunthorpe @ 2009-10-13 16:27 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Pavel Machek, Roland Dreier, Peter Zijlstra,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras,
	Anton Blanchard, general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Jeff Squyres

On Tue, Oct 13, 2009 at 08:40:06AM +0200, Ingo Molnar wrote:
> 
> * Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> 
> > On Mon, Oct 12, 2009 at 10:20:46PM +0200, Ingo Molnar wrote:
> > > It might be more acceptable because the flag-hint mechanism can at most 
> > > cause over-flushing - while with perf events we might miss to invalidate 
> > > a range altogether.
> > 
> > Right. Overflushing is not important, but missing an event entirely is 
> > not recoverable (at least within the current kernel APIs).
> 
> So if we detect event loss in the perf event case (should not happen 
> with sufficient buffering but it is a possibility the code should be 
> prepared for) then we can just flush the [0,-1ULL] range, right?

No, the reason overflushing within a registration is OK is because of
how the MPI APIs are defined and typically used. The map and
registration window will typically be 1:1 ie you malloc something and
then register it. It is an error to register beyond your malloced
space. So, in truth, the hint stuff isn't really essential for MPI.

flushing all ranges would result in data loss since ranges may be in
use at the time, and 'flush' is actually unregister/reregister - the
hardware cannot do in place atomic modify.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [ofa-general] Re: [GIT PULL] please pull ummunotify
@ 2009-10-13 16:27                                                     ` Jason Gunthorpe
  0 siblings, 0 replies; 82+ messages in thread
From: Jason Gunthorpe @ 2009-10-13 16:27 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Pavel Machek, Roland Dreier, Peter Zijlstra, linux-rdma,
	linux-kernel, Paul Mackerras, Anton Blanchard, general, akpm,
	torvalds, Jeff Squyres

On Tue, Oct 13, 2009 at 08:40:06AM +0200, Ingo Molnar wrote:
> 
> * Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote:
> 
> > On Mon, Oct 12, 2009 at 10:20:46PM +0200, Ingo Molnar wrote:
> > > It might be more acceptable because the flag-hint mechanism can at most 
> > > cause over-flushing - while with perf events we might miss to invalidate 
> > > a range altogether.
> > 
> > Right. Overflushing is not important, but missing an event entirely is 
> > not recoverable (at least within the current kernel APIs).
> 
> So if we detect event loss in the perf event case (should not happen 
> with sufficient buffering but it is a possibility the code should be 
> prepared for) then we can just flush the [0,-1ULL] range, right?

No, the reason overflushing within a registration is OK is because of
how the MPI APIs are defined and typically used. The map and
registration window will typically be 1:1 ie you malloc something and
then register it. It is an error to register beyond your malloced
space. So, in truth, the hint stuff isn't really essential for MPI.

flushing all ranges would result in data loss since ranges may be in
use at the time, and 'flush' is actually unregister/reregister - the
hardware cannot do in place atomic modify.

Jason

^ permalink raw reply	[flat|nested] 82+ messages in thread

end of thread, other threads:[~2009-10-13 16:28 UTC | newest]

Thread overview: 82+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-09-11  4:38 [GIT PULL] please pull ummunotify Roland Dreier
2009-09-11  4:38 ` Roland Dreier
2009-09-15 11:34 ` Pavel Machek
     [not found]   ` <20090915113434.GF1328-+ZI9xUNit7I@public.gmane.org>
2009-09-15 14:57     ` [ofa-general] " Roland Dreier
2009-09-15 14:57       ` Roland Dreier
     [not found]       ` <ada7hw0gsqz.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>
2009-09-28 20:49         ` Pavel Machek
2009-09-28 20:49           ` Pavel Machek
     [not found]           ` <20090928204923.GA1960-I/5MKhXcvmPrBKCeMvbIDA@public.gmane.org>
2009-09-28 21:40             ` Jason Gunthorpe
2009-09-28 21:40               ` Jason Gunthorpe
     [not found] ` <aday6omhz9d.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>
2009-09-11  5:56   ` KOSAKI Motohiro
2009-09-11  5:56     ` KOSAKI Motohiro
2009-09-11  6:03     ` [ofa-general] " Roland Dreier
2009-09-11  6:03       ` Roland Dreier
     [not found]       ` <adatyzahvbm.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>
2009-09-11  6:11         ` KOSAKI Motohiro
2009-09-11  6:11           ` KOSAKI Motohiro
     [not found]           ` <20090911150552.DB68.A69D9226-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
2009-09-11 16:42             ` Gleb Natapov
2009-09-11 16:42               ` Gleb Natapov
2009-09-11  6:15       ` Brice Goglin
     [not found]         ` <4AA9EAF7.5010401-MZpvjPyXg2s@public.gmane.org>
2009-09-11  6:21           ` KOSAKI Motohiro
2009-09-11  6:21             ` KOSAKI Motohiro
2009-09-11  6:22           ` Roland Dreier
2009-09-11  6:22             ` Roland Dreier
2009-09-11  6:40             ` [ofa-general] " Jason Gunthorpe
2009-09-11  6:40               ` Jason Gunthorpe
     [not found]               ` <20090911064019.GZ4973-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2009-09-11 16:58                 ` Roland Dreier
2009-09-11 16:58                   ` Roland Dreier
     [not found]                   ` <adaljklifkt.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>
2009-09-15  7:03                     ` KOSAKI Motohiro
2009-09-15  7:03                       ` KOSAKI Motohiro
2009-09-15  8:27                       ` Roland Dreier
2009-09-15  8:27                         ` Roland Dreier
2009-09-15 12:38                       ` Jeff Squyres
2009-09-15 12:38                         ` Jeff Squyres
2009-09-16 16:30   ` Roland Dreier
2009-09-16 16:30     ` Roland Dreier
2009-09-16 16:40     ` [ofa-general] " Linus Torvalds
2009-09-16 16:40       ` Linus Torvalds
2009-09-17 11:30   ` Peter Zijlstra
2009-09-17 11:30     ` Peter Zijlstra
2009-09-17 14:24     ` [ofa-general] " Roland Dreier
     [not found]       ` <adafxalejiq.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>
2009-09-17 14:32         ` Roland Dreier
2009-09-17 14:32           ` Roland Dreier
     [not found]           ` <adaab0tej5c.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>
2009-09-17 14:49             ` Peter Zijlstra
2009-09-17 14:49               ` Peter Zijlstra
2009-09-17 15:03               ` Roland Dreier
2009-09-17 15:03                 ` Roland Dreier
     [not found]                 ` <adazl8td35u.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>
2009-09-17 15:22                   ` Peter Zijlstra
2009-09-17 15:22                     ` Peter Zijlstra
2009-09-17 15:45                   ` Roland Dreier
2009-09-17 15:45                     ` Roland Dreier
     [not found]                     ` <adatyz1d17q.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>
2009-09-18 11:50                       ` Ingo Molnar
2009-09-18 11:50                         ` Ingo Molnar
2009-09-29 17:13                       ` Pavel Machek
2009-09-29 17:13                         ` Pavel Machek
     [not found]                         ` <20090929171332.GD14405-I/5MKhXcvmPrBKCeMvbIDA@public.gmane.org>
2009-09-30  9:44                           ` Ingo Molnar
2009-09-30  9:44                             ` Ingo Molnar
     [not found]                             ` <20090930094456.GD24621-X9Un+BFzKDI@public.gmane.org>
2009-09-30 16:02                               ` Jason Gunthorpe
2009-09-30 16:02                                 ` Jason Gunthorpe
     [not found]                                 ` <20090930160232.GZ22310-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2009-10-12 18:19                                   ` Ingo Molnar
2009-10-12 18:19                                     ` Ingo Molnar
     [not found]                                     ` <20091012181944.GF17138-X9Un+BFzKDI@public.gmane.org>
2009-10-12 19:30                                       ` Jason Gunthorpe
2009-10-12 19:30                                         ` Jason Gunthorpe
2009-10-12 20:20                                         ` Ingo Molnar
     [not found]                                           ` <20091012202046.GA7648-X9Un+BFzKDI@public.gmane.org>
2009-10-13  4:05                                             ` Jason Gunthorpe
2009-10-13  4:05                                               ` Jason Gunthorpe
2009-10-13  6:40                                               ` Ingo Molnar
     [not found]                                                 ` <20091013064006.GC9470-X9Un+BFzKDI@public.gmane.org>
2009-10-13 16:27                                                   ` Jason Gunthorpe
2009-10-13 16:27                                                     ` Jason Gunthorpe
2009-10-13  5:43                                           ` Brice Goglin
     [not found]                                             ` <4AD41373.8010108-MZpvjPyXg2s@public.gmane.org>
2009-10-13  6:38                                               ` Ingo Molnar
2009-10-13  6:38                                                 ` Ingo Molnar
2009-09-30 17:06                               ` Roland Dreier
2009-09-30 17:06                                 ` Roland Dreier
2009-10-02 16:32                               ` Roland Dreier
2009-10-02 16:32                                 ` Roland Dreier
     [not found]                                 ` <ada3a61rc3j.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>
2009-10-02 20:45                                   ` Pavel Machek
2009-10-02 20:45                                     ` Pavel Machek
2009-10-07 22:34                                   ` Roland Dreier
2009-10-07 22:34                                     ` Roland Dreier
     [not found]                                     ` <ada3a5uq1dk.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>
2009-10-12 17:33                                       ` Peter Zijlstra
2009-10-12 17:33                                         ` Peter Zijlstra
2009-09-17 14:43         ` Peter Zijlstra
2009-09-17 14:43           ` Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.