From: Ingo Molnar <mingo@elte.hu> To: Christopher Yeoh <cyeoh@au1.ibm.com> Cc: linux-kernel@vger.kernel.org, Andrew Morton <akpm@linux-foundation.org>, Linus Torvalds <torvalds@linux-foundation.org>, Peter Zijlstra <a.p.zijlstra@chello.nl>, linux-mm@kvack.org Subject: Re: [RFC][PATCH] Cross Memory Attach Date: Wed, 15 Sep 2010 10:02:35 +0200 [thread overview] Message-ID: <20100915080235.GA13152@elte.hu> (raw) In-Reply-To: <20100915104855.41de3ebf@lilo> (Interesting patch found on lkml, more folks Cc:-ed) * Christopher Yeoh <cyeoh@au1.ibm.com> wrote: > The basic idea behind cross memory attach is to allow MPI programs > doing intra-node communication to do a single copy of the message > rather than a double copy of the message via shared memory. > > The following patch attempts to achieve this by allowing a destination > process, given an address and size from a source process, to copy > memory directly from the source process into its own address space via > a system call. There is also a symmetrical ability to copy from the > current process's address space into a destination process's address > space. > > Use of vmsplice instead was considered, but has problems. Since you > need the reader and writer working co-operatively if the pipe is not > drained then you block. Which requires some wrapping to do non > blocking on the send side or polling on the receive. In all to all > communication it requires ordering otherwise you can deadlock. And in > the example of many MPI tasks writing to one MPI task vmsplice > serialises the copying. > > I've added the use of this capability to OpenMPI and run some MPI > benchmarks on a 64-way (with SMT off) Power6 machine which see > improvements in the following areas: > > HPCC results: > ============= > > MB/s Num Processes > Naturally Ordered 4 8 16 32 > Base 1235 935 622 419 > CMA 4741 3769 1977 703 > > > MB/s Num Processes > Randomly Ordered 4 8 16 32 > Base 1227 947 638 412 > CMA 4666 3682 1978 710 > > MB/s Num Processes > Max Ping Pong 4 8 16 32 > Base 2028 1938 1928 1882 > CMA 7424 7510 7598 7708 > > > NPB: > ==== > BT - 12% improvement > FT - 15% improvement > IS - 30% improvement > SP - 34% improvement > > IMB: > === > > Ping Pong - ~30% improvement > Ping Ping - ~120% improvement > SendRecv - ~100% improvement > Exchange - ~150% improvement > Gather(v) - ~20% improvement > Scatter(v) - ~20% improvement > AlltoAll(v) - 30-50% improvement > > Patch is as below. Any comments? Impressive numbers! What did those OpenMPI facilities use before your patch - shared memory or sockets? I have an observation about the interface: > +asmlinkage long sys_copy_from_process(pid_t pid, unsigned long addr, > + unsigned long len, > + char __user *buf, int flags); > +asmlinkage long sys_copy_to_process(pid_t pid, unsigned long addr, > + unsigned long len, > + char __user *buf, int flags); A small detail: 'int flags' should probably be 'unsigned long flags' - it leaves more space. Also, note that there is a further performance optimization possible here: if the other task's ->mm is the same as this task's (they share the MM), then the copy can be done straight in this process context, without GUP. User-space might not necessarily be aware of this so it might make sense to express this special case in the kernel too. More fundamentally, wouldnt it make sense to create an iovec interface here? If the Gather(v) / Scatter(v) / AlltoAll(v) workloads have any fragmentation on the user-space buffer side then the copy of multiple areas could be done in a single syscall. (the MM lock has to be touched only once, target task only be looked up only once, etc.) Plus, a small naming detail, shouldnt the naming be more IO like: sys_process_vm_read() sys_process_vm_write() Basically a regular read()/write() interface, but instead of fd's we'd have (PID,addr) identifiers for remote buffers, and instant execution (no buffering). This makes these somewhat special syscalls a bit less special :-) [ In theory we could also use this new ABI in a way to help the various RDMA efforts as well - but it looks way too complex. RDMA is rather difficult from an OS design POV - and this special case you have implemented is much easier to do, as we are in a single trust domain. ] Thanks, Ingo
WARNING: multiple messages have this Message-ID (diff)
From: Ingo Molnar <mingo@elte.hu> To: Christopher Yeoh <cyeoh@au1.ibm.com> Cc: linux-kernel@vger.kernel.org, Andrew Morton <akpm@linux-foundation.org>, Linus Torvalds <torvalds@linux-foundation.org>, Peter Zijlstra <a.p.zijlstra@chello.nl>, linux-mm@kvack.org Subject: Re: [RFC][PATCH] Cross Memory Attach Date: Wed, 15 Sep 2010 10:02:35 +0200 [thread overview] Message-ID: <20100915080235.GA13152@elte.hu> (raw) In-Reply-To: <20100915104855.41de3ebf@lilo> (Interesting patch found on lkml, more folks Cc:-ed) * Christopher Yeoh <cyeoh@au1.ibm.com> wrote: > The basic idea behind cross memory attach is to allow MPI programs > doing intra-node communication to do a single copy of the message > rather than a double copy of the message via shared memory. > > The following patch attempts to achieve this by allowing a destination > process, given an address and size from a source process, to copy > memory directly from the source process into its own address space via > a system call. There is also a symmetrical ability to copy from the > current process's address space into a destination process's address > space. > > Use of vmsplice instead was considered, but has problems. Since you > need the reader and writer working co-operatively if the pipe is not > drained then you block. Which requires some wrapping to do non > blocking on the send side or polling on the receive. In all to all > communication it requires ordering otherwise you can deadlock. And in > the example of many MPI tasks writing to one MPI task vmsplice > serialises the copying. > > I've added the use of this capability to OpenMPI and run some MPI > benchmarks on a 64-way (with SMT off) Power6 machine which see > improvements in the following areas: > > HPCC results: > ============= > > MB/s Num Processes > Naturally Ordered 4 8 16 32 > Base 1235 935 622 419 > CMA 4741 3769 1977 703 > > > MB/s Num Processes > Randomly Ordered 4 8 16 32 > Base 1227 947 638 412 > CMA 4666 3682 1978 710 > > MB/s Num Processes > Max Ping Pong 4 8 16 32 > Base 2028 1938 1928 1882 > CMA 7424 7510 7598 7708 > > > NPB: > ==== > BT - 12% improvement > FT - 15% improvement > IS - 30% improvement > SP - 34% improvement > > IMB: > === > > Ping Pong - ~30% improvement > Ping Ping - ~120% improvement > SendRecv - ~100% improvement > Exchange - ~150% improvement > Gather(v) - ~20% improvement > Scatter(v) - ~20% improvement > AlltoAll(v) - 30-50% improvement > > Patch is as below. Any comments? Impressive numbers! What did those OpenMPI facilities use before your patch - shared memory or sockets? I have an observation about the interface: > +asmlinkage long sys_copy_from_process(pid_t pid, unsigned long addr, > + unsigned long len, > + char __user *buf, int flags); > +asmlinkage long sys_copy_to_process(pid_t pid, unsigned long addr, > + unsigned long len, > + char __user *buf, int flags); A small detail: 'int flags' should probably be 'unsigned long flags' - it leaves more space. Also, note that there is a further performance optimization possible here: if the other task's ->mm is the same as this task's (they share the MM), then the copy can be done straight in this process context, without GUP. User-space might not necessarily be aware of this so it might make sense to express this special case in the kernel too. More fundamentally, wouldnt it make sense to create an iovec interface here? If the Gather(v) / Scatter(v) / AlltoAll(v) workloads have any fragmentation on the user-space buffer side then the copy of multiple areas could be done in a single syscall. (the MM lock has to be touched only once, target task only be looked up only once, etc.) Plus, a small naming detail, shouldnt the naming be more IO like: sys_process_vm_read() sys_process_vm_write() Basically a regular read()/write() interface, but instead of fd's we'd have (PID,addr) identifiers for remote buffers, and instant execution (no buffering). This makes these somewhat special syscalls a bit less special :-) [ In theory we could also use this new ABI in a way to help the various RDMA efforts as well - but it looks way too complex. RDMA is rather difficult from an OS design POV - and this special case you have implemented is much easier to do, as we are in a single trust domain. ] Thanks, Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2010-09-15 8:02 UTC|newest] Thread overview: 62+ messages / expand[flat|nested] mbox.gz Atom feed top 2010-09-15 1:18 [RFC][PATCH] Cross Memory Attach Christopher Yeoh 2010-09-15 8:02 ` Ingo Molnar [this message] 2010-09-15 8:02 ` Ingo Molnar 2010-09-15 8:16 ` Ingo Molnar 2010-09-15 8:16 ` Ingo Molnar 2010-09-15 13:23 ` Christopher Yeoh 2010-09-15 13:23 ` Christopher Yeoh 2010-09-15 13:20 ` Christopher Yeoh 2010-09-15 13:20 ` Christopher Yeoh 2010-09-15 10:58 ` Avi Kivity 2010-09-15 10:58 ` Avi Kivity 2010-09-15 13:51 ` Ingo Molnar 2010-09-15 13:51 ` Ingo Molnar 2010-09-15 16:10 ` Avi Kivity 2010-09-15 16:10 ` Avi Kivity 2010-09-15 14:42 ` Christopher Yeoh 2010-09-15 14:42 ` Christopher Yeoh 2010-09-15 14:52 ` Linus Torvalds 2010-09-15 14:52 ` Linus Torvalds 2010-09-15 15:44 ` Robin Holt 2010-09-15 15:44 ` Robin Holt 2010-09-16 6:32 ` Brice Goglin 2010-09-16 6:32 ` Brice Goglin 2010-09-16 9:15 ` Brice Goglin 2010-09-16 9:15 ` Brice Goglin 2010-09-16 14:00 ` Christopher Yeoh 2010-09-16 14:00 ` Christopher Yeoh 2010-09-15 14:46 ` Bryan Donlan 2010-09-15 14:46 ` Bryan Donlan 2010-09-15 16:13 ` Avi Kivity 2010-09-15 16:13 ` Avi Kivity 2010-09-15 19:35 ` Eric W. Biederman 2010-09-15 19:35 ` Eric W. Biederman 2010-09-16 1:18 ` Christopher Yeoh 2010-09-16 1:18 ` Christopher Yeoh 2010-09-16 9:26 ` Avi Kivity 2010-09-16 9:26 ` Avi Kivity 2010-11-02 3:37 ` Christopher Yeoh 2010-11-02 3:37 ` Christopher Yeoh 2010-11-02 11:10 ` Avi Kivity 2010-11-02 11:10 ` Avi Kivity 2010-09-16 1:58 ` KOSAKI Motohiro 2010-09-16 1:58 ` KOSAKI Motohiro 2010-09-16 8:08 ` Ingo Molnar 2010-09-16 8:08 ` Ingo Molnar 2010-09-15 15:11 ` Linus Torvalds 2010-09-15 15:14 ` Linus Torvalds 2010-09-16 2:25 ` Christopher Yeoh 2010-09-16 16:27 ` Peter Zijlstra 2010-09-16 16:54 ` Linus Torvalds 2010-09-16 17:13 ` Peter Zijlstra 2010-09-16 17:34 ` Linus Torvalds 2010-09-16 17:47 ` Peter Zijlstra 2010-09-16 17:54 ` Linus Torvalds 2010-09-16 18:00 ` Linus Torvalds 2010-09-19 4:44 ` Yuhong Bao 2010-09-19 19:20 ` Yuhong Bao 2010-09-19 21:48 ` Russell King - ARM Linux 2010-09-19 22:47 ` Yuhong Bao 2010-09-19 4:55 ` Yuhong Bao 2010-09-15 16:07 ` Valdis.Kletnieks 2010-09-16 2:17 ` Christopher Yeoh
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20100915080235.GA13152@elte.hu \ --to=mingo@elte.hu \ --cc=a.p.zijlstra@chello.nl \ --cc=akpm@linux-foundation.org \ --cc=cyeoh@au1.ibm.com \ --cc=linux-kernel@vger.kernel.org \ --cc=linux-mm@kvack.org \ --cc=torvalds@linux-foundation.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.