From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S262802AbUDOTos (ORCPT ); Thu, 15 Apr 2004 15:44:48 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S262849AbUDOTos (ORCPT ); Thu, 15 Apr 2004 15:44:48 -0400 Received: from e6.ny.us.ibm.com ([32.97.182.106]:32484 "EHLO e6.ny.us.ibm.com") by vger.kernel.org with ESMTP id S262802AbUDOToj (ORCPT ); Thu, 15 Apr 2004 15:44:39 -0400 Subject: Re: NUMA API for Linux From: Matthew Dobson Reply-To: colpatch@us.ibm.com To: Andi Kleen Cc: LKML , Andrew Morton , "Martin J. Bligh" In-Reply-To: <20040415123915.016523df.ak@suse.de> References: <1081373058.9061.16.camel@arrakis> <20040407232712.2595ac16.ak@suse.de> <1081989517.1206.206.camel@arrakis> <20040415123915.016523df.ak@suse.de> Content-Type: text/plain Organization: IBM LTC Message-Id: <1082058260.3111.73.camel@arrakis> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.4.5 (1.4.5-7) Date: Thu, 15 Apr 2004 12:44:21 -0700 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 2004-04-15 at 03:39, Andi Kleen wrote: > On Wed, 14 Apr 2004 17:38:37 -0700 > Matthew Dobson wrote: > > > > > 1) Redefine the value of some of the MPOL_* flags > > I don't want to merge the flags the and the mode argument. It's ugly. Well, if you're against adding a pid argument, this change is useless anyway. > > 2) Rename check_* to mpol_check_* > > I really don't understand why you insist on renaming all my functions? > I like the current naming, thank you. I don't want to rename them *all*! ;) There's a million functions named check_foo(), so why not a few more? :) > > 3) Remove get_nodes(). This should be done in the same manner as > > sys_sched_setaffinity(). We shouldn't care about unused high bits. > > I disagree on that. This would break programs that are first tested > on a small machine and then later run on a big machine (common case) I don't follow this argument... How does this code work on a small machine, but break on a big one: DECLARE_BITMAP(nodes, MAX_NUMNODES); BITMAP_CLEAR(nodes, MAX_NUMNODES); set_bit(4, nodes); mbind(startaddr, 8 * PAGE_SIZE, MPOL_PREFERRED, nodes, MAX_NUMNODES, flags); ?? Userspace should be declaring an array of unsigned longs based on the size of the machine they're running on. If userspace programs are just declaring arbitrary sized array and hoping that they'll work, they're being stupid. If they're declaring oversized arrays and not zeroing the array before passing it in, then they're being lazy. Either way, we're going to throw the bits away, so do we care if they're being lazy or stupid? If the bits we care about are sane, then we do what userspace tells us. If the bits we care about aren't, we throw an error. Besides, we don't return an error when they pass us too little data, why do we return an error when they pass too much? I remember one of your concerns was checking that all the nodes with set bits need to be online. I think this is wrong. We need to be checking node_online_map at fault time, not at binding time. Sooner or later memory hotplug will go into the kernel, and then you're (or someone) is going to have to rewrite how this is handled. I'd rather do it right the first time. > > 4) Create mpol_check_flags() to, well, check the flags. As the number > > of flags and modes grows, it will be easier to do this check in its own > > function. > > 5) In the syscalls (sys_mbind() & sys_set_mempolicy()), change 'len' to > > a size_t, add __user to the declaration of 'nmask', change 'maxnode' to > > unsigned long is the standard for system calls. Check some others. Ok. I see this set of system calls as sort of the illegitimate love-child of mlock()/mprotect() and sched_set/getaffinity(). sys_mlock() was using a size_t, so I copied that... > > 'nmask_len', and condense 'flags' and 'mode' into 'flags'. The > > motivation here is to make this syscall similar to > > sys_sched_setaffinity(). These calls are basically the memory > > equivalent of set/getaffinity, and should look & behave that way. Also, > > dropping an argument leaves an opening for a pid argument, which I > > believe would be good. We should allow processes (with appropriate > > permissions, of course) to mbind other processes. > > Messing with other process' VM is a recipe for disaster. There > used to be tons of exploitable races in /proc/pid/mem, I don't want to repeat that. > Adding pid to set_mem_policy would be a bit easier, but it would require > to add a lock to the task struct for this. Currently it is nice and lockless > because it relies on the fact that only the current process can change > its own policy. I prefer to keep it lockless, because that keeps the memory > allocation fast paths faster. We're already grabbing the mmap_sem for writing when we modify the vma's in sys_mbind. I haven't looked at what kind of locking would be necessary if we were modifying a *different* processes vma's, but I figured that would at least be a good start. If it's significantly more complicated than that, you're probably right that it's not worth the effort. > > 6) Change how end is calculated as follows: > > end = PAGE_ALIGN(start+len); > > start &= PAGE_MASK; > > Basically, this allows users to pass in a non-page aligned 'start', and > > makes sure we mbind all pages from the page containing 'start' to the > > page containing 'start'+'len'. > > mprotect() does the EINVAL check on unalignment. I think it's better > to follow mprotect here. Ok. I'm usually a fan of not failing if we don't *have* to, and an unaligned start address is easily fixable. On the other hand, if other syscalls throw errors for it, then it's fine with me. BTW, any response to the idea of combining sys_mbind() & sys_set_mempolicy()? -Matt