linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] Introduce nodemask_t ADT [0/7]
@ 2004-03-18 23:04 Matthew Dobson
  2004-03-18 23:23 ` Jesse Barnes
  2004-03-19  0:59 ` Paul Jackson
  0 siblings, 2 replies; 56+ messages in thread
From: Matthew Dobson @ 2004-03-18 23:04 UTC (permalink / raw)
  To: linux-kernel, mbligh, akpm, wli, haveblue

I've got a fairly good size patch set to implement an ADT (Abstract Data
Type) for nodemasks, which follows the path blazed by wli with the
cpumask_t code.  The basic idea is to create a generic,
platform/arch-agnostic nodemask data type, complete with operations to
do most anything you'd want to do with a nodemask.  This stops us from
open-coding nodemask operations, allows non-consecutive node numbering
(ie: nodes don't have to be numbered 0...numnodes-1), gets rid of
numnodes entirely (replaced with num_online_nodes()), and will
facilitate the hotplugging of whole nodes.

As I mentioned, the code is heavily based on Bill Irwin's cpumask_t
code.  The changes are broken into seven patches:

nodemask_t-01-definitions.patch
	The basic definitions of the nodemask_t type: include/linux/nodemask.h,
include/asm-generic/{nodemask.h, nodemask_arith.h, nodemask_array.h,
nodemask_const_reference.h, nodemask_const_value.h, nodemask_nonuma.h},
and some small changes to include/linux/mmzone.h (removing extistant
definition of node_online_map and helper functions).

nodemask_t-02-core.patch
	Changes to arch-independent code.  Surprisingly few references to
numnodes, open-coded node loops, etc.  Most important result of this
patch is that no generic code assumes anything about node numbering. 
This allows individual arches to use sparse numbering if they care to.

nodemask_t-03-i386.patch
	Changes to i386 specific code.  As with most arch changes, it involves
close-coding loops (ie: for_each_online_node(nid) rather than
for(nid=0;nid<numnodes;nid++)) and replacing the use of numnodes with
num_online_nodes() and node_set_online(nid).

nodemask_t-04-ppc64.patch
	Changes to ppc64 specific code.  Untested.  Code review & testing
requested.

nodemask_t-05-x86_64.patch
	Changes to x86_64 specific code.  Untested.  Code review & testing
requested.

nodemask_t-06-ia64.patch
	Changes to ia64 specific code.  Untested.  Code review & testing
requested.

nodemask_t-07-other.patch
	Changes to other arch-specific code (alpha, arm, mips, sparc64 & sh). 
Untested.  Code review & testing requested.



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] Introduce nodemask_t ADT [0/7]
  2004-03-18 23:04 [PATCH] Introduce nodemask_t ADT [0/7] Matthew Dobson
@ 2004-03-18 23:23 ` Jesse Barnes
  2004-03-18 23:32   ` Martin J. Bligh
                     ` (2 more replies)
  2004-03-19  0:59 ` Paul Jackson
  1 sibling, 3 replies; 56+ messages in thread
From: Jesse Barnes @ 2004-03-18 23:23 UTC (permalink / raw)
  To: linux-kernel, colpatch; +Cc: mbligh, akpm, wli, haveblue

On Thursday 18 March 2004 3:04 pm, Matthew Dobson wrote:
> do most anything you'd want to do with a nodemask.  This stops us from
> open-coding nodemask operations, allows non-consecutive node numbering
> (ie: nodes don't have to be numbered 0...numnodes-1), gets rid of
> numnodes entirely (replaced with num_online_nodes()), and will
> facilitate the hotplugging of whole nodes.

My hero! :)  I think this has been needed for awhile, but now that I
think about it, it begs the question of what a node is.  Is it a set
of CPUs and blocks of memory (that seems to be the most commonly used
definition in the code), just memory, just CPUs, or what?  On sn2
hardware, we have the concept of a node without CPUs.  And due to our
wacky I/O layout, we also have nodes without CPUs *or* memory!  (The
I/O guys call these "ionodes".)  And then of course, there are CPUs
that aren't particularly close to any memory (i.e. they have none of
their own, and have to go several hops and/or through other CPUs to
get at memory at all).

I'll take a look at the ia64 bits when I get them (I've only received
two of the seven patches thus far).

Jesse



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] Introduce nodemask_t ADT [0/7]
  2004-03-18 23:23 ` Jesse Barnes
@ 2004-03-18 23:32   ` Martin J. Bligh
  2004-03-18 23:37     ` Jesse Barnes
  2004-03-19  0:01     ` Matthew Dobson
  2004-03-18 23:58   ` Matthew Dobson
  2004-03-19  2:55   ` William Lee Irwin III
  2 siblings, 2 replies; 56+ messages in thread
From: Martin J. Bligh @ 2004-03-18 23:32 UTC (permalink / raw)
  To: Jesse Barnes, linux-kernel, colpatch; +Cc: akpm, wli, haveblue

> On Thursday 18 March 2004 3:04 pm, Matthew Dobson wrote:
>> do most anything you'd want to do with a nodemask.  This stops us from
>> open-coding nodemask operations, allows non-consecutive node numbering
>> (ie: nodes don't have to be numbered 0...numnodes-1), gets rid of
>> numnodes entirely (replaced with num_online_nodes()), and will
>> facilitate the hotplugging of whole nodes.
> 
> My hero! :)  I think this has been needed for awhile, but now that I
> think about it, it begs the question of what a node is.  Is it a set
> of CPUs and blocks of memory (that seems to be the most commonly used
> definition in the code), just memory, just CPUs, or what?  On sn2
> hardware, we have the concept of a node without CPUs.  And due to our
> wacky I/O layout, we also have nodes without CPUs *or* memory!  (The
> I/O guys call these "ionodes".)  And then of course, there are CPUs
> that aren't particularly close to any memory (i.e. they have none of
> their own, and have to go several hops and/or through other CPUs to
> get at memory at all).

I think the closest answer we have is that it's a grouping of cpus and
memory, where either may be NULL. 

I/O isn't directly associated with a node, though it should fit into the 
topo infrastructure, to give distances from io buses to nodes (for which 
I think we currently use cpumasks, which is probably wrong in retrospect, 
but then life is tough and flawed ;-))

M.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] Introduce nodemask_t ADT [0/7]
  2004-03-18 23:32   ` Martin J. Bligh
@ 2004-03-18 23:37     ` Jesse Barnes
  2004-03-18 23:43       ` Martin J. Bligh
  2004-03-19  0:01     ` Matthew Dobson
  1 sibling, 1 reply; 56+ messages in thread
From: Jesse Barnes @ 2004-03-18 23:37 UTC (permalink / raw)
  To: linux-kernel, mbligh

On Thursday 18 March 2004 3:32 pm, Martin J. Bligh wrote:
> I think the closest answer we have is that it's a grouping of cpus and
> memory, where either may be NULL. 

Yep, that seems to make the most sense, but then part of me wants to
drop the term node and never use it again :)

> I/O isn't directly associated with a node, though it should fit into the 
> topo infrastructure, to give distances from io buses to nodes (for which 
> I think we currently use cpumasks, which is probably wrong in retrospect, 
> but then life is tough and flawed ;-))

It's probably not too late to change this to
pcibus_to_nodemask(pci_bus *), or pci_to_nodemask(pci_dev *), there
aren't that many callers, are there (my grep is still running)?

Thanks,
Jesse


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] Introduce nodemask_t ADT [0/7]
  2004-03-18 23:37     ` Jesse Barnes
@ 2004-03-18 23:43       ` Martin J. Bligh
  2004-03-18 23:59         ` Jesse Barnes
                           ` (2 more replies)
  0 siblings, 3 replies; 56+ messages in thread
From: Martin J. Bligh @ 2004-03-18 23:43 UTC (permalink / raw)
  To: Jesse Barnes, linux-kernel

--On Thursday, March 18, 2004 15:37:10 -0800 Jesse Barnes <jbarnes@sgi.com> wrote:

> On Thursday 18 March 2004 3:32 pm, Martin J. Bligh wrote:
>> I think the closest answer we have is that it's a grouping of cpus and
>> memory, where either may be NULL. 
> 
> Yep, that seems to make the most sense, but then part of me wants to
> drop the term node and never use it again :)

Hey, *I* wasn't the one who started splitting their h/w into wierdo pieces ;-)
Anyway, it's a damned sight shorter than "cpumemset".
 
>> I/O isn't directly associated with a node, though it should fit into the 
>> topo infrastructure, to give distances from io buses to nodes (for which 
>> I think we currently use cpumasks, which is probably wrong in retrospect, 
>> but then life is tough and flawed ;-))
> 
> It's probably not too late to change this to
> pcibus_to_nodemask(pci_bus *), or pci_to_nodemask(pci_dev *), there
> aren't that many callers, are there (my grep is still running)?

It probably shouldn't have anything to do with PCI directly either,
so .... ;-) My former thought was that you might just want the most
local memory for DMAing into.

M.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] Introduce nodemask_t ADT [0/7]
  2004-03-18 23:23 ` Jesse Barnes
  2004-03-18 23:32   ` Martin J. Bligh
@ 2004-03-18 23:58   ` Matthew Dobson
  2004-03-19  2:55   ` William Lee Irwin III
  2 siblings, 0 replies; 56+ messages in thread
From: Matthew Dobson @ 2004-03-18 23:58 UTC (permalink / raw)
  To: Jesse Barnes; +Cc: linux-kernel, mbligh, akpm, wli, haveblue

On Thu, 2004-03-18 at 15:23, Jesse Barnes wrote:
> On Thursday 18 March 2004 3:04 pm, Matthew Dobson wrote:
> > do most anything you'd want to do with a nodemask.  This stops us from
> > open-coding nodemask operations, allows non-consecutive node numbering
> > (ie: nodes don't have to be numbered 0...numnodes-1), gets rid of
> > numnodes entirely (replaced with num_online_nodes()), and will
> > facilitate the hotplugging of whole nodes.
> 
> My hero! :)  I think this has been needed for awhile, but now that I

Anything for a damsel in distress! ;)

> think about it, it begs the question of what a node is.  Is it a set
> of CPUs and blocks of memory (that seems to be the most commonly used
> definition in the code), just memory, just CPUs, or what?

There have been arguments about exactly what a node is since there has
been a concept of a node at all.  In the kernel, it isn't defined.  A
node doesn't *have* to have CPUs on it (see nr_cpus_node()), doesn't
*have* to have memory, doesn't *have* to have I/O.  It's supposed to be
just a container for those 3 things, but the containers can be empty. 
This code doesn't get into what a node is, just makes sure they're used
properly... ;)

> On sn2
> hardware, we have the concept of a node without CPUs.  And due to our
> wacky I/O layout, we also have nodes without CPUs *or* memory!  (The
> I/O guys call these "ionodes".)

Yep...  I saw both numnodes and numionodes perusing the ia64 code.  You
should be able to put these CPU/memless nodes in the node_online_map
now...  If there's code that's assuming a node contains either CPUs or
memory, I'd like to find it! :)

> And then of course, there are CPUs
> that aren't particularly close to any memory (i.e. they have none of
> their own, and have to go several hops and/or through other CPUs to
> get at memory at all).

node_distance(from, to)

> I'll take a look at the ia64 bits when I get them (I've only received
> two of the seven patches thus far).
> 
> Jesse

Super.  I'd really like feedback on the ia64 code (well, actually all
the non-i386 code).  I did what I thought was right, but eyes more
familiar with the code are extremely welcome.

Cheers!

-Matt


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] Introduce nodemask_t ADT [0/7]
  2004-03-18 23:43       ` Martin J. Bligh
@ 2004-03-18 23:59         ` Jesse Barnes
  2004-03-19 16:20           ` Martin J. Bligh
  2004-03-19  0:58         ` Zwane Mwaikambo
  2004-03-19  1:08         ` Paul Jackson
  2 siblings, 1 reply; 56+ messages in thread
From: Jesse Barnes @ 2004-03-18 23:59 UTC (permalink / raw)
  To: linux-kernel

On Thursday 18 March 2004 3:43 pm, Martin J. Bligh wrote:
> > It's probably not too late to change this to
> > pcibus_to_nodemask(pci_bus *), or pci_to_nodemask(pci_dev *), there
> > aren't that many callers, are there (my grep is still running)?
> 
> It probably shouldn't have anything to do with PCI directly either,
> so .... ;-) My former thought was that you might just want the most
> local memory for DMAing into.

Right, we want local memory (or potentially remote memory) for DMA,
but what about interrupt redirection?  Some chipsets don't support
interrupt round robin, and just target interrupts at one CPU.  In that
case (and probably the round robin case too), you want to know which
CPU(s) to send the interrupt at.  Can't immediately think of other
in-kernel uses though (administrators will of course want to be able
to locate a given PCI device in a multirack system, but that's another
subject--one that Martin Hicks posted on yesterday).

Jesse


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] Introduce nodemask_t ADT [0/7]
  2004-03-18 23:32   ` Martin J. Bligh
  2004-03-18 23:37     ` Jesse Barnes
@ 2004-03-19  0:01     ` Matthew Dobson
  1 sibling, 0 replies; 56+ messages in thread
From: Matthew Dobson @ 2004-03-19  0:01 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Jesse Barnes, linux-kernel, akpm, wli, haveblue

On Thu, 2004-03-18 at 15:32, Martin J. Bligh wrote:
> > On Thursday 18 March 2004 3:04 pm, Matthew Dobson wrote:
> >> do most anything you'd want to do with a nodemask.  This stops us from
> >> open-coding nodemask operations, allows non-consecutive node numbering
> >> (ie: nodes don't have to be numbered 0...numnodes-1), gets rid of
> >> numnodes entirely (replaced with num_online_nodes()), and will
> >> facilitate the hotplugging of whole nodes.
> > 
> > My hero! :)  I think this has been needed for awhile, but now that I
> > think about it, it begs the question of what a node is.  Is it a set
> > of CPUs and blocks of memory (that seems to be the most commonly used
> > definition in the code), just memory, just CPUs, or what?  On sn2
> > hardware, we have the concept of a node without CPUs.  And due to our
> > wacky I/O layout, we also have nodes without CPUs *or* memory!  (The
> > I/O guys call these "ionodes".)  And then of course, there are CPUs
> > that aren't particularly close to any memory (i.e. they have none of
> > their own, and have to go several hops and/or through other CPUs to
> > get at memory at all).
> 
> I think the closest answer we have is that it's a grouping of cpus and
> memory, where either may be NULL. 
> 
> I/O isn't directly associated with a node, though it should fit into the 
> topo infrastructure, to give distances from io buses to nodes (for which 
> I think we currently use cpumasks, which is probably wrong in retrospect, 
> but then life is tough and flawed ;-))
> 
> M.

Yeah... We used cpumasks because that seemed like a good idea at the
time.  Nodemasks may be a better choice now...  We can write a quicky
inline function nodemask_to_cpumask() as well, to keep the current users
happy.

-Matt


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] Introduce nodemask_t ADT [0/7]
  2004-03-18 23:43       ` Martin J. Bligh
  2004-03-18 23:59         ` Jesse Barnes
@ 2004-03-19  0:58         ` Zwane Mwaikambo
  2004-03-19  1:11           ` Jesse Barnes
  2004-03-19  1:08         ` Paul Jackson
  2 siblings, 1 reply; 56+ messages in thread
From: Zwane Mwaikambo @ 2004-03-19  0:58 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Jesse Barnes, linux-kernel

On Thu, 18 Mar 2004, Martin J. Bligh wrote:

> >> I/O isn't directly associated with a node, though it should fit into the
> >> topo infrastructure, to give distances from io buses to nodes (for which
> >> I think we currently use cpumasks, which is probably wrong in retrospect,
> >> but then life is tough and flawed ;-))
> >
> > It's probably not too late to change this to
> > pcibus_to_nodemask(pci_bus *), or pci_to_nodemask(pci_dev *), there
> > aren't that many callers, are there (my grep is still running)?
>
> It probably shouldn't have anything to do with PCI directly either,
> so .... ;-) My former thought was that you might just want the most
> local memory for DMAing into.

That knowledge should be in the dma api thing shouldn't it? But in it's
current incarnation i don't see how that's possible.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] Introduce nodemask_t ADT [0/7]
  2004-03-18 23:04 [PATCH] Introduce nodemask_t ADT [0/7] Matthew Dobson
  2004-03-18 23:23 ` Jesse Barnes
@ 2004-03-19  0:59 ` Paul Jackson
  2004-03-19  1:19   ` Matthew Dobson
  1 sibling, 1 reply; 56+ messages in thread
From: Paul Jackson @ 2004-03-19  0:59 UTC (permalink / raw)
  To: colpatch; +Cc: linux-kernel, mbligh, akpm, wli, haveblue, hch

[-- Attachment #1: Type: text/plain, Size: 2170 bytes --]

Your nodemask_t reminds me of something I posted to linux-ia64 last
November 7, 2004, under the subject: "[PATCH preview] Adds nodemask_t
for use in cpusets (NUMA memory placement)".

Chris Hellwig responded to it at the time asking why I didn't provide a
single generic mask ADT, and make cpumask and nodemask instances of
that.

I coded that up, but then got distracted.  The remaining issue for which
I didn't have a good answer was that my proposal would break the optimum
handling for sparc64 or other systems that didn't handle passing
structures on the stack efficiently.

Bill Irwin was a party to my discussions of that effort, so I presume
that if he felt it had further merit, he would have mentioned it to
you, Matthew.

Could one of you, Bill or Matthew, speak to why this generic mask ADT,
underlying both cpumask and nodemask, should not be pursued further,
instead of duplicating the various details of cpumask, after a global
s/cpu/node/g change?

I am attaching the header file include/linux/mask.h for my current
version of this mask.h, in case someone reading wants more specifics of
what it is that I am referring to.

This version almost certainly does _not_ work on big endian 64 systems,
due to my ignorance of how kernel bitmasks were layed out when I last
worked on this mask.h header.  Unlike the sparc64 performance issues
with passing structs on the stack, I would hope that the big/little
endian issues could be fixed without messing things up too much.

If this mask.h could actually be made to work, including on sparc64,
then it would seem to be a much cleaner solution.  With it, we would
define cpumask_t and nodemask_t as simply:

  typedef __mask(NR_NODES) nodemask_t;
  typedef __mask(NR_CPUS) cpumask_t;

and either use operations such as mask_and() on both, or if one insisted
on keeping operations that specifically called out cpumask, add some
15 trivial defines such as:

  #define cpumask_and(dst, src1, src2) mask_and(dst, src1, src2)

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

[-- Attachment #2: mask.h --]
[-- Type: text/plain, Size: 7734 bytes --]

#ifndef _MASK_H
#define _MASK_H

/*
 * linux/include/linux/mask.h
 *
 * Copyright (c) 2003 Silicon Graphics, Inc. All rights reserved.
 *
 * Paul Jackson <pj@sgi.com>
 *
 * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 *
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation; either version 2 of the License, or
 * (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with this program; if not, write to the Free Software
 * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
 *
 * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 */

/* Mask is an abstract data type for multi-word bit masks.  Masks are
 * wrapped in a struct, so that they can be passed about as lvalues.
 * The available mask operations and their rough meaning in a single
 * word, single-threaded, unsafe, hypothetical case are:
 *
 * mask_bitsize(mask)			sizeof(mask) * 8
 * mask_snprintf(buf, len, mask)	snprintf(buf, len, "%lx", mask)
 * mask_parse(ubuf, ubuflen, mask)	parse comma sep 32 bit words to mask
 * mask_parselist(buf, mask)		parse cpuset =/+/- list to mask
 * mask_bitset(bit, mask)		mask |= bit
 * mask_bitclear(bit, mask)		mask &= ~bit
 * mask_allset(mask)			mask = ~0UL
 * mask_allclear(mask)			mask = 0UL
 * mask_setnbits(mask, nbits)		mask = ~(~0UL << nbits)
 * mask_isset(bit, mask)		mask | bit
 * mask_test_and_set(bit, mask)		mask | bit ? 0 : mask |= bit
 * mask_assign(dst, src)		dst = src
 * mask_andnot(dst, src)		dst &= ~src
 * mask_orequal(dst, src)		dst |= src
 * mask_and(dst, src1, src2)		dst = src1 & src2
 * mask_or(dst, src1, src2)		dst = src1 | src2
 * mask_complement(mask)		mask = ~mask
 * mask_equal(mask1, mask2)		mask1 == mask2
 * mask_empty(mask)			mask == 0UL
 * mask_weight(mask)			Hamming Weight: number set bits
 * mask_shift_right(dst, src, n)	dst = src >> n
 * mask_shift_left(dst, src, n)		dst = src << n
 * mask_first(mask)			find_first_bit(mask)
 * mask_next(bit, mask)			find_next_bit(mask, size, bit)
 * mask_of_bit(n, mask)			(typeof(mask)) (1 << n)
 * MASK_ALL(mask, nbits)		(typeof(mask)) ~(~0UL << nbits)
 * MASK_NONE(mask)			(typeof(mask)) 0UL
 *
 * Example usage to establish a cpumask_t type and cpu_* operations:
 *   typedef __mask(NR_CPUS) cpumask_t;
 *   #define cpu_set(cpu, map) mask_bitset((cpu), (map))
 *   #define ... 15 or so more cpu_ops() using corresponding mask_ops().
 * Or just do the one line "typedef __mask[..] xxx_t" definition,
 *   skip the custom named op definitions, and use the mask_* ops
 *   directly.  Maybe the custom cpu_* ops should have been done that
 *   way in the first place.
 *
 * The above ops that set or clear bits within an existing mask
 *   do so with atomic instructions, but operations on multiple
 *   words are not thread safe - bring your own locks.
 *
 * The existing definitions of CPU_MASK_ALL were inconsistent.
 *   For cpumasks of one word size or less, they set exactly
 *   NR_CPUS bits, leaving any remaining high order bits zero.
 *   For cpumasks of multiple words, all bits were set, which
 *   might result in a cpumask of weight > NR_CPUS, if NR_CPUS
 *   is not an exact multiple of the number of bits in an
 *   unsigned long.  The MASK_ALL(mask, nbits) macro follows
 *   this inconsistency.
 */

#include "linux/bitops.h"
#include "linux/string.h"
#include "linux/types.h"
#include "linux/kernel.h"
#include "linux/bitmap.h"

/* Deliberately not using DECLARE_BITMAP() - gratuitously opaque. -pj */
#define __mask(bits)	struct { unsigned long _m[BITS_TO_LONGS(bits)]; }

extern int bitmap_snprintf(char *buf, unsigned int buflen,
	const unsigned long *maskp, unsigned int nmaskbits);

extern int bitmap_parse(const char __user *ubuf, unsigned int ubuflen,
	unsigned long *maskp, int nmaskbits);

extern int bitmap_parselist(char *buf,
	unsigned long *maskp, int nmaskbits);

extern void bitmap_setnbits(unsigned long *maskp,
	int nmaskbits, int nbits);

#define mask_bitsize(mask) (sizeof(mask) * 8)

/* Silently right truncates if buflen too short. */
#define mask_snprintf(buf, buflen, mask)		\
	bitmap_snprintf((buf), (buflen),		\
	((mask)._m), sizeof(mask)*8);

#define mask_parse(ubuf, ubuflen, mask)			\
	bitmap_parse((ubuf), (ubuflen), 		\
	((mask)._m), sizeof(mask)*8);

#define mask_parselist(buf, mask)			\
	bitmap_parse((buf),		 		\
	((mask)._m), sizeof(mask)*8);

#define mask_bitset(bit, mask)				\
	(!sizeof(mask) ? 1 :				\
	set_bit((bit), (mask)._m))

#define mask_bitclear(bit, mask)			\
	(!sizeof(mask) ? 0 :				\
	clear_bit((bit), (mask)._m))

#define mask_allset(mask) 				\
	memset((mask)._m, 0xff, sizeof(mask))

#define mask_allclear(mask) 				\
	memset((mask)._m, 0, sizeof(mask))

#define mask_setnbits(mask, nbits)			\
do {							\
	if (sizeof(mask) == sizeof(unsigned long))	\
		(mask)._m[0] = ~(~0UL << ((nbits)%BITS_PER_LONG));	\
	else						\
		bitmap_setnbits((mask)._m,		\
			sizeof(mask)*8, (nbits));	\
} while (0)

#define mask_isset(bit, mask)				\
	(!sizeof(mask) ? 1 : 				\
	test_bit((bit), (mask)._m))

#define mask_test_and_set(bit, mask)			\
	(!sizeof(mask) ? 1 : 				\
	test_and_set_bit(bit, (mask)._m))

#define mask_assign(dst, src)				\
		((dst) = (src))

#define mask_andnot(dst, src)				\
do {							\
	unsigned long *dstp = (dst)._m;			\
	const unsigned long *srcp = (src)._m;		\
	unsigned int masklen = ARRAY_SIZE((dst)._m);	\
	int i;						\
	for (i = 0; i < masklen; i++)			\
		dstp[i] &= ~srcp[i];			\
} while (0)

#define mask_orequal(dst, src)				\
do {							\
	unsigned long *dstp = (dst)._m;			\
	const unsigned long *srcp = (src)._m;		\
	unsigned int masklen = ARRAY_SIZE((dst)._m);	\
	int i;						\
	for (i = 0; i < masklen; i++)			\
		dstp[i] |= srcp[i];			\
} while (0)

#define mask_and(dst, src1, src2)			\
	bitmap_and((dst)._m, (src1)._m, (src2)._m,	\
		mask_bitsize(dst))

#define mask_or(dst, src1, src2)			\
	bitmap_or((dst)._m, (src1)._m, (src2)._m,	\
		mask_bitsize(dst))

#define mask_complement(mask)				\
	bitmap_complement((mask)._m, mask_bitsize(mask))

#define mask_equal(mask1, mask2)			\
	bitmap_equal((mask1)._m, (mask2)._m,		\
		mask_bitsize(mask1))

#define mask_empty(mask)				\
	bitmap_empty((mask)._m, mask_bitsize(mask))

#define mask_weight(mask)				\
	bitmap_weight((mask)._m, mask_bitsize(mask))

#define mask_shift_right(dst, src, shift)		\
	bitmap_shift_right((dst)._m, (src)._m, (shift),	\
		mask_bitsize(dst))

#define mask_shift_left(dst, src, shift)		\
	bitmap_shift_left((dst)._m, (src)._m, (shift),	\
		mask_bitsize(dst))

#define mask_first(mask)				\
	find_first_bit((mask)._m, mask_bitsize(mask))

#define mask_next(n, mask)				\
	(!sizeof(mask) ? 0 : 				\
	find_next_bit((mask)._m, mask_bitsize(mask), (n)+1))

/* Need input 'mask' argument just to get type */
#define mask_of_bit(n, mask)				\
({							\
	typeof(mask) __mask;				\
	mask_allclear(__mask);				\
	mask_bitset((n), __mask);			\
	__mask;						\
})

#define MASK_ALL(mask, nbits)				\
({							\
	typeof(mask) __mask;				\
	int nbitshift = (nbits) % BITS_PER_LONG;	\
	if (sizeof(mask) == sizeof(unsigned long))	\
		__mask._m[0] = ~(~0UL << nbitshift);	\
	else						\
		mask_setnbits(__mask, (nbits));		\
	__mask;						\
})

#define MASK_NONE(mask)					\
({							\
	typeof(mask) __mask;				\
	mask_allclear(__mask);				\
	__mask;						\
})

#endif /* _MASK_H */

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] Introduce nodemask_t ADT [0/7]
  2004-03-18 23:43       ` Martin J. Bligh
  2004-03-18 23:59         ` Jesse Barnes
  2004-03-19  0:58         ` Zwane Mwaikambo
@ 2004-03-19  1:08         ` Paul Jackson
  2 siblings, 0 replies; 56+ messages in thread
From: Paul Jackson @ 2004-03-19  1:08 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: jbarnes, linux-kernel

> Anyway, it's a damned sight shorter than "cpumemset".

Heh - watch it - I saw that! ... LOL

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] Introduce nodemask_t ADT [0/7]
  2004-03-19  0:58         ` Zwane Mwaikambo
@ 2004-03-19  1:11           ` Jesse Barnes
  2004-03-19  1:34             ` Zwane Mwaikambo
  0 siblings, 1 reply; 56+ messages in thread
From: Jesse Barnes @ 2004-03-19  1:11 UTC (permalink / raw)
  To: linux-kernel, zwane

On Thursday 18 March 2004 4:58 pm, Zwane Mwaikambo wrote:
> > It probably shouldn't have anything to do with PCI directly either,
> > so .... ;-) My former thought was that you might just want the most
> > local memory for DMAing into.
> 
> That knowledge should be in the dma api thing shouldn't it? But in it's
> current incarnation i don't see how that's possible.

Couldn't we mostly hide it under the covers (though obviously not in
the intentionally remote case I mentioned earlier):


===== arch/ia64/sn/io/machvec/pci_dma.c 1.28 vs edited =====
--- 1.28/arch/ia64/sn/io/machvec/pci_dma.c	Sun Mar 14 11:17:06 2004
+++ edited/arch/ia64/sn/io/machvec/pci_dma.c	Thu Mar 18 17:08:13 2004
@@ -130,13 +130,11 @@
 	device_sysdata = SN_DEVICE_SYSDATA(hwdev);
 	vhdl = device_sysdata->vhdl;
 
-	/*
-	 * Allocate the memory.
-	 * FIXME: We should be doing alloc_pages_node for the node closest
-	 *        to the PCI device.
-	 */
-	if (!(cpuaddr = (void *)__get_free_pages(GFP_ATOMIC, get_order(size))))
+	cpuaddr = alloc_pages_node(pci_to_node(hwdev), GFP_ATOMIC, (get_order(size)));
+	if (!cpuaddr)
 		return NULL;
+
+	cpuaddr = page_to_virt(cpuaddr);
 
 	memset(cpuaddr, 0x0, size);
 


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] Introduce nodemask_t ADT [0/7]
  2004-03-19  0:59 ` Paul Jackson
@ 2004-03-19  1:19   ` Matthew Dobson
  2004-03-19  1:45     ` Paul Jackson
                       ` (2 more replies)
  0 siblings, 3 replies; 56+ messages in thread
From: Matthew Dobson @ 2004-03-19  1:19 UTC (permalink / raw)
  To: Paul Jackson; +Cc: linux-kernel, mbligh, akpm, wli, haveblue, hch

On Thu, 2004-03-18 at 16:59, Paul Jackson wrote:
> Your nodemask_t reminds me of something I posted to linux-ia64 last
> November 7, 2004, under the subject: "[PATCH preview] Adds nodemask_t
> for use in cpusets (NUMA memory placement)".
> 
> Chris Hellwig responded to it at the time asking why I didn't provide a
> single generic mask ADT, and make cpumask and nodemask instances of
> that.

That is a better idea, if it can be made to work.  My goal is to stop
the proliferation of open-coded references to node details as soon as
possible.  If we we nip this behavior in the bud and convert all users
of cpu/node data to cpumask_t/nodemask_t, we can (more) easily change
the underlying details of how all the cpumask and nodemask functions
work later.  If the users all call our macros, then it's easy to find
them ('grep -r "nodes_and" *' vs searching for every '&' in the kernel
that may or may not be a node or cpu mask) and test them.

> I coded that up, but then got distracted.  The remaining issue for which
> I didn't have a good answer was that my proposal would break the optimum
> handling for sparc64 or other systems that didn't handle passing
> structures on the stack efficiently.
> 
> Bill Irwin was a party to my discussions of that effort, so I presume
> that if he felt it had further merit, he would have mentioned it to
> you, Matthew.

He never mentioned it to me when I queried him for details on his
cpumask_t implementation...

> Could one of you, Bill or Matthew, speak to why this generic mask ADT,
> underlying both cpumask and nodemask, should not be pursued further,
> instead of duplicating the various details of cpumask, after a global
> s/cpu/node/g change?

Nope.  I think it should.  Though it is hard to optimize for generic
masks.  We've got the bitmap_* functions which work nicely for unsigned
long[].  These (cpumask_t/nodemask_t) are nice because they are
optimized for edge cases (UP for cpumask_t and Non-NUMA for nodemask_t)
as well as for long mask cases (passing by structs reference).  It's
hard to make those types of optimizations on generic masks.

> I am attaching the header file include/linux/mask.h for my current
> version of this mask.h, in case someone reading wants more specifics of
> what it is that I am referring to.
> 
> This version almost certainly does _not_ work on big endian 64 systems,
> due to my ignorance of how kernel bitmasks were layed out when I last
> worked on this mask.h header.  Unlike the sparc64 performance issues
> with passing structs on the stack, I would hope that the big/little
> endian issues could be fixed without messing things up too much.
> 
> If this mask.h could actually be made to work, including on sparc64,
> then it would seem to be a much cleaner solution.  With it, we would
> define cpumask_t and nodemask_t as simply:
> 
>   typedef __mask(NR_NODES) nodemask_t;
>   typedef __mask(NR_CPUS) cpumask_t;
> 
> and either use operations such as mask_and() on both, or if one insisted
> on keeping operations that specifically called out cpumask, add some
> 15 trivial defines such as:
> 
>   #define cpumask_and(dst, src1, src2) mask_and(dst, src1, src2)

I'll look it over and see how it looks.  As I said, I'm very much for a
generic mask type if we can do it properly and efficiently.

Cheers!

-Matt


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] Introduce nodemask_t ADT [0/7]
  2004-03-19  1:11           ` Jesse Barnes
@ 2004-03-19  1:34             ` Zwane Mwaikambo
  2004-03-19  1:40               ` Jesse Barnes
  0 siblings, 1 reply; 56+ messages in thread
From: Zwane Mwaikambo @ 2004-03-19  1:34 UTC (permalink / raw)
  To: Jesse Barnes; +Cc: linux-kernel

On Thu, 18 Mar 2004, Jesse Barnes wrote:

> On Thursday 18 March 2004 4:58 pm, Zwane Mwaikambo wrote:
> > > It probably shouldn't have anything to do with PCI directly either,
> > > so .... ;-) My former thought was that you might just want the most
> > > local memory for DMAing into.
> >
> > That knowledge should be in the dma api thing shouldn't it? But in it's
> > current incarnation i don't see how that's possible.
>
> Couldn't we mostly hide it under the covers (though obviously not in
> the intentionally remote case I mentioned earlier):

How about honouring DMA masks? Would this work with a 32bit DMA mask on a
node with memory above the 4G mark. This is for the more impaired systems
out there of course =)

> ===== arch/ia64/sn/io/machvec/pci_dma.c 1.28 vs edited =====
> --- 1.28/arch/ia64/sn/io/machvec/pci_dma.c	Sun Mar 14 11:17:06 2004
> +++ edited/arch/ia64/sn/io/machvec/pci_dma.c	Thu Mar 18 17:08:13 2004
> @@ -130,13 +130,11 @@
>  	device_sysdata = SN_DEVICE_SYSDATA(hwdev);
>  	vhdl = device_sysdata->vhdl;
>
> -	/*
> -	 * Allocate the memory.
> -	 * FIXME: We should be doing alloc_pages_node for the node closest
> -	 *        to the PCI device.
> -	 */
> -	if (!(cpuaddr = (void *)__get_free_pages(GFP_ATOMIC, get_order(size))))
> +	cpuaddr = alloc_pages_node(pci_to_node(hwdev), GFP_ATOMIC, (get_order(size)));
> +	if (!cpuaddr)
>  		return NULL;
> +
> +	cpuaddr = page_to_virt(cpuaddr);
>
>  	memset(cpuaddr, 0x0, size);

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] Introduce nodemask_t ADT [0/7]
  2004-03-19  1:34             ` Zwane Mwaikambo
@ 2004-03-19  1:40               ` Jesse Barnes
  0 siblings, 0 replies; 56+ messages in thread
From: Jesse Barnes @ 2004-03-19  1:40 UTC (permalink / raw)
  To: linux-kernel

On Thursday 18 March 2004 5:34 pm, Zwane Mwaikambo wrote:
> How about honouring DMA masks? Would this work with a 32bit DMA mask on a
> node with memory above the 4G mark. This is for the more impaired systems
> out there of course =)

Yeah, systems w/o an IOMMU are going to have trouble with this...

Jesse


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] Introduce nodemask_t ADT [0/7]
  2004-03-19  1:19   ` Matthew Dobson
@ 2004-03-19  1:45     ` Paul Jackson
  2004-03-19 22:51       ` Matthew Dobson
  2004-03-19  1:48     ` Paul Jackson
  2004-03-19  1:56     ` Paul Jackson
  2 siblings, 1 reply; 56+ messages in thread
From: Paul Jackson @ 2004-03-19  1:45 UTC (permalink / raw)
  To: colpatch; +Cc: linux-kernel, mbligh, akpm, wli, haveblue, hch

> It's hard to make those types of optimizations on generic masks.

I would be assuming that by "generic" we meant arrays of unsigned longs
(or one unsigned long or something isomorphic to one or more unsigned
longs ...).

And I'm assuming that we mean of a size that would allow for putting a
couple of them on a kernel stack ... not _too_ big. Probably NR_CPUS
rough upper limit on the size that was practical to use.

I wouldn't want to get _too_ generic.


> I'll look it over and see how it looks.

Thanks.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] Introduce nodemask_t ADT [0/7]
  2004-03-19  1:19   ` Matthew Dobson
  2004-03-19  1:45     ` Paul Jackson
@ 2004-03-19  1:48     ` Paul Jackson
  2004-03-19  1:56     ` Paul Jackson
  2 siblings, 0 replies; 56+ messages in thread
From: Paul Jackson @ 2004-03-19  1:48 UTC (permalink / raw)
  To: colpatch; +Cc: linux-kernel, mbligh, akpm, wli, haveblue, hch

> My goal is to stop the proliferation of open-coded references
> to node details as soon as possible.

A worthy goal.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] Introduce nodemask_t ADT [0/7]
  2004-03-19  1:19   ` Matthew Dobson
  2004-03-19  1:45     ` Paul Jackson
  2004-03-19  1:48     ` Paul Jackson
@ 2004-03-19  1:56     ` Paul Jackson
  2004-03-19 23:02       ` Matthew Dobson
  2 siblings, 1 reply; 56+ messages in thread
From: Paul Jackson @ 2004-03-19  1:56 UTC (permalink / raw)
  To: colpatch; +Cc: linux-kernel, mbligh, akpm, wli, haveblue, hch

> These (cpumask_t/nodemask_t) are nice because they are
> optimized for edge cases (UP for cpumask_t and Non-NUMA for nodemask_t)
> as well as for long mask cases (passing by structs reference). 

When I looked at the assembly code generated on my one lung i386 box for
native gcc 3.3.2, it looked pretty good (to my untrained eye) using a
struct of an array of unsigned longs, both for the single unsigned long
(<= 32 bits) and multiple unsigned long cases.

Except for the sparc64 guys and their friends who disparage passing
structs on the stack, I conjecture that the single implementation of a
struct of an array of unsigned longs is nearly ideal for all
architectures.

... go ahead ... prove me wrong.  It probably won't be hard ;).

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] Introduce nodemask_t ADT [0/7]
  2004-03-18 23:23 ` Jesse Barnes
  2004-03-18 23:32   ` Martin J. Bligh
  2004-03-18 23:58   ` Matthew Dobson
@ 2004-03-19  2:55   ` William Lee Irwin III
  2 siblings, 0 replies; 56+ messages in thread
From: William Lee Irwin III @ 2004-03-19  2:55 UTC (permalink / raw)
  To: Jesse Barnes; +Cc: linux-kernel, colpatch, mbligh, akpm, haveblue

On Thu, Mar 18, 2004 at 03:23:10PM -0800, Jesse Barnes wrote:
> My hero! :)  I think this has been needed for awhile, but now that I
> think about it, it begs the question of what a node is.  Is it a set
> of CPUs and blocks of memory (that seems to be the most commonly used
> definition in the code), just memory, just CPUs, or what?  On sn2
> hardware, we have the concept of a node without CPUs.  And due to our
> wacky I/O layout, we also have nodes without CPUs *or* memory!  (The
> I/O guys call these "ionodes".)  And then of course, there are CPUs
> that aren't particularly close to any memory (i.e. they have none of
> their own, and have to go several hops and/or through other CPUs to
> get at memory at all).
> I'll take a look at the ia64 bits when I get them (I've only received
> two of the seven patches thus far).

You need a tripartite graph.
(a) are connections full or half duplex? (i.e. directed or undirected)
(b) do you need distinct weights on each edge? (i.e. weighted or unweighted)


-- wli

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] Introduce nodemask_t ADT [0/7]
  2004-03-18 23:59         ` Jesse Barnes
@ 2004-03-19 16:20           ` Martin J. Bligh
  0 siblings, 0 replies; 56+ messages in thread
From: Martin J. Bligh @ 2004-03-19 16:20 UTC (permalink / raw)
  To: Jesse Barnes, linux-kernel

--Jesse Barnes <jbarnes@sgi.com> wrote (on Thursday, March 18, 2004 15:59:42 -0800):

> On Thursday 18 March 2004 3:43 pm, Martin J. Bligh wrote:
>> > It's probably not too late to change this to
>> > pcibus_to_nodemask(pci_bus *), or pci_to_nodemask(pci_dev *), there
>> > aren't that many callers, are there (my grep is still running)?
>> 
>> It probably shouldn't have anything to do with PCI directly either,
>> so .... ;-) My former thought was that you might just want the most
>> local memory for DMAing into.
> 
> Right, we want local memory (or potentially remote memory) for DMA,
> but what about interrupt redirection?  Some chipsets don't support
> interrupt round robin, and just target interrupts at one CPU.  In that
> case (and probably the round robin case too), you want to know which
> CPU(s) to send the interrupt at.  Can't immediately think of other
> in-kernel uses though (administrators will of course want to be able
> to locate a given PCI device in a multirack system, but that's another
> subject--one that Martin Hicks posted on yesterday).

I think we need both ... maybe a cpumask and a zonelist as the destination
types.

M.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] Introduce nodemask_t ADT [0/7]
  2004-03-19  1:45     ` Paul Jackson
@ 2004-03-19 22:51       ` Matthew Dobson
  2004-03-19 23:42         ` Paul Jackson
  0 siblings, 1 reply; 56+ messages in thread
From: Matthew Dobson @ 2004-03-19 22:51 UTC (permalink / raw)
  To: Paul Jackson
  Cc: LKML, Martin J. Bligh, akpm, William Lee Irwin III, Dave Hansen, hch

On Thu, 2004-03-18 at 17:45, Paul Jackson wrote:
> > It's hard to make those types of optimizations on generic masks.
> 
> I would be assuming that by "generic" we meant arrays of unsigned longs
> (or one unsigned long or something isomorphic to one or more unsigned
> longs ...).
> 
> And I'm assuming that we mean of a size that would allow for putting a
> couple of them on a kernel stack ... not _too_ big. Probably NR_CPUS
> rough upper limit on the size that was practical to use.
> 
> I wouldn't want to get _too_ generic.

Well, if we're going to make a generic bitmap type, it shouldn't have
size limitations, as almost any limit we set will be too small
eventually...  Supporting arbitrary length bitmaps doesn't mean we can't
try to optimize for smaller masks, like less than a couple unsigned
longs, as well as single unsigned long optimizations.

-Matt


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] Introduce nodemask_t ADT [0/7]
  2004-03-19  1:56     ` Paul Jackson
@ 2004-03-19 23:02       ` Matthew Dobson
  2004-03-20  0:59         ` Paul Jackson
  0 siblings, 1 reply; 56+ messages in thread
From: Matthew Dobson @ 2004-03-19 23:02 UTC (permalink / raw)
  To: Paul Jackson
  Cc: LKML, Martin J. Bligh, akpm, William Lee Irwin III, Dave Hansen, hch

On Thu, 2004-03-18 at 17:56, Paul Jackson wrote:
> > These (cpumask_t/nodemask_t) are nice because they are
> > optimized for edge cases (UP for cpumask_t and Non-NUMA for nodemask_t)
> > as well as for long mask cases (passing by structs reference). 
> 
> When I looked at the assembly code generated on my one lung i386 box for
> native gcc 3.3.2, it looked pretty good (to my untrained eye) using a
> struct of an array of unsigned longs, both for the single unsigned long
> (<= 32 bits) and multiple unsigned long cases.

The code you wrote, or my patch?

> Except for the sparc64 guys and their friends who disparage passing
> structs on the stack, I conjecture that the single implementation of a
> struct of an array of unsigned longs is nearly ideal for all
> architectures.
> 
> ... go ahead ... prove me wrong.  It probably won't be hard ;).

Sounds like a good idea.  We certainly shouldn't be passing huge masks
on the stack, but for small masks like, i dunno, <= 4 ULs (the same
optimization Bill's code makes) it's no problem.

The thing about code specific to node and cpu masks is that we *know*
what the masks we're manipulating are used for, and that lets us do
things like not letting callers set bits other than 0 on UP cpumasks, or
throwing a BUG when they do.  Or optimizing first_node() to be smarter
than just calling find_first_set() on the passed in mask by just
checking whether bit 0 is set.

-Matt


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] Introduce nodemask_t ADT [0/7]
  2004-03-19 22:51       ` Matthew Dobson
@ 2004-03-19 23:42         ` Paul Jackson
  0 siblings, 0 replies; 56+ messages in thread
From: Paul Jackson @ 2004-03-19 23:42 UTC (permalink / raw)
  To: colpatch; +Cc: linux-kernel, mbligh, akpm, wli, haveblue, hch

> Well, if we're going to make a generic bitmap type, it shouldn't have
> size limitations, as almost any limit we set will be too small
> eventually... 

A trillion bits ...

Nah - bitmasks are a bunch of bits, layed out in order.  This is
appropriate for tens, hundreds, even (on big iron) thousands of
bits.

There are limitations to the practical application of such bitmasks,
as I'm sure we both agree and both accept.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] Introduce nodemask_t ADT [0/7]
  2004-03-19 23:02       ` Matthew Dobson
@ 2004-03-20  0:59         ` Paul Jackson
  2004-03-20  3:18           ` William Lee Irwin III
  0 siblings, 1 reply; 56+ messages in thread
From: Paul Jackson @ 2004-03-20  0:59 UTC (permalink / raw)
  To: colpatch; +Cc: linux-kernel, mbligh, akpm, wli, haveblue, hch

> > When I looked at the assembly code generated on my one lung i386 box for
> > native gcc 3.3.2, it looked pretty good (to my untrained eye) using a
> > struct of an array of unsigned longs, both for the single unsigned long
> > (<= 32 bits) and multiple unsigned long cases.
> 
> The code you wrote, or my patch?

The code I wrote and appended yesterday.  Though, as I realize now in my
post of a few minutes ago, there are perhaps 8 places where instead of
calling various bitmap_ops() unconditionally ('and', 'or', ...) it would
be better to peel off the one-word case and inline it.


> Sounds like a good idea.  We certainly shouldn't be passing huge masks
> on the stack, but for small masks like, i dunno, <= 4 ULs (the same
> optimization Bill's code makes) it's no problem.

Don't we have quite a few places with one, two, even three local variables
of type cpumask_t?  Which live on the stack?  For all mask implementations?
Grep around for "cpumask_t.*,.*," and many of the lines you see appear to
be declarations of such local cpumask_t variables.

And we need to be careful of converting pass by value semantics for small
cpumasks into pass by reference semantics for large cpumasks, as a hidden
feature of the implementation.  One could code some cute bugs that way.

Better, I think, to provide a reasonably rich set of mask ops, so that the
using code need not have anymore than the essential number of distinctly
different masks hanging around at once in order to write clear code.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] Introduce nodemask_t ADT [0/7]
  2004-03-20  0:59         ` Paul Jackson
@ 2004-03-20  3:18           ` William Lee Irwin III
  2004-03-20  6:09             ` Paul Jackson
  2004-03-20  8:02             ` Paul Jackson
  0 siblings, 2 replies; 56+ messages in thread
From: William Lee Irwin III @ 2004-03-20  3:18 UTC (permalink / raw)
  To: Paul Jackson; +Cc: colpatch, linux-kernel, mbligh, akpm, haveblue, hch

At some point in the past, Matt Dobson wrote:
>> Sounds like a good idea.  We certainly shouldn't be passing huge masks
>> on the stack, but for small masks like, i dunno, <= 4 ULs (the same
>> optimization Bill's code makes) it's no problem.

On Fri, Mar 19, 2004 at 04:59:28PM -0800, Paul Jackson wrote:
> Don't we have quite a few places with one, two, even three local variables
> of type cpumask_t?  Which live on the stack?  For all mask implementations?
> Grep around for "cpumask_t.*,.*," and many of the lines you see appear to
> be declarations of such local cpumask_t variables.

The stack footprint of cpumasks is a concern in general. I don't have a
good answer to this. The half-answer I anticipate is that the truly
larger systems will configure themselves with deeper stacks. There isn't
truly a good answer to this. It's overhead for the larger systems, and
heap allocation would be overhead for smaller systems. Unfortunately,
our general design criteria require the larger systems to eat the
overhead until something imaginative is come up with to reduce the
overhead with low code impact and no cost to smaller systems. One thing
that would help is more expressiveness in the API; it's not entirely
clear how to better 3-address code, but OTOH, reducing the number of
intermediate operands is conceivable and would alleviate those overheads.


On Fri, Mar 19, 2004 at 04:59:28PM -0800, Paul Jackson wrote:
> And we need to be careful of converting pass by value semantics for small
> cpumasks into pass by reference semantics for large cpumasks, as a hidden
> feature of the implementation.  One could code some cute bugs that way.
> Better, I think, to provide a reasonably rich set of mask ops, so that the
> using code need not have anymore than the essential number of distinctly
> different masks hanging around at once in order to write clear code.

This is one of the areas where I believe I carried out some innovation.
cpumask_const_t allows more aggressive conversion to call-by-reference
in a safe manner as the constancy of the reference makes the difference
purely operational. It falls down only in scenarios where the input
would be modified. Also, when the argument is actually expected to be
modified, direct call by reference can be used. So only in the case
where a temporary copy is truly required are you forced to do it
(apart from getting the function prototype merged), and the fallback of
cpumask_const_t to call by value in the small case makes it cheap for
smaller machines also, by avoiding the indirection.

It may be worth investigating analogues of cpumask_const_t for more
generic bitmask types (though of course I'm not going to insist on a
force fit).


-- wli

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] Introduce nodemask_t ADT [0/7]
  2004-03-20  3:18           ` William Lee Irwin III
@ 2004-03-20  6:09             ` Paul Jackson
  2004-03-20  9:36               ` William Lee Irwin III
  2004-03-20  8:02             ` Paul Jackson
  1 sibling, 1 reply; 56+ messages in thread
From: Paul Jackson @ 2004-03-20  6:09 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: colpatch, linux-kernel, mbligh, akpm, haveblue, hch

> The stack footprint of cpumasks is a concern in general. I don't have a
> good answer to this. The half-answer I anticipate is that the truly
> larger systems will configure themselves with deeper stacks.

Sounds like a good enough answer to me.


That is, a richer API can help - more 

> This is one of the areas where I believe I carried out some innovation.
> cpumask_const_t allows more aggressive conversion to call-by-reference

True, you did do some substantial work there, and for const objects,
calls by value and reference can be used interchangeably, for best
performance, without semantic impact.

However, something about the current cpus_*_const() macros doesn't seem
to be having the desired impact.  I see five uses in files matching
include/asm-i386/mach-*/mach_apic.h, and one in include/asm-x86_64/smp.h
That's all.  None, for example, in any ia64 code.

That's it.  And why should one have to code explicitly the choice of
the cpus_*_const() variant?  Shouldn't each macro know which of routines
it calls change things, and which don't, letting it pass a pointer to
the read-only routines if that helps?  It knows the sizes as well, so
it can even pick and choose which variation of code to generate.

If one needs an explicit call by reference to avoid passing a multi-word
object, one should ask the user to pass an explicit pointer, to a
routine or macro that expects a pointer to a non-const, not an apparent
value.  Shouldn't try to hide the reference semantics behind quasi-const
labels.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] Introduce nodemask_t ADT [0/7]
  2004-03-20  3:18           ` William Lee Irwin III
  2004-03-20  6:09             ` Paul Jackson
@ 2004-03-20  8:02             ` Paul Jackson
  2004-03-20 11:13               ` William Lee Irwin III
  1 sibling, 1 reply; 56+ messages in thread
From: Paul Jackson @ 2004-03-20  8:02 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: colpatch, linux-kernel, mbligh, akpm, haveblue, hch

Other ways to reduce cpumask copies:

 1) Additional macros, for example a boolean cpus_intersect(x,y), which
    is true iff x overlaps y, and saves the tmp cpumask in the code:

        cpus_and(tmp, target_cpu_mask, allowed_mask);
        if (!cpus_empty(tmp)) {

    or a cpus_subset(x,y) "is x a subset of y" macro to replace this:

        cpus_and(tmp, cpumask, cpu_callout_map);
        BUG_ON(!cpus_equal(cpumask, tmp));

   with this:

	BUG_ON(!cpus_subset(cpumask, cpu_callout_map));

 2) Using existing macros more carefully, for example using cpu_set() to
    set the bit in the following, and saving the copy by assignment:

	pending_irq_balance_cpumask[irq] = cpumask_of_cpu(new_cpu);

    or using the existing cpu_isset() macro, replacing the code (seen
    in part above ;):

        cpus_and(allowed_mask, cpu_online_map, irq_affinity[selected_irq]);
        target_cpu_mask = cpumask_of_cpu(min_loaded);
        cpus_and(tmp, target_cpu_mask, allowed_mask);
        if (!cpus_empty(tmp)) {

    with the code:

	if (cpu_isset(min_loaded, cpu_online_map) && cpu_isset(min_loaded, irq_affinity[selected_irq]) {

    The current nested and ifdef'd complexity of the cpumask macro headers works
    against such efficient coding - it's non-trivial to see just what macros are
    available.  The youngins reading this may not appreciate the value of reducing
    brain load; but the oldins might.

 2) Pass a cpumask pointer to a function that generates a cpumask instead of
    taking one as a return value, letting the called function fill in the memory
    so referenced.  We should not be trying to hide such details, unless we can
    do so seamlessly and consistent with traditional 'C' semantics.

 3) Passing a const cpumask pointer to a function that examines a cpumask
    instead of passing the cpumask by value (on small systems, its one word
    either way - on big systems, it saves copying a multiword mask on the
    stack).

 4) Throwing away dead code:

	static int physical_balance = 0;
        cpumask_t tmp;
	cpus_shift_right(tmp, cpu_online_map, 2);
	if (smp_num_siblings > 1 && !cpus_empty(tmp)
		physical_balance = 1;

The above code fragments are from arch/i386 in 2.6.3-rc1-mm1.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] Introduce nodemask_t ADT [0/7]
  2004-03-20  6:09             ` Paul Jackson
@ 2004-03-20  9:36               ` William Lee Irwin III
  2004-03-22 23:59                 ` Paul Jackson
                                   ` (2 more replies)
  0 siblings, 3 replies; 56+ messages in thread
From: William Lee Irwin III @ 2004-03-20  9:36 UTC (permalink / raw)
  To: Paul Jackson; +Cc: colpatch, linux-kernel, mbligh, akpm, haveblue, hch

At some point in the past, I wrote:
>> The stack footprint of cpumasks is a concern in general. I don't have a
>> good answer to this. The half-answer I anticipate is that the truly
>> larger systems will configure themselves with deeper stacks.

On Fri, Mar 19, 2004 at 10:09:07PM -0800, Paul Jackson wrote:
> Sounds like a good enough answer to me.
> That is, a richer API can help - more 

I know that a richer repertoire of operations will reduce the stack
footprint and mentioned it. There is one issue and one issue only: the
larger the "instruction set" grows, the more intrusive and complex-
looking the thing appears, the more of quagmire merging it becomes. The
bits about deeper stacks not really a "good enough" answer. It's an
answer that trades off code impact for overhead on the systems that
demand the mechanism. In industrial/corporate kernels, this wouldn't
fly as votes are dollars, but here votes are users, and the smaller
systems' performance concerns and broader userbase's maintenance
concerns took precedence over the technically superior solution, at
least for the initial merge.


At some point in the past, I wrote:
>> This is one of the areas where I believe I carried out some innovation.
>> cpumask_const_t allows more aggressive conversion to call-by-reference

On Fri, Mar 19, 2004 at 10:09:07PM -0800, Paul Jackson wrote:
> True, you did do some substantial work there, and for const objects,
> calls by value and reference can be used interchangeably, for best
> performance, without semantic impact.
> However, something about the current cpus_*_const() macros doesn't seem
> to be having the desired impact.  I see five uses in files matching
> include/asm-i386/mach-*/mach_apic.h, and one in include/asm-x86_64/smp.h
> That's all.  None, for example, in any ia64 code.

Yes, spreading the cpumask_const_t use is needed. I pretty much ran out
of steam before getting its use propagated around, and it was also a late
addition that the arch maintainers who sent in their own updates didn't
get a very wide window to adapt to. I think it's vaguely fair to ask
those who need it to propagate its use around themselves.


On Fri, Mar 19, 2004 at 10:09:07PM -0800, Paul Jackson wrote:
> That's it.  And why should one have to code explicitly the choice of
> the cpus_*_const() variant?  Shouldn't each macro know which of routines
> it calls change things, and which don't, letting it pass a pointer to
> the read-only routines if that helps?  It knows the sizes as well, so
> it can even pick and choose which variation of code to generate.

This explicitly forces an unnecessary indirection on smaller systems
where the thing may fit into one machine word anyway, meaning it doesn't
make sense to pass it by reference.

Or in a second possible interpretation, this may already be the current
state of affairs, and you're suggesting a different way to implement it
I don't see how to do offhand which would be equivalent but using
fewer #ifdefs. That would be completely innocuous, and in combination
with a more general set of wrappers, easy to endorse.


On Fri, Mar 19, 2004 at 10:09:07PM -0800, Paul Jackson wrote:
> If one needs an explicit call by reference to avoid passing a multi-word
> object, one should ask the user to pass an explicit pointer, to a
> routine or macro that expects a pointer to a non-const, not an apparent
> value.  Shouldn't try to hide the reference semantics behind quasi-const
> labels.

It's not quasi-const, it is const. If you need a caller to do destructive
updates on your behalf, you use a pointer in both large and small cases.
You may use explicit unwrapped const call by reference, but this forces
the overhead of indirection on smaller systems.

The advantage the type wrapper has is that when NR_CPUS is small and
things fit in one or few machine words, it can be automatically
switched back to call by value with zero source changes by just picking
a threshold in a #ifdef, avoiding the indirection for smaller systems.

Of course, if the second of the two possibilities from the previously
replied-to paragraph holds, you're keeping all the operational semantics
that small systems need anyway so it may not have been necessary to
reiterate all that.


-- wli

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] Introduce nodemask_t ADT [0/7]
  2004-03-20  8:02             ` Paul Jackson
@ 2004-03-20 11:13               ` William Lee Irwin III
  2004-03-21  4:19                 ` Paul Jackson
  2004-03-23  1:12                 ` Paul Jackson
  0 siblings, 2 replies; 56+ messages in thread
From: William Lee Irwin III @ 2004-03-20 11:13 UTC (permalink / raw)
  To: Paul Jackson; +Cc: colpatch, linux-kernel, mbligh, akpm, haveblue, hch

On Sat, Mar 20, 2004 at 12:02:35AM -0800, Paul Jackson wrote:
> Other ways to reduce cpumask copies:
>  1) Additional macros, for example a boolean cpus_intersect(x,y), which
>     is true iff x overlaps y, and saves the tmp cpumask in the code:
>         cpus_and(tmp, target_cpu_mask, allowed_mask);
>         if (!cpus_empty(tmp)) {
>     or a cpus_subset(x,y) "is x a subset of y" macro to replace this:
>         cpus_and(tmp, cpumask, cpu_callout_map);
>         BUG_ON(!cpus_equal(cpumask, tmp));
>    with this:
> 	BUG_ON(!cpus_subset(cpumask, cpu_callout_map));

That probably wouldn't be tough to get merged.


On Sat, Mar 20, 2004 at 12:02:35AM -0800, Paul Jackson wrote:
>  2) Using existing macros more carefully, for example using cpu_set() to
>     set the bit in the following, and saving the copy by assignment:
> 	pending_irq_balance_cpumask[irq] = cpumask_of_cpu(new_cpu);
>     or using the existing cpu_isset() macro, replacing the code (seen
>     in part above ;):
>         cpus_and(allowed_mask, cpu_online_map, irq_affinity[selected_irq]);
>         target_cpu_mask = cpumask_of_cpu(min_loaded);
>         cpus_and(tmp, target_cpu_mask, allowed_mask);
>         if (!cpus_empty(tmp)) {
>     with the code:
> 	if (cpu_isset(min_loaded, cpu_online_map) && cpu_isset(min_loaded, irq_affinity[selected_irq]) {
> The current nested and ifdef'd complexity of the cpumask macro
> headers works against such efficient coding - it's non-trivial to see
> just what macros are available. The youngins reading this may not
> appreciate the value of reducing brain load; but the oldins might.

It was painful enough just to preserve semantics across such a far-
ranging set of changes. Ideally, yes, I would have done this in the
first place.


On Sat, Mar 20, 2004 at 12:02:35AM -0800, Paul Jackson wrote:
> 2) Pass a cpumask pointer to a function that generates a cpumask
> instead of taking one as a return value, letting the called function
> fill in the memory so referenced.  We should not be trying to hide
> such details, unless we can do so seamlessly and consistent with
> traditional 'C' semantics.

This is a generally sane thing to do.


On Sat, Mar 20, 2004 at 12:02:35AM -0800, Paul Jackson wrote:
>  3) Passing a const cpumask pointer to a function that examines a cpumask
>     instead of passing the cpumask by value (on small systems, its one word
>     either way - on big systems, it saves copying a multiword mask on the
>     stack).

IIRC the overhead of indirection on smaller systems was regarded as
unacceptable for these cases, which is what motivated the wrapper type.
If you can get the naked versions merged, more power to you.


On Sat, Mar 20, 2004 at 12:02:35AM -0800, Paul Jackson wrote:
>  4) Throwing away dead code:
> 	static int physical_balance = 0;
>         cpumask_t tmp;
> 	cpus_shift_right(tmp, cpu_online_map, 2);
> 	if (smp_num_siblings > 1 && !cpus_empty(tmp)
> 		physical_balance = 1;
> The above code fragments are from arch/i386 in 2.6.3-rc1-mm1.

[warning! long-winded response follows]

This isn't quite dead; physical_balance isn't a local. it's a state
variable static to io_apic.c and it determines the behavior later after
boot. Frankly, this code fragment is inexplicably bizarre and I've
never seen the code in action, as I have something very closely
approximating zero access (or less) to the systems that use it. In
general, this means you're stuck being very literal about its semantics
until the situation is clarified. This used to be something like

if (smp_num_siblings > 1 && (cpu_online_map >> 2))
	physical_balance = 1;

which defied my attempts to reverse-engineer its true meaning in terms
of cpumasks. In terms of HT, it "wants to do":

if (logical_cpus_per_physical_cpu() > 1 && nr_physical_cpus() > 1)
	set_state_variable_to_balance_by_physical_cpus();

I'm not wild about this kind of thing and would rather not codify
breakage on mixed cpu revision systems any more so than its already
done. The semantics also differ slightly if/when cpu_online_map is
arranged in varying ways, which obscured the issue and what ultimately
stifled the otherwise strong urge to improve it.

A potential "cleanup" that occurred to me is:

if (cpus_weight(cpu_online_map) > smp_num_siblings)
	physical_balance = 1;

which isn't really "good enough" either, but the best that can be done
with the now-extant machine descriptions. I would rather leave the
thing as-is until something equivalent to the following can be done
and the rest of arch/i386 swept to properly deal with HT in general:

if (nr_logical_cpus() > nr_physical_cpus() && nr_physical_cpus() > 1)
	physical_balance = 1;

but this is not how the machine descriptions are arranged now, and I
personally have neither the access to machines nor time allotted to
implement the changes required to fix this code properly. Other,
similar things hold for other bizarre cases where there's fear of
general fragility and bizarre bit twiddling forms of checks that would
vary in semantics from the "obviously cleaner" variants that can't be
cleaned up willy-nilly without the backing to get at the systems to
verify the changes. The only real exceptions to this are when things
are broken as-is, and what you have is an immediate and relatively
localized fix.

It's really unclear that the ia32 APIC/SMP disaster can ever feasibly
be cleaned up, as the code already landed there, and churning it just
raises the spectre of version skew, a rather long, dreary fight to get
the whole of the corrections merged, and the still more terrifying
possibility of incomplete merges leaving various out-of-synch codebases
(e.g. distros) in broken states. Let it serve as an example as to why
things should be done right the first time.

In conclusion, I converted a lot of the i386 arch code very literally
for a good reason. I wish us all the best of noseplugs as we're all
stuck holding our noses while the dead woodchuck under the porch stinks
up the kernel in perpetuity.


-- wli

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] Introduce nodemask_t ADT [0/7]
  2004-03-20 11:13               ` William Lee Irwin III
@ 2004-03-21  4:19                 ` Paul Jackson
  2004-03-21  4:36                   ` William Lee Irwin III
  2004-03-23  1:12                 ` Paul Jackson
  1 sibling, 1 reply; 56+ messages in thread
From: Paul Jackson @ 2004-03-21  4:19 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: colpatch, linux-kernel, mbligh, akpm, haveblue, hch

> This isn't quite dead; physical_balance isn't a local. it's a state
> variable static to io_apic.c and it determines the behavior later after
> boot.

Find by me if folks have their dirty laundry.  There are limits to my
powers to set things right.

Sorry to have provoked your length explanation of physical_balance, but
in the version of the kernel that I happened to do my research on,
2.6.3-rc1-mm1, this is _dead_ code.  The variable physical_balance is
never read, just written, and only appears on 3 lines total.

Obviously if it is in use in current versions of the kernel, then it's
not dead code anymore (at least not without a more profound
understanding of what's going on, which I make no claims to).

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] Introduce nodemask_t ADT [0/7]
  2004-03-21  4:19                 ` Paul Jackson
@ 2004-03-21  4:36                   ` William Lee Irwin III
  2004-03-21  7:59                     ` Nick Piggin
  0 siblings, 1 reply; 56+ messages in thread
From: William Lee Irwin III @ 2004-03-21  4:36 UTC (permalink / raw)
  To: Paul Jackson; +Cc: colpatch, linux-kernel, mbligh, akpm, haveblue, hch

On Sat, Mar 20, 2004 at 08:19:54PM -0800, Paul Jackson wrote:
> Find by me if folks have their dirty laundry.  There are limits to my
> powers to set things right.
> Sorry to have provoked your length explanation of physical_balance, but
> in the version of the kernel that I happened to do my research on,
> 2.6.3-rc1-mm1, this is _dead_ code.  The variable physical_balance is
> never read, just written, and only appears on 3 lines total.
> Obviously if it is in use in current versions of the kernel, then it's
> not dead code anymore (at least not without a more profound
> understanding of what's going on, which I make no claims to).

There's probably something in -mm reducing its use that I haven't
looked at; the digression there was based on mainline.


-- wli

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] Introduce nodemask_t ADT [0/7]
  2004-03-21  4:36                   ` William Lee Irwin III
@ 2004-03-21  7:59                     ` Nick Piggin
  0 siblings, 0 replies; 56+ messages in thread
From: Nick Piggin @ 2004-03-21  7:59 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Paul Jackson, colpatch, linux-kernel, mbligh, akpm, haveblue, hch



William Lee Irwin III wrote:

>On Sat, Mar 20, 2004 at 08:19:54PM -0800, Paul Jackson wrote:
>
>>Find by me if folks have their dirty laundry.  There are limits to my
>>powers to set things right.
>>Sorry to have provoked your length explanation of physical_balance, but
>>in the version of the kernel that I happened to do my research on,
>>2.6.3-rc1-mm1, this is _dead_ code.  The variable physical_balance is
>>never read, just written, and only appears on 3 lines total.
>>Obviously if it is in use in current versions of the kernel, then it's
>>not dead code anymore (at least not without a more profound
>>understanding of what's going on, which I make no claims to).
>>
>
>There's probably something in -mm reducing its use that I haven't
>looked at; the digression there was based on mainline.
>
>

I think it is my patch that makes cpu_sibling_map a cpumask.

You don't need a special case for num_siblings == 2 anymore.
I forgot to clean up the last trace of physical_balance.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] Introduce nodemask_t ADT [0/7]
  2004-03-20  9:36               ` William Lee Irwin III
@ 2004-03-22 23:59                 ` Paul Jackson
  2004-03-23  2:12                   ` William Lee Irwin III
  2004-03-23  1:21                 ` Paul Jackson
  2004-03-23  1:24                 ` Paul Jackson
  2 siblings, 1 reply; 56+ messages in thread
From: Paul Jackson @ 2004-03-22 23:59 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: colpatch, linux-kernel, mbligh, akpm, haveblue, hch

> It's not quasi-const, it is const.

	Const cpumasks are not your Father's const references.
	======================================================

True, the current cpumask_const_t API makes essential use of the 'C'
const construct and semantics.

But it's not simply and entirely the usual const usage as seen in the
signature:

  char *strncpy(char *dest, const char *src, size_t n);

In this usual case, declaring something referenced by an argument
pointer as 'const' means the called function won't change the referenced
data.  For simple values such as the 'size_t n', it's not useful,
because these are simply pass by value, so changes by the called
function won't be visible to the caller anyway.  But when passing in
pointers, it has become useful in 'C' to indicate that the called
function won't mess with the pointed to data.

That's the usual use of the C keyword 'const', as I am sure you know
well.

You aren't simply marking parameters with the C 'const' keyword; you are
also labeling some functions and a cpumask typedef as const, embedding
the letters "_const" in the names.  Part of the cpumask_const_t
implementation uses the real C keyword 'const', but it's not simply the
classic usage shown above.

Each time I try to work with this, I have to scratch my head and think
a few minutes first.

Is the following a fair statement of this interface:

 1) Declaring a cpumask as type 'cpumask_const_t' means it might be passed
    to various cpumask '*_const()' methods either by value or by a const
    pointer, depending on the implementation (the size of the mask,
    presumably).

 2) The various cpumask '*_const()' methods expect to be passed such
    cpumask_const_t masks, and may actually pass by value or by const
    reference, at the discretion of the implementation, again.

So if I am not too confused, this non-C keyword embedded '_const' string
means something like "optionally pass masks by value or const pointer,
at the discretion of the implementation."

This is an abuse (or at the very least an extension) of our customary
use of the 'const' construct of the 'C' language, in my view; an abuse
with which users will never become comfortable.

You castigate yourself for not having converted more cpumask
manipulations to making full use of this quasi-const mechanism.

That way does not lay closure and sweet success.  You have provided a
weird API that will never become widely and comfortably used by others;
you will always find your available energies for putting it in use
yourself to be insufficient; you will be pushing this rock up the hill
throughout your years in purgatory (hopefully fewer years than I expect
to spend there ;).

I do not fault you for failing to put this interface to more wide
spread usage.  I do fault the interface for being a bit too weird
to obtain wide spread usage on its own, after it's original creation
and introduction.

Could you (Bill or any lurker) provide any specific examples of generic
code where it is important to pass by value on some architectures, but
by const reference on some other architectures?  Rather than distort the
API in general for such cases, I'd prefer to consider more narrowly
focused solutions that address such specific needs.  In general, I would
hope to be able to arrive at a cpumask (or even more generic mask as
multi-word bit mask) ADT that was always clear and explicit, using just
traditional 'C' idioms, as to whether arguments were pass by value or
reference, and which arguments were const reference or not-const.  If
some arch's (say sparc64) have arch-specific code that explicitly passes
a pointer to a cpumask, where similar code in some other arch passes by
value, that's fine, and an appropriate arch-specific optimization.  The
only challenges come in generic code for which arch's cannot agree on
any one form for passing a particular mask.  Examples anyone ... ?

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] Introduce nodemask_t ADT [0/7]
  2004-03-20 11:13               ` William Lee Irwin III
  2004-03-21  4:19                 ` Paul Jackson
@ 2004-03-23  1:12                 ` Paul Jackson
  2004-03-23  2:09                   ` William Lee Irwin III
  1 sibling, 1 reply; 56+ messages in thread
From: Paul Jackson @ 2004-03-23  1:12 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: colpatch, linux-kernel, mbligh, akpm, haveblue, hch

> > The current nested and ifdef'd complexity of the cpumask macro
> > headers works against such efficient coding - it's non-trivial to see
> > just what macros are available. The youngins reading this may not
> > appreciate the value of reducing brain load; but the oldins might.
> 
> It was painful enough just to preserve semantics across such a far-
> ranging set of changes. Ideally, yes, I would have done this in the
> first place.

I'm not trying to get on your case, Bill, for not creating and applying
more various cpumask functions.  Rather I am looking for ways to make
that API easier for others to use and use well.  If the situations that
passed about temporary cpumasks can be collapsed into single calls that
are more efficient, then that is one way to make progress on this.

Taking masks to be a struct of an array of unsigned longs seems to come
pretty close.  The sparc64 arch would want to pass a pointer to this
apparently, rather than the struct itself - faster for them. Some other
smaller archs would _not_ want to pass a pointer, but rather the (one
word, for them) value - avoids a dereference for them.  In arch specific
code, each arch can choose which way works best for them.  I need to
identify any generic code where these preferences collide.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] Introduce nodemask_t ADT [0/7]
  2004-03-20  9:36               ` William Lee Irwin III
  2004-03-22 23:59                 ` Paul Jackson
@ 2004-03-23  1:21                 ` Paul Jackson
  2004-03-23  2:10                   ` William Lee Irwin III
  2004-03-23  1:24                 ` Paul Jackson
  2 siblings, 1 reply; 56+ messages in thread
From: Paul Jackson @ 2004-03-23  1:21 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: colpatch, linux-kernel, mbligh, akpm, haveblue, hch

> There is one issue and one issue only: the
> larger the "instruction set" grows, the more intrusive and complex-
> looking the thing appears, the more of quagmire merging it becomes.

Providing the additional mask "instructions" shouldn't create any quagmire.

It's the using of them that is intrusive.

Initially, some intrusive work, such as you (Bill) did was needed to get
masks implanted.  But now it should be appropriate to simply provide the
alternative calls that can make certain code sequences more efficient,
and then if someone complains that their old code sequence is too slow
or uses too much stack, we can recommend alternative code sequences that
will work better for them.

Passing the buck, division of labour and all that ...

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] Introduce nodemask_t ADT [0/7]
  2004-03-20  9:36               ` William Lee Irwin III
  2004-03-22 23:59                 ` Paul Jackson
  2004-03-23  1:21                 ` Paul Jackson
@ 2004-03-23  1:24                 ` Paul Jackson
  2 siblings, 0 replies; 56+ messages in thread
From: Paul Jackson @ 2004-03-23  1:24 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: colpatch, linux-kernel, mbligh, akpm, haveblue, hch

> I think it's vaguely fair to ask
> those who need it to propagate its use around themselves.

I think it's entirely fair to ask that.

Once the API is easy to grok.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] Introduce nodemask_t ADT [0/7]
  2004-03-23  1:12                 ` Paul Jackson
@ 2004-03-23  2:09                   ` William Lee Irwin III
  2004-03-23  2:39                     ` Paul Jackson
  0 siblings, 1 reply; 56+ messages in thread
From: William Lee Irwin III @ 2004-03-23  2:09 UTC (permalink / raw)
  To: Paul Jackson; +Cc: colpatch, linux-kernel, mbligh, akpm, haveblue, hch

On Mon, Mar 22, 2004 at 05:12:43PM -0800, Paul Jackson wrote:
> I'm not trying to get on your case, Bill, for not creating and applying
> more various cpumask functions.  Rather I am looking for ways to make
> that API easier for others to use and use well.  If the situations that
> passed about temporary cpumasks can be collapsed into single calls that
> are more efficient, then that is one way to make progress on this.
> Taking masks to be a struct of an array of unsigned longs seems to come
> pretty close.  The sparc64 arch would want to pass a pointer to this
> apparently, rather than the struct itself - faster for them. Some other
> smaller archs would _not_ want to pass a pointer, but rather the (one
> word, for them) value - avoids a dereference for them.  In arch specific
> code, each arch can choose which way works best for them.  I need to
> identify any generic code where these preferences collide.

I generally like the idea of the arches getting their choice here (heck,
even wrt. representation; e.g. some arch might want an array of cpuid
numbers and not a bitmap at all due to extremely sparse cpuid's or some
such nonsense). The asm-generic stuff was largely a question of
reducing diffsize, preemptive code consolidation, etc.

I don't believe normal C (i.e. sans typedef) will allow needed
ambiguities that make UP/small SMP/etc. compile things out nicely, but
if you can get the requirement of the stuff totally compiling out
dropped or do it in normal C somehow, go for it. I'd call it a cleanup.


-- wli

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] Introduce nodemask_t ADT [0/7]
  2004-03-23  1:21                 ` Paul Jackson
@ 2004-03-23  2:10                   ` William Lee Irwin III
  0 siblings, 0 replies; 56+ messages in thread
From: William Lee Irwin III @ 2004-03-23  2:10 UTC (permalink / raw)
  To: Paul Jackson; +Cc: colpatch, linux-kernel, mbligh, akpm, haveblue, hch

On Mon, Mar 22, 2004 at 05:21:34PM -0800, Paul Jackson wrote:
> Providing the additional mask "instructions" shouldn't create any quagmire.
> It's the using of them that is intrusive.
> Initially, some intrusive work, such as you (Bill) did was needed to get
> masks implanted.  But now it should be appropriate to simply provide the
> alternative calls that can make certain code sequences more efficient,
> and then if someone complains that their old code sequence is too slow
> or uses too much stack, we can recommend alternative code sequences that
> will work better for them.
> Passing the buck, division of labour and all that ...

Higher-level constructs that improve runtime efficiency may also be
good cleanups that can sometimes fix bugs and/or generally consolidate
code. e.g. cpus_subset(x, y). I think those kinds of things should be
perfectly mergeable.


-- wli

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] Introduce nodemask_t ADT [0/7]
  2004-03-22 23:59                 ` Paul Jackson
@ 2004-03-23  2:12                   ` William Lee Irwin III
  0 siblings, 0 replies; 56+ messages in thread
From: William Lee Irwin III @ 2004-03-23  2:12 UTC (permalink / raw)
  To: Paul Jackson; +Cc: colpatch, linux-kernel, mbligh, akpm, haveblue, hch

On Mon, Mar 22, 2004 at 03:59:52PM -0800, Paul Jackson wrote:
> Could you (Bill or any lurker) provide any specific examples of generic
> code where it is important to pass by value on some architectures, but
> by const reference on some other architectures?  Rather than distort the
> API in general for such cases, I'd prefer to consider more narrowly
> focused solutions that address such specific needs.  In general, I would
> hope to be able to arrive at a cpumask (or even more generic mask as
> multi-word bit mask) ADT that was always clear and explicit, using just
> traditional 'C' idioms, as to whether arguments were pass by value or
> reference, and which arguments were const reference or not-const.  If
> some arch's (say sparc64) have arch-specific code that explicitly passes
> a pointer to a cpumask, where similar code in some other arch passes by
> value, that's fine, and an appropriate arch-specific optimization.  The
> only challenges come in generic code for which arch's cannot agree on
> any one form for passing a particular mask.  Examples anyone ... ?

I don't have examples of such, no. The requirement was imposed on me for
basically the reason that what cpu-related API's preceded cpumask_t
compiled out to nops on UP and no indirection on small SMP. If you've
got such an implementation that compiles out to nops on UP etc., or can
get the requirement(s) dropped, I'm all ears.

-- wli

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] Introduce nodemask_t ADT [0/7]
  2004-03-23  2:09                   ` William Lee Irwin III
@ 2004-03-23  2:39                     ` Paul Jackson
  2004-03-23  3:13                       ` William Lee Irwin III
  0 siblings, 1 reply; 56+ messages in thread
From: Paul Jackson @ 2004-03-23  2:39 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: colpatch, linux-kernel, mbligh, akpm, haveblue, hch

> go for it

I'll try.

The main thing I keep seeing, when I look at the resulting machine
code, is that things such as the following (stripped of everything
except the bare bones hardwired stuff I needed to compile):

    #define BITS_TO_LONGS(i) (i+63)/64
    #define NR_CPUS 64
    #define __mask(bits)    struct { unsigned long _m[BITS_TO_LONGS(bits)]; }
    typedef __mask(NR_CPUS) cpumask_t;

    #define mask_and(d,s1,s2)                                       \
    do {                                                            \
	    int i;                                                  \
	    for (i = 0; i < sizeof(d)/sizeof(unsigned long); i++)   \
		    d._m[i] = s1._m[i] & s2._m[i];                  \
    } while(0)

    unsigned long f(cpumask_t c, cpumask_t d, cpumask_t e)
    {
	mask_and(c,d,e);
	return c._m[0];
    }


end up producing quite fine code, such as a single inline 64 bit and
instruction, with no evidence of the for loop, array or struct wrapper,
on an ia64 (gcc 3.2.3 -O2) for the interesting guts of my silly little
test function f().

Objdump -d of function f() in above:

    4000000000000610 <f>:
    4000000000000610:       0d 60 c0 19 3f 23       [MFI]       adds r12=-16,r12
    4000000000000616:       00 00 00 02 00 00                   nop.f 0x0
    400000000000061c:       21 0a 31 80                         and r8=r34,r33;;
    4000000000000620:       11 00 00 00 01 00       [MIB]       nop.m 0x0
    4000000000000626:       c0 80 30 00 42 80                   adds r12=16,r12
    400000000000062c:       08 00 84 00                         br.ret.sptk.many b0;;

>From this I conjecture that I can provide a single call:

    cpumask_and(cpumask_t d, cpumask_t s1, cpumask_t s2);

that works on both normal (1 to 32 cpu) systems and on big iron systems,
with traditional 'C' pass by value semantics, all derived from a single
mask type that works for both node and cpu masks.

The one sticky point evident to me so far would be if some generic code
were passing a cpumask_t across a function call boundary, and needed to
be optimum for both small and sparc64 - one would want to pass by value,
the other would want to pass a pointer to the cpumask.

This is not your fathers 'C'.  The compile time inlining and
optimization provided by gcc enables it to do a lot more than Dennis
Ritchie's original C compiler that I learned on.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] Introduce nodemask_t ADT [0/7]
  2004-03-23  2:39                     ` Paul Jackson
@ 2004-03-23  3:13                       ` William Lee Irwin III
  2004-03-23  3:36                         ` Paul Jackson
  0 siblings, 1 reply; 56+ messages in thread
From: William Lee Irwin III @ 2004-03-23  3:13 UTC (permalink / raw)
  To: Paul Jackson; +Cc: colpatch, linux-kernel, mbligh, akpm, haveblue, hch

On Mon, Mar 22, 2004 at 06:39:18PM -0800, Paul Jackson wrote:
> From this I conjecture that I can provide a single call:
>     cpumask_and(cpumask_t d, cpumask_t s1, cpumask_t s2);
> that works on both normal (1 to 32 cpu) systems and on big iron systems,
> with traditional 'C' pass by value semantics, all derived from a single
> mask type that works for both node and cpu masks.
> The one sticky point evident to me so far would be if some generic code
> were passing a cpumask_t across a function call boundary, and needed to
> be optimum for both small and sparc64 - one would want to pass by value,
> the other would want to pass a pointer to the cpumask.
> This is not your fathers 'C'.  The compile time inlining and
> optimization provided by gcc enables it to do a lot more than Dennis
> Ritchie's original C compiler that I learned on.

gcc flat out miscompiled such inlines last I checked (Zwane shipped the
bugreport IIRC). Either this kind of good behavior is not universally
observable or a miracle occurred and gcc's codegen went from incorrect
to 1980's (fscking patents).

Anyhow, this was also an observation of the code effectively made in
isolation; uninlining and other catastrophes do happen.

If people really thinks this works and/or don't care when it doesn't,
go for it. Last time I heard they did; who knows, the answer may be
different this time.


-- wli

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] Introduce nodemask_t ADT [0/7]
  2004-03-23  3:13                       ` William Lee Irwin III
@ 2004-03-23  3:36                         ` Paul Jackson
  2004-03-23  3:59                           ` William Lee Irwin III
  0 siblings, 1 reply; 56+ messages in thread
From: Paul Jackson @ 2004-03-23  3:36 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: colpatch, linux-kernel, mbligh, akpm, haveblue, hch

> gcc flat out miscompiled such inlines last I checked (Zwane shipped the
> bugreport IIRC).

The one error I've heard tell of recently is one that Andi Kleen hit in
include/linux/bitmap.h.  He had to compute the copy length explicitly in
a separate variable, or gcc forgot to do it (I forget the details).  The
change is:


 static inline void bitmap_copy(unsigned long *dst,
                        const unsigned long *src, int bits)
 {
-       memcpy(dst, src, BITS_TO_LONGS(bits)*sizeof(unsigned long));
+       int len = BITS_TO_LONGS(bits)*sizeof(unsigned long);
+       memcpy(dst, src, len);
 }


Do you have any more pointers/info on the miscompile you quote above?
Like - how long ago - what gcc version ...

It would be an insult to wrap the entire cpumask design around the
axle of such non-specific allegations of gcc misconduct.  Better to
get the API right, and then deal with specific tool bugs as necessary.

Or, as Linus might say (did say, in a quite different context):

  Never _ever_ make your source-code look worse because your tools suck. Fix
  the tools instead.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] Introduce nodemask_t ADT [0/7]
  2004-03-23  3:36                         ` Paul Jackson
@ 2004-03-23  3:59                           ` William Lee Irwin III
  2004-03-23  4:03                             ` Paul Jackson
       [not found]                             ` <20040325012457.51f708c7.pj@sgi.com>
  0 siblings, 2 replies; 56+ messages in thread
From: William Lee Irwin III @ 2004-03-23  3:59 UTC (permalink / raw)
  To: Paul Jackson; +Cc: colpatch, linux-kernel, mbligh, akpm, haveblue, hch

At some point in the past, I wrote:
>> gcc flat out miscompiled such inlines last I checked (Zwane shipped the
>> bugreport IIRC).

On Mon, Mar 22, 2004 at 07:36:28PM -0800, Paul Jackson wrote:
> The one error I've heard tell of recently is one that Andi Kleen hit in
> include/linux/bitmap.h.  He had to compute the copy length explicitly in
> a separate variable, or gcc forgot to do it (I forget the details).  The
> change is:
>  static inline void bitmap_copy(unsigned long *dst,
>                         const unsigned long *src, int bits)
>  {
> -       memcpy(dst, src, BITS_TO_LONGS(bits)*sizeof(unsigned long));
> +       int len = BITS_TO_LONGS(bits)*sizeof(unsigned long);
> +       memcpy(dst, src, len);
>  }
> Do you have any more pointers/info on the miscompile you quote above?
> Like - how long ago - what gcc version ...

It dates back to before the merge. It should have been posted to lkml
by Zwane.


On Mon, Mar 22, 2004 at 07:36:28PM -0800, Paul Jackson wrote:
> It would be an insult to wrap the entire cpumask design around the
> axle of such non-specific allegations of gcc misconduct.  Better to
> get the API right, and then deal with specific tool bugs as necessary.
> Or, as Linus might say (did say, in a quite different context):
>   Never _ever_ make your source-code look worse because your tools suck. Fix
>   the tools instead.

I can't answer this directly. Basically, if ppl are merging workarounds
for ancient compilers, my expectation is that operational semantics
specific to newer compiler versions (or specific architectures) can't
be assumed. I'm not the one making the calls as to whether the
requirements (compiling out completely on UP, compiling to no indirection
on small SMP) are relaxed or if "new enough compiler" meets them.


-- wli

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] Introduce nodemask_t ADT [0/7]
  2004-03-23  3:59                           ` William Lee Irwin III
@ 2004-03-23  4:03                             ` Paul Jackson
       [not found]                             ` <20040325012457.51f708c7.pj@sgi.com>
  1 sibling, 0 replies; 56+ messages in thread
From: Paul Jackson @ 2004-03-23  4:03 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: colpatch, linux-kernel, mbligh, akpm, haveblue, hch

> It dates back to before the merge.

what merge?

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Sparc64, cpumask_t and struct arguments (was: [PATCH] Introduce nodemask_t ADT)
       [not found]                               ` <20040325101827.GO791@holomorphy.com>
@ 2004-03-26 22:36                                 ` Paul Jackson
  2004-03-26 22:54                                   ` David S. Miller
  0 siblings, 1 reply; 56+ messages in thread
From: Paul Jackson @ 2004-03-26 22:36 UTC (permalink / raw)
  To: David S. Miller
  Cc: colpatch, linux-kernel, mbligh, akpm, haveblue, hch,
	William Lee Irwin III

David,

I understand that you are the Sparc64 expert.  I am doing some more
work refining the cpumask_t (and nodemask_t) implementation, in an
effort to obtain a common "mask_t" underlying ADT that can be useful
for both cpu and node masks.

As part of this, I'm looking at having only one data representation
for these masks - an array of unsigned longs encapsulated inside a
struct.  For the processors and compilation environment that I am
most focused on - gcc 3.2 and above on Intel CPUs, this generates
just as good code as the other variants.

However, as I recall, you recommend against passing structs on the
Sparc64 processor and compilation environment.  Am I recalling
correctly?

The two places that I see sparc64 passing a cpumask_t across a real
function call boundary (as opposed to an inline function) is in the file
arch/sparc64/kernel/smp.c, by the functions cheetah_xcall_deliver() and
smp_cross_call_masked().

Could you describe to me the requirements here.

 1) Are the above the two places of interest, currently, for sparc64 work?

 2) What sizes of cpumask_t are needed?

 3) Which of passing by pointer, passing as a struct, and/or passing by
    simple value, are required or acceptable or unacceptable?

 4) Does your answer to (3) vary with the sizes needed in (2)?

 5) The cpu_clear(i, mask) on about line 534 of smp.c confuses me, as it
    seems to be changing a local copy of 'mask' that no one will examine
    later.  What purpose does it serve?  See this line annotated with
    "No affect??" in the changes, below.

 6) I'm not done yet, but so far I have the following changes in my
    workarea for sparc64 (along with larger changes elsewhere).  These
    changes are untested, unbuilt, incomplete and unreviewed.  They
    were generated on a 2.6.4 base, with Matthew Dobson's nodemask_t
    patch of this thread applied.  Do you see anything that looks
    wrong in the following so far?

===== smp.c 1.66 vs edited =====
--- 1.66/arch/sparc64/kernel/smp.c      Tue Feb 24 16:11:02 2004
+++ edited/smp.c        Fri Mar 26 14:01:23 2004
@@ -409,14 +409,8 @@
        int i;

        __asm__ __volatile__("rdpr %%pstate, %0" : "=r" (pstate));
-       for (i = 0; i < NR_CPUS; i++) {
-               if (cpu_isset(i, mask)) {
-                       spitfire_xcall_helper(data0, data1, data2, pstate, i);
-                       cpu_clear(i, mask);
-                       if (cpus_empty(mask))
-                               break;
-               }
-       }
+       for_each_cpu_mask(i, mask)
+               spitfire_xcall_helper(data0, data1, data2, pstate, i);
 }

 /* Cheetah now allows to send the whole 64-bytes of data in the interrupt
@@ -459,25 +453,19 @@

        nack_busy_id = 0;
        {
-               cpumask_t work_mask = mask;
                int i;

-               for (i = 0; i < NR_CPUS; i++) {
-                       if (cpu_isset(i, work_mask)) {
-                               u64 target = (i << 14) | 0x70;
-
-                               if (!is_jalapeno)
-                                       target |= (nack_busy_id << 24);
-                               __asm__ __volatile__(
-                                       "stxa   %%g0, [%0] %1\n\t"
-                                       "membar #Sync\n\t"
-                                       : /* no outputs */
-                                       : "r" (target), "i" (ASI_INTR_W));
-                               nack_busy_id++;
-                               cpu_clear(i, work_mask);
-                               if (cpus_empty(work_mask))
-                                       break;
-                       }
+               for_each_cpu_mask(i, mask) {
+                       u64 target = (i << 14) | 0x70;
+
+                       if (!is_jalapeno)
+                               target |= (nack_busy_id << 24);
+                       __asm__ __volatile__(
+                               "stxa   %%g0, [%0] %1\n\t"
+                               "membar #Sync\n\t"
+                               : /* no outputs */
+                               : "r" (target), "i" (ASI_INTR_W));
+                       nack_busy_id++;
                }
        }

@@ -521,6 +509,8 @@
                        /* Clear out the mask bits for cpus which did not
                         * NACK us.
                         */
+                       /* Could replace with for_each_cpu_mask(i, mask) loop,
+                        * were it not for the cpu_clear(i, mask) below. -pj */
                        for (i = 0; i < NR_CPUS; i++) {
                                if (cpu_isset(i, work_mask)) {
                                        u64 check_mask;
@@ -531,7 +521,7 @@
                                                check_mask = (0x2UL <<
                                                              this_busy_nack);
                                        if ((dispatch_stat & check_mask) == 0)
-                                               cpu_clear(i, mask);
+                                               cpu_clear(i, mask); /* No affect?? -pj */
                                        this_busy_nack += 2;
                                        cpu_clear(i, work_mask);
                                        if (cpus_empty(work_mask))


Thank-you for your considerations.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Sparc64, cpumask_t and struct arguments (was: [PATCH] Introduce nodemask_t ADT)
  2004-03-26 22:36                                 ` Sparc64, cpumask_t and struct arguments (was: [PATCH] Introduce nodemask_t ADT) Paul Jackson
@ 2004-03-26 22:54                                   ` David S. Miller
  2004-03-26 23:18                                     ` Paul Jackson
                                                       ` (2 more replies)
  0 siblings, 3 replies; 56+ messages in thread
From: David S. Miller @ 2004-03-26 22:54 UTC (permalink / raw)
  To: Paul Jackson; +Cc: colpatch, linux-kernel, mbligh, akpm, haveblue, hch, wli

On Fri, 26 Mar 2004 14:36:48 -0800
Paul Jackson <pj@sgi.com> wrote:

> However, as I recall, you recommend against passing structs on the
> Sparc64 processor and compilation environment.  Am I recalling
> correctly?

It's a problem moreso on sparc32.  Structures there have to be passed
"by reference", which means it gets copied onto the stack and then a
pointer to that data area is passed as the actual parameter.

Sparc64 does the same but only _iff_ the structure does not fit entirely
in the remaining argument registers or the structures size is greater
than 32-bytes.

The cross-call functions you reference are slow path code.  And whats
more currently none of that code supports more than 64 cpus, although
I'll have to add support for larger numbers later.

>  5) The cpu_clear(i, mask) on about line 534 of smp.c confuses me, as it
>     seems to be changing a local copy of 'mask' that no one will examine
>     later.  What purpose does it serve?  See this line annotated with
>     "No affect??" in the changes, below.

No, mask is the top level mask value, we assign it to local variable
in order to do this or that operation on it.  Once we've really send
the XCALL message to the processor, we clear it in 'mask' only then.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Sparc64, cpumask_t and struct arguments (was: [PATCH] Introduce nodemask_t ADT)
  2004-03-26 22:54                                   ` David S. Miller
@ 2004-03-26 23:18                                     ` Paul Jackson
  2004-03-26 23:29                                     ` Paul Jackson
  2004-03-26 23:37                                     ` Paul Jackson
  2 siblings, 0 replies; 56+ messages in thread
From: Paul Jackson @ 2004-03-26 23:18 UTC (permalink / raw)
  To: David S. Miller; +Cc: colpatch, linux-kernel, mbligh, akpm, haveblue, hch, wli

> >  5) The cpu_clear(i, mask) on about line 534 of smp.c confuses me, as it
> >     seems to be changing a local copy of 'mask' that no one will examine
> >     later.  What purpose does it serve?  See this line annotated with
> >     "No affect??" in the changes, below.
> 
> No, mask is the top level mask value, we assign it to local variable
> in order to do this or that operation on it.  Once we've really send
> the XCALL message to the processor, we clear it in 'mask' only then.

Yes - I see where the input argument 'mask' is assigned to 'work_mask',
and then 'work_mask' is manipulated during the loops.

But inside the last loop of cheetah_xcall_deliver(), after 'mask' has
been read the last time to initialize 'work_mask', it is repeatedly
updated with the lines:

					if ((dispatch_stat & check_mask) == 0)
						cpu_clear(i, mask);

Why, why?  Can't these two lines just be _removed_?

What earthly purpose does that 'cpu_clear(i, mask)' have?

That copy of 'mask' will never be used again, at least not that I can see.

There's even a comment a few lines above, seemingly related to this:

			/* Clear out the mask bits for cpus which did not
			 * NACK us.
			 */

But that doesn't address the "why".

And then, if these two lines can be removed for lack of affect, the six
lines just before, that setup 'check_mask', can go too, also for lack of
any remaining affect.

That would be eight lines of dead code, with three lines of commentary.

Sorry - I must be missing something somewhere.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Sparc64, cpumask_t and struct arguments (was: [PATCH] Introduce nodemask_t ADT)
  2004-03-26 22:54                                   ` David S. Miller
  2004-03-26 23:18                                     ` Paul Jackson
@ 2004-03-26 23:29                                     ` Paul Jackson
  2004-03-27  0:08                                       ` David S. Miller
  2004-03-26 23:37                                     ` Paul Jackson
  2 siblings, 1 reply; 56+ messages in thread
From: Paul Jackson @ 2004-03-26 23:29 UTC (permalink / raw)
  To: David S. Miller; +Cc: colpatch, linux-kernel, mbligh, akpm, haveblue, hch, wli

> It's a problem moreso on sparc32.

Ah good.  Then, since arch/sparc and include/asm-sparc make no use
of cpumask_t, can I therefore conclude that you don't care whether
it's really a struct, or not?

That would be good news, if so.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Sparc64, cpumask_t and struct arguments (was: [PATCH] Introduce nodemask_t ADT)
  2004-03-26 22:54                                   ` David S. Miller
  2004-03-26 23:18                                     ` Paul Jackson
  2004-03-26 23:29                                     ` Paul Jackson
@ 2004-03-26 23:37                                     ` Paul Jackson
  2 siblings, 0 replies; 56+ messages in thread
From: Paul Jackson @ 2004-03-26 23:37 UTC (permalink / raw)
  To: David S. Miller; +Cc: colpatch, linux-kernel, mbligh, akpm, haveblue, hch, wli

> That copy of 'mask' will never be used again, at least not that I can see.

Aha - you're right.  I missed the 'goto retry'.  The 'mask' is used again,
next time around the body of that function.

Mea culpa.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Sparc64, cpumask_t and struct arguments (was: [PATCH] Introduce nodemask_t ADT)
  2004-03-26 23:29                                     ` Paul Jackson
@ 2004-03-27  0:08                                       ` David S. Miller
  2004-03-27  0:50                                         ` Paul Jackson
  0 siblings, 1 reply; 56+ messages in thread
From: David S. Miller @ 2004-03-27  0:08 UTC (permalink / raw)
  To: Paul Jackson; +Cc: colpatch, linux-kernel, mbligh, akpm, haveblue, hch, wli

On Fri, 26 Mar 2004 15:29:27 -0800
Paul Jackson <pj@sgi.com> wrote:

> > It's a problem moreso on sparc32.
> 
> Ah good.  Then, since arch/sparc and include/asm-sparc make no use
> of cpumask_t, can I therefore conclude that you don't care whether
> it's really a struct, or not?
> 
> That would be good news, if so.

SMP on sparc32 is totally not working yet in 2.6.x, that is why
there are no cpumask_t references :-)

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Sparc64, cpumask_t and struct arguments (was: [PATCH] Introduce nodemask_t ADT)
  2004-03-27  0:08                                       ` David S. Miller
@ 2004-03-27  0:50                                         ` Paul Jackson
  0 siblings, 0 replies; 56+ messages in thread
From: Paul Jackson @ 2004-03-27  0:50 UTC (permalink / raw)
  To: David S. Miller; +Cc: colpatch, linux-kernel, mbligh, akpm, haveblue, hch, wli

> that is why there are no cpumask_t references :-)

Ok - thanks.  Guess I won't worry about it too much for now.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] Introduce nodemask_t ADT [0/7]
  2004-03-20  0:47         ` Paul Jackson
@ 2004-03-20  1:14           ` Paul Jackson
  0 siblings, 0 replies; 56+ messages in thread
From: Paul Jackson @ 2004-03-20  1:14 UTC (permalink / raw)
  To: Paul Jackson; +Cc: colpatch, ak, linux-kernel

>                bitmask_and(d._m, s1._m, s2._m);        \

Oops - need the size argument to bitmask_and:

		bitmask_and(d._m, s1._m, s2._m, sizeof(d)*8);        \

(I only tested the 1 word case earlier ...)

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] Introduce nodemask_t ADT [0/7]
  2004-03-19 23:09       ` Matthew Dobson
@ 2004-03-20  0:47         ` Paul Jackson
  2004-03-20  1:14           ` Paul Jackson
  0 siblings, 1 reply; 56+ messages in thread
From: Paul Jackson @ 2004-03-20  0:47 UTC (permalink / raw)
  To: colpatch; +Cc: ak, linux-kernel

> For the case of a single unsigned long, the bitmap operations aren't
> as efficient as just doing a 
> 	res = (mask1 & mask2);
> vs.
> 	bitmask_and(&res, &mask1, &mask2);

This is probably the same as Matt is thinking, but what I imagine has
the inline efficiency (a machine instruction or two) of the openly coded
first example above, but is always written more in the style of the
second example above.

The 'mask_and()' macro can be an inline or define that collapses in the
one word case (by far the most common) to the same machine instruction
or two as the open code does.  We _do_ know, at compile time, the
_exact_ size of any given instance of these, so gcc has sufficient
information to collapse the inline code, if we present it right, to the
legitimate minimum.

For example, mask_and might be:

=========================================================================
#define __mask(bits)    struct { unsigned long _m[BITS_TO_LONGS(bits)]; }
typedef __mask(NR_CPUS) cpumask_t;

#define mask_and(d,s1,s2)                               \
do {                                                    \
        if (sizeof(d) == sizeof(unsigned long))         \
                d._m[0] = s1._m[0] & s2._m[0];          \
        else                                            \
                bitmask_and(d._m, s1._m, s2._m);        \
} while(0)
=========================================================================

With this, mask_and() on some cpumask_t's, when NR_CPUS is 64,  _does_
collapse to a simple:

        and r32 = r8, r32

on my ia64 with gcc 3.2.3.

The mask.h I appended yesterday uses bitmap calls such as bitmap_and()
unconditionally (even in the one word case).  This is probably worth
fixing, in the manner shown above, at least for such cases such as
'and', 'or', 'empty', 'complement', 'shift_left' and 'shift_right'.

No use of multiple architecturally chosen implementations, such as we
do now with cpumask, is required.  Just a little bit of special case
code for the one-word case, in 6 or 8 of the macros.

I still think, that with just a few lines of compile time logic such as
above, we can have a single implementation that is pretty close to
ideal, for almost all architectures, using a struct holding an array of
unsigned longs.

The one obvious exception, sparc64, should have its own arch specific
implementation under include/asm-sparc64 and perhaps arch/sparc64.  It
should not be embedded in a thicket of pseudo-generic ifdefs and
conditional includes as part of the apparently generic implementation.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] Introduce nodemask_t ADT [0/7]
  2004-03-19  2:04     ` [PATCH] Introduce nodemask_t ADT [0/7] Andi Kleen
  2004-03-19  2:38       ` Paul Jackson
@ 2004-03-19 23:09       ` Matthew Dobson
  2004-03-20  0:47         ` Paul Jackson
  1 sibling, 1 reply; 56+ messages in thread
From: Matthew Dobson @ 2004-03-19 23:09 UTC (permalink / raw)
  To: Andi Kleen; +Cc: LKML, pj

On Thu, 2004-03-18 at 18:04, Andi Kleen wrote:
> Matthew Dobson <colpatch@us.ibm.com> writes:
> 
> >> Chris Hellwig responded to it at the time asking why I didn't provide a
> >> single generic mask ADT, and make cpumask and nodemask instances of
> >> that.
> >
> > That is a better idea, if it can be made to work.  My goal is to stop
> 
> It already exists in linux/bitmap.h. I use that in NUMA API for the node masks.
> 
> It's just a bit ugly to write because you have to always pass MAX_NUMNODES.
> Some wrappers would be prettier.
> 
> -Andi

For MAX_NUMNODES > BITS_PER_LONG, thats just what these are.  There are
just built-in optimizations for the single bit (UP for cpumask, non-NUMA
for nodemask) case and the single unsigned long case.  For the case of a
single unsigned long, the bitmap operations aren't as efficient as just
doing a 
	res = (mask1 & mask2);
vs. 
	bitmask_and(&res, &mask1, &mask2);

Maybe I'm overly concerned about the speed of these ops at the expense
of code size.  If that's the consensus, I'll gladly look at a simpler
implementation.  As I said, my only real goal is to stop people open
coding these types of operations.

-Matt


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] Introduce nodemask_t ADT [0/7]
  2004-03-19  2:04     ` [PATCH] Introduce nodemask_t ADT [0/7] Andi Kleen
@ 2004-03-19  2:38       ` Paul Jackson
  2004-03-19 23:09       ` Matthew Dobson
  1 sibling, 0 replies; 56+ messages in thread
From: Paul Jackson @ 2004-03-19  2:38 UTC (permalink / raw)
  To: Andi Kleen; +Cc: colpatch, linux-kernel

> Some wrappers would be prettier.

That's roughly what the mask.h is, along with enough other bits and
pieces so that it can serve as a complete basis for cpumasks and
nodemasks.  The bitmap.h stuff is an incomplete subset of what's
used.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] Introduce nodemask_t ADT [0/7]
       [not found]   ` <1BgZN-Vk-1@gated-at.bofh.it>
@ 2004-03-19  2:04     ` Andi Kleen
  2004-03-19  2:38       ` Paul Jackson
  2004-03-19 23:09       ` Matthew Dobson
  0 siblings, 2 replies; 56+ messages in thread
From: Andi Kleen @ 2004-03-19  2:04 UTC (permalink / raw)
  To: colpatch; +Cc: linux-kernel, pj

Matthew Dobson <colpatch@us.ibm.com> writes:

>> Chris Hellwig responded to it at the time asking why I didn't provide a
>> single generic mask ADT, and make cpumask and nodemask instances of
>> that.
>
> That is a better idea, if it can be made to work.  My goal is to stop

It already exists in linux/bitmap.h. I use that in NUMA API for the node masks.

It's just a bit ugly to write because you have to always pass MAX_NUMNODES.
Some wrappers would be prettier.

-Andi


^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2004-03-27  0:52 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-03-18 23:04 [PATCH] Introduce nodemask_t ADT [0/7] Matthew Dobson
2004-03-18 23:23 ` Jesse Barnes
2004-03-18 23:32   ` Martin J. Bligh
2004-03-18 23:37     ` Jesse Barnes
2004-03-18 23:43       ` Martin J. Bligh
2004-03-18 23:59         ` Jesse Barnes
2004-03-19 16:20           ` Martin J. Bligh
2004-03-19  0:58         ` Zwane Mwaikambo
2004-03-19  1:11           ` Jesse Barnes
2004-03-19  1:34             ` Zwane Mwaikambo
2004-03-19  1:40               ` Jesse Barnes
2004-03-19  1:08         ` Paul Jackson
2004-03-19  0:01     ` Matthew Dobson
2004-03-18 23:58   ` Matthew Dobson
2004-03-19  2:55   ` William Lee Irwin III
2004-03-19  0:59 ` Paul Jackson
2004-03-19  1:19   ` Matthew Dobson
2004-03-19  1:45     ` Paul Jackson
2004-03-19 22:51       ` Matthew Dobson
2004-03-19 23:42         ` Paul Jackson
2004-03-19  1:48     ` Paul Jackson
2004-03-19  1:56     ` Paul Jackson
2004-03-19 23:02       ` Matthew Dobson
2004-03-20  0:59         ` Paul Jackson
2004-03-20  3:18           ` William Lee Irwin III
2004-03-20  6:09             ` Paul Jackson
2004-03-20  9:36               ` William Lee Irwin III
2004-03-22 23:59                 ` Paul Jackson
2004-03-23  2:12                   ` William Lee Irwin III
2004-03-23  1:21                 ` Paul Jackson
2004-03-23  2:10                   ` William Lee Irwin III
2004-03-23  1:24                 ` Paul Jackson
2004-03-20  8:02             ` Paul Jackson
2004-03-20 11:13               ` William Lee Irwin III
2004-03-21  4:19                 ` Paul Jackson
2004-03-21  4:36                   ` William Lee Irwin III
2004-03-21  7:59                     ` Nick Piggin
2004-03-23  1:12                 ` Paul Jackson
2004-03-23  2:09                   ` William Lee Irwin III
2004-03-23  2:39                     ` Paul Jackson
2004-03-23  3:13                       ` William Lee Irwin III
2004-03-23  3:36                         ` Paul Jackson
2004-03-23  3:59                           ` William Lee Irwin III
2004-03-23  4:03                             ` Paul Jackson
     [not found]                             ` <20040325012457.51f708c7.pj@sgi.com>
     [not found]                               ` <20040325101827.GO791@holomorphy.com>
2004-03-26 22:36                                 ` Sparc64, cpumask_t and struct arguments (was: [PATCH] Introduce nodemask_t ADT) Paul Jackson
2004-03-26 22:54                                   ` David S. Miller
2004-03-26 23:18                                     ` Paul Jackson
2004-03-26 23:29                                     ` Paul Jackson
2004-03-27  0:08                                       ` David S. Miller
2004-03-27  0:50                                         ` Paul Jackson
2004-03-26 23:37                                     ` Paul Jackson
     [not found] <1BeOx-7ax-55@gated-at.bofh.it>
     [not found] ` <1BgGq-DU-5@gated-at.bofh.it>
     [not found]   ` <1BgZN-Vk-1@gated-at.bofh.it>
2004-03-19  2:04     ` [PATCH] Introduce nodemask_t ADT [0/7] Andi Kleen
2004-03-19  2:38       ` Paul Jackson
2004-03-19 23:09       ` Matthew Dobson
2004-03-20  0:47         ` Paul Jackson
2004-03-20  1:14           ` Paul Jackson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).