* bug? in using generic read/write functions to read/write block devices in 2.4.11-pre2 @ 2001-10-03 12:17 Vladimir V. Saveliev 2001-10-03 13:16 ` [PATCH] " Alexander Viro 2001-10-03 21:09 ` Buffer cache confusion? Re: [reiserfs-list] bug? in using generic read/write functions to read/write block devices in 2.4.11-pre2 Eric Whiting 0 siblings, 2 replies; 28+ messages in thread From: Vladimir V. Saveliev @ 2001-10-03 12:17 UTC (permalink / raw) To: linux-kernel, reiserfs-list Hi It looks like something wrong happens with writing/reading to block device using generic read/write functions when one does: mke2fs /dev/hda1 (blocksize is 4096) mount /dev/hda1 umount /dev/hda1 mke2fs /dev/hda1 - FAILS with Warning: could not write 8 blocks in inode table starting at 492004: Attempt to write block from filesystem resulted in short write (note that /dev/hda1 should be big enough - 3gb is enogh for example) Explanation of what happens (could be wrong and unclear): blocksize of /dev/hda1 was 1024. So, /dev/hda1's inode->i_blkbits is set to 10. mount-ing used set_blocksize() to change blocksize to 4096 in blk_size[][]. But inode of /dev/hda1 still has i_blkbits which makes block_prepare_write to create buffers of 1024 bytes and call blkdev_get_block for each of them. fs/block_dev.c:/max_block calculates number of blocks on the device using blk_size[][] and thinks that there are 4 times less blocks on the device. Thanks, vs PS: thanks to Elena <grev@namesys.botik.ru> for finding that ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH] Re: bug? in using generic read/write functions to read/write block devices in 2.4.11-pre2 2001-10-03 12:17 bug? in using generic read/write functions to read/write block devices in 2.4.11-pre2 Vladimir V. Saveliev @ 2001-10-03 13:16 ` Alexander Viro 2001-10-03 16:18 ` Linus Torvalds 2001-10-03 21:09 ` Buffer cache confusion? Re: [reiserfs-list] bug? in using generic read/write functions to read/write block devices in 2.4.11-pre2 Eric Whiting 1 sibling, 1 reply; 28+ messages in thread From: Alexander Viro @ 2001-10-03 13:16 UTC (permalink / raw) To: Linus Torvalds; +Cc: Vladimir V. Saveliev, linux-kernel, reiserfs-list On Wed, 3 Oct 2001, Vladimir V. Saveliev wrote: > Hi > > It looks like something wrong happens with writing/reading to block > device using generic read/write functions when one does: > > mke2fs /dev/hda1 (blocksize is 4096) > mount /dev/hda1 > umount /dev/hda1 > mke2fs /dev/hda1 - FAILS with > Warning: could not write 8 blocks in inode table starting at 492004: > Attempt to write block from filesystem resulted in short write > > (note that /dev/hda1 should be big enough - 3gb is enogh for example) Ehh... Linus, both blkdev_get() and blkdev_open() should set ->i_blkbits. Vladimir, see if the patch below helps: --- S11-pre2/fs/block_dev.c Mon Oct 1 17:56:00 2001 +++ /tmp/block_dev.c Wed Oct 3 09:12:31 2001 @@ -549,36 +549,23 @@ return res; } -int blkdev_get(struct block_device *bdev, mode_t mode, unsigned flags, int kind) +static int do_open(struct block_device *bdev, struct inode *inode, struct file *file) { - int ret = -ENODEV; - kdev_t rdev = to_kdev_t(bdev->bd_dev); /* this should become bdev */ - down(&bdev->bd_sem); + int ret = -ENXIO; + kdev_t dev = to_kdev_t(bdev->bd_dev); + down(&bdev->bd_sem); lock_kernel(); if (!bdev->bd_op) - bdev->bd_op = get_blkfops(MAJOR(rdev)); + bdev->bd_op = get_blkfops(MAJOR(dev)); if (bdev->bd_op) { - /* - * This crockload is due to bad choice of ->open() type. - * It will go away. - * For now, block device ->open() routine must _not_ - * examine anything in 'inode' argument except ->i_rdev. - */ - struct file fake_file = {}; - struct dentry fake_dentry = {}; - ret = -ENOMEM; - fake_file.f_mode = mode; - fake_file.f_flags = flags; - fake_file.f_dentry = &fake_dentry; - fake_dentry.d_inode = bdev->bd_inode; ret = 0; if (bdev->bd_op->open) - ret = bdev->bd_op->open(bdev->bd_inode, &fake_file); + ret = bdev->bd_op->open(inode, file); if (!ret) { bdev->bd_openers++; - bdev->bd_inode->i_size = blkdev_size(rdev); - bdev->bd_inode->i_blkbits = blksize_bits(block_size(rdev)); + bdev->bd_inode->i_size = blkdev_size(dev); + bdev->bd_inode->i_blkbits = blksize_bits(block_size(dev)); } else if (!bdev->bd_openers) bdev->bd_op = NULL; } @@ -589,9 +576,26 @@ return ret; } +int blkdev_get(struct block_device *bdev, mode_t mode, unsigned flags, int kind) +{ + /* + * This crockload is due to bad choice of ->open() type. + * It will go away. + * For now, block device ->open() routine must _not_ + * examine anything in 'inode' argument except ->i_rdev. + */ + struct file fake_file = {}; + struct dentry fake_dentry = {}; + fake_file.f_mode = mode; + fake_file.f_flags = flags; + fake_file.f_dentry = &fake_dentry; + fake_dentry.d_inode = bdev->bd_inode; + + return do_open(bdev, bdev->bd_inode, &fake_file); +} + int blkdev_open(struct inode * inode, struct file * filp) { - int ret; struct block_device *bdev; /* @@ -604,29 +608,8 @@ bd_acquire(inode); bdev = inode->i_bdev; - down(&bdev->bd_sem); - - ret = -ENXIO; - lock_kernel(); - if (!bdev->bd_op) - bdev->bd_op = get_blkfops(MAJOR(inode->i_rdev)); - if (bdev->bd_op) { - ret = 0; - if (bdev->bd_op->open) - ret = bdev->bd_op->open(inode,filp); - if (!ret) { - bdev->bd_openers++; - bdev->bd_inode->i_size = blkdev_size(inode->i_rdev); - } else if (!bdev->bd_openers) - bdev->bd_op = NULL; - } - - unlock_kernel(); - up(&bdev->bd_sem); - if (ret) - bdput(bdev); - return ret; + return do_open(bdev, inode, filp); } int blkdev_put(struct block_device *bdev, int kind) ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] Re: bug? in using generic read/write functions to read/write block devices in 2.4.11-pre2 2001-10-03 13:16 ` [PATCH] " Alexander Viro @ 2001-10-03 16:18 ` Linus Torvalds 2001-10-03 21:43 ` Alexander Viro 0 siblings, 1 reply; 28+ messages in thread From: Linus Torvalds @ 2001-10-03 16:18 UTC (permalink / raw) To: Alexander Viro; +Cc: Vladimir V. Saveliev, linux-kernel, reiserfs-list On Wed, 3 Oct 2001, Alexander Viro wrote: > > Ehh... Linus, both blkdev_get() and blkdev_open() should set ->i_blkbits. Duh. I couldn't even _imagine_ that we'd be so stupid to have duplicated that code twice instead of just having blkdev_open() call blkdev_get(). Thanks. Linus ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] Re: bug? in using generic read/write functions to read/write block devices in 2.4.11-pre2 2001-10-03 16:18 ` Linus Torvalds @ 2001-10-03 21:43 ` Alexander Viro 2001-10-03 21:56 ` Christoph Hellwig 0 siblings, 1 reply; 28+ messages in thread From: Alexander Viro @ 2001-10-03 21:43 UTC (permalink / raw) To: Linus Torvalds; +Cc: Vladimir V. Saveliev, linux-kernel, reiserfs-list On Wed, 3 Oct 2001, Linus Torvalds wrote: > > On Wed, 3 Oct 2001, Alexander Viro wrote: > > > > Ehh... Linus, both blkdev_get() and blkdev_open() should set ->i_blkbits. > > Duh. I couldn't even _imagine_ that we'd be so stupid to have duplicated > that code twice instead of just having blkdev_open() call blkdev_get(). Notice that (inode, file) is bogus API for block_device ->open(). I've checked all instances of that method in 2.4.11-pre2. Results: The _only_ part of inode they are using is ->i_rdev. Read-only. They use file->f_flags and file->f_mode (also read-only). There are 3 exceptions: 1) initrd sets file->f_op. The whole thing is a dirty hack - it should become a character device in 2.5. 2) drivers/s390/char/tapeblock.c does bogus (and useless) stuff with file, including putting pointer to it into global structures. Since file can be fake (allocated on stack of caller) it's hardly a good idea. Fortunately, driver doesn't ever look at that pointer. Ditto for the rest of bogus stuff done there - it's a dead code. 3) drivers/block/floppy.c calls permission(inode) and caches result in file->private_data. Summary on the floppy case: Alain uses "we have write permissions on /dev/fd<n>" as a security check in several ioctls. The reason why we can't just check that file had been opened for write is that floppy_open() will refuse to open the thing for write if it's write-protected. Notice that we could trivially move the check into fd_ioctl() itself - permission() is fast in all relevant cases and it's definitely much faster than operations themselves (we are talking about honest-to-$DEITY PC floppy controller here). That wouldn't require any userland changes. In other words, for all we care it's (block_device, flags, mode). And that makes a lot of sense, since we don't _have_ file in quite a few cases. Moreover, we don't care what inode is used for open - access control is done in generic code, same way as for _any_ open(). Notice that even floppy_open() extra checks do not affect the success of open() - we just cache them for future calls of ioctl(). Moreover, ->release() for block_device also doesn't care for the junk we pass - it only uses inode->i_rdev. In all cases. And I'd rather see it them as int (*open)(struct block_device *bdev, int flags, int mode); int (*release)(struct block_device *bdev); int (*check_media_change)(struct block_device *bdev); int (*revalidate)(struct block_device *bdev); - that would make more sense than the current variant. They are block_device methods, not file or inode ones, after all. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] Re: bug? in using generic read/write functions to read/write block devices in 2.4.11-pre2 2001-10-03 21:43 ` Alexander Viro @ 2001-10-03 21:56 ` Christoph Hellwig 2001-10-03 22:51 ` Alexander Viro 0 siblings, 1 reply; 28+ messages in thread From: Christoph Hellwig @ 2001-10-03 21:56 UTC (permalink / raw) To: Alexander Viro Cc: Vladimir V. Saveliev, linux-kernel, reiserfs-list, Linus Torvalds Hi Al, In article <Pine.GSO.4.21.0110031643130.23558-100000@weyl.math.psu.edu> you wrote: > Moreover, ->release() for block_device also doesn't care for the junk > we pass - it only uses inode->i_rdev. In all cases. And I'd rather > see it them as > int (*open)(struct block_device *bdev, int flags, int mode); > int (*release)(struct block_device *bdev); > int (*check_media_change)(struct block_device *bdev); > int (*revalidate)(struct block_device *bdev); > - that would make more sense than the current variant. They are block_device > methods, not file or inode ones, after all. How about starting 2.5 with that patch ones 2.4.11 is done? Linus? Christoph -- Of course it doesn't work. We've performed a software upgrade. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] Re: bug? in using generic read/write functions to read/write block devices in 2.4.11-pre2 2001-10-03 21:56 ` Christoph Hellwig @ 2001-10-03 22:51 ` Alexander Viro 2001-10-03 19:55 ` Whining about 2.5 (was Re: [PATCH] Re: bug? in using generic read/write functions to read/write block devices in 2.4.11-pre2) Rob Landley 0 siblings, 1 reply; 28+ messages in thread From: Alexander Viro @ 2001-10-03 22:51 UTC (permalink / raw) To: Christoph Hellwig Cc: Vladimir V. Saveliev, linux-kernel, reiserfs-list, Linus Torvalds On Wed, 3 Oct 2001, Christoph Hellwig wrote: > Hi Al, > > In article <Pine.GSO.4.21.0110031643130.23558-100000@weyl.math.psu.edu> you wrote: > > Moreover, ->release() for block_device also doesn't care for the junk > > we pass - it only uses inode->i_rdev. In all cases. And I'd rather > > see it them as > > int (*open)(struct block_device *bdev, int flags, int mode); > > int (*release)(struct block_device *bdev); > > int (*check_media_change)(struct block_device *bdev); > > int (*revalidate)(struct block_device *bdev); > > - that would make more sense than the current variant. They are block_device > > methods, not file or inode ones, after all. > > How about starting 2.5 with that patch ones 2.4.11 is done? Linus? I don't think that it's a good idea. Such patch is trivial - it can be done at any point in 2.5. Moreover, while it does clean some of the mess up, I don't see a lot of other stuff that would depend on it. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Whining about 2.5 (was Re: [PATCH] Re: bug? in using generic read/write functions to read/write block devices in 2.4.11-pre2) 2001-10-03 22:51 ` Alexander Viro @ 2001-10-03 19:55 ` Rob Landley 2001-10-04 0:38 ` Rik van Riel 2001-10-04 21:02 ` Whining about 2.5 (was Re: [PATCH] Re: bug? in using generic read/write functions to read/write block devices in 2.4.11-pre2) Alan Cox 0 siblings, 2 replies; 28+ messages in thread From: Rob Landley @ 2001-10-03 19:55 UTC (permalink / raw) To: Alexander Viro, Christoph Hellwig; +Cc: linux-kernel, Linus Torvalds On Wednesday 03 October 2001 18:51, Alexander Viro wrote: > On Wed, 3 Oct 2001, Christoph Hellwig wrote: > > How about starting 2.5 with that patch ones 2.4.11 is done? Linus? > > I don't think that it's a good idea. Such patch is trivial - it can be > done at any point in 2.5. Moreover, while it does clean some of the > mess up, I don't see a lot of other stuff that would depend on it. I think he's just trolling for anything that might bud off 2.5 at this point. Can you blame him? (Yes, al, you may flame me now. :) Out of morbid curiosity, when 2.5 does finally fork off (a purely academic question, I know), which VM will it use? I'm guessing Alan will still inherit the "stable" codebase, but the -ac and -linus trees are breaking new ground on divergence here. Which tree becomes 2.4 once Alan inherits it? (Is this part of what's holding up 2.5?) Are we waiting for andrea's shiny new VM to get into Alan's tree first? I think Alan said something about somewhere freezing over, but don't quite recall. Is someone else (Andrea?) likely to become 2.4 maintainer? What exactly still needs to happen before 2.4 can be locked down, encased in lucite, and put into bugfix-only mode? (Anybody who's tried to use 3D acceleration with Red Hat 7.1 and >= 2.4.9 is unlikely to be covinced that it's currently in bugfix-only mode. The DRI part, anyway.) On a technical level, 2.4.10's vm is working fine for me on my laptop. (And I've seen "not working". ANYTHING would have been an improvement over ~2.4.5-2.4.7 or so. I often went for a soda while it swapped trying to open its twentieth Konqueror window. (Yeah, I know, bad habit I picked up years ago under OS/2...) At least THAT problem is now history. Then again I did buy 256 megabytes of ram for my laptop, so it might not have been the new kernel that fixed it. :) What's else is left? I'm curious. Rob (Oh, and what's the deal with "classzones"? Linus told Andrea classzones were a dumb idea, and we'd regret it when we tried to inflict NUMA architecture on 2.5, but then went with Andrea's VM anyway, which I thought was based on classzones... Was that ever resolved? What the problem avoided? What IS a classzone, anyway? I'd be happy to RTFM, if anybody could tell me where TF the M is hiding...) Gotta go, Star Trek: The Previous Generation is about to come on... ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Whining about 2.5 (was Re: [PATCH] Re: bug? in using generic read/write functions to read/write block devices in 2.4.11-pre2) 2001-10-03 19:55 ` Whining about 2.5 (was Re: [PATCH] Re: bug? in using generic read/write functions to read/write block devices in 2.4.11-pre2) Rob Landley @ 2001-10-04 0:38 ` Rik van Riel 2001-10-03 22:27 ` Rob Landley 2001-10-04 21:02 ` Whining about 2.5 (was Re: [PATCH] Re: bug? in using generic read/write functions to read/write block devices in 2.4.11-pre2) Alan Cox 1 sibling, 1 reply; 28+ messages in thread From: Rik van Riel @ 2001-10-04 0:38 UTC (permalink / raw) To: Rob Landley Cc: Alexander Viro, Christoph Hellwig, linux-kernel, Linus Torvalds On Wed, 3 Oct 2001, Rob Landley wrote: > (Oh, and what's the deal with "classzones"? Linus told Andrea > classzones were a dumb idea, and we'd regret it when we tried to > inflict NUMA architecture on 2.5, but then went with Andrea's VM > anyway, which I thought was based on classzones... Was that ever > resolved? What the problem avoided? What IS a classzone, anyway? > I'd be happy to RTFM, if anybody could tell me where TF the M is > hiding...) Classzones used to be a superset of the memory zones, so if you have memory zones A, B and C you'd have classzone Ac consisting of memory zone A, classzone Bc = {A + B} and Cc = {A + B + C}. This gives obvious problems for NUMA, suppose you have 4 nodes with zones 1A, 1B, 1C, 2A, 2B, 2C, 3A, 3B, 3C, 4A, 4B and 4C. Putting together classzones for these isn't quite obvious and memory balancing will be complex ;) Of course, nobody knows the exact definitions of classzones in the new 2.4 VM since it's completely undocumented; lets hope Andrea will document his code or we'll see a repeat of the development chaos we had with the 2.2 VM... cheers, Rik -- DMCA, SSSCA, W3C? Who cares? http://thefreeworld.net/ (volunteers needed) http://www.surriel.com/ http://distro.conectiva.com/ ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Whining about 2.5 (was Re: [PATCH] Re: bug? in using generic read/write functions to read/write block devices in 2.4.11-pre2) 2001-10-04 0:38 ` Rik van Riel @ 2001-10-03 22:27 ` Rob Landley 2001-10-04 20:53 ` Whining about 2.5 (was Re: [PATCH] Re: bug? in using generic read/write functions to read/write block devices in 2.4.11-pre2O Alan Cox 2001-10-04 23:39 ` NUMA & classzones (was Whining about 2.5) Martin J. Bligh 0 siblings, 2 replies; 28+ messages in thread From: Rob Landley @ 2001-10-03 22:27 UTC (permalink / raw) To: Rik van Riel; +Cc: linux-kernel On Wednesday 03 October 2001 20:38, Rik van Riel wrote: > On Wed, 3 Oct 2001, Rob Landley wrote: > > (Oh, and what's the deal with "classzones"? Linus told Andrea > > classzones were a dumb idea, and we'd regret it when we tried to > > inflict NUMA architecture on 2.5, but then went with Andrea's VM > > anyway, which I thought was based on classzones... Was that ever > > resolved? What the problem avoided? What IS a classzone, anyway? > > I'd be happy to RTFM, if anybody could tell me where TF the M is > > hiding...) > > Classzones used to be a superset of the memory zones, so > if you have memory zones A, B and C you'd have classzone > Ac consisting of memory zone A, classzone Bc = {A + B} > and Cc = {A + B + C}. Ah. Cumulative zones. A class being a collection of zones, the class-zone patch. Right. That makes a lot more sense... > This gives obvious problems for NUMA, suppose you have 4 > nodes with zones 1A, 1B, 1C, 2A, 2B, 2C, 3A, 3B, 3C, 4A, > 4B and 4C. Is there really a NUMA machine out there where you can DMA out of another node's 16 bit ISA space? So far the differences in the zones seem to be purely a question of capabilities (what you can use this ram for), not performance. Now I know numa changes that, but I'm wondering how many performance-degraded memory zones we're likely to have that still have capabilities like "we can DMA straight out of this". Or better yet, "we WANT to DMA straight out of this". Zones where we wouldn't be better off having the capability in question invoked from whichever node is "closest" to that resource. Perhaps some kind of processor-specific tasklet. So how often does node 1 care about the difference between DMAable and non-DMAable memory in node 2? And more importantly, should the kernel care about this difference, or have the function invoked over on the other processor? Especially since, discounting straightforward memory access latency variations, it SEEMS like this is largely a driver question. Device X can DMA to/from these zones of memory. The memory isn't different to the processors, it's different to the various DEVICES. So it's not just a processor question, but an association between processors, memory, and devices. (Back to the concept of nodes.) Meaning drivers could be supplying zone lists, which is just going to be LOADS of fun... <uninformed rant> I thought a minimalistic approach to numa optimization was to think in terms of nodes, and treat each node as one or more processors with a set of associated "cheap" resources (memory, peripherals, etc). Multiple tiers of decreasing locality for each node sounds like a lot of effort for a first attempt at NUMA support. That's where the "hideously difficult to calculate" bits come in. A problem with could increase exponentially with the number of nodes... I always think of numa as the middle of a continuum. Zillion-way SMP with enormous L1 caches on each processor starts acting a bit like NUMA (you don't wanna go out of cache and fight the big evil memory bus if you can at all avoid it, and we're already worrying about process locality (processor affinity) to preserve cache state...). Shared memory beowulf clusters that page fault through the network with a relatively low-latency interconnect like myrinet would act a bit like NUMA too. (Obviously, I haven't played with the monster SGI hardware or the high-end stuff IBM's so proud of.) In a way, swap space on the drives could be considered a performance-delimited physical memory zone. One the processor can't access directly, which involves the allocation of DRAM bounce buffers. Between that and actual bounce buffers we ALREADY handle problems a lot like page migration between zones (albeit not in a generic, unified way)... So I thought the cheap and easy way out is to have each node know what resources it considers "local", what resources are a pain to access (possibly involving a tasklet on annother node), and a way to determine when tasks requiring a lot of access to them might better to be migrated directly to a node where they're significantly cheaper to the point where the cost of migration gets paid back. This struck me as the 90% "duct tape" solution to NUMA. </uninformed rant> (Hopefully, anyway...) Of course there's bound to be something fundamentally wrong with my understanding of the situation that invalidates all of the above, and I'd appreciate anybody willing to take the time letting me know what it is... So what hardware inherently requires a multi-tier NUMA approach beyond "local stuff" and "everything else"? (I suppose there's bound to be some linearlly arranged system with a long gradual increase in memory access latency as you go down the row, and of course a node in the middle which has a unique resource everybody's fighting for. Is this a common setup in NUMA systems?) And then, of course, there's the whole question of 3D accelerated video card texture memory, and trying to stick THAT into a zone. :) (Eew! Eew! Eew!) Yeah, it IS a can of worms, isn't it? But class/zone lists still seem fine for processors. It's just a question of doing the detective work for memory allocation up front, as it were. If you can't figure it out up front, how the heck are you supposed to do it efficiently at allocation time? It's just that a lot of DEVICES (like 128 megabyte video cards, and limited-range DMA controllers) need their own class/zone lists, too. This chunk of physical memory can be used as DMA buffers for this PCI bridge, which can only be addressed directly by this group of processors anyway because they share the IO-APIC it's wired to... Which involves challenging a LOT of assumptions about the global nature of system resources previous kernels used to make, I know. (Memory for DMA needs the specific device in question, but we already do that for ISA vs PCI dma... The user level stuff is just hinting to avoid bounce buffers...) Um, can bounce buffers permanent page migration to another zone? (Since we have to allocate the page ANYWAY, might as well leave it there till it's evicted, unless of course we're very likely to evict it again pronto in which case we want to avoid bouncing it back... Hmmm... Then under NUMA there would be the "processor X can't access page in new location easily to fill it with new data to DMA out..." Fun fun fun...) > Putting together classzones for these isn't > quite obvious and memory balancing will be complex ;) And this differs from normal in what way? It seems like andrea's approach is just changing where work is done. Moving deductive work from allocation time to boot time. Assembling class/zone lists is an init-time problem (boot time or hot-pluggable-hardware swap time). Having zones linked together into lists of "this pool of memory can be used for these tasks", possibly as linked lists in order of preference for allocations or some such optimization, doesn't strike me as unreasonable. (It is ENTIRELY possible I'm wrong about this. Bordering on "likely", I'll admit...) Making sure that a list arrangement is OPTIMAL is another matter, but whatever method gets chosen to do that people are probably going to be arguing it for years. You can't swap to disk perfectly without being able to see the future, either... The balancing issue is going to be fun, but that's true whatever happens. You inherently have multiple nodes (collections of processors with clear and conflicting preferences about resources) disagreeing with each other about allocation decisions during the course of operation. That's part of the reason the "cheap bucket" and "non-cheap bucket" approach always appealed to me (for zillion way SMP and shared memory clusters, anyway, where they're pretty much the norm anyway). Of course where cheap buckets overlap, there might need to be some variant of weighting to reduce thrashing... Hmmm. Wouldn't you need weighting for non-class zones anyway? Classing zones doesn't necessarily make weighting undoable. The ability to make decisions about a class doesn't mean ALL decisions have to be just aboout the class. It's just that you quickly know what "world" you're starting with, and can narrow down from there. (I'll have to look more closely at Andrea's implementation now that I know what the heck it's supposed to be doing. Now that I THINK I know, anyway...) > Of course, nobody knows the exact definitions of classzones > in the new 2.4 VM since it's completely undocumented; lets I'd noticed. > hope Andrea will document his code or we'll see a repeat of > the development chaos we had with the 2.2 VM... Or, for that matter, early 2.4 up until the start of the use-once thread. For me, anyway. Since 2.4 isn't supposed to handle NUMA anyway, I don't see what difference it makes. Just use ANYTHING that stops the swap storms, lockups, zone starvation, zero order allocation failures, bounce buffer shortages, and other such fun we were having a few versions back. (Once again, this part now seems to be in the "it works for me"(tm) stage.) Then rip it out and start over in 2.5 if there's stuff it can't do. > cheers, > > Rik thingy, Rob Landley, master of stupid questions. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Whining about 2.5 (was Re: [PATCH] Re: bug? in using generic read/write functions to read/write block devices in 2.4.11-pre2O 2001-10-03 22:27 ` Rob Landley @ 2001-10-04 20:53 ` Alan Cox 2001-10-04 23:59 ` Whining about NUMA. :) [Was whining about 2.5...] Rob Landley 2001-10-04 23:39 ` NUMA & classzones (was Whining about 2.5) Martin J. Bligh 1 sibling, 1 reply; 28+ messages in thread From: Alan Cox @ 2001-10-04 20:53 UTC (permalink / raw) To: landley; +Cc: Rik van Riel, linux-kernel > Is there really a NUMA machine out there where you can DMA out of another > node's 16 bit ISA space? So far the differences in the zones seem to be DMA engines are tied to the node the device is tied to not to the processor in question in most NUMA systems. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Whining about NUMA. :) [Was whining about 2.5...] 2001-10-04 20:53 ` Whining about 2.5 (was Re: [PATCH] Re: bug? in using generic read/write functions to read/write block devices in 2.4.11-pre2O Alan Cox @ 2001-10-04 23:59 ` Rob Landley 2001-10-05 14:51 ` Alan Cox 0 siblings, 1 reply; 28+ messages in thread From: Rob Landley @ 2001-10-04 23:59 UTC (permalink / raw) To: Alan Cox; +Cc: Rik van Riel, linux-kernel On Thursday 04 October 2001 16:53, Alan Cox wrote: > > Is there really a NUMA machine out there where you can DMA out of another > > node's 16 bit ISA space? So far the differences in the zones seem to be > > DMA engines are tied to the node the device is tied to not to the processor > in question in most NUMA systems. Oh good. I'd sort of guessed that part, but wasn't quite sure. (I've seen hardware people do some amazingly odd things before. Luckily not recently, though...) So would a workable (if naieve) attempt to use Andrea's memory-zones-grouped-into-classes approach on NUMA just involve making a class/zone list for each node? (Okay, you've got to identify nodes, and group together processors, bridges, DMAable devices, etc, but it seems like that has to be done anyway, class/zone or not.) How does what people want to do for NUMA improve on that? Is a major goal of NUMA figuring out how to borrow resources from adjacent nodes (and less-adjacent nodes) in a "second choice, third choice, twelfth choice" kind of way? Or is a "this resource set is local to this node, and allocating beyond this group is some variant of swapping behavior" approach an acceptable first approximation? If class/zone is so evil for NUMA, what's the alternative that's being considered? (Pointer to paper?) I'm wondering how the class/zone approach is more evil than the alternative of having lots and lots of little zones which have different properties for each processor and DMAable device on the system, and then trying to figure out what to do from there at allocation time or during each attempt to inflict I/O upon buffers. Rob P.S. Rik pointed out in email (replying to my "master of stupid questions" signoff) that I am indeed confused about a great many things, but didn't elaborate. Of course I agree with this, but I do try to make it up on volume :). ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Whining about NUMA. :) [Was whining about 2.5...] 2001-10-04 23:59 ` Whining about NUMA. :) [Was whining about 2.5...] Rob Landley @ 2001-10-05 14:51 ` Alan Cox 2001-10-08 17:57 ` Martin J. Bligh 0 siblings, 1 reply; 28+ messages in thread From: Alan Cox @ 2001-10-05 14:51 UTC (permalink / raw) To: landley; +Cc: Alan Cox, Rik van Riel, linux-kernel > So would a workable (if naieve) attempt to use Andrea's > memory-zones-grouped-into-classes approach on NUMA just involve making a > class/zone list for each node? (Okay, you've got to identify nodes, and > group together processors, bridges, DMAable devices, etc, but it seems like > that has to be done anyway, class/zone or not.) How does what people want to > do for NUMA improve on that? I fear it becomes an N! problem. I'd like to hear what Andrea has planned since without docs its hard to speculate on how the 2.4.10 vm works anyway ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Whining about NUMA. :) [Was whining about 2.5...] 2001-10-05 14:51 ` Alan Cox @ 2001-10-08 17:57 ` Martin J. Bligh 2001-10-08 18:10 ` Alan Cox 0 siblings, 1 reply; 28+ messages in thread From: Martin J. Bligh @ 2001-10-08 17:57 UTC (permalink / raw) To: Alan Cox, landley; +Cc: Rik van Riel, linux-kernel >> So would a workable (if naieve) attempt to use Andrea's >> memory-zones-grouped-into-classes approach on NUMA just involve making a >> class/zone list for each node? (Okay, you've got to identify nodes, and >> group together processors, bridges, DMAable devices, etc, but it seems like >> that has to be done anyway, class/zone or not.) How does what people want to >> do for NUMA improve on that? > > I fear it becomes an N! problem. > > I'd like to hear what Andrea has planned since without docs its hard to > speculate on how the 2.4.10 vm works anyway Can you describe why it's N! ? Are you talking about the worst possible case, or a two level local / non-local problem? Thanks, Martin. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Whining about NUMA. :) [Was whining about 2.5...] 2001-10-08 17:57 ` Martin J. Bligh @ 2001-10-08 18:10 ` Alan Cox 2001-10-08 18:20 ` Martin J. Bligh 0 siblings, 1 reply; 28+ messages in thread From: Alan Cox @ 2001-10-08 18:10 UTC (permalink / raw) To: Martin.Bligh; +Cc: Alan Cox, landley, Rik van Riel, linux-kernel > > speculate on how the 2.4.10 vm works anyway > > Can you describe why it's N! ? Are you talking about the worst possible case, > or a two level local / non-local problem? Worst possible. I dont think in reality it will be nearly that bad ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Whining about NUMA. :) [Was whining about 2.5...] 2001-10-08 18:10 ` Alan Cox @ 2001-10-08 18:20 ` Martin J. Bligh 2001-10-08 18:31 ` Alan Cox 2001-10-08 18:35 ` Jesse Barnes 0 siblings, 2 replies; 28+ messages in thread From: Martin J. Bligh @ 2001-10-08 18:20 UTC (permalink / raw) To: Alan Cox; +Cc: landley, Rik van Riel, linux-kernel >> > speculate on how the 2.4.10 vm works anyway >> >> Can you describe why it's N! ? Are you talking about the worst possible case, >> or a two level local / non-local problem? > > Worst possible. I dont think in reality it will be nearly that bad The worst possible case I can conceive (in the future architectures that I know of) is 4 different levels. I don't think the number of access speed levels is ever related to the number of processors ? (users of other NUMA architectures feel free to slap me at this point). So I *think* the worst possible case is still linear (to number of nodes) in terms of how many classzone type things we'd need? And the number of classzone type things any given access would have to search through for an access is constant? The number of zones searched would be (worst case) linear to number of nodes? As we're intending to code this real soon now, this is more than just idle speculation for my own amusement ;-) Thanks, Martin. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Whining about NUMA. :) [Was whining about 2.5...] 2001-10-08 18:20 ` Martin J. Bligh @ 2001-10-08 18:31 ` Alan Cox 2001-10-08 18:35 ` Jesse Barnes 1 sibling, 0 replies; 28+ messages in thread From: Alan Cox @ 2001-10-08 18:31 UTC (permalink / raw) To: Martin.Bligh; +Cc: Alan Cox, landley, Rik van Riel, linux-kernel > The worst possible case I can conceive (in the future architectures > that I know of) is 4 different levels. I don't think the number of access > speed levels is ever related to the number of processors ? > (users of other NUMA architectures feel free to slap me at this point). The classzone code seems to deal in combinations of memory zones, not in specific zones. It lacks docs and the comments seem at best bogus and from the old code so I may be wrong. So its relative weightings for each combination of memory we might want to consider for each case Andrea ? Alan ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Whining about NUMA. :) [Was whining about 2.5...] 2001-10-08 18:20 ` Martin J. Bligh 2001-10-08 18:31 ` Alan Cox @ 2001-10-08 18:35 ` Jesse Barnes 2001-10-08 18:55 ` Martin J. Bligh 1 sibling, 1 reply; 28+ messages in thread From: Jesse Barnes @ 2001-10-08 18:35 UTC (permalink / raw) To: Martin J. Bligh; +Cc: linux-kernel On Mon, 8 Oct 2001, Martin J. Bligh wrote: > The worst possible case I can conceive (in the future architectures > that I know of) is 4 different levels. I don't think the number of access > speed levels is ever related to the number of processors ? > (users of other NUMA architectures feel free to slap me at this point). So you're saying that at most any given node is 4 hops away from any other for your arch? > So I *think* the worst possible case is still linear (to number of nodes) > in terms of how many classzone type things we'd need? And the number > of classzone type things any given access would have to search through > for an access is constant? The number of zones searched would be > (worst case) linear to number of nodes? That's how we have our stuff coded at the moment, but with classzones you might be able to get that down even further. For instance, you could have classzones that correspond to the number of hops a set of nodes is from a given node. Having such classzones might make finding nearby memory easier. Jesse ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Whining about NUMA. :) [Was whining about 2.5...] 2001-10-08 18:35 ` Jesse Barnes @ 2001-10-08 18:55 ` Martin J. Bligh 2001-10-08 17:48 ` Marcelo Tosatti 2001-10-08 19:12 ` Jesse Barnes 0 siblings, 2 replies; 28+ messages in thread From: Martin J. Bligh @ 2001-10-08 18:55 UTC (permalink / raw) To: Jesse Barnes; +Cc: linux-kernel >> The worst possible case I can conceive (in the future architectures >> that I know of) is 4 different levels. I don't think the number of access >> speed levels is ever related to the number of processors ? >> (users of other NUMA architectures feel free to slap me at this point). > > So you're saying that at most any given node is 4 hops away from any > other for your arch? For the current architecture (well, for NUMA-Q) it's 0 or 1. For future architectures, there will be more (forgive me for deliberately not being specific ... I'd have to ask for more blessing first). Up to about 4. Ish. Depending on how much extra latency each hop introduces, it may well not be worth adding the complexity of differentiating beyond local vs remote? At least at first ... Do you know how many hops SGI can get, and how much extra latency you introduce? I know we're something like 10:1 ratio at the moment between local and remote. I guess my main point was that the number of levels was more like constant than linear. Maybe for large interconnected switched systems with small switches, it's n log n, but in practice I think log n is small enough to be considered constant (the number of levels of switches). >> So I *think* the worst possible case is still linear (to number of nodes) >> in terms of how many classzone type things we'd need? And the number >> of classzone type things any given access would have to search through >> for an access is constant? The number of zones searched would be >> (worst case) linear to number of nodes? > > That's how we have our stuff coded at the moment, but with classzones you > might be able to get that down even further. For instance, you could have > classzones that correspond to the number of hops a set of nodes is from a > given node. Having such classzones might make finding nearby memory easier. That's what I was planning on ... we'd need m x n classzones, where m was the number of levels, and n the number of nodes. Each search would obviously be through m classzones. I'll go poke at the current code some more. M. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Whining about NUMA. :) [Was whining about 2.5...] 2001-10-08 18:55 ` Martin J. Bligh @ 2001-10-08 17:48 ` Marcelo Tosatti 2001-10-08 19:20 ` Martin J. Bligh 2001-10-08 19:12 ` Jesse Barnes 1 sibling, 1 reply; 28+ messages in thread From: Marcelo Tosatti @ 2001-10-08 17:48 UTC (permalink / raw) To: Martin J. Bligh; +Cc: Jesse Barnes, linux-kernel On Mon, 8 Oct 2001, Martin J. Bligh wrote: > >> The worst possible case I can conceive (in the future architectures > >> that I know of) is 4 different levels. I don't think the number of access > >> speed levels is ever related to the number of processors ? > >> (users of other NUMA architectures feel free to slap me at this point). > > > > So you're saying that at most any given node is 4 hops away from any > > other for your arch? > > For the current architecture (well, for NUMA-Q) it's 0 or 1. For future > architectures, there will be more (forgive me for deliberately not being > specific ... I'd have to ask for more blessing first). Up to about 4. Ish. > > Depending on how much extra latency each hop introduces, it may well > not be worth adding the complexity of differentiating beyond local vs > remote? At least at first ... > > Do you know how many hops SGI can get, and how much extra latency > you introduce? I know we're something like 10:1 ratio at the moment > between local and remote. > > I guess my main point was that the number of levels was more like constant > than linear. Maybe for large interconnected switched systems with small > switches, it's n log n, but in practice I think log n is small enough to be > considered constant (the number of levels of switches). > > >> So I *think* the worst possible case is still linear (to number of nodes) > >> in terms of how many classzone type things we'd need? And the number > >> of classzone type things any given access would have to search through > >> for an access is constant? The number of zones searched would be > >> (worst case) linear to number of nodes? > > > > That's how we have our stuff coded at the moment, but with classzones you > > might be able to get that down even further. For instance, you could have > > classzones that correspond to the number of hops a set of nodes is from a > > given node. Having such classzones might make finding nearby memory easier. > > That's what I was planning on ... we'd need m x n classzones, where m > was the number of levels, and n the number of nodes. Each search would > obviously be through m classzones. I'll go poke at the current code some more. You say "numbers of levels" as in each level being a given number of nodes on that "level" distance ? ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Whining about NUMA. :) [Was whining about 2.5...] 2001-10-08 17:48 ` Marcelo Tosatti @ 2001-10-08 19:20 ` Martin J. Bligh 0 siblings, 0 replies; 28+ messages in thread From: Martin J. Bligh @ 2001-10-08 19:20 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: Jesse Barnes, linux-kernel >> That's what I was planning on ... we'd need m x n classzones, where m >> was the number of levels, and n the number of nodes. Each search would >> obviously be through m classzones. I'll go poke at the current code some more. > > You say "numbers of levels" as in each level being a given number of nodes > on that "level" distance ? Yes. For example, if the only different access speeds you have were "on the local node" vs "on another node", and access times to all *other* nodes were the same, you'd have 2 levels. If you have "on the local node" (10 ns) vs "on any node 1 hop away" (100ns), "on any node 2 hops away" (110ns), that'd be 3 levels. (latency numbers picked out of my portable random number generator ;-) ). If the latencies on a 4 level system turn out to be 10,100,101,102 then it's only going to be worth defining 2 levels. If they turn out to be 10,100,1000, 10000, then it'll (probably) be worth doing 4 .... M. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Whining about NUMA. :) [Was whining about 2.5...] 2001-10-08 18:55 ` Martin J. Bligh 2001-10-08 17:48 ` Marcelo Tosatti @ 2001-10-08 19:12 ` Jesse Barnes 2001-10-08 19:37 ` Peter Rival 1 sibling, 1 reply; 28+ messages in thread From: Jesse Barnes @ 2001-10-08 19:12 UTC (permalink / raw) To: Martin J. Bligh; +Cc: linux-kernel On Mon, 8 Oct 2001, Martin J. Bligh wrote: > Depending on how much extra latency each hop introduces, it may well > not be worth adding the complexity of differentiating beyond local vs > remote? At least at first ... Well, there's already some code to do that (mm/numa.c), but I'm not sure how applicable it will be to your arch. > Do you know how many hops SGI can get, and how much extra latency > you introduce? I know we're something like 10:1 ratio at the moment > between local and remote. I think we're something like 1.5:1, and we have machines with up to 256 nodes at the moment, so there can be quite a few hops in the worst case. > I guess my main point was that the number of levels was more like constant > than linear. Maybe for large interconnected switched systems with small > switches, it's n log n, but in practice I think log n is small enough to be > considered constant (the number of levels of switches). Depends on how big your node count gets I guess. > That's what I was planning on ... we'd need m x n classzones, where m > was the number of levels, and n the number of nodes. Each search would > obviously be through m classzones. I'll go poke at the current code some more. Yeah, classzones is one way to go about this. There are some other simple ways to do nearest node allocation though, given the current codebase. I'm still trying to figure out which is the most flexible. Jesse ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Whining about NUMA. :) [Was whining about 2.5...] 2001-10-08 19:12 ` Jesse Barnes @ 2001-10-08 19:37 ` Peter Rival 0 siblings, 0 replies; 28+ messages in thread From: Peter Rival @ 2001-10-08 19:37 UTC (permalink / raw) To: Jesse Barnes; +Cc: Martin J. Bligh, linux-kernel Just to put in my $0.02 on this... Compaq systems will span the range on this. The current Wildfire^WGS Series systems have two levels - either "local" or "remote", which is just under 3:1 latency vs. local. This is all public knowledge, if you care to dig through all the docs. ;) With the new EV7 systems coming out soon (next year?) every CPU has a switch and memory controller built in, so as you add CPUs (up to 64) you potentially add levels of latency. I can't say what they are, but the numbers I've been given so far are _much_ better than that. Just another data point. :) - Pete ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: NUMA & classzones (was Whining about 2.5) 2001-10-03 22:27 ` Rob Landley 2001-10-04 20:53 ` Whining about 2.5 (was Re: [PATCH] Re: bug? in using generic read/write functions to read/write block devices in 2.4.11-pre2O Alan Cox @ 2001-10-04 23:39 ` Martin J. Bligh 2001-10-04 23:55 ` Rob Landley 1 sibling, 1 reply; 28+ messages in thread From: Martin J. Bligh @ 2001-10-04 23:39 UTC (permalink / raw) To: landley, Rik van Riel; +Cc: linux-kernel I'll preface this by saying I know a little about the IBM NUMA-Q (aka Sequent) hardware, but not very much about VM (or anyone else's NUMA hardware). >> Classzones used to be a superset of the memory zones, so >> if you have memory zones A, B and C you'd have classzone >> Ac consisting of memory zone A, classzone Bc = {A + B} >> and Cc = {A + B + C}. > > Ah. Cumulative zones. A class being a collection of zones, the class-zone > patch. Right. That makes a lot more sense... > >> This gives obvious problems for NUMA, suppose you have 4 >> nodes with zones 1A, 1B, 1C, 2A, 2B, 2C, 3A, 3B, 3C, 4A, >> 4B and 4C. > > Is there really a NUMA machine out there where you can DMA out of another > node's 16 bit ISA space? So far the differences in the zones seem to be If I understand your question (and my hardware) correctly, then yes. I think we (IBM NUMA-Q) can DMA from anywhere to anywhere (using PCI cards, not ISA, but we could still use the ISA DMA zone). But it probably doesn't make sense to define A,B, and C for each node. For a start, we don't use ISA DMA (and probably no other NUMA box does either). If HIGHMEM is the stuff above 900Mb or so (and assuming for a moment that we have 1Gb per node), then we probably don't need NORMAL+HIGHMEM for each node either. 0-900Mb = NORMAL (1A) 900-1Gb = HIGHMEM_NODE1 (1B) 1G-2Gb = HIGHMEM_NODE2 (2) 2G-3Gb = HIGHMEM_NODE3 (3) 3Gb-4Gb = HIGHMEM_NODE4 (4) If we have less than 1Gb per node, then one of the other nodes will have 2 zones - whichever contains the transition point from NORMAL-> HIGHMEM. Thus number of zones = number of nodes + 1. (to my mind, if we're frigging with the zone patterns for NUMA, getting rid of DMA zone probably isn't too hard). If I were allowed to define classzones as a per-processor concept (and I don't know enough about VM to know if that's possible), it would seem to fit nicely. Taking the map above, the classzones for a processor on node 3 would be: {3} , {1A + 1B + 2+ 3 + 4} > Especially since, discounting straightforward memory access latency > variations, it SEEMS like this is largely a driver question. Device X can > DMA to/from these zones of memory. The memory isn't different to the > processors, it's different to the various DEVICES. So it's not just a > processor question, but an association between processors, memory, and > devices. (Back to the concept of nodes.) Meaning drivers could be supplying > zone lists, which is just going to be LOADS of fun... If possible, I'd like to avoid making every single driver NUMA aware. Partly because I'm lazy, but also because I think it can be simpler than this. The mem subsystem should just be able to allocate something that's as good as possible for that card, without the driver worrying explicitly about zones (though it may have to specify if it can do 32/64 bit DMA). see http://lse.sourceforge.net/numa - there should be some NUMA API proposals there for explicit stuff. > <uninformed rant> > > I thought a minimalistic approach to numa optimization was to think in terms > of nodes, and treat each node as one or more processors with a set of > associated "cheap" resources (memory, peripherals, etc). Multiple tiers of > decreasing locality for each node sounds like a lot of effort for a first > attempt at NUMA support. That's where the "hideously difficult to calculate" > bits come in. A problem with could increase exponentially with the number of > nodes... Going back to the memory map above, say that node 1-2 are tightly coupled, and 3-4 are tighly coupled, but 1-3, 1-4, 2-3, 2-4 are loosely coupled. This gives us a possible heirarchical NUMA situation. So now the classzones for procs on node 3 would be: {3} , {3+4}, {1A + 1B + 2+ 3 + 4} which would make heirarchical NUMA easy enough. > I always think of numa as the middle of a continuum. Zillion-way SMP with > enormous L1 caches on each processor starts acting a bit like NUMA (you don't > wanna go out of cache and fight the big evil memory bus if you can at all > avoid it, and we're already worrying about process locality (processor > affinity) to preserve cache state...). Kind of, except you can explicitly specify which bits of memory you want to use, rather than the hardware working it out for you. > Shared memory beowulf clusters that > page fault through the network with a relatively low-latency interconnect > like myrinet would act a bit like NUMA too. Yes. > (Obviously, I haven't played > with the monster SGI hardware or the high-end stuff IBM's so proud of.) There's a 16-way NUMA (4x4) at OSDL (www.osdlab.org) that's running linux and available for anyone to play with, if you're so inclined. It doesn't understand very much of it's NUMA-ness, but it works. This is the IBM NUMA-Q hardware ... I presume that's what you're referring to). > In a way, swap space on the drives could be considered a > performance-delimited physical memory zone. One the processor can't access > directly, which involves the allocation of DRAM bounce buffers. Between that > and actual bounce buffers we ALREADY handle problems a lot like page > migration between zones (albeit not in a generic, unified way)... I don't think it's quite that simple. For swap, you always want to page stuff back in before using it. For NUMA memory on remote nodes, it may or may not be worth migrating the page. If we chose to migrate a process between nodes, we could indeed set up a system where we'd page fault pages in from the remote node as we used them, or we could just migrate the working set with the process. Incidentally, swapping on NUMA will need per-zone swapping even more, so I don't see how we could do anything sensible for this without a physical to virtual mem map. But maybe someone knows how. > So I thought the cheap and easy way out is to have each node know what > resources it considers "local", what resources are a pain to access (possibly > involving a tasklet on annother node), and a way to determine when tasks > requiring a lot of access to them might better to be migrated directly to a > node where they're significantly cheaper to the point where the cost of > migration gets paid back. This struck me as the 90% "duct tape" solution to > NUMA. Pretty much. I don't know of any situation when we need a tasklet on another node - that's a pretty horrible thing to have to do. > So what hardware inherently requires a multi-tier NUMA approach beyond "local > stuff" and "everything else"? (I suppose there's bound to be some linearlly > arranged system with a long gradual increase in memory access latency as you > go down the row, and of course a node in the middle which has a unique > resource everybody's fighting for. Is this a common setup in NUMA systems?) The next generation of hardware/chips will have more heirarchical stuff. The shorter / smaller a bus is, the faster it can go, so we can tightly couple small sets faster than big sets. > And then, of course, there's the whole question of 3D accelerated video card > texture memory, and trying to stick THAT into a zone. :) (Eew! Eew! Eew!) > Yeah, it IS a can of worms, isn't it? Your big powerful NUMA server is going to be used to play Quake on? ;-) Same issue for net cards, etc though I guess. > But class/zone lists still seem fine for processors. It's just a question of > doing the detective work for memory allocation up front, as it were. If you > can't figure it out up front, how the heck are you supposed to do it > efficiently at allocation time? If I understand what you mean correctly, we should be able to lay out the topology at boot time, and work out which phys mem locations will be faster / slower from any given resource (proc, PCI, etc). > This > chunk of physical memory can be used as DMA buffers for this PCI bridge, > which can only be addressed directly by this group of processors anyway > because they share the IO-APIC it's wired to... Hmmm ... at least in the hardware I'm familiar with, we can access any PCI bridge or any IO-APIC from any processor. Slower, but functional. > Um, can bounce buffers permanent page migration to another zone? (Since we > have to allocate the page ANYWAY, might as well leave it there till it's > evicted, unless of course we're very likely to evict it again pronto in which > case we want to avoid bouncing it back... As I understand zones, they're physical, therefore pages don't migrate between them. The data might be copied from the bounce buffer to a page in another zone, but ... Not sure if we're using quite the same terminology. Feel free to correct me. > Hmmm... Then under NUMA there > would be the "processor X can't access page in new location easily to fill it > with new data to DMA out..." Fun fun fun...) On the machines I'm used to, there's no problem with "can't access", just slower or faster. > Since 2.4 isn't supposed to handle NUMA anyway, I don't see what difference > it makes. Just use ANYTHING that stops the swap storms, lockups, zone > starvation, zero order allocation failures, bounce buffer shortages, and > other such fun we were having a few versions back. (Once again, this part > now seems to be in the "it works for me"(tm) stage.) > > Then rip it out and start over in 2.5 if there's stuff it can't do. I'm not convinced that changing directions all the time is the most efficient way to operate - it would be nice to keep building on work already done in 2.4 (on whatever subsystem that is) rather than rework it all, but maybe that'll happen anyway, so .... Martin. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: NUMA & classzones (was Whining about 2.5) 2001-10-04 23:39 ` NUMA & classzones (was Whining about 2.5) Martin J. Bligh @ 2001-10-04 23:55 ` Rob Landley 2001-10-05 17:29 ` Martin J. Bligh 2001-10-06 1:44 ` Jesse Barnes 0 siblings, 2 replies; 28+ messages in thread From: Rob Landley @ 2001-10-04 23:55 UTC (permalink / raw) To: Martin J. Bligh, landley, Rik van Riel; +Cc: linux-kernel On Thursday 04 October 2001 19:39, Martin J. Bligh wrote: > I'll preface this by saying I know a little about the IBM NUMA-Q > (aka Sequent) hardware, but not very much about VM (or anyone > else's NUMA hardware). I saw the IBM guys in Austin give a talk on it last year, which A) had more handwaving that star wars episode zero, B) had FAR more info about politics in the AIX division than about NUMA, C) involved the main presenter letting us know he was leaving IBM at the end of the week... Kind of like getting details about CORBA out of IBM. And I worked there when i was trying to do that. (I was once in charge of implementing corba compliance for a project, and all they could find me to define it at the time was a marketing brochure. Sigh...) > >> This gives obvious problems for NUMA, suppose you have 4 > >> nodes with zones 1A, 1B, 1C, 2A, 2B, 2C, 3A, 3B, 3C, 4A, > >> 4B and 4C. > > > > Is there really a NUMA machine out there where you can DMA out of another > > node's 16 bit ISA space? So far the differences in the zones seem to be > > If I understand your question (and my hardware) correctly, then yes. I > think we (IBM NUMA-Q) can DMA from anywhere to anywhere (using PCI cards, > not ISA, but we could still use the ISA DMA zone). Somebody made a NUMA machine with an ISA bus? Wow. That's peverse. I'm impressed. (It was more a "when do we care" question...) Two points: 1) If you can DMA from anywhere to anywhere, it's one big zone, isn't it? Where does the NUMA come in? (I guess it's more expensive to DMA between certain devices/memory pages? Or are we talking sheer processor access latency here, nothing to do with devices at all...?) 2) A processor-centric view of memory zones is not the whole story. Look at the zones we have now. The difference between the ISA zone, the PCI zone, and high memory has nothing to do with the processor*. It's a question of which devices (which bus/bridge really) can talk to which pages. In current UP/SMP systems, the processor can talk to all of them pretty much equally. * modulo the intel 36 bit extension stuff, which I must admit I haven't looked closely at. Don't have the hardware. Then again that's sort of the traditional numa problem of "some memory is a bit funky for the processor to access". Obviously I'm not saying I/O is the ONLY potential difference between memory zones... So we need zones defined relative not just to processors (or groups of processors that have identical access profiles), but also defined relative to i/o devices and busses. Meaning zones may become a driver issue. This gets us back to the concept of "nodes". Groups of processors and devices that collectively have a similar view of the world, memory-wise. Is this a view of the problem that current NUMA thinking is using, or not? > But it probably doesn't make sense to define A,B, and C for each node. For > a start, we don't use ISA DMA (and probably no other NUMA box does > either). If HIGHMEM is the stuff above 900Mb or so (and assuming for a > moment that we have 1Gb per node), then we probably don't need > NORMAL+HIGHMEM for each node either. > > 0-900Mb = NORMAL (1A) > 900-1Gb = HIGHMEM_NODE1 (1B) > 1G-2Gb = HIGHMEM_NODE2 (2) > 2G-3Gb = HIGHMEM_NODE3 (3) > 3Gb-4Gb = HIGHMEM_NODE4 (4) By highmem you mean memory our I/O devices can't DMA out of? Will all the I/O devices in the system share a single pool of buffer memory, or will devices be attached to nodes? (My thinking still turns to making shared memory beowulf clusters act like one big system. The hardware for that will continue to be cheap: rackmount a few tyan thunder dual athlon boards. You can distribute drives for storage and swap space (even RAID them if you like), and who says such a cluster has to put all external access through a single node? > If we have less than 1Gb per node, then one of the other nodes will have 2 > zones - whichever contains the transition point from NORMAL-> HIGHMEM. So "normal" belongs to a specific node, so all devices basically belong to that node? > Thus number of zones = number of nodes + 1. > (to my mind, if we're frigging with the zone patterns for NUMA, getting rid > of DMA zone probably isn't too hard). You still have the problem of doing DMA. Now this is a seperable problem boiling down to either allocation and locking of DMAable buffers the processor can directly access, or setting up bounce buffers when the actual I/O is kicked off. (Or doing memory mapped I/O, or PIO. But all that is still a bit like one big black box, I'd think. And to do it right, you need to know which device you're doing I/O to, because I really wouldn't assume every I/O device on the system shares the same pool of DMAable memory. Or that we haven't got stuff like graphics cards that have their own RAM we map into our address space. Or, for that matter, that physical memory mapping in one node makes anything directly accessable from another node.) > If I were allowed to define classzones as a per-processor concept (and I > don't know enough about VM to know if that's possible), it would seem to > fit nicely. Taking the map above, the classzones for a processor on node 3 > would be: > > {3} , {1A + 1B + 2+ 3 + 4} Not just per processor. Think about a rackmount shared memory beowulf system, page faulting through the network. With quad-processor boards in each 1U, and BLAZINGLY FAST interconnects in the cluster. Now teach that to act like NUMA. Each 1U has four processors with identical performance (and probably one set of page tables if they share a northbridge). Assembling NUMA systems out of closely interconnected SMP systems. > If possible, I'd like to avoid making every single driver NUMA aware. It may not be a driver issue. It may be a bus issue. If there are two PCI busses in the system, do they HAVE to share one set of physical memory mappings? (NUMA sort of implies we have more than one northbridge. Dontcha think we might have more than one southbridge, too?) > Partly because I'm lazy, but also because I think it can be simpler than > this. The mem subsystem should just be able to allocate something that's as > good as possible for that card, without the driver worrying explicitly > about zones (though it may have to specify if it can do 32/64 bit DMA). It's not just 32/64 bit DMA. You're assuming every I/O device in the system is talking to exactly the same pool of memory. The core assumption of NUMA is that the processors aren't doing that, so I don't know why the I/O devices necessarily should. (Maybe they do, what do I know. It would be nice to hear from somebody with actual information...) And if they ARE all talking to one pool of memory, than the whole NUMA question becomes a bit easier, actually... The flood of zones we were so worried about (Node 18's processor sending packets through a network card living on node 13) can't really happen, can it? > see http://lse.sourceforge.net/numa - there should be some NUMA API > proposals there for explicit stuff. Thanks for the link. :) > > I always think of numa as the middle of a continuum. Zillion-way SMP > > with enormous L1 caches on each processor starts acting a bit like NUMA > > (you don't wanna go out of cache and fight the big evil memory bus if you > > can at all avoid it, and we're already worrying about process locality > > (processor affinity) to preserve cache state...). > > Kind of, except you can explicitly specify which bits of memory you want to > use, rather than the hardware working it out for you. Ummm... Is the memory bus somehow physically reconfiguring itself to make some chunk of memory lower or higher latency when talking to a given processor? I'm confused... > > Shared memory beowulf clusters that > > page fault through the network with a relatively low-latency interconnect > > like myrinet would act a bit like NUMA too. > > Yes. But that's the bit that CLEARLY works in terms of nodes, and also which has devices attached to different nodes, requiring things like remote tasklets to access remote devices, and page migration between nodes to do repeated access on remote pages. (Not that this is much different than sending a page back and forth between processor caches in SMP. Hence the continuum I was talking about...) The multiplicative complexity I've heard fears about on this list seems to stem from an interaction between "I/O zones" and "processor access zones" creating an exponential number of gradations when the two qualities apply to the same page. But in a node setup, you don't have to worry about it. A node has its local memory, and it's local I/O, and it inflicts work on remote zones when it needs to deal with their resources. There may be one big shared pool of I/O memory or some such (IBM's NUMA-Q), but in that case it's the same for all processors. Each node has one local pool, one remote pool, and can just talk to a remote node when it needs to (about like SMP). I THOUGHT numa had a gradient, of "local, not as local, not very local at all, darn expensive" pages that differed from node to node, which would be a major pain to optimize for yes. (I was thinking motherboard trace length and relaying stuff several hops down a bus...) But I haven't seen it yet. And even so, "not local=remote" seems to cover the majority of the cases without exponential complexity... I am still highly confused. > > (Obviously, I haven't played > > with the monster SGI hardware or the high-end stuff IBM's so proud of.) > > There's a 16-way NUMA (4x4) at OSDL (www.osdlab.org) that's running > linux and available for anyone to play with, if you're so inclined. It > doesn't understand very much of it's NUMA-ness, but it works. This is the > IBM NUMA-Q hardware ... I presume that's what you're referring to). That's what I've heard the most about. I'm also under the impression that SGI was working on NUMA stuff up around the origin line, and that sun had some monsters in the works as well... It still seems to me that either clustering or zillion-way SMP is the most interesting area of future supercomputing, though. Sheer price to performance. For stuff that's not very easily seperable into chunks, they've got 64 way SMP working in the lab. For stuff that IS chunkable, thousand box clusters are getting common. If the interconnects between boxes are a bottleneck, 10gigE is supposed to be out in late 2003, last I heard, meaning gigE will get cheap... And for just about everything else, there's Moore's Law... Think about big fast-interconnect shared memory clusters. Resources are either local or remote through the network, you don't care too much about gradients. So the "symmetrical" part of SMP applies to decisions between nodes. There's another layer of decisions in that a node may be an SMP box in and of itself (probably will), but there's only really two layers to worry about, not an exponential amount of complexity where each node has a potentially unique relationship with every other node... People wanting to run straightforward multithreaded programs using shared memory and semaphores on big clusters strikes me as an understandable goal, and the drive for fast (low latency) interconnects to make that feasible is something I can see a good bang for the buck coming out of. Here's the hardware that's widely/easily/cheaply available, here's what programmers want to do with it. I can see that. The drive to support monster mainframes which are not only 1% of the market but which get totally redesigned every three or four years to stay ahead of moore's law... I'm not quite sure what's up there. How much of the market can throw that kind of money to constantly offset massive depreciation? Is the commodity hardware world going to inherit NUMA (via department level shared memory beowulf clusters, or just plain the hardware to do it getting cheap enough), or will it remain a niche application? As I said: master of stupid questions. The answers are taking a bit more time... > > In a way, swap space on the drives could be considered a > > performance-delimited physical memory zone. One the processor can't > > access directly, which involves the allocation of DRAM bounce buffers. > > Between that and actual bounce buffers we ALREADY handle problems a lot > > like page migration between zones (albeit not in a generic, unified > > way)... > > I don't think it's quite that simple. For swap, you always want to page > stuff back in before using it. For NUMA memory on remote nodes, it may or > may not be worth migrating the page. Bounce buffers. This is new? Seems like the same locking issues, even... > If we chose to migrate a process > between nodes, we could indeed set up a system where we'd page fault pages > in from the remote node as we used them, or we could just migrate the > working set with the process. Yup. This is a problem I've heard discussed a lot: deciding when to migrate resources. (Pages, processes, etc.) It also seems to be a seperate layer of the problem, one that isn't too closely tied to the initial allocation strategy (although it may feed back into it, but really that seems to be just free/alloc and maybe adjusting weighting/ageing whatever. Am I wrong?) I.E. migration strategy and allocation strategy aren't necessarily the same thing... > Incidentally, swapping on NUMA will need per-zone swapping even more, > so I don't see how we could do anything sensible for this without a > physical to virtual mem map. But maybe someone knows how. There you got me. I DO know that you can have multiple virtual mappings for each physical page, so it's not as easy as the other way around, but this could be why the linked list was invented... (I believe Rik is working on patches that cover this bit. Haven't looked at them yet.) > > So I thought the cheap and easy way out is to have each node know what > > resources it considers "local", what resources are a pain to access > > (possibly involving a tasklet on annother node), and a way to determine > > when tasks requiring a lot of access to them might better to be migrated > > directly to a node where they're significantly cheaper to the point where > > the cost of migration gets paid back. This struck me as the 90% "duct > > tape" solution to NUMA. > > Pretty much. I don't know of any situation when we need a tasklet on > another node - that's a pretty horrible thing to have to do. Think shared memory beowulf. My node has a hard drive. Some other node wants to read and write to my hard drive, because it's part of a larger global file system or storage area network or some such. My node has a network card. There are three different connections to the internet, and they're on seperate nodes to avoid single point of failure syndrome. My node has a video capture card. The cluster as a whole is doing realtime video acquisition and streaming for a cable company that saw the light and switched over to MP4 with a big storage cluster. Incoming signals from cable (or movies fed into the system for pay per view) get converted to mp4 (processor intensive, cluster needed to keep up with HDTV, especially multiple channels) and saved in the storage area network part, and subscriber channels get fetched and fed back out. (Probably not as video, probably as a TCP/IP stream to a set top box. The REAL beauty of digital video isn't trying to do "moves on demand", it's having a cluster stuffed with old episodes of Mash, ER, The West Wing, Star Trek, The Incredible Hulk, Dark Shadows, and Dr. Who which you can call up and play at will. Syndicated content on demand. EASY task for a cluster to do. Doesn't NEED to think NUMA, that could be programmed as beowulf. But we could also be using the Mach microkernel on SMP boxes, it makes about as much sense. Beowulf is message passing, microkernels are message passing, CORBA is message passing... Get fast interconnects, message passing becomes less and less of a good idea...) > > So what hardware inherently requires a multi-tier NUMA approach beyond > > "local stuff" and "everything else"? (I suppose there's bound to be some > > linearlly arranged system with a long gradual increase in memory access > > latency as you go down the row, and of course a node in the middle which > > has a unique resource everybody's fighting for. Is this a common setup > > in NUMA systems?) > > The next generation of hardware/chips will have more heirarchical stuff. > The shorter / smaller a bus is, the faster it can go, so we can tightly > couple small sets faster than big sets. Sure. This is electronics 101, the speed of light is not your friend. (Intel fought and lost this battle with the pentium 4's pipeline, more haste less speed...) But the question of how much of a gradient we care about remains. It's either local, or it's not local. The question is latency, not throughput. (Rambus did this too, more throughput less latency...) Lots of things use loops in an attempt to get fixed latency: stuff wanders by at known intervals so it's easy to fill up slots on the bus because you know when your slot will be coming by... NUMA is also a question of latency. Gimme high end fiber stuff and I could have a multi-gigabit pipe between two machines in different buildings. Latency will still make it a less fun to try to page access DRAM through than your local memory bus, regardless of relative throughput. > > And then, of course, there's the whole question of 3D accelerated video > > card texture memory, and trying to stick THAT into a zone. :) (Eew! > > Eew! Eew!) Yeah, it IS a can of worms, isn't it? > > Your big powerful NUMA server is going to be used to play Quake on? ;-) > Same issue for net cards, etc though I guess. Not quake, video capture and streaming. Big market there, which beowulf clusters can address today, but in a fairly clumsy way. (The sane way to program that is to have one node dispatching/accepting frames to other nodes, so beowulf isn't so bad. But message passing is not a way to control latency, and latency is your real problem when you want to avoid droppping frames. Buffering helps this, though. Five seconds of buffer space covers a multitude of sins...) > > But class/zone lists still seem fine for processors. It's just a > > question of doing the detective work for memory allocation up front, as > > it were. If you can't figure it out up front, how the heck are you > > supposed to do it efficiently at allocation time? > > If I understand what you mean correctly, we should be able to lay out > the topology at boot time, and work out which phys mem locations will > be faster / slower from any given resource (proc, PCI, etc). Ask andrea. I THINK so, but I'm not the expert. (And Linus seems to disagree, and he tends to have good reasons. :) > > This > > chunk of physical memory can be used as DMA buffers for this PCI bridge, > > which can only be addressed directly by this group of processors anyway > > because they share the IO-APIC it's wired to... > > Hmmm ... at least in the hardware I'm familiar with, we can access any PCI > bridge or any IO-APIC from any processor. Slower, but functional. Is the speed difference along a noticeably long gradient, or more "this group is fast, the rest is not so fast"? And do the bridges and IO-APICS cluster with processors into something that looks like nodes, or do they overlap in a less well defined way? > > Um, can bounce buffers permanent page migration to another zone? (Since > > we have to allocate the page ANYWAY, might as well leave it there till > > it's evicted, unless of course we're very likely to evict it again pronto > > in which case we want to avoid bouncing it back... > > As I understand zones, they're physical, therefore pages don't migrate > between them. And processors are physical, so tasks don't migrate between them? > The data might be copied from the bounce buffer to a > page in another zone, but ... Virtual page, physical page... > Not sure if we're using quite the same terminology. Feel free to correct > me. I'm more likely to receive correction. I'm trying to learn and understand the problem... > > Hmmm... Then under NUMA there > > would be the "processor X can't access page in new location easily to > > fill it with new data to DMA out..." Fun fun fun...) > > On the machines I'm used to, there's no problem with "can't access", just > slower or faster. Well, with shared memory beowulf clusters you could have a tasklet on the other machine lock the page and spit you a copy of the data, so "can't" doesn't work there either. That's where the word "easily" came in... But an attempt to DMA into or out of that page from another node would involve bounce buffers on the other node... > > Since 2.4 isn't supposed to handle NUMA anyway, I don't see what > > difference it makes. Just use ANYTHING that stops the swap storms, > > lockups, zone starvation, zero order allocation failures, bounce buffer > > shortages, and other such fun we were having a few versions back. (Once > > again, this part now seems to be in the "it works for me"(tm) stage.) > > > > Then rip it out and start over in 2.5 if there's stuff it can't do. > > I'm not convinced that changing directions all the time is the most > efficient way to operate No comment on the 2.4.0-2.4.10 VM development process will be made by me at this time. > - it would be nice to keep building on work > already done in 2.4 (on whatever subsystem that is) rather than rework > it all, but maybe that'll happen anyway, so .... At one point I thought the purpose of a stable series was to stabilize, debug, and tweak what you'd already done, and architectural changes went in development series. (Except for the occasional new driver.) As I said, I tend to be wrong about stuff... > Martin. Rob ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: NUMA & classzones (was Whining about 2.5) 2001-10-04 23:55 ` Rob Landley @ 2001-10-05 17:29 ` Martin J. Bligh 2001-10-06 1:44 ` Jesse Barnes 1 sibling, 0 replies; 28+ messages in thread From: Martin J. Bligh @ 2001-10-05 17:29 UTC (permalink / raw) To: landley, Rik van Riel; +Cc: linux-kernel >> I'll preface this by saying I know a little about the IBM NUMA-Q >> (aka Sequent) hardware, but not very much about VM (or anyone >> else's NUMA hardware). > > I saw the IBM guys in Austin give a talk on it last year, which A) had more > handwaving that star wars episode zero, B) had FAR more info about politics > in the AIX division than about NUMA, C) involved the main presenter letting > us know he was leaving IBM at the end of the week... Oops. I disclaim all knowledge. I gave a brief presentation at OLS. The slides are somewhere .... but they probably don't make much sense without words. http://lse.sourceforge.net/numa/mtg.2001.07.25/minutes.html under "Porting Linux to NUMA-Q". > Kind of like getting details about CORBA out of IBM. And I worked there > when i was trying to do that. (I was once in charge of implementing corba > compliance for a project, and all they could find me to define it at the time > was a marketing brochure. Sigh...) IBM is huge - don't tar us all with the same brush ;-) There are good parts and bad parts ... >> >> This gives obvious problems for NUMA, suppose you have 4 >> >> nodes with zones 1A, 1B, 1C, 2A, 2B, 2C, 3A, 3B, 3C, 4A, >> >> 4B and 4C. >> > >> > Is there really a NUMA machine out there where you can DMA out of another >> > node's 16 bit ISA space? So far the differences in the zones seem to be >> >> If I understand your question (and my hardware) correctly, then yes. I >> think we (IBM NUMA-Q) can DMA from anywhere to anywhere (using PCI cards, >> not ISA, but we could still use the ISA DMA zone). > > Somebody made a NUMA machine with an ISA bus? Wow. That's peverse. I'm > impressed. (It was more a "when do we care" question...) No, that's not what I said (though we do have some perverse bus in there ;-)) I said we can dma out of the first physical 16Mb of RAM (known as the ISA DMA zone on any node into any other node (using a PCI card). Or at least that's what I meant ;-) > Two points: > > 1) If you can DMA from anywhere to anywhere, it's one big zone, isn't it? > Where does the NUMA come in? (I guess it's more expensive to DMA between > certain devices/memory pages? Or are we talking sheer processor access > latency here, nothing to do with devices at all...?) It takes about 10 times longer to DMA or read memory from a remote node's RAM than the local nodes RAM. It's "Non-uniform" in terms of access speed, not capability. I don't have latency / bandwidth exact ratios to hand, but it's in the order of 10:1 on our boxes. > 2) A processor-centric view of memory zones is not the whole story. Look at > the zones we have now. The difference between the ISA zone, the PCI zone, > and high memory has nothing to do with the processor*. True (for the ISA and PCI). I'd say that the difference between NORMAL and HIGHMEM has everything to do with the processor. Above 4Gb you're talking about 36 bit stuff, but HIGHMEM is often just stuff from 900Mb or so to 4Gb. > It's a question of > which devices (which bus/bridge really) can talk to which pages. In current > UP/SMP systems, the processor can talk to all of them pretty much equally. That's true of the NUMA systems I know too (though maybe not all NUMA systems). > So we need zones defined relative not just to processors (or groups of > processors that have identical access profiles), but also defined relative to > i/o devices and busses. Meaning zones may become a driver issue. > > This gets us back to the concept of "nodes". Groups of processors and > devices that collectively have a similar view of the world, memory-wise. Is > this a view of the problem that current NUMA thinking is using, or not? More or less. A node may not internally be symmetric - some processors may be closer to each other than others. I guess we can redefine the definition of node at that point to be the more tightly coupled groups of processors, but those procs may still have uniform access to the same physical memory, so the definition gets looser. >> But it probably doesn't make sense to define A,B, and C for each node. For >> a start, we don't use ISA DMA (and probably no other NUMA box does >> either). If HIGHMEM is the stuff above 900Mb or so (and assuming for a >> moment that we have 1Gb per node), then we probably don't need >> NORMAL+HIGHMEM for each node either. >> >> 0-900Mb = NORMAL (1A) >> 900-1Gb = HIGHMEM_NODE1 (1B) >> 1G-2Gb = HIGHMEM_NODE2 (2) >> 2G-3Gb = HIGHMEM_NODE3 (3) >> 3Gb-4Gb = HIGHMEM_NODE4 (4) > > By highmem you mean memory our I/O devices can't DMA out of? No. I mean stuff that's not permanently mapped to virtual memory (as I understand it, that's anything over about 900Mb, but then I don't understand it all that well, so ... ;-) ) > Will all the I/O devices in the system share a single pool of buffer memory, > or will devices be attached to nodes? > > (My thinking still turns to making shared memory beowulf clusters act like > one big system. The hardware for that will continue to be cheap: rackmount a > few tyan thunder dual athlon boards. You can distribute drives for storage > and swap space (even RAID them if you like), and who says such a cluster has > to put all external access through a single node? In the case of the hardware I know about (and the shared mem beowulf clusters), you can attach I/O devices to any node (we have 2 PCI buses & 2 I/O APICs per node). OK, so at the moment the released code only uses the buses on the first node, but I have code to fix that that's in development (it works, but it's crude). What's really nice is to multi-drop connect a SAN using fibre-channel cards to each and every node (normally 2 cards per node for redundancy). A disk access on any node then gets routed through the local SAN interface, rather than across the interconnect. Much faster. Outbound net traffic is the same, inbound is harder. >> If we have less than 1Gb per node, then one of the other nodes will have 2 >> zones - whichever contains the transition point from NORMAL-> HIGHMEM. > > So "normal" belongs to a specific node, so all devices basically belong to > that node? NORMAL = <900Mb. If I have >900Mb of mem in the first mode, then NORMAL belongs to that node. There's nothing to stop any device DMAing into things outside the NORMAL (see Jens' patches to reduce bounce bufferness) zone - they use physaddrs - with 32bit PCI, that means the first 4Gb, with 64Gb, pretty much anywhere. Or at least that's how I understand it until somebody tells me different. And no, that doesn't mean that all devices belong to that node. Even if you say I can only DMA into the normal zone, a device on node 3, with no local memory in the normal zone just DMAs over the interconnect. And, yes, that takes some hardware support - that's why NUMA boxes ain't cheap ;-) > You still have the problem of doing DMA. Now this is a seperable problem > boiling down to either allocation and locking of DMAable buffers the > processor can directly access, or setting up bounce buffers when the actual > I/O is kicked off. (Or doing memory mapped I/O, or PIO. But all that is > still a bit like one big black box, I'd think. And to do it right, you need > to know which device you're doing I/O to, because I really wouldn't assume > every I/O device on the system shares the same pool of DMAable memory. Or Yes it does (for me at least). To access some things, I take the PCI dev pointer, work out which bus the PCI card is attatched to, do a bus->node map, and that gives me the answer. > that we haven't got stuff like graphics cards that have their own RAM we map > into our address space. Or, for that matter, that physical memory mapping in > one node makes anything directly accessable from another node.) For me at least, all memory, whether mem-mapped to a card or not, is accessible from everywhere. Port I/O I have to play some funky games for, but that's a different story. > Not just per processor. Think about a rackmount shared memory beowulf > system, page faulting through the network. With quad-processor boards in > each 1U, and BLAZINGLY FAST interconnects in the cluster. Now teach that to > act like NUMA. The interconnect would need hardware support for NUMA to do the cache coherency and transparent remote memory access. > Each 1U has four processors with identical performance (and probably one set > of page tables if they share a northbridge). Assembling NUMA systems out of > closely interconnected SMP systems. That's exactly what the NUMA-Q systems are. More or less standard 4-way Intel boxes, with an interconnect doing something like 10Gb/second with reasonably low latency. Up to 16 quads = 64 processors. Current code is limited to 32 procs on Linux. > It may not be a driver issue. It may be a bus issue. If there are two PCI > busses in the system, do they HAVE to share one set of physical memory > mappings? (NUMA sort of implies we have more than one northbridge. Dontcha > think we might have more than one southbridge, too?) I fear I don't understand this. Not remembering what north vs south did again (I did know once) probably isn't helping ;-) But yes, everyone shares the same physical memory map, at leas on NUMA-Q > It's not just 32/64 bit DMA. You're assuming every I/O device in the system > is talking to exactly the same pool of memory. The core assumption of NUMA > is that the processors aren't doing that, so I don't know why the I/O devices > necessarily should. (Maybe they do, what do I know. It would be nice to > hear from somebody with actual information...) No, the core assumption of NUMA isn't that everyone's not talking to the same pool of memory, it's that talking to different parts of the pool isn't done at a uniform speed. > And if they ARE all talking to one pool of memory, than the whole NUMA > question becomes a bit easier, actually... The flood of zones we were so > worried about (Node 18's processor sending packets through a network card > living on node 13) can't really happen, can it? You still want to allocate memory locally for performance reasons, even though it'll work. My current port to NUMA-Q doesn't do that, and the performance will probably get a lot better when it does. We still need to split different nodes memory into different zones (or find another similar solution). >> > I always think of numa as the middle of a continuum. Zillion-way SMP >> > with enormous L1 caches on each processor starts acting a bit like NUMA >> > (you don't wanna go out of cache and fight the big evil memory bus if you >> > can at all avoid it, and we're already worrying about process locality >> > (processor affinity) to preserve cache state...). >> >> Kind of, except you can explicitly specify which bits of memory you want to >> use, rather than the hardware working it out for you. > > Ummm... I mean if you have a process running on node 1, you can tell it to allocate memory on node 1 (or you could if the code was there ;-) ). Processes on node 3 get memory on node 3, etc. In an "enormous L1 cache" the hardware works out where to put things in the cache, not the OS. > Is the memory bus somehow physically reconfiguring itself to make some chunk > of memory lower or higher latency when talking to a given processor? I'm > confused... Each node has it's own bank of RAM. If I access the RAM in another node, I go over the interconnect, which is a damned sight slower than just going over the local memory bus. The interconnect plugs into the local memory bus of each node, and transparently routes requests around to other nodes for you. Think of it like a big IP based network. Each node is a LAN, with it's own subnet. The interconnect is the router connecting the LANs. I can do 100Mbs over the local LAN, but only 10 Mbps through the router to remote LANs. This email is far too big, and I have to go to a meeting. I'll reply to the rest of it later ;-) M. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: NUMA & classzones (was Whining about 2.5) 2001-10-04 23:55 ` Rob Landley 2001-10-05 17:29 ` Martin J. Bligh @ 2001-10-06 1:44 ` Jesse Barnes 1 sibling, 0 replies; 28+ messages in thread From: Jesse Barnes @ 2001-10-06 1:44 UTC (permalink / raw) Cc: linux-kernel These are some long messages... I'll do my best to reply w/respect to SGI boxes. Maybe it's time I got it together and put one of our NUMA machines on the 'net. A little background first though, for those that aren't familiar with our NUMA arch. All this info should be available at http://techpubs.sgi.com I think, but I'll briefly go over it here. Our newest systems are made of 'bricks'. There are a few different types of brick: c, r, i, p, d, etc. C-bricks are simply a collection of 0-4 cpus and some amount of memory. R-bricks are simply a collection of NUMAlink ports. I-bricks have a few PCI busses and a couple of disks. P-bricks have a bunch of PCI busses, and D-bricks have a bunch of disks. Each brick has at least one IO port, C-bricks also have a NUMAlink port that can be connected to other C-bricks or R-bricks. So remote memory accesses have to go out a NUMAlink to an R-brick and in through another NUMAlink on another C-brick on all but the smallest systems. P, and I bricks are connected directly to C-bricks, while D-bricks are connected via fibrechannel to SCSI cards on either P or I bricks. On Thu, 4 Oct 2001, Rob Landley wrote: > Somebody made a NUMA machine with an ISA bus? Wow. That's peverse. I'm > impressed. (It was more a "when do we care" question...) I certainly hope not!! It's bad enough that people have Pentium NUMA machines; I don't envy the people that had to bring those up. > Two points: > > 1) If you can DMA from anywhere to anywhere, it's one big zone, isn't it? > Where does the NUMA come in? (I guess it's more expensive to DMA between > certain devices/memory pages? Or are we talking sheer processor access > latency here, nothing to do with devices at all...?) On SGI, yes. We've got both MIPS and IPF (formerly known as IA64) NUMA machines. Since both are 64 bit, things are much easier than with say, a Pentium. PCI cards that are 64 bit DMA capable can read/write any memory location on the machine. 32 bit cards can as well, with the help of our PCI bridge. Ideally though, you'd like to DMA to/from devices that are directly connected to the memory on their associated C-brick, otherwise you've got to hop over other nodes to get to your memory (=higher latency, possible bandwidth contention). I hope this answers your question. > 2) A processor-centric view of memory zones is not the whole story. Look at > the zones we have now. The difference between the ISA zone, the PCI zone, > and high memory has nothing to do with the processor*. It's a question of > which devices (which bus/bridge really) can talk to which pages. In current > UP/SMP systems, the processor can talk to all of them pretty much equally. You can think of all the NUMA systems I know of that way as well, but you get higher performance if you're careful about which pages you talk to. > So we need zones defined relative not just to processors (or groups of > processors that have identical access profiles), but also defined relative to > i/o devices and busses. Meaning zones may become a driver issue. > > This gets us back to the concept of "nodes". Groups of processors and > devices that collectively have a similar view of the world, memory-wise. Is > this a view of the problem that current NUMA thinking is using, or not? Yup. We have pg_data_t for just that purpose, although it currently only has information about memory, not total system topology (i.e. I/O devices, CPUs, etc.). > > But it probably doesn't make sense to define A,B, and C for each node. For > > a start, we don't use ISA DMA (and probably no other NUMA box does > > either). If HIGHMEM is the stuff above 900Mb or so (and assuming for a > > moment that we have 1Gb per node), then we probably don't need > > NORMAL+HIGHMEM for each node either. > > > > 0-900Mb = NORMAL (1A) > > 900-1Gb = HIGHMEM_NODE1 (1B) > > 1G-2Gb = HIGHMEM_NODE2 (2) > > 2G-3Gb = HIGHMEM_NODE3 (3) > > 3Gb-4Gb = HIGHMEM_NODE4 (4) > > By highmem you mean memory our I/O devices can't DMA out of? On the NUMA-Q platform, probably (but I'm not sure since I've never worked on one). > Will all the I/O devices in the system share a single pool of buffer memory, > or will devices be attached to nodes? Both. At least for us. > Not just per processor. Think about a rackmount shared memory beowulf > system, page faulting through the network. With quad-processor boards in > each 1U, and BLAZINGLY FAST interconnects in the cluster. Now teach that to > act like NUMA. Sounds familiar. It's much easier when your memory controllers are aware of this fact though... > It's not just 32/64 bit DMA. You're assuming every I/O device in the system > is talking to exactly the same pool of memory. The core assumption of NUMA > is that the processors aren't doing that, so I don't know why the I/O devices > necessarily should. (Maybe they do, what do I know. It would be nice to > hear from somebody with actual information...) I think the core assumption is exactly the opposite, but you're correct if you're talking about simple clusters. > And if they ARE all talking to one pool of memory, than the whole NUMA > question becomes a bit easier, actually... The flood of zones we were so > worried about (Node 18's processor sending packets through a network card > living on node 13) can't really happen, can it? I guess it depends on the machine. On our machine, you could do that, but it looks like the IBM machine would need bounce buffers for such things. You might also need bounce buffers for 32 bit PCI cards for some machines, since they might only be able to DMA to/from the first 4 GB of memory. > I THOUGHT numa had a gradient, of "local, not as local, not very local at > all, darn expensive" pages that differed from node to node, which would be a That's pretty much right. But for most things it's not *too* bad to optimize for, until you get into *huge* machines (e.g. 1024p, lots of memory). > major pain to optimize for yes. (I was thinking motherboard trace length and > relaying stuff several hops down a bus...) But I haven't seen it yet. And > even so, "not local=remote" seems to cover the majority of the cases without > exponential complexity... Yeah, luckily, you can assume local==remote and things will work, albeit slowly (ask Ralf about forgetting to turn on CONFIG_NUMA on one of our MIPS machines). > That's what I've heard the most about. I'm also under the impression that > SGI was working on NUMA stuff up around the origin line, and that sun had > some monsters in the works as well... AFAIK, Sun just has SMP machines, but they might have a NUMA one in the pipe. And yes, we've had NUMA stuff for awhile, and recently got a 1024p system running. There were some weird bottlenecks exposed by that one. > People wanting to run straightforward multithreaded programs using shared > memory and semaphores on big clusters strikes me as an understandable goal, > and the drive for fast (low latency) interconnects to make that feasible is > something I can see a good bang for the buck coming out of. Here's the > hardware that's widely/easily/cheaply available, here's what programmers want > to do with it. I can see that. > > The drive to support monster mainframes which are not only 1% of the market > but which get totally redesigned every three or four years to stay ahead of > moore's law... I'm not quite sure what's up there. How much of the market > can throw that kind of money to constantly offset massive depreciation? > > Is the commodity hardware world going to inherit NUMA (via department level > shared memory beowulf clusters, or just plain the hardware to do it getting > cheap enough), or will it remain a niche application? Maybe? > > Incidentally, swapping on NUMA will need per-zone swapping even more, > > so I don't see how we could do anything sensible for this without a > > physical to virtual mem map. But maybe someone knows how. I know Kanoj was talking about this awhile back; don't know if he ever came up with any code though... > The question is latency, not throughput. (Rambus did this too, more Bingo! Latency is absolutely key to NUMA. If you have really bad latency, you've basically got a cluster. The programming model is greatly simplified though, as you mentioned above (i.e. shared mem multithreading vs. MPI). > Is the speed difference along a noticeably long gradient, or more "this group > is fast, the rest is not so fast"? Depends on the machine. I think IBM's machines have something like a 10:1 ratio of remote vs. local memory access latency, while SGI's latest have something like 1.5:1. That's per-hop though, so big machines can be pretty non-uniform. I hope I've at least refrained from muddying the NUMA waters further with my ramblings. I'll keep an eye on subjects with 'NUMA' or 'zone' though, just so I can be more informed about these things. Jesse ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Whining about 2.5 (was Re: [PATCH] Re: bug? in using generic read/write functions to read/write block devices in 2.4.11-pre2) 2001-10-03 19:55 ` Whining about 2.5 (was Re: [PATCH] Re: bug? in using generic read/write functions to read/write block devices in 2.4.11-pre2) Rob Landley 2001-10-04 0:38 ` Rik van Riel @ 2001-10-04 21:02 ` Alan Cox 1 sibling, 0 replies; 28+ messages in thread From: Alan Cox @ 2001-10-04 21:02 UTC (permalink / raw) To: landley; +Cc: Alexander Viro, Christoph Hellwig, linux-kernel, Linus Torvalds > question, I know), which VM will it use? I'm guessing Alan will still > inherit the "stable" codebase, but the -ac and -linus trees are breaking new > ground on divergence here. Which tree becomes 2.4 once Alan inherits it? > (Is this part of what's holding up 2.5?) For the moment I plan to maintain the 2.4.*-ac tree. I don't know what will happen about 2.4 longer term - that is a Linus question. Looking at historical VM history I don't think we will eliminate enough "2.4.10+ oops on my box" and "on this load the VM sucks" cases from 2.4.10 to fairly review Andrea's VM until Linus has done another 5 or 6 releases and the VM has been tuned, bugs removed and other oops cases proven not to be vm triggered. Alan ^ permalink raw reply [flat|nested] 28+ messages in thread
* Buffer cache confusion? Re: [reiserfs-list] bug? in using generic read/write functions to read/write block devices in 2.4.11-pre2 2001-10-03 12:17 bug? in using generic read/write functions to read/write block devices in 2.4.11-pre2 Vladimir V. Saveliev 2001-10-03 13:16 ` [PATCH] " Alexander Viro @ 2001-10-03 21:09 ` Eric Whiting 1 sibling, 0 replies; 28+ messages in thread From: Eric Whiting @ 2001-10-03 21:09 UTC (permalink / raw) To: linux-kernel, jfs-discussion@oss.software.ibm.com Cc: reiserfs-list, Chris Mason I see a similar odd failure with jfs in 2.4.11pre1. Is this related to the 2.4.11preX buffer cache improvements? eric # uname -a 2.4.11-pre1 #1 SMP Tue Oct 2 12:28:07 MDT 2001 i686 # mkfs.jfs /dev/hdc3 mkfs.jfs development version: $Name: v1_0_6 $ Warning! All data on device /dev/hdc3 will be lost! Continue? (Y/N) Y Format completed successfully. 10241436 kilobytes total disk space. # mount -t jfs /dev/hdc3 /mnt mount: wrong fs type, bad option, bad superblock on /dev/hdc3, or too many mounted file systems "Vladimir V. Saveliev" wrote: > > Hi > > It looks like something wrong happens with writing/reading to block > device using generic read/write functions when one does: > > mke2fs /dev/hda1 (blocksize is 4096) > mount /dev/hda1 > umount /dev/hda1 > mke2fs /dev/hda1 - FAILS with > Warning: could not write 8 blocks in inode table starting at 492004: > Attempt to write block from filesystem resulted in short write > > (note that /dev/hda1 should be big enough - 3gb is enogh for example) > > Explanation of what happens (could be wrong and unclear): > > blocksize of /dev/hda1 was 1024. So, /dev/hda1's inode->i_blkbits is set > to 10. > mount-ing used set_blocksize() to change blocksize to 4096 in > blk_size[][]. > But inode of /dev/hda1 still has i_blkbits which makes > block_prepare_write to create buffers of 1024 bytes and call > blkdev_get_block for each of them. > fs/block_dev.c:/max_block calculates number of blocks on the device > using blk_size[][] and thinks that there are 4 times less blocks on the > device. > > Thanks, > vs > > PS: thanks to Elena <grev@namesys.botik.ru> for finding that ^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2001-10-08 19:39 UTC | newest] Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2001-10-03 12:17 bug? in using generic read/write functions to read/write block devices in 2.4.11-pre2 Vladimir V. Saveliev 2001-10-03 13:16 ` [PATCH] " Alexander Viro 2001-10-03 16:18 ` Linus Torvalds 2001-10-03 21:43 ` Alexander Viro 2001-10-03 21:56 ` Christoph Hellwig 2001-10-03 22:51 ` Alexander Viro 2001-10-03 19:55 ` Whining about 2.5 (was Re: [PATCH] Re: bug? in using generic read/write functions to read/write block devices in 2.4.11-pre2) Rob Landley 2001-10-04 0:38 ` Rik van Riel 2001-10-03 22:27 ` Rob Landley 2001-10-04 20:53 ` Whining about 2.5 (was Re: [PATCH] Re: bug? in using generic read/write functions to read/write block devices in 2.4.11-pre2O Alan Cox 2001-10-04 23:59 ` Whining about NUMA. :) [Was whining about 2.5...] Rob Landley 2001-10-05 14:51 ` Alan Cox 2001-10-08 17:57 ` Martin J. Bligh 2001-10-08 18:10 ` Alan Cox 2001-10-08 18:20 ` Martin J. Bligh 2001-10-08 18:31 ` Alan Cox 2001-10-08 18:35 ` Jesse Barnes 2001-10-08 18:55 ` Martin J. Bligh 2001-10-08 17:48 ` Marcelo Tosatti 2001-10-08 19:20 ` Martin J. Bligh 2001-10-08 19:12 ` Jesse Barnes 2001-10-08 19:37 ` Peter Rival 2001-10-04 23:39 ` NUMA & classzones (was Whining about 2.5) Martin J. Bligh 2001-10-04 23:55 ` Rob Landley 2001-10-05 17:29 ` Martin J. Bligh 2001-10-06 1:44 ` Jesse Barnes 2001-10-04 21:02 ` Whining about 2.5 (was Re: [PATCH] Re: bug? in using generic read/write functions to read/write block devices in 2.4.11-pre2) Alan Cox 2001-10-03 21:09 ` Buffer cache confusion? Re: [reiserfs-list] bug? in using generic read/write functions to read/write block devices in 2.4.11-pre2 Eric Whiting
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).