* vm lock contention reduction @ 2002-07-04 23:05 Andrew Morton 2002-07-04 23:26 ` Rik van Riel 2002-07-04 23:27 ` Rik van Riel 0 siblings, 2 replies; 95+ messages in thread From: Andrew Morton @ 2002-07-04 23:05 UTC (permalink / raw) To: Rik van Riel; +Cc: Andrea Arcangeli, linux-mm, Linus Torvalds I seem to now have a set of patches which fix the pagemap_lru_lock contention for some workloads. They also move the entire page allocation/reclaim/pagecache I/O functions away from page-at-a-time and make them use chunks of 16 pages at a time. The intent of this is to get the effect of large PAGE_CACHE_SIZE without actually doing that. Overall lock contention is reduced by 85-90% and pagemap_lru_lock contention is reduced by maybe 98%. For workloads where the inactive list is dominated by pagecache. If the machine is instead full of anon pages then everything is still crap because the page reclaim code is scanning zillions of pages and not doing much useful with them. In some ways the VM locking is more complex, because we need to cope with pages which aren't on the LRU. In some ways the locking is simpler because pagemap_lru_lock becomes an "innermost" lock. Relevant patches are: http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.24/page-flags-atomicity.patch http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.24/pagevec.patch http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.24/shrink_cache-pagevec.patch http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.24/anon-pagevec.patch http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.24/mpage_writepages-batch.patch http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.24/batched-lru-add.patch http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.24/batched-lru-del.patch http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.24/lru-lock-irq-off.patch http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.24/lru-mopup.patch My vague plan was to wiggle rmap on top of this work, for two reasons: 1: So it is easy to maintain an rmap backout patch, to aid in comparison and debugging and 2: to give a reasonable basis for evaluation of rmap CPU efficiency. But frankly, I've written and rewritten this code three times so far and I'm still not really happy with it. Probably it is more sensible to get the reverse mapping code into the tree first, and I get to reimplement the CPU efficiency work a fourth time :( So I'll flush the rest of my current patchpile at Linus and go take a look at O_DIRECT for a while. I'll shelve this lock contention work until we have an rmap patch for 2.5. Rik, do you have an estimate on that? - -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: vm lock contention reduction 2002-07-04 23:05 vm lock contention reduction Andrew Morton @ 2002-07-04 23:26 ` Rik van Riel 2002-07-04 23:27 ` Rik van Riel 1 sibling, 0 replies; 95+ messages in thread From: Rik van Riel @ 2002-07-04 23:26 UTC (permalink / raw) To: Andrew Morton; +Cc: Andrea Arcangeli, linux-mm, Linus Torvalds On Thu, 4 Jul 2002, Andrew Morton wrote: > I'll shelve this lock contention work until we have an rmap patch > for 2.5. Rik, do you have an estimate on that? I've recovered from the flight back and the flu that hit me just after OLS, so it should be pretty soon. I've just applied Craig Kulesa's patch on top of the latest 2.5 bk tree and will put a few small fixes on top of that (eg. new ARM pagetable layout). regards, Rik -- Bravely reimplemented by the knights who say "NIH". http://www.surriel.com/ http://distro.conectiva.com/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: vm lock contention reduction 2002-07-04 23:05 vm lock contention reduction Andrew Morton 2002-07-04 23:26 ` Rik van Riel @ 2002-07-04 23:27 ` Rik van Riel 2002-07-05 1:37 ` Andrew Morton 1 sibling, 1 reply; 95+ messages in thread From: Rik van Riel @ 2002-07-04 23:27 UTC (permalink / raw) To: Andrew Morton; +Cc: Andrea Arcangeli, linux-mm, Linus Torvalds On Thu, 4 Jul 2002, Andrew Morton wrote: > If the machine is instead full of anon pages then everything is still > crap because the page reclaim code is scanning zillions of pages and not > doing much useful with them. This is something that can be fixed with rmap, because the kernel _will_ be able to do something useful with the anon pages. Now we just need to get Arjan to tune the O(1) page launder thing he was looking at ;) Rik -- Bravely reimplemented by the knights who say "NIH". http://www.surriel.com/ http://distro.conectiva.com/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: vm lock contention reduction 2002-07-04 23:27 ` Rik van Riel @ 2002-07-05 1:37 ` Andrew Morton 2002-07-05 1:49 ` Rik van Riel 0 siblings, 1 reply; 95+ messages in thread From: Andrew Morton @ 2002-07-05 1:37 UTC (permalink / raw) To: Rik van Riel; +Cc: Andrea Arcangeli, linux-mm, Linus Torvalds Rik van Riel wrote: > > On Thu, 4 Jul 2002, Andrew Morton wrote: > > > If the machine is instead full of anon pages then everything is still > > crap because the page reclaim code is scanning zillions of pages and not > > doing much useful with them. > > This is something that can be fixed with rmap, because the > kernel _will_ be able to do something useful with the anon > pages. I think that would be quite useful - we just need to be sure that if the pages aren't added to swapcache we should park them up on the active list out of the way. The refill_inactive() logic in 2.5.24 seems a bit wonky at present. In here: ratio = (unsigned long)nr_pages * ps.nr_active / ((ps.nr_inactive | 1) * 2); refill_inactive(ratio); `ratio' tends to evaluate to zero if you have a ton of inactive or pagecache pages. So the VM is not recirculating old pages off the inactive list at all. I changed it to add 1 to ratio. Also to only call refill_inactive when the number-to-process reaches 32 pages - the kernel tends to settle into steady states where it's calling refill_inactive for just one or two pages, which is a waste of CPU. Didn't make much observable difference. > Now we just need to get Arjan to tune the O(1) page launder > thing he was looking at ;) We keep seeing mysterious references to this. What is the idea behind it? - -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: vm lock contention reduction 2002-07-05 1:37 ` Andrew Morton @ 2002-07-05 1:49 ` Rik van Riel 2002-07-05 2:18 ` Andrew Morton 0 siblings, 1 reply; 95+ messages in thread From: Rik van Riel @ 2002-07-05 1:49 UTC (permalink / raw) To: Andrew Morton; +Cc: Andrea Arcangeli, linux-mm, Linus Torvalds On Thu, 4 Jul 2002, Andrew Morton wrote: > Rik van Riel wrote: > > This is something that can be fixed with rmap, because the > > kernel _will_ be able to do something useful with the anon > > pages. > > I think that would be quite useful - we just need to be sure that if the > pages aren't added to swapcache we should park them up on the active > list out of the way. Absolutely, that is critical. It is all too easy to end up with a TON of non-swappable pages on the inactive list and all the easily evictable pages on the active list. If only because the easily evictable pages tend to disappear quickly and the non-swappable ones stick around forever. > > Now we just need to get Arjan to tune the O(1) page launder > > thing he was looking at ;) > > We keep seeing mysterious references to this. What is the idea > behind it? The idea is that when pages are evicted from the system they traverse the inactive list _once_. If a page is dirty, IO is started and the page is added to the laundry list, if a page is clean it is moved to the clean list. Every time we need more free pages we first check the clean list (all pages there are freeable, guaranteed) and the first (few?) pages of the laundry list. We continue taking pages off of the laundry list until we've run into {a, too many} unfreeable pages. This way we won't scan the inactive pages over and over again every time we free a few. regards, Rik -- Bravely reimplemented by the knights who say "NIH". http://www.surriel.com/ http://distro.conectiva.com/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: vm lock contention reduction 2002-07-05 1:49 ` Rik van Riel @ 2002-07-05 2:18 ` Andrew Morton 2002-07-05 2:16 ` Rik van Riel 2002-07-05 23:11 ` William Lee Irwin III 0 siblings, 2 replies; 95+ messages in thread From: Andrew Morton @ 2002-07-05 2:18 UTC (permalink / raw) To: Rik van Riel; +Cc: Andrea Arcangeli, linux-mm, Linus Torvalds Rik van Riel wrote: > > ... > > > Now we just need to get Arjan to tune the O(1) page launder > > > thing he was looking at ;) > > > > We keep seeing mysterious references to this. What is the idea > > behind it? > > The idea is that when pages are evicted from the system they > traverse the inactive list _once_. > > If a page is dirty, IO is started and the page is added to the > laundry list, if a page is clean it is moved to the clean list. > > Every time we need more free pages we first check the clean list > (all pages there are freeable, guaranteed) and the first (few?) > pages of the laundry list. We continue taking pages off of the > laundry list until we've run into {a, too many} unfreeable pages. > > This way we won't scan the inactive pages over and over again > every time we free a few. > OK. One of the changes I made to pagemap_lru_lock was to always take it with spin_lock_irq(). It's never held for more than 10-20 microseconds, and disabling interrupts while holding it decreased contention by around 30% with four (slow) CPUs. So that's a good change regardless, and it lets us move pages off the laundry list within the IO completion interrupt (but not one-at-a-time!). In fact, they don't need to be on any list at all while under writeback. Of course, that change means that we wouldn't be able to throttle page allocators against IO any more, and we'd have to do something smarter. What a shame ;) - -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: vm lock contention reduction 2002-07-05 2:18 ` Andrew Morton @ 2002-07-05 2:16 ` Rik van Riel 2002-07-05 2:53 ` Andrew Morton 2002-07-05 4:47 ` Linus Torvalds 2002-07-05 23:11 ` William Lee Irwin III 1 sibling, 2 replies; 95+ messages in thread From: Rik van Riel @ 2002-07-05 2:16 UTC (permalink / raw) To: Andrew Morton; +Cc: Andrea Arcangeli, linux-mm, Linus Torvalds On Thu, 4 Jul 2002, Andrew Morton wrote: > Of course, that change means that we wouldn't be able to throttle > page allocators against IO any more, and we'd have to do something > smarter. What a shame ;) We want something smarter anyway. It just doesn't make sense to throttle on one page in one memory zone while the pages in another zone could have already become freeable by now. regards, Rik -- Bravely reimplemented by the knights who say "NIH". http://www.surriel.com/ http://distro.conectiva.com/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: vm lock contention reduction 2002-07-05 2:16 ` Rik van Riel @ 2002-07-05 2:53 ` Andrew Morton 2002-07-05 3:52 ` Benjamin LaHaise 2002-07-05 4:47 ` Linus Torvalds 1 sibling, 1 reply; 95+ messages in thread From: Andrew Morton @ 2002-07-05 2:53 UTC (permalink / raw) To: Rik van Riel; +Cc: Andrea Arcangeli, linux-mm, Linus Torvalds Rik van Riel wrote: > > On Thu, 4 Jul 2002, Andrew Morton wrote: > > > Of course, that change means that we wouldn't be able to throttle > > page allocators against IO any more, and we'd have to do something > > smarter. What a shame ;) > > We want something smarter anyway. It just doesn't make > sense to throttle on one page in one memory zone while > the pages in another zone could have already become > freeable by now. > Or pages from the same zone, indeed. I think what we're saying here is: during writeback, park the pages somewhere out of the way. Page allocators go to sleep on some waitqueue somewhere. Disk completion interrupts put pages back onto the LRU and wake up waiting page allocators. That's all fairly straightforward, but one remaining problem is: who does the writeback? We can get large amounts of page allocation latency due to get_request_wait(), not just wait_on_page(). Generally, the pdflush pool should be sufficient for this but with many spindles it is possible that the kernel will run out of pdflush resources. So we may still need to make page allocating processes start I/O if wakeup_bdflush() fails to find a thread. If so, then the implementation should prefer to make processes which are dirtying memory start the IO. Or provide a non-blocking try_to_submit_bio() for pdflush. - -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: vm lock contention reduction 2002-07-05 2:53 ` Andrew Morton @ 2002-07-05 3:52 ` Benjamin LaHaise 0 siblings, 0 replies; 95+ messages in thread From: Benjamin LaHaise @ 2002-07-05 3:52 UTC (permalink / raw) To: Andrew Morton; +Cc: Rik van Riel, Andrea Arcangeli, linux-mm, Linus Torvalds On Thu, Jul 04, 2002 at 07:53:59PM -0700, Andrew Morton wrote: > Or provide a non-blocking try_to_submit_bio() for pdflush. Ooooh, I need that too. =-) -ben -- "You will be reincarnated as a toad; and you will be much happier." -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: vm lock contention reduction 2002-07-05 2:16 ` Rik van Riel 2002-07-05 2:53 ` Andrew Morton @ 2002-07-05 4:47 ` Linus Torvalds 2002-07-05 5:38 ` Andrew Morton 1 sibling, 1 reply; 95+ messages in thread From: Linus Torvalds @ 2002-07-05 4:47 UTC (permalink / raw) To: Rik van Riel; +Cc: Andrew Morton, Andrea Arcangeli, linux-mm On Thu, 4 Jul 2002, Rik van Riel wrote: > > We want something smarter anyway. It just doesn't make > sense to throttle on one page in one memory zone while > the pages in another zone could have already become > freeable by now. That's a load of bull, and a backed up by zero amount of arguments. It only sounds correct, while being absolute crap. We need to throttle. Full stop. Anybody who says that "it makes no sense to throttle on X because of Y" had better back up that statement with _facts_, not some handwaving. In the end, somebody needs to wait. There is no free lunch. And the fact is, the _reason_ for the waiting doesn't much matter, it only needs to be related to the act of freeing pages by some likely metric. And it should wait for _more_ than the one page we absolutely need. OF COURSE you can always end up avoiding to wait for reason Y. That only means that you end up having to wait on reason Z at some later date. The particular example Rik brought up is particularly stupid, since it's obviously crap and not true. The fact is, that when you allocate memory and you're low on memory, you _have_ to wait "more" than necessary. You want to try to get away from the bad situation, so when you start waiting, you should wait a bit "extra", so that you might not need to wait immediately for the very next allocation. And Rik's argument (and I saw people go "nod nod" like marionettes on a string without even thinking about it) is exactly the wrong way around and basically says that you should wait the _minimal_ amount possible. Yet that is NOT correct, because that will absolutely guarantee that _everybody_ ends up always waiting. I saw this argument at the kernel summit too, thinking that waiting the minimal amount of time is somehow "better". It's not. It's much better to try to maek processes wait slightly _longer_ times, and get into a nice balanced setup where you do intensive work for a while, then wait for a while, then do work again, then wait.. Instead of waiting a bit all the time. Think batching. It's _more_ efficient to batch stuff than it is to try to switch back and forth between working and waiting as quickly as you can. So don't just nod your heads when you see something that sounds sane. Think critically. And the critical thinking says: - you should wait the _maximum_ amount that (a) is fair (b) doesn't introduce bad latency issues (c) still allows overlap of IO and processing Get away from this "minimum wait" thing, because it is WRONG. Try to shoot me down, but do so with real logic and real arguments, not some fuzzy feeling about "we shouldn't wait unless we have to". We _do_ have to wait. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: vm lock contention reduction 2002-07-05 4:47 ` Linus Torvalds @ 2002-07-05 5:38 ` Andrew Morton 2002-07-05 5:51 ` Linus Torvalds 0 siblings, 1 reply; 95+ messages in thread From: Andrew Morton @ 2002-07-05 5:38 UTC (permalink / raw) To: Linus Torvalds; +Cc: Rik van Riel, Andrea Arcangeli, linux-mm Linus Torvalds wrote: > > ... > Think batching. It's _more_ efficient to batch stuff than it is to try to > switch back and forth between working and waiting as quickly as you can. Yup. I've been moaning about this for months. Trivial example: run `vmstat 1' and then start pounding the disk. vmstat will exhibit very long pauses when *clearly* thousands of pages are coming clean every second. Unreasonably long pauses. Sometimes in get_request_wait(), sometimes in shrink_cache->wait_on_page/buffer. We should be giving some of those pages to vmstat more promptly. After all, that process is not a heavy allocator of pages. > So don't just nod your heads when you see something that sounds sane. > Think critically. And the critical thinking says: > > - you should wait the _maximum_ amount that > (a) is fair > (b) doesn't introduce bad latency issues > (c) still allows overlap of IO and processing > > Get away from this "minimum wait" thing, because it is WRONG. Well yes, we do want to batch work up. And a crude way of doing that is "each time 64 pages have come clean, wake up one waiter". Or "as soon as the number of reclaimable pages exceeds zone->pages_min". Some logic would also be needed to prevent new page allocators from jumping the queue, of course. We're still throttling on I/O, but we're throttling against *any* I/O, and not a single randomly-chosen disk block. This scheme is more fair - processes which are allocating more pages get to wait more. > Try to shoot me down, but do so with real logic and real arguments, not > some fuzzy feeling about "we shouldn't wait unless we have to". We _do_ > have to wait. Sure, page allocators must throttle their allocation rate to that at which the IO system can retire writes. But by waiting on a randomly-chosen disk block, we're at the mercy of the elevator. If you happen to choose a page whose blocks are at the far side of the disk, you lose. There could easily be 100 megabytes of reclaimable memory by the time you start running again. We can fit 256 seeks into the request queue. That's 1-2 seconds. I started developing a dumb prototype of this a while back, but it locks up. I'll dust it off and get it going as a "technology demonstration". - -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: vm lock contention reduction 2002-07-05 5:38 ` Andrew Morton @ 2002-07-05 5:51 ` Linus Torvalds 2002-07-05 6:08 ` Linus Torvalds 2002-07-05 6:46 ` Andrew Morton 0 siblings, 2 replies; 95+ messages in thread From: Linus Torvalds @ 2002-07-05 5:51 UTC (permalink / raw) To: Andrew Morton; +Cc: Rik van Riel, Andrea Arcangeli, linux-mm On Thu, 4 Jul 2002, Andrew Morton wrote: > > > > Get away from this "minimum wait" thing, because it is WRONG. > > Well yes, we do want to batch work up. And a crude way of doing that > is "each time 64 pages have come clean, wake up one waiter". That would probably work, but we also need to be careful that we don't get into some strange situations (ie we have 50 waiters that all needed memory at the same time, and less than 50*64 pages that caused us to be in trouble, so we only wake up the 46 first waiters and the last 4 waiters get stuck until the next batch even though we now have _lots_ of pages free). Dont't laugh - things like this has actually happened at times with some of our balancing work with HIGHMEM/NORMAL. Basically, the logic would go: "everybody who ends up having to wait for an allocation should free at least N pages", but then you would end up with 50*N pages total that the system thought it "needed" to free up, and that could be a big number that would cause the VM to want to free up stuff long after it was really done. > Or > "as soon as the number of reclaimable pages exceeds zone->pages_min". > Some logic would also be needed to prevent new page allocators from > jumping the queue, of course. Yeah, the unfairness is the thing that really can be nasty. On the other hand, some unfairness is ok too - and can improve throughput. So jumping the queue is fine, you just must not be able to _consistently_ jump the queue. (In fact, jumping the queue is inevitable to some degree - not allowing any queue jumping at all would imply that any time _anybody_ starts waiting, every single allocation afterwards will have to wait until the waiter got woken up. Which we have actually tried before at times, but which causes really really bad behaviour and horribly bad "pausing") You probably want the occasional allocator able to jump the queue, but the "big spenders" to be caught eventually. "Fairness" really doesn't mean that "everybody should wait equally much", it really means "people should wait roughly relative to how much as they 'spend' memory". If a process is frugal with it's memory footprint, and doesn't dirty a lot of pages/buffers, such a process should obviously wait less than one that allocates/dirties a lot. Right now, we get roughly this behaviour simply by way of statistical behaviour for the page allocator ("if somebody allocates 5 times as many pages, he's 5 times as likely to have to clean something up too"), but trying to be smarter about this could easily break this relative fairness. In particular, statefulness like "cannot jump the queue" breaks it. > > some fuzzy feeling about "we shouldn't wait unless we have to". We _do_ > > have to wait. > > Sure, page allocators must throttle their allocation rate to that at > which the IO system can retire writes. But by waiting on a randomly-chosen > disk block, we're at the mercy of the elevator. Yes. On the other hand, "random" is actually usually a fairly good choice in itself. And it a lot easier that many of the alternatives - especially when it comes to fairness. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: vm lock contention reduction 2002-07-05 5:51 ` Linus Torvalds @ 2002-07-05 6:08 ` Linus Torvalds 2002-07-05 6:27 ` Alexander Viro 2002-07-05 6:33 ` Andrew Morton 2002-07-05 6:46 ` Andrew Morton 1 sibling, 2 replies; 95+ messages in thread From: Linus Torvalds @ 2002-07-05 6:08 UTC (permalink / raw) To: Andrew Morton; +Cc: Rik van Riel, Andrea Arcangeli, linux-mm On Thu, 4 Jul 2002, Linus Torvalds wrote: > > Right now, we get roughly this behaviour simply by way of statistical > behaviour for the page allocator ("if somebody allocates 5 times as many > pages, he's 5 times as likely to have to clean something up too"), but > trying to be smarter about this could easily break this relative fairness. Side note: getting some higher-level locking wrong can _seriously_ break this statistical behaviour. In particular, the ext2 superblock lock at least used to be horribly broken and held in a lot of "bad" places: I doubt Al has gotten far enough to fix that brokenness. The superblock lock used to cause one process that blocked for something (usually reading in some bitmap or other) to cause a lot of _other_ processes to block quite unnecessarily on the badly placed lock, even though they really would have had all the resources they needed. That particular thing is really not a VM problem, but a ext2 issue. The superblock lock just isn't very well placed. I personally suspect that it should be replaced by a spinlock - just to force all blocking operations to be moved outside the lock (so that it would only protect the actual data structures - rather than be held around reading bitmap blocks into memory etc). But that's a rather painful kind of locking change to do and to test. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: vm lock contention reduction 2002-07-05 6:08 ` Linus Torvalds @ 2002-07-05 6:27 ` Alexander Viro 2002-07-05 6:33 ` Andrew Morton 1 sibling, 0 replies; 95+ messages in thread From: Alexander Viro @ 2002-07-05 6:27 UTC (permalink / raw) To: Linus Torvalds; +Cc: Andrew Morton, Rik van Riel, Andrea Arcangeli, linux-mm On Thu, 4 Jul 2002, Linus Torvalds wrote: > In particular, the ext2 superblock lock at least used to be horribly > broken and held in a lot of "bad" places: I doubt Al has gotten far enough > to fix that brokenness. The superblock lock used to cause one process that As the matter of fact, I did. If you want lock_super() to be killed in ext2 (2.5) - just say so and I'll do the rest. Right now both ext2_new_block() and ext2_new_inode() look through the group descriptors for good one and reserve (block|inode) in it. That can be easily done under a spinlock. After that we read a bitmap (no need for any locks) and grab a bit in it (we are guaranteed to have one). The latter can be either done under a spinlock or by being clever and noticing that amount of contenders is always less or equal the number of free bits (with minimal use of set_bit()/etc. atomicity we can do that without spinlocks). After that we don't need any locks whatsoever. Andrew had just killed the last bit of crap there - LRU used to be protected by lock_super() and since it's no more... Rewrite of balloc.c and ialloc.c was done with killing lock_super() in mind - I didn't want to do that in 2.4 for obvious reasons, but for 2.5 it's very easy... -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: vm lock contention reduction 2002-07-05 6:08 ` Linus Torvalds 2002-07-05 6:27 ` Alexander Viro @ 2002-07-05 6:33 ` Andrew Morton 2002-07-05 7:33 ` Andrea Arcangeli 1 sibling, 1 reply; 95+ messages in thread From: Andrew Morton @ 2002-07-05 6:33 UTC (permalink / raw) To: Linus Torvalds; +Cc: Rik van Riel, Andrea Arcangeli, linux-mm Linus Torvalds wrote: > > On Thu, 4 Jul 2002, Linus Torvalds wrote: > > > > Right now, we get roughly this behaviour simply by way of statistical > > behaviour for the page allocator ("if somebody allocates 5 times as many > > pages, he's 5 times as likely to have to clean something up too"), but > > trying to be smarter about this could easily break this relative fairness. > > Side note: getting some higher-level locking wrong can _seriously_ break > this statistical behaviour. > > In particular, the ext2 superblock lock at least used to be horribly > broken and held in a lot of "bad" places: I doubt Al has gotten far enough > to fix that brokenness. The superblock lock used to cause one process that > blocked for something (usually reading in some bitmap or other) to cause a > lot of _other_ processes to block quite unnecessarily on the badly placed > lock, even though they really would have had all the resources they > needed. ext2 is still performing synchronous bitmap reads inside lock_super() and yes, that shuts down the filesystem. But once the bitmaps are in cache, it's not a huge problem. Unless you have an Anton-class box. The context switch on the lock_super in the ext2 block allocator is now one of his major throughput bottlenecks. Although I must say, we're talking about filesystem loads here which are purely RAM-based - fifteen-second tests which never hit disk. This is kinda silly, because it's not a very interesting operating region. But it's fun - I think we've doubled 2.4 throughput now. > That particular thing is really not a VM problem, but a ext2 issue. The > superblock lock just isn't very well placed. I personally suspect that it > should be replaced by a spinlock - just to force all blocking operations > to be moved outside the lock (so that it would only protect the actual > data structures - rather than be held around reading bitmap blocks into > memory etc). It can become a per-blockgroup spinlock. That will scale splendidly. Removing the private bitmap LRUs gets us partway toward that. But the bitmap buffers tend to get shoved up onto the active list real quick when you start pushing things. > But that's a rather painful kind of locking change to do and to test. Well. First locks first. kmap_lock is a bad one on x86. - -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: vm lock contention reduction 2002-07-05 6:33 ` Andrew Morton @ 2002-07-05 7:33 ` Andrea Arcangeli 2002-07-07 2:50 ` Andrew Morton 0 siblings, 1 reply; 95+ messages in thread From: Andrea Arcangeli @ 2002-07-05 7:33 UTC (permalink / raw) To: Andrew Morton; +Cc: Linus Torvalds, Rik van Riel, linux-mm, Martin J. Bligh On Thu, Jul 04, 2002 at 11:33:45PM -0700, Andrew Morton wrote: > Well. First locks first. kmap_lock is a bad one on x86. Actually I thought about kmap_lock and the per-process kmaps a bit more with Martin (cc'ed) during OLS and there is an easy process-scalable solution to drop: the kmap_lock in turn the global pool in turn the global tlb flush The only problem is that it's not anymore both atomic *and* persistent, it's only persistent. It's also atomic if the mm_count == 1, but the kernel cannot rely on it, it has to assume it's a blocking operation always (you find it out if it's blocking only at runtime). In short the same design of the per-process kmaps will work just fine if we add a semaphore to the mm_struct. then before starting using the kmap entry we must acquire the semaphore. This way all the global locking and global tlb flush goes away completely for normal tasks, but still remains the contention of that per-mm semaphore with threads doing simutaneous pte manipulation or simultaneous pagecache I/O though. Furthmore this I/O will be serialized, threaded benchmark like dbench may perform poorly that way I suspect, or we should add a pool of userspace pages so more than 1 thread is allowed to go ahead, but still we may cacheline-bounce in the synchronization of the pool across threads (similar to what we do now in the global pool). Then there's the problem the pagecache/FS API should be changed to pass the vaddr through the stack because page->virtual would go away, the virtual address would be per-process protected by the mm->kmap_sem so we couldn't store it in a global, all tasks can kmap the same page at the same time at virtual vaddr. This as well will break some common code. Last but not the least, I hope in 2.6 production I won't be running benchmarks and profiling using a 32bit cpu anymore anyways. So I'm not very motivated anymore in doing that change after the comment from Linus about the issue with threads. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: vm lock contention reduction 2002-07-05 7:33 ` Andrea Arcangeli @ 2002-07-07 2:50 ` Andrew Morton 2002-07-07 3:05 ` Linus Torvalds ` (2 more replies) 0 siblings, 3 replies; 95+ messages in thread From: Andrew Morton @ 2002-07-07 2:50 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Linus Torvalds, Rik van Riel, linux-mm, Martin J. Bligh Andrea Arcangeli wrote: > > On Thu, Jul 04, 2002 at 11:33:45PM -0700, Andrew Morton wrote: > > Well. First locks first. kmap_lock is a bad one on x86. > > Actually I thought about kmap_lock and the per-process kmaps a bit more > with Martin (cc'ed) during OLS and there is an easy process-scalable > solution to drop: Martin is being bitten by the global invalidate more than by the lock. He increased the size of the kmap pool just to reduce the invalidate frequency and saw 40% speedups of some stuff. Those invalidates don't show up nicely on profiles. > the kmap_lock > in turn the global pool > in turn the global tlb flush > > The only problem is that it's not anymore both atomic *and* persistent, > it's only persistent. It's also atomic if the mm_count == 1, but the > kernel cannot rely on it, it has to assume it's a blocking operation > always (you find it out if it's blocking only at runtime). I was discussing this with sct a few days back. iiuc, the proposal was to create a small per-cpu pool (say, 4-8 pages) which is a "front-end" to regular old kmap(). Any time you have one of these pages in use, the process gets pinned onto the current CPU. If we run out of per-cpu kmaps, just fall back to traditional kmap(). It does mean that this variant of kmap() couldn't just return a `struct page *' - it would have to return something richer than that. > In short the same design of the per-process kmaps will work just fine if > we add a semaphore to the mm_struct. then before starting using the kmap > entry we must acquire the semaphore. This way all the global locking and > global tlb flush goes away completely for normal tasks, but still > remains the contention of that per-mm semaphore with threads doing > simutaneous pte manipulation or simultaneous pagecache I/O though. > Furthmore this I/O will be serialized, threaded benchmark like dbench > may perform poorly that way I suspect, or we should add a pool of > userspace pages so more than 1 thread is allowed to go ahead, but still > we may cacheline-bounce in the synchronization of the pool across > threads (similar to what we do now in the global pool). > > Then there's the problem the pagecache/FS API should be changed to pass > the vaddr through the stack because page->virtual would go away, the > virtual address would be per-process protected by the mm->kmap_sem so we > couldn't store it in a global, all tasks can kmap the same page at the > same time at virtual vaddr. This as well will break some common code. > > Last but not the least, I hope in 2.6 production I won't be running > benchmarks and profiling using a 32bit cpu anymore anyways. > > So I'm not very motivated anymore in doing that change after the comment > from Linus about the issue with threads. I believe that IBM have 32gig, 8- or 16-CPU ia32 machines just coming into production now. Presumably, they're not the only ones. We're stuck with this mess for another few years. - -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: vm lock contention reduction 2002-07-07 2:50 ` Andrew Morton @ 2002-07-07 3:05 ` Linus Torvalds 2002-07-07 3:47 ` Andrew Morton 2002-07-07 5:16 ` vm lock contention reduction Martin J. Bligh 2002-07-07 6:13 ` scalable kmap (was Re: vm lock contention reduction) Martin J. Bligh 2002-07-08 0:38 ` vm lock contention reduction William Lee Irwin III 2 siblings, 2 replies; 95+ messages in thread From: Linus Torvalds @ 2002-07-07 3:05 UTC (permalink / raw) To: Andrew Morton; +Cc: Andrea Arcangeli, Rik van Riel, linux-mm, Martin J. Bligh On Sat, 6 Jul 2002, Andrew Morton wrote: > > Martin is being bitten by the global invalidate more than by the lock. > He increased the size of the kmap pool just to reduce the invalidate > frequency and saw 40% speedups of some stuff. > > Those invalidates don't show up nicely on profiles. I'd like to enhance the profiling support a bit, to create some infrastructure for doing different kinds of profiles, not just the current timer-based one (and not just for the kernel). There's also the P4 native support for "event buffers" or whatever intel calls them, that allows profiling at a lower level by interrupting not for every event, but only when the hw buffer overflows. I haven't had much time to look at the oprofile thing, but what I _have_ seen has made me rather unhappy (especially the horrid system call tracking kludges). I'd rather have some generic hooks (a notion of a "profile buffer" and events that cause us to have to synchronize with it, like process switches, mmap/munmap - oprofile wants these too), and some generic helper routines for profiling (turn any eip into a "dentry + offset" pair together with ways to tag specific dentries as being "worthy" of profiling). Depending on the regular timer interrupt will never give good profiles, simply because it can't be NMI, but also because you then don't have the choice of using other counters (cache miss etc). oprofile does much of this, but in a damn ugly manner. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: vm lock contention reduction 2002-07-07 3:05 ` Linus Torvalds @ 2002-07-07 3:47 ` Andrew Morton 2002-07-08 11:39 ` John Levon 2002-07-07 5:16 ` vm lock contention reduction Martin J. Bligh 1 sibling, 1 reply; 95+ messages in thread From: Andrew Morton @ 2002-07-07 3:47 UTC (permalink / raw) To: Linus Torvalds Cc: Andrea Arcangeli, Rik van Riel, linux-mm, Martin J. Bligh, John Levon Linus Torvalds wrote: > > On Sat, 6 Jul 2002, Andrew Morton wrote: > > > > Martin is being bitten by the global invalidate more than by the lock. > > He increased the size of the kmap pool just to reduce the invalidate > > frequency and saw 40% speedups of some stuff. > > > > Those invalidates don't show up nicely on profiles. > > I'd like to enhance the profiling support a bit, to create some > infrastructure for doing different kinds of profiles, not just the current > timer-based one (and not just for the kernel). > > There's also the P4 native support for "event buffers" or whatever intel > calls them, that allows profiling at a lower level by interrupting not for > every event, but only when the hw buffer overflows. > > I haven't had much time to look at the oprofile thing, but what I _have_ > seen has made me rather unhappy (especially the horrid system call > tracking kludges). > > I'd rather have some generic hooks (a notion of a "profile buffer" and > events that cause us to have to synchronize with it, like process > switches, mmap/munmap - oprofile wants these too), and some generic helper > routines for profiling (turn any eip into a "dentry + offset" pair > together with ways to tag specific dentries as being "worthy" of > profiling). > > Depending on the regular timer interrupt will never give good profiles, > simply because it can't be NMI, but also because you then don't have the > choice of using other counters (cache miss etc). > > oprofile does much of this, but in a damn ugly manner. > I pinged John about an oprofile merge just the other day actually. He agrees with you on the syscall table thing. I think he says that it could be cleaned up if oprofile was in the tree. Ditto the issue with mmap. I was able to isolate and fix some fairly hairy performance problems at work with oprofile. It's a great tool - I use it all the time. And it profiles the entire system - right down to file-n-line in some random shared object. With NMIs. It is not just a kernel tool. So. John. Get coding :-) - -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Enhanced profiling support (was Re: vm lock contention reduction) 2002-07-07 3:47 ` Andrew Morton @ 2002-07-08 11:39 ` John Levon 0 siblings, 0 replies; 95+ messages in thread From: John Levon @ 2002-07-08 11:39 UTC (permalink / raw) To: Andrew Morton Cc: Linus Torvalds, Andrea Arcangeli, Rik van Riel, linux-mm, Martin J. Bligh, linux-kernel [Excuse the quoting, I was out of the loop on this ...] On Sat, Jul 06, 2002 at 08:47:54PM -0700, Andrew Morton wrote: > Linus Torvalds wrote: > > > > I haven't had much time to look at the oprofile thing, but what I _have_ > > seen has made me rather unhappy (especially the horrid system call > > tracking kludges). It makes me very unhappy too. There are a number of horrible things there, mostly for the sake of convenience and performance. sys_call_table is just the most obviously foul thing. I'm glad to hear there is interest in getting some kernel support for such things to be done tastefully. > > I'd rather have some generic hooks (a notion of a "profile buffer" and > > events that cause us to have to synchronize with it, like process > > switches, mmap/munmap - oprofile wants these too), and some generic helper > > routines for profiling (turn any eip into a "dentry + offset" pair > > together with ways to tag specific dentries as being "worthy" of > > profiling). How do you see such dentry names being exported to user-space for the profiling daemon to access ? The current oprofile scheme is, um, less than ideal ... > So. John. Get coding :-) I'm interested in doing so but I'd like to hear some more on how people perceive this working. It essentially means a fork for a lot of the kernel-side code, so it'd mean a lot more work for us (at least until I can drop the 2.2/2.4 versions). regards john -- "If a thing is not diminished by being shared, it is not rightly owned if it is only owned & not shared." - St. Augustine ^ permalink raw reply [flat|nested] 95+ messages in thread
* Enhanced profiling support (was Re: vm lock contention reduction) @ 2002-07-08 11:39 ` John Levon 0 siblings, 0 replies; 95+ messages in thread From: John Levon @ 2002-07-08 11:39 UTC (permalink / raw) To: Andrew Morton Cc: Linus Torvalds, Andrea Arcangeli, Rik van Riel, linux-mm, Martin J. Bligh, linux-kernel [Excuse the quoting, I was out of the loop on this ...] On Sat, Jul 06, 2002 at 08:47:54PM -0700, Andrew Morton wrote: > Linus Torvalds wrote: > > > > I haven't had much time to look at the oprofile thing, but what I _have_ > > seen has made me rather unhappy (especially the horrid system call > > tracking kludges). It makes me very unhappy too. There are a number of horrible things there, mostly for the sake of convenience and performance. sys_call_table is just the most obviously foul thing. I'm glad to hear there is interest in getting some kernel support for such things to be done tastefully. > > I'd rather have some generic hooks (a notion of a "profile buffer" and > > events that cause us to have to synchronize with it, like process > > switches, mmap/munmap - oprofile wants these too), and some generic helper > > routines for profiling (turn any eip into a "dentry + offset" pair > > together with ways to tag specific dentries as being "worthy" of > > profiling). How do you see such dentry names being exported to user-space for the profiling daemon to access ? The current oprofile scheme is, um, less than ideal ... > So. John. Get coding :-) I'm interested in doing so but I'd like to hear some more on how people perceive this working. It essentially means a fork for a lot of the kernel-side code, so it'd mean a lot more work for us (at least until I can drop the 2.2/2.4 versions). regards john -- "If a thing is not diminished by being shared, it is not rightly owned if it is only owned & not shared." - St. Augustine -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Enhanced profiling support (was Re: vm lock contention reduction) 2002-07-08 11:39 ` John Levon @ 2002-07-08 17:52 ` Linus Torvalds -1 siblings, 0 replies; 95+ messages in thread From: Linus Torvalds @ 2002-07-08 17:52 UTC (permalink / raw) To: John Levon Cc: Andrew Morton, Andrea Arcangeli, Rik van Riel, linux-mm, Martin J. Bligh, linux-kernel On Mon, 8 Jul 2002, John Levon wrote: > > > I'd rather have some generic hooks (a notion of a "profile buffer" and > > events that cause us to have to synchronize with it, like process > > switches, mmap/munmap - oprofile wants these too), and some generic helper > > routines for profiling (turn any eip into a "dentry + offset" pair > > together with ways to tag specific dentries as being "worthy" of > > profiling). > > How do you see such dentry names being exported to user-space for the > profiling daemon to access ? The current oprofile scheme is, um, less > than ideal ... Ok, I'll outline my personal favourite interface, but I'd also better point out that while I've thought a bit about what I'd like to have and how it could be implemented in the kernel, I have _not_ actually tried any of it out, much less thought about what the user level stuff really needs. Anyway, here goes a straw-man: - I'd associate each profiling event with a dentry/offset pair, simply because that's the highest-level thing that the kernel knows about and that is "static". - I'd suggest that the profiler explicitly mark the dentries it wants profiled, so that the kernel can throw away events that we're not interested in. The marking function would return a cookie to user space, and increment the dentry count (along with setting the "profile" flag in the dentry) - the "cookie" (which would most easily just be the kernel address of the dentry) would be the thing that we give to user-space (along with offset) on profile read. The user app can turn it back into a filename. Whether it is the original "mark this file for profiling" phase that saves away the cookie<->filename association, or whether we also have a system call for "return the path of this cookie", I don't much care about. Details, details. Anyway, what would be the preferred interface from user level? Linus ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Enhanced profiling support (was Re: vm lock contention reduction) @ 2002-07-08 17:52 ` Linus Torvalds 0 siblings, 0 replies; 95+ messages in thread From: Linus Torvalds @ 2002-07-08 17:52 UTC (permalink / raw) To: John Levon Cc: Andrew Morton, Andrea Arcangeli, Rik van Riel, linux-mm, Martin J. Bligh, linux-kernel On Mon, 8 Jul 2002, John Levon wrote: > > > I'd rather have some generic hooks (a notion of a "profile buffer" and > > events that cause us to have to synchronize with it, like process > > switches, mmap/munmap - oprofile wants these too), and some generic helper > > routines for profiling (turn any eip into a "dentry + offset" pair > > together with ways to tag specific dentries as being "worthy" of > > profiling). > > How do you see such dentry names being exported to user-space for the > profiling daemon to access ? The current oprofile scheme is, um, less > than ideal ... Ok, I'll outline my personal favourite interface, but I'd also better point out that while I've thought a bit about what I'd like to have and how it could be implemented in the kernel, I have _not_ actually tried any of it out, much less thought about what the user level stuff really needs. Anyway, here goes a straw-man: - I'd associate each profiling event with a dentry/offset pair, simply because that's the highest-level thing that the kernel knows about and that is "static". - I'd suggest that the profiler explicitly mark the dentries it wants profiled, so that the kernel can throw away events that we're not interested in. The marking function would return a cookie to user space, and increment the dentry count (along with setting the "profile" flag in the dentry) - the "cookie" (which would most easily just be the kernel address of the dentry) would be the thing that we give to user-space (along with offset) on profile read. The user app can turn it back into a filename. Whether it is the original "mark this file for profiling" phase that saves away the cookie<->filename association, or whether we also have a system call for "return the path of this cookie", I don't much care about. Details, details. Anyway, what would be the preferred interface from user level? Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Enhanced profiling support (was Re: vm lock contention reduction) 2002-07-08 17:52 ` Linus Torvalds @ 2002-07-08 18:41 ` Karim Yaghmour -1 siblings, 0 replies; 95+ messages in thread From: Karim Yaghmour @ 2002-07-08 18:41 UTC (permalink / raw) To: Linus Torvalds Cc: John Levon, Andrew Morton, Andrea Arcangeli, Rik van Riel, linux-mm, Martin J. Bligh, linux-kernel, Richard Moore, bob Linus Torvalds wrote: > On Mon, 8 Jul 2002, John Levon wrote: > > How do you see such dentry names being exported to user-space for the > > profiling daemon to access ? The current oprofile scheme is, um, less > > than ideal ... > > Ok, I'll outline my personal favourite interface, but I'd also better > point out that while I've thought a bit about what I'd like to have and > how it could be implemented in the kernel, I have _not_ actually tried any > of it out, much less thought about what the user level stuff really needs. Sure. I've done some work on profiling using trace hooks. Hopefully the following is useful. > Anyway, here goes a straw-man: > > - I'd associate each profiling event with a dentry/offset pair, simply > because that's the highest-level thing that the kernel knows about and > that is "static". dentry + offset: on a 32bit machine, this is 8 bytes total per event being profiled. This is a lot of information if you are trying you have a high volume throughput. You can almost always skip the dentry since you know scheduling changes and since you can catch a system-state snapshot at the begining of the profiling. After that, the eip is sufficient and can easily be correlated to a meaningfull entry in a file in user-space. > - I'd suggest that the profiler explicitly mark the dentries it wants > profiled, so that the kernel can throw away events that we're not > interested in. The marking function would return a cookie to user > space, and increment the dentry count (along with setting the > "profile" flag in the dentry) Or the kernel can completely ignore this sort of selection and leave it all to the agent responsible for collecting the events. This is what is done in LTT. Currently, you can select one PID, GID, UID, but this is easily extendable to include many. Of course if you agree to having the task struct have "trace" or "profile" field, then this would be much easier. > - the "cookie" (which would most easily just be the kernel address of the > dentry) would be the thing that we give to user-space (along with > offset) on profile read. The user app can turn it back into a filename. That's the typical scheme and the one possible with the data retrieved by LTT. > Whether it is the original "mark this file for profiling" phase that saves > away the cookie<->filename association, or whether we also have a system > call for "return the path of this cookie", I don't much care about. > Details, details. > > Anyway, what would be the preferred interface from user level? The approach LTT takes is that no part in the kernel should actually care about the user level needs. Anything in user level that has to modify the tracing/profiling makes its requests to the trace driver, /dev/tracer. No additional system calls, no special cases in the main kernel code. Only 3 main files: kernel/trace.c: The main entry point for all events (trace_event()) drivers/trace/tracer.c: The trace driver include/linux/trace.h: The trace hook definitions Cheers, Karim =================================================== Karim Yaghmour karim@opersys.com Embedded and Real-Time Linux Expert =================================================== ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Enhanced profiling support (was Re: vm lock contention reduction) @ 2002-07-08 18:41 ` Karim Yaghmour 0 siblings, 0 replies; 95+ messages in thread From: Karim Yaghmour @ 2002-07-08 18:41 UTC (permalink / raw) To: Linus Torvalds Cc: John Levon, Andrew Morton, Andrea Arcangeli, Rik van Riel, linux-mm, Martin J. Bligh, linux-kernel, Richard Moore, bob Linus Torvalds wrote: > On Mon, 8 Jul 2002, John Levon wrote: > > How do you see such dentry names being exported to user-space for the > > profiling daemon to access ? The current oprofile scheme is, um, less > > than ideal ... > > Ok, I'll outline my personal favourite interface, but I'd also better > point out that while I've thought a bit about what I'd like to have and > how it could be implemented in the kernel, I have _not_ actually tried any > of it out, much less thought about what the user level stuff really needs. Sure. I've done some work on profiling using trace hooks. Hopefully the following is useful. > Anyway, here goes a straw-man: > > - I'd associate each profiling event with a dentry/offset pair, simply > because that's the highest-level thing that the kernel knows about and > that is "static". dentry + offset: on a 32bit machine, this is 8 bytes total per event being profiled. This is a lot of information if you are trying you have a high volume throughput. You can almost always skip the dentry since you know scheduling changes and since you can catch a system-state snapshot at the begining of the profiling. After that, the eip is sufficient and can easily be correlated to a meaningfull entry in a file in user-space. > - I'd suggest that the profiler explicitly mark the dentries it wants > profiled, so that the kernel can throw away events that we're not > interested in. The marking function would return a cookie to user > space, and increment the dentry count (along with setting the > "profile" flag in the dentry) Or the kernel can completely ignore this sort of selection and leave it all to the agent responsible for collecting the events. This is what is done in LTT. Currently, you can select one PID, GID, UID, but this is easily extendable to include many. Of course if you agree to having the task struct have "trace" or "profile" field, then this would be much easier. > - the "cookie" (which would most easily just be the kernel address of the > dentry) would be the thing that we give to user-space (along with > offset) on profile read. The user app can turn it back into a filename. That's the typical scheme and the one possible with the data retrieved by LTT. > Whether it is the original "mark this file for profiling" phase that saves > away the cookie<->filename association, or whether we also have a system > call for "return the path of this cookie", I don't much care about. > Details, details. > > Anyway, what would be the preferred interface from user level? The approach LTT takes is that no part in the kernel should actually care about the user level needs. Anything in user level that has to modify the tracing/profiling makes its requests to the trace driver, /dev/tracer. No additional system calls, no special cases in the main kernel code. Only 3 main files: kernel/trace.c: The main entry point for all events (trace_event()) drivers/trace/tracer.c: The trace driver include/linux/trace.h: The trace hook definitions Cheers, Karim =================================================== Karim Yaghmour karim@opersys.com Embedded and Real-Time Linux Expert =================================================== -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Enhanced profiling support (was Re: vm lock contention reduction) 2002-07-08 18:41 ` Karim Yaghmour @ 2002-07-10 2:22 ` John Levon -1 siblings, 0 replies; 95+ messages in thread From: John Levon @ 2002-07-10 2:22 UTC (permalink / raw) To: Karim Yaghmour Cc: Linus Torvalds, Andrew Morton, Andrea Arcangeli, Rik van Riel, linux-mm, Martin J. Bligh, linux-kernel, Richard Moore, bob On Mon, Jul 08, 2002 at 02:41:00PM -0400, Karim Yaghmour wrote: > dentry + offset: on a 32bit machine, this is 8 bytes total per event being > profiled. This is a lot of information if you are trying you have a high > volume throughput. I haven't found that to be significant in profiling overhead, mainly because the hash table removes some of the "sting" of high sampling rates (and the interrupt handler dwarfs all other aspects). The situation is probably different for more general tracing purposes, but I'm dubious as to the utility of a general tracing mechanism. (besides, a profile buffer needs a sample context value too, for things like CPU number and perfctr event number). > You can almost always skip the dentry since you know scheduling > changes and since you can catch a system-state snapshot at the begining of > the profiling. After that, the eip is sufficient and can easily be correlated > to a meaningfull entry in a file in user-space. But as I point out in my other post, dentry-offset alone is not as useful as it could be... I just don't see a really good reason to introduce insidious tracing throughout. Both tracing and profiling are ugly ugly things to be doing by their very nature, and I'd much prefer to keep such intrusions to a bare minimum. The entry.S examine-the-registers approach is simple enough, but it's not much more tasteful than sys_call_table hackery IMHO regards john -- "I know I believe in nothing but it is my nothing" - Manic Street Preachers ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Enhanced profiling support (was Re: vm lock contention reduction) @ 2002-07-10 2:22 ` John Levon 0 siblings, 0 replies; 95+ messages in thread From: John Levon @ 2002-07-10 2:22 UTC (permalink / raw) To: Karim Yaghmour Cc: Linus Torvalds, Andrew Morton, Andrea Arcangeli, Rik van Riel, linux-mm, Martin J. Bligh, linux-kernel, Richard Moore, bob On Mon, Jul 08, 2002 at 02:41:00PM -0400, Karim Yaghmour wrote: > dentry + offset: on a 32bit machine, this is 8 bytes total per event being > profiled. This is a lot of information if you are trying you have a high > volume throughput. I haven't found that to be significant in profiling overhead, mainly because the hash table removes some of the "sting" of high sampling rates (and the interrupt handler dwarfs all other aspects). The situation is probably different for more general tracing purposes, but I'm dubious as to the utility of a general tracing mechanism. (besides, a profile buffer needs a sample context value too, for things like CPU number and perfctr event number). > You can almost always skip the dentry since you know scheduling > changes and since you can catch a system-state snapshot at the begining of > the profiling. After that, the eip is sufficient and can easily be correlated > to a meaningfull entry in a file in user-space. But as I point out in my other post, dentry-offset alone is not as useful as it could be... I just don't see a really good reason to introduce insidious tracing throughout. Both tracing and profiling are ugly ugly things to be doing by their very nature, and I'd much prefer to keep such intrusions to a bare minimum. The entry.S examine-the-registers approach is simple enough, but it's not much more tasteful than sys_call_table hackery IMHO regards john -- "I know I believe in nothing but it is my nothing" - Manic Street Preachers -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Enhanced profiling support (was Re: vm lock contention reduction) 2002-07-10 2:22 ` John Levon @ 2002-07-10 4:16 ` Karim Yaghmour -1 siblings, 0 replies; 95+ messages in thread From: Karim Yaghmour @ 2002-07-10 4:16 UTC (permalink / raw) To: John Levon Cc: Linus Torvalds, Andrew Morton, Andrea Arcangeli, Rik van Riel, linux-mm, Martin J. Bligh, linux-kernel, Richard Moore, bob John Levon wrote: > I'm dubious as to the utility of a general tracing mechanism. ... > I just don't see a really good reason to introduce insidious tracing > throughout. Both tracing and profiling are ugly ugly things to be doing > by their very nature, and I'd much prefer to keep such intrusions to a > bare minimum. Tracing is essential for an entire category of problems which can only be solved by obtaining the raw data which describesg the dynamic behavior of the kernel. Have you ever tried to solve an inter-process synchronization problem using strace or gdb? In reality, only tracing built into the kernel can enable a developer to solve such problems. Have you ever tried to follow the exact reaction applications have to kernel input? It took lots of ad-hoc experimentation to isolate the thundering hurd problem. Tracing would have shown this immediately. Can you list the exact sequence of processes that are scheduled in reaction to input when you press the keyboard while running a terminal in X? This is out of reach of most user-space programmers but a trace shows this quite nicely. Ever had a box saturated with IRQs and still showing 0% CPU usage? This problem has been reported time and again. Lately someone was asking about the utility which soaks-up CPU cycles to show this sort of situation. Once more, tracing shows this right away ... without soaking up CPU cycles. Ever tried to get the exact time spent by an application in user-space vs. kernel space? Even better, can you tell the actually syscall which cost the most in kernel time? You can indeed get closer using random sampling, but it's just one more thing tracing gives you without any difficulty. And the list goes on. The fact that so many kernel subsystems already have their own tracing built-in (see other posting) clearly shows that there is a fundamental need for such a utility even for driver developers. If many driver developers can't develop drivers adequately without tracing, can we expect user-space developers to efficiently use the kernel if they have absolutely no idea about the dynamic interaction their processes have with the kernel and how this interaction is influenced by and influences the interaction with other processes? > The entry.S examine-the-registers approach is simple enough, but it's > not much more tasteful than sys_call_table hackery IMHO I guess we won't agree on this. From my point of view it is much better to have the code directly within entry.S for all to see instead of having some external software play around with the syscall table in a way kernel users can't trace back to the kernel's own code. Cheers, Karim =================================================== Karim Yaghmour karim@opersys.com Embedded and Real-Time Linux Expert =================================================== ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Enhanced profiling support (was Re: vm lock contention reduction) @ 2002-07-10 4:16 ` Karim Yaghmour 0 siblings, 0 replies; 95+ messages in thread From: Karim Yaghmour @ 2002-07-10 4:16 UTC (permalink / raw) To: John Levon Cc: Linus Torvalds, Andrew Morton, Andrea Arcangeli, Rik van Riel, linux-mm, Martin J. Bligh, linux-kernel, Richard Moore, bob John Levon wrote: > I'm dubious as to the utility of a general tracing mechanism. ... > I just don't see a really good reason to introduce insidious tracing > throughout. Both tracing and profiling are ugly ugly things to be doing > by their very nature, and I'd much prefer to keep such intrusions to a > bare minimum. Tracing is essential for an entire category of problems which can only be solved by obtaining the raw data which describesg the dynamic behavior of the kernel. Have you ever tried to solve an inter-process synchronization problem using strace or gdb? In reality, only tracing built into the kernel can enable a developer to solve such problems. Have you ever tried to follow the exact reaction applications have to kernel input? It took lots of ad-hoc experimentation to isolate the thundering hurd problem. Tracing would have shown this immediately. Can you list the exact sequence of processes that are scheduled in reaction to input when you press the keyboard while running a terminal in X? This is out of reach of most user-space programmers but a trace shows this quite nicely. Ever had a box saturated with IRQs and still showing 0% CPU usage? This problem has been reported time and again. Lately someone was asking about the utility which soaks-up CPU cycles to show this sort of situation. Once more, tracing shows this right away ... without soaking up CPU cycles. Ever tried to get the exact time spent by an application in user-space vs. kernel space? Even better, can you tell the actually syscall which cost the most in kernel time? You can indeed get closer using random sampling, but it's just one more thing tracing gives you without any difficulty. And the list goes on. The fact that so many kernel subsystems already have their own tracing built-in (see other posting) clearly shows that there is a fundamental need for such a utility even for driver developers. If many driver developers can't develop drivers adequately without tracing, can we expect user-space developers to efficiently use the kernel if they have absolutely no idea about the dynamic interaction their processes have with the kernel and how this interaction is influenced by and influences the interaction with other processes? > The entry.S examine-the-registers approach is simple enough, but it's > not much more tasteful than sys_call_table hackery IMHO I guess we won't agree on this. From my point of view it is much better to have the code directly within entry.S for all to see instead of having some external software play around with the syscall table in a way kernel users can't trace back to the kernel's own code. Cheers, Karim =================================================== Karim Yaghmour karim@opersys.com Embedded and Real-Time Linux Expert =================================================== -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Enhanced profiling support (was Re: vm lock contention reduction) 2002-07-10 4:16 ` Karim Yaghmour @ 2002-07-10 4:38 ` John Levon -1 siblings, 0 replies; 95+ messages in thread From: John Levon @ 2002-07-10 4:38 UTC (permalink / raw) To: Karim Yaghmour Cc: Linus Torvalds, Andrew Morton, Andrea Arcangeli, Rik van Riel, linux-mm, Martin J. Bligh, linux-kernel, Richard Moore, bob On Wed, Jul 10, 2002 at 12:16:05AM -0400, Karim Yaghmour wrote: [snip] > And the list goes on. Sure, there are all sorts of things where some tracing can come in useful. The question is whether it's really something the mainline kernel should be doing, and if the gung-ho approach is nice or not. > The fact that so many kernel subsystems already have their own tracing > built-in (see other posting) Your list was almost entirely composed of per-driver debug routines. This is not the same thing as logging trap entry/exits, syscalls etc etc, on any level, and I'm a bit perplexed that you're making such an assocation. > expect user-space developers to efficiently use the kernel if they > have > absolutely no idea about the dynamic interaction their processes have > with the kernel and how this interaction is influenced by and > influences > the interaction with other processes? This is clearly an exaggeration. And seeing as something like LTT doesn't (and cannot) tell the "whole story" either, I could throw the same argument directly back at you. The point is, there comes a point of no return where usefulness gets outweighed by ugliness. For the very few cases that such detailed information is really useful, the user can usually install the needed special-case tools. In contrast a profiling mechanism that improves on the poor lot that currently exists (gprof, readprofile) has a truly general utility, and can hopefully be done without too much ugliness. The primary reason I want to see something like this is to kill the ugly code I have to maintain. > > The entry.S examine-the-registers approach is simple enough, but > > it's > > not much more tasteful than sys_call_table hackery IMHO > > I guess we won't agree on this. From my point of view it is much > better > to have the code directly within entry.S for all to see instead of > having some external software play around with the syscall table in a > way kernel users can't trace back to the kernel's own code. Eh ? I didn't say sys_call_table hackery was better. I said the entry.S thing wasn't much better ... regards john ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Enhanced profiling support (was Re: vm lock contention reduction) @ 2002-07-10 4:38 ` John Levon 0 siblings, 0 replies; 95+ messages in thread From: John Levon @ 2002-07-10 4:38 UTC (permalink / raw) To: Karim Yaghmour Cc: Linus Torvalds, Andrew Morton, Andrea Arcangeli, Rik van Riel, linux-mm, Martin J. Bligh, linux-kernel, Richard Moore, bob On Wed, Jul 10, 2002 at 12:16:05AM -0400, Karim Yaghmour wrote: [snip] > And the list goes on. Sure, there are all sorts of things where some tracing can come in useful. The question is whether it's really something the mainline kernel should be doing, and if the gung-ho approach is nice or not. > The fact that so many kernel subsystems already have their own tracing > built-in (see other posting) Your list was almost entirely composed of per-driver debug routines. This is not the same thing as logging trap entry/exits, syscalls etc etc, on any level, and I'm a bit perplexed that you're making such an assocation. > expect user-space developers to efficiently use the kernel if they > have > absolutely no idea about the dynamic interaction their processes have > with the kernel and how this interaction is influenced by and > influences > the interaction with other processes? This is clearly an exaggeration. And seeing as something like LTT doesn't (and cannot) tell the "whole story" either, I could throw the same argument directly back at you. The point is, there comes a point of no return where usefulness gets outweighed by ugliness. For the very few cases that such detailed information is really useful, the user can usually install the needed special-case tools. In contrast a profiling mechanism that improves on the poor lot that currently exists (gprof, readprofile) has a truly general utility, and can hopefully be done without too much ugliness. The primary reason I want to see something like this is to kill the ugly code I have to maintain. > > The entry.S examine-the-registers approach is simple enough, but > > it's > > not much more tasteful than sys_call_table hackery IMHO > > I guess we won't agree on this. From my point of view it is much > better > to have the code directly within entry.S for all to see instead of > having some external software play around with the syscall table in a > way kernel users can't trace back to the kernel's own code. Eh ? I didn't say sys_call_table hackery was better. I said the entry.S thing wasn't much better ... regards john -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Enhanced profiling support (was Re: vm lock contention reduction) 2002-07-10 4:38 ` John Levon @ 2002-07-10 5:46 ` Karim Yaghmour -1 siblings, 0 replies; 95+ messages in thread From: Karim Yaghmour @ 2002-07-10 5:46 UTC (permalink / raw) To: John Levon Cc: Linus Torvalds, Andrew Morton, Andrea Arcangeli, Rik van Riel, linux-mm, Martin J. Bligh, linux-kernel, Richard Moore, bob John Levon wrote: > > And the list goes on. > > Sure, there are all sorts of things where some tracing can come in > useful. The question is whether it's really something the mainline > kernel should be doing, and if the gung-ho approach is nice or not. I'm not sure how to read "gung-ho approach" but I would point out that LTT wasn't built overnight. It's been out there for 3 years now. As for whether the mainline kernel should have it or not, then I think the standard has always been set using the actual purpose: Is this useful to users or is this only useful to kernel developers? Whenever it was the later, then the answer has almost always been NO. In the case of tracing I think that not only is it "useful" to users, but I would say "essential". Would you ask a user to recompile his kernel with a ptrace patch because he needs to use gdb? I don't think so, and I don't think application developers or system administrators should have to recompile their kernel either in order to understand the system' dynamics. If facts are of interest, then I would point out that LTT is already part of a number of distributions out there. Most other distros that don't have it included find it very useful but don't want to get involved in maintaining it until it's part of the kernel. > > The fact that so many kernel subsystems already have their own tracing > > built-in (see other posting) > > Your list was almost entirely composed of per-driver debug routines. I don't see any contradiction here. This is part of what I'm pointing out. Mainly that understanding the dynamic behavior of the system is essential to software development, especially when complex interactions, such as in the Linux kernel, take place. > This is not the same thing as logging trap entry/exits, syscalls etc > etc, on any level, and I'm a bit perplexed that you're making such an > assocation. In regards to trap entry/exits, there was a talk a couple of years ago by Nat Friedman, I think, which discussed GNU rope. Basically, it identified the location of page fault misses at the start of programs and used this information to reorder the binary in order to accelerate its load time. That's just one example where traps are of interest. Not to mention that traps can result in scheduling changes and that knowing why a process has been scheduled out is all part of understanding the system's dynamics. As for syscalls, oprofile, for one, really needs this sort of info. So I don't see your point. There are 2 things LTT provides in the kernel: 1- A unified tracing and high-throughput data-collection system 2- A select list of trace points in the kernel Item #1 can easily replace the existing ad-hoc implementation while serving oprofile, DProbes and others. Item #2 is of prime interest to application developers and system administrators. > > expect user-space developers to efficiently use the kernel if they > > have > > absolutely no idea about the dynamic interaction their processes have > > with the kernel and how this interaction is influenced by and > > influences > > the interaction with other processes? > > This is clearly an exaggeration. And seeing as something like LTT > doesn't (and cannot) tell the "whole story" either, I could throw the > same argument directly back at you. The point is, there comes a point of > no return where usefulness gets outweighed by ugliness. For the very few > cases that such detailed information is really useful, the user can > usually install the needed special-case tools. I'd really be interested in seing what you mean by "ugliness". If there's a list of grievances you have with LTT then I'm all ears. Anything inserted by LTT is clean and clear. It doesn't change anything to the kernel's normal behavior and once a trace point is inserted it requires almost zero maintenance. Take for example the scheduling change trace point (kernel/sched.c:schedule()): TRACE_SCHEDCHANGE(prev, next); I don't see why this should be ugly. It's an inline that results in zero lines of code if you configure tracing as off. The cases I presented earlier clearly show the usefullness of this information. The developer who needs to solve his synchronization problem or the system administrator who wants to understand why his box is so slow doesn't really want to patch/recompile/reboot to fix his problem. You would like to paint these as "very few cases". Unfortunately these cases are much more common than you say they are. > In contrast a profiling mechanism that improves on the poor lot that > currently exists (gprof, readprofile) has a truly general utility, and > can hopefully be done without too much ugliness. Somehow it is not justifiable to add a feature the like of which didn't exist before but it is justifiable to add a feature which only "improves" on existing tools? As I said before, LTT and the information it provides has a truly general utility: Oprofile can use it, DProbes uses it, a slew of existing ad-hoc tracing systems in the kernel can be replaced by it, application developers can isolate synchronization problems with it, system administrators can identify performance bottlenecks with it, etc. An argument over which is more useful between LTT and oprofile is bound to be useless. If nothing else, I see LTT as having a more general use than oprofile. But let's not get into that. What I'm really interested in here is the possibilty of having one unified tracing and data collection system which serves many purposes instead of having each subsystem or profiler have its own tracing and data collection mechanism. > The primary reason I want to see something like this is to kill the ugly > code I have to maintain. I can't say that my goals are any different. > > > The entry.S examine-the-registers approach is simple enough, but > > > it's > > > not much more tasteful than sys_call_table hackery IMHO > > > > I guess we won't agree on this. From my point of view it is much > > better > > to have the code directly within entry.S for all to see instead of > > having some external software play around with the syscall table in a > > way kernel users can't trace back to the kernel's own code. > > Eh ? I didn't say sys_call_table hackery was better. I said the entry.S > thing wasn't much better ... I know you weren't saying that. I said that in _my point of view_ the entry.S "thing" is much better. Cheers, Karim =================================================== Karim Yaghmour karim@opersys.com Embedded and Real-Time Linux Expert =================================================== ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Enhanced profiling support (was Re: vm lock contention reduction) @ 2002-07-10 5:46 ` Karim Yaghmour 0 siblings, 0 replies; 95+ messages in thread From: Karim Yaghmour @ 2002-07-10 5:46 UTC (permalink / raw) To: John Levon Cc: Linus Torvalds, Andrew Morton, Andrea Arcangeli, Rik van Riel, linux-mm, Martin J. Bligh, linux-kernel, Richard Moore, bob John Levon wrote: > > And the list goes on. > > Sure, there are all sorts of things where some tracing can come in > useful. The question is whether it's really something the mainline > kernel should be doing, and if the gung-ho approach is nice or not. I'm not sure how to read "gung-ho approach" but I would point out that LTT wasn't built overnight. It's been out there for 3 years now. As for whether the mainline kernel should have it or not, then I think the standard has always been set using the actual purpose: Is this useful to users or is this only useful to kernel developers? Whenever it was the later, then the answer has almost always been NO. In the case of tracing I think that not only is it "useful" to users, but I would say "essential". Would you ask a user to recompile his kernel with a ptrace patch because he needs to use gdb? I don't think so, and I don't think application developers or system administrators should have to recompile their kernel either in order to understand the system' dynamics. If facts are of interest, then I would point out that LTT is already part of a number of distributions out there. Most other distros that don't have it included find it very useful but don't want to get involved in maintaining it until it's part of the kernel. > > The fact that so many kernel subsystems already have their own tracing > > built-in (see other posting) > > Your list was almost entirely composed of per-driver debug routines. I don't see any contradiction here. This is part of what I'm pointing out. Mainly that understanding the dynamic behavior of the system is essential to software development, especially when complex interactions, such as in the Linux kernel, take place. > This is not the same thing as logging trap entry/exits, syscalls etc > etc, on any level, and I'm a bit perplexed that you're making such an > assocation. In regards to trap entry/exits, there was a talk a couple of years ago by Nat Friedman, I think, which discussed GNU rope. Basically, it identified the location of page fault misses at the start of programs and used this information to reorder the binary in order to accelerate its load time. That's just one example where traps are of interest. Not to mention that traps can result in scheduling changes and that knowing why a process has been scheduled out is all part of understanding the system's dynamics. As for syscalls, oprofile, for one, really needs this sort of info. So I don't see your point. There are 2 things LTT provides in the kernel: 1- A unified tracing and high-throughput data-collection system 2- A select list of trace points in the kernel Item #1 can easily replace the existing ad-hoc implementation while serving oprofile, DProbes and others. Item #2 is of prime interest to application developers and system administrators. > > expect user-space developers to efficiently use the kernel if they > > have > > absolutely no idea about the dynamic interaction their processes have > > with the kernel and how this interaction is influenced by and > > influences > > the interaction with other processes? > > This is clearly an exaggeration. And seeing as something like LTT > doesn't (and cannot) tell the "whole story" either, I could throw the > same argument directly back at you. The point is, there comes a point of > no return where usefulness gets outweighed by ugliness. For the very few > cases that such detailed information is really useful, the user can > usually install the needed special-case tools. I'd really be interested in seing what you mean by "ugliness". If there's a list of grievances you have with LTT then I'm all ears. Anything inserted by LTT is clean and clear. It doesn't change anything to the kernel's normal behavior and once a trace point is inserted it requires almost zero maintenance. Take for example the scheduling change trace point (kernel/sched.c:schedule()): TRACE_SCHEDCHANGE(prev, next); I don't see why this should be ugly. It's an inline that results in zero lines of code if you configure tracing as off. The cases I presented earlier clearly show the usefullness of this information. The developer who needs to solve his synchronization problem or the system administrator who wants to understand why his box is so slow doesn't really want to patch/recompile/reboot to fix his problem. You would like to paint these as "very few cases". Unfortunately these cases are much more common than you say they are. > In contrast a profiling mechanism that improves on the poor lot that > currently exists (gprof, readprofile) has a truly general utility, and > can hopefully be done without too much ugliness. Somehow it is not justifiable to add a feature the like of which didn't exist before but it is justifiable to add a feature which only "improves" on existing tools? As I said before, LTT and the information it provides has a truly general utility: Oprofile can use it, DProbes uses it, a slew of existing ad-hoc tracing systems in the kernel can be replaced by it, application developers can isolate synchronization problems with it, system administrators can identify performance bottlenecks with it, etc. An argument over which is more useful between LTT and oprofile is bound to be useless. If nothing else, I see LTT as having a more general use than oprofile. But let's not get into that. What I'm really interested in here is the possibilty of having one unified tracing and data collection system which serves many purposes instead of having each subsystem or profiler have its own tracing and data collection mechanism. > The primary reason I want to see something like this is to kill the ugly > code I have to maintain. I can't say that my goals are any different. > > > The entry.S examine-the-registers approach is simple enough, but > > > it's > > > not much more tasteful than sys_call_table hackery IMHO > > > > I guess we won't agree on this. From my point of view it is much > > better > > to have the code directly within entry.S for all to see instead of > > having some external software play around with the syscall table in a > > way kernel users can't trace back to the kernel's own code. > > Eh ? I didn't say sys_call_table hackery was better. I said the entry.S > thing wasn't much better ... I know you weren't saying that. I said that in _my point of view_ the entry.S "thing" is much better. Cheers, Karim =================================================== Karim Yaghmour karim@opersys.com Embedded and Real-Time Linux Expert =================================================== -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Enhanced profiling support (was Re: vm lock contention reduction) 2002-07-10 4:38 ` John Levon @ 2002-07-10 13:10 ` bob -1 siblings, 0 replies; 95+ messages in thread From: bob @ 2002-07-10 13:10 UTC (permalink / raw) To: John Levon Cc: Karim Yaghmour, Linus Torvalds, Andrew Morton, Andrea Arcangeli, Rik van Riel, linux-mm, Martin J. Bligh, linux-kernel, Richard Moore, bob, okrieg John, I have been cc'ed on this email as I was an active participant at the RAS performance monitoring/tracing session at OLS. Let me preface by saying that my view may be a bit biased as I have worked on the the core tracing infrastructure that went into IRIX in the mid 90s as well as the tracing infrastructure for K42, our research OS (see http://www.research.ibm.com/K42/index.html), and in both cases helped solve performance issues that would not have been otherwise solved. From the below it doesn't appear that anyone is arguing that tracing is not useful. The debate (except for some of the details) appears over whether it should be included in the kernel in a first-class manner or individual mechanisms put in on an ad-hoc basis. While this is indeed philosophical, let me share some experiences and benefits from other systems I've worked on: 1) The mechanism proposed is very non-invasive, a single line of code (some TRACE_XXX macro or like) is added to the area of interest. Further, at OLS, the proposal was to add only a few trace points. Programming-wise this does not clutter the code - in fact having a single well-known unified mechanism is cleaner coding than a set of one-off ways, as when anyone sees a trace macro it will be clear what it is. 2) In the end, there will be less intrusion with a single unified approach. With a series of different mechanisms over time multiple events will get added in the same place creating performance issues and more importantly causing confusion. 3) A unified approach will uncover performance issues not explicitly being searched for and allow ones of known interest to be tracked down without adding a patch (that may be cumbersome to maintain) and re-compilation. 4) In both my experiences, I have had resistance to adding this tracing infrastructure, and in both experiences other kernel developers have come back after the fact and thanked me for my persistence :-), as it helped them solve timing sensitive or other such issues they were having great difficulty with. If there is interest, I would happy to set up a conference number so people who are interested could all speak. -bob Robert Wisniewski The K42 MP OS Project Advanced Operating Systems Scalable Parallel Systems IBM T.J. Watson Research Center 914-945-3181 http://www.research.ibm.com/K42/ bob@watson.ibm.com ---- John Levon writes: > On Wed, Jul 10, 2002 at 12:16:05AM -0400, Karim Yaghmour wrote: > > [snip] > > > And the list goes on. > > Sure, there are all sorts of things where some tracing can come in > useful. The question is whether it's really something the mainline > kernel should be doing, and if the gung-ho approach is nice or not. > > > The fact that so many kernel subsystems already have their own tracing > > built-in (see other posting) > > Your list was almost entirely composed of per-driver debug routines. > This is not the same thing as logging trap entry/exits, syscalls etc > etc, on any level, and I'm a bit perplexed that you're making such an > assocation. > > > expect user-space developers to efficiently use the kernel if they > > have > > absolutely no idea about the dynamic interaction their processes have > > with the kernel and how this interaction is influenced by and > > influences > > the interaction with other processes? > > This is clearly an exaggeration. And seeing as something like LTT > doesn't (and cannot) tell the "whole story" either, I could throw the > same argument directly back at you. The point is, there comes a point of > no return where usefulness gets outweighed by ugliness. For the very few > cases that such detailed information is really useful, the user can > usually install the needed special-case tools. > > In contrast a profiling mechanism that improves on the poor lot that > currently exists (gprof, readprofile) has a truly general utility, and > can hopefully be done without too much ugliness. > > The primary reason I want to see something like this is to kill the ugly > code I have to maintain. > > > > The entry.S examine-the-registers approach is simple enough, but > > > it's > > > not much more tasteful than sys_call_table hackery IMHO > > > > I guess we won't agree on this. From my point of view it is much > > better > > to have the code directly within entry.S for all to see instead of > > having some external software play around with the syscall table in a > > way kernel users can't trace back to the kernel's own code. > > Eh ? I didn't say sys_call_table hackery was better. I said the entry.S > thing wasn't much better ... > > regards > john > ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Enhanced profiling support (was Re: vm lock contention reduction) @ 2002-07-10 13:10 ` bob 0 siblings, 0 replies; 95+ messages in thread From: bob @ 2002-07-10 13:10 UTC (permalink / raw) To: John Levon Cc: Karim Yaghmour, Linus Torvalds, Andrew Morton, Andrea Arcangeli, Rik van Riel, linux-mm, Martin J. Bligh, linux-kernel, Richard Moore, bob, okrieg John, I have been cc'ed on this email as I was an active participant at the RAS performance monitoring/tracing session at OLS. Let me preface by saying that my view may be a bit biased as I have worked on the the core tracing infrastructure that went into IRIX in the mid 90s as well as the tracing infrastructure for K42, our research OS (see http://www.research.ibm.com/K42/index.html), and in both cases helped solve performance issues that would not have been otherwise solved. From the below it doesn't appear that anyone is arguing that tracing is not useful. The debate (except for some of the details) appears over whether it should be included in the kernel in a first-class manner or individual mechanisms put in on an ad-hoc basis. While this is indeed philosophical, let me share some experiences and benefits from other systems I've worked on: 1) The mechanism proposed is very non-invasive, a single line of code (some TRACE_XXX macro or like) is added to the area of interest. Further, at OLS, the proposal was to add only a few trace points. Programming-wise this does not clutter the code - in fact having a single well-known unified mechanism is cleaner coding than a set of one-off ways, as when anyone sees a trace macro it will be clear what it is. 2) In the end, there will be less intrusion with a single unified approach. With a series of different mechanisms over time multiple events will get added in the same place creating performance issues and more importantly causing confusion. 3) A unified approach will uncover performance issues not explicitly being searched for and allow ones of known interest to be tracked down without adding a patch (that may be cumbersome to maintain) and re-compilation. 4) In both my experiences, I have had resistance to adding this tracing infrastructure, and in both experiences other kernel developers have come back after the fact and thanked me for my persistence :-), as it helped them solve timing sensitive or other such issues they were having great difficulty with. If there is interest, I would happy to set up a conference number so people who are interested could all speak. -bob Robert Wisniewski The K42 MP OS Project Advanced Operating Systems Scalable Parallel Systems IBM T.J. Watson Research Center 914-945-3181 http://www.research.ibm.com/K42/ bob@watson.ibm.com ---- John Levon writes: > On Wed, Jul 10, 2002 at 12:16:05AM -0400, Karim Yaghmour wrote: > > [snip] > > > And the list goes on. > > Sure, there are all sorts of things where some tracing can come in > useful. The question is whether it's really something the mainline > kernel should be doing, and if the gung-ho approach is nice or not. > > > The fact that so many kernel subsystems already have their own tracing > > built-in (see other posting) > > Your list was almost entirely composed of per-driver debug routines. > This is not the same thing as logging trap entry/exits, syscalls etc > etc, on any level, and I'm a bit perplexed that you're making such an > assocation. > > > expect user-space developers to efficiently use the kernel if they > > have > > absolutely no idea about the dynamic interaction their processes have > > with the kernel and how this interaction is influenced by and > > influences > > the interaction with other processes? > > This is clearly an exaggeration. And seeing as something like LTT > doesn't (and cannot) tell the "whole story" either, I could throw the > same argument directly back at you. The point is, there comes a point of > no return where usefulness gets outweighed by ugliness. For the very few > cases that such detailed information is really useful, the user can > usually install the needed special-case tools. > > In contrast a profiling mechanism that improves on the poor lot that > currently exists (gprof, readprofile) has a truly general utility, and > can hopefully be done without too much ugliness. > > The primary reason I want to see something like this is to kill the ugly > code I have to maintain. > > > > The entry.S examine-the-registers approach is simple enough, but > > > it's > > > not much more tasteful than sys_call_table hackery IMHO > > > > I guess we won't agree on this. From my point of view it is much > > better > > to have the code directly within entry.S for all to see instead of > > having some external software play around with the syscall table in a > > way kernel users can't trace back to the kernel's own code. > > Eh ? I didn't say sys_call_table hackery was better. I said the entry.S > thing wasn't much better ... > > regards > john > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Enhanced profiling support (was Re: vm lock contention reduction) 2002-07-08 17:52 ` Linus Torvalds (?) (?) @ 2002-07-09 16:57 ` John Levon 2002-07-09 19:56 ` Karim Yaghmour -1 siblings, 1 reply; 95+ messages in thread From: John Levon @ 2002-07-09 16:57 UTC (permalink / raw) To: Linus Torvalds; +Cc: Andrew Morton, linux-kernel On Mon, Jul 08, 2002 at 10:52:36AM -0700, Linus Torvalds wrote: > - I'd associate each profiling event with a dentry/offset pair, simply > because that's the highest-level thing that the kernel knows about and > that is "static". This makes sense, I think. > - I'd suggest that the profiler explicitly mark the dentries it wants > profiled, so that the kernel can throw away events that we're not > interested in. The marking function would return a cookie to user > space, and increment the dentry count (along with setting the > "profile" flag in the dentry) For a system-wide profiler, this needs to be /all/ dentries that get mapped in with executable permissions, or we lose any mappings of shared libraries we don't know about etc. Essentially, oprofile wants samples against any dentry that gets mmap()ed with PROT_EXEC, so this marking would really need to happen at mmap() time. Missing out on any dentry profiles amounts to data loss in the system profile and has the potential to mislead. > - the "cookie" (which would most easily just be the kernel address of the > dentry) would be the thing that we give to user-space (along with > offset) on profile read. The user app can turn it back into a filename. > > Whether it is the original "mark this file for profiling" phase that saves > away the cookie<->filename association, or whether we also have a system > call for "return the path of this cookie", I don't much care about. > Details, details. > > Anyway, what would be the preferred interface from user level? oprofile currently receives eip-pid pairs, along with the necessary syscall tracing info needed to reconstruct file offsets. The above scheme removes the dependency on the pid, but this also unfortunately throws away some useful information. It is often useful to be able to separate out shared-library samples on a per-process (and/or per-application) basis. Any really useful profile buffer facility really needs to preserve this info, but just including the raw pid isn't going to work when user-space can't reconstruct the "name" of the pid (where "name" would be something "/bin/bash") because the process exited in the meantime. The same goes for kernel samples that happen in process context. So this might work well in tandem with some global process-tree tracing scheme, but I don't know what form that might take ... (Then there are kernel modules, but that's probably best served by patching modutils) regards john ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Enhanced profiling support (was Re: vm lock contention reduction) 2002-07-09 16:57 ` John Levon @ 2002-07-09 19:56 ` Karim Yaghmour 0 siblings, 0 replies; 95+ messages in thread From: Karim Yaghmour @ 2002-07-09 19:56 UTC (permalink / raw) To: John Levon Cc: Linus Torvalds, Andrew Morton, linux-kernel, Richard Moore, bob I've been following oprofile's development for a while and was quite happy to see a DCPI equivalent come to Linux. The LTT data engine collection is actually inspired by the one described in the DCPI paper. As I said earlier, much of the kernel behavior information oprofile requires is made available by LTT. Instead of redirecting the syscall table, for example, LTT hooks itself within the syscall path to obtain all syscall entries and exits (all contained within #ifdef's of course). The same goes for important events such as forks/reads/writes/scheduling etc. The mapping of PID to the name of an executable, for instance, is easily extracted from this information. As with other applications/uses which require insight into the dynamics of the kernel, it would seem to me that oprofile would greatly benefit from the integration of the LTT patch in the mainline kernel. If nothing else, oprofile could use the LTT collection engine to forward its data to user-space, much like DProbes currently does. By the same token, the LTT collection engine could replace the slew of per-driver tracing systems already in the kernel, providing therefore a uniform tracing system: drivers/char/ftape/lowlevel/ftape-tracing.c drivers/char/ip2/ip2trace.c drivers/char/dtlk.c drivers/char/mwavedd.h drivers/char/n_r3964 drivers/acpi/utilities/utdebug.c drivers/cdrom/sbpcd.c drivers/isdn/eicon/eicon_mod.c drivers/scsi/gdth.c drivers/scsi/megaraid.c drivers/scsi/qlogicfc.c drivers/scsi/qlogicisp.c drivers/net/wavelan.c drivers/net/skfp/hwmtm.c drivers/net/pcmcia/wavelan_cs.c drivers/net/wireless/orinoco.c drivers/net/wireless/orinoco_cs.c drivers/video/radeonfb.c drivers/usb/pwc.h drivers/usb/hpusbscsi.c include/linux/jdb.h for fs/ext3/*.c and fs/jdb/*.c net/irda/irnet/irnet.h etc. The above list is but a sample of the files containing actual code implementing tracing and/or data collection engines. There a great deal many files that actually have trace points already in them. A simple "grep -r TRACE *" provides an interesting insight to the number of subsystems requiring tracing, each implementing their own scheme. It is time to provide a uniform tracing and high-throughput data collection engine for all to use. LTT has already been field-tested for these purposes and is very easily extended to include any additional functionality required. Any comments/thoughts are greatly appreciated. Cheers, Karim John Levon wrote: > > On Mon, Jul 08, 2002 at 10:52:36AM -0700, Linus Torvalds wrote: > > > - I'd associate each profiling event with a dentry/offset pair, simply > > because that's the highest-level thing that the kernel knows about and > > that is "static". > > This makes sense, I think. > > > - I'd suggest that the profiler explicitly mark the dentries it wants > > profiled, so that the kernel can throw away events that we're not > > interested in. The marking function would return a cookie to user > > space, and increment the dentry count (along with setting the > > "profile" flag in the dentry) > > For a system-wide profiler, this needs to be /all/ dentries that get > mapped in with executable permissions, or we lose any mappings of shared > libraries we don't know about etc. Essentially, oprofile wants samples > against any dentry that gets mmap()ed with PROT_EXEC, so this marking > would really need to happen at mmap() time. Missing out on any dentry > profiles amounts to data loss in the system profile and has the > potential to mislead. > > > - the "cookie" (which would most easily just be the kernel address of the > > dentry) would be the thing that we give to user-space (along with > > offset) on profile read. The user app can turn it back into a filename. > > > > Whether it is the original "mark this file for profiling" phase that saves > > away the cookie<->filename association, or whether we also have a system > > call for "return the path of this cookie", I don't much care about. > > Details, details. > > > > Anyway, what would be the preferred interface from user level? > > oprofile currently receives eip-pid pairs, along with the necessary > syscall tracing info needed to reconstruct file offsets. The above > scheme removes the dependency on the pid, but this also unfortunately > throws away some useful information. > > It is often useful to be able to separate out shared-library samples on > a per-process (and/or per-application) basis. Any really useful profile > buffer facility really needs to preserve this info, but just including > the raw pid isn't going to work when user-space can't reconstruct the > "name" of the pid (where "name" would be something "/bin/bash") because > the process exited in the meantime. > > The same goes for kernel samples that happen in process context. > > So this might work well in tandem with some global process-tree tracing > scheme, but I don't know what form that might take ... > > (Then there are kernel modules, but that's probably best served by > patching modutils) > > regards > john > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ =================================================== Karim Yaghmour karim@opersys.com Embedded and Real-Time Linux Expert =================================================== ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: vm lock contention reduction 2002-07-07 3:05 ` Linus Torvalds 2002-07-07 3:47 ` Andrew Morton @ 2002-07-07 5:16 ` Martin J. Bligh 1 sibling, 0 replies; 95+ messages in thread From: Martin J. Bligh @ 2002-07-07 5:16 UTC (permalink / raw) To: Linus Torvalds, Andrew Morton; +Cc: Andrea Arcangeli, Rik van Riel, linux-mm > I'd like to enhance the profiling support a bit, to create some > infrastructure for doing different kinds of profiles, not just > the current timer-based one (and not just for the kernel). > > There's also the P4 native support for "event buffers" or whatever intel > calls them, that allows profiling at a lower level by interrupting not for > every event, but only when the hw buffer overflows. kernprof is capable of monitoring the CPU's stats gathering counters I believe. The following is thanks to Mala Anand, and I think she was monitoring the cacheline misses like this, though the mechanism allows access to various things, I can't remember if there's a TLB hit rate counter of the top of my head ... M. ------------------------- The following script is what I used to collect performance counter profiling data using kernprof: sleep 5 kernprof -r -a 0xc0 -d pmc -f 1000 -t pc -b sleep 30 kernprof -e -i -m /usr/src/linux/System.map > kpg.out sort -nr +2 kpg.out > kpg26sort.out -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* scalable kmap (was Re: vm lock contention reduction) 2002-07-07 2:50 ` Andrew Morton 2002-07-07 3:05 ` Linus Torvalds @ 2002-07-07 6:13 ` Martin J. Bligh 2002-07-07 6:37 ` Andrew Morton ` (2 more replies) 2002-07-08 0:38 ` vm lock contention reduction William Lee Irwin III 2 siblings, 3 replies; 95+ messages in thread From: Martin J. Bligh @ 2002-07-07 6:13 UTC (permalink / raw) To: Andrew Morton, Andrea Arcangeli; +Cc: Linus Torvalds, Rik van Riel, linux-mm > Martin is being bitten by the global invalidate more than by the lock. True. > He increased the size of the kmap pool just to reduce the invalidate > frequency and saw 40% speedups of some stuff. I think that might have been when Andrea was using persistent kmap for highpte (since fixed), so we were really kicking the **** out of it. Nonetheless, your point is perfectly correct, it's the global invalidate that's really the expensive thing. Making the kmap pool 4096 instead of 1024 resulted in a tenfold reduction in the number of global TLB flushes (there's a static set of "fixed" maps which would presumably explain the difference). As we have larger memory machines, we're under more and more KVA pressure, more stuff gets shoved into highmem, kmap usage gets heavier ... and this starts to look even worse. > Those invalidates don't show up nicely on profiles. Not the standard profile, no. We can fairly easily count the cost of the time taken to do the flush, but not the side effects of the cache-trash. According to my P4 manual, I can see an ITLB miss counter, but nothing for the data (apart from some more generic things). > I was discussing this with sct a few days back. iiuc, the proposal > was to create a small per-cpu pool (say, 4-8 pages) which is a > "front-end" to regular old kmap(). > > Any time you have one of these pages in use, the process gets > pinned onto the current CPU. Ewww! That's gross ;-) I don't want to think about how that interacts with cpus_allowed being changed, etc .... but mainly it's just gross ;-) On the other hand I won't deny it may be the most practical solution .... The really sad thing about the global TLB flush for persisant kmap is that it's probably wholly unnecessary - most likely each entry is really only dirty on 1 CPU. At a random guess (ie I haven't actually checked this), I think the vast majority of the persistent kmaps are doing something like: "kmap; copy_to/from_user; kunmap", and are just using this mechanism in case the page they're copying to / from us going to generate a page fault & schedule - thus we don't use atomic kmap. The proportions are merely speculation - if someone wants to throw stones at that (or better still, empirical measurements), feel free. I talked this over with Andrea for a while at OLS, and some of the other things we covered are below (KVA = kernel virtual addr space, UKVA = user-kernel virtual address space, per process like user space, but with the protections of kernel space). 1. Just pin the damned page. Instead of pinning the process, pin the user page you're touching in memory. To me this is more elegant, though Andrea pointed out we'd be slowing down the common case when you don't actually page fault at all. Then I mentioned something about fixing it up in the pagefault path instead so that the process would reset it's map when it got restarted, but Andrea started laughing at that point ;-) 2. Per process kmaps This kills the global kmap_lock and is fairly easy to implement with just a per-process lock. Unfortunately it still doesn't really kill the problem of having to do a global TLB flush if your process has more than 1 thread. Whilst I can limply claim that it's kind of nice for most non-threaded processes, and the fallback position is no worse than we have now, it's not panning out the way I'd hoped. 3. Per task kmaps The way we've been doing UKVA up to now is to map it at the top of the user address space, wedging it above the stack, and it's on a per-process basis. If we moved the area up to the window in the top 128Mb of KVA instead, it might be easier to make it really per task instead, which'd be much more useful for kmap. Then you'd need to allocate 1 virtual frame per task and you'd get atomic, persistent, per-task kmap. Which I think isn't as bad as it sounds, because you can actually reuse the same virtual page for all tasks ... or at least I think you can ... it's late, and I need to twist my brain through that one some more. I think this'd stop you having a shared kernel top level pagetable for all tasks though, so it probably causes more problems than it solves. 4. Another kludge. Supposing we didn't really map the page at all, just left it as an invalid page table entry, and just shoved an entry into the TLB instead (there's a way of doing that somewhere) ... chances are very good it'd stay there long enough to do the operation, if it fell out, or we got migrated, you'd just get a pagefault and patch it back up again ... ;-) > I believe that IBM have 32gig, 8- or 16-CPU ia32 machines just > coming into production now. Presumably, they're not the only > ones. We're stuck with this mess for another few years. We do indeed - the latest ones we'll actually be selling can actually go even larger. I now have a 32 way 32Gb P3 box in the lab from older hardware for experiments, and have enough hardware to make something even bigger if I'm feeling particularly suicidal one day ... whilst I'd agree that large 32 bit boxes are a temporary transition phase (and a pain in the ass) it'll be a while before we have Hammer systems large enough for this sort of thing. Or if you have a few million dollars, Anton has a very nice PPC64 machine he can sell you ;-) M. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: scalable kmap (was Re: vm lock contention reduction) 2002-07-07 6:13 ` scalable kmap (was Re: vm lock contention reduction) Martin J. Bligh @ 2002-07-07 6:37 ` Andrew Morton 2002-07-07 7:53 ` Linus Torvalds 2002-07-08 17:29 ` Martin J. Bligh 2 siblings, 0 replies; 95+ messages in thread From: Andrew Morton @ 2002-07-07 6:37 UTC (permalink / raw) To: Martin J. Bligh; +Cc: Andrea Arcangeli, Linus Torvalds, Rik van Riel, linux-mm "Martin J. Bligh" wrote: > > ... > > I was discussing this with sct a few days back. iiuc, the proposal > > was to create a small per-cpu pool (say, 4-8 pages) which is a > > "front-end" to regular old kmap(). > > > > Any time you have one of these pages in use, the process gets > > pinned onto the current CPU. > > Ewww! That's gross ;-) Hey. So is highmem ;) But sys_sched_affinity() allows you to change the affinity of other tasks (it takes a pid). So that's torn that idea. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: scalable kmap (was Re: vm lock contention reduction) 2002-07-07 6:13 ` scalable kmap (was Re: vm lock contention reduction) Martin J. Bligh 2002-07-07 6:37 ` Andrew Morton @ 2002-07-07 7:53 ` Linus Torvalds 2002-07-07 9:04 ` Andrew Morton 2002-07-07 16:00 ` Martin J. Bligh 2002-07-08 17:29 ` Martin J. Bligh 2 siblings, 2 replies; 95+ messages in thread From: Linus Torvalds @ 2002-07-07 7:53 UTC (permalink / raw) To: Martin J. Bligh; +Cc: Andrew Morton, Andrea Arcangeli, Rik van Riel, linux-mm On Sat, 6 Jul 2002, Martin J. Bligh wrote: > > I think that might have been when Andrea was using persistent kmap > for highpte (since fixed), so we were really kicking the **** out > of it. Nonetheless, your point is perfectly correct, it's the > global invalidate that's really the expensive thing. I suspect that there really aren't that many places that care about the persistent mappings, and the atomic per-cpu stuff is inherently scalable (but due to being harder to cache, slower). So I wonder how much of a problem the kmap stuff really is. So if the main problem ends up being that some paths (a) really want the persistent version _and_ (b) you can make the paths hold them for long times (by writing to a blocking pipe/socket or similar) we may just have much simpler approaches - like a per-user kmap count. It's not hard to make kmap() do something the equivalent of down(current->user->kmap_max); and make kunmap() just do the "up()", and then just initialize the user kmap_max semaphore to something simple like 100. Which just guarantees that any user at any time can only hold 100 concurrent persistent kmap's open. Problem solved. (and yeah, you can make it configurable on a per-user level or something, so that if it turns out that oracle really has 100 threads all blocking on a socket at the same time, you can admin up the oracle count). The _performance_ scalability concerns should be fairly easily solvable (as far as I can tell - feel free to correct me) by making the persistent array bigger and finding things where persistency isn't needed (and possibly doesn't even help due to lack of locality), and just making those places use the per-cpu atomic ones. Eh? Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: scalable kmap (was Re: vm lock contention reduction) 2002-07-07 7:53 ` Linus Torvalds @ 2002-07-07 9:04 ` Andrew Morton 2002-07-07 16:13 ` Martin J. Bligh 2002-07-07 18:31 ` Linus Torvalds 2002-07-07 16:00 ` Martin J. Bligh 1 sibling, 2 replies; 95+ messages in thread From: Andrew Morton @ 2002-07-07 9:04 UTC (permalink / raw) To: Linus Torvalds; +Cc: Martin J. Bligh, Andrea Arcangeli, Rik van Riel, linux-mm Linus Torvalds wrote: > > ... > The _performance_ scalability concerns should be fairly easily solvable > (as far as I can tell - feel free to correct me) by making the persistent > array bigger and finding things where persistency isn't needed (and > possibly doesn't even help due to lack of locality), and just making those > places use the per-cpu atomic ones. > > Eh? You mean just tune up the existing code, basically. There's certainly plenty of opportunity to do that. Probably the biggest offenders are generic_file_read/write. In generic_file_write() we're already faulting in the user page(s) beforehand (somewhat racily, btw). We could formalise that into a pin_user_page_range() or whatever and use an atomic kmap in there. Use the same thing in file_read_actor. The fs/buffer.c code is very profilgate. It kmaps the page unconditionally, but in the great majority of cases, it never uses that kmap. Plus we're still kmapping pages twice on the prepare_write/commit_write path. That doesn't use another kmap. But. clear_user_highpage() is using atomic kmap already. Networking uses kmaps, but from a quick peek it seems that it's mostly using kmap_atomic already, and a pin_user_page_range()/unpin_user_page_range() API could be dropped in there quite easily if needed. ext2 directories will require some thought. Martin, what sort of workload were you seeing the problems with? I can fix the buffer.c and filemap.c stuff and then we can take another look at it. I'm just not too sure about the pin_user_page() thing. How expensive is a page table walk in there likely to be? - -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: scalable kmap (was Re: vm lock contention reduction) 2002-07-07 9:04 ` Andrew Morton @ 2002-07-07 16:13 ` Martin J. Bligh 2002-07-07 18:31 ` Linus Torvalds 1 sibling, 0 replies; 95+ messages in thread From: Martin J. Bligh @ 2002-07-07 16:13 UTC (permalink / raw) To: Andrew Morton, Linus Torvalds; +Cc: Andrea Arcangeli, Rik van Riel, linux-mm > You mean just tune up the existing code, basically. > > There's certainly plenty of opportunity to do that. > > Probably the biggest offenders are generic_file_read/write. In > generic_file_write() we're already faulting in the user page(s) > beforehand (somewhat racily, btw). We could formalise that into > a pin_user_page_range() or whatever and use an atomic kmap > in there. That was basically what Andrea and I were discussing at OLS - his concern (IIRC) was that the pinning operation would slow things down (exactly the concern you mention later on). > Martin, what sort of workload were you seeing the problems with? I think I was just playing with kernel compile again, but I can get our performance group to gather some real numbers for a variety of benchmarks. With kernel compile there were only two large-scale callers, but I don't remember what they were off the top of my head. I'll give you some initial figures tommorow. > I'm just not too sure about the pin_user_page() thing. How > expensive is a page table walk in there likely to be? It looks like the code does try to recover from a page fault right now ... but it just falls over and says the write failed. Is there some way we could just make it abort the copy to user, and restart the copy sequence in the (unlikely) case of a page fault? If we're going to play around with that sort of thing (or indeed do what the current code does), should this all be wrapped in a "copy_to/from_high_user" function or something similar? Cleaner, easier to change later, and less places for people to screw up. M. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: scalable kmap (was Re: vm lock contention reduction) 2002-07-07 9:04 ` Andrew Morton 2002-07-07 16:13 ` Martin J. Bligh @ 2002-07-07 18:31 ` Linus Torvalds 2002-07-07 18:55 ` Linus Torvalds 2002-07-08 7:24 ` Andrew Morton 1 sibling, 2 replies; 95+ messages in thread From: Linus Torvalds @ 2002-07-07 18:31 UTC (permalink / raw) To: Andrew Morton; +Cc: Martin J. Bligh, Andrea Arcangeli, Rik van Riel, linux-mm On Sun, 7 Jul 2002, Andrew Morton wrote: > > Probably the biggest offenders are generic_file_read/write. In > generic_file_write() we're already faulting in the user page(s) > beforehand (somewhat racily, btw). We could formalise that into > a pin_user_page_range() or whatever and use an atomic kmap > in there. I'd really prefer not to. We're talking of a difference between one single-cycle instruction (the address should be in the TLB 99% of all times), and a long slow TLB walk with various locks etc. Anyway, it couldn't be an atomic kmap in file_send_actor anyway, since the write itself may need to block for other reasons (ie socket buffer full etc). THAT is the one that can get misused - the others are not a big deal, I think. So kmap_atomic definitely doesn't work there. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: scalable kmap (was Re: vm lock contention reduction) 2002-07-07 18:31 ` Linus Torvalds @ 2002-07-07 18:55 ` Linus Torvalds 2002-07-07 19:02 ` Linus Torvalds 2002-07-08 7:24 ` Andrew Morton 1 sibling, 1 reply; 95+ messages in thread From: Linus Torvalds @ 2002-07-07 18:55 UTC (permalink / raw) To: Andrew Morton; +Cc: Martin J. Bligh, Andrea Arcangeli, Rik van Riel, linux-mm On Sun, 7 Jul 2002, Linus Torvalds wrote: > > Anyway, it couldn't be an atomic kmap in file_send_actor anyway, since the > write itself may need to block for other reasons (ie socket buffer full > etc). THAT is the one that can get misused - the others are not a big > deal, I think. > > So kmap_atomic definitely doesn't work there. We do actually have an alternate approach: get rid of the "kmap()" in file_send_actor() altogether, and require targets of sendfile() to always support the sendpage() interface (which can do a kmap at a lower level if they need to - it's not even guaranteed that they do need to). I suspect most uses of sendfile() are TCP, which already does sendpage(). That would largely get rid of the DoS worry (can anybody see any other ways of forcing a arbitrarily long sleep on a kmap'ed page?) It still leaves the scalability issue, but quite frankly, so far I'd still much rather optimize for the regular hardware and screw scalability if it slows down the 4GB UP (or 2P) case. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: scalable kmap (was Re: vm lock contention reduction) 2002-07-07 18:55 ` Linus Torvalds @ 2002-07-07 19:02 ` Linus Torvalds 0 siblings, 0 replies; 95+ messages in thread From: Linus Torvalds @ 2002-07-07 19:02 UTC (permalink / raw) To: Andrew Morton; +Cc: Martin J. Bligh, Andrea Arcangeli, Rik van Riel, linux-mm On Sun, 7 Jul 2002, Linus Torvalds wrote: > > We do actually have an alternate approach: get rid of the "kmap()" in > file_send_actor() altogether, and require targets of sendfile() to always > support the sendpage() interface (which can do a kmap at a lower level if > they need to - it's not even guaranteed that they do need to). Ok, I'm going to just check that in to 2.5.x, and see who (if anybody) screams. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: scalable kmap (was Re: vm lock contention reduction) 2002-07-07 18:31 ` Linus Torvalds 2002-07-07 18:55 ` Linus Torvalds @ 2002-07-08 7:24 ` Andrew Morton 2002-07-08 8:09 ` Andrea Arcangeli 1 sibling, 1 reply; 95+ messages in thread From: Andrew Morton @ 2002-07-08 7:24 UTC (permalink / raw) To: Linus Torvalds; +Cc: Martin J. Bligh, Andrea Arcangeli, Rik van Riel, linux-mm Linus Torvalds wrote: > > On Sun, 7 Jul 2002, Andrew Morton wrote: > > > > Probably the biggest offenders are generic_file_read/write. In > > generic_file_write() we're already faulting in the user page(s) > > beforehand (somewhat racily, btw). We could formalise that into > > a pin_user_page_range() or whatever and use an atomic kmap > > in there. > > I'd really prefer not to. We're talking of a difference between one > single-cycle instruction (the address should be in the TLB 99% of all > times), and a long slow TLB walk with various locks etc. > > Anyway, it couldn't be an atomic kmap in file_send_actor anyway, since the > write itself may need to block for other reasons (ie socket buffer full > etc). THAT is the one that can get misused - the others are not a big > deal, I think. > > So kmap_atomic definitely doesn't work there. > OK, I've been through everything and all the filesystems and written four patches which I'll throw away. I think I know how to do all this now. - Convert buffer.c to atomic kmaps. - prepare_write/commit_write no longer do any implicit kmapping at all. - file_read_actor and generic_file_write do their own atomic_kmap (more on this below). - file_send_actor still does kmap. - If a filesystem wants its page kmapped between prepare and commit, it does it itself. So foo_prepare_write() { int ret; ret = block_prepare_write(); if (ret == 0) kmap(page); return ret; } foo_commit_write() { kunmap(page); return generic_commit_write(); } So in the case of ext2, we can split the directory and S_ISREG a_ops. The directory a_ops will kmap the page. The S_ISREG a_ops will not. Basically: no implicit kmaps. You do it yourself if you want it, and if you cannot do atomic kmaps. Now, file_read_actor and generic_file_write still have the problem of the target userspace page getting evited while they're holding an atomic kmap. But the rmap page eviction code has the mm_struct. So can we not do this: generic_file_write() { ... atomic_inc(¤t->mm->dont_unmap_pages); { volatile char dummy; __get_user(dummy, addr); __get_user(dummy, addr+bytes+1); } lock_page(); ->prepare_write() kmap_atomic() copy_from_user() kunmap_atomic() ->commit_write() atomic_dec(¤t->mm->dont_unmap_pages); unlock_page() } and over in mm/rmap.c:try_to_unmap_one(), check mm->dont_unmap_pages. Obviously, all this is dependent on CONFIG_HIGHMEM. Workable? - -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: scalable kmap (was Re: vm lock contention reduction) 2002-07-08 7:24 ` Andrew Morton @ 2002-07-08 8:09 ` Andrea Arcangeli 2002-07-08 14:50 ` William Lee Irwin III 2002-07-08 20:39 ` Andrew Morton 0 siblings, 2 replies; 95+ messages in thread From: Andrea Arcangeli @ 2002-07-08 8:09 UTC (permalink / raw) To: Andrew Morton; +Cc: Linus Torvalds, Martin J. Bligh, Rik van Riel, linux-mm On Mon, Jul 08, 2002 at 12:24:09AM -0700, Andrew Morton wrote: > Linus Torvalds wrote: > > > > On Sun, 7 Jul 2002, Andrew Morton wrote: > > > > > > Probably the biggest offenders are generic_file_read/write. In > > > generic_file_write() we're already faulting in the user page(s) > > > beforehand (somewhat racily, btw). We could formalise that into > > > a pin_user_page_range() or whatever and use an atomic kmap > > > in there. > > > > I'd really prefer not to. We're talking of a difference between one > > single-cycle instruction (the address should be in the TLB 99% of all > > times), and a long slow TLB walk with various locks etc. > > > > Anyway, it couldn't be an atomic kmap in file_send_actor anyway, since the > > write itself may need to block for other reasons (ie socket buffer full > > etc). THAT is the one that can get misused - the others are not a big > > deal, I think. > > > > So kmap_atomic definitely doesn't work there. > > > > OK, I've been through everything and all the filesystems and > written four patches which I'll throw away. I think I know > how to do all this now. > > - Convert buffer.c to atomic kmaps. > > - prepare_write/commit_write no longer do any implicit kmapping > at all. > > - file_read_actor and generic_file_write do their own atomic_kmap > (more on this below). > > - file_send_actor still does kmap. > > - If a filesystem wants its page kmapped between prepare and commit, > it does it itself. So > > foo_prepare_write() > { > int ret; > > ret = block_prepare_write(); > if (ret == 0) > kmap(page); > return ret; > } > > foo_commit_write() > { > kunmap(page); > return generic_commit_write(); > } > > So in the case of ext2, we can split the directory and S_ISREG a_ops. > The directory a_ops will kmap the page. The S_ISREG a_ops will not. > > > Basically: no implicit kmaps. You do it yourself if you want it, and > if you cannot do atomic kmaps. > > > Now, file_read_actor and generic_file_write still have the problem > of the target userspace page getting evited while they're holding an > atomic kmap. > > But the rmap page eviction code has the mm_struct. So can we not do this: > > generic_file_write() > { > ... > atomic_inc(¤t->mm->dont_unmap_pages); > > { > volatile char dummy; > __get_user(dummy, addr); > __get_user(dummy, addr+bytes+1); > } > lock_page(); > ->prepare_write() > kmap_atomic() > copy_from_user() > kunmap_atomic() > ->commit_write() > atomic_dec(¤t->mm->dont_unmap_pages); > unlock_page() > } > > and over in mm/rmap.c:try_to_unmap_one(), check mm->dont_unmap_pages. > > Obviously, all this is dependent on CONFIG_HIGHMEM. > > Workable? the above pseudocode still won't work correctly, if you don't pin the page as Martin proposed and you only rely on its virtual mapping to stay there because the page can go away under you despite the swap_out/rmap-unmapping work, if there's a parallel thread running munmap+re-mmap under you. So at the very least you need the mmap_sem at every generic_file_write to avoid other threads to change your virtual address under you. And you'll basically need to make the mmap_sem recursive, because you have to take it before running __get_user to avoid races. You could easily do that using my rwsem, I made two versions of them, with one that supports recursion, however this is just for your info, I'm not suggesting to make it recursive. furthmore rmap provides no advantages at all here, swap_out as well will have to learn about the mm_struct before it has a chance to try to unmap a mm_struct. side note: I heard a "I need rmap for this" from a number of people so far, and they were all wrong so far, none of them would get any advantage from rmap, one of them (the closer one to really need rmap) wasn't aware that we just have rmap for all shared mappings, and he needed the rmap information for the shared mappings for the same reason we need the rmap information to keep the shared mappings synchronized with truncate. The only reason I can imagine rmap useful in todays hardware for all kind of vma (what the patch provides compared to what we have now) is to more efficiently defragment ram with an algorithm in the memory balancing to provide largepages more efficiently from mixed zones, if somebody would suggest rmap for this reason (nobody did yet) I would have to agree completely that it is very useful for that, OTOH it seems everybody is reserving (or planning to reserve) a zone for largepages anyways so that we don't run into fragmentation in the first place. And btw - talking about largepages - we have three concurrent and controversial largepage implementations for linux available today, they all have different API, one is even shipped in production by a vendor, and while auditing the code I seen it also exports an API visible to userspace [ignoring the sysctl] (unlike what I was told): +#define MAP_BIGPAGE 0x40 /* bigpage mapping */ [..] _trans(flags, MAP_GROWSDOWN, VM_GROWSDOWN) | _trans(flags, MAP_DENYWRITE, VM_DENYWRITE) | + _trans(flags, MAP_BIGPAGE, VM_BIGMAP) | _trans(flags, MAP_EXECUTABLE, VM_EXECUTABLE); return prot_bits | flag_bits; #undef _trans that's a new unofficial bitflag to mmap that any proprietary userspace can pass to mmap today. Other implementations of the largepage feature use madvise or other syscalls to tell the kernel to allocate largepages. At least the above won't return -EINVAL so the binaryonly app will work transparently on a mainline kernel, but it can eventually malfunction if we use 0x40 for something else in 2.5. So I think we should do something about the largepages too ASAP into 2.5 (like async-io). Returning to the above kmap hack (assuming you take the mmap_sem and you fix your instability), your hack will destabilize the vm by design and it will run the machine oom despite of lots of swap available, think all tasks taking the page fault in __get_user due a swapin at the same time, and not being able to swapout some memory to resolve the __get_user swapin because you pinned all the address spaces, they'll run oom despite there's still lots of swap free (of course with the oom killer and the infinite loop in the allocator such condition will deadlock the kernel instead, it's one of the cases where nobody is going to teach the oom killer to detect such condition as a case where it has to oom kill because there's still lots of vm available at that time; so to be accurate I meant with my vm updates applied the kernel will run oom, while a mainline kernel will silenty deadlock). So I I'm not really happy with this mm-pinning-during-page-fault design solution (regardless if you prefer to deadlock or to run oom, you know I prefer the latter :). Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: scalable kmap (was Re: vm lock contention reduction) 2002-07-08 8:09 ` Andrea Arcangeli @ 2002-07-08 14:50 ` William Lee Irwin III 2002-07-08 20:39 ` Andrew Morton 1 sibling, 0 replies; 95+ messages in thread From: William Lee Irwin III @ 2002-07-08 14:50 UTC (permalink / raw) To: Andrea Arcangeli Cc: Andrew Morton, Linus Torvalds, Martin J. Bligh, Rik van Riel, linux-mm On Mon, Jul 08, 2002 at 10:09:53AM +0200, Andrea Arcangeli wrote: > with truncate. The only reason I can imagine rmap useful in todays > hardware for all kind of vma (what the patch provides compared to what > we have now) is to more efficiently defragment ram with an algorithm in > the memory balancing to provide largepages more efficiently from mixed > zones, if somebody would suggest rmap for this reason (nobody did yet) I > would have to agree completely that it is very useful for that, OTOH it A number of uses of reverse mappings come to mind. There is the use of rmap for Linux running as a guest instance returning memory to host OS's and/or firmware. This involves evicting specific physically contiguous regions. I believe UML and some ppc64 ports would like to do this. I intend to experiment with using rmap for virtual cache invalidation on some of my sun4c machines in my spare time as well, though given the general importance of virtual caches that's not likely to be a good motivator for reverse mapping support. I believe the 2.4.x- based rmap tree also had a comment suggesting it could be used to more directly address general multipage allocation failures due to fragmentation, but I'm unaware of any particular attempts to use them for general online defragmentation. While it appears physical scanning could provide some benefit for page replacement in the presence of large amounts of shared memory, as the number of virtual pages present across all processes could be a large multiple of the number of physical pages present in a system, I'll leave the final judgment of its effectiveness there to those more frequently involved in page replacement issues. Perhaps collecting statistics on pte_chain lengths could be useful here, as that would give some notion of how much additional work is generated by sharing for the virtual scan algorithm. On the virtual scanning side, perhaps collecting statistics on how frequently the memclass() check fails in try_to_swap_out() might give some notion of how advantageous physical scanning is with respect to being able to pinpoint pressured regions. Other statistics like the scan rate and so on might be trickier with the virtual scan since some pages may be scanned multiple times. Maybe another thing to check is how often a pte is invalidated without actually being able to evict the page. I did something for several of these a while ago but am not sure what happened to it. I'll see what I can brew up today, especially since there are some unanswered questions still about another scenario I'm able to reproduce involving excess dirty memory I've been asked to collect more information on. Cheers, Bill -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: scalable kmap (was Re: vm lock contention reduction) 2002-07-08 8:09 ` Andrea Arcangeli 2002-07-08 14:50 ` William Lee Irwin III @ 2002-07-08 20:39 ` Andrew Morton 2002-07-08 21:08 ` Benjamin LaHaise 1 sibling, 1 reply; 95+ messages in thread From: Andrew Morton @ 2002-07-08 20:39 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Linus Torvalds, Martin J. Bligh, Rik van Riel, linux-mm Andrea Arcangeli wrote: > > ... > > generic_file_write() > > { > > ... > > atomic_inc(¤t->mm->dont_unmap_pages); > > > > { > > volatile char dummy; > > __get_user(dummy, addr); > > __get_user(dummy, addr+bytes+1); > > } > > lock_page(); > > ->prepare_write() > > kmap_atomic() > > copy_from_user() > > kunmap_atomic() > > ->commit_write() > > atomic_dec(¤t->mm->dont_unmap_pages); > > unlock_page() > > } > > > > and over in mm/rmap.c:try_to_unmap_one(), check mm->dont_unmap_pages. > > > > Obviously, all this is dependent on CONFIG_HIGHMEM. > > > > Workable? > > the above pseudocode still won't work correctly, Sure. It's crap. It can be used to get mlockall() for free. > if you don't pin the > page as Martin proposed and you only rely on its virtual mapping to stay > there because the page can go away under you despite the > swap_out/rmap-unmapping work, if there's a parallel thread running > munmap+re-mmap under you. So at the very least you need the mmap_sem at > every generic_file_write to avoid other threads to change your virtual > address under you. And you'll basically need to make the mmap_sem > recursive, because you have to take it before running __get_user to > avoid races. You could easily do that using my rwsem, I made two versions > of them, with one that supports recursion, however this is just for your > info, I'm not suggesting to make it recursive. I think I'll just go for pinning the damn page. It's a spinlock and maybe three cachelines but the kernel is about to do a 4k memcpy anyway. And get_user_pages() doesn't show up much on O_DIRECT profiles and it'll be a net win and we need to do SOMETHING, dammit. > ... > The only reason I can imagine rmap useful in todays > hardware for all kind of vma (what the patch provides compared to what > we have now) is to more efficiently defragment ram with an algorithm in > the memory balancing to provide largepages more efficiently from mixed > zones, if somebody would suggest rmap for this reason (nobody did yet) It has been discussed. But no action yet. > I > would have to agree completely that it is very useful for that, OTOH it > seems everybody is reserving (or planning to reserve) a zone for > largepages anyways so that we don't run into fragmentation in the first > place. And btw - talking about largepages - we have three concurrent and > controversial largepage implementations for linux available today, they > all have different API, one is even shipped in production by a vendor, What implementation do you favour? > and while auditing the code I seen it also exports an API visible to > userspace [ignoring the sysctl] (unlike what I was told): > > +#define MAP_BIGPAGE 0x40 /* bigpage mapping */ > [..] > _trans(flags, MAP_GROWSDOWN, VM_GROWSDOWN) | > _trans(flags, MAP_DENYWRITE, VM_DENYWRITE) | > + _trans(flags, MAP_BIGPAGE, VM_BIGMAP) | > _trans(flags, MAP_EXECUTABLE, VM_EXECUTABLE); > return prot_bits | flag_bits; > #undef _trans > > that's a new unofficial bitflag to mmap that any proprietary userspace > can pass to mmap today. Other implementations of the largepage feature > use madvise or other syscalls to tell the kernel to allocate > largepages. At least the above won't return -EINVAL so the binaryonly > app will work transparently on a mainline kernel, but it can eventually > malfunction if we use 0x40 for something else in 2.5. So I think we > should do something about the largepages too ASAP into 2.5 (like > async-io). Yup. I don't think the -aa kernel has a large page patch, does it? Is that something which you have time to look into? - -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: scalable kmap (was Re: vm lock contention reduction) 2002-07-08 20:39 ` Andrew Morton @ 2002-07-08 21:08 ` Benjamin LaHaise 2002-07-08 21:45 ` Andrew Morton 0 siblings, 1 reply; 95+ messages in thread From: Benjamin LaHaise @ 2002-07-08 21:08 UTC (permalink / raw) To: Andrew Morton Cc: Andrea Arcangeli, Linus Torvalds, Martin J. Bligh, Rik van Riel, linux-mm On Mon, Jul 08, 2002 at 01:39:04PM -0700, Andrew Morton wrote: > I think I'll just go for pinning the damn page. It's a spinlock and > maybe three cachelines but the kernel is about to do a 4k memcpy > anyway. And get_user_pages() doesn't show up much on O_DIRECT > profiles and it'll be a net win and we need to do SOMETHING, dammit. Pinning the page costs too much (remember, it's only a win with a reduced copy of more that 512 bytes). The right way of doing it is letting copy_*_user fail on a page fault for places like this where we need to drop locks before going into the page fault handler. -ben -- "You will be reincarnated as a toad; and you will be much happier." -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: scalable kmap (was Re: vm lock contention reduction) 2002-07-08 21:08 ` Benjamin LaHaise @ 2002-07-08 21:45 ` Andrew Morton 2002-07-08 22:24 ` Benjamin LaHaise 0 siblings, 1 reply; 95+ messages in thread From: Andrew Morton @ 2002-07-08 21:45 UTC (permalink / raw) To: Benjamin LaHaise Cc: Andrea Arcangeli, Linus Torvalds, Martin J. Bligh, Rik van Riel, linux-mm Benjamin LaHaise wrote: > > On Mon, Jul 08, 2002 at 01:39:04PM -0700, Andrew Morton wrote: > > I think I'll just go for pinning the damn page. It's a spinlock and > > maybe three cachelines but the kernel is about to do a 4k memcpy > > anyway. And get_user_pages() doesn't show up much on O_DIRECT > > profiles and it'll be a net win and we need to do SOMETHING, dammit. > > Pinning the page costs too much (remember, it's only a win with a > reduced copy of more that 512 bytes). Could you expand on that? > The right way of doing it is > letting copy_*_user fail on a page fault for places like this where > we need to drop locks before going into the page fault handler. OK. There are a few things which need to be fixed up in there. One is to drop and reacquire the atomic kmap. Another is the page lock (for the write-to-mmaped-page-from-the-same-file thing). Another is to undo the ->prepare_write call. Or to remember to not run it again on the retry. It's really the page lock which is the tricky one. It could be a new, uninitialised page. It's in pagecache and it is not fully uptodate. If we drop the page lock and that page is inside i_size then the kernel has exposed uninitialised data. Tricky. A sleazy approach would be to not unlock the page at all. ie: no change. Sure, the kernel can deadlock. But it's always been that way - the deadlock requires two improbable things, whereas the schedule-inside-atomic-kmap requires just one. hmm. Bit stumped on that one. Btw, is it safe to drop and reacquire an atomic kmap if you found out that you accidentally slept while holding it? - -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: scalable kmap (was Re: vm lock contention reduction) 2002-07-08 21:45 ` Andrew Morton @ 2002-07-08 22:24 ` Benjamin LaHaise 0 siblings, 0 replies; 95+ messages in thread From: Benjamin LaHaise @ 2002-07-08 22:24 UTC (permalink / raw) To: Andrew Morton Cc: Andrea Arcangeli, Linus Torvalds, Martin J. Bligh, Rik van Riel, linux-mm On Mon, Jul 08, 2002 at 02:45:35PM -0700, Andrew Morton wrote: > > Pinning the page costs too much (remember, it's only a win with a > > reduced copy of more that 512 bytes). > > Could you expand on that? I'm going from data that I gather while fiddling with aio and the pipe code. As a thought experiment, look at it this way: pinning the page involves a minimum 4-5 data dependent cache line accesses (mm struct, 2-3 page table levels, then a locked cycle on the page struct itself) compared to the use of tlb entries that are likely to be present (free, plus recent cpus have hardware to prefect them completely asynchronous to instruction execution). > > The right way of doing it is > > letting copy_*_user fail on a page fault for places like this where > > we need to drop locks before going into the page fault handler. > > OK. There are a few things which need to be fixed up in there. One > is to drop and reacquire the atomic kmap. Another is the page > lock (for the write-to-mmaped-page-from-the-same-file thing). > Another is to undo the ->prepare_write call. Or to remember to not > run it again on the retry. > > It's really the page lock which is the tricky one. It could be > a new, uninitialised page. It's in pagecache and it is not > fully uptodate. If we drop the page lock and that page is > inside i_size then the kernel has exposed uninitialised data. Hmmm, do we really need to insert a new, uninitialised page into the page cache before filling it with data? If we could defer that until the data is copied into the page (most of the time there would be no collisions during writes, so a spurious copy is unlikely) Side note: I did an alternative fix for this which just stuffed a copy of the struct page * into the task struct, and checked for this inside filemap.c. Very gross, but it worked. > Tricky. A sleazy approach would be to not unlock the page at > all. ie: no change. Sure, the kernel can deadlock. But it's > always been that way - the deadlock requires two improbable things, > whereas the schedule-inside-atomic-kmap requires just one. It's not unlikely if you've got a malicious user behind the shell. > Btw, is it safe to drop and reacquire an atomic kmap if you > found out that you accidentally slept while holding it? Yes and no: it works, but if debugging is enabled it bugs out. -ben -- "You will be reincarnated as a toad; and you will be much happier." -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: scalable kmap (was Re: vm lock contention reduction) 2002-07-07 7:53 ` Linus Torvalds 2002-07-07 9:04 ` Andrew Morton @ 2002-07-07 16:00 ` Martin J. Bligh 2002-07-07 18:28 ` Linus Torvalds 2002-07-08 7:00 ` Andrea Arcangeli 1 sibling, 2 replies; 95+ messages in thread From: Martin J. Bligh @ 2002-07-07 16:00 UTC (permalink / raw) To: Linus Torvalds; +Cc: Andrew Morton, Andrea Arcangeli, Rik van Riel, linux-mm >> I think that might have been when Andrea was using persistent kmap >> for highpte (since fixed), so we were really kicking the **** out >> of it. Nonetheless, your point is perfectly correct, it's the >> global invalidate that's really the expensive thing. > > I suspect that there really aren't that many places that care about the > persistent mappings, and the atomic per-cpu stuff is inherently scalable > (but due to being harder to cache, slower). So I wonder how much of a > problem the kmap stuff really is. > > So if the main problem ends up being that some paths (a) really want the > persistent version _and_ (b) you can make the paths hold them for long > times (by writing to a blocking pipe/socket or similar) we may just have > much simpler approaches - like a per-user kmap count. I don't think they really want something that's persistant over a long time, I think they want something they can hold over a potential reschedule - that's why they're not using kmap_atomic. > Which just guarantees that any user at any time can only hold 100 > concurrent persistent kmap's open. Problem solved. We're not running out of kmaps in the pool, we're just churning them (and dirtying them) at an unpleasant rate. Every time we exhaust the pool, we do a global TLB flush on all CPUs, which sucks for performance. > The _performance_ scalability concerns should be fairly easily solvable > (as far as I can tell - feel free to correct me) by making the persistent > array bigger Making the array bigger does help, but it consumes some more virtual address space, which the most critical resource on these machines ... at the moment we use up 1024 entries, which is 4Mb, I normally set things to 4096, which uses 16Mb - certainly that would be a better default for larger machines. But if I make it much bigger than that, I start to run out of vmalloc space ;-) Of course we could just add the size of the kmap pool to _VMALLOC_RESERVE, which would be somewhat better ... > and finding things where persistency isn't needed (and > possibly doesn't even help due to lack of locality), and just > making those places use the per-cpu atomic ones. I'm kind of handwaving at this point because I don't have the stats to hand. I had Keith gather some stats on this, and see what was actually calling kmap - I'll dig those out on Monday when I'm back in the office, and send them along. I was kind of hoping to find some elegant killer solution to all this, but it's been kicked around for a while now, and every solution seems to have its problems. If it can't be elegantly solved, we can probably kill the performance issue by just tuning, as you say ... M. PS. One interesting thing Keith found was this: on NUMA-Q, I currently do the IPI send for smp_call_function (amongst other things) as a sequenced unicast (send a seperate message to each CPU in turn), rather than the normal broadcast because it's harder to do in clustered apic mode. Whilst trying to switch this back, he found it ran faster as the sequenced unicast, not only for NUMA-Q, but also for standard SMP boxes!!! I'm guessing the timing offset generated helps cacheline or lock contention ... interesting anyway. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: scalable kmap (was Re: vm lock contention reduction) 2002-07-07 16:00 ` Martin J. Bligh @ 2002-07-07 18:28 ` Linus Torvalds 2002-07-08 7:11 ` Andrea Arcangeli 2002-07-08 10:15 ` Eric W. Biederman 2002-07-08 7:00 ` Andrea Arcangeli 1 sibling, 2 replies; 95+ messages in thread From: Linus Torvalds @ 2002-07-07 18:28 UTC (permalink / raw) To: Martin J. Bligh; +Cc: Andrew Morton, Andrea Arcangeli, Rik van Riel, linux-mm On Sun, 7 Jul 2002, Martin J. Bligh wrote: > > > > So if the main problem ends up being that some paths (a) really want the > > persistent version _and_ (b) you can make the paths hold them for long > > times (by writing to a blocking pipe/socket or similar) we may just have > > much simpler approaches - like a per-user kmap count. > > I don't think they really want something that's persistant over a > long time, I think they want something they can hold over a potential > reschedule - that's why they're not using kmap_atomic. Some things could be _loong_, ie the "sendfile()" case to a socket. Worse, it can be intentionally made arbitrarily long by just setting up a socket that has no readers, which is why it is potentially a DoS thing. This is why I suggested the per-user kmap semaphore: not because it will ever trigger in normal load (where "normal load" means even _extremely_ high load by people who don't actively try to break the system), but because it will be a resource counter for a potential attack and limit the attacker down to the point where it isn't a problem. > > Which just guarantees that any user at any time can only hold 100 > > concurrent persistent kmap's open. Problem solved. > > We're not running out of kmaps in the pool, we're just churning them > (and dirtying them) at an unpleasant rate. Every time we exhaust the > pool, we do a global TLB flush on all CPUs, which sucks for performance. That's the "normal load" case, fixable by just making the kmap pool larger. That's ot what the semaphore is there for. > > The _performance_ scalability concerns should be fairly easily solvable > > (as far as I can tell - feel free to correct me) by making the persistent > > array bigger > > Making the array bigger does help, but it consumes some more virtual > address space, which the most critical resource on these machines ... > at the moment we use up 1024 entries, which is 4Mb, I normally set > things to 4096, which uses 16Mb - certainly that would be a better > default for larger machines. But if I make it much bigger than that, > I start to run out of vmalloc space ;-) Of course we could just add > the size of the kmap pool to _VMALLOC_RESERVE, which would be somewhat > better ... I don't see 16MB of virtual space as being a real problem on a 64GB machine. > PS. One interesting thing Keith found was this: on NUMA-Q, I currently > do the IPI send for smp_call_function (amongst other things) as a > sequenced unicast (send a seperate message to each CPU in turn), > rather than the normal broadcast because it's harder to do in > clustered apic mode. Whilst trying to switch this back, he found it ran > faster as the sequenced unicast, not only for NUMA-Q, but also for > standard SMP boxes!!! I'm guessing the timing offset generated helps > cacheline or lock contention ... interesting anyway. Hmm.. Right now we have the same IDT and GDT on all CPU's, so _if_ the CPU is stupid enough to do a locked cycle to update the "A" bit on the segments (even if it is already set), you would see horrible cacheline bouncing for any interrupt. I don't know if that is the case. I'd _assume_ that the microcode was clever enough to not do this, but who knows. It should be fairly easily testable (just "SMOP") by duplicating the IDT/GDT across CPU's. I don't think the cross-calls should have any locks in them, although there does seem to be some silly things like "flush_cpumask" that should probably just be in the "cpu_tlbstate[cpu] array instead (no cacheline bouncing, and since we touch that array anyway, it should be better for the cache in other ways too). Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: scalable kmap (was Re: vm lock contention reduction) 2002-07-07 18:28 ` Linus Torvalds @ 2002-07-08 7:11 ` Andrea Arcangeli 2002-07-08 10:15 ` Eric W. Biederman 1 sibling, 0 replies; 95+ messages in thread From: Andrea Arcangeli @ 2002-07-08 7:11 UTC (permalink / raw) To: Linus Torvalds; +Cc: Martin J. Bligh, Andrew Morton, Rik van Riel, linux-mm On Sun, Jul 07, 2002 at 11:28:03AM -0700, Linus Torvalds wrote: > Hmm.. Right now we have the same IDT and GDT on all CPU's, so _if_ the CPU > is stupid enough to do a locked cycle to update the "A" bit on the > segments (even if it is already set), you would see horrible cacheline > bouncing for any interrupt. > > I don't know if that is the case. I'd _assume_ that the microcode was > clever enough to not do this, but who knows. It should be fairly easily > testable (just "SMOP") by duplicating the IDT/GDT across CPU's. if that would be the problem, it would be not specific to IPI though, I would be surprised if the cpu would be that stupid to lock in such an always read-only place, it would showup in any smp regardless of the smp_call_function. OTOH I don't know why the hardware would even try to lock implicitly there. > I don't think the cross-calls should have any locks in them, although > there does seem to be some silly things like "flush_cpumask" that should > probably just be in the "cpu_tlbstate[cpu] array instead (no cacheline > bouncing, and since we touch that array anyway, it should be better for > the cache in other ways too). agreed, I overlooked it, flush_cpumask is also a good candidate to be made per-cpu. The others as said are the ->started/finished in the call_data array (that's generic for all smp_call_function). Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: scalable kmap (was Re: vm lock contention reduction) 2002-07-07 18:28 ` Linus Torvalds 2002-07-08 7:11 ` Andrea Arcangeli @ 2002-07-08 10:15 ` Eric W. Biederman 1 sibling, 0 replies; 95+ messages in thread From: Eric W. Biederman @ 2002-07-08 10:15 UTC (permalink / raw) To: Linus Torvalds Cc: Martin J. Bligh, Andrew Morton, Andrea Arcangeli, Rik van Riel, linux-mm Linus Torvalds <torvalds@transmeta.com> writes: > Hmm.. Right now we have the same IDT and GDT on all CPU's, so _if_ the CPU > is stupid enough to do a locked cycle to update the "A" bit on the > segments (even if it is already set), you would see horrible cacheline > bouncing for any interrupt. > > I don't know if that is the case. I'd _assume_ that the microcode was > clever enough to not do this, but who knows. It should be fairly easily > testable (just "SMOP") by duplicating the IDT/GDT across CPU's. If you don't carry about the "A" bit and I don't think we do this is trivial preventable. You can set when you initialize the GDT/IDT and it will never be updated. I had to make this change a while ago in LinuxBIOS because P4's lock up when you load a GDT from a ROM that doesn't have the accessed bit set. The fact it doesn't lock up is a fairly good proof that no writes happen when the accessed bit is already set. Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: scalable kmap (was Re: vm lock contention reduction) 2002-07-07 16:00 ` Martin J. Bligh 2002-07-07 18:28 ` Linus Torvalds @ 2002-07-08 7:00 ` Andrea Arcangeli 1 sibling, 0 replies; 95+ messages in thread From: Andrea Arcangeli @ 2002-07-08 7:00 UTC (permalink / raw) To: Martin J. Bligh; +Cc: Linus Torvalds, Andrew Morton, Rik van Riel, linux-mm On Sun, Jul 07, 2002 at 09:00:27AM -0700, Martin J. Bligh wrote: > clustered apic mode. Whilst trying to switch this back, he found it ran > faster as the sequenced unicast, not only for NUMA-Q, but also for > standard SMP boxes!!! I'm guessing the timing offset generated helps > cacheline or lock contention ... interesting anyway. makes sense. but it sounds like it should be fixed in the way we synchronize with the ipis, rather than by executing them in sequence. We should just have the smp_call_function poll (read-only) a list of per-cpu data, and have all the other cpus inside the ipi modifying their own per-cpu cachelines. Right now the ipi callback works on a shared cacheline, that is call_data->started/finished, that could be probably per-cpu without much problems, just having the smp_call_function reading all the per-cpu fields rather than only the current global ones. things like tlb flushing are all per-cpu, with per-cpu tlbdata informations, the only bottleneck really only seems the smp_call_function_interrupt implementation that uses global counter instead of a per-cpu counters. it will be a bit similar to the big-reader-lock algorithm, the writer polling the per-cpu counters here is the smp_call_function, and the reader modifying its per-cpu counter is smp_call_function_interrupt. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: scalable kmap (was Re: vm lock contention reduction) 2002-07-07 6:13 ` scalable kmap (was Re: vm lock contention reduction) Martin J. Bligh 2002-07-07 6:37 ` Andrew Morton 2002-07-07 7:53 ` Linus Torvalds @ 2002-07-08 17:29 ` Martin J. Bligh 2002-07-08 22:14 ` Linus Torvalds 2002-07-09 3:17 ` Andrew Morton 2 siblings, 2 replies; 95+ messages in thread From: Martin J. Bligh @ 2002-07-08 17:29 UTC (permalink / raw) To: Andrew Morton, Andrea Arcangeli; +Cc: Linus Torvalds, Rik van Riel, linux-mm OK, here's the data from Keith that I was promising on kmap. This was just for a kernel compile. So copy_strings and file_read_actor seem to be the main users (for this workload) by an order of magnitude. 0.00 0.00 1/3762429 remove_arg_zero [618] 0.00 0.00 10/3762429 ext2_set_link [576] 0.00 0.00 350/3762429 block_read_full_page [128] 0.00 0.00 750/3762429 ext2_empty_dir [476] 0.02 0.00 11465/3762429 ext2_delete_entry [217] 0.02 0.00 12983/3762429 ext2_inode_by_name [228] 0.03 0.00 15400/3762429 ext2_add_link [182] 0.03 0.00 16621/3762429 ext2_find_entry [198] 0.06 0.00 33016/3762429 ext2_readdir [79] 0.13 0.00 71900/3762429 generic_file_write [109] 0.17 0.00 95589/3762429 generic_commit_write [255] 2.25 0.00 1229513/3762429 file_read_actor [50] 4.15 0.00 2274831/3762429 copy_strings [36] [105] 0.2 6.87 0.00 3762429 kunmap_high [105] and 0.00 0.00 1/3762429 remove_arg_zero [618] 0.00 0.00 350/3762429 block_read_full_page [128] 0.22 0.01 71900/3762429 generic_file_write [109] 0.27 0.01 90245/3762429 ext2_get_page [176] 0.29 0.01 95589/3762429 _block_prepare_write [141] 3.72 0.11 1229513/3762429 file_read_actor [50] 6.88 0.21 2274831/3762429 copy_strings [36] [87] 0.3 11.38 0.35 3762429 kmap_high [87] 0.35 0.00 27/27 flush_all_zero_pkmaps [273] this kernel had a larger kmap area (which is why flush_all_zero_pkmaps is only called so little). It is a 2.4.18 tree with O(1) sched. ---------------------- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: scalable kmap (was Re: vm lock contention reduction) 2002-07-08 17:29 ` Martin J. Bligh @ 2002-07-08 22:14 ` Linus Torvalds 2002-07-09 0:16 ` Andrew Morton 2002-07-09 3:17 ` Andrew Morton 1 sibling, 1 reply; 95+ messages in thread From: Linus Torvalds @ 2002-07-08 22:14 UTC (permalink / raw) To: Martin J. Bligh; +Cc: Andrew Morton, Andrea Arcangeli, Rik van Riel, linux-mm On Mon, 8 Jul 2002, Martin J. Bligh wrote: > > OK, here's the data from Keith that I was promising on kmap. This was just > for a kernel compile. So copy_strings and file_read_actor seem to be the > main users (for this workload) by an order of magnitude. Ok, both the top two (by far) users are basically just "copy_to_user()" and "copy_from_user()". What we could do is to make a special case for the copy_xx_user() stuff, and have page faulting fixing those two special cases up (kunmap before calling handle_mm_fault, and then re-kmap and fixing up the address just before returning). It's even easy to check hat to trigger at: if we have a magic "atomic kmap that handles page faults correctly" thing, such a thing would need to be preempt safe due to the atomic kmap anyway - so we could trigger the special case on the faulting code being non-preemptable. Basically, the only thing it would require would be a slightly magic "calling convention", where some register holds the page pointer, and another register holds the "mapped address" pointer, and then we'd have something like do_page_fault(..) { .... + if (current->preempt_count) + kunmap_atomic(ptregs->page_reg); switch (handle_mm_fault(mm, vma, address, write)) { .... } + if (current->preempt_count) + ptregs->addr_reg = (ptregs->addr_reg & ~PAGE_MASK) | kmap_atomic(ptregs->page_reg); ... which basically allows us to hold "atomic" kmap's over a page fault (and _only_ over a page fault, it wouldn't help for anything but the user copy case). Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: scalable kmap (was Re: vm lock contention reduction) 2002-07-08 22:14 ` Linus Torvalds @ 2002-07-09 0:16 ` Andrew Morton 0 siblings, 0 replies; 95+ messages in thread From: Andrew Morton @ 2002-07-09 0:16 UTC (permalink / raw) To: Linus Torvalds; +Cc: Martin J. Bligh, Andrea Arcangeli, Rik van Riel, linux-mm Linus Torvalds wrote: > > ... > do_page_fault(..) > { > .... > > + if (current->preempt_count) > + kunmap_atomic(ptregs->page_reg); > > switch (handle_mm_fault(mm, vma, address, write)) { > .... > } > > + if (current->preempt_count) > + ptregs->addr_reg = (ptregs->addr_reg & ~PAGE_MASK) | kmap_atomic(ptregs->page_reg); > > ... > > which basically allows us to hold "atomic" kmap's over a page fault (and > _only_ over a page fault, it wouldn't help for anything but the user copy > case). That'll work. We need to use a bit more than preempt_count, because that doesn't get incremented when CONFIG_PREEMPT=n, and we need to know stuff like the kmap type and the page and the virtual address within the fault handler. I'll work on the below patch. I'll need to audit the various copy_*_user implementations to make sure that esi and edi are really always the right registers to use. Sigh. I've been trying to avoid understanding the x86 instruction set. arch/i386/mm/fault.c | 39 +++++++++++++++++++++++++++++++++ include/asm-i386/kmap_types.h | 3 +- include/asm-ppc/kmap_types.h | 1 include/asm-sparc/kmap_types.h | 1 include/asm-x86_64/kmap_types.h | 1 include/linux/highmem.h | 46 ++++++++++++++++++++++++++++++++++++++++ include/linux/sched.h | 7 ++++++ mm/filemap.c | 8 +++++- 8 files changed, 103 insertions, 3 deletions --- 2.5.25/arch/i386/mm/fault.c~linus-copy_user-hack Mon Jul 8 17:13:16 2002 +++ 2.5.25-akpm/arch/i386/mm/fault.c Mon Jul 8 17:13:16 2002 @@ -18,6 +18,7 @@ #include <linux/ptrace.h> #include <linux/mman.h> #include <linux/mm.h> +#include <linux/highmem.h> #include <linux/smp.h> #include <linux/smp_lock.h> #include <linux/interrupt.h> @@ -138,6 +139,40 @@ void bust_spinlocks(int yes) } } +#ifdef CONFIG_HIGHMEM +static inline void highmem_remove_atomic_kmap(void) +{ + struct copy_user_state *cus = current->copy_user_state; + + printk("%s\n", __FUNCTION__); + if (cus) + kunmap_atomic(cus->kaddr, cus->type); +} + +static inline void highmem_restore_atomic_kmap(struct pt_regs *regs) +{ + struct copy_user_state *cus = current->copy_user_state; + + if (cus) { + long *reg; + unsigned offset; + + cus->kaddr = kmap_atomic(cus->page, cus->type); + if (cus->src) + reg = ®s->esi; + else + reg = ®s->edi; + offset = *reg & PAGE_SIZE; + *reg = ((long)cus->kaddr & ~(PAGE_SIZE - 1)) | offset; + } +} +#else +static inline void highmem_remove_atomic_kmap(void) +{} +static inline void highmem_restore_atomic_kmap(struct pt_regs *regs) +{} +#endif + asmlinkage void do_invalid_op(struct pt_regs *, unsigned long); extern unsigned long idt; @@ -206,6 +241,8 @@ asmlinkage void do_page_fault(struct pt_ } #endif + highmem_remove_atomic_kmap(); + down_read(&mm->mmap_sem); vma = find_vma(mm, address); @@ -283,6 +320,7 @@ good_area: tsk->thread.screen_bitmap |= 1 << bit; } up_read(&mm->mmap_sem); + highmem_restore_atomic_kmap(regs); return; /* @@ -291,6 +329,7 @@ good_area: */ bad_area: up_read(&mm->mmap_sem); + highmem_restore_atomic_kmap(regs); /* User mode accesses just cause a SIGSEGV */ if (error_code & 4) { --- 2.5.25/include/asm-i386/kmap_types.h~linus-copy_user-hack Mon Jul 8 17:13:16 2002 +++ 2.5.25-akpm/include/asm-i386/kmap_types.h Mon Jul 8 17:13:16 2002 @@ -19,7 +19,8 @@ D(5) KM_BIO_SRC_IRQ, D(6) KM_BIO_DST_IRQ, D(7) KM_PTE0, D(8) KM_PTE1, -D(9) KM_TYPE_NR +D(9) KM_FILEMAP, +D(10) KM_TYPE_NR }; #undef D --- 2.5.25/include/asm-ppc/kmap_types.h~linus-copy_user-hack Mon Jul 8 17:13:16 2002 +++ 2.5.25-akpm/include/asm-ppc/kmap_types.h Mon Jul 8 17:13:16 2002 @@ -15,6 +15,7 @@ enum km_type { KM_BIO_DST_IRQ, KM_PTE0, KM_PTE1, + KM_FILEMAP, KM_TYPE_NR }; --- 2.5.25/include/asm-sparc/kmap_types.h~linus-copy_user-hack Mon Jul 8 17:13:16 2002 +++ 2.5.25-akpm/include/asm-sparc/kmap_types.h Mon Jul 8 17:13:16 2002 @@ -9,6 +9,7 @@ enum km_type { KM_USER1, KM_BIO_SRC_IRQ, KM_BIO_DST_IRQ, + KM_FILEMAP, KM_TYPE_NR }; --- 2.5.25/include/asm-x86_64/kmap_types.h~linus-copy_user-hack Mon Jul 8 17:13:16 2002 +++ 2.5.25-akpm/include/asm-x86_64/kmap_types.h Mon Jul 8 17:13:16 2002 @@ -9,6 +9,7 @@ enum km_type { KM_USER1, KM_BIO_SRC_IRQ, KM_BIO_DST_IRQ, + KM_FILEMAP, KM_TYPE_NR }; --- 2.5.25/include/linux/highmem.h~linus-copy_user-hack Mon Jul 8 17:13:16 2002 +++ 2.5.25-akpm/include/linux/highmem.h Mon Jul 8 17:13:16 2002 @@ -10,12 +10,44 @@ extern struct page *highmem_start_page; #include <asm/highmem.h> +#include <asm/kmap_types.h> /* declarations for linux/mm/highmem.c */ unsigned int nr_free_highpages(void); extern void check_highmem_ptes(void); +/* + * Used when performing a copy_*_user while holding an atomic kmap + */ +struct copy_user_state { + struct page *page; /* The page which is kmap_atomiced */ + void *kaddr; /* Its mapping */ + enum km_type type; /* Its offset */ + int src; /* 1: fixup ESI. 0: Fixup EDI */ +}; + +/* + * `src' is true if the kmap_atomic virtual address is the source of the copy. + */ +static inline void +pre_kmap_copy_user(struct copy_user_state *state, struct page *page, + void *kaddr, enum km_type type, int src) +{ + state->page = page; + state->kaddr = kaddr; + state->type = type; + state->src = src; + BUG_ON(current->copy_user_state != NULL); + current->copy_user_state = state; +} + +static inline void post_kmap_copy_user(void) +{ + BUG_ON(current->copy_user_state == NULL); + current->copy_user_state = NULL; +} + #else /* CONFIG_HIGHMEM */ static inline unsigned int nr_free_highpages(void) { return 0; } @@ -26,6 +58,20 @@ static inline void *kmap(struct page *pa #define kmap_atomic(page,idx) kmap(page) #define kunmap_atomic(page,idx) kunmap(page) + +struct copy_user_state { + int is_gcc_still_buggy; /* ? */ +}; + +static inline void +pre_kmap_copy_user(struct copy_user_state *state, struct page *page, + void *kaddr, enum km_type type, int src) +{ +} + +static inline void post_kmap_copy_user(void) +{ +} #endif /* CONFIG_HIGHMEM */ --- 2.5.25/include/linux/sched.h~linus-copy_user-hack Mon Jul 8 17:13:16 2002 +++ 2.5.25-akpm/include/linux/sched.h Mon Jul 8 17:13:16 2002 @@ -248,6 +248,10 @@ extern struct user_struct root_user; typedef struct prio_array prio_array_t; +#ifdef CONFIG_HIGHMEM +struct copy_user_state; +#endif + struct task_struct { volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */ struct thread_info *thread_info; @@ -365,6 +369,9 @@ struct task_struct { /* journalling filesystem info */ void *journal_info; struct dentry *proc_dentry; +#ifdef CONFIG_HIGHMEM + struct copy_user_state *copy_user_state; +#endif }; extern void __put_task_struct(struct task_struct *tsk); --- 2.5.25/mm/filemap.c~linus-copy_user-hack Mon Jul 8 17:13:22 2002 +++ 2.5.25-akpm/mm/filemap.c Mon Jul 8 17:13:30 2002 @@ -15,6 +15,7 @@ #include <linux/fs.h> #include <linux/mm.h> #include <linux/mman.h> +#include <linux/highmem.h> #include <linux/pagemap.h> #include <linux/file.h> #include <linux/iobuf.h> @@ -1186,14 +1187,17 @@ static ssize_t generic_file_direct_IO(in int file_read_actor(read_descriptor_t * desc, struct page *page, unsigned long offset, unsigned long size) { char *kaddr; + struct copy_user_state copy_user_state; unsigned long left, count = desc->count; if (size > count) size = count; - kaddr = kmap(page); + kaddr = kmap_atomic(page, KM_FILEMAP); + pre_kmap_copy_user(©_user_state, page, kaddr, KM_FILEMAP, 1); left = __copy_to_user(desc->buf, kaddr + offset, size); - kunmap(page); + post_kmap_copy_user(); + kunmap_atomic(kaddr, KM_FILEMAP); if (left) { size -= left; - -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: scalable kmap (was Re: vm lock contention reduction) 2002-07-08 17:29 ` Martin J. Bligh 2002-07-08 22:14 ` Linus Torvalds @ 2002-07-09 3:17 ` Andrew Morton 2002-07-09 4:28 ` Martin J. Bligh 2002-07-09 13:59 ` Benjamin LaHaise 1 sibling, 2 replies; 95+ messages in thread From: Andrew Morton @ 2002-07-09 3:17 UTC (permalink / raw) To: Martin J. Bligh; +Cc: Andrea Arcangeli, Linus Torvalds, Rik van Riel, linux-mm "Martin J. Bligh" wrote: > > OK, here's the data from Keith that I was promising on kmap. This was just > for a kernel compile. So copy_strings and file_read_actor seem to be the > main users (for this workload) by an order of magnitude. > > 0.00 0.00 1/3762429 remove_arg_zero [618] > 0.00 0.00 10/3762429 ext2_set_link [576] > 0.00 0.00 350/3762429 block_read_full_page [128] > 0.00 0.00 750/3762429 ext2_empty_dir [476] > 0.02 0.00 11465/3762429 ext2_delete_entry [217] > 0.02 0.00 12983/3762429 ext2_inode_by_name [228] > 0.03 0.00 15400/3762429 ext2_add_link [182] > 0.03 0.00 16621/3762429 ext2_find_entry [198] > 0.06 0.00 33016/3762429 ext2_readdir [79] > 0.13 0.00 71900/3762429 generic_file_write [109] > 0.17 0.00 95589/3762429 generic_commit_write [255] > 2.25 0.00 1229513/3762429 file_read_actor [50] > 4.15 0.00 2274831/3762429 copy_strings [36] > [105] 0.2 6.87 0.00 3762429 kunmap_high [105] It's a bit weird that copy_strings is the heaviest user of kmap(). I bet it's kmapping the same page over and over. A little cache there will help. I'll fix that up. So here's the patch. Seems to work. Across a `make -j6 bzImage' the number of calls to kmap_high() went from 490,429 down to 41,174. And guess what? Zero change in wallclock time. With: make -j6 bzImage 399.97s user 32.95s system 374% cpu 1:55.51 total kmap_high_counter: 41174 Without: make -j6 bzImage 400.06s user 32.74s system 377% cpu 1:54.78 total kmap_high_counter: 490429 Quad PIII Xeon, 2.5G RAM. Any theories? arch/i386/lib/usercopy.c | 10 +++++ arch/i386/mm/fault.c | 40 ++++++++++++++++++++ fs/exec.c | 6 ++- include/asm-i386/kmap_types.h | 3 + include/asm-i386/processor.h | 2 + include/asm-ppc/kmap_types.h | 1 include/asm-sparc/kmap_types.h | 1 include/asm-x86_64/kmap_types.h | 1 include/linux/highmem.h | 78 ++++++++++++++++++++++++++++++++++++++++ include/linux/sched.h | 5 ++ mm/filemap.c | 11 +++-- 11 files changed, 151 insertions(+), 7 deletions(-) --- 2.5.25/arch/i386/mm/fault.c~linus-copy_user-hack Mon Jul 8 19:58:03 2002 +++ 2.5.25-akpm/arch/i386/mm/fault.c Mon Jul 8 20:08:20 2002 @@ -18,6 +18,7 @@ #include <linux/ptrace.h> #include <linux/mman.h> #include <linux/mm.h> +#include <linux/highmem.h> #include <linux/smp.h> #include <linux/smp_lock.h> #include <linux/interrupt.h> @@ -138,6 +139,39 @@ void bust_spinlocks(int yes) } } +#ifdef CONFIG_HIGHMEM +static inline void highmem_remove_atomic_kmap(struct pt_regs *regs) +{ + struct copy_user_state *cus = current->copy_user_state; + + if (cus) + kunmap_atomic(cus->kaddr, cus->type); +} + +static inline void highmem_restore_atomic_kmap(struct pt_regs *regs) +{ + struct copy_user_state *cus = current->copy_user_state; + + if (cus) { + long *reg; + unsigned offset; + + cus->kaddr = kmap_atomic(cus->page, cus->type); + if (cus->src) + reg = ®s->esi; + else + reg = ®s->edi; + offset = *reg & (PAGE_SIZE - 1); + *reg = ((long)cus->kaddr) | offset; + } +} +#else +static inline void highmem_remove_atomic_kmap(struct pt_regs *regs) +{} +static inline void highmem_restore_atomic_kmap(struct pt_regs *regs) +{} +#endif + asmlinkage void do_invalid_op(struct pt_regs *, unsigned long); extern unsigned long idt; @@ -206,6 +240,8 @@ asmlinkage void do_page_fault(struct pt_ } #endif + highmem_remove_atomic_kmap(regs); + down_read(&mm->mmap_sem); vma = find_vma(mm, address); @@ -267,8 +303,10 @@ good_area: tsk->maj_flt++; break; case VM_FAULT_SIGBUS: + highmem_restore_atomic_kmap(regs); goto do_sigbus; case VM_FAULT_OOM: + highmem_restore_atomic_kmap(regs); goto out_of_memory; default: BUG(); @@ -283,6 +321,7 @@ good_area: tsk->thread.screen_bitmap |= 1 << bit; } up_read(&mm->mmap_sem); + highmem_restore_atomic_kmap(regs); return; /* @@ -291,6 +330,7 @@ good_area: */ bad_area: up_read(&mm->mmap_sem); + highmem_restore_atomic_kmap(regs); /* User mode accesses just cause a SIGSEGV */ if (error_code & 4) { --- 2.5.25/include/asm-i386/kmap_types.h~linus-copy_user-hack Mon Jul 8 19:58:03 2002 +++ 2.5.25-akpm/include/asm-i386/kmap_types.h Mon Jul 8 19:58:03 2002 @@ -19,7 +19,8 @@ D(5) KM_BIO_SRC_IRQ, D(6) KM_BIO_DST_IRQ, D(7) KM_PTE0, D(8) KM_PTE1, -D(9) KM_TYPE_NR +D(9) KM_FILEMAP, +D(10) KM_TYPE_NR }; #undef D --- 2.5.25/include/asm-ppc/kmap_types.h~linus-copy_user-hack Mon Jul 8 19:58:03 2002 +++ 2.5.25-akpm/include/asm-ppc/kmap_types.h Mon Jul 8 19:58:03 2002 @@ -15,6 +15,7 @@ enum km_type { KM_BIO_DST_IRQ, KM_PTE0, KM_PTE1, + KM_FILEMAP, KM_TYPE_NR }; --- 2.5.25/include/asm-sparc/kmap_types.h~linus-copy_user-hack Mon Jul 8 19:58:03 2002 +++ 2.5.25-akpm/include/asm-sparc/kmap_types.h Mon Jul 8 19:58:03 2002 @@ -9,6 +9,7 @@ enum km_type { KM_USER1, KM_BIO_SRC_IRQ, KM_BIO_DST_IRQ, + KM_FILEMAP, KM_TYPE_NR }; --- 2.5.25/include/asm-x86_64/kmap_types.h~linus-copy_user-hack Mon Jul 8 19:58:03 2002 +++ 2.5.25-akpm/include/asm-x86_64/kmap_types.h Mon Jul 8 19:58:03 2002 @@ -9,6 +9,7 @@ enum km_type { KM_USER1, KM_BIO_SRC_IRQ, KM_BIO_DST_IRQ, + KM_FILEMAP, KM_TYPE_NR }; --- 2.5.25/include/linux/highmem.h~linus-copy_user-hack Mon Jul 8 19:58:03 2002 +++ 2.5.25-akpm/include/linux/highmem.h Mon Jul 8 20:13:05 2002 @@ -3,6 +3,7 @@ #include <linux/config.h> #include <linux/fs.h> +#include <asm/processor.h> #include <asm/cacheflush.h> #ifdef CONFIG_HIGHMEM @@ -10,6 +11,7 @@ extern struct page *highmem_start_page; #include <asm/highmem.h> +#include <asm/kmap_types.h> /* declarations for linux/mm/highmem.c */ unsigned int nr_free_highpages(void); @@ -72,4 +74,80 @@ static inline void copy_user_highpage(st kunmap_atomic(vto, KM_USER1); } +#if defined(CONFIG_HIGHMEM) && defined(ARCH_HAS_KMAP_FIXUP) +/* + * Used when performing a copy_*_user while holding an atomic kmap + */ +struct copy_user_state { + struct page *page; /* The page which is kmap_atomiced */ + void *kaddr; /* Its mapping */ + enum km_type type; /* Its offset */ + int src; /* 1: fixup ESI. 0: Fixup EDI */ +}; + +/* + * `src' is true if the kmap_atomic virtual address is the source of the copy. + */ +static inline void * +kmap_copy_user(struct copy_user_state *cus, struct page *page, + enum km_type type, int src) +{ + cus->page = page; + cus->kaddr = kmap_atomic(page, type); + if (PageHighMem(page)) { + cus->type = type; + cus->src = src; + BUG_ON(current->copy_user_state != NULL); + current->copy_user_state = cus; + } + return cus->kaddr; +} + +static inline void kunmap_copy_user(struct copy_user_state *cus) +{ + if (PageHighMem(cus->page)) { + BUG_ON(current->copy_user_state != cus); + kunmap_atomic(cus->kaddr, cus->type); + current->copy_user_state = NULL; + cus->page = NULL; /* debug */ + } +} + +/* + * After a copy_*_user, the kernel virtual address may be different. So + * use kmap_copy_user_addr() to get the new value. + */ +static inline void *kmap_copy_user_addr(struct copy_user_state *cus) +{ + return cus->kaddr; +} + +#else + +struct copy_user_state { + struct page *page; +}; + +/* + * This must be a macro because `type' may be undefined + */ + +#define kmap_copy_user(cus, page, type, src) \ + ({ \ + (cus)->page = (page); \ + kmap(page); \ + }) + +static inline void kunmap_copy_user(struct copy_user_state *cus) +{ + kunmap(cus->page); +} + +static inline void *kmap_copy_user_addr(struct copy_user_state *cus) +{ + return page_address(cus->page); +} + +#endif + #endif /* _LINUX_HIGHMEM_H */ --- 2.5.25/include/linux/sched.h~linus-copy_user-hack Mon Jul 8 19:58:03 2002 +++ 2.5.25-akpm/include/linux/sched.h Mon Jul 8 20:14:08 2002 @@ -248,6 +248,8 @@ extern struct user_struct root_user; typedef struct prio_array prio_array_t; +struct copy_user_state; + struct task_struct { volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */ struct thread_info *thread_info; @@ -365,6 +367,9 @@ struct task_struct { /* journalling filesystem info */ void *journal_info; struct dentry *proc_dentry; +#ifdef CONFIG_HIGHMEM + struct copy_user_state *copy_user_state; +#endif }; extern void __put_task_struct(struct task_struct *tsk); --- 2.5.25/mm/filemap.c~linus-copy_user-hack Mon Jul 8 19:58:03 2002 +++ 2.5.25-akpm/mm/filemap.c Mon Jul 8 19:58:03 2002 @@ -15,6 +15,7 @@ #include <linux/fs.h> #include <linux/mm.h> #include <linux/mman.h> +#include <linux/highmem.h> #include <linux/pagemap.h> #include <linux/file.h> #include <linux/iobuf.h> @@ -1183,18 +1184,20 @@ static ssize_t generic_file_direct_IO(in return retval; } -int file_read_actor(read_descriptor_t * desc, struct page *page, unsigned long offset, unsigned long size) +int file_read_actor(read_descriptor_t *desc, struct page *page, + unsigned long offset, unsigned long size) { char *kaddr; + struct copy_user_state copy_user_state; unsigned long left, count = desc->count; if (size > count) size = count; - kaddr = kmap(page); + kaddr = kmap_copy_user(©_user_state, page, KM_FILEMAP, 1); left = __copy_to_user(desc->buf, kaddr + offset, size); - kunmap(page); - + kunmap_copy_user(©_user_state); + if (left) { size -= left; desc->error = -EFAULT; --- 2.5.25/include/asm-i386/processor.h~linus-copy_user-hack Mon Jul 8 19:58:03 2002 +++ 2.5.25-akpm/include/asm-i386/processor.h Mon Jul 8 19:58:03 2002 @@ -485,4 +485,6 @@ extern inline void prefetchw(const void #endif +#define ARCH_HAS_KMAP_FIXUP + #endif /* __ASM_I386_PROCESSOR_H */ --- 2.5.25/arch/i386/lib/usercopy.c~linus-copy_user-hack Mon Jul 8 19:58:03 2002 +++ 2.5.25-akpm/arch/i386/lib/usercopy.c Mon Jul 8 19:58:03 2002 @@ -11,6 +11,16 @@ #ifdef CONFIG_X86_USE_3DNOW_AND_WORKS +/* + * We cannot use the mmx functions here with the kmap_atomic fixup + * code. + * + * But CONFIG_X86_USE_3DNOW_AND_WORKS never gets defined anywhere. + * Maybe kill this code? + */ + +#error this will not work + unsigned long __generic_copy_to_user(void *to, const void *from, unsigned long n) { --- 2.5.25/fs/exec.c~linus-copy_user-hack Mon Jul 8 19:58:03 2002 +++ 2.5.25-akpm/fs/exec.c Mon Jul 8 19:58:03 2002 @@ -203,6 +203,7 @@ int copy_strings(int argc,char ** argv, int i, new, err; struct page *page; int offset, bytes_to_copy; + struct copy_user_state copy_user_state; offset = pos % PAGE_SIZE; i = pos/PAGE_SIZE; @@ -215,7 +216,8 @@ int copy_strings(int argc,char ** argv, return -ENOMEM; new = 1; } - kaddr = kmap(page); + kaddr = kmap_copy_user(©_user_state, page, + KM_FILEMAP, 0); if (new && offset) memset(kaddr, 0, offset); @@ -226,7 +228,7 @@ int copy_strings(int argc,char ** argv, memset(kaddr+offset+len, 0, PAGE_SIZE-offset-len); } err = copy_from_user(kaddr + offset, str, bytes_to_copy); - kunmap(page); + kunmap_copy_user(©_user_state); if (err) return -EFAULT; - -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: scalable kmap (was Re: vm lock contention reduction) 2002-07-09 3:17 ` Andrew Morton @ 2002-07-09 4:28 ` Martin J. Bligh 2002-07-09 5:28 ` Andrew Morton 2002-07-09 13:59 ` Benjamin LaHaise 1 sibling, 1 reply; 95+ messages in thread From: Martin J. Bligh @ 2002-07-09 4:28 UTC (permalink / raw) To: Andrew Morton; +Cc: Andrea Arcangeli, Linus Torvalds, Rik van Riel, linux-mm > It's a bit weird that copy_strings is the heaviest user of kmap(). I can dig out the whole acg, and see what was calling copy_strings, that might help give us a clue. > I bet it's kmapping the same page over and over. A little cache > there will help. I'll fix that up. Sounds like a plan ... > So here's the patch. Seems to work. Across a `make -j6 bzImage' > the number of calls to kmap_high() went from 490,429 down to 41,174. > > And guess what? Zero change in wallclock time. Well, I was going to try to bench this tonight, but am having a problem with 2.5.25 right now (we've been on 2.4 for a while, but are shifting). Hopefully get you some numbers tommorow, and will get some other benchmarks done by people here on various machines. > Any theories? Maybe the cost of the atomic kmap counters the gain? Having to do a single line tlbflush every time ... a per cpu pool might help if that is the problem, but would have to make it a reasonable size to counter the cost. I'll do some more measurements first, and get some profile data to see if the number of ticks changes down in one function and up in the other? Thanks to all for looking at this, M. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: scalable kmap (was Re: vm lock contention reduction) 2002-07-09 4:28 ` Martin J. Bligh @ 2002-07-09 5:28 ` Andrew Morton 2002-07-09 6:15 ` Martin J. Bligh ` (2 more replies) 0 siblings, 3 replies; 95+ messages in thread From: Andrew Morton @ 2002-07-09 5:28 UTC (permalink / raw) To: Martin J. Bligh; +Cc: Andrea Arcangeli, Linus Torvalds, Rik van Riel, linux-mm "Martin J. Bligh" wrote: > > > It's a bit weird that copy_strings is the heaviest user of kmap(). > > I can dig out the whole acg, and see what was calling copy_strings, > that might help give us a clue. Well, it'll be exec(). > > I bet it's kmapping the same page over and over. A little cache > > there will help. I'll fix that up. > > Sounds like a plan ... Updated patch below. We don't need an atomic kmap in copy_strings at all. kmap is the right thing to do, but just be smarter about it. Hanging onto the existing kmap in there reduces the number of kmap() calls by a factor of 32 across a kernel compile. But still no aggregate speedup. > > So here's the patch. Seems to work. Across a `make -j6 bzImage' > > the number of calls to kmap_high() went from 490,429 down to 41,174. > > > > And guess what? Zero change in wallclock time. > > Well, I was going to try to bench this tonight, but am having a > problem with 2.5.25 right now (we've been on 2.4 for a while, > but are shifting). Hopefully get you some numbers tommorow, and > will get some other benchmarks done by people here on various > machines. Don't tell me those NUMAQ's are using IDE ;) But seriously, what's the problem? We really do need the big boxes to be able to test 2.5 right now, and any blockage needs to be cleared away. > > Any theories? > > Maybe the cost of the atomic kmap counters the gain? Having to do a > single line tlbflush every time ... a per cpu pool might help if that > is the problem, but would have to make it a reasonable size to counter > the cost. I'll do some more measurements first, and get some profile > data to see if the number of ticks changes down in one function and > up in the other? Well, certainly it's dumb to perform an atomic_kunmap() when we're just about to atomic_kmap() the same page again. But no change. arch/i386/lib/usercopy.c | 10 +++++ arch/i386/mm/fault.c | 40 ++++++++++++++++++++ fs/exec.c | 72 ++++++++++++++++++++++++++++-------- include/asm-i386/kmap_types.h | 3 + include/asm-i386/processor.h | 2 + include/asm-ppc/kmap_types.h | 1 include/asm-sparc/kmap_types.h | 1 include/asm-x86_64/kmap_types.h | 1 include/linux/highmem.h | 78 ++++++++++++++++++++++++++++++++++++++++ include/linux/sched.h | 5 ++ mm/filemap.c | 11 +++-- 11 files changed, 202 insertions(+), 22 deletions(-) --- 2.5.25/arch/i386/mm/fault.c~linus-copy_user-hack Mon Jul 8 21:44:21 2002 +++ 2.5.25-akpm/arch/i386/mm/fault.c Mon Jul 8 21:44:21 2002 @@ -18,6 +18,7 @@ #include <linux/ptrace.h> #include <linux/mman.h> #include <linux/mm.h> +#include <linux/highmem.h> #include <linux/smp.h> #include <linux/smp_lock.h> #include <linux/interrupt.h> @@ -138,6 +139,39 @@ void bust_spinlocks(int yes) } } +#ifdef CONFIG_HIGHMEM +static inline void highmem_remove_atomic_kmap(struct pt_regs *regs) +{ + struct copy_user_state *cus = current->copy_user_state; + + if (cus) + kunmap_atomic(cus->kaddr, cus->type); +} + +static inline void highmem_restore_atomic_kmap(struct pt_regs *regs) +{ + struct copy_user_state *cus = current->copy_user_state; + + if (cus) { + long *reg; + unsigned offset; + + cus->kaddr = kmap_atomic(cus->page, cus->type); + if (cus->src) + reg = ®s->esi; + else + reg = ®s->edi; + offset = *reg & (PAGE_SIZE - 1); + *reg = ((long)cus->kaddr) | offset; + } +} +#else +static inline void highmem_remove_atomic_kmap(struct pt_regs *regs) +{} +static inline void highmem_restore_atomic_kmap(struct pt_regs *regs) +{} +#endif + asmlinkage void do_invalid_op(struct pt_regs *, unsigned long); extern unsigned long idt; @@ -206,6 +240,8 @@ asmlinkage void do_page_fault(struct pt_ } #endif + highmem_remove_atomic_kmap(regs); + down_read(&mm->mmap_sem); vma = find_vma(mm, address); @@ -267,8 +303,10 @@ good_area: tsk->maj_flt++; break; case VM_FAULT_SIGBUS: + highmem_restore_atomic_kmap(regs); goto do_sigbus; case VM_FAULT_OOM: + highmem_restore_atomic_kmap(regs); goto out_of_memory; default: BUG(); @@ -283,6 +321,7 @@ good_area: tsk->thread.screen_bitmap |= 1 << bit; } up_read(&mm->mmap_sem); + highmem_restore_atomic_kmap(regs); return; /* @@ -291,6 +330,7 @@ good_area: */ bad_area: up_read(&mm->mmap_sem); + highmem_restore_atomic_kmap(regs); /* User mode accesses just cause a SIGSEGV */ if (error_code & 4) { --- 2.5.25/include/asm-i386/kmap_types.h~linus-copy_user-hack Mon Jul 8 21:44:21 2002 +++ 2.5.25-akpm/include/asm-i386/kmap_types.h Mon Jul 8 21:44:21 2002 @@ -19,7 +19,8 @@ D(5) KM_BIO_SRC_IRQ, D(6) KM_BIO_DST_IRQ, D(7) KM_PTE0, D(8) KM_PTE1, -D(9) KM_TYPE_NR +D(9) KM_FILEMAP, +D(10) KM_TYPE_NR }; #undef D --- 2.5.25/include/asm-ppc/kmap_types.h~linus-copy_user-hack Mon Jul 8 21:44:21 2002 +++ 2.5.25-akpm/include/asm-ppc/kmap_types.h Mon Jul 8 21:44:21 2002 @@ -15,6 +15,7 @@ enum km_type { KM_BIO_DST_IRQ, KM_PTE0, KM_PTE1, + KM_FILEMAP, KM_TYPE_NR }; --- 2.5.25/include/asm-sparc/kmap_types.h~linus-copy_user-hack Mon Jul 8 21:44:21 2002 +++ 2.5.25-akpm/include/asm-sparc/kmap_types.h Mon Jul 8 21:44:21 2002 @@ -9,6 +9,7 @@ enum km_type { KM_USER1, KM_BIO_SRC_IRQ, KM_BIO_DST_IRQ, + KM_FILEMAP, KM_TYPE_NR }; --- 2.5.25/include/asm-x86_64/kmap_types.h~linus-copy_user-hack Mon Jul 8 21:44:21 2002 +++ 2.5.25-akpm/include/asm-x86_64/kmap_types.h Mon Jul 8 21:44:21 2002 @@ -9,6 +9,7 @@ enum km_type { KM_USER1, KM_BIO_SRC_IRQ, KM_BIO_DST_IRQ, + KM_FILEMAP, KM_TYPE_NR }; --- 2.5.25/include/linux/highmem.h~linus-copy_user-hack Mon Jul 8 21:44:21 2002 +++ 2.5.25-akpm/include/linux/highmem.h Mon Jul 8 21:44:21 2002 @@ -3,6 +3,7 @@ #include <linux/config.h> #include <linux/fs.h> +#include <asm/processor.h> #include <asm/cacheflush.h> #ifdef CONFIG_HIGHMEM @@ -10,6 +11,7 @@ extern struct page *highmem_start_page; #include <asm/highmem.h> +#include <asm/kmap_types.h> /* declarations for linux/mm/highmem.c */ unsigned int nr_free_highpages(void); @@ -72,4 +74,80 @@ static inline void copy_user_highpage(st kunmap_atomic(vto, KM_USER1); } +#if defined(CONFIG_HIGHMEM) && defined(ARCH_HAS_KMAP_FIXUP) +/* + * Used when performing a copy_*_user while holding an atomic kmap + */ +struct copy_user_state { + struct page *page; /* The page which is kmap_atomiced */ + void *kaddr; /* Its mapping */ + enum km_type type; /* Its offset */ + int src; /* 1: fixup ESI. 0: Fixup EDI */ +}; + +/* + * `src' is true if the kmap_atomic virtual address is the source of the copy. + */ +static inline void * +kmap_copy_user(struct copy_user_state *cus, struct page *page, + enum km_type type, int src) +{ + cus->page = page; + cus->kaddr = kmap_atomic(page, type); + if (PageHighMem(page)) { + cus->type = type; + cus->src = src; + BUG_ON(current->copy_user_state != NULL); + current->copy_user_state = cus; + } + return cus->kaddr; +} + +static inline void kunmap_copy_user(struct copy_user_state *cus) +{ + if (PageHighMem(cus->page)) { + BUG_ON(current->copy_user_state != cus); + kunmap_atomic(cus->kaddr, cus->type); + current->copy_user_state = NULL; + cus->page = NULL; /* debug */ + } +} + +/* + * After a copy_*_user, the kernel virtual address may be different. So + * use kmap_copy_user_addr() to get the new value. + */ +static inline void *kmap_copy_user_addr(struct copy_user_state *cus) +{ + return cus->kaddr; +} + +#else + +struct copy_user_state { + struct page *page; +}; + +/* + * This must be a macro because `type' may be undefined + */ + +#define kmap_copy_user(cus, page, type, src) \ + ({ \ + (cus)->page = (page); \ + kmap(page); \ + }) + +static inline void kunmap_copy_user(struct copy_user_state *cus) +{ + kunmap(cus->page); +} + +static inline void *kmap_copy_user_addr(struct copy_user_state *cus) +{ + return page_address(cus->page); +} + +#endif + #endif /* _LINUX_HIGHMEM_H */ --- 2.5.25/include/linux/sched.h~linus-copy_user-hack Mon Jul 8 21:44:21 2002 +++ 2.5.25-akpm/include/linux/sched.h Mon Jul 8 21:44:21 2002 @@ -248,6 +248,8 @@ extern struct user_struct root_user; typedef struct prio_array prio_array_t; +struct copy_user_state; + struct task_struct { volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */ struct thread_info *thread_info; @@ -365,6 +367,9 @@ struct task_struct { /* journalling filesystem info */ void *journal_info; struct dentry *proc_dentry; +#ifdef CONFIG_HIGHMEM + struct copy_user_state *copy_user_state; +#endif }; extern void __put_task_struct(struct task_struct *tsk); --- 2.5.25/mm/filemap.c~linus-copy_user-hack Mon Jul 8 21:44:21 2002 +++ 2.5.25-akpm/mm/filemap.c Mon Jul 8 21:44:21 2002 @@ -15,6 +15,7 @@ #include <linux/fs.h> #include <linux/mm.h> #include <linux/mman.h> +#include <linux/highmem.h> #include <linux/pagemap.h> #include <linux/file.h> #include <linux/iobuf.h> @@ -1183,18 +1184,20 @@ static ssize_t generic_file_direct_IO(in return retval; } -int file_read_actor(read_descriptor_t * desc, struct page *page, unsigned long offset, unsigned long size) +int file_read_actor(read_descriptor_t *desc, struct page *page, + unsigned long offset, unsigned long size) { char *kaddr; + struct copy_user_state copy_user_state; unsigned long left, count = desc->count; if (size > count) size = count; - kaddr = kmap(page); + kaddr = kmap_copy_user(©_user_state, page, KM_FILEMAP, 1); left = __copy_to_user(desc->buf, kaddr + offset, size); - kunmap(page); - + kunmap_copy_user(©_user_state); + if (left) { size -= left; desc->error = -EFAULT; --- 2.5.25/include/asm-i386/processor.h~linus-copy_user-hack Mon Jul 8 21:44:21 2002 +++ 2.5.25-akpm/include/asm-i386/processor.h Mon Jul 8 21:44:21 2002 @@ -485,4 +485,6 @@ extern inline void prefetchw(const void #endif +#define ARCH_HAS_KMAP_FIXUP + #endif /* __ASM_I386_PROCESSOR_H */ --- 2.5.25/arch/i386/lib/usercopy.c~linus-copy_user-hack Mon Jul 8 21:44:21 2002 +++ 2.5.25-akpm/arch/i386/lib/usercopy.c Mon Jul 8 21:44:21 2002 @@ -11,6 +11,16 @@ #ifdef CONFIG_X86_USE_3DNOW_AND_WORKS +/* + * We cannot use the mmx functions here with the kmap_atomic fixup + * code. + * + * But CONFIG_X86_USE_3DNOW_AND_WORKS never gets defined anywhere. + * Maybe kill this code? + */ + +#error this will not work + unsigned long __generic_copy_to_user(void *to, const void *from, unsigned long n) { --- 2.5.25/fs/exec.c~linus-copy_user-hack Mon Jul 8 21:44:21 2002 +++ 2.5.25-akpm/fs/exec.c Mon Jul 8 22:18:12 2002 @@ -182,27 +182,45 @@ static int count(char ** argv, int max) * memory to free pages in kernel mem. These are in a format ready * to be put directly into the top of new user memory. */ +int akpm_hits; +int akpm_misses; +int akpm_kmaps; + int copy_strings(int argc,char ** argv, struct linux_binprm *bprm) { + struct page *kmapped_page = NULL; + char *kaddr = NULL; + int ret; + while (argc-- > 0) { char *str; int len; unsigned long pos; - if (get_user(str, argv+argc) || !(len = strnlen_user(str, bprm->p))) - return -EFAULT; - if (bprm->p < len) - return -E2BIG; + if (get_user(str, argv+argc) || + !(len = strnlen_user(str, bprm->p))) { + ret = -EFAULT; + goto out; + } + + if (bprm->p < len) { + ret = -E2BIG; + goto out; + } bprm->p -= len; /* XXX: add architecture specific overflow check here. */ - pos = bprm->p; + + /* + * The only sleeping function which we are allowed to call in + * this loop is copy_from_user(). Otherwise, copy_user_state + * could get trashed. + */ while (len > 0) { - char *kaddr; int i, new, err; - struct page *page; int offset, bytes_to_copy; + struct page *page; offset = pos % PAGE_SIZE; i = pos/PAGE_SIZE; @@ -211,32 +229,52 @@ int copy_strings(int argc,char ** argv, if (!page) { page = alloc_page(GFP_HIGHUSER); bprm->page[i] = page; - if (!page) - return -ENOMEM; + if (!page) { + ret = -ENOMEM; + goto out; + } new = 1; } - kaddr = kmap(page); + if (kmapped_page) { + if (page == kmapped_page) + akpm_hits++; + else + akpm_misses++; + } + + if (page != kmapped_page) { + if (kmapped_page) + kunmap(kmapped_page); + kmapped_page = page; + kaddr = kmap(kmapped_page); + akpm_kmaps++; + } if (new && offset) memset(kaddr, 0, offset); bytes_to_copy = PAGE_SIZE - offset; if (bytes_to_copy > len) { bytes_to_copy = len; if (new) - memset(kaddr+offset+len, 0, PAGE_SIZE-offset-len); + memset(kaddr+offset+len, 0, + PAGE_SIZE-offset-len); + } + err = copy_from_user(kaddr+offset, str, bytes_to_copy); + if (err) { + ret = -EFAULT; + goto out; } - err = copy_from_user(kaddr + offset, str, bytes_to_copy); - kunmap(page); - - if (err) - return -EFAULT; pos += bytes_to_copy; str += bytes_to_copy; len -= bytes_to_copy; } } - return 0; + ret = 0; +out: + if (kmapped_page) + kunmap(kmapped_page); + return ret; } /* - -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: scalable kmap (was Re: vm lock contention reduction) 2002-07-09 5:28 ` Andrew Morton @ 2002-07-09 6:15 ` Martin J. Bligh 2002-07-09 6:30 ` William Lee Irwin III 2002-07-09 6:32 ` William Lee Irwin III 2002-07-09 16:08 ` Martin J. Bligh 2002-07-09 17:32 ` Andrea Arcangeli 2 siblings, 2 replies; 95+ messages in thread From: Martin J. Bligh @ 2002-07-09 6:15 UTC (permalink / raw) To: Andrew Morton; +Cc: Andrea Arcangeli, Linus Torvalds, Rik van Riel, linux-mm > Don't tell me those NUMAQ's are using IDE ;) No, that's one level of pain I don't have to deal with ;-) Now switched fibrechannel SANs on a machine that really needs NUMA aware multipath IO is more likely to be a problem, on the other hand ... but I can live without that for now ... > But seriously, what's the problem? We really do need the big > boxes to be able to test 2.5 right now, and any blockage needs > to be cleared away. You really want the current list? The whole of our team is shifting focus to 2.5, which'll make life more interesting ;-) wli might care to elaborate on 2 & 3, since I think he helped them identify / fix (helped maybe meaning did). 1. Irqbalance doesn't like clustered apic mode (have hack) 2. Using ioremap fairly early catches cpu_online_map set to 0 for some reason (have hack). 3. Something to do with BIO that I'll let Bill explain, but I am given to believe it's well known (have hack). After this, one of our team got one of the boxes booted today for the first time on a recent 2.5 kernel, but I don't have any numbers from it yet. 4. Starfire ethernet driver doesn't seem to want to compile in (crc32_le in set_rx_mode?) Can't recall the exact error, will recreate tommorow, may well work as a module. 5. Compiler errors that I haven't really looked at but am guessing may be due to me using an old compiler / tools (Redhat 6.2)? /tmp/cca8ibx8.s: Assembler messages: /tmp/cca8ibx8.s:255: Error: suffix or operands invalid for `lcall' gcc -Wp,-MD,./.setup.o.d -D__ASSEMBLY__ -D__KERNEL__ -I/home/mbligh/linux-2.5.25/include -nostdinc -iwithprefix include -traditional -DSVGA_MODE=NORMAL_VGA -D__BIG_KERNEL__ -c -o setup.o setup.S make[1]: *** [bootsect.o] Error 1 make[1]: *** Waiting for unfinished jobs.... /tmp/ccn0yhUe.s: Assembler messages: /tmp/ccn0yhUe.s:1292: Error: suffix or operands invalid for `lcall' make[1]: *** [setup.o] Error 1 make[1]: Leaving directory `/home/mbligh/linux-2.5.25/arch/i386/boot' make: *** [install] Error 2 At that point, I decided to wait until tommorow as I really need to switch machines, disks, distributions, and reinstall ;-) But then I need to do a few more things ... 6. Get someone to port forward the ia32 discontigmem support, test and submit. 7. Port forward the TSC disable changes. 8. Port forward the remote quad timer disable changes. 9. Port forward the pci-bridge remap hack that I'm too ashamed to publish. Then I might get some numbers out of her ;-) Comparing 2.4 to 2.5 will be most interesting ... In the meantime I'll get you some numbers from an 8 way SMP box, which should be much simpler to do. M. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: scalable kmap (was Re: vm lock contention reduction) 2002-07-09 6:15 ` Martin J. Bligh @ 2002-07-09 6:30 ` William Lee Irwin III 2002-07-09 6:32 ` William Lee Irwin III 1 sibling, 0 replies; 95+ messages in thread From: William Lee Irwin III @ 2002-07-09 6:30 UTC (permalink / raw) To: Martin J. Bligh Cc: Andrew Morton, Andrea Arcangeli, Linus Torvalds, Rik van Riel, linux-mm On Mon, Jul 08, 2002 at 11:15:52PM -0700, Martin J. Bligh wrote: > wli might care to elaborate on 2 & 3, since I think he helped > them identify / fix (helped maybe meaning did). > 1. Irqbalance doesn't like clustered apic mode (have hack) > 2. Using ioremap fairly early catches cpu_online_map set to 0 > for some reason (have hack). > 3. Something to do with BIO that I'll let Bill explain, but I > am given to believe it's well known (have hack). (1) irqbalance is blatantly stuffing flat bitmasks into ICR2 this breaks clustered hierarchical destination format everywhere all the time (2) ioremap wants to flush_tlb_all() before cpu_online_map is initialized. This results in smp_call_function() spinning until an atomic_t goes to -1, which never happens since it doesn't kick any cpu's to decrement the counter. (3) The bio thing is just the usual queue max_sectors where you've recommended decreasing the MPAGE max sectors so the bio constraints are satisfied before they hit elevator.c where a BUG_ON() is triggered instead of a bio split. Cheers, Bill -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: scalable kmap (was Re: vm lock contention reduction) 2002-07-09 6:15 ` Martin J. Bligh 2002-07-09 6:30 ` William Lee Irwin III @ 2002-07-09 6:32 ` William Lee Irwin III 1 sibling, 0 replies; 95+ messages in thread From: William Lee Irwin III @ 2002-07-09 6:32 UTC (permalink / raw) To: Martin J. Bligh Cc: Andrew Morton, Andrea Arcangeli, Linus Torvalds, Rik van Riel, linux-mm > > Don't tell me those NUMAQ's are using IDE ;) > > No, that's one level of pain I don't have to deal with ;-) > > Now switched fibrechannel SANs on a machine that really needs > NUMA aware multipath IO is more likely to be a problem, on > the other hand ... but I can live without that for now ... > > > But seriously, what's the problem? We really do need the big > > boxes to be able to test 2.5 right now, and any blockage needs > > to be cleared away. > > You really want the current list? The whole of our team is > shifting focus to 2.5, which'll make life more interesting ;-) > On Mon, Jul 08, 2002 at 11:15:52PM -0700, Martin J. Bligh wrote: > wli might care to elaborate on 2 & 3, since I think he helped > them identify / fix (helped maybe meaning did). Oh, I forgot, there was a bad v86 info thing on the 9th cpu woken that I never finished debugging, too. And the MAX_IO_APICS issue, which is easily solved by just increasing the constant #ifdef CONFIG_MULTIQUAD as usual. This panics before console_init() though, which makes it seem more painful than it really is. Cheers, Bill -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: scalable kmap (was Re: vm lock contention reduction) 2002-07-09 5:28 ` Andrew Morton 2002-07-09 6:15 ` Martin J. Bligh @ 2002-07-09 16:08 ` Martin J. Bligh 2002-07-09 17:32 ` Andrea Arcangeli 2 siblings, 0 replies; 95+ messages in thread From: Martin J. Bligh @ 2002-07-09 16:08 UTC (permalink / raw) To: Andrew Morton; +Cc: Andrea Arcangeli, Linus Torvalds, Rik van Riel, linux-mm >> I can dig out the whole acg, and see what was calling copy_strings, >> that might help give us a clue. > > Well, it'll be exec(). Yup ;-) 0.95 18.17 30186/90554 copy_strings_kernel [68] 1.90 36.34 60368/90554 do_execve [15] [36] 1.5 2.85 54.52 90554 copy_strings [36] 40.55 0.00 2274831/2528191 _generic_copy_from_user [41] 6.88 0.21 2274831/3762429 kmap_high [87] 4.15 0.00 2274831/3762429 kunmap_high [105] 2.12 0.00 2273886/4517587 strnlen_user [120] 0.09 0.51 31139/4284514 _alloc_pages [24] 0.00 0.00 31139/4284514 alloc_pages [369] 0.00 0.00 2/30186 load_script <cycle 2> [553] 0.00 19.12 30184/30186 do_execve [15] [68] 0.5 0.00 19.12 30186 copy_strings_kernel [68] 0.95 18.17 30186/90554 copy_strings [36] > Updated patch below. We don't need an atomic kmap in copy_strings > at all. kmap is the right thing to do, but just be smarter about it. > Hanging onto the existing kmap in there reduces the number of kmap() > calls by a factor of 32 across a kernel compile. Sounds about right, looking at the data above. > But still no aggregate speedup. Now that really is odd. Will get you some more numbers. M. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: scalable kmap (was Re: vm lock contention reduction) 2002-07-09 5:28 ` Andrew Morton 2002-07-09 6:15 ` Martin J. Bligh 2002-07-09 16:08 ` Martin J. Bligh @ 2002-07-09 17:32 ` Andrea Arcangeli 2002-07-10 5:32 ` Andrew Morton 2 siblings, 1 reply; 95+ messages in thread From: Andrea Arcangeli @ 2002-07-09 17:32 UTC (permalink / raw) To: Andrew Morton; +Cc: Martin J. Bligh, Linus Torvalds, Rik van Riel, linux-mm the patch with the hooks into the page fault handler is basically what Martin was suggesting me at ols when I was laughting. the ugliest part is the implicit dependency on every copy-user to store the current source and destination in esi/dsi. However your implementation is still not as optimizes as you can optimize it, if you take the max-performnace route you should do it all, you shouldn't kunmap_atomic/kmap_atomic blindly around the page fault handler like you're doing now. you should hold on the atomic kmap for the whole page fault until a new kmap_atomic of the same type happens on the current cpu under you during the page fault (either from the page fault handler itself or another task because the page fault handler blocked and scheduled in another task). To detect the underlying kmap_atomic you only need a per-cpu sequence counter that you increase during kmap_atomic (right after the implicit preempt_disable()). you only need to read this sequence counter at page-fault-entry-point (in the place where you are now doing the uncoditional kunmap_atomic), and re-read the global per-cpu sequence counter later before returning from the page fault to know if you've to execute a new kmap_atomic. You need a sequence counter per-cpu per-type of kmap. NOTE: you don't need to execute the kunmap_atomic at all in the recursive case, just disable the debugging (infact if you take this route you can as well drop the kunmap_atomic call enterely from the common code, you may want to verify the ppc or sparc guys aren't doing something magic in the kunmap first, in such case you may want to skip it only during recursion in the i386/mm/fault.c). Don't blame the fact we lose some debugging capability, you want max performance at all costs remeber? Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: scalable kmap (was Re: vm lock contention reduction) 2002-07-09 17:32 ` Andrea Arcangeli @ 2002-07-10 5:32 ` Andrew Morton 2002-07-10 22:43 ` Martin J. Bligh 0 siblings, 1 reply; 95+ messages in thread From: Andrew Morton @ 2002-07-10 5:32 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Martin J. Bligh, Linus Torvalds, Rik van Riel, linux-mm Andrea Arcangeli wrote: > > the patch with the hooks into the page fault handler is basically > what Martin was suggesting me at ols when I was laughting. the ugliest > part is the implicit dependency on every copy-user to store the current > source and destination in esi/dsi. Ah, but note how I called the patch "linus-copy_user-hack". That clearly identifies who is to blame. But yes. However it seems that the esi/edi restriction isn't too bad. If someone has a smarter copy_*_user implementation then they'll have to use the old one in certain places, or use those registers, or write fixup code for the new implementation. I tested it with the IBM folks' faster copy_*_user patch which fixes the Intel alignment thing. Worked OK. > However your implementation is still not as optimizes as you can > optimize it, if you take the max-performnace route you should do it all, > you shouldn't kunmap_atomic/kmap_atomic blindly around the page fault > handler like you're doing now. you should hold on the atomic kmap for > the whole page fault until a new kmap_atomic of the same type happens on > the current cpu under you during the page fault (either from the page > fault handler itself or another task because the page fault handler > blocked and scheduled in another task). Good stuff, thanks. Did that. New patch is here. I stuck with KM_FILEMAP because KM_USER0 is used in the COW code, and that is in fact called inside the filemap fault handler. So basically the "oh drat" path was being taken all the time. > ... > NOTE: you don't need to execute the kunmap_atomic at all in the > recursive case, just disable the debugging (infact if you take this > route you can as well drop the kunmap_atomic call enterely from the > common code, you may want to verify the ppc or sparc guys aren't doing > something magic in the kunmap first, in such case you may want to skip > it only during recursion in the i386/mm/fault.c). Don't blame the fact > we lose some debugging capability, you want max performance at all costs > remeber? Well, the non-recursive case is the common case. And yes, maybe. Could use `ifdef CONFIG_HIGHMEM_DEBUG' and then open-code the preempt_enable() in there I guess... Here's the diff. The kmap() and kmap_atomic() rate is way down now. Still no benefit from it all through. Martin. Help. arch/i386/kernel/i386_ksyms.c | 5 ++ arch/i386/lib/usercopy.c | 10 +++++ arch/i386/mm/fault.c | 71 +++++++++++++++++++++++++++++++++++ fs/exec.c | 60 +++++++++++++++++++++--------- include/asm-i386/highmem.h | 5 ++ include/asm-i386/kmap_types.h | 3 + include/asm-i386/processor.h | 2 + include/asm-ppc/kmap_types.h | 1 include/asm-sparc/kmap_types.h | 1 include/asm-x86_64/kmap_types.h | 1 include/linux/highmem.h | 80 ++++++++++++++++++++++++++++++++++++++++ include/linux/sched.h | 5 ++ mm/filemap.c | 11 +++-- 13 files changed, 232 insertions(+), 23 deletions(-) --- 2.5.25/arch/i386/kernel/i386_ksyms.c~linus-copy_user-hack Tue Jul 9 21:12:35 2002 +++ 2.5.25-akpm/arch/i386/kernel/i386_ksyms.c Tue Jul 9 21:12:35 2002 @@ -14,6 +14,7 @@ #include <linux/kernel.h> #include <linux/string.h> #include <linux/tty.h> +#include <linux/highmem.h> #include <asm/semaphore.h> #include <asm/processor.h> @@ -76,6 +77,10 @@ EXPORT_SYMBOL(get_cmos_time); EXPORT_SYMBOL(apm_info); EXPORT_SYMBOL(gdt); +#ifdef CONFIG_HIGHMEM +EXPORT_SYMBOL(kmap_atomic_seq); +#endif + #ifdef CONFIG_DEBUG_IOVIRT EXPORT_SYMBOL(__io_virt_debug); #endif --- 2.5.25/arch/i386/lib/usercopy.c~linus-copy_user-hack Tue Jul 9 21:12:35 2002 +++ 2.5.25-akpm/arch/i386/lib/usercopy.c Tue Jul 9 21:12:35 2002 @@ -11,6 +11,16 @@ #ifdef CONFIG_X86_USE_3DNOW_AND_WORKS +/* + * We cannot use the mmx functions here with the kmap_atomic fixup + * code. + * + * But CONFIG_X86_USE_3DNOW_AND_WORKS never gets defined anywhere. + * Maybe kill this code? + */ + +#error this will not work + unsigned long __generic_copy_to_user(void *to, const void *from, unsigned long n) { --- 2.5.25/arch/i386/mm/fault.c~linus-copy_user-hack Tue Jul 9 21:12:35 2002 +++ 2.5.25-akpm/arch/i386/mm/fault.c Tue Jul 9 21:34:38 2002 @@ -18,6 +18,7 @@ #include <linux/ptrace.h> #include <linux/mman.h> #include <linux/mm.h> +#include <linux/highmem.h> #include <linux/smp.h> #include <linux/smp_lock.h> #include <linux/interrupt.h> @@ -138,6 +139,70 @@ void bust_spinlocks(int yes) } } +#ifdef CONFIG_HIGHMEM + +/* + * per-cpu, per-atomic-kmap sequence numbers. Incremented in kmap_atomic. + * If these change, we know that an atomic kmap slot has been reused. + */ +int kmap_atomic_seq[KM_TYPE_NR] __per_cpu_data = {0}; + +/* + * Note the CPU ID and the currently-held atomic kmap's sequence number + */ +static inline void note_atomic_kmap(struct pt_regs *regs) +{ + struct copy_user_state *cus = current->copy_user_state; + + if (cus) { + cus->cpu = smp_processor_id(); + cus->seq = this_cpu(kmap_atomic_seq[cus->type]); + } +} + +/* + * After processing the fault, look to see whether we have switched CPUs + * or whether the fault handler has used the same kmap slot (it must have + * scheduled to another task). If so, drop the kmap and get a new one. + * And then fix up the machine register which copy_*_user() is using so + * that it gets the correct address relative to the the new kmap. + */ +static void +__check_atomic_kmap(struct copy_user_state *cus, struct pt_regs *regs) +{ + const int cpu = smp_processor_id(); + + if (cus->seq != per_cpu(kmap_atomic_seq[cus->type], cpu) || + cus->cpu != cpu) { + long *reg; + unsigned offset; + + kunmap_atomic(cus->kaddr, cus->type); + cus->kaddr = kmap_atomic(cus->page, cus->type); + if (cus->src) + reg = ®s->esi; + else + reg = ®s->edi; + offset = *reg & (PAGE_SIZE - 1); + *reg = ((long)cus->kaddr) | offset; + } +} + +static inline void check_atomic_kmap(struct pt_regs *regs) +{ + struct copy_user_state *cus = current->copy_user_state; + + if (cus) + __check_atomic_kmap(cus, regs); +} + +#else +static inline void note_atomic_kmap(struct pt_regs *regs) +{} +static inline void check_atomic_kmap(struct pt_regs *regs) +{} +#endif + asmlinkage void do_invalid_op(struct pt_regs *, unsigned long); extern unsigned long idt; @@ -206,6 +271,8 @@ asmlinkage void do_page_fault(struct pt_ } #endif + note_atomic_kmap(regs); + down_read(&mm->mmap_sem); vma = find_vma(mm, address); @@ -267,8 +334,10 @@ good_area: tsk->maj_flt++; break; case VM_FAULT_SIGBUS: + check_atomic_kmap(regs); goto do_sigbus; case VM_FAULT_OOM: + check_atomic_kmap(regs); goto out_of_memory; default: BUG(); @@ -283,6 +352,7 @@ good_area: tsk->thread.screen_bitmap |= 1 << bit; } up_read(&mm->mmap_sem); + check_atomic_kmap(regs); return; /* @@ -291,6 +361,7 @@ good_area: */ bad_area: up_read(&mm->mmap_sem); + check_atomic_kmap(regs); /* User mode accesses just cause a SIGSEGV */ if (error_code & 4) { --- 2.5.25/fs/exec.c~linus-copy_user-hack Tue Jul 9 21:12:35 2002 +++ 2.5.25-akpm/fs/exec.c Tue Jul 9 21:53:51 2002 @@ -184,25 +184,39 @@ static int count(char ** argv, int max) */ int copy_strings(int argc,char ** argv, struct linux_binprm *bprm) { + struct page *kmapped_page = NULL; + char *kaddr = NULL; + int ret; + while (argc-- > 0) { char *str; int len; unsigned long pos; - if (get_user(str, argv+argc) || !(len = strnlen_user(str, bprm->p))) - return -EFAULT; - if (bprm->p < len) - return -E2BIG; + if (get_user(str, argv+argc) || + !(len = strnlen_user(str, bprm->p))) { + ret = -EFAULT; + goto out; + } + + if (bprm->p < len) { + ret = -E2BIG; + goto out; + } bprm->p -= len; /* XXX: add architecture specific overflow check here. */ - pos = bprm->p; + + /* + * The only sleeping function which we are allowed to call in + * this loop is copy_from_user(). Otherwise, copy_user_state + * could get trashed. + */ while (len > 0) { - char *kaddr; int i, new, err; - struct page *page; int offset, bytes_to_copy; + struct page *page; offset = pos % PAGE_SIZE; i = pos/PAGE_SIZE; @@ -211,32 +225,44 @@ int copy_strings(int argc,char ** argv, if (!page) { page = alloc_page(GFP_HIGHUSER); bprm->page[i] = page; - if (!page) - return -ENOMEM; + if (!page) { + ret = -ENOMEM; + goto out; + } new = 1; } - kaddr = kmap(page); + if (page != kmapped_page) { + if (kmapped_page) + kunmap(kmapped_page); + kmapped_page = page; + kaddr = kmap(kmapped_page); + } if (new && offset) memset(kaddr, 0, offset); bytes_to_copy = PAGE_SIZE - offset; if (bytes_to_copy > len) { bytes_to_copy = len; if (new) - memset(kaddr+offset+len, 0, PAGE_SIZE-offset-len); + memset(kaddr+offset+len, 0, + PAGE_SIZE-offset-len); + } + err = copy_from_user(kaddr+offset, str, bytes_to_copy); + if (err) { + ret = -EFAULT; + goto out; } - err = copy_from_user(kaddr + offset, str, bytes_to_copy); - kunmap(page); - - if (err) - return -EFAULT; pos += bytes_to_copy; str += bytes_to_copy; len -= bytes_to_copy; } } - return 0; + ret = 0; +out: + if (kmapped_page) + kunmap(kmapped_page); + return ret; } /* --- 2.5.25/include/asm-i386/highmem.h~linus-copy_user-hack Tue Jul 9 21:12:35 2002 +++ 2.5.25-akpm/include/asm-i386/highmem.h Tue Jul 9 21:12:35 2002 @@ -22,6 +22,7 @@ #include <linux/config.h> #include <linux/interrupt.h> +#include <linux/percpu.h> #include <asm/kmap_types.h> #include <asm/tlbflush.h> @@ -76,6 +77,8 @@ static inline void kunmap(struct page *p * be used in IRQ contexts, so in some (very limited) cases we need * it. */ +extern int kmap_atomic_seq[KM_TYPE_NR] __per_cpu_data; + static inline void *kmap_atomic(struct page *page, enum km_type type) { enum fixed_addresses idx; @@ -93,7 +96,7 @@ static inline void *kmap_atomic(struct p #endif set_pte(kmap_pte-idx, mk_pte(page, kmap_prot)); __flush_tlb_one(vaddr); - + this_cpu(kmap_atomic_seq[type])++; return (void*) vaddr; } --- 2.5.25/include/asm-i386/processor.h~linus-copy_user-hack Tue Jul 9 21:12:35 2002 +++ 2.5.25-akpm/include/asm-i386/processor.h Tue Jul 9 21:12:35 2002 @@ -485,4 +485,6 @@ extern inline void prefetchw(const void #endif +#define ARCH_HAS_KMAP_FIXUP + #endif /* __ASM_I386_PROCESSOR_H */ --- 2.5.25/include/linux/highmem.h~linus-copy_user-hack Tue Jul 9 21:12:35 2002 +++ 2.5.25-akpm/include/linux/highmem.h Tue Jul 9 21:54:59 2002 @@ -3,6 +3,7 @@ #include <linux/config.h> #include <linux/fs.h> +#include <asm/processor.h> #include <asm/cacheflush.h> #ifdef CONFIG_HIGHMEM @@ -10,6 +11,7 @@ extern struct page *highmem_start_page; #include <asm/highmem.h> +#include <asm/kmap_types.h> /* declarations for linux/mm/highmem.c */ unsigned int nr_free_highpages(void); @@ -72,4 +74,82 @@ static inline void copy_user_highpage(st kunmap_atomic(vto, KM_USER1); } +#if defined(CONFIG_HIGHMEM) && defined(ARCH_HAS_KMAP_FIXUP) +/* + * Used when performing a copy_*_user while holding an atomic kmap + */ +struct copy_user_state { + struct page *page; /* The page which is kmap_atomiced */ + void *kaddr; /* Its mapping */ + enum km_type type; /* Its offset */ + int src; /* 1: fixup ESI. 0: Fixup EDI */ + int cpu; /* CPU which the kmap was taken on */ + int seq; /* The kmap's sequence number */ +}; + +/* + * `src' is true if the kmap_atomic virtual address is the source of the copy. + */ +static inline void * +kmap_copy_user(struct copy_user_state *cus, struct page *page, + enum km_type type, int src) +{ + cus->page = page; + cus->kaddr = kmap_atomic(page, type); + if (PageHighMem(page)) { + cus->type = type; + cus->src = src; + BUG_ON(current->copy_user_state != NULL); + current->copy_user_state = cus; + } + return cus->kaddr; +} + +static inline void kunmap_copy_user(struct copy_user_state *cus) +{ + if (PageHighMem(cus->page)) { + BUG_ON(current->copy_user_state != cus); + kunmap_atomic(cus->kaddr, cus->type); + current->copy_user_state = NULL; + cus->page = NULL; /* debug */ + } +} + +/* + * After a copy_*_user, the kernel virtual address may be different. So + * use kmap_copy_user_addr() to get the new value. + */ +static inline void *kmap_copy_user_addr(struct copy_user_state *cus) +{ + return cus->kaddr; +} + +#else + +struct copy_user_state { + struct page *page; +}; + +/* + * This must be a macro because `type' may be undefined + */ + +#define kmap_copy_user(cus, page, type, src) \ + ({ \ + (cus)->page = (page); \ + kmap(page); \ + }) + +static inline void kunmap_copy_user(struct copy_user_state *cus) +{ + kunmap(cus->page); +} + +static inline void *kmap_copy_user_addr(struct copy_user_state *cus) +{ + return page_address(cus->page); +} + +#endif + #endif /* _LINUX_HIGHMEM_H */ --- 2.5.25/include/linux/sched.h~linus-copy_user-hack Tue Jul 9 21:12:35 2002 +++ 2.5.25-akpm/include/linux/sched.h Tue Jul 9 21:12:35 2002 @@ -248,6 +248,8 @@ extern struct user_struct root_user; typedef struct prio_array prio_array_t; +struct copy_user_state; + struct task_struct { volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */ struct thread_info *thread_info; @@ -365,6 +367,9 @@ struct task_struct { /* journalling filesystem info */ void *journal_info; struct dentry *proc_dentry; +#ifdef CONFIG_HIGHMEM + struct copy_user_state *copy_user_state; +#endif }; extern void __put_task_struct(struct task_struct *tsk); --- 2.5.25/mm/filemap.c~linus-copy_user-hack Tue Jul 9 21:12:35 2002 +++ 2.5.25-akpm/mm/filemap.c Tue Jul 9 21:53:51 2002 @@ -15,6 +15,7 @@ #include <linux/fs.h> #include <linux/mm.h> #include <linux/mman.h> +#include <linux/highmem.h> #include <linux/pagemap.h> #include <linux/file.h> #include <linux/iobuf.h> @@ -1183,18 +1184,20 @@ static ssize_t generic_file_direct_IO(in return retval; } -int file_read_actor(read_descriptor_t * desc, struct page *page, unsigned long offset, unsigned long size) +int file_read_actor(read_descriptor_t *desc, struct page *page, + unsigned long offset, unsigned long size) { char *kaddr; + struct copy_user_state copy_user_state; unsigned long left, count = desc->count; if (size > count) size = count; - kaddr = kmap(page); + kaddr = kmap_copy_user(©_user_state, page, KM_FILEMAP, 1); left = __copy_to_user(desc->buf, kaddr + offset, size); - kunmap(page); - + kunmap_copy_user(©_user_state); + if (left) { size -= left; desc->error = -EFAULT; --- 2.5.25/include/asm-i386/kmap_types.h~linus-copy_user-hack Tue Jul 9 21:13:02 2002 +++ 2.5.25-akpm/include/asm-i386/kmap_types.h Tue Jul 9 21:13:43 2002 @@ -19,7 +19,8 @@ D(5) KM_BIO_SRC_IRQ, D(6) KM_BIO_DST_IRQ, D(7) KM_PTE0, D(8) KM_PTE1, -D(9) KM_TYPE_NR +D(9) KM_FILEMAP, +D(10) KM_TYPE_NR }; #undef D --- 2.5.25/include/asm-ppc/kmap_types.h~linus-copy_user-hack Tue Jul 9 21:13:06 2002 +++ 2.5.25-akpm/include/asm-ppc/kmap_types.h Tue Jul 9 21:13:49 2002 @@ -15,6 +15,7 @@ enum km_type { KM_BIO_DST_IRQ, KM_PTE0, KM_PTE1, + KM_FILEMAP, KM_TYPE_NR }; --- 2.5.25/include/asm-sparc/kmap_types.h~linus-copy_user-hack Tue Jul 9 21:13:09 2002 +++ 2.5.25-akpm/include/asm-sparc/kmap_types.h Tue Jul 9 21:13:55 2002 @@ -9,6 +9,7 @@ enum km_type { KM_USER1, KM_BIO_SRC_IRQ, KM_BIO_DST_IRQ, + KM_FILEMAP, KM_TYPE_NR }; --- 2.5.25/include/asm-x86_64/kmap_types.h~linus-copy_user-hack Tue Jul 9 21:13:12 2002 +++ 2.5.25-akpm/include/asm-x86_64/kmap_types.h Tue Jul 9 21:14:01 2002 @@ -9,6 +9,7 @@ enum km_type { KM_USER1, KM_BIO_SRC_IRQ, KM_BIO_DST_IRQ, + KM_FILEMAP, KM_TYPE_NR }; - -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: scalable kmap (was Re: vm lock contention reduction) 2002-07-10 5:32 ` Andrew Morton @ 2002-07-10 22:43 ` Martin J. Bligh 2002-07-10 23:08 ` Andrew Morton 0 siblings, 1 reply; 95+ messages in thread From: Martin J. Bligh @ 2002-07-10 22:43 UTC (permalink / raw) To: Andrew Morton, Andrea Arcangeli; +Cc: Linus Torvalds, Rik van Riel, linux-mm > Here's the diff. The kmap() and kmap_atomic() rate is way down > now. Still no benefit from it all through. Martin. Help. Hmmm ... well I have some preliminary results on the 16-way NUMA for kernel compile, and it doesn't make things faster - if anything there's a barely perceptible slowdown (may just be statistical error). On the other hand, Hanna just did some dbench measurements on an 8-way SMP, and got about 15% improvement out of it (she mailed the results to lkml just now). The profile comparison between 2.4 and 2.5 is interesting. Unless I've screwed something up in the profiling, seems like we're spending a lot of time in do_page_fault (with or without your patch). It's *so* different, that I'm inclined to suspect my profiling .... (profile=2). 2.5: 46361 total 0.3985 36500 default_idle 570.3125 5556 do_page_fault 4.5653 1087 do_softirq 5.2260 673 pte_alloc_one 3.8239 529 schedule 0.5700 304 exit_notify 0.3800 192 __wake_up 1.7143 188 do_fork 0.0904 180 system_call 4.0909 174 pgd_alloc 2.7188 109 timer_bh 0.1548 2.4 + patches 22256 total 0.0237 13510 default_idle 259.8077 2042 _text_lock_swap 37.8148 585 lru_cache_add 6.3587 551 do_anonymous_page 1.6596 488 do_generic_file_read 0.4388 401 lru_cache_del 18.2273 385 _text_lock_namei 0.3688 363 __free_pages_ok 0.6927 267 __generic_copy_from_user 2.5673 222 __d_lookup 0.8672 211 zap_page_range 0.2409 188 _text_lock_dec_and_lock 7.8333 179 rmqueue 0.4144 178 __find_get_page 2.7812 161 _text_lock_read_write 1.3644 157 file_read_actor 0.6886 151 nr_free_pages 2.9038 123 link_path_walk 0.0478 118 set_page_dirty 1.2292 101 fput 0.4353 Interestingly, kmap doesn't show up in the virgin 2.5 profile at all, but it does in the 2.4 profile ... 7 flush_all_zero_pkmaps 0.0700 109 kmap_high 0.3028 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: scalable kmap (was Re: vm lock contention reduction) 2002-07-10 22:43 ` Martin J. Bligh @ 2002-07-10 23:08 ` Andrew Morton 2002-07-10 23:26 ` Martin J. Bligh 2002-07-12 17:48 ` Martin J. Bligh 0 siblings, 2 replies; 95+ messages in thread From: Andrew Morton @ 2002-07-10 23:08 UTC (permalink / raw) To: Martin J. Bligh; +Cc: Andrea Arcangeli, Linus Torvalds, Rik van Riel, linux-mm "Martin J. Bligh" wrote: > > > Here's the diff. The kmap() and kmap_atomic() rate is way down > > now. Still no benefit from it all through. Martin. Help. > > Hmmm ... well I have some preliminary results on the 16-way NUMA > for kernel compile, and it doesn't make things faster - if anything there's > a barely perceptible slowdown (may just be statistical error). > > On the other hand, Hanna just did some dbench measurements on an > 8-way SMP, and got about 15% improvement out of it (she mailed the > results to lkml just now). Yes, thanks. Still scratching my head over that lot. > The profile comparison between 2.4 and 2.5 is interesting. Unless I've > screwed something up in the profiling, seems like we're spending a lot > of time in do_page_fault (with or without your patch). It's *so* different, > that I'm inclined to suspect my profiling .... (profile=2). Yes, I've seen the in-kernel profiler doing odd things. If you're not using the local APIC timer then I think the new IRQ balancing code will break the profiler by steering the clock interrupts away from busy CPUs (!). But ISTR that the profiler has gone whacky even with CONFIG_X86_LOCAL_APIC, which shouldn't be affected by the IRQ steering. But NMI-based oprofile is bang-on target so I recommend you use that. I'll publish my oprofile-for-2.5 asap. - -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: scalable kmap (was Re: vm lock contention reduction) 2002-07-10 23:08 ` Andrew Morton @ 2002-07-10 23:26 ` Martin J. Bligh 2002-07-11 0:19 ` Andrew Morton 2002-07-12 17:48 ` Martin J. Bligh 1 sibling, 1 reply; 95+ messages in thread From: Martin J. Bligh @ 2002-07-10 23:26 UTC (permalink / raw) To: Andrew Morton; +Cc: Andrea Arcangeli, Linus Torvalds, Rik van Riel, linux-mm > Yes, I've seen the in-kernel profiler doing odd things. If > you're not using the local APIC timer then I think the new > IRQ balancing code will break the profiler by steering the > clock interrupts away from busy CPUs (!). The first patch we apply is this diff -Nur linux-2.5.23-vanilla/arch/i386/kernel/io_apic.c linux-2.5.23-patched/arch/i386/kernel/io_apic.c --- linux-2.5.23-vanilla/arch/i386/kernel/io_apic.c Tue Jun 18 19:11:52 2002 +++ linux-2.5.23-patched/arch/i386/kernel/io_apic.c Thu Jun 27 14:28:51 2002 @@ -247,7 +247,7 @@ static inline void balance_irq(int irq) { -#if CONFIG_SMP +#if (CONFIG_SMP && !CONFIG_MULTIQUAD) irq_balance_t *entry = irq_balance + irq; unsigned long now = jiffies; Which should turn of IRQ balancing completely, I think ... > But ISTR that the profiler has gone whacky even with CONFIG_X86_LOCAL_APIC, > which shouldn't be affected by the IRQ steering. Interrupt 0 will only ever go the first quad (well, it should do if I actually fixed the timers), but CONFIG_X86_LOCAL_APIC=y is on. Wierd ... > But NMI-based oprofile is bang-on target so I recommend you use that. > I'll publish my oprofile-for-2.5 asap. That'd be good, but I'm not sure my box likes NMIs too much ;-) We'll see .... M. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: scalable kmap (was Re: vm lock contention reduction) 2002-07-10 23:26 ` Martin J. Bligh @ 2002-07-11 0:19 ` Andrew Morton 0 siblings, 0 replies; 95+ messages in thread From: Andrew Morton @ 2002-07-11 0:19 UTC (permalink / raw) To: Martin J. Bligh; +Cc: Andrea Arcangeli, Linus Torvalds, Rik van Riel, linux-mm "Martin J. Bligh" wrote: > > ... > > But NMI-based oprofile is bang-on target so I recommend you use that. > > I'll publish my oprofile-for-2.5 asap. > > That'd be good, but I'm not sure my box likes NMIs too much ;-) > We'll see .... http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.25/oprofile.patch.gz Now, oprofile trick for young players: the most useful metric is CPU_CLK_UNHALTED. And the example commandline at http://oprofile.sourceforge.net/doc.php3 works just fine. But the kernel halts the clock in default_idle, and the numbers you get out of the profiler only reflect the amount of time which was spent with the clock unhalted. So for example if you run oprofile against an idle machine, it looks like the machine is spending 40% of its cycles handling the clock timer. Because it doesn't account for halted cycles. I find this interpolation hurts my brain too much, so I always use the `idle=poll' kernel boot parameter so the clock is never halted. This gives profiles which are comprehensible even to simple Australians. So. - Set NR_CPUS to 8. Otherwise oprofile does kmalloc(256kbytes) and won't start. This is Rusty's fault. - patch, build, install kernel - oprofile keeps stuff in /var/opd, and I'm never sure whether my profiles are fresh, or are a mixture of this one and the previous one. So I always blow away /var/opd first. rm -rf /var/opd <start benchmark> op_start --vmlinux=/boot/vmlinux --map-file=/boot/System.map \ --ctr0-event=CPU_CLK_UNHALTED --ctr0-count=300000 op_stop <benchmark ends> oprofpp -dl -i /boot/vmlinux Easy. - -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: scalable kmap (was Re: vm lock contention reduction) 2002-07-10 23:08 ` Andrew Morton 2002-07-10 23:26 ` Martin J. Bligh @ 2002-07-12 17:48 ` Martin J. Bligh 2002-07-13 11:18 ` Andrea Arcangeli 1 sibling, 1 reply; 95+ messages in thread From: Martin J. Bligh @ 2002-07-12 17:48 UTC (permalink / raw) To: Andrew Morton; +Cc: Andrea Arcangeli, Linus Torvalds, Rik van Riel, linux-mm OK, preliminary results we've seen about another 15% reduction in CPU load on Apache Specweb99 on an 8-way machine with Andrew's kmap patches! Will send out some more detailed numbers later from the official specweb machine (thanks to Dave Hansen for running the prelim tests). Secondly, I'd like to propose yet another mechanism, which would also be a cheaper way to do things .... based vaguely on an RCU type mechanism: When you go to allocate a new global kmap, the danger is that its PTE entry has not been TLB flushed, and the old value is still in some CPUs TLB cache. If only this task context is going to use this kmap (eg copy_to_user), all we need do is check that we have context switched since we last used this kmap entry (since it was freed is easiest). If we have not, we merely do a local single line invalidate of that entry. If we switch to running on any other CPU in the future, we'll do a global TLB flush on the switch, so no problem there. I suspect that 99% of the time, this means no TLB flush at all, or even an invalidate. If multiple task contexts might use this kmap, we need to check that ALL cpus have done an context switch since this entry was last used. If not, we send a single line invalidate to only those other CPUs that have not switched, and thus might still have a dirty entry ... I believe RCU already has all the mechanisms for checking context switches. By context switch, I really mean TLB flush - ie switched processes, not just threads. Madness? M. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: scalable kmap (was Re: vm lock contention reduction) 2002-07-12 17:48 ` Martin J. Bligh @ 2002-07-13 11:18 ` Andrea Arcangeli 0 siblings, 0 replies; 95+ messages in thread From: Andrea Arcangeli @ 2002-07-13 11:18 UTC (permalink / raw) To: Martin J. Bligh; +Cc: Andrew Morton, Linus Torvalds, Rik van Riel, linux-mm On Fri, Jul 12, 2002 at 10:48:06AM -0700, Martin J. Bligh wrote: > OK, preliminary results we've seen about another 15% reduction in CPU load > on Apache Specweb99 on an 8-way machine with Andrew's kmap patches! > Will send out some more detailed numbers later from the official specweb > machine (thanks to Dave Hansen for running the prelim tests). > > Secondly, I'd like to propose yet another mechanism, which would > also be a cheaper way to do things .... based vaguely on an RCU > type mechanism: > > When you go to allocate a new global kmap, the danger is that its PTE > entry has not been TLB flushed, and the old value is still in some CPUs > TLB cache. > > If only this task context is going to use this kmap (eg copy_to_user), > all we need do is check that we have context switched since we last you also must turn off the global bit from the kmap_prot first if you want this to work. > used this kmap entry (since it was freed is easiest). If we have not, we > merely do a local single line invalidate of that entry. If we switch to > running on any other CPU in the future, we'll do a global TLB flush on > the switch, so no problem there. I suspect that 99% of the time, this > means no TLB flush at all, or even an invalidate. > > If multiple task contexts might use this kmap, we need to check that > ALL cpus have done an context switch since this entry was last used. > If not, we send a single line invalidate to only those other CPUs that > have not switched, and thus might still have a dirty entry ... > > I believe RCU already has all the mechanisms for checking context > switches. By context switch, I really mean TLB flush - ie switched RCU doesn't have and doesn't need that kind of mechanism. First of all RCU don't even track context switches, it only tracks quiescent points (even if there's no context switch if somebody called schedule() that's a quiescent point even if we keep running the current task afterwards maybe because it's the only running task). But even tracking context switches isn't enough as you said, you've to track mm_switches that is yet in a different place. So the mm_switch code would need a new instrumentation but that's not the main issue, it's just the logic to keep track of this info seem overcomplex, doesn't really sound an obvious optimization to me. Also I would remind it should be implemented it in a zerocost way for the 64bit archs but that's certainly not a problem (just a reminder :). > processes, not just threads. > > Madness? > > M. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: scalable kmap (was Re: vm lock contention reduction) 2002-07-09 3:17 ` Andrew Morton 2002-07-09 4:28 ` Martin J. Bligh @ 2002-07-09 13:59 ` Benjamin LaHaise 1 sibling, 0 replies; 95+ messages in thread From: Benjamin LaHaise @ 2002-07-09 13:59 UTC (permalink / raw) To: Andrew Morton Cc: Martin J. Bligh, Andrea Arcangeli, Linus Torvalds, Rik van Riel, linux-mm On Mon, Jul 08, 2002 at 08:17:36PM -0700, Andrew Morton wrote: > -D(9) KM_TYPE_NR > +D(9) KM_FILEMAP, > +D(10) KM_TYPE_NR Reusing KM_USER[01] would be better than adding yet another kmap type. -ben -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: vm lock contention reduction 2002-07-07 2:50 ` Andrew Morton 2002-07-07 3:05 ` Linus Torvalds 2002-07-07 6:13 ` scalable kmap (was Re: vm lock contention reduction) Martin J. Bligh @ 2002-07-08 0:38 ` William Lee Irwin III 2 siblings, 0 replies; 95+ messages in thread From: William Lee Irwin III @ 2002-07-08 0:38 UTC (permalink / raw) To: Andrew Morton Cc: Andrea Arcangeli, Linus Torvalds, Rik van Riel, linux-mm, Martin J. Bligh On Sat, Jul 06, 2002 at 07:50:41PM -0700, Andrew Morton wrote: > Any time you have one of these pages in use, the process gets > pinned onto the current CPU. If we run out of per-cpu kmaps, > just fall back to traditional kmap(). This is not particularly difficult, it only requires a depth counter and a saved cpumask for when it becomes unpinned again. Cheers, Bill -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: vm lock contention reduction 2002-07-05 5:51 ` Linus Torvalds 2002-07-05 6:08 ` Linus Torvalds @ 2002-07-05 6:46 ` Andrew Morton 2002-07-05 14:25 ` Rik van Riel 1 sibling, 1 reply; 95+ messages in thread From: Andrew Morton @ 2002-07-05 6:46 UTC (permalink / raw) To: Linus Torvalds; +Cc: Rik van Riel, Andrea Arcangeli, linux-mm Linus Torvalds wrote: > > On Thu, 4 Jul 2002, Andrew Morton wrote: > > > > > > Get away from this "minimum wait" thing, because it is WRONG. > > > > Well yes, we do want to batch work up. And a crude way of doing that > > is "each time 64 pages have come clean, wake up one waiter". > > That would probably work, but we also need to be careful that we don't get > into some strange situations (ie we have 50 waiters that all needed memory > at the same time, and less than 50*64 pages that caused us to be in > trouble, so we only wake up the 46 first waiters and the last 4 waiters > get stuck until the next batch even though we now have _lots_ of pages > free). > > Dont't laugh - things like this has actually happened at times with some > of our balancing work with HIGHMEM/NORMAL. Basically, the logic would go: > "everybody who ends up having to wait for an allocation should free at > least N pages", but then you would end up with 50*N pages total that the > system thought it "needed" to free up, and that could be a big number that > would cause the VM to want to free up stuff long after it was really done. Well we'd certainly need to make direct caller-reclaims the normal mode of operation. Avoid context switches in the page allocator. However it occurs to me that we could easily get in the situation where a page allocator find a PageDirty or PageWriteback page on the tail of the LRU and waits on it, but there are plenty of reclaimable pages further along in the LRU. In this situation it would better to just tag the page as "wait on it next time around" and then just skip it. This is basically what the PageLaunder/BH_Launder logic was doing. I think. For a long time it wasn't very effective because it wasn't able to wait on page/buffers which were written by someone else. Andrea finally sorted that by setting BH_Launder in submit_bh. All those deadlock problems have gone away now and we can (re)implement this much more simply. But what I don't like about it is that it's dumb. The kernel ends up doing these enormous list scans and not achieving very much, whereas we could achieve the same effect by doing a bit of page motion at interrupt time. It's a polled-versus-interrupt thing. And right now, that dumbness is multiplied by the CPU count because it happens under pagemap_lru_lock. But with the bustup of that, at least we can be scalably dumb ;) > > Or > > "as soon as the number of reclaimable pages exceeds zone->pages_min". > > Some logic would also be needed to prevent new page allocators from > > jumping the queue, of course. > > Yeah, the unfairness is the thing that really can be nasty. > > On the other hand, some unfairness is ok too - and can improve throughput. > So jumping the queue is fine, you just must not be able to _consistently_ > jump the queue. > > (In fact, jumping the queue is inevitable to some degree - not allowing > any queue jumping at all would imply that any time _anybody_ starts > waiting, every single allocation afterwards will have to wait until the > waiter got woken up. Which we have actually tried before at times, but > which causes really really bad behaviour and horribly bad "pausing") > > You probably want the occasional allocator able to jump the queue, but the > "big spenders" to be caught eventually. "Fairness" really doesn't mean > that "everybody should wait equally much", it really means "people should > wait roughly relative to how much as they 'spend' memory". Right. And that implies heuristics to divine which tasks are heavy page allocators. uh-oh. But as a first-order approximation: if a task is currently allocating pages from within generic_file_write(), then whack it hard. Here's PageLaunder-for-2.5. (Not tested enough - don't apply ;)) It seems to help vmstat somewhat, but it still gets stuck in shrink_cache->get_request_wait() a lot. include/linux/page-flags.h | 7 +++++++ mm/filemap.c | 1 + mm/vmscan.c | 2 +- 3 files changed, 9 insertions(+), 1 deletion(-) --- 2.5.24/mm/vmscan.c~second-chance-throttle Thu Jul 4 23:17:14 2002 +++ 2.5.24-akpm/mm/vmscan.c Thu Jul 4 23:18:40 2002 @@ -443,7 +443,7 @@ shrink_cache(int nr_pages, zone_t *class * IO in progress? Leave it at the back of the list. */ if (unlikely(PageWriteback(page))) { - if (may_enter_fs) { + if (may_enter_fs && TestSetPageThrottle(page)) { page_cache_get(page); spin_unlock(&pagemap_lru_lock); wait_on_page_writeback(page); --- 2.5.24/include/linux/page-flags.h~second-chance-throttle Thu Jul 4 23:18:35 2002 +++ 2.5.24-akpm/include/linux/page-flags.h Thu Jul 4 23:20:02 2002 @@ -65,6 +65,7 @@ #define PG_private 12 /* Has something at ->private */ #define PG_writeback 13 /* Page is under writeback */ #define PG_nosave 15 /* Used for system suspend/resume */ +#define PG_throttle 16 /* page allocator should throttle */ /* * Global page accounting. One instance per CPU. @@ -216,6 +217,12 @@ extern void get_page_state(struct page_s #define ClearPageNosave(page) clear_bit(PG_nosave, &(page)->flags) #define TestClearPageNosave(page) test_and_clear_bit(PG_nosave, &(page)->flags) +#define PageThrottle(page) test_bit(PG_throttle, &(page)->flags) +#define SetPageThrottle(page) set_bit(PG_throttle, &(page)->flags) +#define TestSetPageThrottle(page) test_and_set_bit(PG_throttle, &(page)->flags) +#define ClearPageThrottle(page) clear_bit(PG_throttle, &(page)->flags) +#define TestClearPageThrottle(page) test_and_clear_bit(PG_throttle, &(page)->flags) + /* * The PageSwapCache predicate doesn't use a PG_flag at this time, * but it may again do so one day. --- 2.5.24/mm/filemap.c~second-chance-throttle Thu Jul 4 23:20:08 2002 +++ 2.5.24-akpm/mm/filemap.c Thu Jul 4 23:20:23 2002 @@ -682,6 +682,7 @@ void end_page_writeback(struct page *pag { wait_queue_head_t *waitqueue = page_waitqueue(page); smp_mb__before_clear_bit(); + ClearPageThrottle(page); if (!TestClearPageWriteback(page)) BUG(); smp_mb__after_clear_bit(); - -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: vm lock contention reduction 2002-07-05 6:46 ` Andrew Morton @ 2002-07-05 14:25 ` Rik van Riel 0 siblings, 0 replies; 95+ messages in thread From: Rik van Riel @ 2002-07-05 14:25 UTC (permalink / raw) To: Andrew Morton; +Cc: Linus Torvalds, Andrea Arcangeli, linux-mm On Thu, 4 Jul 2002, Andrew Morton wrote: > Linus Torvalds wrote: > > You probably want the occasional allocator able to jump the queue, but the > > "big spenders" to be caught eventually. "Fairness" really doesn't mean > > that "everybody should wait equally much", it really means "people should > > wait roughly relative to how much as they 'spend' memory". > > Right. And that implies heuristics to divine which tasks are > heavy page allocators. uh-oh. This isn't too hard. In order to achieve this you: 1) wait for one kswapd loop when you get below a high water mark 2) allocate one page when kswapd wakes everybody up again (at this point we're NOT necessarily above the high water mark again...) This means that once the system is under a lot of pressure heavy allocators will be throttled a lot more than light allocators and the system gets a chance to free things. Of course, kswapd does everything (except get_request) asynchronously so a kswapd loop should be relatively short. regards, Rik -- Bravely reimplemented by the knights who say "NIH". http://www.surriel.com/ http://distro.conectiva.com/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: vm lock contention reduction 2002-07-05 2:18 ` Andrew Morton 2002-07-05 2:16 ` Rik van Riel @ 2002-07-05 23:11 ` William Lee Irwin III 2002-07-05 23:48 ` Andrew Morton 1 sibling, 1 reply; 95+ messages in thread From: William Lee Irwin III @ 2002-07-05 23:11 UTC (permalink / raw) To: Andrew Morton; +Cc: Rik van Riel, Andrea Arcangeli, linux-mm, Linus Torvalds On Thu, Jul 04, 2002 at 07:18:34PM -0700, Andrew Morton wrote: > Of course, that change means that we wouldn't be able to throttle > page allocators against IO any more, and we'd have to do something > smarter. What a shame ;) This is actually necessary IMHO. Some testing I've been able to do seems to reveal the current throttling mechanism as inadequate. Cheers, Bill -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: vm lock contention reduction 2002-07-05 23:11 ` William Lee Irwin III @ 2002-07-05 23:48 ` Andrew Morton 2002-07-06 0:11 ` Rik van Riel 0 siblings, 1 reply; 95+ messages in thread From: Andrew Morton @ 2002-07-05 23:48 UTC (permalink / raw) To: William Lee Irwin III Cc: Rik van Riel, Andrea Arcangeli, linux-mm, Linus Torvalds William Lee Irwin III wrote: > > On Thu, Jul 04, 2002 at 07:18:34PM -0700, Andrew Morton wrote: > > Of course, that change means that we wouldn't be able to throttle > > page allocators against IO any more, and we'd have to do something > > smarter. What a shame ;) > > This is actually necessary IMHO. Some testing I've been able to do seems > to reveal the current throttling mechanism as inadequate. > I don't think so. If you're referring to the situation where your 4G machine had 3.5G dirty pages without triggering writeback. That's not a generic problem. It's something specific to your setup. You're going to have to repeat it and stick some printk's into balance_dirty_pages(). No other way of finding it. Possibly it's an arith overflow in there, but I more suspect that your nr_pagecache_pages() function is returning an incorrect value. This happened to David M-T just this week in the 2.4 kernel - the nr_buffermem_pages() function was returning a bad value due to an unaccounted-for hole in the memory map and the observed effect was just the same. So. Please debug it. - -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: vm lock contention reduction 2002-07-05 23:48 ` Andrew Morton @ 2002-07-06 0:11 ` Rik van Riel 2002-07-06 0:31 ` Linus Torvalds 2002-07-06 0:48 ` Andrew Morton 0 siblings, 2 replies; 95+ messages in thread From: Rik van Riel @ 2002-07-06 0:11 UTC (permalink / raw) To: Andrew Morton Cc: William Lee Irwin III, Andrea Arcangeli, linux-mm, Linus Torvalds On Fri, 5 Jul 2002, Andrew Morton wrote: > William Lee Irwin III wrote: > > On Thu, Jul 04, 2002 at 07:18:34PM -0700, Andrew Morton wrote: > > > Of course, that change means that we wouldn't be able to throttle > > > page allocators against IO any more, and we'd have to do something > > > smarter. What a shame ;) > > > > This is actually necessary IMHO. Some testing I've been able to do seems > > to reveal the current throttling mechanism as inadequate. > > I don't think so. If you're referring to the situation where your > 4G machine had 3.5G dirty pages without triggering writeback. > > That's not a generic problem. But it is, mmap() and anonymous memory don't trigger writeback. regards, Rik -- Bravely reimplemented by the knights who say "NIH". http://www.surriel.com/ http://distro.conectiva.com/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: vm lock contention reduction 2002-07-06 0:11 ` Rik van Riel @ 2002-07-06 0:31 ` Linus Torvalds 2002-07-06 0:45 ` Rik van Riel 2002-07-06 0:48 ` Andrew Morton 1 sibling, 1 reply; 95+ messages in thread From: Linus Torvalds @ 2002-07-06 0:31 UTC (permalink / raw) To: Rik van Riel Cc: Andrew Morton, William Lee Irwin III, Andrea Arcangeli, linux-mm On Fri, 5 Jul 2002, Rik van Riel wrote: > > But it is, mmap() and anonymous memory don't trigger writeback. I don't think we can fix that, without going back to the approach of marking any writable memory as read-only and counting it at page-fault time. Wasn't it you who did that test-patch originally? From what I remember, it had basically zero downside for normal UNIX applications (shared memory that is written to is so rare that it doesn't end up on the radar), but was quite expensive for the DB kind of shmem usage where shared writable memory is the major component.. There might be some way to avoid the page fault badness (the large page stuff will do this automatically, for example), which might make the "let's keep track of dirty mappings explicitly" approach acceptable again. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: vm lock contention reduction 2002-07-06 0:31 ` Linus Torvalds @ 2002-07-06 0:45 ` Rik van Riel 0 siblings, 0 replies; 95+ messages in thread From: Rik van Riel @ 2002-07-06 0:45 UTC (permalink / raw) To: Linus Torvalds Cc: Andrew Morton, William Lee Irwin III, Andrea Arcangeli, linux-mm On Fri, 5 Jul 2002, Linus Torvalds wrote: > On Fri, 5 Jul 2002, Rik van Riel wrote: > > > > But it is, mmap() and anonymous memory don't trigger writeback. > > I don't think we can fix that, without going back to the approach of > marking any writable memory as read-only and counting it at page-fault > time. I don't think we have to fix it, as long as the shrink_caches/page_launder function is well balanced and throttling is done intelligently. > Wasn't it you who did that test-patch originally? I don't remember who did it, so I suspect it wasn't me. Could it be Ben ? > There might be some way to avoid the page fault badness (the large page > stuff will do this automatically, for example), which might make the > "let's keep track of dirty mappings explicitly" approach acceptable > again. That's another approach, I guess it'll be worth looking into both (and searching the web for what other people have done before us with both ;)) regards, Rik -- Bravely reimplemented by the knights who say "NIH". http://www.surriel.com/ http://distro.conectiva.com/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: vm lock contention reduction 2002-07-06 0:11 ` Rik van Riel 2002-07-06 0:31 ` Linus Torvalds @ 2002-07-06 0:48 ` Andrew Morton 2002-07-08 0:59 ` William Lee Irwin III 1 sibling, 1 reply; 95+ messages in thread From: Andrew Morton @ 2002-07-06 0:48 UTC (permalink / raw) To: Rik van Riel Cc: William Lee Irwin III, Andrea Arcangeli, linux-mm, Linus Torvalds Rik van Riel wrote: > > On Fri, 5 Jul 2002, Andrew Morton wrote: > > William Lee Irwin III wrote: > > > On Thu, Jul 04, 2002 at 07:18:34PM -0700, Andrew Morton wrote: > > > > Of course, that change means that we wouldn't be able to throttle > > > > page allocators against IO any more, and we'd have to do something > > > > smarter. What a shame ;) > > > > > > This is actually necessary IMHO. Some testing I've been able to do seems > > > to reveal the current throttling mechanism as inadequate. > > > > I don't think so. If you're referring to the situation where your > > 4G machine had 3.5G dirty pages without triggering writeback. > > > > That's not a generic problem. > > But it is, mmap() and anonymous memory don't trigger writeback. > That's different. Bill hit a problem just running tiobench. We can run balance_dirty_pages() when a COW copyout is performed, which will approximately improve things. But the whole idea of the dirty memory thresholds just seems bust, really. Because how do you pick the thresholds? 40%. Bah. One idea we kicked around a while back was to remove the throttling from the write(2) path altogether, and to perform the throttling inside the page allocator instead, where we have more information. And perhaps set PF_WRITER before doing so, so the VM can penalise the caller more heavily. But this doesn't work if the write(2) caller is redirtying existing pagecache, just like writes to anon pages, mmap pages, etc. Alternatively: the VM can periodically reach over and twiddle the dirty memory thresholds which balance_dirty_pages() uses. I don't like our chances of getting that right though ;) - -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: vm lock contention reduction 2002-07-06 0:48 ` Andrew Morton @ 2002-07-08 0:59 ` William Lee Irwin III 0 siblings, 0 replies; 95+ messages in thread From: William Lee Irwin III @ 2002-07-08 0:59 UTC (permalink / raw) To: Andrew Morton; +Cc: Rik van Riel, Andrea Arcangeli, linux-mm, Linus Torvalds Rik van Riel wrote: >> But it is, mmap() and anonymous memory don't trigger writeback. On Fri, Jul 05, 2002 at 05:48:48PM -0700, Andrew Morton wrote: > That's different. Bill hit a problem just running tiobench. > We can run balance_dirty_pages() when a COW copyout is performed, > which will approximately improve things. > But the whole idea of the dirty memory thresholds just seems bust, > really. Because how do you pick the thresholds? 40%. Bah. I don't know what the answer should be, but I can certainly demonstrate this in a rather uninteresting situation (4GB, 4cpu's, 1 disk, 16 tasks). But I can concur with that evaluation. In my esteem fixed fractions of memory don't have a very direct relationship to what's going on. Cheers, Bill -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Enhanced profiling support (was Re: vm lock contention reduction) @ 2002-07-10 14:28 ` Richard J Moore 0 siblings, 0 replies; 95+ messages in thread From: Richard J Moore @ 2002-07-10 14:28 UTC (permalink / raw) To: John Levon Cc: Andrew Morton, Andrea Arcangeli, bob, Karim Yaghmour, linux-kernel, linux-mm, mjbligh, John Levon, Rik van Riel, Linus Torvalds >Sure, there are all sorts of things where some tracing can come in >useful. The question is whether it's really something the mainline >kernel should be doing, and if the gung-ho approach is nice or not. > >> The fact that so many kernel subsystems already have their own tracing >> built-in (see other posting) > >Your list was almost entirely composed of per-driver debug routines. >This is not the same thing as logging trap entry/exits, syscalls etc >etc, on any level, and I'm a bit perplexed that you're making such an >assocation. There's a balance to be struck with tracing. First we should point out that the recording mechanism doesn't have to intrude within the kernel unlss you want init time tracing. The bigger point of contention seems to be that of instrumentation. Yes, it is very ugly to have thousands of trace points littering the source. On the otherhand, for basic serviceability a minimal set should be present in a production system - these would typically allow the external interface of any component to be traced. For low-level tracing - i.e. internal routines etc - the dynamic trace can be used. This requires no modification to source. The tracepoint is implemanted dynamically in execting code. DProbes+LTT provides this capability. Some level of tracing (along with other complementary PD tools e.g. crash dump) needs to be readiliy available to deal with those types of problem we see with mature systems employed in the production environment. Typically such problems are not readily recreatable nor even prictable. I've often had to solve problems which impact a business environment severely, where one server out of 2000 gets hit each day, but its a different one each day. Its under those circumstances that trace along without other automated data capturing problem determination tools become invaluable. And its a fact of life that only those types of difficult problem remain once we've beaten a system to death in developments and test. Being able to use a common set of tools whatever the componets under investigation greatly eases problem determination. This is especially so where you have the ability to use dprobes with LTT to provide ad hoc tracepoints that were not originally included by the developers. Richard J Moore CEng, MIEE, Consulting IT Specialist, TSM RAS Project Lead - Linux Technology Centre (ATS-PIC). http://oss.software.ibm.com/developerworks/opensource/linux Office: (+44) (0)1962-817072, Mobile: (+44) (0)7768-298183 IBM UK Ltd, MP135 Galileo Centre, Hursley Park, Winchester, SO21 2JN, UK The IBM Academy will hold a Conference on Performance Engineering in Toronto July 8-10. A High Availability Conference follows July 10-12. Details on http://w3.ibm.com/academy/ John Levon <movement@marcelothewonderp To: Karim Yaghmour <karim@opersys.com> enguin.com> cc: Linus Torvalds <torvalds@transmeta.com>, Andrew Morton Sent by: John Levon <akpm@zip.com.au>, Andrea Arcangeli <andrea@suse.de>, Rik van Riel <moz@compsoc.man.ac.uk> <riel@conectiva.com.br>, "linux-mm@kvack.org" <linux-mm@kvack.org>, mjbligh@linux.ibm.com, linux-kernel@vger.kernel.org, Richard J Moore/UK/IBM@IBMGB, bob <bob@watson.ibm.com> 10/07/2002 00:38 Subject: Re: Enhanced profiling support (was Re: vm lock contention reduction) Please respond to John Levon On Wed, Jul 10, 2002 at 12:16:05AM -0400, Karim Yaghmour wrote: [snip] > And the list goes on. Sure, there are all sorts of things where some tracing can come in useful. The question is whether it's really something the mainline kernel should be doing, and if the gung-ho approach is nice or not. > The fact that so many kernel subsystems already have their own tracing > built-in (see other posting) Your list was almost entirely composed of per-driver debug routines. This is not the same thing as logging trap entry/exits, syscalls etc etc, on any level, and I'm a bit perplexed that you're making such an assocation. > expect user-space developers to efficiently use the kernel if they > have > absolutely no idea about the dynamic interaction their processes have > with the kernel and how this interaction is influenced by and > influences > the interaction with other processes? This is clearly an exaggeration. And seeing as something like LTT doesn't (and cannot) tell the "whole story" either, I could throw the same argument directly back at you. The point is, there comes a point of no return where usefulness gets outweighed by ugliness. For the very few cases that such detailed information is really useful, the user can usually install the needed special-case tools. In contrast a profiling mechanism that improves on the poor lot that currently exists (gprof, readprofile) has a truly general utility, and can hopefully be done without too much ugliness. The primary reason I want to see something like this is to kill the ugly code I have to maintain. > > The entry.S examine-the-registers approach is simple enough, but > > it's > > not much more tasteful than sys_call_table hackery IMHO > > I guess we won't agree on this. From my point of view it is much > better > to have the code directly within entry.S for all to see instead of > having some external software play around with the syscall table in a > way kernel users can't trace back to the kernel's own code. Eh ? I didn't say sys_call_table hackery was better. I said the entry.S thing wasn't much better ... regards john ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Enhanced profiling support (was Re: vm lock contention reduction) @ 2002-07-10 14:28 ` Richard J Moore 0 siblings, 0 replies; 95+ messages in thread From: Richard J Moore @ 2002-07-10 14:28 UTC (permalink / raw) To: John Levon Cc: Andrew Morton, Andrea Arcangeli, bob, Karim Yaghmour, linux-kernel, linux-mm, mjbligh, John Levon, Rik van Riel, Linus Torvalds >Sure, there are all sorts of things where some tracing can come in >useful. The question is whether it's really something the mainline >kernel should be doing, and if the gung-ho approach is nice or not. > >> The fact that so many kernel subsystems already have their own tracing >> built-in (see other posting) > >Your list was almost entirely composed of per-driver debug routines. >This is not the same thing as logging trap entry/exits, syscalls etc >etc, on any level, and I'm a bit perplexed that you're making such an >assocation. There's a balance to be struck with tracing. First we should point out that the recording mechanism doesn't have to intrude within the kernel unlss you want init time tracing. The bigger point of contention seems to be that of instrumentation. Yes, it is very ugly to have thousands of trace points littering the source. On the otherhand, for basic serviceability a minimal set should be present in a production system - these would typically allow the external interface of any component to be traced. For low-level tracing - i.e. internal routines etc - the dynamic trace can be used. This requires no modification to source. The tracepoint is implemanted dynamically in execting code. DProbes+LTT provides this capability. Some level of tracing (along with other complementary PD tools e.g. crash dump) needs to be readiliy available to deal with those types of problem we see with mature systems employed in the production environment. Typically such problems are not readily recreatable nor even prictable. I've often had to solve problems which impact a business environment severely, where one server out of 2000 gets hit each day, but its a different one each day. Its under those circumstances that trace along without other automated data capturing problem determination tools become invaluable. And its a fact of life that only those types of difficult problem remain once we've beaten a system to death in developments and test. Being able to use a common set of tools whatever the componets under investigation greatly eases problem determination. This is especially so where you have the ability to use dprobes with LTT to provide ad hoc tracepoints that were not originally included by the developers. Richard J Moore CEng, MIEE, Consulting IT Specialist, TSM RAS Project Lead - Linux Technology Centre (ATS-PIC). http://oss.software.ibm.com/developerworks/opensource/linux Office: (+44) (0)1962-817072, Mobile: (+44) (0)7768-298183 IBM UK Ltd, MP135 Galileo Centre, Hursley Park, Winchester, SO21 2JN, UK The IBM Academy will hold a Conference on Performance Engineering in Toronto July 8-10. A High Availability Conference follows July 10-12. Details on http://w3.ibm.com/academy/ John Levon <movement@marcelothewonderp To: Karim Yaghmour <karim@opersys.com> enguin.com> cc: Linus Torvalds <torvalds@transmeta.com>, Andrew Morton Sent by: John Levon <akpm@zip.com.au>, Andrea Arcangeli <andrea@suse.de>, Rik van Riel <moz@compsoc.man.ac.uk> <riel@conectiva.com.br>, "linux-mm@kvack.org" <linux-mm@kvack.org>, mjbligh@linux.ibm.com, linux-kernel@vger.kernel.org, Richard J Moore/UK/IBM@IBMGB, bob <bob@watson.ibm.com> 10/07/2002 00:38 Subject: Re: Enhanced profiling support (was Re: vm lock contention reduction) Please respond to John Levon On Wed, Jul 10, 2002 at 12:16:05AM -0400, Karim Yaghmour wrote: [snip] > And the list goes on. Sure, there are all sorts of things where some tracing can come in useful. The question is whether it's really something the mainline kernel should be doing, and if the gung-ho approach is nice or not. > The fact that so many kernel subsystems already have their own tracing > built-in (see other posting) Your list was almost entirely composed of per-driver debug routines. This is not the same thing as logging trap entry/exits, syscalls etc etc, on any level, and I'm a bit perplexed that you're making such an assocation. > expect user-space developers to efficiently use the kernel if they > have > absolutely no idea about the dynamic interaction their processes have > with the kernel and how this interaction is influenced by and > influences > the interaction with other processes? This is clearly an exaggeration. And seeing as something like LTT doesn't (and cannot) tell the "whole story" either, I could throw the same argument directly back at you. The point is, there comes a point of no return where usefulness gets outweighed by ugliness. For the very few cases that such detailed information is really useful, the user can usually install the needed special-case tools. In contrast a profiling mechanism that improves on the poor lot that currently exists (gprof, readprofile) has a truly general utility, and can hopefully be done without too much ugliness. The primary reason I want to see something like this is to kill the ugly code I have to maintain. > > The entry.S examine-the-registers approach is simple enough, but > > it's > > not much more tasteful than sys_call_table hackery IMHO > > I guess we won't agree on this. From my point of view it is much > better > to have the code directly within entry.S for all to see instead of > having some external software play around with the syscall table in a > way kernel users can't trace back to the kernel's own code. Eh ? I didn't say sys_call_table hackery was better. I said the entry.S thing wasn't much better ... regards john -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Enhanced profiling support (was Re: vm lock contention reduction) 2002-07-10 14:28 ` Richard J Moore @ 2002-07-10 20:30 ` Karim Yaghmour -1 siblings, 0 replies; 95+ messages in thread From: Karim Yaghmour @ 2002-07-10 20:30 UTC (permalink / raw) To: Richard J Moore Cc: John Levon, Andrew Morton, Andrea Arcangeli, bob, linux-kernel, linux-mm, mjbligh, John Levon, Rik van Riel, Linus Torvalds Richard J Moore wrote: > Some level of tracing (along with other complementary PD tools e.g. crash > dump) needs to be readiliy available to deal with those types of problem we > see with mature systems employed in the production environment. Typically > such problems are not readily recreatable nor even prictable. I've often > had to solve problems which impact a business environment severely, where > one server out of 2000 gets hit each day, but its a different one each day. > Its under those circumstances that trace along without other automated data > capturing problem determination tools become invaluable. And its a fact of > life that only those types of difficult problem remain once we've beaten a > system to death in developments and test. Being able to use a common set of > tools whatever the componets under investigation greatly eases problem > determination. This is especially so where you have the ability to use > dprobes with LTT to provide ad hoc tracepoints that were not originally > included by the developers. I definitely agree. One case which perfectly illustrates how extreme these situations can be is the Mars Pathfinder. The folks at the Jet Propulsion Lab used a tracing tool very similar to LTT to locate the priority inversion problem the Pathfinder had while it was on Mars. The full account gives an interesting read (sorry for the link being on MS's website but its author works for MS research ...): http://research.microsoft.com/research/os/mbj/Mars_Pathfinder/Authoritative_Account.html Karim =================================================== Karim Yaghmour karim@opersys.com Embedded and Real-Time Linux Expert =================================================== ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Enhanced profiling support (was Re: vm lock contention reduction) @ 2002-07-10 20:30 ` Karim Yaghmour 0 siblings, 0 replies; 95+ messages in thread From: Karim Yaghmour @ 2002-07-10 20:30 UTC (permalink / raw) To: Richard J Moore Cc: John Levon, Andrew Morton, Andrea Arcangeli, bob, linux-kernel, linux-mm, mjbligh, John Levon, Rik van Riel, Linus Torvalds Richard J Moore wrote: > Some level of tracing (along with other complementary PD tools e.g. crash > dump) needs to be readiliy available to deal with those types of problem we > see with mature systems employed in the production environment. Typically > such problems are not readily recreatable nor even prictable. I've often > had to solve problems which impact a business environment severely, where > one server out of 2000 gets hit each day, but its a different one each day. > Its under those circumstances that trace along without other automated data > capturing problem determination tools become invaluable. And its a fact of > life that only those types of difficult problem remain once we've beaten a > system to death in developments and test. Being able to use a common set of > tools whatever the componets under investigation greatly eases problem > determination. This is especially so where you have the ability to use > dprobes with LTT to provide ad hoc tracepoints that were not originally > included by the developers. I definitely agree. One case which perfectly illustrates how extreme these situations can be is the Mars Pathfinder. The folks at the Jet Propulsion Lab used a tracing tool very similar to LTT to locate the priority inversion problem the Pathfinder had while it was on Mars. The full account gives an interesting read (sorry for the link being on MS's website but its author works for MS research ...): http://research.microsoft.com/research/os/mbj/Mars_Pathfinder/Authoritative_Account.html Karim =================================================== Karim Yaghmour karim@opersys.com Embedded and Real-Time Linux Expert =================================================== -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Enhanced profiling support (was Re: vm lock contention reduction) 2002-07-10 20:30 ` Karim Yaghmour @ 2002-07-10 21:41 ` Andrea Arcangeli -1 siblings, 0 replies; 95+ messages in thread From: Andrea Arcangeli @ 2002-07-10 21:41 UTC (permalink / raw) To: Karim Yaghmour Cc: Richard J Moore, John Levon, Andrew Morton, bob, linux-kernel, linux-mm, mjbligh, John Levon, Rik van Riel, Linus Torvalds On Wed, Jul 10, 2002 at 04:30:42PM -0400, Karim Yaghmour wrote: > > Richard J Moore wrote: > > Some level of tracing (along with other complementary PD tools e.g. crash > > dump) needs to be readiliy available to deal with those types of problem we > > see with mature systems employed in the production environment. Typically > > such problems are not readily recreatable nor even prictable. I've often > > had to solve problems which impact a business environment severely, where > > one server out of 2000 gets hit each day, but its a different one each day. > > Its under those circumstances that trace along without other automated data > > capturing problem determination tools become invaluable. And its a fact of > > life that only those types of difficult problem remain once we've beaten a > > system to death in developments and test. Being able to use a common set of > > tools whatever the componets under investigation greatly eases problem > > determination. This is especially so where you have the ability to use > > dprobes with LTT to provide ad hoc tracepoints that were not originally > > included by the developers. > > I definitely agree. > > One case which perfectly illustrates how extreme these situations can be is > the Mars Pathfinder. The folks at the Jet Propulsion Lab used a tracing tool > very similar to LTT to locate the priority inversion problem the Pathfinder > had while it was on Mars. btw, on the topic, with our semaphores there's no way to handle priority inversion with SCHED_RR tasks, if there's more than one task that runs in RT priority we may fall into starvation of RT tasks too the same way. No starvation can happen of course if all tasks in the systems belongs to the same scheduler policy (nice levels can have effects but they're not indefinite delays). The fix Ingo used for SCHED_IDLE is to have a special call to the scheduler while returning to userspace, so in the only place where we know the kernel isn't holding any lock. But while it's going to only generate some minor unexpected cpu load with SCHED_IDLE, generalizing that hack to make all tasks scheduling inside the kernel running with RT priority isn't going to provide a nice/fair behaviour (some task infact could run way too much if it's very system-hungry, in particular with -preempt, which could again generate starvation of userspace, even if not anymore because of kernel locks). Maybe I'm overlooking something simple but I guess it's not going to be easy to fix it, for the semaphores it isn't too bad, they could learn how to raise priority of a special holder when needed, but for any semaphore-by-hand (for example spinlock based) it would require some major auditing. Andrea ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Enhanced profiling support (was Re: vm lock contention reduction) @ 2002-07-10 21:41 ` Andrea Arcangeli 0 siblings, 0 replies; 95+ messages in thread From: Andrea Arcangeli @ 2002-07-10 21:41 UTC (permalink / raw) To: Karim Yaghmour Cc: Richard J Moore, John Levon, Andrew Morton, bob, linux-kernel, linux-mm, mjbligh, John Levon, Rik van Riel, Linus Torvalds On Wed, Jul 10, 2002 at 04:30:42PM -0400, Karim Yaghmour wrote: > > Richard J Moore wrote: > > Some level of tracing (along with other complementary PD tools e.g. crash > > dump) needs to be readiliy available to deal with those types of problem we > > see with mature systems employed in the production environment. Typically > > such problems are not readily recreatable nor even prictable. I've often > > had to solve problems which impact a business environment severely, where > > one server out of 2000 gets hit each day, but its a different one each day. > > Its under those circumstances that trace along without other automated data > > capturing problem determination tools become invaluable. And its a fact of > > life that only those types of difficult problem remain once we've beaten a > > system to death in developments and test. Being able to use a common set of > > tools whatever the componets under investigation greatly eases problem > > determination. This is especially so where you have the ability to use > > dprobes with LTT to provide ad hoc tracepoints that were not originally > > included by the developers. > > I definitely agree. > > One case which perfectly illustrates how extreme these situations can be is > the Mars Pathfinder. The folks at the Jet Propulsion Lab used a tracing tool > very similar to LTT to locate the priority inversion problem the Pathfinder > had while it was on Mars. btw, on the topic, with our semaphores there's no way to handle priority inversion with SCHED_RR tasks, if there's more than one task that runs in RT priority we may fall into starvation of RT tasks too the same way. No starvation can happen of course if all tasks in the systems belongs to the same scheduler policy (nice levels can have effects but they're not indefinite delays). The fix Ingo used for SCHED_IDLE is to have a special call to the scheduler while returning to userspace, so in the only place where we know the kernel isn't holding any lock. But while it's going to only generate some minor unexpected cpu load with SCHED_IDLE, generalizing that hack to make all tasks scheduling inside the kernel running with RT priority isn't going to provide a nice/fair behaviour (some task infact could run way too much if it's very system-hungry, in particular with -preempt, which could again generate starvation of userspace, even if not anymore because of kernel locks). Maybe I'm overlooking something simple but I guess it's not going to be easy to fix it, for the semaphores it isn't too bad, they could learn how to raise priority of a special holder when needed, but for any semaphore-by-hand (for example spinlock based) it would require some major auditing. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Enhanced profiling support (was Re: vm lock contention reduction) 2002-07-10 21:41 ` Andrea Arcangeli (?) @ 2002-07-11 4:47 ` Karim Yaghmour 2002-07-11 4:59 ` Karim Yaghmour -1 siblings, 1 reply; 95+ messages in thread From: Karim Yaghmour @ 2002-07-11 4:47 UTC (permalink / raw) To: Andrea Arcangeli Cc: Richard J Moore, John Levon, Andrew Morton, bob, linux-kernel, linux-mm, mjbligh, John Levon, Rik van Riel, Linus Torvalds [-- Attachment #1: Type: text/plain, Size: 4771 bytes --] Andrea Arcangeli wrote: > btw, on the topic, with our semaphores there's no way to handle priority > inversion with SCHED_RR tasks, if there's more than one task that runs > in RT priority we may fall into starvation of RT tasks too the same way. > > No starvation can happen of course if all tasks in the systems belongs > to the same scheduler policy (nice levels can have effects but they're > not indefinite delays). > > The fix Ingo used for SCHED_IDLE is to have a special call to the > scheduler while returning to userspace, so in the only place where we > know the kernel isn't holding any lock. But while it's going to only > generate some minor unexpected cpu load with SCHED_IDLE, generalizing > that hack to make all tasks scheduling inside the kernel running with RT > priority isn't going to provide a nice/fair behaviour (some task infact could > run way too much if it's very system-hungry, in particular with > -preempt, which could again generate starvation of userspace, even if > not anymore because of kernel locks). Maybe I'm overlooking something > simple but I guess it's not going to be easy to fix it, for the > semaphores it isn't too bad, they could learn how to raise priority of a > special holder when needed, but for any semaphore-by-hand (for example > spinlock based) it would require some major auditing. I wasn't aware of this particular problem, but I certainly can see LTT as being helpful in trying to understand the actual interactions. The custom event creation API is rather simple to use if you would like to instrument some of the semaphore code (the Sys V IPC code is already instrumented at a basic level): trace_create_event() trace_create_owned_event() trace_destroy_event() trace_std_formatted_event() trace_raw_event() Here's the complete description: int trace_create_event(char* pm_event_type, char* pm_event_desc, int pm_format_type, char* pm_format_data) pm_event_type is a short string describing the type of event. pm_event_desc is required if you are using trace_std_formatted_event(), more on this below. pm_format_type and pm_format_data are required if you are using trace_raw_event(). The function returns a unique custom event ID. int trace_create_owned_event(char* pm_event_type, char* pm_event_desc, int pm_format_type, char* pm_format_data, pid_t pm_owner_pid) Same as trace_create_event() except that all events created using this call will be deleted once the process with pm_owner_pid exits. Not really useful in kernel space, but essential for providing user-space events. void trace_destroy_event(int pm_event_id) Destroys the event with given pm_event_id. int trace_std_formatted_event(int pm_event_id, ...) Instead of having a slew of printk's all around the code, pm_event_desc is filled with a printk-like string at the event creation and the actual params used to fill this string are passed to trace_std_formatted_event(). Be aware that this function uses va_start/vsprintf/va_end. The resulting string is logged in the trace as such and is visible in the trace much like a printk in a log. Except that you can place this in paths where printk's can't go and you can be sure that events logged with this are delivered even in high throughput situations (granted the trace buffer size is adequate). int trace_raw_event(int pm_event_id, int pm_event_size, void* pm_event_data) This is the easiest way to log tons of binary data. Just give it the size of the data (pm_event_size) and a pointer to it (pm_event_data) and it gets written in the trace. This is what DProbes uses. The binary data is then formatted in userspace. All events logged using this API can easily be extracted using LibLTT's API. There is no requirement to use LTT's own visualization tool, although you can still use it to see your own custom events. As with the other traced events, each custom event gets timestamped (do_gettimeofday) and placed in order of occurrence in the trace buffer. I've attached an example module that uses the custom event API. See the Examples directory of the LTT package for an example custom trace reader which uses LibLTT. One clear advantage of this API is that you can avoid using any "#ifdef TRACE" or "#ifdef DEBUG". If the tracing module isn't loaded or if the trace daemon isn't running, then nothing gets logged. Cheers, Karim =================================================== Karim Yaghmour karim@opersys.com Embedded and Real-Time Linux Expert =================================================== [-- Attachment #2: custom1.c --] [-- Type: image/x-xbitmap, Size: 2287 bytes --] ^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: Enhanced profiling support (was Re: vm lock contention reduction) 2002-07-11 4:47 ` Karim Yaghmour @ 2002-07-11 4:59 ` Karim Yaghmour 0 siblings, 0 replies; 95+ messages in thread From: Karim Yaghmour @ 2002-07-11 4:59 UTC (permalink / raw) To: Andrea Arcangeli, Richard J Moore, John Levon, Andrew Morton, bob, linux-kernel, linux-mm, mjbligh, John Levon, Rik van Riel, Linus Torvalds [-- Attachment #1: Type: text/plain, Size: 252 bytes --] Sorry, the attachment never made it ... =================================================== Karim Yaghmour karim@opersys.com Embedded and Real-Time Linux Expert =================================================== [-- Attachment #2: example.c --] [-- Type: text/plain, Size: 2325 bytes --] /* Example usage of custom event API */ #define MODULE #if 0 #define CONFIG_TRACE #endif #include <linux/config.h> #include <linux/module.h> #include <linux/trace.h> #include <asm/string.h> struct delta_event { int an_int; char a_char; }; static int alpha_id, omega_id, theta_id, delta_id, rho_id; int init_module(void) { uint8_t a_byte; char a_char; int an_int; int a_hex; char* a_string = "We are initializing the module"; struct delta_event a_delta_event; /* Create events */ alpha_id = trace_create_event("Alpha", "Number %d, String %s, Hex %08X", CUSTOM_EVENT_FORMAT_TYPE_STR, NULL); omega_id = trace_create_event("Omega", "Number %d, Char %c", CUSTOM_EVENT_FORMAT_TYPE_STR, NULL); theta_id = trace_create_event("Theta", "Plain string", CUSTOM_EVENT_FORMAT_TYPE_STR, NULL); delta_id = trace_create_event("Delta", NULL, CUSTOM_EVENT_FORMAT_TYPE_HEX, NULL); rho_id = trace_create_event("Rho", NULL, CUSTOM_EVENT_FORMAT_TYPE_HEX, NULL); /* Trace events */ an_int = 1; a_hex = 0xFFFFAAAA; trace_std_formatted_event(alpha_id, an_int, a_string, a_hex); an_int = 25; a_char = 'c'; trace_std_formatted_event(omega_id, an_int, a_char); trace_std_formatted_event(theta_id); memset(&a_delta_event, 0, sizeof(a_delta_event)); trace_raw_event(delta_id, sizeof(a_delta_event), &a_delta_event); a_byte = 0x12; trace_raw_event(rho_id, sizeof(a_byte), &a_byte); return 0; } void cleanup_module(void) { uint8_t a_byte; char a_char; int an_int; int a_hex; char* a_string = "We are initializing the module"; struct delta_event a_delta_event; /* Trace events */ an_int = 324; a_hex = 0xABCDEF10; trace_std_formatted_event(alpha_id, an_int, a_string, a_hex); an_int = 789; a_char = 's'; trace_std_formatted_event(omega_id, an_int, a_char); trace_std_formatted_event(theta_id); memset(&a_delta_event, 0xFF, sizeof(a_delta_event)); trace_raw_event(delta_id, sizeof(a_delta_event), &a_delta_event); a_byte = 0xA4; trace_raw_event(rho_id, sizeof(a_byte), &a_byte); /* Destroy the events created */ trace_destroy_event(alpha_id); trace_destroy_event(omega_id); trace_destroy_event(theta_id); trace_destroy_event(delta_id); trace_destroy_event(rho_id); } ^ permalink raw reply [flat|nested] 95+ messages in thread
end of thread, other threads:[~2002-07-13 11:18 UTC | newest] Thread overview: 95+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2002-07-04 23:05 vm lock contention reduction Andrew Morton 2002-07-04 23:26 ` Rik van Riel 2002-07-04 23:27 ` Rik van Riel 2002-07-05 1:37 ` Andrew Morton 2002-07-05 1:49 ` Rik van Riel 2002-07-05 2:18 ` Andrew Morton 2002-07-05 2:16 ` Rik van Riel 2002-07-05 2:53 ` Andrew Morton 2002-07-05 3:52 ` Benjamin LaHaise 2002-07-05 4:47 ` Linus Torvalds 2002-07-05 5:38 ` Andrew Morton 2002-07-05 5:51 ` Linus Torvalds 2002-07-05 6:08 ` Linus Torvalds 2002-07-05 6:27 ` Alexander Viro 2002-07-05 6:33 ` Andrew Morton 2002-07-05 7:33 ` Andrea Arcangeli 2002-07-07 2:50 ` Andrew Morton 2002-07-07 3:05 ` Linus Torvalds 2002-07-07 3:47 ` Andrew Morton 2002-07-08 11:39 ` Enhanced profiling support (was Re: vm lock contention reduction) John Levon 2002-07-08 11:39 ` John Levon 2002-07-08 17:52 ` Linus Torvalds 2002-07-08 17:52 ` Linus Torvalds 2002-07-08 18:41 ` Karim Yaghmour 2002-07-08 18:41 ` Karim Yaghmour 2002-07-10 2:22 ` John Levon 2002-07-10 2:22 ` John Levon 2002-07-10 4:16 ` Karim Yaghmour 2002-07-10 4:16 ` Karim Yaghmour 2002-07-10 4:38 ` John Levon 2002-07-10 4:38 ` John Levon 2002-07-10 5:46 ` Karim Yaghmour 2002-07-10 5:46 ` Karim Yaghmour 2002-07-10 13:10 ` bob 2002-07-10 13:10 ` bob 2002-07-09 16:57 ` John Levon 2002-07-09 19:56 ` Karim Yaghmour 2002-07-07 5:16 ` vm lock contention reduction Martin J. Bligh 2002-07-07 6:13 ` scalable kmap (was Re: vm lock contention reduction) Martin J. Bligh 2002-07-07 6:37 ` Andrew Morton 2002-07-07 7:53 ` Linus Torvalds 2002-07-07 9:04 ` Andrew Morton 2002-07-07 16:13 ` Martin J. Bligh 2002-07-07 18:31 ` Linus Torvalds 2002-07-07 18:55 ` Linus Torvalds 2002-07-07 19:02 ` Linus Torvalds 2002-07-08 7:24 ` Andrew Morton 2002-07-08 8:09 ` Andrea Arcangeli 2002-07-08 14:50 ` William Lee Irwin III 2002-07-08 20:39 ` Andrew Morton 2002-07-08 21:08 ` Benjamin LaHaise 2002-07-08 21:45 ` Andrew Morton 2002-07-08 22:24 ` Benjamin LaHaise 2002-07-07 16:00 ` Martin J. Bligh 2002-07-07 18:28 ` Linus Torvalds 2002-07-08 7:11 ` Andrea Arcangeli 2002-07-08 10:15 ` Eric W. Biederman 2002-07-08 7:00 ` Andrea Arcangeli 2002-07-08 17:29 ` Martin J. Bligh 2002-07-08 22:14 ` Linus Torvalds 2002-07-09 0:16 ` Andrew Morton 2002-07-09 3:17 ` Andrew Morton 2002-07-09 4:28 ` Martin J. Bligh 2002-07-09 5:28 ` Andrew Morton 2002-07-09 6:15 ` Martin J. Bligh 2002-07-09 6:30 ` William Lee Irwin III 2002-07-09 6:32 ` William Lee Irwin III 2002-07-09 16:08 ` Martin J. Bligh 2002-07-09 17:32 ` Andrea Arcangeli 2002-07-10 5:32 ` Andrew Morton 2002-07-10 22:43 ` Martin J. Bligh 2002-07-10 23:08 ` Andrew Morton 2002-07-10 23:26 ` Martin J. Bligh 2002-07-11 0:19 ` Andrew Morton 2002-07-12 17:48 ` Martin J. Bligh 2002-07-13 11:18 ` Andrea Arcangeli 2002-07-09 13:59 ` Benjamin LaHaise 2002-07-08 0:38 ` vm lock contention reduction William Lee Irwin III 2002-07-05 6:46 ` Andrew Morton 2002-07-05 14:25 ` Rik van Riel 2002-07-05 23:11 ` William Lee Irwin III 2002-07-05 23:48 ` Andrew Morton 2002-07-06 0:11 ` Rik van Riel 2002-07-06 0:31 ` Linus Torvalds 2002-07-06 0:45 ` Rik van Riel 2002-07-06 0:48 ` Andrew Morton 2002-07-08 0:59 ` William Lee Irwin III 2002-07-10 14:28 Enhanced profiling support (was Re: vm lock contention reduction) Richard J Moore 2002-07-10 14:28 ` Richard J Moore 2002-07-10 20:30 ` Karim Yaghmour 2002-07-10 20:30 ` Karim Yaghmour 2002-07-10 21:41 ` Andrea Arcangeli 2002-07-10 21:41 ` Andrea Arcangeli 2002-07-11 4:47 ` Karim Yaghmour 2002-07-11 4:59 ` Karim Yaghmour
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.