All of lore.kernel.org
 help / color / mirror / Atom feed
* vm lock contention reduction
@ 2002-07-04 23:05 Andrew Morton
  2002-07-04 23:26 ` Rik van Riel
  2002-07-04 23:27 ` Rik van Riel
  0 siblings, 2 replies; 87+ messages in thread
From: Andrew Morton @ 2002-07-04 23:05 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andrea Arcangeli, linux-mm, Linus Torvalds

I seem to now have a set of patches which fix the pagemap_lru_lock
contention for some workloads.

They also move the entire page allocation/reclaim/pagecache I/O
functions away from page-at-a-time and make them use chunks of 16 pages
at a time.  The intent of this is to get the effect of large PAGE_CACHE_SIZE
without actually doing that.

Overall lock contention is reduced by 85-90% and pagemap_lru_lock contention
is reduced by maybe 98%.  For workloads where the inactive list is dominated
by pagecache.

If the machine is instead full of anon pages then everything is still crap
because the page reclaim code is scanning zillions of pages and not doing
much useful with them.

In some ways the VM locking is more complex, because we need to cope
with pages which aren't on the LRU.  In some ways the locking is simpler
because pagemap_lru_lock becomes an "innermost" lock.

Relevant patches are:

http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.24/page-flags-atomicity.patch
http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.24/pagevec.patch
http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.24/shrink_cache-pagevec.patch
http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.24/anon-pagevec.patch
http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.24/mpage_writepages-batch.patch
http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.24/batched-lru-add.patch
http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.24/batched-lru-del.patch
http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.24/lru-lock-irq-off.patch
http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.24/lru-mopup.patch

My vague plan was to wiggle rmap on top of this work, for two reasons:

1: So it is easy to maintain an rmap backout patch, to aid in comparison
   and debugging and

2: to give a reasonable basis for evaluation of rmap CPU efficiency.

But frankly, I've written and rewritten this code three times so far
and I'm still not really happy with it.  Probably it is more sensible
to get the reverse mapping code into the tree first, and I get to
reimplement the CPU efficiency work a fourth time :(

So I'll flush the rest of my current patchpile at Linus and go take a
look at O_DIRECT for a while.

I'll shelve this lock contention work until we have an rmap patch
for 2.5.   Rik, do you have an estimate on that?

-
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: vm lock contention reduction
  2002-07-04 23:05 vm lock contention reduction Andrew Morton
@ 2002-07-04 23:26 ` Rik van Riel
  2002-07-04 23:27 ` Rik van Riel
  1 sibling, 0 replies; 87+ messages in thread
From: Rik van Riel @ 2002-07-04 23:26 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, linux-mm, Linus Torvalds

On Thu, 4 Jul 2002, Andrew Morton wrote:

> I'll shelve this lock contention work until we have an rmap patch
> for 2.5.   Rik, do you have an estimate on that?

I've recovered from the flight back and the flu that
hit me just after OLS, so it should be pretty soon.

I've just applied Craig Kulesa's patch on top of the
latest 2.5 bk tree and will put a few small fixes on
top of that (eg. new ARM pagetable layout).

regards,

Rik
-- 
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/		http://distro.conectiva.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: vm lock contention reduction
  2002-07-04 23:05 vm lock contention reduction Andrew Morton
  2002-07-04 23:26 ` Rik van Riel
@ 2002-07-04 23:27 ` Rik van Riel
  2002-07-05  1:37   ` Andrew Morton
  1 sibling, 1 reply; 87+ messages in thread
From: Rik van Riel @ 2002-07-04 23:27 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, linux-mm, Linus Torvalds

On Thu, 4 Jul 2002, Andrew Morton wrote:

> If the machine is instead full of anon pages then everything is still
> crap because the page reclaim code is scanning zillions of pages and not
> doing much useful with them.

This is something that can be fixed with rmap, because the
kernel _will_ be able to do something useful with the anon
pages.

Now we just need to get Arjan to tune the O(1) page launder
thing he was looking at ;)

Rik
-- 
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/		http://distro.conectiva.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: vm lock contention reduction
  2002-07-04 23:27 ` Rik van Riel
@ 2002-07-05  1:37   ` Andrew Morton
  2002-07-05  1:49     ` Rik van Riel
  0 siblings, 1 reply; 87+ messages in thread
From: Andrew Morton @ 2002-07-05  1:37 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andrea Arcangeli, linux-mm, Linus Torvalds

Rik van Riel wrote:
> 
> On Thu, 4 Jul 2002, Andrew Morton wrote:
> 
> > If the machine is instead full of anon pages then everything is still
> > crap because the page reclaim code is scanning zillions of pages and not
> > doing much useful with them.
> 
> This is something that can be fixed with rmap, because the
> kernel _will_ be able to do something useful with the anon
> pages.

I think that would be quite useful - we just need to be sure that if the
pages aren't added to swapcache we should park them up on the active list
out of the way.

The refill_inactive() logic in 2.5.24 seems a bit wonky at present.
In here:

        ratio = (unsigned long)nr_pages * ps.nr_active /
                                ((ps.nr_inactive | 1) * 2);
        refill_inactive(ratio);

`ratio' tends to evaluate to zero if you have a ton of inactive
or pagecache pages.  So the VM is not recirculating old pages
off the inactive list at all.

I changed it to add 1 to ratio.  Also to only call refill_inactive
when the number-to-process reaches 32 pages - the kernel tends
to settle into steady states where it's calling refill_inactive
for just one or two pages, which is a waste of CPU.

Didn't make much observable difference.

> Now we just need to get Arjan to tune the O(1) page launder
> thing he was looking at ;)

We keep seeing mysterious references to this. What is the idea
behind it?

-
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: vm lock contention reduction
  2002-07-05  1:37   ` Andrew Morton
@ 2002-07-05  1:49     ` Rik van Riel
  2002-07-05  2:18       ` Andrew Morton
  0 siblings, 1 reply; 87+ messages in thread
From: Rik van Riel @ 2002-07-05  1:49 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, linux-mm, Linus Torvalds

On Thu, 4 Jul 2002, Andrew Morton wrote:
> Rik van Riel wrote:

> > This is something that can be fixed with rmap, because the
> > kernel _will_ be able to do something useful with the anon
> > pages.
>
> I think that would be quite useful - we just need to be sure that if the
> pages aren't added to swapcache we should park them up on the active
> list out of the way.

Absolutely, that is critical.  It is all too easy to end up with
a TON of non-swappable pages on the inactive list and all the
easily evictable pages on the active list.

If only because the easily evictable pages tend to disappear
quickly and the non-swappable ones stick around forever.

> > Now we just need to get Arjan to tune the O(1) page launder
> > thing he was looking at ;)
>
> We keep seeing mysterious references to this. What is the idea
> behind it?

The idea is that when pages are evicted from the system they
traverse the inactive list _once_.

If a page is dirty, IO is started and the page is added to the
laundry list, if a page is clean it is moved to the clean list.

Every time we need more free pages we first check the clean list
(all pages there are freeable, guaranteed) and the first (few?)
pages of the laundry list. We continue taking pages off of the
laundry list until we've run into {a, too many} unfreeable pages.

This way we won't scan the inactive pages over and over again
every time we free a few.

regards,

Rik
-- 
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/		http://distro.conectiva.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: vm lock contention reduction
  2002-07-05  2:18       ` Andrew Morton
@ 2002-07-05  2:16         ` Rik van Riel
  2002-07-05  2:53           ` Andrew Morton
  2002-07-05  4:47           ` Linus Torvalds
  2002-07-05 23:11         ` William Lee Irwin III
  1 sibling, 2 replies; 87+ messages in thread
From: Rik van Riel @ 2002-07-05  2:16 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, linux-mm, Linus Torvalds

On Thu, 4 Jul 2002, Andrew Morton wrote:

> Of course, that change means that we wouldn't be able to throttle
> page allocators against IO any more, and we'd have to do something
> smarter.  What a shame ;)

We want something smarter anyway.  It just doesn't make
sense to throttle on one page in one memory zone while
the pages in another zone could have already become
freeable by now.

regards,

Rik
-- 
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/		http://distro.conectiva.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: vm lock contention reduction
  2002-07-05  1:49     ` Rik van Riel
@ 2002-07-05  2:18       ` Andrew Morton
  2002-07-05  2:16         ` Rik van Riel
  2002-07-05 23:11         ` William Lee Irwin III
  0 siblings, 2 replies; 87+ messages in thread
From: Andrew Morton @ 2002-07-05  2:18 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andrea Arcangeli, linux-mm, Linus Torvalds

Rik van Riel wrote:
> 
> ...
> > > Now we just need to get Arjan to tune the O(1) page launder
> > > thing he was looking at ;)
> >
> > We keep seeing mysterious references to this. What is the idea
> > behind it?
> 
> The idea is that when pages are evicted from the system they
> traverse the inactive list _once_.
> 
> If a page is dirty, IO is started and the page is added to the
> laundry list, if a page is clean it is moved to the clean list.
> 
> Every time we need more free pages we first check the clean list
> (all pages there are freeable, guaranteed) and the first (few?)
> pages of the laundry list. We continue taking pages off of the
> laundry list until we've run into {a, too many} unfreeable pages.
> 
> This way we won't scan the inactive pages over and over again
> every time we free a few.
> 

OK.

One of the changes I made to pagemap_lru_lock was to always
take it with spin_lock_irq().  It's never held for more than
10-20 microseconds, and disabling interrupts while holding it
decreased contention by around 30% with four (slow) CPUs.

So that's a good change regardless, and it lets us move pages
off the laundry list within the IO completion interrupt (but
not one-at-a-time!).

In fact, they don't need to be on any list at all while under
writeback.

Of course, that change means that we wouldn't be able to throttle
page allocators against IO any more, and we'd have to do something
smarter.  What a shame ;)

-
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: vm lock contention reduction
  2002-07-05  2:16         ` Rik van Riel
@ 2002-07-05  2:53           ` Andrew Morton
  2002-07-05  3:52             ` Benjamin LaHaise
  2002-07-05  4:47           ` Linus Torvalds
  1 sibling, 1 reply; 87+ messages in thread
From: Andrew Morton @ 2002-07-05  2:53 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andrea Arcangeli, linux-mm, Linus Torvalds

Rik van Riel wrote:
> 
> On Thu, 4 Jul 2002, Andrew Morton wrote:
> 
> > Of course, that change means that we wouldn't be able to throttle
> > page allocators against IO any more, and we'd have to do something
> > smarter.  What a shame ;)
> 
> We want something smarter anyway.  It just doesn't make
> sense to throttle on one page in one memory zone while
> the pages in another zone could have already become
> freeable by now.
> 

Or pages from the same zone, indeed.

I think what we're saying here is: during writeback, park the
pages somewhere out of the way.  Page allocators go to sleep
on some waitqueue somewhere.  Disk completion interrupts put
pages back onto the LRU and wake up waiting page allocators.

That's all fairly straightforward, but one remaining problem
is: who does the writeback?  We can get large amounts of 
page allocation latency due to get_request_wait(), not just
wait_on_page().

Generally, the pdflush pool should be sufficient for this but
with many spindles it is possible that the kernel will run out
of pdflush resources.

So we may still need to make page allocating processes start I/O
if wakeup_bdflush() fails to find a thread.   If so, then the
implementation should prefer to make processes which are dirtying
memory start the IO.

Or provide a non-blocking try_to_submit_bio() for pdflush.

-
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: vm lock contention reduction
  2002-07-05  2:53           ` Andrew Morton
@ 2002-07-05  3:52             ` Benjamin LaHaise
  0 siblings, 0 replies; 87+ messages in thread
From: Benjamin LaHaise @ 2002-07-05  3:52 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Rik van Riel, Andrea Arcangeli, linux-mm, Linus Torvalds

On Thu, Jul 04, 2002 at 07:53:59PM -0700, Andrew Morton wrote:
> Or provide a non-blocking try_to_submit_bio() for pdflush.

Ooooh, I need that too. =-)

		-ben
-- 
"You will be reincarnated as a toad; and you will be much happier."
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: vm lock contention reduction
  2002-07-05  2:16         ` Rik van Riel
  2002-07-05  2:53           ` Andrew Morton
@ 2002-07-05  4:47           ` Linus Torvalds
  2002-07-05  5:38             ` Andrew Morton
  1 sibling, 1 reply; 87+ messages in thread
From: Linus Torvalds @ 2002-07-05  4:47 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andrew Morton, Andrea Arcangeli, linux-mm


On Thu, 4 Jul 2002, Rik van Riel wrote:
>
> We want something smarter anyway.  It just doesn't make
> sense to throttle on one page in one memory zone while
> the pages in another zone could have already become
> freeable by now.

That's a load of bull, and a backed up by zero amount of arguments. It
only sounds correct, while being absolute crap.

We need to throttle. Full stop. Anybody who says that "it makes no sense
to throttle on X because of Y" had better back up that statement with
_facts_, not some handwaving.

In the end, somebody needs to wait. There is no free lunch. And the fact
is, the _reason_ for the waiting doesn't much matter, it only needs to be
related to the act of freeing pages by some likely metric. And it should
wait for _more_ than the one page we absolutely need.

OF COURSE you can always end up avoiding to wait for reason Y. That only
means that you end up having to wait on reason Z at some later date.

The particular example Rik brought up is particularly stupid, since it's
obviously crap and not true. The fact is, that when you allocate memory
and you're low on memory, you _have_ to wait "more" than necessary. You
want to try to get away from the bad situation, so when you start waiting,
you should wait a bit "extra", so that you might not need to wait
immediately for the very next allocation.

And Rik's argument (and I saw people go "nod nod" like marionettes on a
string without even thinking about it) is exactly the wrong way around and
basically says that you should wait the _minimal_ amount possible. Yet
that is NOT correct, because that will absolutely guarantee that
_everybody_ ends up always waiting.

I saw this argument at the kernel summit too, thinking that waiting the
minimal amount of time is somehow "better". It's not. It's much better to
try to maek processes wait slightly _longer_ times, and get into a nice
balanced setup where you do intensive work for a while, then wait for a
while, then do work again, then wait.. Instead of waiting a bit all the
time.

Think batching. It's _more_ efficient to batch stuff than it is to try to
switch back and forth between working and waiting as quickly as you can.

So don't just nod your heads when you see something that sounds sane.
Think critically. And the critical thinking says:

 - you should wait the _maximum_ amount that
   (a) is fair
   (b) doesn't introduce bad latency issues
   (c) still allows overlap of IO and processing

Get away from this "minimum wait" thing, because it is WRONG.

Try to shoot me down, but do so with real logic and real arguments, not
some fuzzy feeling about "we shouldn't wait unless we have to". We _do_
have to wait.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: vm lock contention reduction
  2002-07-05  4:47           ` Linus Torvalds
@ 2002-07-05  5:38             ` Andrew Morton
  2002-07-05  5:51               ` Linus Torvalds
  0 siblings, 1 reply; 87+ messages in thread
From: Andrew Morton @ 2002-07-05  5:38 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Rik van Riel, Andrea Arcangeli, linux-mm

Linus Torvalds wrote:
> 
> ...
> Think batching. It's _more_ efficient to batch stuff than it is to try to
> switch back and forth between working and waiting as quickly as you can.

Yup.

I've been moaning about this for months.  Trivial example:  run
`vmstat 1' and then start pounding the disk.  vmstat will exhibit
very long pauses when *clearly* thousands of pages are coming
clean every second.  Unreasonably long pauses.   Sometimes in
get_request_wait(), sometimes in shrink_cache->wait_on_page/buffer.

We should be giving some of those pages to vmstat more promptly.
After all, that process is not a heavy allocator of pages.
 
> So don't just nod your heads when you see something that sounds sane.
> Think critically. And the critical thinking says:
> 
>  - you should wait the _maximum_ amount that
>    (a) is fair
>    (b) doesn't introduce bad latency issues
>    (c) still allows overlap of IO and processing
> 
> Get away from this "minimum wait" thing, because it is WRONG.

Well yes, we do want to batch work up.  And a crude way of doing that
is "each time 64 pages have come clean, wake up one waiter".  Or 
"as soon as the number of reclaimable pages exceeds zone->pages_min".
Some logic would also be needed to prevent new page allocators from
jumping the queue, of course.

We're still throttling on I/O, but we're throttling against
*any* I/O, and not a single randomly-chosen disk block.

This scheme is more fair - processes which are allocating more pages
get to wait more.

> Try to shoot me down, but do so with real logic and real arguments, not
> some fuzzy feeling about "we shouldn't wait unless we have to". We _do_
> have to wait.

Sure, page allocators must throttle their allocation rate to that at
which the IO system can retire writes.  But by waiting on a randomly-chosen
disk block, we're at the mercy of the elevator.  If you happen to
choose a page whose blocks are at the far side of the disk, you lose.
There could easily be 100 megabytes of reclaimable memory by the time
you start running again.

We can fit 256 seeks into the request queue.  That's 1-2 seconds.

I started developing a dumb prototype of this a while back, but
it locks up.   I'll dust it off and get it going as a "technology
demonstration".

-
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: vm lock contention reduction
  2002-07-05  5:38             ` Andrew Morton
@ 2002-07-05  5:51               ` Linus Torvalds
  2002-07-05  6:08                 ` Linus Torvalds
  2002-07-05  6:46                 ` Andrew Morton
  0 siblings, 2 replies; 87+ messages in thread
From: Linus Torvalds @ 2002-07-05  5:51 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Rik van Riel, Andrea Arcangeli, linux-mm


On Thu, 4 Jul 2002, Andrew Morton wrote:
> >
> > Get away from this "minimum wait" thing, because it is WRONG.
>
> Well yes, we do want to batch work up.  And a crude way of doing that
> is "each time 64 pages have come clean, wake up one waiter".

That would probably work, but we also need to be careful that we don't get
into some strange situations (ie we have 50 waiters that all needed memory
at the same time, and less than 50*64 pages that caused us to be in
trouble, so we only wake up the 46 first waiters and the last 4 waiters
get stuck until the next batch even though we now have _lots_ of pages
free).

Dont't laugh - things like this has actually happened at times with some
of our balancing work with HIGHMEM/NORMAL. Basically, the logic would go:
"everybody who ends up having to wait for an allocation should free at
least N pages", but then you would end up with 50*N pages total that the
system thought it "needed" to free up, and that could be a big number that
would cause the VM to want to free up stuff long after it was really done.

> Or
> "as soon as the number of reclaimable pages exceeds zone->pages_min".
> Some logic would also be needed to prevent new page allocators from
> jumping the queue, of course.

Yeah, the unfairness is the thing that really can be nasty.

On the other hand, some unfairness is ok too - and can improve throughput.
So jumping the queue is fine, you just must not be able to _consistently_
jump the queue.

(In fact, jumping the queue is inevitable to some degree - not allowing
any queue jumping at all would imply that any time _anybody_ starts
waiting, every single allocation afterwards will have to wait until the
waiter got woken up. Which we have actually tried before at times, but
which causes really really bad behaviour and horribly bad "pausing")

You probably want the occasional allocator able to jump the queue, but the
"big spenders" to be caught eventually. "Fairness" really doesn't mean
that "everybody should wait equally much", it really means "people should
wait roughly relative to how much as they 'spend' memory".

If a process is frugal with it's memory footprint, and doesn't dirty a lot
of pages/buffers, such a process should obviously wait less than one that
allocates/dirties a lot.

Right now, we get roughly this behaviour simply by way of statistical
behaviour for the page allocator ("if somebody allocates 5 times as many
pages, he's 5 times as likely to have to clean something up too"), but
trying to be smarter about this could easily break this relative fairness.

In particular, statefulness like "cannot jump the queue" breaks it.

> > some fuzzy feeling about "we shouldn't wait unless we have to". We _do_
> > have to wait.
>
> Sure, page allocators must throttle their allocation rate to that at
> which the IO system can retire writes.  But by waiting on a randomly-chosen
> disk block, we're at the mercy of the elevator.

Yes. On the other hand, "random" is actually usually a fairly good choice
in itself. And it a lot easier that many of the alternatives - especially
when it comes to fairness.

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: vm lock contention reduction
  2002-07-05  5:51               ` Linus Torvalds
@ 2002-07-05  6:08                 ` Linus Torvalds
  2002-07-05  6:27                   ` Alexander Viro
  2002-07-05  6:33                   ` Andrew Morton
  2002-07-05  6:46                 ` Andrew Morton
  1 sibling, 2 replies; 87+ messages in thread
From: Linus Torvalds @ 2002-07-05  6:08 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Rik van Riel, Andrea Arcangeli, linux-mm


On Thu, 4 Jul 2002, Linus Torvalds wrote:
>
> Right now, we get roughly this behaviour simply by way of statistical
> behaviour for the page allocator ("if somebody allocates 5 times as many
> pages, he's 5 times as likely to have to clean something up too"), but
> trying to be smarter about this could easily break this relative fairness.

Side note: getting some higher-level locking wrong can _seriously_ break
this statistical behaviour.

In particular, the ext2 superblock lock at least used to be horribly
broken and held in a lot of "bad" places: I doubt Al has gotten far enough
to fix that brokenness. The superblock lock used to cause one process that
blocked for something (usually reading in some bitmap or other) to cause a
lot of _other_ processes to block quite unnecessarily on the badly placed
lock, even though they really would have had all the resources they
needed.

That particular thing is really not a VM problem, but a ext2 issue. The
superblock lock just isn't very well placed. I personally suspect that it
should be replaced by a spinlock - just to force all blocking operations
to be moved outside the lock (so that it would only protect the actual
data structures - rather than be held around reading bitmap blocks into
memory etc).

But that's a rather painful kind of locking change to do and to test.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: vm lock contention reduction
  2002-07-05  6:08                 ` Linus Torvalds
@ 2002-07-05  6:27                   ` Alexander Viro
  2002-07-05  6:33                   ` Andrew Morton
  1 sibling, 0 replies; 87+ messages in thread
From: Alexander Viro @ 2002-07-05  6:27 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andrew Morton, Rik van Riel, Andrea Arcangeli, linux-mm


On Thu, 4 Jul 2002, Linus Torvalds wrote:

> In particular, the ext2 superblock lock at least used to be horribly
> broken and held in a lot of "bad" places: I doubt Al has gotten far enough
> to fix that brokenness. The superblock lock used to cause one process that

As the matter of fact, I did.  If you want lock_super() to be killed in ext2
(2.5) - just say so and I'll do the rest.

Right now both ext2_new_block() and ext2_new_inode() look through the
group descriptors for good one and reserve (block|inode) in it.  That
can be easily done under a spinlock.  After that we read a bitmap
(no need for any locks) and grab a bit in it (we are guaranteed to
have one).  The latter can be either done under a spinlock or by being
clever and noticing that amount of contenders is always less or equal
the number of free bits (with minimal use of set_bit()/etc. atomicity
we can do that without spinlocks).  After that we don't need any locks
whatsoever.

Andrew had just killed the last bit of crap there - LRU used to be protected
by lock_super() and since it's no more...

Rewrite of balloc.c and ialloc.c was done with killing lock_super() in mind -
I didn't want to do that in 2.4 for obvious reasons, but for 2.5 it's very
easy...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: vm lock contention reduction
  2002-07-05  6:08                 ` Linus Torvalds
  2002-07-05  6:27                   ` Alexander Viro
@ 2002-07-05  6:33                   ` Andrew Morton
  2002-07-05  7:33                     ` Andrea Arcangeli
  1 sibling, 1 reply; 87+ messages in thread
From: Andrew Morton @ 2002-07-05  6:33 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Rik van Riel, Andrea Arcangeli, linux-mm

Linus Torvalds wrote:
> 
> On Thu, 4 Jul 2002, Linus Torvalds wrote:
> >
> > Right now, we get roughly this behaviour simply by way of statistical
> > behaviour for the page allocator ("if somebody allocates 5 times as many
> > pages, he's 5 times as likely to have to clean something up too"), but
> > trying to be smarter about this could easily break this relative fairness.
> 
> Side note: getting some higher-level locking wrong can _seriously_ break
> this statistical behaviour.
> 
> In particular, the ext2 superblock lock at least used to be horribly
> broken and held in a lot of "bad" places: I doubt Al has gotten far enough
> to fix that brokenness. The superblock lock used to cause one process that
> blocked for something (usually reading in some bitmap or other) to cause a
> lot of _other_ processes to block quite unnecessarily on the badly placed
> lock, even though they really would have had all the resources they
> needed.

ext2 is still performing synchronous bitmap reads inside lock_super()
and yes, that shuts down the filesystem.  But once the bitmaps are
in cache, it's not a huge problem.

Unless you have an Anton-class box.   The context switch on the lock_super
in the ext2 block allocator is now one of his major throughput bottlenecks.

Although I must say, we're talking about filesystem loads here which
are purely RAM-based - fifteen-second tests which never hit disk.
This is kinda silly, because it's not a very interesting operating
region.  But it's fun - I think we've doubled 2.4 throughput now.

> That particular thing is really not a VM problem, but a ext2 issue. The
> superblock lock just isn't very well placed. I personally suspect that it
> should be replaced by a spinlock - just to force all blocking operations
> to be moved outside the lock (so that it would only protect the actual
> data structures - rather than be held around reading bitmap blocks into
> memory etc).

It can become a per-blockgroup spinlock.  That will scale splendidly.
Removing the private bitmap LRUs gets us partway toward that.   But
the bitmap buffers tend to get shoved up onto the active list real
quick when you start pushing things.

> But that's a rather painful kind of locking change to do and to test.

Well.  First locks first.  kmap_lock is a bad one on x86.

-
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: vm lock contention reduction
  2002-07-05  5:51               ` Linus Torvalds
  2002-07-05  6:08                 ` Linus Torvalds
@ 2002-07-05  6:46                 ` Andrew Morton
  2002-07-05 14:25                   ` Rik van Riel
  1 sibling, 1 reply; 87+ messages in thread
From: Andrew Morton @ 2002-07-05  6:46 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Rik van Riel, Andrea Arcangeli, linux-mm

Linus Torvalds wrote:
> 
> On Thu, 4 Jul 2002, Andrew Morton wrote:
> > >
> > > Get away from this "minimum wait" thing, because it is WRONG.
> >
> > Well yes, we do want to batch work up.  And a crude way of doing that
> > is "each time 64 pages have come clean, wake up one waiter".
> 
> That would probably work, but we also need to be careful that we don't get
> into some strange situations (ie we have 50 waiters that all needed memory
> at the same time, and less than 50*64 pages that caused us to be in
> trouble, so we only wake up the 46 first waiters and the last 4 waiters
> get stuck until the next batch even though we now have _lots_ of pages
> free).
> 
> Dont't laugh - things like this has actually happened at times with some
> of our balancing work with HIGHMEM/NORMAL. Basically, the logic would go:
> "everybody who ends up having to wait for an allocation should free at
> least N pages", but then you would end up with 50*N pages total that the
> system thought it "needed" to free up, and that could be a big number that
> would cause the VM to want to free up stuff long after it was really done.

Well we'd certainly need to make direct caller-reclaims the normal
mode of operation.  Avoid context switches in the page allocator.

However it occurs to me that we could easily get in the situation
where a page allocator find a PageDirty or PageWriteback page on
the tail of the LRU and waits on it, but there are plenty of
reclaimable pages further along in the LRU.

In this situation it would better to just tag the page as "wait on
it next time around" and then just skip it.  This is basically what
the PageLaunder/BH_Launder logic was doing.  I think.  For a long
time it wasn't very effective because it wasn't able to wait on
page/buffers which were written by someone else.  Andrea finally
sorted that by setting BH_Launder in submit_bh.

All those deadlock problems have gone away now and we can
(re)implement this much more simply.

But what I don't like about it is that it's dumb.  The kernel ends
up doing these enormous list scans and not achieving very much, 
whereas we could achieve the same effect by doing a bit of page motion
at interrupt time.  It's a polled-versus-interrupt thing.

And right now, that dumbness is multiplied by the CPU count because
it happens under pagemap_lru_lock.  But with the bustup of that, at
least we can be scalably dumb ;)


> > Or
> > "as soon as the number of reclaimable pages exceeds zone->pages_min".
> > Some logic would also be needed to prevent new page allocators from
> > jumping the queue, of course.
> 
> Yeah, the unfairness is the thing that really can be nasty.
> 
> On the other hand, some unfairness is ok too - and can improve throughput.
> So jumping the queue is fine, you just must not be able to _consistently_
> jump the queue.
> 
> (In fact, jumping the queue is inevitable to some degree - not allowing
> any queue jumping at all would imply that any time _anybody_ starts
> waiting, every single allocation afterwards will have to wait until the
> waiter got woken up. Which we have actually tried before at times, but
> which causes really really bad behaviour and horribly bad "pausing")
> 
> You probably want the occasional allocator able to jump the queue, but the
> "big spenders" to be caught eventually. "Fairness" really doesn't mean
> that "everybody should wait equally much", it really means "people should
> wait roughly relative to how much as they 'spend' memory".

Right.  And that implies heuristics to divine which tasks are
heavy page allocators.  uh-oh.  But as a first-order approximation:
if a task is currently allocating pages from within generic_file_write(),
then whack it hard.


Here's PageLaunder-for-2.5.  (Not tested enough - don't apply ;))
It seems to help vmstat somewhat, but it still gets stuck in
shrink_cache->get_request_wait() a lot.

 include/linux/page-flags.h |    7 +++++++
 mm/filemap.c               |    1 +
 mm/vmscan.c                |    2 +-
 3 files changed, 9 insertions(+), 1 deletion(-)

--- 2.5.24/mm/vmscan.c~second-chance-throttle	Thu Jul  4 23:17:14 2002
+++ 2.5.24-akpm/mm/vmscan.c	Thu Jul  4 23:18:40 2002
@@ -443,7 +443,7 @@ shrink_cache(int nr_pages, zone_t *class
 		 * IO in progress? Leave it at the back of the list.
 		 */
 		if (unlikely(PageWriteback(page))) {
-			if (may_enter_fs) {
+			if (may_enter_fs && TestSetPageThrottle(page)) {
 				page_cache_get(page);
 				spin_unlock(&pagemap_lru_lock);
 				wait_on_page_writeback(page);
--- 2.5.24/include/linux/page-flags.h~second-chance-throttle	Thu Jul  4 23:18:35 2002
+++ 2.5.24-akpm/include/linux/page-flags.h	Thu Jul  4 23:20:02 2002
@@ -65,6 +65,7 @@
 #define PG_private		12	/* Has something at ->private */
 #define PG_writeback		13	/* Page is under writeback */
 #define PG_nosave		15	/* Used for system suspend/resume */
+#define PG_throttle		16	/* page allocator should throttle */
 
 /*
  * Global page accounting.  One instance per CPU.
@@ -216,6 +217,12 @@ extern void get_page_state(struct page_s
 #define ClearPageNosave(page)		clear_bit(PG_nosave, &(page)->flags)
 #define TestClearPageNosave(page)	test_and_clear_bit(PG_nosave, &(page)->flags)
 
+#define PageThrottle(page)	test_bit(PG_throttle, &(page)->flags)
+#define SetPageThrottle(page)	set_bit(PG_throttle, &(page)->flags)
+#define TestSetPageThrottle(page) test_and_set_bit(PG_throttle, &(page)->flags)
+#define ClearPageThrottle(page)	clear_bit(PG_throttle, &(page)->flags)
+#define TestClearPageThrottle(page) test_and_clear_bit(PG_throttle, &(page)->flags)
+
 /*
  * The PageSwapCache predicate doesn't use a PG_flag at this time,
  * but it may again do so one day.
--- 2.5.24/mm/filemap.c~second-chance-throttle	Thu Jul  4 23:20:08 2002
+++ 2.5.24-akpm/mm/filemap.c	Thu Jul  4 23:20:23 2002
@@ -682,6 +682,7 @@ void end_page_writeback(struct page *pag
 {
 	wait_queue_head_t *waitqueue = page_waitqueue(page);
 	smp_mb__before_clear_bit();
+	ClearPageThrottle(page);
 	if (!TestClearPageWriteback(page))
 		BUG();
 	smp_mb__after_clear_bit(); 

-
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: vm lock contention reduction
  2002-07-05  6:33                   ` Andrew Morton
@ 2002-07-05  7:33                     ` Andrea Arcangeli
  2002-07-07  2:50                       ` Andrew Morton
  0 siblings, 1 reply; 87+ messages in thread
From: Andrea Arcangeli @ 2002-07-05  7:33 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linus Torvalds, Rik van Riel, linux-mm, Martin J. Bligh

On Thu, Jul 04, 2002 at 11:33:45PM -0700, Andrew Morton wrote:
> Well.  First locks first.  kmap_lock is a bad one on x86.

Actually I thought about kmap_lock and the per-process kmaps a bit more
with Martin (cc'ed) during OLS and there is an easy process-scalable
solution to drop:

	the kmap_lock
	in turn the global pool
	in turn the global tlb flush

The only problem is that it's not anymore both atomic *and* persistent,
it's only persistent. It's also atomic if the mm_count == 1, but the
kernel cannot rely on it, it has to assume it's a blocking operation
always (you find it out if it's blocking only at runtime).

In short the same design of the per-process kmaps will work just fine if
we add a semaphore to the mm_struct. then before starting using the kmap
entry we must acquire the semaphore. This way all the global locking and
global tlb flush goes away completely for normal tasks, but still
remains the contention of that per-mm semaphore with threads doing
simutaneous pte manipulation or simultaneous pagecache I/O though.
Furthmore this I/O will be serialized, threaded benchmark like dbench
may perform poorly that way I suspect, or we should add a pool of
userspace pages so more than 1 thread is allowed to go ahead, but still
we may cacheline-bounce in the synchronization of the pool across
threads (similar to what we do now in the global pool).

Then there's the problem the pagecache/FS API should be changed to pass
the vaddr through the stack because page->virtual would go away, the
virtual address would be per-process protected by the mm->kmap_sem so we
couldn't store it in a global, all tasks can kmap the same page at the
same time at virtual vaddr. This as well will break some common code.

Last but not the least, I hope in 2.6 production I won't be running
benchmarks and profiling using a 32bit cpu anymore anyways.

So I'm not very motivated anymore in doing that change after the comment
from Linus about the issue with threads.

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: vm lock contention reduction
  2002-07-05  6:46                 ` Andrew Morton
@ 2002-07-05 14:25                   ` Rik van Riel
  0 siblings, 0 replies; 87+ messages in thread
From: Rik van Riel @ 2002-07-05 14:25 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linus Torvalds, Andrea Arcangeli, linux-mm

On Thu, 4 Jul 2002, Andrew Morton wrote:
> Linus Torvalds wrote:

> > You probably want the occasional allocator able to jump the queue, but the
> > "big spenders" to be caught eventually. "Fairness" really doesn't mean
> > that "everybody should wait equally much", it really means "people should
> > wait roughly relative to how much as they 'spend' memory".
>
> Right.  And that implies heuristics to divine which tasks are
> heavy page allocators.  uh-oh.

This isn't too hard. In order to achieve this you:

1) wait for one kswapd loop when you get below a high water mark
2) allocate one page when kswapd wakes everybody up again
   (at this point we're NOT necessarily above the high water
   mark again...)

This means that once the system is under a lot of pressure
heavy allocators will be throttled a lot more than light
allocators and the system gets a chance to free things.

Of course, kswapd does everything (except get_request)
asynchronously so a kswapd loop should be relatively short.

regards,

Rik
-- 
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/		http://distro.conectiva.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: vm lock contention reduction
  2002-07-05  2:18       ` Andrew Morton
  2002-07-05  2:16         ` Rik van Riel
@ 2002-07-05 23:11         ` William Lee Irwin III
  2002-07-05 23:48           ` Andrew Morton
  1 sibling, 1 reply; 87+ messages in thread
From: William Lee Irwin III @ 2002-07-05 23:11 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Rik van Riel, Andrea Arcangeli, linux-mm, Linus Torvalds

On Thu, Jul 04, 2002 at 07:18:34PM -0700, Andrew Morton wrote:
> Of course, that change means that we wouldn't be able to throttle
> page allocators against IO any more, and we'd have to do something
> smarter.  What a shame ;)

This is actually necessary IMHO. Some testing I've been able to do seems
to reveal the current throttling mechanism as inadequate.


Cheers,
Bill
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: vm lock contention reduction
  2002-07-05 23:11         ` William Lee Irwin III
@ 2002-07-05 23:48           ` Andrew Morton
  2002-07-06  0:11             ` Rik van Riel
  0 siblings, 1 reply; 87+ messages in thread
From: Andrew Morton @ 2002-07-05 23:48 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Rik van Riel, Andrea Arcangeli, linux-mm, Linus Torvalds

William Lee Irwin III wrote:
> 
> On Thu, Jul 04, 2002 at 07:18:34PM -0700, Andrew Morton wrote:
> > Of course, that change means that we wouldn't be able to throttle
> > page allocators against IO any more, and we'd have to do something
> > smarter.  What a shame ;)
> 
> This is actually necessary IMHO. Some testing I've been able to do seems
> to reveal the current throttling mechanism as inadequate.
> 

I don't think so.  If you're referring to the situation where your
4G machine had 3.5G dirty pages without triggering writeback.

That's not a generic problem.  It's something specific to your
setup.  You're going to have to repeat it and stick some printk's
into balance_dirty_pages().  No other way of finding it.

Possibly it's an arith overflow in there, but I more suspect that
your nr_pagecache_pages() function is returning an incorrect value.

This happened to David M-T just this week in the 2.4 kernel - the
nr_buffermem_pages() function was returning a bad value due to an
unaccounted-for hole in the memory map and the observed effect
was just the same.

So.  Please debug it.

-
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: vm lock contention reduction
  2002-07-05 23:48           ` Andrew Morton
@ 2002-07-06  0:11             ` Rik van Riel
  2002-07-06  0:31               ` Linus Torvalds
  2002-07-06  0:48               ` Andrew Morton
  0 siblings, 2 replies; 87+ messages in thread
From: Rik van Riel @ 2002-07-06  0:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: William Lee Irwin III, Andrea Arcangeli, linux-mm, Linus Torvalds

On Fri, 5 Jul 2002, Andrew Morton wrote:
> William Lee Irwin III wrote:
> > On Thu, Jul 04, 2002 at 07:18:34PM -0700, Andrew Morton wrote:
> > > Of course, that change means that we wouldn't be able to throttle
> > > page allocators against IO any more, and we'd have to do something
> > > smarter.  What a shame ;)
> >
> > This is actually necessary IMHO. Some testing I've been able to do seems
> > to reveal the current throttling mechanism as inadequate.
>
> I don't think so.  If you're referring to the situation where your
> 4G machine had 3.5G dirty pages without triggering writeback.
>
> That's not a generic problem.

But it is, mmap() and anonymous memory don't trigger writeback.

regards,

Rik
-- 
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/		http://distro.conectiva.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: vm lock contention reduction
  2002-07-06  0:11             ` Rik van Riel
@ 2002-07-06  0:31               ` Linus Torvalds
  2002-07-06  0:45                 ` Rik van Riel
  2002-07-06  0:48               ` Andrew Morton
  1 sibling, 1 reply; 87+ messages in thread
From: Linus Torvalds @ 2002-07-06  0:31 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrew Morton, William Lee Irwin III, Andrea Arcangeli, linux-mm


On Fri, 5 Jul 2002, Rik van Riel wrote:
>
> But it is, mmap() and anonymous memory don't trigger writeback.

I don't think we can fix that, without going back to the approach of
marking any writable memory as read-only and counting it at page-fault
time.

Wasn't it you who did that test-patch originally?  From what I remember,
it had basically zero downside for normal UNIX applications (shared memory
that is written to is so rare that it doesn't end up on the radar), but
was quite expensive for the DB kind of shmem usage where shared writable
memory is the major component..

There might be some way to avoid the page fault badness (the large page
stuff will do this automatically, for example), which might make the
"let's keep track of dirty mappings explicitly" approach acceptable again.

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: vm lock contention reduction
  2002-07-06  0:31               ` Linus Torvalds
@ 2002-07-06  0:45                 ` Rik van Riel
  0 siblings, 0 replies; 87+ messages in thread
From: Rik van Riel @ 2002-07-06  0:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, William Lee Irwin III, Andrea Arcangeli, linux-mm

On Fri, 5 Jul 2002, Linus Torvalds wrote:
> On Fri, 5 Jul 2002, Rik van Riel wrote:
> >
> > But it is, mmap() and anonymous memory don't trigger writeback.
>
> I don't think we can fix that, without going back to the approach of
> marking any writable memory as read-only and counting it at page-fault
> time.

I don't think we have to fix it, as long as the
shrink_caches/page_launder function is well
balanced and throttling is done intelligently.

> Wasn't it you who did that test-patch originally?

I don't remember who did it, so I suspect it wasn't me.
Could it be Ben ?

> There might be some way to avoid the page fault badness (the large page
> stuff will do this automatically, for example), which might make the
> "let's keep track of dirty mappings explicitly" approach acceptable
> again.

That's another approach, I guess it'll be worth looking
into both (and searching the web for what other people
have done before us with both ;))

regards,

Rik
-- 
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/		http://distro.conectiva.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: vm lock contention reduction
  2002-07-06  0:11             ` Rik van Riel
  2002-07-06  0:31               ` Linus Torvalds
@ 2002-07-06  0:48               ` Andrew Morton
  2002-07-08  0:59                 ` William Lee Irwin III
  1 sibling, 1 reply; 87+ messages in thread
From: Andrew Morton @ 2002-07-06  0:48 UTC (permalink / raw)
  To: Rik van Riel
  Cc: William Lee Irwin III, Andrea Arcangeli, linux-mm, Linus Torvalds

Rik van Riel wrote:
> 
> On Fri, 5 Jul 2002, Andrew Morton wrote:
> > William Lee Irwin III wrote:
> > > On Thu, Jul 04, 2002 at 07:18:34PM -0700, Andrew Morton wrote:
> > > > Of course, that change means that we wouldn't be able to throttle
> > > > page allocators against IO any more, and we'd have to do something
> > > > smarter.  What a shame ;)
> > >
> > > This is actually necessary IMHO. Some testing I've been able to do seems
> > > to reveal the current throttling mechanism as inadequate.
> >
> > I don't think so.  If you're referring to the situation where your
> > 4G machine had 3.5G dirty pages without triggering writeback.
> >
> > That's not a generic problem.
> 
> But it is, mmap() and anonymous memory don't trigger writeback.
> 

That's different.  Bill hit a problem just running tiobench.

We can run balance_dirty_pages() when a COW copyout is performed,
which will approximately improve things.

But the whole idea of the dirty memory thresholds just seems bust,
really.  Because how do you pick the thresholds?  40%.  Bah.

One idea we kicked around a while back was to remove the throttling
from the write(2) path altogether, and to perform the throttling
inside the page allocator instead, where we have more information.
And perhaps set PF_WRITER before doing so, so the VM can penalise
the caller more heavily.

But this doesn't work if the write(2) caller is redirtying existing
pagecache, just like writes to anon pages, mmap pages, etc.

Alternatively: the VM can periodically reach over and twiddle the
dirty memory thresholds which balance_dirty_pages() uses.  I don't
like our chances of getting that right though ;)

-
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: vm lock contention reduction
  2002-07-05  7:33                     ` Andrea Arcangeli
@ 2002-07-07  2:50                       ` Andrew Morton
  2002-07-07  3:05                         ` Linus Torvalds
                                           ` (2 more replies)
  0 siblings, 3 replies; 87+ messages in thread
From: Andrew Morton @ 2002-07-07  2:50 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Linus Torvalds, Rik van Riel, linux-mm, Martin J. Bligh

Andrea Arcangeli wrote:
> 
> On Thu, Jul 04, 2002 at 11:33:45PM -0700, Andrew Morton wrote:
> > Well.  First locks first.  kmap_lock is a bad one on x86.
> 
> Actually I thought about kmap_lock and the per-process kmaps a bit more
> with Martin (cc'ed) during OLS and there is an easy process-scalable
> solution to drop:

Martin is being bitten by the global invalidate more than by the lock.
He increased the size of the kmap pool just to reduce the invalidate
frequency and saw 40% speedups of some stuff.

Those invalidates don't show up nicely on profiles.

>         the kmap_lock
>         in turn the global pool
>         in turn the global tlb flush
> 
> The only problem is that it's not anymore both atomic *and* persistent,
> it's only persistent. It's also atomic if the mm_count == 1, but the
> kernel cannot rely on it, it has to assume it's a blocking operation
> always (you find it out if it's blocking only at runtime).

I was discussing this with sct a few days back.  iiuc, the proposal
was to create a small per-cpu pool (say, 4-8 pages) which is a
"front-end" to regular old kmap().

Any time you have one of these pages in use, the process gets
pinned onto the current CPU. If we run out of per-cpu kmaps,
just fall back to traditional kmap().

It does mean that this variant of kmap() couldn't just return
a `struct page *' - it would have to return something richer
than that.

> In short the same design of the per-process kmaps will work just fine if
> we add a semaphore to the mm_struct. then before starting using the kmap
> entry we must acquire the semaphore. This way all the global locking and
> global tlb flush goes away completely for normal tasks, but still
> remains the contention of that per-mm semaphore with threads doing
> simutaneous pte manipulation or simultaneous pagecache I/O though.
> Furthmore this I/O will be serialized, threaded benchmark like dbench
> may perform poorly that way I suspect, or we should add a pool of
> userspace pages so more than 1 thread is allowed to go ahead, but still
> we may cacheline-bounce in the synchronization of the pool across
> threads (similar to what we do now in the global pool).
> 
> Then there's the problem the pagecache/FS API should be changed to pass
> the vaddr through the stack because page->virtual would go away, the
> virtual address would be per-process protected by the mm->kmap_sem so we
> couldn't store it in a global, all tasks can kmap the same page at the
> same time at virtual vaddr. This as well will break some common code.
> 
> Last but not the least, I hope in 2.6 production I won't be running
> benchmarks and profiling using a 32bit cpu anymore anyways.
> 
> So I'm not very motivated anymore in doing that change after the comment
> from Linus about the issue with threads.

I believe that IBM have 32gig, 8- or 16-CPU ia32 machines just
coming into production now.  Presumably, they're not the only
ones.  We're stuck with this mess for another few years.

-
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: vm lock contention reduction
  2002-07-07  2:50                       ` Andrew Morton
@ 2002-07-07  3:05                         ` Linus Torvalds
  2002-07-07  3:47                           ` Andrew Morton
  2002-07-07  5:16                           ` vm lock contention reduction Martin J. Bligh
  2002-07-07  6:13                         ` scalable kmap (was Re: vm lock contention reduction) Martin J. Bligh
  2002-07-08  0:38                         ` vm lock contention reduction William Lee Irwin III
  2 siblings, 2 replies; 87+ messages in thread
From: Linus Torvalds @ 2002-07-07  3:05 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, Rik van Riel, linux-mm, Martin J. Bligh


On Sat, 6 Jul 2002, Andrew Morton wrote:
>
> Martin is being bitten by the global invalidate more than by the lock.
> He increased the size of the kmap pool just to reduce the invalidate
> frequency and saw 40% speedups of some stuff.
>
> Those invalidates don't show up nicely on profiles.

I'd like to enhance the profiling support a bit, to create some
infrastructure for doing different kinds of profiles, not just the current
timer-based one (and not just for the kernel).

There's also the P4 native support for "event buffers" or whatever intel
calls them, that allows profiling at a lower level by interrupting not for
every event, but only when the hw buffer overflows.

I haven't had much time to look at the oprofile thing, but what I _have_
seen has made me rather unhappy (especially the horrid system call
tracking kludges).

I'd rather have some generic hooks (a notion of a "profile buffer" and
events that cause us to have to synchronize with it, like process
switches, mmap/munmap - oprofile wants these too), and some generic helper
routines for profiling (turn any eip into a "dentry + offset" pair
together with ways to tag specific dentries as being "worthy" of
profiling).

Depending on the regular timer interrupt will never give good profiles,
simply because it can't be NMI, but also because you then don't have the
choice of using other counters (cache miss etc).

oprofile does much of this, but in a damn ugly manner.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: vm lock contention reduction
  2002-07-07  3:05                         ` Linus Torvalds
@ 2002-07-07  3:47                           ` Andrew Morton
  2002-07-08 11:39                               ` John Levon
  2002-07-07  5:16                           ` vm lock contention reduction Martin J. Bligh
  1 sibling, 1 reply; 87+ messages in thread
From: Andrew Morton @ 2002-07-07  3:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrea Arcangeli, Rik van Riel, linux-mm, Martin J. Bligh, John Levon

Linus Torvalds wrote:
> 
> On Sat, 6 Jul 2002, Andrew Morton wrote:
> >
> > Martin is being bitten by the global invalidate more than by the lock.
> > He increased the size of the kmap pool just to reduce the invalidate
> > frequency and saw 40% speedups of some stuff.
> >
> > Those invalidates don't show up nicely on profiles.
> 
> I'd like to enhance the profiling support a bit, to create some
> infrastructure for doing different kinds of profiles, not just the current
> timer-based one (and not just for the kernel).
> 
> There's also the P4 native support for "event buffers" or whatever intel
> calls them, that allows profiling at a lower level by interrupting not for
> every event, but only when the hw buffer overflows.
> 
> I haven't had much time to look at the oprofile thing, but what I _have_
> seen has made me rather unhappy (especially the horrid system call
> tracking kludges).
> 
> I'd rather have some generic hooks (a notion of a "profile buffer" and
> events that cause us to have to synchronize with it, like process
> switches, mmap/munmap - oprofile wants these too), and some generic helper
> routines for profiling (turn any eip into a "dentry + offset" pair
> together with ways to tag specific dentries as being "worthy" of
> profiling).
> 
> Depending on the regular timer interrupt will never give good profiles,
> simply because it can't be NMI, but also because you then don't have the
> choice of using other counters (cache miss etc).
> 
> oprofile does much of this, but in a damn ugly manner.
> 

I pinged John about an oprofile merge just the other day actually.
He agrees with you on the syscall table thing.  I think he says
that it could be cleaned up if oprofile was in the tree.  Ditto
the issue with mmap.

I was able to isolate and fix some fairly hairy performance problems
at work with oprofile.  It's a great tool - I use it all the time.  And
it profiles the entire system - right down to file-n-line in some random
shared object.  With NMIs.  It is not just a kernel tool.

So.  John.  Get coding :-)

-
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: vm lock contention reduction
  2002-07-07  3:05                         ` Linus Torvalds
  2002-07-07  3:47                           ` Andrew Morton
@ 2002-07-07  5:16                           ` Martin J. Bligh
  1 sibling, 0 replies; 87+ messages in thread
From: Martin J. Bligh @ 2002-07-07  5:16 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton; +Cc: Andrea Arcangeli, Rik van Riel, linux-mm

> I'd like to enhance the profiling support a bit, to create some
> infrastructure for doing different kinds of profiles, not just 
> the current timer-based one (and not just for the kernel).
> 
> There's also the P4 native support for "event buffers" or whatever intel
> calls them, that allows profiling at a lower level by interrupting not for
> every event, but only when the hw buffer overflows.

kernprof is capable of monitoring the CPU's stats gathering 
counters I believe. The following is thanks to Mala Anand,
and I think she was monitoring the cacheline misses like this,
though the mechanism allows access to various things, I can't
remember if there's a TLB hit rate counter of the top of my
head ...

M.

-------------------------

The following script is what I used to collect performance counter
profiling data using kernprof:

sleep 5
kernprof -r -a 0xc0 -d pmc -f 1000 -t pc -b
sleep 30
kernprof -e -i -m /usr/src/linux/System.map > kpg.out
sort -nr +2 kpg.out > kpg26sort.out
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* scalable kmap (was Re: vm lock contention reduction)
  2002-07-07  2:50                       ` Andrew Morton
  2002-07-07  3:05                         ` Linus Torvalds
@ 2002-07-07  6:13                         ` Martin J. Bligh
  2002-07-07  6:37                           ` Andrew Morton
                                             ` (2 more replies)
  2002-07-08  0:38                         ` vm lock contention reduction William Lee Irwin III
  2 siblings, 3 replies; 87+ messages in thread
From: Martin J. Bligh @ 2002-07-07  6:13 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli; +Cc: Linus Torvalds, Rik van Riel, linux-mm

> Martin is being bitten by the global invalidate more than by the lock.

True.

> He increased the size of the kmap pool just to reduce the invalidate
> frequency and saw 40% speedups of some stuff.

I think that might have been when Andrea was using persistent kmap
for highpte (since fixed), so we were really kicking the **** out 
of it. Nonetheless, your point is perfectly correct, it's the 
global invalidate that's really the expensive thing.

Making the kmap pool 4096 instead of 1024 resulted in a tenfold 
reduction in the number of global TLB flushes (there's a static 
set of "fixed" maps which would presumably explain the difference).

As we have larger memory machines, we're under more and more KVA
pressure, more stuff gets shoved into highmem, kmap usage gets
heavier ... and this starts to look even worse.

> Those invalidates don't show up nicely on profiles.

Not the standard profile, no. We can fairly easily count the cost
of the time taken to do the flush, but not the side effects of the
cache-trash. According to my P4 manual, I can see an ITLB miss counter,
but nothing for the data (apart from some more generic things).

> I was discussing this with sct a few days back.  iiuc, the proposal
> was to create a small per-cpu pool (say, 4-8 pages) which is a
> "front-end" to regular old kmap().
> 
> Any time you have one of these pages in use, the process gets
> pinned onto the current CPU. 

Ewww! That's gross ;-) I don't want to think about how that interacts
with cpus_allowed being changed, etc .... but mainly it's just gross ;-)
On the other hand I won't deny it may be the most practical solution ....

The really sad thing about the global TLB flush for persisant kmap 
is that it's probably wholly unnecessary - most likely each entry
is really only dirty on 1 CPU.

At a random guess (ie I haven't actually checked this), I think the
vast majority of the persistent kmaps are doing something like:
"kmap; copy_to/from_user; kunmap", and are just using this mechanism
in case the page they're copying to / from us going to generate a 
page fault & schedule - thus we don't use atomic kmap. The proportions
are merely speculation - if someone wants to throw stones at that (or
better still, empirical measurements), feel free.

I talked this over with Andrea for a while at OLS, and some of the
other things we covered are below (KVA = kernel virtual addr space,
UKVA = user-kernel virtual address space, per process like user
space, but with the protections of kernel space).

1. Just pin the damned page.

Instead of pinning the process, pin the user page you're touching in
memory. To me this is more elegant, though Andrea pointed out we'd
be slowing down the common case when you don't actually page fault
at all.

Then I mentioned something about fixing it up in the pagefault path
instead so that the process would reset it's map when it got restarted,
but Andrea started laughing at that point ;-)

2. Per process kmaps

This kills the global kmap_lock and is fairly easy to implement with
just a per-process lock. Unfortunately it still doesn't really kill
the problem of having to do a global TLB flush if your process has
more than 1 thread. Whilst I can limply claim that it's kind of nice
for most non-threaded processes, and the fallback position is no worse
than we have now, it's not panning out the way I'd hoped.

3. Per task kmaps

The way we've been doing UKVA up to now is to map it at the top of the
user address space, wedging it above the stack, and it's on a per-process
basis. If we moved the area up to the window in the top 128Mb of KVA
instead, it might be easier to make it really per task instead, which'd
be much more useful for kmap.

Then you'd need to allocate 1 virtual frame per task and you'd get 
atomic, persistent, per-task kmap. Which I think isn't as bad as it
sounds, because you can actually reuse the same virtual page for all
tasks ... or at least I think you can ... it's late, and I need to 
twist my brain through that one some more. I think this'd stop you
having a shared kernel top level pagetable for all tasks though, so
it probably causes more problems than it solves.

4. Another kludge.

Supposing we didn't really map the page at all, just left it as an
invalid page table entry, and just shoved an entry into the TLB 
instead (there's a way of doing that somewhere) ... chances are 
very good it'd stay there long enough to do the operation, if it 
fell out, or we got migrated, you'd just get a pagefault and patch
it back up again ... ;-)

> I believe that IBM have 32gig, 8- or 16-CPU ia32 machines just
> coming into production now.  Presumably, they're not the only
> ones.  We're stuck with this mess for another few years.

We do indeed - the latest ones we'll actually be selling can actually
go even larger. I now have a 32 way 32Gb P3 box in the lab from older hardware for experiments, and have enough hardware to make something 
even bigger if I'm feeling particularly suicidal one day ... whilst 
I'd agree that large 32 bit boxes are a temporary transition phase 
(and a pain in the ass) it'll be a while before we have Hammer systems 
large enough for this sort of thing. Or if you have a few million 
dollars, Anton has a very nice PPC64 machine he can sell you ;-)

M.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: scalable kmap (was Re: vm lock contention reduction)
  2002-07-07  6:13                         ` scalable kmap (was Re: vm lock contention reduction) Martin J. Bligh
@ 2002-07-07  6:37                           ` Andrew Morton
  2002-07-07  7:53                           ` Linus Torvalds
  2002-07-08 17:29                           ` Martin J. Bligh
  2 siblings, 0 replies; 87+ messages in thread
From: Andrew Morton @ 2002-07-07  6:37 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Andrea Arcangeli, Linus Torvalds, Rik van Riel, linux-mm

"Martin J. Bligh" wrote:
> 
> ...
> > I was discussing this with sct a few days back.  iiuc, the proposal
> > was to create a small per-cpu pool (say, 4-8 pages) which is a
> > "front-end" to regular old kmap().
> >
> > Any time you have one of these pages in use, the process gets
> > pinned onto the current CPU.
> 
> Ewww! That's gross ;-)

Hey.  So is highmem ;)

But sys_sched_affinity() allows you to change the affinity of
other tasks (it takes a pid).  So that's torn that idea.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: scalable kmap (was Re: vm lock contention reduction)
  2002-07-07  6:13                         ` scalable kmap (was Re: vm lock contention reduction) Martin J. Bligh
  2002-07-07  6:37                           ` Andrew Morton
@ 2002-07-07  7:53                           ` Linus Torvalds
  2002-07-07  9:04                             ` Andrew Morton
  2002-07-07 16:00                             ` Martin J. Bligh
  2002-07-08 17:29                           ` Martin J. Bligh
  2 siblings, 2 replies; 87+ messages in thread
From: Linus Torvalds @ 2002-07-07  7:53 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Andrew Morton, Andrea Arcangeli, Rik van Riel, linux-mm


On Sat, 6 Jul 2002, Martin J. Bligh wrote:
>
> I think that might have been when Andrea was using persistent kmap
> for highpte (since fixed), so we were really kicking the **** out
> of it. Nonetheless, your point is perfectly correct, it's the
> global invalidate that's really the expensive thing.

I suspect that there really aren't that many places that care about the
persistent mappings, and the atomic per-cpu stuff is inherently scalable
(but due to being harder to cache, slower). So I wonder how much of a
problem the kmap stuff really is.

So if the main problem ends up being that some paths (a) really want the
persistent version _and_ (b) you can make the paths hold them for long
times (by writing to a blocking pipe/socket or similar) we may just have
much simpler approaches - like a per-user kmap count.

It's not hard to make kmap() do something the equivalent of

	down(current->user->kmap_max);

and make kunmap() just do the "up()", and then just initialize the user
kmap_max semaphore to something simple like 100.

Which just guarantees that any user at any time can only hold 100
concurrent persistent kmap's open. Problem solved.

(and yeah, you can make it configurable on a per-user level or something,
so that if it turns out that oracle really has 100 threads all blocking on
a socket at the same time, you can admin up the oracle count).

The _performance_ scalability concerns should be fairly easily solvable
(as far as I can tell - feel free to correct me) by making the persistent
array bigger and finding things where persistency isn't needed (and
possibly doesn't even help due to lack of locality), and just making those
places use the per-cpu atomic ones.

Eh?

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: scalable kmap (was Re: vm lock contention reduction)
  2002-07-07  7:53                           ` Linus Torvalds
@ 2002-07-07  9:04                             ` Andrew Morton
  2002-07-07 16:13                               ` Martin J. Bligh
  2002-07-07 18:31                               ` Linus Torvalds
  2002-07-07 16:00                             ` Martin J. Bligh
  1 sibling, 2 replies; 87+ messages in thread
From: Andrew Morton @ 2002-07-07  9:04 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Martin J. Bligh, Andrea Arcangeli, Rik van Riel, linux-mm

Linus Torvalds wrote:
> 
> ...
> The _performance_ scalability concerns should be fairly easily solvable
> (as far as I can tell - feel free to correct me) by making the persistent
> array bigger and finding things where persistency isn't needed (and
> possibly doesn't even help due to lack of locality), and just making those
> places use the per-cpu atomic ones.
> 
> Eh?

You mean just tune up the existing code, basically.

There's certainly plenty of opportunity to do that.

Probably the biggest offenders are generic_file_read/write.  In
generic_file_write() we're already faulting in the user page(s)
beforehand (somewhat racily, btw).  We could formalise that into
a pin_user_page_range() or whatever and use an atomic kmap
in there.

Use the same thing in file_read_actor.

The fs/buffer.c code is very profilgate.  It kmaps the page
unconditionally, but in the great majority of cases, it
never uses that kmap.

Plus we're still kmapping pages twice on the prepare_write/commit_write
path.  That doesn't use another kmap.  But.

clear_user_highpage() is using atomic kmap already.

Networking uses kmaps, but from a quick peek it seems that it's mostly
using kmap_atomic already, and a pin_user_page_range()/unpin_user_page_range()
API could be dropped in there quite easily if needed.

ext2 directories will require some thought.

Martin, what sort of workload were you seeing the problems with?


I can fix the buffer.c and filemap.c stuff and then we can take
another look at it.

I'm just not too sure about the pin_user_page() thing.  How
expensive is a page table walk in there likely to be?

-
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: scalable kmap (was Re: vm lock contention reduction)
  2002-07-07  7:53                           ` Linus Torvalds
  2002-07-07  9:04                             ` Andrew Morton
@ 2002-07-07 16:00                             ` Martin J. Bligh
  2002-07-07 18:28                               ` Linus Torvalds
  2002-07-08  7:00                               ` Andrea Arcangeli
  1 sibling, 2 replies; 87+ messages in thread
From: Martin J. Bligh @ 2002-07-07 16:00 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andrew Morton, Andrea Arcangeli, Rik van Riel, linux-mm

>> I think that might have been when Andrea was using persistent kmap
>> for highpte (since fixed), so we were really kicking the **** out
>> of it. Nonetheless, your point is perfectly correct, it's the
>> global invalidate that's really the expensive thing.
> 
> I suspect that there really aren't that many places that care about the
> persistent mappings, and the atomic per-cpu stuff is inherently scalable
> (but due to being harder to cache, slower). So I wonder how much of a
> problem the kmap stuff really is.
> 
> So if the main problem ends up being that some paths (a) really want the
> persistent version _and_ (b) you can make the paths hold them for long
> times (by writing to a blocking pipe/socket or similar) we may just have
> much simpler approaches - like a per-user kmap count.

I don't think they really want something that's persistant over a
long time, I think they want something they can hold over a potential
reschedule - that's why they're not using kmap_atomic.

> Which just guarantees that any user at any time can only hold 100
> concurrent persistent kmap's open. Problem solved.

We're not running out of kmaps in the pool, we're just churning them
(and dirtying them) at an unpleasant rate. Every time we exhaust the
pool, we do a global TLB flush on all CPUs, which sucks for performance.
 
> The _performance_ scalability concerns should be fairly easily solvable
> (as far as I can tell - feel free to correct me) by making the persistent
> array bigger 

Making the array bigger does help, but it consumes some more virtual
address space, which the most critical resource on these machines ... 
at the moment we use up 1024 entries, which is 4Mb, I normally set
things to 4096, which uses 16Mb - certainly that would be a better
default for larger machines. But if I make it much bigger than that,
I start to run out of vmalloc space ;-) Of course we could just add
the size of the kmap pool to _VMALLOC_RESERVE, which would be somewhat
better ...

> and finding things where persistency isn't needed (and
> possibly doesn't even help due to lack of locality), and just 
> making those places use the per-cpu atomic ones.

I'm kind of handwaving at this point because I don't have the stats
to hand. I had Keith gather some stats on this, and see what was 
actually calling kmap - I'll dig those out on Monday when I'm back
in the office, and send them along.

I was kind of hoping to find some elegant killer solution to all this,
but it's been kicked around for a while now, and every solution seems
to have its problems. If it can't be elegantly solved, we can probably
kill the performance issue by just tuning, as you say ...

M.

PS. One interesting thing Keith found was this: on NUMA-Q, I currently
do the IPI send for smp_call_function (amongst other things) as a 
sequenced unicast (send a seperate message to each CPU in turn), 
rather than the normal broadcast because it's harder to do in 
clustered apic mode. Whilst trying to switch this back, he found it ran
faster as the sequenced unicast, not only for NUMA-Q, but also for
standard SMP boxes!!! I'm guessing the timing offset generated helps
cacheline or lock contention ... interesting anyway.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: scalable kmap (was Re: vm lock contention reduction)
  2002-07-07  9:04                             ` Andrew Morton
@ 2002-07-07 16:13                               ` Martin J. Bligh
  2002-07-07 18:31                               ` Linus Torvalds
  1 sibling, 0 replies; 87+ messages in thread
From: Martin J. Bligh @ 2002-07-07 16:13 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds; +Cc: Andrea Arcangeli, Rik van Riel, linux-mm

> You mean just tune up the existing code, basically.
> 
> There's certainly plenty of opportunity to do that.
> 
> Probably the biggest offenders are generic_file_read/write.  In
> generic_file_write() we're already faulting in the user page(s)
> beforehand (somewhat racily, btw).  We could formalise that into
> a pin_user_page_range() or whatever and use an atomic kmap
> in there.

That was basically what Andrea and I were discussing at OLS - his
concern (IIRC) was that the pinning operation would slow things
down (exactly the concern you mention later on).

> Martin, what sort of workload were you seeing the problems with?

I think I was just playing with kernel compile again, but I can
get our performance group to gather some real numbers for a variety
of benchmarks. With kernel compile there were only two large-scale
callers, but I don't remember what they were off the top of my head.
I'll give you some initial figures tommorow.

> I'm just not too sure about the pin_user_page() thing.  How
> expensive is a page table walk in there likely to be?

It looks like the code does try to recover from a page fault right
now ... but it just falls over and says the write failed. Is there
some way we could just make it abort the copy to user, and restart
the copy sequence in the (unlikely) case of a page fault?

If we're going to play around with that sort of thing (or indeed
do what the current code does), should this all be wrapped in a
"copy_to/from_high_user" function or something similar? Cleaner,
easier to change later, and less places for people to screw up.

M.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: scalable kmap (was Re: vm lock contention reduction)
  2002-07-07 16:00                             ` Martin J. Bligh
@ 2002-07-07 18:28                               ` Linus Torvalds
  2002-07-08  7:11                                 ` Andrea Arcangeli
  2002-07-08 10:15                                 ` Eric W. Biederman
  2002-07-08  7:00                               ` Andrea Arcangeli
  1 sibling, 2 replies; 87+ messages in thread
From: Linus Torvalds @ 2002-07-07 18:28 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Andrew Morton, Andrea Arcangeli, Rik van Riel, linux-mm


On Sun, 7 Jul 2002, Martin J. Bligh wrote:
> >
> > So if the main problem ends up being that some paths (a) really want the
> > persistent version _and_ (b) you can make the paths hold them for long
> > times (by writing to a blocking pipe/socket or similar) we may just have
> > much simpler approaches - like a per-user kmap count.
>
> I don't think they really want something that's persistant over a
> long time, I think they want something they can hold over a potential
> reschedule - that's why they're not using kmap_atomic.

Some things could be _loong_, ie the "sendfile()" case to a socket.

Worse, it can be intentionally made arbitrarily long by just setting up a
socket that has no readers, which is why it is potentially a DoS thing.

This is why I suggested the per-user kmap semaphore: not because it will
ever trigger in normal load (where "normal load" means even _extremely_
high load by people who don't actively try to break the system), but
because it will be a resource counter for a potential attack and limit the
attacker down to the point where it isn't a problem.

> > Which just guarantees that any user at any time can only hold 100
> > concurrent persistent kmap's open. Problem solved.
>
> We're not running out of kmaps in the pool, we're just churning them
> (and dirtying them) at an unpleasant rate. Every time we exhaust the
> pool, we do a global TLB flush on all CPUs, which sucks for performance.

That's the "normal load" case, fixable by just making the kmap pool
larger. That's ot what the semaphore is there for.

> > The _performance_ scalability concerns should be fairly easily solvable
> > (as far as I can tell - feel free to correct me) by making the persistent
> > array bigger
>
> Making the array bigger does help, but it consumes some more virtual
> address space, which the most critical resource on these machines ...
> at the moment we use up 1024 entries, which is 4Mb, I normally set
> things to 4096, which uses 16Mb - certainly that would be a better
> default for larger machines. But if I make it much bigger than that,
> I start to run out of vmalloc space ;-) Of course we could just add
> the size of the kmap pool to _VMALLOC_RESERVE, which would be somewhat
> better ...

I don't see 16MB of virtual space as being a real problem on a 64GB
machine.

> PS. One interesting thing Keith found was this: on NUMA-Q, I currently
> do the IPI send for smp_call_function (amongst other things) as a
> sequenced unicast (send a seperate message to each CPU in turn),
> rather than the normal broadcast because it's harder to do in
> clustered apic mode. Whilst trying to switch this back, he found it ran
> faster as the sequenced unicast, not only for NUMA-Q, but also for
> standard SMP boxes!!! I'm guessing the timing offset generated helps
> cacheline or lock contention ... interesting anyway.

Hmm.. Right now we have the same IDT and GDT on all CPU's, so _if_ the CPU
is stupid enough to do a locked cycle to update the "A" bit on the
segments (even if it is already set), you would see horrible cacheline
bouncing for any interrupt.

I don't know if that is the case. I'd _assume_ that the microcode was
clever enough to not do this, but who knows. It should be fairly easily
testable (just "SMOP") by duplicating the IDT/GDT across CPU's.

I don't think the cross-calls should have any locks in them, although
there does seem to be some silly things like "flush_cpumask" that should
probably just be in the "cpu_tlbstate[cpu] array instead (no cacheline
bouncing, and since we touch that array anyway, it should be better for
the cache in other ways too).

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: scalable kmap (was Re: vm lock contention reduction)
  2002-07-07  9:04                             ` Andrew Morton
  2002-07-07 16:13                               ` Martin J. Bligh
@ 2002-07-07 18:31                               ` Linus Torvalds
  2002-07-07 18:55                                 ` Linus Torvalds
  2002-07-08  7:24                                 ` Andrew Morton
  1 sibling, 2 replies; 87+ messages in thread
From: Linus Torvalds @ 2002-07-07 18:31 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Martin J. Bligh, Andrea Arcangeli, Rik van Riel, linux-mm


On Sun, 7 Jul 2002, Andrew Morton wrote:
>
> Probably the biggest offenders are generic_file_read/write.  In
> generic_file_write() we're already faulting in the user page(s)
> beforehand (somewhat racily, btw).  We could formalise that into
> a pin_user_page_range() or whatever and use an atomic kmap
> in there.

I'd really prefer not to. We're talking of a difference between one
single-cycle instruction (the address should be in the TLB 99% of all
times), and a long slow TLB walk with various locks etc.

Anyway, it couldn't be an atomic kmap in file_send_actor anyway, since the
write itself may need to block for other reasons (ie socket buffer full
etc). THAT is the one that can get misused - the others are not a big
deal, I think.

So kmap_atomic definitely doesn't work there.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: scalable kmap (was Re: vm lock contention reduction)
  2002-07-07 18:31                               ` Linus Torvalds
@ 2002-07-07 18:55                                 ` Linus Torvalds
  2002-07-07 19:02                                   ` Linus Torvalds
  2002-07-08  7:24                                 ` Andrew Morton
  1 sibling, 1 reply; 87+ messages in thread
From: Linus Torvalds @ 2002-07-07 18:55 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Martin J. Bligh, Andrea Arcangeli, Rik van Riel, linux-mm


On Sun, 7 Jul 2002, Linus Torvalds wrote:
>
> Anyway, it couldn't be an atomic kmap in file_send_actor anyway, since the
> write itself may need to block for other reasons (ie socket buffer full
> etc). THAT is the one that can get misused - the others are not a big
> deal, I think.
>
> So kmap_atomic definitely doesn't work there.

We do actually have an alternate approach: get rid of the "kmap()" in
file_send_actor() altogether, and require targets of sendfile() to always
support the sendpage() interface (which can do a kmap at a lower level if
they need to - it's not even guaranteed that they do need to).

I suspect most uses of sendfile() are TCP, which already does sendpage().

That would largely get rid of the DoS worry (can anybody see any other
ways of forcing a arbitrarily long sleep on a kmap'ed page?)

It still leaves the scalability issue, but quite frankly, so far I'd still
much rather optimize for the regular hardware and screw scalability if it
slows down the 4GB UP (or 2P) case.

			Linus


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: scalable kmap (was Re: vm lock contention reduction)
  2002-07-07 18:55                                 ` Linus Torvalds
@ 2002-07-07 19:02                                   ` Linus Torvalds
  0 siblings, 0 replies; 87+ messages in thread
From: Linus Torvalds @ 2002-07-07 19:02 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Martin J. Bligh, Andrea Arcangeli, Rik van Riel, linux-mm


On Sun, 7 Jul 2002, Linus Torvalds wrote:
>
> We do actually have an alternate approach: get rid of the "kmap()" in
> file_send_actor() altogether, and require targets of sendfile() to always
> support the sendpage() interface (which can do a kmap at a lower level if
> they need to - it's not even guaranteed that they do need to).

Ok, I'm going to just check that in to 2.5.x, and see who (if anybody)
screams.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: vm lock contention reduction
  2002-07-07  2:50                       ` Andrew Morton
  2002-07-07  3:05                         ` Linus Torvalds
  2002-07-07  6:13                         ` scalable kmap (was Re: vm lock contention reduction) Martin J. Bligh
@ 2002-07-08  0:38                         ` William Lee Irwin III
  2 siblings, 0 replies; 87+ messages in thread
From: William Lee Irwin III @ 2002-07-08  0:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Linus Torvalds, Rik van Riel, linux-mm,
	Martin J. Bligh

On Sat, Jul 06, 2002 at 07:50:41PM -0700, Andrew Morton wrote:
> Any time you have one of these pages in use, the process gets
> pinned onto the current CPU. If we run out of per-cpu kmaps,
> just fall back to traditional kmap().

This is not particularly difficult, it only requires a depth counter
and a saved cpumask for when it becomes unpinned again.


Cheers,
Bill
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: vm lock contention reduction
  2002-07-06  0:48               ` Andrew Morton
@ 2002-07-08  0:59                 ` William Lee Irwin III
  0 siblings, 0 replies; 87+ messages in thread
From: William Lee Irwin III @ 2002-07-08  0:59 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Rik van Riel, Andrea Arcangeli, linux-mm, Linus Torvalds

Rik van Riel wrote:
>> But it is, mmap() and anonymous memory don't trigger writeback.

On Fri, Jul 05, 2002 at 05:48:48PM -0700, Andrew Morton wrote:
> That's different.  Bill hit a problem just running tiobench.
> We can run balance_dirty_pages() when a COW copyout is performed,
> which will approximately improve things.
> But the whole idea of the dirty memory thresholds just seems bust,
> really.  Because how do you pick the thresholds?  40%.  Bah.

I don't know what the answer should be, but I can certainly demonstrate
this in a rather uninteresting situation (4GB, 4cpu's, 1 disk, 16 tasks).

But I can concur with that evaluation. In my esteem fixed fractions of
memory don't have a very direct relationship to what's going on.


Cheers,
Bill
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: scalable kmap (was Re: vm lock contention reduction)
  2002-07-07 16:00                             ` Martin J. Bligh
  2002-07-07 18:28                               ` Linus Torvalds
@ 2002-07-08  7:00                               ` Andrea Arcangeli
  1 sibling, 0 replies; 87+ messages in thread
From: Andrea Arcangeli @ 2002-07-08  7:00 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Linus Torvalds, Andrew Morton, Rik van Riel, linux-mm

On Sun, Jul 07, 2002 at 09:00:27AM -0700, Martin J. Bligh wrote:
> clustered apic mode. Whilst trying to switch this back, he found it ran
> faster as the sequenced unicast, not only for NUMA-Q, but also for
> standard SMP boxes!!! I'm guessing the timing offset generated helps
> cacheline or lock contention ... interesting anyway.

makes sense. but it sounds like it should be fixed in the way we
synchronize with the ipis, rather than by executing them in sequence. We
should just have the smp_call_function poll (read-only) a list of
per-cpu data, and have all the other cpus inside the ipi modifying their
own per-cpu cachelines. Right now the ipi callback works on a shared
cacheline, that is call_data->started/finished, that could be probably
per-cpu without much problems, just having the smp_call_function reading
all the per-cpu fields rather than only the current global ones. things
like tlb flushing are all per-cpu, with per-cpu tlbdata informations,
the only bottleneck really only seems the smp_call_function_interrupt
implementation that uses global counter instead of a per-cpu counters.

it will be a bit similar to the big-reader-lock algorithm, the writer
polling the per-cpu counters here is the smp_call_function, and the
reader modifying its per-cpu counter is smp_call_function_interrupt.

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: scalable kmap (was Re: vm lock contention reduction)
  2002-07-07 18:28                               ` Linus Torvalds
@ 2002-07-08  7:11                                 ` Andrea Arcangeli
  2002-07-08 10:15                                 ` Eric W. Biederman
  1 sibling, 0 replies; 87+ messages in thread
From: Andrea Arcangeli @ 2002-07-08  7:11 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Martin J. Bligh, Andrew Morton, Rik van Riel, linux-mm

On Sun, Jul 07, 2002 at 11:28:03AM -0700, Linus Torvalds wrote:
> Hmm.. Right now we have the same IDT and GDT on all CPU's, so _if_ the CPU
> is stupid enough to do a locked cycle to update the "A" bit on the
> segments (even if it is already set), you would see horrible cacheline
> bouncing for any interrupt.
> 
> I don't know if that is the case. I'd _assume_ that the microcode was
> clever enough to not do this, but who knows. It should be fairly easily
> testable (just "SMOP") by duplicating the IDT/GDT across CPU's.

if that would be the problem, it would be not specific to IPI though, I
would be surprised if the cpu would be that stupid to lock in such an
always read-only place, it would showup in any smp regardless of the
smp_call_function. OTOH I don't know why the hardware would even try to
lock implicitly there.

> I don't think the cross-calls should have any locks in them, although
> there does seem to be some silly things like "flush_cpumask" that should
> probably just be in the "cpu_tlbstate[cpu] array instead (no cacheline
> bouncing, and since we touch that array anyway, it should be better for
> the cache in other ways too).

agreed, I overlooked it, flush_cpumask is also a good candidate to be
made per-cpu. The others as said are the ->started/finished in the
call_data array (that's generic for all smp_call_function).

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: scalable kmap (was Re: vm lock contention reduction)
  2002-07-07 18:31                               ` Linus Torvalds
  2002-07-07 18:55                                 ` Linus Torvalds
@ 2002-07-08  7:24                                 ` Andrew Morton
  2002-07-08  8:09                                   ` Andrea Arcangeli
  1 sibling, 1 reply; 87+ messages in thread
From: Andrew Morton @ 2002-07-08  7:24 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Martin J. Bligh, Andrea Arcangeli, Rik van Riel, linux-mm

Linus Torvalds wrote:
> 
> On Sun, 7 Jul 2002, Andrew Morton wrote:
> >
> > Probably the biggest offenders are generic_file_read/write.  In
> > generic_file_write() we're already faulting in the user page(s)
> > beforehand (somewhat racily, btw).  We could formalise that into
> > a pin_user_page_range() or whatever and use an atomic kmap
> > in there.
> 
> I'd really prefer not to. We're talking of a difference between one
> single-cycle instruction (the address should be in the TLB 99% of all
> times), and a long slow TLB walk with various locks etc.
> 
> Anyway, it couldn't be an atomic kmap in file_send_actor anyway, since the
> write itself may need to block for other reasons (ie socket buffer full
> etc). THAT is the one that can get misused - the others are not a big
> deal, I think.
> 
> So kmap_atomic definitely doesn't work there.
> 

OK, I've been through everything and all the filesystems and
written four patches which I'll throw away.  I think I know
how to do all this now.

- Convert buffer.c to atomic kmaps.

- prepare_write/commit_write no longer do any implicit kmapping
  at all.

- file_read_actor and generic_file_write do their own atomic_kmap
  (more on this below).

- file_send_actor still does kmap.

- If a filesystem wants its page kmapped between prepare and commit,
  it does it itself.  So

  foo_prepare_write()
  {
	int ret;

	ret = block_prepare_write();
	if (ret == 0)
		kmap(page);
	return ret;
  }

  foo_commit_write()
  {
 	kunmap(page);
	return generic_commit_write();
  }

  So in the case of ext2, we can split the directory and S_ISREG a_ops.
  The directory a_ops will kmap the page.  The S_ISREG a_ops will not.


Basically: no implicit kmaps.  You do it yourself if you want it, and
if you cannot do atomic kmaps.


Now, file_read_actor and generic_file_write still have the problem
of the target userspace page getting evited while they're holding an
atomic kmap.

But the rmap page eviction code has the mm_struct.  So can we not do this:

	generic_file_write()
	{
		...
		atomic_inc(&current->mm->dont_unmap_pages);

		{
			volatile char dummy;
			__get_user(dummy, addr);
			__get_user(dummy, addr+bytes+1);
		}
		lock_page();
		->prepare_write()
		kmap_atomic()
		copy_from_user()
		kunmap_atomic()
		->commit_write()
		atomic_dec(&current->mm->dont_unmap_pages);
		unlock_page()
	}

and over in mm/rmap.c:try_to_unmap_one(), check mm->dont_unmap_pages.

Obviously, all this is dependent on CONFIG_HIGHMEM.

Workable?

-
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: scalable kmap (was Re: vm lock contention reduction)
  2002-07-08  7:24                                 ` Andrew Morton
@ 2002-07-08  8:09                                   ` Andrea Arcangeli
  2002-07-08 14:50                                     ` William Lee Irwin III
  2002-07-08 20:39                                     ` Andrew Morton
  0 siblings, 2 replies; 87+ messages in thread
From: Andrea Arcangeli @ 2002-07-08  8:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linus Torvalds, Martin J. Bligh, Rik van Riel, linux-mm

On Mon, Jul 08, 2002 at 12:24:09AM -0700, Andrew Morton wrote:
> Linus Torvalds wrote:
> > 
> > On Sun, 7 Jul 2002, Andrew Morton wrote:
> > >
> > > Probably the biggest offenders are generic_file_read/write.  In
> > > generic_file_write() we're already faulting in the user page(s)
> > > beforehand (somewhat racily, btw).  We could formalise that into
> > > a pin_user_page_range() or whatever and use an atomic kmap
> > > in there.
> > 
> > I'd really prefer not to. We're talking of a difference between one
> > single-cycle instruction (the address should be in the TLB 99% of all
> > times), and a long slow TLB walk with various locks etc.
> > 
> > Anyway, it couldn't be an atomic kmap in file_send_actor anyway, since the
> > write itself may need to block for other reasons (ie socket buffer full
> > etc). THAT is the one that can get misused - the others are not a big
> > deal, I think.
> > 
> > So kmap_atomic definitely doesn't work there.
> > 
> 
> OK, I've been through everything and all the filesystems and
> written four patches which I'll throw away.  I think I know
> how to do all this now.
> 
> - Convert buffer.c to atomic kmaps.
> 
> - prepare_write/commit_write no longer do any implicit kmapping
>   at all.
> 
> - file_read_actor and generic_file_write do their own atomic_kmap
>   (more on this below).
> 
> - file_send_actor still does kmap.
> 
> - If a filesystem wants its page kmapped between prepare and commit,
>   it does it itself.  So
> 
>   foo_prepare_write()
>   {
> 	int ret;
> 
> 	ret = block_prepare_write();
> 	if (ret == 0)
> 		kmap(page);
> 	return ret;
>   }
> 
>   foo_commit_write()
>   {
>  	kunmap(page);
> 	return generic_commit_write();
>   }
> 
>   So in the case of ext2, we can split the directory and S_ISREG a_ops.
>   The directory a_ops will kmap the page.  The S_ISREG a_ops will not.
> 
> 
> Basically: no implicit kmaps.  You do it yourself if you want it, and
> if you cannot do atomic kmaps.
> 
> 
> Now, file_read_actor and generic_file_write still have the problem
> of the target userspace page getting evited while they're holding an
> atomic kmap.
> 
> But the rmap page eviction code has the mm_struct.  So can we not do this:
> 
> 	generic_file_write()
> 	{
> 		...
> 		atomic_inc(&current->mm->dont_unmap_pages);
> 
> 		{
> 			volatile char dummy;
> 			__get_user(dummy, addr);
> 			__get_user(dummy, addr+bytes+1);
> 		}
> 		lock_page();
> 		->prepare_write()
> 		kmap_atomic()
> 		copy_from_user()
> 		kunmap_atomic()
> 		->commit_write()
> 		atomic_dec(&current->mm->dont_unmap_pages);
> 		unlock_page()
> 	}
> 
> and over in mm/rmap.c:try_to_unmap_one(), check mm->dont_unmap_pages.
> 
> Obviously, all this is dependent on CONFIG_HIGHMEM.
> 
> Workable?

the above pseudocode still won't work correctly, if you don't pin the
page as Martin proposed and you only rely on its virtual mapping to stay
there because the page can go away under you despite the
swap_out/rmap-unmapping work, if there's a parallel thread running
munmap+re-mmap under you. So at the very least you need the mmap_sem at
every generic_file_write to avoid other threads to change your virtual
address under you. And you'll basically need to make the mmap_sem
recursive, because you have to take it before running __get_user to
avoid races. You could easily do that using my rwsem, I made two versions
of them, with one that supports recursion, however this is just for your
info, I'm not suggesting to make it recursive.

furthmore rmap provides no advantages at all here, swap_out as well will
have to learn about the mm_struct before it has a chance to try to unmap
a mm_struct.

side note: I heard a "I need rmap for this" from a number of people so
far, and they were all wrong so far, none of them would get any
advantage from rmap, one of them (the closer one to really need rmap)
wasn't aware that we just have rmap for all shared mappings, and he
needed the rmap information for the shared mappings for the same reason
we need the rmap information to keep the shared mappings synchronized
with truncate. The only reason I can imagine rmap useful in todays
hardware for all kind of vma (what the patch provides compared to what
we have now) is to more efficiently defragment ram with an algorithm in
the memory balancing to provide largepages more efficiently from mixed
zones, if somebody would suggest rmap for this reason (nobody did yet) I
would have to agree completely that it is very useful for that, OTOH it
seems everybody is reserving (or planning to reserve) a zone for
largepages anyways so that we don't run into fragmentation in the first
place. And btw - talking about largepages - we have three concurrent and
controversial largepage implementations for linux available today, they
all have different API, one is even shipped in production by a vendor,
and while auditing the code I seen it also exports an API visible to
userspace [ignoring the sysctl] (unlike what I was told):

+#define MAP_BIGPAGE	0x40		/* bigpage mapping */
[..]
 		_trans(flags, MAP_GROWSDOWN, VM_GROWSDOWN) |
 		_trans(flags, MAP_DENYWRITE, VM_DENYWRITE) |
+		_trans(flags, MAP_BIGPAGE, VM_BIGMAP) |
 		_trans(flags, MAP_EXECUTABLE, VM_EXECUTABLE);
 	return prot_bits | flag_bits;
 #undef _trans

that's a new unofficial bitflag to mmap that any proprietary userspace
can pass to mmap today. Other implementations of the largepage feature
use madvise or other syscalls to tell the kernel to allocate
largepages. At least the above won't return -EINVAL so the binaryonly
app will work transparently on a mainline kernel, but it can eventually
malfunction if we use 0x40 for something else in 2.5. So I think we
should do something about the largepages too ASAP into 2.5 (like
async-io).

Returning to the above kmap hack (assuming you take the mmap_sem and you
fix your instability), your hack will destabilize the vm by design and
it will run the machine oom despite of lots of swap available, think all
tasks taking the page fault in __get_user due a swapin at the same time,
and not being able to swapout some memory to resolve the __get_user
swapin because you pinned all the address spaces, they'll run oom
despite there's still lots of swap free (of course with the oom killer
and the infinite loop in the allocator such condition will deadlock the
kernel instead, it's one of the cases where nobody is going to teach the
oom killer to detect such condition as a case where it has to oom kill
because there's still lots of vm available at that time; so to be
accurate I meant with my vm updates applied the kernel will run oom,
while a mainline kernel will silenty deadlock). So I I'm not really
happy with this mm-pinning-during-page-fault design solution (regardless
if you prefer to deadlock or to run oom, you know I prefer the latter :).

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: scalable kmap (was Re: vm lock contention reduction)
  2002-07-07 18:28                               ` Linus Torvalds
  2002-07-08  7:11                                 ` Andrea Arcangeli
@ 2002-07-08 10:15                                 ` Eric W. Biederman
  1 sibling, 0 replies; 87+ messages in thread
From: Eric W. Biederman @ 2002-07-08 10:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Martin J. Bligh, Andrew Morton, Andrea Arcangeli, Rik van Riel, linux-mm

Linus Torvalds <torvalds@transmeta.com> writes:

> Hmm.. Right now we have the same IDT and GDT on all CPU's, so _if_ the CPU
> is stupid enough to do a locked cycle to update the "A" bit on the
> segments (even if it is already set), you would see horrible cacheline
> bouncing for any interrupt.
> 
> I don't know if that is the case. I'd _assume_ that the microcode was
> clever enough to not do this, but who knows. It should be fairly easily
> testable (just "SMOP") by duplicating the IDT/GDT across CPU's.

If you don't carry about the "A" bit and I don't think we do this is
trivial preventable.  You can set when you initialize the GDT/IDT and
it will never be updated.

I had to make this change a while ago in LinuxBIOS because P4's lock up
when you load a GDT from a ROM that doesn't have the accessed bit set.

The fact it doesn't lock up is a fairly good proof that no writes happen
when the accessed bit is already set.

Eric
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Enhanced profiling support (was Re: vm lock contention reduction)
  2002-07-07  3:47                           ` Andrew Morton
@ 2002-07-08 11:39                               ` John Levon
  0 siblings, 0 replies; 87+ messages in thread
From: John Levon @ 2002-07-08 11:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Andrea Arcangeli, Rik van Riel, linux-mm,
	Martin J. Bligh, linux-kernel

[Excuse the quoting, I was out of the loop on this ...]

On Sat, Jul 06, 2002 at 08:47:54PM -0700, Andrew Morton wrote:

> Linus Torvalds wrote:
> > 
> > I haven't had much time to look at the oprofile thing, but what I _have_
> > seen has made me rather unhappy (especially the horrid system call
> > tracking kludges).

It makes me very unhappy too. There are a number of horrible things
there, mostly for the sake of convenience and performance.
sys_call_table is just the most obviously foul thing.  I'm glad to hear
there is interest in getting some kernel support for such things to be
done tastefully.

> > I'd rather have some generic hooks (a notion of a "profile buffer" and
> > events that cause us to have to synchronize with it, like process
> > switches, mmap/munmap - oprofile wants these too), and some generic helper
> > routines for profiling (turn any eip into a "dentry + offset" pair
> > together with ways to tag specific dentries as being "worthy" of
> > profiling).

How do you see such dentry names being exported to user-space for the
profiling daemon to access ? The current oprofile scheme is, um, less
than ideal ...

> So.  John.  Get coding :-)

I'm interested in doing so but I'd like to hear some more on how people
perceive this working. It essentially means a fork for a lot of the
kernel-side code, so it'd mean a lot more work for us (at least until I
can drop the 2.2/2.4 versions).

regards
john

-- 
"If a thing is not diminished by being shared, it is not rightly owned if
 it is only owned & not shared."
	- St. Augustine

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Enhanced profiling support (was Re: vm lock contention reduction)
@ 2002-07-08 11:39                               ` John Levon
  0 siblings, 0 replies; 87+ messages in thread
From: John Levon @ 2002-07-08 11:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Andrea Arcangeli, Rik van Riel, linux-mm,
	Martin J. Bligh, linux-kernel

[Excuse the quoting, I was out of the loop on this ...]

On Sat, Jul 06, 2002 at 08:47:54PM -0700, Andrew Morton wrote:

> Linus Torvalds wrote:
> > 
> > I haven't had much time to look at the oprofile thing, but what I _have_
> > seen has made me rather unhappy (especially the horrid system call
> > tracking kludges).

It makes me very unhappy too. There are a number of horrible things
there, mostly for the sake of convenience and performance.
sys_call_table is just the most obviously foul thing.  I'm glad to hear
there is interest in getting some kernel support for such things to be
done tastefully.

> > I'd rather have some generic hooks (a notion of a "profile buffer" and
> > events that cause us to have to synchronize with it, like process
> > switches, mmap/munmap - oprofile wants these too), and some generic helper
> > routines for profiling (turn any eip into a "dentry + offset" pair
> > together with ways to tag specific dentries as being "worthy" of
> > profiling).

How do you see such dentry names being exported to user-space for the
profiling daemon to access ? The current oprofile scheme is, um, less
than ideal ...

> So.  John.  Get coding :-)

I'm interested in doing so but I'd like to hear some more on how people
perceive this working. It essentially means a fork for a lot of the
kernel-side code, so it'd mean a lot more work for us (at least until I
can drop the 2.2/2.4 versions).

regards
john

-- 
"If a thing is not diminished by being shared, it is not rightly owned if
 it is only owned & not shared."
	- St. Augustine
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: scalable kmap (was Re: vm lock contention reduction)
  2002-07-08  8:09                                   ` Andrea Arcangeli
@ 2002-07-08 14:50                                     ` William Lee Irwin III
  2002-07-08 20:39                                     ` Andrew Morton
  1 sibling, 0 replies; 87+ messages in thread
From: William Lee Irwin III @ 2002-07-08 14:50 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, Linus Torvalds, Martin J. Bligh, Rik van Riel, linux-mm

On Mon, Jul 08, 2002 at 10:09:53AM +0200, Andrea Arcangeli wrote:
> with truncate. The only reason I can imagine rmap useful in todays
> hardware for all kind of vma (what the patch provides compared to what
> we have now) is to more efficiently defragment ram with an algorithm in
> the memory balancing to provide largepages more efficiently from mixed
> zones, if somebody would suggest rmap for this reason (nobody did yet) I
> would have to agree completely that it is very useful for that, OTOH it

A number of uses of reverse mappings come to mind. There is the use of
rmap for Linux running as a guest instance returning memory to host
OS's and/or firmware. This involves evicting specific physically
contiguous regions. I believe UML and some ppc64 ports would like to do
this. I intend to experiment with using rmap for virtual cache
invalidation on some of my sun4c machines in my spare time as well,
though given the general importance of virtual caches that's not likely
to be a good motivator for reverse mapping support. I believe the 2.4.x-
based rmap tree also had a comment suggesting it could be used to more
directly address general multipage allocation failures due to
fragmentation, but I'm unaware of any particular attempts to use them
for general online defragmentation.

While it appears physical scanning could provide some benefit for page
replacement in the presence of large amounts of shared memory, as the
number of virtual pages present across all processes could be a large
multiple of the number of physical pages present in a system, I'll
leave the final judgment of its effectiveness there to those more
frequently involved in page replacement issues.

Perhaps collecting statistics on pte_chain lengths could be useful here,
as that would give some notion of how much additional work is generated
by sharing for the virtual scan algorithm. On the virtual scanning side,
perhaps collecting statistics on how frequently the memclass() check
fails in try_to_swap_out() might give some notion of how advantageous
physical scanning is with respect to being able to pinpoint pressured
regions. Other statistics like the scan rate and so on might be trickier
with the virtual scan since some pages may be scanned multiple times.
Maybe another thing to check is how often a pte is invalidated without
actually being able to evict the page. I did something for several of
these a while ago but am not sure what happened to it. I'll see what I
can brew up today, especially since there are some unanswered questions
still about another scenario I'm able to reproduce involving excess
dirty memory I've been asked to collect more information on.

Cheers,
Bill
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: scalable kmap (was Re: vm lock contention reduction)
  2002-07-07  6:13                         ` scalable kmap (was Re: vm lock contention reduction) Martin J. Bligh
  2002-07-07  6:37                           ` Andrew Morton
  2002-07-07  7:53                           ` Linus Torvalds
@ 2002-07-08 17:29                           ` Martin J. Bligh
  2002-07-08 22:14                             ` Linus Torvalds
  2002-07-09  3:17                             ` Andrew Morton
  2 siblings, 2 replies; 87+ messages in thread
From: Martin J. Bligh @ 2002-07-08 17:29 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli; +Cc: Linus Torvalds, Rik van Riel, linux-mm

OK, here's the data from Keith that I was promising on kmap. This was just
for a kernel compile. So copy_strings and file_read_actor seem to be the
main users (for this workload) by an order of magnitude.

                0.00    0.00       1/3762429     remove_arg_zero [618]
                0.00    0.00      10/3762429     ext2_set_link [576]
                0.00    0.00     350/3762429     block_read_full_page [128]
                0.00    0.00     750/3762429     ext2_empty_dir [476]
                0.02    0.00   11465/3762429     ext2_delete_entry [217]
                0.02    0.00   12983/3762429     ext2_inode_by_name [228]
                0.03    0.00   15400/3762429     ext2_add_link [182]
                0.03    0.00   16621/3762429     ext2_find_entry [198]
                0.06    0.00   33016/3762429     ext2_readdir [79]
                0.13    0.00   71900/3762429     generic_file_write [109]
                0.17    0.00   95589/3762429     generic_commit_write [255]
                2.25    0.00 1229513/3762429     file_read_actor [50]
                4.15    0.00 2274831/3762429     copy_strings [36]
[105]    0.2    6.87    0.00 3762429         kunmap_high [105]


and 

                0.00    0.00       1/3762429     remove_arg_zero [618]
                0.00    0.00     350/3762429     block_read_full_page [128]
                0.22    0.01   71900/3762429     generic_file_write [109]
                0.27    0.01   90245/3762429     ext2_get_page [176]
                0.29    0.01   95589/3762429     _block_prepare_write [141]
                3.72    0.11 1229513/3762429     file_read_actor [50]
                6.88    0.21 2274831/3762429     copy_strings [36]
[87]     0.3   11.38    0.35 3762429         kmap_high [87]
                0.35    0.00      27/27          flush_all_zero_pkmaps [273]

this kernel had a larger kmap area (which is why flush_all_zero_pkmaps is
only called so little).  It is a 2.4.18 tree with O(1) sched.

----------------------

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Enhanced profiling support (was Re: vm lock contention reduction)
  2002-07-08 11:39                               ` John Levon
@ 2002-07-08 17:52                                 ` Linus Torvalds
  -1 siblings, 0 replies; 87+ messages in thread
From: Linus Torvalds @ 2002-07-08 17:52 UTC (permalink / raw)
  To: John Levon
  Cc: Andrew Morton, Andrea Arcangeli, Rik van Riel, linux-mm,
	Martin J. Bligh, linux-kernel



On Mon, 8 Jul 2002, John Levon wrote:
>
> > I'd rather have some generic hooks (a notion of a "profile buffer" and
> > events that cause us to have to synchronize with it, like process
> > switches, mmap/munmap - oprofile wants these too), and some generic helper
> > routines for profiling (turn any eip into a "dentry + offset" pair
> > together with ways to tag specific dentries as being "worthy" of
> > profiling).
>
> How do you see such dentry names being exported to user-space for the
> profiling daemon to access ? The current oprofile scheme is, um, less
> than ideal ...

Ok, I'll outline my personal favourite interface, but I'd also better
point out that while I've thought a bit about what I'd like to have and
how it could be implemented in the kernel, I have _not_ actually tried any
of it out, much less thought about what the user level stuff really needs.

Anyway, here goes a straw-man:

 - I'd associate each profiling event with a dentry/offset pair, simply
   because that's the highest-level thing that the kernel knows about and
   that is "static".

 - I'd suggest that the profiler explicitly mark the dentries it wants
   profiled, so that the kernel can throw away events that we're not
   interested in. The marking function would return a cookie to user
   space, and increment the dentry count (along with setting the
   "profile" flag in the dentry)

 - the "cookie" (which would most easily just be the kernel address of the
   dentry) would be the thing that we give to user-space (along with
   offset) on profile read. The user app can turn it back into a filename.

Whether it is the original "mark this file for profiling" phase that saves
away the cookie<->filename association, or whether we also have a system
call for "return the path of this cookie", I don't much care about.
Details, details.

Anyway, what would be the preferred interface from user level?

		Linus


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Enhanced profiling support (was Re: vm lock contention reduction)
@ 2002-07-08 17:52                                 ` Linus Torvalds
  0 siblings, 0 replies; 87+ messages in thread
From: Linus Torvalds @ 2002-07-08 17:52 UTC (permalink / raw)
  To: John Levon
  Cc: Andrew Morton, Andrea Arcangeli, Rik van Riel, linux-mm,
	Martin J. Bligh, linux-kernel


On Mon, 8 Jul 2002, John Levon wrote:
>
> > I'd rather have some generic hooks (a notion of a "profile buffer" and
> > events that cause us to have to synchronize with it, like process
> > switches, mmap/munmap - oprofile wants these too), and some generic helper
> > routines for profiling (turn any eip into a "dentry + offset" pair
> > together with ways to tag specific dentries as being "worthy" of
> > profiling).
>
> How do you see such dentry names being exported to user-space for the
> profiling daemon to access ? The current oprofile scheme is, um, less
> than ideal ...

Ok, I'll outline my personal favourite interface, but I'd also better
point out that while I've thought a bit about what I'd like to have and
how it could be implemented in the kernel, I have _not_ actually tried any
of it out, much less thought about what the user level stuff really needs.

Anyway, here goes a straw-man:

 - I'd associate each profiling event with a dentry/offset pair, simply
   because that's the highest-level thing that the kernel knows about and
   that is "static".

 - I'd suggest that the profiler explicitly mark the dentries it wants
   profiled, so that the kernel can throw away events that we're not
   interested in. The marking function would return a cookie to user
   space, and increment the dentry count (along with setting the
   "profile" flag in the dentry)

 - the "cookie" (which would most easily just be the kernel address of the
   dentry) would be the thing that we give to user-space (along with
   offset) on profile read. The user app can turn it back into a filename.

Whether it is the original "mark this file for profiling" phase that saves
away the cookie<->filename association, or whether we also have a system
call for "return the path of this cookie", I don't much care about.
Details, details.

Anyway, what would be the preferred interface from user level?

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Enhanced profiling support (was Re: vm lock contention reduction)
  2002-07-08 17:52                                 ` Linus Torvalds
@ 2002-07-08 18:41                                   ` Karim Yaghmour
  -1 siblings, 0 replies; 87+ messages in thread
From: Karim Yaghmour @ 2002-07-08 18:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: John Levon, Andrew Morton, Andrea Arcangeli, Rik van Riel,
	linux-mm, Martin J. Bligh, linux-kernel, Richard Moore, bob


Linus Torvalds wrote:
> On Mon, 8 Jul 2002, John Levon wrote:
> > How do you see such dentry names being exported to user-space for the
> > profiling daemon to access ? The current oprofile scheme is, um, less
> > than ideal ...
> 
> Ok, I'll outline my personal favourite interface, but I'd also better
> point out that while I've thought a bit about what I'd like to have and
> how it could be implemented in the kernel, I have _not_ actually tried any
> of it out, much less thought about what the user level stuff really needs.

Sure. I've done some work on profiling using trace hooks. Hopefully the
following is useful.

> Anyway, here goes a straw-man:
> 
>  - I'd associate each profiling event with a dentry/offset pair, simply
>    because that's the highest-level thing that the kernel knows about and
>    that is "static".

dentry + offset: on a 32bit machine, this is 8 bytes total per event being
profiled. This is a lot of information if you are trying you have a high
volume throughput. You can almost always skip the dentry since you know scheduling
changes and since you can catch a system-state snapshot at the begining of
the profiling. After that, the eip is sufficient and can easily be correlated
to a meaningfull entry in a file in user-space.

>  - I'd suggest that the profiler explicitly mark the dentries it wants
>    profiled, so that the kernel can throw away events that we're not
>    interested in. The marking function would return a cookie to user
>    space, and increment the dentry count (along with setting the
>    "profile" flag in the dentry)

Or the kernel can completely ignore this sort of selection and leave it
all to the agent responsible for collecting the events. This is what is
done in LTT. Currently, you can select one PID, GID, UID, but this
is easily extendable to include many. Of course if you agree to having
the task struct have "trace" or "profile" field, then this would be
much easier.

>  - the "cookie" (which would most easily just be the kernel address of the
>    dentry) would be the thing that we give to user-space (along with
>    offset) on profile read. The user app can turn it back into a filename.

That's the typical scheme and the one possible with the data retrieved
by LTT.

> Whether it is the original "mark this file for profiling" phase that saves
> away the cookie<->filename association, or whether we also have a system
> call for "return the path of this cookie", I don't much care about.
> Details, details.
> 
> Anyway, what would be the preferred interface from user level?

The approach LTT takes is that no part in the kernel should actually care
about the user level needs. Anything in user level that has to modify
the tracing/profiling makes its requests to the trace driver, /dev/tracer.
No additional system calls, no special cases in the main kernel code. Only
3 main files:
kernel/trace.c: The main entry point for all events (trace_event())
drivers/trace/tracer.c: The trace driver
include/linux/trace.h: The trace hook definitions

Cheers,

Karim

===================================================
                 Karim Yaghmour
               karim@opersys.com
      Embedded and Real-Time Linux Expert
===================================================

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Enhanced profiling support (was Re: vm lock contention reduction)
@ 2002-07-08 18:41                                   ` Karim Yaghmour
  0 siblings, 0 replies; 87+ messages in thread
From: Karim Yaghmour @ 2002-07-08 18:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: John Levon, Andrew Morton, Andrea Arcangeli, Rik van Riel,
	linux-mm, Martin J. Bligh, linux-kernel, Richard Moore, bob

Linus Torvalds wrote:
> On Mon, 8 Jul 2002, John Levon wrote:
> > How do you see such dentry names being exported to user-space for the
> > profiling daemon to access ? The current oprofile scheme is, um, less
> > than ideal ...
> 
> Ok, I'll outline my personal favourite interface, but I'd also better
> point out that while I've thought a bit about what I'd like to have and
> how it could be implemented in the kernel, I have _not_ actually tried any
> of it out, much less thought about what the user level stuff really needs.

Sure. I've done some work on profiling using trace hooks. Hopefully the
following is useful.

> Anyway, here goes a straw-man:
> 
>  - I'd associate each profiling event with a dentry/offset pair, simply
>    because that's the highest-level thing that the kernel knows about and
>    that is "static".

dentry + offset: on a 32bit machine, this is 8 bytes total per event being
profiled. This is a lot of information if you are trying you have a high
volume throughput. You can almost always skip the dentry since you know scheduling
changes and since you can catch a system-state snapshot at the begining of
the profiling. After that, the eip is sufficient and can easily be correlated
to a meaningfull entry in a file in user-space.

>  - I'd suggest that the profiler explicitly mark the dentries it wants
>    profiled, so that the kernel can throw away events that we're not
>    interested in. The marking function would return a cookie to user
>    space, and increment the dentry count (along with setting the
>    "profile" flag in the dentry)

Or the kernel can completely ignore this sort of selection and leave it
all to the agent responsible for collecting the events. This is what is
done in LTT. Currently, you can select one PID, GID, UID, but this
is easily extendable to include many. Of course if you agree to having
the task struct have "trace" or "profile" field, then this would be
much easier.

>  - the "cookie" (which would most easily just be the kernel address of the
>    dentry) would be the thing that we give to user-space (along with
>    offset) on profile read. The user app can turn it back into a filename.

That's the typical scheme and the one possible with the data retrieved
by LTT.

> Whether it is the original "mark this file for profiling" phase that saves
> away the cookie<->filename association, or whether we also have a system
> call for "return the path of this cookie", I don't much care about.
> Details, details.
> 
> Anyway, what would be the preferred interface from user level?

The approach LTT takes is that no part in the kernel should actually care
about the user level needs. Anything in user level that has to modify
the tracing/profiling makes its requests to the trace driver, /dev/tracer.
No additional system calls, no special cases in the main kernel code. Only
3 main files:
kernel/trace.c: The main entry point for all events (trace_event())
drivers/trace/tracer.c: The trace driver
include/linux/trace.h: The trace hook definitions

Cheers,

Karim

===================================================
                 Karim Yaghmour
               karim@opersys.com
      Embedded and Real-Time Linux Expert
===================================================
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: scalable kmap (was Re: vm lock contention reduction)
  2002-07-08  8:09                                   ` Andrea Arcangeli
  2002-07-08 14:50                                     ` William Lee Irwin III
@ 2002-07-08 20:39                                     ` Andrew Morton
  2002-07-08 21:08                                       ` Benjamin LaHaise
  1 sibling, 1 reply; 87+ messages in thread
From: Andrew Morton @ 2002-07-08 20:39 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Linus Torvalds, Martin J. Bligh, Rik van Riel, linux-mm

Andrea Arcangeli wrote:
> 
> ...
> >       generic_file_write()
> >       {
> >               ...
> >               atomic_inc(&current->mm->dont_unmap_pages);
> >
> >               {
> >                       volatile char dummy;
> >                       __get_user(dummy, addr);
> >                       __get_user(dummy, addr+bytes+1);
> >               }
> >               lock_page();
> >               ->prepare_write()
> >               kmap_atomic()
> >               copy_from_user()
> >               kunmap_atomic()
> >               ->commit_write()
> >               atomic_dec(&current->mm->dont_unmap_pages);
> >               unlock_page()
> >       }
> >
> > and over in mm/rmap.c:try_to_unmap_one(), check mm->dont_unmap_pages.
> >
> > Obviously, all this is dependent on CONFIG_HIGHMEM.
> >
> > Workable?
> 
> the above pseudocode still won't work correctly,

Sure.  It's crap.  It can be used to get mlockall() for free.

>  if you don't pin the
> page as Martin proposed and you only rely on its virtual mapping to stay
> there because the page can go away under you despite the
> swap_out/rmap-unmapping work, if there's a parallel thread running
> munmap+re-mmap under you. So at the very least you need the mmap_sem at
> every generic_file_write to avoid other threads to change your virtual
> address under you. And you'll basically need to make the mmap_sem
> recursive, because you have to take it before running __get_user to
> avoid races. You could easily do that using my rwsem, I made two versions
> of them, with one that supports recursion, however this is just for your
> info, I'm not suggesting to make it recursive.

I think I'll just go for pinning the damn page.  It's a spinlock and
maybe three cachelines but the kernel is about to do a 4k memcpy
anyway.  And get_user_pages() doesn't show up much on O_DIRECT
profiles and it'll be a net win and we need to do SOMETHING, dammit.
 
> ...
> The only reason I can imagine rmap useful in todays
> hardware for all kind of vma (what the patch provides compared to what
> we have now) is to more efficiently defragment ram with an algorithm in
> the memory balancing to provide largepages more efficiently from mixed
> zones, if somebody would suggest rmap for this reason (nobody did yet)

It has been discussed.  But no action yet.

> I
> would have to agree completely that it is very useful for that, OTOH it
> seems everybody is reserving (or planning to reserve) a zone for
> largepages anyways so that we don't run into fragmentation in the first
> place. And btw - talking about largepages - we have three concurrent and
> controversial largepage implementations for linux available today, they
> all have different API, one is even shipped in production by a vendor,

What implementation do you favour?

> and while auditing the code I seen it also exports an API visible to
> userspace [ignoring the sysctl] (unlike what I was told):
> 
> +#define MAP_BIGPAGE    0x40            /* bigpage mapping */
> [..]
>                 _trans(flags, MAP_GROWSDOWN, VM_GROWSDOWN) |
>                 _trans(flags, MAP_DENYWRITE, VM_DENYWRITE) |
> +               _trans(flags, MAP_BIGPAGE, VM_BIGMAP) |
>                 _trans(flags, MAP_EXECUTABLE, VM_EXECUTABLE);
>         return prot_bits | flag_bits;
>  #undef _trans
> 
> that's a new unofficial bitflag to mmap that any proprietary userspace
> can pass to mmap today. Other implementations of the largepage feature
> use madvise or other syscalls to tell the kernel to allocate
> largepages. At least the above won't return -EINVAL so the binaryonly
> app will work transparently on a mainline kernel, but it can eventually
> malfunction if we use 0x40 for something else in 2.5. So I think we
> should do something about the largepages too ASAP into 2.5 (like
> async-io).

Yup.  I don't think the -aa kernel has a large page patch, does it?
Is that something which you have time to look into?
 
-
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: scalable kmap (was Re: vm lock contention reduction)
  2002-07-08 20:39                                     ` Andrew Morton
@ 2002-07-08 21:08                                       ` Benjamin LaHaise
  2002-07-08 21:45                                         ` Andrew Morton
  0 siblings, 1 reply; 87+ messages in thread
From: Benjamin LaHaise @ 2002-07-08 21:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Linus Torvalds, Martin J. Bligh, Rik van Riel,
	linux-mm

On Mon, Jul 08, 2002 at 01:39:04PM -0700, Andrew Morton wrote:
> I think I'll just go for pinning the damn page.  It's a spinlock and
> maybe three cachelines but the kernel is about to do a 4k memcpy
> anyway.  And get_user_pages() doesn't show up much on O_DIRECT
> profiles and it'll be a net win and we need to do SOMETHING, dammit.

Pinning the page costs too much (remember, it's only a win with a 
reduced copy of more that 512 bytes).  The right way of doing it is 
letting copy_*_user fail on a page fault for places like this where 
we need to drop locks before going into the page fault handler.

		-ben
-- 
"You will be reincarnated as a toad; and you will be much happier."
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: scalable kmap (was Re: vm lock contention reduction)
  2002-07-08 21:08                                       ` Benjamin LaHaise
@ 2002-07-08 21:45                                         ` Andrew Morton
  2002-07-08 22:24                                           ` Benjamin LaHaise
  0 siblings, 1 reply; 87+ messages in thread
From: Andrew Morton @ 2002-07-08 21:45 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Andrea Arcangeli, Linus Torvalds, Martin J. Bligh, Rik van Riel,
	linux-mm

Benjamin LaHaise wrote:
> 
> On Mon, Jul 08, 2002 at 01:39:04PM -0700, Andrew Morton wrote:
> > I think I'll just go for pinning the damn page.  It's a spinlock and
> > maybe three cachelines but the kernel is about to do a 4k memcpy
> > anyway.  And get_user_pages() doesn't show up much on O_DIRECT
> > profiles and it'll be a net win and we need to do SOMETHING, dammit.
> 
> Pinning the page costs too much (remember, it's only a win with a
> reduced copy of more that 512 bytes).

Could you expand on that?

>  The right way of doing it is
> letting copy_*_user fail on a page fault for places like this where
> we need to drop locks before going into the page fault handler.

OK.  There are a few things which need to be fixed up in there.  One
is to drop and reacquire the atomic kmap.  Another is the page
lock (for the write-to-mmaped-page-from-the-same-file thing).
Another is to undo the ->prepare_write call.  Or to remember to not
run it again on the retry.

It's really the page lock which is the tricky one.  It could be
a new, uninitialised page.  It's in pagecache and it is not
fully uptodate.  If we drop the page lock and that page is
inside i_size then the kernel has exposed uninitialised data.

Tricky.   A sleazy approach would be to not unlock the page at
all. ie: no change.  Sure, the kernel can deadlock.  But it's
always been that way - the deadlock requires two improbable things,
whereas the schedule-inside-atomic-kmap requires just one.

hmm.  Bit stumped on that one.

Btw, is it safe to drop and reacquire an atomic kmap if you
found out that you accidentally slept while holding it?

-
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: scalable kmap (was Re: vm lock contention reduction)
  2002-07-08 17:29                           ` Martin J. Bligh
@ 2002-07-08 22:14                             ` Linus Torvalds
  2002-07-09  0:16                               ` Andrew Morton
  2002-07-09  3:17                             ` Andrew Morton
  1 sibling, 1 reply; 87+ messages in thread
From: Linus Torvalds @ 2002-07-08 22:14 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Andrew Morton, Andrea Arcangeli, Rik van Riel, linux-mm


On Mon, 8 Jul 2002, Martin J. Bligh wrote:
>
> OK, here's the data from Keith that I was promising on kmap. This was just
> for a kernel compile. So copy_strings and file_read_actor seem to be the
> main users (for this workload) by an order of magnitude.

Ok, both the top two (by far) users are basically just "copy_to_user()"
and "copy_from_user()".

What we could do is to make a special case for the copy_xx_user() stuff,
and have page faulting fixing those two special cases up (kunmap before
calling handle_mm_fault, and then re-kmap and fixing up the address just
before returning).

It's even easy to check hat to trigger at: if we have a magic "atomic kmap
that handles page faults correctly" thing, such a thing would need to be
preempt safe due to the atomic kmap anyway - so we could trigger the
special case on the faulting code being non-preemptable.

Basically, the only thing it would require would be a slightly magic
"calling convention", where some register holds the page pointer, and
another register holds the "mapped address" pointer, and then we'd have
something like

	do_page_fault(..)
	{
		....

	+	if (current->preempt_count)
	+		kunmap_atomic(ptregs->page_reg);

		switch (handle_mm_fault(mm, vma, address, write)) {
		....
		}

	+	if (current->preempt_count)
	+		ptregs->addr_reg = (ptregs->addr_reg & ~PAGE_MASK) | kmap_atomic(ptregs->page_reg);

		...

which basically allows us to hold "atomic" kmap's over a page fault (and
_only_ over a page fault, it wouldn't help for anything but the user copy
case).

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: scalable kmap (was Re: vm lock contention reduction)
  2002-07-08 21:45                                         ` Andrew Morton
@ 2002-07-08 22:24                                           ` Benjamin LaHaise
  0 siblings, 0 replies; 87+ messages in thread
From: Benjamin LaHaise @ 2002-07-08 22:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Linus Torvalds, Martin J. Bligh, Rik van Riel,
	linux-mm

On Mon, Jul 08, 2002 at 02:45:35PM -0700, Andrew Morton wrote:
> > Pinning the page costs too much (remember, it's only a win with a
> > reduced copy of more that 512 bytes).
> 
> Could you expand on that?

I'm going from data that I gather while fiddling with aio and the pipe 
code.  As a thought experiment, look at it this way: pinning the page 
involves a minimum 4-5 data dependent cache line accesses (mm struct, 
2-3 page table levels, then a locked cycle on the page struct itself) 
compared to the use of tlb entries that are likely to be present (free, 
plus recent cpus have hardware to prefect them completely asynchronous 
to instruction execution).

> >  The right way of doing it is
> > letting copy_*_user fail on a page fault for places like this where
> > we need to drop locks before going into the page fault handler.
> 
> OK.  There are a few things which need to be fixed up in there.  One
> is to drop and reacquire the atomic kmap.  Another is the page
> lock (for the write-to-mmaped-page-from-the-same-file thing).
> Another is to undo the ->prepare_write call.  Or to remember to not
> run it again on the retry.
> 
> It's really the page lock which is the tricky one.  It could be
> a new, uninitialised page.  It's in pagecache and it is not
> fully uptodate.  If we drop the page lock and that page is
> inside i_size then the kernel has exposed uninitialised data.

Hmmm, do we really need to insert a new, uninitialised page into 
the page cache before filling it with data?  If we could defer that 
until the data is copied into the page (most of the time there would 
be no collisions during writes, so a spurious copy is unlikely)

Side note: I did an alternative fix for this which just stuffed a 
copy of the struct page * into the task struct, and checked for this 
inside filemap.c.  Very gross, but it worked.

> Tricky.   A sleazy approach would be to not unlock the page at
> all. ie: no change.  Sure, the kernel can deadlock.  But it's
> always been that way - the deadlock requires two improbable things,
> whereas the schedule-inside-atomic-kmap requires just one.

It's not unlikely if you've got a malicious user behind the shell.

> Btw, is it safe to drop and reacquire an atomic kmap if you
> found out that you accidentally slept while holding it?

Yes and no: it works, but if debugging is enabled it bugs out.

		-ben
-- 
"You will be reincarnated as a toad; and you will be much happier."
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: scalable kmap (was Re: vm lock contention reduction)
  2002-07-08 22:14                             ` Linus Torvalds
@ 2002-07-09  0:16                               ` Andrew Morton
  0 siblings, 0 replies; 87+ messages in thread
From: Andrew Morton @ 2002-07-09  0:16 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Martin J. Bligh, Andrea Arcangeli, Rik van Riel, linux-mm

Linus Torvalds wrote:
> 
> ...
>         do_page_fault(..)
>         {
>                 ....
> 
>         +       if (current->preempt_count)
>         +               kunmap_atomic(ptregs->page_reg);
> 
>                 switch (handle_mm_fault(mm, vma, address, write)) {
>                 ....
>                 }
> 
>         +       if (current->preempt_count)
>         +               ptregs->addr_reg = (ptregs->addr_reg & ~PAGE_MASK) | kmap_atomic(ptregs->page_reg);
> 
>                 ...
> 
> which basically allows us to hold "atomic" kmap's over a page fault (and
> _only_ over a page fault, it wouldn't help for anything but the user copy
> case).

That'll work.

We need to use a bit more than preempt_count, because that
doesn't get incremented when CONFIG_PREEMPT=n, and we need
to know stuff like the kmap type and the page and the virtual
address within the fault handler.

I'll work on the below patch.  I'll need to audit the various
copy_*_user implementations to make sure that esi and edi are
really always the right registers to use.  Sigh.  I've been
trying to avoid understanding the x86 instruction set.

 arch/i386/mm/fault.c            |   39 +++++++++++++++++++++++++++++++++
 include/asm-i386/kmap_types.h   |    3 +-
 include/asm-ppc/kmap_types.h    |    1 
 include/asm-sparc/kmap_types.h  |    1 
 include/asm-x86_64/kmap_types.h |    1 
 include/linux/highmem.h         |   46 ++++++++++++++++++++++++++++++++++++++++
 include/linux/sched.h           |    7 ++++++
 mm/filemap.c                    |    8 +++++-
 8 files changed, 103 insertions, 3 deletions

--- 2.5.25/arch/i386/mm/fault.c~linus-copy_user-hack	Mon Jul  8 17:13:16 2002
+++ 2.5.25-akpm/arch/i386/mm/fault.c	Mon Jul  8 17:13:16 2002
@@ -18,6 +18,7 @@
 #include <linux/ptrace.h>
 #include <linux/mman.h>
 #include <linux/mm.h>
+#include <linux/highmem.h>
 #include <linux/smp.h>
 #include <linux/smp_lock.h>
 #include <linux/interrupt.h>
@@ -138,6 +139,40 @@ void bust_spinlocks(int yes)
 	}
 }
 
+#ifdef CONFIG_HIGHMEM
+static inline void highmem_remove_atomic_kmap(void)
+{
+	struct copy_user_state *cus = current->copy_user_state;
+
+	printk("%s\n", __FUNCTION__);
+	if (cus)
+		kunmap_atomic(cus->kaddr, cus->type);
+}
+
+static inline void highmem_restore_atomic_kmap(struct pt_regs *regs)
+{
+	struct copy_user_state *cus = current->copy_user_state;
+
+	if (cus) {
+		long *reg;
+		unsigned offset;
+
+		cus->kaddr = kmap_atomic(cus->page, cus->type);
+		if (cus->src)
+			reg = &regs->esi;
+		else
+			reg = &regs->edi;
+		offset = *reg & PAGE_SIZE;
+		*reg = ((long)cus->kaddr & ~(PAGE_SIZE - 1)) | offset;
+	}
+}
+#else
+static inline void highmem_remove_atomic_kmap(void)
+{}
+static inline void highmem_restore_atomic_kmap(struct pt_regs *regs)
+{}
+#endif
+
 asmlinkage void do_invalid_op(struct pt_regs *, unsigned long);
 extern unsigned long idt;
 
@@ -206,6 +241,8 @@ asmlinkage void do_page_fault(struct pt_
 	}
 #endif
 
+	highmem_remove_atomic_kmap();
+
 	down_read(&mm->mmap_sem);
 
 	vma = find_vma(mm, address);
@@ -283,6 +320,7 @@ good_area:
 			tsk->thread.screen_bitmap |= 1 << bit;
 	}
 	up_read(&mm->mmap_sem);
+	highmem_restore_atomic_kmap(regs);
 	return;
 
 /*
@@ -291,6 +329,7 @@ good_area:
  */
 bad_area:
 	up_read(&mm->mmap_sem);
+	highmem_restore_atomic_kmap(regs);
 
 	/* User mode accesses just cause a SIGSEGV */
 	if (error_code & 4) {
--- 2.5.25/include/asm-i386/kmap_types.h~linus-copy_user-hack	Mon Jul  8 17:13:16 2002
+++ 2.5.25-akpm/include/asm-i386/kmap_types.h	Mon Jul  8 17:13:16 2002
@@ -19,7 +19,8 @@ D(5)	KM_BIO_SRC_IRQ,
 D(6)	KM_BIO_DST_IRQ,
 D(7)	KM_PTE0,
 D(8)	KM_PTE1,
-D(9)	KM_TYPE_NR
+D(9)	KM_FILEMAP,
+D(10)	KM_TYPE_NR
 };
 
 #undef D
--- 2.5.25/include/asm-ppc/kmap_types.h~linus-copy_user-hack	Mon Jul  8 17:13:16 2002
+++ 2.5.25-akpm/include/asm-ppc/kmap_types.h	Mon Jul  8 17:13:16 2002
@@ -15,6 +15,7 @@ enum km_type {
 	KM_BIO_DST_IRQ,
 	KM_PTE0,
 	KM_PTE1,
+	KM_FILEMAP,
 	KM_TYPE_NR
 };
 
--- 2.5.25/include/asm-sparc/kmap_types.h~linus-copy_user-hack	Mon Jul  8 17:13:16 2002
+++ 2.5.25-akpm/include/asm-sparc/kmap_types.h	Mon Jul  8 17:13:16 2002
@@ -9,6 +9,7 @@ enum km_type {
 	KM_USER1,
 	KM_BIO_SRC_IRQ,
 	KM_BIO_DST_IRQ,
+	KM_FILEMAP,
 	KM_TYPE_NR
 };
 
--- 2.5.25/include/asm-x86_64/kmap_types.h~linus-copy_user-hack	Mon Jul  8 17:13:16 2002
+++ 2.5.25-akpm/include/asm-x86_64/kmap_types.h	Mon Jul  8 17:13:16 2002
@@ -9,6 +9,7 @@ enum km_type {
 	KM_USER1,
 	KM_BIO_SRC_IRQ,
 	KM_BIO_DST_IRQ,
+	KM_FILEMAP,
 	KM_TYPE_NR
 };
 
--- 2.5.25/include/linux/highmem.h~linus-copy_user-hack	Mon Jul  8 17:13:16 2002
+++ 2.5.25-akpm/include/linux/highmem.h	Mon Jul  8 17:13:16 2002
@@ -10,12 +10,44 @@
 extern struct page *highmem_start_page;
 
 #include <asm/highmem.h>
+#include <asm/kmap_types.h>
 
 /* declarations for linux/mm/highmem.c */
 unsigned int nr_free_highpages(void);
 
 extern void check_highmem_ptes(void);
 
+/*
+ * Used when performing a copy_*_user while holding an atomic kmap
+ */
+struct copy_user_state {
+	struct page *page;		/* The page which is kmap_atomiced */
+	void *kaddr;			/* Its mapping */
+	enum km_type type;		/* Its offset */
+	int src;			/* 1: fixup ESI.  0: Fixup EDI */
+};
+
+/*
+ * `src' is true if the kmap_atomic virtual address is the source of the copy.
+ */
+static inline void
+pre_kmap_copy_user(struct copy_user_state *state, struct page *page,
+			void *kaddr, enum km_type type, int src)
+{
+	state->page = page;
+	state->kaddr = kaddr;
+	state->type = type;
+	state->src = src;
+	BUG_ON(current->copy_user_state != NULL);
+	current->copy_user_state = state;
+}
+
+static inline void post_kmap_copy_user(void)
+{
+	BUG_ON(current->copy_user_state == NULL);
+	current->copy_user_state = NULL;
+}
+	
 #else /* CONFIG_HIGHMEM */
 
 static inline unsigned int nr_free_highpages(void) { return 0; }
@@ -26,6 +58,20 @@ static inline void *kmap(struct page *pa
 
 #define kmap_atomic(page,idx)		kmap(page)
 #define kunmap_atomic(page,idx)		kunmap(page)
+
+struct copy_user_state {
+	int is_gcc_still_buggy;		/* ? */
+};
+
+static inline void
+pre_kmap_copy_user(struct copy_user_state *state, struct page *page,
+			void *kaddr, enum km_type type, int src)
+{
+}
+
+static inline void post_kmap_copy_user(void)
+{
+}
 
 #endif /* CONFIG_HIGHMEM */
 
--- 2.5.25/include/linux/sched.h~linus-copy_user-hack	Mon Jul  8 17:13:16 2002
+++ 2.5.25-akpm/include/linux/sched.h	Mon Jul  8 17:13:16 2002
@@ -248,6 +248,10 @@ extern struct user_struct root_user;
 
 typedef struct prio_array prio_array_t;
 
+#ifdef CONFIG_HIGHMEM
+struct copy_user_state;
+#endif
+
 struct task_struct {
 	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
 	struct thread_info *thread_info;
@@ -365,6 +369,9 @@ struct task_struct {
 /* journalling filesystem info */
 	void *journal_info;
 	struct dentry *proc_dentry;
+#ifdef CONFIG_HIGHMEM
+	struct copy_user_state *copy_user_state;
+#endif
 };
 
 extern void __put_task_struct(struct task_struct *tsk);
--- 2.5.25/mm/filemap.c~linus-copy_user-hack	Mon Jul  8 17:13:22 2002
+++ 2.5.25-akpm/mm/filemap.c	Mon Jul  8 17:13:30 2002
@@ -15,6 +15,7 @@
 #include <linux/fs.h>
 #include <linux/mm.h>
 #include <linux/mman.h>
+#include <linux/highmem.h>
 #include <linux/pagemap.h>
 #include <linux/file.h>
 #include <linux/iobuf.h>
@@ -1186,14 +1187,17 @@ static ssize_t generic_file_direct_IO(in
 int file_read_actor(read_descriptor_t * desc, struct page *page, unsigned long offset, unsigned long size)
 {
 	char *kaddr;
+	struct copy_user_state copy_user_state;
 	unsigned long left, count = desc->count;
 
 	if (size > count)
 		size = count;
 
-	kaddr = kmap(page);
+	kaddr = kmap_atomic(page, KM_FILEMAP);
+	pre_kmap_copy_user(&copy_user_state, page, kaddr, KM_FILEMAP, 1);
 	left = __copy_to_user(desc->buf, kaddr + offset, size);
-	kunmap(page);
+	post_kmap_copy_user();
+	kunmap_atomic(kaddr, KM_FILEMAP);
 	
 	if (left) {
 		size -= left;

-
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: scalable kmap (was Re: vm lock contention reduction)
  2002-07-08 17:29                           ` Martin J. Bligh
  2002-07-08 22:14                             ` Linus Torvalds
@ 2002-07-09  3:17                             ` Andrew Morton
  2002-07-09  4:28                               ` Martin J. Bligh
  2002-07-09 13:59                               ` Benjamin LaHaise
  1 sibling, 2 replies; 87+ messages in thread
From: Andrew Morton @ 2002-07-09  3:17 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Andrea Arcangeli, Linus Torvalds, Rik van Riel, linux-mm

"Martin J. Bligh" wrote:
> 
> OK, here's the data from Keith that I was promising on kmap. This was just
> for a kernel compile. So copy_strings and file_read_actor seem to be the
> main users (for this workload) by an order of magnitude.
> 
>                 0.00    0.00       1/3762429     remove_arg_zero [618]
>                 0.00    0.00      10/3762429     ext2_set_link [576]
>                 0.00    0.00     350/3762429     block_read_full_page [128]
>                 0.00    0.00     750/3762429     ext2_empty_dir [476]
>                 0.02    0.00   11465/3762429     ext2_delete_entry [217]
>                 0.02    0.00   12983/3762429     ext2_inode_by_name [228]
>                 0.03    0.00   15400/3762429     ext2_add_link [182]
>                 0.03    0.00   16621/3762429     ext2_find_entry [198]
>                 0.06    0.00   33016/3762429     ext2_readdir [79]
>                 0.13    0.00   71900/3762429     generic_file_write [109]
>                 0.17    0.00   95589/3762429     generic_commit_write [255]
>                 2.25    0.00 1229513/3762429     file_read_actor [50]
>                 4.15    0.00 2274831/3762429     copy_strings [36]
> [105]    0.2    6.87    0.00 3762429         kunmap_high [105]


It's a bit weird that copy_strings is the heaviest user of kmap().
I bet it's kmapping the same page over and over.  A little cache
there will help.  I'll fix that up.

So here's the patch.  Seems to work.  Across a `make -j6 bzImage'
the number of calls to kmap_high() went from 490,429 down to 41,174.

And guess what?   Zero change in wallclock time.

With:
        make -j6 bzImage  399.97s user 32.95s system 374% cpu 1:55.51 total
        kmap_high_counter: 41174

Without:
        make -j6 bzImage  400.06s user 32.74s system 377% cpu 1:54.78 total
        kmap_high_counter: 490429

Quad PIII Xeon, 2.5G RAM.

Any theories?


 arch/i386/lib/usercopy.c        |   10 +++++
 arch/i386/mm/fault.c            |   40 ++++++++++++++++++++
 fs/exec.c                       |    6 ++-
 include/asm-i386/kmap_types.h   |    3 +
 include/asm-i386/processor.h    |    2 +
 include/asm-ppc/kmap_types.h    |    1 
 include/asm-sparc/kmap_types.h  |    1 
 include/asm-x86_64/kmap_types.h |    1 
 include/linux/highmem.h         |   78 ++++++++++++++++++++++++++++++++++++++++
 include/linux/sched.h           |    5 ++
 mm/filemap.c                    |   11 +++--
 11 files changed, 151 insertions(+), 7 deletions(-)

--- 2.5.25/arch/i386/mm/fault.c~linus-copy_user-hack	Mon Jul  8 19:58:03 2002
+++ 2.5.25-akpm/arch/i386/mm/fault.c	Mon Jul  8 20:08:20 2002
@@ -18,6 +18,7 @@
 #include <linux/ptrace.h>
 #include <linux/mman.h>
 #include <linux/mm.h>
+#include <linux/highmem.h>
 #include <linux/smp.h>
 #include <linux/smp_lock.h>
 #include <linux/interrupt.h>
@@ -138,6 +139,39 @@ void bust_spinlocks(int yes)
 	}
 }
 
+#ifdef CONFIG_HIGHMEM
+static inline void highmem_remove_atomic_kmap(struct pt_regs *regs)
+{
+	struct copy_user_state *cus = current->copy_user_state;
+
+	if (cus)
+		kunmap_atomic(cus->kaddr, cus->type);
+}
+
+static inline void highmem_restore_atomic_kmap(struct pt_regs *regs)
+{
+	struct copy_user_state *cus = current->copy_user_state;
+
+	if (cus) {
+		long *reg;
+		unsigned offset;
+
+		cus->kaddr = kmap_atomic(cus->page, cus->type);
+		if (cus->src)
+			reg = &regs->esi;
+		else
+			reg = &regs->edi;
+		offset = *reg & (PAGE_SIZE - 1);
+		*reg = ((long)cus->kaddr) | offset;
+	}
+}
+#else
+static inline void highmem_remove_atomic_kmap(struct pt_regs *regs)
+{}
+static inline void highmem_restore_atomic_kmap(struct pt_regs *regs)
+{}
+#endif
+
 asmlinkage void do_invalid_op(struct pt_regs *, unsigned long);
 extern unsigned long idt;
 
@@ -206,6 +240,8 @@ asmlinkage void do_page_fault(struct pt_
 	}
 #endif
 
+	highmem_remove_atomic_kmap(regs);
+
 	down_read(&mm->mmap_sem);
 
 	vma = find_vma(mm, address);
@@ -267,8 +303,10 @@ good_area:
 			tsk->maj_flt++;
 			break;
 		case VM_FAULT_SIGBUS:
+			highmem_restore_atomic_kmap(regs);
 			goto do_sigbus;
 		case VM_FAULT_OOM:
+			highmem_restore_atomic_kmap(regs);
 			goto out_of_memory;
 		default:
 			BUG();
@@ -283,6 +321,7 @@ good_area:
 			tsk->thread.screen_bitmap |= 1 << bit;
 	}
 	up_read(&mm->mmap_sem);
+	highmem_restore_atomic_kmap(regs);
 	return;
 
 /*
@@ -291,6 +330,7 @@ good_area:
  */
 bad_area:
 	up_read(&mm->mmap_sem);
+	highmem_restore_atomic_kmap(regs);
 
 	/* User mode accesses just cause a SIGSEGV */
 	if (error_code & 4) {
--- 2.5.25/include/asm-i386/kmap_types.h~linus-copy_user-hack	Mon Jul  8 19:58:03 2002
+++ 2.5.25-akpm/include/asm-i386/kmap_types.h	Mon Jul  8 19:58:03 2002
@@ -19,7 +19,8 @@ D(5)	KM_BIO_SRC_IRQ,
 D(6)	KM_BIO_DST_IRQ,
 D(7)	KM_PTE0,
 D(8)	KM_PTE1,
-D(9)	KM_TYPE_NR
+D(9)	KM_FILEMAP,
+D(10)	KM_TYPE_NR
 };
 
 #undef D
--- 2.5.25/include/asm-ppc/kmap_types.h~linus-copy_user-hack	Mon Jul  8 19:58:03 2002
+++ 2.5.25-akpm/include/asm-ppc/kmap_types.h	Mon Jul  8 19:58:03 2002
@@ -15,6 +15,7 @@ enum km_type {
 	KM_BIO_DST_IRQ,
 	KM_PTE0,
 	KM_PTE1,
+	KM_FILEMAP,
 	KM_TYPE_NR
 };
 
--- 2.5.25/include/asm-sparc/kmap_types.h~linus-copy_user-hack	Mon Jul  8 19:58:03 2002
+++ 2.5.25-akpm/include/asm-sparc/kmap_types.h	Mon Jul  8 19:58:03 2002
@@ -9,6 +9,7 @@ enum km_type {
 	KM_USER1,
 	KM_BIO_SRC_IRQ,
 	KM_BIO_DST_IRQ,
+	KM_FILEMAP,
 	KM_TYPE_NR
 };
 
--- 2.5.25/include/asm-x86_64/kmap_types.h~linus-copy_user-hack	Mon Jul  8 19:58:03 2002
+++ 2.5.25-akpm/include/asm-x86_64/kmap_types.h	Mon Jul  8 19:58:03 2002
@@ -9,6 +9,7 @@ enum km_type {
 	KM_USER1,
 	KM_BIO_SRC_IRQ,
 	KM_BIO_DST_IRQ,
+	KM_FILEMAP,
 	KM_TYPE_NR
 };
 
--- 2.5.25/include/linux/highmem.h~linus-copy_user-hack	Mon Jul  8 19:58:03 2002
+++ 2.5.25-akpm/include/linux/highmem.h	Mon Jul  8 20:13:05 2002
@@ -3,6 +3,7 @@
 
 #include <linux/config.h>
 #include <linux/fs.h>
+#include <asm/processor.h>
 #include <asm/cacheflush.h>
 
 #ifdef CONFIG_HIGHMEM
@@ -10,6 +11,7 @@
 extern struct page *highmem_start_page;
 
 #include <asm/highmem.h>
+#include <asm/kmap_types.h>
 
 /* declarations for linux/mm/highmem.c */
 unsigned int nr_free_highpages(void);
@@ -72,4 +74,80 @@ static inline void copy_user_highpage(st
 	kunmap_atomic(vto, KM_USER1);
 }
 
+#if defined(CONFIG_HIGHMEM) && defined(ARCH_HAS_KMAP_FIXUP)
+/*
+ * Used when performing a copy_*_user while holding an atomic kmap
+ */
+struct copy_user_state {
+	struct page *page;		/* The page which is kmap_atomiced */
+	void *kaddr;			/* Its mapping */
+	enum km_type type;		/* Its offset */
+	int src;			/* 1: fixup ESI.  0: Fixup EDI */
+};
+
+/*
+ * `src' is true if the kmap_atomic virtual address is the source of the copy.
+ */
+static inline void *
+kmap_copy_user(struct copy_user_state *cus, struct page *page,
+			enum km_type type, int src)
+{
+	cus->page = page;
+	cus->kaddr = kmap_atomic(page, type);
+	if (PageHighMem(page)) {
+		cus->type = type;
+		cus->src = src;
+		BUG_ON(current->copy_user_state != NULL);
+		current->copy_user_state = cus;
+	}
+	return cus->kaddr;
+}
+
+static inline void kunmap_copy_user(struct copy_user_state *cus)
+{
+	if (PageHighMem(cus->page)) {
+		BUG_ON(current->copy_user_state != cus);
+		kunmap_atomic(cus->kaddr, cus->type);
+		current->copy_user_state = NULL;
+		cus->page = NULL;	/* debug */
+	}
+}
+
+/*
+ * After a copy_*_user, the kernel virtual address may be different.  So
+ * use kmap_copy_user_addr() to get the new value.
+ */
+static inline void *kmap_copy_user_addr(struct copy_user_state *cus)
+{
+	return cus->kaddr;
+}
+
+#else
+
+struct copy_user_state {
+	struct page *page;
+};
+
+/*
+ * This must be a macro because `type' may be undefined
+ */
+
+#define kmap_copy_user(cus, page, type, src)	\
+	({					\
+		(cus)->page = (page);		\
+		kmap(page);			\
+	})
+
+static inline void kunmap_copy_user(struct copy_user_state *cus)
+{
+	kunmap(cus->page);
+}
+
+static inline void *kmap_copy_user_addr(struct copy_user_state *cus)
+{
+	return page_address(cus->page);
+}
+
+#endif
+
 #endif /* _LINUX_HIGHMEM_H */
--- 2.5.25/include/linux/sched.h~linus-copy_user-hack	Mon Jul  8 19:58:03 2002
+++ 2.5.25-akpm/include/linux/sched.h	Mon Jul  8 20:14:08 2002
@@ -248,6 +248,8 @@ extern struct user_struct root_user;
 
 typedef struct prio_array prio_array_t;
 
+struct copy_user_state;
+
 struct task_struct {
 	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
 	struct thread_info *thread_info;
@@ -365,6 +367,9 @@ struct task_struct {
 /* journalling filesystem info */
 	void *journal_info;
 	struct dentry *proc_dentry;
+#ifdef CONFIG_HIGHMEM
+	struct copy_user_state *copy_user_state;
+#endif
 };
 
 extern void __put_task_struct(struct task_struct *tsk);
--- 2.5.25/mm/filemap.c~linus-copy_user-hack	Mon Jul  8 19:58:03 2002
+++ 2.5.25-akpm/mm/filemap.c	Mon Jul  8 19:58:03 2002
@@ -15,6 +15,7 @@
 #include <linux/fs.h>
 #include <linux/mm.h>
 #include <linux/mman.h>
+#include <linux/highmem.h>
 #include <linux/pagemap.h>
 #include <linux/file.h>
 #include <linux/iobuf.h>
@@ -1183,18 +1184,20 @@ static ssize_t generic_file_direct_IO(in
 	return retval;
 }
 
-int file_read_actor(read_descriptor_t * desc, struct page *page, unsigned long offset, unsigned long size)
+int file_read_actor(read_descriptor_t *desc, struct page *page,
+			unsigned long offset, unsigned long size)
 {
 	char *kaddr;
+	struct copy_user_state copy_user_state;
 	unsigned long left, count = desc->count;
 
 	if (size > count)
 		size = count;
 
-	kaddr = kmap(page);
+	kaddr = kmap_copy_user(&copy_user_state, page, KM_FILEMAP, 1);
 	left = __copy_to_user(desc->buf, kaddr + offset, size);
-	kunmap(page);
-	
+	kunmap_copy_user(&copy_user_state);
+
 	if (left) {
 		size -= left;
 		desc->error = -EFAULT;
--- 2.5.25/include/asm-i386/processor.h~linus-copy_user-hack	Mon Jul  8 19:58:03 2002
+++ 2.5.25-akpm/include/asm-i386/processor.h	Mon Jul  8 19:58:03 2002
@@ -485,4 +485,6 @@ extern inline void prefetchw(const void 
 
 #endif
 
+#define ARCH_HAS_KMAP_FIXUP
+
 #endif /* __ASM_I386_PROCESSOR_H */
--- 2.5.25/arch/i386/lib/usercopy.c~linus-copy_user-hack	Mon Jul  8 19:58:03 2002
+++ 2.5.25-akpm/arch/i386/lib/usercopy.c	Mon Jul  8 19:58:03 2002
@@ -11,6 +11,16 @@
 
 #ifdef CONFIG_X86_USE_3DNOW_AND_WORKS
 
+/*
+ * We cannot use the mmx functions here with the kmap_atomic fixup
+ * code.
+ *
+ * But CONFIG_X86_USE_3DNOW_AND_WORKS never gets defined anywhere.
+ * Maybe kill this code?
+ */
+
+#error this will not work
+
 unsigned long
 __generic_copy_to_user(void *to, const void *from, unsigned long n)
 {
--- 2.5.25/fs/exec.c~linus-copy_user-hack	Mon Jul  8 19:58:03 2002
+++ 2.5.25-akpm/fs/exec.c	Mon Jul  8 19:58:03 2002
@@ -203,6 +203,7 @@ int copy_strings(int argc,char ** argv, 
 			int i, new, err;
 			struct page *page;
 			int offset, bytes_to_copy;
+			struct copy_user_state copy_user_state;
 
 			offset = pos % PAGE_SIZE;
 			i = pos/PAGE_SIZE;
@@ -215,7 +216,8 @@ int copy_strings(int argc,char ** argv, 
 					return -ENOMEM;
 				new = 1;
 			}
-			kaddr = kmap(page);
+			kaddr = kmap_copy_user(&copy_user_state, page,
+						KM_FILEMAP, 0);
 
 			if (new && offset)
 				memset(kaddr, 0, offset);
@@ -226,7 +228,7 @@ int copy_strings(int argc,char ** argv, 
 					memset(kaddr+offset+len, 0, PAGE_SIZE-offset-len);
 			}
 			err = copy_from_user(kaddr + offset, str, bytes_to_copy);
-			kunmap(page);
+			kunmap_copy_user(&copy_user_state);
 
 			if (err)
 				return -EFAULT; 

-
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: scalable kmap (was Re: vm lock contention reduction)
  2002-07-09  3:17                             ` Andrew Morton
@ 2002-07-09  4:28                               ` Martin J. Bligh
  2002-07-09  5:28                                 ` Andrew Morton
  2002-07-09 13:59                               ` Benjamin LaHaise
  1 sibling, 1 reply; 87+ messages in thread
From: Martin J. Bligh @ 2002-07-09  4:28 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, Linus Torvalds, Rik van Riel, linux-mm

> It's a bit weird that copy_strings is the heaviest user of kmap().

I can dig out the whole acg, and see what was calling copy_strings,
that might help give us a clue.

> I bet it's kmapping the same page over and over.  A little cache
> there will help.  I'll fix that up.

Sounds like a plan ...

> So here's the patch.  Seems to work.  Across a `make -j6 bzImage'
> the number of calls to kmap_high() went from 490,429 down to 41,174.
> 
> And guess what?   Zero change in wallclock time.

Well, I was going to try to bench this tonight, but am having a
problem with 2.5.25 right now (we've been on 2.4 for a while,
but are shifting). Hopefully get you some numbers tommorow, and 
will get some other benchmarks done by people here on various 
machines.

> Any theories?

Maybe the cost of the atomic kmap counters the gain? Having to do a 
single line tlbflush every time ... a per cpu pool might help if that
is the problem, but would have to make it a reasonable size to counter
the cost. I'll do some more measurements first, and get some profile
data to see if the number of ticks changes down in one function and
up in the other?

Thanks to all for looking at this,

M.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: scalable kmap (was Re: vm lock contention reduction)
  2002-07-09  4:28                               ` Martin J. Bligh
@ 2002-07-09  5:28                                 ` Andrew Morton
  2002-07-09  6:15                                   ` Martin J. Bligh
                                                     ` (2 more replies)
  0 siblings, 3 replies; 87+ messages in thread
From: Andrew Morton @ 2002-07-09  5:28 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Andrea Arcangeli, Linus Torvalds, Rik van Riel, linux-mm

"Martin J. Bligh" wrote:
> 
> > It's a bit weird that copy_strings is the heaviest user of kmap().
> 
> I can dig out the whole acg, and see what was calling copy_strings,
> that might help give us a clue.

Well, it'll be exec().

> > I bet it's kmapping the same page over and over.  A little cache
> > there will help.  I'll fix that up.
> 
> Sounds like a plan ...

Updated patch below.  We don't need an atomic kmap in copy_strings
at all.  kmap is the right thing to do, but just be smarter about it.
Hanging onto the existing kmap in there reduces the number of kmap()
calls by a factor of 32 across a kernel compile.

But still no aggregate speedup.

> > So here's the patch.  Seems to work.  Across a `make -j6 bzImage'
> > the number of calls to kmap_high() went from 490,429 down to 41,174.
> >
> > And guess what?   Zero change in wallclock time.
> 
> Well, I was going to try to bench this tonight, but am having a
> problem with 2.5.25 right now (we've been on 2.4 for a while,
> but are shifting). Hopefully get you some numbers tommorow, and
> will get some other benchmarks done by people here on various
> machines.

Don't tell me those NUMAQ's are using IDE ;)

But seriously, what's the problem?  We really do need the big
boxes to be able to test 2.5 right now, and any blockage needs
to be cleared away.

> > Any theories?
> 
> Maybe the cost of the atomic kmap counters the gain? Having to do a
> single line tlbflush every time ... a per cpu pool might help if that
> is the problem, but would have to make it a reasonable size to counter
> the cost. I'll do some more measurements first, and get some profile
> data to see if the number of ticks changes down in one function and
> up in the other?

Well, certainly it's dumb to perform an atomic_kunmap() when we're
just about to atomic_kmap() the same page again.  But no change.



 arch/i386/lib/usercopy.c        |   10 +++++
 arch/i386/mm/fault.c            |   40 ++++++++++++++++++++
 fs/exec.c                       |   72 ++++++++++++++++++++++++++++--------
 include/asm-i386/kmap_types.h   |    3 +
 include/asm-i386/processor.h    |    2 +
 include/asm-ppc/kmap_types.h    |    1 
 include/asm-sparc/kmap_types.h  |    1 
 include/asm-x86_64/kmap_types.h |    1 
 include/linux/highmem.h         |   78 ++++++++++++++++++++++++++++++++++++++++
 include/linux/sched.h           |    5 ++
 mm/filemap.c                    |   11 +++--
 11 files changed, 202 insertions(+), 22 deletions(-)

--- 2.5.25/arch/i386/mm/fault.c~linus-copy_user-hack	Mon Jul  8 21:44:21 2002
+++ 2.5.25-akpm/arch/i386/mm/fault.c	Mon Jul  8 21:44:21 2002
@@ -18,6 +18,7 @@
 #include <linux/ptrace.h>
 #include <linux/mman.h>
 #include <linux/mm.h>
+#include <linux/highmem.h>
 #include <linux/smp.h>
 #include <linux/smp_lock.h>
 #include <linux/interrupt.h>
@@ -138,6 +139,39 @@ void bust_spinlocks(int yes)
 	}
 }
 
+#ifdef CONFIG_HIGHMEM
+static inline void highmem_remove_atomic_kmap(struct pt_regs *regs)
+{
+	struct copy_user_state *cus = current->copy_user_state;
+
+	if (cus)
+		kunmap_atomic(cus->kaddr, cus->type);
+}
+
+static inline void highmem_restore_atomic_kmap(struct pt_regs *regs)
+{
+	struct copy_user_state *cus = current->copy_user_state;
+
+	if (cus) {
+		long *reg;
+		unsigned offset;
+
+		cus->kaddr = kmap_atomic(cus->page, cus->type);
+		if (cus->src)
+			reg = &regs->esi;
+		else
+			reg = &regs->edi;
+		offset = *reg & (PAGE_SIZE - 1);
+		*reg = ((long)cus->kaddr) | offset;
+	}
+}
+#else
+static inline void highmem_remove_atomic_kmap(struct pt_regs *regs)
+{}
+static inline void highmem_restore_atomic_kmap(struct pt_regs *regs)
+{}
+#endif
+
 asmlinkage void do_invalid_op(struct pt_regs *, unsigned long);
 extern unsigned long idt;
 
@@ -206,6 +240,8 @@ asmlinkage void do_page_fault(struct pt_
 	}
 #endif
 
+	highmem_remove_atomic_kmap(regs);
+
 	down_read(&mm->mmap_sem);
 
 	vma = find_vma(mm, address);
@@ -267,8 +303,10 @@ good_area:
 			tsk->maj_flt++;
 			break;
 		case VM_FAULT_SIGBUS:
+			highmem_restore_atomic_kmap(regs);
 			goto do_sigbus;
 		case VM_FAULT_OOM:
+			highmem_restore_atomic_kmap(regs);
 			goto out_of_memory;
 		default:
 			BUG();
@@ -283,6 +321,7 @@ good_area:
 			tsk->thread.screen_bitmap |= 1 << bit;
 	}
 	up_read(&mm->mmap_sem);
+	highmem_restore_atomic_kmap(regs);
 	return;
 
 /*
@@ -291,6 +330,7 @@ good_area:
  */
 bad_area:
 	up_read(&mm->mmap_sem);
+	highmem_restore_atomic_kmap(regs);
 
 	/* User mode accesses just cause a SIGSEGV */
 	if (error_code & 4) {
--- 2.5.25/include/asm-i386/kmap_types.h~linus-copy_user-hack	Mon Jul  8 21:44:21 2002
+++ 2.5.25-akpm/include/asm-i386/kmap_types.h	Mon Jul  8 21:44:21 2002
@@ -19,7 +19,8 @@ D(5)	KM_BIO_SRC_IRQ,
 D(6)	KM_BIO_DST_IRQ,
 D(7)	KM_PTE0,
 D(8)	KM_PTE1,
-D(9)	KM_TYPE_NR
+D(9)	KM_FILEMAP,
+D(10)	KM_TYPE_NR
 };
 
 #undef D
--- 2.5.25/include/asm-ppc/kmap_types.h~linus-copy_user-hack	Mon Jul  8 21:44:21 2002
+++ 2.5.25-akpm/include/asm-ppc/kmap_types.h	Mon Jul  8 21:44:21 2002
@@ -15,6 +15,7 @@ enum km_type {
 	KM_BIO_DST_IRQ,
 	KM_PTE0,
 	KM_PTE1,
+	KM_FILEMAP,
 	KM_TYPE_NR
 };
 
--- 2.5.25/include/asm-sparc/kmap_types.h~linus-copy_user-hack	Mon Jul  8 21:44:21 2002
+++ 2.5.25-akpm/include/asm-sparc/kmap_types.h	Mon Jul  8 21:44:21 2002
@@ -9,6 +9,7 @@ enum km_type {
 	KM_USER1,
 	KM_BIO_SRC_IRQ,
 	KM_BIO_DST_IRQ,
+	KM_FILEMAP,
 	KM_TYPE_NR
 };
 
--- 2.5.25/include/asm-x86_64/kmap_types.h~linus-copy_user-hack	Mon Jul  8 21:44:21 2002
+++ 2.5.25-akpm/include/asm-x86_64/kmap_types.h	Mon Jul  8 21:44:21 2002
@@ -9,6 +9,7 @@ enum km_type {
 	KM_USER1,
 	KM_BIO_SRC_IRQ,
 	KM_BIO_DST_IRQ,
+	KM_FILEMAP,
 	KM_TYPE_NR
 };
 
--- 2.5.25/include/linux/highmem.h~linus-copy_user-hack	Mon Jul  8 21:44:21 2002
+++ 2.5.25-akpm/include/linux/highmem.h	Mon Jul  8 21:44:21 2002
@@ -3,6 +3,7 @@
 
 #include <linux/config.h>
 #include <linux/fs.h>
+#include <asm/processor.h>
 #include <asm/cacheflush.h>
 
 #ifdef CONFIG_HIGHMEM
@@ -10,6 +11,7 @@
 extern struct page *highmem_start_page;
 
 #include <asm/highmem.h>
+#include <asm/kmap_types.h>
 
 /* declarations for linux/mm/highmem.c */
 unsigned int nr_free_highpages(void);
@@ -72,4 +74,80 @@ static inline void copy_user_highpage(st
 	kunmap_atomic(vto, KM_USER1);
 }
 
+#if defined(CONFIG_HIGHMEM) && defined(ARCH_HAS_KMAP_FIXUP)
+/*
+ * Used when performing a copy_*_user while holding an atomic kmap
+ */
+struct copy_user_state {
+	struct page *page;		/* The page which is kmap_atomiced */
+	void *kaddr;			/* Its mapping */
+	enum km_type type;		/* Its offset */
+	int src;			/* 1: fixup ESI.  0: Fixup EDI */
+};
+
+/*
+ * `src' is true if the kmap_atomic virtual address is the source of the copy.
+ */
+static inline void *
+kmap_copy_user(struct copy_user_state *cus, struct page *page,
+			enum km_type type, int src)
+{
+	cus->page = page;
+	cus->kaddr = kmap_atomic(page, type);
+	if (PageHighMem(page)) {
+		cus->type = type;
+		cus->src = src;
+		BUG_ON(current->copy_user_state != NULL);
+		current->copy_user_state = cus;
+	}
+	return cus->kaddr;
+}
+
+static inline void kunmap_copy_user(struct copy_user_state *cus)
+{
+	if (PageHighMem(cus->page)) {
+		BUG_ON(current->copy_user_state != cus);
+		kunmap_atomic(cus->kaddr, cus->type);
+		current->copy_user_state = NULL;
+		cus->page = NULL;	/* debug */
+	}
+}
+
+/*
+ * After a copy_*_user, the kernel virtual address may be different.  So
+ * use kmap_copy_user_addr() to get the new value.
+ */
+static inline void *kmap_copy_user_addr(struct copy_user_state *cus)
+{
+	return cus->kaddr;
+}
+
+#else
+
+struct copy_user_state {
+	struct page *page;
+};
+
+/*
+ * This must be a macro because `type' may be undefined
+ */
+
+#define kmap_copy_user(cus, page, type, src)	\
+	({					\
+		(cus)->page = (page);		\
+		kmap(page);			\
+	})
+
+static inline void kunmap_copy_user(struct copy_user_state *cus)
+{
+	kunmap(cus->page);
+}
+
+static inline void *kmap_copy_user_addr(struct copy_user_state *cus)
+{
+	return page_address(cus->page);
+}
+
+#endif
+
 #endif /* _LINUX_HIGHMEM_H */
--- 2.5.25/include/linux/sched.h~linus-copy_user-hack	Mon Jul  8 21:44:21 2002
+++ 2.5.25-akpm/include/linux/sched.h	Mon Jul  8 21:44:21 2002
@@ -248,6 +248,8 @@ extern struct user_struct root_user;
 
 typedef struct prio_array prio_array_t;
 
+struct copy_user_state;
+
 struct task_struct {
 	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
 	struct thread_info *thread_info;
@@ -365,6 +367,9 @@ struct task_struct {
 /* journalling filesystem info */
 	void *journal_info;
 	struct dentry *proc_dentry;
+#ifdef CONFIG_HIGHMEM
+	struct copy_user_state *copy_user_state;
+#endif
 };
 
 extern void __put_task_struct(struct task_struct *tsk);
--- 2.5.25/mm/filemap.c~linus-copy_user-hack	Mon Jul  8 21:44:21 2002
+++ 2.5.25-akpm/mm/filemap.c	Mon Jul  8 21:44:21 2002
@@ -15,6 +15,7 @@
 #include <linux/fs.h>
 #include <linux/mm.h>
 #include <linux/mman.h>
+#include <linux/highmem.h>
 #include <linux/pagemap.h>
 #include <linux/file.h>
 #include <linux/iobuf.h>
@@ -1183,18 +1184,20 @@ static ssize_t generic_file_direct_IO(in
 	return retval;
 }
 
-int file_read_actor(read_descriptor_t * desc, struct page *page, unsigned long offset, unsigned long size)
+int file_read_actor(read_descriptor_t *desc, struct page *page,
+			unsigned long offset, unsigned long size)
 {
 	char *kaddr;
+	struct copy_user_state copy_user_state;
 	unsigned long left, count = desc->count;
 
 	if (size > count)
 		size = count;
 
-	kaddr = kmap(page);
+	kaddr = kmap_copy_user(&copy_user_state, page, KM_FILEMAP, 1);
 	left = __copy_to_user(desc->buf, kaddr + offset, size);
-	kunmap(page);
-	
+	kunmap_copy_user(&copy_user_state);
+
 	if (left) {
 		size -= left;
 		desc->error = -EFAULT;
--- 2.5.25/include/asm-i386/processor.h~linus-copy_user-hack	Mon Jul  8 21:44:21 2002
+++ 2.5.25-akpm/include/asm-i386/processor.h	Mon Jul  8 21:44:21 2002
@@ -485,4 +485,6 @@ extern inline void prefetchw(const void 
 
 #endif
 
+#define ARCH_HAS_KMAP_FIXUP
+
 #endif /* __ASM_I386_PROCESSOR_H */
--- 2.5.25/arch/i386/lib/usercopy.c~linus-copy_user-hack	Mon Jul  8 21:44:21 2002
+++ 2.5.25-akpm/arch/i386/lib/usercopy.c	Mon Jul  8 21:44:21 2002
@@ -11,6 +11,16 @@
 
 #ifdef CONFIG_X86_USE_3DNOW_AND_WORKS
 
+/*
+ * We cannot use the mmx functions here with the kmap_atomic fixup
+ * code.
+ *
+ * But CONFIG_X86_USE_3DNOW_AND_WORKS never gets defined anywhere.
+ * Maybe kill this code?
+ */
+
+#error this will not work
+
 unsigned long
 __generic_copy_to_user(void *to, const void *from, unsigned long n)
 {
--- 2.5.25/fs/exec.c~linus-copy_user-hack	Mon Jul  8 21:44:21 2002
+++ 2.5.25-akpm/fs/exec.c	Mon Jul  8 22:18:12 2002
@@ -182,27 +182,45 @@ static int count(char ** argv, int max)
  * memory to free pages in kernel mem. These are in a format ready
  * to be put directly into the top of new user memory.
  */
+int akpm_hits;
+int akpm_misses;
+int akpm_kmaps;
+
 int copy_strings(int argc,char ** argv, struct linux_binprm *bprm) 
 {
+	struct page *kmapped_page = NULL;
+	char *kaddr = NULL;
+	int ret;
+
 	while (argc-- > 0) {
 		char *str;
 		int len;
 		unsigned long pos;
 
-		if (get_user(str, argv+argc) || !(len = strnlen_user(str, bprm->p)))
-			return -EFAULT;
-		if (bprm->p < len) 
-			return -E2BIG; 
+		if (get_user(str, argv+argc) ||
+				!(len = strnlen_user(str, bprm->p))) {
+			ret = -EFAULT;
+			goto out;
+		}
+
+		if (bprm->p < len)  {
+			ret = -E2BIG;
+			goto out;
+		}
 
 		bprm->p -= len;
 		/* XXX: add architecture specific overflow check here. */ 
-
 		pos = bprm->p;
+
+		/*
+		 * The only sleeping function which we are allowed to call in
+		 * this loop is copy_from_user().  Otherwise, copy_user_state
+		 * could get trashed.
+		 */
 		while (len > 0) {
-			char *kaddr;
 			int i, new, err;
-			struct page *page;
 			int offset, bytes_to_copy;
+			struct page *page;
 
 			offset = pos % PAGE_SIZE;
 			i = pos/PAGE_SIZE;
@@ -211,32 +229,52 @@ int copy_strings(int argc,char ** argv, 
 			if (!page) {
 				page = alloc_page(GFP_HIGHUSER);
 				bprm->page[i] = page;
-				if (!page)
-					return -ENOMEM;
+				if (!page) {
+					ret = -ENOMEM;
+					goto out;
+				}
 				new = 1;
 			}
-			kaddr = kmap(page);
 
+			if (kmapped_page) {
+				if (page == kmapped_page)
+					akpm_hits++;
+				else
+					akpm_misses++;
+			}
+			
+			if (page != kmapped_page) {
+				if (kmapped_page)
+					kunmap(kmapped_page);
+				kmapped_page = page;
+				kaddr = kmap(kmapped_page);
+				akpm_kmaps++;
+			}
 			if (new && offset)
 				memset(kaddr, 0, offset);
 			bytes_to_copy = PAGE_SIZE - offset;
 			if (bytes_to_copy > len) {
 				bytes_to_copy = len;
 				if (new)
-					memset(kaddr+offset+len, 0, PAGE_SIZE-offset-len);
+					memset(kaddr+offset+len, 0,
+						PAGE_SIZE-offset-len);
+			}
+			err = copy_from_user(kaddr+offset, str, bytes_to_copy);
+			if (err) {
+				ret = -EFAULT;
+				goto out;
 			}
-			err = copy_from_user(kaddr + offset, str, bytes_to_copy);
-			kunmap(page);
-
-			if (err)
-				return -EFAULT; 
 
 			pos += bytes_to_copy;
 			str += bytes_to_copy;
 			len -= bytes_to_copy;
 		}
 	}
-	return 0;
+	ret = 0;
+out:
+	if (kmapped_page)
+		kunmap(kmapped_page);
+	return ret;
 }
 
 /*

-
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: scalable kmap (was Re: vm lock contention reduction)
  2002-07-09  5:28                                 ` Andrew Morton
@ 2002-07-09  6:15                                   ` Martin J. Bligh
  2002-07-09  6:30                                     ` William Lee Irwin III
  2002-07-09  6:32                                     ` William Lee Irwin III
  2002-07-09 16:08                                   ` Martin J. Bligh
  2002-07-09 17:32                                   ` Andrea Arcangeli
  2 siblings, 2 replies; 87+ messages in thread
From: Martin J. Bligh @ 2002-07-09  6:15 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, Linus Torvalds, Rik van Riel, linux-mm

> Don't tell me those NUMAQ's are using IDE ;)

No, that's one level of pain I don't have to deal with ;-)

Now switched fibrechannel SANs on a machine that really needs
NUMA aware multipath IO is more likely to be a problem, on 
the other hand ... but I can live without that for now ...

> But seriously, what's the problem?  We really do need the big
> boxes to be able to test 2.5 right now, and any blockage needs
> to be cleared away.

You really want the current list? The whole of our team is 
shifting focus to 2.5, which'll make life more interesting ;-)

wli might care to elaborate on 2 & 3, since I think he helped
them identify / fix (helped maybe meaning did).

1. Irqbalance doesn't like clustered apic mode (have hack)
2. Using ioremap fairly early catches cpu_online_map set to 0
   for some reason (have hack).
3. Something to do with BIO that I'll let Bill explain, but I
   am given to believe it's well known (have hack).

After this, one of our team got one of the boxes booted today
for the first time on a recent 2.5 kernel, but I don't have any 
numbers from it yet.

4. Starfire ethernet driver doesn't seem to want to compile in
   (crc32_le in set_rx_mode?) Can't recall the exact error, will
   recreate tommorow, may well work as a module. 
5. Compiler errors that I haven't really looked at but am guessing
   may be due to me using an old compiler / tools (Redhat 6.2)?

/tmp/cca8ibx8.s: Assembler messages:
/tmp/cca8ibx8.s:255: Error: suffix or operands invalid for `lcall'
  gcc -Wp,-MD,./.setup.o.d -D__ASSEMBLY__ -D__KERNEL__ -I/home/mbligh/linux-2.5.25/include -nostdinc -iwithprefix include  -traditional -DSVGA_MODE=NORMAL_VGA  -D__BIG_KERNEL__  -c -o setup.o setup.S
make[1]: *** [bootsect.o] Error 1
make[1]: *** Waiting for unfinished jobs....
/tmp/ccn0yhUe.s: Assembler messages:
/tmp/ccn0yhUe.s:1292: Error: suffix or operands invalid for `lcall'
make[1]: *** [setup.o] Error 1
make[1]: Leaving directory `/home/mbligh/linux-2.5.25/arch/i386/boot'
make: *** [install] Error 2

At that point, I decided to wait until tommorow as I really need
to switch machines, disks, distributions, and reinstall ;-) But
then I need to do a few more things ...

6. Get someone to port forward the ia32 discontigmem support, test
   and submit.
7. Port forward the TSC disable changes.
8. Port forward the remote quad timer disable changes.
9. Port forward the pci-bridge remap hack that I'm too ashamed to
   publish.

Then I might get some numbers out of her ;-) Comparing 2.4 to 2.5
will be most interesting ...

In the meantime I'll get you some numbers from an 8 way SMP box,
which should be much simpler to do.

M.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: scalable kmap (was Re: vm lock contention reduction)
  2002-07-09  6:15                                   ` Martin J. Bligh
@ 2002-07-09  6:30                                     ` William Lee Irwin III
  2002-07-09  6:32                                     ` William Lee Irwin III
  1 sibling, 0 replies; 87+ messages in thread
From: William Lee Irwin III @ 2002-07-09  6:30 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Andrew Morton, Andrea Arcangeli, Linus Torvalds, Rik van Riel, linux-mm

On Mon, Jul 08, 2002 at 11:15:52PM -0700, Martin J. Bligh wrote:
> wli might care to elaborate on 2 & 3, since I think he helped
> them identify / fix (helped maybe meaning did).
> 1. Irqbalance doesn't like clustered apic mode (have hack)
> 2. Using ioremap fairly early catches cpu_online_map set to 0
>    for some reason (have hack).
> 3. Something to do with BIO that I'll let Bill explain, but I
>    am given to believe it's well known (have hack).

(1) irqbalance is blatantly stuffing flat bitmasks into ICR2
	this breaks clustered hierarchical destination format
	everywhere all the time

(2) ioremap wants to flush_tlb_all() before cpu_online_map is
	initialized. This results in smp_call_function() spinning
	until an atomic_t goes to -1, which never happens since it
	doesn't kick any cpu's to decrement the counter.

(3) The bio thing is just the usual queue max_sectors where you've
	recommended decreasing the MPAGE max sectors so the bio 
	constraints are satisfied before they hit elevator.c where
	a BUG_ON() is triggered instead of a bio split.


Cheers,
Bill
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: scalable kmap (was Re: vm lock contention reduction)
  2002-07-09  6:15                                   ` Martin J. Bligh
  2002-07-09  6:30                                     ` William Lee Irwin III
@ 2002-07-09  6:32                                     ` William Lee Irwin III
  1 sibling, 0 replies; 87+ messages in thread
From: William Lee Irwin III @ 2002-07-09  6:32 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Andrew Morton, Andrea Arcangeli, Linus Torvalds, Rik van Riel, linux-mm

> > Don't tell me those NUMAQ's are using IDE ;)
> 
> No, that's one level of pain I don't have to deal with ;-)
> 
> Now switched fibrechannel SANs on a machine that really needs
> NUMA aware multipath IO is more likely to be a problem, on 
> the other hand ... but I can live without that for now ...
> 
> > But seriously, what's the problem?  We really do need the big
> > boxes to be able to test 2.5 right now, and any blockage needs
> > to be cleared away.
> 
> You really want the current list? The whole of our team is 
> shifting focus to 2.5, which'll make life more interesting ;-)
> 
On Mon, Jul 08, 2002 at 11:15:52PM -0700, Martin J. Bligh wrote:
> wli might care to elaborate on 2 & 3, since I think he helped
> them identify / fix (helped maybe meaning did).

Oh, I forgot, there was a bad v86 info thing on the 9th cpu woken
that I never finished debugging, too. And the MAX_IO_APICS issue,
which is easily solved by just increasing the constant #ifdef
CONFIG_MULTIQUAD as usual. This panics before console_init() though,
which makes it seem more painful than it really is.


Cheers,
Bill
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: scalable kmap (was Re: vm lock contention reduction)
  2002-07-09  3:17                             ` Andrew Morton
  2002-07-09  4:28                               ` Martin J. Bligh
@ 2002-07-09 13:59                               ` Benjamin LaHaise
  1 sibling, 0 replies; 87+ messages in thread
From: Benjamin LaHaise @ 2002-07-09 13:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Martin J. Bligh, Andrea Arcangeli, Linus Torvalds, Rik van Riel,
	linux-mm

On Mon, Jul 08, 2002 at 08:17:36PM -0700, Andrew Morton wrote:
> -D(9)	KM_TYPE_NR
> +D(9)	KM_FILEMAP,
> +D(10)	KM_TYPE_NR

Reusing KM_USER[01] would be better than adding yet another 
kmap type.

		-ben
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: scalable kmap (was Re: vm lock contention reduction)
  2002-07-09  5:28                                 ` Andrew Morton
  2002-07-09  6:15                                   ` Martin J. Bligh
@ 2002-07-09 16:08                                   ` Martin J. Bligh
  2002-07-09 17:32                                   ` Andrea Arcangeli
  2 siblings, 0 replies; 87+ messages in thread
From: Martin J. Bligh @ 2002-07-09 16:08 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, Linus Torvalds, Rik van Riel, linux-mm

>> I can dig out the whole acg, and see what was calling copy_strings,
>> that might help give us a clue.
> 
> Well, it'll be exec().

Yup ;-) 

                 0.95   18.17   30186/90554       copy_strings_kernel [68]
                1.90   36.34   60368/90554       do_execve [15]
[36]     1.5    2.85   54.52   90554         copy_strings [36]
               40.55    0.00 2274831/2528191     _generic_copy_from_user [41]
                6.88    0.21 2274831/3762429     kmap_high [87]
                4.15    0.00 2274831/3762429     kunmap_high [105]
                2.12    0.00 2273886/4517587     strnlen_user [120]
                0.09    0.51   31139/4284514     _alloc_pages [24]
                0.00    0.00   31139/4284514     alloc_pages [369]


                0.00    0.00       2/30186       load_script <cycle 2> [553]
                0.00   19.12   30184/30186       do_execve [15]
[68]     0.5    0.00   19.12   30186         copy_strings_kernel [68]
                0.95   18.17   30186/90554       copy_strings [36]

> Updated patch below.  We don't need an atomic kmap in copy_strings
> at all.  kmap is the right thing to do, but just be smarter about it.
> Hanging onto the existing kmap in there reduces the number of kmap()
> calls by a factor of 32 across a kernel compile.

Sounds about right, looking at the data above.
 
> But still no aggregate speedup.

Now that really is odd. Will get you some more numbers.

M.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Enhanced profiling support (was Re: vm lock contention reduction)
  2002-07-08 17:52                                 ` Linus Torvalds
  (?)
  (?)
@ 2002-07-09 16:57                                 ` John Levon
  2002-07-09 19:56                                   ` Karim Yaghmour
  -1 siblings, 1 reply; 87+ messages in thread
From: John Levon @ 2002-07-09 16:57 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andrew Morton, linux-kernel

On Mon, Jul 08, 2002 at 10:52:36AM -0700, Linus Torvalds wrote:

>  - I'd associate each profiling event with a dentry/offset pair, simply
>    because that's the highest-level thing that the kernel knows about and
>    that is "static".

This makes sense, I think.

>  - I'd suggest that the profiler explicitly mark the dentries it wants
>    profiled, so that the kernel can throw away events that we're not
>    interested in. The marking function would return a cookie to user
>    space, and increment the dentry count (along with setting the
>    "profile" flag in the dentry)

For a system-wide profiler, this needs to be /all/ dentries that get
mapped in with executable permissions, or we lose any mappings of shared
libraries we don't know about etc. Essentially, oprofile wants samples
against any dentry that gets mmap()ed with PROT_EXEC, so this marking
would really need to happen at mmap() time. Missing out on any dentry
profiles amounts to data loss in the system profile and has the
potential to mislead.

>  - the "cookie" (which would most easily just be the kernel address of the
>    dentry) would be the thing that we give to user-space (along with
>    offset) on profile read. The user app can turn it back into a filename.
> 
> Whether it is the original "mark this file for profiling" phase that saves
> away the cookie<->filename association, or whether we also have a system
> call for "return the path of this cookie", I don't much care about.
> Details, details.
> 
> Anyway, what would be the preferred interface from user level?

oprofile currently receives eip-pid pairs, along with the necessary
syscall tracing info needed to reconstruct file offsets. The above
scheme removes the dependency on the pid, but this also unfortunately
throws away some useful information.

It is often useful to be able to separate out shared-library samples on
a per-process (and/or per-application) basis. Any really useful profile
buffer facility really needs to preserve this info, but just including
the raw pid isn't going to work when user-space can't reconstruct the
"name" of the pid (where "name" would be something "/bin/bash") because
the process exited in the meantime.

The same goes for kernel samples that happen in process context.

So this might work well in tandem with some global process-tree tracing
scheme, but I don't know what form that might take ...

(Then there are kernel modules, but that's probably best served by
patching modutils)

regards
john

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: scalable kmap (was Re: vm lock contention reduction)
  2002-07-09  5:28                                 ` Andrew Morton
  2002-07-09  6:15                                   ` Martin J. Bligh
  2002-07-09 16:08                                   ` Martin J. Bligh
@ 2002-07-09 17:32                                   ` Andrea Arcangeli
  2002-07-10  5:32                                     ` Andrew Morton
  2 siblings, 1 reply; 87+ messages in thread
From: Andrea Arcangeli @ 2002-07-09 17:32 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Martin J. Bligh, Linus Torvalds, Rik van Riel, linux-mm

the patch with the hooks into the page fault handler is basically
what Martin was suggesting me at ols when I was laughting. the ugliest
part is the implicit dependency on every copy-user to store the current
source and destination in esi/dsi.

However your implementation is still not as optimizes as you can
optimize it, if you take the max-performnace route you should do it all,
you shouldn't kunmap_atomic/kmap_atomic blindly around the page fault
handler like you're doing now. you should hold on the atomic kmap for
the whole page fault until a new kmap_atomic of the same type happens on
the current cpu under you during the page fault (either from the page
fault handler itself or another task because the page fault handler
blocked and scheduled in another task).

To detect the underlying kmap_atomic you only need a per-cpu sequence
counter that you increase during kmap_atomic (right after the implicit
preempt_disable()). you only need to read this sequence counter at
page-fault-entry-point (in the place where you are now doing the
uncoditional kunmap_atomic), and re-read the global per-cpu sequence
counter later before returning from the page fault to know if you've to
execute a new kmap_atomic. You need a sequence counter per-cpu per-type
of kmap.

NOTE: you don't need to execute the kunmap_atomic at all in the
recursive case, just disable the debugging (infact if you take this
route you can as well drop the kunmap_atomic call enterely from the
common code, you may want to verify the ppc or sparc guys aren't doing
something magic in the kunmap first, in such case you may want to skip
it only during recursion in the i386/mm/fault.c). Don't blame the fact
we lose some debugging capability, you want max performance at all costs
remeber?

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Enhanced profiling support (was Re: vm lock contention reduction)
  2002-07-09 16:57                                 ` John Levon
@ 2002-07-09 19:56                                   ` Karim Yaghmour
  0 siblings, 0 replies; 87+ messages in thread
From: Karim Yaghmour @ 2002-07-09 19:56 UTC (permalink / raw)
  To: John Levon
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, Richard Moore, bob


I've been following oprofile's development for a while and was quite
happy to see a DCPI equivalent come to Linux. The LTT data engine
collection is actually inspired by the one described in the DCPI paper.

As I said earlier, much of the kernel behavior information oprofile
requires is made available by LTT. Instead of redirecting the syscall
table, for example, LTT hooks itself within the syscall path to obtain
all syscall entries and exits (all contained within #ifdef's of course).
The same goes for important events such as forks/reads/writes/scheduling
etc. The mapping of PID to the name of an executable, for instance,
is easily extracted from this information.

As with other applications/uses which require insight into the dynamics
of the kernel, it would seem to me that oprofile would greatly benefit
from the integration of the LTT patch in the mainline kernel. If
nothing else, oprofile could use the LTT collection engine to forward
its data to user-space, much like DProbes currently does.

By the same token, the LTT collection engine could replace the slew
of per-driver tracing systems already in the kernel, providing therefore
a uniform tracing system:
drivers/char/ftape/lowlevel/ftape-tracing.c
drivers/char/ip2/ip2trace.c
drivers/char/dtlk.c
drivers/char/mwavedd.h
drivers/char/n_r3964
drivers/acpi/utilities/utdebug.c
drivers/cdrom/sbpcd.c
drivers/isdn/eicon/eicon_mod.c
drivers/scsi/gdth.c
drivers/scsi/megaraid.c
drivers/scsi/qlogicfc.c
drivers/scsi/qlogicisp.c
drivers/net/wavelan.c
drivers/net/skfp/hwmtm.c
drivers/net/pcmcia/wavelan_cs.c
drivers/net/wireless/orinoco.c
drivers/net/wireless/orinoco_cs.c
drivers/video/radeonfb.c
drivers/usb/pwc.h
drivers/usb/hpusbscsi.c
include/linux/jdb.h for fs/ext3/*.c and fs/jdb/*.c
net/irda/irnet/irnet.h
etc.

The above list is but a sample of the files containing actual code
implementing tracing and/or data collection engines. There a great
deal many files that actually have trace points already in them. A
simple "grep -r TRACE *" provides an interesting insight to the
number of subsystems requiring tracing, each implementing their own
scheme.

It is time to provide a uniform tracing and high-throughput data
collection engine for all to use. LTT has already been field-tested
for these purposes and is very easily extended to include any
additional functionality required.

Any comments/thoughts are greatly appreciated.

Cheers,

Karim

John Levon wrote:
> 
> On Mon, Jul 08, 2002 at 10:52:36AM -0700, Linus Torvalds wrote:
> 
> >  - I'd associate each profiling event with a dentry/offset pair, simply
> >    because that's the highest-level thing that the kernel knows about and
> >    that is "static".
> 
> This makes sense, I think.
> 
> >  - I'd suggest that the profiler explicitly mark the dentries it wants
> >    profiled, so that the kernel can throw away events that we're not
> >    interested in. The marking function would return a cookie to user
> >    space, and increment the dentry count (along with setting the
> >    "profile" flag in the dentry)
> 
> For a system-wide profiler, this needs to be /all/ dentries that get
> mapped in with executable permissions, or we lose any mappings of shared
> libraries we don't know about etc. Essentially, oprofile wants samples
> against any dentry that gets mmap()ed with PROT_EXEC, so this marking
> would really need to happen at mmap() time. Missing out on any dentry
> profiles amounts to data loss in the system profile and has the
> potential to mislead.
> 
> >  - the "cookie" (which would most easily just be the kernel address of the
> >    dentry) would be the thing that we give to user-space (along with
> >    offset) on profile read. The user app can turn it back into a filename.
> >
> > Whether it is the original "mark this file for profiling" phase that saves
> > away the cookie<->filename association, or whether we also have a system
> > call for "return the path of this cookie", I don't much care about.
> > Details, details.
> >
> > Anyway, what would be the preferred interface from user level?
> 
> oprofile currently receives eip-pid pairs, along with the necessary
> syscall tracing info needed to reconstruct file offsets. The above
> scheme removes the dependency on the pid, but this also unfortunately
> throws away some useful information.
> 
> It is often useful to be able to separate out shared-library samples on
> a per-process (and/or per-application) basis. Any really useful profile
> buffer facility really needs to preserve this info, but just including
> the raw pid isn't going to work when user-space can't reconstruct the
> "name" of the pid (where "name" would be something "/bin/bash") because
> the process exited in the meantime.
> 
> The same goes for kernel samples that happen in process context.
> 
> So this might work well in tandem with some global process-tree tracing
> scheme, but I don't know what form that might take ...
> 
> (Then there are kernel modules, but that's probably best served by
> patching modutils)
> 
> regards
> john
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

===================================================
                 Karim Yaghmour
               karim@opersys.com
      Embedded and Real-Time Linux Expert
===================================================

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Enhanced profiling support (was Re: vm lock contention reduction)
  2002-07-08 18:41                                   ` Karim Yaghmour
@ 2002-07-10  2:22                                     ` John Levon
  -1 siblings, 0 replies; 87+ messages in thread
From: John Levon @ 2002-07-10  2:22 UTC (permalink / raw)
  To: Karim Yaghmour
  Cc: Linus Torvalds, Andrew Morton, Andrea Arcangeli, Rik van Riel,
	linux-mm, Martin J. Bligh, linux-kernel, Richard Moore, bob

On Mon, Jul 08, 2002 at 02:41:00PM -0400, Karim Yaghmour wrote:

> dentry + offset: on a 32bit machine, this is 8 bytes total per event being
> profiled. This is a lot of information if you are trying you have a high
> volume throughput.

I haven't found that to be significant in profiling overhead, mainly
because the hash table removes some of the "sting" of high sampling
rates (and the interrupt handler dwarfs all other aspects). The
situation is probably different for more general tracing purposes, but
I'm dubious as to the utility of a general tracing mechanism.

(besides, a profile buffer needs a sample context value too, for things
like CPU number and perfctr event number).

> You can almost always skip the dentry since you know scheduling
> changes and since you can catch a system-state snapshot at the begining of
> the profiling. After that, the eip is sufficient and can easily be correlated
> to a meaningfull entry in a file in user-space.

But as I point out in my other post, dentry-offset alone is not as
useful as it could be...

I just don't see a really good reason to introduce insidious tracing
throughout. Both tracing and profiling are ugly ugly things to be doing
by their very nature, and I'd much prefer to keep such intrusions to a
bare minimum.

The entry.S examine-the-registers approach is simple enough, but it's
not much more tasteful than sys_call_table hackery IMHO

regards
john

-- 
"I know I believe in nothing but it is my nothing"
	- Manic Street Preachers

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Enhanced profiling support (was Re: vm lock contention reduction)
@ 2002-07-10  2:22                                     ` John Levon
  0 siblings, 0 replies; 87+ messages in thread
From: John Levon @ 2002-07-10  2:22 UTC (permalink / raw)
  To: Karim Yaghmour
  Cc: Linus Torvalds, Andrew Morton, Andrea Arcangeli, Rik van Riel,
	linux-mm, Martin J. Bligh, linux-kernel, Richard Moore, bob

On Mon, Jul 08, 2002 at 02:41:00PM -0400, Karim Yaghmour wrote:

> dentry + offset: on a 32bit machine, this is 8 bytes total per event being
> profiled. This is a lot of information if you are trying you have a high
> volume throughput.

I haven't found that to be significant in profiling overhead, mainly
because the hash table removes some of the "sting" of high sampling
rates (and the interrupt handler dwarfs all other aspects). The
situation is probably different for more general tracing purposes, but
I'm dubious as to the utility of a general tracing mechanism.

(besides, a profile buffer needs a sample context value too, for things
like CPU number and perfctr event number).

> You can almost always skip the dentry since you know scheduling
> changes and since you can catch a system-state snapshot at the begining of
> the profiling. After that, the eip is sufficient and can easily be correlated
> to a meaningfull entry in a file in user-space.

But as I point out in my other post, dentry-offset alone is not as
useful as it could be...

I just don't see a really good reason to introduce insidious tracing
throughout. Both tracing and profiling are ugly ugly things to be doing
by their very nature, and I'd much prefer to keep such intrusions to a
bare minimum.

The entry.S examine-the-registers approach is simple enough, but it's
not much more tasteful than sys_call_table hackery IMHO

regards
john

-- 
"I know I believe in nothing but it is my nothing"
	- Manic Street Preachers
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Enhanced profiling support (was Re: vm lock contention reduction)
  2002-07-10  2:22                                     ` John Levon
@ 2002-07-10  4:16                                       ` Karim Yaghmour
  -1 siblings, 0 replies; 87+ messages in thread
From: Karim Yaghmour @ 2002-07-10  4:16 UTC (permalink / raw)
  To: John Levon
  Cc: Linus Torvalds, Andrew Morton, Andrea Arcangeli, Rik van Riel,
	linux-mm, Martin J. Bligh, linux-kernel, Richard Moore, bob


John Levon wrote:
> I'm dubious as to the utility of a general tracing mechanism.
...
> I just don't see a really good reason to introduce insidious tracing
> throughout. Both tracing and profiling are ugly ugly things to be doing
> by their very nature, and I'd much prefer to keep such intrusions to a
> bare minimum.

Tracing is essential for an entire category of problems which can only
be solved by obtaining the raw data which describesg the dynamic behavior
of the kernel.

Have you ever tried to solve an inter-process synchronization problem
using strace or gdb? In reality, only tracing built into the kernel can
enable a developer to solve such problems.

Have you ever tried to follow the exact reaction applications have
to kernel input? It took lots of ad-hoc experimentation to isolate the
thundering hurd problem. Tracing would have shown this immediately.

Can you list the exact sequence of processes that are scheduled
in reaction to input when you press the keyboard while running a
terminal in X? This is out of reach of most user-space programmers but
a trace shows this quite nicely.

Ever had a box saturated with IRQs and still showing 0% CPU usage? This
problem has been reported time and again. Lately someone was asking
about the utility which soaks-up CPU cycles to show this sort of
situation. Once more, tracing shows this right away ... without soaking
up CPU cycles.

Ever tried to get the exact time spent by an application in user-space
vs. kernel space? Even better, can you tell the actually syscall which
cost the most in kernel time? You can indeed get closer using random sampling,
but it's just one more thing tracing gives you without any difficulty.

And the list goes on.

The fact that so many kernel subsystems already have their own tracing
built-in (see other posting) clearly shows that there is a fundamental
need for such a utility even for driver developers. If many driver
developers can't develop drivers adequately without tracing, can we
expect user-space developers to efficiently use the kernel if they have
absolutely no idea about the dynamic interaction their processes have
with the kernel and how this interaction is influenced by and influences
the interaction with other processes?

> The entry.S examine-the-registers approach is simple enough, but it's
> not much more tasteful than sys_call_table hackery IMHO

I guess we won't agree on this. From my point of view it is much better
to have the code directly within entry.S for all to see instead of
having some external software play around with the syscall table in a
way kernel users can't trace back to the kernel's own code.

Cheers,

Karim

===================================================
                 Karim Yaghmour
               karim@opersys.com
      Embedded and Real-Time Linux Expert
===================================================

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Enhanced profiling support (was Re: vm lock contention reduction)
@ 2002-07-10  4:16                                       ` Karim Yaghmour
  0 siblings, 0 replies; 87+ messages in thread
From: Karim Yaghmour @ 2002-07-10  4:16 UTC (permalink / raw)
  To: John Levon
  Cc: Linus Torvalds, Andrew Morton, Andrea Arcangeli, Rik van Riel,
	linux-mm, Martin J. Bligh, linux-kernel, Richard Moore, bob

John Levon wrote:
> I'm dubious as to the utility of a general tracing mechanism.
...
> I just don't see a really good reason to introduce insidious tracing
> throughout. Both tracing and profiling are ugly ugly things to be doing
> by their very nature, and I'd much prefer to keep such intrusions to a
> bare minimum.

Tracing is essential for an entire category of problems which can only
be solved by obtaining the raw data which describesg the dynamic behavior
of the kernel.

Have you ever tried to solve an inter-process synchronization problem
using strace or gdb? In reality, only tracing built into the kernel can
enable a developer to solve such problems.

Have you ever tried to follow the exact reaction applications have
to kernel input? It took lots of ad-hoc experimentation to isolate the
thundering hurd problem. Tracing would have shown this immediately.

Can you list the exact sequence of processes that are scheduled
in reaction to input when you press the keyboard while running a
terminal in X? This is out of reach of most user-space programmers but
a trace shows this quite nicely.

Ever had a box saturated with IRQs and still showing 0% CPU usage? This
problem has been reported time and again. Lately someone was asking
about the utility which soaks-up CPU cycles to show this sort of
situation. Once more, tracing shows this right away ... without soaking
up CPU cycles.

Ever tried to get the exact time spent by an application in user-space
vs. kernel space? Even better, can you tell the actually syscall which
cost the most in kernel time? You can indeed get closer using random sampling,
but it's just one more thing tracing gives you without any difficulty.

And the list goes on.

The fact that so many kernel subsystems already have their own tracing
built-in (see other posting) clearly shows that there is a fundamental
need for such a utility even for driver developers. If many driver
developers can't develop drivers adequately without tracing, can we
expect user-space developers to efficiently use the kernel if they have
absolutely no idea about the dynamic interaction their processes have
with the kernel and how this interaction is influenced by and influences
the interaction with other processes?

> The entry.S examine-the-registers approach is simple enough, but it's
> not much more tasteful than sys_call_table hackery IMHO

I guess we won't agree on this. From my point of view it is much better
to have the code directly within entry.S for all to see instead of
having some external software play around with the syscall table in a
way kernel users can't trace back to the kernel's own code.

Cheers,

Karim

===================================================
                 Karim Yaghmour
               karim@opersys.com
      Embedded and Real-Time Linux Expert
===================================================
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Enhanced profiling support (was Re: vm lock contention reduction)
  2002-07-10  4:16                                       ` Karim Yaghmour
@ 2002-07-10  4:38                                         ` John Levon
  -1 siblings, 0 replies; 87+ messages in thread
From: John Levon @ 2002-07-10  4:38 UTC (permalink / raw)
  To: Karim Yaghmour
  Cc: Linus Torvalds, Andrew Morton, Andrea Arcangeli, Rik van Riel,
	linux-mm, Martin J. Bligh, linux-kernel, Richard Moore, bob

On Wed, Jul 10, 2002 at 12:16:05AM -0400, Karim Yaghmour wrote:
 
[snip]
 
> And the list goes on.
 
Sure, there are all sorts of things where some tracing can come in
useful. The question is whether it's really something the mainline
kernel should be doing, and if the gung-ho approach is nice or not.

> The fact that so many kernel subsystems already have their own tracing
> built-in (see other posting)
 
Your list was almost entirely composed of per-driver debug routines.
This is not the same thing as logging trap entry/exits, syscalls etc
etc, on any level, and I'm a bit perplexed that you're making such an
assocation.
 
> expect user-space developers to efficiently use the kernel if they
> have
> absolutely no idea about the dynamic interaction their processes have
> with the kernel and how this interaction is influenced by and
> influences
> the interaction with other processes?
 
This is clearly an exaggeration. And seeing as something like LTT
doesn't (and cannot) tell the "whole story" either, I could throw the
same argument directly back at you. The point is, there comes a point of
no return where usefulness gets outweighed by ugliness. For the very few
cases that such detailed information is really useful, the user can
usually install the needed special-case tools.
 
In contrast a profiling mechanism that improves on the poor lot that
currently exists (gprof, readprofile) has a truly general utility, and
can hopefully be done without too much ugliness.
 
The primary reason I want to see something like this is to kill the ugly
code I have to maintain.

> > The entry.S examine-the-registers approach is simple enough, but
> > it's
> > not much more tasteful than sys_call_table hackery IMHO
>
> I guess we won't agree on this. From my point of view it is much
> better
> to have the code directly within entry.S for all to see instead of
> having some external software play around with the syscall table in a
> way kernel users can't trace back to the kernel's own code.

Eh ? I didn't say sys_call_table hackery was better. I said the entry.S
thing wasn't much better ...

regards
john


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Enhanced profiling support (was Re: vm lock contention reduction)
@ 2002-07-10  4:38                                         ` John Levon
  0 siblings, 0 replies; 87+ messages in thread
From: John Levon @ 2002-07-10  4:38 UTC (permalink / raw)
  To: Karim Yaghmour
  Cc: Linus Torvalds, Andrew Morton, Andrea Arcangeli, Rik van Riel,
	linux-mm, Martin J. Bligh, linux-kernel, Richard Moore, bob

On Wed, Jul 10, 2002 at 12:16:05AM -0400, Karim Yaghmour wrote:
 
[snip]
 
> And the list goes on.
 
Sure, there are all sorts of things where some tracing can come in
useful. The question is whether it's really something the mainline
kernel should be doing, and if the gung-ho approach is nice or not.

> The fact that so many kernel subsystems already have their own tracing
> built-in (see other posting)
 
Your list was almost entirely composed of per-driver debug routines.
This is not the same thing as logging trap entry/exits, syscalls etc
etc, on any level, and I'm a bit perplexed that you're making such an
assocation.
 
> expect user-space developers to efficiently use the kernel if they
> have
> absolutely no idea about the dynamic interaction their processes have
> with the kernel and how this interaction is influenced by and
> influences
> the interaction with other processes?
 
This is clearly an exaggeration. And seeing as something like LTT
doesn't (and cannot) tell the "whole story" either, I could throw the
same argument directly back at you. The point is, there comes a point of
no return where usefulness gets outweighed by ugliness. For the very few
cases that such detailed information is really useful, the user can
usually install the needed special-case tools.
 
In contrast a profiling mechanism that improves on the poor lot that
currently exists (gprof, readprofile) has a truly general utility, and
can hopefully be done without too much ugliness.
 
The primary reason I want to see something like this is to kill the ugly
code I have to maintain.

> > The entry.S examine-the-registers approach is simple enough, but
> > it's
> > not much more tasteful than sys_call_table hackery IMHO
>
> I guess we won't agree on this. From my point of view it is much
> better
> to have the code directly within entry.S for all to see instead of
> having some external software play around with the syscall table in a
> way kernel users can't trace back to the kernel's own code.

Eh ? I didn't say sys_call_table hackery was better. I said the entry.S
thing wasn't much better ...

regards
john

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: scalable kmap (was Re: vm lock contention reduction)
  2002-07-09 17:32                                   ` Andrea Arcangeli
@ 2002-07-10  5:32                                     ` Andrew Morton
  2002-07-10 22:43                                       ` Martin J. Bligh
  0 siblings, 1 reply; 87+ messages in thread
From: Andrew Morton @ 2002-07-10  5:32 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Martin J. Bligh, Linus Torvalds, Rik van Riel, linux-mm

Andrea Arcangeli wrote:
> 
> the patch with the hooks into the page fault handler is basically
> what Martin was suggesting me at ols when I was laughting. the ugliest
> part is the implicit dependency on every copy-user to store the current
> source and destination in esi/dsi.

Ah, but note how I called the patch "linus-copy_user-hack".  That clearly
identifies who is to blame.

But yes.

However it seems that the esi/edi restriction isn't too bad.  If someone
has a smarter copy_*_user implementation then they'll have to use the old
one in certain places, or use those registers, or write fixup code for the
new implementation.

I tested it with the IBM folks' faster copy_*_user patch which
fixes the Intel alignment thing.  Worked OK.

> However your implementation is still not as optimizes as you can
> optimize it, if you take the max-performnace route you should do it all,
> you shouldn't kunmap_atomic/kmap_atomic blindly around the page fault
> handler like you're doing now. you should hold on the atomic kmap for
> the whole page fault until a new kmap_atomic of the same type happens on
> the current cpu under you during the page fault (either from the page
> fault handler itself or another task because the page fault handler
> blocked and scheduled in another task).

Good stuff, thanks.  Did that.  New patch is here.  I stuck with
KM_FILEMAP because KM_USER0 is used in the COW code, and that
is in fact called inside the filemap fault handler.  So basically
the "oh drat" path was being taken all the time.

> ...
> NOTE: you don't need to execute the kunmap_atomic at all in the
> recursive case, just disable the debugging (infact if you take this
> route you can as well drop the kunmap_atomic call enterely from the
> common code, you may want to verify the ppc or sparc guys aren't doing
> something magic in the kunmap first, in such case you may want to skip
> it only during recursion in the i386/mm/fault.c). Don't blame the fact
> we lose some debugging capability, you want max performance at all costs
> remeber?

Well, the non-recursive case is the common case.  And yes, maybe.  Could
use `ifdef CONFIG_HIGHMEM_DEBUG' and then open-code the preempt_enable()
in there I guess...

Here's the diff.  The kmap() and kmap_atomic() rate is way down
now.  Still no benefit from it all through.  Martin.  Help.


 arch/i386/kernel/i386_ksyms.c   |    5 ++
 arch/i386/lib/usercopy.c        |   10 +++++
 arch/i386/mm/fault.c            |   71 +++++++++++++++++++++++++++++++++++
 fs/exec.c                       |   60 +++++++++++++++++++++---------
 include/asm-i386/highmem.h      |    5 ++
 include/asm-i386/kmap_types.h   |    3 +
 include/asm-i386/processor.h    |    2 +
 include/asm-ppc/kmap_types.h    |    1 
 include/asm-sparc/kmap_types.h  |    1 
 include/asm-x86_64/kmap_types.h |    1 
 include/linux/highmem.h         |   80 ++++++++++++++++++++++++++++++++++++++++
 include/linux/sched.h           |    5 ++
 mm/filemap.c                    |   11 +++--
 13 files changed, 232 insertions(+), 23 deletions(-)

--- 2.5.25/arch/i386/kernel/i386_ksyms.c~linus-copy_user-hack	Tue Jul  9 21:12:35 2002
+++ 2.5.25-akpm/arch/i386/kernel/i386_ksyms.c	Tue Jul  9 21:12:35 2002
@@ -14,6 +14,7 @@
 #include <linux/kernel.h>
 #include <linux/string.h>
 #include <linux/tty.h>
+#include <linux/highmem.h>
 
 #include <asm/semaphore.h>
 #include <asm/processor.h>
@@ -76,6 +77,10 @@ EXPORT_SYMBOL(get_cmos_time);
 EXPORT_SYMBOL(apm_info);
 EXPORT_SYMBOL(gdt);
 
+#ifdef CONFIG_HIGHMEM
+EXPORT_SYMBOL(kmap_atomic_seq);
+#endif
+
 #ifdef CONFIG_DEBUG_IOVIRT
 EXPORT_SYMBOL(__io_virt_debug);
 #endif
--- 2.5.25/arch/i386/lib/usercopy.c~linus-copy_user-hack	Tue Jul  9 21:12:35 2002
+++ 2.5.25-akpm/arch/i386/lib/usercopy.c	Tue Jul  9 21:12:35 2002
@@ -11,6 +11,16 @@
 
 #ifdef CONFIG_X86_USE_3DNOW_AND_WORKS
 
+/*
+ * We cannot use the mmx functions here with the kmap_atomic fixup
+ * code.
+ *
+ * But CONFIG_X86_USE_3DNOW_AND_WORKS never gets defined anywhere.
+ * Maybe kill this code?
+ */
+
+#error this will not work
+
 unsigned long
 __generic_copy_to_user(void *to, const void *from, unsigned long n)
 {
--- 2.5.25/arch/i386/mm/fault.c~linus-copy_user-hack	Tue Jul  9 21:12:35 2002
+++ 2.5.25-akpm/arch/i386/mm/fault.c	Tue Jul  9 21:34:38 2002
@@ -18,6 +18,7 @@
 #include <linux/ptrace.h>
 #include <linux/mman.h>
 #include <linux/mm.h>
+#include <linux/highmem.h>
 #include <linux/smp.h>
 #include <linux/smp_lock.h>
 #include <linux/interrupt.h>
@@ -138,6 +139,70 @@ void bust_spinlocks(int yes)
 	}
 }
 
+#ifdef CONFIG_HIGHMEM
+
+/*
+ * per-cpu, per-atomic-kmap sequence numbers.  Incremented in kmap_atomic.
+ * If these change, we know that an atomic kmap slot has been reused.
+ */
+int kmap_atomic_seq[KM_TYPE_NR] __per_cpu_data = {0};
+
+/*
+ * Note the CPU ID and the currently-held atomic kmap's sequence number
+ */
+static inline void note_atomic_kmap(struct pt_regs *regs)
+{
+	struct copy_user_state *cus = current->copy_user_state;
+
+	if (cus) {
+		cus->cpu = smp_processor_id();
+		cus->seq = this_cpu(kmap_atomic_seq[cus->type]);
+	}
+}
+
+/*
+ * After processing the fault, look to see whether we have switched CPUs
+ * or whether the fault handler has used the same kmap slot (it must have
+ * scheduled to another task).  If so, drop the kmap and get a new one.
+ * And then fix up the machine register which copy_*_user() is using so
+ * that it gets the correct address relative to the the new kmap.
+ */
+static void
+__check_atomic_kmap(struct copy_user_state *cus, struct pt_regs *regs)
+{
+	const int cpu = smp_processor_id();
+
+	if (cus->seq != per_cpu(kmap_atomic_seq[cus->type], cpu) ||
+				cus->cpu != cpu) {
+		long *reg;
+		unsigned offset;
+
+		kunmap_atomic(cus->kaddr, cus->type);
+		cus->kaddr = kmap_atomic(cus->page, cus->type);
+		if (cus->src)
+			reg = &regs->esi;
+		else
+			reg = &regs->edi;
+		offset = *reg & (PAGE_SIZE - 1);
+		*reg = ((long)cus->kaddr) | offset;
+	}
+}
+
+static inline void check_atomic_kmap(struct pt_regs *regs)
+{
+	struct copy_user_state *cus = current->copy_user_state;
+
+	if (cus)
+		__check_atomic_kmap(cus, regs);
+}
+	
+#else
+static inline void note_atomic_kmap(struct pt_regs *regs)
+{}
+static inline void check_atomic_kmap(struct pt_regs *regs)
+{}
+#endif
+
 asmlinkage void do_invalid_op(struct pt_regs *, unsigned long);
 extern unsigned long idt;
 
@@ -206,6 +271,8 @@ asmlinkage void do_page_fault(struct pt_
 	}
 #endif
 
+	note_atomic_kmap(regs);
+
 	down_read(&mm->mmap_sem);
 
 	vma = find_vma(mm, address);
@@ -267,8 +334,10 @@ good_area:
 			tsk->maj_flt++;
 			break;
 		case VM_FAULT_SIGBUS:
+			check_atomic_kmap(regs);
 			goto do_sigbus;
 		case VM_FAULT_OOM:
+			check_atomic_kmap(regs);
 			goto out_of_memory;
 		default:
 			BUG();
@@ -283,6 +352,7 @@ good_area:
 			tsk->thread.screen_bitmap |= 1 << bit;
 	}
 	up_read(&mm->mmap_sem);
+	check_atomic_kmap(regs);
 	return;
 
 /*
@@ -291,6 +361,7 @@ good_area:
  */
 bad_area:
 	up_read(&mm->mmap_sem);
+	check_atomic_kmap(regs);
 
 	/* User mode accesses just cause a SIGSEGV */
 	if (error_code & 4) {
--- 2.5.25/fs/exec.c~linus-copy_user-hack	Tue Jul  9 21:12:35 2002
+++ 2.5.25-akpm/fs/exec.c	Tue Jul  9 21:53:51 2002
@@ -184,25 +184,39 @@ static int count(char ** argv, int max)
  */
 int copy_strings(int argc,char ** argv, struct linux_binprm *bprm) 
 {
+	struct page *kmapped_page = NULL;
+	char *kaddr = NULL;
+	int ret;
+
 	while (argc-- > 0) {
 		char *str;
 		int len;
 		unsigned long pos;
 
-		if (get_user(str, argv+argc) || !(len = strnlen_user(str, bprm->p)))
-			return -EFAULT;
-		if (bprm->p < len) 
-			return -E2BIG; 
+		if (get_user(str, argv+argc) ||
+				!(len = strnlen_user(str, bprm->p))) {
+			ret = -EFAULT;
+			goto out;
+		}
+
+		if (bprm->p < len)  {
+			ret = -E2BIG;
+			goto out;
+		}
 
 		bprm->p -= len;
 		/* XXX: add architecture specific overflow check here. */ 
-
 		pos = bprm->p;
+
+		/*
+		 * The only sleeping function which we are allowed to call in
+		 * this loop is copy_from_user().  Otherwise, copy_user_state
+		 * could get trashed.
+		 */
 		while (len > 0) {
-			char *kaddr;
 			int i, new, err;
-			struct page *page;
 			int offset, bytes_to_copy;
+			struct page *page;
 
 			offset = pos % PAGE_SIZE;
 			i = pos/PAGE_SIZE;
@@ -211,32 +225,44 @@ int copy_strings(int argc,char ** argv, 
 			if (!page) {
 				page = alloc_page(GFP_HIGHUSER);
 				bprm->page[i] = page;
-				if (!page)
-					return -ENOMEM;
+				if (!page) {
+					ret = -ENOMEM;
+					goto out;
+				}
 				new = 1;
 			}
-			kaddr = kmap(page);
 
+			if (page != kmapped_page) {
+				if (kmapped_page)
+					kunmap(kmapped_page);
+				kmapped_page = page;
+				kaddr = kmap(kmapped_page);
+			}
 			if (new && offset)
 				memset(kaddr, 0, offset);
 			bytes_to_copy = PAGE_SIZE - offset;
 			if (bytes_to_copy > len) {
 				bytes_to_copy = len;
 				if (new)
-					memset(kaddr+offset+len, 0, PAGE_SIZE-offset-len);
+					memset(kaddr+offset+len, 0,
+						PAGE_SIZE-offset-len);
+			}
+			err = copy_from_user(kaddr+offset, str, bytes_to_copy);
+			if (err) {
+				ret = -EFAULT;
+				goto out;
 			}
-			err = copy_from_user(kaddr + offset, str, bytes_to_copy);
-			kunmap(page);
-
-			if (err)
-				return -EFAULT; 
 
 			pos += bytes_to_copy;
 			str += bytes_to_copy;
 			len -= bytes_to_copy;
 		}
 	}
-	return 0;
+	ret = 0;
+out:
+	if (kmapped_page)
+		kunmap(kmapped_page);
+	return ret;
 }
 
 /*
--- 2.5.25/include/asm-i386/highmem.h~linus-copy_user-hack	Tue Jul  9 21:12:35 2002
+++ 2.5.25-akpm/include/asm-i386/highmem.h	Tue Jul  9 21:12:35 2002
@@ -22,6 +22,7 @@
 
 #include <linux/config.h>
 #include <linux/interrupt.h>
+#include <linux/percpu.h>
 #include <asm/kmap_types.h>
 #include <asm/tlbflush.h>
 
@@ -76,6 +77,8 @@ static inline void kunmap(struct page *p
  * be used in IRQ contexts, so in some (very limited) cases we need
  * it.
  */
+extern int kmap_atomic_seq[KM_TYPE_NR] __per_cpu_data;
+
 static inline void *kmap_atomic(struct page *page, enum km_type type)
 {
 	enum fixed_addresses idx;
@@ -93,7 +96,7 @@ static inline void *kmap_atomic(struct p
 #endif
 	set_pte(kmap_pte-idx, mk_pte(page, kmap_prot));
 	__flush_tlb_one(vaddr);
-
+	this_cpu(kmap_atomic_seq[type])++;
 	return (void*) vaddr;
 }
 
--- 2.5.25/include/asm-i386/processor.h~linus-copy_user-hack	Tue Jul  9 21:12:35 2002
+++ 2.5.25-akpm/include/asm-i386/processor.h	Tue Jul  9 21:12:35 2002
@@ -485,4 +485,6 @@ extern inline void prefetchw(const void 
 
 #endif
 
+#define ARCH_HAS_KMAP_FIXUP
+
 #endif /* __ASM_I386_PROCESSOR_H */
--- 2.5.25/include/linux/highmem.h~linus-copy_user-hack	Tue Jul  9 21:12:35 2002
+++ 2.5.25-akpm/include/linux/highmem.h	Tue Jul  9 21:54:59 2002
@@ -3,6 +3,7 @@
 
 #include <linux/config.h>
 #include <linux/fs.h>
+#include <asm/processor.h>
 #include <asm/cacheflush.h>
 
 #ifdef CONFIG_HIGHMEM
@@ -10,6 +11,7 @@
 extern struct page *highmem_start_page;
 
 #include <asm/highmem.h>
+#include <asm/kmap_types.h>
 
 /* declarations for linux/mm/highmem.c */
 unsigned int nr_free_highpages(void);
@@ -72,4 +74,82 @@ static inline void copy_user_highpage(st
 	kunmap_atomic(vto, KM_USER1);
 }
 
+#if defined(CONFIG_HIGHMEM) && defined(ARCH_HAS_KMAP_FIXUP)
+/*
+ * Used when performing a copy_*_user while holding an atomic kmap
+ */
+struct copy_user_state {
+	struct page *page;	/* The page which is kmap_atomiced */
+	void *kaddr;		/* Its mapping */
+	enum km_type type;	/* Its offset */
+	int src;		/* 1: fixup ESI.  0: Fixup EDI */
+	int cpu;		/* CPU which the kmap was taken on */
+	int seq;		/* The kmap's sequence number */
+};
+
+/*
+ * `src' is true if the kmap_atomic virtual address is the source of the copy.
+ */
+static inline void *
+kmap_copy_user(struct copy_user_state *cus, struct page *page,
+			enum km_type type, int src)
+{
+	cus->page = page;
+	cus->kaddr = kmap_atomic(page, type);
+	if (PageHighMem(page)) {
+		cus->type = type;
+		cus->src = src;
+		BUG_ON(current->copy_user_state != NULL);
+		current->copy_user_state = cus;
+	}
+	return cus->kaddr;
+}
+
+static inline void kunmap_copy_user(struct copy_user_state *cus)
+{
+	if (PageHighMem(cus->page)) {
+		BUG_ON(current->copy_user_state != cus);
+		kunmap_atomic(cus->kaddr, cus->type);
+		current->copy_user_state = NULL;
+		cus->page = NULL;	/* debug */
+	}
+}
+
+/*
+ * After a copy_*_user, the kernel virtual address may be different.  So
+ * use kmap_copy_user_addr() to get the new value.
+ */
+static inline void *kmap_copy_user_addr(struct copy_user_state *cus)
+{
+	return cus->kaddr;
+}
+
+#else
+
+struct copy_user_state {
+	struct page *page;
+};
+
+/*
+ * This must be a macro because `type' may be undefined
+ */
+
+#define kmap_copy_user(cus, page, type, src)	\
+	({					\
+		(cus)->page = (page);		\
+		kmap(page);			\
+	})
+
+static inline void kunmap_copy_user(struct copy_user_state *cus)
+{
+	kunmap(cus->page);
+}
+
+static inline void *kmap_copy_user_addr(struct copy_user_state *cus)
+{
+	return page_address(cus->page);
+}
+
+#endif
+
 #endif /* _LINUX_HIGHMEM_H */
--- 2.5.25/include/linux/sched.h~linus-copy_user-hack	Tue Jul  9 21:12:35 2002
+++ 2.5.25-akpm/include/linux/sched.h	Tue Jul  9 21:12:35 2002
@@ -248,6 +248,8 @@ extern struct user_struct root_user;
 
 typedef struct prio_array prio_array_t;
 
+struct copy_user_state;
+
 struct task_struct {
 	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
 	struct thread_info *thread_info;
@@ -365,6 +367,9 @@ struct task_struct {
 /* journalling filesystem info */
 	void *journal_info;
 	struct dentry *proc_dentry;
+#ifdef CONFIG_HIGHMEM
+	struct copy_user_state *copy_user_state;
+#endif
 };
 
 extern void __put_task_struct(struct task_struct *tsk);
--- 2.5.25/mm/filemap.c~linus-copy_user-hack	Tue Jul  9 21:12:35 2002
+++ 2.5.25-akpm/mm/filemap.c	Tue Jul  9 21:53:51 2002
@@ -15,6 +15,7 @@
 #include <linux/fs.h>
 #include <linux/mm.h>
 #include <linux/mman.h>
+#include <linux/highmem.h>
 #include <linux/pagemap.h>
 #include <linux/file.h>
 #include <linux/iobuf.h>
@@ -1183,18 +1184,20 @@ static ssize_t generic_file_direct_IO(in
 	return retval;
 }
 
-int file_read_actor(read_descriptor_t * desc, struct page *page, unsigned long offset, unsigned long size)
+int file_read_actor(read_descriptor_t *desc, struct page *page,
+			unsigned long offset, unsigned long size)
 {
 	char *kaddr;
+	struct copy_user_state copy_user_state;
 	unsigned long left, count = desc->count;
 
 	if (size > count)
 		size = count;
 
-	kaddr = kmap(page);
+	kaddr = kmap_copy_user(&copy_user_state, page, KM_FILEMAP, 1);
 	left = __copy_to_user(desc->buf, kaddr + offset, size);
-	kunmap(page);
-	
+	kunmap_copy_user(&copy_user_state);
+
 	if (left) {
 		size -= left;
 		desc->error = -EFAULT;
--- 2.5.25/include/asm-i386/kmap_types.h~linus-copy_user-hack	Tue Jul  9 21:13:02 2002
+++ 2.5.25-akpm/include/asm-i386/kmap_types.h	Tue Jul  9 21:13:43 2002
@@ -19,7 +19,8 @@ D(5)	KM_BIO_SRC_IRQ,
 D(6)	KM_BIO_DST_IRQ,
 D(7)	KM_PTE0,
 D(8)	KM_PTE1,
-D(9)	KM_TYPE_NR
+D(9)	KM_FILEMAP,
+D(10)	KM_TYPE_NR
 };
 
 #undef D
--- 2.5.25/include/asm-ppc/kmap_types.h~linus-copy_user-hack	Tue Jul  9 21:13:06 2002
+++ 2.5.25-akpm/include/asm-ppc/kmap_types.h	Tue Jul  9 21:13:49 2002
@@ -15,6 +15,7 @@ enum km_type {
 	KM_BIO_DST_IRQ,
 	KM_PTE0,
 	KM_PTE1,
+	KM_FILEMAP,
 	KM_TYPE_NR
 };
 
--- 2.5.25/include/asm-sparc/kmap_types.h~linus-copy_user-hack	Tue Jul  9 21:13:09 2002
+++ 2.5.25-akpm/include/asm-sparc/kmap_types.h	Tue Jul  9 21:13:55 2002
@@ -9,6 +9,7 @@ enum km_type {
 	KM_USER1,
 	KM_BIO_SRC_IRQ,
 	KM_BIO_DST_IRQ,
+	KM_FILEMAP,
 	KM_TYPE_NR
 };
 
--- 2.5.25/include/asm-x86_64/kmap_types.h~linus-copy_user-hack	Tue Jul  9 21:13:12 2002
+++ 2.5.25-akpm/include/asm-x86_64/kmap_types.h	Tue Jul  9 21:14:01 2002
@@ -9,6 +9,7 @@ enum km_type {
 	KM_USER1,
 	KM_BIO_SRC_IRQ,
 	KM_BIO_DST_IRQ,
+	KM_FILEMAP,
 	KM_TYPE_NR
 };
 

-
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Enhanced profiling support (was Re: vm lock contention reduction)
  2002-07-10  4:38                                         ` John Levon
@ 2002-07-10  5:46                                           ` Karim Yaghmour
  -1 siblings, 0 replies; 87+ messages in thread
From: Karim Yaghmour @ 2002-07-10  5:46 UTC (permalink / raw)
  To: John Levon
  Cc: Linus Torvalds, Andrew Morton, Andrea Arcangeli, Rik van Riel,
	linux-mm, Martin J. Bligh, linux-kernel, Richard Moore, bob


John Levon wrote:
> > And the list goes on.
> 
> Sure, there are all sorts of things where some tracing can come in
> useful. The question is whether it's really something the mainline
> kernel should be doing, and if the gung-ho approach is nice or not.

I'm not sure how to read "gung-ho approach" but I would point out that
LTT wasn't built overnight. It's been out there for 3 years now.

As for whether the mainline kernel should have it or not, then I
think the standard has always been set using the actual purpose:
Is this useful to users or is this only useful to kernel developers?
Whenever it was the later, then the answer has almost always been NO.
In the case of tracing I think that not only is it "useful" to users,
but I would say "essential". Would you ask a user to recompile his
kernel with a ptrace patch because he needs to use gdb? I don't
think so, and I don't think application developers or system
administrators should have to recompile their kernel either in
order to understand the system' dynamics.

If facts are of interest, then I would point out that LTT is already
part of a number of distributions out there. Most other distros that
don't have it included find it very useful but don't want to get
involved in maintaining it until it's part of the kernel.

> > The fact that so many kernel subsystems already have their own tracing
> > built-in (see other posting)
> 
> Your list was almost entirely composed of per-driver debug routines.

I don't see any contradiction here. This is part of what I'm pointing out.
Mainly that understanding the dynamic behavior of the system is essential
to software development, especially when complex interactions, such as
in the Linux kernel, take place.

> This is not the same thing as logging trap entry/exits, syscalls etc
> etc, on any level, and I'm a bit perplexed that you're making such an
> assocation.

In regards to trap entry/exits, there was a talk a couple of years ago
by Nat Friedman, I think, which discussed GNU rope. Basically, it
identified the location of page fault misses at the start of programs
and used this information to reorder the binary in order to accelerate
its load time. That's just one example where traps are of interest.

Not to mention that traps can result in scheduling changes and that
knowing why a process has been scheduled out is all part of understanding
the system's dynamics.

As for syscalls, oprofile, for one, really needs this sort of info. So
I don't see your point.

There are 2 things LTT provides in the kernel:
1- A unified tracing and high-throughput data-collection system
2- A select list of trace points in the kernel

Item #1 can easily replace the existing ad-hoc implementation while
serving oprofile, DProbes and others. Item #2 is of prime interest
to application developers and system administrators.

> > expect user-space developers to efficiently use the kernel if they
> > have
> > absolutely no idea about the dynamic interaction their processes have
> > with the kernel and how this interaction is influenced by and
> > influences
> > the interaction with other processes?
> 
> This is clearly an exaggeration. And seeing as something like LTT
> doesn't (and cannot) tell the "whole story" either, I could throw the
> same argument directly back at you. The point is, there comes a point of
> no return where usefulness gets outweighed by ugliness. For the very few
> cases that such detailed information is really useful, the user can
> usually install the needed special-case tools.

I'd really be interested in seing what you mean by "ugliness". If there's
a list of grievances you have with LTT then I'm all ears.

Anything inserted by LTT is clean and clear. It doesn't change anything
to the kernel's normal behavior and once a trace point is inserted it
requires almost zero maintenance. Take for example the scheduling change
trace point (kernel/sched.c:schedule()):
	TRACE_SCHEDCHANGE(prev, next);

I don't see why this should be ugly. It's an inline that results in
zero lines of code if you configure tracing as off.

The cases I presented earlier clearly show the usefullness of this
information. The developer who needs to solve his synchronization
problem or the system administrator who wants to understand why his
box is so slow doesn't really want to patch/recompile/reboot to fix
his problem.

You would like to paint these as "very few cases". Unfortunately
these cases are much more common than you say they are.

> In contrast a profiling mechanism that improves on the poor lot that
> currently exists (gprof, readprofile) has a truly general utility, and
> can hopefully be done without too much ugliness.

Somehow it is not justifiable to add a feature the like of which didn't
exist before but it is justifiable to add a feature which only "improves"
on existing tools?

As I said before, LTT and the information it provides has a truly
general utility: Oprofile can use it, DProbes uses it, a slew of existing
ad-hoc tracing systems in the kernel can be replaced by it, application
developers can isolate synchronization problems with it, system
administrators can identify performance bottlenecks with it, etc.

An argument over which is more useful between LTT and oprofile is bound
to be useless. If nothing else, I see LTT as having a more general use
than oprofile. But let's not get into that. What I'm really interested in
here is the possibilty of having one unified tracing and data collection
system which serves many purposes instead of having each subsystem or
profiler have its own tracing and data collection mechanism.

> The primary reason I want to see something like this is to kill the ugly
> code I have to maintain.

I can't say that my goals are any different.

> > > The entry.S examine-the-registers approach is simple enough, but
> > > it's
> > > not much more tasteful than sys_call_table hackery IMHO
> >
> > I guess we won't agree on this. From my point of view it is much
> > better
> > to have the code directly within entry.S for all to see instead of
> > having some external software play around with the syscall table in a
> > way kernel users can't trace back to the kernel's own code.
> 
> Eh ? I didn't say sys_call_table hackery was better. I said the entry.S
> thing wasn't much better ...

I know you weren't saying that. I said that in _my point of view_ the entry.S
"thing" is much better.

Cheers,

Karim

===================================================
                 Karim Yaghmour
               karim@opersys.com
      Embedded and Real-Time Linux Expert
===================================================

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Enhanced profiling support (was Re: vm lock contention reduction)
@ 2002-07-10  5:46                                           ` Karim Yaghmour
  0 siblings, 0 replies; 87+ messages in thread
From: Karim Yaghmour @ 2002-07-10  5:46 UTC (permalink / raw)
  To: John Levon
  Cc: Linus Torvalds, Andrew Morton, Andrea Arcangeli, Rik van Riel,
	linux-mm, Martin J. Bligh, linux-kernel, Richard Moore, bob

John Levon wrote:
> > And the list goes on.
> 
> Sure, there are all sorts of things where some tracing can come in
> useful. The question is whether it's really something the mainline
> kernel should be doing, and if the gung-ho approach is nice or not.

I'm not sure how to read "gung-ho approach" but I would point out that
LTT wasn't built overnight. It's been out there for 3 years now.

As for whether the mainline kernel should have it or not, then I
think the standard has always been set using the actual purpose:
Is this useful to users or is this only useful to kernel developers?
Whenever it was the later, then the answer has almost always been NO.
In the case of tracing I think that not only is it "useful" to users,
but I would say "essential". Would you ask a user to recompile his
kernel with a ptrace patch because he needs to use gdb? I don't
think so, and I don't think application developers or system
administrators should have to recompile their kernel either in
order to understand the system' dynamics.

If facts are of interest, then I would point out that LTT is already
part of a number of distributions out there. Most other distros that
don't have it included find it very useful but don't want to get
involved in maintaining it until it's part of the kernel.

> > The fact that so many kernel subsystems already have their own tracing
> > built-in (see other posting)
> 
> Your list was almost entirely composed of per-driver debug routines.

I don't see any contradiction here. This is part of what I'm pointing out.
Mainly that understanding the dynamic behavior of the system is essential
to software development, especially when complex interactions, such as
in the Linux kernel, take place.

> This is not the same thing as logging trap entry/exits, syscalls etc
> etc, on any level, and I'm a bit perplexed that you're making such an
> assocation.

In regards to trap entry/exits, there was a talk a couple of years ago
by Nat Friedman, I think, which discussed GNU rope. Basically, it
identified the location of page fault misses at the start of programs
and used this information to reorder the binary in order to accelerate
its load time. That's just one example where traps are of interest.

Not to mention that traps can result in scheduling changes and that
knowing why a process has been scheduled out is all part of understanding
the system's dynamics.

As for syscalls, oprofile, for one, really needs this sort of info. So
I don't see your point.

There are 2 things LTT provides in the kernel:
1- A unified tracing and high-throughput data-collection system
2- A select list of trace points in the kernel

Item #1 can easily replace the existing ad-hoc implementation while
serving oprofile, DProbes and others. Item #2 is of prime interest
to application developers and system administrators.

> > expect user-space developers to efficiently use the kernel if they
> > have
> > absolutely no idea about the dynamic interaction their processes have
> > with the kernel and how this interaction is influenced by and
> > influences
> > the interaction with other processes?
> 
> This is clearly an exaggeration. And seeing as something like LTT
> doesn't (and cannot) tell the "whole story" either, I could throw the
> same argument directly back at you. The point is, there comes a point of
> no return where usefulness gets outweighed by ugliness. For the very few
> cases that such detailed information is really useful, the user can
> usually install the needed special-case tools.

I'd really be interested in seing what you mean by "ugliness". If there's
a list of grievances you have with LTT then I'm all ears.

Anything inserted by LTT is clean and clear. It doesn't change anything
to the kernel's normal behavior and once a trace point is inserted it
requires almost zero maintenance. Take for example the scheduling change
trace point (kernel/sched.c:schedule()):
	TRACE_SCHEDCHANGE(prev, next);

I don't see why this should be ugly. It's an inline that results in
zero lines of code if you configure tracing as off.

The cases I presented earlier clearly show the usefullness of this
information. The developer who needs to solve his synchronization
problem or the system administrator who wants to understand why his
box is so slow doesn't really want to patch/recompile/reboot to fix
his problem.

You would like to paint these as "very few cases". Unfortunately
these cases are much more common than you say they are.

> In contrast a profiling mechanism that improves on the poor lot that
> currently exists (gprof, readprofile) has a truly general utility, and
> can hopefully be done without too much ugliness.

Somehow it is not justifiable to add a feature the like of which didn't
exist before but it is justifiable to add a feature which only "improves"
on existing tools?

As I said before, LTT and the information it provides has a truly
general utility: Oprofile can use it, DProbes uses it, a slew of existing
ad-hoc tracing systems in the kernel can be replaced by it, application
developers can isolate synchronization problems with it, system
administrators can identify performance bottlenecks with it, etc.

An argument over which is more useful between LTT and oprofile is bound
to be useless. If nothing else, I see LTT as having a more general use
than oprofile. But let's not get into that. What I'm really interested in
here is the possibilty of having one unified tracing and data collection
system which serves many purposes instead of having each subsystem or
profiler have its own tracing and data collection mechanism.

> The primary reason I want to see something like this is to kill the ugly
> code I have to maintain.

I can't say that my goals are any different.

> > > The entry.S examine-the-registers approach is simple enough, but
> > > it's
> > > not much more tasteful than sys_call_table hackery IMHO
> >
> > I guess we won't agree on this. From my point of view it is much
> > better
> > to have the code directly within entry.S for all to see instead of
> > having some external software play around with the syscall table in a
> > way kernel users can't trace back to the kernel's own code.
> 
> Eh ? I didn't say sys_call_table hackery was better. I said the entry.S
> thing wasn't much better ...

I know you weren't saying that. I said that in _my point of view_ the entry.S
"thing" is much better.

Cheers,

Karim

===================================================
                 Karim Yaghmour
               karim@opersys.com
      Embedded and Real-Time Linux Expert
===================================================
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Enhanced profiling support (was Re: vm lock contention reduction)
  2002-07-10  4:38                                         ` John Levon
@ 2002-07-10 13:10                                           ` bob
  -1 siblings, 0 replies; 87+ messages in thread
From: bob @ 2002-07-10 13:10 UTC (permalink / raw)
  To: John Levon
  Cc: Karim Yaghmour, Linus Torvalds, Andrew Morton, Andrea Arcangeli,
	Rik van Riel, linux-mm, Martin J. Bligh, linux-kernel,
	Richard Moore, bob, okrieg

John,
     I have been cc'ed on this email as I was an active participant at the
RAS performance monitoring/tracing session at OLS.  Let me preface by
saying that my view may be a bit biased as I have worked on the the core
tracing infrastructure that went into IRIX in the mid 90s as well as the
tracing infrastructure for K42, our research OS (see
http://www.research.ibm.com/K42/index.html), and in both cases helped solve
performance issues that would not have been otherwise solved.  From the
below it doesn't appear that anyone is arguing that tracing is not useful.
The debate (except for some of the details) appears over whether it should
be included in the kernel in a first-class manner or individual mechanisms
put in on an ad-hoc basis.  While this is indeed philosophical, let me
share some experiences and benefits from other systems I've worked on:
 1) The mechanism proposed is very non-invasive, a single line of
code (some TRACE_XXX macro or like) is added to the area of interest.
Further, at OLS, the proposal was to add only a few trace points.
Programming-wise this does not clutter the code - in fact having a single
well-known unified mechanism is cleaner coding than a set of one-off
ways, as when anyone sees a trace macro it will be clear what it is.
 2) In the end, there will be less intrusion with a single unified
approach.  With a series of different mechanisms over time multiple events
will get added in the same place creating performance issues and more
importantly causing confusion.
 3) A unified approach will uncover performance issues not explicitly being 
searched for and allow ones of known interest to be tracked down without
adding a patch (that may be cumbersome to maintain) and re-compilation.
 4) In both my experiences, I have had resistance to adding this
tracing infrastructure, and in both experiences other kernel developers
have come back after the fact and thanked me for my persistence :-), as it
helped them solve timing sensitive or other such issues they were having
great difficulty with.

If there is interest, I would happy to set up a conference number so people 
who are interested could all speak.

-bob

Robert Wisniewski
The K42 MP OS Project
Advanced Operating Systems
Scalable Parallel Systems
IBM T.J. Watson Research Center
914-945-3181
http://www.research.ibm.com/K42/
bob@watson.ibm.com

----

John Levon writes:
 > On Wed, Jul 10, 2002 at 12:16:05AM -0400, Karim Yaghmour wrote:
 >  
 > [snip]
 >  
 > > And the list goes on.
 >  
 > Sure, there are all sorts of things where some tracing can come in
 > useful. The question is whether it's really something the mainline
 > kernel should be doing, and if the gung-ho approach is nice or not.
 > 
 > > The fact that so many kernel subsystems already have their own tracing
 > > built-in (see other posting)
 >  
 > Your list was almost entirely composed of per-driver debug routines.
 > This is not the same thing as logging trap entry/exits, syscalls etc
 > etc, on any level, and I'm a bit perplexed that you're making such an
 > assocation.
 >  
 > > expect user-space developers to efficiently use the kernel if they
 > > have
 > > absolutely no idea about the dynamic interaction their processes have
 > > with the kernel and how this interaction is influenced by and
 > > influences
 > > the interaction with other processes?
 >  
 > This is clearly an exaggeration. And seeing as something like LTT
 > doesn't (and cannot) tell the "whole story" either, I could throw the
 > same argument directly back at you. The point is, there comes a point of
 > no return where usefulness gets outweighed by ugliness. For the very few
 > cases that such detailed information is really useful, the user can
 > usually install the needed special-case tools.
 >  
 > In contrast a profiling mechanism that improves on the poor lot that
 > currently exists (gprof, readprofile) has a truly general utility, and
 > can hopefully be done without too much ugliness.
 >  
 > The primary reason I want to see something like this is to kill the ugly
 > code I have to maintain.
 > 
 > > > The entry.S examine-the-registers approach is simple enough, but
 > > > it's
 > > > not much more tasteful than sys_call_table hackery IMHO
 > >
 > > I guess we won't agree on this. From my point of view it is much
 > > better
 > > to have the code directly within entry.S for all to see instead of
 > > having some external software play around with the syscall table in a
 > > way kernel users can't trace back to the kernel's own code.
 > 
 > Eh ? I didn't say sys_call_table hackery was better. I said the entry.S
 > thing wasn't much better ...
 > 
 > regards
 > john
 > 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Enhanced profiling support (was Re: vm lock contention reduction)
@ 2002-07-10 13:10                                           ` bob
  0 siblings, 0 replies; 87+ messages in thread
From: bob @ 2002-07-10 13:10 UTC (permalink / raw)
  To: John Levon
  Cc: Karim Yaghmour, Linus Torvalds, Andrew Morton, Andrea Arcangeli,
	Rik van Riel, linux-mm, Martin J. Bligh, linux-kernel,
	Richard Moore, bob, okrieg

John,
     I have been cc'ed on this email as I was an active participant at the
RAS performance monitoring/tracing session at OLS.  Let me preface by
saying that my view may be a bit biased as I have worked on the the core
tracing infrastructure that went into IRIX in the mid 90s as well as the
tracing infrastructure for K42, our research OS (see
http://www.research.ibm.com/K42/index.html), and in both cases helped solve
performance issues that would not have been otherwise solved.  From the
below it doesn't appear that anyone is arguing that tracing is not useful.
The debate (except for some of the details) appears over whether it should
be included in the kernel in a first-class manner or individual mechanisms
put in on an ad-hoc basis.  While this is indeed philosophical, let me
share some experiences and benefits from other systems I've worked on:
 1) The mechanism proposed is very non-invasive, a single line of
code (some TRACE_XXX macro or like) is added to the area of interest.
Further, at OLS, the proposal was to add only a few trace points.
Programming-wise this does not clutter the code - in fact having a single
well-known unified mechanism is cleaner coding than a set of one-off
ways, as when anyone sees a trace macro it will be clear what it is.
 2) In the end, there will be less intrusion with a single unified
approach.  With a series of different mechanisms over time multiple events
will get added in the same place creating performance issues and more
importantly causing confusion.
 3) A unified approach will uncover performance issues not explicitly being 
searched for and allow ones of known interest to be tracked down without
adding a patch (that may be cumbersome to maintain) and re-compilation.
 4) In both my experiences, I have had resistance to adding this
tracing infrastructure, and in both experiences other kernel developers
have come back after the fact and thanked me for my persistence :-), as it
helped them solve timing sensitive or other such issues they were having
great difficulty with.

If there is interest, I would happy to set up a conference number so people 
who are interested could all speak.

-bob

Robert Wisniewski
The K42 MP OS Project
Advanced Operating Systems
Scalable Parallel Systems
IBM T.J. Watson Research Center
914-945-3181
http://www.research.ibm.com/K42/
bob@watson.ibm.com

----

John Levon writes:
 > On Wed, Jul 10, 2002 at 12:16:05AM -0400, Karim Yaghmour wrote:
 >  
 > [snip]
 >  
 > > And the list goes on.
 >  
 > Sure, there are all sorts of things where some tracing can come in
 > useful. The question is whether it's really something the mainline
 > kernel should be doing, and if the gung-ho approach is nice or not.
 > 
 > > The fact that so many kernel subsystems already have their own tracing
 > > built-in (see other posting)
 >  
 > Your list was almost entirely composed of per-driver debug routines.
 > This is not the same thing as logging trap entry/exits, syscalls etc
 > etc, on any level, and I'm a bit perplexed that you're making such an
 > assocation.
 >  
 > > expect user-space developers to efficiently use the kernel if they
 > > have
 > > absolutely no idea about the dynamic interaction their processes have
 > > with the kernel and how this interaction is influenced by and
 > > influences
 > > the interaction with other processes?
 >  
 > This is clearly an exaggeration. And seeing as something like LTT
 > doesn't (and cannot) tell the "whole story" either, I could throw the
 > same argument directly back at you. The point is, there comes a point of
 > no return where usefulness gets outweighed by ugliness. For the very few
 > cases that such detailed information is really useful, the user can
 > usually install the needed special-case tools.
 >  
 > In contrast a profiling mechanism that improves on the poor lot that
 > currently exists (gprof, readprofile) has a truly general utility, and
 > can hopefully be done without too much ugliness.
 >  
 > The primary reason I want to see something like this is to kill the ugly
 > code I have to maintain.
 > 
 > > > The entry.S examine-the-registers approach is simple enough, but
 > > > it's
 > > > not much more tasteful than sys_call_table hackery IMHO
 > >
 > > I guess we won't agree on this. From my point of view it is much
 > > better
 > > to have the code directly within entry.S for all to see instead of
 > > having some external software play around with the syscall table in a
 > > way kernel users can't trace back to the kernel's own code.
 > 
 > Eh ? I didn't say sys_call_table hackery was better. I said the entry.S
 > thing wasn't much better ...
 > 
 > regards
 > john
 > 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: scalable kmap (was Re: vm lock contention reduction)
  2002-07-10  5:32                                     ` Andrew Morton
@ 2002-07-10 22:43                                       ` Martin J. Bligh
  2002-07-10 23:08                                         ` Andrew Morton
  0 siblings, 1 reply; 87+ messages in thread
From: Martin J. Bligh @ 2002-07-10 22:43 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli; +Cc: Linus Torvalds, Rik van Riel, linux-mm

> Here's the diff.  The kmap() and kmap_atomic() rate is way down
> now.  Still no benefit from it all through.  Martin.  Help.

Hmmm ... well I have some preliminary results on the 16-way NUMA
for kernel compile, and it doesn't make things faster - if anything there's
a barely perceptible slowdown (may just be statistical error). 

On the other hand, Hanna just did some dbench measurements on an
8-way SMP, and got about 15% improvement out of it (she mailed the
results to lkml just now). 

The profile comparison between 2.4 and 2.5 is interesting. Unless I've
screwed something up in the profiling, seems like we're spending a lot
of time in do_page_fault (with or without your patch). It's *so* different,
that I'm inclined to suspect my profiling .... (profile=2).

2.5:

 46361 total                                      0.3985
 36500 default_idle                             570.3125
  5556 do_page_fault                              4.5653
  1087 do_softirq                                 5.2260
   673 pte_alloc_one                              3.8239
   529 schedule                                   0.5700
   304 exit_notify                                0.3800
   192 __wake_up                                  1.7143
   188 do_fork                                    0.0904
   180 system_call                                4.0909
   174 pgd_alloc                                  2.7188
   109 timer_bh                                   0.1548

2.4 + patches

 22256 total                                      0.0237
 13510 default_idle                             259.8077
  2042 _text_lock_swap                           37.8148
   585 lru_cache_add                              6.3587
   551 do_anonymous_page                          1.6596
   488 do_generic_file_read                       0.4388
   401 lru_cache_del                             18.2273
   385 _text_lock_namei                           0.3688
   363 __free_pages_ok                            0.6927
   267 __generic_copy_from_user                   2.5673
   222 __d_lookup                                 0.8672
   211 zap_page_range                             0.2409
   188 _text_lock_dec_and_lock                    7.8333
   179 rmqueue                                    0.4144
   178 __find_get_page                            2.7812
   161 _text_lock_read_write                      1.3644
   157 file_read_actor                            0.6886
   151 nr_free_pages                              2.9038
   123 link_path_walk                             0.0478
   118 set_page_dirty                             1.2292
   101 fput                                       0.4353

Interestingly, kmap doesn't show up in the virgin 2.5 profile at all,
but it does in the 2.4 profile ... 

   7 flush_all_zero_pkmaps                      0.0700
109 kmap_high                                  0.3028



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: scalable kmap (was Re: vm lock contention reduction)
  2002-07-10 22:43                                       ` Martin J. Bligh
@ 2002-07-10 23:08                                         ` Andrew Morton
  2002-07-10 23:26                                           ` Martin J. Bligh
  2002-07-12 17:48                                           ` Martin J. Bligh
  0 siblings, 2 replies; 87+ messages in thread
From: Andrew Morton @ 2002-07-10 23:08 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Andrea Arcangeli, Linus Torvalds, Rik van Riel, linux-mm

"Martin J. Bligh" wrote:
> 
> > Here's the diff.  The kmap() and kmap_atomic() rate is way down
> > now.  Still no benefit from it all through.  Martin.  Help.
> 
> Hmmm ... well I have some preliminary results on the 16-way NUMA
> for kernel compile, and it doesn't make things faster - if anything there's
> a barely perceptible slowdown (may just be statistical error).
> 
> On the other hand, Hanna just did some dbench measurements on an
> 8-way SMP, and got about 15% improvement out of it (she mailed the
> results to lkml just now).

Yes, thanks.  Still scratching my head over that lot.

> The profile comparison between 2.4 and 2.5 is interesting. Unless I've
> screwed something up in the profiling, seems like we're spending a lot
> of time in do_page_fault (with or without your patch). It's *so* different,
> that I'm inclined to suspect my profiling .... (profile=2).

Yes, I've seen the in-kernel profiler doing odd things.  If 
you're not using the local APIC timer then I think the new
IRQ balancing code will break the profiler by steering the
clock interrupts away from busy CPUs (!).

But ISTR that the profiler has gone whacky even with CONFIG_X86_LOCAL_APIC,
which shouldn't be affected by the IRQ steering.

But NMI-based oprofile is bang-on target so I recommend you use that.
I'll publish my oprofile-for-2.5 asap.


-
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: scalable kmap (was Re: vm lock contention reduction)
  2002-07-10 23:08                                         ` Andrew Morton
@ 2002-07-10 23:26                                           ` Martin J. Bligh
  2002-07-11  0:19                                             ` Andrew Morton
  2002-07-12 17:48                                           ` Martin J. Bligh
  1 sibling, 1 reply; 87+ messages in thread
From: Martin J. Bligh @ 2002-07-10 23:26 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, Linus Torvalds, Rik van Riel, linux-mm

> Yes, I've seen the in-kernel profiler doing odd things.  If 
> you're not using the local APIC timer then I think the new
> IRQ balancing code will break the profiler by steering the
> clock interrupts away from busy CPUs (!).

The first patch we apply is this

diff -Nur linux-2.5.23-vanilla/arch/i386/kernel/io_apic.c linux-2.5.23-patched/arch/i386/kernel/io_apic.c
--- linux-2.5.23-vanilla/arch/i386/kernel/io_apic.c	Tue Jun 18 19:11:52 2002
+++ linux-2.5.23-patched/arch/i386/kernel/io_apic.c	Thu Jun 27 14:28:51 2002
@@ -247,7 +247,7 @@
 
 static inline void balance_irq(int irq)
 {
-#if CONFIG_SMP
+#if (CONFIG_SMP && !CONFIG_MULTIQUAD)
 	irq_balance_t *entry = irq_balance + irq;
 	unsigned long now = jiffies;

Which should turn of IRQ balancing completely, I think ...

> But ISTR that the profiler has gone whacky even with CONFIG_X86_LOCAL_APIC,
> which shouldn't be affected by the IRQ steering.

Interrupt 0 will only ever go the first quad (well, it should do
if I actually fixed the timers), but CONFIG_X86_LOCAL_APIC=y
is on. Wierd ...

> But NMI-based oprofile is bang-on target so I recommend you use that.
> I'll publish my oprofile-for-2.5 asap.

That'd be good, but I'm not sure my box likes NMIs too much ;-)
We'll see ....

M.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: scalable kmap (was Re: vm lock contention reduction)
  2002-07-10 23:26                                           ` Martin J. Bligh
@ 2002-07-11  0:19                                             ` Andrew Morton
  0 siblings, 0 replies; 87+ messages in thread
From: Andrew Morton @ 2002-07-11  0:19 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Andrea Arcangeli, Linus Torvalds, Rik van Riel, linux-mm

"Martin J. Bligh" wrote:
> 
> ...
> > But NMI-based oprofile is bang-on target so I recommend you use that.
> > I'll publish my oprofile-for-2.5 asap.
> 
> That'd be good, but I'm not sure my box likes NMIs too much ;-)
> We'll see ....

http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.25/oprofile.patch.gz

Now, oprofile trick for young players: the most useful metric
is CPU_CLK_UNHALTED.  And the example commandline at http://oprofile.sourceforge.net/doc.php3 works just fine.

But the kernel halts the clock in default_idle, and the numbers
you get out of the profiler only reflect the amount of time which
was spent with the clock unhalted.

So for example if you run oprofile against an idle machine,
it looks like the machine is spending 40% of its cycles handling
the clock timer.  Because it doesn't account for halted cycles.

I find this interpolation hurts my brain too much, so I always
use the `idle=poll' kernel boot parameter so the clock is never
halted.  This gives profiles which are comprehensible even to simple
Australians.

So.

- Set NR_CPUS to 8.  Otherwise oprofile does kmalloc(256kbytes)
  and won't start.  This is Rusty's fault.

- patch, build, install kernel

- oprofile keeps stuff in /var/opd, and I'm never sure whether
  my profiles are fresh, or are a mixture of this one and the
  previous one.  So I always blow away /var/opd first.

	rm -rf /var/opd
	<start benchmark>
	op_start --vmlinux=/boot/vmlinux --map-file=/boot/System.map \
			--ctr0-event=CPU_CLK_UNHALTED --ctr0-count=300000
	op_stop
	<benchmark ends>
	oprofpp -dl -i /boot/vmlinux

Easy.

-
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: scalable kmap (was Re: vm lock contention reduction)
  2002-07-10 23:08                                         ` Andrew Morton
  2002-07-10 23:26                                           ` Martin J. Bligh
@ 2002-07-12 17:48                                           ` Martin J. Bligh
  2002-07-13 11:18                                             ` Andrea Arcangeli
  1 sibling, 1 reply; 87+ messages in thread
From: Martin J. Bligh @ 2002-07-12 17:48 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, Linus Torvalds, Rik van Riel, linux-mm

OK, preliminary results we've seen about another 15% reduction in CPU load
on Apache Specweb99 on an 8-way machine with Andrew's kmap patches!
Will send out some more detailed numbers later from the official specweb
machine (thanks to Dave Hansen for running the prelim tests).

Secondly, I'd like to propose yet another mechanism, which would
also be a cheaper way to do things .... based vaguely on an RCU
type mechanism:

When you go to allocate a new global kmap, the danger is that its PTE
entry has not been TLB flushed, and the old value is still in some CPUs 
TLB cache.

If only this task context is going to use this kmap (eg copy_to_user), 
all we need do is check that we have context switched since we last 
used this kmap entry (since it was freed is easiest). If we have not, we 
merely do a local single line invalidate of that entry. If we switch to 
running on any other CPU in the future, we'll do a global TLB flush on 
the switch, so no problem there. I suspect that 99% of the time, this 
means no TLB flush at all, or even an invalidate.

If multiple task contexts might use this kmap, we need to check that
ALL cpus have done an context switch since this entry was last used.
If not, we send a single line invalidate to only those other CPUs that
have not switched, and thus might still have a dirty entry ...

I believe RCU already has all the mechanisms for checking context
switches. By context switch, I really mean TLB flush - ie switched
processes, not just threads.

Madness?

M.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: scalable kmap (was Re: vm lock contention reduction)
  2002-07-12 17:48                                           ` Martin J. Bligh
@ 2002-07-13 11:18                                             ` Andrea Arcangeli
  0 siblings, 0 replies; 87+ messages in thread
From: Andrea Arcangeli @ 2002-07-13 11:18 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Andrew Morton, Linus Torvalds, Rik van Riel, linux-mm

On Fri, Jul 12, 2002 at 10:48:06AM -0700, Martin J. Bligh wrote:
> OK, preliminary results we've seen about another 15% reduction in CPU load
> on Apache Specweb99 on an 8-way machine with Andrew's kmap patches!
> Will send out some more detailed numbers later from the official specweb
> machine (thanks to Dave Hansen for running the prelim tests).
> 
> Secondly, I'd like to propose yet another mechanism, which would
> also be a cheaper way to do things .... based vaguely on an RCU
> type mechanism:
> 
> When you go to allocate a new global kmap, the danger is that its PTE
> entry has not been TLB flushed, and the old value is still in some CPUs 
> TLB cache.
> 
> If only this task context is going to use this kmap (eg copy_to_user), 
> all we need do is check that we have context switched since we last 

you also must turn off the global bit from the kmap_prot first if you
want this to work.

> used this kmap entry (since it was freed is easiest). If we have not, we 
> merely do a local single line invalidate of that entry. If we switch to 
> running on any other CPU in the future, we'll do a global TLB flush on 
> the switch, so no problem there. I suspect that 99% of the time, this 
> means no TLB flush at all, or even an invalidate.
> 
> If multiple task contexts might use this kmap, we need to check that
> ALL cpus have done an context switch since this entry was last used.
> If not, we send a single line invalidate to only those other CPUs that
> have not switched, and thus might still have a dirty entry ...
> 
> I believe RCU already has all the mechanisms for checking context
> switches. By context switch, I really mean TLB flush - ie switched

RCU doesn't have and doesn't need that kind of mechanism. First of all
RCU don't even track context switches, it only tracks quiescent points
(even if there's no context switch if somebody called schedule() that's
a quiescent point even if we keep running the current task afterwards
maybe because it's the only running task). But even tracking context
switches isn't enough as you said, you've to track mm_switches that is
yet in a different place.

So the mm_switch code would need a new instrumentation but that's not
the main issue, it's just the logic to keep track of this info seem
overcomplex, doesn't really sound an obvious optimization to me.  Also I
would remind it should be implemented it in a zerocost way for the 64bit
archs but that's certainly not a problem (just a reminder :).

> processes, not just threads.
> 
> Madness?
> 
> M.


Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 87+ messages in thread

end of thread, other threads:[~2002-07-13 11:18 UTC | newest]

Thread overview: 87+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-07-04 23:05 vm lock contention reduction Andrew Morton
2002-07-04 23:26 ` Rik van Riel
2002-07-04 23:27 ` Rik van Riel
2002-07-05  1:37   ` Andrew Morton
2002-07-05  1:49     ` Rik van Riel
2002-07-05  2:18       ` Andrew Morton
2002-07-05  2:16         ` Rik van Riel
2002-07-05  2:53           ` Andrew Morton
2002-07-05  3:52             ` Benjamin LaHaise
2002-07-05  4:47           ` Linus Torvalds
2002-07-05  5:38             ` Andrew Morton
2002-07-05  5:51               ` Linus Torvalds
2002-07-05  6:08                 ` Linus Torvalds
2002-07-05  6:27                   ` Alexander Viro
2002-07-05  6:33                   ` Andrew Morton
2002-07-05  7:33                     ` Andrea Arcangeli
2002-07-07  2:50                       ` Andrew Morton
2002-07-07  3:05                         ` Linus Torvalds
2002-07-07  3:47                           ` Andrew Morton
2002-07-08 11:39                             ` Enhanced profiling support (was Re: vm lock contention reduction) John Levon
2002-07-08 11:39                               ` John Levon
2002-07-08 17:52                               ` Linus Torvalds
2002-07-08 17:52                                 ` Linus Torvalds
2002-07-08 18:41                                 ` Karim Yaghmour
2002-07-08 18:41                                   ` Karim Yaghmour
2002-07-10  2:22                                   ` John Levon
2002-07-10  2:22                                     ` John Levon
2002-07-10  4:16                                     ` Karim Yaghmour
2002-07-10  4:16                                       ` Karim Yaghmour
2002-07-10  4:38                                       ` John Levon
2002-07-10  4:38                                         ` John Levon
2002-07-10  5:46                                         ` Karim Yaghmour
2002-07-10  5:46                                           ` Karim Yaghmour
2002-07-10 13:10                                         ` bob
2002-07-10 13:10                                           ` bob
2002-07-09 16:57                                 ` John Levon
2002-07-09 19:56                                   ` Karim Yaghmour
2002-07-07  5:16                           ` vm lock contention reduction Martin J. Bligh
2002-07-07  6:13                         ` scalable kmap (was Re: vm lock contention reduction) Martin J. Bligh
2002-07-07  6:37                           ` Andrew Morton
2002-07-07  7:53                           ` Linus Torvalds
2002-07-07  9:04                             ` Andrew Morton
2002-07-07 16:13                               ` Martin J. Bligh
2002-07-07 18:31                               ` Linus Torvalds
2002-07-07 18:55                                 ` Linus Torvalds
2002-07-07 19:02                                   ` Linus Torvalds
2002-07-08  7:24                                 ` Andrew Morton
2002-07-08  8:09                                   ` Andrea Arcangeli
2002-07-08 14:50                                     ` William Lee Irwin III
2002-07-08 20:39                                     ` Andrew Morton
2002-07-08 21:08                                       ` Benjamin LaHaise
2002-07-08 21:45                                         ` Andrew Morton
2002-07-08 22:24                                           ` Benjamin LaHaise
2002-07-07 16:00                             ` Martin J. Bligh
2002-07-07 18:28                               ` Linus Torvalds
2002-07-08  7:11                                 ` Andrea Arcangeli
2002-07-08 10:15                                 ` Eric W. Biederman
2002-07-08  7:00                               ` Andrea Arcangeli
2002-07-08 17:29                           ` Martin J. Bligh
2002-07-08 22:14                             ` Linus Torvalds
2002-07-09  0:16                               ` Andrew Morton
2002-07-09  3:17                             ` Andrew Morton
2002-07-09  4:28                               ` Martin J. Bligh
2002-07-09  5:28                                 ` Andrew Morton
2002-07-09  6:15                                   ` Martin J. Bligh
2002-07-09  6:30                                     ` William Lee Irwin III
2002-07-09  6:32                                     ` William Lee Irwin III
2002-07-09 16:08                                   ` Martin J. Bligh
2002-07-09 17:32                                   ` Andrea Arcangeli
2002-07-10  5:32                                     ` Andrew Morton
2002-07-10 22:43                                       ` Martin J. Bligh
2002-07-10 23:08                                         ` Andrew Morton
2002-07-10 23:26                                           ` Martin J. Bligh
2002-07-11  0:19                                             ` Andrew Morton
2002-07-12 17:48                                           ` Martin J. Bligh
2002-07-13 11:18                                             ` Andrea Arcangeli
2002-07-09 13:59                               ` Benjamin LaHaise
2002-07-08  0:38                         ` vm lock contention reduction William Lee Irwin III
2002-07-05  6:46                 ` Andrew Morton
2002-07-05 14:25                   ` Rik van Riel
2002-07-05 23:11         ` William Lee Irwin III
2002-07-05 23:48           ` Andrew Morton
2002-07-06  0:11             ` Rik van Riel
2002-07-06  0:31               ` Linus Torvalds
2002-07-06  0:45                 ` Rik van Riel
2002-07-06  0:48               ` Andrew Morton
2002-07-08  0:59                 ` William Lee Irwin III

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.