linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] Export current_is_keventd() for libphy
@ 2006-12-03  5:50 Ben Collins
  2006-12-03  9:16 ` Andrew Morton
  0 siblings, 1 reply; 54+ messages in thread
From: Ben Collins @ 2006-12-03  5:50 UTC (permalink / raw)
  To: linux-kernel; +Cc: Linus Torvalds

Fixes this problem when libphy is compiled as module:

WARNING: "current_is_keventd" [drivers/net/phy/libphy.ko] undefined!

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 17c2f03..1cf226b 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -608,6 +608,7 @@ int current_is_keventd(void)
 	return ret;
 
 }
+EXPORT_SYMBOL_GPL(current_is_keventd);
 
 #ifdef CONFIG_HOTPLUG_CPU
 /* Take the work from this (downed) CPU. */


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH] Export current_is_keventd() for libphy
  2006-12-03  5:50 [PATCH] Export current_is_keventd() for libphy Ben Collins
@ 2006-12-03  9:16 ` Andrew Morton
  2006-12-04 19:17   ` Steve Fox
  2006-12-05 17:48   ` Maciej W. Rozycki
  0 siblings, 2 replies; 54+ messages in thread
From: Andrew Morton @ 2006-12-03  9:16 UTC (permalink / raw)
  To: Ben Collins; +Cc: linux-kernel, Linus Torvalds, Maciej W. Rozycki, Jeff Garzik

On Sun, 03 Dec 2006 00:50:55 -0500
Ben Collins <ben.collins@ubuntu.com> wrote:

> Fixes this problem when libphy is compiled as module:
> 
> WARNING: "current_is_keventd" [drivers/net/phy/libphy.ko] undefined!
> 
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 17c2f03..1cf226b 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -608,6 +608,7 @@ int current_is_keventd(void)
>  	return ret;
>  
>  }
> +EXPORT_SYMBOL_GPL(current_is_keventd);
>  
>  #ifdef CONFIG_HOTPLUG_CPU
>  /* Take the work from this (downed) CPU. */

wtf?  That code was merged?  This bug has been known for months and after
several unsuccessful attempts at trying to understand what on earth that
hackery is supposed to be doing I gave up on it and asked Jeff to look after
it.

Maciej, please, in very small words written in a very large font, explain to
us what is going on in phy_stop_interrupts()?  Include pictures.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] Export current_is_keventd() for libphy
  2006-12-03  9:16 ` Andrew Morton
@ 2006-12-04 19:17   ` Steve Fox
  2006-12-05 18:05     ` Maciej W. Rozycki
  2006-12-05 17:48   ` Maciej W. Rozycki
  1 sibling, 1 reply; 54+ messages in thread
From: Steve Fox @ 2006-12-04 19:17 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Ben Collins, linux-kernel, Linus Torvalds, macro, afleming

On Sun, 2006-12-03 at 01:16 -0800, Andrew Morton wrote:
> On Sun, 03 Dec 2006 00:50:55 -0500
> Ben Collins <ben.collins@ubuntu.com> wrote:
> 
> > Fixes this problem when libphy is compiled as module:
> >
> > WARNING: "current_is_keventd" [drivers/net/phy/libphy.ko] undefined!
> >
> > diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> > index 17c2f03..1cf226b 100644
> > --- a/kernel/workqueue.c
> > +++ b/kernel/workqueue.c
> > @@ -608,6 +608,7 @@ int current_is_keventd(void)
> >  	return ret;
> >
> >  }
> > +EXPORT_SYMBOL_GPL(current_is_keventd);
> >
> >  #ifdef CONFIG_HOTPLUG_CPU
> >  /* Take the work from this (downed) CPU. */
> 
> wtf?  That code was merged?  This bug has been known for months and after
> several unsuccessful attempts at trying to understand what on earth that
> hackery is supposed to be doing I gave up on it and asked Jeff to look after
> it.
> 
> Maciej, please, in very small words written in a very large font, explain to
> us what is going on in phy_stop_interrupts()?  Include pictures.

Unfortunately Maciej didn't get cc'ed correctly on your mail. Hopefully
I've added the right person to this post as well as Andy who has also
recently changed this code.

We're also hitting this on one of the test.kernel.org machines, bl6-13.

-- 

Steve Fox
IBM Linux Technology Center

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] Export current_is_keventd() for libphy
  2006-12-03  9:16 ` Andrew Morton
  2006-12-04 19:17   ` Steve Fox
@ 2006-12-05 17:48   ` Maciej W. Rozycki
  2006-12-05 18:07     ` Linus Torvalds
                       ` (2 more replies)
  1 sibling, 3 replies; 54+ messages in thread
From: Maciej W. Rozycki @ 2006-12-05 17:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andy Fleming, Ben Collins, linux-kernel, Linus Torvalds, Jeff Garzik

On Sun, 3 Dec 2006, Andrew Morton wrote:

> wtf?  That code was merged?  This bug has been known for months and after
> several unsuccessful attempts at trying to understand what on earth that
> hackery is supposed to be doing I gave up on it and asked Jeff to look after
> it.

 I am surprised it got merged too -- my understanding from the discussion 
was it was to be sorted out somehow within libphy, one way or another.

> Maciej, please, in very small words written in a very large font, explain to
> us what is going on in phy_stop_interrupts()?  Include pictures.

 Would ASCII art qualify as pictures?  Regardless, I am providing a 
textual description, which please feel free to skip to the conclusion 
below if uninterested in the details.

 Essentially there is a race when disconnecting from a PHY, because 
interrupt delivery uses the event queue for processing.  The function to 
handle interrupts that is called from the event queue is phy_change().  
It takes a pointer to a structure that is associated with the PHY.  At the 
time phy_stop_interrupts() is called there may be one or more calls to 
phy_change() still pending on the event queue.  They may not be able to be 
processed until the structure passed to phy_change() have been freed, at 
which point calling the function is wrong.

 One way of avoiding it is calling flush_scheduled_work() from 
phy_stop_interrupts().  This is fine as long as a caller of 
phy_stop_interrupts() (not necessarily the immediate one calling into 
libphy) does not hold the netlink lock.

 If a caller indeed holds the netlink lock, then a driver effectively 
calling phy_stop_interrupts() may arrange for the function to be itself 
scheduled through the event queue.  This has the effect of avoiding the 
race as well, as the queue is processed in order, except it causes more 
hassle for the driver.  Hence the choice was left to the driver's author 
-- if a driver "knows" the netlink lock is not going to be held at that 
point, it may call phy_stop_interrupts() directly, otherwise it shall use 
the event queue.

 With such an assumption in place the function has to check somehow 
whether it has been scheduled through the queue or not and act 
accordingly, which is why that "if" clause is there.

 Now I gather the conclusion was the whole mess was going to be included 
within libphy and not exposed to Ethernet MAC drivers.  This way the 
library would schedule both phy_stop_interrupts() and mdiobus_unregister() 
(which is actually the function freeing the PHY device structure) through 
the event queue as needed without a MAC driver having to know.

 And the whole question that remains is whether it is Andy (cc-ed) or me 
who is more competent to implement this change. ;-)

  Maciej

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] Export current_is_keventd() for libphy
  2006-12-04 19:17   ` Steve Fox
@ 2006-12-05 18:05     ` Maciej W. Rozycki
  0 siblings, 0 replies; 54+ messages in thread
From: Maciej W. Rozycki @ 2006-12-05 18:05 UTC (permalink / raw)
  To: Steve Fox
  Cc: Andrew Morton, Ben Collins, linux-kernel, Linus Torvalds, afleming

On Mon, 4 Dec 2006, Steve Fox wrote:

> Unfortunately Maciej didn't get cc'ed correctly on your mail. Hopefully
> I've added the right person to this post as well as Andy who has also
> recently changed this code.

 Thanks -- my parser of the LKML does not always trigger.

  Maciej

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] Export current_is_keventd() for libphy
  2006-12-05 17:48   ` Maciej W. Rozycki
@ 2006-12-05 18:07     ` Linus Torvalds
  2006-12-05 19:31       ` Andrew Morton
  2006-12-05 18:57     ` Andy Fleming
  2006-12-05 20:39     ` Andrew Morton
  2 siblings, 1 reply; 54+ messages in thread
From: Linus Torvalds @ 2006-12-05 18:07 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: Andrew Morton, Andy Fleming, Ben Collins, linux-kernel, Jeff Garzik



On Tue, 5 Dec 2006, Maciej W. Rozycki wrote:
>
>  One way of avoiding it is calling flush_scheduled_work() from 
> phy_stop_interrupts().  This is fine as long as a caller of 
> phy_stop_interrupts() (not necessarily the immediate one calling into 
> libphy) does not hold the netlink lock.
> 
>  If a caller indeed holds the netlink lock, then a driver effectively 
> calling phy_stop_interrupts() may arrange for the function to be itself 
> scheduled through the event queue.  This has the effect of avoiding the 
> race as well, as the queue is processed in order, except it causes more 
> hassle for the driver.

I would personally be ok with "flush_scheduled_work()" _itself_ noticing 
that it is actually waiting to flush "itself", and just being a no-op in 
that case.

However, having outside code do that detection for it seems to be ugly as 
hell. And something like the network drievr layer shouldn't know about 
keventd details like this.

So if you can move that deadlock avoidance logic into 
"flush_scheduled_work()" itself, we'd avoid the stupid export, and we'd 
also avoid the driver layer having to care about these things. It would 
still be _ugly_, but I think it would be less so.

Hmm?

		Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] Export current_is_keventd() for libphy
  2006-12-05 17:48   ` Maciej W. Rozycki
  2006-12-05 18:07     ` Linus Torvalds
@ 2006-12-05 18:57     ` Andy Fleming
  2006-12-06 12:31       ` Maciej W. Rozycki
  2006-12-05 20:39     ` Andrew Morton
  2 siblings, 1 reply; 54+ messages in thread
From: Andy Fleming @ 2006-12-05 18:57 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: Andrew Morton, Ben Collins, linux-kernel, Linus Torvalds, Jeff Garzik


On Dec 5, 2006, at 11:48, Maciej W. Rozycki wrote:

>
>  Essentially there is a race when disconnecting from a PHY, because
> interrupt delivery uses the event queue for processing.  The  
> function to
> handle interrupts that is called from the event queue is phy_change().
> It takes a pointer to a structure that is associated with the PHY.   
> At the
> time phy_stop_interrupts() is called there may be one or more calls to
> phy_change() still pending on the event queue.  They may not be  
> able to be
> processed until the structure passed to phy_change() have been  
> freed, at
> which point calling the function is wrong.
>
>  One way of avoiding it is calling flush_scheduled_work() from
> phy_stop_interrupts().  This is fine as long as a caller of
> phy_stop_interrupts() (not necessarily the immediate one calling into
> libphy) does not hold the netlink lock.
>
>  If a caller indeed holds the netlink lock, then a driver effectively
> calling phy_stop_interrupts() may arrange for the function to be  
> itself
> scheduled through the event queue.  This has the effect of avoiding  
> the
> race as well, as the queue is processed in order, except it causes  
> more
> hassle for the driver.  Hence the choice was left to the driver's  
> author
> -- if a driver "knows" the netlink lock is not going to be held at  
> that 
> point, it may call phy_stop_interrupts() directly, otherwise it  
> shall use
> the event queue.


We need to make sure there are no more pending phy_change()  
invocations in the work queue, this is true, however I'm not  
convinced that this avoids the problem.  And now that I come back to  
this email after Linus's response, let me add that I agree with his  
suggestion.  I still don't think it solves the original problem,  
though.  Unless I'm missing something, Maciej's suggested fix (having  
the driver invoke phy_stop_interrupts() from a work queue) doesn't  
stop the race:

* Schedule stop_interrupts and freeing of memory.
* interrupt occurs, and schedules phy_change
* work_queue triggers, and stop_interrupts is invoked.  It doesn't  
call flush_scheduled_work, because it's being called from keventd.
* The invoker of stop_interrupts (presumably some function in the  
driver) frees up memory, including the phy_device.
* phy_change is invoked() from the work queue, and starts accessing  
freed memory

Am I misunderstanding how work queues operate?

If I'm right, an ugly solution would have to disable the PHY  
interrupts before scheduling the cleanup.  But is there really no way  
to tell the kernel to squash all pending work that came from *this*  
work_queue?



>
>  With such an assumption in place the function has to check somehow
> whether it has been scheduled through the queue or not and act
> accordingly, which is why that "if" clause is there.
>
>  Now I gather the conclusion was the whole mess was going to be  
> included
> within libphy and not exposed to Ethernet MAC drivers.  This way the
> library would schedule both phy_stop_interrupts() and  
> mdiobus_unregister()
> (which is actually the function freeing the PHY device structure)  
> through
> the event queue as needed without a MAC driver having to know.


I suggested this, mostly so that drivers wouldn't have to be aware of  
this.  But I'm not quite sure what happens when you unload a module.   
Does some stuff stay behind if needed?  If you unload the ethernet  
driver, that will usually remove the bus controller for the PHY,  
which would prevent any scheduled code from accessing the PHY.


Andy



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] Export current_is_keventd() for libphy
  2006-12-05 18:07     ` Linus Torvalds
@ 2006-12-05 19:31       ` Andrew Morton
  0 siblings, 0 replies; 54+ messages in thread
From: Andrew Morton @ 2006-12-05 19:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Maciej W. Rozycki, Andy Fleming, Ben Collins, linux-kernel, Jeff Garzik

On Tue, 5 Dec 2006 10:07:21 -0800 (PST)
Linus Torvalds <torvalds@osdl.org> wrote:

> 
> 
> On Tue, 5 Dec 2006, Maciej W. Rozycki wrote:
> >
> >  One way of avoiding it is calling flush_scheduled_work() from 
> > phy_stop_interrupts().  This is fine as long as a caller of 
> > phy_stop_interrupts() (not necessarily the immediate one calling into 
> > libphy) does not hold the netlink lock.
> > 
> >  If a caller indeed holds the netlink lock, then a driver effectively 
> > calling phy_stop_interrupts() may arrange for the function to be itself 
> > scheduled through the event queue.  This has the effect of avoiding the 
> > race as well, as the queue is processed in order, except it causes more 
> > hassle for the driver.
> 
> I would personally be ok with "flush_scheduled_work()" _itself_ noticing 
> that it is actually waiting to flush "itself", and just being a no-op in 
> that case.

It does do that:

static void flush_cpu_workqueue(struct cpu_workqueue_struct *cwq)
{
	if (cwq->thread == current) {
		/*
		 * Probably keventd trying to flush its own queue. So simply run
		 * it by hand rather than deadlocking.
		 */
		run_workqueue(cwq);


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] Export current_is_keventd() for libphy
  2006-12-05 17:48   ` Maciej W. Rozycki
  2006-12-05 18:07     ` Linus Torvalds
  2006-12-05 18:57     ` Andy Fleming
@ 2006-12-05 20:39     ` Andrew Morton
  2006-12-05 20:59       ` Andy Fleming
  2 siblings, 1 reply; 54+ messages in thread
From: Andrew Morton @ 2006-12-05 20:39 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: Andy Fleming, Ben Collins, linux-kernel, Linus Torvalds, Jeff Garzik

On Tue, 5 Dec 2006 17:48:05 +0000 (GMT)
"Maciej W. Rozycki" <macro@linux-mips.org> wrote:

>  Essentially there is a race when disconnecting from a PHY, because 
> interrupt delivery uses the event queue for processing.  The function to 
> handle interrupts that is called from the event queue is phy_change().  
> It takes a pointer to a structure that is associated with the PHY.  At the 
> time phy_stop_interrupts() is called there may be one or more calls to 
> phy_change() still pending on the event queue.  They may not be able to be 
> processed until the structure passed to phy_change() have been freed, at 
> which point calling the function is wrong.
> 
>  One way of avoiding it is calling flush_scheduled_work() from 
> phy_stop_interrupts().  This is fine as long as a caller of 
> phy_stop_interrupts() (not necessarily the immediate one calling into 
> libphy) does not hold the netlink lock.

So let me try to rephrase...

- phy_change() is the workqueue callback function.  It is executed by
  keventd.

- Something under phy_change() takes rtnl_lock() (but what??)

- phy_stop_interrupts() does flush_scheduled_work().  This has to
  following logic:

  - if I am kevetnd, run phy_change() directly.

  - If I am not keventd, wait for keventd() to run phy_change()

- So if the caller of phy_stop_interrupt() already holds rtnl_lock(),
  and if that caller is keventd then it will recur onto rntl_lock() and
  will deadlock.

Problem is, if the caller of phy_stop_interrupt() is *not* keventd, that
caller will still deadlock, because that caller is waiting for keventd to
run phy_change(), and keventd cannot do that, because the not-keventd
process already holds rtnl_lock.


Now, afaict, there are only two callers of phy_stop_interrupts(): the
close() handlers of gianfar.c and fs_enet-main.c (confusingly held in
netdevice.stop (confusingly called by dev_close())).  Via phy_disconnect. 
Did I miss anything?

And the dev_close() caller holds rtnl_lock.


Summary:

a) Please tell us what code under phy_change() wants to take rtnl_lock

b) I think it should deadlock whether or not the caller of
   phy_stop_interrupt() is keventd.  What am I missing?

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] Export current_is_keventd() for libphy
  2006-12-05 20:39     ` Andrew Morton
@ 2006-12-05 20:59       ` Andy Fleming
  2006-12-05 21:26         ` Andrew Morton
  0 siblings, 1 reply; 54+ messages in thread
From: Andy Fleming @ 2006-12-05 20:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Maciej W. Rozycki, Ben Collins, linux-kernel, Linus Torvalds,
	Jeff Garzik


On Dec 5, 2006, at 14:39, Andrew Morton wrote:

> On Tue, 5 Dec 2006 17:48:05 +0000 (GMT)
> "Maciej W. Rozycki" <macro@linux-mips.org> wrote:
>
>>  Essentially there is a race when disconnecting from a PHY, because
>> interrupt delivery uses the event queue for processing.  The  
>> function to
>> handle interrupts that is called from the event queue is phy_change 
>> ().
>> It takes a pointer to a structure that is associated with the  
>> PHY.  At the
>> time phy_stop_interrupts() is called there may be one or more  
>> calls to
>> phy_change() still pending on the event queue.  They may not be  
>> able to be
>> processed until the structure passed to phy_change() have been  
>> freed, at
>> which point calling the function is wrong.
>>
>>  One way of avoiding it is calling flush_scheduled_work() from
>> phy_stop_interrupts().  This is fine as long as a caller of
>> phy_stop_interrupts() (not necessarily the immediate one calling into
>> libphy) does not hold the netlink lock.
>
> So let me try to rephrase...
>
> - phy_change() is the workqueue callback function.  It is executed by
>   keventd.
>
> - Something under phy_change() takes rtnl_lock() (but what??)


I don't think it's phy_change().  It's something else that may be  
scheduled.


>
> - phy_stop_interrupts() does flush_scheduled_work().  This has to
>   following logic:
>
>   - if I am kevetnd, run phy_change() directly.
>
>   - If I am not keventd, wait for keventd() to run phy_change()
>
> - So if the caller of phy_stop_interrupt() already holds rtnl_lock(),
>   and if that caller is keventd then it will recur onto rntl_lock()  
> and
>   will deadlock.
>
> Problem is, if the caller of phy_stop_interrupt() is *not* keventd,  
> that
> caller will still deadlock, because that caller is waiting for  
> keventd to
> run phy_change(), and keventd cannot do that, because the not-keventd
> process already holds rtnl_lock.
>
>
> Now, afaict, there are only two callers of phy_stop_interrupts(): the
> close() handlers of gianfar.c and fs_enet-main.c (confusingly held in
> netdevice.stop (confusingly called by dev_close())).  Via  
> phy_disconnect.
> Did I miss anything?


Right now, that's probably about right.


>
> And the dev_close() caller holds rtnl_lock.
>


Ok, I think this is the summary:

- phy_change() is the work queue callback function (scheduled when a  
PHY interrupt occurs)

- dev_close() invokes the controller's stop/close/whatever function,  
and it calls phy_disconnect()

- phy_disconnect() calls phy_stop_interrupts().  To prevent any  
pending phy_change() calls from getting confused, phy_stop_interrupts 
() needs to flush the queue.  Otherwise, subsequent memory freeings  
will leave phy_change() hanging.

- If phy_stop_interrupts() calls flush_scheduled_work(), keventd will  
execute its queues while rtnl_lock is held, providing opportunity for  
other callbacks to deadlock.

- innocent puppies are slaughtered, and the world mourns.


Maciej's solution is to schedule phy_disconnect() to be called from a  
work queue.  That solution should work, but it sounds like it doesn't  
require the check for if keventd is running.

Of course, my objection to it is that it now requires the ethernet  
controller to be excessively aware of the details of how the PHY Lib  
is handling the PHY interrupts (by scheduling them on a work queue).

Andy



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] Export current_is_keventd() for libphy
  2006-12-05 20:59       ` Andy Fleming
@ 2006-12-05 21:26         ` Andrew Morton
  2006-12-05 21:37           ` Roland Dreier
  0 siblings, 1 reply; 54+ messages in thread
From: Andrew Morton @ 2006-12-05 21:26 UTC (permalink / raw)
  To: Andy Fleming
  Cc: Maciej W. Rozycki, Ben Collins, linux-kernel, Linus Torvalds,
	Jeff Garzik

On Tue, 5 Dec 2006 14:59:31 -0600
Andy Fleming <afleming@freescale.com> wrote:

> Ok, I think this is the summary:
> 
> - phy_change() is the work queue callback function (scheduled when a  
> PHY interrupt occurs)
> 
> - dev_close() invokes the controller's stop/close/whatever function,  
> and it calls phy_disconnect()
> 
> - phy_disconnect() calls phy_stop_interrupts().  To prevent any  
> pending phy_change() calls from getting confused, phy_stop_interrupts 
> () needs to flush the queue.  Otherwise, subsequent memory freeings  
> will leave phy_change() hanging.
> 
> - If phy_stop_interrupts() calls flush_scheduled_work(), keventd will  
> execute its queues while rtnl_lock is held, providing opportunity for  
> other callbacks to deadlock.
> 
> - innocent puppies are slaughtered, and the world mourns.
> 

ah, OK.  So it's some other queued-up callback which takes rtnl_lock.

But I still don't see what's special about keventd.  If, say, /sbin/ip is
running flush_scheduled_work() under rtnl_lock then it too should deadlock
in this scenario.

> 
> Maciej's solution is to schedule phy_disconnect() to be called from a  
> work queue.  That solution should work, but it sounds like it doesn't  
> require the check for if keventd is running.
> 
> Of course, my objection to it is that it now requires the ethernet  
> controller to be excessively aware of the details of how the PHY Lib  
> is handling the PHY interrupts (by scheduling them on a work queue).

So what's a good fix?

a) Ban the calling of flush_scheduled_work() from under rtnl_lock(). 
   Sounds hard.

b) Ban the queueing of callback functions which take rtnl_lock().  This
   sounds like a plain bad idea - callbacks are low-level things which
   ought to be able to take locks.

c) Cancel the phy_change() callback within phy_stop_interrupts() so we
   don't need to run flush_scheduled_work() at all.

   This will almost work, as long as it's done in workqueue.c with
   appropriate locking.  The bug occurs when some other CPU is running
   phy_change() right now - we'll end up freeing data which that CPU is
   presently playing with.

   But perhaps we can take care of this within workqueue.c.  We need a
   cancel function which will cancel the work and, if its callback is
   presently executing it will block until that execution has completed.

   If we require of the calling subsystem a) that the work will not get
   rescheduled (that means phy_interrupt()) and b) that the callback does
   not rearm the work then things get simpler.

   But still not very simple.  It gets ugly with per-CPU qorkqueues.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] Export current_is_keventd() for libphy
  2006-12-05 21:26         ` Andrew Morton
@ 2006-12-05 21:37           ` Roland Dreier
  2006-12-05 21:57             ` Andrew Morton
  0 siblings, 1 reply; 54+ messages in thread
From: Roland Dreier @ 2006-12-05 21:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andy Fleming, Maciej W. Rozycki, Ben Collins, linux-kernel,
	Linus Torvalds, Jeff Garzik

 > a) Ban the calling of flush_scheduled_work() from under rtnl_lock(). 
 >    Sounds hard.

Unfortunate if this is happening a lot.  It seems like the most
sensible fix -- flush_scheduled_work() is in effect calling into
an unknown and changeable in the future set of functions (since it
waits for them to finish), and it seems error-prone to hold a lock
across such a call.

 >    This will almost work, as long as it's done in workqueue.c with
 >    appropriate locking.  The bug occurs when some other CPU is running
 >    phy_change() right now - we'll end up freeing data which that CPU is
 >    presently playing with.
 > 
 >    But perhaps we can take care of this within workqueue.c.  We need a
 >    cancel function which will cancel the work and, if its callback is
 >    presently executing it will block until that execution has completed.

I may be misunderstanding you, but this seems to deadlock in exactly
the same way: if someone calls this cancel routine holding rtnl_lock,
and the work function that will also take rtnl_lock has just started,
it will get stuck when the work function tries to take rtnl_lock.

 - R.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] Export current_is_keventd() for libphy
  2006-12-05 21:37           ` Roland Dreier
@ 2006-12-05 21:57             ` Andrew Morton
  2006-12-05 23:49               ` Roland Dreier
                                 ` (2 more replies)
  0 siblings, 3 replies; 54+ messages in thread
From: Andrew Morton @ 2006-12-05 21:57 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Andy Fleming, Maciej W. Rozycki, Ben Collins, linux-kernel,
	Linus Torvalds, Jeff Garzik

On Tue, 05 Dec 2006 13:37:37 -0800
Roland Dreier <rdreier@cisco.com> wrote:

>  > a) Ban the calling of flush_scheduled_work() from under rtnl_lock(). 
>  >    Sounds hard.
> 
> Unfortunate if this is happening a lot.  It seems like the most
> sensible fix -- flush_scheduled_work() is in effect calling into
> an unknown and changeable in the future set of functions (since it
> waits for them to finish), and it seems error-prone to hold a lock
> across such a call.

yes, I agree.  It's really bad to be calling flush_scheduled_work() with
any locks held at all.  Fragile, hard-to-maintain, source of
once-in-a-blue-moon failures, etc.  I guess lockdep will help.

But running flush_scheduled_work() from within dev_close() is a very
sensible thing to do, and dev_close is called under rtnl_lock().
davem is -> thattaway ;)


>  >    This will almost work, as long as it's done in workqueue.c with
>  >    appropriate locking.  The bug occurs when some other CPU is running
>  >    phy_change() right now - we'll end up freeing data which that CPU is
>  >    presently playing with.
>  > 
>  >    But perhaps we can take care of this within workqueue.c.  We need a
>  >    cancel function which will cancel the work and, if its callback is
>  >    presently executing it will block until that execution has completed.
> 
> I may be misunderstanding you, but this seems to deadlock in exactly
> the same way: if someone calls this cancel routine holding rtnl_lock,
> and the work function that will also take rtnl_lock has just started,
> it will get stuck when the work function tries to take rtnl_lock.

Ah.  The point is that the phy code doesn't want to flush _all_ pending
callbacks.  It only wants to flush its own one.  And its own one doesn't
take rtnl_lock().

IOW, the phy code has no interest in running some random other subsystem's
callback - it just wants to run its own.  Hence no deadlock.

Maybe the lesson here is that flush_scheduled_work() is a bad function.
It should really be flush_this_work(struct work_struct *w).  That is in
fact what approximately 100% of the flush_scheduled_work() callers actually
want to do.

hmm.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] Export current_is_keventd() for libphy
  2006-12-05 21:57             ` Andrew Morton
@ 2006-12-05 23:49               ` Roland Dreier
  2006-12-05 23:52               ` Roland Dreier
  2006-12-06 15:25               ` Maciej W. Rozycki
  2 siblings, 0 replies; 54+ messages in thread
From: Roland Dreier @ 2006-12-05 23:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andy Fleming, Maciej W. Rozycki, Ben Collins, linux-kernel,
	Linus Torvalds, Jeff Garzik

 > But running flush_scheduled_work() from within dev_close() is a very
 > sensible thing to do, and dev_close is called under rtnl_lock().

I can't argue with that -- this has actually bitten me in the past.

Hmm, I'll try to understand why we need rtnl_lock() to cover dev_close...

 - R.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] Export current_is_keventd() for libphy
  2006-12-05 21:57             ` Andrew Morton
  2006-12-05 23:49               ` Roland Dreier
@ 2006-12-05 23:52               ` Roland Dreier
  2006-12-06 15:25               ` Maciej W. Rozycki
  2 siblings, 0 replies; 54+ messages in thread
From: Roland Dreier @ 2006-12-05 23:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andy Fleming, Maciej W. Rozycki, Ben Collins, linux-kernel,
	Linus Torvalds, Jeff Garzik

 > Ah.  The point is that the phy code doesn't want to flush _all_ pending
 > callbacks.  It only wants to flush its own one.  And its own one doesn't
 > take rtnl_lock().

OK, got it.  You're absolutely correct.

 > Maybe the lesson here is that flush_scheduled_work() is a bad function.
 > It should really be flush_this_work(struct work_struct *w).  That is in
 > fact what approximately 100% of the flush_scheduled_work() callers actually
 > want to do.

I think flush_this_work() runs into trouble if it means "make sure
everything up to <this work> has completed" because it still syncs
with everything before <this work>, which has the same risk of
deadlock.  And I'm not totally sure everyone who does
flush_scheduled_work() really means "cancel my work" -- they might mean
"finish up my work".

For example I would have to do some archeology to remember exactly
what I needed flush_scheduled_work() when I wrote drivers/infiniband/ulp/ipoib

 - R.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] Export current_is_keventd() for libphy
  2006-12-05 18:57     ` Andy Fleming
@ 2006-12-06 12:31       ` Maciej W. Rozycki
  0 siblings, 0 replies; 54+ messages in thread
From: Maciej W. Rozycki @ 2006-12-06 12:31 UTC (permalink / raw)
  To: Andy Fleming
  Cc: Andrew Morton, Ben Collins, linux-kernel, Linus Torvalds, Jeff Garzik

On Tue, 5 Dec 2006, Andy Fleming wrote:

> We need to make sure there are no more pending phy_change() invocations in the
> work queue, this is true, however I'm not convinced that this avoids the
> problem.  And now that I come back to this email after Linus's response, let
> me add that I agree with his suggestion.  I still don't think it solves the
> original problem, though.  Unless I'm missing something, Maciej's suggested
> fix (having the driver invoke phy_stop_interrupts() from a work queue) doesn't
> stop the race:
> 
> * Schedule stop_interrupts and freeing of memory.
> * interrupt occurs, and schedules phy_change
> * work_queue triggers, and stop_interrupts is invoked.  It doesn't call
> flush_scheduled_work, because it's being called from keventd.
> * The invoker of stop_interrupts (presumably some function in the driver)
> frees up memory, including the phy_device.
> * phy_change is invoked() from the work queue, and starts accessing freed
> memory

 This is not going to happen with my other changes to the file applied.  
The reason is at the time phy_stop_interrupts() is called phy_stop() has 
already run and switched the state of the PHY being handled to PHY_HALTED.  
As a result any subsequent calls to phy_interrupt() that might have 
happened after phy_stop() have not scheduled calls to phy_change() for 
this PHY as will not any that may happen up until free_irq() have 
unregistered the interrupt for the PHY.

> I suggested this, mostly so that drivers wouldn't have to be aware of this.
> But I'm not quite sure what happens when you unload a module.  Does some stuff
> stay behind if needed?  If you unload the ethernet driver, that will usually
> remove the bus controller for the PHY, which would prevent any scheduled code
> from accessing the PHY.

 Hmm, I am unsure if there is anything that would ensure flushing of the 
queue after its stop() call has finished and before a driver is removed 
(its module_exit() call is invoked), probably nothing, and that is why I 
put explicit flush_scheduled_work() in sb1250-mac.c:sbmac_remove() and the 
driver's open() call checks whether a possible previous instance of the 
structure used by phy_change() have not been freed yet.  There may be a 
cleaner way of doing it, but I will have to think about it.

  Maciej

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] Export current_is_keventd() for libphy
  2006-12-05 21:57             ` Andrew Morton
  2006-12-05 23:49               ` Roland Dreier
  2006-12-05 23:52               ` Roland Dreier
@ 2006-12-06 15:25               ` Maciej W. Rozycki
  2006-12-06 15:57                 ` Andrew Morton
  2 siblings, 1 reply; 54+ messages in thread
From: Maciej W. Rozycki @ 2006-12-06 15:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Roland Dreier, Andy Fleming, Ben Collins, linux-kernel,
	Linus Torvalds, Jeff Garzik

On Tue, 5 Dec 2006, Andrew Morton wrote:

> But running flush_scheduled_work() from within dev_close() is a very
> sensible thing to do, and dev_close is called under rtnl_lock().
> davem is -> thattaway ;)

 And when within dev_close() there is quite a chance there is 
linkwatch_event() somewhere in the event queue already. ;-)

> Ah.  The point is that the phy code doesn't want to flush _all_ pending
> callbacks.  It only wants to flush its own one.  And its own one doesn't
> take rtnl_lock().
> 
> IOW, the phy code has no interest in running some random other subsystem's
> callback - it just wants to run its own.  Hence no deadlock.

 Both are true.  It's linkwatch_event() that's somewhere in the queue 
already that makes the trouble here.

> Maybe the lesson here is that flush_scheduled_work() is a bad function.
> It should really be flush_this_work(struct work_struct *w).  That is in
> fact what approximately 100% of the flush_scheduled_work() callers actually
> want to do.

 There may be cases where flush_scheduled_work() is indeed needed, but 
likely outside drivers, and I agree such a specific call would be useful 
and work here.

  Maciej

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] Export current_is_keventd() for libphy
  2006-12-06 15:25               ` Maciej W. Rozycki
@ 2006-12-06 15:57                 ` Andrew Morton
  2006-12-06 17:17                   ` Linus Torvalds
  2006-12-06 17:43                   ` David Howells
  0 siblings, 2 replies; 54+ messages in thread
From: Andrew Morton @ 2006-12-06 15:57 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: Roland Dreier, Andy Fleming, Ben Collins, linux-kernel,
	Linus Torvalds, Jeff Garzik

On Wed, 6 Dec 2006 15:25:22 +0000 (GMT)
"Maciej W. Rozycki" <macro@linux-mips.org> wrote:

> > Maybe the lesson here is that flush_scheduled_work() is a bad function.
> > It should really be flush_this_work(struct work_struct *w).  That is in
> > fact what approximately 100% of the flush_scheduled_work() callers actually
> > want to do.
> 
>  There may be cases where flush_scheduled_work() is indeed needed, but 
> likely outside drivers, and I agree such a specific call would be useful 
> and work here.

I think so too.  But it would be imprudent to hang around waiting for me
to write it :(

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] Export current_is_keventd() for libphy
  2006-12-06 15:57                 ` Andrew Morton
@ 2006-12-06 17:17                   ` Linus Torvalds
  2006-12-07  1:21                     ` Linus Torvalds
  2006-12-06 17:43                   ` David Howells
  1 sibling, 1 reply; 54+ messages in thread
From: Linus Torvalds @ 2006-12-06 17:17 UTC (permalink / raw)
  To: Andrew Morton, David Howells
  Cc: Maciej W. Rozycki, Roland Dreier, Andy Fleming, Ben Collins,
	Linux Kernel Mailing List, Jeff Garzik



On Wed, 6 Dec 2006, Andrew Morton wrote:
> 
> I think so too.  But it would be imprudent to hang around waiting for me
> to write it :(

How about something like this?

This
 (a) depends on the just-merged "struct work" cleanup
 (b) is totally untested
 (c) probably kills you slowly and painfully
 (d) may breed frikken sharks with lasers on their frikken heads!
 (e) does compile, but I don't guarantee anything else.
 (f) may, in other words, be totally broken

And, btw: it may not work. Just in case that wasn't clear. This is a quick 
hack from me just sitting down and seeing if I can still do kernel 
programming, or whether I'm just relegated to merge other peoples code.

		Linus

PS. It might be broken.

PPS. David Howells added to participant list, hopefully he can 
double-check all my assumptions, since he's touched the workqueue code 
last. Tag, you're it!

----
diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index 4044bb1..e175f39 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -587,8 +587,7 @@ int phy_stop_interrupts(struct phy_device *phydev)
 	 * Finish any pending work; we might have been scheduled
 	 * to be called from keventd ourselves, though.
 	 */
-	if (!current_is_keventd())
-		flush_scheduled_work();
+	run_scheduled_work(&phydev->phy_queue);
 
 	free_irq(phydev->irq, phydev);
 
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 4a3ea83..a601ed5 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -160,6 +160,7 @@ extern int queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
 extern void FASTCALL(flush_workqueue(struct workqueue_struct *wq));
 
 extern int FASTCALL(schedule_work(struct work_struct *work));
+extern int FASTCALL(run_scheduled_work(struct work_struct *work));
 extern int FASTCALL(schedule_delayed_work(struct delayed_work *work, unsigned long delay));
 
 extern int schedule_delayed_work_on(int cpu, struct delayed_work *work, unsigned long delay);
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 8d1e7cb..fcacf06 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -103,6 +103,79 @@ static inline void *get_wq_data(struct work_struct *work)
 	return (void *) (work->management & WORK_STRUCT_WQ_DATA_MASK);
 }
 
+static int __run_work(struct cpu_workqueue_struct *cwq, struct work_struct *work)
+{
+	int ret = 0;
+	unsigned long flags;
+
+	spin_lock_irqsave(&cwq->lock, flags);
+	/*
+	 * We need to re-validate the work info after we've gotten
+	 * the cpu_workqueue lock. We can run the work now iff:
+	 *
+	 *  - the wq_data still matches the cpu_workqueue_struct
+	 *  - AND the work is still marked pending
+	 *  - AND the work is still on a list (which will be this
+	 *    workqueue_struct list)
+	 *
+	 * All these conditions are important, because we
+	 * need to protect against the work being run right
+	 * now on another CPU (all but the last one might be
+	 * true if it's currently running and has not been
+	 * released yet, for example).
+	 */
+	if (get_wq_data(work) == cwq
+	    && test_bit(WORK_STRUCT_PENDING, &work->management)
+	    && !list_empty(&work->entry)) {
+		work_func_t f = work->func;
+		list_del_init(&work->entry);
+		spin_unlock_irqrestore(&cwq->lock, flags);
+
+		if (!test_bit(WORK_STRUCT_NOAUTOREL, &work->management))
+			work_release(work);
+		f(work);
+
+		spin_lock_irqsave(&cwq->lock, flags);
+		cwq->remove_sequence++;
+		wake_up(&cwq->work_done);
+		ret = 1;
+	}
+	spin_unlock_irqrestore(&cwq->lock, flags);
+	return ret;
+}
+
+/**
+ * run_scheduled_work - run scheduled work synchronously
+ * @work: work to run
+ *
+ * This checks if the work was pending, and runs it
+ * synchronously if so. It returns a boolean to indicate
+ * whether it had any scheduled work to run or not.
+ *
+ * NOTE! This _only_ works for normal work_structs. You
+ * CANNOT use this for delayed work, because the wq data
+ * for delayed work will not point properly to the per-
+ * CPU workqueue struct, but will change!
+ */
+int fastcall run_scheduled_work(struct work_struct *work)
+{
+	for (;;) {
+		struct cpu_workqueue_struct *cwq;
+
+		if (!test_bit(WORK_STRUCT_PENDING, &work->management))
+			return 0;
+		if (list_empty(&work->entry))
+			return 0;
+		/* NOTE! This depends intimately on __queue_work! */
+		cwq = get_wq_data(work);
+		if (!cwq)
+			return 0;
+		if (__run_work(cwq, work))
+			return 1;
+	}
+}
+EXPORT_SYMBOL(run_scheduled_work);
+
 /* Preempt must be disabled. */
 static void __queue_work(struct cpu_workqueue_struct *cwq,
 			 struct work_struct *work)

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH] Export current_is_keventd() for libphy
  2006-12-06 15:57                 ` Andrew Morton
  2006-12-06 17:17                   ` Linus Torvalds
@ 2006-12-06 17:43                   ` David Howells
  2006-12-06 17:50                     ` Jeff Garzik
                                       ` (2 more replies)
  1 sibling, 3 replies; 54+ messages in thread
From: David Howells @ 2006-12-06 17:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, David Howells, Maciej W. Rozycki, Roland Dreier,
	Andy Fleming, Ben Collins, Linux Kernel Mailing List,
	Jeff Garzik

Linus Torvalds <torvalds@osdl.org> wrote:

> How about something like this?

At first glance, this looks reasonable.

It also looks like it should be used to replace a lot of the
cancel_delayed_work() calls that attempt to cancel _undelayed_ work items.
That would allow a number of work items to be downgraded from delayed_work to
work_struct.

Also, the name "run_scheduled_work" sort of suggests that the work *will* be
run regardless of whether it was pending or not.  Given the confusion over
cancel_delayed_work(), I imagine this will rain confusion too.

+	if (get_wq_data(work) == cwq
+	    && test_bit(WORK_STRUCT_PENDING, &work->management)

I wonder if those can be combined, perhaps:

+	if ((work->management & ~WORK_STRUCT_NOAUTOREL) ==
+	    ((unsigned long) cwq | (1 << WORK_STRUCT_PENDING))

Otherwise for i386 the compiler can't combine them because test_bit() is done
with inline asm.

And:

+		if (!test_bit(WORK_STRUCT_PENDING, &work->management))

Should possibly be:

+		if (!work_pending(work))

David

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] Export current_is_keventd() for libphy
  2006-12-06 17:43                   ` David Howells
@ 2006-12-06 17:50                     ` Jeff Garzik
  2006-12-06 18:07                       ` Linus Torvalds
  2006-12-06 17:53                     ` Linus Torvalds
  2006-12-06 18:02                     ` David Howells
  2 siblings, 1 reply; 54+ messages in thread
From: Jeff Garzik @ 2006-12-06 17:50 UTC (permalink / raw)
  To: David Howells
  Cc: Linus Torvalds, Andrew Morton, Maciej W. Rozycki, Roland Dreier,
	Andy Fleming, Ben Collins, Linux Kernel Mailing List

Honestly I wonder if some of these situations really want 
"kill_scheduled_work_unless_it_is_already_running_right_now_if_so_wait_for_it"

Since its during shutdown, usually the task just wants to know that the 
code in the workqueue won't be touching the hardware or data structures 
after <this> point.

	Jeff




^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] Export current_is_keventd() for libphy
  2006-12-06 17:43                   ` David Howells
  2006-12-06 17:50                     ` Jeff Garzik
@ 2006-12-06 17:53                     ` Linus Torvalds
  2006-12-06 17:58                       ` Linus Torvalds
  2006-12-06 18:02                     ` David Howells
  2 siblings, 1 reply; 54+ messages in thread
From: Linus Torvalds @ 2006-12-06 17:53 UTC (permalink / raw)
  To: David Howells
  Cc: Andrew Morton, Maciej W. Rozycki, Roland Dreier, Andy Fleming,
	Ben Collins, Linux Kernel Mailing List, Jeff Garzik



On Wed, 6 Dec 2006, David Howells wrote:
> 
> +	if (get_wq_data(work) == cwq
> +	    && test_bit(WORK_STRUCT_PENDING, &work->management)
> 
> I wonder if those can be combined, perhaps:

Gcc should do it for us, afaik. I didn't check, but gcc is generally 
pretty good at combining logical operations like this, because it's very 
common.

> Otherwise for i386 the compiler can't combine them because test_bit() is done
> with inline asm.

Nope. Look again.

test_bit() with a constant number is done very much in C, and very much on 
purpose. _Exactly_ to allow the compiler to combine these kinds of things.

> And:
> 
> +		if (!test_bit(WORK_STRUCT_PENDING, &work->management))
> 
> Should possibly be:
> 
> +		if (!work_pending(work))

Yeah, that's a worthy cleanup.

		Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] Export current_is_keventd() for libphy
  2006-12-06 17:53                     ` Linus Torvalds
@ 2006-12-06 17:58                       ` Linus Torvalds
  2006-12-06 18:33                         ` Linus Torvalds
  2006-12-06 18:43                         ` David Howells
  0 siblings, 2 replies; 54+ messages in thread
From: Linus Torvalds @ 2006-12-06 17:58 UTC (permalink / raw)
  To: David Howells
  Cc: Andrew Morton, Maciej W. Rozycki, Roland Dreier, Andy Fleming,
	Ben Collins, Linux Kernel Mailing List, Jeff Garzik



On Wed, 6 Dec 2006, Linus Torvalds wrote:
> 
> Gcc should do it for us, afaik. I didn't check, but gcc is generally 
> pretty good at combining logical operations like this, because it's very 
> common.

Sadly, gcc doesn't do it in this case. Still, I'd rather keep the source 
clean, and hope that the compiler improves eventually, than to make the 
code uglier.

I replaced both of the "test_bit(WORK_STRUCT_PENDING, &work->management)" 
with "work_pending(work)" in my tree.

Now somebody who knows how to trigger this thing should just _test_ it.

		Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] Export current_is_keventd() for libphy
  2006-12-06 17:43                   ` David Howells
  2006-12-06 17:50                     ` Jeff Garzik
  2006-12-06 17:53                     ` Linus Torvalds
@ 2006-12-06 18:02                     ` David Howells
  2 siblings, 0 replies; 54+ messages in thread
From: David Howells @ 2006-12-06 18:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Howells, Andrew Morton, Maciej W. Rozycki, Roland Dreier,
	Andy Fleming, Ben Collins, Linux Kernel Mailing List,
	Jeff Garzik

Linus Torvalds <torvalds@osdl.org> wrote:

> test_bit() with a constant number is done very much in C, and very much on 
> purpose. _Exactly_ to allow the compiler to combine these kinds of things.

Ah...  I've read that several times, and each time I've assumed it's referring
to *addr not nr as being constant.

David

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] Export current_is_keventd() for libphy
  2006-12-06 17:50                     ` Jeff Garzik
@ 2006-12-06 18:07                       ` Linus Torvalds
  0 siblings, 0 replies; 54+ messages in thread
From: Linus Torvalds @ 2006-12-06 18:07 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: David Howells, Andrew Morton, Maciej W. Rozycki, Roland Dreier,
	Andy Fleming, Ben Collins, Linux Kernel Mailing List



On Wed, 6 Dec 2006, Jeff Garzik wrote:
>
> Honestly I wonder if some of these situations really want
> "kill_scheduled_work_unless_it_is_already_running_right_now_if_so_wait_for_it"

We could do that, and the code to do it would be fairly close to what the 
"run it" code is. Just replace the

	if (!test_bit(WORK_STRUCT_NOAUTOREL, &work->management))
		work_release(work);
	f(work);

with an unconditional

	work_release(work);

instead.

However, I think I'd prefer my patch for now, if only because that 
"work_release()" thing kind of worries me. You're supposed to either do 
the release yourself in the work function _after_ you've done all 
book-keeping, or if the thing doesn't need any book-keeping at all, you 
can do the "autorelease" thing. The "kill without running" breaks that 
conceptual model.

So a "kill_work()" usage should almost always end up being something like

	if (kill_work(work))
		release anything that running the work function would release

but then for the (probably common) case where there is nothing that the 
work function releases, that would obviously boil down to just a

	kill_work(work);

However, my point is that the _workqueue_ logic cannot know what the user 
of the workqueue wants, so the "run_scheduled_work()" approach is much 
saner in this respect.

NOTE NOTE NOTE! In neither case can we do anything at all about a 
workqueue entry that is actually _already_ running on another CPU. My 
suggested patch will simply not run it at all synchronously (and return 
0), and a "kill_work()" thing couldn't do anything either. The required 
synchronization for something like that simply doesn't exist.

		Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] Export current_is_keventd() for libphy
  2006-12-06 17:58                       ` Linus Torvalds
@ 2006-12-06 18:33                         ` Linus Torvalds
  2006-12-06 18:37                           ` Linus Torvalds
  2006-12-06 18:43                         ` David Howells
  1 sibling, 1 reply; 54+ messages in thread
From: Linus Torvalds @ 2006-12-06 18:33 UTC (permalink / raw)
  To: David Howells
  Cc: Andrew Morton, Maciej W. Rozycki, Roland Dreier, Andy Fleming,
	Ben Collins, Linux Kernel Mailing List, Jeff Garzik, linux-arch



On Wed, 6 Dec 2006, Linus Torvalds wrote:
> 
> Sadly, gcc doesn't do it in this case. Still, I'd rather keep the source 
> clean, and hope that the compiler improves eventually, than to make the 
> code uglier.

Actually, it's our own damn fault.

We've long had our arguments "const volatile" to test_bit(), which 
basically means that gcc can't do the optimization. Damn.

And they need to be "volatile" not because we _want_ the thing to be 
volaile, but because these things are occationally used on volatile 
objects (so if the function doesn't take a volatile pointer, it would 
warn).

That's why so many of these helper functions use "const volatile" 
pointers, which on the face of it would seem strange if you don't realize 
that it's more about a C type issue than about the _actual_ type being 
either "const" or "volatile".

Sad. I guess we could remove the "const volatile" from the _cast_, but the 
thing is, if the underlying object we're actually looking at really _is_ 
volatile, we shouldn't do that. GAAH.

Really sad. I doubt any of the callers actually want the "volatile" access 
at all. Things like <linux/page-flags.h> definitely _don't_ want it.

I suspect we should just face up to the fact that 

 (a) "volatile" on kernel data is basically always a bug, and you should 
     use locking. "volatile" doesn't help anything at all with memory 
     ordering and friends, so it's insane to think it "solves" anything on 
     its own.
 (b) on "iomem" pointers it does make sense, but those need special 
     accessor functions _anyway_, so things like test_bit() wouldn't work 
     on them.
 (c) if you spin on a value changing, you should use "cpu_relax()" or 
     "barrier()" anyway, which will force gcc to re-load any values from 
     memory over the loop.

and remove the "volatile" from all the bitop accessor functions.

Or at least from "test_bit()". It's not like it's synchronous _anyway_ 
(there's no memory barriers etc).

Hmm? Comments? linux-arch added to Cc, just in case people care (this 
particular thing is in <asm-*/bitops.h>, so it _is_ architecture- 
specific, but we should probably all agree on it first)

		Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] Export current_is_keventd() for libphy
  2006-12-06 18:33                         ` Linus Torvalds
@ 2006-12-06 18:37                           ` Linus Torvalds
  0 siblings, 0 replies; 54+ messages in thread
From: Linus Torvalds @ 2006-12-06 18:37 UTC (permalink / raw)
  To: David Howells
  Cc: Andrew Morton, Maciej W. Rozycki, Roland Dreier, Andy Fleming,
	Ben Collins, Linux Kernel Mailing List, Jeff Garzik, linux-arch



On Wed, 6 Dec 2006, Linus Torvalds wrote:
>
> and remove the "volatile" from all the bitop accessor functions.

It might also be interesting to see if this would change code-size at all. 

There's a number of things that check different bits in the same word 
right now, and they just reload the word unnecessarily and do multiple 
tests. Some of the page flags functions obviously already work around this 
by doing horrible things by hand instead, eg:

                (page->flags & (
                        1 << PG_lru     |
                        1 << PG_private |
                        1 << PG_locked  |
                        1 << PG_active  |
                        1 << PG_reclaim |
                        1 << PG_slab    |
                        1 << PG_swapcache |
                        1 << PG_writeback |
                        1 << PG_reserved |
                        1 << PG_buddy ))

in the free_pages_check() thing. It may make sense there, but we really 
_should_ allow gcc to just do things like this for us, and just use the 
proper functions to test bits.

			Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] Export current_is_keventd() for libphy
  2006-12-06 17:58                       ` Linus Torvalds
  2006-12-06 18:33                         ` Linus Torvalds
@ 2006-12-06 18:43                         ` David Howells
  2006-12-06 19:02                           ` Linus Torvalds
  1 sibling, 1 reply; 54+ messages in thread
From: David Howells @ 2006-12-06 18:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Howells, Andrew Morton, Maciej W. Rozycki, Roland Dreier,
	Andy Fleming, Ben Collins, Linux Kernel Mailing List,
	Jeff Garzik, linux-arch

Linus Torvalds <torvalds@osdl.org> wrote:

>  (a) "volatile" on kernel data is basically always a bug, and you should 
>      use locking.

But what about when you're building a lock?  Actually, I suspect correct usage
of asm constraints and memory barriers trumps volatile anyway even there.

David

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] Export current_is_keventd() for libphy
  2006-12-06 18:43                         ` David Howells
@ 2006-12-06 19:02                           ` Linus Torvalds
  0 siblings, 0 replies; 54+ messages in thread
From: Linus Torvalds @ 2006-12-06 19:02 UTC (permalink / raw)
  To: David Howells
  Cc: Andrew Morton, Maciej W. Rozycki, Roland Dreier, Andy Fleming,
	Ben Collins, Linux Kernel Mailing List, Jeff Garzik, linux-arch



On Wed, 6 Dec 2006, David Howells wrote:
>
> Linus Torvalds <torvalds@osdl.org> wrote:
> 
> >  (a) "volatile" on kernel data is basically always a bug, and you should 
> >      use locking.
> 
> But what about when you're building a lock?  Actually, I suspect correct usage
> of asm constraints and memory barriers trumps volatile anyway even there.

The word you look for is not "suspect".

You _cannot_ build a lock using "volatile", unless your CPU is strictly 
in-order and has an in-order memory subsystem too (so, for example, while 
all ia64 implementations today are in-order, they do /not/ have an 
in-order memory subsystem). Only then could you do locking with volatile 
and some crazy Peterson's algorithm.

I don't think any such CPU actually exists.

Anyway, we've had this discussion before on linux-kernel, it really boils 
down to that "volatile" is basically never correct with the exception of 
flags that don't have any meaning and that you don't actually _care_ about 
the exact value (the low word of "jiffies" being the canonical example of 
something where "volatile" is actually fine, and where - as long as you 
can load it atomically - "volatile" really does make sense).

		Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] Export current_is_keventd() for libphy
  2006-12-06 17:17                   ` Linus Torvalds
@ 2006-12-07  1:21                     ` Linus Torvalds
  2006-12-07  6:42                       ` Andrew Morton
  2006-12-07 15:28                       ` Maciej W. Rozycki
  0 siblings, 2 replies; 54+ messages in thread
From: Linus Torvalds @ 2006-12-07  1:21 UTC (permalink / raw)
  To: Andrew Morton, David Howells
  Cc: Maciej W. Rozycki, Roland Dreier, Andy Fleming, Ben Collins,
	Linux Kernel Mailing List, Jeff Garzik



On Wed, 6 Dec 2006, Linus Torvalds wrote:
> 
> How about something like this?

I didn't get any answers on this. I'd like to get this issue resolved, but 
since I don't even use libphy on my main machine, I need somebody else to 
test it for me.

Just to remind you all, here's the patch again. This is identical to the 
previous version except for the trivial cleanup to use "work_pending()" 
instead of open-coding it in two places.

		Linus

----
diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index 4044bb1..e175f39 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -587,8 +587,7 @@ int phy_stop_interrupts(struct phy_device *phydev)
 	 * Finish any pending work; we might have been scheduled
 	 * to be called from keventd ourselves, though.
 	 */
-	if (!current_is_keventd())
-		flush_scheduled_work();
+	run_scheduled_work(&phydev->phy_queue);
 
 	free_irq(phydev->irq, phydev);
 
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 4a3ea83..a601ed5 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -160,6 +160,7 @@ extern int queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
 extern void FASTCALL(flush_workqueue(struct workqueue_struct *wq));
 
 extern int FASTCALL(schedule_work(struct work_struct *work));
+extern int FASTCALL(run_scheduled_work(struct work_struct *work));
 extern int FASTCALL(schedule_delayed_work(struct delayed_work *work, unsigned long delay));
 
 extern int schedule_delayed_work_on(int cpu, struct delayed_work *work, unsigned long delay);
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 8d1e7cb..36f9b78 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -103,6 +103,79 @@ static inline void *get_wq_data(struct work_struct *work)
 	return (void *) (work->management & WORK_STRUCT_WQ_DATA_MASK);
 }
 
+static int __run_work(struct cpu_workqueue_struct *cwq, struct work_struct *work)
+{
+	int ret = 0;
+	unsigned long flags;
+
+	spin_lock_irqsave(&cwq->lock, flags);
+	/*
+	 * We need to re-validate the work info after we've gotten
+	 * the cpu_workqueue lock. We can run the work now iff:
+	 *
+	 *  - the wq_data still matches the cpu_workqueue_struct
+	 *  - AND the work is still marked pending
+	 *  - AND the work is still on a list (which will be this
+	 *    workqueue_struct list)
+	 *
+	 * All these conditions are important, because we
+	 * need to protect against the work being run right
+	 * now on another CPU (all but the last one might be
+	 * true if it's currently running and has not been
+	 * released yet, for example).
+	 */
+	if (get_wq_data(work) == cwq
+	    && work_pending(work)
+	    && !list_empty(&work->entry)) {
+		work_func_t f = work->func;
+		list_del_init(&work->entry);
+		spin_unlock_irqrestore(&cwq->lock, flags);
+
+		if (!test_bit(WORK_STRUCT_NOAUTOREL, &work->management))
+			work_release(work);
+		f(work);
+
+		spin_lock_irqsave(&cwq->lock, flags);
+		cwq->remove_sequence++;
+		wake_up(&cwq->work_done);
+		ret = 1;
+	}
+	spin_unlock_irqrestore(&cwq->lock, flags);
+	return ret;
+}
+
+/**
+ * run_scheduled_work - run scheduled work synchronously
+ * @work: work to run
+ *
+ * This checks if the work was pending, and runs it
+ * synchronously if so. It returns a boolean to indicate
+ * whether it had any scheduled work to run or not.
+ *
+ * NOTE! This _only_ works for normal work_structs. You
+ * CANNOT use this for delayed work, because the wq data
+ * for delayed work will not point properly to the per-
+ * CPU workqueue struct, but will change!
+ */
+int fastcall run_scheduled_work(struct work_struct *work)
+{
+	for (;;) {
+		struct cpu_workqueue_struct *cwq;
+
+		if (!work_pending(work))
+			return 0;
+		if (list_empty(&work->entry))
+			return 0;
+		/* NOTE! This depends intimately on __queue_work! */
+		cwq = get_wq_data(work);
+		if (!cwq)
+			return 0;
+		if (__run_work(cwq, work))
+			return 1;
+	}
+}
+EXPORT_SYMBOL(run_scheduled_work);
+
 /* Preempt must be disabled. */
 static void __queue_work(struct cpu_workqueue_struct *cwq,
 			 struct work_struct *work)

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH] Export current_is_keventd() for libphy
  2006-12-07  1:21                     ` Linus Torvalds
@ 2006-12-07  6:42                       ` Andrew Morton
  2006-12-07  7:49                         ` Andrew Morton
                                           ` (2 more replies)
  2006-12-07 15:28                       ` Maciej W. Rozycki
  1 sibling, 3 replies; 54+ messages in thread
From: Andrew Morton @ 2006-12-07  6:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Howells, Maciej W. Rozycki, Roland Dreier, Andy Fleming,
	Ben Collins, Linux Kernel Mailing List, Jeff Garzik

On Wed, 6 Dec 2006 17:21:50 -0800 (PST)
Linus Torvalds <torvalds@osdl.org> wrote:

> 
> 
> On Wed, 6 Dec 2006, Linus Torvalds wrote:
> > 
> > How about something like this?
> 
> I didn't get any answers on this. I'd like to get this issue resolved, but 
> since I don't even use libphy on my main machine, I need somebody else to 
> test it for me.
> 
> Just to remind you all, here's the patch again. This is identical to the 
> previous version except for the trivial cleanup to use "work_pending()" 
> instead of open-coding it in two places.
> 
> 		Linus
> 
> ...
>
> +static int __run_work(struct cpu_workqueue_struct *cwq, struct work_struct *work)
> +{
> +	int ret = 0;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&cwq->lock, flags);
> +	/*
> +	 * We need to re-validate the work info after we've gotten
> +	 * the cpu_workqueue lock. We can run the work now iff:
> +	 *
> +	 *  - the wq_data still matches the cpu_workqueue_struct
> +	 *  - AND the work is still marked pending
> +	 *  - AND the work is still on a list (which will be this
> +	 *    workqueue_struct list)
> +	 *
> +	 * All these conditions are important, because we
> +	 * need to protect against the work being run right
> +	 * now on another CPU (all but the last one might be
> +	 * true if it's currently running and has not been
> +	 * released yet, for example).
> +	 */
> +	if (get_wq_data(work) == cwq
> +	    && work_pending(work)
> +	    && !list_empty(&work->entry)) {
> +		work_func_t f = work->func;
> +		list_del_init(&work->entry);
> +		spin_unlock_irqrestore(&cwq->lock, flags);
> +
> +		if (!test_bit(WORK_STRUCT_NOAUTOREL, &work->management))
> +			work_release(work);
> +		f(work);
> +
> +		spin_lock_irqsave(&cwq->lock, flags);
> +		cwq->remove_sequence++;
> +		wake_up(&cwq->work_done);
> +		ret = 1;
> +	}
> +	spin_unlock_irqrestore(&cwq->lock, flags);
> +	return ret;
> +}
> +
> +/**
> + * run_scheduled_work - run scheduled work synchronously
> + * @work: work to run
> + *
> + * This checks if the work was pending, and runs it
> + * synchronously if so. It returns a boolean to indicate
> + * whether it had any scheduled work to run or not.
> + *
> + * NOTE! This _only_ works for normal work_structs. You
> + * CANNOT use this for delayed work, because the wq data
> + * for delayed work will not point properly to the per-
> + * CPU workqueue struct, but will change!
> + */
> +int fastcall run_scheduled_work(struct work_struct *work)
> +{
> +	for (;;) {
> +		struct cpu_workqueue_struct *cwq;
> +
> +		if (!work_pending(work))
> +			return 0;

But this will return to the caller if the callback is presently running on
a different CPU.  The whole point here is to be able to reliably kill off
the pending work so that the caller can free resources.

> +		if (list_empty(&work->entry))
> +			return 0;
> +		/* NOTE! This depends intimately on __queue_work! */
> +		cwq = get_wq_data(work);
> +		if (!cwq)
> +			return 0;
> +		if (__run_work(cwq, work))
> +			return 1;
> +	}
> +}
> +EXPORT_SYMBOL(run_scheduled_work);

Also, I worry that this code can run the callback on the caller's CPU. 
Users of per-cpu workqueues can legitimately assume that each callback runs
on the right CPU.  I doubt if many callers _do_ do that - there's
schedule_delayed_work_on(), but that's a bit different.

A solution to both problems is of course to block the caller if the
callback is running.  We can perhaps borrow cwq->work_done for that.


But I wouldn't want to think about an implementation as long as we have
that WORK_STRUCT_NOAUTOREL horror in there.  Can we just nuke that?  Only
three drivers need it and I bet they can be modified to use the usual
mechanisms.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] Export current_is_keventd() for libphy
  2006-12-07  6:42                       ` Andrew Morton
@ 2006-12-07  7:49                         ` Andrew Morton
  2006-12-07 10:29                         ` David Howells
  2006-12-07 16:49                         ` Linus Torvalds
  2 siblings, 0 replies; 54+ messages in thread
From: Andrew Morton @ 2006-12-07  7:49 UTC (permalink / raw)
  To: Linus Torvalds, David Howells, Maciej W. Rozycki, Roland Dreier,
	Andy Fleming, Ben Collins, Linux Kernel Mailing List,
	Jeff Garzik

On Wed, 6 Dec 2006 22:42:07 -0800
Andrew Morton <akpm@osdl.org> wrote:

> But I wouldn't want to think about an implementation as long as we have
> that WORK_STRUCT_NOAUTOREL horror in there.  Can we just nuke that?  Only
> three drivers need it and I bet they can be modified to use the usual
> mechanisms.

I guess I don't understand exactly what problem the noautorel stuff is
trying to solve.  It _seems_ to me that in all cases we can simply stuff
the old `data' field in alongside the controlling work_struct or
delayed_work which wants to operate on it.

Bridge is the simple case..

diff -puN net/bridge/br_private.h~bridge-avoid-using-noautorel-workqueues net/bridge/br_private.h
--- a/net/bridge/br_private.h~bridge-avoid-using-noautorel-workqueues
+++ a/net/bridge/br_private.h
@@ -83,6 +83,7 @@ struct net_bridge_port
 	struct timer_list		message_age_timer;
 	struct kobject			kobj;
 	struct delayed_work		carrier_check;
+	struct net_device		*carrier_check_dev;
 	struct rcu_head			rcu;
 };
 
diff -puN net/bridge/br_if.c~bridge-avoid-using-noautorel-workqueues net/bridge/br_if.c
--- a/net/bridge/br_if.c~bridge-avoid-using-noautorel-workqueues
+++ a/net/bridge/br_if.c
@@ -83,14 +83,11 @@ static void port_carrier_check(struct wo
 	struct net_device *dev;
 	struct net_bridge *br;
 
-	dev = container_of(work, struct net_bridge_port,
-			   carrier_check.work)->dev;
-	work_release(work);
-
+	p = container_of(work, struct net_bridge_port, carrier_check.work);
+	dev = p->carrier_check_dev;
 	rtnl_lock();
-	p = dev->br_port;
-	if (!p)
-		goto done;
+	if (!dev->br_port)
+		goto done;	/* Can this happen? */
 	br = p->br;
 
 	if (netif_carrier_ok(dev))
@@ -280,7 +277,8 @@ static struct net_bridge_port *new_nbp(s
 	p->port_no = index;
 	br_init_port(p);
 	p->state = BR_STATE_DISABLED;
-	INIT_DELAYED_WORK_NAR(&p->carrier_check, port_carrier_check);
+	p->carrier_check_dev = dev;
+	INIT_DELAYED_WORK(&p->carrier_check, port_carrier_check);
 	br_stp_port_timer_init(p);
 
 	kobject_init(&p->kobj);
_


What am I missing?

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] Export current_is_keventd() for libphy
  2006-12-07  6:42                       ` Andrew Morton
  2006-12-07  7:49                         ` Andrew Morton
@ 2006-12-07 10:29                         ` David Howells
  2006-12-07 10:42                           ` Andrew Morton
  2006-12-07 16:49                         ` Linus Torvalds
  2 siblings, 1 reply; 54+ messages in thread
From: David Howells @ 2006-12-07 10:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, David Howells, Maciej W. Rozycki, Roland Dreier,
	Andy Fleming, Ben Collins, Linux Kernel Mailing List,
	Jeff Garzik

Andrew Morton <akpm@osdl.org> wrote:

> I guess I don't understand exactly what problem the noautorel stuff is
> trying to solve.  It _seems_ to me that in all cases we can simply stuff
> the old `data' field in alongside the controlling work_struct or
> delayed_work which wants to operate on it.

The problem is that you have to be able to guarantee that the data is still
accessible once you clear the pending bit.  The pending bit is your only
guaranteed protection, and once it is clear, the containing structure might be
deallocated.

I would like to be able to get rid of the NAR bit too, but I'm not confident
that in all cases I can.  It'll take a bit more study of the code to be able
to do that.

David

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] Export current_is_keventd() for libphy
  2006-12-07 10:29                         ` David Howells
@ 2006-12-07 10:42                           ` Andrew Morton
  2006-12-07 17:05                             ` Jeff Garzik
  0 siblings, 1 reply; 54+ messages in thread
From: Andrew Morton @ 2006-12-07 10:42 UTC (permalink / raw)
  To: David Howells
  Cc: torvalds, dhowells, macro, rdreier, afleming, ben.collins,
	linux-kernel, jeff

> On Thu, 07 Dec 2006 10:29:39 +0000 David Howells <dhowells@redhat.com> wrote:
> Andrew Morton <akpm@osdl.org> wrote:
> 
> > I guess I don't understand exactly what problem the noautorel stuff is
> > trying to solve.  It _seems_ to me that in all cases we can simply stuff
> > the old `data' field in alongside the controlling work_struct or
> > delayed_work which wants to operate on it.
> 
> The problem is that you have to be able to guarantee that the data is still
> accessible once you clear the pending bit.  The pending bit is your only
> guaranteed protection, and once it is clear, the containing structure might be
> deallocated.
> 
> I would like to be able to get rid of the NAR bit too, but I'm not confident
> that in all cases I can.  It'll take a bit more study of the code to be able
> to do that.
> 

But anyone who is going to free the structure which contains the
work_struct would need to run flush_workqueue() beforehand, after having
ensured that the work won't reschedule itself.  So the
struct-which-contains-the-work_struct is safe during the callback's
execution.

If that's not being done then the code was buggy in 2.6.19, too..

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] Export current_is_keventd() for libphy
  2006-12-07  1:21                     ` Linus Torvalds
  2006-12-07  6:42                       ` Andrew Morton
@ 2006-12-07 15:28                       ` Maciej W. Rozycki
  1 sibling, 0 replies; 54+ messages in thread
From: Maciej W. Rozycki @ 2006-12-07 15:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, David Howells, Roland Dreier, Andy Fleming,
	Ben Collins, Linux Kernel Mailing List, Jeff Garzik

On Wed, 6 Dec 2006, Linus Torvalds wrote:

> I didn't get any answers on this. I'd like to get this issue resolved, but 
> since I don't even use libphy on my main machine, I need somebody else to 
> test it for me.

 I tested it in the evening with the system I implemented it for 
originally.  It seems to work -- it does not oops as would the original 
code without the call to flush_scheduled_work() nor does it deadlock as 
would code with the call (and no special precautions).

 Thanks a lot (and congrats for still being capable of doing something 
else from merging).

  Maciej

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] Export current_is_keventd() for libphy
  2006-12-07  6:42                       ` Andrew Morton
  2006-12-07  7:49                         ` Andrew Morton
  2006-12-07 10:29                         ` David Howells
@ 2006-12-07 16:49                         ` Linus Torvalds
  2006-12-07 17:52                           ` Andrew Morton
  2 siblings, 1 reply; 54+ messages in thread
From: Linus Torvalds @ 2006-12-07 16:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Howells, Maciej W. Rozycki, Roland Dreier, Andy Fleming,
	Ben Collins, Linux Kernel Mailing List, Jeff Garzik



On Wed, 6 Dec 2006, Andrew Morton wrote:
>
> But this will return to the caller if the callback is presently running on
> a different CPU.  The whole point here is to be able to reliably kill off
> the pending work so that the caller can free resources.

I mentioned that in one of the emails.

We do not _have_ the information to not do that. It simply doesn't exist. 
We can either wait for _all_ pending entries on the to complete (by 
waiting for the workqueue counters for added/removed to be the same), or 
we can have the race.

> Also, I worry that this code can run the callback on the caller's CPU. 

Right.

> Users of per-cpu workqueues can legitimately assume that each callback runs
> on the right CPU.

Not if they use this interface, they can't.

They asked for it, they get it. That's the unix philosophy. 

		Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] Export current_is_keventd() for libphy
  2006-12-07 10:42                           ` Andrew Morton
@ 2006-12-07 17:05                             ` Jeff Garzik
  2006-12-07 17:57                               ` Andrew Morton
                                                 ` (2 more replies)
  0 siblings, 3 replies; 54+ messages in thread
From: Jeff Garzik @ 2006-12-07 17:05 UTC (permalink / raw)
  To: Andrew Morton, torvalds, macro
  Cc: David Howells, rdreier, afleming, ben.collins, linux-kernel

Yes, I merged the code, but looking deeper at phy its clear I missed 
some things.

Looking into libphy's workqueue stuff, it has the following sequence:

	disable interrupts
	schedule_work()

	... time passes ...
	... workqueue routine is called ...

	enable interrupts
	handle interrupt

I really have to question if a workqueue was the best choice of 
direction for such a sequence.  You don't want to put off handling an 
interrupt, with interrupts disabled, for a potentially unbounded amount 
of time.

Maybe the best course of action is to mark it with CONFIG_BROKEN until 
it gets fixed.

	Jeff




^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] Export current_is_keventd() for libphy
  2006-12-07 16:49                         ` Linus Torvalds
@ 2006-12-07 17:52                           ` Andrew Morton
  2006-12-07 18:01                             ` Linus Torvalds
  0 siblings, 1 reply; 54+ messages in thread
From: Andrew Morton @ 2006-12-07 17:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Howells, Maciej W. Rozycki, Roland Dreier, Andy Fleming,
	Ben Collins, Linux Kernel Mailing List, Jeff Garzik

On Thu, 7 Dec 2006 08:49:02 -0800 (PST)
Linus Torvalds <torvalds@osdl.org> wrote:

> 
> 
> On Wed, 6 Dec 2006, Andrew Morton wrote:
> >
> > But this will return to the caller if the callback is presently running on
> > a different CPU.  The whole point here is to be able to reliably kill off
> > the pending work so that the caller can free resources.
> 
> I mentioned that in one of the emails.
> 
> We do not _have_ the information to not do that. It simply doesn't exist. 
> We can either wait for _all_ pending entries on the to complete (by 
> waiting for the workqueue counters for added/removed to be the same), or 
> we can have the race.

Well we'll need to add the infrastructure to be able to do this, won't we? 
The whole point of calling flush_scheduled_work() (which we're trying to
replace/simplify) is to block the caller until it is safe to release
resources.

It might be a challenge to do this without adding more stuff to work_struct
though.

umm..  Putting a work_struct* into struct cpu_workqueue_struct and then
doing appropriate things with cpu_workqueue_struct.lock might work.

<hack, hack>

Something along these lines.  The keventd-calls-flush_work() case rather
sucks though.


diff -puN kernel/workqueue.c~a kernel/workqueue.c
--- a/kernel/workqueue.c~a
+++ a/kernel/workqueue.c
@@ -323,6 +323,7 @@ static void run_workqueue(struct cpu_wor
 		work_func_t f = work->func;
 
 		list_del_init(cwq->worklist.next);
+		cwq->current_work = work;
 		spin_unlock_irqrestore(&cwq->lock, flags);
 
 		BUG_ON(get_wq_data(work) != cwq);
@@ -342,6 +343,7 @@ static void run_workqueue(struct cpu_wor
 		}
 
 		spin_lock_irqsave(&cwq->lock, flags);
+		cwq->current_work = NULL;
 		cwq->remove_sequence++;
 		wake_up(&cwq->work_done);
 	}
@@ -425,6 +427,64 @@ static void flush_cpu_workqueue(struct c
 	}
 }
 
+static void wait_on_work(struct cpu_workqueue_struct *cwq,
+				struct work_struct *work)
+{
+	DEFINE_WAIT(wait);
+
+	spin_lock_irq(&cwq->lock);
+	while (cwq->current_work == work) {
+		prepare_to_wait(&cwq->work_done, &wait, TASK_UNINTERRUPTIBLE);
+		spin_unlock_irq(&cwq->lock);
+		schedule();
+		spin_lock_irq(&cwq->lock);
+	}
+	finish_wait(&cwq->work_done, &wait);
+	spin_unlock_irq(&cwq->lock);
+}
+
+static void flush_one_work(struct cpu_workqueue_struct *cwq,
+				struct work_struct *work)
+{
+	spin_lock_irq(&cwq->lock);
+	if (test_and_clear_bit(WORK_STRUCT_PENDING, &work->management)) {
+		list_del_init(&work->entry);
+		spin_unlock_irq(&cwq->lock);
+		return;
+	}
+	spin_unlock_irq(&cwq->lock);
+
+	/* It's running, or it has completed */
+
+	if (cwq->thread == current) {
+		/* This stinks */
+		/*
+		 * Probably keventd trying to flush its own queue. So simply run
+		 * it by hand rather than deadlocking.
+		 */
+		run_workqueue(cwq);
+	} else {
+		wait_on_work(cwq, work);
+	}
+}
+
+void flush_work(struct work_struct *work)
+{
+	might_sleep();
+
+	if (is_single_threaded(wq)) {
+		/* Always use first cpu's area. */
+		flush_one_work(per_cpu_ptr(wq->cpu_wq, singlethread_cpu), work);
+	} else {
+		int cpu;
+
+		mutex_lock(&workqueue_mutex);
+		for_each_online_cpu(cpu)
+			flush_one_work(per_cpu_ptr(wq->cpu_wq, cpu), work);
+		mutex_unlock(&workqueue_mutex);
+	}
+}
+
 /**
  * flush_workqueue - ensure that any scheduled work has run to completion.
  * @wq: workqueue to flush
_


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] Export current_is_keventd() for libphy
  2006-12-07 17:05                             ` Jeff Garzik
@ 2006-12-07 17:57                               ` Andrew Morton
  2006-12-07 18:17                                 ` Andrew Morton
  2006-12-08 16:52                                 ` [PATCH] group xtime, xtime_lock, wall_to_monotonic, avenrun, calc_load_count fields together in ktimed Eric Dumazet
  2006-12-07 18:08                               ` [PATCH] Export current_is_keventd() for libphy Maciej W. Rozycki
  2006-12-07 18:59                               ` Andy Fleming
  2 siblings, 2 replies; 54+ messages in thread
From: Andrew Morton @ 2006-12-07 17:57 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: torvalds, macro, David Howells, rdreier, afleming, ben.collins,
	linux-kernel

On Thu, 07 Dec 2006 12:05:38 -0500
Jeff Garzik <jeff@garzik.org> wrote:

> Yes, I merged the code, but looking deeper at phy its clear I missed 
> some things.
> 
> Looking into libphy's workqueue stuff, it has the following sequence:
> 
> 	disable interrupts
> 	schedule_work()
> 
> 	... time passes ...
> 	... workqueue routine is called ...
> 
> 	enable interrupts
> 	handle interrupt
> 
> I really have to question if a workqueue was the best choice of 
> direction for such a sequence.  You don't want to put off handling an 
> interrupt, with interrupts disabled, for a potentially unbounded amount 
> of time.

That'll lock the box on UP, or if the timer fires on the current CPU?

> Maybe the best course of action is to mark it with CONFIG_BROKEN until 
> it gets fixed.

hm, maybe.  I wonder if as a short-term palliative we could remove the
current_is_keventd() call and drop rtnl_lock.  Or export current_is_keventd ;)

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] Export current_is_keventd() for libphy
  2006-12-07 17:52                           ` Andrew Morton
@ 2006-12-07 18:01                             ` Linus Torvalds
  2006-12-07 18:16                               ` Andrew Morton
  0 siblings, 1 reply; 54+ messages in thread
From: Linus Torvalds @ 2006-12-07 18:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Howells, Maciej W. Rozycki, Roland Dreier, Andy Fleming,
	Ben Collins, Linux Kernel Mailing List, Jeff Garzik



On Thu, 7 Dec 2006, Andrew Morton wrote:
> 
> umm..  Putting a work_struct* into struct cpu_workqueue_struct and then
> doing appropriate things with cpu_workqueue_struct.lock might work.

Yeah, that looks sane. We can't hide anything in "struct work", because we 
can't trust it any more once it's been dispatched, but adding a pointer to 
the cpu_workqueue_struct that is only used to compare against another 
pointer sounds fine.

		Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] Export current_is_keventd() for libphy
  2006-12-07 17:05                             ` Jeff Garzik
  2006-12-07 17:57                               ` Andrew Morton
@ 2006-12-07 18:08                               ` Maciej W. Rozycki
  2006-12-07 18:59                               ` Andy Fleming
  2 siblings, 0 replies; 54+ messages in thread
From: Maciej W. Rozycki @ 2006-12-07 18:08 UTC (permalink / raw)
  To: Jeff Garzik, Andy Fleming
  Cc: Andrew Morton, Linus Torvalds, David Howells, rdreier,
	ben.collins, linux-kernel

On Thu, 7 Dec 2006, Jeff Garzik wrote:

> Looking into libphy's workqueue stuff, it has the following sequence:
> 
> 	disable interrupts
> 	schedule_work()
> 
> 	... time passes ...
> 	... workqueue routine is called ...
> 
> 	enable interrupts
> 	handle interrupt
> 
> I really have to question if a workqueue was the best choice of direction for
> such a sequence.  You don't want to put off handling an interrupt, with
> interrupts disabled, for a potentially unbounded amount of time.

 This is because to ack the interrupt in the device the MDIO bus has to be 
accessed and I gather for some implementations it may be too obnoxiously 
slow for the interrupt context to cope with.  Note that only the interrupt 
line used for the PHY is disabled (though obviously with consequences to 
any sharers).

 Andy, could you please comment?

  Maciej

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] Export current_is_keventd() for libphy
  2006-12-07 18:01                             ` Linus Torvalds
@ 2006-12-07 18:16                               ` Andrew Morton
  2006-12-07 18:27                                 ` Linus Torvalds
  0 siblings, 1 reply; 54+ messages in thread
From: Andrew Morton @ 2006-12-07 18:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Howells, Maciej W. Rozycki, Roland Dreier, Andy Fleming,
	Ben Collins, Linux Kernel Mailing List, Jeff Garzik

On Thu, 7 Dec 2006 10:01:56 -0800 (PST)
Linus Torvalds <torvalds@osdl.org> wrote:

> 
> 
> On Thu, 7 Dec 2006, Andrew Morton wrote:
> > 
> > umm..  Putting a work_struct* into struct cpu_workqueue_struct and then
> > doing appropriate things with cpu_workqueue_struct.lock might work.
> 
> Yeah, that looks sane. We can't hide anything in "struct work", because we 
> can't trust it any more once it's been dispatched,

We _can_ trust it in the context of

	void flush_work(struct work_struct *work)

because the caller "owns" the work_struct.  It'd be pretty nutty for the
caller to pass in a pointer to something which could be freed at any time. 
Most flush_work_queue() callers do something like:


	flush_scheduled_work();
	kfree(my_object_which_contains_a_work_struct);

hopefully libphy follows that model...

> but adding a pointer to 
> the cpu_workqueue_struct that is only used to compare against another 
> pointer sounds fine.

ho-hum.  I'll take a look at turning that into something which compiles,
then I'll convert a few oft-used flush_scheduled_work() callers over to use
it.  To do this on a sensible timescale perhaps means that we should export
current_is_keventd(), get the howling hordes off our backs.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] Export current_is_keventd() for libphy
  2006-12-07 17:57                               ` Andrew Morton
@ 2006-12-07 18:17                                 ` Andrew Morton
  2006-12-08 16:52                                 ` [PATCH] group xtime, xtime_lock, wall_to_monotonic, avenrun, calc_load_count fields together in ktimed Eric Dumazet
  1 sibling, 0 replies; 54+ messages in thread
From: Andrew Morton @ 2006-12-07 18:17 UTC (permalink / raw)
  To: Jeff Garzik, torvalds, macro, David Howells, rdreier, afleming,
	ben.collins, linux-kernel

On Thu, 7 Dec 2006 09:57:15 -0800
Andrew Morton <akpm@osdl.org> wrote:

> On Thu, 07 Dec 2006 12:05:38 -0500
> Jeff Garzik <jeff@garzik.org> wrote:
> 
> > Yes, I merged the code, but looking deeper at phy its clear I missed 
> > some things.
> > 
> > Looking into libphy's workqueue stuff, it has the following sequence:
> > 
> > 	disable interrupts
> > 	schedule_work()
> > 
> > 	... time passes ...
> > 	... workqueue routine is called ...
> > 
> > 	enable interrupts
> > 	handle interrupt
> > 
> > I really have to question if a workqueue was the best choice of 
> > direction for such a sequence.  You don't want to put off handling an 
> > interrupt, with interrupts disabled, for a potentially unbounded amount 
> > of time.
> 
> That'll lock the box on UP, or if the timer fires on the current CPU?

oh. "disable interrupts" == disable_irq(), not local_irq_disable()?

Not so bad ;)

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] Export current_is_keventd() for libphy
  2006-12-07 18:16                               ` Andrew Morton
@ 2006-12-07 18:27                                 ` Linus Torvalds
  0 siblings, 0 replies; 54+ messages in thread
From: Linus Torvalds @ 2006-12-07 18:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Howells, Maciej W. Rozycki, Roland Dreier, Andy Fleming,
	Ben Collins, Linux Kernel Mailing List, Jeff Garzik



On Thu, 7 Dec 2006, Andrew Morton wrote:
> 
> We _can_ trust it in the context of
> 
> 	void flush_work(struct work_struct *work)

Yes, but the way the bits are defined, the "pending" bit is not meaningful 
as a synchronization event, for example - because _other_ users can't 
trust it once they've dispatched the function. So even in the synchronous 
run/flush_scheduled_work() kind of situation, you end up having to work 
with the fact that nobody _else_ can rely on the data structures, and that 
they are designed to work that way..

> ho-hum.  I'll take a look at turning that into something which compiles,
> then I'll convert a few oft-used flush_scheduled_work() callers over to use
> it.  To do this on a sensible timescale perhaps means that we should export
> current_is_keventd(), get the howling hordes off our backs.

Well, I simply committed my work that doesn't guarantee synchronization - 
the synchronization can now be added in kernel/workqueue.c any way we 
want. It's better than what we used to have, for sure, in both compiling 
and solving the practical problem, but also as a "go forward" point.

			Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] Export current_is_keventd() for libphy
  2006-12-07 17:05                             ` Jeff Garzik
  2006-12-07 17:57                               ` Andrew Morton
  2006-12-07 18:08                               ` [PATCH] Export current_is_keventd() for libphy Maciej W. Rozycki
@ 2006-12-07 18:59                               ` Andy Fleming
  2 siblings, 0 replies; 54+ messages in thread
From: Andy Fleming @ 2006-12-07 18:59 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Andrew Morton, torvalds, macro, David Howells, rdreier,
	ben.collins, linux-kernel


On Dec 7, 2006, at 11:05, Jeff Garzik wrote:

> Yes, I merged the code, but looking deeper at phy its clear I  
> missed some things.
>
> Looking into libphy's workqueue stuff, it has the following sequence:
>
> 	disable interrupts
> 	schedule_work()
>
> 	... time passes ...
> 	... workqueue routine is called ...
>
> 	enable interrupts
> 	handle interrupt
>
> I really have to question if a workqueue was the best choice of  
> direction for such a sequence.  You don't want to put off handling  
> an interrupt, with interrupts disabled, for a potentially unbounded  
> amount of time.
>
> Maybe the best course of action is to mark it with CONFIG_BROKEN  
> until it gets fixed.


Yes, this is one of the design choices I made to be able to lock  
properly around MDIO transactions.

1) MDIO transactions take a long time
2) Some interfaces provide an interrupt to indicate the transaction  
has completed.

So I didn't want an irq-disabling spin lock.  It would prevent 2 from  
ever being used, and would mean all interrupts were disabled for  
thousands of cycles for MDIO transactions.

So I decreed that no MDIO transactions can happen in interrupt  
context.  But the registers to disable the specific PHY's interrupt  
are only accessible through MDIO, so in order to be able to follow  
that edict AND ever handle the interrupt, I need to disable the  
interrupt.  But I only disable the PHY's interrupt (which is likely  
shared among some devices).  I agree it's ugly, but I haven't yet  
figured out another way to do it.

I'd love to eliminate the work queue altogether.  I keep thinking  
that I could do something with semaphores, or spin_trylocks, but I'm  
not sure about the best way to let the interrupt handlers do their  
thing.

Andy


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH] group xtime, xtime_lock, wall_to_monotonic, avenrun, calc_load_count fields together in ktimed
  2006-12-07 17:57                               ` Andrew Morton
  2006-12-07 18:17                                 ` Andrew Morton
@ 2006-12-08 16:52                                 ` Eric Dumazet
  2006-12-09  5:46                                   ` Andrew Morton
  2006-12-13 21:26                                   ` [PATCH] Introduce time_data, a new structure to hold jiffies, xtime, xtime_lock, wall_to_monotonic, calc_load_count and avenrun Eric Dumazet
  1 sibling, 2 replies; 54+ messages in thread
From: Eric Dumazet @ 2006-12-08 16:52 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andi Kleen, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1269 bytes --]

This patch introduces a new structure called ktimed (Kernel Time Data), where 
some time keeping related variables are put together to share as few cache 
lines as possible. This avoid some false sharing, (since linker could put 
calc_load_count in a *random* cache line for example)

I also optimized calc_load() to not always call count_active_tasks() :
It should call it only once every 5 seconds (LOAD_FREQ=5*HZ)

Note : x86_64 was using an arch specific placement of __xtime and __xtime_lock 
(see arch/x86_64/kernel/vmlinux.lds.S). (vsyscall stuff)
It is now using a specific placement of __ktimed, since xtime and xtime_lock 
are now fields from __ktimed.

Note : I failed to move jiffies64 as well in ktimed : too many changes needed 
because of jiffies aliasing (and endianess), but it could be done.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 arch/x86_64/kernel/time.c        |    5 ++-
 arch/x86_64/kernel/vmlinux.lds.S |    7 +---
 arch/x86_64/kernel/vsyscall.c    |    1 
 include/asm-x86_64/vsyscall.h    |   13 +++------
 include/linux/sched.h            |    1 
 include/linux/time.h             |   18 ++++++++++--
 kernel/timer.c                   |   41 ++++++++++++-----------------
 7 files changed, 43 insertions(+), 43 deletions(-)


[-- Attachment #2: xtime_s.patch --]
[-- Type: text/plain, Size: 6905 bytes --]

--- linux-2.6.19/include/linux/time.h	2006-12-08 11:40:46.000000000 +0100
+++ linux-2.6.19-ed/include/linux/time.h	2006-12-08 16:58:57.000000000 +0100
@@ -88,9 +88,21 @@ static inline struct timespec timespec_s
 #define timespec_valid(ts) \
 	(((ts)->tv_sec >= 0) && (((unsigned long) (ts)->tv_nsec) < NSEC_PER_SEC))
 
-extern struct timespec xtime;
-extern struct timespec wall_to_monotonic;
-extern seqlock_t xtime_lock;
+/*
+ * define a structure to keep all fields close to each others.
+ */
+struct ktimed_struct {
+	struct timespec _xtime;
+	struct timespec wall_to_monotonic;
+	seqlock_t lock;
+	unsigned long avenrun[3];
+	int calc_load_count;
+};
+extern struct ktimed_struct ktimed;
+#define xtime             ktimed._xtime
+#define wall_to_monotonic ktimed.wall_to_monotonic
+#define xtime_lock        ktimed.lock
+#define avenrun           ktimed.avenrun
 
 void timekeeping_init(void);
 
--- linux-2.6.19/kernel/timer.c	2006-12-08 11:50:11.000000000 +0100
+++ linux-2.6.19-ed/kernel/timer.c	2006-12-08 18:13:24.000000000 +0100
@@ -570,11 +570,13 @@ found:
  * however, we will ALWAYS keep the tv_nsec part positive so we can use
  * the usual normalization.
  */
-struct timespec xtime __attribute__ ((aligned (16)));
-struct timespec wall_to_monotonic __attribute__ ((aligned (16)));
-
-EXPORT_SYMBOL(xtime);
-
+#ifndef ARCH_HAVE_KTIMED
+struct ktimed_struct ktimed __cacheline_aligned = {
+	.lock = __SEQLOCK_UNLOCKED(ktimed.lock),
+	.calc_load_count = LOAD_FREQ,
+};
+EXPORT_SYMBOL(ktimed);
+#endif
 
 /* XXX - all of this timekeeping code should be later moved to time.c */
 #include <linux/clocksource.h>
@@ -995,9 +997,6 @@ static unsigned long count_active_tasks(
  *
  * Requires xtime_lock to access.
  */
-unsigned long avenrun[3];
-
-EXPORT_SYMBOL(avenrun);
 
 /*
  * calc_load - given tick count, update the avenrun load estimates.
@@ -1006,27 +1005,21 @@ EXPORT_SYMBOL(avenrun);
 static inline void calc_load(unsigned long ticks)
 {
 	unsigned long active_tasks; /* fixed-point */
-	static int count = LOAD_FREQ;
 
-	active_tasks = count_active_tasks();
-	for (count -= ticks; count < 0; count += LOAD_FREQ) {
-		CALC_LOAD(avenrun[0], EXP_1, active_tasks);
-		CALC_LOAD(avenrun[1], EXP_5, active_tasks);
-		CALC_LOAD(avenrun[2], EXP_15, active_tasks);
+	ktimed.calc_load_count -= ticks;
+
+	if (unlikely(ktimed.calc_load_count < 0)) {
+		active_tasks = count_active_tasks();
+		do {
+			ktimed.calc_load_count += LOAD_FREQ;
+			CALC_LOAD(avenrun[0], EXP_1, active_tasks);
+			CALC_LOAD(avenrun[1], EXP_5, active_tasks);
+			CALC_LOAD(avenrun[2], EXP_15, active_tasks);
+		} while (ktimed.calc_load_count < 0);
 	}
 }
 
 /*
- * This read-write spinlock protects us from races in SMP while
- * playing with xtime and avenrun.
- */
-#ifndef ARCH_HAVE_XTIME_LOCK
-__cacheline_aligned_in_smp DEFINE_SEQLOCK(xtime_lock);
-
-EXPORT_SYMBOL(xtime_lock);
-#endif
-
-/*
  * This function runs timers and the timer-tq in bottom half context.
  */
 static void run_timer_softirq(struct softirq_action *h)
--- linux-2.6.19/include/linux/sched.h	2006-12-08 12:10:45.000000000 +0100
+++ linux-2.6.19-ed/include/linux/sched.h	2006-12-08 12:10:59.000000000 +0100
@@ -104,7 +104,6 @@ struct futex_pi_state;
  *    the EXP_n values would be 1981, 2034 and 2043 if still using only
  *    11 bit fractions.
  */
-extern unsigned long avenrun[];		/* Load averages */
 
 #define FSHIFT		11		/* nr of bits of precision */
 #define FIXED_1		(1<<FSHIFT)	/* 1.0 as fixed-point */
--- linux-2.6.19/include/asm-x86_64/vsyscall.h	2006-12-08 11:47:03.000000000 +0100
+++ linux-2.6.19-ed/include/asm-x86_64/vsyscall.h	2006-12-08 18:04:29.000000000 +0100
@@ -20,8 +20,7 @@ enum vsyscall_num {
 #define __section_jiffies __attribute__ ((unused, __section__ (".jiffies"), aligned(16)))
 #define __section_sys_tz __attribute__ ((unused, __section__ (".sys_tz"), aligned(16)))
 #define __section_sysctl_vsyscall __attribute__ ((unused, __section__ (".sysctl_vsyscall"), aligned(16)))
-#define __section_xtime __attribute__ ((unused, __section__ (".xtime"), aligned(16)))
-#define __section_xtime_lock __attribute__ ((unused, __section__ (".xtime_lock"), aligned(16)))
+#define __section_ktimed __attribute__ ((unused, __section__ (".ktimed"), aligned(16)))
 
 #define VXTIME_TSC	1
 #define VXTIME_HPET	2
@@ -45,21 +44,19 @@ struct vxtime_data {
 /* vsyscall space (readonly) */
 extern struct vxtime_data __vxtime;
 extern int __vgetcpu_mode;
-extern struct timespec __xtime;
 extern volatile unsigned long __jiffies;
 extern struct timezone __sys_tz;
-extern seqlock_t __xtime_lock;
+extern struct ktimed_struct __ktimed;
+#define __xtime_lock __ktimed.lock
+#define __xtime      __ktimed._xtime
 
 /* kernel space (writeable) */
 extern struct vxtime_data vxtime;
 extern int vgetcpu_mode;
 extern struct timezone sys_tz;
 extern int sysctl_vsyscall;
-extern seqlock_t xtime_lock;
 
-extern int sysctl_vsyscall;
-
-#define ARCH_HAVE_XTIME_LOCK 1
+#define ARCH_HAVE_KTIMED 1
 
 #endif /* __KERNEL__ */
 
--- linux-2.6.19/arch/x86_64/kernel/vmlinux.lds.S	2006-12-08 16:03:13.000000000 +0100
+++ linux-2.6.19-ed/arch/x86_64/kernel/vmlinux.lds.S	2006-12-08 17:52:29.000000000 +0100
@@ -94,8 +94,8 @@ SECTIONS
   __vsyscall_0 = VSYSCALL_VIRT_ADDR;
 
   . = ALIGN(CONFIG_X86_L1_CACHE_BYTES);
-  .xtime_lock : AT(VLOAD(.xtime_lock)) { *(.xtime_lock) }
-  xtime_lock = VVIRT(.xtime_lock);
+  .ktimed : AT(VLOAD(.ktimed)) { *(.ktimed) }
+  ktimed = VVIRT(.ktimed);
 
   .vxtime : AT(VLOAD(.vxtime)) { *(.vxtime) }
   vxtime = VVIRT(.vxtime);
@@ -109,9 +109,6 @@ SECTIONS
   .sysctl_vsyscall : AT(VLOAD(.sysctl_vsyscall)) { *(.sysctl_vsyscall) }
   sysctl_vsyscall = VVIRT(.sysctl_vsyscall);
 
-  .xtime : AT(VLOAD(.xtime)) { *(.xtime) }
-  xtime = VVIRT(.xtime);
-
   . = ALIGN(CONFIG_X86_L1_CACHE_BYTES);
   .jiffies : AT(VLOAD(.jiffies)) { *(.jiffies) }
   jiffies = VVIRT(.jiffies);
--- linux-2.6.19/arch/x86_64/kernel/vsyscall.c	2006-11-29 22:57:37.000000000 +0100
+++ linux-2.6.19-ed/arch/x86_64/kernel/vsyscall.c	2006-12-08 18:02:49.000000000 +0100
@@ -44,7 +44,6 @@
 #define __vsyscall(nr) __attribute__ ((unused,__section__(".vsyscall_" #nr)))
 
 int __sysctl_vsyscall __section_sysctl_vsyscall = 1;
-seqlock_t __xtime_lock __section_xtime_lock = SEQLOCK_UNLOCKED;
 int __vgetcpu_mode __section_vgetcpu_mode;
 
 #include <asm/unistd.h>
--- linux-2.6.19/arch/x86_64/kernel/time.c	2006-12-08 16:22:15.000000000 +0100
+++ linux-2.6.19-ed/arch/x86_64/kernel/time.c	2006-12-08 18:13:24.000000000 +0100
@@ -77,7 +77,10 @@ unsigned long long monotonic_base;
 struct vxtime_data __vxtime __section_vxtime;	/* for vsyscalls */
 
 volatile unsigned long __jiffies __section_jiffies = INITIAL_JIFFIES;
-struct timespec __xtime __section_xtime;
+struct ktimed_struct __ktimed __section_ktimed = {
+	.lock =  __SEQLOCK_UNLOCKED(ktimed.lock),
+	.calc_load_count = LOAD_FREQ,
+};
 struct timezone __sys_tz __section_sys_tz;
 
 /*

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] group xtime, xtime_lock, wall_to_monotonic, avenrun, calc_load_count fields together in ktimed
  2006-12-08 16:52                                 ` [PATCH] group xtime, xtime_lock, wall_to_monotonic, avenrun, calc_load_count fields together in ktimed Eric Dumazet
@ 2006-12-09  5:46                                   ` Andrew Morton
  2006-12-09  6:07                                     ` Randy Dunlap
  2006-12-11 20:44                                     ` Eric Dumazet
  2006-12-13 21:26                                   ` [PATCH] Introduce time_data, a new structure to hold jiffies, xtime, xtime_lock, wall_to_monotonic, calc_load_count and avenrun Eric Dumazet
  1 sibling, 2 replies; 54+ messages in thread
From: Andrew Morton @ 2006-12-09  5:46 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Andi Kleen, linux-kernel

On Fri, 8 Dec 2006 17:52:09 +0100
Eric Dumazet <dada1@cosmosbay.com> wrote:

> This patch introduces a new structure called ktimed (Kernel Time Data), where 
> some time keeping related variables are put together to share as few cache 
> lines as possible. This avoid some false sharing, (since linker could put 
> calc_load_count in a *random* cache line for example)
> 
> I also optimized calc_load() to not always call count_active_tasks() :
> It should call it only once every 5 seconds (LOAD_FREQ=5*HZ)
> 
> Note : x86_64 was using an arch specific placement of __xtime and __xtime_lock 
> (see arch/x86_64/kernel/vmlinux.lds.S). (vsyscall stuff)
> It is now using a specific placement of __ktimed, since xtime and xtime_lock 
> are now fields from __ktimed.
> 
> Note : I failed to move jiffies64 as well in ktimed : too many changes needed 
> because of jiffies aliasing (and endianess), but it could be done.
> 

Sounds like you have about three patches there.

<save attachment, read from file, s/^/> />

>  
> -extern struct timespec xtime;
> -extern struct timespec wall_to_monotonic;
> -extern seqlock_t xtime_lock;
> +/*
> + * define a structure to keep all fields close to each others.
> + */
> +struct ktimed_struct {
> +	struct timespec _xtime;
> +	struct timespec wall_to_monotonic;
> +	seqlock_t lock;
> +	unsigned long avenrun[3];
> +	int calc_load_count;
> +};

crappy name, but I guess it doesn't matter as nobody will use it at this
stage.  But...

> +extern struct ktimed_struct ktimed;
> +#define xtime             ktimed._xtime
> +#define wall_to_monotonic ktimed.wall_to_monotonic
> +#define xtime_lock        ktimed.lock
> +#define avenrun           ktimed.avenrun

They'll use these instead.

Frankly, I think we'd be better off removing these macros and, longer-term,
use

	write_seqlock(time_data.xtime_lock);

The approach you have here would be a good transition-period thing.

>  void timekeeping_init(void);
>  
> --- linux-2.6.19/kernel/timer.c	2006-12-08 11:50:11.000000000 +0100
> +++ linux-2.6.19-ed/kernel/timer.c	2006-12-08 18:13:24.000000000 +0100
> @@ -570,11 +570,13 @@ found:
>   * however, we will ALWAYS keep the tv_nsec part positive so we can use
>   * the usual normalization.
>   */
> -struct timespec xtime __attribute__ ((aligned (16)));
> -struct timespec wall_to_monotonic __attribute__ ((aligned (16)));
> -
> -EXPORT_SYMBOL(xtime);
> -
> +#ifndef ARCH_HAVE_KTIMED

argh, another ARCH_HAVE_MESS.  Due to the x86_64 vsyscall code.

Could you please see if we can nuke this by making
kernel/timer.c:xtime_lock use attribute(weak)?  In a separate patch ;)

> +struct ktimed_struct ktimed __cacheline_aligned = {
> +	.lock = __SEQLOCK_UNLOCKED(ktimed.lock),
> +	.calc_load_count = LOAD_FREQ,
> +};
> +EXPORT_SYMBOL(ktimed);
> +#endif
>  
>  /* XXX - all of this timekeeping code should be later moved to time.c */
>  #include <linux/clocksource.h>
> @@ -995,9 +997,6 @@ static unsigned long count_active_tasks(
>   *
>   * Requires xtime_lock to access.
>   */
> -unsigned long avenrun[3];
> -
> -EXPORT_SYMBOL(avenrun);
>  
>  /*
>   * calc_load - given tick count, update the avenrun load estimates.
> @@ -1006,27 +1005,21 @@ EXPORT_SYMBOL(avenrun);
>  static inline void calc_load(unsigned long ticks)
>  {
>  	unsigned long active_tasks; /* fixed-point */
> -	static int count = LOAD_FREQ;
>  
> -	active_tasks = count_active_tasks();
> -	for (count -= ticks; count < 0; count += LOAD_FREQ) {
> -		CALC_LOAD(avenrun[0], EXP_1, active_tasks);
> -		CALC_LOAD(avenrun[1], EXP_5, active_tasks);
> -		CALC_LOAD(avenrun[2], EXP_15, active_tasks);
> +	ktimed.calc_load_count -= ticks;
> +
> +	if (unlikely(ktimed.calc_load_count < 0)) {
> +		active_tasks = count_active_tasks();
> +		do {
> +			ktimed.calc_load_count += LOAD_FREQ;
> +			CALC_LOAD(avenrun[0], EXP_1, active_tasks);
> +			CALC_LOAD(avenrun[1], EXP_5, active_tasks);
> +			CALC_LOAD(avenrun[2], EXP_15, active_tasks);
> +		} while (ktimed.calc_load_count < 0);
>  	}
>  }
>  
> ...
>
> +extern struct ktimed_struct __ktimed;
> +#define __xtime_lock __ktimed.lock
> +#define __xtime      __ktimed._xtime
>  
>  /* kernel space (writeable) */
>  extern struct vxtime_data vxtime;
>  extern int vgetcpu_mode;
>  extern struct timezone sys_tz;
>  extern int sysctl_vsyscall;
> -extern seqlock_t xtime_lock;
>  
> -extern int sysctl_vsyscall;
> -
> -#define ARCH_HAVE_XTIME_LOCK 1
> +#define ARCH_HAVE_KTIMED 1
>  

hm, the patch seems to transform a mess into a mess.  I guess it's a messy
problem.

I agree that aggregating all the time-related things into a struct like
this makes some sense.  As does aggregating them all into a similar-looking
namespace, but that'd probably be too intrusive - too late for that.



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] group xtime, xtime_lock, wall_to_monotonic, avenrun, calc_load_count fields together in ktimed
  2006-12-09  5:46                                   ` Andrew Morton
@ 2006-12-09  6:07                                     ` Randy Dunlap
  2006-12-11 20:44                                     ` Eric Dumazet
  1 sibling, 0 replies; 54+ messages in thread
From: Randy Dunlap @ 2006-12-09  6:07 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Eric Dumazet, Andi Kleen, linux-kernel

On Fri, 8 Dec 2006 21:46:25 -0800 Andrew Morton wrote:

> On Fri, 8 Dec 2006 17:52:09 +0100
> Eric Dumazet <dada1@cosmosbay.com> wrote:

[snip]

> Sounds like you have about three patches there.
> 
> <save attachment, read from file, s/^/> />
> 
> >  
> > -extern struct timespec xtime;
> > -extern struct timespec wall_to_monotonic;
> > -extern seqlock_t xtime_lock;
> > +/*
> > + * define a structure to keep all fields close to each others.
> > + */
> > +struct ktimed_struct {
> > +	struct timespec _xtime;
> > +	struct timespec wall_to_monotonic;
> > +	seqlock_t lock;
> > +	unsigned long avenrun[3];
> > +	int calc_load_count;
> > +};
> 
> crappy name, but I guess it doesn't matter as nobody will use it at this
> stage.  But...

ktimedata would be better IMO.  somethingd looks like a daemon.

> 
> > +extern struct ktimed_struct ktimed;
> > +#define xtime             ktimed._xtime
> > +#define wall_to_monotonic ktimed.wall_to_monotonic
> > +#define xtime_lock        ktimed.lock
> > +#define avenrun           ktimed.avenrun
> 
> They'll use these instead.
> 
> Frankly, I think we'd be better off removing these macros and, longer-term,
> use
> 
> 	write_seqlock(time_data.xtime_lock);
> 
> The approach you have here would be a good transition-period thing.

[snip]

> hm, the patch seems to transform a mess into a mess.  I guess it's a messy
> problem.
> 
> I agree that aggregating all the time-related things into a struct like
> this makes some sense.  As does aggregating them all into a similar-looking
> namespace, but that'd probably be too intrusive - too late for that.

Just curious, is the change measurable (time/performance wise)?
Patch makes sense anyway.

---
~Randy

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] group xtime, xtime_lock, wall_to_monotonic, avenrun, calc_load_count fields together in ktimed
  2006-12-09  5:46                                   ` Andrew Morton
  2006-12-09  6:07                                     ` Randy Dunlap
@ 2006-12-11 20:44                                     ` Eric Dumazet
  2006-12-11 22:00                                       ` Andrew Morton
  1 sibling, 1 reply; 54+ messages in thread
From: Eric Dumazet @ 2006-12-11 20:44 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andi Kleen, linux-kernel

Andrew Morton a écrit :
> 
> hm, the patch seems to transform a mess into a mess.  I guess it's a messy
> problem.
> 
> I agree that aggregating all the time-related things into a struct like
> this makes some sense.  As does aggregating them all into a similar-looking
> namespace, but that'd probably be too intrusive - too late for that.


Hi Andrew, thanks for your comments.

I sent two patches for the __attribute__((weak)) xtime_lock thing, and 
calc_load() optimization, which dont depend on ktimed.

Should I now send patches for aggregating things or is it considered too 
intrusive ? (Sorry if I didnt understand your last sentence)

If yes, should I send separate patches to :

1) define an empty ktimed (or with a placeholder for jiffies64, not yet used)
2) move xtime into ktimed
3) move xtime_lock into ktimed
4) move wall_to_monotonic into ktimed
5) move calc_load.count into ktimed
6) move avenrun into ktimed.
7) patches to use ktimed.jiffies64 on various arches (with the problem of 
aliasing jiffies)

Thank you
Eric


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] group xtime, xtime_lock, wall_to_monotonic, avenrun, calc_load_count fields together in ktimed
  2006-12-11 20:44                                     ` Eric Dumazet
@ 2006-12-11 22:00                                       ` Andrew Morton
  0 siblings, 0 replies; 54+ messages in thread
From: Andrew Morton @ 2006-12-11 22:00 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Andi Kleen, linux-kernel

On Mon, 11 Dec 2006 21:44:34 +0100
Eric Dumazet <dada1@cosmosbay.com> wrote:

> Andrew Morton a __crit :
> > 
> > hm, the patch seems to transform a mess into a mess.  I guess it's a messy
> > problem.
> > 
> > I agree that aggregating all the time-related things into a struct like
> > this makes some sense.  As does aggregating them all into a similar-looking
> > namespace, but that'd probably be too intrusive - too late for that.
> 
> 
> Hi Andrew, thanks for your comments.
> 
> I sent two patches for the __attribute__((weak)) xtime_lock thing, and 
> calc_load() optimization, which dont depend on ktimed.

yup, thanks.

> Should I now send patches for aggregating things or is it considered too 
> intrusive ?

The previous version didn't look too intrusive.  But it would be nice to
have a plan to get rid of the macros:

#define xtime_lock	ktimed.xtime_lock

and just open-code this everywhere.

> (Sorry if I didnt understand your last sentence)

What I meant was: if we're not going to to aggregate all these globals like
this:

	ktimed.xtime_lock
	ktimed.wall_to_monotonic

then it would be nice if they were at least aggregated by naming convention:

	time_management_time_lock
	time_management_wall_to_monotonic
	etc

so the reader can see that these things are all part of the same subsystem.

But the proposed ktimed.xtime_lock achieves that, and has runtime benefits
too.

Can we please not call it ktimed?  That sounds like a kernel thread to me. 
time_data would be better.

> If yes, should I send separate patches to :
> 
> 1) define an empty ktimed (or with a placeholder for jiffies64, not yet used)
> 2) move xtime into ktimed
> 3) move xtime_lock into ktimed
> 4) move wall_to_monotonic into ktimed
> 5) move calc_load.count into ktimed
> 6) move avenrun into ktimed.

A single patch there would suffice, I suspect.

> 7) patches to use ktimed.jiffies64 on various arches (with the problem of 
> aliasing jiffies)

That might be a sprinkle of per-arch patches, but I'm not sure what is
entailed here.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH] Introduce time_data, a new structure to hold jiffies, xtime, xtime_lock, wall_to_monotonic, calc_load_count and avenrun
  2006-12-08 16:52                                 ` [PATCH] group xtime, xtime_lock, wall_to_monotonic, avenrun, calc_load_count fields together in ktimed Eric Dumazet
  2006-12-09  5:46                                   ` Andrew Morton
@ 2006-12-13 21:26                                   ` Eric Dumazet
  2006-12-15  5:24                                     ` Andrew Morton
  2006-12-15 16:21                                     ` Eric Dumazet
  1 sibling, 2 replies; 54+ messages in thread
From: Eric Dumazet @ 2006-12-13 21:26 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andi Kleen, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1462 bytes --]

This patch introduces a new structure called time_data, where some time 
keeping related variables are put together to share as few cache lines as 
possible. This should reduce timer interrupt latency and cache lines ping 
pongs in SMP machines.

struct time_data {
        u64                    _jiffies_64;
        struct timespec _xtime;
        struct timespec _wall_to_monotonic;
        seqlock_t           lock;
        int                     calc_load_count;
        unsigned int     _avenrun[3];
};

_jiffies_64 is a place holder so that arches can (optionally) aliases 
jiffies_64/jiffies in time_data. This patch does the thing for i386 and 
x86_64. 

avenrun, xtime, xtime_lock, wall_to_monotonic, are now temporary defined as 
macros to make this patch not too invasive, but we can in future patches 
gradually deletes these macros.


Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 arch/i386/kernel/vmlinux.lds.S   |    3 +
 arch/s390/appldata/appldata_os.c |   18 ++++-----
 arch/x86_64/kernel/time.c        |    8 +++-
 arch/x86_64/kernel/vmlinux.lds.S |   14 ++-----
 arch/x86_64/kernel/vsyscall.c    |   15 +++-----
 include/asm-x86_64/vsyscall.h    |    8 +---
 include/linux/jiffies.h          |    2 -
 include/linux/sched.h            |   10 +++--
 include/linux/time.h             |   33 +++++++++++++++--
 kernel/timer.c                   |   54 ++++++++++++++---------------
 10 files changed, 92 insertions(+), 73 deletions(-)

[-- Attachment #2: time_data.patch --]
[-- Type: text/plain, Size: 14108 bytes --]

--- linux-2.6.19/include/linux/time.h	2006-12-13 20:21:23.000000000 +0100
+++ linux-2.6.19/include/linux/time.h	2006-12-13 18:06:44.000000000 +0100
@@ -88,21 +88,44 @@ static inline struct timespec timespec_s
 #define timespec_valid(ts) \
 	(((ts)->tv_sec >= 0) && (((unsigned long) (ts)->tv_nsec) < NSEC_PER_SEC))
 
-extern struct timespec xtime;
-extern struct timespec wall_to_monotonic;
-extern seqlock_t xtime_lock __attribute__((weak));
+/*
+ * Structure time_data holds most timekeeping related variables
+ * to make them share as few cache lines as possible.
+ */
+struct time_data {
+	/*
+	 * We want to alias jiffies_64 to time_data._jiffies_64,
+	 * in vmlinux.lds.S files (jiffies_64 = time_data)
+	 * so keep _jiffies_64 at the beginning of time_data
+	 */
+	u64		_jiffies_64;
+	struct timespec _xtime;
+	struct timespec _wall_to_monotonic;
+	seqlock_t	lock;
+	int		calc_load_count;
+	unsigned int	_avenrun[3];
+};
+/*
+ * time_data has weak attribute because some arches (x86_64) want
+ * to have a read_only version in their vsyscall page
+ */
+extern struct time_data time_data __attribute__((weak));
+#define xtime             time_data._xtime
+#define wall_to_monotonic time_data._wall_to_monotonic
+#define xtime_lock        time_data.lock
+#define avenrun           time_data._avenrun
 
 void timekeeping_init(void);
 
 static inline unsigned long get_seconds(void)
 {
-	return xtime.tv_sec;
+	return time_data._xtime.tv_sec;
 }
 
 struct timespec current_kernel_time(void);
 
 #define CURRENT_TIME		(current_kernel_time())
-#define CURRENT_TIME_SEC	((struct timespec) { xtime.tv_sec, 0 })
+#define CURRENT_TIME_SEC	((struct timespec) { time_data._xtime.tv_sec, 0 })
 
 extern void do_gettimeofday(struct timeval *tv);
 extern int do_settimeofday(struct timespec *tv);
--- linux-2.6.19/include/linux/jiffies.h	2006-12-12 00:32:00.000000000 +0100
+++ linux-2.6.19/include/linux/jiffies.h	2006-12-13 18:11:30.000000000 +0100
@@ -78,7 +78,7 @@
  * without sampling the sequence number in xtime_lock.
  * get_jiffies_64() will do this for you as appropriate.
  */
-extern u64 __jiffy_data jiffies_64;
+extern u64 __jiffy_data jiffies_64 __attribute__((weak));
 extern unsigned long volatile __jiffy_data jiffies;
 
 #if (BITS_PER_LONG < 64)
--- linux-2.6.19/include/linux/sched.h	2006-12-08 12:10:45.000000000 +0100
+++ linux-2.6.19/include/linux/sched.h	2006-12-13 15:54:29.000000000 +0100
@@ -104,7 +104,6 @@ struct futex_pi_state;
  *    the EXP_n values would be 1981, 2034 and 2043 if still using only
  *    11 bit fractions.
  */
-extern unsigned long avenrun[];		/* Load averages */
 
 #define FSHIFT		11		/* nr of bits of precision */
 #define FIXED_1		(1<<FSHIFT)	/* 1.0 as fixed-point */
@@ -114,9 +113,12 @@ extern unsigned long avenrun[];		/* Load
 #define EXP_15		2037		/* 1/exp(5sec/15min) */
 
 #define CALC_LOAD(load,exp,n) \
-	load *= exp; \
-	load += n*(FIXED_1-exp); \
-	load >>= FSHIFT;
+	{ \
+	unsigned long temp = load; \
+	temp *= exp; \
+	temp += n*(FIXED_1-exp); \
+	load = temp >> FSHIFT; \
+	}
 
 extern unsigned long total_forks;
 extern int nr_threads;
--- linux-2.6.19/kernel/timer.c	2006-12-13 13:28:11.000000000 +0100
+++ linux-2.6.19/kernel/timer.c	2006-12-13 20:58:53.000000000 +0100
@@ -41,7 +41,7 @@
 #include <asm/timex.h>
 #include <asm/io.h>
 
-u64 jiffies_64 __cacheline_aligned_in_smp = INITIAL_JIFFIES;
+__attribute__((weak)) u64 jiffies_64 __cacheline_aligned_in_smp = INITIAL_JIFFIES;
 
 EXPORT_SYMBOL(jiffies_64);
 
@@ -570,10 +570,12 @@ found:
  * however, we will ALWAYS keep the tv_nsec part positive so we can use
  * the usual normalization.
  */
-struct timespec xtime __attribute__ ((aligned (16)));
-struct timespec wall_to_monotonic __attribute__ ((aligned (16)));
-
-EXPORT_SYMBOL(xtime);
+__attribute__((weak)) struct time_data time_data __cacheline_aligned = {
+	._jiffies_64 = INITIAL_JIFFIES,
+	.lock = __SEQLOCK_UNLOCKED(time_data.lock),
+	.calc_load_count = LOAD_FREQ,
+};
+EXPORT_SYMBOL(time_data);
 
 
 /* XXX - all of this timekeeping code should be later moved to time.c */
@@ -995,9 +997,6 @@ static unsigned long count_active_tasks(
  *
  * Requires xtime_lock to access.
  */
-unsigned long avenrun[3];
-
-EXPORT_SYMBOL(avenrun);
 
 /*
  * calc_load - given tick count, update the avenrun load estimates.
@@ -1006,29 +1005,20 @@ EXPORT_SYMBOL(avenrun);
 static inline void calc_load(unsigned long ticks)
 {
 	unsigned long active_tasks; /* fixed-point */
-	static int count = LOAD_FREQ;
 
-	count -= ticks;
-	if (unlikely(count < 0)) {
+	time_data.calc_load_count -= ticks;
+	if (unlikely(time_data.calc_load_count < 0)) {
 		active_tasks = count_active_tasks();
 		do {
-			CALC_LOAD(avenrun[0], EXP_1, active_tasks);
-			CALC_LOAD(avenrun[1], EXP_5, active_tasks);
-			CALC_LOAD(avenrun[2], EXP_15, active_tasks);
-			count += LOAD_FREQ;
-		} while (count < 0);
+			CALC_LOAD(time_data._avenrun[0], EXP_1, active_tasks);
+			CALC_LOAD(time_data._avenrun[1], EXP_5, active_tasks);
+			CALC_LOAD(time_data._avenrun[2], EXP_15, active_tasks);
+			time_data.calc_load_count += LOAD_FREQ;
+		} while (time_data.calc_load_count < 0);
 	}
 }
 
 /*
- * This read-write spinlock protects us from races in SMP while
- * playing with xtime and avenrun.
- */
-__attribute__((weak)) __cacheline_aligned_in_smp DEFINE_SEQLOCK(xtime_lock);
-
-EXPORT_SYMBOL(xtime_lock);
-
-/*
  * This function runs timers and the timer-tq in bottom half context.
  */
 static void run_timer_softirq(struct softirq_action *h)
@@ -1253,6 +1243,16 @@ asmlinkage long sys_gettid(void)
 }
 
 /**
+ * avenrun_to_loads - convert an _avenrun[] to sysinfo.loads[]
+ * @idx: index (0..2) in time_data._avenrun[]
+ */
+static inline unsigned long avenrun_to_loads(unsigned int idx)
+{
+	unsigned long ret = time_data._avenrun[idx];
+	return ret << (SI_LOAD_SHIFT - FSHIFT);
+}
+
+/**
  * sys_sysinfo - fill in sysinfo struct
  * @info: pointer to buffer to fill
  */ 
@@ -1285,9 +1285,9 @@ asmlinkage long sys_sysinfo(struct sysin
 		}
 		val.uptime = tp.tv_sec + (tp.tv_nsec ? 1 : 0);
 
-		val.loads[0] = avenrun[0] << (SI_LOAD_SHIFT - FSHIFT);
-		val.loads[1] = avenrun[1] << (SI_LOAD_SHIFT - FSHIFT);
-		val.loads[2] = avenrun[2] << (SI_LOAD_SHIFT - FSHIFT);
+		val.loads[0] = avenrun_to_loads(0);
+		val.loads[1] = avenrun_to_loads(1);
+		val.loads[2] = avenrun_to_loads(2);
 
 		val.procs = nr_threads;
 	} while (read_seqretry(&xtime_lock, seq));
--- linux-2.6.19/include/asm-x86_64/vsyscall.h	2006-12-13 13:31:50.000000000 +0100
+++ linux-2.6.19/include/asm-x86_64/vsyscall.h	2006-12-13 21:20:18.000000000 +0100
@@ -17,11 +17,9 @@ enum vsyscall_num {
 
 #define __section_vxtime __attribute__ ((unused, __section__ (".vxtime"), aligned(16)))
 #define __section_vgetcpu_mode __attribute__ ((unused, __section__ (".vgetcpu_mode"), aligned(16)))
-#define __section_jiffies __attribute__ ((unused, __section__ (".jiffies"), aligned(16)))
 #define __section_sys_tz __attribute__ ((unused, __section__ (".sys_tz"), aligned(16)))
 #define __section_sysctl_vsyscall __attribute__ ((unused, __section__ (".sysctl_vsyscall"), aligned(16)))
-#define __section_xtime __attribute__ ((unused, __section__ (".xtime"), aligned(16)))
-#define __section_xtime_lock __attribute__ ((unused, __section__ (".xtime_lock"), aligned(16)))
+#define __section_time_data __attribute__ ((unused, __section__ (".time_data"), aligned(16)))
 
 #define VXTIME_TSC	1
 #define VXTIME_HPET	2
@@ -43,12 +41,10 @@ struct vxtime_data {
 #define hpet_writel(d,a)        writel(d, (void __iomem *)fix_to_virt(FIX_HPET_BASE) + a)
 
 /* vsyscall space (readonly) */
+extern struct time_data __time_data;
 extern struct vxtime_data __vxtime;
 extern int __vgetcpu_mode;
-extern struct timespec __xtime;
-extern volatile unsigned long __jiffies;
 extern struct timezone __sys_tz;
-extern seqlock_t __xtime_lock;
 
 /* kernel space (writeable) */
 extern struct vxtime_data vxtime;
--- linux-2.6.19/arch/x86_64/kernel/vsyscall.c	2006-11-29 22:57:37.000000000 +0100
+++ linux-2.6.19/arch/x86_64/kernel/vsyscall.c	2006-12-13 21:04:03.000000000 +0100
@@ -44,7 +44,6 @@
 #define __vsyscall(nr) __attribute__ ((unused,__section__(".vsyscall_" #nr)))
 
 int __sysctl_vsyscall __section_sysctl_vsyscall = 1;
-seqlock_t __xtime_lock __section_xtime_lock = SEQLOCK_UNLOCKED;
 int __vgetcpu_mode __section_vgetcpu_mode;
 
 #include <asm/unistd.h>
@@ -66,10 +65,10 @@ static __always_inline void do_vgettimeo
 	unsigned long sec, usec;
 
 	do {
-		sequence = read_seqbegin(&__xtime_lock);
+		sequence = read_seqbegin(&__time_data.lock);
 		
-		sec = __xtime.tv_sec;
-		usec = __xtime.tv_nsec / 1000;
+		sec = __time_data._xtime.tv_sec;
+		usec = __time_data._xtime.tv_nsec / 1000;
 
 		if (__vxtime.mode != VXTIME_HPET) {
 			t = get_cycles_sync();
@@ -83,7 +82,7 @@ static __always_inline void do_vgettimeo
 				   fix_to_virt(VSYSCALL_HPET) + 0xf0) -
 				  __vxtime.last) * __vxtime.quot) >> 32;
 		}
-	} while (read_seqretry(&__xtime_lock, sequence));
+	} while (read_seqretry(&__time_data.lock, sequence));
 
 	tv->tv_sec = sec + usec / 1000000;
 	tv->tv_usec = usec % 1000000;
@@ -131,8 +130,8 @@ time_t __vsyscall(1) vtime(time_t *t)
 	if (!__sysctl_vsyscall)
 		return time_syscall(t);
 	else if (t)
-		*t = __xtime.tv_sec;		
-	return __xtime.tv_sec;
+		*t = __time_data._xtime.tv_sec;		
+	return __time_data._xtime.tv_sec;
 }
 
 /* Fast way to get current CPU and node.
@@ -157,7 +156,7 @@ vgetcpu(unsigned *cpu, unsigned *node, s
 	   We do this here because otherwise user space would do it on
 	   its own in a likely inferior way (no access to jiffies).
 	   If you don't like it pass NULL. */
-	if (tcache && tcache->blob[0] == (j = __jiffies)) {
+	if (tcache && tcache->blob[0] == (j = __time_data._jiffies_64)) {
 		p = tcache->blob[1];
 	} else if (__vgetcpu_mode == VGETCPU_RDTSCP) {
 		/* Load per CPU data from RDTSCP */
--- linux-2.6.19/arch/x86_64/kernel/time.c	2006-12-08 16:22:15.000000000 +0100
+++ linux-2.6.19/arch/x86_64/kernel/time.c	2006-12-13 15:47:01.000000000 +0100
@@ -76,8 +76,12 @@ unsigned long long monotonic_base;
 
 struct vxtime_data __vxtime __section_vxtime;	/* for vsyscalls */
 
-volatile unsigned long __jiffies __section_jiffies = INITIAL_JIFFIES;
-struct timespec __xtime __section_xtime;
+struct time_data __time_data __section_time_data = {
+	._jiffies_64 = INITIAL_JIFFIES,
+	.lock =  __SEQLOCK_UNLOCKED(time_data.lock),
+	.calc_load_count = LOAD_FREQ,
+};
+
 struct timezone __sys_tz __section_sys_tz;
 
 /*
--- linux-2.6.19/arch/x86_64/kernel/vmlinux.lds.S	2006-12-08 16:03:13.000000000 +0100
+++ linux-2.6.19/arch/x86_64/kernel/vmlinux.lds.S	2006-12-13 21:04:03.000000000 +0100
@@ -12,7 +12,6 @@
 OUTPUT_FORMAT("elf64-x86-64", "elf64-x86-64", "elf64-x86-64")
 OUTPUT_ARCH(i386:x86-64)
 ENTRY(phys_startup_64)
-jiffies_64 = jiffies;
 PHDRS {
 	text PT_LOAD FLAGS(5);	/* R_E */
 	data PT_LOAD FLAGS(7);	/* RWE */
@@ -94,8 +93,10 @@ SECTIONS
   __vsyscall_0 = VSYSCALL_VIRT_ADDR;
 
   . = ALIGN(CONFIG_X86_L1_CACHE_BYTES);
-  .xtime_lock : AT(VLOAD(.xtime_lock)) { *(.xtime_lock) }
-  xtime_lock = VVIRT(.xtime_lock);
+  .time_data : AT(VLOAD(.time_data)) { *(.time_data) }
+  time_data = VVIRT(.time_data);
+  jiffies = time_data;
+  jiffies_64 = time_data;
 
   .vxtime : AT(VLOAD(.vxtime)) { *(.vxtime) }
   vxtime = VVIRT(.vxtime);
@@ -109,13 +110,6 @@ SECTIONS
   .sysctl_vsyscall : AT(VLOAD(.sysctl_vsyscall)) { *(.sysctl_vsyscall) }
   sysctl_vsyscall = VVIRT(.sysctl_vsyscall);
 
-  .xtime : AT(VLOAD(.xtime)) { *(.xtime) }
-  xtime = VVIRT(.xtime);
-
-  . = ALIGN(CONFIG_X86_L1_CACHE_BYTES);
-  .jiffies : AT(VLOAD(.jiffies)) { *(.jiffies) }
-  jiffies = VVIRT(.jiffies);
-
   .vsyscall_1 ADDR(.vsyscall_0) + 1024: AT(VLOAD(.vsyscall_1)) { *(.vsyscall_1) }
   .vsyscall_2 ADDR(.vsyscall_0) + 2048: AT(VLOAD(.vsyscall_2)) { *(.vsyscall_2) }
   .vsyscall_3 ADDR(.vsyscall_0) + 3072: AT(VLOAD(.vsyscall_3)) { *(.vsyscall_3) }
--- linux-2.6.19/arch/i386/kernel/vmlinux.lds.S	2006-12-13 16:09:56.000000000 +0100
+++ linux-2.6.19/arch/i386/kernel/vmlinux.lds.S	2006-12-13 20:35:45.000000000 +0100
@@ -12,7 +12,8 @@
 OUTPUT_FORMAT("elf32-i386", "elf32-i386", "elf32-i386")
 OUTPUT_ARCH(i386)
 ENTRY(phys_startup_32)
-jiffies = jiffies_64;
+jiffies = time_data;
+jiffies_64 = time_data;
 
 PHDRS {
 	text PT_LOAD FLAGS(5);	/* R_E */
--- linux-2.6.19/arch/s390/appldata/appldata_os.c	2006-12-13 16:27:59.000000000 +0100
+++ linux-2.6.19/arch/s390/appldata/appldata_os.c	2006-12-13 16:27:59.000000000 +0100
@@ -68,7 +68,7 @@ struct appldata_os_data {
 
 	u32 nr_running;		/* number of runnable threads      */
 	u32 nr_threads;		/* number of threads               */
-	u32 avenrun[3];		/* average nr. of running processes during */
+	u32 _avenrun[3];		/* average nr. of running processes during */
 				/* the last 1, 5 and 15 minutes */
 
 	/* New in 2.6 */
@@ -98,11 +98,11 @@ static inline void appldata_print_debug(
 	P_DEBUG("nr_threads   = %u\n", os_data->nr_threads);
 	P_DEBUG("nr_running   = %u\n", os_data->nr_running);
 	P_DEBUG("nr_iowait    = %u\n", os_data->nr_iowait);
-	P_DEBUG("avenrun(int) = %8x / %8x / %8x\n", os_data->avenrun[0],
-		os_data->avenrun[1], os_data->avenrun[2]);
-	a0 = os_data->avenrun[0];
-	a1 = os_data->avenrun[1];
-	a2 = os_data->avenrun[2];
+	P_DEBUG("avenrun(int) = %8x / %8x / %8x\n", os_data->_avenrun[0],
+		os_data->_avenrun[1], os_data->_avenrun[2]);
+	a0 = os_data->_avenrun[0];
+	a1 = os_data->_avenrun[1];
+	a2 = os_data->_avenrun[2];
 	P_DEBUG("avenrun(float) = %d.%02d / %d.%02d / %d.%02d\n",
 		LOAD_INT(a0), LOAD_FRAC(a0), LOAD_INT(a1), LOAD_FRAC(a1),
 		LOAD_INT(a2), LOAD_FRAC(a2));
@@ -145,9 +145,9 @@ static void appldata_get_os_data(void *d
 	os_data->nr_threads = nr_threads;
 	os_data->nr_running = nr_running();
 	os_data->nr_iowait  = nr_iowait();
-	os_data->avenrun[0] = avenrun[0] + (FIXED_1/200);
-	os_data->avenrun[1] = avenrun[1] + (FIXED_1/200);
-	os_data->avenrun[2] = avenrun[2] + (FIXED_1/200);
+	os_data->_avenrun[0] = avenrun[0] + (FIXED_1/200);
+	os_data->_avenrun[1] = avenrun[1] + (FIXED_1/200);
+	os_data->_avenrun[2] = avenrun[2] + (FIXED_1/200);
 
 	j = 0;
 	for_each_online_cpu(i) {

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] Introduce time_data, a new structure to hold jiffies, xtime, xtime_lock, wall_to_monotonic, calc_load_count and avenrun
  2006-12-13 21:26                                   ` [PATCH] Introduce time_data, a new structure to hold jiffies, xtime, xtime_lock, wall_to_monotonic, calc_load_count and avenrun Eric Dumazet
@ 2006-12-15  5:24                                     ` Andrew Morton
  2006-12-15 11:21                                       ` Eric Dumazet
  2006-12-15 16:21                                     ` Eric Dumazet
  1 sibling, 1 reply; 54+ messages in thread
From: Andrew Morton @ 2006-12-15  5:24 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Andi Kleen, linux-kernel

On Wed, 13 Dec 2006 22:26:26 +0100
Eric Dumazet <dada1@cosmosbay.com> wrote:

> This patch introduces a new structure called time_data, where some time 
> keeping related variables are put together to share as few cache lines as 
> possible.

ia64 refers to xtime_lock from assembly and hence doesn't link.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] Introduce time_data, a new structure to hold jiffies, xtime, xtime_lock, wall_to_monotonic, calc_load_count and avenrun
  2006-12-15  5:24                                     ` Andrew Morton
@ 2006-12-15 11:21                                       ` Eric Dumazet
  0 siblings, 0 replies; 54+ messages in thread
From: Eric Dumazet @ 2006-12-15 11:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, linux-kernel, Thomas Gleixner, Ingo Molnar,
	john stultz, Roman Zippel

[-- Attachment #1: Type: text/plain, Size: 2573 bytes --]

On Friday 15 December 2006 06:24, Andrew Morton wrote:
> On Wed, 13 Dec 2006 22:26:26 +0100
>
> Eric Dumazet <dada1@cosmosbay.com> wrote:
> > This patch introduces a new structure called time_data, where some time
> > keeping related variables are put together to share as few cache lines as
> > possible.
>
> ia64 refers to xtime_lock from assembly and hence doesn't link.
>

I see, I missed this.

In this new version, I define linker aliases on IA64 for xtime, xtime_lock and 
wall_to_monotonic. An ia64 expert might change arch/ia64/kernel/fsys.S later 
to only use time_data.

Thank you

[PATCH] Introduce time_data, a new structure to hold jiffies, xtime, 
xtime_lock, wall_to_monotonic, calc_load_count and avenrun

This patch introduces a new structure called time_data, where some time 
keeping related variables are put together to share as few cache lines as 
possible. This should reduce timer interrupt latency and cache lines ping 
pongs in SMP machines.

struct time_data {
        u64                    _jiffies_64;
        struct timespec _xtime;
        struct timespec _wall_to_monotonic;
        seqlock_t           lock;
        int                     calc_load_count;
        unsigned int     _avenrun[3];
};

_jiffies_64 is a place holder so that arches can (optionally) aliases 
jiffies_64/jiffies in time_data. This patch does the thing for i386, x86_64 
and ia64 

avenrun, xtime, xtime_lock, wall_to_monotonic, are now temporary defined as 
macros to make this patch not too invasive, but we can in future patches 
gradually deletes these macros.

For ia64, I added aliases for xtime, xtime_lock and wall_to_monotonic because 
arch/ia64/kernel/fsys.S uses these symbols. This file might be changed to 
directly access time_data in the future (reducing registers use as well)


Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 arch/i386/kernel/vmlinux.lds.S   |    3 +
 arch/ia64/kernel/asm-offsets.c   |    3 +
 arch/ia64/kernel/vmlinux.lds.S   |    7 +++
 arch/s390/appldata/appldata_os.c |   18 ++++-----
 arch/x86_64/kernel/time.c        |    8 +++-
 arch/x86_64/kernel/vmlinux.lds.S |   14 ++-----
 arch/x86_64/kernel/vsyscall.c    |   15 +++-----
 include/asm-x86_64/vsyscall.h    |    8 +---
 include/linux/jiffies.h          |    2 -
 include/linux/sched.h            |   10 +++--
 include/linux/time.h             |   33 +++++++++++++++--
 kernel/timer.c                   |   54 ++++++++++++++---------------
 12 files changed, 101 insertions(+), 74 deletions(-)


[-- Attachment #2: time_data.patch --]
[-- Type: text/plain, Size: 15514 bytes --]

--- linux-2.6.19/include/linux/time.h	2006-12-13 20:21:23.000000000 +0100
+++ linux-2.6.19-ed/include/linux/time.h	2006-12-15 11:22:46.000000000 +0100
@@ -88,21 +88,44 @@ static inline struct timespec timespec_s
 #define timespec_valid(ts) \
 	(((ts)->tv_sec >= 0) && (((unsigned long) (ts)->tv_nsec) < NSEC_PER_SEC))
 
-extern struct timespec xtime;
-extern struct timespec wall_to_monotonic;
-extern seqlock_t xtime_lock __attribute__((weak));
+/*
+ * Structure time_data holds most timekeeping related variables
+ * to make them share as few cache lines as possible.
+ */
+struct time_data {
+	/*
+	 * We want to alias jiffies_64 to time_data._jiffies_64,
+	 * in vmlinux.lds.S files (jiffies_64 = time_data)
+	 * so keep _jiffies_64 at the beginning of time_data
+	 */
+	u64		_jiffies_64;
+	struct timespec _xtime;
+	struct timespec _wall_to_monotonic;
+	seqlock_t	lock;
+	int		calc_load_count;
+	unsigned int	_avenrun[3];
+};
+/*
+ * time_data has weak attribute because some arches (x86_64) want
+ * to have a read_only version in their vsyscall page
+ */
+extern struct time_data time_data __attribute__((weak));
+#define xtime             time_data._xtime
+#define wall_to_monotonic time_data._wall_to_monotonic
+#define xtime_lock        time_data.lock
+#define avenrun           time_data._avenrun
 
 void timekeeping_init(void);
 
 static inline unsigned long get_seconds(void)
 {
-	return xtime.tv_sec;
+	return time_data._xtime.tv_sec;
 }
 
 struct timespec current_kernel_time(void);
 
 #define CURRENT_TIME		(current_kernel_time())
-#define CURRENT_TIME_SEC	((struct timespec) { xtime.tv_sec, 0 })
+#define CURRENT_TIME_SEC	((struct timespec) { time_data._xtime.tv_sec, 0 })
 
 extern void do_gettimeofday(struct timeval *tv);
 extern int do_settimeofday(struct timespec *tv);
--- linux-2.6.19/include/linux/jiffies.h	2006-12-12 00:32:00.000000000 +0100
+++ linux-2.6.19-ed/include/linux/jiffies.h	2006-12-13 18:11:30.000000000 +0100
@@ -78,7 +78,7 @@
  * without sampling the sequence number in xtime_lock.
  * get_jiffies_64() will do this for you as appropriate.
  */
-extern u64 __jiffy_data jiffies_64;
+extern u64 __jiffy_data jiffies_64 __attribute__((weak));
 extern unsigned long volatile __jiffy_data jiffies;
 
 #if (BITS_PER_LONG < 64)
--- linux-2.6.19/include/linux/sched.h	2006-12-08 12:10:45.000000000 +0100
+++ linux-2.6.19-ed/include/linux/sched.h	2006-12-13 15:54:29.000000000 +0100
@@ -104,7 +104,6 @@ struct futex_pi_state;
  *    the EXP_n values would be 1981, 2034 and 2043 if still using only
  *    11 bit fractions.
  */
-extern unsigned long avenrun[];		/* Load averages */
 
 #define FSHIFT		11		/* nr of bits of precision */
 #define FIXED_1		(1<<FSHIFT)	/* 1.0 as fixed-point */
@@ -114,9 +113,12 @@ extern unsigned long avenrun[];		/* Load
 #define EXP_15		2037		/* 1/exp(5sec/15min) */
 
 #define CALC_LOAD(load,exp,n) \
-	load *= exp; \
-	load += n*(FIXED_1-exp); \
-	load >>= FSHIFT;
+	{ \
+	unsigned long temp = load; \
+	temp *= exp; \
+	temp += n*(FIXED_1-exp); \
+	load = temp >> FSHIFT; \
+	}
 
 extern unsigned long total_forks;
 extern int nr_threads;
--- linux-2.6.19/kernel/timer.c	2006-12-13 13:28:11.000000000 +0100
+++ linux-2.6.19-ed/kernel/timer.c	2006-12-13 20:58:53.000000000 +0100
@@ -41,7 +41,7 @@
 #include <asm/timex.h>
 #include <asm/io.h>
 
-u64 jiffies_64 __cacheline_aligned_in_smp = INITIAL_JIFFIES;
+__attribute__((weak)) u64 jiffies_64 __cacheline_aligned_in_smp = INITIAL_JIFFIES;
 
 EXPORT_SYMBOL(jiffies_64);
 
@@ -570,10 +570,12 @@ found:
  * however, we will ALWAYS keep the tv_nsec part positive so we can use
  * the usual normalization.
  */
-struct timespec xtime __attribute__ ((aligned (16)));
-struct timespec wall_to_monotonic __attribute__ ((aligned (16)));
-
-EXPORT_SYMBOL(xtime);
+__attribute__((weak)) struct time_data time_data __cacheline_aligned = {
+	._jiffies_64 = INITIAL_JIFFIES,
+	.lock = __SEQLOCK_UNLOCKED(time_data.lock),
+	.calc_load_count = LOAD_FREQ,
+};
+EXPORT_SYMBOL(time_data);
 
 
 /* XXX - all of this timekeeping code should be later moved to time.c */
@@ -995,9 +997,6 @@ static unsigned long count_active_tasks(
  *
  * Requires xtime_lock to access.
  */
-unsigned long avenrun[3];
-
-EXPORT_SYMBOL(avenrun);
 
 /*
  * calc_load - given tick count, update the avenrun load estimates.
@@ -1006,29 +1005,20 @@ EXPORT_SYMBOL(avenrun);
 static inline void calc_load(unsigned long ticks)
 {
 	unsigned long active_tasks; /* fixed-point */
-	static int count = LOAD_FREQ;
 
-	count -= ticks;
-	if (unlikely(count < 0)) {
+	time_data.calc_load_count -= ticks;
+	if (unlikely(time_data.calc_load_count < 0)) {
 		active_tasks = count_active_tasks();
 		do {
-			CALC_LOAD(avenrun[0], EXP_1, active_tasks);
-			CALC_LOAD(avenrun[1], EXP_5, active_tasks);
-			CALC_LOAD(avenrun[2], EXP_15, active_tasks);
-			count += LOAD_FREQ;
-		} while (count < 0);
+			CALC_LOAD(time_data._avenrun[0], EXP_1, active_tasks);
+			CALC_LOAD(time_data._avenrun[1], EXP_5, active_tasks);
+			CALC_LOAD(time_data._avenrun[2], EXP_15, active_tasks);
+			time_data.calc_load_count += LOAD_FREQ;
+		} while (time_data.calc_load_count < 0);
 	}
 }
 
 /*
- * This read-write spinlock protects us from races in SMP while
- * playing with xtime and avenrun.
- */
-__attribute__((weak)) __cacheline_aligned_in_smp DEFINE_SEQLOCK(xtime_lock);
-
-EXPORT_SYMBOL(xtime_lock);
-
-/*
  * This function runs timers and the timer-tq in bottom half context.
  */
 static void run_timer_softirq(struct softirq_action *h)
@@ -1253,6 +1243,16 @@ asmlinkage long sys_gettid(void)
 }
 
 /**
+ * avenrun_to_loads - convert an _avenrun[] to sysinfo.loads[]
+ * @idx: index (0..2) in time_data._avenrun[]
+ */
+static inline unsigned long avenrun_to_loads(unsigned int idx)
+{
+	unsigned long ret = time_data._avenrun[idx];
+	return ret << (SI_LOAD_SHIFT - FSHIFT);
+}
+
+/**
  * sys_sysinfo - fill in sysinfo struct
  * @info: pointer to buffer to fill
  */ 
@@ -1285,9 +1285,9 @@ asmlinkage long sys_sysinfo(struct sysin
 		}
 		val.uptime = tp.tv_sec + (tp.tv_nsec ? 1 : 0);
 
-		val.loads[0] = avenrun[0] << (SI_LOAD_SHIFT - FSHIFT);
-		val.loads[1] = avenrun[1] << (SI_LOAD_SHIFT - FSHIFT);
-		val.loads[2] = avenrun[2] << (SI_LOAD_SHIFT - FSHIFT);
+		val.loads[0] = avenrun_to_loads(0);
+		val.loads[1] = avenrun_to_loads(1);
+		val.loads[2] = avenrun_to_loads(2);
 
 		val.procs = nr_threads;
 	} while (read_seqretry(&xtime_lock, seq));
--- linux-2.6.19/include/asm-x86_64/vsyscall.h	2006-12-13 13:31:50.000000000 +0100
+++ linux-2.6.19-ed/include/asm-x86_64/vsyscall.h	2006-12-13 21:20:18.000000000 +0100
@@ -17,11 +17,9 @@ enum vsyscall_num {
 
 #define __section_vxtime __attribute__ ((unused, __section__ (".vxtime"), aligned(16)))
 #define __section_vgetcpu_mode __attribute__ ((unused, __section__ (".vgetcpu_mode"), aligned(16)))
-#define __section_jiffies __attribute__ ((unused, __section__ (".jiffies"), aligned(16)))
 #define __section_sys_tz __attribute__ ((unused, __section__ (".sys_tz"), aligned(16)))
 #define __section_sysctl_vsyscall __attribute__ ((unused, __section__ (".sysctl_vsyscall"), aligned(16)))
-#define __section_xtime __attribute__ ((unused, __section__ (".xtime"), aligned(16)))
-#define __section_xtime_lock __attribute__ ((unused, __section__ (".xtime_lock"), aligned(16)))
+#define __section_time_data __attribute__ ((unused, __section__ (".time_data"), aligned(16)))
 
 #define VXTIME_TSC	1
 #define VXTIME_HPET	2
@@ -43,12 +41,10 @@ struct vxtime_data {
 #define hpet_writel(d,a)        writel(d, (void __iomem *)fix_to_virt(FIX_HPET_BASE) + a)
 
 /* vsyscall space (readonly) */
+extern struct time_data __time_data;
 extern struct vxtime_data __vxtime;
 extern int __vgetcpu_mode;
-extern struct timespec __xtime;
-extern volatile unsigned long __jiffies;
 extern struct timezone __sys_tz;
-extern seqlock_t __xtime_lock;
 
 /* kernel space (writeable) */
 extern struct vxtime_data vxtime;
--- linux-2.6.19/arch/x86_64/kernel/vsyscall.c	2006-11-29 22:57:37.000000000 +0100
+++ linux-2.6.19-ed/arch/x86_64/kernel/vsyscall.c	2006-12-13 21:04:03.000000000 +0100
@@ -44,7 +44,6 @@
 #define __vsyscall(nr) __attribute__ ((unused,__section__(".vsyscall_" #nr)))
 
 int __sysctl_vsyscall __section_sysctl_vsyscall = 1;
-seqlock_t __xtime_lock __section_xtime_lock = SEQLOCK_UNLOCKED;
 int __vgetcpu_mode __section_vgetcpu_mode;
 
 #include <asm/unistd.h>
@@ -66,10 +65,10 @@ static __always_inline void do_vgettimeo
 	unsigned long sec, usec;
 
 	do {
-		sequence = read_seqbegin(&__xtime_lock);
+		sequence = read_seqbegin(&__time_data.lock);
 		
-		sec = __xtime.tv_sec;
-		usec = __xtime.tv_nsec / 1000;
+		sec = __time_data._xtime.tv_sec;
+		usec = __time_data._xtime.tv_nsec / 1000;
 
 		if (__vxtime.mode != VXTIME_HPET) {
 			t = get_cycles_sync();
@@ -83,7 +82,7 @@ static __always_inline void do_vgettimeo
 				   fix_to_virt(VSYSCALL_HPET) + 0xf0) -
 				  __vxtime.last) * __vxtime.quot) >> 32;
 		}
-	} while (read_seqretry(&__xtime_lock, sequence));
+	} while (read_seqretry(&__time_data.lock, sequence));
 
 	tv->tv_sec = sec + usec / 1000000;
 	tv->tv_usec = usec % 1000000;
@@ -131,8 +130,8 @@ time_t __vsyscall(1) vtime(time_t *t)
 	if (!__sysctl_vsyscall)
 		return time_syscall(t);
 	else if (t)
-		*t = __xtime.tv_sec;		
-	return __xtime.tv_sec;
+		*t = __time_data._xtime.tv_sec;		
+	return __time_data._xtime.tv_sec;
 }
 
 /* Fast way to get current CPU and node.
@@ -157,7 +156,7 @@ vgetcpu(unsigned *cpu, unsigned *node, s
 	   We do this here because otherwise user space would do it on
 	   its own in a likely inferior way (no access to jiffies).
 	   If you don't like it pass NULL. */
-	if (tcache && tcache->blob[0] == (j = __jiffies)) {
+	if (tcache && tcache->blob[0] == (j = __time_data._jiffies_64)) {
 		p = tcache->blob[1];
 	} else if (__vgetcpu_mode == VGETCPU_RDTSCP) {
 		/* Load per CPU data from RDTSCP */
--- linux-2.6.19/arch/x86_64/kernel/time.c	2006-12-08 16:22:15.000000000 +0100
+++ linux-2.6.19-ed/arch/x86_64/kernel/time.c	2006-12-13 15:47:01.000000000 +0100
@@ -76,8 +76,12 @@ unsigned long long monotonic_base;
 
 struct vxtime_data __vxtime __section_vxtime;	/* for vsyscalls */
 
-volatile unsigned long __jiffies __section_jiffies = INITIAL_JIFFIES;
-struct timespec __xtime __section_xtime;
+struct time_data __time_data __section_time_data = {
+	._jiffies_64 = INITIAL_JIFFIES,
+	.lock =  __SEQLOCK_UNLOCKED(time_data.lock),
+	.calc_load_count = LOAD_FREQ,
+};
+
 struct timezone __sys_tz __section_sys_tz;
 
 /*
--- linux-2.6.19/arch/x86_64/kernel/vmlinux.lds.S	2006-12-08 16:03:13.000000000 +0100
+++ linux-2.6.19-ed/arch/x86_64/kernel/vmlinux.lds.S	2006-12-15 13:13:49.000000000 +0100
@@ -12,7 +12,6 @@
 OUTPUT_FORMAT("elf64-x86-64", "elf64-x86-64", "elf64-x86-64")
 OUTPUT_ARCH(i386:x86-64)
 ENTRY(phys_startup_64)
-jiffies_64 = jiffies;
 PHDRS {
 	text PT_LOAD FLAGS(5);	/* R_E */
 	data PT_LOAD FLAGS(7);	/* RWE */
@@ -94,8 +93,10 @@ SECTIONS
   __vsyscall_0 = VSYSCALL_VIRT_ADDR;
 
   . = ALIGN(CONFIG_X86_L1_CACHE_BYTES);
-  .xtime_lock : AT(VLOAD(.xtime_lock)) { *(.xtime_lock) }
-  xtime_lock = VVIRT(.xtime_lock);
+  .time_data : AT(VLOAD(.time_data)) { *(.time_data) }
+  time_data = VVIRT(.time_data);
+  jiffies = time_data;
+  jiffies_64 = time_data;
 
   .vxtime : AT(VLOAD(.vxtime)) { *(.vxtime) }
   vxtime = VVIRT(.vxtime);
@@ -109,13 +110,6 @@ SECTIONS
   .sysctl_vsyscall : AT(VLOAD(.sysctl_vsyscall)) { *(.sysctl_vsyscall) }
   sysctl_vsyscall = VVIRT(.sysctl_vsyscall);
 
-  .xtime : AT(VLOAD(.xtime)) { *(.xtime) }
-  xtime = VVIRT(.xtime);
-
-  . = ALIGN(CONFIG_X86_L1_CACHE_BYTES);
-  .jiffies : AT(VLOAD(.jiffies)) { *(.jiffies) }
-  jiffies = VVIRT(.jiffies);
-
   .vsyscall_1 ADDR(.vsyscall_0) + 1024: AT(VLOAD(.vsyscall_1)) { *(.vsyscall_1) }
   .vsyscall_2 ADDR(.vsyscall_0) + 2048: AT(VLOAD(.vsyscall_2)) { *(.vsyscall_2) }
   .vsyscall_3 ADDR(.vsyscall_0) + 3072: AT(VLOAD(.vsyscall_3)) { *(.vsyscall_3) }
--- linux-2.6.19/arch/i386/kernel/vmlinux.lds.S	2006-12-13 16:09:56.000000000 +0100
+++ linux-2.6.19-ed/arch/i386/kernel/vmlinux.lds.S	2006-12-13 20:35:45.000000000 +0100
@@ -12,7 +12,8 @@
 OUTPUT_FORMAT("elf32-i386", "elf32-i386", "elf32-i386")
 OUTPUT_ARCH(i386)
 ENTRY(phys_startup_32)
-jiffies = jiffies_64;
+jiffies = time_data;
+jiffies_64 = time_data;
 
 PHDRS {
 	text PT_LOAD FLAGS(5);	/* R_E */
--- linux-2.6.19/arch/ia64/kernel/asm-offsets.c	2006-12-15 11:22:46.000000000 +0100
+++ linux-2.6.19-ed/arch/ia64/kernel/asm-offsets.c	2006-12-15 11:27:35.000000000 +0100
@@ -268,4 +268,7 @@ void foo(void)
 	DEFINE(IA64_TIME_SOURCE_MMIO64, TIME_SOURCE_MMIO64);
 	DEFINE(IA64_TIME_SOURCE_MMIO32, TIME_SOURCE_MMIO32);
 	DEFINE(IA64_TIMESPEC_TV_NSEC_OFFSET, offsetof (struct timespec, tv_nsec));
+	DEFINE(IA64_XTIME_OFFSET, offsetof (struct time_data, _xtime));
+	DEFINE(IA64_WALL_TO_MONOTONIC_OFFSET, offsetof (struct time_data, _wall_to_monotonic));
+	DEFINE(IA64_XTIME_LOCK_OFFSET, offsetof (struct time_data, lock));
 }
--- linux-2.6.19/arch/ia64/kernel/vmlinux.lds.S	2006-12-15 11:00:32.000000000 +0100
+++ linux-2.6.19-ed/arch/ia64/kernel/vmlinux.lds.S	2006-12-15 11:27:35.000000000 +0100
@@ -3,6 +3,7 @@
 #include <asm/ptrace.h>
 #include <asm/system.h>
 #include <asm/pgtable.h>
+#include <asm/asm-offsets.h>
 
 #define LOAD_OFFSET	(KERNEL_START - KERNEL_TR_PAGE_SIZE)
 #include <asm-generic/vmlinux.lds.h>
@@ -15,7 +16,11 @@
 OUTPUT_FORMAT("elf64-ia64-little")
 OUTPUT_ARCH(ia64)
 ENTRY(phys_start)
-jiffies = jiffies_64;
+jiffies = time_data;
+jiffies_64 = time_data;
+xtime = time_data + IA64_XTIME_OFFSET;
+wall_to_monotonic = time_data + IA64_WALL_TO_MONOTONIC_OFFSET;
+xtime_lock = time_data + IA64_XTIME_LOCK_OFFSET;
 PHDRS {
   code   PT_LOAD;
   percpu PT_LOAD;
--- linux-2.6.19/arch/s390/appldata/appldata_os.c	2006-12-13 16:27:59.000000000 +0100
+++ linux-2.6.19-ed/arch/s390/appldata/appldata_os.c	2006-12-13 16:27:59.000000000 +0100
@@ -68,7 +68,7 @@ struct appldata_os_data {
 
 	u32 nr_running;		/* number of runnable threads      */
 	u32 nr_threads;		/* number of threads               */
-	u32 avenrun[3];		/* average nr. of running processes during */
+	u32 _avenrun[3];		/* average nr. of running processes during */
 				/* the last 1, 5 and 15 minutes */
 
 	/* New in 2.6 */
@@ -98,11 +98,11 @@ static inline void appldata_print_debug(
 	P_DEBUG("nr_threads   = %u\n", os_data->nr_threads);
 	P_DEBUG("nr_running   = %u\n", os_data->nr_running);
 	P_DEBUG("nr_iowait    = %u\n", os_data->nr_iowait);
-	P_DEBUG("avenrun(int) = %8x / %8x / %8x\n", os_data->avenrun[0],
-		os_data->avenrun[1], os_data->avenrun[2]);
-	a0 = os_data->avenrun[0];
-	a1 = os_data->avenrun[1];
-	a2 = os_data->avenrun[2];
+	P_DEBUG("avenrun(int) = %8x / %8x / %8x\n", os_data->_avenrun[0],
+		os_data->_avenrun[1], os_data->_avenrun[2]);
+	a0 = os_data->_avenrun[0];
+	a1 = os_data->_avenrun[1];
+	a2 = os_data->_avenrun[2];
 	P_DEBUG("avenrun(float) = %d.%02d / %d.%02d / %d.%02d\n",
 		LOAD_INT(a0), LOAD_FRAC(a0), LOAD_INT(a1), LOAD_FRAC(a1),
 		LOAD_INT(a2), LOAD_FRAC(a2));
@@ -145,9 +145,9 @@ static void appldata_get_os_data(void *d
 	os_data->nr_threads = nr_threads;
 	os_data->nr_running = nr_running();
 	os_data->nr_iowait  = nr_iowait();
-	os_data->avenrun[0] = avenrun[0] + (FIXED_1/200);
-	os_data->avenrun[1] = avenrun[1] + (FIXED_1/200);
-	os_data->avenrun[2] = avenrun[2] + (FIXED_1/200);
+	os_data->_avenrun[0] = avenrun[0] + (FIXED_1/200);
+	os_data->_avenrun[1] = avenrun[1] + (FIXED_1/200);
+	os_data->_avenrun[2] = avenrun[2] + (FIXED_1/200);
 
 	j = 0;
 	for_each_online_cpu(i) {

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] Introduce time_data, a new structure to hold jiffies, xtime, xtime_lock, wall_to_monotonic, calc_load_count and avenrun
  2006-12-13 21:26                                   ` [PATCH] Introduce time_data, a new structure to hold jiffies, xtime, xtime_lock, wall_to_monotonic, calc_load_count and avenrun Eric Dumazet
  2006-12-15  5:24                                     ` Andrew Morton
@ 2006-12-15 16:21                                     ` Eric Dumazet
  1 sibling, 0 replies; 54+ messages in thread
From: Eric Dumazet @ 2006-12-15 16:21 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andi Kleen, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2074 bytes --]

Just in case, I re-based the patch against 2.6.20-rc1-mm1


[PATCH] Introduce time_data, a new structure to hold jiffies, xtime, 
xtime_lock, wall_to_monotonic, calc_load_count and avenrun

This patch introduces a new structure called time_data, where some time 
keeping related variables are put together to share as few cache lines as 
possible. This should reduce timer interrupt latency and cache lines ping 
pongs in SMP machines.

struct time_data {
        u64                    _jiffies_64;
        struct timespec _xtime;
        struct timespec _wall_to_monotonic;
        seqlock_t           lock;
        int                     calc_load_count;
        unsigned int     _avenrun[3];
};

_jiffies_64 is a place holder so that arches can (optionally) aliases 
jiffies_64/jiffies in time_data. This patch does the thing for i386, x86_64 
and ia64 

avenrun, xtime, xtime_lock, wall_to_monotonic, are now temporary defined as 
macros to make this patch not too invasive, but we can in future patches 
gradually deletes these macros.

For ia64, I added aliases for xtime, xtime_lock and wall_to_monotonic because 
arch/ia64/kernel/fsys.S uses these symbols. This file might be changed to 
directly access time_data in the future (reducing registers use as well)


Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 arch/i386/kernel/vmlinux.lds.S   |    3 +
 arch/ia64/kernel/asm-offsets.c   |    3 +
 arch/ia64/kernel/vmlinux.lds.S   |    7 +++
 arch/s390/appldata/appldata_os.c |   18 ++++-----
 arch/x86_64/kernel/time.c        |    8 +++-
 arch/x86_64/kernel/vmlinux.lds.S |   14 ++-----
 arch/x86_64/kernel/vsyscall.c    |   15 +++-----
 include/asm-x86_64/vsyscall.h    |    8 +---
 include/linux/jiffies.h          |    2 -
 include/linux/sched.h            |   10 +++--
 include/linux/time.h             |   34 +++++++++++++++---
 kernel/timer.c                   |   54 ++++++++++++++---------------
 12 files changed, 102 insertions(+), 74 deletions(-)

[-- Attachment #2: time_data.patch --]
[-- Type: text/plain, Size: 15720 bytes --]

--- linux-2.6.20-rc1-mm1/include/linux/time.h	2006-12-15 15:46:16.000000000 +0100
+++ linux-2.6.20-rc1-mm1-ed/include/linux/time.h	2006-12-15 17:22:15.000000000 +0100
@@ -88,22 +88,46 @@ static inline struct timespec timespec_s
 #define timespec_valid(ts) \
 	(((ts)->tv_sec >= 0) && (((unsigned long) (ts)->tv_nsec) < NSEC_PER_SEC))
 
-extern struct timespec xtime;
-extern struct timespec wall_to_monotonic;
-extern seqlock_t xtime_lock __attribute__((weak));
+
+/*
+ * Structure time_data holds most timekeeping related variables
+ * to make them share as few cache lines as possible.
+ */
+struct time_data {
+	/*
+	 * We want to alias jiffies_64 to time_data._jiffies_64,
+	 * in vmlinux.lds.S files (jiffies_64 = time_data)
+	 * so keep _jiffies_64 at the beginning of time_data
+	 */
+	u64		_jiffies_64;
+	struct timespec _xtime;
+	struct timespec _wall_to_monotonic;
+	seqlock_t	lock;
+	int		calc_load_count;
+	unsigned int	_avenrun[3];
+};
+/*
+ * time_data has weak attribute because some arches (x86_64) want
+ * to have a read_only version in their vsyscall page
+ */
+extern struct time_data time_data __attribute__((weak));
+#define xtime             time_data._xtime
+#define wall_to_monotonic time_data._wall_to_monotonic
+#define xtime_lock        time_data.lock
+#define avenrun           time_data._avenrun
 
 extern unsigned long read_persistent_clock(void);
 void timekeeping_init(void);
 
 static inline unsigned long get_seconds(void)
 {
-	return xtime.tv_sec;
+	return time_data._xtime.tv_sec;
 }
 
 struct timespec current_kernel_time(void);
 
 #define CURRENT_TIME		(current_kernel_time())
-#define CURRENT_TIME_SEC	((struct timespec) { xtime.tv_sec, 0 })
+#define CURRENT_TIME_SEC	((struct timespec) { time_data._xtime.tv_sec, 0 })
 
 extern void do_gettimeofday(struct timeval *tv);
 extern int do_settimeofday(struct timespec *tv);
--- linux-2.6.20-rc1-mm1/include/linux/jiffies.h	2006-12-15 15:46:16.000000000 +0100
+++ linux-2.6.20-rc1-mm1-ed/include/linux/jiffies.h	2006-12-15 17:19:07.000000000 +0100
@@ -78,7 +78,7 @@
  * without sampling the sequence number in xtime_lock.
  * get_jiffies_64() will do this for you as appropriate.
  */
-extern u64 __jiffy_data jiffies_64;
+extern u64 __jiffy_data jiffies_64 __attribute__((weak));
 extern unsigned long volatile __jiffy_data jiffies;
 
 #if (BITS_PER_LONG < 64)
--- linux-2.6.20-rc1-mm1/include/linux/sched.h	2006-12-15 15:46:16.000000000 +0100
+++ linux-2.6.20-rc1-mm1-ed/include/linux/sched.h	2006-12-15 17:19:07.000000000 +0100
@@ -106,7 +106,6 @@ struct bio;
  *    the EXP_n values would be 1981, 2034 and 2043 if still using only
  *    11 bit fractions.
  */
-extern unsigned long avenrun[];		/* Load averages */
 
 #define FSHIFT		11		/* nr of bits of precision */
 #define FIXED_1		(1<<FSHIFT)	/* 1.0 as fixed-point */
@@ -116,9 +115,12 @@ extern unsigned long avenrun[];		/* Load
 #define EXP_15		2037		/* 1/exp(5sec/15min) */
 
 #define CALC_LOAD(load,exp,n) \
-	load *= exp; \
-	load += n*(FIXED_1-exp); \
-	load >>= FSHIFT;
+	{ \
+	unsigned long temp = load; \
+	temp *= exp; \
+	temp += n*(FIXED_1-exp); \
+	load = temp >> FSHIFT; \
+	}
 
 extern unsigned long total_forks;
 extern int nr_threads;
--- linux-2.6.20-rc1-mm1/kernel/timer.c	2006-12-15 15:46:16.000000000 +0100
+++ linux-2.6.20-rc1-mm1-ed/kernel/timer.c	2006-12-15 17:19:07.000000000 +0100
@@ -42,7 +42,7 @@
 #include <asm/timex.h>
 #include <asm/io.h>
 
-u64 jiffies_64 __cacheline_aligned_in_smp = INITIAL_JIFFIES;
+__attribute__((weak)) u64 jiffies_64 __cacheline_aligned_in_smp = INITIAL_JIFFIES;
 
 EXPORT_SYMBOL(jiffies_64);
 
@@ -771,10 +771,12 @@ unsigned long next_timer_interrupt(void)
  * however, we will ALWAYS keep the tv_nsec part positive so we can use
  * the usual normalization.
  */
-struct timespec xtime __attribute__ ((aligned (16)));
-struct timespec wall_to_monotonic __attribute__ ((aligned (16)));
-
-EXPORT_SYMBOL(xtime);
+__attribute__((weak)) struct time_data time_data __cacheline_aligned = {
+	._jiffies_64 = INITIAL_JIFFIES,
+	.lock = __SEQLOCK_UNLOCKED(time_data.lock),
+	.calc_load_count = LOAD_FREQ,
+};
+EXPORT_SYMBOL(time_data);
 
 
 /* XXX - all of this timekeeping code should be later moved to time.c */
@@ -1238,9 +1240,6 @@ static unsigned long count_active_tasks(
  *
  * Requires xtime_lock to access.
  */
-unsigned long avenrun[3];
-
-EXPORT_SYMBOL(avenrun);
 
 /*
  * calc_load - given tick count, update the avenrun load estimates.
@@ -1249,29 +1248,20 @@ EXPORT_SYMBOL(avenrun);
 static inline void calc_load(unsigned long ticks)
 {
 	unsigned long active_tasks; /* fixed-point */
-	static int count = LOAD_FREQ;
 
-	count -= ticks;
-	if (unlikely(count < 0)) {
+	time_data.calc_load_count -= ticks;
+	if (unlikely(time_data.calc_load_count < 0)) {
 		active_tasks = count_active_tasks();
 		do {
-			CALC_LOAD(avenrun[0], EXP_1, active_tasks);
-			CALC_LOAD(avenrun[1], EXP_5, active_tasks);
-			CALC_LOAD(avenrun[2], EXP_15, active_tasks);
-			count += LOAD_FREQ;
-		} while (count < 0);
+			CALC_LOAD(time_data._avenrun[0], EXP_1, active_tasks);
+			CALC_LOAD(time_data._avenrun[1], EXP_5, active_tasks);
+			CALC_LOAD(time_data._avenrun[2], EXP_15, active_tasks);
+			time_data.calc_load_count += LOAD_FREQ;
+		} while (time_data.calc_load_count < 0);
 	}
 }
 
 /*
- * This read-write spinlock protects us from races in SMP while
- * playing with xtime and avenrun.
- */
-__attribute__((weak)) __cacheline_aligned_in_smp DEFINE_SEQLOCK(xtime_lock);
-
-EXPORT_SYMBOL(xtime_lock);
-
-/*
  * This function runs timers and the timer-tq in bottom half context.
  */
 static void run_timer_softirq(struct softirq_action *h)
@@ -1497,6 +1487,16 @@ asmlinkage long sys_gettid(void)
 }
 
 /**
+ * avenrun_to_loads - convert an _avenrun[] to sysinfo.loads[]
+ * @idx: index (0..2) in time_data._avenrun[]
+ */
+static inline unsigned long avenrun_to_loads(unsigned int idx)
+{
+	unsigned long ret = time_data._avenrun[idx];
+	return ret << (SI_LOAD_SHIFT - FSHIFT);
+}
+
+/**
  * sys_sysinfo - fill in sysinfo struct
  * @info: pointer to buffer to fill
  */ 
@@ -1529,9 +1529,9 @@ asmlinkage long sys_sysinfo(struct sysin
 		}
 		val.uptime = tp.tv_sec + (tp.tv_nsec ? 1 : 0);
 
-		val.loads[0] = avenrun[0] << (SI_LOAD_SHIFT - FSHIFT);
-		val.loads[1] = avenrun[1] << (SI_LOAD_SHIFT - FSHIFT);
-		val.loads[2] = avenrun[2] << (SI_LOAD_SHIFT - FSHIFT);
+		val.loads[0] = avenrun_to_loads(0);
+		val.loads[1] = avenrun_to_loads(1);
+		val.loads[2] = avenrun_to_loads(2);
 
 		val.procs = nr_threads;
 	} while (read_seqretry(&xtime_lock, seq));
--- linux-2.6.20-rc1-mm1/include/asm-x86_64/vsyscall.h	2006-12-15 15:46:16.000000000 +0100
+++ linux-2.6.20-rc1-mm1-ed/include/asm-x86_64/vsyscall.h	2006-12-15 17:19:07.000000000 +0100
@@ -18,11 +18,9 @@ enum vsyscall_num {
 
 #define __section_vxtime __attribute__ ((unused, __section__ (".vxtime"), aligned(16)))
 #define __section_vgetcpu_mode __attribute__ ((unused, __section__ (".vgetcpu_mode"), aligned(16)))
-#define __section_jiffies __attribute__ ((unused, __section__ (".jiffies"), aligned(16)))
 #define __section_sys_tz __attribute__ ((unused, __section__ (".sys_tz"), aligned(16)))
 #define __section_sysctl_vsyscall __attribute__ ((unused, __section__ (".sysctl_vsyscall"), aligned(16)))
-#define __section_xtime __attribute__ ((unused, __section__ (".xtime"), aligned(16)))
-#define __section_xtime_lock __attribute__ ((unused, __section__ (".xtime_lock"), aligned(16)))
+#define __section_time_data __attribute__ ((unused, __section__ (".time_data"), aligned(16)))
 
 #define VXTIME_TSC	1
 #define VXTIME_HPET	2
@@ -44,12 +42,10 @@ struct vxtime_data {
 #define hpet_writel(d,a)        writel(d, (void __iomem *)fix_to_virt(FIX_HPET_BASE) + a)
 
 /* vsyscall space (readonly) */
+extern struct time_data __time_data;
 extern struct vxtime_data __vxtime;
 extern int __vgetcpu_mode;
-extern struct timespec __xtime;
-extern volatile unsigned long __jiffies;
 extern struct timezone __sys_tz;
-extern seqlock_t __xtime_lock;
 
 /* kernel space (writeable) */
 extern struct vxtime_data vxtime;
--- linux-2.6.20-rc1-mm1/arch/x86_64/kernel/vsyscall.c	2006-12-14 02:14:23.000000000 +0100
+++ linux-2.6.20-rc1-mm1-ed/arch/x86_64/kernel/vsyscall.c	2006-12-15 17:19:07.000000000 +0100
@@ -45,7 +45,6 @@
 #define __syscall_clobber "r11","rcx","memory"
 
 int __sysctl_vsyscall __section_sysctl_vsyscall = 1;
-seqlock_t __xtime_lock __section_xtime_lock = SEQLOCK_UNLOCKED;
 int __vgetcpu_mode __section_vgetcpu_mode;
 
 #include <asm/unistd.h>
@@ -67,10 +66,10 @@ static __always_inline void do_vgettimeo
 	unsigned long sec, usec;
 
 	do {
-		sequence = read_seqbegin(&__xtime_lock);
+		sequence = read_seqbegin(&__time_data.lock);
 		
-		sec = __xtime.tv_sec;
-		usec = __xtime.tv_nsec / 1000;
+		sec = __time_data._xtime.tv_sec;
+		usec = __time_data._xtime.tv_nsec / 1000;
 
 		if (__vxtime.mode != VXTIME_HPET) {
 			t = get_cycles_sync();
@@ -84,7 +83,7 @@ static __always_inline void do_vgettimeo
 				   fix_to_virt(VSYSCALL_HPET) + 0xf0) -
 				  __vxtime.last) * __vxtime.quot) >> 32;
 		}
-	} while (read_seqretry(&__xtime_lock, sequence));
+	} while (read_seqretry(&__time_data.lock, sequence));
 
 	tv->tv_sec = sec + usec / 1000000;
 	tv->tv_usec = usec % 1000000;
@@ -132,8 +131,8 @@ time_t __vsyscall(1) vtime(time_t *t)
 	if (!__sysctl_vsyscall)
 		return time_syscall(t);
 	else if (t)
-		*t = __xtime.tv_sec;		
-	return __xtime.tv_sec;
+		*t = __time_data._xtime.tv_sec;		
+	return __time_data._xtime.tv_sec;
 }
 
 /* Fast way to get current CPU and node.
@@ -158,7 +157,7 @@ vgetcpu(unsigned *cpu, unsigned *node, s
 	   We do this here because otherwise user space would do it on
 	   its own in a likely inferior way (no access to jiffies).
 	   If you don't like it pass NULL. */
-	if (tcache && tcache->blob[0] == (j = __jiffies)) {
+	if (tcache && tcache->blob[0] == (j = __time_data._jiffies_64)) {
 		p = tcache->blob[1];
 	} else if (__vgetcpu_mode == VGETCPU_RDTSCP) {
 		/* Load per CPU data from RDTSCP */
--- linux-2.6.20-rc1-mm1/arch/x86_64/kernel/time.c	2006-12-15 15:46:15.000000000 +0100
+++ linux-2.6.20-rc1-mm1-ed/arch/x86_64/kernel/time.c	2006-12-15 17:19:07.000000000 +0100
@@ -76,8 +76,12 @@ unsigned long long monotonic_base;
 
 struct vxtime_data __vxtime __section_vxtime;	/* for vsyscalls */
 
-volatile unsigned long __jiffies __section_jiffies = INITIAL_JIFFIES;
-struct timespec __xtime __section_xtime;
+struct time_data __time_data __section_time_data = {
+	._jiffies_64 = INITIAL_JIFFIES,
+	.lock =  __SEQLOCK_UNLOCKED(time_data.lock),
+	.calc_load_count = LOAD_FREQ,
+};
+
 struct timezone __sys_tz __section_sys_tz;
 
 /*
--- linux-2.6.20-rc1-mm1/arch/x86_64/kernel/vmlinux.lds.S	2006-12-15 15:46:15.000000000 +0100
+++ linux-2.6.20-rc1-mm1-ed/arch/x86_64/kernel/vmlinux.lds.S	2006-12-15 17:22:15.000000000 +0100
@@ -12,7 +12,6 @@
 OUTPUT_FORMAT("elf64-x86-64", "elf64-x86-64", "elf64-x86-64")
 OUTPUT_ARCH(i386:x86-64)
 ENTRY(phys_startup_64)
-jiffies_64 = jiffies;
 _proxy_pda = 0;
 PHDRS {
 	text PT_LOAD FLAGS(5);	/* R_E */
@@ -88,8 +87,10 @@ SECTIONS
   __vsyscall_0 = VSYSCALL_VIRT_ADDR;
 
   . = ALIGN(CONFIG_X86_L1_CACHE_BYTES);
-  .xtime_lock : AT(VLOAD(.xtime_lock)) { *(.xtime_lock) }
-  xtime_lock = VVIRT(.xtime_lock);
+  .time_data : AT(VLOAD(.time_data)) { *(.time_data) }
+  time_data = VVIRT(.time_data);
+  jiffies = time_data;
+  jiffies_64 = time_data;
 
   .vxtime : AT(VLOAD(.vxtime)) { *(.vxtime) }
   vxtime = VVIRT(.vxtime);
@@ -103,13 +104,6 @@ SECTIONS
   .sysctl_vsyscall : AT(VLOAD(.sysctl_vsyscall)) { *(.sysctl_vsyscall) }
   sysctl_vsyscall = VVIRT(.sysctl_vsyscall);
 
-  .xtime : AT(VLOAD(.xtime)) { *(.xtime) }
-  xtime = VVIRT(.xtime);
-
-  . = ALIGN(CONFIG_X86_L1_CACHE_BYTES);
-  .jiffies : AT(VLOAD(.jiffies)) { *(.jiffies) }
-  jiffies = VVIRT(.jiffies);
-
   .vsyscall_1 ADDR(.vsyscall_0) + 1024: AT(VLOAD(.vsyscall_1)) { *(.vsyscall_1) }
   .vsyscall_2 ADDR(.vsyscall_0) + 2048: AT(VLOAD(.vsyscall_2)) { *(.vsyscall_2) }
   .vsyscall_3 ADDR(.vsyscall_0) + 3072: AT(VLOAD(.vsyscall_3)) { *(.vsyscall_3) }
--- linux-2.6.20-rc1-mm1/arch/i386/kernel/vmlinux.lds.S	2006-12-15 15:46:15.000000000 +0100
+++ linux-2.6.20-rc1-mm1-ed/arch/i386/kernel/vmlinux.lds.S	2006-12-15 17:22:15.000000000 +0100
@@ -25,7 +25,8 @@
 OUTPUT_FORMAT("elf32-i386", "elf32-i386", "elf32-i386")
 OUTPUT_ARCH(i386)
 ENTRY(phys_startup_32)
-jiffies = jiffies_64;
+jiffies = time_data;
+jiffies_64 = time_data;
 _proxy_pda = 0;
 
 PHDRS {
--- linux-2.6.20-rc1-mm1/arch/ia64/kernel/asm-offsets.c	2006-12-14 02:14:23.000000000 +0100
+++ linux-2.6.20-rc1-mm1-ed/arch/ia64/kernel/asm-offsets.c	2006-12-15 17:19:07.000000000 +0100
@@ -268,4 +268,7 @@ void foo(void)
 	DEFINE(IA64_TIME_SOURCE_MMIO64, TIME_SOURCE_MMIO64);
 	DEFINE(IA64_TIME_SOURCE_MMIO32, TIME_SOURCE_MMIO32);
 	DEFINE(IA64_TIMESPEC_TV_NSEC_OFFSET, offsetof (struct timespec, tv_nsec));
+	DEFINE(IA64_XTIME_OFFSET, offsetof (struct time_data, _xtime));
+	DEFINE(IA64_WALL_TO_MONOTONIC_OFFSET, offsetof (struct time_data, _wall_to_monotonic));
+	DEFINE(IA64_XTIME_LOCK_OFFSET, offsetof (struct time_data, lock));
 }
--- linux-2.6.20-rc1-mm1/arch/ia64/kernel/vmlinux.lds.S	2006-12-15 15:46:15.000000000 +0100
+++ linux-2.6.20-rc1-mm1-ed/arch/ia64/kernel/vmlinux.lds.S	2006-12-15 17:19:07.000000000 +0100
@@ -4,6 +4,7 @@
 #include <asm/system.h>
 #include <asm/sparsemem.h>
 #include <asm/pgtable.h>
+#include <asm/asm-offsets.h>
 
 #define LOAD_OFFSET	(KERNEL_START - KERNEL_TR_PAGE_SIZE)
 #include <asm-generic/vmlinux.lds.h>
@@ -16,7 +17,11 @@
 OUTPUT_FORMAT("elf64-ia64-little")
 OUTPUT_ARCH(ia64)
 ENTRY(phys_start)
-jiffies = jiffies_64;
+jiffies = time_data;
+jiffies_64 = time_data;
+xtime = time_data + IA64_XTIME_OFFSET;
+wall_to_monotonic = time_data + IA64_WALL_TO_MONOTONIC_OFFSET;
+xtime_lock = time_data + IA64_XTIME_LOCK_OFFSET;
 PHDRS {
   code   PT_LOAD;
   percpu PT_LOAD;
--- linux-2.6.20-rc1-mm1/arch/s390/appldata/appldata_os.c	2006-12-14 02:14:23.000000000 +0100
+++ linux-2.6.20-rc1-mm1-ed/arch/s390/appldata/appldata_os.c	2006-12-15 17:19:07.000000000 +0100
@@ -68,7 +68,7 @@ struct appldata_os_data {
 
 	u32 nr_running;		/* number of runnable threads      */
 	u32 nr_threads;		/* number of threads               */
-	u32 avenrun[3];		/* average nr. of running processes during */
+	u32 _avenrun[3];		/* average nr. of running processes during */
 				/* the last 1, 5 and 15 minutes */
 
 	/* New in 2.6 */
@@ -98,11 +98,11 @@ static inline void appldata_print_debug(
 	P_DEBUG("nr_threads   = %u\n", os_data->nr_threads);
 	P_DEBUG("nr_running   = %u\n", os_data->nr_running);
 	P_DEBUG("nr_iowait    = %u\n", os_data->nr_iowait);
-	P_DEBUG("avenrun(int) = %8x / %8x / %8x\n", os_data->avenrun[0],
-		os_data->avenrun[1], os_data->avenrun[2]);
-	a0 = os_data->avenrun[0];
-	a1 = os_data->avenrun[1];
-	a2 = os_data->avenrun[2];
+	P_DEBUG("avenrun(int) = %8x / %8x / %8x\n", os_data->_avenrun[0],
+		os_data->_avenrun[1], os_data->_avenrun[2]);
+	a0 = os_data->_avenrun[0];
+	a1 = os_data->_avenrun[1];
+	a2 = os_data->_avenrun[2];
 	P_DEBUG("avenrun(float) = %d.%02d / %d.%02d / %d.%02d\n",
 		LOAD_INT(a0), LOAD_FRAC(a0), LOAD_INT(a1), LOAD_FRAC(a1),
 		LOAD_INT(a2), LOAD_FRAC(a2));
@@ -145,9 +145,9 @@ static void appldata_get_os_data(void *d
 	os_data->nr_threads = nr_threads;
 	os_data->nr_running = nr_running();
 	os_data->nr_iowait  = nr_iowait();
-	os_data->avenrun[0] = avenrun[0] + (FIXED_1/200);
-	os_data->avenrun[1] = avenrun[1] + (FIXED_1/200);
-	os_data->avenrun[2] = avenrun[2] + (FIXED_1/200);
+	os_data->_avenrun[0] = avenrun[0] + (FIXED_1/200);
+	os_data->_avenrun[1] = avenrun[1] + (FIXED_1/200);
+	os_data->_avenrun[2] = avenrun[2] + (FIXED_1/200);
 
 	j = 0;
 	for_each_online_cpu(i) {

^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2006-12-15 16:21 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-12-03  5:50 [PATCH] Export current_is_keventd() for libphy Ben Collins
2006-12-03  9:16 ` Andrew Morton
2006-12-04 19:17   ` Steve Fox
2006-12-05 18:05     ` Maciej W. Rozycki
2006-12-05 17:48   ` Maciej W. Rozycki
2006-12-05 18:07     ` Linus Torvalds
2006-12-05 19:31       ` Andrew Morton
2006-12-05 18:57     ` Andy Fleming
2006-12-06 12:31       ` Maciej W. Rozycki
2006-12-05 20:39     ` Andrew Morton
2006-12-05 20:59       ` Andy Fleming
2006-12-05 21:26         ` Andrew Morton
2006-12-05 21:37           ` Roland Dreier
2006-12-05 21:57             ` Andrew Morton
2006-12-05 23:49               ` Roland Dreier
2006-12-05 23:52               ` Roland Dreier
2006-12-06 15:25               ` Maciej W. Rozycki
2006-12-06 15:57                 ` Andrew Morton
2006-12-06 17:17                   ` Linus Torvalds
2006-12-07  1:21                     ` Linus Torvalds
2006-12-07  6:42                       ` Andrew Morton
2006-12-07  7:49                         ` Andrew Morton
2006-12-07 10:29                         ` David Howells
2006-12-07 10:42                           ` Andrew Morton
2006-12-07 17:05                             ` Jeff Garzik
2006-12-07 17:57                               ` Andrew Morton
2006-12-07 18:17                                 ` Andrew Morton
2006-12-08 16:52                                 ` [PATCH] group xtime, xtime_lock, wall_to_monotonic, avenrun, calc_load_count fields together in ktimed Eric Dumazet
2006-12-09  5:46                                   ` Andrew Morton
2006-12-09  6:07                                     ` Randy Dunlap
2006-12-11 20:44                                     ` Eric Dumazet
2006-12-11 22:00                                       ` Andrew Morton
2006-12-13 21:26                                   ` [PATCH] Introduce time_data, a new structure to hold jiffies, xtime, xtime_lock, wall_to_monotonic, calc_load_count and avenrun Eric Dumazet
2006-12-15  5:24                                     ` Andrew Morton
2006-12-15 11:21                                       ` Eric Dumazet
2006-12-15 16:21                                     ` Eric Dumazet
2006-12-07 18:08                               ` [PATCH] Export current_is_keventd() for libphy Maciej W. Rozycki
2006-12-07 18:59                               ` Andy Fleming
2006-12-07 16:49                         ` Linus Torvalds
2006-12-07 17:52                           ` Andrew Morton
2006-12-07 18:01                             ` Linus Torvalds
2006-12-07 18:16                               ` Andrew Morton
2006-12-07 18:27                                 ` Linus Torvalds
2006-12-07 15:28                       ` Maciej W. Rozycki
2006-12-06 17:43                   ` David Howells
2006-12-06 17:50                     ` Jeff Garzik
2006-12-06 18:07                       ` Linus Torvalds
2006-12-06 17:53                     ` Linus Torvalds
2006-12-06 17:58                       ` Linus Torvalds
2006-12-06 18:33                         ` Linus Torvalds
2006-12-06 18:37                           ` Linus Torvalds
2006-12-06 18:43                         ` David Howells
2006-12-06 19:02                           ` Linus Torvalds
2006-12-06 18:02                     ` David Howells

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).