linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [RFC] Semaphores used for daemon wakeup
       [not found] <3A42380B.6E9291D1@sgi.com>
@ 2000-12-21 19:30 ` Paul Cassella
  2000-12-21 22:19   ` Tim Wright
                     ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: Paul Cassella @ 2000-12-21 19:30 UTC (permalink / raw)
  To: linux-kernel

The mechanism being developed here seems a lot like synchronization
variables (aka condition variables), which are a part of the "monitor"
synchronization construct.  There is a simple implementation of them in
the xfs patch.  I've been working on a more general version in order to
aid porting some other stuff, which I have appended to this post.

I had been holding off on posting about it since I didn't have any code
that used it ready, and probably wouldn't until 2.5 anyway, but due to
this thread, I think bringing it up now might be helpful.


Daniel Phillips wrote:
> Tim Wright wrote:


> > p_sema_v_lock(&sema, priority, &lock);  /* atomically release the lock AND */
> >                         /* go to sleep on the semaphore */

This in particular looks like sv_wait(), which atomically releases a
lock and goes to sleep.

The style is somewhat different, as a sync variable (sv) is not really a
lock, but is still something that a process can block on.  A process goes
to sleep by calling sv_wait(sv), and is woken up by another process
calling sv_signal(sv) (wake one) or sv_broadcast(sv) (wake all).  If there
is no process sleeping in sv_wait, signals and broadcasts have no effect;
they do not affect any process which subsequently calls sv_wait(). 

Each sync variable is associated with another lock, which provides mutual
exclusion guarantees.  This lock must be held to call sv_wait(), and is
dropped atomically with the process going to sleep.  This lock must also
be held to call sv_signal() or sv_broadcast().  Currently, this lock can
be a spinlock or a semaphore; other types of locks could be added if
necessary. 

The sync variable version of the dmabuf code snippet (assuming the
dmabuf_mutex is never acquired from an interrupt) would look like this: 


    dmabuf_init(...);
    {
            ...
            spin_lock_init(&dmabuf_spin);
            sv_init(&dmabuf_sv, &dmabuf_spin, SV_MON_SPIN);
            ...
    }

    dmabuf_alloc(...)
    {

            ...
            while (1) {
                    spin_lock(&dmabuf_spin);
                    attempt to grab a free buffer;
                    if (success){
                            spin_unlock(&dmabuf_spin);
                            return;
                    } else {
                            sv_wait(&dmabuf_sv);
                    }
            }
    }

    dmabuf_free(...)
    {
            ...
            spin_lock(&dmabuf_spin);
            free up buffer;
            sv_broadcast(&dmabuf_sv);
            spin_unlock(&dmabuf_spin);
    }

If dmabuf_free() could be called from an interrupt, this would be modified
by passing the SV_INTS flag to sv_init(), using spin_lock_irq() in
dmabuf_alloc, and spin_lock_irqsave/restore in dmabuf_free().

> > As you can see, the spinlocks ensure no races, and the key is the atomicity
> > of p_sema_v_lock(). No-one can race in and sleep on dmabuf_wait, because
> > they have to hold dmabuf_mutex to do so. Exactly the same mechanism would
> > work for the bdflush problem.

The same protections are in place here, as the two methods are basically
the same. 


> described.  The main difference between this situation and bdflush is
> that dmabuf_free isn't really waiting on dmabuf_alloc to fullfill a
> condition (other than to get out of its exclusion region) while bdflush
> can have n waiters.

This could be done with two sv's.  After all, there are two conditions:
"someone needs bdflush to run", and "bdflush is done". 


> int atomic_read_and_clear(atomic_t *p)
> {
>         int n = atomic_read(p);
>         atomic_sub(p, n);
>         return n;
> }

I don't think this will work; consider two callers doing the atomic_read() 
at the same time, or someone else doing an atomic_dec() after the
atomic_read(). 


> the more involved wake_up_process().  What's clear is, they are all
> plenty fast enough for this application, and what I'm really trying for
> is readability.

I think sv's are pretty readable and clear, but I'd like to find out what
other people think.

If anyone finds this useful or finds any bugs, or has any questions or
suggestions, please let me know.  I'll be reading the list, but I'm going
on vacation tomorrow, so I'd appreciate a Cc: so I have a better chance of
answering before then.  Thanks.


-- 
Paul Cassella
pwc@sgi.com


And now, the code.

include/linux/sv.h:

/*
This is the version I'm using with a test8-pre1 kernel, so it uses the
old TASK_EXCLUSIVE semantics; it should be trivial to make it work with
new kernels.  I haven't yet done so, since we're going to be using the
test8-pre1 kernel for a few more weeks yet. 

In the interests of brevity, I've taken out most of the debugging code,
some typecasting, and some other things like the wrapper functions for
up() and spin_unlock(), which are needed because we need a function
pointer, and these may be macros or inline fuctions. 

This was originally based on the version Steve Lord did for the xfs port. 
This version had no problems when put into the xfs tree an run through a
stress test.  I don't remember what kernel version that tree was, but it
was before the TASK_EXCLUSIVE change..
*/

typedef void sv_mon_lock_t;
typedef void (*sv_mon_unlock_func_t)(sv_mon_lock_t *lock);

/* sv_flags values: */

#define SV_ORDER_FIFO        0x001
#define SV_ORDER_FILO        0x002

/* If at some point one order becomes preferable to others, we can
   use that order when sv_init() isn't given an order. */
#define SV_ORDER_DEFAULT     SV_ORDER_FIFO

#define SV_ORDER_MASK        0x00f


#define SV_MON_SEMA          0x010
#define SV_MON_SPIN          0x020

#define SV_MON_MASK          0x0f0


/*
   Set if the monitor lock can be aquired from interrupts.  Note that this
   is a superset of the cases in which the sv itself can be touched from
   interrupts.

   This is only valid when the monitor lock is a spinlock.

   If this is used, sv_wait(), sv_signal(), and sv_broadcast() must all be
   called with interrupts disabled, which had to have happened anyway to
   have acuired the monitor spinlock. 
 */
#define SV_INTS              0x100


/* sv_wait_flag values: */

/* Allow sv_wait() to be interrupted by a signal */
#define SV_WAIT_SIG          0x001

typedef struct sv_s {
	wait_queue_head_t sv_waiters;
        /* Lock held to ensure exclusive access to monitor. */
	sv_mon_lock_t *sv_mon_lock;
	sv_mon_unlock_func_t sv_mon_unlock_func;
	spinlock_t sv_lock;  /* Spinlock protecting the sv itself. */
	int sv_flags;
} sv_t;

#define DECLARE_SYNC_VARIABLE(sv, l, f) sv_t sv = sv_init(&sv, l, f)




kernel/sv.c:


static inline void sv_lock(sv_t *sv) {
	spin_lock(&sv->sv_lock);
}

static inline void sv_unlock(sv_t *sv) {
	spin_unlock(&sv->sv_lock);
}


/* up() and spin_unlock() may be inline or macros; the real version
   uses wrappers for them. */

static inline void sv_set_mon_type(sv_t *sv, int type) {
	switch (type) {
	case SV_MON_SPIN:
		sv->sv_mon_unlock_func = spin_unlock;
		break;
	case SV_MON_SEMA:
		sv->sv_mon_unlock_func = up;
		if(sv->sv_flags & SV_INTS) {
			printk(KERN_ERR "sv_set_mon_type: The monitor lock "
			       "cannot be shared with interrupts if it is a "
			       "semaphore!\n");
			BUG();
		}
		break;

	default:
		BUG();
	}
	sv->sv_flags |= type;
}

static inline void sv_set_ord(sv_t *sv, int ord) {
	if (!ord)
		ord = SV_ORDER_DEFAULT;

	if (ord != SV_ORDER_FIFO && ord != SV_ORDER_LIFO) {
		BUG();
	}

	sv->sv_flags |= ord;
}

/*
 * sv            the sync variable to initialize
 * monitor_lock  the lock enforcing exclusive running in the monitor
 * flags         one of
 *   SV_MON_SEMA monitor_lock is a semaphore
 *   SV_MON_SPIN monitor_lock is a spinlock
 * and a bitwise or of some subset of
 *   SV_INTS - the monitor lock can be acquired from interrupts (and
 *             hence, whenever we hold it, interrupts are disabled (or
 *             we're in an interrupt.))  This is only valid when the sv
 *             is of type SV_MON_SPIN.
 */
void sv_init(sv_t *sv, sv_mon_lock_t *lock, int flags) 
{
	int ord = flags & SV_ORDER_MASK;
	int type = flags & SV_MON_MASK;

	/* Copy all non-order, non-type flags */
	sv->sv_flags = flags & ~(SV_ORDER_MASK | SV_MON_MASK);

	sv_set_ord(sv, ord);
	sv_set_mon_type(sv, type);

	sv->sv_mon_lock = lock;

	spin_lock_init(&sv->sv_lock);
	init_waitqueue_head(&sv->sv_waiters);
}

/*
 * The associated lock must be locked on entry.  It is unlocked on return.
 *
 * Set SV_WAIT_SIG in sv_wait_flags to let the sv_wait be interrupted
 * by signals.
 *
 * timeout is how long to wait before giving up, or 0 to wait
 * indefinately.  It is given in jiffies, and is relative.
 *
 * Return values:
 *
 * n < 0 : interrupted,  -n jiffies remaining on timeout, or -1 if timeout == 0
 * n = 0 : timeout expired
 * n > 0 : sv_signal()'d, n jiffies remaining on timeout, or 1 if timeout == 0
 */
signed long sv_wait(sv_t *sv, int sv_wait_flags, unsigned long timeout) 
{
	DECLARE_WAITQUEUE( wait, current );
	signed long ret = 0;

	sv_lock(sv);

	sv->sv_mon_unlock_func(sv->sv_mon_lock);

	/* Add ourselves to the wait queue and set the state before
           releasing the sv_lock so as to avoid racing with wake_up in
           sv_signal() and sv_broadcast. */
	switch(sv->sv_flags & SV_ORDER_MASK) {
	case SV_ORDER_FIFO:
		add_wait_queue_exclusive(&sv->sv_waiters, &wait);
		break;
	case SV_ORDER_FILO:
		add_wait_queue(&sv->sv_waiters, &wait);
		break;
	default:
		BUG();
	}

	if(sv_wait_flags & SV_WAIT_SIG)
		set_current_state(TASK_EXCLUSIVE | TASK_INTERRUPTIBLE  );
	else
		set_current_state(TASK_EXCLUSIVE | TASK_UNINTERRUPTIBLE);

	if(sv->sv_flags & SV_INTS)
		spin_unlock_irq(&sv->sv_lock);
	else
		spin_unlock(&sv->sv_lock);

	if (timeout)
		ret = schedule_timeout(timeout);
	else
		schedule();

	if(current->state != TASK_RUNNING) /* XXX Is this possible? */ {
		printk(KERN_ERR "sv_wait: state not TASK_RUNNING after "
		       "schedule().\n");
		set_current_state(TASK_RUNNING);
	}

	remove_wait_queue(&sv->sv_waiters, &wait);

	/* Return cases:
	   - woken by a sv_signal/sv_broadcast
	   - woken by a signal
	   - woken by timeout expiring
	*/

	/* FIXME: This isn't really accurate; we may have been woken
           before the signal anyway.... */
	if(signal_pending(current))
		return timeout ? -ret : -1;
	return timeout ? ret : 1;
}


void sv_signal(sv_t *sv) 
{
	/* If interrupts can acquire this lock, they can also acquire the
	   sv_mon_lock, which must be held to have called this, so
	   interrupts must be disabled already.  If interrupts cannot
	   contend for this lock, we don't have to disable them. */

	sv_lock(sv);
	wake_up(&sv->sv_waiters);
	sv_unlock(sv);
}

void sv_broadcast(sv_t *sv) 
{
	sv_lock(sv);
	wake_up_all(&sv->sv_waiters);
	sv_unlock(sv);
}


/*
 * These files are subject to the terms and conditions of the GNU General Public
 * License.
 *
 * Copyright (C) 2000 Silicon Graphics, Inc.  All rights reserved.
 *
 * Paul Cassella <pwc@sgi.com>
 */

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Semaphores used for daemon wakeup
  2000-12-21 19:30 ` [RFC] Semaphores used for daemon wakeup Paul Cassella
@ 2000-12-21 22:19   ` Tim Wright
  2000-12-22  1:12   ` Daniel Phillips
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 17+ messages in thread
From: Tim Wright @ 2000-12-21 22:19 UTC (permalink / raw)
  To: Paul Cassella; +Cc: linux-kernel

Looks good.
I'd like to play with you patch, but certainly from a first glance, it would
seem to be sufficiently powerful, and significantly cleaner/clearer (at least
to me :-) than the current mechanism involving the wait queue games.

Regards,

Tim

-- 
Tim Wright - timw@splhi.com or timw@aracnet.com or twright@us.ibm.com
IBM Linux Technology Center, Beaverton, Oregon
"Nobody ever said I was charming, they said "Rimmer, you're a git!"" RD VI
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Semaphores used for daemon wakeup
  2000-12-21 19:30 ` [RFC] Semaphores used for daemon wakeup Paul Cassella
  2000-12-21 22:19   ` Tim Wright
@ 2000-12-22  1:12   ` Daniel Phillips
  2000-12-22  1:50   ` Daniel Phillips
  2000-12-22 17:32   ` Brian Pomerantz
  3 siblings, 0 replies; 17+ messages in thread
From: Daniel Phillips @ 2000-12-22  1:12 UTC (permalink / raw)
  To: Paul Cassella, linux-kernel

Paul Cassella wrote:
> > int atomic_read_and_clear(atomic_t *p)
> > {
> >         int n = atomic_read(p);
> >         atomic_sub(p, n);
> >         return n;
> > }
> 
> I don't think this will work; consider two callers doing the atomic_read()
> at the same time, or someone else doing an atomic_dec() after the
> atomic_read().

Oh yes, mea culpa, this is a terrible primitive, yet it works for this
application.  1) We don't have two callers 2) We only have atomic_inc
from the other processes, and it's ok for the atomic_inc to occur after
the atomic_read because that means the atomic_inc'er will then proceed
to up() the atomic_sub'ers semaphore, and it won't block.

I much preferred my original waiters = xchg(&sem.count, 0), but as noted
it doesn't work with sparc.  A satisfying approach would be to create
the new primitive up_down, which simplifies everything dramatically.

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Semaphores used for daemon wakeup
  2000-12-21 19:30 ` [RFC] Semaphores used for daemon wakeup Paul Cassella
  2000-12-21 22:19   ` Tim Wright
  2000-12-22  1:12   ` Daniel Phillips
@ 2000-12-22  1:50   ` Daniel Phillips
  2000-12-22  4:26     ` Paul Cassella
  2000-12-22 17:32   ` Brian Pomerantz
  3 siblings, 1 reply; 17+ messages in thread
From: Daniel Phillips @ 2000-12-22  1:50 UTC (permalink / raw)
  To: Paul Cassella, Tim Wright, linux-kernel

Paul Cassella wrote:
> The sync variable version of the dmabuf code snippet (assuming the
> dmabuf_mutex is never acquired from an interrupt) would look like this:
> 
>     dmabuf_init(...);
>     {
>             ...
>             spin_lock_init(&dmabuf_spin);
>             sv_init(&dmabuf_sv, &dmabuf_spin, SV_MON_SPIN);
>             ...
>     }
> 
>     dmabuf_alloc(...)
>     {
> 
>             ...
>             while (1) {
>                     spin_lock(&dmabuf_spin);
>                     attempt to grab a free buffer;
>                     if (success){
>                             spin_unlock(&dmabuf_spin);
>                             return;
>                     } else {
>                             sv_wait(&dmabuf_sv);
>                     }
>             }
>     }
> 
>     dmabuf_free(...)
>     {
>             ...
>             spin_lock(&dmabuf_spin);
>             free up buffer;
>             sv_broadcast(&dmabuf_sv);
>             spin_unlock(&dmabuf_spin);
>     }
> 

But isn't this actually a simple situation?  How about:

    dmabuf_alloc(...)
    {
            ...
            while (1) {
                    spin_lock(&dmabuf_lock);
                    attempt to grab a free buffer;
                    spin_unlock(&dmabuf_lock);
                    if (success)
                           return;
                    down(&dmabuf_wait);
            }
    }

    dmabuf_free(...)
    {
            ...
            spin_lock(&dmabuf_lock);
            free up buffer;
            spin_unlock(&dmabuf_lock);
            up(&dmabuf_wait);
    }

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Semaphores used for daemon wakeup
  2000-12-22  1:50   ` Daniel Phillips
@ 2000-12-22  4:26     ` Paul Cassella
  2000-12-22 11:46       ` Daniel Phillips
  0 siblings, 1 reply; 17+ messages in thread
From: Paul Cassella @ 2000-12-22  4:26 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: linux-kernel

On Fri, 22 Dec 2000, Daniel Phillips wrote:

> But isn't this actually a simple situation?  How about:

I had only adapted that example because it had already been posted showing
one way to do it, and so provided something to compare the sv approach to.

>     dmabuf_alloc(...)
>     {
>             while (1) {
>                     spin_lock(&dmabuf_lock);
>                     attempt to grab a free buffer;
>                     spin_unlock(&dmabuf_lock);
>                     if (success)
>                            return;
>                     down(&dmabuf_wait);
>             }
>     }

>     dmabuf_free(...)
>     {
>             spin_lock(&dmabuf_lock);
>             free up buffer;
>             spin_unlock(&dmabuf_lock);
>             up(&dmabuf_wait);
>     }

This does things a little differently than the way the original did it.  
I thought the original implied that dmabuf_free() might free up multiple
buffers.  There's no indication in the comments that this is the case, but
the original, by using vall_sema(), wakes up all dmabuf_alloc()'s that had
gone to sleep.

The example wasn't meant to be an ideal use of sv's, but merely as an
example of how they could be used to achieve the same behavior as the code
that was posted.

--
Paul Cassella
pwc@sgi.com


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Semaphores used for daemon wakeup
  2000-12-22  4:26     ` Paul Cassella
@ 2000-12-22 11:46       ` Daniel Phillips
  2000-12-22 15:33         ` Tim Wright
  0 siblings, 1 reply; 17+ messages in thread
From: Daniel Phillips @ 2000-12-22 11:46 UTC (permalink / raw)
  To: Paul Cassella, linux-kernel

Paul Cassella wrote:
> >     dmabuf_alloc(...)
> >     {
> >             while (1) {
> >                     spin_lock(&dmabuf_lock);
> >                     attempt to grab a free buffer;
> >                     spin_unlock(&dmabuf_lock);
> >                     if (success)
> >                            return;
> >                     down(&dmabuf_wait);
> >             }
> >     }
> 
> >     dmabuf_free(...)
> >     {
> >             spin_lock(&dmabuf_lock);
> >             free up buffer;
> >             spin_unlock(&dmabuf_lock);
> >             up(&dmabuf_wait);
> >     }
> 
> This does things a little differently than the way the original did it.
> I thought the original implied that dmabuf_free() might free up multiple
> buffers.  There's no indication in the comments that this is the case, but
> the original, by using vall_sema(), wakes up all dmabuf_alloc()'s that had
> gone to sleep.

Granted, it's just an example, but it doesn't make sense to wake up more
dmabuf_alloc waiters than you actually have buffers for.  You do one
up() per freed buffer, and the semaphore's wait queue better be fifo or
have some other way of ensuring a task doesn't languish there.  (I don't
know, does it?)
 
> The example wasn't meant to be an ideal use of sv's, but merely as an
> example of how they could be used to achieve the same behavior as the code
> that was posted.

Yes, and a third example of the 'unlock/wakeup_and_sleep' kind of
primitive - there seems to be a pattern.  I should at least take a look
and see if up_down is easy or hard to implement.

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Semaphores used for daemon wakeup
  2000-12-22 11:46       ` Daniel Phillips
@ 2000-12-22 15:33         ` Tim Wright
  2000-12-22 17:25           ` Daniel Phillips
  0 siblings, 1 reply; 17+ messages in thread
From: Tim Wright @ 2000-12-22 15:33 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Paul Cassella, linux-kernel

On Fri, Dec 22, 2000 at 12:46:28PM +0100, Daniel Phillips wrote:
[...]
> Granted, it's just an example, but it doesn't make sense to wake up more
> dmabuf_alloc waiters than you actually have buffers for.  You do one
> up() per freed buffer, and the semaphore's wait queue better be fifo or
> have some other way of ensuring a task doesn't languish there.  (I don't
> know, does it?)
>  

Sorry, I could have picked a clearer example. You are correct - it doesn't
make sense to wake up more waiters than you can satify if you know this at
the time. As Paul mentioned, the idea here is that we may have freed
multiple buffers, and so we wake all the consumers and let them fight it out.
At least in DYNIX/ptx, semaphores are exclusively FIFO. This gives wake-one
semantics by default, and prevents starvation.

> > The example wasn't meant to be an ideal use of sv's, but merely as an
> > example of how they could be used to achieve the same behavior as the code
> > that was posted.
> 
> Yes, and a third example of the 'unlock/wakeup_and_sleep' kind of
> primitive - there seems to be a pattern.  I should at least take a look
> and see if up_down is easy or hard to implement.
> 

One general rule to remember is that if you need to take multiple locks, then
you need to always take them in the same order to avoid deadlocks. One
simple, if crude way of doing this if they're not two fixed locks is to
always take the lock with the lower address first. I don't know if this
will help in this case, but it looks like you probably have to play with
the rw locks for the wait queues to make this atomic.

Tim

-- 
Tim Wright - timw@splhi.com or timw@aracnet.com or twright@us.ibm.com
IBM Linux Technology Center, Beaverton, Oregon
"Nobody ever said I was charming, they said "Rimmer, you're a git!"" RD VI
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Semaphores used for daemon wakeup
  2000-12-22 15:33         ` Tim Wright
@ 2000-12-22 17:25           ` Daniel Phillips
  0 siblings, 0 replies; 17+ messages in thread
From: Daniel Phillips @ 2000-12-22 17:25 UTC (permalink / raw)
  To: Tim Wright, linux-kernel

Tim Wright wrote:
> 
> On Fri, Dec 22, 2000 at 12:46:28PM +0100, Daniel Phillips wrote:
> [...]
> > Granted, it's just an example, but it doesn't make sense to wake up more
> > dmabuf_alloc waiters than you actually have buffers for.  You do one
> > up() per freed buffer, and the semaphore's wait queue better be fifo or
> > have some other way of ensuring a task doesn't languish there.  (I don't
> > know, does it?)
>
> ...You are correct - it doesn't
> make sense to wake up more waiters than you can satify if you know this at
> the time. As Paul mentioned, the idea here is that we may have freed
> multiple buffers, and so we wake all the consumers and let them fight it out.

With my model, if you were freeing several buffers you'd do several ups,
and especially in the common case where nobody is waiting that's
efficient enough that it would be hard to justify optimizing further
with say, an up_many(sem, n).

> At least in DYNIX/ptx, semaphores are exclusively FIFO. This gives wake-one
> semantics by default, and prevents starvation.

Then it's most probably that way in Linux too, but __wake_up_common will
take a little time to digest.

> One general rule to remember is that if you need to take multiple locks, then
> you need to always take them in the same order to avoid deadlocks.

Yes, I picked that one up while working on tailmerging and had to lock
multiple inodes.  Then I changed the design so it only takes one lock at
a time - what an improvement.  The moral of this story is that the best
lock is the one you never have to take.

> One simple, if crude way of doing this if they're not two fixed locks is to
> always take the lock with the lower address first. I don't know if this
> will help in this case, but it looks like you probably have to play with
> the rw locks for the wait queues to make this atomic.

I took a look at the code... the implementation details of these
primitives are a whole nuther world.  There's probably a simple
implementation of up_down but I'm far from being able to see it at this
point.

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Semaphores used for daemon wakeup
  2000-12-21 19:30 ` [RFC] Semaphores used for daemon wakeup Paul Cassella
                     ` (2 preceding siblings ...)
  2000-12-22  1:50   ` Daniel Phillips
@ 2000-12-22 17:32   ` Brian Pomerantz
  3 siblings, 0 replies; 17+ messages in thread
From: Brian Pomerantz @ 2000-12-22 17:32 UTC (permalink / raw)
  To: Paul Cassella; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1493 bytes --]

On Thu, Dec 21, 2000 at 01:30:03PM -0600, Paul Cassella wrote:
> The mechanism being developed here seems a lot like synchronization
> variables (aka condition variables), which are a part of the "monitor"
> synchronization construct.  There is a simple implementation of them in
> the xfs patch.  I've been working on a more general version in order to
> aid porting some other stuff, which I have appended to this post.
> 
> I had been holding off on posting about it since I didn't have any code
> that used it ready, and probably wouldn't until 2.5 anyway, but due to
> this thread, I think bringing it up now might be helpful.
> 

We have a similar set of locks that were developed while porting the
Quadrics device drivers over to Linux from True64.  These are pretty
close to the way pthreads mutexes and condition variables work in user
space and are similar to primitives that exist in True64 kernel land
(thus the need to develop them to make the port easier).  They allow
for use of semaphores or spinlocks depending on whether you are going
to sleep while holding the mutex.  The only caveat with it is that it
requires that wake_up_process() be exported in kernel/ksyms.c in order
to use it in modules.  We're in the process of making a Linux web site
here at the lab that has papers as well as patches that we have made
to help Linux along in the high performance computing area.  Until
that is up, here are the two files that implement mutexes and
condition variables.


BAPper

[-- Attachment #2: mutex.h --]
[-- Type: text/plain, Size: 6259 bytes --]

/*
 *    Copyright (C) 2000  Regents of the University of California
 *
 *    This program is free software; you can redistribute it and/or modify
 *    it under the terms of the GNU General Public License as published by
 *    the Free Software Foundation; either version 2 of the License, or
 *    (at your option) any later version.
 *
 *    This program is distributed in the hope that it will be useful,
 *    but WITHOUT ANY WARRANTY; without even the implied warranty of
 *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 *    GNU General Public License for more details.
 *
 *    You should have received a copy of the GNU General Public License
 *    along with this program; if not, write to the Free Software
 *    Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
 *
 *    Mutex functions.  
 * 	MUTEX_SPIN      spinlock
 * 	MUTEX_SPIN_INTR spinlock with interrupts disabled
 * 	MUTEX_BLOCK     semaphore
 *
 *    Define -DDEBUG_MUTEX to get debugging spinlocks on alpha
 *    (will report file/line no. of stuck spinlocks)
 *
 *    Jim Garlick <garlick@llnl.gov>
 */

#if	!defined(_LINUX_MUTEX_H)
#define	_LINUX_MUTEX_H
#if	defined(__KERNEL__)

#if	defined(DEBUG_MUTEX) && defined(__alpha__) && !defined(DEBUG_SPINLOCK)
#define	DEBUG_SPINLOCK
#endif

#include <asm/smp.h>
#include <asm/spinlock.h>
#include <asm/semaphore.h>
#include <linux/debug.h>
#include <linux/interrupt.h>
#include <linux/version.h>

#if	!defined(DEBUG_SPINLOCK)
#define	debug_spin_lock(lock, file, line)	spin_lock(lock)
#define	debug_spin_trylock(lock, file, line)	spin_trylock(lock)
#endif

#define	mutex_trylock(l) debug_mutex_trylock(l, __BASE_FILE__, __LINE__)
#define	mutex_lock(l)	debug_mutex_lock(l, __BASE_FILE__, __LINE__)

#define PID_NONE	(PID_MAX + 1)
#define PID_INTR	(PID_MAX + 2 + smp_processor_id())	
#define MY_PID 		(in_interrupt() ? PID_INTR : current->pid)
#define MY_CPU		smp_processor_id()

#define MUTEX_MAX_NAME	16

typedef enum { MUTEX_SPIN, MUTEX_SPIN_INTR, MUTEX_BLOCK } lock_type_t;

typedef struct {
	lock_type_t	type;
	union {
		struct {
			spinlock_t		lock;
			unsigned long		flags;
		} spin;
		struct {
			struct semaphore	lock;
		} blocking;
	} mutex_u;
	pid_t		holder;
#if	defined(DEBUG_MUTEX)
	char		name[MUTEX_MAX_NAME];
#endif
} mutex_t;


/* binary semaphores */
#if	LINUX_VERSION_CODE < KERNEL_VERSION(2,3,0)
#define BS_INIT(s)      (s) = MUTEX
#else
#define BS_INIT(s)      init_MUTEX(&(s))
#endif
#define BS_TRY(s)       (down_trylock(&(s)) == 0)
#define BS_LOCK(s)      down(&(s))
#define BS_UNLOCK(s)    up(&(s))


extern __inline__ void
mutex_init(mutex_t *l, char *name, lock_type_t type)
{
	l->type = type;
	switch(l->type) {
		case MUTEX_BLOCK:
			BS_INIT(l->mutex_u.blocking.lock);
			break;
		case MUTEX_SPIN:
		case MUTEX_SPIN_INTR:
			l->mutex_u.spin.lock = SPIN_LOCK_UNLOCKED;
			break;
	}
	l->holder = PID_NONE;
#if	defined(DEBUG_MUTEX)
	strncpy(l->name, name, MUTEX_MAX_NAME);
#endif
}

extern __inline__ void
mutex_destroy(mutex_t *l)
{
	ASSERT(l->holder == PID_NONE);
}

/* 
 * Return nonzero if lock is held by this thread.  
 */
extern __inline__ int 
mutex_held(mutex_t *l)
{
	return (l->holder == MY_PID);
}

/*
 * really we want to be using spin_lock_irqsave/spin_unlock_irqrestore -
 * however there's no trylock functional interface.
 */

#if	LINUX_VERSION_CODE < KERNEL_VERSION(2,3,0)
#if	defined(__alpha__) || defined(__sparc__)
#define local_irq_save(x)	__save_and_cli(x)
#define local_irq_restore(x)	__restore_flags(x)
#elif	defined(__i386__)
#define local_irq_save(x)	__asm__ __volatile__("pushfl ; popl %0 ; cli":"=g" (x): /* no input */ :"memory")
#define local_irq_restore(x)	__asm__ __volatile__("pushl %0 ; popfl": /* no output */ :"g" (x):"memory")
#else
#error UNKNOWN ARCH.
#endif
#endif 

extern __inline__ void
debug_mutex_lock(mutex_t *l, char *file, int line)
{	
	unsigned long flags;

	ASSERT(!mutex_held(l));
	switch(l->type) {
		case MUTEX_BLOCK:
			ASSERT(!in_interrupt());
			BS_LOCK(l->mutex_u.blocking.lock);
			break;
		case MUTEX_SPIN_INTR:
			local_irq_save(flags);
			debug_spin_lock(&l->mutex_u.spin.lock, file, line);
			l->mutex_u.spin.flags = flags;
			break;
		case MUTEX_SPIN:
			debug_spin_lock(&l->mutex_u.spin.lock, file, line);
			break;
	}
	l->holder = MY_PID;
}

extern __inline__ void
mutex_unlock(mutex_t *l)
{
	unsigned long flags;

	ASSERT(mutex_held(l));
	l->holder = PID_NONE;
	switch(l->type) {
		case MUTEX_BLOCK:
			BS_UNLOCK(l->mutex_u.blocking.lock);
			break;
		case MUTEX_SPIN_INTR:
			flags = l->mutex_u.spin.flags;
			spin_unlock(&l->mutex_u.spin.lock);
			local_irq_restore(flags);
			break;
		case MUTEX_SPIN:
			spin_unlock(&l->mutex_u.spin.lock);
			break;
	}
}

/*
 * Unordered lock releasing - if a spinlock, we need to make sure we
 * when l1 is released, we __restore_flags() the value from l.
 * 
 * Usage:
 *   mutex_lock(l)
 *   mutex_lock(l1)
 *   mutex_unlock_unordered(l, l1)
 *   mutex_unlock(l1)
 */
extern __inline__ void
mutex_unlock_unordered(mutex_t *l, mutex_t *l1)
{
	unsigned long flags;

	ASSERT(mutex_held(l));
	ASSERT(mutex_held(l1));

	l->holder = PID_NONE;
	switch (l->type) {
		case MUTEX_BLOCK:
			BS_UNLOCK(l->mutex_u.blocking.lock);
			break;
		case MUTEX_SPIN_INTR:
			ASSERT(l1->type == MUTEX_SPIN_INTR);
			flags = l->mutex_u.spin.flags;
			spin_unlock(&l->mutex_u.spin.lock);
			local_irq_restore(l1->mutex_u.spin.flags);
			l1->mutex_u.spin.flags = flags;
			break;
		case MUTEX_SPIN:
			spin_unlock(&l->mutex_u.spin.lock);
			break;
	}
}

/* 
 * Returns nonzero if lock has been acquired.
 */
extern __inline__ int
debug_mutex_trylock(mutex_t *l, char *file, int line)
{
	int res;
	unsigned long flags;

	ASSERT(!mutex_held(l));
	switch(l->type) {
		case MUTEX_BLOCK:
			ASSERT(!in_interrupt());
			res = BS_TRY(l->mutex_u.blocking.lock);
			break;
		case MUTEX_SPIN_INTR:
			local_irq_save(flags);
			res = (debug_spin_trylock(&l->mutex_u.spin.lock, 
					file, line) != 0);
			if (res)
				l->mutex_u.spin.flags = flags;
			else
				local_irq_restore(flags);
			break;
		case MUTEX_SPIN:
			res = (debug_spin_trylock(&l->mutex_u.spin.lock, file, line) != 0);
			break;
	}
	if (res)
		l->holder = MY_PID;
	return res;
}

#endif /* __KERNEL__ */
#endif /* _LINUX_MUTEX_H */

[-- Attachment #3: condvar.h --]
[-- Type: text/plain, Size: 4267 bytes --]

/*
 *    Copyright (C) 2000  Regents of the University of California
 *
 *    This program is free software; you can redistribute it and/or modify
 *    it under the terms of the GNU General Public License as published by
 *    the Free Software Foundation; either version 2 of the License, or
 *    (at your option) any later version.
 *
 *    This program is distributed in the hope that it will be useful,
 *    but WITHOUT ANY WARRANTY; without even the implied warranty of
 *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 *    GNU General Public License for more details.
 *
 *    You should have received a copy of the GNU General Public License
 *    along with this program; if not, write to the Free Software
 *    Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
 *
 *    Condition variables.
 *    Jim Garlick <garlick@llnl.gov>
 */

#if	!defined(_LINUX_CONDVAR_H)
#define	_LINUX_CONDVAR_H

#if 	defined(__KERNEL__)

#include <linux/list.h>
#include <linux/debug.h>
#include <linux/mutex.h>

#define CV_RET_SIGPENDING	0
#define CV_RET_TIMEOUT		(-1)
#define CV_RET_NORMAL		1

#define CV_MAX_NAME		16

struct cv_task {
	struct task_struct	*task;		/* need to wrap task in this */
	struct list_head	list;		/*   to thread as a list */
	int 			blocked;
};

typedef struct {
	struct list_head	task_list; 	/* list of cv_task's */
						/*   that are waiting on cv */
#if	defined(DEBUG_CV)
	char 			name[CV_MAX_NAME];
#endif
} condvar_t;

#if	defined(DEBUG_CV)
#define LOCK_NAME(c)		(c)->name
#else
#define	LOCK_NAME(c)		"<unknown>"
#endif

#define cv_wait(c,l)		debug_cv_wait(c, l, 0, 0, __BASE_FILE__, __LINE__)
#define cv_waitsig(c,l)		debug_cv_wait(c, l, 0, 1, __BASE_FILE__, __LINE__)
#define cv_timedwait(c,l,to) 	debug_cv_wait(c, l, to, 0, __BASE_FILE__, __LINE__)
#define cv_timedwaitsig(c,l,to) debug_cv_wait(c, l, to, 1, __BASE_FILE__, __LINE__)
#define cv_wakeup_one(c,l)	cv_wakeup(c, l, 0)
#define cv_wakeup_all(c,l) 	cv_wakeup(c, l, 1)
 
extern __inline__ void
cv_init(condvar_t *c, char *name)
{
	INIT_LIST_HEAD(&c->task_list);
#if	defined(DEBUG_CV)
	strncpy(c->name, name, CV_MAX_NAME);
#endif
}

extern __inline__ void
cv_destroy(condvar_t *c)
{
	ASSERT(list_empty(&c->task_list));
}

/*
 * We thread a struct cv_task, allocated on the stack, onto the condvar_t's
 * task_list, and take it off again when we wake up.
 * Note: the reason we avoid using TASK_UNINTERRUPTIBLE is that avenrun
 * computation treats it like TASK_RUNNABLE.  The cvt.blocked flag 
 * distinguishes a signal wakeup from a cv_wakeup.
 */
extern __inline__ int
debug_cv_wait(condvar_t *c, mutex_t *l, long tmo, int interruptible, 
	char *file, int line)
{
	struct cv_task cvt;
	int ret = CV_RET_NORMAL;

	ASSERT(!in_interrupt());		/* we can block */
	ASSERT(mutex_held(l));			/* enter holding lock */
	/*DBF("cv_wait:\tsleep %s %d:%d\n", LOCK_NAME(c), MY_PID, MY_CPU);*/
	cvt.task = current;
	cvt.blocked = 1;
	list_add(&cvt.list, &c->task_list);
	do {
		current->state = TASK_INTERRUPTIBLE;
		mutex_unlock(l);		/* drop lock for sleep */
		if (tmo) {
			if (tmo <= jiffies || !schedule_timeout(tmo - jiffies))
				ret = CV_RET_TIMEOUT;
		} else
			schedule();
		debug_mutex_lock(l, file, line); /* pick up lock again */
		if (interruptible && signal_pending(current))
			ret = CV_RET_SIGPENDING;
	} while (cvt.blocked && ret == CV_RET_NORMAL);
	list_del(&cvt.list);
	/*DBF("cv_wait:\tawake %s %d:%d %s\n", 
	    LOCK_NAME(c), MY_PID, MY_CPU, ret == CV_RET_TIMEOUT ? "timeout" : 
	    (ret == CV_RET_SIGPENDING) ? "sigpending" : "normal");*/
	return ret;				/* return holding lock */
}

extern __inline__ void
cv_wakeup(condvar_t *c, mutex_t *l, int wakeall)
{
	struct list_head *lp;
	struct cv_task *cvtp;

	ASSERT(mutex_held(l)); 			/* already holding lock */
	for (lp = c->task_list.next; lp != &c->task_list; lp = lp->next) {
		cvtp = list_entry(lp, struct cv_task, list);
		if (cvtp->blocked) {
			/*DBF("cv_wakeup:\twaking %s pid=%d %d:%d\n", 
			    LOCK_NAME(c), p->pid, MY_PID, MY_CPU);*/
			cvtp->blocked = 0;
			/* wake_up_process added to kernel/ksyms.c */
			wake_up_process(cvtp->task); 
			if (!wakeall)
				break;
		}
	}
}						/* return still holding lock */


#endif /* __KERNEL__ */
#endif /* _LINUX_CONDVAR_H */

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Semaphores used for daemon wakeup
  2000-12-20  1:34       ` Daniel Phillips
@ 2000-12-21 16:28         ` Tim Wright
  0 siblings, 0 replies; 17+ messages in thread
From: Tim Wright @ 2000-12-21 16:28 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Tim Wright, linux-kernel

On Wed, Dec 20, 2000 at 02:34:56AM +0100, Daniel Phillips wrote:
> 
> Yes, I see.  There are a lot of similarities to the situation I
> described.  The main difference between this situation and bdflush is
> that dmabuf_free isn't really waiting on dmabuf_alloc to fullfill a
> condition (other than to get out of its exclusion region) while bdflush
> can have n waiters.
> 
> If I could have a new primitive for this job it would be up_down(sem1,
> sem2), atomic with respect to a sleeper on sem1.  And please give me an
> up_all for good measure.  Then for a task wanting to wait on bdflush I
> could write:
> 
>         up_down(&bdflush_request, &bdflush_waiter);
> 
> And in bdflush, just:
> 
>         up_all(&bdflush_waiter);
>         down(&bdflush_request);
> 

OK,
I believe that this would look like the following on ptx (omitting all the
obvious stuff :-)

	lock_t bdflush_lock;
	sema_t bdflush_request;
	sema_t bdflush_waiters;
	...
	init_lock(&bdflush_lock);
	init_sema(&bdflush_request, 0);
	init_sema(&bdflush_waiters, 0);
	...

wakeup_bdflush(...)
{
	...
	(void) p_lock(&bdflush_lock, SPLBUF);
	v_sema(&bdflush_request);
	p_sema_v_lock(&bdflush_waiters, PZERO, &bdflush_lock);
}

bdflush(...)
{
	spl_t s;
	...
	s = p_lock(&bdflush_lock, SPLFS);
	vall_sema(&bdflush_waiters);
	v_lock(&bdflush_lock, s);

	if (!flushed || ...
	...
}

Once more, the use of p_sema_v_lock() avoids races.

> 
> > One can argue the relative merits of the different approaches. I suspect that
> > the above code is less bus-intensive relative to the atomic inc/dec/count ops,
> > but I may be wrong.
> 
> I couldn't say, because your mechanism would need to be elaborated a
> little to handle bdflush's multiple waiters, and I don't know exactly
> what your up_and_wait would look like.  Do spinlocks work for bdflush,
> or would you have to go to semaphores?  (If the latter you arrive at my
> up_down primitive, which is interesting.)  It's even hard to say whether
> my approach is faster or slower than the existing approach.  Ultimately,
> up() calls wake_up() and down() calls both add_wait_queue() and
> remove_wait_queue(), so I lose a little there.  I win in the common case
> of the non-blocking wakeup, which normally runs through Ben Lahaises's
> lovingly handcrafted fast path in up(), whereas the existing code uses
> the more involved wake_up_process().  What's clear is, they are all
> plenty fast enough for this application, and what I'm really trying for
> is readability.
> 

The above hopefully elaborates a little. I'm more than happy to give further
details etc. assuming it's not boring everybody to tears :-)
I agree with you that your changes improve the readability significantly.

Regards,
Tim

-- 
Tim Wright - timw@splhi.com or timw@aracnet.com or twright@us.ibm.com
IBM Linux Technology Center, Beaverton, Oregon
"Nobody ever said I was charming, they said "Rimmer, you're a git!"" RD VI
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Semaphores used for daemon wakeup
  2000-12-19 16:07     ` Tim Wright
@ 2000-12-20  1:34       ` Daniel Phillips
  2000-12-21 16:28         ` Tim Wright
  0 siblings, 1 reply; 17+ messages in thread
From: Daniel Phillips @ 2000-12-20  1:34 UTC (permalink / raw)
  To: Tim Wright, linux-kernel

Tim Wright wrote:
> 
> Hi Daniel,
> On Tue, Dec 19, 2000 at 02:11:16PM +0100, Daniel Phillips wrote:
> [...]
> > I'm curious, is my method of avoiding the deadlock race the same as
> > yours?  My solution is to keep a count of tasks that 'intend' to take
> > the down():
> >
> >         atomic_inc(&bdflush_waiters);
> >         up(&bdflush_request);
> >         down(&bdflush_waiter);
> >
> > so that bdflush will issue the correct number of up's even if the waiter
> > has not yet gone to sleep.  IOW, is your approach in DYNIX the same only
> > in spirit, or in detail?
> >
> > --
> > Daniel
> 
> OK,
> this is not how we generally would achieve the goal, although the approach
> looks valid. We have a number of primitives available that are not currently
> used in Linux (unless I'm losing my eyesight :-)
> We use p_sema, and v_sema for down and up respectively (this was done many
> years ago, and the names are in deference to Edsger Dijkstra.
> For normal semaphores (as opposed to read/writer or other variants), we have
> sema_t sema;
> init_sema(&sema, 1);    /* initialize semaphore & set initial count */
> p_sema(&sema, PZERO);   /* "grab" semaphore and set process priority */
>                         /* priority < PZERO == sleep uninterruptibly */
> v_sema(&sema);          /* release semaphore (i.e. increment count) */
> cp_sema(&sema);         /* Attempt to grab semaphore iff free else EBUSY */
> vall_sema(&sema);       /* Wake up all sleepers on this semaphore */
> blocked_sema(&sema);    /* boolean: any sleepers ? */
> p_sema_v_lock(&sema, priority, &lock);  /* atomically release the lock AND */
>                         /* go to sleep on the semaphore */
> 
> Simple spinlock primitives are similar (e.g. p_lock ...), but the last
> primitive above is the key to avoiding many races. The classic coding style
> in DYNIX/ptx (this for buffer allocation) is then:
> 
> dmabuf_init(...);
> {
>         ...
>         init_sema(&dmabuf_wait, 0);
>         init_lock(&dmabuf_mutex);
>         ...
> }
> 
> dmabuf_alloc(...)
> {
>         spl_t saved_spl;
>         ...
>         while (1) {
>                 saved_spl = p_lock(&dmabuf_mutex, SPLSWP);
>                 attempt to grab a free buffer;
>                 if (success){
>                         v_lock(&dmabuf_mutex, saved_spl);
>                         return;
>                 } else {
>                         p_sema_v_lock(&dmabuf_wait, PSWP+1, &dmabuf_mutex);
>                 }
>         }
> }
> 
> dmabuf_free(...)
> {
>         spl_t saved_spl;
>         ...
>         saved_spl = p_lock(&dmabuf_mutex, SPLHI);
>         free up buffer;
>         if (blocked_sema(&dmabuf_wait)) {
>                 vall_sema(&dmabuf_wait);
>         }
>         v_lock(&dmabuf_mutex, s);
> }
> 
> As you can see, the spinlocks ensure no races, and the key is the atomicity
> of p_sema_v_lock(). No-one can race in and sleep on dmabuf_wait, because
> they have to hold dmabuf_mutex to do so. Exactly the same mechanism would
> work for the bdflush problem.

Yes, I see.  There are a lot of similarities to the situation I
described.  The main difference between this situation and bdflush is
that dmabuf_free isn't really waiting on dmabuf_alloc to fullfill a
condition (other than to get out of its exclusion region) while bdflush
can have n waiters.

If I could have a new primitive for this job it would be up_down(sem1,
sem2), atomic with respect to a sleeper on sem1.  And please give me an
up_all for good measure.  Then for a task wanting to wait on bdflush I
could write:

        up_down(&bdflush_request, &bdflush_waiter);

And in bdflush, just:

        up_all(&bdflush_waiter);
        down(&bdflush_request);

But I found I could do the job with existing primitives so I did.

Originally I wrote:

        int waiters = xchg(&bdflush_waiters.count, 0);
        while (waiters--)
                up(&bdflush_waiter);

which uses one less atomic op but, as Philip Rumpf pointed out to me,
doesn't work on Sparc.  Oh well.  On Intel, the extra read is
practically free.  I could have gone at it by making a new primitive:

int atomic_read_and_clear(atomic_t *p)
{
	int n = atomic_read(p);
	atomic_sub(p, n);
	return n;
}

and on arch i86 it would become:

	#define atomic_read_and_clear(p) (xchg(p, 0))

> One can argue the relative merits of the different approaches. I suspect that
> the above code is less bus-intensive relative to the atomic inc/dec/count ops,
> but I may be wrong.

I couldn't say, because your mechanism would need to be elaborated a
little to handle bdflush's multiple waiters, and I don't know exactly
what your up_and_wait would look like.  Do spinlocks work for bdflush,
or would you have to go to semaphores?  (If the latter you arrive at my
up_down primitive, which is interesting.)  It's even hard to say whether
my approach is faster or slower than the existing approach.  Ultimately,
up() calls wake_up() and down() calls both add_wait_queue() and
remove_wait_queue(), so I lose a little there.  I win in the common case
of the non-blocking wakeup, which normally runs through Ben Lahaises's
lovingly handcrafted fast path in up(), whereas the existing code uses
the more involved wake_up_process().  What's clear is, they are all
plenty fast enough for this application, and what I'm really trying for
is readability.

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Semaphores used for daemon wakeup
  2000-12-19 13:11   ` Daniel Phillips
@ 2000-12-19 16:07     ` Tim Wright
  2000-12-20  1:34       ` Daniel Phillips
  0 siblings, 1 reply; 17+ messages in thread
From: Tim Wright @ 2000-12-19 16:07 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Tim Wright, linux-kernel

Hi Daniel,
On Tue, Dec 19, 2000 at 02:11:16PM +0100, Daniel Phillips wrote:
[...]
> I'm curious, is my method of avoiding the deadlock race the same as
> yours?  My solution is to keep a count of tasks that 'intend' to take
> the down():
> 
>         atomic_inc(&bdflush_waiters);
>         up(&bdflush_request);
>         down(&bdflush_waiter);
> 
> so that bdflush will issue the correct number of up's even if the waiter
> has not yet gone to sleep.  IOW, is your approach in DYNIX the same only
> in spirit, or in detail?
> 
> --
> Daniel

OK,
this is not how we generally would achieve the goal, although the approach
looks valid. We have a number of primitives available that are not currently
used in Linux (unless I'm losing my eyesight :-)
We use p_sema, and v_sema for down and up respectively (this was done many
years ago, and the names are in deference to Edsger Dijkstra.
For normal semaphores (as opposed to read/writer or other variants), we have
sema_t sema;
init_sema(&sema, 1);	/* initialize semaphore & set initial count */
p_sema(&sema, PZERO);	/* "grab" semaphore and set process priority */
			/* priority < PZERO == sleep uninterruptibly */
v_sema(&sema);		/* release semaphore (i.e. increment count) */
cp_sema(&sema);		/* Attempt to grab semaphore iff free else EBUSY */
vall_sema(&sema);	/* Wake up all sleepers on this semaphore */
blocked_sema(&sema);	/* boolean: any sleepers ? */
p_sema_v_lock(&sema, priority, &lock);	/* atomically release the lock AND */
			/* go to sleep on the semaphore */

Simple spinlock primitives are similar (e.g. p_lock ...), but the last
primitive above is the key to avoiding many races. The classic coding style
in DYNIX/ptx (this for buffer allocation) is then:

dmabuf_init(...);
{
	...
	init_sema(&dmabuf_wait, 0);
	init_lock(&dmabuf_mutex);
	...
}

dmabuf_alloc(...)
{
	spl_t saved_spl;
	...
	while (1) {
		saved_spl = p_lock(&dmabuf_mutex, SPLSWP);
		attempt to grab a free buffer;
		if (success){
			v_lock(&dmabuf_mutex, saved_spl);
			return;
		} else {
			p_sema_v_lock(&dmabuf_wait, PSWP+1, &dmabuf_mutex);
		}
	}
}

dmabuf_free(...)
{
	spl_t saved_spl;
	...
	saved_spl = p_lock(&dmabuf_mutex, SPLHI);
	free up buffer;
        if (blocked_sema(&dmabuf_wait)) {
                vall_sema(&dmabuf_wait);
        }
        v_lock(&dmabuf_mutex, s);                                               
}

As you can see, the spinlocks ensure no races, and the key is the atomicity
of p_sema_v_lock(). No-one can race in and sleep on dmabuf_wait, because
they have to hold dmabuf_mutex to do so. Exactly the same mechanism would
work for the bdflush problem.

One can argue the relative merits of the different approaches. I suspect that
the above code is less bus-intensive relative to the atomic inc/dec/count ops,
but I may be wrong.

Regards,

Tim

-- 
Tim Wright - timw@splhi.com or timw@aracnet.com or twright@us.ibm.com
"Nobody ever said I was charming, they said "Rimmer, you're a git!"" RD VI
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Semaphores used for daemon wakeup
  2000-12-19  3:34 ` Tim Wright
@ 2000-12-19 13:11   ` Daniel Phillips
  2000-12-19 16:07     ` Tim Wright
  0 siblings, 1 reply; 17+ messages in thread
From: Daniel Phillips @ 2000-12-19 13:11 UTC (permalink / raw)
  To: Tim Wright; +Cc: linux-kernel

Tim Wright wrote:
> 
> On Sun, Dec 17, 2000 at 01:06:10PM +0100, Daniel Phillips wrote:
> > This patch illustrates an alternative approach to waking and waiting on
> > daemons using semaphores instead of direct operations on wait queues.
> > The idea of using semaphores to regulate the cycling of a daemon was
> > suggested to me by Arjan Vos.  The basic idea is simple: on each cycle
> > a daemon down's a semaphore, and is reactivated when some other task
> > up's the semaphore.
> [...]
> >
> > OK, there it is.  Is this better, worse, or lateral?
> 
> Well, I have to confess I'm rather fond of this method, but that could have
> something to do with it being how we did it in DYNIX/ptx (Sequent).
> It certainly works, and I find it very clear, but of course I'm biased :-)

I'm curious, is my method of avoiding the deadlock race the same as
yours?  My solution is to keep a count of tasks that 'intend' to take
the down():

        atomic_inc(&bdflush_waiters);
        up(&bdflush_request);
        down(&bdflush_waiter);

so that bdflush will issue the correct number of up's even if the waiter
has not yet gone to sleep.  IOW, is your approach in DYNIX the same only
in spirit, or in detail?

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Semaphores used for daemon wakeup
  2000-12-17 12:06 Daniel Phillips
  2000-12-19  0:14 ` Nigel Gamble
  2000-12-19  3:34 ` Tim Wright
@ 2000-12-19  9:01 ` Daniel Phillips
  2 siblings, 0 replies; 17+ messages in thread
From: Daniel Phillips @ 2000-12-19  9:01 UTC (permalink / raw)
  To: linux-kernel

Daniel Phillips wrote:
> The idea of using semaphores to regulate the cycling of a daemon was
> suggested to me by Arjan Vos.

Actually, his name is Arjan van de Ven - sorry Arjan :-o

Thanks also to Phillip Rumpf for auditing this patch for cross-platform
correctness.

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Semaphores used for daemon wakeup
  2000-12-17 12:06 Daniel Phillips
  2000-12-19  0:14 ` Nigel Gamble
@ 2000-12-19  3:34 ` Tim Wright
  2000-12-19 13:11   ` Daniel Phillips
  2000-12-19  9:01 ` Daniel Phillips
  2 siblings, 1 reply; 17+ messages in thread
From: Tim Wright @ 2000-12-19  3:34 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: linux-kernel

On Sun, Dec 17, 2000 at 01:06:10PM +0100, Daniel Phillips wrote:
> This patch illustrates an alternative approach to waking and waiting on
> daemons using semaphores instead of direct operations on wait queues.
> The idea of using semaphores to regulate the cycling of a daemon was
> suggested to me by Arjan Vos.  The basic idea is simple: on each cycle
> a daemon down's a semaphore, and is reactivated when some other task
> up's the semaphore.
[...]
> 
> OK, there it is.  Is this better, worse, or lateral?

Well, I have to confess I'm rather fond of this method, but that could have
something to do with it being how we did it in DYNIX/ptx (Sequent).
It certainly works, and I find it very clear, but of course I'm biased :-)

Tim

--
Tim Wright - timw@splhi.com or timw@aracnet.com or twright@us.ibm.com
"Nobody ever said I was charming, they said "Rimmer, you're a git!"" RD VI
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Semaphores used for daemon wakeup
  2000-12-17 12:06 Daniel Phillips
@ 2000-12-19  0:14 ` Nigel Gamble
  2000-12-19  3:34 ` Tim Wright
  2000-12-19  9:01 ` Daniel Phillips
  2 siblings, 0 replies; 17+ messages in thread
From: Nigel Gamble @ 2000-12-19  0:14 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: linux-kernel

On Sun, 17 Dec 2000, Daniel Phillips wrote:
> This patch illustrates an alternative approach to waking and waiting on
> daemons using semaphores instead of direct operations on wait queues.
> The idea of using semaphores to regulate the cycling of a daemon was
> suggested to me by Arjan Vos.  The basic idea is simple: on each cycle
> a daemon down's a semaphore, and is reactivated when some other task
> up's the semaphore.

> Is this better, worse, or lateral?

This is much better, especially from a maintainability point of view.
It is also the method that a lot of operating systems already use.

Nigel Gamble                                    nigel@nrg.org
Mountain View, CA, USA.                         http://www.nrg.org/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC] Semaphores used for daemon wakeup
@ 2000-12-17 12:06 Daniel Phillips
  2000-12-19  0:14 ` Nigel Gamble
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Daniel Phillips @ 2000-12-17 12:06 UTC (permalink / raw)
  To: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1972 bytes --]

This patch illustrates an alternative approach to waking and waiting on
daemons using semaphores instead of direct operations on wait queues.
The idea of using semaphores to regulate the cycling of a daemon was
suggested to me by Arjan Vos.  The basic idea is simple: on each cycle
a daemon down's a semaphore, and is reactivated when some other task
up's the semaphore.

Sometimes an activating task wants to wait until the daemon completes
whatever it's supposed to do - flushing memory in this case.  I
generalized the above idea by adding another semaphore for wakers to
sleep on, and a count variable to let the daemon know how many
sleepers it needs to activate.  This patch updates bdflush and
wakeup_bdflush to use that mechanism.

The implementation uses two semaphores and a counter:

	DECLARE_MUTEX_LOCKED(bdflush_request);
	DECLARE_MUTEX_LOCKED(bdflush_waiter);
	atomic_t bdflush_waiters /*= 0*/;

A task wanting to activate bdflush does:

	up(&bdflush_request);

A task wanting to activate bdflush and wait does:

	atomic_inc(&bdflush_waiters);
	up(&bdflush_request);
	down(&bdflush_waiter);

When bdflush has finished its work it does:

	waiters = atomic_read(&bdflush_waiters);
	atomic_sub(waiters, &bdflush_waiters);
	while (waiters--)
		up(&bdflush_waiter);
	down(&bdflush_request);

Since I wasn't sure whether the side effect in the existing code of
setting the current task RUNNING was really wanted, I wrote this in
explicitly in the places where the side effect was noted, with the
obligatory comment.

I've done some fairly heavy stress-testing and this new scheme (but
not on non-x86 or SMP) and it does seem to work much the same as the
existing one.  I doubt that there is a measureable difference in
execution overhead, nor is there a difference in correctness as far as
I can see.  But for me at least, it's considerably easier to verify
that the semaphore approach is correct.

OK, there it is.  Is this better, worse, or lateral?

-- 
Daniel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: semwake.patch.2.4.0-test10 --]
[-- Type: text/x-c; name="semwake.patch.2.4.0-test10", Size: 3925 bytes --]

--- ../2.4.0-test10.clean/fs/buffer.c	Thu Oct 12 23:19:32 2000
+++ ./fs/buffer.c	Mon Dec 18 03:03:01 2000
@@ -708,7 +708,8 @@
 static void refill_freelist(int size)
 {
 	if (!grow_buffers(size)) {
-		wakeup_bdflush(1);  /* Sets task->state to TASK_RUNNING */
+		wakeup_bdflush(1);
+		__set_current_state(TASK_RUNNING); /* needed?? */
 		current->policy |= SCHED_YIELD;
 		schedule();
 	}
@@ -2469,33 +2470,28 @@
  * response to dirty buffers.  Once this process is activated, we write back
  * a limited number of buffers to the disks and then go back to sleep again.
  */
-static DECLARE_WAIT_QUEUE_HEAD(bdflush_done);
+
+/* Semaphore wakeups, Daniel Phillips, phillips@innominate.de, 2000/12 */
+
 struct task_struct *bdflush_tsk = 0;
+DECLARE_MUTEX_LOCKED(bdflush_request);
+DECLARE_MUTEX_LOCKED(bdflush_waiter);
+atomic_t bdflush_waiters /*= 0*/;
 
 void wakeup_bdflush(int block)
 {
-	DECLARE_WAITQUEUE(wait, current);
-
 	if (current == bdflush_tsk)
 		return;
 
-	if (!block) {
-		wake_up_process(bdflush_tsk);
+	if (!block)
+	{
+		up(&bdflush_request);
 		return;
 	}
 
-	/* kflushd can wakeup us before we have a chance to
-	   go to sleep so we must be smart in handling
-	   this wakeup event from kflushd to avoid deadlocking in SMP
-	   (we are not holding any lock anymore in these two paths). */
-	__set_current_state(TASK_UNINTERRUPTIBLE);
-	add_wait_queue(&bdflush_done, &wait);
-
-	wake_up_process(bdflush_tsk);
-	schedule();
-
-	remove_wait_queue(&bdflush_done, &wait);
-	__set_current_state(TASK_RUNNING);
+	atomic_inc(&bdflush_waiters);
+	up(&bdflush_request);
+	down(&bdflush_waiter);
 }
 
 /* This is the _only_ function that deals with flushing async writes
@@ -2640,7 +2636,7 @@
 int bdflush(void *sem)
 {
 	struct task_struct *tsk = current;
-	int flushed;
+	int flushed, waiters;
 	/*
 	 *	We have a bare-bones task_struct, and really should fill
 	 *	in a few more things so "top" and /proc/2/{exe,root,cwd}
@@ -2660,6 +2656,7 @@
 	spin_unlock_irq(&tsk->sigmask_lock);
 
 	up((struct semaphore *)sem);
+	printk("Testing semwake bdflush synchronization.\n");
 
 	for (;;) {
 		CHECK_EMERGENCY_SYNC
@@ -2668,28 +2665,16 @@
 		if (free_shortage())
 			flushed += page_launder(GFP_BUFFER, 0);
 
-		/* If wakeup_bdflush will wakeup us
-		   after our bdflush_done wakeup, then
-		   we must make sure to not sleep
-		   in schedule_timeout otherwise
-		   wakeup_bdflush may wait for our
-		   bdflush_done wakeup that would never arrive
-		   (as we would be sleeping) and so it would
-		   deadlock in SMP. */
-		__set_current_state(TASK_INTERRUPTIBLE);
-		wake_up_all(&bdflush_done);
-		/*
-		 * If there are still a lot of dirty buffers around,
-		 * skip the sleep and flush some more. Otherwise, we
-		 * go to sleep waiting a wakeup.
-		 */
-		if (!flushed || balance_dirty_state(NODEV) < 0) {
+		waiters = atomic_read(&bdflush_waiters);
+		atomic_sub(waiters, &bdflush_waiters);
+		while (waiters--)
+			up(&bdflush_waiter);
+
+		if (!flushed || balance_dirty_state(NODEV) < 0) 
+		{
 			run_task_queue(&tq_disk);
-			schedule();
+			down(&bdflush_request);
 		}
-		/* Remember to mark us as running otherwise
-		   the next schedule will block. */
-		__set_current_state(TASK_RUNNING);
 	}
 }
 
--- ../2.4.0-test10.clean/mm/highmem.c	Wed Oct 18 23:25:46 2000
+++ ./mm/highmem.c	Mon Dec 18 02:24:45 2000
@@ -309,7 +309,8 @@
 repeat_bh:
 	bh = kmem_cache_alloc(bh_cachep, SLAB_BUFFER);
 	if (!bh) {
-		wakeup_bdflush(1);  /* Sets task->state to TASK_RUNNING */
+		wakeup_bdflush(1);
+		__set_current_state(TASK_RUNNING); /* needed?? */
 		current->policy |= SCHED_YIELD;
 		schedule();
 		goto repeat_bh;
@@ -323,7 +324,8 @@
 repeat_page:
 	page = alloc_page(GFP_BUFFER);
 	if (!page) {
-		wakeup_bdflush(1);  /* Sets task->state to TASK_RUNNING */
+		wakeup_bdflush(1);
+		__set_current_state(TASK_RUNNING); /* needed?? */
 		current->policy |= SCHED_YIELD;
 		schedule();
 		goto repeat_page;

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2000-12-22 18:03 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <3A42380B.6E9291D1@sgi.com>
2000-12-21 19:30 ` [RFC] Semaphores used for daemon wakeup Paul Cassella
2000-12-21 22:19   ` Tim Wright
2000-12-22  1:12   ` Daniel Phillips
2000-12-22  1:50   ` Daniel Phillips
2000-12-22  4:26     ` Paul Cassella
2000-12-22 11:46       ` Daniel Phillips
2000-12-22 15:33         ` Tim Wright
2000-12-22 17:25           ` Daniel Phillips
2000-12-22 17:32   ` Brian Pomerantz
2000-12-17 12:06 Daniel Phillips
2000-12-19  0:14 ` Nigel Gamble
2000-12-19  3:34 ` Tim Wright
2000-12-19 13:11   ` Daniel Phillips
2000-12-19 16:07     ` Tim Wright
2000-12-20  1:34       ` Daniel Phillips
2000-12-21 16:28         ` Tim Wright
2000-12-19  9:01 ` Daniel Phillips

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).