Hi Chris,

On 04/12/2010 08:49 PM, Chris Mason wrote:
> @@ -599,6 +622,13 @@ again:
>   		list_splice_init(&new_pending,&work_list);
>   		goto again;
>   	}
> +
> +	list_sort(NULL,&wake_list, list_comp);
> +	while (!list_empty(&wake_list)) {
> +		q = list_entry(wake_list.next, struct sem_queue, list);
> +		list_del_init(&q->list);
> +		wake_up_sem_queue(q, 0);
> +	}
>   }
>    
What about moving this step much later?

There is no need to hold any locks for the actual wake_up_process().

I've updated my patch:
- improved update_queue that guarantees no O(N^2) for your workload.
- move the actual wake-up after dropping all locks
- optimize setting sem_otime
- cacheline align the ipc spinlock.

But the odd thing:
It doesn't improve the sembench result at all (AMD Phenom X4)
The only thing that is reduced is the system time:
 From ~1 min system time for "sembench -t 250 -w 250 -r 30 -o 0" to ~30 sec.

cpu binding the sembench threads results in an improvement of ~50% - at 
the cost of a significant increase of the system time (from 30 seconds 
to 1 min) and the user time (from 2 seconds to 14 seconds).

Are you sure that the problem is contention on the semaphore array spinlock?
With the above changes, the code that is under the spin_lock is very short.
Especially:
- Why does optimizing ipc/sem.c only reduce the system time [reported by 
time] and not the sembench output?
- Why is there no improvement from the ___cache_line_align?
If there would be  contention, then there should be trashing from 
accessing the lock and writing sem_otime and reading sem_base.
- Additionally: you wrote that reducing the array size does not help much.
But: The arrays are 100% independant, the ipc code scales linearly.
Spreading the work over multiple spinlocks is - like cache line aligning 
- usually a 100% guaranteed improvement if there is contention.

I've attached a modified sembench.c and the proposal for ipc/sem.c
Could you try it?
What do you think?
How many cores do you have in your test system?

--
     Manfred