Hi Chris, On 04/12/2010 08:49 PM, Chris Mason wrote: > @@ -599,6 +622,13 @@ again: > list_splice_init(&new_pending,&work_list); > goto again; > } > + > + list_sort(NULL,&wake_list, list_comp); > + while (!list_empty(&wake_list)) { > + q = list_entry(wake_list.next, struct sem_queue, list); > + list_del_init(&q->list); > + wake_up_sem_queue(q, 0); > + } > } > What about moving this step much later? There is no need to hold any locks for the actual wake_up_process(). I've updated my patch: - improved update_queue that guarantees no O(N^2) for your workload. - move the actual wake-up after dropping all locks - optimize setting sem_otime - cacheline align the ipc spinlock. But the odd thing: It doesn't improve the sembench result at all (AMD Phenom X4) The only thing that is reduced is the system time: From ~1 min system time for "sembench -t 250 -w 250 -r 30 -o 0" to ~30 sec. cpu binding the sembench threads results in an improvement of ~50% - at the cost of a significant increase of the system time (from 30 seconds to 1 min) and the user time (from 2 seconds to 14 seconds). Are you sure that the problem is contention on the semaphore array spinlock? With the above changes, the code that is under the spin_lock is very short. Especially: - Why does optimizing ipc/sem.c only reduce the system time [reported by time] and not the sembench output? - Why is there no improvement from the ___cache_line_align? If there would be contention, then there should be trashing from accessing the lock and writing sem_otime and reading sem_base. - Additionally: you wrote that reducing the array size does not help much. But: The arrays are 100% independant, the ipc code scales linearly. Spreading the work over multiple spinlocks is - like cache line aligning - usually a 100% guaranteed improvement if there is contention. I've attached a modified sembench.c and the proposal for ipc/sem.c Could you try it? What do you think? How many cores do you have in your test system? -- Manfred