[PATCHSET] concurrency managed workqueue, take#3

* [PATCHSET] concurrency managed workqueue, take#3
@ 2010-01-18  0:57 Tejun Heo
  2010-01-18  0:57 ` [PATCH 01/40] sched: consult online mask instead of active in select_fallback_rq() Tejun Heo
                   ` (42 more replies)
  0 siblings, 43 replies; 102+ messages in thread
From: Tejun Heo @ 2010-01-18  0:57 UTC (permalink / raw)
  To: torvalds, mingo, peterz, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi

Hello, all.

This is the third take of cmwq (concurrency managed workqueue)
patchset.  It's on top of the current linus#master
066000dd856709b6980123eb39b957fe26993f7b (v2.6.33-rc3).  Git tree is
available at

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git review-cmwq

Quilt series is available at

  http://master.kernel.org/~tj/patches/review-cmwq.tar.gz

Changes from the last take[L]
=============================

* Scheduler code to select fallback cpu has changed and caused problem
  with kthread_bind()ing from CPU_DOWN_PREP.  It is fixed by adding
  0001-sched-consult-online-mask-instead-of-active-in-selec.patch.

* 0002-0028 haven't changed but included for completeness.

* 0029-0040 added to convert libata, async, fscache, cifs and gfs2 to
  use workqueue and kill slow-work which after conversion doesn't have
  any user left.

New patches in this series are

 0001-sched-consult-online-mask-instead-of-active-in-selec.patch
 0029-workqueue-add-system_wq-and-system_single_wq.patch
 0030-workqueue-implement-work_busy.patch
 0031-libata-take-advantage-of-cmwq-and-remove-concurrency.patch
 0032-async-introduce-workqueue-based-alternative-implemen.patch
 0033-async-convert-async-users-to-use-the-new-implementat.patch
 0034-async-kill-original-implementation.patch
 0035-fscache-convert-object-to-use-workqueue-instead-of-s.patch
 0036-fscache-convert-operation-to-use-workqueue-instead-o.patch
 0037-fscache-drop-references-to-slow-work.patch
 0038-cifs-use-workqueue-instead-of-slow-work.patch
 0039-gfs2-use-workqueue-instead-of-slow-work.patch
 0040-slow-work-kill-it.patch

0001 is the aforementioned scheduler fix.

0029-0030 prepare wq for conversions.

0031 converts libata to use cmwq and remove concurrency limitations.

0032-0034 reimplement async using two workqueues.

0035-0037 convert fscache to use workqueues instead of slow-work.

0038-0039 convert cifs and gfs2 to use workqueues instead of
slow-work.

0040 kills slow-work which doesn't have any user left.

Please note that slow-work conversion is missing a couple of
capabilities.

* sysctls to control concurrency level.

* workqueue business notification used to make fscache work to yield
  context and retry instead of waiting holding the context.

The former can easily be added.  The latter isn't difficult to add
either but I was a bit doubtful about its usefulness.  David, do you
think this is really needed?

With the above omissions and removal of slow-work documentation, the
the whole series ends up reducing line count by around a hundred
lines.  I'll append diffstat output at the end of this email.

The libata conversion reduces 13 lines of code while removing two
annoying concurrency limitations.

The new async implementation is shorter by about two hundred lines
while providing about the same capability and removing a dedicated
thread pool.

Although there are some minor differences, the capability provided by
slow-work is basically identical to that provided by cmwq.  Other than
few places where slow-work specific features are depended on, the
conversion of slow-work users to cmwq is fairly straight forward.  The
ref count is incremented on queue and decremented at the end of the
callback.  Module draining is replaced with workqueue flushing.
Concurrency limit is replaced with max_active.  The removal of
slow-work brings in the largest code reduction of about 2000 lines and
removes yet another dedicated thread pool.

slow-work is probably the largest chunk which can be replaced by cmwq
but as shown in the libata case small conversions can bring noticeable
benefits and there are other places which have had to deal with
similar limitations.

Please note that the slow-work conversions haven't been signed off
yet.  Those changes need careful review from David before going
anywhere.

Performance test
================

Another issue raised was the performance.  I tried a few things but
couldn't find a realistic and easy test scenario which could expose wq
performance difference.  As many have pointed out, wq just isn't a
very hot path.  I ended up writing a simplistic wq load generator.

wq workload is generated by perf-wq.c module which is a very simple
synthetic wq load generator (I'll post it as a reply to this message).
A work is described by five parameters - burn_usecs, mean_sleep_msecs,
mean_resched_msecs and factor.  It randomly splits burn_usecs into
two, burns the first part, sleeps for 0 - 2 * mean_sleep_msecs, burns
what's left of burn_usecs and then reschedules itself in 0 - 2 *
mean_resched_msecs.  factor is used to tune the number of cycles to
match execution duration.

It issues three types of works - short, medium and long, each with two
burn durations L and S.

	burn/L(us)	burn/S(us)	mean_sleep(ms)	mean_resched(ms) cycles
 short	50		1		1		10		 454
 medium	50		2		10		50		 125
 long	50		4		100		250		 42

And then these works are put into the following workloads.  The lower
numbered workloads have more short/medium works.

 workload 0
 * 12 wqs with 4 short works
 *  2 wqs with 2 short  and 2 medium works
 *  4 wqs with 2 medium and 1 long works
 *  8 wqs with 1 long work

 workload 1
 *  8 wqs with 4 short works
 *  2 wqs with 2 short  and 2 medium works
 *  4 wqs with 2 medium and 1 long works
 *  8 wqs with 1 long work

 workload 2
 *  4 wqs with 4 short works
 *  2 wqs with 2 short  and 2 medium works
 *  4 wqs with 2 medium and 1 long works
 *  8 wqs with 1 long work

 workload 3
 *  2 wqs with 4 short works
 *  2 wqs with 2 short  and 2 medium works
 *  4 wqs with 2 medium and 1 long works
 *  8 wqs with 1 long work

 workload 4
 *  2 wqs with 4 short works
 *  2 wqs with 2 medium works
 *  4 wqs with 2 medium and 1 long works
 *  8 wqs with 1 long work

 workload 5
 *  2 wqs with 2 medium works
 *  4 wqs with 2 medium and 1 long works
 *  8 wqs with 1 long work

The above wq loads are run in parallel with mencoder converting 76M
mjpeg file into mpeg4 which takes 25.59 seconds with standard
deviation of 0.19 without wq loading.  The CPU was intel netburst
celeron running at 2.66GHz (chosen for its small cache size and
slowness).  wl0 and 1 are only tested for burn/S.  Each test case was
run 11 times and the first run was discarded.

	 vanilla/L	cmwq/L		vanilla/S	cmwq/S
 wl0					26.18 d0.24	26.27 d0.29
 wl1					26.50 d0.45	26.52 d0.23
 wl2	26.62 d0.35	26.53 d0.23	26.14 d0.22	26.12 d0.32
 wl3	26.30 d0.25	26.29 d0.26	25.94 d0.25	26.17 d0.30
 wl4	26.26 d0.23	25.93 d0.24	25.90 d0.23	25.91 d0.29
 wl5	25.81 d0.33	25.88 d0.25	25.63 d0.27	25.59 d0.26

There is no significant difference between the two.  Maybe the code
overhead and benefits coming from context sharing are canceling each
other nicely.  With longer burns, cmwq looks better but it's nothing
significant.  With shorter burns, other than wl3 spiking up for
vanilla which probably would go away if the test is repeated, the two
are performing virtually identically.

The above is exaggerated synthetic test result and the performance
difference will be even less noticeable in either direction under
realistic workloads.

cmwq extends workqueue such that it can serve as robust async
mechanism which can be used (mostly) universally without introducing
any noticeable performance degradation.

Thanks.

diffstat
========
 Documentation/slow-work.txt   |  322 -----
 arch/ia64/kernel/smpboot.c    |    2 
 arch/ia64/kvm/Kconfig         |    1 
 arch/powerpc/kvm/Kconfig      |    1 
 arch/s390/kvm/Kconfig         |    1 
 arch/x86/kernel/smpboot.c     |    2 
 arch/x86/kvm/Kconfig          |    1 
 drivers/acpi/battery.c        |    4 
 drivers/acpi/osl.c            |   41 
 drivers/ata/libata-core.c     |   50 
 drivers/ata/libata-eh.c       |    4 
 drivers/ata/libata-scsi.c     |   11 
 drivers/ata/libata.h          |    1 
 drivers/ata/pata_legacy.c     |    2 
 drivers/base/core.c           |    2 
 drivers/base/dd.c             |    2 
 drivers/md/raid5.c            |    4 
 drivers/s390/block/dasd.c     |    4 
 drivers/scsi/sd.c             |    8 
 fs/cachefiles/namei.c         |   28 
 fs/cachefiles/rdwr.c          |    4 
 fs/cifs/Kconfig               |    1 
 fs/cifs/cifsfs.c              |    6 
 fs/cifs/cifsglob.h            |    8 
 fs/cifs/dir.c                 |    2 
 fs/cifs/file.c                |   22 
 fs/cifs/misc.c                |   15 
 fs/fscache/Kconfig            |    1 
 fs/fscache/internal.h         |    2 
 fs/fscache/main.c             |   25 
 fs/fscache/object-list.c      |   12 
 fs/fscache/object.c           |   67 -
 fs/fscache/operation.c        |   67 -
 fs/fscache/page.c             |   36 
 fs/gfs2/Kconfig               |    1 
 fs/gfs2/incore.h              |    3 
 fs/gfs2/main.c                |    9 
 fs/gfs2/ops_fstype.c          |    8 
 fs/gfs2/recovery.c            |   52 
 fs/gfs2/recovery.h            |    4 
 fs/gfs2/sys.c                 |    3 
 include/linux/async.h         |   17 
 include/linux/fscache-cache.h |   49 
 include/linux/kvm_host.h      |    4 
 include/linux/libata.h        |    2 
 include/linux/preempt.h       |   48 
 include/linux/sched.h         |   71 -
 include/linux/slow-work.h     |  163 --
 include/linux/stop_machine.h  |    6 
 include/linux/workqueue.h     |  109 +
 init/Kconfig                  |   28 
 init/do_mounts.c              |    2 
 init/main.c                   |    4 
 kernel/Makefile               |    2 
 kernel/async.c                |  393 +-----
 kernel/irq/autoprobe.c        |    2 
 kernel/module.c               |    4 
 kernel/power/process.c        |   21 
 kernel/sched.c                |  334 +++--
 kernel/slow-work-debugfs.c    |  227 ---
 kernel/slow-work.c            | 1068 ----------------
 kernel/slow-work.h            |   72 -
 kernel/stop_machine.c         |  151 +-
 kernel/sysctl.c               |    8 
 kernel/trace/Kconfig          |    4 
 kernel/workqueue.c            | 2697 ++++++++++++++++++++++++++++++++++++------
 virt/kvm/kvm_main.c           |   26 
 67 files changed, 3120 insertions(+), 3231 deletions(-)

--
tejun

[L] http://thread.gmane.org/gmane.linux.kernel/929641

^ permalink raw reply	[flat|nested] 102+ messages in thread