All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC][PATCH] sched_wait_block: wait for blocked threads
@ 2009-11-15 19:04 Stijn Devriendt
  2009-11-15 19:04 ` [PATCH] Initial prototype version of sched_wait_block Stijn Devriendt
  2009-11-16  8:35 ` [RFC] observe and act upon workload parallelism: PERF_TYPE_PARALLELISM (Was: [RFC][PATCH] sched_wait_block: wait for blocked threads) Ingo Molnar
  0 siblings, 2 replies; 12+ messages in thread
From: Stijn Devriendt @ 2009-11-15 19:04 UTC (permalink / raw)
  To: mingo, peterz, linux-kernel

Hi Ingo, Peter, all,

The attached patch is a prototype for a new system call which
allows threads to wait for other threads being blocked.

Its main use is to allow threading libraries to resume
executing more CPU-bound work when one of its threads is
blocked while not having to over-allocating threads in a 
normal situation. 

Benefit over asynchronous I/O is that a threadpool
thread that performs asynchronous I/O might not have
work enough in one item to keep the CPU busy during the
whole asynchronous operation and that not all operations
are async capable.
Giving control back to the library through a thread
waiting for the blocked one allows new workitems to be
executed as long as the former is blocked.

Code performing this wait could look like:
  pid_t parent = ...;
  while (waitpid(parent, NULL, WNOHANG) != 0)
  {
    if (sched_wait_block(parent, NULL) == 0)
		{
      // do work, possibly notify threadpool manager
			// to start another thread blocked on this one
			// first
		}
  }

Any feedback on the concept is much appreciated.

Regards,
	Stijn Devriendt


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH] Initial prototype version of sched_wait_block
  2009-11-15 19:04 [RFC][PATCH] sched_wait_block: wait for blocked threads Stijn Devriendt
@ 2009-11-15 19:04 ` Stijn Devriendt
  2009-11-16  8:35 ` [RFC] observe and act upon workload parallelism: PERF_TYPE_PARALLELISM (Was: [RFC][PATCH] sched_wait_block: wait for blocked threads) Ingo Molnar
  1 sibling, 0 replies; 12+ messages in thread
From: Stijn Devriendt @ 2009-11-15 19:04 UTC (permalink / raw)
  To: mingo, peterz, linux-kernel; +Cc: Stijn Devriendt


Signed-off-by: Stijn Devriendt <HIGHGuY@gmail.com>
---
 arch/alpha/include/asm/unistd.h      |    3 +-
 arch/arm/include/asm/unistd.h        |    1 +
 arch/avr32/include/asm/unistd.h      |    3 +-
 arch/blackfin/include/asm/unistd.h   |    3 +-
 arch/cris/include/asm/unistd.h       |    3 +-
 arch/frv/include/asm/unistd.h        |    3 +-
 arch/h8300/include/asm/unistd.h      |    3 +-
 arch/ia64/include/asm/unistd.h       |    3 +-
 arch/m32r/include/asm/unistd.h       |    3 +-
 arch/m68k/include/asm/unistd.h       |    3 +-
 arch/microblaze/include/asm/unistd.h |    3 +-
 arch/mips/include/asm/unistd.h       |    4 +-
 arch/mn10300/include/asm/unistd.h    |    3 +-
 arch/parisc/include/asm/unistd.h     |    3 +-
 arch/powerpc/include/asm/unistd.h    |    3 +-
 arch/s390/include/asm/unistd.h       |    4 ++-
 arch/sh/include/asm/unistd_32.h      |    3 +-
 arch/sh/include/asm/unistd_64.h      |    3 +-
 arch/sparc/include/asm/unistd.h      |    3 +-
 arch/x86/ia32/ia32entry.S            |    1 +
 arch/x86/include/asm/unistd_32.h     |    1 +
 arch/x86/include/asm/unistd_64.h     |    2 +
 arch/x86/kernel/syscall_table_32.S   |    1 +
 arch/xtensa/include/asm/unistd.h     |    3 +-
 include/asm-generic/unistd.h         |    5 ++-
 include/linux/init_task.h            |    1 +
 include/linux/sched.h                |    3 +-
 include/linux/syscalls.h             |    1 +
 include/linux/wait.h                 |    1 +
 kernel/fork.c                        |    2 +
 kernel/sched.c                       |   64 ++++++++++++++++++++++++++++++++--
 31 files changed, 117 insertions(+), 25 deletions(-)

diff --git a/arch/alpha/include/asm/unistd.h b/arch/alpha/include/asm/unistd.h
index 5b5c174..3b1b200 100644
--- a/arch/alpha/include/asm/unistd.h
+++ b/arch/alpha/include/asm/unistd.h
@@ -433,10 +433,11 @@
 #define __NR_signalfd			476
 #define __NR_timerfd			477
 #define __NR_eventfd			478
+#define __NR_sched_wait_block 479
 
 #ifdef __KERNEL__
 
-#define NR_SYSCALLS			479
+#define NR_SYSCALLS			480
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR
diff --git a/arch/arm/include/asm/unistd.h b/arch/arm/include/asm/unistd.h
index 4e506d0..76364ce 100644
--- a/arch/arm/include/asm/unistd.h
+++ b/arch/arm/include/asm/unistd.h
@@ -391,6 +391,7 @@
 #define __NR_pwritev			(__NR_SYSCALL_BASE+362)
 #define __NR_rt_tgsigqueueinfo		(__NR_SYSCALL_BASE+363)
 #define __NR_perf_event_open		(__NR_SYSCALL_BASE+364)
+#define __NR_sched_wait_block   (__NR_SYSCALL_BASE+365)
 
 /*
  * The following SWIs are ARM private.
diff --git a/arch/avr32/include/asm/unistd.h b/arch/avr32/include/asm/unistd.h
index 89861a2..eb892f6 100644
--- a/arch/avr32/include/asm/unistd.h
+++ b/arch/avr32/include/asm/unistd.h
@@ -299,9 +299,10 @@
 #define __NR_signalfd		279
 /* 280 was __NR_timerfd */
 #define __NR_eventfd		281
+#define __NR_sched_wait_block 282
 
 #ifdef __KERNEL__
-#define NR_syscalls		282
+#define NR_syscalls		283
 
 /* Old stuff */
 #define __IGNORE_uselib
diff --git a/arch/blackfin/include/asm/unistd.h b/arch/blackfin/include/asm/unistd.h
index 779be02..8535d69 100644
--- a/arch/blackfin/include/asm/unistd.h
+++ b/arch/blackfin/include/asm/unistd.h
@@ -388,8 +388,9 @@
 #define __NR_pwritev		367
 #define __NR_rt_tgsigqueueinfo	368
 #define __NR_perf_event_open	369
+#define __NR_sched_wait_block 370
 
-#define __NR_syscall		370
+#define __NR_syscall		371
 #define NR_syscalls		__NR_syscall
 
 /* Old optional stuff no one actually uses */
diff --git a/arch/cris/include/asm/unistd.h b/arch/cris/include/asm/unistd.h
index c170793..8360734 100644
--- a/arch/cris/include/asm/unistd.h
+++ b/arch/cris/include/asm/unistd.h
@@ -339,10 +339,11 @@
 #define __NR_inotify_init1	332
 #define __NR_preadv		333
 #define __NR_pwritev		334
+#define __NR_sched_wait_block 335
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 335
+#define NR_syscalls 336
 
 #include <arch/unistd.h>
 
diff --git a/arch/frv/include/asm/unistd.h b/arch/frv/include/asm/unistd.h
index be6ef0f..4c90687 100644
--- a/arch/frv/include/asm/unistd.h
+++ b/arch/frv/include/asm/unistd.h
@@ -343,10 +343,11 @@
 #define __NR_pwritev		334
 #define __NR_rt_tgsigqueueinfo	335
 #define __NR_perf_event_open	336
+#define __NR_sched_wait_block 337
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 337
+#define NR_syscalls 338
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 /* #define __ARCH_WANT_OLD_READDIR */
diff --git a/arch/h8300/include/asm/unistd.h b/arch/h8300/include/asm/unistd.h
index 99f3c35..eb73a9c 100644
--- a/arch/h8300/include/asm/unistd.h
+++ b/arch/h8300/include/asm/unistd.h
@@ -325,10 +325,11 @@
 #define __NR_move_pages		317
 #define __NR_getcpu		318
 #define __NR_epoll_pwait	319
+#define __NR_sched_wait_block 320
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 320
+#define NR_syscalls 321
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR
diff --git a/arch/ia64/include/asm/unistd.h b/arch/ia64/include/asm/unistd.h
index 5a5347f..ccb2154 100644
--- a/arch/ia64/include/asm/unistd.h
+++ b/arch/ia64/include/asm/unistd.h
@@ -311,11 +311,12 @@
 #define __NR_preadv			1319
 #define __NR_pwritev			1320
 #define __NR_rt_tgsigqueueinfo		1321
+#define __NR_sched_wait_block 1322
 
 #ifdef __KERNEL__
 
 
-#define NR_syscalls			298 /* length of syscall table */
+#define NR_syscalls			299 /* length of syscall table */
 
 /*
  * The following defines stop scripts/checksyscalls.sh from complaining about
diff --git a/arch/m32r/include/asm/unistd.h b/arch/m32r/include/asm/unistd.h
index cf701c9..f4a0030 100644
--- a/arch/m32r/include/asm/unistd.h
+++ b/arch/m32r/include/asm/unistd.h
@@ -330,10 +330,11 @@
 /* #define __NR_timerfd		322 removed */
 #define __NR_eventfd		323
 #define __NR_fallocate		324
+#define __NR_sched_wait_block 325
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 325
+#define NR_syscalls 326
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_STAT64
diff --git a/arch/m68k/include/asm/unistd.h b/arch/m68k/include/asm/unistd.h
index 48b87f5..e792350 100644
--- a/arch/m68k/include/asm/unistd.h
+++ b/arch/m68k/include/asm/unistd.h
@@ -336,10 +336,11 @@
 #define __NR_pwritev		330
 #define __NR_rt_tgsigqueueinfo	331
 #define __NR_perf_event_open	332
+#define __NR_sched_wait_block 333
 
 #ifdef __KERNEL__
 
-#define NR_syscalls		333
+#define NR_syscalls		334
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR
diff --git a/arch/microblaze/include/asm/unistd.h b/arch/microblaze/include/asm/unistd.h
index cb05a07..df787b2 100644
--- a/arch/microblaze/include/asm/unistd.h
+++ b/arch/microblaze/include/asm/unistd.h
@@ -382,8 +382,9 @@
 #define __NR_pwritev		364 /* new */
 #define __NR_rt_tgsigqueueinfo	365 /* new */
 #define __NR_perf_event_open	366 /* new */
+#define __NR_sched_wait_block 367
 
-#define __NR_syscalls		367
+#define __NR_syscalls		368
 
 #ifdef __KERNEL__
 #ifndef __ASSEMBLY__
diff --git a/arch/mips/include/asm/unistd.h b/arch/mips/include/asm/unistd.h
index 8c9dfa9..ac3086f 100644
--- a/arch/mips/include/asm/unistd.h
+++ b/arch/mips/include/asm/unistd.h
@@ -355,11 +355,11 @@
 #define __NR_rt_tgsigqueueinfo		(__NR_Linux + 332)
 #define __NR_perf_event_open		(__NR_Linux + 333)
 #define __NR_accept4			(__NR_Linux + 334)
-
+#define __NR_sched_wait_block (__NR_Linux + 335)
 /*
  * Offset of the last Linux o32 flavoured syscall
  */
-#define __NR_Linux_syscalls		334
+#define __NR_Linux_syscalls		335
 
 #endif /* _MIPS_SIM == _MIPS_SIM_ABI32 */
 
diff --git a/arch/mn10300/include/asm/unistd.h b/arch/mn10300/include/asm/unistd.h
index 2a98393..3d0ec30 100644
--- a/arch/mn10300/include/asm/unistd.h
+++ b/arch/mn10300/include/asm/unistd.h
@@ -348,10 +348,11 @@
 #define __NR_pwritev		335
 #define __NR_rt_tgsigqueueinfo	336
 #define __NR_perf_event_open	337
+#define __NR_sched_wait_block 338
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 338
+#define NR_syscalls 339
 
 /*
  * specify the deprecated syscalls we want to support on this arch
diff --git a/arch/parisc/include/asm/unistd.h b/arch/parisc/include/asm/unistd.h
index cda1583..383332b 100644
--- a/arch/parisc/include/asm/unistd.h
+++ b/arch/parisc/include/asm/unistd.h
@@ -811,8 +811,9 @@
 #define __NR_pwritev		(__NR_Linux + 316)
 #define __NR_rt_tgsigqueueinfo	(__NR_Linux + 317)
 #define __NR_perf_event_open	(__NR_Linux + 318)
+#define __NR_sched_wait_block (__NR_Linux + 319)
 
-#define __NR_Linux_syscalls	(__NR_perf_event_open + 1)
+#define __NR_Linux_syscalls	(__NR_sched_wait_block + 1)
 
 
 #define __IGNORE_select		/* newselect */
diff --git a/arch/powerpc/include/asm/unistd.h b/arch/powerpc/include/asm/unistd.h
index f6ca761..466aa45 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h
@@ -345,10 +345,11 @@
 #define __NR_preadv		320
 #define __NR_pwritev		321
 #define __NR_rt_tgsigqueueinfo	322
+#define __NR_sched_wait_block 323
 
 #ifdef __KERNEL__
 
-#define __NR_syscalls		323
+#define __NR_syscalls		324
 
 #define __NR__exit __NR_exit
 #define NR_syscalls	__NR_syscalls
diff --git a/arch/s390/include/asm/unistd.h b/arch/s390/include/asm/unistd.h
index cb5232d..c0dabc5 100644
--- a/arch/s390/include/asm/unistd.h
+++ b/arch/s390/include/asm/unistd.h
@@ -269,7 +269,9 @@
 #define	__NR_pwritev		329
 #define __NR_rt_tgsigqueueinfo	330
 #define __NR_perf_event_open	331
-#define NR_syscalls 332
+#define __NR_sched_wait_block 332
+
+#define NR_syscalls 333
 
 /* 
  * There are some system calls that are not present on 64 bit, some
diff --git a/arch/sh/include/asm/unistd_32.h b/arch/sh/include/asm/unistd_32.h
index f3fd1b9..16516d3 100644
--- a/arch/sh/include/asm/unistd_32.h
+++ b/arch/sh/include/asm/unistd_32.h
@@ -345,8 +345,9 @@
 #define __NR_pwritev		334
 #define __NR_rt_tgsigqueueinfo	335
 #define __NR_perf_event_open	336
+#define __NR_sched_wait_block 337
 
-#define NR_syscalls 337
+#define NR_syscalls 338
 
 #ifdef __KERNEL__
 
diff --git a/arch/sh/include/asm/unistd_64.h b/arch/sh/include/asm/unistd_64.h
index 343ce8f..6bf4ab4 100644
--- a/arch/sh/include/asm/unistd_64.h
+++ b/arch/sh/include/asm/unistd_64.h
@@ -385,10 +385,11 @@
 #define __NR_pwritev		362
 #define __NR_rt_tgsigqueueinfo	363
 #define __NR_perf_event_open	364
+#define __NR_sched_wait_block 365
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 365
+#define NR_syscalls 366
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR
diff --git a/arch/sparc/include/asm/unistd.h b/arch/sparc/include/asm/unistd.h
index 42f2316..482664d 100644
--- a/arch/sparc/include/asm/unistd.h
+++ b/arch/sparc/include/asm/unistd.h
@@ -396,8 +396,9 @@
 #define __NR_pwritev		325
 #define __NR_rt_tgsigqueueinfo	326
 #define __NR_perf_event_open	327
+#define __NR_sched_wait_block 328
 
-#define NR_SYSCALLS		328
+#define NR_SYSCALLS		329
 
 #ifdef __32bit_syscall_numbers__
 /* Sparc 32-bit only has the "setresuid32", "getresuid32" variants,
diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index 581b056..a63001a 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -841,4 +841,5 @@ ia32_sys_call_table:
 	.quad compat_sys_pwritev
 	.quad compat_sys_rt_tgsigqueueinfo	/* 335 */
 	.quad sys_perf_event_open
+	.quad sys_sched_wait_block
 ia32_syscall_end:
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index 6fb3c20..91a9e56 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -342,6 +342,7 @@
 #define __NR_pwritev		334
 #define __NR_rt_tgsigqueueinfo	335
 #define __NR_perf_event_open	336
+#define __NR_sched_wait_block 337
 
 #ifdef __KERNEL__
 
diff --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h
index 8d3ad0a..11e943e 100644
--- a/arch/x86/include/asm/unistd_64.h
+++ b/arch/x86/include/asm/unistd_64.h
@@ -661,6 +661,8 @@ __SYSCALL(__NR_pwritev, sys_pwritev)
 __SYSCALL(__NR_rt_tgsigqueueinfo, sys_rt_tgsigqueueinfo)
 #define __NR_perf_event_open			298
 __SYSCALL(__NR_perf_event_open, sys_perf_event_open)
+#define __NR_sched_wait_block     299
+__SYSCALL(__NR_sched_wait_block, sys_sched_wait_block)
 
 #ifndef __NO_STUBS
 #define __ARCH_WANT_OLD_READDIR
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index 0157cd2..2f93cef 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -336,3 +336,4 @@ ENTRY(sys_call_table)
 	.long sys_pwritev
 	.long sys_rt_tgsigqueueinfo	/* 335 */
 	.long sys_perf_event_open
+	.long sys_sched_wait_block
diff --git a/arch/xtensa/include/asm/unistd.h b/arch/xtensa/include/asm/unistd.h
index c092c8f..834aa00 100644
--- a/arch/xtensa/include/asm/unistd.h
+++ b/arch/xtensa/include/asm/unistd.h
@@ -681,8 +681,9 @@ __SYSCALL(304, sys_signalfd, 3)
 __SYSCALL(305, sys_ni_syscall, 0)
 #define __NR_eventfd				306
 __SYSCALL(306, sys_eventfd, 1)
+#define __NR_sched_wait_block 307
 
-#define __NR_syscall_count			307
+#define __NR_syscall_count			308
 
 /*
  * sysxtensa syscall handler
diff --git a/include/asm-generic/unistd.h b/include/asm-generic/unistd.h
index d76b66a..ff21a1b 100644
--- a/include/asm-generic/unistd.h
+++ b/include/asm-generic/unistd.h
@@ -623,8 +623,11 @@ __SYSCALL(__NR_rt_tgsigqueueinfo, sys_rt_tgsigqueueinfo)
 #define __NR_perf_event_open 241
 __SYSCALL(__NR_perf_event_open, sys_perf_event_open)
 
+#define __NR_sched_wait_block 242
+__SYSCALL(__NR_sched_wait_block, sys_sched_wait_block)
+
 #undef __NR_syscalls
-#define __NR_syscalls 242
+#define __NR_syscalls 243
 
 /*
  * All syscalls below here should go away really,
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 21a6f5d..21ed548 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -145,6 +145,7 @@ extern struct cred init_cred;
 	.pushable_tasks = PLIST_NODE_INIT(tsk.pushable_tasks, MAX_PRIO), \
 	.ptraced	= LIST_HEAD_INIT(tsk.ptraced),			\
 	.ptrace_entry	= LIST_HEAD_INIT(tsk.ptrace_entry),		\
+	.task_waiters = __WAIT_QUEUE_HEAD_INITIALIZER(tsk.task_waiters), \
 	.real_parent	= &tsk,						\
 	.parent		= &tsk,						\
 	.children	= LIST_HEAD_INIT(tsk.children),			\
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1c46023..d4639b4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1280,7 +1280,7 @@ struct task_struct {
 	unsigned in_execve:1;	/* Tell the LSMs that the process is doing an
 				 * execve */
 	unsigned in_iowait:1;
-
+  unsigned in_taskwait:1;
 
 	/* Revert to default priority/policy when forking */
 	unsigned sched_reset_on_fork:1;
@@ -1315,6 +1315,7 @@ struct task_struct {
 	struct list_head ptraced;
 	struct list_head ptrace_entry;
 
+	wait_queue_head_t task_waiters;
 	/*
 	 * This is the tracer handle for the ptrace BTS extension.
 	 * This field actually belongs to the ptracer task.
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index b50974a..e18d7e8 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -406,6 +406,7 @@ asmlinkage long sys_sched_rr_get_interval(pid_t pid,
 					struct timespec __user *interval);
 asmlinkage long sys_setpriority(int which, int who, int niceval);
 asmlinkage long sys_getpriority(int which, int who);
+asmlinkage long sys_sched_wait_block(pid_t pid, struct timespec __user *uts);
 
 asmlinkage long sys_shutdown(int, int);
 asmlinkage long sys_reboot(int magic1, int magic2, unsigned int cmd,
diff --git a/include/linux/wait.h b/include/linux/wait.h
index a48e16b..4afb340 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -157,6 +157,7 @@ wait_queue_head_t *bit_waitqueue(void *, int);
 #define wake_up_nr(x, nr)		__wake_up(x, TASK_NORMAL, nr, NULL)
 #define wake_up_all(x)			__wake_up(x, TASK_NORMAL, 0, NULL)
 #define wake_up_locked(x)		__wake_up_locked((x), TASK_NORMAL)
+#define wake_up_sync_all(x) __wake_up_sync((x), TASK_NORMAL, 0)
 
 #define wake_up_interruptible(x)	__wake_up(x, TASK_INTERRUPTIBLE, 1, NULL)
 #define wake_up_interruptible_nr(x, nr)	__wake_up(x, TASK_INTERRUPTIBLE, nr, NULL)
diff --git a/kernel/fork.c b/kernel/fork.c
index 166b8c4..e664442 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1121,6 +1121,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	p->blocked_on = NULL; /* not blocked yet */
 #endif
 
+	init_waitqueue_head(&p->task_waiters);
+
 	p->bts = NULL;
 
 	p->stack_start = stack_start;
diff --git a/kernel/sched.c b/kernel/sched.c
index 7179c80..98f5480 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5440,6 +5440,7 @@ asmlinkage void __sched schedule(void)
 	struct rq *rq;
 	int cpu;
 
+
 need_resched:
 	preempt_disable();
 	cpu = smp_processor_id();
@@ -5450,6 +5451,8 @@ need_resched:
 
 	release_kernel_lock(prev);
 need_resched_nonpreemptible:
+	if (prev->state && !unlikely(prev->in_taskwait) && unlikely(waitqueue_active(&prev->task_waiters)))
+		wake_up_sync_all(&prev->task_waiters);
 
 	schedule_debug(prev);
 
@@ -5467,7 +5470,7 @@ need_resched_nonpreemptible:
 			deactivate_task(rq, prev, 1);
 		switch_count = &prev->nvcsw;
 	}
-
+	
 	pre_schedule(rq, prev);
 
 	if (unlikely(!rq->nr_running))
@@ -5717,8 +5720,8 @@ void __wake_up_sync_key(wait_queue_head_t *q, unsigned int mode,
 	if (unlikely(!q))
 		return;
 
-	if (unlikely(!nr_exclusive))
-		wake_flags = 0;
+	/*if (unlikely(!nr_exclusive))
+		wake_flags = 0;*/
 
 	spin_lock_irqsave(&q->lock, flags);
 	__wake_up_common(q, mode, nr_exclusive, wake_flags, key);
@@ -6889,6 +6892,61 @@ out_unlock:
 	return retval;
 }
 
+SYSCALL_DEFINE2(sched_wait_block, pid_t, pid, struct timespec __user *, uts)
+{
+  struct task_struct* p;
+	long retval;
+	struct timespec ts;
+	long timeout = MAX_SCHEDULE_TIMEOUT;
+
+	if (pid < 0)
+	{
+		printk(KERN_INFO "Unknown PID: %d\n", pid);
+		return -EINVAL;
+	}
+
+	if (uts)
+	{
+	  if (copy_from_user(&ts, uts, sizeof(struct timespec)))
+			return -EINVAL;
+
+	  timeout = timespec_to_jiffies(&ts);
+		printk(KERN_INFO "Timeout: %d\n", timeout);
+
+		if (timeout < 0)
+			return -EINVAL;
+	}
+
+	read_lock(&tasklist_lock);
+	p = find_process_by_pid(pid);
+	if (!p)
+	{
+		read_unlock(&tasklist_lock);
+		return -ESRCH;
+	}
+	get_task_struct(p);
+	read_unlock(&tasklist_lock);
+
+	printk(KERN_INFO "Waiting...");
+	// logic
+	current->in_taskwait = 1;
+	retval = wait_event_interruptible_timeout(p->task_waiters, p->state, timeout);
+	current->in_taskwait = 0;
+	// endlogic
+	printk("Done: %d\n", retval);
+
+	if (retval == -ERESTARTSYS)
+		retval = -EINTR;
+	else if (p->state)
+		retval = 0;
+	else
+		retval = -EAGAIN;
+
+  put_task_struct(p);
+
+	return retval;
+}
+
 static const char stat_nam[] = TASK_STATE_TO_CHAR_STR;
 
 void sched_show_task(struct task_struct *p)
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC] observe and act upon workload parallelism: PERF_TYPE_PARALLELISM (Was: [RFC][PATCH] sched_wait_block: wait for blocked threads)
  2009-11-15 19:04 [RFC][PATCH] sched_wait_block: wait for blocked threads Stijn Devriendt
  2009-11-15 19:04 ` [PATCH] Initial prototype version of sched_wait_block Stijn Devriendt
@ 2009-11-16  8:35 ` Ingo Molnar
  2009-11-16 18:02   ` Linus Torvalds
  2009-11-16 19:13   ` Stijn Devriendt
  1 sibling, 2 replies; 12+ messages in thread
From: Ingo Molnar @ 2009-11-16  8:35 UTC (permalink / raw)
  To: Stijn Devriendt, Linus Torvalds, Mike Galbraith, Peter Zijlstra,
	Andrea Arcangeli, Thomas Gleixner, Andrew Morton
  Cc: peterz, linux-kernel


(Cc:-ed more interested parties)

* Stijn Devriendt <highguy@gmail.com> wrote:

> Hi Ingo, Peter, all,
> 
> The attached patch is a prototype for a new system call which allows 
> threads to wait for other threads being blocked.
> 
> Its main use is to allow threading libraries to resume executing more 
> CPU-bound work when one of its threads is blocked while not having to 
> over-allocating threads in a normal situation.
> 
> Benefit over asynchronous I/O is that a threadpool thread that 
> performs asynchronous I/O might not have work enough in one item to 
> keep the CPU busy during the whole asynchronous operation and that not 
> all operations are async capable. Giving control back to the library 
> through a thread waiting for the blocked one allows new workitems to 
> be executed as long as the former is blocked.
> 
> Code performing this wait could look like:
>   pid_t parent = ...;
>   while (waitpid(parent, NULL, WNOHANG) != 0)
>   {
>     if (sched_wait_block(parent, NULL) == 0)
> 		{
>       // do work, possibly notify threadpool manager
> 			// to start another thread blocked on this one
> 			// first
> 		}
>   }
> 
> Any feedback on the concept is much appreciated.

That is a ... rather interesting idea IMO.

Regarding the API and your patch, i think we can and should do something 
different and more capable - while still keeping your basic idea:

Lets turn it all around on its head and add the capability to user-space 
to observe the 'parallelism' of a workload (not limit it to the blocking 
of a single task) and allow the poll()ing of that quantity - without 
affecting workloads.

It should not be limited to a single task, and it should work with 
existing syscall APIs - i.e. be fd based.

Incidentally we already have a syscall and a kernel subsystem that is 
best suited to deal with such types of issues: perf events. I think we 
can create a new, special performance event type that observes 
task/workload (or CPU) parallelism:

	PERF_TYPE_PARALLELISM

With a 'parallelism_threshold' attribute. (which is '1' for a single 
task. See below.)

And then we can use poll() in the thread manager task to observe PIDs, 
workloads or full CPUs. The poll() implementation of perf events is fast 
and scalable.

( Note: there's no need to actually _capture_ the events into the 
  ring-buffer - this is done by not mmap()-ing the fd. I.e. we'll just 
  have a pure poll() wakeup overhead and no tracing overhead. )

The semantics are basically that we are observing task 
schedule/unschedule events and keep a count and a threshold - and can 
poll() on that. perf_event_attr can be used to inject a 'minimum 
parallelism' threshold value (and maybe a 'max parallelism' value as 
well).

Events are emitted (and poll() returns) if the observed workload gets 
'blocked' according to the parallelism threshold - i.e. if the number of 
runnable tasks drops below the threshold.

This fits very nicely into the existing perf events API and we wouldnt 
have to add a new syscall.

Usage is very simple and straightforward, and can happen on various 
levels of 'observation detail':

 - the new fd can be attached to a specific PID (like your syscall).
   perf_event_attr::threshold == 1 means we get the semantics of your
   sched_wait_block() system call. Note that poll() wont have to do a 
   PID lookup (as it is already attached) so it will be much faster than 
   sched_wait_block().

 - the new fd can be attached to a hieararchy of tasks and observe _all_ 
   of the parallelism there. This has the advantage of not having to 
   track each thread in a pool of threads. (this is done via inherited 
   events, see include/linux/perf_event.h:perf_event_attr::inherit) In 
   this case a parallelism threshold value larger than 1 makes sense 
   too, to allow the workload to spread to a number of CPUs. On a 4-CPU 
   system if we set threshold==4 it means that we'll return from poll()
   if the number of runnable tasks drops below 4.

 - the new fd can be attached to a CPU - observing parallelism of a full
   CPU without having to track all workloads. In this case threshold==1
   means that we'll return from poll() if the last task on that CPU 
   schedules out - i.e. if the CPU becomes idle.

etc.

This would make a very powerful task queueing framework. It basically 
allows a 'lazy' user-space scheduler, which only activates if the kernel 
scheduler has run out of work.

What do you think?

	Ingo

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] observe and act upon workload parallelism: PERF_TYPE_PARALLELISM (Was: [RFC][PATCH] sched_wait_block: wait for blocked threads)
  2009-11-16  8:35 ` [RFC] observe and act upon workload parallelism: PERF_TYPE_PARALLELISM (Was: [RFC][PATCH] sched_wait_block: wait for blocked threads) Ingo Molnar
@ 2009-11-16 18:02   ` Linus Torvalds
  2009-11-16 18:18     ` Linus Torvalds
                       ` (3 more replies)
  2009-11-16 19:13   ` Stijn Devriendt
  1 sibling, 4 replies; 12+ messages in thread
From: Linus Torvalds @ 2009-11-16 18:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Stijn Devriendt, Mike Galbraith, Peter Zijlstra,
	Andrea Arcangeli, Thomas Gleixner, Andrew Morton, peterz,
	linux-kernel



On Mon, 16 Nov 2009, Ingo Molnar wrote:
> 
> Regarding the API and your patch, i think we can and should do something 
> different and more capable - while still keeping your basic idea:

Actually, I'd suggest exactly the reverse.

Yes, do something different, but _less_ capable, and much simpler: 
introduce the notion of "grouped thread scheduling", where a _group_ of 
threads gets scheduled as one thread.

Think of it like a classic user-level threading package, where one process 
implements multiple threads entirely in user space, and switches between 
them. Except we'd do the exact reverse: create multiple threads in the 
kernel, but only run _one_ of them at a time. So as far as the scheduler 
is concerned, it acts as just a single thread - except it's a single 
thread that has multiple instances associated with it.

And every time the "currently active" thread in that group runs out of CPU 
time - or any time it sleeps - we'd just go on to the next thread in the 
group.

There are potentially lots of cases where you want to use multiple threads 
not because you want multiple CPU's, but because you want to have "another 
thread ready" for when one thread sleeps on IO. Or you may use threads as 
a container - again, you may not need a lot of CPU, but you split your 
single load up into multiple execution contexts just because you had some 
independent things going on (think UI threads).

As far as I can tell, that is pretty much what Stijn Devriendt wanted: he 
may have lots of threads, but he effectively really just wants "one CPU" 
worth of processing.

It's also what we often want with AIO-like threads: it's not that we want 
CPU parallelism, and if the data is in caches, we'd like to run the IO 
thread immediately and not switch CPU's at all, and actually do it all 
synchronously. It's just that _if_ the AIO thread blocks, we'd like to 
resume the original thread that may have better things to do. 

No "observe CPU parallelism" or anything fancy at all. Just a "don't 
consider these X threads to be parallel" flag to clone (or a separate 
system call).

Imagine doing async system calls with the equivalent of

 - create such an "affine" thread in kernel space
 - run the IO in that affine thread - if it runs to completion without 
   blocking and in a timeslice, never schedule at all.

where these "affine" threads would be even more lightweight than regular 
threads, because they don't even act as real scheduling entities, they are 
grouped together with the original one.

			Linus

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] observe and act upon workload parallelism: PERF_TYPE_PARALLELISM (Was: [RFC][PATCH] sched_wait_block: wait for blocked threads)
  2009-11-16 18:02   ` Linus Torvalds
@ 2009-11-16 18:18     ` Linus Torvalds
  2009-11-16 18:31     ` Alan Cox
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 12+ messages in thread
From: Linus Torvalds @ 2009-11-16 18:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Stijn Devriendt, Mike Galbraith, Peter Zijlstra,
	Andrea Arcangeli, Thomas Gleixner, Andrew Morton, peterz,
	linux-kernel



On Mon, 16 Nov 2009, Linus Torvalds wrote:
> 
> Think of it like a classic user-level threading package, where one process 
> implements multiple threads entirely in user space, and switches between 
> them. Except we'd do the exact reverse: create multiple threads in the 
> kernel, but only run _one_ of them at a time. So as far as the scheduler 
> is concerned, it acts as just a single thread - except it's a single 
> thread that has multiple instances associated with it.

Side note: before anybody asks why not do threads in user space to begin 
with, it's simple: IO, exceptions, and timers. All need kernel support. 

User-level threading works very well, and is usually efficient as hell. 
It's not that long since people constantly claimed that thread libraries 
in user space were much better, and tried to use complex NxM models to do 
them, exactly because user threads are so great.

But almost nobody does user-level threading now, except for specific 
languages that are built around threading. Why? It's not because they 
don't work, it's because they have a few very annoying problems that make 
them not work at all for enough situations that it gets very painful. And 
it's usually about IO and system calls, but sometimes it's about page 
faults, and sometimes it's about the pain of multiplexing that annoying 
timer signal and all the crap that involves.

So I'm suggesting that maybe we could look at doing kernel threads that do 
what user-level threading does. It sounds idiotic, and maybe it is - after 
all, traditionally one of the reasons for user-level threads is that you 
can avoid all the kernel overheads. But if we can make some really 
low-overhead "thread within a thread" model, maybe we could have close to 
the best of both worlds.

Some people really want threads for multi-CPU workloads and spreading 
things out. That's what we have now. But other loads want threads for 
entirely different reasons, like just hiding IO latencies.

			Linus

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] observe and act upon workload parallelism: PERF_TYPE_PARALLELISM (Was: [RFC][PATCH] sched_wait_block: wait for blocked threads)
  2009-11-16 18:02   ` Linus Torvalds
  2009-11-16 18:18     ` Linus Torvalds
@ 2009-11-16 18:31     ` Alan Cox
  2009-11-16 19:49     ` Stijn Devriendt
  2009-11-21 11:26     ` Stijn Devriendt
  3 siblings, 0 replies; 12+ messages in thread
From: Alan Cox @ 2009-11-16 18:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Stijn Devriendt, Mike Galbraith, Peter Zijlstra,
	Andrea Arcangeli, Thomas Gleixner, Andrew Morton, peterz,
	linux-kernel

On Mon, 16 Nov 2009 10:02:50 -0800 (PST)
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> 
> 
> On Mon, 16 Nov 2009, Ingo Molnar wrote:
> > 
> > Regarding the API and your patch, i think we can and should do something 
> > different and more capable - while still keeping your basic idea:
> 
> Actually, I'd suggest exactly the reverse.
> 
> Yes, do something different, but _less_ capable, and much simpler: 
> introduce the notion of "grouped thread scheduling", where a _group_ of 
> threads gets scheduled as one thread.

And preferably expose it so it can be used by groups of processes as well
as just "threads" in the userspace shared pid etc sense. There are quite
a few applications where both "only run one of us" and "don't pre-empt
within the group" are a useful semantic (you often don't want the
pre-empting within the group of threads because it messes up all the
improved locking opportunities). In particular the usual buzz locks don't
mix well with "only run one of us".

Alan

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] observe and act upon workload parallelism:  PERF_TYPE_PARALLELISM (Was: [RFC][PATCH] sched_wait_block: wait for blocked  threads)
  2009-11-16  8:35 ` [RFC] observe and act upon workload parallelism: PERF_TYPE_PARALLELISM (Was: [RFC][PATCH] sched_wait_block: wait for blocked threads) Ingo Molnar
  2009-11-16 18:02   ` Linus Torvalds
@ 2009-11-16 19:13   ` Stijn Devriendt
  2009-11-16 20:22     ` Ingo Molnar
  1 sibling, 1 reply; 12+ messages in thread
From: Stijn Devriendt @ 2009-11-16 19:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Mike Galbraith, Peter Zijlstra, Andrea Arcangeli,
	Thomas Gleixner, Andrew Morton, peterz, linux-kernel

> It should not be limited to a single task, and it should work with
> existing syscall APIs - i.e. be fd based.
>
> Incidentally we already have a syscall and a kernel subsystem that is
> best suited to deal with such types of issues: perf events. I think we
> can create a new, special performance event type that observes
> task/workload (or CPU) parallelism:
>
>        PERF_TYPE_PARALLELISM
>
> With a 'parallelism_threshold' attribute. (which is '1' for a single
> task. See below.)

On one side this looks like it's exactly where it belongs as you're
monitoring performance to keep it up to speed, but it does make
the userspace component depend on a profiling-oriented optional
kernel interface.

>
> And then we can use poll() in the thread manager task to observe PIDs,
> workloads or full CPUs. The poll() implementation of perf events is fast
> and scalable.

I've had a quick peek at the perf code and how it currently hooks into
the scheduler and at first glance it looks like 2 additional context switches
are required when using perf. The scheduler will first schedule the idle
thread to later find out that the schedule tail woke up another process
to run. My initial solution woke up the process before making a
scheduling decision. Depending on context switch times the original
blocking operation may have been unblocked (especially on SMP);
e.g. a blocked user-space mutex which was held shortly.
Feel free to correct me here as it was merely a quick peek.

>
> This would make a very powerful task queueing framework. It basically
> allows a 'lazy' user-space scheduler, which only activates if the kernel
> scheduler has run out of work.
>
> What do you think?
>
>        Ingo

I definately like the way this approach can also work globally.

Stijn

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] observe and act upon workload parallelism:  PERF_TYPE_PARALLELISM (Was: [RFC][PATCH] sched_wait_block: wait for blocked  threads)
  2009-11-16 18:02   ` Linus Torvalds
  2009-11-16 18:18     ` Linus Torvalds
  2009-11-16 18:31     ` Alan Cox
@ 2009-11-16 19:49     ` Stijn Devriendt
  2009-11-16 20:13       ` Ingo Molnar
  2009-11-21 11:26     ` Stijn Devriendt
  3 siblings, 1 reply; 12+ messages in thread
From: Stijn Devriendt @ 2009-11-16 19:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Mike Galbraith, Peter Zijlstra, Andrea Arcangeli,
	Thomas Gleixner, Andrew Morton, peterz, linux-kernel

> Think of it like a classic user-level threading package, where one process
> implements multiple threads entirely in user space, and switches between
> them. Except we'd do the exact reverse: create multiple threads in the
> kernel, but only run _one_ of them at a time. So as far as the scheduler
> is concerned, it acts as just a single thread - except it's a single
> thread that has multiple instances associated with it.
>
> And every time the "currently active" thread in that group runs out of CPU
> time - or any time it sleeps - we'd just go on to the next thread in the
> group.

It almost makes me think of, excuse me, fibers ;)
One of the problems with the original approach is that the waiting threads
are woken even if the CPU load is high enough to keep the CPU going.
I had been thinking of inheriting the timeslice for the woken thread, but
this approach does way more than that. I also suspect the original approach
of introducing unfairness.

> No "observe CPU parallelism" or anything fancy at all. Just a "don't
> consider these X threads to be parallel" flag to clone (or a separate
> system call).

I wouldn't rule it out completely as in a sense, it's comparable to a JVM
monitoring performance counters to see if it needs to run the JIT optimizer.

> Imagine doing async system calls with the equivalent of
>
>  - create such an "affine" thread in kernel space
>  - run the IO in that affine thread - if it runs to completion without
>   blocking and in a timeslice, never schedule at all.
>
> where these "affine" threads would be even more lightweight than regular
> threads, because they don't even act as real scheduling entities, they are
> grouped together with the original one.
>
>                        Linus
>

One extra catch, I didn't even think of in the original approach is
that you still need a way of saying to the kernel: no more work here.

My original approach fails bluntly and I will happily take credit for that ;)
The perf-approach perfectly allows for this, by waking up the "controller"
thread which does exactly nothing as there's no work left.
The grouped thread approach can end up with all threads blocked to
indicate no more work, but I wonder how the lightweight threads will
end up being scheduled; When a threadpool-thread runs, you'll want
to try and run as much work as possible. This would mean that when
one thread blocks and another takes over, the latter will want to run
to completion (empty thread pool queue) starving the blocked thread.

Unless sched_yield is (ab?)used by these extra-threads to let the
scheduler consider the blocked thread again? With the risk that
the scheduler will schedule to the next extra-thread of the same group.

Basically, you're right about what I envisioned: threadpools are meant
to always have some extra work handy, so why not continue work when
one work item is blocked.

Stijn

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] observe and act upon workload parallelism: PERF_TYPE_PARALLELISM (Was: [RFC][PATCH] sched_wait_block: wait for blocked threads)
  2009-11-16 19:49     ` Stijn Devriendt
@ 2009-11-16 20:13       ` Ingo Molnar
  0 siblings, 0 replies; 12+ messages in thread
From: Ingo Molnar @ 2009-11-16 20:13 UTC (permalink / raw)
  To: Stijn Devriendt
  Cc: Linus Torvalds, Mike Galbraith, Peter Zijlstra, Andrea Arcangeli,
	Thomas Gleixner, Andrew Morton, peterz, linux-kernel


* Stijn Devriendt <highguy@gmail.com> wrote:

> One extra catch, I didn't even think of in the original approach is 
> that you still need a way of saying to the kernel: no more work here.
> 
> My original approach fails bluntly and I will happily take credit for 
> that ;) The perf-approach perfectly allows for this, by waking up the 
> "controller" thread which does exactly nothing as there's no work 
> left.

Note, the perf approach does not require a 'controller thread'.

The most efficient approach using perf-events would be:

 - have the pool threads block in poll(perf_event_fd). (all threads 
   block in poll() on the same fd).

 - blocking threads wake_up() the pool and cause them to drop out of
   poll() (with no intermediary). [if there's less than
   perf_event::min_concurrency tasks running.]

 - waking threads observe the event state and only run if we are still 
   below perf_event::max_concurrency - otherwise they re-queue to the
   poll() waitqueue.

Basically the perf-event fd creates the 'group of tasks'. This can be 
created voluntarily by cooperating threads - or involuntarily as well 
via PID attach or CPU attach.

There's no 'tracing' overhead or notification overhead: we maintain a 
shared state and the 'notifications' are straight wakeups that bring the 
pool members out of poll(), to drive the workload further.

Such a special sw-event, with min_concurrency==max_concurrency==1 would 
implement Linus's interface - using standard facilities like poll(). 
(The only 'special' act is the set up of the group itself.)

So various concurrency controls could be implemented that way - 
including the one Linus suggest - even a HPC workload-queueing daemon 
could be done as well, which sheperds 100% uncooperative tasks.

I dont think this 'fancy' approach is actually a performance drag: it 
would really do precisely the same thing Linus's facility does (unless 
i'm missing something subtle - or something less subtle about Linus's 
scheme), with the two parameters set to '1'.

( It would also enable a lot of other things, and it would not tie the 
  queueing implementation into the scheduler. )

Only trying would tell us for sure though - maybe i'm wrong.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] observe and act upon workload parallelism: PERF_TYPE_PARALLELISM (Was: [RFC][PATCH] sched_wait_block: wait for blocked threads)
  2009-11-16 19:13   ` Stijn Devriendt
@ 2009-11-16 20:22     ` Ingo Molnar
  2009-11-18  9:30       ` Stijn Devriendt
  0 siblings, 1 reply; 12+ messages in thread
From: Ingo Molnar @ 2009-11-16 20:22 UTC (permalink / raw)
  To: Stijn Devriendt
  Cc: Linus Torvalds, Mike Galbraith, Peter Zijlstra, Andrea Arcangeli,
	Thomas Gleixner, Andrew Morton, peterz, linux-kernel


* Stijn Devriendt <highguy@gmail.com> wrote:

> > And then we can use poll() in the thread manager task to observe 
> > PIDs, workloads or full CPUs. The poll() implementation of perf 
> > events is fast and scalable.
> 
> I've had a quick peek at the perf code and how it currently hooks into 
> the scheduler and at first glance it looks like 2 additional context 
> switches are required when using perf. The scheduler will first 
> schedule the idle thread to later find out that the schedule tail woke 
> up another process to run. My initial solution woke up the process 
> before making a scheduling decision. Depending on context switch times 
> the original blocking operation may have been unblocked (especially on 
> SMP); e.g. a blocked user-space mutex which was held shortly. Feel 
> free to correct me here as it was merely a quick peek.

( Btw., the PERF_TYPE_PARALLELISM name sucks. A better name would be
  PERF_COUNT_SW_TASKS or PERF_COUNT_SW_THREAD_POOL or so. )

I'd definitely not advocate a 'controller thread' approach: it's an 
unnecessary extra intermediary and it doubles the context switch cost 
and tears cache footprint apart.

We want any such scheme to schedule 'naturally' and optimally: i.e. a 
blocking thread will schedule an available thread - no ifs and when.

The only limit we want is on concurrency - and we can do that by waking 
tasks from the poll() waitqueue if a task blocks - and by requeueing 
woken tasks to the poll() waitqueue if a task wakes (and if the 
concurrency threshold does not allow it to run)..

In a sense the poll() waitqueue becomes a mini-runqueue for 'ready' 
tasks - and the 'number of tasks running' value of the sw event object a 
rq->nr_running value. It does not make the tasks available to the real 
scheduler - but it's a list of tasks that are willing to run.

This would be a perfect and suitable use of poll() concepts i think - 
and well-optimized one as well. It could even be plugged into epoll().

	Ingo

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] observe and act upon workload parallelism:  PERF_TYPE_PARALLELISM (Was: [RFC][PATCH] sched_wait_block: wait for blocked  threads)
  2009-11-16 20:22     ` Ingo Molnar
@ 2009-11-18  9:30       ` Stijn Devriendt
  0 siblings, 0 replies; 12+ messages in thread
From: Stijn Devriendt @ 2009-11-18  9:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Mike Galbraith, Peter Zijlstra, Andrea Arcangeli,
	Thomas Gleixner, Andrew Morton, peterz, linux-kernel

> This would be a perfect and suitable use of poll() concepts i think -
> and well-optimized one as well. It could even be plugged into epoll().
>
>        Ingo

Sorry for not replying earlier, but I was knocked out on the couch
(and still am). I'll have a more thorough look later on when my head
stops spinning 'round and 'round.

I must say that even without applying this perf_event to userspace
scheduling, it's probably a nice event to have anyway in order
to measure concurrency in existing applications. That it would
be suitable to actually control parallellism is probably just a nice
extra. But let's look at things again when the clouds have left
the brain ;)

Stijn

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] observe and act upon workload parallelism:  PERF_TYPE_PARALLELISM (Was: [RFC][PATCH] sched_wait_block: wait for blocked  threads)
  2009-11-16 18:02   ` Linus Torvalds
                       ` (2 preceding siblings ...)
  2009-11-16 19:49     ` Stijn Devriendt
@ 2009-11-21 11:26     ` Stijn Devriendt
  3 siblings, 0 replies; 12+ messages in thread
From: Stijn Devriendt @ 2009-11-21 11:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Mike Galbraith, Peter Zijlstra, Andrea Arcangeli,
	Thomas Gleixner, Andrew Morton, peterz, linux-kernel

> Think of it like a classic user-level threading package, where one process
> implements multiple threads entirely in user space, and switches between
> them. Except we'd do the exact reverse: create multiple threads in the
> kernel, but only run _one_ of them at a time. So as far as the scheduler
> is concerned, it acts as just a single thread - except it's a single
> thread that has multiple instances associated with it.
>
> And every time the "currently active" thread in that group runs out of CPU
> time - or any time it sleeps - we'd just go on to the next thread in the
> group.

Without trying to sound selfish: after some thinking I can't see how this
solves my problem. This is fine for the case you mentioned later on,
like UI threads, but it's not powerful enough for what I'm trying to achieve.

Let's make the round-trip for the thread pool case and start with an empty
thread pool queue:
- All threads are blocked on the queue condition variable untill new work
is queued.
- Thread 1 happily wakes up and runs the work item untill it's blocked.
- A new work item arrives and Thread 2 is woken to handle the new work
  item.
- As long as new work arrives and Thread 2 is not blocked (regardless
  of preemption because the deal was that they will not preempt each
  other) Thread 2 keeps running this work.
  Even when Thread 1 is woken, it will not preempt Thread 1.

One solution would be to let Thread 2 call sched_yield, but the
question then is "when" and "how much". Every time a lightweight
thread yields, you'll incur context switches. Because you don't
know when or how much, you'll be penalized for context switching
even when not needed. (Consider 1 blocked thread and 4 extra threads
sched_yield'ing every 5 work items)

Another option is to have a group-leader. Non-leader threads will call
sched_yield once in a while in order to try and jump back to the group-leader.
The group-leader will always continue work without sched_yield'ing.
There's no preemption between these threads.
The down-side is that in case multiple of these threads are waiting for
an event, wake-ups must wake the group leader rather than the other
coop-scheduled threads for this model to work.
Another down-side is that when a non-leader thread is blocked and the
group-leader is run, the non-leader thread is treated unfair.

Either solution's end-result is a very unfair threadpool where one cannot
guarantee even a loose FIFO-model where items are handled more or
less in the order they are queued and a library that needs to make
trade-offs in performance to get this behaviour back.

The solution is great when the threads are blocked most of the time
and have little CPU processing to do (like UI threads), but doesn't
scale beyond that.

As ever, enlighten me when you have a great solution to this problem.

Stijn

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2009-11-21 11:26 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-11-15 19:04 [RFC][PATCH] sched_wait_block: wait for blocked threads Stijn Devriendt
2009-11-15 19:04 ` [PATCH] Initial prototype version of sched_wait_block Stijn Devriendt
2009-11-16  8:35 ` [RFC] observe and act upon workload parallelism: PERF_TYPE_PARALLELISM (Was: [RFC][PATCH] sched_wait_block: wait for blocked threads) Ingo Molnar
2009-11-16 18:02   ` Linus Torvalds
2009-11-16 18:18     ` Linus Torvalds
2009-11-16 18:31     ` Alan Cox
2009-11-16 19:49     ` Stijn Devriendt
2009-11-16 20:13       ` Ingo Molnar
2009-11-21 11:26     ` Stijn Devriendt
2009-11-16 19:13   ` Stijn Devriendt
2009-11-16 20:22     ` Ingo Molnar
2009-11-18  9:30       ` Stijn Devriendt

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.