All of lore.kernel.org
 help / color / mirror / Atom feed
* fuse uring / wake_up on the same core
@ 2023-03-24 19:50 Bernd Schubert
  2023-03-24 22:44 ` Bernd Schubert
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Bernd Schubert @ 2023-03-24 19:50 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Miklos Szeredi
  Cc: Dharmendra Singh, linux-fsdevel, Amir Goldstein, linux-kernel

Ingo, Peter,

I would like to ask how to wake up from a waitq on the same core. I have 
tried __wake_up_sync()/WF_SYNC, but I do not see any effect.

I'm currently working on fuse/uring communication patches, besides uring 
communication there is also a queue per core. Basic bonnie++ benchmarks 
with a zero file size to just create/read(0)/delete show a ~3x IOPs 
difference between CPU bound bonnie++ and unbound - i.e. with these 
patches it _not_ fuse-daemon that needs to be bound, but the application 
doing IO to the file system. We basically have

bonnie -> vfs                                (app/vfs)
   fuse_req                                   (app/fuse.ko)
   qid = task_cpu(current)                  (app/fuse.ko)
     ring(qid) / SQE completion (fuse.ko)   (app/fuse.ko/uring)
       wait_event(req->waitq, ...)          (app/fuse.ko)
       [app wait]
         daemon ring / handle CQE           (daemon)
           send-back result as SQE          (daemon/uring)
             fuse_request_end               (daemon/uring/fuse.ko)
           wake_up()  ---> random core      (daemon/uring/fuse.ko)
       [app wakeup/fuse/vfs/syscall return]
bonnie ==> different core


1) bound

[root@imesrv1 ~]# numactl --localalloc --physcpubind=0 bonnie++ -q -x 1 
-s0  -d /scratch/dest/ -n 20:1:1:20 -r 0 -u 0 | bon_csv2txt
                     ------Sequential Create------ --------Random 
Create--------
                     -Create-- --Read--- -Delete-- -Create-- --Read--- 
-Delete--
       files:max:min  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP 
/sec %CP
imesrv1   20:1:1:20  6229  28 11289  41 12785  24  6615  28  7769  40 
10020  25
Latency               411us     824us     816us     298us   10473us 
200ms


2) not bound

[root@imesrv1 ~]# bonnie++ -q -x 1 -s0  -d /scratch/dest/ -n 20:1:1:20 
-r 0 -u 0 | bon_csv2txt
                     ------Sequential Create------ --------Random 
Create--------
                     -Create-- --Read--- -Delete-- -Create-- --Read--- 
-Delete--
       files:max:min  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP 
/sec %CP
imesrv1   20:1:1:20  2064  33  2923  43  4556  28  2061  33  2186  42 
4245  30
Latency               850us    3914us    2496us     738us     758us 
6469us


With less files the difference becomes a bit smaller, but is still very 
visible. Besides cache line bouncing, I'm sure that CPU frequency and 
C-states will matter - I could tune that it in the lab, but in the end I 
want to test what users do (I had recently checked with large HPC center 
- Forschungszentrum Juelich - their HPC compute nodes are not tuned up, 
to save energy).
Also, in order to really tune down latencies, I want want to add a 
struct file_operations::uring_cmd_iopoll thread, which will spin for a 
short time and avoid most of kernel/userspace communication. If 
applications (with n-nthreads < n-cores) then get scheduled on different 
core differnent rings will be used, result in
n-threads-spinning > n-threads-application


There was already a related thread about fuse before

https://lore.kernel.org/lkml/1638780405-38026-1-git-send-email-quic_pragalla@quicinc.com/

With the fuse-uring patches that part is basically solved - the waitq 
that that thread is about is not used anymore. But as per above, 
remaining is the waitq of the incoming workq (not mentioned in the 
thread above). As I wrote, I have tried
__wake_up_sync((x), TASK_NORMAL), but it does not make a difference for 
me - similar to Miklos' testing before. I have also tried struct 
completion / swait - does not make a difference either.
I can see task_struct has wake_cpu, but there doesn't seem to be a good 
interface to set it.

Any ideas?


Thanks,
Bernd







^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: fuse uring / wake_up on the same core
  2023-03-24 19:50 fuse uring / wake_up on the same core Bernd Schubert
@ 2023-03-24 22:44 ` Bernd Schubert
       [not found] ` <20230325002815.1703-1-hdanton@sina.com>
  2023-03-27 10:28 ` Peter Zijlstra
  2 siblings, 0 replies; 14+ messages in thread
From: Bernd Schubert @ 2023-03-24 22:44 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Miklos Szeredi
  Cc: Dharmendra Singh, linux-fsdevel, Amir Goldstein, linux-kernel

On 3/24/23 20:50, Bernd Schubert wrote:
> Ingo, Peter,
> 
> I would like to ask how to wake up from a waitq on the same core. I have 
> tried __wake_up_sync()/WF_SYNC, but I do not see any effect.
> 
> I'm currently working on fuse/uring communication patches, besides uring 
> communication there is also a queue per core. Basic bonnie++ benchmarks 
> with a zero file size to just create/read(0)/delete show a ~3x IOPs 
> difference between CPU bound bonnie++ and unbound - i.e. with these 
> patches it _not_ fuse-daemon that needs to be bound, but the application 
> doing IO to the file system. We basically have
> 

[...]

> With less files the difference becomes a bit smaller, but is still very 
> visible. Besides cache line bouncing, I'm sure that CPU frequency and 
> C-states will matter - I could tune that it in the lab, but in the end I 
> want to test what users do (I had recently checked with large HPC center 
> - Forschungszentrum Juelich - their HPC compute nodes are not tuned up, 
> to save energy).
> Also, in order to really tune down latencies, I want want to add a 
> struct file_operations::uring_cmd_iopoll thread, which will spin for a 
> short time and avoid most of kernel/userspace communication. If 
> applications (with n-nthreads < n-cores) then get scheduled on different 
> core differnent rings will be used, result in
> n-threads-spinning > n-threads-application
> 
> 
> There was already a related thread about fuse before
> 
> https://lore.kernel.org/lkml/1638780405-38026-1-git-send-email-quic_pragalla@quicinc.com/
> 
> With the fuse-uring patches that part is basically solved - the waitq 
> that that thread is about is not used anymore. But as per above, 
> remaining is the waitq of the incoming workq (not mentioned in the 
> thread above). As I wrote, I have tried
> __wake_up_sync((x), TASK_NORMAL), but it does not make a difference for 
> me - similar to Miklos' testing before. I have also tried struct 
> completion / swait - does not make a difference either.
> I can see task_struct has wake_cpu, but there doesn't seem to be a good 
> interface to set it.
> 
> Any ideas?
> 

How much of hack is this patch?

[RFC] fuse: wake on the same core / disable migrate before wait

From: Bernd Schubert <bschubert@ddn.com>

Avoid bouncing cores on wake, especially with uring everything
is core affine - bouncing badly decreases performance.
With read/write(/dev/fuse) it is not good either - needs to be tested
for negative impacts.
---
  fs/fuse/dev.c |   16 +++++++++++++---
  1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index e82db13da8f6..d47b6a492434 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -372,12 +372,17 @@ static void request_wait_answer(struct fuse_req *req)
  	struct fuse_iqueue *fiq = &fc->iq;
  	int err;
  
+	/* avoid bouncing between cores on wake */
+	pr_devel("task=%p before wait on core: %u wake_cpu: %u\n",
+		 current, task_cpu(current), current->wake_cpu);
+	migrate_disable();
+
  	if (!fc->no_interrupt) {
  		/* Any signal may interrupt this */
  		err = wait_event_interruptible(req->waitq,
  					test_bit(FR_FINISHED, &req->flags));
  		if (!err)
-			return;
+			goto out;
  
  		set_bit(FR_INTERRUPTED, &req->flags);
  		/* matches barrier in fuse_dev_do_read() */
@@ -391,7 +396,7 @@ static void request_wait_answer(struct fuse_req *req)
  		err = wait_event_killable(req->waitq,
  					test_bit(FR_FINISHED, &req->flags));
  		if (!err)
-			return;
+			goto out;
  
  		spin_lock(&fiq->lock);
  		/* Request is not yet in userspace, bail out */
@@ -400,7 +405,7 @@ static void request_wait_answer(struct fuse_req *req)
  			spin_unlock(&fiq->lock);
  			__fuse_put_request(req);
  			req->out.h.error = -EINTR;
-			return;
+			goto out;
  		}
  		spin_unlock(&fiq->lock);
  	}
@@ -410,6 +415,11 @@ static void request_wait_answer(struct fuse_req *req)
  	 * Wait it out.
  	 */
  	wait_event(req->waitq, test_bit(FR_FINISHED, &req->flags));
+
+out:
+	migrate_enable();
+	pr_devel("task=%p after wait on core: %u\n", current, task_cpu(current));
+
  }
  
  static void __fuse_request_send(struct fuse_req *req)

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: fuse uring / wake_up on the same core
       [not found] ` <20230325002815.1703-1-hdanton@sina.com>
@ 2023-03-25  7:08   ` K Prateek Nayak
  2023-03-27 14:35     ` Bernd Schubert
  0 siblings, 1 reply; 14+ messages in thread
From: K Prateek Nayak @ 2023-03-25  7:08 UTC (permalink / raw)
  To: Hillf Danton, Bernd Schubert
  Cc: Ingo Molnar, Peter Zijlstra, Miklos Szeredi, Amir Goldstein,
	linux-kernel, Andrei Vagin

Hello Hillf,

On 3/25/2023 5:58 AM, Hillf Danton wrote:
> On 24 Mar 2023 22:44:16 +0000 Bernd Schubert <bschubert@ddn.com>
>> How much of hack is this patch?
> 
> It adds a churn to my mind then another RFC [1] rises.
> 
> Feel free to make it work for you and resend it.
> 
> [1] Subject: [RFC PATCH 0/5] sched: Userspace Hinting for Task Placement
>     https://lore.kernel.org/lkml/20220910105326.1797-1-kprateek.nayak@amd.com/

Thank you for pointing to my series.

Another possible way to tackle this is with a strategy Andrei is using in
his "seccomp: add the synchronous mode for seccomp_unotify" series
(https://lore.kernel.org/lkml/20230308073201.3102738-1-avagin@google.com/)

In patch 2, Andrei adds a WF_CURRENT_CPU that allows the task to always
wake on the CPU where the waker is running.
(https://lore.kernel.org/lkml/20230308073201.3102738-3-avagin@google.com/)

I believe Bernd's requirement calls for a a WF_PREV_CPU which always
wakes up the task on the CPU where it previously ran. I believe you can
modify fuse_request_end() (in fs/fuse/dev.c) to use the WF_PREV_CPU flag
when waking the tasks on req->waitq

(P.S. I'm not familiar with the fuse subsystem but the comment
"Wake up waiter sleeping in request_wait_answer()" in fuse_request_end()
seems to suggest that is the right place to add this flag.)

Peter favored wake flag strategy of fixing wakeup target in a reply to an
earlier version of Andrei's series
(https://lore.kernel.org/lkml/Y8UgBnsaGDUJKH5A@hirez.programming.kicks-ass.net/)
but I'll let Peter respond with what he thinks is the right way to tackle
this.

> 
>>
>> [RFC] fuse: wake on the same core / disable migrate before wait
>>
>> From: Bernd Schubert <bschubert@ddn.com>
>>
>> Avoid bouncing cores on wake, especially with uring everything
>> is core affine - bouncing badly decreases performance.
>> With read/write(/dev/fuse) it is not good either - needs to be tested
>> for negative impacts.
>> ---
>>   fs/fuse/dev.c |   16 +++++++++++++---
>>   1 file changed, 13 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
>> index e82db13da8f6..d47b6a492434 100644
>> --- a/fs/fuse/dev.c
>> +++ b/fs/fuse/dev.c
>> @@ -372,12 +372,17 @@ static void request_wait_answer(struct fuse_req *req)
>>   	struct fuse_iqueue *fiq = &fc->iq;
>>   	int err;
>>
>> +	/* avoid bouncing between cores on wake */
>> +	pr_devel("task=%p before wait on core: %u wake_cpu: %u\n",
>> +		 current, task_cpu(current), current->wake_cpu);
>> +	migrate_disable();
>> +
>>   	if (!fc->no_interrupt) {
>>   		/* Any signal may interrupt this */
>>   		err = wait_event_interruptible(req->waitq,
>>   					test_bit(FR_FINISHED, &req->flags));
>>   		if (!err)
>> -			return;
>> +			goto out;
>>
>>   		set_bit(FR_INTERRUPTED, &req->flags);
>>   		/* matches barrier in fuse_dev_do_read() */
>> @@ -391,7 +396,7 @@ static void request_wait_answer(struct fuse_req *req)
>>   		err = wait_event_killable(req->waitq,
>>   					test_bit(FR_FINISHED, &req->flags));
>>   		if (!err)
>> -			return;
>> +			goto out;
>>
>>   		spin_lock(&fiq->lock);
>>   		/* Request is not yet in userspace, bail out */
>> @@ -400,7 +405,7 @@ static void request_wait_answer(struct fuse_req *req)
>>   			spin_unlock(&fiq->lock);
>>   			__fuse_put_request(req);
>>   			req->out.h.error = -EINTR;
>> -			return;
>> +			goto out;
>>   		}
>>   		spin_unlock(&fiq->lock);
>>   	}
>> @@ -410,6 +415,11 @@ static void request_wait_answer(struct fuse_req *req)
>>   	 * Wait it out.
>>   	 */
>>   	wait_event(req->waitq, test_bit(FR_FINISHED, &req->flags));
>> +
>> +out:
>> +	migrate_enable();
>> +	pr_devel("task=%p after wait on core: %u\n", current, task_cpu(current));
>> +
>>   }
>>
>>   static void __fuse_request_send(struct fuse_req *req)
>>
--
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: fuse uring / wake_up on the same core
  2023-03-24 19:50 fuse uring / wake_up on the same core Bernd Schubert
  2023-03-24 22:44 ` Bernd Schubert
       [not found] ` <20230325002815.1703-1-hdanton@sina.com>
@ 2023-03-27 10:28 ` Peter Zijlstra
  2023-04-26 22:40   ` Bernd Schubert
       [not found]   ` <20230427122417.2452-1-hdanton@sina.com>
  2 siblings, 2 replies; 14+ messages in thread
From: Peter Zijlstra @ 2023-03-27 10:28 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Ingo Molnar, Miklos Szeredi, Dharmendra Singh, linux-fsdevel,
	Amir Goldstein, linux-kernel

On Fri, Mar 24, 2023 at 07:50:12PM +0000, Bernd Schubert wrote:

> With the fuse-uring patches that part is basically solved - the waitq 
> that that thread is about is not used anymore. But as per above, 
> remaining is the waitq of the incoming workq (not mentioned in the 
> thread above). As I wrote, I have tried
> __wake_up_sync((x), TASK_NORMAL), but it does not make a difference for 
> me - similar to Miklos' testing before. I have also tried struct 
> completion / swait - does not make a difference either.
> I can see task_struct has wake_cpu, but there doesn't seem to be a good 
> interface to set it.
> 
> Any ideas?

Does the stuff from:

  https://lkml.kernel.org/r/20230308073201.3102738-1-avagin@google.com

work for you?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: fuse uring / wake_up on the same core
  2023-03-25  7:08   ` K Prateek Nayak
@ 2023-03-27 14:35     ` Bernd Schubert
  0 siblings, 0 replies; 14+ messages in thread
From: Bernd Schubert @ 2023-03-27 14:35 UTC (permalink / raw)
  To: K Prateek Nayak, Hillf Danton
  Cc: Ingo Molnar, Peter Zijlstra, Miklos Szeredi, Amir Goldstein,
	linux-kernel, Andrei Vagin

On 3/25/23 08:08, K Prateek Nayak wrote:
> [You don't often get email from kprateek.nayak@amd.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
> 
> Hello Hillf,
> 
> On 3/25/2023 5:58 AM, Hillf Danton wrote:
>> On 24 Mar 2023 22:44:16 +0000 Bernd Schubert <bschubert@ddn.com>
>>> How much of hack is this patch?
>>
>> It adds a churn to my mind then another RFC [1] rises.
>>
>> Feel free to make it work for you and resend it.
>>
>> [1] Subject: [RFC PATCH 0/5] sched: Userspace Hinting for Task Placement
>>      https://lore.kernel.org/lkml/20220910105326.1797-1-kprateek.nayak@amd.com/
> 
> Thank you for pointing to my series.
> 
> Another possible way to tackle this is with a strategy Andrei is using in
> his "seccomp: add the synchronous mode for seccomp_unotify" series
> (https://lore.kernel.org/lkml/20230308073201.3102738-1-avagin@google.com/)
> 
> In patch 2, Andrei adds a WF_CURRENT_CPU that allows the task to always
> wake on the CPU where the waker is running.
> (https://lore.kernel.org/lkml/20230308073201.3102738-3-avagin@google.com/)
> 
> I believe Bernd's requirement calls for a a WF_PREV_CPU which always
> wakes up the task on the CPU where it previously ran. I believe you can
> modify fuse_request_end() (in fs/fuse/dev.c) to use the WF_PREV_CPU flag
> when waking the tasks on req->waitq
> 
> (P.S. I'm not familiar with the fuse subsystem but the comment
> "Wake up waiter sleeping in request_wait_answer()" in fuse_request_end()
> seems to suggest that is the right place to add this flag.)
> 
> Peter favored wake flag strategy of fixing wakeup target in a reply to an
> earlier version of Andrei's series
> (https://lore.kernel.org/lkml/Y8UgBnsaGDUJKH5A@hirez.programming.kicks-ass.net/)
> but I'll let Peter respond with what he thinks is the right way to tackle
> this.
> 

Thanks Hillf, Prateek and Peter! I'm going right now through Andrei's 
(will also check Prateek patches later). On the first glance 
WF_CURRENT_CPU is exactly what I need. At least for fuse/uring no need 
for another 'WF_PREV_CPU' flag - it goes and comes back to/from the ring 
on 'current' cpu and only switches on the final completion - staying on 
the current cpu is all we need. Will test these patches later today.


Thanks again,
Bernd


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: fuse uring / wake_up on the same core
  2023-03-27 10:28 ` Peter Zijlstra
@ 2023-04-26 22:40   ` Bernd Schubert
       [not found]   ` <20230427122417.2452-1-hdanton@sina.com>
  1 sibling, 0 replies; 14+ messages in thread
From: Bernd Schubert @ 2023-04-26 22:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Miklos Szeredi, Dharmendra Singh, linux-fsdevel,
	Amir Goldstein, linux-kernel, K Prateek Nayak, Hillf Danton,
	Andrei Vagin

On 3/27/23 12:28, Peter Zijlstra wrote:
> On Fri, Mar 24, 2023 at 07:50:12PM +0000, Bernd Schubert wrote:
> 
>> With the fuse-uring patches that part is basically solved - the waitq
>> that that thread is about is not used anymore. But as per above,
>> remaining is the waitq of the incoming workq (not mentioned in the
>> thread above). As I wrote, I have tried
>> __wake_up_sync((x), TASK_NORMAL), but it does not make a difference for
>> me - similar to Miklos' testing before. I have also tried struct
>> completion / swait - does not make a difference either.
>> I can see task_struct has wake_cpu, but there doesn't seem to be a good
>> interface to set it.
>>
>> Any ideas?
> 
> Does the stuff from:
> 
>    https://lkml.kernel.org/r/20230308073201.3102738-1-avagin@google.com

Thanks Peter, I have already replied in that thread - using 
__wake_up_on_current_cpu() helps to avoid cpu migrations. Well, some 
update since my last mail in that thread (a few hours ago), more logging 
reveals that I still see a few cpu switches, but nothing compared to 
what I had before.
My issue is now that these patches are not enough and contrary to 
previous testing, forcefully disabling cpu migration using 
migrate_disable() before wait_event_* in fuse's request_wait_answer()
and enabling it after does not help either - my process to create files
(bonnie++) somewhere migrates to another cpu at a later time.
The only workaround I currently have is to set the ring thread 
processing vfs/fuse events in userspace to SCHED_IDLE. In combination 
with WF_CURRENT_CPU performance then goes from ~2200 to ~9000 file 
creates/s for a single thread in the latest branch (should be scalable). 
Which is very close to binding the bonnie++ process to a single core 
(~9400 creates/s).

Is there something available to mark ring threads as IO processing and 
that there is no need to migrate away the submitting thread from IO 
threads?

* application sends request -> forwards to ring and wake ring  -> wait
* ring wakes up (core bound) -> process request -> sends completion -> 
wake up application -> wait for next request
* application wakes up with request result

==> I don't understand why the application is moved to another process 
at all, after the wake issue is eliminated.

I also only see SCHED_IDLE only as workaround, as it would likely have 
side effects if there is anything else running on the system and would 
consume cpus while another process is doing IO.
Is there a way to trace where and why a process is migrated away?


Thanks,
Bernd


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: fuse uring / wake_up on the same core
       [not found]   ` <20230427122417.2452-1-hdanton@sina.com>
@ 2023-04-27 13:35     ` Bernd Schubert
  2023-04-28  1:44       ` Hillf Danton
  0 siblings, 1 reply; 14+ messages in thread
From: Bernd Schubert @ 2023-04-27 13:35 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Peter Zijlstra, Miklos Szeredi, K Prateek Nayak, Andrei Vagin, LKML

On 4/27/23 14:24, Hillf Danton wrote:
> On 26 Apr 2023 22:40:32 +0000 Bernd Schubert <bschubert@ddn.com>
>> My issue is now that these patches are not enough and contrary to
>> previous testing, forcefully disabling cpu migration using
>> migrate_disable() before wait_event_* in fuse's request_wait_answer()
>> and enabling it after does not help either - my process to create files
>> (bonnie++) somewhere migrates to another cpu at a later time.
> 
> Less than 2 migrates every ten minutes?

The test does not run that long... kind of migrate immediately,
I think in less than a second.

> 
>> The only workaround I currently have is to set the ring thread
>> processing vfs/fuse events in userspace to SCHED_IDLE. In combination
>> with WF_CURRENT_CPU performance then goes from ~2200 to ~9000 file
>> creates/s for a single thread in the latest branch (should be scalable).
>> Which is very close to binding the bonnie++ process to a single core
>> (~9400 creates/s).
> 
> The scheduler is good at dispatching tasks to CPUs at least, and it works
> better with userspace hints as both Prateek and Andrei's works propose. 9400
> shows positive feedback from kernel, and the question is, is it feasible
> in your production environment to set CPU affinity? If yes, what else do
> you want?

Well, this is the fuse file system - each and every user would need to do that
and get core affinity right. I'm personally not setting core affinity for
any 'cp' or 'rsync' I'm doing.

Btw, a very hackish way to 'solve' the issue is this


diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index cd7aa679c3ee..dd32effb5010 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -373,6 +373,26 @@ static void request_wait_answer(struct fuse_req *req)
         int err;
         int prev_cpu = task_cpu(current);
  
+       /* When running over uring and core affined userspace threads, we
+        * do not want to let migrate away the request submitting process.
+        * Issue is that even after waking up on the right core, processes
+        * that have submitted requests might get migrated away, because
+        * the ring thread is still doing a bit of work or is in the process
+        * to go to sleep. Assumption here is that processes are started on
+        * the right core (i.e. idle cores) and can then stay on that core
+        * when they come and do file system requests.
+        * Another alternative way is to set SCHED_IDLE for ring threads,
+        * but that would have an issue if there are other processes keeping
+        * the cpu busy.
+        * SCHED_IDLE or this hack here result in about factor 3.5 for
+        * max meta request performance.
+        *
+        * Ideal would to tell the scheduler that ring threads are not disturbing
+        * that migration away from it should very very rarely happen.
+        */
+       if (fc->ring.ready)
+               migrate_disable();
+
         if (!fc->no_interrupt) {
                 /* Any signal may interrupt this */
                 err = wait_event_interruptible(req->waitq,


So it disables migration and never re-enables it...
I'm still continuing to digg if there is a better way, any
hints are very welcome.


Thanks,
Bernd

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: fuse uring / wake_up on the same core
  2023-04-27 13:35     ` Bernd Schubert
@ 2023-04-28  1:44       ` Hillf Danton
  2023-04-28 21:54         ` Bernd Schubert
  0 siblings, 1 reply; 14+ messages in thread
From: Hillf Danton @ 2023-04-28  1:44 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Peter Zijlstra, Miklos Szeredi, K Prateek Nayak, Andrei Vagin,
	linux-mm, linux-kernel

On 27 Apr 2023 13:35:31 +0000 Bernd Schubert <bschubert@ddn.com>
> Btw, a very hackish way to 'solve' the issue is this
> 
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index cd7aa679c3ee..dd32effb5010 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -373,6 +373,26 @@ static void request_wait_answer(struct fuse_req *req)
>          int err;
>          int prev_cpu = task_cpu(current);
>   
> +       /* When running over uring and core affined userspace threads, we
> +        * do not want to let migrate away the request submitting process.
> +        * Issue is that even after waking up on the right core, processes
> +        * that have submitted requests might get migrated away, because
> +        * the ring thread is still doing a bit of work or is in the process
> +        * to go to sleep. Assumption here is that processes are started on
> +        * the right core (i.e. idle cores) and can then stay on that core
> +        * when they come and do file system requests.
> +        * Another alternative way is to set SCHED_IDLE for ring threads,
> +        * but that would have an issue if there are other processes keeping
> +        * the cpu busy.
> +        * SCHED_IDLE or this hack here result in about factor 3.5 for
> +        * max meta request performance.
> +        *
> +        * Ideal would to tell the scheduler that ring threads are not disturbing
> +        * that migration away from it should very very rarely happen.
> +        */
> +       if (fc->ring.ready)
> +               migrate_disable();
> +
>          if (!fc->no_interrupt) {
>                  /* Any signal may interrupt this */
>                  err = wait_event_interruptible(req->waitq,
> 
If I understand it correctly, the seesaw workload hint to scheduler looks
like the diff below, leaving scheduler free to pull the two players apart
across CPU and to migrate anyone.

--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -421,6 +421,7 @@ static void __fuse_request_send(struct f
 		/* acquire extra reference, since request is still needed
 		   after fuse_request_end() */
 		__fuse_get_request(req);
+		current->seesaw = 1;
 		queue_request_and_unlock(fiq, req);
 
 		request_wait_answer(req);
@@ -1229,6 +1230,7 @@ static ssize_t fuse_dev_do_read(struct f
 			   fc->max_write))
 		return -EINVAL;
 
+	current->seesaw = 1;
  restart:
 	for (;;) {
 		spin_lock(&fiq->lock);
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -953,6 +953,7 @@ struct task_struct {
 	/* delay due to memory thrashing */
 	unsigned                        in_thrashing:1;
 #endif
+	unsigned 			seesaw:1;
 
 	unsigned long			atomic_flags; /* Flags requiring atomic access. */
 
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7424,6 +7424,8 @@ select_task_rq_fair(struct task_struct *
 	if (wake_flags & WF_TTWU) {
 		record_wakee(p);
 
+		if (p->seesaw && current->seesaw)
+			return cpu;
 		if (sched_energy_enabled()) {
 			new_cpu = find_energy_efficient_cpu(p, prev_cpu);
 			if (new_cpu >= 0)


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: fuse uring / wake_up on the same core
  2023-04-28  1:44       ` Hillf Danton
@ 2023-04-28 21:54         ` Bernd Schubert
  2023-04-28 23:37           ` Hillf Danton
                             ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Bernd Schubert @ 2023-04-28 21:54 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Peter Zijlstra, Miklos Szeredi, K Prateek Nayak, Andrei Vagin,
	linux-mm, linux-kernel

On 4/28/23 03:44, Hillf Danton wrote:
> On 27 Apr 2023 13:35:31 +0000 Bernd Schubert <bschubert@ddn.com>
>> Btw, a very hackish way to 'solve' the issue is this
>>
>> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
>> index cd7aa679c3ee..dd32effb5010 100644
>> --- a/fs/fuse/dev.c
>> +++ b/fs/fuse/dev.c
>> @@ -373,6 +373,26 @@ static void request_wait_answer(struct fuse_req *req)
>>           int err;
>>           int prev_cpu = task_cpu(current);
>>    
>> +       /* When running over uring and core affined userspace threads, we
>> +        * do not want to let migrate away the request submitting process.
>> +        * Issue is that even after waking up on the right core, processes
>> +        * that have submitted requests might get migrated away, because
>> +        * the ring thread is still doing a bit of work or is in the process
>> +        * to go to sleep. Assumption here is that processes are started on
>> +        * the right core (i.e. idle cores) and can then stay on that core
>> +        * when they come and do file system requests.
>> +        * Another alternative way is to set SCHED_IDLE for ring threads,
>> +        * but that would have an issue if there are other processes keeping
>> +        * the cpu busy.
>> +        * SCHED_IDLE or this hack here result in about factor 3.5 for
>> +        * max meta request performance.
>> +        *
>> +        * Ideal would to tell the scheduler that ring threads are not disturbing
>> +        * that migration away from it should very very rarely happen.
>> +        */
>> +       if (fc->ring.ready)
>> +               migrate_disable();
>> +
>>           if (!fc->no_interrupt) {
>>                   /* Any signal may interrupt this */
>>                   err = wait_event_interruptible(req->waitq,
>>
> If I understand it correctly, the seesaw workload hint to scheduler looks
> like the diff below, leaving scheduler free to pull the two players apart
> across CPU and to migrate anyone.

Thank a lot Hillf! I had a day off / family day today, kernel is now 
eventually compiling.

> 
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -421,6 +421,7 @@ static void __fuse_request_send(struct f
>   		/* acquire extra reference, since request is still needed
>   		   after fuse_request_end() */
>   		__fuse_get_request(req);
> +		current->seesaw = 1;
>   		queue_request_and_unlock(fiq, req);
>   
>   		request_wait_answer(req);
> @@ -1229,6 +1230,7 @@ static ssize_t fuse_dev_do_read(struct f
>   			   fc->max_write))
>   		return -EINVAL;
>   
> +	current->seesaw = 1;

fuse_dev_do_read is plain /dev/fuse (with read/write) and we don't know 
on which core these IO threads are running and which of them to wake up 
when an application comes with a request.

There is a patch to use __wake_up_sync to wake the IO thread and reports 
that it helps in performance, but I don't see it and I think Miklos 
neither. For direct-io read I had also already tested disabling 
migration - it didn't show any effect - we better don't set 
current->seesaw = 1 in fuse_dev_do_read for now.

With my fuse-uring patches things are more clear
(https://lwn.net/Articles/926773/), there is one IO thread per core and 
libfuse side is binding these threads to a single core only.

nproc    /dev/fuse     /dev/fuse     fuse uring    fuse uring
           migrate on   migrate off  migrate on    migrate off
1         2023          1652          1151         3998
2         3375          2805          2221         7950
4         3823          4193          4540         15022
8         7796          8161          7846         22591
16        8520          8518          12235        27864
24        8361          8084          9415         27864
32        8361          8084          9124         12971

(in MiB/s)

So core affinity really matters and with core affinity it is always 
faster with fuse-uring over the existing code.

For single threaded metadata (file creates/stat/unlink) difference 
between migrate on/off is rather similar.  Going to run with multiple 
processes during the next days.

For paged (async) IO it behaves a bit different as here uring can show 
it strength and multiple requests can be combined on CQE processing - 
better to chose and idle ring thread on another core. I actually have a 
question for that as well - later.


>    restart:
>   	for (;;) {
>   		spin_lock(&fiq->lock);
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -953,6 +953,7 @@ struct task_struct {
>   	/* delay due to memory thrashing */
>   	unsigned                        in_thrashing:1;
>   #endif
> +	unsigned 			seesaw:1;
>   
>   	unsigned long			atomic_flags; /* Flags requiring atomic access. */
>   
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7424,6 +7424,8 @@ select_task_rq_fair(struct task_struct *
>   	if (wake_flags & WF_TTWU) {
>   		record_wakee(p);
>   
> +		if (p->seesaw && current->seesaw)
> +			return cpu;
>   		if (sched_energy_enabled()) {
>   			new_cpu = find_energy_efficient_cpu(p, prev_cpu);
>   			if (new_cpu >= 0)


Hmm, WF_CURRENT_CPU works rather similar, except that it tests if cpu is 
in cpus_ptr?  The combination of both patches results in

		if (p->seesaw && current->seesaw)
			return cpu;

		if ((wake_flags & WF_CURRENT_CPU) &&
		    cpumask_test_cpu(cpu, p->cpus_ptr))
			return cpu;



While writing the mail kernel compilation is ready, but it got late, 
will test in the morning.


Thanks again,
Bernd

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: fuse uring / wake_up on the same core
  2023-04-28 21:54         ` Bernd Schubert
@ 2023-04-28 23:37           ` Hillf Danton
  2023-05-01 21:44           ` Bernd Schubert
       [not found]           ` <20230502003335.3253-1-hdanton@sina.com>
  2 siblings, 0 replies; 14+ messages in thread
From: Hillf Danton @ 2023-04-28 23:37 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Peter Zijlstra, Miklos Szeredi, K Prateek Nayak, Andrei Vagin,
	linux-mm, linux-kernel

On 28 Apr 2023 21:54:51 +0000 Bernd Schubert <bschubert@ddn.com>
> > @@ -7424,6 +7424,8 @@ select_task_rq_fair(struct task_struct *
> >   	if (wake_flags & WF_TTWU) {
> >   		record_wakee(p);
> >   
> > +		if (p->seesaw && current->seesaw)
> > +			return cpu;
> >   		if (sched_energy_enabled()) {
> >   			new_cpu = find_energy_efficient_cpu(p, prev_cpu);
> >   			if (new_cpu >= 0)
> 
> Hmm, WF_CURRENT_CPU works rather similar, except that it tests if cpu is 
> in cpus_ptr?  The combination of both patches results in

I missed checking cpu against p. Good catch.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: fuse uring / wake_up on the same core
  2023-04-28 21:54         ` Bernd Schubert
  2023-04-28 23:37           ` Hillf Danton
@ 2023-05-01 21:44           ` Bernd Schubert
       [not found]           ` <20230502003335.3253-1-hdanton@sina.com>
  2 siblings, 0 replies; 14+ messages in thread
From: Bernd Schubert @ 2023-05-01 21:44 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Peter Zijlstra, Miklos Szeredi, K Prateek Nayak, Andrei Vagin,
	linux-mm, linux-kernel

On 4/28/23 23:54, Bernd Schubert wrote:
> On 4/28/23 03:44, Hillf Danton wrote:
>>    restart:
>>       for (;;) {
>>           spin_lock(&fiq->lock);
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -953,6 +953,7 @@ struct task_struct {
>>       /* delay due to memory thrashing */
>>       unsigned                        in_thrashing:1;
>>   #endif
>> +    unsigned             seesaw:1;
>>       unsigned long            atomic_flags; /* Flags requiring atomic 
>> access. */
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -7424,6 +7424,8 @@ select_task_rq_fair(struct task_struct *
>>       if (wake_flags & WF_TTWU) {
>>           record_wakee(p);
>> +        if (p->seesaw && current->seesaw)
>> +            return cpu;
>>           if (sched_energy_enabled()) {
>>               new_cpu = find_energy_efficient_cpu(p, prev_cpu);
>>               if (new_cpu >= 0)
> 
> 
> Hmm, WF_CURRENT_CPU works rather similar, except that it tests if cpu is 
> in cpus_ptr?  The combination of both patches results in
> 
>          if (p->seesaw && current->seesaw)
>              return cpu;
> 
>          if ((wake_flags & WF_CURRENT_CPU) &&
>              cpumask_test_cpu(cpu, p->cpus_ptr))
>              return cpu;
> 
> 
> 
> While writing the mail kernel compilation is ready, but it got late, 
> will test in the morning.

This works wonders!  The fuse-uring part is this

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index cd7aa679c3ee..ec5853ca9646 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -373,6 +373,9 @@ static void request_wait_answer(struct fuse_req *req)
         int err;
         int prev_cpu = task_cpu(current);
  
+       if (fc->ring.per_core_queue)
+               current->seesaw = 1;
+
         if (!fc->no_interrupt) {
                 /* Any signal may interrupt this */
                 err = wait_event_interruptible(req->waitq,
diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index 7d327699b4c5..715741ed58bf 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -1312,6 +1312,13 @@ int fuse_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
                         /* XXX error injection or test with malicious daemon */
                 }
  
+               /* In combination with requesting process (application) seesaw
+                * setting (see request_wait_answer), the application will
+                * stay on the same core.
+                */
+               if (fc->ring.per_core_queue)
+                       current->seesaw = 1;
+
                 ret = fuse_uring_fetch(ring_ent, cmd);
                 break;
         case FUSE_URING_REQ_COMMIT_AND_FETCH:




I'm not familiar at all with scheduler code,
given this works perfectly this suggests the same function is also
called without explicit waitq, when the scheduler preempts a task?

I think there might be side effects - what is if multiple
applications are on one core and another core would be available?
With this flag they would stay on the same core? Maybe better two flags?

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 63d242164b1a..07783ddaec5c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -953,6 +953,8 @@ struct task_struct {
         /* delay due to memory thrashing */
         unsigned                        in_thrashing:1;
  #endif
+       unsigned                        seesaw_req:1;
+       unsigned                        seesaw_io:1;
  
         unsigned long                   atomic_flags; /* Flags requiring atomic access. */
  
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b9d6ed7585c6..474bf3657ef0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7605,6 +7605,13 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
         if (wake_flags & WF_TTWU) {
                 record_wakee(p);
  
+               /* current is handling requests on behalf of the waking process,
+                * both want to run on the same core in seeswaw manner.
+                */
+               if (p->seesaw_req && current->seesaw_io &&
+                   cpumask_test_cpu(cpu, p->cpus_ptr))
+                       return cpu;
+
                 if ((wake_flags & WF_CURRENT_CPU) &&
                     cpumask_test_cpu(cpu, p->cpus_ptr))
                         return cpu;

(not tested yet)


Thanks,
Bernd

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: fuse uring / wake_up on the same core
       [not found]           ` <20230502003335.3253-1-hdanton@sina.com>
@ 2023-05-03 17:04             ` Bernd Schubert
  2023-05-04  2:16               ` Hillf Danton
  0 siblings, 1 reply; 14+ messages in thread
From: Bernd Schubert @ 2023-05-03 17:04 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Peter Zijlstra, Miklos Szeredi, K Prateek Nayak, Andrei Vagin,
	linux-mm, linux-kernel

On 5/2/23 02:33, Hillf Danton wrote:
> On 1 May 2023 21:44:48 +0000 Bernd Schubert <bschubert@ddn.com>
>> I'm not familiar at all with scheduler code,
>> given this works perfectly this suggests the same function is also
>> called without explicit waitq, when the scheduler preempts a task?
> 
> Please see comment below in the select_task function.
>>
>> I think there might be side effects - what is if multiple
>> applications are on one core and another core would be available?
>> With this flag they would stay on the same core? Maybe better two flags?
> 
> Gooddd question. Given multiple seesaws,
> 
> 	Player i1	Player j1
> 	|		|
> 	Seesaw i	Seesaw j
> 	|		|
> 	P i2		P j2
> 
> what is ideal is run Seesaws i and j on different CPU cores.
> 
> We can do that by replacing the seesaw flag in the task struct with
> seesaw id for instance. But I have no idea how complex it will become
> to set up multiple seesaw workloads on the fuse side, by grouping and
> binding players to different seesaws,
> 
> -	unsigned			seesaw:1;
> +	unsigned int			seesaw;> 
> while the corresponding change to scheduler looks like.
> 
> +               if (p->seesaw && p->seesaw == current->seesaw &&
> +                   cpumask_test_cpu(cpu, p->cpus_ptr))
> +                       return cpu;


Hmm, how is the seesaw id assigned and assuming two processes landed
on the same core but later on another core is available, how does it
dynamically change the ID?
My idea with two bits was that there is a fuse ring thread bound to a
specific core - it is the IO processor and gets the seesaw_proc bit.
Application is submitting requests and get the seesaw_req bit set.
Two applications running on the same core won't disturb each other that way.

As addition, if the application is not submitting requests anymore, but
let's say is busy doing computations, we want to have a timeout to let
it move away if another core is more suitable. What do you think about
the new patch version at the end of the mail? It uses two bits and jiffies.
Just tested it and it works fine. The exact timeout period is
certainly debatable. I also feel a bit bad
to take so many bits in struct task. If this approach is acceptable,
the jiffies parameter could be probably an u16.

> 
> Even after job done for fuse, the emerging question is, how to set up
> seesaw workloads for crypto for example, if no more than the seesaw id
> hint to scheduler is preferred.
> 
> And it makes me wonder why Prateek's work stalled for quite a while,
> as more is needed for userspace hint to scheduler to work for
> workloads other than seesaw.

Just quickly went over these a bit, assuming seesaw doesn't get accepted
and we need these, I think it would need a bit modification for fuse


> +		/*
> +		 * Handle the case where a hint is set and the current CPU
> +		 * and the previous CPU where task ran don't share caches.
> +		 */
> +		if (wakeup_hint && !cpus_share_cache(cpu, prev_cpu)) {

I'm testing on and older Xeon system (E5-2650) and tried different settings
with numa binding the application (fuse ring thread is bound anyway)

                                        
governor                                conservative    performance
                                           (default)
application cpu 0, ring cpu 0  creates/s   ~9200          ~9500
application cpu 0. ring cpu 16 creates/s   ~4200          ~8000
application cpu 0. ring cpu 1  creates/s   ~4200          ~8500


No idea why cpu 1 gives better results in performance mode than cpu 16, might be within
accuracy. CPU frequency definitely has the largest effect here - the cpus_share_cache()
condition is not ideal for fuse. And I guess asking users to use cpu performance mode
for fuse is also too much asked - other file systems don't have that requirement...
So far your seesaw idea works best (the modified version in combination with
wake-on-same-core).

>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 63d242164b1a..07783ddaec5c 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -953,6 +953,8 @@ struct task_struct {
>>           /* delay due to memory thrashing */
>>           unsigned                        in_thrashing:1;
>>    #endif
>> +       unsigned                        seesaw_req:1;
>> +       unsigned                        seesaw_io:1;
>>    
>>           unsigned long                   atomic_flags; /* Flags requiring atomic access. */
>>    
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index b9d6ed7585c6..474bf3657ef0 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -7605,6 +7605,13 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
>>           if (wake_flags & WF_TTWU) {
>>                   record_wakee(p);
> 
> Seesaw does not work without WF_TTWU as per define.

What does WF_TTWU actually mean? Something like work flow time to wake unit?

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index cd7aa679c3ee..6da0de4ae9ca 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -411,6 +411,20 @@ static void request_wait_answer(struct fuse_req *req)
  	 * Wait it out.
  	 */
  	wait_event(req->waitq, test_bit(FR_FINISHED, &req->flags));
+
+	/*
+	 * __wake_up_on_current_cpu ensures we wake up on the right core,
+	 * after that we still want to stay on the same core, shared with
+	 * a ring thread to submit next request to it. Issue without seesaw
+	 * is that the while the ring thread is on its way to wait, it disturbs
+	 * the application and application might get migrated away
+	 */
+	if (fc->ring.per_core_queue) {
+		current->seesaw_req = 1;
+		current->seesaw_jiffies = jiffies;
+	}
+
+
  out:
  	if (prev_cpu != task_cpu(current))
  		pr_devel("%s cpu switch from=%d to=%d\n",
diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index 7d327699b4c5..73adc2b16778 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -1312,6 +1312,13 @@ int fuse_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
  			/* XXX error injection or test with malicious daemon */
  		}
  
+		/* In combination with requesting process (application) seesaw
+		 * setting (see request_wait_answer), the application will
+		 * stay on the same core.
+		 */
+		if (fc->ring.per_core_queue)
+			current->seesaw_proc = 1;
+
  		ret = fuse_uring_fetch(ring_ent, cmd);
  		break;
  	case FUSE_URING_REQ_COMMIT_AND_FETCH:
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 63d242164b1a..53d9c77672b7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -953,6 +953,13 @@ struct task_struct {
  	/* delay due to memory thrashing */
  	unsigned                        in_thrashing:1;
  #endif
+	/* requesting task */
+	unsigned 			seesaw_req:1;
+	/* request processing task */
+	unsigned			seesaw_proc:1;
+
+	/* limit seesaw time slot */
+	unsigned long			seesaw_jiffies;
  
  	unsigned long			atomic_flags; /* Flags requiring atomic access. */
  
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b9d6ed7585c6..a14161e6e456 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7605,6 +7605,17 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
  	if (wake_flags & WF_TTWU) {
  		record_wakee(p);
  
+		/*
+		 * Current is handling requests on behalf of the waking process,
+		 * both want to run on the same core in seeswaw manner.
+		 * Typically current is bound to one core.'and only p is allowed
+		 * to freely move.
+		 */
+		if (p->seesaw_req && current->seesaw_proc &&
+		    time_after(jiffies, p->seesaw_jiffies + 10),
+		    cpumask_test_cpu(cpu, p->cpus_ptr))
+			return cpu;
+
  		if ((wake_flags & WF_CURRENT_CPU) &&
  		    cpumask_test_cpu(cpu, p->cpus_ptr))
  			return cpu;


Thanks,
Bernd

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: fuse uring / wake_up on the same core
  2023-05-03 17:04             ` Bernd Schubert
@ 2023-05-04  2:16               ` Hillf Danton
  2023-05-05 13:10                 ` Bernd Schubert
  0 siblings, 1 reply; 14+ messages in thread
From: Hillf Danton @ 2023-05-04  2:16 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Peter Zijlstra, Miklos Szeredi, K Prateek Nayak, Andrei Vagin,
	linux-mm, linux-kernel

On 3 May 2023 17:04:25 +0000 Bernd Schubert <bschubert@ddn.com>
> On 5/2/23 02:33, Hillf Danton wrote:
> > On 1 May 2023 21:44:48 +0000 Bernd Schubert <bschubert@ddn.com>
> >> I'm not familiar at all with scheduler code,
> >> given this works perfectly this suggests the same function is also
> >> called without explicit waitq, when the scheduler preempts a task?
> > 
> > Please see comment below in the select_task function.
> >>
> >> I think there might be side effects - what is if multiple
> >> applications are on one core and another core would be available?
> >> With this flag they would stay on the same core? Maybe better two flags?
> > 
> > Gooddd question. Given multiple seesaws,
> > 
> > 	Player i1	Player j1
> > 	|		|
> > 	Seesaw i	Seesaw j
> > 	|		|
> > 	P i2		P j2
> > 
> > what is ideal is run Seesaws i and j on different CPU cores.
> > 
> > We can do that by replacing the seesaw flag in the task struct with
> > seesaw id for instance. But I have no idea how complex it will become
> > to set up multiple seesaw workloads on the fuse side, by grouping and
> > binding players to different seesaws,
> > 
> > -	unsigned			seesaw:1;
> > +	unsigned int			seesaw;
> > while the corresponding change to scheduler looks like.
> > 
> > +               if (p->seesaw && p->seesaw == current->seesaw &&
> > +                   cpumask_test_cpu(cpu, p->cpus_ptr))
> > +                       return cpu;
> 
> 
> Hmm, how is the seesaw id assigned and assuming two processes landed
> on the same core but later on another core is available, how does it
> dynamically change the ID?

Gooood question again. The seesaw id is freely assigned in the space
unsigned int could offer, 1 <= seesaw id < UINT_MAx, independent of
anything else, and once assigned, it will not change throughout lifespan
because of no link between seesaw id and CPU core id.

The main point of seesaw id is help group/bind players on the fuse side,
and help select the current CPU on the scheduler side. It also prevents
player i2 from making a phone call to player j1.

Then no more it can do, so no change in how scheduler handles an idle core.

> My idea with two bits was that there is a fuse ring thread bound to a
> specific core - it is the IO processor and gets the seesaw_proc bit.

Setting CPU affinity is another usespace hint independent of seesaw.

> Application is submitting requests and get the seesaw_req bit set.
> Two applications running on the same core won't disturb each other that way.

Yeah, scheduler is free to migrate any seesaw_req away if needed, but
seesaw will pull it back at next wakeup.

	req I1 ... In
	|
	seesaw-I on core-K 
	|
	proc I

After binding proc-I to core-W for instance, if req-I2 is migrated to
core-X then it makes no sense to pull proc-I to core-X upon wakeup.
Actually this will not happen because of cpumask test for proc-I.

		    cpumask_test_cpu(cpu, p->cpus_ptr))
> 
> As addition, if the application is not submitting requests anymore, but
> let's say is busy doing computations, we want to have a timeout to let
> it move away if another core is more suitable.

If I understand fuse correctly, in request_wait_answer(), seesaw helps
only if the application sleeps on req->waitq (in other word, seesaw
does not work without WF_TTWU), and scheduler does migration for free.

Note migration is the job on the scheduler side, with nothing to do with
seesaw, otherwise PeterZ will lion roar.

> What do you think about
> the new patch version at the end of the mail? It uses two bits and jiffies.

See comment below in select_task_rq_fair().

What we are doing is just show PeterZ that userspace hint for the fuse
seesaw workload works, and feel free to do whatever you like because you
know much better about fuse than I do.

> Just tested it and it works fine. The exact timeout period is
> certainly debatable. I also feel a bit bad
> to take so many bits in struct task. If this approach is acceptable,
> the jiffies parameter could be probably an u16.
> 
> > 
> > Even after job done for fuse, the emerging question is, how to set up
> > seesaw workloads for crypto for example, if no more than the seesaw id
> > hint to scheduler is preferred.
> > 
> > And it makes me wonder why Prateek's work stalled for quite a while,
> > as more is needed for userspace hint to scheduler to work for
> > workloads other than seesaw.
> 
> Just quickly went over these a bit, assuming seesaw doesn't get accepted
> and we need these, I think it would need a bit modification for fuse
> 
> 
> > +		/*
> > +		 * Handle the case where a hint is set and the current CPU
> > +		 * and the previous CPU where task ran don't share caches.
> > +		 */
> > +		if (wakeup_hint && !cpus_share_cache(cpu, prev_cpu)) {
> 
> I'm testing on and older Xeon system (E5-2650) and tried different settings
> with numa binding the application (fuse ring thread is bound anyway)
> 
>                                         
> governor                                conservative    performance
>                                            (default)
> application cpu 0, ring cpu 0  creates/s   ~9200          ~9500
> application cpu 0. ring cpu 16 creates/s   ~4200          ~8000
> application cpu 0. ring cpu 1  creates/s   ~4200          ~8500
> 
> 
> No idea why cpu 1 gives better results in performance mode than cpu 16, might be within
> accuracy. CPU frequency definitely has the largest effect here - the cpus_share_cache()
> condition is not ideal for fuse. And I guess asking users to use cpu performance mode
> for fuse is also too much asked - other file systems don't have that requirement...
> So far your seesaw idea works best (the modified version in combination with
> wake-on-same-core).
> 
> >>
> >> diff --git a/include/linux/sched.h b/include/linux/sched.h
> >> index 63d242164b1a..07783ddaec5c 100644
> >> --- a/include/linux/sched.h
> >> +++ b/include/linux/sched.h
> >> @@ -953,6 +953,8 @@ struct task_struct {
> >>           /* delay due to memory thrashing */
> >>           unsigned                        in_thrashing:1;
> >>    #endif
> >> +       unsigned                        seesaw_req:1;
> >> +       unsigned                        seesaw_io:1;
> >>    
> >>           unsigned long                   atomic_flags; /* Flags requiring atomic access. */
> >>    
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index b9d6ed7585c6..474bf3657ef0 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -7605,6 +7605,13 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> >>           if (wake_flags & WF_TTWU) {
> >>                   record_wakee(p);
> > 
> > Seesaw does not work without WF_TTWU as per define.
> 
> What does WF_TTWU actually mean? Something like work flow time to wake unit?
> 
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index cd7aa679c3ee..6da0de4ae9ca 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -411,6 +411,20 @@ static void request_wait_answer(struct fuse_req *req)
>   	 * Wait it out.
>   	 */
>   	wait_event(req->waitq, test_bit(FR_FINISHED, &req->flags));
> +
> +	/*
> +	 * __wake_up_on_current_cpu ensures we wake up on the right core,
> +	 * after that we still want to stay on the same core, shared with
> +	 * a ring thread to submit next request to it. Issue without seesaw
> +	 * is that the while the ring thread is on its way to wait, it disturbs
> +	 * the application and application might get migrated away
> +	 */
> +	if (fc->ring.per_core_queue) {
> +		current->seesaw_req = 1;
> +		current->seesaw_jiffies = jiffies;
> +	}
> +
> +
>   out:
>   	if (prev_cpu != task_cpu(current))
>   		pr_devel("%s cpu switch from=%d to=%d\n",
> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
> index 7d327699b4c5..73adc2b16778 100644
> --- a/fs/fuse/dev_uring.c
> +++ b/fs/fuse/dev_uring.c
> @@ -1312,6 +1312,13 @@ int fuse_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
>   			/* XXX error injection or test with malicious daemon */
>   		}
>   
> +		/* In combination with requesting process (application) seesaw
> +		 * setting (see request_wait_answer), the application will
> +		 * stay on the same core.
> +		 */
> +		if (fc->ring.per_core_queue)
> +			current->seesaw_proc = 1;
> +
>   		ret = fuse_uring_fetch(ring_ent, cmd);
>   		break;
>   	case FUSE_URING_REQ_COMMIT_AND_FETCH:
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 63d242164b1a..53d9c77672b7 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -953,6 +953,13 @@ struct task_struct {
>   	/* delay due to memory thrashing */
>   	unsigned                        in_thrashing:1;
>   #endif
> +	/* requesting task */
> +	unsigned 			seesaw_req:1;
> +	/* request processing task */
> +	unsigned			seesaw_proc:1;
> +
> +	/* limit seesaw time slot */
> +	unsigned long			seesaw_jiffies;
>   
>   	unsigned long			atomic_flags; /* Flags requiring atomic access. */
>   
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b9d6ed7585c6..a14161e6e456 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7605,6 +7605,17 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
>   	if (wake_flags & WF_TTWU) {
>   		record_wakee(p);
>   
> +		/*
> +		 * Current is handling requests on behalf of the waking process,
> +		 * both want to run on the same core in seeswaw manner.
> +		 * Typically current is bound to one core.'and only p is allowed
> +		 * to freely move.
> +		 */
> +		if (p->seesaw_req && current->seesaw_proc &&

What is missed is, are p and current assigned the same seesaw id given
multiple cases of seesaw?

> +		    time_after(jiffies, p->seesaw_jiffies + 10),
> +		    cpumask_test_cpu(cpu, p->cpus_ptr))
> +			return cpu;
> +
>   		if ((wake_flags & WF_CURRENT_CPU) &&
>   		    cpumask_test_cpu(cpu, p->cpus_ptr))
>   			return cpu;


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: fuse uring / wake_up on the same core
  2023-05-04  2:16               ` Hillf Danton
@ 2023-05-05 13:10                 ` Bernd Schubert
  0 siblings, 0 replies; 14+ messages in thread
From: Bernd Schubert @ 2023-05-05 13:10 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Peter Zijlstra, Miklos Szeredi, K Prateek Nayak, Andrei Vagin,
	linux-mm, linux-kernel

On 5/4/23 04:16, Hillf Danton wrote:
> 
>> +		    time_after(jiffies, p->seesaw_jiffies + 10),
>> +		    cpumask_test_cpu(cpu, p->cpus_ptr))
>> +			return cpu;

Above is a big typo, I don't even see on the first glance how this 
compiled at all.

This was supposed to be

if (p->seesaw_req && current->seesaw_proc &&
     time_after(jiffies, p->seesaw_jiffies + 10) &&
     cpumask_test_cpu(cpu, p->cpus_ptr))


Anyway, I now understand that the WF_TTWU flag is related to waitq - we 
don't need the timeout at all. But then if the main issue is about waitq 
migration, I don't understand yet why Andrei's WF_CURRENT_CPU is not 
sufficient. I'm going to investigate that next. Probably much easier to 
get that accepted that a rather fuse specific seesaw.


Thanks,
Bernd



^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2023-05-05 13:10 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-24 19:50 fuse uring / wake_up on the same core Bernd Schubert
2023-03-24 22:44 ` Bernd Schubert
     [not found] ` <20230325002815.1703-1-hdanton@sina.com>
2023-03-25  7:08   ` K Prateek Nayak
2023-03-27 14:35     ` Bernd Schubert
2023-03-27 10:28 ` Peter Zijlstra
2023-04-26 22:40   ` Bernd Schubert
     [not found]   ` <20230427122417.2452-1-hdanton@sina.com>
2023-04-27 13:35     ` Bernd Schubert
2023-04-28  1:44       ` Hillf Danton
2023-04-28 21:54         ` Bernd Schubert
2023-04-28 23:37           ` Hillf Danton
2023-05-01 21:44           ` Bernd Schubert
     [not found]           ` <20230502003335.3253-1-hdanton@sina.com>
2023-05-03 17:04             ` Bernd Schubert
2023-05-04  2:16               ` Hillf Danton
2023-05-05 13:10                 ` Bernd Schubert

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.